Intelligent Systems and Applications: Proceedings of the 2021 Intelligent Systems Conference (IntelliSys) Volume 1: 294 (Lecture Notes in Networks and Systems) [1st ed. 2022] 3030821927, 9783030821920

This book presents Proceedings of the 2021 Intelligent Systems Conference which is a remarkable collection of chapters c

331 32 98MB

English Pages 897 [909] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Editor’s Preface
Contents
Late Fusion of Convolutional Neural Network with Wavelet-Based Ensemble Classifier for Acoustic Scene Classification
1 Introduction
2 Proposed Methodology
2.1 Pre-processing and Feature Extraction
2.2 Convolutional Neural Network
2.3 Wavelet Scattering
2.4 Ensemble Classifiers
2.5 Fusion of CNN and Classifiers
3 Results and Discussion
4 Conclusion
References
Deep Learning and Social Media for Managing Disaster: Survey
1 Introduction
2 Background and Related Works
2.1 Recent Surveys
2.2 Disaster
2.3 Disaster Management
3 Disaster Management Models
3.1 Discussion About Disaster Management Models
4 Social Media
5 Retrieving Relevant Information from Social Media
5.1 Classification Algorithms
5.2 Machine Learning (ML)
5.3 Deep Learning (DL)
6 Conclusion and Future Works
References
A Framework for Adaptive Mobile Ecological Momentary Assessments Using Reinforcement Learning
1 Introduction
2 Related Work
3 Adaptive Mobile EMA
3.1 An Unbiased Formulation for Mobile EMA
3.2 Using Reinforcement Learning Framework for Adaptive Mobile EMA
4 A Two-Level User State Model
5 K-Routine Mining Algorithm
5.1 Mining K-Routines
5.2 Merging K-Routines
5.3 Mapping K-Routines
6 Designing Adaptive mEMA Method Using RL
6.1 RL Algorithm
6.2 State Space for Adaptive mEMA
6.3 Action Space for Adaptive mEMA
6.4 Reward Signal for Adaptive mEMA
6.5 Experience Replay for Sample Efficiency Using Dyna-Q
6.6 Performance Evaluation
7 Experiments
7.1 Data
7.2 Baseline Methods
7.3 Experimental Settings and Research Questions
8 Results
8.1 Comparisons Within RL Strategies
8.2 Comparisons Between RL Strategies and Baseline Methods
8.3 Performance by Data Segments
9 Discussion
10 Conclusion
References
Reputation Analysis Based on Weakly-Supervised Bi-LSTM-Attention Network
1 Introduction
2 Related Work
2.1 Machine Learning for Sentiment Analysis
2.2 Deep Learning for Sentiment Analysis
3 Weakly-Supervised Deep Embedding
3.1 The Classic WDE Network Architecture
3.2 Model Enhancement – WDE-BiLSTM-Attention
4 Experiments
4.1 Oversampling
4.2 Baselines and Comparison
4.3 Sentiment Classification
4.4 Topic Mining Based on T-LDA
5 Conclusion
5.1 Deficiency and Future Work
References
Multi-GPU-based Convolutional Neural Networks Training for Text Classification
1 Introduction
2 Related Work
2.1 Data Parallelism Approaches
2.2 Communications in Distributed Environment
3 Distributed CNN for Text Categorization
3.1 Motivation and Objective
3.2 Baseline Model
3.3 A Parallel CNN Algorithm for Text Classification
4 Experimental Results
4.1 Experimental Protocol
4.2 Experiment 1: Sequential CNN Training
4.3 Experiment 2: Sequential vs Distributed Training
4.4 Experiment 3: Varying the Number of GPUs
5 Conclusion
References
Performance Analysis of Data-Driven Techniques for Solving Inverse Kinematics Problems
1 Introduction
2 Testing Model
3 Forward Kinematics
4 Analytical Approach
4.1 Results of Analytical Techniques
4.2 Limitation and Critical Analysis of Analytical Techniques
5 Neural Network Approach
5.1 Preparation of Data Set
5.2 The Neural Network Architecture
6 Experimental Results and Validation
7 Conclusion and Future Work
References
Machine Learning Based H2 Norm Minimization for Maglev Vibration Isolation Platform
1 Introduction
2 Vibration Isolator Modelling
2.1 Derivation of the Balancing Levitation Force
2.2 Isolator Dynamics
2.3 State-Space Framework of Single Axis Levitation
2.4 Four Pole Electromagnet Configuration
3 Experimental Setup
3.1 Hardware
3.2 General Structure
4 FSF Controller Syntheses
4.1 H2 SF Controller Structure
5 Deep Reinforcement Learning Algorithm
6 Experimental Results
7 Conclusions
References
A Vision Based Deep Reinforcement Learning Algorithm for UAV Obstacle Avoidance
1 Introduction
2 Related Work
2.1 Reinforcement Learning for Obstacle Avoidance
2.2 Exploration
3 Methodology: Towards Improving Exploration
3.1 Training Setup
3.2 Convergence Exploration
3.3 Guidance Exploration
4 Results and Discussion
5 Conclusion
References
Detecting and Fixing Nonidiomatic Snippets in Python Source Code with Deep Learning
1 Introduction
2 Related Work
3 Method
3.1 Formal Approach
3.2 Neural Architectures
4 Dataset Generation
4.1 Template Generation
4.2 Augmentation of Templates
5 Evaluation
5.1 Automated and Manual Evaluation
5.2 Precision
5.3 Recall
5.4 Precision of Subsystems
6 Conclusion
A Appendix
References
BreakingBED: Breaking Binary and Efficient Deep Neural Networks by Adversarial Attacks
1 Introduction
2 Compression of Deep Neural Networks
2.1 Knowledge Distillation
2.2 Pruning
2.3 Binarization
3 Adversarial Attacks
3.1 White-Box Attacks
3.2 Black-Box Attacks
4 Breaking Binary and Efficient DNNs
4.1 CNN Compressed Variants
4.2 Evaluation of Robustness
4.3 Class Activation Mapping on Attacked CNNs
4.4 Robustness Evaluation on ImageNet Dataset
4.5 Discussion
5 Conclusion
References
Parallel Dilated CNN for Detecting and Classifying Defects in Surface Steel Strips in Real-Time
1 Introduction
2 Related Work
3 Dataset and Augmentation
4 Proposed DSTEELNet Architecture
5 Experiments
5.1 Experiment Metrics
5.2 Setup
5.3 Results
5.4 Computational Time
6 Conclusion
References
Selective Information Control and Network Compression in Multi-layered Neural Networks
1 Introduction
2 Theory and Computational Methods
2.1 Network Compression
2.2 Controlling Selective Information
2.3 Selective Information-Driven Learning
3 Results and Discussion
3.1 Experimental Outline
3.2 Selective Information Control
3.3 Generalization Performance
3.4 Interpreting Compressed Weights
4 Conclusion
References
DAC–Deep Autoencoder-Based Clustering: A General Deep Learning Framework of Representation Learning
1 Introduction
2 Overview of Deep Autoencoder-Based Clustering
3 Deep Autoencoder for Representation Learning
3.1 Encoder
3.2 Decoder
3.3 Objective Function
4 Experimental Results
4.1 Data Set
4.2 Measurement Metrics
4.3 Experiment Setup
4.4 Results on MNIST
4.5 Results on Other Datasets
5 Limitation
6 Conclusion
References
Enhancing LSTM Models with Self-attention and Stateful Training
1 Introduction
2 Background
2.1 Feed-Forward Networks, Recurrent Neural Networks, Back Propagation Through Time
2.2 Long Short-Term Memory and Truncated BPTT
2.3 Self-attention
2.4 Experimental Rationale
3 Methodology
3.1 Statefulness
3.2 LSTM and Attention
4 Data
4.1 Data Characteristics
4.2 Data Sets
5 Models
5.1 Architectures
5.2 Hyperparameters
6 Experiments and Results
6.1 Model-to-Model and Model-to-Study Comparisons
7 Discussion: Training Behavior
8 Conclusions
References
Domain Generalization Using Ensemble Learning
1 Introduction
2 Related Work
2.1 Ensemble Learning
2.2 Transfer Learning
2.3 Domain Generalization
3 Methods
3.1 Data Preparation
3.2 Experiments
3.3 Hyperparameter Tuning
4 Results
5 Conclusion
References
Research on Text Classification Modeling Strategy Based on Pre-trained Language Model
1 Introduction
2 Related Work
3 Model Architecture
3.1 Model Input
3.2 Transformer
3.3 Capsule Networks
3.4 Model Framework
4 Experiment Design and Analysis
4.1 Experiment Corpus
4.2 Evaluation Metrics
4.3 Experimental Setup
4.4 Comparative Experiment
4.5 Ablation Experiment
4.6 Experiment Analysis
5 Conclusion and Future Work
References
Discovering Nonlinear Dynamics Through Scientific Machine Learning
1 Introduction
2 Scientific Machine Learning Models
2.1 Physics-Informed Neural Networks
2.2 Universal Differential Equations
2.3 Hamiltonian Neural Networks
2.4 Neural Ordinary Differential Equations (Neural ODE)
3 Physical Experiments
3.1 Quadruple Spring Mass System
3.2 Pendulum
3.3 Simulated Pendulum
3.4 Simulation of Wind Forced Pendulum
3.5 Physical Experimental Pendulum
4 Learning the Nonlinear Dynamics with Scientific Machine Learning
4.1 What Do These SciML Models Learn?
4.2 Can SciML Predict the Future?
4.3 Can HNN Solve Complex Dynamic Problems?
5 Conclusion
References
Tensor Data Scattering and the Impossibility of Slicing Theorem
1 Introduction
2 Tensor
3 Pick and Slice
4 Tensor Variator and Its Provision Tensor
5 Nondeterministic of Applying Variator
6 Scattering
6.1 Scatter APIs in Two Popular Deep Learning Frameworks
6.2 Defining Scattering
6.3 Sliceable Scattering
7 Sparse Tensor with X-Sparse Representation
7.1 The Limitations in Current Scattering APIs
7.2 X-Sparse Tensor
7.3 Counting Sparsity and Analyzing Performance
7.4 Mocking Current Scattering APIs
8 Conclusion
References
Scope and Sense of Explainability for AI-Systems
1 Introduction
2 Superhuman Abilities of AI
3 Forms of Explainability
4 Complex Dynamical Systems
5 Stability and Chaos
6 Nonclassical Approaches, Training of Attractors
7 Causality of Results?
8 Conclusions
References
Use Case Prediction Using Deep Learning
1 Introduction
2 Related Work
2.1 Parts of Speech
2.2 Deep Learning
3 Proposed Approach
4 Experiments and Results
4.1 Datasets Description
4.2 Metrics
5 Conclusions
References
VAMDLE: Visitor and Asset Management Using Deep Learning and ElasticSearch
1 Introduction
2 Background
2.1 Visitor Management and Asset Management
2.2 CNN and MobileNet
2.3 Deep Transfer Learning
2.4 ElasticSearch
2.5 High Performance Computing
3 Design
3.1 Architectural Design
3.2 UI and UX Design
4 Implementation and Evaluation
4.1 Dataset
4.2 Image Pre-Processing and Data Augmentation
4.3 Deep Transfer Learning Model
4.4 Android Application
4.5 Evaluation of the Proposed System
5 Conclusion
References
Wind Speed Time Series Prediction with Deep Learning and Data Augmentation
1 Introduction
2 Related Work
3 Background
3.1 Recurrent Neural Networks
3.2 Data Augmentation
4 Methodology
4.1 Time Series Selection
4.2 Time Series Imputation
4.3 Data Augmentation
4.4 Scaling
4.5 Modelling
4.6 Evaluation
5 Results
6 Discussion
7 Conclusion and Future Work
References
Evaluation for Angular Distortion of Welding Plate
1 Introduction
2 Equipment and CNN
3 Experiment
4 Validation of CNN
5 Conclusions
References
A Framework for Testing and Evaluation of Operational Performance of Multi-UAV Systems
1 Introduction
2 Literature Review
3 Problem Description
3.1 Terminology
3.2 Problem Statement
4 Proposed Framework
4.1 Overview of the Proposed Framework
4.2 Modes of Operation
4.3 Scenarios
4.4 Perception Inference Engine (PIE)
4.5 True Scenario
4.6 Evaluator
5 Synthetic Data Generation
6 Hardware Implementation
7 Experiments and Discussion
7.1 Data Collection and Model Selection
7.2 Deployment of PIE
7.3 Deployment of PIE in Hardware
8 Conclusion and Future Work
References
Addressing Consumer Demands: A Manufacturing Collaboration Process Using Blockchain for Knowledge Representation
1 Introduction
2 Background
2.1 Blockchain
2.2 Related Work
3 Proposed Solution
3.1 Collaborative Network of Entities
3.2 Reasoning and Interaction
3.3 Knowledge Representation
4 Conclusion and Future Work
References
Cellular Formation Maintenance and Collision Avoidance Using Centroid-Based Point Set Registration in a Swarm of Drones
1 Introduction
2 Proposed Approach
2.1 Obstacle Detection
2.2 Collision Avoidance
2.3 Re-formation
3 Simulation and Results
4 Conclusion
References
The Simulation with New Opinion Dynamics Using Five Adopter Categories
1 Introduction
2 Theory
2.1 Opinion Dynamics
2.2 Diffusion of Innovations
3 Modeling
4 Simulations
4.1 Manipulating the Initial Distribution of Opinions
4.2 Manipulation of Confidence Coefficient 2D2ij
4.3 Manipulating Mass Media Effects
4.4 Manipulating Network Connection Probabilities
5 Discussion
6 Conclusion
References
Intrinsic Rewards for Reinforcement Learning Within Complex 2D Environments
1 Introduction
2 Related Work
3 Data
4 Methods
4.1 Reinforcement Learning Background
4.2 Model Policies
4.3 Model Inputs
4.4 Model Architecture
5 Metrics
5.1 Quantitative Agent Comparison
5.2 Qualitative Comparison
6 Results and Discussion
6.1 Experiment Setup
6.2 Quantitative Results
6.3 Qualitative Results
7 Conclusion and Future Work
References
Analysis of Divided Society at the Standpoint of In-Group and Out-Group Using Opinion Dynamics
1 Introduction
2 Trust-Distrust Model
2.1 Theory of Trust-Distrust Model
2.2 Two-Agents Calculation
2.3 Calculation for 300 Persons
3 Model Setting for Social Simulation
4 Results
4.1 Calculation for the First Model
4.2 Calculation for the Second Model
5 Discussion
6 Conclusion
References
Simulation of Intragroup Alignment Using a New Model of Opinion Dynamics
1 Introduction
2 Theory
3 Simulation Model of Intragroup Alignment
4 Results
4.1 Trust to a Candidate from Voters
4.2 Sub-leaders
5 Discussion
6 Conclusion
References
Random Forest Classification with MapReduce in Holonic Multiagent Systems
1 Introduction
2 Related Work
3 Background
3.1 Multiagent Learning
3.2 Holonic Multiagent Systems
3.3 Decision Trees and Random Forests
4 Materials and Methods
4.1 Y-Combinator
4.2 Decision Tree Classification
4.3 Random Forest Classification
4.4 System Components
5 Results
6 Discussion
7 Conclusion
References
Monitoring Goal Driven Autonomy Agent's Expectations Generated from Durative Effects
1 Introduction
2 Related Work
3 Preliminaries
4 Two Basic Operations
5 Informed Expectations with Durative Effects
6 Regression Expectations
7 Goldilocks Expectations
8 Property of Regression
9 Empirical Evaluation
10 Conclusions
References
Sublinear Regret with Barzilai-Borwein Step Sizes
1 Introduction
1.1 Contributions
2 Problem Formulation
2.1 Algorithms for Online Optimization Problem
2.2 Quasi-Newton Methods
3 The Barzilai-Borwein Quasi-Newton Method
4 Regret Bounds
5 Conclusions
References
Fluid Dynamics of a Pandemic in a Spatial Social Network: A Reflective Measure of the Spreading
1 Introduction
2 Literature Review
3 Methodology
3.1 Preliminaries
3.2 Argumentative Game Theoretical Approach in Social Network
4 Illustrative Example
5 Conclusion
References
Affective Story-Morphing: Manipulating Shelley’s Frankenstein under Program Control using Emotionally Intelligent Agents
1 Introduction and Motivation
2 Story Morphing in the Affective Reasoner
3 How Development Proceeds
4 Additional Aspects of the Affective Reasoner
4.1 Humor
4.2 Case-Based Reasoning
4.3 Applications
4.4 Users as Agents
5 Morphing the Monster
5.1 A Paraphrase of the Original Narrative—Snippet One
5.2 Story Morph Snippet Two
5.3 Story Morph Snippet Three
5.4 Story Morph Four
5.5 Story Morph Five
5.6 Story-Morph Snippet Six
5.7 Story-Morph Snippet Seven
5.8 Story-Morph Snippet Eight
5.9 Story-Morph Snippet Nine
5.10 Story-Morph Snippet Ten
5.11 Some Finer-Grained Variations
6 Implementation
7 Conclusions and Summary
References
Dynamic Strategies and Opponent Hands Estimation for Reinforcement Learning in Gin Rummy Game
1 Introduction
2 Gin Rummy Rules
3 Related Work
4 Static Strategies
4.1 Discard Strategy
4.2 Draw Strategy
4.3 Opponent Hand’s Estimation Strategy
4.4 Knock Strategy
5 Dynamic Strategies
5.1 Dynamic Knock Strategy
5.2 Dynamic Draw/Discard Strategy
6 Experimental Results
7 Conclusion and Future Work
References
Wireless Sensor Network Smart Environment for Precision Agriculture: An Agent-Based Architecture
1 Introduction
1.1 Agriculture Evolution
1.2 Agriculture 4.0 Conceptual Model
2 Enabling Technologies for Precision Agriculture
3 Multi-agent Architecture for Precision Agriculture
3.1 Modeling the Precision Agriculture Smart Environment
4 Agent-Based PA Implementation Directives
4.1 Hardware Specifications
4.2 Software Specifications
4.3 Experiment Environment Setting
5 Conclusions
References
Autonomy Reconsidered: Towards Developing Multi-agent Systems
1 Introduction
2 Related Literature
3 Behavior, Success, and Autonomy
3.1 Absolute Autonomy: Behavior, Success, Fulfillment
3.2 Relative Autonomy: Levels, Asymmetries, Deficiencies
4 Multi-agent Systems
4.1 Group Potential: Synergy and Interference
4.2 Augmentation and Diminishment
5 Summary
References
A Real-Time Intelligent Intra-vehicular Temperature Control Framework
1 Introduction
2 Background
2.1 Object Detection
2.2 Convolutional Neural Network (CNN)
2.3 Controller Area Network (CAN) Bus
2.4 Message Queuing Telemetry Transport (MQTT)
3 Proposed System
3.1 Microcontroller M1
3.2 Microcontroller M2
3.3 Cloud Communication
4 Results
5 Conclusion
References
Intelligent Control of a Semi-autonomous Assistive Vehicle
1 Introduction
2 The Wheelchair
3 Control
3.1 Modelling
3.2 Controller Design
3.3 Path-Following
4 Conclusions and Future Work
References
One Shot Learning Approach to Identify Drivers
1 Introduction
2 The New Approach
3 Discussion and Results
4 Conclusions and Future Work
References
Facial Recognition Software for Identification of Powered Wheelchair Users
1 Introduction
1.1 Facial Recognition Systems
1.2 API
1.3 Software Libraries
2 Facial Recognition System
3 Results
4 Discussion and Conclusions
References
Intelligent User Interface to Control a Powered Wheelchair Using Infrared Sensors
1 Introduction
2 The New System
3 Testing
4 Conclusions and Future Work
References
A Classification Based Ensemble Pruning Framework with Multi-metric Consideration
1 Introduction
2 Related Work
3 The Proposed Framework
3.1 Problem Statement
3.2 Overview of the Proposed Framework
3.3 Ensemble Pruning with Classification Based Optimization
3.4 Multi-Metric Consideration and Its Optimization
4 Empirical Results
4.1 Compared Methods
4.2 Experiments on Benchmark Datasets
5 Application to Fraud Detection Tasks
6 Conclusion
References
Customs Risk Assessment Based on Unsupervised Anomaly Detection Using Autoencoders
1 Introduction
2 Data and Methodology
2.1 Autoencoder
2.2 Variational Autoencoder
3 Results
3.1 Autoencoder on ENS Data
3.2 Autoencoder on Synthetic Data
4 Future Work and Conclusions
4.1 Conclusions
4.2 Future Work
References
Best Next Preference Prediction Based on LSTM and Multi-level Interactions
1 Introduction
2 LSTM Based Recommendations
3 DeepCBPP for Next Preference Predictions
4 LSTM Model Architectures
5 Performance Evaluation
6 Conclusions and Future Work
References
Achieving Trust in Future Human Interactions with Omnipresent AI: Some Postulates
1 Introduction
2 Defining Omnipresent AI
2.1 What is an Omnipresent AI?
2.2 Interaction Models for Omnipresent AI
2.3 Interaction and Trust
2.4 The Aspirations of Omnipresent AI
3 Towards Postulates of Human-Omnipresent AI Interaction
3.1 Proposing a Natural Communication Method
3.2 Presence and Personality
3.3 Proprioception and the Understanding of Context
4 The Postulates of a Trustworthy Human-Omnipresent AI Interaction
5 Speculative Application of AI-Human Interaction in an Autonomous Vehicle
6 Conclusion
References
A Decentralized Explanatory System for Intelligent Cyber-Physical Systems
1 Introduction
2 Background and Related Works
3 Smart Home Scenarios
4 Decentralizing Explanatory Reasoning
4.1 Solution Overview
4.2 A Decentralized Knowledge
4.3 A Unifying Algorithm: D-CAS
4.4 Generating an Explanation
5 Implementation and Results
5.1 The Window Blinds
5.2 The Ventilation Monitoring System
6 Discussions and Future Works
7 Conclusion
References
Construction Control Organization with Use of Computer and Information Technologies in Context of Sustainable Development Providing
1 Introduction
2 Materials and Methods
3 Results
4 Discussion
5 Conclusion
References
Computational Rational Engineering and Development: Synergies and Opportunities
1 Introduction and Motivation
2 Recent and Past Perspectives on Computer Systems for Automation of Engineering and Development
3 Computational Rationality in Engineering Development
3.1 Domain Characteristics: Problem-Solving and Decision-Making in the Context of Industrial Design, Engineering, and Development
3.2 Interdisciplinary Opportunities and Synergies
4 Discussion and Perspectives
4.1 Mind the Gap: Intelligent Systems for Design, Engineering, and Development
4.2 Open Challenges and Prospective Research Directions
5 Concluding Remarks
References
QPSetter: An Artificial Intelligence-Based Web Enabled, Personalized Service Application for Educators
1 Introduction
2 Motivation
3 Related Work
4 System Architecture
4.1 Scraper Module
4.2 Educational Artificial Intelligence (EAI) Module
4.3 The Database
4.4 The Q-Adder
4.5 The User Interface
5 Conclusions
References
Is It Possible to Recognize a Philosophical Zombie and How to Do It
1 Introduction
2 Why is It Necessary to Think About Philosophical Zombies
3 How to Recognize a Philosophical Zombie
4 Could Artificial Intelligent Systems Get Qualia
5 Conclusion
References
Dynamic Analysis of Bitcoin Fluctuations by Means of a Fractal Predictor
1 Introduction
2 Related Work
3 Theoretical Framework
3.1 Bitcoin
3.2 Fractal Theory
4 Proposal
4.1 Theoretical Definition
4.2 Methodology
5 Experimental Results
5.1 Results
5.2 Discussion
6 Conclusions
References
Are Human Drivers a Liability or an Asset?
1 Introduction
2 Do Near Misses Suggest that Collisions May Occur?
2.1 Near Misses
2.2 Bowties
3 Method and Testing
4 Results
5 Discussion and Conclusions
References
Negative Emotions Induced by Non-verbal Video Clips
1 Introduction
2 Experiment
3 Method
3.1 Participants
3.2 Materials
3.3 Procedure
4 Results
5 Conclusion
References
Automatic Recognition of Key Modulations in Symbolic Musical Pieces Using Information Theory
1 Introduction
2 Related Work
3 Key and Modulation
3.1 Harmonic Analysis
4 Information Theory
5 Application and Analysis
5.1 Results
6 Discussion and Conclusions
Appendix A
References
Increasing Robustness for Machine Learning Services in Challenging Environments: Limited Resources and No Label Feedback
1 Introduction
2 Foundations
2.1 Machine Learning
2.2 Concept Drift
2.3 Outlier Detection
3 Problem Definition and Requirements
4 Design Options
4.1 Step 1: Data Validity
4.2 Step 2: Model Robustness
5 Evaluation
5.1 Evaluation of Data Validity (Step 1)
5.2 Evaluation of Model Robustness (Step 2)
5.3 Evaluation of Overall Prediction Method
6 Conclusion
References
Development Support for Intelligent Systems: Test, Evaluation, and Analysis of Microservices
1 Introduction: Microservices in General
2 Challenges with Testing Microservices
3 Analysis of Microservices
3.1 Collecting Key Figures
3.2 Approach
4 Test Concepts
4.1 Test Concept of Eberhard Wolff
4.2 Test Concept of Sam Newman
4.3 Test Concept of Google
4.4 Test Concept of Netflix
5 Tools for Analysis and Testing
5.1 Tools for Isolated Testing
5.2 Analysis
5.3 Netflix
6 Conclusion
References
An Analysis with Dynamics Between Human Motivation and Messaging on Social Networking Services
1 Introduction
1.1 Back Ground and Purpose
1.2 Structure of This Paper
2 Issues of Previous Studies and Our Approach
2.1 Issues of Previous Studies
2.2 Our Approach
3 The Mechanism of Our Messaging Model
3.1 Event Driven Based
3.2 Messaging and Motivation
3.3 Messaging Strategy
4 Simulations
4.1 Initial Conditions
4.2 Validation Test
4.3 Increments of Modification, M q( t )
4.4 Variable Reliability Factor, M r ( t ) and Trust Level, M tr( t )
4.5 Personalization (Level 1): Random Variation of M q( t ), M r( t ) and M tr( t )
4.6 Personalization (Level 2): Random Variation of Thresholds for Motivation of Each Node
4.7 Personalization (Level 3): Random Variation Both M q( t ) and Thresholds
4.8 Personalization (Level 4): Random Variation Both M q( t ) and Thresholds with P/N Opinions
5 Discussions and Future Works
6 Conclusions
References
Author Index
Recommend Papers

Intelligent Systems and Applications: Proceedings of the 2021 Intelligent Systems Conference (IntelliSys) Volume 1: 294 (Lecture Notes in Networks and Systems) [1st ed. 2022]
 3030821927, 9783030821920

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Lecture Notes in Networks and Systems 294

Kohei Arai   Editor

Intelligent Systems and Applications Proceedings of the 2021 Intelligent Systems Conference (IntelliSys) Volume 1

Lecture Notes in Networks and Systems Volume 294

Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Fernando Gomide, Department of Computer Engineering and Automation—DCA, School of Electrical and Computer Engineering—FEEC, University of Campinas— UNICAMP, São Paulo, Brazil Okyay Kaynak, Department of Electrical and Electronic Engineering, Bogazici University, Istanbul, Turkey Derong Liu, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, USA; Institute of Automation, Chinese Academy of Sciences, Beijing, China Witold Pedrycz, Department of Electrical and Computer Engineering, University of Alberta, Alberta, Canada; Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Marios M. Polycarpou, Department of Electrical and Computer Engineering, KIOS Research Center for Intelligent Systems and Networks, University of Cyprus, Nicosia, Cyprus Imre J. Rudas, Óbuda University, Budapest, Hungary Jun Wang, Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong

The series “Lecture Notes in Networks and Systems” publishes the latest developments in Networks and Systems—quickly, informally and with high quality. Original research reported in proceedings and post-proceedings represents the core of LNNS. Volumes published in LNNS embrace all aspects and subfields of, as well as new challenges in, Networks and Systems. The series contains proceedings and edited volumes in systems and networks, spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output. The series covers the theory, applications, and perspectives on the state of the art and future developments relevant to systems and networks, decision making, control, complex processes and related areas, as embedded in the fields of interdisciplinary and applied sciences, engineering, computer science, physics, economics, social, and life sciences, as well as the paradigms and methodologies behind them. Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.

More information about this series at http://www.springer.com/series/15179

Kohei Arai Editor

Intelligent Systems and Applications Proceedings of the 2021 Intelligent Systems Conference (IntelliSys) Volume 1

123

Editor Kohei Arai Faculty of Science and Engineering Saga University Saga, Japan

ISSN 2367-3370 ISSN 2367-3389 (electronic) Lecture Notes in Networks and Systems ISBN 978-3-030-82192-0 ISBN 978-3-030-82193-7 (eBook) https://doi.org/10.1007/978-3-030-82193-7 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Editor’s Preface

We are very pleased to introduce the Proceedings of Intelligent Systems Conference (IntelliSys) 2021 which was held on September 2 and 3, 2021. The entire world was affected by COVID-19 and our conference was not an exception. To provide a safe conference environment, IntelliSys 2021, which was planned to be held in Amsterdam, Netherlands, was changed to be held fully online. The Intelligent Systems Conference is a prestigious annual conference on areas of intelligent systems and artificial intelligence and their applications to the real world. This conference not only presented the state-of-the-art methods and valuable experience, but also provided the audience with a vision of further development in the fields. One of the meaningful and valuable dimensions of this conference is the way it brings together researchers, scientists, academics, and engineers in the field from different countries. The aim was to further increase the body of knowledge in this specific area by providing a forum to exchange ideas and discuss results, and to build international links. The Program Committee of IntelliSys 2021 represented 25 countries, and authors from 50+ countries submitted a total of 496 papers. This certainly attests to the widespread, international importance of the theme of the conference. Each paper was reviewed on the basis of originality, novelty, and rigorousness. After the reviews, 195 were accepted for presentation, out of which 180 (including 7 posters) papers are finally being published in the proceedings. These papers provide good examples of current research on relevant topics, covering deep learning, data mining, data processing, human–computer interactions, natural language processing, expert systems, robotics, ambient intelligence to name a few. The conference would truly not function without the contributions and support received from authors, participants, keynote speakers, program committee members, session chairs, organizing committee members, steering committee members, and others in their various roles. Their valuable support, suggestions, dedicated commitment, and hard work have made IntelliSys 2021 successful. We warmly thank and greatly appreciate the contributions, and we kindly invite all to continue to contribute to future IntelliSys. v

vi

Editor’s Preface

We believe this event will certainly help further disseminate new ideas and inspire more international collaborations. Kind Regards, Kohei Arai

Contents

Late Fusion of Convolutional Neural Network with Wavelet-Based Ensemble Classifier for Acoustic Scene Classification . . . . . . . . . . . . . . . Cheng Siong Chin and Jianhua Zhang Deep Learning and Social Media for Managing Disaster: Survey . . . . . Zair Bouzidi, Abdelmalek Boudries, and Mourad Amad

1 12

A Framework for Adaptive Mobile Ecological Momentary Assessments Using Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . Lihua Cai, Laura E. Barnes, and Mehdi Boukhechba

31

Reputation Analysis Based on Weakly-Supervised Bi-LSTM-Attention Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kun Xiang and Akihiro Fujii

51

Multi-GPU-based Convolutional Neural Networks Training for Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Imen Ferjani, Minyar Sassi Hidri, and Ali Frihida

72

Performance Analysis of Data-Driven Techniques for Solving Inverse Kinematics Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vijay Bhaskar Semwal and Yash Gupta

85

Machine Learning Based H2 Norm Minimization for Maglev Vibration Isolation Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Ahmet Fevzi Bozkurt, Barış Can Yalçın, and Kadir Erkan A Vision Based Deep Reinforcement Learning Algorithm for UAV Obstacle Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Jeremy Roghair, Amir Niaraki, Kyungtae Ko, and Ali Jannesari Detecting and Fixing Nonidiomatic Snippets in Python Source Code with Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Balázs Szalontai, András Vadász, Zsolt Richárd Borsi, Teréz A. Várkonyi, Balázs Pintér, and Tibor Gregorics vii

viii

Contents

BreakingBED: Breaking Binary and Efficient Deep Neural Networks by Adversarial Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Manoj-Rohit Vemparala, Alexander Frickenstein, Nael Fasfous, Lukas Frickenstein, Qi Zhao, Sabine Kuhn, Daniel Ehrhardt, Yuankai Wu, Christian Unger, Naveen-Shankar Nagaraja, and Walter Stechele Parallel Dilated CNN for Detecting and Classifying Defects in Surface Steel Strips in Real-Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Khaled R. Ahmed Selective Information Control and Network Compression in Multi-layered Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Ryotaro Kamimura DAC–Deep Autoencoder-Based Clustering: A General Deep Learning Framework of Representation Learning . . . . . . . . . . . . . . . . . 205 Si Lu and Ruisi Li Enhancing LSTM Models with Self-attention and Stateful Training . . . 217 Alexander Katrompas and Vangelis Metsis Domain Generalization Using Ensemble Learning . . . . . . . . . . . . . . . . . 236 Yusuf Mesbah, Youssef Youssry Ibrahim, and Adil Mehood Khan Research on Text Classification Modeling Strategy Based on Pre-trained Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 Yiou Lin, Hang Lei, Xiaoyu Li, and Yu Deng Discovering Nonlinear Dynamics Through Scientific Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Lei Huang, Daniel Vrinceanu, Yunjiao Wang, Nalinda Kulathunga, and Nishath Ranasinghe Tensor Data Scattering and the Impossibility of Slicing Theorem . . . . . 280 Wuming Pan Scope and Sense of Explainability for AI-Systems . . . . . . . . . . . . . . . . . 291 A.-M. Leventi-Peetz, T. Östreich, W. Lennartz, and K. Weber Use Case Prediction Using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . 309 Tinashe Wamambo, Cristina Luca, Arooj Fatima, and Mahdi Maktab-Dar-Oghaz VAMDLE: Visitor and Asset Management Using Deep Learning and ElasticSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318 Viswanathsingh Seenundun, Balkrishansingh Purmah, and Zahra Mungloo-Dilmohamud Wind Speed Time Series Prediction with Deep Learning and Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 Anibal Flores, Hugo Tito-Chura, and Victor Yana-Mamani

Contents

ix

Evaluation for Angular Distortion of Welding Plate . . . . . . . . . . . . . . . 344 Shigeru Kato, Shunsaku Kume, Takanori Hino, Fujioka Shota, Tomomichi Kagawa, Hironori Kumeno, and Hajime Nobuhara A Framework for Testing and Evaluation of Operational Performance of Multi-UAV Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 Mrinmoy Sarkar, Xuyang Yan, Shamila Nateghi, Bruce J. Holmes, Kyriakos G. Vamvoudakis, and Abdollah Homaifar Addressing Consumer Demands: A Manufacturing Collaboration Process Using Blockchain for Knowledge Representation . . . . . . . . . . . 375 Ricardo Barbosa, Ricardo Santos, and Paulo Novais Cellular Formation Maintenance and Collision Avoidance Using Centroid-Based Point Set Registration in a Swarm of Drones . . . . . . . . 391 Jawad N. Yasin, Huma Mahboob, Mohammad-Hashem Haghbayan, Muhammad Mehboob Yasin, and Juha Plosila The Simulation with New Opinion Dynamics Using Five Adopter Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409 Makoto Fujii and Akira Ishii Intrinsic Rewards for Reinforcement Learning Within Complex 2D Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 Nathaniel Grabaskas and Zhizhen Wang Analysis of Divided Society at the Standpoint of In-Group and Out-Group Using Opinion Dynamics . . . . . . . . . . . . . . . . . . . . . . . 438 Nozomi Okano and Akira Ishii Simulation of Intragroup Alignment Using a New Model of Opinion Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 Nozomi Okano, Hitoshi Yamamoto, Masaru Nishikawa, and Akira Ishii Random Forest Classification with MapReduce in Holonic Multiagent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464 Michéle Cullinan and Duncan Coulter Monitoring Goal Driven Autonomy Agent’s Expectations Generated from Durative Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484 Noah Reifsnyder and Hector Munoz-Avila Sublinear Regret with Barzilai-Borwein Step Sizes . . . . . . . . . . . . . . . . 499 Iyanuoluwa Emiola Fluid Dynamics of a Pandemic in a Spatial Social Network: A Reflective Measure of the Spreading . . . . . . . . . . . . . . . . . . . . . . . . . . 513 Saad Alqithami

x

Contents

Affective Story-Morphing: Manipulating Shelley’s Frankenstein under Program Control using Emotionally Intelligent Agents . . . . . . . . 526 Clark Elliott Dynamic Strategies and Opponent Hands Estimation for Reinforcement Learning in Gin Rummy Game . . . . . . . . . . . . . . . . 543 Yuexing Hao and Mark Vaysiberg Wireless Sensor Network Smart Environment for Precision Agriculture: An Agent-Based Architecture . . . . . . . . . . . . . . . . . . . . . . . 556 AbdulMutalib Wahaishi and Raafat Aburukba Autonomy Reconsidered: Towards Developing Multi-agent Systems . . . 573 Michael A. Goodrich, Julie A. Adams, and Matthias Scheutz A Real-Time Intelligent Intra-vehicular Temperature Control Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593 Daniel Jacuinde-Alvarez, James Dols, and Shahab Tayeb Intelligent Control of a Semi-autonomous Assistive Vehicle . . . . . . . . . . 613 David Sanders, Giles Tewkesbury, Malik Haddad, Ya Huang, and Boriana Vatchova One Shot Learning Approach to Identify Drivers . . . . . . . . . . . . . . . . . 622 Malik Haddad, David Sanders, Martin Langner, and Giles Tewkesbury Facial Recognition Software for Identification of Powered Wheelchair Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630 Giles Tewkesbury, Samuel Lifton, Malik Haddad, David Sanders, and Alex Gegov Intelligent User Interface to Control a Powered Wheelchair Using Infrared Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640 Malik Haddad, David Sanders, Giles Tewkesbury, Martin Langner, and Sarinova Simandjuntak A Classification Based Ensemble Pruning Framework with Multi-metric Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 650 Ya-Lin Zhang, Qitao Shi, Meng Li, Xinxing Yang, Longfei Li, and Jun Zhou Customs Risk Assessment Based on Unsupervised Anomaly Detection Using Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668 Dion T. Oosterman, Wouter H. Langenkamp, and Ellen L. van Bergen Best Next Preference Prediction Based on LSTM and Multi-level Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682 Ivett Fuentes, Gonzalo Nápoles, Leticia Arco, and Koen Vanhoof

Contents

xi

Achieving Trust in Future Human Interactions with Omnipresent AI: Some Postulates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 700 Peer Sathikh, Zong Rui Dexter Fang, and Guan Yi Tan A Decentralized Explanatory System for Intelligent Cyber-Physical Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719 Étienne Houzé, Jean-Louis Dessalles, Ada Diaconescu, David Menga, and Mathieu Schumann Construction Control Organization with Use of Computer and Information Technologies in Context of Sustainable Development Providing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 739 Zalina Ruslanovna Tuskaeva and Zaurbek Valerievich Albegov Computational Rational Engineering and Development: Synergies and Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744 Ramses Sala QPSetter: An Artificial Intelligence-Based Web Enabled, Personalized Service Application for Educators . . . . . . . . . . . . . . . . . . . 764 Mohammad Ali Kadampur and Sulaiman Al Riyaee Is It Possible to Recognize a Philosophical Zombie and How to Do It . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778 R. V. Dushkin Dynamic Analysis of Bitcoin Fluctuations by Means of a Fractal Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 791 Jesús Jaime Moreno Escobar, Oswaldo Morales Matamoros, Ana Lilia Coria Páez, and Ricardo Tejeida Padilla Are Human Drivers a Liability or an Asset? . . . . . . . . . . . . . . . . . . . . . 805 David Sanders, Malik Haddad, Giles Tewkesbury, Alex Gegov, and Mo Adda Negative Emotions Induced by Non-verbal Video Clips . . . . . . . . . . . . . 817 Flavia De Simone, Simona Collina, and Manuela Nuzzo Automatic Recognition of Key Modulations in Symbolic Musical Pieces Using Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 823 Michele Della Ventura Increasing Robustness for Machine Learning Services in Challenging Environments: Limited Resources and No Label Feedback . . . . . . . . . . 837 Lucas Baier, Niklas Kühl, and Jörg Schmitt Development Support for Intelligent Systems: Test, Evaluation, and Analysis of Microservices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 857 Charline von Perbandt, Matthias Tyca, Arne Koschel, and Irina Astrova

xii

Contents

An Analysis with Dynamics Between Human Motivation and Messaging on Social Networking Services . . . . . . . . . . . . . . . . . . . . 876 Hidehiro Matsumoto and Akira Ishii Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895

Late Fusion of Convolutional Neural Network with Wavelet-Based Ensemble Classifier for Acoustic Scene Classification Cheng Siong Chin1(B) and Jianhua Zhang2 1 Faculty of Science, Agriculture, and Engineering, Newcastle University Singapore, Singapore 599493, Singapore [email protected] 2 School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266525, Shandong, China

Abstract. Log-Mel spectrogram for the convolutional neural network (CNN) and wavelet time scattering for Ensemble of subspace discriminant classifiers is used for classifying acoustic scenes with human speech. The Tampere University of Technology (TUT) Acoustic Scenes dataset is used to demonstrate the feasibility of the proposed model. Comparisons are performed with the baseline model in the TUT 2017 dataset used for Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 Challenge-Task 1. The fused model shows good acoustic classification accuracy of 79.43%. The proposed late fusion of multi-model using CNN and ensemble classifiers exhibits 18.4% higher accuracy than the baseline model with just CNN. Keywords: Acoustic scene classification · Time scattering · Acoustic classification accuracy · Convolutional Neural Network · Wavelet multi-model late fusion system

1 Introduction Acoustic scene classification (ASC) [1–4] classifies audio signals into a pre-selected list of scene types such as car parks, parks, meeting rooms, etc. The problem can resemble speech recognition. The main difference is the target classes are more diversified. They are various applications of ASC. For example, it can be used for acoustic event recognition using the mobile device that detects an individual is having a meeting. It would trigger the device into silent mode automatically. ASC has been used in robots [5, 6], mobile devices [7–9], traffic [10, 11], and medical systems [12, 13]. One of the standard scientific challenges in ASC is Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge. The primary scope of ASC involved obtaining the best acoustic classification accuracy in assigning audio recordings to a specific recorded environment.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 1–11, 2022. https://doi.org/10.1007/978-3-030-82193-7_1

2

C. S. Chin and J. Zhang

Many ASC used Convolutional Neural Network (CNN) [14], Recurrent Neural Networks (RNN) [15], Support Vector Machines (SVM) [16], Gaussian Mixture Models [17], and Multilayer Perceptron [18, 19]. Recurrent network architectures such as convolutional (CRNN), Bi-Long Short Term Memory (LSTM), and LSTM [20] were also used. However, LSTM has inherently gradient vanishing and exploding problems. As observed in DCASE Challenge, the best-performing systems used CNN. An ensemble of neural networks [21] and ensemble classifiers [22] were used. The former approach using CNN has outperformed other ASC task approaches [23–25]. The latter has also demonstrated good acoustic classification accuracy with shorter computation time than CNN. To improve the generalization, Mel-frequency cepstral coefficients (MFCCs) [26] and other signal representations such as Constant Q Transform (CQT) [27] and wavelet time scattering [28] to extract the acoustic features of the raw data. Multiple spectrograms [29, 30] such as MFCC, short-time Fourier transform (STFT), and CQT were also utilized to increase the number of features for training. It has shown positive results as more timefrequency characteristics could be extracted. However, the computation time increases as more features are required to be processed. In this paper, a multi-model late fusion is used for ASC. CNN seems to be a reasonable choice. They are provided a time-frequency representation of audio to capture spectro-temporal modulation patterns for identifying various acoustic scenes. The timefrequency representation is used for CNN. It relates to the width and height dimensions of the convolutional filters, respectively. To reduce the overfitting of the dataset in CNN, a Mixup algorithm [31] is used. The original and mixed datasets are combined to train a CNN [14] using the log-Mel spectrograms. It is followed by an ensemble random subspace discriminant classifier using wavelet scattering [28]. The Tampere University of Technology (TUT) 2017 dataset [32] used for DCASE2017 Challenge-Task 1 will be used for both training and evaluation.

2 Proposed Methodology There are 4680 and 1620 labeled audio files for training and evaluation, respectively. The original TUT-2017 datasets are obtained from different environmental scenes at other recording locations with some human speech recorded. There are not more than 5-min audio recordings at each site. The original recordings are split into 10s segments where each audio segment is included as sound files. The details of the outdoor (both open or enclosed areas) and indoor acoustic scenes can be seen in the TUT-2017 dataset [32]. The following 15 acoustic scenes are as follows. • • • • • • •

bus forest path home city center cafe lakeside beach (outdoor) library (indoor)

Late Fusion of CNN with Wavelet-Based Ensemble Classifier

• • • • • • • •

3

car grocery store urban park (outdoor) office: multiple persons (indoor) metro station (indoor) train (traveling, vehicle) residential area (outdoor) tram (traveling, vehicle)

The recordings are recorded from different streets, homes, and parks [32]. Sound recordings were performed via different devices at 24-bit resolution and 44100 Hz sampling rate. The microphones are worn during recording. 2.1 Pre-processing and Feature Extraction The brief descriptions of the pre-processing steps for log-scale Mel-spectrogram can be seen below. • The acoustic signal is sampled at 44100 Hz. The audio clips are then normalized. • The audio is converted to mid-side encoded [14] data to obtain good spatial information for CNN to detect moving sources. • The signal is then divided into 1s segments with an overlap of 0.5 s. It helps to train the network easier and reduces overfitting for certain acoustic events. The overlap increases the data for subsequent data augmentation. • The window size of 2048 samples using short-time Fourier transform with a hop size of 1024 samples are used. The samples overlap is 1024. The spectrogram has 128 bin mel-scale. The Mel-spectrogram is then converted into a logarithmic scale. • The log-Mel spectrogram data is reshaped before they are used as an image for training CNN. The first two dimensions are the height and width of the image, followed by the channels and the segments. For example, the size is 128 × 42 × 2 × 19. • The training labels are replicated to correspond with the 19 segments. • The dataset is augmented via Mixup [31]. It mixes the features of two different classes in equal proportion, as shown. x˜ = λxi + (1 − λ)xj

(1)

y˜ = λyi + (1 − λ)yj

(2)

where xi and xj are from dissimilar classes. The corresponding class labels are denoted by yi and yj , respectively. The mixing value of λ = 0.5 is used. 2.2 Convolutional Neural Network The CNN’s architecture can be seen in Table 1. The Batch Normalization (BN) and rectified linear unit (ReLU) [33] are used. The ReLU increases the non-linearity in

4

C. S. Chin and J. Zhang

the images. The batch normalization learning is used as a regularization to prevent overfitting. The activation function and BN are located before the convolution layer to improve the acoustic classification accuracy. The max-pooling layers come after the convolution process. The feature map that includes a prominent feature is obtained from the output of the max-pooling layer. The average pooling reduces the activation by combining the non-maximal activations. The last few layers consist of a dropout layer that removes 50% of the visible and hidden units to reduce overfitting. The fully connected layer is compiled the data to form the output for the last second layer that uses the softmax activation function to obtain probabilities of the input from the 15 classes. Lastly, the last classification layer produces the final classification. Table 1. CNN architecture. Description of each layer imageInputLayer- 128 × 42 × 2 batchNormalizationLayer convolution2dLayer- 32 filters (3 × 3) and zero padding batchNormalizationLayer reluLayer convolution2dLayer- 32 filters (3 × 3) and zero padding batchNormalizationLayer reluLayer maxPooling2dLayer- pool size 3 × 3, stride 2 × 2 and zero padding convolution2dLayer- 32 filters (3 × 3) and zero padding batchNormalizationLayer reluLayer convolution2dLayer- 32 filters (3 × 3) and zero padding batchNormalizationLayer reluLayer maxPooling2dLayer- pool size 3 × 3, stride 2 × 2 and zero padding convolution2dLayer- 128 filters (3 × 3) and zero padding batchNormalizationLayer reluLayer convolution2dLayer- 128 filters (3 × 3) and zero padding batchNormalizationLayer reluLayer maxPooling2dLayer- pool size 3 × 3, stride 2 × 2 and zero padding (continued)

Late Fusion of CNN with Wavelet-Based Ensemble Classifier

5

Table 1. (continued) Description of each layer convolution2dLayer- 256 filters (3 × 3) and zero padding batchNormalizationLayer reluLayer convolution2dLayer- 256 filters (3 × 3) and zero padding batchNormalizationLayer reluLayer averagePooling2dLayer-pool size 16 × 6 dropoutLayer(0.5) fullyConnectedLayer(15) softmaxLayer classificationLayer

2.3 Wavelet Scattering The next step involves feature extraction using wavelet scattering for subsequent ensemble classifiers. It provides a good representation [28] of the time-frequency content of a signal. The first and second-order coefficients are used as most of the signal energy can be captured. The parameters of the transform are the filter-bank (using 1D Morlet wavelets) resolutions Q1 = 1 and Q2 = 4. The duration 0.75s of the averaging filter (or invariance scale) is used for the modulation structure duration. The sampling frequency is 44100 Hz. The first filter bank has a resolution of 4, and the second filter bank has a resolution of 1. 2.4 Ensemble Classifiers The proposed ensemble classifiers include different discriminant analysis learners, such as linear discriminant analysis (LDA), Quadratic discriminant analysis (QDA), and Regularized linear discriminant analysis (RDA) with other predictors covariance treatments. The random subspace learning method is used to increase the acoustic classification accuracy. In the random subspace, the feature subspaces are chosen randomly from the original feature space. The final prediction of these individual classifiers is then obtained using majority voting. 2.5 Fusion of CNN and Classifiers The fusion of the CNN and classifier prediction results indicates the relative confidence of their prediction. Multiplying the responses and determining the maximum response creates a late fusion system that inherent in the merits of each method.   (3) class_pred i = argmax probiCNN , probiensem_class

6

C. S. Chin and J. Zhang

where probiCNN and probiensem_class are the probabilities of sound recording i from CNN and ensemble classifiers, respectively.

3 Results and Discussion The configurations of the proposed model are as follows. • • • • • • •

Stochastic gradient descent with momentum optimizer with a learning rate: 0.05 s Size of the mini-batch for each training iteration: 128 Momentum: 0.9 Maximum number of epochs: 8 Factor for L2 regularization: 0.005 Number of epochs for dropping the learning rate: 2 Multiplicative factor applied to the learning rate for each epoch: 0.2

The training data are shuffled before each training epoch. The entire experiment, including the pre-processing, took not more than three hour. The short audio segments (see Fig. 1) provide less information, thus making ASC difficult. A segment sample of the extracted Mel-spectrograms audio clip for the "lakeside beach" scene is shown in Fig. 1. The frequency along the y-axis, time is displayed along the x-axis, and the signal’s energy at a particular time and frequency is shown as the color map. The Intel® Core i7 CPU and Geforce RTX 2060 are used.

Fig. 1. Example of segments of Mel-spectrogram from the lakeside beach scene.

Late Fusion of CNN with Wavelet-Based Ensemble Classifier

7

The acoustic classification accuracy can be seen in Table 2. The ensemble classifiers have a higher acoustic classification accuracy than CNN. Compared to the baseline model (consists of 2 layers × 50 hidden units, 20% dropout), the fused model exhibits 18.4% higher accuracy. The details of the baseline model can be found in the reference [32]. Table 2. Acoustic classification accuracy of different models. Scenes

Acoustic classification accuracy (%) Baseline model [32]

CNN model

Ensemble classifiers model

Fused model

Beach

40.7

73.1

37.9

50.9

Bus

38.9

58.3

92.5

87.9

cafe/restaurant

43.5

74.0

82.4

82.4

Car

64.8

100

76.8

88.8

city-center

79.6

88.8

91.6

93.5

forest path

85.2

97.2

96.2

98.1

grocery store

49.1

70.3

79.6

79.6

Home

76.9

89.8

76.8

91.6

Library

30.6

49.0

36.1

40.7

metro station

93.5

100

95.3

100

Office

73.1

80.5

83.3

84.2

Park

32.4

20.3

68.5

60.1

residential area

77.8

63.8

77.7

81.4

Train

72.2

76.8

85.1

82.4

Tram

57.4

57.4

64.8

69.4

Average

61.0

73.3

76.3

79.4

Although the result of the scene (i.e., beach) using ensemble classifiers (37.96%) is quite poor as compared to CNN (73.14%), the fused model managed to increase the acoustic classification accuracy to 50.92%. Conversely, the scene (i.e. park) using CNN model is relatively low compared to the ensemble classifiers. The fused model increases it to 60.18%. The confusion matrix of CNN, ensemble classifiers, and the fused model are shown in Fig. 2. The confusion chart of the multi-model late fusion system shows better acoustic classification accuracy for city-center, forest path, and metro station than other scenes. The average acoustic classification accuracy of the fused model is computed as 79.43%. The false-negative for the residential area is around 51.6% with false discovery rate of 18.5%. The false negative is quite negligible for the classes.

8

C. S. Chin and J. Zhang

Fig. 2. Confusion charts of CNN (top), Ensemble classifiers (Middle), and Multi-model late fusion model (Bottom).

Late Fusion of CNN with Wavelet-Based Ensemble Classifier

9

4 Conclusion A multi-model late fusion system model consisting of the log-Mel spectrogram for convolutional neural network and wavelet time scattering for ensemble of subspace discriminant classifiers was proposed. The acoustic scene classification aims to classify the acoustic scenes in a different environment such as the park, car park, beach, citycenter, etc. Based on the dataset from the TUT Acoustic Scenes, it demonstrated that the fused model gives good acoustic classification accuracy of 79.43%. The proposed multi-model late fusion system exhibits 18.4% higher acoustic classification accuracy than the baseline model despite relatively low performance in a few scenes such as the beach and library. Nevertheless, the multi-model late fusion system shows good acoustic classification accuracy for most of the scenes. For future works, an adaptive type of hyperparameter tuning and advanced feature extraction methods will be used to improve the performance further.

References 1. Mesaros, A., et al.: Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE/ACM Trans. Audio Speech Lang. Process. 26(2), 379–393 (2018) 2. Mesaros, A., Diment, B., Elizalde, T., Heittola, E., Vincent, B., Raj, T.: Virtanen, sound event detection in the DCASE 2017 challenge. IEEE/ACM Trans. Audio, Speech Lang. Process. 27(6), 992–1006 (2019) 3. Rakotomamonjy, A.: Supervised representation learning for audio scene classification. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1253–1265 (2017) 4. Trowitzsch, I., Mohr, J., Kashef, Y., Obermayer, K.: Robust detection of environmental sounds in binaural auditory scenes. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1344– 1356 (2017) 5. Ribeiro, P.O.C.S., et al.: Underwater place recognition in unknown environments with triplet based acoustic image retrieval. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, pp. 524–529 (2018) 6. Aziz, S., Awais, M., Akram, T., Khan, U., Alhussein, M., Aurangzeb, K.: Automatic scene recognition through acoustic classification for behavioral robotics. Electronics 8(5), 483–500 (2019) 7. Kojima, R., Sugiyama, O., Hoshiba, K., Suzuki, R., Nakadai, K.: HARK-Bird-Box: a portable real-time bird song scene analysis system. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, pp. 2497–2502 (2018) 8. Xu, X., Yu, J., Chen, Y., Zhu, Y., Qian, S., Li, M.: Leveraging audio signals for early recognition of inattentive driving with smartphones. IEEE Trans. Mob. Comput. 17(7), 1553–1567 (2018) 9. Song, X., Wang, M., Qiu, H., Li, K., Ang, C.: Auditory scene analysis-based feature extraction for indoor subarea localization using smartphones. IEEE Sens. J. 19(15), 6309–6316 (2019) 10. Jiang, D., et al.: An audio data representation for traffic acoustic scene recognition. IEEE Access 8, 177863–177873 (2020) 11. Wang, L., Roggen, D.: Sound-based transportation mode recognition with smartphones. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, pp. 930–934 (2019)

10

C. S. Chin and J. Zhang

12. Li, Y., Chen, F., Sun, Z., Ji, J., Jia, W., Wang, Z.: A smart binaural hearing aid architecture leveraging a smartphone APP with deep-learning speech enhancement. IEEE Access 8, 56798–56810 (2020) 13. Vivek, V.S., Vidhya, S., Madhanmohan, P.: Acoustic scene classification in hearing aid using deep learning. In: 2020 International Conference on Communication and Signal Processing (ICCSP), Chennai, India, pp. 0695–0699 (2020) 14. Han, Y., Park, J., Lee, K.: Convolutional neural networks with binaural representations and background subtraction for acoustic scene classification. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), Munich, Germany (2017) 15. Pham, L., Doan, T., Ngo, D.T., Nguyen, H., Kha, H.H.: CDNN-CRNN joined model for acoustic scene classification. In: Detection and Classification of Acoustic Scenes and Events 2019 (DCASE2019) Challenge, Technical Report (2019) 16. Jimenez, A., Elizalde, B., Raj, B.: DCASE 2017 Task 1: acoustic scene classification using shift-invariant kernels and random features. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), Munich, Germany (2017) 17. Fraile, R., Reina, J.C., Arriola, J.G., Blanco, E.: Classification of acoustic scenes based on modulation spectra and position-pitch maps. In: Detection and Classification of Acoustic Scenes and Events 2019 (DCASE2019) Challenge, Technical Report (2019) 18. Bilot, V., Duong, N.Q.K., Ozerov, A.: Acoustic scene classification with multiple instance learning and fusion. In: Detection and Classification of Acoustic Scenes and Events 2019 (DCASE2019) Challenge, Technical Report (2019) 19. Foleiss, J., Tavares, T.: MLP-based feature learning for automatic acoustic scene classification. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), Munich, Germany (2017) 20. Hao, W., Zhao, L., Zhang, Q., Zhao, H., Wang, J.: DCASE 2018 TASK 1A: acoustic scene classification by Bi-LSTM-CNN-net multichannel fusion. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), Surrey, UK (2018) 21. Sakashita, Y., Aono, M.: Acoustic scene classification by Ensemble of spectrograms based on adaptive temporal divisions. In: Proceedings DCASE2018, Woking, Surrey, UK, (2018) 22. Maka, T.: Auditory scene classification using ensemble learning with small audio feature space. In: Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), Technical report (2018) 23. Vafeiadis, A., et al.: Acoustic scene classification: from a hybrid classifier to deep learning. In: Proceeding of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2017), Munich, Germany (2017) 24. Zhang, T., Liang, J., Ding, B.: Acoustic scene classification using deep CNN with fineresolution feature. Expert Syst. Appl. 143, 34 (2020) 25. Valenti, M., Squartini, S., Diment, A., Parascandolo, G., Virtanen, T.A.: Convolutional neural network approach for acoustic scene classification. In: Proceedings IJCNN, Anchorage, Alaska, pp. 1547–1554 (2017) 26. Ghodasara, V., Waldekar, S., Paul, D., Saha, G.: Acoustic scene classification using block based MFCC features. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), Budapest, Hungary (2016) 27. Hong, L.: Acoustic scene classification using Mel-spectrum and CQT based neural network ensemble. In: Detection and Classification of Acoustic Scenes and Events 2020 (DCASE2020) Challenge, Technical Report (2020)

Late Fusion of CNN with Wavelet-Based Ensemble Classifier

11

28. Chin, C.S., Kek, X.Y., Chan, T.K.: Wavelet scattering based gated recurrent units for binaural acoustic scenes classification. In: 2020 International Conference on Internet of Things and Intelligent Applications (ITIA), pp. 1–5 (2020) 29. Zheng, W., Mo, Z., Xing, X., Zhao, G.: CNNs-based Acoustic Scene Classification using Multi-Spectrogram Fusion and Label Expansions. CoRR abs/1809.01543 (2018) 30. Zheng, W., Yi, J., Xing, X., Liu, X., Peng, S.: Acoustic scene classification using deep convolutional neural network and multiple spectrograms fusion. In: Proceeding of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop, Munich, Germany (2017) 31. Ferenc, H.: Mixup: Data-Dependent Data Augmentation. InFERENCe (2017). https://www. inference.vc/mixup-data-dependent-data-augmentation/. Accessed 15 Jan 2019 32. Mesaros, A., et al.: DCASE 2017 challenge setup: tasks, datasets and baseline system. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), pp. 85–92 (2017) 33. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)

Deep Learning and Social Media for Managing Disaster: Survey Zair Bouzidi1(B) , Abdelmalek Boudries2 , and Mourad Amad1 1 LIMPAF Laboratory, Computer Science Department, Science and Applied Science Faculty,

Bouira University, Bouira, Algeria 2 Laboratory LMA, Commercial Science Department, Faculty of Economics, Business and

Management, Bejaia University, Béjaïa, Algeria

Abstract. The broad dissemination and scope of social networks enables individuals to exchange information in real-time. This active involvement of societies plays a major role in reducing disaster risk and alleviating at-risk populations. While any operation needs accurate information in crisis management to enable a rapid response to decrease the potential loss of life. The timely retrieval of information from various regions of a disaster-affected area is a demanding task. A catastrophe relief and response method’s effectiveness depends largely on a prompt and accurate assessment of the disaster’s crisis. This knowledge is primarily collected on site by first responders and can be updated later on. Several technics have been built to automate this need through the extraction and analysis of appropriate content from social media. These approaches are not, however, well incorporated into the mechanism of relief. For more advancement, it would be important to reveal them. Keywords: Alert · Assessment · Awareness · Collaboration · Crowdsourcing · Deep learning · Disaster management models · Neural learning · Social networks · Relevant information

1 Introduction Increasing attention is paid to crisis management field from various research disciplines [1]. Scientists have played a key role in developing ways to handle and analyze data created in catastrophe management situations, especially the first-hand information of social network. We plan to survey and coordinate existing data management and analysis expertise in emergency situations in this paper, as well as present issues and future research directions. As a result of a detailed bibliography survey and our hands-on background from developping an Environment of Automated Learning [2–4] for managing emergencies in LIMPAF laboratory and for improving marketing corporate, business strategies, fraud detection and financial time series prediction [5]. This is a survey of emergency management applications using social networks. It provides a taxonomy of all characteristics of the models of catastrophe management, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 12–30, 2022. https://doi.org/10.1007/978-3-030-82193-7_2

Deep Learning and Social Media for Managing Disaster

13

social networking contributions and classification algorithms of extracting relevant content, from statistical algotithms to automated learning, from social networks and provides the reader with knowledge of existing and emerging tendencies in crisis management application research and area of focus for researchers. In addition, the research raises problems in implementing catastrophe management, contributions, results comparison and methods of criticism. Most social media contents exchanged during emergencies communicate timely, actionable data. However, analyzing social networking contents to acquire such data includes solving many problems, including: parsing short and informal messages, handling information overload, and prioritizing various types of information found in messages. Classical information processing tasks such as filtering, classification, rating, aggregating, extracting, and summarizing can be mapped to these challenges. We discuss the state of the art in emergency management models with different stages, social networking and various methods of retrieving relevant contents to process information from social networking and illustrate both their contributions and shortcomings. In addition, we analyze their details and methodically examine a set of key subproblems ranging from the events identification to the useful and actionable summaries development. The paper rest is set out as follows. Section 2 presents the background and recent surveys. Section 3 introduce different models crisis management. In Sect. 4, we show how content can be gathered from websites to all social media. Section 5 shows diffenet technics used for retrieving relevant information from social media, follow-up of the discussion, explaining and claasifying the different architectures of Deep learning. Finally, we conclude and give some future works.

2 Background and Related Works By analyzing and classifying the recent reviews, we studied concepts of catastrophes and all emergency management models. 2.1 Recent Surveys Table 1 displays the Latest Disaster Management Surveys Classification. We note that there are many information system surveys, particularly in the area of Integrated Communication [6–8, 10]. In artificial intelligence, especially in the fields of automatic learning, machine learning, deep learning, [1, 8, 9, 11–14] and in Collaboration in Volunteered geographic information quality assessment methods [8, 15–18] but even more studies in Social Media [1, 11–13, 19–22]. Risk assessment and mitigation have been discussed [23–25]. Street floods, perceptions of environmental risk and areas of disaster preparedness. In Big Data [14, 26], on the other hand, Crowdsourcing [27, 28] even Crowdtasking [28], they were restricted just for Natural Disaster. Some studies also touched Disaster Education [29, 30], Forecasting [31, 32] only in Forest Fire Danger Prediction field and Post-Disaster Coordination and Response [12, 21] in Super-cyclone Amphan Field. The recent research on Situational Awareness and Damage Assessment [29] only in the field of thermal agent disaster and fire disaster is also available. All these research dealt only

14

Z. Bouzidi et al.

with work performed on Twitter alone. Only our paper will review applications from all data sources on (Twitter, Facebook, Instagram, and so on). Table 1. Classification of recent disaster management surveys DM tasks

Fields

Surveys

Information System

Integrated Communication

[6–8, 10]

Artificial Intelligence

Automated Learning, Machine Learning, Deep Learning

[1, 8, 9, 11–14]

Social Media

/

[1, 11–13, 19–22]

Big Data

Natural Disaster

[14, 26]

Collaboration

Volunteered geographic information quality assessment methods

[8, 15–18]

Crowdsourcing

Natural Disaster

[27, 28]

CrowdTasking]

Natural Disaster

[28]

Education

/

[29, 30]

Forcasting

Forest Fire Danger Prediction

[34, 35]

Risk assessment/reduction

Street oods, Environmental Risk Perceptions And Disaster Preparedness

[23–25]

Situational Awareness

Thermal agent disaster and Fire disaster

[29]

Damage Assessment

Thermal agent disaster and fire disaster

[29]

Post-Disaster Coordination and Response

Super-cyclone Amphan

[12, 21]

2.2 Disaster Disasters such as earthquakes, flooding, fires, terrorist attacks and tsunamis result in disastrous human suffering, property destruction and other adverse effects. In addition to existing disasters, several anthropogenic disasters have arisen over the past two decades, primarily due to globalization, interconnected networks and substantial technological growth (see Table 2). Product forgery, biological risks, terrorism and ecological terrorism include anthropogenic disasters [1, 34]. The planet has undergone several significant natural and/or anthropogenic catastrophes of all time in recent years. Biological, geological, seismic, hydrological or natural processes such as cyclones, earthquakes, tsunamis, floods, forest fires, landslides, sandstorms and volcanic eruptions or hydro-meteorological paroxysms (exceptional precipitation), pandemics (pandemic of the coronavirus such as Covid’19) [32] and its variants or human processes such as simple precipitation) are often modified by species, have

Deep Learning and Social Media for Managing Disaster

15

reportedly identified the variant. From these cases, we find that we have 160,000 deaths and 60 million injured in 27 years (from 1980 to 2017), although we have serious damage only in the five years (from 2012 to 2017) with 32,454 dead, 3,355 injured, 6,639 missing, more than 83,000 hectares burned, 350 homes destroyed and other significant damage. Table 2. Latest catastrophic events No

Catastrophic event

Period

Damage

1

Forest Fire Haiti

Oct 2007

230,000, 220,000

2

Earthquake California

Jan 2010

203 deaths, 6,152.9 Km2 ravaged lands

3

Floods Thailand

Jul 2011

815 Deaths

4

Tsunami earthquake Japan

Apr 2011

15,896 deaths, 6,157 injuries, 2,537 missing

5

Hurricane Sandy USA

Oct 2012

220 Deaths

6

Typhoo Haiyan Philippines

Nov 2013

26,626 injuries

7

Flood of the elbe Germany

Jun 2013

25 deaths

8

Subway bombing Russia

Apr 2017

9

Suicide bombing England May 2017

22 Deaths, 116 injuries

10

Three explosions Indonesia

May 2018

9 deaths, 40 injuries

11

Japan Floods Japan

Jun 2018

235 deaths, 13 missing

12

Indonesian earthquake

Sep 2018

2,000 deaths, 1.5 million injuries

13

Earthquake Fire Haiti

Oct 2018

18 Deaths, 548 injuries

14

Terrorist Attack Strasbourg

Dec 2018

5 deaths, 10 injuries

15

Kivu Ebola epidemic Congo

Aug 2018–Jun 2020

14,739,450 affected, 1,162 healed, 2,299 deaths

16

Coronavirus Pandemic COVID-19

Jan 21st –Jul 23rd , 2020

14,739,450 affected, 8,332,461 healed, 610,776 deaths

15 deaths, 50 injuries

The urgency and significance of loss estimation and the need for decision support resources have been reaffirmed by recent threats from these disasters. In order to fulfill these needs and requirements, various models for disaster management [2–5, 29, 34–37] have been studied, designed and developed. The following are some of the major disaster management activities, including hazard evaluation, risk management, mitigation, preparedness, response and recovery.

16

Z. Bouzidi et al.

When disaster strikes, people seek information and ways [1–4, 35] to provide data and assist others. Disasters inspire altruism, where individuals support those who are in distress or suffering from the disaster. Information on the protection of people and goods, as well as sources of aid, are among the most common forms of online assistance in the event of a disaster. Catastrophe is defined [34] as a complex problem that must be addressed using a multidimensional and multiplatform framework to collect information. It is characterized as a severe disruption to the functioning of society [22] involving extensive losses to humans, materials or the environment. There are two main types of disasters: simple where the structure of the community remains intact and composed where the structure and function of the community are disrupted. Catastrophes are events that are fast-paced. Slow and chronic social disruptions [35], however, are important to theorize as catastrophes because they can have a greater effect than rapidly caused disasters. 2.3 Disaster Management The following stages are included in the disaster recovery cycle: warning, planning, action, prevention, mitigation and restoration (see Fig. 1). In catastrophe management, there are at least six key elements [34]: Prevention, Mitigation, Planning, Response (Relief), Restoration and Reconstruction. The emergency management process, however, is defined in four phases, namely: mitigation (before disaster) [45], preparedness (before disaster) [22], response (during disaster) [45] and recovery (after disaster) [22, 34].

Fig. 1. Disaster management cycles.

3 Disaster Management Models In this conceptual framework and theoretical chaos, discrepancies and some variations between different models of disaster management have resulted in complications. While the scope of disaster management calls for templates to be used [22, 34]. Well-formed [34, 36, 37] typology can be very useful in maintaining discipline and eliminating complications in a chaotic environment.

Deep Learning and Social Media for Managing Disaster

17

There are some various Disaster Management Models, namely: the Classical Model, Computer Model and Disaster Management Social Networking Model, which are different but complementary. Thus, Classical Disaster Management Model, is based on preventative measures, which can reduce the seismic risk, starting with the citizen’s knowledge by teaching him the attitude to take before, during and after the earthquake, then reducing the seismic vulnerability of buildings, which can restrict the damage, without forgetting the cooperation between all the volunteers (solidarity action). Part of the folk wisdom of disaster management is to use personal familiarity to facilitate communication and collaboration, but not just through institutional contact. Collaboration is an essential base, developing into a more collaborative enterprise to become a more complex and versatile network model [46] that promotes multi-organizational cooperation (see Fig. 2). Because of the use of volunteerism and community participation [46], collaboration has always been a skill. Volunteers provide community services with leading-edge capability and connections. Organizational and individual volunteer mobilization often serves a social and psychological purpose: to bring people together and give them a sense of effectiveness. More adaptive leadership will enable organizational learning and make adaptation and improvisation easier. Coordination is strengthened by regular engagement, including involvement in planning and training exercises. Channels of communication developed during the mitigation process serve as a basis [42, 47] for meaningful coordination and contribute to improving resilience and cooperation, playing a big role in risk reduction. However, a multitude of social and behavioral research poses coordination as a significant obstacle for people, associations and organizations responding to disasters [47]. Transmit and/or exchange relevant information containing daily updates, such as accurate and timely warnings (for instance weather updates, traffic alerts and news), instead of a warning about a disaster. These types of data help [47] to keep people aware of their climate. In all phases of disaster management, contact between community members remains important in terms of communication. They interact with each other during the mitigation process, either to keep in contact or to help each other planning [35], while knowing that, there is evidence that local communities and local authorities affected are the best ones to respond immediately. Individuals are actively seeking media emotional support [35], provided to isolated members of the group. In Computer Model, damage evaluation is one of the main criteria of understanding of the situation in order to consider the nature of the devastation and also to prepare the relief accordingly. It is just important to integrate humanitarian principles [48] into the design requirements of an information system. First of all, it must promote the production of disaster management skills with, for instance, disaster education or simulations. Education plays a critical role in motivating community members to improve disaster management skills [4]. In schools, industries and neighborhoods, evacuation exercises are also performed. There are also Games Based-Evacuation Drill (GBED) [33] evacuation drills operating with motion hazard mapping (Motion Hazard Map: MHM) on a tablet with a GPS receiver and smart devices (for example, tablets and smart glasses), games with virtual children [49], while adults (ie, HMD carriers) providing them with sufficient evacuation instructions accordingly to virtual disasters situations. There are, game-based evacuation exercises (GBED) as Disaster Education Based Services [4,

18

Z. Bouzidi et al.

50], such as Paradigmatic Tourism integrating Games Based-Evacuation Drill (GBED). Black Tourism is a place-based disaster education, such as Penumbral Tourism [51], that uses disaster simulation in the real world. Disaster fantasy game based on Tangible Bits [52] and tower defense game [53] improve ability to prepare themself for floods, to evacuate a three-dimensional virtual (3D) world [54], immersive environments of virtual reality [55], Head-Mounted Displays (HMD) and other platforms [56], as Geo-fencing MRG [57] that learn how to organize disaster response, view digital documents on a portable computer when traveling to evacuation location, with electronic tablets and smart glasses. Advanced models and broad data analyses have led to innovative disaster management methods being developed by visualizing disaster incidents not realized, as Motion Danger Map (MHM) and smart devices, the tsunami evacuation drill (TED) framework [33], built for simulation by configuring it using Google Maps. By the way, the causes of street flooding have been discovered by observations, road profiles and flood simulations [23] and suitable solutions have been suggested. Flood simulations are found relatively inexpensive solutions to the traffic problems created by the floods have been analyzed and other variables. In Social Networking Model, the media plays a very significant role in disaster management. The didactic role of the media differs only in content. Audiences seek information about risk, not preparedness, during the planning stage. During the impact process (scary moment), they get emotional support from the media, and connections to the outside world breaking the isolation. The media focuses on the most affected areas after the disaster, providing estimates of harm and loss, and assisting communities in their recovery efforts. For recovery, after impact, they seek to know the conditions of other communities. Crowdsourcing, crowdtasking and Collaborative Disaster Management improve the difficult task of understanding voluminous and high velocity data.

Fig. 2. Collaboration.

In Collaborative Disaster Management, large paper charts retain a distinct advantage in some situations, such as disaster response, in their combination of high resolution and portability: it is called geo-collaboration [8], a community work enabled by geo-spatial information technology on the problems of geographical scale. To promote visualization, asynchronous and online interaction between actors, promoting distributed spatial and temporal cognition, a geo-collaborative, Web-enabled framework is designed to target the unique characteristics of mobile and ubiquitous computing environments.

Deep Learning and Social Media for Managing Disaster

19

As for Crowdsourcing and Crowdtasking, there is an evaluation of the advantages of work-sharing networks and social networking models (information collection, quasijournalistic editing and crowdsourcing) in disaster management. Using motivational analysis to assess the most likely essential app features that will optimize continued user interaction, a modern method of developing a community-based computing environment acts as a real-time dashboard for government agencies responsible for monitoring populations during disasters. The continued engagement of users is measured by the performance of community-based computer systems such as eBayanihan [28]. Crowdsourcing [28] can be a feasible production instrument that shows that intrinsic motives far outweigh extrinsic motivations (such as monetary reward), as shown by the merits of unaffiliated volunteers. Tools such as Ushahidi [28] allow people to quickly access relevant information, such as reports on crisis situations and needs in their community, based on their geographical position, showing the signifiance of volunteers motivation in a serious gambling scenario, simulating involvement in crisis events. Computerized application guidance systems [6], known as public safety systems, are used to rapidly organize public emergency services and save lives and property. In Management P2P Model, it is possible to exploit the adaptability of P2P networks [61] to meet the characteristics of disaster situations. Indeed, Peer-to-peer (P2P) is a decentralized computer network model: transactions take place between equally accountable nodes [61]. The peer-to-peer network is used to interconnect field staff to maintain and/or perpetuate the disaster management system [62] using a single active link between a peer and the control room, as geo-collaborative implementations [62], thanks to P2P principles. There is also an Android or iOS application and an Android chat application [63] using Wi-Fi peer communication allowing communication in disaster situations with others [64]. 3.1 Discussion About Disaster Management Models Different models for disaster recovery have been suggested by academics and organizations. Despite their success in some areas, catastrophes still pose a major challenge to sustainable growth. The strategic management [65], showed that the comprehensive model should include all three listed models due to the complementarity between disaster management models. The quantitative method may sound like more accurate compared to qualitative method. But qualitative risk analysis is ideal for assessing probability and prioritizing risk in a simple way to understand, by rating severity in broader terms. It also makes it easier to recognise areas needing special attention, and being used to manage risk in real-time at any point of the project. There is an undeniably stronger combined approach. As for Disaster Education [4, 33, 49, 50, 52–59], Simulation [51] and Crowdsourcing [66] on Twitter only, there are several applications for crisis management, also in the Alert/Mitigation process [3] on Twitter and Facebook for Forecasting and [4, 5] on all the Web and finally for Collaboration [46]. We also have in Preparedness phasis for Situational Awareness [66–70] and for Damage Assessment [68–71] on only Twitter and in Response phasis for Post-Disaster Coordination [60, 70]. But no application in Recovery phasis (see Table 3).

20

Z. Bouzidi et al. Table 3. Examples of social media-based disaster management in different phases.

Disaster management actions

Disaster management phasis with social media applications

1. Warning/mitigation Disaster education

[4, 33, 49, 50, 52–59]

Simulation

Social Networking Model via a: only Twitter: [51] b: Twitter & Facebook: c: All the Web: /

Forcasting

Social Networking Model via: a: only Twitter: b: Twitter & Facebook: [3] c: All the Web: [4, 5]

Collaboration

Social Networking Model: [46]

Crowdsourcing

Social Networking Model via: a: only Twitter: [66] b: Twitter & Facebook: / c: All the Web: /

2. Preparedness Risk assessment and reduction

Social Networking Model via: a: only Twitter: / b: Twitter & Facebook: / c: All the Web: /

Situational Awareness

Social Networking Model via: a: only Twitter: [66–70, 72] b: Twitter & Facebook: / c: All the Web: /

Damage Assessment

Social Networking Model via a: only Twitter: [68–72] b: Twitter & Facebook: / c: All the Web: /

3. Response Post-Disaster Coordination and Response

Social Networking Model via a: only Twitter: [60, 70] b: Twitter & Facebook: / c: All the Web: /

4. Recovery Normal Activities Resumption

Social Networking Model via: a: only Twitter:/ b: Twitter & Facebook: / c: All the Web:/

Deep Learning and Social Media for Managing Disaster

21

4 Social Media Table 4. Examples of disaster management applications using various social media Ref

Identification methods

[58] Flood Disaster Game-based Learning

Used OSN Twitter

[59] Educational Purposes in Higher Education with Special Reference Twitter [68] Social-temporal context summarization

Twitter

[69] Capitalizing on a TREC Track to Build a Tweet Summarization Dataset

Twitter

[70] Semi-automated artificial intelligence-based classifier for Disaster Twitter Response [72] Summarizing situational tweets: An extractive-abstractive methodology

Twitter

[2]

Based on Artificial NN (ANN)

Twitter & Facebook

[3]

Based on Artificial NN (ANN)

All the Web

[4]

Based on FeedForward NN (FFNN)

All the Web

[5]

Based on LSTM

All the Web

These are sites where people share feelings, whether they’re Twitter, Facebook, Viber, Messenger, any forum on the Internet (see Table 4). The knowledge available on social networks varies from other web sources (press articles, for example) in several respects. Such messages use less formal language, may include words from more than one language, may have different errors in grammar and spelling, and are, for the most part, unstructured, fuzzy and short-lived. Their length and content vary considerably [11]. From all online platforms automatically monitored by the Online Listening Tool, namely Radian6 or one of its rivals [11], content can be gathered from websites to all social media. In fact, via Application Programming Interface (API) [11], many networking platforms allow access to their data. The model, which fairly represents the essentials, is generated by online listening instruments, namely: harvesting contents, cleaning the data of noninformative information, enabling relevance thanks with the learning corpus obtained because of to the tagged messages, and analyzing the results.

5 Retrieving Relevant Information from Social Media We will study all artificial learning methods, from machine learning to deep learning, after an overview of techniques for retrieving relevant knowledge on social networks. 5.1 Classification Algorithms We will study all the techniques for retrieving relevant knowledge on social networks, from Support Vector Classification to Neural Learning, including the Random Forest Classification.

22

Z. Bouzidi et al.

5.1.1 Support Vector Classification To solve regression problems, the approach used for support vector classification can also be expanded. Training points beyond a certain boundary are not taken into account in the cost function for building the support vector classification model. Therefore building a support vector classification model only depends on a subset of training data [73]. 5.1.2 Random Forest Classification In order to control over-fitting and to increase predictive precision, Random forest generates a lot of decision tree based on random collection of data and variables and takes the notion of averaging. Every tree in the lot is developed from the training set using bootstrap sampling. In addition, when a node is split during tree creation, the selected split is not the best split between all the features, but it is the best split between a random subset of features. The bias of the forest usually increases because of this, but also decreases due to techniques such as averaging its variance, which compensates for more than an increase in bias [74]. 5.1.3 Neural Learning Neural Learning (NL) is an artificial intelligence technology that enables computers to learn without having been explicitly programmed to do so. To learn and increase, however, computers need data to analyze and train on [75]. Abiodun et al. (2018) [75] recommend that future research can focus on combining, into one network-wide application, various Automated Neural Networks (ANN) models, namely Machine Learning and Deep Learning. 5.2 Machine Learning (ML) Despite the fact that machine learning is not a new concept, many people are still uncertain what it entails. It is a modern science that uses statistics, data mining, pattern recognition, and predictive analysis to identify patterns and make data predictions. At the end of the 1950s, the first algorithms were developed. The best known of these is none other than the Perceptron (see Table 5). Table 5. Neural learning architecture. Type

Architecture

Model - Training - Algorithm - Application

Ref

Neural Network

Machine Learning

Discriminative-Supervised-Gradient Descent based Backpropagation-Classification

[2, 3]

The perceptron is an algorithm for binary classifiers’ supervised learning. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class, making its predictions based on a linear predictor function, that combines a set of weights with the feature vector.

Deep Learning and Social Media for Managing Disaster

23

5.3 Deep Learning (DL) Neural learning is carried out by Feedforward or Feedback neural network. In Feedforward, we have supervised learning such as Feedforward neural network itself for classification [4], convolutional neural network [37–39] for image recognition/classification or Residual neural network (ResNets) [40] for image recognition. Tables 6, 7 and 8 show, respectively, the classification of FeedForward, FeedBackward, Radial Basis Function and Kohonen Self Organizing Neural Network architectures of Deep Learning. Table 6. Classification of deep learning architectures FeedForward neural network. Architecture

Advantages

Limitations

FFNN (FFNN)

Supervised-Binary classification-Gradient Descent based Backpropagation

No Extrapolation [44]

ConvNets (CNN)

Discriminative-Supervised-Gradient Descent based Backpropagation-Image recognition/classification

Temporal modeling-No increasing accuracy with stacking layers-No coding objects position/orientation [76, 77]

ResNets

Discriminative-Supervised-Gradient Descent based Backpropagation-Image recognition

Increased complexity-BatchAdding skip level connections

Autoencoder

Generative-Unsupervised-Backpropagation-Dimensionality Reduction-Encoding

not discover slow modes [80]

Generative A (GAN)

Generative-Discriminative-Unsupervised-Backpropagation-Fake realistic-Image

distribution learning poorly madea

Restricted Boltzmann Machine (RBM)

Supervised/unsupervised-Generative with Discriminative finetuning-Unsupervised-Gradient Descent based Contrastive divergence-Dimensionality Reduction-Feature learning-Classification-Collaborative filtering

difficult training; tricky partition function making computing log likelihood infeasible

a https://simons.berkeley.edu/news/research-vignette-promise-and-limitations-generative-advers

arial-nets-gans.

For unsupervised learning, we have Autoencoder [41] for Dimensionality reduction and encoding, Generative Adversarial Network [42] for generating realistic fake data, reconstruction of 3D models or image improvement and with supervised or unsupervised learning such as Restricted Boltzmann Machine [41] for dimensionality reduction, feature learning, topic modeling, classification, collaborative filtering or many body quantum mechanics. Neural learning can also be trained in either supervised/unsupervised ways by Radial Basic Function Network [81] for M-means clustering, Least square function, function

24

Z. Bouzidi et al. Table 7. Classification of deep learning architectures FeedBackward neural network.

Architecture

Advantages

Limitations

RNN

Discriminative-Supervised-Gradient Descent & Backpropagation through Time-Natural Language Processing-Language Translation

Difficult time series inference-unsupervised in negative time [79]

Bidirectional RNN (BRNN)

Discriminative-Supervised-Gradient Descent & Backpropagation through Time-Natural Language Processing-Language Translation

Trained with input information limitation up to preset future frame [79]

LSTM

Discriminative-Supervised-Gradient Descent & Backpropagation - Natural Language Processing-Translation

No obtaining well-defined temporal information [77, 78]

Fully Connected-LSTM

Discriminative-Supervised-Gradient Descent & Backpropagation through Time-Natural Language Processing-Language Translation

No obtaining well-defined temporal information [43, 77, 78]

Bi-Directional LSTM Discriminative-Supervised-Gradient Descent-Backpropagation-Natural Language Processing-Translation

Bad Presentation with multi-level features [76]

Table 8. Classification of deep learning architectures Radial Basis Function Neural Network and Kohonen Self Organizing Neural Network. Architecture

Advantages

Limitations

Radial Basis Function Neural Network Radial Basis Fct NN

Discriminative-Supervised/Unsupervised-K-means Clustering-Least Square Fct-Fct approximation-Time series prediction

Slow classification due to RBF fct computation

Kohonen Self Organizing Neural Network Kohonen SO NN

Generative-Unsupervised-Competitive Learning-Dimensionality Reduction- Optimization problems- Clustering analysis

SOM algorithm Problemsa

a https://pdfs.semanticscholar.org/c93a/e9ffeda90c9ea4cd951989a00a0afde8845b.pdf.

approximation and time series prediction or unsupervised ways by Kohonen Self Organizing Netowork [81] for dimensionality reduction, optimization problems or clustering analysis. In Feedback, we have only supervised leaning such as Recurrent Neural Network [5, 41], Bidirectional Recurrent Neural Network [42], Long Short-Term Memory [5, 41], Fully Connected-LSTM [42] and Bi-Directional-LSTM [38] through time-natural language processing and language translation.

Deep Learning and Social Media for Managing Disaster

25

6 Conclusion and Future Works By defining and conceptualising concepts of catastrophes and crisis management, proposing catastrophe classification, exploring and analyzing recents surveys, proposing Classification of Recent Disaster Management Surveys, exploring and analyzing social media-based crisis management packages in different phases, this work aims to explore the potential of social networking in managing disasters and shows the impact of social networking paradigm on the improvement of the catastrophe management process where interactions involving communities are discussed. They have their specific functional necessity to act during the different stages of the crisis management process. In addition, the role of the communication means in the attenuation, response and recovery phases is also presented. We have explored the potential of P2P networks in managing catastrophes. The adaptability of P2P networks [61] should be exploited to respond to the characteristics of crisis situations. New wireless applications are also possible in mobile networks, especially with Web 3.0. Thus, Sensor technology holds great promise for disaster-prone regions, which need comprehensive and effective warning models to protect lives and property. We studied all information retrieving techniques from all social media, starting with Support Vector Classification, Random Forest Classification to Neural Learning. We reviewed all forms of neural learning, from simple neural learning to Deep Learning. We proposed a classification of all Deep Neural learning architectures. Future works will be devoted to Web 3.0 with Deep Learning, Big Data and even Supercomputing. Acknowledgments. We acknowledge support of “Direction Générale de la Recherche Scientifique et du Développement Technologique (DGRSDT)”. MESRS, Algeria.

References 1. Hui, L.H.D., Tsang, P.K.E.: Everyday knowledge and disaster management: the role of social media. In: Robertson, M., Tsang, P.K.E. (eds.) Everyday Knowledge, Education and Sustainable Futures. EARICP, vol. 30, pp. 107–121. Springer, Singapore (2016). https://doi.org/10. 1007/978-981-10-0216-8_8 2. Bouzidi, Z., Boudries, A., Amad, M.: A new efficient alert model for disaster management. In: Proceedings of Conference AIAP 2018. Artificial Intelligence and Its Applications, El-Oued, Algeria (2018) 3. Bouzidi, Z., Amad, M., Boudries, A.: Intelligent and real-time alert model for disaster management based on information retrieval from multiple sources. Int. J. Adv. Media Commun. 7(4), 309–330 (2019). https://doi.org/10.1504/IJAMC.2019.111193 4. Bouzidi, Z., Boudries, A., Amad, M.: Towards a smart interface-based automated learning environment through social media for disaster management and smart disaster education. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) SAI 2020. AISC, vol. 1228, pp. 443–468. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52249-0_31

26

Z. Bouzidi et al.

5. Bouzidi, Z., Amad, M., Boudries, A.: Deep learning-based automated learning environment using smart data to improve corporate marketing, business strategies, fraud detection in financial services and financial time series forecasting. In: International Conference on “Managing Business Through Web Analytics - (ICMBWA 2020)”. Khemis Miliana University, Algeria (2020). Accepted 6. Leitinger, S.H.: Comparison of GIS-based public safety systems for emergency management. In: Proceedings of 24th Urban Data Management Symposium (2004) 7. Hristidis, V., Chen, S.-C., Li, T., Luis, S., Deng, Y.: Survey of data management and analysis in disaster situations. J. Syst. Softw. 83, 1701–1714 (2016) 8. Benali, M., Ghomari, A.R.: Information and knowledge driven collaborative crisis management: a literature review. In: 3rd International Conference on ‘Information and Communication Technologies for Disaster Management (ICT-DM)’, Vienna, Austria (2016) 9. Ogie, R.I., Rho, J.C., Clarke, R.J.: Artificial intelligence in disaster risk communication: a systematic literature review. In: Proceedings of the 5th International Conference on Information and Communication Technologies for Disaster Management, (ICT-DM 2018) (2018). https://doi.org/10.1109/ict-dm.2018.8636380 10. Meissner, A., Luckenbach, T., Risse, T., Kirste, T., Kirchner, H.: Design challenges for an integrated disaster management communication and information system. In: The First IEEE Workshop on Disaster Recovery Networks (DIREN 2002), in Conjunction with IEEE INFOCOM, New York, USA (2002) 11. Imran, M., Ofli, F., Caragea, D., Torralba, A.: Using AI and social media multimodal content for disaster response and management: opportunities, challenges, and future directions. Inf. Process. Manag. 57(5), 1–9 (2020). http://sci-hub.tw/10.1016/j.ipm.2020.102261 12. Nazer, T.H., Xue, G., Ji, Y., Liu, H.: Intelligent disaster response via social media analysis a survey. SIGKDD Explor. Newslett. 19(1), 46–59 (2017) 13. Imran, M., Castillo, C., Diaz, F., Vieweg, S.: Processing social media messages in mass emergency: a survey. ACM Comput. Surv. (CSUR) 47(67), 1–38 (2015). https://doi.org/10. 1145/2771588 14. Arinta, R., Emanuel, A.: Natural disaster application on big data and machine learning: a review (2019). https://doi.org/10.1109/ICITISEE48480.2019.9003984 15. Senaratne, H., Mobasheri, A., Ahmed Loai, A.A., Cristina, C., Mordechai, (M.)H.: A review of volunteered geographic information quality assessment methods. Int. J. Geogr. Inf. Sci. 31(1), 139–167 (2017). https://doi.org/10.1080/13658816.2016.1189556 16. Haworth, B., Bruce, E.: A review of volunteered geographic information for disaster management. Geogr. Compass J. 9(5), 237–250 (2015). https://doi.org/10.1111/gec3.12213 17. Klonner, C., Marx, S., Uson, T., Porto de Albuquerque, J., Hofle, B.: Volunteered geographic information in natural hazard analysis: a systematic literature review of current approaches with a focus on preparedness and mitigation. ISPRS Int. J. Geo-Inf. 5(7) (2016). https://doi. org/10.3390/ijgi5070103 18. Haworth, B.T.: Emergency management perspectives on volunteered geographic information: opportunities, challenges and change. Comput. Environ. Urban Syst. 57, 189–198 (2016). https://doi.org/10.1016/j.compenvurbsys.2016.02.009 19. Saroj, A., Pal, S.: Use of social media in crisis management: a survey. Int. J. Disaster Risk Reduction (2020). https://doi.org/10.1016/j.ijdrr.2020.101584 20. Ruggiero, A., Vos, M.: Social media monitoring for crisis communication: process, methods and trends in the scientific literature. Online J. Commun. Media Technol. 4(1), 105–130 (2014) 21. Poddar, S., Mondal, M., Ghosh, S.: A survey on disaster: understanding the after-effects of super-cyclone amphan and helping hand of social media. Computer Science, Computers and Society (2020)

Deep Learning and Social Media for Managing Disaster

27

22. Knuth, D., Szymczak, H., Kuecuekbalaban, P., Schmidt, S.: Social media in emergencies, how useful can they be. In: 3rd International Conference on Information and Communication Technologies for Disaster Management (ICT-DM) (2016) 23. Lagmay, A.M., et al.: Street floods in Metro Manila and possible solutions. J. Environ. Sci. 59, 39–47 (2017) 24. Kirschenbaum, A.: Preparing for the inevitable: environmental risk perceptions and disaster preparedness. Int. J. Mass Emerg. Disasters 23(2), 97–127 (2005) 25. Jabareen, Y.: Planning the resilient city: concepts and strategies for coping with climate change and environmental risk. Cities 31, 220–229 (2013). https://doi.org/10.1016/j.cities. 2012.05.004 26. Yu, M., Yang, C., Li, Y.: Big data in natural disaster management: a review. Geosciences 8(5) (2018). https://doi.org/10.3390/geosciences8050165 27. Poblet, M., García-Cuesta, E., Casanovas, P.: Crowdsourcing tools for disaster management: a review of platforms and methods. In: Casanovas, P., Pagallo, U., Palmirani, M., Sartor, G. (eds.) AICOL -2013. LNCS (LNAI), vol. 8929, pp. 261–274. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-45960-7_19 28. Middelhoff, M., et al.: Crowdsourcing and crowdtasking in crisis management. In: 3rd International Conference on ‘Information and Communication Technologies for Disaster Management (ICT-DM)’, Vienna, Austria (2016) 29. Torani, S., Majd, P.M., Maroufi, S.S., Dowlati, M., Sheikhi, R.A.: The importance of education on disasters and emergencies: a review article. J. Educ. Health Promot. 8(85) (2019). https:// doi.org/10.4103/jehp.jehp_262_18 30. Lin, C.K., Nifa, F.A.A., Musa, S., Shahron, S.A., Anuar, N.A.: Challenges and opportunities of disaster education program among UUM student. In: Proceedings of the 3rd International Conference on Applied Science and Technology (ICAST 2018), Georgetown, Penang, Malaysia (2018). https://doi.org/10.1063/1.5055440 31. Satoh, K., Weiguo, S., Yang, K.T.: A study of forest fire danger prediction system in Japan. In: Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA 2004) , Zaragoza, Spain, pp. 598–602 (2004). https://doi.org/10.1109/DEXA.2004. 1333540 32. Kohyu, S., Weiguo, S., Yang, K.T.: A study of forest fire danger prediction system in Japan. In: Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA 2004) (2004) 33. Kawai, J., Mitsuhara, H., Shishibori, M.: Tsunami evacuation drill system using motion hazard map and smart devices. In: 3rd International Conference on Information and Communication Technologies for Disaster Management (ICT-DM), pp. 13–15 (2016) 34. Ashir, A.: Use of social media in disaster management. In: ICITE 2012 Conference, Hong Kong, vol. IPEDR, no. 39 (2011) 35. Lamsal, R.: Design and analysis of a large-scale COVID-19 tweets dataset. Appl. Intell. 51(5), 2790–2804 (2020). https://doi.org/10.1007/s10489-020-02029-z 36. Asghar, S., Alahakoon, D., Churilov, L.: A dynamic integrated model for disaster management decision support systems. Int. J. Simul. Syst. Sci. Technol. 6(10) (2005) 37. Alam, F., Imran, M., Ofli, F.: Image4Act: online social media image processing for disaster response. In: Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, (ASONAM 2017), pp. 601–604 (2017). https://doi. org/10.1145/3110025.3110164 38. Kabir, M.Y., Madria, S.K.: A deep learning approach for tweet classification and rescue scheduling for effective disaster management. In: Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, (SIGSPATIAL 2019), pp. 269–278 (2019). https://doi.org/10.1145/3347146.3359097

28

Z. Bouzidi et al.

39. Nguyen, D.T., Al-Mannai, K., Joty, S.R., Sajjad, H., Imran, M., Mitra, P.: Robust classification of crisis-related data on social networks using convolutional neural networks. In: ICWSM, pp. 632–635 (2017) 40. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, pp. 770–778. IEEE (2016). https://doi.org/10.1109/CVPR.2016.90 41. Wu, Q., Ding, K., Huang, B.: Approach for fault prognosis using recurrent neural network. J. Intell. Manuf. 31(7), 1621–1633 (2018). https://doi.org/10.1007/s10845-018-1428-5 42. Canon, M.J., Satuito, A., Sy, C.: Determining disaster risk management priorities through a neural network-based text classifier. In: 2018 International Symposium on Computer, Consumer and Control (IS3C), Taichung, Taiwan, pp. 237–241 (2018). https://doi.org/10.1109/ IS3C.2018.00067 43. Zhao, J., Deng, F., Cai, Y., Chen, J.: Long short-term memory - fully connected (LSTM-FC) neural network for PM2.5 concentration prediction. Chemosphere 220 (2018). https://doi. org/10.1016/j.chemosphere.2018.12.128 44. Haley, P.J., Soloway, D.: Extrapolation limitations of multilayer feedforward neural networks. In: Proceedings of the 1992 IJCNN International Joint Conference on Neural Networks, Baltimore, MD, USA, vol. 4, pp. 25–30 (1992). https://doi.org/10.1109/IJCNN.1992.227294 45. Chikoto, G.L., Sadiq, A.-A., Fordyce, E.: Disaster mitigation and preparedness comparison of nonprofit, public, and private organizations. Nonprofit Volunt. Sect. Q. 42(2), 391–410 (2013) 46. Waugh Jr, W.L., Streib, G.: Collaboration and leadership for effective emergency management. Public Adm. Rev. 66(s1) (2006). https://doi.org/10.1111/j.1540-6210.2006.00673.x 47. Yates, D., Paquette, S.: Emergency knowledge management and social media technologies: a case study of the 2010 Haitian earthquake. Int. J. Inf. Manage. 31, 6–13 (2011) 48. Coletti, P.G.S., Mays, R.E., Widera, A.: Bringing technology and humanitarian values together: a framework to design and assess humanitarian information systems. In: International Conference on Information and Communication Technologies for Disaster Management, Munster, Germany, vol. 4 (2017). https://doi.org/10.1109/ICT-DM.2017.827 5687 49. Iguchi, K., Mitsuhara, H., Shishibori, M.: Evacuation instruction training system using augmented reality and a smartphone-based head-mounted display. In: 3rd International Conference on Information and Communication Technologies for Disaster Management (ICT-DM), Vienna, Austria (2016) 50. Mitsuhara, H., et al.: Penumbral tourism: place-based disaster education via real-world disaster simulation. In: 3rd International Conference on ‘Information and Communication Technologies for Disaster Management (ICT-DM)’, Vienna, Austria (2016) 51. Tobita, J., Fukuwa, H., Mori, M.: Integrated disaster simulator using WebGISand its application to community disaster mitigation activities. J. Nat. Disaster Sci. 30(2), 71–82 (2009) 52. Kobayashi, K., Narita, A., Hirano, M., Tanaka, K., Katada, T., Kuwasawa, K.: DIGTable: a tabletop simulation system for disaster education. In: Proceedongs of Sixth International Conference on Pervasive Computing (Pervasive 2008), pp. 57–60 (2008) 53. Tsai, M.-H., Chang, Y.-L., Kao, C., Kang, S.-C.: The effectiveness of a flood protection computer game for disaster education. Vis. Eng. 3(1), 1–13 (2015). https://doi.org/10.1186/ s40327-015-0021-7 54. Dunwell, I., Petridis, P., Arnab, S., Protopsaltis, A., Hendrix, M., Freitas, S.: Blended gamebased learning environments: extending a serious game into a learning content management system. In: Proceedings of Third International Conference on Intelligent Networking and Collaborative Systems (INCoS), pp. 830–835 (2011)

Deep Learning and Social Media for Managing Disaster

29

55. Smith, S., Ericson, E.: Using immersive game-based virtual reality to teach fire-safety skills to children. Virtual Reality 13(2), 87–99 (2009). https://doi.org/10.1007/s10055-009-0113-6 56. Wang, B., Li, H., Rezgui, Y., Bradley, A., Ong, H.N.: BIM based virtual environment for fire emergency evacuation. Sci. World J. 2014 (2014) 57. Fischer, J.E., Jiang, W., Moran, S.: AtomicOrchid: a mixed reality game to investigate coordination in disaster response. In: Herrlich, M., Malaka, R., Masuch, M. (eds.) ICEC 2012. LNCS, vol. 7522, pp. 572–577. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3642-33542-6_75 58. Zaini, N.A., Noor, S.F.M., Zailani, S.Z.M.: Design and development of flood disaster gamebased learning based on learning domain. Int. J. Eng. Adv. Technol. (IJEAT) 9(4), 679–685 (2020). https://doi.org/10.35940/ijeat.C6216.049420 59. Vivakaran, M.V., Neelamalar, M.: Utilization of social media platforms for educational purposes among the faculty of higher education with special reference to Tamil Nadu. High. Educ. Future 5(1), 4–19 (2018). https://doi.org/10.1177/2347631117738638 60. Qiu, L., Du, Z., Zhu, Q., Fan, Y.: An integrated flood management system based on linking environmental models and disaster-related data. Environ. Model. Softw. 91, 111–126 (2017). https://doi.org/10.1016/j.envsoft.2017.01.025 61. Androutsellis-Theotokis, S., Spinellis, D.: A survey of peer-to-peer content distribution technologies. ACM Comput. Surv. 36(4), 335–371 (2004) 62. Bortenschlager, M., Leitinger, S., Rieser, H., Steinmann, R.: Towards a P2P-based geocollaboration system for disaster management. In: Probst, F., Keler, C. (eds.) GI-Days (2007) 63. Sonawane, R., Doge, S., Vatti, R.: WiFi peer-to-peer communication in disaster management. Int. J. Electr. Electron. Comput. Sci. Eng. (IJEECSE) 4(6) (2017) 64. Geibig, J.: Peer-to-peer algorithms in wireless ad-hoc networks for Disaster Management. Fach Informatik eingereicht an der Mathematisch-Naturwissenschaftlichen Fakultat der Humboldt-Universitat zu Berlin, Berlin (2015) 65. Nojavan, M., Salehi, E., Omidvar, B.: Conceptual change of disaster management models: a thematic analysis. Jamba J. Disaster Risk Stud. 10 (2018). https://doi.org/10.4102/jamba.v10 i1.451 66. Rogstadius, J., Vukovic, M., Teixeira, C.A., Kostakos, V., Karapanos, E., Laredo, J.A.: CrisisTracker: crowdsourced social media curation for disaster awareness. IBM J. Res. Dev. 57(5) (2013). https://doi.org/10.1147/jrd.2013.2260692 67. Clerveaux, V., Spence, B., Katada, T.: Promoting disaster awareness in multicultural societies: the DAG approach. Disaster Prev. Manag. Int. J. 19(2), 199–218 (2010). https://doi.org/10. 1108/09653561011038002 68. He, R., Liu, Y., Yu, G., Tang, J., Hu, Q., Dang, J.: Twitter summarization with social-temporal context. World Wide Web 20(2), 267–290 (2016). https://doi.org/10.1007/s11280-016-0386-0 69. Dussart, A., Pinel-Sauvagnat, K., Hubert, G.: Capitalizing on a TREC track to build a tweet summarization dataset. In: Text Retrieval Conference (TREC 2020) (2020) 70. Lamsal, R., Kumar, T.V.V.: Classifying emergency tweets for disaster response. Int. J. Disaster Response Emerg. Manag. (IJDREM) 3(1), 14–29 (2020). https://doi.org/10.4018/IJDREM. 2020010102 71. Kakooei, M., Baleghi, Y.: Fusion of satellite, aircraft, and UAV data for automatic disaster damage assessment. Int. J. Remote Sens. 38(8–10) (2017). https://doi.org/10.1080/01431161. 2017.1294780 72. Rudra, K., Goyal, P., Ganguly, N., Imran, M., Mitra, P.: Summarizing situational tweets in crisis scenarios: an extractive-abstractive approach. IEEE Trans. Comput. Soc. Syst. 6(5), 981–993 (2019). https://doi.org/10.1109/tcss.2019.2937899 73. Curtin, R.R., et al.: MLPACK: a scalable C++ machine learning library. J. Mach. Learn. Res. 14, 801–805 (2013)

30

Z. Bouzidi et al.

74. Liaw, A., Wiener, M.: Classification and regression by randomForest. R News 2(3), 18–22 (2002) 75. Abiodun, O.I., Jantan, A., Omolara, A.E., Dada, K.V., Mohamed, N.A., Arshad, H.: State-ofthe-art in artificial neural network applications: a survey. Heliyon 4(11) (2018). https://doi. org/10.1016/j.heliyon.2018.e00938 76. Nguyen, N.K., Le, A.-C., Pham, H.T.: Deep bi-directional long short-term memory neural networks for sentiment analysis of social data. In: Huynh, V.-N., Inuiguchi, M., Le, B., Le, B.N., Denoeux, T. (eds.) IUKM 2016. LNCS (LNAI), vol. 9978, pp. 255–268. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49046-5_22 77. Sainath, T., Vinyals, O., Senior, A., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks, pp. 4580–4584 (2015). https://doi.org/10.1109/ICASSP. 2015.7178838 78. Roshan, S., Srivathsan, G., Deepak, K., Chandrakala, S.: Violence detection in automated video surveillance: recent trends and comparative studies, pp. 157–171 (2020). https://doi. org/10.1016/B978-0-12-816385-6.00011-8 79. Berglund, M., Raiko, T., Honkala, M., Karkkainen, L., Vetek, A., Karhunen, J.: Bidirectional Recurrent Neural Networks as Generative Models. MIT Press, Cambridge (2015) 80. Chen, W., Sidky, H., Ferguson, A.: Capabilities and limitations of time-lagged autoencoders for slow mode discovery in dynamical systems. J. Chem. Phys. 151(6) (2019). https://doi. org/10.1063/1.5112048 81. Pouyanfar, S., Tao, Y., Tian, H., Chen, S.-C., Shyu, M.-L.: Multimodal deep learning based on multiple correspondence analysis for disaster management. World Wide Web 22(5), 1893– 1911 (2018). https://doi.org/10.1007/s11280-018-0636-4

A Framework for Adaptive Mobile Ecological Momentary Assessments Using Reinforcement Learning Lihua Cai(B) , Laura E. Barnes, and Mehdi Boukhechba University of Virginia, Charlottesville, VA 22904, USA {lc3cp,lb3dp,mob3f}@virginia.edu

Abstract. Mobile ecological momentary assessments (mEMAs) require substantial user efforts to complete, resulting in low user compliance. One major source of incompliance is triggering mEMAs at inopportune moments. In this work, we propose a framework for implementing adaptive mEMAs using reinforcement learning (RL) to address the timing and context challenge, aiming to improve long term response compliance. To effectively model user state, we also propose a two-level user model with both momentary and routine state features. A novel k-routine mining algorithm is developed to extract routine state from passive sensing data. Using real mobile sensing data collected from 220 participants for over two weeks, we show that our proposed RL strategies consistently outperform the baseline methods including a random strategy and a supervised strategy in user compliance. Keywords: EMA · Ecological momentary assessment sensing · Reinforcement learning

1

· Mobile

Introduction

Mobile ecological momentary assessment (mEMA) is a digital surveying method that attempts to collect critical measurements of user behaviors and mental states in situ on mobile devices, most popularly on personal smartphones. Most recently, mEMA has also been implemented on wearable devices such as smartwatches [12]. Unlike traditional retrospective survey methods (e.g., telephone, paper, web surveys), mEMA frequently collects self-reports to capture the dynamics of human experiences, while reduces recall bias and enhances ecological validity [33]. mEMA has become the typical choice of data collection in areas such as clinical assessment [8], psychology/cognitive process and their mechanisms [32,37], and mobile health [14], owing to the increasing ownership of smartphones and accessibility of wireless network in the past decade [5]. Many EMA studies also captured passive sensing data while collecting EMAs, thereby enabling contextaware mEMA [3]. Although becoming more convenient, active participation in c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 31–50, 2022. https://doi.org/10.1007/978-3-030-82193-7_3

32

L. Cai et al.

mEMAs still demands substantial efforts from users, and poses significant compliance challenges over time [33]. Low response compliance in mEMAs can be attributed to declining user motivation over time. Existing research has applied human behavior theories to engage and motivate users in mobile sensing applications (e.g., substance use logging [28] and weight management [36]). While motivation has been an important challenge to address in mEMAs, low compliance can also result from another significant challenge, inopportune timings and contexts, which could be caused by 1) unavailability at the moment of sensing requests, and 2) interruptions that distract user’s attention from his/her current more prioritized task(s) [3]. Underlying these causes are the different user contexts and cognitive states (e.g., activity, location, time, and stress level). At each data collection moment, the user may not be available and interruptible in certain contexts, failing to attend and respond to the sensing request (e.g. student in a class). Our goal is to identify opportune moments to trigger EMAs to the users, while not interrupt them in unsuitable moments, thereby achieving higher compliance in the long term. Adaptive mEMAs leverage passive sensing to understand user’s context, and based on this understanding, adapt the trigger timings to those moments that are more likely free of interruption and convenient for the user to respond. In addition to being context-aware, adaptive mEMAs also need to avoid bias in the collected data that is coming from being selective in trigger timings [15]. In this work, we design adaptive mEMA strategies using the reinforcement learning (RL) framework under a formulation that reduces bias in the collected mEMAs. Our contributions are threefold: 1) We propose a generalizable framework for the implementation of adaptive mEMAs using RL. 2) We propose a two-level user state model to capture both momentary user state and higher level routine state of the user. A concept called k-routine and its associated learning algorithm are developed to represent user’s high level contextual state. And 3) we demonstrate the feasibility of our proposed approach in a set of RL algorithms using real world mEMA data, and compare their performance with two baseline strategies.

2

Related Work

Response compliance problem in mEMA has been studied by different groups within the human and computer interaction (HCI) community. Most of the existing works focus on understanding different sets of factors that may influence mEMA response compliance. Serre et al. [29], Sokolovsky et al. [30], and Broda et al. [6] studied impacts of demographic and self-reported contextual factors on mEMA response compliance. Vhaduri et al. systematically investigated the impacts of various design factors on response compliance and quality of the collected data in mEMAs [34]. Comparing to our current work, these studies did not leverage passive sensing capabilities to understand users’ contextual states but relied on self-reports and pre-specified triggering schedules from mEMAs. In addition, they also did not intervene with any strategies to improve user response compliance.

Adaptive Mobile EMAs

33

Vhaduri et al. [35], Markopoulos et al. [18], and Hofmann et al. [10] investigated the impacts of delivery timing and reminders on EMA response compliance. Their strategies using user chosen delivery times and regularly dispersed reminders are not adaptive to users’ changing contexts. Intille et al. proposed a microinteraction-based EMA method called μEMA using smartwatches. They leveraged the quick and convenient interface interaction in smartwatches, traded off higher intensity with more frequent interruptions in EMAs, and found significantly higher compliance rate in this new approach [12]. However, smartwatches are still far less pervasive than smartphones among electronic consumers, and this limits large scale EMA deployments for data collection and in various applications. Rabbi et al. proposed a context-assisted evening recall approach as an alternative to mEMA [27]. They showed a 5.6% increase in recall accuracy and 27.8% increase in overall recall completion rate. In this work, context is applied to provide hints to users to reduce recall bias, not as delivery conditions to improve response compliance. We consider this approach as complementary rather than a replacement to our proposed adaptive mEMAs approaches. Of relevance to response compliance in mEMAs is interruption management, which aims to identify opportune moments of users’ routine lives to avoid disrupting their ongoing tasks. A number of researches have been conducted on when to deliver emails [11], text messages [25], and phone calls [2]. For mobile notifications, researchers found that contents, social relationship, and physical activity level [19], location and time [23], current task [24], current activity [7,9,22], psychological traits [20] obtained from both passive sensing and self-reports can be leveraged to predict opportune moments for interruptions. Similar to our current work, these works leveraged both passive sensing data and self-reports to learn users’ contextual states and predict whether a moment is interruptible. Though in contrast to their supervised and rule-based methods, we propose to leverage the reinforcement learning framework to implement adaptive mEMAs.

3 3.1

Adaptive Mobile EMA An Unbiased Formulation for Mobile EMA

Adaptive mEMA leverages passive sensing data to understand users’ context, and interacts with them to collect subjective data. The main goal of adaptive mEMA is to improve user’s response compliance in the long term. To achieve this goal, adaptive mEMA can be formulated as selection of timings for EMA within given trigger budget to obtain maximum user compliance. Trigger budget refers to the allowable number of mEMAs that we can trigger on a given time frame. For example, the trigger budget is three when three mEMAs are delivered daily. Imposing trigger budget is important to avoid over burdening users and maintain user compliance [16,17]. We will also need to spread the mEMAs as evenly as possible across a given time window to avoid ‘contextual dissonance’, which biases the collected data due to context selection [15]. We follow a classical approach to split each day into some number of blocks as shown in Fig. 1, and within

34

L. Cai et al.

Fig. 1. An unbiased formulation of random time mobile EMAs with fixed trigger budget. Scenario 1: when the remaining mEMA windows is equal to the remaining trigger budget, EMA must be delivered regardless of user’s context. Scenario 2: early termination when the trigger budget is met before the end of the episode. Scenario 3: if it gets to the last mEMA window, the action decision is always ‘Trigger’.

each block randomly select a time for mEMA delivery decision [17]. Figure 1 illustrates an example with a daily budget of 3 mEMAs, and 6 2-h windows from 9am to 9pm. In order to guarantee triggering exactly 3 mEMAs daily, we take into consideration the opportunity costs and incorporate it into the decision process. For example, if we decide not to trigger mEMAs in the first three opportunities, we have no choice but to trigger them in the remaining three opportunities in order to meet the budget (scenario 1 in Fig. 1). When three mEMAs have been triggered before the end of the daily cycle, later assessment moments will not be considered any more (scenario 2 in Fig. 1). When the assessment time in the last window is considered, it will always be choosing the trigger decision (scenario 3 in Fig. 1). 3.2

Using Reinforcement Learning Framework for Adaptive Mobile EMA

We propose to use reinforcement learning to implement adaptation at each randomly selected time as shown in Fig. 2. RL is a natural fit for implementing adaptive mEMAs owing to its learning through interactions with the application environments for making optimal action decisions. It has been proposed as a framework for a special type of digital behavior change interventions namely Just-in-time Adaptive Intervention (JITAI), which adapts timings for intervention delivery, and contents in intervention [21]. Adaptive mEMAs can be formulated as a discrete time episodic sequential controlling problem. An episode is often chosen to be a targeted time frame (e.g., from 9am to 9pm) within a day. In each episode, we follow the above formulation, and apply the RL framework to develop sensing policies that assess value of each randomly selected moment for mEMA trigger decision.

Adaptive Mobile EMAs

35

Fig. 2. A framework for adaptive mobile EMA using reinforcement learning. Designing RL strategies for adaptive mEMA follows these steps: 1) Design the RL algorithm; 2) Design the state space; 3) Design the action space; and 4) Design the reward signal.

The RL framework for adaptive mEMAs requires design of a state space that captures critical user states in mEMA response compliance, an action space that controls how mEMAs are delivered, and a reward signal that provides feedback to learn EMA delivery policy. All the sensing data can be uploaded to the cloud for storage and post-processing, followed by policy updates using the chosen RL algorithm. On a daily cycle, the updated policy will be shared with each participant’s sensing app using push notification service.

4

A Two-Level User State Model

The main challenge in applying the RL framework for adaptive mEMAs is to design a state space that captures key contextual determinants that affect users’ response compliance to mEMAs. Most existing approaches look at momentary features such as the current time, location, and activity of a user. Features regarding the status of the user’s mobile device (e.g., a phone call just being ended) are also applied to make trigger decisions on mEMAs. We hypothesize that adding higher level routine contexts in trigger decisions on mEMAs can further enhance response compliance. We define higher level routine contexts to be frequent patterns or arrangements that people live by each day in this work. For example, a person goes to gym for workout everyday at 5pm and spends roughly 2 h there. Using this example, the momentary features at 5:26pm as the person is exercising in the gym could be “late afternoon, gym, highly active”, and the routine feature is “daily exercise in the gym”. There are certainly more complex routines that we have become so accustomed to so that we do not even realize them ourselves. And both our physical and mental states are highly affected by living through them.

36

L. Cai et al. Table 1. Momentary state features at each mEMA trigger decision time. Features

Description

Time

Early morning (8–10am), morning (10am–12pm), noon (12pm–2pm), early afternoon (2–4pm), late afternoon (4–6pm), early evening (6–8pm)

Location

Unique place labels that are learnt by a tempo-spatial clustering algorithm [13]

Speed

Being still, walking, running, being in vehicle using average speed cutoffs (0.1, 1, 5) m/s. Speed is calculated based on average distance between consecutive GPS coordinates within the 10 min time window divided by their corresponding time spans

Hourly activeness

Proportion of time average acceleration in 5 min windows within the past hour is beyond 0.2

Momentary activeness

Proportion of time average acceleration in 1 min windows within the past 10 min is beyond 0.2

Based on this underpinning rationale, we propose a two-level user model to characterize users’ states with the low-level being momentary features, and the high-level being the routine state. Based on the data that are available to us for the current study, we use time, location, speed, hourly activeness level, and momentary activeness level as the momentary features. Table 1 defines these momentary state features. To obtain the high-level routine state, we propose an algorithm called k-routine mining, which is intrigued by the association rule mining algorithm. Before we introduce the k-routine mining algorithm, we first introduce two basic concepts life-block and k-routine in it. Life-block is the basic information unit that describes the whereabouts and activities of a user at a given time. A life-block generally consists of time, location, physical activity based on speed, duration, and any other available contexts that can be extracted through passive sensing data and other mobile phone usage logs. K life-blocks form a k-routine, which is analogous to k-itemset in classic association rule mining algorithm [1]. Without loss of generality, we denote a life-block as (t, loc, act, d) using time (t), location (loc), physical activity based on speed (act), and duration (d) in our examples below. Figure 3 shows an example of a 3-routine with three life-blocks. Note that different life-blocks that form a k-routine need not be adjacent in time as long as they have no overlap and are ordered in time. For each learned unique routine, we assign it a unique code for reference. After being mapped with the momentary state, the routine state will be represented using this assigned code, making this routine state feature categorical.

Adaptive Mobile EMAs

37

Fig. 3. Illustration of an example 3-routine.

5

K-Routine Mining Algorithm

In this section, we provide the details on how we generate these daily k-routines, and map them to the momentary state to provide the high-level state for our two-level user model. 5.1

Mining K-Routines

The process of constructing life-blocks is similar to that of extracting the momentary state features. First, we process the incoming data in ten minute segments, and extract the location, speed from each segment. We choose to process data in ten minute segments because it is reasonable long to provide sufficient data for feature extraction, while relatively short enough to capture people’s fine-gained status. In cases where the user has been in more than one location or one speed category within one segment, we adopt the place or speed category with most data points. If the user is in transition from one place to another, the place label would be denoted as ‘in-transition’. For consecutive segments that the users have same place and speed values, they will be concatenated into a life-block with the time being the arrival time at the place, and duration being the number of segments multiply by 10 min. From this procedure, an entire day of mobile sensing data will be converted into a trajectory of life-blocks. Focusing on daily level, we treat life-blocks as items, and all life-blocks within a day as an ordered transaction, in analogy to items and transaction in classic association rule mining algorithm. However, we can not directly apply frequent itemset generation algorithm in existing association rule mining methods to obtain k-routines due to two key differences. The first difference lies in the temporal order of life-blocks within a day. Life-blocks are sorted by time to form k-routines. The second difference lies in the availability of data being an incremental online process. Data are made available throughout each day, and the algorithm will process the data at 10-min increments to generate daily lifeblock sequences. At the same time, whenever a new life-block is constructed, the k-routine database will be updated to reflect the changes. The k-routine mining algorithm is given in Algorithm 1. K-routines and P laces are the accumulated learned k-routines and visited unique places up to time t, respectively. LBs are the life-blocks of the same day up to time t, and plb is the pending life-block that is being generated and maintained at time t.

38

L. Cai et al.

Algorithm 1. K-Routine Mining Algorithm. Input: K-routines, P laces, LBs, plb, GP Ss, t. Output: K-routines, P laces, LBs, plb, t. 1: act = extractAct(GP Ss) 2: P laces.update(GP Ss) 3: loc = extractLoc(GP Ss) 4: if plb.loc == loc and plb.act == act then 5: plb.update() 6: else 7: LBs.append(plb) 8: K-routines.update(LBs) 9: Clear plb. 10: plb = (t, loc, act, 10mins) 11: end if 12: if t + 10mins remains in the same day then 13: t = t + 10mins 14: else 15: t = t + 10mins 16: Clear LBs and plb. 17: end if 18: return K-routines, P laces, LBs, plb, t.

GP Ss are newly available GPS points in a ten minute segment starting at time t. Algorithm 1 is an online algorithm that will be repeatedly called every 10 min. The number of life-blocks on each day is dependent upon the number of context features that are used to define them, and the number of unique values in each context feature. However, due to the variation in arrival times, uncertainty in visited places (i.e., new places are being visited over time), and duration staying at each place, we cannot reliably estimate its per day computation complexity. Assuming a day has K life-blocks, without limiting the order of k-routines, ˆ this will result in 2K k-routines with k = 1, 2, . . . , K. If we limit k to be k, kˆ i i , where CK is the then the total unique k-routines on the day will be i=1 CK combination of choosing i life-blocks from K life-blocks. For example, if we limit 1 2 3 + CK + CK unique k-routines. k to be 3, then we will have CK 5.2

Merging K-Routines

After obtaining these unique k-routines on a new day, we need to merge them with those learned in the past days if they are similar to each other. We define similarity using the following rules: 1. k1 -routine and k2 -routine are similar only if k1 = k2 . 2. If condition 1) is met, k1 -routine and k2 -routine are similar only if each pair of life-blocks with the same order is similar. 3. Two life-blocks are similar if their place and speed (or activity) are the same, and their arrival time and visiting duration are similar.

Adaptive Mobile EMAs

39

4. Let (t, d) denotes the values of arrival time and visiting duration. (t1 , d1 ) and (t2 , d2 ) are similar if the Euclidean distance between them is smaller than a chosen threshold. 5.3

Mapping K-Routines

At each mEMA trigger decision moment t, we map the learned k-routines to the momentary state to obtain the high-level routine state. We take the following steps: 1. Let t.arrival and d denote the arrival time and duration of a life-block. Existing k-routines will be filtered out if t does not fall in [t.arrival, t.arrival + d] with t.arrival and d referring to the arrival time and stay duration in the last life-block in a k-routine. 2. The remaining k-routines satisfying the above condition will be filtered out if the momentary location and activity are not the same with those associated with the last life-block in each of them. 3. For k-routines with k > 1, we apply the same procedure as in merging newly mined k-routines with existing ones, on all life-blocks other than the last life-block against the life-blocks on the day prior to t. We choose the longest k-routine that survives the above filtering conditions as the routine state associated with moment t. 4. When no k-routines survive the above tests, we assign ‘new routine’ as the routine state.

6

Designing Adaptive mEMA Method Using RL

In this section, we provide a concrete design of the various components required in an RL framework to implement adaptive mEMAs. 6.1

RL Algorithm

We propose an RL algorithm (see Algorithm 2) based on the Q-learning algorithm, which is an off-policy temporal difference (TD) RL algorithm. It can be combined with eligibility trace, most exploration strategies (e.g., -greedy), and planning (e.g., Dyna-Q) to enhance learning speed and sample efficiency. It can also be easily generalized to continuous state space using functional approximation. Specifically, we denote the state space with S, the action space with A, the initial exploration rate with 0 , and the step-size (learning rate) parameter, the discount rate, and the eligibility trace decay parameter with α, γ, λ, respectively. We face three major challenges in designing our proposed algorithm. First, when we decide not to trigger a mEMA, we do not have any feedback on response compliance. We handle this by designing a reward signal that discounts the associated coefficients or adds to them a proportion of their magnitude (see Eq. 1). Second, while no feedback is available when the action is ‘not trigger’, immediate update of the policy is also not possible. Line 32 to 36 in Algorithm 2 are

40

L. Cai et al.

designed to tackle this challenge at the end of each episode. Lastly, real mEMA data are scarce and expensive to collect over the long term, which leads to limited samples for learning optimized policy. We therefore need to improve sample and learning efficiency within limited data. In the current work, we experiment with a simple strategy called Dyna-Q [31] to improve sample efficiency. 6.2

State Space for Adaptive mEMA

We propose several different state feature sets including momentary state features as described in Table 1, first order routine feature, second order routine feature in two different encoding schemes, a motivation feature using rolling compliance based on response data from the past three days. To compare the marginal effectiveness of each feature set, we combine them incrementally to create five different RL strategies with different state spaces including: 1) RL with Momentary state features (RL-M); 2) RL with Momentary and First order routine state features (RL-MF); 3) RL with Momentary and Second order routine state features (RL-MS); 4) RL with Momentary and a more Compactly encoded Second order routine state features (RL-MCS); 5) RL with Momentary and a more Compactly encoded Second order routine state features with Motivation (RL-MCS-M). The difference between the compact and non-compact second order routine representation lies in how k-routines are encoded. In the non-compact encoding, a k-routine is represented by its routine ID; while in the compact representation, a k-routine is represented by all the routine IDs of the 1-routines that form the k-routine. The more compact encoding is potentially more efficient in learning due to partial overlaps among different k-routines. 6.3

Action Space for Adaptive mEMA

The action space in adaptive mEMA can include only two actions, ‘trigger’ and ‘not trigger’ the EMAs; or more than two actions that expand the ‘trigger’ action into ‘trigger’ with different modalities such as sound, vibration, and flash lights. In this study, we consider only two actions – ‘trigger’ and ‘not trigger’. 6.4

Reward Signal for Adaptive mEMA

We design the reward signal in the following way: it takes a binary value when we trigger a mEMA with the following conditions: 1) if the EMA is responded, it receives a positive value 1; if the EMA is not responded, it receives a negative value −1. When we do not trigger a mEMA, the reward signal is more involved because we will not directly receive any feedback as if we would have triggered it. To address this challenge, we need to estimate whether the ‘not trigger’ decision is beneficial at the end of each episode based on how many responded mEMAs we have received for the day. If all triggered mEMAs are responded, we want to reinforce these decisions in their associated states. In contrary, if we end up

Adaptive Mobile EMAs

41

Algorithm 2. Adaptive mEMA using Q-Learning with Linear Approximation and Decaying Exploration. Input: S, A, γ, λ, α, 0 , d. Output: w a , a ∈ A. 1: Initialize w a and ea for each a ∈ A 2: Set S nt , E nt , W nt , W i+1 , Si +1 , Ai +1 to be Φ. 3: for all t = 1, 2, . . . until termination within an episode do 4: Observe st . 5: if st is not terminal state then 6: Take at ∼ -greedy with arg max Φ(st , a)T w a . a∈A

7:

Transition to st+1 , and take at+1 ∼ -greedy with arg max Φ(st+1 , a)T w a .

8: 9: 10: 11: 12: 13: 14: 15: 16:

eat = eat + Φ(st , at ) if at = Not Trigger then δt = rt + γΦ(st+1 , at+1 )T w at+1 − Φ(st , at )T w at for all a ∈ A do w a ←− w a + αδt ea ea ←− γλea end for else Append st to S i , ent to E i , w nt to W i , w at+1 to W i+1 , st+1 to S i+1 , and at+1 to Ai+1 . for all a ∈ A do ea ←− γλea end for end if else Take at ∼ -greedy with arg max Φ(st , a)T w a .

17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42: 43:

a∈A

a∈A \nt

Observe rt , transition to st+1 . Take at+1 ∼ -greedy with arg max Φ(st+1 , a)T w a . a∈A

eat = eat + Φ(st , at ) δt = rt + γΦ(st+1 , at+1 )T w at+1 − Φ(st , at )T w at for all a ∈ A do w a ←− w a + αδt ea ea ←− γλea end for end if for j ∈ range(|Si |) do Set sj = Si [j], aj = nt, sj+1 = Si+1 [j], aj+1 = Ai+1 [j], w aj = W i [j], w aj+1 = W i+1 [j], eaj = E i [j]. δj = rj + γΦ(sj+1 , aj+1 )T w aj+1 − Φ(sj , aj )T w aj w aj ←− w aj + αδj eaj end for if d < 0.1 then  ←− 0.1 else  ←− d end if end for return w a , for each a ∈ A

42

L. Cai et al.

having fewer responded mEMAs than the number of triggered ones, we want to weaken these decisions in their associated states. Let snt i , i = 1, . . . , m denote the states associated with the ‘not trigger’ actions on a given day, and wint , i = 1, . . . , m denote the associated coefficients. We simply reinforce or weaken the t coefficients associated with each state feature in snt i by β|wi |, a proportion of the weight coefficients corresponding to the ‘trigger’ action. The overall reward function is given below: ⎧ 1 at = trigger & task = completed ⎪ ⎪ ⎪ ⎨−1 at = trigger & task = not completed rt = (1) t ⎪ at = not trigger & all tasks are completed β|wi | ⎪ ⎪ ⎩ at = not trigger & not all tasks are completed −β|wit | In Algorithm 2, Line 15 to 20 keep track of all required components for updating the ‘not trigger’ action value function at the end of the episode, and Line 32 to 36 update the ‘not trigger’ action value function after the episode ends. 6.5

Experience Replay for Sample Efficiency Using Dyna-Q

Due to the common nature of EMA studies that last usually few weeks only, we may not have sufficient data to train RL policies that can effectively guide mEMA delivery. To address this challenge, we apply a RL framework called Dyna-Q [31], which integrates planning with learning. In Dyna-Q, an environmental model is created and applied to generate simulated samples for policy updates. This environmental model does not need to be a full model of the environment, but requires only a sample model [31]. We simply adopt a bootstrapping sampler, in which all past episodes including the current one are randomly drawn and replayed to update the policy. In our implementation, we replay ten randomly drawn episodes including the current day at the end of each day to improve sample efficiency. And we combine all available state features with Dyna-Q to be a sixth RL strategy (RL-MCS-M-D). 6.6

Performance Evaluation

We measure mEMA compliance using the following metrics: – Daily Compliance (DC). DC is calculated based on the number of all the responded triggered mEMAs divided by the number of all triggered mEMAs on each day. – Time Constrained Daily Compliance (TCDC). TCDC is calculated based on number of all the mEMAs that are responded within a 10 min window divided by the number of all triggered mEMAs on each day. – Overall Compliance (OC). OC is the final compliance calculated based on the number of all responded triggered mEMAs divided by the number of all triggered mEMAs.

Adaptive Mobile EMAs

43

These metrics are not mutually exclusive. Specifically, the overall compliance reflects the ultimate compliance rate, while ignoring the daily differences. However, it is also important to maintain acceptable daily compliance level as the data can be more representative across time during the data collection. In some application scenarios, when the mEMAs are time sensitive, the time-constrained daily compliance is more critical.

Fig. 4. Study data and EMA statistics: (a) Distribution of number of days in the study for each participant. (b) Distribution of number of EMAs being delivered to each participant. (c) Distribution of actual overall compliance in the triggered EMAs of each participant.

7 7.1

Experiments Data

To test the feasibility of our proposed adaptive mEMA framework using RL, we use real mEMA data from a mobile sensing study that aimed to understand students’ emotions and social anxiety over a two-week window [4]. Data in this study include accelerometer, GPS, communication (e.g., text messages and phone calls), and mobile EMA data from 220 college students using the Sensus mobile application [38]. In particular, accelerometer data were collected 1 Hz for up to two weeks and GPS coordinates every two and a half minutes. Six random time mobile EMAs were delivered in six two hour windows from 9 am to 9 pm everyday. Figure 4 summarizes this mEMA data. 7.2

Baseline Methods

We use two baseline strategies as comparisons to measure the performances of our proposed adaptive mEMA approaches. The first baseline method is a random strategy that randomly selects 3 out of 6 2-h windows each day for EMA delivery. The second baseline method creates a supervised model with all cumulative data available up to the prior day, and applies this model for mEMA trigger decision

44

L. Cai et al.

Fig. 5. Performance by strategies. (a) Average overall compliance; (b) Average daily compliance; (c) Average time-constraint daily compliance across 6 RL strategies and 2 baseline strategies.

at each decision moment. At the end of each day, this model will be retrained with all available data, and deployed for the next day. We apply XGBoost, which is a boosting algorithm that can gracefully handle missing data. The setting for the second baseline method will be the same as the proposed RL methods. We use the same context features learned from our two-level user model approach, including both the momentary and routine features. 7.3

Experimental Settings and Research Questions

Four parameters in the proposed RL algorithm are the initial exploration rate 0 , the step-size (or learning rate) α, the discount rate γ, and the eligibility trace-decay parameter λ. All four parameters fall within a range of 0 and 1. In addition, the reward signal has a discounting parameter β associated with the ‘not trigger’ action. Instead of tuning these parameters, for every participant in each method, we randomly choose values for them from the following options: 1) α = {0.01, 0.05, 0.1}; 2) γ = {0.05, 0.1, 0.2}; 3) λ = {0.05, 0.1, 0.2, 0.5, 0.8}; 4) 0 = {0.1, 0.2, 0.5}; and 5) β = {0.05, 0.1, 0.15, 0.2}. The exploration decaying rate is fixed at d = 0.8. We try to encompass a reasonable range of values for each parameter to be more conservative in our final performance comparisons against the baseline methods. With the above settings, we want to find answers to the following research questions: 1) How does the design of the state features impact the performance on various compliance metrics? 2) How is the performance of the proposed RL methods compared to the baseline methods?

Adaptive Mobile EMAs

45

Table 2. Momentary state features on the selected active sensing times. The cutoffs for the low, median, and high levels in 1) Number of days in study are 7 and 14 days; 2) Total number of triggered EMAs are 30 and 60; 3) Average daily EMAs are 2 and 3. Segment Strategy

Number of days OC

Low

8 8.1

Number of EMAs

TCDC OC

DC

Average daily EMAs

TCDC OC

DC

TCDC

RL-M

0.796 0.793 0.686

0.802 0.803 0.712

0.803 0.803 0.723

RL-MF

0.795 0.798 0.692

0.805 0.806 0.714

0.816 0.813 0.728

RL-MS

0.807 0.800 0.701

0.817 0.813 0.725 0.833 0.826 0.742

RL-MCS

0.803 0.801 0.703 0.809 0.808 0.720

0.824 0.819 0.736

RL-MCS-M

0.766 0.769 0.676

0.791 0.794 0.710

0.792 0.792 0.719

RL-MCS-M-D 0.800 0.795 0.687

0.802 0.798 0.708

0.801 0.792 0.711

Random

0.642 0.651 0.584

0.640 0.649 0.585

0.636 0.642 0.595

Supervised

0.728 0.730 0.647

0.770 0.773 0.694

0.766 0.765 0.704

0.827 0.831 0.737

0.843 0.839 0.760

0.825 0.822 0.763

RL-MF

0.827 0.826 0.728

0.839 0.841 0.760

0.816 0.820 0.761

RL-MS

0.837 0.837 0.739

0.841 0.842 0.760

0.824 0.825 0.767

RL-MCS

0.828 0.828 0.731

0.831 0.830 0.750

0.810 0.812 0.753

RL-MCS-M

0.831 0.833 0.734

0.837 0.838 0.754

0.816 0.819 0.760

RL-MCS-M-D 0.823 0.822 0.727

0.844 0.845 0.763

0.822 0.825 0.764

Random

0.655 0.660 0.577

0.646 0.640 0.573

0.637 0.640 0.588

Supervised

0.836 0.839 0.743 0.846 0.849 0.767 0.818 0.825 0.762

Median RL-M

High

DC

RL-M

0.746 0.739 0.665

0.657 0.652 0.540

0.744 0.742 0.614

RL-MF

0.750 0.749 0.669

0.662 0.658 0.537

0.746 0.745 0.612

RL-MS

0.752 0.752 0.671 0.670 0.667 0.545 0.746 0.744 0.616

RL-MCS

0.740 0.737 0.657

0.660 0.657 0.538

0.741 0.740 0.615

RL-MCS-M

0.745 0.743 0.661

0.654 0.651 0.531

0.739 0.738 0.606

RL-MCS-M-D 0.746 0.744 0.667

0.655 0.651 0.537

0.747 0.745 0.617

Random

0.594 0.587 0.515

0.567 0.562 0.449

0.616 0.613 0.498

Supervised

0.750 0.751 0.666

0.658 0.656 0.531

0.737 0.736 0.605

Results Comparisons Within RL Strategies

Figure 5 shows the average overall compliance, daily compliance, and timeconstraint daily compliance across the 6 RL strategies with different state features and the Dyna-Q method. Since all RL strategies use momentary state features, we will not mention it unless necessary. The RL strategy without any k-routine state feature has the same performance as the one with 1-routine state feature. But the strategy with 2-routines outperforms both of them. When using the compact representation, the performance has no improvements. This is likely due to using only 2-routines in our experiments. Adding the motivation feature also does not lead to performance enhancements. Lastly, the Dyna-Q framework does not improve the overall performance either. Note that the order of performances among the different RL strategies in all three metrics are almost the

46

L. Cai et al.

same. The RL strategy with momentary and 2-routine state features slightly outperform all other strategies by a small margin. 8.2

Comparisons Between RL Strategies and Baseline Methods

From Fig. 5, we can see that the performance of all RL strategies are equal or better than the two baseline methods in all three performance metrics. In particular, the best RL strategy attains an average overall compliance 0.80, an average daily compliance 0.80, and an average time-constraint daily compliance 0.70, comparing to 0.77, 0.77, and 0.69 in the corresponding metrics in the supervised method. To better understand how much improvements have been achieved with the proposed RL methods when compared with the supervised approach, considering a 2 week study with 3 mEMAs daily for 220 participants, a 3% improvement in overall compliance translates into 277 additional surveys that would have been responded by all the participants during the study window. Given that our current dataset had many missing triggers due to technical issues, we expect to see much higher compliance improvements in future real world deployments. 8.3

Performance by Data Segments

The performance of the proposed RL strategies in adaptive mEMAs is greatly dependent on constraints in the real data we used. To better understand these factors, we examine the performances of all strategies in different segments of the data based on number of days in study, total number of triggered EMAs, and average number of daily triggered EMAs (see Table 2). Segments by Number of Days in Study. In the low ( 0: element = x break

sum

sum = 0 f o r l o o p i n d in range ( len ( arr ) ) : i f arr [ loop ind ] > 0: sum = sum + a r r [ l o o p i n d ]

sum = sum ( a f o r a i n a r r i f a > 0)

all

a l l = arr [ 0 ] > 0 loop ind = 0 w h i l e l o o p i n d < l e n ( a r r )−1 and a l l : i f not a r r [ l o o p i n d +1] > 0 : a l l = False loop ind = loop ind + 1

all

= a l l (a > 0 for a in arr )

any

any = a r r [ 0 ] > 0 loop ind = 0 w h i l e l o o p i n d < l e n ( a r r [ 1 : ] ) and not any : l o o p i n d += 1 any = a r r [ l o o p i n d ] > 0

any

= any ( a > 0 f o r a i n a r r )

3.1

= sum ( 1 f o r a i n a r r i f a > 0 )

count

Formal Approach

In this section, we formalize the process of finding and correcting nonidiomatic code snippets. The procedure of refactoring idioms is represented by function

Detecting and Fixing Nonidiomatic Snippets in Python

133

REF ACT OR : Ch∗ → Ch∗ . This function expects a source code to be analyzed and fixed, and outputs the corrected version of the original fragment. Figure 1 shows a visual overview of this function and its parts. For easier understanding, we consider exactly one snippet to be substituted per source code at first. The definition of REF ACT OR is supported by the SU BST IT U T E : Ch∗ × (N × N) × Ch∗ → Ch∗ function. As its arguments it expects the source code to be analyzed, an index pair that represents the location of the nonidiomatic snippet, and a generated alternative of the snippet. Using these arguments the snippet can be replaced in the original code at the given location with its given alternative. The fixed source code is the return value of SUBSTITUTE. Using this function, REF ACT OR is defined as: REFACTOR(SC) := SUBSTITUTE(SC, LOCATE(SC), GEN IDIOM(SNIPPET TYPE(SNIPPET(SC)), VARIABLES(SNIPPET(SC)), FEATURES(SNIPPET(SC), SNIPPET TYPE(SNIPPET(SC)) VARIABLES(SNIPPET(SC))))),

where LOCAT E : Ch∗ → N × N is the function that returns the location of the snippet represented by an index pair in a full-length source code, while SN IP P ET : Ch∗ → Ch∗ returns the snippet itself. LOCAT E is used in the substitution process, whereas SN IP P ET provides the input for the procedure of generating the improved code. These functions are implemented as recurrent neural networks, detailed in Subsect. 3.2.1. A substitute for a nonidiomatic snippet is composed of the following components: a frame – such as “target=sum(1 for elem in list if condition)” –, a dictionary which maps kinds of variables to object names – such as {target var → “cnt”, list var → “arr”, loop index → “i”} – and some additional features, for example the condition arr[i] > 0. Given these examples, the following idiomatic pattern would be generated: cnt = sum(1 f or elem in arr if elem > 0). Consequently, the function GEN IDIOM expects three parameters: the first is the type of frame, the second is the dictionary of variables, and the third contains the additional features (such as the condition). These three parameters are constructed by the following functions, respectively: SN IP P ET T Y P E : Ch∗ → [0..5] returns the index of the frame (0: count, 1: max, 2: search, 3: sum, 4: all, 5: any), V ARIABLES : Ch∗ → (T ype, Ch∗ )∗ returns the dictionary, and F EAT U RES : Ch∗ × [0..5] × (T ype, Ch∗ )∗ → Ch∗ × Ch∗ × {0, 1} × {0, 1} gives the extra features. SN IP P ET T Y P E and V ARIABLES are implemented by different neural networks. Their definition can be found in Subsects. 3.2.2 and 3.2.3. Given the snippet, the type, and the identifier names, a set of four features can be determined: – the condition (if the snippet contains any) – if the snippet iterates over one row/column in a matrix/tuple, we need to know the indices being used

B. Szalontai et al. Input

fnd = False arr = list(range(0,10)) cnt = 0 for i in range(0,len(arr[j])): if arr[j][i] > 0: cnt += 1 print(arr[j]) while fnd: for elem in arr[0]:

SNIPPET

LOCATE

Found snippet Location of snippet

cnt = 0 for i in range(0,len(arr[j])): if arr[j][i] > 0: cnt += 1

SNIPPET_TYPE

(2, 5)

VARIABLES FEATURES

REFACTOR

134

Variables

Type of pattern

count (0)

Extra information

condition arr[i]>0 matrix second idx is main max or min no maxindex shift no

target list index

cnt arr i

GEN_IDIOM

Generated alternative

cnt = sum(1 for elem in arr[j] if elem > 0)

SUBSTITUTE Output

fnd = False arr = list(range(0,10)) cnt = sum(1 for elem in arr[j] if elem > 0) print(arr[j]) while fnd: for elem in arr[0]:

Fig. 1. Visual overview of the refactoring process.

Detecting and Fixing Nonidiomatic Snippets in Python

135

– if the type of snippet is max, we need to know whether the minimum or maximum value is being calculated – if the type of snippet is max, we need to know whether the theoretical indexing starts from one instead of zero. These features are crucial for generation. Determining each of these features separately is fairly trivial given the snippet with its type and the variables. They can be implemented easily using explicit programming. Utilizing these functions, we define the function which returns the generated alternative: GEN IDIOM : [0..5] × (T ype, Ch∗ )∗ × (Ch∗ × Ch∗ × {0, 1} × {0, 1}) → Ch∗ . GEN IDIOM is used in the following way in REF ACT OR: GEN IDIOM(SNIPPET TYPE(SNIPPET(SC)), VARIABLES(SNIPPET(SC)), FEATURES(SNIPPET(SC), SNIPPET TYPE(SNIPPET(SC)), VARIABLES(SNIPPET(SC))))

As mentioned before, functions LOCAT E, SN IP P ET , SN IP P ET T Y P E, and V ARIABLES are implemented by neural networks, denoted by M1 , M2 , and M3 . Based on the model predictions (return values of these functions), we get the desired output using the other functions in the definition of REF ACT OR. A summary table of these functions can be seen in Table 2 and a summary figure in Fig. 1. The above method was explained with one snippet per code, but in fact, it is expanded to more snippets by having LOCAT E return a sequence of index pairs and SN IP P ET a sequence of snippets instead of exactly one index pair and snippet. Hence, if LOCAT E and SN IP P ET return a sequence with n elements, GEN IDIOM gets evaluated n times for each snippet returned by SN IP P ET . Even though the method works for multiple snippets in one program, the training data is built up in a simplified way where each fragment contains exactly one nonidiomatic snippet. In spite of this simplification of the dataset, the model learns to generalize well, and is able to handle fragments without any or with multiple snippets to be substituted. 3.2

Neural Architectures

As mentioned in the previous section, the three key functions of the method are implemented as neural networks. In the implementation, two different architectures are used: one solves a classification problem (M2 ), while the other is generally used for Sequence Tagging tasks (M1 , M3 ). These three neural networks are created to solve the following subtasks: 1. Given the full source code, the nonidiomatic snippets need to be located. 2. Given a nonidiomatic snippet, the type of refactoring is to be determined. 3. The variables that the coder used in the algorithm need to be extracted. For each model, the input needs to be an index sequence that represents tokenized source code. We incorporate two approaches to tokenizing. The function

136

B. Szalontai et al.

Table 2. A summary table about all the above defined functions including their types, the expected input value and their output. Name

Type

Expects

Returns

REF ACT OR

Ch∗ → Ch∗

Source code to be analyzed

Analyzed and fixed source code

SU BST IT U T E

Ch∗ ×(N×N)×Ch∗ → Source code, location Ch∗ of snippet, alternative for it

Source code with a replacement at the given location

LOCAT E

Ch∗ → N × N

Source code to be analyzed

Location of the snippet to be replaced represented by an index pair

SN IP P ET

Ch∗ → Ch∗

Source code to be analyzed

The found snippet

GEN IDIOM

[0..5]×(T ype, Ch∗ )∗ × Type of frame, (Ch∗ × Ch∗ × {0, 1} × variables (along with {0, 1}) → Ch∗ their types) to be utilized and some extracted features

SN IP P ET T Y P E Ch∗ → [0..5]

Variables inserted into the frame to their correct position and incorporating the features resulting in the replacement snippet

Found snippet

Type of frame represented by an index Variables along with their types

V ARIABLES

Ch∗ → (T ype, Ch∗ )∗

Found snippet

F EAT U RES

Ch∗ × [0..5] × (T ype, Ch∗ )∗ → Ch∗ × Ch∗ × {0, 1} × {0, 1}

Found snippet, type of Used condition, matrix snippet, variables indexing info, max or min, (along with their maxindex shift types)

used for tokenizing and converting an entire source code to an index sequence will be further abbreviated as T OKEN IZE CODE, and the one used for tokenizing the found snippets will be abbreviated as T OKEN IZE SN IP P ET . Both take a Ch∗ as the argument, and return the tokenized code represented by an index sequence. There are special tokens (e.g. names of variables, objects, classes, functions, etc.) that occur in one code, but not in any other. Including only some of these tokens in the tokenization process can lead to problems due to bad representations, as Karampatsis et al. [10] suggest. Since the ability to differentiate tokens is crucial in their work, they use an open-vocabulary neural language model to overcome this problem. We found that it is sufficient to simply unify all of these tokens for our classification and tagging problems. The representation returned by T OKEN IZE CODE substitutes each of these tokens for a special NAME token. When T OKEN IZE SN IP P ET tokenizes a localized nonidiomatic snippet, it differentiates special NAME tokens further into one of the following six categories: – – – –

ARR - name of the list variable IDX - name of the loopindex variable PRF - predicate or function FLD - field of an object

Detecting and Fixing Nonidiomatic Snippets in Python

fnd = False arr = list(range(0,10)) cnt = 0 for i in range(0,len(arr)): if pr(arr[i]): cnt += 1 print(arr) while fnd:

M1

137

OOO OOOOOOOOOOO III IIIIIIIIIIIII IIIIIIIII III OOOO OOO

Fig. 2. An example of how the first model tags (on the right) a certain token sequence in order to determine the location of a nonidiomatic snippet (on the left). The model is represented by M1 , and the tags I and O stand for IN and OUT.

– VAR - variable of other type – UNK - indeterminable from the snippet without extra information or not included in any above defined category. For basic tokenizing we use the tokenize package from the Python Standard Library with the modification of splitting up tokens of multiple tabs. The tokenized source code is given as an input to the neural networks. In the following, these neural networks are explained. 3.2.1 Snippet Location (M1 ) The first task is represented by functions LOCAT E and SN IP P ET . As described above, LOCAT E returns a sequence of index pairs representing the locations of the snippets and SN IP P ET returns a sequence of the snippets to be substituted. The task is formulated as a sequence tagging problem, solved by a recurrent neural network (M1 ). It tags each element in an index sequence (which represents the tokens of a source code) with an index (0 or 1) representing either IN or OUT. The input layer expects inputs of consistent length, which we achieve by padding each training example with the special PAD token beforehand. Next is an embedding layer which embeds tokens into a 256 dimensional vector space. The second layer is a bidirectional LSTM [8] with 256 units that returns the whole sequence of outputs. The last layer is a dense layer with 2 units (number of possible tags) which is applied to each element of the sequence returned by the BiLSTM. We use softmax as the activation function, categorical crossentropy as the loss function and Adam [11] (with 0.001 as the learning rate) as the optimizer. After around 15 epochs the model learns to tag a given tokenized code quite well. With the correct tagging of a program, we can extract the desired information using the T AGS2LOC function: it takes the source code and the result of M1 (tags) as its arguments, and returns the sequence of beginning and ending indices. As described above, the found snippet itself is also needed for other

138

B. Szalontai et al.

subtasks. For this purpose, the T AGS2SN P function is used: it also takes the result of M1 as its argument, and returns only those tokens that got labeled with the IN tag. Based on these, the definition of LOCAT E is LOCAT E(SC) := T AGS2 LOC(M1 (T OKEN IZE CODE(SC))) and the definition of SN IP P ET is SN IP P ET (SC) := T AGS2SN P (SC, M1 (T OKEN IZE CODE(SC))). 3.2.2 Determining the Type of Refactor (M2 ) The second task, represented by SN IP P ET T Y P E, is to determine the kind of pattern the code snippet implements. This is a classification problem, which we solve with a feedforward neural network (M2 ). The first layer of the network is dense with 512 units and the input shape is the maximum length (number of tokens) of the training examples. We use the ReLU activation function, then a Dropout [14] with the rate of 20%. A dense layer with 6 units is integrated as the last layer. We use the softmax activation function, categorical crossentropy as the loss function and Adam [11] as the optimizer. After around 20 epochs the model learns to distinguish between the nonidiomatic pattern types. Using M2 we define SN IP P ET T Y P E(SCS) := M2 (T OKEN IZE SN IP P ET (SCS)). 3.2.3 Determining the Variables to Utilize for the Substitution (M3 ) The function V ARIABLES is used to determine the variables that the programmer used in the algorithm. We need to know what list was being iterated over, what functions the programmer used, etc. We need to find the tokens representing the variables that are used for storing the result, and any others being used when determining it. This subtask is also formulated as a sequence tagging problem. The goal of the model M3 is to tag an index sequence, which represents a localized snippet tokenized with function T OKEN IZE SN IP P ET . The following tags are used: – – – – – –

LIST - the iterated list TARGET and TARGET2 - the variable(s) used for storing the results FIELD - field of an object PRED - predicate function FUNC - any function used on every element of the list UNK - any other unknown token.

The architecture of the model that we train to tag a sequence is almost the same as M1 , the main difference comes from the different labeling. This difference implies that the dense output layer of M3 has 8 (number of possible tags including PAD as a special tag) units instead of 2 (as in M1 ). Assuming that we have a snippet that is tagged correctly with these tags, the information needed can be easily extracted. This process is represented by function T AGS2V ARS. It expects a code snippet and the tags of the tokens, and returns a tuple of token-type pairs describing the variables that need to be utilized.

Detecting and Fixing Nonidiomatic Snippets in Python

cnt = 0 for i in range(0,len(arr)): if pr(arr[i]): cnt += 1

M3

139

TARGET U U U U U U U U U U U LIST U U U U PRED U LIST U U U U U TARGET U U

Fig. 3. An example of how the third model tags (on the right) a token sequence (on the left) in order to determine the variables to be utilized for generating an alternative for the original nonidiomatic snippet. The model is represented by M3 , and the tag U stands for UNK. Table 3. A summary table about the functions defined in this section including their names, the expected input value and their output. Name

Expects

Returns

T OKEN IZE

Python source code

Tokenized code represented by an index sequence

M1

Code represented by an index Predicted tags (In, Out) sequence

T AGS2LOC

Tags (In, Out)

Locations of nonidiomatic snippet(s)

T AGS2SN P

Source code and tags (In, Out)

Nonidiomatic snippet(s)

M2

Code represented by an index Predicted type of sequence substitution

M3

Code represented by an index Predicted tags (Target, List, sequence Func, Pred, ...)

T AGS2V ARS Source code and tags Main variables of a (Target, List, Func, Pred, ...) nonidiomatic snippet

Based on the functions defined above, V ARIABLES is defined the following way: V ARIABLES(SCS) := T AGS2V ARS(SCS, M3 (T OKEN IZE SN IP P ET (SCS))). Table 3 summarizes the functions defined in this section. In the next section, the dataset generation is explained for the three neural networks defined above.

4

Dataset Generation

As mentioned in the previous section, three neural networks are used to locate and replace nonidiomatic snippets in source code. The neural networks are trained with two different generated datasets. The generation process consists of three steps. First, templates are generated for each code pattern using generic names for variables and functions. The second, augmentation step includes randomly renaming the identifiers and creating

140

B. Szalontai et al.

random conditions for the code patterns. The last step is to combine these modified patterns with real world code from Github. The results of the second and third steps are used as training sets. These are: 1. A collection of nonidiomatic code patterns with randomly renamed variables and functions, and randomly created conditions 2. A collection of Python scripts from Github projects containing randomly inserted nonidiomatic code patterns. 4.1

Template Generation

The templates of the possible nonidiomatic patterns are created via a contextfree grammar (Table 4). The variable names and conditions are generic so they can be replaced easily. In the next step, we replace the identifiers with random strings and we create conditions with random numerical operators. The generator is written in Python using the Natural Language Toolkit (NLTK) [12]. The examples in Table 4 implement the nonidiomatic pattern count, which counts the elements in a list for which a predicate is satisfied. More examples can be found in Appendix Table 5, 6, 7, 8 and 9. Table 4. Two code snippets generated by the grammar of the count pattern. The columns are: the class index of the code (count: 0), the snippet, the name of the list, and the name of the result. count = 0 f o r loopInd in range ( len ( arr ) ) : 0 arr count i f Pred ( a r r [ l o o p I n d ] ) : count += 1 count = 0 f o r loopInd in range ( len ( arr ) ) : 0 arr count i f Pred ( a r r [ l o o p I n d ] ) : count = count+1

With these rules about 400 templates can be generated. Figure 4 shows the exact numbers for each kind of pattern. 4.2

Augmentation of Templates

Augmenting the examples by replacing the generic variable and function names with random strings creates more data for the neural networks and is also a good method to create equal numbers of patterns. The conditions (generally predicate functions) are also changed at this stage. In most cases, the predicate function is an inline boolean expression (such as

Detecting and Fixing Nonidiomatic Snippets in Python

141

1 == arr[i]). In some of the snippets, the list which we iterate over is replaced by a fixed row/column of a matrix/tuple (such as arr[i][1]). In some snippets we replace all the occurrences of the elements of the list with an indexed version (using the [] operator). With these modifications applied, we generate 120 000 snippets with equal distribution of the six classes. These snippets are processed further to train M2 and M3 , the models that are used to determine the type of the pattern in the snippets and extract the variables from it. To help generalization, the tokens denoting variable names and function names are converted to one of the special tokens (ARR, IDX, PRF, FLD, VAR, UNK) according to the T OKEN IZE SN IP P ET tokenization procedure of, described in Sect. 3.2. The obtained processed snippets (each with its corresponding type of pattern) are used to train M2 . In order to create the datased for training M3 , each token in the snippets is tagged according to its type (LIST, TARGET, TARGET2, FIELD, PRED or FUNC, see Sect. 3.2.3), while the rest is tagged UNK. Figure 3 shows an example of how the tagging works. In order to achieve consistent snippet length, the PAD special token and tag is used for padding the end of the sequences. To train M1 to locate nonidiomatic snippets, programs are downloaded from Github in large amounts and nonidiomatic snippets are inserted into them. These are stored in a table where each row contains the type of the snippet, the larger code with the inserted snippet, the location of the insertion, and the important identifier names. Only the contexts of the snippets are kept to get rid of the large amount of unnecessary code. After tokenizing, each training example becomes approximately 1500 tokens long, of which the snippet is 30–60 tokens. We tag each token with tags IN or OUT depending on whether a given token is in the snippet or not. Figure 2 shows an example of how the tagging works. We use T OKEN IZE CODE for tokenizing full source codes (as described in Sub-

192

200

Amount

150 90

100 64 50 0

24 count max search sum

24

24

all

any

Fig. 4. The distribution of the number of different patterns before renaming.

142

B. Szalontai et al.

sect. 3.2.2). To achieve consistent input length, we use the special PAD token for padding with the OUT tag.

5

Evaluation

In order to demonstrate the validity of our approach, we tested the method on a real dataset, containing scripts coded mostly by students with no experience with Python. The programs had been originally submitted as homeworks to the Mester assignment system operated by E¨otv¨ os Lor´and Universtiy. We have received 13373 Python files, some of which presumably contains nonidiomatic snippets. The assignment tasks are of varying difficulty. The easier ones can be solved with simple program constructions, while the more difficult ones require deeper understanding of patterns. The tasks expect the input from the console and print the output to it as well. The inputs generally consist of multiple lines, where in most cases the first line describes the number of upcoming rows and/or columns. An example of an assignment task can be seen in Fig. 6 in Appendix A. 5.1

Automated and Manual Evaluation

In order to identify the precision, we combine two complementary testing approaches: automated testing and manual tagging. Automated testing is done with a testing tool which compares the original and the refactored programs. It generates inputs for the programs, runs them, then compares their outputs. The inputs are created randomly with the following structure. The length of an input ranges from 4 to 7 lines (in order for the inputs to better work with the programs in the dataset, too short or too long inputs are not preferred). Each line contains 1 to 6 integers separated by spaces, with the values ranging from 1 to 5. The values are bounded because they represent index values in several cases. It should be noted that some of the solutions require string-management. Since our method is not designed to resolve such issues, string inputs are not generated in the automated evaluation process. For each code, 1000 inputs are generated in order to find some that terminate with no error after running it on the original version. If the outputs of both versions match on all inputs, we interpret that the fix is successful. The second approach is applied if no appropriate inputs are found for the original version. These examples are manually tagged to tell whether the fix is successful or not. The tagging is done by a thorough examination of the original and refactored versions of each code. In order to filter out incorrect tags, this tagging procedure was performed twice, then the differences were compared and resolved.

Detecting and Fixing Nonidiomatic Snippets in Python

5.2

143

Precision

We ran our method on each program in the dataset. Out of the 13373 programs, changes were made to 1303. The automated testing approach could be applied in 1054 cases, where 506 were identified as correct. For the rest (249) of the codes, the manual tagging procedure counted 85 correct modifications. The sum results in 591 correct localizations and substitutions (out of the 1303 cases). Thus the precision of our method is 45.35% (591/1303). It should be noted, that the automated testing process pointed to 300 trivially recognizably corrupted codes that crash, produce runtime errors, or do not terminate. These can be easily filtered out with suitable tools, thus it is worth observing the precision neglecting these 300 codes: 58.92% (591/(1303-300)). As we don’t run the programs during the manual testing process, there might be more than 300 such files, making the aforementioned percentage a lower estimate. 5.3

Recall

In order to estimate the prevalence of nonidiomatic snippets in the test dataset, a sample of 300 programs was taken. The programs were selected randomly and tagged with whether they should be fixed or not. As in the previous tagging procedure, all of the programs were tagged twice, then compared. 28 programs were found to contain snippet(s) to be replaced. In the vast majority of these cases, such programs contain exactly one snippet. There are a few exceptions, resulting in 35 snippets in total. The pattern distribution of the found snippets is the following: – – – – – –

count: 17/35 (48.57%) max: 7/35 (20.0%) search: 1/35 (2.85%) sum: 7/35 (20.0%) any: 1/35 (2.85%) all: 2/35 (5.71%).

According to the sample, about 9.33% of the programs contain one or more snippets to be refactored (28/300). Our method correctly localized and substituted 591 snippets, which is 4.41% of all of the codes (591/13373). This indicates the estimated recall of the whole system: 47.27% (4.41%/9.33%). Furthermore, the number of codes with correctly localized snippets (ignoring the correctness of the fix) is 706, which is 5.27% of all of the codes (706/13373), indicating the estimated recall of localization: 56.27% (5.27%/9.33%). 5.4

Precision of Subsystems

In order to determine the precision of the localization capability alone (correct snippet localizations to all localizations), the refactored programs were tagged

144

B. Szalontai et al.

as follows. A program was marked with “correct localization” if all of the found snippets in the code were nonidiomatic. As in the previous cases, the programs were tagged twice in order to reduce the chance of incorrect taggings. Since 591 of the files were already identified as correct fixes, it was only necessary to inspect the rest. 115 programs were found to contain correctly localized, but not appropriately substituted snippets. Altogether there were 706 correct localizations, indicating 54.18% (706/1303) as the precision of the finder algorithm. In possession of this knowledge, it is natural to try to identify the precision of the fixing process (correct substitutions to correct localizations). Out of the programs containing correctly localized snippet(s) 591 were correctly fixed, indicating the precision of the substitution process being 83.71% (591/706). A visualization of the performance of the subsystems can be seen in Fig. 5.

Fig. 5. Visual overview of the precision of the subsystems.

6

Conclusion

We presented a method for locating and fixing nonidiomatic snippets by substituting them with Pythonic alternatives. We introduced a novel approach by using one feedforward and two recurrent neural networks along with explicit programming in order to locate the snippets and generate an equivalent alternative for them. The approach was validated by testing on more than 13 000 Python programs coded by students. According to the evaluation results, given a source code containing nonidiomatic snippets, our algorithm localizes and correctly fixes them in about half of the cases, making the code more Pythonic.

Detecting and Fixing Nonidiomatic Snippets in Python

145

The precision of a practical system would go up to about 60% as many corrupted programs can be trivially recognized. Such a system could be utilized to idiomatize large Python projects with human supervision. It could also be used for educational purposes, since it has the ability to aid the learning process by offering an improved version of the code. In future work, we would like to apply a more general approach to the refactoring process in order to increase the number of substitutable nonidiomatic patterns. We also plan to experiment with different neural network architectures to improve the overall precision and recall of the system. Acknowledgments. EFOP-3.6.3-VEKOP-16-2017-00001: Talent Management in Autonomous Vehicle Control Technologies – The Project is supported by the Hungarian Government and co-financed by the European Social Fund. We would like to express our great appreciation to L´ aszl´ o Zsak´ o and Gyula Horv´ ath for providing an enormous amount of Python codes to test our algorithm on. The data is from the E¨ otv¨ os Lor´ and University’s programming exercise bank and submission website.

A

Appendix

Table 5. Pattern of maximum search with the name of the list, and the name of the maximum value. maXind = 0 ; MAx = a r r [ 0 ] f o r loopInd in range ( len ( arr [ 1 : ] ) ) : i f MAx < a r r [ l o o p I n d + 1 ] : 1 arr MAx MAx = a r r [ l o o p I n d +1] maXind = l o o p I n d+1

Table 6. Pattern of linear search with the name of the list, and the name of the boolean return value. found = Pred ( a r r [ 0 ] ) loopInd = 0 w h i l e l o o p I n d < l e n ( a r r )−1 and not found : 2 arr found i f Pred ( a r r [ l o o p I n d + 1 ] ) : found = True loopInd = loopInd + 1

146

B. Szalontai et al.

Table 7. Pattern of summation with the name of the list, and the name of the sum value. SUm = 0 f o r loopInd in range ( len ( arr ) ) : 3 arr SUm i f Pred ( a r r [ l o o p I n d ] ) : SUm += a r r [ l o o p I n d ]

Table 8. Pattern of all with the name of the list, and the name of the boolean return value. A l l = Pred ( a r r [ 0 ] ) loopInd = 0 w h i l e l o o p I n d < l e n ( a r r )−1 and A l l : 4 arr All i f not Pred ( a r r [ l o o p I n d + 1 ] ) : All = False loopInd = loopInd + 1

Table 9. Pattern of any with the name of the list, and the name of the Boolean return value. Any = Pred ( a r r [ 0 ] ) loopInd = 0 5 w h i l e l o o p I n d < l e n ( a r r )−1 and not Any : arr Any i f not Pred ( a r r [ l o o p I n d + 1 ] ) : Any = True loopInd = loopInd + 1 Most expensive house A real estate firm stores the area and price of the houses for sale. Write a program which finds the index of the most expensive house. Input: The first line of the standard input contains the number of houses (1≤N≤100), the following N lines each contain the area of house (in m2 , 1≤A≤500) and the price (in thousands of USD, 1≤P≤1000). Output: The first line of the output should contain the index of the most expensive house. If there are multiple solutions then the smallest index should be written. Example: Input Output 6 4 42 15 110 20 125 160 166 180 42 10 110 39

Fig. 6. An example exercise.

Detecting and Fixing Nonidiomatic Snippets in Python

147

References 1. Aftandilian, E., Sauciuc, R., Priya, S., Krishnan, S.: Building useful program analysis tools using an extensible java compiler. In: 2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation, pp. 14–23. IEEE (2012) 2. Ahmed, T., Hellendoorn, V., Devanbu, P.T.: Learning lenient parsing & typing via indirect supervision. CoRR, abs/1910.05879 (2019) 3. Allamanis, M., Barr, E.T., Devanbu, P., Sutton, C.: A survey of machine learning for big code and naturalness. ACM Comput. Surv. (CSUR) 51(4), 1–37 (2018) 4. Danish, M., Allamanis, M., Brockschmidt, M., Rice, A., Orchard, D.: Learning units-of-measure from scientific code. In: 2019 IEEE/ACM 14th International Workshop on Software Engineering for Science (SE4Science), pp. 43–46. IEEE (2019) 5. Gupta, R., Pal, S., Kanade, A., Shevade, S.: DeepFix: fixing common C language errors by deep learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017) 6. Habib, A., Pradel, M.: Neural bug finding: a study of opportunities and challenges. CoRR, abs/1906.00307 (2019) 7. Hellendoorn, V.J., Bird, C., Barr, E.T., Allamanis, M.: Deep type inference. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 152–162 (2018) 8. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 9. Kaleeswaran, S., Santhiar, A., Kanade, A., Gulwani, S.: Semi-supervised verified feedback generation. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 739–750 (2016) 10. Karampatsis, R.-M., Sutton, C.: Maybe deep neural networks are the best choice for modeling source code. CoRR, abs/1903.05734 (2019) 11. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (Poster) (2015) 12. Loper, E., Bird, S.: NLTK: the natural language toolkit. In: Proceedings of the ACL-2002 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, ETMTNLP 2002, USA, vol. 1, pp. 63–70. Association for Computational Linguistics (2002) 13. Pradel, M., Sen, K.: DeepBugs: a learning approach to name-based bug detection. Proc. ACM Program. Lang. 2(OOPSLA), 1–25 (2018) 14. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 15. Vasic, M., Kanade, A., Maniatis, P., Bieber, D., Singh, R.: Neural program repair by jointly learning to localize and repair. CoRR, abs/1904.01720 (2019) 16. Wong, W.E., Gao, R., Li, Y., Abreu, R., Wotawa, F.: A survey on software fault localization. IEEE Trans. Softw. Eng. 42(8), 707–740 (2016)

BreakingBED: Breaking Binary and Efficient Deep Neural Networks by Adversarial Attacks Manoj-Rohit Vemparala1(B) , Alexander Frickenstein1 , Nael Fasfous2 , Lukas Frickenstein1 , Qi Zhao1 , Sabine Kuhn1 , Daniel Ehrhardt1 , Yuankai Wu1 , Christian Unger1 , Naveen-Shankar Nagaraja1 , and Walter Stechele2 1

BMW Autonomous Driving, Unterschleiheim, Germany {manoj-rohit.vemparala,alexander.frickenstein,lukas.frickenstein,qi.zhao, sabine.kuhn,daniel.ehrhardt,yuankai.wu,christian.unger, naveen-shankar.nagaraja}@bmw.de 2 Technical University of Munich, Munich, Germany {nael.fasfous,walter.stechele}@tum.de

Abstract. Deploying convolutional neural networks (CNNs) for embedded applications presents many challenges in balancing resourceefficiency and task-related accuracy. These two aspects have been wellresearched in the field of CNN compression. In real-world applications, a third important aspect comes into play, namely the robustness of the CNN. In this paper, we thoroughly study the robustness of uncompressed, distilled, pruned and binarized neural networks against whitebox and black-box adversarial attacks (FGSM, PGD, C&W, DeepFool, LocalSearch and GenAttack). These new insights facilitate defensive training schemes or reactive filtering methods, where the attack is detected and the input is discarded and/or cleaned. Experimental results are shown for distilled CNNs, agent-based state-of-the-art pruned models, and binarized neural networks (BNNs) such as XNOR-Net and ABCNet, trained on CIFAR-10 and ImageNet datasets. We present evaluation methods to simplify the comparison between CNNs under different attack schemes using loss/accuracy levels, stress-strain graphs, box-plots and class activation mapping (CAM). Our analysis reveals susceptible behavior of uncompressed and pruned CNNs against all kinds of attacks. The distilled models exhibit their strength against all white box attacks with an exception of C&W. Furthermore, binary neural networks exhibit resilient behavior compared to their baselines and other compressed variants.

Keywords: Convolutional neural networks Model robustness · Adversarial attacks

M.-R. Vemparala, contributions.

A.

Frickenstein,

N.

· Model compression ·

Fasfous

and

L.

Frickenstein—Equal

c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 148–167, 2022. https://doi.org/10.1007/978-3-030-82193-7_10

Breaking Binary and Efficient Deep Neural Networks

1

149

Introduction

Neural network compression is an extensively studied topic for reducing the computational complexity [21,27,36], the memory demand [15,19,25] and/or the energy consumption [42] of deep neural networks (DNN) deployed on embedded systems. These aspects widen the potential for DNN applications in real-world scenarios. Particularly in the field of robotics and autonomous driving, increasingly deeper and larger convolutional neural networks (CNNs) are deployed on resource-constrained hardware platforms, enabling computer vision-based applications, such as pedestrian detection or free-space detection. Systems in autonomous vehicles are safety critical, maintaining zero-tolerance for potential threats to functional safety. Attacking (breaking) neural networks can be done by injecting small perturbations to their inputs, referred to as adversarial attacks [39]. Under the assumption of varying degrees of information on the CNN and the accessibility of its internal parameters, several black-box (GenAttack [2], LocalSearch [31]) and white-box (FGSM [22], DeepFool [30] and Carlini & Wagner [5]) attacks are potential threats. Understanding these threats helps to develop pro-active [11] and re-active [33] methods to defend against adversarial examples and thereby improve CNN robustness.

Fig. 1. Experimental setup of BreakingBED for breaking binary (C) and Efficient (A) and (B) DNNs attacked with white-box (FGSM, PGD and C&W) and black-box (LocalSearch and GenAttack) adversarial attacks. Evaluated by using loss/accuracy levels, stress-strain Graphs, box-plots and class activation mapping (CAM).

Recent works investigated the mitigation of such threats through robust training of neural networks [14] and robust neural architecture search (NAS) techniques [12]. In [26], the authors compress neural networks through robust quantization, lowering the computational complexity while maintaining good performance against potential attacks. Further investigations on the robustness of binary neural networks (BNNs) were carried out in [10], where BNNs were attacked with white-box (FGSM [22] and C&W [5]) and black-box [34] techniques. The robustness of BNNs was concluded, albeit on basic adverserially trained networks from [34] and a small set of attacks. In order to get a deeper understanding of the effectiveness of adversarial attacks (Sect. 3), applied to binary and efficient DNNs (Sect. 2), we perform an

150

M.-R. Vemparala et al.

extensive set of robustness evaluation experiments. In detail, we expose vanilla full-precision, distilled, pruned and binary DNNs to a variety of adversarial attacks in Sect. 4.

2

Compression of Deep Neural Networks

Many works in literature have focused on reducing the redundancy emerging from training deeper and wider neural networks, aiming to mitigate the challenges of their deployment on edge devices. Compression techniques such as knowledge distillation, pruning, and binarization can potentially make CNNs more efficient in embedded settings. 2.1

Knowledge Distillation

Knowledge distillation (KD) is the transfer of knowledge from a teacher to a student network [20,40]. The student can be a smaller DNN, which is trained on the soft labels of the larger teacher network, achieving an improvement in an accuracy-efficiency trade-off. The student represents a compressed version of the teacher, condensing its knowledge. This paper focuses on KD training, using Kullback-Leibler (KL) divergence between the teacher and the student output distribution formulated as the loss function in Eq. (1). Here, σ(ft (I)) and σ(fs (I)) represent the softmax output logits of the teacher and student network respectively, computed for a sample image I in a mini batch of N samples.   N  σ(ft (In )/T ) KL LKD (ft , fs , T ) = σ(ft (In )/T ) log (1) n=1

σ(fs (In )/T )

During the knowledge transfer using the teacher’s logits, a softmax temperature T  1 is used. During the evaluation, we use T = 1 to obtain softmax-cross entropy loss. 2.2

Pruning

Pruning aims to eliminate redundancies in DNNs and produce smaller, more efficient neural networks. Pruning has been investigated in many works, over a wide range of DNN models, achieving high compression rates while maintaining high prediction accuracy [15,18,19]. Guo et al. [13] present an irregular pruning method, which can significantly reduce the parameter redundancy by integrating connection pruning with the retraining process. Recently, structured pruning techniques, which remove larger, regular parts of the network, achieve a tangible improvement in hardware acceleration with a negligible accuracy loss [9,15,17,43]. More recently, He et al. proposed a learning-based compression method in AMC-AutoML [19]. The authors leverage a reinforcement learning (RL) agent, which learns the possible sparsities in each layer and prunes them based on an 2 -norm heuristic. We adapt

Breaking Binary and Efficient Deep Neural Networks

151

the RL-agent of AMC-AutoML to support different pruning regularities such as filter-wise (F. Prune), channel-wise (Ch. Prune), kernel-wise (K. Prune) and weight-wise (W. Prune) pruning (shown in Fig. 1). Pruning input channels from a layer also discards corresponding output filters from previous layers. Thus, Ch./F. Prune result in a similar compression ratio and CNN structure. The pruning regularity has a direct impact on the hardware implementation complexity and throughput benefits. In this paper, the pruning rate is set at a constant value of 50% over all experiments and pruning regularities. 2.3

Binarization

Binarization represents the most aggressive form of quantization, where the network weights W and activations are constrained to ±1 values. This greatly reduces the memory requirements of DNNs. In theory, binarizing a singleprecision floating-point DNN, reduces its memory footprint by up to ×32. Different schemes for binarization of a DNN have been proposed. Courbariaux et al. [7] introduced the concept of training neural networks with binary weights B during inference and maintaining a latent representation during back-propagation. The authors later augmented this approach with binarized activations [21]. Rastegari et al. [36] introduced XNOR-Net, where the convolution of an input feature map Al−1 and weight tensor W is approximated by a combination of XNOR operations and popcounts ⊕, followed by a multiplication with a scaling factor α, such that Conv(Al−1 , W ) ≈ (sign(Al−1 ) ⊕ B) · α (shown in Fig. 1). Binary neural networks (BNNs) typically suffer from accuracy degradation. To mitigate this problem, Lin et al. [27] proposed a scheme for Accurate Binary CNNs (ABC-Net). The authors approximated the convolution by using a linear combination of multiple binary bases for weights and activations, shrinking the accuracy gap to full-precision counterparts. In this paper, we implement ABCNet and XNOR-Net binarization techniques, to evaluate the effect of adversarial attacks on accurate BNNs.

3

Adversarial Attacks

One option to attack (break) neural networks is by injecting small perturbations (adversarial biases) called adversarial attacks. An adversarial example I Adv that forces a given classifier with parameters θ to misclassify an image I with true label L, renders a successful non-target attack: A = {I Adv |θ(I Adv ) = L }. Whereas, a successful target attack can be defined as: A = {I Adv |θ(I Adv ) = Lt } for some target class t. The capability of the adversary can be described by a set of allowed perturbations S : D(I, I Adv ) ≤ , restricting the maximum possible perturbation distance D(I, I Adv ) to a given image I by some adversarial manipulation budget . Finding I Adv can be formulated as a maximization problem as defined in Eq. (2), whereby various attacks are designed to be effective using different distance metrics (1 , 2 , ∞ ) [4]. max L(I Adv , L, θ)

I Adv ∈S

(2)

152

M.-R. Vemparala et al.

Attacks can be categorized regarding the degree of accessibility to a model’s internal parameters θ. White-box attacks [3,5,22,24,29,39] assume complete model transparency, allowing full control and access to the target CNN. In most real-world scenarios, a model’s fine internal details are not easily accessible, rendering white-box attacks less practical [6]. On the other hand, black-box attacks [2,31] assume no such information. The adversary can be a standard user, with access to only the inputs and the outputs of a targeted model. Such attacks are more practical and can have severe consequences in real-time critical applications. Different models learn similar features when they are trained for the same task. Adversarial perturbations are highly aligned with the weight vectors of a model. This results in the generalization of adversarial examples over different models [22], making it possible to transfer a white-box attack from one model as a black-box attack to another [24]. 3.1

White-Box Attacks

Fast Gradient Sign Method: The most commonly used attack to verify the robustness of neural networks against input perturbations is the fast gradient sign method (FGSM) [22]. FGSM linearizes the loss function of a neural network around θ by calculating its gradient ∇L(I, L, θ) to generate adversarial examples I Adv , resulting in an efficient solution to Eq. (2). The input variation parameter  controls the perturbation’s amplitude [24], as expressed in Eq. (3). I Adv = I +  · sign (∇L (I, L, θ))

(3)

The attack is strengthened when performed iteratively. This can be considered as an extension of FGSM, generating adversarial samples using a small step-size [24]. Projected Gradient Descent: An even more effective variant is iterative projected gradient descent (PGD) on the loss function with uniform random noise initialization [38], expressed in Eq. (4).    Adv = πS IiAdv + α · ∇L IiAdv , L, θ (4) Ii+1 Adv Here, adversary examples Ii+1 are generated by taking one step into the ascent direction of the loss gradient ∇L(IiAdv , L, θ) with respect to the previous image IiAdv at iteration i, where the step-size is scaled by α, followed by a potential projection π onto the legal set S. Legal adversaries are ensured by a projection π onto the legal set I + S with S = {δ : ||δ||p ≤ }. If not mentioned otherwise, PGD attacks focus on the ∞ -norm as a distance metric for D(I, I Adv ), representing an ∞ -ball around natural images I. The iterative multi-step optimization method is able to converge to local maxima of the non-concave and constrained maximization problem, defined in Eq. 2, representing possible worst-case adversaries for the underlying model. By

Breaking Binary and Efficient Deep Neural Networks

153

considering random uniform initialization, arbitrary starting points on the corresponding loss surface are ensured, thus resulting in an exploration of potentially varying local maxima and lastly giving rise to the structural behavior of the corresponding loss surface. This renders the PGD attack as the “ultimate” firstorder adversary, as stated by Madry et al. [28]. Carlini and Wagner: Carlini and Wagner (C&W) [5] presented a targeted attack, to refute the promising defensive approach of defensive distillation [35]. The proposed C&W attack emerged as one of the strongest attacks in literature [1]. CW finds perturbations δ with minimal distance D(I, I + δ) that will change the classification of image I to the target class t. This is a challenging non-linear optimization problem and therefore the authors introduce a function g, such that g(I + δ) = 0 when the classifier gets fooled towards the target class. The attack constructs adversarial examples which try to minimize the objective as mentioned in Eq. (5). min(δp +  · g(I + δ)), where g(I) = ((max Z(I)j ) − Z(I)t )+

(5)

j=t

Z(I)j indicates the output of the CNN for class j before the softmax layer. The minimum condition g(I) = 0 occurs when Z(I)t ≤ Z(I)j ∀j = t. The choice of  maintains a trade-off between the attacked image similarity and the success rate of the target class. Using 2 distance metric, the objective function is minimized through the gradient decent. DeepFool: With the DeepFool [30] attack, the authors propose a method to generate adversarial examples that fool classifiers on large-scale datasets by estimating the distance of an input instance I to the closest decision boundary. The iterative method estimates the perturbation δi at each iteration i till the classifier f (Ii ) changes its prediction (f (Ii ) = L). In practice, once an adversarial perturbation δ is found, the adversarial example is pushed further beyond the decision boundary. The algorithm is not guaranteed to converge to the optimal perturbation, nevertheless it generates adversarial examples with good approximations of the minimal perturbation. The size of the calculated perturbation can also be interpreted as a metric for the model’s robustness against adversarial attacks [41]. 3.2

Black-Box Attacks

LocalSearch: LocalSearch [31] is a simple gradient-free adversarial black-box attack, which is based on random perturbation and a greedy search algorithm around the perturbed pixels. The LocalSearch procedure works in iterations, where each iteration consists of two steps. The first step is to select and evaluate a small subset of points Pi , referred to as the local neighborhood. In the second step, a new solution Pi+1 is selected by taking the evaluation of the previous solution Pi into account. LocalSearch is simple to implement, but is computationally expensive, similar to most greedy search algorithms.

154

M.-R. Vemparala et al.

GenAttack: GenAttack [2] is a gradient-free optimization strategy based on a genetic algorithm. The initial population of perturbed image examples is generated by adding uniform random noise. The best individuals survive the generation based on their fitness evaluation, the selection strategy and the crossover and mutation probabilities. Fitness evaluation reflects the optimization objective, while the selection strategy allows elite individuals in the population to generate new children perturbations through crossover and mutation mechanisms. GenAttack is a faster search algorithm when compared to LocalSearch [31], and generates perturbations which are imperceptible to the human eye.

4

Breaking Binary and Efficient DNNs

Although a successful attack could easily be carried out by adding large perturbations, the requirement of finding the minimum necessary perturbation in each case is typically desirable to perform the attack in an inconspicuous manner. This justifies CNNs to being particularly robust against adversarial attacks that are relevant or expected in practice. However, despite the requirement to keep the perturbation as small as possible, the target for training against an attack structure can be to maximize a corresponding loss function. A prior analysis on the robustness of real world compressed CNNs provides insights which facilitate the realization of strong adversarial defense methods. We evaluate robustness of CNNs which are trained and evaluated on CIFAR10 [23] or ImageNet [37] datasets. The 50K train and 10K test images (32 × 32 pixels) of CIFAR-10 are used to train and evaluate compressed variants of ResNet20/56. [16,19,27,36,40] respectively. The ImageNet dataset consists of ∼1.28M train and 50K validation images (256 × 256 pixels). Compressed variants of ResNet18/50 are trained and evaluated for ImageNet experiments. If not otherwise mentioned, all hyper-parameters specifying the training and the attacks were adopted from the reference implementation. The robustness evaluation covers various white-box (FGSM, PGD, C&W, DeepFool) and black-box (LocalSearch, GenAttack) attacks on the CIFAR-10-trained ResNet20/56 compressed variants, as well as ImageNet-trained CNNs. We perform all the experiments using the trained statistics for the batch normalization layers. 4.1

CNN Compressed Variants

Table 1 summarizes the compressed CNNs and their full-precision counterparts analyzed in this paper. It shows that the neural networks drastically vary in their memory demand and their compute complexity. Deep learning inference accelerators such as the NVIDIA-T4 GPU [32] or Xilinx FPGAs with DSP48 blocks support SIMD-based bit-wise operations [8]. In particular, a single DSP48 block can perform two 16-bit fixed-point multiplications or 48 XNOR operations at once. The normalized compute complexity (NCC) is defined as the optimal utilization of MAC and XNOR operations in one compute unit. The DSP48 block serves as a reference implementation to compute NCC in Table 1.

Breaking Binary and Efficient Deep Neural Networks

155

Table 1. Accuracy top1 [%], memory demand [MB] and the normalized compute complexity (NCC) of compressed CNNs and their full-precision counterparts. Dataset

Model

CIFAR-10 ResNet20 [16]

92.46 %

1.07

41

KD-KL [40]

93.25 %

1.07

41

Ch.Prune [19]

89.76 %

0.70

19

K.Prune [19]

90.73 %

0.61

20

W.Prune [19]

91.98 %

0.59

20

XNOR [36]

82.71 %

0.04

1.3

ABC(1 × 1) [27]

83.42 %

0.04

1.3

ABC(3 × 3) [27]

88.94 %

0.12

8.0

ABC(5 × 5) [27]

90.64 %

0.20

21.3

ResNet56 [16]

93.88 %

3.40

125

KD-KL [40]

94.24 %

3.40

125

Ch.Prune [19]

92.86 %

2.50

62

K.Prune [19]

93.04 %

2.19

63

W.Prune [19]

93.54 %

2.02

62

XNOR [36]

83.24 %

0.11

3.0

ABC(1 × 1) [27]

86.29 %

0.11

3.0

ABC(3 × 3) [27]

92.48 %

0.33

24

ABC(5 × 5) [27]

92.82 %

0.55

66

ImageNet ResNet50 [16]

4.2

Acc. [%] Memory demand [MB] NCC [106 ]

75.43 % 102.01

10216

ResNet18 [16]

69.00 %

46.72

1814

ResNet18-Ch.Prune [19]

67.62 %

34.52

884

ResNet18-XNOR [36]

49.10 %

4.14

173

ResNet18-ABC(1 × 1) [27] 51.07 %

3.48

153

ResNet18-ABC(3 × 3) [27] 59.83 %

6.28

417

Evaluation of Robustness

PGD-Evaluation: Considering PGD attack as the “ultimate” first-order attack, this section experimentally explores the structure of the loss surfaces and the corresponding accuracy deterioration of the proposed efficient DNNs, while exposing the models to PGD adversaries, similar to those proposed by Madry et al. [28]. Investigating the resulting structural behavior, especially the loss level to which the PGD attack is converging to and the speed of deterioration of accuracy, helps in understanding the adversarial robustness of the underlying models with respect to a defined PGD threat model τP GD = { , α, i }. All models are pre-trained on CIFAR-10 without any adversarial examples, to distinguish the influence of varying compression techniques on adversarial robustness. Subsequently, each model is exposed to PGD attacks from τP GD = { = 2, α = 0.5, i = 1000}. Following the method of Carlini et al. [4], i was increased to verify convergence, ensuring local-maxima, representing potentially worst-case adversarial examples for the underlying model with respect to the applied threat

M.-R. Vemparala et al.

102

XNOR 30

101

20

Acc[%]

156

100

Vanilla

Ch.Prune

K.Prune

W.Prune

ABC(1 × 1) ABC(3 × 3) ABC(5 × 5) 102 Above BL: Binary, Distilled Below BL: Vanilla, Pruned

KD-KL 30

0

20

40

60

PGD Iteration (i)

80

20

Loss

Acc [%]

Loss

XNOR remains above BL 101 BL

10

100

10

0 100

0

20

40

60

80

0 100

PGD Iteration (i)

Fig. 2. PGD attack accuracy (solid) and loss (dashed) over PGD iterations for compressed variants of ResNet20 (left) and ResNet56 (right) averaged over five reruns of PGD attack. Additionally, the horizontal breaking line (BL - dashed black) visualizes the deterioration of model accuracy below random guessing (≤ 10%) for CIFAR-10. Visual markings are added to categorize models above and below the BL at i = 10.

model τP GD . However, results are only shown up to i=100, since τP GD showed convergence for all investigated models in this range. The loss value and the corresponding accuracy of the models to the adversary were tracked every 5th iteration. In the following, the adversarial robustness of a model against PGD attacks is evaluated by (1) the overall loss level the PGD attack is converging to and as a consequence the resulting accuracy (2) the number of iterations a model can sustain until it breaks. We can consider a CNN model broken, if its accuracy indicates that the classification is random (10% for CIFAR-10 dataset), represented by model accuracy graphs dropping below the breaking line (BL). Figure 2 shows the mean over five reruns of PGD attack for all models to exploit random initialization, which ensures random exploration of the underlying nonconcave maximization problem as described in Sect. 3. Consistently, all investigated pruning techniques harm adversarial robustness against PGD attack with respect to its vanilla versions of ResNet20/56, when considering (1) the loss and accuracy after a converged attack and (2) the speed of breaking. Vanilla and pruned versions of ResNet20 break within five iterations, whereas the respective ResNet56 versions break within ten iterations. KD shows greater resilience to the PGD attack since (1) its accuracy after the converged attack is higher compared to both the ResNet20/56 vanilla variants and (2) breaking at a higher number of iterations. KD-KL breaks at i = 15 for its ResNet20 variant and at i = 40 at its ResNet56 variant. Binarization can improve the robustness against the defined PGD attack, materializing in (1) the higher loss and accuracy after a converged attack and (2) the greater resilience for a longer period of PGD iterations. XNOR-Net and ABC(5 × 5) break at i = 20, while ABC(3 × 3) and ABC(1 × 1) break at around i = 60 for their ResNet20 variants. For the ResNet56 variants, ABC(1 × 1) and

Breaking Binary and Efficient Deep Neural Networks

157

ABC(5 × 5) break at i = 20, whereas ABC(3 × 3) sustains up to i = 40. The ResNet56 variant of XNOR-Net outperforms all other models in (1) accuracy after converged attack (∼14%) and (2) being the only model that never breaks throughout this experiment (see Fig. 2 right). Stress-Strain Evaluation: To facilitate the interpretation of the data generated from the experiments, we propose a method for evaluating robustness. Different models such as ResNet20 and ResNet56 have different baseline accuracies, making it difficult to directly compare the robustness of different training or compression schemes. Existing metrics, such as attack success-rate [2] or accuracy degradation, fail to capture the differences of the baseline accuracy of a network. Taking inspiration from the field of mechanics, we use formulas of stress and strain to make an analogy with the robustness of networks before they break. Applying a certain amount of stress on an object causes a certain measure ∗ of deformity or strain. We adapt the strain formula to our problem as ε = A−A A , where ε is the strain, A is the accuracy before attack and A∗ the deteriorated accuracy. Note that, we use  and ε to represent perturbation amplitude and strain respectively. A network which sustains higher strain ε w.r.t. an attack is less robust. The rate of change in ε with increased stress indicates the resilience or fragility of the CNN under heavier forms of the same attack. Similar to the different types of mechanical stress (compressive, tensile or shear), iterative and amplitude based attacks can represent different types of attack-stress σ. Using σ and ε, we can compare the degree of robustness between networks, relative to their base accuracies. We can use inverted stress-strain graphs to better visualize the robustness of networks accordingly. Given the behavior of a network under a certain attack, we can classify its robustness in terms of material properties. A network that sustains a high attack stress before breaking is a strong network. On the other hand, a network which gradually degrades with increased attack stress is a ductile network. Lastly, a network which breaks before it deforms can be considered a brittle network. Figure 3 shows a set of stress-strain graphs for all the networks and attacks investigated. Fast Gradient Sign Method: For FGSM attacks, the results show that the KD-KL variant is more resilient compared to other compression techniques, as its strain ε increases at a slower rate with intensified attack stress. During the training, the distillation is performed using higher temperature (T = 30). The attack perturbations are generated using cross-entropy loss with T = 1, resulting in saturated gradients and therefore weakening the attack. Figure 3 shows an interesting effect of increased FGSM stress on the XNOR-Net variant. The robustness of ResNet56-XNOR is higher than other variants under low stress of up to σ = 4. Beyond that point, further attack stress severely harms the robustness of the network, making it the second-worst variant, following ABC (1 × 1). Generally, a boost in robustness is observed when the base CNN is the larger ResNet56 model. This increases their ductility, as they sustain more attack stress before breaking, when compared to the more brittle ResNet20 models. Interestingly, the same does not apply for the binarized ABC models, as they show similar robustness, irrespective of being ResNet20 or ResNet56 variants.

M.-R. Vemparala et al. W.Prune

15

0

1 0.8 0.2 1

1.5

2

0

0

0.8 0.6 0.4

Strain (ε)

0 1.5

2

0

200

0

1 0.8 150

200

0

150

200

GenAttack - ResNet56 -  = 8 N = 16

0

40

CW - ResNet20 - Fixed  = 1 1 Strain (ε)

0.8

0.8 0.6

0

0.2

0.4

Strain (ε)

0.2 0 100

Attack Stress (σ)

20

Attack Stress (σ)

1

1

50

0.6 0.2

100

LocalSearch- ResNet20 - Fixed  = 16

0.8 0.6 0

0.2

0.4

strong

20

0 50

Attack Stress (σ)

GenAttack - ResNet20 -  = 8 | N = 16

15

0.4

Strain (ε)

0.8 0.6 0.4

Strain (ε)

0.2 150

10

DeepFool - ResNet56

0 100

5

Attack Stress (σ)

1

1 0.8 0.6 0.4 0.2

50

Attack Stress (σ)

0

1

PGD - ResNet56 - Fixed i = 3

0 0

0.5

Attack Stress (σ)

FGSM - ResNet56

20

0.2

ductile

0.2 15

15

1

1 0.4

Strain (ε)

brittle

0 10

10

DeepFool - ResNet20

0.6

0.8

1 0.8 0.6 0.4 0.2

5

5

Attack Stress (σ)

PGD - ResNet20 - Fixed i = 3

Attack Stress (σ)

KD-KL

0 0.5

Attack Stress (σ)

0 0

ABC(5 × 5)

0.6

Strain (ε)

0.6 0.4

Strain (ε)

0.2 10

FGSM - ResNet20

Strain (ε)

ABC(3 × 3)

0 5

Attack Stress (σ)

Strain (ε)

ABC(1 × 1)

0.8

1 0.8 0.6 0.4

Strain (ε)

0.2 0 0

Strain (ε)

XNOR

0.4

K.Prune

0.6

Ch.Prune

1

Vanilla

0.4

158

50

100

150

200

Attack Stress (σ)

LocalSearch - ResNet56 - Fixed  = 16

0

20

40

Attack Stress (σ)

CW - ResNet56 - Fixed  = 1

Fig. 3. Stress-strain graphs for various attacks on compressed variants of ResNet20 (top) and ResNet56 (bottom).

LocalSearch

ABC(5 × 5)

KD-KL

0

0

Acc. after Attack

CW

0

0

0

20 40 60 80 100

PGD Acc. after Attack

20 40 60 80 100

FGSM Acc. after Attack

strong

ABC(3 × 3) 20 40 60 80 100

brittle ductile

ABC(1 × 1) Acc. after Attack

XNOR

Acc. after Attack

W.Prune

159

20 40 60 80 100

K.Prune

20 40 60 80 100

Ch.Prune

0

Acc. after Attack

Vanilla

20 40 60 80 100

Breaking Binary and Efficient Deep Neural Networks

DeepFool

GenAttack

Fig. 4. Box-plots for attacks on compressed variants of ResNet20 and ResNet56.

Projected Gradient Descent: For PGD, increased attack stress can be interpreted as higher perturbation amplitude  or more iterations i. Figure 3 shows the attack stress σ = , with iterations fixed to 3. The CNNs show various characteristics for this attack hyper-parameter setting. We observe KD-KL and XNOR variants of ResNet56 having a lower slope compared to other compressed CNNs indicating the ductile behavior. Carlini & Wagner: For the C&W method, we set the attack stress σ to search iterations over  = 1 (see Eq. 5). The results show the strength of this method, rendering all our networks brittle. This is characterized by the steep ascent in strain, breaking all CNNs with minimal attack stress. DeepFool: Similar to the C&W attack, DeepFool renders most of the considered CNNs brittle. One exception is the ResNet56-XNOR, which can sustain some amount of stress before completely breaking. It is worth noting that the other binary CNNs do not perform as well as ResNet56-XNOR in this case. LocalSearch: The LocalSearch attack can also offer two types of stress: amplitude and iterations. In Fig. 3, the stress-strain curves for a fixed amplitude of  = 16. For this amplitude, none of the networks completely break, even after 200 iterations of the attack. However, it is worth noting that binarized CNNs outperform the full-precision variants for both ResNet20 and ResNet56 experiments. GenAttack: For GenAttack, we take the number of generations i as the measure of attack stress, and fix amplitude  = 8 and population N = 16. In Fig. 3, a clear difference between the robustness of BNNs and other variants is observed. We can classify BNNs as strong against GenAttacks, and all other variants, as brittle.

160

M.-R. Vemparala et al.

Box-Plots: In Fig. 4, we present box-plots from data collected over a range of experiments. For each attack, we sweep over the respective strength and iterations mentioned in Table 2. The exact definition of strength and iteration for each attack can be recalled from Sect. 3. The data includes both models, ResNet20 and ResNet56. Table 2. All strength and iteration combinations tested for ResNet20 and ResNet56 variants (vanilla, pruned, binary, and distilled). Strength and iteration definitions for each attack are explained in Sect. 3. Attack

Strength 

Iterations i

FGSM

2, 4, 8, 16

N/A

PGD

0.1, 0.5, 1.0, 2.0

2, 3, 4, 5

CW

0.01, 0.1, 1.0, 5.0, 10.0 1,10, 20, 50

DeepFool

N/A

1, 5, 10, 20

Local search 8, 16, 32

50, 100, 150, 200

GenAttack

50,100, 150, 200 popsize = 6, 16

8, 12

Each plot shows the distribution of all the accuracies achieved by the compression technique, after being attacked by the corresponding method, over all the considered strengths and iterations, as well as their combinations. The boxplots reveal the strength of BNNs against both black-box attacks (GenAttack and LocalSearch), when compared to other variants. Different compression techniques produce different distributions for the PGD attack (marked in Fig. 4). CW proves to be the strongest adversarial attack scheme across all the compressed variants. 4.3

Class Activation Mapping on Attacked CNNs

We use class activation maps (CAM) [44] to determine the region of interest (RoI) for the prediction class using clean and attacked images. The output feature maps of the last convolutional layer and the weight tensor of the fullyconnected layer is considered as the input to the CAM. The CAM highlights regions of the image that influence the CNN’s prediction to a specific class. Similar to heat-maps, red regions indicate those with the highest contribution, while blue indicates the ones with the least. We applied CAM on various compressed variants of ResNet20 and ResNet56, trained on CIFAR-10, which are attacked by DeepFool (Table 3). As mentioned in Sect. 3, DeepFool attempts to find the adversarial perturbation which leads the CNN to the closest decision boundary. Once a perturbation is found, it is reinforced to push the prediction beyond that boundary. Through the CAM visualizations in this section, we attempt to capture this behaviour over the attack iterations.

Breaking Binary and Efficient Deep Neural Networks

161

Table 3. CAM for ResNet20/56 and its compressed variants performed on nonattacked and DeepFool attacked images on the automobile image from CIFAR-10 dataset. Distilled KD-KL

Ch./F.

Pruned Kernel

Weight

XNOR

Binary ABC(1×1) ABC(3×3)

ABC(5×5)

i=1

No AA

Vanilla

No AA i=1 i=5

ResNet56 - CIFAR10

i=5

ResNet20 - CIFAR10

Model Image→ I Adv

All the compression techniques produce no mis-classification in the automobile example using the unattacked raw image in Table 3. Three interpretations can be made from the heat maps. We support our interpretation with quantitative analysis by measuring the third quartile value of the heat map intensity across all the pixels. Observing the CAM output of ResNet56’s vanilla and channel-pruned variants for the unattacked input image, the RoI has large focused interest regions. For an intensity range of (0,255) blue→red, the third quartile value of the heat map intensity across all pixels is 184 and 162 for vanilla and channel-pruned respectively, indicating a large RoI. Second, the intensity of the interest regions decreases, after the attack is applied for one iteration. The third quartile value decreases (171, 152) indicating the lower interest regions. Third, after the attack is applied for five iterations, the focus on the attacked region (bonet) is reinforced to fool towards the nearest class (truck). The third quartile value further decreases (135, 121). Under DeepFool attacks, ResNet56 is more robust compared to ResNet20 which can be illustrated by the more distinct RoIs in the heat maps. The BNN variants have a small RoI compared to their vanilla model for unattacked images. The third quartile value for ResNet56XNOR is 98 indicating this aspect. As the inherent RoI for BNNs are small and concentrated, it could reduce the chances of finding and perturbing the smaller set of critical pixels by the attack model.

162

M.-R. Vemparala et al.

Table 4. Accuracy (Top1) [%] of CNNs after FGSM adversarial attacks for ImageNet.

FGSM

Nat.Acc  = 2  = 4  = 8  = 16

ResNet50 [16]

75.43 % 22.18 16.24 12.08 7.46

ResNet18 [16]

69.00 % 12.82

8.16

5.19 2.95

ResNet18-Ch.Prune [19]

67.62 % 11.18

6.64

3.99 2.34

ResNet18-XNOR [36]

49.10 %

7.57

4.54

2.19 0.93

ResNet18-ABC(1 × 1) [27] 51.07 %

9.11

4.65

2.30 1.13

ResNet18-ABC(3 × 3) [27] 59.83 % 11.33

5.73

2.65 1.43

Table 5. Accuracy [%] of CNNs after PGD adversarial attacks for ImageNet. PGD



ResNet50 [16] (75.43 %)

0.1 25.77 16.07 0.5 3.35 0.94

i=2 i=3 i=4 i=5 9.83 0.43

5.91 0.27

ResNet18 [16] (69.00 %)

0.1 17.86 10.32 0.5 1.33 0.17

5.58 0.04

3.11 0.01

ResNet18-Ch.Prune [19] (67.62 %)

0.1 17.02 10.23 0.5 1.40 0.27

5.92 0.06

3.50 0.02

ResNet18-XNOR [36] (49.10 %)

0.1 13.16 11.46 10.06 0.5 5.67 3.07 1.57

8.84 0.78

ResNet18-ABC(1 × 1) [27] 0.1 18.35 16.22 14.20 12.37 0.5 7.60 3.64 1.75 0.82 (51.91) ResNet18-ABC(3 × 3) [27] (59.83)

4.4

0.1 23.90 20.81 17.80 15.07 0.5 8.31 3.70 1.59 0.66

Robustness Evaluation on ImageNet Dataset

For the robustness evaluation on the ImageNet dataset [37], we use pre-trained ResNet50 and ResNet18 models, and compressed variants of ResNet18. We observe a higher attack search time for ImageNet compared to the CIFAR-10 dataset due to the larger image sizes and model complexity. Therefore, we limit our analysis to two white-box attacks (FGSM and PGD), and one black-box attack (GenAttack). We consider compressed variants such as Ch-Prune, XNOR, ABC(1 × 1) and ABC(3 × 3) specified in Table 4- 6. Fast Gradient Sign Method: In Table 4, we report the natural accuracy and attacked accuracy for different strengths ( = {2, 4, 8, 16}). ResNet50 achieves the highest natural accuracy and attacked accuracy for different strengths compared to other models. Among the compressed variants the channel pruned and ABC(3x3) models portray slightly higher robustness at different strengths.

Breaking Binary and Efficient Deep Neural Networks

163

Projected Gradient Decent: In Table 5, we report the attacked accuracy for two strengths ( = 0.1,  = 0.5). The attacked accuracy decreases for all the models as we increase the number of iterations i. We observe 9.16% higher attacked accuracy for binarized ResNet18 using ABC(3 × 3) compared to the ResNet50 model at i = 5 and  = 0.1. Robustness at higher attack strength  = 0.5 degrades the prediction accuracy for all the compressed variants. Table 6. Accuracy (Top1) [%] of CNNs after GenAttack adversarial attacks for ImageNet. Pop Size = 6. GenAttack



i = 200

i = 400

i = 600

i = 800

i = 1000

OA/TA

OA/TA

OA/TA

OA/TA

OA/TA

ResNet50 [16]

8.0 21.29/12.80 11.64/34.46 6.87/51.94 4.67/64.08 3.06/72.82

(75.43 %)

12.0 13.16/17.45 5.67/41.19 3.55/56.65 2.40/67.29 1.60/74.58

ResNet18 [16]

8.0 16.41/14.52 8.11/41.83 4.35/62.58 2.36/75.62 1.34/83.29

(69.00 %)

12.0 10.24/22.44 5.13/50.74 2.70/68.85 1.58/80.21 1.04/86.62

ResNet18-Ch.Prune [19]

8.0 12.34/12.82 6.05/39.02 3.17/60.46 2.00/74.46 1.22/82.79

(67.62 %)

12.0 7.33/20.25 3.29/49.44 1.84/68.97 1.08/80.11 0.88/86.80

ResNet18-XNOR [36]

8.0 13.06/0.64 12.86/0.72 12.64/0.84 12.68/0.86 12.68/0.94

(49.10 %)

12.0 11.56/0.78 11.14/0.92 11.14/1.04 11.04/1.16 10.82/1.22

ResNet18-ABC(1 × 1) [27] 8.0 17.59/1.48 17.67/1.62 17.37/1.76 17.23/1.88 16.89/1.98 (51.07 %)

12.0 15.83/1.90 15.40/2.08 15.20/2.26 15.02/2.34 14.86/2.52

ResNet18-ABC(3 × 3) [27] 8.0 26.00/0.68 25.02/0.82 25.26/0.92 25.46/0.98 25.58/0.96 12.0 22.50/0.74 22.04/0.94 22.36/1.02 21.75/1.08 21.90/1.14 (59.83 %) OA/TA = Accuracy to original label/Accuracy to target label.

GenAttack: We set an adaptive mutation rate ρ and mutation range α for GenAttack based on the dataset configuration and set the population size to 6, as in [2]. In Table 6, we report overall attacked accuracy and accuracy w.r.t. the fooled target class at several iterations during the attack search (i = {200, 400, 600, 800, 1000}). We also analyze the robustness for two attack strengths ( = 8, 12). Similar to previous observations, ABC models portray higher robustness with respect to their unattacked accuracy, when compared to other compressed variants and the vanilla ResNet50 and ResNet18 models. 4.5

Discussion

The robustness of distilled models can be attributed to their soft label training, which can be more informative than sheer, hard labels. The student is ideally able to learn both the correct classification and the distribution of closeness among other classes. Furthermore, the student is distilled using a high temperature factor T , causing the magnitude of the predicted class to be T times more confident than when trained on hard labels [5]. Thus, white box attacks like FGSM, PGD and DeepFool would require strong adversarial perturbation for

164

M.-R. Vemparala et al.

fooling the final prediction to its nearest class. However, the C&W attack is able to fool the distilled model, even at higher temperatures as the attack is not focused on the cross-entropy loss directly. The training scheme for BNNs is not as simple as vanilla or pruned models. It requires a straight-through-estimator, making the white-box attacks challenging compared to other variants. Introducing multiple scaling factors in case of ABC-Net eases the approximation to its full-precision model. Thus, XNOR-Nets appear to be more resilient against white-box attacks (Fig. 3, Fig. 4). Moreover, the PGD loss levels in Fig. 2 demonstrate the robustness of XNOR-Net through lower loss convergence values and breaking speed. The discretization of weights and activations also makes BNNs stronger against black-box attacks. The CAM results support the robustness for BNNs as they inherently possess smaller and concentrated RoI, reducing the chances of finding and perturbing the critical set of pixels. The BNN robustness is also observed for the ImageNet dataset when attacked with PGD and GenAttack (Tab. 5, Table 6). Pruning is the process of eliminating unused and/or redundant parameters. Here, balancing the compression rate and the accuracy is a key factor. Due to the reduced learning ability, pruned models are not automatically more robust than their full-precision counterpart. This would call for an extra objective function for improving the robustness. Existing works have shown that it is possible to remove more model parameters when pruning is applied in an unstructured manner [15]. A similar behavior can be expected if the robustness is included in the pruning and fine-tuning process.

5

Conclusion

In this paper, we provided a comprehensive analysis on recent white-box and black-box adversarial attacks against state-of-the-art vanilla, distilled, pruned and binary neural networks. We demonstrated that the robustness of CNNs not only depends on the adversarial attack but also on the compression technique at hand. By varying the attacks’ hyper-parameters, strong, ductile and brittle CNNs were identified. Conclusions were made on robustness by analyzing PGD loss/accuracy levels, box-plots, stress-strain graphs and CNN heat maps with CAM. From the presented data, we show that knowledge about the expected adversarial attack or the used compression technique can help the designer or the attacker generate more robust applications or stronger attacks, respectively.

References 1. Akhtar, N., Mian, A.S.: Threat of adversarial attacks on deep learning in computer vision: a survey. IEEE Access 6, 14410–14430 (2018) 2. Alzantot, M., Sharma, Y., Chakraborty, S., Zhang, H., Hsieh, C.J., Srivastava, M.B.: GenAttack: practical black-box attacks with gradient-free optimization. In: ACM Genetic and Evolutionary Computation Conference (GECCO), pp. 1111– 1119. Association for Computing Machinery, New York (2019)

Breaking Binary and Efficient Deep Neural Networks

165

3. Moosavi-Dezfooli, S.-M., Fawzi, A., Fawzi, O., Frossard, P.: Universal adversarial perturbations. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 86–94, July 2017 4. Carlini, N., et al.: On evaluating adversarial robustness. CoRR, abs/1902.06705 (2019) 5. Carlini, N., Wagner, D.A.: Towards evaluating the robustness of neural networks. In: IEEE Symposium on Security and Privacy (SP), pp. 39–57, May 2017 6. Chen, P.-Y., Zhang, H., Sharma, Y., Yi, J., Hsieh, C.-J.: ZOO: zeroth order optimization based black-box attacks to deep neural networks without training substitute models, pp. 15–26, November 2017 7. Courbariaux, M., Bengio, Y., David, J.P.: BinaryConnect: training deep neural networks with binary weights during propagations. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems (NeurIPS), pp. 3123–3131. Curran Associates Inc. (2015) 8. Fasfous, N., Vemparala, M.R., Frickenstein, A., Stechele, W.: OrthrusPE: runtime reconfigurable processing elements for binary neural networks. In: 2020 Design, Automation Test in Europe Conference Exhibition (DATE), pp. 1662–1667 (2020) 9. Frickenstein, A., Rohit Vemparala, M., Unger, C., Ayar, F., Stechele, W.: DSC: Dense-sparse convolution for vectorized inference of convolutional neural networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR-W), June 2019 10. Galloway, A., Taylor, G.W., Moussa, M.: Attacking binarized neural networks. In: International Conference on Learning Representations (2018) 11. Goldblum, M., Fowl, L., Feizi, S., Goldstein, T.: Adversarially robust distillation. In: AAAI (2020) 12. Guo, M., Yang, Y., Xu, R., Liu, Z., Lin, D.: When NAS meets robustness: in search of robust architectures against adversarial attacks (2019) 13. Guo, Y., Yao, A., Chen, Y.: Dynamic network surgery for efficient DNNs. In: Advances in Neural Information Processing Systems (NeurIPS) (2016) 14. Han, B., et al.: Co-teaching: robust training of deep neural networks with extremely noisy labels. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 8527–8537. Curran Associates Inc. (2018) 15. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems (NeurIPS), pp. 1135–1143. Curran Associates Inc. (2015) 16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, June 2016 17. He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: IEEE International Conference on Computer Vision (ICCV), pp. 1398–1406, October 2017 18. He, Y., Liu, P., Wang, Z., et al.: Filter pruning via geometric median for deep convolutional neural networks acceleration. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019) 19. He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., Han, S.: AMC: AutoML for model compression and acceleration on mobile devices. In: The European Conference on Computer Vision (ECCV) (2018) 20. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network (2015)

166

M.-R. Vemparala et al.

21. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems (NeurIPS), pp. 4107– 4115. Curran Associates Inc. (2016) 22. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: International Conference on Learning Representations (ICLR) (2015) 23. Krizhevsky, A.: Learning Multiple Layers of Features from Tiny Images. University of Toronto (2009) 24. Kurakin, A., Goodfellow, I.J., Bengio, S.: Adversarial Machine Learning at Scale. abs/1611.01236 (2016) 25. LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Touretzky, D.S. (ed.) Advances in Neural Information Processing Systems (NeurIPS), pp. 598–605. Morgan-Kaufmann (1990) 26. Lin, J., Gan, C., Han, S.: Defensive quantization: when efficiency meets robustness. In: International Conference on Learning Representations (2019) 27. Lin, X., Zhao, C., Pan, W.: Towards accurate binary convolutional neural network. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems (NeurIPS), pp. 345–353. Curran Associates Inc. (2017) 28. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. ArXiv, abs/1706.06083 (2018) 29. Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I.J.: Adversarial autoencoders. In: International Conference on Learning Representations Workshop (ICLR-W) (2016) 30. Moosavi-Dezfooli, S.-M., Fawzi, A., Frossard, P.: DeepFool: a simple and accurate method to fool deep neural networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2574–2582 (2015) 31. Narodytska, N., Kasiviswanathan, S.P.: Simple black-box adversarial attacks on deep neural networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR-W), pp. 1310–1318, July 2017 32. NVIDIA. NVIDIA Turing GPU architecture (2017). https://www.nvidia. com/content/dam/en-zz/Solutions/design-visualization/technologies/turingarchitecture/NVIDIA-Turing-Architecture-Whitepaper.pdf. Accessed 28 Feb 2020 33. Papernot, N., McDaniel, P.: Extending defensive distillation. ArXiv, abs/1705.05264 (2017) 34. Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A.: Practical black-box attacks against machine learning. In: Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pp. 506–519. Association for Computing Machinery, New York (2017) 35. Papernot, N., McDaniel, P.D., Wu, X., Jha, S., Swami, A.: Distillation as a defense to adversarial perturbations against deep neural networks. In: IEEE Symposium on Security and Privacy (SP), pp. 582–597, May 2016 36. Rastegari M., Ordonez V., Redmon J., Farhadi A.: XNOR-Net: ImageNet classification using binary convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) The European Conference on Computer Vision (ECCV), pp. 525–542. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0 32 37. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015)

Breaking Binary and Efficient Deep Neural Networks

167

38. Shafahi, A., et al.: Adversarial training for free! In: Wallach, H., Larochelle, H., Beygelzimer, A., dAlch´e-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems (NeurIPS), pp. 3358–3369. Curran Associates Inc. (2019) 39. Szegedy, C., et al.: Intriguing properties of neural networks. Presented at the (2014) 40. Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. Presented at the (2020) 41. Wiyatno, R.R., Xu, A., Dia, O., de Berker, A.: Adversarial examples in modern machine learning: a review, November 2019 42. Yang, T., Chen, Y., Sze, V.: Designing energy-efficient convolutional neural networks using energy-aware pruning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6071–6079, July 2017 43. Zhang, T., et al.: StructADMM: a systematic, high-efficiency framework of structured weight pruning for DNNs (2018) 44. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929, June 2016

Parallel Dilated CNN for Detecting and Classifying Defects in Surface Steel Strips in Real-Time Khaled R. Ahmed(B) School of Computing, Southern Illinois University Carbondale, Illinois, USA [email protected]

Abstract. To improve the quality of steel industry, automatic defects inspection and classification is of great importance. This paper proposed and developed DSTEELNet convolution neural network (CNN) architecture to improve detection accuracy and the required time to detect defects in surface steel strips. DSTEELNet includes three parallel stacks of convolution blocks. Each convolution block used dilated convolution that expands the receptive fields and increase the feature resolutions. The experimental results indicate significant improvements in accuracy and illustrate that the DSTEELNet achieves 97% mAP to detect defects in surface steel strips on NEU dataset and able to detect defect in single image in 22 ms. Keywords: Computer vision · Defect detection · Defect classification · Parallel processing · Convolution Neural Network

1 Introduction Quality is an important competitive factor to the steel industry success [1–3]. Surface defect detection is an important part of steel production and has significant impact upon the quality of products. Manual defect detection methods are time-consuming and subject to human made errors and hazards. To overcome the shortcomings of manual inspection, traditional automatic surface defect detection methods were proposed. These include eddy current testing, infrared detection, magnetic flux leakage detection, and laser detection. These methods are not able to detect all the faults, especially the tiny ones [4]. This motivates many researchers [5–8] to develop computer vision systems capable to classify and detect defects in ceramic tiles [5], textile fabrics [9] and steel industries [7– 10]. Structure-based methods extract image structure features such as texture, skeleton and edge. While other methods extract statistical features such as mean, difference and variance from the defect surface and then apply machine learning algorithms to learn these features to recognize defected surfaces [11, 13]. The combination of statistical features and machine learning achieve higher accuracy and robustness than structurebased methods [44]. Using machine learning such as Support Vector Machine (SVM) classifier to classify different types of surface defects may take about 0.239 s to extract features from a single defect image during testing [12]. Therefore, it fails to meet the realtime surface defect detection requirements. However, convolutional networks (CNN) © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 168–183, 2022. https://doi.org/10.1007/978-3-030-82193-7_11

Parallel Dilated CNN for Detecting and Classifying Defects in Real-Time

169

provide automated feature extraction techniques that take raw defect images and predict surface defects in short time and lessen the requirements to manually extract suitable features. The main objective of this research is to enhance steel strips defects detection accuracy and produce a significant generalization. To this end, this paper proposes a modular deep CNN-based architecture, DSTEELNet for detecting and classifying defects in surface steel strips using traditional convolution and dilated convolution. The dilated convolution able to capture more distinctive features by shifting the receptive field [36]. The main contributions of this paper are as follows: • We designed and developed a novel framework called DSTEELNet that detects and classifies defects in surface steel strips. To enhance the detection accuracy significantly, the proposed CNN includes three parallel stacks of convolution blocks as shown in Fig. 5. They are able to capture and propagate important features in parallel. Each convolution block uses different dilated rates. • Evaluate the proposed architecture with the traditional CNN architectures to highlight the effectiveness of DSTEELNet in detecting and classifying defects in surface steel strips. The generated trained model improves the product quality of steel strips since it accurately able to detect and classify defected regions. • We developed deep convolution generative adversarial network DCGAN to extend the size of the NEU dataset. The rest of this paper is organized as follows. Section 2 reviews the related works. The dataset, the traditional and neural augmentation techniques are described in Sect. 3. Section 4 illustrates the details of the proposed DSTEELNet architecture. Section 5 discuss the experiment setup and results. Section 6 concludes this paper and provides the future research direction.

2 Related Work There are many studies have investigated the machine vision techniques in surface defect detection. They are mainly divided into two categories, namely: the traditional image processing method, and the machine learning methods. The traditional image processing methods detect and segment defects by using the primitive attributes reflected by local anomalies. They detect various defects by features extraction techniques that are categorized into four different approaches [15]: structural method [16, 17], threshold method [18–20], spectral method [21–23], and model-based method [24, 25]. In traditional image processing methods, multiple thresholds to detect various defects are needed and are very sensitive to background colors and lighting conditions. These thresholds need to be adjusted to handle different defects. The traditional algorithms require to extract handcrafted features manually that require plenty of manpower [14]. Machine learning-based methods typically include two stages of feature extraction and pattern classification. The first stage analyzes the characteristics of the input image and produces the feature vector describing the defect information. These futures include grayscale statistical features [28], local binary patterns (LBP) feature [3, 26], histogram of oriented gradient (HOG)

170

K. R. Ahmed

features [27], and gray level co-occurrence matrix (GLCM) [28]. Some research efforts have been developed to speed up the features extraction process in parallel using GPU as our previous research work in [13]. The second stage feeds the feature vector into a classifier model that is trained in advance to detect whether the input image has a defect or not [43]. In a complex condition, handcrafted features or shallow learning techniques are not sufficiently discriminative. Therefore, these machine learning-based methods are typically dedicating for a specific scenario, lacking adaptability, and robustness. Recently, neural network methods have achieved excellent results in many computer vision applications. Convolutional neural networks (CNN) have been used to develop several defect detection methods. Some of the CNN research efforts have been developed to classify the defects in steel images as in [10], authors demonstrate that using a sequential CNN to extract features able to improve classification accuracy on defect inspection. The authors in [29] developed a. multi-scale pyramidal pooling network for the classification of steel defects. The authors in [30] developed a flexible multi-layered deep feature extraction framework. Both research works succeed to classify defects, however they failed to localize the location of the defects. Therefore, researchers convert the surface defect detection task into an object detection problem in computer vision to localize defects as in [42]. A simple and direct method is used by first locating defect and then classifying it. The authors in [42] developed a defect detection network (DDN) that integrates the ResNet [44] and Region proposal network (RPN) for precise defect detection and localization. In addition, they proposed the multilevel-feature fusion network that combined lower and high-level features. In other words, the inspection task classifies on regions of defects instead of a whole defect image. The research work in [31] employed traditional CNN with a sliding window to localize the defect. In [32] authors developed a structural defect detection method based on Faster R-CNN [33] that is succeeded to detect five types of surface defects: concrete, cracks, steel corrosion, steel delamination, and bolt corrosion. In [34] authors developed a cascaded autoencoder (CASAE). In first stage, they localize and extract the features of the defect from the input image. In second stage, they used compact CNN to accurately classify defects. Deep learning techniques facilitate quality assurance in manufacturing while, they require large datasets to avoid overfitting. Annotation of the data collected from the manufacturing lines may is a time-consuming task. To address this issue, there has been recent interest in the research community to mitigate it. The next section illustrates the using of data augmentation technique based on traditional techniques and neural networks to enlarge the NEU dataset.

3 Dataset and Augmentation This section introduces the dataset and the expansion techniques in detail to facilitate the training of the proposed model. In our experiment, NEU dataset [3] is used. Originally, the NEU dataset has 1,800 grayscale steel images and includes six types of defect as shown in Fig. 1. The defect types are crazing, inclusion, patches, pitted surface, and scratches, and rolled-in scale, 300 samples for each type. To annotate the dataset, each defect appears in the defected images is marked by a bounding red box (groundtruth box) as shown in Fig. 1. About 5000 groundtruth boxes have been created. The bounding box

Parallel Dilated CNN for Detecting and Classifying Defects in Real-Time

171

Fig. 1. Six types of surface steel strips defect

is used to localize defects and does not represent a defect’s borders and cannot describe its shape. To expand the dataset with new samples, a naive solution to oversampling with data augmentation would be a simple random oversampling with small geometric transformations such as 8° rotation, shifting image horizontally or vertically, etc. There are other simple image manipulations such as mixing images, color augmentations, kernel filters, and random erasing can also be extended to oversample data in the same manner as geometric augmentations. This can be useful for ease of implementation and quick experimentation with different class ratios. In this paper, data augmentation is used to manually increase the size of the dataset by artificially creating different versions of the images from the original training dataset. Table 1 shows the images augmentation setting parameters used in the training process such as flip mode, zoom range, width shift, etc. For example, width shift is used to shift the pixels horizontally either to the left or to the right randomly and generate transformed images. Table 1. Image augmentation setting parameters Parameters

Value

Height Shift

0.08

Width Shift

0.08

Rotation Range 8 Fill mode

Nearest

Zoom Range

0.08

Shear Range

0.3

172

K. R. Ahmed

Fig. 2. Generator adversarial and discriminator loss during training

However, oversampling with basic image transformations may cause overfitting on the minority class which is being oversampled. The biases present in the minority class are more prevalent post-sampling with these techniques. Therefore, this paper also used neural augmentation networks such as Generative Adversarial Network (GAN) [36, 37]. The GAN able to generate synthetic defect images that are near identical to their groundtruth original ones. We have developed a deep convolution GAN named DCGAN that includes two CNNs [38]: generator G (reversed CNN) and discriminator D. Generator G takes random input and generates an image as output from up-sampling the input with transposed convolutions. However, D takes the generated images and original images and tries to predict whether a given image is generated (fake) or original (real). The GAN network performs min–max two players game with value function V(D, G) [36]: minG maxD V (D, G),

(1)

V (D, G) = Eω∼Sdata (ω) [loG D(ω)] + Eτ ∼Sτ (ωτ ) [loG(1 − D(G(τ )))]

(2)

where D(ω) is the probability of ω is a real image, S data is the distribution of the original data, τ is random noise used by the generator G to generate image G(τ ) and Sτ is the distribution of the noise. During training, the aim of the discriminator D is to maximize the probability D(ω) assigned to fake and real images. Since it’s a binary classification problem, this model is fit seeking to minimize the average binary cross entropy. Minimax Gan loss is defined to minimax simultaneous optimization of the disseminator and generator models as shown in Eq. 1. The discriminator pursues to maximize the average of the log probability for real images and the LoG of the inverted probabilities of fake images. In other word, it maximizes the LoG D(ω) + LoG(1−D(G(τ ))). The generator pursues to minimize the LoG of the inverse probability predicted by the discriminator for fake images. In other word, it minimizes the LoG(1−D(G(τ ))). The training results are shown in Fig. 2. It shows the discriminator loss and adversarial loss during training till 600 iterations. It shows that D loss is converging, and the G adversarial loss is converging too. The mean of discriminator loss and adversarial loss are 0.031 and 1.617 respectively.

Parallel Dilated CNN for Detecting and Classifying Defects in Real-Time

173

Fig. 3. Synthetic images by DCGAN

The training was proceeded in six steps. In step 1, we randomly generate a noise vector using Gaussian distribution and pass it to the generator to generate an actual image in step 2. We mix the authentic images form the training dataset and the generated synthetic images in step 3. In step 4, we train the discriminator using the mixed dataset with aiming to correctly label each image as fake or real. Again, we generate random noise and label each noise vector as real image in step 5. Finally, in step 6 we train the GAN using this noise vectors and real imaged labels even they are not actual real images. In summary, at each iteration of the GAN algorithm firstly, it generates random images and then trains the discriminator to distinguish fake and real images, secondly it tries to fool the discriminator by generating more synthetic images, finally it updates the weights of the generator based of the received feedback from the discriminator which enable us to generate more authentic images. We developed the generator architecture as follows. First, it includes a dense layer with a ReLU activation function followed by batch normalization to stabilize GAN as in [36]. We feed a random vector noise generated by Gaussian distribution into this layer. To prepare the number of nodes to be reshaped into 3D volume, we added another dense layer with the ReLU activation function followed by batch normalization. Then Reshape layer is added to generate 3D volume from the input shape. To increase the spatial resolution during training we add a transposed convolution (Conv2DTranspose) with stride 2, 32 filters, each of which is 5 × 5, ReLU activation function and followed by batch normalization and dropout of size 0.3 to avoid overfitting. Finally, we added five up-sample and transposed convolutions (Conv2DTranspose), each of which uses stride 2 and tanh activation function. They increased the spatial dimension resolution from 14 × 14 to 224 × 224, which is the exact of the input images. Afterward, we developed the discriminator generator as follows. It includes two convolution layers (Conv2D) with stride 2, 32 filters, each of which is 5 × 5 and Leaky ReLU activation function to stabilize training. We added flatten and dense layers with sigmod activation function to capture the probability of whether the image is synthetic or real. Figure 3 shows examples of the results of generated images from the NEU dataset. This paper feeds about 1800 images of the NEU dataset to the DCGAN framework that generates 540 synthetic images added to the original NEU dataset and create new dataset called GNEU. We divide GNEU dataset into training, validation and testing sets.

174

K. R. Ahmed

The training set includes 1260 real and synthetic images, the validation set includes 540 real and synthetic images. The test set includes 540 real images.

4 Proposed DSTEELNet Architecture This section describes the proposed DSTEELNet CNN framework to detect and classify defects in surface steel strips. The DSTEELNet includes parallel stack of convolution, activation and Max-Pooling layers as shown in Fig. 5. At the feature level, we added parallel layers and then performed convolution with activation on the resulting feature maps. We added flatten layer to unstack all the tensor values into a 1-D tensor. The flattened features are used as inputs to two dense layers (Multi-layer perception). To reduce/avoid overfitting, we applied dropout. For classification task, we added dense layer with softmax activation function. Finally, the architecture generates a class activation map. The receptive field RF is the portion of the image where the filter extracts features and defined by the filter size of the layer in the CNN [39]. To generate high quality training results and achieve fine details of input 2D image, it is required to increase feature resolution by expanding the receptive field RF . Therefore, this paper used dilated convolution [35] with dilation rate larger than 1 to decrease computational costs by adding dilation rate to the conv2D kernel. The dilation rate is the spacing between each pixel in the convolution filter. Equation 3 shows the form to calculate the receptive field RF where k is the size of the kernel and d is the dilated rate. RF = d (k − 1) + 1

(3)

For example, if dilation rate of 2 is used then each input skips a pixel. Figure 4-c. shows 3 × 3 kernel with dilation rate of 2 has the same field of view as 5 × 5 kernel. As a result, the receptive field RF increased and enabled the filter to capture more contextual information. However, using dilation rate of 1 and 3 × 3 kernel generates receptive field with size 3 × 3 which is equivalent to the the standard convolution as shown in Fig. 4-b. From Eq. 3 the size of the output can be calculated as follows:   g + 2p − RF +1 (4) σ = s Where g × g input with a dilation factor, padding and stride of d, p and s respectively. After using a number of receptive fields having different sizes, we can capture important features in the scene area at different scales. Figure 5 shows the proposed DSTEELNet architecture. It includes five dilated convolution blocks in three parallel stacks. Assume each stack includes m convolution blocks CB(i) where i ∈ {1, 2, . . . m} and the corresponding output of each CB(i) is denoted by βi . The input features and output features are denoted as f in and f out respectively and f out can obtained as follows: m βi (5) fout = fin + i=1

Parallel Dilated CNN for Detecting and Classifying Defects in Real-Time

175

Fig. 4. Dilated convolution in DSTEELNet

 βi =

(i) in ) CB (f  i=1 i−1 (i) CB fin + k=1 βi 1 < i ≤ m

(6)

Each convolution block CBt=j = conv(n = F) followed by Max-pooled block to reduce the feature size and the computational complexity for the next layer. For efficient pooling, we used pool_size = (2,2) and strides = (2,2) [41]. Each convolution block CBt=j = conv(n = F) includes two Conv2D layers followed with ReLU activation function where F is total number of filters and j is the dilation rate. We have used 3 × 3 filters in all convolution blocks. The total number of filters in first convolution block is 64, and the rest are 128, 256, 512 in order. The three parallel stacks (branches) are similar except they have different dilation rates j = 1,2 and 3 respectively as shown in Fig. 5. Standard convolution is equivalent to a convolution with dilation rate equals 1. Each parallel branch generates features from images at different CNN layers and then produces different proprieties. Therefore, we concatenated the generated features from these parallel branches and handed the resulted features to the next convolution layer to produce the final low-level features. This convolution layer has 512 filters with a filter size 3 × 3, dilation rate 1, stride of 1 and followed by ReLU activation function. To convert the square feature map into one dimensional feature vector, flatten layer has been added. Two perception (fully connected) layers with size 1024 were used to feed the results of the flatten layer through dense layer that will perform classification. The last dense layer uses SoftMax activation function to determine class scores. To avoid/reduce overfitting during training. dropout layer has been added to discard some weight produced from two fully connected layers. In this paper, we used dropout of size 0.3.

176

K. R. Ahmed

5 Experiments The performance of the DSTEELNet is evaluated on the generated dataset (GNEU). We demonstrate that DSTEELNet achieves a reasonable design and significant results. Therefore, we compare the proposed DSTEELNet with VGG16, VGG19, ResnNt50, and MobilNet.

Fig. 5. DSTEELNet architecture

5.1 Experiment Metrics For the performance evaluation we use the following performance metrics: TP (TP + FN ) TP Precision = (TP + FP ) Recall + Precision AP = 2 2TP F1 = (2TP + FN + FP ) Recall =

(7) (8) (9) (10)

where, TP is the number of true Positives, FN is the number of false Negative, and FP is the number of false Positive. True positive is referred to defective steel image identified as defective. False positive is referred to defect-free steel image identified as defective. False negative is referred to defective steel image identifies as defect-free. The F1 score is measured to seek a balance between Recall and Precision. In addition, the mean average precision (mAP) is calculated to evaluate the overall performance that is mean value of AP of all classes.

Parallel Dilated CNN for Detecting and Classifying Defects in Real-Time

177

5.2 Setup The experiment platform in this work is Intel(R) Core™ i7-9700L with a clock rate of 3.6 GHz, working with 16 GB DDR4 RAM and a graphics card that is NVIDIA GeForce RTX 2080 SUPER. All experiments in this project were conducted in Microsoft Windows 10 Enterprise 64-bit operating system, using Keras 2.2.4 with TensorFlow 1.14.0 backend. We train the STEELNet and VGG16, VGG19, ResNet50 and MobileNet for about 150 epochs on the GNEU training and validation datasets with batch size of 32 and image input size 224 × 224. We applied the Adam optimizer [40] with learning rate 1e-4. In addition, we applied the categorical cross entropy loss function to the training. The loss is measured between the probability of the class predicted from softmax activation function and the true probability of the category. All the trained models did not use any pretrained weights such ImageNet because ImageNet has no steel surface images. 5.3 Results This section illustrates gradually the results of the proposed CNN architecture to detect defects in surface steel strips. Table 3 shows the class-wise classification performance metrics listed in Eqs. 7–10. It illustrates the comparison between DSTEELNet and the state-of-the-art CNN architectures. It shows that almost models tend to enhance the classification of most categories (such as crazing, patches, rolled-in_scale and scratches). The state-of-the-arts models show poor performance to detect defects such as inclusion and pitted_surface due to some similarities in their defect structures. However, the DSTEELNet succeed to detect all the class categories with high accuracy. Table 3 shows that DSTEELNet produces 97.2% mAP which outperforms the other models, e.g. VGG16 (91.2%, 6% higher mAP), VGG19 (90.0%, 7.2% higher mAP), ResNet50 (93%, 4.2% higher mAP) and MobileNet (94%, 3.2% higher mAP). Table 2 demonstrates the weighted average results. It illustrates that for steel surface defect detection DSTEELNet performs the highest precision, recall and F1 scores as shown in bold values in Table 2. Table 2. Weighted average results Model

Precision Recall F1-score

DSTEELNet 0.97

0.97

0.97

Vgg16

0.89

0.89

0.92

Vgg19

0.92

0.90

0.90

ResNet50

0.95

0.93

0.93

MobileNet

0.94

0.93

0.93

0.97

1.00

1.00

1.00

0.87

1.00

Pitted_surface

Rolled-in_Scale 0.99

0.972

Patches

Scratches

mAP

1.00

0.86

0.97

Inclusion

0.912

1.00 1.00

0.99 0.96

0.92 0.66

1.00 0.89

0.91 1.00

0.87

1.00

0.97

1.00

0.51

1.00

0.90

0.93 1.00

0.98 0.94

0.79 0.67

0.94 1.00

0.68 0.94

0.99

1.00

0.89

0.98

0.54

1.00

0.93

0.99 1.00

0.97 0.96

0.76 0.74

0.99 1.00

0.69 1.00

0.98

1.00

0.98

0.99

0.66

1.00

Precision

MobileNet

0.94

0.99 0.96

0.98 0.98

0.84 0.73

0.99 1.00

0.79 1.00

0.99 0.98

Precision Recall F1

Resnet50 0.97 0.99

Precision Recall F1

VGG19 1.00 0.95

Precision Recall F1

1.00 1.00

Recall F1

1.00

1.00

Precision

Crazing

VGG16

DSTEELNet

Table 3. Detection results on GNEU dataset

1.00

0.90

0.98

0.99

0.82

1.00

0.98

0.94

0.84

0.99

0.82

0.99

Recall F1

178 K. R. Ahmed

Parallel Dilated CNN for Detecting and Classifying Defects in Real-Time

179

In addition, Table 3 shows that DSTEELNet delivers consistent results for the precision, recall and F1 for crazing, patches, pitted_surface, rolled-in_scale and scratches defects. The DSTEELNet succeeds to detect inclusion defect with highest F1 score (0.91) followed by MobileNet (0.82), ResNet50 (0.79), VGG19 (0.69) and VGG16 (0.68) respectively in order. Similarly, the DSTEENet succeeds to detect pitted_surface defect with highest F1 score (0.92) followed by MobileNet (0.84), ResNet50 (0.84), VGG16 (0.79) and VGG19 (0.76) respectively in order. The examples of DSTEELNet detection results are shown in Fig. 6. It shows that DSTEELNet succeeds to detect defects with significance confidence scores. Figure 7 shows the confusion matrices for DSTEELNet and other evaluated models where the test dataset includes 90 images of each surface defect class. As shown in Fig. 7-a DSTEELNet detects all of the steel surface defects perfectly excepts the inclusion defects. It detects about 13 images out of 90 with inclusion defects as pitted_surface. Furthermore, as shown in Fig. 7 (b-d) MobileNet, ResNet50, and VGG19 are able to detect 24, 31 and 40 images out of 90 with inclusion defect as pitted_surface respectively. In other words, DSTEELNet fails to detect 2.9% of defects in 540 images however, ResNet50, MobileNet, VGG19, and VGG16 fail to detect defects in 6.6%,7.4%, 10% and 11% of 540 images, respectively.

Fig. 6. Examples of detection results using DSTEENet on GNEU dataset, green box indicating defect location with confidence score

Figure 8 shows the training and validation accuracy for DSTEELNet. It shows that both training and validation accuracy started to improve from epoch 25 and then converged to the highest accuracy values. 5.4 Computational Time Table 4 shows the average inference time to detect defects in single image by the proposed technique DSTEELNet, and other deep learning and traditional techniques. It reveals that the traditional methods generally are not able to meet the requirements in real-time. In addition, Table 4 shows that the proposed DSTEELNet is the fastest one to detect defects and can meet the real-time requirements. DSTEELNet speeds the defect detection time of

180

K. R. Ahmed

Fig. 7. Confusion matrices for DSTEELNet, MobileNet, ResNet50 and VGG19 on test dataset

Fig. 8. Training and validation accuracy of DSTEELNet

the traditional techniques about 20 times and outperforms the deep learning techniques. The accuracy of the MobileNet and Resnet50 are higher than VGG16 and VGG19 but they take longer time to detect defects. In summary, the DSTEELNet achieves the highest accuracy and shortest detection time due to the reduction of its computation complexity. It also outperforms the recent technique called end-to-end defect detection (EDDN) [45] that added to Vgg16 extra architectures including multi-scale feature maps and predictors for detection. The authors reported that EDDN achieved 0.724 mAP and able to detect defects in single image in 27 ms. The DSTEELNet outperforms EDDN and able to detect defect in single image in 22 ms with 0.972 mAP.

Parallel Dilated CNN for Detecting and Classifying Defects in Real-Time

181

Table 4. Comparison of computational time for traditional and deep learning techniques Traditional techniques

Deep learning techniques

HOGSVM

LBP-SVM GLCM-SVM Vgg16

Vgg19 Resnet50 MobileNet DSTEELNet

443.53 ms

382.35 ms 454.57 ms

29 ms

28 ms

32 ms

34 ms

22 ms

6 Conclusion The major aim of this paper is to design and develop a CNN architecture that is suitable for surface steel strips defect detection task. DSTEELNet that can form sparse receptive fields is proposed to generate more robust and discriminative features for defect detection. The experiment results show that the proposed DSTEELNet can achieve 97% mAP and outperform the state-of-the-art CNN architectures such as VGG16, VGG19, Resent50 and MobilNet with 6%, 7.2%, 4.2% and 3.2% higher mAP respectively. As a future research, we will explore methods to achieve more precise defect boundaries such as performing defect segmentation based on deep learning techniques.

References 1. Quality & Yield Optimization for Flat Steel Production (2017). www.isra-parsytec.com 2. Sadeghi, M., Soltani, H., Zamanifar, K.: Application of parallel algorithm in image processing of steel surfaces for defect detection. Fen Bilimleri Dergisi (CFD) 36, 4 (2015) 3. Song, K., Yan, Y.: A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects. Appl. Surf. Sci. 285, 858–864 (2013) 4. Tian, S., Xu, K.: An algorithm for surface defect identification of steel plates based on genetic algorithm and extreme learning machine. Metals 7(8), 311 (2017) 5. Ragab, K., Alsharay, N.: An efficient defect classification algorithm for ceramic tiles. In: 2017 IEEE 13th International Symposium on Autonomous Decentralized System (ISADS), pp. 255–261 (2017) 6. Ragab, K.: Fast and parallel summed area table for fabric defect detection. Int. J. Pattern Recogn. Artif. Intell. 30(09), 1660004 (2016) 7. Neogi, N., Mohanta, D.K., Pranab, K.: Review of vision-based steel surface inspection systems. EURASIP J. Image Video Process. 1(2014), 50 (2014) 8. Jia, H., et al.: An intelligent real-time vision system for surface defect detection. In: Proceedings of the 17th International Conference on Pattern Recognition. ICPR 2004, vol. 3. IEEE (2004) 9. Sager, K.H., George, L.E.: Defect detection in fabric images using fractal dimension approach. In: International Workshop on Advanced Image Technology, vol. 2011 (2011) 10. Zhou, S., et al.: Classification of surface defects on steel sheet using convolutional neural networks. Materiali Tehnologije 51(1), 123–131 (2017) 11. Ghorai, S., Mukherjee, A., Gangadaran, M., Dutta, P.K.: Automatic defect detection on hotrolled flat steel products. IEEE Trans. Instrum. Meas. 62, 612–621 (2012) 12. Ke, X.U., Lei, W., Wang, J.: Surface defect recognition of hot-rolled steel plates based on tetrolet transform. J. Mech. Eng. 52, 13 (2016)

182

K. R. Ahmed

13. Ahmed, K.R., AlSaeed, M., AlJumah, M.: Parallel Algorithms to detect and classify defects in Surface Steel Strips. In: The World Congress in Computer Science, Computer Engineering, and Applied Computing (CSCE 2020). Transactions on Computational Science & Computational Intelligence. Springer, New York (2020) 14. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015) 15. Ren, R., Hung, T., Tan, K.C.: A generic deep-learning-based approach for automated surface inspection. IEEE Trans. Cybern. 48, 929–940 (2018) 16. Tastimur, C., Yetis, H., Karaköse, M., Akin, E.: Rail defect detection and classification with real time image processing technique. Int. J. Comput. Sci. Softw. Eng. 5, 283 (2016) 17. Jian, C., Gao, J., Ao, Y.: Automatic surface defect detection for mobile phone screen glass based on machine vision. Appl. Soft Comput. 52, 348–358 (2017) 18. Win, M., Bushroa, A.R., Hassan, M.A., Hilman, N.M., Ide-Ektessabi, A.: A contrast adjustment thresholding method for surface defect detection based on mesoscopy. IEEE Trans. Ind. Inform. 11, 642–649 (2015) 19. Kalaiselvi, T., Nagaraja, P.: A rapid automatic brain tumor detection method for MRI images using modified minimum error thresholding technique. Int. J. Imag. Syst. Technol. 1, 77–85 (2015) 20. Wang, L., Zhao, Y., Zhou, Y., Hao, J.: Calculation of flexible printed circuit boards (FPC) global and local defect detection based on computer vision. Circ. World 42, 49–54 (2016) 21. Bai, X., Fang, Y., Lin, W., Wang, L., Ju, B.F.: Saliency-based defect detection in industrial images by using phase spectrum. IEEE Trans. Ind. Inform. 10, 2135–2145 (2014) 22. Borwankar, R., Ludwig, R.: An optical surface inspection and automatic classification technique using the rotated wavelet transform. IEEE Trans. Instrum. Meas. 67, 690–697 (2018) 23. Hu, G.H.: Automated defect detection in textured surfaces using optimal elliptical Gabor filters. Optik 126, 1331–1340 (2015) 24. Susan, S., Sharma, M.: Automatic texture defect detection using Gaussian mixture entropy modeling. Neurocomputing 239, 232–237 (2017) 25. Cen, Y.G., Zhao, R.Z., Cen, L.H., Cui, L.H., Miao, Z.J., Wei, Z.: Defect inspection for TFTLCD images based on the low-rank matrix reconstruction. Neurocomputing 149, 1206–1215 (2015) 26. Gibert, X., Patel, V.M., Chellappa, R.: Deep multitask learning for railway track inspection. IEEE Trans. Intell. Transp. Syst. 18, 153–164 (2017) 27. Shumin, D., Zhoufeng, L., Chunlei, L.: Adaboost learning for fabric defect detection based on hog and SVM. In Proceedings of the International Conference on Multimedia Technology, Hangzhou, China, 26–28 July 2011 28. Chondronasios, A., Popov, I., Jordanov, I.: Feature selection for surface defect classification of extruded aluminum profiles. Int. J. Adv. Manuf. Technol. 83, 33–41 (2016) 29. Masci, J., Meier, U., Fricout, G., Schmidhuber, J.: Multi-scale pyramidal pooling network for generic steel defect classification. In: Proceedings of the Int. Joint Conf. on Neural Networks, Dallas, TX, USA, 4–9 August 2013 30. Natarajan, V., Hung, T.Y., Vaikundam, S., Chia, L.T.: Convolutional networks for voting-based anomaly classification in metal surface inspection. In: Proceedings of the IEEE International Conference on Industrial Technology, Toronto, ON, Canada, 22–25 March 2017 31. Wang, T., Chen, Y., Qiao, M., Snoussi, H.: A fast and robust convolutional neural networkbased defect detection model in product quality control. Int. J. Adv. Manuf. Technol. 94, 3465–3471 (2018) 32. Cha, Y.J., et al.: Autonomous structural visual inspection using region—Based deep learning for detecting multiple damage types. Comput. Aided Civ. Infrastruct. Eng. 33, 731–747 (2018) 33. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, NIPS 2015 Proceedings (2015)

Parallel Dilated CNN for Detecting and Classifying Defects in Real-Time

183

34. Tao, X., Zhang, D., Ma, W., Liu, X., Xu, D.: Automatic metallic surface defect detection and recognition with convolutional neural networks. Appl. Sci. 8, 1575 (2018) 35. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: International Conference on Learning Representations (ICLR) (2016) 36. Xu, H., Warde-Farley, D., Ozair, S., Courville A., Yoshua, K.: Generative Adversarial Networks. arXiv:1406.2661 (2014) 37. Goodfellow, I., Pouget-Abadie, J. Mirza, M.: Genserative Adversarial Networks. arXiv:140 6.266 (2014) 38. Radford, A., Metz, L., Chintala, S.: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv:1511.06434 (2016) 39. Luo, W., et al.: Understanding the effective receptive field in deep convolutional neural networks. arXiv preprint arXiv:1701.04128 (2017) 40. Kingma, D. P., Ba, J.L.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, pp. 1–13 (2015) 41. Scherer, D., Müller, A., Behnke, S.: Evaluation of pooling operations in convolutional architectures for object recognition. In: Diamantaras, K., Duch, W., Iliadis, L.S. (eds.) ICANN 2010. LNCS, vol. 6354, pp. 92–101. Springer, Heidelberg (2010). https://doi.org/10.1007/ 978-3-642-15825-4_10 42. He, Y., Song, K., Meng, Q., Yan, Y.: An end-to-end steel surface defect detection approach via fusing multiple hierarchical features. IEEE Trans. Instrum. Meas. 69(4), 1493–1504 (2020). https://doi.org/10.1109/TIM.2019.2915404 43. Mang Xiao, M., Jiamh, G., Li, L.X., Li, Y.: An evoslutionary classifier for steel surface defects with small sample set. EURASIP J. Image Video Process. 48, 236 (2017). https://doi.org/10. 1186/s13640-017-0197-y 44. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90 45. Lv, X., Duan, F., Jiang, J.-J., Fu, X., Gan, L.: Deep metallic surface defect detection: the new benchmark and detection network. Sensors 20, 1562 (2020). https://doi.org/10.3390/s20 061562

Selective Information Control and Network Compression in Multi-layered Neural Networks Ryotaro Kamimura1,2(B) 1

Kumamoto Drone Technology and Development Foundation, Techno Research Park, Techno Lab 203 1155-12 Tabaru Shimomashiki-Gun, Kumamoto 861-2202, Japan 2 IT Education Center, Tokai University, 4-1-1 Kitakaname, Hiratsuka, Kanagawa 259-1292, Japan

Abstract. The present paper aims to propose a new type of information-theoretic method called “selective information control” to produce a variety of internal representations from among which we can choose appropriate ones for interpretation. The new method aims to improve our network compression method to produce more interpretable representations by changing the selective information. The selective information proposed here represents to what extent a components network responds selectively to inputs. When the component responds more selectively to the inputs, the selective information in the component becomes higher. We applied the method to a simplified bank marketing data set. By gradual increasing or decreasing selective information, we could produce connection weights to improve generalization as well as weights close to the correlation coefficients of the original data set. The better interpretation could be obtained by the gradual selective information decrease. This means that better interpretation can be obtained by increasing the selective information in the lower hidden layers, and then, the information should be filtered out in the higher hidden layers. Keywords: Selective information Interpretation · Generalization

1

· Network compression ·

Introduction

Since the beginning of the connectionism approach to the exploration of cognitive functions [42–44], there have been many attempts to interpret internal representations generated by neural networks and to relate them to the actual cognitive processes. In addition to the clarification of cognitive functions, it has been well recognized that the interpretation of inference mechanisms of neural networks can contribute to the trustworthiness of methods as well as the improvement of general performance [24]. Recently, the trustworthiness of employed models have been one of the serious problems in machine learning, because the introduction of complicated machine c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 184–204, 2022. https://doi.org/10.1007/978-3-030-82193-7_12

Selective Information

185

learning techniques has been endangering our daily life. In particular, the neural network, dealt with in this paper, has been considered one of the most typical black-box models among many machine learning methods [7,9,19,38,41,47]. Thus, neural networks can be used to improve generalization performance for many actual problems, but if we cannot explain the reasons why such improved performance can be obtained, the final results can be accepted with difficulty. Thus, a number of interpretation methods have been developed to respond to the trustworthy and safety problems of machine learning [8,11,17,51]. In addition, the black-boxed property can be a serious problem in improving the general performance itself of neural networks. Though neural networks have progressed rapidly in improved prediction performance, they cannot necessarily considerably outperform their counterparts, in particular, human beings. To improve the more general performance of neural networks, we need to deepen our knowledge on how neural networks respond to inputs and produce the corresponding outputs. In particular and among all, there are a number of serious problems to be solved for neural networks for them to be applicable to actual problems, for example, adversarial attacks [16,33] and catastrophic forgetting [12,20,25]. Those problems cannot be solved when the inner inference mechanism of neural networks cannot be well understood. Thus, in addition to the safety and trustworthiness of neural networks, to improve the general performance itself, we need to interpret exactly the inner mechanism, hidden in complicated neural network configurations. One of the major problems with those interpretation methods is that they have focused on the local interpretation where much effort has been on how a neural network responds to a given input pattern. In particular, in the field of convolutional neural networks (CNN), this tendency toward the local interpretation has been apparent. The CNN, dealing with image data sets, has penetrated rapidly into many application fields, and the necessity for interpretation has been urgent. However, the network architecture for CNN has been more and more complicated, with many specialized layers, such as convolutional layers, which has prevented us from interpreting their inner inference mechanism. Thus, in spite of an urgent need for the inner inference mechanism to be known and the proposal of many different types of interpretation methods, the main focus has been restricted to the local and individual interpretation. This means that, due to easy and intuitive understanding of image data sets, the interpretation has been replaced by instance-based visualization methods such as activation maximization, selectivity detection, local perturbations, Grad-CAM and layer-wise relevance propagation [3,5,14,15,34,37,45,46]. As mentioned, these types of visual interpretation methods cannot necessarily consider the inner inference mechanism of neural networks as was done in the beginning of the connectionism approach [42–44]. We should clarify the inner inference mechanism, hidden behind complicated input patterns as well as interwoven components of neural networks, to uncover the main and fundamental learning processes of neural networks. To extract this core inner inference mechanism, we have introduced a method of network compression [22,23]. The method aims to compress complicated and multi-layered neural networks into the simplest ones without hidden layers, whose

186

R. Kamimura

interpretation is much easier than that of multi-layered neural networks. Recently, model compression has received due attention to reduce the computational burden of complicated and multi-layered neural networks [2,4,10,13,18,21,32,36]. However, these types of conventional model compression have been developed to improve generalization performance. More concretely, complicated neural networks have been replaced by much simpler ones whose generalization performance is approximately equivalent to that of complicated ones. Actually, these compression methods have been performed by black-boxing all components in original complicated neural networks. Thus, even if we can interpret the inner mechanism of simplified networks, the interpretation does not necessarily represent that of original and complex neural networks. The interpretation of smaller models and original larger models are completely different from each other. Thus, it is impossible to apply the compression to the interpretation problem. Contrary to those conventional model compression methods, our network compression method tries to compress original neural networks to keep information contained in connection weights of original and complex neural networks as much as possible by considering all possible routes from inputs to the corresponding outputs. This compression method has been successfully applied to several data sets, [22,23], producing very simple networks with better interpretation performance. However, the simple compression of original complicated neural networks cannot necessarily produce internal representations for easy interpretation. This is due to the fact that, in compressing networks, connection weights and neurons may be complicatedly interwoven. Thus, we need to develop a method to produce networks with more interpretable representations for the results by compression to be applied to the interpretation problem. In this context, we try to introduce here a new type of information-theoretic method to control information content in neural networks, expecting that the appropriate information control can lead us to find simpler and compressed networks for easy interpretation. As has been known, one of the methods to control final internal representations is that of controlling information content to be stored in neural networks. We should note that in the information theory the efficient information transmission is the most important thing to be considered, but in neural networks, the importance should be put not on the transmission but on the storage of information inside neural networks on input patterns and outer conditions. Since the pioneering works of Linsker, many different types of informationtheoretic methods have been proposed, from the maximum to the minimum information preservation principles [6,27–31,39,40,48–50]. This maximum and minimum information can be differentiated, depending on our focus on generalization or interpretation. Thus, one of easiest ways is to borrow those conventional information measures even for the problem of interpretation. However, two major problems in borrowing those measures can be pointed out, namely, computational complexity and interpretability. This means that information-theoretic measures such as mutual information cannot necessarily be applied to controlling the production of internal representations for interpretation, due to the computational complexity and abstract property of information

Selective Information

187

measures. First, information-theoretic measures such as entropy and mutual information cannot be easily implemented, and even if successfully implemented, much computational complexity has made it hard to apply them to actual data sets. Second, even if the implementation and computation is possible, the interpretation of information content tends to be not so easy due to the abstract property of information measures. Then, even if information can be extracted from neural networks, it may be impossible to extract concrete meanings with respect to the inner mechanism of neural networks. For solving those problems, the present paper proposes a new measure of information, called “selective information.” In this new information measure, we suppose that information content in neural networks should be represented in terms of selectivity control of components such as neurons and connection weights in neural networks. When a component can very selectively respond to a specific input, the component contains much information on the specific input. In our actual situation, we suppose that this selectivity can be represented in connection weights. When a neuron is firmly connected with a specific neuron under a given input pattern, this connection weight between the two neurons should have much information on the input pattern. If it is possible to describe this selectivity, we can have a new definition of information, which can be concretely described in terms of connection weights. Thus, we propose here a new information measure of selective information and try to use it for changing the information content to obtain multiple internal representations, from among which we can choose the appropriate one suited for interpretation. The present paper is organized as follows. In Sect. 2, we first present how to compress multi-layered neural networks to obtain the simplest ones without hidden layers. Then, we introduce the selective information described in terms of the number of connection weights between layers for simple interpretation. In Sect. 3, we apply the method to the simplified version of the well-known bank marketing data set in the machine learning data base. In the experiments, we try to show that we can increase or decrease the selective information, producing different compressed weights. The compressed weights by the gradual information decrease tend to produce compressed weights close to the correlation coefficients of the original data set. Thus, the compressed weights are easily interpretable. On the other hand, the compressed weights by the gradual information increase can produce connection weights close to the regression coefficients of the logistic regression analysis, producing better generalization performance. The results show that the selective information control can be used to produce networks with much more interpretable weights, explaining why such a simple interpretation is possible.

2 2.1

Theory and Computational Methods Network Compression

One of the main techniques for interpretation is to simplify network configurations as much as possible. We have proposed a new type of simplification in terms of network compression [23]. In this method, we try to compress all layers step

188

R. Kamimura (3)

(2)

(4)

(5)

(1) (6)

(a) Original network

(b) 1st compression (3)

(4)

(5)

(4)

(1)

(5)

(d) 3rd compression

(5)

(1)

(1) (6)

(6)

(6)

(c) 2nd compression (e) Final compression (1) (6)

(f) Final compressed network

Fig. 1. Network architecture with six layers, including four hidden layers (a) and compressed network (b).

by step by considering all routes from inputs to outputs to obtain the simplest network without hidden layers. Let us show an example of network compression, and for simplicity’s sake, we suppose seven layers, including the input and output layer, in Fig. 1. We compress this seven-layered neural network into a two-layered one without hidden layers step by step, considering all routes from inputs to outputs. Let us take connection weights from the second layer represented by (2) to the third layer (3). As shown (1,2) in Fig. 1(b), weights from the first layer to the second layer wij and from the (2,3)

second layer to the third layer wij (1,3)

wik

=

, are combined into n2  j=1

(1,2)

wij

(2,3)

wjk

(1)

where (1,3) represents a route from the first to the third layer. Suppose that (1,3) wik denotes a compressed weight from the first layer to the third layer. Then, this compressed weight is again compressed with a connection weight from the third to the fifth layer in Fig. 1(c). (1,4)

wil

=

n3  k=1

(1,3)

wik

(3,4)

wkl

(2)

In the same way, we can have the compressed weight from the first layer (1,5) to the fifth layer wim . By combining this compressed weight with connection weights to the output layer in Fig. 1(e), we have

Selective Information (1,6)

win

=

n5  m=1

(1,5)

(5,6) wim wmn

189

(3)

Following these steps, we can compress any neural network, though the steps are limited to fully connected ones. 2.2

Controlling Selective Information

As mentioned in the introduction section, the network compression does not necessarily produce interpretable networks, due to the existence of complicated and interwoven connection weights, neurons, and layers. Thus, we must transform the original neural networks so that they are easily compressible and interpretable. One of the ways to control network configurations is to control information content contained in neural networks. As mentioned above, we do not use the conventional information measures such as mutual information, because we have had difficulty in interpreting those conventional information measures when applied to the interpretation problem. For easy interpretation, we need to use a measure to be interpreted more concretely in terms of components such as neurons and connection weights in neural networks. For this purpose, we introduce selective information. The reason we adapt this concept of selective information is that the information content not transmitted but stored in neural networks can be translated into a concept where information content means how neural networks selectively respond to specific input patterns. When neural networks respond more selectively to input patterns, they have naturally more information content on the input patterns. Now, let us begin with the definition of selectivity, and for simplicity’s sake, we compute the selectivity between the second and third layer (2,3). For the first approximation, we suppose that the importance of weights can be obtained by their absolute values (2,3) (2,3) (4) ujk =| wjk | Then, we normalize this importance by its maximum value, which can be computed by (2,3) ujk (2,3) (5) gjk = (2,3) maxj  k uj  k We call this importance “relative importance,” which can be used to increase or decrease the selectivity, described below. By using this relative importance, selective information can be computed by G(2,3) = n2 n3 −

n2  n3  j=1 k=1

(2,3)

gjk

(6)

This selective information is maximized when only one connection weight has a certain value, while all the others become zero. This case shows that all the information from the precedent layer is contained in one connection weight

190

R. Kamimura

with the highest selectivity. The highest selective information is represented by n2 n3 − 1. On the other hand, this selective information is minimized when all connection weights become equal and the minimum value is zero. In an extreme case where all connection weights are zero, the selective information becomes zero by definition. Because no connection weights exist, naturally information content stored in connection weights should be zero. This information measure is closely related to the entropy measure of information theory [1], but it has a more concrete meaning. When the selective information increases gradually, the number of connection weights also decreases. Thus, this measure of information is directly related to the regularization methods such as weight decay, which have played very important roles in improving the general performance of neural networks [26]. However, this selective information aims to focus on the condensation of information on important features of input patterns. We try to increase or decrease this selective information. To control the selective information, we must control the normalized importance or relative importance. For this, we introduce an inverse case of the original relative importance (2,3)

g¯jk

(2,3)

= 1 − gjk

(7)

This means that, when the importance increases, this inverse one decreases. This importance can be used to decrease the strength of weights with larger importance, and selective information can be reduced. Then, by combining those two types of importance, we have a unified importance (2,3)

hjk

β  (2,3) (2,3) = αgjk + α ¯ g¯jk

(8)

where the parameter α is used to control the magnitude of importance (¯ α = 1 − α), ranging between zero and one; and the other parameter β should be larger than zero, and it is used to control the stability of learning processes. When the parameter α is one, this unified form is equivalent to the initial relative importance. On the other hand, when the parameter α becomes zero, the unified form is equivalent to the inverse form. Thus, by changing the parameter α from one to zero, we can easily change the importance in more detail. The next step is to include this unified importance equation in the learning process. Though it might be better to introduce it in the actual learning processes, this direct inclusion causes much contradiction between error minimization and selective information control. This is because error minimization between outputs and the corresponding targets is contradictory to the selective information control. 2.3

Selective Information-Driven Learning

We introduce here a new type of learning method in which the selective information guides learning processes, and this method is called “selective informationdriven learning” to stress the importance of selective information. The learning steps have some sub-steps called “assimilation steps.” In the beginning of a

Selective Information

191

learning step, the unified importance is applied to the connection weights, and in the following assimilation sub-steps, this importance is assimilated. Because the strength of importance can be weakened in the process of sub-steps, in the next learning step, the importance is again applied, followed by the subsequent assimilation sub-steps, and so on. The actual weights for the tth learning step can be computed by (2,3)

(2,3)

(2,3)

wjk (t) = hjk (t − 1) wjk (t − 1)

(9)

When the parameter α is larger, only one weight tends to be stronger, meaning that the corresponding selective information becomes larger. On the contrary, when the parameter α is smaller, stronger weights are pushed toward smaller ones, and the selective information becomes smaller. Compared with the abstract information measures of information theory, the selective information is directly connected with the actual meaning in terms of the number of connection weights. When the selective information increases, the number of strong connection weights becomes smaller, meaning that information content is stored in a smaller number of weights. On the other hand, when the selective information decreases, the number of strong weights becomes larger. Contrary to the information content of information theory, which represents the possible information to be transmitted, the information content is information stored in connection weights. In this sense of information in terms of information storage, the selective information represents how certain a connection weight is connected with the corresponding layers. Let us see some examples by controlling the selective information in Fig. 2. Note that the selective information is actually controlled in this paper only for hidden layers. This is because it is easier to control the selective information for hidden layers than the corresponding input and output layer, which are constrained to receive input and output information. Figure 2(a) shows a situation where the selective information increases from the second layer to the fifth layer. Actually, the parameter α increases from 0.1 to 0.9. We here use the parameter value less than one here, because we try to make all connection weights smaller, preventing the explosion of weights by this selective information assimilation. On the other hand, when the strength of connection weights becomes smaller, connection weights are pushed toward smaller values. When the parameter α is larger, the relative importance becomes effective, in which larger connection weights remain the same because of the large importance. Gradually, several weights with large strength remain the same, while all the other weak connection weights become weaker and weaker. Actually, as shown in Fig. 2(a), the number of connection weights between layers becomes smaller when the layers become higher. In terms of information, information content in the connection weights remaining in the higher layer becomes larger. We should note again that the information content is not to be transmitted but to be stored, and the information content to be stored is the inverse to the information to be transmitted. On the contrary, Fig. 2(c) shows a case where the selective information decreases, meaning that the number of strong connection weights becomes larger from the

192

R. Kamimura (2)

(3)

(4)

(3)

(2)

(5)

(4)

(5)

(2)

(1)

(1)

(b) No information control

(1)

(5)

(6)

(c) Information decrease (1)

(1) (6)

(6)

(d) Compressed weights

(4)

(6)

(6)

(a) Information increase

(3)

(1)

(e) Compressed weights

(6)

(f) Compressed weights

Fig. 2. Three types of network configuration: by gradual information increase (a), no information control (b), and gradual information decrease (c).

second to the fifth layer. In this case, the parameter α is gradually decreased from 0.9 to 0.1, and the selective information becomes smaller, and finally in the last and final hidden layer, all connection weights tend to have the same strength or importance. Because all connection weights have the same importance, we should say that information content to be stored should be small, and information is distributed into many connection weights. In the following section on the experimental results, we will show how information, generalization, and compressed weights can be changed by controlling the selective information.

3 3.1

Results and Discussion Experimental Outline

We applied the method to the well-known machine learning data set of direct marketing campaigns [35]. We tried to predict whether a client would subscribe to a term deposit. Figure 3 shows a network architecture for the bank marketing data set. The number of inputs was reduced to only six, which were chosen to have some correlations with the targets in the original data set for easy interpretation. The number of hidden layers increased as much as possible, and the actual number of hidden layers was set to 25, because it was impossible to increase the number of layers beyond this point with reasonably good generalization performance. The number of input patterns was set to 24,415. The parameter α for the unified function increased from 0.1 to 0.9 by the gradual information increase method. The learning parameter β was forced to be very small and actually set to 0.05, and this value was used to stabilize the learning processes by slowing the learning rate as much as possible. On the other hand, for the gradual information decrease method, the parameter was decreased from 0.9 to 0.1, where

Selective Information

193

Fig. 3. A network architecture with 25 hidden layers (27 layers with the input and output layers) and six input units (a) and the corresponding simplified network with compressed weights (b) for the bank marketing data set.

the selective information was forced to decrease when the hidden layer number increased from one to 25. For the experiment, we used the neural network package of scikit-learn, except for the number of epochs (steps) and tangent-hyperbolic activation function. Naturally, we added our selective information control inside the package. The number of learning steps was set to 150, in which ten sub-steps (epochs) were used to assimilate the initial importance of connection weights. We compared the present methods with the conventional methods, which were also set to the same setting except for the selective information control. 3.2

Selective Information Control

First, we increased the selective information gradually by increasing the parameter α from 0.1 to 0.9. Figure 4 shows the selective information from the initial hidden layer to the last (25th) hidden layer when the number of learning steps increased from one (top left) to 150 (bottom right). The selective information was plotted every three steps, from one to 150. As can be seen in the figure,

194

R. Kamimura

Fig. 4. Selective information by the gradual information increase method from the first hidden layer to the last hidden layer when the learning steps increased from one (top left) to 150 (bottom right).

the selective information remained small even if the layer number increased in the first several steps. When the learning steps increased further, the selective information tended to increase gradually from the first to the last hidden layer. However, one of the interesting things to see is that, when the learning steps increased, the lower hidden layers tended to have some higher selective information. This may be because the initial several hidden layers tended to be influenced by the input layer in which no selective information control was implemented in this experiment. The results show that the gradual information increase by increasing the parameter α from 0.1 to 0.9 had a natural effect of increasing the selective information content. Figure 5 shows the weights from the first hidden layer (top left) to the 25th hidden layer (bottom right). As can be seen in the figure, many connection weights were strong in the initial several hidden layers. Then, the number of strong weights became smaller and smaller. In the end, in the last hidden layer, only one weight had the stronger value, while all the others had much smaller values. The results show that the number of stronger connection weights became smaller when the selective information increased, as can be expected. On the other hand, we employed the gradual selective information decrease method by decreasing the parameter α from 0.9 to 0.1. Figure 6 shows the selective information for the hidden layers No.1 to No.25 and when the number of learning steps increased from one (top left) to 150 (bottom right), where

Selective Information

195

Fig. 5. Weights for all hidden layers by the gradual selective information increase method for the bank data set. Weights were arranged from top left (weights from the first hidden layer) to bottom right (weights to the last hidden layer.).

Fig. 6. Selective information by the gradual information decrease method from the first hidden layer to the last hidden layer when the learning steps increased from one (top left) to 150 (bottom right).

the selective information was plotted every three learning steps. The selective information was small for all hidden layers in the first place (top left). Then, gradually, the selective information for the initial several hidden layers became larger. Then, the selective information gradually decreased when the hidden layer number increased, though in the last several hidden layers, the strength

196

R. Kamimura

Fig. 7. Weights for all hidden layers by the selective information decrease method for the bank data set. Weights were arranged from top left (weights from the first hidden layer) to bottom right (weights to the final, 25th layer).

Fig. 8. Selective information without information control from the first hidden layer to the last hidden layer when the learning steps increased from one (top left) to 150 (bottom right).

of connection weights became slightly larger. The results show that the gradual information decrease method by decreasing the parameter α was effective in actually decreasing the selective information. Figure 7 shows the connection weights for 25 hidden layers from the first (top left) to 25th hidden layer (bottom right). As can be seen in the figure, the number

Selective Information

197

Fig. 9. Weights for all hidden layers without information control for the bank data set. Weights were arranged from top left (the first step) to bottom right (150th step).

of strong weights was smaller in the first several hidden layers. Then, when the hidden layer number increased further, the number of strong weights became larger and larger. This means that the number of strong weights became larger when the selective information was forced to decrease by the gradual selective information decrease method. Figure 8 shows the selective information for networks without selective information control. Even if the hidden layer number increased from one to 25, the selective information decreased very slightly, but actually little change could be seen in the selective information. These results show that the selective information control was effective in controlling the selective information content. Then, Fig. 9 shows connection weights from the first hidden layer (top left) and to the 25th hidden layer (bottom right); the strength of all hidden weights were almost random, and no regularity could be seen. This shows that connection weights could not be explicitly arranged without the selective information control. 3.3

Generalization Performance

Then, we tried to compare the generalization performance in terms of accuracy, precision, recall, and F-score for the method with and without selective information control. Figure 10 shows generalization performance in terms of accuracy, precision (top left), recall, and F-score (bottom right) by the gradual selective information increase method. The accuracy increased and became close to 0.8, which was the best performance of all three methods. The precision increased up to the level of 0.7, but the recall value became close to 0.5. Thus, the information increase method with the highest accuracy score tried to increase the precision, while the recall remained small. Figure 11 shows the results by the gradual information decrease method. As can be seen in the figure (top left), the accuracy was lower than that by the information increase method in Fig. 10(a), and it could not increase beyond 0.7.

198

R. Kamimura

Fig. 10. Generalization errors in terms of accuracy (top left), precision, recall, and F-score (bottom right) by selective information increase for the bank data set.

Fig. 11. Generalization errors in terms of accuracy (top left), precision, recall, and F-score (bottom right) by selective information decrease for the bank data set.

Selective Information

199

Fig. 12. Generalization errors in terms of accuracy (top left), precision, recall, and F-score (bottom right) without selective information for the bank data set.

Compared with the results by the information increase method in Fig. 10, the precision was lower, but the recall was higher and larger than 0.5. Finally, the F-score was slightly lower than that by the gradual information increase method. These results show that the precision by the information increase method was higher than that of the information decrease method. On the contrary, the recall was the inverse, meaning that the information decrease method produced higher recall values. From these results, we can see that, by changing the information content in hidden layers, different generalization performances could be obtained, meaning that different internal representations were obtained. Finally, Fig. 12 shows the results by the method without information control. All measures were close to those by the gradual information increase or decrease method. However, one of the main differences is that the method without information control produced less stable learning processes in which suddenly lower values were seen when the number of learning step increased beyond about 100 steps. This means that the conventional method without information control may be instable when the number of hidden layers increases considerably. Thus, these results imply that the information control methods have effects to stabilize the learning processes when the number of hidden layer increases considerably. 3.4

Interpreting Compressed Weights

Finally, we tried to interpret compressed weights by the present methods and to compare them with the regression coefficient of the logistic regression analysis.

200

R. Kamimura

Figure 13(a) shows correlation coefficients between inputs and targets of the original data set. As can be seen in the figure, input No.2 (the duration of last contact) had the largest strength and importance. This means that, to subscribe to the term deposit, we need to make sustained contact with customers, which is intuitively reasonably valid. Figure 13(b) shows compressed weights by the gradual selective information increase. Among strong correlation coefficients in Fig. 13, only input No.2 had the largest value, followed by input No.3 (campaign) with moderately strong importance, while all the other inputs had almost no importance. As explained above, the prediction performance in terms of accuracy showed the best performance by the gradual information increase method; those inputs with moderately strong correlations in the original correlation coefficients in Fig. 13(a) were of no use in improving the prediction performance. Note that the correlation between the original correlation in Fig. 13(a) and those weights was 0.92. For the gradual information decrease method in Fig. 13(c), in addition to the strongest input No.2, inputs No.4 (housing) and No.6 (sending documents) had moderate importance, and the correlation coefficient between those compressed weights and the original correlation coefficient became the largest value of 0.97. This means that, though the information decrease method showed lower generalization performance than the information increase method, the gradual information decrease method produced compressed weights quite similar to the correlation coefficients in Fig. 13(a). Thus, the gradual information decrease method, though the prediction performance became lower, produced weights whose interpretation was easier due to the similarity to the original correlation coefficients. Figure 13(d) shows compressed weights by the method without information control. As can be seen in the figure, inputs No.2, No.4, and No.6 had larger importance. These three inputs had also some importance by the gradual information decrease in Fig. 13(c) and the original correlation coefficients in Fig. 13(a). However, the correlation between the original correlation and those compressed weights was 0.7, the lowest score. Finally, Fig. 13(e) shows the regression coefficients by the logistic regression analysis. As can be seen in the figure, the coefficients were similar to compressed weights by the gradual information increase method in Fig. 13(b), and the correlation between those weights and the original correlations was 0.92, slightly better than the 0.91 by the gradual information increase method. These results show that, though the logistic regression analysis could be expected to extract linear correlations between inputs and targets, the linear correlations quite close to the original correlation coefficients were obtained by the gradual information decrease method. To extract the real linear relations between inputs and outputs, we need to use multi-layered neural networks with information control, more exactly, gradual information decrease. Though with some speculation from these results, multi-layered neural networks have the property of losing information content naturally when we see neural networks from the point of view of the information channel in which information should decrease [1]. However, if we can appropriately control the information content

Selective Information

(a) Correlation coefficients

(c) Information decrease

201

(b) Information increase

(d) Without informaton control

(e) logistic regression

Fig. 13. The original correlation coefficients between inputs and targets (a), information increase (b), information decrease (c), no information (d), and regression coefficients by the logistic regression analysis for the bank data set.

in multi-layered neural networks, complicated components such as connection weights may be disentangled as much as possible to produce the very simple and linear relations between inputs and outputs.

4

Conclusion

The present paper aimed to propose a new type of information-theoretic method for interpretation. We suppose that information content on input patterns can be represented in terms of selectivity of components in neural networks. When a neuron responds to a specific input very selectively, the neuron should have some information content on the input. Thus, contrary to the abstract and incomprehensible property of conventional information-theoretic measures such as mutual information, the present measure of selective information can be interpreted very

202

R. Kamimura

concretely. Then, we proposed a method to control flexibly selective information content to obtain different types of internal representations. We applied the method to the bank marketing data set, examining how the gradual selective information increase or decrease could produce different internal representations. The results showed that the selective information control could produce different types of connection weights. The gradual information increase produced networks with better generalization, while the gradual information decrease was related to the production of simple relations between inputs and outputs. Thus, for simple and easy interpretation, we should adopt the gradual information decease, and we should obtain much information in the hidden layers close to the input layer. We focused here on two type of information control, namely, gradual information increase and decrease. However, we should examine further to what extent the information change in hidden layers affects the interpretability more exactly. Though some problems should be solved for actual applications, the results in this paper can certainly contribute to understanding the inner inference mechanism of neural networks.

References 1. Abramson, N.: Information theory and coding (1963) 2. Adriana, R., Nicolas, B., Ebrahimi, K.S., Antoine, C., Carlo, G., Yoshua, B.: FitNets: hints for thin deep nets. In: Proceedings of ICLR (2015) 3. Arbabzadah, F., Montavon, G., M¨ uller, K.R., Samek, W.: Identifying individual facial expressions by deconstructing a neural network. In: German Conference on Pattern Recognition, pp. 344–354. Springer, Cham (2016). https://doi.org/10. 1007/978-3-319-45886-1 28 4. Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Advances in Neural Information Processing Systems, pp. 2654–2662 (2014) 5. Bach, S., Binder, A., Montavon, G., Klauschen, F., M¨ uller, K.R., Samek, W.: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE 10(7), e0130140 (2015) 6. Becker, S.: Mutual information maximization: models of cortical self-organization. Netw. Comput. Neural Syst. 7, 7–31 (1996) 7. Ben´ıtez, J.M., Castro, J.L., Requena, I.: Are artificial neural networks black boxes? IEEE Trans. Neural Networks 8(5), 1156–1164 (1997) 8. Bojarski, M., et al.: Explaining how a deep neural network trained with end-to-end learning steers a car. arXiv preprint arXiv:1704.07911 (2017) 9. Bologna, G.: Is it worth generating rules from neural network ensembles? J. Appl. Logic 2(3), 325–348 (2004) 10. Buciluˇ a, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 535–541. ACM (2006) 11. Caruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M., Elhadad, N.: Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM (2015)

Selective Information

203

12. Chen, X., Wang, S., Fu, B., Long, M., Wang, J.: Catastrophic forgetting meets negative transfer: batch spectral shrinkage for safe transfer learning. In: Advances in Neural Information Processing Systems, pp. 1908–1918 (2019) 13. Cheng, Y., Wang, D., Zhou, P., Zhang, T.: A survey of model compression and acceleration for deep neural networks (2020) 14. Erhan, D., Bengio, Y., Courville, A., Vincent, P.: Visualizing higher-layer features of a deep network. University of Montreal 1341 (2009) 15. Fu, R., Hu, Q., Dong, X., Guo, Y., Gao, Y., Li, B.: Axiom-based GradCAM: towards accurate visualization and explanation of CNNs. arXiv preprint arXiv:2008.02312 (2020) 16. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014) 17. Goodman, B., Flaxman, S.: European union regulations on algorithmic decisionmaking and a right to explanation. arXiv preprint arXiv:1606.08813 (2016) 18. Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: a survey (2020) 19. Hart, A., Wyatt, J.: Evaluating black-boxes as medical decision aids: issues arising from a study of neural networks. Med. Inform. 15(3), 229–236 (1990) 20. Hayes, T.L., Kafle, K., Shrestha, R., Acharya, M., Kanan, C.: Remind your neural network to prevent catastrophic forgetting. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) European Conference on Computer Vision, pp. 466–483. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3 28 21. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015) 22. Kamimura, R.: Collective mutual information maximization to unify passive and positive approaches for improving interpretation and generalization. Neural Netw. 90, 56–71 (2017) 23. Kamimura, R.: Neural self-compressor: collective interpretation by compressing multi-layered neural networks into non-layered networks. Neurocomputing 323, 12–36 (2019) 24. Kindermans, P.J., et al.: The (un)reliability of saliency methods. In: Samek, W., Montavon, G., Vedaldi, A., Hansen, L., Muler, K.R. (eds.) Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pp. 267–280. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28954-6 14 25. Kirkpatrick, J., et al.: Overcoming catastrophic forgetting in neural networks. Proc. Nat. Acad. Sci. 114(13), 3521–3526 (2017) 26. Kukaˇcka, J., Golkov, V., Cremers, D.: Regularization for deep learning: a taxonomy. arXiv preprint arXiv:1710.10686 (2017) 27. Leiva-Murillo, J.M., Art´es-Rodr´ıguez, A.: Maximization of mutual information for supervised linear feature extraction. IEEE Trans. Neural Networks 18(5), 1433– 1441 (2007) 28. Linsker, R.: Self-organization in a perceptual network. Computer 21(3), 105–117 (1988) 29. Linsker, R.: How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Comput. 1(3), 402–411 (1989) 30. Linsker, R.: Local synaptic learning rules suffice to maximize mutual information in a linear network. Neural Comput. 4(5), 691–702 (1992) 31. Linsker, R.: Improved local learning rule for information maximization and related applications. Neural Netw. 18(3), 261–265 (2005) 32. Luo, P., Zhu, Z., Liu, Z., Wang, X., Tang, X.: Face model compression by distilling knowledge from neurons. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)

204

R. Kamimura

33. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017) 34. Montavon, G., Samek, W., M¨ uller, K.R.: Methods for interpreting and understanding deep neural networks. Digital Signal Process. 73, 1–15 (2018) 35. Moro, S., Cortez, P., Rita, P.: A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst. 62, 22–31 (2014) 36. Neill, J.O.: An overview of neural network compression. arXiv preprint arXiv:2006.03669 (2020) 37. Nguyen, A., Yosinski, J., Clune, J.: Understanding neural networks via feature visualization: a survey. In: Samek, W., Montavon, G., Vedaldi, A., Hansen, L., Muller, K.R. (eds.) Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pp. 55–76. Springer, Cham (2019). https://doi.org/10.1007/978-3-03028954-6 4 38. Olden, J.D., Jackson, D.A.: Illuminating the “black box”: a randomization approach for understanding variable contributions in artificial neural networks. Ecol. Model. 154(1–2), 135–150 (2002) 39. Principe, J.C.: Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives. Springer, New York (2010). https://doi.org/10.1007/978-1-4419-15702 40. Principe, J.C., Xu, D., Fisher, J.: Information theoretic learning. In: Unsupervised Adaptive Filtering, vol. 1, pp. 265–319 (2000) 41. Qiu, F., Jensen, J.: Opening the black box of neural networks for remote sensing image classification. Int. J. Remote Sens. 25(9), 1749–1768 (2004) 42. Rumelhart, D.E., Hinton, G.E., Williams, R.: Learning internal representations by error propagation. In: Rumelhart, D.E., Hinton, G.E., et al. (eds.) Parallel Distributed Processing, vol. 1, pp. 318–362. MIT Press, Cambridge (1986) 43. Rumelhart, D.E., McClelland, J.L.: On learning the past tenses of English verbs. In: Rumelhart, D.E., Hinton, G.E., Williams, R.J. (eds.) Parallel Distributed Processing, vol. 2, pp. 216–271. MIT Press, Cambrige (1986) 44. Rumelhart, D.E., Zipser, D.: Feature discovery by competitive learning. In: Rumelhart, D.E., Hinton, G.E., et al. (eds.) Parallel Distributed Processing, vol. 1, pp. 151–193. MIT Press, Cambridge (1986) 45. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: GradCAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618– 626 (2017) 46. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013) 47. Spining, M., Darsey, J., Sumpter, B., Nold, D.: Opening up the black box of artificial neural networks. J. Chem. Educ. 71(5), 406 (1994) 48. Torkkola, K.: Nonlinear feature transform using maximum mutual information. In: Proceedings of International Joint Conference on Neural Networks, pp. 2756–2761 (2001) 49. Torkkola, K.: Feature extraction by non-parametric mutual information maximization. J. Mach. Learn. Res. 3, 1415–1438 (2003) 50. Van Hulle, M.M.: The formation of topographic maps that maximize the average mutual information of the output responses to noiseless input signals. Neural Comput. 9(3), 595–606 (1997) 51. Varshney, K.R., Alemzadeh, H.: On the safety of machine learning: cyber-physical systems, decision sciences, and data products. Big Data 5(3), 246–255 (2017)

DAC–Deep Autoencoder-Based Clustering: A General Deep Learning Framework of Representation Learning Si Lu(B) and Ruisi Li Portland State University, Portland, USA [email protected]

Abstract. Clustering performs an essential role in many real world applications, such as market research, pattern recognition, data analysis, and image processing. However, due to the high dimensionality of the input feature values, the data being fed to clustering algorithms usually contains noise and thus could lead to in-accurate clustering results. While traditional dimension reduction and feature selection algorithms could be used to address this problem, the simple heuristic rules used in those algorithms are based on some particular assumptions. When those assumptions does not hold, these algorithms then might not work. In this paper, we propose DAC, Deep Autoencoder-based Clustering, a generalized data-driven framework to learn clustering representations using deep neuron networks. Experiment results show that our approach could effectively boost performance of the K-Means clustering algorithm on a variety types of datasets. Keywords: Clustering · K-Means · Representation learning neuron networks · Deep autoencoder

1

· Deep

Introduction

Clustering is the task of grouping samples such that the ones in the same group are more similar to each other than to the ones in other groups. Nowadays, clustering performs as a basic and essential pre-processing step of many real world applications. For example, it could be used to help with fake news identification [6], document analysis [16], marketing and sales, etc. Specifically, clustering algorithms can figure out useful information for the applications via grouping according to a variety of data similarity metrics and data grouping schemes. For example, similar patches could be used for image denoising [1–3] or depth enhancement [9], and clustering could be used to find good similar patches [8]. To let the samples be properly assigned to different groups (called clusters), meaningful feature values of the samples need to be obtained first. However, in real world applications, the data we get is often of high dimensions [5] and usually contains noise, making the clustering difficult to succeed. For example, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 205–216, 2022. https://doi.org/10.1007/978-3-030-82193-7_13

206

S. Lu and R. Li

in the MNIST dataset [7], each input hand-written digit image has 784 pixels. While we know some pixels (e.g. the ones at image corners) might not be as useful as others(e.g. the ones around image centers), it is difficult to manually distinguish them in clustering. Traditional dimensionality reduction algorithms, namely, Principle Component Analysis (PCA) [10], Linear Discriminant Analysis (LDA) [4], and Canonical Correlation Analysis (CCA) [13], could be used to reduce the number of features. In addition, feature selection algorithms can be used to select from the original feature values a set of useful and noiseless ones. These algorithms aim to extract the core information given the redundant and correlated input highdimension data features. However, these algorithms often fail mainly due to two reasons. Firstly, most of them require complex mathematical analysis, which is difficult and time consuming as well. Secondly, their is no single approach that could work for all types of datasets. Different datasets could have different dimensions, data sizes and even might be used in totally different applications. Some datasets are linear and some of them are non-linear. As a result, it is difficult to find a way to generally work on all types of datasets. Recently, due to the emerging of the powerful deep neuron networks, deep learning-based approaches have been introduced to learn better data representations and achieve appealing performance improvements for clustering algorithms. One simple approach is to learn representations using deep auto-encoders. Specifically, the original input high dimensional features are fed into a encoder that generates a low dimensional output. This output is further fed to a decoder that tries to recover the raw input data as much as possible. However, most of the existing approaches [11,15] are using images as input and thus using convolutional neuron networks in their work. In this paper, we propose Deep Clustering Autoencoder, a simple but more general framework for representation learning that takes feature vectors as input. Thus, our approach could be applied to more generalized datasets. In addition, according to the group labels, we propose a scheme to adaptively weight all input features. We combine this estimated weight with the loss function computation during training. Experiment results show that our approach could effectively improve the performance of K-Means clustering algorithm on different types of datasets, namely, MNIST, Fashion-MNIST [14], as well as Human Activities and Postural Transitions Data Set (HAPT) [12]. The rest of the paper is organized as follows: in Sect. 2 we describe the overview of our deep autoencoder-based clustering. We then describe the deep autoencoder for representation learning in more details in Sect. 3. We finally show experimental results in Sect. 4 and conclude in Sect. 5.

2

Overview of Deep Autoencoder-Based Clustering

Figure 1 shows an overview of our deep autoencoder-based clustering framework. There are two main steps: training and clustering testing. In the training step, a deep autoencoder with an encoder and a decoder is trained using the training set.

Deep Autoencoder-Based Clustering

207

Fig. 1. Overview of our deep autoencoder-based clustering on MNIST dataset. The autoencoder (consists of an encoder and a decoder) tries to encode and decode the input features such that the decoded output is as close to the input as possible. The input size is 28 × 28 = 784, the size of the learned low-dimension representation is 10. In the testing stage, the learned encoder output is then fed into the classic K-Means algorithm to do clustering.

Here a flattened input vector is fed into the multi-layer deep encoder which has a low dimensional learned representation. This learned representation is further fed into a decoder that tries to recover an output of the same size as the input. The training process of this autoencoder tries to reconstruct the input as much as possible. In the following clustering step, we apply the autoencoder to the testing set. The output of the encoder (learned representations) is then fed to a classic K-Means algorithm to do clustering. The learned low dimensional representation vector contains key information of the given input, and thus yield better clustering results.

3

Deep Autoencoder for Representation Learning

The architecture of our deep autoencoder for representation learning is shown in Fig. 1. As could be seen, the model is not as complex as some of the advanced neuron networks. The reason is that we do not want our model to over-fit in two-folds. First, we do not want our model to over-fit on the training dataset over the testing dataset. Second, we do not want our model to over-fit on the reconstruction problem it-self over the clustering problem. Thus, we select a model of reasonable median complexity. 3.1

Encoder

The encoder aims to encode or compress the input data into a smaller size representation, and at the same time preserve as much key information as possible.

208

S. Lu and R. Li

As shown in Fig. 1, the encoder consists of 8 layers, include the input layer and the learned representation output layer. Here the input layer is being normalized such that all its values is in the range of (0, 1). Specifically, from the beginning, each larger layer is fully connected to the next smaller layer followed by a couple of activation layers. There are mainly two types of activation layers, Relu and Tanh, as shown in Eq. 1 and 2. Adding the Relu layers could introduce nonlinearity to our model, making it more robust against non-linear input data. The Tanh layer, on the other hand, could transform the data into a normalized range of (−1, 1), to alleviate the gradient vanishing/exploding problem. Relu(x) = max(0, x)

(1)

ex + e−x 2 ex − e−x sinh(x) = 2 ex − e−x sinh(x) = x tanh(x) = cosh(x) e + e−x

(2)

cosh(x) =

3.2

Decoder

The decoder aims to decode or decompress the encoded output to reconstruct the original input data as much as possible. It contains nine layers, include the input layer, which is the output of the encoder, and the final output layer. Specifically, each smaller layer is fully connected to the next larger layer followed by a Tanh activation layer. In addition, the decoder has a Sigmoid activation layer (shown in Eq. 3) at the final stage to enforce the output values lie into the range of (0, 1). Sigmoid(x) = 3.3

1 1 + e−x

(3)

Objective Function

Clustering-Weighted MSE Loss. While the goal of the classic autoencoder is to reconstruct the original input as much as possible, it counts each input feature value equally. However, it is possible that each individual input feature contributes differently to the final clustering results. For example, in MNIST dataset, the pixels at the four corners of almost all images are of the same color black (with zero intensity input values), thus have no impact to the final clustering at all. On the other hand, some pixels around the center of the images are likely to perform more important roles. We thus propose a scheme to compute a clustering-weighted MSE loss to let the autoencoder focus more on the reconstruction of more important input feature values, as shown in Eq. 4. n wi (yi − yˆi )2 (4) Lcmse = i=1 n

Deep Autoencoder-Based Clustering

209

Fig. 2. A map of the clustering weight computed for MNIST dataset using 1000 samples from the training set. It could be seen that pixels at boundaries and corners are less important than the ones around image centers.

Here wi is the weight of each feature. It is computed using all ith feature values sampled from a subset of the training dataset with m samples. Denote all ith feature values as {xi k|k = 1, 2, .., m} and the corresponding ground truth group/cluster labels of the m samples {lk |k = 1, 2, .., m}. The corresponding feature weight will be large if both of the two following conditions are met. First, all sampled values in the same groups/clusters have small differences. Second, all sampled values in different groups/clusters have large differences. Thus, the weight is computed as:  wi =

lp =lq

e−(xip −xiq ) 

1



2



lp =lq

(1 − e−(xip −xiq )

lp =lq



1

2

)

(5)

lp =lq

Figure 2 shows a map of the clustering weight computed for MNIST dataset using 1000 samples from the training set. Pixels at boundaries and corners are less important than the ones around image centers, thus have smaller weights (white means larger weights). Final Objective Function. The final objective function then combines the Clustering-weighted MSE Loss and a standard L2 norm regularization, as shown in Eq. 6. Here the L2 norm regularization Lr is computed using all parameters from the autoencoder. β is a balancing factor with a default value of 0.00001. L = Lcmse + β L˙ r

(6)

210

S. Lu and R. Li

Fig. 3. Samples of the MNIST dataset.

4 4.1

Experimental Results Data Set

We evaluate our approach on the classic MNIST hand-written digits dataset. This dataset has 50, 000 images as the training set and 10, 000 images as the testing set. There are 10 groups in total. We show some samples of MNIST dataset in Fig. 3. 4.2

Measurement Metrics

To evaluate our framework, we apply our trained encoder to the testing dataset. We then compare the generated representations from our trained encoder to the raw input features by applying them to the K-Means algorithm. To measure the performance of clustering algorithms, we use the Adjusted Rand Index (ARI). Specifically, this metrics computes a similarity between two clustering results by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and ground truth clustering results. The proposed approach is denoted as DAC. 4.3

Experiment Setup

We implement our framework in Python and PyTorch and test it on a desktop with RTX 2080-Ti. We train the autoencoder for 200 epochs using Adam Optimization Algorithm. The initial learning rate is set to 0.003 and will decrease with the number of epochs during training. Model Complexity. Our model for MNIST has 944.86 k parameters and a computational complexity of 0.001 G Macs (Multiply accumulation operations) during inference. The average processing time per frame is 0.42 ms, leading to a FPS of 2381.

Deep Autoencoder-Based Clustering

211

Table 1. Clustering results on MNIST testing dataset.

K-Means PCA ARI 0.3477

DAC

0.4026 0.6624

Fig. 4. Sample results of our trained autoencoder on MNIST dataset. Top: Raw input images. Bottom: Reconstructed images

4.4

Results on MNIST

Table 1 shows the quantitative performance of the proposed approach in terms of ARI. Comparing to the raw K-Means algorithm, our approach (DAC) boosts the K-Means algorithm’s performance from 0.3477 to 0.6624, which is a 90.50% boost. We also compare our approach with PCA feature dimension reduction which reduces the feature dimension from 784 to 10. From Table 1, it could be seen that while PCA could improve K-Means clustering’s performance from 0.3477 to 0.4026, our approach (DAC) still has the best performance. We also show some of the reconstructed results by our trained autoencoder in Fig. 4. It shows that our trained autoencoder can properly reconstruct the raw input hand-written digits. 4.5

Results on Other Datasets

To test the robustness of our approach against different data types, we apply our method to two other datasets: Fashion-MNIST[14], and Human Activities and Postural Transitions Data Set (HAPT) [12]. Fashion-MNIST is a similar dataset to MNIST, with the same image format and image size. It has 60, 000 images as training set and 10, 000 images as testing set. The only difference is the content: it contains images of 10 types of clothes. The ten categories are shown in Table 2. We show some samples of this dataset in Fig. 5. Table 2. Fashion-MNIST category labels.

T-shirt/top Trouser Pullover Dress Coat Sandal

Shirt

Sneaker Bag

Ankle boot

212

S. Lu and R. Li

Fig. 5. Samples of the Fashion-MNIST dataset.

Human Activities and Postural Transitions Data Set is a dataset that has been captured by smart phone’s sensors [12]. The authors captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate 50 Hz using the embedded accelerometer and gyroscope of the device, which is a smartphone (Samsung Galaxy S II). There are 30 volunteers whose ages are in the range of 19–48 years old. In their data capturing experiment, the volunteers was doing one of twelve activities. There are six basic activities: three static postures (standing, sitting, lying) and three dynamic activities (walking, walking downstairs and walking upstairs). Another six postural transitions that occurred between the static postures have also been added to the dataset. These are: stand-to-sit, sit-to-stand, sit-to-lie, lie-to-sit, stand-to-lie, and lie-to-stand. All twelve types of activities are shown in Table 3. Table 3. HAPT category labels.

walking

walking upstairs walking downstairs sitting

standing laying

stand to sit

sit to stand

standing laying

stand to sit

sit to stand

sit to lie lie to sit

stand to lie

lie to stand

The sensor signals (accelerometer and gyroscope) were then denoised by some noise filters. The authors then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window), leading to a sample size of 561 features. Each sample is captured when the volunteer is doing one type of activities. During the capture process, 70% of the volunteers were randomly selected to generate the training set and 30% were selected to generate the testing set. In total, this dataset has 7767 samples for training and 3162 samples for testing.

Deep Autoencoder-Based Clustering

213

Fig. 6. Sample results of our trained on Fashion-MNIST dataset. Top: Raw input images. Bottom: Reconstructed images

We apply our method to Fashion-MNIST dataset and report the results in Table 4. Here as the Fashion-MNIST is a more complex dataset, we modified the autoencoder and show the modified autoencoder architecture in Fig. 7. It can be seen that comparing to using raw input features in K-Means clustering, our method boosts ARI from 0.3039 to 0.4702, yields to a improvement of 54.7%. We then apply our method to the HAPT dataset and report the results in Table 5. Here as this data set’s inputs are of lower dimension than MNIST, we modified the autoencoder accordingly and show the modified autoencoder architecture in Fig. 8. It can be seen that even with this temporal sequence dataset, our method could effectively improve the K-Means algorithm’s performance by 30%. These results also show that our method could be generally applied to other data types. We also show some of the reconstructed results by our trained autoencoder in Fig. 6. It shows that our trained autoencoder can properly reconstruct the raw input fashion images. Table 4. Clustering results on Fashion-MNIST testing dataset.

K-Means DAC ARI 0.3039

0.4702

Table 5. Clustering results on HAPT testing dataset.

K-Means DAC ARI 0.4290

5

0.5594

Limitation

While the proposed approach efficiently improves the performance of K-Means clustering, it has some limitations. First, the models used are not adaptive to

214

S. Lu and R. Li

Fig. 7. Overview of our deep autoencoder-based clustering on Fashion-MNIST dataset.

Fig. 8. Overview of our deep autoencoder-based clustering on HAPT dataset.

different input sizes. This means that we need to train different models for different data sets with various sample input sizes. Second, we are using limited information of the ground truth labels during training. In the future, we plan to exploit more details from the ground truth labels by using more advanced network architectures. For example, when feeding samples from different digital groups to the model, the output encoded features should be as different as possible. On the other hand, when feeding samples from the same digital group to the model, the output encoded features should be similar to each other.

Deep Autoencoder-Based Clustering

6

215

Conclusion

In this paper, we propose DAC, Deep Autoencoder-based Clustering, a generalized data-driven framework to learn low dimensional clustering representations using trained deep neuron networks. Specifically, we train a multi-layer deep autoencoder to encode and decode the raw input samples. The encoded output of the encoder is then fed to a classic K-Means algorithm to do clustering. We design a scheme to compute a clustering-based weight in the training objective function to train the autoencoder and let it focus more on the reconstruction of more important features. Experimental results show that our approach could effectively boost the performance of a classic clustering algorithm: K-Means by 30% to 90% on MNIST dataset. In addition, our method could be also applied to other types of clustering datasets, such as Fashion-MNIST and Human Activities and Postural Transitions Data Set (HAPT). Experimental results show that our framework could still be able to improve K-Means algorithm’s performance by as much as 55%.

References 1. Buades, A., Coll, B., Morel, J.M.: A non-local algorithm for image denoising. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 60–65 (2005) 2. Chen, F., Zhang, L., Yu, H.: External patch prior guided internal clustering for image denoising. In: IEEE International Conference on Computer Vision (ICCV), pp. 603–611 (2015) 3. Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K.: Image denoising by sparse 3D transform-domain collaborative filtering. IEEE Trans. Image Process. 16(8), 2080–2095 (2007) 4. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, vol. 2. John Wiley & Sons Inc., New York (2001) 5. Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, New York (2011) 6. Hosseinimotlagh, S., Papalexakis, E.E.: Unsupervised content-based identification of fake news articles with tensor decomposition ensembles. In: Proceedings of the Workshop on Misinformation and Misbehavior Mining on the Web (MIS2) (2018) 7. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 8. Lu, S.: Good similar patches for image denoising. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1886–1895. IEEE (2019) 9. Lu, S., Ren, X., Liu, F.: Depth enhancement via low-rank matrix completion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3390–3397 (2014) 10. Pearson, K.: LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 2(11), 559–572 (1901) 11. Yunchen, P., et al.: Variational autoencoder for deep learning of images, labels and captions. Adv. Neural Inf. Process. Syst. 29, 2352–2360 (2016) 12. Reyes-Ortiz, J.-L., Oneto, L., Sam` a, A., Parra, X., Anguita, D.: Transition-aware human activity recognition using smartphones. Neurocomputing 171, 754–767 (2016)

216

S. Lu and R. Li

13. Sun, Q.-S., Zeng, S.-G., Liu, Y., Heng, P.-A., Xia, D.-S.: A new method of feature fusion and its application in image recognition. Pattern Recogn. 38(12), 2437–2448 (2005) 14. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017) 15. Yang, X., Deng, C., Zheng, F., Yan, J., Liu, W.: Deep spectral clustering using dual autoencoder network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019 16. Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 515–524 (2002)

Enhancing LSTM Models with Self-attention and Stateful Training Alexander Katrompas(B) and Vangelis Metsis Texas State University, San Marcos, TX 78666, USA {amk181,vmetsis}@txstate.edu

Abstract. When using LSTM networks to model time-series data, the standard approach is to segment the continuous data stream into fixedsize sequences and then independently feed each sequence to the LSTM network for training in a stateless fashion (i.e. in a fashion that resets the LSTM cell state per fixed-size sequence). As a result, long-term dependencies between patterns appearing in the data stream may be lost. In this work, we introduce a hybrid deep learning architecture that enables long-term inter-sequence modeling while maintaining focus on each sequence’s local characteristics. We use stateful LSTM training to model long-term dependencies that span the fixed-size sequences. We also utilize the attention mechanism to optimally learn each training sequence by focusing on the parts of each sequence that affect the classification outcome the most. Our experimental results show the advantages of each of these two mechanisms independently and in conjunction, compared to the standard stateless LSTM training approach. Keywords: Recurrent neural networks · LSTM · Deep learning Attention mechanisms · Time series data · Self-attention

1

·

Introduction

Recurrent neural networks (RNNs) are well known for their ability to model temporal dynamic data, especially in their ability to predict temporally correlated events [24]. RNNs form a family of neural networks in which a key feature is the additional input of the previous time-step’s network “state,” also known as “memory”. This memory allows RNNs to retain temporal relationships by creating an association between the current time-step and the previous time-step, thereby representing a chain of causation [8,24]. A vanilla RNN’s memory length is relatively short and typically newer information is weighted heavier than older information. However, ideally, an RNN should not only retain longer past information, but it would also weigh information based on importance to the model and not simply on its recent proximity in time. Well established developments in these areas are the RNN architectural variant known as the Long Short-Term Memory (LSTM) network [9,15], as well as the Back-propagation Through Time (BPTT) learning algorithm [4,28]. In c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 217–235, 2022. https://doi.org/10.1007/978-3-030-82193-7_14

218

A. Katrompas and V. Metsis

this study, LSTM networks and variants of BPTT will be studied with the further enhancements: attention mechanisms and stateful training. The goal of such enhancements is enhancing RNN memory in memory length (stateful training), feature importance, and inter-sequence weighting (self-attention). We have built a hybrid deep neural network architecture that enhances the ability of LSTM networks to both “focus” on importance within a sequence and “remember” long term patterns, thereby not only increasing accuracy but also reducing or eliminating the need for extensive data preparation. The first enhancement, a mechanism known as attention [1,11], allows the network to focus on more salient sequences within the LSTM memory space. Specifically, in this discussion, we examine self-attention, also known as intra-attention, which focuses on the important relationships between features within sequences [10,25]. This enhancement is network-level architectural in nature, altering the structure of the network while leaving the LSTM layer untouched. The second enhancement, a training model enhancement, overrides the typical LSTM backpropagation through time algorithm. This enhancement, which we term “stateful training,” allows the LSTM layer to retain its state between error correction updates while also retaining its “batch update” behavior, thereby capturing long sequences of information in an efficient manner. These enhancements are studied individually and in conjunction so that four models are compared and contrasted for temporal classification performance: 1) baseline LSTM, 2) LSTM w/Attention, 3) stateful LSTM, 4) stateful LSTM w/Attention. The remainder of this paper is organized as follows. We first introduce some background work on recurrent neural networks, LSTM networks, and the attention mechanism. Subsequently, we describe the details of our methodology and our proposed solution. We then evaluate our method and compare its performance against a baseline LSTM model as well as against results of other studies on the same publicly available datasets. We discuss our observations on the training behavior of the proposed architecture. Finally, we summarize and conclude this work.

2

Background

RNNs and their associated learning algorithms are typically some variation or enhancement to the standard feed-forward neural network architecture and the back-propagation learning algorithm. A brief introduction to the feed-forward back-propagation (FFBP) algorithm is presented here to frame the challenge and solutions presented in this study. More details about these algorithms can be found in [5,22,25]. 2.1

Feed-Forward Networks, Recurrent Neural Networks, Back Propagation Through Time

A FFBP network trains very simply by feeding information through the network forward, and then back-propagating errors in the reverse, typically with

Enhancing LSTM Models

219

some form of gradient descent. In the simplest case, each neuron’s activation in the network is “fed forward” through a simple sigmoid activation per neuron, errors are calculated in the output layer, and back-propagated through the network for correction of weights between neurons. This feed-forward and backpropagation process is executed with each iteration through the data. A complete pass through all data is known as an epoch [6]. A RNN and its training is derived from the basic FFBP network. The most basic RNN simply captures its current “state” as the output of the RNN layer, and “feeds” this output back to itself as an extra input in the next time step. Typically the RNN structure forms the first layer of a deeper network where a FFN is fed from the output of the RNN. Layered RNNs are also common. In the case of a layered network, the output of the RNN is “hidden” within the network and is therefore called a hidden layer, its neurons termed hidden nodes, and its output termed hidden output [21]. In addition to the aforementioned architectural change, the BP learning algorithm is typically modified into what is known as Back-propagation Through Time (BPTT). Rather than updating the weights with each iteration of input data, input is “batched,” where each batch is some uniform fraction of the total data. Data are fed forward in batches without error correction, collecting all neural output, and updating the network over the entire batch at once [4,23,28]. 2.2

Long Short-Term Memory and Truncated BPTT

Long Short-Term Memory (LSTM) networks are a variant of RNN which not only feed the previous hidden output back into the input of the LSTM but also maintain a separate “cell state,” which updates with each iteration, independent of batch error correction. This cell state is not directly affected by the backpropagation of errors thereby giving the network the ability to avoid the wellknown vanishing/exploding gradient problem [21]. Unlike vanilla RNNs, LSTMs can learn tasks which require memories of events that happened hundreds of discrete time-steps earlier [9,21]. LSTMs also use the batched BPTT algorithm using (aka Truncated BPTT). In typical TBPTT some batch size, k, is chosen between 2 and n/2 where n is the number of instances in the training set. When training an LSTM the internal cell state of the LSTM is typically reset between batches. This reset effectively removes the ability of the network to retain state (i.e. memory) across batches. A form of TBPTT time that allows for information flow across boundaries is known as accelerated TBPTT (A-TBPTT). In this case, k1 is chosen to a batch size and k2 the error size, where k1 < k2. In other words, k1 is when to correct the network and k2 is the amount by which to correct. In this fashion, some portion of a previous batch’s state is incorporated into the current batch [4,23,28]. Extending the idea of A-TBPTT it can be imagined the network could be trained using TBPTT but trained using maximal information on everything seen to that point (i.e. k1 < k2 where k2 is all information seen to that point). The advantage to this could be to both take advantage of TBPTT while also maintaining maximal state information from one batch to the next.

220

A. Katrompas and V. Metsis

However, if it were simply a matter of choosing k1 normally, and k2 to be n (i.e. all instances), to retain all state information, this would simply devolve into a very inefficient form of classical BPTT (i.e. k = n). This method of training also tends to “overload” the network with long past information unlikely to be relevant to the current time-step thereby creating noise. 2.3

Self-attention

Attention mechanisms are a well-known technique in natural language processing using Seq2seq encoder-decoder models. Standard encoder-decoders generally operate with the encoder processing the input sequence and then “compressing” or “summarizing” the information into a context vector of a fixed length for passing to the decoder. A disadvantage of this fixed-length context vector is the inability of the system to remember longer sequences as well as weighing recent information as more important regardless of its true relevance. Attention mechanisms are designed to resolve these problems [1,11]. Self-attention, also known as intra-attention, is an attention mechanism relating different positions of a sequence in order to model dependencies between different parts of the sequence. This differs from general attention in that instead of seeking to discover the “important” parts of the sequence relating to the network output, self-attention seeks to find the “important” portions of the sequence that relate to each other. This is done in order to leverage those intra-sequence relationships to improve network predictions [3,10,16,17,25]. Originally designed for text processing, the benefit of self-attention can be seen in the following example. In order to understand the sentence, “the dog did not run home because it was too tired,” the word “it” must be related to the word “dog” or the sentence makes no sense. However, if we change the word “tired ” to “far,” then the word “it” must be related to the word “home.” Obviously, the relationship between “it” and the subject of the sentence is extremely important to the general understanding of the sentence as a whole. In general attention, the mechanism would seek to process the entire sentence and then emphasize the portions that are most important based on the correctness of network output. Conversely, self-attention seeks to relate portions of the sentence that are most important to each other prior to prediction, thereby enhancing understanding and prediction in a more context-specific manner. [3,10,16,17] This technique has proven so successful that in the case of text processing it has been shown to stand alone without the need for RNN or CNN layers and perform as well or better on its own [25]. 2.4

Experimental Rationale

While it could be argued that text processing is temporal in nature since words have order and are related through time, strictly speaking, text processing is not time-series data. In fact, it could be argued that a text sentence, or even an entire paragraph, is more related to an image in that it represents a single “picture” conceptually in the mind of the reader. In fact, attention mechanisms designed for text processing found almost immediate further success being adapted to image

Enhancing LSTM Models

221

processing. This further emphasizes this “single concept” idea between image and text processing.[29] The analogy goes further in that attention in a sentence or paragraph is generally focusing on subject/verbs/adjectives for understanding, just as in an image attention is focusing on objects/actions/attributes. This study seeks to conduct a preliminary analysis on attention’s efficacy on true time-series data, specifically in temporal classification tasks. As detailed in the landmark paper, Attention is all you need [25], the temporal layer can be removed from text and image processing. However, we seek to understand attention’s role when the temporal aspect of the data is its primary feature. To test this, we re-introduce the LSTM layer and study the interplay between LSTM and attention layers where the LSTM layer is responsible for temporal relationships and attention is responsible for relationships between features. We propose that there is a benefit to understanding the data both “vertically” (i.e. through time) and “horizontally” (i.e. feature to feature) when learning true time-series data. This study seeks to investigate this empirically prior to the next logical step: theoretical study (should it prove worthy empirically).

3

Methodology

This study seeks to investigate the following challenge: How do we maintain maximum relevant temporal state information, without picking up noise and irrelevant information, without over-training, while also leveraging relevant feature importance? In other words, how do we make the LSTM maximally “stateful” but have it pay “attention” to only relevant information? The solutions proposed here study both the concepts of statefulness to preserve information through batches, and the concept of “attention” to focus training on specific, short-term, feature-to-feature, high-value information. 3.1

Statefulness

In the context presented here, “statefulness” refers to the LSTM’s ability to preserve its cell state through batches [14,19]. Typically LSTMs are trained without any preservation of state between batches (i.e. k1 = k2 < n and n%k1 = 0). This can be partially solved through A-TBPTT. It should be noted that carrying state forward is not always desirable and this is highly data-dependent. Stateful training on data which has many short-term dependencies, and/or causation is a near-term event, and/or the data has clear and uniform temporal “sections,” may actually be harmful to the model’s performance. However, what is of concern in this study is data that is continuous with longer-term relevant knowledge throughout the data. To achieve this we begin with setting the batch size to 1. This is a matter of the TensorFlow/Keras API used to model the data, and not part of the general algorithm. Setting batch size to 1 has the effect of making the training sequence equal to 1. This normally would cause the loss of all LSTM cell state information since the LSTM cell state will be reset with every iteration. However,

222

A. Katrompas and V. Metsis

we will alter the LSTM behavior to maintain state between batches (i.e. do not reset the cell state) by setting “stateful” to true (again, this is a matter of the TensorFlow/Keras API used as a method to achieve our algorithmic goals). In this programmatic form (batch size = 1, stateful = true), training is analogous to classical BP, however, we will also structure the data into time slices from 10s to 100s of steps (i.e. a “sequence”), thereby allowing TBPTT to be performed. LSTM cell state will be reset only at the end of an epoch, as opposed to at the end of a sequence, and multiple epochs will be presented. This can be seen in algorithm 1 where the difference between common LSTM batched training and “stateful” training is the placement of the step, “reset LSTM cell state.” In typical LSTM training, this is performed automatically and immediately following the step, “execute TBPTT.” The end result of this altered training algorithm is an LSTM network that will maintain cell state throughout an entire data set (i.e. epoch) while still training and correcting in batches according to TBPTT [9,12,14,19,23,28]. Algorithm 1. Stateful Training Algorithm Data: 3D matrix of r (x) c (x) s, where r is the number of training instances per sequence, c is the number of features, s is the number of sequences, and where N%s=0, where N = total number training instances. Initialize network; while epochs remaining do foreach s do for i ← 1 to r do propagate si forward; E += ei ; end execute TBPTT; end reset LSTM cell state; end

3.2

LSTM and Attention

When combined with LSTM architectures, attention operates by capturing all LSTM output within a sequence and training a separate layer to “pay attention” to (i.e. to weigh) some parts of the output more than others. Note that the LSTM is set to return sequences, i.e. for an input sequence x = (x1 , x2 , ..., xT ) the LSTM layer produces the hidden vector sequence h = (h1 , h2 , ..., hT ) and output y = (y1 , y2 , ..., yT ) of the same length, by iterating the following equations from t = 1 to T . (1) ht = H(Wxh xt + Whh ht−1 + bh )

Enhancing LSTM Models

yt = Why ht + by

223

(2)

where the W terms denote weight matrices, the b terms denote bias vectors, and H is the hidden layer function. Details about LSTM networks can be found in [7]. Attention is essentially a neural network within a neural network, which is learning to weigh portions of a sequence for relative feature importance [27,30]. The general concept of attention can be modified to work with temporal classification problems where the sequences are a collection of instances of time-series data and the “decoding” is classification. In the models presented here, rather than a sequence of words, the sequences are fixed-length vectors generated by segmenting the continuous data stream. Each value of the sequence vector is a time-step (data point) represented as a numeric value. This value can be a sensor measurement, a stock market price, etc. [18]. The attention used in this study is multiplicative self-attention1 and uses the following attention mechanism: ht = tanh(Wx xt + Wh ht−1 + bh )

(3)

et = σ(xTt Wa xt−1 + bt )

(4)

at = sof tmax(et )

(5)

where ht is the hidden node output from the LSTM layer in a two-dimensional matrix (i.e. the entire hidden output achieved in Keras through setting return sequences to true). et is the sigmoid activation output of the attention two-layer network, where Wa is the attention network weights, producing a corresponding matrix of the attention network activations. at is the softmax activation of et producing a vector “alignment score” weighting the importance of the individual parts of the batched input sequences.

4

Data

In this section, we discuss the characteristics of the data for which the proposed architecture is advantageous as well as the datasets used in our experiments. 4.1

Data Characteristics

Sequential Nature: The data to be modeled must be time-series data, continuous and in-order, sampled at reasonably regular rates, with dependencies through time. For example, environmental data such as barometric pressure, air moisture, current temperature, etc. in the prediction of future temperature. Gathering data such as environmental data, process control data, physio-metric data, biometric data, etc. can be done continuously, in order, and at regular intervals, and is of high value to many classification problems. Natural Order : The data to be modeled must be reasonably natural and not artificially staged into discrete, disparate groups. For example, the data cannot 1

pypi.org/project/keras-self-attention/.

224

A. Katrompas and V. Metsis

be EEG data in ordered experimental events such as hearing a noise on the left/right, or a vision event on the left/right [20]. Since the events (auditory or visual stimulus) in this dataset follow a predetermined pattern scripted by the researchers, the model very quickly learns the experimental design pattern and not the EEG signal characteristics that are associated with the stimulus type. This leads to dramatic over-fitting and no generalizability. It should be noted this does not apply to data collected experimentally in which purposeful natural randomness is simulated with uneven events. Temporal Event Classification: The classifications to be modeled must be temporal events through time, and not single-point, discrete classifications. In other words, the events being predicted are things that happen over time continuously. For example, predicting a human fall based on smart-device accelerometer readings. The movements leading up to a fall can be running, walking, standing, etc. followed by a fall which happens over time with a series of time-steps including the initial falling period, striking the ground, remaining in the fall position, recovery, and then back to some non-falling activity. 4.2

Data Sets

Three different datasets were used in our experiments. SmartFall : The data set consists of raw (x, y, z) accelerometer readings representing activities of daily living (ADLs) such as walking and running with falls interspersed [13]. MobiAct: The data set consists of raw (x, y, z) accelerometer readings with various ADLs (jogging, walking upstairs, falls, etc.) recorded and labeled [26]. Occupancy Data: The data set consists of recorded ambient features of an enclosed space (temperature, humidity, light, CO2, and humidity ratio) and the associated event label that space is occupied or not occupied for some period of time [2]. In our experiments, the SmartFall and MobiAct data are not pre-processed other than to concatenate various subjects together into a single training, test, and validation set. Conversely, the original SmartFall study, and especially the MobiAct study, both do extensive pre-processing and feature extraction. The occupancy data is not pre-processed and is taken as-is, in temporal order, in both our study and the cited work. However, the cited study does extensive statistical analysis to achieve the optimal model and feature set whereas our technique simply uses the data as-is with the complete feature set.

5

Models

Four models were used to demonstrate the effectiveness of the enhancements discussed here. All models are built using TensorFlow 2.0 with Keras and the third-party library mentioned above for achieving attention models.

Enhancing LSTM Models

5.1

225

Architectures

Model 1: Vanilla LSTM : This model is a typical LSTM deep-learning model and consists of an LSTM input layer, a dense layer wrapped in a timedistributed layer, another dense layer, and an output classifier. The LSTM return sequences parameter is set to true which enables the complete LSTM hidden layer sequences to be sent forward to the time distributed later as shown in Fig. 1a. The time-distributed wrapper allows each set of hidden layer sequences to be applied to individual identical copies of the first dense layer. This conforms to the idea we want to capture and train on all hidden states equally, and not on just the resulting context vector of the hidden states. This also is analogous to the next model 1b wherein ‘return sequences’ is required to implement the attention layer. This also allows for a consistent comparison between models. The output of the time-distributed dense layer is forwarded to the subsequent dense layer, and finally to the output layer. It is assumed the reader is familiar with such models [9].

Fig. 1. The figure shows the architectures of two networks designed for sequence classification.

Model 2: LSTM with Attention: This model replaces the time-distributed dense layer with an attention layer. Return sequences is set to true enabling the complete hidden layer sequences to be sent forward to the attention layer where they are processed similarly to the previously explained encoder/decoder model and the vanilla LSTM model (see Fig. 1b). Model 3: Stateful LSTM : This model is architecturally identical to the vanilla LSTM (Fig. 1a), however, the learning algorithm is altered to maintain state as described in the section on stateful training. Both return sequences and maintain state parameters are set to true. The state is reset at the end of each epoch as described in Algorithm 1.

226

A. Katrompas and V. Metsis

Model 4: Stateful LSTM with Attention: This model utilizes the TensorFlow functional API and uses both stateful training and attention in parallel layers, which are then merged and fed forward to a common dense layer. In this model, each “side” of the network is trained according to its architecture as described in the previous two models, respectively (Fig. 2).

Fig. 2. Stateful LSTM with attention

5.2

Hyperparameters

In each case, the models were tuned with the number of nodes, time-steps, and epochs that performed the best for the dataset at hand, so that the best performance of each was measured both against each other and against the existing published work. These parameters were selected in a grid search pattern varying hidden layer nodes, time-steps, and the number of epochs in all combinations until the optimal parameters were discovered for each model. Figure 3 shows the typical Stateful LSTM w/Attention summary. From this summary and the following general parameter ranges, it should be sufficient to reproduce all models. Hidden Layer Nodes were selected between 100 and 300 with an increasing number needed from models 1 to 4, in order, as described in the architectures sub-section. Time-Steps were chosen to be 40 in the case of an attention model and 200 in the case of a non-attention model (models 2 and 4, as described in the architectures sub-section).

Enhancing LSTM Models

227

Epochs were chosen between 120 and 35 in generally decreasing numbers from models 1 to 4, as described in the architectures sub-section. This is especially notable in that as the number of nodes increased from model to model, epochs decreased dramatically.

Fig. 3. Typical stateful LSTM with attention model used in the study.

6

Experiments and Results

We first present the experimental results of comparing the four different architectures studied in this work (i.e. Vanilla LSTM, LSTM w/Attention, Stateful LSTM, and Stateful LSTM w/Attention) against each other. We show these results per data set, including accuracy, precision, recall, and F1 scores. Figure 4 shows the bar graph of accuracy per data set. Finally, we compare the results of our best model (Stateful LSTM w/Attention) with the results obtained by previously published work on the same datasets. 6.1

Model-to-Model and Model-to-Study Comparisons

Each of the Tables 1 through 7 show the results of optimally training each model on each dataset. Tables 8, 9, 10, 11, 12 compare the results of each of the best models studied here (measured by accuracy) with the results from the cited studies from which each data set was acquired. The first model-to-study comparison presented is the SmartFall study which used a combination of (x, y, z) accelerometer readings, derived features, and

228

A. Katrompas and V. Metsis Table 1. SmartFall fall detection results SmartFall LSTM Attn State Attn State Accuracy

.939

.946 .958

.960

Precision

.687

.777 .828

.857

Recall

.824

.809 .844

.847

F1

.750

.793 .836

.852

ROC AUC .912

.941 .963

.974

PR AUC

.827 .859

.893

.819

Table 2. MobiAct: Fall detection results MobiAct - Fall LSTM Attn State Attn State Accuracy

.929

.936 .945

.952

Precision

.814

.799 .929

.941

Recall

.871

.912 .847

.864

F1

.841

.852 .886

.901

ROC AUC .960

.966 .990 .990

PR AUC

.933 .960

.966

.970

Table 3. MobiAct: Jogging detection results MobiAct - Jogging LSTM Attn State Attn State Accuracy

.963

.970 .970

Precision

.991

.990 .990 .988

.972

Recall

.969

.977 .978

.981

F1

.980

.984 .984

.985

ROC AUC .973

.980 .982 .965

PR AUC

.996 .997 .991

.986

Enhancing LSTM Models Table 4. MobiAct: Detecting walking down stairs MobiAct - Stairs down LSTM Attn State Attn State Accuracy

.919

.943 .941

.948

Precision

.949

.960 .967

.969

Recall

.929

.953 .944

.953

F1

.939

.957 .955

.961

ROC AUC .950

.965 .968 .968

PR AUC

.955 .965

.956

.968

Table 5. MobiAct: Detecting walking up stairs MobiAct - Stairs up LSTM Attn State Attn State Accuracy

.900

.919 .926

.933

Precision

.944

.953 .973

.975

Recall

.919

.935 .928

.935

F1

.931

.944 .950

.955

ROC AUC .946

.980 .964

.973

PR AUC

.996 .979

.989

.976

Table 6. Detecting occupancy of an enclosed space - Door closed Occupancy 1 LSTM Attn State Attn State Accuracy

.978

.961 .978

.980

Precision

.998

.996 .998

.999

Recall

.944

.903 .942

.948

F1

.907

.947 .969

.973

ROC AUC .990

.991 .994 .990

PR AUC

.980 .986

.977

.990

229

230

A. Katrompas and V. Metsis Table 7. Detecting occupancy of an enclosed space - door open Occupancy 2 LSTM Attn State Attn State Accuracy

.925

.955 .948

.970

Precision

.778

.957 .922

.993

Recall

.860

.850 .840

.890

F1

.817

.901 .879 .939

ROC AUC .984

.993 .984

.993

PR AUC

.965 .959

.972

.929

Fig. 4. Model to model accuracy comparison

post-processing labels into categories of events as fall or not fall. Table 8 shows our results using only raw (x, y, z) accelerometer readings as input and no postprocessing. Table 9 shows our results when post-processing is applied similar to the SmartFall study. Both results are compared to the deep learning model presented in the SmartFall study. Also presented is the MobiAct fall detection results from our experiments, however, the MobiAct study did not include fall detection. The results of our MobiAct fall detection experiments are presented in Table 10 as a comparison to both our SmartFall results and the original SmartFall study results. This is presented simply as a re-enforcement of the overall results of our models in a similar activity, with a different but comparable data set. Again, no preprocessing of our data was done, and we use post processing similar to SmartFall for comparison. Table 11 shows the MobiAct results for jogging detection, walking downstairs, and walking upstairs. In each case, the input data for our models was raw

Enhancing LSTM Models

231

accelerometer readings. Conversely, the input to the “multilayer perceptron” in the MobiAct study was a series of complex derived features that was termed in the study the Optimal Feature Set (OFS). This is notable in that we achieve results that in two out of three cases are superior. In the third case, our results have a lower accuracy score but are comparably close and notable given the difference in pre-processing effort. Table 8. SmartFall: Stateful LSTM w/Attn versus study without post processing SmartFall w/o post processing Attn State Study Accuracy .960

.850

Precision .857

.770

Recall

1.0

.847

Table 9. SmartFall: Stateful LSTM w/Att versus study with post processing SmartFall w/Post processing Attn State Study Accuracy .995

.850

Precision 1.00

.770

Recall

1.0

.989

Table 10. Stateful LSTM w/Att SmartFall, Stateful LSTM w/Att MobiAct, SmartFall comparison MobiAct fall data versus SmartFall Attn State SmartFall Attn State MobiAct Smart-Fall Study Accuracy

.995

.984

Precision 1.0 Recall

.850

.968

.989

.770

1.0

1.0

Table 11. MobiAct detecting ADLs versus study MobiAct Data Jogging Stairs-D Stairs-U Attn+ State Study Attn+ State Study Attn+ State Study Accy. .972

.996

.948

.915

.933

.925

232

A. Katrompas and V. Metsis Table 12. Occupancy detection versus study Occupancy detection data Test 1 Test 2 Attn+ State Study Attn+ State Study Accy. .980

.979

.970

.993

Table 12 shows occupancy detection compared to the cited study. This comparison is notable in that the cited study did not use a neural network model, but rather used several statistical models. Our results are presented as a comparison to these statistical models, specifically the best of statistical model results (Linear Discriminant Analysis). While our best results were similar to the cited study’s best results, there are several things of note that make our approach novel and valuable. Again, our models used no pre-processing or pre-selection of inputs, whereas the cited study was in fact a study of the statistically “best” input selection. In other words, we achieved slightly better results (test set 1), and slightly worse but comparable results (test set 2), by simply using the entire feature set without the need for extensive comparative statistics. This comparison is of value as a demonstration that our enhanced deep learning models achieve similar or better results than most of the statistical methods in the cited paper.

7

Discussion: Training Behavior

A notable and surprising effect on training became evident as the attention models were trained and studied. In each case, models that included attention resisted over-fitting, sometimes dramatically so. Even when trained well past the minimum achievable test error, the models did not exhibit over-fitting. This occurred in both the attention-only model and the stateful model with attention. Figure 5 through 7 show this effect. Figure 5 shows that as the standard LSTM model continued to train, the over-training effect becomes more and more pronounced. This was also observed in the stateful-only model (Fig. 6b). However, in the attention models (Fig. 6a and Fig. 7) the test error closely parallels the training error as the training error continues to decline. Even when both errors “flat-lined,” training and test error remained closer and parallel. This was an unexpected result, however, upon closer inspection, it is seemingly intuitive. The purpose of attention mechanisms is to reduce noise (i.e. irrelevant information) and focus on the relative “important” part of the sequences. It seems intuitive this should reduce over-fitting in that the model has a more difficult time memorizing the training set, and is instead constantly corrected to the important and predictive input sequences and feature relationships. However, this is only a preliminary hypothesis and necessitates further study and validation.

Enhancing LSTM Models

233

Fig. 5. MobiAct walking down stairs - Standard LSTM

Fig. 6. MobiAct walking down stairs - Attention versus statefulness

Fig. 7. MobiAct walking down stairs - Stateful LSTM with attention

8

Conclusions

Our study conducted into LSTM model enhancements demonstrates clearly that LSTM models with the enhancements of statefulness and attention are capable of equal or better classification results than many state-of-the-art models, and

234

A. Katrompas and V. Metsis

most notably with far less pre-processing. This is an important finding in that pre-processing is not only cumbersome, it very often leads to human bias. With the enhancements presented here it is possible to effectively process raw data into accurate temporal classification models. This is an important consideration when attempting to train models in real-time and online while the models are in service, an area that warrants further study. In addition, this study demonstrates both the benefits of attention mechanisms as applied outside their typical domains (e.g. seq2seq text processing models) and re-examines the usefulness of a RNN layer(s) used in conjunction with attention for temporal classification. Furthermore, stateful training is an area gaining ground in the study of long-term pattern recognition and these results support those efforts. Based on these results, attention mechanisms, specifically self-attention, can benefit from RNNs and vice versa, and that this is an area worthy of further investigation.

References 1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) 2. Candanedo, L.M., Feldheim, V.: Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models. Energy Build. 112, 12 (2015) 3. Cheng, J., Dong, L., Lapata, M.: Long short-term memory-networks for machine reading (2016) 4. De Jeses, O., Hagan, M.T.: Backpropagation through time for a general class of recurrent network. In: IJCNN 2001. International Joint Conference on Neural Networks. Proceedings (Cat. No. 01CH37222), vol. 4, pp. 2638–2643 (2001) 5. Dematos, G., Boyd, M.S., Kermanshahi, B., Kohzadi, N., Kaastra, I.: Feedforward versus recurrent neural networks for forecasting monthly Japanese yen exchange rates. Finan. Eng. Jpn. Markets 3, 59–75 (1996) 6. Gershenson, C.: Artificial neural networks for beginners (2003) 7. Graves, A., Jaitly, N., Mohamed, A.-R.: Hybrid speech recognition with deep bidirectional LSTM. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 273–278. IEEE (2013) 8. Hewamalage, H., Bergmeir, C., Bandara, K.: Recurrent neural networks for time series forecasting: current status and future directions. Int. J. Forecast. 37(1), 388–427 (2020) 9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 56, 9:1735–9:1780 (1997) 10. Lin, Z., et al.: A structured self-attentive sentence embedding (2017) 11. Luong, M.-T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation (2015) 12. Masters, D., Luschi, C.: Revisiting small batch training for deep neural networks (2018) 13. Mauldin, T., Canby, M., Metsis, V., Ngu, A., Rivera, C.: Smartfall: a smartwatchbased fall detection system using deep learning. Sensors 18(10), 3363 (2018)

Enhancing LSTM Models

235

14. Mohajerin, N., Waslander, S.L.: State initialization for recurrent neural network modeling of time-series data. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2330–2337 (2017) 15. Moldovan, D., Anghel, I., Cioara, T., Salomie, I.: Time series features extraction versus LSTM for manufacturing processes performance prediction. In: 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp. 1–10 (2019) 16. Parikh, A.P., T¨ ackstr¨ om, O., Das, D., Uszkoreit, J.: A decomposable attention model for natural language inference (2016) 17. Paulus, R., Xiong, C., Socher, R.: A deep reinforced model for abstractive summarization (2017) 18. Qin, Y., Song, D., Chen, H., Cheng, W., Jiang, G., Cottrell, G.: A dual-stage attention-based recurrent neural network for time series prediction (2017) 19. Rahman, L., Mohammed, N., Al Azad, A.K.: A new LSTM model by introducing biological cell state. In: 2016 3rd International Conference on Electrical Engineering and Information Communication Technology (ICEEICT), pp. 1–6 (2016) 20. Rivet, B., Souloumiac, A., Attina, V., Gibert, G.: xdawn algorithm to enhance evoked potentials: Application to brain-computer interface. IEEE Trans. Biomed. Eng. 56(8), 2035–2043 (2009) 21. Squartini, S., Paolinelli, S., Piazza, F.: Comparing different recurrent neural architectures on a specific task from vanishing gradient effect perspective. In: 2006 IEEE International Conference on Networking, Sensing and Control, pp. 380–385 (2006) 22. Struye, J., Latr´e, S.: Hierarchical temporal memory and recurrent neural networks for time series prediction: an empirical validation and reduction to multilayer perceptrons. Neurocomputing 04, 396 (2019) 23. Tang, H., Glass, J.: On training recurrent networks with truncated backpropagation through time in speech recognition (2018) 24. Tomiyama, S., Kitada, S., Tamura, H.: On a new recurrent neural network and learning algorithm using time series and steady-state characteristic. In IEEE SMC 1999 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028), vol. 1, pp. 478–483 (1999) 25. Vaswani, A., et al.: Attention is all you need (2017) 26. Vavoulas, G., Chatzaki, C., Malliotakis, T., Pediaditis, M., Tsiknakis, M.: The mobiact dataset: recognition of activities of daily living using smartphones. In: Proceedings of the International Conference on Information and Communication Technologies for Ageing Well and e-Health - Volume 1: ICT4AWE, (ICT4AGEINGWELL 2016), pp. 143–151. INSTICC. SciTePress (2016) 27. Wang, Y., Huang, M., Zhu, X., Zhao, L.: Attention-based LSTM for aspect-level sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 606–615. Association for Computational Linguistics, November 2016 28. Werbos, P.J.: Backpropagation through time: what it does and how to do it. Proc. IEEE 78(10), 1550–1560 (1990) 29. Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention (2016) 30. Zeng, J., Ma, X., Zhou, K.: Enhancing attention-based LSTM with position context for aspect-level sentiment classification. IEEE Access 7, 20462–20471 (2019)

Domain Generalization Using Ensemble Learning Yusuf Mesbah(B) , Youssef Youssry Ibrahim, and Adil Mehood Khan Machine Learning and Knowledge Representation Lab, Innopolis University, Republic of Tatarstan, Russian Federation [email protected]

Abstract. Domain generalization is a sub-field of transfer learning that aims at bridging the gap between two different domains in the absence of any knowledge about the target domain. Our approach tackles the problem of a model’s weak generalization when it is trained on a single source domain. From this perspective, we build an ensemble model on top of base deep learning models trained on a single source to enhance the generalization of their collective prediction. The results achieved thus far have demonstrated promising improvements of the ensemble over any of its base learners.

Keywords: Neural networks Generalization

1

· Ensemble learning · Domain

Introduction

Ensemble learning is a method in supervised learning that combines multiple predictive models to get better and more robust predictions, which makes ensemble learning methods the best choice when the performance is of high importance. When it comes to the number of classifiers in the ensemble, the work done by R. Bonab, Hamed; Can, Fazli (2016) demonstrating the law of diminishing returns in ensemble construction can be referred to. Their theoretical framework shows that the highest accuracy is achieved by using the same number of independent component classifiers as class labels [6]. The theoretical base of neural networks was proposed by Alexander Bain (1873) and William James (1890) independently. Later on, McCulloch and Pitts (1943) made a mathematical model based on neural networks and called it “threshold logic”. After that, back-propagation was introduced by Rumelhart, Hinton, and Williams (1986). Over the following years, with the scientific and technological advancements, neural networks algorithms became more sophisticated and able to solve bigger and more challenging problems, including object recognition [15], anomaly detection [26,30], accident detection [4,7], action recognition [11,16,27], scene classification [25], hyperspectral image classification [1,2], machine translation [17,28], medical image analysis [10,12], etc. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 236–247, 2022. https://doi.org/10.1007/978-3-030-82193-7_15

Domain Generalization Using Ensemble Learning

237

Fig. 1. Domain generalization is the problem of transferring the knowledge from a source domain (such as SVHN cropped on the left) to a different target domain (such as MNIST on the right) to solve the same task, with the absence of any knowledge regarding the distributional shift in the feature space of the inputs.

Nowadays, deep learning (DL) and convolutional neural networks (CNN) are widely used in our everyday life. For example, modern smartphones have an option of authenticating using facial recognition, and all new self-driven cars are based mainly on a combination of many Deep CNNs to process road images. This increase in use raises the bar for computer vision systems to be more robust and stable. As useful as DL techniques are, some problems are faced when deploying them in the real world that we do not commonly encounter while working on toy datasets or training data in general. As powerful as deep CNNs are, they have a considerable shortcoming in that they are heavily dependant on the dataset used for training; this problem is also known as over-fitting. The problem at hand (called domain-shift) is mainly due to the fact that the training data set (source domain) comes from a different distribution than the deployment data (target dataset), resulting in a decrease in the model’s performance. Such discrepancy can occur in real life from slight changes in variables such as image resolution or picture brightness. Domain Generalization (DG) is a sub-field of Transfer Learning (TL) that aims to solve this problem by combining multiple data sources to train a more resilient model in hopes of generalizing to unseen domains. DG assumes the existence of multiple sources of data that are used for the same task, and a target domain dataset that is harder to work with (i.e.: harder to label and/or to collect). All domains share the same task but have a different marginal distribution. DG is very closely related to Domain Adaptation (DA) which also aims at solving the domain shift problem using one source domain and one target domain. DA can be approached in different ways regarding the existence of labels in the target domain: Supervised, Unsupervised, or Semi-Supervised. DG differs from DA in the fact that we do not have access to the target data nor to its labels. So, Domain Generalization aims at building a model that can generalize well to unseen domains rather than generalizing to a single known domain. Researchers have approached this problem in many ways. One traditional - yet very commonly used - technique is to treat this problem as an over-fitting problem and use regularisation techniques to help the (parametric) model generalize well [33]. Many techniques have proven useful in the case of deep neural networks such

238

Y. Mesbah et al.

as weight decay, dropout, batch normalization, and L1 and L2 regularization. Although these techniques were proven effective to help the model generalize well within the same data set and achieve higher test accuracy, they are not the most effective methods for Domain Generalization. In this paper, we deal with the case of Domain Generalization in its largest definition, where we handle the case of generalizing from a source domain to an unknown target domain. More specifically, we compare the performance of ensemble models with an individual Deep Neural Network on a single source domain generalization. Since ensemble models have shown an increase in accuracy in difficult learning scenarios, we will be investigating how much benefit can ensemble models give us when dealing with Domain Generalization problems. Accordingly, in this paper, we have implemented various ensemble models that consist of CNNs and different traditional machine learning models. We have tested them on five different datasets, and are reporting very interesting findings.

2 2.1

Related Work Ensemble Learning

Ensemble methods have been extensively researched. The main idea is to train multiple predictors for the same problem and merge their output to get better results. Ensemble methods have been commonly used in competitive machine learning competitions such as ILSVRC, where many CNNs are trained and merged to improve performance [19,22,34]. One main difference between the traditional model ensembling and our approach (when it comes to hyperparameter tuning) is the size of the models, in that even though we can use bigger models that will have a better performance on the training data set (and subsequently an ensemble of them), we preferred weaker models that still perform well on the training dataset (0.9+ accuracy) while providing much better generalization (more on that in Sect. 3.2). 2.2

Transfer Learning

Transfer learning (TL) in machine learning is the topic that explores how to store and apply the knowledge gained while solving one problem in a different but related problem. For example, the knowledge gained about recognizing cars could apply when trying to recognize trucks [29]. This is useful to decrease the training time of the models and helps if the target dataset is small. Similarly, Semi-Supervised classification [3,20,32] tackles the problem of the labelled data not being large enough to build a strong classifier, utilizing the large amount of data and the small number of labels. For example, Zhu and Wu [35] discussed how to deal with noisy labels, and Yang et al. considered cost-sensitive learning [31]. Semi-Supervised classification assumes that the distributions of the labelled and unlabelled data are from the same domain, while Transfer Learning allows the domains and tasks used in training and testing to be different[24].

Domain Generalization Using Ensemble Learning

2.3

239

Domain Generalization

Domain Generalization is less explored as a topic than Domain Adaptation [5], even though the ability to access multiple source domains allowed for more innovative and creative techniques. These techniques mainly fall into two streams: i Combining the source domains in a way that helps the model learn domain invariant features that can generalize well to unseen domains. For example, one state of the art method tries to learn domain-agnostic representations by re-arranging the input images and asking the network to solve it as a jigsaw puzzle. Although it has proven very effective, it faces a risk when different classes can share the same sub-components but are linked together differently. ii Measuring the similarity between each target image and potential source domains and then using this information, later on, to either combine or choose a certain classifier to use for this sample as in BSF [22].

3

Methods

Fig. 2. Base CNN used to learn CIFAR10 for the ensembles

We will be comparing three different ensembles with a single Neural Network to evaluate which one performs better on different various generalization problems. In supervised machine learning, there is some dataset D that consists of input data points, where every data point denoted by x has a class label y, with the assumption that there exists a function f that maps from the data point to the class label as y = f (x). The purpose of learners is to search through a space of possible functions, called hypotheses, to find the function h which is the best approximation to f used to assign the label y to x. Such a function is called a classifier. Learners that use a single hypothesis approximation for predictions could suffer from three main problems [9]: i The statistical problem is when the learner is searching in a space of hypotheses that is too big for how much training data is available. In this case, there might be two or more hypotheses that get the same accuracy on the training

240

Y. Mesbah et al.

data but perform differently while predicting future data. An ensemble can reduce the risk of this problem by taking the vote of different learners with different hypotheses, as it reduces the overall variance. In [13], the authors illustrated the variance reduction property of an ensemble system. ii The computational problem is when the learner is not guaranteed to find the best hypothesis and can get stuck in a local minimum as is the case with neural networks and decision tree algorithms. However, as with the statistical problem, an ensemble can help mitigate the computational problem because the weighted combination of several different local minima can help avoid choosing the wrong local minimum. iii The representational problem is when the hypothesis space does not contain a good approximation of the true function f . An ensemble can help in some cases, as a weighted vote of the hypotheses can expand the hypothesis space and result in a better approximation of f . The aforementioned problems can become even more severe when there is a domain gap between the training (source) data and the test (target) data. Usually, this problem is alleviated by training a model on multiple, different source domains. However, if there is a single domain to learn from, generalization could become extremely difficult. Therefore, it is interesting to see whether an ensemble model that uses a single source domain but benefits from having different base learners could help in improving the generalization performance. If so, what kind of ensemble model would perform better? Accordingly, our experiments are tailored to figure out answers to these questions. For every experiment conducted in our paper, we will have a single source dataset Ds and a single target dataset Dt that has a different domain, then we will have N CNN models (similar to Fig. 2) (m1 , m2 , ..., mN ) that will be trained independently on Ds . Then, they will be tested on the target domain Dt , give  ), then by getting the average of their output us their predictions (y1 , y2 , ..., yN  y¯ , we get our first ensemble (average ensemble, denoted by EnA). For the second ensemble with the meta learner (EnM), we will take the models’ outputs and train a layer of perceptrons as a meta learner to give us a weighted average of the models’ outputs. See Fig. 3. For the third ensemble, which is with meta learner v2 (EnM2), it will be similar to the previous ensemble with the only difference being that it has a multi layer perceptron meta learner. For the last ensemble, we compose different traditional ML algorithms (Random Forest (RF), Support Vector Machines (SVM), and Logistic Regression (LR)) into an average ensemble (EnT), see Fig. 3. Lastly, we will be adding to the comparison a single huge CNN (HCNN) that has as much trainable parameters as the sum of all the CNNs in the ensemble to see how the different use of trainable parameters might affect the results.

Domain Generalization Using Ensemble Learning

241

Fig. 3. General ensemble model

3.1

Data Preparation

There will be two datasets from different domains; one of them will be used for training and hyperparameter-tuning, and the other will be for testing to see how the ensemble will perform on a different domain. As for data preparation, for every neural network in the ensemble, a different data augmentation technique will be applied to the training dataset to increase the variance in the training data for every network. 3.2

Experiments

For the first experiment, we will be using three digits datasets: MNIST [21], USPS [14] and SVHN cropped [23] (henceforth referred to as SVHN). MNIST and USPS are composed of white handwritten digits on a black background, but USPS is small and zoomed to fill the frame, while MNIST is large and padded. On the other hand, SVHN is composed of colored images on a colored background (see Fig. 1). Moreover, the digits in SVHN are not perfectly isolated; there can be more than one digit in the one image, and the label for this image would be the middle digit in the image. We will train on one dataset and test on another (for every possible pairing of the 3 datasets). For the second experiment, we will use natural objects datasets CIFAR10 [18] and STL10 [8]. CIFAR10 is a colored dataset that consists of 10 natural objects: 5 animals, and 5 vehicles. Similarly, STL10 has the same setup except that CIFAR10 has images of frogs and STL10 does not. On the other hand, STL10 has images of monkeys while CIFAR10 does not, so we removed the uncommon labels, leaving us with 9 labels in common between the 2 datasets. In experiments involving USPS, the other datasets were resized to 16 × 16 to match USPS. In all other experiments, all the datasets were re-scaled to be 32 × 32 pixels. SVHN was converted to gray-scale to match MNIST [9]. 3.3

Hyperparameter Tuning

For every experiment that was done there were two datasets: source S and target T datasets. The source dataset is further divided into two parts: train and validation, so we will call them Strain and Sval , respectively.

242

Y. Mesbah et al.

CNNs and Ensemble Meta Classifier. To train each CNN (Fig. 2), we used Strain for training and Sval for validation. To achieve independence between the base models, we have a set of different types of augmentations A = {a1 , a2 , ..., an }, and every model i in the ensemble is trained using a unique subset of augmentations Ai ⊂ A. On the other hand, the ensemble meta classifier and the single CNN that will have the same number of parameters as the ensemble were trained using the full set of augmentations A. Traditional ML Algorithms. Similarly, we used Strain for training and Sval to tune some parameters such as the number of trees in a random forest. Table 1. Results for object recognition experiments: (1) from CIFAR10 as the source domain to STL10 as the target domain, (2) from STL10 as the source domain to CIFAR10 as the target domain. Model

4

CIFAR10 to STL10 Strain Sval T

STL10 to CIFAR10 Strain Sval T

model 1 0.987

0.886

0.706

0.721

0.597

0.460

model 2 0.978

0.879

0.675

0.944

0.664

0.557

model 3 0.978

0.877

0.686

0.903

0.636

0.504

model 4 0.976

0.868

0.684

0.984

0.641

0.509

model 5 0.969

0.888

0.696

0.818

0.633

0.515

EnA

0.99

0.903

0.724

0.964

0.681

0.558

EnM

0.99

0.904 0.727 0.964

EnM2

0.99

0.684 0.559

0.903

0.725

0.973

0.68

0.563

HCNN 0.971

0.878

0.683

0.466

0.423

0.358

EnT

0.958

0.487

0.366

0.709

0.371

0.285

RF

1.0

0.498

0.373

1.0

0.459

0.305

SVM

0.081

0.077

0.091

0.193

0.184

0.177

LR

0.464

0.429

0.305

0.654

0.359

0.281

Results

Tables 1, 2, 3, 4 show all the accuracy scores for every model on every problem. By analyzing the tables, we can notice the poor performance of the traditional ML models because they are being trained and tested on image datasets. However, the Random Forest model achieves high accuracy on the training set due to the fact that it is composed of many decision trees and can easily over-fit the training data, but we can see that when tested on the target domain we get very low accuracy. Moreover, while tuning the hyperparameters for the random forest,

Domain Generalization Using Ensemble Learning

243

Table 2. Results for digit recognition experiments: (1) from MNIST as the source domain to SVHN as the target domain, (2) from SVHN as the source domain to MNIST as the target domain. Model

MNIST to SVHN Strain Sval T

SVHN to MNIST Strain Sval T

model 1 0.971

0.973

0.069

0.934

0.936 0.647

model 2 0.971

0.973

0.069

0.933

0.935

model 3 0.978

0.978

0.069

0.934

0.936 0.649

model 4 0.974

0.976

0.069

0.933

0.935

model 5 0.967

0.968

0.07

0.934

0.936 0.649

0.648 0.649

EnA

0.98

0.98

0.069

0.934

0.936 0.649

EnM

0.979

0.979

0.069

0.85

0.842

EnM2

0.979

0.978

0.069

0.527

0.933

0.935

0.649

HCNN 0.992

0.991 0.072 0.933

0.935

0.649

EnT

0.99

0.97

0.104

0.332

0.27

0.093

RF

1.0

0.971

0.068

1.0

0.718

0.366

SVM

0.182

0.188

0.068

0.069

0.064

0.183

LR

0.935

0.927

0.108

0.265

0.242

0.053

we noticed that the more we increase the number of trees the higher the training and validation accuracy until the training accuracy reaches 1.0, at which point the validation accuracy starts to plateau. Another observation is that the CNN-based ensembles (EnA, EnM, EnM2) always give better accuracy in both domains across all the experiments, such as in CIFAR-to-STL (Table 1) where they reached 99% accuracy in the training set and increased over the best individual model (of its base models) in the target domain by 2% (from 66.4% to 68.4%). A similar outcome was observed in the SVHN-to-USPS experiment (Table 4). On the other hand, we can notice a slight drop in accuracy in the ensemble compared to its best base model such as in the USPS-to-MNIST experiment (Table 3) where on the target domain the best performing model got 85.9%, yet none of the ensembles got higher than that. This is because the other models in the ensemble have significantly less accuracy than the best model. However, the ensembles generally still have higher accuracy than the mean accuracy of their base models. For some of the experiments, we do not see good generalization, such as in MNIST-to-SVHN experiment (Table 2), which is due to the huge domain gap between them. Even though the models achieve 95%+ accuracy on training, they get very bad results on the target domain on testing, and in such cases, ensemble methods do not help much.

244

Y. Mesbah et al.

Table 3. Results for digit recognition experiments: (1) from USPS as the source domain to MNIST as the target domain, (2) from MNIST as the source domain to USPS as the target domain. Model

USPS to MNIST Strain Sval T

MNIST to USPS Strain Sval T

model 1 0.996

0.976

0.776

0.998

0.994

0.968

model 2 0.999

0.981

0.85

0.997

0.994

0.888

model 3 1.0

0.98

0.794

0.996

0.993

0.958

model 4 1.0

0.975

0.801

0.998

0.994

0.919

model 5 0.999

0.981

0.859

0.996

0.993

0.973

EnA

0.982 0.852

0.998

0.995 0.962

0.999

EnM

1.0

0.982 0.852

0.998

0.995 0.962

EnM2

0.999

0.982 0.864

0.998

0.995 0.957

HCNN 1.0

0.977

0.885 0.995

0.993

0.904

EnT

0.999

0.942

0.112

0.962

0.945

0.113

RF

1.0

0.941

0.098

1.0

0.968

0.118

SVM

0.999

0.915

0.152

0.921

0.918

0.194

LR

0.301

0.308

0.372

0.803

0.798

0.084

Table 4. Results for digit recognition experiments: (1) from USPS as the source domain to SVHN as the target domain, (2) from SVHN as the source domain to USPS as the target domain. Model

USPS to SVHN Strain Sval T

SVHN to USPS Strain Sval T

model 1 0.998

0.974

0.115

0.957

0.952

0.707

model 2 0.998

0.976

0.138

0.954

0.948

0.675

model 3 0.999

0.974

0.11

0.957

0.947

0.714

model 4 0.999

0.973

0.144

0.955

0.947

0.719

model 5 0.998

0.972

0.123

0.951

0.952

0.737

EnA

0.977

0.125

0.961

0.957 0.756

0.999

EnM

0.998

0.981 0.159 0.961

0.957 0.755

EnM2

0.999

0.977

0.125

0.96

0.957 0.748

HCNN 1.0

0.979

0.08

0.985

0.962

0.604

EnT

0.933

0.115

0.334

0.284

0.11

0.995

RF

1.0

0.94

0.068

1.0

0.694

0.465

SVM

0.994

0.947

0.148

0.124

0.125

0.167

LR

0.301

0.308

0.068

0.261

0.239

0.06

Domain Generalization Using Ensemble Learning

5

245

Conclusion

By providing a different data augmentation for each base learner, we improved the generalization from a single source domain to an unseen target domain. Thus, this proved the usefulness of our ensemble approach, making it the simplest known method for domain generalization. Moreover, it can utilize weak models to get a more robust model. Additionally, note that the more base models there are, the more time it would need for training. For future research, we can explore the effectiveness of the ensemble methods when using multiple source domains, how to use ensemble methods in domain adaptation, and how to best utilize the fact that we have access to the target domain. Also, we will explore how we can incorporate ensemble methods in current approaches for solving the domain adaptation and generalization problems.

References 1. Ahmad, M., Khan, A.M., Mazzara, M., Distefano, S.: Multi-layer extreme learning machine-based autoencoder for hyperspectral image classification. In: VISIGRAPP (4: VISAPP), pp. 75–82 (2019) 2. Ahmad, M., Khan, A.M., Mazzara, M., Distefano, S., Ali, M., Sarfraz, M.S.: A fast and compact 3-D CNN for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. (2020) 3. Baralis, E., Chiusano, S., Garza, P.: A lazy approach to associative classification. IEEE Trans. Knowl. Data Eng. 20(2), 156–171 (2007) 4. Batanina, E., Bekkouch, I.E.I., Youssry, Y., Khan, A., Khattak, A.M., Bortnikov, A.: Domain adaptation for car accident detection in videos. In: 2019 Ninth International Conference on Image Processing Theory, Tools and Applications (IPTA), pp. 1–6. IEEE (2019) 5. Bekkouch, I.E.I., Youssry, Y., Gafarov, R., Khan, A., Khattak, A.M.: Triplet loss network for unsupervised domain adaptation. Algorithms 12(5), 96 (2019) 6. Bonab, H., Can, F.: A theoretical framework on the ideal number of classifiers for online ensembles in data streams, pp. 2053–2056 (2016) 7. Bortnikov, M., Khan, A., Khattak, A.M., Ahmad, M.: Accident recognition via 3D CNNs for automated traffic monitoring in smart cities. In: Arai, K., Kapoor, S. (eds.) CVC 2019. AISC, vol. 944, pp. 256–264. Springer, Cham (2020). https:// doi.org/10.1007/978-3-030-17798-0 22 8. Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 215–223. JMLR Workshop and Conference Proceedings (2011) 9. Dietterich, T.G., et al.: Ensemble learning (2002) 10. Dobrenkii, A., Kuleev, R., Khan, A., Rivera, A.R., Khattak, A.M.: Large residual multiple view 3D CNN for false positive reduction in pulmonary nodule detection. In: 2017 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), pp. 1–6. IEEE (2017) 11. Gavrilin, Y., Khan, A.: Across-sensor feature learning for energy-efficient activity recognition on mobile devices. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE (2019)

246

Y. Mesbah et al.

12. Gusarev, M., Kuleev, R., Khan, A., Rivera, A.R., Khattak, A.M.: Deep learning models for bone suppression in chest radiographs. In: 2017 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), pp. 1–7. IEEE (2017) 13. Hansen, L.K., Salamon, P.: Neural network ensembles. IEEE Trans. Pattern Anal. Mach. Intell. 12(10), 993–1001 (1990) 14. Hull, J.J.: A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 16(5), 550–554 (1994) 15. Khan, A., Fraz, K.: Post-training iterative hierarchical data augmentation for deep networks. In: Advances in Neural Information Processing Systems, vol. 33 (2020) 16. Khan, A.M., Lee, Y.-K., Lee, S., Kim, T.-S.: Accelerometer’s position independent physical activity recognition system for long-term activity monitoring in the elderly. Med. Biol. Eng. Comput. 48(12), 1271–1279 (2010) 17. Khusainova, A., Khan, A., Rivera, A.R.: Sart-similarity, analogies, and relatedness for tatar language: New benchmark datasets for word embeddings evaluation. arXiv preprint arXiv:1904.00365 (2019) 18. Krizhevsky, A., Nair, V., Hinton, G.: Cifar-10 (Canadian institute for advanced research) 19. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012) 20. Kuncheva, L.I., Rodriguez, J.J.: Classifier ensembles with a random linear oracle. IEEE Trans. Knowl. Data Eng. 19(4), 500–508 (2007) 21. LeCun, Y., Cortes, C.: MNIST handwritten digit database (2010) 22. Mancini, M., Bul` o, S.R., Caputo, B., Ricci, E.: Best sources forward: domain generalization through source-specific nets (2018) 23. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning (2011) 24. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010) 25. Protasov, S., Khan, A.M., Sozykin, K., Ahmad, M.: Using deep features for video scene detection and annotation. Sig. Image Video Process. 12(5), 991–999 (2018) 26. Rivera, A.R., Khan, A., Bekkouch, I.E.I., Sheikh, T.S.: Anomaly detection based on zero-shot outlier synthesis and hierarchical feature distillation. IEEE Trans. Neural Networks Learn. Syst. (2020) 27. Sozykin, K., Protasov, S., Khan, A., Hussain, R., Lee, J.: Multi-label classimbalanced action recognition in hockey videos via 3D convolutional neural networks. In: 2018 19th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), pp. 146–151. IEEE (2018) 28. Valeev, A., Gibadullin, I., Khusainova, A., Khan, A.: Application of low-resource machine translation techniques to Russian-tatar language pair. arXiv preprint arXiv:1910.00368 (2019) 29. West, J., Ventura, D., Warnick, S.: Spring research presentation: a theoretical foundation for inductive transfer. Brigham Young Univ. Coll. Phys. Math. Sci. 1(08) (2007) 30. Yakovlev, K., Bekkouch, I.E.I., Khan, A.M., Khattak, A.M.: Abstraction-based outlier detection for image data. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1250, pp. 540–552. Springer, Cham (2021). https://doi.org/ 10.1007/978-3-030-55180-3 40 31. Yang, Q., Ling, C., Chai, X., Pan, R.: Test-cost sensitive classification on data with missing values. IEEE Trans. Knowl. Data Eng. 18(5), 626–638 (2006)

Domain Generalization Using Ensemble Learning

247

32. Yin, X., Han, J., Yang, J., Yu, P.S.: Efficient classification across multiple database relations: a crossmine approach. IEEE Trans. Knowl. Data Eng. 18(6), 770–783 (2006) 33. Zhang, C., Bengio, S., Recht, B., Vinyals, O., Hardt M.: Understanding deep learning requires rethinking generalization (2017) 34. Zhou, Z.-H.: Ensemble Methods: Foundations and Algorithms. CRC Press, Boca Raton (2012) 35. Zhu, X., Xindong, W.: Class noise handling for effective cost-sensitive learning by cost-guided iterative classification filtering. IEEE Trans. Knowl. Data Eng. 18(10), 1435–1440 (2006)

Research on Text Classification Modeling Strategy Based on Pre-trained Language Model Yiou Lin(B) , Hang Lei, Xiaoyu Li, and Yu Deng School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China [email protected]

Abstract. Fine-tuning the pre-trained language model is the current mainstream method of text classification. Take the fine-tuning BERT model as an example, this kind of approach has three main problems: the first one is that the training of massive parameters will cause high training costs The second is that the model is very easy to over-fit in trainable samples, resulting in low transferability. Third, the fine-tuning model is not good at long text classification task. In this paper, we take the sentiment classification task as an example and use the classification accuracy as a metric. We compared two methods to problem one: using a compressed language model (decreased 0.1%) and using entirely frozen weight (reduced by 4%). For the second problem, in the case of using fixed weights (reduced by 4%), Convolutional Networks (CNN) and Capsule Networks (CAP) are used to extract n-gram features and clustering features so that the classification accuracy is improved (increased by 0.5% and decreased by 0.2% respectively). The corpus transfer test shows that CAP’s accuracy is greater than the fine-tuning model (increased by 0.6%). Meanwhile, this article proposes a method to process the long text classification task by expanding position embedding to support long text input (increased by 1.1%). Finally, this paper compares the training speed and parameter scale of different models under the combination of different strategies, and uses the F1 value, precision rate, recall rate and accuracy to measure each model. Keywords: Deep learning · BERT model · Chinese sentiment analysis · Capsule Networks · Position embedding

1

Introduction

Text classification is a fundamental topic in Natural Language Processing (NLP). Before a computer program processes the text classification task, it needs to use the text representation to convert the character features into digital features. In the past, text representation was the discrete representation of text in sparse vectors, such as one-hot representation. But using discrete features means c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 248–260, 2022. https://doi.org/10.1007/978-3-030-82193-7_16

Text Classification Modeling Strategy

249

that it is impossible to measure the semantic relationship between two language units through Euclidean distance. Therefore, cluster-based language representations (such as Brown clustering word vectors [1]) are proposed. Although similar words are mapped to the same cluster to realize the representation of identical word meanings, it still leads to the problem of “multiple words with one meaning”. To solve this problem, distributed representation of words is proposed. Distributed representation refers to using points in a continuous semantic space to represent a word, which translates the relationship between words. With the rise of deep learning, language model (such as Word2vec [2]) which stores the latent grammatical and semantic features in the neural networks have become the mainstream method of text representation. After that, the pre-trained deep neural networks language model represented by BERT [3] has received extensive attention from the academic community. Since the fine-tuning model does not need to be learned from scratch, compared with the model that does not use pretrained, the fine-tuning model achieves higher performance under the premise of fewer data and computational cost. Taking fine-tuning BERT as an example, there are three main problems: 1) The amount of trainable parameters is exceptionally huge. The basic version of BERT has 100 million trainable parameters. 2) The fine-tuning model is more likely to be over-fitting to the corpus field. For example, a model that predicts the sentiment of hotel review will perform extremely badly when predicting the sentiment of product reviews. Retrain a new fine-tuning model takes time and effort. 3) When the pre-trained model is unsupervised training, the length of the input text is limited, and the text that exceeds the length will be truncated. For example, the acceptable token sequence length of BERT is 512. Therefore, the fine-tuning model has information loss for long texts that exceed the acceptable input length, and predictions are often not as accurate as short texts. Therefore, in response to each of the above problems, this article has the following work: 1) This article compares two methods including compressed language model and totally freeze weights because of the large number of trainable parameters. 2) After using totally freeze weights, this article finds using the convolutional networks (CNN) and capsule networks (CAP) to extract n-gram features and clustering features into the forward classification networks will make the classification accuracy recovered and slightly improved. Since this strategy maintains the pre-trained model’s invariance, the fine-tuning model has better domain migration capabilities. 3) This paper proposes a method to extend position embedding. Although the extended position embedding lightly changes the model structure, it can be adapted to text input of any length. 4) Finally, this article compares the training time and parameter scale of different models constructed under the strategy mentioned above and calculate the F1 value, accuracy and recall rate of each model.

250

Y. Lin et al.

Fig. 1. Three commonly used pre-trained language model networks structure diagrams

2

Related Work

Traditional text classification methods mainly focus on feature representation, feature selection and classification algorithm selection. The emergence of finetuning pre-language model blurs the boundaries of the three issues. The pretrained language model represented by BERT is a dynamic text representation model based on transfer learning. Dynamic text representation means that the representation of the current text depends not only on the text itself, but also by the context of the current text. As shown in Fig. 1, Peters et al. [9] proposed an EMLo model based on two-layer BiLSTM, in which the arbitrary embedding Ti is not only related to the input Ei , but also related to the hidden state representing context information. Dynamic text representation has a wealth of text information. The experiment in [9] shows that the first layer of the EMLo model extracts a large amount of grammatical information, and the second layer extracts semantic information. After that, the Transformer-based GPT model [10] overcomes the shortcomings of ELMo that it cannot be calculated in parallel. Furthermore, the BERT model uses a bidirectional Transformer structure to comprehensively consider the context word order to better extract effective text information. The attention mechanism is a new achievement in the current deep learning field, which is used to mine the text’s most representative features. It reduces the design and screening of artificial features like TF-IDF (term frequency–inverse document frequency). In 2017, Transformer first appeared in machine translation [11]. Transformer solves the shortcomings of Recurrent Neural Network (RNN) which is non-parallel computing and solves the dilemma of CNN’s lack of context dependence. In 2018, the Transformer-based dynamic text representation model BERT has been widely used in natural language processing and achieved optimal results in 11 tasks in natural language processing [3]. Although a pre-trained language model with a simple feedforward neural networks can achieve good results, the latest research is more inclined to regard it as a text representation encoder [7,8]. Researchers uniformly encode character sequences into fixed-size vectors or matrices and use other networks to extract advanced features. CNN and CAP are the most commonly used feature extraction networks. Traditional views generally believe that CNN is good at extracting local features, unable to consider long-distance dependent information, and does not consider word order information [4]. The use of pre-trained language models as the underlying text representation makes up for the shortcomings of CNN.

Text Classification Modeling Strategy

251

Although the CNN and BERT strategy has been widely used, the convolution operation has no translation invariance, and the pooling operation completely discards the position information [5]. Therefore, Hinton et al. proposed the capsule networks and dynamic routing algorithm make up for the shortcomings of CNN [12]. CAP borrows ideas from neuroscience and believes that the brain is composed of a series of modules called capsules. These capsules are good at handling features such as posture (position, size, direction), deformation, speed, albedo, hue [6]. In 2017, Sabour et al. introduced the CAP into the handwriting classification field, replaced each neuron scalar in the original neural networks with a vector, and used a dynamic routing algorithm to replace the backpropagation algorithm [13]. In 2018, Zhao et al. first introduced the CAP into text classification, by using CNN to extract the n-gram information of the text, and then using the three-layer CAP to extract more advanced text features [14]. In addition to directly using pre-trained language models to generate text representations, a considerable number of researchers also focus on streamlining the model scale with as little loss of accuracy as possible [15,16]. Among them, ALBERT is a famous example of model simplification which shares all the parameters of the 12-layer attention structure, experiments show that the training speed of ALBERT is significantly faster than the corresponding BERT, and the super-large-scale ALBERT-xxlarge performs better which surpasses BERT in all aspects [17]. The above works focused on how to better use the fine-tuning language model, but the construction strategies used are not comprehensive. The systematic research on the construction strategy is also at a relatively blank stage. Therefore, this article explored the construction the text classification modeling strategy based on pre-trained language model.

3 3.1

Model Architecture Model Input

Fig. 2. Schematic diagram of BERT input embedding

As shown in Fig. 2, during BERT pre-trained, each input sample is composed of a pair of sentences. The basic structure in a sentence is called a “word piece”.

252

Y. Lin et al.

Randomly replace a certain percentage of word pieces with masks. The ID of the word piece, the position of the word piece in the sentence, and the label of the sentence are input through the Embedding layer to form the word piece embedding, position embedding, and segmentation embedding, and then add them as the input of BERT. After 12 layers of Transform structure, complete cloze (predict the true value of the mask position in the picture) and upper and lower sentence matching tasks (judge whether the sentence pair in the picture is context) to achieve the purpose of the joint training model. In the fine-tuning model, supervised training is performed to make it suitable for specific tasks. BERT has strong universality. Almost all NLP tasks can apply this two-stage solution idea, and the effect is significantly improved. At the same time, we can see it is the position embedding that limits the length of the input text. BERT uses the absolute position embedding trained from random initialization, and the general maximum number of positions is set to 512. In the case of limited resources, an ideal solution is to find a way to extend the position embedding of the trained BERT. Specifically, assuming that the trained absolute position embedding is p1 , p2 , ..., pn , we hope to construct a new set of absolute position embedding q1 , q2 , ..., qm , where m > n. To this end, we set q (i−1)×n+j = pj + α(pi − p1 )

(1)

where i ∈ [1, n], j ∈ [1, n] and α = 1. We empirically set α = 0.9. Now, we can get the representation of n2 position embedding, and the first n position embedding are compatible with the original BERT model. 3.2

Transformer

Fig. 3. Schematic diagram of transformer layer connected to the embedding layer

As shown in Fig. 3, the Transformer model of the BERT model contains two sub-layers, namely the self-attention layer and the feed-forward layer. Attention calculation uses the following similarity calculation formula.

Text Classification Modeling Strategy

 Attention (Q, K, V) = softmax

QKT √ dk

253

 V

(2)

Among them: Q, K, and V are the query, key, and value matrix for calculating self-attention respectively; QKT is the attention matrix, which weights the V matrix; dk represents the dimension of the key. The self-attention model is a special case of the attention model, while Q, K, and V are used by the same matrix. Similar to the concurrent operation of multiple sets of convolution kernels, BERT’s Transformer linearly maps the input tensor into multiple sets of tensors, and concurrently performs self-attention model calculations on each set of tensors, called multi-head attention. In terms of parallelism, the multi-head attention model, like CNN, does not rely on the previous calculations, and can be parallelized well and is better than RNN. In terms of long-distance dependence, since the self-attention model calculates attention for each word and all words, no matter how long the distance between them, the maximum path length is only 1, which can still capture long-distance dependence. There is a residual connection around each sub-layer (self-attention, feed-forward networks) in each encoder, and is followed by a “layer-normalization” step. The output of the former encoder is used as the input of the latter encoder, and 12 Transformers form a complete BERT encoding networks. Unlike the original application scenario of Transformer, BERT is only used to extract dynamic representations of text and does not require a decoder. 3.3

Capsule Networks

Since the introduction of CNN is very common, this article only introduces the related construction of CAP. The essence of neural networks calculation is tensor transformation. Different from the scalar-based calculation of ordinary feedforward neurons, each neuron (capsule) receives a vector as input (it can also be extended to a higher-dimensional tensor). The capsule networks uses a dynamic routing algorithm to iteratively calculate the clustering core of the bottom capsule as the output of the high-level capsule. The calculation process follows Algorithm 1 [14]: assuming that the capsule i networks of layer l is connected to the capsule j of layer l + 1, the output-input relationship between the two capsules is expressed as ˆ j|i = W j|i v i (3) u where W j|i is the trainable weight matrix. The dynamic routing algorithm executed r times at the layer l + 1 is: The compression function of step 5 is defined as follows: squash (sj ) =

sj 

2

1 + sj 

2

·

sj sj 

(4)

The compression formula is an innovative point of CAP, by using a new type of nonlinear activation function calculated by the vector. The main function of

254

Y. Lin et al.

Algorithm 1: The Dynamic Routing Algorithm

1 2 3 4 5 6 7

ˆ j|i , r, l Input: input parameters u Output: output result v j For each capsule i in layer l and each capsule j in layer (l + 1):bij ← 0; for r > 0 do For each capsule i in layer l:ci ← softmax  (bi ); For each capsule j in layer l + 1:sj ← i cij u ˆj|i ; For each capsule j in layer l + 1:v j ← squash (sj ); For each capsule i in layer l and each capsule j in layer ˆ j|i · v j ; (l + 1):bij ← bij + u r =r−1

the formula is to make the length of the output vector vj not exceed the value 1 and to maintain the same directionality of sj . After the experimental screening, we set r to 3 and l to 1. Consider that the output of the first layer of CAP is equal to the output of BERT, and set the number of neurons in the second layer of the capsule to 32 and the vector length to 18. 3.4

Model Framework

Fig. 4. Schematic diagram of BERT input coding

Text Classification Modeling Strategy

255

The experimental verification framework proposed in this paper is shown in Fig. 4. The bottom layer is the BERT-base model, which is the core module to deal with problem one and problem three. Module two is the output matrix of module one, in which the first column vector is generally regarded as a sentence vector. It is used as the only input of the classification networks in the baseline fine-tuning model. The third module is an advanced feature extraction. The figure uses CAP as example. In order to avoid trial and error in fine-tuning, this article only uses a layer of CAP with a dimension of 32 and a number of capsules of 18. Module three can also use CNN to replace CAP. The experiment sets up three 1d convolution kernels, the window size is 1, 2, 3, the number of convolution kernels is 128, and the maximum pooling method is used to finally obtain three one-dimensional vectors. Finally, the three output vectors calculated by these three convolution kernels are connected and input to the fully connected networks. In module four, the softmax function is used as the activation function, and the cross-entropy is used as the loss function to train the model parameters.

4

Experiment Design and Analysis

4.1

Experiment Corpus

Chinese sentiment classification corpus is not only scarce but also of uneven quality. This paper selects and organizes three public Chinese sentiment corpora1 . Among them, MioChnCorp-2 is obtained by de-duplication on the corpus of literature [18]. There are 120,000 balanced two-category samples. Corpus 2 is Chinese data set of the 2014 NLPCC Conference Sentiment Analysis and Evaluation Task (NLPCC-SCDL), with a total of 12,500 balanced samples. Su-CD is a public commercial evaluation corpus with a total of 21,000 balanced samples. Table 1 is a detailed description of the these three corpora. Among them: L represents the average sample character length, V represents the size of the character dictionary, Train represents the number of training samples, and Test represents the number of test samples. Table 1. Most important features for linear regression model and decision tree model Corpus MioChnCorp-2

1

L

V

Train

Test

84.1 6090 100000 20000

NLPCC-SCDL 100.4 4778

10000

Su-CD

10522 10523

63.8 4304

2500

https://pan.baidu.com/s/1GrgqQXk5vg6aiaJaZhUAew password: lel4.

256

4.2

Y. Lin et al.

Evaluation Metrics

This paper uses four evaluation metrics including accuracy, precision, recall and F1 measure to evaluate the classification effect of the sentiment classification model. For a binary classification problem, let N be the total sample size. Let TP be the number of samples predicted to be positive and actually positive, TN is the number of samples predicted to be negative and actually negative, FP is the number of samples predicted to be positive and actually negative, FN is predicted to be negative, actual The number of positive samples, then P = T P/(T P + F P )

(5)

R = T P/(T P + F N )

(6)

F1 = 2P R/(P + R)

(7)

A = (T P + T N )/(T P + F P + T N + F N )

(8)

In the above formulas, F1 represents the F1 measurement, P represents the precision, R represents the recall rate, and A represents accuracy. 4.3

Experimental Setup

The experimental setup in this paper is as follows: the validation set is divided from the training set with a ratio of 20%. Training round epoch = 10, the maximum length of each text texts ize = 256, dropout = 0.2. In order to reduce the risk of model over-fitting, set the detection parameter earlys top = 100, that is, if the model does not significantly improve the validation set index after training for 1000 batches, the training is ended early. 4.4

Comparative Experiment

This paper tested the following five models over MioChnCorp-2. The benchmark model is the fine-tuning BERT model with all trainable parameters. 1) Fine-tuning BERT: Use the pre-trained model to directly output the sentence vector and then input it into the classifier with all trainable parameters. 2) Fixed BERT + CAP: Use the vector of each word patch output by the capsule networks clustering pre-trained model, and then input it into the classifier. 3) Fixed BERT + CNN: Use CNN to extract the local features of the dynamic text representation, pool and splice, and then enter the classification networks. 4) Fixed BERT + CAP + CNN: concate clustering features and local features, and then input them into the classifier. 5) Fine-tuning BERT + CAP + CNN: concate clustering features and local features, and then input them into the classifier with all trainable parameters. The evaluation results are shown in Table 2.

Text Classification Modeling Strategy

257

Table 2. Evaluation results of the four comparison models

4.5

Model ID Positive P R

F1

Negative P R

F1

1

0.938

0.911

0.924

0.914

0.941

0.927

2

0.936

0.924

0.930

0.925

0.937

0.931 0.930

3

0.922

0.924

0.923

0.924

0.923

4

0.927

0.936 0.931 0.936 0.926

5

0.953 0.894

0.923

0.901

A

0.923

0.925 0.923

0.931 0.931

0.957 0.928

0.926

Ablation Experiment

This section compared the model’s training speed and parameter scale under different construction strategies and measured the average F1 value and accuracy of each model. The evaluation results are shown in Table 3. Table 3. Evaluation results of models built by different strategies Language model Trainable layer CAP CNN Max length Each step √ √ BERT All 256 1.31 s √ √ 256 0.45 s √ 256 0.36 s √ 256 0.43 s All ALBERT

All













√ √

4.6

Trainable parameters F1 102.7 M

A

0.925 0.926

1.0 M

0.931 0.931

443.5 K

0.931 0.930

591.0 K

0.923 0.923

256

1.17 s

101.6 M

0.926 0.925

256

0.33 s

1.5 K

0.875 0.879

256

1.16 s

9.2M

0.924 0.925

256

0.43 s

1.0M

0.930 0.930

768

0.43 s

1.0M

0.941 0.941

768

0.35 s

443.5K

0.937 0.938

768

0.41 s

591.0K

0.933 0.922

Experiment Analysis

Comparing Model 2 and Model 3 in Table 1, we found that both the CAP and the CNN are helpful for the extraction of advanced features. CAP is 0.7% higher than CNN, so it is more effective. It can also be seen from Table 1 that Model 4 achieved the best results and is 0.6% higher than Model 1 in F1 which means Fine-tuning the model maybe unnecessary. Comparing Model 4 and Model 5, we can find that fine-tuning the language model even reduces the optimal result by 0.5% which means fine-tuning the language model may even be harmful to the extraction of high-level semantic features. Observing Model 1 and Model 5, we can find that after fine-tuning the language model, the final model is more inclined to give unbalanced predictions. We replaced the test corpus with corpus 2 and corpus 3. The evaluation on transfer corpus is shown in the Fig. 5. The

258

Y. Lin et al.

Fig. 5. Results of the corpus transfer test

Fig. 6. Results of the corpus transfer test

results show that the model without fine-tuning has better generalization ability with an increase of F1 about 1%. Compared to the reported optimal classification accuracy 0.915 reported based on Dynamic Convolutional Neural Networks [19], the pre-trained language model represented by BERT has a significant impact on text classification tasks and significantly improves performance. From the ablation experiment we can find: 1) The final result of using BERT as a pre-trained model is better than the corresponding ALBERT, but the training speed is reduced by 10%. 2) The computing time and resources based on the pre-trained language model are mainly concentrated on the calculation and weight update of the dynamic representation. 3) The position embedding exten-

Text Classification Modeling Strategy

259

sion proposed in this paper can greatly improve the accuracy. We measured the accuracy of Model 2 in each sample length interval when the longest input is 256 and the longest input is 768. The result is as shown in Fig. 6. It can be seen that when the sample length is greater than 256, the model based on extended position coding has obvious advantages.

5

Conclusion and Future Work

At present, the pre-trained language model based on fine-tuning is the mainstream method for NLP tasks. This article discussed the modeling strategy based on pre-trained language models. Aiming to deal with the huge number of parameters in the pre-trained language model (such as BERT), we find that using the compressed pre-trained model (such as ALBERT) can take a minimal loss of accuracy and significantly reduce the number of parameters. This paper also finds that changing the pre-training model’s weight may be unnecessary, and even hurts further feature extraction. At last, this paper found that the use of CAP and CNN to extract the clustering features and local features of dynamic text representation has better generalization ability and can avoid the retraining of massive parameters without affecting the accuracy. For long text classification task, this paper proposes an extended position embedding method so that the BERT model can support the input of text of any length. Although the model construction strategies proposed in this article have achieved significant results, they also have some apparent shortcomings especially in more training rounds than traditional fine-tuning the language model. In the future, this article will study a compressed language model based on knowledge distillation, to further accelerate the training of the model, and use the attention mechanism to accelerate the model convergence.

References 1. Brown, P.F., Della Pietra, V.J., Desouza, P.V., Lai, J.C., Mercer, R.L.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–480 (1992) 2. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, vol. 26, pp. 3111–3119 (2013) 3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 4. Peng, H., et al.: Hierarchical taxonomy-aware and attentional graph capsule RCNNs for large-scale multi-label text classification. IEEE Trans. Knowl. Data Eng. 33(6), 2505–2519 (2019) 5. Safaya, A., Abdullatif, M., Yuret, D.: KUISAIL at SemEval-2020 Task 12: BERTCNN for offensive speech identification in social media. In: Proceedings of the Fourteenth Workshop on Semantic Evaluation, pp. 2054–2059 (2020) 6. Wang, Z., et al.: A novel method for intelligent fault diagnosis of bearing based on capsule neural network. Complexity (2019)

260

Y. Lin et al.

7. Liu, Y.: Fine-tune BERT for extractive summarization. arXiv preprint arXiv:1903.10318 (2019) 8. Rodrigues Makiuchi, M., Warnita, T., Uto, K., Shinoda, K.: Multimodal fusion of BERT-CNN and gated CNN representations for depression detection. In: Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop, pp. 55–63, October 2019 9. Peters, M.E., et al.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018) 10. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018). https://s3-us-west-2.amazonaws. com/openaiassets/researchcovers/languageunsupervised/language understanding paper.pdf 11. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017) 12. Hinton, G.E., Krizhevsky, A., Wang, S.D.: Transforming auto-encoders. In: Honkela, T., Duch, W., Girolami, M., Kaski, S. (eds.) ICANN 2011. LNCS, vol. 6791, pp. 44–51. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-64221735-7 6 13. Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. In: Advances in Neural Information Processing Systems, pp. 3856–3866 (2017) 14. Zhao, W., Ye, J., Yang, M., Lei, Z., Zhang, S., Zhao, Z.: Investigating capsule networks with dynamic routing for text classification. arXiv preprint arXiv:1804.00538 (2018) 15. Frosst, N., Hinton, G.: Distilling a neural network into a soft decision tree. arXiv preprint arXiv:1711.09784 (2017) 16. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015) 17. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019) 18. Lin, Y., Lei, H., Wu, J., Li, X.: An empirical study on sentiment classification of Chinese review using word embedding. arXiv preprint arXiv:1511.01665 (2015) 19. Jia, X., Li, N., Jin, Y.: Dynamic convolutional neural network extreme learning machine for text sentiment classification. J. Beijing Univ. Technol. (01), 28–35 (2017)

Discovering Nonlinear Dynamics Through Scientific Machine Learning Lei Huang1(B) , Daniel Vrinceanu2 , Yunjiao Wang2 , Nalinda Kulathunga2 , and Nishath Ranasinghe1 1

Department of Computer Science, Prairie View A&M University, Prairie View, TX 77446, USA [email protected] 2 Texas Southern University, Houston, TX 77004, USA https://computinglab.wixsite.com/computinglab

Abstract. Scientific Machine Learning (SciML) is a new multidisciplinary methodology that combines the data-driven machine learning models and the principle-based computational models to improve the simulations of scientific phenomenon and uncover new scientific rules from existing measurements. This article reveals the experience of using the SciML method to discover the nonlinear dynamics that may be hard to model or be unknown in the real-world scenario. The SciML method solves the traditional principle-based differential equations by integrating a neural network to accurately model the nonlinear dynamics while respecting the scientific constraints and principles. The paper discusses the latest SciML models and apply them to the oscillator simulations and experiment. Besides better capacity to simulate, and match with the observation, the results also demonstrate a successful discovery of the hidden physics in the pendulum dynamics using SciML. Keywords: Scientific machine learning · Scientific simulation Computational science · Nonlinear dynamics

1

·

Introduction

Scientific Machine Learning (SciML) [1] has recently emerged as a new method to solve the scientific computing problems using machine learning models. The method leverages the success of traditional scientific computational models and the advances in data-driven machine learning models to augment the efficiency and accuracy of scientific simulation and inversion. Moreover, it facilitates the scientific discovery by modeling both well-known scientific rules and the unknown patterns based on observed data. The traditional scientific computational models mostly are developed to simulate the physics, chemistry, biology and other scientific phenomenons by using various numerical methods, such as the finite difference or finite element methods, to solve a variety of differential equations. These methods can achieve highly c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 261–279, 2022. https://doi.org/10.1007/978-3-030-82193-7_17

262

L. Huang et al.

accurate simulation results; however, they are also notoriously expensive in consuming computational resources. It is why scientific computing is typically conducted on the supercomputers using complex programming models. Moreover, the scientific computing highly depends on our understanding of the theoretical principles, which do not fully represent the complexity and nonlinear dynamics in many real-world phenomenons. Many times, the parameters and other constraints are not well known and simulation scientists have nothing better to rely on educated guesses. In theory, SciML combines the deterministic scientific principles with the universal approximation of machine learning to thus provide a more efficient yet reliable and explainable model-based data-driven solution. SciML provides a sound scientific theoretic foundation to unveil the new scientific governing rules with a collection of data and models. For example, SciML may integrate a deep learning model into a partial differential equation (PDE) to fit the observed data, which models the well-known principles using the PDE and models the unknown portion such as noise and friction using the deep learning model. The latest theoretical and practical advances in machine learning, especially deep learning, dramatically increase the capacity and accuracy of the universal approximation of nonlinear functions. Despite the progress, it is still not reliable and explainable to learn a complex system with nonlinear dynamics or chaos using deep learning along. Moreover, it requires huge mostly unrealistic big data sets to train a deep learning model to cover all possible features. Even if we can successfully train a deep learning-based surrogate model, the model’s extrapolation is not questionable. It would be much more reliable and explainable if we can embed the physical principles to determine the nonlinear dynamics, and only leave the unknown functions to machine learning. SciML converges the computational science and data science disciplines powers to improve the accuracy, performance, and interpretability in scientific simulation. Moreover, it may unveil scientific rules hidden inside the nonlinear dynamics learned by the machine learning models. In this paper, we present the latest four SciML models using a couple of physics experiments to report our experience, the benefits, and limitations of the SciML method. Section 2 describes the state-of-art of SciML models; Sect. 3 shows the physics experiments, simulation, and data collection; Sect. 4 discusses the results of the SciML models; and the Sect. 5 concludes the findings of the paper.

2

Scientific Machine Learning Models

Scientific machine learning is developed to facilitate the scientific computation either by developing a surrogate model to replace the numerical model or combine the data-driven model to achieve better accuracy and performance. In this paper, we applies the Physics-Informed Neural Networks (PINNs) [29], the Universal Differential Equation (UDE) method [27], the Hamiltonian Neural Networks (HNNs) [11], and the Neural Ordinary Differential Equation (NODE) [7] to learn the nonlinear dynamics in several physics experiments.

Discovering Nonlinear Dynamics

2.1

263

Physics-Informed Neural Networks

The Physics Informed Neural Networks (PINNs) is one of SciML solutions that solves the differential equations by modeling the latent solution u(t, x) directly with a deep learning model and solves the differential equation by taking advantage of automatic differentiation functionalities [2] in machine learning (ML) software. The solution u(t, x) is replaced by a neural network or other machine learning model and its derivatives satisfy the definition of the governing differential equation. For example, the harmonic pendulum differential equation is defined as the Eq. (1). g d2 θ (1) + sin(θ) = 0 2 dt L where g is gravity, L is the length of the pendulum and θ is the angle with respect to the vertical in radians. PINNs redefines the equation by substitute its solution f (t) with a neural network Np (t), where N is the neural network and p is its trainable parameters. The new equation is depicted as the Eq. (2). g sin(θ) = 0 (2) L By creating a loss function to minimize the Eq. (2), PINNs utilizes the automatic differentiation capability in machine learning software to calculate the second order derivatives of Np (t) and optimize the loss function. As the result, the Np (t) is an approximation of the solution of Eq. (1). Furthermore, PINNs can be effectively used to solve the forward problem as well as the inverse problem with minimum modifications to computational codes [18]. Additionally, a Petrov-Galerkin [10] version of PINNs have been employed to solve variational form of PDEs to reduce the training cost [14]. Likewise, modified versions of PINNs have been employed to solve fractional differential equations [23] and stochastic differential equations [34]. As a method to address lack of uncertainty quantification in PINNs, Zhang et al. [35] put forward the idea of using multiple deep neural networks to quantify the parametric uncertainty and dropouts to model the uncertainty stemming from the approximations resulting from the neural networks. As an effort to develop a theoretical basis of PINNs, Shin et al. [31] studied the convergence of the sequence of minimizers generated from PINNs corresponding to the sequence of neural networks to the solution of the given PDE. They found sequence of the minimizers strongly converges to the PDE solution in L2 space as well as each minimizer satisfies both initial and boundary conditions. 

Np (t) +

2.2

Universal Differential Equations

The Universal Differential Equations (UDE) method has some similarities with the PINN method: both rely on the scientific principles represented as differential equations to guide the computation and impose constraints. However, UDE is

264

L. Huang et al.

more flexible to model the unknown functions and combine them with existing scientific knowledge. UDE relies on the numerical differential equation solvers to solve the problem while learning the unknown functions during the calculation. The pendulum equation using UDE is depicted as the Eq. (3), which introduces a neural network Np that represents the unknown function is the experiment, such as the friction and/or the external force. As the result, the UDE solution better fits with the observed experiment results as described in Sect. 3. d2 θ g + sin(θ) + Np (u) = 0 (3) dt2 L The method designs a machine learning model representing unknown physical functions while computing the ODE numerically using the ODE solver. The benefit of using the UDE method is that the neural network does not learn the full dynamics, which may be extremely hard or even impossible due to the nonlinear dynamics in chaotic systems. The approximation caused by the neural network may much diverge the long-term prediction results if we simply use the data-driven statistics-based machine learning model. It is hard to believe that a neural network’s universal approximation can be accurate enough for nonlinear dynamics prediction. The physical principles in dynamics need to be honored during the calculation. The UDE method applies the powerful universal approximation capacity in machine learning and respects physical constraints such as symmetry, invariance, and conservation. 2.3

Hamiltonian Neural Networks

In classical mechanics, Hamiltonian equations are widely adopted to model continuous time evolution of dynamic systems with physically conserved quantities such as energy and they can be effectively used to predict the phase space of dynamic system’s using the current state of the generalized position and momentum. Additionally, Hamiltonian mechanics are smooth, time reversible and provide integral paths that conserve certain physical quantities such as energy. Greydanus et al. [11] introduced Hamiltonian neural networks (HNN) by incorporating Hamiltonian equations into the loss function of the neural network to learn the Hamiltonian of simple systems with noisy phase space data. Additionally, Toth et al. [33] used a generative model to infer the Hamiltonian from dynamic systems using high dimension data (pixel). Matteakis et al. [20] embedded physical constraints into the structure of the neural network using the Hamiltonian equations deviating from other studies using the HNN method. HNN may also help reduce the expensive computational costs for solving scientific problems. The HNN method creates a neural network Np that meets the following requirements: dx1 ∂H ∂Np = = , dt ∂x2 ∂x1

∂H dx2 ∂Np =− =− dt ∂x1 ∂x2

(4)

Discovering Nonlinear Dynamics

265

where (x1 , x2 ) are two inputs of the HNN network, and denote the position and momentum. 2.4

Neural Ordinary Differential Equations (Neural ODE)

Since it was first observed by Weinan E [8], the relation between ResNet [13] and dynamical systems has been widely explored and utilized to increase the capability and stability of deep networks [4–6,17,19,32]. Connecting deep networks with ODE was largely inspired by the success of ResNet, whose network architecture is strikingly similar to the well-known Euler method for differential equations. With this observation, a natural idea is to generalize existing numerical methods to deep networks [19,36]. Neural ODE, proposed by Chen et al. [6], went one step further: it replaces deep networks such as ResNet with existing efficient ODE solvers. One key difference between solving ODE and training deep networks is that their goals are different. The goal of training deep networks is to find functions that fit the data while numerical ODE is to approximate solutions of the ODE. The idea of the dynamical systems approach for deep networks is to tune the vector field so that its flow map can reproduce nonlinear functions needed to fit the data [8]. More specifically, consider dz = f (z, t, θ), dt

z(0) = x

(5)

Let z(t, x, p) be the solution to the initial value problem (5), let T > 0 be a fixed time and p be a set of parameters. The flow map x → z(T, x, θ) defines a function from input to output, which is generally nonlinear [8]. Here f could be a neural network to model vector fields. Neural ODE uses existing solvers to solve the ODE (5) for a given set of parameter and input values. The step after solving the ODE is to adjust the values of p and repeat the process to find optimal values for p so that the flow map fits the data best. Just as regular optimization, this process requires to compute the gradient of a designed loss function with respect to p. A beautiful benefit coming out of Neural ODE approach is that the computation of gradient is easier and independent of the solver and can be carried out by the classical adjoint sensitivity method [26]. Another benefit of this method is that Neural ODE can naturally used for time dependent data as the pendulum data discussed in this paper. A computational disadvantage is that ODE solver often requires a larger number of evaluations than in a standard deep network, which tends to get worse over the training [16].

266

3 3.1

L. Huang et al.

Physical Experiments Quadruple Spring Mass System

A quadruple-springs-mass system allowed to move in a 2-D frictionless surface (Fig. 1) can exhibit simple harmonic motion as well as non-linear dynamic motion. The motion depends on the initial conditions of the system and also on the physical properties of the springs (i.e. spring constants and unstretched lengths of the springs). For simplicity, here we only consider massless springs. The time independent Hamiltonian of the quadruple-springs-mass system can be given using generalized position(x, y) and momentum(px , py ) as; H=

n=4 p2x + p2y 1 + ki (li − ai )2 2m 2 i=1

(6)

where: ai = un-stretched lengths of the springs, li = stretched lengths of the springs, ki = springs constants, m = mass of the particle in the middle

Fig. 1. Quadruple-springs-mass system

   2 2 2 2 2 2 Where,  l1 = (a1 + x) + y , l2 = (a2 − x) + y , l3 = (a3 − y) + x 2 2 and l4 = (a4 + y) + x . We simulate one instance of simple harmonic motion and another instance of non-linear dynamic motion of the quadruple-springsmass system by solving the Hamiltonian equations (Eq. 4) while utilizing the Hamiltonian given in Eq. 6. Initial conditions for the simple harmonic motion and the non-linear dynamic motion are given in Table 1. The data generated for a period of 5π from the two experiments.

Discovering Nonlinear Dynamics

267

Table 1. Initial conditions for the simple harmonic motion and non-linear dynamic motion Motion type

Unstretched length Spring const Init. pos

Init. moment Mass

Parameters

a1 , a2 , a3 , a4

k 1 , k 2 , k 3 , k 4 x 0 , y0

Px0 , Py0

SHM

1, 1, 1, 1

1, 1, 1, 1

−0.2, −0.2 0.1, 0.1

1.0

Nonlinear dynamics 1, 2, 3, 4

4, 3, 2, 1

−0.2, 0.1

1.0

3.2

0.1, −0.2

m

Pendulum

A pendulum is a classical physical phenomenon that has been studied to understand its dynamics for a long time. Figure 2 shows a simple gravity pendulum with angle (θ) and the length of slender rod L, the mass of pendulum bob m, and its angular velocity ω = dθ dt .

Fig. 2. Pendulum motion

The simple gravity pendulum [22] is a harmonic motion without any friction or external forces, which is governed by the simple second-order differential Eq. 1. For Eq. (1), we may use numerical methods to solve the ODE by specifying a small-enough time step. The results are a sequence of the angles θ and the angular velocities dθ dt for each time step during the pendulum simulation time span. The differential equation (1) is changed by adding the air resistance or friction component to simulate a damping harmonic pendulum, linear or nonlinear, depending on the scenario. Equation (7) shows linear air resistance/friction integrated into the motion to slow the pendulum down gradually.

268

L. Huang et al.

g d2 θ dθ + μ + sin(θ) = 0 dt2 dt L

(7)

where: μ dθ dt = the linear air resistance/friction. The air resistance/friction may also be nonlinear represented as a polynomial dθ 2 function such as: μ2 ( dθ dt ) + μ1 dt , then the Eq. (7) is changed to Eq. (8).     2 dθ g dθ d2 θ + sin(θ) = 0 (8) + μ2 + μ1 dt2 dt dt L Besides the gravity and resistance, an external force may interfere with the pendulum motion, which creates a non-harmonic oscillator. The external force f (θ) can be a motor or wind that varies based on the pendulum’s radiant. The differential equation (8) is expanded to become a non-homogeneous differential equation (9) including the external force in Eq. (10).     2 dθ dθ d2 θ g (9) + μ2 + μ1 + sin(θ) = f (θ) dt2 dt dt L and

6 cos(θ) (10) mL2 where m is the mass of pendulum bob and f is the external force driven by a wind or a motor. In Sect. 3.3, we assume that the external force f (θ) is independent of time. However, it could be time-dependent as f (t, θ), as the example detailed in Sect. 3.4, or even stochastic. f (θ) =

3.3

Simulated Pendulum

It is challenging or impossible to analytically solve the nonlinear dynamics equation since it is tough to simplify or divide-and-conquer the problem. Fortunately, we can solve the problem approximately using the numerical method using the finite difference method (FDM), the finite element method (FEM), or the finite volume method (FVM). These differential equations described in Sect. 3.2 can be solved using the ODE numerical solvers implemented in SciPy or Julia. There are many numerical algorithms for these ODE solvers to choose to solve these equations that simulate the temporal behavior of the pendulum dynamics concerning the pendulum’s angle (θ) and its angular velocity (ω) during a time frame. The work uses the Julia [3] programming environment and its DifferentialEQuations.jl package [28] to solve these ODEs for pendulum simulations. Julia provides a high-level programming interface similar to Matlab/Python with salable performance on both CPUs and GPUs. It includes a rich set of computational packages such as differential equations of ODEs/PDEs, linear algebra,

Discovering Nonlinear Dynamics

269

optimizations, automatic differentiation, dynamical systems, and data science packages such as its machine learning package Flux and Boltzmann Machines. The Julia code that defines the pendulum ODE and initial values is listed in Fig. 3. The ODE solver uses the Tsit5 algorithm - the Tsitouras 5/4 Runge-Kutta method with the free fourth-order interpolant, which is efficient and accurate in solving the pendulum ODE equation. The code defines a non-homogeneous differential equation with an external force and polynomial friction. The initial θ value is π/2 and the velocity ω is 0. The period is set from 0 to 10 s, with 0.1 s as the time step. It generates 101 samples of (θ, ω) after the calculation.

Fig. 3. Pendulum motion ODE code

Figure 4(a) shows the phase space of the pendulum dynamics based on the θ and the ω for 10 s motion with nonlinear friction and external torque. Figure 4b illustrates the pendulum angles θ and the angular velocity ω temporal changes during the 10 s of pendulum motion simulation. Due to the external torque and friction, the motion is non-harmonic, leaning toward its right-hand side gradually. 3.4

Simulation of Wind Forced Pendulum

In this simulation a quick air flow is used to start the oscillations of pendulum. The goal of the experiment is to infer the time profile of the air flow pulse from the simulated measurements of the angular position of the pendulum (Fig. 5). We assume that the drag force that acts on the pendulum is proportional to the relative velocity of the pendulum with respect to the air, according to Stoke’s law: Fd = b(u − v )

270

L. Huang et al.

(a) Pendulum Phase Space in 10 Seconds (b) Pendulum Motion Simulation in 10 Seconds

Fig. 4. Pendulum motion simulation shown in the phase space

Fig. 5. A pendulum forced to oscillate by a quick air blow of wind

where the drag coefficient is b = 6πηr for a spherical object of radius r, and η is the viscosity. The air flow is assumed to be uniform, and oriented along the x-axis u = u(t)ˆ x. Under these assumptions, the differential equation for the pendulum is d2 θ b dθ b g + u(t) cos θ (11) = − sin θ − 2 dt L m dt mL with initial conditions: θ(0) = dθ/dt(0) = 0. The solution θ(t) depends on the airflow profile u(t) and the drag coefficient b as external parameters, while gravity g and length L of the pendulum are assumed to be known. Given a set of measurements of time and angular position ({tk , θk }, k = 1, 2, . . . , N ), the unknown airflow profile, as well as the drag coefficient, can be inferred by minimizing the loss function 1 2 (θ(tk ) − θk ) 2 N

L(u(t), b) =

k=1

Discovering Nonlinear Dynamics

271

This can be obtained by using a Conjugate Gradient Descent method where better candidates for b and u(t) are calculated at each iteration as b → b = b − η

∂L , ∂b

u(tk ) → u (tk ) = u(tk ) − η

∂L ∂u(tk )

where η is a small learning rate chosen appropriately and the airflow profile is discretized at the same temporal points tk , for convenience. The partial derivatives of L with respect to b and u(tk ) are obtained in turn as ∂L  ∂θ = (θ(tk ) − θk ) (tk ) ∂b ∂b N

k=1

and

 ∂θ ∂L = (tk ) (θ(tk ) − θk ) ∂u(tk ) ∂u(tk ) N

k=1

The sensitivities of the ODE solution ∂θ/∂b and ∂θ/∂u(tk ) can be calculated in several ways [25]. For our example, we used forward differentiation package ForwardDiff [30], that employs dual numbers [12] during the iterative calculation of the solution of Eq. (11). Each time step during the iterative calculation is calculated according to Heun’s modification of Euler’s method [9]. At the start of integration all parameters are set as dual numbers with zero dual part, except the parameter for which the sensitivity is required, which is set with 1. The solution obtained at the grid points tk will in turn be dual numbers that represent the solution, as the main part, and the sensitivity of the solution with respect to the chosen parameter, as its dual part. The advantage of this approach is that all calculations are done in place with modest memory requirements. Figure 6 show the results of our experiment. We simulated a 1.0 kg pendulum of length 1.0 m that has a drag coefficient of b = 0.25 kg/s. The pendulum is forced to oscillate by a short gaussian blow pulse of amplitude 4.0 m/s, centered around t = 2.8, with standard deviation of 0.2 s. Starting from random guesses for b and u(t), the procedure converges toward the anticipated values. At every time step, only positive values of u(t) are allowed. The convergence is slow, but it can be accelerated by using more refined strategies, like ADAM or RMSProp [15]. 3.5

Physical Experimental Pendulum

Besides the simulation, we also recorded a one-minute video for the pendulum experiment shown in Fig. 7(a). In the experiment, we measured the mass of the pendulum bob and the length of the pendulum. The angle θ and angular velocity ω were calculated based on the image processing algorithms. The friction is unknown in the experiment. To process the experiment video, we first extract the frames out of the video that is recorded with the frame rate of 60 per seconds. We then apply the Blob detection algorithm from the Scikit-image image processing package,

272

L. Huang et al.

Fig. 6. Left panel: comparison between the exact and calculated airflow profiles, and angle vs. time (inset). Right panel: convergence of the loss function.

(a) Experimental Pendulum Video

(b) Labeled Experimental Pendulum Video

Fig. 7. Pendulum experiment recorded in video

which detects the coordinates of the pendulum bob and center. The Difference of Hessian (DoH) algorithm in Blob detection gives us the best performance and less false positives. Figure 7(b) shows the detected coordinates of pendulum center and bob. These coordinates detected are used to compute the angle θ and angular velocity ω based on the geometry and the prior state. The results are a collection of 3600 pendulum angles and angular velocity states in one minute with 1/60 s for each time step.

4

Learning the Nonlinear Dynamics with Scientific Machine Learning

In Sect. 3, the paper shows the simulation results pendulum nonlinear dynamics with assumptions of known functions of the friction and external torque. Can SciML augment the scientific machine learning by using the collected data set?

Discovering Nonlinear Dynamics

273

For real-world experiment, we may not know some of these functions, but we can collect the motion data (θ and ω) based on experiments Sect. 3.5. The question is if the SciML model can learn the unknown nonlinear dynamics hidden in these systems? 4.1

What Do These SciML Models Learn?

In the pendulum study, we knew that the simple harmonic pendulum’s motion is governed by the Eq. (1). The initial conditions include the pendulum angle, the angular velocity, the length, the mass, and the constant gravity. All of them can be measured to determine the motion. In reality, what we do not know at the beginning is the friction and external force in the Eq. (9) and (10). The SciML model only models the friction and external force functions using a neural network and learns the two functions through the recorded data. The ODE solver calculates the harmonic motion. g dθ d2 θ + sin(θ) = Np ( , m, L) dt2 L dt

(12)

The Eq. (9) and (10) is revised as the new Eq. (12), in which the Np ( dθ dt , m, L) is a neural network with four inputs and one output that learns the friction and external torque. The Np is a four-layer fully connected neural network with 464-64-1 neurons in each layer, and it uses the hyperbolic tangent tanh as its activation function. The software package used in the paper is one of the SciML packages named DiffEqFlux implemented in Julia software stack. The neural network is trained by using a small data set from the pendulum simulation with a time span of [0, 10] and a time step 0.1, which gives us 101 samples. The training starts with the Adam optimizer for the first 100 iterations and then switch to the BFGS optimizer after the 100th iteration. Figure 8 shows the loss values of using these two optimizers. In this experiment, the BFGS optimizer learns the function faster than Adam optimizer.

Fig. 8. Loss values during the UDE training

274

L. Huang et al.

Once the loss value becomes small enough ( (S2 ), then the tuple is called a normalrepresentation. If and there is some and some j ∈ p2−1 (i)

We call (f , p1 , p2 , p) is an entangled representation of T . Any variator T : IS0 → IS1 evidently have a x-variator representation

That is, for any I ∈ IS0 we have

Corollary 1. Given any variator T , it is a x-variator. The representation of a variator as a x-variator is not unique. Theorem 2 (Impossibility of Slicing Theorem). Given a variator T , if for any repreholds implies that the sentation (f , p1 , p2 , p) of T , that the condition representation is entangled, then T is not sliceable.

288

W. Pan

Proof. It is clear. Definition 9. We say a tensor variator T : IS1 → IS2 is weakly sliceable, if and only if . T has a normal representation (f , p1 , p2 , p), such that Let’s see another example. The variator ⎡⎡   ⎤⎤   ⎤⎡   0 0 0 0 0 0  ⎥⎥  ⎥⎢  ⎢⎢  ⎥⎥ ⎥⎢ ⎢⎢ 101 101 ⎥ ⎢ ⎢ E_3 = ⎢ ⎥ ⎥ ⎥⎥ ⎥⎢  ⎢⎢  ⎣ ⎣ 0 1 0 ⎦ ⎣ 0 1 0 ⎦⎦ 111 111 is weakly sliceable. Because there are picks p_3_1 = [0] p_3_2 = [1]   p_3 = 0 2 1 and a variator f _3 whose provision tensor is

Definition 10. Given a tensor variator T : IS0 → IS1 which has a normal representation (f , p1 , p2 , p), where p is a shaffle. Let F be the provisioner tensor of the variator f : Ip1 (S0 ) → IS2 , and let V be a tensor with shape S0 , then (F, p1 , p2 , p, V ) is called a x-sparse tensor representation, or simply x-sparse tensor. A x-scattering is a binary e = (A, X ), where A = (F, p1 , p2 , p, V ) is the x-sparse tensor,  X is a tensor. A result of x-scattering e is a tensor B defined as for any J , if J ∈ T IS0 , then there is some I ∈ IS0 , such that J = p(J0 ) where J0 = p1 (I ) + p2 (I )   and B[J ] = V (I ); if J ∈ ISB and J ∈ / T IS0 , then B[J ] = X [J ] holds. The result tensor of a x-scattering also cannot be certainly determined. We implement x-scattering using python code [7] and call it scatterX API. C++ and CUDA code of x-scattering are also provided to move slices parallelly and utilize CUDA streams.

Tensor Data Scattering and the Impossibility of Slicing Theorem

289

7.3 Counting Sparsity and Analyzing Performance . A dense tensor E with shape S has a x-sparse tensor representation Once we randomly remove few elements from E and get a sparse tensor E , then E has a x-sparse representation where indices is a provisioner of a and V is a one-dimensional tensor which contains elements of E . variator Thus, the inner variator in a x-sparse tensor indicates the efficiency of storing sparse indices. Definition 11. Given a x-sparse tensor X = (F, p1 , p2 , p, V ), the sparsityof the xsparse tensor is defined as

Now we can count the sparsity of former examples:

The sparsity is 1 means that the x-sparse tensor hardly can be parallelly used. The x-sparse tensor has smaller sparsity will have high possibility to be parallelly used.

7.4 Mocking Current Scattering APIs The counterpart scattering of the TensorFlow scatter API as in Sect. 6 has a x-scattering representation ((indices, p1 , p2 , p, updates), ts) where

The sparsity

where

290

W. Pan

It can be any number smaller than 1. Whereas the counterpart scattering of the pyTorch scatter API as in Sect. 6 has a x-scattering representation ((Eindex , q1 , q2 , q, src), self ) where

The sparsity

This means that pyTorch scatter API is not sliceable. The key difference of these two kinds of APIs is how the variator in scattering is formed.

8 Conclusion Tensor data scattering is a kind of task that is difficult to use the hardware features of machine learning accelerators. This article theoretically analyses the reasons for this difficulty. And a general theory and algorithm of tensor data scaterring is established in this article. Based on the theories and algorithms in this article, we will be able to implement algorithms that can make better use of accelerator features. Moreover, a standard approach is proposed to represent sparse tensor, which can facilitate parallel computing and data transporting in AI accelerators, and which can also provide a way to efficiently store sparse indices of sparse tensors. A sparsity measuring formula is provided at last section, which can effectively indicate the storage efficiency of sparse tensor and the possibility of parallelly using it. More experiments and comparisons with APIs in other deep learning frameworks require more time. We will continue our work in this area and display the results in the GitHub project [7].

References 1. Soyata, T.: GPU Parallel Program Development using CUDA. CRC Press, Boca Raton (2018) 2. Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse variators. CoRR abs/1904.10509 (2019). http://arxiv.org/abs/1904.10509 3. TensorFlow API: tf.tensor_scatter_nd_update, https://www.tensorflow.org/api_docs/python/tf/ tensor_scatter_nd_update 4. PyTorch Docs: torch.Tensor.scatter. https://pytorch.org/docs/stable/tensors.html?Highlight= scatter#torch.Tensor.scatter 5. Harris, C.R., Millman, K.J., van der Walt, S.J., et al.: Array programming with NumPy. Nature 585, 357–362 (2020) 6. Zhang, T., Liu, X., Wang, X., Walid, A.: cuTensor-tubal: efficient primitives for tubal-rank tensor learning operations on GPUs. IEEE Trans. Parallel Distrib. Syst. 31(3), 595–610 (2020) 7. Algebraic Tensor Project. https://github.com/wmpan/AlgebraicTensor

Scope and Sense of Explainability for AI-Systems 1 ¨ A.-M. Leventi-Peetz1(B) , T. Ostreich , W. Lennartz2 , and K. Weber2 1

Federal Office for Information Security, BSI, Bonn, Germany [email protected] 2 inducto GmbH, Dorfen, Germany [email protected]

Abstract. Certain aspects of the explainability of AI systems will be critically discussed. This especially with focus on the feasibility of the task of making every AI system explainable. Emphasis will be given to difficulties related to the explainability of highly complex and efficient AI systems which deliver decisions whose explanation defies classical logical schemes of cause and effect. AI systems have provably delivered unintelligible solutions which in retrospect were characterized as ingenious (for example move 37 of the game 2 of AlphaGo). It will be elaborated on arguments supporting the notion that if AI-solutions were to be discarded in advance because of their not being thoroughly comprehensible, a great deal of the potentiality of intelligent systems would be wasted. Keywords: Artificial Intelligence (AI) · Machine Learning (ML) · Explainable AI (XAI) · Chaos · Criticality · Attractors · Echo State Networks (ESN) · Time series · Causality

1

Introduction

The next generation AI-systems are expected to extend into areas that correspond to human cognition, such as real time contextual events interpretation and autonomous system adaptation. AI solutions are mostly based on neural networks (NN) training and inference developed on deterministic views of events that lack context and commonsense understanding. Many successful developments have been done in the direction of explainable AI algorithms while further advancements in AI will still have to address novel situations and abstraction to automate ordinary human activities [15]. There exist already various approaches to explain the results of machine-learning systems (ML systems), there are methods and tools which can interpret and verify for example classification results and decisions produced on the basis of sophisticated complex ML systems. The explanations vary with the task and the method which ML systems employ to reach their results. The aim of this work is to give a short but not exhaustive report about known ambiguities, shortcomings, flaws and even mistakes which ML explainability methods imply, underlining the association of these problems c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 291–308, 2022. https://doi.org/10.1007/978-3-030-82193-7_19

292

A.-M. Leventi-Peetz et al.

to the growing complexity of the systems which have to get interpreted. Furthermore, there will be discussed the necessity of taking chaos theoretical approaches for ML into account, and some implemented examples which demonstrate the potentiality of this new direction will be discussed. In conclusion, there will be naturally formulated the doubt as to whether it is possible, or it makes sense, to follow the intention of finding ways to make every ML system explainable. In the section following this introduction, the importance of making AI explainable will be emphasized, by reference to some prominent applications of AI systems, which directly implicate the necessity of understanding the reasoning behind machine made decisions. In this context explainability is seen as a requirement of trustworthy AI. In Sect. 3, the difference between the explainability of rulebased systems of the first generation AI and that of modern ML systems will be emphasized. Technical aspects of the feature-based explainability methods for advanced, dynamically adaptive deep learning systems are discussed in Sect. 4, with focus on the evaluation of a number of recent improvements, introduced to increase reliability of explanations. In Sect. 5 examples will be given to justify the comparison of the behavior of ML systems to the behavior of chaotic systems, whose results are sensitively dependent on their initial conditions. In the following section, advantages of using echo state networks and reservoir computing as a computationally efficient and competitive alternative to deep learning methods will be considered, especially with respect to their ability to simulate both deterministic but also chaotic systems. In Sects. 6 and 7 the proof of causality in ML results will be presented as an indispensable part of any sound explanation of ML supported decisions. At the same time, references will be given to scientific work, which asserts that the problem of assigning causation in observational data has not yet been solved. Some AI specialists assign to XAI the property of being brittle, easy to fool, unstable or wrong. In the conclusions there will be posed the question if it is absolutely necessary to make all AI systems interpretable in the first place. At the present state of developments, interpretability does not necessarily contribute to the trustworthiness of AI systems.

2

Superhuman Abilities of AI

Of crucial importance is the application of AI in so called safety and security critical systems, for example in transportation and medicine, where there is very little or zero tolerance of machine errors. For instance, the interpretation of ML models employed in computer-aided diagnosis (CADx) to support cancer detection on the basis of digital medical images is often the recognition of certain patterns which pixels in the images form [28]. These patterns are combinations of so called features (for example gray levels, texture, shapes etc.) which the algorithm has extracted from the test image in order to infer a result. The term inference means “make a prediction on the basis of experience”, in this case the experience which the model has gained during its training phase, exploiting information stored in large labeled datasets. This would be the case of supervised learning which tought the model to discern between pathological and normal forms. The increasing accuracy of imaging methods calls for an

Scope and Sense of XAI

293

increase of accuracy and reliability of the algorithmic predictive mechanisms. Imaging examination has no longer only qualitative and pure diagnostic character, it now also provides quantitative information on disease severity, as well as identification of so called biomarkers of prognosis and treatment response. ML systems are committed with the objective of complementing diagnostic imaging and helping the therapeutic decision-making process. There is a move toward the rapid expansion of the use of ML tools and leading radiology in daily life of physicians, making each patient unique, in the concept of multidisciplinary approach and precision medicine. The move from the well established predictive analysis to the so called prescriptive one, one that should expect systems to be even more efficient and in a way smarter gets stronger. The quality of these systems concerns not only the health sector but also industry and economy, regarding for example the emergence of smart factories and the approaching realization of the fourth industrial revolution (industry 4.0 ) with the planning of self-organizing intelligent systems, that is systems which can anticipate and find solutions for suddenly arising problems, and most probably also unforeseen problems by themselves. This new generation of system automations will probably have an enormous social and economic impact world wide. People, societies, will have to rely on the decisions and the advices of machines to organize life. But can advices and decisions of machine systems get completely trusted eventually even without the final approval of some reviewing human experts committee? Could they be accounted as reliable and secure? Could people perhaps trust these systems if their behavior becomes somehow explainable? In this case could the development of adequate norms and criteria as to how machine explanations should look like be enough in order to inspire trust? And who should be able to comprehend these explanations? These questions have received a great deal of attention in the last years and will stay in focus of research for a long time to go. Explainability has received special attention ever since AI algorithms managed to reach what is being called superhuman abilities. People have realized that they can develop systems that are not only faster in solving problems but can also do better, because they can find solutions which no expert has ever been able to find so far. One has to recall the famous creative and unique stone move 37 of the game 2 of AlphaGo which was evaluated by AlphaGo as having a probability of being played by a human close to one in ten million [5]. Experts have been asked about the implications of this kind of creativity. Some of the experts attributed the move to clever programming, and not creativity of the software. In other articles the advancements from AlphaGo to AlphaGo Zero (a program that can win a play without any use of information based on human experience) has been seen as an example of the AI becoming self-aware and creating its own AI which is as smart as itself if not smarter. Experience shows that experts in general cannot always make explanations of their decisions understandable not even for fellow experts! However it is expected that the self-awareness of AI systems should enable them to explain their decisions to humans. In fact on AI systems there are made much higher demands than on humans when they have to make decisions. In autonomous driving for example it is expected that the

294

A.-M. Leventi-Peetz et al.

technology must be at least 100 times better than humans, according to Prof. Trapp of Fraunhofer ESK [29].

3

Forms of Explainability

The rule-based systems, or expert systems of the first generation of AI, were deterministic. Their intelligence was fixed, following a definite series of rules and instructions, their inference was made based on Boolean or classical logic. The explanation of the decisions of those systems was the demonstration of the inference rules that led to a decision. But these systems followed rules which would be determined by humans. They were as causal, fair, robust, trustworthy and usable as their developers had made them to be. These systems wouldn’t change or update on their own, they would not learn from mistakes. They simulate AI but for many experts they were not true AI systems. The first so called reason tracing explanations were saying nothing about the system’s general goals or resolution strategy [12]. The utilization of the fact that knowledge of the problem to be solved, if expressed in a form that computers can handle, offered advantages, motivated domain experts, so named knowledge engineers, to encode experts’ advice in the form of associational (also referred to as heuristic or empirical) rules that mapped observable features (evidence) to conclusions. For a large portion of real-world problems it is significantly easier to collect data and identify a desirable behavior, than to explicitly write a program, as Karpathy aptly stated (2017) [22]. ML systems, nowadays powered by NN and deep learning shifted the paradigm from one in which the programmer must provide rules and inputs in order to obtain results, to one where specialists and no specialists can provide inputs and results to derive rules. The promise of this approach is that learned rules can be applied to many new inputs, without requiring that the user has the expertise needed to derive results. This is sometimes also observed as democratization of AI. The motivation in this respect is that representing knowledge in datasets is much easier than having to provide methods of encoding and manipulating symbolic knowledge. Because in this case updating and improving learning systems can be done more smoothly as the datasets grow and evolve over time. Furthermore, rule-based systems are not of help for solving problems in complex domains and there are many cases (e.g., cancer detection in medical images), where no explicitly defined rules in a programmatic or declarative way are possible. The hope of AI research is to implement general AI by creating autonomously learning systems. These systems should become finally unlimited in their ability to simulate intelligence, they should be able to demonstrate all signs of an adaptively growing intelligence: Previous knowledge should be modified, eliminated if not needed any more, while new knowledge should be continuously gained. Hence, these systems should be able to build and update their rules actively on the fly. This is the difference between ML systems and rule-based ones. Neural networks represent instances of learning systems. A learning system implements a utility function representing the difference between the system’s prediction

Scope and Sense of XAI

295

and reality and this difference will be minimized for example with the help of optimization techniques, which will change the system’s parameters. These optimization techniques (e.g., gradient descent, stochastic gradient descent) are in fact rule-based techniques because they just compute gradients needed to adjust the weights and biases to optimize its utility function. The approach of the calculation varies considerably (e.g., between supervised and unsupervised learning). The learning process is deterministic (including the statistical and probabilistic part of the method), however it is practically impossible to describe the learning system with a model because this would involve millions of dynamic parameters (e.g., weights, biases) which make the description of internal system processes untraceable. Their enormous complexity makes learned systems very hard to explain, so that they can hardly get understood by humans [32]. It can’t become entirely clear for trained systems how they make their decisions. That’s the dark secret at the heart of learning systems according to Will Knight, Senior Editor of MIT Technology Review.1 According to Tommi Jaakkola (MIT, Computer Science)2 this is already a major problem for many applications; whether it’s an investment decision, a medical decision, or a military decision, one doesn’t want to just rely on a black box. The European Union issued the so named EU General Data Protection Regulation [14] which is practically a right to explanation. Citizens are entitled to ask for an explanation about algorithmic decisions made about them. There arises the question if GDPR will become a game-changer for AI technologies. The consequences of this regulation are not yet really clear. It remains to be seen whether such a law is legally enforceable. It’s not clear if that law is more a right to inform rather than a right to explanation. Therefore, the impact of GDPR on AI is still under dispute. For the explainability of NN models, a large body of work focuses on post-training feature visualization to qualitatively understand the dynamics of the NNs. The following properties are important for explanations: – – – – –

Causality Fairness Robustness and Reliability Usability Trust

There have been long discussions about biased decisions, the famous husky which has been misidentified as wolf, because of the snow in the picture of his environment, is known to almost everybody. The bias in the data is a serious issue especially because as experts point out, algorithms tend to amplify existing biases, they actually learn from differences and any difference can under circumstances become a bias in the process of learning. However one cannot discard the possibility that even if all training datasets were balanced, so that no biases were possible, there could always still exist some kind of biases in the opinion of users who are meant to understand the algorithm’s interpretation and judge about the 1 2

https://www.technologyreview.com/author/will-knight/. https://people.csail.mit.edu/tommi/.

296

A.-M. Leventi-Peetz et al.

algorithmic fairness. There are many subtleties involved in interpretation which should be of concern in parallel to the technical refinement of algorithms and software.

4

Complex Dynamical Systems

Learning setups can not always be static. The necessity of learning in continuous time, by using continuous data streams to which also online learning belongs, has established incremental learning strategies to account for situations that training data become available in a sequential order. The best predictor for future data gets updated at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning algorithms are also known to be prone to the so named catastrophic interference, which is the tendency of an artificial NN to completely and abruptly forget previously learned information upon learning new. This is the well-known stability-plasticity dilemma [16]. An algorithm has to dynamically adapt to new patterns in the data, when the data itself is generated as a function of time, e.g., stock price prediction. In time series forecasting a model is employed to predict future values based on a previously performed time series analysis and the thereof values observed. That is historical data is used to forecast the future. Such predictions are delivered together with confidence intervals (CI) that reflect the confidence level for the prediction. The size of the sample and its variability belong to the factors which affect the width of the confidence interval, as well as the confidence level, usually set at 95% [4]. A larger sample will tend to produce a better estimate of the population parameter, when all other factors stay unchanged. However, NNs belonging to specific settings do not provide a unique solution, because their performance is determined by several factors, such as the initial values, usually chosen randomly from a distribution, the order of input data during the training cycle and the number of training cycles [10,19,27]. Other variables belonging to the mathematical attributes of a specific NN, like learning rate, momentum, affect also the final state of a trained NN which makes a high number of different possible combinations possible. Evolutionary algorithms have been proposed to find the most suitable design of NNs, in order to allow a better prediction, given the high number of possible combinations of parameters. Also many different NNs can be trained independently with the same set of data, so that an ensemble of artificial NNs that have a similar average performance but a different predisposition to make mistakes on their individual level of prediction will be created [7]. If one needs to estimate a new patient’s individual risk, for example in cardiovascular disease prediction, or the riskiness of a single stock, or one must classify the danger of some unknown data traffic pattern that might hide a cyber attack, a set of independent NN models acting simultaneously on the same problem should be of advantage. An ensemble of models performs better than any individual model, because the various errors of the models average out therefore it has dominated recent ML competitions [8,17]. Using model ensembles also

Scope and Sense of XAI

297

requires a much larger training time as compared to training only one model. Each model is trained either from scratch or derived from a base model in significant ways. In all kinds of ensemble methods, concatenated, averaged, weighted etc., one has certain advantages and disadvantages and a reported accuracy of up to 89% on test data. Explainability refers to the ability of a model or an ensemble of models to explain its decisions in terms of human observable decision boundaries or features. Should the user get a proof that a different choice of ensemble weights would not have resulted to a different classification in his case? How do the decision boundaries look like that resulted to the decision concerning him? One can also develop ensembles during fine-tuning operations dividing the procedure in subtasks. Incremental and active learning remain a field of research aiming at developing recognition and decision systems that are able to deal with new data from known or even completely new classes by performing learning in a continuous fashion. Active learning and active knowledge discovery are approaches, which require continuously changing models. How should continuous learning with a series of update steps get performed robustly and efficiently is a question that still remains open. And how explainable are these models for the user? If it is allowed to assume that the parameters of the NN vary smoothly with the time-varying training dataset, one can apply warm-start optimization for each time step, using the parameters of the previous step as initialization for the current parameters. In this case a network fine-tuning is performed under the assumption that the introduction of new categories is not necessary for the classification of the new data. If however the new datasets have little or nothing in common with the datasets of the previous step, new classes (known or unknown) have to be added with additional nodes at the output layer of the network, together with some new parameters and a new normalization for this network. Questions of convergence under time limitation or perhaps data sparsity are in general open. How many layers must be adapted so that a robust solution can be found for real-world and real-time applications. For example how many SGD (stochastic gradient decent) iterations would be necessary for each update in order to achieve calculation accuracy without the need of overwhelming computational effort. There exist empirical studies which have investigated various factors among others the fraction of older to new data to be considered during the SGD iterations as to avoid overfitting. The dropout technique randomly changes the network architecture to minimize the risks that learned parameters do not generalize well. This method in essence simulates ensembles of models without creating multiple networks. The dropout technique requires tuning of hyperparameters to work well, like change the learning rate, weight decay, momentum, max-norm, number of units in a layer, and for a given network architecture and input data requires experimentation with the hyperparameters. Dropout increases convergence time as one needs to train models with different combinations of hyperparameters that affect model behavior, further increasing training time [17]. However dropout acts detrimental to accuracy if used without normalization therefore normalization techniques have been developed, some also

298

A.-M. Leventi-Peetz et al.

going beyond the batch normalization to account for active learning. On top of this, wrong object labels (label noise) are not completely avoidable in real-world applications which considerably degrades the accuracy of the results. Researchers have managed to spot changes of a continuously learning deep CNN (convolutional neural network ) by visualizing the shifting of the mainly attended image regions, for example when a new class is introduced, by observing the strongest network-filter changes during a single learning step [2]. Visual explanations for DNN—for example CAM (class activation mapping) or Grad-CAM [3,24]—are posthoc, they work on a NN after training is complete and the parameters are fixed, when also for only a short time. The network produces a feature map at its last convolution layer, and weights of features or gradients with respect to feature map activations are posthoc calculated and plotted. The result is a class-discriminative localization map which determines the position of particular class objects. However explainability is not interpretability and therefore posthoc attention mechanisms, although perhaps helpful for following reactions of agents in video games, may not be optimal for real-world decisions connected with high risk. Explaining how a model made its decision delivers a chain of results, after a sequence of mathematical operations have been applied to the model and can perhaps help to better understand the functionality of the model but it does not also provide any known rules of the natural world which would make sense to humans. Moreover, model rules do not always translate to unique or comparable decisions, so that to find a way to translate model rules (explainability) to natural world rules (interpretability for humans) would not be the only problem that has to be solved. For instance studies have demonstrated that the overlap of features, which filters extract in high convolutional layers, leads to poor model expressiveness in CNNs. Methods have been developed to remove redundancies and feature ambiguity by inducing bias in the training process and confine each filter’s attention to just one or a few classes. Also methods to disentangle middle-layer representations of CNNs to correspond to objects and to object parts features have been developed, in order to assign semantic meanings to filters [35]. Because there is a trade-off between explainability and performance, in real-world applications additional networks, so called explainability networks, have been implemented and trained in parallel to the original performing networks with the task to make the former explainable. For the training and testing of explainable filters, benchmark datasets with ground truth annotations have been employed. In a number of cases the majority of classifications could be attributed to these new filters, but there have been also cases where the performing network achieved better classifications than its corresponding explaining network. The additional computational effort and time associated with the process of features disentanglement makes the concept not applicable for dense networks or when a great number of features have to be recognized [26]. CNNs use pooling which is the application of down sampling of the feature map to ensure that the CNN recognizes the same object in images of different forms and also to reduce the memory requirements of the model. The pooling operation introduces spatial invariance in CNNs which is also one of the

Scope and Sense of XAI

299

major weaknesses of CNNs. Max pooling for instance preserves the best features and the feature map gets flattened into a column matrix to be processed in the NN for further computations. As a result of pooling, CNNs can lose features in images and there would be needed a very big amount of training data for this weakness to get compensated. CNNs are also unable to recognize pose, texture and deformations in images or parts of images. CNNs lack equivalence because they don’t implement equivariance, however they use translational invariance therefore they can for example detect a face in a picture, if they have detected an eye, independent of the spatial location of the eye in respect to the rest of features which usually belong to a face. Alone on the basis of features the results of a CNN cannot generally get interpreted as it seems. Capsule networks or CapsNets have been proposed as an alternative to CNNs [23]. Their neurons accept and output vectors as opposed to CNNs’ scalar values. Features can be learned together with their deformations and viewing conditions. In capsule networks, each capsule is made up of a group of neurons with each neuron’s output representing a different property of the same feature. The output of a capsule is the probability that a feature is present and is delivered together with the so named instantiation parameters, expressing the equivariance of the network, or its ability to keep its decision unchanged regardless of input transformations. The introduction of CapsNets is considered to be promising for solving real life problems like machine translation, intent detection, mood and emotion detection, traffic prediction on the basis of spatio-temporal traffic data expressed in images etc. Even though the training time for CapsNets is better than CNNs, it is still not acceptable for time critical operations and highly unsuitable for online training. Research is currently ongoing in this area. CapsNets are considered to be explainable by design, because during learning they construct relevance paths that reduce unrelated capsules without the necessity of a backward process for explanation.

5

Stability and Chaos

An important issue concerning the trustworthiness of DNNs is their liability to mistakes when adversarial examples are introduced as inputs to them causing them wrong decisions. Intentionally designed examples to fool a model, are the adversarial attacks, which some call optical illusions for machines, as they mostly concern widely discussed examples of striking miscategorization of pictures. Quite famous is the case of the classification network which had been trained to distinguish between a number of image categories with panda and gibbon being two of them. The classifier determines with 57.7% of accuracy the image of a panda. If a small perturbation is added to the picture, the classifier classifies the image as gibbon with 99% accuracy [18]. Research has showed that the output of deep neural networks (DNN) can be easily changed by adding relatively small perturbations to the input vector. There exist also designed and successfully applied attacks with an one-pixel image perturbation, for example

300

A.-M. Leventi-Peetz et al.

based on what is called differential evolution (DE) which can fool more types of networks [30]. Reinforcement learning (RL) is the autonomous learning of agents who learn out of experience how to carry out a designated task, and discover the best policy of behavior, or the best actions to undertake through interaction with their environment and evaluation of the according collection of rewards and punishments. RL systems have been proved to be also liable to mistakes due to adversarial attacks. It has been demonstrated that learning agents can also be manipulated by adversarial examples. Research shows that widely-used RL algorithms, such as DQN (deep Q-learning), TRPO (trust region policy optimization), and A3C (asynchronous advantage actor critic), are vulnerable to adversarial inputs. Degraded performance even in the presence of perturbations which are too subtle to be perceived by a human, can cause an agent to make wrong decisions [9,21]. ML systems are highly complex and complexity makes a system itself highly dependent on initial conditions. The here mentioned examples, where a small perturbation causes the system to make a jump in category space, present an analogy to the behavior known of chaotic systems, small changes in the starting state can generate a big difference in the dynamics of the system later on. The noise needed to add to the panda picture in order to get the false classification was a so named custom made perturbation, especially generated by a GAN, a generative adversarial network, trained to fool models by exploiting chaos. Perturbations can be meticulously designed to serve certain purposes, and make a DNN take a wrong decision, however also completely random perturbations which can arise accidentally in very complex environments where ML is already applied or is planned to be applied in the near future, especially implicating systems with real time requirements, can cause serious mistakes with possibly catastrophic consequences. In certain cases it can be difficult to discern between input signal and perturbation. There is a close relationship between complex systems research and ML with a wide range of cross-disciplinary interactions. Exploring how ML works in the aspect of involving complexity is a subject of significant research which has to be considered also in the context of interpretation [31]. For time series classification problems (TSCP) features have to get ordered by time, unlike the traditional classification problems. CNNs have been applied on time series automatically, tailoring filters that represented repeated patterns, learned and extracted features from the raw data. Recurrent neural networks (RNNs) are a family of NN used especially to address tasks which involve time series as input, and are therefore deployed in sequential data processing and continuous-time environments. They are capable of memorizing historic inputs, they posses dynamic memory, as they preserve in their internal state a nonlinear transformation of the input history. They are characterized by the presence of feedback connections in their hidden layer which allows them short-term memory capability. However their learning of short and long-time dependencies is problematic when implemented by means of gradient descent (vanishing/exploding gradients) whereby their training with backpropagation through time is computationally intensive and often inefficient. The interpretability of the internal dynamics of RNNs is input dependent and

Scope and Sense of XAI

301

almost infeasible given the complexity of the time and space dependent activity of their neurons.

6

Nonclassical Approaches, Training of Attractors

Nonclassical approaches like for example some based on heteroclinic networks with multiple saddle fixed points as nodes, connected by heteroclinic orbits as edges in the phase space of the learning system have been elaborated to generate reproducible sequential series of metastable states and attractors to explain RNN behaviors. To this task, known engineering methods have been extended to enable data based inference of heteroclinic dynamics [34]. These approaches use reservoir computing (RC) and reservoirs, that can be employed instead of temporal kernel functions, to avoid training-related challenges associated with RNN (slow convergence and instabilities etc.). Echo state networks (ESNs) and liquid state machines (LSM) have been proposed as possible RNN alternatives, under the name of RC. Reservoirs, seen as generalizations of RNN-architectures and ESNs, are far easier to train and have been mainly associated with supervised learning underlying RNNs. They map input signals into higher dimensional computational spaces through the dynamics of fixed, non-linear systems, the reservoirs. ESNs are considered appropriate to be used as universal approximators of arbitrary dynamic systems. Furthermore, the NN of the reservoir is randomly generated and only the readout has to be trained. The trained output layer delivers linear combinations of the internal states, interpreting the dynamics of the reservoir and its perturbations by external inputs. Reservoir computing can be applied for model-free and data based predictions of nonlinear dynamic systems. Reservoirs can be also applied for continuous physical systems in space and/or time, allowing computations in situations where partially or completely unknown interactions or extreme variations of the input signal take place, allowing for very limited functional control and almost no predictability. Andrea Ceni, Peter Ashwin, and Lorenzo Livi have investigated the possibility to exploit transient dynamical regimes and what they define as excitable network connections to switch between different stable attractors of the model for classification purposes [6]. They demonstrated how to extract such excitable network attractors (ENAs) from ESNs, whereby the previous training induced bifurcations that generated fixed points in phase space so that the trained system under small perturbations as input could move from one stable attractor to another. The hope is, that this can get exploited for classification problems that involve switching between a finite set of classes (attractors) and could be used instead of RNNs. Input dependent excitability thresholds of excitable connections have been also defined to measure the minimum distance in phase space, which would be necessary in order for a solution to escape from a stable point and converge to another. The authors found out that there exist local switching subspaces (LSS) in the vicinity of attractors, the dimensions of which directly relate with the activity of connections in the network, when the ESN solves

302

A.-M. Leventi-Peetz et al.

a task, in dependence of the complexity of the input and its impact on the dynamics of the reservoir. And this has to be assessed on a case-by-case basis. Finding fixed points for the dynamics of the system depends on the convergence of the optimization algorithm and one can have similar solutions, which in dependence of the chosen tolerance can be numerically different. Excitability thresholds should be important for the robustness of the solutions. ESNs which yield network attractors with low excitability thresholds were found to be less robust to noise perturbations. But sensitivity and accuracy of the network do suffer under low excitability. Training of the reservoir is simply tuning the readout parameters using comparison between input and output data, and an autoregressive process to minimize the difference. The result of the training could be for example a classification system which, when a sequence of patterns is given, can recognize each pattern by itself. A trained reservoir should act as an autonomous dynamical system whose state evolution, given the initial conditions, represents the state evolution of the nonlinear dynamical system that has to be predicted (task system). The forecast horizon is used to estimate the quality of short-time predictions of such a trained system. It is defined as the time between the start of a prediction and the point where it deviates from the test data more than a fixed threshold. There have been investigations, as to how to choose training hyperparameters like reservoir size, spectral radius, network connectivity, training sample size, training window and so on, in order to get reliable predictions. The latter must compare to the typical time scales of the motion of the system, determined by the maximum Lyapunov exponent. However the calculation of the Lyapunov exponent is complex and numerically unstable and one needs to have a knowledge of the mathematical model of the system to calculate it. This is not the case if one has only the time series data. The dynamics of a system can also be multiscale, noisy which might sometimes lead to rare transition events. Some systems can also spend very long periods of time in various metastable states and rarely, and at apparently random times, due to some influencing signal, suddenly transform into a new, quantitatively different state. Such changes in the dynamical behavior of complex systems are also known as critical transitions and occur at so-called tipping points. Theories explain this behavior as due to a large separation of time scales between the system state and signal evolve. Also complex and multiscale data have to be analyzed for system behavior predictions. It is an open question, how good can events and also rare events get predicted in multiscale nonlinear dynamic systems, making use of only the slow system state data for the training and having perhaps only a partial knowledge of the physics of the data generating system. In this context there exist developments in the direction of what is called physics-informed ESNs, which are ESNs extended to represent solutions of ODE (ordinary differential equation system), aiming at introducing causality in ML. Physical information gets imposed in the reservoir by means of special constraints of invariant principles. The ESN-architecture should be represented by an ODE approximator, which implements a physics-informed training scheme for the reservoir computing model [13]. Jiang et al. (2019) [20] have

Scope and Sense of XAI

303

demonstrated for reservoir computing systems which were employed for modelfree prediction of nonlinear dynamical systems, that there exists an interval for their spectral radius within which the prediction error is minimized. The authors have performed many experiments keeping the many hyperparameters of the reservoir fixed and leaving only the edge weights free. Characteristic for a reservoir consisting of a complex network of N interconnected neurons, is its adjacency matrix, an N × N weighted matrix, whose largest absolute eigenvalue is the network’s spectral radius. The authors have used ensemble-averaged predictions to show that the spectral radius of the reservoir plays a fundamental role in achieving correct predictions. They substantiated this finding by experimenting with a number of spatiotemporal dynamical systems known from physics: the nonlinear Schr¨ odinger equation (NLSE), the Kuramoto Sivashinsky equation (KSE), and the Ginzburg-Landau equation (GLE), where they could compare between the evolution of the true solution with the according results delivered by trained ESNs. For all the examined systems there could be found optimal intervals for the values of the spectral radius, and it could be determined that, when the radius lies outside this interval the prediction error raises immensely. This result remained valid, independent of the rest of the network parameters. Also in a case where performed calculations showed that only about 50 out of 100 ensemble realizations resulted to acceptable predictions, the spectral radius still had to be taken out of the optimal interval in order to get reasonable results in terms of accuracy and time. Remarkable is that also in the case of a chaotic nature of the solution, the necessity of choosing the spectral radius out of the optimal interval in order to get a meaningful predictions remains valid. Furthermore, it could be demonstrated that using directed or undirected network topology strongly influences the magnitude of the spectral radius interval, the directed case leading to different spectral radius values and also to an absolute minimum of achievable prediction error [20]. While traditional methods for chaotic dynamical systems manage to make shortterm predictions for about one Lyapunov time, model-free reservoir-computing predictions based only on data demonstrate a prediction horizon up to about half a dozen Lyapunov times [20]. It has also been discovered that the computational efficiency of ESNs gets maximized when the network is at the border between a stable and an unstable dynamical regime, at the so called edge of criticality or the region at the edge of chaos. That makes especially interesting the state between ordered dynamics (where disturbances die out fast) and chaotic dynamics (where disturbances get amplified). The average sensitivity to perturbations of its initial conditions allows to decide if a dynamical system has ordered or chaotic dynamics. There seems to exist no standard recipe of how to design an RNN or an ESN so that it operates steadily at its critical regime independent of task properties. Researchers suggest the development of mechanisms for self-organized criticality in ESNs [25,33]. Could a guarantee for a very low error in results, finally substitute the demand for explanations of ML systems predictions, so as to categorize them as trustworthy, without case-dependent technicalities, like counterfactual explanations, feature-based explanations, adversarial perturbation-based explanations etc. It is quite obvious that using established XAI methods, the creation of

304

A.-M. Leventi-Peetz et al.

explanations would find it difficult to keep pace with the rate of production of results that need to be explained (dynamical systems, online learning, IIoT etc.).

7

Causality of Results?

It is plausible to consider that it is difficult to have trust or a comprehensible interpretation of the results of ML and deep learning, unless causality regarding the production of these results can be established as a basis for the interpretation. Causality implicates temporal notion in the sense that there is a direction in time which dictates how a past causal event in a variable produces a future event in some other variable, which leads to a natural spatiotemporal definition of causal effects, that can be used to detect arrows of influence in real-world systems [1]. Mechanistic models which get fitted to predict results in complicated dynamical systems, represent simplified versatile descriptions of scientific hypotheses, and they implement parameters which are interpretable as they have a correspondence in the physical world. It is different with causation inference from data, the so named observational inference, the causality of which constitutes a challenging problem for complex dynamical systems, from theoretical foundations to practical computational issues [2]. Granger’s causality formulation describes a form of influence on predictability (or the lack of predictability), in the sense that from time dependent observations of a free complex system, without any probing activities exercised on it, it examines if the knowledge of one time series is useful in forecasting another time series, in which case the former can be seen and interpreted as potentially causal for the latter. The question of causation is fundamental for problems of control, policy decisions and forecasts and there can be probably no decision explanation without revealing the causation inference of the decision supporting system. Measures based on the Shannon entropy informational-theoretic approach, allow for a very general characterization of dependencies in complex and dynamical systems from symbolic to continuous descriptions. In analogy to Wiener-Granger causality for linear systems, transfer entropy is a way to consider questions of pairwise information transfer between nonlinear dynamical systems. However several works have shown limitations in measuring dependence and causation. Some researchers examine the causation problem with respect to dynamical attractors and the concept of generalized synchronization. Convergent cross-mapping tests implement the examination of the so named closeness principle. Within the framework of structural causal models (SCMs) there have been examined conditions under which nonlinear models can be identified from observational data. This method does not always deliver unique solutions however.

8

Conclusions

ML algorithms and their implementations are inherently highly complex systems and the quality of their predictions under real-world operation conditions cannot be safely quantified. To explain the functionality of a deep-learning system

Scope and Sense of XAI

305

under the influence of an arbitrary input of the domain for which the system has been designed and trained for, is considered to be generally impossible. NN based ML systems will be explained mostly through observations of the magnitude of network activations along paths connecting their neurons, followed back to the network input. Especially popular are XAI visualizations for interpretability, which highlight those parts of an image which are mostly correlated to the classification result (attention-based explanations). Such explanations are not always unambiguous, they are not intuitive, repeatable or unique. Arun Das et al. (2020) [11] write about the “inability of human-attention to deduce XAI explanation maps for decision-making and the unavailability of a quantitative measure of completeness and correctness of explanation maps”. The authors recommend further developments, if visualization techniques should be used for mission-critical AI applications. Returning to causality, it has to be emphasized, that causal inference from observational data is an open issue and still a subject of research. Attempts to create explainable surrogate models, for example using ODE systems (for instance neural ODE architectures for sequential data processing) adapting the equations parameters with the help of ML, underlie uncertainty and errors. Could dynamical systems get endowed with some kind of self-awareness, that is could they manage to maintain an inbuilt mechanism of internal active control, able to instantaneously evaluate the system’s state, if it is ordered, critical or chaotic, this would empower them to even ask for human intervention. However, the time scale on which systems undergo phase transformations and the duration of their stay in new states are beyond control, so that a request might have lost actuality, before a human specialist can react, let alone the possibility to prevent undesired system decisions, by forcing some alternative decision or even stopping the system. Such an option would be a contradiction in itself because AI systems are developed and employed to produce decisions correctly and fast based on data alone, as they are intended for tasks which no human experts can efficiently perform. This accounts of course for the cases when the AI systems operate as desired by their developers. Another matter is the significance and the priority of explanations, for example when a new, unforeseen and therefore not assessable decision has been delivered. Getting back to the creative and unique move 37 of the game 2 of AlphaGo, which would have been chosen with probability close to one in ten million, how could it have ever been possible to explain this move to someone and convince him in advance that this is indeed the right move to make in order to win the game? The tendency goes to a growing need for creative and unique decisions generated by AI systems for a world of increasing complexity, to open the way to new perceptions and novel concepts. For example, could AI prevent a disaster by timely predicting unforeseen threats? In this sense many AI systems may have to stay unpredictable to deal with unpredictable and even chaotic circumstances, which call for unexpected solutions inherently lacking explanations, that build upon previous experience and already discovered knowledge.

306

A.-M. Leventi-Peetz et al.

References 1. Bianco-Martinez, E., Baptista, M.S.: Space-time nature of causality. Chaos 28, 075509 (2018). https://doi.org/10.1063/1.5019917 2. Bollt, E.M., Sun, J., Runge, J.: Introduction to focus issue: causation inference and information flow in dynamical systems: theory and applications. Chaos 28, 075201 (2018). https://doi.org/10.1063/1.5046848 3. Buhrmester, V., M¨ unch, D., Arens, M.: Analysis of explainers of black box deep neural networks for computer vision: a survey. arXiv e-print (2019). https://arxiv. org/abs/1911.12116 4. Brownlee, J.: Confidence intervals for machine learning. Tutorial at Machine Learning Mastery (2019). https://machinelearningmastery.com/confidence-intervalsfor-machine-learning/ 5. Canaan, R., Salge, C., Togelius, J., Nealen, A.: Leveling the playing field - fairness in AI versus human game benchmarks. arXiv e-print (2019). https://arxiv.org/ abs/1903.07008 6. Ceni, A., Ashwin, P., Livi, L.: Interpreting recurrent neural networks behaviour via excitable network attractors. Cogn. Comput. 12(2), 330–356 (2020). https:// doi.org/10.1007/s12559-019-09634-2 7. Cerliani, M.: Neural networks ensemble. Posted on towards data science (2020). https://towardsdatascience.com/neural-networks-ensemble-33f33bea7df3 8. Makhijani, C.: Advanced ensemble learning techniques. Posted on towards data science (2020). https://towardsdatascience.com/advanced-ensemble-learningtechniques-bf755e38cbfb 9. Chen, T., Liu, J., Xiang, Y., Niu, W., Tong, E., Han, Z.: Adversarial attack and defense in reinforcement learning-from AI security view. Cybersecurity 2(1), 1–22 (2019). https://doi.org/10.1186/s42400-019-0027-x 10. Cui, Y., Ahmad, S., Hawkins, J.: Continuous online sequence learning with an unsupervised neural network model. Neural Comput. 28, 2474–2504 (2016). https:// numenta.com/neuroscience-research/research-publications/papers/continuousonline-sequence-learning-with-an-unsupervised-neural-network-model/ 11. Das, A., Rad, P.: Opportunities and challenges in explainable artificial intelligence (XAI): a survey. arXiv e-print (2020). https://arxiv.org/abs/2006.11371 12. David, J.M., Krivine, J.P., Simmons, R.: Second generation expert systems: a step forward in knowledge engineering. In: David, J.M., Krivine, J.P., Simmons, R. (eds.) Second Generation Expert Systems, pp. 3–23. Springer, Heidelberg (1993). https://doi.org/10.1007/978-3-642-77927-5 1 13. Doan, N.A.K., Polifke, W., Magri, L.: Physics-informed echo state networks for chaotic systems forecasting. In: Rodrigues, J.M.F., et al. (eds.) ICCS 2019. LNCS, vol. 11539, pp. 192–198. Springer, Cham (2019). https://doi.org/10.1007/978-3030-22747-0 15 14. General Data Protection Regulation. https://gdpr-info.eu/ 15. Intel Labs: Neuromorphic Computing - Next Generation of AI. https://www.intel. com/content/www/us/en/research/neuromorphic-computing.html 16. French, R.M.: Catastrophic forgetting in connectionist networks. Trends Cogn. Sci. 3(4), 128–135 (1999). https://doi.org/10.1016/S1364-6613(99)01294-2 17. Garbin, C., Zhu, X., Marques, O.: Dropout vs. batch normalization: an empirical study of their impact to deep learning. Multimed. Tools Appl. 79, 12777–12815 (2020). https://doi.org/10.1007/s11042-019-08453-9

Scope and Sense of XAI

307

18. Goodfellow, I., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: International Conference on Learning Representations (2015). http:// arxiv.org/abs/1412.6572 19. Grossi, E.: How artificial intelligence tools can be used to assess individual patient risk in cardiovascular disease: problems with the current methods. BMC Cardiovasc. Disord. 6 (2006). Article number: 20. https://doi.org/10.1186/1471-2261-620 20. Jiang, J., Lai, Y.-C.: Model-free prediction of spatiotemporal dynamical systems with recurrent neural networks: role of network spectral radius. Phys. Rev. Res. 1(3), 033056-1–033056-14 (2019). https://doi.org/10.1103/PhysRevResearch. 1.033056 21. Ilahi, I., et al.: Challenges and countermeasures for adversarial attacks on deep reinforcement learning. arXiv e-print (2020). https://arxiv.org/abs/2001.09684 22. Karpathy, A.: Software 2.0. medium.com (2017). https://medium.com/@karpathy/ software-2-0-a64152b37c35 23. Patrick, M.K., Adekoya, A.F., Mighty, A.A., Edward, B.Y.: Capsule networks - a survey. J. King Saud Univ. Comput. Inf. Sci. 1319–1578 (2019). https://doi.org/ 10.1016/j.jksuci.2019.09.014 24. Montavon, G., Binder, A., Lapuschkin, S., Samek, W., M¨ uller, K.-R.: Layer-wise relevance propagation: an overview. In: Samek, W., Montavon, G., Vedaldi, A., Hansen, L.K., M¨ uller, K.-R. (eds.) Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. LNCS (LNAI), vol. 11700, pp. 193–209. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28954-6 10 25. Pathak, J., et al.: Using machine learning to replicate chaotic attractors and calculate Lyapunov exponents from data. Chaos 27, 121102 (2017). https://doi.org/ 10.1063/1.5010300 26. Raffin, A., Hill, A., Traor´e, R., Lesort, T., D´ıaz-Rodr´ıguez, N., Filliat, D.: Decoupling feature extraction from policy learning: assessing benefits of state representation learning in goal based robotics. In: SPiRL Workshop ICLR (2019). https:// openreview.net/forum?id=Hkl-di09FQ 27. Richter, J.: Machine learning approaches for time series. Posted on dida.do (2020). https://dida.do/blog/machine-learning-approaches-for-time-series 28. Singh, A., Sengupta, S., Lakshminarayanan, V.: Explainable deep learning models in medical image analysis. J. Imaging 6(6), 52 (2020). https://doi.org/10.3390/ jimaging6060052 29. Strehlitz, M.: Wir k¨ onnen keine Garantien f¨ ur das Funktionieren von KI geben. Interview with Prof. Dr. habil. Mario Trapp, director of Fraunhofer IKS (2019). https://barrytown.blog/2019/06/25/wir-koennen-keine-garantienfuer-das-funktionieren-von-ki-geben/ 30. Su, J., Vargas, D.V., Sakurai, K.: One pixel attack for fooling deep neural networks. IEEE Trans. Evol. Comput. 23(5), 828–841 (2019). https://doi.org/10. 1109/TEVC.2019.2890858 31. Tang, Y., Kurths, J., Lin, W., Ott, E., Kocarev, L.: Introduction to focus issue: when machine learning meets complex systems: networks, chaos, and nonlinear dynamics. Chaos 30, 063151 (2020). https://doi.org/10.1063/5.0016505 32. Tricentis: AI In Software Testing. AI Approaches Compared: Rule-Based Testing vs. Learning. https://www.tricentis.com/artificial-intelligence-software-testing/aiapproaches-rule-based-testing-vs-learning/ 33. Verzelli, P., Alippi, C., Livi, L.: Echo state networks with self-normalizing activations on the hyper-sphere. Sci. Rep. 9, 13887 (2019). https://doi.org/10.1038/ s41598-019-50158-4

308

A.-M. Leventi-Peetz et al.

34. Voit, M., Meyer-Ortmanns, H.: Dynamical inference of simple heteroclinic networks. Front. Appl. Math. Stat. (2019). https://doi.org/10.3389/fams.2019.00063 35. Zhang, Q., Wu, Y.N., Zhu, S.: Interpretable convolutional neural networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 8827–8836 (2018). https://doi.org/10.1109/CVPR.2018. 00920

Use Case Prediction Using Deep Learning Tinashe Wamambo(B) , Cristina Luca, Arooj Fatima, and Mahdi Maktab-Dar-Oghaz Anglia Ruskin University, Cambridge Campus, East Road, Cambridge CB1 1PT, UK [email protected] {cristina.luca,arooj.fatima,mahdi.maktabdar}@aru.ac.uk

Abstract. Research into utilising text classification to analyse product reviews from e-commerce websites has increased tremendously in recent years. Machine Learning and Deep Learning classifiers have been utilised to organise, categorise and classify product reviews, enabling the identification of polarity and sentiment within product reviews. In this paper, we propose a methodology to classify product reviews using machine learning and deep learning with the intention to identify and predict the activity (use case) in which the consumer used the product they have reviewed. Keywords: Natural Language Processing Machine Learning · Deep Learning

1

· Text classification ·

Introduction

In the modern world the internet is the most valuable resource for learning, getting ideas, buying and selling products and services. E-commerce retail websites such as Amazon, Ebay, etc. experience a large armount of internet traffic as consumers purchase products for their websites. Everyday millions of reviews are generated by consumers as they provide feedback about products and services, and their experience using them. The increased popularity of e-commerce websites and the explosion of product reviews in record high numbers has seen increased research into sentiment analysis and text classification. Sentiment analysis (also known as opinion mining) is the process of analysing text documents to extract and understand the sentiments expressed in the text. A combination of natural language processing (NLP) with a machine learning capability (also known as text classification) is utilised to determine the polarity of a text document. i.e. identifying whether the opinion expressed in a text document is positive, negative or neutral. Text classification is also utilised to determine a text document’s sentiment orientation, i.e. identifying whether the opinion expressed in a text document is subjective or objective. Product reviews are packed full of subjective opinions because consumers provide feedback on their experience using a product or service. Millions of product reviews have been generated by consumers who have purchased a product and have had experience using and benefiting from that c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 309–317, 2022. https://doi.org/10.1007/978-3-030-82193-7_20

310

T. Wamambo et al.

product. New consumers looking to purchase the same or similar products utilise these product reviews as part of their decision making process to make sure they make informed purchasing decisions. However, given the large amount of reviews generated for a given product, the average consumer will most likely not read all of the reviews to gain a holistic view of the product and the sentiments other consumers who have already bought and used/experienced the product have towards that product. Currently the quickest and easiest decision making element that is tied to product reviews that consumers utilise is the star rating, the higher the star rating the more likely a consumer will purchase that product. However, a star rating is ambiguous and prone to grade inflation, e.g. having a 4.8 star rating does not mean the product is exceptional, the difference between a product with a 4.5 star rating and one with a 4.8 star rating could be massive which makes it very difficult for consumers to differentiate between OK products and very good ones. Another issue with star ratings is the fact that they are shallow, they do not truly provide summarised information about the sentiments expressed by consumers or why consumers expressed such sentiments in the product reviews. Star ratings also leave a lot of room for assumptions to be made about a product’s suitability for certain tasks. e.g. a pair of shoes with a high star rating may be suitable for running but not for hiking. New consumers looking to purchase that pair of shoes would not be provided with this information in a quick and easy manner. This research aims to identify the activities other consumers used a product for alongside the polarity and sentiment they expressed in the product reviews using sentiment analysis, text classification and machine learning. The intention is to provide new consumers with valuable summarised information that they can use to decide if a product is suitable for the activity they intend to perform. For this research, the activity a consumer used a product for will be known as a ‘use case’. For this research, a use case is an action or activity performed using the product as described in a product review. An example of this is “I bought these boots for walking my dogs around the local park, they have been fit for purpose thus far”. The use case in this example is “walking”. The rest of the paper is organised as follows: Sect. 2 describes the related work, Sect. 3 presents the proposed approach to detect and predict a use case, Sect. 4 discusses the experiments performed and the results, and Sect. 5 finally draws the Conclusions.

2

Related Work

This research is motivated by advancements in machine learning techniques, sentiment analysis and text classification. In many reviews, users express their opinions towards a product’s features and their user experience. So, aspect/feature based sentiment analysis is a suitable direction to pursue. Action words/terms (verbs) are of particular importance for this research, so Parts of Speech

Use Case Prediction Using Deep Learning

311

(POS) tagging will be considered for feature extraction. Many machine learning approaches have been implemented over the years, Support Vector Machines (SVM) and Naive Bayes have significant popularity at effectively performing text classification with high accuracy and dealing with large datasets. 2.1

Parts of Speech

Part of Speech (POS) tagging has been used for feature extraction. Devi, et al. [1] performed sentence level classification to detect words tagged as nouns because aspects/features are usually described by nouns or noun phrases. Alfrjani, et al. [2] applied POS tagging to determine if tokens in reviews were nouns, verbs, adjectives, adverbs, etc. with the intention of extracting the POS tags as token features. Hemmatian and Sohrabi [7] utilised frequent based nouns as well as order and similarity based filtering to improve feature extraction using POS tagging. Devi, et al. [1] used POS tagging to extract product features from product reviews, however Alfrjani, et al. [2] used POS tagging to categorise words in a review as part of an integration process between a semantic domain ontology and natural language processing. Likewise, Hemmatian and Sohrabi [7] used POS tagging to extract features described as nouns or noun phrases. However, their approach included word frequency, whilst Devi, et al. [1] identified grammatical dependencies between words in sentences. Feature extraction strategies used by Devi, et al. [1] and Hemmatian and Sohrabi [7] have been considered for this research because product features are expected to be described as nouns or noun phrases, so POS tagging is vital. 2.2

Deep Learning

Deep learning is a branch of machine learning that aims to enable machines to learn and evolve, similar to the way humans learn from their memories and experiences throughout their lifetime. Instead of using predefined equations, deep learning sets up “basic parameters about the data and trains the computer to learn on its own by recognizing patterns” [14] using neural networks. A neural network consists of “multiple hidden layers that can learn increasingly abstract representations of the input data” [5] using weights that are adjusted during training [3] to produce better predictions. Parvathi and Jyothis [11] proposed a text classification strategy that involved using a Convolutional Neural Network (CNN) to identify relevant text to determine the category a document belonged to. The proposed strategy used “layers of neurons and a bag of words approach” [11] to analyse text documents. In their conclusions, Parvathi and Jyothis [11] reported accurate and positive results. However, they noted that deep learning models have a limitation of learning through observation which means they only contain knowledge provided in the training data instead of learning in a generalised manner.

312

T. Wamambo et al.

Parwez, et al. [12] highlighted that traditional machine learning models suffered from a limitation of “relying on the bag of words representation of documents to generate features in which word order and context are ignored” [12] which could cause data sparsity problems. They proposed a neural network architecture that involved Convolutional Neural Networks (CNN) to classify short text documents, e.g. tweets. The CNN models used a combination of generic and domain specific word embeddings to predict class labels, whilst considering the contextual information within text documents. Results showed that CNN models outperformed traditional machine learning models in terms of validation accuracy and optimal feature generation which was used to analyse unlabelled text documents. Parwez, et al. [12] concluded that the proposed approach could be used to perform social media surveillance focused on predicting disease outbreaks. Kolekar and Khanuja [8] performed a comparison between machine learning algorithms and neural networks to find out the better approach to classifying the polarity of tweets. They applied Term Frequency and Inverse Document Frequency (TF-IDF) word embedding technique to the tweets and fed the features to Support Vector Machines (SVM), Naive Bayes and Convolutional Neural Network (CNN) developed using Keras and Tensorflow. Results showed that “using deep learning approach has given better result compare to traditional machine learning technique like SVM and NB” [8]. Subramani, et al. [15] implemented a neural network based topic modelling architecture to anaylse text documents and cluster highly similar documents together. They coined this approach as the Neural Topic Modelling approach. Their architecture used Latent Dirichlet Allocation (LDA), Keras and Tensorflow. According to the researchers, the approach provides a scalable and unsupervised learning framework that accurately discovers topics for a text corpus by considering the “semantic meanings of the words ensuring the usefulness of the topics” [15]. Results from testing with short and large text documents showed positive results, leading Subramani, et al. [15] to conclude that their proposed topic modelling approach had real world applications, i.e. movie recommendations and news clustering. To identify similar documents based on the semantic meaning of their text, Mo and Ma [10] built DocNet, a clustering system that combined word embedding vectors, a deep neural network and euclidean distance. The expectation was for small document to have “small distances among their vectors while distinct document have large distances” [10]. Results showed this to be true with DocNet performing better than traditional clustering techniques, i.e. TF-IDF and K-means clustering. However, Mo and Ma [10] stated that DocNet’s performance heavily depended on the similarity of “data distribution between data fed to DocNet and data to classificaton or clustering” [10] which means the DocNet system will most likely perform poorly on new datasets. At the Google I/O conference in 2019, Sara Robinson [13] a developer advocate at Google presented a text classification model that predicted the topic of a Stack Overflow question. As part of pre-processing the train/test dataset,

Use Case Prediction Using Deep Learning

313

words that specified the topic of the Stack Overflow question within the text were replaced with the word avocado. This was done to prevent the machine learning model from using signal words to perform classification, but instead generalise to find patterns within a dataset because some Stack Overflow questions may not specify the topic. A bag of words approach was used to encode words into matrices, applying a multi-hot encoding technique to convert the Stack Overflow questions into vocabulary size arrays. The training labels were also converted into a multi-hot array because the model was going to have the ability to identify and predict on multiple labels. A deep learning neural network was developed, it took the bag of words matrices as input data, feeding the data into hidden layers. The output layer of the neural network used sigmoid to compute the model’s output. Sigmoid returns a value between zero and one for each label which corresponded to the probability of the label being associated with the Stack Overflow question. To develop the deep learning text classification neural network, Sara Robinson [13] used Pandas, Scikit-learn and Keras to pre-process the data, an 80/20 train/test split was applied to the dataset and the neural network model architecture was built using TensorFlow. The model had 95% accuracy. The techniques applied in various research in sentiment analysis and text classification have been primarily focused on improving the classification of text to extract features, polarity, opinions and emotions expressed towards products by consumers. This has been valuable for understanding the sentiment expressed by consumers, however, it has not been able to identify why consumers hold those sentiments towards products or the activities consumers have used the products for which is a major limitation. This research aims to resolve this limitation by understanding why consumers express particular opinions towards products, through identifying the activities consumers have used the products for.

3

Proposed Approach

The proposed approach is a deep learning neural network that has been developed to perform text classification on product reviews to detect and predict a use case. This approach has similarities to the approach presented by Sara Robinson [13] at the 2019 Google I/O conference. However, key signal words that specify the use case within the product review text have not been replaced with the word avocado. Multi-hot encoding has been applied to the product reviews and their labels producing a dataset of matrices. The product reviews (extracted from Amazon) used for this research have only one label and, therefore, the model will not classified on multiple labels as [13], where the Stack Overflow questions being classified had multiple labels. A 90/10 train/test split has been applied to the dataset for this iteration, whereas Sara Robinson [13] applied an 80/20 train/test split to her dataset. In similar fashion to Sara Robinson [13], TensorFlow has been utilised to build, train and test a deep learning neural network model.

314

T. Wamambo et al.

This approach focuses on the use cases listed below: – – – – – –

Run Walk Hike Climb Swim Unknown.

30,000 product reviews have been extracted from Amazon to train and test the proposed model. They consist of 5000 reviews for each of the use cases on focus to make sure the train/test dataset is balanced. This provides a greater probability for the neural network model to have balanced classes. Product reviews have been pre-processed and analysed, spelling mistakes have been corrected and stop words have been removed using NLTK’s natural language processing capbilities. TextBlob, a Python library that sits on top of NLTK has been utilised to extract features that have been used to create a multi-hot encoded bag of words matrices. A MultiLabelBinarizer class which is part of the Scitkit-learn library has been used to multi-hot encode the labels. Parts of Speech (POS) tagging has been utilised to identify and extract nouns, noun phrases and verbs as features which is different to the approach proposed by Devi, et al. [1] who only extracted nouns. The verbs that have been extracted as features describe the actions that have been performed as denoted in the review text and the activity (use case) in which the consumer used the product is expected to be described by the verbs in some way. NLTK is referred to as “The Conqueror” in EliteDataScience [4]. It is a “leading platform for building Python programs to work with human language” [16] that provides an easy to use interface to a suite containing a variety of text processing libraries. Many breakthroughs have been made in the field of analysing and processing text using NLTK as it is “responsible for conquering many text analysis problems” [4]. TextBlob, referred to as “The Prince” in [4], is a text processing library that “sits on the mighty shoulders of NLTK” [4]. It provides a “simple API for diving into common natural language processing (NLP) tasks such as part-ofspeech tagging, noun phrase extraction, sentiment analysis, classification”, [9] and it also has a “gentle learning curve while boasting a surprising amount of functionality” [4]. TextBlob also allows the use of NLTK tools along side its own tools, enabling access to the NLTK tool kit and all of its benefits. The deep learning neural network has been created using a Sequential class which is part of the Keras library. Dense layers which are also part of the Keras library have been added to the neural network as three layers that are used to classify a data matrix in chunks spread across various hidden layers. The multi hot encoded bag of words matrices and the multi hot encoded labels have been provided as input to the neural network. The neural network trains using the training data matrices over five epochs, this means the neural network repeatedly goes through the entire training dataset five times.

Use Case Prediction Using Deep Learning

4

315

Experiments and Results

4.1

Datasets Description

Amazon product reviews have been used as training/testing data for this research. This is because Amazon has a large amount of free text product reviews it holds due to Amazon’s vast product range and user base. A fantastic extensive dataset1 containing millions of Amazon product reviews has been used. Approximately 5 million reviews from the clothing and shoes categories have been retrieved. 4.2

Metrics

Table 1. Neural network metrics Accuracy Precision Recall 90%

96%

44%

Table 1 shows the accuracy, precision and recall of the neural network model. As shown in the table, the model has high accuracy and high precision, but unfortunately it has low recall. This means 9 out of 10 positive predictions have a high probability of being correct (precision), however only 4 out of 10 positive predictions have a probability of actually being correct (recall). As a result, the model may not generalise well and has a high probability of producing a significant amount of incorrect predictions. According to Google Developers [6], “improving precision typically reduces recall and vice versa” [6] because of the tension that is present between precision and recall. The good news is that the neural network is not biased towards positive predictions because the prerequisites required for the model to behave in such a manner are a low precision and very high recall. Table 2. Neural network performance Accurately predicted use cases Inaccurately predicted use cases 61%

39%

Table 2 shows the performance of the neural network classifier when it is exposed to 1200 brand new product reviews extracted from the extensive dataset described in Sect. 4.1. 1

https://nijianmo.github.io/amazon/index.html.

316

T. Wamambo et al.

Even though the neural network classifier has a very high classification accuracy as shown in Table 1, it did not accurately predict an extremely large amount of use cases as expected. The classifier accurately predicts the use case for the majority of the 1200 product reviews. This is positive reflection of the neural network and its performance given it managed to classify completely new product reviews and accurately predict the use cases for a relatively large amount of the product reviews, even though the neural network recorded a low recall. The most likely cause for the neural network failing to accurately predict the use case for a larger amount of product reviews is that the classifier does not generalise well enough.

5

Conclusions

This paper proposes an approach that develops a deep learning neural network to classify product reviews and predict the activity (use case) in which the consumer used the product. Natural language processing techniques, text classification and machine learning have been applied to develop the neural network. As shown by the metrics, the neural network has high accuracy and high precision, however low recall showed that results have a high probability of consisting of false positives. This was evidenced by exposing the neural network to completely new product reviews it had never consumed. The neural network accurately predicted the use case for the majority of the product reviews, however approximately 40% of the product reviews were incorrectly classified which is a significant number of reviews for a neural network to incorrectly classify. The aim of this research is to classify product reviews and accurately predict the activity (use case) a consumer used the product for as described in the review text. Results and metrics from testing the neural network show that text classification recall needs to be improved to limit the prevalence of false predictions and make sure the accurate predictions are reliably produced as output. As further work is undertaken within this research, this will be the focus.

References 1. Devi, D.V.N., Kumar, C.K., Prasad, S.: A feature based approach for sentiment analysis by using support vector machine. In: 2016 IEEE 6th International Conference on Advanced Computing (IACC) (2016). https://doi.org/10.1109/IACC. 2016.11 2. Alfrjani, R., Osman, T., Cosma, G.: A new approach to ontology-based semantic modelling for opinion mining. In: 2016 UKSim-AMSS 18th International Conference on Computer Modelling and Simulation (UKSim) (2016). https://doi.org/10. 1109/UKSim.2016.15 3. Allibhai, E.: Building A Deep Learning Model using Keras (2018). https://towardsdatascience.com/building-a-deep-learning-model-using-keras1548ca149d37

Use Case Prediction Using Deep Learning

317

4. EliteDataScience: 5 Heroice Python NLP Libraries (2017). https:// elitedatascience.com/python-nlp-libraries 5. EliteDataScience:. Keras Tutorial: The Ultimate Beginner’s Guide to Deep Learning in Python (2018). https://elitedatascience.com/keras-tutorial-deep-learningin-python 6. Google Developers: Machine Learning Crash Course (2020). https://developers. google.com/machine-learning/crash-course/classification/precision-and-recall 7. Hemmatian, F., Sohrabi, M.K.: A survey on classification techniques for opinion mining and sentiment analysis. In: Artificial Intelligence Review (2017). https:// link.springer.com/article/10.1007/s10462-017-9599-6 8. Kolekar, S.S., Khanuja, H.K.: Tweet classification with convolutional neural network. In: 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA) (2018). https://doi.org/10.1109/ICCUBEA. 2018.8697397 9. Loria, S.: TextBlob: Simplified Text Processing (2018). https://textblob. readthedocs.io/en/dev/ 10. Mo, Z., Ma, J.: DocNet: a document embedding approach based on neural networks. In: 2018 24th International Conference on Automation and Computing (ICAC) (2018). https://doi.org/10.23919/IConAC.2018.8749095 11. Parvathi, P., Jyothis, T.S.: Identifying relevant text from text document using deep learning. In: 2018 International Conference on Circuits and Systems in Digital Enterprise Technology (ICCSDET) (2018). https://doi.org/10.1109/ICCSDET. 2018.8821192 12. Parwez, MD., A., Abulaish, M., Jahiruddin: Multi-label classification of microblogging texts using convolution neural network. In: IEEE Access, vol. 7 (2019). https://doi.org/10.1109/ACCESS.2019.2919494 13. Robinson, S.: Live Coding A Machine Learning Model from Scratch (Google I/O’19), Google Cloud Platform (2019). https://www.youtube.com/watch?v= RPHiqF2bSs 14. SAS Institute: Deep Learning What it is and why it matters (2020). https://www. sas.com/en us/insights/analytics/deep-learning.html 15. Subramani, S., Sridhar, V., Shetty, K.: A novel approach of neural topic modelling for document clustering. In: 2018 IEEE Symposium Series on Computational Intelligence (SSCI) (2018). https://doi.org/10.1109/SSCI.2018.8628912 16. The NLTK Project: Natural Language Toolkit (2017). https://www.nltk.org/

VAMDLE: Visitor and Asset Management Using Deep Learning and ElasticSearch Viswanathsingh Seenundun1 , Balkrishansingh Purmah1 , and Zahra Mungloo-Dilmohamud2(B) 1 Accenture Technology, Ebene, Mauritius 2 Department of Digital Technologies, FoICDT University of Mauritius,

Reduit, Moka, Mauritius [email protected] Abstract. Visitor management and asset management are crucial in restricted places. This paper focuses on how artificial intelligence and image recognition can drive innovation in both visitor and asset management spaces by employing the latest advances in technology. The proposed solution, VAMDLE, is an Android application which uses deep transfer learning and Elasticsearch to facilitate the registration of visitors as well as the management of borrowed assets. TensorFlow was used to train a pre-trained model for assets image recognition and the new model was integrated into the Android application with the aid of the TFLite library. A restful web API was developed with the aid of Spring Boot to manage all the data used by the client application. The unique identifiers of the assets and of the employees were read and recognized using text recognition and regular expressions and Elasticsearch was used to automatically fill in forms. The use of these various tools and technologies resulted in an app with a simple interface, a very good classification accuracy and good average response time. The proposed system was able to register a classification accuracy of up to 97%. Keywords: Asset management · Visitor management · Deep learning · ElasticSearch

1 Introduction Visitor management in protected areas requires information about the visitors such as who they are, where they are on the premises and how many there are. When there are a large number of visitors, the need to capture these information as quickly as possible becomes imperative. Asset management systems require information about the assets such as the type of equipment, the date purchased, availability of the asset and who the asset is currently assigned to. Often these 2 systems are disparate but in cases where these tasks are to be performed at the reception desk or kiosk and as quickly as possible it is much easier and more efficient to have a single application managing both. The use of technology coupled with machine learning and open source search and analytics engines can aid in speeding up these processes. Deep learning, a subcategory of machine learning, has been used in many diverse fields. It learns by using multiple layers to gradually extract higher level features from © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 318–329, 2022. https://doi.org/10.1007/978-3-030-82193-7_21

VAMDLE

319

the raw input [1]. One of the areas where deep learning excels is image classification [2]. Examples where deep learning has been used for the classification of images include crack detection in civil engineering structures [3], image detection in autonomous driving [4], visual search for any product that someone scans using his mobile phone after seeing it in a store on in a magazine and instantly orders it and in medical image analysis [5]. Deep transfer learning, a more recent technique, is also used extensively in image analysis but here the learning process does not start from scratch but rather starts from patterns which have been obtained while solving a different problem. This model is then applied to a new field. In this research work, the use of Deep Learning together with ElasticSearch for a single Android app handling both visitor and asset management was investigated. Although Deep Learning has been used extensively for image classification, it has not been used in the context of an app for both visitor and asset management. The relevant literature, the design of the system and the resulting app are detailed in the sections that follow.

2 Background 2.1 Visitor Management and Asset Management Visitor management software systems electronically track and record the usage of any public or private building or site [6]. They are used to increase security, ensure that visitors are compliant to the site’s rules and regulations, to know who is on site and finally to impress visitors by enhancing the company image. They make the visitor signin process more efficient, accurate, and consistent. These software are usually installed on a self-service kiosk or on a device such as a pc or tablet at the reception desk. Some examples of visitor management systems are Envoy [7], proxyclick.com [8], swipedOn [9] and trackforce Valiant [10] and although they provide for many advantages they do not allow for asset management. Asset management systems also known as enterprise asset management are systems that electronically track and record equipment and inventory that are used in a company. Examples include Asset Panda [11], EZOfficeInventory [12], GoCodes Asset Management [13]and UpKeep [14]. However most of the asset management systems are intended for many advanced features and not simple tasks like the borrowing and returning of assets. Furthermore, asset management tools do not have visitor management features. 2.2 CNN and MobileNet Convolutional Neural Networks (CNNs) are deep learning algorithms which can take in an image as input, assign importance (learnable weights and biases) to various aspects/objects in the image and are able to differentiate one from the other (Fig. 1). The architecture of CNN is different from regular Neural Networks which transform an input by passing it through several hidden layers (consisting of a set of neurons) where each layer is fully connected to all neurons of the layer before and at the end, there is the output layer representing the predictions. In CNN, the layers are organized in 3 dimensions (width, depth and height) [15].

320

V. Seenundun et al.

Fig. 1. The different CNN layers [16]

MobileNet is a CNN architecture model which has been designed for mobile and embedded vision applications [16]. MobileNets are based on a restructured architecture where depthwise separable convolutions which are made up of two layers: depthwise convolutions and pointwise convolutions, have been used. A standard convolution filters and combines inputs into a new set of outputs in a single step while the depthwise separable convolution splits this layer into two separate layers, one for filtering and one for combining. This results in significantly reduced computation and model size [16]. MobileNet has been pre-trained on the ImageNet dataset [17] which has more than 14 million images at date [18]. Deep learning neural networks can be customized by setting some variables which are known as hyper-parameters. These determine the network structure and how the network is trained. Hyper-parameters involve the application of different functions such as the activations, loss functions, probability distribution function, the number of epochs for training, the number of training iterations, the learning rate, optimizers and other regularization and the purpose is to avoid the problem of overfitting or underfitting [19]. These values are set before training and before optimizing weights and bias. Although the use of CNN and deep learning have resulted in improved accuracies for many tasks, they also depend on the availability of very large computational resources which in turn require substantial energy consumption. Hence, these models are costly to train and develop, both financially and environmentally [20]. 2.3 Deep Transfer Learning Deep transfer learning is a deep learning method where a model, initially built for a specific task, is reused as the base for another model performing a different task [21]. Many state-of-the-art deep learning models exist and may be repurposed for other tasks, in other domains. The use of transfer learning is widespread since models can be built more quickly while providing good accuracy. Using pre-trained models also results in lower hardware requirements compared to those needed to train a model from scratch. There are several factors which need to be considered when determining the best approach to repurpose a model and these include the available computational power and the size and similarity of the dataset.

VAMDLE

321

Figure 2 shows the different steps when using Transfer Learning. As can be seen from the figure, deep learning models are first trained on another dataset for another problem. In this case, the ImageNet dataset which contains 1.2 million images was used. The weights learned during the previous training are transferred to a new training process.

Fig. 2. Overview of transfer learning [29]

2.4 ElasticSearch ElasticSearch is a real-time distributed search and analytics engine built on Apache Lucene which is the most advanced, high-performance, and fully featured search engine library [22]. It is used for full-text search, structured search, analytics, or any combination of these. It is used by Wikipedia to provide full-text search with highlighted search snippets, by GitHub to query billion lines of code and by e-commerce websites for personalization [23]. Although full-text search, analytics systems and distributed systems are not new, ElasticSearch successfully combines these together and is easy to use [22]. ElasticSearch can be accessed by using a simple RESTful API. 2.5 High Performance Computing High Performance Computing (HPC) refers to the ability to process data and perform complex calculations at high speeds. HPC solutions have three main components which are compute, network and storage. A supercomputer is an example of a HPC solution. A high-computing performance architecture comprises compute servers which are networked together into clusters with each server being called a node. The nodes work in parallel to boost the processing speed. Each cluster is connected to a data storage where the output is captured. For the purpose of this research, only some assets have been tested but in a real-life example many more images will need to be processed and hence

322

V. Seenundun et al.

high computational resources will be needed. Graphics Processing Units (GPUs) were originally designed to accelerate graphics tasks like image rendering. However, these last few years has seen a major development in HPC with GPUs having evolved from being a simple graphics cards into a platform for HPC. GPUs have shown much better performance compared to both multi-core CPU and single-core CPU for different kinds of applications [24, 25].

3 Design 3.1 Architectural Design The architecture of the proposed system, VAMDLE, is shown in Fig. 3. First, images from the training dataset are to be fed to TensorFlow for training using the MobileNet architecture. The output of the training is a file in ‘.pb’ format. This file is then converted to the ‘.tflite’ format which is faster, has a relatively small model size without any noticeable loss in accuracy. This file is then exported to the android application where it is used for real time image recognition. Text detection and recognition are performed using Google’s ML kit library. The android application communicates with the Ngrok server, where the REST API is hosted, through HTTPS requests and the response is obtained in JSON format.

Fig. 3. Architectural design of VAMDLE.

3.2 UI and UX Design Nowadays when designing a system, both user interface (UI) and user experience (UX) have to be considered. According to the ISO standards [26], UX is defined as “the combined experience of what a user feels, perceives, thinks, and physically and mentally reacts to before and during the use of a product or service”. The interface has therefore been designed using all recommended UI and UX standards. Balsamiq WireFrames [27] was used to design all user interfaces and 2 such wireframes are shown in Fig. 4.

VAMDLE

323

Fig. 4. WireFrames of UI while scanning an asset and displaying the results

4 Implementation and Evaluation VAMDLE is expected to be used to scan the National ID of visitors to the building or office or to scan the asset that users are borrowing or returning at the reception desk of the same office or building. The app should allow receptionists to login, to register visitors by scanning their NID and having all fields being filled in automatically, allow receptionists to checkout visitors when they are leaving, view the time the visitor has been registered, view all visitors who are currently on premises, to assign an asset to an employee and to record returned assets. The system should be able to identify an asset based on its text label. 4.1 Dataset For asset management, the most common assets which are usually borrowed at the reception desk of a company were identified. These are HDMI converters, USB transmitters, meeting room speakers and projectors. Therefore 30 images of each of these assets were taken from different angles and used to build the dataset which was used for training the transfer learning model. 4.2 Image Pre-Processing and Data Augmentation Abundant and high quality data are crucial to the successful implementation of different deep learning models [28] and these are considered as important as the algorithms themselves. Hence data augmentation is often used in deep learning. Data augmentation refers to the application of one or more deformations to annotated dataset which results in new, additional training data [29]. Data augmentation increases the diversity of data available significantly without actually having to collect new data or take new photos.

324

V. Seenundun et al.

This technique can help in resolving data imbalance and can increase the overall accuracy of a model. Therefore, an initial step in this work was data augmentation. There exist different possible augmentation techniques as shown in Fig. 5 and for the purpose of this project the basic image manipulations shown in Table 1 have been used. These image augmentation techniques were not applied individually on the images but a random combination of these techniques were applied on each image with the objective of generating different images each time. Table 1. Summary of applied data augmentation techniques. #

Technique

Parameters

1

Image Flip

Horizontal

2

Random Rotation

Range: 18°

3

Random Translation

Range: −10 to 10 pixels

4

Shear: Right

Range: −10 to 10 pixels

5

Shear: Bottom

Range: −10 to 10 pixels

Once the dataset was ready, it was split into training and testing datasets. The training dataset was used to train the model and the testing dataset was used to validate and optimize the model by adjusting the hyper parameters. 4.3 Deep Transfer Learning Model The training of the deep learning was done in Ubuntu (Linux) where a virtual environment was created and the TensorFlow pip package was installed together with python 3. The pre-trained MobileNet was retrained to adapt it to the problem of recognition of images of different assets. Since transfer learning is being used, only the final layer of the neural network was trained. The MobileNet architecture can be configured to produce optimum results by changing the values of parameters. The parameters are the input image resolution which can be 128,160, 192 or 224px, the relative model size (example: 1.0, 0.75, 0.50 or 0.25) and the training steps. The MobileNet was tested with different values of these parameters. Predictably, using higher resolution images results in higher processing times with better performance. However, an equilibrium has to be achieved between processing time and the performance. The classification accuracy, the area under the receiver operating characteristic curve (AUROC) and the Mean Absolute Error (MAE) were used to as performance metrics to assess the deep learning model. It was found that choosing an image resolution of 224, a relative size model fraction of 0.5 and training steps of 4000 gave optimal results with an acceptable amount of time to retrain the model and very accurate results. Figure 6 shows how TensorFlow was integrated into the whole system.

VAMDLE

325

Fig. 5. Image augmentation techniques [30]

4.4 Android Application Since both visitor and asset management are sensitive and need to be secure, the application can only be accessed by using unique allocated credentials. Once the user has been authenticated, the various functionalities of the system can be accessed. Registration of visitors is done by using the mobile device’s camera to scan the text on an official identity document of the visitor, which can be his passport or national ID card. By using regex, the name and ID number of the visitor were extracted from the text read and used to auto fill the registration form. Assigning and returning an asset was also implemented by using the text reader, which scans the text label on the asset and spelling errors were handled by ElasticSearch in the backend. As assets are used a lot, often their text labels are damaged or the text is not readable anymore. Therefore, an alternative was provided where the system can identify the asset by detecting its shape using TensorFlow and deep transfer learning. By using this technique, there is no need to change the asset’s label or buy additional hardware. Employees borrowing assets need to be identified and this was done by scanning the text on the employee’s card and reading his name. Once again spelling mistakes are handled using ElasticSearch in the backend. This resulted in a rapid process since the employee simply needs to show his card to the camera and the system will process the information automatically. The mobile application was implemented using the Android Studio Integrated Development Environment (IDE). The CameraSource library was used to manage the camera

326

V. Seenundun et al.

on the phone or tablet having the app. The camera works with a detector which receives frames from the camera at a specified rate. Processing of the preview frames was done as quickly as possible to minimize any lag in the application. The model trained using deep transfer learning is executed by the Android application using the org.tensorflow: tensorflow-lite library. This library is used to attach a score and label to the preview frames which then determines which asset is a better match for the real time image. Google Mobile vision which is part of the Google ML Kit was used since it offers a reliable and robust system for reading text from real time images or video streams. Each time text was detected, the Text Recognition API was used to determine the corresponding text in each block and segmented it into lines and words. The data captured is sent to the backend as JSON objects through HTTP requests. Retrofit Library was used to ensure a safe connection between the android application and the REST API. Gson was used to convert Java objects into their JSON format before sending the request to the API. It was also used to convert the JSON response from the API into Java objects to be used by the application. Spring Boot was used for implementing the REST API. ElasticSearch was integrated and it asynchronously updates its cluster every 24 h to make sure that any CRUD operations performed in the database is reflected in the cluster. The REST API was used to communicate with both the ElasticSearch cluster and the MySQL database to obtain the required information. Spring Tool Suite 3 (STS) which is an extension of the Eclipse IDE was used in the implementation of the backend. MySQL-connector was used to connect to the MySQL database at runtime and the Java Persistence API (JPA) was used to ease the management of data between the Java objects and the database.

Fig. 6. Integration of Tensorflow into the System

4.5 Evaluation of the Proposed System The app that was implemented was compared to some of the most used Visitor and Asset Management tools currently on the market. The results are shown in Table 2. The

VAMDLE

327

proposed can be used for both visitor and asset management at the various reception desks or registration kiosks at a site. Table 2. Evaluation of the Proposed System with Existing Visitor and Asset Management Tools Visitor Management Apps

VAMDLE

Envoy

Proxyclick

SwipedOn

Asset Management Apps

TractionGuest

Assetpanda

EZOfficeInventory

GoCodes Asset Management

UpKeep

Visitor Check-in

Dashboard Analytics































Asset Registration

Asset Status Tracking

Authentication







































5 Conclusion In this work a novel visitor and asset management system, VAMDLE, which makes use of deep transfer learning and Elasticsearch has been presented. An in-depth literature survey was conducted on the state-of the art in the field, findings were analyzed and an architecture design for an efficient model was proposed. The system which consisted of a number of different parts: the deep transfer learning model, the text analyzer using ElasticSearch and the mobile application, was then implemented. The effect of using different values of the parameters in the model on the various performance metrics and time taken to complete a task were compared and it was found that an image resolution of 224, a relative size model fraction of 0.5 and training steps of 4000 gave best results in terms of time taken and performance. An image resolution of 224, a relative size model fraction of 0.5 and training steps of 4000 gave optimum results with an acceptable amount size, speed and accuracy characteristics. As with any system, the proposed system can be improved. first of all, the RESTful web service can be deployed on a more robust and secured platform such as Azure or Amazon Web Services. The android application can be further improved so that whenever a new asset is to be added, the system can auto-train to include the asset in its system without any manual intervention from the user. The exhaustive literature review has shown that no such solution exists to date. Moreover, the system can be trained to identify fake user ids and passports. This will help in thwarting fraud and the use of fake documents by visitors.

328

V. Seenundun et al.

References 1. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015) 2. Rawat, W., Wang, Z.: Deep convolutional neural networks for image classification: a comprehensive review. Neural Comput. 29(9), 2352–2449 (2017) 3. Dung, C.V., Anh, L.D.: Autonomous concrete crack detection using deep fully convolutional neural network. Autom. Constr. 99, 52–58 (2019) 4. Fujiyoshi, H., Hirakawa, T., Yamashita, T.: Deep learning-based image recognition for autonomous driving. IATSS Res. 43(4), 244–252 (2019) 5. Fourcade, A., Khonsari, R.H.: Deep learning in medical image analysis: a third eye for doctors. J. Stomatol. Oral Maxillofac. Surg. 120(4), 279–288 (2019) 6. Zejda, D., Zelenka, J.: The concept of comprehensive tracking software to support sustainable tourism in protected areas. Sustainability 11(15), 4104 (2019) 7. Envoy Visitor, Deliveries, and Rooms Management | Envoy. https://envoy.com/. Accessed 14 Feb 2021 8. Proxyclick | Enterprise Visitor Management System. https://www.proxyclick.com/. Accessed 14 Feb 2021 9. Visitor Management System | In and Out Board | Best Sign In App USA. https://www.swi pedon.com/. Accessed 14 Feb 2021 10. Visitor Management. https://info.trackforce.com/en-za/visitor-management-software. Accessed 14 Feb 2021 11. Easy and Flexible Asset Tracking Software - Asset Panda. https://www.assetpanda.com/. Accessed 14 Feb 2021 12. Asset Tracking and Management Software - EZOfficeInventory. https://www.ezofficeinve ntory.com/. Accessed 14 Feb 2021 13. Home - Asset & Inventory Tracking Software. https://gocodes.com/. Accessed 14 Feb 2021 14. CMMS Software by UpKeep CMMS | Try Free. https://www.onupkeep.com/. Accessed 14 Feb 2021 15. O’Shea, K., Nash, R.: An Introduction to Convolutional Neural Networks (2015) 16. Howard, A.G., et al.: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications (2017) 17. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) 18. ImageNet. http://www.image-net.org/. Accessed 14 Feb 2021 19. G Inc: A Tutorial on Deep Learning Part 1: Nonlinear Classifiers and the Backpropagation Algorithm (2015) 20. Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for deep learning in NLP. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 3645–3650 (2019) 21. Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C.: A survey on deep transfer learning. In: K˚urková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018. LNCS, vol. 11141, pp. 270–279. Springer, Cham (2018). https://doi.org/10.1007/9783-030-01424-7_27 22. Gormley, C., Tong, Z.: Elasticsearch: The Definitive Guide: A Distributed Real-time Search And Analytics Engine, 1st edn. O’reilly Media, Beijing, p. 724 (2015) 23. Vavliakis, K.N., Katsikopoulos, G., Symeonidis, A.L.: E-commerce Personalization with Elasticsearch. Procedia Comput. Sci. 151, 1128–1133 (2019) 24. Gupta, S., Babu, M.R.: Generating performance analysis of GPU compared to single-core and multi-core CPU for natural language applications. IJACSA 2(5), 108 (2011)

VAMDLE

329

25. Amich, M., Luca, P.D., Fiscale, S.: Accelerated implementation of FQSqueezer novel genomic compression method. In: 2020 19th International Symposium on Parallel and Distributed Computing (ISPDC), pp. 158–163 (2020) 26. ISO - ISO 9241-210:2019 - Ergonomics of human-system interaction — Part 210: Humancentred design for interactive systems. https://www.iso.org/standard/77520.html. Accessed 14 Feb 2021 27. Balsamiq Wireframes - Industry Standard Low-Fidelity Wireframing Software | Balsamiq. https://balsamiq.com/wireframes/. Accessed 14 Feb 2021 28. Sajjad, M., Khan, S., Muhammad, K., Wu, W., Ullah, A., Baik, S.W.: Multi-grade brain tumor classification using deep CNN with extensive data augmentation. J. Comput. Sci. 30, 174–182 (2019) 29. Salamon, J., Bello, J.P.: Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 24(3), 279–283 (2017) 30. Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning. J. Big Data 6(1), 1–48 (2019). https://doi.org/10.1186/s40537-019-0197-0

Wind Speed Time Series Prediction with Deep Learning and Data Augmentation Anibal Flores(B) , Hugo Tito-Chura, and Victor Yana-Mamani Universidad Nacional de Moquegua, Moquegua, Peru

Abstract. This paper presents a hybrid model based on recurrent neural networks known as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) for the prediction of daily wind speed time series in the Moquegua region of Peru. The proposal model called GRU LSTM GRU LSTM + DA consists of an architecture of 4 hybrid sequential layers and works on a normalization scale of −1, +1 instead of the traditional 0, +1 scale, it also uses data augmentation (DA) to improve the process training and prediction results of the model. The results of the proposal are compared with 4 benchmark models (GRU GRU GRU GRU, LSTM LSTM LSTM LSTM, GRU LSTM and GRU LSTM GRU LSTM), showing that the proposal in terms of RMSE, RRMSE and MAPE by far exceeds the benchmark models. In the same way, the results achieved in terms of RMSE are compared with the results of related work, showing the superiority of the proposal model in this study. Keywords: Deep learning · Data augmentation · Wind speed prediction · Time series scaling

1 Introduction Climate change [1] is a severe threat to the future of humanity, and the consumption of fossil fuels has contributed enormously to this problem, hence the use of renewable energies has become an excellent alternative to mitigate its effects. Renewable energies include wind, geothermal, hydroelectric, tidal, solar, wave power, biomass, and biofuels. In the case of wind energy, electricity generation is carried out with the force of the wind. The windmills that are in the wind farms are connected to electricity generators that transform the wind that turns their blades into electrical energy, here the analysis of time series related to the wind speed and wind direction play an important role. Peru has great potential to exploit wind energy, especially along the coastal region that borders the Pacific Ocean. In this study, a hybrid model of four layers GRU LSTM GRU LSTM is proposed for prediction of wind speed time series, the same one that produced good prediction results with solar radiation time series in [2]. Some updates are included in the pre-processing stage of the time series, such as the inclusion of a data augmentation phase and a min/max scale range change of −1, +1 instead of 0, +1 in the normalization or scaling phase. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 330–343, 2022. https://doi.org/10.1007/978-3-030-82193-7_22

Wind Speed Time Series Prediction with Deep Learning

331

Regarding data augmentation, it is commonly used to overcome the overfitting problem presented by deep learning models in the training phase, as can be seen in works such as [3–6]. In other cases, data augmentation also allows overcoming the underfitting problem caused by the reduced amount of training data, this can be seen in the work of Flores et al. [7]. In this study, according to the benchmark models GRU GRU GRU GRU, LSTM LSTM LSTM LSTM, GRU LSTM, and GRU LSTM GRU LSTM, it is observed that these do not present overfitting problems, much less underfitting. However, the aim of using data augmentation in the wind speed time series is to enrich them in order to achieve better training and therefore better prediction results. Regarding the normalization phase, many works such as [2, 8–12], etc. use the 0,1 min/max scale, However in this work, it’s proposed the −1, +1 min/max scale due to the fact that in various experiments carried out it was observed that for wind speed time series, the scale −1, +1 allows to improve the precision of the predictions of deep learning models such as recurrent neural networks. Regarding the organization of this work, it is structured as follows. In Sect. 2, the state-of-the-art works that served as a starting point for the proposal made in this study are briefly described. In Sect. 3, the theoretical background necessary for a better understanding of the content of the paper are described. In Sect. 4, the process for the implementation of the study proposal is described. In Sect. 5, the results achieved are described. In Sect. 6 the results achieved are discussed, comparing them with similar works of the state of the art. Finally, the conclusions of the study and the improvements that can be made for future work are shown.

2 Related Work Some related works arranged chronologically are briefly described below: Zhang et al. [13], propose a hybrid model based on Wavelet Transform Technique (WTT), Seasonal Adjustment Method (SAM) and Radial Basis Function Neural Network (RBFNN) to predict daily wind speed time series, the results show an RMSE of 0.88. Qureshi et al. [14], propose a model based on Deep Neural Network (DNN) that implements Meta Regression and Transfer Learning (MRT), the best RMSE achieved for hourly wind speed time series is 0.0953. Bokde et al. [15], propose the Ensemble Empirical Mode Decomposition (EEMD) and Pattern Sequence Forecasting (PSF) for the prediction of hourly wind speed time series, the results show that the best RMSE achieved is 0.36. Mezaache et al. [16] propose a two-block architecture, in the first block they use the AutoEncoder (AE) network to reduce the dimensionality of the wind speed time series and the second block they experiment with Extreme Learning Machine (ELM) and Elman Neural Network (ENN), the best RMSE achieved is 3.0506 using AE + ENN. Khodayar et al. [17] propose an Interval Probability Distribution Learning (IPDL) to decrease the wind data uncertainties, in addition to the Restricted Boltzmann Machines (RBM) and Rough Set Theory neural network to capture unsupervised temporal features from wind speed time series, it is used 10 min wind speed time series for experimentation, data from 2004 to 2005 is used for training and from 2006 is used for testing, the best RMSE is 11,126 for a 3-h forecast horizon. Li et al. [18], propose a hybrid model based on Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) for the

332

A. Flores et al.

prediction of short-term wind speed time series with feature extraction, considering 3500 data for training and 500 for testing, the best RMSE is 3.0012. Liu et al. [19], propose the use of the recurrent neural network Gated Recurrent Unit (GRU) to predict daily wind speed time series, it is experimented with 811 items for the training phase and 372 for the testing phase, the best RMSE achieved is 0.9899. Deng et al. [20], propose a bidirectional deep learning architecture based on Gated Recurrent Unit (GRU) to predict wind speed and direction time series, the best RMSE achieved is 6.75. Wang et al. [21], propose a model based on Wavelet Transformation-Kullback-Leibler, to predict hourly wind speed data, the results show that the best RMSE achieved is 1.07. Jiang et al. [22], propose a model based on Variable Weights Combined to predict wind speed time series, the results reach an RMSE of 0.2557. Cheng et al. [23], propose an approach called MultiObjective Salp Swarm Optimizer (MSSO) for 10-min wind speed time series prediction. The results show very good precision (RMSE: 0.3002) managing to surpass other techniques of the state of the art. Altan et al. [24], propose a hybrid model based on Long Short-Term Memory (LSTM), Decomposition Methods (DM) and Gray Wolf Optimizer (GWO) to predict 10-h wind speed time series, according to the results achieved, the best RMSE is 0.1878. Yan et al. [25], propose a model based on Improved Singular Spectrum Decomposition (ISSD), Long Short-Term Memory (LSTM) and Deep Belief Network (DBN) to predict hourly wind speed time series, the results show that the best RMSE achieved by the proposal is 1.0156. Noman et al. [26], propose a model called NARX (Nonlinear Auto-Regressive Exogenous), which reached an RMSE of 0.3590 in the prediction of 10-min wind speed time series. Luo et al. [27], propose two types of approaches: The Decomposition Ensemble (DE) and Multi-Objective Optimization (MOO), the best RMSE achieved is 0.2348 predicting 10-min wind speed time series.

3 Background 3.1 Recurrent Neural Networks A recurrent neural network (RNN) is a type of artificial neural network within Deep Learning [8], it integrates feedback cycles, allowing through them that the information persists throughout the training epochs, this is done through connections between the outputs of the layers, which combine their results with the input data. This makes them applicable to solving problems such as handwriting recognition, speech recognition, time series prediction, etc. The best known recurrent neural network is the Long Short-Term Memory (LSTM) from which different variants were generated, including the Gated Recurrent Unit (GRU). Long Short-term Memory (LSTM) LSTM is a recurrent neural network (RNN), and was designed to address the vanishing problem [8] encountered training classic RNNs. Unlike standard feedback neural networks, LSTM has feedback connections. It can process not only individual data points, but also streaming or sequences data. For example, LSTM is applicable to tasks such as handwriting and speech recognition [28]. According Fig. 1, a common LSTM unit is composed of a cell, and three gates (entry, exit and forget). The cell remembers values

Wind Speed Time Series Prediction with Deep Learning

333

Fig. 1. LSTM architecture

arbitrary along the training epochs and the gates regulate the information in/out of the cell. LSTM networks are suitable for classifying and regression tasks. From Fig. 1, the output C t can be calculated with the following equations.     f t = σ W f ht−1 , xt + bf (1)     it = σ W i ht−1 , xt + bi

(2)

    C˜ t = tanh W c ht−1 , xt + bC

(3)

C t = f t C t−1 + it C˜ t

(4)

And the output ht with the following equations:     ot = σ W o ht−1 , xt + bo ht = ot tanh(C t )

(5) (6)

Gated Recurrent Unit (GRU) GRU network is a variant of LSTM, it combines the forget and input gates into a single update gate. GRU is simpler than standard LSTM models, and has been growing increasingly popular. GRUs are a gating mechanism in RNNs [29]. The GRU is a recurrent neural network the inspired by LSTM and main difference is in the fewest number of parameters, since it does not have an output gate. GRUs have been shown to exhibit even better performance in certain smaller datasets. However, the LSTM is stronger than the GRU, since it can easily perform an unlimited count, while the GRU cannot [30]. Figure 2 shows the architecture of Gate Recurrent (GRU) Unit network.

334

A. Flores et al.

Fig. 2. GRU architecture

From Fig. 2, the following equations emerge to calculate ht : rt = σ (W r xt + U r ht−1 + br )

(7)

zt = σ (W z xt + U z ht−1 + bz )

(8)

h˜ t = tanh(Wxt + U(rt  ht−1 ) + b

(9)

ht = (1 − zt )  ht−1 + zt  h˜ t

(10)

Where: W z ,W r ,W,U z ,U r ,U br , bz , b σ 

Parameter matrices Parameter vectors Element-wise sigmoid function Element-wise multiplication

3.2 Data Augmentation Data augmentation is the artificial generation of synthetic data through disturbances or transformations of the original data [7]. This allows to increase both in size and diversity the training data. In computer vision, this technique became a regularization standard, and also to improve performance and outperform overfitting problem on Convolutional Neural Networks (CNNs). In time series classification they have generally been created to solve overfitting problems and these consist of time-warping, rotation, scaling, permutation, jittering, etc. [4], and for time series forecasting it is also possible to apply some of these techniques, for example time-warping and jittering is used [5]. Figure 3 shows graphically some of these techniques applied to wind speed time series.

Wind Speed Time Series Prediction with Deep Learning

335

Fig. 3. Basic data augmentation techniques for time series classification. a) Raw Data, b) Jittering, c) Scaling, d) Rotation, e) Time-Warping

4 Methodology 4.1 Time Series Selection The daily wind speed time series corresponds to the coastal province of Ilo in the Moquegua Region of Peru, which has a lot of potential for the generation of wind energy. The daily time series of wind speed at 50 m is between 1981-01-01 and 202012-31 (14610 items) and was obtained from the NASA repository1 , through the POWER Data Access Viewer web tool: latitude: −17.6851 and longitude: −71.3515. The data for the model training corresponds to the years 1981–2016 (13149 items) and for the model testing the years 2017–2020 (1461 items). Figure 4 shows the graphical location.

Fig. 4. Location of Ilo Province in Moquegua Region of Perú.

1 https://power.larc.nasa.gov/data-access-viewer/.

336

A. Flores et al.

4.2 Time Series Imputation The time series did not present missing data or NA values, so it was not necessary to apply any imputation technique. 4.3 Data Augmentation In this stage, the algorithm proposed by Flores et al. [7] was used. The block-size used is 6 and sub-block size is 3. The technique proposed in [7] works with two data augmentation techniques used for time series classification: time-warping and jittering. Time-warping allows to increase the length of the time series, but the synthetic data generated can introduce bias as it is linear, hence it is combined with jittering. What Jittering does is introduce noise in such a way that the synthetic data is not linear, these are generated randomly considering as limits the start and end value of each sub-block. Figure 5 shows partially the results of this process.

Fig. 5. Wind Speed time series with real and augmented values.

4.4 Scaling For the proposal model in this stage, the −1, + 1 min/max scale was considered instead of the 0, + 1 min/max scale that is commonly recommended or used in many works of the state of the art mentioned above. The min/max scales work with the Eq. (11). Os =

Oi − Omin Omax − Omin

Where: Os Oi

The scaled value between min and max value. The vector element to be scaled.

(11)

Wind Speed Time Series Prediction with Deep Learning

337

Omin The smallest element in the vector. Omax The largest element in the vector. After 20 runs of the GRU LSTM GRU LSTM model without data augmentation, for the scale −1, + 1 an average RMSE of 0.5179 was obtained, while for the scale 0, + 1 an average RMSE of 0.5192 was obtained, demonstrating in this way the best precision of the scale proposed for this study. Figure 6 shows a graphical comparison between these two scales.

Fig. 6. Comparison of min/max Scales: 0, + 1 vs −1, + 1.

4.5 Modelling GRU LSTM GRU LSTM architecture for proposal model was implemented with the hyperparameters shown in Table 1. Table 1. Training and testing data for proposal and benchmark models Model

Hyperparameters

GRU LSTM GRU LSTM + DA Hidden neurons

Values 160

Epochs

100

Optimizer

adam

Drop rate

0.2

Activation function ReLu Layer 1, 2, 3 y 4

(40,40,40,40)

Batch size

40

338

A. Flores et al.

Similar parameters were used for the 4-layer benchmark models: LSTM LSTM LSTM LSTM and GRU GRU GRU GRU and GRU LSTM GRU LSTM. For the two-layer GRU LSTM model, the number of hidden neurons was only 80, since 40 neurons were established for each hidden layer. 4.6 Evaluation The proposal model is evaluated through Root Mean Squared Error (RMSE), Relative RMSE, and Mean Absolute Percentage Error (MAPE). These metrics are estimated through Eqs. (12), (13), and (14) respectively.  n 2 i=1 (Pi − Oi) (12) RMSE = n RRMSE = 1 MAPE = n

RMSE

∗ 100

1 n i=1 Oi n n  (Oi − Pi )



i=1

Oi

∗ 100

(13)

(14)

Where: n Number of predicted values. Pi Vector of predicted values. Oi Vector of observed values (testing data).

5 Results According to Table 2, in the three metrics that were used to evaluate the benchmark models and the proposal model, the superiority of the GRU LSTM GRU LSTM + DA model with respect to the benchmark models is clearly appreciated. Also, the benchmark models being based on recurrent neural networks present very similar results, however, the best of them is GRU LSTM GRU LSTM, the same one that was used as the basis of the proposal model. In terms of average RMSE, the proposal model (RMSE: 0.0876) outperforms the best benchmark model (RMSE: 0.5242) in 0.4366. In terms of average RRMSE, the proposal model outperforms the best benchmark model in 13.0129. Likewise according [31] and [32], a model’s precision level is excellent if RRMSE < 10%, good if 10% < RRMSE < 20%, fair if 20% < RRMSE < 30%, and poor if RRMSE > 30%. Thus, the proposal model (GRU LSTM GRU LSTM + DA) has an average RRMSE = 1.4777 ± 0.2319, which qualifies it as an excellent model surpassing other good models such as the benchmark ones.

Wind Speed Time Series Prediction with Deep Learning

339

Table 2. Results of benchmark models and proposal model Model

Predicted days 30

50

100

250

500

1000

1461

Avg

GRU GRU GRU GRU 0.5146

0.5286 ± 0.0325

RRMSE 17.9351

14.9025 16.5702 16.5926 14.7703 14.8961 14.7443

1.2630 ± 1.2630

MAPE

13.0355 14.1669 13.9296 12.1245 12.3141 12.2761 13.4564 ± 1.5133

RMSE

0.5218 16.3485

0.4828

0.5680

0.5761

0.5181

0.5190

LSTM LSTM LSTM LSTM RMSE

0.5033

RRMSE 17.2994 MAPE

15.7064

0.4748

0.5711

0.5824

0.5224

0.5224

0.5180

0.5277 ± 0.0374

14.6557 16.6608 16.7721 14.8812 14.9946 14.8436 15.7296 ± 1.1266 0.4748 13.9701 13.8863 12.0400 12.1644 12.1151 11.4795 ± 5.0357

GRU LSTM RMSE

0.5367

0.4948

0.5850

0.5806

0.5202

0.5196

0.5156

0.5360 ± 0.0342

RRMSE 18.4472

15.2745 17.0658 16.7207 14.8190 14.9127 14.7740 16.0019 ± 1.4288

MAPE

13.5276 14.7068 14.2251 12.2939 12.3979 12.3672 13.8124 ± 1.7658

17.1685

GRU LSTM GRU LSTM RMSE

0.5078

0.4728

0.5591

0.5792

0.5191

0.5181

0.5138

0.5242 ± 0.0349

RRMSE 17.4528

14.5948 16.3113 16.6807 14.7883 14.8699 14.7220 15.6314 ± 1.1599

MAPE

12.7549 13.9166 14.0231 12.0868 12.1724 12.1431 13.3112 ± 1.4731

16.0820

GRU LSTM GRU LSTM + DA RMSE

0.0895

0.0826

0.0930

0.0935

0.0842

0.0844

0.0865

0.0876 ± 0.0043

RRMSE

3.0748

2.5492

2.7139

2.6922

2.3982

2.4221

2.4797

2.6185 ± 0.2359

MAPE

2.9858

2.3531

2.4360

2.3315

2.0335

2.0458

2.0849

2.3243 ± 0.3342

In terms of average MAPE, the proposal model outperforms the best benchmark model in 10.9869, which only reflects what was indicated with the other metrics in the preceding paragraphs. Figure 7 shows a graphical comparison of the three metrics analyzed in this work. According to Fig. 8, the original wind speed time series and the predictions of the GRU LSTM GRU LSTM and GRU LSTM GRU LSTM + DA models can be seen for a prediction horizon of 100 days. It is also observed how the predictions of the proposal model (red line) better fit the original values. Likewise, it is important to highlight how deviations from the base model are improved with the data augmentation of the proposal model.

340

A. Flores et al.

Fig. 7. Metrics comparison of benchmark models and proposal model.

Fig. 8. Comparison of best benchmark model and proposal model for 100 predicted days.

6 Discussion Table 3 shows a summary of the results obtained by different models of the state of the art in the prediction of wind speed time series. According to Table 3, which shows the RMSEs achieved by different works of the state of the art, it can be seen that the best RMSE achieved is 0.0953, in the work of Qureshi et al. [14], comparing it with the RMSE of 0.0876 of the proposal model GRU LSTM GRU LSTM + DA, it is appreciated that the second one is superior, therefore, it can be affirmed that the proposal model in this study has a lot of potential for the prediction of wind speed time series, being able to obtain better results than the state of the art models.

Wind Speed Time Series Prediction with Deep Learning

341

Table 3. Results of related work and proposal model Work

Technique

Frequency

Train

Test

RMSE

Zhang et al. [12]

WTT + SAM + RBFNN

Daily

696

48

0.88

Bokde et al. [14]

EEMD + PSF

Hourly

2160

720

0.36

Mezaache et al. [15]

AE + ENN

10-min

26000

11000

3.0506

Khodayar et al. [16]

RBM + IPDL

10-min

105120

52560

Li et al. [17]

CNN + LSTM

15-min

3500

500

3.0012

Liu et al. [18]

GRU

Daily

811

372

0.9899

Deng et al. [19]

Bi-GRU

400

6.75

Jiang el at. [21]

VWC

2304

576

0.2557

Wang et al. [20]

EWT + KLD

Hourly

14016

3504

1.07

Qureshi et al. [13]

DNN + MRT

Hourly

Yan et al. [24]

ISSD + LSTM-GOADBN

Hourly

600

100

1.0156

Cheng et al. [22]

MSSO

10-min

2880

720

0.3002

Altan et al. [23]

DM + LSTM + GWO

10-h

4397

775

0.1878

Noman et al. [25]

NARX

10-min

Data 2017

Data 2018

0.3590

Luo et al. [26]

DE + MOO

10-min

3200

800

0.2348

Proposal Model

GRU LSTM GRU LSTM + DA

Daily

13149

1461

0.0876

11.126

0.0953

7 Conclusion and Future Work According to the results obtained in this study, it can be concluded that the −1, +1 scale allows increasing the precision of the predictions of the hybrid four-layer model GRU LSTM GRU LSTM, likewise, the data augmentation widely used to overcome the overfitting problem, it can also be used to improve the precision of models that do not present overfitting, so in this study, according to the average MAPE, it allowed to improve the precision of the GRU LSTM GRU LSTM model by 10.9869%, turning a good model into an excellent model according to the obtained RRMSE. The main advantage of the approach proposed in this study for the prediction of wind speed time series lies in the precision of the results achieved. On the other hand, the increase in data for the training phase of the model causes the computational cost to rise considerably. For future work, according to what was observed during the experimentation process of this work, it can be recommended, the experimentation with larger sizes for the blocksize parameter of the data augmentation technique used. It is very likely that for values greater than six (6), better results can be obtained for the GRU LSTM GRU LSTM model.

342

A. Flores et al.

Likewise, experimentation with data with different frequencies and other architectures based on recurrent neural networks is pertinent.

References 1. McMichael, A.J., Lindgren, E.: Climate change: present and future risks to health, and necessary responses. J. Intern. Med. 270(5), 401–413 (2011) 2. Flores, A., Tito, H., Centty, D.: Comparison of hybrid recurrent neural networks for univariate time series forecasting. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) Intelligent Systems and Applications. IntelliSys 2020. Advances in Intelligent Systems and Computing, vol. 1250, pp. 375–387.Springer, Cham (2020). https://doi.org/10.1007/978-3-030-55180-3_28 3. Yeomans, J., Thwaites, S., Robertson, W.S.P., Booth, D., Ng, B., Thewlis, D.: Simulating timeseries data for improved deep neural network performance. IEEE Access. 7, 131248–131255 (2019) 4. Rashid, K.M., Louis, J.: Times-series data augmentation and deep learning for construction equipment activity recognition. Adv. Eng. Inform. 42, 1–12 (2019) 5. Iwana, B.K., Uchida, S.: Time series data augmentation for neural networks by time warping with a discriminative teacher. arxiv.org (2020) 6. Rashid, K.M., Louis, J.: Window-warping: a time series data augmentation of IMU data for construction equipment activity identification. In: de 36 International Sympoium on Automation and Robotics in Construction (ISARC 2019), Banff, Canada (2019) 7. Flores, A., Tito, H., Apaza-Alanoca, H.: Data augmentation for short-term time series prediction with deep learning. In: de Computing Conference, London, United Kingdom (2021, in press) 8. Flores, A., Tito, H., Centty, D.: Recurrent neural networks for meteorological time series imputation. Int. J. Adv. Comput. Sci. Appl. 11(3), 482–487 (2020) 9. Flores, A., Tito, H., Centty, D.: Improving gated recurrent unit predictions with univariate time series imputation techniques. Int. J. Adv. Comput. Sci. Appl. 10(12), 710–714 (2019) 10. Huynh, A.N-L., Deo, R.C., An-Vo, D-A., Ali, M., Raj, N., Abdulla, S.: Near real-time global solar radiation forecasting at multiple time-step horizons using the Long Short-Term Memory network. Energies 13(3517), 11–30 (2020) 11. Gürel, A.E., A˘gbulut, Ü., Biçen, Y.: Assessment of machine learning, time series, response surface methdology and empirical models in prediction of global solar radiation. J. Clean. Prod. 277, 122353 (2020) 12. Che, Z., Purushotham, S., Cho, K., Sontag, D., Liu, Y.: Recurrent neural networks for multivariate time series with missing values. Sci. Rep. 8(6085), 1–12 (2018) 13. Zhang, W., Wang, J., Wang, J., Zhao, Z., Tian, M.: Short-term wind speed forecasting based on a hybrid model. J. Appl. Soft Comput. 92(106294), 1–20 (2013) 14. Qureshi, A.S., Khan, A., Zameer, A., Usman, A.: Wind power prediction using deep neural network based meta regression and transfer learning. Appl. Soft Comput. 58, 742–755 (2017) 15. Bokde, N., Feijoo, A., Kulat, K.: Analysis of differencing and decomposition preprocessing methods for wind speed prediction. Appl. Soft Comput. 71, 926–938 (2018) 16. Mezaache, H„ Bouzgoud, H.: Auto-encoder with neural networks for wind speed forecasting. In: de IEEE International Conference on Communications and Electrical Engineering, El Oued, Algeria (2018) 17. Khodayar, M.I., Wang, J., Manthouri, M.: Interval deep generative neural network for wind speed forecasting. IEEE Trans. Smart Grid 10(4), 3974–3989 (2019)

Wind Speed Time Series Prediction with Deep Learning

343

18. Li, G., Wang, T.F., Hu, F.X., Liu, T.C.: Algorithm considering multi-dimensional meteorological feature extraction in short-term wind speed prediction. In: de IEEE Information Technology, Networking, Electronic and Automation Control Conference, Chengdu, China (2019) 19. Liu, M., Qiu, P., Wei, K.: Research on wind speed prediction of wind power system based on GRU deep learning. In: de IEEE Conference on Energy Internet and Energy System Integration, Changsha, China (2019) 20. Deng, Y., Jia, H., Li, P., Tong, X., Qiu, X., Li, F.: A deep learning methodology based on bidirectional gated recurrent unit for window power prediction. In: de IEEE, Xi’an, China (2019) 21. Wang, J., Li, Y.: An innovative hybrid approach for multi-step ahead wind speed prediction. Appl. Soft Comput. J. 78, 296–309 (2019) 22. Jiang, P., Liu, Z.: Variable weights combined model based on multi-objective optimization for short-term wind speed forecasting. Appl. Soft Comput. J. 82(105587), 1–19 (2019) 23. Cheng, Z., Wang, J.: A new combined model based on multi-objective slap swarm optimization for wind speed forecasting. Appl. Soft Comput. J. 92(106294), 1–20 (2020) 24. Altan, A., Karasu, S., Zio, E.: A new hybrid model for wind speed forecasting combining long short-term memory neural network, decomposition methods and grey wolf optimizer. Appl. Soft Comput. 100, 106996 (2020) 25. Yan, X., Liu, Y., Xu, Y., Jia, M.: Multistep forecasting for diurnal wind speed based on hybrid deep learning model with improved singular spectrum decomposition. Energy Convers. Manage. 225(113456), 1–22 (2020) 26. Noman, F., et al.: Multistep short-term wind speed prediction using nonlinear auto-regressive neural network with exogenous variable selection. Alexand. Eng. J. 60(1), 1221–1229 (2020) 27. Luo, L., Li, H., Wang, J., Hu, J.: Design of a combined wind speed forecasting system based on decomposition-ensemble and multi-objective optimization approach. Appl. Math. Model. 89, 49–72 (2021) 28. Xiangang, L., Xihong, W.: Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition. arxiv.org (2014) 29. Kyunghyun, C., et al.: Learning phrase representations using RNN enconder-decoder for statistical machine traslation. 1–15 (2014). arxiv.org 30. Gail, W., Yoav, G., Eran, Y.: On the practical computational power of finite precision RNNs for language recognition. 1–9 (2018), arxiv.org 31. Huynh, A.N.-L., Deo, R.C., An-Vo, D.-A-, Ali, M., Raj, N., Abdulla, S.: Near real-time global solar radiation forecasting at multiple time-step horizons using the long short-term memory network. Energies 13(14), 3517 (2020) 32. Liu, M., Qiu, P., Wei, K.: Research on wind speed prediction of wind power system based on GRU deep learning. In: de IEEE Conference on Energy Internet and Energy System Integration, Changsha, China (2019)

Evaluation for Angular Distortion of Welding Plate Shigeru Kato1(B) , Shunsaku Kume1 , Takanori Hino1 , Fujioka Shota1 , Tomomichi Kagawa1 , Hironori Kumeno1 , and Hajime Nobuhara2 1 National Institute of Technology, Niihama College, Niihama, Japan

[email protected]

2 University of Tsukuba, Tsukuba, Japan

[email protected]

Abstract. Welding is essential in our life. It is crucial to nurture welding skills in Japan nowadays. The experts have to evaluate the many beginners’ welding. Since the experts’ burden is critical, a computational assistant for evaluating beginners’ welding is required. This paper describes a simple evaluation system of welding plates by beginners. The authors considered four types of beginners’ typical defects: lack of welding metal, linear misalignment, welding metal unevenness, and angular distortion. To capture these defects simultaneously, the authors propose an original equipment to photograph the welding plates. The computer extracts only the part of the welding plate using color markers. CNN (Convolutional Neural Network) evaluates the defects. As a first step, the authors addressed evaluating only angular distortion. The angular distortion is one of the typical failures by beginners. In the experiment, the authors conducted the validation of CNN. In the conclusion part, we discuss the experimental result and future works. Keywords: Welding joint · Angular distortion · Image processing · CNN

1 Introduction Welding is significant in our life infrastructure [1], such as building, vehicle, water pipe, etc. Therefore, it is crucial to retain the people with high welding skills in Japan, and researchers develop educational systems for nurturing welding skills [2, 3]. Besides, a welding simulator for training is proposed [4]. However, as illustrated in Fig. 1(a) the expert should judge numerous welded plates to consider whether to admit a welding license. Also, the subjectivity is different among the experts, and thereby, the evaluations would differ as Fig. 1(b). Such a difference sometimes causes a quarrel between the evaluators and examinee. Therefore, in our previous study [5], we began to develop a computational system that evaluates the welding plates, as represented in Fig. 2.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 344–354, 2022. https://doi.org/10.1007/978-3-030-82193-7_23

Evaluation for Angular Distortion of Welding Plate

345

Fig. 1. Problems in human subjective evaluation.

Fig. 2. Welding evaluation system in our previous study.

In our previous system shown in Fig. 2, the convolutional neural network (CNN) [6] is employed. The CNN evaluates the welding joint as good or bad. We confirmed the proposed CNN worked well. However, a human had to extract the welding area by hand, as depicted in Fig. 2. Therefore, we adopted R-CNN (Region-Based Convolutional Neural Network) to automatically extract only welding joint area as in Fig. 3 [7]. Although R-CNN could appropriately capture the welding joint area, we had to prepare a number of image data for training the R-CNN, and the R-CNN was not stable to capture the welding joint area. Therefore, we determined to find another stable method to extract only the welding plate area image instead of R-CNN.

Fig. 3. R-CNN could extract welding joint area automatically.

346

S. Kato et al.

2 Equipment and CNN Figure 4 shows an excellent example of the welding joint without defect.

Fig. 4. Good welding joint plate with no defects.

The welding plate is evaluated by experts considering basically four defects: less metal on the joint line as Fig. 5(a), linear misalignment as Fig. 5(b), joint metal unevenness as Fig. 5(c), and angular distortion as Fig. 5(d). These defects are beginners’ typical failures.

Fig. 5. Typical welding joint defects by beginners.

Evaluation for Angular Distortion of Welding Plate

347

In order to capture the defects explained in Fig. 5 simultaneously, we decided to take a picture of the welding plate from the point of 30° angle above the bottom line, as Fig. 6 displays.

Fig. 6. Equipment to take welding plate picture.

Instead of R-CNN detection, we employ a stable method to extract only welding plate area images from the background using pink color markers, as shown in Fig. 7.

Fig. 7. Welding plate is placed along with the pink markers.

In the present paper, we focused on angular distortion [8] evaluation as the first step. The angular distortion is explained in Fig. 8. The plate is bent largely (Bad) as Fig. 8(a). In Fig. 8(b), the plate is bent slightly (Neutral). Contrary, in Fig. 8(c), the plate is flat (Good).

348

S. Kato et al.

Fig. 8. Angular distortion.

To ensure the light condition becomes constant, the equipment is surrounded by the box, as shown in Fig. 9. LED light is attached to the ceil of the inside of the box.

Fig. 9. Equipment is inside of the black box.

Figure 10 illustrates the automatic plate extraction process and CNN construction. Firstly, the metal welding plate area is extracted from the picture, as shown in Fig. 10(a), (b). And then, the extracted welding plate image is resized to 227-by-227 pixels, as shown in Fig. 10(c). The resized image is inputted to the proposed CNN in Fig. 10(d). We employed the AlexNet [9] to evaluate the welding plate angular distortion.

Evaluation for Angular Distortion of Welding Plate

349

Fig. 10. Welding plate extraction and CNN configuration.

The transfer learning [10] is adopted to tune the connection weights between fc7 and fc8. The proposed CNN outputs angular distortion level “Good” (i.e. flat), “Neutral” (slightly bent), and “Bad” (largely bent).

3 Experiment Firstly, we photographed eight pictures shown in Fig. 11 in a local welding school in our city.

Fig. 11. Photographs of welding plates taken in welding school.

350

S. Kato et al.

On another day, we photographed other welding plates in our laboratory. We hold 29 welding plates. In order to obtain many training image data, we photographed 29 plates by rotating the front and tail, as shown in Fig. 12. Consequently, we have taken 58 (29 times 2) pictures in our laboratory.

Fig. 12. Same welding plates rotating front and tail.

Therefore, we obtained 66 (8 + 58) welding plate pictures totally in the present paper. As displayed in Fig. 13, Good (flat), Bad (angular distorted), and neutral (slightly distorted) welding images are extracted and resized correctly.

Fig. 13. Example of images extracted and resized.

Evaluation for Angular Distortion of Welding Plate

351

All 66 pictures were automatically extracted along with the pink markers successfully. We classified all images to Good (21 images), Bad (34 images), and Neutral (11 images), as shown in Fig. 14.

Fig. 14. Example of extracted images: (a) Good, (b) Bad, and (c) Neutral.

4 Validation of CNN In order to validate the proposed CNN in Fig. 10(d), we used 66 images (21: Good, 34: Bad, and 11: Neutral). The 4-fold-cross validation [11, 12] is conducted using Data Set (1), (2), (3), and (4), as illustrated in Fig. 15. In the training phase, transfer learning [10] is employed. Table 1 enumerates the conditions for transfer learning of CNN and the number of train and test data for all validation data set. We employed “Deep Learning Toolbox” [13] of “MATLAB.” For all validation data set from (1) to (4) in Fig. 15 10 images (5: Good, 5: Bad) not used for training are inputted to the trained CNN. Accuracies were 80%, 100%, 90%, and 60%, respectively. The mean accuracy was 82.5%. Figure 16 shows the confusion matrix for each validation data set. “Good” or “Bad” plates are misjudged as “Neutral.” Since “Neutral” evaluation includes fuzziness [14] between “Good” and “Bad,” CNN would confuse like as a human. Several studies challenge to evaluate welding joint defect using CNN [15]. However, they deal with joints welded by professionals or machines. On the other hand, our study focuses on beginners’ typical failure.

352

S. Kato et al.

Fig. 15. CNN validation data sets.

Evaluation for Angular Distortion of Welding Plate

353

Table 1. Transfer learning and CNN test settings for all validation data set. Parameter

Value/Condition

Solver

sgdm

Learning rate

0.0001

Max epochs

50

Mini batch size

8

Total iterations

350 = 50 * 56/8

Number of train data 56 Number of test data

10

Fig. 16. Confusion matrix for each validation data set.

5 Conclusions The present paper describes the simple system for beginners’ welding evaluation. We constructed the equipment to photograph the welding plate and developed the stable automatic extraction method of welding plate area from the background by using the pink color markers. The extracted images are evaluated by CNN. CNN could evaluate angular distortion properly. In the present paper, we deal with only “angular distortion.” However, it is necessary to evaluate other defects such as “less metal on joint” as Fig. 5(a),

354

S. Kato et al.

“joint step” as Fig. 5(b), and “joint metal unevenness” as Fig. 5(c). In the future, we will address to evaluate other failures, not only the angular distortion. Acknowledgments. This research is supported by the Japan Welding Engineering Society’s grant.

References 1. Niles, R.W., Jackson, C.E.: Weld thermal efficiency of the GTAW process. Weld. J. 54, 25–32 (1975) 2. Asai, S., Ogawa, T., Takebayashi, H.: Visualization and digitation of welder skill for education and training. Weld. World 56, 26–34 (2012) 3. Hino, T., et al.: Visualization of gas tungsten arc welding skill using brightness map of backside weld pool. Trans. Mat. Res. Soc. Japan 44(5), 181–186 (2019) 4. Byrd, A.P., Stone, R.T., Anderson, R.G., Woltjer, K.: The use of virtual welding simulators to evaluate experimental welders. Weld. J. 94(12), 389–395 (2015) 5. Kato, S., Hino, T., Yoshikawa, N.: Fundamental study on evaluation system of beginner’s welding using CNN. In: Lecture Notes in Networks and Systems, vol. 96, pp. 821–827 (2019) 6. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 7. Kato, S., Hino, T., Kumeno, H., Kagawa, T., Nobuhara, H.: Automatic detection of beginner’s welding joint. In: Proceedings of 2020 Joint 11th International Conference on Soft Computing and Intelligent Systems and 21st International Symposium on Advanced Intelligent Systems, pp. 465–467 (2020). 8. Mochizuki, M., Okano, S.: Effect of welding process conditions on angular distortion induced by bead-on-plate welding. ISIJ Int. 58(1), 153–158 (2018) 9. Krizhevsky, A., Sutskever, I., Hinton, CE.: ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS 2012), pp. 1097–1105 (2012) 10. Shin, H.C., et al.: Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35(5), 1285–1298 (2016) 11. Priddy, K.L., Keller, P.E.: Artificial Neural Networks - An Introduction, Chapter 11 Dealing with Limited Amounts of Data, pp. 101–102. SPIE Press, Bellingham (2005) 12. Wong, T.-T.: Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recogn. 48(9), 2839–2846 (2015) 13. MathWorks: Transfer Learning Using AlexNet. https://jp.mathworks.com/help/deeplearn ing/ug/transfer-learning-using-alexnet.html;jsessionid=10ee690544b1eb830e5dc2412cf0? lang=en. Accessed 27 Feb 2021 14. Zadeh, L.: Fuzzy sets. Inf. Control 8, 338–353 (1965) 15. Park, J.-K., , An, W.-H., Kang, D.-J.: Convolutional neural network based surface inspection system for non-patterned welding defects. Int. J. Precis. Eng. Manuf. 20(3), 363–374 (2019)

A Framework for Testing and Evaluation of Operational Performance of Multi-UAV Systems Mrinmoy Sarkar1 , Xuyang Yan1 , Shamila Nateghi1 , Bruce J. Holmes2 , Kyriakos G. Vamvoudakis3 , and Abdollah Homaifar1(B) 1

North Carolina A&T State University, 1601 East Market Street, Greensboro, NC 27401, USA {msarkar,xyan}@aggies.ncat.edu, {snateghiboroujeni,homaifar}@ncat.edu 2 Alakai Technologies Corporation, Hopkinton, USA [email protected] https://www.skai.co/ 3 Georgia Institute of Technology, 270 Ferst Drive, NW, Atlanta, GA 30332-0150, USA [email protected]

Abstract. In this paper, we propose a data-driven testing and evaluation framework for multi-UAVs to evaluate their performance in executing missions in the physical world. Seven micro-behaviors, termed here as modes of operation, are leveraged to describe the autonomous functionalities of the UAVs. These functionalities are then used to design five scenarios for model training, validation and testing of the proposed framework. Each scenario includes a distinct sequence of behaviors for the UAVs in order for the different autonomous functionalities to be evaluated. We develop and implement a simulation environment using the Robot Operating System (ROS), Gazebo, and the Pixhawk autopilot to generate synthetic data for the training of a classification model. This trained model is then utilized to evaluate the behaviors of the UAVs while performing real-world missions. Finally, the proposed framework is tested using synthetic data generated from a simulation environment and validated using real-world data. Keywords: Test and evaluation · Multi-UAV testing · Autonomous behavioral testing · Cognitive systems · Physical flight testing · Bi-LSTM · ROS

1

Introduction

Recent developments in Unmanned Aerial Systems (UASs) introduce new challenges for the safety, verification and validation, and efficiency of Advanced Air Mobility (AAM) operating capabilities, and serve as technical foundations for the concept of Urban Air Mobility (UAM) [13]. In contrast to legacy air transportation systems, UAM aims to enable the growth of increasing traffic congestion. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 355–374, 2022. https://doi.org/10.1007/978-3-030-82193-7_24

356

M. Sarkar et al.

The UAM initiatives for safe airspace services seek to support the projected growth by incorporating UASs into the development of future urban traffic networks. National Aeronautics and Space Administration (NASA) projects that approximately 500 million package deliveries and 750 million air metro services will be accomplished by the UASs by 2030 [20]. The upcoming revolution of urban air traffic not only provides a promising solution to the traditional transportation systems, but also brings many concerns for the safety of UAM services, especially in dense environments. Given the deployment of large scale UASs, the testing, verification, and evaluation of such systems has become one of the most important steps for ensuring safety and reliability. In general, testing and evaluation is a core step for the deployment of autonomous systems, especially unmanned [4]. At the beginning of the development of UAS, the testing community tests the systems with methodologies developed for manned systems [22]. However, there are significant differences between the autonomous UAS and manned systems which require the development of new approaches. The fundamental differences lie in the role of autonomy in the decision-making process. Therefore, it is incumbent to develop new methodologies that are capable of testing the entire decision process of UAS without biasing the system into default human solutions [25]. As a result, a datadriven approach is more suitable than human cognitive approaches to develop a testing system for UAS that is adaptive and evolves over time. The key challenges for testing and evaluation of UAS are [17,22], (i) establishing which characteristics to observe, i.e., environment, characteristics of the system itself or characteristics of different threats in the environment; (ii) developing metrics for each characteristic like the tilt, or the height of a wall, GPS coordinates or motion of dynamic obstacles in the environment; and (iii) providing standards for the metrics in terms of numeric specifications like the maximum height of a wall, or the exact location of a static obstacle. Towards this direction, the contributions of the present paper are threefold. First, we develop a novel data-driven framework to potentially mitigate many of these fundamental challenges. The proposed framework incorporates an external observer which is able to perceive the behavior of the UAVs and employs an evaluator to automatically assess the UAVs’ performance. We define micro-behaviors for UAVs, primarily the modes of operation, to design different scenarios for the UAVs and quantify their autonomous capabilities for a particular mission from these scenarios. A classification model is then employed to learn the UAVs’ behavior and predict their performance in executing mission using the collected sensor measurements. Moreover, we develop a simulation environment to generate synthetic data and train the classification model for evaluating the behaviors of the UAVs. We implement the testing framework in our indoor flight testing environment with real-time data and test multiple UAVs simultaneously. The remainder of this paper is structured as follows. The background of testing the autonomous systems is described in Sect. 2. The problem at hand is described in Sect. 3. Section 4 discusses the details of the proposed testing framework. The development of the simulation environment and synthetic data

Testing Framework

357

generation procedure are described in Sect. 5. The details regarding the development of the indoor flight testing facility are discussed in Sect. 6 and the performance of the proposed method is presented in Sect. 7. Finally, concluding remarks and future work are provided in Sect. 8.

2

Literature Review

An autonomous unmanned system [25] refers to any system that acquires data from a sensor, perceives information, compares the information against its previous knowledge, and makes a decision based upon this information. In [22], a UAS is defined as a system that is capable of performing tasks in the world by itself without explicit human control, as well as a system that senses, understands, and acts upon the environment in which it operates. In [25], the authors provided a comprehensive introduction about testing the intelligence of an autonomous system. To test any system effectively, a tester or testing system requires: (1) the autonomous system under test (SUT); (2) the documentation associated with operation and maintenance of the SUT; and (3) the specification against which the system will be tested. First, the platform needs to be tested for structural integrity and controllability. Second, the communication among the individual components of the system needs to be tested. Third, hardware-in-the-loop followed by vehicle-in-the-loop testing can be conducted, and finally the field testing is required to verify and validate the previous testing steps. In our proposed testing framework, we assume the first step has been completed and aim to provide a unified data-driven solution for the final two steps. An introductory framework for simulation-based test and evaluation of autonomous vehicles has been proposed in [10,18]. Based on virtual reality (VR) and augmented reality (AR) technologies, the framework is developed to integrate the testing facility with the development process of an autonomous system. In [2] a quantitative method for assuring coordinated autonomy is proposed. The testing and evaluation could be quantified with a reliability engineering approach. The authors adopted a probabilistic model checking to assure the autonomy of a coordinated multi-robot mission. Primarily the system is suitable for multi-robot model checking, but not applicable to test autonomous systems in the physical world when data from the autonomous system is not accessible. A mission-based test and evaluation framework of UAS has been proposed to simulate innovative concepts and applications across the Department of Defense in [6]. It is limited to those UASs that are developed for military missions while UASs are simultaneously evolving across numerous civilian applications. In [21], a specialized testing and evaluation of autonomous surface vehicles is proposed. The authors provided six different scenarios for an autonomous swarm of surface vehicles and developed quantitative metrics for each scenario. Since their approach is specific to the unmanned surface vehicle, it is difficult to extend those techniques to other autonomous systems such as unmanned ground and aerial vehicles.

358

M. Sarkar et al.

An experimental test and evaluation framework is proposed for autonomous underwater vehicles in [15]. It defines the capabilities of the autonomous system initially and then designs appropriate scenarios to test those capabilities in the physical environment rather than a simulation environment. Our approach takes a similar procedure from [15] to test and evaluate UASs. However, [15] assumes that the SUT is a white-box system while the proposed framework considers the SUT as a black-box system. Generally, a system is considered as a white-box system when human operators have full knowledge about its internal dynamic or structure. Conversely, a black-box system refers to any system with completely unknown internal system structure. A detailed test and evaluation approach for autonomous unmanned ground vehicles (UGVs) is provided in [24]. The authors presented a scientific and comprehensive design approach to test different autonomous modules separately, or to test the autonomous system as a whole. The fuzzy comprehensive evaluation method along with the analytic hierarchy process (AHP) is used to evaluate the individual module and the entire technical performance of autonomous ground vehicles quantitatively. However, their approach also depends on the white-box assumption of the SUT. A description of the testing and evaluation of micro-air vehicles (MAVs) for both the physical realm and behavioral realm is provided in [19]. According to the author, testing the physical capability of a MAV is relatively easier than testing the autonomous behavior of the MAV. Although the work provides a scientific description of what needs to be tested, there is no quantitative, or qualitative description of how autonomous behaviors can be tested. In [23], the authors developed a data-driven framework to predict the autonomous behavior of UAV and validated it using five behaviors. However, in this paper, we extend it significantly by adding two complex behaviors of the autonomous agent and developing five different scenarios. More importantly, we propose a new metric for assignment to each autonomous agent after the mission completion in the testing framework. Although the decision tree classifier presents good performance for predicting the behaviors of UAVs in [23], it ignores the time dependency among the behaviors of UAVs and thus shows poor performance concerning the prediction of the two new behaviors developed in this paper. Hence, the Long Short-Term Memory (LSTM) based classifier is employed in the proposed testing framework to improve the prediction performance for the behavior of UAVs by exploring the temporal relationship. The details of the proposed framework are described in Sect. 4.

3

Problem Description

In the statement of the problem, basic terminology follows that will be used throughout the paper. 3.1

Terminology

Modes of Operation: A behavior pattern which can be described visually or mathematically like vertical takeoff (vTakeoff), vertical land (vLand), or Search.

Testing Framework

359

Scenario: A formal description of a UAV’s expected behavior from the start of a mission to the end of a mission like searching an area for specific targets. External Observer: A combination of different sensors that are not mounted on the UAV agents but mounted in the UAV’s operational environment. This ensures all UAV agents are in the field of view of the sensory system like high definition fully synchronized high resolution video cameras, or a radar system to track UAVs and estimate the dynamics of the UAV agents (SUTs). 3.2

Problem Statement

The problem can be stated generally as how we can automatically infer if an autonomous UAV agent is performing well or exhibiting undesirable behavior while operating in a multi-UAV or single-UAV mission by observing the agents using an external sensory system. Suppose that a UAV is assigned to a search mission. The objectives of the UAV are to search the whole area without colliding with any obstacle and detect all the target objects in the given search region. After conducting experiments on the developed UAV in real-world, the question that we are focusing in this paper is how we can assign a numerical value to evaluate the UAV’s performance in conducting the mission? Therefore, to formulate this problem, we define M ∈ N as a discrete set of modes of operation and S ∈ N as a scenario composed of a subset of M . By observing the motions of the UAV agents in a physical test environment using an external observer, the objective is to predict different modes in the S. Let z ∈ Rn be the external observer measurement and Z = [zt , zt−1 , · · · , zt−τ ], where Z is a sequence of measurements from previous τ ∈ N time steps, zt is the current measurement and t ∈ N is the current time stamp. Each mode of operation in S can be predicted using Z such that s = f (Z), where s ∈ S and f (·) is a nonlinear function of Z which maps the motion history of the UAV to the mode of operation. By predicting s in each time stamp, we can assign a confidence score α to each UAV at the end of the mission, which reflects the overall performance of a UAV during the entire mission.

4

Proposed Framework

In this section the proposed framework and its major components are described. 4.1

Overview of the Proposed Framework

As shown in Fig. 1, a physical testing environment which contains different objects such as boxes, artificial trees, blowers, different light sources, is considered in the proposed framework. The purpose of having these objects is to create different types of real-world environments for testing. The environment is equipped with tracking devices such as high definition fully synchronized high resolution video cameras previously defined as an External observer. The purpose of the tracking devices are to estimate the motion of the UAV agents (position:

360

M. Sarkar et al.

Fig. 1. In the testing framework, the observer module tracks all the UAVs individually and provides the 6D pose vector (x, y, z, φ, θ, ψ) ∈ R. The 6D pose vector goes through a time derivative operation to generate the velocities. In the next step, the 12D vector, ˙ θ, ˙ ψ] ˙ and a history of these measurements are used as X = [x, y, z, φ, θ, ψ, x, ˙ y, ˙ z, ˙ φ, feature representation in the Perception Inference Engine (PIE) which predicts the current mode of operation of each UAV. The predicted mode of operation is used by the evaluator along with the true scenario to generate a confidence score for each UAV in performing a defined mission. The confidence score is a measure to show how well a UAV perform a mission which consists of several modes.

˙ θ, ˙ ψ) ˙ x, y, z, orientation: φ, θ, ψ, linear velocity: x, ˙ y, ˙ z˙ and angular velocity: φ, ∈ R, while they are operating in a mission. Using the estimated motion data the Perception Inference Engine (PIE) block predicts the current mode of operation of each UAV agent. The Evaluator block compares the predicted modes with the expected modes from the True Scenario block. At the end of the mission, a confidence score is provided from the Evaluator block regarding the performance of each UAV. The following are required to be satisfied in the proposed framework. 1. All the UAVs are assumed to be autonomous and will execute a well-defined scenario. 2. During the execution of the scenario, all UAVs must be in the field of view of the Observer module shown in Fig. 1. 4.2

Modes of Operation

Hold: The UAV stays on the ground plane. vTakeoff: The UAV starts flying vertically upwards until it reaches a predefined altitude. Hover: The UAV stays at a fixed altitude and (x, y) coordinate.

Testing Framework

361

vLand: The UAV begins flying vertically downwards until it reaches the ground plane. Search: The UAV primarily moves in (x, y) plane in a fixed altitude. Loiter: The UAV follows a circular trajectory of fixed radius and the center is the detected (x, y) coordinates of the target object in a fixed altitude. Obstacleavoid: The UAV follows a collision-free trajectory and the trajectory depends on the properties of the obstacle. Two possible collision-free trajectories are shown in Fig. 2(g). Each mode of operation is shown graphically in Fig. 2. These seven modes are used as the generalized description of UAV’s behaviors and they are not mission-specific. For example, the Loiter mode indicates the detection of any targets irrespective of their geometric difference or other properties.

(a) Hold

(e) Search

(b) vTakeoff

(f) Loiter

(c) Hover

(d) vLand

(g) Obstacleavoid

Fig. 2. Graphical representation of the proposed seven modes of operation.

4.3

Scenarios

Scenario-1: The UAV takes off from a predefined position until reaching a user-defined altitude and then hovers for a specific period of time. Finally, it slightly moves in a positive x direction and lands on the location. The scenario is designed to test whether a UAV could fly autonomously. Scenario-2: The UAV takes off from a predefined location until gaining a predefined altitude and then hovers for a specific period of time. Afterwards, it scans a rectangular area using a lawnmower type search pattern. The UAV will land after finishing the scanning. In this scenario, we test the UAV’s capability for searching in an area. Scenario-3: This is an extended version of Scenario-2. Here, some target objects are placed in the ground plane which should be detected by the UAV while scanning the area. The objective of the scenario is to check if the UAV’s perception module for target detection works properly. In this scenario, we expect to see

362

M. Sarkar et al.

Loiter mode from UAV, since Loiter mode is a sign to indicate that the UAV detects the target. Scenario-4: This is also an extended version of Scenario-2. Here, we put a static object in the nominal trajectory of the UAV so that it needs to avoid the obstacle in its path. This scenario is designed to test if the UAV can avoid obstacles. Scenario-5: This is a combination of Scenario-3 and Scenario-4. The test environment has both the target object and static obstacle. This scenario is designed to test all the capabilities that have been tested individually in the other four scenarios. The mode transition diagram of each scenario is shown in Fig. 3.

Hold

vTakeoff

Hold

vLand

Hover

vLand

Scenario-1 Hold

vTakeoff

Hover

Hold

Search

vLand

Scenario-2

vTakeoff

Hover

vTakeoff

Hover

Search

Loiter

Scenario-3 Hold

vTakeoff

Hover Loiter

vLand

Search

Obstacleavoid

vLand

Scenario-4

Search

Obstacleavoid

Scenario-5

Fig. 3. Mode transition diagram of the five scenarios.

4.4

Perception Inference Engine (PIE)

Perception Inference Engine (PIE) is a classification model. The objective of PIE is to infer the current mode of operation of individual UAV agents using the motion data extracted from the motion capture system. In [23], the Decision Tree (DT), Support Vector Machine (SVM), and Na¨ıve Bayes classifier are used to predict the modes. It was reported that DT outperformed the other two classifiers. However, the performance of DT-based classifier degrades significantly when we designed the two new modes of operation (Loiter and Obstacleavoid). With further investigation, we discovered that the motion history of a UAV plays an important role in predicting the current mode. In [23], the motion history is not considered to predict the mode of the UAVs, but the current measurement of the observer module is used. Therefore, not only for the two new modes but also during the transition between two modes, the DT’s performance is poor. More importantly, the behavior of UAVs is time-dependent so it is necessary to explore the temporal relationship from the motion history. Hence, we treat the problem as a time series classification or sequence classification problem. The state-of-theart time series classification technique is LSTM-based classifier [7,9,11,12,14,26]. We use the bidirectional variant of the LSTM-based classifier to develop PIE.

Testing Framework

363

The mathematical formulation of this type of network can be found in [14]. An extensive mathematical formulation and architecture of the Bi-LSTM network can be found in [5]. The architecture of PIE is a bidirectional LSTM followed by a dense layer to classify the mode of operation. The architecture is shown in Fig. 4. The LSTM block is described in the following paragraph.

Fig. 4. The architecture of Perception Inference Engine for ith UAV. Here, τ is the time step and FC is a fully connected layer.

LSTM Block: Each LSTM block in Fig. 4 takes three inputs: (i) the sensor measurement (Xt ); (ii) the output (ht−1 ) from the previous LSTM block; and (iii) the previous cell state (Ct−1 ). In a LSTM network the cell state is the mechanism to store the time dependency of input features in the form of memory. The memory is stored or deleted using gates. Therefore, each LSTM block is composed of three gates named as forget, input, and output as shown in Fig. 5. – Forget gate: This gate decides what information needs to be stored or removed from the cell State. The output of this gate is calculated using ft = σ(Wf Xt + Uf ht−1 + bf ), no

(1) no ×ni

where, ft ∈ R is the output vector of the forget gate, Wf ∈ R is the weight matrix of forget gate for input features, Xt ∈ Rni is the input feature vector, Uf ∈ Rno ×no is the weight matrix of forget gate for output vector of previous cell, ht−1 ∈ Rno is the output vector of previous cell, bf ∈ Rno is the bias vector of forget gate, no is output dimension, ni is input dimension, and σ refers to a sigmoid function. – Input gate: The input gate updates the old cell state using it = σ(Wi Xt + Ui ht−1 + bi ), Cˆt = tanh(Wc Xt + Uc ht−1 + bc ), Ct = ft  Ct−1 + it  Cˆt ,

(2)

364

M. Sarkar et al.

where, it ∈ Rno is the output vector of the input gate, Wi ∈ Rno ×ni is the weight matrix of input gate for input features, Ui ∈ Rno ×no is the weight matrix of input gate for output vector of previous cell, bi ∈ Rno is the bias vector of input gate, Cˆt ∈ Rno is the candidate cell state vector, Wc ∈ Rno ×ni is the weight matrix of candidate cell for input features, Uc ∈ Rno ×no is the weight matrix of candidate cell for output vector of previous cell, bc ∈ Rno is the bias vector of candidate cell, Ct ∈ Rno is the cell state vector, and tanh is the hyperbolic tangent function. – Output gate: The output gate calculates the output of the LSTM block using ot = σ(Wo Xt + Uo ht−1 + bo ), (3) ht = ot  tanh(Ct ), where, ot ∈ Rno is the output vector of the output gate, Wo ∈ Rno ×ni is the weight matrix of output gate for input features, Uo ∈ Rno ×no is the weight matrix of output gate for output vector of previous cell, bo ∈ Rno is the bias vector of output gate, and ht ∈ Rno is the output of the LSTM block.

Fig. 5. Graphical representation of the LSTM block.

4.5

True Scenario

The True Scenario block contains information about the expected behavior of the UAV agent in a particular scenario. It has a list of tuples which contains seven variables denoted as (m, x, y, z, w, l, h) ∈ R. Here, m refers to one of those seven modes (m ∈ M ), (x, y, z) and (w, l, h) are the center and dimensions of a virtual 3D bounding box within the expected mode of operation (m). This block takes the current position of the UAV and compares it with all the list elements

Testing Framework

365

to find the UAV’s current expected mode of operation. Suppose (xc , yc , zc ) ∈ R is the current position of the UAV, then the expected behavior of the UAV is m if and only if x − xc  ≤ w2 & y − yc  ≤ 2l & z − zc  ≤ h2 . Based on the scenarios defined in Sect. 4.3, the whole operational environment is partitioned into a number of 3D bounding boxes with different sizes. Each bounding box is assigned with a unique mode of operation. Although this type of design may limit the development of scenarios where a UAV could be in different modes in the same physical location in different time stamps, this design approach facilitates the automation of the testing framework. Moreover, the purpose of the framework is to test the UAVs capabilities such as vtakeoff or vland irrespective of the physical location. For any scenario which requires the same location for different modes, we can design an alternative scenario to handle the particular location conflict among modes of operation. 4.6

Evaluator

The evaluator block compares the predicted mode of operation from the PIE block with the expected mode of operation that comes from the True Scenario block. The True Scenario block provides the expected mode of operation based on the prior knowledge of the area where each scenario is supposed to be implemented and the current position of the UAV. For example, if the UAV is currently on top of a target object then the expected mode of operation of the UAV is Loiter. The confidence score for each UAV is calculated using, N

WTMn × L(TMn , PMn ) , (4) N n=0 WTMn   T ∈ N are the total time-steps where, α ∈ [0, 1] is a confidence score, N = δt in a scenario, T is the total mission execution time, δt is the sampling period, i = {1, 2, ..., K}, K is the number of modes in a particular scenario, WTM ∈ R is K i = 1, the user-defined weight for the different modes in a scenario with i=1 WTM TM is the true mode coming from the True Scenario block, PM is the predicted mode coming from the PIE block, and L(A, B) = 0 if A = B or 1 if A = B. In equation (4), α is a weighted accuracy and reflects the accuracy of a UAV while performing a requested mission. Since α is calculated at the end of any mission, T ≤ Tmax is varying, with Tmax the maximum allowed time for the execution of a mission. As an example, Scenario-1 has four modes of operation such that there are four user-defined weights each associated with one of the four modes. Accordingly, the value of K is 4 in Scenario-1. The user-defined weights are introduced to make the testing framework more flexible to test more critical modes effectively, so we can assign small weight for Hover mode and a larger weight for Obstacleavoid mode so that the UAV is more penalized when it makes mistakes during the obstacle avoidance. α=1−

n=0

366

5

M. Sarkar et al.

Synthetic Data Generation

The proposed data-driven testing framework requires a significant amount of good quality data. Therefore, a simulation environment is developed using offthe-shelf state-of-the-art tools and software packages which are both open source and user friendly. The Robot operating system (ROS) is used to develop the intelligence of the UAV agent for mapping, path planning or object recognition. We use the Pixhawk firmware to develop the low-level controller and on-board sensor data management of the UAV. Gazebo is used as the 3D simulation environment. All five scenarios are developed in this simulation environment and the motion data of the UAV agents are recorded to train the PIE classifier. The operational area of the UAVs is set to 50 m × 50 m × 10 m. The autonomy implementation details are out of the scope of this paper. We record twelve features named as x, y, z positions, roll (φ), pitch (θ), yaw (ψ), x, ˙ y, ˙ z˙ velocities ˙ pitch rate (θ) ˙ and yaw rate (ψ). ˙ Using this in three directions, roll rate (φ), feature vector, we can train a deep learning model to predict the current mode (one of the seven modes) of the UAV. The experimental results using these synthetic data are listed in Sect. 7.

6

Hardware Implementation

We use our own built UAV model to implement the scenarios and test it using the proposed framework. We followed a similar development procedure for the UAV hardware as described in [8]. The UAV is designed using quadcopter architecture. It is equipped with Pixhawk autopilot for low-level actuator control and IMU data processing, Intel Aero Computer board for high-level implementation of mission planning and computer vision algorithms such as 3D map generation of the environment using Intel Real Sense Camera. Moreover, the UAV can carry an external load of 250 gm for a maximum flight time of 12 min. Therefore, We use similar software architecture as the simulation platform such as ROS and Pixhawk firmware. As a result, all the software developed in the simulation platform can be directly used with minor modifications in the UAV hardware. The assembled UAV is shown in Fig. 6. The dimension of the indoor flight testing environment is 12 m × 4 m × m and it has ten high definition fully synchronized high resolution video cameras which are evenly distributed in the space. During the hardware implementation of the framework, we use the motion capture system to track each UAV and predict the current mode of operation of the UAV using the PIE block. The testing framework provides the confidence scores for these UAV agents after each implemented scenario. The experimental results are discussed in Sect. 7.

7

Experiments and Discussion

In this section, the experimental results from simulation and hardware implementation are described.

Testing Framework

367

Fig. 6. The developed UAV which is used to implement the scenarios and tested in the indoor flight testing environment.

7.1

Data Collection and Model Selection

In the first stage, we implement all the five scenarios described in Sect. 4.3 in the simulation environment and record the data during the simulation. All the data is sampled at 60 Hz. The recorded data is presented in Table 1. We label the data with respect to the seven modes and utilize the labeled data to train the classification model. Table 1. An overview of the recorded synthetic data from four UAVs. Scenario-1 Scenario-2 Scenario-3 Scenario-4 Scenario-5 Hold

13264

3378

64310

15361

33644

vTakeoff

1813

1850

1900

1978

1974

Hover

9001

9036

9045

9039

9025

0

40868

204655

154459

156413

Search Loiter

0

0

83451

0

49397

Obstacleavoid

0

0

0

7665

9119

1940

1974

1986

1993

2027

vLand

Since Scenario-5 contains all the developed modes of operation, we use only Scenario-5 to train the model. We split the data from Scenario-5 into training (80%) and validation (20%) set. Then, we test the model with data from the other four scenarios. To select the best Bi-LSTM model, we use three different timesteps (32, 64, 128), two different feature sets (seven features: [z, x, ˙ y, ˙ z, ˙ ψ, θ, φ] and nine features: [x, y, z, x, ˙ y, ˙ z, ˙ ψ, θ, φ]) and four different neuron sizes in the LSTM cell (64, 128, 256, 512). A total number of 24(3 × 2 × 4) different models are trained for 700 epochs with Adam [16] (learning rate = 0.001) optimizer and Categorical Cross Entropy as the loss function. We use Keras [3] with Tensorflow [1] back-end as the deep-learning framework to develop, train, and

368

M. Sarkar et al.

validate the models. We used Intel Xeon(R) CPU at 2.2 GHz with 88 cores, 128 GB RAM, and Nvidia Geforce RTX 2080Ti GPU. To train one Bi-LSTM model, it took 7 h 46 min 7 s. Based on the accuracy metric, the best parameter set is obtained as time-steps 128, feature set 7 and neuron size 64 as shown in Fig. 7. The best model has only 37, 767 trainable parameters. Table 2 summarizes the testing scores under the best parameter setting in terms of Accuracy, Precision, Recall and F1 Score. Also, a comparison with base decision tree model is presented in Table 2. The decision tree is also trained with the same number of input features and same number of time steps. Consequently, the input feature length of the decision tree is 896 (128 × 7). From Table 2, it is clear that Bi-LSTM outperforms the base decision tree (DT) model and thus we use the Bi-LSTM model in the PIE block. With the best parameter setting, the trained model is used as the PIE block in the framework and experiments are conducted using four UAVs. The details of the experiments are discussed in the following subsections.

Fig. 7. Performance of different trained models for different combination of parameters.

7.2

Deployment of PIE

In the simulation interface, three different types of experiments are conducted to evaluate the efficacy of the proposed testing framework. In each experiment, all the UAVs are operating at the same time concurrently. Experiment-1: UAVs with healthy sensors, controllers, and planning algorithms are used in all five scenarios. Table 3 summarizes the testing results of

Testing Framework

369

Table 2. Best testing scores of the best trained models (Bi-LSTM and DT) using data which are recorded from four UAVs during the data-collection phase. Accuracy Bi-LSTM DT

Precision Bi-LSTM DT

Recall Bi-LSTM DT

F1 Score Bi-LSTM DT

Scenario-1 0.96

0.85 0.96

0.92 0.96

0.85 0.96

0.86

Scenario-2 0.96

0.92 0.98

0.95 0.96

0.92 0.97

0.94

Scenario-3 0.97

0.93 0.97

0.93 0.97

0.93 0.97

0.93

Scenario-4 0.97

0.90 0.98

0.98 0.97

0.90 0.98

0.94

Scenario-5 0.95

0.90 0.95

0.91 0.95

0.90 0.95

0.90

UAVs with healthy functionalities. From Table 3, we can infer that each UAV gets a confidence score of close to 0.9 and above for performing well in these five 1 for each mode of scenarios. In each trial, the user-defined weights are set to K operation. Table 3. Average confidence score (α) of 10 trials, for the four simulated UAVs for five different scenarios while UAVs have healthy implementation of autonomy. UAV1 UAV2 UAV3 UAV4 Scenario-1 0.86

0.87

0.90

0.87

Scenario-2 0.94

0.93

0.96

0.92

Scenario-3 0.93

0.93

0.92

0.94

Scenario-4 0.93

0.91

0.93

0.92

Scenario-5 0.88

0.90

0.90

0.91

Experiment-2: In this experiment, we disable different functionalities of UAVs to test whether the proposed testing framework could capture the deficiency of UAVs in four scenarios. In Scenario-2, we inject delays in the path planning algorithms for UAV1 and UAV3 . The target detection modules of the UAV2 and UAV3 are disabled during the execution of Scenario-3. For Scenario-4 and Scenario-5, the obstacle detection sensors are removed from the UAV1 and UAV4 . The testing results are shown in Table 5. From Table 5, we observe that the UAVs shows a poor confidence score in each scenario when some technical issues occur. The user-defined weights for this experiment are tabulated in Table 4. Experiment-3: In this experiment, we implement a new behavior in Scenario-3 for which the PIE model is not directly trained. In the modified Scenario-3 , UAVs are commanded to move gradually upward when they find a target object in the search space. Therefore, the trajectory of Loiter mode changes from a circle into a spiral trajectory in this experiment as shown in Fig. 8. The maximum

370

M. Sarkar et al.

Table 4. User-defined weights for different scenarios for experiment-2. Here, “-” indicates that the respective scenario does not have the respective mode of operation. WHold WvTakeoff WHover WSearch WLoiter WObstacleavoid WvLand Scenario-2 0.1

0.1

0.1

0.6

-

-

0.1

Scenario-3 0.1

0.1

0.1

0.1

0.5

-

0.1

Scenario-4 0.1

0.1

0.1

0.1

-

0.5

0.1

Scenario-5 0.1

0.1

0.1

0.1

0.1

0.4

0.1

Table 5. Average confidence score (α) of 10 trials, for the four simulated UAVs for four different scenarios while different functionalities of UAVs are disabled. UAV1 UAV2 UAV3 UAV4 Scenario-2 0.15

0.89

0.17

0.88

Scenario-3 0.95

0.09

0.12

0.93

Scenario-4 0.05

0.94

0.93

0.02

Scenario-5 0.04

0.92

0.91

0.02

change of altitude is set as 25% of the vertical height of the indoor flight testing space in this mode which is a 67% change of altitude from nominal behavior. We introduce this type of behavior to indicate the atmospheric disturbance in the physical environment. The testing result is shown in Table 6. Table 6 indicates that the proposed framework still provides a high confidence score to each UAV while UAV’s behavior shifts from the nominal behavior significantly but in an 1 for each acceptable direction. In each trial, the user-defined weights are set to K mode of operation.

Fig. 8. The expected behavior of a UAV during loiter mode in modified scenario-3 .

Testing Framework

371

Table 6. Average confidence score (α) of 10 trials, for the four simulated UAVs for the modified scenario-3 . UAV1 UAV2 UAV3 UAV4 Scenario-3 0.89

0.86

0.86

0.85

In summary, from Table 3, it is clear that the testing framework provides a high confidence score while all the UAVs are performing well in the simulation environment. Conversely, Table 5 shows very poor confidence scores for UAVs in the simulation when there are technical issues in UAV’s autonomy implementation. Moreover, Table 6 demonstrates that the testing framework can predict unseen behavior. 7.3

Deployment of PIE in Hardware

After obtaining satisfactory results from the simulation, we implement Scenario1 and Scenario-2 in the indoor flight testing facility for two UAVs. The objective of the real-world implementation and experimentation is to further evaluate the performance of the proposed testing framework. Due to the limited space of the physical environment, we are unable to implement Scenario-3, Scenario-4, and Scenario-5 in the indoor flight testing facility. We do not implement any faulty behaviors in real UAVs because it may cause potential damage to the hardware components of the UAVs. Moreover, we have a plan to develop a control mechanism which will take control of the UAVs in case an unexpected behavior occur during the physical testing. The UAVs are tracked using high definition fully synchronized high resolution video cameras (External Observer) and the obtained features are fed to the trained PIE model. The Evaluator uses the predicted mode from the PIE module and true mode from the True Scenario block to calculate the confidence score of each UAV in the indoor flight testing facility. With ten trials, the mean of the confidence scores for UAVs from the hardware implementation are presented in Table 7. Considering the dynamic nature of the physical world, we also provide the standard deviation of the confidence scores in Table 7. Table 7. Average confidence score (α) of 10 trials, for the two real UAVs for two different scenarios while UAVs have healthy implementation of autonomy. UAV1 μ σ

UAV2 μ σ

Scenario-1 0.95 0.015 0.96 0.011 Scenario-2 0.92 0.026 0.93 0.014

372

M. Sarkar et al.

From Tables 3 and 7, we can infer that the proposed testing framework can be utilized to test the operational success of autonomous UAV missions not only in the simulation platform but also in the physical testing environment. More importantly, Table 5 indicates that the framework can evaluate the undesirable behavior of a UAV agent by assigning a relatively low confidence score.

8

Conclusion and Future Work

In this paper, a data-driven testing framework is proposed to evaluate the performance of multiple UAVs while performing missions in real-world scenarios. We define seven modes of operation to describe the behaviors of UAVs and use the defined modes of operation to assess the autonomous capabilities of UAVs in executing missions. The LSTM-based classifier is used to provide an accurate estimation of the behavior of UAVs by exploring the temporal relationship in the data and a comparison study with previously known best classifier (Decision tree) has been presented. To train and validate the proposed framework, we developed a simulation interface using ROS, Gazebo, and Pixhawk. Five different scenarios were designed and implemented in the developed simulation interface. Furthermore, we implemented two scenarios in the indoor flight testing facility and tested the proposed framework with real-world data. The experimental results justify the efficacy of the proposed testing framework for use in both simulation and real-world scenarios. A potential application of the proposed framework is the development of safe and reliable urban air mobility networks. Additionally, the proposed framework offers a promising solution for the recurrent qualification assurance of UASs after deployment. In the future, we will implement the other three scenarios in a larger indoor testing facility with a control mechanism that will reduce the chance of damaging the UAV during the testing procedure. Moreover, the collaboration among UAVs will be considered for designing more comprehensive scenarios for the performance evaluation of the proposed testing framework. Furthermore, efforts will be conducted on the interpretation of the confidence score in terms of the upper or lower bounds to consider real-world standards for the operational success of UASs. Acknowledgment. The authors would like to thank the Office of the Secretary of Defense (OSD) for the financial support under agreement number FA8750-15-2-0116. This work is also partially funded through the National Institute of Aerospace’s Langley Distinguished Professor Program under grant number C16-2B00-NCAT, and by the NASA University Leadership Initiative (ULI) under grant number 80N SSC20M 0161. Also, this work is supported in parts by NSF under grant Nos. CAREER CPS-1851588, S&AS 1849198, and SATC-1801611.

Testing Framework

373

References 1. Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016), pp. 265–283 (2016) 2. Chaki, S., Dolan, J.M., Giampapa, J.A.: Toward a quantitative method for assuring coordinated autonomy. In: Proceedings of ARMS Workshop (2013) 3. Chollet, F., et al.: Keras (2015) 4. Cowart, K., Valerdi, R., Kenley, C.R.: Development, validation and implementation considerations of a decision support system for unmanned & autonomous system of systems test & evaluation (2010) 5. Cui, Z., Ke, R., Pu, Z., Wang, Y.: Deep bidirectional and unidirectional LSTM recurrent neural network for network-wide traffic speed prediction. arXiv preprint arXiv:1801.02143 (2018) 6. Djang, P.A., Lopez, F.: Unmanned and autonomous systems mission based test and evaluation. In: Proceedings of the 9th Workshop on Performance Metrics for Intelligent Systems, pp. 81–85 (2009) 7. Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM (1999) 8. Girma, A., et al.: IoT-enabled autonomous system collaboration for disaster-area management. IEEE/CAA J. Automatica Sin. 7(5), 1249–1262 (2020) 9. Girma, A., Yan, X., Homaifar, A., Driver identification based on vehicle telematics data using LSTM-recurrent neural network. In: 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), pp. 894–902. IEEE (2019) 10. Gonda, N.D.: A framework for test & evaluation of autonomous systems along the virtuality-reality spectrum. Master’s thesis, Old Dominion University, Norfolk, VA, USA (2019) 11. Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005) 12. Greff, K., Srivastava, R.K., Koutn´ık, J., Steunebrink, B.R., Schmidhuber, J.: LSTM: a search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28(10), 2222–2232 (2016) 13. Holden, J., Goel, N.: Fast-forwarding to a future of on-demand urban air transportation, San Francisco, CA (2016) 14. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015) 15. Keane, J., Joiner, K.: Experimental test and evaluation of autonomous underwater vehicles. Aust. J. Multi-Discip. Eng. 16(1), 67–79 (2020) 16. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 17. Koopman, P., Wagner, M.: Challenges in autonomous vehicle testing and validation. SAE Int. J. Transp. Saf. 4(1), 15–24 (2016) 18. Leathrum, J.F., Mielke, R.R., Shen, Y., Johnson, H.: Academic/industry educational lab for simulation-based test & evaluation of autonomous vehicles. In: 2018 Winter Simulation Conference (WSC), pp. 4026–4037. IEEE (2018) 19. Michelson, R.C.: Test and evaluation for fully autonomous micro air vehicles. ITEA J. 29(4), 367–374 (2008) 20. NASA: Advanced air mobility studies/reports/presentations (2019)

374

M. Sarkar et al.

21. Reitz, B.C., Wilkerson, J.L.: Test and evaluation of autonomous surface vehicles: a case study. In: 2020 IEEE/ION Position, Location and Navigation Symposium (PLANS), pp. 839–850. IEEE (2020) 22. Roske, V.P., Kohlberg, I., Wagner, R.: Autonomous systems challenges to test and evaluation. In: 28th Conference of National Defense Industrial Association (2012) 23. Sarkar, M., Homaifar, A., Erol, B.A., Behniapoor, M., Tunstel, E.: PIE: a tool for data-driven autonomous UAV flight testing. J. Intell. Robot. Syst. 98(2), 421–438 (2019). https://doi.org/10.1007/s10846-019-01078-y 24. Sun, Y., Xiong, G., Song, W., Gong, J., Chen, H.: Test and evaluation of autonomous ground vehicles. Adv. Mech. Eng. 6 (2014). https://doi.org/10.1155/ 2014/681326 25. Thompson, M.: Testing the intelligence of unmanned autonomous systems. Technical report, Trideum Corp., Aberdeen, MD (2008) 26. Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., Woo, W.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, pp. 802–810 (2015)

Addressing Consumer Demands: A Manufacturing Collaboration Process Using Blockchain for Knowledge Representation Ricardo Barbosa1,2(B) , Ricardo Santos1 , and Paulo Novais2 1

2

CIICESI, Escola Superior de Tecnologia e Gest˜ ao, Polit´ecnico do Porto, Porto, Portugal {rmb,rjs}@estg.ipp.pt Department of Informatics, ALGORITMI Center, University of Minho, Braga, Portugal [email protected]

Abstract. Under I4.0, the evolution of the manufacturing processes is supported by an increase of data that is available and produced by organisations, the digitalisation of manufacturing pipelines, and a paradigm shift in production (from mass production to mass personalisation). Additionally, organisations need to gather the necessary conditions to ensure their quick adaptation to a changing environment and replace reactiveness for proactivity. Collaboration can act as the foundation to an answer for the increase demand for customised products, with an open and transparent environment where information is shared, and actors can work together to solve a common problem. In this work we propose a model definition for an industrial collaboration network composed by a network of entities, with reasoning and interaction, that uses a blockchain for knowledge representation. Current definitions of MAS already include a representation of equipment, transportation, products, and organisations; our contribution proposes the inclusion of the consumer, represented by an agent, directly in the manufacturing process. This agent represents the preferences and needs of the consumer in product customisation scenarios which, together with the other agents, negotiate criteria and cooperate with each other. The network is composed by distinct types of agents, across multiple organisations, that share common objectives. We use Hyperledger Fabric to represent knowledge, assuring that the data is stored and shared with all entities, while keeping the information secure and assuring that it cannot be tampered with.

Keywords: Collaboration Multi-agent system

· Negotiation · Industry 4.0 · Blockchain ·

This work has been supported by FCT – Funda¸ca ˜o para a Ciˆencia e Tecnologia within the Project Scope: UIDB/04728/2020. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 375–390, 2022. https://doi.org/10.1007/978-3-030-82193-7_25

376

1

R. Barbosa et al.

Introduction

Recent developments regarding Industry 4.0 (I4.0) definitions are commonly naming collaboration scenarios, and the urge to collaborate, as an essential characteristic for the success of the fourth industrial revolution [3]. As a revolution that is destined to impact the overall performance, quality, and the control of the manufacturing process, is still facing some challenges [24]. To answer its demands, organisations have a necessity to collaborate more efficiently, making faster and reliable decisions, and establishing transactions between the right partners. Organisations establish among themselves a collaborative principle, which typically operates in the supply chain, in order to introduce benefits to their activities, as well as the ability to respond to the needs of the most demanding consumer. With the introduction of I4.0, the manufacturing process must be able to meet the consumers needs, resulting in an increase in the demand associated with the supply chain, which represents a necessity to improve communication and supplier integration [13]. In a set of entities, where all (or part of) services and/or products are highly dependent on the availability of services and/or products from other entities, it is difficult to execute quick and easy decisions. There will always be a set of dependencies between two or more entities that share a position in the supply chain [27], and problem is that in an environment where organisations need to make decisions assertively and quickly, there is no way of knowing which entity to depend on. Such developments in the manufacturing processes are being supported by governmental agenda, being part of the several strategy plans for the future of the industry in many countries, including the European Union [11]. Collaboration is based on trust, but without social capabilities or characteristics, collaboration can be based on agreeable contracts that bound two entities [30]. We evidence four necessities: (1) the decentralisation of the decision-making process; (2) supporting collaboration; (3) include the consumer in the manufacturing processes; (4) and represent the generated knowledge. As result this work proposes the inclusion of the consumer, represented by an agent, directly in the manufacturing process. To achieve that, this work proposes the definition of a model for an industrial collaboration network, that includes the consumer in the manufacturing process (social manufacturing) and is composed by a collaborative network of entities (based on a Multi-Agent System), a reasoning and interaction layer, and the usage of blockchain to represent knowledge. Our proposal is based on the definition of a Multi-Agent System (MAS) to represent industrial entities in an environment where there is a necessity for collaboration, while maintaining competitive natures. This model can support decision-making processes regarding which entity should one rely on, to solve existing dependencies and is initially focused on the manufacturing of customised products. This work is structured as follows. In Sect. 2 we present a background on the technologies and concepts included in this work, with special reference to the blockchain technology. This Section also includes related work entries that are the result of the combination of MAS and blockchain technologies; Sect. 3

Manufacturing Collaboration Process Using Blockchain for KR

377

describes our proposed solution, including the process that originated the proposed model and a description of its main components, namely: network of entities, reasoning and interaction layer, and knowledge representation; This work concludes (Sect. 4) with a discussion of the proposed solution, its strengths, limitations and future work paths.

2

Background

As a concept, Ambient Intelligence defines a vision of the Information Society with emphasis on greater ease of use, more efficient support services, and supporting human interactions, referring to a digital environment that proactively but sensibly supports people and their daily activities. The focus of this concept has been adjusted according to chronological needs [6], and a quantitative analysis of scientific publications in the field suggests that the term has been replaced by more popular terminologies appropriated to the area of application, including the I4.0 terminology that is typically associated with Ambient Intelligence in an industrial context (or an intelligent industrial environment). First introduced by the German industry during the Hannover Fair event in 2013 [15], the I4.0 concept is impulsed by emerging technologies that are being adopted by manufacturing environments like the Internet of Things, wireless sensor networks, big data, cloud computing, and embedded systems. One of the main objectives is the creation of new values for the industry, through the creation of new business models, and the resolution of numerous socia l problems [14]. Cyber-Physical Systems (CPS) are defined as a transforming technology that provides innovative services to enable interconnected operations between physical assets, computing, and communication [16]. Shafiq et al. [26] define CPS as being “the convergence of the physical and digital worlds by establishing global networks for business that incorporate their machinery, warehousing systems and production facilities”. Monostori et al. [20] affirm that CPS “are systems of collaborating computational entities which are in intensive connection with the surrounding physical world and its ongoing processes, providing and using, at the same time, data-accessing and data-processing services available on the Internet”. With the growing usage of sensors and network connected machines, there will be a continuous generation of data that the CPS manage and leverage the connectivity between the machines, originating smart-machines. Also applying the concept of CPS in production, logistics and services in the current state of industrial practices, it would transform the factories of today into smart-factories with significant economical potential [16]. With an increasing usage of online social networks, and the adoption of new technologies, there is a demand to include consumers opinions on product manufacturing, customisation and delivery, requiring factories to become self-aware, self-maintenance, and capable of making market predictions and act accordingly [17]. Social Cyber-Physical Systems (SCPS) are an evolution of the CPS

378

R. Barbosa et al.

model, and combines the production services with the consumer, understanding consumer demands and offer personalised products and services on valuable time [33]. Agent-based technology is recognised as an important approach for the twenty-first century manufacturing systems. The suitability of agent technology is a unique factor to consider in the real world applications, particularly in I4.0, since it can bring a major improvement in the decision making processes and in the collaboration of different systems [2]. Is an entity that senses the environment and acts on it, performing a task continuously, with a strong autonomy, in a shifting environment, while coexisting with other entities and processes. Multi-agent Systems (MAS) aim to provide both principles for construction of complex systems involving multiple agents and mechanisms for coordination of independent agent behaviour [28]. While an agent ins any individual entity that is making decisions independently, MAS are a network of agents that work together to solve a specific problem (where agents work together) implying a certain level of cooperation among the agents involved, that can be explicit by design, or adapted. MAS are a particular type of intelligent systems, where autonomous agents dwell in a world with no control, or persistent knowledge. This infrastructure has been studied as a solution to manage widely distributed systems, particularly industrial applications, and aim to provide both principles for construction of complex systems involving multiple agents. MAS, which consists of multiple autonomous agents with distinct goals, are especially suitable for the development of complex and dynamic systems. Agents communicate with each other and with the environment with a focus on understanding the latter and reason upon intelligent models, coordinating their efforts to achieve their goals and the one of the ecosystem where they are inserted in. 2.1

Blockchain

Since the publication of “Bitcoin: A Peer-to-Peer Electronic Cash System” by Nakamoto [21], and the follow announcement of the first public version of the bitcoin client, blockchain has started its journey to become one of the most popular topics today. Since then, blockchain has being commonly associated with cryptocurrency and accompanied its success, which intrigued and triggered the curiosity of researchers from different academic backgrounds for the pursue of all the different scenarios of application for the blockchain technology [4]. Despise the current success in digital currencies and financial assets, the potential application reach is still a work in progress [1]. Blockchain is the generic designation given to transaction persistence protocols, which are based on different algorithms and cryptographic principles that ensure the integrity and traceability of all transactions within the system, without the need to place trust in a central entity, thereby maintaining it, decentralised and distributed. The successor of the initial blockchain protocols (Blockchain 1.0), whose implementation is restricted to ensuring that a predefined set of validations were respected, is Blockchain 2.0. This new designation is associated with the new generation of blockchain protocol implementations designed since its inception to support the definition

Manufacturing Collaboration Process Using Blockchain for KR

379

of business rules and custom validations through Smart Contracts. As a direct response to the increasing demands from the industry, anxiously expecting a framework that allowed the full exploration of this technology for the most different ends. Smart contracts were introduced as a concept by Szabo in the 90s [29], whose definition was defined as a computerised transaction protocol that executes the terms of a contract [8]. This definition was based on the necessity to translate contractual clauses into code, and embedded into hardware or software that is capable of self-enforce them, resulting in a decrease for the need of a trusted intermediary between transacting parties. In Blockchain, smart contracts are self-enforcing scripts that represent a digital contract [18]. They work as a software protocol that performs an action when certain conditions are met, reducing the amount of human involvement required to create, execute, and enforce a contract. Since there is no necessity for the contract partners to fully trust each other, blockchain, as a distributed system, is suitable for this type of application by removing the intermediary and simplifying trustless protocols between multiple parties [32]. 2.2

Related Work

The combination of these technologies (namely, blockchain, MAS, and smart contracts) is not a novel concept. There are existing proposals based on the combination of some/all previous described technologies in the described domain (intelligent industrial environment), namely: – The work of Casado-Vara et al. [7] presents a model that uses a combination of blockchain, smart contracts, and a MAS to coordinate the tracking of food in the agriculture supply chain. The proposed model uses blockchain to store a record of all transactions, and this decision was justified by the authors due to security and decentralisation necessities. The coordination of all the members of the supply chain is performed using a MAS, where agents verify the fulfillment of smart contracts for each transaction between entities. – Abeyratne and Monfared [1] main objective was to define a blockchain based system to facilitate the vast amount of data that is required about the products and respective consumers in a manufacturing domain. Their approach is composed by a decentralised distributed system that uses blockchain to collect, store and manage the data related to the product life cycle, where the authors claim that this solution allows consumers to access information related to a specific product at any given time, resulting (theoretically) in better buying decisions. – Ghadimi et al. work [12] proposes a MAS approach as solution for the automation, and process facilitation, of sustainable supplier selection and order allocation, which results in a more cooperative partnership. Their proposed model is composed by two sub-models: a supplier evaluation model; and an order allocation model. The first sub-model uses three types of agents: a database agent, a supplier agent, and a decision maker agent. The second sub-model

380

R. Barbosa et al.

uses an order allocator agent, a database agent, and a supplier agent. According to the authors, their model can improve the order fulfillment rate, decrease demand uncertainty, and eventually can lead to improvements in the performance of a supply chain. – The work of Wang et al. [31] proposes the definition of a MAS to represent an Industrial Network where they define the following agents: Machine Agent (MA) which represents all the equipment that performs any production or test activity; Conveying Agent (CA) which represents all entities that move a product, like robots, conveyor belts, and others; and the Product Agent (PA) which represents the products that are or will be processed by MA, and transported by CA. In addition, they propose an intelligent negotiation mechanism for agents to cooperate with each other, as well as preventing deadlocks by improving their decision-making and coordination behaviour.

3

Proposed Solution

The current list challenges that industries face today, includes the necessity for the collaboration between different organisations and/or partners including suppliers, service providers, shipping providers, and even other competing organisations. While this collaboration might prove to be an improvement to the previous unidirectional communication channels that different organisations had, there is still a necessity to include and create communication channels to the consumers to answer the shift in the paradigm of production: from the mass production towards mass customisation. With reports [9] affirming that the frequency that consumers will ask for more complex or personalised products is increasing, only with collaboration organisations will be capable of answering such exclusive demands. Collaboration occurs when organisations work jointly on the development of products, where the distributed returns are sufficient for all the collaborating parties [23], witnessing a free flow of information between collaborating organisations, which in turn provides faster decision-making and can enhance the effectiveness of internal processes. With an increase of productivity efficiency under I4.0, manufacturing flexibility and the integration of different processes and activities are guaranteed, due to the intelligent manufacturing environment. The problem is how, besides handling manufacturing and processes flexibility, industries will be able to fulfil personalised demands by their consumers, and be capable to offer better response to the needs and preferences of them. I4.0 assumes its operations in a computerised and intelligent manufacturing environment, assuring flexibility and high production efficiency, which allows a faster communication between costumer and producer, with consumers being much more demanding and requesting more personalised products. As result, is even more important to include the consumer on the manufacturing process (social manufacturing). Therefore, this work proposes a model definition for an industrial collaboration network that includes the consumer in the manufacturing process. A visual representation of this model is illustrated in

Manufacturing Collaboration Process Using Blockchain for KR

381

Fig. 1, which is divided in three main components: (1) a collaborative network of entities based on a MAS; (2) a reasoning and interaction layer; (3) knowledge representation using blockchain. The proposal is based on the definition of an MAS in an industrial context for the representation of different entities that decentralise decision-making processes and aid the manufacturing process by using agents to represent entities included in an industrial environment, where there is a necessity for collaboration. Is designed to be capable of representing and supporting the complex structure of dependencies created between entities, improve decision-making processes, and to facilitate the relationships through collaboration. Organisations need to look at the individualisation of customer’s requirements, where the goal is to deliver various goods to fulfill small customer groups with specific needs while reducing production costs and focusing in customisation, flexibility, and responsiveness.

Fig. 1. Industrial collaboration model that includes the consumer. On the left: entities network from different organisations (A, B, C) based on a MAS. In this network the MA are represented by , CA represented by ♦, PA represented by , and the CsA are represented by ; On the right: reasoning and interaction layer and a consortium blockchain node representation, used for knowledge representation.

Despite their presence in the collaboration network, it does not mean that an organisation is associated or is part of other organisation. Instead, that organisation is allowed to use the resources that are available in the network. As result, organisations A, B, or C, might not have any form of business relationship between them, or even any past interaction/transaction. Despite the

382

R. Barbosa et al.

initial definition of this model being oriented to solve the increase in demand for customised products, the model can work as a collaborative network for every manufacturing process. 3.1

Collaborative Network of Entities

This initial part of the model is achieved through the creation of a network, as suggested by the work of Schuh et al. [25], where entities can collaborate towards a stronger cooperation and each can achieve its targets. We achieve this network of entities in a similar approach proposed by the model present on the work of Wang et al. [31], where they define a MAS that includes: – Machine Agent (MA): represents all the equipment that performs any production or test activity; – Conveying Agent (CA): which represents all entities that move a product, like robots, conveyor belts, and others; – Product Agent (PA): which represents the products that are or will be processed by MA, and transported by CA Our model proposal, represented in Fig. 1, includes the previous agents in its network of entities including a visual reference, namely: MA are represented by ; CA are represented by ♦; and PA are represented by . The objective of this network is not to create an idea that the entities belonging to the network appear and operate like a larger unique entity. Instead, the point of the network is to encapsulate the different entities and their relationships in the same environment to allow the other components of the proposed model to be applied in an organised setup. Each entity has knowledge of all other entities present and the network and knowledge regarding their function, and inputs, outputs, and credibility (to be discussed in Sect. 3.3). A new type of product is created through smart manufacturing: intelligent product. These products contain embedded sensors, identifiable components, and processors that carry information and knowledge to convey functional guidelines to the production system, including information about your production requirements and the equipment required for this. In this way, each PA knows, at any given moment, all the steps it has already taken, all the MA it has passed through, the remaining steps, and which MAs are needed for its completion. The main contribution of this proposal is the introduction of a new actor (the consumer), represented by an agent, who assumes criteria representing his preferences and needs. This Consumer Agent (CsA) is represented in the model by , and represents a consumer (our a group of consumers). In customisation scenarios, the needs and preferences represented by the CsA will have to be negotiated with the other agents. This cooperation is essential to understand the feasibility of the product taking into account the existing raw material and the current processes performed by the MA and other entities present in the network. The MAS systems already contemplate negotiation processes between agents, and the inclusion of the CsA, and its integration into the system, creates a need for redefinition/adjustment of existing negotiation processes.

Manufacturing Collaboration Process Using Blockchain for KR

383

The goal that each CsA intends to achieve, is directly correlated to the consumer or group of consumers that is representing, more specifically, to their needs and preferences. The capture of these criteria is not the main focus of this proposal, but in future works we will address the possibility of including externals sources that can help the identification of consumer needs. At this moment, we are going to assume that this needs and preferences are known and being correctly represented by the CsA. Additionally, the inclusion of the consumer is a direct response to the necessities for social manufacturing, and their inclusion on the entirety of the product life cycle. This model is initially focused on the inclusion of the consumer on the manufacturing process (design, manufacturing, disposal), but can be further extended to the other processes that can even include the decision-making process regarding materials and suppliers selection. 3.2

Reasoning and Interaction

The second part that composes this model is based on the MAS and is intended as a solution to handle the reasoning and the interactions between entities, to decide which are the best entities, in the network, to interact with in each situation. As a direct response to the diverse consumer demands (represented by the CsA), there is a necessity for each entity to connect and work effectively and efficiently with others, making the entity to entity relationship critical for the success of this model. The selection of the right entities for the manufacturing of a product (whose characteristics are represented by the PA as a result of a negotiation process with the CsA) is the main purpose of this layer. The MAS proposed in the reasoning and interaction layer is based on the methodology presented in Nikraz et al. [22] and the work of Ghadimi et al. [12]. These works are focused on the key issues of the analysis and designing of a MAS, with a special attention to the analysis and designing phases, which are based on the Foundation for Intelligent Physical Agents (FIPA) standards. To design the system, is performed an identification, categorisation, and refinement of agent types during the analysis phase. It starts by making an initial agent type identification based on two rules: (1) add one type of agent per user/device; (2) add one type of agent per resource; This step is followed by a responsibility’s identification, where is created an initial list for each agent main responsibilities. In this proposal are included the definition of the following agents: Blockchain Agent (BA); Entity Agent (EA); and the Decision Maker Agent (DMA). For each one of these agents were defined the following responsibilities: – Blockchain Agent (BA): 1. Receives the entity data from the EA; 2. Saves the data from the EA in a blockchain transaction; 3. Informs the EA that data was saved; 4. Receives a data request from the EA; 5. Returns data results to EA;

384

R. Barbosa et al.

6. Receives data requests from DMA; 7. Returns results to DMA; – Entity Agent (EA): 1. Requests data from the BA; 2. Send data to the BA to add to its public profile; 3. Send data to the BA to add to its private profile; 4. Receive data from the BA; 5. Request results from DMA; 6. Receive results from DMA; – Decision Making Agent (DMA): 1. Start decision-making process; 2. Request data from BA; 3. Receives data; 4. Evaluate entities involved; 5. Send data to the BA; 6. Inform all EA involved. The process is then focused on the acquaintance’s identification, where there is a necessity to identify all the possible interactions. The analysis ends with the agent refinement where a set of considerations is applied: – Support: what supporting information agents need to accomplish with their responsibilities, and how, when and where is this information generated/ stored; – Discovery: how agents linked by acquaintance relation discover each other; – Management and monitoring: is the system required to keep track of existing agents, or if there is a need to create or demand other agents. How each agent relates to another is defined in the form of communications and interactions, with messages being sent between sender and receiver. To perform a specification for the system interactions, Nikraz et al. [22] advise that an interaction table should be created, that considers each agent responsibilities, including: – – – – – –

A description of the interaction; The responsibility (identified by a corresponding number); An interaction protocol to implement the interaction; The role played by the agent (Initiator or a Responder); The agent name of the complementary role; A description of the trigger condition that initiates the interaction.

3.3

Knowledge Representation

The final part of this model is responsible for handling the knowledge representation that supports its entirety. The model uses a blockchain to store entity and transactions data, providing a shared, immutable, and transparent appendonly register of all actions that have happened to all the participants in the

Manufacturing Collaboration Process Using Blockchain for KR

385

network. This is achieved through the adoption of a consortium blockchain (a middle ground between the low trust provided by the public blockchain, and the ‘single entity that rules everything’ of the private blockchain) [19], since it provides many of the benefits found in a private blockchain (like efficiency, transactions, and data access privacy) without consolidation the power in one entity, and maintaining the decentralisation of the decision-making process. This unique strategy found in the consortium blockchain is highly beneficial for entities collaboration since it operates under a leadership of a group instead of a single entity. Transaction and general data on the blockchain are also controlled using permissions, managed by the network. These overall system rules are easier to manage and are capable of achieving better protection results against external disturbances (when compared to other solutions). Regarding the entity data, is created and registered for each entity that is inserted in the network, and is used for the identification of entities and the ease of the collaboration process. As result, each entity is represented by a public and a private profile. The public profile contains data that is accessed by the network participants, and is used to validate and evaluate which entities should be approached to collaborate in a specific manufacturing process. This profile aids the identification of an entity in a network and stores the following values: – Inputs: represent the needs of the entity, namely what it needs from the network to fulfil its processes. These inputs can be raw materials, maintenance needs, transportation services, among others. This value can be read by each participant of the network, but each entity can only update its input values. – Outputs: represent what an entity offers to the network. Each entity has a set of needs that wants to be fulfilled (inputs) and can have a set of outputs (what it can offer/produce) that can be used as inputs by other entity. Ultimately, an output of an entity might represent the input of other. – Credibility: is a value attached to the public profile of an entity and represents how each entity is perceived by the other entities in the network. Defined as the quality of being trusted and believed in, this variable holds a range of values (from zero to one, where zero is no trust and one is absolute trust) that, based on previous interactions, represents how the network trusts a specific entity. Despise its presence on the public profile, this value cannot be adulterated. In the specific case of the CsA, the inputs define the needs and preferences of the consumers or group of consumers that they are representing. Initially this needs can be related to a product they want to be manufactured, but in further expansions of this model can also be related to specific preferences like processes, suppliers, or even raw materials. The private profile stores data regarding the level of confidence that a single entity has in every other entities of the network. In our proposal, one entity can have a certain level of confidence in other entity, regarding what the level of confidence of the others entities is. This confidence is represented by a range of values (from zero to one, where zero is no trust and one is absolute trust)

386

R. Barbosa et al.

and is only accessible by its entity. The update of this value occurs each time a transaction is performed between entities. This combination of confidence and credibility values are critical for the success of this model. Credibility can be described in four axes: trustworthiness; expertise; reliability; and quality; where the first two axes can be related to the credibility of the entity itself, while the latter are related to the credibility of the transaction performed. In this model credibility is used to provide a mean for an entity to be individually classified by others, while the simpler and direct approach of confidence is used to provide an entity with a way of storing their evaluation for each entity, based on their previous interactions. For example: MA1 has a low level of credibility, but due to previous successful collaborations with a PA1 , it has a high value of confidence in MA1 which allows PA1 to trust in MA1 to establish more transactions. As for the blockchain that supports the knowledge representation layer for this model, it requires transparency and privacy features, and a necessity for a special infrastructure that can provide such characteristics. As result, this work relies on a Hyperledger Fabric (HF) for knowledge representation. Similar to other blockchain technologies, HF has a ledger, uses smart contracts, and is a system where participants manage their transactions. HF differs from other blockchains by not being an open system that allows unknown entities to participate in the network, instead, its members need special authorisation and validation to be part of the network [10]. Is an implementation of a distributed ledger platform for running smart contracts, leveraging familiar and proven technologies, with a modular architecture that allows pluggable implementations of various functions [5]. This peculiar blockchain architecture introduced by HF is called “executeorder-validate”, and a distributed application for Fabric consists of two parts: 1. Smart Contract (Chaincode): is the central part of a distributed application in Fabric, with special chaincodes existing to manage the blockchain system and maintaining parameters. Chaincode is invoked by an application external to the blockchain, when there is a need to interact with the ledger. 2. An endorsement policy that is evaluated in the validation phase. This policy acts as a static library for the validation of transactions, which can only be parameterised by the chaincode. A typical endorsement policy allows the chaincode to specify the endorsers for a transaction in the form of a set of peers. This set of peers are defined as the smallest set of entities required to endorse a transaction to be valid. To endorse, an entity endorsing peer needs to run the smart contract associated with the transaction and sign its outcome. In HF, a ledger consists of two distinct parts, a world state and a blockchain. The world state is a database that holds the current values for the ledger state, making it easy to access them, while the blockchain works as a transaction log that registers every change that lead to the current world state. The world state is implemented as a database, providing a rich set of operations for the efficient storage and retrieval of states. When a transaction that implies changes to the

Manufacturing Collaboration Process Using Blockchain for KR

387

world state is submitted, by invoking a smart contract, ends up being committed to the blockchain, where a notification about the validity of the transaction is later sent to its committer. In addition to represent and register every transaction performed in the network (and its participants), this knowledge representation layer is also capable of representing a product life-cycle by analysing each transaction performed by a PA.

4

Conclusion and Future Work

While the manufacturing processes are evolving under I4.0, by taking advantage of the amount of data produced and the digitalisation of manufacturing pipelines, organisations are still facing a variety of challenges. On of those challenges is the increasing demand of customised products by their consumers, that are shifting the manufacturing paradigm towards mass customisation. This specific challenge requires organisations to adapt their manufacturing process, to produce multiple products (or the same product but with different variations) without having to make significant changes to their production line while minimising their downtime. Besides the necessity for the manufacturing of customised products, organisations will need to gather the necessary conditions to ensure their quick adaptation to a changing environment (motivated by trends and social influence), and assuring that they have the required materials and services to answer the manufacturing needs. One solution to this problem can be found in collaboration, that besides providing a solution to the increase in demand for customised products, can also act as a solution for many other challenges in I4.0. Collaboration is an open and transparent environment where information is shared, and each actor can work together to solve a common problem. The proposed model present in this work is our solution to the necessity for collaboration between organisations, and the satisfaction of customised demands by the consumers. We proposed a model definition for an industrial collaboration network composed by a network of entities, reasoning and interaction layer, and knowledge representation using blockchain. Despite the combination of MAS and blockchain not being a novel process, and existing works that proposed a similar base infrastructure, the novelty of this proposal is found in the inclusion of the consumer. The initial portion of our model is found in a network of entities, based on a MAS, where each agent represents an entity that is directly related to the manufacturing process of a product, namely: Machine Agent (MA); Conveying Agent (CA); Product Agent (PA); and the Consumer Agent (CsA). This network of entities is composed by different types of agents, belonging to different organisations, that are a common objective: collaborate to solve an existing problem, which in this scenario is the manufacturing of a product. The knowledge representation uses Hyperledger Fabric and is the entry point for all the information in the network. By creating a solid way of structuring and saving the data, creates the possibility that for each entity and its interactions,

388

R. Barbosa et al.

the data is stored and shared with all the entities, while keeping the information secure and making sure that stored information cannot be tampered with. Entities information contains data that helps create each organisation’s profile and helps in the decision-making process, creating a way for network participants to evaluate and classify each other’s performance when collaborating. The decision-making portion, relies on a multi-agent system that interacts with the Hyperledger Fabric blockchain in order to gather the necessary data to handle decision-making processes regarding choosing the right entity to collaborate. This is crucial, to help stakeholders and decision makers streamline their decision-making process, that can be the difference between acting in a useful time and solving a problem or failing. As for the limitations of this work, the first that should be addressed is the usage of blockchain. Is the right solution for this model? Despite the current success with cryptocurrencies, and the combination of MAS and blockchain being well documented in literature, this application of this technology is still limited in real world, often associated with a certain level of distrust. However, blockchain aligns with our proposal, and the consortium blockchain provides a way to create interactions among a group of entities that exchange funds, goods, or information, while none are willing to agree on a trusted third party. Also, the usage of smart contracts can simplify trust-less protocols between multiple parties, while the details of the contract remain hidden to other network entities, and providing the decentralisation of the decision-making process. However, there are some limitations. The MAS developed still lacks maturity in some areas, namely when it comes to the actions of the agents. An entity that can potential affect the operation of the model is the Decision Making Agent (DMA) behaviour and actions, where is important to consider what decisionmaking model framework/algorithm, such has the Markov decision process and a fuzzy inference system, should be used and how it could affect the model. This would enable the developing of the model even further. It is also noted that since different organisations will be sharing their resources, where sensitive data can be available, there is a concern for security and privacy. At the moment, this work relies on the underlining concepts of privacy that come attached to the blockchain technology.

References 1. Abeyratne, S.A., Monfared, R.P.: Blockchain ready manufacturing supply chain using distributed ledger. Int. J. Res. Eng. Technol. 5(9), 1–10 (2016) 2. Adeyeri, M.K., Mpofu, K., Adenuga, O.T.: Integration of agent technology into manufacturing enterprise: a review and platform for industry 4.0. In: 2015 International Conference on Industrial Engineering and Operations Management (IEOM), pp. 1–10. IEEE (2015) 3. Agostini, L., Filippini, R.: Organizational and managerial challenges in the path toward industry 4.0. Eur. J. Innov. Manag. 22, 406–421 (2019) 4. Aste, T., Tasca, P., Di Matteo, T.: Blockchain technologies: the foreseeable impact on society and industry. Computer 50(9), 18–28 (2017)

Manufacturing Collaboration Process Using Blockchain for KR

389

5. Cachin, C., et al.: Architecture of the hyperledger blockchain fabric. In: Workshop on Distributed Cryptocurrencies and Consensus Ledgers, vol. 310 (2016) 6. Carneiro, D., Novais, P.: New applications of ambient intelligence. In: Ramos, C., Novais, P., Nihan, C.E., Corchado Rodr´ıguez, J.M. (eds.) Ambient Intelligence Software and Applications. AISC, vol. 291, pp. 225–232. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07596-9 25 7. Casado-Vara, R., Prieto, J., De la Prieta, F., Corchado, J.M.: How blockchain improves the supply chain: case study alimentary supply chain. Procedia Comput. Sci. 134, 393–398 (2018). The 15th International Conference on Mobile Systems and Pervasive Computing (MobiSPC 2018)/The 13th International Conference on Future Networks and Communications (FNC-2018)/Affiliated Workshops 8. Christidis, K., Devetsikiotis, M.: Blockchains and smart contracts for the Internet of Things. IEEE Access 4, 2292–2303 (2016) 9. Deloitte: Industry 4.0. Challenges and solutions for the digital transformation and use of exponential technologies, pp. 1–30. Deloitte (2015) 10. Dib, O., Brousmiche, K.-L., Durand, A., Thea, E., Hamida, E.B.: Consortium blockchains: overview, applications and challenges. Int. J. Adv. Telecommun. 11(1 & 2), 51–64 (2018) 11. Digital Transformation of Industrial Ecosystems (Unit A.4): Smart ManufacturingShaping Europe’s digital future, October 2020 12. Ghadimi, P., Toosi, F.G., Heavey, C.: A multi-agent systems approach for sustainable supplier selection and order allocation in a partnership supply chain. Eur. J. Oper. Res. 269(1), 286–301 (2018) 13. Ghadimi, P., Wang, C., Lim, M., Heavey, C.: Intelligent sustainable supplier selection using multi-agent technology: theory and application for industry 4.0 supply chains. Comput. Ind. Eng. 127, 588–600 (2019) 14. Kang, H.S., et al.: Smart manufacturing: past research, present findings, and future directions. Int. J. Precis. Eng. Manuf.-Green Technol. 3(1), 111–128 (2016). https://doi.org/10.1007/s40684-016-0015-5 15. Lee, J.: Industry 4.0 in big data environment. Ger. Harting Mag. 1(1), 8–10 (2013) 16. Lee, J., Bagheri, B., Kao, H.-A.: A cyber-physical systems architecture for industry 4.0-based manufacturing systems. Manuf. Lett. 3, 18–23 (2015) 17. Lee, J., Kao, H.-A., Yang, S.: Service innovation and smart analytics for industry 4.0 and big data environment. Procedia CIRP 16, 3–8 (2014) 18. Lin, I.-C., Liao, T.-C.: A survey of blockchain security issues and challenges. IJ Netw. Secur. 19(5), 653–659 (2017) 19. Du, M., Ma, X., Zhe, Z., Wang, X., Chen, Q.: A review on consensus algorithm of blockchain. In: 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 2567–2572 (2017) 20. Monostori, L., et al.: Cyber-physical systems in manufacturing. CIRP Ann. 65(2), 621–641 (2016) 21. Nakamoto, S.: Bitcoin: a peer-to-peer electronic cash system. Bitcoin, April 2008 22. Nikraz, M., Caire, G., Bahri, P.A.: A methodology for the analysis and design of multi-agent systems using JADE. Technical report (2006) 23. Oliver, A.L.: On the duality of competition and collaboration: network-based knowledge relations in the biotechnology industry. Scand. J. Manag. 20(1–2), 151– 171 (2004) 24. Olsen, T., Tomlin, B.: Industry 4.0: Opportunities and challenges for operations management. Manuf. Serv. Oper. Manag. 22, 113–122 (2020)

390

R. Barbosa et al.

25. Schuh, G., Potente, T., Wesch-Potente, C., Weber, A.R., Prote, J.-P.: Collaboration mechanisms to increase productivity in the context of industrie 4.0. Procedia CIRP 19, 51–56 (2014) 26. Shafiq, S.I., Sanin, C., Szczerbicki, E., Toro, C.: Virtual engineering object/virtual engineering process: a specialized form of cyber physical system for industrie 4.0. Procedia Comput. Sci. 60, 1146–1155 (2015). Proceedings of the 19th Annual Conference on Knowledge-Based and Intelligent Information & Engineering Systems, KES-2015, Singapore, September 2015 27. Stevens, G.C.: Integrating the supply chain. Int. J. Phys. Distrib. Mater. Manag. 19(8), 3–8 (1989) 28. Stone, P., Veloso, M.: Multiagent systems: a survey from a machine learning perspective. Auton. Robots 8, 345–383 (2000). https://doi.org/10.1023/A: 1008942012299 29. Szabo, N.: Smart contracts: building blocks for digital markets. EXTROPY: J. Transhumanist Thought (16), 18(2) (1996) 30. Tschannen-Moran, M.: Collaboration and the need for trust. J. Educ. Adm. 39(4), 308–331 (2001) 31. Wang, S., Wan, J., Zhang, D., Li, D., Zhang, C.: Towards smart factory for industry 4.0: a self-organized multi-agent system with big data based feedback and coordination. Comput. Netw. 101, 158–168 (2016) 32. W¨ ust, K., Gervais, A.: Do you need a blockchain? In: 2018 Crypto Valley Conference on Blockchain Technology (CVCBT), pp. 45–54 (2018) 33. Zhang, F., Liu, M., Shen, W.: Operation modes of smart factory for high-end equipment manufacturing in the Internet and Big Data era. In: 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Banff, AB, pp. 152–157. IEEE, October 2017

Cellular Formation Maintenance and Collision Avoidance Using Centroid-Based Point Set Registration in a Swarm of Drones Jawad N. Yasin1(B) , Huma Mahboob2 , Mohammad-Hashem Haghbayan1 , Muhammad Mehboob Yasin3 , and Juha Plosila1 1

3

Autonomous Systems Laboratory, Department of Future Technologies, University of Turku, Vesilinnantie 5, 20500 Turku, Finland {janaya,mohhag,juplos}@utu.fi 2 Connected Shopping Ltd., Thetford, UK [email protected] Department of Computer Networks, College of Computer Sciences and Information Technology, King Faisal University, Hofuf, Saudi Arabia [email protected] Abstract. This work focuses on low-energy collision avoidance and formation maintenance in autonomous swarms of drones. Here, the two main problems are: 1) how to avoid collisions by temporarily breaking the formation, i.e., collision avoidance reformation, and 2) how do such reformation while minimizing the deviation resulting in minimization of the overall time and energy consumption of the drones. To address the first question, we use cellular automata based technique to find an efficient formation that avoids the obstacle while minimizing the time and energy. Concerning the second question, a near-optimal reformation of the swarm after successful collision avoidance is achieved by applying a temperature function reduction technique, originally used in the point set registration process. The goal of the reformation process is to remove the disturbance while minimizing the overall time it takes for the swarm to reach the destination and consequently reducing the energy consumption required by this operation. To measure the degree of formation disturbance due to collision avoidance, deviation of the centroid of the swarm formation is used, inspired by the concept of the center of mass in classical mechanics. Experimental results show the efficiency of the proposed technique, in terms of performance and energy. Keywords: Multi-agent system intelligence · Collision avoidance

1

· Formation maintenance · Swarm · Point set registration

Introduction

A swarm is a concept that seems to have no precise definition in literature as such; instead, we find a lot of definitions and discussion addressing swarming i.e. swarm behaviour [11]. Swarm robotics can be classified as the study c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 391–408, 2022. https://doi.org/10.1007/978-3-030-82193-7_26

392

J. N. Yasin et al.

of how a system, consisting of large number of a relatively simple agents, can be designed to attain a desired cumulative behaviour based on the interactions between the agents themselves and between the agents and the environment [7,23]. Due to their ability to work in a collaborative style, swarms of autonomous agents add significant advantages over the use of single agent, and therefore they have high demand in diverse fields such as search and rescue, surveying and mapping, inspection, and delivery, in both military and civilian/commercial contexts [22]. Consequently, the interest of the research community is increasing towards optimization of various autonomy characteristics of drone swarms, for instance collision avoidance, formation maintenance, resource allocation, and navigation [5,13]. In swarm navigation and formation flight, collision avoidance and maintenance of the formation are the most important problems [20,25]. Formation control methodologies can be categorized into three main approaches: 1) the leaderfollower based approach, where every node/drone works autonomously and individually by maintaining a given formation as perfectly as possible by adjusting its position with respect to its neighbours and the leader drone [17,22,24]; 2) the behaviour based approach, where, based on a pre-determined strategy, one behaviour is chosen out of the available ones [3,14]; and 3) the virtual structure based approach, where the swarm is considered a single entity, i.e. a single large drone effectively, and navigated through a trajectory accordingly [6,15]. Cellular automata based modelling provides the environment that each cell can decide its movement by only looking at its neighbours and the environment and based on its rules that are defined for each individual cell dynamically and run-time [1]. The modelling of cellular automata in our collision avoidance algorithm provide us the opportunity to reform the system to pass the obstacle only by defining some distributed rules for each individual drone. In other words to pass the collision we do not need to have a central processing element that defines the path of each drone. In return each node, individually, in a dynamic and flexible movement can pass the obstacle so that the overall time and energy consumption of the swarm is optimized. To define these rules for individual drones, we make use of genetic algorithm techniques that is highly compatible with cellular automata model. A genetic algorithm (GA) is one of the simplest random-based classical evolutionary computing methods, where random changes are applied to the current solutions to generate new solutions for finding an optimal or near-optimal solution [8]. GAs work by utilizing the basic principles of generation of potential random solutions, selection of the best solution by calculating the distance of each solution to the destination, generation of new solutions based on the generated good solutions, and repeating these steps in order to reach the desired result [2,18]. The ability of GAs to converge close to the global optimum and their relatively simple implementation make them quite popular among available optimization heuristics [4]. Point set registration is a commonly used method, playing an important role in various applications such as image retrieval, 3D reconstruction, shape and object recognition, and SLAM [9]. In point set registration, the correlation

Cellular Formation Maintenance and Collision Avoidance Using CPSR

393

between two point sets is determined in order to retrieve the required transformation that maps one point set to the other [16]. Our algorithm has two phases: – The first phase is a cellular automata based collision avoidance scheme that disturbs the original formation to pass the obstacle. – The second phase is a re-formation scheme, inspired by point set registration, that will resume from the highest disturbed formation to the original formation. In this paper, the leader-follower based approach is utilized for drone swarm control due to its reliability, ease of implementation and analysis, and scalability [12,19]. However, in our proposed solution, there is no unique global leader, as the leader gets changed dynamically based on certain constraints. The cellular formation and collision avoidance algorithms are integrated with a simple GAinspired approach and a point set registration method in order to optimize the collision avoidance and re-formation phases. The goal is to calculate the escape routes and select a near-optimal path upon detection of an obstacle, having minimal deviation from the original route. Once a defined danger zone has been passed, for reconstruction of the formation, centroid-based point set registration (CPSR) is used in the formation maintenance algorithm to optimally bring back each drone that lost its position in the formation when avoiding a collision with the detected moving obstacle. Using a GA-inspired approach for collision avoidance in a swarm of robots/UAVs is beneficial as it allows the algorithm to check for all possible maneuvers and select the best solution depending on the pre-defined constraints such as the minimal movement requirement and power consumption limitations. Furthermore, with the help of CPSR, once a collision has been avoided, the formation can be obtained again swiftly and optimally by bringing the UAVs back to the desired formation shape. CPSR facilitates this dynamic recovery process and autonomous switching of the swarm leader according to the requirements posed by the scenario at hand. This can be very helpful especially in cases where the time to complete a mission is critical. The rest of the paper is structured as follows. In Sect. 2, the proposed algorithm is described. Section 3 provides the simulation results. Finally, Sect. 4 concludes the paper with some discussion and comments on future work.

2

Proposed Approach

The general pseudo code of the proposed approach is given in Algorithm 1. We start with the assumption that the leader-follower connection has been established and that the formation is already maintained before a mission starts. By utilizing the on-board processing units, this top-level algorithm is executed by every individual node locally. Algorithm 1 starts by initializing the Boolean variable/flag F LAGobs (Line 2) whose role is to indicate absence (F alse) or presence (T rue) of an obstacle. Then the target shape (TShape) of the swarm

394

J. N. Yasin et al.

is initialized based on the current state or position of each node with respect to the others (Line 3). TShape is the next targeted formation shape calculated at every interval for the next time interval, determining the next target position for each node to propagate to. After the above initial steps, the main loop (Lines 4–13) is entered. First, the procedure ObstacleDetection is called, and based on the values of the variables/signals calculated by this algorithm, information on the presence and characteristics of a potential obstacle is available (Line 5). In case an obstacle is detected, i.e. F LAGobs == T rue, the procedure CollisionAvoidance is called, and TShape is set up according to the feedback received (Lines 6–7). After this, once a collision has been successfully avoided, F LAGobs is reset to F alse (Line 8). On the other hand, if no obstacle is detected, i.e. if F LAGobs == F alse holds after the execution of ObstacleDetection, then TShape is updated without an involvement of the collision avoidance procedure (Line 10). Finally, the point set registration based re-formation algorithm PS-ReFormation is called to re-establish the desired formation (Line 12). This has an effect only if the formation is distorted due to a collision avoidance event. It is important to note here that in the point set registration of the reformation process, it is crucial to optimally and rapidly calculate the mapping between the current and desired shapes of the swarm. This is the case especially when complicated movements are involved which drastically change TShape of the swarm, for instance, when the angle of the leader’s movement strongly changes due to the presence of an obstacle on the path.

Algorithm 1. Global Routine 2: 4:

6: 8: 10: 12: 14:

procedure Obstacle Detection & Navigation F LAGobs ← F alse; TShape ← Initialization based on current state; while True do F LAGobs , Dobs , Aobs , Vobs , Dimpact ← Obstacle Detection(); if F LAGobs then TShape ← CollisionAvoidance(Dobs , Aobs , Vobs , Dimpact ); F LAGobs ← F alse; else Update TShape; end if PS-Reformation(TShape); end while end procedure

Cellular Formation Maintenance and Collision Avoidance Using CPSR

395

Algorithm 2. Obstacle Detection procedure Obstacle Detection() if obstacle in Detection Range then F LAGobs ← T rue; 4: Dobs , Aobs ← Calculate obstacle distance and angles at which the edges lie; Vobs ← Calculate obstacle Velocity; 6: Dimpact ← Calculate distance to impact; return(F LAGobs , Dobs , Aobs , Vobs , Dimpact ) 8: end if end procedure 2:

2.1

Obstacle Detection

In this procedure (specified in Algorithm 2), the node continuously scans for the obstacles, and soon as there is an obstacle in the detection range of the onboard sensor system, the signal flag F LAGobs is set to T rue (Lines 2–3). After this, the calculation of the parameters of the obstacle is done, i.e. the distance to the obstacle (Dobs ) and the angle at which the detected obstacle lies Aobs (Line 4), as shown in Fig. 1 (variables explanation given in Table 1). Then it is determined if the obstacle is moving or stationary (Line 5), using Eqs. (1)–(3) and illustrated in Fig. 2. The velocity of the obstacle (Vobs ) is computed, and based on the value of Vobs , we have three possible case scenarios: (1) if Vobs == 0 then the environment is static or the obstacle under observation is stationary; (2) if Vobs is negative, then the obstacle is coming towards the UAV; or (3) if Vobs is positive, then the obstacle is going away from the UAV. Based on the computed velocity of the obstacle, distance to the potential impact (Dimpact ) is calculated (Lines 5–6). These calculations are elaborated in the equations below.

Fig. 1. Distance and direction calculation

396

J. N. Yasin et al. Table 1. Description of variables from Fig. 1 Variables

Description

DRi DLi

Distance of right and left edges of the obstacle from leader, follower 1 and follower 2 as shown in Figure

dF 1L dF 2L

Distance of leader from follower 1 and follower 2 respectively

θLOR θLOL Angle at which right and left edges are detected from leader respectively θF 1L θF 2L

Angle of leader from follower 1 and follower 2, respectively

We know that after t1 seconds, the distance travelled by the UAV can be calculated by: (1) dU AV = v ∗ (t1 − t0 )

Fig. 2. Moving obstacle calculation

Then the distance travelled in the meantime by the obstacle and the velocity of the obstacle are calculated by (2) and (3), respectively: dobs = do − dU AV − d1

(2)

where do , d1 are the distances between the UAV and the obstacle at times to and t1 , respectively. (3) vobs = dobs /Δt In Eq. (2), if dobs == 0, it means the obstacle is stationary. Otherwise the obstacle is moving and in that case if the distance between the obstacle and the UAV after t1 , i.e., d1 is less than the distance detected at time to , i.e., do , then the obstacle is moving towards the UAV (Fig. 2). If d1 is greater than do , it means the obstacle moving away from the UAV. Figure 3(a) shows the point when the obstacle has entered the detection range of the UAV. The obstacle’s

Cellular Formation Maintenance and Collision Avoidance Using CPSR

397

trace of movement is shown as a red dotted line; similarly, the UAVs’ traces of movement are shown as correspondingly coloured lines behind the smaller coloured circles representing three UAVs. Based on the movement of the obstacle, the computational point of impact and the dimensions of the obstacle are calculated, and based on these the Danger Zone (the red circle, the point of impact is the red dot inside it) is defined as shown in Fig. 3(a).

Fig. 3. (a) Point of impact and danger zone as they appear dynamically. (b) Highest level of disturbance illustration

2.2

Collision Avoidance

Collision avoidance in our proposed algorithm is simply defining continuously the next step formation for the swarm in a way that this sequence of formations can pass the obstacle. This re-forming of the swarm continues until the swarm reaches the highest formation disturbance in which it is guaranteed the swarm can pass the obstacle. The highest formation disturbance is defined as the state when all the drones have passed the line perpendicular to the velocity vector of the swarm, i.e. swarm movement in Fig. 3(b), and passing through the mass point of the obstacle, see Fig. 3(b). After that the swarm resumes back to its original formation via TPS-based algorithm [21] that will be discussed in the reformation section1 . In this section we only cover the collision avoidance algorithm until reaching the highest level of disturbance. The goal here is to determine rules for each drone by which the drones pass the obstacle in a way that the time and energy minimizes. To do this we made use of applying GA in a cellular automata (CA) model of the swarm. This model is based on separating the space, 2D or 3D, into identical grid zones where the size of the grid is determined to encompass one drone in its safe distance from the borders of the grid. Using this modeling

1

Even though the highest formation disturbance state might not totally guarantee not to have collision in TPS-based reformation phase, we take this assumption to simplify finding the moment of switching from GA-based collision avoidance phase to TPS-based resuming the original formation.

398

J. N. Yasin et al.

method, 2D/3D environment can be divided to identical grid zones, cells, where existences and non-existence of a drone in each grid can change the state of the cell to one and zero respectively. Figure 4(a) shows such a model for 2 drones and one obstacle in 2D environment. Each cell in this model only can see its neighbors, like standard CA [1], and based on the neighbors it decides its next state. If the cell is occupied by a drone, the next state of the cell is determined via the movement of the drone that is limited towards cardinal and inter-cardinal directions (Fig. 4(b)). Our GA-based algorithm tries to find the best possible rule for each cell occupied by the drone i.e. black, that minimizes the time. The time in this model is the number of steps for the swarm to reach the highest level of disturbance. The energy in each step, i.e. for the CA rule, can only have three values, 1) ‘0’, when the drone stays in its position, 2) ‘1’, when there is a movement toward one cardinal direction, and 3) ‘2’, when there is a movement toward one intercardinal direction. The overall energy for a drone is the summation of energy consumption in each iteration from the start of reformation to highest formation disturbance state.

Fig. 4. (a) 2D model of 2 drones and one obstacle. (b) Example of movement according to a rule.

Algorithm 3. CollisionAvoidance procedure CollisionAvoidance(Dobs , Aobs , Vobs , Dimpact ) DangerZone ← Calculate based on Dimpact and obstacle dimensions; while DangerZone do 4: Calculate Escape routes; Update the TShape using GA; 6: end while return(TShape) 8: end procedure 2:

Cellular Formation Maintenance and Collision Avoidance Using CPSR

399

The population of the potential solutions was done by applying the following principles: – generation of potential random solutions by defining different rules for each drone – calculate the time and energy of each solution when reaching the highest disturbance formation is targeted, i.e., when all the drones pass the obstacle, – regenerate new solutions by mutation of good obtained rules for each drone These routines are integrated with the collision avoidance developed and presented in [24], in such a way that the translated TShape destination of each node of the swarm is checked at each time interval. If there is an obstacle in the path of a node, the TShape destination may not be the same as the original destination. Therefore, the TShape destination is calculated by using a fixed grid around the danger zone in order to restrict the GA from populating infinite exhaustive solutions. Afterwards at each iteration, the way to reach the TShape destination is optimized by point set registration. 2.3

Re-formation

Observing and avoiding an obstacle by the swarm, in most of the cases changes/disturbs the formation of the swarm until reaching the highest disturbance formation where after the swarm must be restored to the initial formation state. This process raises a formation construction problem that is widely covered in the literature. However, in our case, the re-formation algorithm, or in other words the disturbance rejection of a swarm, must be compatible with our obstacle detection and collision avoidance algorithm whose main target is to reduce the overall settling time and energy of the system, i.e., bringing the disturbed centroid back to its intended state in the TShape. It is worth mentioning that in the process of resuming the formation it is not necessarily needed to keep the initial neighbouring state among the drones since in the formation all the drones are considered to be an identical node. Furthermore, since there is no dedicated leader and leader is only selected according to the situation at hand, therefore the dynamicity of re-formation process gets smoother with no pauses or unnecessary waiting times for nodes. For example, in the original state if drone 1 has two neighbors drone 2 and 3, after the reconstruction and resuming the formation this might not necessarily happen. Or as shown in next section (Fig. 6), the leader before the swarm gets disturbed and after the reformation process is not the same, as the leader was re-elected dynamically. In the process of resuming back from the disturbed state of the swarm formation, referred to as the scene in this section, to the initial formation state, i.e., the TShape model, two main questions are: 1) what is the optimal alignment or mapping of identical nodes in the disturbed formation of the swarm, i.e. the scene, and in the initial formation, i.e., TShape model; and 2) what is the optimal trajectory of each node in the scene to be mapped into the corresponding node in the TShape model? For the first problem, we adopt a well-know idea in point set

400

J. N. Yasin et al.

registration [9,16] that is based on the thin-plate spline (TPS) technique that is used in data interpolation and smoothing [10]. After determining the mapping strategy, for the second problem, we use the shortest path scheme when applying the proposed collision avoidance approach. In the following, we first explain the concept of thin-plate splines, and after this we propose an algorithm based on that. A piece-wise function defined by polynomials is known as a spline. Complex and complicated shapes are approximated with ease via curve fitting using splines due to their non-complicated construction [10]. For simplicity, we discuss the algorithm for 2-dimensional formulation and presume to have two sets of correlating data sets or points X i.e. xi , i = 1, 2, . . . , n and V i.e. vi , i = 1, 2, . . . , n. Where xi and vi are the coordinate representation of the locations of a point, xi = (1, xi x, xi y) and vi = (1, vi x, vi y), in the scene and model respectively. Considering the shape of the disturbed function, finding a mapping function f (vi ) that fits between correlating point sets X and V can be obtained by minimizing the following: ET P S (f ) =  +λ

n 

||xi − f (vi )||2

i=1

∂2f ∂2f 2 ∂2f ) + ( 2 )]dxdy [( 2 )2 + 2( ∂x ∂x∂y ∂y

(4)

where ET P S is the energy function that is considered as the measurement for the amount of formation disturbance. The integral part of the equation represents how the corresponding point sets are mapped to the correlating point set by keeping the intended formation under consideration. Also, the factor λ provides the scaling. If the intention is to map one point set over the other without considering the shape of the disturbed swarm, λ should be set to zero and the closest points are mapped accordingly without keeping the shape under consideration. In this situation, the disturbance, i.e., ET P S , is simply as follows: ET P S (f ) =

n 

||xi − f (vi )||2

(5)

i=1

Minimization of such a temperature function determines the mapping process from the disturbed swarm in the highest disturbance formation, i.e. the scene, to the original shape after the obstacle. Via this mapping the new leader also will be determined. After calculating the mapping function, each drone from the scene follows the shortest path to reach its hypothetical location in the model. Since the model in Eq. 4 is hypothetical and uncertain events might affect the process of reformation, a relative run-time measurement is needed to continuously assure that the swarm is reaching its formation. This metric is the hypothetical center of the swarm that can be calculated from an instantaneous location of the drones in the swarm. The continuous error in the formation that should be dynamically observed and reduced is calculated by summation of the deviation of the distance of each drone from the center w.r.t. the golden formation model, that is shown

Cellular Formation Maintenance and Collision Avoidance Using CPSR

401

by drms . As an example, Fig. 5 shows the centroid point for three drones in the golden formation model, i.e. the left side, and the disturbed model, i.e., the right side. Based on this, the deviation of the distance of each drone from the center w.r.t. the golden formation model is as follows: Δd1 = dc − d1 Δd2 = dc − d2 Δd3 = dc − d3

(6)



and drms =

Δd21 + Δd22 + Δd23

(7)

Fig. 5. Centroid of the swarm

The measurement of drms , is a figure of merit. From this equation it is determined how much the current formation has been distorted from its original/predefined formation. So, minimizing the drms to zero will bring back the formation optimally, i.e., drms −→ 0. The reformation process is done by first calculating the centroid of the swarm as shown in the Fig. 5. These values are then fed to the point set registration, in order to calculate the optimal solution for bringing the UAVs back to the desired formation, and in the meantime, bringing the centroid as quickly as possible to the final destination, which can be seen in the next section, i.e., Simulation and Results. During the reformation the leader of the swarm changes dynamically, and the previous leader (UAV1/Blue UAV) goes to the position of the current leader (UAV2/Green UAV). The reason for that is, while UAV1 is deviating from its current trajectory in order to avoid colliding with the obstacle, it slows down. In the meantime UAV2, which continues in its path, becomes a more likely candidate for going to the position of UAV1 rather than slowing down for it. Therefore, UAV2 moves to the location of UAV1, and simultaneously UAV1 moves to the previous location of UAV2, as soon as UAV1 has successfully avoided the collision.

402

3

J. N. Yasin et al.

Simulation and Results

Fig. 6. Different time intervals from spawning to when the obstacle comes in detection range. (a) When the UAVs are spawned and obstacle is moving towards the swarm. (b) Obstacle is in detection range of the UAV, and point of impact is calculated and shown, UAV1 is deviating from its original path in order to avoid the danger zone. (c) Bypassing the obstacle. (d) bypassing the obstacle 2. (e) Leader changed while bypassing the obstacle.

The initial conditions/assumptions for our work are defined as follows: 1. there is no explicit unique leader; the leader for the swarm changes dynamically according to the situation, i.e., the leadership is a temporary role. 2. UAVs accelerate or decelerate as needed.

Cellular Formation Maintenance and Collision Avoidance Using CPSR

403

Fig. 7. Simulation snapshots at equal time intervals, i.e., 0%, 25%, 50%, 75%, 100% (a) When the UAVs are spawned and obstacle is moving towards the swarm. (b) Obstacle is in detection range of the UAV, and point of impact is calculated and shown, UAV1 is deviating from its original path in order to avoid the danger zone. (c) Bypassing the obstacle. (d) Notice the change of leader while bypassing the obstacle. (e) Leader changed while bypassing the obstacle.

3. UAVs obtain their own position vectors using the on-board localization techniques. 4. communication channel ideal, i.e., lossless. 5. an obstacle can be stationary or moving towards the swarm or away from the swarm with unknown velocity. 6. for visualization purposes and to avoid the overlapping, the detection range circle of only the leader is shown. The UAVs are spawned at near the defined V-shaped formation (Fig. 6). The current leader UAV1 (blue), starts moving towards the destination and the other UAVs start moving towards their positions in the formation. An obstacle is also moving towards the swarm, but at this instant it is outside the detection range of the on-board sensor system of the UAVs, as shown in Fig. 6(a). In Fig. 6(b), the obstacle is already in the detection range and the Point of Impact and the Danger Zone has been computed, as explained in Algorithm 2.

404

J. N. Yasin et al.

Fig. 8. Swarm movement from start to destination. Navigational traces using: (a) Proposed approach (b) Dedicated leader

Figure 6(c), shows the trend of escape route chosen by the collision avoidance module by deviating UAV1 to its right and slowing down the velocity of UAV3 to allow UAV1 stay on chosen route, as explained in Algorithm 3. Figure 6(d) and (e) shows reformation process using CPSR, as shown that since UAV1 had to slow down and deviate to avoid the Danger Zone, UAV2 in the meantime is dynamically declared as the leader, as it continued its trajectory on the same path with same velocity, gets ahead of the rest instead of waiting for UAV1 to get back into its formation position. This is done to make sure time of arrival of centroid to the destination is minimized. Similarly, the optimal reformation would require UAV1 to go to UAV2’s place and UAV3 would just speed up to catch with its position in the formation, as explained in Section II-C. Figure 8(a), shows the movement of the swarm from starting point till it reaches the destination. In comparison, the behaviour of the swarm if point set registration is used with explicit unique leader and without balancing the centroid of the swarm is shown in Fig. 8(b). The graph (Fig. 9(a)) shows the overall trend of the distances maintained by the UAVs throughout the course, where D31 is the distance between UAV1 and UAV3, D21 is the distance between UAV1 and UAV2, and D32 is the distance between UAV2 and UAV3. The obstacle gets detected at t = 30s, and the collision avoidance is enforced which distorts the formation (Fig. 6). In order to test the scalability of the proposed algorithm, the number of nodes in the swarm was increased to make a two layered V-shaped formation, as shown in Fig. 7. The little overshoots in the movements of the drones in the figure can be reduced by integrating a more stable speed controller into the algorithm. Figure 9(b), shows the change in the temperature of the system, i.e., the swarm, from the start until the destination is reached. At t = 30s, the disturbance/change in the temperature of the system shows the obstacle

Cellular Formation Maintenance and Collision Avoidance Using CPSR

405

detection. The other significant disturbance at t = 75s is due to the leader change and from there on the swarm gradually reshapes itself into the target shape defined by TShape.

Fig. 9. (a) Distance maintained by drones with each other. (b) Change in temperature of the system as a whole

Figure 10 shows the time taken for the swarm to reach its destination in four different scenarios: “No obstacle”, i.e., if it is not disturbed at all and there are no deviations from the path, then t0 = 166s; “Unique”, i.e., if there is an obstacle in the path and the swarm has a unique dedicated leader, then t1 = 213s; and finally two cases experimenting our proposed approach, i.e., in “3-CPSR” (a swarm with 3 nodes like in the previous cases) and “8-CPSR” (with 8 nodes), the time to finish was t2 = 181s and t3 = 198s, respectively. This shows that we can considerably reduce the time by dynamically re-forming the swarm and changing the leader at run-time whenever the situation requires. The reason for this is that in the CPSR approach the swarm does not stop at any moment but keeps on progressing towards the destination, with each UAV deviating to avoid a collision when needed and accelerating afterwards to reach its position, defined by TShape, to maintain the formation. On the other hand, in the fixed leader case (“Unique”), the swarm will slow down and wait for the leader to resume its position in the front before continuing towards the destination. When considering the three-drone formation (Fig. 6), the swarm needed 56s from the obstacle detection to come back into the initial formation, whereas in the eight-drones case (Fig. 7), this took 85s. It is evident from the experiments that in the latter case, due to a much bigger obstacle, the drones had to deviate more than in the former case. However, this did not affect the overall mission time very much, as UAV2, which became the new leader, did not have to deviate a lot from its path nor reduce its speed significantly.

J. N. Yasin et al.

Time (s)

406

220

213 198

200 180 No obstacle

181 166

3-CPSR

8-CPSR

Unique

Fig. 10. Time for mission completion for different approaches

4

Conclusion

In this paper, we developed a novel approach for collision avoidance and formation maintenance in a swarm of drones in dynamic environments. The proposed method utilizes a genetic algorithm inspired scheme in its collision avoidance part and point set registration in its formation maintenance part. In the approach, a swarm does not have a uniquely determined leader, and formation maintenance is accomplished by stabilizing the centroid of the swarm. The behaviour of the algorithm was theoretically analysed and tested in a simulation environment. The simulation results shown provide sufficient proof that the method works in a near-optimal manner in a dynamic environment, where an obstacle continues movement in its detected trajectory. We tested the efficiency of the proposed algorithm by comparing it with corresponding algorithms that assume existence of an explicit/unique leader. It was demonstrated that the ability to re-elect the leader dynamically, if required, gets the mission completed more quickly, i.e. it saves time and consequently energy by sparing the swarm from waiting for the leader to get back into its defined position in the formation. In our future work, we plan to extend the proposed approach by examining the other environmental effects, such as air drag, on the layers of drones, such as a two or multi layered V-shaped formation. That can help in optimizing the resource management in the swarm by dynamically swapping the outer layer with the inner layer in order to minimize the effect of air drag and maximize the flight time on a single charge. Also, we will consider more complex scenarios with several simultaneous obstacles and more versatile movement of obstacles. Acknowledgment. This work has been supported in part by the Academy of Finlandfunded research project 314048 and Nokia Foundation (Award No. 20200147).

Cellular Formation Maintenance and Collision Avoidance Using CPSR

407

References 1. Wolfram, S.: A New Kind of Science — Online–Table of Contents. Library Catalog. www.wolframscience.com 2. Alander, J.T.: On finding the optimal genetic algorithms for robot control problems. In: Proceedings IROS 1991: IEEE/RSJ International Workshop on Intelligent Robots and Systems 1991, vol. 3, pp. 1313–1318, November 1991 3. Balch, T., Arkin, R.C.: Behavior-based formation control for multirobot teams. IEEE Trans. Robot. Autom. 14(6), 926–939 (1998) 4. Bansal, J.C., Singh, P.K., Pal, N.R.: Evolutionary and Swarm Intelligence Algorithms. Springer, Cham (2019). https://doi.org/10.1007/978-3-319-91341-4 5. Campion, M., Ranganathan, P., Faruque, S.: A review and future directions of UAV swarm communication architectures. In: 2018 IEEE International Conference on Electro/Information Technology (EIT), pp. 0903–0908, May 2018 6. Dong, L., Chen, Y., Qu, X.: Formation control strategy for nonholonomic intelligent vehicles based on virtual structure and consensus approach. Proc. Eng. 137, 415– 424 (2016). Green Intelligent Transportation System and Safety 7. Dorigo, M., Roosevelt, A.F.D.: Swarm robotics. In: Special Issue, Autonomous Robots. Citeseer (2004) 8. Gad, A.: Introduction to Optimization with Genetic Algorithm, July 2018 9. Guo, P., Hu, W., Ren, H., Zhang, Y.: PCAOT: a Manhattan point cloud registration method towards large rotation and small overlap. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7912–7917, October 2018 10. Chui, H., Rangarajan, A.: A new algorithm for non-rigid point matching. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2000 (Cat. No. PR00662), vol. 2, pp. 44–51, June 2000 11. Hamann, H.: Introduction to Swarm Robotics, pp. 1–32. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-74528-2 1 12. Han, Q., Li, T., Sun, S., Villarrubia, G., de la Prieta, F.: “1-n” leader-follower formation control of multiple agents based on bearing-only observation. In: Demazeau, Y., Decker, K.S., Bajo P´erez, J., de la Prieta, F. (eds.) Advances in Practical Applications of Agents, Multi-Agent Systems, and Sustainability: The PAAMS Collection, pp. 120–130. Springer, Cham (2015) 13. He, L., Bai, P., Liang, X., Zhang, J., Wang, W.: Feedback formation control of UAV swarm with multiple implicit leaders. Aerosp. Sci. Technol. 72, 327–334 (2018) 14. Lawton, J.R.T., Beard, R.W., Young, B.J.: A decentralized approach to formation maneuvers. IEEE Trans. Robot. Autom. 19(6), 933–941 (2003) 15. Li, N.H.M., Liu, H.H.T.: Formation UAV flight control using virtual structure and motion synchronization. In: 2008 American Control Conference, pp. 1782–1787, June 2008 16. Myronenko, A., Song, X.B.: Point-set registration: coherent point drift. CoRR, abs/0905.2635 (2009) 17. Kwang-Kyo, O., Park, M.-C., Ahn, H.-S.: A survey of multi-agent formation control. Automatica 53, 424–440 (2015) 18. Vicencio, K., Davis, B., Gentilini, I.: Multi-goal path planning based on the generalized traveling salesman problem with neighborhoods. In: 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2985–2990, September 2014

408

J. N. Yasin et al.

19. Yasin, J.N., et al.: Night vision obstacle detection and avoidance based on bioinspired vision sensors. In: 2020 IEEE Sensors, pp. 1–4 (2020) 20. Yasin, J.N., Mohamed, S.A.S., Haghbayan, M., Heikkonen, J., Tenhunen, H., Plosila, J.: Unmanned aerial vehicles (UAVs): collision avoidance systems and approaches. IEEE Access 8, 105139–105155 (2020) 21. Yasin, J.N., et al.: Energy-efficient formation morphing for collision avoidance in a swarm of drones. IEEE Access 8, 170681–170695 (2020) 22. Yasin, J.N., Mohamed, S.A.S., Haghbayan, M.-H., Heikkonen, J., Tenhunen, H., Plosila, J.: Navigation of autonomous swarm of drones using translational coordinates. In: Demazeau, Y., Holvoet, T., Corchado, J.M., Costantini, S. (eds.) Advances in Practical Applications of Agents, Multi-Agent Systems, and Trustworthiness. The PAAMS Collection, pp. 353–362. Springer, Cham (2020) 23. Yasin, J.N., et al.: Dynamic formation reshaping based on point set registration in a swarm of drones (2020) 24. Yasin, J.N., Haghbayan, M.-H., Heikkonen, J., Tenhunen, H., Plosila, J.: Formation maintenance and collision avoidance in a swarm of drones. In: Proceedings of the 2019 3rd International Symposium on Computer Science and Intelligent Control, ISCSIC 2019. Association for Computing Machinery, New York (2019) 25. Zhuge, C., Cai, Y., Tang, Z.: A novel dynamic obstacle avoidance algorithm based on collision time histogram. Chin. J. Electron. 26(3), 522–529 (2017)

The Simulation with New Opinion Dynamics Using Five Adopter Categories Makoto Fujii(B) and Akira Ishii Tottori University, Koyama, Tottori, Japan [email protected], [email protected]

Abstract. The purpose of this paper is to interpret the diffusion of innovation (transfer of opinions to the adoption category) from the simulation of opinion dynamics with five adapter categories set as agents, and to provide a computational social science method useful for marketing and mass media research. In the simulation, we observed the impact on the spread of innovation by manipulating variables such as the Initial Distribution of Opinions, the Confidence Coefficient between agents, the Mass Media Effects, and the Network Connection Probabilities of the random network. Simulation results show that when the media has a uniform impact on the market, the distribution of people’s opinions is distorted in the direction that the media takes the lead. We also observed that by manipulating the initial values of the opinions of the initial adopters, the reliability coefficient, and the connection probability between the nodes of the random network, the market is affected, and the spread of innovation is affected. Keywords: Opinion dynamics · Diffusion of Innovations · The five adopter categories · Simulation

1 Introduction The dispatch of information has traditionally been the role of the mass media, but with the development of SNS, consumers have also become a part of its role. Elucidation of the mechanism of opinion transition (Information diffusion and consensus building) regarding consumer innovation adoption based on the theory of opinion dynamics is thought to provide many suggestions to the complicated recent marketing communications. Opinion dynamics is known as a theory that analyzes how many human opinions converge by exchanging opinions with people. It is one of the research themes of social physics and has been studied from various aspects as a theory to analyze the process of aiming for social consensus building [1–6]. On the other hand, in the field of social science, as represented by Rogers’s “Diffusion of Innovation” [7], the research is being conducted on the process of disseminating innovation that people perceive as new. Opinion dynamics and research of diffusion of innovation are considered to have a high affinity because they are both conducted with the elements of objects, exchange of opinions, communication channels, and the passage of time. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 409–424, 2022. https://doi.org/10.1007/978-3-030-82193-7_27

410

M. Fujii and A. Ishii

Therefore, the purpose of this research is to interpret the spread of innovation from the simulation of opinion dynamics with five adapter categories set as agents, and to provide a computational social science method useful for marketing and mass media research. We organize the rest of this article as follows: In the next section, we discuss a typical model of opinion dynamics and a new model of opinion dynamics that also deals with the negative confidence factors used in this paper. Next, we will consider the diffusion of innovation and the five adapter categories used as agents. In Sect. 4, simulation using the proposed model is performed. We perform simulations that manipulate the Initial Distribution of Opinions, Confidence Coefficient, Mass Media Effects, and Network Connection Probabilities that influence the diffusion of innovation. Section 5 concludes this paper with a closing chapter after considering the conditions that are effective for the spread of innovation from the results of simulations.

2 Theory 2.1 Opinion Dynamics There are two main types of opinion dynamics models. One is a mathematical model in which people’s opinions are discrete values of two values (1 and 0, or 1 and −1), and the other one is a model in which people’s opinions are quantified and distributed continuously. Discrete binary theory can be applied in such as the French and American presidential elections [8] and the referendum (seen in Brexit) [9]. Typical mathematical models of binary theory include the Voter model [10–12] proposed by Granovetter, Galam’s theory of magnetic physics [13], and the local majority decision model [8, 9]. Many developments have been proposed for the Voter model, but the basic purpose is to describe two choices (adopted/not adopted) represented by voting behavior in elections. In the standard Voter model, the opinion is s ± 1. The opinion of i is Si and the opinion of j is Sj . Let the total number be N people, and the total set of opinions be S = {Si }. Also, let Sk be the set of opinions that differ only in the values of S and S k . Then, if the probability that the overall opinion distribution is S at time t is P(S, t), then it can be obtained by the following Eq. (1).        d (1) P(S, t) = Wk S k P S k , t − Wk (S)P(S, t) k dt On the other hand, the mathematical model of opinion dynamics theory that expresses opinions with continuous values from 0 to 1 is called the Bounded Confidence Model [14–16], and it is premised that people find a compromise of consensus by exchanging opinions. The Deffuant-Weisbuch Model [17, 18] and the Heselmann-Krause Model [19] are known as the major models of the Bounded Confidence Model. The basic idea of the Bounded Confidence Model [14–16] is that the individual i is influenced by the opinions of those around him and his own opinions change. The Heselmann-Krause Model is defined below (2). N Dij xj (t) xi (t + 1) = (2) j=1

The Simulation with New Opinion Dynamics

411

Opinion xi threshold is 0 ≤ xi (t) ≤ 1. Here, assuming N persons, the opinion of the individual i at time t is written as xi (t), where 1 ≤ i ≤ N. The coefficient Dij takes various real positive values for all combinations i, j among N persons, and if Dij = 0, it means that the individual i ignores the opinion of j. Since the opinion threshold is 0 ≤ xi (t) ≤ 1, this model does not assume that the coefficient Dij takes a negative value. Opinion takes a continuous value from 0 to 1 (1 = agrees, 0 = disagree), and there is no assumption of disagreement. In other words, it can be seen that implicit consensus building is assumed. However, in the real world, we know that sometimes consensus building can be difficult. Therefore, Ishii-Kawahata [20, 21] extended Hegselmann-Krause Model [19] to deal with problems that are difficult to form consensus and developed a model that included lack of trust among people in opinion dynamics theory. Ishii et al. correct the meaning of the coefficient Dij as a confidence coefficient and assume that if there is a trust relationship between i and j, then Dij > 0, and if there is a distrust relationship, then Dij < 0. Moreover they consider people to ignore opinions that are far from theirs without agreeing or repelling them, and assuming that they are not particularly affected by opinions that are remarkably close to him/herself. Including these two processes, Ishii uses the following function (3) and (4) instead of the Dij xj (t) term used in HegselmannKrause Model [21].    Dij  Ii , Ij Ij (t) − Ii (t) (3) where    Ii , I j =

1   1 + exp(β) Ii − Ij − b

(4)

Furthermore, the model proposed by Ishii [21] and Ishii-Okano [22] consider the influence of mass media and forgetting topics. In addition to human contact (information exchange), information provided by the mass media influences the formation of public opinion and the opinions of people and is therefore an important term in opinion dynamics theory. Moreover, considering the transition of time, it is considered to be an important point as opinion dynamics theory to consider the effect that the matter in question itself becomes old and the interest of people decreases. Regarding the mass media effect, A(t) is the external pressure at time t, and the reaction difference of each agent is represented by the coefficient ci . The coefficient ci can have different values for each, and ci can be positive or negative. If the coefficient ci is positive, the person i moves his/her opinion toward the mass media. On the contrary, when the coefficient ci is negative, the opinion of the person changes against the direction of the media. Regarding forgetting, we have dealt with the forgetting of problems by introducing an exponential decay function. Including these two effects, the change in the opinion of the agent can be expressed as follows (5). Ii (t) = −αIi (t)t + ci A(t)t +

N j=1

   Dij  Ii (t), Ij (t) Ij (t) − Ii (t) t

(5)

412

M. Fujii and A. Ishii

Based on this model, Fujii-Ishii [23] replaced consent with adoption, distrust with rejection, and added agents in the middle of the simulation to observe the transition of opinions. Assuming that the entire simulation is 300, the consumers who exist in the market from the beginning are Nini , and the consumers who enter the simulation later are Nuser , that is, N = 300 = Nini + Nuser . In addition, the later entry of consumers into the market (simulation) is defined as follows (6). Nuser =

(N − Nini ) ∗ (tanh x + 1) 2

(6)

This study showed that when the mass media had a uniform impact on the market, the distribution of people’s opinions was biased towards the media-led direction. In this study, it was confirmed that the mass media first affects Nini and then Nuser . It was also shown that the stronger the connection between people, the easier it is to be influenced by others. In other words, in order to adopt (plus) the opinions of consumers who will enter the market later, it has been shown that there is a need for pre-existing positive consumers, mass media which encourage adoption, and a dense network of people. 2.2 Diffusion of Innovations According to Rogers, innovations are an idea, habit, or object that is perceived as new by an individual or other hiring unit. Its diffusion is the process by which innovations are transmitted between members of the social system over time through a communication channel [7]. In the marketing context, smartphones (iPhone) in the mobile phone market and Greek yogurt Chobani in the yogurt market are good examples of innovations. In addition, the essence of the diffusion process is information exchange, through which one conveys new ideas to others. The fastest and most efficient way to potentially convey information that innovation exists is in the mass media, and face-to-face information exchange is effective in persuading people to accept new ideas. In the choice of adoption/rejection of innovations, the superiority or inferiority of innovations and demand have a great influence on the spread, but the superiority or inferiority of innovation itself is not discussed in this paper. We focus on the diffusion of innovations utilizing opinion dynamics. In the previous research, the agents were divided into Nini and Nuser and the simulation was executed, but in the Diffusion of Innovations, Rogers classifies consumers into five innovation-based categories. Therefore, in this study, agents are classified into five categories, and the transition of innovation propagation (adoption and rejection) is considered from a marketing perspective. Adopter Categorization distinguishes between members of the social system based on individual innovations. Innovativeness is the degree to which oneself adopts new ideas and products relatively early compared to other members of the social system [7]. The five adopter categorizations are Innovators, Early Adopters, Early Majority, Late Majority, and Laggards (Fig. 1).

The Simulation with New Opinion Dynamics

413

Fig. 1. Adopter categorization. Modified from “Diffusion of Innovations” (2003).

Innovators are adventurous and act as gatekeepers of social systems. In other words, it is the layer that adopts innovations earliest among the five categories. Early Adopters are opinion leaders who provide information about innovations for potential users. Although the Early Majority are cautious when adopting innovations, they are also an important category that act as a bridge in the dissemination process and also as a mutual liaison in interpersonal networks. Late Majority are said to be skeptical of innovations and will adopt innovations after the adoption rate exceeds 50%. For the Late Majority, pressures within the groups are needed to motivate innovation adoption. Laggards are the last category of social systems to adopt innovations. This is the category that takes the longest time to adopt innovation.

3 Modeling In this study, tanh is also used to drop agents into simulations, but the timing difference when each adopter category adopts innovations is considered as follows (Fig. 2). (tanh(x) + 1) 2

(7)

EarlyAdopters : Nea =

(tanh(x − 3) + 1) 2

(8)

Early Majority : Nem =

(tanh(x − 6) + 1) 2

(9)

Innovators : Ninn =

414

M. Fujii and A. Ishii

Late Majority : Nlm = Laggards : Nlg =

(tanh(x − 9) + 1) 2

(10)

(tanh(x − 12) + 1) 2

(11)

1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 -3.00 -2.31 -1.62 -0.93 -0.24 0.45 1.14 1.83 2.52 3.21 3.90 4.59 5.28 5.97 6.66 7.35 8.04 8.73 9.42 10.11 10.80 11.49 12.18 12.87 13.56 14.25 14.94

0.00

Ninn

Nea

Nem

Nlm

Nlg

Fig. 2. Innovation adoption timing for each adopter category

Based on the Ishii (2019) model, agents are classified into five adopter categories, and the initial distribution of opinions, confidence coefficient, media coefficient, and random network connection probability are manipulated to calculate the transition of innovation propagation. As the first calculation example, we show the result calculated by 1,000 people (Ninn = 25, Nea = 135, Nem = 340, Nlm = 340, Nlg = 160) in Fig. 3. The initial opinion distribution of the Innovators is ± 30, and the network of people is set as a random network with a link connection probability of 50%. The Dij value among 1,000 people is determined by a uniform random number from -1 to 1. As we can see, people’s opinions are positive and negative, but people’s opinions are completely scattered. In this calculation, the mass media effect A(t) is set to 0, so the opinion distribution in this figure is positive and negative without bias. In the calculation assuming 1,000 people, it seems that the opinion of people is balanced to some extent as a whole, although the opinions of people are positively and negatively distributed. Also, we can see from the calculated distribution, starting with a uniform distribution of opinion distributions, calculations with slight differences are spread and balanced to some extent, but the final distribution of opinions is not uniform. We can find several opinion groups from the

The Simulation with New Opinion Dynamics

415

calculated distribution. It can also be confirmed that both the adoption curve and the rejection curve draw the S-shaped curve. In this paper, a simulation is executed based on this calculation result.

Fig. 3. Calculation result of N = 1,000. The human network is a random network with a link connection probability of 0.5. The figure on the left shows the time evolution of the opinion trajectory of each adopter category (Red: Ninn, Light Blue: Nea, Blue: Nem, Green: Nlm, Pink: Nlg). The central figure shows the distribution of opinions at the final point of the calculation. The figure on the right shows the time-series change in the ratio of adopt (+) and reject (−), and red indicates adopt (+) and blue indicates reject (−). Dij is randomly set from −1 to 1, So everyone trusts or distrusts everyone. The media effect is zero (A (t) = 0). (Color Figure online)

4 Simulations 4.1 Manipulating the Initial Distribution of Opinions In the Fig. 3, we set the initial opinion distribution of the Innovator to ± 30, but we bias the initial opinion distribution to 10 to 30 and −10 to −30 and observe the propagation of dissemination. The left figure in Fig. 4 starts the opinion distribution of the Innovators from the plus, and the right figure starts from the minus, but it seems that the shares of adoption (plus) and rejection (minus) are in competition. This is because the confidence coefficient Dij is set to random (−1 to 1), and there is no bias variable other than the initial value of opinions that biases to plus or minus, so it is uniform in the process of people exchanging opinions. It is thought that this is because it is being transformed. In addition, among the adopter categories, the early adopters are said to hold the key to popularization, so the results of the calculation with a biased distribution of opinions of the Early adopters are shown in Fig. 5. In the first half of the simulation, where the opinions of the initial adopters started from positive, the positive ratio was high, and in the negative case, the negative ratio was high, but in the latter half of the simulation, the positive/negative ratio was about the same. Again, the confidence factor Dij is set to random (−1 to 1), and there is no bias variable other than the initial value of opinions that biases it to plus or minus, so it is leveled in the process of people exchanging opinions. 4.2 Manipulation of Confidence Coefficient Dij What if the Early Adopters are trusted by other adopter categories? Here, the basic conditions are the same as in Fig. 5, but the confidence coefficient from Early Adopters,

416

M. Fujii and A. Ishii

Fig. 4. Calculation result of N = 1,000. Left figure: innovator’s initial opinion distribution = 10 to 30, right figure: innovator’s initial opinion distribution =−10 to −30. The human network is a random network with a link connection probability of 0.5. The time evolution of the opinion trajectory of each adopter category (Red: Ninn, Light Blue: Nea, Blue: Nem, Green: Nlm, Pink: Nlg). The bottom figures show the time-series change in the ratio of adopt (+) and Reject (−), and red indicates adopt (+) and blue indicates reject (−). Dij is randomly set from −1 to 1, So everyone trusts or distrusts everyone. The media effect is zero (A (t) = 0). (Color figure online)

Early Majority, Late Majority, and Laggard to the Early Adopters is set to 1 to 2 (nonnegative), and the simulation is executed (Fig. 6). It can be confirmed that when the initial opinion distribution of the Early Adopters is positive, the ratio is positive, and when it is negative, the ratio of negative is higher. This is because Early Adopters are trusted by the other categories other than the Innovators, so it is considered that the opinions of Early Adopters influence the other three categories. However, the trust from other category to Early Adopters suggests that it can be both a poison and a medicine for those who want to spread innovation. They need measures (variables that distort the hiring direction) to keep Early Adopters on their side. 4.3 Manipulating Mass Media Effects We apply a certain mass media effect to the theory of opinion dynamics and consider how the effect affects each adopter category and the ratio of adoption (+) and rejection (−). That is, it is assumed that A(t) = Aconst. Here, Aconst is a constant value. Figure 7 is calculated under the same calculation conditions as Fig. 3, but the media effect A(t) is not zero. Here, in Fig. 7, it is assumed that A(t) = 2.5, 5, and 10. When we compare with A (t) = 2.5 and 5, the opinion distribution of A(t) = 5 appears to be more biased towards positive opinions.

The Simulation with New Opinion Dynamics

417

Fig. 5. Calculation result of N = 1,000. Left figure: early adopters initial opinion distribution = 10 to 20, right figure: early adopters initial opinion distribution = −10 to −20. The human network is a random network with a link connection probability of 0.5. The time evolution of the opinion trajectory of each adopter category (Red: Ninn, Light Blue: Nea, Blue: Nem, Green: Nlm, Pink: Nlg). The bottom figures show the time-series change in the ratio of adopt (+) and reject (−), and red indicates adopt (+) and blue indicates reject (−). Dij is randomly set from −1 to 1, So everyone trusts or distrusts everyone. The media effect is zero (A (t) = 0). (Color figure online)

In the case of A(t) = 10, the calculated opinion distribution is clearly distorted in the positive direction. This can be seen in Ishii et al.’s theory of opinion dynamics, which qualitatively explains the phenomenon in which the media effect biases market opinions toward the media-led direction [22]. The effect is shown not only to Innovators but also to the other adopter category that will be entered later. It can also be seen that people who are in the market first and have been in touch with the mass media for a longer time are more likely to be affected stronger by the media. As a result, consumers who enter the market later are in contact with and influenced by more adopt (+) opinions. It suggests that by dropping larger mass media (advertising) variables, the opinion of adopter categories can be distorted to the media-led direction, given the same other simulation conditions. When comparing the ratios of A (t) = 2.5, 5, and 10 (Fig. 6, bottom line graph), it can be confirmed that the ratio difference between 5 and 10 is larger than the difference between 2.5 and 5. This suggests that mass media (advertising) variables may have thresholds that function in the dissemination of innovation (between 5 and 10 in this case). Next, the simulation is performed by manipulating the mass media variables with the confidence coefficient of the Early Adopters set to non-negative. The simulation conditions are as follows. When Early Adopters, which are the key to the diffusion of

418

M. Fujii and A. Ishii

Fig. 6. Calculation Result of N = 1,000. Left figure: early adopters initial opinion distribution = 10 to 20, right figure: early adopters initial opinion distribution =−10 to −20. The human network is a random network with a link connection probability of 0.5. The time evolution of the opinion trajectory of each adopter category (Red: Ninn, Light Blue: Nea, Blue: Nem, Green: Nlm, Pink: Nlg). The bottom figures show the time-series change in the ratio of adopt (+) and Reject (−), and red indicates adopt (+) and blue indicates reject (−). Dij from early adopters, early majority, late majority, and laggard to early adopters is randomly set from 1 to 2, and other Dij is randomly set from −1 to 1. The media effect is zero (A (t) = 0). (Color figure online)

innovations, are positive for the adoption of the innovation, and the adopter category excluding the Innovators trusts the Early Adopters, and the mass media variable in the recruitment (+) direction is set (Fig. 8). Since the mass media variables are added, it can be confirmed that the right figure is more biased toward positive than the left figure. In the time series graph of the ratio of adoption (+) and rejection (−), The movements in the first half of the simulation are similar on the left and right, in the latter half of right graph, it can be confirmed that the ratio of reject (−) is decreased and the ratio of adopt (+) is increased. If Early Adopters, who are considered to be opinion leaders, are positive in adopting innovations, are trusted by other adopter categories, and can create a favorable environment in which the media is given positive opinions, it can be expected to attract the majority. 4.4 Manipulating Network Connection Probabilities Finally, we calculate the changed connection probabilities of the nodes in the random network. We set N = 1,000 in this calculation (Ninn = 25, Nea = 135, Nem = 340, Nlm = 340, Nlg = 160), too. Dij is randomly set from −1 to 1, so everyone trusts or distrusts

The Simulation with New Opinion Dynamics

419

Fig. 7. Calculation result of N = 1,000. The human network is a random network with a link connection probability of 0.5. The above three figures show time evolution of the opinion trajectory of each adopter category (Red: Ninn, Light Blue: Nea, Blue: Nem, Green: Nlm, Pink: Nlg). The bottom line graphs show the time-series change in the ratio of adopt (+) and Reject (−), and red indicates adopt (+) and blue indicates reject (−). Dij is randomly set from −1 to 1, so everyone trusts or distrusts everyone. The mass media effect is A (t) = 2.5, 5, 10 from the left. (Color figure online)

everyone. Here, we set the mass media parameter (A) to 5. We change the probability of connecting to other nodes in a random network to 0% and 1%. The calculation results are shown in Fig. 9. From this calculation result, it can be seen that when the probability of connection between nodes is 0%, the consumers are not influenced by others, but are influenced only by the mass media and form their opinion. It can be seen that consumers, who were initially negative, are shifting their opinions from negative to positive over the time. In addition, it can be seen that the cumulative share of negative (rejection) becomes zero and the cumulative share of positive (adoption) becomes 100% with the passage of time. The adoption curve, it can be confirmed that each adopter category is added to the plus (recruitment) step by step. Even when the connection probability is 1%, the distribution is to the right, but it can be seen that the opinion distribution extends to the minus side. This means that the distribution of opinions is expanding as a result of being influenced not only by the mass media but also by the opinions of others. In the process of long-term contact with media information and increasing contact with people, the share of positive (adopt) is increasing and the ratio of negative (reject) is decreasing. In addition, we change the probability of connecting to other nodes in a random network to 25%, 50% and 75%. The calculation results are shown in Fig. 10. From this calculation, we can know that the higher the probability of connection between nodes, the more the opinions of others are influenced by the opinions of others. Comparing the dispersal of opinions between 25% and 75%, we can see that the difference between hiring and rejection is not much different. When people are not connected enough to each other, opinions are formed under the influence of the mass media. If the connection probability is high, it is suggested that the opinions of others change due to the influence

420

M. Fujii and A. Ishii

Fig. 8. Calculation result of n = 1,000. in both the left and right figures, the initial opinion distribution of the innovator ± 30. Initial opinion distribution of early adopters = 10 to 20. The above two figures show time evolution of the opinion trajectory of Aach adopter category (Red: Ninn, Light blue: Nea, Blue: Nem, Green: Nlm, Pink: Nlg). The bottom line graphs show the time-series change in the ratio of adopt (+) and reject (−), and red indicates adopt (+) and blue indicates reject (−). The human network is a random network with a link connection probability of 0.5. Dij from early adopters, early majority, late majority, and laggard to early adopters is randomly set from 1 to 2, and other Dij is randomly set from −1 to 1. Media effect: Left Figure A (t) = 0. Right Figure A (t) = 5. (Color figure online)

of the opinions of others. It suggests that face-to-face communication works effectively when adopting innovations. In other words, the use of social media is expected for diffusion of innovations.

5 Discussion Based on Ishii et al.’s new opinion dynamics theory, which incorporates both trust and distrust in human relationships, we conducted a simulation incorporating adopter categories (Innovators, Early Adopters, Early Majority, Late Majority, and Laggards). In this simulation, the agent is set to 1,000, and the reliability coefficient Dij that connects people is executed in two patterns. One is set Dij to a random number from −1 to 1. Therefore, the probability that Dij takes a positive value and the probability that it takes a negative value are 50%, respectively. On the other hand, we set the confidence coefficient from Early Adopters, Early Majority, Late Majority, and Laggard to Early Adopters to a random number of 1–2. The person-to-person connection is a random network that is part of a complete graph, with the probability of linking between nodes set to 50%, except in Fig. 9 and 10.

The Simulation with New Opinion Dynamics

421

Fig. 9. Calculation result of N = 1,000. The human network is a random network with a link connection probability of 0% on the right and 1% on the left. The above two figures show time evolution of the opinion trajectory of each adopter category (Red: Ninn, Light Blue: Nea, Blue: Nem, Green: Nlm, Pink: Nlg). The bottom line graphs show the time-series change in the ratio of adopt (+) and reject (−), and red indicates adopt (+) and blue indicates reject (−). Dij is randomly set from −1 to 1, so everyone trusts or distrusts everyone. The media effect is zero (A (t) = 5). (Color figure online)

The composition of the adopter categories is set to 2.5% for Innovators, 13.5% for Early Adopters, 34% for Early Majority, 34% for Late Majority, and 16% for Laggard. Regarding the innovativeness of each adopter category, the adopt order is clear, but the adopt curve of each adopter category is not clear, so we set to be sequentially added to the simulation by tanh. Since it is expected that the detailed recruitment curve of each recruitment category will differ greatly depending on the category, etc., it is necessary to set a guideline axis and examine it in detail. Figure 4 and 5 show simulations that manipulate the initial distribution of opinions. In Fig. 4, the Innovators are set, and in Fig. 5, the initial opinion distribution of the Early Adopters are set to adoption (+) and rejection (−), respectively, and the simulation is executed. Regardless of whether the initial opinion distribution of Innovators and Early Adopters were adopting (+) or rejecting (−), there was no significant difference in the adopt/reject ratio. Unless the influence of the mass media and the confidence coefficient are manipulated, it is inferred that the opinions of the people in the market will be leveled. It suggests that if the market is hot, it will be anointed, and if it is on Flaming (although appropriate action is required), it may eventually settle down without taking excessive action.

422

M. Fujii and A. Ishii

Fig. 10. Calculation result of N = 1,000. The human network is a random network with a link connection probability of 25% on the right, 50% on the center, and 75% on the left. The above two figures show time evolution of the opinion trajectory of each adopter category (Red: Ninn, Light Blue: Nea, Blue: Nem, Green: Nlm, Pink: Nlg). The bottom line graphs show the time-series change in the ratio of adopt (+) and reject (−), and red indicates adopt (+) and blue indicates reject (−). Dij is randomly set from −1 to 1, so everyone trusts or distrusts everyone. The media effect is zero (A (t) = 5). (Color figure online)

Figure 6 shows a simulation in which the reliability coefficient Dij is manipulated. Early Adopters are the adopter category that is expected to play the role of opinion leaders in the diffusion of Innovations. Therefore, the adopter categories excluding Innovators trust Early Adopters (non-negative). As a result, it was confirmed that when the initial opinion distribution of the Early Adopters is adopting (+), the ratio of adopt (+) tends to be high, and when it is rejected (−), the opposite tendency is observed. Since there is no guarantee that Early Adopters will always be on their side, it is suggested that detailed measures such as the implementation of campaigns targeting the Early Adopters are necessary. Figure 7 and 8 show simulations by manipulating media effects. Here, it was shown that the effect of the media distorts people’s opinions toward the media. In particular, a large difference was confirmed between the media effects A(t) = 5 and 10. It is thought that this suggests the existence of a threshold value at which the media exerts its effect. However, this time, we have not been able to execute simulations assuming multiple media such as mass media and digital media, so we would like to continue our studies in the future. Figure 8 shows a simulation that adds the initial opinion distribution, confidence coefficient, and media effect of Early Adopters. It can be said that it is a perfect environment for the spread of innovation because the trusted Early Adopters are positive about the spread of innovations and the influence of the media is biased toward the positive. It is important to build that positive initial opinion distribution of Early Adopters, trust from other adopter categories to Early Adopters, media bias in the positive direction. In Figs. 7 and 8, we calculated while changing the connection probabilities of the nodes of the random network. The mass media effect is set to 5 (A(t) = 5). When the

The Simulation with New Opinion Dynamics

423

connection probability is 0%, even if the opinion is rejecting (−) among the consumers who are in the market first, the opinion is changed to adopt (+) due to the influence of the mass media. This is because it is not affected by others, but only by the influence of the mass media. However, we can see that the share of people with reject (−) opinions is increasing as the probability of connection between people increases. This result shows that it is important to lead the opinions of people who are already in the market to adopt (+) first, if we want to raise the adopt (+) opinions of those who enter the market later. It is also shown that the higher the probability of connection between people, the stronger the influence on the change of opinion.

6 Conclusion In this paper, we confirmed the influence of the opinion transition of the adopter category using the new opinion dynamics theory that includes both trust and distrust in human relations proposed by Ishii et al. When the influence of the mass media uniformly affects the market, it was shown that the distribution of opinions of people is biased toward the media-led direction. In doing so, it was observed that it first affected the people in the market and then later on the categories that adopted innovations. It was confirmed that the existence of Early Adopters with positive opinions that existed in the market in advance, a certain number of mass media that encourage adopting, and the trust in Early Adopters promote the diffusion. Elucidation of the mechanism of opinion transition of new entrants based on opinion dynamics theory is useful for marketing and mass media research. This time, we provided the computational social science method. However, the influence of media that influences human decision-making is wide-ranging, such as mass advertising, digital advertising, and SNS and so on, and influence of those media are not uniform. In addition, the timing of adopting depends largely on the level of involvement of individual consumers, but in this paper, it is fixed for convenience. It can be said that the remaining issues are the construction of a model that considers the heterogeneity of media effects, and other type of human network and their connection probabilities.

References 1. French, J.R.P.: A formal theory of social power. Psychol. Rev. 63 181–194(1956) 2. Harary, F.: A criterion for unanimity in French as theory of social power. In: Cartwright, D. (ed.) Studies in Social Power. Institute for Social Research, Research Center for Group Dynamics, Ann Arbor (1959) 3. Abelson, R.P.: Mathematical models of the distribution of attitudes under controversy. In: Frederiksen, N., Gulliksen, H. (eds.) Contributions to Mathematical Psychology, pp. 142–160. Holt, Rinehart and Winston, New York (1964) 4. De Groot, M.H.: Reaching a consensus. J. Am. Statist. Assoc. 69, 118–121 (1974) 5. Lehrer, K.: Social consensus and rational agnoiology. Synthese 31, 141–160 (1975) 6. Chatterjee, S.: Reaching a consensus: some limit theorems. In: Proc. Int. Statist. Inst, pp. 159– 164 (1975) 7. Rogers, E.M.: Diffusion of Innovations, 5th edn. Free Press, New York (2003)

424

M. Fujii and A. Ishii

8. Galam, S.: The trump phenomenon: an explanation from sociophysics. Int. J. Mod. Phys. B 31, 1742015 (2017) 9. Galam, S.: Are referendums a mechanism to turn our prejudices into rational choices? An unfortunate answer from sociophysics. In: Morel, L., Qvortrup, M. (eds.) The Routledge Handbook to Referendums and Direct Democracy, Chapter 19, Taylor & Francis, London (2017) 10. Granovetter, M.: Threshold models of collective behavior. Am. J. Sociol. Vil. 83(6), 1420– 1443 (1978) 11. Clifford, P., Sudbury, A.: A model for spatial conflict. Biometrika 60, 581–588 (1973) 12. Holley, R., Liggett, T.: Ergodic theorems for weakly interacting infinite systems and the voter model. Ann. Probab. 3(4), 643–663 (1975) 13. Galam, S.: Rational group decision making: a random field Ising model at T = 0. Phys. A 238, 66 (1997) 14. Jager, W., Amblard, F.: Uniformity, bipolarization and pluriformity captured as generic stylized behavior with an agent-based simulation model of attitude change. Comput. Math. Organ. Theor. 10 295–303(2004) 15. Jager, W., Amblard, F.: Multiple attitude dynamics in large populations. In: Presented in the Agent 2005 Conference on Generative Social Progress, Models and Mechanisms, October 13–15. The University of Chicago (2005) 16. Kurmyshev, E., Juarez, H.A., Gonzalez-Silva, R.A.: Dynamics of bounded confidence opinion in heterogeneous social networks: concord against partial antagonism. Phys. A 390, 2945– 2955 (2011) 17. Deffuant, G., Neau, D., Amblard, F., Weisbuch. G.: Mixing beliefs among interacting agents. Adv. Complex Syst. 3(15), 87–98 (2000) 18. Weisbuch, G., Deffuant, G., Amblard, F., Nadal, J.-P.: Meet, discuss and segregate! Complexitym 7(3) 55–63 (2002) 19. Hegselmann, R., Krause, U.: Opinion dynamics and bounded confidence: models, analysis, and simulation. J. Artifi. Soc. Soc. Simul. 5(3), 1–33 (2002) 20. Ishii, A., Kawahata, Y.: Opinion dynamics theory for analysis of consensus formation and division of opinion on the Internet. In: Proceedings of the 22nd Asia Pacific Symposium on Intelligent and Evolutionary Systems (IES2018), pp. 71–76. arXiv:1812.11845 [physics.socph] (2018) 21. Ishii, A.: Opinion dynamics theory considering trust and suspicion in human relations. In: Morais, D.C., Carreras, A., de Almeida, A.T., Vetschera, R. (eds.) GDN 2019. LNBIP, vol. 351, pp. 193–204. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-21711-2_15 22. Ishii, A., Okano, N.: Sociophysics approach of simulation of mass media effects in society using new opinion. In: Arai, K., et al. (eds.) Advances in Intelligent Systems and Computing as the Proceedings of IntelliSys2020 (IntelliSys 2020), AISC 1252, pp. 13–28 (2021) 23. Fujii, M., Ishii, A.: The simulation of diffusion of innovations using new opinion dynamics. In: The 5th International Workshop on Application of Big Data for Computational Social Science (ABCSS2020 @ WI-IAT 2020) (2020)

Intrinsic Rewards for Reinforcement Learning Within Complex 2D Environments Nathaniel Grabaskas1(B) and Zhizhen Wang2 1

PengFei Research, Cupertino, CA 95014, USA 2 Microsoft Australia, Sydney, Australia

Abstract. In this paper, we propose an approach to train an intelligent agent using reinforcement learning in order to draw on a two-dimensional grid. Painting is a creative art, and it will take human beings years to learn how to draw. In the training process, we build grid environments with obstacles and challenges resembling abstract art and then place the agent in different environments to reach the goal. In phase I, We propose using intrinsic rewards based on the state of the model to stimulate the agent’s exploration desire and to increase adaptability in complex environments. In phase II, we prototype a rendering pipeline to translate the agent’s movement during the training process into a painting. Our results show the intrinsic reward method increased the agent’s ability to learn in environments of moderate complexity. The rendering pipeline prototype was evaluated in a single round of crowd sourced evaluation and steps to further improve outlined. Keywords: Reinforcement learning · Intrinsic rewards · Deep learning · Grid environments · Generative art · Intelligent agent

1

Introduction

Drawing and creative art are a critical part of human civilisation and culture. To learn how to draw would take years of learning and practising for humans. Hence we want to explore the idea of training an artificial intelligence agent and visualizing the training to produce interesting and abstract pieces of art. However, the scope of work in this paper is primarily focused on stimulating agent curiosity for a reward. Phase I of this project focuses on agents learning to explore an environment, avoid obstacles, and reach a goal. Phase II, is to investigate the abstract artwork an agent can generate during the training process. Based on the goal, the agent is measured against quantitatively in phase I and qualitatively in phase II. In phase I, the agent learning to explore, we define a 2D environment (canvas) to simulate the painting environment, and each training canvas is initialised with a positive reward grid (goal) and hazard zones (immediately ends the episode) [5]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 425–437, 2022. https://doi.org/10.1007/978-3-030-82193-7_28

426

N. Grabaskas and Z. Wang

The initial parameters for generating the environment are based on abstract art paintings. The next step is to allow the agent to explore the environment while avoiding danger zones and minimize frames taken to reach the goal. We propose a method of rewarding the agent simply for learning, attempting to give an intrinsic desire to explore the environment [1,10]. The intrinsic reward method proves successful in moderate complexity environments. Phase II of the experiment is to generate visualizations from the agent’s training and compare which methods produce more interesting art. This was inspired by Luo’s dissertation in artistic applications for reinforcement learning [7]. In order to turn the agent training into artwork, a separate pipeline is setup. This pipeline renders images from the agent training, applies open source painting effects, and sends to human raters for qualitative evaluation. The phase II is explored as a prototype design in this paper and further experiments are outlined. The paper is organized into the following sections: related works, data used, methods, metrics, results and discussion, and finally conclusion and future steps.

2

Related Work

Exploration in sparse reward environments remains a key challenge of reinforcement learning [10]. They propose Rewarding Impact-Driven Exploration (RIDE), a novel type of intrinsic reward which encourages the agent to take actions leading to significant changes in its learned state representation. They evaluate their method on multiple challenging procedurally-generated tasks in MiniGrid. This approach is more sample efficient than existing exploration methods and the intrinsic reward doesn’t diminish during the course of training and rewards the agent substantially more for interacting with objects it can control. RIDE is computed as the L2-norm of the difference in the learned state representation between consecutive states. However, to ensure the agent does not go back and forth they discount RIDE by the number of times the state is visited. The parameters used to learn the intrinsic reward signal are used only to determine the exploration bonus and never part of the agent’s policy. While our intrinsic reward was inspired by them, we use the model weights not the embedding to calculate the reward. We believe the better represents rewarding the agent’s learning. Similar work was done by Zhewei, Wen and Shuchange [18] where they trained an AI agent to paint on a canvas to generate a painting similar to the target image. Apart from training the agent for a reward policy, they also proposed an approach to decompose the target painting into hundreds of strokes in sequence within a grid. In the end of the project, the agent is trained to be general enough to handle multiple types of images (including digits, handwritten, streetview, human portains, etc.). Another work related to artwork generation is done by Ning and his team [17]. The team is focusing on a particular type of painting, stroke drawing. They applied inverse reinforcement learning to learn the reward function from the training data. Our approach differs as we convert the agent movements and interaction with the environment into artwork.

Intrinsic Rewards for Reinforcement Learning

3

427

Data

The dataset set used is the agent’s training environment. In Zhewei’s paper [18], their team model the agent’s painting process as sequential decision-making tasks. During training, the agent was rewarded based on comparing the current status on the canvas to the target painting. In this project, we use a different strategy. Instead of focusing on decomposing the painting into a sequential environment, we pre-build the training environment based on the chosen abstract art from Surma’s Kaggle Open dataset [11]. An example can be seen in Fig. 1.

Fig. 1. This images shows a piece of abstract art chosen from Surma’s Kaggle Open Dataset.

The Kaggle dataset consists of 8,145 abstract art paintings with a resolution of 512*512. To maximize the experiment result, we choose a few arts with precise edges and repeatable patterns. Then a method was built to transform the raw artwork into a 2D grid representation. The team uses a preprocessing script for resizing, aligning, setting a threshold for detecting the edge, blurring, and defining the hazard zones. Then the interim images are used as examples to create MiniGrid environments. Minimalist GridWorld (MiniGrid) environment seeks to minimize software dependencies, be easy to extend, and deliver good performance for faster training [5]. The environment comes with the existing object types wall, floor, lava, door, key, ball, box and goal. This gives the agent diverse, generated environments with tasks such as getting the key to unlock the door, hazards like lava to be

428

N. Grabaskas and Z. Wang

avoided, and the purpose of getting to the goal. This provides a framework in which to execute our agent training. The team uses customized objects to echo the artworks’ characteristics (for example, different color schema, shape, patterns, and transparency). Later, the agent will explore procedurally generated environments and iterate on various policies to achieve maximum rewards. An example MiniGrid environment can be found in Fig. 2.

Fig. 2. This images shows a piece of abstract art converted to a grid environment for the agent to explore, named MiniGrid-PaintingS11Env-v1. The orange tiles represent lava and end the episode if the agent touches them and green represents the goal the agent is striving to reach. Only if the goal is reached is an extrinsic reward given.

4

Methods

In this section we discuss some background in how reinforcement learning works, existing learning algorithms, our evaluated intrinsic reward method, and the model architecture used.

Intrinsic Rewards for Reinforcement Learning

4.1

429

Reinforcement Learning Background

We use a common reinforcement learning setup where an agent interacts with an environment over frames or discrete time steps. At each frame st , the agent receives a current state representation and selects an action at from a set of possible actions A. The policy π is a mapping from the given state st to an action at . Afterwards, the agent receives the next state st+1 and possibly a reward. This process continues until a terminal state is reached (reaching a goal or frame limit). The reward at the end is the sum of rewards over each time step with a discount factor λ to decay the reward. The goal of the agent is to maximize the expected end reward from each current state [8]. 4.2

Model Policies

Advantage Actor-Critic (A2C) - In this algorithm the advantage function captures how much better an action is compared to others at a given state, while the value function captures how good it is to be at this state. This way the evaluation of an action is based not only on how good the action is, but also how much better it can be. The benefit of the advantage actor-critic function is it reduces the high variance of policy networks and stabilizes the model [6,12]. Actor-Critic algorithm is a hybrid mechanism combining the value optimization and policy optimization. More specifically, the Actor-Critic combines the Q-learning and PG (Policy Gradient) algorithms [9]. At a high level, the resulting algorithm involves a loop alternating between: • Actor: a Policy Gradient algorithm deciding on an action to take • Critic: off policy reinforcement learning algorithm critiquing the action the actor selected, providing feedback on how to adjust. It can take advantage of efficiency tricks in Q-learning, such as memory replay. A2C maintains a policy defined as π(st ; θ) and an estimate of the value function V (st ; θv ). A2C operates in the forward view and uses the same mix of time step returns to update both the policy π and the value function V π . The policy π and the value function V π are updated after every tmax actions. The update performed by the algorithm is updateloss = πloss − Hξ ∗ H + Vlossξ ∗ Vloss

(1)

where ξ is the coefficient (we use 0.01 and 0.5 respectively for entropy H and V ) [14]. We use two fully-connected neural networks for the actor and critic. The actor component outputs the agent action. And the critic outputs the value function estimate. This value function estimate replaces the reward function in policy gradient calculating the rewards only at the end of the episode. A2C Intrinsic Rewards - After backward propagation of the model using the updateloss , there is a second round of backward propagation using intrinsicreward : intrinsicreward = L2 N orm||φst+1 − φst ||

(2)

430

N. Grabaskas and Z. Wang

where φ represents the weights for all layers in the CNN [10]. This intrinsic reward is to encourage the agent to take actions leading to significant changes in its learned state representation. Attempting to mimic the sensation of learning or exploring. 4.3

Model Inputs

Input to the model (see Table 1) is the agent’s view of the grid environment (See environment example in Fig. 2). For all experiments the view distance is set to 11, therefore the input is an array of size (3, 11, 11). Each tile is encoded as a 3 dimensional tuple: (OBJECT, COLOR, STATE) [5]. Table 1. Reinforcement model input by dimension and possible values for each dimension. Dimension Represented values

4.4

STATE

Open, closed, and locked

OBJECT

Wall, floor, lava, door, key, ball, box, and goal

COLOR

Red, green, blue, purple, yellow, and grey

Model Architecture

To implement A2C with Deep Learning, 3 components are needed (see Fig. 3). Component 1 is a Convolution Neural Network (CNN) which takes the agent observation and converts this to an embedding of size 576. The other two components are the Actor and Critic and have the same architecture. They both take the embedding output from the CNN and have a single fully-connected layer of 64 neurons [16]. The Actor outputs one of the possible agent actions, while the Critic outputs an estimate of the value function. These components are optimized using RMSProp [13].

5

Metrics

We outline both the quantitative and qualitative metrics we use to evaluate our experiments. 5.1

Quantitative Agent Comparison

Each episode the grid is setup and the agent is given a reward for reaching the goal, this reward decays for each frame of the episode. No reward is given if the frame limit is reached or the agent touches the hazard. Quantitative metrics are used to compare agents trained with different algorithms in the same

Intrinsic Rewards for Reinforcement Learning

431

Fig. 3. This figure shows a diagram of the model architecture. The CNN contains three layers which take the agents view as input and output an embedding of size 576. Each actor/critic component take this as input and outputs an action (size = 7) and value function estimate (size = 1) respectively.

environment. Each agent is evaluated over the entire training process and there aggregate performance is used. The baseline is the normal A2C algorithm trained and evaluated on each grid environment, both basic and art inspired. We compare the intrinsic reward variation against A2C on each environment and discuss successes and shortcomings. We chose two metrics to evaluate the algorithms. Mean Reward is the first metric and Max Reward is the second metric, this is across all episodes played, the highest reward the agent received. We don’t discuss these further, but there are two other metrics one could consider, mean frames and max frames. The difficulty with frames as a metric is lower is not always better. Given an environment with hazards, higher may be better, because the agent learned to avoid the obstacles. 5.2

Qualitative Comparison

After training the agents we sent each trained agent through 50 episodes to evaluate for each environment. For each episode we create a time lapse graphic representation of the environment. The agent’s movements are shown throughout the episode with brighter spots representing where the agent spent more time. Next, we used OpenCV xphoto oil painting effect [3] to take this image and create an abstract representation. These images were combined to create an A/B comparison (see Fig. 4). Comparisons were between the same environments and agents with the same amount of frames for training. Again using the normal A2C agent output as the

432

N. Grabaskas and Z. Wang

baseline and our intrinsic model as the comparison. The ordering was alternated so one source was not always on the same side. And 100 comparison images were sampled from the available 350 to be sent off for rating. These comparisons were placed in front of human raters using Mechanical Turk (MTurk) and asked which one they find more interesting. The major challenge with the qualitative comparison is the subjectivity of what each rater finds interesting. Image ordering is varied to help reduce bias, if the rater learns to prefer one method over the other and expects the better image to be on a particular side. Each sample is placed before three raters and their ratings combined so we are not relying on a single observer, this also helps to reduce bias. The question and possible answers is shown as an example below: • Question - With an A/B comparison of images, “Which Image of do you find more interesting?” • Answer - A, B, Same.

Fig. 4. This figure shows two images rendered using the agent environment time lapse along with OpenCV oil painting effect. Image A is from normal A2C and image B is from A2C intrinsic. In this example the agent on the right did more exploration and created a visualization with more effect.

6

Results and Discussion

In this section we cover the experiment setup, reporting quantitative and qualitative results. We also discuss what the results show, comparing variations of environments and training algorithms. As well as a piece by piece analysis of the training visualization pipeline.

Intrinsic Rewards for Reinforcement Learning

6.1

433

Experiment Setup

There are three repositories we combined to build upon for our intrinsic rewards and agent training environment. 1. Gym-MiniGrid [5] - contains the framework for the grid environment and obstacles/interactions the agent is able to perform built on top of the popular OpenAI Gym [4]. 2. Torch-AC [15] - contains the base algorithm implementations for Advantage Actor-Critic [8]. 3. RL-Starter-Files [16] - contains the framework for training the agent on each environment, storing model states, and evaluating agent performance. In the experiment, we first set up the Gym-MiniGrid environment as the base environment, which provides the playground for the agents to explore. Then Actor-Critic (A2C) algorithm was implemented using the Torch-AC. We also enhance the default A2C algorithm by implementing intrinsic reward when updating the agent’s parameters during each batch. A list of abstract art inspired environments were implemented in GymMiniGrid to train the agent. In each environment, the obstacles objects were customized generated based on either precise rules or constraint policy. In some of the complicated grid world, the agent needs to complete a list of tasks (picking up a colored key or going through a linked door) to reach the goal. The agent was trained with RL Starter repository with the A2C and A2C intrinsic for 3 million frames. The agent’s convergence rate is highly correlated with the complexity and safety of the environment [2]. And we found out not all agents could find the path to reach the goal at the end of the experiments. We utilize two types of results for quantitative and qualitative evaluation. For quantitative comparison, an average of return means and return maximum of the agent’s rewards were picked to show the general performance in each setup. And for qualitative questions, we first rendered an image using the time-lapsed agent’s movement on the grid and enhanced the raw image with visual effect before sending them for review. 6.2

Quantitative Results

The baseline agent without intrinsic rewards did better on our baseline environments DoorKey, FourRoom, and MultiRoom. These environments are less complex, with FourRoom and MultiRoom requiring navigation only with no hazards. These did not require exploration or curiosity on the part of the agent, eliminating any advantage the intrinsic agent had. While both algorithms were able to adapt to the environment, giving a reward for “learning” did not benefit the intrinsic agent and showed a 14.79% decrease in rewards received by the agent (see Table 2). However the intrinsic reward agents does achieve a higher maximum reward. PaintingS11N5 and PaintingV1S11N5 are complex, procedurally generated environments which neither algorithm was able to train well on. Since rewards

434

N. Grabaskas and Z. Wang

Table 2. All of these agents were trained for 3M frames on each environment. The means and max are shown for the entire training period. This method of evaluation reduced variance in agent performance based on only a small number of final episode. Environment

A2C - μrt A2C - max rt A2C Intrinsic - μrt A2C Intrinsic - max rt

DoorKey-8x8

0.0115

0.1112

0.0094

0.1068

FourRooms

0.3623

0.7733

0.2734

0.7923

MultiRoom-N2-S4

0.6738

0.8396

0.6631

0.8500

PaintingS11N5Env

0.0000

0.0000

8.08E−07

0.0190

PaintingV1S11N5Env 0.0075

0.0800

0.0065

0.0721

PaintingS9Env

0.0012

0.0786

0.2685

0.9474

PaintingS11Env

0.0012

0.0585

0.0018

0.0790

were only given for reaching the goal, the agent is not given a reward for making progress within the environment. These maze-like environments change each episode, and make it difficult for the agent to learn to navigate. In the future we could try first training the agent on a smaller version of the maze and then expand the size of the maze as the agent learns to navigate. This is a common problem and one we tried to overcome with intrinsic rewards. Agents learning to navigate in a sparse reward and complex environment. Further experiments are still needed here. PaintingS9 and PaintingS11 are less complex and singleton environments, but they still require some exploration to move around the lava and find the goal. In both of these environments the A2C model with intrinsic reward agent was able to explore the environment and learn to reach the goal in not only fewer frames but also more consistently than the baseline A2C. 6.3

Qualitative Results

For our qualitative analysis, each image combination was shown to three raters and we use the average across all three raters. A baseline image vote was given a 0.0 for each time it was selected as more interesting and an intrinsic image vote was given a 1.0 for each time it was selected as more interesting. A vote for “they are the same” was given a 0.5. The average was taken across all 300 raters and this gave a score of 0.503. Threshold based was only count votes where a confidence interval of 2/3 votes was needed to count. The means 97/100 images received two out of three votes with only three images receiving a neutral score. Normal image received 48 votes and intrinsic images received 49 votes. For examples of success and failure as viewed by the raters, see Fig. 5 for intrinsic success and Fig. 6 for intrinsic failure. The 100 samples sent off for rating came back almost completely even, 0.503 toward intrinsic artwork being more interesting. This shows there is a lot of room to improve both our rendering pipeline and the agent rewards to prioritize “good art” creation. At each step in the rendering process there is the potential for improvement. Time lapse of the environment is only one option, we could also

Intrinsic Rewards for Reinforcement Learning

435

Fig. 5. This figure shows two generated art images. In this image the left is intrinsic and right is baseline derived. The raters unanimously chose left has more interesting. Most likely because the agent explored more and thus created more movement within the artwork.

Fig. 6. This figure shows two generated art images. In this image the left is intrinsic and right is baseline derived. The raters unanimously chose right as more interesting. In this case the agent on the left created a darker image with less contrast.

look at a time lapse of the agent’s view, a time lapse of the layers of the CNN or Actor/Critic components. Instead of a time laps we could attach a theoretical brush to the agent and render brush strokes based on the agents movement. If the critic could evaluate how good the final artwork is this could lead to more interesting artwork. Additionally, instead of the oil painting effect, we could try watercolor effect, brush stroke effect, cartoon effect, or sketch effect. The color scheme could be changed to give more variety or less for a monochromatic final image. Each of

436

N. Grabaskas and Z. Wang

these steps could be evaluated using human raters to determine the best decision at each phase and combine all steps to create even more interesting final artwork. Asking a human rater which image they find more interesting is very subjective, and this was clearly shown in our survey results. Perhaps if there was a target painting, then we could ask which model version was able to more closely match the target. Or even more pointed questions, such as which image is more colorful, which one more closely matches a specific style, etc. Since human ratings on art are very subjective, it would also be very difficult to compare results from multiple rounds of evaluations, since each group would be different. We need to use the same group or a group with a similar to makeup to increase confidence between evaluation rounds.

7

Conclusion and Future Work

Our work set out to give reinforcement learning agents a reward beyond just the explicit reward of reaching a goal. And to visualize this training as a way to generate interesting abstract artwork. This intrinsic reward incentivized model state changes to promote curiosity and learning. We evaluated a method of determining the difference between two states using the L2 norm, adding this to the loss, and performing a second round of backward propagation. We evaluated this change on the Advantage Actor-Critic algorithm and Gym-MiniGrid for the environment. We looked at multiple baseline environments: one involved a simple task of using a key to open a door and the others involved navigating through a limited number of rooms to reach the goal. Additionally, we created 4 environments inspired by abstract art, two complex procedurally generated mazes with hazards, and two singleton environments with hazards. The A2C algorithm with intrinsic rewards was able to learn the singleton environments far more effectively, mastering the PaintingS9 where the normal A2C was unable to achieve even small rewards. Our experiments show the normal A2C algorithm was able to learn the baseline environments more effectively, achieving 14.79% higher average rewards over the intrinsic variant. However, the intrinsic variant did show higher maximum rewards. Suggesting the rewards do influence the agent’s desire to explore and at times lead to better maximum performance. The complex procedurally generated environments were too difficult for either of variants to learn. From our qualitative analysis in phase II, it is clear our prototype to create artwork from the agent’s actions has opportunities to improve at multiple points. Our rewards need to better incentivize art creation actions, our rendering pipeline to convert actions to artwork has many untested assumptions which need to be verified, and our evaluation needs more specific questions with a similar group of raters for each iteration. The intrinsic reward showed positive training results and there are areas for future work. Further training and analysis across more environments and for longer training periods. A stepped approach where an agent is trained on a simple environment and then tuned on a more complex environment, could

Intrinsic Rewards for Reinforcement Learning

437

help the agent learn faster than starting with random weights on a complex environment. Additional tuning of agent view distance and learning parameters may also help the agent explore effectively.

References 1. Al-Shedivat, M., Lee, L., Salakhutdinov, R., Xing, E.: On the complexity of exploration in goal-driven navigation. In: Relational Representation Learning Workshop (NIPS 2018), arXiv:1811.06889 (2018) 2. De Biase, A., Namgaudis, M.: Creating Safer Reward Functions for Reinforcement Learning Agents in the Gridworld. University of Gothenburg Chalmers University of Technology, Sweden (2018) 3. Bradski, G.: The OpenCV library. Dr. Dobb’s J. Softw. Tools (2000) 4. Brockman, G., et al.: OpenAI gym. In: ACL 2016, arXiv:1606.01540 (2018) 5. Chevalier-Boisvert, L.M., Willems, L., Pal, S.: Minimalistic gridworld environment for openAI gym (2018) 6. Degris, T., Pilarski, P.M., Sutton, R.S.: Model-free reinforcement learning with continuous action in practice. In: 2012 American Control Conference (ACC), pp. 2177–2182 (2012) 7. Luo, J.: Reinforcement learning for generative art. Ph.D. thesis, UC Santa Barbara (2020) 8. Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. Computer Vision and Pattern Recognition (cs.CV), arXiv:1903.04411v3 (2019). Version 3 9. Mnih, V., et al.: NIPS Deep Learning Workshop 2013, arXiv:1312.5602v1 (2013) 10. Raileanu, R., Rockt¨ aschel, T.: RIDE: rewarding impact-driven exploration for procedurally-generated environments. Machine Learning (cs.LG), arXiv:2002.12292v2 (2020). Version 2 11. (Grzegorz) Surma, G.: Abstract art images (2019) 12. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998) 13. Tieleman, T., Hinton, G.: Lecture 6.5—RMSProp: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning (2012) 14. Wang, Z., et al.: Sample efficient actor-critic with experience replay. In: ICLR 2017, ArXiv:1611.01224v2 (2016). Version 2 15. Willems, L., Karra, K.: PyTorch actor-critic deep reinforcement learning algorithms: A2C and PPO (2018) 16. Willems, L., Yuan, Y., Bahdanau, D., Chevalier-Boisvert, M.: RL starter files (2018) 17. Xie, N., Zhao, T., Tian, F., Zhang, X., Sugiyama, M.: Stroke-based stylization learning and rendering with inverse reinforcement learning. In: IJCAI 2015: Proceedings of the 24th International Conference on Artificial Intelligence, July 2015. Version 2 18. Zhou, S., Huang, Z., Heng, W.: Learning to paint with model-based deep reinforcement learning. Computer Vision and Pattern Recognition (cs.CV), arXiv:1903.04411v3 (2019)

Analysis of Divided Society at the Standpoint of In-Group and Out-Group Using Opinion Dynamics Nozomi Okano and Akira Ishii(B) Tottori University, Tottori 680-8552, Japan [email protected]

Abstract. This is a study using Ishii’s opinion dynamics to simulate a divided society. It is assumed that there is trust and distrust between people in society and that it is influenced by the mass media. We distinguished the trust in the inner group of the two divided groups and the trust as an outgroup between the groups, and simulated the degree of social division. Keywords: Opinion dynamics

1

· Divided society · Conflict

Introduction

After the 2020 presidential election, modern American society is deeply divided. There are other examples of serious social division in world history. Moreover, even in modern times, various divisions can be seen in various countries. The worst consequence of the division of society is the American Civil War. In such a divided society, the people of the society are divided into multiple groups, and there is no relationship of trust between the groups. In addition, the strength of the relationship of trust as an in-group within the group should also be related to the division of society. Therefore, let us analyze this fragmented society using Ishii’s opinion dynamics [1–3], which deals with both trust and distrust among people in the theory of opinion dynamics. This model is named as Trust-Distrust Model [4]. In Sect. 2, the authors introduce the trust-distrust model. In Sect. 3, we show the model setting for social simulation of this paper. In Sect. 4, we show the results of the simulation. In Sect. 5, we review the results and discuss with them. The conclusion is presented in Sect. 6.

2 2.1

Trust-Distrust Model Theory of Trust-Distrust Model

Trust-Distrust Model is based on the original bounded confidence model [5–7]. For a fixed agent, say i, where 1 ≤ i ≤ N , we denote the agent’s opinion at time c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 438–452, 2022. https://doi.org/10.1007/978-3-030-82193-7_29

Divided Society

439

t by Ii (t). According to Hegselmann-Krause [5], opinion formation of agent i can be described as follows. N  Dij Ij (t) (1) Ii (t + 1) = j=1

This can be written in the following form. ΔIi (t) =

N 

Dij Ij (t)Δt

(2)

j=1

where it is assumed that Dij ≥ 0 for all i, j in the model of HegselmannKrause. The coefficient Dij is set to be a factor to determine the speed of consensus buildings in the society. Based on this definition, Dij = 0 means that the opinion of agent i is not affected by the opinion of agent j. In this theory, it is expected implicitly that the final goal of the negotiation among people is the formation of consensus. However, in the real society in the world, the formation of consensus among people is sometimes very difficult. The Trust-Distrust Model improves on this point by incorporating a sense of distrust among the people of a society [1– 3]. Thus, we employ the Trust-Distrust Model in this paper. The detail of this Trust-Distrust Model is described in the references [2,3]. The Trust-Distrust Model uses the following equation [2]. ΔIi (t) = ci A(t)Δt +

N 

Dij Φ(Ii (t), Ij (t))(Ij (t) − Ii (t))Δt

(3)

j=1

where Ii (t) is the agent’s opinion at time t. In this model, Dij can be negative. The negative Dij means the agent “i” distrusts the agent “j”. The value range of Ii (t) is −∞ ≤ Ii (t) ≤ +∞. In the fist term in the right-hand side, the term A(t) is the mass media effect. This mass media term comes from the mass media term in the mathematical model for hit phenomena [8]. We assume here that Dij is an asymmetric matrix; Dij and Dji , Dij = Dji and Dij and Dji can have different signs. 2.2

Two-Agents Calculation

Solve the Trust-Distrust Model with two people. Let the agents be A and B. The equation is as follows ΔIA (t) = cA A(t)Δt + DAB Φ(IA (t), IB (t))(IB (t) − IA (t))Δt

(4)

ΔIB (t) = cB A(t)Δt + DBA Φ(IB (t), IA (t))(IA (t) − IB (t))Δt

(5)

As you can see from Fig. 1, if the coefficients of trust between A and B, DAB and DBA , are positive, then A and B trust each other, and their opinions get

440

N. Okano and A. Ishii

closer over time to reach a consensus. However, if the coefficients DAB and DBA are negative, A and B distrust each other, and their opinions repel each other and drift apart. After a certain degree of disagreement, they ignore each other, and so their opinions become parallel. When the coefficients of confidence between A and B, DAB and DBA , are positive values, the model is the same as the conventional bounded confidence model [5–7], where the coefficients DAB and DBA correspond to the speed of reaching consensus. On the other hand, when the coefficients of trust between A and B, DAB and DBA , are positive, it is not included in the bounded confidence model, which is a characteristic of the Trust-Distrust Model. Since the Bounded Confidence Model implicitly assumes trust among all people in a society, the coefficient of trust, Dij , is all positive values, which means that the coefficient of distrust is positive. Since the Bounded Confidence Model implicitly assumes trust among all people in a society, the coefficient of trust, Dij , is all positive, and it is not possible to handle dissent due to distrust.

Fig. 1. Calculation result for N = 2. (a) DAB > 0 and DBA > 0. (b) DAB < 0 and DBA < 0.

2.3

Calculation for 300 Persons

In the following, the number of people in society is set to 300. The network that connects people is a random network, and the probability of connecting nodes is 30%. The coefficient Dij of trust or distrust from person to person is determined by a random number from −1 to 1. As is known from Ishii-Kawahata’s research [9], in a complete network, if 55% or more of the confidence coefficient Dij among people is a positive value, society will reach consensus. The threshold of 55% does not change even with random network [10]. The Fig. 2 shows a typical example. In Fig. 2(a), the confidence coefficient Dij is all positive, so society has reached consensus. This is the same result as the calculation of the Bounded Confidence Model so far. On the other hand, in Fig. 2(b), half of the confidence coefficient Dij is a positive value, and the other half is a negative value meaning distrust. In this case, as seen in the calculation, society does not form consensus and the distribution of opinions expands.

Divided Society

441

Fig. 2. Calculation results of opinion dynamics calculation for (a) 100% positive trust and (b) 50% positive trust. The left figures are trajectories of opinion of people in society and the right figures are the opinion distribution at time = 10. The number of people in society is set to be 300.

What can be seen in the calculation of (b) is a typical example of opinion distribution in general society. Figure 3 shows us the typical example of the opinion distributions near the threshold value 55% for the positive rate of the confidence coefficient Dij for 300 persons on random network with connecting probability 30%. It can be seen that if it is more than 55%, the society will reach consensus, and if it is 55% or less, the opinion distribution tends to expand without consensus building. From this paper onward, we will calculate that society is divided into two groups, but if the confidence coefficient Dij in each of the two groups that form society is 55% or more, consensus building will be formed for each group. In this case, consensus is formed on different opinions for each group. If the percentage of positive confidence factors Dij in a group is 55% or less, the group does not reach consensus. Whether or not the group forms consensus is determined by the ratio of the value of the confidence coefficient Dij being a positive value as described above. It works effectively for the analysis even when the following societies are divided into two groups.

442

N. Okano and A. Ishii

Fig. 3. Calculation results of the opinion distributions at time = 10 for (a) 57% positive trust, (b) 56% positive trust, (c) 55% positive trust and (d) 54% positive trust.

3

Model Setting for Social Simulation

In the calculations in this paper, we assume here that the number of persons is 300. For the mass media effect, we set A(t) = 0, no mass media condition in this paper. The human network is assumed to be random network with connection probability of 30%. The mutual Dij value between 300 people is decided by homogeneous random number. In general, people decide which group to belong to based upon their social identity or political identification [11,12]. In deciding which group to support, they identify the group they belong to as their “in-group” and distinguish other groups as “out-groups” [13]. They strengthen a sense of unity with the group members of the group they identify with and start to support the in-group issues [12]. In this paper, we perform two types of calculations as an extension of the previous works [3,14]. The first type is that, as shown in the Fig. 4, society is divided into two groups, A and B. Within each group is the in-group, and the relationship between the groups is the out-group. In the calculation, the number of people in society as a whole is 300, and 150 people belong to group A and 150 people belong to group B. The ratio of positive confidence coefficient Dij is set to be TA for the group A and TB for the group B. The ratio of positive

Divided Society

443

Fig. 4. The first type model of our simulation. The society is divide into two groups A and B. The in-group trust of A is TA and the in-group trust of B is TB . The out-group trust between A and B is TAB .

confidence coefficient Dij between the groups A and B is TA B. In this paper, TA B is set to be zero so that every confidence coefficient Dij between the people of the group A and the people of the group B is between −1 to 0. The second type of our simulation model is shown in Fig. 5. In this case as well, society is divided into two groups, A and B, but some people in group B trust the people in group A. They trust each other equally in Group B, but some of them also trust the people in Group A. This second type of the simulation model is pointed out first in our previous work [3]. In the calculation, the number of people in society as a whole is 300, and 150 people belong to group A and 150 people belong to group B. And we set that 50 of the people in Group B also

Fig. 5. The second type model of our simulation. The society is divide into two groups A and B. The in-group trust of A is TA and the in-group trust of B is TB1 . In the group B, there is a Group B2. The out-group trust between group A and group B2 is TB2 .

444

N. Okano and A. Ishii

trust the people in Group A. This model is same as the last model of [3], but we do a lot of simulations with different settings. This model is, for example, a Republican in American society after the 2020 presidential election, looking to compromise with the Democratic Party rather than following President Trump. Also, in Japanese society, there are examples of opposition parties that are cooperative with the ruling party.

4 4.1

Results Calculation for the First Model

First, the figure shows a case where two groups A and B form a consensus as an in-group. Assuming that the positive value of the confidence coefficient as an in-group of A and B is 80%, consensus is formed as shown by past studies [9,10] for complete network and random network. For the case of Fig. 6, In the figure, groups A and B converge on different opinions and form consensus. In this case, the confidence factors between people as an out-group of A and B are all negative. In other words, if this is a society, the society is completely divided into two. An example of this is American society at the time of the American Civil War.

Fig. 6. Calculation results of opinion dynamics calculation for TA = 0.8, TB = 0.8 and TAB = 0. The left figures are trajectories of opinion of people in society and the right figures are the opinion distribution at time = 10. The number of people in society is set to be 300.

Figure 7 shows the calculation when the ratios where the confidence coefficient Dij as an in-group of groups A and B is a positive value is 60%, 55%, 50%, and 40%. The confidence factor Dij between A and B as an out-group is a negative value, −1 to 0. In Fig. 8, the ratio of the confidence coefficient Dij in A and B as an in-group is a positive value is 55%, and the ratio of the confidence coefficient Dij as an out-group between A and B is 30%, 50%, 60%, 80%. If the ratio of positive values of the confidence coefficient Dij among the people who make up group A and

Divided Society

445

Fig. 7. Calculation results of opinion dynamics calculation for TAB = 0. The in-group trust is set to be (a) TA = TB = 0.6, (b) TA = TB = 0.55, (c) TA = TB = 0.5 and (d) TA = TB = 0.4. The figures are trajectories of opinion of people in society. The number of people in society is set to be 300.

group B as an out-group is finite, the trajectories of the opinions of the people A and B are mixed due to the mutual trust relationship. Especially in the case of TAB = 0.8, the strength of trust as an in-group of Group A and Group B is not enough for consensus building. However, due to the strength of trust between A and B as an out-group, society leads to consensus building. In Fig. 9, the ratio of the confidence coefficient Dij in A and B as an in-group is a positive value is 70%, and the ratio of the confidence coefficient Dij as an out-group between A and B is 0%, 30%, 50% and 70%. If the ratio of positive values of the confidence coefficient Dij among the people who make up group A and group B as an out-group is finite, the trajectories of the opinions of the people A and B are mixed due to the mutual trust relationship. If A and B have a strong relationship of trust as an in-group and A and B have a weak relationship of trust as an out-group, Group A and Group B will form an agreement for each group. However, at TAB = 0.5, the opinions of A and B people become mixed, and at TAB = 0.5, A and B agree on the same opinion. In the case of (d), regardless of whether A or B is used, the ratio of the trust

446

N. Okano and A. Ishii

Fig. 8. Calculation results of opinion dynamics calculation for the in-group trust TA = TB = 0.55. The out-group trust is set to be (a) TAB = 0.3, (b) TAB = 0.5, (c) TAB = 0.6, and (d) TAB = 0.8. The figures are trajectories of opinion of people in society. The number of people in society is set to be 300.

coefficient Dij being a positive value is 70% as a whole society, so consensus building is formed in the whole society. 4.2

Calculation for the Second Model

The second computational model is when society as a whole is divided into the group A and the group B, but some of the group B trust the people of the group A. This would be the case, for example, at the end of the Trump administration in the United States, when some Republicans were compromised with Democrats. Also consider a referendum asking whether Britain should be separated from the EU. If some people in Britain want a coalition with the EU, think of A as a continental European countries and B as Britain. Our model calculation is similar to the case where some of B’s people have a strong trust in A. In Fig. 10, the ratio in which the confidence coefficient Dij as an in-group of group A and group B is a positive value is 70%, and the trust relationship Dij as an out-group between A and B is all a negative value. However, the percentages

Divided Society

447

Fig. 9. Calculation results of opinion dynamics calculation for the in-group trust TA = TB = 0.7. The out-group trust is set to be (a) TAB = 0.0, (b) TAB = 0.3, (c) TAB = 0.5, and (d) TAB = 0.7. The figures are trajectories of opinion of people in society. The number of people in society is set to be 300.

of people in group B2 of B whose confidence factor Dij in group A is positive are 70%, 60%, 50%, and 40%. Looking at the calculation results, when TAB2 = 0.7, the people in B2 who trust A are close to the opinions of people in A, but when TAB2 is 0.6 or less, a small number of people in B2 are in the opinion of A. Most people in B2 agree as an in-group B just by adjusting to. This is because the values of the out-group confidence factor Dij between groups A and B are all negative. In Fig. 11, the ratio in which the confidence coefficient Dij as an in-group of group A and group B is a positive value is 50%, and the trust relationship Dij as an out-group between A and B is all a negative value. However, the percentages of people in group B2 of B whose confidence factor Dij to group A is positive are 30%, 50%, 70%, and 90%. As can be seen in the figure, in this case, A and B are in-groups, and the ratio of the value of the confidence coefficient Dij being a positive value is 50%, and consensus is not formed as an in-group. Therefore, society as a whole does not reach consensus. However, because of the strong trust that B2 people have

448

N. Okano and A. Ishii

in A, when TAB2 = 0.9, B2 people form consensus with opinions close to those of A people.

Fig. 10. Calculation results of opinion dynamics calculation for the case of B2 group. The in-group trust for A is TA = 0.7. The in-group trust for B is TB = 0.7. The outgroup trust between A and B is TAB = 0.0. The calculations are for various trust value from the group B2 to the group A; (a) TB2 = 0.7, (b) TB2 = 0.6, (c) TB2 = 0.5 and (d) TB2 = 0.4. The number of people in society is set to be 300.

In Fig. 12, the percentage of positive confidence coefficient Dij as an in-group is 70% for A and 30% for B. Therefore, the group A forms a consensus, but the group B does not. However, if the B2 people have a strong trust in A, the B2 people will converge on the same opinion as the group A. In Fig. 13, neither the group A nor the group B is strong enough to reach consensus as an in-group. Therefore, neither A nor B will form a consensus, nor will society as a whole form a consensus. However, when the B2 people have a strong trust in the A people, the B2 people tend to reach consensus with opinions close to those of the A people. However, this is the case for some of the B2 people, and with a strong trust of TAB2 = 0.9.

5

Discussion

In this paper, we have calculated various aspects of a divided society. As shown in Fig. 7 and 13, even if society is divided into two, if people’s trust is not

Divided Society

449

Fig. 11. Calculation results of opinion dynamics calculation for the case of B2 group. The in-group trust for A is TA = 0.5. The in-group trust for B is TB = 0.5. The outgroup trust between A and B is TAB = 0.0. The calculations are for various trust value from the group B2 to the group A; (a) TB2 = 0.3, (b) TB2 = 0.5, (c) TB2 = 0.7 and (d) TB2 = 0.9. The number of people in society is set to be 300.

strong enough to reach consensus as an in-group of each, the group will not form consensus and the society as a whole will agree. Does not form. In the case of politics, if such a society as a whole does not reach consensus, it can be said that it is stable because the division of society does not become a serious conflict. This is a result supporting the conclusion suggested by Ishii and Okano [14]. The most dangerous aspect of social division is in the case of the Fig. 6, where society divides and each divided group reaches consensus. In the case of the American Civil War, this worst conflict occurred. In Fig. 7, consensus building has not occurred in society, but there has been no consensus building that divides into two and forms consensus separately. However, because there is no relationship of trust between A and B as an outgroup, the opinions of the people of A and the opinions of the people of B are separated. This would be close to conflict between conservative and liberal in many countries.

450

N. Okano and A. Ishii

Fig. 12. Calculation results of opinion dynamics calculation for the case of B2 group. The in-group trust for A is TA = 0.7. The in-group trust for B is TB = 0.3. The outgroup trust between A and B is TAB = 0.0. The calculations are for various trust value from the group B2 to the group A; (a) TB2 = 0.1, (b) TB2 = 0.3, (c) TB2 = 0.5 and (d) TB2 = 0.7. The number of people in society is set to be 300.

It was also found that if there is strong trust between groups as an out-group, consensus can be reached by society as a whole beyond the factions, as shown in Fig. 8(d). In addition, as the second model, one of the divided societies was calculated in a setting that was not united. This situation is common in reality. In this second model, people in B2 have different opinions depending on their trust in people in A, and B may not function as an in-group in the calculation. A typical example of B not functioning as a faction is shown in Fig. 12. In Fig. 12, due to the low level of trust as an in-group, B lacks cohesion, and the people of B2 move to consensus with the people of A who have strong trust. This would be a calculation that corresponds to the coalition of political parties. In this way, Ishii’s opinion dynamics include distrust among people, so it can be seen that social simulations that respond to crisis situations such as social division are possible.

Divided Society

451

Fig. 13. Calculation results of opinion dynamics calculation for the case of B2 group. The in-group trust for A is TA = 0.3. The in-group trust for B is TB = 0.3. The outgroup trust between A and B is TAB = 0.0. The calculations are for various trust value from the group B2 to the group A; (a) TB2 = 0.3, (b) TB2 = 0.5, (c) TB2 = 0.7 and (d) TB2 = 0.9. The number of people in society is set to be 300.

6

Conclusion

In this paper, the Trust-Distrust Model is used to calculate the division of society. We also calculated that in a divided society, one of the two groups was not tightly bound. Trust-Distrust Model is a newly submitted theory, but it has been found that it is possible to simulate society in various aspects with considerable flexibility. Acknowledgments. This work was supported by JSPS KAKENHI Grant Number JP19K04881. The authors are grateful for frequent discussion with Prof. M Nishikawa of Tsuda University especially for the discussion of in-group and out-group problem.

References 1. Ishii, A., Kawahata, Y.: Opinion dynamics theory for analysis of consensus formation and division of opinion on the internet. In: Proceedings of The 22nd Asia Pacific Symposium on Intelligent and Evolutionary Systems (IES2018), pp. 71–76. arXiv:1812.11845 [physics.soc-ph] (2018)

452

N. Okano and A. Ishii

2. Ishii, A.: Opinion dynamics theory considering trust and suspicion in human relations. In: Morais D., Carreras A., de Almeida A., Vetschera R. (eds.) Group Decision and Negotiation: Behavior, Models, and Support, GDN 2019. Lecture Notes in Business Information Processing, vol. 351. Springer, Cham (2019) 3. Ishii, A., Okano, N., Nishikawa, M.: Social simulation of divided society using new opinion dynamics. Front. Phys. 9 (2021). https://doi.org/10.3389/fphy.2021. 640925 4. Okano, N., Ishii, A.: Opinion dynamics on a dual network of neighbor relations and society as a whole using the Trust-Distrust model. Submitted to the Proceedings of the 23rd International Conference on Artificial Intelligence (ICAI 2021) 5. Hegselmann, R., Krause, U.: Opinion dynamics and bounded confidence models, analysis, and simulation. J. Artif. Soc. Soc. Simul. 5 (2002) 6. Deffuant, G., Neau, D., Amblard, F., Weisbuch, G.: Mixing beliefs among interacting agents. Adv. Complex Syst. 3, 87-98 (2000) 7. Weisbuch, G., Deffuant, G., Amblard, F., Nadal, J.-P.: Meet, discuss and segregate! Complexity 7(3), 55–63 (2002) 8. Ishii, A., et al.: The ‘hit’ phenomenon: a mathematical model of human dynamics interactions as s stochastic process. New J. Phys. 14, 063018 (2012). (22pp.) 9. Ishii, A., Kawahata, Y.: Opinion dynamics theory considering interpersonal relationship of trust and distrust and media effects. In: The 33rd Annual Conference of the Japanese Society for Artificial Intelligence, p. 33 JSAI2019 2F3-OS-5a-05 (2019) 10. Ishii, A., Yomura, I., Okano, N.: Opinion dynamics including both trust and distrust in human relation for various network structure. In: 2020 International Conference on Technologies and Applications of Artificial Intelligence (TAAI), pp. 131–135. (2020). https://doi.org/10.1109/TAAI51410.2020.00032 11. Berelson, B.R., Lazarsfeld, P.F., McPhee, W.N.: Voting: A Study of Opinion Formation in a Presidential Campaign. The University of Chicago Press, Chicago (1954) 12. Achen, C.H., Bartels, L.M.: Democracy for Realists: Why Elections Do Not Produce Responsive Government. Princeton University Press, Princton (2016) 13. Tajfel, H., Turner, J.C.: An integrative theory of inter-group conflict. In: Austin, W.G., Worchel, S. (eds.) The Social Psychology of Inter-group Relations, pp. 33– 47, Monterey, CA, Brooks/Cole (1979) 14. Ishii, A., Okano, N.: Social simulation of a divided society using opinion dynamics. In: Proceedings of the 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, pp. 660–667 (2020)

Simulation of Intragroup Alignment Using a New Model of Opinion Dynamics Nozomi Okano1 , Hitoshi Yamamoto2 , Masaru Nishikawa3 , and Akira Ishii1(B) 1

Tottori University, Tottori 680-8552, Japan [email protected] 2 Rissho University, Shinagawa, Tokyo 141-8602, Japan 3 Tsuda University, Kodaira, Tokyo 187-8577, Japan

Abstract. We performed a social simulation using the opinion dynamics theory that incorporates trust and distrust. In this paper, we simulated the invisible primary in the US presidential election as a social simulation. We found that the candidate with strong trust from the voters is advantageous. Besides, when simulating a model that is closer to the actual invisible primary with sub-leaders, it was found that candidates who have strong support from many sub-leaders are advantageous. All of the results we obtained is consistent with the findings of previous studies on invisible primaries. The most of the actual election of the US presidential election results are consistent with this study’s findings. Keywords: Opinion dynamics alignment · Voter · Sub-leader

1

· Invisible primary · Intragroup

Introduction

In the United States, the presidential campaign officially begins in February of the year the presidential election is held. However, many months before the first primary election, the “invisible primary” begins. There is no agreement among experts as to when the “invisible primary” begins. However, many consider the period between Labor Day and January of the presidential election year to be the “invisible primary.” Election candidates try to gain an advantage in the “invisible primary” to get ahead of the other candidates somehow. At this stage, candidates send out many press releases, try to raise money, and promote themselves [1–6]. During the invisible primary, voters will have to choose whom to support from the same party candidates. Therefore, the role of the media in influencing public opinion is significant. The news media will also have to choose which candidates to cover increasingly and which not to cover. Therefore, money and media matter most during the invisible primary [7–9]. Additionally, a candidate who successfully secures party leaders’ endorsements as many as possible is likely to win the caucuses. The candidate sends a signal to donors, party activists, party organizers, and rank-and-file voters that he/she is viable [10–12]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 453–463, 2022. https://doi.org/10.1007/978-3-030-82193-7_30

454

N. Okano et al.

The FiveThirtyEight collected data on how many endorsements had the candidates secured during the invisible primary [10]. The FiveThirtyEight adopts a simple weighting system: 10 points for governors, 5 points for US senators, and 1 point for US representatives. They conducted a historical comparison of cumulative endorsement points at invisible primaries from 1980 until 2016. The candidate who won a large majority of support from party leaders at the invisible primaries became the presumptive nominee within both of the parties. However, the 2016 election was an anomaly: Donald Trump was declared the presumptive Republican nominee, even most Republican party leaders rallied around other party leaders at the invisible primary [9,12]. Stiles et al. utilize a threshold model of social interaction to simulate an invisible primary outcome. They conducted a network analysis simulation using three candidates while changing the size of the primary electorate. They indicated that frontrunners were likely to gain advantages over other candidates at the end of the invisible primary under certain conditions. They argue that three requirements must be met for a particular candidate to gain an advantage before the Iowa Caucus: “a sizeable lead in the polls at the onset of the race, an environment in which there is little information decay, and an unwavering base of support.” If even any of these requirements is missing, it becomes unclear whether a particular candidate will have an advantage. Therefore, the candidate’s campaign must use tools to inform the voters about themselves and keep their interest in the candidate [12]. However, Stiles et al. tended to focus on the role of media. It seems to be lacking from their analysis how much support is needed to survive a primary election. Also, are there any factors for candidates to win an invisible primary other than the role of media? Therefore, we adopt a newly developed opinion dynamic model to the invisible primary. The conventional approach of opinion dynamics has taken an approach that continuously responds to changes in opinion, rather than a binary opinions approach that either agrees or disagrees. The conventional theory of opinion dynamics that continually processes opinion transitions is “the bounded confidence model.” These theories aimed at consensus building [13–18]. In using the bounded confidence model theories, the simulation only calculates assuming trust relationships between individuals. In short, the primary purpose of these opinion dynamics would be consensus building in society. If anything, people’s opinions in societies do not always reach consensus. Opinions can be divided and disagreed. This is a problem that cannot be handled by the conventional bounded confidence model. Recently, Ishii et al. proposed a new theory of opinion dynamics that deals with human relationships of both trust and distrust [19,20] just as a simple extension of the bounded confidence model. Using this theory, Ishii and Kawahata calculated a charismatic person’s effect in the simulation [21]. Ishii and Okano [22,23] calculated the people who are untrusted by all in the society.

Simulation of Intragroup Alignment

455

In Sect. 2, the authors present the theory of opinion dynamics. Section 3 deals with applying the opinion dynamics to actual cases, i.e., the United States’ invisible primaries. The results of the simulations are shown in Sect. 4. In Sect. 5, we review the results and match those with the findings of the prior studies: we confirmed that candidates with solid trust from voters are advantageous. Also, it was found that candidates who have strong support from many sub-leaders are advantageous. In conclusion, we discuss Trump’s victory in the 2016 Republican primary was an exception; however, most of the actual election results are consistent with this study’s findings. This paper uses Ishii’s new opinion dynamics to simulate the United States’ invisible primary.

2

Theory

According to Ishii [19,20], we use the following equation: the equation of opinion dynamics of Ishii and coworkers. ΔIi (t) = −αIi (t)Δt + ci A(t)Δt +

N 

Dij Φ(Ii (t), Ij (t))(Ij (t) − Ii (t))Δt

(1)

j=1

Dij is the coefficient of trust for agent i to agent j. The positive Dij means the agent i trust the agent j. The negative Dij means the agent i distrust the agent j. They assume here that Dij is an asymmetric matrix; Dij and Dji , Dij = Dji and Dij and Dji can have different signs. This function Φ(Ii (t), Ij (t)) is the Sigmoid function and it works as a smooth cut-off function at |Ii − Ij | = b. Using this Sigmoid function, we assume that if the opinions of the two are too far apart, they will not be influenced by each other’s opinions. 1 (2) Φ(Ii , Ij ) = 1 + exp(β(|Ii − Ij | − b)) Moreover, because of the factor Ij (t) − Ii (t), the opinion Ii (t) is not affected by the opinion Ij (t) if the opinion Ij (t) is almost same as the opinion Ii (t). This opinion dynamics theory’s characteristic is that it inherently incorporates trust and distrust of people. In this opinion dynamics theory, trust does not turn into distrust when opinions go away. No matter how close the opinions are, the two distrustful people are set initially to repel. This is a convenient theory for simulating conflicts between people who are determined to be distrustful, such as race, religion, or historical background.

3

Simulation Model of Intragroup Alignment

In this paper, we propose to apply opinion dynamics to the invisible primaries in the United States. As the first step, consider two models, as shown in Fig. 1. Figure 1(a) schematically illustrates the case where a charismatic candidate (e.g.,

456

N. Okano et al.

Donald Trump) is trusted by voters (constituents) by the value p. Figure 1(b) is a model representing the case where there are Nsub sub-leaders (e.g., senators, representatives, and governors) who support the candidate. In this model, the sub-leaders are considered to have a deeper trust in the candidate than the voters. The value of their trust is q. Here, it is assumed that the number N of people is not the whole population but the supporters of the political party to which the candidate belongs. Subleaders are people with influence who support a leader in an invisible primary. For example, how sub-leaders act in an invisible primary is vital for the surging growth of a candidate who is going to be selected among many candidates in a primary election.

Fig. 1. Schematic illustration of two models for calculation. (a) is the illustration of the amount of trust from every person in society to the candidate. (b) illustrates the amount of trust from sub-leaders to the candidate and the amount of trust from every person in society to the candidate.

In the concrete calculation, the number N of the target people is 300. The coefficient of confidence Dij among people is determined by a random number in the range −1 to 1. The candidate’s initial opinion is +15, the strength of the candidate is m = 10, and the strength of the voters’ will is m = 1. The connection between people is a random network, and the probability of being connected is 50%. A scale-free network is more realistic as a network of people’s connections, but in that case, whether or not the sub-leader is located at the hub affects the calculation result. Therefore, in the current situation where there is no measured value of the network structure of people’s connection in the actual example of the target, the random network structure is used.

Simulation of Intragroup Alignment

4 4.1

457

Results Trust to a Candidate from Voters

First, the calculation results corresponding to the model of Fig. 1(a) are shown. Let p be the degree of voters’ trust for a candidate in an invisible primary. Set the value of p in the range of −5 to 20 and show the calculation result. Figure 2 shows voters’ trust for a candidate at −5 and +10. In −5, the voters have a backlash against the candidate, and there is no one with the same opinion near the candidate’s opinion. In other words, the candidate is avoided by voters because of their adverse credibility. Conversely, if the confidence level is 10, voters have similar opinions with the candidate. If the reliability is positive, the approval rating is high.

Fig. 2. Calculations for positive and negative trust to a candidate from voters. The left graphs are the opinion trajectories of people in society. The blue trajectory is that of the candidate. The middle graphs are the opinion distribution at time = 10. The right graphs are the number of positive opinions and opposing opinions as a function of time. In (a), the amount of trust from every voter in society to the candidate is p = −5. In (b), the amount of trust from every voter in society to the candidate is p = 10.

Figure 3 shows the confidence level of p from voters and the percentage of positive opinions. Since a candidate’s opinion is 15, the percentage of positive opinions is the candidate’s support percentage. The numerical value of the approval rating shown here is the average value obtained by performing the simulation three times. Since a random number specifies the reliability Dij among voters, there are not a few fluctuations due to the random number’s numerical value for each calculation. However, it can be seen from the figure that the approval rating is low when the reliability of people is negative, and the approval rating is high when the reliability is positive.

458

N. Okano et al.

Fig. 3. The calculated value of the ratio of positive opinion as a function of time. Each value is the average of three calculations.

4.2

Sub-leaders

Next, the calculation when there are sub-leaders of Fig. 1(b) is shown. In this case, in addition to the trust p from voters, the number of sub-leaders Nsub and the value of the trust q that the sub-leaders have in a candidate are important. To compare various Nsub , the following conditions are imposed here, and the calculation is performed under the condition that this value is 3000. The calculation is the average of three simulation calculations. 300p + qNsub = 3000

(3)

Figure 4 shows the case where Nsub = 150 and the reliability from voters is −5, and the case where Nsub = 10 and the reliability from voters are +5. Reliability between the sub-leaders are set to 10. As it is shown in Fig. 4, the overall reliability is fixed to 3000. It seems that the larger the number of sub-leaders is, the higher the approval rating goes. The calculation confirms this in Fig. 5. It also calculates the approval rating by averaging three simulations. All the calculations in Fig. 5 are done under the condition that Eq. (3) is satisfied. Looking at Fig. 5, although the value fluctuates due to random numbers, the larger the number of sub-leaders is, the more advantageous in obtaining the higher approval rating.

5

Discussion

We applied the opinion dynamics model to the invisible primary of the United States. As a result of the simulation, the following two points were found. (1) Candidates who have the confidence of voters are advantageous. (2) When simulating a closer to the actual invisible primary model, including the sub-leaders,

Simulation of Intragroup Alignment

459

Fig. 4. Calculations for positive and negative trust to a candidate from voters. The left graphs are the opinion trajectories of people in society. The blue trajectory is that of the candidate. The middle graphs are the opinion distribution at time = 10. The right graphs are the number of positive opinions and negative opinions as a function of time. In (a), the amount of trust from every voter in society to the candidate is set to be p = −5, and the amount of trust between sub-leaders is 10. The number of sub-leaders is 150. In (a), the amount of trust from every voter in society to the candidate is set to be p = +5, and the amount of trust between sub-leaders is 10. The number of sub-leaders is 10.

Fig. 5. The calculated value of the ratio of positive opinion as a function of time. The five curves correspond to the number of sub-leaders of 150, 100, 50, 10, and 5. Each value is the average of three calculations.

460

N. Okano et al.

candidates who have strong support from many sub-leaders are advantageous. Stiles et al. focused on the role of media. However, the role of sub-leaders is also crucial for a candidate to survive an invisible primary. Except for that point, our findings are consistent and not significantly different from with previous studies’ findings on invisible primaries [10–12]. The actual election results are also almost consistent with the findings of this study. For some reason, only Trump in the 2016 Republican primary does not match this at all. We wonder why? It would be an interesting point to study shortly. As the next step, a simulation that considers the trust that voters have in the sub-leaders who work for a candidate can be considered. Figure 6 and Fig. 7 show an example of calculation using the model. Also, in this calculation, the total value of the reliability is set to 3000. In actual invisible primaries, sub-leaders are political insiders such as federal senators, representatives, and governors. Therefore, it would be interesting to have a simulation in which the sub-leaders are set in more detail according to the actual elections and primaries. 300p + qNsub + 300ps Nsub = 3000

(4)

Fig. 6. Schematic illustration of models for calculation, including the amount of trust from all voters to sub-leaders.

The social simulation using this paper’s opinion dynamics is a research method that can simulate a leaders’ appearance in society and the division by multiple leaders, not limited to the primary election.

Simulation of Intragroup Alignment

461

Fig. 7. Calculations for positive and negative trust to a candidate from voters. The left graphs are the opinion trajectories of people in society. The blue trajectory is that of the candidate. The middle graphs are the opinion distribution at time = 10. The right graphs are the number of positive opinions and negative opinions as a function of time. The amount of trust from every voter in society to the candidate is set to be p = −2, and the amount of trust between sub-leaders is +1. The amount of trust from the sub-leaders to the candidate is set to be 30. The number of sub-leaders is 100. The amount of trust from every voter in society to sub-leaders is set to be p = +0.02, and the amount of trust between sub-leaders is 1.

6

Conclusion

In this study, as stated in the Introduction, we performed a social simulation using the opinion dynamics theory that incorporates trust and distrust. In this paper, we simulated the invisible primary before the US presidential election as a social simulation. When we applied the opinion dynamics model to the invisible primary and simulated it, we obtained the calculation result that the candidate with strong trust from the voters is advantageous. Besides, when simulating a model that is closer to the actual invisible primary with sub-leaders, it was found that candidates who have strong support from many sub-leaders are advantageous – the role of sub-leaders failed to notice in the preceding studies. Basically, our findings are consistent with the findings of previous studies on invisible primaries. Though this is entirely inconsistent with Trump’s case in the 2016 Republican primary, most of the actual election results are consistent with this study’s findings. Therefore, it can be said that this study is a theory that can simulate the actual invisible primary election. We are aware that our research may have a limitation and future improvements. The evaluation is done solely on the proposed methods. Hence, we should work on multiple approaches to validate the results and show our model’s advantages.

References 1. Flowers, J., Haynes, A., Crespin, M.: The media, the campaign, and the message. Am. J. Polit. Sci. 47(2), 259–273 (2003) 2. Haynes, A.A., Flowers, J.F., Gurian, P.: Getting the message out: candidate communication strategy during the invisible primary. Polit. Res. Q. 55, 633–652 (2002) 3. Buell, E.H., Jr.: The invisible primary. In: Mayer, W.G. (ed.) Pursuit of the White House. Chatham House Publishers, New Jersey (1996)

462

N. Okano et al.

4. Hadley, A.T.: The Invisible Primary. Prentice Hall, Englewood Cliffs (1976) 5. Darr, J.P.: Earning Iowa: local newspapers and the invisible primary. Soc. Sci. Q. 100, 320–327 (2019) 6. Kenski, K., Filer, C.R., Conway-Silva, B.A.: Lying, liars, and lies: incivility in 2016 presidential candidate and campaign tweets during the invisible primary. Am. Behav. Sci. 62, 286–299 (2018) 7. Steger, W.P.: Who wins nominations and why? An updated forecast of the Presidential Primary Vote. Polit. Res. Q. 60, 91–99 (2007) 8. Han, L.C.: Off to the (horse) races: media coverage of the ‘not-so-invisible’ invisible primary of 2007. In: Bose, M. (ed.) From Votes to Victory: Winning and Governing the White House in the Twenty-First Century, pp. 91–116. Texas A & M University Press (2011) 9. Reuning, K., Dietrich, N.: Media coverage, public interest, and support in the 2016 republican invisible primary. Perspect. Polit. 17, 326–339 (2019) 10. Bycoffe, A.: “The Endorsement Primary,” The FiveThirtyEight (2020). https:// projects.fivethirtyeight.com/2016-endorsement-primary 11. Cohen, M., Karol, D., Noel, H., Zaller, J.: The Party Decides: Presidential Nominations Before and After Reform. University of Chicago Press, Chicago (2009) 12. Stiles, E.A., Swearingen, C.D., Seiter, L., Foreman, B.: Catch me if you can: using a threshold model to simulate support for presidential candidates in the invisible primary. J. Artif. Soc. Soc. Simul. 23 (2020) 13. Deffuant, G., Neau, D., Amblard, F., Weisbuch, G.: Mixing beliefs among interacting agents. Adv. Complex Syst. 3, 87–98 (2000) 14. Weisbuch, G., Deffuant, G., Amblard, F., Nadal, J.-P.: Meet, discuss and segregate! Complexity 7(3) 55–63 (2002) 15. Hegselmann, R., Krause, U.: Opinion dynamics and bounded confidence models, analysis, and simulation. J. Artif. Soc. Soc. Simul. 5 (2002) 16. Duggins, P.: A psychologically-motivated model of opinion change with applications to American politics. J. Artif. Soc. Soc. Simul. 20(1), 13 (2017) 17. Castellano, C., Fortunato, S., Loreto, V.: Statistical physics of social dynamics. Rev. Modern Phys. 81, 591/646 (2009) 18. Sˆırbu, A., Loreto, V., Servedio, V.D.P., Tria, F.: Opinion dynamics: models, extensions and external effects. In: Loreto, V., et al. (eds.) Participatory Sensing, Opinions and Collective Awareness. Understanding Complex Systems, 42 pages. Springer, Cham (2017) 19. Ishii, A., Kawahata, Y.: Opinion dynamics theory for analysis of consensus formation and division of opinion on the internet. In: Proceedings of The 22nd Asia Pacific Symposium on Intelligent and Evolutionary Systems, pp. 71–76 (2018). arXiv:1812.11845 [physics.soc-ph] 20. Ishii, A.: Opinion dynamics theory considering trust and suspicion in human relations. In: Morais, D., Carreras, A., de Almeida, A., Vetschera, R. (eds.) Group Decision and Negotiation: Behavior, Models, and Support, GDN 2019. Lecture Notes in Business Information Processing, vol. 351, pp. 193–204. Springer, Cham (2019) 21. Ishii, A., Kawahata, Y.: Opinion dynamics theory considering interpersonal relationship of trust and distrust and media effects. In: The 33rd Annual Conference of the Japanese Society for Artificial Intelligence, 33 JSAI2019 2F3-OS-5a-05 (2019)

Simulation of Intragroup Alignment

463

22. Okano, N., Ishii, A.: Sociophysics approach of simulation of charismatic person and distrusted people in society using opinion dynamics. In: Sato, H., Iwanaga, S., Ishii, A. (eds.) Proceedings of the 23rd Asia-Pacific Symposium on Intelligent and Evolutionary Systems, pp. 238–252. Springer (2019) 23. Okano, N., Ishii, A.: Isolated, untrusted people in society and charismatic person using opinion dynamics. In: Proceedings of ABCSS2019 in Web Intelligence, pp. 1–6 (2019)

Random Forest Classification with MapReduce in Holonic Multiagent Systems Mich´ele Cullinan and Duncan Coulter(B) University of Johannesburg, Johannesburg, South Africa [email protected], [email protected]

Abstract. Multiagent systems are a dominant field of study within artificial intelligence. Holons are a special kind of agent implementation found in multiagent systems which has not received much attention in recent mainstream AI topics, such as machine learning. Their self-similar structure is both stable and coherent, and likewise consists of one or more holons. This paper studies how holonism, introduced through recursive modelling techniques, benefit multiagent systems. It also expands the scope of multiagent learning applications by proposing a new architecture for executing Decision Tree and Random Forest machine learning models that are both novel in terms of parallelism and extensibility in order to solve classification problems. The models were designed using a structured based approach of computer programs, with a special focus on recursive structures. Results obtained show that when the algorithms are applied in a classification problem domain, the algorithms are able to perform consistent with their expected behaviour. Keywords: Holonic multiagent systems Trees · Random Forests

1

· MapReduce · Decision

Introduction

The human brain has the ability to use different ways and multiple regions to learn information. This natural distributed learning mechanism allows information to be more interconnected and embedded in memory. Artificial intelligence (AI) problem solving modelling techniques such as machine learning (ML) are often inspired by tasks performed by the brain. Another fundamental method, specifically for those problems with a distributed nature, is multiagent systems (MAS). Together they produce AI theory concentrating on group intelligent behaviours emerging from the cooperation of multiple interacting intelligent entities forming multiagent systems. Knowledge about the problem is divided and shared to develop a solution as a result of coordination, cooperation and sometimes competition. The theory of multiagent learning (MAL) is the application of MAS together with ML algorithms involving multiple agents as described in [14]. Cooperative c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 464–483, 2022. https://doi.org/10.1007/978-3-030-82193-7_31

Holonic MapReduce Classification

465

MAL attempts to draw from multiagent theory in a spectrum of areas, including reinforcement learning, evolutionary computation, complex systems, agent modelling and robotics. The technological challenges of today involve tackling complex real-world problems in dynamic and unstable environments and according to [19], it means that MAL should depend on scalable theories so that agents can handle multiple states in continuous strategy spaces. The predominate field of study in MAL related to machine learning algorithms is reinforcement learning. More specifically, it looks at the connection between reinforcement learning and game theory [19]. Although a lot of progress has been made in understanding MAS based reinforcement learning, it has narrowed the scope of MAL research [19]. Another related field of study is holonic systems (HS). Holonicity is a concept arising from biological holons, which are systems forming a multi-levelled hierarchy of semi-autonomous sub-wholes where the whole becomes more than the sum of its parts due to its emergent properties. In a biological example, the human body contains organs which are groups of cells that can be divided into even smaller parts called organelles. Each of the components have a variety of functions, none of which can function without its sub-components or without reference to the organ that it is part of [5]. MASs can become very complex due to the interactions between their constituent autonomous agents, but when agents are implemented as holons they can assist in a system becoming self-organising [2]. Principles of divide-and-conquer will be used to model separate machine learning tasks by breaking them up into smaller part which are distributed to other holonic agents. Holarchies are also useful for defining a mixture of autonomous agents and the hierarchical organisation of those agents which is implemented by using recursion [19]. The models will be designed using a structured based approach of computer programs, with a special focus on recursive structures. According to [6], HS is considered a general paradigm for distributed intelligent manufacturing control, whereas MAS is regarded as a software technology that can be used to implement this type of holonic system. The coordination of a group of autonomous agents, resulting in intelligent behaviour, is supported by the research in subjects like human autonomy, cooperation among communities, and learning from past experiences [6]. Research into the application of MASs has focused on manufacturing enterprises ranging from product design to real time control [6]. Multiagent systems have been shown to support many applications, such as information retrieval, evolutionary algorithms, constraint satisfaction in the timetabling problem, flood forecasting, land use allocation in farming, localising and tracking of objects in motion, modelling road networks in industrial plants, and adaptive network meshing and traffic simulation [2]. Agent-based and holonic system design techniques have also been beneficial in the manufacturing sectors [6]. The intersection of these MAS paradigms, methodologies and fields of research, has identified a gap in the research of multiagent systems. Applications of holarchies are absent from the field of AI machine learning, most notably in the space of supervised learning. Since many distributed problems exhibit an inherent structure which may be beneficially mirrored in the relationships between

466

M. Cullinan and D. Coulter

groups of intelligent agent problem solvers found in multiagent systems, the exploitation of the inherently recursive nature of some learning algorithms could improve their processing adaptability in dynamic and distributed execution environments. The aim of this paper is to develop a novel and extensible learning system represented by multiagent communities and sub-communities which will contribute to broadening of the landscape of MAL research and its applications. The following section will give an overview of related work in the above mentioned fields of research. Discussions on multiagent systems, multiagent learning and holonic multiagent systems are provided in the Background section. The Materials and Methods section considers the research methodology and prototype implementation specific to the model presented applied to Decision Trees and Random Forest classification. The Results section describes the outcomes based on the model evaluation followed by a Discussion section. Finally, the conclusion and future work are presented the Conclusion section.

2

Related Work

Initially, biological modelling research in MAL mostly involved adaptive parallel computation inspired by nature, which included ant systems, flocking or herding behaviour, evolutionary computation, social learning, neural networks, and interaction and imitation learning [19]. This period was followed by study focused on multiagent learning techniques [19], dominated by applications of MAL reinforcement learning in a game theoretic context [19]. As is evidenced by this literature review, reinforcement learning is still the most commonly studied technique for multiagent learning. Multiagent systems that cooperate to achieve self-organisation have been successfully applied to domains in biological modelling, manufacturing and industrial control simulation, e-commerce, networking, robotics, avionics, and flood forecasting. Agents based on self-organisation through holons have been useful in domains such as transportation scheduling, RoboCup, flexible manufacturing systems and business process coordination in virtual enterprises [5]. Holonic multiagent systems were also effective in traffic signals control [1]. The MAS applications drawing on self-organisation discussed in [2] are listed in Table 1. In the article by [7] an algorithm was presented for a cooperative multiagent environment for agents to create a supervised ML model used to decide agents’ future actions. The algorithm was applied to traffic signals domain, learning with P-concept probability model. In [15] two multiagent learning algorithms where applied where multiple agents concurrently learn how to better interact with each other extending popular multiagent algorithms namely, cooperative Co-Evolution and multiagent reinforcement learning. Lenient learning was extended to multiagent deep reinforcement learning (DRL) in [13]. Two research approaches were then discovered to improve parallel reinforcement learning agents in applied to Hysteretic Q-learning and in leniency domains [13].

Holonic MapReduce Classification

467

Table 1. MAS applications drawing on self-organisation Application

Mechanism

Information Retrieval

Cooperative Information Agents

Timetabling

Cooperation in AMAS

Flood Forecasting

Cooperation in AMAS

Land Use Allocation

Eco-problem

Localisation and Tracking Reactive MAS Adaptive Meshing

Holons

Traffic Simulation

Holons, Holarchies

The paper by [3] suggested techniques for centralising training of deep multiagent RL using the model-free deep Q-Network as the baseline model and message sharing between agents. Classical RL is underpinned by a reward function that is the exclusive property of the environment, and is only altered by external factors. A paper by [10] introduced a novel peer rewarding system, in which agents could deliberately influence each others’ reward function. Likelihood Quantile Networks for coordinating multiagent RL was suggested by [11] to improve performance by making each agent consider the probability that another agent is changing its exploration policy and using that information to weigh the learning rate applied to samples. A new architecture was introduced by [9] that allowed holons to adapt to their environment relying on the recursive nature of the holarchy with an example of robot soccer players. The domains of applications ranged from manufacturing systems, transport, cooperative systems and radio mesh dimensioning. The goal of this research is to investigate new applications of holarchies in the machine learning domain, which considers how holonicity can capture the self-similar properties of certain supervised learning systems. For example, a hierarchical multiagent based ensemble Random Forest model can be expressed as a community of recursively composed smaller sub-communities consisting of binary tree data structures. The aggregate of theses sub-trees produce a predictor with the emergent ability to solve classification problems. Holonic multiagent systems (HMAS) where compared to holonic manufacturing systems in [5]. Furthering this topic, a framework for the analysis and design of HMASs was provided by [17] and the system was implemented with the Madkit platform. In HMASs agents interact with other agents through a hierarchical structure. Agents were defined by holons playing one of four roles, the StandAlone, Head, Part and Multi-Part. Bernon also showed a holonic approach for self-organisation in multiagent systems in [2]. Agents could self-organise and regulate the systems complexity by also playing one of the roles defined above. Algorithms running in imperfect and open real-world environments, for example cloud applications, benefit from

468

M. Cullinan and D. Coulter

self-organisation. In [9] an architecture was proposed with a recursive nature using holarchies. The solution proposed an artificial immune network as a holonic system with a reinforcement learning mechanism. The model-architecture proposed in this paper will be implemented as holonic MapReduce model which can be mapped onto multiagent implementations and that display suitable self-similar properties as identified from this literature survey. The aim is to produce a model that is scalable, distributable and can easily perform dynamic execution in a cloud environment.

3

Background

There are two ways in which independent agents in multiagent systems are programmed to achieve certain goals. The agents can either cooperate with one another or they can compete against each other [19]. MASs are well suited to distributed computation which adds to their complexity and so it becomes necessary for them to have a mechanism for self-organisation. This may involve grouping to form societies, resulting in coherent coordination emerging in their behaviour [2]. Mechanisms for generating self-organisation can be broadly categorised as types of either central control, partial control or completely decentralised control. Based on these control mechanisms systems can be further categorised as, Reactive MASs, Cooperative Information Agents, Adaptive MASs, or Holonic MASs [2]. Holarchies, which is significant to the model proposed by this paper, allows for an additional level of MAS sophistication. It is used to organise multiagent systems as recursive self-similar entities. This discussion will be further elaborated on in the HMAS section. 3.1

Multiagent Learning

Multiagent learning problems consider the design of algorithms involving agents, described as situated in a stochastic game, and must learn optimal behaviour [19]. One of the criteria for classifying MAL techniques is based on learning environments. Other criteria are the agents awareness to its environment and the use of models to learn. According to [19] the field of MAL and its classifications of MAL techniques are too limited. Most multiagent frameworks are described in terms of Stochastic or Markov games [19]. Reinforcement learning (RL) and its family of techniques is the most studied in MAL. Simply defined, RL is based on the observation that rewarding the desirable behaviour of an agent and punishing undesirable behaviour will lead to a behaviour change which attributes to the agent achieving its objectives. The positive reinforced feedback is usually coded as a scalar value which maximises the reinforcement it expects to receive [19]. Research in MAL started in a field called artificial life which involved adaptive parallel computation inspired by nature. Techniques explored where ant

Holonic MapReduce Classification

469

systems and steering behaviours, evolutionary computation social learning, neural networks, and interactive and imitation learning [19]. Following that the field of study became dominated by reinforcement learning MAL techniques. Some common types of algorithms in MAL system implementations are computational, descriptive, normative, prescriptive cooperative or non-cooperative [19]. MAL techniques can then also be classified according to the type of learning, which is shown in Table 2. Table 2. Types of MAL techniques from [19] Type of Learning

Description

Multiplied learning Several agents learning independently of one another Divided learning

Learning tasks are divided among one of more agents with the same known goal

Interactive learning Agents share the work in a single learning task

There are a few learning paradigms related to the concept of MAL, but have not found enough attention according to [19]. Two of them are Multistrategy learning and Parallel Inductive learning. Multistrategy learning combines two or more learning strategies into the learning system, and Parallel Inductive learning studies the exploitation of the inherent parallelism in many learning algorithms in order to scale to more complex problems. The main goal of this paper is to explore a unified approach to developing holonic multiagent learning systems. The two above mentioned techniques were selected to validate the approach due to their hierarchically self-similar nature. 3.2

Holonic Multiagent Systems

Biological holons are systems forming a multi-levelled hierarchy of semiautonomous sub-wholes where the whole becomes more than the sum of its parts due to its properties. Multiagent systems can become very complex due to the interactions between the autonomous agents, therefore, implementing agents as holons allows a system of agents to become self-organising [2]. A holon is therefore implemented as a special case of an agent that is composed of sub-agents with the same structure. At a low level both are simple model building blocks with a flow of incoming and outgoing data. According to [6], holonic systems (HS) can be considered to be a general paradigm for distributed intelligent manufacturing control, whereas MAS is regarded as software technology that can be used to implement this type of holonic system. The paper by [5] presents a general framework for holonic multiagent systems with multiple advantages. The model preserves compatibility with standard

470

M. Cullinan and D. Coulter

multiagent systems by representing each agent as a holon. The complexity of a group of agents is encapsulated as a super holon and the number of agents active in the holon does not change its communication with other agents in the system. The flexible and variable nature of cooperating agent systems allows for a design which can change at run-time [5]. According to Fischer there is currently no existing programming construct that can be used to design such an objectoriented programming styled approach. He provides terminology and a methodology in the context of HMASs for recursive modelling with multi-agency which can handle a dynamic runtime. 3.3

Decision Trees and Random Forests

The problem solving mechanism in supervised learning models involves a set of the data called the training set, which is a set of attribute measurements together with their observed outcome. This data can then be used to create a prediction model, also referred to as a learner, which can predict the outcome for unseen data. The following section will discuss a the Decision Tree supervised learning algorithm. Classification trees provide the building blocks for ensemble methods such as Random Forests, which will be covered more deeply in the remaining background section. Decision Trees. A classification tree is a Supervised learning algorithm. The model is built with input data from a training dataset. Each data point in the training set consists of an input vector X and its corresponding output value y producing an example, (X, y). X consists of a set of attribute values xi , i = 1, 2, ..., n and y is a single observed outcome value. To make a prediction the Decision Tree function takes an unseen example from the testing dataset and produces a “decision”, which is a single output value yj representing a class [18]. The predicted output is then compared to the actual output in order to determine the model’s accuracy. A Decision Tree learner is implemented as a binary tree data structure. Nodes represent a single input variable from the training dataset and any incoming data are recursively partitioned along some path in the tree. The tree uses split points to perform tests on the input data. The split points are calculated from the cost function and are determined according to the minimum cost of creating a branch for that attribute. Branches terminate at a leaf node representing the output value. The process of growing a tree takes a greedy approach [8]. The splitting method calculates which example attribute, referred to as the “important” attribute, minimises the cost of creating an internal node at that position in the tree structure. Good trees are ones in which the branches and sub-branches are short, making it easy to interpret. After adding a node the branched-off tree that follows can be seen as new sub-model created with a smaller training set and effectively considering fewer attributes [18]. Figure 1 shows how Decision Trees are defined by recursively

Holonic MapReduce Classification

471

partitioning the input data into localised regions, and defining a sub-model in each resulting region [8]:

Fig. 1. Decision Tree created by recursive binary splitting adapted from [8]

Random Forests. As briefly mentioned, Random Forests are an ensemble method derived from Decision Trees with hierarchical recursive properties. The advantage of Decision Trees in machine learning is that the algorithm is quite easy to understand and the results are easy to interpret. Depending on the dataset, however, they are not always the most accurate and especially where the data is highly dimensional. Decision Trees are also known to have the following disadvantages [8]: – they do not provide very accurate predictions due to the greedy nature of the tree construction algorithm – they are unstable and due to their hierarchical nature can change drastically due to small changes in the data – they can overfit the data In statistical terms, Decision Trees are high variance estimators. Random Forests are a method for reducing the high variance, by calculating a final prediction from the average of many weaker predictions that have each been calculated over a subset of the data [8]. This technique is called bagging, or bootstrap aggregation, which involves partitioning the dataset into subsets randomly with replacement to create many weak learners whose predictions are then averaged. Random Forests are good at reducing the variance because the methodology decorrelates the weak tree learners in the way that learning data is selected. These ensembles have shown very good predictive accuracy, although they do become more difficult to interpret [8].

472

4

M. Cullinan and D. Coulter

Materials and Methods

The following section describes the proposed model. The problem was divided into sub-tasks and distributed to decentralised problem solvers using a MapReduce derived approach. A prototype was implemented demonstrating the parallelism and flexibility of a holonic MAS interpretation specifically applied to supervised ML. This research attempts to test the hypothesis that holonicity in recursive modelling can be realised by a suitable self-similar multiagent learning system. Decision Trees and Random Forests were the taken approaches to explore the model on solving problems. The prototype validates the model using classification data sets. The proposed model exploits the novel benefits arising from the intersection of holonic self-similarity in multiagent systems by leveraging the hierarchical recursive structure in Decision Tree machine learning. The algorithm works by splitting a machine learning problem into modular algorithmic components, serialising each algorithm as a string and then solving the classification problem by recursive MapReduce. The algorithm was also run in a virtual cluster environment using a container orchestration tool. The model was implemented in the Python programming language with Decision Trees and Random Forests adapted from [4]. One of complexities of distributed computation is transferring code as data, specifically serialising recursive inner functions found in the algorithms of Decision Trees and Random Forests. This problem was addressed in the model by using the Y-Combinator methodology, which will be discussed in the following section. 4.1

Y-Combinator

In order to create a distributable and deployable model using multi-agency, and that can change its learning algorithm dynamically, the model components need to be serialisable. A challenge is found in using a language like Python to serialise recursive inner functions. If a recursive function is defined containing another recursive function as an inner function, then in order for the function to be serialisable it must be rewritten in combinator form and then have a Y-Combinator function applied to it [12]. Serialisation is the function of converting data from an in-memory construct into any linear sequence, for example, a string. This conversion also needs the ability to be undone by the function of deserialisation. The aim of this paper is to explore whether complex ML algorithms, in the context of holonic multiagent systems, can be encoded as a string to be later unpacked and evaluated either on the cloud, remotely, or by some other distributed architecture. Dill is a common Python library with methods for serialising and deserialising most Python functions. The library fails, however, to serialise recursive inner functions as demonstrated in [12] because it produces circular references in its closure. A way to resolve the problem is by manually removing the recursion from the function using concepts of functional programming [12]. A function’s “free

Holonic MapReduce Classification

473

variables” are those not local to the function. Combinators are a class of functions without free variables. Functions in Python have the property of being first class objects which means that they can be passed into another function as a parameter. There is a function called the Fixed-Point function defined as: def Y( f ) : return f (Y( f ) ) where Y is a special combinator function, being applied to a function f. Given a function h applied to f, which returns a new function g: h(f ) = g

(1)

From the fixed-point property, there exists a fixpoint f’ of h where: f  = Y (h)  h(f  ) = f 

(2)

Suppose a recursive function r is redefined in its combinator form using h as follows: def h ( f ) : def r ( . . . ) : ... f (...) ... return r Then the function r , which is the non-recursive counterpart of r, can be obtained from a transformation with the Fixed-Point function: r = Y (h) = r

(3)

Now, although the recursion has been removed from r and h is a combinator, Y (h) is still recursive. It therefore becomes necessary to introduce the Y-Combinator function, which further removes recursion from the Fixed-Point combinator Y by means of lambda expressions. The Y-Combinator function is defined as follows: def y c o m b i n a t o r ( f ) : return ( lambda x : x ( x ) )( lambda y : f( lambda ∗ a r g s , ∗∗ kwargs : y ( y ) ( ∗ a r g s , ∗∗ kwargs ) ) )

474

M. Cullinan and D. Coulter

The function defined above has the following properties [12]: – it is a counterpart of Y – it has no free variables and is a combinator Finally the recursive function r can be converted into a serialisable function by applying the Y-Combinator to the r-combinator defined by h: y combinator(h) = r

(4)

In the original Decision Tree’s recursive predict function, where it is referenced itself internally, predict becomes classified as a free variable. In order to transform the recursive predict function into its non-recursive counterpart the internal function reference must be replaced by another inner function as follows: def m a k e p r e d i c t c o m b i n a t o r ( ) : def p r e d i c t c o m b i n a t o r ( p r e d i c t ) : def i n n e r ( node , row ) : i f row [ node [ ’ i n d e x ’ ] ] < node [ ’ v a l u e ’ ] : return p r e d i c t ( . . . ) else : ... return i n n e r return p r e d i c t c o m b i n a t o r The newly created version of the predict function, in combinator form, is then made serialisable and deserialisable after applying the y combinator function to it as follows: predict combinator = make predict combinator() predict serializable = y combinator(predict combinator) 4.2

Decision Tree Classification

The proposed holonic MapReduce model was first evaluated by a Decision tree classification problem. For binary classification problems, a split in the dataset separate the training data examples into two groups, one for each class. The cost function calculating the node split points in the dataset is the Gini index. A Gini score is a measure of the split purity which is a function of the proportion of each class present in the resulting two groups of training data. Splitting is stopped if the size of the groups equals its minimum number allowed or the tree has grown to its maximum tree depth [4].  gini = (1.0 − (proportion × proportion)) × (n ÷ N ) (5) where: n is the group size N is the total samples

Holonic MapReduce Classification

475

The decision tree learner parameters for the model are defined in Table 3. Table 3. Decision Tree learner parameters Parameter

Value

Number of data folds 5 Maximum tree depth 6 Minimum size

2

In order to transfer the serialised model components to the ensemble orchestrator, they were encoded and stored as nested recursive structures and passed in the model parameters. The model parameters are received as a seed string which the ensemble orchestrator is responsible for unpacking and distributing across the nodes. The model parameters for holonic Decision Tree classification are presented in Table 4. Table 4. Holonic MapReduce model parameters for Decision Tree classification Parameter

Value

Dataset

ML dataset

Learner parameters Data folds, tree depth, tree size, sample size, number of trees Algorithm

4.3

name: accuracy metric, reduce function: serialised accuracy metric, algorithm: name: decision tree, map function: serialised decision tree, algorithm: name: predict, reduce function: serialised predict

Random Forest Classification

A recursive extension to MapReduce as discussed in [16] was used to implement the holonic multiagent system. Unlike the usual approach to MapReduce, large scale recursions are implemented by iterated MapReduce jobs to handle the problem with the recursive task not being able to restart if it failed. If they followed a restart policy then there might be a case where no task would ever receive any input and thus not produce any output for the next task. The Random Forest algorithm was evaluated as an iterated application of encoded learner algorithms defined by bootstrap aggregation with each step of recursion being a MapReduce job.

476

M. Cullinan and D. Coulter

The Random Forest learner parameters for the model are defined in Table 5.

Table 5. Random Forest learner parameters Parameter

Value

Number of data folds 5 Maximum tree depth 6 Minimum size

2

Sample size

0.5

Number of trees

5

Here the serialised model components were provided as a deeper nested structure, but still providing a uniformity in how the model parameters are defined an passed to the ensemble orchestrator. This allows extensibility in streaming information to models deployed on the cloud. Model parameters for the holonic Random Forest classification are presented in Table 6. Table 6. Holonic MapReduce model parameters for Random Forest classification Parameter

Value

Dataset

ML dataset

Learner parameters Data folds, tree depth, tree size, sample size, number of trees Algorithm

4.4

name: accuracy metric, reduce function: serialised accuracy metric, algorithm: name: bootstrap, map function: serialised bootstrap, algorithm: name: max predict, reduce function: serialised max predict, algorithm: name: decision tree, map function: serialised decision tree, algorithm: name: predict, reduce function: serialised predict

System Components

Designing the model architecture as a nested recursive structure reduced complexity when unpacking the parameters and applying the serialised components to solve the problem. The learning problem was defined in terms of nested recursion in which each node has a level of computation but the components that go deeper down the hierarchy are less complex, so that, as the problem unpacks into its smaller components each holon has a smaller self-contained problem to solve. The model architecture was achieved using a recursive extension to MapReduce

Holonic MapReduce Classification

477

with computations based on successive calls to a Mapper and Reducer, each component returning results to its parent holon. Figure 2 illustrates the model’s activity diagram.

Fig. 2. Holonic ensemble orchestration architecture derived from MapReduce

478

M. Cullinan and D. Coulter

The model consists of some major components and minor components. The major components compose the aspects of the MapReduce paradigm, which is also used as an interpretation of the ensemble method. The ensemble orchestrator component is called App. This component is responsible for receiving the initial data, ML algorithm learner parameters and a nested dictionary structure of strings containing serialised functions comprising the required ML algorithms. The App component is responsible for unpacking the model parameters and starting the recursive evaluation of the function sequence. The next component is the Reducer whose goal it is to deserialise incoming reduce functions and apply them to the streamed data. The Mapper component deserialises map functions and applies them to the data. As is usual in a MapReduce system, the components prescribe to manager-worker based roles. The App component is the manager and the Reducer and Mapper components are workers. This format also allows for parallelisation of the problem solving mechanism to solve the problem. The App, Reducer and Mapper components communicate via requests. Internally, the components also concurrently run their received functions with the implementation of tasks using the Python asynchronous model. Figure 3 shows a visualisation of the App, Mapper and Reducer cloud components.

Fig. 3. Holonic multiagent MapReduce cloud components

Holonic MapReduce Classification

479

Figure 4 illustrates the sequence diagram of the holonic MapReduce model showing the systems recursive MapReduce jobs.

Fig. 4. Holonic multiagent MapReduce sequence diagram

5

Results

The following plots were generated by the running system for each of the two problems. These results were compared to those obtained from running the original Python implementation in order to show that the Decision Tree and Random

480

M. Cullinan and D. Coulter

Forest algorithms still work after the application of the holonic multiagent architecture. The model was tested on the banknote dataset discussed in [4] to predict whether a given banknote is real based on data collected from banknote photos. It contains 1,372 rows with 5 numeric variables. It is a simple binary classification problem where the output y ∈ (0, 1). Each example in the dataset has the following attributes [4]. All the values are continuous except the integer class value: – – – – –

variance of Wavelet Transformed image skewness of Wavelet Transformed image kurtosis of Wavelet Transformed image entropy of image class value

The fold number in the graphs represent the data subsets used in crossvalidation training and testing to build the model. The bar chart in Fig. 5 shows the holonic MapReduce model accuracy applied to a Decision Tree learning algorithm compared to the original Decision Tree model. Both models achieve very high accuracy scores for the banknote classification dataset with the holonic model sometimes performing slightly better.

Fig. 5. Decision Tree holonic MapReduce model accuracy

Figure 6 compares the accuracy of the holonic MapReduce Random Forest model in Fig. 6a to the original Random Forest model in Fig. 6b. The model instances were created containing 3, 5, 10, and 15 trees, respectively.

Holonic MapReduce Classification

(a) Holonic MapReduce Model

481

(b) Original Model

Fig. 6. (a) Holonic MapReduce Random Forest model accuracy. (b) Original Random Forest model accuracy

Figure 7 shows the mean accuracy of the holonic MapReduce Decision Tree (containing a single tree) and Random Forest (varying instances containing 3, 5, 10 and 15 trees, respectively) models compared to the original models. The holonic implementation preforms slightly better than the original model except for the Random Forest instance containing 10 trees.

Fig. 7. Mean accuracy of the Decision Tree and Random Forest holonic MapReduce model

6

Discussion

The results in the previous section indicate that both the holonic MapReduce Decision Tree and Random Forest algorithms are running correctly. The application of the models to the problem of classifying data from the banknote dataset produce very high accuracy, which is expected since it is a small dataset with few attributes. Building the models using cross-validation is one of the parallelisable aspects of Decision Trees and Random Forests exploited by the model which improves

482

M. Cullinan and D. Coulter

the original algorithm by making it able to scale in a distributed environment such as a multi-node compute cluster. Using the holonic MapReduce architecture as an implementation mapping onto a multiagent system makes the model flexible and easily deployable. The difference between creating the Decision tree versus the Random Forest model instance is determined only by the definition of the model parameters. The Random Forest requires a deeper nested structure of serialised algorithms compared to the Decision tree, but the main model architecture remains unchanged between processing the different models.

7

Conclusion

This paper investigates the recursive nature of some machine learning algorithms that lend itself to holonic MAS design resulting in a flexible, reusable, and extensible MAS model for supervised learning problems. It also incorporated the Parallel Inductive learning domain in MAL which studies the ability to scale complex learning using the inherent parallelism in certain algorithms. The model parameters where serialised and uniformly passed to the model which simplifies complex distributed computation when streaming information to models deployed on the cloud. Large models can be computed by evaluating the model spread out across nodes for speed and leveraging the parallelisation found in the ensemble Random Forest algorithm. A supervised learning problem was computed as a MapReduce system by the recursive computation of serialised model components as verified by the consistent results when applied to Decision Trees and Random Forest classification. Future work aims to incorporate software languages into the model with specific features that can be leveraged for implementing self-similar agents. The objective would be to reduce the complexity introduced in serialising and deserialising recursive algorithms so that holonicity can be applied to more complex learning approaches, such as deep learning.

References 1. Abdoos, M., Mozayani, N., Bazzan, A.: Holonic multi-agent systems for traffic signals control. Eng. Appl. Artif. Intell. 26, 1575–1587 (2013) 2. Bernon, C., Chevrier, V., Hilaire, V., Marrow, P.: Applications of self-organising multi-agent systems: an initial framework for comparison. Informatica 30, 01 (2006) 3. Bhalla, S., Subramanian, S.G., Crowley, M.: Training cooperative agents for multiagent reinforcement learning. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS 2019, Richland, SC, pp. 1826–1828. International Foundation for Autonomous Agents and Multiagent Systems (2019) 4. Brownlee, J.: How to implement bagging from scratch with Python (2019)

Holonic MapReduce Classification

483

5. Fischer, K., Schillo, M., Siekmann, J.: Holonic multiagent systems: a foundation for the organisation of multiagent systems. In: Maˇr´ık, V., McFarlane, D., Valckenaers, P. (eds.) HoloMAS 2003. LNCS (LNAI), vol. 2744, pp. 71–80. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45185-3 7 6. Giret, A., Botti, V.: Holons and agents. J. Intell. Manuf. 15, 645–659 (2004). https://doi.org/10.1023/B:JIMS.0000037714.56201.a3 7. Goldman, C.V., Rosenschein, J.S.: Mutually supervised learning in multiagent systems. In: Weiß, G., Sen, S. (eds.) IJCAI 1995. LNCS, vol. 1042, pp. 85–96. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-60923-7 20 8. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics, Springer, New York (2001). https://doi.org/10.1007/ 978-0-387-21606-5 9. Hilaire, V., Koukam, A., Rodriguez, S.: An adaptative agent architecture for holonic multi-agent systems. ACM Trans. Auton. Adapt. Syst. 3(1), 2:1–2:24 (2008) 10. Lupu, A., Precup, D.: Gifting in multi-agent reinforcement learning. In: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, Richland, SC, pp. 789–797. International Foundation for Autonomous Agents and Multiagent Systems (2020) 11. Lyu, X., Amato, C.: Likelihood quantile networks for coordinating multi-agent reinforcement learning. In: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS 2020, Richland, SC, pp. 798–806. International Foundation for Autonomous Agents and Multiagent Systems (2020) 12. O’Regan, E.: Serialising functions in Python (2016) 13. Palmer, G., Tuyls, K., Bloembergen, D., Savani, R.: Lenient multi-agent deep reinforcement learning. CoRR, abs/1707.04402 (2017) 14. Panait, L., Luke, S.: Cooperative multi-agent learning: the state of the art. Auton. Agent. Multi-Agent Syst. 11(3), 387–434 (2005). https://doi.org/10.1007/s10458005-2631-2 15. Panait, L., Sullivan, K., Luke, S.: Lenient learners in cooperative multiagent systems, pp. 801–803, January 2006 16. Rajaraman, A., Leskovec, J., Ullman, J.D.: Mining Massive Datasets (2014) 17. Rodriguez, S., Hilaire, V., Koukam, A.: Towards a methodological framework for holonic multi-agent systems. In: Fourth International Workshop of Engineering Societies in the Agents World, 29–31 October 2003, pp. 179–185. Imperial College London (2003) 18. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd edn. Prentice Hall Press, Upper Saddle River (2009) 19. Tuyls, K., Weiss, G.: Multiagent learning: basics, challenges, and prospects. AI Mag. 33(3), 41 (2012)

Monitoring Goal Driven Autonomy Agent’s Expectations Generated from Durative Effects Noah Reifsnyder(B) and Hector Munoz-Avila Department of Computer Science and Engineering, Lehigh University, Bethlehem, PA 18015, USA {ndr217,hem4}@lehigh.edu

Abstract. One of the crucial capabilities for robust agency is self-assessment, namely, the capability of the agent to compute its own boundaries. A method of assessing these boundaries is using so-called expectations: constructs defining the boundaries of an agent’s courses of action as a function of the plan, the goals achieved by that plan, the initial state, the action model and the last action executed. In this paper we redefine four forms of expectations from the goal reasoning literature but, unlike those works, the agent reasons with durative actions. We present properties and a comparative study highlighting the trade offs between the expectations. Keywords: Goal reasoning · Expectations · Durative actions · Continuous domains

1 Introduction There has been an increasing interest in AI safety, creating reliable AI agents that perform within their boundaries. With the increasing sophistication of autonomous systems, situations arise in which unexpected situations may occur. This happens when the agent and/or the environment in which the agent is operating behaves in a way that is inconsistent with the agent’s planning knowledge. Goal-driven autonomy (GDA) agents supervise the agent’s execution of its current plan and formulate new goals when discrepancies arise between the agent’s expectations of the actions in the plan executed so far and the resulting state. To detect discrepancies, the agent generates a set of expectations X(π, a), as a function of the next action a in the plan π to be executed. The agent can then check if this expectation is satisfied in the state, s. Naively, for computing the expectation X(π, a) it would seem sufficient for the agent to simply check if the preconditions of a are satisfied in s (i.e., to define X(π, a)= “the preconditions of a”). However an agent using expectations defined this way would not be checking the plan trajectory in any way. Similarly, we could project the state s before the action a is the be executed from the initial state s0 using π. (i.e., to define X(π, a) = s). Expectations are often generated this way in the goal reasoning literature [1, 10, 13]. The issue here is that there are often cases where there are variables in the state that have no bearing on π and thus if they are altered it will not impact the agents c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 484–498, 2022. https://doi.org/10.1007/978-3-030-82193-7_32

Monitoring Expectations from Durative Effects

485

execution. To define the expectations over the entire state would cause discrepancies in situations where they are not needed. Researchers have observed that the notion of expectations plays a key role in the resulting performance of goal reasoning agents [6, 7]. A real world example of the use of expectations can be seen as follows. John recently bought a new car, and it was advertised with an average Miles Per Gallon of 32. However, after John has driven for a bit and filled up the tank for the first time, he notices that he only got 15 miles to the gallon. This would be a cause for concern and would likely result in him taking the car back to the dealer to be checked, and its possible there’s something wrong with the car like a fuel leak. This is the goal of expectations; to identify problems using given state information. In this paper we report on our studies computing expectations when the actions are durative. This means actions have continuous effects over some segment of time. For instance, the agent may control a camera to follow a target while turning around 20◦ on a swivel. Performing this action requires some time to complete as a function of the rotation speed of the camera. Furthermore, while performing this action, the agent may initiate another action such as zooming out by a 1/3 zoom ratio. This action itself requires time to complete and may be initiated while the previous action is still not completed. Therefore, expectations are also a function of the time t and not of a specific action in π since multiple actions may have been initiated. Thus, in our work, expectations are a pairs of the form X(π, t), where t is some time after π was initiated. The following are the main contributions of this paper, centered around the notion of expectations when GDA agents perform durative actions: – We re-define the notion informed, regression [5] and goldilocks [18] expectations. – We formulate properties on regression expectations providing guarantees on the success of the remaining plan when certain conditions are met. – We provide a empirical evaluation on two very different domains and discuss trade offs between the different forms of expectations. The rest of the paper is organized as follows: The next section will discuss related work. Section three will go through the preliminaries, with section four defining some operators for use in our later definitions. The next three sections will define the different types of expectations we have developed for these domains. Following that we will discuss some properties of our expectations. This will be followed by our empirical evaluation and our conclusions.

2 Related Work GDA is a goal reasoning model in which agents monitor the current plan’s execution and assess whether the observed states match the agents’ own expectations. GDA agents formulate a plan π achieving goals g from the current state s. They also formulate the expectations X(π, a) for every action a in π. As each action a in π is executed, the agent checks if its expectations X(π, a) are satisfied in the current state s. When they are not satisfied the GDA agents follows a procedure to formulate new goals g  , a new plan π  , new expectations X(π  , a) and the process repeats itself.

486

N. Reifsnyder and H. Munoz-Avila

Research on GDA agents have explored different facets of the cycle including generating explanations for the mismatch between the agent’s expectation and the state [11] and procedures to generate new goals [16]. In this discussion, we focus on work formulating the expectations. A variety of formulations for the expectations have been formulated mostly for symbolic domains. This includes defining the expectations as [12]: – (1) immediate expectations: checking the preconditions of the next action to execute; – (2) state expectations: the projected state by applying the actions in the plan executed so far; – (3) informed expectations: the cumulative effects of the actions executed so far; – (4) goal regression expectations: starting from the goals, accumulating the conditions in the state necessary to execute the rest of the plan, building on classical work [15]; – (5) goldilocks expectations [18] combining (3) and (4). In our work we re-define informed, regression, and goldilocks expectation’s when actions have durative effects. GDA expectations have been explored for actions with numeric fluents. Intervals (x↓ , x↑ ) are used to indicate valid values for a variable. Actions define a function f (x) indicating new values for x after the action is applied. Like their symbolic counterparts, these works assume the actions in a plan π are executed instantaneously and in sequence. Therefore, expectations are also denoted as a function X(π, a) of each action a ∈ π. In Wilson et al. [21], state expectations are defined in which the intervals are projected forward (f (x↓ ), f (x↑ )) for each action. [17] extends these ideas for informed, regression, and goldilocks expectations. In our work, we re-define when the numeric effects of the action change over time (e.g., f (x, t)) and include situations when actions in the plan π are performed concurrently. Hence, we define expectations over time, X(π, t). Studies on expectation failures emanate from the plan execution literature [20]. For instance, [2] proposed a model to find the reason for a failure in the plan. [4] propose a taxonomies of expectation failures as a function of the plan, planning knowledge, the resulting state and the state expectations. For example, the incorrect domain knowledge class refers to expectations failures caused by planning operators incorrectly modeling the actual operators behavior. [20] presents an alternative taxonomy of failures related to the execution of the SIPE HTN planner. For instance, a failure maybe attributed to a condition that was held to be true at planning time but it is no longer true when the plan is executed. [8] proposes execution failures when the plan doesn’t meet quality considerations. None of these works consider failures due to durative actions.

3 Preliminaries Throughout this paper, we will use partial mappings. A partial f : A  B indicates a function that is defined only for a subset of A. When referring to the set of variables defined in a mapping, we write Af , meaning the set of variables from A defined in the mapping f . When listing the entire mapping (e.g., in examples), we will use a dictionary

Monitoring Expectations from Durative Effects

487

format to represent the partial mapping, where the keys are the variables, and the values are what they map to. For example, f = {a : 1, b : 0} denotes that there are two variables in mapping f : a and b, and that their respective values, denoted f [a] and f [b], are 1 and 2. Since we are dealing with actions that have functional effects, such as changes over a time t, we use lambda calculus to represent them. Briefly, functions are represented as tuples. The first element in the tuple is the function to be exercised, all following elements in the tuple are arguments to that function. For example, (− 3 2) represents the minus function, where 3 is the first argument and 2 is the second, i.e., 3 – 2. Therefore (− 3 2) would return 1. There can also be functions with free variables, called lambda functions. We use these to represent functions dependent on a time variable. For example, one can write f = λt.(− t 3) to represent a function with a singular argument t. This function f thus returns the argument given it subtracted by 3, thus (f 5) would return 2. A state is a mapping S : V → R from a collection of variables V to a collection of real numbers R. Since we are dealing with real numbers, its unrealistic to always know the exact values of variables [19]. Therefore we represent the value of a variable, e.g., at-y[r], with two mappings denoting its lower and upper bounds, e.g., (at-y[r]↓ , at-y[r]↑ ). Actions have durative effects, meaning for a time period t the variable is changing as a function of t. Table 2 shows an example of a state, while Table 1 shows an example operator that can be applied to this state. As exemplified in Table 1, an operator is a 4-tuple o = (name parameters prea ef f a ). A set of goals G is a partial mapping G: V  R from a collection of variables V to a collection of real numbers R. These are also represented with two mappings to denote an upper and lower bound for each variable in VG . An operator’s preconditions are a list of variables boundings. They are represented as the partial mapping prea : V  C of variables to constants (with individual variables for upper and lower bounds). For example, in the operator move north shown in Table 1 we are checking that the lower bound of the variable at-y[x] (i.e., at-y[x]↓ ) is (∗ t move-rate[x]↑ )). We define the set of effects from an action as a partial mapping ef f a : V  Λ from variables to lambda functions. Looking at Table 1, we can see that the effects are a list of 3-tuples. For the purpose of our calculations, we care solely about the functional changes to the state variables. Thus for all variables v ∈ e where e is the effect list, ef f a [v] = e[v][2][0]. For example, from Table 1, one of the effects is “[at-y[x]↓ → (+ at-y[x]↓ ((f1 x) t))),”. Thus ef f a [at-y[x]↓ ] would be the lambda function returned by “(f1 x)” (as defined in the operator). Applying operators change the values of variable as a function of time t. For example, consider applying the operator move north (Table 1) to the initial state defined in Table 2 with arguments (r, 2), indicating rover r is to move north for 2 time steps. The operator changes the rover’s fuel level fuel[r] and its y coordinate, at-y[r]. The upper bound of its fuel level, at-y[x]↑ , changes from 10 to (+ 10 (× −t .9); when t = 2, the execution of the operator is completed and the value of at-y[x]↑ is set to 8.2 (i.e., (+ 10 (∗ −2 .9)).

488

N. Reifsnyder and H. Munoz-Avila Table 1. Example of operator with a numeric fluent (Fuel). (:operator move north :parameters x t :condition at-y[x]↓ → (∗ t move-rate[x]↑ ) fuel[x]↓ → (∗ t fuel-rate[x]↑ ) :effect [at-y[x]↓ → (+ at-y[x]↓ ((f1 x) t))), at-y[x]↑ → (+ at-y[x]↑ ((f2 x) t))), fuel[x]↓ → (+ fuel[x]↓ ((f3 x) t))), fuel[x]↑ → (+ fuel[x]↑ ((f4 x) t)))] f1 = λx.(λt.(∗ t move-rate[x]↑ )), f2 = λx.(λt.(∗ t move-rate[x]↓ )), f3 = λx.(λt.(∗ -t fuel-rate[x]↑ )), f4 = λx.(λt.(∗ -t fuel-rate[x]↓ ))

A plan π is defined as a set of time stamped actions of the form (time, action). For example, in Table 2 the first action in π is (0, move north(r, 2)), indicating that at time 0, the action move north with the parameters r and 2 is executed. The next action is (0, move east(r, 2)). Since its starting time is also 0, the effect of this is the rover moving diagonally. We denote the partial mapping Ts : T  P(A) as a mapping from a time t to the set of actions Ts (t) starting at time t. Analogously, we also define the partial mapping Te : T  P(A) as a mapping from times to sets of actions that end at those times (P(A) is the power set of A). A planning problem is a triple P = (S0 , A, G). A plan π solves a problem P if the following conditions are met: there is a sequence of states S0 , S1 , ..., Sn such that: (1) in Sn . Si yields Si+i in π. (2) G is satisfied a fa (1) for all actions a such that a ∈ Ts (t ) ∪ Te (t ) Si yields Si+1 by adding with t ≤ i ≤ t , with fa being the functional effect athat changes a variable v in action fa [v](1). a. That is, for each variable v, Si+1 [v] = Si [v] + A mapping of variables to values such as G is satisfied in a state S if for all v↓ ∈ Vg , G[v↓ ] ≥ S[v↓ ] and for all v↑ ∈ Vg , G[v↑ ] ≥ S[v↑ ]. The same definition applies to an expectation X, which is also a mapping from variables to values, to be satisfied in a state S. Similarly, let πt be the portion of the plan π that remains to be executed at time t ≤ n + 1. That is, πt includes all actions in π not in Te (0) ∪ Te (1) ∪ . . . Te (t) (i.e., actions that are still under execution or whose execution has not started yet). πt is valid if there is a sequence of states St , St+1 , ..., Sn yielded such that: (1) Si yields Si+i in π. (2) G is satisfied in Sn . In our work, we assume the state persists unmodified after the completion of the plan π, meaning Sn = Sn+1 and therefore, the empty plan πn+1 = () solves (Sn+1 , A, G) whenever π solves (S0 , A, G).

Monitoring Expectations from Durative Effects

489

Table 2. Planning problem and a solution plan (:Initial State {fuel : {r↓ : 10, r↑ : 10]}} {Beacon fuel : {r↓ : 1, r↑ : 1]}} {at-y : {r↓ : 2, r↑ : 2, Beacon1↓ : 0, Beacon1↑ : 0}} {at-x : {r↓ : 0, r↑ : 0, Beacon1↓ : 2, Beacon1↑ : 2}} {lit : {Beacon1↓ : 0, Beacon1↑ : 0}} {fuel-rate : {r↓ : .9, r↑ : 1.1}} {move-rate : {r↓ : .9, r↑ : 1.1}} :Actions move north, move south, move east, move west, light beacon :Goals {lit : {Beacon1↓ : 1, Beacon1↑ : 1]}} :Plan π (0, move north(r, 2)), (0, move east(r, 2)), (2, light beacon(r, 1))

4 Two Basic Operations We introduce two basic operations ⊕S and P , which are used to define precisely the different forms of expectations. Informally, ⊕S compounds lambda functions (useful for adding together effects of actions), whereas P removes a function from a compounded set (useful for removing the effect of an action as its function of t after it is completed).  We define D = A ⊕tS B, where A are some variables (e.g., accumulated changes while compounding functions), S is the current state, t is the current time, and B are the effects of some action (e.g., the next action in the plan). More generally, for any partial functions A and B, any time t , and any state S with A : V  Λ, S : V → Λ,  t ∈ Z + , and B : V  Λ, A ⊕tS B is a partial mapping Dt : V  Λ defined as follows: 1. if v ∈ VA ∩ VB where A[v] = M and B[v] = N , then Dt [v] = λt.(+ (M t) (N (− t t ))). 2. if v ∈ VA − VB then Dt (v) = A(v). 3. if v ∈ VB − VA where S[v] = M and B[v] = N , then Dt [v] = λt.(+ (M t) (N (− t t ))). 4. for all other variables Dt is undefined (i.e., VDt = VA ∪ VB ) 

Informally, A ⊕tS B results in a function addition (+ (A[v] t) (B[v] (− t t ))) when the variable v is defined in A and B (Case 1), or (+ (S[v] t) (B[v] (− t t ))) when v is defined in B but not A (Case 3). We are using an updated time variable (− t t ) for the

490

N. Reifsnyder and H. Munoz-Avila

Fig. 1. Example calculation of D[a] from the ⊕ operator example

functions from B since these represent the effects of the next actions to be added into the expectation set. Since we are in the middle of the plan, we need to shift the value of t to represent that that action isn’t starting at time t = 0. If the variable v is defined in A but not B, it’s assigned A[v] (Case 2). When it’s undefined in A and B, then it’s left undefined (Case 4). For example, if A, S, and B are defined as: • • • • •

A = {a : λt.(+(∗ 2 t)3), c : λt.(t)} S = {a : λt.(+(∗ 2 t)3), b : λt.(−(∗ 3 t)4), c : λt.(t)} B = {a : λt.(∗ 2 t), b : λt.(t)} t = 3 (current time is 3) Then D = A ⊕3S B = {a : λt.(−(∗ 4 t)3), b : λt.(−(∗ 4 t)7), c : λt.(t)}

In the resulting partial function D[a] = λt.(− (∗ 4 t) 3) is obtained by adding the functions (A[a] t) and (B[a] (−t 3)) (i.e., Case 1). This procedure is shown in Fig. 1. D[b] and D[c] are similarly obtained from the rules defined in the ⊕ operator.  We define D = A tP B, where A are some variables (e.g., accumulated changes while compounding functions), P are the preconditions from some action (e.g., the current action we are regressing from) and B are the effects of the action. More generally,  let A : V  Λ, P : V  C, t ∈ Z + , and B : V  Λ, we define A tP B as a partial mapping D : V  Λ with: 1. if v ∈ VA − VB then D[v] = A[v]. 2. if v ∈ VA ∩ VB where A[v] = M and B[v] = N , then D[v] = (− (M t) (N (− t t ))). 3. if v ∈ VP then D[v] = λt.(P [v]) 4. for all other variables D is undefined (i.e., VD = VA ∪ VP ) 

Informally, A tP B results in a new partial mapping that is defined for all variables from A and P . The new mapping takes the value A(v) if v is defined in A but not in B (Case 1). If a variable v is defined in A and B, the new mapping takes the values after subtracting (− (A[v] t) (B[v] (− t t ))) (Case 2), If a variable v is defined in P the new mapping takes the value P [v] (Case 3). If a variable is not defined in either A or P , it is left undefined (Case 4). For example, if we have the three partial functions A, P , and B, as follows: – A = {a : λt.(+ t 3), b : λt.(− (∗ 2 t) 4)}

Monitoring Expectations from Durative Effects

– – – –

491

P = {c : −2} B = {b : λt.(∗ 2 t)} t = 3 (current time is 3) Then D = A 3P B = {a : λt.(+ t 3), b : λt.(2), c : λt.(−2)}.

5 Informed Expectations with Durative Effects Agents using Informed Expectations check that the compounded and accumulated effects are valid in the environment. Informally, informed Expectations accumulate a set of functions over time extracted from the effects of all previous durative actions executed so far in π. They compound all active durative actions’ functions, and retain all past changes made to the state by actions that have finished executing. Formally, informed expectations are denoted as Xinf (π, −1, t, ∅) = inft , for some time t. Each inft is recursively generated as follows: 1. inf−1 = ∅. (we start with t = −1 for bookkeeping) 2. For all t ≥ 0, inft is generated by the following 3 steps: (a) inft = inft−1 (b) for all a ∈ Ts (t), inft = inft ⊕tst−1 ef f a . (c) for all a ∈ Te (t), inft = inft t{} ef f a Case 1, the base case, indicates that before the first action is executed, we have no accumulated effects. The 3 steps of Case 2, the recursive case, are as follows: we start with the expectations computed up to time t − 1 (Step 2 (a)). We add the effects of actions starting in t (Step 2 (b)). Finally, we subtract the effects of actions terminating at time t (step 2(c)). Example: If we are at time t = 2 in our plan, we can calculate inf2 for f uel as follows in Fig. 2: The first step in calculating the expectations is to combine inf1 (Informed expectations at time 1) with ef f move north (effects of the move north action) using the 2{} operator. Line 3 of the figure shows the substitutions of the values of these mappings (Line 2) applied using the operator definitions. The right side of line 3 shows the simplified result of this operator application. Lines 4 and 5 sets up the second operator, using the result of the last calculation and the effects of the move east action. Then, the left side of line 6 shows application of the operator, with the right side being the simplified result. We can see the entire calculation in Fig. 3. This figure includes all steps and for all variables and follows a similar calculation path as the simplified figure.

6 Regression Expectations Informally, Regression Expectations accumulate a set of functions starting backwards from the last action in π. It compounds all active durative actions’ inverse functions as well as their preconditions, ensuring that the goals will still be met after the completion of π. Formally, let n be the time step when π finalizes its execution, 0 the time when π starts its execution, t a natural number with 0 ≤ t ≤ n + 1, we denote the regressed expectations at time t with Xreg (π, t, n+1,G)= regt . Each regt is generated as follows:

492

N. Reifsnyder and H. Munoz-Avila

Fig. 2. Calculation of the informed expectations for variable f uel at time step t = 2. The last operation step is left out because ef f light beacon doesn’t alter the variable f uel and thus doesn’t alter the expectation set for this variable. Full expansion of all variables in the state can be seen in Fig. 3.

1. regn+1 = G. (we start with t = n + 1 for bookkeeping) 2. For all t ≤ n, we compute regt in three steps: (a) regt = regt+1 (b) for all a ∈ Ts (s), regt = regt tprea ef f a (c) for all a ∈ Te (t), regt = regt ⊕tst+1 ef f a . Case 1 is the base case; t = n is the time that the last action in π completes its execution. So t = n + 1 is the first time step after the completion of the plan’s execution and we expect the set of goals G to be satisfied. If the goals are unknown, then G = {}. The 3 steps of Case 2, the recursive case, are as follows: we start with the regressed expectations computed up to time t + 1 (Step 2 (a)). We subtract the effects of actions starting at time t (step 2(b)). Finally, we add the effects of actions ending in t (Step 2 (c)). Agents using Regression Expectations check that the rest of the plan can be executed, and when finished the set of goals G (if they exist) will be satisfied. Example: If we are at time t = 2 of the plan trace π in Table 2 we can calculate the Regression Expectations reg2 = Xregress (π, s2 ,G) as follows (The preconditions and effects for move east and light beacon have not been shown before): – reg4 = G = {lit : {Beacon1↓ : 1, Beacon1↑ : 1]}}. (i.e., Case 1 with n = 3) – reg3 = reg4 ⊕3{} ef f light beacon = {lit : {Beacon1↓ : (+ 1 (− t 3)), Beacon1↑ : (+ 1 (− t 3))]}} (i.e., Step 2 (b): light beacon ends at t = 3). – Thus reg2 = reg3 2prelight beacon ef f light beacon ⊕2{} ef f move north ⊕2{} ef f move east (i.e., Step 2 (b): light beacon starts at t = 2 and Step 2 (c): move east and move north end at t = 2).

Monitoring Expectations from Durative Effects

493

Fig. 3. Expanded calculation of the informed expectations at time step t = 2.

7 Goldilocks Expectations Goldilocks Expectations [18] combines Informed and Regression Expectations. Formally, we define Goldilocks Expectations as Xgold (π, t,G) = goldt , where goldt = (inft , regt ). That is, for ever time t, goldt is the pair containing the Informed and Regression Expectations for that time. An agent using Xgold (π, i,G) checks the overlap of the regressed and the informed intervals, [lef t(v  ), right(v  )] ∩ [lef t(v”), right(v”)