Computational Intelligence: 11th International Joint Conference, IJCCI 2019, Vienna, Austria, September 17–19, 2019, Revised Selected Papers 3030705935, 9783030705930

This present book includes a set of selected revised and extended versions of the best papers presented at the 11th Inte

259 85 9MB

English Pages 420 [414] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Organization
Contents
Evolutionary Computation Theory and Applications
Niching-Based Feature Selection with Multi-tree Genetic Programming for Dynamic Flexible Job Shop Scheduling
1 Introduction
2 Related Work
2.1 Routing and Sequencing Rules
2.2 Feature Selection
3 Background
3.1 Problem Definition
3.2 Genetic Programming for Dynamic Job Shop Scheduling
3.3 Multi-tree Genetic Programming
3.4 Niching-GP Feature Selection
3.5 Two-stage Genetic Programming with Feature Selection
4 Methods
4.1 Feature Selection for Multi-tree Genetic Programming
4.2 Two-Stage Multi-tree Genetic Programming
5 Experimental Setup
5.1 Scenario Generation Configuration
5.2 GP Configuration
5.3 Method Comparison
6 Experimental Results
7 Result Discussion
8 Conclusion
9 Future Work
References
Building Market Timing Strategies Using Trend Representative Testing and Computational Intelligence Metaheuristics
1 Introduction
2 Timing Buy and Sell Decisions
3 Related Work
4 Trend Representative Testing: Simulating Various Market Conditions While Training and Testing
5 Market Timing Algorithms
5.1 Individual Encoding and Measuring Fitness
5.2 Genetic Algorithms
5.3 Particle Swarm Optimization
6 Experimental Setup
7 Results
8 Conclusion
References
Hybrid Strategy Coupling EGO and CMA-ES for Structural Topology Optimization in Statics and Crashworthiness
1 Introduction
2 Problem Representation
2.1 Parametrization
2.2 Geometry Mapping
3 Optimization Problem and Constraints
4 Resolution Strategy
4.1 Optimization Algorithm
4.2 Constraint Handling Techniques
5 Test Case
5.1 Linear Elastic Case
5.2 Nonlinear Crash Case
6 Experimental Setup
7 Results
7.1 9-Variables Linear Elastic Case
7.2 15-Variables Test Cases
8 Conclusions
References
An Empirical Study on Insertion and Deletion Mutation in Cartesian Genetic Programming
1 Introduction
2 Related Work
2.1 Cartesian Genetic Programming
2.2 Advanced Mutation Techniques in Standard CGP
3 Insertion and Deletion Mutation in CGP
3.1 The Insertion Mutation Technique
3.2 The Deletion Mutation Technique
4 Experiments
4.1 Experimental Setup
4.2 Search Performance Evaluation
4.3 Fitness Range Analysis
4.4 Active Function Node Range Analysis
5 Comparison to EGGP
6 Discussion
7 Conclusion
8 Future Work
References
Handling Complexity in Some Typical Problems of Distributed Systems by Using Self-organizing Principles
1 Introduction
1.1 Self-organization
1.2 Complexity in Application Scenarios
1.3 Measurement of Complexity
2 Swarm Intelligence in Distributed Systems
2.1 Some Selected Distributed Systems’ Use-Cases
2.2 Algorithm Recommendation for Selected Use Cases
3 An Illustration: Bee Algorithm for Dynamic Load Balancing
3.1 Bee Algorithm
3.2 P2P Network Model
3.3 Convergence
4 Conclusion
References
Fuzzy Computation Theory and Applications
Markov Decision Processes with Fuzzy Risk-Sensitive Rewards: The Best Coherent Risk Measures Under Risk Averse Utilities
1 Introduction
2 Coherent Risk Measures Derived from Risk Averse Utilities
3 Fuzziness and Extended Criteria
4 Estimation of Fuzziness with Evaluation Weights and θ-mean Functions
5 Markov Decision with Risk Allocation by Coherent Risk Measures
6 Maximization of Risk-Sensitive Running Rewards Under Feasible Risk Constraints
7 Maximization of Risk-Sensitive Terminal Rewards Under Feasible Risk Constraints
8 Numerical Examples
9 Conclusion
References
Correlation Analysis Via Intuitionistic Fuzzy Modal and Aggregation Operators
1 Introduction
1.1 The Relevance for Contextualizing A-CC
1.2 Main Contribution
1.3 Paper Outline
2 Related Works
3 Preliminary
3.1 Intuitionistic Fuzzy Negations
3.2 Intuitionistic Fuzzy T-norms and T-conorms
3.3 Intuitionistic Fuzzy Modal Operators
3.4 Intuitionistic Fuzzy α-level Modal Operators
3.5 Action of Conjugate Operators on Aggregation Operators
4 Correlation from A-IFL
5 Results on Conjugate Modal Operators
6 A-CC Results on Modal Operators
7 A-CC Results α-Level Modal Operators
8 A-CC Results on Triangular (Co)Norms and Modal Operators
9 Conclusion and Further Work
References
Fuzzy Geometric Approach to Collision Estimation Under Gaussian Noise in Human-Robot Interaction
1 Introduction
2 Gaussian Noise and the Intersection Problem
2.1 Computation of Intersections—Analytical Approach
2.2 Transformation of Gaussian Distributions
3 Inverse Solution
4 Fuzzy Solution
5 Extension to Six Inputs and Two Outputs
5.1 General Approach
5.2 Fuzzy Approach
5.3 The Energetic Approach
6 Mixed Gaussian Distributions
7 Robots and Humans in Motion
8 Simulation Results
9 Conclusions
References
Predicting Cardiovascular Death with Automatically Designed Fuzzy Logic Rule-Based Models
1 Introduction
2 Evolutionary Fuzzy Logic Rule-Based Predictive Modeling
3 Data Description
4 Experiments and Results
5 Conclusions
References
Neural Computation Theory and Applications
Neural Models to Quantify the Determinants of Truck Fuel Consumption
1 Introduction
2 Collection of Fuel Consumption and Input Factor Data
3 Extracting Statistics for Route and Driver Fuel Economy
4 Extracting Empirical Fuel Economy Models
5 Estimating Model Compensation Impact on Driver Performance Measurements
6 Extracting Statistics for Fuel Shrinkage
7 Extracting Empirical Fuel Shrinkage Models
8 Conclusions and Future Work
References
Towards a Class-Aware Information Granulation for Graph Embedding and Classification
1 Introduction
2 Embedding via Data Granulation
3 The GRALG Classification System
3.1 Extractor
3.2 Granulator
3.3 Embedder
3.4 Classifier
3.5 Training Phase
3.6 Synthesized Classification Model and Test Phase
4 Extractor and Granulation Improvements
4.1 Class-Aware Extractor
4.2 Class-Aware Granulator
4.3 Class-Aware Granulator with Uniform Q Scaling
4.4 Class-Aware Granulator with Frequency-Based Q Scaling
5 Test and Results
6 Conclusions
References
Near Optimal Solving of the (N2–1)-puzzle Using Heuristics Based on Artificial Neural Networks
1 Introduction
1.1 Contributions
2 Background
2.1 The (N2–1)-puzzle
2.2 Artificial Neural Networks
3 Related Work
4 Designing a New Heuristic
4.1 Encoding the Input and Output
4.2 Design of the Neural Networks
4.3 Training Data and Training
4.4 Resulting ANN-distance Heuristics
5 Experimental Evaluation
5.1 Evaluation on Single Estimations
5.2 Evaluation on A* Searches
5.3 Competitive Comparison Against Heuristics Presented in Other Studies
5.4 Analysis of the Behavior of A* Search with the Underlying ANN-distance Heuristic
6 Discussion and Conclusion
References
Deep Convolutional Neural Network Processing of Images for Obstacle Avoidance
1 Introduction
2 Deep Learning for Image Processing
2.1 Components of a Deep Learning System
2.2 Relevant Previous Works
3 Obstacle Avoidance Task
3.1 The Robot
3.2 The Environment
3.3 Relevant Previous Works
4 Deep Learning Applied to Obstacle Avoidance
4.1 Data Collection
4.2 Deep Learning Application
5 Results
5.1 Robot Performance in the Environment
5.2 Examining Network Weights and Activations
6 Conclusions
References
CVaR Q-Learning
1 Introduction
2 Preliminaries
2.1 Conditional Value-at-Risk
2.2 Q-Learning
2.3 Distributional Transition Operator
2.4 Problem Formulation
3 CVaR Value Iteration
3.1 Bellman Equation for CVaR
3.2 CVaR Value Iteration with Linear Interpolation
3.3 Accelerated Value Iteration for CVaR
3.4 Computing ξ
3.5 Experiments
4 CVaR Q-Learning
4.1 Estimating CVaR
4.2 Temporal Difference Updates
4.3 CVaR and Policy Improvement
4.4 CVaR Q-Learning with VaR-Based Policy Improvement
4.5 Experiments
5 Deep CVaR Q-Learning
5.1 Loss Functions
5.2 Experiments
6 Conclusion
A Proofs of Theoretical Results
A.1 Proof of Theorem 1
A.2 Proof of Theorem 2
B Other Results
B.1 CVaR Value Iteration —Linear Program
References
Rule Extraction from Neural Networks and Other Classifiers Applied to XSS Detection
1 Introduction
2 Background and Related Work
2.1 Overview of Minimising Boolean Expressions
2.2 Cross-Site Scripting
3 Methodology
3.1 Datasets
3.2 Selected Features
3.3 Training Classifiers
3.4 Classifiers and Boolean Functions
3.5 Sampling
3.6 Extracting Rules
4 Results
4.1 Neural Networks
4.2 Support Vector Machines
4.3 k-NN
4.4 Timings
4.5 Labelling via Sampling
5 Discussion
6 Conclusion
References
Introduction to Sequential Heteroscedastic Probabilistic Neural Networks
1 Introduction
2 A Review of RHPNN
3 Derivation of SHPNN Formulation
4 The SHPNN Algorithm
5 Results
6 Conclusion
References
Author Index
Recommend Papers

Computational Intelligence: 11th International Joint Conference, IJCCI 2019, Vienna, Austria, September 17–19, 2019, Revised Selected Papers
 3030705935, 9783030705930

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Studies in Computational Intelligence 922

Juan Julián Merelo · Jonathan Garibaldi · Alejandro Linares-Barranco · Kevin Warwick · Kurosh Madani Editors

Computational Intelligence 11th International Joint Conference, IJCCI 2019, Vienna, Austria, September 17–19, 2019, Revised Selected Papers

Studies in Computational Intelligence Volume 922

Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland

The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, self-organizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.

More information about this series at http://www.springer.com/series/7092

Juan Julián Merelo · Jonathan Garibaldi · Alejandro Linares-Barranco · Kevin Warwick · Kurosh Madani Editors

Computational Intelligence 11th International Joint Conference, IJCCI 2019, Vienna, Austria, September 17–19, 2019, Revised Selected Papers

Editors Juan Julián Merelo Computer Architecture and Technology University of Granada Granada, Spain Alejandro Linares-Barranco ETSI Informática Universidad de Sevilla Sevilla, Spain Kurosh Madani University of Paris-EST Créteil (UPEC) Créteil, France

Jonathan Garibaldi Jubilee Campus University of Nottingham Nottingham, UK Kevin Warwick University of Reading Coventry, UK Coventry University Coventry, UK

ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-030-70593-0 ISBN 978-3-030-70594-7 (eBook) https://doi.org/10.1007/978-3-030-70594-7 © Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

The present book includes extended and revised versions of a set of selected papers from the 11th International Joint Conference on Computational Intelligence (IJCCI 2019), held in Vienna, Austria, from 17 to 19 September 2019. IJCCI 2019 received 86 paper submissions from 31 countries, of which 17% have been included in this book. The papers were selected by the event chairs, who based their decision on a number of criteria that included the classifications and comments provided by the program committee members, the session chairs’ assessment, and also the program chairs’ global view of all papers included in the technical program. The authors of selected papers were then invited to submit a revised and extended version of their papers having at least 30% innovative material. The purpose of IJCCI is to bring together researchers, engineers, and practitioners on the areas of Fuzzy Computation, Evolutionary Computation, and Neural Computation. IJCCI is composed of three co-located conferences, each specialized in at least one of the aforementioned main knowledge areas. The papers selected to be included in this book contribute to the understanding of relevant trends of current research on Computational Intelligence, including Machine Learning, Deep Learning, Reinforcement Learning, Fuzzy Decision Analysis, Fuzzy Methods in Knowledge Discovery, Approximate Reasoning, Information Fusion, Fuzzy Control, Robotics, Sensors, Fuzzy Hardware, Quantum Computing, Higher Level Artificial Neural Network Based Intelligent Systems, Neurocomputational Formulations, Image Processing and Artificial Vision, Pattern Recognition, Decision-Making, Neural Prostheses, Neural-Based Data Mining and Complex Information Process, Learning Paradigms and Algorithms, Evolutionary Search and MetaHeuristics and Genetic Algorithms, as well as applications to Industrial, Financial, Medical, Games and Entertainment Technologies, Evolutionary Robotics, Evolutionary Art and Design, Computational Economics and Finance, Swarm/Collective Intelligence, among other fields.

v

vi

Preface

We would like to thank all the authors for their contributions and also to the reviewers who have helped ensuring the quality of this publication. Granada, Spain Nottingham, UK Sevilla, Spain Coventry, UK Créteil, France September 2019

Juan Julián Merelo Jonathan Garibaldi Alejandro Linares-Barranco Kevin Warwick Kurosh Madani

Organization

Conference Co-chairs Kurosh Madani, University of Paris-EST Créteil (UPEC), France Kevin Warwick (honorary), University of Reading and Coventry University, UK

Program Co-chairs ECTA Juan Julián Merelo, University of Granada, Spain

FCTA Jonathan Garibaldi, University of Nottingham, UK

NCTA Alejandro Linares-Barranco, ETSI Informática, Spain

vii

viii

Organization

ECTA Program Committee Haneen Algethami, Taif University, Saudi Arabia Richard Allmendinger, University of Manchester, UK Sabri Arik, Istanbul University-Cerrahpasa, Turkey Helio Barbosa, Laboratorio Nacional de Computaçao Cientifica, Brazil Matthieu Basseur, University of Angers, France Benjamin Biesinger, AIT Austrian Institute of Technology, Austria János Botzheim, Budapest University of Technology, Hungary Alexander Brownlee, University of Stirling, UK William Buckley, California Evolution Institute, USA Fabio Caraffini, De Montfort University, UK Paolo Cazzaniga, University of Bergamo, Italy Francisco Chicano, University of Málaga, Spain Antonio Cioppa, University of Salerno, Italy J. Manuel Colmenar, Universidad Rey Juan Carlos, Spain Vincenzo Conti, Kore University of Enna, Italy Carola Doerr, CNRS and Sorbonne University Paris, France Andres Faina, IT University of Copenhagen, Denmark Paola Festa, University of Napoli, Italy Francesco Fontanella, Università di Cassino e del Lazio Meridionale, Italy Ewa Gajda-Zagórska, Institute of Science and Technology Austria, Austria Adrien Goeffon, University of Angers, France Pauline Haddow, The Norwegian University of Science and Technology, Norway Lutz Hamel, University of Rhode Island, USA Oussama Hamid, University of Nottingham, UK Gareth Howells, University of Kent, UK Giovanni Iacca, University of Trento, Italy Thomas Jansen, Aberystwyth University, UK Juan Luis Jimenez Laredo, University of Le Havre, France Khairul Kasmiran, Universiti Putra Malaysia, Malaysia Paul Kaufmann, Mainz University, Germany Nawwaf Kharma, Concordia University, Canada Ahmed Kheiri, Lancaster University, UK Wasakorn Laesanklang, Mahidol University, Thailand Nuno Leite, Instituto Superior de Engenharia de Lisboa, Portugal Julio Lopera, University of Granada, Spain Wenjian Luo, University of Science and Technology of China, China Mohamed Arezki Mellal, M’Hamed Bougara University, Algeria Rui Mendes, University of Minho, Portugal Pablo Mesejo Santiago, Universidad de Granada, Spain Mustafa Misir, Istinye University, Turkey Rafael Nogueras, Universidad de Málaga, Spain Pietro Oliveto, University of Sheffield, UK

Organization

ix

Gary Parker, Connecticut College, USA Mario Pavone, University of Catania, Italy Paola Pellegrini, French Institute of Science and Technology for Transport, France David A. Pelta, University of Granada, Spain Daniel Porumbel, Conservatoire National des Arts et Métiers, Paris (CNAM), France Jakob Puchinger, SystemX-Centrale Supélec, France Günther Raidl, TU Wien, Austria José Ribeiro, Instituto Politécnico de Leiria, Portugal Mohammed Salem, University of Mustapha Stambouli Mascara, Algeria Frédéric Saubion, University of Angers, France Moshe Sipper, Ben-Gurion University, Israel Andrzej Skowron, Institute of Mathematics UW, Poland Jim Smith, The University of the West of England, UK Dominik Sobania, Johannes Gutenberg University Mainz, Germany Tatiana Tambouratzis, University of Piraeus, Greece Sara Tari, Université du Littoral côte d’Opale, France Gianluca Tempesti, The University of York, UK Krzysztof Trojanowski, Uniwersytet Kardynała Stefana Wyszy´nskiego, Poland Tan Tse Guan, Universiti Malaysia Kelantan, Malaysia Nadarajen Veerapen, University of Lille, France Rafael Villanueva, Universidad Politécnica de Valencia, Spain Yifei Wang, Georgia Institute of Technology, USA Christine Zarges, Aberystwyth University, UK

FCTA Program Committee Jesús Alcalá-Fdez, University of Granada, Spain Vijayan Asari, University of Dayton, USA Sansanee Auephanwiriyakul, Chiang Mai University, Thailand Thomas Baeck, Leiden University, Netherlands Mokhtar Beldjehem, University of Ottawa, Canada Daniel Berrar, Tokyo Institute of Technology, Japan Michal Bidlo, Brno University of Technology, Faculty of Information Technology, Czech Republic Fernando Bobillo, University of Zaragoza, Spain Ahmed Bufardi, Independent Researcher, Switzerland Daniel Callegari, PUCRS Pontificia Universidade Catolica do Rio Grande do Sul, Brazil Heloisa Camargo, UFSCar, Brazil Rahul Caprihan, Dayalbagh Educational Institute, India Pablo Carmona, University of Extremadura, Spain Giovanna Castellano, University of Bari, Italy Wen-Jer Chang, National Taiwan Ocean University, Taiwan, Republic of China

x

Organization

Mu-Song Chen, Da-Yeh University, Taiwan, Republic of China France Cheong, RMIT University, Australia Sung-Bae Cho, Yonsei University, Korea, Republic of Amine Chohra, Paris-East University (UPEC), France Catalina Cocianu, The Bucharest University of Economic Studies, Faculty of Cybernetics, Statistics and Informatics in Economy, Romania Pedro Coelho, State University of Rio de Janeiro, Brazil Martina Dankova, University of Ostrava, Czech Republic Bijan Davvaz, Yazd University, Iran, Islamic Republic of António Dourado, University of Coimbra, Portugal Marc Ebner, Ernst-Moritz-Arndt-Universität Greifswald, Germany El-Sayed El-Alfy, King Fahd University of Petroleum and Minerals, Saudi Arabia Anna Esparcia-Alcázar, Universitat Politècnica de València, Spain Stefka Fidanova, Bulgarian Academy of Sciences, Bulgaria Nizami Gasilov, Baskent University, Turkey Vladimir Golovko, Brest State Technical University, Belarus Antonio Gonzalez, University of Granada, Spain Sarah Greenfield, De Montfort University, UK Hazlina Hamdan, Universiti Putra Malaysia, Malaysia Oussama Hamid, University of Nottingham, UK Thomas Hanne, University of Applied Arts and Sciences Northwestern Switzerland, Switzerland Rainer Heinrich Palm, AASS, Department of Technology, Örebro University, SE70182 Örebro,Sweden, Germany Arturo Hernández—Aguirre, Centre for Research in Mathematics, Mexico Christopher Hinde, Loughborough University, UK Wladyslaw Homenda, Warsaw University of Technology, Poland Katsuhiro Honda, Osaka Prefecture University, Japan Wei-Chiang Hong, Jiangsu Normal University, China Alexander Hošovský, Technical University of Kosice, Slovak Republic Chih-Cheng Hung, Kennesaw State University, USA Yuji Iwahori, Chubu University, Japan Dmitry Kangin, University of Exeter, UK Iwona Karcz-Duleba, Wroclaw University of Science Technology, Poland Christel Kemke, University of Manitoba, Canada Etienne Kerre, Ghent University, Belgium Wali Khan, Kohat University of Science and Technology (KUST), Kohat, Pakistan Ahmed Kheiri, Lancaster University, UK DaeEun Kim, Yonsei University, Korea, Republic of Ziad Kobti, University of Windsor, Canada László Kóczy, Budapest University of Technology and Economics, Hungary Mario Köppen, Kyushu Institute of Technology, Japan Donald Kraft, Colorado Technical University, USA Ondrej Krejcar, University of Hradec Kralove, Czech Republic Pavel Krömer, VSB Ostrava, Czech Republic

Organization

xi

Yau-Hwang Kuo, National Cheng Kung University, Taiwan, Republic of China Anne Laurent, Lirmm, Montpellier University, France Shih-Hsi Liu, California State University, Fresno, USA Ahmad Lotfi, Nottingham Trent University, UK Edwin Lughofer, Johannes Kepler University, Austria Francisco Gallego Lupianez, University Complutense de Madrid, Spain Francesco Marcelloni, University of Pisa, Italy Francisco Martínez Álvarez, Pablo de Olavide University of Seville, Spain Mitsuharu Matsumoto, The University of Electro-Communications, Japan Corrado Mencar, University of Bari, Italy Chilukuri Mohan, Syracuse University, USA José Molina, Universidad Carlos III de Madrid, Spain Javier Montero, Complutense University of Madrid, Spain Pawel Myszkowski, Wroclaw University of Technology, Poland Maria Nicoletti, Universidade Federal de São Carlos, Brazil Vilém Novák, University of Ostrava, Czech Republic Schütze Oliver, CINVESTAV-IPN, Mexico Ender Özcan, University of Nottingham, UK David A. Pelta, University of Granada, Spain Parag Pendharkar, Pennsylvania State University, USA Irina Perfilieva, University of Ostrava, Czech Republic Radu-Emil Precup, Politehnica University of Timisoara, Romania Daowen Qiu, Sun Yat-sen University, China Amaryllis Raouzaiou, National Technical University of Athens, Greece Joaquim Reis, ISCTE, Portugal Olympia Roeva, Institute of Biophysics and Biomedical Engineering, Bulgarian Academy of Sciences, Bulgaria Neil Rowe, Naval Postgraduate School, USA Suman Roychoudhury, Tata Consultancy Services, India Jose de Jesus Rubio, Instituto Politecnico Nacional, Mexico Daniel Sánchez, University of Granada, Spain Miguel Sanz-Bobi, Comillas Pontifical University, Spain Alon Schclar, The Academic College of Tel-Aviv Yaffo, Israel Christoph Schommer, University Luxembourg, Campus Belval, Maison du Nombre, Luxembourg Patrick Siarry, University Paris 12 (LiSSi), France Andrzej Skowron, Institute of Mathematics UW, Poland Catherine Stringfellow, Midwestern State University, USA Mu-Chun Su, National Central University, Taiwan, Republic of China Tatiana Tambouratzis, University of Piraeus, Greece C. Tao, National I-Lan University, Taiwan, Republic of China Mohammad Teshnehlab, K. N. Toosi University, Iran, Islamic Republic of Philippe Thomas, Université de Lorraine, France Carlos Travieso-González, Universidad de Las Palmas de Gran Canaria, Spain Tan Tse Guan, Universiti Malaysia Kelantan, Malaysia

xii

Organization

Alexander Tulupyev, St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS), Russian Federation Lucia Vacariu, Technical University of Cluj Napoca, Romania José Luis Verdegay, University of Granada, Spain Salvatore Vitabile, University of Palermo, Italy Neal Wagner, Systems Technology and Research, USA Zeshui Xu, Sichuan University, China Chung-Hsing Yeh, Monash University, Australia Cleber Zanchettin, Federal University of Pernambuco, Brazil

NCTA Program Committee Miltos Alamaniotis, University of Texas at San Antonio, USA Davide Bacciu, University of Pisa, Italy William Buckley, California Evolution Institute, USA Lin-Ching Chang, The Catholic University of America, USA Domenico Ciuonzo, University of Naples Federico II, Italy Jose de Jesus Rubio, Instituto Politecnico Nacional, Mexico Artur Ferreira, ISEL—Instituto Superior de Engenharia de Lisboa, Portugal Abbas Fotouhi, Cranfield University, UK Wojciech Froelich, University of Silesia, Poland Weinan Gao, Georgia Southern University, USA Jonathan Garibaldi, University of Nottingham, UK Arfan Ghani, Coventry University, UK Stefan Glüge, ZHAW School of Life Sciences and Facility Management, Switzerland Petr Hajek, Faculty of Economics and Administration, University of Pardubice, Czech Republic Oussama Hamid, University of Nottingham, UK Philipp Hoevel, University College Cork, Ireland Magnus Johnsson, Malmö University, Sweden Dmitry Kangin, University of Exeter, UK Khairul Kasmiran, Universiti Putra Malaysia, Malaysia Ajay Kaul, Shri Mata Vaishno Devi University, India Shuai Li, Chinese University of Hong Kong, Hong Kong Kurosh Madani, University of Paris-EST Créteil (UPEC), France Juan Julián Merelo, University of Granada, Spain Mark Oxley, Air Force Institute of Technology, USA Gary Parker, Connecticut College, USA Daniel Porumbel, Conservatoire National des Arts et Métiers, Paris (CNAM), France Antonello Rizzi, Università di Roma “La Sapienza”, Italy George Rudolph, Utah Valley University, USA Andrzej Skowron, Institute of Mathematics UW, Poland Norikazu Takahashi, Okayama University, Japan

Organization

Tatiana Tambouratzis, University of Piraeus, Greece Philippe Thomas, Université de Lorraine, France Lefteri Tsoukalas, Purdue University, USA Jessica Turner, Georgia State University, USA Jan Van Campenhout, Ghent University, Belgium Arjen van Ooyen, VU University Amsterdam, Netherlands Guanghui Wen, Southeast University, China

NCTA Additional Reviewers Alessio Martino, University of Rome “La Sapienza”, Italy

Invited Speakers Pietro S. Oliveto, University of Sheffield, UK Vesna Sesum-Cavic, TU Vienna, Austria Andreas Holzinger, Medical University Graz, Austria Jonathan Garibaldi, University of Nottingham, UK

xiii

Contents

Evolutionary Computation Theory and Applications Niching-Based Feature Selection with Multi-tree Genetic Programming for Dynamic Flexible Job Shop Scheduling . . . . . . . . . . . . . Yahia Zakaria, Yassin Zakaria, Ahmed BahaaElDin, and Mayada Hadhoud

3

Building Market Timing Strategies Using Trend Representative Testing and Computational Intelligence Metaheuristics . . . . . . . . . . . . . . . Ismail Mohamed and Fernando E. B. Otero

29

Hybrid Strategy Coupling EGO and CMA-ES for Structural Topology Optimization in Statics and Crashworthiness . . . . . . . . . . . . . . . . Elena Raponi, Mariusz Bujny, Markus Olhofer, Simonetta Boria, and Fabian Duddeck An Empirical Study on Insertion and Deletion Mutation in Cartesian Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roman Kalkreuth

55

85

Handling Complexity in Some Typical Problems of Distributed Systems by Using Self-organizing Principles . . . . . . . . . . . . . . . . . . . . . . . . . 115 ˇ c Vesna Šešum-Cavi´ Fuzzy Computation Theory and Applications Markov Decision Processes with Fuzzy Risk-Sensitive Rewards: The Best Coherent Risk Measures Under Risk Averse Utilities . . . . . . . . . 135 Yuji Yoshida Correlation Analysis Via Intuitionistic Fuzzy Modal and Aggregation Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Alex Bertei, Renata H. S. Reiser, and Luciana Foss

xv

xvi

Contents

Fuzzy Geometric Approach to Collision Estimation Under Gaussian Noise in Human-Robot Interaction . . . . . . . . . . . . . . . . . . . . . . . . . 191 Rainer Palm and Achim J. Lilienthal Predicting Cardiovascular Death with Automatically Designed Fuzzy Logic Rule-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Christina Brester, Vladimir Stanovov, Ari Voutilainen, Tomi-Pekka Tuomainen, Eugene Semenkin, and Mikko Kolehmainen Neural Computation Theory and Applications Neural Models to Quantify the Determinants of Truck Fuel Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Alwyn J. Hoffman and Schalk Rabé Towards a Class-Aware Information Granulation for Graph Embedding and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Luca Baldini, Alessio Martino, and Antonello Rizzi Near Optimal Solving of the (N2 –1)-puzzle Using Heuristics Based on Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 Vojtech Cahlik and Pavel Surynek Deep Convolutional Neural Network Processing of Images for Obstacle Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 Mohammad O. Khan and Gary B. Parker CVaR Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 Silvestr Stanko and Karel Macek Rule Extraction from Neural Networks and Other Classifiers Applied to XSS Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 Fawaz A. Mereani and Jacob M. Howe Introduction to Sequential Heteroscedastic Probabilistic Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 Ali Mahmoudi, Reza Askari Moghadam, and Kurosh Madani Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403

Evolutionary Computation Theory and Applications

Niching-Based Feature Selection with Multi-tree Genetic Programming for Dynamic Flexible Job Shop Scheduling Yahia Zakaria, Yassin Zakaria, Ahmed BahaaElDin, and Mayada Hadhoud

Abstract Genetic programming has been explored in recent works to evolve hyperheuristics for dynamic flexible job shop scheduling. To generate optimum rules, the algorithm searches a space of trees composed from a set of terminals and operators. Since the search space is exponentially proportional to the size of the terminal set, it is preferred to opt out any insignificant terminals. Feature selection techniques has been employed to reduce the terminal set size without discarding any important information and they have proven to be effective for enhancing search performance and efficiency for dynamic flexible job shop scheduling. In this paper, we extends our previous work by adding a modified version of the two-stage genetic programming algorithm and by comparing the different methods in a larger experimental setup. The results show that feature selection can generate better rules in most of the cases while also being more efficient to in a production environment. Keywords Feature selection · Flexible job shop scheduling · Dynamic scheduling · Genetic programming · Hyper heuristics

Y. Zakaria (B) · A. BahaaElDin · M. Hadhoud Computer Engineering Department, Faculty of Engineering, Cairo University, Giza, Egypt e-mail: [email protected] A. BahaaElDin e-mail: [email protected] M. Hadhoud e-mail: [email protected] Y. Zakaria Computer and System Department, Electronic Research Institute Cairo, Cairo, Egypt e-mail: [email protected] © Springer Nature Switzerland AG 2021 J. J. Merelo et al. (eds.), Computational Intelligence, Studies in Computational Intelligence 922, https://doi.org/10.1007/978-3-030-70594-7_1

3

4

Y. Zakaria et al.

1 Introduction Job Shop Scheduling (JSS) [2] is an important optimization problem due to its many practical applications in the field of manufacturing and cloud computing. The goal of JSS is to assign job operations to machines and order the operations in the machine queue for processing. Each job consists of an operation sequence that should be processed in order and each operation can be processed by only a single machine. In JSS, job arrival time is known beforehand and the scheduler can plan for the future. This assumption does not hold true for many real life applications where job arrival times are unknown. Dynamic Job Shop Scheduling (DJSS) is an extension of JSS where the scheduler has no knowledge about future jobs. Also in a large production environment, there are usually many machines which can process the same operation to increase the throughput. This adds the flexibility of allowing the operation to be processed by any member in a subset of the machines. Thus Dynamic Flexible Job Shop Scheduling (DFJSS) is an extension of DJSS where the operation can have multiple machine options to select from. Thus DFJSS schedulers have the responsibility to route operations to a valid machine queue as well. So the scheduling process is divided into routing and sequencing. With the advances in real-time technology, a dynamic real-time scheduling system is needed. To have a practical DJSS scheduler, scalability and real-time responses are required. To meet these requirements, Dispatching rules has been extensively used for job shop scheduling [1]. Dispatching rules are heuristics that compute a priority value for each operation locally in the machine queue. In DFJSS, the scheduler can rely on a pair of rules: A Routing Rule (RR) to route operations to machine queues and A Sequencing Rule (SR) that functions similar to dispatching rules. Although rules can be handcrafted by experts, changes in the production environment usually require the rule to be redesigned. This consumes time and effort. Thus automatic rule generation is desired. A popular method for automatic rule generation is genetic programming (GP) [5]. A rule in GP is a mathematical equation which is encoded as a tree. In the rule tree, leaves denote features while inner nodes denote operators and edges connect operators with its operands. GP uses the genetic algorithm framework to evolve tree rules. GP is a search algorithm, thus it is harder for GP to find good solutions in a huge search space. Narrowing the search space without removing local/global optima is essential for GP to converge faster. One way to decrease the search space efficiently is by feature selection, as adding more irrelevant features will exponentially expand the search space. In a previous work, Niching-based Surrogate Feature Selection (NiSuFS) has been modified to fit with Multi-Tree GP for DFJSS [12]. This paper is an extension to the work presented in the aforementioned paper. This paper has the following objectives: – Propose applying the Two-Stage GP algorithm [16] to Multi-Tree GP. – Present some modifications to the feature selection method in the original paper. – Compare between the different methods in a larger experimental setup.

Niching-Based Feature Selection with Multi-tree Genetic Programming …

5

The rest of the paper is organized as follows: Sect. 2 reviews the related work in a brief manner. Section 3 present the technical background behind the methods presented in this paper. Section 4 explains the methods proposed in the original paper and the modifications in this extension paper. Section 5 details the experimental setup then Sect. 6 shows the experimental results. Section 7 contains a discussion about the presented results. Finally, conclusions and future work are given in Sect. 8.

2 Related Work 2.1 Routing and Sequencing Rules To solve the DJSS problem, Dispatching rules have been applied extensively, because of their computational efficiency suitable for real time decisions in a dynamic stochastic environment. A dispatching rule operates by calculating the operations’ priority value in the machine queue, so that the highest priority operation gets selected for processing before its peers. The priority value is used differently by stochastic dispatching rule [14] where the priority value is converted into a probability of selecting the operation before its peers. The results in their paper showed that regular dispatch rules outperforms stochastic dispatching rules in its current form. Non-delay dispatching rules are applied to a machine queue if the machine is idle with a non-empty queue. On the other hand, active scheduling can delay the decision thus leaving the machine idle. Non-delay rules has been proven to perform better than active ones in DJSS [9]. Early dispatching rules were designed manually by experts but failed to generalize, so effort was shifted into generating dispatching rules automatically. Using genetic programming for DJSS hyper-heuristic generation were explored in recent researches [9, 11]. In a modern production environment, multiple machines are available to process a single operation. Thus DFJSS is an important extension over DJSS. The DFJSS problem can not be solved with dispatching rules only as the scheduler is responsible for routing each operation to a valid machine. A computationally efficient solution for DFJSS can be obtained by dividing the problem into two sub-problems: routing operations to machines (Routing) and ordering the operations inside the machine queue (Sequencing). For GP to address DFJSS, GP has to evolve 2 paired rules: a routing rule and a sequencing rule. Cooperative Coevolution GP (CCGP) was proposed in [11] to solve DFJSS. In CCGP , two separate populations, one for routing and another for sequencing, are evolved. During evaluation, CCGP pairs a representative rule with the individuals of the other type. The representative rule is the rule with best fitness. Another method for GP to evolve two rules is to use a multi-tree representation for the chromosome [13]. Using this method, one population can be used to generate pairs of routing and sequencing rules. It was also proposed to use a tree swapping crossover operator in multi-tree GP [13]. The proposed crossover operator helps diversify the chromosomes without breaking useful functional blocks. Multi-

6

Y. Zakaria et al.

tree GP with non-dominated sorting genetic algorithm II (NSGA-II) and strength Pareto evolutionary algorithm 2 (SPEA2) are used for Multi-objective DFJSS (MODFJSS) [15]. CCGP with NSGA-II was proposed in a recent work [17].

2.2 Feature Selection Feature selection is an important and challenging task in machine learning [3]. Feature selection filters the set of features to keep only the relevant ones. Building machine learning on relevant feature sets prevents models from being misled, as using only a relevant features will significantly decrease the search space and thus the model can converge faster to a better solution. There is 3 categories of feature selection [6] which are: Wrapper techniques, Filter techniques and Embedded techniques. Embedded techniques do feature selection within the construction of the model resulting in a faster approach over warping techniques and more accurate than filter techniques. Embedded feature selection is observed in GP where relevant features survive with good individuals while irrelevant features tend to go extinct. However, using the feature frequencies in good individuals as a metric for feature selection has two main disadvantages: First, common features might be a result of reaching local optima. Second, some occurrences of features might have no significance in the rule due to how the equations are formed like W − W and W ÷ W. Feature contribution to fitness was proposed [7] as a measurement to rank features for selection. Feature contribution solved the problem of insignificant feature occurrences. Still the population in GP can be crowded in a local optima. Thus GP must be run for multiple times to avoid bias. This is very computationally expensive, so Niching-GP was proposed [8] to generate a population with diverse phenotypes in a single run using a clearing method. This technique was extended to DFJSS with CCGP in a Two-Stage GP algorithm [16]. Two-Stage GP uses the selected feature set to limit the options of the mutation operator. To avoid wasting the results of the Niching GP, the second stage resumes the training from the last generation of the Niching GP. Feature selection was also combined with Multi-Tree GP to aggregate a subset of selected features and start from scratch with the aggregated set [12].

3 Background 3.1 Problem Definition Dynamic Job Shop Scheduling (DJSS) is a version of the Job Shop scheduling problem where the scheduler has no knowledge of the jobs that are yet to arrive. In DJSS, a scenario can be formulated as follows: – A scenario S consists of a set of Jobs J = {J1 , J2 , ..., Jn }.

Niching-Based Feature Selection with Multi-tree Genetic Programming …

7

Each job J j arrives at time ta (J j ) and has a due time td (J j ) and a weight W (J j ). A scenario S is processed on a set of machines M = {Mi , M2 , ..., Mm }. Each job J j consists of a sequence of operations O(J j ) = (O j1 , O j2 , ..., O jl j ). Each operation O ji can only be processed by a single machine M(O ji ) ∈ M where the processing time is δ(O ji ). – Each machine can only process up to one operation at any given point in time. – Once a machine starts processing an operation, it will not abort it till the processing is finished (Non-preemptive). Therefore, if a machine Mk starts to process an operation O ji at time ts (O ji ), it is ensured that the processing will end at time te (O ji ) = ts (O ji ) + δ(O ji ). – Job operations must be processed in the provided sequence. So in any job J j , operation O ji can only enter a machine after operation O j (i−1) has been finished. In addition, operation O j1 can only enter a machine after its job arrival time ta (J j ).

– – – –

At any given time t, the scheduler has access to the following information: – Pending Jobs which have already arrived ta (J j ) ≤ t and still has unprocessed operations te (O jl j ) > t. – State of Each Machine which denotes whether it is currently busy or not and how long it has been in that state. It also identifies which operation is currently being processed if any. Each machine has a queue Q(Mi ) which contains all the pending operations that are processable by the corresponding machine and are available for processing at time t. It is common to use dispatch rules as a scheduling mechanism for DJSS. Dispatch rules can be represented as a function rd (t1..n (O ji )) where t1..n are terminals containing information about the situation. Whenever an idle machine has pending operations in its queue, the scheduler applies the dispatch rule to each operation in the queue and selects the one with the highest priority Oselected to enter the machine as shown in Eq. 1. Oselected = arg min rd (t1..n (O ji )), O ji ∈ Q(Mi )

(1)

Dynamic Flexible Job Shop Scheduling (DFJSS) is a relaxation to the Dynamic Job Shop Scheduling problem. Instead of limiting the machine options of each operation to only one machine, operations in DFJSS can have multiple options with different processing times and the scheduler must select a machine to process the available operations upon being ready. So to fit the DFJSS problem, the scenario formulation is modified as follows: – Each operation O ji can only be processed by a machine Mk ∈ π(O ji ) ⊆ M and |π(O ji )| ≥ 1 where the processing time on machine Mk is δ(O ji , Mk ). – The processing time of an operation depends on the machine to which the operation was assigned. Therefore, if a machine Mk starts to process an operation O ji at time ts (O ji ), the processing will end at time te (O ji ) = ts (O ji ) + δ(O ji , Mk ).

8

Y. Zakaria et al.

While the only responsibility of DJSS schedulers was to dispatch an operation for processing, DFJSS schedulers has to make a pair of decisions: routing operations to machine queues and sequencing the operations in the queue. Therefore, it is common to use a pair of rules to schedule a DFJSS scenario. These rules are called the routing rule rr (t1..n (O ji , Mk )) and the sequencing rule rs (t1..n (O ji , Mk )). Whenever an operation has become available for processing, the routing rule is applied to select a machine queue Mselected to which the operation is appended as in Eq. 2 and whenever an idle machine has pending operations in its queue, the sequencing rule is applied to select an operation Oselected to enter the machine as shown in Eq. 3. Mselected = arg min rr (t1..n (O ji , Mk ))), Mk ∈ π(O ji )

(2)

Oselected = arg min rs (t1..n (O ji , Mk ))), O ji ∈ Q(Mi )

(3)

It is common to optimize the scheduler to minimize one or more target functions that denote the efficiency at which the resources were consumed to finish the processing of a given set of jobs. There are two well-known target functions that can be computed for each individual job: 1. Flow-time: C(J j ) = te (O jl j ) − ta (J j ) which is the time span spent by the job as pending. 2. Tardiness: T (J j ) = max(0, te (O jl j ) − td (J j )) which is the time span spent by the job as pending beyond its due time. To compute the target function for a whole scenario, the results for individual jobs are usually aggregated using one of the following aggregation functions: 1. Mean: f (S) = E[F(J j )]. 2. Weighted Mean: f (S) = E[W (J j ) × F(J j )]/E[W (J j )]. 3. Maximum: f (S) = max j F(J j ). In the original paper [12], the experiments were ran on all 6 possible combinations of the target function (2 per-job targets × 3 aggregations). However, we will focus only on mean flow-time m f , weighted mean flow-time wm f and maximum flowtime x f in this extension paper. It is noteworthy that the tardiness in the original paper was not clamped at zero for early-finished jobs to incentivize the scheduler to finish as early as possible. However, this does not follow the original definition of tardiness.

3.2 Genetic Programming for Dynamic Job Shop Scheduling It is common to treat scheduler optimization for both DJSS and DFJSS as a hyperheuristic generation task. Genetic programming (GP) is a very popular technique for evolving hyper-heuristics and have proven to be successful for both DJSS and DFJSS [4, 10, 11, 13]. GP is genetic algorithm (GA) in which the chromosome is

Niching-Based Feature Selection with Multi-tree Genetic Programming … Fig. 1 Rule tree example (P T + W ) × (−J T ) [12]

9

× +

PT



W

JT

usually a program tree. To apply GP to DJSS, the dispatch rule is represented as an expression tree as shown in Fig. 1. Similar to most GA algorithms, GP applies selection, crossover and mutation to the chromosomes. We will briefly explain the operators used in the implementation of this paper: – Ramped Half-and-Half Tree Generation generate a random tree from scratch by growing a tree such that 50% of the time all the terminals are at the same depth and the other 50% the terminals depths are allowed to differ. – Tournament Selection selects chromosomes from the population by randomly selecting K individuals and selecting the most fit to undergo crossover and/or mutation. The parameter K is the tournament size and it controls the probability of selecting an unfit individuals. – One-Point Tree Crossover randomly picks a sub-tree from each parent and swaps them together. To prevent bloat, a child is replaced by one of its parents if its depth exceeds a static limit [5]. – One-Point Tree Mutation randomly picks a sub-tree from the individual and replaces it with a randomly generated tree. For DJSS, each chromosome can contains one tree for the dispatch rule and the evolution operators can be used to populate multiple generations till a stopping criteria is met. To extend regular GP for DFJSS, the regular GP algorithm can be modified to co-evolve two separate populations in parallel, one for each rule type, which is known as Cooperative Co-evolution GP (CCGP) [11]. The other option is modify the chromosome to hold a pair of trees [13].

3.3 Multi-tree Genetic Programming To evolve a pair of rules for DFJSS, Multi-Tree chromosomes can be used to represent both of them as one individual. In contrast with CCGP [11], The pair of rules in a Multi-Tree chromosome is evaluated together instead of using a population representative. This allows Multi-tree GP to learn rules that behave well with only certain partners, while all rules in CCGP must work in tandem with the other population’s representative. To extend GP to Multi-Tree chromosomes, the operators must be modified to work with a pair of trees instead of a single tree.

10

Y. Zakaria et al.

A naive extension of the GP crossover operator for multi-tree GP is to mate each rule with the corresponding rule in the other parent. However, mating the two rules at the same time can easily break useful structures in the chromosomes. In other words, if one rule in the pair is good but paired with a weak partner, it will have a low chance of surviving a crossover to be evaluated with other partners. To mitigate this problem, a Tree-Swapping crossover operator was proposed [13] as shown in Algorithm 1. To allow some rules to be passed down unmodified, the tree swapping operator randomly picks one type of rules to mate while swapping the other. We use this operator in our methods. Algorithm 1 Swapping Multi-Tree Crossover [13]. Input: Chromosomes C1, C2 1: Randomly pick r ∼ {0, 1} with probability 50% 2: C1[r ], C2[r ] ← T r eeCr ossover (C1[r ], C2[r ]) 3: C1[1 − r ], C2[1 − r ] ← C2[1 − r ], C1[1 − r ]

3.4 Niching-GP Feature Selection Feature selection for DJSS was proposed [7] to enhance the quality of the rules by tightening the search space around rules that use important features only. While regular GP has embedded feature selection, it would need more generations to implicitly limit the presence of weak features. However, the original feature selection technique was also inefficient, since it required multiple diverse yet highly fit rules (30 rules in [7]) so GP has to be run multiple times to avoid picking rules with the same phenotypic characteristics. To mitigate this issue, a Niching-based Surrogate Feature Selection technique (NiSuFS) was proposed [8] to generate the required diverse rule set in a single run. The technique is composed of 4 main steps: 1. Run Genetic Programming with clearing applied after every generation to prevent the population from converging to a tight region in the phenotypic space. 2. Use clearing to extract the best diverse set from the final population. 3. Calculate the fitness degradation of each rule after opting out each feature, then vote for features whose absence cause a significant degradation. 4. Filter features based on votes to create the selected feature subset. Step 3 follows the assumption that a feature’s significance is proportional to the degradation in the rule fitness after opting out the feature under test. We will discuss the main components of the NiSuFS algorithm. Phenotype Similarity is essential for the clearing method. To prevent the population from collapsing to a single local minima in the search space, the clearing method needs a way to measure the similarity between rules. While it may seem intuitive to compare rules by their tree structure, it is not a good indicator of the rule behavior since the same expression can be represented using many different tree structures. For example, (P T − W ) + (J T − J T ) is equivalent to (P T + J T ) − (W + J T ),

Niching-Based Feature Selection with Multi-tree Genetic Programming …

11

P T − J T is equivalent to (−J T ) + P T and max(P T, −P T ) is equivalent to P T since P T is always positive. In addition, we care more about the ordering of operations rather than the value of their priorities, so a rule R is equivalent to R + R since the priorities will be in the same order. So we care about the rule’s phenotype (behavior) rather than its genotype (structure). In order to measure a chromosome phenotype, a set of discriminative situations are required [8]. The set of situations are randomly sampled from multiple simulations which are scheduled by a benchmark rule (WATC rule). To avoid situations with low discriminative properties, the situations are filtered out if the option count is below a predefined threshold. A set of randomly sampled situations are selected to measure the phenotypes of the population. The phenotype of a chromosome in a specific situation is equal to the rank of the operation which was picked by the original WATC rule. The phenotype on different situations are composed into a phenotype vector and the distance between chromosomes is considered to be the euclidean distance between their phenotype vectors. The phenotype calculation steps are shown in Algorithm 2. Algorithm 2 Calculate Phenotype [8]. Input: Rule r , Benchmark-Sorted Situation Set S Output: Phenotype P 1: for i = 0 to length(S) do 2: priorites ← Apply Rule(r, S[i]) 3: indices ← ArgSor t ( priorities) 4: P[i] ← Find I ndex(indices, 0) 5: end for

Clearing Method is a niching technique applied after each generation to prevent crowding in a tight region of the phenotype space. Before clearing, we sort the chromosomes by fitness in an ascending order (best fitness first). Then, each chromosome will clear its weaker siblings that are within a certain phenotype distance σ while keeping only the best k siblings in its region. The fitness of the cleared chromosomes are set to ∞. Although the cleared chromosomes stay in the population, they have a very low chance of surviving the tournament selection. The steps are shown in Algorithm 3. The Best Diverse Set is picked from the last generation and the algorithm is exactly the same as the clearing method except that only the best R rule pairs are kept. The selected set should contain highly fit rules that also behave in a different manner. This property is necessary to ensure a fair voting for features in the next step. Voting for Features is run using the best diverse set to collect the features that have a significantly positive effect on the evolved rules. A feature’s significance in a rule is assumed to be proportional to the rule fitness degradation after the feature has been set to a fixed value. If the fitness degradation exceeds a predefined threshold , the feature receives the vote of rule. Each rule has a different voting weight from 0 to 1

12

Y. Zakaria et al.

Algorithm 3 Clearing Method [8]. Input: Population P 1: Sor t By Fitness(P) 2: for i = 0 to length(P) − 1 do 3: if P[i]. f itness = ∞: Continue 4: si ze = 1 5: for j = i + 1 to length(P) − 1 do 6: if Distance(P[i], P[ j]) ≤ σ then 7: if si ze = k then 8: P[ j]. f itness ← ∞ 9: else 10: si ze = si ze + 1 11: end if 12: end if 13: end for 14: end for

based on its fitness. If the amount of votes received by the feature exceeds half the total weights, it is added to the selected feature subset. During degradation calculation, the feature t is set to 1 in the dispatch rule. After that, the degraded fitness of the rule f (r |t = 1) is measured. As shown in Eq. 4, The significance ζ(r, t) is the difference between the degraded fitness and the original fitness. The feature selection algorithm is shown in Algorithm 4. The fixed feature is set to 1 since it is the multiplicative and divisive identity. Although 1 is not the additive or subtractive identity, it should have minimal effect on the overall ranking. ζ(r, t) = f (r ) − f (r |t = 1)

Algorithm 4 Feature Selection. Input: Rule Pairs R, Features T , Rule type n Output: Selected Feature T˜ 1: W ← CalculateW  eights(R) 2: wthreshold ← α w∈W w 3: T˜ ← {} 4: for t in T do 5: v ← 0 6: for r in R do 7: if ζn (r, t) ≥  then 8: v ← v + W [r ] 9: end if 10: end for 11: if v ≥ wthreshold then 12: T˜ ← T˜ ∪ {t} 13: end if 14: end for

(4)

Niching-Based Feature Selection with Multi-tree Genetic Programming …

13

3.5 Two-stage Genetic Programming with Feature Selection Although NiSuFS is much more efficient compared to the original feature selection algorithm, it still requires running the GP algorithm two times to obtain a rule. To enhance the time efficiency of the algorithm, a two-stage GP algorithm was proposed [16] which reuses the last generation of Niching-GP step as the initial population of the regular GP step. The algorithm follows these steps: 1. Run the Niching-GP for half the number of generations. 2. Select the best diverse set and apply feature selection. 3. Limit the features in the mutation operator to the selected features and run regular GP for the remaining half of the generations. Although the initial population will contain unselected features, they tend to disappear across the generations since the mutation can only add selected features. In addition, we noticed that even if an important feature was not selected by chance, they can still survive to the last generation if elitism is applied. The Two-stage GP paper [16] also added another contribution which is the extension of NiSuFS to DFJSS. The difference between Two-stage GP [16] and Multi-Tree FS-GP [12] is the base GP algorithm and the two-stage framework. Two-stage GP was built on CCGP while Multi-Tree FS-GP was built on Multi-Tree GP. Other than that, the two methods share a lot of common steps.

4 Methods In this section, we will mention the methods in the original paper [12] then describe the modifications in this extension paper.

4.1 Feature Selection for Multi-tree Genetic Programming To support niching and feature selection for DFJSS, the NiSuFS technique was extended to work with rule pairs. First, the chromosome structure was changed to the multi-tree representation [13] and the crossover operator was replaced with the tree-swapping crossover operator [13]. Some components of the NiSuFS was modified as follows: Phenotype Similarity was modified to work with tree pairs. Two sets of situations (routing and sequencing situations) were randomly sampled from multiple simulations which are scheduled by a benchmark rule-pair (Least-Work-in-Queue for routing and Maximum-Operation-Waiting-Time for Sequencing). In this paper, we changed the benchmark rule-pair to be Least-Processing-Time for routing and

14

Y. Zakaria et al.

Maximum-Time-in-System for Sequencing since it was significantly more fit compared to the previous rule. The new benchmark rule is more comparable to the fit rules generated by GP so we expect it to generate situations that are more similar to the ones faced by rules in late generations. The phenotype of a chromosome is the concatenation of the phenotype of each rule in the pair. Voting for Features was modified to select two subsets of features: one for routing and the other for sequencing. The feature significance to a rule is assumed to be proportional to the rule-pair fitness degradation after the feature has been set to a fixed value in the rule under test. That means the feature is not fixed in the other rule in the pair. In the original paper [12], the fitness of the rule pair f (r ) is measured as the mean fitness on a set of reference scenarios which are randomly generated once before the feature selection. Each rule is assigned a different voting weight from 0 to 1 proportional on its fitness (lower fitness value means higher weight) as shown in Eq. 5. Any feature, that collect votes greater than or equal a certain ratio α of the total weights, is selected.

w(r ) =

max( f (r )) − f (r ) r

max( f (r )) − min( f (r )) r

(5)

r

During degradation calculation, The feature t is set to 1 for either the routing rule or the sequencing rule. After that, the degraded fitness of the rule pair f n (r |t = 1) is measured where n ∈ {routing, sequencing}. The significance for ζn (r, t) for the rule n is shown in Eq. 6. Based on the significance, the feature selection algorithm is applied as shown in Algorithm 4. The fixed feature is set to 1 similar to NiSuFS. ζn (r, t) = f (r ) − f n (r |t = 1)

(6)

The feature selection algorithm was slightly modified in the implementation of this extension paper. Instead of using the mean fitness over multiple scenarios to calculate the weight, feature significance and votes, the process was applied on each scenario individually and then the votes were aggregated. The intuition behind this modification is the fact that one rule-pair would always have a weight of 0 and will have no effect on the voting process. That rule-pair is the one with the worst mean fitness even if it did perform better than some of its peers in some scenarios. This was observed in the experiments so we decided to split the feature selection to run on each scenario individually. In the original paper, the scenario lengths in the experiments were short and the fitness function tended to be noisy, thus some insignificant features may be selected by some diverse sets. To increase the quality of the selected features, the experiments were run multiple times and their results were aggregated. The aggregated feature set only contains the features that has been selected by more than a certain number of diverse sets. Although the experiments in this paper are run on much longer scenarios, we noticed that some noise, albeit weaker, is still observed. So we ran experiments with and without aggregation for comparison.

Niching-Based Feature Selection with Multi-tree Genetic Programming …

15

4.2 Two-Stage Multi-tree Genetic Programming Similar to Two-Stage GP [16], we applied the same technique using multi-tree representation instead of CCGP. The feature selection was also modified as aforementioned.

5 Experimental Setup This section details the experimental steps and the configuration for each step. In the original paper, the experiments were conducted 6 times; once for each objective: maximum, mean and weighted mean of tardiness and flow-time. In this extension, we had more methods to compare so we limited our scope to 3 objective functions: maximum, mean and weighted mean of flow-time since they are more commonly used in recent work. Before running the experiments, we generate and store 20 situations for each rule type (routing and sequencing) and each situation must have at least 5 options available. Therefore the dimension of the phenotype vector is 40. The same situations were used across all the experiments. The feature selection is run on 10 randomly generated scenarios and the test is run on 100 scenarios for performance comparison. All the scenarios used for situation generation, training, feature selection have the same configuration, however, the testing scenarios are 10 times larger. For each method, the experiments were run 5 times and the generated rules were used for comparison. Since the first stage of the Two-Stage GP is exactly the same as the Niching-GP and Feature Selection steps in the other methods, we reused the selected features across the other experiments. The feature aggregation aggregates the results of the 5 runs and the feature under test is kept if it appeared in 2 or more selected subsets. The significance threshold  is set to 0.0001 and the voting threshold ratio α is set to 0.25. Setting the voting threshold ratio to 0.5 as in [8, 16] led to selecting only one or two features per run. This was observed in the experiments of the original paper and the extension paper. As mentioned in the original paper, we hypothesize that fixing the feature in only one rule from the pair undermines its perceived significance. Any feature that has been selected less than 2 times are removed by the aggregation step. The regular Multi-tree GP step is run 15 times per objective: 5 with full feature set, 5 with the selected features from a single diverse set and 5 with the aggregated feature subset.

5.1 Scenario Generation Configuration In the original paper, the scenario length was set to a very short length due to time and parsimonious constraints. Therefore a short spike in the job arrival schedule was added to generate interesting and discriminative situations which would rarely

16 Table 1 Scenario generation configuration Parameter #Machines #Jobs #Operations per job #Machines per operation Operation processing time Utilization Job arrival Due time factor Job weight

Y. Zakaria et al.

Value 10 Training: 500 (100 warmup), Testing: 5000 (1000 warmup) U (1, 10) U (1, 10) U (1, 99) ∈ Z 0.99 Poisson process U(1, 1.3) 1(20%), 2(60%), 4(20%)

arise in such short scenarios. In the experiments of this extension paper, the scenario length was increased 10 times for training and 100 times for testing so the short spike was no longer needed and thus removed. The scenario generation configuration is detailed in Table 1.

5.2 GP Configuration The Niche-GP configuration is based on a combination of the configurations in [8, 13]. The main changes in the new experiments is that we allowed the initialization and mutation to generate trees with a depth of 0 (a single feature). We also increased the number of evaluation replication from 1 to 4 since it decreased the noise in the fitness value across generations. The configuration is detailed in Table 2. The operators used in our experiments are :+, −, ×, ÷, negative, mininmum, maximum. All the operators are binary except negative which is unary. Each feature has a corresponding terminal as shown in Table 3. The feature set is composed of features used by [13] in addition to time-invariant versions of some features used by [8]. The Regular GP configuration is based on the configuration in [13] with some changes similar to the aforementioned changes in Niche-GP. The configuration is detailed in Table 2. The same operators and terminals as in Niche-GP are used in Regular-GP. The fitness of the rule pair for each scenario f (r, s) is normalized relative to the benchmark rule-pair fitness for the same scenario f b (s) before being aggregated to a mean fitness f (r ). This allows the scenarios to be weighted according to the benchmark fitness which was applied in recent works that also used surrogate models [8]. Since the fitness values are usually high and in a narrow range (around 140– 150 for mean and mean weighted flow-time for good rules), we subtract a lowerbound fitness fl (s) from the rule-pair and benchmark fitness before normalization. The normalized fitness fˆ(r, s) is calculated as shown in Eq. 7. The lower bound

Niching-Based Feature Selection with Multi-tree Genetic Programming … Table 2 GP configuration Parameter #Generation Population size Selection method Crossover probability Mutation probability #Elites Generated tree size Maximum tree size #Simulation replication per evaluation Clearing distance σ Clearing set size k Best diverse set size R

Niche-GP

Regular-GP

51 512 Tournament of size 7 0.8 0.15 – U (0, 2) 8 4

101 512 Tournament of size 7 0.8 0.15 32 U (0, 2) 8 4

5 1 32

– – –

17

Table 3 Feature set [12] Feature Description NIQ WIQ MWT NMWT NINQ WINQ PT OWT NPT WKR JT NOR W TIS

Number of operations in machine queue Current work in machine queue Machine waiting time Median waiting time for next operation machines Median operation count in next operation machines queues Median work in next operation machines queues Operation processing time Operation waiting time Next operation median processing time Sum of median processing time for remaining operations Current delay after job due time Number of remaining operations in job Job weight Time spent by job in system

is calculated on a relaxed version of the problem in which we assume that each operation will be processed as soon as it is available and that it will be assigned to the machine with the least processing time. The lower bound fitness is ensured to never exceed the fitness of any rule. f (r, s) − fl (s) fˆ(r, s) = f b (s) − fl (s)

(7)

18

Y. Zakaria et al.

5.3 Method Comparison To compare two methods, the best rule-pairs are applied to 100 test scenarios and the mean fitness is supplied to Wilcoxon signed-rank test at 5% level. For each method we compared the rules generated at 50 generations and 100 generations. In some experiments, we noticed that the best rule at generation 100 performs worse than the best rule at generation 50. The reason is due the noise caused by the randomlygenerated evaluation scenarios. Although we increased the evaluation replications from 1 to 4 to mitigate this issue, it still happens but at a lower rate. Therefore, we compiled a list of the best rule in every generation (Hall of Fame) and selected the best rule found up to a certain generation. The rules in the hall of fame are evaluated on a set of 32 scenarios and the best rule is selected. Therefore, in the comparison, we will add 2 more rules for every experiment: best rule till generation 50 and best rule till generation 100. In the original paper, we normalized the rule-pair fitness during comparison as mentioned in the training setup. However, we noticed almost no difference in the results, so we skipped this step in the new experiments.

6 Experimental Results Figure 2 show the results of the feature selection across the 5 experiments on each fitness function. The row number denote the experiment and the column label denote the feature under test. All the features, that were selected 2 or more times in each column, were selected to be part of the aggregated set. The features included in the aggregated set are emphasized on the x-axis labels. Tables 4, 5 and 6 shows the results for each method using the fitness functions: mean flow-time, mean weighted flow-time and maximum flow-time respectively. The methods included are: – – – –

No-FS: Multi-Tree GP without Feature Selection. FS (2-Stage): Two-Stage Multi-Tree GP. FS (Single): Multi-Tree GP using the features selected from a single diverse set. FS (Aggregate): Multi-Tree GP using the aggregated features subset.

The columns labeled “NoFS-50” and “NoFS-100” shows the results of the Wilcoxon test against the No-FS method at generation 50 and 100 respectively. The sign “(–)” means that the method is significantly worse that the results without feature selection. The sign “(+)” means that the method is significantly better that the results without feature selection. No sign means the difference is insignificant ( p-value > 5%).

5 4 3 2 1 0

(b) Mean Flow-time Routing

5 4 3 2 1 0

NI WI Q M Q NMWT WIWT N NINQ Q PT OW NPT WK T R JT NO R W TIS

NI WI Q M Q NM WT W WI T N NINQ Q P OWT NPT WK T R JT NO R W TIS

(a) Mean Flow-time Sequencing

5 4 3 2 1 0

(d) Mean Weighted Flow-time Routing

5 4 3 2 1 0

NI WI Q M Q NMWT WIWT N NINQ Q PT OW NPT WK T R JT NO R W TIS

NI WIQ M Q NMWT WIWT N NINQ Q P OWT N T WKPT R JT NO R W TIS

(c) Mean Weighted Flow-time Sequencing

5 4 3 2 1 0

19

NI WI Q M Q NMWT WIWT N NINQ Q PT OW NPT WK T R JT NO R W TIS

5 4 3 2 1 0

NI WI Q M Q NM WT WIWT N NINQ Q P OWT NPT WK T R JT NO R W TIS

Niching-Based Feature Selection with Multi-tree Genetic Programming …

(e) Maximum Flow-time Sequencing

(f) Maximum Flow-time Routing

Fig. 2 Selected features Table 4 Mean flow-time Best individual in last generation Mean No-FS FS (2Stage) FS (Single) FS (Aggregate)

Minimum

NoFS50

Best individual in hall of fame NoFS100

Mean

Minimum

NoFS50

NoFS100

Gen 50

149.67142

144.47803 N/A

N/A

149.59403

144.39554 N/A

N/A

Gen 100

149.31492

144.15217 N/A

N/A

149.20215

144.11784 N/A

N/A

Gen 50

149.94865

144.72837 (−)

(−)

149.34893

144.35214 (+)

(−)

Gen 100

150.0065

144.76404 (−)

(−)

149.17161

144.15904 (+)

(+)

Gen 50

149.92529

144.73816 (−)

(−)

149.73558

144.64978 (−)

(−)

Gen 100

149.88849

144.65091 (−)

(−)

149.51343

144.52434 (+)

(−)

Gen 50

149.68852

144.41286

(−)

149.09629

143.96711 (+)

(+)

Gen 100

149.76821

144.57335 (−)

(−)

148.96266

143.90923 (+)

(+)

20

Y. Zakaria et al.

Table 5 Mean weighted flow-time Best individual in last generation Mean No-FS FS (2Stage) FS (Single) FS (Aggregate)

Minimum

NoFS50

Best individual in hall of fame NoFS100

Mean

Minimum

NoFS50

NoFS100

Gen 50

150.1677

144.82132 N/A

N/A

149.6265

144.44911 N/A

N/A

Gen 100

149.29163

144.1778

N/A

N/A

149.02588

144.10318 N/A

N/A

Gen 50

148.88719

143.81336 (+)

(+)

148.84932

143.87963 (+)

(+)

Gen 100

149.03398

143.99882 (+)

(+)

148.76956

143.55745 (+)

(+)

Gen 50

149.61854

144.59483 (+)

(−)

149.34081

144.37938 (+)

(−)

Gen 100

149.21598

144.1097

(+)

(+)

149.04562

143.95549 (+)

Gen 50

149.33474

144.38907 (+)

(−)

149.09945

144.04885 (+)

(−)

Gen 100

148.85585

143.81797 (+)

(+)

148.56357

143.41443 (+)

(+)

Table 6 Maximum flow-time Best individual in last generation Mean No-FS FS (2Stage) FS (Single) FS (Aggregate)

Minimum

NoFS50

Best individual in hall of fame NoFS100

Mean

Minimum

NoFS50

NoFS100

Gen 50

574.75217

509.89844 N/A

N/A

564.88001

514.15742 N/A

N/A

Gen 100

576.6557

513.80547 N/A

N/A

562.39796

508.61875 N/A

N/A

Gen 50

568.25221

518.97656 (+)

(+)

565.22129

509.38125

Gen 100

575.28446

521.51172

564.61096

506.4375

Gen 50

578.65309

527.15313

577.90963

523.27813 (−)

(−)

Gen 100

578.27176

522.8875

578.34241

511.47344 (−)

(−)

Gen 50

567.33395

518.85898 (+)

(+)

563.76874

509.89297

Gen 100

563.07963

510.41016 (+)

(+)

563.49774

508.25469

Figures 3a, 3b, 4a, 4b, 5a and 5b shows the average number of unique features used by the best rule in every generation for different methods. The plot shows the average of the 5 runs for each method. It is noteworthy that the Two-Stage plots starts from after the feature selection step so the results start after 50 generations of Niche-GP.

Niching-Based Feature Selection with Multi-tree Genetic Programming …

21

No-FS

10

FS(Single) FS(Aggr)

Unique Terminals

2-Stage

8

6

0

20

40

60

80

100

Generation

(a) Mean Flow-time Routing

No-FS FS(Single)

10

FS(Aggr) 2-Stage

Unique Terminals

8

6

4

2 0

20

40

60 Generation

(b) Mean Flow-time Sequencing Fig. 3 Unique terminals across mean flow-time experiments

80

100

22

Y. Zakaria et al. 10 No-FS FS(Single) FS(Aggr)

Unique Terminals

9

2-Stage

8

7

6

5

0

20

40

60

80

100

Generation

(a) Mean Weighted Flow-time Routing. 10

No-FS FS(Single) FS(Aggr) 2-Stage

Unique Terminals

8

6

4

2 0

20

40

60

80

Generation

(b) Mean Weighted Flow-time Sequencing Fig. 4 Unique terminals across mean weighted flow-time experiments

100

Niching-Based Feature Selection with Multi-tree Genetic Programming …

23

10 No-FS FS(Single) FS(Aggr)

Unique Terminals

2-Stage

8

6

4 0

20

40

60

80

100

Generation

(a) Maximum Flow-time Routing No-FS FS(Single) FS(Aggr)

Unique Terminals

10

2-Stage

8

6

4

0

20

40

60

80

Generation

(b) Maximum Flow-time Sequencing Fig. 5 Unique terminals across maximum flow-time experiments

100

24

Y. Zakaria et al.

7 Result Discussion Compared to the results in the original paper, it is noted that the selected features are more sparse in Fig. 2. This is probably due to a decrease in the noise since we use longer scenarios (500 instead of 50) and more scenarios for the feature significance (10 instead of 5). From the figure, we can conclude the following: – PT was selected in almost every routing and sequencing rule which shows its great significance on performance. This was also noted in the original paper. However, there are three runs in the maximum flow-time experiments that opted out the PT feature. This probably means that its less significant compared to TIS since the scheduler prefers to finish long-awaiting jobs to minimize the maximum flow-time. – WIQ is selected in every routing rule since it is a significant indicator of the expected operation waiting time and aids in load balancing. However, NIQ is rarely selected, probably due to its redundancy with the more informative feature WIQ. This was also noted in the original paper. – Unlike the original paper, OWT is mostly irrelevant to both routing and sequencing. – MWT is sometimes important for routing rules and never relevant to sequencing. – JT is rarely selected which seems intuitive since no experiments were run using tardiness as the fitness. – W is heavily selected by sequencing rules in Mean Weighted Flow-time but it is rarely selected otherwise, which is intuitively expected. The same was noted in the original paper. – TIS is very relevant to sequencing for Maximum Flow-time only, otherwise, it is rarely selected. The same was noted in the original paper. – NMWT was selected a few times for the sequencing rule in mean and mean weighted flow-time. – The remaining features are mostly deemed to be irrelevant as they are rarely selected. It is noteworthy that the selected features after aggregation do not exceed 6 features for sequencing rules and 5 features for routing rules. The aforementioned numbers were almost similar in the original paper. Since the number of selected features is always below half the full feature set, we can expect that the search space to be much smaller and efficient to explore. In the original paper, the feature selection lead to significantly better results in half the cases and insignificant difference in the remaining cases. The new experimental results, as shown in Tables 4, 5 and 6, shows, that in a few cases, feature selection could lead to significantly worse results. In the original paper, we only compared FS-Aggregate against no-FS at generation 50. If we do the same on the new results, FS-Aggregate is still significantly better than No-FS in 2 out of 3 experiments and insignificantly worse in only 1 experiment. However, in some cases, this could be an unfair comparison when time efficiency of training is considered. FS-Aggregate requires 5 × 50 generations of Niche-GP and 50 generations of Regular-GP which is much longer than the 50 generations of Regular-GP in No-FS.

Niching-Based Feature Selection with Multi-tree Genetic Programming …

25

If we allow the Regular GP to run for an additional 50 generations, it can be on par and sometimes better than FS-Aggregate. So we notice a trend that whenever FS-Aggregate and No-FS run for the same length (ignoring the Niche-GP steps), FS-Aggregate is significantly better (or never significantly worse in 1 case). It has been noticed that feature selection methods converge faster however it is unknown whether this trend continues beyond 100 generations. Longer training periods are needed to either confirm or reject this hypothesis. It is also noted that longer training periods can lead to worse results due to noise in the evaluation function in some experiments. Therefore, we added the results for the best rule across the hall of fame. This modification lead to significantly better results compared to the best rule in the last generation. It is also noted that the Two-Stage Multi-Tree GP is on par with FS-Aggregate while being much more efficient. It is also significantly better than FS-Single so we recommend it as the best out of the three feature selection GP methods. Regarding computational efficiency during production, GP methods with feature selection generate rules that rely on fewer features. This trend has been noticed in Figs. 3a, 4a, b, 5a and b. The only rules, that do not follow this trend, are the sequencing rules for mean flow-time in Fig. 3b. It is also noteworthy the number of used features in Two-Step GP decreases across generations and in most cases become lower than GP without feature selection. So we can conclude that feature selection can be used to generate better rules that are more efficient in production at the cost of an additional feature selection step.

8 Conclusion This paper presented an extension to niching-based feature selection with MultiTree GP for DFJSS [12]. We added a modified version of Two-Stage GP [16] to the methods and presented a comparison between the different techniques in a larger experimental setup. In general, we concluded that in most cases, feature selection methods can generate better rule-pairs at the cost of a longer training period due to the feature selection step. The generated rules are also more efficient during production since they rely on fewer features. However, it was also shown that in a few cases, feature selection can underperform compared to GP without feature selection.

9 Future Work For future work, we plan to investigate the properties of the phenotype vector and the effect of the randomly-sampled situations on the generated niches in the search space. It is also worth investigating if there are more discriminative phenotype measurement methods to enhance the clearing results. We also plan to compare between the Two-Stage CCGP method [16] and our methods in future works. Currently, fea-

26

Y. Zakaria et al.

ture selection assumes the routing and sequencing rules are independent during the fitness significance measurement, therefore it would be useful to test this assumption in a future work. We also plan to test whether the results of the feature selection step is robust to changes in the parameters of the production environment.

References 1. Blackstone, J.H., Phillips, D.T., Hogg, G.L.: A state-of-the-art survey of dispatching rules for manufacturing job shop operations. Int. J. Prod. Res. 27–45 (1982) 2. Brucker, P., Schlie, R.: Job-shop scheduling with multi-purpose machines. Computing 45, 369–375 (1990) 3. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003) 4. Jakobovic, D., Marasovic, K.: Evolving priority scheduling heuristics with genetic programming. Appl. Soft Comput. 12, 2781–2789 (2012) 5. Koza, J.: Genetic Programming: On the Programming of Computers by Means of Natural Selection, vol. 1. MIT Press (1992) 6. Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng. 17(4), 491–502 (2005) 7. Mei, Y., Zhang, M., Nyugen, S.: Feature selection in evolving job shop dispatching rules with genetic programming. In: GECCO (2016) 8. Mei, Y., Nguyen, S., Xue, B., Zhang, M.: An efficient feature selection algorithm for evolving job shop scheduling rules with genetic programming. IEEE Trans. Emerg. Top. Comput. Intell. 1(5), 339–353 (2017). https://doi.org/10.1109/TETCI.2017.2743758 9. Nguyen, S., Zhang, M., Johnston, M., Tan, K.: A computational study of representations in genetic programming to evolve dispatching rules for the job shop scheduling problem. IEEE Trans. Evol. Comput. 17(5), 621–639 (2013) 10. Nguyen, S., Zhang, M., Johnston, M., Tan, K.: Automatic design of scheduling policies for dynamic multi-objective job shop scheduling via cooperative coevolution genetic programming. IEEE Trans. Evol. Comput. 18, 193–208 (2014) 11. Yska, D., Mei, Y., Zhang, M.: Genetic programming hyper-heuristic with cooperative coevolution for dynamic flexible job shop scheduling. In: Proceedings of the European Conference on Genetic Programming, pp. 306–321. Springer (2018). https://doi.org/10.1007/978-3-31977553-1_19 12. Zakaria., Y., BahaaElDin., A., Hadhoud., M.: Applying feature selection to rule evolution for dynamic flexible job shop scheduling. In: Proceedings of the 11th International Joint Conference on Computational Intelligence - Volume 1: ECTA (IJCCI 2019), pp. 139–146. INSTICC, SciTePress (2019). https://doi.org/10.5220/0007957801390146 13. Zhang, F., Mei, Y., Zhang, M.: Genetic programming with multi-tree representation for dynamic flexible job shop scheduling. In: Australasian Joint Conference on Artificial Intelligence, pp. 472–484. Springer (2018). https://doi.org/10.1007/978-3-030-03991-2_43 14. Zhang, F., Mei, Y., Zhang, M.: Can stochastic dispatching rules evolved by genetic programming hyper-heuristics help in dynamic flexible job shop scheduling? In: 2019 IEEE Congress on Evolutionary Computation (CEC), pp. 41–48 (2019) 15. Zhang, F., Mei, Y., Zhang, M.: Evolving dispatching rules for multi-objective dynamic flexible job shop scheduling via genetic programming hyper-heuristics. In: 2019 IEEE Congress on Evolutionary Computation (CEC), pp. 1366–1373 (2019)

Niching-Based Feature Selection with Multi-tree Genetic Programming …

27

16. Zhang, F., Mei, Y., Zhang, M.: A two-stage genetic programming hyper-heuristic approach with feature selection for dynamic flexible job shop scheduling. In: Proceedings of the Genetic and Evolutionary Computation Conference. GECCO ’19, pp. 347–355. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3321707.3321790 17. Zhou, Y., Yang, J., Zheng, L.: Hyper-heuristic coevolution of machine assignment and job sequencing rules for multi-objective dynamic flexible job shop scheduling. IEEE Access 7, 68–88 (2019)

Building Market Timing Strategies Using Trend Representative Testing and Computational Intelligence Metaheuristics Ismail Mohamed and Fernando E. B. Otero

Abstract Market timing, one of the core challenges to design successful trading strategies, is concerned with deciding when to buy or sell an asset of interest on a financial market. Market timing strategies can be built by using a collection of components or functions that process market context and return a recommendation on the course of action to take. In this chapter, we revisit the work presented in [20] on the application of Genetic Algorithms (GA) and Particle Swarm Optimization (PSO) to the issue of market timing while using a novel approach for training and testing called Trend Representative Testing. We provide more details on the process of building trend representative datasets, as well as, introduce a new PSO variant with a different approach to pruning. Results show that the new pruning procedure is capable of reducing solution length while not adversely affecting the quality of the solutions in a statistically significant manner. Keywords Particle swarm optimization · Genetic algorithms · Market timing · Technical analysis

1 Introduction The history of trading in financial markets is a long and colorful one, with the earliest records of such activities emerging in the early 13th century. Today, trading in financial markets has evolved to include trading in securities (such as stocks and bonds), currencies, commodities and various other options and derivatives. The speed at which trading occurs has also evolved along with the instruments being traded, and it is now common for trades on financial markets to occur at infinitesimal fractions of a second. Following the events of Black Monday in 1987, the U.S. Securities and I. Mohamed (B) · F. E. B. Otero University of Kent, Chatham Maritime, Kent, UK e-mail: [email protected] F. E. B. Otero e-mail: [email protected] © Springer Nature Switzerland AG 2021 J. J. Merelo et al. (eds.), Computational Intelligence, Studies in Computational Intelligence 922, https://doi.org/10.1007/978-3-030-70594-7_2

29

30

I. Mohamed and F. E. B. Otero

Exchange Commission (SEC) began introducing measures to mitigate the impact future market crashes would have on the overall economy. One of the key issues that exacerbated the effects of that particular market crash is that brokers abstained from responding to incoming buy and sell orders in an attempt to reduce their losses, which resulted in investors incurring catastrophic ones. The SEC responded by introducing the Small Order Execution System (SOES). This allowed investors, bar institutions, to use computers to submit orders on the NASDAQ exchange that were automatically matched and executed. Traders quickly figured that they could use technology to submit orders and trade on the stock exchange, and that the conduit was widely available. With time, a new kind of trader emerged: the day trader; one that trades through the day and closes out the day with no held securities. By the mid 1990s, the SEC introduced another set of measures that allowed institutions to avail exchanges that electronically matched and executed orders, known as Electronic Communication Networks (ECN). Although originally intended as alternative trading venues, and underestimated by the established exchanges, electronic exchanges and ECNs grew exponentially to process roughly a quarter of all trades in the US by the early 2000s. By then, the major exchanges such as NASDAQ and NYSE gave in and acknowledged the potential of electronic exchanges, and through a series of mergers and acquisitions, availed their own [21]. The birth of electronic exchanges and ECNs also ushered in a new form of trading: algorithmic trading. Traders wishing to outsmart and beat their competitors to the market started using algorithms to embody trading strategies and submit orders directly to the exchange. The sophistication and speed of these systems grew with time, and started using computational and artificial intelligence techniques by the late 1990s and traded with massive volume at fractions of a second by the mid 2000s. In order to design a trading system, algorithmic or otherwise, a designer has to tackle four basic issues. The first issue is the “why” behind the trading. This concerns the objectives the designer wants to achieve from their trading. These objectives are usually defined in terms of profits, exposure to risk and the length of time the designer wants to achieve their goals over. The objectives can be constrained (e.g. “Double the initial capital, while allowing maximum losses of 25% over the next six months…”) or open-ended (e.g. “Maximize profits while minimizing losses for the foreseeable future using the initial capital provided…”). Once the objectives are set, the designer now has to contend with which assets or securities to trade in, or the “what” behind trading. This issue is also known as portfolio optimization. Deciding what to trade in is usually based on what best serves the objectives defined by the designer, and again can be either constrained (e.g. trading in the securities that belong only to a particular sector of the market) or unconstrained. Having decided on why we are trading, and what securities we are trading, the designer then has to answer “when” to actually buy or sell a given security. This issue is known as market timing. The final issue a designer has to contend with is “how”. This is also known as execution optimization, and is concerned with how to best form and execute orders for buying or selling a given security once such a decision comes so as to best serve the objectives set out by the designer.

Building Market Timing Strategies Using Trend Representative …

31

As this chapter is concerned with tackling the issue of market timing, let us further contemplate the implications of tackling such an issue. Market timing can be formally defined as the identification of opportunities to buy or sell a given tradable item in the market so as to best serve the financial goals of the trader [14]. A common approach followed to tackle market timing is to use a collection of components or functions that process current and past market data and produce a recommendation on a course of action: buy, sell or do nothing (also known as hold). This collection of components would form the core of the market timing strategy, and the challenge presented to the designer of an algorithmic trading system is to select the most appropriate components and tune their parameters in such a manner so that it would best serve their goals. Designers of algorithmic trading systems have increasingly employed computational intelligence techniques soon after the introduction of electronic exchanges. One such employed technique is that of Particle Swarm Optimization (PSO). In comparison to Genetic Algorithms (GA) and Genetic Programming (GP), PSO has not been as popular in the financial domain, and in particular within the market timing space. GA and GP have been used as the core metaheuristic for market timing strategies ever since the introduction of electronic exchanges in the mid 1990s, while the earliest approach that utilized PSO for market timing was introduced in 2011 [4]. Within the financial domain, PSO has seen limited adoption in literature in comparison to GA, even though PSO has shown to have a performance advantage according to some studies [11, 24]. This chapter serves as an extension of the work presented in [20]. We performed an extensive comparison between the applications of PSO and GA to market timing in terms of quality of solutions generated using an expanded set of signal generating components. We also introduced the concept of trend representative testing as a potential remedy to the limitations of step-forward testing – the incumbent method of testing when it comes to training and testing in market timing. In this chapter, we reiterate the work represented in [20], while adding more detail to the process and methodology of trend representative testing as well as a new PSO algorithm with a different approach to pruning. The remainder of this chapter is structured as follows. In Sect. 2, we delve deeper into the issue of market timing and review a formalization of market timing first presented in [19] that considers both the selection of components and the tuning of their parameters simultaneously. In Sect. 3 we look at the application of computational intelligence metaheuristics to tackle market timing and the associated limitations of these approaches. We then take a deeper look at the concept of trend representative testing in Sect. 4. We provide a more elaborate description of the process of generating a trend representative dataset compared to [20], and discuss how it can potentially address the limitations identified in Sect. 3. This is followed by a description of the algorithms used to tackle market timing using trend representative testing in Sect. 5. We also introduce another variant of PSO with a different take on pruning than in [20]. The main impetus behind introducing this new variant is that we wanted to see if it is possible to reduce solution lengths, and the effect that would have on solution quality. In Sects. 6 and 7 we discuss how we set up our experiments, present

32

I. Mohamed and F. E. B. Otero

their results and analyze the performance of the different algorithms covered in this chapter. Finally, Sect. 8 presents the conclusion and suggestions for future research.

2 Timing Buy and Sell Decisions As mentioned earlier, the issue of market timing is concerned with deciding when to take action with a given security we wish to trade. Our actions can be to either buy this security, sell it or take no action at all. Over time, distinct schools of thought have emerged on how to best time your actions on the market, and these schools of thought can be categorized into two major types: technical analysis and fundamental analysis. Technical analysis can trace its roots to the 17th and early 18th century with the work of Joseph De La Vega on the Amsterdam stock exchange and Homna Munehisa on the rice markets of Ojima in Osaka. Technical analysis considers a security’s current and historical price and volume movements, along with current demand and supply of the security in the market, in an attempt to forecast possible future price movements [14]. Charles Dow helped set the foundations of modern technical analysis with his Dow Theory, introduced in the 1920s. Modern technical analysis is based on three foundations: price discounts all relevant information regarding the security in question, prices have a tendency to move in trends and that history repeats itself. The first foundation is based on the assumption that all factors that can effectively affect the price of a security have exerted their influence by the time a trade takes place. These factors can include the psychological state of persons or entities interested in trading the security, the expectations of these entities and the forces of supply and demand amongst other factors. The first foundation further posits that it is sufficient to only consider current and previous prices of securities of interest as a reflection of all exerted influences on such securities. The second foundation assumes that prices have a tendency to move in trends based on the expectations of entities currently trading in securities of interest. For example, if traders expect that demand will increase for a particular security they would buy that particular security in order to sell at a higher price for profit. As more traders react to this behavior by buying into the security themselves, hoping to generate a profit in the same manner, the price of the security is driven higher and higher in a cascade. The security is then considered to be in an uptrend. The trend takes its course, and an opposite cascade of selling occurs, the security is considered to be in an downtrend. Being able identify the trend a security is currently in will enable the trader to take the correct course of action within the confines of their strategy. The third foundation assumes that entities trading in securities have a tendency to consistently react in the same fashion when presented with particular market conditions. This phenomenon was proven empirically by observing the histories of a multitude of securities across multiple markets over time. Methods of technical analysis employs the use of functions known as indicators. Indicators process price history and produce a recommendation of whether to buy

Building Market Timing Strategies Using Trend Representative …

33

or sell a given security. This output recommendation is also known as a signal. Indicators will usually have one or more parameters that affect their behavior, and using different values for these parameters will produce a different signal profile for the same input data. Readers interested in the various methodologies of technical analysis are redirected to the works of Pring [23] and Kaufman [14] for comprehensive descriptions of a multitude of technical analysis techniques as an exhaustive list of all the technical analysis tools available to the contemporary trader is beyond the scope of this chapter. On the other hand, fundamental analysis is based on the concept that a security has two prices: a fair price and its current market price [22]. Over time, the market price will match the fair price of the security. To arrive at the fair price for a security, fundamental analysis will consider the current financial state of the entity that issued the security. This will include looking at the current and previous financial and accounting records of that entity, considering current and previous management, looking at sales projections and the track record of the entity of meeting those projections, earnings history, micro- and macroeconomic factors surrounding the entity and even the current market sentiment towards the entity among other factors. After considering those factors and arriving at a fair price, decisions to buy and sell will be based on the current discrepancy between the fair price and the market price. The assumption here is that the market price will close the gap and converge with the fair price. If the fair price is less that the current market price, then the decision will be to sell, and if the opposite is true then the decision will be to buy. Although the more traditional of both approaches, fundamental analysis is not without its caveats. A major issue with fundamental analysis is the rate of release of information for the sources of analysis. Sales and Revenue reports are usually released on a quarterly schedule, while tax filings are only published annually. This could be problematic for strategies working on smaller time horizons, such as trading on a daily or second-by-second basis. In efforts to work around this limitation, fundamental analysis has expanded to include the emergent field of social media sentiment analysis. Sentiment analysis on social media networks is the process of mining these networks for the sentiment of their participants towards all aspects, micro or macro, that could affect a traded security. As contribution on social media occurs at a much higher frequency than the publication of financial documents and reports, traders can act much faster and utilize fluctuations in sentiment to guide buy and sell decisions. In practice, traders building a strategy would employ techniques from both schools. A common approach is to use fundamental analysis for portfolio composition and technical analysis for market timing. The reasoning behind using techniques from both schools of thought is to reduce the exposure to risk that one or more techniques from either schools might be incorrect or consume data that is of untrustworthy nature,1 thus generating signals that could incur losses. In this chapter, we have cho-

1 An

example of this would be using a purely fundamental approach while trading Enron before its crash and bankruptcy in late 2001. A post-mortem investigation by the U.S. Securities and Exchange

34

I. Mohamed and F. E. B. Otero

sen to work with technical analysis indicators for practical reasons regarding the availability of data and access to libraries of such indicators. In order to define a market timing strategy in formal terms, we can consider a market timing strategy to be a set of components. Each component t in the set processes information regarding a security in question and returns a signal: 1 for a buy recommendation, 0 for a hold recommendation and −1 for a sell recommendation. Every component will also have a weight associated with it, as well as a set of unique parameters that influence its behavior. The weight associated with a component t affects the power of the signal generated by this particular component in the overall aggregate signal generated by the candidate solution. If the aggregate signal is positive, then the decision would be to buy. If, on the other hand, the aggregate signal is negative then the decision would be to sell. Otherwise, the recommendation would be to hold and take no action. This formulation can be presented as follows: solution = {w1 t1 , . . . , wn tn }, ∀ti : {ti1 , . . . , tix } signal =

n 

wi ti

(1) (2)

i=1

where x denotes the number of parameters for the component at hand, w represents the weight assigned to the component at hand, t represents a single component and n is total number of components within the solution. The weights for the components are all normalized to be between 0 and 1, and have a total sum of 1. Attempting to find the least possible subset of components that achieve our objectives from potentially endless combinations of components, weights and parameter values, results in a rich landscape of candidate solutions that generate a variety of signals for the same market conditions.

3 Related Work In the studies by Hu et al. [11] and Soler-Dominguez et al. [24], the authors perform a comprehensive investigation on the use of computational intelligence techniques in finance. Both studies considered a large variety of computational intelligence algorithms that included swarm intelligence (ACO, Artificial Bee Colony Optimization, PSO), evolutionary algorithms (Differential Evolution, GA, GP), fuzzy systems, neural networks and stochastic local search methods (ILS, Tabu Search, GRASP, Simulated Annealing) among others. Although the study by Hu et al. was primarily focused in the role played by computational intelligence in the discovery of trading strategies, the one by Soler-Dominguez et al. was more holistic in nature and considered other applications of computational intelligence techniques as long as they Commision (SEC) showed that the information published in the firm’s financial documentation were false, leading to investments by market participants that were built on mislead assumptions.

Building Market Timing Strategies Using Trend Representative …

35

where within the realm of finance. The time span covered by both studies reaches as far back as the early 1990s and ends with recent times. By looking at the algorithms covered by both studies, we can easily arrive at the fact that Genetic Algorithms (GA), and to a slightly lesser extent Genetic Programming (GP), are the most popular algorithms by volume of publications alone. The work by Allen and Karjalanien [2] can be seen as one of the earliest work done using GA. Here, the authors build trading rules based on technical analysis indicators using a GA. The authors then benchmarked their results against a buy-and-hold strategy using out of sample data. Another approach for the use of Genetic Algorithms (GA) is to utilize them to directly optimize the parameters of one or more indicators, regardless of whether these indicators were of the technical analysis variety or the fundamental analysis one. The works of Subramanian et al. [25] and de la Fuente et al. [9] can be seen as examples of this approach, with the former tackling market timing as a multiobjective optimization problem. Yet another approach would be to use GA to optimize the parameters of another primary algorithm that is in charge of producing trading signals by way of tuning its parameters in order to improve fitness. Algorithms in charge of producing trading signals in this paradigm included neural networks, self-organizing maps (SOM), fuzzy systems and an assortment of classification algorithms. An extensive cataloging of works done using this synergistic paradigm can be seen in the study by Hu et al. [11]. More recent applications of GA to the issue of market timing can be seen in the works of Kim et al. [15] and Kampouridis and Otero [12]. In comparison to GA and GP, Particle Swarm Optimization (PSO) has not seen such an extensive adoption within the space of market timing. One possible factor that can help us explain this massive difference is that GA were introduced far earlier than PSO, late 1970s versus mid 1990s respectively. The very first application of PSO to market timing was the work done by Briza and Naval [4]. Based on the work by Subramanian et al. [25], the authors optimized the weights of a collection of instances of technical analysis indicators. To generate these instances, the authors started with five technical analysis indicators and used industry standard presets for the parameter values to generate multiple instances from each indicator. The final signal acted upon is then the aggregate signal collected from the individual weighed instances. The approach followed by the authors in this case was a multiobjective one, were they optimized two measures of financial fitness: percentage of return and Sharpe Ratio. This application was then followed by two hybrid approaches, where PSO was used to optimize another signal generating algorithm. In [5], Chakravarty and Dash used PSO to optimize a neural network that was tasked with the prediction of movements in an index price. Liu et al. [17] also used PSO to optimize another signal generating algorithm, in this case a neural network that generated fuzzy rules for market timing, and the authors reported positive results. In more recent publications, Chen and Kao [6] used PSO as a secondary algorithm to optimize the parameters of a primary system that is capable of forecasting the prices of an index as means of market timing. This primary system is built using a combination of fuzzy time series and support vector machines (SVM). In [16], Ladyzynski and Grzegorzewski also used PSO as a secondary algorithm to optimize

36

I. Mohamed and F. E. B. Otero

the parameters of a primary system. In this case, the primary system was tasked with the identification of price chart patterns for the purposes of market timing and relied on fuzzy logic and classification trees. The authors noted that the use of PSO in this scenario vastly improved the accuracy of their primary system and that the overall hybrid system proved to be promising. Wang et al. [27] used a hybrid approach combining PSO and a reward scheme to optimize two technical indicators in order to improve Sharpe Ratio. The results of this hybrid showed that this approach outperformed other methods such as GARCH. In [3], Bera et al. used PSO to optimize the parameters of a single technical indicator and choose to apply their methodology to the foreign exchange market instead of the stock market. In their results, the authors have noted that their system was profitable under testing. In [26], Sun and Gao have used PSO in a secondary role to optimize the weights on the links of a neural network tasked with prediction of prices of securities on an exchange. The results showed that this hybrid system had an error rate of 30% in predicted prices when compared to the actual ones. In yet another approach where PSO played a secondary role, Karathanasopoulos et al. [13] used PSO to optimize the weights on a radial basis function neural network (RBF-NN) built to predict the prices of the crude oil commodity. Compared to two other classical neural network models, the authors noted that their PSO-enhanced approach outperformed the others in predictive capacity. In the aforementioned publications regarding the use of PSO in the domain of market timing, we can observe a notable trend: PSO was more frequently used in a secondary role to optimize the performance of a primary signal generating system that used other algorithms besides PSO. There were three exceptions to this observation: [3, 4, 27]. In these three exceptions, PSO was used as the sole algorithm responsible for the generation of signals—either by optimizing the weights of a set of technical indicators with preset parameters or optimizing the parameters of one or more indicators with preset weights. The only attempt we are aware of that considered the simultaneous optimization of both the selection of indicators and the tuning of their parameters was the one by Mohamed and Otero [19]. Here, the authors used PSO to optimize the parameters of six technical indicators, as well as tuning the weights of their produced signals and pruning ineffective ones. Their work was tested against four stocks and showed that using PSO was a viable approach albeit with some caveats: the set of indicators used was limited in scope; the number of datasets used for training and testing was small; and no there was no benchmark to compare the performance of the PSO against. The work proposed in this chapter addresses two key limitations identified in the literature so far: (1) it considers the optimization of both the selection of signal generating components and the values of their parameters in a simultaneous fashion; and (2) avoid the tendency of market timing strategies to overfit to particular price movements when using step-forward testing via the proposition of the use of trend representative testing as is discussed next.

Building Market Timing Strategies Using Trend Representative …

37

4 Trend Representative Testing: Simulating Various Market Conditions While Training and Testing The current incumbent method of training and testing used when building market timing strategies in the literature surveyed is a procedure known as step-forward testing [11, 14, 24]. A straightforward approach, step-forward testing starts by acquiring a stream of price data for a particular tradable asset and then arbitrarily split this stream into two sections: the chronologically earlier section being used for training, while the later is used for testing. A common approach is to ensure that the training section is proportionally twice the size of the testing section in terms of number of data points contained within each section. The main issue with step-forward testing is that while training your algorithm, you are confined to the price movements or trends observed from the data points within the training section. This means that the algorithm is only exposed to the upwards, downwards and sideways trends currently manifest in the training data both in terms of length and intensity. This introduces the likelihood of overfitting to these particular trends, and when faced with different types of trends (those with different lengths and intensities) in real life trading, all the profits that were seen while training and testing quickly evaporate. A simple example would be if an algorithm only sees upward trends during both training and testing, and then is exposed to a downwards trend in real life trading. An example of this can be seen in Fig. 1. This shortcoming has been reported in both the studies by Hu et al. [11] and Soler-Dominguez et al. [24]. Furthermore, trying to apply standard tactics to avoid overfitting such as k-fold cross validation are not easy due to the structure of the data. In order to overcome the limitations associated with step-forward testing, we propose the use of Trend Representative Testing. The ideas behind this novel approach are based on the suggestions of domain experts in [14]. The main philosophy behind Trend Representative Testing is that by exposing an algorithm during training and testing to a variety of upwards, downwards and sideways movements, we reduce the chances of the algorithm overfitting to any single one of those trends and have a better estimation of the algorithm’s performance in real life trading. Our objective is then to build a library containing numerous examples of each type of trend, with various intensities and time lengths, and define an approach on the use of this library in training and testing. The process of building a dataset for Trend Representative Testing is a systematic approach of analyzing raw price streams, identifying usable subsections with known trends and then storing them within a library so as to have a multitude of uptrends, downtrends and sideways movements for use in training and testing. The first step of this process is to acquire a vast amount raw price data, over an extended time frame to improve our chances of capturing the largest variety of trends in terms of direction, intensity and length. For our library, we acquired the raw price data for all securities exchanged on the Nasdaq and NYSE markets traded from 1990 to 2018. Each individual price stream is then scanned for price shocks, and upon detection, the raw price stream is divided into two sub-streams known as cords: one cord

38

I. Mohamed and F. E. B. Otero

Fig. 1 An example of dividing price data for step-forward testing (Image courtesy of Yahoo! Finance). The data shown here represents daily prices for the Microsoft (MSFT) security on the NASDAQ market between 2016 and early 2019. The red line represents the point of division, with the first two years of data being used for training, and the later year used for testing. The data represents an example of the liability for step forward testing to overfit, as the entirety of data shown here represents an extended upwards trend. Algorithms trained and tested on this data are liable to perform poorly when exposed to a downtrend

representing the data occurring before the price shock and the other representing the data occurring after the price shock. The reason we remove price shocks is that they are outlier events, categorized by a sudden change in price in response to an event. Price shocks are highly unpredictable and disruptive events. Including these sudden and disruptive changes in the training data would imply that our trained market timing strategies are capable of predicting price shocks and correctly responding to them, which is not the case. This is the main reason why we elect to remove price shocks from training and testing data. Besides being catastrophic events, price shocks are rare and training a strategy to use them would be highly impractical. Price shocks can be defined as price actions that are three times the Average True Range (ATR) within a short period of time [14]. Each generated cord is then subsampled using sliding windows of various sizes to produce strands. Each strand is then analyzed to identify the direction of the underlying trend represented (upwards, downwards or sideways) and intensity using the Directional Index technical indicator [23] and finally added to the library. A visual example of this process can be seen Fig. 2. In order to use this new dataset, we first start by building two sets: a training set and a testing set. A training set is composed of n triplets, where a triplet is a set of three strands: one uptrend, one downtrend and one sideways trend. A testing set is composed of a single triplet. During training, we select a triplet at random from the training set with each iteration, and that triplet is used to assess the performance of all the candidate solutions within that iteration. To measure the performance of a candidate solution, we evaluate the performance of the candidate solution at hand

Building Market Timing Strategies Using Trend Representative …

39

Fig. 2 A visual example of the process behind generating a Trend Representative Testing dataset (Images courtesy of Yahoo! Finance). The data shows here the price data for the Dow Jones index between July 1987 and January 1988. In (A) we can see the raw price data, with the tall black rectangular highlighting the price shock caused by the events of Black Monday on October 19, 1987. This price shock is confirmed in spikes of Average True Range and True Range depicted directly under the top chart. Upon the identification of the price shock, the raw stream of data is divided into two streams we call cords: one before the price shock (B), and the other after the price shock (C). The cords are then subsampled using sliding windows of various sizes to produce what we call strands (D). Strands are then analyzed using the Directional Index technical indicator to identify the underlying trends and stored in library that forms the Trend Representative Testing dataset

40

I. Mohamed and F. E. B. Otero

agaisnt each constituent strand within the triplet designated for the current iteration and report the average fitness. With each new iteration, a randomly selected triplet from the training set is designated as the current training triplet and this process is repeated until the training criteria are met or a set of training iterations are exhausted. For testing, the triplet from the testing set is used in a similar fashion to assess the performance of the surviving solutions form the algorithm. The idea behind trend representative testing is that we want discourage niching or specializing in one particular trend type and instead promote discovering market timing strategies that fair well against various market conditions.

5 Market Timing Algorithms In this section we describe how our market timing formalization was encoded, explain how the algorithms were adapted to use this encoding and tackle the problem of market timing.

5.1 Individual Encoding and Measuring Fitness Let us begin this section by describing how the formalization presented in Sect. 2 is encoded into a candidate solution, and how we assess the fitness of these solutions. According to our formalization, a candidate solution is a collection of signal generating components, where each of these components would have a weight and a set of parameters. For example, if we had three signal generating components such as the technical indicators Moving Average Converge Diverge (MACD), the Relative Strength Indicator (RSI) and the Chaikin Oscillator (Chaikin), we could possibly have a candidate solution of: 0.3×M AC D(Fast Period = 12, Slow Period = 27, Smooth Period = 9) +0.2×RS I (Over bought = 70, Over sold = 30, Period = 14) +0.5×Chaikin(Fast Period = 3, Slow Period = 10) where the number preceding the indicator represents its weight and the values in the brackets represent the values of parameters per indicator. A visual example can be seen in Fig. 3. As the algorithms used are all based on a population of individuals, each individual would represent a candidate solution. We chose to encode the individuals as multitier associative arrays. The top level binding associates an indicator identifier with a set of its parameters. The bottom level binding associates a parameter identifier with

Building Market Timing Strategies Using Trend Representative …

41

Fig. 3 An example of an encoded candidate solution with three components. Reproduced from [20]

its value. The parameters per indicator are dependent on its type, but all indicators have an instance of a weight parameter. Our choice for encoding individuals as multi-tier associative arrays allows us maximum flexibility to use an unlimited number of signal generating components and be agnostic of their parameters in terms of quantity and value types, as would have been the case if we choose to encode an individual as a regular array. Encoding the individual as an array, would have confined us to particular number of components and needed the maintenance of a mapping structure to be used to decode the array’s values, both limitations that we wished to avoid at this time. The semantics of how each component is handled and how its weight and parameters are tuned are then left to be implemented in the individual algorithms, as will be discussed shortly. A new candidate solution is generated by instantiating components from the available catalog with random values for the parameters and adding it to the dictionary representing the individual. As for assessing the fitness of an individual, we chose to maximize the Annualized Rate of Return (AROR) generated by backtesting the individual at hand. Backtesting would simulate trading based on enacting the aggregate signal produced by a solution over a preset time period for a given asset. The data being used during backtesting would depend on the current strand being used from the Trend Representative Testing procedure. As for AROR, this can be defined as: A R O Rsimple =

En 252 × E0 n

(3)

where E n is final equity or capital, E 0 is initial equity or capital, 252 represents the number of trading days in a typical American calendar and n is the number of days in the testing period.

5.2 Genetic Algorithms In order to adapt Genetic Algorithms (GA) to tackle market timing using our proposed formalization, we utilized a typical implementation of GA and modified the crossover and mutation operators to accommodate our individual encoding scheme. For crossover, we begin by selecting individuals for the crossover procedure using typical tournament selection, where the tournament size is a user-defined parameter. The selected individuals are then prepared by ordering the genotype by key. A

42

I. Mohamed and F. E. B. Otero

Fig. 4 An example of a crossover operation. Reproduced from [20]

crossover point is then selected at random with the constraint that it lands between the definition of two components, but not within them. This would ensure that the components in the resulting genotypes are valid, with the correct number of parameters and parameter values are within valid ranges. Using the example mentioned earlier, we can generate a crossover point either between the definition of MACD and RSI, or between RSI and Chaikin. An example of a crossover operation can be seen in Fig. 4. In order to simulate mutation, a random component from within the individual’s genotype is replaced by a fresh copy of the same type of component with random (but valid) values for its parameters. The crossover and mutation operators are used to generate new populations, generation after generation. This process is repeated until an allocated budget of generations is exhausted. Elitism is implemented via the use of an archive to keep track of the best performing individuals per generation, with the fittest in the archive being reported as the proposed solution at the end of the algorithm’s run.

5.3 Particle Swarm Optimization We now turn our attention to adapting Particle Swarm Optimization (PSO) to tackle market timing. As PSO is based on a swarm of individuals representing candidate solutions, we will use the same encoding we introduced earlier. In order to be able to work with this encoding within PSO, the standard mechanics of the various PSO operators will have to be modified, and these modifications are discussed next.

Building Market Timing Strategies Using Trend Representative …

43

The basic PSO model can be seen in Algorithm 1. This model supports both lbest and g-best neighborhood structures, based on the value of the neighborhood size parameter. A neighborhood size equal to the size of the swarm would imply a g-best neighborhood structure, while values less than the swarm size would imply an l-best neighborhood structure. In order to accommodate the proposed encoding scheme, we pushed down the implementation of the addition, subtraction and multiplication operators required for the velocity update equation (lines 8–15) to be at the component level instead of the algorithm level. This allows us to be agnostic to the types of components used and their parameter values. It also removes the limitation that the parameter values have to either be binary or numeric in nature, as long as there is a suitable override for the addition, subtraction and multiplication operators within the component. This opens up the possibilities for the designer to consider an arbitrary number of signal generating components for their market timing strategies. We also adopted a decreasing inertia schedule, Clerc’s constriction [7] and velocity scaling as measures to promote convergence within the swarm and eliminate velocity explosion.2 For the decreasing inertia schedule, the inertia for every particle is diminished by an amount defined by a function based on the iterations left with every passing iteration. To eliminate velocity explosion, the designer is a given a choice between using Clerc’s constriction [7] or velocity scaling. With velocity scaling, the velocity vi j (t + 1) is scaled down by a user defined factor before being used to update a particle’s state. Using these modifications, PSO can now use the proposed encoding to optimize both the parameters and weights of a set of components in relation to a financial fitness metric. A common problem faced by search metaheuristics, PSO included, is getting stuck in local optima. Multiple measures have been devised since the introduction of the basic model to remedy that problem with varying degrees of success [8]. Inspired by the work done by Abdelbar with Ant Colony Optimization (ACO) in [1], we introduced a variation of PSO that stochastically updates the velocity of its particles only when it is favorable in terms of fitness. We will refer to this variation of PSO in the remainder of this chapter as PSO S . The modifications required for PSO S are reviewed next. Every particle x in the swarm S represents a candidate solution. From our earlier discussion, this means that a particle’s state is a collection of weighted components, where each component has its own set of parameters. A particle starts out with an instance of all the available signal generating components, each instantiated with random weights and parameter values. In contrast with the basic PSO model, the cognitive and social components of the velocity update equation are modified to be calculated as:

2 Early

experiments indicated the tendency of particles to adopt ever increasing values for velocity if left unchecked, leading to the particles quickly seeking the edges of the search space and moving beyond it.

44

I. Mohamed and F. E. B. Otero

Algorithm 1 Basic PSO high-level pseudocode. 1: initialize swarm S 2: repeat 3: for every particle xi in S do 4: if f (xi ) > per sonal_best (xi ) then 5: per sonal_best (xi ) ← f (xi ) 6: end if 7: for every component j in particle i do 8: bias ← αvi j (t) 9: cognitive ← c1 r1 (yi j (t) − xi j (t)) 10: social ← c2 r2 ( yˆi j (t) − xi j (t)) 11: vi j (t + 1) ← bias + cognitive + social 12: if j ∈ R then 13: xi j (t + 1) ← xi j + vi j (t + 1) 14: else if j ∈ [0, 1] then 15: Pr (xi j (t + 1) → 1) : sigmoid(vi j (t + 1)) 16: end if 17: end for 18: end for 19: until stopping criteria met 20: return fittest particle

 cognitive =  social =

– – – – – –

f (yi (t)) yi (t) − xi (t) ifrand() < | f (xi (t))+ | f (yi (t))

0

otherwise

f ( yˆi (t)) yˆi (t) − xi (t) ifrand() < | f (xi (t))+ | f ( yˆi (t))

0

otherwise

(4) (5)

x: particle i: current particle index y: personal best yˆ : neighborhood best f (x): the fitness of x rand(): random number between 0 and 1.

According to Eqs. 4 and 5, the cognitive and social components will only stochastically influence velocity update if there is an improvement in fitness, following a hill climbing fashion. In [19], we introduced a pruning procedure in an attempt to arrive at shorter solutions by actively removing signal generating components whose weight falls below a specific threshold at specific checkpoints through the algorithm’s run. Experimentation with this pruning procedure did not prove to be fruitful. This was further confirmed when IRace found the best setting for that pruning procedure is to be turned off. This suggests that the pruning procedure is too destructive. Any components that fall below the pruning threshold is removed from all solutions in the swarm

Building Market Timing Strategies Using Trend Representative …

45

and there is no mechanism to reintroduce the pruned components at a later stage. Therefore, components that do not have a good configuration at the moment, and thus not contributing to the solution in a beneficial manner, get removed without having the opportunity to explore other configurations. Shorter solutions have an advantage of being faster to compute and easier to comprehend by the user, and so are still desirable. The challenge is to find a good balance between the quality and size of a solution. In an another attempt to arrive at the least sufficing subset of components that optimizes a financial metric, we introduce a novel approach to pruning that was not present in [20]. In this novel pruning approach, components with weights falling below a threshold will have their weights updated to zero without physically removing them from the solution as was the case in [19]. Components who have weight of zero are effectively excluded from contributing to the aggregate signal produced by the candidate solution and thus can be disregarded. By not permanently removing them from the solutions, we allow their reintroduction in later iterations if they learn of a useful configuration through the interaction of the particles in the swarm. The pruning procedure is triggered at frequent points throughout the algorithm’s run, and the deadline (the number of iterations that pass before it is triggered) and threshold are all user defined parameters. This pruning procedure is added on top of PSO S resulting in a new variant we will refer to as PSO P .

6 Experimental Setup In order to evaluate the efficacy of the proposed methods in building competent market timing strategies, we tested the proposed GA, PSO, PSO S and PSO P algorithms. Since all four algorithms have parameters, testing was preceded with hyper-parameter optimization performed using the iterated racing procedure (IRace) [18]. The IRace procedure was run with a budget of 300 iterations and a survivor limit of one, in order to arrive at a single configuration for each algorithm. The results of the IRace procedure can be seen in Table 1. Looking at the PSO variants, we can see that the variant with the smallest swarm size is PSO P , followed by regular PSO then PSO S which is almost twice the size of PSO P . When it comes to the total number of fitness evaluations to be performed by the algorithm, regular PSO comes in with the least number of evaluations at 1260, followed by PSO P and PSO S with 2700 and 15,399 respectively. The regular PSO has adopted an l-best neighborhood structure, while both PSO S and PSO P adopted g-best neighborhood structures. All three PSO variants favored scaling over Clerc’s constriction, with both PSO variants that use the stochastic state update procedure using a scaling factor of around 0.5 while regular PSO adopted a more aggressive scaling profile with a scaling factor of about 0.9. Both regular PSO and PSO P leaned slightly more towards depending on the social component during velocity update, while PSO S showed the exact opposite. In essence, after hyper-parameter optimization using IRace, we ended up with a fast acting l-best PSO, an almost equally fast g-best P S O P and a relatively slow PSO S . As for GA, the hyper-parameter optimiza-

46

I. Mohamed and F. E. B. Otero

Table 1 IRace discovered configurations for each of the algorithms tested PSO S

PSO

PSO P

GA

Population 45

Population 59

Population 30

Population

53

Iterations

Iterations

Iterations

Generations

266

28

261

90

Neighbors 26

Neighbors 59

Neighbors 30

Mutation probability

0.6306

c1

2.4291

c1

3.2761

c1

2.908

Crossover probability

0.455

c2

3.4185

c2

2.363

c2

3.417

Tournament size

22

Clamp

Scaling

Clamp

Scaling

Clamp

Scaling

Scaling factor

0.8974

Scaling factor

0.551

Scaling Factor

0.5988

Pruning

False

Pruning

True

Pruning threshold

0.1799

Pruning deadline

15

tion procedure resulted in a relatively slow GA (with a population size and generation count close to PSO S ) with a high mutation rate and low crossover rate. This suggests that the solution landscape is on the rugged side, with sharp peaks and deep valleys that require radical moves to traverse. All algorithms are trained and tested using trend representative testing. The data used contains 30 strands, representing 10 upwards, 10 downwards and 10 sideways trends at various intensities. The 30 strands are divided into triplets, resulting in 10 distinct triplets for use in training and testing. The details of the trend dataset can be seen in Table 2. The columns in Table 2 describe the symbol of the source stock data, the beginning date, the ending date and the trend of every strand in the dataset. The data has then been split into 10 datasets, where each dataset would contain one triplet reserved for testing and the remaining 9 triplets form the training set. Each step in the training and testing procedure is repeated 10 times to cater for the effects of stochasticity. When it comes to the signal generating components available for the algorithms, we used 63 technical analysis indicators for our experiments. These indicators contain an assortment of momentum indicators, oscillators, accumulation/distribution indicators, candlestick continuation pattern detectors and candlestick reversal pattern indicators. If any of the indicators had parameters that were concerned with the length of data processed, an upper limit of 45 days was set, so that we were guaranteed a minimum of 5 trading signals within a single trading year.3 Any other parameters are initialized to random values as long as comply with the value constraints defined 3A

typical US trading year is compromized of 252 d.

Building Market Timing Strategies Using Trend Representative … Table 2 Data strands used for training and testing. Reproduced from [20] Id Symbol Begin date End date Length BSX1 LUV1 KFY1 EXC1 LUV2 KFY2 AVNW1 PUK1 LUV3 KFY3 EXC2 LUV4 EXC3 PUK2 MGA1 ED1 EXC4 PUK3 BSX2 ED2 JBLU1 MGA2 MGA3 ATRO1 AVNW2 EXC5 AVNW3 IAG1 MGA4 IAG2

BSX LUV KFY EXC LUV KFY AVNW PUK LUV KFY EXC LUV EXC PUK MGA ED EXC PUK BSX ED JBLU MGA MGA ATRO AVNW EXC AVNW IAG MGA IAG

2012-10-10 2008-08-22 2007-05-16 2003-04-14 2004-12-03 2007-03-20 2005-07-18 2010-08-12 2008-09-02 2003-03-13 2002-10-03 2003-11-21 2003-05-12 2005-05-12 1996-02-29 1997-07-02 1999-08-20 2002-03-19 2009-04-22 2011-12-15 2003-05-15 2012-12-28 1995-09-19 1997-06-04 2003-03-07 2015-03-12 2013-06-11 2015-11-09 1995-10-17 2012-01-19

2013-07-09 2010-05-07 2007-10-12 2003-08-20 2005-05-04 2007-09-21 2006-01-12 2012-04-03 2009-01-30 2003-08-04 2003-08-04 2004-04-01 2003-10-15 2006-03-13 1996-07-08 1997-11-20 2000-03-30 2002-07-25 2009-09-18 2012-05-16 2003-11-10 2013-10-14 1996-12-13 1997-11-28 2003-08-05 2016-09-02 2013-11-20 2016-08-24 1996-04-22 2012-06-04

185 430 105 90 105 130 125 415 105 100 210 90 110 210 90 100 155 90 105 105 125 200 315 125 105 375 115 200 130 95

47

Trend ↑ ↔ ↓ ↑ ↔ ↓ ↑ ↔ ↓ ↑ ↔ ↓ ↑ ↔ ↓ ↑ ↔ ↓ ↑ ↔ ↓ ↑ ↔ ↓ ↑ ↔ ↓ ↑ ↔ ↓

by every indicator, and the best setting for these parameters in terms of performance is discovered by the algorithms as they traverse the solution landscape.

48

I. Mohamed and F. E. B. Otero

7 Results Table 3 shows the minimum, median, mean and maximum fitness achieved by the four algorithms per dataset and trend. Regular PSO shows the most wins based on mean performance with 15 wins out of a possible 30, followed by PSO S with 6 wins, then PSO P with 5 wins and finally GA with least wins scoring only 4 out of a possible 30. Positive values in Table 3 indicates profits were made on the initial investment, negative values indicate that losses were incurred and a value of zero indicates a break-even situation. By looking at overall averages in Table 4, we can see that all four algorithms performed considerably better with downtrends when compared to the other two trend types. We can also see that GA is the least performing algorithm in all of the trend types, and that a PSO variant has a slight edge over GA in all three cases. The clear difference in fitness between downtrends and the other two types of trends indicates that the algorithms generate market timing strategies that are unbalanced. Nevertheless, performing better in downtrends is positive when compared with buy-and-hold strategies which would fail under such conditions. The issue of unbalanced performance with the various trend types can perhaps be remedied by tackling market timing as a multi-objective optimization problem, where we try to pursue a Pareto front that maximizes performance for all three trend types. By observing the performance of the market strategies generated by all four algorithms under downtrends, uptrends and sideways movements, we have a better approximation of the performance of these strategies under varying market conditions, and therein lies the value of Trend Representative Testing. With step-forward testing, we are confined to the underlying trends represented in the training data. This can easily lead to overfitting, where strategies only perform well when exposed to trends that are similar to those encountered in training and poor otherwise. With Trend Representative Testing, we explicitly avoid this issue by exposing our algorithms to a variety of trends during both training and testing. Table 5 shows the rankings of the algorithms after performing the non-parametric Friedman test with the Holm’s post-hoc test by trend type on the mean results [10]. The first column shows the trend type; the second column shows the algorithm name; the third column shows the average rank, where the lower the rank the better the algorithm’s performance; the fourth column shows the p-value of the statistical test when the average rank is compared to the average rank of the algorithm with the best rank (control algorithm); and the fifth shows the Holm’s critical value. Statistically significant differences at the 5% level between the average ranks of an algorithm and the control algorithm are determined by the fact that the p-value is lower than the critical value, indicating that the control algorithm is significantly better than the algorithm in that row. The non-parametric Friedman test was chosen as it does not make assumptions that the data is normally distributed, a requirement for equivalent parametric tests. We can see from Table 5 that PSO variants that employed the stochastic state update procedure (namely PSO S and PSO P ) ranked highest across all three trend types, albeit not at a statistically significant level. This leads to two interesting obser-

4

3

2

1

IAG1

MGA4

IAG2

BSX1

LUV1

KFY1

EXC1

LUV2

KFY2

AVNW1

PUK1

LUV3

KFY3

EXC2

LUV4





























0



Test Strand

Trend

Dataset

2.80

0.16

2.08

−0.85

−2.84

−3.83

1.61

2.09

2.80

2.01

−1.33

−0.67

0.80

0.39

−10.35

Min

GA

2.83

0.91

2.49

2.91

−1.44

−2.01

1.75

2.27

2.82

2.08

−0.10

−0.35

1.95

0.88

−5.99

2.84

0.84

2.45

2.03

−1.32

−1.59

1.77

2.31

2.83

2.22

−0.28

−0.36

1.71

0.91

−6.25

Median Mean

2.96

1.55

2.67

4.63

0.00

1.29

2.05

2.64

2.90

2.67

−0.01

−0.13

2.16

1.60

−1.39

Max

2.80

−0.53

2.67

1.08

−2.37

−1.97

1.56

2.21

2.80

1.86

−1.71

−5.91

2.17

0.66

−4.41

Min

PSO

2.80

1.31

2.72

3.06

−0.05

0.70

1.98

2.43

2.80

2.09

−0.11

−0.31

2.17

1.78

−4.26

2.80

1.03

2.74

3.17

−0.39

0.10

2.15

2.44

2.80

2.14

−0.45

−0.85

2.17

1.60

−4.01

Median Mean

2.80

1.60

2.80

4.70

0.00

1.26

3.61

2.62

2.80

2.66

−0.04

−0.02

2.17

2.09

−3.04

Max

2.77

0.08

1.39

1.58

−2.45

−3.58

1.65

2.17

2.80

2.00

−1.10

−0.85

0.69

0.93

−9.71

Min

PSO S

2.85

1.13

2.42

2.67

−1.25

−1.84

2.00

2.31

2.83

2.07

−0.34

−0.34

0.93

1.19

−5.79

2.85

0.99

2.40

3.02

−1.00

−1.21

2.04

2.30

2.85

2.07

−0.42

−0.39

1.22

1.25

−6.19

Median Mean

2.95

1.62

2.94

6.12

0.38

1.22

2.85

2.46

2.92

2.17

−0.08

−0.11

2.17

1.70

−3.42

Max

2.79

0.06

2.58

2.04

−2.53

−1.70

1.76

2.20

2.80

1.89

−4.53

−0.36

2.17

0.91

−4.41

Min

PSO P

2.80

1.08

2.67

2.80

−0.03

−0.48

1.87

2.37

2.80

2.05

−1.06

−0.31

2.17

1.53

−4.00

2.80

1.68

2.67

4.11

0.09

1.28

2.20

2.51

2.80

2.12

−0.01

−0.11

2.18

1.93

−2.08

Max

(continued)

2.80

0.94

2.64

2.89

−0.59

−0.17

1.90

2.37

2.80

2.05

−1.58

−0.28

2.17

1.45

−3.79

Median Mean

Table 3 Computational results for each algorithm over the 10 datasets. The min, median, mean and max values are determined by running each algorithm 10 times on each dataset. The best result for each dataset and trend combination is shown in bold

Building Market Timing Strategies Using Trend Representative … 49

9

8

7

6

ED2

JBLU1

MGA2

MGA3

ATRO1











AVNW3

BSX2





PUK3



AVNW2

EXC4



EXC5

ED1





MGA1





EXC3

PUK2



5



Test Strand

Trend

Dataset

Table 3 (continued)

GA

1.75

−1.18

2.09

−8.86

−1.09

−6.38

−0.07

2.46

3.13

2.38

1.53

0.38

2.80

−4.10

1.39

Min

1.96

−0.40

4.14

9.75

−0.08

−4.34

9.41

2.57

3.42

2.45

2.27

1.14

2.99

−0.10

1.64

1.97

−0.27

4.20

8.49

−0.13

−4.50

7.99

2.58

3.51

2.57

2.33

1.10

2.98

−0.88

1.71

Median Mean

2.14

0.96

5.65

16.08

0.51

−3.39

12.06

2.70

4.03

3.66

3.72

1.98

3.15

0.51

2.14

Max

PSO

1.75

−0.61

2.68

5.29

−0.17

−4.69

0.62

2.36

2.28

2.80

0.95

1.75

2.80

0.06

1.96

Min

1.92

0.44

3.67

9.10

0.27

−4.33

8.08

2.44

3.22

2.80

2.07

1.75

2.80

0.59

2.03

1.94

0.27

3.48

8.68

0.23

−3.28

7.56

2.45

3.13

2.80

2.33

1.88

2.80

0.58

2.12

Median Mean

2.10

0.87

3.99

11.31

0.48

−0.04

11.95

2.63

3.32

2.80

3.47

2.44

2.80

1.07

2.38

Max

PSO S

1.86

−0.59

3.94

4.76

−1.96

−4.87

6.59

2.35

3.25

2.25

1.20

−0.03

2.80

−1.51

1.52

Min

2.01

−0.35

4.77

9.51

−0.10

−4.63

10.68

2.52

3.42

2.43

1.97

1.26

3.14

−0.23

2.02

2.02

−0.19

4.63

9.17

−0.08

−3.29

10.28

2.51

3.48

2.60

2.35

1.21

3.15

−0.28

2.00

Median Mean

2.20

0.56

5.67

11.51

1.34

0.00

13.09

2.67

3.76

3.66

3.96

2.32

3.80

0.76

2.39

Max

PSO P

1.78

−0.95

2.98

6.32

−1.66

−4.74

6.13

2.27

3.13

2.28

0.92

0.86

2.80

−2.39

1.96

Min

1.95

0.40

3.84

9.51

−0.21

−4.61

9.90

2.45

3.25

2.80

2.65

1.85

2.80

0.12

1.96

1.92

0.31

3.77

9.64

−0.28

−4.12

10.00

2.44

3.24

2.75

2.84

1.91

2.80

−0.37

1.98

Median Mean

2.01

1.80

3.99

12.18

0.60

−1.47

11.95

2.54

3.32

2.80

5.20

2.46

2.80

0.54

2.14

Max

50 I. Mohamed and F. E. B. Otero

Building Market Timing Strategies Using Trend Representative … Table 4 Overall average fitness by trend for each algorithm Algorithm Trend GA PSO PSO S Downtrend Sideways Uptrend

3.46 0.61 0.31

3.62 1.01 0.81

3.84 0.74 0.55

51

PSO P 3.89 0.76 0.80

Table 5 Average rankings of each algorithm according to the Friedman non-parametric test with the Holm post-hoc test over the mean performance. No statistical differences at the significance level 5% were observed Trend Algorithm Ranking p-value Holm Downtrend

Sideways

Uptrend

PSO S (control) PSO PSO P GA PSO P (control) PSO GA PSO S PSO S (control) GA PSO PSO P

2.0 2.35 2.75 2.9 2.1 2.2 2.7 3.0 2.2 2.2 2.75 2.85

– 0.5444 0.1939 0.119 – 0.8625 0.2987 0.119 – 1.0 0.3408 0.2602

– 0.05 0.025 0.0167 – 0.05 0.025 0.0167 – 0.05 0.025 0.0167

vations. The first of these observations is that all three PSO variants experimented with here are competitive with GA in terms of performance when it comes to the domain of market timing. PSO can also produce these competitive results at a lower cost in terms of total number of fitness evaluations as can be seen from the algorithm configurations in Table 1. The second observation is that PSO P , the PSO variant with pruning, is also competitive with the PSO variants without the pruning procedure as no statistical significance was observed in the results. Figure 5 shows a histogram of the solution lengths returned by PSO P during testing. We can see that the majority of solution lengths were between 29 and 46 components, with only a single solution employing all 63 components. This suggests that PSO is capable of discovering shorter solutions without adversely affecting performance in a significant manner. This presents the opportunity of pursuing shorter and shorter solution lengths with the aim of finding the least satisfying subset of components that maximize our financial metrics. By finding shorter solutions, we will be capable of producing market timing strategies that execute faster and are more easily comprehensible.

52

I. Mohamed and F. E. B. Otero

Fig. 5 A histogram showing the solution lengths of the solutions from PSO P . We can clearly see that the majority of solutions range in length between 29 and 46. Only a single solution employed all 63 technical indicators

8 Conclusion In this chapter we revisited and extended the work presented in [20] by improving the details of trend representative testing and introducing a new PSO variant (PSO P ) with a novel pruning procedure. The results show that all three PSO variants are competitive to GA in terms of performance, and one particular variant was capable of achieving such a performance at a fraction of the number of fitness evaluations required. The results also show that the newly introduced PSO variant, PSO P , was capable of producing competitive results in comparison to the other algorithm while returning solutions that are considerably shorter in length. We suggest the following avenues of future research. First, use a more sophisticated measure of financial fitness. This would allow us to simulate hidden costs of trading such as slippage. Second, approach the problem of market timing as a multiobjective one by trying to maximize performance across the three types of trends and against multiple financial objectives. Third, pursuing shorter solution lengths by considering it as one of multiple objectives in a multi-objective approach to market timing. The validity of this pursuit is based on the evidence presented in the results returned by PSO P , where the majority of the solutions returned did not use the full set of signal generating components available and yet remained competitive in terms of performance to the solutions generated by the other algorithms. Finally, adapt more metaheuristics to tackle market timing and compare their performance against the currently proposed ones in significantly larger datasets. We could then use meta-

Building Market Timing Strategies Using Trend Representative …

53

learning to understand if and when metaheuristics perform significantly better than others under particular conditions and use that information to build hybrid approaches that use more than one metaheuristic to build strategies for market timing.

References 1. Abdelbar, A.: Stubborn ants. In: IEEE Swarm Intelligence Symposium, SIS 2008, pp. 1–5 (2008) 2. Allen, F., Karjalainen, R.: Using genetic algorithms to find technical trading rules. J. Financ. Econ. 51(2), 245–271 (1999) 3. Bera, A., Sychel, D., Sacharski, B.: Improved particle swarm optimization method for investment strategies parameters computing. J. Theor. Appl. Comput. Sci. 8(4), 45–55 (2014) 4. Briza, A.C., Naval Jr., P.C.: Stock trading system based on the multi-objective particle swarm optimization of technical indicators on end-of-day market data. Appl. Soft Comput. 11(1), 1191–1201 (2011) 5. Chakravarty, S., Dash, P.K.: A PSO based integrated functional link net and interval type-2 fuzzy logic system for predicting stock market indices. Appl. Soft Comput. 12(2), 931–941 (2012) 6. Chen, S.M., Kao, P.Y.: TAIEX forecasting based on fuzzy time series, particle swarm optimization techniques and support vector machines. Inf. Sci. 247, 62–71 (2013) 7. Clerc, M.: Think locally act locally-a framework for adaptive particle swarm optimizers. IEEE J. Evol. Comput. 29, 1951–1957 (2002) 8. Engelbrecht, A.P.: Fundamentals of computational swarm intelligence. John Wiley & Sons Ltd. (2005) 9. de la Fuente, D., Garrido, A., Laviada, J., Gómez, A.: Genetic algorithms to optimise the time to make stock market investment. In: Genetic and Evolutionary Computation Conference, pp. 1857–1858 (2006) 10. García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf. Sci. 180(10), 2044–2064 (2010) 11. Hu, Y., Liu, K., Zhang, X., Su, L., Ngai, E.W.T., Liu, M.: Application of evolutionary computation for rule discovery in stock algorithmic trading: a literature review. Appl. Soft Comput. 36, 534–551 (2015) 12. Kampouridis, M., Otero, F.E.: Evolving trading strategies using directional changes. Expert Syst. Appl. 73, 145–160 (2017) 13. Karathanasopoulos, A., Dunis, C., Khalil, S.: Modelling, forecasting and trading with a new sliding window approach: the crack spread example. Quant. Financ. 7688, 1–12 (2016) 14. Kaufman, P.J.: Trading Systems and Methods, 5th edn. John Wiley & Sons, Inc. (2013) 15. Kim, Y., Ahn, W., Oh, K.J., Enke, D.: An intelligent hybrid trading system for discovering trading rules for the futures market using rough sets and genetic algorithms. Appl. Soft Comput. 55, 127–140 (2017) 16. Ladyzynski, P., Grzegorzewski, P.: Particle swarm intelligence tunning of fuzzy geometric protoforms for price patterns recognition and stock trading. Expert Syst. Appl. 40(7), 2391– 2397 (2013) 17. Liu, C.F., Yeh, C.Y., Lee, S.J.: Application of type-2 neuro-fuzzy modeling in stock price prediction. Appl. Soft Comput. 12(4), 1348–1358 (2012) 18. López-Ibáñez, M., Dubois-Lacoste, J., Pérez Cáceres, L., Stützle, T., Birattari, M.: The irace package: iterated racing for automatic algorithm configuration. Oper. Res. Perspect. 3, 43–58 (2016)

54

I. Mohamed and F. E. B. Otero

19. Mohamed, I., Otero, F.E.: Using particle swarms to build strategies for market timing: a comparative study. In: Swarm Intelligence: 11th International Conference, ANTS 2018, Rome, Italy, Proceedings, 29–31 October 2018, pp. 435–436. Springer International Publishing (2018) 20. Mohamed., I., Otero., F.E.B.: Using population-based metaheuristics and trend representative testing to compose strategies for market timing. In: Proceedings of the 11th International Joint Conference on Computational Intelligence - Volume 1: ECTA, (IJCCI 2019), pp. 59–69. INSTICC, SciTePress (2019) 21. Patterson, S.: Dark Pools: The Rise of A.I. Trading Machines and the Looming Threat to Wall Street. Random House Business Books (2013) 22. Penman, S.H.: Financial Statement Analysis and Security Valuation. McGraw-Hill (2013) 23. Pring, M.: Technical Analysis Explained. McGraw-Hill (2002) 24. Soler-Dominguez, A., Juan, A.A., Kizys, R.: A survey on financial applications of metaheuristics. ACM Comput. Surv. 50(1), 1–23 (2017) 25. Subramanian, H., Ramamoorthy, S., Stone, P., Kuipers, B.: designing safe, profitable automated stock trading agents using evolutionary algorithms. In: Genetic and Evolutionary Computation Conference, vol. 2, p. 1777 (2006) 26. Sun, Y., Gao, Y.: An improved hybrid algorithm based on PSO and BP for stock price forecasting. Open Cybern. & Syst. J. (2015) 27. Wang, F., Yu, P.L., Cheung, D.W.: Combining technical trading rules using particle swarm optimization. Expert Syst. Appl. 41(6), 3016–3026 (2014)

Hybrid Strategy Coupling EGO and CMA-ES for Structural Topology Optimization in Statics and Crashworthiness Elena Raponi , Mariusz Bujny , Markus Olhofer , Simonetta Boria , and Fabian Duddeck Abstract Topology Optimization (TO) represents a relevant tool in the design of mechanical structures and, as such, it is currently used in many industrial applications. However, many TO optimization techniques are still questionable when applied to crashworthiness optimization problems due to their complexity and lack of gradient information. The aim of this work is to describe the Hybrid Kriging-assisted Level Set Method (HKG-LSM) and test its performance in the optimization of mechanical structures consisting of ensembles of beams subjected to both static and dynamic loads. The algorithm adopts a low-dimensional parametrization introduced by the Evolutionary Level Set Method (EA-LSM) for structural Topology Optimization and couples the Efficient Global Optimization (EGO) and the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) to converge towards the optimum within a fixed budget of evaluations. It takes advantage of the explorative capabilities of EGO ensuring a fast convergence at the beginning of the optimization procedure, as well as the flexibility and robustness of CMA-ES to exploit promising regions of the search space Precisely, HKG-LSM first uses the Kriging-based method for Level Set Topology Optimization (KG-LSM) and afterwards switches to the EALSM using CMA-ES, whose parameters are initialized based on the previous model. Within the research, a minimum compliance cantilever beam test case is used to validate the presented strategy at different dimensionalities, up to 15 variables. The method is then applied to a 15-variables 2D crash test case, consisting of a cylindrical pole impact on a rectangular beam fixed at both ends. Results show that HKG-LSM performs well in terms of convergence speed and hence represents a valuable option in real-world applications with limited computational resources. E. Raponi (B) · S. Boria School of Sciences and Technology, Department of Mathematics, University of Camerino, Camerino, Italy e-mail: [email protected] M. Bujny · M. Olhofer Honda Research Institute Europe GmbH, Offenbach am Main, Germany F. Duddeck Department of Civil, Geo and Environmental Engineering, Technical University of Munich, Munich, Germany © Springer Nature Switzerland AG 2021 J. J. Merelo et al. (eds.), Computational Intelligence, Studies in Computational Intelligence 922, https://doi.org/10.1007/978-3-030-70594-7_3

55

56

E. Raponi et al.

Keywords Topology optimization · Crashworthiness optimization · Hybrid methods · Surrogate modeling · Kriging · Evolution strategies · Level set method · Moving morphable components

1 Introduction Vehicle safety is one of the most studied research areas in automotive engineering over recent decades. In order to test mechanical components, many physical experiments are needed. These are characterized by prohibitive times and costs, and hence particular attention has to be paid to the design phase of structures. Computer-Aided Design (CAD) methods and numerical simulations with the Finite Element Method (FEM) have become a standard for the design and analysis of mechanical structures, leading to a rapid development of Structural Topology Optimization (TO) [5]. TO is a well-developed discipline that aims to generate optimal component structures through changing material distribution in a given design space under defined boundary conditions (supports, forces). It is currently applied in many engineering fields, however the most known TO approaches for crashworthiness applications might be questioned. In fact, they use simplifications that frequently do not take into account essential aspects of the crash problem, e.g. nonlinearities, numerical noise, and discontinuities of the objective functions to be optimized. Such approaches can be collected in the following categories: Equivalent Static Loads method [9, 22], Ground Structure Approaches [30], Bubble and Graph/Heuristic-based Approaches [11, 28], Hybrid Cellular Automata methods [10, 27], and State-based representation approaches [3, 4]. In this context, more general optimizers which do not require any gradient or sensitivity information to carry out the optimization procedure deserve special attention. In particular, Evolution Strategies (ESs) (e.g., the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [18]) and the Efficient Global Optimization (EGO) [13] demonstrated to be valid alternatives. On the one hand, ESs are optimization techniques based on the iterative creation of generations of individuals, corresponding to different points of the search space, which only need the direct evaluation of the objective function to move forward towards the optimum. Compared to other global optimization techniques, these algorithms are easy to implement and very often provide adequate solutions, showing many advantages such as simplicity, robust responses to changing circumstances and flexibility. However, they normally require thousands of calls to the high-fidelity analysis codes to locate a near optimal solution, leading to very high numerical costs, which are proportional to the problem dimensionality. On the other hand, EGO allows for replacing the direct optimization of the computationally expensive model by an iterative process composed of the creation, optimization and update of a model approximating the original function, referred to as surrogate model. This model is much cheaper to run and can be used to perform many more evaluations during the optimization process. Starting from a limited set

Hybrid Strategy Coupling EGO and CMA-ES for Structural …

57

of points defined in the problem domain and the evaluation of the objective function on such a set, a first approximation is constructed. Afterwards, new promising points are selected and the model is updated according to them. Nonetheless, surrogatebased optimization shows some limitation with the increasing of dimensionality and iterations of the algorithm, turning out to have poorer exploitive capabilities than ESs [33]. One attempt to combine the positive aspects of EGO and CMA-ES at different stages of the optimization process was made by Raponi et al. [34], who proposed the Hybrid Kriging-assisted Level Set Method (HKG-LSM) for Structural Topology Optimization. This strategy is composed of three major building blocks: – Level Set Method (LSM) [1, 15, 29], which defines the material distribution changes based on a low-dimensional implicit parametrization of the material boundaries by means of local Level Set Functions (LSFs), also referred to as Movable Morphable Components (MMCs) [14]; – Kriging-guided Level Set Method (KG-LSM) [32, 33], which constructs a cheap approximating model according to an initial set of training points. Based on the data points and the committed error in the approximation, it updates the model iteratively and allows for obtaining a trade-off between exploration and exploitation of the domain space, which results in a fast convergence of the algorithm towards promising areas; – Evolutionary Level Set Method (EA-LSM) for crash Topology Optimization [6, 7], embedding CMA-ES, to which the proposed optimization algorithm switches when no improvement on the optimal design is found within a prescribed number of evaluations. Its main role is to investigate the detected good region and further exploit it in order to locally optimize the current design. This research is an extension of the aforementioned work [33], aiming to validate the HKG-LSM algorithm in optimization problems characterized by a higher dimensionality. Moreover, it applies the algorithm to a crashworthiness scenario for the first time. More precisely, HKG-LSM is here evaluated on a static cantilever beam, as well as on a dynamic pole intrusion test case for a fixed budget of FEM evaluations. This choice is dictated by the fact that in many industrial applications, FEM evaluations are very expensive and, as such, limited. The paper is structured as follows. Section 2 describes the implicit parametrization used to represent the problem and the geometry mapping to the mechanical model. In Sect. 3, the addressed optimization problem is defined, together with its constraints, which are necessary in order to produce physically consistent and well-performing structures. In Sect. 4, the resolution strategy consisting of the combination of KGLSM with EA-LSM is presented. Section 5 illustrates the static and dynamic test cases used to evaluate the proposed approach, while the experimental campaign and its results are discussed in Sects. 6 and 7, respectively. Conclusions are drawn at the end in Sect. 8.

58

E. Raponi et al.

2 Problem Representation The HKG-LSM is designed for to of crash structures. Therefore, it is aimed at easily handling the unpredictability, noisiness, discontinuities and lack of gradient information which characterize the crashworthiness design optimization problems. Since in structural mechanics optimal topologies frequently consist of ensembles of interconnected beams, on the basis of previous studies [7, 14, 33], the mechanical structure consists of a set of MMCS whose boundaries are implicitly defined by iso-contours of an LSF.

2.1 Parametrization The proposed parametrization was introduced in previous works [7, 33]. It is based on a global LSF  defined as: ⎧ ⎪ ⎨(u) > 0, (u) = 0, ⎪ ⎩ (u) < 0,

u ∈ , u ∈ ∂, u ∈ D\.

(1)

Here,  denotes the region of the domain D occupied by material, while D\ is the complementary set, occupied by void. The interface between material and void is given by ∂. Hence, in this approach intermediate densities do not occur for each u = (x, y)T ∈ D. A representation of this global LSF for complex geometries is difficult; therefore, it is constructed by a set of local LSFs given by: ⎧ ⎪ ⎨φi (u) > 0, φi (u) = 0, ⎪ ⎩ φi (u) < 0,

u ∈ i , u ∈ ∂i , u ∈ D\i ,

(2)

where φi is the local LSF of the ith component occupying the design domain subregion i . The total region of the domain D is hence constructed by the union of e elementary components: e  i . (3) = i=1

To reduce the number of parameters, we propose inspired by Guo et al. [14] to use beams as local basis functions having the mathematical form:

Hybrid Strategy Coupling EGO and CMA-ES for Structural …

59

Fig. 1 Structural component details [7]: a component parametrization, b corresponding local LSF, where negative values are set to zero [33]





m cos θi (x − x0i ) + sin θi y − y0 i φi (u) = − li /2 

m − sin θi (x − x0i ) + cos θi y − y0 i + −1 , ti /2

(4)

where u = (x, y)T is a point of the bi-dimensional domain D = R2 and (x0 , y0 ) denotes the position of the center of the component with length l and thickness t; in addition, the beam component is oriented inside the domain by the rotation angle θ , see Fig. 1a. The integer m is a method parameter to be chosen by the user; it should be even and relatively large. In this study, m = 6 is used. The variation of the local LSF parameters enables to move, shrink, dilate, and rotate a single beam component. Figure 1b depicts a 3-dimensional plot of the ith local LSF corresponding to the definition given in Eq. (4).

2.2 Geometry Mapping For Level Set Topology Optimization it is essential to choose an appropriate mapping from the geometry representation defined by the LSF to the mechanical, i.e. computational model. Here, motivated by its simplicity, a density-based geometry mapping is used: (5) 0 ≤ ρ (u) ≤ 1, E (u) = ρ (u) E0 where ρ (u) is the density at the point u ∈ D and E0 is the reference quantity of the elasticity tensor. By this approach, the density ρ(u) at position u can be obtained from the LSF (u) via: ρ (u) = H ( (u)) . (6)

60

E. Raponi et al.

Fig. 2 Combination of local LSFs: a illustrative structural layout, b plot of the global LSF, where negative values are set to zero [33]

Note, that the global LSF is taken here as the maximum of the local ones at position u: (7)  (u) = max (φ1 (u), φ2 (u), . . . , φe (u)) . Here, H (x) is the Heaviside function, which is defined as:

H (x) =

0, 1,

if x < 0, if x ≥ 0.

(8)

Figure 2 shows an example of the composition of local LSFs, resulting in the global LSF defined by Eq. (7).

3 Optimization Problem and Constraints In this work, an optimization problem of the following form is considered: min

f obj (x),

s.t.

KU = F/r (t) = 0,

x

(9)

g(x) ≤ 0, x ∈ R , k

where f obj is the objective function that is minimized, KU = F denotes the equilibrium condition for the static test case, r (t) = 0 represents the equilibrium condition at time t when a dynamic test case is considered, and g is the inequality constraint. The objective function is evaluated on the vector x ∈ Rk of the design variables, which collects all the parameters defining the LSM basis functions. In the static test cases, f obj is the compliance (i.e., the inverse of the stiffness) of a cantilever beam, while the pole intrusion was chosen as objective function in the crash scenario. The optimization is carried out while respecting a g(x) = V (x) − Vreq ≤ 0 constraint, which requires the volume V of the analyzed structure not to exceed

Hybrid Strategy Coupling EGO and CMA-ES for Structural …

61

Fig. 3 Violation of the connectivity constraint regarding a the connection to the support, b the connection to the load, c the connection of the structure itself. The distances prohibiting the design to fulfill the connectivity criteria are represented by the dashed lines [33, 34]

a prescribed limit Vreq equal to 50% of the design domain volume. Throughout the optimization process, disconnected structures can be generated, as illustrated in Fig. 3. Therefore, in order to obtain physically consistent structural layouts, a connectivity constraint is also considered. It ensures that the optimized structures make sense from the physical point of view, while the volume constraint is imposed in order to make the structures fulfill the industrial requirements of limited mass. When the static cantilever beam test case is analyzed, each material distribution is required to be connected to the support on the left-hand side of the domain, to the load that the structure is subjected to on the right-hand side, and has to respect a connection of the structure to itself, i.e. a material path from the support to the load has to be found. Figure 3 shows three material distributions, each representing the violation of the connectivity constraint according to the first, second and third criterion, respectively. On the other hand, in the dynamic pole intrusion test case, the connection to the load is replaced by the connection of the structure to the right-hand side support. According to the phase of the optimization procedure, i.e. the subalgorithm that is currently used, two different techniques are used to handle such constraints, which are described in Sect. 4.2.

4 Resolution Strategy In the next part, the principles of the optimization approach proposed in this paper are described. The main idea is to realize a new hybrid strategy for the optimization of structures by combining sequentially the following two steps: – EGO method with a Kriging Level Set method (KG-LSM) and – CMA-ES approach incorporated in an Evolutionary Level Set Method (EA-LSM). This hybridization, named here HKG-LSM, is motivated by the high efficiency of the EGO in the early stage of the optimization for design space exploration via the infill procedures and the superior performance of the CMA-ES for exploitation in the later phase (local search and convergence).

62

E. Raponi et al.

4.1 Optimization Algorithm The optimization approach proposed here, i.e. the HKG-LSM, is especially adapted to solve efficiently structural design problems. To illustrate the method, it is applied first to linear and static cantilever beam test cases; then, in the final part of this paper, it is used for the optimization of a nonlinear, transient crash case. The first sub-algorithm of the HKG-LSM is based on a redraft of the EGO method taken as proposed by [2, 20], applied to a Kriging surrogate model. Then, the approach switches to the second sub-algorithm, the CMA-ES, to refine the best structure obtained by the first sub-algorithm; an outline of the developed optimization method is provided in Algorithm 1.

Data: initial data set (X, Y) with n sample points, where X = {x(1) , x(2) , . . . , x(n) }T i := 1; while i < n do check feasibility of point x(i) ; if x(i) infeasible then X := X\{x(i) } end i := i + 1; end y ∗ := min(Y); x∗ := x ∈ X : y(x) = y ∗ ; t := 0; c := 1 while t < n max and c < cmax do fit Kriging to available data (X, Y); find infill point: p := argmaxx EICD(x) (Eq. (25)); update sample set: X := X ∪ p; update response set: Y := Y ∪ y(p); if min(Y) < y ∗ then c := 1; y ∗ := min(Y); x∗ := x ∈ X : y(x) = y ∗ ; else c := c + 1; end t := t + 1; end initialize CMA-ES parameters (σ ,C); initialize parent design: x = x∗ ; run CMA-ES.

Algorithm 1: HKG-LSM optimization algorithm [34]. The reference algorithm for the CMAES can be found in the work by Hansen [17].

Hybrid Strategy Coupling EGO and CMA-ES for Structural …

63

The procedure starts with the Design of Experiments (DoE) [13] generating samples in the design space and selecting a set of training points to compute the objective function using the high-fidelity model (i.e. finite element model). For this, an Optimal Latin Hypercube Sampling (OHLS) [12] is used, which leads to a training dataset X = {x(1) , x(2) , . . . , x(n) }T with the observed responses y = {y (1) , y (2) , . . . , y (n) }T . Then, the feasibility w.r.t. volume and connectivity constraints is checked for all points and a Kriging surrogate model is constructed and iteratively updated. As soon as the maximum number of iterations without design improvement is reached, the second stage is started, i.e. the overall optimizer progresses from the Kriging-based strategy to the CMA-ES-based approach. The parent design for the latter is set to be the best structure found in the first optimization stage.

4.1.1

Kriging Model

The Kriging surrogate model, for details see e.g. [2, 8, 13, 21], is used to predict values of an unknown response function at a selected new point by computing a weighted average of the already known values of the function in the vicinity of that point. Starting from a set of sample points X with observed responses y, the value of the objective function at a new location x is approximated by the value from the surrogate. For Kriging, the training data is interpreted as results of a stochastic process, which can be described using a set of random vectors of the form Y(x) = (Y (x(1) ), . . . , Y (x(n) ))T with mean 1μ, where 1 is an nx1 column vector of ones. In addition, let the correlation between each couple of random variables be described via a basis function expression: ⎛ cor[Y (x(i) ), Y (x(l) )] = exp ⎝−

k 

⎞ (l) 2 ⎠ θ j |x (i) , j − xj |

(10)

j=1

where θ is a vector parameter allowing the width of the basis function to differ from variable to variable. The correlations depend on the absolute distance between the sample points and on the parameters θ j , which are estimated by using the likelihood of the predicted data y [13]: L=

  (y − 1μ)T  −1 (y − 1μ) . exp − (2π σ 2 )n/2 ||1/2 2σ 2 1

(11)

After appropriate substitutions and simplifications, the natural logarithm of Eq. (11) is considered. By deriving and setting the derivative to zero, the maximum likelihood estimates (MLEs) for the mean μ and variance σ 2 are obtained: μˆ =

1T  −1 y (y − 1μ)T  −1 (y − 1μ) 2 , , σ ˆ = n 1T  −1 1

(12)

64

E. Raponi et al.

where  is the correlation matrix between the random variables. Finally, the model correlation can be utilized to predict new values based on the observed data. By augmenting the model data with a new input x and the corresponding output yˆ and maximizing the likelihood of the augmented data, the prediction of the response at the new location, the mean value yˆ , and the committed error in the prediction sˆ 2 are determined: ˆ yˆ (x) = μˆ + ψ T  −1 (y − 1μ),   1 − 1T  −1 ψ 2 2 T −1 . sˆ (x) = σˆ 1 − ψ  ψ + 1 −1 1 4.1.2

(13) (14)

Efficient Global Optimization

After the selection and construction of the surrogate model by fitting it to the initial DoE data, the optimization process starts. Here, compared to any evolutionary algorithm, the advantage of choosing EGO as surrogate technique becomes apparent (especially for non-parallelized processes); each iteration consists of only a single call of the expensive high-fidelity model, the proposed infill point, guiding the search towards the optimum of the optimization problem and providing new high-fidelity data for an update of the surrogate. Among other techniques for the choice of the locations for the infill points, this work uses a modified version of Expected Improvement (EI) [13], taking into account the connectivity and the volume constraints as described in Sect. 4.2 [33]. Since one goal of the infill search is to position the next point in order to improve the best observed value so far (ymin ), the EI criterion aims to look for an infill position that maximizes the amount of improvement (defined as I (x) = max(ymin − Y (x), 0)) over the value of the objective function measured on the current optimum: xinfill = argmax E[I (x)].

(15)

x

In turn, the EI is defined as follows:   ⎧ ymin − yˆ (x) ⎪ ⎪ (y − y ˆ (x)) ⎪ min Y ⎪ sˆ (x) ⎪ ⎪   ⎪ if sˆ (x) > 0 ⎨ ymin − yˆ (x) E[I (x)] = + sˆ (x)φY sˆ (x) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 0 if sˆ (x) = 0

(16)

where Y and φY are the Gaussian cumulative distribution function and probability density function, respectively. Equation (16) can be seen as sum of two terms:

Hybrid Strategy Coupling EGO and CMA-ES for Structural …

65

– the first term, proportional to (ymin − yˆ (x)), which controls the exploitative tendency of the search criterion; – the second term, proportional to sˆ (x), whose predominance would lead to an explorative choice of the new infill point. Therefore, the EI infill criterion represents a good trade-off between exploitation and exploration.

4.1.3

Covariance Matrix Adaptation Evolution Strategy

CMA-ES [17] is a popular algorithm for global optimization problems. The (μ, λ)CMA-ES relies on the iterative sampling and updating of a multivariate normal density   (g+1) (g) 2 (g) , k = 1, . . . , λ (17) ∼ N x(g) , σ C xk w where g is the iteration counter, σ (g) ∈ R+ is the mutation step size, which controls the step length, and C(g) ∈ Rn×n , where n is the dimensionality of the problem, is a covariance matrix. CMA-ES is a derandomized Evolution Strategy, which adapts the covariance matrix of the normal distribution on the basis of the previous search steps [19]. Such matrix represents pairwise dependencies between the problem variables and its update during the optimization procedure is of particular importance when dealing with ill-conditioned objective functions. (g) Regarding Eq. (17), λ is the offspring population size and xw is the recombination point, which is computed as the weighted mean of selected individuals: x(g) w =

μ 

(g)

wi xi:λ ,

(18)

i=1

with wi > 0, ∀i = 1, . . . , μ, and

μ 

wi = 1. The indexing i : λ is used to denote the

i=1

ith best individual. The mutation parameters, i.e. the step size σ (g) and the covariance matrix C(g) , are automatically tuned by the algorithm. Such procedure happens in two steps. First of all, the global step size undergoes an adaptation process. Afterwards, C(g) is updated according to the evolution path, the μ weighted difference vectors of the newly selected parents, and the last recombination point [16].

4.1.4

Combining EGO and CMA-ES in the HKG-LSM

The main contribution of this paper is to exploit the complementarity of CMA-ES and EGO to increase the overall efficiency of the global optimization of computationally expensive functions [26]. This is motivated by the insight that EGO constructs

66

E. Raponi et al.

approximations of high-fidelity models using the values of the expensive objective function at chosen training samples and improving the surrogate by adding new points to this low-fidelity model. In contrast to this, CMA-ES samples new points according to a multivariate normal distribution to converge towards the optimum by means of recombination and mutation. As a consequence, it seems to be natural to take advantage of the different potentials that both methods exhibit at different stages of the optimization process. This leads to an algorithm that (i) starts with a DoE characteristic for surrogate-based optimization techniques and fitting an initial Kriging approximation of the high-fidelity model. Then, (ii) it continues with the optimization of the Kriging model using the adaptive infill procedures from EGO. In the next step, (iii) it switches to CMA-ES for final refining and convergence as soon as no further improvement on the objective function value is observed for a prescribed number of iterations. The total approach proposed in this work is referred here as Hybrid Kriging-assisted Level Set Method for Structural Topology Optimization. It may be also used outside of the context of topology optimization, e.g. for shape, size and material optimizations. To realize the sequential coupling proposed by HKG-LSM, a predefined constant value, cmax , is compared to c, which is the counter of iterations without any improvement of the optimum. If EGO finds a better design before the value cmax is reached, i.e. for c ≤ cmax , the first sub-algorithm is continued and the counter is reset to zero. Otherwise the CMA-ES is started. The cmax parameter can be chosen according to the considered problem, but, as discussed later in this work, it is convenient that the transition to CMA-ES occurs once the explorative capabilities of KG-LSM are sufficiently exploited. At the moment of the switch between the considered sub-algorithms, the mutation step size σ of the CMA-ES optimizer is taken as the exponentially weighted average (EWA) of the differences in position between the unpromising infill points, i.e. worse than the current best, and the current best design during the last cmax iterations of the KG-LSM, rescaled by a constant scalar α. Such EWA is iteratively computed over the c ≤ cmax iterations as follows:

xi,1 , if c = 1 vc = (19) βvc−1 + (1 − β)xi,c , if c > 1 where the coefficient β represents the degree of weighting decrease, a constant smoothing factor between 0 and 1 and chosen in this work equal to 0.1 (a lower β discounts newer observations faster), xi,c is the value of the ith component of the infill design at a time period c, and vc is the value of the EWA at any iteration, c. Equation (19) applies weighted factors to the positions of the infill points so that the most recent candidates have greater influence on the stepsize definition. This choice for the σ parameter is aimed to take into account the predominant tendency of the KG-LSM to explore/exploit the design space during the unsuccessful search of a new current optimum within the prescribed number of iterations cmax , leading to a larger/stricter mutation step size for the ES as soon as the transition occurs.

Hybrid Strategy Coupling EGO and CMA-ES for Structural …

67

Fig. 4 One-to-one correspondence between the beam structure and its graph representation [33]

4.2 Constraint Handling Techniques During the DoE phase of the optimization algorithm, the infeasible sample points are automatically discarded and only the feasible ones are used to construct the surrogate approximation. On the other hand, different techniques are used to deal with the connectivity and volume constraint during the infill procedure.

4.2.1

Expected Improvement for Connected Designs

The Expected Improvement for Connected Designs (EICD) is a variant of the standard EI, defined to ensure the connectivity of the promising candidates during the infill procedure [32, 33]. The main idea is to penalize the disconnected designs according to the level of infeasibility. By means of a mapping to a two-dimensional graph representation (Fig. 4), when disconnected designs are met during the maximization of EI by Differential Evolution (DE) [36], the algorithm computes a penalty P, which takes into account the amount of violation of the connectivity constraint: P = γ (P1 + P2 + P3 ),

(20)

where each Pi , for i = 1, 2, 3, is the minimum extra-distance that has been computed according to the 1st , 2nd or 3rd type of disconnection presented in Sect. 3, and γ is a suitable penalty factor. Such penalty is used to modify the EI criterion in Eq. (16) as follows:

EICD(x) =

E[I (x)] if x is connected, −P(x) if x is disconnected.

(21)

68

E. Raponi et al.

Here, the introduction of the penalty P is dictated by the necessity to avoid the creation of large flat areas in the landscape of the optimization problem, which might cause stagnation of the DE search. As a result, the designs infeasible according to the connectivity constraint are automatically discarded in the maximization of Eq. (21). Therefore, since the convergence speed is measured in terms of evaluations, the proposed strategy does not evaluate the objective function on disconnected designs and hence it speeds up the convergence towards the optimum.

4.2.2

Constrained Expected Improvement

In this study, the Constrained Expected Improvement (CEI) [13] is used to generate points that satisfy a prescribed volume limit. At each iteration of the optimization strategy, when the Kriging surrogate is fitted on the previous training set and the EI function is computed, an approximation of the function defining the volume constraint is also constructed. Let Y (x) = N ( yˆ (x), sˆ 2 (x)) be the Kriging model for the objective function. Then, the surrogate model for the constraint is: G(x) = N (g(x), ˆ σˆ 2 (x)),

(22)

where g(x) ˆ is the prediction of the constraint function and σˆ 2 (x) is the variance of the model of the constraint. At this stage, the probability that the design satisfies the volume constraint (Probability of Feasibility (PF)) is computed as: 1 P[F(x)] = √ 2 σˆ (x) 2π



∞ 0

 2 (F − g(x)) ˆ dG, exp − 2σˆ 2 (x)

(23)

where g(x) ˆ is the prediction of the constraint function, σˆ 2 (x) is the variance of the model of the constraint, and F = G(x) − Vreq is the entity of feasibility. Hence, the probability that a feasible infill point improves on the best value so far can be computed by maximizing the product between EI and PF: xinfill = argmax CEI(x) = argmax E[I (x)]P[F(x)], x

(24)

x

where the product is justified by the independence of the models. In order to consider both the connectivity and volume constraints, the coupling between EICD and CEI is obtained by applying the connectivity check in the maximization of EI·PF, leading to the following combined definition:

EICD(x) =

CEI(x) if x is connected, −P(x) if x is disconnected.

(25)

Hybrid Strategy Coupling EGO and CMA-ES for Structural …

4.2.3

69

Exterior Penalty Method

The preceding techniques are appropriate for the Kriging-based EGO. Therefore, in the second part of the optimization process, when CMA-ES comes in, another method to handle the constraints has to be introduced: the Exterior Penalty Method [31], which is commonly used by optimization strategies involving evolutionary algorithms. The mechanism of penalty functions is the most uniformly applicable approach to deal with constrained design. It works by penalizing the objective function values every time a constraint is not satisfied. Therefore, the objective function assumes the following shape: f (x) = f obj (x) + γ · max(0, g(x)),

(26)

where f obj is the objective function to be minimized in the original problem, g is a nonlinear constraint and γ denotes a penalty constant. This leads to a modification of the objective function value that automatically rejects the penalized design. The penalty is usually taken as a very large constant value, in order to immediately ensure rejection of an infeasible design. The penalized objective function is referred to as cost function. When using surrogate approximations, adding such a big value is not the best approach to avoid infeasible designs. In fact, it would lead to severe discontinuities in the shape of the penalized fitness function and such modifications would artificially and severely influence the surrogate approximation.

5 Test Case Firstly, the proposed HKG-LSM optimization strategy is evaluated on a linear elastic test case at two different dimensionalities, which is characterized by low computational costs and is already sufficient to draw some conclusions about the promisingness of the approach. The test is performed on the standard cantilever beam benchmark problem, whose structure is optimized to minimize compliance. Afterwards, the algorithm is applied to a dynamic transverse bending test case, for which standard topology optimization techniques are not suitable. The previously developed KG-LSM [32, 33] and the EA-LSM, using CMA-ES [7, 17], are taken as a reference to compare the convergence properties and the optimized designs resulting from the proposed approach.

5.1 Linear Elastic Case The first case study is the standard cantilever beam benchmark test case, whose structure is optimized to minimize compliance. As shown in Fig. 5a, the beam is fixed to a support at the left-hand side and a static unit load is applied in the middle

70

E. Raponi et al.

Fig. 5 Cantilever beam test case: a problem definition and b CalculiX FEM mesh [33] Table 1 Cantilever beam test case settings Property Symbol

Value

Unit

Beam mass density Young’s modulus Poisson’s ratio Load Required volume fraction Mesh resolution Element type

ρ E ν F Vf

7, 85 · 103 2.1 · 105 0.3 1 50%

kg/m3 MPa – N –

– –

– –

Solver



100 x 50 Four-node shell element CalculiX 2.9



of the right-hand side. The domain dimensions are 20 × 10 [mm]. The LSF is here mapped to a CalculiX1 FEM mesh (Fig. 5b), where a very low density space (1% of the material density) is assigned to the areas occupied by void. The domain dimensions are 20 × 10[mm] and it is discretized with 5000 four-node square shell (S4R) finite elements, arranged in a 100 × 50 grid. The test settings are shown in Table 1. In order to validate the presented method, the 6-beams diagonal layout of beams depicted in Fig. 6 is used. This layout is used to evaluate the method in a 9-variables environment first [34], where both the x and y coordinates of the beams’ barycenter, as well as their orientation, are free to vary during the optimization, and in a 15variables test afterwards, where all the parameters defining the local LSFs are allowed to change. A symmetry of the design with respect to the domain horizontal symmetry axis is assumed throughout the optimization process.

1 CalculiX

is an open-source, 3D structural FEM software developed at MTU Aero Engines in Munich. CalculiX, Version 2.13, was used in this work: http://www.calculix.de/.

Hybrid Strategy Coupling EGO and CMA-ES for Structural …

71

Fig. 6 Reference layout of 6-beams for the static cantilever beam test case [33]

Fig. 7 Initial reference 6-beams configuration for the dynamic transverse bending test case [33]

Fig. 8 Transverse bending test case: a problem definition and b LS-Dyna FEM mesh [33]

5.2 Nonlinear Crash Case The final aim of this study is to propose an algorithm performing well in the continuous global optimization of expensive and multimodal problems, e.g. associated with vehicle crashworthiness. In this work, the proposed application to a crashworthiness optimization problem consists of a 6-beams and 15-variables transverse bending test case (Fig. 8), starting from the reference layout shown in Fig. 7. As for the 15-variables static test case, during the optimization process, all the parameters characterizing the moving morphable components are allowed to vary. However, in this case a symmetry condition with respect to the domain vertical symmetry axis is imposed. More precisely, the crash experiments involved a pole impact on a beam supported at both ends, as illustrated in Fig. 8a. The optimization task is to minimize the dynamic

72

E. Raponi et al.

Table 2 Configuration of the transverse bending test case Property Symbol Value Beam mass density Young’s modulus Poisson’s ratio Yield strength Tangent modulus Initial pole velocity Pole mass Pole diameter LS-Dyna termination time LS-Dyna mesh resolution Solver Element type

Unit

ρ E ν Re E tan v m D tend

2.7 · 103 7.0 0.33 241.0 70.0 20 11.815 139.154 1.5

kg/m3 MPa – MPa MPa m/s kg mm ms



80 × 20



– –

LS-DYNA R7.1.1 Eight-node solid element

– –

· 104

intrusion of the impactor into the structure, i.e., the intrusion preceding the elastic rebound. The LSF function is mapped on a reference LS-Dyna mesh, as illustrated in Fig. 8b. It is composed of 1600 eight-node solid elements with a piecewise linear plasticity material and fixed in the z direction throughout the optimization procedure. In particular, an elasto-plastic material defined by an isotropic von Mises plasticity model with linear hardening is used [23, 24]. In order to assure a physically correct crash behavior, the elements in the areas occupied by void are deleted from the mesh. The properties of the material are shown in Table 2.

6 Experimental Setup This section describes the experimental setup used to test the proposed HKG-LSM and the obtained results. Since this study aims to extend the experimental evaluation presented by Raponi et al. [34], where the HKG-LSM is tested on a 6-beams 9variables configuration under static loading conditions, here the same algorithm is validated on a 15-variables configuration in both a static and dynamic test case. The following strategies are compared for a total budget of 500 calls to the objective function: – HKG-LSM-bias(n)-exp: Hybrid Kriging-guided Level Set Method where the switch from KG-LSM to CMA-ES is done when the best observed objective value is not updated for n iterations; the step size for CMA-ES is chosen as the exponen-

Hybrid Strategy Coupling EGO and CMA-ES for Structural …

73

tially weighted average of the distances between the best structure and the worse infill points; – KG-LSM: Kriging-guided optimization method with initial DoE selection by removing the infeasible points and the EICD to handle both the connectivity and the volume constraint during the infill procedure. It is shown for comparison purposes; – EA-LSM using CMA-ES and taking the reference structure in Fig. 6 as first parent design. Here, the exterior penalty method is used to drive the search towards feasible designs. It is shown for comparison purposes. In the 9-variables static test case, the surrogate-based algorithms start from a 300samples DoE. Three versions of the hybrid method are analyzed, which differ in the chosen bias n (n = 10, 30, 50) guiding the switch between the sub-algorithms. In the KG-LSM optimization algorithm, θ L = [1E-3] ∗ 9 and θU = [1E3] ∗ 9 are chosen as user-defined bounds for the theta parameter to be optimized through Maximum Likelihood Estimation, as explained in Sect. 4.1.1. They characterize the correlations between the approximation variables, which are chosen as squared-exponential functions. A non-elitist (μ, λ)-CMA-ES strategy is used and initialized with a parent population of μ = 5 individuals, whereas λ = 10 is the offspring population size. This means that the best individuals are reselected at each iteration of the algorithm through a recombination plus mutation procedure. The mutation step is characterized by a normally distributed random vector defined by the initial mean m init , while the standard deviation σinit is chosen in this work as the exponentially weighted average described in Sect. 4.1.4. The static and dynamic 15-variables test cases are compared for the same budget of 500 evaluations of the objective function, starting from a 600-samples and 200samples DoE, respectively. This choice was dictated by the fact that in the crash test case it is much easier to obtain feasible structures according to the connectivity constraint rather than in the static one, which requires the connection of the structural layout to the point where the load is applied on the right-hand side of the domain. In both cases, the same three algorithms are compared, with the only difference that HKG-LSM is tested with a switch bias fixed to 50 and 100. By analogy with the 9-variables test case, in the KG-LSM optimization algorithm θ L = [1E-3] ∗ 15 and θU = [1E3] ∗ 15 are chosen as bounds for the theta parameter to be optimized through Maximum Likelihood Estimation, and non-elitist (6, 12)-CMA-ES strategy is used.

7 Results In this section, the results obtained for each of the described test cases are shown. HKG-LSM is compared with the state-of-the-art CMA-ES and the novel KGLSM in a constrained static environment for two different problem dimensionalities (Sects. 7.1 and 7.2), as well as in a crash scenario (Sect. 7.2). Due to the non-

74

E. Raponi et al.

deterministic nature of the considered optimization algorithms, for each one 30 optimization runs are performed (with different random seeds).

7.1 9-Variables Linear Elastic Case The comparison is first drawn on the static cantilever beam benchmark test case in a 9-variables [34] and in a 15-variables test case. The results obtained in each case are shown below. For each strategy, the average convergence of the compliance objective function in terms of evaluations is shown in Fig. 9. Here, the objective function is normalized with respect to the reference design for the 6-beams test case defined in Sect. 5.1. Figure 9 shows that when evaluation 280 is reached, CMA-ES seems more suitable for exploiting the best design obtained so far, leading to a better performance over KGLSM. When observing the HKG-LSM strategies, the predominance of CMA-ES is not evident anymore. In fact, the hybrid techniques take advantage of the initial rapid decrease of the objective function resulting from the Kriging-based optimization and, when no improvement is detected within a certain number of iterations, the algorithm transits to CMA-ES. The HKG-LSM-bias50-exp provides a better convergence trend than the other hybrid techniques. Indeed, an early switch to the CMA-ES might cause an excessive exploitation of a partially optimized design, increasing its probability to converge to a local optimum. On the other hand, a late transition from KG-LSM to CMA-ES has a better chance of bringing to the ES a design that is the result of a more explorative strategy, which did not focus on localized areas of the problem domain, but rather found optimal designs by searching emptier areas, characterized by a high variance of the surrogate model. Such considerations can be validated by the following analysis. According to Fig. 10, three main groups of material distributions, each representing a different (local) optimum obtained by the HKG-LSM-bias50exp, can be distinguished. The frequency with which they have been observed are the 20%, 20%, and 40%, for the first, second and third optimum respectively. On the other hand, the same local optima could be detected after the inspection of the results obtained for the HKG-LSM-bias10-exp. Here, the percentages drop to 7%, 7%, and 53%, respectively. Such data demonstrate that a premature switch to the CMA-ES might give preference to local optima, leading to a worse convergence trend for the minimization of the cost function. In general, from the final structures optimized by HKG-LSM (Fig. 10), it can be deducted that such methods allow sufficient flexibility for reaching designs that are not far from the theoretical one presented by Michell [25] for a 6-beams configuration, showed in Fig. 11. In fact, at this stage, the quality of the obtained beam configurations is affected by the impossibility to vary the length and thickness parameters of the beams. To conclude, further information about the optimized cost values can be found in Fig. 12, which presents the statistics for each method after 100, 250 and 500 evaluations. In order to compare the medians reached by the different optimization strategies, the statistical Wilcoxon rank sum test is used. At evaluation

Hybrid Strategy Coupling EGO and CMA-ES for Structural …

75

Convergence averaged over 30 runs 1

CMA-ES HKG-LSM-bias10-exp HKG-LSM-bias30-exp HKG-LSM-bias50-exp KG-LSM

0.9

Cost function

0.8

0.7

0.6

0.5

0.4

0.3

50

100

150

200

250

300

350

400

450

500

Evaluations

Fig. 9 Convergence of the compliance function averaged over 30 runs for the 9-variables linear elastic test case. HKG-LSM starts from a 300-samples DoE, with selection of feasible designs. The trends for the 10, 30 and 50-iterations bias for the switch are compared. The evaluations spent for the DoE phase are not shown in the plot since they are performed once for all the methods. The KG-LSM and the CMA-ES are also shown as a reference [34]

Fig. 10 Three main topology types obtained with the HKG-LSM-bias50-exp method and the frequency with which they have been developed in 30 optimization runs. a First optimum: 20%. b Second optimum: 20%. c Third optimum: 40% [34]

500, the null hypothesis that data from the different optimization methods have equal medians at the 5% significance level is not rejected when comparing HKG-LSMbias30/50-exp with CMA-ES. This means that these strategies are comparable at the end of the considered maximum range of evaluations. Moreover, if evaluating the statistics at the beginning of the optimization process (after 100 evaluations), the same null hypothesis is rejected when comparing CMA-ES with the HKG-LSM strategies. Therefore, the HKG-LSM converges much faster than CMA-ES towards the optimum, validating the promisingness of surrogate-based strategies in case if a strict budget of evaluations is imposed.

76

E. Raponi et al.

Fig. 11 Michell theoretical model for a 6-beams static test case [34]

7.2 15-Variables Test Cases Since this work is an extension of the study presented by Raponi et al. [34] on a static 9-variables test case, more experiments were carried out at a higher dimensionality in both, a static and a dynamic environment, where the optimal layout is obtained by allowing the six MMCs to vary in x-position and y-position of the barycenter, orientation, length and thickness, and it is hence defined by a vector of 15 parameters due to the symmetry condition. These scenarios are described below.

7.2.1

Linear Elastic Case

Although the dimensionality of the problem is increased, the same 6-beams configuration of the 9-variables test case (Fig. 6) is also used as reference design for the 15-variables static test case. For each compared optimization strategy, the average compliance over the 30 runs in terms of evaluations is shown in Fig. 13, where the objective function is normalized with respect to the compliance measured for the design in Fig. 6. By taking advantage of previous knowledge, the hybrid technique is tested in two variants, allowing the switch from KG-LSM to EA-LSM when no improvement on the best recorded value is obtained for 50 or 100 consecutive evaluations. Moreover, the convergence trends of the surrogate-based strategies start from an average value resulting from an initial DoE of 600 samples, where only the connected designs are selected and evaluated, while the infeasible ones are automatically discarded. From Fig. 13 it can be observed that, even if the dimensionality of the problem is increased, the hybrid strategies outperform KG-LSM and are preferable to CMAES within the prescribed budget of evaluations. In fact, while KG-LSM shows a better performance than CMA-ES until evaluation 200, HKG-LSM-bias50-exp and HKG-LSM-bias100-exp are characterized by a faster convergence at the beginning of the optimization process and comparable cost function values just at the end. A further remark to the convergence trends is about the different behavior shown by

Hybrid Strategy Coupling EGO and CMA-ES for Structural …

77

Cost function

Statistics after 100 evaluations 1 0.8 0.6 0.4 CMA-ES

HKG-LSM-bias10-exp

HKG-LSM-bias30-exp

HKG-LSM-bias50-exp

KG-LSM

Optimization method

(a) Statistics after 250 evaluations Cost function

1.2 1 0.8 0.6 0.4 CMA-ES

HKG-LSM-bias10-exp

HKG-LSM-bias30-exp

HKG-LSM-bias50-exp

KG-LSM

Optimization method

(b)

Cost function

Statistics after 500 evaluations 0.45

0.4

0.35

0.3 CMA-ES

HKG-LSM-bias10-exp

HKG-LSM-bias30-exp

HKG-LSM-bias50-exp

KG-LSM

Optimization method

(c) Fig. 12 Statistical evaluation of the optimization methods compared for the 6-beams 9-variables cantilever beam test case: box plots for 30 runs after a 100, b 250, and c 500 evaluations [34]

the hybrid strategies with respect to the previous test case. While in the 9-variables static test case, a postponed switch leads to an overall better convergence towards the optimum, in the the 15-variables static test case, the switch after 100 evaluations with no improvement on the optimal layout is not preferable to the one after 50 evaluations. This might be due to a less explorative behavior of the KG-LSM algorithm, which does not produce significant updates in many of the 30 runs since the beginning of the process. In this way, many runs are stuck in the same local regions regardless of

78

E. Raponi et al. Convergence averaged over 30 runs

1

CMA-ES HKG-LSM-100bias HKG-LSM-50bias KG-LSM

0.9

Cost function

0.8

0.7

0.6

0.5

0.4

0.3

50

100

150

200

250

300

350

400

450

500

Evaluations

Fig. 13 Convergence of the compliance function averaged over 30 runs for the 15-variables linear elastic test case. HKG-LSM starts from a 600-samples DoE, with selection of feasible designs. The trends for a 50 and 100-iterations bias for the switch are compared. The KG-LSM and the CMA-ES trends are also shown as a reference

the moment when the switch between KG-LSM and EA-LSM occurs, and hence do not benefit from a delayed switch. Further information about the optimized cost values can be found in Fig. 16a, which presents the statistics for each method after 100, 250 and 500 evaluations. Again, the statistical Wilcoxon rank sum test is used. At evaluation 500, the null hypothesis that data from the different optimization methods have equal medians at the 5% significance level is accepted when comparing HKG-LSM-bias50/100exp with CMA-ES, while it is rejected in the comparison of the hybrid strategy with KG-LSM. At evaluation 100 and 250, the same null hypothesis is rejected when comparing HKG-LSM-bias50-exp with all the other strategies. This implies that HKG-LSM is statistically significantly better than KG-LSM, while allows for a faster convergence towards an optimum if compared to CMA-ES. Finally, although the layouts resulting from each of the compared optimization strategies are not satisfactory (pointing out the need for a wider budget of evaluations in higher-dimensional problems), at evaluation 500 almost all the designs reached by CMA-ES can be classified into three main groups of material distributions, as illustrated in Fig. 14. The frequency with which they have been observed are the 13%, 30%, and 47%, for the first, second and third optimum respectively, amounting in total to the 90% of all the designs. On the other hand, the same local optima could be detected after the inspection of the results obtained for the HKG-LSM-bias50exp with percentages dropping only to the 7%, 27% and 17%, respectively, while the

Hybrid Strategy Coupling EGO and CMA-ES for Structural …

79

Fig. 14 Three main topology types obtained with the CMA-ES method and the frequency with which they have been developed in 30 optimization runs. a First optimum: 13%. b Second optimum: 30%. c Third optimum: 47%

other designs cannot be classified into categories. From these data it can be observed that while CMA-ES is able to reach clearly defined local optima when used as single algorithm, it is has reduced explorative capabilities when used as sub-algorithm in the hybrid strategies. As a consequence, taking advantage of the KG-LSM exploration as much as possible at the beginning is of the utmost importance. Therefore, an optimization study on the number of evaluations preceding the switch needs to be done.

7.2.2

Nonlinear Crash Case

A 6-beams 15-variables problem is introduced in this work to evaluate the HKG-LSM capabilities in a constrained crashworthiness environment. The same optimization strategies compared in the 15-variables static test case are here analyzed. For each strategy, the average convergence of the compliance objective function in terms of evaluations is shown in Fig. 15. In analogy with the static case, the objective function is normalized with respect to the reference design for the 6-beams test case defined in Sect. 5.2. The convergence trends of the surrogate-based strategies start from an average value resulting from an initial DoE of 200 samples, with selection of feasible designs. From Fig. 15, it is worth noting that, due to its ability to balance exploration and exploitation, KG-LSM already outperforms CMA-ES within the whole budget of available evaluations. However, it still makes sense to test the hybridization procedure due to the different convergence curves produced by the surrogate-based and the evolution strategy. Indeed, while KG-LSM makes early progress and then loses efficiency, CMA-ES steadily converges to the minimum as the number of calls to the objective function increases. Therefore, a switch to the second strategy can be beneficial in view of a final tuning of an optimal configuration of beams. However, such improvement on the KG-LSM performance was not observed in the studied test case. On the other hand, if looking at the box plots in Fig. 16b and by the results from the statistical testing, it can be assessed that HKG-LSM is statistically significantly better than CMA-ES and comparable to KG-LSM during all the optimization procedure. Therefore, in this crash scenario where the previously introduced KG-LSM already

80

E. Raponi et al. Convergence averaged over 30 runs

1.1

CMA-ES HKG-LSM-100bias HKG-LSM-50bias KG-LSM

1

Cost function

0.9

0.8

0.7

0.6

0.5

0.4

0.3

50

100

150

200

250

300

350

400

450

500

Evaluations

Fig. 15 Convergence of the intrusion function averaged over 30 runs for the nonlinear crash test case. HKG-LSM starts from a 200-samples DoE, with selection of feasible designs. The trends for a 50 and 100-iterations bias for the switch are compared. The KG-LSM and the CMA-ES trends are also shown as a reference

performs satisfactorily, the hybrid strategy does not worsen the optimization outcome from the statistical point of view. To the authors’ opinion, HKG-LSM would be much more efficient in case if a even higher dimensional test case is considered, where a pure surrogate-based strategy would exhibit some deficiencies. Finally, similar topology types are obtained by all the surrogate-based strategies, demonstrating that HKG-LSM can be efficiently used at the early stages of the design process of mechanical structures targeted to face dynamic loads. The main topologies and the frequencies with which they have been observed are shown in Table 3.

8 Conclusions This study is an extension of the work by Raponi et al. [34]. The main aim was to evaluate the potential of the Hybrid Kriging-assisted Level-Set Method (HKG-LSM) for Topology Optimization (TO) in higher-dimensional test cases and under dynamic loading conditions. The whole research is motivated by the need for cheap procedures in modern automotive industry, where optimization strategies demonstrated to be good alternatives to the classical trial and error method. Aimed to final applications to the crashworthiness design optimization field, where no gradient information is available, Efficient Global Optimization (EGO) with the use of surrogate models -

Hybrid Strategy Coupling EGO and CMA-ES for Structural … Statistics after 100 iterations

Statistics after 100 iterations

Cost function

Cost function

0.8 0.7 0.6 0.5 0.4

0.8 0.7 0.6 0.5 0.4

0.3 CMA-ES

HKG-LSM-100bias HKG-LSM-50bias

KG-LSM

CMA-ES

Optimization method

HKG-LSM-100bias HKG-LSM-50bias

KG-LSM

Optimization method

Statistics after 250 iterations

Statistics after 250 iterations 0.7

Cost function

Cost function

81

0.4

0.35

0.6 0.5 0.4

0.3 CMA-ES

HKG-LSM-100bias HKG-LSM-50bias

KG-LSM

CMA-ES

Optimization method Statistics after 500 iterations

Statistics after 500 iterations 0.6

Cost function

Cost function

KG-LSM

Optimization method

0.36 0.34 0.32 0.3 0.28

HKG-LSM-100bias HKG-LSM-50bias

CMA-ES

HKG-LSM-100bias HKG-LSM-50bias

0.5

0.4

KG-LSM

CMA-ES

Optimization method

HKG-LSM-100bias HKG-LSM-50bias

KG-LSM

Optimization method

(b)

(a)

Fig. 16 Statistical evaluation of the optimization methods compared for the 15-variables cantilever beam test case a and the 15-variables transverse bending test case b. Box plots for 30 runs after 100, 250, and 500 evaluations Table 3 Main topology types and frequencies with which they have been observed, ordered according to decreasing structural performance, for the 15-variables dynamic test case. Out of the 30 runs, the topology classification is made by visual inspection and it is based on the main characteristics determining the design (number of connections to the supports, final beam components, presence of cross sub-configurations) Topology type KG-LSM (%) HKG-LSM-bias50- HKG-LSMexp bias100-exp (%) (%)

23

13

20

3

3

3

30

30

30

82

E. Raponi et al.

Kriging in this case - and Evolution Strategies (ESs), CMA-ES in particular, represent valid alternatives. Their principles are different, yet complementary: EGO constructs an approximation of the high-fidelity model by evaluating the expensive objective function on a chosen training set and improves the optimum by adding new points to the model, while CMA-ES learns and samples multi-normal laws in the space of designs variables and converges towards the optimum by means of a recombination and mutation process of the individuals. Such algorithms are used by the Krigingguided Level Set Method (KG-LSM) and the Evolutionary Level Set Method (EALSM) for structural TO, which are the two sub-algorithms composing the HKGLSM optimization strategy. Starting with an initial set of training points generated by the Design of Experiments (DoE) and thanks to the surrogate-based technique, a fast convergence of the objective function towards the optimum is obtained at the beginning of the optimization process. Afterwards, CMA-ES is used to exploit some promising areas of the domain, leading to a refined design with a competitive fitness if compared to its neighbors. In this research, the potential of the proposed HKG-LSM was first assessed through the study of a standard cantilever beam benchmark test case in optimization problems of different dimensionalities (up to 15 variables), where the objective function to be minimized was the compliance of the structure. Afterwards, an application to a dynamic transverse bending test case with intrusion minimization was presented to show the suitability of the method in a crashworthiness application. In the static test cases, HKG-LSM led to significant improvements of the convergence properties in the initial stages of the optimization process, if compared to KG-LSM and EA-LSM. When coming to the dynamic test case, the considered application was characterized by an already good performance of the novel KGLSM, if compared to EA-LSM. No improvement on the convergence trend could be hence observed for the hybrid strategy. However, HKG-LSM turned out to be superior to EA-LSM and comparable to KG-LSM both in terms of convergence speed and generated optimal layouts. Therefore, the proposed approach can be in general a valid alternative to both the surrogate-based and evolutionary-based methods, if taken alone, in the optimization of mechanical structures. Indeed, it was characterized by a fast convergence at the beginning of the optimization process in each of the considered test cases. Since HKG-LSM is finalized to crashworthiness optimization problems, which are characterized by a limited number of available evaluations, the proposed hybrid technique is promising. Consequently, further studies on test cases of higher dimensionality are planned to be done in future works. To this end one option is to couple the Clement method with the novel PCA-Bo [35]. Moreover, a procedure to optimize the number of evaluations preceding the switch and some deeper analysis with regards to the estimation of the initial parameters for the CMA-ES are needed. In fact, the fitted Kriging mean as an approximation to the true function can be used to initialize both the covariance matrix and the step size of CMA-ES [26]. Lastly, new methods to combine surrogates with evolution strategies in order to obtain outperforming mechanical structures can be investigated.

Hybrid Strategy Coupling EGO and CMA-ES for Structural …

83

References 1. Allaire, G., Jouve, F., Toader, A.M.: Structural optimization using sensitivity analysis and a level-set method. J. Comput. Phys. 194(1), 363–393 (2004). https://doi.org/10.1016/j.jcp.2003. 09.032 2. Arsenyev, I.: Efficient Surrogate-based Robust Design Optimization Method. Ph.D. thesis, Technische Universität München (2017) 3. Aulig, N.: Generic topology optimization based on local state features. Ph.D. thesis, Technische Universität Darmstadt, VDI Verlag, Germany (2017) 4. Aulig, N., Olhofer, M.: State-based representation for structural topology optimization and application to crashworthiness. In: 2016 IEEE Congress on Evolutionary Computation (CEC), Vancouver, Canada, pp. 1642–1649 (2016). https://doi.org/10.1109/CEC.2016.7743985 5. Bendsøe, M.P., Sigmund, O.: Topology Optimization - Theory, Methods, and Applications, 2nd edn. Springer, Berlin (2004). http://www.springer.com/cn/book/9783540429920 6. Bujny, M., Aulig, N., Olhofer, M., Duddeck, F.: Evolutionary level set method for crashworthiness topology optimization. In: VII European Congress on Computational Methods in Applied Sciences and Engineering, Crete Island, Greece (2016) 7. Bujny, M., Aulig, N., Olhofer, M., Duddeck, F.: Identification of optimal topologies for crashworthiness with the evolutionary level set method. Int. J. Crashworthiness 23(4), 395–416 (2018). https://doi.org/10.1080/13588265.2017.1331493 8. Cressie, N.: The origins of kriging. Math. Geol. 22(3), 239–252 (1990). https://doi.org/10. 1007/BF00889887 9. Duddeck, F., Volz, K.: A new topology optimization approach for crashworthiness of passenger vehicles based on physically defined equivalent static loads. In: ICrash International Crashworthiness Conference, Milano, Italy (2012) 10. Duddeck, F., Hunkeler, S., Lozano, P., Wehrle, E., Zeng, D.: Topology optimization for crashworthiness of thin-walled structures under axial impact using hybrid cellular automata. Struct. Multidiscip. Optim. 54(3), 415–428 (2016). https://doi.org/10.1007/s00158-016-1445-y 11. Eschenauer, H.A., Kobelev, V.V., Schumacher, A.: Bubble method for topology and shape optimization of structures. Struct. Optim. 8(1), 42–51 (1994). https://doi.org/10.1007/BF01742933 12. Fang, K.T., Li, R., Sudjianto, A.: Design and Modeling for Computer Experiments. CRC Press (2005) 13. Forrester, A.I.J., Sóbester, A., Keane, A.J.: Engineering Design via Surrogate Modelling - A Practical Guide. John Wiley & Sons Ltd. (2008) 14. Guo, X., Zhang, W., Zhong, W.: Doing topology optimization explicitly and geometrically - a new moving morphable components based framework. J. Appl. Mech. 81(8), 081009 (2014). https://doi.org/10.1115/1.4027609 15. Haber, R., Bendsøe, M.P.: Problem formulation, solution procedures and geometric modeling: key issues in variable-topology optimization. In: 7th AIAA/USAF/NASA/ISSMO Symposium on Multidisciplinary Analysis and Optimization, St. Louis, Missouri, USA (1998) 16. Hansen, N.: The CMA evolution strategy: a tutorial (2005). https://hal.inria.fr/hal-01297037, hal-01297037f 17. Hansen, N.: The CMA evolution strategy: a comparing review. In: Towards a New Evolutionary Computation. Studies in Fuzziness and Soft Computing, pp. 75–102. Springer, Berlin (2006). https://doi.org/10.1007/3-540-32494-1_4 18. Hansen, N., Ostermeier, A.: Adapting arbitrary normal mutation distributions in evolution strategies: the covariance matrix adaptation. In: Proceedings of IEEE International Conference on Evolutionary Computation, pp. 312–317 (1996). https://doi.org/10.1109/ICEC.1996. 542381 19. Hansen, N., Ostermeier, A.: Completely derandomized self-adaptation in evolution strategies. Evol. Comput. 9(2), 159–195 (2001). https://doi.org/10.1162/106365601750190398 20. Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-box functions. J. Glob. Optim. 13(4), 455–492 (1998). https://doi.org/10.1023/A:1008306431147

84

E. Raponi et al.

21. Kleijnen, J.P.C.: Kriging metamodeling in simulation: a review. Eur. J. Oper. Res. 192(3), 707–716 (2009). https://doi.org/10.1016/j.ejor.2007.10.013 22. Lee, H.A., Park, G.J.: Nonlinear dynamic response topology optimization using the equivalent static loads method. Comput. Methods Appl. Mech. Eng. 283, 956–970 (2015). https://doi.org/ 10.1016/j.cma.2014.10.015 23. Livermore Software Technology Corporation (LSTC), P. O. Box 712 Livermore, California 94551-0712: LS-DYNA KEYWORD USER’S MANUAL, Volume II - Material Models (2014). lS-DYNA R7.1 24. Livermore Software Technology Corporation (LSTC), P. O. Box 712 Livermore, California 94551-0712: LS-DYNA Theory Manual (2019) 25. Michell, A.G.M.: LVIII. The limits of economy of material in frame-structures. Philos. Mag. 8(47), 589–597 (1904). https://doi.org/10.1080/14786440409463229 26. Mohammadi, H., Riche, R.L., Touboul, E.: Making EGO and CMA-ES complementary for global optimization. In: Learning and Intelligent Optimization. Lecture Notes in Computer Science, pp. 287–292. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19084-629 27. Mozumder, C., Renaud, J.E., Tovar, A.: Topometry optimisation for crashworthiness design using hybrid cellular automata. Int. J. Veh. Des. 60(1–2) (2012). https://trid.trb.org/view.aspx? id=1222579 28. Ortmann, C., Schumacher, A.: Graph and heuristic based topology optimization of crash loaded structures. Struct. Multidiscip. Optim. 47(6), 839–854 (2013). https://doi.org/10.1007/s00158012-0872-7 29. Osher, S., Sethian, J.A.: Fronts propagating with curvature-dependent speed: algorithms based on Hamilton-Jacobi formulations. J. Comput. Phys. 79(1), 12–49 (1988). https://doi.org/10. 1016/0021-9991(88)90002-2 30. Pedersen, C.B.W.: Topology optimization design of crushed 2d-frames for desired energy absorption history. Struct. Multidiscip. Optim. 25(5–6), 368–382 (2003). https://doi.org/10. 1007/s00158-003-0282-y 31. Rao, S.S.: Engineering Optimization: Theory and Practice. Wiley (1996) 32. Raponi, E., Bujny, M., Olhofer, M., Aulig, N., Boria, S., Duddeck, F.: Kriging-guided level set method for crash topology optimization. In: 7th GACM Colloquium on Computational Mechanics for Young Scientists from Academia and Industry. Stuttgart, Germany (2017) 33. Raponi, E., Bujny, M., Olhofer, M., Aulig, N., Boria, S., Duddeck, F.: Kriging-assisted topology optimization of crash structures. Comput. Methods Appl. Mech. Eng. 348, 730–752 (2019). https://doi.org/10.1016/j.cma.2019.02.002 34. Raponi, E., Bujny, M., Olhofer, M., Boria, S., Duddeck, F.: Hybrid kriging-assisted level set method for structural topology optimization. In: Proceedings of the 11th International Joint Conference on Computational Intelligence (IJCCI 2019), Vienna, Austria, pp. 70–81 (2019). https://doi.org/10.5220/0008067800700081 35. Raponi, E., Wang, H., Bujny, M., Boria, S., Doerr, C.: High dimensional Bayesian optimization assisted by principal component analysis. In: Parallel Problem Solving from Nature, PPSN XVI, pp. 169–183. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58112-1_12 36. Storn, R., Price, K.: Differential evolution a simple and efficient heuristic for global optimization over continuous spaces. J. Glob. Optim. 11(4), 341–359 (1997). https://doi.org/10.1023/A: 1008202821328

An Empirical Study on Insertion and Deletion Mutation in Cartesian Genetic Programming Roman Kalkreuth

Abstract The most commonly used genetic operator in Cartesian Genetic Programming (CGP) is the genotypic point mutation. Since CGP suffers from a lack of knowledge about the possibilities and effectiveness of advanced genetic operators, the point mutation is usually the sole genetic operator when CGP is used. To improve the state of knowledge, this work is devoted to the investigation of the effectiveness of two phenotypic mutation techniques. takes another step towards the use of advanced phenotypic mutations in CGP. The functionality of the proposed mutations is inspired by biological evolution, where DNA sequences are mutated by inserting and deleting nucleotides. This behavior is adapted by activating and deactivating function nodes in the genotype. In the first place, the experimental part of this paper focuses, on experiments with sets of well-known Boolean functions and symbolic regression problems are performed and the results show an improved search performance when these phenotypic mutations are used. The observed improvement of the search performance indicates that the insertion and deletion mutation techniques are beneficial for the use of CGP. The effectiveness of both mutation techniques is underlined with a comparison to another state-of-the-art technique in the field of graph-based genetic programming. Another part of this work is devoted to the analysis and interpretation of the effects which are caused in fitness and phenotype space when both mutation techniques are used. For the interpretation, we analyze and compare the findings of previous work in the field of phenotypic genetic operators for CGP which leads to new ideas about the effects of both mutation techniques on the behavior of CGP. Keywords Cartesian genetic programming · Mutation · Phenotype and Boolean functions · Symbolic regression

R. Kalkreuth (B) Department of Computer Science, TU Dortmund University, Dortmund, Germany e-mail: [email protected] URL: http://www.cs.tu-dortmund.de © Springer Nature Switzerland AG 2021 J. J. Merelo et al. (eds.), Computational Intelligence, Studies in Computational Intelligence 922, https://doi.org/10.1007/978-3-030-70594-7_4

85

86

R. Kalkreuth

1 Introduction Genetic programming (GP) can be considered as nature-inspired search heuristic, which opens the automatic derivation of programs for problem-solving. First work on GP has been done by [3, 4, 7]. Later work by [13–15] significantly popularized the field of GP. GP is traditionally used with trees as program representation. Over two decades ago Miller, Thompson, Kalganova, and Fogarty presented first publications on Cartesian Genetic Programming (CGP)—an encoding model inspired by the twodimensional array of functional nodes connected by feed-forward wires of an FPGA device [8, 19, 20]. CGP offers a graph-based representation that, in addition to standard GP problem domains, makes it easy to be applied to many graph-based applications such as electronic circuits, image processing, and neural networks. CGP has multiple advantages: – CGP covers an mechanism for the design of simple hierarchical functions. While in many optimization systems such a mechanism has to be integrated explicitly, in CGP, multiple feed-forward connections may form from the same output connection of a functional node. This behavior can be beneficial for the evolution of target functions that may benefit from repetitive structures. – The maximal size of encoded solutions is fixed, keeping CGP free to some extent from “bloat”. “Bloat” can be seen as a major drawback of tree-based GP. – CGP offers a way of passing redundant information throughout the generations. This mechanism can be seen as a sort of memory for evolutionary artifacts. The propagation and reuse of such redundant information have been found beneficial for the convergence of CGP [12] – CGP encodes a directed acyclic graph. This enables the evolution of topologies. In contrast to tree-based GP for which a broad range of advanced crossover and mutation techniques have been introduced and investigated, the state of knowledge of advanced mutation techniques in CGP can be considered as relatively poor. The lack of contributions in this subfield of CGP has been the major motivation for this work. Moreover, in standard tree-based GP, the simultaneous use of multiple types of mutation has been found beneficial by Kraft et al. [16] and Angeline [1]. A first initial testing of two phenotypic mutation for CGP called insertion and deletion has been presented by Kalkreuth [9]. Recently, these two new mutations for CGP have been officially introduced by Kalkreuth [10] with comprehensive experiments in the Boolean function domain. However, the experiments of Kalkreuth covered only one problem domain. In this work, we extend the analysis of both mutation techniques with comprehensive experiments in the symbolic regression domain. This paper is an extended and revised version of the work of Kalkreuth [10]. The structure of the paper is as follows: Sect. 2 describes CGP briefly and surveys previous work on advanced mutation techniques in CGP. In Sect. 3 we describe the insertion and deletion mutation techniques for standard CGP. Section 4 is devoted to the experimental results and the description of our experiments. Section 5 contains a comparison of the insertion and deletion mutation in combination with CGP to another state-of-the-art method for

An Empirical Study on Insertion and Deletion …

87

evolving graphs. In Sect. 6, we discuss and interpret the results of our experiments. Finally, Sect. 7 gives a conclusion, and Sect. 8 outlines future work.

2 Related Work 2.1 Cartesian Genetic Programming Cartesian Genetic Programming is a form of Genetic Programming which offers a novel graph-based representation. In contrast to tree-based GP, CGP represents a genetic program via genotype-phenotype mapping as an indexed, acyclic, and directed graph. Originally the structure of the graphs was a rectangular grid of Nr rows and Nc columns, but later work also focused on a representation with one row. The genes in the genotype are grouped, and each group refers to a node of the graph, except the last one, which represents the outputs of the phenotype. Each node is represented by two types of genes which index the function number in the GP function set and the node inputs. These nodes are called function nodes and execute functions on the input values. The number of input genes depends on the maximum arity Na of the function set. The last group in the genotype represents the indexes of the nodes, which lead to the outputs. A backward search is used to decode the corresponding phenotype. The backward search starts from the outputs and processes the linked nodes in the genotype. In this way, only active nodes are processed during the evaluation procedure. The number of inputs Ni , outputs No , and the length of the genotype is fixed. Every candidate program is represented with Nr ∗ Nc ∗ (Na + 1) + No integers. Even when the length of the genotype is fixed for every candidate program, the length of the corresponding phenotype in CGP is variable, which can be considered as a significant advantage of the CGP representation. CGP is traditionally used with a (1+λ) evolutionary algorithm. The new population in each generation consists of the best individual of the previous population and the λ created offspring. The breeding procedure is mostly done by a point mutation that swaps genes in the genotype of an individual in the valid range by chance. An example of the decoding from genotype to phenotype is illustrated in Fig. 1.

2.2 Advanced Mutation Techniques in Standard CGP For an investigation of the length bias and the search limitation of CGP, a modified version of the point mutation has been introduced by [5]. The modified point mutation exactly mutates one active gene. This so-called single active-gene mutation strategy (SAGMS) has been found beneficial for the search performance of CGP. The

88

R. Kalkreuth

Fig. 1 Exemplification of the decoding procedure of a CGP genotype to its corresponding phenotype for a mathematical function. The nodes are represented by two types of numbers, which index the number in the function lookup table (underlined) and the inputs (non-underlined) for the node. Inactive function nodes are shown in gray color. This figure has been taken from the work of Kalkreuth [10]

SAGMS can be seen as a form of phenotypic genetic operator since it respects only active function genes in the genotype, which are an active part of the corresponding phenotype. Later work by [17] extended SAGMS to the so-called Biased Single-Active Mutation, which is based on the idea of analyzing the behavior of the genotype during the evolutionary process for a given set of problems. With the help of this analysis, a bias is created in order to help the direct gene mutation when applied to other problems. The mutation operator was proposed for digital combinational logic circuit design. The experiments of [17] showed that the proposed mutation performed better or equivalent as the traditional point mutation. In order to reduce the stalling effect in CGP and to improve the efficiency of the CGP algorithm, [22] introduced an Orthogonal Neighbourhood Mutation (ONM) operator. According to Ni et al., the ONN selects four loci as alleles of gene strings by chance. Afterward, a Fourfactor-three-level orthogonal experiment with local search is performed. The results of the experiments demonstrated that the ONM operator is able to reduce the stalling effect in CGP and to converge the algorithm more quickly. Recently, Kalkreuth [10] introduced two phenotypic mutations for CGP, which adapt the functionality of the insertion and deletion mutation.

3 Insertion and Deletion Mutation in CGP The proposed mutations for CGP are inspired by biological evolution in which extra base pairs are inserted into a new place in the DNA or in which a section of DNA is deleted. Figure 2 exemplifies the insertion and deletion mutation on the DNA sequence. Related to CGP, we adopt these so-called frameshift mutations by activat-

An Empirical Study on Insertion and Deletion …

Insertion

GAGA CTCT Original sequence

89

GAGAGA CTCTCT Mutated sequence

Deletion

GA CT

Fig. 2 Deletions and insertions of nucleotides. This figure has been taken from the work of Kalkreuth [10]

ing and deactivating randomly chosen function nodes. The activation and deactivation of the nodes are done by adjusting the connection genes of neighborhood nodes. Both mutation techniques work similarly as the single active-gene mutation strategy. The state of exactly one function node in the genome is changed. Since these forms of mutation can elicit strong changes in the behavior of the individuals, we apply an insertion rate and a deletion rate for every offspring. On the basis of these mutation rates, the decision is made as to whether the mutations are performed on the genome of an individual. The insertion and deletion mutation technique work independently from each other, which means that both mutations can be performed on the genome of the individual in the breeding procedure of one generation. If the consideration of a minimum or a maximum number of function nodes is necessary for all individuals in the population, the algorithms can be parameterized with maximum and minimum numbers. We will explain both mutation techniques in detail in the following two subsections. For both mutation techniques, we determine the active and passive function nodes of the respective individual before the mutation procedure.

3.1 The Insertion Mutation Technique When a genome is selected for the insertion mutation, one inactive function node becomes active. If all function nodes are already active or the number of active function nodes excels a defined maximum, the mutation is rejected. If an individual is suitable for the insertion mutation, we randomly select one inactive function node. After the selection, we have to distinguish three cases: 1. The Selected Inactive Node Has a Following Active Function Node In this context, the term following function node means that the node number of an active function node is greater than the function number of the randomly selected node. If the selected node has a following active function node, we copy the connection genes of the following active node to the selected inactive node. Afterward, we adjust one randomly selected connection gene of the following active node to the selected inactive node. In this way, the selected inactive node will be respected by the backward search and consequently becomes active. No

90

R. Kalkreuth

further steps are required for the previously active function node since all other active function nodes remain active due to the copying of the connection genes. 2. The Selected Inactive Node Has a Previous Active Function Node and No Following Active Function Node In this context, the term previous active function node means that the node number of an active function node is smaller than the function number of the randomly selected node. If the selected node has a previous active function node and no following active node, at least one output node is connected with the previous active function node. In this case, we adjust all output nodes which are connected with the previously active function node to the selected inactive node. Afterward, we adjust one connection gene of the selected node to the previously active function node. The other connection genes are randomly connected to previous active function or input nodes. In this way, the selected inactive node becomes active, and the other inactive function nodes remain inactive. 3. The Selected Inactive Node Has No Previous or Following Active Function Node If the selected inactive node has no previous or following active function node, the individual has no active function nodes. Consequently, the output nodes are directly connected with an input node. If this is the case, we adjust at least one output node to the selected inactive node. Afterward, we randomly connect the connection genes of the selected inactive node to input nodes. In this way, the selected node becomes active, and other function nodes remain inactive.

3.2 The Deletion Mutation Technique In contrast to the insertion mutation technique, when a genome is selected for deletion mutation, one active node becomes inactive. If all function nodes are inactive or the number of active function nodes is smaller than a defined minimum, the mutation is rejected. If an individual is suitable for the deletion mutation, we select the first active function node of the individual. The deletion mutation procedure is then done by performing the following steps: 1. Adjust the Connection Genes of All Following Active Function Nodes The connection genes of all following active function nodes which are connected with the selected active function node are randomly adjusted to other active function or input nodes. 2. Adjust the Outputs Nodes All output nodes which are connected with the selected active function nodes are randomly adjusted to other active function or input nodes. After performing the adjustment of connection genes and output nodes, the selected active function node becomes inactive.

An Empirical Study on Insertion and Deletion …

91

Fig. 3 The proposed insertion mutation technique. This figure has been taken from the work of Kalkreuth [10]

Figure 3 exemplifies the insertion technique. As visible, one inactive node is selected for activation. The connection genes in the genotype are adjusted to activate the selection function node in the phenotype. In contrast, Fig. 4 illustrates an example of a deletion mutation in one active node becomes inactive by adjusting the respective connection genes. In both figures, the genotype is grouped into a number of genes which represent the function and output nodes. Moreover, active function nodes are highlighted in solid boxes, and inactive nodes are shown in dashed boxes. The selected active or inactive nodes are highlighted in red.

4 Experiments 4.1 Experimental Setup We performed experiments with Boolean function and symbolic regression problems. To evaluate the search performance of the insertion and deletion mutation techniques,

92

R. Kalkreuth

Fig. 4 The proposed deletion mutation technique. This figure has been taken from the work of Kalkreuth [10]

we measured the number of fitness evaluations until the CGP algorithm terminated (fitness-evaluations-to-termination) and the best fitness value which has been found after a predefined budget of fitness evaluations (best-fitness-of-run). In addition to the mean values of the measurements, we calculated the standard deviation (SD) and the standard error of the mean (SEM). We also calculated the median and the first and second quartile. We performed 100 independent runs with different random seeds. We used the well known (1 + 4)-CGP algorithm for all experiments. Moreover, we used the standard CGP point mutation operator in combination with the insertion and deletion mutations. We used minimizing fitness functions in all experiments, which are explained in the respective subsection. To classify the significance of our results, we used the Mann-Whitney-U-Test. The mean values are denoted a † if the p-value is less than the significance level 0.05 and a ‡ if the p-value is less than the significance level 0.01 compared to the use of the point mutation as the sole genetic operator.

An Empirical Study on Insertion and Deletion …

93

4.2 Search Performance Evaluation 4.2.1

Boolean Functions

To evaluate the search performance of the insertion and deletion mutation techniques in the Boolean domain, we chose the five Even-Parity problems with n = 3, 4, 5, 6, and 7 Boolean inputs. The goal was to find a program that produces the value of the Boolean even parity depending on the n independent inputs. The fitness was represented by the number of fitness cases for which the candidate solution failed to generate the correct value of the even parity function. Since former work by [23] outlined that this problem type was excessively used and investigated in the past, we also evaluated multiple output problems as the digital adder, multiplier, and demultiplexer. These types of problems differ markedly from the parity problems, and the 3-Bit digital multiplier has been proposed as a suitable alternative. As a result, we receive a diverse set of problems in this problem domain. The set of benchmark problems with the corresponding number of inputs and outputs is shown in Table 1. To evaluate the fitness of the individuals on the multiple output problems, we defined the fitness value of an individual as the number of different bits to the corresponding truth table. To find performant configurations for the insertion and deletion mutation rates, we used automated parameter tuning. The evolved configurations are shown in Table 3 (Table 2). We compared the (1 + 4)-CGP algorithm to our modified (1 + 4)-CGP algorithm equipped with the insertion and deletion mutation techniques. Our modified (1 + 4)-CGP is denoted as (1 + 4)-CGP-ID. The number of function nodes was set to 100 for all tested problems. Following conventional wisdom for CGP, we use a

Table 1 List of Boolean function problems for the search performance evaluation. This table has been taken from the work of Kalkreuth [10] Problem Number of inputs Number of outputs Parity-3 Parity-4 Parity-5 Parity-6 Parity-7 Adder 1-Bit Adder 2-Bit Adder 3-Bit Multiplier 2-Bit Mulitplier 3-Bit Demultiplexer 3:8-Bit Comparator 4×1-Bit

3 4 5 6 7 3 5 7 4 6 3 4

1 1 1 1 1 2 3 4 4 6 8 18

94

R. Kalkreuth

Table 2 Configuration of the 1 + 4-CGP algorithm. This table has been taken from the work of Kalkreuth [10] Property Value μ λ Number of nodes Maximum generations Function set Point mutation rate

1 4 100 20,000,000 AND, OR, NAND, NOR 4%

Table 3 Insertion and deletion rates for the (1 + 4)-CGP-ID algorithm for the tested Boolean function problems. This table has been taken from the work of Kalkreuth [10] Problem Point mutation rate Insertion rate (%) Deletion rate (%) (%) Parity-3 Parity-4 Parity-5 Parity-6 Parity-7 Adder 1-Bit Adder 2-Bit Adder 3-Bit Multiplier 2-Bit Multiplier 3-Bit Demultiplexer 3:8-Bit Comparator 4×1-Bit

2.5 1.5 1 1 1 2 1 1 2 1 2 1

40 7.5 8 6 6 5 10 5 5 6 10 5

25 5 2 4 3 5 10 5 5 3 10 5

point mutation rate of 4% for the traditional (1 + 4)-CGP algorithm. The algorithm configuration of the (1 + 4)-CGP algorithm is shown in Table 2. We performed the runtime measurement on a computer with a Intel(R) Core(TM) i7 CPU 930 with 2.80 GHz and 24 GB of RAM. Table 4 presents the results of our search performance evaluation in the Boolean function domain, which shows a reduced number of generations until the termination criterion triggers for the (1 + 4)-CGP-ID algorithm. The results also show that when the (1 + 4)-CGP-ID is used on more complex Boolean function problems, the mean runtime of the algorithm is also clearly reduced. Figure 5 provides boxplots for all tested problems in the Boolean domain.

An Empirical Study on Insertion and Deletion …

95

Table 4 Results of the search performance evaluation for the tested Boolean function problems. This table has been taken from the work of Kalkreuth [10] Problem

Parity-3

Parity-4

Parity-5

Parity-6

Parity-7

Adder 1-Bit

Adder 2-Bit

Adder 3-Bit

Algorithm

SEM

1Q

Median

3Q

Mean runtime (s)

(1 + 4)-CGP

4917

4926

±493

1695

3412

5598

0.20

2700‡

2173

±217

1370

1928

3358

0.17

(1 + 4)-CGP

43895

43013

±4301

18125

29398

57968

1.78

(1 + 4)-CGPID

14381‡

9905

±991

8948

11928

19948

1.04

(1 + 4)-CGP

194727

148386

±14839

83304

168996

249993

12.47

(1 + 4)-CGPID

45349‡

28257

±2826

25735

34622

53923

6.99

(1 + 4)-CGP

746627

512510

±51250

371794

617932

937638

112.35

(1 + 4)-CGPID

105331‡

52171

±5217

65445

92466

139067

38.22

(1 + 4)-CGP

3074853

3146951

±314695 1341520

2231156

3696237

976.68

(1 + 4)-CGPID

283856‡

177515

238426

325776

181.09

±17751

177610

(1 + 4)-CGP

9364

8002

±800

3183

7550

12413

0.23

(1 + 4)-CGPID

8080†

7360

±736

3448

5876

10254

0.23

(1 + 4)-CGP

274734

262394

±26239

113622

188212

341853

4.98

(1 + 4)-CGPID

113744‡

88022

±8802

56379

84258

140745

3.53

(1 + 4)-CGP

4068492

3567764

±356776 1802712

3092538

4745253

90.93

(1 + 4)-CGPID

846075‡

885420

584198

979748

36.39

(1 + 4)-CGPID Multiplier 3-Bit (1 + 4)-CGP

Comparator 4×1-Bit

SD

(1 + 4)-CGPID

Multiplier 2-Bit (1 + 4)-CGP

Demultiplexer 3:8-Bit

Mean fitness evaluation

±88542

373149

24645

33364

±3336

6499

14108

26148

0.48

21539‡

33170

±3317

6372

10196

23753

0.47 14.60

757523

522412

±52241

402333

685390

958647

(1 + 4)-CGPID

354118‡

337590

±33759

142446

250396

465565

9.49s

(1 + 4)-CGP

23432

13546

±1355

15258

199918

26750

0.60

(1 + 4)-CGPID

15523‡

8994

±899

8954

13704

19657

0.53

(1 + 4)-CGP

2628085

1848923

±184892 1528983

2056080

2918599

91.06

(1 + 4)-CGPID

338019‡

208523

272924

461282

14.65

±20852

180908

96

R. Kalkreuth Parity−3

Parity−4

Fitness Evaluations

3e+04

Parity−5 8e+05

3e+05

6e+05 2e+04

2e+05 4e+05

1e+04

1e+05 2e+05

0e+00

0e+00 (1+4)−CGP

0e+00

(1+4)−CGP−ID

(1+4)−CGP

Parity−6

(1+4)−CGP−ID

Parity−7

Fitness Evaluations

(1+4)−CGP−ID

Adder 1−Bit

2.0e+06

4e+04 2e+07

1.5e+06

3e+04

1.0e+06

2e+04

1e+07

1e+04

5.0e+05 0e+00

0.0e+00

0e+00

(1+4)−CGP (1+4)−CGP−ID

(1+4)−CGP

Adder 2−Bit

Fitness Evaluations

(1+4)−CGP

(1+4)−CGP−ID

(1+4)−CGP

(1+4)−CGP−ID

Multiplier 2−Bit

Adder 3−Bit 2.0e+07

1.5e+06

2e+05 1.5e+07 1.0e+06 1.0e+07 5.0e+05

1e+05

5.0e+06

0e+00

0.0e+00

0.0e+00 (1+4)−CGP (1+4)−CGP−ID

(1+4)−CGP

(1+4)−CGP (1+4)−CGP−ID

Demulti. 3:8−Bit

Multiplier 3−Bit

(1+4)−CGP−ID

Comp. 4x1−Bit

Fitness Evaluations

8e+04

1e+07

6e+04

2e+06

4e+04 5e+06

1e+06 2e+04

0e+00

0e+00

0e+00 (1+4)−CGP

(1+4)−CGP−ID

Algorithm

(1+4)−CGP

(1+4)−CGP−ID

Algorithm

(1+4)−CGP

(1+4)−CGP−ID

Algorithm

Fig. 5 Boxplots for the results of the search performance evaluation for the tested Boolean function problems. This figure has been taken from the work of Kalkreuth [10]

An Empirical Study on Insertion and Deletion …

4.2.2

97

Symbolic Regression

To evaluate the search performance in the symbolic regression domain, we chose eleven symbolic regression problems from the work of McDermott et al. [18] for better GP benchmarks. The functions of the problems are shown in Table 6. A training data set U[a, b, c] refers to c uniform random samples drawn from a to b inclusive and E[a, b, c] refers to a grid of points evenly spaced with an interval of c, from a to b inclusive. The functions of the problems are shown in Table 7. We included the problems Keijzer-6, Nguyen-7, Pagie-1, Vladislavleva-4 and Korns-12, which have been recommended by White et al. [23] as a set of significant problems with different reputations. The fitness of the individuals was represented by a cost function value. The cost function was defined by the sum of the absolute difference between the  P real function values and the values of an evaluated individual. Let T = x p p=1 be a training dataset of P random points and f ind (x p ) the value of an evaluated individual and f ref (x p ) the true function value. Let C :=

P 

| f ind (x p ) − f ref (x p )|

p=1

be the cost function. When the difference of all absolute values becomes less than 0.01, the algorithm is classified as converged. All problems were evaluated with the best-fitness-of-run method. We measured the best fitness value after a budget of 10,000 generations. Additionally, we evaluated the more simple symbolic regression problems Koza 1, 2 & 3 with the fitness-evaluation-to-termination method. For this purpose we used a smaller function set consisting of the four basic arithmetic functions +, −, ∗ and /. We defined a maximum number of 106 fitness evaluations for these three experiments. The reason for our choice of these three problems is the fact that we can find an ideal solution more likely on average than the other more complex benchmark problems, which require a higher amount of fitness evaluations to find an ideal solution. To find performant configurations for the insertion and deletion mutation rates, we tuned the respective parameters manually. For all tested problems, we observed that a point mutation rate of 4% seems to be a good choice for the use of the insertion and deletion mutations. The determined configurations are shown in Tables 5, 6 and 7. Tables 8 and 9 present the results of our search performance evaluation in the symbolic regression domain, which show a better mean fitness of run and a reduced number of fitness evaluations until the termination criterion triggers for the (1 + 4)CGP-ID algorithm. Figure 6 and 7 provide boxplots for all tested problems of the search performance evaluation.

98

R. Kalkreuth

Table 5 Insertion and deletion rates for the (1 + 4)-CGP-ID algorithm for the tested symbolic function problems Problem Point mutation rate Insertion rate (%) Deletion rate (%) (%) Koza-1 Koza-2 Koza-3 Nguyen-4 Nguyen-5 Nguyen-7 Keijzer-6 Pagie-1 Vladislavleva-4 Korns-12

4 4 4 4 4 4 4 4 1 1

5 5 5 5 5 7.5 10 5 5 10

5 5 2.5 5 5 5 5 5 5 5

Table 6 List of symbolic regression problems for the search performance evaluations Problem

Objective function

Vars

Training set

Function set

Koza-1

x4 + x3 + x2 + x

1

U[−1, 1, 20]

Koza

Koza-2

x 5 − 2x 3 + x

1

U[−1, 1, 20]

Koza

Koza-3

x 6 − 2x 4 + x 2

1

U[−1, 1, 20]

Koza

Nguyen-4

x6 + x5 + x4 + x3 + x2 + x

1

U[−1, 1, 20]

Koza

Nguyen-5

sin(x 2 ) cos(x) − 1

1

U[−1, 1, 20]

Koza

Nguyen-6

sin(x) + sin(x + x 2 )

1

U[−1, 1, 20]

Koza

Nguyen-7

ln(x + 1) + ln(x 2 + 1) x i 1/i

1

U[0, 2, 20]

Koza Keijzer

1

E[1, 50, 1]

Pagie-1

1/(1 + x −4 ) + 1/(1 + y −4 )

2

E[−5, 5, 0.4]

Koza

Vladislavleva-4

10 5+(x−3)2 +(y−3)2 +(z−3)2 +(v−3)2 +(w−3)2

5

U[0.05, 6.05, 1024]

Vladislavleva-A

Korns-12

2 − 2.1 cos(9.8x) sin(1.3w)

5

U[-50, 50, 10,000]

Korns

Keijzer-6

Table 7 Function sets for the set of symbolic regression problems Name Functions Constants (ERC) Koza Keijzer Vladislavleva-A Korns

+ − ∗ / sin cos en ln(|n|) √ + ∗ n1 − n n + + ∗ n2

− −

∗ / n2

/ sin cos en ln(|n|) √ n n 3 tan tanh

Constant input with a value of 1 Random value from N (μ = 0, σ = 5) n ε n + ε nε Random finite 64-bit IEEE double

An Empirical Study on Insertion and Deletion …

99

Table 8 Results of the search performance evaluation for the tested symbolic regression problems evaluated by the best-fitness-of-run method Problem

Algorithm

Mean best fitness of run

Koza-1

(1 + 4)CGP

0.33

0.36

±0.03

0.13

0.23

0.38

(1 + 4)CGP-ID

0.18‡

0.15

±0.01

0.06

0.15

0.26

(1 + 4)CGP

0.16

0.142

±0.01

0.04

0.13

0.23

(1 + 4)CGP-ID

0.11‡

0.12

±0.01

0.03

0.07

0.17

(1 + 4)CGP

0.09

0.10

±0.01

0.03

0.06

0.13

(1 + 4)CGP-ID

0.06†

0.05

±0.00

0.02

0.04

0.08

(1 + 4)CGP

0.36

0.30

±0.03

0.18

0.30

0.48

(1 + 4)CGP-ID

0.24‡

0.19

±0.02

0.11

0.18

0.34

(1 + 4)CGP

0.21

0.18

±0.02

0.07

0.15

0.31

(1 + 4)CGP-ID

0.10‡

0.10

±0.01

0.03

0.07

0.16

(1 + 4)CGP

0.32

0.42

±0.04

0.11

0.19

0.36

(1 + 4)CGP-ID

0.16‡

0.14

±0.01

0.08

0.14

0.23

(1 + 4)CGP

0.45

0.30

±0.03

0.24

0.39

0.67

(1 + 4)CGP-ID

0.28‡

0.20

±0.02

0.14

0.21

0.38

(1 + 4)CGP

2.52

1.61

±0.16

1.49

2.07

3.00

(1 + 4)CGP-ID

1.94‡

1.48

±0.15

1.15

1.44

2.05

(1 + 4)CGP

106.17

45.75

±4.57

76.43

100.59

140.56

(1 + 4)CGP-ID

82.48‡

38.22

±3.82

55.97

73.78

97.96

(1 + 4)CGP

261.93

147.75

±14.77

144.63

163.63

466.96

(1 + 4)CGP-ID

169.82‡

83.21

±8.32

140.13

144.20

151.26

(1 + 4)CGP

8892.96

902.12

±90.21

8509.78

8571.14

8689.03

(1 + 4)CGP-ID

8563.38†

126.31

±12.63

8513.62

8555.67

8584.05

Koza-2

Koza-3

Nguyen-4

Nguyen-5

Nguyen-6

Nguyen-7

Keijzer-6

Pagie-1

Vladislavleva-4

Korns-12

SD

SEM

1Q

Median

3Q

100

R. Kalkreuth

Table 9 Results of the search performance evaluation for the tested symbolic regression problems Koza 1, 2 and 3 evaluated by the fitness-evaluations-to-termination method Problem Algorithm Mean SD SEM 1Q Median 3Q fitness evaluations Koza-1

Koza-2

Koza-3

(1 + 4)CGP (1 + 4)CGP-ID (1 + 4)CGP (1 + 4)CGP-ID (1 + 4)CGP (1 + 4)CGP-ID

113757

276255

±27625

1544

5084

28641

86313†

63980

±255923

864

2408

15527

359337

410241

±41024

13951

133352

822057

211194†

313048

±31304

8127

46594

256673

348474

405892

±40589

6743

109994

790608

217992†

355736

±35573

3877

31258

190603

4.3 Fitness Range Analysis To investigate the effects of the insertion and deletion mutation techniques, we measured the range of the fitness values for the individuals in the population. For the measurement, we defined a budget of 1000 generations for each problem and algorithm. We measured the range of fitness values in each generation. At the end of each run, we averaged the measured range values. Furthermore, we performed 100 runs for each algorithm and problem and averaged the mean values of each run. With the intention to ensure generalization in our analysis, the (1 + 4)-CGP-ID algorithm was parameterized with a point mutation rate of 1% and both the insertion and deletion mutation have been parameterized with a rate of 5% for the tested Boolean function problems. For the tested symbolic regression problems, we used a point mutation rate of 4%, which is the same rate as in the search performance evaluation. We also used a general setting of 5% insertion and deletion rate for all tested symbolic regression problems. Table 10 shows the results of the fitness range analysis for all tested Boolean function problems. As visible, the range of the fitness values of the (1 + 4)-CGP-ID is much smaller compared to the (1 + 4)-CGP. Please note that we used a minimizing fitness function for our experiments. Table 11 shows the results for all tested symbolic regression problems. It is visible that there is no general pattern for the tested symbolic regression problems. However, on the majority of the tested problems, the median value is higher for the (1 + 4)-CGP-ID algorithm.

An Empirical Study on Insertion and Deletion …

Cost function value

Koza−1

0.6

1.5

0.6

1.0

0.4

0.5

0.2

0.4

0.2

0.0

0.0 1+4−CGP

1+4−CGP−ID

1+4−CGP

1+4−CGP

1+4−CGP−ID

Nguyen−5

Nguyen−4

Cost function value

Koza−3

Koza−2

0.0

2.0

0.8

1.5

0.6

1.0

0.4

0.5

0.2

1+4−CGP−ID

Nguyen−6 2.5 2.0 1.5 1.0 0.5

0.0

0.0 1+4−CGP

0.0 1+4−CGP

1+4−CGP−ID

1+4−CGP−ID

1+4−CGP

Keijzer−6

Nguyen−7

1+4−CGP−ID

Pagie−1

10.0

1.5

Cost function value

101

250 200

7.5 1.0

150 5.0 100

0.5 2.5

0.0

50

0.0 1+4−CGP

0 1+4−CGP

1+4−CGP−ID

1+4−CGP−ID

1+4−CGP

1+4−CGP−ID

Algorithm

Korns−12

Vladislavleva−4

Cost function value

12000 400 11000 300

10000

200

9000

1+4−CGP

1+4−CGP−ID

Algorithm

1+4−CGP

1+4−CGP−ID

Algorithm

Fig. 6 Boxplots for the results of the search performance evaluation for the tested symbolic regression problems evaluated by the best-fitness-of-run method

102

R. Kalkreuth Koza−2

Number of Fitness Evaluations

Koza−1 1000000

1000000

750000

750000

500000

500000

250000

250000

0

0 1+4−CGP

1+4−CGP−ID

1+4−CGP

1+4−CGP−ID

Algorithm

Koza−3

Number of Fitness Evaluations

250000 200000 150000 100000 50000 0 1+4−CGP

1+4−CGP−ID

Algorithm

Fig. 7 Boxplots for the results of the search performance evaluation for the problems Koza 1, 2 and 3 evaluated by the fitness-evaluations-to-termination method

4.4 Active Function Node Range Analysis To measure the exploration with and without our proposed mutation in phenotype space, we analyzed the range of the active function nodes. We measured the number of active function nodes of the best individual in each generation and calculated the range at the end of each run. The best individual has a high fitness value, and we assume that the exploration of phenotypes that have a high fitness value is essential to find the global optimum. We performed 100 runs for each algorithm and allowed a budget of 10,000 fitness evaluations. Afterward, we performed the statistical evaluation on the range values for the (1 + 4)-CGP and (1 + 4)-CGP-ID algorithm. Tables 12 and 13 show the results of the function node range analysis for all tested Boolean function and symbolic regression problems. It is clearly visible that the range of active function nodes of the (1 + 4)-CGP-ID is greater compared to the (1 + 4)-CGP for the majority of our tested problems.

An Empirical Study on Insertion and Deletion …

103

Table 10 Results of the fitness range analysis for the tested Boolean function problems. This table has been taken from the work of Kalkreuth [10] Problem

Algorithm

Parity-3

(1 + 4)CGP

2.40

(1 + 4)CGP-ID Parity-4

Parity-5

Parity-6

Parity-7

Adder 1-Bit

Adder 2-Bit

Adder 3-Bit

Multiplier 2-Bit

Multiplier 3-Bit

Demultiplexer 3:8-Bit

Comparator 4×1-Bit

Mean fitness range

SD

SEM

1Q

Median

3Q

0.43

±0.04

2.06

2.41

2.74

1.50

0.29

±0.029

1.27

1.45

1.71

(1 + 4)CGP

3.59

0.99

±0.09

2.89

3.62

4.13

(1 + 4)CGP-ID

2.17

0.64

±0.06

1.75

2.08

2.67

(1 + 4)CGP

3.88

1.16

±0.11

3.06

3.74

4.54

(1 + 4)CGP-ID

2.50

0.81

±0.08

1.98

2.45

3.06

(1 + 4)CGP

4.11

1.60

±0.16

2.91

3.81

5.25

(1 + 4)CGP-ID

2.90

1.41

±0.14

2.03

2.60

3.45

(1 + 4)CGP

3.80

1.63

±0.16

2.63

3.59

4.7

(1 + 4)CGP-ID

2.68

1.54

±0.15

1.62

2.43

3.53

(1 + 4)CGP

4.83

0.59

±0.05

4.46

4.87

5.25

(1 + 4)CGP-ID

2.61

0.46

±0.04

2.34

2.62

2.94

(1 + 4)CGP

16.50

2.40

±0.24

14.53

16.67

17.86

(1 + 4)CGP-ID

8.69

1.64

±0.16

7.53

8.55

9.85

(1 + 4)CGP

55.85

7.70

±0.77

50

55.97

61.29

(1 + 4)CGP-ID

29.66

5.82

±0.58

25.57

29.52

33.81

(1 + 4)CGP

13.48

1.52

±0.15

12.40

13.35

14.41

(1 + 4)CGP-ID

6.73

0.98

±0.09

6.11

6.75

7.33

(1 + 4)CGP

61.48

5.48

±0.55

58.11

61.25

64.50

(1 + 4)CGP-ID

28.47

3.22

±0.32

26.26

28.41

30.50

(1 + 4)CGP

11.07

0.96

±0.10

10.42

10.93

11.53

(1 + 4)CGP-ID

5.13

0.62

±0.06

4.69

5.06

5.49

(1 + 4)CGP

30.38

2.22

±0.22

29.01

30.21

31.96

(1 + 4)CGP-ID

16.75

1.65

±0.16

15.65

16.59

17.79

104

R. Kalkreuth

Table 11 Results of the fitness range analysis for the tested symbolic regression problems Problem

Algorithm

Mean fitness range

SD

SEM

1Q

Median

3Q

Koza-1

(1 + 4)CGP

1.00 · 1014

6.15 · 1014

±6.15 · 1013

9.32 · 100

2.25 · 101

5.42 · 101

(1 + 4)CGP-ID

1.00 · 1013

1.00 · 1014

±1.00 · 1013

1.5 · 101

3.1 · 101

6.8 · 101

(1 + 4)CGP

1.16 · 1014

1.03 · 1014

±1.03 · 1013

6.47 · 100

1.70 · 101

4.12 · 101

(1 + 4)CGP-ID

3.14 · 106

3.08 · 107

±3.08 · 106

7.18 · 100

1.25 · 101

2.30 · 101

(1 + 4)CGP

2.51 · 1013

1.79 · 1014

±1.79 · 1013

1.05 · 101

2.32 · 101

5.47 · 101

(1 + 4)CGP-ID

1.24 · 1014

8.43 · 1014

±8.43 · 1013

1.82 · 101

4.02 · 101

1.07 · 102

(1 + 4)CGP

6.71 · 1013

4.50 · 1014

±4.50 · 1013

1.22 · 101

2.26 · 101

3.65 · 101

(1 + 4)CGP-ID

1.68 · 1014

1.68 · 1015

±1.68 · 1014

1.69 · 101

3.10 · 101

5.46 · 101

(1 + 4)CGP

1.51 · 108

1.51 · 109

±1.51 · 108

9.50 · 100

1.63 · 101

4.28 · 101

(1 + 4)CGP-ID

4.02 · 1011

4.02 · 1012

±4.02 · 1011

1.64 · 101

2.38 · 101

6.81 · 101

(1 + 4)CGP

4.67 · 1013

2.75 · 1014

±2.74 · 1013

6.66 · 100

1.68 · 101

4.02 · 101

(1 + 4)CGP-ID

1.39 · 1013

1.39 · 1014

±1.39 · 1013

1.45 · 101

2.28 · 101

4.78 · 101

(1 + 4)CGP

9.00 · 1013

8.05 · 1014

±8.05 · 1013

1.08 · 101

2.82 · 101

5.94 · 101

(1 + 4)CGP-ID

2.27 · 1010

2.15 · 1011

±2.15 · 1010

1.60 · 101

2.60 · 101

7.96 · 101

(1 + 4)CGP

4.30 · 108

4.97 · 108

±4.97 · 107

1.84 · 102

2.27 · 103

1.0 · 109

(1 + 4)CGP-ID

4.60 · 108

5.01 · 108

±5.01 · 107

2.00 · 102

9.74 · 103

1.0 · 109

(1 + 4)CGP

2.15 · 1099

1.23 · 10100

±1.23 · 1099

2.21 · 102

7.62 · 102

1.91 · 103

(1 + 4)CGP-ID

1.41 · 1099

9.67 · 1099

±9.67 · 1098

4.69 · 102

1.43 · 103

8.22 · 103

(1 + 4)CGP

1.38 · 1017

2.64 · 1017

±2.64 · 1016

3.92 · 102

2.45 · 104

8.55 · 1016

(1 + 4)CGP-ID

7.15 · 1016

1.84 · 1017

±1.85 · 1016

4.17 · 102

2.55 · 104

7.41 · 1015

(1 + 4)CGP

1.57 · 1018

3.12 · 1018

±3.12 · 1017

1.44 · 103

9.24 · 103

6.47 · 1017

(1 + 4)CGP-ID

5.34 · 1019

5.09 · 1020

±5.09 · 1019

4.45 · 103

5.31 · 105

5.06 · 1019

Koza-2

Koza-3

Nguyen-4

Nguyen-5

Nguyen-6

Nguyen-7

Keijzer-6

Pagie-1

Vladislavleva-4

Korns-12

An Empirical Study on Insertion and Deletion …

105

Table 12 Results of the active function node range analysis. This table has been taken from the work of Kalkreuth [10] Problem

Algorithm

Mean SD active function node range

SEM

1Q

Median

3Q

Parity-3

(1 + 4)CGP

33.33

4.54

0.45

31

33

35

(1 + 4)CGP-ID

38.31

7.82

0.78

34

39

43

(1 + 4)CGP

36.18

3.97

0.39

33.75

36

39

(1 + 4)CGP-ID

50.8

6.04

0.60

47

51

55

(1 + 4)CGP

35.02

4.13

0.41

32.75

34.75

37

(1 + 4)CGP-ID

58.81

5.17

0.52

56

59

62

(1 + 4)CGP

35.18

4.52

0.45

32

34

37.25

(1 + 4)CGP-ID

52.21

6.73

0.63

47

53.5

57

(1 + 4)CGP

35.21

4.27

0.43

33

35

38

(1 + 4)CGP-ID

51.83

6.80

0.68

48

51

56.25

(1 + 4)CGP

38.73

5.67

0.56

35

38

42

(1 + 4)CGP-ID

38.98

5.17

0.52

36

39

42.25

(1 + 4)CGP

36

4.04

±0.40

33

36

38.25

(1 + 4)CGP-ID

52.47

9.37

±0.94

47

52

60

(1 + 4)CGP

36.82

4.13

0.41

34

36

39

(1 + 4)CGP-ID

45.01

6.59

0.66

40

44

49

(1 + 4)CGP

36.31

3.81

0.38

34

36

39

(1 + 4)CGP-ID

38

4.95

0.49

35

37

41

(1 + 4)CGP

36.46

4.74

0.47

33

36

39

(1 + 4)CGP-ID

41.28

5.47

0.55

37

40.5

46

(1 + 4)CGP

35.53

4.16

0.42

32.75

35

38

(1 + 4)CGP-ID

42.69

6.00

0.60

39

42

46.25

(1 + 4)CGP

30.38

3.53

0.35

28

30

33

(1 + 4)CGP-ID

32.6

5.40

0.54

29

32

36

Parity-4

Parity-5

Parity-6

Parity-7

Adder 1-Bit

Adder 2-Bit

Adder 3-Bit

Multiplier 2-Bit

Multiplier 3-Bit

Demultiplexer 3:8-Bit

Comparator 4×1-Bit

106

R. Kalkreuth

Table 13 Results of the active function node range analysis Problem

Algorithm

Koza-1

(1 + 4)CGP

8.63

(1 + 4)CGP-ID Koza-2

Koza-3

Nguyen-4

Nguyen-5

Nguyen-6

Nguyen-7

Keijzer-6

Pagie-1

Vladislavleva-4

Korns-12

Mean SD active function node range

SEM

1Q

Median

3Q

3.36

±0.33

6.75

8

11

14.16

4.93

±0.49

10

14

17.25

(1 + 4)CGP

10.33

4.20

±0.42

8

10

13

(1 + 4)CGP-ID

15.1

4.83

±0.48

11.75

15

17.5

(1 + 4)CGP

11.04

3.72

±0.37

8.75

11

13

(1 + 4)CGP-ID

15.93

4.87

±0.48

13

16

19.25

(1 + 4)CGP

9.51

3.62

±0.36

7

10

12

(1 + 4)CGP-ID

14.79

4.53

±0.45

12

15

18

(1 + 4)CGP

10.31

4.49

±0.45

6.75

9

13

(1 + 4)CGP-ID

14.75

4.33

±0.43

12

15

17

(1 + 4)CGP

9.27

4.02

±0.40

6

9

11

(1 + 4)CGP-ID

13.88

4.82

±0.48

10

14

17.25

(1 + 4)CGP

8.52

3.97

±0.39

5.75

8

11

(1 + 4)CGP-ID

13.9

4.70

±0.47

10.75

14

17

(1 + 4)CGP

11.61

4.75

±0.47

8

11

14.25

(1 + 4)CGP-ID

13.38

4.73

±0.47

10

13

16

(1 + 4)CGP

12.05

5.37

±0.53

8

12

15

(1 + 4)CGP-ID

16.83

7.35

±0.73

11

15

22

(1 + 4)CGP

16.02

5.55

±0.55

13

16

20

(1 + 4)CGP-ID

20.93

7.04

±0.70

17

21

25

(1 + 4)CGP

7.14

3.25

±0.32

5

7

9

(1 + 4)CGP-ID

9.54

4.39

±0.44

6

9

12.25

An Empirical Study on Insertion and Deletion …

107

5 Comparison to EGGP We compared three advanced CGP algorithms to a recently introduced method for evolving graphs called Evolving Graphs by Graph Programming (EGGP). EGGP has been introduced by [2]. In their experiments, Atkinson et al. compared EGGP to standard CGP and showed that EGGP performs significantly better on the majority of the tested Boolean function problems. Consequently, we chose EGGP as the baseline for our algorithm comparison. Furthermore, since we evaluated the same set of Boolean function problems as Atkinson et al. we directly compared the results of our experiments with the results in Atkinson et al. For our algorithm comparison, we chose the (1 + 4)-CGP-ID algorithm and also compared EGGP to a (2 + 2)-CGP algorithm with μ = 2 and λ = 2. The (2 + 2)-CGP algorithm was equipped with the subgraph crossover technique [11]. Moreover, we evaluated the (2 + 2)-CGP algorithm with and without the use of the insertion and deletion technique. In the presented results, the (2 + 2)-CGP equipped with insertion and deletion mutation is denoted as (2 + 2)-CGP-ID. We evaluated important parameters like the crossover and mutation rates empirically. Moreover, we empirically tuned the parameters μ and λ and found that a configuration of μ = λ = 2 performs best on our benchmark problems. The parameter settings for the crossover and mutation rates of the (2 + 2)-CGP and (2 + 2)-CGP-ID algorithm are shown in Table 14. We measured the number of fitness evaluations until a correct solution was found, similar to our search performance evaluation. To compare our results directly, we utilized the same evaluation method as Atkinson et al. by calculating the median value, the median absolute deviation (MAD), and the interquartile range (IQR). Table 15 shows the results of the algorithm comparison for all tested Boolean function problems. It is visible that the median values of the (2 + 2)-CGP-ID and EGGP are on the same level. Moreover, it is also visible that we achieved a lower median value of fitness evaluations for the (2 + 2)-CGP-ID algorithm on some of the tested problems. Please note that the results for EGGP have been directly taken from the work of Atkinson et al.

6 Discussion The primary concern of our experiments was to find significant contributions of the insertion and deletion mutation technique to the search performance of CGP. The results of our experiments showed beneficial effects on a diverse set of Boolean functions and symbolic regression problems. We observed a reduced amount of fitness evaluations when the insertion and deletion mutation techniques were in use for all tested Boolean function problems. Moreover, we observed a better fitness value after a predefined budget of fitness evaluations on all tested symbolic regression problems. For the more simple symbolic regression problems of our tested problems, the number of fitness evaluations was clearly reduced. Overall, our experiments indicate

108

R. Kalkreuth

Table 14 Parametrization of the (2 + 2)-CGP and (2 + 2)-CGP-ID algorithms using subgraph crossover. This table has been taken from the work of Kalkreuth [10] Problem Algorithm Crossover rate Point mut. rate Insertion rate Deletion rate (%) (%) (%) (%) Parity-3

Parity-4

Parity-5

Parity-6

Parity-7

Adder 1-Bit

Adder 2-Bit

Adder 3-Bit

Multiplier 2-Bit

Multiplier 3-Bit

Demultipl. 3:8-Bit

Comparator 4×1-Bit

(2 + 2)-CGP (2 + 2)-CGPID (2 + 2)-CGP (2 + 2)-CGPID (2 + 2)-CGP (2 + 2)-CGPID (2 + 2)-CGP (2 + 2)-CGPID (2 + 2)-CGP (2 + 2)-CGPID (2 + 2)-CGP (2 + 2)-CGPID (2 + 2)-CGP (2 + 2)-CGPID (2 + 2)-CGP (2 + 2)-CGPID (2 + 2)-CGP (2 + 2)-CGPID (2 + 2)-CGP (2 + 2)-CGPID (2 + 2)-CGP (2 + 2)-CGPID (2 + 2)-CGP (2 + 2)-CGPID

50 75

4 1

– 10

– 10

75 75

4 2

– 20

– 20

75 75

4 1

– 8

– 2

75 50

4 1

– 6

– 3

50 50

4 1

– 6

– 3

25 50

4 2

– 7,5

– 7,5

25 50

4 1

– 10

– 10

25 50

4 1

– 10

– 5

25

4





50

2

5

5

50

4





50

1

6

3

25

4





75

2

10

10

25

4





75

1

5

5

An Empirical Study on Insertion and Deletion …

109

Table 15 Results of the algorithm comparison. This table has been taken from the work of Kalkreuth [10] Problem Algorithm Median MAD IQR Parity-3

Parity-4

Parity-5

Parity-6

Parity-7

Adder 1-Bit

Adder 2-Bit

Adder 3-Bit

Multiplier 2-Bit

Multiplier 3-Bit

(1 + 4)-CGP-ID (2 + 2)-CGP (2 + 2)-CGP-ID EGGP (1 + 4)-CGP-ID (2 + 2)-CGP (2 + 2)-CGP-ID EGGP (1 + 4)-CGP-ID (2 + 2)-CGP (2 + 2)-CGP-ID EGGP (1 + 4)-CGP-ID (2 + 2)-CGP (2 + 2)-CGP-ID EGGP (1 + 4)-CGP-ID (2 + 2)-CGP (2 + 2)-CGP-ID EGGP (1 + 4)-CGP-ID (2 + 2)-CGP (2 + 2)-CGP-ID EGGP (1 + 4)-CGP-ID (2 + 2)-CGP (2 + 2)-CGP-ID EGGP (1 + 4)-CGP-ID (2 + 2)-CGP (2 + 2)-CGP-ID EGGP (1 + 4)-CGP-ID (2 + 2)-CGP (2 + 2)-CGP-ID EGGP (1 + 4)-CGP-ID (2 + 2)-CGP (2 + 2)-CGP-ID EGGP

1928 2778 2203 2755 11920 14723 10701 13920 34622 128807 27821 34368 92466 534039 69742 83053 238426 1966944 172182 197575 5876 8950 4838 5723 84258 191683 60568 74633 584198 2991999 378685 275180 10196 17704 7787 14118 250396 1024142 166686 1241880

1578 2986 1318 1558 6876 12391 5333 5803 21174 83201 14715 15190 42247 505962 31376 33273 123789 1558881 72077 61405 5157 8951 3864 3020 64105 146445 40591 32863 549282 2379680 259886 114838 17576 20544 10345 5553 236555 777862 118461 437210

2052 4564 2098 4836 11061 16432 8711 11629 28572 105579 25519 30054 74034 721456 46839 66611 149330 2039929 114928 131215 6906 11131 6377 7123 85338 212833 55450 66018 640965 3438321 381805 298250 17543 19383 10164 12955 343552 993072 196298 829223 (continued)

110

R. Kalkreuth

Table 15 (continued) Problem Algorithm Demultiplexer 3:8-Bit

Comparator 4×1-Bit

Median

MAD

IQR

(1 + 4)-CGP-ID

13704

6736

10797

(2 + 2)-CGP (2 + 2)-CGP-ID EGGP (1 + 4)-CGP-ID

21047 9978 16763 272924

9443 6394 4710 172932

15538 9554 9210 290674

(2 + 2)-CGP (2 + 2)-CGP-ID EGGP

3207723 217799 262660

1788937 122378 84248

3045088 182878 174185

that the use of the insertion and deletion mutation is beneficial for the search performance of CGP in the Boolean function and symbolic regression domain. However, since we primarily evaluated the search performance on recommended test problems which are well-known in the GP community, we think that it would be helpful to investigate more symbolic regression problems for which an ideal solution can be found more likely on average. The reason for this is that for this kind of problem, the search performance of the respective CGP algorithm can be evaluated more likely after the whole evolutionary process is completed. Our experiments also addressed the question in which way insertion and deletion mutation improve the search performance of CGP. Our experiments showed that the sole use of the standard CGP point mutation leads to a wide range of fitness values on our tested Boolean function problems. However, the range of the fitness values was comparatively smaller, when the insertion and deletion mutation techniques were in use. In the first place, our results in the Boolean domain indicate that the sole use of the point mutation operator is comparatively more disruptive in this problem domain and can influence the search performance in a negative way. Moreover, our experiment showed that the breeding of new individuals is comparatively less disruptive when our proposed mutations are in use. On our tested symoblic regression problems, we observed no pattern. On some problems, the mean range of fitness values of the (1 + 4)-CGP-ID algorithm was smaller and on some smaller bigger. However, on the majority of the tested problems, we observed a higher median value for the (1 + 4)-CGP-ID algorithm. At the moment, we have to acknowledge that we have no answer for the results in the symbolic regression domain. Another part of our experiments was devoted to a range analysis of the active function node of the best individual in the population. The results of this experiment indicate that our proposed mutations can lead to more exploration of the phenotype space. We observed the same effect in the Boolean functions as well as in the symbolic regression domain. Since the best individual is of high fitness, we assume that the exploration of phenotypes in high fitness regions of the search space is essential

An Empirical Study on Insertion and Deletion …

111

for the search performance of the CGP algorithm. However, for more meaningful statements, more detailed analyses have to be performed in future work. Another answer and explanation to the question of the effectiveness of our proposed mutations may be found in the work of Goldman and Punch. In their work, Goldman and Punch [6] analyzed the evolutionary mechanisms of traditional CGP and concluded that We found large sections of the genome were never used by any ancestor of the final solution. Furthermore, offspring almost never include active nodes that were inactive in their direct parent but active in a previous ancestor.

Moreover, regarding the actual behavior of CGP, Goldman, and Punch also concluded that CGP genomes include a surprising amount of redundant and unused nodes.

Based on the very detailed and precise analysis of Goldman and Punch, we assume that our proposed mutations cause more activity in these large and normally unused sections of the CGP genome by activating inactive function nodes. Another hypothetical explanation is based on the outcome of another work by Goldman and Punch [5]. For an investigation of the length bias and search limitations in CGP the Goldman and Punch found that CGP has an innate parsimony pressure, which makes it very difficult to evolve individuals with a high percentage of active nodes. Based on this finding, we assume that the use of the insertion and deletion mutation techniques counteract the observed innate parsimony pressure. One indicator for this assumption is the outcome of our active function node analysis. For the (1 + 4)-CGP-ID algorithm, we observed a wider range of active function nodes which have been processed throughout the performed evolutionary runs. Our comparison with EGGP showed that the use of our proposed mutations in combination with the subgraph crossover indicates that these advanced techniques are beneficial for the use of CGP. Furthermore, on some of our tested problems, we achieved a lower median value for the (2 + 2)-CGP-ID algorithm when compared to EGGP. However, for more significant and meaningful statements about the current state of EGGP and CGP, a more comprehensive study is needed and should include different problem domains. For the field of graph-based Genetic Programming, this point is of high importance because there is comparatively only a little knowledge about the search performance of CGP and EGGP in other problem domains. Moreover, EGGP and CGP have been mostly evaluated with Boolean function problems in the past, which resulted in a one-sided state of knowledge. Therefore, we think that comprehensive comparative studies are needed to expand the current state of knowledge. In the first place, this effort should tackle comparisons in the symbolic regression domain. For our comparison, we started by using the same set of benchmark problems as in the work of Atkinson et al. [2]. As a next step forward we have to compare the search performance of EGGP and our advanced techniques on our set of symbolic regression problems. Addressing the reasons of the effectiveness of the (2 + 2)-CGP-ID algorithm, we have to acknowledge that we don’t have any results and answers to the question

112

R. Kalkreuth

in which way the combination of subgraph crossover and our proposed mutations contribute to the search performance of CGP. The results of our experiments open two questions that have to be tackled with our future work: In the first place, we have to find answers in which way the (2 + 2)-CGP-ID algorithm contributes to the search performance of CGP. To achieve insight into the detailed functional mechanism of the (2 + 2)-CGP and (2 + 2)-CGP-ID algorithm, we have to understand the proposed methods in detail. As a first step forward, we think a separate investigation of exploitation and exploration effects of the (2 + 2)-CGP and (2 + 2)-CGP-ID algorithm would be helpful. We also have to tackle the question of why small population sizes are generally successful in the Boolean domain. Since the effectiveness of the (1 + 4)-CGP in the Boolean domain is well known in the field of CGP [20, 21], our experiments with the (2 + 2)-CGP-ID algorithm underline the effectiveness of small population sizes in the Boolean problem domain. Consequently, there is a need for more insight into the observed conditions of our experiments.

7 Conclusion Within this paper, we evaluated and analyzed two recently introduced phenotypic mutation techniques in two different problem domains. The results of our experiments clearly show that our proposed methods can be beneficial for the use of CGP and that the improvement of the search performance is not bound to only one problem domain. Since the work of Kalkreuth [10] merely evaluated the insertion and deletion mutation in one problem domain, this work can be seen as an important extension. We compared CGP to another state-of-the-art method for evolving graphs and showed that advanced methods of crossover and mutation allow CGP to perform well. The analytic part of our experiments showed, on the one hand, that the sole use of the point mutation operator in CGP can cause more disruptive effects in the Boolean function domain when compared to the use of our proposed mutations in combination with the point mutation operator. However, our analytic experiments in the symbolic regression domain also showed that this observation could not be generalized. Moreover, our experiments indicate that our proposed mutations enable a wider search in high fitness regions within the search space. This effect has been observed on all tested benchmark problems of two different problem domains. For more significant statements about the beneficial effects of the insertion and deletion mutation, a rigorous and comprehensive study on a larger set of problems is needed and should include the investigation of more popular GP problem domains such as classification, predictive modeling, or path finding, and planning.

An Empirical Study on Insertion and Deletion …

113

8 Future Work We will mainly focus on the expansion of our search performance evaluations by including more GP problem domains. This will also include a more comprehensive comparison between EGGP and CGP. Additionally, we will focus on more analytic experiments. These experiments will primarily include an analysis of the exploration abilities of CGP when the proposed mutations are in use. For these experiments, we plan direct comparisons of the behavior on problems in the Boolean function domain and the symbolic regression domain. Another part of our future work is devoted to a detailed investigation of the (2 + 2)-CGP-ID algorithm with subgraph crossover and our mutations. This will also include an investigation in which way the subgraph crossover and our proposed mutations work together and if there are similar functional behaviors between different problems.

References 1. Angeline, P.J.: An investigation into the sensitivity of genetic programming to the frequency of leaf selection during subtree crossover. In: Koza J.R., Goldberg D.E., Fogel D.B., Riolo R.L. (eds.) Genetic Programming 1996: Proceedings of the First Annual Conference, Stanford University, CA, USA, 28–31 July 1996, pp. 21–29. MIT Press (1996) 2. Atkinson, T., Plump, D., Stepney, S.: Evolving graphs by graph programming. In: Castelli M., Sekanina L., Zhang M., Cagnoni S., Garcia-Sanchez P. (eds.) EuroGP 2018: Proceedings of the 21st European Conference on Genetic Programming, Parma, Italy, 4–6 April 2018. LNCS, vol. 10781, pp. 35–51. Springer (2018) 3. Cramer, N.L.: A representation for the adaptive generation of simple sequential programs. In: Proceedings of the 1st International Conference on Genetic Algorithms, Hillsdale, NJ, USA, pp. 183–187. L. Erlbaum Associates Inc. (1985) 4. Forsyth, R.: Beagle – a darwian approach to pattern recognition. Kybernetes 10(3), 159–166 (1981) 5. Goldman, B.W., Punch, W.F.: Length bias and search limitations in cartesian genetic programming. In: Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation, GECCO ’13, pp. 933–940. ACM, New York, NY, USA (2013) 6. Goldman, B.W., Punch, W.F.: Analysis of cartesian genetic programming’s evolutionary mechanisms. IEEE Trans. Evol. Comput. 19(3), 359–373 (2015) 7. Hicklin, J.: Application of the genetic algorithm to automatic program generation. Master’s thesis (1986) 8. Kalganova, T., Miller, J.F.: Evolutionary approach to design multiple-valued combinational circuits. In: Proceedings of International Conference on Applications of Computer Systems (ACS) (1997) 9. Kalkreuth, R.: Towards advanced phenotypic mutations in cartesian genetic programming. CoRR arXiv:1803.06127 (2018) 10. Kalkreuth, R.: Two new mutation techniques for cartesian genetic programming. In: Proceedings of the 11th International Joint Conference on Computational Intelligence, IJCCI 2019, Vienna, Austria, 17–19 September 2019, pp. 82–92 (2019) 11. Kalkreuth, R., Rudolph, G., Droschinsky, A.: A new subgraph crossover for cartesian genetic programming. In: Castelli M., McDermott J., Sekanina L. (eds.) EuroGP 2017: Proceedings of the 20th European Conference on Genetic Programming, 19–21 April 2017. LNCS, vol. 10196, pp. 294–310. Springer, Amsterdam (2017)

114

R. Kalkreuth

12. Kaufmann, P., Platzner, M.: Advanced techniques for the creation and propagation of modules in cartesian genetic programming. In: Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation, GECCO ’08, pp. 1219–1226. ACM, New York, NY, USA (2008) 13. Koza, J.: Genetic Programming: a paradigm for genetically breeding populations of computer programs to solve problems, June 1990. Technical Report STAN-CS-90-1314, Department of Computer Science, Stanford University (1990) 14. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992) 15. Koza, J.R.: Genetic Programming II: Automatic Discovery of Reusable Programs. MIT Press, Cambridge (1994) 16. Kraft, D.H., Petry, F.E., Buckles, B.P., Sadasivan, T.: The use of genetic programming to build queries for information retrieval. In: Proceedings of the 1994 IEEE World Congress on Computational Intelligence, Orlando, Florida, USA, 27–29 June 1994, vol. 1, pp. 468–473. IEEE Press (1994) 17. Manfrini, F.A.L., Bernardino, H.S., Barbosa, H.J.C.: A novel efficient mutation for evolutionary design of combinational logic circuits. In: Handl J., Hart E., Lewis P.R., López-Ibáñez M., Ochoa G., Paechter B. (eds.) Parallel Problem Solving from Nature – PPSN XIV, pp. 665–674. Springer International Publishing, Cham (2016) 18. McDermott, J., White, D.R., Luke, S., Manzoni, L., Castelli, M., Vanneschi, L., Ja´skowski, W., Krawiec, K., Harper, R., De Jong, K., O’Reilly, U.-M.: Genetic programming needs better benchmarks. In: Proceedings of the 14th International Conference on Genetic and Evolutionary Computation Conference, GECCO ’08, pp. 791–798. ACM, Philadelphia (2012) 19. Miller, J.F., Thomson, P., Fogarty, T.: Designing Electronic Circuits Using Evolutionary Algorithms. Arithmetic Circuits: A Case Study (1997) 20. Miller, J.F.: An empirical study of the efficiency of learning Boolean functions using a cartesian genetic programming approach. In: Proceedings of the Genetic and Evolutionary Computation Conference, Orlando, Florida, USA, 13–17 July 1999, vol. 2, pp. 1135–1142. Morgan Kaufmann (1999) 21. Miller, J.F., Smith, S.L.: Redundancy and computational efficiency in cartesian genetic programming. IEEE Trans. Evol. Comput. 10(2), 167–174 (2006) 22. Ni, F., Li, Y., Yang, X., Xiang, J.: An orthogonal cartesian genetic programming algorithm for evolvable hardware. In: 2014 International Conference on Identification, Information and Knowledge in the Internet of Things (IIKI), October 2014, pp. 220–224 (2014) 23. White, D.R., McDermott, J., Castelli, M., Manzoni, L., Goldman, B.W., Kronberger, G., Jaskowski, W., O’Reilly, U.-M., Luke, S.: Better GP benchmarks: community survey results and proposals. Genet. Program. Evolvable Mach. 14(1), 3–29 (2013)

Handling Complexity in Some Typical Problems of Distributed Systems by Using Self-organizing Principles ˇ c Vesna Šešum-Cavi´

Abstract Today’s software systems are continuously becoming more complex. Main factors that determine software complexity are huge amounts of distributed components, heterogeneity, problem size and dynamic changes of the environment. These challenges are especially emphasized in distributed software systems. To cope with unforeseen dynamics in the environment and vast number of unpredictable dependencies on participating components, employing of self-organization principles at different levels in the software architecture can be beneficial. This could help in shifting complexity from one central coordinator component to many distributed, autonomously acting software components. Swarm intelligence represents a selforganizing biological system. Therefore, swarm-inspired algorithms play an important role in the design of self-organizing software for distributed systems and enable different kinds of self-organization. This chapter is based on my keynote at IJCCI 2019 with the purpose to provide a brief overview of the significance and power of swarm intelligence in coping with some typical distributed systems’ problems as well as findings about how and in which use cases the principles of self-organization can contribute to reduce software complexity. Keywords Self-organization · Swarm intelligence · Distributed systems · Complexity

1 Introduction Distributed systems develop rapidly and become more and more complex. They usually contain huge number of heterogeneous and mobile nodes, so heterogeneity could be identified as one of the main challenges. When integration of multiple systems is needed, the following issues should be taken in consideration: they differ in their capabilities in terms of integration, have disparities in data, use and support different technologies and standards, and they could be on different platforms [16]. ˇ c (B) V. Šešum-Cavi´ Institute of Information Systems Engineering, Faculty of Informatics, TU Wien, Argentinierstr. 8, 1040 Vienna, Austria e-mail: [email protected] © Springer Nature Switzerland AG 2021 J. J. Merelo et al. (eds.), Computational Intelligence, Studies in Computational Intelligence 922, https://doi.org/10.1007/978-3-030-70594-7_5

115

ˇ c V. Šešum-Cavi´

116

Distributed systems are forced to integrate other software systems and components that are often not reliable, exhibit bad performance, and are sometimes unavailable. Such software is typically characterized by a huge problem size concerning number of computers, clients, requests and size of queries, autonomy and heterogeneity of participating organizations, and dynamic changes of the environment. Therefore, their complexity1 becomes a critical issue. The system complexity has been widely identified to be an important problem [1, 12]. Ranganathan and Campbell [16] identifies five aspects of distributed system complexity: task-structure complexity, unpredictability, size complexity, chaotic complexity and algorithmic complexity. To cope with huge dynamics and vast number of unpredictable dependencies on participating components, other approaches are demanded. In attacking complexity, [16] proposes self-configuration and self-repair, high-level programming and interaction, and hierarchical organization of systems and concepts. A useful way would be also implementation of autonomously acting components, which are inspired by nature. These components act in a dynamic, ad hoc way and adapt quickly and self-subsistent to both changing requirements and dynamically evolving system states caused through the interplay and contribution of the many components towards a global goal. The unavoidable complexity cannot be eliminated, but it can be shifted. Among some well-known tools in coping with the complexity (e.g., abstraction, decoupling, decomposition), a self-organizing approach represents one promising way. Certainly, self-* systems will not be able to adapt to all possible events, but they promise a good perspective to deal with complexity. Herrmann [9] depicts the necessity for self-* mechanisms in distributed systems.

1.1 Self-organization Researchers have experimented with different paradigms in order to achieve the main properties of self-* systems. Self-* appears in systems without interventions by external directing influences (instructions from a “supervisory leader” or an order imposed on them in many different ways—various directives, recipes, templates) and forms patterns through interactions among their components [3, 10]. Although a functional structure appears and maintains spontaneously, complex systems are not arbitrarily regulated, but ordered in a very organized way. This organization is not built into the system at its origin. It emerges in a sequence of self-organizing processes that include spontaneous transitions into new states of higher organizational complexity. Patterns are well organized structures [3] and can refer to an arrangement of objects both in space (e.g., a zebra’s coat) and in time (e.g., firefly flashing). A self-organizing system possesses multiple interdependent components that cooperate in self-initiated interactions [10] through which an information exchange is done.

1 Note

that there are no standard, generally accepted definitions of complexity.

Handling Complexity in Some Typical Problems …

117

Self-organization in a system appears at different levels (from the lowest level to the highest one), and each of these levels can exhibit their own self-organization. Interacting components are constantly changing their state. “Decisions” and consequently changes are local (e.g. in an ant colony, each ant “decides” by its own which path it will choose). Also, components only interact with their immediate “neighbours”. A mutual dependency implies that changes are not arbitrary: some relative states are “preferable”, in sense that they will be reinforced or stabilized (like those paths in an ant colony where there are more pheromone), while others are eliminated. The components of the lowest level produce their own emergent properties (patterns) and form the building blocks for the next higher level of organization, with different emergent properties, and this process can further proceed to higher levels in turn. Most of dynamic systems are metastable possessing many attractors as alternative stable positions. A “noise” (fluctuations) in a system allows the system to escape one basin and to enter another, leading the system to the optimal organization. The basic mechanism underlying self-organization is the variation that governs any dynamic system and allows for exploring of different regions in a state space until it happens to reach an attractor—a preferred position of the system. Thus, increasing variation, i.e., adding “noise” to the system implies that the exploration of a state space will be emphasized, accelerated and deepened. Reaching the attractor, the system comes to the stable state. A further exploration of new state space positions can be continued, if random changes are introduced, which can cause the system to move towards a new attractor [10]. Mathematically speaking, it is possible to have several local optima, but only one global optimum. The self-organization mechanisms have a fully distributed characteristic in a dynamical system, i.e., it must be distributed over all participating components. We cannot “invent” new forms of self-*: it already exists around us. However, we can learn from biologically-based mechanisms and try to transfer and implement such mechanisms into software systems. Such systems have the following advantages over traditional systems: robustness, flexibility, capability to function autonomously, while demanding a minimum of supervision, and spontaneous development of complex adaptations without need for detailed planning. In mapping, software agents usually play the role of particular swarm individuals (e.g., ants, bees, etc.) and “perform” self-* actions characteristic for the respective swarm colony. All these mechanisms are characterized by a huge number of different environmental parameters influencing the behaviour of artificial swarms. Although self-* approach is attractive and promising, proven to cope with complexity, the starting question is how to determine whether or not to apply principles of self-organization on a particular use case. First, it is necessary to discern what kind of complexity exists in a particular problem. According to that information, a conclusion can be made about what self-* mechanisms or principle could be suitable for a particular case. For example, if a considered problem possesses programming complexity (and additionally a system itself is rather heterogeneous, like a distributed heterogeneous system), then a high level of autonomy and decoupling is necessary to show some success in coping with this complexity.

ˇ c V. Šešum-Cavi´

118

1.2 Complexity in Application Scenarios The sources of complexity in the application scenarios presented in Sect. 2.1 are: 1. 2. 3. 4. 5. 6.

amount of resources, i.e., the huge amount of distributed components that must interplay in a global solution, type of resources, i.e., heterogeneity, large number of interactions of the various elements of the software, huge problem size, clients, requests, size of queries etc., autonomy of organizations, dynamic changes of the environment.

According to these sources, different types of complexity can be discerned: point 1. is a pure computational complexity, points 2. and 3. address programming complexity, point 4. refers to both computational and programming complexity, whereas 5. and 6. are the consequences of features of complex adaptive system. In the analysis of computational complexity [8], two well-known types appear: • time complexity—the length of time it takes to find a solution or complete a process as a function of the size of input; • space complexity—the amount of physical storage required for a system to perform a certain operation, i.e., to solve an instance of the problem as a function of the size of input. Every task2 can contain subtasks. The order of complexity of the task is determined through analyzing the demands of each task by breaking it down into its constituent parts [5]. Tasks vary in complexity in two ways: horizontal (involving classical information) or vertical, i.e., hierarchical (involving hierarchical information). Horizontal complexity is the amount of information in simple quantitative terms within a task and consists of the number of different responses that have to be performed [5]. Hierarchical complexity refers to the number of recursions that the coordinating actions must perform on a set of primary elements. The actions at a higher order of hierarchical complexity: (a) are defined in terms of actions at the next lower order of hierarchical complexity; (b) organize and transform the lower-order actions; (c) produce organizations of lower-order actions that are qualitatively new and not arbitrary, and cannot be accomplished by those lower-order actions alone [5]. Example: Consider the action A1 of evaluating a + b and the action A2 of evaluating (a + b) + c. The horizontal complexity of A1 is smaller than the horizontal of A2 since the action of addition is executed less often in A1 than in A2 . On the other hand, because A1 differs from A2 only in how many times addition is executed, but not in the organization of the addition, both actions have the same hierarchical complexity. So, in the presented application scenarios (Sect. 2.1), the above mentioned types of complexity can be observed: programming and computational, in which both time and space complexity are present; additionally hierarchical complexity is present. 2 The notion of task is used here as an example, and could be generalized with the notion of process.

Handling Complexity in Some Typical Problems …

119

1.3 Measurement of Complexity Researchers from different areas of science like biology, computer science, finance, etc., define different measures of complexity for each respective field. Lloyd [14] present a categorization of complexity measures by defining common questions for all problems: 1. 2. 3.

how hard is to describe? how hard is it to create? what is its degree of organization?

A general form of self-organization measurement does not exist. For example, in [4], the mechanism of “brood sorting” is used and spatial entropy is proposed as a measure of self-organization. In selected use-cases, the measurement of self-organization, i.e., how good the single contributors (bees, ants, …) organize themselves is realized by means of specially constructed functions (e.g., the suitability function in Sect. 3.1). Higher values of these functions denote the better self-organization in the presented systems. Computational complexity is tracked in time.

2 Swarm Intelligence in Distributed Systems Swarm intelligence possesses distributive and autonomous properties and represents a self-organizing biological system. Every individual in the population makes local decisions, and acts in a decentralized manner. A communication of “knowledge” between individuals is done without any supervisor. Therefore, swarm-inspired algorithms play an important role in the design of self-organizing software for distributed systems. They have a broad spectrum of application areas, support the optimization and robustness of highly dynamic distributed systems, fast adaptation to changes by learning from history and enable different kinds of self-organization. For example, they provide primitives for continued execution when nodes or the network communication fail, when nodes are added or removed during execution, or even in situations when the application should be upgraded “on-the-fly” without interrupting execution.

2.1 Some Selected Distributed Systems’ Use-Cases Some typical distributed systems problems (load balancing, load clustering, information placement and retrieval in heterogeneous networks, distributed routing, peer clustering) have been successfully treated by swarm intelligence. (1)

Load Balancing can be described as finding the best possible workload (re)distribution and addresses ways to transfer excessive load from busy

ˇ c V. Šešum-Cavi´

120

(2)

(3)

(4)

(overloaded) nodes to idle (under-loaded) nodes (Fig. 1). Load Balancing can take place at local node level allocating load to several core processors of one computer, as well as at network level distributing the load among ˇ c and Kühn [18] explains for the first time how different nodes. Šešum-Cavi´ bee intelligence can be mapped to dynamic load balancing. Load Clustering deals with clustering of work loads in a computer system. It tries to make further optimizations of the load distribution based on the content of the load items (Fig. 2). A single load item can be described as a task that consists of several attributes (e.g. a certain priority), has a payload, a dynamic life cycle and is handled by a computer or processor. Among different clustering and classifying algorithms (K-Means, Fuzzy C-Means, Genetic KMeans, Hierarchical Clustering, K-Nearest Neighbor, Decision Trees), a usage of ant intelligence in dynamic load clustering is demonstrated [13]. Information Placement and Retrieval in Heterogeneous Networks. Šešumˇ c and Kühn [17] deals with data placement and retrieval in the internet Cavi´ (Fig. 3). Unstructured peer-to-peer overlay network technologies are combined with swarm intelligence (ant intelligence, bee intelligence and slime molds). It is proven that a good query capability with good scalability can be achieved by using swarm-based algorithms. P2P Streaming. Further, the previous use case 3 is extended to streaming in fully decentralized P2P networks [19]. It addresses need to create a P2P application which combines video on-demand streaming and user collaboration

Fig. 1 Dynamic load balancing

Handling Complexity in Some Typical Problems …

Fig. 2 Dynamic load clustering

Fig. 3 Information retrieval

121

ˇ c V. Šešum-Cavi´

122

Fig. 4 P2P streaming

(5)

(Fig. 4). P2P applications that support the streaming delivery method rely on hybrid approaches, and therefore, are not fully decentralized. Besides ant intelligence and bee intelligence, the lookup mechanism used includes usage of a slime mold intelligence that is adapted for this use case as well as bark beetle intelligence [21] that represents designing a new simple, effective swarm-based algorithm. ˇ c et al. [20] presents modelling of the lifeDistributed Routing. Šešum-Cavi´ cycle of cellular slime moulds and bee-behaviour based on the foraging mechanism of honey bees in order to create fully distributed routing algorithms for unstructured P2P networks (Fig. 5). A modelling and adaptation of slime mould intelligence is done for the first time for routing in unstructured P2P networks. Bee intelligence was already applied to the routing problem in general [22]. However, in [20], another type of mapping and adaptation is proposed.

2.2 Algorithm Recommendation for Selected Use Cases The selected problems numbered in Sect. 2.1 were treated by using different approaches and types of algorithms (conventional and swarm intelligent). The details of specific adaptations and implementations can be found in [13, 17–21]. The obtained results proved the significance of usage swarm intelligence approach in complex, dynamical distributed systems’ problems. In this subsection, a kind of recommendation algorithms for selected use cases is presented as a sum-up of obtained results (Table 1).

Handling Complexity in Some Typical Problems …

123

Fig. 5 Distributed routing

3 An Illustration: Bee Algorithm for Dynamic Load Balancing For sake of illustration, a mapping and adaptation of bee algorithm for dynamic load balancing [18] is shortly reviewed. As this scenario refers to unstructured P2P networks, a formalization of P2P network model is presented. Further, some theoretical concepts of the presented bee algorithm are discussed.3

3.1 Bee Algorithm In a honeybee colony, bees have different roles: foragers, followers, and receivers. A functioning of a bee colony relies on two main strategies: (*) navigation—searching for nectar in an unknown landscape; a forager searches for a flower with good nectar and after finding and collecting, it returns to the hive and unloads the nectar, and (*) recruitment—a forager performs the so-called “waggle dance”, i.e., it communicates the knowledge about the visited flowers (quality, distance and direction) to other bees (Fig. 6). A follower randomly chooses to follow one of the foragers and visits the 3 More

details incl. benchmarking results can be found in [18].

ˇ c V. Šešum-Cavi´

124 Table 1 A sum-up for algorithm recommendation in selected use-cases Scenario

Recommended algorithm(s) or combination of algorithms

Metric

Load balancing

– Both combinations Absolute execution time BeeAlgorithm/Sender and MinMaxAS/MinMaxAS were equal good in the chain topology – Both combinations BeeAlgorithm/Sender and MinMaxAS/RoundRobin were equal good in the ring topology – Both combinations BeeAlgorithm/BeeAlgorithm and GA/AntNet were equal good in the star topology – A combination RoundRobin/BeeAlgorithm was the best in the full topology – Bee algorithms play a significant role in almost each topology, as the best obtained results in each topology are based on bee algorithms either used inside subnets or used between subnets or both

Load clustering

– From the group of clustering Absolute execution time algorithms, Hierarchical Clustering obtained the best results, whereas from the group of classification algorithms the Ant-Miner algorithm was the best – The combination of the Hierarchical algorithm with any other, except the Genetic K-Means algorithm, leads to a good execution time. The best result was delivered by the combination of the Hierarchical and Fuzzy C-Means algorithm. The Hierarchical Clustering showed the best results in a small network with only one client that supplies load. For large and more complex networks, an intelligent approach with an appropriate similarity function will help (continued)

Handling Complexity in Some Typical Problems …

125

Table 1 (continued) Scenario

Recommended algorithm(s) or combination of algorithms

Metric

Information placement and retrieval

– Random/AntNet algorithm is Absolute execution time better than Random/MinMaxAS; the possible reason for that could be the fact that Random/AntNet better supports dynamic processes – Bee algorithm obtained the best results especially on large instances

P2P streaming

– Bark Beetles algorithm, Absolute execution time, Physarum Polycephalum success rate, and average algorithm, Gnutella flooding, messages per node k-Walker, AntNet and Dd-slime mold algorithms are compared – Absolute time: for small replication rate (2%), all network sizes, Physarum Polycephalum algorithm outperforms the other algorithms; for bigger replication rate (16%), all network sizes, Bark Beetles algorithm outperforms the other algorithms – Average message per node: for all replication rates, all network sizes, Bark Beetles algorithm outperforms the other algorithms – Success rate: for small replication rate (2%), network sizes of 50 and 100 nodes, Bark Beetles algorithm has comparable success rate as Gnutella; for bigger replication rate (16%), all network sizes, Bark Beetles algorithm has 100% success rate (continued)

ˇ c V. Šešum-Cavi´

126 Table 1 (continued) Scenario

Recommended algorithm(s) or combination of algorithms

Metric

Routing in P2P

– Slime Mold routing algorithm (SMNet) outperformed all other benchmarked routing algorithms (AntNet, BeeHive, Gnutella, k-Random Walker) regarding the average delivery delay of data packets with growing amount of network nodes and data packet traffic – Bee routing algorithm (BeeNet) took the overall second place right after SMNet

Data packet delivery ratio, average data packet delay, average data packet hop count, and routing overhead messages

Fig. 6 A honeybee colony in nature [2]

“advertised” flower. Foragers and followers can change their roles in the next step of navigation. A receiver processes the nectar in the hive. A software agent plays the role of bee and resides at a particular node. A node consists of exactly one hive and one flower in its environment. A task is one nectar unit in a flower. Following situations are possible: (*) there are more nectar units in a flower, (*) a flower is empty (in that case, it is not removed from the system). A new task can be put at any node in the network. A hive has k stationary bees (receivers) and l outgoing bees (foragers and followers). Initially, all outgoing bees are foragers.

Handling Complexity in Some Typical Problems …

127

Foragers scout for a “partner” node of the node that they belong to, i.e., a particular resource for load balancing to get or put work load from/to it. Further, they inform and recruit followers. Thus, the main actors are foragers and followers as receiver bees process tasks at their node and have no influence on the algorithm itself [18]. The goal is to find the best partner node (determined by means of a suitability function) by taking the best path (here defined as the shortest path). A navigation strategy determines which node will be visited next and is realized by a state transitions rule [23]: Pi j (t) =

 [ρi j (t)]α · [1 di j ]β   [ρi j (t)]α · [1 di j ]β

(1)

j∈Ai (t)

where d ij is the heuristic distance between i and j, α is a binary variable that turns on/off the arc fitness influence, and β is the parameter that controls the significance of a heuristic distance, and ρ ij (t) is the arc fitness from node i to node j at time t and is calculated in the following way: ρ ij = 1/k, where k is the number of neighbouring nodes of node i in case of forager, whereas in case of follower [23]  ρi j (t) =

λ

1−λ·| Ai (t)∩Fi (t)| |Ai (t)|−|Ai (t)∩Fi (t)|

i f j ∈ Fi (t) if j ∈ / Fi (t)

∀ j ∈ Ai (t), 0 ≤ λ ≤ 1

(2)

where Ai (t) is the set of allowed next nodes, i.e., the set of neighbouring nodes of node i, and F i (t) is the set of favoured next nodes recommended by the preferred path. During the recruitment, bees communicate using the following parameters: path (distance), and quality of the solution. Therefore, fitness function f i for a particular bee i can be derived as [18]: fi =

1 δ Hi

(3)

where H i is the number of hops on the tour, and δ is the suitability function. The colony’s fitness function f colony is the average of all fitness functions (for n bees): f colony =

1 n fi i=1 n

(4)

If bee i finds a highly suitable partner node, then its fitness function, f i obtains a good value. After a trip, an outgoing bee determines how “good it was” by comparing its result f i with f colony , and based on that decides its next role [15]. Therefore, two following situations can occur [18]:

ˇ c V. Šešum-Cavi´

128

• if a bee of an under-loaded node searches for a suitable task belonging to some overloaded node, then this bee carries the information about how complex a task the node can accept; • if a bee of an overloaded node searches for an under-loaded node that can accept one or more tasks from this overloaded node, then it carries the information about the complexity of tasks this overloaded node offers and compares it with the available resource of the current under-loaded node that it just visits. In both cases, the complexity of the task should be compared with the available resources at a node [18]. For this purpose, the following notions are introduced: task complexity c, host load hl and host speed hs, whereas hs is relative in a heterogeneous environment, hl represents the fraction of the machine that is not available to the application, and c is the time necessary for a machine with hs = 1 to complete a task when hl = 0. We calculate the argument x = (c/hs)/(1 – hl) of suitability function δ and define it as δ = δ(x). For example, when an under-loaded node with high resource capacities is taking work from an overloaded node node, a partner node offering tasks with small complexity is not a good partner as other nodes could perform these small tasks as well. Taking them would mean wasting available resources.

3.2 P2P Network Model A formalized description of a P2P overlay network 20 is closely related to the definition of a unique identifier of each P2P node. Since P2P overlay networks operate above the physical layer, this unique identifier must not be the physical host address. Let a P2P overlay network be represented by a graph GP2P = (V P2P , E P2P ), where the nodes v ∈ V P2P of the graph represent nodes in a P2P network and the links e ∈ E P2P represent connections between these nodes. Nodes v1 , v2 ∈ V P2P are neighbours, if and only if ∃ (v1 , v2 ) ∈ E P2P . Each node vi ∈ V P2P , i = 1,…, n, n = | V P2P | has a physical address yes and a logical unique identifier x i , which is known to all nodes, but only neighbours are able to map the logical identifier x to the physical host address y and therefore, exchange packets directly:  m(c, xe ) =

ye i f e ∈ neighbour s(c) other wise xe

where c, e ∈ V P2P . Intelligent swarm agents (e.g., bees) may know the physical address ys of their source node in addition to the logical identifier x s , and therefore, may return directly to their source.

Handling Complexity in Some Typical Problems …

129

3.3 Convergence References [6, 11] investigate the convergence of Bee Colony Optimization algorithm and prove that the current best solution converges to one of the optimal solutions (with the probability one) as the number of iterations increases. We provide a convergence in value of the bee algorithm from Sect. 3.1. For this purpose, pre-assumptions and formalization are taken from [7]: 1. 2.

3.

GP2P = (V P2P , E P2P ) is a graph of n nodes and links between these nodes (nodes are not necessarily fully connected in the load balancing scenario); S, f, Φ), where S is the set of candidate solutions, f is the objective function, Φ is the set of constrains that defines the set of feasible solutions; the goal is to find an optimal solution sopt ; Θ is the finite set of states of the problem, θ =< vi ,vj ,…,vh ,… > , |θ | is the number of nodes in a sequence, |θ |≤ n; Θ * is the set of feasible states, Θ * ⊆ Θ; for the time being, static scenarios in this theoretical explanation are considered.

The probability rule of navigation strategy in construct solution phase could be described as: P(ch+1 = j|x h ) = 

Fi j (ρ) j∈Ai Fi j (ρ)

(5)

β

where F ij is some non-decreasing function, F ij (z) = zα ηij . A recruitment phase represents the exchange of knowledge about the path length (distance, which is expressed as the number of hops) and quality of the solution (measured by some similarity function, δ). This phase is described by Eq. 3. In the following, a new result is derived as the consequence of the similar result that considers convergence of one group of Ant System Algorithms in which, for example, Min-Max Ant System belongs to 7. Therefore, the next corollary is inspired and based on one theorem from 7 that proves convergence in value of Min-Max Ant System. The theorem says that when using a fixed positive lower bound on the pheromone trails finding the optimal solution is guaranteed for this specific group of algorithms. The next proof is based on some specifics of the bee algorithm and some general issues that could be found in the proof of convergence in value of Min-Max Ant System as well 7. Corollary: If P(k) is the probability that the bee algorithm finds an optimal solution at least once within the first k iterations, then lim P(k) = 1. k→+∞

Proof From Eq. 2, it follows that the arc fitness ρ for a follower bee belongs ρ∈{ 1−λ , l−1 λ}, where λ is the probability of choosing the preferred path and l is the number of neighbouring nodes of a particular node. If the case for a forager bee is added, that , λ, 1l }. So, for the given network values of arc fitness can have a finite means ρ∈{ 1−λ l−1 number of values and it values stay in some closed interval [ρ min , ρ max ]. The lower bound is positive and fixed for the given network. Therefore, any feasible choice

ˇ c V. Šešum-Cavi´

130

from Eq. 1 for any partial solution x h is made with the probability: pmin ≥

α ρmin α α (n − 1)ρmax + ρmin

(6)

Any solution (incl. the optimum solution) can be generated with the probability:  p>

α ρmin α α (n − 1)ρmax + ρmin

m >0

(7)

where m is the maximum length of a sequence. From this fact, it follows that P(k) =1 – (1 – p)k . For every arbitrarily small ε > 0, P(k) ≥ 1 – ε. That means: lim P(k) = 1. k→+∞



The corollary explains the following. In the bee algorithm, the values that are assigned to arcs are the values of arc fitness, ρ ij . Some of these values will be implicitly reinforced by learning of the other hive mates via waggle dance (i.e., a recruitment process). The fact how “strong” is the recruitment of a particular bee depends on the values of suitability function and the path length. The higher the value of suitability function and the lower the path length, the stronger the recruitment is. The bee algorithm forces the best-so-far solution, and uses implicit maximum value of ρ max (which is directly implied by f max from the best-so-far solution). Further, the value of λ is initialized to the upper limit (λ0 ), so the minimum value ρ min will be reached in: (a) (b)

1−λ0 , l−1 1−λ0 , n−1

for any case for the case with fully connected nodes.

Also, any feasible solution can be constructed with a nonzero probability. If we assume that connection (i, j) does not have the largest probability to be chosen (i.e., j does not belong to set F i ), then the probability of choosing this connection is 1−λ . l−1

4 Conclusion An extreme and raising complexity characterizes nowadays distributed systems. Self* approaches represent a promising way to cope with complexity. However, selforganization in different forms already exists around us, from nature to social and economic organizations. Especially inspiring self-organization forms are those ones that could be found in nature, e.g., different types of swarm intelligence. Thus, swarmintelligent algorithms significantly contribute to the design of self-organizing software for distributed systems. In this chapter, some typical distributed use cases, which are successfully treated by swarm intelligent approaches, are shortly overviewed (incl. the list of the most successful swarm-based algorithms applied in these use cases). As an illustration, one use case is selected—dynamic load balancing, and

Handling Complexity in Some Typical Problems …

131

an application of bee intelligence onto this problem is reviewed. In order to explain better the behaviour of bee algorithm, its theoretical aspects are discussed. Future work includes: • standardizing methodology for a fair evaluation of algorithms in distributed systems’ use cases; • a theoretical evaluation of swarm-intelligent algorithms, an explanation of “why specific methods work well on specific problems” and the analysis of algorithm’s behavior. Although this is a challenging task, for certain metaheuristics, theoretical work regarding convergence is partially established with some encouraging results, whereas for many others, no theoretical background exists.

References 1. Asprey, W., et al.: Conquer system complexity: build systems with billions of parts. In: CRA Conference on Grand Research Challenges in Computer Science and Engineering, pp. 29–33 (2002) 2. Barth, F.: Insects and Flowers: The Biology of a Partnership. Princeton University Press, Princeton (1982) 3. Camazine, S., Deneubourg, J., Franks, N.R., Sneyd, J., Theraulaz, G., Bonabeau, E.: SelfOrganization in Biological Systems. Princeton University Press, Princeton (2003) 4. Casadei, M., Menezes, R., Viroli, M., Tolksdorf, R.; Self-organized over-clustering avoidance in tuple-space systems. In: IEEE Congress on Evolutionary Computation (2007) 5. Commons, M.L., Goodheart, E.A., Dawson, T.L.: Psychophysics of stage: task complexity and statistical models. In: International Objective Measurement Workshop at the Annual Conference of the American Educational Research Association (1997) 6. Davidovic, T., Teodorovic, D., Selmic, M.: Bee colony optimization part I: the algorithm overview. YUJOR 25(1), 33–56 (2015) 7. Dorigo, M., Stützle, T.: Ant Colony Optimization. MIT Press (2004) 8. Fortnow, L., Homer, S.: A short history of computational complexity. Bull. EATCS 80, 95–133 (2003) 9. Herrmann, K.: MESH Mdl — a middleware for self-organization in ad hoc networks. In: 23rd International Conference on Distributed Computing Systems (2003) 10. Heylighen, F.: The science of self-organization and adaptivity. In: Kiel, L.D. (ed.) Knowledge Management, Organizational Intelligence and Learning, and Complexity. The Encyclopedia of Life Support Systems. EOLSS Publishers, Oxford (2001) 11. Jakši´c-Krüger, T., Davidovi´c, T., Teodorovi´c, D., et al.: The bee colony optimization algorithm and its convergence. Int. J. Bio-Inspired Comput. 8(5), 340–354 (2016) 12. Kephart, J., Chess, D.: The vision of autonomic computing. IEEE Comput. 36(1), 41–50 (2003) ˇ c, V., Vögler, M.: A space-based generic pattern 13. Kühn, E., Marek, A., Scheller, T., Šešum-Cavi´ for self-initiative load clustering agents. In: 14th International Conference on Coordination Models and Languages (2012) 14. Lloyd S.: Measures of complexity: a nonexhaustive list. IEEE Control Syst. (2001) 15. Nakrani, S., Tovey, C.: On honey bees and dynamic server allocation in the internet hosting centers. Adapt. Behav. 12, 223–240 (2004) 16. Ranganathan, A., Campbell, R.H.: What is the complexity of a distributed computing system? Complexity 12(6), 37–45 (2007) ˇ c, V., Kühn, E.: A swarm intelligence appliance to the construction of an intelligent 17. Šešum-Cavi´ peer-to-peer overlay network. In: 4th International Conference on Complex, Intelligent and Software Intensive Systems (2010)

132

ˇ c V. Šešum-Cavi´

ˇ c, V., Kühn, E.: Self-organized load balancing through swarm intelligence. In: 18. Šešum-Cavi´ Next Generation Data Technologies for Collective Computational Intelligence. Studies in Computational Intelligence, vol. 352, pp. 195–224. Springer (2011) ˇ c, V., Kühn, E., Kanev D.: Bio-inspired search algorithms for unstructured P2P 19. Šešum-Cavi´ overlay networks. Swarm Evolut. Comput. 29, 73–93 (2016). Elsevier ˇ c, V., Kühn, E., Zischka, S.: Swarm-inspired routing algorithms for unstructured 20. Šešum-Cavi´ P2P networks. In: Int. J. Swarm Intell. Res. IJSIR 9(3) (2018). Article 2 ˇ c V., Kühn E., Fleischhacker L.: Efficient search and lookup in unstructured P2P 21. Šešum-Cavi´ overlay networks inspired by swarm intelligence. IEEE Trans. Emerg. Top. Comput. Intell. (in press) 22. Wedde, H.F., Farooq, M., Zhang, Y.: BeeHive: an efficient fault-tolerant routing algorithm inspired by honey bee behaviour. Ant Colony Optim. Swarm Intell. 83–94 (2004) 23. Wong, L.P., Low, M.Y., Chong, C.S.: A bee colony optimization for travelling salesman problem. In: 2nd Asia International Conference on Modelling & Simulation, pp. 818–823 (2008)

Fuzzy Computation Theory and Applications

Markov Decision Processes with Fuzzy Risk-Sensitive Rewards: The Best Coherent Risk Measures Under Risk Averse Utilities Yuji Yoshida

Abstract Risk-sensitive Markov decision processes with risk constraints are discussed using the best coherent risk measures under risk averse utility. The coherent risk measures are represented as weighted average value-at-risks with the most adapted risk spectrum derived from decision maker’s risk averse utility, and then the risk spectrum inherits the risk averse property of the decision maker’s utility as weighting. Risk-sensitive expected rewards are also approximated by the derived weighted average value-at-risks. By perception-based extension, Markov decision processes are formulated for fuzzy random variables. Firstly, to find feasible ranges of risk levels, a risk-minimizing problem is discussed by mathematical programming. Next the maximization of risk-sensitive running rewards under the feasible risk constraints is discussed by dynamic programming. While, in case of the maximization of risk-sensitive terminal rewards in long terms, a sufficient condition for numerical computation of the solutions is given. A few numerical examples are given to understand the obtained results and several figures are shown to illustrate the details. Keywords Markov decision process · Risk-sensitive expectation · Risk constraint · Coherent risk measure · Weighted average value-at-risk · Risk averse utility · Fuzzy random variable · Perception-based extension

1 Introduction Risk management in decision making is studied by several approaches. One is to use risk-sensitive expected rewards, which were introduced by Howard and Matheson [9]. Risk-sensitive expectation is given by Y. Yoshida (B) University of Kitakyushu, 4-2-1 Kitagata, Kokuraminami, Kitakyushu 802-8577, Japan e-mail: [email protected] © Springer Nature Switzerland AG 2021 J. J. Merelo et al. (eds.), Computational Intelligence, Studies in Computational Intelligence 922, https://doi.org/10.1007/978-3-030-70594-7_6

135

136

Y. Yoshida

f −1 (E( f (·))),

(1)

where f and f −1 are decision maker’s utility function and its inverse function and E(·) is an expectation. This approach is a method to estimate random events through utility functions, and it is studied by several authors (B¨auerle and Rieder [5]). However this criterion with non-linear utility functions f has been developed by operator theory in general. We take another approach from the viewpoint of risk measures such as percentiles, value-at-risks, conditional value-at-risks and so on. Risk measure is one of the most important concepts in economics, financial analysis, asset management, engineering and so on, and it has been improved in both practical and theoretical aspects. The variance was used once as a risk measure in decision making. Nowadays drastic declines of asset prices have been studied after Lehman’s financial crisis, and value-at-risk (VaR) is used widely to estimate the risk of asset price declines in practical management (Jorion [10]). The value-at-risk is defined by percentiles at a specified probability, however it does not have coherency. Coherent risk measures have been studied to improve the criterion of risks with worst scenarios (Artzner et al. [4]). Several improved risk measures based on value-at-risks are proposed: for example, conditional value-at-risks, expected shortfall, entropic value-at-risk (Rockafellar and Uryasev [16], Tasche [17]). While Kusuoka [12] gave a spectral representation for coherent risk measures, and Acerbi [1] and Adam et al. [2] discussed its applications to portfolio selection and so on. Recently Yoshida [29] has derived the best coherent risk measure from decision maker’s risk averse utility function, using weighted average value-at-risks. This derived coherent risk measure inherits the risk averse property of the decision maker’s utility function as risk spectrum weighting. Yoshida [32, 33] have also applied this method to portfolio selection. This paper adopts the weighted average value-at-risks to estimate risk-sensitive rewards under risk constraints, which is also a kind of extension of Yoshida [28]. This paper uses fuzzy random variables, which were introduced by Kwakernaak [13] and which have two kinds of uncertainties, i.e. randomness and fuzziness (Yoshida [18]). In this paper, randomness is used to represent the uncertainty regarding the belief degree of frequency, and fuzziness is applied to linguistic imprecision of data because of lack of information (Yoshida et al. [19], Yoshida [24]). We introduce coherent risk measures and risk-sensitive estimations for fuzzy random variables, using perception-based method in Yoshida [23]. This paper also estimates fuzzy numbers and fuzzy random variables by probabilistic expectation and evaluation weights and θ-mean functions, which are characterized by possibility and necessity criteria for subjective estimation and pessimistic-optimistic indexes for subjective decision (Yoshida [20, 31]). In dynamic decision making, there are two kinds of risk-sensitive rewards, i.e. risk-sensitive running rewards and risk-sensitive terminal rewards, which have different properties. It is difficult to compare them in general, and especially the latter is used when we cannot settle until the terminal time. Yoshida [30] has discussed the maximization of risk-sensitive running rewards under risk constraints by dynamic programming. However we cannot apply dynamic programming to the maximization of risk-sensitive terminal rewards under risk constraints because of difference

Markov Decision Processes with Fuzzy Risk-Sensitive Rewards …

137

between the forms of these rewards. This paper deals with not only risk-sensitive running rewards but also risk-sensitive terminal rewards. Then we discuss the latter by mathematical programming and we investigate a sufficient condition to obtain the optimal solutions in long term cases by numerical methods. In Sect. 2, we introduce value-at-risks, average value-at-risks and coherent risk measures. Coherent risk measures are represented by weighted average value-at-risks with risk spectra, which are based on spectral representation in Kusuoka [12]. From Yoshida [29], we introduce coherent risk measures with the best risk spectra under decision maker’s risk averse utility. In Sect. 3, we define weighted average valueat-risks, coherent risk measures and risk-sensitive expectations for fuzzy random variables, using perception-based approaches. In Sect. 4, we also introduce estimation tools with evaluation weights and θ-mean functions in order to evaluate the randomness and fuzziness of fuzzy random variables. In Sect. 5, we formulate risksensitive Markov decision problems with risk allocation under risk constraints by use of coherent risk measures. Then risk-sensitive expected rewards are approximated by weighted average value-at-risks with the best risk spectrum derived from decision maker’s risk averse utility function, and the risk constraints are given by coherent risk measures which are also represented by weighted average value-at-risks. In the rest of Sect. 5, we investigate the lower bound of risk values in order to find feasible ranges of the risk constraints. In Sects. 6 and 7, we maximize risk-sensitive running rewards and risk-sensitive terminal rewards under risk constraints using mathematical programming. We apply dynamic programming to the former, and on the other hand regarding the latter we investigate a sufficient condition for numerical approaches in long term cases. In Sect. 8, we give a few numerical examples to understand the obtained results, and several figures and tables are shown to illustrate the details.

2 Coherent Risk Measures Derived from Risk Averse Utilities Let R = (−∞, ∞) and let P be a non-atomic probability on a sample space Ω. Let X be the family of all integrable real-valued random variables X on Ω with a continuous distribution x → FX (x) = P(X < x) for which there exists a non-empty open interval I such that FX (·) : I → (0, 1) is strictly increasing and onto. Then there exists a strictly increasing and continuous inverse function FX−1 : (0, 1) → I such that FX (·) : I → (0, 1) and FX−1 : (0, 1) → I are one-to-one and onto, and it holds that lim x↓inf I FX (x) = 0 and lim x↑sup I FX (x) = 1. The value-at-risk (VaR) at a risk probability p is given by the percentile of the distribution function FX as follows. ⎧ if p = 0 ⎨ inf I VaR p (X ) = sup{x ∈ I | FX (x) ≤ p} if p ∈ (0, 1) ⎩ sup I if p = 1.

(2)

138

Y. Yoshida

Then average value-at-risk (AVaR) at a probability p ∈ (0, 1] is given by AVaR p (X ) =

1 p



p

VaRq (X ) dq.

(3)

0

Hence the following definition characterizes coherent risk measures. Definition 2.1 (Artzner et al. [4]). A map ρ : X → R is called a coherent risk measure if it satisfies the following (i) – (iv): (i) ρ(X ) ≥ ρ(Y ) for X, Y ∈ X satisfying X ≤ Y . (monotonicity) (ii) ρ(cX ) = cρ(X ) for X ∈ X and c ∈ R satisfying c ≥ 0. (positive homogeneity) (iii) ρ(X + c) = ρ(X ) − c for X ∈ X and c ∈ R. (translation invariance) (iv) ρ(X + Y ) ≤ ρ(X ) + ρ(Y ) for X, Y ∈ X . (sub-additivity) It is known in Artzner et al. [4] that −AVaR p (·) is a coherent risk measure however −VaR p (·) is not coherent because sub-additivity (iv) does not hold in general. Conditional value-at-risks and expected shortfalls are also famous coherent risk measures (Rockafellar and Uryasev [16], Tasche [17]). Hence the following fundamental concepts are well-known. Definition 2.2 (Dellacherie[6], Renneberg[15], Kusuoka[12]). Let ρ : X → R. Random variables X (∈ X ) and Y (∈ X ) are called comonotonic if (X (ω) − X (ω ))(Y (ω) − Y (ω )) ≥ 0 holds for almost all ω, ω ∈ Ω. (ii) ρ is called comonotonically additive if ρ(X + Y ) = ρ(X ) + ρ(Y ) holds for all comonotonic X, Y ∈ X . (iii) ρ is called law invariant if ρ(X ) = ρ(Y ) holds for all X, Y ∈ X satisfying P(X < · ) = P(Y < · ). (iv) ρ is called continuous if limn→∞ ρ(X n ) = ρ(X ) holds for {X n } ⊂ X and X ∈ X such that limn→∞ X n = X almost surely.

(i)

Now, for a probability p ∈ (0, 1] and a non-increasing right-continuous function 1 λ : [0, 1] → [0, ∞) satisfying 0 λ(q) dq = 1, we define a weighted average valueat-risk with weighting λ on (0, p) by AVaRλp (X ) =

 0

p



p

VaRq (X ) λ(q) dq

λ(q) dq.

(4)

0

Then λ is called a risk spectrum, and −AVaRλp becomes a coherent risk measure. Further Kusuoka [12] proved that coherent risk measures are represented by weighted average value-at-risks with the following spectral representation (Yoshida [29, Theorem 1]). Lemma 2.1 Let ρ : X → R be a law invariant, comonotonically additive, continuous coherent risk measure. Then there exists a risk spectrum λ such that

Markov Decision Processes with Fuzzy Risk-Sensitive Rewards …

139

Fig. 1 Value-at-risk VaR p (X )

ρ(X ) = −AVaRλ1 (X )

(5)

for X ∈ X . Further, −AVaRλp is a coherent risk measure on X for p ∈ (0, 1). Throughout this paper we use a law invariant, comonotonically additive, continuous coherent risk measure ρ, and we also deal with a case when value-at-risks are represented as (6) VaR p (X ) = E(X ) + κ( p) · σ(X ) with the mean E(X ) and the standard deviation σ(X ) of random variables X ∈ X , where κ : (0, 1) → R is an increasing function. A sufficient condition of (6) is what random variables X have normal distributions (Fig. 1). From (4) and (6) we have AVaRλp (X ) = E(X ) + κλ ( p) · σ(X ), where λ



κ ( p) = 0

p

 κ(q) λ(q) dq

p

λ(q) dq.

(7)

(8)

0

Let f : I → R be a C 2 -class risk averse utility function satisfying f > 0 and f ≤ 0 on I , where I is an open interval. For a probability p ∈ (0, 1] and a random variable X ∈ X , a non-linear risk-sensitive form   p 1 −1 f f (VaRq (X )) dq (9) p 0



140

Y. Yoshida

is an average value-at-risk of X on the downside (0, p) under utility f . We note that (9) is reduced to (3) if f is risk-neutral, i.e. it is a linear increasing function. Hence we have the following lemma from Yoshida [29, Theorem 2]. Lemma 2.2 A risk spectrum λ which minimizes the distance between (9) and (4):



f −1



X ∈X

for p ∈ (0, 1] is given by

1 p

 0

p

2 f (VaRq (X )) dq − AVaRλp (X )

λ( p) = e−

1 p

C(q) dq

C( p)

(10)

(11)

with a component function C which is given by p f (VaR p (X )) − 1p 0 f (VaRq (X )) dq  σ(X ) p p f f −1 1p 0 f (VaRq (X )) dq X ∈X  C( p) =

p σ(X ) VaR p (X ) − f −1 1p 0 f (VaRq (X )) dq

(12)

X ∈X

for p ∈ (0, 1] if λ is non-increasing (Yoshida [29, Theorem 2]). For exponential utility function f , the corresponding component function C is given concretely by (84) in Example 8.2. The component functions C for several utilities f are also investigated in Yoshida [29, Examples 1-4]. In Lemma 2.2 the coherent risk measure −AVaRλp has a kind of semi-linear property such as Definition 2.1(ii)(iii) and it brings us effective computation, and the risk spectrum λ can also inherit the risk averse property of the non-linear utility function f as weighting on (0, p). In the sequel we also use the risk spectrum λ in Lemma 2.2 for risk-sensitive expected rewards (1).

3 Fuzziness and Extended Criteria A fuzzy number is represented by its membership function a˜ : R → [0, 1] which is normal, upper-semicontinuous, fuzzy convex and which has a compact support (Zadeh [34]). Let N be the set of all fuzzy numbers. For a fuzzy number a˜ ∈ N , ˜ ≥ α} = [a˜ α− , a˜ α+ ] for its α-cuts are given by closed intervals a˜ α = {x ∈ R | a(x) α ∈ (0, 1]. An addition and a scalar multiplication for fuzzy numbers are defined by their α-cuts. For fuzzy numbers a, ˜ b˜ ∈ N , fuzzy max order a˜ b˜ means that a˜ α± ≤ b˜α± for all α ∈ (0, 1]. A fuzzy-number-valued map X˜ : Ω → N is called a fuzzy random variable if X˜ α± ∈ X for all α ∈ (0, 1], where X˜ α (ω) = {x ∈ R | X˜ (ω)(x) ≥ α} = [ X˜ α− (ω), X˜ α+ (ω)] for ω ∈ Ω. Let X˜ be the family of all fuzzy random variables on

Markov Decision Processes with Fuzzy Risk-Sensitive Rewards …

141

Fig. 2 Perception-based estimation of a fuzzy random variable

Ω. Kruse and Meyer [11] gave the expectation of fuzzy random variables X˜ ∈ X˜ in the following perception-based definition based on Zadeh’s extension principle: ˜ X˜ )(x) = E(

inf X˜ (ω)(X (ω))

sup

X ∈X : E(X )=x ω∈Ω

(13)

for x ∈ R, where E(·) is the expectation for real-valued random variables. Then, the ˜ X˜ ) is a fuzzy number with α-cut E( ˜ X˜ )α = [E( X˜ α− ), E( X˜ α+ )]. The αexpectation E( cut of fuzzy number (13) can be generally given by the following Aumann integral: ˜ X˜ )α = {E(X )|X ∈ X and X (ω) ∈ X˜ α (ω) for all ω ∈ Ω}. Puri and Ralescu [14] E( discussed the conditional expectation of fuzzy random variables by Aumann integral, and López-Díaz et al. [7] studied it for statistics with fuzzy data. For a functional ψ : X → N as a general estimation of real-valued random variables, by the perceptionbased approach we can discuss fuzzy extensions ψ˜ of the estimation ψ which is defined as follows: ˜ X˜ )(x) = ψ(

sup

inf X˜ (ω)(X (ω)), x ∈ R

X ∈X : ψ(X )=x ω∈Ω

(14)

for a fuzzy random variable X˜ ∈ X˜ (Fig. 2, Yoshida [22]). Define risk-sensitive expectation (1) by ϕ(X ) = f −1 (E( f (X )))

(15)

for real-valued random variables X ∈ X . For a weighted average value-at-risk AVaRλp , the risk-sensitive expectation ϕ and a coherent risk measure ρ, their exten-

142

Y. Yoshida

sions for a fuzzy random variable X˜ ∈ X˜ are also fuzzy numbers given respectively as follows: λ ( X˜ )(x) = sup inf X˜ (ω)(X (ω)), (16) AVaR p X ∈X : AVaRλp (X )=x

ϕ( ˜ X˜ )(x) = ρ( ˜ X˜ )(x) =

ω∈Ω

sup

inf X˜ (ω)(X (ω)),

(17)

sup

inf X˜ (ω)(X (ω))

(18)

X ∈X : ϕ(X )=x ω∈Ω

X ∈X : ρ(X )=x ω∈Ω

λ ( X˜ ) = [AVaRλ ( X˜ − ), AVaRλ ( X˜ + )], for x ∈ R. Their α-cuts are given by AVaR α α α p p p ˜ X˜ )α = [ρ( X˜ α+ ), ρ( X˜ α− )], and the extended meaϕ( ˜ X˜ )α = [ϕ( X˜ α− ), ϕ( X˜ α+ )] and ρ( sures have the following properties similarly to Definition 2.1 (See Yoshida [23, Prop.2.2, Theorems 2.2 and 2.3] for more general cases). Lemma 3.1 Let X˜ , Y˜ ∈ X˜ be fuzzy random variables. Let a probability p ∈ (0, 1]. Then the following (i.a) – (i.d), (ii.a) – (ii.d) and (iii.a) – (iii.d) hold. (i.a) (i,b) (i.c) (i.d) (ii.a) (ii.b) (ii.c) (ii.d) (iii.a) (iii.b) (iii.c) (iii.d)

λ ( X˜ ) AVaR λ (Y˜ ). If X˜ Y˜ , then AVaR p p   λ λ AVaR p (a˜ X˜ ) = a˜ AVaR p ( X˜ ) for fuzzy numbers a˜ ∈ N satisfying a˜ 0. λ ( X˜ + a) λ ( X˜ ) + a˜ for fuzzy numbers a˜ ∈ N . AVaR ˜ = AVaR p p λ ( X˜ + Y˜ ) AVaR λ ( X˜ ) + AVaR λ (Y˜ ). AVaR p

p

p

If X˜ Y˜ , then ϕ( ˜ X˜ ) ϕ( ˜ Y˜ ). ˜ ˜ ϕ( ˜ a˜ X ) = a˜ ϕ( ˜ X ) for fuzzy numbers a˜ ∈ N ϕ( ˜ X˜ + a) ˜ = ϕ( ˜ X˜ ) + a˜ for fuzzy numbers a˜ ˜ ˜ ϕ( ˜ X + Y ) ϕ( ˜ X˜ ) + ϕ( ˜ Y˜ ). ˜ ˜ ˜ ˜ X ) ρ( ˜ Y˜ ). If X Y , then ρ( ˜ ˜ ρ( ˜ a˜ X ) = a˜ ρ( ˜ X ) for fuzzy numbers a˜ ∈ N ρ( ˜ X˜ + a) ˜ = ρ( ˜ X˜ ) − a˜ for fuzzy numbers a˜ ˜ ˜ ρ( ˜ X + Y ) ρ( ˜ X˜ ) + ρ( ˜ Y˜ ).

satisfying a˜ 0. ∈ N. satisfying a˜ 0. ∈ N.

4 Estimation of Fuzziness with Evaluation Weights and θ-mean Functions We need defuzzification tools to formulate the optimization problems. Defuzzification is studied by a lot of authors. Yoshida [20] has studied an evaluation of fuzzy numbers by evaluation weights which are induced from fuzzy measures to evaluate a confidence degree that a fuzzy number takes values in an interval. Let a constant θ ∈ [0, 1]. The fuzziness is estimated by the evaluation weights w(α) and the following θ-mean function:

Markov Decision Processes with Fuzzy Risk-Sensitive Rewards …

143

[x, y] (∈ I) → θx + (1 − θ)y (∈ R),

(19)

where I denotes the set of all bounded closed intervals. This scalarization is used for the estimation of fuzzy numbers to give a mean value of the interval [x, y] with a weight θ, and θ is called a pessimistic-optimistic index which indicates the pessimistic degree of attitude in decision making ([8]). Let a fuzzy number a˜ ∈ N . A mean value of the fuzzy number a˜ with respect to an evaluation weight w(α) and θ-mean function, which depends only on a˜ and α, is given as follows ([22]): θ



1

˜ = E (a) 0

(θ a˜ α−

+ (1 −

θ) a˜ α+ ) w(α) dα



1

w(α) dα,

(20)

0

where a˜ α = [a˜ α− , a˜ α+ ] is the α-cut of the fuzzy number a. ˜ In (20), w(α) indicates a confidence degree that the fuzzy number a˜ takes values in the interval a˜ α at each level α. Hence, an evaluation weight w(α) is called the possibility evaluation weight if w(α) = 1 for α ∈ [0, 1], and w(α) is called the necessity evaluation weight if w(α) = 1 − α for α ∈ [0, 1] (Yoshida [20, 21]). The mean E θ has the following natural properties of the addition and scalar multiplication and the monotonicity regarding the fuzzy max order . Lemma 4.1 ([20, Theorem 1]). Let θ ∈ [0, 1]. For fuzzy numbers a, ˜ b˜ ∈ N and real numbers c, the following (i) – (iv) hold. (i) E θ (a˜ + 1{c} ) = E θ (a) ˜ + c, where 1{·} is the characteristic function of a set. ˜ = c E θ (a) ˜ if c ≥ 0. (ii) E θ (c a) ˜ = E θ (a) ˜ ˜ + E θ (b). (iii) E θ (a˜ + b) θ ˜ holds. ˜ then E (a) ˜ ≤ E θ (b) (iv) If a˜ b, For a fuzzy random variable X˜ (∈ X˜ ), the mean of the expectation E(E θ ( X˜ )) is a real number  1  1 θ ˜ − + ˜ ˜ (θ X α + (1 − θ) X α ) w(α) dα w(α) dα . (21) E(E ( X )) = E 0

0

Then, from (21) and Lemma 4.1, we obtain the following results. Lemma 4.2 ([20, Corllary 1]). Let θ ∈ [0, 1]. For a fuzzy number a˜ ∈ N , integrable fuzzy random variables X˜ , Y˜ , an integrable real-valued random variable Z and a nonnegative real number c, the following (i) – (v) hold. ˜ X˜ )). (i) E(E θ ( X˜ )) = E θ ( E( θ θ ˜ = E (a) ˜ and E(E θ (Z )) = E(Z ). (ii) E(E (a)) θ ˜ (iii) E(E (c X )) = c E(E θ ( X˜ ))). (iv) E(E θ ( X˜ + Y˜ )) = E(E θ ( X˜ )) + E(E θ (Y˜ )). (v) If X˜ Y˜ , then E(E θ ( X˜ )) ≤ E(E θ (Y˜ )) holds.

144

Y. Yoshida

Let X˜a be a family of fuzzy random variables X˜ ∈ X˜ such that { X˜ α± | α ∈ [0, 1]} are comonotonic, i.e., there exists a real-valued random variable X ∈ X such that for X˜ α± (α ∈ [0, 1]) there exists a non-decreasing function h ± α : I → R satisfying (X (ω)) for all ω ∈ Ω, where I is the range of X . X˜ α± (ω) = h ± α Example 4.1 Let X ∈ X be an integrable real-valued random variable and let a˜ ∈ N be a fuzzy number a(x) ˜ = max{1 − |x|/c, 0} for x ∈ R, where c is a positive number. Let X˜ ∈ X˜ be a triangle-type fuzzy random variable such that ˜ X˜ (ω)(·) = 1{X (ω)} (·) + a(·)

(22)

for ω ∈ Ω. Then X˜ α± (ω) = X (ω) ± (1 − α)c = h ± α (X (ω)) for ω ∈ Ω, where (x) = x ± (1 − α)c for x ∈ R and α ∈ [0, 1]. Therefore X˜ α± (α ∈ [0, 1]) are h± α comonotonic ([21, 26]), and we obtain X˜ ∈ X˜a . Because AVaR p is comonotonically additive from [21, Proposition 3(iii)], we can easily check the following proposition in a similar way to [25, Lemma 1(ii)] and [27, Lemma 2.1]. Then we can easily check the following proposition for weighted average value-at-risks AVaRλp , the risk-sensitive expectation ϕ, coherent risk measures λ , ϕ˜ and ρ˜ (See Yoshida [23, Prop.2.2, Theorems 2.2 and ρ, their extensions AVaR p

2.3] for more general cases). Proposition 4.1 For θ ∈ [0, 1], it holds that λ ( X˜ )) = AVaRλ (E θ ( X˜ )), E θ (AVaR p p

(23)

E θ (ϕ( ˜ X˜ )) = ϕ(E θ ( X˜ )),

(24)

˜ X˜ )) = ρ(E θ ( X˜ )) E 1−θ (ρ(

(25)

for fuzzy random variables X˜ ∈ X˜a . Finally we introduce variances and covariances of fuzzy random variables from the viewpoint of evaluation weights and θ-mean functions. From (18), for fuzzy random variables X˜ and Y˜ , we define variances and covariances as follows: (26) V (E θ ( X˜ )) = E (E θ ( X˜ ) − E(E θ ( X˜ )))2 , Cov(E θ ( X˜ ), E θ (Y˜ )) = E (E θ ( X˜ ) − E(E θ ( X˜ )))(E θ (Y˜ ) − E(E θ (Y˜ )))

(27)

for θ ∈ [0, 1], where V ( · ) and Cov( · , · ) denote the variance and the covariance of real-valued random variables. Then we can easily check the following lemma.

Markov Decision Processes with Fuzzy Risk-Sensitive Rewards …

145

Lemma 4.3 Let θ ∈ [0, 1]. For fuzzy numbers a, ˜ b˜ ∈ N , integrable fuzzy random variables X˜ , Y˜ and a nonnegative real number c, the following (i) – (v) hold. (i) V (E θ (a)) ˜ = 0. ˜ = V (E θ ( X˜ )). (ii) V (E θ ( X˜ + a)) θ ˜ (iii) V (E (c X )) = c2 V (E θ ( X˜ )). ˜ = Cov(E θ (a), ˜ E θ ( X˜ )) = 0. (iv) Cov(E θ ( X˜ ), E θ (a)) θ ˜ θ ˜ ˜ ˜ E (Y + b)) = Cov(E θ ( X˜ ), E θ (Y˜ )). (v) Cov(E ( X + a),

5 Markov Decision with Risk Allocation by Coherent Risk Measures Let a state space by the set of fuzzy numbers N and an action space by A = n x i = 1 and x i ≥ 0 (i = 1, 2, . . . , n)}, where n is {(x 1 , x 2 , . . . , x n ) ∈ Rn | i=1 a positive integer. In this paper we focus on risk-sensitive expected rewards to choose alternatives consisting of n assets. Let a positive integer T be a terminal time, and let time t = 1, 2, . . . , T . Let R˜ ti (∈ X˜a ) be a fuzzy reward for asset i (= 1, 2, . . . , n). Let θ ∈ [0, 1]. Hence we put their expectations and covariances respec˜ R˜ ti )) = E(E θ ( R˜ ti )) and σti j = E((E θ ( R˜ ti ) − μit )(E θ ( R˜ tj ) − tively by μit = E θ ( E( j T , where mapμt )) for i, j = 1, 2, . . . , n. We give Markov policies by π = {πt }t=1 1 2 n pings πt = (πt , πt , · · · , πt ) : Ω → A for t = 1, 2, . . . , T , and then πt is called a strategy. A reward with a strategy πt = (πt1 , πt2 , · · · , πtn ) is given by R˜ tπ =

n

πti R˜ ti .

(28)

i=1 π (ω) (ω ∈ Ω) at Then strategy πt are chosen depending only on the current state R˜ t−1 each time t. Put a collection of all Markov policies by Π . Let a probability p ∈ (0, 1) and let a positive constant δ. Let f be a C 2 -class risk averse utility function which is given in Sect. 2, and let the risk-sensitive expectation ϕ(·) = f −1 (E( f (·))) in (15). Denote a positive constant β be a discount rate and let ρ be a coherent risk measure for risk constraints. From Proposition 4.1, the risk value at time t is estimated as

˜ R˜ tπ )) = ρ(E θ ( R˜ tπ )). E 1−θ (ρ(

(29)

Here there are two kinds of estimation approaches for risk-sensitive rewards: Risksensitive running rewards and risk-sensitive terminal rewards. From Lemma 4.2 and Proposition 4.1, the total of risk-sensitive running rewards with a discount rate β is given by   T T



θ t−1 π β ϕ( ˜ R˜ t ) = β t−1 E θ (ϕ( ˜ R˜ tπ )) E t=1

t=1

146

Y. Yoshida

=

T

β t−1 ϕ(E θ ( R˜ tπ ))

(30)

t=1

=

T

β t−1 f −1 (E( f (E θ ( R˜ tπ )))).

t=1

While the risk-sensitive terminal reward is estimated similarly as   T    T 



E θ ϕ˜ R˜ tπ R˜ tπ = ϕ Eθ t=1

 =ϕ

t=1 T



E θ ( R˜ tπ )

t=1

  

= f

−1

E

f

T

 E

θ

( R˜ tπ )

.

(31)

t=1

We should select these risk-sensitive rewards adapted to these cases. Risk-sensitive running terminal rewards (31) are used especially when we cannot settle until terminal time T . We discus the optimization of risk-sensitive running rewards in Sect. 6 and we also investigate the optimization of risk-sensitive terminal rewards in Sect. 7. First from (29) and (30) we deal with the following maximization problem of the total of risk-sensitive running rewards under risk constraints. Problem (R1). Maximize the total of risk-sensitive expected running rewards T

β t−1 ϕ(E θ ( R˜ tπ )) =

t=1

T

β t−1 f −1 (E( f (E θ ( R˜ tπ ))))

(32)

t=1

with respect to strategies πt ∈ Π under risk constraint ρ(E θ ( R˜ tπ )) ≤ δ

(33)

for time t = 1, 2, . . . , T . In (33), positive constant δ implies a risk level. From the approximation results in Lemma 2.2, we have ϕ(·) = f

−1

(E( f (·))) = f

−1



1

VaRq ( f (·)) dq

0

≈ AVaRλ1 (·)

(34)

with a risk spectrum λ. While by Lemma 2.1 we take a risk spectrum ν such that ρ(·) = −AVaRνp (·).

(35)

Markov Decision Processes with Fuzzy Risk-Sensitive Rewards …

147

Hence we estimate the risks on downside (0, p). From the viewpoint of (34) and (35), this paper discusses the following optimization instead of Problem (R1). Problem (R2). Maximize the total of risk-sensitive expected running rewards T

β t−1 AVaRλ1 (E θ ( R˜ tπ ))

(36)

t=1 T with respect to Markov policies π = {πt }t=1 ∈ Π under risk constraint

− AVaRνp (E θ ( R˜ tπ )) ≤ δ

(37)

for time t = 1, 2, . . . , T . In (36) and (37), risk spectra λ and ν are different in general, however we may select same risk spectrum, i.e. λ = ν. Hence from (28) the expectation and the standard deviation of reward R˜ tπ are respectively E(E θ ( R˜ tπ )) =

n

πti μit ,

(38)

i=1

 

n  n

j ij θ ˜π  πti πt σt . σ(E ( Rt )) =

(39)

i=1 j=1

Together with (7), we also have weighted average value-at-risk AVaRνp (E θ ( R˜ tπ )) =

n

 

n  n

j ij πti μit + κν ( p) πti πt σt ,

i=1

where κν ( p) =



p 0

(40)

i=1 j=1

 κ(q) ν(q) dq

p

ν(q) dq.

(41)

0

In this paper we assume κλ (1) ≤ 0 and κν ( p) < 0 (Fig. 4). Let Πt (δ) be the collection of strategies πt satisfying risk constraint (37), and let Πt = supδ>0 Πt (δ). In the rest of this section we investigate the lower bound of −AVaRνp (E θ ( R˜ tπ )) for the feasibility of risk constraint (37) in Problem (R2), i.e. Πt (δ) = ∅. From (40), we firstly discuss the following maximization problem for AVaRνp (E θ ( R˜ tπ )). Problem (P1). Maximize weighted average value-at-risk

148

Y. Yoshida

AVaRνp (E θ ( R˜ tπ )) =

n

 

n  n

j ij i i ν πt μt + κ ( p) πti πt σt

i=1

(42)

i=1 j=1

with respect to strategies πt = (πt1 , πt2 , · · · , πtn ) ∈ Πt . Let a constant γ ∈ R. Under a constraint E(E θ ( R˜ tπ )) =

n

πti μit = γ,

(43)

i=1

which is from (38), Problem (P1) is solved by Lagrange multiplier method, and then the corresponding value (42) becomes  γ + κν ( p)

At γ 2 − 2Bt γ + Ct , Δt

⎡ 11 ⎤ σt μ1t ⎢ σt21 ⎢μ2t ⎥ ⎢ ⎢ ⎥ μ = ⎢ . ⎥, Σ = ⎢ . ⎣ .. ⎣ .. ⎦ ⎡

where

σt12 σt22 .. .

··· ··· .. .

⎡ ⎤ ⎤ 1 σt1n ⎢1⎥ σt2n ⎥ ⎢ ⎥ ⎥ .. ⎥, 1 = ⎢ .. ⎥, ⎣.⎦ ⎦ .

σtn1 σtn2 · · · σtnn

μnt

(44)

1

At = 1T Σt−1 1, Bt = 1T Σt−1 μt , Ct = μTt Σt−1 μt , Δt = At Ct√− Bt2 and T denotes the transpose of a vector. If At > 0, Δt > 0 and κν ( p) < − Δt /At are satisfied, we can√ easily check the real-valued function (44) of γ is concave and it has the maximum Bt −

At κν ( p)2 −Δt At

at γ =

Bt At

+

At



Δt At κν ( p)2 −Δt

(Yoshida [21, 28]). Since 

 sup AVaRνp (E θ ( R˜ tπ )) = sup

πt ∈Πt

γ

sup

πt ∈Πt :

n

i i i=1 πt μt =γ

AVaRνp (E θ ( R˜ tπ )) ,

(45)

we obtain the following analytical solutions for Problem (P1). √ Theorem 5.1 Let At > 0, Δt > 0 and κν ( p) < − Δt /At . Then the following (i) and (ii) hold. (i)

The maximum weighted average value-at-risk of Problem (P1), i.e. (45), is Bt −

at the expected reward



At κν ( p)2 − Δt At

(46)

Markov Decision Processes with Fuzzy Risk-Sensitive Rewards …

γ=

149

Bt Δt +  . At At At κν ( p)2 − Δt

(47)

An optimal strategy is given by πt◦ = ξt◦ Σt−1 1 + ηt◦ Σt−1 μt

(48)

tγ t , ηt◦ = At γ−B and 0 is the zero vector in Rn . if πt◦ ≥ 0, where ξt◦ = Ct −B Δt Δt (ii) If Σt−1 1 ≥ 0, Σt−1 μt ≥ 0 and κν ( p) ≥ Ct , then the strategy (48) satisfies πt◦ ≥ 0.

6 Maximization of Risk-Sensitive Running Rewards Under Feasible Risk Constraints Let p ∈ (0, 1) be a probability and let ν be a risk spectrum which are given in Sect. 5. From Theorem 5.1, we define the lower bound of −AVaRνp (E θ ( R˜ tπ )) by the following constant δ t ( p): δ t ( p) = inf

πt ∈Πt

(−AVaRνp (E θ ( R˜ tπ )))

Bt =− + At



At κν ( p)2 − Δt . At

(49)

Thus the feasible range of risk levels δ in risk constraint (37) is {δ | Πt (δ) = ∅} = take a risk level δ ∈ [δ t ( p), ∞). Under the constraint of expected [δ t ( p), ∞).

nNowi we πt μit = γt , similarly to (44) we have rewards i=1  sup

πt ∈Πt (δ):

n i=1

πti μit =γt

AVaRλ1 (E θ ( R˜ tπ )) = γt + κλ (1) 

sup

πt ∈Πt (δ):

n i=1

πti μit =γt

AVaRνp (E θ ( R˜ tπ ))

ν

= γt + κ ( p)

At γt2 − 2Bt γt + Ct , Δt

(50)

At γt2 − 2Bt γt + Ct . Δt

(51)

Hence from (50) and (51) we check the following maximization of risk-sensitive expectations under feasible risk constraints at time t. Problem (P2). Maximize the risk-sensitive expectation  γt + κλ (1)

At γt2 − 2Bt γt + Ct Δt

with respect to expected rewards γt ∈ R under risk constraint

(52)

150

Y. Yoshida

 γt + κν ( p)

At γt2 − 2Bt γt + Ct ≥ −δ. Δt

(53)

Hence solving quadratic inequality of γt , Eq. (53) is equivalent toγt ∈ [γt− , γt+ ],

(54)

where γt±

Bt κν ( p)2 + Δt δ = ∓ At κν ( p)2 − Δt

 √ Δt κν ( p) At δ 2 + 2Bt δ + Ct − κν ( p)2 . At κν ( p)2 − Δt

(55)

Hence, by solving the maximization of concave function (52) of γt within constraint [γt− , γt+ ], we easily obtain the following results for Problem (P2). √ Theorem 6.1 Let At > 0, Δt > 0, κν ( p) ≤ κλ (1) ≤ 0 and κν ( p) < − Δt /At . Then the maximum risk-sensitive estimation in Problem (P2) is  ⎧ Bt At κλ (1)2 − Δt ⎪ ⎪ ⎪ − ⎪ ⎪ At At ⎪ ⎪ ⎪ Δt ⎨ at an expected reward γ ∗ = Bt +  t ϕ∗t = (56) At At At κλ (1)2 − Δt ⎪ ⎪ if δ + ≤ δ and κλ (1) < −√Δ /A , ⎪ t t ⎪ t ⎪ ⎪ ⎪ κλ (1) ⎪ + + ⎩ γt − (δ + γt ) at an expected reward γt∗ = γt+ otherwise, κν ( p) where δt+ = −

Bt At κλ (1)κν ( p) − Δt +  . At At At κλ (1)2 − Δt

In the rest of this section, we deal with dynamic maximization. Problem (R2) is reduced to the following Problem (R3), applying (50) and (54) to them. Problem (R3). Maximize the total of risk-sensitive expected running rewards T





β t−1 ⎝γt + κλ (1)

t=1

⎞ At γt2 − 2Bt γt + Ct ⎠ Δt

(57)

with respect to expected rewards (γ1 , γ2 , · · · γT ) ∈ RT under risk constraint γt ∈ [γt− , γt+ ] for all t = 1, 2, . . . , T .

(58)

Markov Decision Processes with Fuzzy Risk-Sensitive Rewards …

151

Let Φ R be the total maximum of risk-sensitive expected running rewards in Problem (R3). By applying dynamic programming to Problem (R3), we obtain the following results. Lemma 6.1 Let {vt } be a sequence given by the following optimality equations vt =

⎧ ⎨ sup

γt ∈[γt− ,γt+ ] ⎩

 γt + κλ (1)

⎫ At γt2 − 2Bt γt + Ct ⎬ + β vt+1 ⎭ Δt

(59)

for t = 1, 2, . . . , T and vT +1 = 0. Then Φ R = v1 is the total maximum of risksensitive expected running rewards for Problems (R2) and (R3). Together with Theorem 6.1, we have the following results from Lemma 6.1. √ Theorem 6.2 Let At > 0, Δt > 0, κν ( p) ≤ κλ (1) ≤ 0 and κν ( p) < − Δt /At for t = 1, 2, . . . , T . Let ϕ∗t and γt∗ be the maximum risk-sensitive estimation and the expected reward given by (56) respectively for t = 1, 2, . . . , T . (i)

Let {vt } be a sequence given by the following optimality equations vt = ϕ∗t + β vt+1

(60)

for t = 1, 2, . . . , T and vT +1 = 0. Then Φ R = v1 is the total maximum of risksensitive expected running rewards in Problems (R2) and (R3). T ∈ Π of Problem (R2) is given by (ii) An optimal Markov policy π ∗ = {πt∗ }t=1 πt∗ = ξt∗ Σt−1 1 + ηt∗ Σt−1 μt for t = 1, 2, . . . , T if πt∗ ≥ 0, where ξt∗ =

Ct −Bt γt∗ Δt

and ηt∗ =

(61) At γt∗ −Bt Δt

.

Remark 6.1 (i) We can give a sufficient condition for πt∗ ≥ 0 as follows: κλ (1)2 ≥ Ct , Ct κν ( p)2 ≥ (Ct + Bt δ)2 , At κν ( p)2 ≥ (At δ + Bt )2 , Σt−1 1 ≥ 0 and Σt−1 μt ≥ 0 for t = 1, 2, . . . , T . (ii) If n ≥ 3, the optimal strategy πt∗ is not unique and it is given by any strategy πt∗ ∈ A satisfying (πt∗ )T μt = γt∗ .

7 Maximization of Risk-Sensitive Terminal Rewards Under Feasible Risk Constraints In this section we discuss the maximization of the risk-sensitive terminal rewards under risk constraints. From (29) and (31), we deal with the following problem. Problem (T1). Maximize the risk-sensitive terminal rewards

152

Y. Yoshida

 ϕ

T

 E

θ

( R˜ tπ )

   = f

−1

E

f

t=1

T

 E

θ

( R˜ tπ )

(62)

t=1

T with respect to Markov policies π = {πt }t=1 ∈ Π under risk constraint

ρ(E θ ( R˜ tπ )) ≤ δ

(63)

for time t = 1, 2, . . . , T . Applying (34) and (35) to (62) and (63), we investigate the following problem instead of Problem (T1). Problem (T2). Maximize the risk-sensitive terminal rewards  AVaRλ1

T

 E

θ

( R˜ tπ )

(64)

t=1 T with respect to Markov policies π = {πt }t=1 ∈ Π under risk constraints

− AVaRνp (E θ ( R˜ tπ )) ≤ δ

(65)

for all t = 1, 2, . . . , T . Let a time space T = {1, 2, . . . , √ T }. In this section we assume At > 0, Δt > 0, κν ( p) ≤ κλ (1) ≤ 0 and κν ( p) < − Δt /At for all t ∈ T. For a Markov policy

T π =π T ∈ Π , the expectation and the standard deviation of terminal rewards t=1 R˜ t {πt }t=1 are  T   n  T



θ ˜π i i E E ( Rt ) = πt μt , (66) t=1

σ

 T

t=1

i=1

 ⎛ ⎞  T n

n 

 j ij ⎝ E θ ( R˜ tπ ) =  πti πt σt ⎠. 

t=1

t=1

In a similar way to (50) and (54), under to the following Problem (T3).

(67)

i=1 j=1

n i=1

πti μit = γt , Problem (T2) is reduced

Problem (T3). Maximize the risk-sensitive terminal rewards Φ T given by   T  At γt2 − 2Bt γt + Ct T λ γt + κ (1) Φ (γ1 , γ2 , · · · γT ) = Δt t=1 t=1 T

with respect to expected rewards (γ1 , γ2 , · · · γT ) ∈ RT under risk constraint

(68)

Markov Decision Processes with Fuzzy Risk-Sensitive Rewards …

γt ∈ [γt− , γt+ ]

153

(69)

for all t ∈ T. Now it is difficult to solve Problem (T3) by dynamic programming because of the form (68). If T is small, we can calculate the solutions of Problem (T3) directly by multi-parameter optimization. Hence we investigate Problem (T3) when the terminal time T is not small. We have the following technical lemma, which is checked easily. Lemma 7.1 Let t ∈ T and constants E, D ∈ R satisfying D ≥ 0. Define a function  Ψ (γ) = E + γ + κλ (1)

At γ 2 − 2Bt γ + Ct +D Δt

(70)

for γ ∈ R. Then the following (i) and (ii) hold: (i) If κλ (1)2 ≤ Δt /At , then function Ψ is non-decreasing on R. (ii) If κλ (1)2 > Δt /At , then function Ψ is concave and it has a maximum at γ = γ D , where √ Bt Δt 1 + A t D D γ = +  . (71) At At At κλ (1)2 − Δt Then we can easily obtain the following theorem using Lemma 7.1 and the firstorder necessary condition for the optimal solutions of Problem (T3). √ < − Δt /At Theorem 7.1 Assume At > 0, Δt > 0, κν ( p) ≤ κλ (1) ≤ 0 and κν ( p)% T for all t ∈ T. Let Φ T has a maximum at a point (γ1∗ , γ2∗ , · · · , γT∗ ) ∈ t=1 [γt− , γt+ ] in Problem (T3). Then the following (i) and (ii) hold: (i)

There exists a subset T∗ of T = {1, 2, . . . , T } for which the following (72) and (73) hold:

Δt κλ (1)2 > (72) At ∗ t ∈T /

and the point (γ1∗ , γ2∗ , · · · , γT∗ ) satisfies ⎧ ⎨ γt+ for t ∈ T∗ ∗ ∗ γt = Δt θ + Bt (< γt+ ) for t ∈ / T∗ , ⎩ At

(73)

where & 1 ∗

At (γ + )2 − 2Bt γ + + Ct t ∈T / ∗ At + D t t D∗ = and θ∗ = & .

Δt Δt κλ (1)2 − t ∈T ∗ t∈T∗ / At

(74)

154

(ii)

Y. Yoshida T An optimal Markov policy π ∗ = {πt∗ }t=1 ∈ Π of Problem (T2) is given by

πt∗ = ξt∗ Σt−1 1 + ηt∗ Σt−1 μt if πt∗ ≥ 0, where ξt∗ =

Ct −Bt γt∗ Δt

and ηt∗ =

At γt∗ −Bt Δt

(75)

with γt∗ in (73).

Remark 7.1 If n ≥ 3, the optimal strategy is not unique and it is given by any strategy πt∗ ∈ A satisfying (πt∗ )T μt = γt∗ . Hence T∗ in (73) implies the critical subset of T = {1, 2, . . . , T } such that for / T∗ t ∈ T∗ the optimal solution γt∗ equals to the right side boundary γt+ and for t ∈ Δt θ∗ +Bt ∗ − + the optimal solution γt is an inner point of the interval [γt , γt ]. If T is At small, we can obtain the optimal solution γt∗ directly by multi-parameter computation. However, if T is large, we need to investigate solutions (73) with 2T comt for t ∈ T, and then numerical computation bination, i.e. γt∗ = γt+ or γt∗ = Δt θ+B At with numerous combination, i.e. 2T times, will be requested. Hence for large T the following theorem gives a sufficient condition when the optimal solutions equal to the right side boundaries, i.e. (γ1∗ , γ2∗ , · · · , γT∗ ) = (γ1+ , γ2+ , · · · , γT+ ), which is ' T ' equivalent to ∂Φ ≥ 0 for all t ∈ T because the object function Φ T is concave ∂γt 'γ =γ + t t %T [γt− , γt+ ]. The following inequality (76) is derived from Lemma 7.1 and on ' t=1 T ' ∂Φ ≥ 0 for all t ∈ T. ∂γt ' + γt =γt

√ Theorem 7.2 Assume At > 0, Δt > 0, κν ( p) ≤ κλ (1) ≤ 0 and κν ( p) < − Δt /At for all t ∈ T. If an inequality κλ (1)2 max t∈T



At γt+ − Bt Δt

2 ≤

At (γ + )2 − 2Bt γ + + Ct t t Δt

(76)

t∈T

holds, then the optimal solution in Problem (T3) is expected reward γt∗ = γt+ for all t ∈ T and the maximum risk-sensitive terminal reward is Φ T (γ1∗ , γ2∗ , · · · γT∗ ) = Φ T (γ1+ , γ2+ , · · · γT+ ). The inequality (76) gives a sufficient condition for terminal time T which satisfies γt∗ = γt+ for all t ∈ T. The left term in (76) is bounded upper, however the right term (≥ t∈T A1t > 0) becomes larger and it goes to infinity as T → ∞. Therefore in actual cases we can check (76) is satisfied for all large T .

8 Numerical Examples We give a few examples to understand the results in the previous sections.

Markov Decision Processes with Fuzzy Risk-Sensitive Rewards …

155

Example 8.1 Let a domain I = R and let f be a risk neutral utility function f (x) = ax + b for x ∈ R with constants a(> 0) and b(∈ R). Then its risk spectrum in Lemma 2.2 is given by λ( p) = 1. The corresponding weighted average value-at-risk (4) is reduced to the average value-at-risk (3), and we have f −1 (E( f (X ))) = E(X ) = AVaR1 (X )

(77)

for X ∈ X (Yoshida [29]). Example 8.2 Let a domain I = R and let a risk averse exponential utility function f (x) =

1 − e−τ x τ

(78)



for x ∈ R with a positive constant τ (Fig. 3). Then − ff = τ is the degree of decision maker’s absolute risk aversity (Arrow [3]). Let X be a family of random variables X which have normal distribution functions. Define the cumulative distribution function G : R → (0, 1) of the standard normal distribution by 1 G(x) = √ 2π



x

z2

e− 2 dz

(79)

−∞

for x ∈ R, and define an increasing function κ : (0, 1) → R by its inverse function κ( p) = G −1 ( p)

Fig. 3 Utility functions f (x)

(80)

156

Y. Yoshida

for probabilities p ∈ (0, 1). Then we have value-at-risk VaR p (X ) = μ + κ( p) · σ for X ∈ X with mean μ and standard deviation σ. Further we suppose for X ∈ X there exists a distribution (μ, σ)(∈ R × (0, ∞)) → φ(μ) ·

21−n/2 n−1 − σ2 σ e 2, Γ (n/2)

(81)

where φ(μ) is some probability distribution, Γ (·) is a gamma function and σ2 21−n/2 σ n−1 e− 2 is a chi distribution with degree of freedom n. We take a utility Γ (n/2) f (x) =

1 − e−0.05x 0.05

(82)

for x ∈ R with τ = 0.05 in (78), and by Lemma 2.2 there exists a risk spectrum λ satisfying f −1 (E( f (·))) ≈ AVaRλ1 (·) in (34). Then, by Yoshida [29, Example 2], the best risk spectrum in Lemma 2.2 is given by λ( p) = e−

1 p

C(q) dq

C( p)

(83)

for p ∈ (0, 1] with the component function 

 C( p) =



0 1 ·  p

1−



log 0

1 p

p 0

 p 1 p

0

1 eτ σ(κ( p)−κ(q))

 σ2

σ n e− 2 dσ

dq . σ2 eτ σ(κ( p)−κ(q)) dq σ n e− 2 dσ

(84)

Figure 4 illustrates functions κλ ( p), which  are calculated from (80), (83) and (84). 1  1 Then we have κλ (1) = 0 κ(q) λ(q) dq 0 λ(q) dq = −0.03. On the other hand for risk measures ρ we use another utility g(x) = 1 − e−x

(85)

for x ∈ R with τ = 1 in (78). By (35) we take a risk spectrum ν such that ρ(·) = −AVaRνp (·). Hence we discuss a case of risk probability 5%, i.e. p = 0.05, in the normal distribution and we focus on the downsiderisks on (0, 0.05). Then similarly  0.05  0.05 we can calculate κν (0.05) = 0 κ(q) ν(q) dq ν(q) dq = −2.29701. We 0 give fuzzy rewards by fuzzy random variables R˜ ti ∈ X˜a (i = 1, 2, . . . , n) as follows: R˜ ti (ω)(·) = 1{Rti (ω)} (·) + a˜ ti (·)

(86)

Markov Decision Processes with Fuzzy Risk-Sensitive Rewards …

157

Fig. 4 Functions κλ ( p) ij

Table 1 Expectations μit with fuzzy factors cti and covariances Σ = [σt ] ij

(b) Σ = [σt ].

(a) μit and cti .

i i i i

=1 =2 =3 =4

μit 0.097 0.087 0.094 0.089

cti 0.009 0.007 0.008 0.006

ij

σt j =1 j =2 j =3 j =4 i = 1 0.41 −0.08 −0.06 0.05 i = 2 −0.08 0.38 −0.07 0.06 i = 3 −0.06 −0.07 0.37 −0.05 i = 4 0.05 0.06 −0.05 0.34

for ω ∈ Ω, where Rti (∈ X ) and a˜ ti is a fuzzy number a˜ ti (x) = max{1 − |x|/cti , 0} for x ∈ R with a positive number cti , which implies a fuzzy factor. Let n = 4 for the number of assets. Hence we put the expectations μit of rewards Rti with ij fuzzy factors cti and the covariances σt of rewards Rti by Table 1. We deal with an optimistic and possibility case, i.e. θ = 0 and w(α) = 1 for α ∈ [0, 1]. Hence we have At = 14.0101√> 0 and Δt = 0.00193328 > 0. And we can easily check κν (0.05) < κλ (1) < − Δt /At = −0.011747. From (49), the lower bound of risk levels is δ t ( p) = 0.569092. Now we take a risk level δ = 0.75 in the feasible range [0.569092, ∞). Maximum Risk-sensitive Running Rewards: From Theorem 6.1, we obtain the maximum risk-sensitive estimation ϕ∗t = 0.0891323 at the expected reward γt∗ = 0.0978428 for Problem (R2), where πt∗ = (0.252546, 0.269506, 0.308391, 0.169556) is an optimal strategy. Hence the difference between the real expected reward γt∗ = 0.0978428 and the maximum risk-sensitive estimation ϕ∗t = 0.0891323 comes from decision maker’s risk averse feeling. We can also choose a pessimistic and necessity case, i.e. θ = 1 and w(α) = 1 − α for α ∈ [0, 1]. Then from Table 1 we find the maximum risk-sensitive estimation and the expected reward satisfy 0.0786323 ≤ ϕ∗t ≤ 0.0891323 and 0.0873428 ≤ γt∗ ≤ 0.0978428 (Table 2).

158

Y. Yoshida

Table 2 Risk-sensitive estimation ϕ∗t and expected reward γt∗ γt∗ ϕ∗t

Pess. & Nec.

Opti. & Poss.

0.0873428 0.0786323

0.0978428 0.0891323

Fig. 5 Maximum risk-sensitive expected reward ϕ∗t and the expected reward γt∗

Figure 5 illustrates the maximum risk-sensitive estimation ϕ∗t and the expected reward γt∗ for risk levels δ in Theorem 6.1, and we see the two lines are cut and connected at δt+ . We find ϕ∗t is smaller than γt∗ because γt∗ implies actual expected rewards and ϕ∗t contains decision maker’s risk aversity under his utility. We discuss a dynamic case with an expiration date T = 20 and a discount rate β = 0.95. From Theorem 6.2, the total maximum of risk-sensitive expected running rewards for Problem (R2) is Φ R = v1 = 1.33284, and Fig. 6 illustrates the sequence {vt } in Theorem 6.2, which is the sub-total maximum of risk-sensitive expected running rewards after time t. Maximum Risk-sensitive Terminal Rewards: Next we investigate a dynamic case with a terminal time T in Sect. 7. Hence we need to find expected rewards γt∗ in constraint [γt− , γt+ ] = [0.0935474, 0.0995114]. Now we have κλ (1)2 = 0.0009 > Δt /At = 0.000137992 for all t = 1, 2, . . . , T , and then (72), i.e. κλ (1)2 >

T t=1 Δt /At , holds for T = 1, 2, . . . , 6. Finally we check solutions for each T . (a) Case of T ≥ 4: Inequality condition (76) holds since κλ (1)2 max t∈T



At γt+ − Bt Δt

2

< T × 0.136777 =

= 0.426547

At (γ + )2 − 2Bt γ + + Ct t t . Δ t t∈T

Markov Decision Processes with Fuzzy Risk-Sensitive Rewards …

159

Fig. 6 Sequences {vt } for Example 8.2 (τ = 1)

By Theorem 7.2 we get the expected reward γt∗ = γt+ = 0.0995114 for all t = 1, 2 · · · , T . Then the maximum risk-sensitive terminal reward is Φ T = 0.375855, 0.472748, 0.569891, 0.667225, · · · respectively for T = 4, 5, 6, 7, · · · . (b) Case of T = 3: From (73) and (74), we have T∗ = ∅, θ∗ = 20.9899 and γt∗ = Δt θ∗ +Bt = 0.0994037 < γt+ = 0.0995114. Therefore we get the maximum riskAt sensitive terminal reward Φ T = 0.27932 at γt∗ = 0.0994037 for t = 1, 2, 3. ∗ t = (c) Case of T = 2: Similarly we also have T∗ = ∅, θ∗ = 15.125 and γt∗ = Δt θA+B t + 0.0985944 < γt = 0.0995114. Therefore we get the maximum risk-sensitive terminal reward Φ T = 0.183576 at γt∗ = 0.0985944 for t = 1, 2. (d) Case of T = 1: Similarly we also have T∗ = ∅, θ∗ = 9.67831 and γt∗ = Δt θ∗ +Bt = 0.0978428 < γt+ = 0.0995114. Therefore we get the maximum riskAt sensitive terminal reward Φ T = 0.0891323 at γt∗ = 0.0978428 for t = 1. Figure 7 illustrates these risk-sensitive terminal rewards Φ T for T = 1, 2, . . . , 7. We observe that risk-sensitive terminal rewards Φ T have the maximum at the right side boundary γt∗ = γt+ for T ≥ 4. and we also find the maximum at γt∗ inside of the interval (γt− , γt+ ).

9 Conclusion Using Lemma 2.2, we can incorporate the decision maker’s risk averse attitude into coherent risk measures as weighting for average value-at-risks. As we have seen in Example 8.2, risk-sensitive estimations are approximated by weighted average risks with the best spectrum λ for with utility (82), which is given in (83) and (84),

160

Y. Yoshida

Fig. 7 Maximum risk-sensitive terminal rewards Φ T at γt∗ (T = 1, 2, . . . , 7)

and the coherent risk measures ρ is also given by weighted average risks with the best spectrum ν for with utility (85) in the same manner. If we prepare constants κλ (1) and κν ( p) once from κ, λ and ν using Fig. 4, we can calculate risk-sensitive estimation ϕ and coherent risk values ρ immediately from (34) and (35) respectively. This kind of quick risk-sensitive decision making will be applicable to reasonable and high-speed computing with artificial intelligence reasoning, for example, stock trading, auto driving and so on.

References 1. Acerbi, C.: Spectral measures of risk: a coherent representation of subjective risk aversion. J. Bank. Financ. 26, 1505–1518 (2002) 2. Adam, A., Houkari, M., Laurent, J.-P.: Spectral risk measures and portfolio selection. J. Bank. Financ. 32, 1870–1882 (2008) 3. Arrow, K.J.: Essays in the Theory of Risk-Bearing. Markham, Chicago (1971) 4. Artzner, P., Delbaen, F., Eber, J.-M., Heath, D.: Coherent measures of risk. Math. Financ. 9, 203–228 (1999) 5. B¨auerle, N., Rieder.: More risk-sensitive Markov decision processes. Math. Oper. Res. 39, 105–120 (2014) 6. Dellacherie, C.: Quelques commentarires sur les prolongements de capacités. Séminare de Probabilites 1969/1970, Strasbourg, LNAI, vol. 191, pp. 77–81. Springer (1971) 7. López-Díaz, M., Gil, M.A., Ralescu, D.A.: Overview on the development of fuzzy random variables. Fuzzy Sets Syst. 147, 2546–2557 (2006)

Markov Decision Processes with Fuzzy Risk-Sensitive Rewards …

161

8. Fortemps, P., Roubens, M.: Ranking and defuzzification methods based on area compensation. Fuzzy Sets Syst. 82, 319–330 (1996) 9. Howard, R., Matheson, J.: Risk-sensitive Markov decision processes. Manage. Sci. 18, 356–369 (1972) 10. Jorion, P.: Value at Risk: the New Benchmark for Managing Financial Risk. McGraw-Hill, New York (2006) 11. Kruse, R., Meyer, K.D.: Statistics with Vague Data. Riedel Publ. Co., Dortrecht (1987) 12. Kusuoka, S.: On law-invariant coherent risk measures. Adv. Math. Econ. 3, 83–95 (2001) 13. Kwakernaak, H.: Fuzzy random variables-I. Defin. Theorem. Inform. Sci. 15, 1–29 (1978) 14. Puri, M.L., Ralescu, D.A.: Fuzzy random variables. J. Math. Anal. Appl. 114, 409–422 (1986) 15. Renneberg, D.: Non Additive Measure and Integral. Kluwer Academic Publ, Dordrecht (1994) 16. Rockafellar, R.T., Uryasev, S.: Optimization of conditional value-at-risk. J. Risk 2, 21–41 (2000) 17. Tasche, D.: Expected shortfall and beyond. J. Bank. Financ. 26, 1519–1533 (2002) 18. Yoshida, Y.: The valuation of European options in uncertain environment. Europ. J. Oper. Res. 145, 221–229 (2003) 19. Yoshida, Y., Yasuda, M., Nakagami, J., Kurano, M.: A new evaluation of mean value for fuzzy numbers and its application to American put option under uncertainty. Fuzzy Sets Syst. 160, 3250–3262 (2006) 20. Yoshida, Y.: Mean values, measurement of fuzziness and variance of fuzzy random variables for fuzzy optimization. In: Proceedings of SCIS & ISIS 2006, Tokyo, pp. 2277–2282 (2006) 21. Yoshida, Y.: A risk-minimizing model under uncertainty in portfolio. In: Modeling Decisions for Artificial Intelligence—MDAI 2007, LNAI, vol. 4529, pp. 295–306. Springer (2007) 22. Yoshida, Y.: Fuzzy extension of estimations with randomness: the perception-based approach. IFSA2007, LNAI, vol. 4617, pp. 381–391. Springer (2007) 23. Yoshida, Y.: Perception-based estimations of fuzzy random variables: linearity and convexity. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 16(suppl.), 71–87 (2008) 24. Yoshida, Y.: An estimation model of value-at-risk portfolio under uncertainty. Fuzzy Sets Syst. 160, 3250–3262 (2009) 25. Yoshida, Y.: A perception-based portfolio under uncertainty: minimization of average rates of falling. In: Modeling Decisions for Artificial Intelligence—MDAI 2009, LNAI, vol. 5861, pp. 149–160. Springer (2009) 26. Yoshida, Y.: An ordered weighted average with a truncation weight on intervals. In: Modeling Decisions for Artificial Intelligence—MDAI 2012, LNAI, vol. 7647, pp. 45–55. Springer (2012) 27. Yoshida, Y.: Aggregation of dynamic risk measures in financial management. In: Modeling Decisions for Artificial Intelligence—MDAI 2014, LNAI, vol. 8825, pp. 38–49. Springer (2014) 28. Yoshida, Y.: Maximization of returns under an average value-at-risk constraint in fuzzy asset management. Procedia Comput. Sci. 112, 11–20 (2017) 29. Yoshida, Y.: Coherent risk measures derived from utility functions. In: Modeling Decisions for Artificial Intelligence—MDAI 2018, LNAI, vol. 11144, pp. 15–26. Springer (2018) 30. Yoshida, Y.: Risk-sensitive Markov decision processes with risk constraints of coherent risk measures in fuzzy and stochastic environment. In: Proceedings of IJCCI 2019, pp. 269–277. Science and Technology Publication (2019) 31. Yoshida, Y.: Portfolio optimization with perception-based risk measures in dynamic fuzzy asset management. Granul. Comput. 4, 615–627 (2019) 32. Yoshida, Y.: Portfolio optimization in fuzzy asset management with coherent risk measures derived from risk averse utility. Neural Comput. Appl. 32, 10847–10857 (2020) 33. Yoshida, Y.: Dynamic risk-sensitive fuzzy asset management with coherent risk measures derived from decision maker’s utility. Granul. Comput. 6, 19–35 (2021). https://doi.org/10. 1007/s41066-019-00196-0 34. Zadeh, L.A.: Fuzzy sets. Inform. Control 8, 338–353 (1965)

Correlation Analysis Via Intuitionistic Fuzzy Modal and Aggregation Operators Alex Bertei , Renata H. S. Reiser , and Luciana Foss

Abstract The measurement of the correlation between two Atanassov’s intuitionistic fuzzy sets (A-IFS) plays an essential role in A-IFS theory, formalized by the coefficient correlation (A-CC) betwenn two A-IFS. Correlation coefficient of the AIFS is one of the most applied indices and widely used in many research fields, such as clustering analysis, decision making, digital image processing, medical diagnosis and also including pattern recognition. This paper aims to study A-CC between two A-IFS obtained as an image of modal operators and intuitionistic fuzzy t-norms and tconorms. We present algebraic expressions of correlation coefficient relationship by considering intuitionistic fuzzy t-norms, t-conorms and modal operators. The action of necessity and possibility modal operators along with intuitionistic fuzzy t-norms and t-conorms are investigated by verifying the conditions under which A-CC preserve the main properties related to conjugate and complement operations performed on A-IFS. Keywords Correlation coefficient · Modal operators · Intuitionistic fuzzy sets · Fuzzy logic · Intuitionistic fuzzy t-norms and t-conorms

A. Bertei (B) · R. H. S. Reiser · L. Foss Laboratory of Ubiquitous and Parallel Systems (LUPS), Centre for Technological Development (CDTEC), Federal University of Pelotas (UFPEL), Gomes Carneiro 1, Pelotas, RS 96010610, Brazil e-mail: [email protected] R. H. S. Reiser e-mail: [email protected] L. Foss e-mail: [email protected] © Springer Nature Switzerland AG 2021 J. J. Merelo et al. (eds.), Computational Intelligence, Studies in Computational Intelligence 922, https://doi.org/10.1007/978-3-030-70594-7_7

163

164

A. Bertei et al.

1 Introduction The fuzzy set (FS) theory has been widely used in many fields of modern society since it was proposed by Zadeh in [58]. The traditional fuzzy set faces a specific limit as it fails to present a comprehensive description of all information from investigated problems. In other words, in the FS theory, only membership values are assigned to each element lying in the values 0 and 1 limiting the unit interval. Due to the presence of some hesitation information in many situations where the non-membership fuzzy degree does not coincide with the complement of its membership degree value. Atanassov’s Intuitionistic Fuzzy Sets (A-IFS) [2] are used to better explore the situations where there is hesitancy in the information describing the membership and non-membership fuzzy degrees of elements in FS. In this context, the concept of an A-IFS considers the information of both the membership degree and the non-membership degree which are not complementary fuzzy values. Further, Atanassov and Gargov [3] presented the concept of the Atanassov’s Interval-valued Intuitionistic Fuzzy Sets (A-IVIFS), denoting the membership degree and the non-membership degree by closed subintervals of the unit interval [0, 1] and effectively, extended the A-IFS’s capability of handling not only the uncertain but also the imprecision information resulting from fuzzy system computations. The A-IFS theoretical approach underlies a great and relevant number of studies. See, e.g., the Atanassov’s intuitionistic fuzzy index of an element interpreting the hesitation between the membership and non-membership degree according to the following distinct aspects: (i) relating similarity measure in A-IFS to analyze the consensus of an expert preference into a group decision making (GDM) [57]; (ii) dealing with similarity measure to indicate the similar degrees in A-IFS [49]; and (iii) analyzing the A-IFS entropy and describing related fuzziness degrees [50]. All of them are closely connected with the Atanassov’s correlation coefficient (ACC) [7–9, 30, 31, 33, 38, 39] performed on A-IFS, which can express whole expert systems in fuzzy reasoning, mainly those applied to decision-making processes.

1.1 The Relevance for Contextualizing A-CC It is well known that the conventional correlation analysis using probabilities and statistics is inadequate to handle uncertainty, frequently leading to failure data modeling [37]. Moreover, correlation [11, 14, 15, 21, 27] is one of the key features of data analysis for some fuzzy sets since it is also intuitively clear, and many attempts have been undertaken to formalize it. In addition, the methodology to obtain the correlation coefficient performed on variables in A-IFS is also a challenge in research areas considering entropy, similarity and bissimilarity measures, also including the classical statistical theory.

Correlation Analysis Via Intuitionistic Fuzzy Modal …

165

Results from [51] limit such analysis to an essential type of a correlation coefficient which is due to the Pearson correlation coefficient, conceived as a strict association between two variables and defined as a linear relationship of their FS. Thus, the Pearsin correlation coefficient is equal to: (i) (ii) (iii) (iv)

+1, in the case of a positive linear relationship increasing their variable values; 0, indicating no association between the two variables; −1, in the case of a negative (decreasing) linear relationship; and taking any value between −1 and +1 in remaining cases.

The closer the coefficient is to either +1 or −1, the stronger the (positive or negative, respectively) correlation of related variables is. The measurement of the correlation between two A-IFS plays an essential role in A-IFS theory, formalized by the coefficient correlation (A-CC) performed over two Atanassov’s intuitionistic fuzzy sets. Therefore, analysing and applying an effective A-CC formula is a relevant research topic, since it constitutes one of the most applied indices widely used in many technological and theoretical research fields. See the new results from clustering analysis [33, 42], decision making [6, 14, 34, 36, 54, 56], digital image processing [23, 53], medical diagnosis [27, 52] and also pattern recognition. Chiang and Lin [15] discussed the Pearson’s correlation for fuzzy data based on the conventional statistics and derived a formula for the correlation coefficient of fuzzy sets. Liu and Xu [32] studied the correlation coefficient of fuzzy numbers by using a mathematical programming approach based on the classical definition of correlation coefficients. Later, Hong and collaborators [26] researched the correlation coefficient of fuzzy numbers by applying TW (the weakest t_norm) based algebraic operations. Furthermore, Gerstenkorn and Manko [22] defined the correlation coefficient of IFSs, Hong, and Hwang [25] defined correlation coefficients in probability spaces. Mitchell [35] proposed the correlation of A-IFS by interpreting an A-IFS as an ensemble of ordinary fuzzy sets. Hung and Wu [28, 29] developed a method to calculate the correlation coefficients of A-IFS by means of “centroid”.

1.2 Main Contribution This article mainly focuses on intuitionistic fuzzy modal (A-IFM) operators, introduced by Atanassov [2] and have been systematically studied by different authors [5, 17, 19]. Some algebraic and characteristic properties of these operators were also examined and discussed [7, 8, 20, 24]. Extending the results presented in [7], this article studies A-CC relating to level modal operators and aggregations as intuitionistic fuzzy t-norms and t-conorms. Based on their analytical expressions, fuzzy data analysis for classification in prediction and diagnosis in GDM problems are also considered.

166

A. Bertei et al.

This paper differs from other researches, only considering A-CC data values. Thus, we are mainly interested in expressing the A-CC obtained from the three following operator classes: (i) linguistic modifiers provided by decision-makers and performed on A-IFS A and B are discussed as follows: (a) necessity and possibility modal operators: A and ♦A; (b) α-level modal operators: K α A, L α A, !A and ?A. (ii) aggregation operators performing union and intersection as follows: (a) triangular norms: AB TI ; (b) triangular conorms: AB SI . (iii) compositions performed on previous operator classes are studied: (a) S I (A, B), ♦S I (A, B); and (b) TI (A, B), ♦TI (A, B). Additionally, conjugate operators performed over the class of α-level model operators are also considered, as a method to obtain other α-level model operators preserving these main properties. The paper also discusses the dual constructions, considering the intuitionistic fuzzy extension of the standard negation.

1.3 Paper Outline This paper is organized as follows: Sect. 2 considers the related works presenting brief comparisons performing A-CC in A-IFS. Section 3 states the foundations on Atanassov’s intuitionistic fuzzy logic (A-IFL), reporting main concepts of level modal operations and including the action of automorphisms and negation operators in order to obtain conjugate and complement of A-IFS and intuitionistic fuzzy t-norms and t-conorms. Section 4 brings the main concepts of the correlation coefficient from A-IFL. Section 5 shows dual and conjugate operators which are preserved by α-level modal operators as necessity and possibility operators. In Sect. 6, this study includes the main results based on A-CC obtained by modal operators. New results in Sect. 8 show the A-CC between intuitionistic fuzzy t-(co)norms and α-level modal operators. Finally, conclusions and further work are reported in Sect. 9.

2 Related Works In [51], new results extending previous works [46] on a correlation coefficient of AIFS are presented. The coefficient discussed like Pearson’s coefficient between crisp sets, measures how strong a relationship between A-IFS is, and indicates if the sets

Correlation Analysis Via Intuitionistic Fuzzy Modal …

167

are positively or negatively correlated. It is worth stressing the point that all three terms describing A-IFS (the membership, non-membership values and hesitation margins) were taken into account. And, the correlation coefficient is considered in respect of each of these three terms. Moreover, an extended numerical analysis of the well-known benchmarks of Pima Indian Diabetes Database and Iris data is provided. The study of the correlation between IFS is obtained as the image of intuitionistic fuzzy connectives, as t-norms and t-conorms in [39]. The proposed method considers the action of automorphisms and the class of strong fuzzy negations in order to verify the conditions under which the correlation coefficient related to such fuzzy connectives and their corresponding conjugate and dual constructions are obtained. The notion of trapezoidal fuzzy intuitionistic fuzzy sets (TzFIFS) and related arithmetic operations are discussed in [41]. An multiattribute decision making (MADM) model was proposed based on the correlation coefficient of TzFIFS for ranking the alternatives together with weighted averaging (WA) and weighted geometric (WG) operators. These promote a new MADM model performing the correlation coefficient by ranking alternatives of TzFIFS. In [43], a geometrical interpretation of correlation coefficients for picture fuzzy sets is proposed, as a new extension of the correlation of Atanassov’s IFSs [22] considering positive, neutral, negative and also the refusal degrees related to the membership functions. The proposed weighted correlation coefficient is used to calculate the degree of correlation between the picture fuzzy sets, aiming at clustering different objects applied to clustering analysis under picture fuzzy environments. In [9], a correlation between A-IFS obtained as an image of strong intuitionistic fuzzy negations is presented considering the action of strong fuzzy negations. This approach enables the verification of the conditions under which the correlation coefficient related to such A-IFS and their corresponding conjugate constructions are obtained. Moreover, algebraic expressions of the correlation relationship are discussed by considering representable automorphisms. The membership degree and the non-membership degree of the A-IFS are considered in [31], proposing a new approach to measuring the correlation degree of IFS. The method not only reflects the symbol attribute of an A-CC degree, but also preserves the integrity of related A-IFS. A new measure for calculating the correlation coefficients between A-IFS and proof of their desirable axiomatic properties was developed in [56]. An algorithm for multi-attribute DM is presented. The algorithm first defines the intuitionistic fuzzy ideal solution (IFIS) and the intuitionistic fuzzy negative ideal solution (IFNIS). Then based on the IFIS, the IFNIS and the developed correlation coefficient, the algorithm gives a simple and straightforward method to rank and select the given alternatives. In [21], a novel weighted correlation coefficient formulation is able to measure the relationship between two Pythagorean fuzzy set (PFS). An MCDM problem considering multi-criteria on which incomplete and imprecise knowledge is studied, see [44]. This proposal reviews and refines the TOPSIS method using correlation coefficients in order to deal with MCDM problems which have uncertainties and DMs preferences. Intuitionistic fuzzy weighted averaging (IFWA) operators are used, aggregating each DM opinions and evaluating

168

A. Bertei et al.

the relevance of alternatives and criteria. Then positive/negative-ideal solutions are calculated. The theory of neutrosophic sets is a powerful technique to handle incomplete, indeterminate and inconsistent information in the real world [45]. Then, a correlation coefficient between dynamic single-valued neutrosophic multiset (DSVNM) and a weighted correlation coefficient between DSVNM are presented, measuring the correlation degrees between DSVNM, and their properties are investigated in [55]. The existing correlation coefficients did not meet any desirable properties in the IFS theory. In order to solve this problem, Huang [27] introduced an improved correlation coefficient of the IFS. Then, main properties of this correlation coefficient are discussed introducing a generalization of the coefficient of A-IVIFS. Choquet integral is used [38], proposing intuitionistic fuzzy Choquet integral correlation coefficient as a new extension of existing correlation coefficient for intuitionistic fuzzy correlation coefficient. An evaluation formula of intuitionistic fuzzy Choquet integral correlation coefficient between an alternative and the ideal alternative is proposed. The considered alternatives can be ranked and the most desirable one is selected. The main results in [8] extend studies in [9] analyzing A-CC obtained as the image of level modal operators. The actions of the necessity and the possibility are considered, verifying under which conditions a correlation coefficient preserves main properties related to A-IFS. Paper [7] extends results from [8], where the action of A-CC over necessity and possibility modal operators are considered, determining the A-CC of A-IFS obtained as image of the !A and ?A level modal operators and discussing the main conditions under which the main properties related to such fuzzy sets are preserved by conjugate and complement operations. In addition, a simulation based on the proposal methodology using α-level modal operators is applied to a medical diagnosis analysis.

3 Preliminary Firstly, a brief account on A-IFS is stated. Consider a non-empty and finite universe U = {x1 , . . . , xn } and the unitary interval [0, 1] = U . According with [11], an Atanassosv’s intutionistic fuzzy set A (A-IFS) based on U is expressed as A = {(x, μ A (x), ν A (x)) : x ∈ U }

(1)

whenever the membership and non-membership functions μ A , ν A : U → U are related by the inequality μ A (xi ) + ν A (xi ) ≤ 1, for all i ∈ Nn = {1, 2, . . . , n}. An intuitionistic fuzzy index (IFIx) or hesitance degree of an A-IFS A is given as π A (xi ) = 1 − μ A (xi ) − ν A (xi ).

(2)

Correlation Analysis Via Intuitionistic Fuzzy Modal …

169

And, the set of all above related A-IFS is denoted by C (A). Let U˜ = {x˜i = (xi1 , xi2 ) ∈ U 2 : xi1 + xi2 ≤ 1} be the set of all intuitionistic fuzzy values such that x˜i is a pair of membership and non-membership degrees of an element xi ∈ U , i.e. (xi1 , xi2 ) = (μ A (xi ), ν A (xi )). And, the related IFIx is given as π A (xi ) = xi3 = 1 − xi1 − xi2 , for all i ∈ Nn = {1, 2, . . . , n}. The projections lU˜ n , rU˜ n : U˜ n → U n are given by: lU˜ n (x˜1 , x˜2 , . . . , x˜n ) = (x11 , x21 , . . . , xn1 ) rU˜ n (x˜1 , x˜2 , . . . , x˜n ) = (x12 , x22 , . . . , xn2 )

(3) (4)

The order relation ≤U˜ on U˜ is defined as: x˜ ≤U˜ y˜ ⇔ x1 ≤ y1 and x2 ≥ y2 . More˜ for all x˜ ∈ U˜ . over, 0˜ = (0, 1) ≤U˜ x˜ ≤U˜ (1, 0) = 1,

3.1 Intuitionistic Fuzzy Negations Intuitionistic fuzzy negations and intuitionistic automorphisms are studied in the following. See more details in [10]. An intuitionistic fuzzy negation (A-IFNs) N I : U˜ → U˜ is a function verifying: NI 1 NI 2

˜ = N I (0, 1) = 1˜ and N I (1) ˜ = N I (1, 0) = 0; ˜ N I (0) If x˜ ≥U˜ y˜ then N I (x) ˜ ≤U˜ N I ( y˜ ), ∀x, ˜ y˜ ∈ U˜ .

In [13], if an IFN N I also satisfies the involutive property NI 3

N I (N I (x)) ˜ = x, ˜ ∀x˜ ∈ U˜ ,

N I is called a strong A-IFN. According with [18] [Theorem 3.6], N I is a strong A-IFNs iff there exists a strong fuzzy negation N on U such that: N I (x1 , x2 ) = (N (N S (x2 )), N S (N (x1 ))).

(5)

Thus, N I is an example of N -representable IFN. Moreover, if N is the standard fuzzy negation (N (x) = N S (x) = 1 − x) Eq. (5) can be given as ˜ = N SI (x1 , x2 ) = (x2 , x1 ). N S I (x)

(6)

By [12], the complement of an IFS A w.r.t. N I in (5) is given as A N I = {(x, N I (μ A (x), ν A (x))) : x ∈ U }.

(7)

When N = N S in Eq. (5), then the complement of an IFS A w.r.t. N S I is expressed as

170

A. Bertei et al.

A = {(x, ν A (x), μ A (x)) : x ∈ U }.

(8)

The function f N I : U˜ n → U˜ is the N I -dual operator of f : U˜ n → U˜ given as follows (9) f N I (x˜1 , . . . , x˜n ) = N I ( f (N I (x1 ), . . . , N I (x˜n ))). For further information, see [2–4]. Proposition 1 [40][Proposition 8] Let N1 and N2 be two fuzzy negation such that N1 ≤ N2 . Then the function N12 : U˜ → U˜ is an IFN defined by ˜ = (N I (N S (x2 )), N S ◦ N I (x1 )). N I1 (x)

(10)

3.2 Intuitionistic Fuzzy T-norms and T-conorms According to [13], a function TI : U˜ 2 → U˜ is called an intuitionistic fuzzy triangular norm (IFT) if it is commutative, associative and increasing (in both arguments) oper˜ A function S I : U˜ 2 → U˜ is called an intuitionistic ation with the neutral element 1. fuzzy triangular conorm (IFS) if it is commutative, associative and increasing (in ˜ both argume nts) operation with the neutral element 0. Proposition 2 [16] Let T (S) : U 2 → U be a fuzzy t-(co)norm such that T (x, y) ≤ N S (S(N S (x), N S (y))). The function TI (S I ) : U˜ 2 → U˜ , given by the next expressions: ˜ y˜ ) = (T (x1 , y1 ), S(x2 , y2 )) TI (x, (S I (x, ˜ y˜ ) = (S(x1 , y1 ), T (x2 , y2 ))).

(11)

for all x˜ = (x1 , x2 ), y˜ = (y1 , y2 ) ∈ U˜ 2 , is an IFT (IFS). By Proposition 2, TI (S I ) is a t−representable IFT (IFS). Definition 1 [39, Proposition 5.4] When TI (S I ) is a t-representable   IFT (IFS) related to a pair (T, S), ((S, T )), the intuitionistic fuzzy set AB TI AB SI is defined as: AB TI = {(x, T (x1 , y1 ), S(x2 , y2 )) : x ∈ U }, (AB SI = {(x, S(x1 , y1 ), T (x2 , y2 )) : x ∈ U })

(12) (13)

whenever the corresponding condition holds: T (x1 , y1 ) ≤ N S (S(x2 , y2 ))and(S(x1 , y1 ) ≤ N S (T (x2 , y2 ))). Proposition 3 [39, Proposition 3.9] Let TI (S I ) be a t−representable IFT (IFS) and N I be a strong N -representable negation on U˜ , according with Eqs. (11) and (5),

Correlation Analysis Via Intuitionistic Fuzzy Modal …

171

respectively. For all x, ˜ y˜ ∈ U˜ , (TI ) N I ((S I ) N I ) : U˜ 2 → U˜ is an IFT (IFS) given as follows ˜ y˜ )=(S N S ◦N (x1 , y1 ), TN ◦N S (x2 , y2 )), (TI ) N I (x, ˜ y˜ )=(TN S ◦N (x1 , y1 ), S N ◦N S (x2 , y2 )). (S I ) N I (x,

(14)

Corollary 1 [39, Corollary 3.10] For all x, ˜ y˜ ∈ U˜ , when N = N S the following holds ˜ y˜ ) = S I (x, ˜ y˜ ) and (S I ) N I (x, ˜ y˜ ) = TI (x, ˜ y˜ ). (TI ) N I (x,

(15)

3.3 Intuitionistic Fuzzy Modal Operators Following [1], a pair of operator is considered over the IFSs, transforming an IFS into a fuzzy set (FS). These two operators are similar to the logical operators of necessity () and possibility (♦) and their properties resemble those of Modal Logic. Adverbial locutions as “very or absolutely” and “more or less” are interpreted as the linguistic modifiers, respectively named as necessity and possibility operators. The formal approach formalizes an optimistic view point and it should be suitable for risk-taking decision makers. In contrast, the later formalizes a pessimistic view point mainly related to decision makers who take risk-averse behavior. Thus, such dual constructors are modifying the evaluation of the linguistic Boolean truth values: “true” and “false” [2]. In logical deduction processes, the analytic representation of such expressions plays an important role, and the A-CC analysis is able to identify the close correlation related A-IFS. Additionally, relevant properties of dual operators are reported below: Definition 2 [4, Def. 1.41] Let A be an A-IFS. The related A-IFS and ♦A-IFS obtained by the necessity and possibility modal operators are, respectively, given as A = { x, x1 , 1 − x1 |x ∈ U };

(16)

♦A = { x, 1 − x2 , x2 |x ∈ U }.

(17)

Obviously, by Definition 2, when is an ordinary fuzzy set then A = A = ♦A. Proposition 4 [4, Prop. 1.42] For every A-IFS, the following properties are verified: A = ♦A, A = A,

♦A = A, ♦A = ♦A,

(18) (19)

♦A = A,

♦♦A = ♦A.

(20)

Theorem 1 [4, Theorem.1.43] For every two A-IFS A and B:

172

A. Bertei et al.

 ( A ∩ B) = A ∩ B,

(21)

 ( A ∪ B) = A ∪ B, ♦ ( A ∩ B) = ♦A ∩ ♦B, ♦ ( A ∪ B) = ♦A ∪ ♦B.

(22) (23) (24)

Corollary 2 When A, B on the universe U , for each xi ∈ U , the inequalities (μ A (x), ν A (x)) = (x1 , x2 ) and (μ B (x), ν B (x)) = (y1 , y2 ) are considered (AB SI ) = {(S(x1 , y1 ), 1 − S(x1 , y1 ))},

(25)

♦(AB ) = {(1 − S(x2 , y2 ), S(x2 , y2 ))},

(26)

(AB SI ) = {(1 − S(x1 , y1 ), S(x1 , y1 ))},

(27)

(AB ) = {(T (x1 , y1 ), 1 − T (x1 , y1 ))}.

(28)

TI

TI

Proof Straightforward from Definitions (2) and (1).

3.4 Intuitionistic Fuzzy α-level Modal Operators Let α ∈ [0, 1]. The operators K α , L α : U˜ → U˜ are defined as follows K α (x1 , x2 ) = (max (α, x1 ) , min (α, x2 )) ; L α (x1 , x2 ) = (min (α, x1 ) , max (α, x2 )) .

(29) (30)

Further, their corresponding A-IFS are given in the following. Definition 3 [4, Def. 1.99] Let A be an A-IFS and α ∈ [0, 1]. The operator K α A-IFS and L α A-IFS are respectively given as K α A={(x, max (α, x1 ) , min (α, x2 )) : x ∈ U } ; L α A={(x, min (α, x1 ) , max (α, x2 )) : x ∈ U } .

(31) (32)

From the complementary relation establishing K α A = L α A, the components of the pair (K α , L α ) are called as N S I -dual operators. Moreover, when α = 21 , we use the notation ! ≡ K 21 and ? ≡ L 12 and related !A-IFS and ?A-IFS are given in the next definition. Definition 4 [4, Def.1.96] Let A be an A-IFS. The related two α-level modal operators !A-IFS and ?A-IFS are respectively given as follows

Correlation Analysis Via Intuitionistic Fuzzy Modal …

      1 1 , x1 , min , x2 : x ∈ U ; !A= x, max 2 2       1 1 , x1 , max , x2 : x ∈ U . ?A= x, min 2 2

173

(33) (34)

Theorem 2 [4, Theorema 1.97] Let A and B be IFS, the following properties hold: !A = ?A, !?A =?!A,

(35)

!(A ∩ B) =!A ∩ !B, !(A ∪ B) =!A ∪ !B, ?(A ∩ B) =?A ∩ ?B, ?(A ∪ B) =?A ∪ ?B.

(36) (37)

Theorem 3 [4, Theorema 1.98]For every IFSs A, the following properties are verified: !A =!A, ?A =?A, ♦!A =!♦A, ♦?A =?♦A.

(38) (39)

3.5 Action of Conjugate Operators on Aggregation Operators In [40, Def.1], a function Φ : U˜ → U˜ is an intuitionistic fuzzy automorphism ( A˜ ≤U˜ IFA) on U˜ if Φ is a bijective and non-decreasing function, x˜ ≤U˜ y˜ ⇔ Φ(x) Φ( y˜ ). Aut (U˜ ) denotes the set of all A-IFA, extending the notion of a fuzzy automorphism φ : U → U in Aut (U ). And, the action of Φ : U˜ → U˜ on f I : U˜ n → U˜ is a function f IΦ : U˜ → U˜ called intuitionistic conjugate (A-IFA) of f I and defined as follows: f IΦ (x˜1 , . . . , x˜n ) = Φ −1 ( f I (Φ(x˜1 ), . . . , Φ(x˜n ))).

(40)

Now, the φ-representability of an A-IFA is reported in proposition below. Proposition 5 [40, Prop. 5] Let φ ∈ Aut (U ) and φ ∈ Aut (U ). The φ-representable A-IFA Φ ∈ Aut (U˜ ) is defined as the following expression Φ(x1 , x2 ) = (φ(x1 ), 1 − φ(1 − x2 )).

(41)

Proposition 6 Let TI (S I ) be a t-representable IFT (IFS). If T = S N (S = TN ) and φ φ T φ = S N (S φ = TN ) then the following property is verified   φ ˜ y˜ ) = T φ (x1 , y1 ), S φ (x2 , y2 ) ; TI (x,    φ S I (x, ˜ y˜ ) = S φ (x1 , y1 ), T φ (x2 , y2 ) .

(42) (43)

174

A. Bertei et al. φ

Proof Let T = S N and T φ = S N . For all x, ˜ y˜ ∈ U˜ the next result holds: ˜ Φ( y˜ )) = Φ −1 (TI (φ(x1 ), 1 − φ(1 − x2 )), (φ(y1 ), 1 − φ(1 − y2 ))) Φ −1 TI (Φ(x), = (T (φ(x1 ), φ(y1 )), S(1 − φ(1 − x2 ), 1 − φ(1 − y2 ))   = φ −1 (T (φ(x1 ), φ(y1 )), 1−φ −1 (T (φ(1− x2 ), φ(1 − y2 )))   = T φ (x1 , y1 ), 1 − T φ (1 − x2 , 1 − y2 )     = T φ (x1 , y1 ), (T φ ) N S (x2 , y2 ) = T φ (x1 , y1 ), S φ (x2 , y2 ) Analogously, Eq. (43) can be proved.

4 Correlation from A-IFL Using denotation related to Eqs. (2), (3)a and (3)b: (μ A (x1 ), μ A (x2 ), . . . , μ A (xn )) = (x11 , x21 , . . . , xn1 ) = xi1 ; (ν A (x1 ), ν A (x2 ), . . . , ν A (xn )) = (x12 , x22 , . . . , xn2 ) = xi2 ; (π A (x1 ), π A (x2 ), . . . , π A (xn )) = (x13 , x23 , . . . , xn3 ) = xi3 . and the two corresponding classes of the quasi-arithmetic means are reported below: (i) the arithmetic mean related to an A-IFS A, given as follow: n n n



xi1 ; m(xi2 ) = n1 xi2 ; m(xi3 ) = n1 xi3 . m(xi1 ) = n1 i=1

i=1

i=1

(ii) the quadratic mean, performed over the difference between each intuitionistic fuzzy value of an A-IFS A and the corresponding arithmetic mean of all its values are described in the following: ⎛ ⎞2  n n  ⎝xi1 − 1 x j1 ⎠ ; m 2 (xi1 ) = n j=1 i=1 ⎛ ⎞2  n n  1 ⎝xi2 − m 2 (xi2 ) = x j2 ⎠ ; n i=1 i= j ⎛ ⎞2  n n  ⎝xi3 − 1 m 2 (xi3 ) = x j3 ⎠ . n j=1 i=1 Thus, the quotient between product values obtained by taking two sums performed over such classes of quasi-arithmetic means extending the coefficient correlation definition to the Atanassov-intuitionistic fuzzy approach.

Correlation Analysis Via Intuitionistic Fuzzy Modal …

175

Definition 5 [47] The A-CC between A and B in C (A) is given as follows: C(A, B) =

1 (C1 (A, B) + C2 (A, B) + C3 (A, B)) 3

(44)

wherever the following holds: n

 xi1 −

i=1

C1 (A, B) = 

n

xi1 − i=1 n

1 n

 xi2 −

i=1

C2 (A, B) = 

n

xi2 − i=1 n

1 n

 xi3 −

i=1

C3 (A, B) = 

n

xi3 − i=1

1 n

1 n

n



n

2 x j1

j=1 1 n

n

yi1 −



yi2 −

x j2 2

x j2

j=1

n

yi3 −

x j3

n

1 n

y j1

j=1



y j2 2

n

1 n

y j2

j=1

n



y j3

j=1



i=1

2

j=1

yi2 −

j=1

2

n



 x j3

1 n

n

i=1

n

j=1

1 n

y j1

j=1



j=1

n

1 n

i=1

n

n

1 n

yi1 −

x j1

j=1



n

yi3 −

1 n

n

2 y j3

j=1

In [47], the correlation coefficient C(A, B) in equation (44) considers both factors: (i) the amount of information expressed by the membership and non-membership degrees expressed by C1 (A, B) and C2 (A, B), respectively; and (ii) the reliability of information expressed by the hesitation margins in C3 (A, B). Additionally, for fuzzy data, these expressions just make sense for A-IFS variables whose values vary and avoid zero in the denominator. Moreover, C(A, B) fulfils the following properties: (i) C(A, B) = C(A, B); (ii) If A = B then C(A, B) = 1; (iii) −1 ≤ C(A, B) ≤ 1. Proposition 7 [9, Prop.1] Let N be a strong A-IFNs, A and B be A-IFS and A and B be their corresponding complements. The following holds: C1 (A, B) = C2 (A, B);

(45)

C2 (A, B) = C1 (A, B);

(46)

C3 (A, B) = C3 (A, B).

(47)

176

A. Bertei et al.

Corollary 3 [9, Corollary.1] Let N be a strong A-IFNs, A and B are A-IFS and A and B be their corresponding complements. The following holds: C(A, B) = C(A, B).

(48)

5 Results on Conjugate Modal Operators In this section, dual and conjugate operators are preserved by α-level modal operators, necessity and possibility operators. Proposition 8 Consider a Φ-representable automorphism given by Eq. (41) and A-IFS and ♦A-IFS given by Eqs. (16) and (17), respectively. Then, ˜ = x; ˜ Φ (x)

(49)

♦ (x) ˜ = ♦x, ˜ ∀x˜ = (x1 , x2 ) ∈ U˜ . Φ

(50)

Proposition 9 Consider a Φ-representable automorphism given by Eq. (41) and K α -IFS and L α -IFS given by Eqs. (31) and (32), respectively. For all x˜ = (x1 , x2 ) ∈ U˜ , the following holds: ˜ =(φ −1(max(α, φ(x1 ))), 1 − φ −1 (1 − min(α, 1 − φ(1 − x2 )))); (K α )Φ (x) Φ

−1

(51)

−1

(L α ) (x) ˜ =(φ (min(α, φ(x1 ))), 1 − φ (1 − max(α, 1 − φ(1 − x2 )))). (52) Proof For all x˜ = (x1 , x2 ) ∈ U˜ , we have that (K α )Φ (x) ˜ = Φ −1 (K α (Φ(x1 , x2 )))by Eq.(40) = Φ −1 (max(α, φ(x1 )), min(α, 1−φ(1 − x2 )))by Eq.(41) = (φ −1 (max(α, φ(x1 ))), 1 −φ −1 (1 − min(α, 1 − φ(1 − x2 ))))by Eq. (41)

Analogously, Eq. (52) can be proved. Therefore, Proposition 9 is verified. Corollary 4 Consider a Φ-representable automorphism given by Eq. (41) and !AIFS and ?A-IFS given by Eqs. (33) and (34), respectively. For all x˜ = (x1 , x2 ) ∈ U˜ , the following holds:        1 1 −1 −1 max 1 − min (!) (x) , φ(x1 ) , 1 − φ(1 − x2 ) ˜ = φ ,1 − φ ; 2 2        1 1 , φ(x1 ) , 1 − φ(1 − x2 ) ˜ = φ −1 min , 1 − φ −1 1 − max . (?)Φ (x) 2 2 Φ

Proposition 10 Consider Φ ∈ Aut (U˜ ) and K α -IFS and L α -IFS given by Eqs. (31) and (32), respectively. For all x˜ = (x1 , x2 ) ∈ U˜ , the following holds:

Correlation Analysis Via Intuitionistic Fuzzy Modal …

177

    K αΦ (x) ˜ = N SΦI L Φ ˜ and L Φ ˜ = N SΦI K αΦ (x) ˜ . α ( x) α ( x) Proof For all x˜ ∈ U˜ , the results below are verified:   N SΦI L Φ ˜ = N SΦI (Φ −1 (L α (Φ(x1 , x2 )))) = Φ −1 (N S I (L α (Φ(x1 , x2 )))) by Eq. (40) α ( x) = Φ −1 (N S I (L α (φ(x1 ), 1 − φ(1 − x2 )))) by Eq. (41) = Φ −1 (N S I ((min(α, φ(x1 )), max(α, 1 − φ(1 − x2 )))) by Eq. (52) = Φ −1 (max(α, 1 − φ(1 − x2 )), min(α, φ(x1 )))) by Eq. (6) = (φ −1 (max(α, φ(x1 )), 1 − φ −1 (1 − min(α, 1 − φ(1 − x2 )))) by Eq. (41) = K αΦ (x), ˜ by Eq. (51).

Since N S I is a strong IFN, other one can be straightforward proved.

6 A-CC Results on Modal Operators This section reports main results of A-CC related to A-IFS, A-IFS, and ♦A-IFS, obtained by the action of dual and conjugate operators on U˜ . Proposition 11 [8, Proposition.IV.1] Let A-AIFS and ♦A-IFS given in Eqs. (1) and (17). The A-CC between A-IFS and ♦A-IFS is given as 1 (C1 (A, ♦A) + 1) , 3

C(A, ♦A) =

(53)

whenever the following holds n

 xi1 −

i=1

C1 (A, ♦A) = 

n

xi1 − i=1

1 n

1 n

n

 x j1

−xi2 +

j=1

n

j=1

2 x j1

n

1 n

x j2

j=1

 −xi2 +

i=1



n

1 n

n

2 . x j2

j=1

Proposition 12 [8, Proposition.IV.3] The correlation coefficient between an A-IFS and ♦A-IFS is given as C(A, ♦A) = −C ( A, ♦A)

(54)

Proposition 13 [8, Proposition.IV.5] Let A-IFS, A-IFS and ♦A-IFS given by Eqs. (1) and (16) and (17). The following holds: C(A, A) = C ( A, ♦A)

(55)

178

A. Bertei et al.

Proposition 14 [8, Proposition.IV.7] Let A be an IFS. The correlation between IFS A and A is given as C(A, A) = −C ( A, A)

(56)

Proposition 15 [8, Proposition.IV.9] Let A-IFS, ♦-IFS and A-IFS given as Eqs. (1), (16) and (17), respectively. The following holds: C(♦A, A) =

1 (C1 (A, ♦A) + C2 (A, A)) 3

(57)

Proposition 16 [8, Proposition.IV.11] Let A-IFS and ♦A-IFS given by Eqs. (16) and (17), respectively. The following holds: C(♦A, A) =

2 C2 (♦A, A) 3

(58)

whenever the following expression holds n

i=1

 xi2− n1

n

j=1

 x j2

xi1− n1



n

x j1

j=1

C2 (♦A, A) =  2  2 .

n n n n





x j2 x j1 xi2− n1 xi1− n1 i=1

j=1

i=1

j=1

Proposition 17 [8, Proposition.IV.13] For an A-IFS A, we have that: C(♦A, A) =

2 (C1 (A, ♦A)) 3

(59)

7 A-CC Results α-Level Modal Operators This section studies the extension of the main results of A-CC based in [7] related to A-IFS, !A-IFS, ?A-IFS, A-IFS, and ♦A-IFS obtained by the action of dual and conjugate operators on U˜ . For that, consider i ∈ Nn , k ∈ N3 and the notations below:  αik = min

   1 1 , xik , βik = max , xik . 2 2

Proposition 18 The ¸ A-CC between A-IFS A and ?A-IFS is given as C(A, ?A) =

1 (C1 (A, ?A) + C2 (A, ?A) + C3 (A, ?A)) 3

(60)

Correlation Analysis Via Intuitionistic Fuzzy Modal …

179

whenever the following holds: n

 xi1 −

i=1

C1 (A, ?A) = 

n

xi1 − i=1 n

1 n

 xi2 −

i=1

C2 (A, ?A) = 

n

xi2 − i=1 n

1 n

 xi3 −

i=1

C3 (A, ?A) = 

n

xi3 − i=1

1 n

1 n



n

n

2

αi1 −

βi2 −

x j2

j=1

2

n

x j2

j=1

1 n

1 n

 βi2 −

1 n

n

x j3

j=1

2

n

β j2

j=1

αi1 + βi2 −

2



β j2

1 n

j=1

n

α j1

j=1

j=1

 x j3

2

n

n

i=1

n

α j1

j=1







n

i=1

n

n

1 n

n

x j1

j=1 1 n

αi1 −

x j1

j=1

1 n



n

α j1 + β j2

j=1

 αi1 + βi2 −

i=1

1 n

n

2 α j1 + β j2

j=1

Proof Let A-IFS A and ?A-IFS given by Eqs. (1) and (34), respectively. C1 (A, ?A) and C2 (A, ?A) follow from (17) and (44). And, the related resultant margin to C3 is given as follows: n



i=1

n

x j3 xi3 − n1 j=1



n

1 − α j1 − β j2 1 − αi1 − βi2 − n1 j=1



C3 (A, ?A) =  2  2

n n n n





x j3 1 − α j1 − β j2 xi3 − n1 1 − αi1 − βi2 − n1 i=1

n

j=1

 xi3 −

1 n

i=1

= 

n

xi3 − i=1

1 n

n

i=1

j=1

 αi1 + βi2 −

x j3

j=1

n

j=1

2 x j3

n

i=1

1 n



n

α j1 + β j2

j=1

 αi1 + βi2 −

1 n

n

2 α j1 + β j2

j=1

Therefore, Proposition 18 is verified. Proposition 19 The A-CC between A-IFS and ?A-IFS is given as follows: C(A, ?A) =

 1 C1 (A, ?A) + C2 (A, ?A) + C3 (A, ?A) 3

(61)

180

A. Bertei et al.

whenever the following holds n

 xi1 −

i=1

C1 (A, ?A) = 

n

xi1 − i=1 n

C2 (A, ?A) = 

n

xi2 − n

xi3 −

2 x j1

1 n

n

1 n

n

βi2 −

αi1 −

x j2

j=1

2 x j2

n

1 n

n



n

2 x j3

j=1

n

β j2 

α j1

j=1

1 n

2

n

α j1

j=1

βi2 + αi1 −

x j3

2

j=1

n

αi1 −

j=1

n

1 n



i=1

β j2

j=1







n

i=1

j=1 1 n

βi2 −

x j1

j=1

n



C3 (A, ?A) = 

n

xi3 − i=1

1 n

1 n

i=1



j=1

 xi2 −

n

n

1 n

i=1

i=1

1 n

1 n

β j2 + α j1

j=1

 βi2 + αi1 −

i=1



n

1 n

n

2 β j2 + α j1

j=1

Proof C1 (A, ?A) and C2 (A, ?A) follows from Eqs. (1), (8), (34) and (44). And then, n



i=1

n

x j3 xi3 − n1 j=1



n

1 − β j2 − α j1 1 − βi2 − αi1 − n1 j=1



C3 (A, ?(A)) =  2  2

n n n n





1 1 x j3 1 − β j2 − α j1 xi3 − n 1 − βi2 − αi1 − n i=1

n

i=1



j=1

n

x j3 xi3 − n1 j=1

i=1

j=1



n

β j2 + α j1 βi2 + αi1 − n1 j=1



=  2  2

n n n n





x j3 β j2 + α j1 xi3 − n1 βi2 + αi1 − n1 i=1

j=1

i=1

j=1

Thus, Proposition 19 is also verified. Proposition 20 Let ?A-IFS and !A-IFS given by Eqs. (34) and (33), respectively. The following holds: C(A, !A) = C(A, ?A) Proof By Eq. (44) we have that:

(62)

Correlation Analysis Via Intuitionistic Fuzzy Modal …



n

i=1

181



n

x j1 xi1 − n1 j=1



n

β j2 βi2 − n1 j=1

C1 (A, !A) =  2  2 = C1 (A, ?A)

n n n n





x j1 β j2 xi1 − n1 βi2 − n1 i=1



n

i=1

j=1

n

x j2 xi2 − n1 j=1

C2 (A, !A) = 

n

x i=1 n



j=1

n

α j1 αi1 − n1 j=1



2  2 = C2 (A, ?A) n n n



1 x j2 α j1 αi1 − n1 i2 − n j=1



i=1

C3 (A, !A) = 

n

x

n 1

i3 − n

j=1



n

x j3 xi3 − n1 i=1 j=1

i=1

i=1

2 x j3

j=1



n

1−β j2 −α j1 1−βi2 −αi1 − n1 j=1

n



i=1

n 1

1−βi2 −αi1 − n

2 = C3 (A, ?A)

1−β j2 −α j1

j=1

Therefore, Proposition 20 is verified. Corollary 5 Let A − I F S A, ?A − I F S and !A − I F S given as Eqs. (1), (34) and (33), respectively. Based on their NS-dual constructions, the following holds: C(A, !A)

Eq.(48)

=

C(A, !A)

Eq.(35)a

=

C(A, ?A).

(63)

Proof It follows from Propositions 7 and 20 and the results from above proposition. Proposition 21 Let A-IFS A, ?A-IFS and ♦A-IFS given by Eqs. (1), (34) and (17) respectively. The following holds: C(A, ♦?A) =

 1 C1 (A, ♦?A) + C2 (A, ♦?A) , 3

(64)

whenever the following holds n

 xi1 −

i=1

C1 (A, ♦?A) = (−1) 

n

xi1 − i=1

1 n

1 n

n

 x j1

βi1 −

j=1

n

j=1

2 x j1

n

i=1

1 n



n

β j1

j=1

 βi1 −

1 n

n

j=1

2 β j1

182

A. Bertei et al.



n

xi2 −

i=1

C2 (A, ♦?A) = 

n

xi2 − i=1

1 n

1 n



n

βi1 −

x j2

j=1

2

n

x j2

j=1

1 n

βi1 −

i=1

β j1

j=1



n



n

n

1 n

2 β j1

j=1

Proof Straightforward. Corollary 6 Let A − I F S, ?A − I F S and ♦A − I F S given as Eqs. (1), (34) and (17), respectively. Then the following holds: C(A, ♦?A)

Eq.(19)b

C(A, ♦?A)

Eq.(48)

=

=

Eq.(48)

C(A, ♦?A)

C(A, ♦?A)

=

Eq.(20)b

=

C(A, ♦?A);

C(A, ♦♦?A).

Proof It results from Propositions 4, 7 and 21. Proposition 22 Let A be an A − I F S. The correlation between IFS A and !A-IFS is given as   C(A, !A) = −C A, ♦?A

(65)

Proof Straightforward. Corollary 7 Let A − I F S A, !A − I F S and ♦A − I F S given as Eqs. (1), (33), and (17) respectively. Then the following holds: C(A, !A)

Eq.(19)

C(A, !A)

Eq.(48)a

=

=

C(A, ?A)

Eq.(48)

=

C(A, !A)

C(A, ?A).

Eq.(2)a

=

(66)

C(A, ♦!A).

(67)

Proof It results from Propositions 4, 7 and 22. Proposition 23 Let ?A-IFS, !A-IFS, ♦A-IFS and A-IFS given by Eqs. (34), (33), (17) and (16) respectively. The following holds: C(?A, ♦!A) =

2 (C1 (?A, ♦!A)) , 3

(68)

whenever the following holds n

 αi1 −

i=1

C1 (?A, ♦!A) = (−1) 

n

αi1 − i=1

1 n

1 n

n

 α j1

αi2 −

j=1

n

j=1

2 α j1

n

i=1

1 n



n

α j2

j=1

 αi2 −

1 n

n

j=1

2 α j2

Correlation Analysis Via Intuitionistic Fuzzy Modal …

183

Proof By Eqs. (34), (33), (17), (16) and (44) we have the following results: 

n

i=1

n

α j1 αi1 − n1 j=1



n

1 − α j2 1 − αi2 − n1 j=1



C1 (?A, ♦!A) =  2  2

n n n n





α j1 1 − α j2 αi1 − n1 1 − αi2 − n1 i=1

n



i=1

j=1

αi1 − n1

= (−1) 

n

α i=1

i=1

α j1

j=1

n 1 α i1 − n j1 j=1

j=1



n

2

αi2 − n1 n



n

α j2

j=1

2 = C2 (?A, ♦!A) n

α j2 αi2 − n1



i=1

j=1

Since C3 (?A, ♦!A) = 0, Proposition 23 is verified. Corollary 8 Let ?A − I F S, !A − I F S, ♦A − I F S and A − I F S given as Eqs. (34), (33), (17) and (16), respectively. Then the following holds: C(?A, ♦!A)

Eq.(38)

C(?A, ♦!A)

Eq.(19)a

=

=

C(?A, ♦?A)

Eq.(48)

=

C(?A, ♦?A).

C(?A, ♦!A).

Proof It follows from Propositions 7, 4 and 23. Proposition 24 Let ?A-IFS, !A-IFS, ♦A-IFS and A-IFS Eqs. (34), (33), (17) and (16) respectively. The following holds: 2 C(?A, ♦!A) = − (C1 (?A, ♦!A)) , 3

given

by

(69)

Proof It follows from Proposition 23. Corollary 9 Let ?A − I F S, !A − I F S, ♦A − I F S and A − I F S given as Eqs. (34), (33), (17) and (16), respectively. Then the following holds: C(?A, ♦!A)

Eq.(48)

=

C(?A, ♦!A)

Eq.(35)

=

C(?A, ♦?A).

Proof It follows from Propositions 4, 7 and 17. Proposition 25 For an A-IFS A, we have that: C(?A, ♦!A) =

2 (C1 (?A, ♦!A)) . 3

Proof It follows from Proposition 23 and Corollary 7.

(70)

184

A. Bertei et al.

8 A-CC Results on Triangular (Co)Norms and Modal Operators This section studies main results of A-CC obtained from t-representable t-(co)norms on U˜ denoted by TI (S I ), which is given by a pair (T, S)((S, T )) of a t-(co)norm T (S) and modal operators as necessity and possibility operators. Proposition 26 Let AB TI and AB SI are IFT (IFS) on U˜ and expressed according with Eqs. (12) and (13). Then, the following holds: TI

C(AB TI , AB ) = C(AB TI , AB SI )

(71)

SI

C(AB SI , AB ) = C(AB SI , AB TI ).

(72)

Proof Based on result from [9, Proposition 5.4], we have that: 1 1 T T T C(AB TI , AB I )= (2C1 (AB TI , AB I )+1)= (2C2 (AB TI , AB I )+1) = C(AB TI , AB S I ) (73) 3 3 1 1 S S S C(AB S I , AB I )= (2C1 (AB S I , AB I )+1)= (2C2 (AB S I , AB I )+1) = C(AB S I , AB TI ) (74) 3 3

Proposition 27 The correlation coefficient between (AB) SI and ♦(AB)TI is given as follows: C((AB) SI , ♦(AB)TI ) =

2 (C1 ((AB) SI , ♦(AB)TI )) 3

(75)

whenever the following expression holds  n

S(xi1 , yi1 ) − i=1

1 n

C1 ((AB) S I , ♦(AB)TI )= (−1) 

n

S(x , y ) − i1 i1 i=1

n

 S(x j1 , y j1 ) S(xi2 , yi2 ) −

j=1 1 n

1 n

n

 S(x j2 , y j2 )

j=1

2  2 n n n



S(x j1 , y j1 ) S(x j2 , y j2 ) S(xi2 , yi2 )−n1

j=1

i=1

j=1

Proof Straightforward from Proposition 15. Corollary 10 Let AB TI , AB SI , ♦A-IFS and A-IFS given as Eqs. (12), (13), (17), and (16), respectively. Then the following holds: C((AB) S I , ♦(AB)TI )

Eq.(18)a

C((AB) S I , ♦(AB)TI )

Eq.(19)b

C((AB) S I , ♦(AB)TI )

Eq.(20)b

=

=

C((AB) S I , (AB)TI )

Eq.(48)

=

C((AB) S I , (AB)TI ).

C((AB) S I , ♦(AB)TI )

Eq.(20)a

C((AB) S I , ♦♦(AB)TI )

Eq.(19)a

=

(76)

C(♦(AB) S I , ♦(AB)TI )

(77) =

=

C((AB) S I , ♦♦(AB)TI ).

(78) Proof Straightforward from Propositions 4, 7, and 27.

Correlation Analysis Via Intuitionistic Fuzzy Modal …

185

Proposition 28 The correlation coefficient between (AB) SI and ♦(AB)TI is given as (79) C((AB) SI , ♦(AB)TI ) = −C((AB) SI , ♦(AB)TI ). Proof Straightforward from Propositions 16 and Corollary 3. Corollary 11 Let AB TI , AB SI , ♦A-IFS and A-IFS given as Eqs. (12), (13), (17), and (16), respectively. Then the following holds: C((AB) S I , ♦(AB)TI )

Eq.(48)

C((AB) S I , ♦(AB)TI )

Eq.(19)a

C((AB) S I , ♦(AB)TI )

Eq.(19)a

C((AB) S I , ♦(AB)TI )

=

=

Eq.(19)b

C((AB) S I , ♦(AB)TI ).

=

C(♦(AB) S I , ♦(AB)TI )

Eq.(19)b

C((AB) S I , ♦(AB)TI )

Eq.(20)b

=

(80)

C(♦(AB) S I , ♦(AB)TI )

(81) =

=

C((AB) S I , ♦♦(AB)TI ).

(82) Proof Straightforward from Propositions 4, 7 and 28. Proposition 29 The correlation coefficient between AB TI and (AB)TI is given as follows: 1 C(AB TI , (AB)TI ) = (C2 (AB TI , (AB)TI ) + 1) (83) 3 whenever the following expression holds n

 S(xi2 , yi2 ) −

i=1

C2 (AB TI , (AB)TI ) = (−1) 

n

S(x , y ) − i2 i2 i=1

1 n

1 n



n

S(x j2 , y j2 )

T (xi1 , yi1 ) −

j=1 n

2  n

T (xi1 , yi1 ) −

S(x j2 , y j2 )

j=1

i=1

n

1 n

 T (x j1 , y j1 )

j=1 1 n

n

2 T (x j1 , y j1 )

j=1

Proof Straightforward from Proposition 13. Corollary 12 Let AB TI and (AB TI ) given as Eqs. (12), and (28), respectively. Then the following holds: C(AB TI , (AB TI ))

Eq.(18)b

C(AB TI , (AB TI ))

Eq.(48)

=

=

C(AB TI , ♦(AB TI ))

C(AB TI , (AB TI )

Eq.(19)b

=

Eq.(20)a

=

C(AB TI , ♦(AB TI )). (84)

C(AB TI , ♦(AB TI ). (85)

Proof Straightforward from Proposition 4, 7, and 29. Proposition 30 The correlation coefficient between AB TI and ♦(AB)TI is given as C(AB TI , ♦(AB)TI ) = C(AB TI , (AB)TI ) Proof Straightforward from Proposition 29.

(86)

186

A. Bertei et al.

Corollary 13 Let AB TI and ♦(AB TI ) expressed by Eqs.(12), and (26). Then, C(AB TI , ♦(AB TI ))

Eq.(18)a

C(AB TI , ♦(AB TI ))

Eq.(18)a,(48)

=

Eq.(20)a

C(A, AB TI )

=

=

C(AB TI , AB TI )

C(A, ♦AB TI ). Eq.(19)a

=

(87)

C(AB TI , AB TI ). (88)

Proof Straightforward from Propositions 4, 7, and 30. Proposition 31 Let A − I F S and AB TI given in Eqs. (16), and (12). The correlation coefficient between A − I F S and AB TI is given as C(AB TI , A) =

1 (C1 (AB TI , A) + C2 (AB TI , A)) 3

(89)

whenever the following holds n

 T (xi1 , yi1 ) −

i=1

C1 (AB TI , A) = 

n

T (xi1 , yi1 ) − i=1 n

n

1 n

1 n



 T (x j1 , y j1 )

xi1 −

j=1

n

2 T (x j1 , y j1 )

j=1

xi1 −

S(1 − xi1 , yi2 ) −

i=1

C2 (AB TI , A) = (−1) 

n

S(1 − xi1 , yi2 ) − i=1

1 n

1 n

n

1 n

i=1 n

x j1

j=1



n



n

1 n

2 x j1

j=1



S(1 − x j1 , y j2 )

xi1 −

j=1

2

n

n

S(1 − x j1 , y j2 )

j=1



n

1 n

x j1

j=1

 xi1 −

n

1 n

i=1

2 x j1

j=1

Proof By Eqs. (16), (12), and (44), the following holds: n

 T (xi1 , yi1 ) −

i=1

C1 (AB TI , A) = 

n

T (xi1 , yi1 ) − i=1 n



1 n

1 n

n

n

i=1

n



2 T (x j1 , y j1 )

j=1

C2 (AB TI , A) = 

n

S(1 − xi1 , yi2 ) −

1 n

n

1 n

x j1

j=1

xi1 −

n

1 n

2 x j1

j=1



S(1 − x j1 , y j2 )



n

1 n



i=1 n

(1 − xi1 ) −

j=1

2

n

S(1 − x j1 , y j2 )

j=1

S(1 − xi1 , yi2 ) −

i=1

= (−1) 

n

S(1 − xi1 , yi2 ) − i=1

xi1 −

j=1

S(1 − xi1 , yi2 ) −

i=1

 T (x j1 , y j1 )

1 n

n

i=1

1 n

n

(1 − xi1 ) −

S(1 − x j1 , y j2 )

xi1 −

j=1

n

j=1

2 S(1 − x j1 , y j2 )

n

i=1

1 n

(1 − x j1 )

j=1

 



n

1 n

1 n

n

n

2 (1 − x j1 )

j=1



x j1

j=1

 xi1 −

1 n

n

2 . x j1

j=1

Corollary 14 Let AB TI and A expressed by Eqs.(28), and (16). The following holds:

Correlation Analysis Via Intuitionistic Fuzzy Modal …

C(AB TI , A)

Eq.(18)b

C(AB TI , A)

Eq.(48)

=

=

Eq.(20)a

C(AB TI , ♦A)

C(AB TI , A)

187

=

Eq.(19)a

=

C(AB TI , ♦A).

(90)

C(AB TI , A).

(91)

Proof Straightforward from Propositions 4, 7, and 31. Proposition 32 Let A − I F S and AB TI given in Eqs. (16) and (12). The correlation coefficient between A − I F S and AB TI is given as C(AB TI , A) = −C(AB TI , A)

(92)

Proof By Eqs. (8), (16), (44) and (12), the following holds:  n

T (xi1 , yi1 ) − i=1

C1 (AB TI , A) = 

n

T (x , y ) − i1 i1 n

i=1



T (xi1 , yi1 ) −

i=1

= (−1) 

n

T (xi1 , yi1 ) − i=1

1 n

1 n



n

n

1 n

1 n



n

T (x j1 , y j1 ) (1 − xi1 ) −

j=1 n

2  n

T (x j1 , y j1 ) (1 − xi1 ) −

j=1

i=1



T (x j1 , y j1 )

xi1 −

j=1

n

2 T (x j1 , y j1 )

j=1

S(1 − xi1 , yi2 ) −

i=1

C2 (AB TI , A) = 

n

S(1 − xi1 , yi2 ) − i=1

1 n

1 n

xi1 −

i=1 1 n

1 n

n

1 n

n

j=1 1 n

n

2 (1 − x j1 )

j=1

TI 2 = −C1 (AB , A);

x j1

j=1



S(1 − x j1 , y j2 )

xi1 −

j=1 n

 (1 − x j1 )

x j1

j=1



n



n

n

S(1 − x j1 , y j2 )

j=1

2  n

i=1

1 n

xi1 −

n

 x j1

j=1 1 n

n

2 x j1

j=1

= −C2 (AB TI , A).

Corollary 15 Let AB TI , ♦A-IFS and A-IFS given as Eqs. (28), (16) and (17), respectively. Based on their N S -dual constructions, the following holds: C(AB TI , A)

Eq.(48)

C(AB TI , A)

Eq.(18)a

=

=

C(AB TI , A) C(AB TI , ♦A)

Eq.(18)b

=

C(AB TI , ♦A).

(93)

C(AB TI , ♦♦A).

(94)

Eq.(20)b

=

Proof Straightforward from Propositions 4, 7, and 32.

9 Conclusion and Further Work In this article, the analytical expressions of A-CC were considered to level modal operators as necessity and possibility along with intuitionistic fuzzy t-norms and t-conorms. Ongoing work investigates applications dealing with A-CC and solving decision making problems. Further work intends to extend these studies of A-IFS

188

A. Bertei et al.

to other fuzzy connectives frequently applied to making decision based on fuzzy systems. Furthermore, extend the correlation coefficient developing by [9] that is based in A-IFS for A-IVIFS where both membership and non-membership values as well as as the margin of hesitation will be given by interval-valued intuitionistic fuzzy degrees.

References 1. Atanassov, K. T., Gargov, G.: Intuitionistic fuzzy sets. VII ITKR session. Sofia. Centr. Sci.Techn. Library Bulg. Acad. Sci. 1697(84) (1983) 2. Atanassov, K.T.: Intuitionistic fuzzy sets. Fuzzy Sets Syst. 20(1), 87–96 (1986) 3. Atanassov, K.T., Gargov, G.: Interval valued intuitionistic fuzzy sets. Fuzzy Sets Syst. 31(3), 343–349 (1989) 4. Atanassov, K.T.: Studies in Fuzziness and Soft Computing. Physica-Verlag HD, Heidelberg, Germany (1999) 5. Atanassov, K. T.: On Intuitionistic Fuzzy Sets Theory, Studies in Fuzziness and Soft Computing, vol. 283, pp. 1–283. Springer (2012) 6. Batyrshin, I.Z.: On definition and construction of association measures. J Intell Fuzzy Syst. 29(6), 2319–2326 (2015) 7. Bertei, A., Reiser, R. H. S., Foss, L.: Correlation coefficient of modal level operators: an application to medical diagnosis. In: Merelo, J., Garibaldi, J., Barranco, A., Madani, K., Warwick, K.(eds.) Proceedings of the 11th International Joint Conference on Computational Intelligence (IJCCI), vol. 1, pp. 278–287, SP, Austria (2019) 8. Bertei, A., Reiser, R.H.S.: Correlation coefficient analysis performed on duality and conjugate modal-level operators. In: IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2018), pp. 1–8. IEEE, Rio de Janeiro, Brasil (2018) 9. Bertei, A., Zanotelli, R., Cardoso, W., Reiser, R., Foss, L., Bedregal, B.: Correlation coefficient analysis based on fuzzy negations and representable automorphisms. In: IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2016), pp. 127–132. IEEE, Vancouver, Canada (2016) 10. Bustince, H., Burillo, P., Soria, F.: Automorphisms, negations and implication operators. Fuzzy Sets Syst. 134(2), 209–229 (2003) 11. Bustince, H., Burillo, P.: Correlation of interval-valued intuitionistic fuzzy sets. Fuzzy Sets Syst. 74(2), 237–244 (1995) 12. Bustince, H., Barrenechea, E., Mohedano, V.: Intuicionistic fuzzy implication operators—an expression and main properties. Uncertain. Fuzziness Knowl.-Based Syst. 12(12), 387–406 (2004) 13. Bustince, H., Kacprzyk, J., Mohedanoi, V.: Tntuitionistic fuzzy generators, Application to intuitionistic fuzzy complementation. Fuzzy Sets Syst. 114, 485–504 (2000) 14. Cheng, Y-T., and Yang, C-C.: The application of fuzzy correlation coefficient with fuzzy interval data. Int. J. Innov. Manag. Inf. Prod. 5(3), 65–71 (2014) 15. Chiang, D.-A., Lin, N.P.: Correlation of fuzzy sets. Fuzzy Sets Syst. 102(2), 221–226 (1999) 16. Cornelis, C., Deschrijver, G., Kerre, E.: Classification Of intuitionistic fuzzy implicators: an algebraic approach. In: Joint Conference on Information Sciences, pp. 105–108 (2002) 17. Çuvalcio˘glu, G.: One, Two and Uni-type Operators on IFSs, Imprecision and Uncertainty in Information Representation and Processing: new Tools Based on Intuitionistic Fuzzy Sets and Generalized Nets. Springer (2016) 18. Deschrijver, G., Cornelis, C., Kerre, E.E.: On the representation of intuitionistic fuzzy t-norms and t-conorms. IEEE Trans. Fuzzy Syst. 1(12), 45–61 (2004)

Correlation Analysis Via Intuitionistic Fuzzy Modal …

189

19. Dencheva, K.: Extension of intuitionistic fuzzy modal operators () and (). In: International IEEE Conference, pp. 21–22. IEEE (2004) 20. Font, J.M.F., Hajek, P.: On Lukasiewicz’s four-valued modal logic. Stud. Log. Int. J Symb. Log. 70(2), 157–182 (2000) 21. Garg, H.: A novel correlation coefficients between pythagorean fuzzy sets and its applications to decision-making processes. Int. J. Intell. Syst. 31(12), 1–11 (2016) 22. Gerstenkorn, T., Ma´nko, J.: Correlation of intuitionistic fuzzy sets. Fuzzy Sets Syst. 44(1), 39–43 (1991) 23. Goldberger, J., Gordon, S., Greenspan, H.: An efficient image similarity measure based on approximations of KL-divergence between two gaussian mixtures. In: IEEE International Conference on Computer Vision (ICCV 2003), pp. 487–493 (2013) 24. Hinde, C., Atanassov, K.T.: On intuitionistic fuzzy negations and intuitionistic fuzzy extended modal operators. Part 2. IV In: International IEEE Conference Intelligent Systems, pp. 13–20. IEEE, Varna, Bulgaria (2008) 25. Hong, D.H., Hwang, S.Y.: Correlation of intuitionistic fuzzy sets in probability spaces. Fuzzy Sets Syst. 75(1), 77–81 (1995) 26. Hong, D.H.: Fuzzy measures for a correlation coefficient of fuzzy numbers under TW (the weakest t_norm)-based fuzzy arithmetic operations. Inf. Sci. 176(2), 150–160 (2006) 27. Huang, H.-L., Guo, Y.: An improved correlation coefficient of intuitionistic fuzzy sets. J. Intell. Syst. (2017) 28. Hung, W-L.: Using Statistical viewpoint in developing correlation of intuitionistic fuzzy sets. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 9(04), 509–516 (2001) 29. Hung, W.-L.: Correlation of intuitionistic fuzzy sets by centroid method. Inf. Sci. 144(1), 219–225 (2002) 30. Liu, S.: Correlation and aggregation integrated MCDM with interval-valued intuitionistic fuzzy numbers. In: 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), pp. 2252–2256 (2016) 31. Liu, B., Shen. Y., Mu, L., Chen, X., Chen. L.: A new correlation measure of the intuitionistic fuzzy sets. J. Intell. Fuzzy Syst. 30(28), 1019–1028 (2016) 32. Liu, S.T., Kao, C.: Fuzzy measures for correlation coefficient of fuzzy numbers. Fuzzy Sets Syst. 128(2), 267–275 (2002) 33. Meng, F., Wang, C., Chen, X., Zhang, Q.: Correlation coefficients of interval-valued hesitant fuzzy sets and their application based on the shapley function. Int. J. Intell. Syst. 31(1), 17–43 (2016) 34. Mishra, A.R., Rani, P.: Information measures based TOPSIS method for multicriteria decision making problem in intuitionistic fuzzy environment. Iran. J. Fuzzy Syst. 14(6), 41–63 (2017) 35. Mitchel, H.: A correlation coefficient for intuitionistic fuzzy sets. Int. J. Intell. Sys. 19(5), 483–490 (2004) 36. Park, D.G., Kwun, Y.C., Park, J.H., Park, I.Y.: Correlation coefficient of interval-valued intuitionistic fuzzy sets and its application to multiple attribute group decision making problems. Math. Comput. Modell. 50(9), 1279–1293 (2009) 37. Perason, K.: Notes on the history of correlation. Biometrika 13(1), 25–44 (1920) 38. Qu, G., Qu, W., Zhang, Z., Wang, J.: Choquet integral correlation coefficient of intuitionistic fuzzy sets and its applications. J. Intell. Fuzzy Syst. 33, 543–553 (2017) 39. Reiser, R., Visintin, L., Benítez, I., Bedregal, B.: Correlations from conjugate and dual intuitionistic fuzzy triangular norms and conorms. In: Joint IFSA World Congress and NAFIPS Annual Meeting, pp. 1394–1399. IEEE, Edmonton, Canada (2013) 40. Reiser, R. H.S., Bedregal, B.: Correlation in interval-valued Atanassov´s intuitionistic fuzzy sets—conjugate and negation operators. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 25(5), 787–819 (2017) 41. Robinson, J., Amirtharaj, H.: MADM problems with correlation coefficient of trapezoidal fuzzy intuitionistic fuzzy sets. Adv Decis. Sci. 2014, 1–11 (2014) 42. Rodríguez, R.M., Martínez, L., Torra, V., Xu, Z.S., Herrera, F.: Hesitant fuzzy sets: state of the art and future directions. Int. J. Intell. Syst. 29(6), 495–524 (2004)

190

A. Bertei et al.

43. Singh, P.: Correlation coefficients for picture fuzzy sets. J. Intell. Fuzzy Syst. 2(28), 591–604 (2015) 44. Solanki, R., Gulati, G., Tiwari, A., Lohani, Q.M.D.: A correlation based Intuitionistic fuzzy TOPSIS method on supplier selection problem. In: IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2016), pp. 2106–2112. IEEE, Vancouver, Canada (2016) 45. Smarandache, F.: A Unifying Field in Logics: Neutrosophic Logic. American Research Press (1999) 46. Szmidt, E., Kacprzyk, J.: Computational Intelligence for Knowledge-Based Systems Design. Springer, Berlin Heidelberg, Berlin, Heidelberg (2010) 47. Szmidt E., Kacprzyk J.: A new approach to principal component analysis for intuitionistic fuzzy data sets. In: Greco S., Bouchon-Meunier B., Coletti G., Fedrizzi M., Matarazzo B., Yager R.R. (eds.) Advances in Computational Intelligence. IPMU 2012. Communications in Computer and Information Science, vol. 298, pp. 529–538. Springer, Heidelberg (2008) 48. Szmidt, E., Kacprzyk, J.: Distances between intuitionistic fuzzy sets. Fuzzy Sets Syst. 114(3), 505–518 (2000) 49. Szmidt, E., Kacprzyk, J.: A new similarity measure for intuitionistic fuzzy sets: straightforward approaches may not work. In: 2007 IEEE International Fuzzy Systems Conference, pp. 1-6. IEEE (2007) 50. Szmidt, E., Kacprzyk, J.: Entropy for intuitionistic fuzzy sets. Fuzzy Sets Syst. 108(3), 467–477 (2001) 51. Szmidt, E., Kacprzyk, J., Bujnowski, P.: Correlation between intuitionistic fuzzy sets: some conceptual and numerical extensions. IEEE Intl. Conference on Fuzzy Systems (FUZZ-IEEE 2012), pp. 1–7. IEEE, Brisbane, Australia (2012) 52. Szmidt, E., Kacprzyk, J.: A Similarity measure for intuitionistic fuzzy sets and its application in supporting medical diagnostic reasoning. In: Rutkowski, L., Siekmann, J.H., Tadeusiewicz, R., Zadeh, L.A. (eds.) Proceedings of Artificial Intelligence and Soft Computing (ICAISC) 2004, LNCS, pp. 388–393, Springer, Heidelberg (2004) 53. Weken, D.V.D., Nachtegael, M., Kerre, E.E.: Using similarity measures and homogeneity for the comparison of images. Image Vis. Comput. 22(9), 695–702 (2004) 54. Ye, J.: Fuzzy decision-making method based on the weighted correlation coefficient under intuitionistic fuzzy environment. Euro. J. Oper. Res. 205(1), 202–204 (2010) 55. Ye, J.: Correlation coefficient between dynamic single valued neutrosophic multisets and its multiple attribute decision-making method. Information 8(2) (2017) 56. Zhao, H., Xu. Z.: Intuitionistic fuzzy multi-attribute decision making with ideal-point-based method and correlation measure. J. Intell. Fuzzy Syst. 30(2), 747–757 (2016) 57. Zadeh, L.A.: A new measure of consensus with reciprocal preference relations: the correlation consensus degree. Int. J. Knowl.-Based Syst. 107, 104–116 (2016) 58. Zadeh, L.A.: Fuzzy sets. Inf. Control 8(3), 338–353 (1965)

Fuzzy Geometric Approach to Collision Estimation Under Gaussian Noise in Human-Robot Interaction Rainer Palm and Achim J. Lilienthal

Abstract Humans and mobile robots while sharing the same work areas require a high level of safety especially at possible intersections of trajectories. An issue of the human-robot navigation is the computation of the intersection point in the presence of noisy measurements or fuzzy information. For Gaussian distributions of positions/orientations (inputs) of robot and human agent and their parameters the corresponding parameters at the intersections (outputs) are computed by analytical and fuzzy methods. This is done both for the static and the dynamic case using Kalman filters for robot/human positions and orientations and thus for the estimation of the intersection positions. For the overdetermined case (6 inputs, 2 outputs) a so-called ’energetic’ approach is used for the estimation of the point of intersection. The inverse task is discussed, specifying the parameters of the output distributions and looking for the parameters of the input distributions. For larger standard deviations (stds) mixed Gaussian models are suggested as approximation of non-Gaussian distributions. Keywords Human-robot systems · Navigation · Gaussian noise · Kalman filters · Fuzzy modeling

1 Introduction Human operators and mobile robots in shared work areas require a high degree of system stability, safety and mutual adaptation of the behavior to ensure a successful collaboration. A general discussion on robot-human cooperation is presented by [11]. Task planning, navigation and obstacle avoidance were major research activities in R. Palm (B) · A. J. Lilienthal AASS Department of Technology Orebro University, SE-70182 Orebro, Sweden e-mail: [email protected] A. J. Lilienthal e-mail: [email protected] © Springer Nature Switzerland AG 2021 J. J. Merelo et al. (eds.), Computational Intelligence, Studies in Computational Intelligence 922, https://doi.org/10.1007/978-3-030-70594-7_8

191

192

R. Palm and A. J. Lilienthal

recent years [3, 9, 14]. Recognition of human intentions is a core issue for reaching particular goals [4, 12, 13, 19]. The problem of intersection of trajectories between robots and humans is addressed by [2] while describing planned human-robot rendezvous at intersection areas. The computation of the intersections of the intended trajectories of robot and human requires the measurement and/or estimation of the the positions and orientations of robot and human which are subject to uncertainties and observation noise and thus affect the accuracy of the intersection estimates. Chen et al. and Luo et al. discuss a high level control strategy for a multiple target tracking approach for robots and other agents [6, 20]. On the other hand, the present work focuses on the one-robot one-human scenario to deepen the problem of accuracy and collision avoidance for short distances. Uncertainties of more than one degree in the orientation measurements on human and robot can lead to high uncertainties at intersection. For safety reasons and for an effective collaboration between humans and robots, it is therefore essential to predict uncertainties at possible intersections. Positions and orientations of human and robot are non-linearly related to the intersection coordinates but can be linearized if we only consider the linear part of correlation between input (positions and orientations) and output (intersection coordinates) and for small stds at the input [1, 16]. The problem of uncertainty in human/robot systems leads us consequently to fuzzy systems due to the human factor involved and the issue of uncertain inputs in technical systems. We know two main directions to deal with uncertainties at system inputs: – processing of fuzzy inputs (fuzzy sets) in fuzzy systems [5, 10, 17]. – fuzzy reasoning with probabilistic inputs [21] and the transformation of probabilistic distributions into fuzzy sets [15]. Despite the success of these approaches in fuzzy systems, these methods fail to address the practical problem of processing a probability distribution through a static nonlinear system, which is described both analytically and fuzzy. Dealing with uncertain/fuzzy inputs in an analytical way is motivated by prediction of future situations like collision avoidance or cooperation at specific work areas, and to use this information for feed forward control actions and/or re-planning of trajectories. Our main application is the bearing task for intersections of possible trajectories of human and robot starting from different positions for the same “target” (the intersection)[18]. At first we consider the static case (human and robot standing still) in order to show the general problems and challenges. A next step is to consider Robot and human in motion where Kalman filters are used both for positions and orientations to suppress the measurement noise during motion. For larger stds at the input, we introduce mixed Gaussian distributions. In principle we address the following direct task: given the parameters of Gaussian distributions at the input of a static system (human/robot), find the corresponding distribution parameters at the output (intersection). The inverse task reads: Given the output distribution parameters, find the input distribution parameters. The paper is organized as follows. Section 2 describes the analytical way to the intersection problem with Gaussian noise. Section 3 deals with the inverse problem to find the input distribution parameters for given output parameters. Section 4 describes the local linear fuzzy approximation of the non-

Fuzzy Geometric Approach to Collision Estimation …

193

linear analytical calculation. In Sect. 5 the two orientation inputs are extended by another four position inputs, a so-called “energy” approach is introduced to solve the overdetermined problem: 6 inputs and 2 outputs In Sect. 6, mixed Gaussian distributions and their contribution to the intersection problem are presented. Section 7 deals with robot and human in motion with Kalman filters for positions and orientations included. Section 8 deals with simulations to evaluate the methods and approaches presented. Finally, Sect. 9 concludes the paper.

2 Gaussian Noise and the Intersection Problem 2.1 Computation of Intersections—Analytical Approach Consider two linear paths x R (t) and x H (t) intended by a robot and a human and a possible intersection (xc , yc ). x H = (x H , y H ) and x R = (x R , y R ) are the position of human and robot and φ H and φ R their orientation angles (see Fig. 1). From Fig. 1 we get x H = x R + d R H cos(φ R + δ R ) y H = y R + d R H sin(φ R + δ R )

(1)

x R = x H + d R H cos(φ H + δ H ) y R = y H + d R H sin(φ H + δ H ) where positive angles δ H and δ R are measured from the y coordinates counterclockwise and x H , x R , φ R , δ H , δ R , d R H and the angle γ are supposed to be measurable. The orientation angle φ H is computed by φ H = arcsin((y H − y R )/d R H ) − δ H + π

(2)

From Fig. 1, Eqs. (1) and (2) the intersection coordinates xc and yc are computed by A−B tan φ R − tan φ H A tan φ H − B tan φ R yc = tan φ R − tan φ H A = x R tan φ R − y R

xc =

B = x H tan φ H − y H

Rewriting (3) into a matrix-vector form leads to xc = (xc , yc )T and x R H = (x R , y R , x H , y H )T

(3)

194

R. Palm and A. J. Lilienthal

Fig. 1 Human-robot scenario: geometry, extracted from [18]

xc = A R H · x R H

(4)

where xc = (xc , yc )T and x R H = (x R , y R , x H , y H )T and

1 G



A R H = f (φ R , φ H ) =  −1 − tan φ H 1 tan φ R tan φ R tan φ H − tan φ H − tan φ R tan φ H tan φ H

The orientation angle φ H can be determined by different means, for example from a scenario recorded by human eye tracking plus a corresponding camera picture taken from the human’s position and transmitted to the robot [14]. A Takagi-Sugeno (TS) fuzzy approximation of (4) is derived by [14]

Fuzzy Geometric Approach to Collision Estimation …

xc =



wi (φ R )w j (φ H ) · A R H i, j · x R H

195

(5)

i, j

 ), w j (φ H ) ∈ [0, 1] are normalized membership functions with i wi (φ R ) wi (φ R = 1 and j w j (φ H ) = 1. In the following paragraph the accuracy of the computed intersection in the case of distorted orientation information is addressed.

2.2 Transformation of Gaussian Distributions General Considerations. Let be given a static nonlinear system z = F(x)

(6)

with two inputs x = (x1 , x2 )T and two outputs z = (z 1 , z 2 )T where F denotes a nonlinear system. Furthermore let two uncorrelated Gaussian distributed inputs x1 and x2 be described by the 2-dim density f x1 ,x2 =

   ex22 1 1 ex21 ex p − + 2πσx1 σx2 2 σx21 σx22

(7)

where ex1 = x1 − x¯1 , x¯1 - mean(x1 ), σx1 —standard deviation x1 and ex2 = x2 − x¯2 , x¯2 - mean(x2 ), σx2 —standard deviation x2 . The goal is to find the distribution of the output signals z 1 and z 2 and thus their stds and the correlation coefficient between them. For linear systems Gaussian distributions are linearly transformed so that the output signals are also Gaussian distributed. However this does not hold for nonlinear system in general. Only if we assume the input stds small enough then the output distributions are nearly Gaussian distributed but with correlated components as follows 1  · 2πσz 1 σz 2 1 − ρ2z 12    2 ez22 ez 1 1 2ρz 12 ez 1 ez 2 ex p − + − 2(1 − ρ2z 12 ) σz21 σz22 σz 1 σz 2 f z 1 ,z 2 =

(8)

ρz12 —correlation coefficient. For the connection between (7) and (8) we use a differential approach which is described in the next paragraph. Differential Approach. Function F in (6) can be described by individual smooth and nonlinear static transfer functions where (x1 , x2 ) = (φ R , φ H ) and (z 1 , z 2 ) = (xc , yc ) z 1 = f 1 (x1 , x2 ) z 2 = f 2 (x1 , x2 )

Linearization of (9) yields

(9)

196

R. Palm and A. J. Lilienthal dz = J˜ · dx or ez = J˜ · ex

(10)

with ez = (ez 1 , ez 2 )T and ex = (ex1 , ex2 )T dz = (dz 1 , dz 2 )

J˜ =



T

and dx = (d x1 , d x2 )

∂ f 1 /∂x1 , ∂ f 1 /∂x2 ∂ f 2 /∂x1 , ∂ f 2 /∂x2

(11)

T



(12)

Specific Approach to the Intersection. From (4) we derive the differential approach if the contributing agents change their directions of motion. To quantify the uncertainty of xc for uncertain angles φ R and φ H or positions x R H = (x R , y R , x H , y H )T we differentiate (4) with x R H = const.

˙ = (φ˙ R φ

˙ x˙ c = J˜ · φ   J˜ J˜ φ˙ H )T ; J˜ = ˜11 ˜12 J21 J22

(13)

where   J˜11 = − tan φ H 1 tan φ H −1 J˜12 J˜21 J˜22

xR H G 2 · cos2 φ R   xR H = tan φ R −1 − tan φ R 1 G 2 · cos2 φ H = J˜11 · tan φ H = J˜12 · tan φ R

Output Distribution. To compute the density f z1 ,z2 of the output signal we invert (11) and substitute the entries of ex into (7) ex = J · ez

(14)

with J = J˜−1 and  J=

J11 J12 J21 J22



 =

jxz jyz



(15)

where jxz = (J11 , J12 ) and jyz = (J21 , J22 ). The entries Ji j are the result of the inversion of J˜. From this substitution we get

Fuzzy Geometric Approach to Collision Estimation … f x1 ,x2 = K x1 ,x2 · ex p(−

where K x1 ,x2 =

1 2πσx1 σx2

197

1 · ez T · (jx1 ,z T , jx2 ,z T ) · Sx−1 · 2



jx1 ,z jx2 ,z

 · ez )

(16)

and ⎛

1 σx2

,0

⎞ ⎠

(17)

 2  2  1 1 e e J + e J + J + e J z 1 11 z 2 12 z 1 21 z 2 22 σx21 σx22

(18)

Sx−1 = ⎝

1

0,

1 σx2

2

The exponent of (16) is rewritten into

1 x po = − · 2



and furthermore x po = −

 2  

 2 2  2  J J12 J21 J22 1 J21 J22 J11 J12 2 + + e + (19) + e + 2 · e · ez21 11 z1 z2 z2 2 σx21 σx22 σx21 σx22 σx21 σx22

Then, we compare xpo in (19) with the exponent of the output density (8) Let  A=

2 2  J21 J11 ; + σx21 σx22

 B=

  2 2  J12 J22 J21 J22 J11 J12 ; C = + + σx21 σx22 σx21 σx22

(20)

then this comparison yields 1 1 = A; (1 − ρ2z 12 ) σz21

1 1 = B; (1 − ρ2z 12 ) σz22

1 −2ρz 12 = 2C (1 − ρ2z 12 ) σz 1 σz 2

(21)

from which we finally obtain the correlation coefficient ρz12 and the stds σz1 and σz2 C ρz 12 = − √ ; AB

C2 1 = A− ; 2 σz 1 B

C2 1 =B− 2 σz 2 A

(22)

from which yields: once we have measured the parameters of the input distribution and the mathematical expression for the transfer function F(x, y) then we can compute the output distribution parameters directly.

3 Inverse Solution Up to now we discussed the problem: Given the parameters of the input distributions of a nonlinear system, find the parameters of the output distributions. In practice,

198

R. Palm and A. J. Lilienthal

it might be helpful to define a specific accuracy at the intersection and look for the necessary accuracy of the input measurements. This inverse task we apply is similar to that we discussed in Sect. 2.2. The starting point is Eq. (11). Equations (7) and (8) describe the densities of the inputs and the outputs, respectively. Then we substitute (11) into (8) and discuss the exponent x poz only:   −1 2ρz12 ez1 ez2 T ˜ T −1 ˜ ex J Sz J ex − x poz = 2(1 − ρ2z12 ) σz 1 σz 2

(23)

where  Sz−1

=

1 σz21

,0



0, σ12

(24)

z1

With ez1 ez2 = ( J˜11 ex1 + J˜12 ex2 ) · ( J˜21 ex1 + J˜22 ex2 );  ˜2  ˜2 2  2  J J12 J˜21 J˜22 2 + e + + ex T J˜T Sz−1 J˜ex = ex21 11 x2 σz21 σz22 σz21 σz22  ˜ ˜  J11 J12 J˜21 J˜22 +2ex1 ex2 + σz21 σz22

(25)

while renaming we obtain x poz into x pox we obtain     2 2 1 2 J˜11 J˜21 2ρz12 ˜ ˜ 2 x poz = − ex1 (1 − ρz12 + 2 − J11 J21 2 σz21 σz 2 σz 1 σz 2  ˜2   2 J12 J˜22 2ρz12 ˜ ˜ 2 2 1 − ρz12 + 2 − +ex2 J12 J22 σz21 σz 2 σz 1 σz 2    ˜ ˜ 2ex ex J˜21 J˜22 ρz12 J11 J12 ˜11 J˜22 + J˜12 J˜21 + 1 2  · + − J σz21 σz22 σz 1 σz 2 1 − ρ2z12

(26)

Now, comparing (26) with the exponent of (7) of the input density we find that the mixed term in (26) should be zero. Hence we obtain the correlation coefficient and the stds of the inputs as follows

Fuzzy Geometric Approach to Collision Estimation …

ρz12

199

 ˜ ˜  J11 J12 σz 1 σz 2 J˜21 J˜22   = + 2 2 σz 1 σz 2 J˜11 J˜22 + J˜12 J˜21

(27)

   ˜2 2 J11 J˜21 2ρz12 ˜ ˜ 2 1 − ρz12 + 2 − J11 J21 σz21 σz 2 σz 1 σz 2    ˜2 2 1 J12 J˜22 2ρz12 ˜ ˜ 2 1 − ρ = + − J J 12 22 z 12 σ 2y σz21 σz22 σz 1 σz 2 1 = σx2

(28) (29)

4 Fuzzy Solution To avoid high costs of an on-line computation of the output distribution a TS-fuzzy approximation of (26) is suggested by the following rules Ri j provided that an analytical representation (6) is available

Ri j : I F x1 = X 1i AN D x2 = X 2i T H E N ρz12 = −  AN D

Ci2j 1 = A − ij σz21 Bi j

AN D

Ci j Ai j Bi j

Ci2j 1 = B − ij σz22 Ai j

(30)

where X 1i , X 2i are fuzzy terms for x1 , x2 , Ai j , Bi j , Ci j are functions of predefined variables x1 = x1i and x2 = x2i From (30) we get 

Ci j Ai j Bi j ij    Ci2j 1 = w (x )w (x ) A − i 1 j 2 ij σz21 Bi j ij    Ci2j 1 = w (x )w (x ) B − i 1 j 2 ij σz22 Ai j ij ρz12 = −

wi (x1 )w j (x2 ) 

 wi (x1 ) ∈ [0, 1] and w j (x2 ) ∈ [0, 1] are weighting functions with j w j (x 2 ) = 1

(31)

 i

wi (x1 ) = 1

200

R. Palm and A. J. Lilienthal

5 Extension to Six Inputs and Two Outputs 5.1 General Approach In the previous section we dealt with two orientation inputs and two intersection position outputs where the position coordinates of robot and human are assumed to be constant. Consider again the nonlinear system xc = F(x)

(32)

where F denotes a nonlinear system. Here we have 6 inputs x = (x1 , x2 , x3 , x4 , x5 , x6 )T and 2 outputs xc = (xc , yc )T . For the intersection problem we get x = (φ R , φ H , x R , y R , x H , y H ). Let further the uncorrelated Gaussian distributed inputs x1 ... x6 be described by the 6-dim density   1 1 T −1 ex p − S e ) (33) f xi = (e x x x (2π)6/2 |Sx |1/2 2 where ex = (ex 1 , ex 2 , ..., ex 6 )T ; ex = x − x¯ , x¯ —mean (x), Sx —covariance matrix. ⎞ ⎛ 2 σx1 0 ... 0 ⎜ 0 σ 2 ... 0 ⎟ x2 ⎟ Sx = ⎜ ⎝ ... ... ... ... ⎠ 0 ... 0 σx26 The output density is again described by

f xc ,yc =

1 

2πσxc σ yc 1 − ρ2

 · ex p −

2ρexc e yc 1 (exTc Sc −1 exc − ) 2 2(1 − ρ σxc σ yc

 (34)

where ρ—correlation coefficient , exc = (exc , e yc )T and  Sc

−1

=

1 σx2c

,0



0, σ12

(35)

yc

In correspondence to (6) and (9) function F can be described by xc = f 1 (x) yc = f 2 (x)

(36)

Fuzzy Geometric Approach to Collision Estimation …

201

Furthermore according to (13) we have exc = J˜ · ex

(37)

with

J˜ =



J˜11 J˜21

J˜12 ... J˜16 J˜22 ... J˜26

 (38)

where ∂ fi J˜i j = , , i = 1, 2 , j = 1, ..., 6 ∂x j

(39)

Inversion of (38) leads to ex = J˜t · exc = J · exc

(40)

with the pseudo inverse J˜t of J˜. Renaming J˜t into J we get ⎛

⎞ J11 J12 J = ⎝ ... ... ⎠ J61 J62

(41)

Substituting (37) into (33) we obtain   1 f xc ,yc = K xc ex p − (exc T J T Sx −1 J exc ) 2

(42)

where K xc represents a normalization of the output density and

Jxc = J T Sx −1 J =



A B C D



where A= C=

6  1 2 J ; 2 i1 σ i=1 xi

B=

6  1 Ji1 Ji2 ; σ2 i=1 xi

6  1 J J 2 i1 i2 σ i=1 xi

D=

6  1 2 J σ 2 i2 i=1 xi

(43)

202

R. Palm and A. J. Lilienthal

Substitution of (43) into (42) leads with B = C to   1 f xc ,yc = K xc ex p − (Aex2c + De2yc + 2Cexc e yc ) 2

(44)

Comparison of (44) with (34) leads with (35) to C ρ = −√ AD 1 C2 = A− ; 2 σxc D

C2 1 = D− 2 σ yc A

(45)

which is the counterpart to the 2-dim input case (22).

5.2 Fuzzy Approach The fuzzy approach is similar to the 2-input 2-output case: The first step is to compute values Ai , Bi and Ci from (43) at predefined positions/orientations x = (x1 , x2 , x3 , x4 , x5 , x6 )iT . Then, we formulate fuzzy rules Ri , according to (30) and (31) with i = 1...n, l—number of fuzzy terms, k = 6—number of variables n = l k —number of rules. Ri :

I F xi = Xi

AN D

C2 1 = Ai − i σz21 Bi

T H E N ρz 12 = − √ AN D

Ci Ai Bi

(46)

C2 1 = Bi − i σz22 Ai

where Xi are fuzzy terms for xi . From this set of rules we get again (31). 

Ci A i Bi i    Ci2 1 = w (x) A − i i σz21 Bi i    Ci2 1 = w (x) B − i i σz22 Ai ρz 12 = −

wi (x) √

(47)

i

6 wi (x) = l=1 wi (xl ), wi (xl ) ∈ [0, 1] are weighting functions with

 i

wi (xl ) = 1

The challenge of this approach is the number of rules needed. Even for the 2-inputs 2-outputs case, the increase of the resolution for the same range of φ R and φ H yields an increase of the number of rules. Using 7 membership functions for φ R and φ H

Fuzzy Geometric Approach to Collision Estimation …

203

Fig. 2 Fuzzy sectors, extracted from [14]

each we obtain 49 rules. Doubling of the number of membership functions leads to to 196 rules. To avoid an “explosion” of the number of rules to be processed at the same time a number of sub-areas together with a small set of rules is formulated. Then, depending on measurements of φ R and φ H , an appropriate sub-area is selected, and the corresponding set of rules will be activated (see Fig. 2, sub-area A R , A H ). So, the total number of rules could increase whereas the number of rules to be processed for a calculation would remain low. To avoid abrupt changes at the borderlines between the sub-areas an overlap of these regions is recommended. Unfortunately, for 6 inputs one faces an exponential increase in the number of rules being associated with a very high computational burden. For l = 7 fuzzy terms for each input variable xk , k = 6 we end up with n = 76 rules which is much to high. A limitation to an appropriate number of variables at the input of a fuzzy system can be either heuristic or systematic to find out the most influential input variables [8] (see Fig. 2).

5.3 The Energetic Approach Simulations have shown that this method works well for certain conditions regarding the stds of orientations and positions of robot and human. However it appears

204

R. Palm and A. J. Lilienthal

that a mixture of orientation angles and robot/human positions leads to inconsistent results for stds σxc and σ yc of the intersection coordinates xc . The reason is the use of the pseudo inverse for the transformation ex = J˜t · exc in (40), a least square approximation, which leads definitely to uncertainties. A way out is to compute the variations of the intersection position from robot/human orientations xcŒ,tot and from robot/human positions xc R H,tot separately σxc φ,tot 2 = σxc φ 2 + σ yc φ 2 σxc R H,tot 2 = σxc R H 2 + σ yc R H 2

(48)

Then both variations are summarized to σxc tot 2 = σxc φ,tot 2 + σxc R H,tot 2

(49)

Here it should be observed that the variance of an noise error signal exc represents the noise energy per sample [7]. Due to the law of preservation of energy, the energy of the ‘input’ signal ex should be the same as that of the output exc because of a transformation exc = J˜ · ex by which energy is neither fed-in nor gone-away. Following this idea it is obvious to use the resulting standard deviation σxc tot =



σxc φ,tot 2 + σxc R H,tot 2

(50)

as a measure for the uncertainty of the intersection coordinates in the case of given orientation/position uncertainties of robot and human. The input energy E in is computed by the energy E φ from the orientation angles and by E R H from the positions of robot and human (51) E in = E φ + E R H where E φ = r x2R · σφ2 R + r x2H · σφ2 H

(52)

and r x R —distance robot-intersection, r x H —distance human-intersection. Further E R H = σx2R + σ 2y R + σx2H + σ 2y H

(53)

The average standard deviation of the input signal reads σin =



Eφ + E R H

(54)

Since σin = σxc tot

(55)

Fuzzy Geometric Approach to Collision Estimation …

205

we know the ‘output’ average standard deviation just from the ‘input’ average standard deviation on the condition that we know the intersection coordinate xc .

6 Mixed Gaussian Distributions Gaussian input signals at nonlinear or fuzzy systems with large stds do not usually lead to Gaussian output signals. Therefore we approximate a distribution with a large standard deviation by several distributions with small stds. In this connection, fuzzy systems are linearized around the mean values of these distributions. The following analysis shows that the previous analytical approach and the fuzzy approximation also applies for mixed Gaussian distributions. Consider an example of a mixture of two distributions/densities f x y1 and f x y2   e2y1 1 1 ex2 ex p − ( 21 + 2 2πσx1 σ y1 2 σx1 σ y1

(56)

   e2y2 1 1 ex22 = ex p − + 2 2πσx2 σ y2 2 σx22 σ y2

(57)

f x y1 =

f x y2

that are linearly combined f x y = a1 f x y1 + a2 f x y2 with ai >= 0 and

 i

(58)

ai = 1 where i = 1, 2 and ex1 = x1 − x¯1 ; ex2 = x2 − x¯2 e y1 = y1 − y¯1 ; e y2 = y2 − y¯2

x¯i , y¯i are the mean values of xi , yi . The partial outputs yield f zi1 ,z 2 =

   i 2 2 ezi 2ρi ezi 1 ezi 2 ez 1 1 1  · ex p − + 22 − 2 2 i i σz 1 σz 2 2 2(1 − ρi ) σzi 1 σzi 2 2πσzi 1 σzi 2 1 − ρi

(59)

ezi 1 = z 1 − z¯ 1i ; ezi 2 = z 2 − z¯ 2i ; ρi —correlation coefficient. From this we finally obtain the output distribution

f z1 ,z2 =

2  i=1

ai f zi1 ,z2

(60)

206

R. Palm and A. J. Lilienthal

The mixed output distribution f z1 ,z2 are linear combination of partial output distrii . Given the mean z¯ ki , k = 1, 2 butions f zi1 ,z2 resulting from the input distributions f x,y 2

and variance σzi k of the partial output distributions f zi1 ,z2 , then mean and variance of the mixed output distribution are

z¯ k =

2 

z¯ ki

(61)

i=1

σzk 2 = a1 (σzk 1 )2 + a2 (σzk 2 )2 + a1 a2 (¯z 1 − z¯ 2 )2 from which we obtain the standard deviation σzk of the intersection straight forward.

7 Robots and Humans in Motion The previous sections dealt with the situation where robot and human are at rest. For robot and human in motion there occur the following new aspects: • • • •

characteristic of the trajectories (form of paths, velocities) change of intersection positions during motion filtering of the positions/orientations of robot and human change of the errors of positions/orientations during motion.

Without restriction of generality we assume robot and human to move on lines with constant velocities and orientations since every smooth trajectory can be approximated by piecewise linear trajectories with regarding constant but usually different velocities. Positions, orientations and their velocities are corrupted with system noise and measurement noise. Therefore an appropriate discrete Kalman filter method is applied. For robot/human we have 6 inputs x(k) = (x, y, φ, vx , v y , ω)T each. Starting with the system equations state equation :

x(k) = A(k − 1) · x(k − 1) + w(k − 1)

measur ement equation : y(k) = C(k) · x(k) + v(k) covariance matri x o f system noise : E(ww T ) = Q covariance matri x o f measur ement noise : E(vv T ) = R ˆ − 1) old estimation : x(k err or : e = x − xˆ

(62)

old err or covariance matri x : P(k − 1) = E(e · eT ) A—system matrix, C—output matrix, w—system noise, v—measurement noise. Next, the two steps ’prediction’ and ’correction’ of the Kalman filter algorithm follow:

Fuzzy Geometric Approach to Collision Estimation …

207

Prediction This step includes the extrapolation of the state x(k) based on the previous estimation ˆ − 1) x(k ˆ − 1) (63) x (k) = A(k − 1) · x(k Furthermore, an extrapolation of matrix P(k − 1) follows P (k) = Q(k − 1) + A(k − 1)P(k − 1)A(k − 1)T

(64)

Correction Computation of the Kalman filter gain K(k) based on P (k) K(k) = P (k) · C(k) · (C(k)P (k)C(k)T + R(k))−1

(65)

Computation of the new error covariance matrix P(k) using the Kalman filter gain K(k) and P (k) (66) P(k) = (I − K(k)) · C(k)) · P (k) Finally we obtain the new estimate ˆ x(k) = x (k) + K(k) · (y(k) − C(k) · x (k))

(67)

The discrete Kalman filter will now be applied to the trajectories x(k) of the robot and the human to see how the intersection position xc evolves. Both the noise of robot/human positions and the noise of orientations have an impact on the noise of the intersection. However changes in orientations due to noise have a greater influence on the intersection that that of the positions. Therefore an appropriate Kalman filter is definitely needed. In the connection the combination of positions and orientation angles in a common Kalman filter is of great advantage and will be used in our analysis. An evaluation of the quality of the filter is done by the measurement of the standard deviations of the noise at the intersection at different segments T 1, T 2, ... of the trajectories which is necessary because of different distances of the positions of the acting agents (robot and human) to the possible intersection of their trajectories (see Fig. 3). The structure of a trajectory either of the robot and the human is described by Eq. (62). With a discrete time step t = 1 the corresponding matrices read ⎛

1 ⎜0 ⎜ ⎜0 A(k) = ⎜ ⎜0 ⎜ ⎝0 0

0 1 0 0 0 0

0 0 1 0 0 0

t 0 0 1 0 0

0 t 0 0 1 0

⎞ 0 ⎛ ⎞ 0 ⎟ ⎟ 100000 t ⎟ ⎟ ; C(k) = ⎝ 0 1 0 0 0 0 ⎠ 0 ⎟ ⎟ 001000 0 ⎠ 1

(68)

208

R. Palm and A. J. Lilienthal

Fig. 3 Computation of the intersection during motion at different time sequences Ti



σx2 ⎜ 0 ⎜ ⎜ 0 Q(1) = ⎜ ⎜ 0 ⎜ ⎝ 0 0

0 σ 2y 0 0 0 0

0 0 σφ2 0 0 0

0 0 0 σv2x 0 0

0 0 0 0 σv2y 0

⎛ ⎞ 0 0 ⎜0 0 ⎟ ⎜ ⎟ ⎜ 0 ⎟ ⎟ ; Q(k > 1) = ⎜ 0 ⎜0 ⎟ 0 ⎟ ⎜ ⎝0 ⎠ 0 2 0 σω

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 σv2x 0 0

0 0 0 0 σv2y 0

⎞ 0 0 ⎟ ⎟ 0 ⎟ ⎟ (69) 0 ⎟ ⎟ 0 ⎠ σω2



⎞ 2 σν,x 0 0 2 0 ⎠ R(k) = ⎝ 0 σν,y 2 0 0 σν,φ

(70)

Figures 4 and 5 show the areas of intersection with the reference trajectories of robot and human (red lines) and the corresponding estimated trajectories Eq. (67) from the Kalman filter. In addition we see the results for the time sequences T1 (k=1...10,red dots) T2 (k=11...20, black dots) and T3 (k=21...30, blue crosses). The results for the non-filtered case are quite scattered whereas the Kalman-filtered case depicts much better results for the estimated positions of the intersection.

Fuzzy Geometric Approach to Collision Estimation …

Fig. 4 Intersection—no Kalman filter

Fig. 5 Intersection—with Kalman filter

209

210

R. Palm and A. J. Lilienthal

Fig. 6 Sector size: 60◦ , extracted from [18]

8 Simulation Results Gaussian Input Distributions. The simulation results show the possibility of predicting uncertainties at possible intersections by using analytical and / or fuzzy models for a static situation in which robot and human are not moving (see Fig. 1). Position and orientation of robot and human are given by x R = (x R , y R ) = (2, 0)m and x H = (x H , y H ) = (4, 10)m and φ R = 1.78 rad, (= 102◦ ), and φ H = 3.69 rad, (= 212◦ ). φ R and φ H are corrupted with Gaussian noise with standard deviations (std) of σφ R = σx1 = 0.02 rad, (= 1.1◦ ). The fuzzy approach is compared with the analytical non-fuzzy approach by using partitions of 60◦ , 30◦ , 15◦ , 7.5◦ of the unit circle for the orientation angles, see Table 1 and Figs. 6, 7, 8 and 9. Notations in Table 1: σz1 c —std-computed, σz1 m —std-measured etc. The numbers show two qualitative results: 1. Higher resolutions lead to better results. 2. The performance with regard to a comparison between measured and computed values depends on the shape of membership functions (mf’s). Lower input std’s (0.02 rad) require Gaussian mf’s, higher input std’s (0.05 rad = 2.9◦ ) require Gaussian bell shape mf’s that can be explained by different smoothing effects (see columns 4 and 5 in Table 1). Results 1 and 2 can be explained by the comparison of the corresponding control surfaces and the measurements (black and red dots), see Figs. 10, 11, 12, 13 and 14. Figure 10 displays the control surfaces of xc and yc for the analytical case (4). The control surfaces of the fuzzy approximations (5) (see [14]) are depicted in Figs. 11,

Fuzzy Geometric Approach to Collision Estimation …

Fig. 7 Sector size: 30◦ , extracted from [18]

Fig. 8 Sector size: 15◦ , extracted from [18]

211

212

R. Palm and A. J. Lilienthal

Fig. 9 Sector size: 7.5◦ , extracted from [18]

12, 13 and 14. Starting from the resolution 60◦ (Fig. 11) we see a very high difference compared to the analytic approach (Fig. 10) which decreases more and more down to resolution 7.5◦ (Fig. 14). This explains the high differences in stds and correlation coefficients in particular for sector sizes 60◦ and 30◦ . Energetic Approach. In the following we concentrate on 3 examples with the following notations (see Table 2): stds from robot/human orientation plus position (from (45)): σxc ; σ yc sum of orientation and position, computed (from (50)): σsum 1 sum of orientation and position, measured: σxr ; σ yr ; σsum 2 sum of orientation and position, measured and calculated (from (40)): σtot,end input energy (from (54)): σs,tot Example 1 This is a special case where all input standard deviations are equal. Positions/orientations of robot/human: x R = 2; y R = 0; x H = 4; y H = 10; φ R = 100◦ + ν phi,R ; φ H = 212◦ + ν phi,H ; Input stds: σ phi,R = 0.02; σ phi,H = 0.02; σx,R = 0.02; σ y,R = 0.02; σx,H = 0.02; σ y,H = 0.02; In this example we see a high match between the stds σxc and σ yc computed from (45) and the measured stds σxr and σ yr . In addition, we have a high match between σsum 1 , σsum 2 , σtot,end , and σs,tot

Fuzzy Geometric Approach to Collision Estimation …

213

Table 1 Standard deviations, fuzzy and non-fuzzy results, extracted from [18] Input std 0.02 Gauss, bell shaped (GB) Gauss Sector size/◦ Non-fuzz σz 1 c Fuzz σz 1 c Non-fuzz σz 1 m Fuzz σz 1 m Non-fuzz σz 2 c Fuzz σz 2 c Non-fuzz σz 2 m Fuzz σz 2 m Non-fuzz ρz 12 c Fuzz ρz 12 c Non-fuzz ρz 12 m Fuzz ρz 12 m

30◦

15◦

7.5

7.5

7.5◦

0.143

0.140

0.138

0.125

0.144

0.366

0.220 0.160

0.184 0.144

0.140 0.138

0.126 0.126

0.144 0.142

0.367 0.368

0.555 0.128

0.224 0.132

0.061 0.123

0.225 0.114

0.164 0.124

0.381 0.303

0.092 0.134

0.087 0.120

0.120 0.123

0.112 0.113

0.122 0.129

0.299 0.310

0.599 0.576

0.171 0.541

0.034 0.588

0.154 0.561

0.139 0.623

0.325 0.669

−0.263 0.572

0.272 0.459

0.478 0.586

0.506 0.549

0.592 0.660

0.592 0.667

0.380

0.575

0.990

0.711

0.635

0.592

15





0.05 GB

60◦

y

yc

cm

10 5

x

cm

0 −5 4

x

c

3.8

2 1.8

3.6

1.6

3.4

phi

1.4 3.2

1.2

H

Fig. 10 Control surface non-fuzzy, extracted from [18]

phiR

214

R. Palm and A. J. Lilienthal 100 80 60

yc

40

ycm

20 0

x

cm

xc

−20 4 3.5 3

phiH

1.8

1.6

1.4

1.2

2

phiR

Fig. 11 Control surface fuzzy, 60◦ , extracted from [18]

30 25 20

yc

y

cm

15 10 5 0

x

−5 4

x

3.8

c

3.6 3.4

phi

3.2

cm

1.2

1.6

1.4

H

Fig. 12 Control surface fuzzy, 30◦ , extracted from [18]

phi

R

1.8

2

Fuzzy Geometric Approach to Collision Estimation …

15

215

y

c

ycm

10 5

x

cm

0 −5 4

xc

3.8

2 1.8

3.6

1.6

3.4

1.4 3.2

phi

1.2

phi

H

R

Fig. 13 Control surface fuzzy, 15◦ , extracted from [18]

15

yc

10

ycm

5

xcm

0 −5 4

x

3.8

2

c

1.8

3.6

1.6

3.4

phiH

1.4 3.2

1.2

Fig. 14 Control surface fuzzy, 7.5◦ , extracted from [18]

phiR

216

R. Palm and A. J. Lilienthal

Table 2 Energy approach, comparisons Stds/examples Example 1 σxc σ yc σsum 1 σxr σ yr σsum 2 σtot,end σs,tot

0.1295 0.1190 0.1759 0.1297 0.1172 0.1748 0.1738 0.1658

Example 2

Example 3

0.2191 0.2783 0.3543 0.2266 0.2638 0.3478 0.3433 0.3323

0.2377 0.4428 0.6539 0.2487 0.6032 0.6524 0.6447 0.6084

Example 2 This example deals with slightly different input stds: σ phi,R = 0.03; σ phi,H = 0.05; σx,R = 0.03; σ y,R = 0.03; σx,H = 0.05; σ y,H = 0.05; In this example we see also a quite high coincidence between the stds σxc , σ yc and σxr , σ yr mentioned above. Example 3 This example deals with higher differences between the input stds: σ phi,R = 0.03; σ phi,H = 0.12; σx,R = 0.03; σ y,R = 0.03; σx,H = 0.03; σ y,H = 0.03; In this example we also see a fairly high match between σsum 1 , σsum 2 , σtot,end , and σs,tot but not between the stds σxc , σ yc computed from (45) and the measured stds σxr and σ yr . This result shows the advantage of the “energetic approach”. The resulting standard deviation gained from the output “energy” is an average value for the intersection error that results from position/orientation errors of robot and human. Mixed Gaussian Distributions. Due to larger uncertainties of the orientations of robot and human we assume the input signals to be a mixture of two Gaussian distributions with the following parameters: φ¯ R1 = 1.779 rad,(102 deg), σφ R1 = 0.02 rad φ¯ H 1 = 3.698 rad,(212 deg), σφ H 1 = 0.02 rad φ¯ R2 = 1.762 rad,(101 deg), σφ R2 = 0.03 rad φ¯ H 2 = 3.716 rad,(213 deg), σφ H 2 = 0.03 rad σz1 1 = 0.1309 rad; σz2 1 = 0.1157 rad σz1 2 = 0.2274 rad; σz2 2 = 0.1978 rad The following computed non-fuzzy and fuzzy (superscript F) values and measured values (superscript m) according to (61) show the correctness of the previous analysis for the analytical case.

Fuzzy Geometric Approach to Collision Estimation …

217

Fig. 15 Mixed Gaussian, input, extracted from [18]

z¯ 1 = 0.487; z¯ 1F = 0.413; z¯ 1m = 0.485 z¯ 2 = 7.746; z¯ 2F = 7.737; z¯ 2m = 7.737 σz1 = 0.222; σz1 F = 0.235; σz1 m = 0.199 σz2 = 0.184; σz2 F = 0.184; σz2 m = 0.178 Figures 15 and 16 show the regarding input and output densities where Figs. 17 and 18 depict the scatter diagrams (cuts at certain density levels). Finally it turns out that the fuzzy approximation is sufficiently accurate. Robots and Humans in Motion. The following simulations shows the impact of the Kalman filter on the prediction of a possible intersection between robot/human trajectories. As already mentioned in Sect. 7 we will see the results for the measured stds σxc , σ yc , and mean values x¯c , y¯c of the intersection coordinates, the standard deviation σsum 1 that follows from the “energy” value (50). We use example 2 with the input stds of measurement noise σ phi,R = 0.03; σ phi,H = 0.05; σx,R = 0.03 ; σ y,R = 0.03; σx,H = 0.05;σ y,H = 0.05; along the time sequences T1 (k = 1...10), T2 (k = 11...20) and T3 (k = 21...30). These results are then compared with the non-filtered values. The results in Table 3 show much smaller stds of the intersection coordinations for the filtered case and therefore the great benefit of Kalman filtering for positions and orientation angles. We can also see a decrease of the stds in the case of smaller distances between the intersection and the robot/human. positions. Finally Fig. 19 shows the efficiency of the Kalman filter used for the orientation angle  H of the human operator by means of which the standard deviation can be suppressed by 50% of the non-filtered case.

218

Fig. 16 Mixed Gaussian, output, extracted from [18]

Fig. 17 Scatter diagram, mixed input, extracted from [18]

R. Palm and A. J. Lilienthal

Fuzzy Geometric Approach to Collision Estimation …

219

Fig. 18 Scatter diagram, mixed output, extracted from [18] Table 3 Example 2, stds of 3 sequences Sequences T1 T2 Stds σxc σ yc σsum 1 x¯c y¯c

Kalman 0.0835 0.2447 0.2586 −1.5597 6.5214

No Kalman Kalman 0.1755 0.0803 0.6840 0.1245 0.7062 0.1245 −1.4773 −1.6186 6.5113 6.3769

T3 No Kalman Kalman 0.2949 0.0665 0.4184 0.1138 0.4184 0.1318 −1.6743 −1.7138 6.6380 6.6658

No Kalman 0.2106 0.2208 0.3051 −1.6743 6.6380

9 Conclusions This research work deals with the prediction of situations and scenarios between robots and humans in shared areas for collision avoidance, task planning and control actions in the presence of uncertainties. The problem of the computation of intersections of human/robot trajectories is addressed, assuming that uncertainties of positions/orientations of human and robots are modeled by Gaussian noise. To do that we proposed a transformation from the human/robot positions/orientations to intersection coordinates using a geometrical model and its TS fuzzy approximation. From measured ‘input’ uncertainties, that are represented by standard deviations of the positions/orientations of human and robot, the ‘output’ standard deviations of the intersection coordinates are calculated whereas the nominal position/orientation

220

R. Palm and A. J. Lilienthal

Fig. 19 Kalman filtering, orientation human

and disturbance parameters of robot and human are supposed to be known. This analysis and its fuzzy extension applies to the static and the dynamic case provided that estimations of positions of robot and human can be derived. The method is both applied to the case of 2-inputs/2-outputs and to 6-inputs/2-outputs. In the dynamic case, Kalman filters for the estimation of robot/human positions and orientations and thus ultimately for the estimation of the intersection positions are used. For the overdetermined case, 6-inputs/2-outputs, we presented a so-called ‘energetic’ approach for the estimation of the intersection. The inverse task is the following: given the standard deviations for the intersection coordinates, find the corresponding input standard deviations for the orientations of robot and human. This problem is solved for the analytical and the fuzzy version of the 2-input case (orientations only). Large standard deviations of the orientation signals leads to the method of mixed Gaussian distributions. As a whole, the increase of the accuracy of human-robot pose estimations at small distances increase the system performance and human safety of human-robot collaboration which will be used in factory workshops and for robots working in rescue operations and in cooperation with human operators. Acknowledgements This research work has been supported by the AIR-project, Action and Intention Recognition in Human Interaction with Autonomous Systems.

Fuzzy Geometric Approach to Collision Estimation …

221

References 1. Banelli, P.: Non-linear transformations of gaussians and gaussian-mixtures with implications on estimation and information theory. IEEE Trans. Inf. Theory (2013) 2. Bruce, J., Wawer, J., Vaughan, R.: Human-robot rendezvous by co-operative trajectory signals, pp. 1–2 (2015) 3. Firl, J.: Probabilistic maneuver recognition in traffic scenarios. Doctoral dissertation, KIT Karlsruhe (2014) 4. Fraichard, T., Paulin, R., Reignier, P.: Human-robot motion: taking attention into account . Research Report, RR-8487 (2014) 5. Hellendoorn H, Palm R.: Fuzzy system technologies at siemens R and D. Fuzzy Sets Syst. 63(3), 245–259 (1994) 6. Chen, J., Wang, C., Chou, C.: Multiple target tracking in occlusion area with interacting object models in urban environments. Robot. Auton. Syst. 103:68–82 (2018) 7. Smith:, J.O.: Signal metrics. In: Mathematics of the Fiscrete Fourier Transform (DFT) with Adio Applications. Center for Computer Research in Music and Acoustics (CCRMA) Department of Music, Stanford University, Ross Moore-Free Books, Sidney (2000) 8. Schaefer, J., Strimmer, K.: A shrinkage to large scale covariance matrix estimation and implications for functional genomics. Stat. Appl. Genet. Mol. Biol. 4(1), Art. 32 (2005) 9. Khatib, O.: Real-time obstacle avoidance for manipulators and mobile robots. In: IEEE International Conference on Robotics and Automation. St. Loius, Missouri, pp. 500–505 (1985) 10. Foulloy, L., Galichet, S.: Fuzzy control with fuzzy inputs. IEEE Trans. Fuzzy Syst. 11(4), 437–449 (2003) 11. Hamid, O.H., Smith, N.L.: Automation, per se, is not job elimination: how artificial intelligence forwards cooperative human-machine coexistence. In: Proceedings IEEE 15th International Conference on Industrial Informatics (INDIN), pp. 899–904. IEEE, Emden, Germany (2017) 12. Palm, R., Chadalavada, R., Lilienthal, A.: Fuzzy modeling and control for intention recognition in human-robot systems. In: 7. IJCCI (FCTA) 2016: Porto, Portugal (2016) 13. Palm, R., Iliev, B.: Learning of grasp behaviors for an artificial hand by time clustering and takagi-sugeno modeling. In: Proceedings FUZZ-IEEE 2006—IEEE International Conference on Fuzzy Systems. IEEE, Vancouver, BC, Canada (2006) 14. Palm, R., Lilienthal, A.: Fuzzy logic and control in human-robot systems: geometrical and kinematic considerations. In: WCCI 2018: 2018 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 827–834. IEEE (2018) 15. Pota, M., Esposito, M., Pietro, G.D.: Transformation of probability distribution into a fuzzy set interpretable with likelihood view. IEEE 11th International Conference on Hybrid Intelligent Systems (HIS 2011), pp. 91–96. IEEE, Malacca Malaysia (2011) 16. R.Palm, Driankov, D.: Tuning of scaling factors in fuzzy controllers using correlation functions. In: Proceedings FUZZ-IEEE’93. IEEE, San Francisco, california (1993) 17. R.Palm, Driankov, D.: Fuzzy inputs. Fuzzy Sets and Systems—special issue on modern fuzzy control, pp. 315–335 (1994) 18. Palm, R., Lilienthal, A.: Uncertainty and fuzzy modeling in human-robot navigation. In: Proceedings of the 11th International Joint Conference on Computer International (IJCCI 2019), pp. 296–305. IJCCI, SCITEPRESS, Wien 19. Tahboub, K.A.: Intelligent human-machine interaction based on dynamic bayesian networks probabilistic intention recognition. J. Intell. Robot. Syst. 45(1), 31–52 (2006) 20. Luo, W., Xing, J., Milan, A., Zhang, X., Liu, W., Zhao, X., Kim, T.: Multiple object tracking: a literature review. Comput. Vis. Pattern Recogn. 1409,7618, pp. 1–18 (2014) 21. Yager, R., Filev, D.B.: Reasoning with probabilistic inputs. In: Proceedings of the Joint Conference of NAFIPS, IFIS and NASA. pp. 352–356. NAFIPS, San Antonio (1994)

Predicting Cardiovascular Death with Automatically Designed Fuzzy Logic Rule-Based Models Christina Brester , Vladimir Stanovov , Ari Voutilainen , Tomi-Pekka Tuomainen , Eugene Semenkin , and Mikko Kolehmainen

Abstract Predictive models are commonly used in epidemiological studies to estimate risks of illnesses. For knowledge-based models, the logic behind is clear, whereas for automatically generated data-driven models, it is not always transparent how they work. In this study, we applied an evolutionary approach to design a Fuzzy Logic Rule-based model that is easily interpretable compared to many other datadriven models. We utilized a high-dimensional epidemiological data collected within the Kuopio Ischemic Heart Disease Risk Factor (KIHD) Study in 1984–1989 to train the model and predict cardiovascular death for middle-aged men by 2016. In multiple runs of 5-fold cross-validation, we evaluated the model performance and showed that it could achieve higher true positive rate (TPR) than Random Forest and provide more stable results than Decision Tree. Also, the presented approach proved its effectiveness for high-dimensional samples: on the set of 653 predictors, we obtained 68% accuracy on average, whereas on the reduced set of 100 predictors, we could improve this result only up to 70%. Furthermore, this study introduces the most important predictor variables used in the generated Fuzzy Logic Rule-based model. Keywords Fuzzy logic · Predictive modeling · Cardiovascular death · Evolutionary search · Population study

C. Brester (B) · M. Kolehmainen Department of Environmental and Biological Sciences, University of Eastern Finland, Yliopistonranta 1 E, 70210 Kuopio, Finland e-mail: [email protected] C. Brester · V. Stanovov · E. Semenkin Institute of Computer Science and Telecommunications, Reshetnev Siberian State University of Science and Technology, Krasnoyarsky Rabochy Avenue 31, 660037 Krasnoyarsk, Russia A. Voutilainen · T.-P. Tuomainen Institute of Public Health and Clinical Nutrition, University of Eastern Finland, Yliopistonranta 1 C, 70210 Kuopio, Finland © Springer Nature Switzerland AG 2021 J. J. Merelo et al. (eds.), Computational Intelligence, Studies in Computational Intelligence 922, https://doi.org/10.1007/978-3-030-70594-7_9

223

224

C. Brester et al.

1 Introduction Cardiovascular disease remains the leading cause of deaths worldwide [1]. However, many fatal events can be prevented if cardiovascular disease or its progression is diagnosed in advance. Therefore, epidemiologists and other scientists are making efforts to find a solution that can warn patients about cardiovascular risks in time. One of the possible preventive tools is a predictive model estimating the probability of disease occurrence in the future. There are a number of models which have been proposed so far to estimate cardiovascular risks. For example, Hippisley-Cox et al. [2] introduced their solution based on Cox proportional hazards models and used it to assess 10-year cardiovascular risks [2]. They predefined a list of predictors and included conventional risk factors such as age, systolic blood pressure, body mass index, total cholesterol, smoking and so on. At the same time, an advanced paradigm called Precision Public Health claims that models and treatment should be customized taking into account that importance of risk factors varies from one population to another, from one individual to another [3]. It is believed that effective computational methods are able to discover unknown patterns in extensive datasets using up-to-date computing resources [4]. Therefore, in this respect data-driven models have the great potential as they are trained on the data describing a particular population. In the next example, random population samples from North Karelia, Kuopio province, and southwestern Finland were used to train logistic regression estimating risks of coronary heart disease, stroke, and cardiovascular disease [5]. In this study, a limited set of risk factors was involved: age, smoking, total cholesterol, HDLcholesterol, systolic blood pressure, diabetes, and positive family history. Although these variables are traditionally included in cardiovascular modeling, it is possible that this list of predictors lacks other relevant variables. In this study, we utilize a high-dimensional dataset collected in the Kuopio Ischemic Heart Disease Risk Factor (KIHD) Study [6]. The data thoroughly describes a population of middle-aged men from Eastern Finland in 1984–1989. Selecting relevant risk factors from hundreds of clinical, biochemical, physiological, and socioeconomic measurements, the model is trained to predict cardiovascular death by 2016, which corresponds to a prediction horizon of about 30 years. We provide a learning algorithm with 653 available predictors and let it choose the most relevant ones automatically. Next, model interpretability is significant for epidemiological studies since awareness of strong associations between risk factors and cardiovascular outcomes allows the application of preventive means and interventions. In this study, we use an evolutionary approach to train a Fuzzy Logic Rule-based predictive model that is of high interpretability. To compare the Fuzzy Rule-based model with other state-of-the-art data-driven models, we also introduce the results obtained with Decision Tree and Random Forest. In addition to investigating the model performance, we analyze variable importance and present a list of the twenty most important predictors based on the fuzzy rules generated and their usage.

Predicting Cardiovascular Death …

225

Also, the approach presented in this paper has been already tested on the same dataset when predicting coronary heart disease and its progression [7]. It has been shown that the automatically designed fuzzy rule base is compact and contains easily interpretable short rules. This paper contains the following sections. In Introduction, we explain our motivation and briefly describe the data and the method proposed. In Sect. 2, we give all the details of the approach developed to generate the Fuzzy Rule-based model. Next, in Sect. 3, we present the data used to train the predictive model. Section 4 contains the experimental results and discussion. Lastly, Sect. 5 includes main conclusions.

2 Evolutionary Fuzzy Logic Rule-Based Predictive Modeling The fuzzy set theory has found many applications in the area of machine learning, mostly for supervised learning. In combination with evolutionary computation techniques, these systems are referred to as Evolutionary Fuzzy Systems [8]. Among them, the fuzzy models consisting of a set of fuzzy rules are known as Fuzzy RuleBased Systems (FRBS). FRBS are designed by using an implementation of a genetic algorithm (GA) [9] with specific crossover and mutation operators, this class of algorithms is usually referred to as Genetic Fuzzy Systems (GFS) [10]. The GAs, or probably other evolutionary algorithms, are employed due to their ability to deal with large search spaces efficiently. Also, many algorithm versions incorporate a priori knowledge, or extract them from the available data. The flexibility of GFS allows the efficient application in various areas; however, there are two main trends in fuzzy systems development: the first is usually called interpretable fuzzy systems, and it focuses on creating fuzzy systems capable of building fuzzy rules, which are easy to understand for human experts in the area of interest. These systems are usually relatively simple but lead to larger error values in most scenarios. The second trend is called accurate fuzzy systems, and it focuses on generating more complex fuzzy systems, which are not always easy to interpret but capable of making precise predictions. Obviously, depending on the application area, one of these types of fuzzy systems should be considered but a good tradeoff between them is usually desirable. The Hybrid Evolutionary Fuzzy Classification Algorithm (HEFCA) used in this study was originally presented in [11] and further developed in [12]. The algorithm is based on an earlier study [13], and it implements a specific scheme to generate the compact and accurate fuzzy rule bases. The HEFCA algorithm builds rules of the following form: Rq : IF x1 is Aq1 and . . . and xn is Aqn then Class Cq with CFq ,

(1)

226

C. Brester et al.

Fig. 1 Fuzzy term granulation [7]

where Rq is the q-th fuzzy rule, x = (x p1 , …, x pn ) is the set of m training patterns in n-dimensional space, Aqi is a fuzzy set for the i-th variable, C q is the class number, and CF q is the rule weight. The training sample is normalized to [0, 1], and the linear transformation coefficients are used for the test set. In general case, this means that after transformation some values may fall outside the [0, 1] interval, such cases are still correctly processed by the classifier. The product operator was used to calculate membership value for each pattern. The generated fuzzy logic predictive model relies on the fixed fuzzy terms for input variables, introducing four granulations into 2, 3, 4, and 5 fuzzy terms of triangular shape and an additional “Don’t Care” condition (DC), required to simplify the rules. Figure 1 shows all the fuzzy terms which are used for each input variable at the same time. The HEFCA algorithm has previously been modified to handle missing values, so that these values are considered as DC condition during the fuzzy inference [14]. The main HEFCA steps are as follows: (1) (2) (3) (4) (5) (6) (7)

Sample-based initialization; Selection (Tournament or Rank-based); Crossover; Mutation (3 levels); Michigan part (genetic or heuristic); Operator probability adaptation; Stopping criterion check, return to step 2 (the number of generations left).

The sample-based initialization used randomly chosen instances from the training sample to generate realistic rules. In this procedure, for each variable in the rule, one of 14 fuzzy terms is chosen with the probability proportional to the membership

Predicting Cardiovascular Death …

227

function value for this particular term. After this, every term was replaced by DC condition with the probability of 0.9. The quality of each generated rule was estimated using the confidence value: 



Conf Aq → Class k =





x p ∈Class k µ Aq x p   m p=1 µ Aq x p

 ,

(2)

where Aq is the q-th rule left part, k is the class number, µAq (x p ) is the membership value for the input value x p . The class number corresponding to the newly generated rule was determined as the class having the highest confidence. The weight of each rule was estimated as:   CFq = 2 · Conf Aq → Class k − 1,

(3)

so that the confidence of 1 is transformed to the weight equal to 1, and the confidence of 0.5 to zero weight. If the generated rule had the confidence lower than 0.5, the rule was generated again until a valid rule is obtained. This filtering of the rules was shown to be highly competitive in [15]. Unlike the previous works [12], in this version of HEFCA, the loss value was applied in fitness estimation. Here the fitness of an individual is calculated as follows: Fitnessi =

5000 ∗ Lossi 5000 ∗ Errori + + NRi + Leni , N N

(4)

where Errori is the number of incorrectly classified instances, NRi is the number of rules. Here Leni is the total number of non-empty predicates in all rules in a rule base, i = 1, …, NP, where NP is the population size, N is the sample size, and Lossi is calculated as: Lossi =

N 

1 − µ Aw (xi )C Fw ,

(5)

i=0

where w is the index of the winner rule, i.e. the rule having the largest weighted membership value after applying to an instance. Adding the loss value to the fitness allows the algorithm to be sensible not only to the number of correctly or incorrectly classified instances, but also to how confident the classifier is about its final decision. The number of rules was limited by NRmax , and during the initialization step, the rule base was filled with NRmax /2 rules. The linear rank selection and the tournament selection with a tournament size of 5 were applied in the algorithm. The specific crossover operator was used, in which the newly generated offspring had the random number of rules from 1 to min(|S 1 | + |S 2 |, NRmax ), where |S i | is the size of the i-th rule base. For the new rule base, the rules either from the first or the second parent were chosen randomly.

228

C. Brester et al.

The mutation operator changed every term number in the rule base to randomly chosen one in range [0, 14], including DC conditions with three probability levels: 1/(3|S|) (low), 1/|S| (average) and 3/|S| (strong). The Michigan part was applied to the rule base after mutation. On this step the rule base was considered as a population of a genetic algorithm. The fitness value of the rule was estimated as the number of instances correctly classified with this specific rule. If two rules were identical, only one of them received a fitness value. Three types of the Michigan part were applied: adding new rules, deleting the worst rules, or replacing the worst rules with the newly generated. The number of rules to be added, removed or replaced was estimated as a rounded value of |S|/5, but the total number of rules was limited by NRmax . Generating new rules was performed in two ways: in the first case, new rules were generated using the same heuristic as for initialization, i.e. using the training sample instances, while in the second case they were generated with genetic operators, namely the tournament selection, the uniform crossover, and the average mutation. To choose among the variants of presented genetic operators, the selfconfiguration scheme originally described in [16] was applied. The probability value was assigned to each operator, and initially set to 1/z, where z is the number of operators of a particular type, for example, 3 levels of mutation, 2 types of selection and 2 types of Michigan part. The estimation of success of each operator type was performed using the averaged fitness values: AvgFiti =

n i j=1

 fi j n i , i = 1, 2 . . . , z,

(6)

where f ij is the fitness the j-th offspring generated with the i-th operator type, and ni is the number of offspring generated with the i-th operator. The operator having the highest fitness was as the winning operator, its probability pi was increased   considered , while for other operators the probabilities were decreased by by 0.5(z − 1)/ z N g   0.5/ z N g , where N g is the total number of generations. The probability of applying each operator could not be decreased lower than 0.05. The self-configuration procedure was applied to two selection types, three mutation types, and two types of generating new rules in the Michigan part, i.e., heuristic and genetic. The training of the HEFCA algorithm was further improved by an instance selection method, using the balancing strategy, described in [12]. The instance selection creates a subsample of the original training sample, in which the number of instances belonging to each class are as balanced as possible, to prevent the negative effects of imbalanced datasets and speed up the search process. Each instance in the training set received probability to be chosen depending on how difficult it was to classify in previous adaptation periods when the subsample changed. At the beginning all instances received equal counter values set to U i = 1, i = 1 … N. After several generations, 25 in this study (adaptation period), the whole sample is classified by the best rule base, and only instances used during the actual training, i.e. subsample, get updated counters. If an instance i was classified correctly, then U i = 1, otherwise U i = U i +1. Next, the probabilities of all instances being

Predicting Cardiovascular Death …

229

selected into the subsample  for the next adaptation period are recalculated as N pi = 1/ Ui ∗ j=1 1/U j , so that instances with larger counters (easy to classify) are chosen with smaller probability. In addition, all the rule bases are tested on the whole training set at the end of the adaptation period not to lose the best solution. As mentioned above, the balancing strategy also keeps track of probabilities for each class independently and constructs the training set to be as balanced as possible. The adaptive instance selection procedure implements two main principles: the exploration of areas of the feature space unknown before, and using information about classification quality to build a better separation between classes. More detailed description of the HEFCA algorithm can be found in [12].

3 Data Description To train predictive models in this study, we used the epidemiological data collected in Eastern Finland, where the population is known to have a high risk of coronary heart disease [6]. The data has been gathered within the Kuopio Ischemic Heart Disease Risk Factor (KIHD) Study, which is a large-scale ongoing epidemiological project launched in 1984. The KIHD data has been involved in hundreds of different studies so far. Most of them present associations between particular risk factors and cardiovascular outcomes [17]. In addition to primary cardiovascular outcome measures such as progression of chronic diseases (atherosclerosis) and sudden incidents (stroke), the cohort contains information about diabetes, dementia, and cancer, which have been studied as well [18]. Lastly, both cardiovascular and non-cardiovascular deaths are included in the KIHD outcomes. The use of KIHD in data-driven predictive modeling has been described in several studies, which have introduced some limitations of the data and revealed performance of the models trained [7]. These studies have applied k-fold cross-validation to estimate the model accuracy objectively. It has been found that for many experiments, accuracy has not exceeded 70%, which is still relatively low. Another issue is related to interpreting complex data-driven models and understanding interconnections among risk factors and their influence on the output. Therefore, such interpretable models as Decision Trees and Fuzzy Logic Rule-based models are of special interest. In the current study, we used an excerpt from a baseline examination held in 1984– 1989 as a set of predictor variables. This is an extensive set of clinical, physiological, and biochemical measurements augmented with the information from questionnaires describing subjects’ health behavior from socioeconomic, physical, and psychological perspectives. Cardiovascular death, which was selected to be the output variable, was monitored by 2016 so that the prediction horizon equaled about 30 years. The baseline examination involved 2 682 middle-aged men (42, 48, 54, and 60 years old) whose health state in 1984–1989 was characterized with thousands

230

C. Brester et al.

of predictor variables. About 10% of those predictors were preselected for our experiments, therefore, in the beginning, we had 950 of them. Then, this raw data was preprocessed in the following way: (1) (2) (3) (4) (5)

We excluded predictor variables with more than 5% of gaps (i.e. missing). We excluded subjects with more than 5% of gaps. Categorical variables were transformed into binary dummies. Remaining gaps were filled with the nearest neighbor imputation method [19]. To handle competing risks, subjects who died because of any noncardiovascular reason before 2016 were filtered out.

This preprocessing led us to the sample of 1 797 subjects and 653 predictors: 671 subjects died from cardiovascular disease by 2016, whereas 1 126 subjects stayed alive. The next section describes experiments and introduces results of modeling using the data presented.

4 Experiments and Results To estimate the model performance, we applied 5-fold cross-validation with stratification. Moreover, to mitigate an effect of the random sample division into training and test parts, we repeated 5-fold cross-validation 50 times. Interval estimates (presented as boxplots) and point estimates (averaged values) of accuracy, true positive rate (TPR), and true negative rate (TNR) were obtained based on these multiple runs. To contrast the results of HEFCA with some other state-of-the-art models and learning algorithms, we present the performance of Decision Tree, which is highly interpretable, and Random Forest, which has been shown as an effective model allowing variable importance estimation [20, 21]. Decision Tree and Random Forest have been thoroughly investigated on the same data in the study, where performance of these models has been evaluated for a maximum tree depth varying from 2 to 15 (unpublished results, 2021) [22]. For the current comparison, Decision Tree with a maximum tree depth of 3 and Random Forest with a maximum tree depth of 13 have been selected as the ones with the highest performance. HEFCA was applied, first, to the whole data set with 653 predictors. Since HEFCA has its own mechanism to filter out uninformative variables, it could be used without any additional variable selection methods. Nevertheless, a lower-dimensional space of relative predictors allows the search algorithm to design a better model more easily. Therefore, as an additional experiment, we utilized importance of each variable estimated with Random Forest and took the 100 most important variables to train the Fuzzy Rule-based model. On the one hand, this experiment helped us to assess the effectiveness of HEFCA when using the high-dimensional data. Also, it showed how its performance was increased when the data size diminished. On the other hand, utilizing variable importance from Random Forest and training the Fuzzy Rule-based model on the preselected set of predictors allow the representation of

Predicting Cardiovascular Death …

231

complex interconnections from the ensemble of trees in a simpler form. Yet we do not focus on this point in the current report. In all the experiments conducted, the following parameters of HEFCA were used: the population size was 100, the number of generations was 500, and the maximum number of rules was 40. The performance of the Fuzzy Logic Rule-based model is demonstrated in Fig. 2. We present not only accuracy but also TPR and TNR since the data is unbalanced and models tend to label subjects as “no CVD death” more often. This might be seen from the Random Forest performance on the test data: despite the highest accuracy compared to other models, it reached the poorest TPR (a cut-off needs adjustment). From a practical point of view, TPR is much more important than TNR since missing subjects who are at risk might have more serious consequences than checking false positive ones. In this respect, the Fuzzy Logic Rule-based model trained on the reduced set of predictor variables (100 instead of 653) could achieve the best result. Next, comparing two Fuzzy Logic Rule-based models (Fig. 2), we may note that variable preselection contributed to the model performance to some extent. However,

Fig. 2 Model performance. This figure shows accuracy, TPR and TNR reached with every model in multiple runs of cross-validation

232

C. Brester et al.

the difference is relatively small, which means that HEFCA is rather effective for high-dimensional predictor spaces too. Another point in favor of both Fuzzy Logic Rule-based models is that variance of TPR and TNR is smaller than the one demonstrated with the Decision Tree model, which is known to be sensitive to minor changes in the data [23]. It may be that the average accuracy of 70%, which was achieved with the Fuzzy Logic Rule-based model on the reduced set of predictors, is still unsatisfactory. On the other hand, such a long prediction horizon complicates modeling because values of predictor variables may change greatly across time and, consequently, relevance of predictors diminishes. Also, an opportunity to execute variable selection automatically on the extensive set of predictors allows finding non-trivial associations between risk factors and cardiovascular outcomes. Therefore, we performed the analysis of the variable usage to look closer at the most informative predictors of the study. For this purpose, we investigated the Fuzzy Logic Rule-based models trained on the whole set of 653 predictors. First, the best rule bases at the end of each run were extracted and applied to the training and test sets. The number of successful usages (correctly classified instances) and the number of unsuccessful usages of every rule were recorded as well as the number of usages of every term of every variable. After this, the number of successful usages of every variable, i.e. the number of times when the variable participated in a rule which made a correct classification, was calculated based on the success rate of every rule. The successful usage rates of every variable were averaged over all algorithm runs and sorted. The first 20 variables which were used mostly on the training and test datasets are shown in Figs. 3 and 4 as well as their successful usage rates. The analysis of the most informative predictors was done separately for the training and test datasets. As we can see, many of variables in Figs. 3 and 4 are the same. Nevertheless, as it has been shown in the previous study [7], the KIHD sample is heterogeneous, therefore, the most successfully used rules on the training and test data might vary. Generally, both lists of variables look reasonable: there are predictors describing cardiovascular problems in the past, systolic blood pressure, medication taken, smoking, dietary habits. Interestingly, both lists also include socioeconomic variables such as marriage, unemployed persons in the family, change of an employer. This stresses the complex nature of CVD, and that different kinds of risk factors, including socioeconomic ones, matter when predicting cardiovascular death. All in all, the presented variables are not just accidentally selected predictors, they are the most successfully used predictors in 50 independent runs of 5-fold cross-validation. The examples of the fuzzy rules generated with HEFCA are shown in Fig. 5. These rules were obtained using the whole sample of 653 predictor variables. As can be noted in Fig. 5, the rules are compact and easily interpretable. For instance, the first rule is applied for those subjects who work a lot and consume a little linolenic acid and not much other fats. These subjects are predicted to be at risk if the rule has the highest weight. However, not just one “winner” rule might be taken into account but also other rules indicating “CVD death” to analyze potential

Predicting Cardiovascular Death …

233

Fig. 3 The top 20 variables included in the best fuzzy rule bases and ordered according to their successful usage on the training data

risks. Also, rules that describe the “no CVD death” group would be important as well since they reveal the ways to protect subjects from cardiovascular death.

5 Conclusions In this study, we showed that Fuzzy Logic Rule-based models might be designed using just the collected data and applied in cardiovascular predictive modeling. We developed an evolution-based approach to iteratively improve a number of fuzzy rule bases and at the end of training, choose the best one as a final model. To avoid overfitting, we limited the number of rules in bases and their length, which also led us to compact rules. The built-in variable selection allowed us to apply this approach straightforwardly to the high-dimensional epidemiological dataset. The data collected in Eastern Finland was utilized to train the model predicting cardiovascular death for middle-aged men with the horizon of 30 years. The generated Fuzzy Logic Rule-based model was properly validated and compared with other traditional data-driven models, in particular with Decision Tree and Random Forest. According to the experimental results, in comparison with Random Forest, the Fuzzy Logic Rule-based model showed the higher performance considering the balanced

234

C. Brester et al.

Fig. 4 The top 20 variables included in the best fuzzy rule bases and ordered according to their successful usage on the test data

TPR and TNR values. Also, this model demonstrated more stable results than Decision Tree in terms of TPR and TNR. Nevertheless, the estimated accuracy was 68 and 70% on average on the whole set of 653 predictors and also on the reduced set of 100 predictors, which still needs improvement in the future. Next, we introduced the most important predictors based on the variable successful usage and found that many of them were meaningful regarding to the outcome variable, i.e. cardiovascular death. Besides, what we could see while analyzing the rules themselves is that they contain the whole explanation of the prediction made. This is utterly important for epidemiological studies, wherein the thorough model understanding is required.

Predicting Cardiovascular Death …

235

Fig. 5 Examples of the fuzzy rules generated with HEFCA on the whole set of 653 predictors. The color red means “CVD death”, the color blue denotes “no CVD death”

Acknowledgements The reported study was funded by Russian Foundation for Basic Research, Government of Krasnoyarsk Territory, Krasnoyarsk Regional Fund of Science, to the research project: 18-41-242011 “Multi-objective design of predictive models with compact interpretable strictures in epidemiology”.

References 1. World Health Organization: fact sheet ‘Cardiovascular diseases (CVDs)’, http://www.who.int/ mediacentre/factsheets/fs317/en/. Last accessed 05 Jan 2020 2. Hippisley-Cox, J., Coupland, C., Brindle, P.: Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study. BMJ 357, j2099 (2017) 3. Weeramanthri, T.S., Dawkins, H.J.S., Baynam, G., Bellgard, M., Gudes, O., Semmens, J.B.: Editorial: precision public health. Front Public Health 6(121) (2018). https://doi.org/10.3389/ fpubh.2018.00121 4. Dolley, S.: Big Data’s role in precision public health. Front Public Health 6, 68 (2018). https:// doi.org/10.3389/fpubh.2018.00068 5. Vartiainen, E., Laatikainen, T., Peltonen, M., Puska, P.M.: Predicting coronary heart disease and stroke: the FINRISK calculator. Glob Heart 11(2), 213–216 (2016)

236

C. Brester et al.

6. Salonen, J.T.: Is there a continuing need for longitudinal epidemiologic research? The Kuopio Ischaemic heart disease risk factor study. Ann Clin Res 20(1–2), 46–50 (1988) 7. Brester, Ch., Stanovov, V., Voutilainen, A., Tuomainen, T.-P., Semenkin, E., Kolehmainen, M.: Evolutionary fuzzy logic-based model design in predicting coronary heart disease and its progression. In: Proceedings of the 11th International Joint Conference on Computational Intelligence, vol. 1, pp. 360–366. FCTA, Vienna, Austria (2019). https://doi.org/10.5220/000 8363303600366 8. Fazzolari, M., Alcala, R., Nojima, Y., Ishibuchi, H., Herrera, F.: A review of the application of multiobjective evolutionary fuzzy systems: current status and further directions. IEEE Trans. Fuzzy Syst. 21(1), 45–65 (2013) 9. Holland, J.H.: Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT Press (1992) 10. Herrera, F.: Genetic fuzzy systems: taxonomy, current research trends and prospects. Evol Intell 1, 27–46 (2008) 11. Stanovov, V., Semenkin, E., Semenkina, O.: Self-configuring hybrid evolutionary algorithm for fuzzy classification with active learning. IEEE Congress Evolut Comput CEC 2015, 1823–1830 (2015) 12. Stanovov, V., Semenkin, E., Semenkina, O.: Self-configuring hybrid evolutionary algorithm for fuzzy imbalanced classification with adaptive instance selection. J Artif Intell Soft Comput Res 6(3), 173–188 (2016) 13. Ishibuchi, H., Mihara, S., Nojima, Y.: Parallel distributed hybrid fuzzy GBML models with rule set migration and training data rotation. IEEE Trans Fuzzy Syst 21(2) (2013) 14. Stanovov, V., Brester, C., Kolehmainen, M., Semenkina, O.: Why don’t you use evolutionary algorithms in big data? IOP Conf Seri: Mater Sci Eng 173(1) (2017). https://doi.org/10.1088/ 1757-899x/173/1/012020 15. Ishibuchi, H., Yamamoto, T.: Rule weight specification in fuzzy rule-based classification systems. IEEE Trans. Fuzzy Syst. 13(4), 428–435 (2005) 16. Semenkina, M., Semenkin, E.: Hybrid self-configuring evolutionary algorithm for automated design of fuzzy classifier. In: Tan, Y., Shi, Y., Coello, C.A.C. (eds.). Advances in Swarm Intelligence, PT1, LNCS vol. 8794, pp. 310–317 (2014) 17. Virtanen, J.K., Wu, J.H.Y., Voutilainen, S., Mursu, J., Tuomainen, T.P.: Serum n-6 polyunsaturated fatty acids and risk of death: the Kuopio Ischaemic heart disease risk factor study. Am. J. Clin. Nutr. 107(3), 427–435 (2018). https://doi.org/10.1093/ajcn/nqx063 18. Ylilauri, M.P.T., Voutilainen, S., Lönnroos, E., Virtanen, H.E.K., Tuomainen, T.P., Salonen, J.T., Virtanen, J.K.: Associations of dietary choline intake with risk of incident dementia and with cognitive performance: the Kuopio Ischaemic Heart disease risk factor study. Am. J. Clin. Nutr. 110(6), 1416–1423 (2019). https://doi.org/10.1093/ajcn/nqz148 19. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001). https://doi.org/10.1093/bioinformatics/17.6.520 20. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1, 81–106 (1986) 21. Breiman, L.: Random Forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/a:101 0933404324 22. Brester, Ch., Voutilainen, A., Tuomainen, T.-P., Kauhanen, J., Kolehmainen, M.: Epidemiological predictive modeling: lessons learned from the Kuopio Ischemic Heart Disease risk factor study (Unpublished, 2021) 23. Hastie, T., Tibshirani, R., Friedman, J.H.: The elements of statistical learning: data mining, inference, and prediction. Springer, New York Inc. (2009)

Neural Computation Theory and Applications

Neural Models to Quantify the Determinants of Truck Fuel Consumption Alwyn J. Hoffman

and Schalk Rabé

Abstract Fuel cost is of critical importance to the profitability of road transport operators. In addition, the transport industry is a primary contributor towards harmful emissions. Earlier studies identified fuel economy and fuel shrinkage as two independent contributors towards fuel cost, and found that truck driver behaviour is an important determinant for both phenomena. We used a representative data set to extract regression and neural models for fuel economy and shrinkage, and used these models to remove the impact of factors not controlled by the driver, allowing us to measure driver performance more accurately. All models extracted demonstrated significant out-of-sample predictive ability. Neural models for fuel economy outperformed regression models. We verified the significance of compensating for factors not controlled by the driver by demonstrating large differences in driver fuel economy ranking before and after compensating for route inclination and payload. We found that supplier depot and driver are the primary factors related to shrinkage, and that a relatively small fraction of depots and drivers cause the majority of shrinkage. Compensating for non-driver factors however did not significantly change driver shrinkage ranking, in contrast to the results obtained for fuel economy. Keywords Fuel economy · Truck driver · Performance benchmarking · Generalized regression neural network · Multilayer perceptron

1 Introduction Road freight transport is an essential element of the global economy. This is specifically relevant in regions with limited availability of rail infrastructure [1]; for example, road transport is responsible for 76% of cargo movement in South Africa; this figure is even higher in other African countries [2]. The cost of transport in Africa is comparatively high as a fraction of the total cost of delivered goods—18% compared to a global average of less than 10% [3]. Fuel cost is the single biggest A. J. Hoffman (B) · S. Rabé School of Electrical, Electronic and Computer Engineering, North-West University, Potchefstroom, South Africa e-mail: [email protected] © Springer Nature Switzerland AG 2021 J. J. Merelo et al. (eds.), Computational Intelligence, Studies in Computational Intelligence 922, https://doi.org/10.1007/978-3-030-70594-7_10

239

240

A. J. Hoffman and S. Rabé

contributor to the cost of road transport operations, representing approximately 40% of operating costs [4]. Fuel economy is therefore a critical element to be managed by road freight transport operators to ensure continued profitability in a very competitive industry. Fuel shrinkage further adds to the fuel bill of road transport operators; Naidoo [4] estimated the average level of fuel theft to be as high as 15% or more. The contribution of the transport sector towards greenhouse gas emissions has been widely researched and is estimated at around 29% of all emissions caused by human activities [5]. The transition to clean energy will be challenging for long haul freight trucks, due to the large distances covered by these vehicles. Heavy-duty vehicle GHG EPA regulations are projected to reduce CO2 emissions by about 270 million metric tons over the life of vehicles built under the EPA program, saving about 530 million barrels of oil [6]. Driver proficiency, payload and route inclinations are known to be the primary factors that influence truck fuel consumption [7–9]. Engine characteristics and driving style have also been found to play a major role [10, 11]. Another study applied a Big Data approach to large vehicle fleets driving on flat roads and at constant speeds [12], while further research investigated the use of telematics solutions to improve truck fuel consumption [13]. Various historical studies applied neural networks to model the fuel economy of trucks with the aim to find the most accurate model for fuel economy in terms of the input factors mentioned above [14–17]. A critical aspect that has been overlooked is to accurately quantify the contribution of the driver. The behavior of the driver is the only factor that can be readily influenced to reduce emissions and fuel costs without negatively impacting the economic function fulfilled by transport. This will however only be possible if the impact of factors like route inclinations and payload are removed before assessing the performance of the driver. In order to achieve the objective of accurate driver assessment, we will re-use the linear and nonlinear regression and neural tehcniques that model fuel economy for long haul freight trucks in terms of route inclination, payload and driver identity [18]. Firstly we will use these models to evaluate driver fuel economy performance after compensating for factors not controlled by the driver [18]. Secondly we will add the extraction of fuel shrinkage models to support the management of this important contributor to effective fuel cost. The focus of this work is to investigate the hypothesis that the presence of factors not under the control of the truck driver, like route inclinations and payload differences, significantly influence the performance outcomes for truck drivers if not properly compensated for. To test this hypothesis we use regression and neural models that quantify the impact on fuel economy of factors not controlled by drivers, to remove the impact of such factors in order to arrive at a residual fuel economy and residual fuel shrinkage that is mainly determined by driver behavior. This approach should produce more reliable driver performance measures than simple averages of performance over all driver trips and should therefore enable objective assessment of driver performance. The rest of the paper is structured as follows: the collection of a representative set of fuel consumption data and the different routes that were covered by the available

Neural Models to Quantify the Determinants …

241

data set is described in Sect. 2. Statistical measures of fuel economy for the population as well as per route and driver are extracted in Sect. 3, to provide evidence of the need for a driver performance model. The extraction of empirical models that will allow the isolation of the impact of the driver on fuel consumption is covered in Sect. 4. Section 5 estimates the impact of model compensation on driver performance measurement, while Sect. 6 extracts statistics about fuel shrinkage. Section 7 describes the extraction of fuel shrinkage models, and in Sect. 8 we conclude and make recommendations for future research work.

2 Collection of Fuel Consumption and Input Factor Data In order to develop reliable models it is necessary to generate representative fuel usage statistics on routes that include widely ranging inclinations. For this purpose we collected data over a period of two calendar years from a fleet of 468 vehicles that cover most of the major routes in Southern Africa, as displayed in Fig. 1. Incline data was extracted from Googlemaps by using the route descriptions as defined by the set of GPS coordinates representing each route [20]. We identified categorical variables that have been proven to influence fuel economy (including driver ID and route) and categorized the data according to these variables. The collected data included GPS location, time and date and the total amount of fuel used by the engine for the duration of a trip (defined as from switch on to switch off). We filtered out all trips with a trip distance of shorter than 100 km as very short trips have much lower fuel economy (measured in km/l) compared to the long haul trips that is the focus of this study. The dataset furthermore included the payload per trip. To identify occurrences of fuel theft we calculated the discrepancies between the amounts dispensed after each trip compared to the amounts of consumed fuel according to the on-board computer of the trucks for the same calendar period. As

Fig. 1 GPS crumb trail data of a typical truck from the data set

242

A. J. Hoffman and S. Rabé

trucks were not always fully refuelled, it resulted in some invalid comparisons; we therefore removed those observations where significant negative discrepancy values were obtained.

3 Extracting Statistics for Route and Driver Fuel Economy The statistics for the variables related to fuel economy measured across 7,332 observations were described in detail in earlier work [18]; we will only repeat the most important results in this paper. The available data included observations for 21 different routes, most of which were frequently driven over the relevant period by a set of 331 drivers. In order to investigate the impact of route characteristics and driver behaviour, the available data set was categorized per route. Figure 2 displays the number of trips available per route as well as the average fuel economy per route, sorted from highest to lowest. It can be seen that the average fuel economy per route varies by almost a factor of two from the least to the most fuel efficient. Figure 3 displays the histogram of average fuel economy per driver across all routes. For

Fig. 2 Number of trips and average fuel economy per route [18]

Neural Models to Quantify the Determinants …

243

Fig. 3 Histogram of average fuel economy per driver for all routes [18]

drivers the spread of averages is even wider than for routes; this may however partly be because of route inclination and payload variations. As could be expected the variation in performance between drivers within a specific route is not quite as big as across all routes, as can be seen by studying the histograms of driver average fuel economy for a few individual routes in Fig. 4.

Fig. 4 Histograms of average fuel economy per driver for individual routes [18]

244

A. J. Hoffman and S. Rabé

By first removing the impact of the route, we can quantify the potential for fuel economy improvement, should all drivers perform at the same level.

4 Extracting Empirical Fuel Economy Models In a previous article [18] we described the extraction of linear fuel economy regression models, nonlinear regression models as well as various neural network models. This included a generalized regression neural network, which is a type of radial basis function network, as well as multi-layer perceptrons. We extracted models using all four modeling techniques (linear regression, nonlinear regression, GRNN and MLP NN) from the earliest 70% of all observations, and predicted fuel economy for the remaining 30% of observations. We first extracted models using driver, route and payload factors as inputs to allow comparison of our results with results from previous research. We selected input factors by ranking potential inputs based on absolute value of linear correlations between inputs and fuel economy, and only included input factors with a correlation coefficient of at least 0.1 with the model target. Once a ranked input factor has been selected, we only considered additional factors that had a correlation with already selected factors of less than 0.4, as the use of several higly correlated inputs results in unstable model parameters. The list of model parameters selected on this basis included Elevation Gain, Max RPM, Payload and Max Speed. Elevation Lost and some other factors were not selected based on their high correlations with Elevation Gain, that was selected first as it had the highest absolute correlation with fuel economy. The scatterplots of Target vs Output for the regression and neural models respectively, as displayed in Figs. 5 and 6, show that the model fits for the test sets are very similar to that for the training sets. This indicates that the models have good generalization capability. The neural models provide a superior fit of output to target compared to the regression models, while the GR neural network seem to be slightly superior to the MLP network. We will confirm these observations using correlation analysis. We calculated the correlations between model outputs and target variables for both the training and test sets, as displayed in Tables 1 and 2, to assess model accuracy. As expected the models that include driver, route and payload inputs have the biggest correlations between output and target. The observed relationships between fuel economy and the respective explanatory variables are strong and consistent, as most of the correlation obtained in the training set is still present in the test set. The nonlinear regression models perform slightly superior to the linear regression models, while the neural models outperform the regression models, both for the general, the route & payload and the driver models. The driver behavioral model that uses Max RPM, Max Brake and Max Speed as inputs, slightly outperforms the driver ID model that uses driver identity as input.

Neural Models to Quantify the Determinants …

245

Fig. 5 Scatterplots for linear and nonlinear regression targets and outputs [18]

5 Estimating Model Compensation Impact on Driver Performance Measurements In order to measure driver performance more consistently we have to compensate for those factors over which the driver has no control. For this reason we calculated a compensated fuel economy figure for each trip by subtracting the route and payload fuel economy model output from the original fuel economy. We then added the population average fuel economy to this residual to obtain a fuel economy figure that is mostly attributed to driver behavior: Driver fuel economy =Original fuel economy −Route & Cargo model output + Population average

(1)

Tables 3 and 4 displays correlations obtained between outputs and targets for the various route & payload models. As before we observe that neural models slightly

246

A. J. Hoffman and S. Rabé

Fig. 6 Scatterplots for GRNN and MLP neural network targets and outputs [18] Table 1 Correlation coefficients between fuel economy model outputs and targets for the training set [18] Inputs

LinRegr

NonLinR

GRNN

MLPNN

All Var

0,695

0,721

0,856

0,800

Route

0,627

0,660

0,740

0,735

Payload

0,174

0,184

0,221

0,257

Route&Payload

0,671

0,705

0,814

0,783

DriverBeh

0,381

0,381

0,392

0,400

DriverID

0,357



0,300

0,327

Neural Models to Quantify the Determinants …

247

Table 2 Correlation coefficients between fuel economy model outputs and targets for the test set [18] Inputs

LinRegr

NonLinRe

GRNN

MLPNN

All Var

0,592

0,655

0,763

0,741

Route

0,607

0,636

0,710

0,706

Payload

0,180

0,202

0,240

0,282

Route&Payload

0,640

0,678

0,768

0,744

DriverBeh

0,139

0,159

0,315

0,341

DriverID

0,121



0,127

0,148

Table 3 Training set correlation coefficients between outputs and targets for models trained on the route & payload residual fuel economy [18] Inputs

LinRegr

NonLinR

GRNN

MLPNN

DriverBeh

0,263

0,262

0,271

0,277

DriverID

0,435

0,067

0,368

0,414

Table 4 Test set correlation coefficients between outputs and targets for models trained on the route & payload residual fuel economy [18] Inputs

LinRegr

NonLinR

GRNN

MLPNN

DriverBeh

0,024

0,044

0,173

0,200

DriverID

0,134

0,033

0,112

0,131

outperform linear regression models, and that for neural models a significant fraction of the correlation between output and target is retained in the test set. To verify if variations in performance for the same driver are reduced after compensating for the impact of route and payload, we calculated the standard deviation of uncompensated driver fuel economy averages over all drivers, and obtained a figure of 0.192 km/l. The compensated driver fuel economy in Eq. 1 above was used to calculate compensated driver averages. The standard deviation of compensated driver averages was then calculated as 0.158 km/l; as expected this is indeed lower than the figure before compensation. In Fig. 7 we compare uncompensated versus route and cargo compensated fuel economy histograms for a sample of drivers. The change in distribution is clearly visible; in cases where the average did not change much, as for driver 923, the spread became narrower as expected, due to removal of the impact of varying route inclinations and payloads. To verify the impact of compensating for route and payload we calculate correlations between average driver fuel economy performance before and after compensation. Table 5 indicates that driver performance before and after route and payload compensation is negatively correlated. The fact that this is almost equally strong for

248

A. J. Hoffman and S. Rabé

Fig. 7 Comparison of uncompensated and route and cargo compensated fuel economy histograms for different drivers [18]

Table 5 Correlations between compensated and uncompensated driver fuel economy performance [18]

Variable

Train

Test

Route & cargo compensated

−0,598

−0,554

0,007

0,240

Driver, route & cargo compensated

the training and test sets provides evidence that it is not as a result of model overfitting. We furthermore observe that when also removing the impact of driver ID the remaining correlation for the training set is almost zero, as the remaining model error will now have little resemblance to the original fuel economy. A small positive correlation remains for the test set as the models could not capture all variations present in the data; this is also to be expected as not all factors impacting fuel economy are present in the model (e.g. wind speed and traffic conditions). To quantify the degree to which model compensation influences driver performance measures we calculated each driver’s ranking compared to other drivers, firstly based on uncompensated and secondly based on compensated performance averages. For each driver the difference in ranking position was determined before and after model compensation; we normalized this change in ranking by division

Neural Models to Quantify the Determinants …

249

through the total number of drivers. We then calculated the average absolute change in ranking differences over all drivers to obtain an overall figure of the degree to which ranking was impacted by performance compensation, as indicated in Eq. 2: Ave Relative Ranking Change =

N 

Abs(Ranking Change)k N

(2)

k=1

where N is the total number of drivers. This figure will be zero for no ranking changes and 0.5 for random changes to all driver rankings. To verify the consistency in driver performance over time, we first calculated the relative change in ranking between the training and test sets for both the uncompensated and compensated fuel economies, and obtained a relative ranking change of 0.27. This indicates that performance does change over time, but that it is not entirely random, with some level of consistency. We then proceeded to compare the ranking of driver performances between the case with no compensation and the case after model compensation. Tables 6 and 7 displays the relative ranking changes for different compensation models for the training and test sets. The fact that the change in driver ranking before and after compensation is bigger than the difference of 0.27 observed between the training and test sets, indicates that, over and above changes in performance over time, the model based compensation results in a significant difference in driver ranking. Table 6 Average relative change in driver performance ranking before and after compensation for the training set [18] Inputs

LinRegr

NonLinR

GRNN

MLPNN

All Var

0,468

0,468

0,447

0,456

Route

0,477

0,469

0,465

0,466

Payload

0,493

0,489

0,495

0,494

Route & payload

0,479

0,471

0,472

0,473

DriverBeh

0,460

0,459

0,482

0,458

DriverID

0,343

0,494

0,500

0,411

Table 7 Average relative change in driver performance ranking before and after compensation for the test set [18] Inputs

LinRegr

NonLinR

GRNN

MLPNN

All Var

0,471

0,468

0,462

0,462

Route

0,483

0,472

0,471

0,467

Payload

0,493

0,486

0,493

0,493

Route & payload

0,482

0,464

0,473

0,475

DriverBeh

0,470

0,467

0,477

0,472

DriverID

0,414

0,495

0,499

0,445

250

A. J. Hoffman and S. Rabé

The scatterplot of driver rankings before and after compensation as displayed in Fig. 8 confirmed these results. The straight line in the middle of the graph may seem to indicate that a series of drivers have retained the same ranking before and after compensation. In fact, these are drivers with no trips in the test set and to whom we allocated average performance; they therefore assumed sequential positions in the ranking list. We then calculated the fraction of drivers for whom performance relative to the population average changed from positive to negative or vice versa after compensation. The total fraction of changes should be 0.5 if performance before and after model compensation is unrelated (e.g. random performance changes). In Table 8 we

Fig. 8 Comparison between driver ranking before and after compensating for route and cargo [18]

Table 8 Fraction of drivers with reverse in relative performance before and after compensation [18] Inputs

LinRegr

NonLinR

GRNN

MLPNN

All Var

0,565

0,565

0,529

0,511

Route

0,577

0,565

0,544

0,532

Payload

0,601

0,592

0,592

0,583

Route & Payload

0,577

0,583

0,544

0,571

DriverBeh

0,565

0,553

0,577

0,562

DriverID

0,363

0,607

0,598

0,447

Neural Models to Quantify the Determinants …

251

Table 9 Comparing different route & payload models based on difference in driver performance ranking after compensation [18] Model type

LinRegr

NonLinR

GRNN

LinRegress

0,000

0,070

0,091

0,078

NonLinRegr

0,070

0,000

0,109

0,095

GRNN

0,091

0,109

0,000

0,059

MLPNN

0,078

0,095

0,059

0,000

Table 10 Comparison between route and payload models based on difference in driver performance ranking after compensation [18]

MLPNN

LinRegress

0,082

NonLinRegress

0,123

GRNN

0,100

MLPNN

0,121

observe that for the route and cargo model compensations the fraction of drivers with reversed relative performance are the biggest. For the driver models the fraction of changes approach 0.5, because the residues from these models are largely unrelated to driver identity and would therefore appear to be random. We calculated the differences in average relative change in ranking between the models, to allow comparison of the impact of the different models on driver performance after correction. As these differences are fairly close to zero, displayed in Table 9, it provides evidence that all the models largely agree in terms of the required changes in driver ranking. In those cases where a model was compared against itself a result of exactly zero was obtained. Table 10 displays the results of a similar comparison between route models and payload models that used the same modelling technique. As payload represents a smaller fraction of fuel economy changes compared to route inclination, it is less effective when used on its own. The differences are therefore slightly larger than in the previous case.

6 Extracting Statistics for Fuel Shrinkage We calculated that the average fuel discrepancy over all trips in the available set is 14.23%. Thus, 14.23% more fuel was dispensed at the refueling stations than that was used according to the onboard computers installed in the vehicles. This indicates either that the onboard computers of the vehicles provide inaccurate measurements because of calibration errors or that large quantities of fuel are stolen by drivers. According to work done by M van der Westhuizen, the onboard equipment of the vehicles is accurate within 1% [19]. We can therefore regard the data collected of the fuel used by the engine of the vehicles to be very reliable. This indicates that the

252

A. J. Hoffman and S. Rabé

Fig. 9 Histogram of fuel discrepancies, with inaccurate observations removed

large average fuel discrepancy is due to fuel theft or other causes resulting in fuel loss. In Fig. 9, we display a histogram of relative discrepancies expressed as a percentage of total fuel used. A positive discrepancy show that more fuel was dispensed at the end of the trip that that was used according to the onboard computer during the trip. A negative relative discrepancy indicates that more fuel was used according to the onboard computer than what was dispensed into the vehicle at the end of the trip; this results from incomplete refueling of vehicles after some trips. When the discrepancy is positive, we suspect that fuel might have been stolen during that trip. We observe that the histogram is skewed to the right, i.e. in most cases fuel discrepancies tend to be positive, indicating that fuel shrinkage is occurring on a regular basis. In order to quantify the contribution of various input factors, we categorized the data based on the following categorical variables: • • • • •

Driver Route Vehicle Supplier depot where refuelling took place Locations where cargo is loaded and unloaded (Origin and Destination).

To determine which input variables display a significant relationship with fuel discrepancy we performed Analysis of Variation calculations with respect to all the above variables. As example, we display the results of the ANOVA test applied to supplier depot in Table 11 below. We conclude that there is a significant relationship, given the small probability of achieving the F-statistic value in case of no relationship. We obtained similar results, with somewhat smaller F-values, for the other variables. The ANOVA tests therefore suggest that it will be worthwhile to extract a model for fuel discrepancy in terms of these variables. We then proceeded to calculate the average values of fuel discrepancies per category for each of the categorical variables described above, to determine if we observe

Neural Models to Quantify the Determinants …

253

Table 11 Fuel discrepancy ANOVA results for the depot category

significant differences in different categories. As an example, a histogram of driver averages is displayed in Fig. 10. Most drivers have an average relative discrepancy between 0 and 30%. There are also relatively many drivers with more than 70% average discrepancy. The histogram for averages of per depot ass displayed in Fig. 11 suggests that there is a high theft rate at certain depots. This confirms that it is worthwhile to pursue the modelling of the relationships between input factors and fuel shrinkage. For the other variables, the histograms were more symmetrical around zero, indicating that supplier depot and driver are most likely the primary explanatory variables for fuel shrinkage. To determine which categorization options contain a significant number of categories with fuel discrepancies that deviate significantly from the population average we used t-statistics. The t-statistic of a sample dataset is calculated by the following formula: t value =

µs − µ p S√Ds n

Fig. 10 Histogram of driver average relative fuel discrepancies

(3)

254

A. J. Hoffman and S. Rabé

Fig. 11 Histogram of depot average relative fuel discrepancies

where µs is the sample mean, µ p is the population mean, S Ds is the sample standard deviation and n is the sample size. We calculated t-statistics for all the categories and then sorted the categories based on t-statistic. In Fig. 12, the histogram for the driver with the highest t-statistic is shown. When compared to the total population histogram, we observe that this driver causes extreme fuel discrepancies. From the histogram in Fig. 13, we can see that more than half of the entries at the Puma Filling Station depot have above 40% fuel loss discrepancy. In order to quantify the degree to which discrepancies are concentrated within a limited fraction of categories, we then calculated the cumulative fraction of total fuel discrepancy over the entire population as function of t-statistics, starting with categories that displayed the smallest t-statistic (i.e. with low discrepancies). The

Fig. 12 Relative fuel loss histogram for driver with staff number of 1522

Neural Models to Quantify the Determinants …

255

Fig. 13 Relative fuel loss histogram for Puma Filling Station depot

cumulative plot for the depot data is shown in Fig. 14; we can see that all depots with a t-statistic greater than 2.94, are responsible for 73% of the total fuel discrepancy. The cumulative plot for drivers is shown in Fig. 15. From this plot we can see that all the drivers with t-statistic greater than 2, representing 12% of drivers, are responsible for about 27.5% of the total amount of absolute fuel discrepancy. It is thus clear that a minority of depots and drivers are responsible for a large fraction of total fuel shrinkage. We also combined different category types where a significant fraction of losses occurs. We calculated the statistics for the combined categories to see if significant

Fig. 14 Cumulative distribution of the depot contribution to total fuel discrepancy

256

A. J. Hoffman and S. Rabé

Fig. 15 Cumulative distribution of the driver contribution to total fuel discrepancy

differences can be observed in the combined categories compared to the individual categories. By following this approach, we will observe if a certain combination of categories causes extreme fuel losses. As example, when Puma Filling Station depot is combined with driver number of 1522 the average fuel loss of this driver rises from 40.36 to 62.98% when only the transactions done at this depot are considered. This might be an indication that there is interaction between fuel theft at specific depots and specific drivers. In order to perform a direct comparison between the impacts of different input factors on fuel discrepancy we implemented a linear correlation analysis. We converted the categorical variables into continuous variables in the following manner: • For each category of each input factor (e.g. Puma Depot for category Supplier Depot) we calculated the accumulated average value of relative fuel discrepancy as function of time as from the start of the observation period up to the end of each specific month. These averages represent the behavioural characteristics of the respective categories up to that point in time. • For each new observation falling in a subsequent month we determined its category memberships (e.g. Supplier Depot: Puma; Driver ID: 1522, etc.). We then allocated to that observation the accumulated averages for the categories to which it belongs as determined at the end of the previous month. Each observation thus inherits the attributes of the categories to which it belongs. As these attributes are continuous variables, it can be used as basis for a correlation analysis with the observed outcomes.

Neural Models to Quantify the Determinants …

257

Table 12 Correlations between input variables and fuel discrepancy Period

Vehicle

Driver

Supplier

Origin

Destination

Route

Total calendar range

0.0383

0.0526

0.1543

0.0328

0.0579

0.0516

First half

0.0151

0.0233

0.1361

0.0257

0.0134

0.0235

Second half

0.0716

0.0825

0.1785

0.0527

0.1039

0.0749

The correlations are displayed in Table 12. The largest correlations are obtained for Supplier Depot, which confirms our earlier results from the ANOVA analysis. As most of the other correlations are also significant, we will retain all input variables when modelling fuel discrepancy.

7 Extracting Empirical Fuel Shrinkage Models Similar models as those described in Sect. 5 above applied to fuel economy were extracted for fuel shrinkage. The linear regression model coefficients that were obtained for the various input variables are displayed in Table 13, while correlation coefficients between actual and predicted values for linear regression and neural network models are displayed in Table 14. We observe that both models perform almost identically, and that the models maintains their prediction accuracy for the test set, indicating that the modelled relationships are not based on noise in the data. The correlations are not very high, which can be expected, as fuel shrinkage tends to be sporadic and show much variation from one trip to the next, depending on the available opportunities to siphon fuel from the system. What is however important is that the correlation obtained for the test set is similar to that for the training set (even higher in this case), showing that the model captures behaviour that is consistent over time. Table 13 Linear regression coefficients

Input Variable

Coefficient

Vehicle FMID

−0.013968

Driver & staff number

Table 14 Linear regression correlations

0.067676

Supplier

0.23981

Loading town

0.038324

Location to town

−0.006171

Route

−0.019472

Measurement

Linear regression

Neural network

Training set correlation

0.142

0.142

Test set correlation

0.190

0.186

258

A. J. Hoffman and S. Rabé

Table 15 Correlations between actual and predicted average relative discrepancies per category Period

Supplier

Driver

Vehicle

Origin

Destination

Route

(a) Linear regression Training

0,511

0,344

0,292

0,069

0,321

0,114

Test

0,327

0,339

0,244

0,290

0,139

0,150

(b) Neural network Training

0.198

0.252

0.263

0.261

0.296

0.269

Test

0.165

0.373

0.458

0.259

0.228

0.218

We then proceeded to calculate average predicted fuel discrepancy per category, and correlated this will actual average discrepancy per category for both the training and test sets; these results are displayed in Table 15. We observe that the crosscategory average correlations are much higher than the correlations over the individual observations; we expected this as the averaging within categories removes some of the noisiness of the data. Linear regression and neural networks produced similar results. We can therefore conclude that we observe systematic fuel shrinkage related behavior that is category specific and that is sustained from the training to the test period. Similar to the case for fuel economy, we then proceeded to remove the effect of input variables not linked to the driver, in order to determine if the prevalence of driver related shrinkage is mainly the result of other factors not controlled by the driver. We therefore extracted a model where we removed driver as input variable; the correlations between actual and predicted fuel discrepancy are displayed in Table 16. It can be seen that the correlations are only a little smaller compared to the models that includes driver; we expected this as the driver linear regression model coefficient for the previous model was small. We subtracted the predicted fuel discrepancy of the model excluding driver as input from the actual fuel discrepancy to obtain a residual fuel discrepancy, similar to the residual fuel economy of the previous sections. We then modelled this residual discrepancy using only driver as input; the results are displayed in Table 17. The correlations are very small and similar for the training and test sets; this indicates Table 16 Linear regression correlations using model excluding driver

Table 17 Linear regression correlations for residual discrepancy using only driver as input

Measurement

Linear regression

Neural network

Training set correlation

0.138

0.138

Test set correlation

0.176

0.178

Measurement

Linear regression

Neural network

Training set correlation

0.0304

0.0303

Test set correlation

0.0841

0.0840

Neural Models to Quantify the Determinants …

259

that it is not as easy for fuel shrinkage as for fuel economy to separate the influence of the driver from other variables. We then determined the ranking for drivers both using original fuel discrepancy and residual fuel discrepancy. In Table 18, we see that, contrary to the case of fuel economy, the ranking of drivers did not change much after compensating for other factors. Only about 6% of drivers switched from a positive to negative ranking compared to other drivers. This result is confirmed by the scatterplot in Fig. 16 that shows a high degree of correspondence in driver fuel shrinkage ranking before and after compensating for other factors. We calculated the correlation between the driver rankings before and after compensation as 0.979 (this figure was practically identical for linear regression and neural models). We therefore conclude that driver assessments in terms of involvement in fuel shrinkage is not much impacted by other Table 18 Impact of compensating for factors on driver fuel shrinkage ranking Measurement

Linear regression

Neural network

Relative discrepancy population average

10.29

10.29

Residual relative discrepancy population average

0.472

0.464

Fraction of drivers with positive performance on relative discrepancy

0.441

0.441

Fraction of drivers with positive performance on residual relative discrepancy

0.431

0.429

Fraction of drivers that switch between positive/negative based on different measurement method

0.061

0.059

Fig. 16 Scatter plot of driver ranking before and after compensating for other factors

260

A. J. Hoffman and S. Rabé

factors, even though this phenomenon may be more prevalent at some depots and on some routes than others.

8 Conclusions and Future Work The objective of this paper was to determine the impact of truck drivers on truck fuel economy and shrinkage using modelling techniques. More specifically, we investigated the impact on driver performance ranking of compensation for factors not controlled by the driver. We stated a hypothesis that factors beyond the control of a truck driver has a significant impact on methods to measure driver fuel economy and shrinkage performance. The results reported in this paper provides conclusive evidence that we can accept the hypothesis in the case of fuel economy, but not in the case of fuel shrinkage. Based on our analysis of fuel economy, we found that route inclination and payload explain a significant fraction of total observed fuel economy deviations. We observed that compensating for route and payload reduced variations between average performance levels of different drivers. We furthermore found that there is more consistency between driver performance in the training and test sets after compensating for route and payload than before. We also found that driver fuel economy performance, measured before and after compensating for route and payload, are negatively correlated. In line with this finding, we observed large changes in driver performance ranking after compensation. Lastly we found that, for the majority of drivers, the fuel economy performance relative to the population average changes in sign after compensating for route and payload. The above findings provide convincing evidence that the default measure currently used for driver fuel economy performance, namely the observed average performance over all completed trips, is not reliable. We therefore propose a new performance measure, based on the residual of the model that predicts fuel economy in terms of route inclinations and payload. By adding the population average for fuel economy to this residual one can obtain a realistic fuel economy performance assessment for each driver. The analysis of fuel shrinkage identified supplier depots and driver identity as the primary explanatory variables. We observed that a minority of supplier depots and drivers cause the bulk of fuel shrinkage, suggesting that road transport operators should implement specific control measures at such depots and for such drivers. Contrary to the findings for fuel economy, we however found that compensating for factors not controlled by the driver did not result in a significant change in driver fuel shrinkage rankings. It would appear that drivers that on average displayed higher than average shrinkage levels before compensation still displayed high shrinkage levels after removing the impact of supplier depot and other factors. This confirms that a subset of drivers are systematically involved in fuel theft, and that while this theft mainly occurs around the operations of specific refuel depots, such behaviour is present at other locations as well.

Neural Models to Quantify the Determinants …

261

Based on feedback from road transport operators and operators of truck parking facilities, we believe that fuel theft activities are not only restricted to refuel depots, but also occur in locations like truck parks, where drivers receive bribes in exchange for allowing fuel to be siphoned from their trucks. We can investigate the prevalence of this phenomenon by monitoring average fuel tank levels before and after trucks visited such locations, where no formal refuel facilities are located. Momentary samples of fuel tank levels tend to be an unreliable indication of the volume of fuel currently in the tank, due to movements in the fuel surface while driving and the high thermal expansion coefficient of diesel. By filtering out short term fluctuations and linking such measurements to temperature readings, it should however be possible to provide indicators of estimated changes in fuel tank volumes before and after suspicious events while the truck is in transit. Future work will involve the inclusion of additional input factors not related to driver behavior, like wind speed and traffic conditions, as factors to be compensated for in the fuel economy model. To improve the fuel shrinkage model we plan to analyze the unauthorized stop locations of drivers to determine if high shrinkage levels can be linked to such locations. We will also consider the use of more sophisticated neural network techniques, e.g. using recurrent neural networks to apply temporal filtering to real time measurements of fuel tank levels.

References 1. Hoffman, A.J.: “The use of technology for trade corridor management in Africa,” in NEPAD transport summit. Sandton, Johannesburg, South Africa (2010) 2. Havenga, J.: 10th annual state of logistics survey for South Africa, ISBN number: 978-0-79885616-4, p. 9. South Africa, Stellenbosch (2013) 3. Stopping the flow of illegal black gold. Transp. World Afr. 12(3) (2014) 4. Naidoo, J.: Transport white paper: The South African Cross-Border Industry, THRIP research report. North-West University, Potchefstroom, South Africa (2013) 5. United States Environmental Protection Agency. Sources of Greenhouse Gas Emissions. 26 Apr 2019. https://www.epa.gov/ghgemissions/sources-greenhouse-gas-emissions 6. United States Environmental Protection Agency. Transp. Climate Change 26 Apr 2019. https://www.epa.gov/transportation-air-pollution-and-climate-change/carbon-pollutiontransportation 7. Weille, J.D.: Quantification of road user savings. World Bank Occasional Papers No. 2, World Bank, Washington D.C., (1966) 8. Biggs, D.: ARFCOM—Models For Estimating Light to Heavy Vehicle Fuel Consumption, Research Report ARR 152. Australian Road Research Board, Nunawading (1988) 9. Bennett, I.G.C.: HDM-4 Fuel Consumption Modelling, Preliminary Draft Report to the International Study of Highway Development and Management Tools. University of Birmingham, Birmingham (1995) 10. Rakha, J., Wang, H.: Fuel consumption model for heavy duty diesel trucks: model development and testing. Transp. Res. Part D 127–141 (2017) 11. Delgado, O., Clark, N., Thompson, G.: Modeling transit bus fuel consumption on the basis of cycle properties. J. Air Waste Manage. Assoc. 443–452 (2011) 12. Perrottaa, F., Parrya, T., Neves, L., Buckland, T., Benbow, E., Mesgarpourd, M.: Verification of the HDM-4 fuel consumption model using a big data approach: a UK case study. Transp. Res. Part D 109–118 (2019)

262

A. J. Hoffman and S. Rabé

13. Hoffman, A.J., Van der Westhuizen, M.: An investigation into the economics of fuel management in the freight logistics industry. In: IEEE Intelligent Transportation Systems Proceedings, Qingdao, PRC (2014) 14. Zhigang, X., Tao, W., Said, E., Xiangmo, Z.: Modeling relationship between truck fuel consumption and driving behavior using data from internet of vehicles. Comput.-Aided Civil Infrastr. Eng. 33, 209–219 (2018) 15. Jian-Da, W., Jun-Ching, L.: A forecasting system for car fuel consumption using a radial basis function neural network. Expert Syst. Appl. 39, 1883–1888 (2012) 16. Siami-Irdemoosa, E., Dindarloo, S.R.: Prediction of fuel consumption of mining dump trucks: a neural networks approach. Appl. Energy 151, 77–84 (2015) 17. Jassim, H.S., Lu, W., Olofsson, T.: Assessing energy consumption and carbon dioxide emissions of off-highway trucks in earthwork operations: an artificial neural network model. J. Clean. Prod. 198, 364–380 (2018) 18. Hoffman, A.J.: Neural models for benchmarking of truck driver fuel economy performance. In: 11th International Conference on Neural Computation Theory and Applications, Vienna, Austria (2019) 19. van der Westhuizen, M.: The Characterization of Automotive Fuel Consumption in the Freight Logistics Sector. North West University, Potchefstroom (2017) 20. Gong, L., Morikawa, T., Yamamoto, T., Sato, H.: Deriving personal trip data from GPS data: a literature review on the existing methodologies. In: The 9th International Conference on Traffic & Transportation Studies, Beijing, China (2014)

Towards a Class-Aware Information Granulation for Graph Embedding and Classification Luca Baldini , Alessio Martino , and Antonello Rizzi

Abstract Pattern recognition in the graphs domain gained a lot of attention in the last two decades, since graphs are able to describe relationships (edges) between atomic entities (nodes) which can further be equipped with attributes encoding meaningful information. In this work, we investigate a novel graph embedding procedure based on the Granular Computing paradigm. Conversely to recently-developed techniques, we propose a stratified procedure for extracting suitable information granules (namely, frequent and/or meaningful subgraphs) in a class-aware fashion; that is, each class for the classification problem at hand is represented by the set of its own pivotal information granules. Computational results on several open-access datasets show performance improvements when considering also the ground-truth class labels in the information granulation procedure. Furthermore, since the granulation procedure is based on random walks, it is also very appealing in Big Data scenarios. Keywords Pattern recognition · Supervised learning · Granular computing · Inexact graph matching · Graph embedding · Graph based pattern recognition · Information granulation

1 Introduction Graphs are topological data structures able to capture relationships between interacting elements. Whether nodes and/or edges can be equipped with suitable attributes L. Baldini (B) · A. Rizzi Department of Information Engineering, Electronics and Telecommunications, University of Rome “La Sapienza”, Via Eudossiana 18, 00184 Rome, Italy e-mail: [email protected] A. Rizzi e-mail: [email protected] A. Martino Institute of Cognitive Sciences and Technologies (ISTC), Italian National Research Council, Via San Martino della Battaglia 44, 00185 Rome, Italy e-mail: [email protected] © Springer Nature Switzerland AG 2021 J. J. Merelo et al. (eds.), Computational Intelligence, Studies in Computational Intelligence 922, https://doi.org/10.1007/978-3-030-70594-7_11

263

264

L. Baldini et al.

(in this case, we refer to as labelled graphs), graphs are also able to encode semantic information for the data at hand. This dual representative power makes graphs a flexible and accurate abstraction and, as such, they have been widely used to model a plethora of real-world phenomena, including biological systems [19, 25, 26, 29, 31, 33, 34] social networks [7, 68] computer vision and image processing [1, 22, 61, 63]. On the other hand, the straightforward approach in pattern recognition is to represent the input pattern as a real-valued feature vector lying in an m-dimensional vector space. This is mainly due to the relatively simple underlying math whether some properties are satisfied; that is, the resulting space can easily be equipped with an adequate metric satisfying the properties of non-negativity, identity, symmetry and triangle inequality [42, 58, 69]. The same does not necessarily hold in structured domains and, for this reason, the main drawback when representing entities with graphs is the unpractical, non-geometric space in which they lie. Amongst the mainstream approaches employable when dealing with structural pattern recognition [42], a natural approach consists in using an ad-hoc dissimilarity measure working directly in the input space: this allows to reuse some of the well-known pattern recognition techniques for supervised learning, notably K -Nearest Neighbours (K -NN) [17], being the quintessential decision rule which can be equipped with a custom dissimilarity measure. As graphs are concerned, a straightforward custom dissimilarity measure is the so-called Graph Edit Distances (GEDs) [51]: this (family of) dissimilarity measures aims at quantifying the dissimilarity between two graphs as the minimum cost sequence of atomic operations (namely, substitution, deletion and insertion of nodes and/or edges) needed to transform the two graphs into one another. An alternative strategy that has gained much attention relies on graph kernels [30]: these methods exploit the so-called kernel trick, that is the inner product between graphs in a vector space induced by a positive (semi)definite kernel function, in order to measure similarity between graphs. The classification task usually relies on well-known kernelized algorithms such as Support Vector Machines [16, 39]. Whilst graph kernels perform an implicit embedding by exploiting kernel functions’ peculiar mathematical properties [18, 50], a dual method (core of this paper) relies on an explicit graph embedding [13]. In this approach, the input pattern from the structured graphs domain G is mapped into an embedding space D. Clearly, the design of the mapping function φ : G → D with D ⊆ Rn is crucial in this procedure and some efforts must be ensured to fill the informative and semantic gap between the two domains. There are several ways to perform explicit embedding in either naïve or automatic approaches. Naïve approaches usually leverage on feature engineering (also known as feature generation), in which the data analyst manually designs the mapping function φ by extracting numerical features from the (structured) pattern at hand and concatenates them in a vector form. Notwithstanding its straightforwardness, this approach requires a deep knowledge on both the data and the problem at hand since usually there exist many numerical features that can be possibly extracted from structured patterns, yet specific subsets are useful for solving different problems. Examples of feature engineering on the graphs domain can be found in [20, 38, 40, 45, 48]. Amongst the automatic approaches, graph neural networks emerged as a deep learning-based method for automatically learning low-

Towards a Class-Aware Information Granulation …

265

dimensional graph embeddings [13]. However, as common in deep architectures, the model interpretability is a delicate issue whose addressing is still out of reach [71]. Another automatic approach for explicit graph embedding which, at the same time, returns a fully interpretable model is based on Granular Computing [3]. This approach, based on the extraction of information granules, can be pursued in order to obtain efficient mapping functions able to reflect the information carried by the structured data into the vector space. Generally speaking, in recent years, Granular Computing has been a dynamic paradigm for synthesizing advanced machine learning and structural pattern recognition systems both under a practical (see e.g. [2, 4, 36, 37, 43, 44, 64]) and methodological viewpoint (see e.g. [8, 14, 23, 70]). The appeal of Granular Computing-based pattern recognition systems stems from their ability to automatically design (in a data-driven manner) the mapping function, especially thanks to the symbolic histograms procedure [21]. Furthermore, these systems are human-interpretable, as the automatically-extracted information granules can provide field-experts with further knowledge about the modelled system. However, as a drawback, not only the automatic information granules synthesis plays a key role in the embedding procedure and shall be carefully designed, but also an heavy computational effort is necessary and often, as the dataset size increases, the problem may become unfeasible especially under the memory footprint viewpoint. This paper follows a previous work [2] on the synthesis of Granular Computingbased classification systems in the graphs domain, originally proposed in [4]. In [2], we explored a lightweight stochastic procedure for substructures extraction in order to synthesize the alphabet, i.e. the set of information granules on the top of which the embedding space is built, by taking advantages of Breadth First Search (BFS) and Depth First Search (DFS) algorithms for graph traversing. In this work, we push forward this study by also considering the ground-truth class labels for a better refinement of the alphabet synthesis. The remainder of the paper is structured as follows: in Sect. 2 we provide the reader an overview of explicit embedding procedures for structured data via information granulation; in Sect. 3 we recall GRALG, the graph-based classification system under analysis, discussing both its original implementation [4] and the stochastic variant proposed in [2]; Sect. 4, as instead, describes the refined granulation procedure at the basis of this paper; in Sect. 5 we show an exhaustive computational comparison and, finally, Sect. 6 concludes the paper, remarking future research.

2 Embedding via Data Granulation In the fields of Computational Intelligence and Soft Computing, Granular Computing emerged as a novel paradigm able to deal with complex systems inspired by the human abilities to unravel complex situations in environments characterized by uncertainties and with limited knowledge [32, 74]. Indeed, the process of granulation can be referred to as the set of techniques that leads to the emergence of meaningful aggregated data at different levels of abstraction known as information granules.

266

L. Baldini et al.

As a human-thinking problem-solving process, the Granular Computing approach consists in observing and considering the problem at various levels of granularity, retaining only those that are relevant for the task at hand, therefore discarding unnecessary and superfluous information and make the problem tractable for decision making activities [56]. The importance of information granules resides in the ability to underline properties and relationships between data aggregates. These entities can be synthesized according to the so-called indistinguishability rule, that is, elements that share enough similarity, structural or functional properties can be condensed into the same group [76], with the goal to pursue a semantic discrimination of the information residing in the data at hand [54]. Furthermore, data can be represented using different levels of ‘granularity’ and thus different peculiarities of the considered system can emerge [57, 67, 72, 73, 75]. The selection of the most adequate level of granularity is one of the most important issues when designing Granular Computing-based systems, being strongly influenced by nature of the problem. Indeed, by varying the resolution at which the problem is observed, the level of abstraction varies accordingly: the higher the resolution, the less the level of abstraction and finer details emerge. Conversely, low resolution or low granularity levels correspond to high level of abstraction where less, but more populated information granules are likely to emerge. Different approaches can be considered in order to accomplish the process of information granulation. Notable frameworks represent information granules as mathematical entities relying on set theory: fuzzy sets, rough sets and probabilistic set [28, 35, 52, 77]. A straightforward method for the synthesis of meaningful information granules can be found amongst unsupervised learning methods which have been widely explored in the context of Granular Computing [53, 55, 59]. Indeed, clustering algorithms have a direct connection with the concept of ‘granules-as-groups’, since these methods use to combine similar data into the same cluster according to a notion of proximity. Notwithstanding that, clustering methods must be properly designed in order to unravel groups and regularities in a multi-perspective view according to the Granular Computing principles. Typically, three main factors may affect the resulting data partitioning from a Granular Computing viewpoint [27]: • (dis)similarity measure, which serves as the main function in order to determine the degree of proximity between data elements • threshold of inclusion, which determines whether a given pattern can be included in a specific group (cluster) according to the degree of (dis)similarity with respect to other patterns • cluster representative, which is the pivotal element that compresses the information contained in a cluster. A typical clustering algorithm which directly relies on the aforementioned parameters is the Basic Sequential Algorithmic Scheme (BSAS) [65] that performs a so-called ‘free clustering procedure’, i.e. the number of clusters shall not be defined a-priori as in other data clustering paradigms such as k-clustering. When the input space of the problem corresponds to the graphs domain, the above discussed clustering parameters must be carefully tuned in order to deal with the structured nature of

Towards a Class-Aware Information Granulation …

267

data involved. That is, not only an effective (dis)similarity measure must be properly defined for comparing graphs, but also an adequate cluster representative must be tailored accordingly. An effective approach involves the medoid (also known as MinSOD) as the representative element of a cluster, since its evaluation can be performed just in light of the pairwise dissimilarities between the patterns belonging to the cluster itself, overcoming the lack of algebraic structures which characterize non-geometric spaces [46, 47, 49]. The clusters representatives from the resulting partition can be considered as symbols belonging to an alphabet A = {s1 , ..., sn }. These elements are retained as pivotal elements on the top of which structured data will be mapped in a vector space by means of the symbolic histograms paradigm, hence leading to an explicit graph embedding. According to the symbolic histograms, a graph can be mapped into an embedded space by building an n-length integervalued vector whose i-th component counts the number of occurrences of the symbol si belonging to the alphabet A within the graph to be embedded. The resulting embedding space is inherently endowed with well-defined distance measures, such as the Euclidean distance or the dot product, effectively enabling the application of many standard classification systems developed for geometric spaces.

3 The GRALG Classification System GRALG (GRanular computing Approach for Labelled Graphs) is a general purpose classification system suitable for dealing with labelled graphs and based on the Granular Computing paradigm. GRALG has been originally proposed in [4] and lately successfully applied in the context of image classification [5, 6]. In Sects. 3.1– 3.4, the four main blocks of the system are described separately, whereas in Sects. 3.5 and 3.6 we describe the way in which they cooperate in order to train the model and perform the testing phase, respectively. The discussion regards both the original implementation [4–6] and the lightweight stochastic variant proposed in [2].

3.1 Extractor The goal of this block regards the extraction of substructures from an input set S ⊂ G. In the original GRALG implementation, this procedure used to compute exhaustively the set of possible subgraphs that can be drawn from any given graph G ∈ S. The advantage of the procedure relies on it being dependent on only one userdefined parameter o, that is, the maximum number of vertices for all subgraphs to be extracted. However, the procedure itself is strongly dependent on this parameter: the computational complexity of exhaustive extractions is (asymptotically) combinatorial, making this method unfeasible for large graphs and/or high values of o both as running time and memory footprint are concerned. Specifically, the original procedure used to expand each node of a given graph to a possible subgraph of order 2,

268

L. Baldini et al.

caching in memory the resulting substructures, and then expanding and storing them iteratively until the desired maximum order o is reached. At the end of the extraction procedure, the resulting set of substructures S g is returned.

3.1.1

Random Subgraphs Extractor Based on BFS and DFS

The novel lightweight procedure proposed in [2] randomly draws a graph G = {V, E} from S, where V and E indicate the set of nodes and edges, respectively; then proceeds by selecting a seed node v ∈ V for a traversal strategy based on either BFS or DFS in order to extract a subgraph g = {Vg , Eg }. Both extractions (graph G from S and node v from V) are performed with uniform random distribution. Alongside o (maximum subgraph order), a new user-defined parameter W determines the desired cardinality of S g . The procedure can be summarized as follows1 : 1. Initialize S g as an empty set 2. For each candidate subgraphs order k = 1 . . . o a. b. c. d. e.

Draw a random graph G = {V, E} from S Draw a random vertex v from V Traverse graph G starting from seed node v until k vertices are visited Collect visited nodes and edges in a subgraph g Append g in S g

3. Repeat step 2 until |S g | = W In step 2c, the graph traverse is performed by using one of two following well-known algorithms: Breadth First Search: starting from a node v, BFS performs a traverse throughout the graph exploring first the adjacent nodes of v and then moving farther only after the neighbourhood is totally discovered. A First-In-First-Out policy is in charge to organize the list of neighbours for the considered vertex, in order to give priority to adjacent nodes. The algorithm can be summarized as follow: 1. 2. 3. 4. 5. 6.

Select the starting vertex v. Push v in a queue list L(q) Pop u, the first element of the queue from L(q) For each neighbour s of u, push s in L(q) if s is not mark as “visited” Mark u as a “visited” vertex Repeat 3-5 until L(q) is empty.

Depth First Search: in this strategy, a given graph is traversed starting from a seed vertex v, but unlike the BFS search, the visit follows a path with increasing length from v and backtracks only after all the vertices from the selected path 1 Detailed

pseudocodes can be found in the original research paper [2].

Towards a Class-Aware Information Granulation …

269

are discovered. A Last-In-First-Out policy is in charge to organize the list of neighbours for the considered vertex, in order to visit in-depth vertices first. The steps of the algorithm are: 1. 2. 3. 4. 5. 6.

Select the starting vertex v Push v in a stack list L(s) Pop u the last element from stack L(s) For each neighbour s of u, push s in L(s) if s is not marked as “visited” Mark u as “visited” Repeat 3-5 until L(s) is empty.

These methods are employed to populate the set of vertices Vg and edges Eg for the subgraph g: a vertex is added to Vg as soon as it is marked as “visited”, whereas an edge is added to Eg by considering the current and the last visited vertices.

3.2 Granulator The granulation process is carried out after the set of atomic entities S g has been populated by the Extractor block as defined in Sect. 3.1.1. This block aims at building a set of relevant symbols, namely the alphabet A, by means of an unsupervised learning approach performed on the subgraphs set S g . We adopt the BSAS free clustering algorithm that relies on two parameters, Q and θ , namely the maximum number of allowed clusters and the dissimilarity threshold for pattern inclusion in the nearest cluster, respectively. In particular, the θ parameter impacts on the resolution adopted during the clustering procedure affecting consequently the granularity level of the synthesized symbols. Thanks to a binary search method, an ensemble of partitions is generated according to different values of θ . Without loss of generality, we can consider a vector θ storing the candidate values for the binary search. For every cluster C in the resulting partitions, a cluster quality index F(C) is defined as: F(C) = η · (C) + (1 − η) · (C)

(1)

where the two terms (C) and (C) are defined respectively as: (C) =

 1 d(g ∗ , gi ) |C| − 1 i

(C) = 1 −

|C| g |Str |

(2)

(3)

where, g ∗ is the MinSOD element of cluster C (see Sect. 2) and gi is the i th pattern in the cluster. Both the clustering algorithm and Eq. (2) rely on the dissimilarity measure described in the next paragraph. The quality index defined in Eq. (1) reads

270

L. Baldini et al.

as the linear convex combination between compactness (C) and cardinality (C) as defined in Eqs. (2) and (3), respectively, where η ∈ [0, 1] weights the importance of the two terms. For all partitions in the ensemble, each cluster is filtered thanks to a given threshold τ F , which aims at selecting only relevant clusters according to the quality index F. In this way, only well-formed clusters (i.e., compact and populated) contribute to shape the alphabet A.

3.2.1

Dissimilarity Measure and Inexact Graph Matching

The core dissimilarity measure in GRALG is a GED, which is based on the same rationale behind other well-known edit distances, such as the Levenshtein distance between strings [15]. Given a set of edit operations defined on nodes and edges (i.e., deletion, insertion and substitution), a GED evaluates the dissimilarity between two graphs as the minimum cost set of operations needed to turn the first graph into the other. Additionally, in this work, we adopt a six-weighted GED which allows to individually establish the importance of each operation of both nodes and edges. Formally speaking, the GED between G 1 and G 2 can be defined as a function d : G × G → R, such that [2]: d(G 1 , G 2 ) =

min

(e1 ,...,ek )∈X (G 1 ,G 2 )

k 

c (ei )

(4)

i=1

where X (G 1 , G 2 ) is the set of prospective edit operations needed to transform the two graphs into one another. From Eq. (4), the cost functions c(·) associated to each edit operation are not known a priori and their definitions are of utmost importance for an adequate evaluation of the dissimilarity between graphs, being strongly dependent on the nature of the attrbutes on nodes and edges. Furthermore, an exact solution for the problem stated in Eq. (4) is known to have an exponential complexity with respect to the order of graphs involved [9–12]. For this reason, a suitable heuristic that considers a suboptimal solution is mandatory in order to overcome the impracticability arisen from the intrinsic computational complexity. In GRALG, we adopted a greedy approach known as Node Best Match First (nBMF) [6] that can be divided in two consecutive routines called vertex nBMF and edge nBMF, respectively. The interested reader is referred to [2] for detailed pseudocodes of both routines. Let G 1 = (V1 , E1 , Lv , Le ), G 2 = (V2 , E2 , Lv , Le ) be two fully labelled graphs with nodes and edges labels set Lv and Le and let o1 = |V1 |, o2 = |V2 |, n 1 = |E1 |, n 2 = |E2 | be the number of nodes and edges in the two graphs, respectively. In general, G 1 and G 2 might have different sizes in terms of both nodes and edges and thus we consider o1 = o2 and n 1 = n 2 . Additionally, let dvπv : Lv × Lv → R and deπe : Le × Le → R be two custom functions that enable the evaluation of dissimilarities between nodes’ and edges’ attributes, possibly depending on some parameters πv and πe [66]. The first procedure greedy matches the nodes in graph G 1 with nodes in G 2 according to dvπv , that is, the first node from V1 is assigned to the most similar node

Towards a Class-Aware Information Granulation …

271

from V2 . The selected pairs are stored in a set of matched nodes M and neglected in the next rounds of nodes evaluations. The vertex nBMF routine evaluates the operation costs as follow: • according to dvπv , each match contributes to the overall node substitution cost • if o1 > o2 , then we count (o1 − o2 ) node insertions • if o1 < o2 , then we count (o2 − o1 ) node deletions. Once the set of matched nodes M is returned, the second routine (edge nBMF) takes place by matching induced edges. Specifically, by relying on the set M, the procedure checks whether an edge exists in both E1 and E2 . For example, let us suppose (v1 , u 1 ) ∈ M and (v2 , u 2 ) ∈ M be two pairs of matched nodes, where v1 , v2 ∈ V1 and u 1 , u 2 ∈ V2 . Let also be (v1 , v2 ) ∈ E1 an edge of G 1 . Then, the procedure checks whether an edge (u 1 , u 2 ) exists in G 2 as well, being u 1 and u 2 the nodes that have been matched to v1 and v2 , namely the nodes that compose the edge in G 1 . In general, the edit operation costs are evaluated as follow: • if the edge exists on both E1 and E2 , this counts as an edge substitution and its cost is given by the dissimilarity between edges according to deπe ; • if the two nodes are connected on G 1 only, this counts as an edge insertion; • if the two nodes are connected on G 2 only, this counts as an edge deletion. The overall dissimilarities between nodes and edges, say dV (V1 , V2 ) and dE (E1 , E2 ), can be defined as: sub sub ins ins del del dV (V1 , V2 ) = wnode · cnode + wnode · cnode + wnode · cnode sub sub ins ins del del dE (E1 , E2 ) = wedge · cedge + wedge · cedge + wedge · cedge

(5)

sub ins del sub ins del , cedge , cedge , cnode , cnode , cnode are the costs associated to the edit operwhere cedge sub sub ins ins del del ations on nodes and edges and wnode , wedge , wnode , wedge , wnode , wedge are the six weights which reflect the importance of each operation individually. In order to avoid skewness due to the different sizes between G 1 and G 2 , dissimilarities in Eq. (5) are normalized as follows:

dV (V1 , V2 ) max(o1 , o2 ) dE (E1 , E2 ) dE (E1 , E2 ) = 1 , o ) · (min(o1 , o2 ) − 1)) 2 (min(o1 2

dV (V1 , V2 ) =

and finally: d(G 1 , G 2 ) =

 1  dV (V1 , V2 ) + dE (E1 , E2 ) 2

(6)

(7)

272

L. Baldini et al.

3.3 Embedder The goal of this block is to build an embedding function φ : G → D that maps graphs from G into an n-dimensional space D ⊆ Rn . Starting from the symbols collected in the alphabet A = {s1 , . . . , sn } returned by the granulation process, the embedding function φ A : G → Rn is evaluated according to the symbolic histogram paradigm [21, 22]. The vectorial representation h (the symbolic histogram) of a given graph G is defined as follow: h = φ A (G) = [occ(s1 , G), . . . , occ(sn , G)]

(8)

where occ : A → N counts the occurrences of the subgraphs s ∈ A in the input graph G. By employing the same GED described in Sect. 3.2.1 and a symbol-dependent threshold τ j = (C j ) · , with being a user-defined tolerance parameter and C j being the cluster whose MinSOD is s j , an occurrence of s j is scored in G whether the dissimilarity between a subgraph in G and s j is below the threshold value τ j . The main drawback of the aforementioned procedure concerns the computational burden needed to extract the subgraphs that made up the graph for which the embedding is required. The original procedure in GRALG used to expand the graph G in atomic substructures up to a user-defined order, following an exhaustive extraction strategy (alike the Extractor block, see Sect. 3.1) and then used to compare all the obtained subgraphs against all the symbols in the alphabet A via the GED dissimilarity. However, such approach results unfeasible as well both in terms of running time and memory footprint for medium/large graphs datasets hence, following [2], a lightweight strategy has been employed in order to avoid the issues related to graphs exhaustive expansion. The algorithm starts either a BFS or DFS traversal strategy2 from a seed node v belonging to node set V in order to explore the graph and extract a set of subgraphs up until a given order o. In order to mitigate the number of subgraphs, when a node u ∈ V is considered as a seed node, the procedure firstly checks whether u already appeared in one of the previously extracted subgraphs and eventually neglect it as starting node for the traversal strategy. Eventually, this procedure returns a subgraph set S ge affordable for the embedding phase.

3.4 Classifier This block runs a supervised classification algorithm in the vector space D ⊆ Rm in order to evaluate the effectiveness of the synthesized embedding space. For our purpose, the classification block is equipped with a K -NN decision rule: test patterns are classified according to the most frequent class amongst their respective K nearest patterns. After the classification has been complete, a suitable performance measure 2 The Embedder must follow the same traverse strategy as the Extractor: both of them shall use either DFS or BFS.

Towards a Class-Aware Information Granulation …

273

can be employed to evaluate the whole system. It is worth noting that since the classification task occurs in a metric space, the K -NN procedure is endowed with a plain Euclidean distance between vectors (i.e., symbolic histograms).

3.5 Training Phase The four blocks described in Sects. 3.1–3.4 endow the atomic functions in GRALG and herein we describe how they jointly co-operate in order to synthesize a classification model. Let S ⊂ G be a dataset of labelled graphs on nodes and/or edges and let Str , Svs and Sts be three non-overlapping sets (training, validation and test set, respectively) drawn from S. The training procedure starts with the Extractor (Sect. 3.1) that expands graphs g in Str using either BFS or DFS in order to return the set of subgraphs Str which is used as the main input for the Granulator module.

3.5.1

Optimized Alphabet Synthesis via Genetic Algorithm.

The Granulator block (Sect. 3.2) depends on several parameters whose suitable values are strictly problem- and data-dependent and are hardly known a-priori. This aspect, on a more theoretical side, reflects the observation (see Sect. 1) that relying on an automatic procedure in order to find a suitable granularity for the problem at hand is of utmost importance. To this end, a genetic algorithm is in charge of automatically tune the Granulator parameters in order to synthesize the alphabet A. The genetic code is given by the concatenation of [Q τ F η ]

(9)

where: • Q is maximum number of allowed clusters for BSAS • τ F is the threshold that retains symbols from high quality clusters in order to form the alphabet • η is the trade-off parameter for weighting compactness and cardinality in the cluster quality index (see Eq. (1)) sub sub ins ins del del , wedge , wnode , wedge , wnode , wedge ] is the array composed by the six • = [wnode weights for the GED (see Sect. 3.2.1) • = {πv , πe } is the set of parameters for the dissimilarity measures between nodes dvπv and edges deπe , if applicable (see Sect. 3.2.1). g

Each individual from the evolving population considers the set of subgraphs Str extracted from Str and runs several BSAS procedures with different threshold values θ . For each of the θ value under analysis, at most Q clusters can be discovered and the dissimilarity between graphs is evaluated using the nBMF procedure as

274

L. Baldini et al.

in Sect. 3.2.1 by considering the six weights and the parameters . It is worth recalling that the latter is only applicable if the vertices and/or nodes dissimilarities are parametric themselves. At the end of the clustering procedures, each cluster is evaluated thanks to the quality index (1) using the parameter η for weighting the convex linear combination and clusters whose value is smaller than (or equal to) τ F are discarded and their representatives will not form the alphabet. Once the alphabet ge ge A is synthesized, the Embedder (Sect. 3.3) extracts Str and Svs from Str and Svs (respectively) and exploits A in order to map both the training set and the validation set towards a metric space (say Dtr and Dvs ) using the same GED previously used for BSAS, along with the corresponding parameters and . The Classifier is trained on Dtr and its accuracy is evaluated on Dvs . The latter serves as the fitness function for the individual itself. The following genetic operators take care of moving from one generation to the next: selection: roulette wheel mutation: uniform within the genes’ lower-upper bounds crossover: uniform choice between three alternatives: two-point, one-point, uniform elitism: top 10% of the current population. At the end of the evolution, the best individual is retained, especially the and portions of its genetic code, along with the alphabet A synthesized using its genetic code.

3.5.2

Feature Selection Phase

The Granulator may produce a large set of symbols in A , resulting in a highdimensional embedding space. In order to reduce the dimensionality of the embedding space (hence, the number of meaningful symbols), a feature selection procedure is employed. The reasoning behind this additional feature selection phase is two-fold [41]: • enhanced model interpretability: having less pivotal symbols fosters the model interpretability since there will be less symbols to be analyzed by field-experts; • a faster test phase: in fact, as will be stressed in Sect. 3.6, testing a new pattern will be faster since there will be less symbols to match the pattern with. This stage is also based on genetic optimization and it is in charge to discard unpromising features, reducing the number of symbols in A thanks to a binary projection mask m ∈ {0, 1}|A | . The projection mask is the genetic code for this second optimization stage. Each individual from the evolving population projects Dtr and Dvs on the subspace marked by non-zero elements in m, say Dtr and Dvs . The classifier is trained on Dtr and its accuracy is evaluated on Dvs . The fitness function reads as a convex linear combination between the classifier accuracy on Dvs and the cost μ of the mask m:

Towards a Class-Aware Information Granulation …

μ=

275

m 0 |m|

(10)

with a user-defined parameter α ∈ [0, 1] which weights performance and sparsity. At the end of the evolution, the best projection mask m is retained and used in order to return the reduced alphabet A .

3.6 Synthesized Classification Model and Test Phase

From the two genetic optimization procedures, , and A are the main actors which completely characterize the classification model, hence the key components needed to classify previously unseen test data. In fact, given a set of test data Sts : ge

1. the Embedder returns Sts ; 2. the Embedder performs the symbolic histograms embedding by matching symbols in A using the GED equipped with parameters and (if applicable); 3. the K -NN exploits Dtr , namely the training set projected using the best projection mask m , in order to classify the embedded data from step #2. The performance obtained by the classifier on the latter step mark the final performance of the overall classification system.

4 Extractor and Granulation Improvements The key component of the system described so far is the Granulator block that let emerge meaningful symbols upon which the model synthesis occurs. In this Section, we introduce an enhanced granulation procedure that takes advantage of all available information about the data not yet exploited. Indeed, in Sect. 3.2, information granules are extracted without considering the ground-truth class labels from the g subgraphs set Str belonging to the training set Str . The latter set, however, carries also the information related to the ground-truth classes that patterns belong to, which is completely neglected during the process of granulation, missing the opportunity to exploit information potentially useful for this task [44]. In other words, this a-priori information about the data can be used by the Granulator for synthesizing specific symbols for each of the problem-related classes and hopefully improve the overall quality of the alphabet [60]. To this end, the Extractor block defined in Sect. 3.1 has been revisited as well, being the core module in charge of forwarding an appropriate g set of subgraphs Str which is essential for the granulation phase.

276

L. Baldini et al.

4.1 Class-Aware Extractor As discussed in Sect. 3, since an exhaustive expansion of all graphs in their constituent parts would be unfeasible in terms of running time and memory footprint, g the cardinality of Str is kept fixed to W through the sampling method described in Sect. 3.1.1 that relies on a stochastic procedure based on a uniform random pattern sampling from Str . As it is, this approach may not guarantee that the subgraphs’ label distribution corresponds to that of the starting set Str . In order to overcome this problem, the subgraphs extraction has been redesigned in a stratified (class-aware) manner that takes into account the frequency of the classes in the training set and, g consequently, populate different target sets Sl for l = 1, . . . , M with M being the number of classes in the dataset. This class-aware extraction, described in detail in Algorithm 1, can be summarized as follows: 1. For each class l with l = 1, . . . , M in the dataset Str , evaluatethe absolute freM fl = |Str |; quency fl , namely the number of patterns of class l, such that l=1 g 2. Let W be the user-defined desired cardinality of Str , then evaluate the number of subgraphs to be extracted for each class as Nl = |Sftrl | · W ; 3. Extract Nl subgraphs by performing the stochastic procedure from Sect. 3.1.1 on g graphs belonging to l-th class only and collect them in Sl .

Algorithm 1 Enhanced Extractor procedure extractrnd(Graph Set S = {G 1 , . . . , G n } with G = {V , E }, W max size of subgraphs set S g , O max order of extracted subgraph) g Sl : initially empty set of class specific subgraphs for class l = 1 . . . M do Evaluate class frequency fl Compute target number of subgraphs per class Nl = |Sfl | · W g while |Sl | ≤ Nl do for order k = 1 . . . o do Random extract a graph G from S if G belongs to class l then Random extract a vertex v from V g = Extract(G, v, k) g g Sl = Sl ∪ g g

g

return M class specific subgraph sets {S1 , . . . , S M }

4.2 Class-Aware Granulator After the extraction phase has completed, the class-aware granulation can take place. g Recall that Sl is the set of subgraphs extracted from graphs from Str belonging to the l-th class: the Class-Aware Granulator performs M different BSAS-driven clustering ensemble procedures by considering the class-specialized set of subgraphs

Towards a Class-Aware Information Granulation …

277

g

Sl . In this way, each instance of a clustering procedure outputs its own alphabet, say Al , whose cardinality may differ for each class. Since in this novel method the clustering algorithms are fed with more homogeneous subgraphs set, compactness and cardinality of each cluster should have been improved, leading to an improved cluster quality index F(C), as described in Sect. 3.2. It is worth remarking that the classification system relies on a graph embedding procedure, whose reliability is strongly affected by the quality of the alphabet needed for building the symbolic histograms. When all of the M granulation stages have been completed, the alphabet A collects all symbols in the M alphabets Al returned so far and the synthesis moves towards the Embedder block and the symbolic histograms evaluation thanks to φ A (·). A detailed description of the proposed procedure is listed in Algorithm 2. The main drawback of the Class-Aware Granulator is an augmented complexity of the underlying model in terms of cardinality of the symbols alphabet A, as will also be confirmed by the experiments in Sect. 5. It is worth recalling that the BSAS ensemble clustering procedure employed in the Granulator block relies on two parameters, Q (maximum number of allowed clusters) and θ (vector containing different dissimilarity measure thresholds). Furthermore, in the optimization phase, the genetic algorithm described in Sect. 3.5.1 takes care of selecting a suitable value of Q in a defined range Q ∈ [1, Q max ], where Q max is user-defined. Since the Granulator block generates an ensemble of partitions for each value in θ , the number of clusters for a single granulation can be at most O(|θ| · Q max ), where |θ| trivially depends on depth level of the binary-search employed for building the vector θ . Consequently, when the Class-Aware Granulator is employed, the cardinality |A| for the alphabet can be at most O(M · |θ| · Q max ). This may lead to very high-dimensional embedding spaces, especially when the number of classes of the problem at hand is large, affecting also the computational time required to build the symbolic histograms. In order to face this problem, we investigate two different approaches (Sects. 4.3 and 4.4, respectively) to reduce the model complexity by bounding the maximum values of Q max according to the number of classes involved. Algorithm 2 Class Aware Granulator. procedure Granulate(M subgraph sets Sgl , θ vector of thresholds, Q max number of cluster) A: initially empty set of overall alphabet Al : initially empty set of class related alphabet sC : minSOD subgraph cluster representative for all class l = 1 . . . M do partitions = clustering Ensemble(Sgl , θ, Q) for all cluster C in partitions do if F (C) ≤ τ F then Append sC in Al Merge all Al in A return A

278

L. Baldini et al.

4.3 Class-Aware Granulator with Uniform Q Scaling The first method is a simple uniform scaling of Q max with respect to the M classes for the classification problem at hand. That is, when the l-th granulation occurs, Q is bounded in range Q ∈ [1, Q max /M], for all of the M classes in the dataset. Of course, this reduces the prospective number of symbols by shrinking the range of admissible values for Q and therefore limits the cardinality of A. In plain words, [1, Q max /M] are the genetic algorithm lower and upper bounds for the Q parameter involved in Algorithm 2.

4.4 Class-Aware Granulator with Frequency-Based Q Scaling If on one hand the previous approach may work well for balanced dataset, the uniform scaling may not work properly when the classes in dataset are not equally distributed. Indeed, recall from Sect. 4.2 that the number of subgraphs to be extracted for each class is proportional to the frequency of the class itself. For this reason, a different strategy is deployed to scale Q in a class-aware fashion as well by considering the frequency fl of the l-th class in the training set: 1. For each  M class l with l = 1 . . . M in Str , evaluate the absolute frequency fl such fl = |Str |; that l=1 2. For each class l with l = 1 . . . M evaluate the scaled value Q lmax = |Sftrl | · Q max ;   3. Perform the Class-Aware Granulator for class l with Q ∈ 1, Q lmax Therefore, each time a partition has to be built (Algorithm 2), a class-specific value Q lmax is selected for the related Q range by the Class-Aware Granulator. It is worth noting that for this specific method, the length of the genetic code differ from what previously described in Sect. 3.5.1. Indeed, when the frequency-based Q scaling is employed, we need to optimize M different Q values for each Class-Aware Granulator instance: this inevitably results in an increased genetic code length by a factor of M − 1 with respect to the former case (cf. Eq. (9)), namely: [Q 1 . . . Q M τ F η ]

(11)

5 Test and Results For addressing the proposed improvements over both the original GRALG implementation [4] and the non-stratified stochastic variant [2], different graph datasets from the IAM repository [62] are considered (see Table 1 for list and description).

Towards a Class-Aware Information Granulation …

279

Since labelled graphs on both nodes and edges have been considered, suitable dissimilarity measures have to be defined as well (cf. Sect. 3.2.1): Letter: node labels are real-valued 2-dimensional vectors v of (x, y) coordinates and therefore the dissimilarity measure dv between two given nodes, say v (a) and v (b) , is defined as the plain Euclidean distance: dv (v (a) , v (b) ) = v(a) − v(b) 2 Conversely, edges are not labelled. As shown in Table 1, this dataset has three variants that differ by the amount of distortion introduced into nodes and edges and denoted by Low, Medium and High. AIDS: node labels are composed by a string value Schem (chemical symbol), an integer Nch (charge) and a real-valued 2-dimensional vector v of (x, y) coordinates. For any two given nodes, their dissimilarity is evaluated as: (a) (b) (a) (b) − Nch | + ds (Schem , Schem ) dv (v (a) , v (b) ) = v(a) − v(b) 2 + |Nch (a) (b) (a) (b) , Schem ) = 1 if Schem = Schem , and 0 otherwise. Conwhere ds (Schem versely, the edge dissimilarity is discarded since not useful for the classification task. GREC: node labels are composed by a string (type) and a real-valued 2-dimensional vector v. The dissimilarity measure dv between two different nodes, revisited with respect to our previous work [2], reads as:

 dv (v (a) , v (b) ) =

(1 − χ ) · χ + (1 −

√1 v(a) − v(b) 2 2 χ ) · √12 v(a) − v(b) 2

if t ype(a) = t ype(b) otherwise

Edge labels are defined by an integer value f r eq (frequency) that defines the number of (type,angle)-pairs where, in turn, type is a string which may assume two values (namely, arc or line) and angle is a real number. Given two edges, say e(a) and e(b) their dissimilarity is defined as follows: 1. If f r eq (a) = f r eq (b) = 1 ⎧ line (a) (b) (a) (b) ⎪ ⎨α · d (angle , angle ) if t ype = t ype = line (a) (b) de (e , e ) = β · d ar c (angle(a) , angle(b) ) if t ype(a) = t ype(b) = ar c ⎪ ⎩ γ otherwise 2. If f r eq (a) = f r eq (b) = 2

280

L. Baldini et al.

Table 1 Characteristic of IAM datasets used for testing: size of Training (tr), Validation (vl) and Test (ts) set, number of classes (# classes), types of nodes and edges labels, average number of nodes and edges, whether the dataset is uniformly distributed amongst classes or not (Balanced). Taken from [2] Dataset size (tr, # classes node edge Avg # Avg # Balanced vl, ts) labels labels nodes edges Letter-L Letter-M Letter-H GREC AIDS

750, 750, 750 750, 750, 750 750, 750, 750 286, 286, 528 250, 250, 1500

15

R2

none

4.7

3.1

Y

15

R2

none

4.7

3.2

Y

15

R2

none

4.7

4.5

Y

22

string + R2 string + integer + R2

tuple

11.5

12.2

Y

integer

15.7

16.2

N

2

Table 2 Number of subgraphs extracted (o = 5) by the exhaustive procedure. Taken from [2] g g g Dataset |Str | |Svl | |Sts | Letter-H Letter-M Letter-L GREC AIDS

21165 8582 8193 27119 35208

20543 8489 7976 28581 35692

⎧ α ⎪ · d line (angle1(a) , angle1(b) ) + ⎪ 2 ⎪ ⎪ (a) (b) ⎪ ⎪ ⎨ if t ype = t ype = line de (e(a) , e(b) ) = α2 · d line (angle2(a) , angle2(b) ) + ⎪ ⎪ ⎪ if t ype(a) = t ype(b) = ar c ⎪ ⎪ ⎪ ⎩γ otherwise 3. If f r eq (a) = f r eq (b)

21435 8560 8111 50579 220108

β 2

· d ar c (angle2(a) , angle2(b) )

β 2

· d ar c (angle1(a) , angle1(b) )

de (e(a) , e(b) ) = δ

where d line and d ar c are the module distance normalized respectively in [−π, π ] and [0, ar cmax ]. Parameters χ , α, β, γ , δ ∈ [0, 1] compose the set defined in Sect. 3.5.1 which shall be optimized by the genetic algorithm.

Towards a Class-Aware Information Granulation …

281

The implementation has been developed in C++, using the SPARE3 and Boost libraries.4 Tests have been performed on a workstation with Linux Ubuntu 18.04, hyperthreaded 4-core Intel i7-3770K @3.50GHz equipped with 32GB of RAM. For the sake of comparison, the number of subgraphs extracted from the training set, validation set and test set by the former exhaustive procedure [4] has been reported in Table 2. As our purpose is to validate the goodness of the enhanced Granulator along with the strategies that bound the alphabet cardinality, we considered four configurations: Configuration 1 (Basic): No class-aware granulation deployed, as investigated in [2] Configuration 2 (CA): Class-Aware granulation without Q scaling (Section 4.2) Configuration 3 (CA-US): Class-Aware granulation with Uniform Q Scaling (Section 4.3) Configuration 4 (CA-FS): Class-Aware granulation with Frequency-based Q Scaling (Section 4.4). In all tests, we followed the random extraction procedure defined in Algorithm 1, setting up the maximum number of allowed subgraphs W equal to a given percentage g of |Str | (cf. Table 2). The subgraphs for the embedding strategy are extracted by following the procedure described in Sect. 3.3. Furthermore, the traversal strategy deployed in both extraction and embedding phase is the BFS. This choice stems from the results obtained in our previous work [2], where no clear winner emerged between BFS and DFS traversing strategies in terms of performances and running times for the considered datasets. The algorithm parameters are set as follows: g

W = 10%, 30%, 50% of |Str | Q max = 500 o = 5 the maximum order of the extracted subgraphs 20 individuals for the population of both genetic algorithms 20 generations for the first genetic algorithm (alphabet optimization) 50 generations for the second genetic algorithm (feature selection) α = 1 in the fitness function for the second genetic algorithm (no weight to sparsity) – K = 5 for the K -NN classifier – = 1.1 as tolerance value for the symbolic histograms evaluation

– – – – – – –

In Figs. 1 to 3, we take into account three different aspects in order to compare the four strategies: 1. Accuracy on the Test Set 2. Overall running times (including training and test phases).

3 https://sourceforge.net/projects/libspare/. 4 http://www.boost.org/.

282

L. Baldini et al.

Table 3 Results achieved by the original GRALG implementation (exhaustive extraction) Dataset Running Times [min] Accuracy on Test Set Alphabet Size [%] AIDS GREC Letter-L Letter-M Letter-H

37.81 182.00 9.48 13.21 48.71

99.44 91.10 98.425 86.34 77.16

100

58 477 146 298 350

100 W = 10% W = 30% W = 50%

W = 10% W = 30% W = 50%

90

Accuracy [%]

Accuracy [%]

90

80

80

70

70

60

60 AIDS

GREC Letter-L Letter-M Letter-H

AIDS

(a) Configuration Basic 100

100 W = 10% W = 30% W = 50%

W = 10% W = 30% W = 50%

90

90

Accuracy [%]

Accuracy [%]

GREC Letter-L Letter-M Letter-H

(b) Configuration CA

80

70

80

70

60

60

AIDS

GREC Letter-L Letter-M Letter-H

(c) Configuration CA-US

AIDS

GREC Letter-L Letter-M Letter-H

(d) Configuration CA-FS

Fig. 1 Accuracy comparison for the 4 configurations

3. Total number of symbols in the alphabet at the end of the alphabet synthesis phase.5 Due to the intrinsic randomness in the training procedures, results herein presented have been averaged across 10 different runs. As a baseline benchmark, we refer to 5 This

number refers to the alphabet before the feature selection phase in order to have a fair comparison, free of biases from the second genetic algorithm.

Towards a Class-Aware Information Granulation … 60

140 W = 10% W = 30% W = 50%

50

W = 10% W = 30% W = 50%

120 100

40

Time [min]

Time [min]

283

30 20

80 60 40

10

20 0

0 AIDS

AIDS

GREC Letter-L Letter-M Letter-H

(a) Configuration Basic

GREC Letter-L Letter-M Letter-H

(b) Configuration CA

40

30 W = 10% W = 30% W = 50%

W = 10% W = 30% W = 50%

25

Time [min]

Time [min]

30

20

20 15 10

10 5 0

0 AIDS

GREC Letter-L Letter-M Letter-H

AIDS

(c) Configuration CA-US

GREC Letter-L Letter-M Letter-H

(d) Configuration CA-FS

Fig. 2 Running times comparison for the 4 configurations

the results shown in Table 3 obtained by the original GRALG implementation [4], which is devoid of both random Extractor and Class-Aware Granulator and instead performs an exhaustive extraction for populating the set of subgraph for the (nonstratified) clustering ensemble and the embedding phase. By matching Figs. 1–2 with Table 3, it is possible to spot the major improvements achieved in [2]: a clear-cut reduction of running times while having some minor reduction in performance (accuracy on the test set), especially for the hardest problems (GREC and Letter-H). For these two datasets, the introduction of the Class-Aware strategy leads to improvements in terms of accuracy: • for GREC, the exhaustive procedure still performs better, yet it is possible to observe a performance boost with respect to the non-stratified stochastic variant • for Letter-H, the Class-Aware strategy outperforms the exhaustive procedure by 4%-5%. When the Class-Aware is employed, in any of its configurations, Fig. 1b–d show improved results with respect to the non-stratified version in Fig. 1a. Indeed, results for GREC and Letter-H improved by 4%-5% and 8%-9% in terms of accuracy, respectively, followed by Letter-M that gains a 3%-4% accuracy boost.

284

L. Baldini et al. 600

4000

Alphabet Cardinality

500

Alphabet Cardinality

W = 10% W = 30% W = 50%

W = 10% W = 30% W = 50%

400 300 200

3000

2000

1000

100 0

0 AIDS

AIDS

GREC Letter-L Letter-M Letter-H

700

1200 W = 10% W = 30% W = 50%

800 600 400 200

W = 10% W = 30% W = 50%

600

Alphabet Cardinality

1000

Alphabet Cardinality

GREC Letter-L Letter-M Letter-H

(b) Configuration CA

(a) Configuration Basic

500 400 300 200 100 0

0 AIDS

GREC Letter-L Letter-M Letter-H

(c) Configuration CA-US

AIDS

GREC Letter-L Letter-M Letter-H

(d) Configuration CA-FS

Fig. 3 Alphabet size comparison for the 4 configurations

On the other hand, when no scaling method is considered, the Class-Aware procedure worsen the computational complexity of the classifier, up until reaching running times comparable to the original implementation as shown in Fig. 2b. An explanation of this behaviour may be found by considering the Embedding procedure in Sect. 3.3: when a graph has to be embedded in the vector space, every symbol of the alphabet A has to be matched with the set of subgraphs drawn from the graph itself and therefore the computational complexity can grow rapidly if both the subgraphs and the alphabet become too large, as in the case explained in Sect. 4.2, where the abovementioned set cardinality can assume values to O(M · |θ| · Q max ). It is not difficult to see this phenomenon by comparing Fig. 3a with Fig. 3b, where the latter shows a clear-cut augmentation of symbols in the alphabet for almost all of the considered datasets. The results obtained when Q scaling methods are employed, show us that they are effective in reducing the cardinality of the alphabet A, as we can observe by comparing Fig. 3b with Figs. 3c-3d. As a consequence, in Fig. 2c-2d running times for the configuration with the aforementioned methods scale down comparably to the base configuration in Fig. 2a.

Towards a Class-Aware Information Granulation …

285

6 Conclusions In this paper, we investigated an improved strategy to extract relevant substructures for graph embedding strategies. Specifically, starting from the results obtained in [2], we considered a novel Class-Aware Granulator capable to synthesize class-specific symbols thanks to an enhanced Extractor, which as well operates in a class-aware fashion. The class-aware (stratified) method deploys a clustering ensemble on each class-related subgraphs set extracted from the training set. The hypothesis behind this method is that by composing different alphabets whose symbols represent relevant and specific substructures for a given class, the embedding space spanned by the symbolic histograms should exhibit a better separation amongst the problem-related classes. Computational results show that this approach is effective (at least on the five considered datasets), outperforming the former version and, in some cases, the baseline implementation equipped with an exhaustive subgraphs extraction. If on one hand this method shows interesting accuracy boosts, on the other hand we witnessed a rapid growth of the alphabet cardinality with respect to number of classes for the problem at hand and consequently, a deep deterioration of running time due to the fact that the granulation procedure must be repeated as many times as there are classes in the dataset. To address this problem, we have investigated two different approaches for bounding the maximum number of clusters for each clustering ensemble procedure in the Class-Aware strategy: a first method tries to limit the number of resulting clusters by uniformly scaling the genetic algorithm upper bound for the maximum number of clusters, where this scaled value is common to all granulation instances, regardless of the class under analysis. A second method scales the upper bound in a class-aware fashion as well by weighting the maximum number of clusters according to the frequency of each class. Despite the latter case seems smarter and more suitable for unbalanced classes, the resulting genetic code grows with respect to the former case as each granulation instance will have its own bounds, increasing the search space. The achieved results proved that both strategies are effective in synthesizing meaningful embedding spaces while, at the same time, reducing the alphabet cardinality and the computational time. The novelties introduced in our previous work [2] reinforce the concept that clustering is a promising techniques for synthesizing information granules also with small datasets. Indeed, thanks to the proposed Class-Aware procedure, who smartly exploits all the available information, relevant symbols can emerge even from strongly subsampled dataset, without any remarkable penalty in terms of computational complexity. All proposed methods foster graph embedding procedures based on Granular Computing towards Big Data scenarios, where a clever use of the available information is mandatory due to redundancies and noise. Future research directions could be directed to further improvements in graph embedding stage. In fact, instead of letting a single genetic algorithm to synthesize class-aware alphabets, one can employ a single genetic algorithm for each class: in this way, GRALG can be equipped with the so-called local metric learning capability

286

L. Baldini et al.

[24], as each class will be characterized by its own GED weights and nodes/edges dissimilarity measure parameters.

References 1. Bai, X.: Graph-Based Methods in Computer Vision: Developments and Applications: Developments and Applications. IGI Global (2012) 2. Baldini, L., Martino, A., Rizzi, A.: Stochastic information granules extraction for graph embedding and classification. In: Proceedings of the 11th International Joint Conference on Computational Intelligence - Volume 1: NCTA, (IJCCI 2019). pp. 391–402. INSTICC, SciTePress (2019). https://doi.org/10.5220/0008149403910402 3. Bargiela, A., Pedrycz, W.: Toward a theory of granular computing for human-centered information processing. IEEE Transactions on Fuzzy Systems 16(2), 320–330 (2008). https://doi. org/10.1109/TFUZZ.2007.905912 4. Bianchi, F.M., Livi, L., Rizzi, A., Sadeghian, A.: A granular computing approach to the design of optimized graph classification systems. Soft Computing 18(2), 393–412 (2014). https://doi. org/10.1007/s00500-013-1065-z 5. Bianchi, F.M., Scardapane, S., Livi, L., Uncini, A., Rizzi, A.: An interpretable graph-based image classifier. In: 2014 International Joint Conference on Neural Networks (IJCNN). pp. 2339–2346 (2014). https://doi.org/10.1109/IJCNN.2014.6889601 6. Bianchi, F.M., Scardapane, S., Rizzi, A., Uncini, A., Sadeghian, A.: Granular computing techniques for classification and semantic characterization of structured data. Cognitive Computation 8(3), 442–461 (2016). https://doi.org/10.1007/s12559-015-9369-1 7. Borgatti, S.P., Mehra, A., Brass, D.J., Labianca, G.: Network analysis in the social sciences. Science 323(5916), 892–895 (2009). https://doi.org/10.1126/science.1165821 8. Borowska, K., Stepaniuk, J.: A rough-granular approach to the imbalanced data classification problem. Applied Soft Computing 83, 105607 (2019). https://doi.org/10.1016/j.asoc.2019. 105607 9. Bunke, H., Allermann, G.: Inexact graph matching for structural pattern recognition. Pattern Recognition Letters 1(4), 245–253 (1983). https://doi.org/10.1016/0167-8655(83)90033-8 10. Bunke, H.: On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters 18(8), 689–694 (1997). https://doi.org/10.1016/S0167-8655(97)00060-3 11. Bunke, H.: Graph matching: Theoretical foundations, algorithms, and applications. In: Proceedings of Vision Interface. pp. 82–88 (2000) 12. Bunke, H.: Graph-based tools for data mining and machine learning. In: Perner, P., Rosenfeld, A. (eds.) Machine Learning and Data Mining in Pattern Recognition. pp. 7–19. Springer Berlin Heidelberg, Berlin, Heidelberg (2003). https://doi.org/10.1007/3-540-45065-3_2 13. Cai, H., Zheng, V.W., Chang, K.: A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Transactions on Knowledge & Data Engineering 30(09), 1616–1637 (2018). https://doi.org/10.1109/TKDE.2018.2807452 14. Chiaselotti, G., Ciucci, D., Gentile, T.: Simple graphs in granular computing. Information Sciences 340–341, 279–304 (2016). https://doi.org/10.1016/j.ins.2015.12.042 15. Cinti, A., Bianchi, F.M., Martino, A., Rizzi, A.: A novel algorithm for online inexact string matching and its fpga implementation. Cognitive Computation 12(2), 369–387 (Mar 2020). https://doi.org/10.1007/s12559-019-09646-y 16. Cortes, C., Vapnik, V.: Support-vector networks. Machine learning 20(3), 273–297 (1995) 17. Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 21–27 (1967). https://doi.org/10.1109/TIT.1967.1053964 18. Cover, T.M.: Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE transactions on electronic computers EC-14(3), 326–334 (1965). https://doi.org/10.1109/PGEC.1965.264137

Towards a Class-Aware Information Granulation …

287

19. Davidson, E.H., Rast, J.P., Oliveri, P., Ransick, A., Calestani, C., Yuh, C.H., Minokawa, T., Amore, G., Hinman, V., Arenas-Mena, C., Otim, O., Brown, C.T., Livi, C.B., Lee, P.Y., Revilla, R., Rust, A.G., Pan, Z.j., Schilstra, M.J., Clarke, P.J.C., Arnone, M.I., Rowen, L., Cameron, R.A., McClay, D.R., Hood, L., Bolouri, H.: A genomic regulatory network for development. Science 295(5560), 1669–1678 (2002). https://doi.org/10.1126/science.1069883 20. De Santis, E., Martino, A., Rizzi, A., Frattale Mascioli, F.M.: Dissimilarity space representations and automatic feature selection for protein function prediction. In: 2018 International Joint Conference on Neural Networks (IJCNN). pp. 1–8 (2018). https://doi.org/10.1109/IJCNN. 2018.8489115 21. Del Vescovo, G., Rizzi, A.: Automatic classification of graphs by symbolic histograms. In: 2007 IEEE International Conference on Granular Computing (GRC 2007). pp. 410–416. IEEE (2007) 22. Del Vescovo, G., Rizzi, A.: Online handwriting recognition by the symbolic histograms approach. In: 2007 IEEE International Conference on Granular Computing (GRC 2007), p. 686. IEEE (2007) 23. Dey, A., Broumi, S., Son, L.H., Bakali, A., Talea, M., Smarandache, F.: A new algorithm for finding minimum spanning trees with undirected neutrosophic graphs. Granular Computing 4(1), 63–69 (2019). https://doi.org/10.1007/s41066-018-0084-7 24. Di Noia, A., Martino, A., Montanari, P., Rizzi, A.: Supervised machine learning techniques and genetic optimization for occupational diseases risk prediction. Soft Computing 24(6), 4393–4406 (Mar 2020). https://doi.org/10.1007/s00500-019-04200-2 25. Di Paola, L., De Ruvo, M., Paci, P., Santoni, D., Giuliani, A.:Protein contactnetworks: An emerging paradigm in chemistry. Chemical Reviews113(3), 1598–1613 (2013). https://doi. org/10.1021/cr3002356 26. Di Paola, L., Giuliani, A.: Protein“Protein Interactions: TheStructuralFoundation of Life Complexity, pp. 1–12. American Cancer Society (2017). https://doi.org/10.1002/9780470015902. a0001346.pub2 27. Ding, S., Du, M., Zhu, H.: Survey on granularity clustering. Cognitive neurodynamics 9(6), 561–572 (2015) 28. Dubois, D., Prade, H.: Bridging gaps between several forms of granular computing. Granular Computing 1(2), 115–126 (2016) 29. Gasteiger, J., Engel, T.: Chemoinformatics: a textbook. John Wiley & Sons (2006) 30. Ghosh, S., Das, N., Gonçalves, T., Quaresma, P., Kundu, M.: The journey of graph kernels through two decades. Computer Science Review 27, 88–111 (2018) 31. Giuliani, A., Filippi, S., Bertolaso, M.: Why network approach can promote a new way of thinking in biology. Frontiers in Genetics 5, 83 (2014). https://doi.org/10.3389/fgene.2014. 00083 32. Howard, N., Lieberman, H.: Brainspace: Relating neuroscience to knowledge about everyday life. Cognitive Computation 6(1), 35–44 (2014). https://doi.org/10.1007/s12559-012-9171-2 33. Jeong, H., Tombor, B., Albert, R., Oltvai, Z.N., Barabási, A.L.: The large-scale organization of metabolic networks. Nature 407(6804), 651 (2000). https://doi.org/10.1038/35036627 34. Krishnan, A., Zbilut, J.P., Tomita, M., Giuliani, A.: Proteins as networks: usefulness of graph theory in protein science. Current Protein and Peptide Science 9(1), 28–38 (2008). https://doi. org/10.2174/138920308783565705 35. Lin, T.Y., Yao, Y.Y., Zadeh, L.A.: Data mining, rough sets and granular computing. Physica vol. 95, (2013) 36. Maiorino, E., Possemato, F., Modugno, V., Rizzi, A.: Information granules filtering for inexact sequential pattern mining by evolutionary computation. In: Proceedings of the International Joint Conference on Computational Intelligence - Volume 1. p. 104111. IJCCI 2014, SCITEPRESS - Science and Technology Publications, Lda, Setubal, PRT (2014). https://doi. org/10.5220/0005124901040111 37. Maiorino, E., Possemato, F., Modugno, V., Rizzi, A.: Noise sensitivity of an information granules filtering procedure by genetic optimization for inexact sequential pattern mining. In:

288

38.

39.

40.

41.

42.

43. 44.

45.

46.

47.

48.

49.

50.

51.

L. Baldini et al. Merelo, J.J., Rosa, A., Cadenas, J.M., Dourado, A., Madani, K., Filipe, J. (eds.) Computational Intelligence, pp. 131–150. Springer International Publishing, Cham (2016). https://doi. org/10.1007/978-3-319-26393-9_9 Maiorino, E., Rizzi, A., Sadeghian, A., Giuliani, A.: Spectral reconstruction of protein contact networks. Physica A: Statistical Mechanics and its Applications 471, 804–817 (2017). https:// doi.org/10.1016/j.physa.2016.12.046 Martino, A., De Santis, E., Baldini, L., Rizzi, A.: Calibration techniques for binary classification problems: A comparative analysis. In: Proceedings of the 11th International Joint Conference on Computational Intelligence - Volume 1: NCTA, (IJCCI 2019). pp. 487–495. INSTICC, SciTePress (2019). https://doi.org/10.5220/0008165504870495 Martino, A., De Santis, E., Giuliani, A., Rizzi, A.: Modelling and recognition of protein contact networks by multiple kernel learning and dissimilarity representations. Entropy 22(7) (2020). https://doi.org/10.3390/e22070794 Martino, A., Frattale Mascioli, F.M., Rizzi, A.: On the optimization of embedding spaces via information granulation for pattern recognition. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2020). https://doi.org/10.1109/IJCNN48605.2020. 9206830 Martino, A., Giuliani, A., Rizzi, A.: Granular computing techniques for bioinformatics pattern recognition problems in non-metric spaces. In: Pedrycz, W., Chen, S.M. (eds.) Computational Intelligence for Pattern Recognition, pp. 53–81. Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-319-89629-8_3 Martino, A., Giuliani, A., Rizzi, A.: (hyper)graph embedding and classification via simplicial complexes. Algorithms 12(11) (2019). https://doi.org/10.3390/a12110223 Martino, A., Giuliani, A., Todde, V., Bizzarri, M., Rizzi, A.: Metabolic networks classification and knowledge discovery by information granulation. Computational Biology and Chemistry p. 107187 (2019). https://doi.org/10.1016/j.compbiolchem.2019.107187 Martino, A., Maiorino, E., Giuliani, A., Giampieri, M., Rizzi, A.: Supervised approaches for function prediction of proteins contact networks from topological structure information. In: Sharma, P., Bianchi, F.M. (eds.) Image Analysis: 20th Scandinavian Conference, SCIA 2017, Tromsø, Norway, June 12–14, 2017, Proceedings, Part I, pp. 285–296. Springer International Publishing, Cham (2017). https://doi.org/10.1007/978-3-319-59126-1_24 Martino, A., Rizzi, A., Frattale Mascioli, F.M.: Efficient approaches for solving the large-scale k-medoids problem. In: Proceedings of the 9th International Joint Conference on Computational Intelligence - Volume 1: IJCCI,. pp. 338–347. INSTICC, SciTePress (2017). https://doi.org/ 10.5220/0006515003380347 Martino, A., Rizzi, A., Frattale Mascioli, F.M.: Distance matrix pre-caching and distributed computation of internal validation indices in k-medoids clustering. In: 2018 International Joint Conference on Neural Networks (IJCNN). pp. 1–8 (2018). https://doi.org/10.1109/IJCNN. 2018.8489101 Martino, A., Rizzi, A., Frattale Mascioli, F.M.: Supervised approaches for protein function prediction by topological data analysis. In: 2018 International Joint Conference on Neural Networks (IJCNN). pp. 1–8 (2018). https://doi.org/10.1109/IJCNN.2018.8489307 Martino, A., Rizzi, A., Frattale Mascioli, F.M.: Efficient approaches for solving the largescale k-medoids problem: Towards structured data. In: Sabourin, C., Merelo, J.J., Madani, K., Warwick, K. (eds.) Computational Intelligence: 9th International Joint Conference, IJCCI 2017 Funchal-Madeira, Portugal, November 1-3, 2017 Revised Selected Papers, pp. 199–219. Springer International Publishing, Cham (2019). https://doi.org/10.1007/978-3-030-164690_11 Mercer, J.: Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical transactions of the royal society of London. Series A, containing papers of a mathematical or physical character 209, 415–446 (1909) Neuhaus, M., Bunke, H.: Bridging the gap between graph edit distance and kernel machines, vol. 68. World Scientific (2007)

Towards a Class-Aware Information Granulation …

289

52. Pedrycz, A., Hirota, K., Pedrycz, W., Dong, F.: Granular representation and granular computing with fuzzy sets. Fuzzy Sets and Systems 203, 17–32 (2012) 53. Pedrycz, W.: Knowledge-based clustering: from data to information granules. John Wiley & Sons (2005) 54. Pedrycz, W.: Human centricity in computing with fuzzy sets: an interpretability quest for higher order granular constructs. Journal of Ambient Intelligence and Humanized Computing 1(1), 65–74 (2010) 55. Pedrycz, W.: Proximity-based clustering: a search for structural consistency in data with semantic blocks of features. IEEE Transactions on Fuzzy Systems 21(5), 978–982 (2013) 56. Pedrycz, W.: Granular computing: analysis and design of intelligent systems. CRC Press (2016) 57. Pedrycz, W., Homenda, W.: Building the fundamentals of granular computing: A principle of justifiable granularity. Applied Soft Computing 13(10), 4209–4218 (2013). https://doi.org/10. 1016/j.asoc.2013.06.017 58. Pe˛kalska, E., Duin, R.P.: The dissimilarity representation for pattern recognition: foundations and applications. World Scientific (2005) 59. Peters, G., Weber, R.: Dcc: a framework for dynamic granular clustering. Granular Computing 1(1), 1–11 (2016). https://doi.org/10.1007/s41066-015-0012-z 60. Possemato, F., Rizzi, A.: Automatic text categorization by a granular computing approach: Facing unbalanced data sets. In: The 2013 International Joint Conference on Neural Networks (IJCNN). pp. 1–8 (Aug 2013). https://doi.org/10.1109/IJCNN.2013.6707082 61. Richiardi, J., Achard, S., Bunke, H., Van De Ville, D.: Machine learning with brain graphs: predictive modeling approaches for functional imaging in systems neuroscience. IEEE Signal Processing Magazine 30(3), 58–70 (2013) 62. Riesen, K., Bunke, H.: Iam graph database repository for graph based pattern recognition and machine learning. In: Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). pp. 287–297. Springer (2008) 63. Rizzi, A., Del Vescovo, G.: Automatic image classification by a granular computing approach. In: 2006 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing. pp. 33–38 (Sep 2006). https://doi.org/10.1109/MLSP.2006.275517 64. Rizzi, A., Del Vescovo, G., Livi, L., Frattale Mascioli, F.M.: A new granular computing approach for sequences representation and classification. In: The 2012 International Joint Conference on Neural Networks (IJCNN). pp. 1–8 (June 2012). https://doi.org/10.1109/IJCNN. 2012.6252680 65. Theodoridis, S., Koutroumbas, K.: Pattern Recognition. Academic Press, 4 edn. (2008) 66. Wang, F., Sun, J.: Survey on distance metric learning and dimensionality reduction in data mining. Data Mining and Knowledge Discovery 29(2), 534–564 (2015). https://doi.org/10. 1007/s10618-014-0356-z 67. Wang, G., Yang, J., Xu, J.: Granular computing: from granularity optimization to multigranularity joint problem solving. Granular Computing 2(3), 105–120 (2017). https://doi.org/ 10.1007/s41066-016-0032-3 68. Wasserman, S., Faust, K.: Social Network Analysis: Methods and Applications. Cambridge University Press, New York, USA (1994) 69. Weinshall, D., Jacobs, D.W., Gdalyahu, Y.: Classification in non-metric spaces. In: Kearns, M.J., Solla, S.A., Cohn, D.A. (eds.) Advances in Neural Information Processing Systems 11, pp. 838–846. MIT Press (1999) 70. William-West, T.O., Singh, D.: Information granulation for rough fuzzy hypergraphs. Granular Computing 3(1), 75–92 (2018). https://doi.org/10.1007/s41066-017-0057-2 71. Xu, F., Uszkoreit, H., Du, Y., Fan, W., Zhao, D., Zhu, J.: Explainable ai: A brief survey on history, research areas, approaches and challenges. In: Tang, J., Kan, M.Y., Zhao, D., Li, S., Zan, H. (eds.) Natural Language Processing and Chinese Computing, pp. 563–574. Springer International Publishing, Cham (2019) 72. Yang, J., Wang, G., Zhang, Q.: Knowledge distance measure in multigranulation spaces of fuzzy equivalence relations. Information Sciences 448, 18–35 (2018). https://doi.org/10.1016/ j.ins.2018.03.026

290

L. Baldini et al.

73. Yao, Y.Y.: The rise of granular computing. Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition) 20(3), 299–308 (2008) 74. Yao, Y.: A triarchic theory of granular computing. Granular Computing 1(2), 145–157 (2016). https://doi.org/10.1007/s41066-015-0011-0 75. Yao, Y., Zhao, L.: A measurement theory view on the granularity of partitions. Information Sciences 213, 1–13 (2012). https://doi.org/10.1016/j.ins.2012.05.021 76. Zadeh, L.A.: Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy sets and systems 90(2), 111–127 (1997) 77. Zhang, Q., Zhang, Q., Wang, G.: The uncertainty of probabilistic rough sets in multi-granulation spaces. International Journal of Approximate Reasoning 77, 38–54 (2016)

Near Optimal Solving of the (N2 –1)-puzzle Using Heuristics Based on Artificial Neural Networks Vojtech Cahlik and Pavel Surynek

Abstract We address the design of heuristics for near-optimal solving of the (N2 – 1)-puzzle using the A* search algorithm in this paper. The A* search algorithm explores configurations of the puzzle in the order determined by a heuristic that tries to estimate the minimum number of moves needed to reach the goal from the given configuration. To guarantee finding an optimal solution, the A* algorithm requires heuristics that estimate the number of moves from below. Common heuristics for the (N2 –1)-puzzle often underestimate the true number of moves greatly in order to meet the admissibility requirement. The worse the estimation is the more configurations the search algorithm needs to explore. We therefore relax from the admissibility requirement and design a novel heuristic that tries estimating the minimum number of moves remaining as precisely as possible while overestimation of the true distance is permitted. Our heuristic called ANN-distance is based on a deep artificial neural network (ANN). We experimentally show that with a well trained ANN-distance heuristic, whose inputs are just the positions of the tiles, we are able to achieve better accuracy of estimation than with conventional heuristics such as those derived from the Manhattan distance or pattern database heuristics. Though we cannot guarantee admissibility of ANN-distance due to possible overestimation of the true number of moves, an experimental evaluation on random 15-puzzles shows that in most cases the ANN-distance calculates the true minimum distance from the goal or an estimation that is very close to the true distance. Consequently, A* search with the ANN-distance heuristic usually finds an optimal solution or a solution that is very close to the optimum. Moreover, the underlying neural network in ANN-distance consumes much less memory than a comparable pattern database. We also show that a deep artificial neural network can be more powerful than a shallow artificial neural network, and also trained our heuristic to prefer underestimating the optimal solution cost, which pushed the solutions towards better optimality. V. Cahlik · P. Surynek (B) Faculty of Information Technology, Czech Technical University in Prague, Thákurova 9, 160 00 Praha 6, Prague, Czech Republic e-mail: [email protected] V. Cahlik e-mail: [email protected] © Springer Nature Switzerland AG 2021 J. J. Merelo et al. (eds.), Computational Intelligence, Studies in Computational Intelligence 922, https://doi.org/10.1007/978-3-030-70594-7_12

291

292

V. Cahlik and P. Surynek

Keywords (N 2 − 1)-puzzle · Heuristic design · Artificial neural networks · Deep learning · Near optimal solutions

1 Introduction The (N2 –1)-puzzle [34, 41] due to its challenging difficulty and yet simple formulation became an important benchmark for a variety of problem solving methods and heuristic search algorithms [5, 16]. The task in the (N2 –1)-puzzle is to rearrange N 2 − 1 square tiles on a square board of size N × N by shifting them horizontally or vertically into a configuration where tiles are ordered from 1 to N 2 − 1 (see Fig. 1 for an illustration of the 8-puzzle). Tiles can never overlap so there is one blank position on the board that enables one tile at a time to move; that is, a tile can be moved horizontally of vertically to the neighboring blank position in one step. It is known that finding an optimal solution of the (N2 –1)-puzzle, that is, finding the shortest possible sequence of moves that reaches the goal configuration from the initial one, is an NP-hard problem [6, 26, 27]. Hence the (N2 –1)-puzzle is considered to be a challenging benchmark problem for testing a variety of algorithms. On the other hand, the difficulty of the puzzle is not absolutely prohibitive as it belongs to the NP-class which can be seen due to existence of algorithms that can generate feasible sub-optimal solutions of polynomial length (O(N 6 ) to be precise for the (N2 –1)-puzzle case). Various approaches have been adopted to address the problem using searchbased and other techniques. They include heuristics built on top of pattern databases [9, 11] which are applicable as part of the A* algorithm to obtain optimal solutions. The disadvantage of optimal search-based algorithms is their limited scalability for larger N even with advanced heuristics. So far the best known result is an optimal A*-based algorithm for the 24-puzzle by Korf [17] built on top of sophisticated pattern database heuristics. The 35-puzzle is still waiting for an optimal algorithm that can finish in reasonable time on commodity hardware. Different approaches are represented by solving the puzzle (N2 –1)-puzzle subopti-mally using rule-based algorithms such as those of Parberry [23] or those using permutation group reasoning [18]. The advantage of these algorithms is that they are fast and can be used in an online mode. However, solutions produced by these algorithms are usually very far from the optimum, which precludes their application in practice. These algorithms are commonly used to test existence of a solution before an optimal algorithm is started. Various attempts have been made to combine good quality of solutions with fast solving by unifying above streams of research. For example improvements of suboptimal solutions by using macro operations, where instead of moving single tile to its goal position, multiple tiles are taken together to form a so called snake that moves as a single entity, were introduced in [40]. Moving tiles together consumes fewer moves than if tiles are moved individually. Another approach is to design better tile rearrangement rules that by themselves lead to shorter solutions as suggested in

Near Optimal Solving of the (N2 –1)-puzzle …

293

[24]. In the average case, these rule-based algorithms can generate solutions that are much better that those produced by plain rule-based algorithms [25].

1.1 Contributions In this paper, we present an attempt to solve the (N2 –1)-puzzle near-optimally or optimally using a heuristic based on artificial neural networks (ANNs) [15] called ANN-distance. Our heuristic is intended to be integrated into the A* algorithm [14]. Particularly our contributions are as follows: • We design a near-admissible heuristic to be used in the A* algorithm. Unlike conventional heuristics used in A* that focus on finding optimal solutions and must therefore strictly satisfy the admissibility requirement, our heuristic only tries to make the best possible estimation while admissibility can be occasionally violated (that is, the heuristic overestimates). Specifically we try to directly calculate the estimation of the number of moves remaining to reach the goal configuration using an ANN. This is in contrast to admissible heuristics that try to make estimations from below which often leads to significant underestimations. • We test the hypothesis that our ANN-distance heuristic does not violate the admissibility requirement in most cases and provides better estimations than heuristics based on pattern databases. That is, we have shown in our experimentation with the 15-puzzle that the A* algorithm with ANN-distance produces sub-optimal solutions only rarely. Moreover, estimations made with ANN-distance are closer to the true minimum number of moves required than those obtained from the comparable 7-8 pattern database heuristic [10]. Our research on the topic of utilizing ANNs in (N2 –1)-puzzle solving initiated as part of a bachelor’s thesis [3]. This paper extends our previous work presented in [2]. Specifically we improved the training process of the ANN-distance heuristic so that it generates better results than those presented in [2]. Additionally we present here a significantly extended set of experiments that did not fit in the original conference format. The paper is organized as follows: we first introduce the (N2 –1)-puzzle formally and put it in the context of related works. Then, we describe the design of the ANNdistance heuristic and the process of its training. Finally, we experimentally evaluate ANN-distance as part of the A* algorithm and compare it with A* using the 7-8 pattern database heuristic. All tests are done on random instances of the 15-puzzle.

2 Background In this section we will formally define the (N2 –1)-puzzle and introduce artificial neural networks.

294

V. Cahlik and P. Surynek

2.1 The (N2 –1)-puzzle The (N2 –1)-puzzle is a classical sliding puzzle that is composed of a set of numbered tiles T = {1, 2, ..., N 2 − 1} of uniform size 1 × 1. These tiles, which are sometimes also called pebbles, are placed in a non-overlapping way on a square board of size N × N . The positions on the board are numbered from 1 to N 2 , and one of these positions always remains empty and allows the tiles to be rearranged. A configuration (placement) of tiles can be expressed as assignment c : T → {1, 2, ..., N 2 }, for which it holds that c(i) = c( j) whenever i = j. There are various versions of the game based on the choice of N , with the 15-puzzle (in which N = 4) being the most famous of these variants. Definition 1 The (N2 –1)-puzzle is a quadruple PN 2 −1 = [N , T, c0 , cg ], where c0 : T → {1, 2, ..., N 2 } is an initial configuration of tiles and cg : T → {1, 2, ..., N 2 } is a goal configuration (usually an identity with cg (t) = t for t = 1, 2, ..., N 2 − 1). [2] A tile can only be moved into a blank neighboring position, and only one tile can be moved at a time. The objective in the puzzle is to rearrange the tiles from the initial configuration c0 into the desired goal configuration cg by repeatedly moving single tiles. Since there is only one blank position, movements of tiles naturally correspond to Left, Right, Up, and Down movements, while it is unambiguous where the movement takes place. The solution sequence can be expressed as σ = [m 1 , m 2 , ..., m l ] where m i ∈ {L , R, U, D} with the natural meaning of L=Left, R=Right, U=Up, and D=Down for i = 1, 2, ..., l; l denotes the solution length (which is sometimes called solution cost). We call solution σ optimal if l is as small as possible, otherwise we say that the solution is sub-optimal. An illustration of a solution is shown in Fig. 1.

2.2 Artificial Neural Networks In our attempt to design a heuristic that gives very precise estimations we made use of a feed-forward artificial neural network (ANN). The ANN consists of multiple computational units called artificial neurons that perform simple real-valued compu-

=3 ={1,2,…,8} Solution sequence: Less efficient solution sequence: Fig. 1 An illustration of an initial and a goal configuration of the 8-puzzle [2]

Near Optimal Solving of the (N2 –1)-puzzle …

295

n tations on their input vectors x ∈ Rn as follows: y(x) = ξ( i=0 wi xi ) where w ∈ Rn is vector of weights, w0 ∈ R is a bias and ξ is an activation function, for example a sigmoid as follows, where λ determines the shape of the sigmoid: ξ(z) =

1 1 + e−λz

(1)

Neurons in an ANN are arranged in layers. Neurons in the first layer represent the input vector. Neurons in the second layer get the outputs of neurons in the first layer as their input, and so on. At every layer neurons are fully interconnected with neurons from the previous layer. Outputs of neurons in the last layer form the output vector. Usually we want an ANN to respond to given inputs in a particular way. Mathematically the network implements a function φ : Rn → Rm , where n is the dimension of input and m is the dimension of output. Our assumption is that the heuristic function determining the minimum number of moves required to reach cg from some given configuration c can be approximated using the feed forward neural network. Hence after encoding cg and c as a real-valued vector and giving it at the input of the network we want the network to respond with the minimum number of moves at the output layer. After deciding a topology of the network, that is how many layers and how many neurons per layer it should consist of, this is achieved through the process of learning [28, 30]. Learning usually does not alter the topology but only sets the weight vectors w and biases w0 of individual neurons so that for a given input xi the network responds with a desired output yi in the last layer. The network is trained for a data-set which contains pairs of input xi and desired output yi . If the ANN is designed well, that is if it has the proper number of neurons per layer and if the data-set is representative enough, then the ANN can appropriately respond even to inputs that are outside the training data-set. We then say that the ANN generalizes well: this is the goal in our design as well. However, we are aware of the challenging aspect of the task as it is unrealistic to train the network for all possible configurations c of the (N2 –1)-puzzle, of which there are as many as N ! (N !/2 solvable ones).

3 Related Work MAPF Algorithms. The (N2 –1)-puzzle can be perceived as a multi-agent path finding (MAPF) problem. In the MAPF view of the (N2 –1)-puzzle, the N × N -board is represented as an undirected graph G = (V, E), in which vertices represent the board positions and edges connect neighboring vertices. The pebbles are then called agents and occupy the vertices. The graph models a generalized environment in which agents can move across edges. Similarly to the N × N -board, there is at most one agent per vertex, and an agent can be moved into the empty neighboring position (vertex). The objective is to move the agents so that ultimately each agent occupies

296

V. Cahlik and P. Surynek

its goal vertex. The (N2 –1)-puzzle can be represented as a MAPF problem if each position on the board is replaced with a vertex, and if each vertex is connected with vertices corresponding to the neighboring positions on the board. After this process is performed, a grid of size N × N with one empty vertex is obtained, in which the initial positions and goals of the agents correspond to the initial and goal configuration of the puzzle. Two modern examples of MAPF algorithms are ICTS [32, 33] and CBS [1, 31]. On the low level, the ICTS algorithm is given a total cost of a desired solution and then searches through various distributions of the total cost among the individual agents, while the high level mechanism iterates the overall cost from the lower bound upwards until a solvable cost is reached. On the other hand, the CBS algorithm searches a tree of possible conflicts between agents, and gradually introduces constraints on the possible paths of the agents in order to resolve these conflicts. Initially CBS searches for individual paths as if other agents are not present in the problem instance. Paths found in this way are then verified with respect to collisions between agents. If there are collisions, constraints to avoid them are introduced into the individual path search process and the high level is branched as each collision can be usually resolved by two cases—the first or the second agent avoids the place and time of the collision. The downside of solving the (N2 –1)-puzzle as a MAPF problem is that MAPF algorithms tend to be computationally ineffective when the vertices are densely populated by agents with only few empty positions. Virtually all MAPF algorithms were designed for problems where the graph is typically not fully occupied by agents these algorithms are often adaptive with respect to the density of agents. In sparsely occupied graphs MAPF algorithms typically do not consider all possible interactions between agents and tend to plan individual paths as independently as possible. However in densely populated graphs these algorithms cannot benefit from their advantages, as many collisions are detected while the individual paths are planned. This is also the case with the (N2 –1)-puzzle, as there is only one unoccupied vertex. Efficient Rule-based Algorithms. Another, more specialized family of algorithms that can be used to solve the (N2 –1)-puzzle are the polynomial-time rule-based algorithms like BIBOX [36, 37] and Push-and-Swap [20]. These algorithms, which were often designed specifically for the (N2 –1)-puzzle problem domain, generally obtain the goal configuration by iteratively applying predefined macro-operations. However, rule-based algorithms produce solutions which are generally sub-optimal. As these algorithms are designed for practical use, solutions generated by them are closer to the optimum than in the case of permutation group-based algorithms. Compilations-based Algorithms. An important group of optimal solving techniques for MAPF and consequently for the (N2 –1)-puzzle is represented by compilationbased algorithms. These algorithms reduce the task of finding a solution to MAPF to some of the existing formalism such as propositional satisfiability (SAT) [38, 39], constraint satisfaction (CSP) [19] or answer set programming (ASP) [8]. This approach is characterized by employing efficient off-the-shelf solvers for the target formalism, an efficiency one can hardly achieve with a custom dedicated solver.

Near Optimal Solving of the (N2 –1)-puzzle …

297

Moreover, any progress of the target formalism solver is immediately translated into the progress of solving the original problem. Algorithms based on State-space Search. Unlike in the MAPF approach, the solving of the (N2 –1)-puzzle can be approached without planning the paths of single pebbles individually. In a pure state-space search approach, the configuration of the board is called a state, which can be transformed into a different state by applying an action. Together these states and actions form a solution graph, or state-space, which can then be searched until the goal configuration is found. Arguably, the most popular search algorithm for state-space search is A*. There are various algorithms for solving the (N2 –1)-puzzle that make use of statespace search, some of which even use the MAPF problem definition. Optimal A*based algorithms include ID-OD (independence detection—operator decomposition) [35], a technique which tries to make use of a relatively sparsely occupied graph by dividing the agents into independent groups which are then solved separately. In case of algorithms of exponential time complexity, this is a useful technique as it divides the exponent by the number of independent groups. Applications of ANNs in Solving the Puzzle. Samadi et al. [22] used an ANNbased heuristic function for estimating optimal solution costs of random instances of the 15-puzzle. Instead of feeding the pebble positions to their neural network, they used estimates of several pattern database heuristics as the model’s inputs. To bias the estimates of the heuristic towards admissibility, a custom error function that penalized overestimates more than underestimates was used. The heuristic was then used as part of the RBFS search algorithm. The unbiased ANN heuristic resulted in search trees being smaller by a factor of 10 than those generated with the 7–8 pattern database heuristic, and the generated solutions were inadmissible by about 4%. Using a biased ANN heuristic further pushed inadmissibility to only 0.1%, but increased the size of the generated search trees. Ernandes and Gori [21] used only the positions of individual tiles as inputs to an ANN. This model was again used to estimate the optimal solution costs of random instances of the 15-puzzle. They used a very small network with a single hidden layer of only 15 neurons. IDA* with their heuristic then achieved an optimal solution in about 50% of cases, and compared to the Manhattan distance heuristic, time cost was reduced by a factor of about 500. Arfaee et al. [29] developed a novel bootstrapping procedure for the creation of powerful ANN-based heuristics for solving the 24-puzzle. The process was initiated with an untrained artificial neural network, which was used as a heuristic for solving a number of 24-puzzle instances. The instances which were solved within time limit were then used for training of the next iteration of the heuristic. After a number of iterations, the heuristic was powerful enough to solve the majority of the initial set of random 24-puzzle instances. Suboptimality of the solutions was about 7%.

298

V. Cahlik and P. Surynek

4 Designing a New Heuristic The ANN-distance heuristic will be integrated as the underlying heuristic of the A* graph search algorithm. Let us first describe basics of the A* algorithm. A* explores the space of possible configurations of the puzzle. It maintains the Open list in which it stores candidate configurations for further exploration. The algorithm always chooses configuration c from Open with the minimum g(c) + h(c) for the next expansion, where g(c) is the number of steps taken to reach c from c0 , and h(c) is the estimation of the number of remaining steps from c to cg . It is generally true that with an admissible heuristic function, the closer h is (from below) to the true number of steps remaining, the fewer configurations the algorithm will expand. With an inadmissible heuristic function, the situation is more complicated. Experiments have shown that the weighted A* algorithm, in which the heuristic estimates are scaled by a weight W > 1 (and are thus inadmissible), reaches the solution faster with greater values of W , but the obtained solutions are typically further and further from the optimum. However, it is reasonable to expect that if the estimates produced by the ANN-distance heuristic are relatively accurate and do not overestimate the optimal solution cost too often, the solutions found by A* will not be too far from the optimum. Some of the ANN-distance heuristics will therefore be designed to produce as accurate estimates of the optimal solution costs as possible. Other ANN-distance heuristics will be trained to underestimate the optimal solution costs, which can be expected to result in solutions whose cost will be closer to the optimum.

4.1 Encoding the Input and Output The input to the underlying fully-connected feed-forward ANN of the ANN-distance heuristic are the one-hot encoded positions of the tiles, including the blank. In order to one-hot encode the tile positions of a 15-puzzle instance, a 256-dimensional vector is required. This vector is the input to ANN. The output of the network is a single neuron outputting the estimated optimal solution cost.

4.2 Design of the Neural Networks The hyperparameters of the ANN, most importantly the layer sizes, were calculated using grid search [13]. The most powerful architecture turned out to be a deep ANN with 5 hidden layers composed of 1024, 1024, 512, 128 and 64 hidden neurons. This is a relatively large network, but the 15-puzzle is a complex problem domain, with some of the configurations being deceptive—there are for example states with only a few tiles being out of their place, yet with a high optimal solution cost. The ANN must be flexible enough to learn to recognize these quirks.

Near Optimal Solving of the (N2 –1)-puzzle …

299

Several state-of-the-art techniques for training deep ANNs were used. These techniques include the Adam optimizer, ELU activation function, dropout regularization, and batch normalization [12]. Layer weights were initialized with He initialization [12]. The output neuron used no activation function. Learning rate was set experimentally.

4.3 Training Data and Training The training dataset was obtained by randomly shuffling the boards, using both high and low shuffle counts. This resulted in some of the boards having high optimal solution cost of about 50, but with some of the boards still having a low optimal solution cost. This is preferable, as the ANN-distance heuristic must correctly estimate optimal solution costs when the search algorithm is both far away and close to the solution. However, preliminary tests with the pattern database (PDB) heuristic revealed that the A* algorithm spends most of the time far away from the solution. Hence we took the assumption that queries to the ANN-distance heuristic will follow a similar distribution of distances from the optimum, and we selected instances into the training dataset that are mostly hard. In total, there are 4.8 million instances in the training dataset, which is a relatively large number, but still just a 1017 of the total size of the state-space of the 15-puzzle. The optimal solution costs were calculated with the IDA* algorithm with the underlying PDB 7–8 heuristic. The distribution of boards in the training dataset can be seen in Fig. 2.

0.175

Relative frequency

0.150 0.125 0.100 0.075 0.050 0.025 0.000 0

10

20

30

40

50

60

70

Optimal solution cost Fig. 2 The distribution of the optimal solution lengths of boards in the training dataset

300

V. Cahlik and P. Surynek =0 = 0.2 = 0.4 = 0.6

60 50

c(di )

40 30 20 10 0 −4

−2

0

di

2

4

Fig. 3 The AMSE cost of a single instance (vertical axis) for different deviations from the optimal solution cost (horizontal axis) and various values of α

The cost function used while training some networks was mean squared error (MSE). However, for some networks, a novel cost function, which we named asymmetric mean squared error (AMSE), was used. This function is basically the MSE cost function skewed to one side. The AMSE function is defined as AMSE(Yˆ , Y ) =

n 1 ˆ (Yi − Yi )2 (sgn(Yˆi − Yi ) + α)2 , n i=1

(2)

where n is the number of instances, Yˆ is a vector of the estimated values, Y is a vector of the target values, and α is a parameter between 0 and 1, which controls the skew of the function. The closer α is to 1, the more the overestimations are penalized. AMSE with α = 0 is equal to the mean squared error cost function. The behavior of AMSE cost denoted as c(di ) for different estimation errors di and various values of α is shown in Fig. 3. Training was performed by gradient descent using reverse-mode autodiff, a process historically known as backpropagation [7, 12]. The Adam optimizer was used. Training was run for 40 epochs, which took about 3 h.1 During training, the training cost kept decreasing steadily, but the cost on the validation set kept fluctuating, so the network’s weights were saved after each epoch and at the end of training, and only

1 All experiments were run on an eight-core Intel Xeon E5-2623 v4 CPU with 30 GB of RAM and a

NVIDIA Quadro P4000 graphics card under Debian Linux. The ANN was implemented and trained using the Keras library [4] with the TensorFlow backend.

Near Optimal Solving of the (N2 –1)-puzzle …

301

the weights corresponding to the lowest validation error were restored and saved as the final model.

4.4 Resulting ANN-distance Heuristics For the detailed experimental evaluation we chose four versions of the ANN-distance heuristic. This included the ANN-distance heuristic with a deep underlying ANN trained with the MSE cost function (with the layer sizes stated in Sect. 4.2), then two heuristics with the underlying deep ANN trained with the asymmetric AMSE cost function with α of 0.4 and 0.8, and a final heuristic with a shallow ANN composed of a single hidden layer of 2752 neurons, which is the same total number of neurons as that of the deep ANNs.

5 Experimental Evaluation Our experimental evaluation was focused on competitive comparison of the ANNdistance heuristics against the admissible PDB 7–8 heuristic (the pattern database heuristic with pattern sizes 7 and 8) and its weighted variants, which are inadmissible. The tests were run on a set of 1172 instances obtained using random permutations. The distribution is shown in Fig. 4, the mean optimal solution cost of boards in the 0.25

Relative frequency

0.20

0.15

0.10

0.05

0.00 35

40

45

50

55

60

65

Optimal solution cost Fig. 4 Histogram of optimal solution costs of boards in the evaluation dataset

70

302

V. Cahlik and P. Surynek

evaluation dataset is 52.8. We focused on measuring the performance of the heuristics in two scenarios: 1. when they are used to obtain single estimations 2. when they are used as the underlying heuristic of the A* algorithm

5.1 Evaluation on Single Estimations The first experiment tested how closely the true distance from the goal configuration has been estimated by the PDB heuristics and by the ANN-distance heuristics. The resulting distributions of estimations are shown in Figs. 5 and 6. It can be seen that the shapes of the distributions are similar to the shape of the distribution of the true optimal solution costs of boards in the evaluation dataset. The biggest difference over the evaluation dataset is that the distributions are shifted to more positive or negative values. The PDB 7–8 heuristic produces smaller estimates than its weighted counterparts, and the distributions of the estimations of the weighted PDB 7–8 heuristics are also spread more widely than the distribution of the optimal solution cost estimates produced by the original PDB 7–8 heuristic. The ANN-distance heuristic produces optimal solution cost estimates whose distribution is quite similar to the true distribution of the optimal solution costs, with the mean of 52.9. The ANN-distance heuristic that was trained to underestimate its estimations using an asymmetric cost function with α = 0.8 produces estimates which are somewhat lower than those pro-

0.30

Relative frequency

0.25

PDB PDB, W=1.15 PDB, W=1.3

0.20 0.15 0.10 0.05 0.00 30

40

50

60

Optimal solution cost estimation Fig. 5 PDB 7–8 heuristics: histogram of single estimations

70

80

Near Optimal Solving of the (N2 –1)-puzzle … ANN (MSE) ANN (AMSE, = 0.8)

0.25

Relative frequency

303

0.20 0.15 0.10 0.05 0.00 30

35

40

45

50

55

60

65

70

Optimal solution cost estimation Fig. 6 ANN-distance heuristics: histogram of single estimations PDB PDB, W=1.15 PDB, W=1.3

0.30

Relative frequency

0.25 0.20 0.15 0.10 0.05 0.00 −15

−10

−5

0

5

10

15

Optimal solution cost estimation error Fig. 7 PDB 7–8 heuristics: histogram of errors on single estimations

duced by the original ANN-distance heuristic trained with the mean squared error cost function, with the mean of 48.9. The distributions of estimation errors with respect to the true optimal solution costs are shown in Figs. 7 and 8. Clearly, both of the ANN-distance heuristics are inadmissible according to the test. The ANN-distance heuristic trained with the MSE cost function has the error’s mean of 0.06 and the error’s standard deviation of 1.42, which means that its estimations are relatively accurate. The ANN-distance heuristic

304

V. Cahlik and P. Surynek

Relative frequency

0.25

ANN (MSE) ANN (AMSE, = 0.8)

0.20

0.15

0.10

0.05

0.00 −10

−8

−6

−4

−2

0

2

4

Optimal solution cost estimation error Fig. 8 ANN-distance heuristics: histogram of errors on single estimations

trained with the AMSE cost function with α = 0.8 tends to underestimate the cost for almost all estimations, with the mean error of −3.9, but is still inadmissible. On the other hand, the PDB 7–8 heuristic tends to strongly underestimate the optimal solution costs by default, and is more accurate only after being weighted by W = 1.15.

5.2 Evaluation on A* Searches In the second experiment, the heuristics were evaluated as the underlying heuristics of the A* algorithm. No time limit was imposed on the runs of the A* algorithm, and thus the solutions to all boards with all heuristics were found. The results are presented in Table 1. The meaning of the columns is as follows: the first column states the heuristic used, the second column represents the average cost of the solutions found, the third column states the average number of nodes expanded by the searches, the fourth column represents the average runtime of the search, the fifth column states what percentage of solutions found were optimal. The sixth column represents the average suboptimality of solutions, measured in percentages. Average suboptimality is defined as the average cost of the solutions, divided by the average optimal solution cost (and after this computation, 1 is subtracted and the result is multiplied by 100 to obtain percentages). The last column shows the average number of expanded nodes per second. The rows simply represent the results of the heuristics: the first four rows represent the pattern database heuristic and its weighted variants, the fifth row represents the deep ANN-distance heuristic trained with the MSE cost function, the

Near Optimal Solving of the (N2 –1)-puzzle …

305

Table 1 Performance analysis of A* running with various heuristics Heuristic Cost Expanded Time %Opt %Avg. subopt. PDB 7–8 PDB 7–8, W = 1.15 PDB 7–8, W = 1.3 PDB 7–8, W = 1.45 ANN ANN, 1 h.l. ANN, AMSE (α = 0.4) ANN, AMSE (α = 0.8)

Expanded/s.

52.8 53.4

62,222 3,296

6.946 0.346

100 71

0 1.19

8958 9526

55.2

1,017

0.107

38

4.46

9505

57.5

540

0.056

25

8.86

9643

53.7 53.7 53.5

355 1,577 532

2.405 5.988 3.543

61 60 67

1.62 1.69 1.34

148 263 150

53.5

2,576

16.812

69

1.23

153

sixth row is the shallow ANN trained with the MSE cost function, and the last two rows are the deep ANN-distance heuristics trained with the AMSE cost function. From the table, it is clear that in terms of the number of expanded nodes, the deep ANN-distance heuristic trained with the MSE cost function performs significantly better than all of the heuristics based on pattern databases. The number of expanded nodes has been decreased by two orders of magnitude compared to the unweighted PDB heuristic. However the success rate of finding the optimal solutions is also relatively low, with the deep ANN-distance heuristic leading to an optimal solution only in about 60 percent of cases. Exact number of unsuccessful searches for the optimum depending on the distance from the optimum can be found in Table 2. The suboptimality of the solutions was further decreased by the ANN-distance heuristics trained with the AMSE cost function. However, even though the heuristic trained with the AMSE cost function with α = 0.8 underestimates almost all of its estimations, it still lead to suboptimal solutions being found by the A* search algorithm in about 30% of cases. It is possible that this suboptimality mostly comes from the inconsistency of the heuristic, as the graph version of the A* algorithm only guarantees optimality when the heuristic function used is consistent. An interesting result is that the ANN-distance function based on a shallow ANN lead to a significantly higher amount of expanded nodes than the deep ANN-distance heuristic. Even during training, the prediction cost on the training set was larger with the shallow ANN than with the deep ANN. This suggests that the use of deep learning can be beneficial even in the field of heuristic design. The biggest weakness of the ANN-distance heuristic is the speed of inference, as the pattern database heuristics are faster at making estimations by almost two

306

V. Cahlik and P. Surynek

Table 2 Suboptimality of the solutions found by A* Heuristic Opt. Opt.+2 PDB 7–8 PDB 7–8, W = 1.15 ANN ANN, 1 h.l. ANN, AMSE (α = 0.4) ANN, AMSE (α = 0.8)

Opt.+4

Opt.+6

1172 834

0 309

0 29

0 0

718 701 786

409 421 357

44 49 29

1 1 0

814

336

21

1

ANN (MSE) PDB

0.8

Relative frequency

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0

2500

5000

7500

10000 12500 15000 17500 20000

Number of expanded nodes Fig. 9 Histogram of number of nodes expanded during A* search (trimmed)

orders of magnitude. The shallow ANN’s estimations are slightly faster than those of the deep ANNs. On the other hand, the underlying neural networks in the ANNdistance heuristics only take up about 20 MBs of memory, which is a very low number compared to the PDB 7–8 heuristics, which take up about 4.5 GBs of memory. The distributions of the number of nodes expanded by the A* searches with the deep unbiased ANN-distance heuristic and the unweighted PDB 7–8 heuristic can be seen in Fig. 9.

Near Optimal Solving of the (N2 –1)-puzzle …

307

5.3 Competitive Comparison Against Heuristics Presented in Other Studies The ANN-distance heuristic functions presented in this paper can be directly compared to neural heuritics presented in two earlier studies, both of which were mention in Sect. 3. The first paper by Ernandes and Gori from 2004 [21] created a heuristic function based on an artificial neural network, which was then evaluated using the IDA* algorithm. The second paper by Samadi et al. from 2008 [22] also designed a heuristic function based on an ANN, but trained it using outputs of other heuristic functions as features, namely the outputs of the PDB 7–8 heuristic and the Manhattan distance heuristic. Samadi et al. evaluated their heuristic using the RBFS search algorithm. Neither of these studies presented the performance of their network on single estimations, and the algorithms used differ, so comparison against these heuristics is difficult. Nevertheless, the results are presented in Table 3. It is evident that the number of expanded nodes is significantly smaller with the unbiased ANN-distance heuristic presented in this paper than with the heuristics of Samadi et al. and Ernandes and Gori. Our ANN distance expands by two order of magnitudes fewer nodes than previous ANN-based heuristics. It is however likely that at least the heuristic presented by Ernandes and Gori would perform better in number of expanded nodes if ran as part of the A* algorithm, since IDA*, which they used, tends to run the low-level multiple times and the low-level revisits large subtrees, and thus the IDA* algorithm expands more nodes than necessary. As for optimality, Ernandes and Gori stated that 28.7% of their solutions were optimal, which is far worse than roughly 60% solutions obtained by the unbiased ANN-distance heuristic presented in this paper. However, Samadi et al. stated that suboptimality of their biased heuristic is less than 0.1%, which is significantly lower than the suboptimality of 1.23% of the biased ANN-distance heuristic trained with

Table 3 Comparison of search algorithms against heuristics presented in other papers Study Algorithm Heuristic Avg. cost Avg. nodes Avg. time (s) This This

A* A*

This

A*

[21] [22]

IDA* RBFS

[22]

RBFS

ANN ANN, AMSE (α = 0.4) ANN, AMSE (α = 0.8) ANN ANN (MD+PDB) ANN (MD+PDB, asym. cost)

53.7 53.5

355 532

2.405 3.543

53.5

2,576

16.812

54.5 54.3

24,711 2,241

7.38 0.001

52.6

16,654

0.021

308

V. Cahlik and P. Surynek

the AMSE cost function with α = 0.8. Unfortunately, it is not clear whether Samadi et al. used exactly the same definition of suboptimality. The runtime comparison is included for completeness, but is not very informative when compared over different studies, as for example Samadi et al. whose algorithm only takes a millisecond to find a solution, had likely well optimized their code.

5.4 Analysis of the Behavior of A* Search with the Underlying ANN-distance Heuristic When evaluating a heuristic function, it is important to know what parts of the statespace the algorithm spends the most time in. This analysis can be performed by calculating the optimal solution cost belonging to each expanded state. The distribution of the optimal solution costs belonging to nodes expanded by the A* algorithm can be seen in Fig. 10. The data were obtained by running an IDA* search with the PDB 7–8 heuristic to get the optimal solution cost, for each state the A* search with the deep ANN-distance heuristic (trained with the MSE cost function) expanded, for boards obtained as random permutations. It can be also observed that the distribution is much more skewed to the left than the distribution of optimal solution costs of states expanded by the A* algorithm with the underlying PDB 7–8 heuristic, which is presented in Fig. 11. This could mean that the heuristic estimations are very accurate for instances of high optimal 0.25

Relative frequency

0.20

0.15

0.10

0.05

0.00 0

10

20

30

40

50

60

Optimal solution cost Fig. 10 Histogram of optimal solution costs of nodes expanded during A* search with the ANNdistance heuristic trained with the MSE cost function

Near Optimal Solving of the (N2 –1)-puzzle …

309

Relative frequency

0.25 0.20 0.15 0.10 0.05 0.00 10

20

30

40

50

60

Optimal solution cost Fig. 11 Histogram of optimal solution costs of nodes expanded during A* search with the PDB 7–8 heuristic

solution costs, so the search algorithm with the ANN-distance gets faster from the initial state than with the PDB 7–8 heuristic.

6 Discussion and Conclusion This paper refers on the use of artificial neural networks as the underlying paradigm for the design of heuristics for the (N2 –1)-puzzle. We designed a heuristic called ANN-distance with an underlying deep artificial neural network that estimates for a given configuration the distance in terms of the minimal number of moves required to reach the goal configuration. Although the ANN-distance heuristic is not admissible, it is relatively accurate and does not significantly overestimate the true distance. As a result, the ANN-distance usually yields an optimal solution when used as a part of A*. Moreover, since ANN-distance is relatively precise in its estimations, A* with the ANN-distance heuristic expands much fewer nodes than with other heuristics like 7–8 pattern database. When compared to a shallow artificial neural network with a single hidden layer, the ANN-distance with an underlying deep ANN gives more accurate estimations and therefore leads to lower number of nodes expanded by the A* search. This demonstrates that deep learning can be successfully applied to heuristic design in symbolic domains like the (N2 –1)-puzzle. We also tried to bias the ANN-distance heuristic to prefer underestimating its estimations, which resulted in solutions that are closer to the optimum.

310

V. Cahlik and P. Surynek

When compared to the results of other scientific papers, the ANN-distance heuristic performs significantly better in the (N2 –1)-puzzle domain than any other similar previously designed heuristic. Our ANN-distance improved the number of expanded nodes over the best published heuristic functions based on artificial neural networks by an order of magnitude. It seems that the cause of the performance gain was a combination of a much larger artificial neural network and training dataset, as well as the use of deep learning and other modern techniques for ANN training. An important advantage of the ANN-distance is that it consumes much less memory than a comparable pattern database heuristic. This feature enables the use of similar ANN-based heuristics even in cheap embedded systems. ANN-based heuristics similar to our ANN-distance could be used in other problem domains similar to the (N2 –1)-puzzle. Our hypothesis is that it could be possible to use these heuristics even in problem domains which are mostly opaque, that is, problem domains whose inner structure is not well understood by humans. As we continue our work, we plan to move to the domain of the 24-puzzle. As it will be much more challenging to obtain a sufficient number of training instances, transfer learning could be used to utilize the ANN-distance heuristic trained on the 15-puzzle domain. Such a heuristic function could then be improved with the bootstrapping algorithm [29]. We also plan to perform further experiments with the distribution of the boards in the training dataset. ˇ Acknowledgements This research has been supported by GACR—the Czech Science Foundation, grant registration number 19-17966S. We would like to thank anonymous reviewers for their valuable comments.

References 1. Boyarski, E., Felner, A., Stern, R., Sharon, G., Tolpin, D., Betzalel, O., Shimony, S.: ICBS: improved conflict-based search algorithm for multi-agent pathfinding. In: IJCAI. pp. 740–746 (2015) 2. Cahlík, V., Surynek, P.: On the design of a heuristic based on artificial neural networks for the near optimal solving of the (n2 -1)-puzzle. In: Proceedings of the 11th International Joint Conference on Computational Intelligence, IJCCI 2019, Vienna, Austria, September 17–19, 2019. pp. 473–478 (2019) 3. Cahlík, V., Surynek, P.: Application of Artificial Neural Networks in Solving the (N2 -1)-Puzzle. Bachelor’s thesis, Faculty of Information Technology, Czech Technical University in Prague (2020) 4. Chollet, F., et al.: Keras. https://keras.io (2015) 5. Culberson, J.C., Schaeffer, J.: Efficiently searching the 15-puzzle. Department of Computer Science, University of Alberta, Tech. rep. (1994) 6. Demaine, E.D., Rudoy, M.: A simple proof that the (n2–1)-puzzle is hard. Theor. Comput. Sci. 732, 80–84 (2018) 7. E. Rumelhart, G. Hinton, R.W.: Learning internal representations by error propagation (1985) 8. Erdem, E., Kisa, D.G., Öztok, U., Schüller, P.: A general formal framework for pathfinding problems with multiple agents. In: Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, July 14–18, 2013, Bellevue, Washington, USA (2013)

Near Optimal Solving of the (N2 –1)-puzzle …

311

9. Felner, A., Adler, A.: Solving the 24-puzzle with instance dependent pattern databases. In: SARA-05. pp. 248–260 (2005) 10. Felner, A., Korf, R.E., Hanan, S.: Additive pattern database heuristics. Journal of Artificial Intelligence Research 22, 279–318 (2004) 11. Felner, A., Korf, R.E., Meshulam, R., Holte, R.C.: Compressed pattern databases. J. Artif. Intell. Res. 30, 213–247 (2007) 12. Geron, A.: Hands-On Machine Learning with Scikit-Learn and TensorFlow, pp. 275–312. First edn. (2017) 13. Ghawi, R., Pfeffer, J.: Efficient hyperparameter tuning with grid search for text categorization using knn approach with BM25 similarity. Open Computer Science 9(1), 160–180 (2019) 14. Hart, P.E., Nilsson, N.J., Raphael, B.: A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics SSC-4(2), 100–107 (1968) 15. Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall (1999) 16. Korf, R.E.: Sliding-tile puzzles and Rubik’s Cube in AI research. IEEE Intelligent Systems 14, 8–12 (1999) 17. Korf, R.E., Taylor, L.A.: Finding optimal solutions to the twenty-four puzzle. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence and Eighth Innovative Applications of Artificial Intelligence Conference, AAAI 96, IAAI 96, Portland, Oregon, USA, August 4-8, 1996, Volume 2. pp. 1202–1207 (1996) 18. Kornhauser, D., Miller, G.L., Spirakis, P.G.: Coordinating pebble motion on graphs, the diameter of permutation groups, and applications. FOCS 1984, 241–250 (1984) 19. Li, J., Harabor, D., Stuckey, P.J., Ma, H., Koenig, S.: Symmetry-breaking constraints for gridbased multi-agent path finding. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. pp. 6087–6095 (2019) 20. Luna, R., Bekris, K.: Efficient and complete centralized multi-robot path planning. In: IROS. pp. 3268–3275 (2011) 21. M. Ernandes, M.G.: Likely-admissible and sub-symbolic heuristics (2004) 22. M. Samadi, A. Felner, J.S.: Learning from multiple heuristics (2008) 23. Parberry, I.: A real-time algorithm for the (n2 -1)-puzzle. Inf. Process. Lett. 56(1), 23–28 (1995) 24. Parberry, I.: Memory-efficient method for fast computation of short 15-puzzle solutions. IEEE Trans. Comput. Intellig. and AI in Games 7(2), 200–203 (2015) 25. Parberry, I.: Solving the (n2–1)-puzzle with 8/3 n3 expected moves. Algorithms 8(3), 459–465 (2015) 26. Ratner, D., Warmuth, M.K.: Finding a shortest solution for the N x N extension of the 15-puzzle is intractable. In: AAAI. pp. 168–172 (1986) 27. Ratner, D., Warmuth, M.K.: Nxn puzzle and related relocation problem. J. Symb. Comput. 10(2), 111–138 (1990) 28. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323, 533– (Oct 1986) 29. S. Arfaee, S. Zilles, R.H.: Learning heuristic functions for large state spaces (2011) 30. Schmidhuber, J.: Deep learning in neural networks: An overview. CoRR abs/1404.7828 (2014), http://arxiv.org/abs/1404.7828 31. Sharon, G., Stern, R., Felner, A., Sturtevant, N.R.: Conflict-based search for optimal multi-agent pathfinding. Artif. Intell. 219, 40–66 (2015) 32. Sharon, G., Stern, R., Goldenberg, M., Felner, A.: The increasing cost tree search for optimal multi-agent pathfinding. Artif. Intell. 195, 470–495 (2013) 33. Sharon, G., Stern, R., Goldenberg, M., Felner, A.: Pruning techniques for the increasing cost tree search for optimal multi-agent pathfinding. In: Symposium on Combinatorial Search (SOCS) (2011) 34. Slocum, J., Sonneveld, D.: The 15 Puzzle. Slocum Puzzle Foundation (2006)

312

V. Cahlik and P. Surynek

35. Standley, T.: Finding optimal solutions to cooperative pathfinding problems. In: AAAI. pp. 173–178 (2010) 36. Surynek, P.: A novel approach to path planning for multiple robots in bi-connected graphs. ICRA 2009, 3613–3619 (2009) 37. Surynek, P.: Solving abstract cooperative path-finding in densely populated environments. Computational Intelligence 30(2), 402–450 (2014) 38. Surynek, P.: Time-expanded graph-based propositional encodings for makespan-optimal solving of cooperative path finding problems. Ann. Math. Artif. Intell. 81(3–4), 329–375 (2017) 39. Surynek, P.: Unifying search-based and compilation-based approaches to multi-agent path finding through satisfiability modulo theories. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019. pp. 1177–1183 (2019) 40. Surynek, P., Michalík, P.: The joint movement of pebbles in solving the ( n2 - 1)-puzzle suboptimally and its applications in rule-based cooperative path-finding. Autonomous Agents and Multi-Agent Systems 31(3), 715–763 (2017) 41. Wilson, R.M.: Graph puzzles, homotopy, and the alternating group. Journal of Combinatorial Theory, Series B 16(1), 86–96 (1974)

Deep Convolutional Neural Network Processing of Images for Obstacle Avoidance Mohammad O. Khan and Gary B. Parker

Abstract Deep Convolutional Neural Networks have been found to be quite successful at processing images and determining their classifications, such as distinguishing dogs from cats in a set of images. With this in mind, we implemented a deep network system for obstacle avoidance of an autonomous robot driving in a realworld lab environment. The network was first trained on the CIFAR10 dataset using a replication of Alex Krizhevsky’s network architecture. The network was then altered by replacing the final fully connected layer with a new fully connected layer with three outputs and random starting weights. The entire network was then fine-tuned using images from the actual environment that were taken by the robot’s camera as it was remotely driven in the lab by a human operator. The network learned the correct responses of left, right, or straight for each of the images with a very low error rate when checked on test images. In addition, ten tests on the actual robot showed that it could successfully and consistently drive through the lab while avoiding obstacles. Keywords Deep learning · Artificial neural networks · Obstacle avoidance · Indoor · Turtlebot · Mobile robotics

1 Introduction Deep Convolutional Neural Networks are Artificial Neural Network architectures that are capable of differentiating between thousands of objects by learning from millions of images. The majority of Deep Learning systems are being used for image processing. Some of these systems are able to outperform humans in classification tasks, where one object is differentiated from another and sorted into classes (dog vs. wolf, e.g.). In this paper, we present the use of a Deep Convolutional Neural Network trained on images taken from the camera of a TurtleBot type robot to enable it to distinguish between obstacles and free space while autonomously driving in a tight

M. O. Khan · G. B. Parker (B) Connecticut College, New London, CT 06340, USA e-mail: [email protected] © Springer Nature Switzerland AG 2021 J. J. Merelo et al. (eds.), Computational Intelligence, Studies in Computational Intelligence 922, https://doi.org/10.1007/978-3-030-70594-7_13

313

314

M. O. Khan and G. B. Parker

classroom/laboratory setting. After training, the robot was able to drive autonomously while successfully avoiding obstacles.

2 Deep Learning for Image Processing A standard feed-forward Artificial Neural Network that is fully connected between each layer could be used to classify images, but it would be overwhelmed computationally. If we were using a 100 × 100 pixel image with 3 color channels, we would have 100 × 100 × 3 or 30,000 inputs to the neural network. This is a large number of inputs for a standard neural network to process and would result in limited success. The Deep Learning architecture is better equipped to handle image processing and has enjoyed significant success. Instead of sending all input values from layer to layer, deep networks are designed to take regions or subsamples of inputs. For images this means that instead of sending all pixels in the entire image as inputs, different neurons will only take regions of the image as inputs—full connectivity is reduced to local connectivity. Figure 1 visualizes this concept. We take an image and extract local regions of depth 3 for the color channels along with their respective pixel values and input them into a neuron.

2.1 Components of a Deep Learning System A Deep Convolutional Neural Network is typically made up of a stack of layers, including convolution, pooling, and fully connected layers. The typical system has two or three sets of a convolution layer followed by a pooling layer, with a fullyconnected layer producing the output. In this section, we will discuss these layers, plus a type of normalization layer and a commonly used activation function unit, both of which we used in our research.

Fig. 1 Examples of two local receptive fields being input into a neuron

Deep Convolutional Neural Network …

315

The convolution layer passes kernels (filter windows) over the image to produce new smaller images through convolution of the filter and the image pixels. The makeup of each kernel (its coefficients) is learned in a similar way to the weights of a standard neural network as the system is exposed to a dataset made up of image and corresponding classifications. The number of intermediate images produced by this method can be determined by the programmer as they specify the number of kernels for each layer. In this way, Deep Learning takes an image and extracts local regions of depth 3 for the color channels along with their respective pixel values and inputs them into a neuron. Supposing that our local receptive fields are of size 5 × 5, this neuron takes in an input of dimensions 5 × 5 × 3 for that particular portion of the 3 color channel image. The local receptive fields can be seen as small windows that slide over our image, where the number of panes on the window is predefined. These panes help determine what features under the window we want to extract, and over time these features are better refined. These windows are commonly referred to as kernels. Depending on the type of kernel, different features of the image may be highlighted, such as blurring and sharpening. Using these kernel filters, the Deep Learning networks can develop identification of complex patterns in datasets. During learning, deep networks develop the weights or coefficients of these kernels through training on input/output sets of data. The system considers the error between the actual and desired outputs and uses this to adjust the kernel weights using an objective function. This is all done during training, so the makeup of the kernels is completely learned by the system. A general strategy is to follow a convolution layer with a pooling layer. After the convolution layer, the pooling is applied to each of the resultant images. The convolution layer is responsible for learning the lower level features of an image, such as edges, and the pooling layer seeks to detect a higher level attributes from these features. In addition, pooling is helps build invariance in local translations so that even if the input region is slightly translated, most of the pooled output values will remain the same. Along with these benefits, the image size is reduced. Reducing the size of the image dramatically cuts down on the amount of processing needed to train the higher level features of the network. Three typical methods of pooling are: (1) Max pooling—The maximum pixel value is chosen out of a region of pixels. (2) Min pooling—The minimum pixel value is chosen out of a region of pixels. (3) Average pooling—The average pixel value is chosen out of a region of pixels. The pooling process is similar to convolution as a window is passed over our image. In our system we use max pooling, which helps extract dominant features, or regions with the largest values, and feeds them into subsequent layers of the network. The Rectified Linear Unit (RLU) layer [1] has recently grown in popularity. Many researchers consider this over using the sigmoid activation function. In fact, Alex Krizhevsky et al. were able to accelerate convergence in their training by a factor of 6 times in relation to the sigmoid activation function by using this function. This is a fairly straightforward operation: the function takes a numerical input X and returns it if it is positive, otherwise it returns-1 * X. This effectively eliminates negative inputs and boosts computation time since complex computations such as exponentiation are not needed.

316

M. O. Khan and G. B. Parker

A normalization layer helps to add inhibition, which amplifies signals. One such method is the Local Response Normalization [1] layer, which imitates the lateral inhibition in biological systems by allowing excited neurons to subdue neighboring neurons. The signal at a normalized neuron is amplified in its excitement due to its difference in comparison to neighboring neurons. This gives neurons with large activation values even more influence than without the normalization, which results in the amplification of significant features highlighted in previous layers. The final layer in many Deep Convolutional Neural Networks is the fully connected layer, which is much like any normal multi-layered perceptron. The outputs of the neurons in this layer are the outputs of the network. The outputs from this layer are compared to the desired outputs for the particular image and the error delta is used to perform learning in the network.

2.2 Relevant Previous Works The most significant research related to our application is that of Alex Krizhevsky. In addition to his research, the dataset he used, CIFAR10 from the Canadian Institute for Advanced Research [2], is also very relevant to our research. Alex Krizhevsky used this dataset when he developed it for his 2009 Master’s Thesis at the University of Toronto [3]. Prior to this, small images on the scale of 32 × 32 were not easily labeled for classification tasks. The CIFAR10 dataset includes 10 different classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. The classes are set up in a way to be mutually exclusive, so that categories such as automobile and truck are completely separate. In 2010, Krizhevsky developed different deep neural network models to learn classification of images in the dataset [4]. At the time he obtained the highest accuracy using this dataset as his best system classified objects correctly with a success rate of 78.9%. Since then, Mishkin and Matas have obtained 94.16% accuracy on the CIFAR10 dataset [5]. Whereas, Springenberg et al. have obtained 95.59% accuracy [6] and the current best performance is by Graham with an accuracy of 96.53% using max pooling [7]. Krizhevsky, with Sutskever and Hinton [1] advanced their earlier work with further research using Deep Learning. They developed a neural network with 60 million parameters and 650,000 neurons. They achieved a top-5 classification (1000 classes) error rate of only 15.3% compared to a much higher second-place error rate of 26.2%. Their network had 5 convolutional layers along with a few pooling layers and 3 fully connected layers including a final output layer of 1000 outputs. In addition to reporting on the achievement of an impressive error rate in such a large classification task, by noting that removal of a single hidden layer dropped the top-1 classification error rate by 2%, this paper contributed to the discussion of the importance of depth in neural networks. A team of engineers and research scientists mostly from Google [8] developed GoogLeNet, a deep network with 22 layers, and entered it in the Large Scale Visual Recognition Challenge (ILSVRC). The team won the competition with 12 times

Deep Convolutional Neural Network …

317

fewer parameters than Krizhevsky’s deep network and obtained an impressive 6.66% error rate for top-5 classification. Following the pattern of improvements, He et al. [9] of Microsoft Research used a 19 layer deep neural network for the task and obtained an accuracy of 4.94% for top-5 classification. This was a landmark accomplishment as it is purported to be the first to beat human level performance, which is 5.1% for the ImageNet dataset.

3 Obstacle Avoidance Task The task that we hoped to complete was to use a deep neural network to learn obstacle avoidance for an autonomous vehicle driving in a tight, somewhat unordered room/office environment. To test the functionality and success of the program, the performance of the robot was to achieve the primary goal of autonomously following an approximately rectangular path in a tight environment without colliding with obstacles. Descriptions of the robot and environment are provided below.

3.1 The Robot Deep Learning Robot from Autonomous, which is functionally the same platform as the TurtleBot, was used for this research (Fig. 2). A Kobuki mobile base allows it to rotate and move in any direction on the ground plane. In addition to the base platform, the robot includes an Asus Xtion Pro 3D Depth Camera, a microphone embedded in the camera, and a speaker. And most importantly, in addition to the CPU used to control the platform, the robot is equipped with an Nvidia Tegra TK1 GPU, which allows it to carry out Deep Learning computations onboard. The GPU is its main difference from regular TurtleBot hardware. While the Tegra TK1 is a powerful mobile processor, it was only equipped with 2 GB of memory. This is an issue for training deep networks since holding too many parameters in memory causes the robot to crash. Consequently, running multiple programs at the same time has to be avoided because the robot is unstable while training due to its limited memory. The software provided for implementing Deep Learning on the GPU to speed up computation included Google TensorFlow, Torch, Theano, and Caffe, and CUDA and cuDNN—we used Caffe. The TurtleBot framework works hand in hand with the Robot Operating System (ROS), which is used to control the robot and access all the information coming from any of the robot’s sensors. ROS is an “open-source, meta-operating system” which allows hardware abstraction, low-level control and message passing between different modules and processes. The robot’s computer was setup with Ubuntu 14.04.

318

M. O. Khan and G. B. Parker

Fig. 2 Photograph of the Deep Learning Robot

3.2 The Environment Training and tests were completed in the Connecticut College Robotics Lab, which with its standard setup has a variety of obstacles in the room and provides a reasonably complex environment. A photo of the lab is provided in Fig. 3. The central work table is in the middle of the photo and an 8 × 8 foot colony space is in the far corner. The

Fig. 3 A photograph of the robotics lab and a visual of the environment showing lab tables (white rectangles), chairs (white circles), cabinets (amber rectangles), and colony space (brown square). The red line shows the approximate track the robot must follow to move around the lab without running into obstacles. (The visual from the environment is from Khan and Parker, 2019 [17])

Deep Convolutional Neural Network …

319

lab chairs are positioned at the work tables. Also in Fig. 3 is a diagram of the lab with the top being north. The orientation is different from the photo as it was taken in the northeastern corner of the room (top right in the diagram) facing the southeastern part of the room where the colony space is located. The diagram shows the desired rectangular path of our robot, which would take it around the central work table. This long table only has 3 planes of support on the underside; otherwise there are gaps underneath the table. In the diagram, white rectangles with dark borders are lab tables. The north and south sides of the tables are solid (2 of the planes of support), whereas the east and west sides have gaps. The gap size is large enough for the robot to be able to drive through, but chairs (white circles with dark borders) were placed in those locations. The total radii of the chairs are larger than the circles shown because the feet of the chairs extend out further. There is no gap for the robot to move in between neighboring chairs (in most cases). The dark brown rectangle (southwest corner of the lab) is the colony space, which is a boxed off area of the lab that may be used for other experiments. It has borders (one foot high solid walls) that the robot would need to avoid hitting. The golden rectangles (north and south walls of the lab) denote cabinets which the robot must also avoid. The red rectangle in the middle of the figure shows the path around the center table that the robot must follow or the general path it needs to go in on its way as it avoids chairs, tables, boxes, etc. Separate runs in both clockwise and counter clockwise directions were made around this path to increase the variety of situations encountered by the robot. Figure 4 shows photos of the moveable round chairs that were used to close the gaps between tables and cabinets where we did not want the robot to go. The chairs are supported with 5 rounded legs and have a circular stump that can extend to adjust the height and can rotate 360 degrees to change the orientation of the seat. These were chosen as the main objects of interest because they are not solid—there is clearly a good amount of gap area in between the legs. This allows for complexity in defining

Fig. 4 Photographs showing chairs and spacing. (Images from Khan and Parker, 2019 [17])

320

M. O. Khan and G. B. Parker

what an obstacle is and what and obstacle is not. Since the robot could see through the chairs to the carpet beyond, this prevented it from simply learning to follow the color of the carpet, which forced it to recognize obstacles. As can possibly be seen in Fig. 2, the camera for the robot faces down at about 40 degrees from the vertical position. With this in mind, the environment was designed to be complex enough, in terms of objects close to the ground, to be a difficult problem for obstacle recognition/avoidance. If the environment was built only using cardboard or other flat material as the main obstacle in the environment, then there would be a fairly straightforward solution. There would not be much variety, apart from lighting conditions, as to what material needed to be avoided. By using the chairs, the environment was more natural and complex. The legs of the chairs were visually small and had significant gaps, so the robot could still see much carpet through the legs. Therefore, just following carpet was not a viable solution. Sufficient gaps between the chairs, but not enough for robot passage, compounded the problem. Therefore, the learning system could not just develop a control program to follow a carpeted area, but instead needed a more complex pattern to be recognized from the dataset. Not only were the chairs not solid surfaces, they were typically moved by students overnight. While they might be in the same relative location, the orientations were completely different each time. This added complexity to the problem because it was not easy for a pattern to be developed with the changing orientations. Consequently, for obstacle avoidance to be successful, the deep neural network is required to properly classify chairs as obstacles to be avoided. Even though we wanted sufficient complexity to ensure this was a difficult task, there were certain situations in the environmental that would be impossible for the robot to solve. In this research, we dealt with two of these situations (Fig. 5). In the first, if there is enough of a gap between two chairs the robot may make the decision to go straight instead of turning away from the chairs. To solve this issue, the chairs in the environment were placed close enough so that the gap was small enough to block the robot. In the second, if the robot is close enough to a cabinet directly head on so that it only sees the cabinet, it won’t know which way to turn. Even for a human with limited peripheral vision, it would be impossible to know which direction to turn. There is no way to have metaknowledge about which direction contains an obstacle and which does not. To solve this issue, areas with cabinets included an open cabinet that swiveled to a direction the robot was supposed to avoid. These changes established rough guidelines as to the correct path for the robot. And since the contents of the cabinets added a variety of different items for the visual system to process, these changes added complexity to the environment. In addition to the standard lab obstacles (chairs, cabinets, and tables) there were other obstacles to avoid in the environment (Fig. 6). A few images in the dataset included small cardboard boxes and a good amount of the dataset included the borders of a colony space environment. It was important to include obstacles like these in order to confirm that the concept of obstacle avoidance was being abstracted instead of the robot only avoiding black colored objects (the black chairs). It is also significant to note that students used the lab throughout the day and night, so conditions of the carpet changed while the dataset was being developed. For example, coins were

Deep Convolutional Neural Network …

321

Fig. 5 Trouble scenarios. In the left photo, the robot has room to go between the two chairs, but will be stuck under the table, so the chairs needed to be close enough to prevent passage. In the center photo, at the point of seeing the cabinet, the robot would not know whether to go left or right, so one of the cabinet doors was opened (right photo) to avoid the situation

Fig. 6 These images show additional obstacles such as a small cardboard box and the wall of the colony space. (Images from Khan and Parker, 2019 [17])

found laid out on the ground near a turn in the path on one day. On another day, shreds of paper were at different locations on the path. Since these items only added to the diversity in what we might consider edge cases, we decided not to remove the bulk of them while building the dataset.

322

M. O. Khan and G. B. Parker

3.3 Relevant Previous Works The TurtleBot platform has been used in much research for obstacle detection and avoidance. The Point Cloud Library and depth information along with plane detection algorithms was used to build methods of obstacle avoidance [10]. High curvature edge detection was used to locate boundaries between the ground and objects that rest on the ground. Other researchers using the TurtleBot platform have used deep networks to learn obstacle avoidance. Tai, Li, and Liu used depth images as the only input into the deep network for training purposes [11]. They discretized control commands with outputs such as: “go-straightforward”, “turning-half-right”, “turning-full-right”, etc. The depth image was from a Kinect camera with dimensions of 640 × 480. This image was downsampled to 160 × 120. Three stages of processing were done where the layering was ordered as such: convolution, activation, pooling. The first convolution layer used 32 convolution kernels, each of size 5 × 5. The final layer included a fullyconnected layer with outputs for each discretized movement decision. In all trials, the robot never collided with obstacles, and the accuracy obtained after training in relation to the testing set was 80.2%. Their network was trained only on 1104 depth images. The environment used in this dataset seems fairly straightforward— meaning that the only “obstacles” seems to be walls or pillars. The environment was not dynamic. Tai and Liu produced another paper related to the previous paper [12]. Instead of a real-world environment, this was tested in a simulated environment provided by the TurtleBot platform, called Gazebo. Different types of corridor environments were tested and learned. A reinforcement learning technique called Q-learning was paired with the power of Deep Learning. The robot, once again, used depth images and the training was done using Caffe. Other deep reinforcement learning research included real-world evaluation on a TurtleBot [13], using dueling deep double Q networks trained to learn obstacle avoidance [14], and using a fully connected NN to map to Q-values for obstacle avoidance [15]. In continued research, Tai, Li, and Liu [16] applied a Convolutional Neural Network with several layers to process depth images in order to learn obstacle avoidance for a TurtleBot in the real world. This is very similar to our work, except they used depth images and the obstacles simpler since they were just in a corridor. In addition, they did not transfer learned layers to use on the new problem as we did, but trained completely on the dataset for the problem to be solved. This paper updates and extends research reported previously [17]. This research is a distinctive approach in comparison to these previous works. Research like Boucher’s does not consider higher level learning, but instead builds upon advanced expert systems, which can detect differentials in the ground plane. By using Deep Learning, our research allows for pattern based learning, which is more general and one that does not need to be explicitly programmed. While Tai et al. used Deep Learning, their dataset was limited with just over 1100 images. We built our own

Deep Convolutional Neural Network …

323

dataset to have over 30,000 images, increasing the size of the effective dataset by about 28 times. The environment for our research is more complex than just the flat surfaces of walls and columns in a corridor. Similar to Xie’s work, in our research the learning was done on a dataset that was based on raw monocular RGB images, which opens the door to further research with cameras that do not have depth. In addition, the sizes of the images used in our research were dramatically smaller, which also opens up the door for faster training and a speed up in forward propagation. Also of great importance in our work, as with some of these others, is that the training and testing were not done in a simulation or contrived world, but in a real world environment.

4 Deep Learning Applied to Obstacle Avoidance In order to learn obstacle avoidance for the TurtleBot operating in our environment, a human operator steered the robot using remote control while the camera on the robot collected images and corresponding control signals. A Deep Convolutional Neural Network similar to that used by Alex Krizhevsky was used to replicate his work in learning to distinguish images in the CIFAR10 dataset. This trained NN was altered by removing the final fully connected layer and replacing it with a new fully connected layer that had random weights and three outputs.

4.1 Data Collection To collect the data, a user on a keyboard remotely controlled the robot, which was connected through Bluetooth, as it was driven around the lab following the path in both directions. The robot maintained continuous forward movement as the operator selected three possible control signals: left, right, or straight. To increase the diversity of the images in the dataset, different starting points were chosen and hard scenarios such as being close to walls were included. Overall, 30,754 images were collected and labelled. A script written to control the camera processed about 10 images per second, but only images that corresponded to control inputs were saved. While no time record was kept, an estimated 1.5–2 h were spent on trial runs and collections. In the initial testing conditions, we found that there were edge cases that were missing, so more data was added over time. The end result was 30,754 images that were collected and labelled. By default, the images from the Asus Xtion Pro are of dimension 640 × 480. Although this would provide a great detail for training, it would take prohibitive processing power and time to train to get reasonable results for our task. Therefore, as can be seen in Fig. 7, we took the 640 × 480 images and downsized them to 64 × 64. This downsizing was found to be sufficient detail to perform the task, and it yielded reasonable learning rates.

324

M. O. Khan and G. B. Parker

Fig. 7 Reducing the image resolution from 640 × 480 to 64 × 64. (Figure from Khan and Parker, 2019 [17])

4.2 Deep Learning Application In order to use Deep Learning for our task of obstacle avoidance while driving autonomously, we started by replicating Alex Krizhevsky’s network architecture and used it to solve the CIFAR10 dataset. We obtained about 74% accuracy for the dataset. We took this architecture with learned weights, removed the final fully connected layer, which had 10 outputs, and replaced it with a new fully connected layer that had 3 outputs and random weights. The idea was that the lower level features detected by the network for the CIFAR10 dataset are general enough to be applied to the problem of obstacle detection. There is a large difference between detecting an airplane and detecting a dog or a cat, but Krizhevsky’s network is capable of differentiating between the two based on the same kernel weights. That seems to be a large area of coverage for the type of data provided so keeping these trained kernel weights made sense. The results should be good and the training time significantly reduced. In addition, our thought was that the accuracy of our system would be increased since Krizhevsky’s network was trained on 32 × 32 dimension images and our images were be 64 × 64 pixels. Figure 8 is a diagram of our complete network with it split into three lines to ease the visualization. The three lines also help to highlight the three repetitions of the layer combinations of convolution, pooling, and normalization. The diagram also shows the difference in this architecture and that of Krizhevsky’s network. The layer labeled “ip1Tweak” is a layer that replaces the layer that Krizhevsky labeled “ip”, which stands for inner product. Both are fully connected, but ip has 10 outputs, while ip1Tweak has 3. This is shown by the value 3 above the ip1Tweak layer in the diagram. The 3 outputs correspond to the decision making of the TurtleBot in terms of the autonomous driving directions left, right, and straight. The network has 32 convolution kernels for the first convolution, 32 convolution kernels for the second convolution, and 64 convolution kernels for the last one. Each convolution layer is immediately followed by a pooling layer. In addition to the convolution and pooling layers, each set of layers includes a Rectified Linear Unit. Local Response

Deep Convolutional Neural Network …

325

Fig. 8 The final architecture for the deep network. This is inspired by the architecture for solving the CIFAR10 dataset. The rectangles represent layers. The octagons represent data. (Figure from Khan and Parker, 2019 [17])

Normalization is also part of the network with one augmenting the outputs of the 1st and 2nd pooling layers. Most of the hyperparameters were found through experimentation after several trials to find the desired level of accuracy and performance. They were: • testing iterations: 100; basically how many forward passes the test will carry out. • batch size: 77; this is for batch gradient descent—notice that batch size * testing iterations will cover the entire testing dataset. • base learning rate: 0.001 • momentum: 0.9 • weight decay: 0.004 • learning rate policy: fixed • maximum training iterations: 15,000 • testing interval: 150; testing will be carried out every 150 training iterations. The number of maximum iterations was chosen as an estimation of the number of epochs that the network would need to stabilize. Since the batch size of the training data was 77 images, we would need about 300 iterations to cover the whole training dataset. Therefore, the number for maximum iterations was established as 15,000 in order that the network could to go through about 50 epochs. The testing interval was much less than it usually is for large networks (on the order of 1000) so that we could analyze shifts in learning in a reasonable time instead of having to wait for over half an hour.

326

M. O. Khan and G. B. Parker

5 Results The dataset was split 75/25% for training/testing the final network: 23,065 images for training and 7,689 images for testing. Starting with a Krizhevsky network trained on the CIFAR10 dataset and replacing the final layer with a tweaked fully connected layer, we ran the Deep Learning Neural Network on the images generated for the obstacle avoidance problem. The network was able to obtain an accuracy of 84% after 200 iterations and 90% after 2000 iterations (Fig. 9). The maximum accuracy attained after 15,000 iterations was about 92%. This accuracy was the network’s ability to correctly choose left, right, or straight for each of the test images.

5.1 Robot Performance in the Environment Ten different test runs were completed in the actual environment. For each of these, the robot was reversed after a full lap in order for it to complete the lap in the both directions. The robot did on occasion, although rarely, slightly graze against the leg of a chair or a cardboard box. It is possible that the turning angle for the robot was the issue causing the minor scrapes since this was a tight environment—though the network made the right decision, the movement of the physical robot may have been slightly too much. This could possibly be corrected with very small tweaks in the values of turning radii for the different decisions, however this does not reflect on or add to the discussion about the performance of the deep network in itself. In any case, these slight grazes did not change the trajectory of the robot, so these rare occurrences were not considered obstacle collisions. In all ten tests, the robot was able to complete its course with no major events. Fig. 9 The performance of the network in relation to iterations for the fine-tuned Krizhevsky network trained with over 30,000 images. The first 15,000 iterations are shown. It took about 200 iterations to get to the 84% mark and by 15,000 it was at 92% accuracy. (Figure from Khan and Parker, 2019 [17])

Deep Convolutional Neural Network …

327

In addition to the tests to see if it could make it around the path ten out of ten times, we observed the robot during particular situations of interest and noted that it routinely performed the correct action. We observed that the robot was successfully able to navigate the tight corridor and move away from chair obstacles (Fig. 4) and the border of the colony space (Fig. 6), which showed that the robot learned to avoid more than just the chairs. We also noted that the scenario of the open cabinet was not a challenge for the robot (Figs. 4 and 5 right photos) and helped it determine the direction to turn to stay on the path. And, although the cardboard box was seldom included in the original training dataset, the robot clearly had pattern recognition broad enough to be able to avoid it (Fig. 6).

5.2 Examining Network Weights and Activations While it is not trivial to understand what type of logic or pattern the network has learned, we can surely peek into the network and see how the weights and activations look. We might be able to pick out a few features that seem important for the development of the network that can not only differentiate in its lower levels between 10 different categories of animals and objects, but also differentiate between obstacle and non-obstacle. For this, we will use the image shown in Fig. 7 to look at the different kernels and activations. As can be seen in the diagram in Fig. 8, our network is made up of two convolution/pooling/normalization series of layers, followed by a convolution/pooling pair of layers connected to a final fully-connected layer. We will first consider the kernels for the convolution layers. These were learned during training on the CIFAR10 dataset. We will then look at the activations that resulted at each layer when the Fig. 7 image is presented to the system. The sets of convolution layer kernels for the first, second, and third layers are shown in Fig. 10 below. We can immediately see in the kernels for the first layer that some of these kernels are edge detectors while others resemble some form of blurring or sharpening. It is harder to discern exactly what the kernels of the second (shown) and third (not shown) convolution layers are doing, but nonetheless it is important to realize that the deeper the layers get into the network the more complex the overall function becomes for differentiating between objects. This complexity allows for the general pattern recognition model that the network builds through training. The activations of the network with these kernels are of particular interest. In Figs. 11, 12, and 13 we can see the image activations in all the different layers of the network. Figure 11 shows the resultant activations in the first set of layers. It’s difficult to see what’s happening in the convolutional layer, but we can immediately identify that the chair is highlighted in the pooling layer, and the normalization layer augments its location even more. Another observation is that the carpet and chair are shown as contrastive in a couple of the images. The cabinet does not light up too much, however there are several activation outputs where it is next in line after the chair in terms of brightness level.

328

M. O. Khan and G. B. Parker

Fig. 10 The 32 kernels of the first convolution layer. The 1024 kernels of the second convolution layer. The 2048 kernels are not shown

Fig. 11 Activations are shown here, respectively, for the first convolution layer, the first pooling layer, and the first normalization layer

Fig. 12 Activations are shown here, respectively, for the second convolution layer, the second pooling layer, and the second normalization layer

Deep Convolutional Neural Network …

329

The second iteration of the same types of layers (Fig. 12) does not directly reveal much information about obstacle detection, but it shows that different parts of the image light up. This might mean that the network expects the object to show up in different locations of the image. In a way this makes the network translation invariant. The bottom edge of the cabinet and the edge of the table to the right of the chair are highlighted even more. At this point it seems as if the network is focusing more on objects in the vicinity of the chair than the chair itself. In fact, in some images it looks as if the chair and the bottom edge of the cabinet are linked together. In the final convolution/pooling layers (Fig. 13) there are some fairly high level features that the network picked up. It is hard to understand what exactly these features may be, but it looks like almost all parts of the image light up in one image or another. This may mean that the network is able to focus not on a single location in the image but more so the whole image in working towards developing a holistic understanding of the class of object. The last layer of the network is harder to visualize as it is a fully connected neural network, but we can certainly see the probabilities of the activation for each “class” or driving decision the robot could make if we look at the three outputs (see Fig. 14). While it is hard to determine what exact features the network is picking up, we have hypothesized about the meaning of some of the graphics from looking into the activation of the network. We know for sure that there is some contrastive definition that is built between the chair and the carpet and also between the cabinet and the carpet.

Fig. 13 Activations are shown here, respectively, for the third and final convolution layer, and the third and final pooling layer

330

left0.02,straight 0.96, right0.02

M. O. Khan and G. B. Parker

left 0, straight 0, right 1

left 0.73, straight 0.27, right 0

Fig. 14 A sampling of scenarios where the neural network made live decisions. The outputs of the NN are shown for each (total 1.0). The NN will have the robot go straight in the first scenario, turn right in the second, and turn left in the third. (Images from Khan and Parker, 2019 [17])

6 Conclusions The learning system that we developed was very successful in the complex environment where it was tested and we believe that the same method will yield good results in other environments as well. For this model, we took a staggered approach to learning. We first replicated Krizhevsky’s network and used it to solve the CIFAR10 dataset. We then replaced the final fully connected layer, which had ten outputs (classes of animals/objects), with a new fully connected layer, which had three outputs (left, straight, right). This approach did not require cameras with depth perception or environments that were virtual or static. It was more successful than those previous approaches in regards to accuracy in the corresponding environment. In addition to successfully learning the CIFAR10 dataset, the model also quickly learned the robot’s operating environment dataset, with an accuracy of 84% after only 200 iterations. The tests on the actual robot were even more noteworthy. During those ten tests, with the robot completing full circuits in both directions of the path for each test, there were no collisions (some minor scrapes, but nothing that impeded the robot or even changed its trajectory). Consequently, even though the training did not result in 100% accuracy on the test dataset, the training set was diverse enough to allow the robot to avoid all obstacles in the real world tests. These test runs also showed that the robot learned to maneuver in difficult situations, including tight corridors, areas near the colony space, and seldom seen obstacles such as a cardboard box. It was interesting to consider the weights and activations as we peeked inside the network. The first layer kernels revealed that they were probably involved in a combination of edge detection, blurring, and sharpening. In the subsequent layers, the kernels were harder to interpret, but we assume they were involved in more complex functions that were performing general pattern recognition, which helped them determine obstacles. In looking at the activations in the first set of layers, it is possible to identify some of these obstacles, such as chairs. In the second set of layers, the network starts to focus more on the other objects in the image. The final set

Deep Convolutional Neural Network …

331

of layers seems to be focusing on a higher-level breakdown of the image—different parts of the image are highlighted, but the objects are no longer recognizable. While it is difficult to discern the complex function the model develops end-to-end, we know that it was successful in identifying areas of the environment to avoid and was good at determining which way to turn when confronted with an obstacle. Lastly, in observing the probability outputs of each decision in context of specific driving scenarios, it was evident that the model not only learned the first best decision, but also classified what it considered to be the next best decision. In other words, some real-world cases had multiple successful or close to successful outcomes (driving decisions), and the model demonstrated an abstracted understanding of that concept. In further research, we will test our trained robot in environments similar to, but different from, the original lab environment. This could include other spaces in the building, such as labs, classrooms, and hallways. This will show that the collision avoidance concepts are general enough to be effective in new environments with different lighting conditions, textures, and objects. In addition, we plan to test the learning system on other robots used for other purposes—effectively decoupling the model from the robot and ensuring that model performance is not tied to one particular robot type. One possibility is an outdoors robot that can use our system to learn to stay on the walkways on campus as it moves from one location to another. It would have the general concept of the direction to go provided by GPS, but this GPS would lack the accuracy to keep it on the sidewalk. In addition, with a proper training set, the robot could learn to avoid unanticipated obstacles, including humans, on the sidewalk while working in tandem with the GPS to reach a destination.

References 1. Krizhevsky, A, Sutskever, I. & Hinton, G., 2012. ImageNet classification with deep convolutional neural networks. Neural Information Processing Systems (NIPS) 2. Krizhevsky, A., 2009b. CIFAR10 dataset project page: https://www.cs.toronto.edu/~kriz/cifar. html 3. Krizhevsky, A., 2009a. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto 4. Krizhevsky, A., 2010. Convolutional deep belief networks on CIFAR-10. Unpublished manuscript 5. Mishkin, D. & Matas, J., 2016. All you need is a good init. Proceedings of the International Conference on Learning Representations (ICLR) 6. Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M., 2015. Striving for simplicity: the all convolutional net. Proceedings of the International Conference on Learning Representations (ICLR) 7. Graham, B., 2014. Fractional max-pooling. CoRR, arXiv:1412.6071 8. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke,V., and Rabinovich, A., 2014. Going deeper with convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, pp. 1–9 9. He, K., Zhang, X., Ren, S. & Sun, J., 2015. Delving deep into rectifiers: Surpassing human level performance on imagenet classification. Proceedings of the IEEE International Conference on Computer Vision

332

M. O. Khan and G. B. Parker

10. Boucher, S.: Obstacle detection and avoidance using TurtleBot platform and Xbox Kinect. Research Assistantship Report, Department of Computer Science, Rochester Institute of Technology (2012) 11. Tai, L., Li, S. & Liu, M., 2016. A deep-network solution towards modeless obstacle avoidance. Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Daejeon, pp. 2759–2764 12. Tai, L. & Liu, M., 2016. A robot exploration strategy based on q-learning network. Proceedings of the IEEE International Conference on Real-time Computing and Robotics (RCAR). Angkor Wat, 2016, pp. 57–62 13. Tai, L., Paolo, G., & Liu, M. 2017. Virtual-to-real deep reinforcement learning: continuous control of mobile robots for mapless navigation. Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2017). Vancouver, BC, pp. 31–36 14. Xie, L., Wang, S., Markham, A., Trigoni, N.: Towards monocular vision based obstacle avoidance through deep reinforcement learning. Proceedings of the RSS 2017 Workshop on New Frontiers for Deep Learning in Robotics (2017) 15. Wu, K., Esfahani, M., Yuan, S., Wang, H.: Depth-based obstacle avoidance through deep reinforcement learning. Proceedings of the 5th International Conference on Mechatronics and Robotics. February 2019, pp. 102–106 (2019) 16. Tai, L., Li, S., Liu, M.: Autonomous exploration of mobile robots through deep neural networks. International Journal of Advanced Robotic Systems. July-August 2017, 1–9 (2017) 17. Khan, M., Parker, G.: Vision Based Indoor Obstacle Avoidance using a Deep Convolutional Neural Network. Proceedings of the 11th International Joint Conference on Computational Intelligence (IJCCI 2019), pp. 403–411, INSTICC, SciTePress (2019)

CVaR Q-Learning Silvestr Stanko and Karel Macek

Abstract In this paper we focus on reinforcement learning algorithms that are sensitive to risk. The notion of risk we work with is the well-known conditional valueat-risk (CVaR). We describe a faster method for computing value iteration updates for CVaR markov decision processes (MDP). This improvement then opens doors for a sampling version of the algorithm, which we call CVaR Q-learning. In order to allow optimizing CVaR on large state spaces, we also formulate loss functions that are later used in a deep learning context. Proposed methods are theoretically analyzed and experimentally verified. Keywords Reinforcement learning · CVaR · Conditional value at risk · Q-learning · Deep Q-learning

1 Introduction Reinforcement learning—as a learning-oriented extension of Dynamic Programming—has shown excellent results regarding beating the best human players in the game of Go [34] or human-level control in computer games [27], and most recently also in the area of learning from human instructions in grounded language [9]. In spite of this success, reinforcement learning still faces several serious challenges before being able to work efficiently in critical real-world environments [16]. One of the primary challenges is safety in a stochastic environment. Thus, AI safety is one of the challenges caused by the discrepancy between the the environment the agent trains on and is tested on. It can be addressed by robustness [22]. The relation between robustness and stochastic modeling has been studied e.g. in [10]. S. Stanko (B) · K. Macek DHL ITS Digital Lab, Prague, Czech Republic e-mail: [email protected] K. Macek e-mail: [email protected] © Springer Nature Switzerland AG 2021 J. J. Merelo et al. (eds.), Computational Intelligence, Studies in Computational Intelligence 922, https://doi.org/10.1007/978-3-030-70594-7_14

333

334

S. Stanko and K. Macek

Risk-sensitive decision making in Markov Decision Processes (MDPs) has been subject to systematic research. The works from 1970s and 1980s focus on exponential utility [18], maximizing the mean with constrained variance [35], or the max-min criterion [12]. All these approaches neglect explicit formulation of the risk. Even recently, some there are some works of this kind. For example, [25] uses barrier functions used for protecting from undesired behavior. Vinitsky [41] uses adversarial populations to cover undesired scenarios. Yet another idea is to use imitation of expert behavior as a safe reference for future learning [8]. Garcıa and Fernández [15] summarizes a variety of the risk-oriented objectives in Reinforcement Learning. To quantify the risk in a standardized way, various measures have been used, such as Value-at-Risk (VaR) [23]. According to our knowledge, the today’s most broadly adopted measure in the finance industry is Conditional Value at Risk (CVaR) [11]. It also fits the recent discussion on the risk in robotics [24]. The essential benefit of CVaR is that it (i) treats the risk—similarly to VaR—more explicitly than e.g. variance-based approaches and (ii) more conservatively than VaR. In the other words, it is a reasonable representation of the worst-case scenario. All these facts have motivated several attempts to use CVaR also in the context of Reinforcement Learning: • [30] optimizes the expected value which is a usual criterion. In addition to that a CVaR constraint restricts the strategies. In some sense, this approach is similar to the application of barrier functions [25]. • Another research direction focuses on optimization of the CVaR objective by Actor-Critic and policy-gradient algorithms [39, 40]. • One of the major disadvantages of CVaR is its time-inconsistency. Various approaches have been tried to cope with that, either dealing with a time-consistent subclass of coherent measures only, restricting the hypothesis space to timeconsistent policies, or formulating the CVaR objective in a way that is timeconsistent [26]. • [10] uses a continuous augmented state-space but unlike [4], this continuous state is shown to have bounded error when a particular linear discretization is applied. The only disadvantage of this method is the need to run a linear program at each step of their algorithm. The major drawback of [10] is the need to solve a linear programming problem at each iteration. We have already published some elements of an approach to accelerate the process [36, 37]. In this paper, we provide full-length proofs of theorems, detailed algorithms, and description of systematic testing. An important side effect is that the new way of calculation allow us to define also the sample-based version so we speak about CVaR Q-Learning which is also the title of this paper. This paper is organized as follows: Sect. 2 introduces the notation related to Markov Decision Processes, and Q-Learning, and also defines CVaR and the exact problem definition. Section 3 describes the CVaR value iteration algorithm from [10], and provides an accelerated version of said algorithm. The improved algorithm is then extended in Sects. 4 and 5 to CVaR Q-learning and Deep CVaR Q-learning, respectively. Section 6 concludes the paper.

CVaR Q-Learning

335

2 Preliminaries 2.1 Conditional Value-at-Risk To describe the uncertainty, we use P(·) for probability of an event; p(·) and p(·|·) for the probability mass function and conditional probability mass function respectively. The cumulative distribution function is defined as F(z) = P(Z ≤ z). For the random variables we work with the expected value E[Z ], Value-at-Risk VaRα (Z ) = F −1 (α) = inf {z|α ≤ F(z)}

(1)

with confidence level α ∈ (0, 1) and Conditional Value-at-Risk as 1 CVaRα (Z ) = α

 0

α

FZ−1 (β)dβ

1 = α



α

VaRβ (Z )dβ

(2)

0

Note on notation: In the risk-related literature, it is common to work with losses instead of rewards. The Value-at-Risk is then defined as the 1 − α quantile. The notation we use reflects the use of reward in reinforcement learning and this sometimes leads to the need of reformulating some definitions or theorems. While these reformulations may differ in notation, they are based on the same underlying principles. CVaR as Optimization: [32] proved the following equality  CVaRα (Z ) = max s

 1  − E (Z − s) + s α

 (3)

where (x)− = min(x, 0) represents the negative part of x and in the optimal point it holds that s ∗ = VaRα (Z ) CVaRα (Z ) =

 1  − E (Z − V a Rα (Z )) + V a Rα (Z ). α

(4)

CVaR Dual Formulation: CVaR can be expressed also as: CVaRα (Z ) = where

min

ξ∈UCVaR (α, p(·))

Eξ [Z ]

     UCVaR (α, p(·)) = ξ : ξ(z) ∈ 0, α1 , ξ(z) p(z)dz = 1

(5)

(6)

336

S. Stanko and K. Macek

2.2 Q-Learning As any other reinforcement learning, we work with the concept of Markov Decision Process (MDP [6]) which is defined as a 5-tuple M = (X , A, r, p, γ), where X is the finite state space, A is the finite action space, r (x, a) is a bounded deterministic reward generated by being in state x and selecting action a, p(x  |x, a) is the probability of transition to new state x  given state x and action a. γ ∈ [0, 1) is a discount factor. A stationary policy π is a mapping from states to probabilities of selecting each possible action π : X × A → [0, 1]. For indexing the time, we use t = 0, 1, . . . , ∞. The return is defined as discounted sum of rewards along a trajectory over the infinite horizon, given policy π and initial state x0 : Z π (x0 ) =



γ t r (xt , at )

t=0

Note that the return is a random variable. The action-value function Q π : X × A → R of policy π describes the expected return of taking action a ∈ A in state x ∈ X , then acting according to π: π



π





Q (x, a) = E Z (x, a) = E



γ r (xt , at ) t

t=0

(7)

x0 = x, a0 = a, at ∼ π Value iteration [38] is a well-known algorithm for computing the optimal actionvalue function, and hereby finding the optimal policy, maximizing the expected return. Below we introduce the Bellman optimality operator T : T Q(x, a) = r (x, a) + γ

x

p(x  |x, a) max Q(x  , a  )  a

(8)

The value iteration algorithm works by repeatedly applying the operator from an initial estimate Q 0 until convergence to the optimal action-value function Q. Q-learning [43] extends value iteration to cases where we don’t have direct access to the transition probabilities of the environment. In that case we have to rely on direct interaction with the environment. Q-learning works by repeatedly updating the Q value estimate according to the sampled rewards and states using a moving exponential average.

  Q (x , a ) Q t+1 (x, a) = (1 − βt )Q t (x, a) + βt r + γ max t  a



x ∼ p(·|x, a) where βt is the learning rate at time t.

(9)

CVaR Q-Learning

337

Q-learning has been successfully applied, in conjunction with deep learning, on challenging domains such as Atari games [27].

2.3 Distributional Transition Operator The return can be considered not only for the first state x, but also for a given first action a. Denoting it as Z (x, a), we can define it recursively as follows: Z (x, a) = r (x, a) + γ Z (x  , a  ) x  ∼ p(·|x, a), a ∼ π, x0 = x, a0 = a D

(10)

D

where = denotes that random variables on both sides of the equation share the same probability distribution. Analogously to the policy evaluation [38, p. 90], we speak about value distribution Z π . We define the transition operator P π : Z → Z as P π Z (x, a) = Z (x  , a  ) D

x  ∼ p(·|x, a), a  ∼ π(·, x)

(11)

These operators are described in more detail in [5].

2.4 Problem Formulation Unlike the expected value, in order to maximize CVaR as part of a reinforcement learning problem, we cannot limit ourselves to stationary policies. The optimum can only be reached when working with more general, history-dependent policies, since the CVaR objective is not time-consistent [28]. Definition 1 (History-dependent Policies) Let the space of admissible histories up to time t be Ht = Ht−1 × A × X for t ≥ 1, and H0 = X . A generic element h t ∈ Ht is of the form h t = (x0 , a0 , ..., xt−1 , at−1 ). Let  H,t be the set of all history-dependent policies with the property that at each time t the distribution of the randomized control action is a function of h t . In other words,  H,t = {π0 : H0 → P(A), ..., πt : Ht → P(A)}. We also let  H = limt→∞  H,t be the set of all history-dependent policies. The risk-averse objective we wish to address for a given confidence level α is max CVaRα (Z π (x0 ))

π∈ H

(12)

338

S. Stanko and K. Macek

3 CVaR Value Iteration Since CVaR is a time-inconsistent measure, [10] present a dynamic programming formulation with an extended state state space. A value iteration type algorithm is then applied on this space. We repeat the key ideas of [10] below, as they form a basis for our contributions presented in later sections.

3.1 Bellman Equation for CVaR The results of [10] heavily rely on the CVaR decomposition theorem [28]: CVaRα (Z π (x)) =



min

ξ∈UCVaR (α, p(·|x,a))

  p(x  |x, π(x))ξ(x  ) · CVaRξ(x  )α Z π (x  )

x

(13) where the risk envelope UCVaR (α, p(·|x, a)) coincides with the dual definition of CVaR (6). π The theorem states that we can  the CVaRα (Z (x, a)) as the minimal  compute π  weighted combination of CVaRα Z (x ) under a probability distribution perturbed by ξ(x  ). Notice that the variable ξ both appears in the sum and modifies the confidence level for each state. The decomposition theorem was extended in [10] by defining the CVaR value function C(x, y) with an augmented state-space X × Y. Here Y = (0, 1] is an additional continuous state that represents the different confidence levels. C(x, y) = max CVaR y (Z π (x)) π∈ H

(14)

Similar to standard dynamic programming, it is convenient to work with operators defined on the space of value functions. This leads to the following definition of the CVaR Bellman operator T cvar : X × Y → X × Y: T

cvar



π∗ CVaR y (Z (x)) = max r (x, a) + γCVaR y (P Z (x, a)) a

(15)

where P π denotes the transition operator (11) with an optimal policy π ∗ for all confidence levels. [10, Lemma 3] further showed that the operator T cvar is a contraction and also preserves the convexity of yCVaR y . The optimization problem (13) is a convex one and therefore has a single solution. Additionally, the fixed point of this contraction is the optimal C ∗ (x, y) = maxπ∈ CVaR y (Z π (x, y)) [10, Theorem 4].

CVaR Q-Learning

339

Naively applying the theorem is unfortunately impossible in practice, since the state space is continuous in y. Chow [10] showed that it is possible to approximate the yCVaR y as a piece-wise linear function and the theoretical guarantees still hold.

3.2 CVaR Value Iteration with Linear Interpolation   Given a set of N (x) interpolation points Y(x) = y1 , . . . , y N (x) , we can approximate the yC(x, y) function by interpolation on these points, i.e. Ix [C](y) = yi C(x, yi ) +

yi+1 C(x, yi+1 ) − yi C(x, yi ) (y − yi ) yi+1 − yi

  where yi = max y  ∈ Y(x) : y  ≤ y . The interpolated Bellman operator T I is then also a contraction and has a bounded error ((16), Theorem 7). T I C(x, y) = max r (x, a) + γ a



min

ξ∈UCVaR (α, p(·|x,a))

p(x  |x, a)

x

Ix  [C](yξ(x  )) y

(16) This algorithm can be used to find an approximate global optimum in any MDP. Each iteration of (16) can be formulated as a linear program, which is the solution proposed in (16). This approach is unfortunately quite slow and is unusable for large state spaces. In the next section we propose a faster way to compute the updates. For completeness, we present the linear program in the Appendix 29.

3.3 Accelerated Value Iteration for CVaR In order to improve time complexity of the existing algorithm, we will describe a connection between the yCVaR y function and the quantile function of the underlying distribution. We will then use this connection to formulate a different way to compute the value iteration step, resulting in the first linear-time algorithm for solving CVaR MDPs. Lemma 1 Any discrete distribution has a piecewise linear and convex yCVaR y function. Similarly, any piecewise linear convex function can be seen as representing a certain discrete distribution. Particularly, the integral of the quantile function is the yCVaR y function  yCVaR y (Z ) =

y

VaRβ (Z )dβ

0

and the derivative of the yCVaR y function is the quantile function

(17)

340

S. Stanko and K. Macek

∂ yCVaR y (Z ) = VaR y (Z ) ∂y

(18)

Proof The fact that discrete distributions have a piece-wise linear yCVaR y function has already been shown by [32]. According to definition (2) we have 1 yCVaR y (Z ) = y y



y



y

VaRβ (Z )dβ =

0

VaRβ (Z )dβ

0

by taking the y derivative, we have ∂ ∂ yCVaR y (Z ) = ∂y ∂y



y

VaRβ (Z )dβ = VaR y (Z )

0

We now propose to use Lemma 1 to compute CVaR at each atom, utilizing our knowledge of the full distribution. The high level steps of the computation are: 1. Transform yCVaR y (Z (x  )) of each reachable state x  to a discrete probability distribution using (18). 2. Combine these to to a distribution representing the full state-action distribution 3. Compute yCVaR y for all atoms using (17) See Fig. 1 for a visualization of the procedure. Note that this procedure is linear in number of atoms and state transitions. The only potentially nonlinear step is the sorting step in mixing distributions. However, since values are pre-sorted for each state x  , this is equivalent to a single step of the Merge sort algorithm, which is also linear in the number of transitions and atoms. We show the explicit computation of the procedure for linearly interpolated atoms in Algorithm 1.

Fig. 1 Visualization of the CVaR computation for a single state and action with two transition states. Thick arrows represent the conversion between yCVaR y and the quantile function. Adapted from [37]

CVaR Q-Learning

341

Algorithm 1 CVaR Computation via Quantile Representation. function extractDistribution input: vectors C, y # Note: y0 = C(x  , y0 ) = 0 for i ∈ {1, ..., |y|} do C(x  , yi ) − C(x  , yi−1 ) di = yi − yi−1 end for output vector d

function extractC input: vectors d, p C0 = 0 for i ∈ {1, ..., |p|} do Ci = Ci−1 + di · pi end for output vector C

function mixDistributions K input: tuples (d(1) , p (1) ), ..., (d(K) , p (K ) ) and vector y # k=1 pk = 1 for i, k ∈ {1, ..., K } × {1, ..., |y|} do pi(k) = p (k) · (yi − yi−1 ) # Weigh atom probabilities by transitions end for # Join all tuples together:   (1) (1) (1) (1) (2) (2) (K ) (K ) atoms = (d1 , p1 ), ..., (d N , p N ), (d1 , p1 ), ..., (d N , p N ) Sort atoms by d Unwrap vectors d, p from sorted tuples output d, p # Main input: tuples (C(xi , •), p (1) ), ..., (C(xi , •), p (K ) ) and vector y for i ∈ {1, ..., K } do d(i) = extractDistribution(C(xi , •), y) end for dmix , ymix = mixDistributions((d(1) , p (1) ), ..., (d(K) , p (K ) ), y) Cout = extractC(dmix , ymix ) output: Cout

To show the correctness of this approach, we formulate it as a solution to problem (13) in the next paragraphs. Note that we skip the reward and gamma scaling for readability’s sake. Extension to the Bellman operator is trivial.

3.4 Computing ξ We now need a way to retrieve yt+1 = yt ξ ∗ (xt ) in order to extract the optimal policy. We compute ξ ∗ (xt ) by using the following intuition: yt+1 represents portion of the tail of Z (xt+1 ) that has values present in the computation of CVaR yt (Z (xt )). In the continuous case, it is the probability in Z (xt+1 ) of values less than VaR yt (Z (xt )) as we show below [37]: Theorem 1 Let x1 , x2 be only two states reachable from state x via action a in a single transition. Let the cumulative distribution functions of the state’s underlying distributions Z (x1 ), Z (x2 ) be strictly increasing with unbounded support. Then the solution to minimization problem (13) can be computed for i = 1, 2 by setting

342

S. Stanko and K. Macek

ξ(xi ) =

  FZ (xi ) FZ−1 (x,a) (α)

(19)

α

See Appendix A.1 for proof. The theorem is straightforwardly extendable to multiple states by induction.

3.5 Experiments We test the proposed algorithm on identical task as [10]. The agent’s task is to navigate a rectangular grid from a starting state to a single destination, moving in its four-neighborhood. The agent is encourage to reach the goal in the fastest way possible, by recieving a negative reward of −1 each step. A set of obstacles is placed randomly on the grid and stepping on an obstacle ends the episode while the agent receives a reward of −40. To simulate sensing and control noise, the agent has a δ = 0.05 probability of moving to a different state than intended. For our experiments, we choose a 40 × 60 grid-world and approximate the αCVaRα function using 21 log-spaced atoms. The learned policies on a sample grid are shown in Fig. 2 and they align with the ones reported in [10], the lower we set α, the more risk sensitive the agent becomes, choosing a longer but safer route towards the goal. α = 0.1

α = 0.2

0

0

−5

−5

−10

−10

−15

−15

−20 −25

S

G α = 0.3

−30

−20

S

G α = 1.0

0

−25 0.0 −2.5

−5

−5.0 −7.5

−10

−10.0 −12.5

−15

−15.0 −17.5

−20

S

G

S

G

−20.0

Fig. 2 Grid-world simulations. The optimal deterministic paths are shown together with CVaR estimates for given α. Adapted from [37]

CVaR Q-Learning

343

The code for these experiments, as well as for the other experiments discussed in this article are available online.1

4 CVaR Q-Learning When we don’t have full knowledge of the environment, including the transition probabilities, value iteration is no longer usable. In this case we have to rely on methods that interact with the environment directly. One such algorithm is the well known Q-learning [43] that works by repeatedly updating the action value estimate according to the sampled rewards and states using a moving exponential average. As our next contribution, we formulate a Q-learning algorithm for optimizing CVaR MDPs.

4.1 Estimating CVaR Estimating CVaR for Q-learning requires finding a recursive expression whose expectation is the CVaR value. Similar methods have been already thoroughly investigated in the stochastic approximation literature by [31]. The Robbins Monroe theorem has also been applied directly to CVaR estimation by [3], who used it to formulate a recursive importance sampling procedure useful for estimating CVaR of long-tailed distributions. We will first focus on a simple example with a single state, which translates to estimating CVaR from samples we get from a single unknown distribution. In order to estimate CVaR, we need to maintain two separate estimates V and C, being our VaR and CVaR estimates respectively.

1 Vt+1 = Vt + βt 1 − 1(Vt ≥r ) α

1 Ct+1 = (1 − βt )Ct + βt Vt + (r − Vt )− α

(20) (21)

βt represents the learning rate at time t. Notice that (20) is in fact a standard equation for quantile estimation.2 Equation (21) then represents the moving exponential average of the primal CVaR definition (3). The estimations are proven to converge, given the usual requirements on the learning rate [3].

1 https://github.com/Silvicek/cvar-algorithms. 2 See

e.g. [21] for more information on quantile estimation/regression.

344

S. Stanko and K. Macek

4.2 Temporal Difference Updates We first define two separate values for each state, action, and atom V, C : X × A × Y → R where C(x, a, y) represents CVaR y (Z (x, a)) of the distribution, similar to the definition (14). V (x, a, y) represents the VaR y estimate, i.e. the estimate of the y−quantile of a distribution recovered from CVaR y by Lemma 1. We formulate the temporal difference update of the CVaR Q-learning algorithm in Algorithm 2, under the assumption of linearly spaced atoms. Algorithm 2 CVaR TD update. input: x, a, x  , r for each i do C(x  , yi ) = maxa  C(x  , a  , yi ) end for   d = extractDistribution C(x  , •), y # see Algorithm 1. for each i, j do   V (x, a, yi ) = V (x, a, yi ) + β 1 − y1i 1(V (x,a,yi )≥r +γd j )  −   8: C(x, a, yi ) = (1 − β)C(x, a, yi ) + β V (x, a, yi ) + y1i r + γd j − V (x, a, yi ) 9: end for

1: 2: 3: 4: 5: 6: 7:

In the first step we construct a new CVaR (line 3), representing CVaR y (Z (x  )), by greedily selecting actions that yield the highest CVaR for each atom. We then transform C(x  , •) to the underlying distribution (line 5) d and use it to create the target T d = r + γd. In the next step, we use the quantile values proportionally to their probabilities (in the uniform case, this means exactly once) and apply the respective VaR and CVaR update rules (lines 7, 8). If the atoms aren’t uniformly spaced (log-spaced atoms are motivated by the error bounds of CVaR Value Iteration), we have to perform basic importance sampling when updating the estimates. In contrast with the uniform version, we iterate only over the atoms and perform a single update for the whole target by taking an expectation over the target distribution. This is done by replacing lines 7, 8 with

1 V (x, a, yi ) = V (x, a, yi ) + β E 1 − 1(V (x,a,yi )≥r +γd j ) yi j

(22) − 1  r + γd j − V (x, a, yi ) C(x, a, yi ) = (1 − β)C(x, a, yi ) + β E V (x, a, yi ) + yi j

The explicit computation of the expectation term for VaR would then look like



1 1 1 − = 1 − 1 p 1 E (V (x,a,yi )≥r +γd j ) j (V (x,a,yi )≥r +γd j ) yi yi j j

CVaR Q-Learning

345

where p j = y j − y j−1 represents the probability of d j . The CVaR update expectation is computed analogically.



− − 1  1  = p j V (x, a, yi ) + r + γd j − V (x, a, yi ) r + γd j − V (x, a, yi ) E V (x, a, yi ) + yi yi j j

This is a valid approach since sample mean is equal to the mean of the original distribution. In this case we are performing the updates on batches of samples and the final expected value remains unchanged. E[ f (Z )] =



pi E[ f (Z i )]

i

The above equation holds for any function f if Z is a mixture of Z i , so it also holds for the VaR update 1 − y1i 1(V (x,a,yi )≥r +γd j ) where the learned distribution is a mixture of the different target distributions. We conclude the same for the CVaR update, since the expectation remains unchanged. We are in fact using more informed updates—similar to the difference between pure and batch Stochastic Gradient Descent.

4.3 CVaR and Policy Improvement CVaR Q-learning allows us to find the CVaR y function, but our ultimate goal is the optimal policy. Below we formulate an algorithm we call VaR-based policy improvement, which will then help us with finding this optimal policy. Let us now assume that we have successfully converged with distributional value iteration and have available the return distributions of some stationary policy for each state and action. Our next goal is to find a policy improvement algorithm that will monotonically increase the CVaRα criterion for selected α. Recall the primal definition of CVaR (3)  CVaRα (Z ) = max s

 1  − E (Z − s) + s α



Our goal (12) can then be rewritten as max CVaRα (Z π ) = max max π

π

s

 1  π E (Z − s)− + s α

As mentioned earlier, the primal solution is equivalent to VaRα (Z )  CVaRα (Z ) = max s

   1  1  E (Z − s)− + s = E (Z − VaRα (Z ))− + VaRα (Z ) α α

346

S. Stanko and K. Macek

The main idea of VaR-based policy improvement is the following: If we knew the value s ∗ in advance, we could simplify the problem to maximize only max CVaRα (Z π ) = max π

π

 1  π E (Z − s ∗ )− + s ∗ α

(23)

Given that we have access to the return distributions, we can improve the policy by simply choosing an action that maximizes CVaRα in the first state a0 = arg maxπ CVaRα (Z π (x0 )), setting s ∗ = VaRα (Z (x0 , a0 )) and focus on maximization of the simpler criterion. This can be seen as coordinate ascent with the following phases:   1. Maximize α1 E (Z π (x0 ) − s)− + s w.r.t. s while keeping π fixed. This is equivalent to computing to the primal.   CVaR according 2. Maximize α1 E (Z π (x0 ) − s)− + s w.r.t. π while keeping s fixed. This is the policy improvement step. ∗ 3. Recompute CVaRα (Z π ) where π ∗ is the new policy. Since our goal is to optimize the criterion of the distribution starting at x0 , we need to change the value s while traversing the MDP (where we have only access to Z (xt )). st − r . See We do this by recursively updating the s we maximize by setting st+1 = γ Algorithm 3 for the full procedure which we justify in the following theorem: [37] Theorem 2 Let π be a stationary policy, α ∈ (0, 1]. By following policy π ∗ from algorithm 3, we improve CVaRα (Z ) in expectation: ∗

CVaRα (Z π ) ≤ CVaRα (Z π ) See A.2 for proof. The ideas presented here were partially explored by [4] although not to this extent. See Remark 3.9 in [4] for details. Algorithm 3 VaR-based policy improvement. a = arg maxa CVaRα (Z (x0 , a)) s = VaRα (Z (x0 , a)) Take action a, observe x, r while x is not terminal do s −r s= γ   a = arg maxa E (Z (x, a) − s)− Take action a, observe x, r end while

CVaR Q-Learning

347

4.4 CVaR Q-Learning with VaR-Based Policy Improvement Our next goal is to use VaR-based policy improvement for extracting the optimal CVaR Q-learning has converged. This would mean optimizing   policy, once E (Z t − s)− in each step. A problem we encounter here is that we have access only to the discretized distributions and we cannot extract the values between selected atoms. To solve this, we propose an approximate heuristic that uses linear interpolation to extract the VaR ofgiven distribution.  The expression E (Z t − s)− is computed by taking the expectation of the distribution before the value s. We are therefore looking for value y where VaR y = s. This  value is linearly interpolated from VaR yi−1 and VaR yi where yi = min y : VaR y ≥ s . The expectation is then taken over the extracted distribution, as this is the distribution that approximates CVaR the best. See Algorithm 4 and Fig. 3 for more intuition behind the heuristic. Algorithm 4 CVaR Q-learning policy.

  function expMinInterp # E (da − s)− input: s, vectors d, V, y z=0 for i ∈ {1, ..., |y|} do if S < Vi then break end if z = z + di · (yi − yi−1 ) a = arg maxa expMinInterp(s, d, V (x  , a, •), y) end for s − Vi−1 Take action a, observe r, x  plast = (yi − yi−1 ) s −r Vi − Vi−1 s= z = z + di · plast γ output z x = x end while

input: α, converged V, C x = x0 a = arg maxa C(x, a, α) s = V (x, a, y) while x is not terminal do   da = extractDistribution C(x  , a, •), y ∀a

4.5 Experiments For our CVaR Q-learning experiments, we use the same environment as in 3.5. Since the positive reward is very sparse, we chose to run the algorithm on a smaller environment of size 10 × 15. We trained the agent for 10,000 sampled episodes with learning rate β = 0.4 that dropped each 10 episodes by a factor of 0.995. The used policy was ε−greedy and maximized expected value (α = 1) with ε = 0.5. Notice

348

S. Stanko and K. Macek

Fig. 3 Visualization of the VaR-based heuristic. Quantile function of the exact distribution (unknown to the model) is shown in red and the VaR estimates at selected α-levels are shown in green. Let’s say we now want to know y where VaR y = 0. We use linear interpolation between the nearest known VaRs, shown in orange. In this case the interpolation estimate is y = 0.5. Adapted from [37]

the high value of ε. We found that lower ε values led to overfitting the optimal expected value policy as the agent updated states out of the optimal path sparsely. With mentioned parameters, the agent was able to learn the optimal policies for different levels of α. See Fig. 5 for learned policies, which were extracted from the same CVaR value function. Note on convexity: Unlike CVaR Value Iteration, where we maintain convexity of the yCVaR y function with each update (given we started with convex estimates), in CVaR Q-learning we can break the convexity in each update for any atom. We experience this in practice as well as can be gauged from Fig. 4. Fortunately this fact does not break the update rule, since the targets we use to update C as well as V do not have to be ordered. yCVaR

Extracted Distribution

0

0.0

VaR 0

−0.5

−2

−1.0

−1 −2

−4

−1.5

−3

−6

−2.0 −2.5

−4

−6

−3.0 0.00

0.25

0.50

0.75

1.00

Left Right Up Down

−5

−8

0.00

0.25

0.50

0.75

1.00

−7

0.00

0.25

0.50

0.75

1.00

Fig. 4 Learned C, V estimates for a single state after 10000 episodes. Notice the nonconvexities visible from the extracted distribution plot. Both extracted distribution and VaR functions should be nondecreasing

CVaR Q-Learning

349

α = 0.1

α = 0.3

0

.0 −2.5

−5

−5.0

−10

−7.5

−15

−10.0

−20

−12.5

−25

−15.0 −17.5

−30

S

G α = 0.6

S

−35

S α = 1.0

0.0

G

G

−20.0

0

−2.5

−2

−5.0

−4

−7.5

−6

−10.0

−8

−12.5

−10

−15.0

−12

−17.5 −20.0

S

G

−14

Fig. 5 Grid-world Q-learning simulations. The optimal deterministic paths are shown together with CVaR estimates for given α. Adapted from [37]

5 Deep CVaR Q-Learning Real world environments, which are of utmost interest for practical applications, have often intractable state spaces. In these cases, the exact Q-learning methodology fails, as we cannot store separate Q estimates for each state. It is common to use function approximation in this case, example of which is the Deep Q-learning algorithm [27]. Of particular interest in our case is also the distributional variant of DQN called Quantile Regression DQN (QR-DQN) [13]. In this section, we extend CVaR Q-learning to its deep Q-learning variant and show the practicality and scalability of the proposed methods. The logic of transitioning from CVaR Q-learning to Deep CVaR Q-learning (CVaR DQN) is similar to DQN. One difference is that in the case of CVaR DQN, we need to represent two separate values—one for V , one for C. As with DQN, we need to reformulate the updates as arguments minimizing some loss function.

5.1 Loss Functions The loss function for V (x, a, y) is similar to QR-DQN loss in that we wish to find quantiles of a distribution. The target distribution however is constructed differently—in CVaR-DQN we extract the distribution from the yCVaR y function of the next state T V = r + γd.

350

S. Stanko and K. Macek

LVaR =

N   E (r + γd j − Vi (x, a))(y j − 1(Vi (x,a)≥r +γd j ) ) i=1

(24)

j

where d j are atoms of the extracted distribution. Constructing the CVaR loss function consists of transforming the running mean into mean squared error, again with the transformed distribution atoms d j LCVaR =

N i=1

 E j

− 1  Vi (x, a) + r + γd j − Vi (x, a) − Ci (x, a) yi

2 (25)

Putting it all together, we are now able to construct the full CVaR-DQN loss function. (26) L = LVaR + LCVaR Combining the loss functions with the full DQN algorithm, we get the full CVaRDQN with experience replay. See Algorithm 5 in the appendix. Note that we utilize a target network C  that is used for extraction of the target values of C, similarly to the original DQN algorithm. The network V does not need a target network since the target is constructed independently of the value V .

5.2 Experiments We applied the CVaR DQN algorithm to environments with number of states, which would be intractable for Q-learning without approximation.

5.2.1

Ice Lake

Ice Lake is a visual environment specifically designed for risk-sensitive decision making. The agent is moving in a continuous space and its goal is to reach a goal state as fast as possible. In the middle of the space is a lake of ice, which has a probability o cracking, ending the episode and the agent receives a large negative reward. The agent has five discrete actions, namely go Left, Right, Up, Down and Noop. These correspond to moving in the respective directions or no operation. Since the agent is on ice, there is a sliding element in the movement—this is mainly done to introduce time dependency and makes the environment a little harder. The environment is updated thirty times per second. The agent receives a negative reward of −1 per second, the episode ends with reward 100 if he reaches the goal unharmed or −50 if the ice breaks. Straightforward

CVaR Q-Learning

351

Fig. 6 The Ice Lake environment. The agent is black and his target is green. The blue ring represents a dangerous area with risk of breaking the ice. Grey arrow shows the optimal risk-neutral path, red shows the risk-averse path. Adapted from [37]

path to the goal has about a 15% chance of breaking the ice when taking the shortcut and it is still advantageous for a risk-neutral agent to take the dangerous path. We have trained the agent in two settings. One simple, using the positions and velocities of the agent as inputs. The other with purely visual inputs, as seen in Fig. 6.

5.2.2

Network Architecture

The neural network architecture used in our experiments takes inspiration from DQN, but has a few notable differences. Namely, the output is not a single value in our case, but instead a vector representing CVaR y or VaR y for the different confidence levels y. The output shape is therefore |A| × N where N is the number of atoms we want to use and |A| is the action space size. Additionally, we must work with with two distinct output types, one for C and one for V . We have experimented with two separate networks (one for each value) and also with a single network differing only in the last layer. As we didn’t find significant performance differences, we settled on the faster version with shared weights. We also used 256 units instead of 512 to ease the computation requirements and used Adam [20] as the optimization algorithm. The implementation was done in Python and the neural networks were built using Tensorflow [1] as the framework of choice for gradient descent. The code was based on OpenAi baselines [14], an open-source DQN implementation.

5.2.3

Parameter Tuning

The optimal CVaR value function contains information about optimal policies for any α. This hints that we should explore many more different policies in order to retrieve enough information from the environment. During our experiments, we first tested with α = 1 so as to find reasonable policies quickly. We noticed that the optimal policy with respect to expected value was found fast and other policies were quickly abandoned due to the character of ε-greedy exploration. This fact, together with the

352

S. Stanko and K. Macek

exploration-exploitation dilemma, contributes to the difficulty of learning the correct policies. After some experimentation, we settled on the following points: • Training with a single policy is insufficient in larger environments. Instead of maximizing CVaR for α = 1 as in our CVaR Q-learning experiments, we change the value α randomly for each episode (uniformly over (0, 1]). • The training benefits from a higher value of ε than DQN. We settled on 0.3 as a reasonable value with the ability to explore faster, while making the learned trajectories exploitable. • The random initialization used in deep learning has a detrimental effect on the initial distribution estimates, due to the way how the target is constructed and this sometimes leads to the introduction of extreme values during the initial training. We have found that clipping the gradient norm helps to mitigate these problems and overall helps with the stability of learning.

5.2.4

Results

With the tweaked parameters, both versions (baseline and visual) were able to converge and learned both the optimal expected value policy and the risk-sensitive policy (at once), as in Figure 6. Examples of learned policies can be seen at https://youtu. be/KO8vWRnwbCA. Although we tested with the vanilla version of DQN, we expect that all the DQN improvements such as experience replay [17], dueling [42], parameter noise [29] and others (combining the improvements matters, see [17]) should have a positive effect on the learning performance. Another practical improvement may be the introduction of Huber loss, similarly to QR-DQN.

6 Conclusion This work has presented a novel approach to use CVaR as an objective in Reinforcement Learning applications. The quantile representation trick allowed much faster calculation than other state-of-art approaches. More specifically, our approach does not require solving a series of Linear Programs and instead finds solutions to the internal optimization problems in linear time by appealing to the underlying distributions of the CVaR function. The first benefit is a faster form of Value Iteration. Subsequently, we have introduced CVaR Q-learning and expanded to its deep learning version which has demonstrated the scalability of the presented approach. We also presented a policy improvement algorithm, that allowed us to find the optimal policy for CVaR Q-learning. CVaR Q-learning leverages the benefits of standard Qlearning, i.e. it can work also in cases when the knowledge of the environment is not perfect. The the study of the theoretical aspects of the algorithms as well as simula-

CVaR Q-Learning

353

tions have confirmed the potential of the introduced methods: The CVaR objective is a practical framework for computing control policies that are robust with respect to both stochasticity and model perturbations. Our work enhances the current stateof-the-art methods for CVaR MDPs and improves both practicality and scalability of the available approaches. These promising results motivate further research. From the theoretical point of view, what we left unproven is convergence of CVaR Q-Learning. In spite of many demonstrations in this article, end-to-end real life simulations were not addressed. Since Q-learning works with observed data, it motivates possible applications in the area of optimal portfolio management [2, 44] or stable control of energy systems [19, 33].

A Proofs of Theoretical Results A.1 Proof of Theorem 1 Proof Since we are interested in the minimal argument, we can ease the computation by focusing on the αCVaRα function instead of CVaRα . When working with two states, the equation of interest simplifies to     αCVaRα (Z (x, a)) = min p1 ξ1 αCVaRξ1 α Z (x1 ) + p2 ξ2 αCVaRξ2 α Z (x2 ) ξ

s.t. p1 ξ1 + p2 ξ2 = 1 1 0 ≤ ξ1 ≤ α 1 0 ≤ ξ2 ≤ α therefore 1 − p1 ξ1 αCVaR 1− p1 ξ1 α (Z (x2 )) 1 − p1 1− p1  ξ1 α  1− p1 ξ1 α 1− p1 = min p1 VaRβ (Z (x1 )) dβ + (1 − p1 ) VaRβ (Z (x2 ) dβ

αCVaRα (Z (x, a)) = min p1 ξ1 αCVaRξ1 α (Z (x1 )) + (1 − p1 ) ξ

ξ

0

0

To find the minimal argument, we find the first derivative w.r.t. ξ1 ∂αCVaRα − p1 = p1 αVaRξ1 α (Z (x1 )) + (1 − p1 )α VaR 1− p1 ξ1 α (Z (x2 ) 1− p1 ∂ξ1 1 − p1   = p1 VaRξα (Z (x1 )) − p1 VaR 1− pξ α (Z (x2 )) 1− p

354

S. Stanko and K. Macek

By setting the derivative to 0 , we get !

VaRξ1 α (Z (x1 )) = VaR 1− pξ α (Z (x2 )) = VaRξ2 α (Z (x2 )) 1− p

[7] have shown that in the case of strictly increasing c.d.f. with unbounded support, it holds that VaRξ1 α (Z (x1 )) = VaRξ2 α (Z (x2 )) FZ−1 (x1 ) (ξ1 α)

=

= VaRα (Z (x, a))

FZ−1 (x2 ) (ξ2 α)

= FZ−1 (x,a) (α)

and we can extract the values of ξ1 α, ξ2 α using the FZ−1 (x1 ) (ξ1 α)   FZ (x1 ) FZ−1  (ξ1 α) (x ) 1

ξ1 α

 = FZ−1 (x,a) (α)/ FZ (x1 )   = FZ (x1 ) FZ−1 (α) (x,a)   −1 = FZ (x1 ) FZ (x,a) (α)

And similarly for ξ2 . Since the problem is convex, we have found the optimal point.

A.2 Proof of Theorem 2  1  Proof Let s ∗ be a solution to maxs E (Z π (x0 ) − s)− + s. Then by optimizα  1  π ∗ − ing E (Z − s ) over π, we monotonously improve the optimization criterion α C V a Rα (Z (x0 )).  1  π E (Z − s)− + s s α  1  π ∗ − E (Z + s∗ ≤ max − s ) π α  1  π∗ E (Z − s  )− + s  ≤ max  s α

CVaRα (Z π ) = max

 1  π E (Z − s ∗ )− + s ∗ α  1  ∗ = E (Z π − s ∗ )− + s ∗ α

=



= C V a Rα (Z π )

When optimizing w.r.t. π we can ignore the scaling term α1 and a constant term affecting the s without  optimal argument. We can therefore focus on optimization  of E (Z π (x0 ) − s ∗ )− . ∗

CVaR Q-Learning

355



  s − rt E (Z t − s)− = E [(Z t − s)1(Z t ≤ s)] = E (rt + γ Z t+1 − s)1(Z t+1 ≤ ) γ

s − rt ) P(xt+1 , rt | xt , a)E (rt + γ Z (xt+1 ) − s)1(Z (xt+1 ) ≤ = γ x t+1 ,rt  

s − rt s − rt = P(xt+1 , rt | xt , a)E γ Z (xt+1 ) − 1(Z (x t+1 ) ≤ ) γ γ x t+1 ,rt  

s − rt s − rt =γ P(xt+1 , rt | xt , a)E Z (xt+1 ) − 1(Z (x t+1 ) ≤ ) γ γ x t+1 ,rt

  − s − rt Z (xt+1 ) − P(xt+1 , rt | xt , a)E =γ γ x ,r t+1

t

(27) the fact that probawhere we used the definition of return Z t = Rt + γ Z t+1 and  bility mixture expectations can be computed as E[ f (Z )] = i pi E[ f (Z i )] for any function f . Now let’s say we sampled reward rt and state xt+1 , we are still trying to find a policy π ∗ that maximizes   π ∗ = arg max E (Z (xt ) − s)−  xt+1 , rt ] π

  s − rt − Z (xt+1 ) − = arg max E π γ

(28)

where we ignored the unsampled states, since these are not a function of xt+1 , and the multiplicative constant γ that will not affect the maximum argument. At the starting state, we set s = s ∗ . At each following state we select an action according to equation (28). By induction we maximize the criterion (23) in each step.

356

S. Stanko and K. Macek

B Other Results Algorithm 5 Deep CVaR Q-learning with experience replay. Initialize replay memory M Initialize the VaR function V with random weights θv Initialize the CVaR function C with random weights θc Initialize target CVaR function C  with weights θc = θc for each episode do x = x0 while x is not terminal do Choose a using a policy derived from C (ε-greedy) Take action a, observe r, x  Store transition (x, a, r, x  ) in M x = x Sample random transitions (x j , a j , r j , x j ) from M Build the loss function LVaR + LCVaR Perform a gradient step on LVaR + LCVaR w.r.t. θv , θc Every Ntarget steps set θc = θc end while end for

B.1 CVaR Value Iteration —Linear Program min ξ,I x 

1 p(x  |x, a)Ix  y x

s.t. 0 ≤ ξ ≤

(29)

1 y

(30)

Ix  ≥ yi C(x, yi ) +

yi+1 C(x, yi+1 ) − yi C(x, yi ) (yξ(x  ) − yi ) yi+1 − yi

∀i, ∀x  (31)





p(x |x, a)ξ(x ) = 1

x

Here Ix is a slack variable for capturing piece-wise linearity.



∀x (32)

CVaR Q-Learning

357

References 1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: A system for large-scale machine learning. OSDI. 16, 265–283 (2016) 2. Almahdi, S., Yang, S.Y.: An adaptive portfolio trading system: A risk-return portfolio optimization using recurrent reinforcement learning with expected maximum drawdown. Expert Syst. Appl. 87, 267–279 (2017) 3. Bardou, O., Frikha, N., Pages, G.: Recursive computation of value-at-risk and conditional value-at-risk using mc and qmc. In: Monte Carlo and quasi-Monte Carlo methods 2008, pp. 193–208. Springer (2009) 4. Bäuerle, N., Ott, J.: Markov decision processes with average-value-at-risk criteria. Mathematical Methods of Operations Research 74(3), 361–379 (2011) 5. Bellemare, M.G., Dabney, W., Munos, R.: A distributional perspective on reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning - Volume 70. p. 449–458. ICML’17, JMLR.org (2017) 6. Bellman, R.: A Markovian decision process. Journal of Mathematics and Mechanics pp. 679– 684 (1957) 7. Bernard, C., Vanduffel, S.: Quantile of a mixture with application to model risk assessment. Dependence Modeling 3(1), (2015) 8. Brown, D.S., Niekum, S., Petrik, M.: Bayesian robust optimization for imitation learning (2020) 9. Chevalier-Boisvert, M., Bahdanau, D., Lahlou, S., Willems, L., Saharia, C., Huu Nguyen, T., Bengio, Y.: Babyai: First steps towards grounded language learning with a human in the loop. arXiv e-prints 1810.08272 (Oct 2018), https://arxiv.org/abs/1810.08272 10. Chow, Y., Tamar, A., Mannor, S., Pavone, M.: Risk-sensitive and robust decision-making: a cvar optimization approach. In: Advances in Neural Information Processing Systems. pp. 1522–1530 (2015) 11. Committee, B., et al.: Fundamental review of the trading book: A revised market risk framework. Consultative Document, October (2013) 12. Coraluppi, S.P.: Optimal control of markov decision processes for performance and robustness. (1998) 13. Dabney, W., Rowland, M., Bellemare, M.G., Munos, R.: Distributional reinforcement learning with quantile regression. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018) 14. Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y.: Openai baselines. https://github.com/openai/baselines (2017) 15. Garcıa, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16(1), 1437–1480 (2015) 16. Hamid, O., Braun, J.: Reinforcement Learning and Attractor Neural Network Models of Associative Learning, pp. 327–349 (05 2019). 10.1007/978-3-030-16469-0_17 17. Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., Silver, D.: Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298 (2017) 18. Howard, R.A., Matheson, J.E.: Risk-sensitive markov decision processes. Manage. Sci. 18(7), 356–369 (1972) 19. Khan, M.R.B., Pasupuleti, J., Al-Fattah, J., Tahmasebi, M.: Energy management system for pv-battery microgrid based on model predictive control. Indonesian Journal of Electrical Engineering and Computer Science 15(1), 20–25 (2019) 20. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 21. Koenker, R., Hallock, K.F.: Quantile regression. Journal of economic perspectives 15(4), 143– 156 (2001) 22. Leike, J., Martic, M., Krakovna, V., Ortega, P.A., Everitt, T., Lefrancq, A., Orseau, L., Legg, S.: Ai safety gridworlds. arXiv preprint arXiv:1711.09883 (2017)

358

S. Stanko and K. Macek

23. Macek, K.: Predictive control via lazy learning and stochastic optimization. In: Doktorandské dny 2010 - Sborník doktorand˚u FJFI. pp. 115–122 (November 2010) 24. Majumdar, A., Pavone, M.: How should a robot assess risk? towards an axiomatic theory of risk in robotics. arXiv preprint arXiv:1710.11040 (2017) 25. Marvi, Z., Kiumarsi, B.: Safe reinforcement learning: A control barrier function optimization approach. International Journal of Robust and Nonlinear Control (2020) 26. Miller, C.W., Yang, I.: Optimal control of conditional value-at-risk in continuous time. SIAM J. Control. Optim. 55(2), 856–884 (2017) 27. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529 (2015) 28. Pflug, G.C., Pichler, A.: Time-consistent decisions and temporal decomposition of coherent risk functionals. Mathematics of Operations Research 41(2), 682–699 (2016) 29. Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen, R.Y., Chen, X., Asfour, T., Abbeel, P., Andrychowicz, M.: Parameter space noise for exploration. arXiv preprint arXiv:1706.01905 (2017) 30. Prashanth, L.: Policy gradients for cvar-constrained mdps. In: International Conference on Algorithmic Learning Theory. pp. 155–169. Springer (2014) 31. Robbins, H., Monro, S.: A stochastic approximation method. The annals of mathematical statistics pp. 400–407 (1951) 32. Rockafellar, R.T., Uryasev, S.: Optimization of conditional value-at-risk. Journal of risk 2, 21–42 (2000) 33. Schmidt, M., Moreno, M.V., Schülke, A., Macek, K., Maˇrík, K., Pastor, A.G.: Optimizing legacy building operation: The evolution into data-driven predictive cyber-physical systems. Energy and Buildings 148, 257–279 (2017) 34. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354 (2017) 35. Sobel, M.J.: The variance of discounted markov decision processes. J. Appl. Probab. 19(4), 794–802 (1982) 36. Stanko, S.: Risk-averse distributional reinforcement learning. Master’s thesis, Czech Technical University (2018), https://dspace.cvut.cz/bitstream/handle/10467/76432/F3-DP-2018Stanko-Silvestr-thesis.pdf 37. Stanko, S., Macek, K.: Risk-averse distributional reinforcement learning: A cvar optimization approach. In: Proceedings of the 11th International Joint Conference on Computational Intelligence, IJCCI 2019, Vienna, Austria, September 17-19, 2019. pp. 412–423 (2019). 10.5220/0008175604120423, https://doi.org/10.5220/0008175604120423 38. Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction, vol. 1. MIT press Cambridge (1998) 39. Tamar, A., Chow, Y., Ghavamzadeh, M., Mannor, S.: Sequential decision making with coherent risk. IEEE Trans. Autom. Control 62(7), 3323–3338 (2017) 40. Tamar, A., Glassner, Y., Mannor, S.: Optimizing the cvar via sampling. In: AAAI. pp. 2993– 2999 (2015) 41. Vinitsky, E., Du, Y., Parvate, K., Jang, K., Abbeel, P., Bayen, A.: Robust reinforcement learning using adversarial populations (2020) 42. Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., De Freitas, N.: Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581 (2015) 43. Watkins, C.J., Dayan, P.: Q-learning. Machine learning 8(3–4), 279–292 (1992) 44. Yang, Q., Ye, T., Zhang, L.: A general framework of optimal investment. Available at SSRN 3136708, (2019)

Rule Extraction from Neural Networks and Other Classifiers Applied to XSS Detection Fawaz A. Mereani

and Jacob M. Howe

Abstract Explainable artificial intelligence (XAI) is concerned with creating artificial intelligence that is intelligible and interpretable by humans. Many AI techniques build classifiers, some of which result in intelligible models, some of which don’t. Rule extraction from classifiers treated as black boxes is an important topic in XAI, that aims to find rule sets that describe classifiers and that are understandable to humans. Neural networks provide one type of classifier where it is difficult to explain why the inputs map to the decision; support vector machines provide a second example of this kind. A third type of classifier, k-nearest neighbour (k-NN), gives more interpretable classifiers, but suffers from performance problems as the model is little more than a representation of the training data. This work investigates a technique to extract rules from classifiers where the underlying problem’s feature space is Boolean, without looking at the inner structure of the classifier. For such a classifier with a small feature space, a Boolean function describing it can be directly calculated, whilst for a classifier with a larger feature space, a sampling method is investigated to produce rule-based approximations to the behaviour of the underlying classifier, with varying granularity, leading to XAI. The behaviour of the technique with neural network, support vector machine, and k-NN classifiers is experimentally assessed on a dataset of cross-site scripting (XSS) attacks, and proves to give very high accuracy and precision, often comparable to the classifier being approximated. Keywords Rule extraction · Explainable AI · Neural networks · XSS

F. A. Mereani · J. M. Howe (B) City, University of London, London, UK e-mail: [email protected] F. A. Mereani e-mail: [email protected] F. A. Mereani Umm AL-Qura University, Makkah, Saudi Arabia © Springer Nature Switzerland AG 2021 J. J. Merelo et al. (eds.), Computational Intelligence, Studies in Computational Intelligence 922, https://doi.org/10.1007/978-3-030-70594-7_15

359

360

F. A. Mereani and J. M. Howe

1 Introduction Explainable Artificial Intelligence (XAI) is the term used to capture the problem of making artificial intelligence applications intelligible to humans [15]. XAI aims to “produce more explainable models, while maintaining a high level of learning performance (prediction accuracy); and enable human users to understand, appropriately trust, and effectively manage the emerging generation of artificially intelligent partners” [15]. Machine learning, especially neural networks, can produce classifiers that give high predictive accuracy, leading to excellent performance in complex tasks such as detecting objects in the images [17, 45], or understanding natural language [9]. The resulting trained models (again, particularly from neural networks) are often essentially black boxes. The way in which a neural network reaches a decision from the input data is not accompanied by an explanation that can be interpreted by a user. This also applies to an extent to other machine learning techniques such as support vector machines. There is a lot of interest in being able to give an explanation for the decision making resulting from machine learning models. That might be by opening up black box models [3, 4], by developing methods that help to understand what the model has learned [25, 34], or as is done in the current paper, by extracting rules from the networks. In [29] the authors introduced a technique for rule extraction from neural networks (NN) in the case that the feature set is Boolean. This initial investigation begins with the observation that if the features that form the input to a neural network are all Boolean, and the output value is a Boolean, then the trained neural network precisely defines a Boolean function. It is demonstrated that if the number of features is small, then each possible input combination can be evaluated, essentially enumerating the truth table for the Boolean function represented by the neural network. As the number of features increases, the size of the truth table rises rapidly, hence its enumeration becomes infeasible. Therefore, for neural networks defined over larger feature spaces approximations of the encoded Boolean function were considered which use a sampling based approach. The extracted rules are Boolean functions and, with each Boolean variable corresponding to a feature in the problem domain, these rules are clearly interpretable. The target application is the detection JavaScript based cross-site scripting (XSS) attacks. In previous work, a variety of machine learning techniques were applied to determine whether a script was malicious or benign [30, 31]. The performance of the resulting classifiers was evaluated and they achieved high predictive accuracy results using a large real-world dataset of scripts. It was found in this work that the most successful set of features to abstract the scripts to was Boolean valued (sixty-two features in total), hence the models are Boolean valued. The current paper develops this works further. The approximation of classifiers which cannot be exactly described is based on a sampling of the values given by the classifier. This sampling is further explored, in terms of the size of the sample needed

Rule Extraction from Neural Networks and Other Classifiers Applied …

361

in order to determine the classification value, how accurate this classification is, and how the sampling determines the classification value. In [29] it was conjectured that the technique would work with any classifier with Boolean features. In earlier work, support vector machines (SVM) and k-nearest neighbour (k-NN) classifiers have proved to be particularly good at classifying XSS [30, 31]. SVM share, to some extent, the problems of interpretability that neural networks have. Whilst k-NN classifiers are fairly easy to interpret (so there is not a big XAI challenge in using them), unlike many other classifiers they grow with the size of the training data, since they are essentially a representation of that data. This leads to performance problems, and motivates rule extraction for a different reason, namely that an extracted rule would not be of the same size as the training data and should perform classification more efficiently. The aim of this paper is to perform rule extraction from classifiers treated as black boxes, with the extracted Boolean functions giving a decision making method that is more explainable to humans [15]. The contributions of this paper are as follows: • Validation of the observation that for an entirely Boolean feature set, with binary classification, the trained classifier defines a Boolean function • An investigation into how to use a sampling approach to approximate this Boolean function when the feature set is sufficiently large for the generation of the precise function to be infeasible • A demonstration that the Boolean function extraction can be performed for neural network, SVM, and k-NN classifiers for XSS detection problems • An empirical evaluation of approximate rule extraction from the XSS classifiers. The rest of this paper is organised as follows: Sect. 2 gives background and related work on methods for extracting rules and the detection of XSS attacks in scripts. Section 3 describes the dataset used, including how features are selected and ranked, how classifiers are trained and evaluated using this dataset, and the method used for constructing and approximating Boolean functions. Section 4 presents results related to the application of the rule extraction, and Sect. 5 discusses the results. Further discussion and concluding remarks are given in Sect. 6.

2 Background and Related Work It has been said that requirements of a software system (accuracy, and ease-of-use) almost always work in a contradictory manner. As [7] has stated, “Unfortunately, in prediction, accuracy and simplicity (interpretability) are in conflict.” The extraction of rules from a trained classifier can be an intermediate method which allows for the satisfaction of both of these requirements via the use of a relatively simple and understandable set of such rules which simulates a model’s predictions (that is, they explain the rules which are used inside a black box) [5, 11, 27].

362

F. A. Mereani and J. M. Howe

The neural network classifier model represents one of the most popular types of classifier from which rules can be extracted. Algorithms for extracting rules from neural network classifiers may be divided into three main categories: 1. Pedagogical: This kind of method is not concerned with the internal structure of the network, but only with deriving the rules used by the network by looking at the relationships between the inputs and the outputs. It does not scrutinise the internal behaviour of the network [46, 48]. An example of the use of this type of rule extraction can be found in [40], where rules were extracted from a multilayer medical diagnostic system by monitoring the impact on network outputs of changes to its inputs. The VIA technique [47] is another example of a pedagogical method, where the generation and testing of an input dataset was focused on the extraction of rules from the neural network while it was being trained using backpropagation. Other techniques in this category are sampling and the reverse engineering of neural networks [16]. An example of the use of samples for a pedagogical approach is given in [10], where Craven and Shavlik proposed an algorithm called TREPAN. This algorithm extracts M-of-N and split trees from an ANN with one hidden layer, which the network employed as an “oracle” to statistically validate the correctness and significance of the generated rules. Saad and Wunsch proposed, in [39], a method they termed HYPINV which relied on a network inversion technique. This method is capable of extracting the hyperplane rule learned by a multilayer perceptron in the form of conjunctions and disjunctions of hyperplanes. The present study will focus on the use of samples to extract rules from black box classifiers. Hence, knowing how to use samples to extract rules is essential. Huysmans et al. [12] specifies methods for using samples to extract rules; additional training instances are created to act as samples for use by TREPAN. Another method is to create random instances; this method keeps the samples nearer to the original training instances, which in turn ensures that the generated samples are similar to the original training data when used with ANN-DT in [41]. 2. Decompositional: The methods in this category are concerned with extracting the rules directly from the layers within the network. Such decompositional methods analyse network activation, the outputs of hidden layers and the associated weights [13]. An example of this type of method is found in [43, 44] where a three step algorithm was proposed for analysing and thus ‘understanding’ the neural network. Deriving rules from deep neural networks is one of the more difficult tasks in this area. Katz et al. proposed a new algorithm in [22] to verify the properties of neural networks using ReLUs activation functions [33] by the application of simplex. The algorithm was modified to support ReLU constraints and termed Reluplex. Reluplex is concerned with reducing the search space, but it needs to ‘split’ using a specific ReLU constraint. Note that in this method many or indeed all of the ReLUs involved can be ignored. This algorithm has been applied to the next generation of airborne collision avoidance systems for unmanned aircraft [20].

Rule Extraction from Neural Networks and Other Classifiers Applied …

363

3. Eclectic: Methods of this type combine attributes derived from the two previous types. This type of rule extraction was used in [23], where a method for discovering trends within a large dataset was proposed which employed a neural network as a black box which had the function of discovering knowledge. At the same time, the method examines weights by pruning and clustering the activation values of the hidden units within the network. It is fruitful to compare the types of extraction methods used in terms of their relative advantages and disadvantages. First, it can be observed that the extraction of rules using decompositional approaches is complex and requires considerable computational resources, and this is the most important constraint with regard to the use of these methods. Pedagogical approaches are generally faster because they do not attempt to analyse the weights and internal structure of the associated neural network. However, the most important disadvantage of this approach is that it is less likely to find all the rules that describe the behaviour of the network correctly. The eclectic approach is slower but more precise because it combines the two other methodologies [2]. A decision tree is one of the most common methods of representing the rules extracted from non-rule-based classifiers, where the individual rules can be specified in the form i f...then. The decision tree itself is built using these rules such that the classes (returned by the classifier) are the leaves and the branches represent the sequences of features (conditions) that lead to these classes [1]. Representing the rules in a way which is understandable by human-being is described in [6] and [18]. 1. If-Then / If-Then-Else: Rules are represented by using “i f ” condition. The condition component is a set of conditions on input variables, followed by a “then” which indicates a class. An example of an “i f...then...else” rule is: i f (a11 < x1 < a12 ) and (a21 < x2 < a22 ) then Class A else Class B. Note that most extraction algorithms create rules that contain conjunctions and they will generally ensure that the conditional parts define separate areas in the input space, meaning that the rules are mutually exclusive. Therefore, only one rule will be able to classify a new entry. 2. M-of-N: This type of represention of the rules is considered to be more compact than “i f...then” rule sets; the decision in relation to just one class is made such that it is required that M of the full set of N rules be true for this class to be returned. 3. Oblique Rules/Multi-surface: This type of representation is made using rules that separate a feature space using planes, each side of each plane represents a particular class, this allows each data point in the space to belong to a specific class. This representation is more difficult to understand, but such rules are powerful. 4. Equation Rules: This type of rule representation is similar to oblique rules, but non-linear equations in the condition part are used for this purpose. This type of rule representation makes it difficult to understand the extracted rules, thus contributes little to the interpretation of the original model.

364

F. A. Mereani and J. M. Howe

5. Fuzzy Rules: This method of representing rules is similar to that of (i f...then) rules, the difference being that this representation deals with fuzzy sets and its underlying mechanism is many valued fuzzy logic. Fuzzy rules are easy to understand because they are expressed using concepts that are readily comprehensible by the user. In this study, the pedagogical approach combined with a sampling technique will be adopted to extract the rules from a neural network classifier. The proposed approach will focus on extracting the rules by finding the relationships between the inputs and the returned classes. The rules so extracted will be represented in the form (i f...then..else), since Boolean functions act as decision trees.

2.1 Overview of Minimising Boolean Expressions To better understand the Boolean function being used, it is useful to extract the rules into a compact representation. A minimal representation of a Boolean expression is easier to understand and to write out; in addition, explanations based on such minimal forms are less prone to error. Importantly, a minimal representation can be more effective and efficient when implemented in experiments [38]. Therefore, the minimisation of Boolean expressions, to find a representation equivalent to the original expression but of a minimum size, is considered here. Minimisation can be achieved in several ways, where the important factor in relation to choosing a method is the number of variables in the expression. The commonly used methods for minimising Boolean expressions are: 1. Karnaugh Maps: This is a graphical method for minimising Boolean expressions [21], whereby the truth table of the expression is expressed as a matrix, all the complementary pairs are then eliminated, and the result is a minimised Boolean expression. This method is very effective when only small numbers of variables are involved, but it becomes more unwieldy when there are large numbers of variables. 2. Tabular (Quine-McCluskey): This method is, in general, more effective than the Karnaugh maps method, and in particular its effectiveness can be observed when minimising expressions containing a large number of variables [26]. Minimising Boolean expressions using this method is achieved via two main activities: the identification of primary implicants and the selection of essential primary implicants. Essential primary implicants are all those terms that will be present in the final simplified function. The starting point is to list the minterms that define the function, then the prime implicants are found by a matching method. Each minterm and maxterm are compared with every other minterm. When the expressions differ in terms of only one variable, this variable will be removed and a function will be created which excludes it. This process is repeated for each minterm and maxterm pair until the search ends. The selection of essential primary implicants is achieved by creating a table containing the prime impli-

Rule Extraction from Neural Networks and Other Classifiers Applied …

365

cants. The prime implicant can then be reduced by removing the essential prime implicants, removing the rows that dominate others, and removing the columns that dominate others. These steps are repeated until there no further reduction possible. The weakness of this method is that the run time grows exponentially with the number of variables [19] 3. Reduced Ordered Binary Decision Diagrams (ROBDDs): This method is undertaken by imposing an order on the variables of a Boolean function, and then representing this function as a graph structure; this provides a canonical non-redundant representation of the Boolean function, given the variable ordering [8]. In this study, the tabular method will be adopted, because of its effectiveness when minimising expressions with large numbers of variables.

2.2 Cross-Site Scripting Cross-Site Scripting (XSS) is a type of attack targeting web applications, ranked by OWASP as one of the top 10 attacks [35]. XSS is standardly prevented from being executed through good coding practice, using sanitization and escaping to prevent untrusted content being interpreted as code [51]. Parser-level isolation provides an alternative, confining user input data during the lifetime of the application [32]. Blacklists are viewed as easy to circumvent and these approaches are preferred [51]. Machine learning techniques have been applied to prevent XSS attacks. An early approach [24] evaluates ADTree, SVM, Naive Bayes, and RIPPER classifiers by tracking the symbols that appear in malicious and benign scripts, and achieved precision of up to 92%. Another approach, [50], extracts features used in malicious scripts much more than benign, such as the DOM-modifying functions and the eval function; this method achieved accuracy rate of up to 94.38%. Furthermore, in [30] a number classifiers were evaluated: SVM with linear and polynomial kernels, k-NN and Random Forest. Using a k-NN classifier achieved high accuracy results up to 99.75%, with precision rate up to 99.88%. Here the extracted features depend on the occurrence or not of a syntactic element within a script. A neural network classifier was evaluated in [31] to prevent XSS attacks by using ensemble and cascading techniques and the results gave a very high accuracy of up to 99.80% in the base level which their feature groups used directly, and 99.89% at the meta level where the features are the outputs of base level. As well as in scripting, there is emerging interest in using neural networks to detect malware in executables, for instance, in [36] a recurrent neural network is used to detect malicious executables at execution time with 93% accuracy.

366

F. A. Mereani and J. M. Howe

3 Methodology This section describes the dataset used in the experiments, the approach to selecting features to build analyses with, the training of the classifiers, and the way in which they are evaluated. The aim of this work is to find Boolean functions as rules extracted from classifiers, which can then themselves be used as classifiers, replacing the originals. The approach to extracting a Boolean function from a classifier (neural network, support vector machine or k-nearest neighbour) is given, both for exact rule extraction, and for a series of approximations to classifiers.

3.1 Datasets The current work uses the dataset from [29]. This is primarily the dataset from [31], with the training set augmented with files from CSIC 2010 [14] (consisting of 152 malicious instances and 3971 benign instances) in order to cover more types of scripts to extract more precise rules. The classifiers being trained are to determine whether or not text entered into a web application represents a cross-site script. Hence the dataset consists of 43,218 files, of which 28,068 are labelled as benign and 15,150 are labelled as malicious. Note that 9,068 of the benign scripts are plain text from [49]. These are then divided into a training set of 19,122 instances (5,150 malicious and 13,972 benign) and a holdout testing set of 24,096 instances (10,000 malicious and 14,096 benign). There is no overlap between the training and testing datasets.

3.2 Selected Features As in [29], the starting point of this work is to abstract the input script file into the 62 features used in [31]. These are divided into two groups: alphanumeric, a range of keywords and tokens from the target application, and non-alphanumeric features, the full set of non-alphanumeric characters. Rather than working with these features immediately without further reflection as in [31], here the features have been ranked by using Algorithm 1 [28]. The method selects the most powerful features in a sequential feature selection. This method works by minimising over all feature subsets, which uses the deviance and chi-square to find the most powerful features. The deviance is twice the difference between the log likelihood of that model and the saturated model, and the inverse of the chi-square with degrees of freedom is used to set the termination tolerance parameter. The application of the ranking algorithm on the feature set shows that only 34 features need be used, and the ranking of these selected features in order of effectiveness is given in Table 1. The key observation of these features is that they are all Boolean valued, that is, the feature occurs in the script or it does not, allowing the exploitation of this additional 0/1 valued structure.

Rule Extraction from Neural Networks and Other Classifiers Applied …

367

Algorithm 1. Ranking Features Algorithm. Input: Original features set; Start with empty features subset; Feature = Sequential Feature Selection; while (Deviance > Chi-Square) do Feature Subset = Add feature to selected feature subset; Feature = Sequential Feature Selection; end

3.3 Training Classifiers Using the features from Table 1, a series of classifiers were trained for each of: feed forward neural networks, support vector machines and k-nearest neighbour. The neural network classifiers were built using a single hidden layer, with 10 hidden units, and the train function (which updates the weight and bias values) was optimised by setting it to be “trainbr” to minimise a combination of squared errors and weights. The support vector machines were trained using a linear kernel. The k-nearest neighbour classifier was training with k optimised to be 4. For each of the three machine learning techniques, two classifiers were built: one using all 34 features, which is viewed as the best classifier, the one from which rules

Table 1 Selected features [29] No. Features 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Alert < { ? ! JS File HTTP ’ ; & , Src Space &# Eval .

No.

Features

18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

% (<) @ Onload StringfromCharCode : \ ] ( ‘ Img  > == / Onerror // iframe

368

F. A. Mereani and J. M. Howe

are to be extracted, and the other using the top 16 features, which will be used for comparison, evaluation and discussion.

3.4 Classifiers and Boolean Functions Observe that a classifier each of whose input features is Boolean, and whose output is a Boolean value, is precisely equivalent to a Boolean function. Enumerating each possible input, and calculating the corresponding output results in the truth table for this Boolean function. Hence, the classifier can be replaced by this Boolean function. The result is a rule based system, each of whose decisions is explainable and auditable. This is particularly attractive when the initial classifier is a neural network where individual decisions are not explainable. The approach is also attractive when the initial classifier is a SVM, since the Boolean classifier gives more easily understood decisions. When k-NN provides the initial classifier, this classifier is already explainable, but k-NN classifiers suffer in performance, since they grow linearly with the size of the training dataset; the Boolean classifier is of a fixed sized, hence may perform better when there is considerable training data. In the current study, the feature set is Boolean, therefore this approach applies. Whilst for a small number of features this rule extraction technique might be applied directly, the number of potential inputs grows exponentially with the number of features, and the problem quickly becomes infeasible. This motivates a sampling based approach.

3.5 Sampling The key classifier in this work is the neural network trained over a feature space with 34 features. This provides an exemplar case for a rule extraction where the Boolean function defined is too large to generate from the classifier. Despite this, there is motivation to find a Boolean function that can be used in place of the neural network. With varying motivation, as discussed in the previous section, this also applies to the SVM and k-NN classifiers trained over 34 features. The approach taken is to sample the classifier and to use this sample to build a Boolean function; this Boolean function then provides an approximation of the original function. The idea is to fix a number of features for which producing a Boolean function is feasible (via a truth table in this case) and to determine what value the function should take by interrogating the initial classifier with the full feature set. For example, suppose it is determined that considering 4 features will result in a truth table that can be feasibly constructed. Then the four highest ranking features (in Table 1) will provide the entries for the truth table. For a row of the truth table, the values of these features is fixed, and then extended with values for the remaining 30 features to give an input to the classifier,

Rule Extraction from Neural Networks and Other Classifiers Applied …

369

which is then queried and the result noted. This is done repeatedly and from the resulting sample the most frequently occurring result is the entry in the truth table. Whilst the training dataset is relatively large, with 19,112 scripts, this is still sparse compared to the 234 possible inputs to the classifiers considered. This means that whilst the classifiers learn from the training set, the generalisation is not necessarily great enough that every input to the classifiers is equally meaningful. That is, a random sampling extending the fixed values might not give good results, since it might not match the shape of likely inputs. Indeed, this was observed in development of approximations of neural networks in [29], with inputs holding the default value dominating. In order to counteract this, the extensions were generated from the training set, with a random selection of instances from the training set being selected (with the full 34 features), and these being used for sampling the classifiers with the fixed features replacing the corresponding feature values. Algorithm 2 specifies the sampling method. Here, the input to the algorithm is L (an integer) the number of fixed features, Classifier which is a trained classifier (in this case either neural network, SVM or k-NN trained over 34 features) and Sample which is a random selection from the training set of inputs to the neural network. A truth table, TT, for the fixed features, with undefined output values, is constructed by buildInitTruthTable. Each row of this truth table is considered in turn. The values of the row of TT are substituted into each element of Sample leading to an input which is passed to Classifier for classification. If the result is classification as malicious then a counter for malicious instances, malicious_count, is incremented, otherwise, benign_count is incremented. Once each element of Sample has been considered, a comparison between the two counts is made, and the output column of the truth table TT is populated with 0 if most instances are malicious, and 1 otherwise. In [29], Sample in Algorithm 2 consisted of 1024 inputs chosen at random from the training dataset. As well as repeating this experiment for neural networks, a tactic where a much smaller sample is used is investigated for all classifiers considered. The 1024 inputs are themselves randomly sampled, with just 32 samples chosen. If this gives a clear cut answer (in these experiments, if 25 or more of the samples give the same output) then this answer is used, if not the sample size is doubled to 64, and if this fails to give a clear cut answer, the sample is again doubled to 128. Using this tactic, this work investigates successive approximations, with a varying number of fixed features: 1, 2, 4, 8, 10, 12 and 16 features. Fixing the size of Sample to be 32, this work will also investigate how precisely the sampling works by two measures. Firstly, by tabulating for each row of the truth table, how many samples gave malicious as the output. Secondly, by charting for each test case the split between samples giving malicious, and samples giving benign, for the rule that gave the output for the test case.

370

F. A. Mereani and J. M. Howe

Algorithm 2. Sampling Method Algorithm. Input: L ∈ N, Classifier, Sample; TT = buildInitTruthTable(L); for row in TT do malicious_count = 0; benign_count = 0; for s in Sample do input = substitute(row, s); result = Classifier(input) ; if result == malicious then malicious_count ++; else benign_count ++; end end if malicious_count > benign_count then TT[row] = 0 ; \\Malicious else TT[row] = 1 ; \\Benign end end

3.6 Extracting Rules After labelling all rows in the truth table, each row can be considered to be a rule that describes one class. To give a more succinct set of rules, the Boolean function can be minimised [42] resulting in simplified expressions. The minimised Boolean functions are then evaluated as classifiers. For minimising Boolean functions “Logic Friday” [37] has been used which applies the Tabular Method as a minimisation algorithm.

4 Results In the experiments, MatLab 2018b was used to build the classifiers, and to find the truth tables based on these classifiers. This was done using various numbers of fixed features: 1, 2, 4, 8, 10, 12, and 16. The extracted truth tables define sets of rules acting as classifiers approximating the original classifier, and these rule sets were then reduced to a more compact representation using “Logic Friday” [37]. Results on trained classifiers and rule extraction are presented in turn for neural networks, support vector machines, and k-nearest neighbour. This is followed by results on how long it takes to perform the rule extraction, and on how clear cut the sampling method for rule extraction is.

Rule Extraction from Neural Networks and Other Classifiers Applied …

371

4.1 Neural Networks For neural networks, the first results repeat the experiments from [29] on training neural network classifiers and extracting rules, then a revised set of experiments on rule extraction is given.

4.1.1

Training Neural Network Classifiers

As in [29], a neural network classifier was trained using the full 34 features, and tested using the testing dataset. Table 2 gives the performance of this classifier. Evaluation uses the confusion matrix, leading to Accuracy, Precision, Sensitivity, and Specificity measures. This network is the one from which rules are extracted, leading to a series of approximations to it. In addition, for later comparison, a neural network classifier was trained using just the 16 highest ranked features. Table 3 gives the performance of this classifier. For this network, the Boolean function that the network defines can be precisely extracted. This results in a Boolean classifier whose performance will be the same as the neural network. Table 4 shows the number of the rules that result from constructing the truth table for the 16 features, along with the number of rules that classify scripts as benign after minimisation is applied (hence any script whose features do not match a rule for benign is malicious).

Table 2 Neural network classifier performance using 34 features [29] Accuracy Precision Sensitivity Specificity Confusion matrix 99.88

99.98

99.75

99.98

Malicious Benign

Malicious 9998 25

Benign 2 14071

Table 3 Neural network classifier performance using 16 features [29] Accuracy Precision Sensitivity Specificity Confusion matrix 99.78

99.94

99.53

99.95

Table 4 Classifier rules using 16 features [29] Class Malicious Rules

41,549

Malicious Benign

Malicious 9994 47

Benign 6 14049

Benign

Minimised

23,987

2,560

372

F. A. Mereani and J. M. Howe

Table 5 Results for rules extracted from NN using 1, 2, 4, 8, 10, 12, and 16 features [29] Accuracy

Precision

Sensitivity Specificity Confusion matrix

91.96

80.70

99.92

Malicious 1 Feature

2 Features 91.96

4 Features 98.95

8 Features 98.13

10 Features 99.15

12 Features 99.82

16 Features 99.90

4.1.2

80.70

97.54

95.62

98.00

99.62

99.94

99.92

99.92

99.87

99.96

99.96

99.82

87.95

87.95

98.28

96.98

98.60

99.73

99.95

Benign

Malicious

8070

1930

Benign

6

14090

Malicious

Benign

Malicious

8070

1930

Benign

6

14090

Malicious

Benign

Malicious

9754

246

Benign

7

14089

Malicious

Benign

Malicious

9562

438

Benign

12

14084

Malicious

Benign

Malicious

9800

200

Benign

3

14093

Malicious

Benign

Malicious

9962

38

Benign

3

14093

Malicious

Benign

Malicious

9994

6

Benign

18

14078

Initial Rule Extraction

Rules were extracted from the neural network trained on 34 features by applying the sampling method for each row in the truth table, hence the number of extracted rules is equal to 2n , where n is the number of fixed features in the approximation, and each row describes one rule. As in [29], 1024 samples were used for each row of the truth table. This process was repeated where the number of fixed features was 1, 2, 4, 8, 10, 12, and 16 features. Each of these gives an approximation to the neural network, and the purpose of this repetition is to observe the number of rules that are extracted and the accuracy of the results on the testing dataset. Table 5 gives the results of testing the rules extracted from the 34 feature neural network, approximating with 1, 2, 4, 8, 10, 12 and 16 features. Again, the evaluation is given in terms of the confusion matrix, and the Accuracy, Precision, Sensitivity and Specificity measures. Table 6 summarises the number of rules for each class by using the various numbers of selected features. The final column gives the number rules that classify the input as benign after minimisation (hence, any input not matching one of these rules is classified as malicious).

Rule Extraction from Neural Networks and Other Classifiers Applied …

373

Table 6 Numbers of rules as related to numbers of selected features [29] Features Malicious Benign Minimised 1 Feature 2 Features 4 Features 8 Features 10 Features 12 Features 16 Features

1 2 7 100 384 1,560 39,792

1 2 9 156 640 2,536 25,744

1 1 3 29 62 229 2,488

Table 7 Timing of rule extraction from the NN classifier with 1024 samples [29] Features Interval 1 Feature 2 Features 4 Features 8 Features 10 Features 12 Features 16 Features

18 sec 37 sec 120 sec 390 sec 7,846 sec 30,598 sec 482,618 sec

Also included is Table 7 detailing the time take to perform the rule extraction from the 34 feature neural network using 1024 samples for each row of the truth table. The point to note here is that the extraction of the 16 feature Boolean approximation of the neural network takes considerable time, more than five days of computation.

4.1.3

Revised Rule Extraction

The time taken to extract Boolean rules with the 16 features using 1024 samples for each row of the truth table motivates investigation of using a smaller sample size. Initial experimentation (not included, but used to generate Fig. 1) reduced the sample size to 32. As noted in Sect. 4.5, some decisions based on this reduced number of samples are close, so a tactic that expands the number of samples to 64 if fewer that 25 samples indicate one value or the other is used, and again if fewer that 50 samples indicate one value or the other, the number of samples is expanded to 128. Table 8 shows the result of applying the Boolean classifiers extracted from the 34 feature neural network to the testing dataset, where the experiment was conducted using 1, 2, 4, 8, 10, 12, and 16 features. Note that the results are close to those of the neural network classifier which used 1024 instances as samples for decision making.

374

F. A. Mereani and J. M. Howe

Table 8 Extracted results from NN using 32/64/128 samples Accuracy

Precision

Sensitivity Specificity Confusion Matrix

58.49

0

0

Malicious 1 Feature

2 Features 98.95

4 Features 98.74

8 Features 96.61

10 Features 98.37

12 Features 99.84

16 Features 99.87

97.60

97.08

99.61

96.60

99.75

99.86

99.87

99.88

92.76

99.46

99.87

99.83

58.49

98.32

97.96

99.70

97.63

99.82

99.90

Table 9 Numbers of NN rules using 32/64/128 samples Features Malicious Benign 1 Feature 2 Features 4 Features 8 Features 10 Features 12 Features 16 Features

0 3 8 103 382 1,576 39,861

2 1 8 153 642 2,520 25,675

Benign

Malicious

0

10000

Benign

0

14096

Malicious

Benign

Malicious

9760

240

Benign

12

14084

Malicious

Benign

Malicious

9708

292

Benign

11

14085

Malicious

Benign

Malicious

9961

39

Benign

777

13319

Malicious

Benign

Malicious

9660

340

Benign

52

14044

Malicious

Benign

Malicious

9975

25

Benign

13

14083

Malicious

Benign

Malicious

9986

14

Benign

17

14079

Minimised 1 1 4 29 84 245 2,766

Table 9 shows the number of Boolean rules extracted for each class, and number of rules after minimized.

Rule Extraction from Neural Networks and Other Classifiers Applied …

375

4.2 Support Vector Machines This section repeats the experiments performed in the previous section with neural networks, but instead using a 34 feature SVM as the base classifier.

4.2.1

Training Support Vector Machine Classifiers

As for neural networks, two SVM classifiers are trained, the first using 34 features, which will be used as the base classifier for rule extraction, and the second using 16 features for comparison purposes. The performance of the SVM trained on the full 34 features is given in Table 10. Table 11 gives the performance of the SVM trained over 16 features. Again, the Boolean function that this classifier defines can be extracted precisely, and the number of rules that this gives can be seen in Table 12.

4.2.2

Rule Extraction

Boolean rule extraction is performed as for neural networks. The 34 feature SVM classifier is sampled using the 32/64/128 sample tactic and the results are presented in Table 13. The number of rules extracted are given in Table 14.

Table 10 SVM Classifier performance using 34 features Accuracy Precision Sensitivity Specificity Confusion matrix 99.90

99.90

99.88

99.92

Malicious Benign

Malicious 9990 12

Benign 10 14087

Table 11 SVM classifier performance using 16 features Accuracy Precision Sensitivity Specificity Confusion matrix 99.90

99.87

99.89

99.90

Malicious Benign

Table 12 SVM Classifier rules using 16 features Class Malicious Benign Rules

33,589

31,947

Malicious 9987 11

Benign 13 14085

Minimised 8,043

376

F. A. Mereani and J. M. Howe

Table 13 Results for rules extracted from SVM using 32/64/128 samples Accuracy

Precision

Sensitivity Specificity Confusion matrix

91.96

80.70

99.92

Malicious 1 Feature

2 Features 91.96

4 Features 98.74

8 Features 98.24

10 Features 99.18

12 Features 99.09

16 Features 99.90

80.70

97.03

95.80

98.07

97.85

99.88

99.92

99.93

99.97

99.96

99.97

99.89

87.95

87.95

97.93

97.10

98.64

98.49

99.91

Table 14 Numbers of SVM rules using 32/64/128 samples Features Malicious Benign 1 Feature 2 Features 4 Features 8 Features 10 Features 12 Features 16 Features

1 2 8 79 327 1,193 35,780

1 2 8 177 697 2,903 29,756

Benign

Malicious

8070

1930

Benign

6

14090

Malicious

Benign

Malicious

8070

1930

Benign

6

14090

Malicious

Benign

Malicious

9703

297

Benign

6

14091

Malicious

Benign

Malicious

9580

420

Benign

2

14094

Malicious

Benign

Malicious

9807

193

Benign

3

14093

Malicious

Benign

Malicious

9785

215

Benign

2

14094

Malicious

Benign

Malicious

9988

12

Benign

11

14085

Minimised 1 1 4 35 129 448 5,884

4.3 k-NN Again, the same set of experiments were conducted for k-NN. Classifiers are built, then rule extraction performed, with the resulting classifiers evaluated.

Rule Extraction from Neural Networks and Other Classifiers Applied …

4.3.1

377

Training k-NN Classifiers

A k-NN classifier was trained using the same 34 features as for the NN and SVM classifiers. Here (unlike in previous work [30]) the k parameter was optimised to be 4. Table 15 shows the performance results for k-NN when using 34 features. Again, a k-NN classifier trained using 16 features was also developed and the results of testing this can be seen in Table 16. As for the other base classifiers, the Boolean function that this classifier defines can be extracted and the number of rules that the gives can be seen in Table 17. 4.3.2

Rule Extraction

As in the previous sections, Boolean rule extraction was performed by sampling the 34 feature k-NN classifier using the 32/64/128 sample tactic. The results are presented in Table 18. The number of rules extracted are given in Table 19.

4.4 Timings A major motivation for investigating the number of samples required to extract good Boolean rules from other classifiers is the time taken by the rule extraction from [29]. In this section, the time taken for rule extraction using the 32/64/128 tactic is given for NN, SVM and kNN. These timings are given in Table 20.

Table 15 k-NN classifier performance using 34 features Accuracy Precision Sensitivity Specificity Confusion matrix 99.68

100

99.25

100

Malicious Benign

Malicious 1000 75

Benign 0 14021

Table 16 k-NN classifier performance using 16 features Accuracy Precision Sensitivity Specificity Confusion matrix 99.83

99.99

99.61

99.99

Malicious Benign

Table 17 k-NN classifier rules using 16 features Class Malicious Benign Rules

47,244

18,292

Malicious 9999 39

Benign 1 14057

Minimised 1,598

378

F. A. Mereani and J. M. Howe

Table 18 Results for rules extracted from k-NN using 32/64/128 samples Accuracy

Precision

Sensitivity Specificity Confusion Matrix

58.49

0

0

Malicious 1 Feature

2 Features 98.95

4 Features 98.74

8 Features 99.48

10 Features 96.50

12 Features 99.72

16 Features 97.06

97.60

97.08

98.25

96.91

99.79

99.95

99.87

99.88

92.78

94.78

99.54

93.43

58.49

98.32

97.96

99.44

97.77

99.85

99.96

Table 19 Numbers of k-NN rules using 32/64/128 samples Features Malicious Benign 1 Feature 2 Features 4 Features 8 Features 10 Features 12 Features 16 Features

1 3 7 151 593 3,334 44,120

1 1 9 105 431 762 21,416

Benign

Malicious

0

10000

Benign

0

14096

Malicious

Benign

Malicious

9760

240

Benign

12

14084

Malicious

Benign

Malicious

9708

292

Benign

11

14085

Malicious

Benign

Malicious

9925

75

Benign

772

13324

Malicious

Benign

Malicious

9691

309

Benign

533

13563

Malicious

Benign

Malicious

9979

21

Benign

46

14050

Malicious

Benign

Malicious

9995

5

Benign

702

13394

Minimised 1 1 5 32 105 94 2,843

4.5 Labelling via Sampling The core of the rule extraction methods presented in this paper is sampling a classifier to give an assignment of a Boolean value to a row in a truth table. This section investigates the sampling and how clean a division is achieved. Figures 1, 2, 3 present results on this for NN, SVM and k-NN respectively. Each figure gives two results,

Rule Extraction from Neural Networks and Other Classifiers Applied …

379

Table 20 Timing of rule extraction from NN/SVM/k-NN classifiers using 32/64/128 samples Features NN SVM k-NN 1 Feature 2 Features 4 Features 8 Features 10 Features 12 Features 16 Features

1 sec 2 sec 3 sec 57 sec 212 sec 755 sec 10,645 sec

1 sec 1 sec 1 sec 9 sec 36 sec 135 sec 3,720 sec

1 sec 1 sec 6 sec 101 sec 419 sec 2,016 sec 23,271 sec

Fig. 1 Number of occurrences of the samples in a NN

developed by considering rule extraction for 16 features using a fixed sample size of 32 (note that this differs from the 32/64/128 tactic used in the earlier results, although the performance of the extracted classifiers is not dramatically different). The first result (in blue) details how many rows of the truth table (with 216 rows) result from the number of cases; that is, Iterations gives the number of rows which has a split where Cases is the number of samples giving malicious. The second result (in orange) gives the number of testing instances that are assigned from a row of the truth table whose value was determined by a sample with Cases being the number of samples giving malicious (for example, if a test instance is determined to be malicious, Cases is the number of samples that have malicious, say 30).

380

Fig. 2 Number of occurrences of the samples in a SVM

Fig. 3 Number of occurrences of the samples in a k-NN

F. A. Mereani and J. M. Howe

Rule Extraction from Neural Networks and Other Classifiers Applied …

381

5 Discussion As has already been established in [30, 31] a variety of machine learning classifiers can be applied to the XSS detection problem giving very good results. This is confirmed here in Tables 2, 10 and 15 giving performance results for NN, SVM and k-NN classifiers trained using the 34 highest ranked features. These results are accompanied by Tables 3, 11 and 16 giving the results when training NN, SVM and k-NN classifiers using only the 16 highest ranked features, which again give good results, if (particularly for NN) not quite as good as for the higher number of features. Each of the classifiers above defines a Boolean function, which might be used to replace the original classifier. For the 16 feature classifiers, this Boolean function can be calculated directly, but for the 34 feature classifiers this is computationally intractable. The first question posed by this paper is, whether it is possible to find useful approximations to the Boolean functions described by the higher dimensional classifiers? The existence of the 16 feature Boolean classifiers suggests that it is. Table 5 repeats the experiments from [29] using the sampling based approach to find a series of Boolean classifiers using an increasing number of features to approximate the 34 feature NN classifier. The performance of the approximating Boolean classifier using 16 features matches (in fact, slightly betters) that of the neural network that it is modelling, with 99.90% accuracy and 99.94% precision, demonstrating that rule extraction has been successfully accomplished. For comparison, the extracted rulebased Boolean function classifier in Table 5 performs slightly better than those for the 16-feature neural network in Table 3. These results show that rule extraction works, but the sampling used, where 1024 samples were used to generate each row of the truth table for the Boolean function, is slow. As can be seen in Table 7 the time taken to build the Boolean functions increases exponentially, with the best approximation using 16 features taking more than five days of computation. The second question posed by this paper is whether the number of samples can be reduced (speeding up the approximation process) whilst maintaining precision. This is answered by Table 8 where a small number (32) of samples is used to determine the value of a row in the truth table, and increasing this (to 64, then 128 samples) only when the sample does not give a clear cut answer. The results show that the performance of the resulting Boolean classifiers is comparable to that of the Boolean classifiers extracted by using a larger number of samples, hence also comparable to that of the underlying neural network when using 16 features. Table 20 shows that this approach is indeed faster, taking about 3 h of computation. Third question is whether this approach works for classifiers developed using other machine learning techniques? This work considers: (i) support vector machines that to some extent share the problem of neural networks in that the output of learning is a function which is hard to interpret, (ii) k-nearest neighbour where the interpretability of the output is clear, but where the output model is large, being essentially a representation of the training data. Tables 13 and 18 give the results of repeating the approach to approximating neural network classifiers using instead SVM and k-NN respectively. The results confirm that the approach works for other classifiers.

382

F. A. Mereani and J. M. Howe

The performance of the Boolean functions extracted from the 34 feature SVM are excellent, with the 16 feature Boolean classifier working as well as the underlying base SVM. The performance of the Boolean functions extracted from the 34 feature k-NN classifier, whilst giving performance of over 90%, are not quite as strong as for the NN and SVM classifiers. In particular the 16 feature Boolean function performs noticably less well that the k-NN classifier it is approximating (and the 12 feature Boolean function). The authors believe that this is because NN and SVM classifiers are learning functions that discriminate between the features, weighting some more heavily than others; in contrast, k-NN classifiers simply measure distance from the training data, with no feature being more important than others. When performing rule extraction, a series of approximations using an increasing number of features have been built and evaluated in Tables 8, 13 and 18, as well as Table 5. The number of rules both before and after minimisation are given in Tables 6, 9, 14, and 19. As might be expected, as the number of features increases, the number of rules (after minimisation) increases too, and the performance of the resulting classifiers improves. The improvements are not necessarily monotonic, but the pattern is clear. If comparing against the Boolean function extracted from the 16 feature classifers, the number of minimised rules is comparable (except for SVM). The fourth question addressed is how well the sampling method gives Boolean values, that is, do all the samples give the same value, or do the samples give a split decision? This is considered in Figures 1, 2, 3. The first result (the blue line) plots the sample split (Cases) against the number of rows of the 16 feature truth table which derive from this split. As can be seen in Fig. 2, SVM gives the cleanest split with the value for most rows deriving from a sample with only a handful of sample instances not agreeing with the final label. Figure 1 shows a similar pattern for the neural network, although this is shallower with more cases where the number of samples for each value are close to each other. Figure 3, however, gives a different pattern for k-NN, with fewer clean cut cases, hinting at why rule extraction works less well. The second result (the orange line) plots the number of tests that were labelled using a rule whose split is given by Cases. Here it can be seen why the approach works so well. For test data (which is not a chimera of a row of the truth table and a sample from the training data) almost all of the values assigned come from rules whose value comes from a clean cut sample. This applies to all three base classifiers. That is, the doubt in the sampling comes from examples which do not occur in practice. The final question is what level of explainable AI has been extracted from classifiers in the form of the Boolean rule-based systems? The approximations described in this work give classifiers whose reasoning can be described, allowing decision making to be auditable. The successive approximations show that relatively good performance can be achieved with the use of only a small number of features. That the sampling approach gives approximations with some degree of noise is illustrated across the tabled results where anomalous cases can be found. For example, the 8 feature case in Table 5, where the introduction of feature 7, URL addresses, leads to some additional misclassifications, compared to the courser 4 feature classifier. It should also be noted that the very course 1 and 2 feature classifiers still give useful result, with all the 2 feature classifiers giving over 90% precision. The reason for this

Rule Extraction from Neural Networks and Other Classifiers Applied …

383

result is that the highest ranking feature is the use of “Alert” within the script and that a high proportion of attacks in the database use this, whilst it is rarely used in benign scripts. This first feature is very powerful. This observation (whilst not surprising to the authors) is a good illustration of XAI in action, where the rule-based system has made the explanation explicit. The best approximation still requires thousands of rules even after minimisation. It is not clear that each individual decision can be interpreted by a human user, in the context of the larger number of rules. However it should be noted that these rules are simply the flattening of a single binary decision-tree over 16 features. It seems unlikely that a smaller datastructure can be used successfully as a classifier. In either view, the extracted rules mean that decision making is always auditable, the reasoning for any decision can be traced. As noted in the methodology, the current approach requires a double use of the training set, firstly to train the classifiers, and secondly to guide the sampling approach used in the approximation of the classifiers by Boolean functions. However, given the size of the Boolean function described by the trained classifiers, some kind of guidance seems inevitable in a black box approach to approximation. The black box approach has worked, resulting in successfully extracting rules in form of (i f...then...else) in order to distinguish malicious and benign scripts without delving deeper into the inner structure of the classifiers.

6 Conclusion This paper develops the approach to rule extraction first described in [29]. It considers machine learning for classification problems where the feature space is Boolean, and the classes are also Boolean. The example in this work is the classification of JavaScript as malicious (an XSS attack) or benign, where the actual script is abstracted to a Boolean feature set. The rule extraction first finds a Boolean function as a truth table describing the classifier, then simplifies this. The Boolean function might be an exact description of the classifier, or an approximation built using a sampling technique. Rule extraction is successfully demonstrated for three kinds of classifier trained to detect XSS attacks: neural networks, support vector machines and k-nearest neighbour. Approximations to the full 34 feature classifiers were considered at different levels of granularity. The most precise approximations are over 16 features. This work shows how the sampling technique first suggested in [29] can be adapted to work with a reduced number of samples, producing approximations much more quickly. Using this new, faster, approach to sampling, the 16 feature approximation to the original neural network gives 99.87% accuracy and 99.86% precision. These results are as good as those for the initial classifier and can be computed relatively quickly, extending and improving upon the results in [29]. Similar results are obtained for SVM, and good, though less reliable results for k-NN.

384

F. A. Mereani and J. M. Howe

The number of rules extracted grows with the number of features used in the approximation. As discussed in Section 5, this means that these rules are auditable – the reasoning for any given classification can easily be looked up and interpreted – and essentially form a 16 feature binary decision tree which, while relatively large, should be seen as an explainable classifier. Future work is to investigate how this approach might be generalised to features which are not Boolean valued, by piecewise approximation, or otherwise. Alternative ways of computing the rules (perhaps using BDDs), and further approximation to give more compact rules sets will also be explored. In conclusion, XAI principles have been followed to give a procedure resulting in explanations of black box classifiers as a set of Boolean rules which can be understood by human users, leading to successful rule extraction.

References 1. Ardiansyah, S., Majid, M.A., Zain, J.M.: Knowledge of extraction from trained neural network by using decision tree. In: 2nd International Conference on Science in Information Technology (ICSITech). pp. 220–225. IEEE (2016) 2. Augasta, M.G., Kathirvalavakumar, T.: Rule extraction from neural networks—a comparative study. In: International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012). pp. 404–408. IEEE (2012) 3. Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.R., Samek, W.: On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLoS ONE 10(7), e0130140 (2015) 4. Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., Müller, K.R.: How to Explain Individual Classification Decisions. Journal of Machine Learning Research 11, 1803–1831 (2010) 5. Baesens, B., Martens, D., Setiono, R., Zurada, J.M.: Guest Editorial White Box Nonlinear Prediction Models. IEEE Transactions on Neural Networks 22(12), 2406–2408 (2011) 6. Bondarenko, A., Aleksejeva, L., Jumutc, V., Borisov, A.: Classification Tree Extraction from Trained Artificial Neural Networks. Procedia Computer Science 104, 556–563 (2017) 7. Breiman, L., et al.: Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science 16(3), 199–231 (2001) 8. Bryant, R.E.: Symbolic Boolean Manipulation with Ordered Binary Decision Diagrams. ACM Computing Surveys 24, 293–318 (1992) 9. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In: Empirical Methods in Natural Language Processing. p. 1724–1734. Association for Computational Linguistics (2014) 10. Craven, M.W., Shavlik, J.W.: Using Sampling and Queries to Extract Rules from Trained Neural Networks. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 37–45. Morgan Kaufmann (1994) 11. Craven, M.W., Shavlik, J.W.: Extracting tree-structured tepresentations of trained networks. In: Advances in Neural Information Processing Systems. pp. 24–30. MIT Press (1996) 12. Dancey, D., McLean, D.A., Bandar, Z.A.: Decision tree extraction from trained neural networks. In: Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference. pp. 515–519. AAAI Press (2004)

Rule Extraction from Neural Networks and Other Classifiers Applied …

385

13. Etchells, T.A., Lisboa, P.J.: Orthogonal search-based rule extraction (OSRE) for trained neural networks: a practical and efficient approach. IEEE Transactions on Neural Networks 17(2), 374–384 (2006) 14. Giménez, C.T., Villegas, A.P., Marañón, G.Á.: HTTP data set CSIC 2010. Information Security Institute of CSIC (Spanish Research National Council) (2010) 15. Gunning, D.: Explainable Artificial Intelligence (XAI). Tech. Rep. DARPA/I20, Defense Advanced Research Projects Agency (2016) 16. Hailesilassie, T.: Rule extraction algorithm for deep neural networks: A review. arXiv preprint arXiv:1610.05267 (2016) 17. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition. pp. 770–778. IEEE (2016) 18. Huysmans, J., Baesens, B., Vanthienen, J.: Using rule extraction to improve the comprehensibility of predictive models. Tech. Rep. KBI 0612, Katholieke Universiteit Leuven (2006) 19. Jain, T.K., Kushwaha, D.S., Misra, A.K.: Optimization of the Quine-McCluskey Method for the Minimization of the Boolean Expressions. In: Fourth International Conference on Autonomic and Autonomous Systems. pp. 165–168. IEEE Computer Society (2008) 20. Julian, K.D., Lopez, J., Brush, J.S., Owen, M.P., Kochenderfer, M.J.: Policy compression for aircraft collision avoidance systems. In: Proceedings of the 35th Digital Avionics Systems Conference . pp. 1–10. IEEE Press (2016) 21. Karnaugh, M.: The map method for synthesis of combinational logic circuits. Transactions of the American Institute of Electrical Engineers, Part I: Communication and Electronics 72, 593–599 (1953) 22. Katz, G., Barrett, C., Dill, D., Julian, K., Kochenderfer, M.: Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks. In: Computer Aided Verification. Lecture Notes in Computer Science, vol. 10426, pp. 97–117. Springer (2017) 23. Keedwell, E., Narayanan, A., Savic, D.: Creating rules from trained neural networks using genetic algorithms. Internation Journal of Computers, System and Signal 1(1), 30–42 (2000) 24. Likarish, P., Jung, E., Jo, I.: Obfuscated malicious Javascript detection using classification techniques. In: Malicious and Unwanted Software (MALWARE). pp. 47–54. IEEE (2009) 25. Mahendran, A., Vedaldi, A.: Understanding Deep Image Representations by Inverting Them. In: Computer Vision and Pattern Recognition. pp. 5188–5196. IEEE (2015) 26. Manojlovic, V.: Minimization of Switching Functions using Quine-McCluskey Method. International Journal of Computer Applications 82(4), 12–16 (2013) 27. Martens, D., Baesens, B., Van Gestel, T.: Decompositional Rule Extraction from Support Vector Machines by Active Learning. IEEE Transactions on Knowledge and Data Engineering 21(2), 178–191 (2009) 28. MathWorks: Feature selection. https://uk.mathworks.com/help/stats/feature-selection.html (2019), accessed: 11/3/2019 29. Mereani, F.A., Howe, J.M.: Exact and Approximate Rule Extraction from Neural Networks with Boolean Features. In: Proceedings of the 11th International Joint Conference on Computational Intelligence. pp. 424–433. ScitePress (2019) 30. Mereani, F.A., Howe, J.M.: Detecting Cross-Site Scripting Attacks Using Machine Learning. In: International Conference on Advanced Intelligent Systems and Computing, vol. 723, pp. 200–210. Springer (2018) 31. Mereani, F.A., Howe, J.M.: Preventing Cross-Site Scripting Attacks by Combining Classifiers. In: Proceedings of the 10th International Joint Conference on Computational Intelligence Volume 1. pp. 135–143. ScitePress (2018) 32. Nadji, Y., Saxena, P., Song, D.: Document Structure Integrity: A Robust Basis for Cross-site Scripting Defense. In: Network and Distributed System Security Symposium. Internet Society (2009) 33. Nair, V., Hinton, G.E.: Rectified linear units improve Restricted Boltzmann Machines. In: Proceedings of the 27th International Conference on Machine Learning. pp. 807–814. Omnipress (2010)

386

F. A. Mereani and J. M. Howe

34. Nguyen, A., Yosinski, J., Clune, J.: Multifaceted Feature Visualization: Uncovering the Different Types of Features Learned by Each Neuron in Deep Neural Networks. arXiv preprint arXiv:1602.03616 (2016) 35. OWASP Top 10 - 2017 rc1 (2017), https://www.owasp.org. Accessed: 7/6/2017 36. Rhode, M., Burnap, P., Jones, K.: Early Stage Malware Prediction Using Recurrent Neural Networks. Computers and Security 77, 578–594 (2017) 37. Rickmann, S.: Logic Friday (version 1.1.4) (2012). https://web.archive.org/web/ 20131022021257. http://www.sontrak.com/. Accessed 24 Nov 2018 38. Rudell, R.L.: Multiple-valued logic minimization for PLA synthesis. Tech. Rep. UCB/ERL M86/65, University of California, Berkeley (1986) 39. Saad, E.W., Wunsch II, D.C.: Neural network explanation using inversion. Neural networks 20(1), 78–93 (2007) 40. Saito, K., Nakano, R.: Medical diagnostic expert system based on PDP model. Proceedings of IEEE International Conference on Neural Networks. 1, 255–262 (1988) 41. Schmitz, G.P., Aldrich, C., Gouws, F.S.: ANN-DT: an algorithm for extraction of decision trees from artificial neural networks. IEEE Transactions on Neural Networks 10(6), 1392– 1401 (1999) 42. Schwender, H.: Minimization of Boolean Expressions using Matrix Algebra. Tech. rep., Technical Report//Sonderforschungsbereich 475, Komplexitätsreduktion in Multivariaten Datenstrukturen, Universität Dortmund (2007) 43. Setiono, R., Liu, H.: Understanding Neural Networks via Rule Extraction. In: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence. vol. 1, pp. 480–485. Morgan Kaufmann (1995) 44. Setiono, R., Liu, H.: NeuroLinear: From neural networks to oblique decision rules. Neurocomputing 17(1), 1–24 (1997) 45. Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. In: International Conference on Learning Representations (2015) 46. Taha, I., Ghosh, J.: Three techniques for extracting rules from feedforward networks. In: Intelligent Engineering Systems Through Artificial Neural Networks. pp. 23–28. ASME Press (1996) 47. Thrun, S.B.: Extracting Provably Correct Rules from Artificial Neural Networks. University of Bonn, Tech. rep. (1993) 48. Tsukimoto, H.: Extracting rules from trained neural networks. IEEE Transactions on Neural Networks 11(2), 377–389 (2000) 49. Wang, H., Lu, Y., Zhai, C.: Latent Aspect Rating Analysis Without Aspect Keyword Supervision. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 618–626. KDD ’11, ACM (2011) 50. Wang, W.H., Lv, Y.J., Chen, H.B., Fang, Z.L.: A Static Malicious JavaScript Detection Using SVM. Proceedings of the International Conference on Computer Science and Electronics Engineering. 40, 21–30 (2013) 51. Weinberger, J., Saxena, P., Akhawe, D., Finifter, M., Shin, R., Song, D.: A Systematic Analysis of XSS Sanitization in Web Application Frameworks. In: European Symposium on Research in Computer Security. Lecture Notes in Computer Science, vol. 6879, pp. 150–171. Springer (2011)

Introduction to Sequential Heteroscedastic Probabilistic Neural Networks Ali Mahmoudi , Reza Askari Moghadam , and Kurosh Madani

Abstract This paper is dedicated to analyzing the performance of an online classification algorithm called sequential heteroscedastic probabilistic neural network (SHPNN). The aforementioned algorithm is a variant of probabilistic neural networks (PNNs). This algorithm has the advantage of being structurally flexible to match the complexities of the data space. Another distinctive feature of this algorithm is the fact that it can achieve roughly the same level of accuracy compared to its counterparts while having an acceptable speed in training phase. But perhaps its most important quality is the fact that SHPNN’s structure contains valuable information about the underlying statistics of the data. In this paper the derivation of SHPNN formulation is presented and discussed. The following analysis includes comparison between the performance of SHPNN and other similar algorithms. Furthermore, the interpretability of this network is analyzed by visualizing the learning process. Keywords Online classification · Probabilistic neural network · Sequential learning · Machine learning

1 Introduction An important goal in the area of human-robot interaction and cognitive robotics is to design learning algorithms that resemble human learning process. This need arises from the fact that in such scenarios the robot has to interact with a human tutor in its learning schema and for this whole process to be successful, the learning algorithm A. Mahmoudi · R. A. Moghadam (B) Faculty of New Sciences and Technologies, University of Tehran, Tehran, Iran e-mail: [email protected] A. Mahmoudi e-mail: [email protected] K. Madani LISSI Lab, Senart-FB Institute of Technology, University Paris Est-Creteil (UPEC), Lieusaint, France e-mail: [email protected] © Springer Nature Switzerland AG 2021 J. J. Merelo et al. (eds.), Computational Intelligence, Studies in Computational Intelligence 922, https://doi.org/10.1007/978-3-030-70594-7_16

387

388

A. Mahmoudi et al.

must mimic learning process of a human. The exact mechanism responsible for human learning is still a mystery from neuroscientific standpoint. Nevertheless, some aspects of learning in humans is known to us. One of these aspects is the ability to discern different concepts in real-time and updating the present knowledge based on even one observation. Another known property of learning in humans is the fact that human brain can reconfigure the connections between neurons to allow for learning of new ideas. Interestingly, most of the machine learning algorithms introduced in the past few years do not resemble these properties of human learning [3]. The branch of machine learning algorithms that work similar to the aforementioned description of human learning is known as online supervised learning. Stochastic gradient descent back propagation has the potential to be used for sequential learning [5]. Several variants of radial basis function (RBF) feedforward neural networks exist that are designed for online learning applications. RAN [9], MRAN [16], and GAP-RBF [4] are among the most prominent algorithms implementing RBF nodes. Online variant of extreme learning machines (ELM) known as online sequential extreme learning machine (OS-ELM) is regarded as one of the best existing algorithms for online supervised learning due to its superior accuracy and speed comparing with its predecessors [7]. A variant of OS-ELM capable of learning new class during the training is called progressive learning technique (PLT) [11]. Another variant of OS-ELM is structure adjustable online extreme learning machine (SAOELM) that is able to add hidden layer nodes during the training [6]. There are also several works on online supervised learning using spiking neural networks [12, 13]. In a previous work, the authors have presented an algorithm called sequential heteroscedastic probabilistic neural network (SHPNN). This algorithm is capable of learning in real-time form a sequence of data and adjust its structure is necessary [8]. SHPNN is based on the heteroscedastic probabilistic neural network (RHPNN) presented by Yang and Chen [14]. This algorithm relies on expectation maximization (EM) algorithm to obtain the maximum likelihood (ML) estimate of the model. In addition, it uses a statistical method known as Jack-Knife to overcome numerical instabilities. Applications of RHPNN expand from cloud service selection to analog circuit fault detection and MEMS devices fault detection [1, 10, 15]. The aim of this paper is to be a deeper dive into the details of obtaining SHPNN formulation and to present a more detailed and comprehensive analysis of its performance in different datasets. In addition, visualization of decision making by SHPNN is another important matter that has been addressed in this paper. The organization of this paper is as follows. Section 2 reviews the structure and formulation of RHPNN. Section 3 elaborates the derivation of SHPNN formulation. Section 4 summarizes the steps necessary for SHPNN to realize sequential learning. Section 5 presents the results of comparing the proposed algorithm with other similar algorithms and peeks into the inner workings of this algorithm. Finally, Section 6 concludes this writing.

Introduction to Sequential Heteroscedastic Probabilistic Neural Networks

389

2 A Review of RHPNN Assuming the d-dimensional pattern vector to be x ∈ Rd with its target class as j ∈ { 1, ..., K } where K represents the total number of classes. A classifier can be defined as a mapping between pattern vectors (x) and their corresponding class indexes g(x) in which g : Rd → {1, ..., K } . With the class conditional probability density function (PDF) of class j being f j and its a priori probability as α j , the Bayes classifier minimizing the misclassification error can be written as: g Bayes (x) = arg( max {α j f j (x)}) 1≤ j≤K

(1)

Probabilistic neural networks (PNNs) are four-layered feedforward neural networks that can realize or approximate the optimal classifier described in (1). By using a mixture of Parzen windows or Gaussian kernel functions, PNNs can estimate the class conditional PDFs and realize the Bayes classifier in (1). In the first layer a PNN receives input patterns and therefore it is called input layer. In the next layer there exists K group of nodes where each node is a Gaussian basis function and is defined as follows:     x − ci, j 2 1 exp − (2) pi, j (x) = d/2 2σi,2 j (2πσi,2 j ) In (2) the σi,2 j is the variance parameter and ci, j ∈ Rd is the center of the basis function. Assuming M j to be the number of nodes in class j, the total number of nodes in the second layer can be written as: M=

K 

Mj

(3)

j=1

In the third layer there exists a node for each class. It estimates a class conditional PDF f i by using a mixture of Gaussian kernels. f j (x) =

Mj 

βi, j pi, j (x), 1 ≤ j ≤ K

(4)

i=1

With βi, j being the positive mixing coefficient and are subjected to the following constraint: Mj  i=1

βi, j = 1, 1 ≤ j ≤ K

(5)

390

A. Mahmoudi et al.

The fourth and final layer of a PNN makes decisions according to the classifier in (1). Equiprobability of classes is a necessary assumption here, since the class a priori probabilities cannot be estimated solely based on training data. Therefore, the following is assumed: 1 ,1 ≤ j ≤ K K

αj =

(6)

The RHPNN implements the EM algorithm to determine the ML estimate of the model. By separating the training patterns into the K labeled subsets we have: N

j N {xn }n=1 = {{xn, j }n=1 } Kj=1

in which

K 

Nj = N

(7)

(8)

j=1

With N being the number of training patterns and N j the number of training patterns with target class of j. With the equiprobability assumption, the log posterior likelihood function of the training se can be written as: log L f =

Nj K  

log f j (xn, j )

(9)

j=1 n=1

After forming the Lagrangian, iterative formulas for calculating the network 2 parameters is formed. Assuming the previous values of βm,i , σm,i , and cm,i to (k) 2 (k) (k) be βm,i  , σm,i  , and cm,i  respectively, and 1 ≤ m ≤ M, 1 ≤ n ≤ N , and 1 ≤ i ≤ K , the weights can be written as: (k) (xn,i ) wm,i

(k) (k) βm,i  pm,i (xn,i ) = M (k) (k) i  p (xn,i ) l=1 βl,i l,i

where (k) (xn,i ) = pl,i

1 (k) d/2 (2π σ 2  ) l,i

⎞ ⎛  (k)   2 xn,i − cl,i    ⎟ ⎜ exp ⎝− ⎠  2 (k) 2 σl,i

(10)

(11)

With the weights at hand, the network parameters can be updated. (k+1) cm,i  =

 Ni

(k) n=1 wm,i (x n,i )x n,i  Ni (k) n=1 wm,i (x n,i )

(12)

Introduction to Sequential Heteroscedastic Probabilistic Neural Networks

 2 (k+1) σm,i =

 2 (k)   (k)   w (x ) − c  x n,i n,i m,i n=1 m,i  Ni (k) d n=1 wm,i (xn,i )

391

 Ni

(k+1) 1  Ni = w (k) (xn,i ) βm,i  n=1 m,i Ni

(13)

(14)

To address the numerical difficulties and making the algorithm unbiased, RHPNN uses the Jack-knife estimator on the previously derived relations.

3 Derivation of SHPNN Formulation As was demonstrated in the previous section, RHPNN uses the expectation maximization algorithm along with Jack-knife estimator. Although these algorithms are good at what they do, they require the entire training data to be present in the training phase. This is at odds with the process of online learning in which the algorithm receives the data in a sequence and does not keep the previous patterns in the memory. Therefore, there is a need to modify the RHPNN algorithm to be able to use it for online learning tasks. The role of Jack-knife in RHPNN is to prevent numerical instabilities. These instabilities mostly arise from outliers that tend to make the variance of hidden nodes very small. This problem can be avoided by setting a lower bound for each kernel’s variance value. Replacing the Jack-knife with a lower bound for variance also reduces the computational complexity of the algorithm. It is known that EM algorithm has a relatively fast convergence rate. Therefore, by using only the first iteration of EM algorithm the need for having the whole training data can be circumvented. By implementing this strategy, SHPNN is able to find pseudo-optimal values for network parameters. In order to modify the EM algorithm according to this approach, for the k + 1-th pattern entering the algorithm it can be written (contrary to the relations presented before the index k denotes the k-th pattern that is introduced to the algorithm and not the k-th iteration in EM algorithm):  Ni

(k) n=1 wm,i (x n,i )x n,i  Ni (k) n=1 wm,i (x n,i )

 Ni

(k−1) (k) wm,i (xn,i )xn,i + wm,i (xk+1 )xk+1 =  Ni (k) n=1 wm,i (x n,i ) (15) The first term of the numerator can be derived as follows:

(k+1) = cm,i 

(k) cm,i  =

n=1

 Ni

(k−1) n=1 wm,i (x n,i )x n,i  Ni (k−1) n=1 wm,i (x n,i )



 Ni n=1

(k−1) wm,i (xn,i )xn,i

(k)  Ni (k−1) = cm,i  wm,i (xn,i ) n=1

(16)

392

A. Mahmoudi et al.

(k) Now by defining Ym,i =

 Ni n=1

(k−1) wm,i (xn,i ), (15) can be written as:

(k) (k) (k) (k+1) cm,i  Ym,i + wm,i (xk+1 )xk+1  cm,i = (k) (k) Ym,i + wm,i (xk+1 )

(17)

As a result, to compute the new kernel centers, only the previous values for that (k) is needed. In a similar fashion, the relations for kernel center and the value Ym,i variances and mixing coefficients can be derived and their final form is presented in (18) and (19).  2  (k)   (k) 2 (k) (k)   x dσ Y + w (x ) − c   k+1 k+1 m,i  m,i m,i m,i 2 (k+1) σm,i = (18) (k) (k) d(Ym,i + wm,i (xk+1 )) (k+1) = βm,i 

 1  (k) (k) Ym,i + wm,i (xk+1 ) Ni + 1

(19)

(k) At the end of each training step, the value Ym,i is updated to be used in the next step: (k+1) (k) (k) = Ym,i + wm,i (xk+1 ) (20) Ym,i

By using the aforementioned relations one can classify patterns arriving in sequential fashion.

4 The SHPNN Algorithm This section is dedicated to elaborating on various steps of the proposed algorithm in classifying patterns in an online manner. The schematic structure of SHPNN algorithm is depicted in Fig. 1. 1. In the first step, the initial number of kernels for each class is determined. Then the parameters of these kernels is initialized. 2. In the second step, by using the normal RHPNN algorithm, the network is trained on the initial batch. In each iteration k, first the weights are calculated: (k) (xn,i ) wm,i

(k) (k) βm,i  pm,i (xn,i ) = M (k) (k) i  p (xn,i ) l,i l=1 βl,i

where (k) (xn,i ) = pl,i

1

 d/2 2 (k) (2π σl,i )

⎞ ⎛  (k)   2 xn,i − cl,i    ⎟ ⎜ exp ⎝− ⎠  2 (k) 2 σl,i

(21)

(22)

Introduction to Sequential Heteroscedastic Probabilistic Neural Networks

393

Output Layer

Class Conditional Probabilities

Hidden Layer

Input Layer Fig. 1 Schematic structure of SHPNN

With the weights at hand, the network parameters can be updated. (k+1) = cm,i 

 2 (k+1) = σm,i

 Ni

(k) n=1 wm,i (x n,i )x n,i  Ni (k) n=1 wm,i (x n,i )

 2 (k)   (k)   w (x ) − c x  n,i n,i m,i n=1 m,i  Ni (k) d n=1 wm,i (xn,i )

(23)

 Ni

(k+1) 1  Ni = w (k) (xn,i ) βm,i  n=1 m,i Ni

(24)

(25)

Training the network on the initial batch will cause the kernels to form a rough estimate of the data space. 3. In this step, the sequential learning of data commences. Since only the first iteration of EM algorithm is used in this stage, the index k represents the kth pattern that enters the algorithm for classification. Based on the relations derived from pervious section, the parameters of kernels is updated according to the following formulas:

pm,i (xk+1 ) =

1

 d/2 2 (k) (2π σm,i )

⎞ ⎛  (k)   2  x − c  ⎟ m,i ⎜  k+1 exp ⎝− ⎠  (k) 2  2 σm,i

(26)

394

A. Mahmoudi et al. (k) wm,i (xk+1 )

(k) βm,i  pm,i (xk+1 ) = M (k) i  pl,i (xk+1 ) l=1 βl,i

(k) (k) (k) (k+1) cm,i  Ym,i + wm,i (xk+1 )xk+1 cm,i  = (k) (k) Ym,i + wm,i (xk+1 )  2 (k+1) = σm,i

  (k)   2 (k) 2 (k) (k) dσm,i Ym,i + wm,i (xk+1 )xk+1 − cm,i   (k) (k) d(Ym,i + wm,i (xk+1 ))

(k+1) = βm,i  in which

 1  (k) (k) Ym,i + wm,i (xk+1 ) Ni + 1

(k+1) (k) (k) = Ym,i + wm,i (xk+1 ) Ym,i

(27)

(28)

(29)

(30)

(31)

By using these formulas, there is no need to save the previous patterns in the memory and therefore online sequential is realized. In this algorithm it learning (k) Ni = n=1 wm,i (xn,i ) from the previous steps. is necessary to save the value Ym,i In addition, the number of patterns for each class that has entered the algorithm (Ni ) must also be memorized. 4. After applying these changes if the algorithm can classify the pattern at hand correctly, then the algorithm goes back to step 2 and reads the next pattern. 5. If after applying these changes the algorithm still cannot classify the algorithm at hand correctly, then the changes to the network parameters is discarded and a kernel is added to its corresponding class. The center of the added kernel is located at the position of the pattern at hand and its variance is a small value (e.g. 0.01). It is also necessary to update the values of mixing coefficients so that their sum still amounts to unity. In doing so, a good choice for the new kernel’s mixing coefficient is to have an average effect on the final class conditional probability: β =

1 Ni

(32)

By choosing (32) as the value for the new kernel’s mixing coefficient, the other kernels coefficients should be scaled:   Ni − 1 βold (33) βnew = Ni In (33), βnew is the updated value of the existing kernels and βold is their previous value. This way the following equity holds:

Introduction to Sequential Heteroscedastic Probabilistic Neural Networks Mj 

395

βi, j = 1, 1 ≤ j ≤ K

(34)

i=1

This whole process is summarized in Fig. 2.

Fig. 2 Flowchart of the SHPNN algorithm [8]

Start

Initialize the kernel parameters

Learn thefirst batch using RHPNN

Sequential introduction of data

Updating kernel parameters using (26)-(31)

Was the pattern classified correctly?

No Add a new kernel at the location of the pattern

Yes

396

A. Mahmoudi et al.

5 Results Three standard benchmarks were used to evaluate the performance of the proposed algorithm in classification tasks. They are Iris, image segmentation, and satellite image datasets from the UC Irvine’s machine learning repository [2]. More detail about these datasets is presented in Table 1. The computer used to run the programs was equipped with 8 GB of RAM with Intel Core i7-8550u processor. The first dataset used here is Fisher’s Iris dataset that includes measured values of petal and sepal dimensions for three variants of Iris plant. For each type of plant this dataset contains 50 patterns. The next dataset is called image segmentation. This dataset includes 3 × 3 regions arbitrary chosen from 7 outdoor images and it consists of 2310 such regions. From each of these 3 × 3 regions 19 attributes are extracted. The goal in this dataset is to recognize what category each of these patterns belong to. The categories are path, grass, window, sky, cement, foliage, and brick facing. The third dataset used is called satellite image. Its entries are from Landsat multispectral scanner. The scenes captured by this scanner are actually comprised of four digitals images of that scene in four different spectral bands. The whole data set is a scene with the size of 82 × 100 pixels and patterns are 3 × 3 regions of this scene. The aim of this dataset is to classify the central pixel in each pattern. There are 7 target categories here: grey soil, damp gray soil, very damp gray soil, red soil, soil with vegetation stubble, cotton crop, and mixture class. This dataset effectively has 6 classes since none of the patterns belong to the mixture class. For each dataset, the training set and test set remained the same. However, the order of their introduction is shuffled in each run. For the Iris dataset, the algorithm starts with three kernels in each group. The number of initial kernels for the other two datasets is five. Additionally, in each run 10 percent of training data is used as initial batch. The results for each dataset is averaged over 10 runs and their comparison with some of the prominent online supervised classification algorithms is presented in Table 2. The metrics for this comparison are training accuracy, testing accuracy and training time.

Table 1 Specification of datasets used to evaluate SHPNN Dataset # Attributes # Classes # Training data Iris Image segmentation Satellite image

# Testing data

4 19

3 7

120 2100

30 210

36

7

4435

2000

Introduction to Sequential Heteroscedastic Probabilistic Neural Networks

397

The results presented in Table 2 shows that the presented algorithm achieves a good training and test accuracy when compared to its counterparts. Based on these results it seems that the proposed algorithm can achieve some of the best training times among all online supervised learning algorithms. A performance measure for assessing the optimization capacity of the proposed algorithm is comparing it with the regular RHPNN. However, to purely evaluate the performance of the modified EM algorithm, the Jack-knife estimator was removed from RHPNN. In addition, the number of hidden nodes for RHPNN was selected to match the number of nodes in SHPNN algorithm but distributed evenly among the classes. The results of this comparison is presented in Table 3. Prior to evaluating the performance of these two algorithms, it is expected that RHPNN must outperform SHPNN in training and testing accuracies. By looking at the data in Table 3 it is evident that the gains in accuracy are either marginal or

Table 2 Comparing SHPNN results with other prominent counterparts [8] Dataset Algorithm Training time Training Testing (s) accuracy accuracy iris Image segmentation

Satellite Image

SNN SHPNN OS-ELM (RBF) SAO-ELM MRAN SHPNN OS-ELM (RBF) SAO-ELM MRAN SHPNN

# Nodes

– 0.1829 9.9981

0.872 0.9617 0.9700

0.861 0.9267 0.9488

– 13.5 180

9.1644 7004.5 4.4922 319.14

0.9725 – 0.9426 0.9318

0.9516 0.9330 0.9495 0.8901

191 53.1 178.3 400

211.034 2469.4 26.44

0.9480 – 0.8867

0.9139 0.8636 0.8575

413 20.4 534.6

Table 3 Comparing the preformance of RHPNN and SHPNN Dataset Algorithm Training time (s) Training accuracy Testing accuracy Iris Image Segmentation Satellite Image

RHPNN SHPNN RHPNN

0.5885 0.1829 100.8

0.9758 0.9617 0.9066

0.9800 0.9267 0.9233

SHPNN RHPNN SHPNN

4.4922 1037 26.44

0.9426 0.9009 0.8867

0.9495 0.866 0.8575

398

A. Mahmoudi et al.

nonexistent. The reason for this is the fact that SHPNN can allocate hidden layer nodes efficiently whereas in RHPNN there is no way to distribute the number of nodes for each class initially and setting a proper number of nodes for RHPNN requires a laborious process of hyperparameter optimization. Another important result of the Table 3 is the fact that almost in the same level of accuracy, SHPNN is trained much faster that the regular RHPNN. This advantage in training speed will only widen if the Jack-knife was not removed from RHPNN. A good way to peek into the workings of SHPNN is to track its learning process visually. To do so, the algorithm was trained on Iris dataset. During the training, all the training data was plotted in different colors in the plane of the first two features (visualization was harder in 3 dimensions and impossible in the full 4 dimensional space of the Iris dataset). For each update of kernel parameters during the training process, the kernels were indicated by their centers and a circle around them with the radius of 20σ (the value was selected solely to increase the clarity of the figures). In Fig. 3 six steps of the training process for the SHPNN initialized with one kernel for each class is depicted. As it is shown in Fig. 3, the network starts with random kernels. Then only after one iteration of EM algorithm in the training of the initial batch, the kernels are adjusted to the underlying data to a good degree. This is a testimony to the power of EM algorithm’s first step. Afterward, the EM algorithm fine tunes the kernels’ parameters until the beginning of the sequential learning phase. In this phase and with the introduction of each pattern, either the existing kernels are adjusted or a new kernel with a small variance is added. It is interesting to see that these kernels stay small in terms of variances and they basically try to learn the more complex features of the decision boundary. This process continues until arrival of the last pattern and finalization of the network’s structure. Similarly, Fig. 4 visualizes the training process of SHPNN on the Iris dataset but this time initialized with 3 kernels for each class. It is interesting to see that even in this case in the training of the initial batch, one kernel on each class becomes the dominant node that captures the structure of that class on the large scale and the other two kernel shrink to be able to capture the intricacies of the decision boundary (it is worth noting that this does not hold for the class represented by red color since its decision boundary is rather straightforward). One of the most important properties of the proposed algorithm is the fact that its learning process is quite transparent comparing with some of the most prominent online and batch learning algorithms. This transparency is demonstrated through the Figs. 3 and 4. In addition, after the training process is complete, the structure of this algorithm provides valuable information about the underlying statistics of training data.

Introduction to Sequential Heteroscedastic Probabilistic Neural Networks

399

Fig. 3 Visualization of SHPNN training process with 1 kernel for each class initially. a Initial random positions for kernels. b After only one iteration of EM algorithm on the initial batch. c Kernels at the end of training on initial batch. d Automatic addition of first kernel during the sequential learning phase. e Kernels’ status in the middle of sequential learning phase. f Final structure of the network after the completion of the learning process

400

A. Mahmoudi et al.

Fig. 4 Visualization of SHPNN training process with 3 kernel for each class initially. a Initial random positions for kernels. b After only one iteration of EM algorithm on the initial batch. c Kernels at the end of training on initial batch. d Automatic addition of first kernel during the sequential learning phase. e Kernels’ status in the middle of sequential learning phase. f Final structure of the network after the completion of the learning process

Introduction to Sequential Heteroscedastic Probabilistic Neural Networks

401

6 Conclusion In this paper, a novel online supervised classification algorithm was introduced and the process of derivation of its formulation was elaborated and the results of its comparison with other similar algorithms was presented. The results show that in terms of accuracy, this algorithm is very close to its counterparts. In terms of training time, this algorithm falls behind some of the state-of-the-art algorithms in online learning. However, it outpaces the algorithm that was the inspiration to it (RHPNN) by a huge margin while keeping the same level of accuracy due to its adaptive hidden node addition. In the past few years, as the machine learning algorithms start to take on more serious and critical tasks such as intelligent cruise control systems in modern automobiles. In such systems, reliability is of utmost importance and as a result a great deal of attention has been given to developing algorithms that are interpretable in their nature. The presented algorithm, despite its shortcomings, enjoys from a very transparent structure that allows the experts to understand its working principle and prepare themselves for its potential errors. This interpretability also allows the network structure to represent meaningful information about the data space after the training phase is completed.

References 1. Asgary, R., Mohammadi, K., Zwolinski, M.: Using neural networks as a fault detection mechanism in mems devices. Microelectronics Reliability 47(1), 142–149 (2007) 2. Dua, D., Graff, C.: UCI machine learning repository (2017), http://archive.ics.uci.edu/ml 3. Hamid, O.H., Braun, J.: Reinforcement learning and attractor neural network models of associative learning. In: International Joint Conference on Computational Intelligence. pp. 327–349. Springer (2017) 4. Huang, G.B., Saratchandran, P., Sundararajan, N.: An efficient sequential learning algorithm for growing and pruning rbf (gap-rbf) networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 34(6), 2284–2292 (2004) 5. LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.R.: Efficient backprop. In: Neural networks: Tricks of the trade, pp. 9–48. Springer (2012) 6. Li, G., Liu, M., Dong, M.: A new online learning algorithm for structure-adjustable extreme learning machine. Computers & Mathematics with Applications 60(3), 377–389 (2010) 7. Liang, N.Y., Huang, G.B., Saratchandran, P., Sundararajan, N.: A fast and accurate online sequential learning algorithm for feedforward networks. IEEE Transactions on neural networks 17(6), 1411–1423 (2006) 8. Mahmoudi., A., Askari Moghadam, R., Madani., K.: A sequential heteroscedastic probabilistic neural network for online classification. In: Proceedings of the 11th International Joint Conference on Computational Intelligence - Volume 1: NCTA, (IJCCI 2019). pp. 449–453. INSTICC, SciTePress (2019). 10.5220/0008495604490453 9. Platt, J.: A resource-allocating network for function interpolation. MIT Press (1991) 10. Somu, N., MR, G.R., Kalpana, V., Kirthivasan, K., VS, S.S.: An improved robust heteroscedastic probabilistic neural network based trust prediction approach for cloud service selection. Neural Networks 108, 339–354 (2018)

402

A. Mahmoudi et al.

11. Venkatesan, R., Er, M.J.: A novel progressive learning technique for multi-class classification. Neurocomputing 207, 310–321 (2016) 12. Wang, J., Belatreche, A., Maguire, L., Mcginnity, T.M.: An online supervised learning method for spiking neural networks with adaptive structure. Neurocomputing 144, 526–536 (2014) 13. Xu, Y., Yang, J., Zhong, S.: An online supervised learning method based on gradient descent for spiking neurons. Neural Networks 93, 7–20 (2017) 14. Yang, Z.R., Chen, S.: Robust maximum likelihood training of heteroscedastic probabilistic neural networks. Neural Networks 11(4), 739–747 (1998) 15. Yang, Z.R., Zwolinski, M., Chalk, C.D., Williams, A.C.: Applying a robust heteroscedastic probabilistic neural network to analog fault detection and classification. IEEE Transactions on computer-aided design of integrated circuits and systems 19(1), 142–151 (2000) 16. Yingwei, L., Sundararajan, N., Saratchandran, P.: A sequential learning scheme for function approximation using minimal radial basis function neural networks. Neural computation 9(2), 461–478 (1997)

Author Index

B BahaaElDin, Ahmed, 3 Baldini, Luca, 263 Bertei, Alex, 163 Boria, Simonetta, 55 Brester, Christina, 223 Bujny, Mariusz, 55

C Cahlik, Vojtech, 291

D Duddeck, Fabian, 55

M Macek, Karel, 333 Madani, Kurosh, 387 Mahmoudi, Ali, 387 Martino, Alessio, 263 Mereani, Fawaz A., 359 Moghadam, Reza Askari, 387 Mohamed, Ismail, 29

O Olhofer, Markus, 55 Otero, Fernando E. B., 29

P Palm, Rainer, 191 Parker, Gary B., 313

F Foss, Luciana, 163

H Hadhoud, Mayada, 3 Hoffman, Alwyn J., 239 Howe, Jacob M., 359

R Rabé, Schalk, 239 Raponi, Elena, 55 Reiser, Renata H. S., 163 Rizzi, Antonello, 263

K Kalkreuth, Roman, 85 Khan, Mohammad O., 313 Kolehmainen, Mikko, 223

S Semenkin, Eugene, 223 ˇ c, Vesna, 115 Šešum-Cavi´ Stanko, Silvestr, 333 Stanovov, Vladimir, 223 Surynek, Pavel, 291

L Lilienthal, Achim J., 191

T Tuomainen, Tomi-Pekka, 223

© Springer Nature Switzerland AG 2021 J. J. Merelo et al. (eds.), Computational Intelligence, Studies in Computational Intelligence 922, https://doi.org/10.1007/978-3-030-70594-7

403

404 V Voutilainen, Ari, 223 Y Yoshida, Yuji, 135

Author Index Z Zakaria, Yahia, 3 Zakaria, Yassin, 3