Handbook on Computer Learning and Intelligence [2 vols., 2 ed.] 9811245142, 9789811245145

The Handbook on Computer Learning and Intelligence is a second edition which aims to be a one-stop-shop for the various

252 74 90MB

English Pages 990 [1057] Year 2022

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Volume 1: Explainable AI and Supervised Learning
Preface
About the Editor
Acknowledgments
Contents
Introduction
Part I: Explainable AI
1. Explainable Artificial Intelligence and Computational Intelligence: Past and Present • Mojtaba Yeganejou and Scott Dick
2. Fundamentals of Fuzzy Set Theory • Fernando Gomide
3. Granular Computing • Andrzej Bargiela and Witold Pedrycz
4. Evolving Fuzzy and Neuro-Fuzzy Systems: Fundamentals, Stability, Explainability, Useability, and Applications • Edwin Lughofer
5. Incremental Fuzzy Machine Learning for Online Classification of Emotions in Games from EEG Data Streams • Daniel Leite, Volnei Frigeri Jr., and Rodrigo Medeiros
6. Causal Inference in Medicine and in Health Policy: A Summary • Wenhao Zhang, Ramin Ramezani, and Arash Naeim
Part II: Supervised Learning
7. Fuzzy Classifiers • Hamid Bouchachia
8. Kernel Models and Support Vector Machines • Denis Kolev, Mikhail Suvorov, and Dmitry Kangin
9. Evolving Connectionist Systems for Adaptive Learning and Knowledge Discovery: From Neuro-Fuzzy to Spiking, Neurogenetic, and Quantum Inspired: A Review of Principles and Applications • Nikola Kirilov Kasabov
10. Supervised Learning Using Spiking Neural Networks • Abeegithan Jeyasothy, Shirin Dora, Suresh Sundaram, and Narasimhan Sundararajan
11. Fault Detection and Diagnosis Based on an LSTM Neural Network Applied to a Level Control Pilot Plant • Emerson Vilar de Oliveira, Yuri Thomas Nunes, Mailson Ribeiro Santos, and Luiz Affonso Guedes
12. Conversational Agents: Theory and Applications • Mattias Wahde and Marco Virgolin
Index
Volume 2: Deep Learning, Intelligent Control and Evolutionary Computation
Preface
About the Editor
Acknowledgments
Contents
Introduction
Part III: Deep Learning
13. Deep Learning and Its Adversarial Robustness: A Brief Introduction • Fu Wang, Chi Zhang, Peipei Xu, and Wenjie Ruan
14. Deep Learning for Graph-Structured Data • Luca Pasa, Nicolò Navarin, and Alessandro Sperduti
15. A Critical Appraisal on Deep Neural Networks: Bridge the Gap between Deep Learning and Neuroscience via XAI • Anna-Sophie Bartle, Ziping Jiang, Richard Jiang, Ahmed Bouridane, and Somaya Almaadeed
16. Ensemble Learning • Y. Liu and Q. Zhao
17. A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification • Xiaowei Gu and Plamen P. Angelov
Part IV: Intelligent Control
18. Fuzzy Model-Based Control: Predictive and Adaptive Approach • Igor Škrjanc and Sašo Blažič
19. Reinforcement Learning with Applications in Autonomous Control and Game Theory • Kyriakos G. Vamvoudakis, Frank L. Lewis, and Draguna Vrabie
20. Nature-Inspired Optimal Tuning of Fuzzy Controllers • Radu-Emil Precup and Radu-Codrut David
21. Indirect Self-evolving Fuzzy Control Approaches and Their Applications • Zhao-Xu Yang and Hai-Jun Rong
Part V: Evolutionary Computation
22. Evolutionary Computation: Historical View and Basic Concepts • Carlos A. Coello Coello, Carlos Segura, and Gara Miranda
23. An Empirical Study of Algorithmic Bias • Dipankar Dasgupta and Sajib Sen
24. Collective Intelligence: A Comprehensive Review of Metaheuristic Algorithms Inspired by Animals • Fevrier Valdez
25. Fuzzy Dynamic Parameter Adaptation for Gray Wolf Optimization of Modular Granular Neural Networks Applied to Human Recognition Using the Iris Biometric Measure • Patricia Melin, Daniela Sánchez, and Oscar Castillo
26. Evaluating Inter-Task Similarity for Multifactorial Evolutionary Algorithm from Different Perspectives • Lei Zhou, Liang Feng, Min Jiang, and Kay Chen Tan
Index
Recommend Papers

Handbook on Computer Learning and Intelligence [2 vols., 2 ed.]
 9811245142, 9789811245145

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

HANDBOOK ON COMPUTER LEARNING AND INTELLIGENCE Volume 1: Explainable AI and Supervised Learning

12498_9789811246043_V1_TP.indd 1

12/10/21 11:57 AM

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

HANDBOOK ON COMPUTER LEARNING AND INTELLIGENCE Volume 1: Explainable AI and Supervised Learning

Editor

Plamen Parvanov Angelov Lancaster University, UK

World Scientific NEW JERSEY



LONDON

12498_9789811246043_V1_TP.indd 2



SINGAPORE



BEIJING



SHANGHAI



HONG KONG



TAIPEI



CHENNAI



TOKYO

12/10/21 11:57 AM

Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

Library of Congress Cataloging-in-Publication Data Names: Angelov, Plamen P., editor. Title: Handbook on computer learning and intelligence / editor, Plamen Parvanov Angelov, Lancaster University, UK. Other titles: Handbook on computational intelligence. Description: New Jersey : World Scientific, [2022] | Includes bibliographical references and index. | Contents: Volume 1. Explainable AI and supervised learning - Volume 2. Deep learning, intelligent control and evolutionary computation. Identifiers: LCCN 2021052472 | ISBN 9789811245145 (set) | ISBN 9789811246043 (v. 1 ; hardcover) | ISBN 9789811246074 (v. 2 ; hardcover) | ISBN 9789811247323 (set ; ebook for institutions) | ISBN 9789811247330 (set ; ebook for individuals) Subjects: LCSH: Expert systems (Computer science)--Handbooks, manuals, etc. | Neural networks (Computer science)--Handbooks, manuals, etc. | Systems engineering- Data processing--Handbooks, manuals, etc. | Intelligent control systems--Handbooks, manuals, etc. | Computational intelligence--Handbooks, manuals, etc. Classification: LCC QA76.76.E95 H3556 2022 | DDC 006.3/3--dc23/eng/20211105 LC record available at https://lccn.loc.gov/2021052472

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

Copyright © 2022 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher. For any available supplementary material, please visit https://www.worldscientific.com/worldscibooks/10.1142/12498#t=suppl Desk Editors: Nandha Kumar/Amanda Yun Typeset by Stallion Press Email: [email protected] Printed in Singapore

Alex - 12498 - Handbook on Computer Learning and Intelligence.indd 1

10/5/2022 9:46:46 am

June 14, 2022 16:43

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-fm

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_fmatter

Preface

The Handbook aims to be a one-stop-shop for the various aspects of the broad research area of Computer learning and Intelligence. It is organized in two volumes and five parts as follows: Volume 1 includes two parts: Part 1 Explainable AI Part II Supervised Learning Volume 2 has three parts: Part III Deep Learning Part IV Intelligent Control Part V Evolutionary Computation The Handbook has twenty-six chapters in total, which detail the theory, methodology and applications of Computer Learning and Intelligence. These individual contributions are authored by some of the leading experts in the respective areas and often are co-authored by a small team of associates. They offer a cocktail of individual contributions that span the spectrum of the key topics as outlined in the titles of the five parts. In total, over 67 authors from over 20 different countries contributed to this carefully crafted final product. This collaborative effort brought together leading researchers and scientists form USA, Canada, Wales, Northern Ireland, England, Japan, Sweden, Italy, Spain, Austria, Slovenia, Romania, Singapore, New Zealand, Brazil, Russia, India, Mexico, Hong Kong, China, Pakistan and Qatar. The scope of the Handbook covers the most important aspects of the topic of Computer Learning and Intelligence. Preparing, compiling and editing this Handbook was an enjoyable and inspirational experience. I hope you will also enjoy reading it, will find answers to your questions and will use this book in your everyday work. Plamen Parvanov Angelov Lancaster, UK 3 June 2022 v

page v

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

June 1, 2022 13:2

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-fm

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_fmatter

About the Editor

Professor Plamen Parvanov Angelov holds a Personal Chair in Intelligent Systems and is Director of Research at the School of Computing and Communications at Lancaster University, UK. He obtained his PhD in 1993 and DSc (Doctor of Sciences) degree in 2015 when he also become Fellow of IEEE. Prof. Angelov is the founding Director of the Lancaster Intelligent, Robotic and Autonomous systems (LIRA) Research Centre which brings together over 60 faculty/academics across fifteen different departments of Lancaster University. Prof. Angelov is a Fellow of the European Laboratory for Learning and Intelligent Systems (ELLIS) and of the Institution of Engineering and Technology (IET) as well as a Governorat-large of the International Neural Networks Society (INNS) for a third consecutive three-year term following two consecutive terms holding the elected role of Vice President. In the last decade, Prof. Angelov is also Governor-at-large of the Systems, Man and Cybernetics Society of the IEEE. Prof. Angelov founded two research groups (the Intelligent Systems Research group in 2010 and the Data Science group at 2014) and was a founding member of the Data Science Institute and of the CyberSecurity Academic Centre of Excellence at Lancaster. Prof. Angelov has published over 370 publications in leading journals and peerreviewed conference proceedings, 3 granted US patents, 3 research monographs (by Wiley, 2012 and Springer, 2002 and 2018) cited over 12300 times with an h-index of 58 (according to Google Scholar). He has an active research portfolio in the area of computer learning and intelligence, and internationally recognized results into online and evolving learning and algorithms. More recently, his research is focused on explainable deep learning as well as anthropomorphic and empirical computer learning. Prof. Angelov leads numerous projects (including several multimillion ones) funded by UK research councils, EU, industry, European Space Agency, etc. His research was recognized by the 2020 Dennis Gabor Award for “outstanding contributions to engineering applications of neural networks”, ‘The Engineer Innovation vii

page vii

June 1, 2022 13:2

viii

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-fm

Handbook on Computer Learning and Intelligence — Vol. I

and Technology 2008 Special Award’ and ‘For outstanding Services’ (2013) by IEEE and INNS. He is also the founding co-Editor-in-Chief of Springer’s journal on Evolving Systems and Associate Editor of several leading international journals, including IEEE Transactions on Cybernetics, IEEE Transactions on Fuzzy Systems, IEEE Transactions on AI, etc. He gave over 30 dozen keynote/plenary talks at high profile conferences. Prof. Angelov was General Co-Chair of a number of high-profile IEEE conferences and is the founding Chair of the Technical Committee on Evolving Intelligent Systems, Systems, Man and Cybernetics Society of the IEEE. He is also co-chairing the working group (WG) on a new standard on Explainable AI created by his initiative within the Standards Committee of the Computational Intelligent Society of the IEEE which he chaired from 2010–2012. Prof. Angelov is the founding co-Director of one of the funded programs by ELLIS on Human-centered machine learning. He was a member of the International Program Committee of over 100 international conferences (primarily IEEE conferences).

page viii

June 1, 2022 13:2

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-fm

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_fmatter

Acknowledgments

The Editor would like to acknowledge the unwavering support and love of his wife Rositsa, his children (Mariela and Lachezar) as well as of his mother, Lilyana.

ix

page ix

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

June 1, 2022 13:2

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-fm

page xi

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_fmatter

Contents

Preface

v

About the Editor

vii

Acknowledgments

ix

Introduction

xv

Part I: Explainable AI 1. Explainable Artificial Intelligence and Computational Intelligence: Past and Present

3

Mojtaba Yeganejou and Scott Dick 2. Fundamentals of Fuzzy Set Theory

57

Fernando Gomide 3. Granular Computing

97

Andrzej Bargiela and Witold Pedrycz 4. Evolving Fuzzy and Neuro-Fuzzy Systems: Fundamentals, Stability, Explainability, Useability, and Applications

133

Edwin Lughofer 5. Incremental Fuzzy Machine Learning for Online Classification of Emotions in Games from EEG Data Streams Daniel Leite, Volnei Frigeri Jr., and Rodrigo Medeiros

xi

235

June 1, 2022 13:2

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

xii

b4528-v1-fm

page xii

Handbook on Computer Learning and Intelligence — Vol. I

6. Causal Inference in Medicine and in Health Policy: A Summary

263

Wenhao Zhang, Ramin Ramezani, and Arash Naeim

Part II: Supervised Learning 7. Fuzzy Classifiers

305

Hamid Bouchachia 8. Kernel Models and Support Vector Machines

329

Denis Kolev, Mikhail Suvorov, and Dmitry Kangin 9. Evolving Connectionist Systems for Adaptive Learning and Knowledge Discovery: From Neuro-Fuzzy to Spiking, Neurogenetic, and Quantum Inspired: A Review of Principles and Applications

399

Nikola Kirilov Kasabov 10. Supervised Learning Using Spiking Neural Networks

423

Abeegithan Jeyasothy, Shirin Dora, Suresh Sundaram, and Narasimhan Sundararajan 11. Fault Detection and Diagnosis Based on an LSTM Neural Network Applied to a Level Control Pilot Plant

467

Emerson Vilar de Oliveira, Yuri Thomas Nunes, Mailson Ribeiro Santos, and Luiz Affonso Guedes 12. Conversational Agents: Theory and Applications

497

Mattias Wahde and Marco Virgolin Index

i1

Part III: Deep Learning 13. Deep Learning and Its Adversarial Robustness: A Brief Introduction Fu Wang, Chi Zhang, Peipei Xu, and Wenjie Ruan

547

June 1, 2022 13:2

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-fm

Contents

14. Deep Learning for Graph-Structured Data

page xiii

xiii

585

Luca Pasa, Nicolò Navarin, and Alessandro Sperduti 15. A Critical Appraisal on Deep Neural Networks: Bridge the Gap between Deep Learning and Neuroscience via XAI

619

Anna-Sophie Bartle, Ziping Jiang, Richard Jiang, Ahmed Bouridane, and Somaya Almaadeed 16. Ensemble Learning

635

Y. Liu and Q. Zhao 17. A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification

661

Xiaowei Gu and Plamen P. Angelov

Part IV: Intelligent Control 18. Fuzzy Model-Based Control: Predictive and Adaptive Approach

699

Igor Škrjanc and Sašo Blažiˇc 19. Reinforcement Learning with Applications in Autonomous Control and Game Theory

731

Kyriakos G. Vamvoudakis, Frank L. Lewis, and Draguna Vrabie 20. Nature-Inspired Optimal Tuning of Fuzzy Controllers

775

Radu-Emil Precup and Radu-Codrut David 21. Indirect Self-evolving Fuzzy Control Approaches and Their Applications

809

Zhao-Xu Yang and Hai-Jun Rong

Part V: Evolutionary Computation 22. Evolutionary Computation: Historical View and Basic Concepts Carlos A. Coello Coello, Carlos Segura, and Gara Miranda

851

September 14, 2022 14:34

xiv

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-fm

page xiv

Handbook on Computer Learning and Intelligence — Vol. I

23. An Empirical Study of Algorithmic Bias

895

Dipankar Dasgupta and Sajib Sen 24. Collective Intelligence: A Comprehensive Review of Metaheuristic Algorithms Inspired by Animals

923

Fevrier Valdez 25. Fuzzy Dynamic Parameter Adaptation for Gray Wolf Optimization of Modular Granular Neural Networks Applied to Human Recognition Using the Iris Biometric Measure

947

Patricia Melin, Daniela Sánchez, and Oscar Castillo 26. Evaluating Inter-task Similarity for Multifactorial Evolutionary Algorithm from Different Perspectives

973

Lei Zhou, Liang Feng, Min Jiang, and Kay Chen Tan Index

i1

June 1, 2022 13:2

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-fm

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_fmatter

Handbook on Computer Learning and Intelligence Introduction by the Editor

You are holding in your hand the second edition of the handbook that was published six years ago as Handbook on Computational Intelligence. The research field has evolved so much during the last five years that we decided to not only update the content with new chapters but also change the title of the handbook. At the same time, the core chapters (15 out of 26) have been written and significantly updated by the same authors. The structure of the book is also largely the same, though some parts evolved to cover new phenomena better. For example, the part Deep Learning evolved from the artificial neural networks, and the part Explainable AI evolved from fuzzy logic. The new title, Computer Learning and Intelligence, better reflects the new state-of-the-art in the area that spans machine learning and artificial intelligence (AI), but is more application and engineering oriented rather than being theoretically abstract or statistics oriented like many other texts that also use these terms. The term computer learning is used instead of the common term machine learning deliberately. Not only does it link with the term computational intelligence but it also reflects more accurately the nature of learning algorithms that are of interest. These are implemented on computing devices spanning from the Cloud to the Edge, covering personal and networked computers and other computerized tools. The term machine is more often used in a mechanical and mechatronic context as a physical device contrasting the (still) electronic nature of the computers. In addition, the term machine learning, the genesis of which stems from the research area of AI, is now used almost as a synonym for statistical learning. In fact, learning is much more than function approximation, parameter learning, and number fitting. An important element of learning is reasoning, knowledge representation, expression and handling (storage, organization, retrieval), as well as decision-making and cognition. Statistical learning is quite successful today, but it seems to oversimplify some of these aspects. In this handbook, we aim to provide all perspectives, and therefore, we combine computer learning with intelligence. xv

page xv

June 1, 2022 13:2

xvi

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-fm

Handbook on Computer Learning and Intelligence — Vol. I

Talking about intelligence, we mentioned computational intelligence, which was the subject of the first edition of the handbook. As is well known, this term itself came around toward the end of the last century and covers areas of research that are inspired and that “borrow”/“mimic” such forms of natural intelligence as the human brain (artificial neural networks), human reasoning (fuzzy logic and systems), and natural evolution (evolutionary systems). Intelligence is also at the core of the widely used term AI. The very idea of developing systems, devices, algorithms, and techniques that possess characteristics of intelligence and are computational (not just conceptual) dates back to the middle of the 20th century or even earlier, but only now it is becoming truly widespread, moving from the labs to the real-life applications. In this handbook, while not claiming to provide an exhaustive picture of this dynamic and fast-developing area, we provide some key directions and examples written in self-contained chapters, which are, however, organized in parts on topics such as: • • • • •

Explainable AI Supervising Learning Deep Learning Intelligent Control Evolutionary Computation

The primary goal of Computer Learning and Intelligence is to provide efficient computational solutions to the existing open problems from theoretical and application points of view with regard to the understanding, representation, modeling, visualization, reasoning, decision, prediction, classification, analysis, and control of physical objects, environmental or social phenomena, etc., to which the traditional methods, techniques, and theories (primarily so-called first principles based, deterministic, or probabilistic, often expressed as differential equations, regression, and Bayesian models, and stemming from mass and energy balance) cannot provide a valid or useful/practical solution. Another specific feature of Computer Learning and Intelligence is that it offers solutions that bear characteristics of intelligence, which is usually attributed to humans only. This has to be considered broadly rather than literally, as is the area of AI. This is, perhaps, clearer with fuzzy logic, where systems can make decisions very much like humans do. This is in a stark contrast to the deterministic-type expert systems or probabilistic associative rules. One can argue that artificial neural networks, including deep learning, process the data in a manner that is similar to what the human brain does. For evolutionary computation, the argument is that the population of candidate solutions “evolves” toward the optimum in a manner similar to the way species or living organisms evolve in nature.

page xvi

June 1, 2022 13:2

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-fm

Introduction

page xvii

xvii

The handbook is composed of 2 volumes and 5 parts, which contain 26 chapters. Eleven of these chapters are new in comparison to the first edition, while the other 15 chapters are substantially improved and revised. More specifically, Volume 1 includes Part I (Explainable AI) and Part II (Supervised Learning). Volume 2 includes Part III (Deep Learning), Part IV (Intelligent Control), and Part V (Evolutionary Computation). In Part I the readers can find six chapters on explainable AI, including: • Explainable AI and Computational Intelligence: Past and Present This new chapter provides a historical perspective of explainable AI. It is written by Dr. Mojtaba Yeganejou and Prof. Scott Dick from the University of Alberta, Canada. • Fundamentals of Fuzzy Sets Theory This chapter is a thoroughly revised version of the chapter with the same title from the first edition of the handbook. It provides a step-by-step introduction to the theory of fuzzy sets, which itself is anchored in the theory of reasoning and is one of the most prominent and well-developed forms of explainable AI (the others being decision trees and symbolic AI). It is written by one of the leading experts in this area, former president of the International Fuzzy Systems Association (IFSA), Prof. Fernando Gomide from UNICAMP, Campinas, Brazil. • Granular Computing This chapter is also a thoroughly revised version from the first edition of the handbook and is written by two of the pioneers in this area, Profs. Witold Pedrycz from the University of Alberta, Canada, and Andrzej Bargiela. Granular computing became a cornerstone of the area of explainable AI, and this chapter offers a thorough review of the problems and solutions that granular computing offers. • Evolving Fuzzy and Neuro-Fuzzy Systems: Fundamentals, Stability, Explainability, Useability, and Applications Since its introduction around the turn of the century by Profs. Plamen Angelov and Nikola Kasabov, the area of evolving fuzzy and neuro-fuzzy systems has been constantly developing as a form of explainable AI and machine learning. This thoroughly revised chapter, in comparison to the version in the first edition, offers a review of the problems and some of the solutions. It is authored by one of the leading experts in this area, Dr. Edwin Lughofer from Johannes Kepler University Linz, Austria. • Incremental Fuzzy Machine Learning for Online Classification of Emotions in Games from EEG Data Streams

June 14, 2022 16:43

xviii

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-fm

Handbook on Computer Learning and Intelligence — Vol. I

This new chapter is authored by Prof. Daniel Leite and Drs. Volnei Frigeri Jr and Rodrigo Medeiros from Adolfo Ibáñez University, Chile, and the Federal University of Minas Gerais, Brazil. It offers one specific approach to a form of explainable AI and its application to games and EEG data stream processing. • Causal Reasoning This new chapter is authored by Drs. Ramin Ramezani and Wenhao Zhang and Prof. Arash Naeim). It describes the very important topic of causality in AI. Part II consists of six chapters, which cover the area of supervised learning: • Fuzzy Classifiers This is a thoroughly revised and refreshed chapter that was also a part of the first edition of the handbook. It covers the area of fuzzy rule-based classifiers, which combine explainable AI and intelligence, more generally, as well as the area of machine or computer learning. It is written by Prof. Hamid Bouchachia from Bournemouth University, UK. • Kernel Models and Support Vector Machines This chapter offers a very skillful review of one of the hottest topics in research and applications related to supervised computer learning. It is written by Drs. Denis Kolev, Mikhail Suvorov, and Dmitry Kangin and is a thorough revision of the version that was included in the first edition of the handbook. • Evolving Connectionist Systems for Adaptive Learning and Knowledge Discovery: From Neuro-Fuzzy to Spiking, Neurogenetic, and Quantum Inspired: A Review of Principles and Applications This chapter offers a review of one of the cornerstones of computer learning and intelligence, namely the evolving connectionist systems, and is written by the pioneer in this area, Prof. Nikola Kasabov from the Auckland University of Technology, New Zealand. • Supervised Learning Using Spiking Neural Networks This is a new chapter written by Drs. Abeegithan Jeyasothy from the Nanyang Technological University, Singapore; Shirin Dora from the University of Ulster, Northern Ireland, UK; Sundaram Suresh from the Indian Institute of Science; and Prof. Narasimhan Sundararajan, who recently retired from the Nanyang Technological University, Singapore. It covers the topic of supervised computer learning of a particular type of artificial neural networks that holds a lot of promise—spiking neural networks. • Fault Detection and Diagnosis based on LSTM Neural Network Applied to a Level Control Pilot Plant

page xviii

June 14, 2022 16:43

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-fm

Introduction

page xix

xix

This is a thoroughly revised and updated chapter in comparison to the version in the first edition of the handbook. It covers an area of high industrial interest and now includes long short-term memory type of neural networks and is authored by Drs. Emerson V. de Oliveira, Yuri Thomas Nunes, and Malison Ribeira Santos and Prof. Luiz Affonso Guedez from the Federal University of Rio Grande du Nord, Natal, Brazil. • Conversational Agents: Theory and Applications This is a new chapter that combines intelligence with supervised computer learning. It is authored by Prof. Mattias Wahde and Dr. Marco Virgolin from the Chalmers University, Sweden. The second volume consist of three parts. The first of these, part III, is devoted to deep learning and consists of five chapters, four of which are new and specially written for the second edition of the handbook: • Deep Learning and Its Adversarial Robustness: A Brief Introduction This new chapter, written by Drs. Fu Wang, Chi Zhang, PeiPei Xu, and Wenjie Ruan from Exeter University, UK, provides a brief introduction to the hot topic of deep learning to understand its robustness under adversarial perturbations. • Deep Learning for Graph-Structured Data This new chapter written by leading experts in the area of deep learning, Drs. Luca Pasa and Nicolò Navarin and Prof. Alessandro Sperduti, describes methods that are specifically tailored to graph-structured data. • A Critical Appraisal on Deep Neural Networks: Bridge the Gap from Deep Learning to Neuroscience via XAI This new chapter, written by Anna-Sophia Bartle from the University of Tubingen, Germany; Ziping Jiang from Lancaster University, UK; Richard Jiang from Lancaster University, UK, Ahmed Bouridane from Northumbria University, UK, and Somaya Almaadeed from Qatar University, provides a critical analysis of the deep learning techniques trying to bridge the divide between neuroscience and explainable AI • Ensemble Learning This is a thoroughly revised chapter written by the same authors as in the first edition (Dr. Yong Liu and Prof. Qiangfu Zhao) and covers the area of ensemble learning.

June 13, 2022 11:22

xx

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-fm

Handbook on Computer Learning and Intelligence — Vol. I

• A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification This is a new chapter contributed by Dr. Xiaowei Gu from the University of Aberystwyth in Wales, UK, and by Prof. Plamen Angelov. It provides an explainable-by-design form of deep learning, which can also take the form of linguistic rule base and is applied to aerial image scene classification. Part IV consists of four chapters covering the topic of intelligent control: • Fuzzy Model-Based Control: Predictive and Adaptive Approach This chapter is a thorough revision of the chapter form the first edition of the handbook and is co-authored by Prof. Igor Škrjanc and Dr. Sašo Blažiˇc from Ljubljana University, Slovenia. • Reinforcement Learning with Applications in Automation Control and Game Theory This chapter written by renowned world leaders in this area, Dr. Kyriakos G. Vamvoudakis from the Georgia Institute of Technology, Prof. Frank L. Lewis from the University of Texas, and Dr. Draguna Vrabie from the Pacific Northwest National Laboratory in the USA. In this thoroughly revised and updated chapter in comparison to the version in the first edition, the authors describe reinforcement learning that is being applied not only to systems control but also to game theory. • Nature-Inspired Optimal Tuning of Fuzzy Controllers This chapter is also a thorough revision from the first edition by the same authors, Prof. Radu-Emil Precup and Dr. Radu-Codrut David, both from the Politehnica University of Timisoara, Romania. • Indirect Self-Evolving Fuzzy Control Approaches and Their Applications This is a new chapter and is written by Drs. Zhao-Xu Yang and Hai-Jun Rong from Xian Jiaotong University, China. Finally, part V includes five chapters on evolutionary computation: • Evolutionary Computation: History View and Basic Concepts This is a thoroughly revised chapter from the first edition of the handbook, which sets the scene with a historical and philosophical introduction of the topic. It is written by one of the leading scientists in this area, Dr. Carlos A. Coello-Coello from CINVESTAV, Mexico, and co-authored by Carlos Segura from the Centre of Research in Mathematics, Mexico, and Gara Miranda from the University of La Laguna, Tenerife, Spain.

page xx

June 14, 2022 16:43

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

Introduction

b4528-v1-fm

page xxi

xxi

• An Empirical Study of Algorithmic Bias This chapter is written by the same author as in the first edition, Prof. Dipankar Dasgupta from the University of Memphis, USA, and co-authored by his associate Dr. Sanjib Sen, but it offers a new topic: an empirical study of algorithmic bias. • Collective Intelligence: A Comprehensive Review of Metaheuristic Algorithms Inspired by Animals This chapter is a thoroughly revised version of the chapter by the same author, Dr. Fevrier Valdez from the Institute of Technology, Tijuana, Mexico, who contributed to the first edition. • Fuzzy Dynamic Parameter Adaptation for Gray Wolf Optimization of Modular Granular Neural Networks Applied to Human Recognition Using the Iris Biometric Measure This chapter is a thoroughly revised version of the chapters contributed by the authors to the first edition of the handbook. It is written by the leading experts in the area of fuzzy systems, Drs. Patricia Melin and Daniela Sanchez and Prof. Oscar Castillo from the Tijuana Institute of Technology, Mexico. • Evaluating Inter-task Similarity for Multifactorial Evolutionary Algorithm from Different Perspectives The last chapter of this part of the second volume and of the handbook is new and is contributed by Lei Zhou and Drs. Liang Feng from Chongqing University, China, and Min Jiang from Xiamen University, China, and Prof. Kay Chen Tan from City University of Hong Kong. In conclusion, this handbook is a thoroughly revised and updated second edition and is composed with care, aiming to cover all main aspects and recent trends in the computer learning and intelligence area of research and offering solid background knowledge as well as end-point applications. It is designed to be a one-stop shop for interested readers, but by no means aims to completely replace all other sources in this dynamically evolving area of research. Enjoy reading it.

Plamen Parvanov Angelov Editor of the Handbook Lancaster, UK 3 June 2022

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Part I

Explainable AI

1

b4528-v1-ch01

page 1

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_0001

Chapter 1

Explainable Artificial Intelligence and Computational Intelligence: Past and Present Mojtaba Yeganejou∗ and Scott Dick† Department of Electrical and Computer Engineering University of Alberta, 9211 116 Street, Edmonton, AB, Canada, T6G 1H9 ∗ [email protected][email protected]

Explainable artificial intelligence is a very active research topic at the time of writing this chapter. What is perhaps less appreciated is that the concepts behind this topic have been intimately connected to computational intelligence for our field’s entire decades-long history. This chapter reviews that long history and demonstrates its connections to the general artificial intelligence literature. We then examine how computational intelligence contributes to modern research into explainable systems.

1.1. Introduction At the time of this writing in 2021, artificial intelligence (AI) seems poised to fundamentally reshape modern society. One estimate finds that half of all the tasks the workers perform during their jobs in North America could be automated (although only about 5% of jobs can be totally automated). This automation is expected to raise productivity growth back above 2% annually and add between $3.5 and $5.8 trillion dollars of value per year to the economy [1]. We have reached the point where modern algorithms (e.g., deep learning) can exceed human performance in certain domains. However, they are “black boxes,” and their decision-making process is opaque; sometimes to the point of being incomprehensible to even human experts in the area. A famous example is the 19th move in the second game of the historic 2016 Go match between Lee Sedol, a grandmaster at the game, and AlphaGo [2] built by DeepMind. “It’s a creative move,” commented Go expert Michael Redmond. “It’s something that I don’t think I’ve seen in a top player’s game” [3]. 3

page 3

June 1, 2022 12:51

4

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

It is human nature to distrust what we do not understand, and this leads us to a profound, even existential question for the field of AI: Why should humans trust these algorithms with so many important tasks and decisions? Simply put, if human users do not trust AI algorithms, they will not be used. Unless the question of trust is resolved, all of the progress of AI in the last several decades could be utterly wasted, as humanity at large simply walks away from the “AI revolution.” Research into trusted AI has shown that providing users with interpretations and explanations of the decision-making process is a vital precondition for winning users’ trust [4, 5], and numerous studies have linked trust and explanation with each other [6]. Various forms of intelligent systems are trusted more if their recommendations are explained (see [7–10] for more details). In this sense, interpretability can be defined as the degree to which humans comprehend the process by which a decision is reached. It can also be considered the degree to which a human expert could consistently predict the model’s result [11, 12]. From a philosophical standpoint, what does it mean for an AI to be opaque or transparent? One viewpoint, originating from [13], is that opacity of an algorithm refers to a stakeholder’s lack of knowledge, at a point in time, of “epistemically relevant elements.” Zednik [14] proposes that the epistemically relevant elements for AI are any portion of the AI algorithm, its inputs, outputs, or internal calculations that could be cited in an explanation. Zednik further points out that this definition of opacity is agent-specific and that different stakeholders (such as the stakeholder groups in [15]) may have different needs for understanding those elements. Plainly, opacity is also a matter of degree, as many stakeholders do have knowledge of at least the architecture and learning algorithm of even the largest deep neural network. We thus see that AI opacity could be defined as the degree of a given stakeholder’s lack of knowledge, at a point in time, of all possible explanatory elements for a given AI model. Explanations, then, are a means of redressing opacity for a given stakeholder by communicating an “adequate” set of those explanatory elements. The above suggests that one route to generating explanations might be to provide a trace of the inferential processes by which the AI reaches its decisions. However, researchers as far back as the 1970s have strongly argued that this is not adequate and that instead, explanations must provide the user with insight into the problem domain and corresponding justifications for the AI’s decisions. Furthermore, this also means that merely presenting an “interpretable” AI model to the user (e.g., a rulebase) is also insufficient; creating and presenting an explanation is a major system task in itself and needs a suitable interface [16]. In the remainder of this chapter, we will refer to this capability (and the component/modules and user experience elements supporting it) as the explanation interface (EI). As Miller points out, explanation is a social act; an explainer communicates knowledge about an explanandum (the object of the explanation) to the explainee;

page 4

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 5

5

this knowledge is the explanans [11]. In XAI, a decision or result obtained from an AI model is the explanandum, the explainee is the user who needs more information, the explainer is the EI of the AI, and the explanans is the information (explanation) communicated to the user [17]. Thus, the EI must interact with the user to communicate an explanans. Research into EIs also shows that they are generally designed to respond to only a few question types: what, how, what if, why, and why not. What questions ask for factual statements; these are fairly simple in nature. The others involve causal reasoning (relating causes and effects). The literature on education and philosophy also views the cognitive process of explanation as a close relative of causal reasoning; explanations often refer to causation, and causal analyses have been found to have explanatory value [18–23]. A number of researchers treat explanation as the development of a causal analysis for why some event occurred or how some system operates, e.g., [24, 25]. Miller contends that causal analyses answering how, what if, why, and why not questions are best approached using contrastive explanations; a counterfactual (something that did not occur) is chosen as the explanandum, and an explanans is generated [11]. (Note that counterfactuals for what if and why not questions are specified by the user.) A further analysis of counterfactuals is conducted in [26], where four fundamental usages of counterfactuals were identified. First, one could add or remove a fact from the current world state, and reason about the outcomes. Second, one could select a “better” or “worse” world state, and reason about how these might be reached. Third, they can be used to investigate possible causes of the current world state. Finally, they can aid in apportioning blame for the current world state. As above, the goal of an EI is to foster user trust; this obviously implies that XAI researchers must measure the extent to which their explanations improve user trust. However, the issue of accurately measuring user trust is a familiar problem from the information sciences literature. In technology adoption studies particularly, the act of trusting consists of using the technology. This is very difficult to study in the field, and laboratory experiments are necessarily limited in their fidelity to the real world. Hence, many studies employ an antecedent to actual use (the construct of Intent to Use, measured by psychometric scales), as a proxy for use of the technology [27, 28]. However, when the subjects’ trusting behaviors are compared against their responses on the scales, the two are often inconsistent. A relatively small number of XAI papers have also begun using a similar proxy, where trust in the AI is measured via psychometric scales [29]. These too have repeatedly been found to be inconsistent with users’ actual behaviors [30, 31]. Drawing these points together, Papenmeier [32] proposes three design requirements for empirical studies of user trust in AI: the experiment must present a realistic, relatable scenario for the subjects; there must be negative consequences for the AI’s errors; and the experiment must require the

June 1, 2022 12:51

6

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

subjects to perform an action, either following the AI’s recommendation or not. Paez also points out that in many situations, the decision to trust an AI or not must be made under time pressure, which should be simulated in an experiment [33]. An EI will also be necessary to achieve other AI system objectives, including [34]: • Verifiability: An AI model will reproduce biases in the knowledge it is provided or the data it is trained on [35]; the model may also contain errors. Models must therefore be verified as being fit for purpose. However, this is only possible when the reviewers performing the verification can understand the AI models. • Comparative judgments: Different models might perform equally well, but via very different decision-making processes. It is quite possible that some of those processes might be socially unacceptable, even though they might perform better on some benchmarks. It is only through the EI that those different processes can be revealed and evaluated. • Knowledge discovery: AI models often identify relationships that humans would not be able to find. However, these discoveries can only have an impact beyond that AI if they can be comprehensibly presented to human analysts (as in the KDD framework [36]). • Legal compliance: The European Union’s General Data Protection Regulation includes a “right to explanation,” granting any data subject the right to an explanation concerning any decision made about them by an algorithm [37, 38]. We find it unlikely that an EU judge and jury would accept “the neural network said so” as an adequate explanation if data subjects challenge those decisions in court. Numerous scholars have also examined the impact of uninterpretable AI models on social justice and equity [39–51]. Unjust outcomes from AI decisions have repeatedly been observed in fields as diverse as criminal sentencing decisions, loan and insurance approvals, and hiring practices. If the AI in these cases provided an explanans for these decisions, these outcomes may well have been changed. Data bias is another pernicious, significant issue; social problems such as gender and racial inequality can influence a corpus of primary sources used to train an AI in unexpected ways (see [42, 49, 58, 59] for more details). Sometimes these are blatant, other times more subtle; but review of the AI by experts using an EI seems to be one of the few defenses against this form of bias. The focus of this chapter is on the history of explanations in AI systems, and in particular how computational intelligence is bound up with that history. One point we wish to emphasize at the outset is that “explainable” AI—despite the recent flurry of activity under the banner of XAI—is in fact a research topic that has been studied for over 50 years. This is a large body of work that should be studied and

page 6

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 7

7

appreciated, lest hard-won lessons from those earlier years be lost. We need only look to the history of independent component analysis to appreciate that this is entirely possible [52]. In fact, the importance of interpretability has been recognized from the earliest days of intelligent system research [53], and one can easily identify multiple generations of approaches to the interpretability problem in the literature. The remainder of this chapter is organized as follows. In Section 1.2, we survey the history of expert systems and machine learning (including neural networks), and how explanation systems developed in concert with them. In Section 1.3, we examine several contemporary XAI approaches in depth. In Section 1.4, we examine how soft computing contributed to XAI in the past and present. We offer a summary and conclusions in Section 1.5.

1.2. History of XAI At the AAAI conference in 2017, a panel discussed the history of expert systems, which had seen vast enthusiasm in the 1980s, before their limitations led to disappointment and the “AI Winter” of the 1990s [54, 55]. One of the panelists, Peter Dear, was a historian of science specializing in the Scientific Revolution of the 16th and 17th centuries. In his 2006 treatise, he describes science (in its present conception) as a hybrid of two goals, intelligibility and instrumentality. Intelligibility refers to the discovery of the fundamental nature of the universe, and the accounting of how this results in the world we experience around us. Instrumentality, on the other hand, is the search for a means to affect the world around us, manipulating it for our own purposes [56]. The difference between these two goals seems strikingly similar to a divide that has plagued XAI for decades. In the AI field, intelligibility is interpreted to mean explainability; it should be possible to understand why an AI came to the conclusion it did. Instrumentality, meanwhile, is interpreted as the AI’s quantified performance (via accuracy, ROC curves, etc.). One of the panel findings about expert systems—and AI in general, past and present—is that intelligibility and instrumentality are currently not complementary to each other in our field. Indeed, they are seen as conflicting goals, for which improvements in one must come at the expense of the other (this is also known as the interpretability/accuracy tradeoff [57]). AI experts such as Randall Davis recounted their experiences supporting this viewpoint during the panel [54]. A considerable amount of the history of XAI can be seen as a vacillation between systems emphasizing intelligibility and those focused solely on instrumentality. The former are largely made up of “symbolic” approaches to AI (e.g., expert systems, theorem provers, etc.), what Nilsson refers to as “good old-fashioned AI” [55]. Such systems are commonly designed around a known base of expert knowledge, which is strongly leveraged in building solutions. Instrumentality is not neglected

June 21, 2022 17:3

8

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

Figure 1.1: History of XAI.

in these systems, but neither is it their sole objective; they are relatively transparent to the human user. The latter, on the other hand, are largely composed of “nonsymbolic” approaches such as statistical learning and neural networks, which figure prominently in pattern recognition and signal processing. These systems emphasize excellence in performing a task, but are not easily intelligible (and may in fact be unintelligible), thus requiring a post hoc accounting of their decision processes. In Figure 1.1, we summarize some highlights the development of these two approaches to AI, and their effect on explainability research. The remainder of this section will discuss this material in depth.

1.2.1. Symbolic AI and Expert Systems A symbol in the AI sense is a meaningful entity, in and of itself. It can also be combined into more complex aggregations with other symbols following regular rules; entities of this type were also sometimes referred to as physical symbols [58]. Symbolic processing has been at the heart of AI from the very beginning; e.g., the Logic Theorist (the first automated theorem prover, published in 1956) was a symbolic system [59]. Indeed, one of the most famous conjectures in AI, the Physical Symbol System hypothesis, holds that A physical symbol system has the necessary and sufficient means for general intelligent action [60].

In this hypothesis, a physical symbol system is an abstract machine that operates on collections of symbols and, over time, modifies them or produces new ones.

page 8

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 9

9

Thus, this hypothesis asserts that symbolic processing is the necessary and sufficient condition to produce what we refer to as “intelligence” [55]. A variety of symbolic AI systems were produced following the development of the Logic Theorist (see [55] for an in-depth review). One of these, the Dendral system, became the forerunner of the expert systems field. It was originally designed to assist chemists in interpreting mass spectrometer data. In its original form, Dendral simply generated all the possible (acyclical) sub-structures of a molecule from its chemical formula and some other basic information. An expert chemist could then compare that dictionary of structures against actual mass spectrometer data to determine which structures were most likely being caught by the instrument [55]. Feigenbaum et al. suggested that this dictionary could be augmented with knowledge elicited from expert chemists to constrain the search to the most likely sub-structures. This knowledge was encoded in inference rules with an IF–THEN structure, potentially with a fairly complex antecedent expression. The resulting program was called Heuristic-Dendral [61], and is widely considered the first expert system to have been developed; its descendants are still used by chemists today. This research led Feigenbaum to advance the knowledge principle: that the problemsolving power in an expert system arises principally from the quality of the elicited knowledge it contains, and has little to do with the inference mechanisms applied to it [55]. Interestingly, the Dendral project also incorporated an explicit test of the instrumentality of its predictions. A simulation of a mass spectrometer was designed, which accepted predicted structures as its inputs. The simulation results for each structure were then compared with the actual output of a mass spectrometer; a close match indicated that the predicted structures were probably correct. This was the first time that such a confirmatory step was included in an AI system [55]. Following on from the Dendral project, work began on an AI system that would use expert knowledge to diagnose bacterial infections and recommend treatment plans. This was the MYCIN system, which laid down several design approaches that deeply influenced subsequent research. Firstly, knowledge to be stored in the system was elicited from known experts in the field, in the form of IF–THEN rules (inspired by their successful use in Dendral) for diagnosing infections and treating them. Second, uncertainty was explicitly represented in the system (using the ad hoc construct of “certainty factors”). Third, inferences were executed by backwards chaining. Fourth, the IF–THEN rules were separated from the inference routines (unlike in Dendral), dividing the architecture into a knowledge base and an inference engine [62]. This last point would help spawn an entire industry. If the knowledge in an expert system is entirely modularized into the knowledge base, then applying the expert system to a different problem would simply require changing out the

June 1, 2022 12:51

10

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

knowledge base. The EMYCIN project [63] attempted just this, using a single inference engine and rules elicited from domain experts to construct expert systems in a range of different fields of study. By the late 1980s, a wide variety of commercial expert systems were in use, some based on EMYCIN and others not; 130 such systems are identified in [64]. Now let’s examine how these symbolic systems provide explanations to the user about a decision they reached. As discussed earlier, while it seems simple to just output the set of rules that affected the system’s decision, such traces have been found to be ineffective as an explanans; in the end, the explainee does not gain an adequate grasp of the explanandum. Expert system designers were well aware of this point; MYCIN was designed with an explanation system that could respond to “how” and “why” questions with English-language answers [55, 65]. This intuitively seems like a step forward, but what does it tell us about XAI in general? Indeed, one of the principal theoretical questions for XAI is just what constitutes a “good explanation”; a great many answers have been proposed over the years. Swartout and Moore [66] proposed five desirable characteristics for explainable AI: (1) Fidelity: An explanans must reflect the actual processes used to reach the decision. (2) Understandability: An explanans must be comprehensible to the system’s users. (3) Sufficiency: The explanans must provide enough detail to justify and defend the decision. (4) Low Overhead: The design effort expended on the EI should be proportionate to the total design effort needed for the whole AI system. (5) Efficiency: The execution time of the AI system should not be substantially increased due to the EI. Plainly, some of these desired characteristics are conflicting. For instance, if our standard of sufficiency is adequate detail to withstand a lawsuit, then the explanation will be very lengthy and detailed; but it is known that lengthy, detailed explanations are less understandable [66]. High-fidelity explanations can also be poorly understandable; again, due to the complexity of the explanation. Chakraborty et al. [67] argue that interpretability is multidimensional, incorporating, e.g., explainability, accountability, fairness, transparency, and functionality. The authors also consider the need to rationalize, justify, and understand systems and users’ confidence in their results. Doshi-Velez and Kim [68] distinguish explanation from formal interpretability, and suggest fairness, privacy, reliability, robustness, causality, usability, and trust as core characteristics of XAI. Some authors divide expert systems from the 1970s and ’80s into a first and second generation, based principally on how explanations are created and communicated to the user. In the first generation, systems such as MYCIN

page 10

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 11

11

re-expressed the set of rules used to reach a decision as statements in natural language. However, the results (especially when a number of rules were invoked) became linguistically unnatural. Moreover, this was only a “surface” explanation; the explanation subsystem had no ability to provide deeper insight or the general principles governing the decision [66]. In second-generation systems, researchers sought to address both shortcomings. They focused on making explanations sensitive to knowledge about the user, their history and goals, and the problem domain. Some second-generation systems were also self-improving [69]. Thus, instead of merely parroting the expert system rules, the EI drew upon computational linguistics and natural language understanding (itself a field of AI) to formulate a more responsive and personalized explanans. To do so, additional information about the world model that the rules represent was also added; this may take the form of metarules (as in NEOMYCIN [70]); abstract strategies for problem-solving (as in Generic Tasks [71]), explicitly representing the system’s design in a knowledge base (as in the Explainable Expert System [72]), or by designing a second, separate knowledge base specifically for the explanation subsystem (as in the Reconstructive Explanation system [73]) [66]. The XPLAIN system [74, 75] incorporated both additional domain knowledge and system design knowledge. An interesting point about these last two systems is that an explanans might not exhibit fidelity to the AI’s decision process. This might be unintentional in Reconstructive Explanations, but the use of “white lies” to “improve” the explanans was an observed outcome in the XPLAIN system.

1.2.2. Non-Symbolic Learning and XAI Non-symbolic approaches, such as machine learning, have been a part of AI for at least as long as symbolic ones. The McCullogh–Pitts artificial neuron, often cited as the beginning of the neural networks field, was developed in 1943 [76]. The first learning rule for neural networks, due to Hebb, was proposed in 1949 [77], and Rosenblatt’s perceptron architecture was published in 1958 [78]. Separately, Samuel’s automated checkers program was enhanced with machine learning concepts, including the minimax search of a move tree and an early exploration of what would become temporal difference learning, in 1955 [79]. Statistical machine learning has also established itself as a key part of artificial intelligence. Statistical methods for pattern recognition have been a part of AI since the 1950s, principally centered around Bayesian reasoning. Later, there were also limitations observed with the certainty factors used in MYCIN. As rulebases grew larger, it became clear that the ad hoc assignment of these factors was creating inconsistencies and conflicts within a rulebase, which had to be tediously hand corrected. The addition of another rule would, furthermore, often undo the delicate

June 1, 2022 12:51

12

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

balance of the existing rules. In response, AI designers began incorporating statistical concepts (and Bayes’ rule in particular) in the design of expert systems [55]. Additional proposals for dealing with uncertainty, around this time period, included the Dempster–Schafer theory of evidence [80] and fuzzy logic [55, 81]. An important elaboration of these first methods in the 1970s was the development of graph structures that encode causal links between different events in an AI system. The so-called Bayesian networks allowed AI designers to encode their knowledge of a domain in a graph structure; with the links denoting causality, one can then simplify the use of Bayes’ rules. A Bayesian network can also be inductively learned from data. One notable point is that Bayesian networks were designed for tabular (i.e., static) data, rather than a data stream. Time-varying data, on the other hand, was the focus of the hidden Markov models first designed for the DRAGON speech-recognition system [55]. This period in the 1970s was also the first neural network “winter,” after interest in the perceptron algorithm collapsed following Minsky and Papert’s analysis that it could only ever solve linearly separable problems [82]. However, some work did continue. In particular, Fukushima’s Neocognitron architecture [83] laid the foundations for the convolutional neural networks developed by Lecun [84], which, in their various elaborations, currently dominate signal-, image-, and videoprocessing research. Interest in neural networks also began picking up in the late 1970s and quickened after Rumelhart et al. published their backpropagation algorithm [85], which demonstrated a general solution to the credit assignment problem that had bedeviled previous research into multilayer neural networks. At roughly the same time, Hopfield’s associative memory [86] invigorated research into recurrent neural networks [55]. The resurgence of interest in neural networks led to a flowering of new ideas. Numerous approaches to speeding up the basic backpropagation algorithm were proposed, from simple approaches such as adding momentum terms to the backpropagation weight update [87], to more complex proposals involving computing second derivatives of the error surface (e.g., conjugate gradient algorithms [88], the Levenberg–Marquardt algorithm [89]). Domain knowledge was used to initialize NNs to place them near a strong local minimum in, e.g., the KBANN system [90], while trained networks were pruned to improve their generalization performance, e.g., the Optimal Brain Surgeon algorithm [91]. Entirely different approaches to NN design, often dealing with completely different problems, were also proposed; kernel-based approaches such as radial basis function networks [92] might compete with multilayer perceptrons in classification or function-approximation tasks, but neither addresses the blind-source separation problem that independent components analysis [93] tackles. Information-theoretic learning also gives us the Infomax principle [94], while Q-learning networks [95] and Boltzmann machines [96] grow out of statistical learning theory and Markov processes [87].

page 12

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 13

13

While NN algorithms generally were the state-of-the-art in machine learning during the late 1980s and early ’90s, the black-box nature of NNs made the problem of user trust substantially worse [55]. Simultaneously, it also made the problem of generating an explanans significantly harder; there was now no knowledge base of expert rules to begin from, only an unintuitive weighted graph. Thus, XAI in this period focused on post hoc explanations, under the banner of knowledge extraction. An influential 1995 survey [97] proposed a taxonomy for this field of study based on five criteria. They were originally oriented toward rule extraction, as opposed to other representational formats, but were later generalized to capture others. The criteria were [98]: (1) Rule Format. This originally captured the distinction between a rulebase based on Boolean propositional logic, one based on Boolean first-order logic, and one based on fuzzy logic. In the more general form, this criterion distinguishes between representational forms for the knowledge extracted from the network. (2) Rule Quality. This is a multidimensional quantity, capturing the (out-ofsample) accuracy of the rulebase, its fidelity to the NN, its consistency, and its comprehensibility (through the proxy of counting the number rules and rule antecedents). This is perhaps the most controversial of the five criteria, and there have been multiple alternative proposals for defining and measuring the “quality” of extracted rules (e.g., [99, 100]), to say nothing of evaluating other representations. (3) Translucency. This criterion focuses on the granularity of the knowledge extraction algorithm. At one extreme, the decompositional level extracts rules from each individual neuron; at the other, only the input/output behavior of the whole network is studied. Approaches falling between these extremes were originally labeled “eclectic,” although later work on alternative representations suggested “compositional” instead [101]. (4) Algorithmic Complexity. As with earlier criteria, the complexity of an EI is a vital concern. (5) Portability. This criterion is something of a misnomer, as the EIs at this time were not truly portable between one NN and another. Rather, it captures the degree to which the NN itself employed an idiosyncratic learning algorithm, or alternatively, was trained following one of the dominant approaches such as backpropagation. More generally, this criterion focuses on the assumptions underlying an extraction algorithm. Despite investigations of other representations, in this period the dominant form of an explanans for a trained NN remained IF–THEN rules (be they based on propositional, first-order, or fuzzy logic). This is ascribed to the popularity of feedforward networks and the familiarity of rule-based expert systems [98].

June 1, 2022 12:51

14

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

Furthermore, the bulk of these rule-extraction algorithms were decompositional (extracting rules for individual neurons and then pruning the resulting rule set), and produced rules expressed in Boolean logic [98, 101]. Some of the earliest ruleextraction algorithms, e.g., [102–104] (the latter two focusing on specialized NN architectures), belonged to this group. In particular, the KBANN algorithm [103] employed “M of N ” rules, which fire if at least M out of a total of N antecedents are true. Decompositional algorithms continued to be developed, including [105–108]. Other authors, in the meantime, focused on the “pedagogical” approach: treating the NN as a black box and deriving rules by observing the NN’s overall transfer function. The first of these [109] even predates the decompositional papers, but encountered a rule explosion problem even with heuristics to limit the search space. It is extremely difficult to constrain the search space of antecedent combinations for rules without additional information from the structure of the network. As the number of rules is based on the power set of antecedent clauses, and realistic neural networks have a substantial number of input dimensions, a rule explosion results (i.e., we again run into the curse of dimensionality). One important note is that these algorithms generally induced rules that correspond to an axis-parallel hyperbox in feature space. While such structures are computationally easier to generate, they do tend to be inefficient at capturing concepts (i.e., clusters in feature space) that are not conveniently in an axis-parallel orientation [97, 98, 110]. Rule refinement is a variation of rule extraction, in which the NN is first initialized using a rulebase and then trained on a dataset from that domain. The rules then extracted from the trained NN are expected to be a more “refined” model compared to the rules used to initialize it. Obviously, this requires a neural network architecture that can be initialized using rules; KBANN [90] is a classic example. Others include feedforward networks [111] and the refinement of regular grammars in recurrent neural networks [112]. A number of other representations for an explanans have also been explored, although not in as great a depth. For feedforward networks, the next most important representation appears to have been decision trees [97, 113, 114]. There was also work on extracting rules covering polyhedral regions in feature space (allowing for concepts that are not axis-parallel) [115]. One variation on this work was Maire’s algorithm for backpropagating a classification region from the network’s output domain back to the input domain [116]; there is substantial conceptual similarity between this work and the more modern approach of generating saliency maps as an explanans (as will be discussed in Section 3). Sethi et al. [117] generated decision tables as the explanans in [117]. Recurrent networks require different representations in order to capture the effect of their feedback loops (which IF–THEN rules do not represent efficiently). The dominant representation in this case appears to have been either deterministic or fuzzy finite state automata [112, 118]. We should also mention

page 14

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 15

15

an early work that directly treats XAI as a visualization problem [119]. Visualizations (and again, saliency maps in particular) are today a common representation of an explanans. Returning to the five criteria from [97] discussed above, the question of algorithmic complexity in rule extraction has been investigated in more depth. A number of results show that conceptually useful algorithms in this area are NP-hard, NP-complete, or even exponentially complex. A sampling of these results includes: • Searching for rule subsets in a decompositional approach is exponentially complex in the number of inputs to each node [120]. This is a crucial step for this class of algorithms, as the fundamental approach of decomposition is to first generate rules at the level of each neuron and then aggregate them into a more compact and understandable rulebase. • Finding a minimal Disjunctive Normal Form expression for a trained feedforward network is NP-hard [120]. • The complexity of the query-based approach in [121] is greater than polynomial [120]. That algorithm attempted to improve upon the runtime of search-based algorithms with pedagogical translucency (black-box algorithms) by using queries. • Determining if a perceptron is symmetrical with respect to two inputs is NP-complete [106]. This technique would allow the weights on the two inputs to be revised to their combined average, simplifying rule extraction from that node. • Extracting M-of-N rules from single-layer NNs is NP-hard [120]. These are general results, and NNs specifically designed for rule extraction can have far more tractable complexity. For instance, the DIMPL architecture, and ensembles thereof, supported rule extraction in polynomial time by modifying the structure and activation functions of a multilayer perceptron to force all learned concepts to form axis-parallel hyperboxes [122]. A more recent variation of DIMPL that creates an explainable convolutional NN was proposed in [123].

1.2.3. The AI Winter and XAI In the late 1980s, there was a sudden, rapid decline in the funding support for, and industrial application of, expert systems generally. The underlying cause is widely agreed to have been disillusionment with AI when it could not meet the lofty promises being made by AI researchers [124]. We can see this as the trough of a hype cycle; this is a concept advanced by Gartner Research that sees technology adoption as a five-phase process. After a technological development or discovery, the second phase is intense publicity and interest, with a significant boost in funding; proponents

June 1, 2022 12:51

16

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

often offer speculative statements on the ultimate potential of the technology during this period. The third phase is the trough of disillusionment, where the interested audience from phase 2 loses faith in the technology because it does not achieve the lofty goals being promised by its proponents. Industrial adoption stagnates or falls, and research funding is sharply reduced. Successful technologies weather this trough, and more practical and effective applications are developed, eventually winning the technology a place of acceptance in science and engineering.1 Funding for AI in the early to mid-1980s was driven by major national technology development efforts: Japan’s Fifth Generation Computing initiative, the USA’s Strategic Computing Initiative, and the UK’s Alvey project all focused on AI as a core technology for achieving their goals [55]. At the same time, major industrial investments were made in AI following Digital Equipment Corporation’s successful use of the XCON expert system in its operations. One particular aspect of this investment was the LISP Machine, a specialized computer architecture for executing programs in the LISP programming language, which was the dominant approach for programming expert systems at the time. In 1987, the LISP Machine market abruptly collapsed, as less expensive general-purpose workstations from, e.g., Sun Microsystems were able to run LISP programs faster; LISP Machine vendors were simply not able to keep pace with the advance of Moore’s law. The expert systems themselves were also proving to be brittle and extremely difficult to maintain and update; even XCON had to be retired as it did not respond well to the changing markets that Digital was addressing [124]. Simultaneously, the Fifth Generation project and the Strategic Computing Initiative were wound down for not achieving their goals [55]. Neural networks research also suffered a major slowdown, beginning several years after the LISP Machine collapse. The first major blow came in a 1991 PhD dissertation [125] that comprehensively examined the vanishing gradient problem (and its dual of exploding gradients). NN researchers had for years noted that backpropagation networks with more than two hidden layers performed quite poorly, even though biological neural networks had a large number of layers. It seemed that the magnitude of the error gradients at each successive layer was dropping significantly. Hochreiter [125] confirmed this observation and revealed that an exponential decay of gradient magnitudes was a consequence of the backpropagation algorithm itself. Being limited to just a shallow network of about two hidden layers was a significant constraint for the field, made critical by the emergence of the softmargin algorithm for training support vector machines (SVMs) in 1995 [126]. From this point forward, SVMs and related kernel-based methods became the dominant

1 https://www.gartner.com/en/research/methodologies/gartner-hype-cycle

page 16

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 17

17

learning algorithms for pattern recognition in the late 1990s and early 2000s, with Random Forests a close competitor; see, for instance, [127, 128]. The research into explanations for expert systems and NNs also declined during their winter periods. However, there was a continued interest in explanations for kernel-based learning and SVMs in particular. The reason is simple: while the learning algorithm at the heart of the SVM is just another linear separation, the separation takes place in the induced feature space arising from a nonlinear kernel transform of the input space (commonly to a higher dimensional space in which Cover’s theorem tells us that patterns are more likely to be linearly separable). The transform is, furthermore, never explicitly computed, but implicitly represented in the pairwise distances between observations recorded in the Gram matrix [87]. Thus, a trained SVM is as much a black box as a trained NN. A review in 2010 [129] found that explanations for SVMs was still an active research topic at that time, even as the deep learning revival was already bringing the neural network and AI winters to an end—and re-igniting the need for explanations of NNs [130]. Explanations for other forms of machine learning, such as probabilistic reasoning [131] and Bayesian networks [132], also remained active. Results from second-generation expert systems were also applied in agent-based systems [133, 134], explainable tutoring systems [135, 136], and case-based reasoning [137, 138].

1.2.4. XAI Resurgence Since the mid-2010s, a new generation of XAI systems—now actually called XAI— has emerged in response to the renewed interest in NNs and particularly deep neural networks (DNNs). A large and growing literature now exists on visualizing, comparing, and understanding the operation of DNNs [12, 139–145]. As one might expect, a driving force in these developments is (again!) the need to engender trust from AI users [4, 146–151]. Once again, the U.S. government (through the Defense Advanced Research Projects Agency [DARPA]) is funding substantial volumes of research in this field, launching their 4-year Explainable Artificial Intelligence program in 2017 [152]. Modern XAI is expected to take advantage of the huge strides made in computing power and technology, data visualization, and natural language processing to produce superior explanans. For example, argumentative dialogues are a class of EI that simply did not exist in first- and second-generation systems, but are possible now [146, 153]. Contrastive explanations also seem much more common today, be they occlusions in an image rather than saliency maps [145], model ablation [141], comparisons across datasets [140], or counter-examples [154, 155]. For all these advances, however, there seems to have been a forgetting of much of the progress made in previous generations of XAI. Pedagogical approaches seem to be the norm, with saliency maps from the original input domain presently a

June 1, 2022 12:51

18

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

dominant representation. Compositional and decompositional approaches, on the other hand, are much rarer [156]. Likewise, the goals of the current EIs often appear to be simply accounting for the current decision made by an AI (the literature refers to this as a local explanation). By contrast, global explanations account for the sum total of knowledge learned by a DNN and can help tie together individual decisions into an appreciation of the DNN’s world model [4, 73, 157]. This also assists with other possible goals for an EI: understanding future decisions, trust formation, and debugging and validating the AI, among others. Critically, an EI must respond to the needs of the stakeholders it is designed for; in other words, it must be engineered for human users [158]. Furthermore, in the modern world, users must expect to interact with AIs over an extended period of time; hence, the AI will need to be cognizant of not only the user’s goals, knowledge, and experience, but also of how these have been affected by the AI and its prior explanations [159].

1.3. Methods for XAI We now examine modern XAI. We first examine the issue of trust in an XAI, and in particular, the formation and maintenance of user trust; after all, gaining user trust is the raison d’etre of the XAI field. We then examine four current approaches to generating an explanans in depth. Our discussion covers layer-wise relevance propagation (LRP) [160], local interpretable model-agnostic explanations (LIME) [4], testing with concept activation vectors (TCAV) [161], and rule extraction with meta-features [162].

1.3.1. Trust in XAI Systems The starting point for our discussion is necessarily to define what trust means for an AI. This is nontrivial, as trust is a vague and mercurial concept that many fields have grappled with [163]. There are thus a number of definitions of “trust” in the XAI literature, which are more or less similar to the better-studied constructs of “trust in automation” or “trust in information technology.” There is of course the classic meaning of trust as a willingness to be vulnerable to another entity [164]; as pointed out by Lyons et al. [165], there is a correlation between how vulnerable the user is to the entity (i.e., the level of risk) and how reliant they are on it [166, 167]. There also seems to be a tendency for the most automated systems to pose the greatest risk [168]. In a definition more directly aimed at autonomous systems, Lee and See define trust as an “attitude that an agent will help achieve an individual’s goals in a situation characterized by uncertainty and vulnerability” [169, 170]; this seems to be the most commonly adopted definition in the XAI literature. Other papers make use of more traditional definitions of trust, such as from Mayer et al.’s 1995 work on

page 18

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 19

19

organizational trust: “the willingness of a party to be vulnerable to the actions of another party based on the expectation that the other will perform a particular action important to the trustor, irrespective of the ability to monitor or control that other party” [164]. Israelsen and Ahmed define it as “a psychological state in which an agent willingly and securely becomes vulnerable, or depends on, a trustee (e.g., another person, institution, or an autonomous intelligent agent), having taken into consideration the characteristics (e.g., benevolence, integrity, competence) of the trustee” [171]. Others take a more behaviorist slant; trust was “the extent to which a user is confident in, and willing to act on the basis of, the recommendations, actions, and decisions of an artificially intelligent decision aid” in [172]. Trust, as an attitude or belief, does not exist in a vacuum, and is not extended willy-nilly. Trust formation is, instead, a complex psychological process. The information systems literature conceives trust as being a psychological state that is influenced by a number of precursors. Researchers studying trust in some specific situations must therefore define the conceptual framework of factors that influence trust within that context. For XAI specifically, an explanans is then hypothesized to influence one or more of those antecedents to trust. One of the most commonly used frameworks in XAI research postulates ability, benevolence, and integrity (the so-called ABI model) as major factors influencing user trust in AI [164, 173] (the ABI+ variant adds the fourth factor of predictability [27]). The constructs of the technology adoption model (TAM) [28], particularly ease of use and usefulness, are also frequently used to frame XAI research [174]. Trust is also influenced by usefulness of the explanation and user satisfaction in, e.g., [174]. A few other models have been put forward; for instance, Ashoori and Weisz propose seven factors influencing trust in an AI; trust itself was a multidimensional construct, measured by five psychometric scales [170]. We also need to consider how humans interact with, and develop trust in, AI systems. Dennet proposed that people treat unknown entities as objects whose properties can be abstracted at various levels; he proposes three distinct ones. The Physical level applies to relatively simple concrete objects that we can understand via the laws of physics. The design level is for highly complex objects (living creatures, intricate machinery, etc.) that humans simply cannot analyze in enough depth to reduce the actions to physical laws. Instead, understanding must come from an abstracted model of how the object functions. The intentional level presumes that the object has beliefs and/or desires and will deliberately act to satisfy them. Understanding is thus achieved from the comprehension of these beliefs and desires [175]. Now let us apply the above framework to AI. An AI is not driven by the laws of physics at any level beyond computing hardware; indeed, the layered architecture of modern computer systems means that physical-level understanding of a program is simply nonsensical. Similarly, the design of an AI is simply too complex to produce

June 1, 2022 12:51

20

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

page 20

Handbook on Computer Learning and Intelligence — Vol. I

a useful understanding. Thus, by default, people choose to treat it as an intelligence (as its name implies) and seek to understand and develop trust in the AI at the intentional level. They will seek to determine its beliefs and desires and extend trust based on that understanding [176]. A considerable amount of research bears out this conclusion. Participants in user studies treated autonomous systems as intentional beings in, e.g., [177–180]. People were found to respond to a robot’s social presence (“a psychological state in which virtual (para-authentic or artificial) actors are experienced as actual social actors” [181]) in [182]. It seems reasonable to say that the main goal of studies of emotion and affect in robotics is to improve user trust in the robot; the underlying assumption must therefore be that users are again treating the robot as an intentional being. Furthermore, as pointed out in [183], “humans have a tendency to over interpret the depth of their relationships with machines, and to think of them in social or moral terms where this is not warranted.” However, these findings are not universal. For instance, the results from [184] indicate that employees prefer co-workers who are real, trustworthy human beings. AI is only entrusted with low-risk, repetitive work [170, 184]. As one study participant stated, “I don’t think that an AI, no matter how good the data input into it, can make a moral decision” [170].

1.3.2. Layer-wise Relevance Propagation In the last half-decade, saliency maps have become a preferred explanans for problem domains in which DNNs are applied to image the data. A saliency map is a transform of a digital image X , which has been classified by a trained DNN, into a new image SM of equal dimensions with the property that, for every pixel X(i, j ) of the original image, SM(i, j ) represents how “important” that pixel is in generating the overall classification of A. In other words, SM is a heat map indicating the importance of each pixel in reaching a classification decision. For a wide variety of datasets, researchers have found that the pixels that are “important” to a DNN classifier tend to group together into patches covering significant, meaningful regions or objects in the image A. Thus, presenting the heat map to a user provides an intuitively comprehensible rationale for the DNN’s classification decision. LRP [160] is perhaps the best known of the algorithms for generating saliency maps. Consider an input image X consisting of V pixels. The transfer function f of a trained DNN maps the image X to an output y f : Rv → R1 , f (X ) ≥ 0, y = f (X ).

(1.1)

Without loss of generality, we will assume that y is a real-valued scalar. Given the predicted values f (X ), LRP defines an ad hoc measure of “importance,” termed

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 21

21

Figure 1.2: Illustration of layer-wise relevance propagation.

relevance in [160]. The idea is that, for a given input image, there is a total amount of relevance, which remains constant as the input image is transformed in each layer. The outputs of the network thus contain the same total amount of relevance that the input image contained, albeit more concentrated. LRP thus initializes the  relevance at the output layer as the network output (or the sum of all outputs j y j for the multiple-output case). Relevance is then propagated to the input layer, under the constraint that the total relevance at each layer is an identical constant to that of every other layer. The basic concept is depicted in Figure 1.2. In this simple dense feedforward network, each layer is fully connected to its predecessor and successor layers. Each connection is weighted, with weights being real-valued. The inputs (pixels) are assumed to be flattened rather than in their row–column configuration. We depict a network with three outputs yi , which we assume will be one-hot encoded to represent three output classes. The red arrows denote messages, which combine the relevance from their source node with the weight on the connection between the source and sink nodes. Superscripts denote the source and sink layers, and subscripts denote the source and sink neurons. The total relevance at each layer is given by  (1)  (output)  (l) Rd = Rd = · · · Rd (1.2) f (x) = d∈output

d∈l

d

where d denotes a neuron within the l-th layer, and Rd is the relevance for that neuron. Rd < 0 is interpreted to mean that the neuron contributes evidence against the network’s classification, while Rd > 0 contributes evidence favoring it. LRP proceeds by propagating relevance values from the output layer toward the input. The relevance of any neuron, except an output neuron, is defined as the weighted sum of incoming relevances from all connected neurons in the succeeding layer, represented by messages [160]:  (l,l+1) Ri←k (1.3) Ri(l) = k:i is input for neuron k

June 1, 2022 12:51

22

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

page 22

Handbook on Computer Learning and Intelligence — Vol. I

(l,l+1) LRP defines the message Ri←k from neuron k to neuron i as follows [160]:

ai wik (l,l+1) Ri←k = Rk(l+1)  h ah whk

(1.4)

where Rk(l+1) is the relevance value for neuron k at layer l+1, ai is the activation value of neuron i, and wik is the weight between neuron i and neuron k, and h iterates through all the neurons of layer l. Hence, the message sent from neuron k at layer l + 1 to neuron i at layer l is a portion of the relevance of neuron k. For an output neuron, relevance can be defined using R (l) = f (x) = y, the known network output. A point to consider is that neuron activations and weights can be zero or nearly so, and this can possibly create numerical instabilities. LRP thus defines a stabilizer constant ∈ and adds it to the messages, yielding [160]: z i j = ai wi j  zi j + b j zj =

(1.5) (1.6)

i

(l,l+1) = Ri← j

⎧ z ij (l+1) ⎪ if z j ≥ 0 ⎪ ⎨ z j +  .R j ⎪ ⎪ ⎩

zi j .R (l+1) if z j < 0. zj −  j

(1.7)

Once the relevances for each neuron in the input layer are determined, they define the pixel values for the saliency map SM [160]. 1.3.2.1. Quantitative analysis of visual interpretations

With saliency maps being one of the most common modern representations of an explanans [160, 185–188], the question naturally arises of how to measure their quality. Where a rulebase can be evaluated using predictive accuracy or complexity measures, these are not available for a saliency map. While it is possible to request human raters to evaluate different maps and one could, in theory, obtain reliable results with a large enough pool of raters, in practice such approaches are not scalable in time or resources. A quantitative, computable measure of saliency map quality is thus called for. Perturbation analysis [189] is an approach that examines the impact of particular regions of an image on the overall classification mechanism. Consider a pixel p in a saliency map generated for an image x; we assign it the saliency value h p . We sort the pixels in descending order by their h p values, yielding [189]: O = (r1r2 , . . . , r L )

(1.8)

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 23

23

where ri is the i t h largest value h p out of a total of L pixels. We then apply a sequence of random pixel perturbations (denoted Most Relevant First, MoRF) to the image x, following the ordering O [189]. This has the effect of iteratively removing the information associated with the most salient pixels from the image. We can then compare the performance of a fixed classifier on the original image versus the progressively perturbed ones to determine how swiftly the classification decays. The faster the decay, the more important the lost information is—and the better the estimated saliencies are. The sequence of perturbations is given by [189] (0) =x xMoRF

∀1≤k ≤ L

(k) : xMoRF

=g



(1.9)

(k−1) xMoRF , rk

(1.10)

where g(xr) is a function that removes information from image x at location r, rk is the k t h entry in O, and L is the number of regions perturbed. A perturbation consists of replacing pixel rk and its immediate neighbors with random values, thus nulling out the information contained at that location. The MoRF sequences for two saliency methods can then be compared over a dataset using the area over the perturbation curve (AOPC), given by [189]:

L    1 0 k f xMoRF (1.11) − f xMoRF AOPC = L + 1 k=0 px

where · px indicates averaging over all images in a dataset, and f is a class discriminant function (e.g., logits in the softmax layer of a trained DNN). A toy example is provided in Table 1.1. The AOPC are computed as: 68 [(25 − 20) + (25 − 10) + (25 − 2) + (25 − 0)] = 5 5 64 [(25 − 21) + (25 − 13) + (25 − 2) + (25 − 0)] = Method 2 = 5 5 Method 1 =

Table 1.1: Illustration of a comparison between two MoRF sequences. Remove windows per Saliency Map 1

Value of f (x)

Remove windows per Saliency Map 2

Value of f (x)

Original image

25

Original image

25

Remove r11

20

Remove r12

21

Remove r21

10

Remove r22

13

2

Remove r32 Remove r42

2

Remove r31 Remove r41

0

0

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

24

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

We would thus conclude that Saliency Map 1 was more effective in highlighting the pixels having the greatest influence on the classifier. Some recent criticisms also highlight other weaknesses of the saliency map as an explanans. For example, a recent work has demonstrated that saliency maps generated from randomized networks are similar to those from well-trained ones [190]. Simple data processing steps, such as mean shifts, may also substantially change the saliencies, even though no new information is added [191]. Ghorbani et al. [192] showed that small changes to input images, if carefully chosen, could significantly alter the resulting saliency maps, even though the changes were invisibly small to a human observer. Wang et al. [193] found that the bias term in neurons had a substantial correlation with the neuron’s relevance; as a bias does not respond to any input, such a correlation is certainly questionable. The expressiveness of Integrated Gradients [194] and Taylor decomposition [195] have been found to be highly dependent on the chosen reference point. Adebayo et al. [196] found that some gradient-based saliency algorithms created saliency maps that seemed to come from trained networks, even when these networks contained only randomly selected weights.

1.3.3. Local Interpretable Model-Agnostic Explanations Local interpretable model-agnostic explanations (LIME) [4] are, as the name implies, focused solely on individual predictions as the explanandum. Fundamentally, LIME constructs an auxiliary model relating each prediction to the system input features and identifies the features that most strongly influence (positively or negatively) the classification in that auxiliary model (see e.g. [197]). To illustrate the LIME representation of an explanans, consider the example of an algorithm trained for medical diagnosis in Figure 1.3. For a diagnosis of “Flu,” LIME will construct a model (from some highly-interpretable class of models) that relates the observed features to the class “Flu.” The features that most strongly support the class (headache and sneezes are reported), and those Explanation

AI Model

Predicts

Features Flu

sneeze weight headache no fatigue age

sneeze LIME headache

Human evaluates the explanation

no fatigue

Figure 1.3: The AI model predicts that a patient has the flu. LIME highlights the important symptoms. Sneeze and headache support the “flu” prediction, while “no fatigue” is evidence against it.

page 24

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 25

25

which most strongly contradict it (a lack of fatigue is reported) are highlighted as such. In LIME an explanans is denoted as g ∈ G, where G is a set of models widely held to be interpretable, e.g., decision trees, linear regressions, etc. [198] These models are made up of components (tree nodes, terms in the regression, etc.), and the domain of g is {0, 1}d , indicating the existence of a given component. However, it common that not all possible g ∈ G are simple enough to be easily interpretable; a decision tree may be too deep, a regression may have too many terms, etc. (g) is defined as a measurement of the complexity of the model (tree depth, number of terms, etc.) [197]. Now let us denote the classifier generating the prediction to be explained as f : Rd → R, with a real-valued dependent variable f (x) treated as the probability that instance x is assigned to a given class. Define πx (z) as a distance between an instance z and the instance x. The loss function L( f, g, πx ) measures the error that g makes in approximating f in the neighborhood πx . The goal of LIME is to simultaneously minimize this loss function and the complexity (g) [197]: ξ(x) =

argmin gG

L( f, g, πx ) + (g)

(1.12)

A key goal of LIME is to perform this minimization without information concerning f , thus making LIME model-agnostic. To do so, we draw uniform samples z without replacement from the domain of f , and use the trained classifier to produce f (z). Each of these samples is then weighted by their distance πx (z) to instance x; samples closer to x will more strongly influence the loss function L. Once a collection of these samples is obtained, the minimization above is carried out. In [199], Mohseni et al. evaluated the explanans from the LIME algorithm by comparing it to a collection of 10 human-created explanans. Elements of the LIME explanans seemed irrelevant compared to the human explanations, leading to a low precision of the LIME explanans. In a similar experiment concerning the SHAP XAI algorithm [200], Weerts et al. studied adding explanations to AI decisions. The human users were found to simply prefer to check the class posterior probabilities, rather than engage with the SHAP explanations.

1.3.4. Testing with Concept Activation Vectors (TCAV) [161] Testing with concept activation vectors (TCAV) [32] would be categorized as a compositional approach to explanation, which probes how layers in a DNN respond

June 1, 2022 12:51

26

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

Figure 1.4: A neural network with one of the hidden layers selected for TCAV analysis.

to concepts selected by the EI. It is thus a different approach from LRP or LIME, which seek to re-express concepts that are implicit within the DNN in a more comprehensible form. TCAV examines how a layer in a trained network would respond to a set of examples P that belong to a user-selected concept, versus a set Q of randomly selected ones that do not belong to the concept. The concepts are presented to the trained network’s inputs, and proceed through the network to the designated layer l. Denote the transfer function of the network up to and including layer l as F, and the transfer function of the remaining layers as H . For all p ∈ P and q ∈ Q, a concept activation vector (CAV) is determined by first training a linear classifier to discriminate between the classes F( p) and F(q), and then taking the gradient of the separating hypersurface between them. This CAV now points in the “direction” of the activations representing the concept.  To illustrate this concept, consider a neural network with inputs x and outputs  y as in Figure 1.4. Outputs are assumed to be one-hot encoded as usual. We identify a hidden layer L in the network to be analyzed using TCAV. The transfer function  of the network up to and including that layer is denoted as F( x), and the transfer   function for the remainder of the network is y = H (F( x)). To probe the response of L to a user-defined concept P, we present input examples p ∈ P containing that concept to the network and observe the results F( p). We also present randomly selected input examples q ∈ Q that do not contain the concept P and observe the outputs F(q). For example, in Figure 1.5, we present examples of the desired concept “circle,” contrasted with inputs that are not circles. The next step is to train a linear classifier to separate the F( p)and F(q). This produces a separating hyperplane in the input space of that layer; the gradient of that hyperplane is of course a vector orthogonal to it. Denote that vector as the CAV (shown in red in Figure 1.5). With a CAV in hand, we can now inquire how class K relates to that concept. Take the directional derivative of H with respect to the CAV as [161] SCAV,k,l (x) = limε→0

H (F(x) + εCAV) − H (F(x)) = ∇ H (F(x)) · CAV (1.13) ε

where x is an element of the training dataset belonging to class K , and H () is restricted to the output representing class K . SCAV,k,l (x) thus represents how sensitive

page 26

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 27

27

Figure 1.5: Determining a CAV for the concept “circle”.

class K is to the concept in P. Now suppose that X k represents all of the examples in the dataset that belong to class K . One can then calculate what fraction of them have a positive sensitivity S versus the concept in P; this will be the TCAV figure of merit, representing how strongly the concept influences class K [161]: TCAVQ CAV,k,l =

|{x ∈ X k : SCAV,k,l (x) > 0}| |X k |

(1.14)

By probing the different classes and layers in a DNN with multiple concepts, we will discover which concepts most strongly influence which classes at different layers. The highly influential concepts together become the explanans for the decisions made by the DNN [161].

1.3.5. Meta-Features and Rule Extraction Meta-features are abstracted feature representations, used to improve the fidelity, stability, and sometimes accuracy of an explanans. The meta-feature approach focuses on global explanations in the pedagogical approach (the DNN is a black box) [97, 201, 202]. It is particularly appropriate for sparse, high-dimensional data, as a rule-based explanation for such data will either have poor coverage [203], or a large number of narrow, highly specific rules. In neither case will the explanans be robust. Instead, we can map input-domain features to more abstract, less sparse meta-features. Clustering is a classic example, as is treating a patch of a digital image as a superpixel [161, 204]. Key characteristics of effective meta-features include [162, 205] • • • • •

Low dimensionality Low sparsity High fidelity Statistical independence Comprehensibility

June 1, 2022 12:51

28

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

One example of the meta-feature approach is data-driven meta-features (DDMFs), in which meta-features are found using dimensionality reduction algorithms; for instance, Ramon et al. [162] employ the non-negative matrix factorization (NMF) algorithm. Consider an n × m data matrix X with n unique data items composed of m features. NMF factors X into two non-negative matrices U and W such that X ≈ UW and X − UW2 is minimized with respect to U and W , with U , W ≥ 0. To evaluate DDMFs, the authors in [162] propose three measures: accuracy (in the classic sense); fidelity (the accuracy of labels from the explanans compared only to the predicted labels from the DNN); and stability (the rulebases from different training sessions of an explanans will consistently repeat the same label predictions on out-of-sample data). While the first two are quite easily computed, measuring stability required a Monte Carlo algorithm [162].

1.3.6. Discussion The XAI field exists because human users (often for a good reason) do not extend the same level of trust to an AI that they would to a human. Explanations are thus a means to improve user trust in the AI; the assumption is that increased trust will then lead to increased acceptance and usage of AI systems. Experience with other types of automated or internet-based systems has borne this out. Life in the COVID-19 pandemic, for example, was very strongly influenced (most would probably say positively) by internet-based e-commerce; this is despite the fact that e-commerce essentially did not exist prior to 1995. In this section, we reviewed four modern techniques for generating an explanans, which are all meant to be problem- and architecture-agnostic. They are intended as generalized explanation algorithms, which therefore assumes that “explanation” is an independent phenomenon. On the other hand, Miller’s review showed that “explanation” is a social act [11], and therefore dependent on context. One might therefore ask if an EI, rather than being a generalizable system component, might need to be engineered to respond to the needs of a particular stakeholder group. One example would be explanations of AI decisions in the banking sector (e.g., granting or refusing a loan), which must satisfy the government’s banking regulators. Anecdotal discussions have indicated that some regulators may want to see a mathematical formula by which these decisions are made. Contrast this with AI developers who would be more interested in the execution trace of a DNN.

1.4. XAI and Fuzzy Logic Fuzzy sets were first introduced by Lotfi A. Zadeh at the University of Berkeley in 1965 [81]. One of the pioneers of modern control theory, Zadeh had for some

page 28

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 29

29

years been considering the question of vagueness in human language and reasoning. Tolerating that vagueness appeared to be a key reason why humans could forge ahead in uncertain environments, when computer programs simply reached an error state [206]. His 1965 formalization of this idea was to extend the definition of the characteristic function of a classical set from having a codomain consisting of the binary set {0, 1} to the closed real interval [0,1]. This new construct is the fuzzy set, defined by its membership function. Formally, a fuzzy set A = {(x, μ A (x))|x ∈ X } where A is a fuzzy subset of X , and μ A (x)(μ:X → [0,1]) is the membership function (MF) for the fuzzy set A Several set-theoretic operations (including union, complement, and intersection) were also extended to fuzzy sets [81]. Later research has shown that “fuzzy set theory” is a family of theories differentiated by the settheoretic operations in each one; fuzzy intersections can be any member of the class of triangular norm functions, and unions can be any triangular conorm. Fuzzy set complements do not conveniently map to a specific class of functions, but they are usually chosen so that the DeMorgan laws of Boolean logic will hold with the particular choice of intersection and union. Fuzzy logic is a family of manyvalued logics that are isomorphic to the family of fuzzy set theories. Specifically, fuzzy logic normally denotes the class of logics having an uncountably infinite truth-value set (the real interval [0,1] is dense and thus has an uncountably infinite cardinality). Again, conjunctions may be any triangular norm, and disjunctions may be any triangular conorm. A survey of over 30 such logics may be found in [207]. A thorough introduction to fuzzy sets and fuzzy logic is provided in other chapters of this handbook, and we will not repeat that material here save for what is essential to our discussion. Fuzzy logic is most commonly operationalized via a fuzzy inferential system (FIS), which employs linguistic fuzzy rules (based on the linguistic variable construct discussed in Section 4.1) to create an input-output mapping. An FIS is thus highly transparent, in the sense of [13]; the fuzzy rules are IF–THEN structures with linguistic predicates, usually chosen to be intuitively meaningful. A second common operationalization is to hybridize neural networks with fuzzy logic. This produces a network that is more interpretable (either at the decompositional or the pedagogical level, depending on the precise hybridization performed). Hence, we see that fuzzy logic has been employed in generating explanans since the 1970s, and research into hybrid deep fuzzy networks is ongoing today. Our review in this section begins with the first major application of an FIS to practical problems, in the area of automated control. We will then explore the similarities and differences between FIS and fuzzy expert systems, which were contemporaneous with second-generation expert systems. We then turn our attention to neuro-fuzzy and genetic fuzzy hybrid systems, and close with a discussion of current research into deep neuro-fuzzy systems.

June 1, 2022 12:51

30

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

1.4.1. Fuzzy Logic and Control Systems By the early 1970s, research into fuzzy logic was garnering interest from multiple researchers. The early focus was mainly on the mathematical properties of fuzzy sets. Zadeh introduced a theory of state equations for describing the behavior of fuzzy systems in 1971 [208], and in 1972 began suggesting that fuzzy sets could be used for automatic control [209]. From the perspective of XAI, one of the most vital developments was the introduction of the linguistic variable (LV) construct in [210–212]. This is the use of fuzzy sets to provide a mathematical meaning to linguistic terms such as small or cold. A linguistic variable is a 5-tuple (X , U , T , S, M) where X is the name of the variable, U is its universe of discourse (the universal set that X belongs to), T is the set of linguistic terms for the variable, S is a syntactic rule for generating additional linguistic terms (S is usually a context-free grammar whose terminal symbols are the terms in set T ), and M is a semantic rule that associates a fuzzy subset of U with each term in S [210, 212]. The grammar S is needed to accommodate linguistic hedges in a variable (e.g., very, or more or less); a term generated by S (referred to as a composite term) will normally consist of a single base term, and an arbitrary number of hedges. The semantic rule M assigns a fuzzy subset of U to the base term, and a function h: [0, 1] → 0, 1 to each hedge. The functions h are applied to the fuzzy set for the base term, in the order defined by the parse tree of the entire composite term [210, 212, 213]. As mentioned above, an FIS is composed of linguistic fuzzy rules. Specifically, these are IF–THEN rules much like in a traditional expert system, but with the antecedent predicates (and possibly the consequent predicates) defined as linguistic variables over each input (and output) dimension in the system feature space. The first application of the FIS to real-world problems was the design of an automated controller by Mamdani [214] (see [215, 216] as well). The first well-known commercial application was the cement kiln controller developed by Holmblad and Ostergaard [217]. Pappis and Mamdani’s traffic controller [218], Flanagan’s sludge process controller [219], and the first fuzzy inference chip developed by AT&T Bell Laboratories [220] are other well-known industrial applications. A good review of this initial development can be found in [221]. Mamdani’s system rapidly prototyped a control algorithm by elucidating the knowledge of experienced human operators (in the form of if–then rules), and representing that knowledge as an FIS. When it was evaluated, the FIS algorithm (discussed in more detail below) proved to be immediately effective [214]. This has become a hallmark of fuzzy control; ongoing research and industrial experience have shown that the FIS is, in fact, a swift and intuitive means to devise a control surface (the response function of a control algorithm) of essentially arbitrary shape. They have in fact been shown to outperform classical PID controllers in many instances.

page 30

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 31

31

Figure 1.6: Feedback control system.

One of the most famous examples is use of fuzzy control for a fully automated subway in the Japanese city of Sendai [222–224]. Fuzzy control algorithms are usually referred to as a fuzzy logic controller (FLC). 1.4.1.1. Mamdani’s FLC

Automated feedback control systems follow the general design in Figure 1.6. The plant is the system being controlled, which is actually performing some useful work  as measured by the output vector y. The controller is the system that determines and   applies control inputs u to affect the plant, and thereby alter the output y. The input   to the controller is an error vector e, which represents the deviation of the output y from a desired reference signal. In modern control theory, the plant is represented  as a vector of state variables x, which measure specific, important characteristics of the plant. The time evolution of those state variables, as well as the control  input vector u, and their effect on the plant’s outputs, are represented as differential equations [225]: 

dx dt



 



= F( x(t), u(t)) 

y(t) = G( x (t))







(1.15) (1.16)

where x, u and y are (vector-valued) functions of time. Conventional controllers are thus based on a mathematical model of the plant being controlled. They work extremely well for linear plants whose state variables are easily and precisely observed. However, this clearly describes only a minority of the circumstances in which automatic control is desirable; indeed many highvalue, high-risk plants exhibit significant nonlinearities and poor observability of state variables. Nonetheless, trained human operators are able to safely operate such plants. It thus becomes highly desirable to create an automatic controller that can mimic those operators’ performance. One possibility is to elicit condition– response rules from those operators, and deploy them as the controller of a plant; the FLC developed by Mamdani and Assilian [214] was the first practical example of employing fuzzy logic, which we discuss in more detail below. The plant in [214] is a steam engine (a highly nonlinear system) whose observed state variables are the speed error (difference from the set point), pressure error,

June 1, 2022 12:51

32

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

and the derivatives of each error. The two control variables are the heat input to the boiler and the throttle opening at the input of the engine cylinder. The control unit’s goal is thus to determine the proper value of the two control variables, given the four inputs. For each input, a linguistic variable was defined over that universe of discourse. Each LV contained only base linguistic terms (no hedges); these are Positive Big (PB), Positive Medium (PM), Positive Small (PS), Negative Big (NB), Negative Medium (NM), Negative Small (NS), and Nil. Mamdani and Assilian’s design approach followed an earlier suggestion by Gaines [226] that a general learning procedure for an intelligent controller could consist of initializing the controller using priming, coding, and training approaches, followed by reinforcement learning to optimize the controller. Priming refers to the communication of a control strategy to the controller (e.g., as a rulebase); coding refers to establishing a representational format common to the controller and plant components; and training refers to adapting the controller to take a known optimal action for selected scenarios. As discussed in [214], the FLC was intended to perform the priming function only; however, the resulting algorithm was consistently superior to the digital controller used as an experimental contrast. Thus, the additional step of optimizing via reinforcement learning was not undertaken, and the results of just the FLC were reported. In designing the FLC, Mamdani and Assilian needed to solve multiple challenges: 1. Firstly, how are the IF–THEN rules represented? If linguistic variables are employed as the predicates of the rules, then two further challenges arise: a. How are linguistic predicates from different universes of discourse to be combined within a rule? b. How are observations from sampled-data sensors to be interpreted in rules with linguistic predicates? 2. How will a conclusion be inferred from the set of IF–THEN rules and sensor observations? 3. How will this conclusion (itself a linguistic value) be interpreted into a control action for the physical plant? As we know, the FLC [214] was designed with linguistic variables as the predicates to make the control rules as intuitive as possible (i.e., transparent). The solutions to these challenges were as follows (we will discuss only a two-input, single-output system for clarity of presentation). Given input variables X , Y and control variable Z , fuzzy rules were of the form IF x is A and y is B THEN z is C

page 32

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 33

33

for x ∈ X , y ∈ Y , z ∈ Z , and A, B, C being linguistic values that map to fuzzy subsets of X , Y , Z , respectively. Challenge (1a) is solved using the Cartesian product of fuzzy sets; for example, if we consider the two fuzzy sets A and B, the Cartesian product is given by [214] μ A×B (x, y) = min(μ A (x), μ B (y))

(1.16)

applied to the fuzzy sets defined by the semantic rule of each linguistic variable. To solve challenge (1b), we can combine this product with a real-valued observation (x0 , y0 ) using Zadeh’s compositional rule of inference. The one step that is needed is to convert (x0 , y0 ) into a fuzzy membership value; this was done by evaluating μ A×B (x, y) at the point (x0 , y0 ). This leads us to the solution of challenge (2): Mamdani and Assilian combine the antecedent and consequent predicates of a fuzzy rule together using the Cartesian product again, producing a fuzzy relation R. This R,  and the memberships of the observation vector x , produce the output fuzzy set C’ via the compositional rule of inference. (Note that this is not equivalent to a fuzzy implication.) Thus the inference result of one rule is produced by [214] μC (z) = max min(μ A×B (x0 , y0 ), min(μ A×B (x, y), μC (z))) Multiple rules are combined via fuzzy union [214]: μC i (z) μout = i

(1.17)

(1.18)

where the index i ranges over the whole set of fuzzy rules, and the fuzzy union is implemented by the max() t-conorm. Finally, for challenge (3), the control action output was chosen to be the unique element of Z with the highest membership in the output fuzzy set, if such exists; an expected value was taken if there were multiple elements with the maximal membership [214]. Mamdani and Assilian’s work in [214] has been tremendously influential on the fuzzy systems community. Indeed, their approach to designing the FLC has, with some generalizations, become one of the de facto standards of the community. Figure 1.7 illustrates the general architecture of a modern FIS. Firstly, the IF–THEN rule format above has become the “canonical” form of fuzzy rules having linguistic consequents (we discuss alternative consequents later). Secondly, observations (assumed to be numeric quantities) undergo fuzzification as the first stage of an FIS, which converts them into fuzzy sets (commonly with only a single objectmembership pair, known as a fuzzy singleton). The fuzzified observations are then passed to the inference engine, which combines them with a stored database of fuzzy linguistic rules (the rulebase) to determine an inferred conclusion. This inference engine still uses Mamdani’s inference algorithm instead of a fuzzy implication function; the algorithm is visualized in Figure 1.8. The conclusion then commonly needs to be converted back into a numeric output, a process termed defuzzification.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

34

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

Crisp Input

Fuzzification Interface

Fuzzy Input

Inference Engine

Fuzzy Output

Defuzzification Interface

Crisp Output

Fuzzy Rulebase

Figure 1.7: Basic architecture of a fuzzy expert system.

Figure 1.8: Mamdani fuzzy inference system using min and max for T-norm and T-conorm operators [227].

Defuzzification is most commonly implemented by the centroid method, i.e., determining the z value for the centroid of the output fuzzy set via the familiar equation from physics [227]:

 μ A (z)zdz (1.19) Z COA = z  z μ A (z)dz where μ A (z) is the aggregated output MF. While the Mamdani architecture was the first FLC published, the Takagi– Sugeno–Kang (TSK) FLC, first published in 1985 [228], is also very widely used. The principal difference between the two is that consequents in the TSK FLC are always numeric functions of the inputs, rather than fuzzy sets. A basic TSK fuzzy inference system, with linear consequent functions, is illustrated in Figure 1.9. In this system, the canonical form of the rules is IF x is A1 and y is B1, THEN z 1 = f (x, y). The system output is then a sum of the output functions, weighted by the firing strength of each rule. Considering the FLC from the perspective of XAI, it seems clear that a linguistic rulebase is more comprehensible than a differential equation in several variables!

page 34

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 35

35

Figure 1.9: Takagi–Sugeno fuzzy inference system using a min or product as T-norm operator [227].

Specifically, a human would be able to understand those rules without having to first learn the topic of differential equations. However, neither Mamdani nor Takagi et al. actually designed an EI for their FLC, and so the explanans for an FLC is just the fuzzy rulebase. Recalling the EI of MYCIN and other conventional expert systems, the rules of the expert system were not deemed a sufficient explanation. Nor does the literature on FLCs commonly attempt a more formal evaluation of the comprehensibility of an FLC. A further point arises when we consider the differences between the Mamdani and TSK fuzzy systems. The TSK systems have attractive numerical properties; specifically, the linear coefficients ( p, q, and r in Figure 1.9) afford additional degrees of freedom, allowing for a more precise control law—or more accurate predictions when the TSK system is used for classification or other tasks. Furthermore, design rules for building TSK controllers with guaranteed stability have long been known [229]. However, a linear function is, on its face, less transparent than a single linguistic term as produced by Mamdani controllers. This is a first example of the interpretability/accuracy trade-off that will be discussed in more depth later in this section [230].

1.4.2. Expert Systems and Fuzzy Logic As discussed, researchers quickly realized that expert systems such as MYCIN and PROSPECTOR could be reused in different problem domains by swapping out their rulebase; the rest of the system was considered a reusable expert system shell. These shells generally employed the production system architecture, depicted in Figure 1.10. The EI and rulebase components have been previously discussed. The working memory contains all data that have been input to the system, conclusions inferred from all rules fired to this point (we can collectively call these facts), and the set of

June 1, 2022 12:51

36

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

Figure 1.10: Production system architecture for expert systems [231].

all rules whose antecedents may yet be matched to the available facts (the conflict set). The inference engine is either a forward- or backwards-chaining approach. In forward chaining, the engine will iteratively select a rule from the conflict set whose antecedents are fully matched by available facts, and then execute any actions in the THEN part of the rule; this may include adding new facts to the working memory. The conflict set is then updated (rules might be added or deleted), and the process repeats, until the conflict set is empty [232]. In backwards chaining, the inference engine treats the possible outcomes of rules as statements that need to be proved by reasoning from the facts and rules in working memory. If a fact directly proves the statement, it is selected. Otherwise, rules whose consequents match the statement are selected, and their antecedents are added to working memory as new statements to be proven. The process continues recursively until either facts proving the entire sequence of statements are identified, or no more facts and rules that continue the chain can be found [233]. One of the key steps in the production system is identifying the conflict set; this is generally called the many-pattern–many-object matching problem. As its name implies, the problem is to take a collection of objects, and a collection of patterns, and find every object that matches any of the patterns. The brute-force approach to this problem has an exponential time complexity, and so other alternatives are needed. The Rete network algorithm is generally used in practical expert systems. The Rete network organizes the rulebase into a tree structure, based on testing single features at each node. Then, once a new fact enters the working memory, it can be quickly passed through the Rete network to arrive at a set of rules compatible with that fact. Once all antecedents in a rule have been matched to facts in the working memory, the rule can fire. The complexity of this algorithm has been shown to be polynomial in the number of rules [232, 234]. As discussed earlier, the developers of MYCIN quickly realized that their expert system needed some way to tolerate vagueness and imprecision in its inputs

page 36

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 37

37

and rules. Their solution, ad hoc certainty factors, worked initially but was found to be extremely fragile; certainty factors throughout the rulebase might have to be manually re-tuned whenever a new rule was added [55]. It is thus unsurprising that researchers in the 1970s and 1980s began to study fuzzy expert systems as a potentially more robust means of accommodating imprecision and vagueness. Early systems included Sphinx, Diabeto, and CADIAG-2 for medical diagnostic support [235–237]. Zadeh’s 1983 article [238] described how uncertainty in a production system-based shell can be accounted for by using possibility distributions and the compositional rule of inference. He expanded on these points in his discussion of fuzzy syllogistic reasoning in [239]. An alternative approach is outlined by Negoita in [240]. Buckley et al. present a fuzzy expert system where the facts are fuzzy numbers in addition to the rulebase being fuzzy [241]; a refinement of this system is discussed in [242]. An expert system for military target classification is proposed in [243]. A fuzzy expert system based on backwards chaining is proposed in [244]; notably, this system includes an EI that can respond to why, how, and what if questions. The Fuzzy CLIPS shell2 is a modification of the CLIPS shell (for traditional forward-chaining expert systems) that incorporates fuzzy logic. Of all the fuzzy expert systems discussed, only Fuzzy CLIPS employs a Rete network for the many-pattern/many-object problem; the others simply use a brute-force search. However, Fuzzy CLIPS overlooks a key difference between fuzzy and traditional expert systems. In fuzzy forward rule chaining, a rule containing a given predicate in its antecedents must not fire until all rules in the conflict set that assert that predicate in their conclusions have fired [245]. This attention to ordering is needed because, unlike in the Boolean logic, the proposition ( p → q) ∧ (q → s) → ( p → s) (sometimes called the chaining syllogism) is not generally a tautology in fuzzy logic (i.e., it is not a tautology in all possible fuzzy logics) [246]. Thus, ignoring the ordering constraint can lead to incorrect results. FuzzyShell is an alternative fuzzy forward-chaining shell, with a modified Rete network that does account for this ordering constraint [247]. In reviewing fuzzy expert system shells, one once again finds the same pattern that was present in FLC designs: a linguistic rulebase again seems to improve the transparency of the expert system, but the EI seems to have been neglected. With the exception of [244], none of the reviewed systems have a way of responding to any user queries, or a means to construct counterfactuals. The potential is clearly there, and causal reasoning is a topic of interest in the fuzzy community (e.g., [248, 249]). The extra step of building an EI, however, seems to seldom be taken.

2 https://github.com/garydriley/FuzzyCLIPS631

June 1, 2022 12:51

38

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

1.4.3. Hybrid Fuzzy Systems One of the signature characteristics of computational intelligence is that hybrid system designs combining its major core disciplines (neural networks, fuzzy logic, and evolutionary computing) are very commonly superior to designs using only one approach [250]. Specifically, the hybrids involving fuzzy logic are usually considered more “intuitive” or “understandable,” due to their incorporation of “interpretable” fuzzy sets. This is much the same argument that is made for fuzzy systems in general. In the next two subsections, we will review neuro-fuzzy and evolutionary fuzzy systems, respectively, and examine what kinds of explanans are being offered by these systems. 1.4.3.1. Neuro-fuzzy systems

Hybrids of neural networks and fuzzy logic were first investigated in the early 1970s, during the neural network winter. Lee and Lee designed a generalization of the McCullogh–Pitts neuron which operated with fuzzy inputs and outputs [251]. Later, Keller and Hunt designed a perceptron that created a fuzzy hyperplane instead of a crisp one as the decision boundary (an idea that presaged the soft-margin algorithm for support vector machines some 10 years later) [252]. Wang and Mendel proposed a gradient-descent algorithm for inductively learning fuzzy rulebases in [253]. Then in 1988, NASA organized the 1st Workshop on Neural Networks and Fuzzy Logic, which catalyzed developments in this area [254]. In the years since, thousands of theory and application papers have explored the design space of neuro-fuzzy systems. We can divide this design space into two major approaches: fuzzifying neural networks, and neural architectures that mimic fuzzy systems (authors often define the term neuro-fuzzy systems to only mean this latter subclass). We discuss each in turn below. The term “fuzzy neural network” has come to mean a traditional neural network architecture that has been modified to incorporate fuzzy sets and logic in some fashion for the purpose of improving the transparency of the NN. This quite commonly takes the form of modifying individual neurons to employ fuzzy quantities in their computations. A good example is Hayashi’s fuzzified multilayer perceptron, in which neuron inputs, outputs, and weights are all fuzzy numbers (fuzzy sets with unimodal, convex membership functions, which represent a real number that is not precisely known) [255]. Another example is Pal and Mitra’s Fuzzy MLP [256], in which only the network inputs and outputs are fuzzy; internally the network is a basic MLP with backpropagation. The network inputs are the memberships from each fuzzy set in a linguistic variable covering the actual input dimensions of a dataset, while the outputs are fuzzy class labels (permitting each object x to belong to more than one class). These two examples also illustrate

page 38

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 39

39

Figure 1.11: ANFIS architecture [257].

a significant parallel to the knowledge extraction algorithms for shallow networks discussed in [98]; after all, the intent of those works was also to make the knowledge stored in NNs more transparent. Hayashi’s work can be seen as fuzzification at the decompositional level, which Pal and Mitra’s might be considered fuzzification at the pedagogical level. As a very general statement, fuzzy neural networks tend to be more transparent than traditional NNs, but without sacrificing a great deal of accuracy. Mimicking a fuzzy system using a neural network has a somewhat different purpose. If a 1:1 correspondence between the neural architecture and an FIS can be shown, then the knowledge in the NN is highly transparent; the problem lies in acquiring that knowledge. This class of neuro-fuzzy systems is commonly used to design and optimize an FIS, using observed data rather than knowledge elicitation from experts. Perhaps the best-known example is Jang’s ANFIS architecture [257], which is provably equivalent to a TSK fuzzy system. As shown in Figure 1.11, this NN is radically different from an MLP. Each layer of the network executes one stage of computations that together form a TSK fuzzy system. The first layer computes the membership of the numeric inputs in each fuzzy set from a linguistic variable over that dimension. The second layer combines the linguistic predicates from the first layer into rules; note that this layer is not fully connected to the first. The pattern of connections from layer 1 to layer 2 defines the rules of the fuzzy system. Layer 3 normalizes the firing strength of each rule, and then layer 4 computes a linear function of the inputs as the TSK consequent to each rule, weighted by the normalized firing strength of the rule. The outputs of layer 4 are then summed to produce the network output. There have been a wide variety of modifications and extensions proposed for ANFIS in the last three decades. For more details on both approaches to neuro-fuzzy systems, we direct the reader to the recent review in [258].

June 1, 2022 12:51

40

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

When we compare fuzzy neural networks with the FIS mimics, we again observe the interpretability/accuracy trade-off; there is a tendency for the former to be more accurate, while the latter are more transparent as they directly translate into an FIS [227]. Note, however, that the latter statement has its limits; ANFIS, for instance, treats the parameters of the layer 1 fuzzy sets as adaptive parameters to be learned. It is thus entirely possible for the fuzzy sets in a trained ANFIS to diverge from an “intuitive” meaning associated with a linguistic term they may have been assigned [250]. Furthermore, these architectures once again focus on transparency, but generally do not actually follow up with designing an EI. 1.4.3.2. Evolutionary fuzzy algorithms

Evolutionary algorithms are a class of iterative optimization techniques based on the metaheuristic of selection pressure. In each iteration, a population of solutions is generated, and ranked according to their “fitness” (the value of an objective function). Those with the highest fitness are used to form the next population that will be evaluated; those with low fitness are discarded. The algorithms might closely follow the ideas of mate competition and genetic reproduction; crystal formation in an annealing process; foraging behaviors of ants; and many others [250]. As with neuro-fuzzy systems, the goal of an evolutionary fuzzy algorithm (EFA) is to design an intelligent system to model a dataset with both high accuracy and interpretability. However, while EFAs have a shorter history than neuro-fuzzy systems (the first articles [259–261] describing genetic fuzzy systems appeared in 1991), they are generally accepted today as producing more accurate systems. Moreover, multiobjective genetic fuzzy systems can simultaneously optimize the accuracy and interpretability of an FIS [57]. Multiple elements of an FIS might be designed/optimized in an EFA. The rulebase, the inference engine, and the fuzzy sets of the semantic rules for each LV can all be learned in the EFA. Fundamentally, there are two different learning approaches used in an EFA: it might either be used to fine-tune an initial FIS, or alternatively the EFA can inductively learn the whole FIS from a dataset. In the former case, the initial FIS is likely hand-crafted by the system designer, possibly including knowledge elicitation from experts on the system being modeled. The rulebase thus developed will be highly interpretable. The FIS is then fine-tuned by the evolutionary algorithm to improve its accuracy on observed data. In particular, such systems often focus on fine-tuning membership function parameters, e.g., [262], although some proposals also adapt the inference engine of the FIS [57]. Alternatively, an FIS might be inductively learned from a dataset. Again, this can take multiple forms. One might, for example, use an EFA to select an optimal subset of rules from a larger set of rule candidates [263]; or one might directly

page 40

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 41

41

generate the fuzzy rules. In the latter case, the linguistic variables for each predicate are usually pre-defined, and the EFA only considers the rules arising from their different combinations, e.g., [264]. A much smaller number of papers do consider the much wider design space of learning both the linguistic variables and the rules inductively (e.g., [265]) [57]. Note that generally, as in the previous discussions of this section, the explanans offered by EFA algorithms is again just the fuzzy rulebase.

1.4.4. Fuzzy Logic and XAI: Recent Progress Hybridizations of fuzzy logic and DNNs have received little attention until very recently. Much like the shallow neuro-fuzzy systems discussed above, deep neurofuzzy systems are designed to improve the transparency of a DNN, while retaining as much of the accuracy offered by a DNN as possible. As with their shallownetwork counterparts, deep neuro-fuzzy systems can be divided into two groups: fuzzy DNNs and FIS mimics. One key difference with deep FIS mimics, however, is that the architectures usually create multiple, stacked FIS layers instead of one single FIS. Furthermore, we also note that deep neuro-fuzzy system architectures again do not commonly include a dedicated EI. Fuzzy DNNs include the fuzzy deep MLP in [266] and the fuzzy restricted Boltzmann machine in [267]. The latter is stacked into a fuzzy deep belief network in [268]. Echo state networks were modified with a second reservoir employing fuzzy clustering for feature reinforcement in [269], and then stacked into a DNN. Rulebases are learned using the Wang–Mendel algorithm [253], and then stacked, in [270]. Stacked autoencoders are merged with fuzzy clustering in [271]. A deep version of the Fuzzy MLP is proposed in [272]. Similarly, fuzzy C-means clustering was used to preprocess images for stacked autoencoders in [273], and for CNNs in [274]. ResNet is fuzzified in [275]. The Softmax classification layer at the output of a DNN was replaced with a fuzzy decision tree in [276]. There are also papers that build a fuzzy DNN using extensions to type-1 fuzzy sets. The fuzzy restricted Boltzmann machine of [267] was redesigned in [277] to employ Pythagorean fuzzy sets [278], and to employ interval type-2 fuzzy sets in [279]. Pythagorean fuzzy sets were used to fuzzify stacked denoising autoencoders in [280]. FIS mimics include a hybrid of ANFIS and long short-term memory in [281], a network for mimicking the fuzzy Choquet integral in [282], and stacked TSK fuzzy systems in [283–285]. A CNN was pruned using fuzzy compression in [286]. A solver for polynomial equations with Z -number coefficients is proposed in [287]. In [288] and [289], the convolutional and deep belief components of a DNN, respectively, are employed as feature extractors prior to building a model via clustering. This is quite similar to the approach in [290, 291]: the densely connected

June 1, 2022 12:51

42

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

layers at the end of a deep network are replaced with an alternative classifier built from a clustering algorithm. The latter two, however, employ the fuzzy C-means and Gustafson–Kessel fuzzy clustering algorithms, respectively.

1.4.5. Discussion Interpretability has been a desideratum of fuzzy sets since Zadeh’s initial development of the topic, but truly became a defining characteristic with the development of the linguistic variable construct. Fuzzy control systems and fuzzy expert systems both grew out of Zadeh’s approximate reasoning framework, which is founded on LVs. The success of fuzzy control in particular seems to have provided the impetus for researchers to study hybridizations of neural networks with fuzzy logic. It’s worth noting that neuro-fuzzy systems began to gain traction just as the AI Winter was setting in, while EIs for neural networks were again becoming a major concern. By exploring the limits of the interpretability/accuracy trade–off, neuro-fuzzy system research is thus an important historical facet of XAI. In recent works, researchers have again returned to these roots, examining how various hybrids of neural networks and fuzzy logic fare in that trade-off. These include fuzzy deep neural networks and stacked fuzzy system mimics, paralleling earlier work on shallow neuro-fuzzy systems.

1.5. Summary and Conclusions In this chapter, we examined current and historical approaches to XAI and their relationships with computational intelligence. We examined the motivations behind XAI, and how the historical development of symbolic and non-symbolic approaches to AI influenced research into explainability, and vice versa. We then reviewed the research into trust and AI in more depth, along with four major modern approaches to generating an explanans. Our final major topic was the historical and current research into XAI and computational intelligence, in which fuzzy sets and logic play a central role in making AI more explainable. This normally takes the form of algorithms hybridizing machine learning and fuzzy logic (e.g., neuro-fuzzy evolutionary fuzzy systems). We found that AI systems based on fuzzy sets and logic, in particular, produce highly transparent models, through the use of the linguistic variable construct. It appears that this advantage can carry over to hybrid neuro-fuzzy systems (particularly the fuzzy system mimics) and evolutionary fuzzy systems as well. However, the design of dedicated explanation interfaces for all modern AI systems seems to have received far less attention than they did during the heyday of traditional expert systems. In future work, we believe that EI design should again become a major focus for AI and machine learning in general, including in computational

page 42

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 43

43

intelligence. In particular, recent progress in natural language processing seems to be a particularly ripe area for creating more effective explanations.

References [1] J. Manyika and K. Sneader, AI, automation, and the future of work: ten things to solve for, (2018). [Online]. Available at: https://www.mckinsey.com/featured-insights/future-of-work/aiautomation-and-the-future-of-work-ten-things-to-solve-for#. [2] D. Silver et al., Mastering the game of Go with deep neural networks and tree search, Nature, vol. 529 (2016), doi: 10.1038/nature16961. [3] C. Metz, Google’s AI wins pivotal second game in match with Go grandmaster, WIRED, Mar. 10 (2016). [4] M. T. Ribeiro, S. Singh, and C. Guestrin, Why should I trust you?: Explaining the predictions of any classifier, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016). [5] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad, Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission (2015), doi: 10.1145/ 2783258.2788613. [6] R. R. Hoffman, S. T. Mueller, G. Klein, and J. Litman, Metrics for explainable AI: challenges and prospects, arXiv. (2018). [7] S. Antifakos, N. Kern, B. Schiele, and A. Schwaninger, Towards improving trust in contextaware systems by displaying system confidence (2005), doi: 10.1145/1085777.1085780. [8] P. Dourish, Accounting for system behaviour: representation, reflection and resourceful action, Third Decenn. Conf. Comput. Context (CIC ’95) (1995). [9] H. J. Suermondt and G. F. Cooper, An evaluation of explanations of probabilistic inference, Comput. Biomed. Res. (1993), doi: 10.1006/cbmr.1993.1017. [10] J. Tullio, A. K. Dey, J. Chalecki, and J. Fogarty, How it works: a field study of non-technical users interacting with an intelligent system (2007), doi: 10.1145/1240624.1240630. [11] T. Miller, Explanation in artificial intelligence: insights from the social sciences, Artif. Intell., vol. 267, pp. 1–38 (2019). [12] B. Kim, R. Khanna, and O. Koyejo, Examples are not enough, learn to criticize! Criticism for interpretability (2016). [13] P. Humphreys, The philosophical novelty of computer simulation methods, Synthese, vol. 169, no. 3, pp. 615–626 (2008). [14] C. Zednik, Solving the black box problem: a normative framework for explainable artificial intelligence, Philos. Technol. (2019), doi: https://doi.org/10.1007/s13347-019-00382-7. [15] R. Tomsett, D. Braines, D. Harborne, A. Preece, and S. Chakraborty, Interpretable to whom? A role-based model for analyzing interpretable machine learning systems, arXiv. (2018). [16] J. D. Moore and W. R. Swartout, Explanation in expert systems: a survey, ISI Res. Rep. (1988). [17] J. A. Overton, ‘Explain’ in scientific discourse, Synthese (2013), doi: 10.1007/s11229-0120109-8. [18] J. Y. Halpern and J. Pearl, Causes and explanations: a structural-model approach. Part I: causes, Br.J . Philos. Sci. (2005), doi: 10.1093/bjps/axi147. [19] J. Y. Halpern and J. Pearl, Causes and explanations: a structural-model approach. Part II: explanations, Br.J . Philos. Sci. (2005), doi: 10.1093/bjps/axi148. [20] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (Morgan Kaufmann Series in Representation and Reasoning) (1988). [21] L. R. Goldberg, The Book of Why: The New Science of Cause and Effect. Judea Pearl and Dana Mackenzie (2018). Hachette UK.

June 1, 2022 12:51

44

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

[22] J. D. Trout, Scientific explanation and the sense of understanding, Philos. Sci. (2002), doi: 10.1086/341050. [23] T. Lombrozo, Explanation and abductive inference, in The Oxford Handbook of Thinking and Reasoning (2012). [24] S. Sigurdsson, P. A. Philipsen, L. K. Hansen, J. Larsen, M. Gniadecka, and H. Christian Wulf, Detection of skin cancer by classification of Raman spectra, IEEE Trans. Biomed. Eng. (2004), doi: 10.1109/TBME.2004.831538. [25] K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, and A. Tkatchenko, Quantum-chemical insights from deep tensor neural networks, Nat. Commun. (2017), doi: 10.1007/s13351-0187159-x. [26] R. M. J. Byrne, Counterfactuals in explainable artificial intelligence (XAI): evidence from human reasoning, in IJCAI, pp. 6276–6282 (2019). [27] G. Dietz and D. N. Den Hartog, Measuring trust inside organisations, Pers. Rev., vol. 35, no. 5, pp. 557–588 (2006). [28] F. D. Davis, Perceived usefulness, perceived ease of use, and user acceptance of information technology, MIS Q., vol. 13, no. 3, pp. 319–340 (1989). [29] X. Deng and X. Wang, Incremental learning of dynamic fuzzy neural networks for accurate system modeling, Fuzzy Sets Syst., vol. 160, no. 7, pp. 972–987 (2009). [30] N. Wang, D. V. Pynadath, and S. G. Hill, Trust calibration within a human-robot team: comparing automatically generated explanations, in International Conference on Human-Robot Interaction (HRI)2, 16AD, pp. 109–116. [31] Z. Buçinca, P. Lin, K. Z. Gajos, and E. L. Glassman, Proxy tasks and subjective measures can be misleading in evaluating explainable AI systems, in International Conference on Intelligent User Interfaces 2, pp. 454–464 (2020). [32] A. Papenmeier, Trust in automated decision making: how user’s trust and perceived understanding is influenced by the quality of automatically generated explanations, University of Twente (2019). [33] A. Paez, The pragmatic turn in explainable artificial intelligence (XAI), Minds Mach., vol. 29, pp. 441–459 (2019). [34] Samek, Wojciech, T. Wiegand, and K.-R. Müller, Explainable artificial intelligence: understanding, visualizing and interpreting deep learning models, arXiv Prepr. arXiv1708.08296 (2017). [35] S. Hajian, F. Bonchi, and C. Castillo, Algorithmic bias: from discrimination discovery to fairness-aware data mining, (2016), doi: 10.1145/2939672.2945386. [36] U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, Knowledge discovery and data mining: towards a unifying framework, in International Conference on Knowledge Discovery and Data Mining, Portland, OR, United States (1996). [37] B. Goodman and S. Flaxman, European Union regulations on algorithmic decision-making and a ‘right to explanation,’ arXiv Prepr. arXiv1606.08813 (2016). [38] B. Goodman and S. Flaxman, European Union regulations on algorithmic decision making and a ‘right to explanation,’ AI Mag., (2017), doi: 10.1609/aimag.v38i3.2741. [39] P. Adler et al., Auditing black-box models for indirect influence, Knowl. Inf. Syst., (2018), doi: 10.1007/s10115-017-1116-3. [40] V. Bellotti and K. Edwards, Intelligibility and accountability: human considerations in contextaware systems, Hum.-Comput. Interact., (2001), doi: 10.1207/S15327051HCI16234_05. [41] N. Bostrom and E. Yudkowsky, The ethics of artificial intelligence, in The Cambridge Handbook of Artificial Intelligence (2014). [42] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel, Fairness through awareness, (2012), doi: 10.1145/2090236.2090255. [43] C. K. Fallon and L. M. Blaha, Improving automation transparency: addressing some of machine learning’s unique challenges, (2018), doi: 10.1007/978-3-319-91470-1_21.

page 44

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 45

45

[44] B. Hayes and J. A. Shah, Improving robot controller transparency through autonomous policy explanation, (2017), doi: 10.1145/2909824.3020233. [45] M. Joseph, M. Kearns, J. Morgenstern, S. Neel, and A. Roth, Rawlsian fairness for machine learning, arXiv Prepr. arXiv1610.09559 (2016). [46] M. Joseph, M. Kearns, J. Morgenstern, and A. Roth, Fairness in learning: classic and contextual bandits, arXiv, abs/1605.07139 (2016). [47] J. A. Kroll et al., Accountable algorithms, University of Pennsylvania Law Review (2017). [48] C. Otte, Safe and interpretable machine learning: a methodological review, Studies in Computational Intelligence, (2013), doi: 10.1007/978-3-642-32378-2_8. [49] L. Sweeney, Discrimination in online ad delivery: google ads, black names and white names, racial discrimination, and click advertising, Queue, (2013), doi: 10.1145/2460276. 2460278. [50] K. R. Varshney and H. Alemzadeh, On the safety of machine learning: cyber-physical systems, decision sciences, and data products, Big Data, (2017), doi: 10.1089/big.2016.0051. [51] S. Wachter, B. Mittelstadt, and C. Russell, Counterfactual explanations without opening the black box: automated decisions and the GDPR, arXiv. 2017, doi: 10.2139/ssrn.3063289. [52] P. Comon, Independent component analysis: a new concept?, Signal Process., (1994), doi: 10.1016/0165-1684(94)90029-9. [53] G. Montavon, W. Samek, and K. R. Müller, Methods for interpreting and understanding deep neural networks, Digital Signal Proces., (2018), doi: 10.1016/j.dsp.2017.10.011. [54] D. C. Brock, Learning from artificial intelligence’s previous awakenings: the history of expert systems, AI Mag., (2018), doi: 10.1609/aimag.v39i3.2809. [55] N. Nilsson, The Quest for Artificial Intelligence: A History of Ideas and Achievements. New York, NY, USA: Cambridge University Press (2009). [56] P. Dear, The Intelligibility of Nature: How Science Makes Sense of the World. Chicago, IL, USA: The University of Chicago Press (2006). [57] M. Fazzolari, R. Alcala, Y. Nojima, H. Ishibuchi, and F. Herrera, A review of the application of multiobjective evolutionary fuzzy systems: current status and further directions, IEEE Trans. Fuzzy Syst., (2013), doi: 10.1109/TFUZZ.2012.2201338. [58] D. Poole and A. Mackworth, Artificial Intelligence: Foundations of Computational Agents, 2nd edn. New York, NY, USA: Cambridge University Press (2017). [59] A. Newell and H. A. Simon, The logic theory machine: a complex information processing system, Proc. IRE Trans. Inf. Theory, vol. IT-2, pp. 61–79 (1956). [60] A. Newell and H. A. Simon, Computer science as empirical inquiry: symbols and search, Commun. ACM, vol. 19, no. 3, pp. 113–126, Sep. 1976, doi: 10.1145/1283920.1283930. [61] R. K. Lindsay, B. G. Buchanan, E. A. Feigenbaum, and J. Lederberg, Applications of Artificial Intelligence for Organic Chemistry: The Dendral Project. New York, NY, USA: McGraw-Hill (1980). [62] B. G. Buchanan and E. H. Shortliffe, Rule-based expert systems: the MYCIN experiments of the Stanford Heuristic Programming Project. Addison-Wesley, Reading (1984). [63] W. van Melle, A Domain-independent system that aids in constructing knowledge-based consultation programs, Stanford University (1980). [64] E. A. Feigenbaum, P. McCorduck, and H. P. Nii, The Rise of the Expert Company: How Visionary Companies are Using Artificial Intelligence to Achieve Higher Productivity and Profits. New York, NY, USA: Times Books (1988). [65] L. K. Hansen and L. Rieger, Interpretability in intelligent systems: a new concept?, in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2019). [66] W. R. Swartout and J. D. Moore, Explanation in second generation expert systems, in Second Generation Expert Systems, J. M. David, J. P. Krivine, and R. Simmons (eds.), Berlin, Germany: Springer (1993), pp. 543–585.

June 1, 2022 12:51

46

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

[67] S. Chakraborty et al., Interpretability of deep learning models: a survey of results (2018), doi: 10.1109/UIC-ATC.2017.8397411. [68] F. Doshi-Velez and B. Kim, A roadmap for a rigorous science of interpretability, arXiv Prepr. arXiv1702.08608v1 (2017). [69] J. Psotka, L. D. Massey, S. A. Mutter, and J. S. Brown, Intelligent tutoring systems: lessons learned, Psychology Press (1988). [70] D. W. Hasling, W. J. Clancey, and G. Rennels, Strategic explanations for a diagnostic consultation system, Int.J . Man. Mach. Stud., vol. 20, no. 1, pp. 3–19 (1984). [71] B. Chandrasekaran, Generic tasks in knowledge-based reasoning, IEEE Expert, vol. 1, no. 3, pp. 23–30 (1986). [72] W. R. Swartout, C. L. Paris, and J. D. Moore, Design for explainable expert systems, IEEE Expert, vol. 6, no. 3, pp. 58–64 (1991). [73] M. R. Wick and W. B. Thompson, Reconstructive expert system explanation, Artif. Intell., (1992), doi: 10.1016/0004-3702(92)90087-E. [74] W. R. Swartout, XPLAIN: a system for creating and explaining expert consulting programs, Artif. Intell., (1983), doi: 10.1016/S0004-3702(83)80014-9. [75] R. Neches, W. R. Swartout, and J. D. Moore, Enhanced maintenance and explanation of expert systems through explicit models of their development, IEEE Trans. Software Eng. (1985), doi: 10.1109/TSE.1985.231882. [76] W. S. McCulloch and W. Pitts, A logical calculus of the ideas immanent in nervous activity, Bull. Math. Biophys., vol. 5, pp. 115–133 (1943). [77] D. O. Hebb, The Organization of Behavior: A Neuropsychological Theory. New York, NY, USA: Wiley (1949). [78] F. Rosenblatt, The Perceptron: a probabilistic model for information storage and organization in the brain, Psychol. Rev., vol. 65, pp. 386–408 (1958). [79] J. J. Williams et al., AXIS: generating explanations at scale with learnersourcing and machine learning, (2016), doi: 10.1145/2876034.2876042. [80] G. Shafer, A Mathematical Theory of Evidence. Princeton, NJ, USA: Princeton University Press (1976). [81] L. a. Zadeh, Fuzzy sets, Inf. Control (1965), doi: 10.1016/S0019-9958(65)90241-X. [82] M. L. Minsky and S. A. Papert, Perceptrons: An Introduction to Computational Geometry. Cambridge, MA, USA: MIT Press (1969). [83] K. Fukushima, Neural network model for a mechanism of pattern recognition unaffected by shift in position: neocognitron, Trans. IECEans., vol. J62-A, no. 10, pp. 655–658 (1979). [84] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE (1998), doi: 10.1109/5.726791. [85] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning representations by backpropagating errors, Nature, vol. 323, pp. 533–536 (1986). [86] J. J. Hopfield, Neural networks and physical systems with emergent collective computational abilities, Proc. Natl. Acad. Sci., vol. 79, no. 8, pp. 2554–2558 (1982). [87] S. Haykin, Neural Networks and Learning Machines. Prentice Hall (2008). [88] E. M. Johansson, F. U. Dowla, and D. M. Goodman, Back-propagation learning for multi-layer feedforward neural networks using the conjugate gradient method, Int.J . Neural Syst., vol. 2, no. 4, pp. 291–301 (1991). [89] D. Marquardt, An algorithm for least-squares estimation of non-linear parameters, J . Soc. Indus. Appl. Math., vol. 11, no. 2, pp. 431–441 (1963). [90] G. G. Towell, Symbolic knowledge and neural networks: insertion, refinement, and extraction, University of Wisconsin, Madison (1991). [91] B. Hassibi and D. G. Stork, Second-order-derivatives for network pruning: optimal brain surgeon, in Advances in Neural Information Processing Systems, pp. 164–171 (1993).

page 46

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 47

47

[92] D. S. Broomhead and D. Lowe, Multivariable functional interpolation and adaptive networks, Complex Syst., vol. 2, pp. 321–355 (1988). [93] A. Bell and T. J. Sejnowski, An information-maximization approach to blind separation and blind deconvolution, Neural Comput., vol. 6, pp. 1129–1159 (1996). [94] R. Linsker, Self-organization in a perceptual network, IEEE Comput., vol. 21, no. 3, pp. 105–117 (1988). [95] G. Tesauro, Temporal difference learning and TD-gammon, Commun. ACM, vol. 38, no. 3, pp. 58–68 (1995). [96] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, A learning algorithm for Boltzmann machines, Cogn. Sci., vol. 9, no. 1, pp. 147–169 (1985). [97] R. Andrews, J. Diederich, and A. B. Tickle, Survey and critique of techniques for extracting rules from trained artificial neural networks, Knowledge-Based Syst. (1995), doi: 10.1016/09507051(96)81920-4. [98] A. B. Tickle, F. Maire, G. Bologna, R. Andrews, and J. Diederich, Lessons from past, current issues, and future research directions in extracting the knowledge embedded in artificial neural networks, Lect. Notes Artif. Intell., vol. 1778, pp. 226–239 (2000). [99] M. J. Healy, A topological semantics for rule extraction with neural networks, Conn. Sci., vol. 11, no. 1, pp. 91–113 (1999). [100] R. Krishnan, G. Sivakumar, and P. Battacharya, A search technique for rule extraction from trained neural networks, Pattern Recognit. Lett., vol. 20, pp. 273–280 (1999). [101] A. B. Tickle, R. Andrews, M. Golea, and J. Diederich, The truth will come to light: directions and challenges in extracting the knowledge embedded within trained artificial neural networks, IEEE Trans. Neural Networks (1998), doi: 10.1109/72.728352. [102] L. M. Fu, Rule learning by searching on adapted nets, in National Conference on Artificial Intelligence, pp. 590–595 (1991). [103] G. Towell and J. Shavlik, The extraction of refined rules from knowledge based neural networks, Mach. Learn., vol. 131, pp. 71–101 (1993). [104] C. McMillan, M. C. Mozer, and P. Smolensky, The connectionist scientist game: rule extraction and refinement in a neural network (1991). [105] R. Krishnan, A systematic method for decompositional rule extraction from neural networks, in NIPS’96 Rule Extraction From Trained Artificial Neural Networks Wkshp, pp. 38–45 (1996). [106] F. Maire, A partial order for the M-of-N rule extraction algorithm, IEEE Trans. Neural Networks, vol. 8, pp. 1542–1544 (1997). [107] K. Saito and R. Nakano, Law discovery using neural networks, in Proc. NIPS’96 Rule Extraction from Trained Artificial Neural Networks, pp. 62–69 (1996). [108] R. Setiono, Extracting rules from neural networks by pruning and hidden unit splitting, Neural Comput., vol. 9, pp. 205–225 (1997). [109] K. Saito and R. Nakano, Medical diagnostic expert system based on PDP model, in IEEE International Conference on Neural Networks1, pp. 255–262 (1988). [110] A. Bargiela and W. Pedrycz, Granular Computing: An Introduction. Boston, MA, USA: Kluwer Academic (2003). [111] R. Andrews and S. Geva, RULEX & cebp networks as the basis for a rule refinement system, in Hybrid Problems, Hybrid Solutions, J. Hallam and P. McKevitt (eds.), Amsterdam, The Netherlands: IOS Press (1995), p. 12pp. [112] C. L. Giles and C. W. Omlin, Rule refinement with recurrent neural networks, in IEEE International Conference on Neural Networks, pp. 801–806 (1993). [113] M. W. Craven, Extracting Comprehensible Models from extraction algorithms from an artificial neural network, Univ. Wisconsin (1996). [114] S. Thrun, Extracting rules from artificial neural networks with distributed representations (1995).

June 1, 2022 12:51

48

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

[115] R. Setiono and L. Huan, NeuroLinear: from neural networks to oblique decision rules, Neurocomputing, vol. 17, pp. 1–24 (1997). [116] F. Maire, Rule-extraction by backpropagation of polyhedra, Neural Networks, vol. 12, pp. 717–725 (1999). [117] K. K. Sethi, D. K. Mishra, and B. Mishra, Extended taxonomy of rule extraction techniques and assessment of KDRuleEx, Int.J . Comput. Appl., vol. 50, no. 21, pp. 25–31 (2012). [118] C. L. Giles and C. W. Omlin, Rule revision with recurrent networks, IEEE Trans. Knowl. Data Eng., vol. 8, no. 1, pp. 183–188 (1996). [119] M. Humphrey, S. J. Cunningham, and I. H. Witten, Knowledge visualization techniques for machine learning, Intell. Data Anal., vol. 2, no. 4, pp. 333–347 (1998). [120] J. Diederich, Rule extraction from support vector machines: an introduction, Studies in Computational Intelligence (2008), doi: 10.1007/978-3-540-75390-2_1. [121] M. W. Craven and J. W. Shavlik, Using sampling and queries to extract rules from trained neural networks, in Machine Learning Proceedings 1994 (1994). [122] G. Bologna, A study on rule extraction from several combined neural networks, Int.J . Neural Syst., vol. 11, no. 3, pp. 247–255 (2001). [123] G. Bologna and S. Fossati, A two-step rule-extraction technique for a CNN, Electronics, vol. 9, no. 6 (2020). [124] D. Crevier, AI: The Tumultuous History of the Search for Artificial Intelligence. New York, NY, USA: BasicBooks (1993). [125] J. Hochreiter, Untersuchungen zu dynamischen neuronalen Netzen, Technische Universität Munchen (1991). [126] C. Cortes and V. Vapnik, Support-vector networks, Mach. Learning, vol. 20, pp. 273–397 (1995). [127] R. Díaz-Uriarte and S. A. de Andrés, Gene selection and classification of microarray data using random forest, BMC Bioinf., vol. 7, no. 3, 2006. [128] A. Statnikov and C. F. Aliferis, Are random forests better than support vector machines for microarray-based cancer classification?, in AMIA Annual Symposium (2007), pp. 686–690. [129] N. Barakat and A. P. Bradley, Rule extraction from support vector machines: a review, Neurocomputing, vol. 74, pp. 178–190 (2010). [130] D. Erhan, A. Courville, and Y. Bengio, Understanding representations learned in deep architectures, Network (2010). [131] M. J. Druzdzel and M. Henrion, Using scenarios to explain probabilistic inference, Proc. AAAI–90 Work. Explan. (1990). [132] C. Lacave and F. J. Díez, A review of explanation methods for Bayesian networks, Knowl. Eng. Rev., vol. 17, no. 2, pp. 107–127 (2002), doi: 10.1017/S026988890200019X. [133] W. L. Johnson, Agents that learn to explain themselves, AAAI-94 Proceedings (1994). [134] M. van Lent, W. Fisher, and M. Mancuso, An explainable artificial intelligence system for small-unit tactical behavior, IAAI-04 Proceedings (2004). [135] V. A. W. M. M. Aleven and K. R. Koedinger, An effective metacognitive strategy: learning by doing and explaining with a computer-based Cognitive Tutor, Cogn. Sci., (2002), doi: 10.1207/ s15516709cog2602_1. [136] H. C. Lane, M. G. Core, M. Van Lent, S. Solomon, and D. Gomboc, Explainable artificial intelligence for training and tutoring, Proc. 12th Int. Conf. Artif. Intell. Educ. (2005). [137] F. Sørmo, J. Cassens, and A. Aamodt, Explanation in case-based reasoning-perspectives and goals, Artif. Intell. Rev. (2005), doi: 10.1007/s10462-005-4607-7. [138] A. Kofod-Petersen, J. Cassens, and A. Aamodt, Explanatory capabilities in the CREEK knowledge-intensive case-based reasoner, in Proceedings of the 2008 conference on Tenth Scandinavian Conference on Artificial Intelligence: SCAI 2008, pp. 28–35 (2008).

page 48

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 49

49

[139] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, Making the V in VQA matter: elevating the role of image understanding in visual question answering (2017), doi: 10.1109/ CVPR.2017.670. [140] Y. Goyal, A. Mohapatra, D. Parikh, D. Batra, and V. Tech, Interpreting visual question answering models (2016). [141] F. Sadeghi, S. K. Divvala, and A. Farhadi, VisKE: visual knowledge extraction and question answering by visual verification of relation phrases (2015), doi: 10.1109/CVPR.2015.7298752. [142] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, Grad-CAM: why did you say that? Visual explanations from deep networks via gradient-based localization, arXiv:1610.02391 (2016). [143] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson, Understanding neural networks through deep visualization, arXiv Prepr. arXiv1506.06579 (2015). [144] T. Zahavy, N. Ben-Zrihem, and S. Mannor, Graying the black box: understanding DQNs, in Proceedings of The 33rd International Conference on Machine Learning, 2016, pp. 1899-1908. [145] M. D. Zeiler and R. Fergus, Visualizing and understanding convolutional networks, in European Conference on Computer Vision, pp. 818–833 (2014). [146] L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele, and T. Darrell, Generating visual explanations (2016), doi: 10.1007/978-3-319-46493-0_1. [147] T. Kulesza, M. Burnett, W. K. Wong, and S. Stumpf, Principles of explanatory debugging to personalize interactive machine learning (2015), doi: 10.1145/2678025.2701399. [148] T. Kulesza, S. Stumpf, M. Burnett, and I. Kwan, Tell me more? The effects of mental model soundness on personalizing an intelligent agent (2012), doi: 10.1145/2207676.2207678. [149] M. Lomas, R. Chevalier, E. V. Cross, R. C. Garrett, J. Hoare, and M. Kopack, Explaining robot actions (2012), doi: 10.1145/2157689.2157748. [150] D. H. Park et al., Attentive explanations: justifying decisions and pointing to the evidence (extended abstract), arXiv. (2017). [151] S. Rosenthal, S. P. Selvaraj, and M. Veloso, Verbalization: narration of autonomous robot experience, in Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) (2016). [152] D. Gunning and D. W. Aha, DARPA’s explainable artificial intelligence program, AI Mag., vol. Summer, pp. 44–58 (2019). [153] A. Arioua, P. Buche, and M. Croitoru, Explanatory dialogues with argumentative faculties over inconsistent knowledge bases, Expert Syst. Appl. (2017), doi: 10.1016/j.eswa.2017.03.009. [154] P. Shafto and N. D. Goodman, Teaching games: statistical sampling assumptions for learning in pedagogical situations, in Proceedings of the 30th annual conference of the Cognitive Science Society, pp. 1632–1637 (2008). [155] A. Nguyen, J. Yosinski, and J. Clune, Deep neural networks are easily fooled: high confidence predictions for unrecognizable images (2015), doi: 10.1109/CVPR.2015.7298640. [156] R. Sheh and I. Monteath, Defining explainable AI for requirements analysis, KI - Kunstl. Intelligenz (2018), doi: 10.1007/s13218-018-0559-3. [157] P. Lipton, Contrastive explanation, R. Inst. Philos. Suppl. (1990), doi: 10.1017/ s1358246100005130. [158] J. A. Glomsrud, A. Ødegårdstuen, A. L. St. Clair, and Ø. Smogeli, Trustworthy versus explainable AI in autonomous vessels, in International Seminar on Safety and Security of Autonomous Vessels (2019), p. 11. [159] F. Nothdurft, F. Richter, and W. Minker, Probabilistic human-computer trust handling, in SIGDIAL, pp. 51–59 (2014). [160] S. A. Binder, G. Montavon, F. Klauschen, K. R. Müller, and W. Samek, On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation, PLoS One (2015), doi: 10.1371/journal.pone.0130140.

June 1, 2022 12:51

50

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

[161] B. Kim et al., Interpretability beyond feature attribution: quantitative testing with concept activation vectors (TCAV), in Proc. ICML, pp. 2673–2682 (2018). [162] Y. Ramon, D. Martens, T. Evgeniou, and S. Praet, Metafeatures-based rule-extraction for classifiers on behavioral and textual data: a preprint, arXiv. (2020). [163] P. Beatty, I. Reay, S. Dick, and J. Miller, Consumer trust in e-commerce web sites, ACM Comput. Surv., vol. 43, no. 3, pp. 14.1–14.46 (2011). [164] R. C. Mayer, J. H. Davis, and F. D. Schoorman, An integrative model of organizational trust, Acad. Manag. Rev., vol. 20, no. 3, pp. 709–734 (1995). [165] J. Lyons, N. Ho, J. Friedman, G. Alarcon, and S. Guznov, Trust of learning systems: considerations for code, algorithms, and affordances for learning, in Human and Machine Learning, J. Zhou and F. Chen (eds.), Cham, Switzerland: Springer International Publishing AG, pp. 265–278 (2018). [166] J. D. Lee and B. D. Seppelt, Human factors in automation design, in Springer Handbook of Automation, S. Y. Nof (ed.), Berlin, Germany: Springer-Verlag, pp. 417–436 (2009). [167] J. B. Lyons and C. K. Stokes, Human–human reliance in the context of automation, Hum. Factors, vol. 54, no. 1, pp. 112–121 (2011). [168] L. Onnasch, C. D. Wickens, H. Li, and D. Manzey, Human performance consequences of stages and levels of automation: an integrated meta-analysis, Hum. Factors J. Hum. Factors Ergon. Soc., vol. 56, no. 3, pp. 476–488 (2014). [169] J. D. Lee and K. A. See, Trust in automation: designing for appropriate reliance, Hum. Factors, vol. 46, no. 1, pp. 50–80 (2004). [170] M. Ashoori and J. D. Weisz, In AI we trust? Factors that influence trustworthiness of AI-infused decision-making processes (2019). [Online]. Available at: https://arxiv.org/abs/1912.02675. [171] B. Israelsen and N. Ahmed, ‘Dave...I can assure you ...that it’s going to be all right ...’ A definition, case for, and survey of algorithmic assurances in human-autonomy trust relationships, ACM Comput. Surv., vol. 51, no. 6, pp. 1–37 (2019). [172] J. Drozdal et al., Trust in AutoML: exploring information needs for establishing trust in automated machine learning systems (2020). [Online]. Available at: https://arxiv.org/abs/2001. 06509. [173] E. Toreini, M. Aitken, K. Coopamootoo, K. Elliott, C. G. Zelaya, and A. van Moorsel, The relationship between trust in AI and trustworthy machine learning technologies (2020). [Online]. Available at: https://arxiv.org/abs/1912.00782. [174] S. Mohseni, N. Zarei, and E. D. Ragan, A multidisciplinary survey and framework for design and evaluation of explainable AI systems (2020). [Online]. Available at: https://arxiv.org/abs/1811.11839. [175] D. C. Dennett, The Intentional Stance. Cambridge, MA, USA: MIT Press (1996). [176] P. Cofta, Trusting the IoT: there is more to trust than trustworthiness, in IFIP International Conference on Trust Management, pp. 98–107 (2019). [177] M. Desai et al., Effects of changing reliability on trust of robot systems, in International Conference on Human-Robot Interaction (HRI), pp. 73–80 (2012). [178] A. Freedy, E. DeVisser, G. Weltman, and N. Coeyman, Measurement of trust in humanrobot collaboration, in International Symposium on Collaborative Technologies and Systems, pp. 106–114 (2007). [179] R. C. Arkin, J. Borenstein, and A. R. Wagner, Competing ethical frameworks mediated by moral emotions in HRI: motivations, background, and approach, in ICRES 2019: International Conference on Robot Ethics and Standards, London, UK (2019). [180] K. Weitz, D. Schiller, R. Schlagowski, T. Huber, and E. Andre, ‘Do you trust me?’: Increasing user-trust by integrating virtual agents in explainable AI interaction design, in International Conference on Intelligent Virtual Agents, pp. 7–9 (2019). [181] K. M. Lee, Presence, explicated, Commun. Theory, vol. 14, no. 1, pp. 27–50 (2004).

page 50

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 51

51

[182] W. A. Bainbridge, J. W. Hart, E. S. Kim, and B. Scassellati, The benefits of interactions with physically present robots over video-displayed agents, Int. J. Soc. Robot., vol. 3, pp. 41–52 (2011). [183] P. Andras et al., Trusting intelligent machines: deepening trust within socio-technical systems, IEEE Technol. Soc. Mag., vol. 37, no. 4, pp. 76–83 (2018). [184] K. Lazanyi, Readiness for artificial intelligence, in International Symposium on Intelligent Systems and Informatics (SISY), pp. 235–238 (2018). [185] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, Deconvolutional networks (2010), doi: 10.1109/CVPR.2010.5539957. [186] K. Simonyan, A. Vedaldi, and A. Zisserman, Deep inside convolutional networks: visualising image classification models and saliency maps, arXiv1312.6034 [cs] (2013). [187] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, Striving for simplicity: the all convolutional net, arXiv Prepr. arXiv1412.6806 (2014). [188] M. D. Zeiler, G. W. Taylor, and R. Fergus, Adaptive deconvolutional networks for mid and high level feature learning (2011), doi: 10.1109/ICCV.2011.6126474. [189] W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K. R. Müller, Evaluating the visualization of what a deep neural network has learned, IEEE Trans. Neural Networks Learn. Syst. (2017), doi: 10.1109/TNNLS.2016.2599820. [190] J. Adebayo, J. Gilmer, I. Goodfellow, and B. Kim, Local explanation methods for deep neural networks lack sensitivity to parameter values, arXiv:1810.03307 (2018). [191] P. J. Kindermans et al., The (un)reliability of saliency methods, in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2019). [192] A. Ghorbani, A. Abid, and J. Zou, Interpretation of neural networks is fragile (2019), doi: 10.1609/aaai.v33i01.33013681. [193] S. Wang, T. Zhou, and J. A. Bilmes, Bias also matters: bias attribution for deep neural network explanation, in Proceedings of the 36th International Conference on Machine Learning, Long Beach, California (2019). [194] M. Sundararajan, A. Taly, and Q. Yan, Axiomatic attribution for deep networks, in Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia (2017). [195] G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K. R. Müller, Explaining nonlinear classification decisions with deep Taylor decomposition, Pattern Recognit. (2017), doi: 10.1016/j.patcog.2016.11.008. [196] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim, Sanity checks for saliency maps, NeurIPS (2018). [197] C. Szegedy et al., Going deeper with convolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015). [198] F. Wang and C. Rudin, Falling rule lists, in Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS) 2015, San Diego, CA, USA (2015). [199] S. Mohseni and E. D. Ragan, A human-grounded evaluation benchmark for local explanations of machine learning, arXiv. (2018). [200] H. J. P. Weerts, W. Van Ipenburg, and M. Pechenizkiy, A human-grounded evaluation of SHAP for alert processing, arXiv. (2019). [201] D. Martens, B. Baesens, T. Van Gestel, and J. Vanthienen, Comprehensible credit scoring models using rule extraction from support vector machines, Eur. J. Oper. Res. (2007), doi: 10.1016/j.ejor.2006.04.051. [202] W. Chen, M. Zhang, Y. Zhang, and X. Duan, Exploiting meta features for dependency parsing and part-of-speech tagging, Artif. Intell. (2016), doi: 10.1016/j.artint.2015.09.002. [203] M. Sushil, S. Šuster, and W. Daelemans, Rule induction for global explanation of trained models, arXiv. (2018), doi: 10.18653/v1/w18-5411.

June 1, 2022 12:51

52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

[204] Y. Wei, M. C. Chang, Y. Ying, S. N. Lim, and S. Lyu, Explain black-box image classifications using superpixel-based interpretation (2018), doi: 10.1109/ICPR.2018.8546302. [205] D. Alvarez-Melis and T. S. Jaakkola, Towards robust interpretability with self-explaining neural networks, arXiv:1806.07538 (2018). [206] R. Seising, The Fuzzification of Systems: The Genesis of Fuzzy Set Theory and Its Initial Applications - Developments Up to the 1970s. Berlin, Germany: Springer (2007). [207] E. Turunen, Algebraic structures in fuzzy logic, Fuzzy Sets Syst., vol. 52, pp. 181–188 (1992). [208] L. A. Zadeh, R. E. Kalman, and N. DeClaris, Aspects of Network and System Theory (1971). [209] L. A. Zadeh, A rationale for fuzzy control, J . Dyn. Syst. Meas. Control. Trans. ASME (1972), doi: 10.1115/1.3426540. [210] L. A. Zadeh, Outline of a new approach to the analysis of complex systems and decision processes, IEEE Trans. Syst. Man Cybern. (1973), doi: 10.1109/TSMC.1973.5408575. [211] L. A. Zadeh, On the analysis of large-scale systems, in H. Gottinger (ed.), Systems Approaches and Environment Problems, Gottingen: Vandenhoeck and Ruprecht, pp. 23–37 (1974). [212] L. A. Zadeh, The concept of a linguistic variable and its application to approximate reasoning-I, Inf. Sci. (Ny) (1975), doi: 10.1016/0020-0255(75)90036-5. [213] L. A. Zadeh, Quantitative fuzzy semantics, Inf. Sci. (Ny), vol. 3, no. 2, pp. 159–176 (1971). [214] E. H. Mamdani and S. Assilian, An experiment in linguistic synthesis with a fuzzy logic controller, Int.J . Man. Mach. Stud. (1975), doi: 10.1016/S0020-7373(75)80002-2. [215] E. H. Mamdani, Application of fuzzy algorithms for control of simple dynamic plant, Proc. Inst. Electr. Eng. (1974), doi: 10.1049/piee.1974.0328. [216] E. H. Mamdani, Advances in the linguistic synthesis of fuzzy controllers, Int.J . Man. Mach. Stud. (1976), doi: 10.1016/S0020-7373(76)80028-4. [217] L. P. Holmblad and J. J. Ostergaard, Control of a cement kiln by fuzzy logic, in M. M. Gupta and E. Sanchez (eds.), Fuzzy Information and Decision Processes, Amsterdam: North-Holland, pp. 389–399 (1982), doi: 10.1016/b978-1-4832-1450-4.50039-0. [218] C. P. Pappis and E. H. Mamdani, A fuzzy logic controller for a traffic junction, IEEE Trans. Syst. Man Cybern. (1977), doi: 10.1109/TSMC.1977.4309605. [219] M. J. Flanagan, On the application of approximate reasoning to control of the activated sludge process, J . Environ. Sci. Heal. Part B Pestic. Food Contam. Agric. Wastes (1980), doi: 10.1109/JACC.1980.4232059. [220] M. Togai and H. Watanabe, A VLSI implementation of a fuzzy-inference engine: toward an expert system on a chip, Inf. Sci. (Ny) (1986), doi: 10.1016/0020-0255(86)90017-4. [221] R. M. Tong, An annotated bibliography of fuzzy control, Ind. Appl. Fuzzy Control, pp. 249–269 (1985). [222] S. Yasunobu, S. Miyamoto, and H. Ihara, Fuzzy control for automatic train operation system, IFAC Proc. (1983), doi: 10.1016/s1474-6670(17)62539-4. [223] S. Miyamoto, Predictive fuzzy control and its application to automatic train operation systems, First Int. Conf. Fuzzy Inf. Process., pp. 22–26 (1984). [224] S. Yasunobu, Automatic train operation by predictive fuzzy control, Ind. Appl. Fuzzy Control (1985). [225] P. D. Hansen, Techniques for process control, in Process/Industrial Instruments and Controls Handbook, pp. 2.30–2.56 (1999). [226] B. R. Gaines and J. H. Andreae, A learning machine in the context of the general control problem, in IFAC Congress, pp. 342–349 (1966). [227] G. J. Klir and B. Yuan, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Upper Saddle River, NJ: Prentice Hall PTR (1995). [228] T. Takagi and M. Sugeno, Fuzzy identification of systems and its applications to modeling and control, IEEE Trans. Syst. Man Cybern. (1985), doi: 10.1109/TSMC.1985.6313399.

page 52

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 53

53

[229] K. Tanaka and H. O. Wang, Fuzzy Control Systems Design and Analysis: A Linear Matrix Inequality Approach. New York, NY, USA: John Wiley & Sons, Inc. (2001). [230] O. Cordon, A historical review of evolutionary learning methods for Mamdani-type fuzzy rulebased systems: designing interpretable genetic fuzzy systems, Int.J . Approx. Reason., vol. 52, pp. 894–913 (2011). [231] J. Durkin, Expert system development tools, in The Handbook of Applied Expert Systems, J. Liebowitz (ed.), Boca Raton, FL, USA: CRC Press (1998). [232] C. L. Forgy, Rete: a fast algorithm for the many pattern/many object pattern match problem, Artif. Intell., vol. 19, pp. 17–37 (1982). [233] S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 4th edn. Hoboken, NJ, USA: Pearson Education (2020). [234] J. Pan, G. N. DeSouza, and A. C. Kak, FuzzyShell: a large-scale expert system shell using fuzzy logic for uncertainty reasoning, IEEE Trans. Fuzzy Syst., vol. 6, no. 4, pp. 563–581 (1998). [235] M. Fieschi, M. Joubert, D. Fieschi, and M. Roux, SPHINX: a system for computer-aided diagnosis, Methods Inf. Med., vol. 21, no. 3, pp. 143–148 (1982). [236] J.-C. Buisson, H. Farreny, and H. Prade, The development of a medical expert system and the treatment of imprecision in the framework of possibility theory, Inf. Sci. (Ny), vol. 37, no. 1–3, pp. 211–226 (1985). [237] K.-P. Adlassnig, G. Kolarz, and W. Scheithauer, Present state of the medical expert system CADIAG-2, Methods Inf. Med., vol. 24, no. 1, pp. 13–20 (1985). [238] L. A. Zadeh, The role of fuzzy logic in the management of uncertainty in expert systems, Fuzzy Sets Syst., vol. 11, no. 1–3, pp. 199–227 (1983). [239] L. A. Zadeh, Syllogistic reasoning in fuzzy logic and its application to usuality and reasoning with dispositions, IEEE Trans. Syst. Man Cybern., vol. 15, no. 6, pp. 754–763 (1985). [240] C. Negoita, Expert Systems and Fuzzy Systems. Menlo Park, CA, USA: Benjamin-Cummings Pub. Co. (1985). [241] J. J. Buckley, W. Siler, and D. Tucker, A fuzzy expert system, Fuzzy Sets Syst., vol. 20, pp. 1–16 (1986). [242] J. J. Buckley and D. Tucker, Second generation fuzzy expert system, Fuzzy Sets Syst., vol. 31, pp. 271–284 (1989). [243] P. Bonissone, S. S. Gans, and K. S. Decker, RUM: a layered architecture for reasoning with uncertainty, in Proc. IJCAI, pp. 891–898 (1987). [244] K. S. Leung, W. S. F. Wong, and W. Lam, Applications of a novel fuzzy expert system shell, Expert Syst., vol. 6, no. 1, pp. 2–10 (1989). [245] L. O. Hall, Rule chaining in fuzzy expert systems, IEEE Trans. Fuzzy Syst., vol. 9, no. 6, pp. 822–828 (2001). [246] C. Igel and K.-H. Temme, The chaining syllogism in fuzzy logic, IEEE Trans. Fuzzy Syst., vol. 12, no. 6, pp. 849–853 (2004). [247] J. Pan, G. N. DeSouza, and A. C. Kak, FuzzyShell: a large-scale expert system shell using fuzzy logic for uncertainty reasoning, IEEE Trans. Fuzzy Syst., vol. 6, no. 4, pp. 563–581 (1998). [248] D. Dubois and H. Prade, Fuzzy relation equations and causal reasoning, Fuzzy Sets Syst., vol. 75, no. 2, pp. 119–134 (1995). [249] L. J. Mazlack, Imperfect causality, Fundam. Inform., vol. 59, no. 2–3, pp. 191–201 (2004). [250] J.-S. R. Jang, C.-T. Sun, and E. Mizutani, Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence. Upper Saddle River, New Jersey: Prentice Hall (1997). [251] S. C. Lee and E. T. Lee, Fuzzy sets and neural networks, J . Cybern., vol. 4, pp. 83–103 (1974). [252] J. M. Keller and D. J. Hunt, Incorporating fuzzy membership functions into the perceptron algorithm, IEEE Trans. Pattern Anal. Mach. Intell., vol. 7, no. 6, pp. 693–699 (1985).

June 1, 2022 12:51

54

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Handbook on Computer Learning and Intelligence — Vol. I

[253] L.-X. Wang and J. M. Mendel, Generating fuzzy rules by learning from examples, IEEE Trans. Syst. Man Cybern., vol. 22, no. 6, pp. 1414–1427 (1992). [254] H. Takagi, Fusion technology of fuzzy theory and neural networks - survey and future directions, in Proc. Int. Conf. on Fuzzy Logic and Neural Networks, pp. 13–26 (1990). [255] Y. Hayashi, J. J. Buckley, and E. Czogala, Fuzzy neural network with fuzzy signals and weights, Int.J . Intell. Syst., vol. 8, pp. 527–537 (1993). [256] S. K. Pal and S. Mitra, Multilayer perceptron, fuzzy sets, and classification, IEEE Trans. Neural Networks (1992), doi: 10.1109/72.159058. [257] J. S. R. Jang, ANFIS: adaptive-network-based fuzzy inference system, IEEE Trans. Syst. Man Cybern. (1993), doi: 10.1109/21.256541. [258] P. V. de C. Souza, Fuzzy neural networks and neuro-fuzzy networks: a review the main techniques and applications used in the literature, Appl. Soft Comput., vol. 92, p. 26 (2020). [259] M. Valenzuela-Rendón, The fuzzy classifier system: motivations and first results, in Int. Conf. Parallel Problem Solving from Nature, pp. 330–334 (1991). [260] P. Thrift, Fuzzy logic synthesis with genetic algorithms, in Int. Conf. on Genetic Algorithms, pp. 509–513 (1991). [261] D. T. Pham and D. Karaboga, Optimum design of fuzzy logic controllers using genetic algorithms, J . Syst. Eng., vol. 1, pp. 114–118 (1991). [262] A. Botta, B. Lazzerini, F. Marcelloni, and D. C. Stefanescu, Context adaptation of fuzzy systems through a multi-objective evolutionary approach based on a novel interpretability index, Soft Comput., vol. 13, no. 5, pp. 437–449 (2009). [263] H. Ishibuchi and T. Yamamoto, “Fuzzy rule selection by multi-objective genetic local search algorithms and rule evaluation measures in data mining, Fuzzy Sets Syst., vol. 141, no. 1, pp. 59–88 (2004). [264] M. Cococcioni, P. Ducange, B. Lazzerini, and F. Marcelloni, A Pareto-based multi-objective evolutionary approach to the identification of Mamdani fuzzy systems, Soft Comput., vol. 11, no. 11, pp. 1013–1031 (2007). [265] R. Alcala, P. Ducange, F. Herrera, B. Lazzerini, and F. Marcelloni, A multiobjective evolutionary approach to concurrently learn rule and data bases of linguistic fuzzy-rule-based systems, IEEE Trans. Fuzzy Syst., vol. 17, no. 5, pp. 1106–1122 (2009). [266] A. Sarabakha and E. Kayacan, Online deep fuzzy learning for control of nonlinear systems using expert knowledge, IEEE Trans. Fuzzy Syst. (2020), doi: 10.1109/TFUZZ.2019.2936787. [267] C. L. P. Chen, C. Y. Zhang, L. Chen, and M. Gan, Fuzzy restricted boltzmann machine for the enhancement of deep learning, IEEE Trans. Fuzzy Syst. (2015), doi: 10.1109/TFUZZ.2015. 2406889. [268] S. Feng, C. L. Philip Chen, and C. Y. Zhang, A fuzzy deep model based on fuzzy restricted Boltzmann machines for high-dimensional data classification, IEEE Trans. Fuzzy Syst. (2020), doi: 10.1109/TFUZZ.2019.2902111. [269] S. Zhang, Z. Sun, M. Wang, J. Long, Y. Bai, and C. Li, deep fuzzy echo state networks for machinery fault diagnosis, IEEE Trans. Fuzzy Syst. (2020), doi: 10.1109/TFUZZ.2019. 2914617. [270] L. X. Wang, Fast training algorithms for deep convolutional fuzzy systems with application to stock index prediction, IEEE Trans. Fuzzy Syst. (2020), doi: 10.1109/TFUZZ.2019.2930488. [271] Q. Feng, L. Chen, C. L. Philip Chen, and L. Guo, Deep fuzzy clustering: a representation learning approach, IEEE Trans. Fuzzy Syst. (2020), doi: 10.1109/TFUZZ.2020.2966173. [272] P. Hurtik, V. Molek, and J. Hula, Data preprocessing technique for neural networks based on image represented by a fuzzy function, IEEE Trans. Fuzzy Syst. (2020), doi: 10.1109/TFUZZ.2019.2911494. [273] L. Chen, W. Su, M. Wu, W. Pedrycz, and K. Hirota, A fuzzy deep neural network with sparse autoencoder for emotional intention understanding in human-robot interaction, IEEE Trans. Fuzzy Syst. (2020), doi: 10.1109/TFUZZ.2020.2966167.

page 54

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch01

Explainable Artificial Intelligence and Computational Intelligence: Past and Present

page 55

55

[274] V. John, S. Mita, Z. Liu, and B. Qi, Pedestrian detection in thermal images using adaptive fuzzy C-means clustering and convolutional neural networks (2015), doi: 10.1109/MVA. 2015.7153177. [275] C. Guan, S. Wang, and A. W.-C. Liew, Lip image segmentation based on a fuzzy convolutional neural network, IEEE Trans. Fuzzy Syst. (2019). [276] Y. Wang et al., Deep fuzzy tree for large-scale hierarchical visual classification, IEEE Trans. Fuzzy Syst. (2020), doi: 10.1109/TFUZZ.2019.2936801. [277] Y.-J. Zheng, W.-G. Sheng, X.-M. Sun, and S.-Y. Chen, Airline passenger profiling based on fuzzy deep machine learning, IEEE Trans. Neural Networks Learn. Syst. (2016), doi: 10.1109/TNNLS.2016.2609437. [278] R. R. Yager, Pythagorean membership grades in multicriteria decision making, IEEE Trans. Fuzzy Syst. (2014), doi: 10.1109/TFUZZ.2013.2278989. [279] A. K. Shukla, T. Seth, and P. K. Muhuri, Interval type-2 fuzzy sets for enhanced learning in deep belief networks (2017), doi: 10.1109/FUZZ-IEEE.2017.8015638. [280] Y. J. Zheng, S. Y. Chen, Y. Xue, and J. Y. Xue, A pythagorean-type fuzzy deep denoising autoencoder for industrial accident early warning, IEEE Trans. Fuzzy Syst. (2017), doi: 10.1109/ TFUZZ.2017.2738605. [281] A. I. Aviles, S. M. Alsaleh, E. Montseny, P. Sobrevilla, and A. Casals, A deep-neuro-fuzzy approach for estimating the interaction forces in robotic surgery (2016), doi: 10.1109/FUZZIEEE.2016.7737812. [282] M. A. Islam, D. T. Anderson, A. J. Pinar, T. C. Havens, G. Scott, and J. M. Keller, Enabling explainable fusion in deep learning with fuzzy integral neural networks, IEEE Trans. Fuzzy Syst. (2020), doi: 10.1109/TFUZZ.2019.2917124. [283] S. Rajurkar and N. K. Verma, Developing deep fuzzy network with Takagi Sugeno fuzzy inference system (2017), doi: 10.1109/FUZZ-IEEE.2017.8015718. [284] T. Zhou, F. L. Chung, and S. Wang, Deep TSK fuzzy classifier with stacked generalization and triplely concise interpretability guarantee for large data, IEEE Trans. Fuzzy Syst. (2017), doi: 10.1109/TFUZZ.2016.2604003. [285] S. Gu, F. L. Chung, and S. Wang, A novel deep fuzzy classifier by stacking adversarial interpretable TSK fuzzy sub-classifiers with smooth gradient information, IEEE Trans. Fuzzy Syst. (2020), doi: 10.1109/TFUZZ.2019.2919481. [286] W. R. Tan, C. S. Chan, H. E. Aguirre, and K. Tanaka, Fuzzy qualitative deep compression network, Neurocomputing (2017), doi: 10.1016/j.neucom.2017.04.023. [287] R. Jafari, S. Razvarz, and A. Gegov, Neural network approach to solving fuzzy nonlinear equations using Z-numbers, IEEE Trans. Fuzzy Syst. (2020), doi: 10.1109/TFUZZ.2019.2940919. [288] Y. De la Rosa, Erick, Wen, Data-driven fuzzy modeling using deep learning, arXiv Prepr. arXiv1702.07076 (2017). [289] Z. Zhang, M. Huang, S. Liu, B. Xiao, and T. S. Durrani, Fuzzy multilayer clustering and fuzzy label regularization for unsupervised person reidentification, IEEE Trans. Fuzzy Syst. (2020), doi: 10.1109/TFUZZ.2019.2914626. [290] M. Yeganejou and S. Dick, Classification via deep fuzzy c-means clustering (2018), doi: 10.1109/FUZZ-IEEE.2018.8491461. [291] M. Yeganejou, S. Dick, and J. Miller, Interpretable deep convolutional fuzzy classifier, IEEE Trans. Fuzzy Syst. (2020), doi: 10.1109/TFUZZ.2019.2946520.

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch02

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_0002

Chapter 2

Fundamentals of Fuzzy Set Theory Fernando Gomide School of Electrical and Computer Engineering, University of Campinas, 13083-852 Campinas, SP, Brazil [email protected]

The goal of this chapter is to offer a comprehensive, systematic, updated, and self-contained tutorial-like introduction to fuzzy set theory. The notions and concepts addressed here cover the spectrum that contains, we believe, the material deemed relevant for computational intelligence and intelligent systems theory and applications. It starts by reviewing the very basic idea of sets, introduces the notion of a fuzzy set, and gives the main insights and interpretations to help intuition. It proceeds with characterization of fuzzy sets, operations and their generalizations, information granulation and its key constituents, and ends discussing the issue of linguistic approximation and extensions of fuzzy sets.

2.1. Sets A set is a fundamental concept in mathematics and science. Classically a set is defined as “any multiplicity which can be thought of as one . . . any totality of definite elements which can be bound up into a whole by means of a law” or being more descriptive “any collection into a whole M of definite and separate objects m of our intuition or our thought” [1, 2]. Intuitively, a set may be viewed as the class M of all objects m satisfying any particular property or defining condition. Alternatively, a set can be characterized by an assignment scheme to define the objects of a domain that satisfy the intended property. For instance, an indicator function or a characteristic function is a function defined on a domain X that indicates membership of an object of X in a set A on X, having the value 1 for all elements of A and the value 0 for all elements of X not in A. The domain can be either continuous or discrete. For instance, the closed interval [3, 7] constitutes a continuous and bounded domain whereas the set N = {0, 1, 2, . . .} of natural numbers is discrete and countable, but with no bound. 57

page 57

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

58

b4528-v1-ch02

page 58

Handbook on Computer Learning and Intelligence — Vol. I

In general, a characteristic function of a set A defined in X assumes the following form:  1, if x ∈ A A(x) = (2.1) 0, if x ∈ / A The empty set ∅ has a characteristic function that is identically equal to zero, ∅(x) = 0 for all x in X. The domain X itself has a characteristic function that is identically equal to one, X(x) = 1 for all x in X. Also, a singleton A = {a}, a set with only a single element, has the characteristic function A(x) = 1 if x = a and A(x) = 0 otherwise. Characteristic functions A:X → {0, 1} induce a constraint with well-defined boundaries on the elements of the domain X that can be assigned to a set A.

2.2. Fuzzy Sets The fundamental idea of a fuzzy set is to relax the rigid boundaries of the constraints induced by characteristic functions by admitting intermediate values of class membership. The idea is to allow assignments of intermediate values between 0 and 1 to quantify our perception on how compatible the objects of a domain are with the class, with 0 meaning incompatibility, complete exclusion, and 1 compatibility, complete membership. Membership values thus express the degrees to which each object of a domain is compatible with the properties distinctive to the class. Intermediate membership values mean that no natural threshold exists and that elements of a universe can be a member of a class and, at the same time, belong to other classes with different degrees. Gradual, less strict membership degrees are the essence of fuzzy sets. Formally, a fuzzy set A is described by a membership function mapping the elements of a domain X to the unit interval [0,1] [3]. A: X → [0, 1]

(2.2)

Membership functions fully define fuzzy sets. Membership functions generalize characteristic functions in the same way as fuzzy sets generalize sets. Fuzzy sets can be also be seen as a set of ordered pairs of the form {x, A(x)}, where x is an object of X and A(x) is its corresponding degree of membership. For a finite domain X = {x1 , x2 , . . . , xn }, A can be represented by a n-dimensional vector A = (a1 , a2 , . . . , an ) with each component ai = A(xi ). Being more illustrative, we may view fuzzy sets as elastic constraints imposed on the elements of a universe. Fuzzy sets deal primarily with the concepts of elasticity, graduality, or absence of sharply defined boundaries. In contrast, sets are concerned with rigid boundaries, lack of graded belongingness, and sharp binary constraints. Gradual membership means that no natural boundary exists and that some elements

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch02

Fundamentals of Fuzzy Set Theory

tall

short

1.0

0

1.5

2.0

3 X

59

tall

short

1.0

0

page 59

1.5

2.0

3 X

Figure 2.1: Two-valued membership in characteristic functions (sets) and gradual membership represented by membership functions (fuzzy sets).

of the domain can, contrary to sets, coexist (belong) to different fuzzy sets with different degrees of membership. For instance, in Figure 2.1 (left), x1 = 1.5 belongs to the category of short people, and x2 = 2.0 belongs to the category of tall people, but in Figure 2.1 (right), x1 simultaneously is 0.8 short and 0.2 tall, and x2 simultaneously is 0.4 short and 0.8 tall.

2.2.1. Interpretation of Fuzzy Sets In fuzzy set theory, fuzziness has a precise meaning. Fuzziness primarily means lack of precise boundaries of a collection of objects and, as such, it is a manifestation of imprecision and a particular type of uncertainty. First, it is worth noting that fuzziness is both conceptually and formally different from the fundamental concept of probability. In general, it is difficult to foresee the result of tossing a fair coin as it is impossible to know if either head or tail will occur for certain. We may, at most, say that there is a 50% chance to have a head or tail, but as soon as the coin falls, uncertainty vanishes. On the contrary, when we say that a person is tall we are not being precise, and imprecision remains independently of any event. Formally, probability is a set function, a mapping whose universe is a set of subsets of a domain. In contrast, fuzzy sets are membership functions, mappings from some given universe of discourse to the unit interval. Secondly, fuzziness, generality, and ambiguity are distinct notions. A notion is general when it applies to a multiplicity of objects and keeps only a common essential property. An ambiguous notion stands for several unrelated objects. Therefore, from this point of view, fuzziness means neither generality nor ambiguity and applications of fuzzy sets exclude these categories. Fuzzy set theory assumes that the universe is well defined and has its elements assigned to classes by means of a numerical scale. Applications of fuzzy sets in areas such as data analysis, reasoning under uncertainty, and decision-making suggest different interpretations of membership grades in terms of similarity, uncertainty, and preference [4, 5]. Membership value

June 1, 2022 12:51

60

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch02

Handbook on Computer Learning and Intelligence — Vol. I

comfortable

1.0

0.5 0.2 0

10

20

25

30

40 °C

Figure 2.2: Membership function for a fuzzy set of comfortable temperature.

A(x) from the point of view of similarity means the degree of compatibility of an element x ∈ X with representative elements of A. This is the primary and most intuitive interpretation of a fuzzy set, one that is particularly suitable for data analysis. An example is the case when we question on how to qualify an environment as comfortable when we know that the current temperature is 25◦ C. Such quantification is a matter of degree. For instance, assuming a domain X = [0, 40] and choosing 20◦ C as representative of comfortable temperature, we note that 25◦ C is comfortable to the degree of 0.2 (Figure 2.2). In the example, we have adopted piecewise linearly decreasing functions of the distance between temperature values and the representative value 20◦ C to determine the corresponding membership degree. Now, assume that the values of a variable x is such that A(x) > 0. Then given a value v of X, A(v) expresses a possibility that x = v given that x is in A is all that is known. In this situation, the membership degree of a given tentative value v to the class A reflects the degree of plausibility that this value is the same as the value of x. This idea reflects a type of uncertainty because if the membership degree is high, our confidence about the value of x may still be low, but if the degree is low, then the tentative value may be rejected as an implausible candidate. The variable labeled by the class A is uncontrollable. This allows assignment of fuzzy sets to possibility distributions as suggested in possibility theory [6]. For instance, suppose someone said he felt comfortable in an environment. In this situation, the membership degree of a given tentative temperature value, say 25◦ C, reflects the degree of plausibility that this value of temperature is the same as the one under which the individual felt comfortable. Note that the actual value of the temperature value is unknown, but there is no question if that value of temperature did occur or not. Possibility concerns whether an event may occur and to what degree. On the contrary, probability concerns whether an event will occur. Finally, assume that A reflects a preference on the values of a variable x in X. For instance, x can be a decision variable and fuzzy set A a flexible constraint characterizing feasible values and decision-maker preferences. In this case, A(v)

page 60

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Fundamentals of Fuzzy Set Theory

b4528-v1-ch02

page 61

61

denotes the grade of preference in favor of v as the value of x. This interpretation prevails in fuzzy optimization and decision analysis. For instance, we may be interested in finding a comfortable value of temperature. The membership degree of a candidate temperature value v reflects our degree of satisfaction with the particular temperature value chosen. In this situation, the choice of the value is controllable in the sense that the value being adopted depends on our choice.

2.2.2. Rationale for Membership Functions Generally speaking, any function A: X → [0, 1] is qualified to serve as a membership function describing the corresponding fuzzy set. In practice, the form of the membership functions should reflect the environment of the problem at hand for which we construct fuzzy sets. They should mirror our perception of the concept to be modeled and used in problem solving, the level of detail we intend to capture, and the context in which the fuzzy sets are going to be used. It is essential to assess the type of fuzzy set from the standpoint of its suitability when handling the design and optimization issues. Keeping these reasons in mind, we review the most commonly used categories of membership functions. All of them are defined in the universe of real numbers, that is X = R. Triangular membership function: It is described by piecewise linear segments of the form ⎧ ⎪ 0 if x ≤ a ⎪ ⎪ ⎪ ⎪ x − a ⎪ ⎪ if x ∈ [a, m] ⎪ ⎨m−a A(x, a, m, b) = b−x ⎪ ⎪ ⎪ if x ∈ [m, b] ⎪ ⎪b−m ⎪ ⎪ ⎪ ⎩0 if x ≥ b. Using more concise notation, the above expression can be written down in the form A(x, a, m, b) = max{min [(x − a)/(m − a), (b − x)/(b − m)], 0} (Figure 2.3). The meaning of the parameters is straightforward: m is the modal (typical) value of the fuzzy set while a and b are the lower and upper bounds, respectively. They could be sought as those elements of the domain that delineate the elements belonging to A with nonzero membership degrees. Triangular fuzzy sets (membership functions) are the simplest possible models of grades of membership as they are fully defined by only three parameters. The semantics of triangular fuzzy sets reflects the knowledge of the typical value of the concept and its spread. The linear change in the membership grades is the simplest possible model of membership one could think of. If the derivative of the

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

62

b4528-v1-ch02

Handbook on Computer Learning and Intelligence — Vol. I 1 A(x) 0.9

a = -1 m=2 b =5

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -5

0

x 10

5

Figure 2.3: Triangular membership function.

triangular membership function could be sought as a measure of sensitivity of A, then its sensitivity is constant for each of the linear segments of the fuzzy set. Trapezoidal membership function: A piecewise linear function characterized by four parameters, a, m, n, and b each of which defines one of the four linear parts of the membership function (Figure 2.4). It has the following form: ⎧ 0 if x < a ⎪ ⎪ ⎪ ⎪ ⎪ x −a ⎪ ⎪ if x ∈ [a, m] ⎪ ⎪ ⎪ ⎨m−a if x ∈ [m, n] A(x) = 1 ⎪ ⎪ ⎪ b−x ⎪ ⎪ if x ∈ [n, b] ⎪ ⎪ ⎪ b−n ⎪ ⎪ ⎩ 0 if x > b We can rewrite A using an equivalent notation as follows: A(x, a, m, n, b) = max{min [(x − a)/(m − a), 1, (b − x)/(b − n)], 0} -membership function. This function has the form ⎧ ⎪  ⎨ 0 if x ≤ a or A(x) = A(x) = 2 ⎪ 1 − e−k(x−a) if x > a ⎩ where k > 0 (Figure 2.5).

0

if x ≤ a

k(x − a)2 if x > a, 1 + k(x − a)2

page 62

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch02

Fundamentals of Fuzzy Set Theory 1 A(x) 0.9 0.8

63

a = -2.5 m=0 n = 2.5 b = 5.0

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -5

0

5

x

10

Figure 2.4: Trapezoidal membership function.

1 A(x) 0.9

a=1 k=5

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -5

0

5

page 63

x

10

Figure 2.5: -membership function.

S-membership function: The function is expressed by ⎧ 0 if x ≤ a ⎪ ⎪ 2  ⎪ ⎪ x − a ⎪ ⎪ ⎪ if x ∈ [a, m] ⎨2 b − a A(x) =   ⎪ x −b 2 ⎪ ⎪ ⎪ if x ∈ [m, b] 1−2 ⎪ ⎪ b−a ⎪ ⎩ 1 if x > b. The point m = (a + b)/2 is the crossover point of the S-function (Figure 2.6).

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

64

b4528-v1-ch02

Handbook on Computer Learning and Intelligence — Vol. I 1 A(x) 0.9

a = -1 b= 3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -5

0

5

x

10

Figure 2.6: S-membership function.

1 A(x) 0.9 0.8

k = 0.5 m=2

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -5

0

5

x

10

Figure 2.7: Gaussian membership function.

Gaussian membership function: This membership function is described by the following relationship:   (x − m)2 . A(x, m, σ ) = exp − σ2 An example of the membership function is shown in Figure 2.7. Gaussian membership functions have two important parameters. The modal value m represents

page 64

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch02

Fundamentals of Fuzzy Set Theory 1 A(x) 0.9 0.8

page 65

65

k=1 m=2

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -5

0

5

x

10

Figure 2.8: An example of the exponential-like membership function.

the typical element of A while σ denotes a spread of A. Higher values of σ corresponds to larger spreads of the fuzzy sets. Exponential-like membership function: It has the following form (Figure 2.8): A(x) =

1 , k > 0. 1 + k(x − m)2

(2.3)

The spread of the exponential-like membership function increases as the value of k gets lower.

2.3. Characteristics of Fuzzy Sets Given the diversity of potentially useful and semantically sound membership functions, there are certain common characteristics or descriptors that are conceptually and operationally useful to capture the essence of fuzzy sets. We provide next a list of the descriptors commonly encountered in practice. Normality: We say that the fuzzy set A is normal if its membership function attains 1, that is, sup A(x) = 1

(2.4)

x∈X

If this property does not hold, we call the fuzzy set subnormal. An illustration of the corresponding fuzzy set is shown in Figure 2.9. The supremum (sup) in the

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

66

b4528-v1-ch02

page 66

Handbook on Computer Learning and Intelligence — Vol. I hgt(A)

1

1 A

hgt(A)

A

x

x

Figure 2.9: Examples of normal and subnormal fuzzy sets.

above expression is also referred to as a height of the fuzzy set A. Thus, the fuzzy set is normal if hgt(A) = 1. The normality of A has a simple interpretation: by determining the height of the fuzzy set, we identify an element of the domain whose membership degree is the highest. The value of the height being equal to 1 states that there is at least one element in X that is fully typical with respect to A and that could be sought as entirely compatible with the semantic category presented by A. A subnormal fuzzy set has height lower than 1, viz. hgt (A) < 1, and the degree of typicality of elements in this fuzzy set is somewhat lower (weaker) and we cannot identify any element in X that is fully compatible with the underlying concept. In practice, while forming a fuzzy set we expect its normality. Normalization: The normalization, demoted by Norm(.), is a mechanism to convert a subnormal nonempty fuzzy set A into its normal counterpart. Dividing the original membership function by its height can do this Norm( A) =

A(x) . hgt (A)

(2.5)

While the height describes the global property of the membership grades, the following notions offer an interesting characterization of the elements of X regarding their membership degrees. Support: Support of a fuzzy set A, Supp(A), is a set of all elements of X with nonzero membership degrees in A Supp( A) = {x ∈ X | A(x) > 0}.

(2.6)

In other words, support identifies all elements of X that exhibit some association with the fuzzy set under consideration (by being allocated to A with nonzero membership degrees).

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch02

Fundamentals of Fuzzy Set Theory

A

1

page 67

67

A

1

x

x Core(A)

Supp(A)

Figure 2.10: Support and core of A.

1

1

A

A

A

x

A

+

x

Figure 2.11: Example of α-cut and of strong α-cut.

Core: The core of a fuzzy set A, Core(A), is a set of all elements of the universe that are typical to A: they come with unit membership grades Core( A) = {x ∈ X | A(x) = 1}.

(2.7)

The support and core are related in the sense that they identify and collect elements belonging to the fuzzy set yet at two different levels of membership. Given the character of the core and support, we note that all elements of the core of A are subsumed by the elements of the support of this fuzzy set. Note that both support and core are sets, not fuzzy sets. In Figure 2.10, they are intervals. We refer to them as the set-based characterizations of fuzzy sets. While core and support are somewhat extreme, in the sense that they identify the elements of A that exhibit the strongest and the weakest links with A, we may be also interested in characterizing sets of elements that come with some intermediate membership degrees. The notion of α-cut offers here an interesting insight into the nature of fuzzy sets α -cut: The α-cut of a fuzzy set A, denoted by Aα , is a set consisting of the elements of the domain whose membership values are equal to or exceed a certain threshold

June 1, 2022 12:51

68

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch02

page 68

Handbook on Computer Learning and Intelligence — Vol. I

A

1

( x1+ (1- )x2)

x1

x2

x

x1+ (1- )x2

Figure 2.12: Convex fuzzy set A.

level α ∈ [0, 1]. Formally, Aα = {x ∈ X | A(x) ≥ α}. A strong α-cut identifies all elements in X for which A+ α = {x ∈ X | A(x) > α}. Figure 2.11 illustrates the notion of α-cut and strong α-cut. Both support and core are limit cases of α-cuts and strong α-cuts. From α = 0 and the strong α-cut, we arrive at the concept of the support of A. The value α = 1 means that the corresponding α-cut is the core of A. Representation theorem: Any fuzzy set can be viewed as a family of fuzzy sets. This is the essence of a result known as the representation theorem. The representation theorem states that any fuzzy set A can be decomposed into a family of α-cuts α Aα , A= α∈[0,1] or, equivalently in terms of membership functions, A(x) = sup α Aα (x). α∈[0,1] Convexity: We say that a fuzzy set is convex if its membership function satisfies the following condition: A[λx1 + (1 − λ)x2 ] ≤ min [ A(x1 ), A(x2 )], ∀x1 , x2 ∈ X, λ ∈ [0, 1].

(2.8)

Relationship (2.8) says that, whenever we choose a point x on a line segment between x1 and x2 , the point (x, A(x)) is always located above or on the line passing through the two points (x1 , A(x1 )) and (x2 , A(x2 )) (Figure 2.12). Note that the membership function is not a convex function in the conventional sense [7]. The set S is convex if, for all x1 , x2 ∈ S, then x = λx1 + (1 − λ)x2 ∈ S for all λ ∈ [0,1]. Convexity means that any line segment identified by any two points in S is contained in S. For instance, intervals of real numbers are convex sets. Therefore, if a fuzzy set is convex, then all of its α-cuts are convex, and conversely, if a fuzzy set has all its α-cuts convex, then it is a convex fuzzy set (Figure 2.13). Thus, we may say that a fuzzy set is convex if all its α-cuts are convex.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch02

Fundamentals of Fuzzy Set Theory

1

A

69

A

1

x

A

page 69

A

x

Figure 2.13: Convex and non-convex fuzzy sets.

Fuzzy sets can be characterized by counting their elements and using a single numeric quantity as a descriptor of the count. While in the case of sets this sounds clear, with fuzzy sets we have to consider different membership grades. In the simplest form this counting comes under the name of cardinality. Cardinality: Given a fuzzy set A defined in a finite or countable universe X, its cardinality, denoted by Card(A) is expressed as the following sum:

A(x) (2.9) Card( A) = x∈X

or, alternatively, as the integral Card( A) =

A(x)dx,

(2.10)

X

assuming that the integral is well-defined. We also use the alternative notation Card( A) = |A| and refer to it as a sigma count (σ -count). The cardinality of fuzzy sets is explicitly associated with the concept of granularity of information granules realized in this manner. More descriptively, the more the elements of A we encounter, the higher the level of abstraction supported by A and the lower the granularity of the construct. Higher values of cardinality come with the higher level of abstraction (generalization) and the lower values of granularity (specificity). So far, we discussed properties of a single fuzzy set. Next, we look at the characterizations of relationships between two fuzzy sets. Equality: We say that two fuzzy sets A and B defined in X are equal if and only if their membership functions are identical, that is, A(x) = B(x) ∀x ∈ X.

(2.11)

June 10, 2022 12:20

70

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch02

Handbook on Computer Learning and Intelligence — Vol. I

B

1 A

x

Figure 2.14: Inclusion A ⊂ B.

Inclusion: Fuzzy set A is a subset of B (A is included in B), A ⊆ B, if and only if every element of A also is an element of B. This property expressed in terms of membership degrees means that the following inequality is satisfied A(x) ≤ B(x) ∀x ∈ X.

(2.12)

An illustration of these two relationships in the case of sets is shown in Figure 2.14. To satisfy the relationship of inclusion, we require that the characteristic functions adhere to Eq. (2.12) for all elements of X. If the inclusion is not satisfied even for a single point of X, the inclusion property does not hold. See Pedrycz and Gomide [8] for an alternative notion of inclusion that captures the idea of the degree of inclusion. Specificity: Often, we face the issue to quantify how much a single element of a domain could be viewed as a representative of a fuzzy set. If this fuzzy set is a singleton, then  A(x) =

1 if x = x0 , 0 if x = x0

and there is no hesitation in selecting xo as the sole representative of A. We say that A is very specific and its choice comes with no hesitation. On the other extreme, if A covers the entire domain X and has all elements with the membership grade equal to 1, the choice of the only one representative of A becomes more problematic once it is not clear which element to choose. These two extreme situations are shown in Figure 2.15. Intuitively, we see that the specificity is a concept that relates with the cardinality of a set. The higher the cardinality of the set (the more evident its abstraction) is, the lower its specificity. One approach to quantify the notion of specificity of a fuzzy set is as follows [9]: The specificity of a fuzzy set A defined in X, denoted by Spec(A), is a mapping from a family of normal fuzzy sets in X into nonnegative numbers such that the following conditions are satisfied (Figure 2.16).

page 70

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch02

Fundamentals of Fuzzy Set Theory

A

page 71

71

A 1

1

x0

x

x

Figure 2.15: Two extreme cases of sets with distinct levels of specificity.

Figure 2.16: Specificity of fuzzy sets: fuzzy set A 1 is less specific than A 2 .

1. Spec(A) = 1 if there exists only one element xo of X for which A(xo ) = 1 and A(x) = 0 ∀x = xo (A is a singleton); 2. Spec(A) = 0 if A(x) = 0 ∀x ∈ X (A is the empty set ∅); 3. Spec(A1 ) ≤ Spec(A2 ) if A1 ⊃ A2 . A particular instance of specificity measure is [10] Spec( A) =

αmax

0

1 dα, Card(Aα )

where αmax = hgt(A). For finite domains, the integration is replaced by the sum Spec(A) =

m

i=1

1 αi , Card(Aαi )

where αi = αi − αi−1 with αo = 0; m stands for the number of the membership grades of A.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

72

b4528-v1-ch02

page 72

Handbook on Computer Learning and Intelligence — Vol. I

B(x)

A(x) A

1

(A B)(x) B

1

A B

1

B

A

0

1

2

3

4

R 0

1

2

3

4

R 0

1

2

3

4

R

Figure 2.17: Intersection of sets in terms of their characteristic functions.

2.4. Operations with Fuzzy Sets As in set theory, we may combine fuzzy sets to produce new fuzzy sets. Generally, combination must possess properties to match intuition, to comply with the semantics of the intended operation, and to be flexible to fit application requirements. Next, we provide an overview of the main operations and their generalizations, interpretations, and examples of realizations.

2.4.1. Standard Operations on Sets and Fuzzy Sets To start, we review the familiar operations of intersection, union, and complement of set theory. Consider two sets A = {x ∈ R | 1 ≤ x ≤ 3} and B = {x ∈ R | 2 ≤ x ≤ 4}, closed intervals of the real line. Their intersection is the set A ∩ B = {x ∈ R|2 ≤ x ≤ 3}. Figure 2.17 shows the intersection operation in terms of the characteristic functions of A and B. Looking at the values of the characteristic function of A ∩ B that results when comparing the individual values of A(x) and B(x) for each x ∈ R, we note that they correspond to the minimum between the values of A(x) and B(x). In general, given the characteristic functions of A and B, the characteristic function of their intersection A ∩ B is computed using ( A ∩ B)(x) = min [A(x), B(x)] ∀x ∈ X,

(2.13)

where (A ∩ B)(x) denotes the characteristic function of the set A ∩ B. The union of sets A and B in terms of the characteristic functions proceeds similarly. If A and B are the same intervals as above, then A ∪ B = {x ∈ R | 1 ≤ x ≤ 4}. In this case, the value of the characteristic function of the union is the maximum of corresponding values of the characteristic functions A(x) and B(x) taken pointwise (Figure 2.18).

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch02

page 73

Fundamentals of Fuzzy Set Theory

(A B((x)

B(x)

A(x) A

1

73

B

1

A B

1

B

A

0

1

2

3

R

4

0

1

2

3

R

4

0

1

2

3

4

R

Figure 2.18: Union of two sets in terms of their characteristic functions.

A

1

0

1

2

A

1

3

4

R

0

1

2

3

4

R

Figure 2.19: Complement of a set in terms of its characteristic function.

Therefore, given the characteristic functions of A and B, we determine the characteristic function of the union as ( A ∪ B)(x) = max [A(x), B(x)] ∀x ∈ X,

(2.14)

where ( A ∪ B)(x) denotes the characteristic function of the set A ∪ B. Likewise, as Figure 2.19 suggests, the complement A of A, expressed in terms of its characteristic function, is the one-complement of the characteristic function of A. For instance, if A = {x ∈ R | 1 ≤ x ≤ 3}, then A = {x ∈ R | 4 < x < 1}. Thus, the characteristic function of the complement of a set A is A(x) = 1 − A(x), ∀x ∈ X.

(2.15)

Because sets are particular instances of fuzzy sets, the operations of intersection, union and complement should equally well apply to fuzzy sets. Indeed, when we use membership functions in Eq. (2.2) to (2.15), these formulae serve as standard definitions of intersection, union, and complement of fuzzy sets. Examples are shown in Figure 2.20. Standard set and fuzzy set operations fulfill the properties of Table 2.1. Figures 2.19 and 2.20 show that the laws of non-contradiction and excluded middle hold for sets, but they do not hold by fuzzy sets with the standard operations (see Table 2.2). Particularly worth noting is a violation of the non-contradiction law

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

74

b4528-v1-ch02

page 74

Handbook on Computer Learning and Intelligence — Vol. I

B

A

1

A

1

B

1

A

A B

A B 0

A

1

2

3

4

X

0

1

2

3

4

X

0

1

2

3

4

X

Figure 2.20: Standard operations on fuzzy sets. Table 2.1: Properties of operations with sets and fuzzy sets. 1. Commutativity 2. Associativity 3. Distributivity 4. Idempotency 5. Boundary conditions 6. Involution 7. Transitivity

A∪B = B∪ A A∩B = B∩ A A ∪ (B ∪ C) = (A ∪ B) ∪ C A ∩ (B ∩ C) = (A ∩ B) ∩ C A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C) A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) A∪ A= A A∩ A= A A ∪ φ = A and A ∪ X = X A ∩ φ = φ and A ∩ X = A = A=A if A ⊂ B and B ⊂ C then A ⊂ C

Table 2.2: Non-contradiction and excluded middle for standard operations.

8. Non-contradiction 9. Excluded middle

Sets

Fuzzy sets

A∩A=φ A∪A=X

A ∩ A = φ A∪A= X

once it shows the issue of fuzziness from the point of view of the coexistence of a class and its complement, one of the main source of fuzziness. This coexistence is impossible in set theory and means a contradiction in conventional logic. Interestingly, if we consider a particular subnormal fuzzy set A whose membership function is constant and equal to 0.5 for all elements of the universe, then from Eqs. (2.2)–(2.15) we see that A = A ∪ A = A ∩ A = A, a situation in which there is no way to distinguish the fuzzy set from its complement and any the fuzzy set that results from standard operations with them. The value 0.5 is a crossover point representing a balance between membership and non-membership at which we attain the highest level of fuzziness. The fuzzy set and its complement are indiscernible.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Fundamentals of Fuzzy Set Theory

b4528-v1-ch02

page 75

75

As we will see next, there are types of operations for which non-contradiction and excluded middle are recovered. While for sets these types produce the same result as the standard operators, this is not the case with fuzzy sets. In addition, A = A ∪ A = A ∩ A = A does not hold for any choice of intersection, union and complement operators. Operations on fuzzy sets concern manipulation of their membership functions. Therefore, they are domain dependent and different contexts may require their different realizations. For instance, since operations provide ways to combine information, they can be performed differently in image processing, control, and diagnostic systems applications for example. When developing realizations of intersection and union of fuzzy sets it is useful to require commutativity, associativity, monotonicity, and identity. The last requirement (identity) has different forms depending on the operation. For instance, the intersection of any fuzzy set with domain X should return the fuzzy set itself. For the union, identity implies that the union of any fuzzy set and an empty fuzzy set returns the fuzzy set itself. Thus, in principle any two place operator [0, 1] × [0, 1] → [0, 1] that satisfies the collection of the requirements can be regarded as a potential candidate to realize the intersection or union of fuzzy sets, identity acting as boundary conditions meaning. In general, idempotency is not strictly required, but the realizations of union and intersection could be idempotent as are the minimum and maximum operators (min [a, a] = a and max [a, a] = a).

2.4.2. Triangular Norms and Conorms Triangular norms and conorms constitute general forms of operations for intersection and union. While t-norms generalize intersection of fuzzy sets, t-conorms (or s-norms) generalize the union of fuzzy sets [11]. A triangular norm is a two-place operation t: [0, 1] × [0, 1] → [0, 1] that satisfies the following properties: 1. Commutativity:

a tb = bta

2. Associativity:

a t (b t c) = (a t b) t c

3. Monotonicity:

if b ≤ c then a t b ≤ a t c

4. Boundary conditions:

a t1 = a at0 = 0

where a, b, c ∈ [0, 1]. There is a one-to-one correspondence between the general requirements outlined above and the properties of t-norms. The first three reflect the general character of set operations. Boundary conditions stress the fact all t-norms attain the same values at boundaries of the unit square [0, 1] × [0, 1]. Thus, for sets, any t-norm produces the same result.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

76

b4528-v1-ch02

page 76

Handbook on Computer Learning and Intelligence — Vol. I

Examples of t-norms are Minimum:

atm b = min(a, b) = a ∧ b

Product:

at p b = ab

Lukasiewicz:

atl b = max(a + b − 1, 0) ⎧ ⎪ ⎨ a if b = 1 atd b = b if a = 1 ⎪ ⎩ 0 otherwise

Drastic product:

The minimum (tm ), product (t p ), Lukasiewicz (tl ), and drastic product (td ) operators are shown in Figure 2.21, with examples of the union of the triangular fuzzy sets on X = [0,8], A = (x, 1, 3, 6) and B = (x, 2.5, 5, 7). Triangular conorms are functions s: [0, 1] × [0, 1] → [0, 1] that serve as generic realizations of the union operator on fuzzy sets. One can show that s: [0, 1] × [0, 1] → [0, 1] is a t-conorm if and only if there exists a t-norm, called dual t-norm, such that for ∀a,b ∈ [0,1] we have a s b = 1 − (1 − a) t (1 − b).

(2.16)

For the corresponding dual t-norm we have a t b = 1 − (1 − a) s (1 − b).

(2.17)

The duality expressed by Eq. (2.16) and Eq. (2.17) can be viewed as alternative definition of t-conorms. Duality allows us to deduce the properties of t-conorms on the basis of the analogous properties of t-norms. From Eq. (2.16) and Eq. (2.17), we get (1 − a) t (1 − b) = 1 − a s b (1 − a) s (1 − b) = 1 − a t b . These two relationships can be expressed symbolically as A∩B = A∪B A∪B = A∩B . which are the De Morgan laws. Commonly used t-conorms includes Maximum:

a sm b = max(a, b) = a ∨ b

Algebraic sum:

as p b = a + b − ab

Lukasiewicz:

asl b = min(a + b, 1) ⎧ ⎪ ⎨ a if b = 0 asd b = b if a = 0 ⎪ ⎩ 1 otherwise

Drastic sum:

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Fundamentals of Fuzzy Set Theory

b4528-v1-ch02

page 77

77

(a)

(b)

(c)

(d)

Figure 2.21: The (a) min, (b) product, (c) Lukasiewicz, (d) drastic product t-norms and the intersection of fuzzy sets A and B.

The maximum (sm ), algebraic sum (s p ), Lukasiewicz (sl ), and drastic sum (sd ) operators are shown in Figure 2.22, which also includes the union of the triangular fuzzy sets on [0, 8], A = (x, 1, 3, 6) and B = (x, 2.5, 5, 7). The properties A ∪ A = X and the excluded middle hold for the drastic sum.

June 1, 2022 12:51

78

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch02

Handbook on Computer Learning and Intelligence — Vol. I

(a)

(b)

(c)

(d)

Figure 2.22: The (a) max, (b) algebraic sum, (c) Lukasiewicz, (d) drastic sum s-norms and the union of fuzzy sets A and B.

2.4.3. Triangular Norms as Categories of Logical operators Fuzzy propositions involve combinations of linguistic statements (or their symbolic representations) such as in

page 78

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Fundamentals of Fuzzy Set Theory

b4528-v1-ch02

page 79

79

1. temperature is high and humidity is low; 2. velocity is low or noise level is high. These sentences use logical operations ∧ (and), ∨ (or) to combine linguistic statements into propositions. For instance, in the first example we have a conjunction (and, ∧) of linguistic statements while in the second there is a disjunction (or, ∨) of the statements. Given the truth values of each statement, the question is how to determine the truth value of the composite statement or, equivalently, the truth value of the proposition. Let truth(P) = p ∈ [0,1], the truth value of proposition P. Thus, p = 0 means that the proposition is false while p = 1 means that P is true. Intermediate values p ∈ (0,1) indicate partial truth of the proposition. To compute the truth value of composite propositions coming in the form of P ∧ Q, P ∨ Q given the truth values p and q of its components, we have to come up with operations that transforms truth values p and q into the corresponding truth values p ∧ q and p ∨ q. To make these operations meaningful, we require that they satisfy some basic requirements. For instance, it is desirable that p ∧ q and q ∧ p (similarly, p ∨ q and q ∨ p) produce the same truth values. Likewise, we require that the truth value of ( p ∧ q) ∧ r is the same as the following combination p∧ (q ∧ r). In other words, the conjunction and disjunction operations are commutative and associative. In addition, when the truth value of an individual statement increase, the truth values of their combinations also increase. Moreover, if P is absolutely false, p = 0, then P ∧ Q should also be false no matter what the truth value of Q is. Furthermore the truth value of P ∨ Q should coincide with the truth value of Q. On the other hand, if P is absolutely true, p = 1, then the truth value of P ∧ Q should coincide with the truth value of Q, while P ∨ Q should also be true. Triangular norms and conorms are the general families of logic connectives that comply with these requirements. Triangular norms provide a general category of logical connectives in the sense that t-norms are used to model conjunction operators while t-conorms serve as models of disjunctions. Let L = {P, Q, . . .} be a set of single (atomic) statements P, Q, . . . and truth: L → [0,1] a function which assigns truth values p, q, . . . ∈ [0,1] to each element of L. Thus, we have truth(PandQ) ≡ truth(P ∧ Q) → p ∧ q = p t q truth(PorQ) ≡ truth(P ∨ Q) → p ∨ q = p s q. Table 2.3 shows examples of truth values for P, Q, P ∧ Q, and P ∨ Q, when we selected the min and product t-norms, and the max and algebraic sum t-conorms, respectively. For p, q ∈ {0, 1}, the results coincide with the classic interpretation of conjunction and disjunction for any choice of the triangular norm and conorm. The differences are present when p, q ∈ (0, 1).

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

80

b4528-v1-ch02

Handbook on Computer Learning and Intelligence — Vol. I Table 2.3: Triangular norms as generalized logical connectives. p 1 1 0 0 0.2 0.5 0.8

q

min( p, q)

max( p, q)

pq

p + q − pq

1 0 1 0 0.5 0.8 0.7

1 0 0 0 0.2 0.5 0.7

1 1 1 0 0.5 0.8 0.8

1 0 0 0 0.1 0.4 0.56

1 1 1 0 0.6 0.9 0.94

A point worth noting here concerns the interpretation of set operations in terms of logical connectives. By being supported by the isomorphism between set theory and propositional two-valued logic, the intersection and union can be identified with conjunction and disjunction, respectively. This can also be realized with triangular norms viewed as general conjunctive and disjunctive connectives within the framework of multivalued logic [7, 12]. Triangular norms also play a key role in different types of fuzzy logic [13]. Given a continuous t-norm t, let us define the following ϕ operator aϕb = sup{c ∈ [0, 1]|a t c ≤ b} for all a, b ∈ [0, 1]. This operation can be interpreted as an implication induced by some t-norm, aϕb = a ⇒ b, and therefore it is, like implication, an inclusion. The operator ϕ generalizes the classic implication. As Table 2.4 suggests, the two-valued implication arises as a special case of the ϕ operator in case when a, b ∈ {0, 1}. Note that aϕb (a ⇒ b), returns 1 whenever a ≤ b. If we interpret these two truth values as membership degrees, we conclude that aϕb models a multivalued inclusion relationship. Table 2.4: ϕ operator for binary values of its arguments. a

b

a⇒b

aϕb

0 0 1 1

0 1 0 1

1 1 0 1

1 1 0 1

page 80

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch02

Fundamentals of Fuzzy Set Theory

page 81

81

2.4.4. Aggregation of Fuzzy Sets Several fuzzy sets can be combined (aggregated) to provide a single fuzzy set forming the result of such an aggregation operation. For instance, when we compute intersection and union of fuzzy sets, the result is a fuzzy set whose membership function captures the information carried by the original fuzzy sets. This fact suggests a general view of aggregation of fuzzy sets as a certain transformations performed on their membership functions. In general, we encounter a wealth of aggregation operations [14–18]. Aggregation operations are n-ary functions g: [0, 1]n → [0, 1] with the following properties: 1. Monotonicity

g(x1 , x2 , . . . , xn ) ≥ g(y1 , y2 , . . . , yn ) if xi > y j

2. Boundary conditions

g(0, 0, . . . . . . , 0) = 0 g(1, 1, . . . . . . , 1) = 1

Since triangular norms and conorms are monotonic, associative, satisfy the boundary conditions, they qualify a class of associative aggregation operations whose neutral elements are equal to 1 and 0, respectively. The following operators constitute important alternative examples: Averaging operations: In addition to monotonicity and the satisfaction of the boundary conditions, averaging operations are idempotent and commutative. They can be described in terms of the generalized mean [19]

n

1

p (xi ) p p ∈ R, p = 0 g(x1 , x2 , . . . , xn ) =  n i=1 Generalized mean subsumes well-known averages and operators such as p=1 p→0

n 1 xi n i=1  n  xi g(x1 , x2 , . . . , xn ) = n

g(x1 , x2 , . . . , xn ) =

arithmetic mean geometric mean

i=1

p = −1

g(x1 , x2 , . . . , xn ) =

n n 

harmonic mean

1/xi

i=1

p→−∝ p →∝

g(x1 , x2 , . . . , xn ) = min(x1 , x2 , . . . , xn ) g(x1 , x2 , . . . , xn ) = max(x1 , x2 , . . . , xn )

minimum maximum

June 1, 2022 12:51

82

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch02

Handbook on Computer Learning and Intelligence — Vol. I

The following relationship holds min(x1 , x2 , . . . , xn ) ≤ g(x1 , x2 , . . . , xn ) ≤ max(x1 , x2 , . . . , xn ) Therefore, generalized means ranges over the values not being covered by triangular norms and conorms. Ordered weighted averaging operations: Ordered weighted averaging (OWA) is a weighted sum whose arguments are ordered [20]. Let w = [w1 w2 …wn ]T , wi ∈ [0, 1], be weights such that n

wi = 1. i=1

Let a sequence of membership values {A(xi )} be ordered as follows A(x1 ) ≤ A(x2 ) ≤ . . . ≤ A(xn ). The family of ordered weighted averaging OWA(A, w) is defined as n

wi A(xi ) OWA(A, w) = i=1

By choosing certain forms of w, we can show that OWA includes several special cases of aggregation operators mentioned before. For instance: 1. if w = [1, 0,…, 0]T then OWA(A,w) = min(A(x1 ), A(x2 ),…, A(xn )) 2. if w = [0, 0, …, 1]T then OWA(A,w) = max (A(x1 ), A(x2 ), …, A(xn )) n 1 A(x1 ) arithmetic mean. 3. if w = [1/n,…,1/n]T then OWA(A,w) = n i=1 Varying the values of the weights wi results in aggregation values located inbetween between min and max, min( A(x1 ), A(x2 ), . . . , A(xn )) ≤ OWA( A, w) ≤ max( A(x1 ), A(x2 ), . . . , A(xn )), and OWA behaves as a compensatory operator, similar to the generalized mean. Uninorms and nullnorms: Triangular norms provide one of possible ways to aggregate membership grades. By definition, the identity elements are 1 (t-norms) and 0 (t-conorms). When used in the aggregation operations, these elements do not affect the result of aggregation (that is, a t 1 = a and a t 0 = a). Uninorms generalize and unify triangular norms by allowing the identity element to be any value in the unit interval, that is, e ∈ (0, 1). Uninorms become t-norms when e = 1 and t-conorms when e = 0. They exhibit some intermediate characteristics for all remaining values of e. Therefore, uninorms share the same properties as triangular norms with the exception of the identity [21].

page 82

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch02

Fundamentals of Fuzzy Set Theory

page 83

83

A uninorm is an operation u: [0,1] × [0,1] → [0,1] that satisfies the following: 1. Commutativity:

a ub = bua

2. Associativity:

a u (b u c) = (a u b)u c

3. Monotonicity:

if b ≤ c then a u b ≤ a u c

4. Identity:

a u e = a, ∀a ∈ [0,1]

where a, b, c ∈ [0,1]. Examples of uninorm include conjunctive uc and disjunctive ud forms of uninorms. They can be obtained in terms of a t-norm t and a conorm s as follows: a) if (0 u 1) = 0, then ⎧     a b ⎪ ⎪ e t if 0 ≤ a, b ≤ e ⎪ ⎪ ⎪ e e ⎪ ⎨     b−e a−e a uc b = S if e ≤ a, b ≤ 1 e + (1 − e) ⎪ ⎪ ⎪ 1−e 1−e ⎪ ⎪ ⎪ ⎩ min(a, b) otherwise b) If (0 u 1) = 1, then ⎧     a b ⎪ ⎪ e t if 0 ≤ a, b ≤ e ⎪ ⎪ ⎪ e e ⎪ ⎨     b−e a−e a ud b = S if e ≤ a, b ≤ 1 e + (1 − e) ⎪ ⎪ ⎪ 1−e 1−e ⎪ ⎪ ⎪ ⎩ max(a, b) otherwise

2.5. Fuzzy Relations Relations represent and quantify associations between objects. They provide a mechanism to model interactions and dependencies between variables, components, modules, etc. Fuzzy relations generalize the concept of relation in the same manner as fuzzy set generalizes the fundamental idea of set. Fuzzy relations have applications especially in information retrieval, pattern classification, modeling and control, diagnostics, and decision-making.

2.5.1. Relations and Fuzzy Relations Fuzzy relation is a generalization of the concept of relations by admitting the notion of partial association between elements of domains. Intuitively, a fuzzy relation can be seen as a multidimensional fuzzy set. For instance, if X and Y are two

June 1, 2022 12:51

84

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch02

Handbook on Computer Learning and Intelligence — Vol. I

domains of interest, a fuzzy relation R is any fuzzy subset of the Cartesian product of X and Y [22]. Equivalently, a fuzzy relation on X × Y is a mapping R: X × Y → [0, 1]. The membership function of R for some pair (x, y), R(x, y) = 1, denotes that the two objects x and y are fully related. On the other hand, R(x, y) = 0 means that these elements are unrelated while the values in-between, 0 < R(x, y) < 1, underline a partial association. For instance, if dfs , dnf , dns , dgf are documents whose subjects concern mainly fuzzy systems, neural fuzzy systems, neural systems, and genetic fuzzy systems whose keywords are denoted by w f , wn and wg , respectively, then a relation R on D × W, D = {dfs , dnf , dns , dgf } and W = {w f , wn , wg } can assume the matrix form with the following entries: ⎡

1

0 0.6

⎢ ⎢ 0.8 1 R=⎢ ⎢ 0 1 ⎣ 0.8 1



⎥ 0 ⎥ ⎥ 0 ⎥ ⎦ 0

Since the universes are discrete, R can be represented as a 4 × 3 matrix (four documents and three keywords) and entries are degrees of memberships. For instance, R(dfs , w f ) = 1 means that the document content d f s is fully compatible with the keyword w f whereas R(dfs , wn ) = 0 and R(dfs , wg ) = 0.6 indicate that dfs does not mention neural systems, but does have genetic systems as part of its content. As with relations, when X and Y are finite with Card(X) = n and Card(Y) = m, then R can be arranged into a certain n × m matrix R = [ri j ], with ri j ∈ [0,1] being the corresponding degrees of association between xi and y j . The basic operations on fuzzy relations, union, intersection, and complement, are analogous to the corresponding operations on fuzzy sets once fuzzy relations are fuzzy sets formed on multidimensional spaces. Their characterization and representation also mimics fuzzy sets.

2.5.2. Cartesian Product A procedure to construct fuzzy relations is through the use of Cartesian product extended to fuzzy sets. The concept closely follows the one adopted for sets once they involve pairs of points of the underlying universes, added with a membership degree. Given fuzzy sets A1 , A2 , . . . An on the respective domains X1 , X2 , . . . , Xn , their Cartesian product A1 × A2 × . . . × An is a fuzzy relation R on X1 × X2 × . . . × Xn

page 84

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch02

Fundamentals of Fuzzy Set Theory

page 85

85

with the following membership function R(x1 , x2 , . . . , xn ) = min{A1 (x1 ), A2 (x2 ), . . . , An (xn )}, ∀x1 ∈ X1 , ∀x2 ∈ X2 , . . . , ∀xn ∈ Xn . In general, we can generalize the concept of this Cartesian product using t-norms: R(x1 , x2 , . . . , xn ) = A1 (x1 ) t A2 (x2 )t . . . tAn (xn ) ∀x1 ∈ X1 , ∀x2 ∈ X2 , . . . , ∀xn ∈ Xn .

2.5.3. Projection of Fuzzy Relations Contrasting with the concept of the Cartesian product, the idea of projection is to construct fuzzy relations on some subspaces of the original relation. If R is a fuzzy relation on X1 × X2 × . . . × Xn , its projection on X = Xi × X j × . . . × Xk , is a fuzzy relation R X whose membership function is [23] RX (xi , x j , . . . , xk ) = Pr ojX R(x1 , x2 , . . . , xn ) =

sup

x t ,x u ,...,x v

R(x1 , x2 , . . . , xn ),

where I = {i, j, . . . , k} is a subsequence of the set of indexes N = {1, 2, . . . , n}, and J = {t, u,. . .,v}is a subsequence of N such that I ∪ J = N and I ∩ J = ∅. Thus, J is the complement of I with respect to N. Notice that the above expression is computed for all values of (x1 , x2 , . . . , xn ) ∈ X1 × X2 × . . . × Xn . Figure 2.23 illustrates projection in the two-dimensional X × Y case.

Figure 2.23: Fuzzy relation R and its projections on X and Y.

June 1, 2022 12:51

86

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch02

Handbook on Computer Learning and Intelligence — Vol. I

Figure 2.24: Cylindrical extension of a fuzzy set.

2.5.4. Cylindrical Extension The notion of cylindrical extension aims to expand a fuzzy set to a multidimensional relation. In this sense, cylindrical extension can be regarded as an operation complementary to the projection operation [23]. The cylindrical extension on X × Y of a fuzzy set of X is a fuzzy relation cylA whose membership function is equal to cyl A(x, y) = A(x) ∀x ∈ X, y ∈ Y. Figure 2.24 shows the cylindrical extension of a triangular fuzzy set A.

2.6. Linguistic Variables One frequently deals with variables describing phenomena of physical or human systems assuming a finite, quite small number of descriptors. In contrast to the idea of numeric variables as commonly used, the notion of linguistic variable can be understood as a variable whose values are fuzzy sets. In general, linguistic variables may assume values consisting of words or sentences expressed in a certain language [24]. Formally, a linguistic variable is characterized by a quintuple < X, T(X), X, G, M >, where X is the name of the variable, T(X) is a term set of X whose elements are labels L of linguistic values of X, G is a grammar that generates the names of X, and M is a semantic rule that assigns to each label L∈T(X) a meaning whose realization is a fuzzy set on the universe X with base variable x. Figure 2.25 shows an example.

page 86

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch02

Fundamentals of Fuzzy Set Theory

page 87

87

C(x)

cold

1.0

cold

x

X universe

F(x)

temperature name X

linguistic terms L

comfortable

1.0

comfortable

semantic rule M(L) x

X

W(x)

warm

1.0

term set T(X)

warm

x base variable

X

Figure 2.25: An example of the linguistic variable temperature.

2.7. Granulation of Data The notion of granulation emerges as a need to abstract and summarize data to support the processes of comprehension and decision-making. For instance, we often sample an environment for values of attributes of state variables, but we rarely process all details because of our physical and cognitive limitations. Quite often, just a reduced number of variables, attributes, and values are considered because those are the only features of interest given the task under consideration. To avoid all necessary and highly distractive details, we require an effective abstraction procedure. Detailed numeric information is aggregated into a format of information granules where the granules themselves are regarded as collections of elements that are perceived as being indistinguishable, similar, close, or functionally equivalent. There are different formalisms and concepts of information granules [25]. For instance, granules can be realized as sets (intervals), rough sets, probability densities [26]. Typical examples of the granular data are singletons and intervals. In these two special cases, we typically refer to discretization and quantization (Figure 2.26). As the specificity of granules increases, intervals become singletons and in this limit case the quantization results in a discretization process. Fuzzy sets are examples of information granules. When talking about a family of fuzzy sets, we are typically concerned with fuzzy partitions of X. Given the nature

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

88

b4528-v1-ch02

page 88

Handbook on Computer Learning and Intelligence — Vol. I

S1

S2

S3

I1

S4

I2

I3

1

F1

I4

F2

F3

F4

1

1

x

x

x

Figure 2.26: Discretization, quantization, and granulation.

of fuzzy sets, fuzzy granulation generalizes the notion of quantization (Figure 2.26) and emphasizes a gradual nature of transitions between neighboring information granules [27]. When dealing with information granulation we often develop a family of fuzzy sets and move on with the processing that inherently uses all the elements of this family. The existing terminology refers to such collections of data granules as frames of cognition [8]. In what follows, we briefly review the concept and its main properties. Frame of cognition: A frame of cognition results from information granulation when we encounter a finite collection of fuzzy sets—information granules that represent the entire universe of discourse and satisfy a system of semantic constraints. The frame of cognition is a notion of particular interest in fuzzy modeling, fuzzy control, classification, and data analysis. A frame of cognition consists of several labeled, normal fuzzy sets. Each of these fuzzy sets is treated as a reference for further processing. A frame of cognition can be viewed as a codebook of conceptual entities. We may view them as a family of linguistic landmarks, say small, medium, high, etc. More formally, a frame of cognition   = {A1 , A2 , . . . , Am },

(2.18)

is a collection of fuzzy sets defined in the same universe X that satisfies at least two requirements, coverage and semantic soundness. Coverage: We say that  covers X if any element x ∈ X is compatible with at least one fuzzy sets Ai in , i ∈ I = {1, 2, . . . , m} meaning that it is compatible (coincides) with Ai to some nonzero degree, that is ∀

∃ A j (x) > 0

x∈X i∈I

(2.19)

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch02

Fundamentals of Fuzzy Set Theory

page 89

89

Being stricter, we may require a satisfaction of the so-called δ-level coverage, which means that for any element of X, fuzzy sets are activated to a degree not lower than δ ∀

∃ A j (x) > δ

(2.20)

x∈X i∈I

where δ ∈ [0,1]. From application perspective, the coverage assures that each element of X is represented by at least one of the elements of , and guarantees any absence of gaps, that is, elements of X for which there is no fuzzy set being compatible with it. Semantic soundness: The notion of semantic soundness is more complicated and difficult to quantify. In principle, we are interested in information granules of  that are meaningful. While there is far more flexibility in a way in which a number of detailed requirements could be structured, we may agree upon a collection of several fundamental properties: 1. Each Ai , i ∈ I, is a unimodal and normal fuzzy set. 2. Fuzzy sets Ai , i ∈ I, are disjoint enough to assure that they are sufficiently distinct to become linguistically meaningful. This imposes a maximum degree λ of overlap between any two elements of . In other words, given any x ∈ X, there is no more than one fuzzy set Ai such that Ai (x) ≥ λ, λ ∈ [0,1]. 3. The number of elements of  is low; following the psychological findings reported by Miller and others we consider the number of fuzzy sets forming the frame of cognition to be maintained in the range of 7 ± 2 items. Coverage and semantic soundness [28]are the two essential conditions that should be fulfilled by the membership functions of Ai to achieve interpretability. In particular, δ-coverage and λ-overlapping induce a minimal (δ) and maximal (λ) level of overlap between fuzzy sets (Figure 2.27). Considering the families of linguistic labels and associated fuzzy sets embraced in a frame of cognition, several characteristics are worth emphasizing.

1

Ai

x

Figure 2.27: Coverage and semantic soundness of a cognitive frame.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

90

b4528-v1-ch02

Handbook on Computer Learning and Intelligence — Vol. I A1

A2

A1

A3

A2

A3

A4

x

x

Figure 2.28: Two frames of cognition; 1 is coarser (more general) than 2 .

1

S1

Ai

A2

x

Figure 2.29: Focus of attention; two regions of focus of attention implied by the corresponding fuzzy sets.

Specificity: We say that the frame of cognition 1 is more specific than 2 if all the elements of 1 are more specific than the elements of 2 (Figure 2.28). Here the specificity Spec(Ai ) of the fuzzy sets that compose the cognition frames can be evaluated as suggested in Section 2.3. The less specific cognition frames promotes granulation realized at the higher level of abstraction (generalization). Subsequently, we are provided with the description that captures fewer details. Granularity: Granularity of a frame of cognition relates to the granularity of fuzzy sets used there. The higher the number of fuzzy sets in the frame, the finer the resulting granulation is. Therefore, the frame of cognition 1 is finer than 2 if |1 |>|2 |. If the converse holds, 1 is coarser than 2 (Figure 2.28). Focus of attention: A focus of attention induced by a certain fuzzy set A = Ai in  is defined as a certain α–cut of this fuzzy set. By moving A along X while keeping its membership function unchanged, we can focus attention on a certain selected region of X, as shown in Figure 2.29. Information hiding: Information hiding is closely related to the notion of focus of attention and manifests through a collection of elements that are hidden

page 90

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Fundamentals of Fuzzy Set Theory

b4528-v1-ch02

page 91

91

A

1

B

a1 a2

a3

a4

x

Figure 2.30: A concept of information hiding realized by the use of trapezoidal fuzzy set A. Points in [a2 , a3 ] are made indistinguishable. The effect of information hiding is not present in case of triangular fuzzy set B.

when viewed from the standpoint of membership functions. By modifying the membership function of A = Ai in  we can produce an equivalence of the elements positioned within some region of X. For instance, consider a trapezoidal fuzzy set A on R and its 1-cut (core), the closed interval [a2 , a3 ], as depicted in Figure 2.30. All points within the interval [a2 , a3 ] are made indistinguishable and through the use of this specific fuzzy set they are made equivalent. Hence, more detailed information, a position of a certain point falling within this interval, is hidden. In general, by increasing or decreasing the level of the α–cut we can accomplish a so-called α–information hiding through normalization.

2.8. Linguistic Approximation In many cases we are provided with a finite family of fuzzy sets {A1 , A2 , . . . , Ac } (whose membership functions could have been determined earlier) using which we would like to represent a certain fuzzy sets B. These fuzzy sets are referred to as a vocabulary of information granules. Furthermore, assume that we have at our disposal a finite collection of linguistic modifiers {τ1 , τ2 , . . . , τ p }. There are two general categories of the modifiers realizing operations of concentration and dilution. Their semantics relates to the linguistic adjectives of the form very (concentration) and more or less (dilution). Given the semantics of Ai’s and the linguistic modifiers, the aim of the representation process is to capture the essence of B. Given the nature of the objects and the ensuing processing being used here, we refer to this process as a linguistic approximation. There are many ways to attain linguistic approximation. For instance, the scheme shown in Figure 2.31 comprises of two steps: the first finds the best match between B and Ai’s. The quality of matching can be expressed in terms of some distance or similarity measure. The next step refines the best match by applying one of the linguistic modifiers. The result of the linguistic approximation is B ≈ τi ( A j ) with the indexes i and j determined by the matching mechanism.

June 1, 2022 12:51

92

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch02

Handbook on Computer Learning and Intelligence — Vol. I B

{Ai} { j} j(Ai)

Figure 2.31: Linguistic approximation of B using a vocabulary and modifiers.

A

1 3

A 2

reference fuzzy sets

X

A1

A2

A3

Figure 2.32: Fuzzy sets of order 1 (left), and order 2 (right).

2.9. Generalizations and Extensions of Fuzzy Sets The essential notion of fuzzy set can be generalized into more abstract constructs such as second order fuzzy sets and interval fuzzy sets. Various construction and conceptual issues arise concerning values of membership, inducing a collection of views of granular membership grades. The simplest construct is interval-valued fuzzy sets. More refined versions of the construct produce fuzzy sets of higher type and type-2 fuzzy sets.

2.9.1. Higher Order Fuzzy Sets To make it simple, the notion of fuzzy sets discussed so far could be referred to as fuzzy sets of order 1. The essence of a fuzzy of order 2 is that it is defined over a collection of some other fuzzy sets. Figure 2.32 shows the difference between fuzzy sets, which in this context is referred to as fuzzy sets of order 1, and fuzzy sets of order 2. For the order 2 fuzzy set, one can use the notation A = [λ1 , λ2 , λ3 ] given the reference, fuzzy sets are A1 , A2 , and A3 . In a similar way, fuzzy sets of higher order, say order 3 or higher, can be formed in a recursive manner. While conceptually appealing and straightforward, its applicability is an open issue. One may not willing to expend more effort into their design unless there is a strong reason

page 92

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch02

Fundamentals of Fuzzy Set Theory

page 93

93

behind the usage of fuzzy sets of higher order. Moreover, nothing prevents us from building fuzzy sets of higher order based on a family of terms that are not fuzzy sets only. One might consider a family of information granules such as sets over which a certain fuzzy set could be formed.

2.9.2. Type-2 Fuzzy Sets Choice of membership functions or membership degrees may raise the issue that characterizing membership degrees, as single numeric values could be counterintuitive given the nature of fuzzy sets themselves. An alternative can be sought by capturing the semantics of membership using intervals of possible membership grades rather than single numeric values. This is the concept of interval-valued fuzzy sets A (Figure 2.33). The lower A− and upper A+ bounds of the membership grades are used to capture the lack of uniqueness of numeric membership. Type-2 fuzzy sets are a generalization of interval-valued fuzzy sets [29]. Instead of intervals of numeric values of membership degrees, memberships are fuzzy sets themselves. Consider a certain element x j of the universe of discourse. The membership of x j in A is captured by a fuzzy set formed over the unit interval. An example of type-2 fuzzy set is illustrated in Figure 2.34.

1 A+

Ax

Figure 2.33: An interval-valued fuzzy set; the lower and upper bound of membership grades.

1

xi

xj

xk

X

Figure 2.34: Type-2 fuzzy set; for each element of X, there is a corresponding fuzzy set of membership grades.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

94

b4528-v1-ch02

Handbook on Computer Learning and Intelligence — Vol. I

2.10. Conclusion The chapter has summarized the fundamental notions and concepts of fuzzy set theory. The goal was to offer basic and key contents of interest for computational intelligence and intelligent systems theory and applications. Currently, a number of outstanding books and journals are available to help researchers, scientists, and engineers to master the fuzzy set theory and the contributions it brings to develop new approaches through hybridizations and new applications. The bibliography includes some of them. The remaining chapters of this book provide the readers with a clear picture of the current state of the art in the area.

References [1] Cantor, G. (1883). Grundlagen einer allgemeinen Mannigfaltigkeitslehre, Teubner, Leipzig. [2] Cantor, G. (1895). Beiträge zur Begründung der transfiniten Mengenlehre, Math. Ann., 46, pp. 481–512. [3] Zadeh, L. (1965). Fuzzy sets, Inf. Control, 8, pp. 338–353. [4] Dubois, D. and Prade, H. (1997). The three semantics of fuzzy sets, Fuzzy Sets Syst., 2, pp. 141–150. [5] Dubois, D. and Prade, H. (1998). An introduction to fuzzy sets, Clin. Chim. Acta, 70, pp. 3–29. [6] Zadeh, L. (1978). Fuzzy sets as a basis for a theory of possibility, Fuzzy Sets Syst., 3, pp. 3–28. [7] Klir, G. and Yuan, B. (1995). Fuzzy Sets and Fuzzy Logic: Theory and Applications, Prentice Hall, Upper Saddle River, New Jersey, USA. [8] Pedrycz, W. and Gomide, F. (2007). Fuzzy Systems Engineering: Toward Human-Centric Computing, Wiley Interscience, Hoboken, New Jersey, USA. [9] Garmendia, L., Yager, R., Trillas, E. and Salvador, A. (2003). On t-norms based measures of specificity, Fuzzy Sets Syst., 133, pp. 237–248. [10] Yager, R. (1983). Entropy and specificity in a mathematical theory of evidence, Int. J. Gen. Syst., 9, pp. 249–260. [11] Klement et al., (2000). Triangular Norms. Springer, Dordrecht, Netherlands. [12] Hajek, P. (1998). Mathematics of Fuzzy Logic, Kluwer Academic Publishers, Dordrecht Norwell, New York, London. [13] Klement, P. and Navarra, M. (1999). A survey on different triangular norm-based fuzzy logics. Fuzzy Sets and Systems, 101, pp. 241–251. [14] Dubois, D. and Prade, H. (1985). A review of fuzzy aggregation connectives, Inf. Sci., 36, pp. 85–121. [15] Bouchon-Meunier, B. (1998). Aggregation and Fusion of Imperfect Information, Physica-Verlag, Heidelberg, Germany. [16] Calvo, T., Kolesárová, A., Komorníková, M. and Mesiar, R. (2002). Aggregation operators: properties, classes and construction methods, in Aggregation Operators: New Trends and Applications, Physica-Verlag, Heidelberg, Germany. [17] Dubois, D. and Prade, H. (2004). On the use of aggregation operations in information fusion, Fuzzy Sets Syst., 142, pp. 143–161. [18] Beliakov et al., 2007. [19] Dyckhoff, H. and Pedrycz, W. (1984). Generalized means as a models of compensative connectives, Fuzzy Sets Syst., 142, pp. 143–154. [20] Yager, R. (1988). On ordered weighted averaging aggregation operations in multicriteria decision making, IEEE Trans. Syst. Man Cybern., 18, pp. 183–190.

page 94

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Fundamentals of Fuzzy Set Theory

b4528-v1-ch02

page 95

95

[21] Yager, R. and Rybalov, A. (1996). Uninorm aggregation operators, Fuzzy Sets Syst., 80, pp. 111–120. [22] Zadeh, L. (1971). Similarity relations and fuzzy orderings, Inf. Sci., 3, pp. 177–200. [23] Zadeh, L. (1975a,b). The concept of linguistic variables ad its application to approximate reasoning I, II III, Inf. Sci., 8(9), pp. 199–251, 301–357, 43–80. [24] Zadeh, L. (1999). From computing with numbers to computing with words: from manipulation of measurements to manipulation of perceptions, IEEE Trans. Circuits Syst., 45, pp. 105–119. [25] Pedrycz, W., Skowron, A. and Kreinovich, V. (2008). Handbook of Granular Computing, John Wiley & Sons, Chichester, West Sussex, England. [26] Lin, T. (2004). Granular computing: rough sets perspectives, IEEE Connect., 2, pp. 10–13. [27] Zadeh, L. (1999). Fuzzy logic = computing with words, in Zadeh, L. and Kacprzyk, J. (eds.), Computing with Words in Information and Intelligent Systems, Physica-Verlag, Heidelberg, Germany, pp. 3–23. [28] Oliveria, J. (1993). On optimal fuzzy systems with I/O interfaces, Proc. of the Second IEEE Int. Conf. on Fuzzy Systems, San Francisco, California, USA, pp. 34–40. [29] Mendel, J. (2007). Type-2 fuzzy sets and systems: an overview, IEEE Comput. Intell. Mag., 2, pp. 20–29.

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_0003

Chapter 3

Granular Computing Andrzej Bargiela∗,‡ and Witold Pedrycz†,§ ∗ The University of Nottingham, Nottingham, UK † University of Alberta, Edmonton, Canada ‡ [email protected] § [email protected]

Research into human-centered information processing, as evidenced through the development of fuzzy sets and fuzzy logic, has brought additional insight into the transformative nature of the aggregation of inaccurate and/or fuzzy information in terms of the semantic content of data aggregates. This insight has led to the suggestion of the development of the granular computing (GrC) paradigm some 15 years ago and takes as a point of departure an empirical validation of the aggregated data in order to achieve the closest possible correspondence of the semantics of data aggregates and the entities in the problem domain. Indeed, it can be observed that information abstraction combined with empirical validation of abstractions has been deployed as a methodological tool in various scientific disciplines. This chapter is focused on exploring the foundations of GrC and casting it as a structured combination of algorithmic and non-algorithmic information processing that mimics human, intelligent synthesis of knowledge from imprecise and/or fuzzy information.

3.1. Introduction Granular computing (GrC) is frequently defined in an informal way as a general computation theory for effectively using granules such as classes, clusters, subsets, groups, and intervals to build an efficient computational model for complex applications with which to process large volumes of information presented as either raw data or aggregated problem domain knowledge. Though the term GrC is relatively recent, the basic notions and principles of granular computing have appeared under different names in many related fields, such as information hiding in programming, granularity in artificial intelligence, divide and conquer in theoretical computer science, interval computing, cluster analysis, fuzzy and rough set theories, 97

page 97

June 1, 2022 12:51

98

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

Handbook on Computer Learning and Intelligence — Vol. I

neutrosophic computing, quotient space theory, belief functions, machine learning, databases, and many others. In the past few years, we have witnessed a renewed and fast-growing interest in GrC. Granular computing has begun to play important roles in bioinformatics, e-business, security, machine learning, data mining, highperformance computing, and wireless mobile computing in terms of efficiency, effectiveness, robustness and structured representation of uncertainty. With the vigorous research interest in the GrC paradigm [3–10, 17, 20, 21, 25–28, 33, 36–44], it is natural to see that there are voices calling for clarification of the distinctiveness of GrC from the underpinning constituent disciplines and from other computational paradigms proposed for large-scale/complex information possessing. Recent contributions by Yao [36–39] attempt to bring together various insights into GrC from a broad spectrum of disciplines and cast the GrC framework as structured thinking at the philosophical level and structured problem solving at the practical level. In this chapter, we elaborate on our earlier proposal [11] and look at the roots of granular computing in the light of the original insight of Zadeh [40] stating, “fuzzy information granulation in an intuitive form underlies human problem solving.” We suggest that human problem solving has strong foundations in axiomatic set theory and theory of computability and that it underlies some recent research results linking intelligence to physical computation [1, 2]. In fact, re-examining human information processing in this light brings granular computing from a domain of computation and philosophy to one of physics and set theory. The set theoretical perspective on GrC adopted in this chapter offers also a good basis for the evaluation of other human-centered computing paradigms such as the generalized constraint-based computation recently communicated by Zadeh [44].

3.2. Set Theoretical Interpretation of Granulation The commonly accepted definition of granulation introduced in [17, 21], and [37] is: Definition 1. Information granulation is a grouping of elements based on their indistinguishability, similarity, proximity or functionality. This definition serves well the purpose of constructive generation of granules but does little to differentiate granulation from clustering. More importantly however, Definition 1 implies that the nature of information granules is fully captured by their interpretation as subsets of the original dataset within the intuitive set theory of Cantor [13]. Unfortunately, an inevitable consequence of this is that the inconsistencies (paradoxes) associated with intuitive set theory, such as “cardinality of set of all sets” (Cantor) or “definition of a set that is not a member of itself” (Russel), are imported into the domain of information granulation.

page 98

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Granular Computing

b4528-v1-ch03

page 99

99

In order to provide a more robust definition of information granulation, we follow the approach adopted in the development of axiomatic set theory. The key realization there was that the commonly accepted intuition, that one can form any set one wants, should be questioned. Accepting the departure point of intuitive set theory, we can say that, normally, sets are not members of themselves, i.e., normally, ~(y in y). But the axioms of intuitive set theory do not exclude the existence of “abnormal” sets, which are members of themselves. So, if we consider a set of all “normal” sets: x = {y|∼(y in y)}, we can axiomatically guarantee the existence of set x: ∃x∀y(y ∈ x ⇔ ω(y ∈ y)). If we then substitute x for y, we arrive at a contradiction: ∃x∀x(x ∈ x ⇔ ω(x ∈ x)). So, the unrestricted comprehension axiom of the intuitive set theory leads to contradictions and cannot therefore serve as a foundation of set theory.

3.2.1. Zermelo–Fraenkel Axiomatization An early attempt at overcoming the above contradiction was an axiomatic scheme developed by Ernst Zermelo and Abraham Fraenkel [45]. Their idea was to restrict the comprehension axiom schema by adopting only those instances of it that are necessary for the reconstruction of common mathematics. In other words, the standard approach, of using a formula F(y) to collect the set y having the property F, leads to the generation of an object that is not a set (otherwise we arrive at a contradiction). So, looking at the problem the other way, they have concluded that the contradiction constitutes a de facto proof that there are other semantical entities in addition to sets. The important observation that we can make here is that the semantical transformation of sets through the process of applying some set-forming formula applies also to the process of information granulation and, consequently, information granules should be considered semantically distinct from the granulated entities. We therefore arrive at a modified definition of information granulation as follows: Definition 2. Information granulation is a semantically meaningful grouping of elements based on their indistinguishability, similarity, proximity, or functionality. Continuing with the Zermelo–Fraenkel approach, we must legalize some collections of sets that are not sets. Let F(y, z 1 , z 2 , . . . , z n ) be a formula in the language of set theory (where z 1 , z 2 , . . . , z n are optional parameters). We can say that for any

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

100

b4528-v1-ch03

Handbook on Computer Learning and Intelligence — Vol. I

values of parameters z 1 , z 2 , . . . , z n , the formula F defines a “class” A A = {y | F(y, z1 , z2 , . . . , zn )} which consists of all y’s possessing the property F. Different values of z1 , z2 , . . . , zn give rise to different classes. Consequently the axiomatization of set theory involves the formulation of axiom schemas that represent possible instances of axioms for different classes. The following is a full set of axioms of the Zermelo–Fraenkel set theory: Z1, Extensionality: ∀x∀y[∀z(z ∈ x ≡ z ∈ y) ⇒ x = y] Asserts that if sets x and y have the same members, the sets are identical. Z2, Null Set: ∃x ∼ ∃y(y ∈ x) Asserts that there is a unique empty set. Z3, Pair Set: ∀x∀y∃z∀w(w ∈ z ≡ w = x ∨ w = y) Asserts that for any sets x and y, there exists a pair set of x and y, i.e., a set that has only x and y as members. Z4, Unions: ∀x∃y∀z(z ∈ y ≡ ∃w(w ∈ x ∧ z ∈ w) Asserts that for any set x, there is a set y containing every set that is a member of some member of x. Z5, Power Set: ∀x∃y∀z(z ∈ y ≡ ∃w(w ∈ z ⇒ w ∈ x) Asserts that for any set x, there is a set y which contains as members all those sets whose members are also elements of x, i.e., y contains all of the subsets of x. Z6, Infinite Set: ∃x[∅ ∈ x ∧ ∀y(y ∈ x ⇒



{y, {y}} ∈ x)]

Asserts that there is a set x which contains ∅ as a member and which is such that,  whenever y is a member of x, then y {y} is a member of x.

page 100

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

Granular Computing

page 101

101

Z7, Regularity: ∀x[x = ∅ ⇒ ∃y(y ∈ x ∧ ∀z(z ∈ x ⇒∼ (z ∈ y)))] Asserts that every set is “well-founded”, i.e., it rules out the existence of circular chains of sets as well as infinitely descending chains of sets. A member y of a set x with this property is called a “minimal” element. Z8, Replacement Schema: ∀x∃y F(x, y) ⇒ ∀u∃v∀r(r ∈ v = ∃s(s ∈ u ∧ Fx,v [s, r])) Asserts that given a formula F(x, y) and Fx,y [s, r] as a result of substituting s and r for x and y, every instance of the above axiom schema is an axiom. In other words, given a functional formula F and a set u, we can form a new set v by collecting all of the sets to which the members of u are uniquely related by F. It is important to note that elements of v need not be elements of u. Z9, Separation Schema: ∀u∃v∀r(r ∈ v ≡ r ∈ u ∧ Fx [r]) Asserts that there exists a set v which has, as members, precisely the members of u that satisfy the formula F. Again, every instance of the above axiom schema is an axiom. Unfortunately, the presence of the two axiom schemas, Z6 and Z7, implies infinite axiomatization of the Zermelo–Fraenkel (ZF) set theory. While it is fully acknowledged that the ZF set theory, and its many variants, has advanced our understanding of cardinal and ordinal numbers and has led to the proof of the property of “well-ordering” of sets (with the help of an additional “Axiom of Choice”; ZFC), the theory seems unduly complex for the purpose of a set theoretical interpretation of information granules.

3.2.2. Von Neumann–Bernays–Goedel Axiomatization A different approach to the axiomatization of set theory designed to yield the same results as ZF but with a finite number of axioms (i.e., without the reliance on axiom schemas) was proposed by von Neumann in 1920 and subsequently refined by Bernays in 1937 and Goedel in 1940 [15]. The defining aspect of von Neumann– Bernays–Goedel set theory (NBG) is the introduction of the concept of “proper class” among its objects. NBG and ZFC are very closely related and in fact NBG is a conservative extension of ZFC.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

102

b4528-v1-ch03

Handbook on Computer Learning and Intelligence — Vol. I

In NBG, proper classes are differentiated from sets by the fact that they do not belong to other classes. Thus, in NBG, we have x ⇔ ∃y(x ∈ y) which can be phrased as follows: x is a set if it belongs to either a set or a class. The basic observation that can be made about NBG is that it is essentially a two-sorted theory; it involves sets (denoted here by lower-case letters) and classes (denoted by upper-case letters). Consequently, the above statement about membership assumes one of the following forms: x ∈ y or x ∈ Y and statements about equality are in the form x = y or X ∈ Y Using this notation, the axioms of NBG are as follows: N1, Class Extensionality: ∀x[x ∈ A ↔ x ∈ B) ⇒ A = B] Asserts that classes with the same elements are the same. N2, Set Extensionality: ∀x[x ∈ a ↔ x ∈ b) ⇒ a = b] Asserts that sets with the same elements are the same. N3, Pairing: ∀x∀y∃z∀w(w ∈ z ≡ w = x ∨ w = y)) Asserts that for any set x and y, there exists a set {x, y}that has exactly two elements x and y. It is worth noting that this axiom allows the definition of ordered pairs and taken together with the Class Comprehension axiom, it allows implementation of relations on sets as classes. N4, Union: ∀x∃y∀z(z ∈ y ≡ ∃w(w ∈ x ∧ z ∈ w)) Asserts that for any set x there exists a set which contains exactly the elements of elements of x.

page 102

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

Granular Computing

page 103

103

N5, Power Set: ∀x∃y∀z[z ∈ y ≡ ∀w(w ∈ z ⇒ w ∈ x) Asserts that for any set x, there is a set which contains exactly the subsets of x. N6, Infinite Set:    ∃x ∅ ∈ x ∧ ∀y(y ∈ x ⇒ {y, {y}} ∈ x) Asserts that there is a set x which contains an empty set as an element and contains  y {y} for each of its elements y. N7, Regularity: ∀x[x = ∅ ⇒ ∃y(y ∈ x ∧ ∀z(z ∈ x ⇒∼ (z ∈ y)))] Asserts that each nonempty set is disjoined from one of its elements. N8, Limitation of size: ∼x ⇔ |x| = |V | Asserts that if the cardinality of x equals the cardinality of the set theoretical universe V , x is not a set but a proper class. This axiom can be shown to be equivalent to the axioms of Regularity, Replacement, and Separation in NBG. Thus the classes that are proper in NBG are, in a very clear sense, big, while sets are small. It should be appreciated that the latter has a very profound implication on computation, which processes proper classes. This is because the classes built over countable sets can be uncountable and, as such, do not satisfy the constraints of the formalism of the Universal Turing Machine. N9, Class Comprehension schema: Unlike in the ZF axiomatization, this schema consists of a finite set of axioms (thus giving finite axiomatization of NBG). Axiom of Sets: For any set x, there is a class X such that x = X Axiom of Complement: For any class X , the complement V − X = {x|x ∈ / X} Axiom of Intersection: For any classes X and Y , the intersection X ∩ Y = {x|x ∈ X ∧ x ∈ Y } is a class. Axiom of Products: For any classes X and Y , the class X × Y = {(x, y)|x ∈ X ∧ y ∈ Y } is a class. This axiom actually provides far more than is needed for representing relations on classes. What is actually needed is just that V × Y is a class.

June 1, 2022 12:51

104

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

Handbook on Computer Learning and Intelligence — Vol. I

Axiom of Converses: For any class X , the class Conv1(X ) = {(y, x) | (x, y) ∈ X } and the class Conv2(X ) = {(y, (x, z)) | (x, (y, z)) ∈ X } exist. Axiom of Association: For any class X , the class Assoc1(X ) = {((x, y), z) | (x, (y, z)) ∈ X } and the class Assoc2(X ) = {(w, (x, (y, z))) | (w, ((x, y, z))) ∈ X } exist. Axiom of Ranges: For any class X , the class Rng(X ) = {y | (∃x(x, y) ∈ X } exists. Axiom of Membership: The class [∈] = {(x, y) | x ∈ y} exists. Axiom of Diagonal: The class [=] = {(x, y) | x ∈ y} exists. This axiom can be used to build a relation asserting the equality of any two of its arguments and consequently used to handle repeated variables. With the above finite axiomatization, the NBG theory can be adopted as a set theoretical basis for granular computing. Such a formal framework prompts a powerful insight into the essence of granulation, namely that the granulation process transforms the semantics of the granulated entities, mirroring the semantical distinction between sets and classes. The semantics of granules is derived from the domain that has, in general, higher cardinality than the cardinality of the granulated sets. Although, at first, it might be a bit surprising to see that such a semantical transformation is an essential part of information granulation, in fact we can point to a common framework of many scientific disciplines which have evolved by abstracting from details inherent to the underpinning scientific discipline and developing a vocabulary of terms (proper classes) that have been verified by reference to real life (ultimately to the laws of physics). An example of the granulation of detailed information into semantically meaningful granules might be the consideration of cells and organisms in biology rather than the consideration of molecules, atoms, or sub-atomic particles when studying the physiology of living organisms. The operation on classes in NBG is entirely consistent with the operation on sets in the intuitive set theory. The principle of abstraction implies that classes can be formed out of any statement of the predicate calculus with the membership relation. Notions of equality, pairing, and such are thus matters of definitions (a specific abstraction of a formula) and not of axioms. In NBG, a set represents a class if every element of the set is an element of the class. Consequently, there are classes that do not have representations. We therefore suggest that the advantage of adopting NBG as a set theoretical basis for granular computing is that it provides a framework within which one can discuss a hierarchy of different granulations without running the risk of inconsistency. For instance, one can denote a “large category” as a category of granules whose collection of granules and collection of morphisms can be represented by a class. A “small category” can be denoted as a category of granules

page 104

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Granular Computing

b4528-v1-ch03

page 105

105

contained in sets. Thus, we can speak of “category of all small categories” (which is a “large category”) without the risk of inconsistency. A similar framework for a set theoretical representation of granulation is offered by the theory of types published by Russell in 1937 [29]. The theory assumes a linear hierarchy of types: with type 0 consisting of objects of undecided type and, for each natural number n, type n+1 objects are sets of type n objects. The conclusions that can be drawn from this framework with respect to the nature of granulation are exactly the same as those drawn from the NBG.

3.2.3. Mereology An alternative framework for the formalization of granular computing, that of mereology, has been proposed by other researchers. The roots of mereology can be traced to the work of Edmund Husserl (1901) [16] and to the subsequent work of the Polish mathematician Stanislaw Lesniewski in the late 1920s [18, 19]. Much of this work was motivated by the same concerns about the intuitive set theory that have spurred the development of axiomatic set theories (ZF, NBG, and others) [15, 45]. Mereology replaces talk about “sets” with talk about “sums” of objects, objects being no more than the various things that make up wholes. However, such a simple replacement results in an “intuitive mereology” that is analogous to “intuitive set theory.” Such “intuitive mereology” suffers from paradoxes analogous to Russel’s paradox (we can ask: If there is an object whose parts are all the objects that are not parts of themselves; is it a part of itself?). So, one has to conclude that the mere introduction of the mereological concept of “partness” and “wholeness” is not sufficient and that mereology requires axiomatic formulation. Axiomatic formulation of mereology has been proposed as a first-order theory whose universe of discourse consists of wholes and their respective parts, collectively called objects [31, 34]. A mereological system requires at least one primitive relation, e.g. dyadic Parthood, x is a part of y, written as Pxy. Parthood is nearly always assumed to partially order the universe. An immediately defined predicate is x is a proper part of y, written as PPxy, which holds if Pxy is true and Pyx is false. An object lacking proper parts is an atom. The mereological universe consists of all objects we wish to consider and all of their proper parts. Two other predicates commonly defined in mereology are Overlap and Underlap. These are defined as follows: – Oxy is an overlap of x and y if there exists an object z such that Pzx and Pzy both hold. – Uxy is an underlap of x and y if there exists an object z such that x and y are both parts of z (Pxz and Pyz hold).

June 1, 2022 12:51

106

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

Handbook on Computer Learning and Intelligence — Vol. I

With the above predicates, axiomatic mereology defines the following axioms: M1, Parthood is Reflexive: Asserts that an object is part of itself. M2, Parthood is Antisymmetric: Asserts that if Pxy and Pyx both hold, then x and y are the same object. M3, Parthood is Transitive: Asserts that if Pxy and Pyz hold, then Pxz holds. M4, Weak Supplementation: Asserts that if PPxy holds, there exists a z such that Pzy holds but Ozx does not. M5, Strong Supplementation: Asserts that if Pyx does not hold, there exists a z such that Pzy holds but Ozx does not. M5a, Atomistic Supplementation: Asserts that if Pxy does not hold, then there exists an atom z such that Pzx holds but Ozy does not. Top: Asserts that there exists a “universal object”, designated W , such that PxW holds for any x. Bottom: Asserts that there exists an atomic “null object”, designated N , such that PNx holds for any x. M6, Sum: Asserts that if Uxy holds, there exists a z, called the “sum of x and y,” such that the parts of z are just those objects which are parts of either x or y. M7, Product: Asserts that if Oxy holds, there exists a z, called the “product of x and y,” such that the parts of z are just those objects which are parts of both x and y. M8, Unrestricted Fusion: Let f be a first-order formula having one free variable. Then the fusion of all objects satisfying f exists. M9, Atomicity: Asserts that all objects are either atoms or fusions of atoms. It is clear that if “parthood” in mereology is taken as corresponding to “subset” in set theory, there is some analogy between the above axioms of classical extensional mereology and those of standard Zermelo–Fraenkel set theory. However, there are

page 106

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Granular Computing

b4528-v1-ch03

page 107

107

some philosophical and common sense objections to some of the above axioms, e.g., transitivity of parthood (M3). Also the set of above axioms is not minimal since it is possible to derive Weak Supplementation axiom (M4) from Strong Supplementation axiom (M5). Axiom M6 implies that if the universe is finite or if Top is assumed, then the universe is closed under sum. Universal closure of product and of supplementation relative to W requires Bottom. W and N are evidently the mereological equivalents of the universal and the null sets. Because sum and product are binary operations, M6 and M7 admit the sum and product of only a finite number of objects. The fusion axiom, M8, enables taking the sum of infinitely many objects. The same holds for products. If M8 holds, then W exists for infinite universes. Hence Top needs to be assumed only if the universe is infinite and M8 does not hold. It is somewhat strange that while the Top axiom (postulating W ) is not controversial, the Bottom axiom (postulating N ) is. Lesniewski rejected the Bottom axiom and most mereological systems follow his example. Hence, while the universe is closed under sum, the product of objects that do not overlap is typically undefined. Such defined mereology is equivalent to Boolean algebra lacking a 0. Postulating N generates a mereology in which all possible products are definable but it also transforms extensional mereology into a Boolean algebra without a null element [34]. The full mathematical analysis of the theories of parthood is beyond the intended scope of this paper and the reader is referred to the recent publication by Pontow and Shubert [46] in which the authors prove, by set theoretical means, that there exists a model of general extensional mereology where arbitrary summation of attributes is not possible. However, it is clear from the axiomatization above that the question about the existence of a universal entity containing all other entities and the question about the existence of an empty entity as part of all existing entities are answered very differently by set theory and mereology. In set theory, the existence of a universal entity is contradictory and the existence of an empty set is mandatory while in mereology the existence of a universal set is stipulated by the respective fusion axioms and the existence of an empty entity is denied. Moreover, it is worth noting that in mereology there is no straightforward analogue to the set theoretical is-element-of relation [46]. So, taking into account the above, we suggest the following answer to the underlying questions of this section: Why granulation is necessary? Why is the set theoretical representation of granulation appropriate? – The concept of granulation is necessary to denote the semantical transformation of granulated entities in a way that is analogous to semantical transformation of sets into classes in axiomatic set theory;

June 1, 2022 12:51

108

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

Handbook on Computer Learning and Intelligence — Vol. I

– Granulation interpreted in the context of axiomatic set theory is very different from clustering, since it deals with semantical transformation of data and does not limit itself to a mere grouping of similar entities; – The set theoretical interpretation of granulation enables a consistent representation of a hierarchy of information granules.

3.3. Abstraction and Computation Having established an argument for semantical dimension to granulation, one may ask, how is the meaning (semantics) instilled into real-life information granules? Is the meaning instilled through an algorithmic processing of constituent entities or is it a feature that is independent of algorithmic processing? The answers to these questions are hinted at by von Neumann’s limitation of size principle, mentioned in the previous section, and are more fully informed by Turing’s theoretical model of computation. In his original paper [35], Turing has defined computation as an automatic version of doing what people do when they manipulate numbers and symbols on paper. He proposed a conceptual model which included: (a) an arbitrarily long tape from which one could read as many symbols as needed (from a countable set); (b) a means to read and write those symbols; (c) a countable set of states storing information about the completed processing of symbols; and (d) a countable set of rules that govern what should be done for various combinations of input and system states. A physical instantiation of computation, envisaged by Turing, was a human operator (called computer) who was compelled to obey the rules (a)–(d) above. There are several important implications of Turing’s definition of computation. First, the model implies that computation explores only a subset of capabilities of human information processing. Second, the constraint that the input and output are strictly symbolic (with symbols drawn from a countable set) implies that the computer does not interact directly with the environment. These are critical limitations meaning that Turing’s computer on its own is unable (by definition) to respond to external, physical stimuli. Consequently, it is not just wrong but essentially meaningless to speculate on the ability of Turing machines to perform human-like intelligent interaction with the real world. To phrase it in mathematical terms, the general form of computation, formalized as a universal Turing machine (UTM), is defined as a mapping of sets that have at most cardinality N0 (infinite, countable) onto sets with cardinality N0 . The practical instances of information processing, such as clustering of data, typically involve a finite number of elements in both the input and output sets and therefore represent a more manageable mapping of a finite set with cardinality max1 onto another finite set with cardinality max2 . The hierarchy of computable clustering can be therefore represented as in Figure 3.1.

page 108

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

Granular Computing

page 109

109

Figure 3.1: Cardinality of sets in a hierarchy of clusterings implemented on UTM.

Figure 3.2: Mapping of abstractions from the real-world domain (cardinality N1 ) onto the sets of clusters.

The functions F1 (x1 )  x2 , F2 (x2 )  x3 , F3 (x3 )  x4 represent mappings of: – infinite (countable) input set onto infinite (countable) output set; – infinite (countable) input set onto finite output set; and – finite input set onto finite output set, respectively. The functional mappings, deployed in the process of clustering, reflect the criteria of similarity, proximity, or indistinguishability of elements in the input set and, on this basis, grouping them together into a separate entity to be placed in the output set. In other words, the functional mappings generate data abstractions on the basis of pre-defined criteria and consequently represent UTM computation. However, we need to understand how these criteria are selected and how they are decided to be appropriate in any specific circumstance. Clearly, there are many ways of defining similarity, proximity, or indistinguishability. Some of these definitions are likely to have good real-world interpretation, while others may be difficult to interpret or indeed may lead to physically meaningless results. We suggest that the process of instilling the real-world interpretation into data structures generated by functional mappings F1 (x1 )  x2 , F2 (x2 )  x3 , F3 (x3 )  x4 , involves reference to the real world, as illustrated in Figure 3.2. This is represented as execution of “experimentation” functions E ∗ (x0 ).These functions map the

June 1, 2022 12:51

110

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

Handbook on Computer Learning and Intelligence — Vol. I

real-world domain x0 , which has cardinality N1 (infinite, continuum), onto sets x1 , x2 , x3 , x4 , respectively. At this point, it is important to underline that the experimentation functions E 1 (x0 )  x1 , E 2 (x0 )  x2 , E 3 (x0 )  x3 , E 4 (x0 )  x4 are not computational, in the UTM sense, because their domains have cardinality N1 . So, the process of defining the criteria for data clustering, and implicitly instilling the meaning into information granules, relies on the laws of physics and not on the mathematical model of computation. Furthermore, the results of experimentation do not depend on whether the experimenter understands or is even aware of the laws of physics. It is precisely because of this that we consider the experimentation functions as providing objective evidence.

3.4. Experimentation as a Physical Computation Recent research [30] has demonstrated that analogue computation, in the form of recurrent analogue neural networks (RANN), can exceed the abilities of a UTM if the weights in such neural networks are allowed to take continuous rather than discrete weights. While this result is significant in itself, it relies on assumptions about the continuity of parameters that are difficult to verify. So, although the brain looks remarkably like a RANN, drawing any conclusions about the hypercomputational abilities of the brain, purely on the grounds of structural similarities, leads to the same questions about the validity of the assumptions about continuity of weights. Of course, this is not to say that these assumptions are not valid; they may well be valid, but we just highlight that this has not yet been demonstrated in a conclusive way. A pragmatic approach to bridging the gap between the theoretical model of hyper-computation, as offered by RANN, and the human, intelligent information processing (which, by definition, is hyper-computational) has been proposed by Bains [1, 2]. Her suggestion was to reverse the original question about the hypercomputational ability of systems and to ask: if the behavior of physical systems cannot be replicated using Turing machines, how can they be replicated? The answer to this question is surprisingly simple: we can use the inherent computational ability of physical phenomena in conjunction with the numerical information processing ability of UTM. In other words, the readiness to refine numerical computations in the light of objective evidence coming from a real-life experiment instills the ability to overcome the limitations of the Turing machine. We have advocated this approach in our earlier work [6] and have argued that the hyper-computational power of granular computing is equivalent to “keeping an open mind” in intelligent, human information processing. In what follows, we describe the model of physical computation, as proposed in [1], and cast it in the framework of granular computing.

page 110

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Granular Computing

b4528-v1-ch03

page 111

111

3.4.1. A Model of Physical Computation We define a system under consideration as an identifiable collection of connected elements. A system is said to be embodied if it occupies a definable volume and has a collective contiguous boundary. In particular, a UTM with its collection of input/ output data, states and, collection of rules, implementing some information processing algorithm, can be considered to be a system G whose physical instantiations may refer to specific I/O, processing, and storage devices as well as specific energy states. The matter, space, and energy outside the boundaries of the embodied system are collectively called the physical environment and will be denoted here by P. A sensor is any part of the system that can be changed by physical influences from the environment. Any forces, fields, energy, matter, etc. that may be impinging on the system are collectively called the sensor input (i ∈ X ), even where no explicitly defined sensors exist. An actuator is any part of the system that can change the environment. Physical changes to the embodied system that manifest themselves externally (e.g., emission of energy, change of position, etc.) are collectively called the actuator output (h ∈ Y ) of G. A coupled pair of sensor input i t and actuator output h t represents an instance of experimentation at time t and is denoted here as Et . Since the system G, considered in this study, is a computational system (modeled by UTM) and since the objective of the evolution of this system is to mimic human intelligent information processing, we will define Gt as the computational intelligence function performed by the embodied system G. Function Gt maps the input to the output at specific time instances t resolved with arbitrarily small accuracy δt > 0, so as not to preclude the possibility of a continuous physical time. We can thus formally define the computational intelligence function as: G t (i t )  h t +δt . In the proposed physical model of computation, we stipulate that Gt causes only an immediate output in response to an immediate input. This stipulation does not prevent one from implementing some plan over time but it implies that a controller that would be necessary to implement such a plan is part of the intelligence function. The adaptation of Gt in response to evolving input i t can be described by the computational learning function, LG : LG (Gt , it )  Gt+δt . Considering now the impact of the system behavior on the environment, we can define the environment reaction function mapping system output h(environment input) to the environment output i (system input): Pt (ht )  it+δt The adaptation of the environment P over time can be described by the environment learning function, LP : LP (Pt , ht )  Pt+δt .

June 1, 2022 12:51

112

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

Handbook on Computer Learning and Intelligence — Vol. I

Figure 3.3: Evolution of a system in an experiment with physical interaction.

3.4.2. Physical Interaction between P and G The interaction between the system G and its physical environment P may be considered to fall into one of two classes: real interaction and virtual interaction. Real interaction is a purely physical process in which the output from the environment P is, in its entirety, forwarded as an input to the system G and, conversely, the output from G is fully utilized as input to P. Referring to the notation in Figure 3.3, real interaction is one in which h t = h’t and i t = i’t for all time instances t. Unfortunately, this type of interaction does not accept the limitations of the UTM, namely the processing of only a pre-defined set of symbols rather than a full spectrum of responses from the environment. Consequently, this type of interaction places too high demands on the information processing capabilities of G and, in practical terms, is limited to the interaction of physical objects as governed by the laws of physics. In other words, the intelligence function and its implementation are one and the same.

3.4.3. Virtual Interaction An alternative mode of interaction is virtual interaction, which is mediated by the symbolic representation of information. Here we use the term symbol as it is defined in the context of UTM: a letter or sign taken from a finite alphabet to allow distinguishability. We define Vt as the virtual computational intelligence function, analogous to Gt in terms of information processing, and V’t as the complementary computational intelligence function, analogous to Gt in terms of communication with the physical environment. With the above definitions, we can lift some major constraints of physical interactions, with important consequences. The complementary function V’t can implement an interface to the environment, filtering real-life information input from the environment and facilitating transfer of actuator output, while the

page 112

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

Granular Computing

page 113

113

Figure 3.4: Evolution of a system in an experiment with virtual interaction.

virtual intelligence function Vt can implement UTM processing of the filtered information. This means that i t does not need to be equal to i’t and ht does not need to be equal to h’t . In other words, inputs and outputs may be considered selectively rather than in their totality. The implication is that many physically distinguishable states may have the same symbolic representation at the virtual computational intelligence function level. The relationship between the two components of computational intelligence is illustrated in Figure 3.4. It should be pointed out that, typically, we think of the complementary function V’t as some mechanical or electronic device (utilizing the laws of physics in its interaction with the environment) but a broader interpretation that includes human perception, as discussed by Zadeh in [40], is entirely consistent with the above model. In this broader context, the UTM implementing the virtual computational intelligence function can be referred to as computing with perceptions or computing with words (see Figure 3.5). Another important implication of the virtual interaction model is that V and P need not have any kind of conserved relationship. This is because only the range/ modality of subsets of i’t and h’t attach to V and these subsets are defined by the choice of sensor/actuator modalities. So, we can focus on the choice of modalities, within the complementary computational intelligence function, as a mechanism through which one can exercise the assignment of semantics to both inputs and

June 1, 2022 12:51

114

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

Handbook on Computer Learning and Intelligence — Vol. I

Figure 3.5: The paradigm of computing with perceptions within the framework of virtual interaction.

Figure 3.6: An instance of granular computing involving two essential components: algorithmic clustering and empirical evaluation of granules.

outputs of the virtual intelligence function. To put it informally, the complementary function is a facility for defining a “language” we chose to communicate with the real world. Of course, to make the optimal choice (one that allows undistorted perception and interaction with the physical environment), it would be necessary to have a complete knowledge of the physical environment. So, in its very nature the process of defining the semantics of inputs and outputs of the virtual intelligence function is iterative and involves evolution of our understanding of the physical environment.

3.5. Granular Computation An important conclusion from the discussion above is that the discovery of semantics of information abstraction, sometimes referred to as structured thinking, or a philosophical dimension of granular computing, can be reduced to physical experimentation. This is a very welcome development as it gives a good basis for the formalization of the granular computing paradigm. We argue here that granular computing should be defined as a structured combination of algorithmic abstraction of data and non-algorithmic, empirical verification of the semantics of these abstractions. This definition is general in that it neither prescribes the mechanism of algorithmic abstraction nor elaborates on

page 114

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

Granular Computing

page 115

115

the techniques of experimental verification. Instead, it highlights the essence of combining computational and non-computational information processing. Such a definition has several advantages: – – – –

it emphasizes the complementarity of the two constituent functional mappings; it justifies the hyper-computational nature of GrC; it places physics alongside set theory as the theoretical foundations of GrC; it helps to avoid confusion between GrC and purely algorithmic data processing while taking full advantage of the advances in algorithmic data processing.

3.6. Design of Information Granules Building information granules constitutes a central item on the agenda of granular computing with far-reaching implications on its applications. We present a way of moving from data to numeric representatives, information granules and their linguistic summarization. The organization of the overall scheme and the relationships among the resulting constructs are displayed in Figure 3.7. Clustering as a design prerequisite of information granules Along with a truly remarkable diversity of detailed algorithms and optimization mechanisms of clustering, the paradigm itself delivers a viable prerequisite for the formation of information granules (associated with the ideas and terminology of fuzzy clustering, rough clustering, and others) and applies to both numeric data and information granules. Information granules built through clustering are predominantly data-driven, viz. clusters (either in the form of fuzzy sets, sets, or rough sets) are a manifestation of a structure encountered (discovered) in the data. Numeric prototypes are formed by invoking clustering algorithms, which yield a partition matrix and a collection of the prototypes. Clustering realizes a certain process of abstraction producing a small number of prototypes based on a large number of numeric data. Interestingly, clustering can also be completed in the

Figure 3.7: From data to information granules and linguistic summarization.

June 1, 2022 12:51

116

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

Handbook on Computer Learning and Intelligence — Vol. I

feature space. In this situation, the algorithm returns a small collection of abstracted features (groups of features) that might be referred to as meta-features. Two ways of generalization of numeric prototypes treated as key descriptors of data and manageable chunks of knowledge are considered: (i) symbolic and (ii) granular. In the symbolic generalization, one ignores the numeric values of the prototypes and regards them as sequences of integer indexes (labels). Along this line developed are concepts of (symbolic) stability and (symbolic) resemblance of data structures. The second generalization motivates the construction of information granules (granular prototypes), which arise as a direct quest for delivering a more comprehensive representation of the data than the one delivered through numeric entities. This entails that information granules (including their associated level of abstraction) have to be prudently formed to achieve the required quality of the granular model. As a consequence, performance evaluation embraces the following sound alternatives: (i) evaluation of the representation capabilities of numeric prototypes, (ii) evaluation of the representation capabilities of granular prototypes, and (iii) evaluation of the quality of the granular model. Evaluation of the representation capabilities of numeric prototypes In the first situation, the representation capabilities of numeric prototypes are assessed with the aid of a so-called granulation–degranulation scheme yielding a certain reconstruction error. The essence of the scheme can be schematically portrayed as follows: x  internal representation  reconstruction. The formation of the internal representation is referred to as granulation (encoding), whereas the process of degranulation (decoding) can be sought as an inverse mechanism to the encoding scheme. In terms of detailed formulas, one encounters the following flow of computing: – Encoding leading to the degrees of activation of information granules by input x, say A1 (x), A2 (x), . . . , Ac (x) with Ai (x) =

c  j =1



1 x−vi  x−vi 

2/(m−1)

In case the prototypes are developed with the use of the Fuzzy C-Means (FCM) clustering algorithm, the parameter m(> 1) stands for the fuzzification coefficient and  ·  denotes the Euclidean distance.

page 116

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Granular Computing

b4528-v1-ch03

page 117

117

– Degranulation (decoding) producing a reconstruction of x via the following expression: c 

xˆ =

i=1 c  i=1

Am i (x)vi Am i (x)v

It is worth stressing that the above-stated formulas are a consequence of the underlying optimization problems. For any collection of numeric data, the reconstruction error is the sum of squared errors (distances) of the original data and their reconstructed versions. The principle of justifiable granularity The principle of justifiable granularity guides the construction of an information granule based on the available experimental evidence [50, 51]. In a nutshell, the resulting information granule delivers a summarization of data (viz. the available experimental evidence). The underlying rationale behind the principle is to deliver a concise and abstract characterization of the data such that (i) the produced granule is justified in light of the available experimental data and (ii) the granule comes with a well-defined semantics, meaning that it can be easily interpreted and become distinguishable from the others. Formally speaking, these two intuitively appealing criteria are expressed by the criterion of coverage and the criterion of specificity. Coverage states how much data are positioned behind the constructed information granule. Put differently, coverage quantifies the extent to which the information granule is supported by the available experimental evidence. Specificity, on the other hand, is concerned with the semantics of the information granule, stressing the semantics (meaning) of the granule. One-dimensional case The definition of coverage and specificity requires formalization and this depends on the formal nature of the information granule to be formed. As an illustration, consider an interval form of the information granule A. In the case of intervals built on the basis of one-dimensional numeric data (evidence) x1 , x2 , …, x N , the coverage measure is associated with a count of the number of data embraced by A, namely 1 card{xk |xk ∈ A} N card (.) denotes the cardinality of A, viz. the number (count) of elements xk belonging to A. In essence, coverage has a visible probabilistic flavor. The specificity of A, sp(A) is regarded as a decreasing function g of the size (length) of the cov( A) =

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

118

b4528-v1-ch03

Handbook on Computer Learning and Intelligence — Vol. I

information granule. If the granule is composed of a single element, sp(A) attains the highest value and returns 1. If A is included in some other information granule B, then sp( A) > sp(B). In a limit case, if A is an entire space, sp(A) returns zero. For an interval-valued information granule A = [a, b], a simple implementation of specificity with g being a linearly decreasing function comes as sp( A) = g(length(A)) = 1

|b − a| range

where range stands for an entire space over which intervals are defined. If we consider a fuzzy set as a formal setting for information granules, the definitions of coverage and specificity are reformulated to take into account the nature of membership functions admitting a notion of partial membership. Here we invoke the fundamental representation theorem stating that any fuzzy set can be represented as a family of its α-cuts, namely A(x) = supα∈[0,1] [ min(α, Aα (x))] where Aα (x) = {x|A(x) ≥ α} The supremum (sup) operation is taken over all values of the threshold α. Recall that by virtue of the representation theorem, any fuzzy set is represented as a collection of sets. Having this in mind and considering coverage as a point of departure for constructs of sets (intervals), we have the following relationships: – Coverage cov(A) =



A(x)dx /N

X

where X is a space over which A is defined; moreover, one assumes that A can be integrated. The discrete version of the coverage expression comes in the form of the sum of membership degrees. If each data point is associated with some weight, the calculations of the coverage involve these values:   w(x)A(x)dx / w(x)dx cov( A) = X

X

– Specificity 1 sp( Aα )dα

sp( A) = 0

page 118

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

Granular Computing

abstraction (coverage)

speed [60,80] km/h

page 119

119

specificity

speed [40,95] km/h

speed [20,100] km/ h

Figure 3.8: Relationships between the abstraction (coverage) and specificity of information granules of speed: emergence of a hierarchy of interval information granules of varying specificity.

The criteria of coverage and specificity are in an obvious relationship (Figure 3.8). We are interested in forecasting the speed: the more specific the statement about the prediction of speed is, the lower the likelihood of its satisfaction (coverage). Let us introduce the following product of the criteria: V = cov(A)sp( A). It is apparent that the coverage and specificity are in conflict; the increase in coverage is associated with the drop in the specificity. Thus the desired solution is the one where the value of V attains its maximum. The design of the information granule is accomplished by maximizing the above product of coverage and specificity. Formally speaking, consider that an information granule is described by a vector of parameters p, V (p). The principle of justifiable granularity gives an information granule that maximizes V , popt = arg p V (p). To maximize the index V by adjusting the parameters of the information granule, two different strategies are encountered: (i) A two-phase development is considered. First, a numeric representative (mean, median, mode, etc.) is determined first. It can be regarded as an initial representation of the data. Next, the parameters of the information granule are optimized by maximizing V . For instance, in the case of an interval, one has two bounds (a and b) to be determined. These two parameters are determined separately, viz. a and b are formed by maximizing V (a) and V (b). The data used in the maximization of V (b) involve data larger than the numeric representative. Likewise, V (a) is optimized on the basis of data lower than this representative. (ii) a single-phase procedure in which all parameters of the information granule are determined at the same time.

June 1, 2022 12:51

120

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

Handbook on Computer Learning and Intelligence — Vol. I

Multi-dimensional case The results of clustering coming in the form of numeric prototypes v1 , v2 , …, vc can be further augmented by forming information granules, giving rise to so-called granular prototypes. This can be regarded as the result of the immediate usage of the principle of justifiable granularity and its algorithmic underpinning as elaborated earlier. Around the numeric prototype vi , one spans an information granule Vi , Vi = (vi , ρi ), whose optimal size is obtained as the result of the maximization of the well-known criterion ρi,opt = arg max [cov(Vi )sp(Vi )] ρi

where cov(Vi ) =

1 card{xk | |xk N

− vi | ≤ nρi }

cov(Vi ) = 1 − ρi assuming that we are concerned with normalized data. In the case of the FCM method, the data come with their membership grades (entries of the partition matrix). The coverage criterion is modified to reflect this. Let us introduce the following notation: i = {xk | xk − vi  ≤ nρi } Then the coverage is expressed in the form cov(Vi ) =

1

ui xk ∈i N

Representation aspects of granular prototypes It is worth noting that given a collection of granular prototypes, one can conveniently assess their abilities to represent the original data (experimental evidence). The reconstruction problem, as outlined before for numeric data, can be formulated as follows: given xk , complete its granulation and degranulation using the granular prototypes Vi , i = 1, 2, . . . , c. The detailed computing generalizes the reconstruction process completed for the numeric prototypes and for given x yields a granular result Xˆ = (ˆx, ρ) ˆ where c c m m i=1 Ai (x)vi i=1 Ai (x)ρi  ρ ˆ = vˆ = c c m m i=1 Ai (x) i=1 Ai (x)

page 120

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Granular Computing

b4528-v1-ch03

page 121

121

The quality of reconstruction uses the coverage criterion formed with the aid of the Boolean predicate 1 if ˆvi − xk  ≤ ρˆ T (xk ) = 0, otherwise It is worth noting that in addition to the global measure of quality of granular prototypes, one can associate with them their individual quality (taken as the product of the coverage and specificity computed in the formation of the corresponding information granule). The principle of justifiable granularity highlights an important facet of elevation of the type of information granularity: the result of capturing a number of pieces of numeric experimental evidence comes as a single abstract entity—information granule. As various numeric data can be thought of as information granules of type 0, the result becomes a single information granule of type 1. This is a general phenomenon of elevation of the type of information granularity. The increased level of abstraction is a direct consequence of the diversity present in the originally available granules. This elevation effect is of general nature and can be emphasized by stating that when dealing with experimental evidence composed of information granules of type n, the result becomes a single information granule of type (n+1). Some generalizations and their augmentations are reported in [53, 56] As a way of constructing information granules, the principle of justifiable granularity exhibits a significant level of generality in two important ways. First, given the underlying requirements of coverage and specificity, different formalisms of information granules can be engaged. Second, experimental evidence could be expressed as information granules articulated in different formalisms, and on this basis, a certain information granule is formed. It is worth stressing that there is a striking difference between clustering and the principle of justifiable granularity. First, clustering leads to the formation of at least two information granules (clusters), whereas the principle of justifiable granularity produces a single information granule. Second, when positioning clustering and the principle vis-à-vis each other, the principle of justifiable granularity can be sought as a follow-up step facilitating an augmentation of the numeric representative of the cluster (e.g., a prototype) and yielding granular prototypes where the facet of information granularity is retained. Granular probes of spatiotemporal data When coping with spatiotemporal data (say, time series of temperature recorded over in a given geographic region), a concept of spatiotemporal probes arises as an efficient vehicle to describe the data, capture their local nature, and articulate their local characteristics as well as elaborate on their abilities as modeling artifacts

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

122

b4528-v1-ch03

Handbook on Computer Learning and Intelligence — Vol. I

(building blocks). The experimental evidence is expressed as a collection of data z k = z(xk , yk , tk ) with the corresponding arguments describing the location (x, y), time (t), and the value of the temporal data z k . Here, an information granule is positioned in a three-dimensional space: (i) space of spatiotemporal variable z; (ii) spatial position defined by the positions (x, y); (iii) temporal domain described by time coordinate. The information granule to be constructed is spanned over some position of the space of values z0 , spatial location (x0 , y0 ), and temporal location t0 . With regard to the spatial location and the temporal location, we introduce some predefined level of specificity, which imposes a level of detail considered in advance. The coverage and specificity formed over the special and temporal domains are defined in the form cov( A) = card{z(xk , yk , tk )| z(xk , yk , tk ) − z 0  < nρ},  |z − z 0  , sp( A) = max 0, 1 − Lz  |x − x0  , spx = max 0, 1 − Lx  |y − y0  , sp y = max 0, 1 − Ly  |t − t0  , spt = max 0, 1 − Lt where the above specificity measures are monotonically decreasing functions (linear functions in the case shown above). There are some cutoff ranges (L x , L y …), which help impose a certain level of detail used in the construction of the information granule. Information granule A in the space of the spatiotemporal variable is carried out as before by maximizing the product of coverage of the data and the specificity, cov(A)sp(A); however, in this situation, one considers the specificity that is associated with two other facets of the problem. In essence, A is produced by maximizing the product cov(A)sp(A)spx, sp y spt .

3.6.1. Granular Models The concept The paradigm shift implied by the engagement of information granules becomes manifested in several tangible ways including (i) a stronger dependence on data when building structure-free, user-oriented, and versatile models spanned over selected

page 122

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Granular Computing

b4528-v1-ch03

page 123

123

representatives of experimental data, (ii) emergence of models at various varying levels of abstraction (generality) being delivered by the specificity/generality of information granules, and (iii) building a collection of individual local models and supporting their efficient aggregation. Here, several main conceptually and algorithmically far-reaching avenues are emphasized. Notably, some of them have been studied to some extent in the past and several open up new directions worth investigating and pursuing. In what follows, we elaborate on them in more detail pointing out the relationships among them [52]. data  numeric models. This is a traditionally explored path that has been present in system modeling for decades. The original numeric data are used to build the model. There are a number of models, both linear and nonlinear, exploiting various design technologies, estimation techniques, and learning mechanisms associated with evaluation criteria where accuracy and interpretability are commonly exploited, with the Occam razor principle assuming a central role. The precision of the model is an advantage; however, the realization of the model is impacted by the dimensionality of the data (making the realization of some models not feasible); questions of memorization and a lack of generalization abilities are also central to the design practices. data  numeric prototypes. This path is associated with the concise representation of data by means of a small number of representatives (prototypes). The tasks falling within this scope are preliminary to data analytics problems. Various clustering algorithms constitute generic development vehicles using which the prototypes are built as a direct product of the grouping method. data  numeric prototypes  symbolic prototypes. This alternative branches off to symbolic prototypes where, on purpose, we ignore the numeric details of the prototypes with the intent to deliver a qualitative view of the information granules. Along this line, concepts such as the symbolic (qualitative) stability and qualitative resemblance of structure in data are established. data  numeric prototypes  granular prototypes. This path augments the previous one by bringing the next phase in which the numeric prototypes are enriched by their granular counterparts. The granular prototypes are built in such a way that they deliver a comprehensive description of the data. The principle of justifiable granularity helps quantify the quality of the granules as well as deliver a global view of the granular characterization of the data. data  numeric prototypes  symbolic prototypes  qualitative modeling. The alternative envisioned here builds upon the one where symbolic prototypes are formed and subsequently used in the formation of qualitative models, viz. the models capturing qualitative dependencies among input and output variables. This coincides

June 1, 2022 12:51

124

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

Handbook on Computer Learning and Intelligence — Vol. I

with the well-known subarea of AI known as qualitative modeling, see [47] with a number of applications reported in [48, 49, 54, 55]. data  numeric prototypes  granular prototypes  granular models. This path constitutes a direct extension of the previous one when granular prototypes are sought as a collection of high-level abstract data based on which a model is being constructed. By virtue of the granular data, we refer to such models as granular models. Construction of granular models By granular model, we mean a model whose result is inherently an information granule as opposed to numeric results produced by a numeric model. In terms of this naming, neural networks are (nonlinear) numeric models. In contrast, granular neural networks produce outputs that are information granules (say, intervals). There are two fundamental ways of constructing granular models: (i) Stepwise development. One starts with a numeric model developed with the use of the existing methodology and algorithms and then elevates the numeric parameters of the models to their granular counterparts following the way outlined above. This design process dwells upon the existing models and, in this way, one takes full advantage of the existing modeling practices. By the same token, one can envision that the granular model delivers a substantial augmentation of the existing models. In this sense, we can talk about granular neural networks, granular fuzzy rule-based models. In essence, the design is  concerned with the transformation a −→ A = G(a) applied to the individual parameters, where G is a certain formalism of information granulation. It is worth noticing that in the overall process, there are two performance indexes optimized: in the numeric model, one usually considers the root mean squared error (RMSE), while in the granular augmentation of the model, one invokes another performance index that takes into consideration the product of the coverage and specificity. (ii) A single-step design. One proceeds with the development of the granular model from scratch by designing granular parameters of the model. This process uses only a single performance index that is of interest to evaluate the quality of the granular result. One starts with a numeric (type-0) model M(x; a) developed on the basis of input–output data D = (xk , target k ). Then one elevates it to type-1 by optimizing a level of information granularity allocated to the numeric parameters a, thus making them granular. The way of transforming (elevating) a numeric entity a to an information granule A is expressed formally as follows: A = G(a). Here, G stands for a formal setting

page 124

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

Granular Computing

page 125

125

of information granules (say, intervals, fuzzy sets, etc.) and ε denotes a level of information granularity. One can envision two among a number of ways of elevating a certain numeric parameter a into its granular counterpart. If A is an interval, its bounds are determined as A = [min(a(1−ε), a(1+ε)), max((a(1−ε), a(1−ε))], ε ∈ [0, 1], Another option comes in the form A = [min(a/(1+ε), a(1+ε)), max((a/(1+ε), a(1+ε))], ε ≥ 0. If A is realized as a fuzzy set, one can regard the bounds of its support determined in the ways outlined above. Obviously, the higher the value of ε, the broader the result and higher the likelihood of satisfying the coverage requirement. Higher values of ε yield lower specificity of the results. Following the principle of justifiable granularity, we maximize the product of coverage and specificity by choosing a value of ε. The performance of the granular model can be studied by analyzing the values of coverage and specificity for various values of the level of information granularity ε. Some plots of such relationships are presented in Figure 3.9. In general, as noted earlier, increasing values of ε result in higher coverage and lower values of specificity. By analyzing the pace of changes in the coverage versus the changes in the specificity, one can select a preferred value of ε as such beyond which the coverage does not increase in a substantial way yet the specificity deteriorates significantly. One can refer to Figure 3.9, where both curves (a) and (b) help identify suitable values of ε. One can develop a global descriptor of the quality of the granular model by computing the area under curve; the larger the area is, the better the overall quality specificity 1

(a)

(b)

1 coverage

Figure 3.9: Characteristics of the granular models presented in the coverage-specificity coordinates.

June 1, 2022 12:51

126

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

Handbook on Computer Learning and Intelligence — Vol. I

of the model (quantified over all levels of information granularity) is. For instance, in Figure 3.5, the granular model (a) exhibits better performance than (b). The level of information granularity allocated to all parameters of the model is the same. Different levels of information granularity can be assigned to individual parameters; an allocation of these levels could be optimized in such a way that the values of the performance index V become maximized, whereas a balance of information granularity is retained. In a formal way, we consider the following optimization task: Min V, ε

subject to the constraints P i=1

εi = Pεεi ∈ [0, 1] or

P i=1

εi = Pεεi ≥ 0,

where the vector of levels of information granularity is expressed as ε = [ε1 ε2 . . . ε P ] and P stands for the number of parameters of the model. By virtue of the nature of this optimization problem, the use of evolutionary methods could be a viable option.

3.7. An Example of Granular Computation We illustrate here an application of the granular computation, cast in the formalism of set theory, to a practical problem of analyzing traffic queues. A three-way intersection is represented in Figure 3.10. The three lane-occupancy detectors (inductive loops), labeled here as “east,” “west,” and “south,” provide counts of vehicles passing over them. The counts are then integrated to yield a measure of traffic queues on the corresponding approaches to the junction. A representative sample of the resulting three-dimensional time series of traffic queues is illustrated in Figure 3.11.

Figure 3.10: A three-way intersection with measured traffic queues.

page 126

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

Granular Computing

page 127

127

Figure 3.11: A subset of 100 readings from the time series of traffic queues data.

Figure 3.12: FCM prototypes as a subset of the original measurements of traffic queues.

It is quite clear that, on its own, data depicted in Figure 3.11 reflect primarily the signaling stages of the junction. This view is reinforced if we plot the traffic queues on a two-dimensional plane and apply some clustering technique (such as FCM) to identify prototypes that are the best (in terms of the given optimality criterion) representation of data. The prototypes, denoted as small circles in Figure 3.12, indicate that the typical operation of the junction involves simultaneously increasing and simultaneously decreasing queues on the “east” and “west” junctions. This of course corresponds to “red” and “green” signaling stages. It is worth emphasizing

June 1, 2022 12:51

128

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

Handbook on Computer Learning and Intelligence — Vol. I

Figure 3.13: Granular FCM prototypes representing a class that is semantically distinct from the original point data.

here that the above prototypes can be considered a simple subset of the original numerical data since the nature of the prototypes is entirely consistent with that of the original data. Unfortunately, within this framework, the interpretation of the prototype indicating “zero” queue in both “east” and “west” directions is not very informative. In order to uncover the meaning of this prototype, we resort to a granulated view of data. Figure 3.13 represents traffic queue data that have been granulated based on the maximization of information density measure discussed in [12]. The semantics of the original readings is now changed from point representation of queues into interval (hyperbox) representation of queues. In terms of set theory, we are dealing here with a class of hyperboxes, which is semantically distinct from point data. Applying FCM clustering to granular data results in granular prototypes denoted in Figure 3.13 as rectangles with bold boundaries overlaid on the granulated data. In order to ascertain that the granulation does not distort the essential features of the data, different granulation parameters have been investigated and a representative sample of two granulations is depicted in Figure 3.13. The three FCM prototypes lying in the areas of simultaneous increase and decrease in traffic queues have identical interpretations as the corresponding prototypes in Figure 3.12. However, the central prototype highlights a physical property of traffic that was not captured by the numerical prototype. Figure 3.14 illustrates the richer interpretation of the central prototype. It is clear that the traffic queues on the western approach are increasing, while the traffic queues on the eastern approach are decreasing. This is caused by the “right turning” traffic being blocked by the oncoming traffic from the eastern junction. It is worth noting that the prototype is unambiguous about the absence of a symmetrical situation where an increase in traffic queues on the eastern junction would occur simultaneously with the decrease in the queues on the western junction. The fact

page 128

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

Granular Computing

page 129

129

Figure 3.14: Granular prototype capturing traffic delays for the right-turning traffic.

that this is a three-way junction with no right turn for the traffic from the eastern junction has been captured purely from the granular interpretation of data. Note that the same cannot be said about the numerical data illustrated in Figure 3.12. As we have argued in this chapter, the essential component of granular computing is the experimental validation of the semantics of the information granules. We have conducted a planned experiment in which we placed counting devices on the entrance to the “south” link. The proportion of the vehicles entering the “south” junction (during the green stage of the “east–west” junction) to the count of vehicles on the stop line on the “west” approach represents a measure of the right turning traffic. The ratio of these numerical counts was 0.1428. Similar measurements derived from two different granulations depicted in Figure 3.13 were 0.1437 and 0.1498. We conclude therefore that the granulated data captured the essential characteristics of the right turning traffic and that, in this particular application, the granulation parameters do not affect the result to a significant degree (which is clearly a desirable property). A more extensive experimentation could involve the verification of the granular measurement of the right-turning traffic for drivers that have different driving styles in terms of acceleration and gap acceptance. Although we do not make any specific claim in this respect, it is possible that the granulation of traffic queue data would need to be parameterized with the dynamic driver behavior data. Such data could be derived by differentiating the traffic queues (measurement of the speed of change of queues) and granulating the resulting six-dimensional data.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

130

b4528-v1-ch03

Handbook on Computer Learning and Intelligence — Vol. I

References [1] S. Bains, Intelligence as physical computation, AISB J., 1(3), 225–240 (2003). [2] S. Bains, Physical computation and embodied artificial intelligence, PhD thesis, The Open University (2005). [3] A. Bargiela and W. Pedrycz, Granular Computing: An Introduction, Kluwer Academic (2002). [4] A. Bargiela and W. Pedrycz, From numbers to information granules: a study of unsupervised learning and feature analysis, in H. Bunke, A. Kandel (eds.), Hybrid Methods in Pattern Recognition, World Scientific (2002). [5] A. Bargiela and W. Pedrycz, Recursive information granulation: aggregation and interpretation issues, IEEE Trans. Syst. Man Cybern. SMC-B, 33(1), 96–112 (2003). [6] A. Bargiela, Hypercomputational characteristics of granular computing, in 1st Warsaw Int. Seminar on Intelligent Systems - WISIS 2004, Invited Lectures, Warsaw (2004). [7] A. Bargiela, W. Pedrycz, and M. Tanaka, An inclusion/exclusion fuzzy hyperbox classifier, Int.J . Knowledge-Based Intell. Eng. Syst., 8(2), 91–98 (2004). [8] A. Bargiela, W. Pedrycz, and Hirota K., Granular prototyping in fuzzy clustering, IEEE Trans. Fuzzy Syst., 12(5), 697–709 (2004). [9] A. Bargiela and W. Pedrycz, A model of granular data: a design problem with the Tchebyschev FCM, Soft Comput., 9(3), 155–163 (2005). [10] A. Bargiela and W. Pedrycz, Granular mappings, IEEE Trans. Syst. Man Cybern. SMC-A, 35(2), 288–301 (2005). [11] A. Bargiela and W. Pedrycz, The roots of granular computing, Proc. IEEE Granular Computing Conf., Atlanta, 741–744 (2006). [12] A. Bargiela, I. Kosonen, M. Pursula, and E. Peytchev, Granular analysis of traffic data for turning movements estimation, Int.J . Enterp. Inf. Syst., 2(2), 13–27 (2006). [13] G. Cantor, Über einen Satz aus der Theorie der stetigen Mannigfaltigkeiten, Göttinger Nachr., 127–135 (1879). [14] D. Dubois, H. Prade, and R. Yager (eds.), Fuzzy Information Engineering, Wiley, New York (1997). [15] K. Goedel, The Consistency of the Axiom of Choice and of the Generalized Continuum Hypothesis with the Axioms of Set Theory. Princeton University Press, Princeton, NJ (1940). [16] E. Husserl, Logische Untersuchungen. Phanomenologie und Theorie der Erkenntnis, in Logical Investigations (1970). [17] M. Inuiguchi, S. Hirano, and S. Tsumoto (eds.), Rough Set Theory and Granular Computing, Springer, Berlin (2003). [18] S. Lesniewski, Uber Funktionen, deren Felder Gruppen mit Rucksicht auf diese Funktionen sind, Fundamenta Mathematicae XIII, 319–332 (1929). [19] S. Lesniewski, Grundzuge eines neuen Systems der Grundlagen der Mathematik, Fundamenta Mathematicae XIV, 1–81 (1929). [20] T. Y. Lin, Granular computing on binary relations, in L. Polkowski and A. Skowron (eds.), Rough Sets in Knowledge Discovery: Methodology and Applications, Physica-Verlag, Heidelberg, 286–318 (1998). [21] T. Y. Lin, Y. Y. Yao, and L. A. Zadeh (eds.), Data mining, Rough Sets and Granular Computing, Physica-Verlag (2002). [22] M. Matsuyama, How to Summarize and Analyze Small Data Sets, Kawaboku Publishers, Tokyo (1937). [23] Z. Pawlak, Rough Sets: Theoretical Aspects of Reasoning About Data, Kluwer Academic, Dordrecht (1991). [24] Z. Pawlak, Granularity of knowledge, indiscernibility and rough sets, Proc. IEEE Conf. Fuzzy Syst., 106–110 (1998). [25] W. Pedrycz, Fuzzy Control and Fuzzy Systems, Wiley, New York (1989).

page 130

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Granular Computing

b4528-v1-ch03

page 131

131

[26] W. Pedrycz and F. Gomide, An Introduction to Fuzzy Sets, Cambridge, MIT Press, Cambridge, MA (1998). [27] W. Pedrycz, M. H. Smith, and A. Bargiela, Granular clustering: a granular signature of data, Proc. 19t h Int. (IEEE) Conf. NAFIPS’2000, Atlanta, 69–73 (2000). [28] W. Pedrycz, A. Bargiela, Granular clustering: a granular signature of data, IEEE Trans. Syst. Man Cybern., 32(2), 212–224 (2002). [29] B. Russel, New foundations for mathematical logic, Am. Math. Mon., 2 (1937). [30] H. Siegelmann, Neural Networks and Analogue Computation: Beyond the Turing Limit, Birkhäuser, Boston, MA (1999). [31] P. Simons, Parts: A Study in Ontology, Oxford Univ. Press (1987). [32] A. Skowron, Toward intelligent systems: calculi of information granules, Bull. Int. Rough Set Soc., 5, 9–30 (2001). [33] A. Skowron and J. Stepaniuk, Information granules: towards foundations of granular computing, Int.J . Intell. Syst., 16, 57–85 (2001). [34] A. Tarski, Foundations of the geometry of solids, in Logic, Semantics, Metamathematics: Papers 1923–38, Hackett, 1984 (1956). [35] A. Turing, On computable numbers, with an application to the entscheidungsproblem, Proc. London Math. Soc., 42, 230–265 (1936). [36] Y. Y. Yao and J. T. Yao, Granular computing as a basis for consistent classification problems, Proceedings PAKDD’02 Workshop on Foundations of Data Mining, 101–106 (2002). [37] Y. Y. Yao, Granular computing, Proc. 4t h Chinese National Conference on Rough Sets and Soft Computing, 31, 1–5 (2004). [38] Y. Y. Yao, A partition model of granular computing, LNCS Trans. Rough Sets, 1, 232–253 (2004). [39] Y. Y. Yao, Perspectives on granular computing, Proc. IEEE Conf. on Granular Computing, 1, 85–90 (2005). [40] L. A. Zadeh, Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic, Fuzzy Sets Syst., 90, 111–127 (1997). [41] L. A. Zadeh, Fuzzy sets and information granularity, in N. Gupta, R. Ragade, R. Yager (eds.), Advances in Fuzzy Set Theory and Applications, North-Holland (1979). [42] L. A. Zadeh, Fuzzy sets, Inform Control, 8, 338–353 (1965). [43] L. A. Zadeh, From computing with numbers to computing with words: from manipulation of measurements to manipulation of perceptions, Int.J . Appl. Math Comput. Sci., 12(3), 307–324 (2002). [44] L. A. Zadeh, Granular computing: the concept of generalized constraint-based computation, Private Communication. [45] E. Zermelo, Untersuchungen ueber die Grundlagen der Mengenlehre, Math. Annalen, 65, 261–281 (1908). [46] C. Pontow, R. Schubert, A mathematical analysis of parthood, Data Knowl. Eng., 59(1), 107–138 (2006). [47] K. Forbus, Qualitative process theory, Artif. Intell., 24, 85–168 (1984). [48] F. Guerrin, Qualitative reasoning about an ecological process: interpretation in hydroecology, Ecol. Model., 59, 165–201 (1991). [49] W. Haider, J. Hu, J. Slay, B. P. Turnbull, and Y. Xie, Generating realistic intrusion detection system dataset based on fuzzy qualitative modeling, J . Network Comput. Appl., 87, 185–192 (2017). [50] W. Pedrycz and W. Homenda, Building the fundamentals of granular computing: a principle of justifiable granularity, Appl. Soft Comput., 13, 4209–4218 (2013). [51] W. Pedrycz, Granular Computing, CRC Press, Boca Raton, FL (2013). [52] W. Pedrycz, Granular computing for data analytics: a manifesto of human-centric computing, IEEE /CAA J. Autom. Sin., 5, 1025–1034 (2018).

June 1, 2022 12:51

132

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch03

Handbook on Computer Learning and Intelligence — Vol. I

[53] D. Wang, W. Pedrycz, and Z. Li, Granular data aggregation: an adaptive principle of justifiable granularity approach, IEEE Trans. Cybern., 49, 417–426 (2019). [54] Y. H. Wong, A. B. Rad, and Y. K. Wong, Qualitative modeling and control of dynamic systems, Eng. Appl. Artif. Intell., 10(5), 429–439 (1997). [55] J. Žabkar, M. Možina, I. Bratko, and J. Demšar, Learning qualitative models from numerical data, Artif. Intell., 175(9–10), 1604–1619 (2011). [56] Z. Zhongjie and H. Jian, Stabilizing the information granules formed by the principle of justifiable granularity, Inf. Sci., 503, 183–199 (2019).

page 132

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_0004

Chapter 4

Evolving Fuzzy and Neuro-Fuzzy Systems: Fundamentals, Stability, Explainability, Useability, and Applications Edwin Lughofer Department of Knowledge-Based Mathematical Systems, Johannes Kepler University Linz, Altenbergerstrasse 69, A-4040 Linz, [email protected]

This chapter provides an all-round picture of the development and advances in the fields of evolving fuzzy systems (EFS) and evolving neuro-fuzzy systems (ENFS) which have been made during the last 20 years since their first-time appearance around the year 2000. Their basic difference to the conventional (neuro-)fuzzy systems is that they can be learned, from data, on-the-fly during (fast) online processes in an incremental and mostly single-pass manner. Therefore, they stand for emerging topic in the field of soft computing and artificial intelligence for addressing modeling problems in the quickly increasing complexity of online data streaming applications and Big Data challenges, implying a shift from batch off-line model design phases (as conducted since the 1980s) to online (active) model teaching and adaptation. The focus will be on the definition of various model architectures used in the context of EFS and ENFS, on providing an overview of the basic learning concepts and learning steps in E(N)FS with a wide scope of references to E(N)FS approaches published so far (fundamentals), and on discussing advanced aspects towards an improved stability, reliability, and useability (usually must-to-haves to guarantee robustness and userfriendliness) as well as towards an educated explainability and interpretability of the models and their outputs (usually a nice-to-have to find reasons for predictions and to offer insights into the systems’ nature). The chapter will be concluded with a list of real-world applications where various E(N)FS approaches have been successfully applied with satisfactory accuracy, robustness, and speed.

4.1. Introduction: Motivation Due to the increasing complexity and permanent growth of data acquisition sites, in today’s industrial systems there is an increasing demand for fast modeling algorithms from online data streams [1]. Such algorithms are ensuring that models can be quickly adapted to the actual system situation and thus provide reliable outputs 133

page 133

June 1, 2022 12:51

134

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

any time during online real-world processes. There, changing operating conditions, environmental influences, and new unexplored system states may trigger a quite dynamic behavior, causing previously trained models to become inefficient or even inaccurate [2]. In this sense, conventional static models, which are trained once in an off-line stage and are not able to adapt dynamically to the actual system states, are not an adequate alternative method for coping with these demands (severe downtrends in accuracy have been examined in previous publications). A list of potential realworld application examples relying on online dynamic aspects, and thus demanding flexible modeling techniques, can be found in Tables 4.1 and 4.2. Another challenge which has recently become a very hot topic within the machine learning community, and is given specific attention in the European framework program Horizon 2020 and in the upcoming Horizon Europe 2021–2027 program, is the processing and mining of so-called Big Data [3], usually stemming from very large data bases VLDBs [4]. The use of Big Data takes place in many areas such as meteorology and geotechnical engineering [5], magnetic resonance imaging, credit scoring [6], connectomics [7], complex physics simulations, and biological and environmental research [8]. This data is so big (often requesting a size of several exabytes) that it cannot be handled in a one-shot experience [9], as it can exceed the virtual memory of today’s conventional computers. Thus, standard batch modeling techniques are not applicable. In order to tackle the aforementioned requirements, the field of evolving intelligent systems (EISs) [10] or, in a wider machine learning sense, the field of learning in non-stationary environments (LNSEs) [2] enjoyed increasing attention during the last years. This even led to the emergence of a journal in 2010, named Evolving Systems, published by Springer (Heidelberg)1 , with 396 peer-reviewed papers published so far and a current impact factor of 2.070 (in the year 2020). Both fields support learning topologies which operate in a single-pass manner and are able to update models and surrogate statistics on-the-fly and on demand. The single-pass nature and incrementality of the updates assure online and, in most cases, even realtime learning and model training capabilities. While EISs focus mainly on adaptive evolving models within the field of soft computing, LNSEs go a step further and also join incremental machine learning and data mining techniques, originally stemming from the area of, “incremental heuristic search” [11]. The update in these approaches concerns both parameter adaptation and structural changes, depending on the degree of change required. The structural changes are usually enforced by evolution and pruning components, and are finally responsible for the term evolving in EISs. In this context, evolving should not be confused with evolutionary (as has sometimes happened in the past, unfortunately). Evolutionary approaches are usually applied in

1 http://www.springer.com/physics/complexity/journal/12530

page 134

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 135

135

the context of complex optimization problems and learn parameters and structures based on genetic operators, but they do this by using all the data in an iterative optimization procedure rather than integrating new knowledge permanently on-the-fly. Apart from dynamic adaptation requirements in industrial predictive maintenance systems [12] as well as control [13] and production environments, another important aspect of evolving models is that they provide the opportunity for making self-learning computer systems and machines. In fact, evolving models are permanently updating their knowledge and understanding of diverse complex relationships and dependencies in real-world application scenarios by integrating new system behaviors and environmental influences (which are manifested in the captured data). This is typically achieved in a fully autonomous fashion [14], and their learning follows a life-long learning context and thus is never really terminated, but lasts as long as new information arrives. In this context, they are also able to properly address the principles of continual learning and task-based incremental learning [15] in specific data stream environments, where new data chunks characterize new tasks (typically with some relation to older tasks): based on their occurrence in the streams, evolving mechanisms may autonomously evolve new model components and structures and thus expand knowledge of the models on-the-fly to account for the new tasks. Due to these aspects, they can be seen as a valuable contribution within the field of computational intelligence [16] and even in the field of artificial intelligence (AI) [17]. There are several possibilities for using an adequate model architecture within the context of an evolving system. This strongly depends on the learning problem at hand, in particular, whether it is supervised or unsupervised. In the case of the latter, techniques from the field of clustering, usually termed as dynamic clustering [18] are a prominent choice. In the case of supervised classification and regression models, the architectures should support decision boundaries or approximation surfaces with an arbitrary nonlinearity degree. Moreover, the choice may depend on past experience with some machine learning and data mining tools; for instance, it is wellknown that SVMs are usually among the top 10 performers for many classification tasks [19]; thus, they are a reasonable choice to be used in an online classification setting as well (in the form of incremental SVMs [20, 21]). Soft computing models such as neural networks [22], fuzzy systems [23] or genetic programming [24], and any hybrid concepts of these (e.g., neuro-fuzzy systems [25]) are all known to be universal approximators [26] and thus are able to resolve non-linearities implicitly contained in the systems’ behavior. Neural networks usually suffer from their black box nature, i.e., not allowing operators and users any insight into the models extracted from the streams, although approaches for more explanation of model outputs arose during recent years; however, these are basically operating on an output level rather than on an internal model component level, the latter opening up further process insights for experts (new knowledge gaining). Genetic programming

June 1, 2022 12:51

136

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

is a more promising choice in this direction; however, it may suffer from expanding unnecessarily complex formulas with many nested functional terms (termed as the bloating effect [27]), which are again hard to interpret. Fuzzy systems are specific mathematical models which build upon the concept of fuzzy logic, first introduced in 1965 by Lotfi A. Zadeh [28]. They are a very useful option to offer interpretability on a structural model component level, as they contain rules that are linguistically readable. This mimics human thinking about relationships and structural dependencies present in a system. This will become clearer in the subsequent section when mathematically defining possible architecture variants within the field of fuzzy systems. Furthermore, the reader may refer to Chapter 1 of this book, where basic concepts of fuzzy sets and systems are introduced and described in detail. Fuzzy systems have been widely used in the context of data stream mining and evolving systems during the last two decades (starting around the year 2000). They are termed as evolving fuzzy systems (EFS) [29], and in combination with concepts from the field of neuronal networks they are termed as evolving neuro-fuzzy systems (ENFS) [30]. They wed the concept of robust incremental learning of purely datadriven models (in order to obtain the highest possible precision for prediction on new samples) with aspects of interpretability and explainability of learned model components (typically, in the form of linguistically readable rules) [31, 32]. This opens up new research horizons in the context of on-the-fly feedback integration of human knowledge into the models [33]. This in turn, may enrich evolving models with other more subjective and cognitive viewpoints and interpretation aspects than those contained in pure objective data, turning them into a hybrid AI concept [34]. The upper plot in Figure 4.1 shows the development of the number of publications in the field of evolving (neuro-)fuzzy systems over the time frame of nearly 30 years, significantly starting at around 2000 and with a further significant increase during the “golden early emerging years” (2006–2014), followed by saturation during the last 5–6 years. By the end of 2020, there were 1652 in total. On the other hand, regarding the citation trends shown in the lower plot, an explosion can be seen from 2012 to 2019, where the number of citations approximately tripled, from 1,000 to 3,000 per year. This underlines the growing importance of evolving (neuro-)fuzzy systems during the recent decades in research and industry—which goes hand-in-hand with the importance of evolving systems in general, as described above. In sum, a remarkable number of 24,547 references have been made to papers in the field of evolving (neuro-)fuzzy systems.

4.2. Architectures for EFS The first five subsections are dedicated to architectures for regression problems, for which evolving fuzzy systems (EFS) have been used preliminarily, including the

page 136

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Evolving Fuzzy and Neuro-Fuzzy Systems

page 137

137

Figure 4.1: Upper plot: the development of publications in the field of evolving (neuro-)fuzzy systems over the last 30 years; lower plot: the development in terms of citations, 24,547 in total.

extensions to neuro-fuzzy systems. Then, various variants of fuzzy classification model structures are discussed, having been introduced for representing decision boundaries in various forms in evolving fuzzy classifiers (EFC), ranging from the classical single-model architecture to multi-model variants and deep rule-based architectures.

4.2.1. Mamdani Mamdani fuzzy systems [35] are the most common choice for coding expert knowledge/experience into a rule-based IF-THEN form; examples in several application scenarios can be found in [36, 37] or [38, 39]. In general, assuming p input variables (features), the definition of the ith rule in a single-output Mamdani fuzzy system is as follows: x ) IS i Rulei IF (x1 IS μi1 ) AND . . . AND (x p IS μip ) THEN li ( with i the consequent fuzzy set in the fuzzy partition of the output variable used x ) of the ith rule, and μi1 , . . . , μip are the fuzzy sets appearing in the consequent li (

June 1, 2022 12:51

138

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

page 138

Handbook on Computer Learning and Intelligence — Vol. I

in the rule antecedents. The rule firing degree (also called rule activation level) for a concrete input vector x = (x1 , . . . , x p ) is then defined by p

x ) = T μi j (x j ), μi (

(4.1)

j =1

with T being a specific conjunction operator, denoted as t-norm [40]—most frequently, minimum or product is used, i.e., p

x ) = min(μi j (x j )) μi ( j =1

p  μi ( x) = (μi j (x j ))

(4.2)

j =1

with μi j (x j ) being the membership degree of x j to the j th fuzzy set in the ith rule. It may happen that i =  j for some i = j . Hence, a t-conorm [40] is applied which combines the rule firing levels of those rules having the same consequents to one output set. The most common choice for the t-conorm is the maximum operator. In this case, the consequent fuzzy set is cut at the alpha-level: x )) αi = max ji =1,...,Ci (μ ji (

(4.3)

with Ci being the number of rules whose consequent fuzzy set is the same as that for Rule i, and ji the indices of these rules. This is done for the whole fuzzy rule base and the various α-cut output sets are joined to one fuzzy area employing the supremum operator. An example of such a fuzzy output area for a fuzzy output partition containing four fuzzy sets A1, A2, A3, and A4 is shown in Figure 4.2. In order to obtain a crisp output value, a defuzzification method is applied, the most commonly used being the mean of maximum (MOM) over the whole area, the center of gravity (COG) or the bisector [41] which is the vertical line that will divide the whole area into two sub-regions of equal areas. For concrete formulas of the defuzzification operators, please refer to [42, 43]. MOM and COG are examples shown in Figure 4.2. Due to the defuzzification process, it is quite intuitive that the inference in Mamdani fuzzy systems loses some accuracy, as an output fuzzy number is reduced to a crisp number. Therefore, they have been hardly been applied within the context of online modeling and data stream mining, where the main purpose is to obtain an accurate evolving fuzzy model in the context of precise evolving fuzzy modeling—an exception can be found in [44] termed as the SOFMLS approach and in [45] using a switching architecture with Takagi–Sugeno type consequents, see below.

4.2.2. Takagi–Sugeno Takagi–Sugeno (TS) fuzzy systems [46] are the most common architectural choice in evolving fuzzy systems approaches [30]. This has several reasons. First of all, they enjoy a large attraction in many fields of real-world applications and systems

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Evolving Fuzzy and Neuro-Fuzzy Systems

page 139

139

Figure 4.2: Mean of maximum (MOM) and center of gravity (COG) defuzzification for a rule consequent partition (fuzzy partition in output variable Y ) in a Mamdani fuzzy system; the shaded area indicates the joint consequents (fuzzy sets) in the active rules (applying supremum operator); the cutoff points (alpha cuts) are according to maximal membership degrees obtained from the rule antecedent parts of the active rules.

engineering [23], ranging from process control [13, 47, 48], system identification [49–51], through condition monitoring [52, 53] and chemometric calibration tasks [54, 55], to predictive and preventive maintenance [12, 56]. Thus, their robustness and applicability in the standard batch modeling case has been practically verified for several decades. Second, they are known to be universal approximators [57], i.e., they are able to model any implicitly contained non-linearity with a sufficient degree of accuracy, while their interpretable capabilities are still intact or may even offer advantages: while the antecedent parts remain linguistic, the consequent parts can be interpreted either in a more physical sense (see [58]) or as local functional tendencies [32], see also Section 4.5.2. Finally, parts of their architecture (the consequents) can be updated exactly by recursive procedures, as will be described in Section 4.3.1, achieving true optimal incremental solutions. 4.2.2.1. Takagi–Sugeno standard

A single rule in a (single output) standard TS fuzzy system is of the form Rulei : IF (x1 IS μi1 ) AND . . . AND (x p IS μip ) x ) = wi0 + wi1 x1 + wi2 x2 + . . . + wip x p , THEN li (

(4.4) (4.5)

where x = (x1 , . . . , x p ) is the p-dimensional input vector and μi j the fuzzy set describing the j -th antecedent of the rule. Typically, these fuzzy sets are associated

June 1, 2022 12:51

140

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

page 140

Handbook on Computer Learning and Intelligence — Vol. I

with a linguistic label. As in the case of Mamdani fuzzy systems, the AND connective is modeled in terms of a t-norm, i.e., a generalized logical conjunction [40]. The output of a TS system consisting of C rules is realized by a linear combination of the outputs produced by the individual rules (through the li ’s) (→ fuzzy inference), where the contribution of each rule is indicated by its normalized degree of activation i and thus, fˆ( x ) = yˆ =

C  i=1

μi ( x) , i ( x ) · li ( x ) with i ( x ) = C x) j =1 μ j (

(4.6)

x ) as in Eq. (4.1). with μi ( The most convenient choice for fuzzy sets in EFS and fuzzy system design in general are Gaussian functions, which lead to so-called fuzzy basis function networks [59]. Multivariate kernels based on the axis-parallel normal distribution are achieved for representing the rules’ antecedent parts   p  1 (xi − ci )2 x) = ex p − . (4.7) μi ( 2 σi2 i=1 In this sense, the linear hyper-planes li are connected with multivariate Gaussians to form an overall smooth function. Then, the output form in Eq. (4.6) has some synergies with Gaussian mixture models (GMMs) [60, 61], often used for clustering and pattern recognition tasks [62, 63]. The difference is that the li ’s are hyperplanes instead of singleton weights and do not reflect the degree of density of the corresponding rules (as in the case of GMMs through mixing proportions), but the linear trend of the approximation/regression surface in the corresponding local parts. 4.2.2.2. Takagi–Sugeno generalized

Recently, the generalized form of Takagi–Sugeno fuzzy systems has been offered to the evolving fuzzy systems community, launching its origin in [64] and [65], later explored and further developed in [66] and in [67]. The basic principle is that it employs multidimensional normal (Gaussian) distributions in arbitrary positions for representing single rules. Thus, it overcomes the deficiency of not being able to model local correlations between input and output variables appropriately, as is the case for standard rules using the classical t-norm operators [40]—these may induce inexact approximations of real local trends [68]. An example for visualizing this problematic nature is provided in Figure 4.3: in the left image, axis-parallel rules (represented by ellipsoids) are used for modeling the partial tendencies of the regression curves which are not following the input axis direction, but are rotated to some degree; obviously, the volumes of the rules are artificially blown up and the rules do not represent the real characteristics of

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 141

141

Figure 4.3: Left: conventional axis-parallel rules (represented by ellipsoids) achieve an inaccurate representation of the local trends (correlations) of a nonlinear approximation problem (defined by noisy data samples); right: generalized rules (by rotation) achieve a much more accurate and compact representation (5 rules instead of 6 actually needed).

the local tendencies well leading to information loss. In the right image, non-axisparallel rules using general multivariate Gaussians are applied for a more accurate representation (rotated ellipsoids). The generalized fuzzy rules have been defined in [64] as x ) = wi0 + wi1 x1 + wi2 x2 + . . . + wip x p , Rulei : IF x IS (about) μi THEN li ( (4.8) where μi denotes a high-dimensional kernel function, which in accordance with the basis function network spirit is given by the generalized multivariate Gaussian distribution   1 T −1 x − ci ) i ( x ) = ex p − ( x − ci ) (4.9) μi ( 2 with ci being the center and i−1 the inverse covariance matrix of the ith rule, allowing any possible rotation and spread of the rule. In order to maintain (input/output) interpretability of the evolved TS fuzzy models for users/operators (see also Section 4.5.2), the authors in [67] foresee a projection concept to form fuzzy sets and classical rule antecedents. Interpretability relies on the angle between the principal component directions and the feature axes, which has the effect that long spread rules are more effectively projected than when using the inner contour spreads (through axis-parallel cutting points). The spread σi of the projected fuzzy set is set according to   r (4.10) σi = max j =1,... p  cos(φ(ei , a j )) , λj

June 1, 2022 12:51

142

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

page 142

Handbook on Computer Learning and Intelligence — Vol. I

with r being the range of influence of one rule, usually set to 1, representing the (inner) characteristic contour/spread of the rule (as mentioned above). The center of the fuzzy set in the ith dimension is set to be equal to the ith coordinate of the rule center. φ(ei , a j ) denotes the angle between the principal component direction represented by the eigenvector a j and the ith axis vector ei ; λ j denotes the eigenvalue of the j th principal component. 4.2.2.3. Takagi–Sugeno extended

An extended version of Takagi–Sugeno fuzzy systems in the context of evolving systems has been applied in [69]. There, instead of a hyper-plane li = wi0 + wi1 x1 + wi2 x2 + . . . + wip x p the consequent function for the ith rule is defined as the LS_SVM model according to [70] x) = li (

N 

αik K ( x , xk ) + βi ,

(4.11)

k=1

with K (., .) being a kernel function fulfilling the Mercer criterion [71] for characterizing a symmetric positive semi-definite kernel [72], N the number of training samples, and α and β the consequent parameters (support vectors and intercept) to learn. The advantage of these consequents is that they are supposed to provide more accuracy, as a support vector regression modeling [70] is applied to each local region. Hence, non-linearities within local regions may be better resolved. On the other hand, the consequents are more difficult to interpret. SVM modeling for consequents has also been conducted in [73] in an incremental manner, but there, by using classical TS consequent functions (yielding a linear SVM modeling without kernels). Another extended version of Takagi–Sugeno fuzzy systems was proposed in [74], where the rule consequents were defined through a nonlinear mapping , realized by wavelet-type functions (typically modeled through the use of the Mexican hat function [75]). Therefore, each rule consequent represents a specific area of the output space in the wavelet domain, which makes the fuzzy system compactly applicable to time series data with (many) lagged inputs (different resolutions). This has been achieved in combination with an interval output vector to form a variant of a type-2 (TS) fuzzy system, see the subsequent section.

4.2.3. Type-2 Fuzzy Systems Type-2 fuzzy systems were invented by Lotfi Zadeh in 1975 [76] for the purpose of modeling the uncertainty in the membership functions of the usual (type-1) fuzzy sets. The distinguishing feature of a type-2 fuzzy set μ˜ i j versus its type-1 counterpart

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Evolving Fuzzy and Neuro-Fuzzy Systems

page 143

143

μi j is that the membership function values of μ˜ i j are blurred; i.e., they are no longer a single number in [0, 1], but instead a continuous range of values between 0 and 1, say [a, b] ⊆ [0, 1]. One can either assign the same weighting or assign a variable weighting to membership function values in [a, b]. When the former is done, the resulting type-2 fuzzy set is called an interval type-2 fuzzy set. When the latter is done, the resulting type-2 fuzzy set is called a general type-2 fuzzy set [77]. The ith rule of an interval-based type-2 fuzzy system is defined in the following way [78, 79]: Rulei : IF x1 IS μ˜ i1 AND . . . AND x p IS μ˜ ip THEN li ( x ) = f˜i with f˜i being a general type-2 uncertainty function. In the case of a Takagi–Sugeno-based consequent scheme (e.g., used in [80], the first approach of an evolving type-2 fuzzy system), the consequent function becomes x ) = w˜ i0 + w˜ i1 x1 + w˜ i2 x2 + . . . + w˜ ip x p li (

(4.12)

with w˜ i j being an interval set (instead of a crisp continuous value), i.e., w˜ i j = [ci j − si j , ci j + si j ].

(4.13)

In the case of a Mamdani-based consequent scheme (e.g., used in [81], a recent ˜ i , with  ˜ i a type-2 evolving approach), the consequent function becomes: li =  fuzzy set. An enhanced approach for eliciting the final output is applied, so-called Karnik– Mendel iterative procedure [82] where a type of reduction is performed before the defuzzification process. In this procedure, the consequent values l¯i = ci0 − si0 + (ci1 − si1 )x1 + (ci2 − si2 )x2 + . . . + (cip − sip )x p and l i = ci0 + si0 + (ci1 + si1 )x1 + ¯ (ci2 + si2 )x2 + . . . + (cip + sip )x p are sorted in ascending order denoted as y¯i and ¯ i ( x ) and  i ( x ), are y i for all i = 1, . . . , C. Accordingly, the membership values,  ¯ ¯sorted in ascending order denoted as ψ¯ ( x ) and ψ ( x ). Then, the outputs y ¯ and y i i ¯ ¯ are computed by R   L ¯ x ) y¯i + Ci=L+1 ψ ( x ) y¯i x )y i + Ci=R+1 ψ¯ i ( x )y i i=1 ψi ( i=1 ψ i ( i , y = R ¯ y¯ =  L ¯ ¯ ¯ , C C ¯ x ) + i=L+1 ψ ( x) x ) + i=R+1 ψ¯ i ( x) ¯ i=1 ψi ( i=1 ψ i ( i ¯ ¯ (4.14) C C with L and R being positive numbers, often L = 2 and R = 2 . Taking the average of y¯ and y yields the final output value y. ¯

4.2.4. Interval-Based Fuzzy Systems with Mixed Outputs A specific variant of an evolving fuzzy system architecture was proposed in [83] where the input fuzzy sets represent crisp intervals (thus, a specific realization of the

June 1, 2022 12:51

144

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

page 144

Handbook on Computer Learning and Intelligence — Vol. I

trapezoidal function and also termed as granules [84]) and where the consequents of the fuzzy system contain a combination of interval output and hyper-plane output. The regression coefficients themselves are interval values. Hence, a rule is defined as p

p

Rulei : IF l1i ≤ x1 ≤ L i1 AND . . . AND l1 ≤ x p ≤ L 1 THEN li ( x ) = f˜i AND u i1 ≤ y ≤ U1i ,

(4.15)

x ) as defined in Eq. (4.12) and an additional consequent term (that output y with li ( lies in an interval) to be able to make an advanced output interpretation. Especially, this directly yields an uncertainty value possibly contained in model predictions, without requiring the derivation and design of explicit error bars, etc. Furthermore, for experts, intervals may be more easy to understand than hyper-planes when gaining insights into the dependencies/relations the system contains [83]. In a modified form obtained by substituting the input intervals with conventional type-1 fuzzy sets and an additional consequent functional with a reduced argumentlist, this type of evolving fuzzy system has recently been successfully applied in [85] to properly address missing values in data streams, see also Section 4.6.2.

4.2.5. Neuro-Fuzzy Systems A neuro-fuzzy system is typically understood as a specific form of layered (neural) models, where, in one or several layers, components from the research field of fuzzy sets and systems are integrated [25]—e.g., fuzzy-set based activation functions. Typically, an evolving neuro-fuzzy system comprises the layers as shown in Figure 4.4: a fuzzification layer (Layer 1), a rule activation (= membership degree

Figure 4.4: Typical layered structure of an evolving neuro-fuzzy system (ENFS) with two inputs and two rules/neurons.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Evolving Fuzzy and Neuro-Fuzzy Systems

page 145

145

calculation) layer (Layer 2), a normalization layer (Layer 3), a functional consequent layer (Layer 4), and a weighted summation (= output) layer (Layer 5). In functional form, by using forward propagation of layers, the output inference of an evolving neuro-fuzzy system containing C rules/neurons is defined as yˆ ( x ) = yˆ =

C 

i ( x ) · f i ( x ),

(4.16)

i=1

with the normalization layer μi ( x) i ( , x ) = C x) j =1 μ j (

(4.17)

and the rule activation layer in standard (axis-parallel) form p

x ) = T μi j (x j ), μi ( j =1

(4.18)

with p, the number of input variables (or regressors as transformed input variables). μi ∈ [0, 1] denotes an aggregated rule/neuron membership degree over fuzzy set activation levels μi j ∈ [0, 1], j = 1, . . . , p (performed in the fuzzification layer), with the aggregation T typically performed by using a t-norm [40], but in recent works also with some extended norms; see, e.g., [86, 87]—basically, for the purpose of gaining more flexibility (and thus interpretability) of rule antecedent parts, see also Section 4.5.2. The final realization of evolving neuro-fuzzy systems is achieved through the definition of the functionals f 1 , . . . , fC , also termed as rule consequent functions (appearing in the functional consequent layer). In most of the current E(N)FS approaches [30] [88], consequents of the Takagi–Sugeno–Kang type [89] are employed, which are generally defined through polynomials of order m (here for the ith rule)  i , x) = wi0 + f i (w

m 



wi, j1 ... jm1 x j1 . . . x jm1 ,

(4.19)

m1=1 j1 ≤ j2 ≤...≤ jm1 ∈{1,..., p}

where P denotes the number of input variables (comprising the dimensionality of the data set and the learning problem) and x1 , . . . , x p the input variables. In the special  i , x) = wi0 , whereas case of Sugeno fuzzy systems [90], m is set to 0 and thus f i (w in the special case of Takagi–Sugeno fuzzy systems [46], m becomes 1 and thus f i a hyper-plane defined by: f i (w  i , x) = wi0 + wi1 ∗ x1 + . . . + wip ∗ x p . The latter, in combination with evolving fuzzy models, resulted in the classical eTS models, as originally proposed in [91] and used by many other E(N)FS approaches [30] [88].

June 1, 2022 12:51

146

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

Extensions of the basic layered neuro-fuzzy architecture shown in Figure 4.4 concern the following: • The addition of one or more hidden layers for achieving a higher degree of model nonlinearity and an advanced interpretation of the knowledge contained in the input features/measurements—e.g., a recurrent layer could be integrated in order to represent connective weights between different time lags of one or more variables [92] or to perform local rule feedback [93], or to achieve an autoassociative structure with cross-links between the rule antecedents [94]; please refer to [25] for several possibilities of additional (advanced) layers. • The integration of feature weights into the construction of the fuzzy neurons; this yields a feature weighting aspect in order to address the curse of dimensionality in a soft manner, as well as a better readability of the rules, as features with low weights assigned during the stream learning process can be seen as unimportant and thus out-masked when showing the fuzzy neurons/rules to experts, see also Section 4.4.2 for details. A recent architecture addressing this issue has been published in [86]. • More complex construction of the fuzzy neurons through advanced norms operating on the activation functions (fuzzy sets). The typical connection strategy of fuzzy set activation levels is realized through the usage of classical t-norms [40], leading to pure AND-connections among rule antecedent parts. Advanced concepts allowing a mixture of ANDs and ORs with a kind of intensity activation level of both have been realized for neuro-fuzzy systems in [86, 87] through the construction of uni- and uninull-neurons based on the usage of uni-norms and related concepts [95]. An evolving neuro-fuzzy system approach with the integration of uninull-neurons has been recently published in [96]. • More complex fuzzy-set based activation functions are added in order to yield transformation of the data to a compressed localized time and/or frequency space, such as the usage of wavelets as proposed in [97]. • Integrating recurrency in the layers in order to be able to model the dependencies of the various lags among several and within single features; this is especially important for time-series based modeling and forecasting purposes. A first attempt this direction has been made in [98], where the concept of long-short term memory (LSTM) networks has been employed to span the recurrent layer and where type-2 fuzzy sets are used in the neurons as activation functions. A new type of evolving neuro-fuzzy architecture has been proposed by Silva et al. in [99] termed as neo-fuzzy neuron networks, and applied in the evolving context. It relies on the idea of using a set of TS fuzzy rules for each input dimension independently and then connecting these with a conventional sum for obtaining the final model output. In this sense, the curse of dimensionality is cleverly addressed

page 146

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 147

147

by subspace decomposition into univariate model viewpoints. The domain of each input i is thereby granulated into m, complimentary membership functions, where m can be autonomously evolved from data. In all of these architectures, a common denominator is the usage of linear output weights (also termed as consequent parameters), which are linear parameters w  i , i = 1, . . . , C and should be incrementally updated in a single-pass manner to best follow the local output trends over stream samples in the approximation/regression/forecasting problem. Section 4.3.1 provides an overview of variants for achieving this in a robust manner.

4.2.6. Cloud-Based Fuzzy Models In [100], a completely new form of fuzzy rule-based systems has been proposed that is based on the concept of data clouds. Data clouds are characterized by a set of previous data samples with common properties (closeness in terms of the input– output mapping). This allows for a more loose characterization of samples belonging to the same cloud than traditional fuzzy rules are able to offer. All of these (see above) are mainly characterized through the definition of concrete membership functions, whose realization induces the final shape of the rules: for instance, when using Gaussian membership functions, rules of ellipsoidal shape are induced. Thus, these assume that the local data distributions also (nearly) follow similar shapes, at least to some extent. This means that local distribution assumptions are implicitly made when defining the membership functions or the generalized rule appearance. Data clouds can circumvent this issue as they do not assume any local data distributions in advance. Figure 4.5 demonstrates a comparison between membership functions and clouds for a simple two-dimensional example.

Figure 4.5: Left: clusters representing operation modes defined in conventional ellipsoidal shapes (as in the previous approaches above); right: clouds are just loose representations of sample densities—a new sample Z is then considered to belong to the nearest cloud according to its density [100].

June 1, 2022 12:51

148

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

Based on the clouds, the authors in [100] define a fuzzy rule for a single output model as follows: x ) = wi0 + wi1 x1 + wi2 x2 + . . . + wip x p Rulei : IF x IS like i THEN li ( (4.20) In the case of a multi-output model, a whole matrix of consequent parameters is multiplied with x to form a set of consequent functions L i . The fuzziness in this architecture is preserved in the sense that a particular data sample can belong to all clouds with a different degree (realized by the ‘IS like’ statement). This degree of fulfillment of the rule premise is determined by a local density γi , which can be calculated and updated in a single-pass manner based on the definition of a Cauchytype function, see [100] for details. The latter just measures the local density, but does not assume any local data distribution. The overall output over all cloud-based rules is achieved by weighted averaging in the same manner as for classical TS fuzzy systems, see Section 4.2.2. A cloud-based fuzzy rule base system can be represented by a layered architecture as visualized in Figure 4.4, where the first two layers are substituted by a single layer, which calculates the local density γi for a new sample to each data cloud (instead of a rule fulfillment degree based on fuzzy sets). Normalization, consequent, and output layers (Layers 3 to 5) are the same according to the weighted averaging of hyper-plane consequents given in Eq. (4.20), realized through Eq. (4.16) x ) with γi ( x ) in the normalization Eq. (4.17). by substituting μi ( This architecture is also used in a fuzzy classification variant termed as TEDAClass, see [101], and embedded in the ALMMo approach to learning multimodel event systems [102].

4.2.7. Fuzzy Classifiers Fuzzy classifiers have enjoyed wide attraction for various applications for more than two decades, see, e.g. [103–106]. Their particular strength is their ability to model decision boundaries with an arbitrary nonlinearity degree while maintaining interpretability in the sense “which rules on the feature set imply which output class labels.” In a winner-takes-all concept, the decision boundary proceeds between rule pairs having different majority class labels. As rules are usually nonlinear contours in the high-dimensional space, the nonlinearity of the decision boundary is induced— enjoying arbitrary complexity due to a possible arbitrary number of rules. Due to their wide attraction for conventional off-line classification purposes, they have also been x used in the context of data stream classification. This led to the emergence of so-called evolving fuzzy classifier (EFC) approaches

page 148

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Evolving Fuzzy and Neuro-Fuzzy Systems

page 149

149

as a particular sub-field of evolving (neuro-) fuzzy systems, see [96, 107–111], which inherit some functionality in terms of incremental, single-pass updates of rules and parameters (see the subsequent section). In the subsequent subsections, the particular fuzzy classifier architectures used and partially even newly developed for EFCs are described. 4.2.7.1. Classical and Extended Single-Model

The rule in a classical fuzzy classification model architecture with singleton consequent labels as a widely studied architecture in the fuzzy system community [103, 112, 113] is defined by Rulei : IF x1 IS μi1 AND . . . AND x p IS μip THEN li = L i ,

(4.21)

where L i is the crisp output class label from the set {1, . . . , K } with K , the number of classes for the ith rule. This architecture precludes the use of confidence labels in the single classes per rule. In the case of clean classification rules, when each single rule contains/covers training samples from a single class this architecture provides an adequate resolution of the class distributions. However, in real-world problems, classes usually overlap significantly and therefore often rules containing samples from more than one class are extracted. Thus, an extended fuzzy classification model that includes the confidence levels con f i1,...,K of the ith rule in the single classes has been applied in an evolving, adaptive learning context, see, e.g., the first approaches in [107, 109, 114] Rulei : IF x1 IS μi1 AND . . . AND x p IS μip THEN li = [con f i1 , con f i2 , . . . , con f i K ]

(4.22)

Thus, a local region represented by a rule in the form of Eq. (4.22) can better model class overlaps in the corresponding part of the feature space: for instance, three classes overlap with the support of 200, 100, and 50 samples in one single fuzzy rule; then, the confidence in Class #1 would be intuitively 0.57 according to its relative frequency (200/350), in Class #2 it would be 0.29 (100/350), and in Class #3 it would be 0.14 (50/350). A more enhanced treatment of class confidence levels will be provided in Section 4.4.4 when describing options for representing reliability in class responses. In a winner-takes-all context (the most common choice in fuzzy classifiers [103]), the final classifier output L will be obtained by L = li ∗

li ∗ = argmax1≤k≤K con f i ∗ k i ∗ = argmax1≤i≤C μi ( x)

(4.23)

June 1, 2022 12:51

150

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

page 150

Handbook on Computer Learning and Intelligence — Vol. I

(a)

(b)

Figure 4.6: (a): Classification according to the winner-takes-all concept using Eq. (4.23); (b): the decision boundary moves towards the more unpurified rule due to the gravitation concept applied in Eq. (4.24).

In a more enhanced (weighted) classification scheme such as that used for evolving fuzzy classifiers (EFC) in [115], the degree of purity is respected as well and integrated into the calculation of the final classification response L:   μ1 ( x )h ∗1,k +μ2 ( x )h∗2,k (4.24) L = argmaxk=m,m∗ con f k = μ1 ( x ) + μ2 ( x) with h∗1,k =

h 1,k h 2,k h∗2,k = , h 1,m + h 1,m∗ h 2,m + h 2,m∗

(4.25)

x ) being the membership degree of the nearest rule (with a majority class with μ1 ( x ) the membership degree of the second nearest rule with a different m), and μ2 ( majority class m∗ = m; h i,k denotes the class frequency of class k in rule i. This difference is important as two nearby lying rules with the same majority class label do not induce a decision boundary in between them. Figure 4.6 shows an example of decision boundaries induced by Eq. (4.23) (left image) and by (4.24) (right image). Obviously, a more purified rule (i.e., having less overlap in classes; right side) is favored over that with significant overlap (left side); as the decision boundary moves away, more samples are classified to the majority class in the purified rule, which is intended to obtain a clearer, less uncertain decision boundary. 4.2.7.2. Multi-model one-versus-rest

The first variant of multi-model architecture is based on the well-known one-versusrest classification scheme from the field of machine learning [62], and has been introduced in the fuzzy systems community, especially the evolving fuzzy systems community in [107]. It diminishes the problematic of having complex nonlinear

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Evolving Fuzzy and Neuro-Fuzzy Systems

page 151

151

multi-decision boundaries in case of multi-class classification problems, which is the case for single-model architecture as all classes are coded into one model. This is achieved by representing K binary classifiers for the K different classes, each one for the purpose of discriminating one single class from the others (→ one-versusrest). Thus, during the training cycle (batch or incremental), for the kth classifier all feature vectors resp. samples belonging to the kth class are assigned a label of 1, and all other samples belonging to other classes are assigned a label of 0. The nice thing is that a (single model) classification model D( f ) = C resp. any regression model D( f ) = R (such as Takagi–Sugeno variants discussed above) can be applied for one sub-model in the ensemble. Interestingly, in [107] it has been shown that, using Takagi–Sugeno architecture for the binary classifiers by regressing on {0, 1}, the masking problem occurring in the linear regression by indicator matrix approach [116] can be avoided. At the classification stage for a new query point x, a model which is producing the maximal model response is used as the basis for the final classification label output L, i.e., x) L = argmaxm=1,...,K fˆm (

in the case when D( f ) = R

x ) in the case when D( f ) = C L = argmaxm=1,...,K con fm (

(4.26)

A rule-based one-versus-rest classification scheme was proposed within the context of a MIMO (Multiple Input Multiple Output) fuzzy system and applied in an evolving classification context [117]. There, a rule is defined by IF x IS (about) μi THEN li = x i , where



1 2 K wi0 wi0 . . . wi0

(4.27)



⎢ 1 2 K⎥ ⎢ wi1 wi1 ⎥ . . . wi1 ⎢ ⎥

i = ⎢ . .. .. .. ⎥ ⎢ .. . . . ⎥ ⎣ ⎦ 1 2 K wip wip . . . wip and μi as in Eq. (4.9). Thus, a complete hyper-plane for each class per rule is defined. This offers the flexibility to regress on different classes within single rules and thus to resolve class overlaps in a single region by multiple regression surfaces [117]. 4.2.7.3. Multi-model all-pairs

The multi-model all-pairs (also termed as all-versus-all) classifier architecture, originally introduced in the machine learning community [118, 119], and first introduced in (evolving) fuzzy classifiers design in [110], overcomes the often occurring

June 1, 2022 12:51

152

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

page 152

Handbook on Computer Learning and Intelligence — Vol. I

imbalanced learning problems induced by the one-versus-rest classification scheme in the case of multi-class (polychotomous) problems. On the other hand, it is well-known that imbalanced problems cause severe down-trends in classification accuracy [120]. Thus, it is beneficial to avoid imbalanced problems while still trying to enforce the decision boundaries are easy are as possible to learn. This is achieved by the all-pairs architecture, as for each class pair (k, l) an own classifier is trained, decomposing the whole learning problem into binary, less complex sub-problems. Formally, this can be expressed by a classifier Ck,l , which is induced by a training procedure Tk,l when using (only) the class samples belonging to classes k and l x |L( x ) = k ∨ L( x ) = l}, Ck,l ←− Tk,l (X k,l ) X k,l = {

(4.28)

with L( x ) being the class label associated with feature vector x. This means that Ck,l is a classifier for separating samples belonging to class k from those belonging to class l. When classifying a new sample x, each classifier outputs a confidence level con f k,l , which denotes the degree of preference of class k over class l for this sample. This degree lies in [0, 1] where 0 means no preference, i.e., a crisp vote for class l; and 1 means full preference, i.e., a crisp vote for class k. This is conducted for each pair of classes and stored into a preference relation matrix R: ⎤ ⎡ 0 con f 1,2 con f 1,3 . . . con f 1,K ⎥ ⎢ ⎢ con f 2,1 0 con f 2,3 . . . con f2,K ⎥ ⎥ ⎢ (4.29) R=⎢ ⎥ .. .. .. .. .. ⎥ ⎢ . . . . . ⎦ ⎣ 0 con f K ,1 con f K ,2 con f K ,3 . . . The preference relation matrix in Eq. (4.29) opens another interpretation dimension on the output level: considerations may go into partial uncertainty reasoning or preference relational structure in a fuzzy sense [121]. In the most convenient way, the final class response is often obtained by the following:    con f k,i , (4.30) L = argmaxk=1,...,K (scorek ) = argmaxk=1,...,K K ≥i≥1

i.e., the class with the highest score (= highest preference degree summed over all classes) is returned by the classifier. In [119, 122], it was shown that pair-wise classification is not only more accurate than one-versus-rest technique, but also more efficient regarding computation times, see also [110], which is an important characteristic for fast stream learning problems. The reason for this is basically that binary classification problems contain significantly lower numbers of samples, as each sub-problem uses only a small subset of samples.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Evolving Fuzzy and Neuro-Fuzzy Systems

page 153

153

4.2.7.4. Neuro-fuzzy systems as classifiers

A further note goes to the usage of neuro-fuzzy systems as classifiers [123] as has recently been successfully applied in an evolving context in [96]. For binary classification problems, adoption of the architecture is not needed at all, as then a regression on {0, 1}-values is typically established. For multi-class classification problems, the architecture visualized in Figure 4.4 expands to a multi-output model, where each output represents one class. Thus, the value of this output (in 0, 1) directly contains the certainty that a new query point falls into this particular class. Applying a softmax operator (as is commonly used in neural classifiers [62]) on the certainty vector delivers the final output class. Learning can then be again conducted using a one-versus-rest classification technique as described in Section 4.2.7.2, through a loss function on each class versus the rest separately, see [96]. 4.2.7.5. Deep rule-based classifiers (DRB)

The authors in [111] proposed a deep (fuzzy) rule-based architecture for solving classification problems with a high accuracy (due to multiple layers), but at the same time offering interpretability of the implicit neurons (as realized through cloud-based rules). The principal architecture is visualized in Figure 4.7. The essential core component is that in the third layer, it contains massively parallel ensembles of IF-THEN rules, which are presented in the following form: Rulei : IF x ∼ Pi,1 O R x ∼ Pi,2 O R . . . O R x ∼ Pi,i N THEN li = Class ci , (4.31) where ∼ denotes similarity, which can be a fuzzy degree of satisfaction, a densitybased value (as in the cloud-based fuzzy systems), or a typicality [124]. Pi, j denotes

Figure 4.7: Architecture of a deep rule-based classifier [111].

June 1, 2022 12:51

154

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

page 154

Handbook on Computer Learning and Intelligence — Vol. I

the j th prototype of the ith class (having i N prototypes in sum), usually estimated through (weighted) average of samples belonging to one prototype. It is thus remarkable that the rule appearance in Eq. (4.31) uses OR-connections to describe a single-class through possibilities of prototypes (learned from the data). Such a realization is not possible in all of the aforementioned architectures relying on classical t-norm concepts for connecting the rule antecedents. The final decisionmaking for new query samples is achieved through advanced decision-making mechanisms through confidence scores (apart from the “winner-takes-all” principle). The latter are elicited by a Gaussian kernel function through the membership degree of a new sample to each prototype in a single rule and by taking the maximal value, see [111] for details. 4.2.7.6. Model ensembles

A single model may indeed have its charm regarding a fast update/learning performance, but may suffer in predictive performance in case of higher noise levels and dimensionality in the data. This is because in the case of noisy data samples, single regression fits, especially when they are of non-linear type, often may provide unstable solutions. That is, the regression fits tend to over-fit due to fitting more the noise than the real underlying functional approximation trends [125], increasing the variance error much more than reducing the bias error—please refer to a detailed discussion of this problematic issue in [116]. Ensembles of models can typically increase the robustness in performance in such cases, as is known from the (classical) batch machine learning community [126] [127]. Furthermore, ensembles are potential candidates to properly deal with drifting data distributions in a natural way (especially by evolving new ensemble members upon drift alarms, etc. [128]). Thus, combining the idea of ensembling with E(N)FS, leading to a kind of online ensemble of E(N)FS with both increased robustness and flexibility, has recently emerged as an important topic in the evolving fuzzy systems community, see, e.g., [129–131]. An ensemble of fuzzy systems can be defined by F1 ∪ F2 ∪ . . . ∪ Fm , where each F stands for one so-called ensemble member or base learner, and whose output is generally defined as the following: m wi Fˆi , Fˆ = Aggi=1

(4.32)

with Fˆi being the predicted output from ensemble member Fi , wi its weight (impact in the final prediction) and Agg a specific aggregation operator, whose choice depends on the ensembling concept used: e.g., in the case of bagging it is a simply average over all members’ outputs [132] (thus, wi = 1/m), or in the case of boosting it is a specific weighted aggregation function [133], see also Section 4.3.5.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 155

155

Each ensemble member thereby denotes a single fuzzy system with any architecture as discussed above.

4.3. Fundamentals of Learning Algorithms Data streams are one of the fundamental reasons for the necessity of applying evolving, adaptive models in general and evolving fuzzy systems in particular. This is simply because streams are theoretically an infinite sequence of samples, which cannot be processed at once within a batch process, not even in modern computers with high virtual memory capacities. Data streams may not necessarily be online based on permanent recordings, measurements, or sample gatherings, but can also arise due to a block- or sample-wise loading of batch data sites, e.g., in case of very large data bases (VLDB)2 or in case of big data problems [134]; in this context, they are also often referred as pseudo-streams. In particular, a data stream (or pseudostream) is characterized by the following properties [1]: • The data samples or data blocks are continuously arriving online over time. The frequency depends on the frequency of the measurement recording process. • The data samples are arriving in a specific order, over which the system has no control. • Data streams are usually not bounded in size; i.e., a data stream is alive as long as some interfaces, devices, or components at the system are switched on and are collecting data. • Once a data sample/block is processed, it is usually discarded immediately, afterwards. Changes in the process such as new operation modes, system states, varying environmental influences, etc. are usually implicitly also affecting the data stream in such a way that, for instance, drifts or shifts may arise (see Section 4.4.1), or new regions in the feature/system variable space are explored (knowledge expansion). Formally, a stream can be defined as an infinite sequence of samples ( x1 , y1 ), x3 , y3 ), . . . , where x denotes the vector containing all input features ( x2 , y2 ), ( (variables) and y, the output variables that should be predicted. In the case of unsupervised learning problems, y disappears—note, however, that in the context of fuzzy systems, only supervised regression and classification problems are studied. Often y = y, i.e., single output systems, are encouraged especially as it is often possible to decast a MIMO (multiple input multiple output problem) system into single independent MISO (multiple input single output problem) systems (e.g., when the outputs are independent). 2 http://en.wikipedia.org/wiki/Very_large_database

June 1, 2022 12:51

156

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

page 156

Handbook on Computer Learning and Intelligence — Vol. I

Handling streams for modeling tasks in an appropriate way requires the usage of incremental learning algorithms, which are deduced from the concept of incremental heuristic search [11]. These algorithms possess the property to build and learn models in a step-wise manner rather than with a whole data set at once. From a formal mathematical point of view, an incremental model update I of the former model f N (estimated from the N initial samples) is defined by the following: x N+1,...,N+m , yN+1,...,N+m )). f N+m = I ( f N , (

(4.33)

So, the incremental model update is done by just taking the new m samples and the old model, but not using any prior data. Here, the whole model may also include some additional statistical help measures, which needs to be updated synchronously to the “real” model. If m = 1, we speak about incremental learning in sample mode or sample-wise incremental learning, otherwise about incremental learning in chunk mode or chunk-wise incremental learning. If the output vector starts to be missing in the data stream samples, but a supervised model has already been trained before, which is then updated with the stream samples either in unsupervised manner or by using its own predictions, then we speaks about semi-supervised (online) learning [135]. Two update modes in the incremental learning process are distinguished: (1) Update of the model parameters: in this case, a fixed number of parameters  N = {φ1 , . . . , φl } N of the original model f N is updated by the incremental learning process, and the outcome is a new parameter setting  N+m with the same number of parameters, i.e., | N+m | = | N |. Here, we also speak about a model adaptation resp. a model refinement with new data. (2) Update of the whole model structure: this case leads to the evolving learning concept, as the number of parameters may change and also the number of structural components may change automatically (e.g., rules are added or pruned in case of fuzzy systems) according to the characteristics of the new data samples xN+1,...,N+m . This means that usually (but not necessarily) | N+m | = | N | and C N+m = C N with C being the number of structural components. The update of the whole model structure may also include an update of the input structure, i.e., input variables/features may be exchanged during the incremental learning process—see also Section 4.4.2. An important aspect in incremental learning algorithms is the so-called plasticity–stability dilemma [136], which describes the problem of finding an appropriate tradeoff between flexible model updates and structural convergence. This strongly depends on the nature of the stream: in some cases a more intense update is required than in others (drifting versus life-long concepts in the stream, see also Section 4.4.1). If an algorithm converges to an optimal solution or at least

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Evolving Fuzzy and Neuro-Fuzzy Systems

page 157

157

to the same solution as the hypothetical batch solution (obtained by using all data up to a certain sample at once), it is called a recursive algorithm. An initial batch mode training step with the first amount of training samples is, whenever possible, usually preferable over incremental learning from scratch, i.e., building up the model sample-per-sample from the beginning. This is because within a batch mode phase, it is possible to carry out validation procedures (such as cross-validation [137] or bootstrapping [138]) in connection with search techniques for finding an optimal set of parameters for the learning algorithm to achieve a good generalization quality (nearly every E(N)FS approach has such learning parameters to tune). The obtained parameters are usually reliable start parameters for the incremental learning algorithm to evolve the model further. When performing incremental learning from scratch, the parameters have to be set to some blind default values, which may not necessarily be appropriate for the given data stream mining problem. In a pure online learning setting, however, incremental learning from scratch is indispensable. Then, often the start default parameters of the learning engines need to be parameterized. Thus, it is essential that the algorithms require as less as possible parameters (see Section 4.3.6), or alternatively embed algorithms for adaptive changes of the learning parameters.

4.3.1. Recursive Learning of Linear Parameters For learning the linear consequent (output weight) parameters w,  which are present in most of the fuzzy system architectures discussed in the previous section, the currently available EFS techniques typically rely on the optimization of the least squares error criterion, which is defined as the squared deviation between observed outputs y(1), . . . , y(N ) and predicted model outputs yˆ (1), . . . , yˆ (N ) min J = y − yˆ  L 2 = w 

N 

(y(k) − yˆ (k))2 .

(4.34)

k=1

As yˆ (k) is typically elicited by a linearly weighted sum of rule/neuron activation levels (see Eq. (4.16)) with weights w,  this problem can be written as a classical linear least squares problem with a weighted regression matrix containing the global regressors r(k) and rule/neuron activation levels (in the last layer) as weights. Its exact definition depends on the chosen consequent functions (see the previous section): e.g., in the case of TS fuzzy models with hyper-planes, the regressor in the kth sample (denoting the kth row in the regression matrix) is given by the following:  x (k)) x1 (k)1 ( x (k)) . . . x p (k)1 ( x (k)) . . . r(k) = 1 (  x (k)) x1 (k)C ( x (k)) . . . x p (k)C ( x (k)) C (

(4.35)

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

158

b4528-v1-ch04

page 158

Handbook on Computer Learning and Intelligence — Vol. I

with C being the current number of rules. In the case of higher order polynomials, the polynomial terms are also integrated in Eq. (4.35). For this problem, no matter how the regressors are constructed, it is well-known that a recursive solution exists that converges to the optimal one within each incremental learning step—see [139] and also [29] (Chapter 2) for its detailed derivation in the context of evolving TS fuzzy systems. However, the problem with this global learning approach is that it does not offer any flexibility regarding rule evolution and pruning, as these cause a change in the size of the regressors and thus a dynamic change in the dimensionality of the recursive learning problem, which leads to a disturbance of the parameters in the other rules and to a loss of optimality. Therefore, the authors in [107] emphasize the usage of a local learning approach that learns and updates the consequent parameters for each rule separately. Adding or deleting a new rule, therefore, does not affect the convergence of the parameters of all other rules; thus, optimality in the least squares sense is preserved. The local learning approach leads to a weighted least squares formulation, which for the ith rule is given by the following: min Ji = wi

N 

λ N−k i ( x (k)) (y(k) − f i (w  i , x(k)))2 .

(4.36)

k=1

 i , x(k)) the y(k) denotes the observed (measured) target value in sample k and f i (w estimated target value through the consequent function f i of the ith rule, which contains the linear consequent/output weight parameters w  i = [wi0 , wi1 , . . . , wip ] (with p being the number of parameters according to the length of the regressor vector). The symbol C denotes the number of rules. N is the number of samples (seen so far), and λ an optional forgetting factor in [0, 1], which leads to an exponential decrease of x ) denote the rule/neuron membership weights over past samples. The weights i ( degree/activation level of a sample x to the ith rule and guarantee a (locally) stable inference for new query samples [29]. The weights are essential, as samples lying far away from an existing rule/neuron will receive lower weights in the estimation of the consequent parameters than those lying closer to it, which assures reliable local optimization and convergence—see [29], Chapter 2 for a detailed discussion. This local LS error functional Eq. (4.36) turned out to be more or less the common denominator in current E(N)FS approaches [30]. The problem can be seen as a classical weighted least squares problem, where the weighting matrix is a diagonal matrix containing the basis function values i for each input sample. Again, an exact recursive formulation can be derived (see [29], Chapter 2), which is termed as recursive fuzzily weighted least squares (RFWLS). As RFWLS is so fundamental and is used in most of the EFS approaches [30], we explicitly deploy the update formulas (from the kth to the k + 1st sample)

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

page 159

Evolving Fuzzy and Neuro-Fuzzy Systems

159

wi (k + 1) = wi (k) + γ (k)(y(k + 1) − rT (k + 1)wi (k)), γ (k) =

Pi (k) r (k + 1)

λ + rT (k + 1)Pi (k) r (k + 1) i ( x (k + 1))

Pi (k + 1) =

1 (I − γ (k) r T (k + 1))Pi (k), λ

,

(4.37) (4.38)

(4.39)

with Pi (k) = (Ri (k)T Q i (k)Ri (k))−1 being the inverse weighted Hessian matrix and r(k + 1) the regressor vector in the new, k + 1st sample (its construction depends on the consequent function, see above). The (normalized) membership degree of a new x (k + 1)) is integrated as fuzzy weight in the denominator sample to the ith rule i ( in Eq. (4.38), and assures local convergence, as only consequent parameters of the most active rules are significantly updated and parameters of rules lying far away from the current sample (and thus yielding a low membership value) remain nearly unaffected (which is desired), see [29], Chapter 2 for a detailed analysis. Please note that convergence to global optimality in each single sample update is guaranteed as Eq. (4.36) is a convex, parabolic function, and Eq. (4.37) denotes a Gauss–Newton step, which converges within one iteration in the case of parabolas [139]. This assurance of convergence to optimality is guaranteed as long as there is no structural change in the rules’ antecedents. However, due to rule center movements or resettings in the incremental learning phase (see Section 4.3.4), this is usually not the case. Therefore, a kind of sub-optimality is caused whose deviation degree to the real optimality could have been bounded for some EFS approaches such as FLEXFIS [140], PANFIS [141], or SOFMLS [44], and several extensions of and also some alternative approaches to RFWLS have been proposed, see Section 4.3.1.1. Whenever a new rule is evolved by a rule evolution criterion (see also below), the parameters and the inverse weighted Hessian matrix (required for an exact update) have to be initialized. In [139], it is emphasized to set w  to 0 and Pi to α I with α big enough. However, this is for the purpose of a global modeling approach starting with a faded out regression surface over the whole input space. In local learning, the other rules defining other parts of the feature space remain untouched. Thus, setting the hyper-plane of the new rule that may appear somewhere in between the other rules to 0 would lead to an undesired muting of one local region and to discontinuities in the online predictions [142]. Thus, it is more beneficial to inherit the parameter vector and the inverse weighted Hessian matrix from the most nearby lying rule [142]. 4.3.1.1. Improving the robustness of recursive linear consequent parameters learning

Several extensions to the conventional RFWLS estimator were proposed in the literature during recent years, especially in order to address the sub-optimality

June 1, 2022 12:51

160

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

problematic, to dampen the effect of outliers on the update of the parameters, to decrease the curse of dimensionality in the case of large regressor vectors, and also to address noise in the input and not only in the output as is the case for RFWLS due to the optimization of perpendicular distances between predicted and observed target values. These extensions of RFWLS are as follows: • In PANFIS [141], an additional constant α is inserted, conferring a noteworthy effect to foster the asymptotic convergence of the system error and weight vector being adapted, which acts like a binary function. In other words, the constant α is in charge to regulate the current belief of the weight vectors w  i and depends on the approximation and estimation errors. It is 1 whenever the approximation error is bigger than the system error, and is 0 otherwise. Thus, adaptation takes place in the first case and not in the second case (which may have advantages in terms of flexibility and computation time). A similar concept is used in the improved version of SOFNN, see [143]. • Generalized version of RFWLS (termed as FWGRLS) as used in GENEFIS [66]: this exploits the generalized RLS as derived in [144] and adopts it to the local learning context in order to gain its benefits as discussed above. The basic difference to RFWLS is that it adds a weight decay regularization term to the least squares problem formulation in order to punish more complex models. In a final simplification step, it ends up with similar formulas as in Eq. (4.37) to  i (k)) in Eq. (4.39), but with the difference to subtract the term α Pi (k + 1)∇φ(w Eq. (4.37), with α a regularization parameter and φ the weight decay function: one of the most popular ones in the literature is the quadratic function defined as  i (k)2 ; thus ∇φ = w  i (k)). φ(w  i (k)) = 12 w • Modified recursive least squares algorithm as proposed in [44] to train both parameters and structures of the model with the support of linearization and Lyapunov functions [145]. It is remarkable that in this approach, the evolution or pruning of rules does not harm the convergence of the parameters, as these change only the dimension of the inverse Hessian and the weight matrices. Based on this approach, the bound on the identification error could be made even smaller in [146] with the usage of an own designed dead-zone recursive least square algorithm. • Extended weighted recursive least squares (EWRLS), which also takes into account whenever two rules are merged and then performs different formulas to assure optimality in the least squares sense despite structural changes due to the merging process [147]. • Multi-innovation RFWLS as suggested in [148]: this variant integrates both, past error estimators (also termed as innovation vectors) and membership values on some recent past samples, into one single parameter update cycle. This is achieved

page 160

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 161

161

through an expansion of RFWLS from the vectorized formulation to the matrix case. This achieves a compensation of structural changes and thus of sub-optimal solutions over an innovation length v (the size of the innovation vector). This finally results in optimality of the parameters with respect to the past v samples. • Recursive correntropy was suggested in [149] and [150] for recursive updates within a global learning approach, and adopted to the local learning case in [148]: this variant integrates an exponential error term in the objective function formulation according to the discrete variant of the correntropy between predicted −

|y(k)− fi (w  i , x )(k)|

2φ 2 serving as outlier fading and measured output values, i.e., ( xk ) = e factor; this gives larger errors lower weights during optimization, as the aim is to maximize the corr-entropy function [151]. • Recursive weighted total least squares (RWTLS) as introduced in [148]: it is an extension of the LS objective function formulation by integrating the sample reconstruction error in the local LS functional to properly account for both, noise in the inputs and output(s) (whereas RFWLS only respects noise in the output(s)). The mathematical derivation leads to a solution space formed by the weighted mean of input regressors (with weights given by the rule membership values) and the smallest eigenvalue of a weighted and mean-centered regression matrix in the total least squares sense (containing all regressor values plus the target value y). It can be recursively updated through the usage of generalized inverse iteration [152], where the (important) noise covariance matrix is autonomously and incrementally estimated from data through a recursive polynomial Kalman smoother with a particular state transition matrix.

Another possibility to achieve a better local optimality of the consequent parameters is presented in [153], which allows iterative re-optimization cycles on a batch of the most recent data samples. This is done in conjunction with an iterative optimization of nonlinear parameters (which reflect any possible structural changes); thus, suboptimality can be completely avoided. On the other hand, it requires re-learning phases and thus does not meet the spirit of a real recursive, single-pass algorithm, as is the case for all extensions discussed above. A final note goes to the possibility to apply RFWLS and all its related algorithms (i.) to binary fuzzy classifiers by performing a regression on {0, 1}, and (ii.) to multiclass classifiers based on one-versus-rest classification scheme (see Section 4.2.7.2) by conducting an indicator-based recursive learning per class separately, see [107].

4.3.2. Recursive Learning of Nonlinear Parameters Nonlinear parameters occur in every model architecture as defined throughout Section 4.2.2, mainly in the fuzzy sets included in the rules’ antecedent parts—except

June 1, 2022 12:51

162

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

Figure 4.8: Example of rule update (left-hand side) versus rule evolution (right-hand side) based on new incoming data samples (rectangular markers).

for the extended version of TS fuzzy systems (Section 4.2.2.3), where they also appear in the consequents. Often, the parameters in the fuzzy sets define their centers c and characteristic spreads σ , but the parameters may appear in a different form— for instance, in the case of sigmoid functions they define the slope and the point of gradient change. Thus, we generally refer to a set of nonlinear parameters as . The incremental update of nonlinear parameters is necessary in order to adjust and move the fuzzy sets and rules composed of the sets to the actual distribution of the data in order to achieve well-placed positions. An example is provided in Figure 4.8 where the initial data cloud (circles) in the left upper part changes slightly the position due to new data samples (rectangles). Leaving the original rule (marked with a dashed ellipsoid as induced by multivariate Gaussians) untouched would cause a misplacement of the rule. Thus, it is beneficial to adjust the rule center and its spread according to the new samples (solid ellipsoid). This figure also shows the case when new stream samples appear which are far away from any existing rules (rectangles in the middle right part). Such samples should induce the evolution of a new rule in order to compactly represent this new knowledge contained in the new cloud (which will be handled in the subsequent section)—the update of older rules would induce inaccurately blown up rules, containing a (useless) mixture of local data representations. One possibility to update the nonlinear parameters in EFS is again, similar to the consequent parameters, by applying a numerical incremental optimization procedure. Relying on the least squares optimization problem as in the case of recursive linear parameter updates, its formulation in dependence of the nonlinear parameters  becomes N   = (yk − yˆ (, w))  L2 . (4.40) min J (, w) ; w 

k=1

 x )i (( x )). Then, In the case of TS fuzzy systems, for instance, yˆ () = Ci=1 li ( the linear consequent parameter w  needs to be synchronously optimized to the

page 162

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Evolving Fuzzy and Neuro-Fuzzy Systems

page 163

163

non-linear parameters (thus, in optional braces) in order to guarantee an optimal solution. This can be done either in an alternating nested procedure, i.e., perform an optimization step for nonlinear parameters first, see below, and then optimize the linear ones, e.g., by Eq. (4.37), or within one joint update formula, e.g., when using one Jacobian matrix on all parameters). Equation (4.40) is still a free optimization problem; thus, any numerical, gradient-based, or Hessian-based technique for which a stable incremental algorithm can be developed is a good choice: this is the case for the steepest descent, Gauss–Newton method, and Levenberg–Marquardt. Interestingly, a common parameter update formula can be deduced for all three variants [154] x (k), )e( x (k), ), (k + 1) = (k) + μ(k)P(k)−1 ψ(

(4.41)

∂y ( x (k)) is the partial derivative of the current model y after where ψ( x (k), ) = ∂ each nonlinear parameter is evaluated at the current input sample x(k), e( x (k), ) is the residual in the kth sample: e( x (k), ) = yk − yˆk , and μ(k)P(k)−1 is the learning gain with P(k) an approximation of the Hessian matrix, which is realized in different ways:

• For the steepest descent algorithm, P(k) = I ; thus, the update depends only on μ (k) the current first order derivative vectors; furthermore, μ(k) = r (k) 2 with r regression vector. • For the Gauss–Newton algorithm,, μ(k) = 1 − λ and P(k) = (1 − λ)H (k) with H (k) the Hessian matrix, which can be approximated by J acT (k) J ac(k) with J ac the Jacobian matrix (including the derivatives w.r.t. all parameters in x (k)))J ac(k) in all rules for all samples up to the kth) resp. by J acT (k)diag(i ( the case of the weighted version for local learning (see also Section 4.3.1)—note that the Jacobian matrix reduces to the regression matrix R in the case of linear parameters, as the derivatives are the original input variables. Additionally to updating the parameters according to Eq. (4.41), the update of the matrix P is required, which is given by the following: P(k) = λP(k − 1) + (1 − λ)ψ( x (k), )ψ( x (k), )T .

(4.42)

• For the Levenberg–Marquardt algorithm, P(k) = (1 − λ)H (k) + α I with H (k) as in the case of Gauss–Newton and again μ(k) = 1− λ. The update of the matrix P is done by the following:   (4.43) P(k) = λP(k − 1) + (1 − λ) ψ( x (k), )ψ( x (k), )T + α I . Using matrix inversion lemma [155] and some reformulation operations to avoid matrix inversion in each step (P(k)−1 is required in Eq. (4.41)) leads to the well-known recursive Gauss–Newton approach, which is, e.g., used in [69] for

June 1, 2022 12:51

164

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

recursively updating the kernel widths in the consequents and also for fine-tuning the regularization parameter. It also results in the recursive least squares approach in the case of linear parameters (formulas for the local learning variant in Eq. (4.37)). In the case of recursive Levenberg–Marquardt (RLM) algorithm, a more complex reformulation option is requested to approximate the update formulas for P(k)−1 directly (without an intermediate inversion step). This leads to the recursive equations as successfully used in the EFP method by Wang and Vrbanek [156] for updating centers and spreads in Gaussian fuzzy sets (multivariate Gaussian rules).

4.3.3. Learning of Consequents in EFC The most common choice in EFC design for consequent learning is simply to use the class majority voting for each rule separately. This can be achieved by incrementally counting the number of samples falling into each class k and rule i, h ik , (the rule that is the nearest one in the current data stream process). The class with majority K h ik is the consequent class of the corresponding (ith) rule count k∗ = argmaxk=1 in the case of the classical single-model architecture Eq. (4.21). The confidences in each class per rule can be obtained by the relative frequencies among all classes con f ik =  Kh ik h in the case of extended architecture in Eq. (4.22). For all-pairs k=1 ik classifiers, the same strategy can be applied within each single binary classifier. For the one-versus-rest architecture, RFWLS and all of its robust extensions can be applied to each class separately with the usage of an indicator-based RFWLS scheme [96]. An enhanced confidence calculation scheme will be handled under the scope of reliability in Section 4.4.4.

4.3.4. Evolving Mechanisms A fundamental issue in evolving systems, which differs from adaptive systems, is that they possess the capability to change their structure during online, incremental learning cycles—adaptive systems are only able to update their parameters as described in the two preliminary subsections, which typically induce only a movement and/or a reshaping of model components. The evolving technology, however, addresses the dynamic expansion and contraction of the model structure and thus the knowledge contained in the model—which typically induces the addition or deletion of whole model components. Therefore, most of the EFS approaches foresee the following fundamental concepts (or at least a significant subset of them): • Rule evolution: It addresses the problem of when and how to evolve new rules on-the-fly and on demand → knowledge expansion. • Rule pruning: It addresses the problem of when and how to prune rules in the rule base on-the-fly and on demand → knowledge contraction.

page 164

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 165

165

• Rule merging: It addresses the problem to merge two or more rules together once they become overlapping and thus contain redundant information → model simplification. • Rule splitting: It addresses the problem of when and how to split rules (into two) that contain an inhomogeneous data representation and thus contribute inaccurately to the model output → knowledge refinement. The first issue guarantees the inclusion of new systems states, operation modes, process characteristics in the models to enrich their knowledge and expand them to so far unexplored regions in the feature space. The second issue guarantees that a rule base cannot grow forever and become extremely large; hence, it is responsible for smart computation times and compactness of the rule base, which may be beneficial for interpretability reasons, see Section 4.5.2. In addition, it is a helpful engine for preventing model over-fitting. The third issue takes care to avoid unnecessary complexity due to redundant information contained in two or more rules, which can happen because of data cloud/cluster fusion effects. The fourth issue avoids inaccurate, blown-up rules due to data cloud/cluster delamination effects, often happening because of a permanent adaptation of rules during gradually drifting states. Figure 4.9 visualizes the induction of the four cases due to the characteristics

Figure 4.9: Four cases of data stream sample appearances requiring evolving concepts: from rule evolution (Case 1) via rule pruning (Case 2) and merging (Case 3) to rule splitting (Case 4).

June 1, 2022 12:51

166

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

of new incoming data samples (rectangular markers): old rules are shown by ellipsoidal solid lines and updated and new ones by dashed and dotted lines. The current state-of-the-art in EFS is that these concepts are handled in many different ways in different approaches, see [30] for a recent larger collection of published approaches. Due to space limitations of this book chapter, it is not possible to describe all the various options for rule evolution, pruning, merging, and splitting anchored in the various EFS approaches. Therefore, in the following subsections, we outline the most important directions which enjoy some common understanding and usage in various approaches. 4.3.4.1. Rule evolution

One concept that is widely used for autonomous rule evolution is the incremental, evolving clustering technique (see [157] for a survey of methods), which searches for an optimal grouping of the data into several clusters, ideally following the natural distribution of data clouds, in an incremental and evolving manner (i.e., new clusters are automatically added on demand from stream samples). In particular, the aim is that similar samples should contribute to the (formation of the) same cluster while different samples should fall into different clusters [158]. In the case of using prototype-based clustering techniques emphasizing clusters with convex shapes (e.g., ellipsoids), each evolved cluster can be directly associated with a rule representing a compact local region in the feature space. Due to projection onto the axes, the fuzzy sets composing the rule antecedents can be obtained— see [67] for various projection possibilities. Similar considerations can be made when performing clustering in the principal component space, as proposed in [159]. In the case when clusters with arbitrary shapes are evolved from the stream, also termed as data clouds as proposed and used in evolving cloud-based models [100] (see Section 4.2.6), the centers of the clouds can be associated with the centers of the rules and the ranges of influence of the rules are estimated by empirical density estimators, which are also directly used as rule membership degrees, thus termed as empirical membership functions [160]. Such empirical functions have been successfully embedded in the autonomous learning multimodel systems approach (ALMMo) [102], where the concepts of eccentricity and typicality are used to form clouds in arbitrary shapes [124]; it is remarkable that these two concepts do not need any learning parameters and thus can be used in a full plug-and-play manner from each other. In order to check new samples for cluster compatibility versus incompatibility (based on which it can be decided whether a new cluster = rule needs to be evolved or not), the compatibility concepts applied in the various approaches differ from each other:

page 166

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 167

167

• Some are using distance-oriented criteria (e.g., DENFIS [161], FLEXFIS [140], eFuMo [162, 163], ePFM [164], or InFuR [165]), where the distance of new incoming samples to existing clusters/rules is checked and decided whether sufficient novelty is present, and if so, a new rule is evolved. • Some are using density-based criteria (e.g., eT [91] and its extension eTS+ [166] with several successors as summarized in [14], or the approach in [167]), which are mostly relying on potential concepts to estimate whether new samples fall into a newly arising dense region where no cluster/rule has been defined so far, thus showing a higher density than prototypes already generated. Similar concepts are adopted in cloud-based approaches [101, 102], where it is additionally checked whether a new data cloud possibly overlap, with existing clouds, and if so, the already available prototype (or focal point) is replaced by the new sample. Different inner norms can be used when recursively calculating the density [168], which induces different types of distance measures in the Cauchy-type density calculation. • Some others are using statistical-oriented criteria (e.g., ENFM [169], GS-EFS [67] or SAFIS [170] and its extensions [171, 172], PANFIS [141], GENEFIS [66]), especially to check for the significance of the present rules as well as prediction intervals and tolerance regions surrounding the rules, or to check the goodness of fit tests based on statistical criteria (e.g., F-statistics) for candidate splits. The leaves are replaced with a new subtree, inducing an expansion of the hierarchical fuzzy model (as in, e.g., in [173] or in the incremental LOLIMOT (Local Linear Model Tree) approach proposed in [174]). • Some are also using fuzzy-set-theoretic concepts such as coverage subject to -completeness (e.g., SOFNN [143, 175] or SCFNN [176]). • Some are relying on Yager’s participatory learning concept [177], comparing the arousal index and a specific compatibility measure with thresholds—as performed in ePL [178, 179] and eMG [64]. • Granular evolving methods, IBeM [83], FBeM [65, 180], and eGNN [181, 182], consider a maximum expansion region (a hyper-rectangle) around information granules. Granules and expansion regions are time varying and may contract and enlarge independently for different attributes based on the data stream, the frequency of activation of granules and rules, and on the size of the rule base. The compatibility concept directly affects the rule evolution criterion, often being a threshold (e.g., a maximal allowed distance) or a statistical control limit (e.g., a prediction interval or tolerance region), which decides whether a new rule should be evolved or not. Distance-based criteria may be more prone to outliers than densitybased and statistical-based ones; on the other hand, the latter ones are usually more

June 1, 2022 12:51

168

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

lazy until new rules are evolved (e.g., a significant new dense area is required such that a new rule is evolved there). All of the aforementioned rule evolution criteria have the appealing characteristic that they are unsupervised; thus, no current values of the target variables are de facto required. This makes them directly applicable to systems with significant latency and time lags for target recordings [183]. For supervised learning without any latency, evolution criteria can be also based on the (accumulated) error between the measured and estimated outputs together. The error may thus indicate a high bias of the model to generalize well to new data; hence, its flexibility (degree of nonlinearity) needs to be increased by evolving new rules. This is taken into account, e.g., in the following methods: GD-FNN [184], SAFIS [170], SCFNN [176], AHLTNM [185], eFuMo [163], and others. It has the advantage that the actual model performance is checked to realize whether the model can (still) generalize to the new data or not; this may even be the case when extrapolation on new samples is already high, depending on the extrapolation behavior of the model (often influenced by its degree of nonlinearity), an issue which is not respected by any unsupervised criterion. An example of this aspect is visualized in Figure 4.10, where rectangular new samples fall into the extrapolation region of the rules, but do not induce a high error (as they are close to the model surface shown as solid line); on the other hand, circular new samples also falling into the extrapolation region are far away from the model surface, thus inducing a high error → evolution required (the evolved model shown as dotted line). It may also be recommended to consider multiple criteria and different conditions for cluster/rule evolution. This is, for instance, performed by NEUROFast [186] or eFuMo [163]. In eFuMo, the concept of delay of evolving mechanisms

Figure 4.10: Rule evolution due to high model error on new samples (circular filled dots) → evolved model in dotted line; rectangular new samples do not require an expansion of the original model (solid line), as the model error on them is low.

page 168

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 169

169

is introduced, which is an interval in which evolving mechanisms are not enabled. Only adaptation of centers and model parameters is conducted during such time intervals. The delay of evolving mechanisms takes place after a change in the structure of the system performed by any evolving mechanism. The model should have a certain period to adapt to the new structure. Whenever a new rule C + 1 is evolved, usually its center is set to the current data sample cC+1 = x and its initial spread is typically in such a way that minimal coverage with neighboring rules is achieved, also often to assure -completeness (the concrete formula is typically approach-dependent). The consequent parameters w  C+1 are either set to those ones of the nearest rule (as originally suggested in [91]) or to a vector of zeros, according to the considerations regarding good parameter convergence made in [29] (Chapter 2). The inverse Hessian matrix of the new rule PC+1 is set to α I , with α a big value and I the identity matrix, again according to the considerations regarding good parameter convergence made in [29] (Chapter 2). Fundamental and quite common to many incremental clustering approaches is the update of the centers c defining the cluster prototype given by c(N + 1) =

N c(N ) + x (N + 1) N +1

(4.44)

and the update of the inverse covariance matrix  −1 defining the shape of ellipsoidal clusters (applicable in the case when Gaussian functions are applied for rule representation), given by the following:  −1 (N + 1) =

α ( −1 (N )(  −1 (N ) x − c))( −1 (N )( x − c))T − , (4.45) 1−α 1 − α 1 + α(( x − c)T  −1 (N )( x − c))

1 and N the number of samples seen so far. Usually, various with α = N+1 clusters/rules are updated, each representing an own covariance matrix i−1 ; thus, the symbol N = ki then represents the number of samples “seen by the corresponding cluster so far,” i.e., falling into the corresponding cluster so far (denoted as the support of the cluster (rule)). Note that these updates are adiabatic in the sense that they deliver the same solution as the batch off-line estimations.

4.3.4.2. Rule pruning

Mechanisms for pruning rules are convenient to delete old or inactive ones, which are no longer valid. These mechanisms are of utmost importance to assure a fast computation speed in model updates—with the ultimate aim to meet real-time demands; that is, the model update is finished with a new sample before the next one arrives. Furthermore, for explanation and interpretability aspects (see Section 4.5), it is beneficial to achieve compact rule bases and networks.

June 1, 2022 12:51

170

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

In general, it may happen that a rule is created in a part of the input–output space where there are just a few representative samples, which seems to indicate a new knowledge that should be covered by the model (due to the evolution of a new rule, see above). This may be originally (at the time of creation) justified by errors in measurements or due to a change of the system behavior. Later, however, this change may not be confirmed, such that the rule remains insignificant (compared to other rules—see the second case in Figure 4.9) and/or remains untouched for a longer time (also termed as idle rule). The mechanisms to remove rules are mainly based on the following principles: • The age of the rule as proposed in the xTS approach [187] which is defined as the ratio between the accumulated time of past samples falling into the rule and the current time—rules are removed according to the ratio between the support set and the overall number of samples and their age. • The size of the support set of the cluster, also termed as rule utility as proposed in eTS+ approach [166], and which is elicited by the ratio between the number of cluster activations and the time the rule was evolved and added to the model. A similar concept was used in the ALMMo approach [102]. • The contribution of the rule to the output error—as performed in SAFIS [170], GAP-RBF [188], and GD-FNN [184]. • The combination of the age and the total number of activations—as performed in IBeM [83], FBeM [65], and eGNN [181, 182]. Successful concepts were further proposed : (i) in SOFNN [175], which is based on the sensitivity of the model output error according to the change of local model parameters (inspired by optimal brain surgeon concepts [189]); (ii) in PANFIS [141], where rules are removed when they are inconsequential in terms of contributing to past outputs and possible future estimations; (iii) in eFuMo [163] as an extension of the age concepts in eTS+ [166] based on the rate between the support set and the age of the cluster; (iv) in the granular approach proposed in [190] as the concept of the half-life of a rule (or granule). 4.3.4.3. Rule merging

Most of the E(N)FS approaches that are applying an adaptation of the rule contours, e.g., by recursive adaptation of the nonlinear antecedent parameters, are equipped with a merging strategy for rules. Whenever rules move together due to the nature of data stream samples falling in between these, they may become inevitably overlapping, see the third case in Figure 4.9 for an example: new samples marked by filled rectangles seem to open up a new region in the feature space (knowledge expansion), desiring the evolution of a new rule; however, later, when new

page 170

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 171

171

samples (marked by non-filled rectangles) appear in between the two rules, they induce a movement of the two rules together, finally becoming overlapping and/or representing one homogeneous data cloud. This is also termed as the (sequential, stream-based) data cluster fusion effect. Various criteria have been suggested to identify such occurrences and to eliminate them. In [191], a generic concept has been defined for recognizing such over-lapping situations based on fuzzy set and rule level. It is applicable for most of the conventional EFS techniques relying on a geometric-based criterion employing a rule inclusion metric, applicable independently of the evolving learning engine. This has been expanded in [67] to the case of generalized, adjacent rules in the feature space, where it is checked whether the same trends in the rule antecedents as well as consequents appear, thus indicating a kind of joint homogeneity between the rules. This in turn avoids information and accuracy loss of the model. Generic rule merging formulas have been established in [67, 191] cnew =

cl kl + ck kk kl + k k kl l−1

−1 new

+

=

kk k−1

  kl k k 1  .∗ I + diag  new kl + k k − ci )T ( cnew − ci ) ( c (4.46) kl + k k

knew = kl + kk . with indices k and l denoting the two rules to be merged, cnew the new (merged) −1 its inverse covariance matrix. ( cnew − ci ) is a row vector with i = center, new argmink,l (kk , kl ), i.e., the index of the less supported rule (with support ki ), and diag the vector of diagonal entries; the division in the numerator (1/…) (the rank-1 modification term for increasing robustness of the update) is done component-wise cnew − ci ). for each element of the matrix ( cnew − ci )T ( eFuMo [163] expands the concepts in [67] by also taking into account the prediction of the local models, where three conditions for merging are defined: angle condition, correlation condition, and distance ratio condition. An instantaneous similarity measure is introduced in FBeM [65] for multidimensional trapezoidal fuzzy sets, say Ai1 and Ai2 , as S( Ai1 , Ai2 )   i1      1 n  i1 i2  λ − λi2  + i1 − i2  +  L i1 − L i2  , − l + l = 1− j j j j j j j 4n j =1 j (4.47) i i i i i where A = (l , λ ,  , L ) is an n-dimensional trapezoid. Such a measure is more discriminative than, for example, the distance between centers of neighbouring clusters, and its calculation is fast. In ENFM [169], two clusters are merged when the

June 1, 2022 12:51

172

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

membership of the first cluster to the second cluster center is greater than a predefined threshold, and vice-versa. In SOFNN [175] and extensions [143, 192], clusters are merged when they exhibit the same centers; in the extensions of SOFNN [143, 192] similarity is calculated on fuzzy set level based on the Jaccard index and when this is higher than a threshold, two fuzzy sets are merged: once fuzzy sets for all premise parts in two (or more) rules fall together (as being merged), these rules can be directly replaced by a single rule. Further various similarity measures for fuzzy sets and rules can be found in [193, 194], most of which can be adopted to the online single-pass case in the context of E(N)FS, as they are pretty fast to calculate and do not need historic data samples. Furthermore, merging of clusters does not only provide a less redundant (overlapping) representation of the local data distributions, but also keeps E(N)FS, in general, more compact and thus more interpretable and faster to adapt. Especially, the distinguishability and simplicity concepts as two important interpretability criteria (see Section 4.5.2) are addressed and often preserved by homogeneous rule merging. 4.3.4.4. Rule splitting

Rule splitting becomes evident whenever rules become blown up and/or their (local) errors increase significantly over time, which can happen either (1) due to cluster delamination effects, where at the beginning, it seems that only one cluster (rule) is required to properly represent a local data distribution, but later two heterogeneous clouds crystallize out to appear in the local region; (2) due to (permanent) gradually drifting patterns whose intensity is not strong enough to induce the evolution of a new cluster (rule); (3) due to inappropriately set (too pessimistic) cluster evolution thresholds, such that too many local clouds are formed to one joint cluster (rule)—often such a threshold (usually, apparent in all ENFS methods) is difficult to set in advance for a new stream modeling problem. The first two occurrences are visualized in Figure 4.11. Several techniques for splitting rules properly in an incremental single-pass context have been proposed. Basically, they use the concepts of local rule errors and volumes of rules as criteria to decide when to best perform a split. In NeuroFAST [186], clusters are split according to their mean square error (MSE). The algorithm calculates the error in each P step and splits the rule and the local model with the greatest error. The mechanism of splitting in eFuMo [163] is based on the relative estimation error, which is accumulated in a certain time interval. The error is calculated for each sample that falls in one of the existing clusters. The initialization of the resulting clusters is based on the eigenvectors of

page 172

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 173

173

Figure 4.11: Upper: blown-up cluster due to gradual drifts always updating the cluster a bit (no cluster evolution is triggered); lower: showing a cluster delamination effect due to samples originally appearing as one cluster (left), but when new samples fall into this cluster, two heterogeneous data clouds turn out (right).

the cluster covariance matrix, as in [195] and in [196] where non-linear system identification by evolving Gustafson–Kessel fuzzy clustering and supervised local model network learning for a drug absorption spectra process is presented. An efficient (fast) incremental rule splitting procedure in the context of generalized evolving fuzzy systems is presented in [197], especially for the purpose of splitting blown-up rules with high local estimation errors over the past iterations and extraordinary high volumes compared to other rules (measured in terms of statistical process control strategies [1]). In this sense, it can autonomously compensate for those drifts, which cannot be automatically detected (e.g., slowly gradual ones), see Section 4.4.1. The splitting is conducted along the first principal component of the covariance matrix (the longest direction span of a rule ellipsoid in the multi-dimensional space), by dividing a rule i to be split into two equal halves: √ √ λi λi ci (split2) = ci − ai (4.48) ci (split1) = ci + ai 2 2 with ci (split1) and ci (split2) being the centers of the two split rules; λi corresponds to the largest eigenvalue of the covariance matrix of rule i, i.e., λi = max() and ai to the corresponding eigenvector, both of which can be obtained through classical

June 1, 2022 12:51

174

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

eigendecomposition of the covariance matrix i . The latter is obtained for the two split rules as follows: ⎧ ⎨  j j j = 1 ∗ T ∗ (4.49) i (split1) = i (split2) = A A ,  j j =  j j ⎩ j=1 4 where 11 is assumed to be the entry for the largest eigenvalue and A, a matrix containing eigenvectors of the original covariance i (stored in columns). The error criterion by calculating an adaptive threshold, which is based on the number of rules compared to an upper allowed maximal number of rules (e.g., parameterized by a user), is also used in AHLTNM [185]. This approach has the appealing aspect that an undesired rule explosion, which would make an evolved rule base totally uninterpretable, can be avoided. The split in [185] operates on halving hyper-boxes, which geometrically represent rules embedding trapezoidal fuzzy sets.

4.3.5. Evolving Model Ensembles During the last 2–3 years, a few approaches have emerged which are not dealing with the update and evolution of a single ENFS model, but with a whole ensemble of ENFS models—see Section 4.2.7.6 for motivation to build up model ensembles (diversity, robustness in terms of low and noisy samples, etc.). The approaches basically differ in three aspects: (i) how and with which samples the singlebase learners are evolved in order to assure diversity among them; (ii) when and how ensemble members are evolved and pruned; and (iii) how the members are aggregated to produce the final prediction/classification output, i.e., the definition of the aggregation operator Agg in Eq. (4.32) and the learning of its included (weight) parameters. A variant to address model ensembling is through the concept of bagging [132], where sample bootstraps, also termed as bags, are generated from the whole data set by drawing with replications. The probability that a sample appears at least one time in a bootstrap is 0.632, but it can appear 0 or multiple times according to the probabilities achieved through a binomial distribution. It is well-known from the literature that generating models from various bags (also termed as ensemble members) increases sample significance and robustness against higher noise levels when the members are combined to an overall output—so, it can be also expected for (neuro-)fuzzy systems, while being learned on various bags extracted from the data stream. The approach in [198] (termed as OB-EFS) proposes an online bagging variant of evolving fuzzy systems for regression problems, using generalized smart EFS as

page 174

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 175

175

base members [67] and where the bags are simulated through a statistically funded online sampling based on the Poisson distribution, which exactly simulates the off-line drawing procedure with replications. It embeds an autonomous pruning of ensemble members in a soft manner by down-weighing the prediction of members with extraordinary high accumulated error trends in a weighted averaging scheme. In this sense, it automatically emphasizes those members better fitting the current data situation in the stream. Furthermore, it integrates an automatic evolution of members in the case of explicitly detecting drifts based on the Hoeffding-inequality [199] on accumulated past performance indicators. An online bagging approach for stream classification problems has been recently proposed in [200], which uses the eClass0 [201] and the ALMMo-0 [160] (which is based on 0-order AnYa-type fuzzy classifiers [100], see also Section 4.2.6) approaches as base members, and dynamically extracts a subset of members based on various diversity statistics measures in order to form various ensembles, which are then combined to an overall output by an advanced combination module. The dynamically changing ensemble members make this approach greatly flexible to address the arising drifts (see also Section 4.4.1). Another ensemble approach for stream classification problems has been previously proposed in [130], which is based on an online boosting concept, as (single-model) evolving fuzzy classifier are trained based on different subset of features, thus acting as evolving weak classifiers (as typically also carried out in boosting concepts [202]). Their predictions are amalgamated through weights of the base members which are increased or decreased according to their classification correctness on recent stream samples: this mimics the weights as assigned in the classical (off-line) ada-boost approach [133]. The base members are evolved based on the parsimonious evolving fuzzy classifiers (pClass) concept proposed in [117], which employs one-versus-rest classification (see Section 4.2.7.2) in combination with TS fuzzy models for establishing the {0, 1}-regression models based on indicator entries. Similarly to OB-EFS, it employs the Hoeffding inequality to decide when to evolve a new member. A remarkable online (trained) boosting variant for regression problems has been introduced in [129], which employs the interval-based granular architecture [65] as discussed in Section 4.2.4 for the base members and which learns weights of the base members used in advanced weighted aggregation operators [203] from the data. Thus, it can be seen as a trained combiner of ensemble members [204], which typically may outperform fixed combiners, because they can adapt to the data in a supervised context (to best explain the real target values). Another key issue of this approach is that it is able to optimize meta-parameters (steering rule evolution and merging) by a multi-objective evolutionary approach [205], which integrates four objectives based on statistical measures and model complexity. A variant/extension

June 1, 2022 12:51

176

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

of this technique is proposed in [206] with the usage of empirical data analysis for forming evolving data clouds as base learners which are represented in a cloud-based fuzzy system. Finally, another variant for creating model ensembles on-the-fly is the concept of sequential ensembling as proposed in [131]. This introduces a completely different point of view on a data stream by not permanently updating an ensemble of existing models, but by generating a new fuzzy system on each new incoming data chunk. The combination of the fuzzy systems over chunks is then called sequential ensemble, which in fact is an ensemble of fuzzy systems rather than of evolving fuzzy systems. This is because each fuzzy system is trained from scratch with a classical batch training approach. The basic key idea, thereby, is how to combine the predictions of the members for new data chunks in order to flexibly react onto drifts. This is achieved through five different concepts respecting changes in the input data distributions as well as target concepts. A central key feature is that it can spontaneously react onto cyclic drifts by autonomously reactivating older members (trained on previous chunks).

4.3.6. EFS Approaches and Learning Steps As can be noted from Figure 4.1, the total number of publications in the field of evolving neuro-fuzzy systems is 1,652 (status at the end of year 2020). After a rough checking and counting, it turned out that in these publications, approximately 500–600 different EFS approaches containing different structural and parameter learning algorithms have been published so far (with the most significant explosion during the last 10 years). Hence, it is simply impossible to provide a full list of all approaches with a description of their basic characteristics and properties within this chapter. We, thus, refer to the recent survey in [30], which discusses several methods in more detail, with respect to their parametrization efforts and which types of rule evolution and antecedent learning concepts they address. We also refer to a previous version of this chapter published in [88], where 32 “core approaches” are listed in tables along four pages and their basic properties compared against each other. These core approaches were basically those ones that performed early pioneering works in E(N)FS during the 2000s up to 2013, i.e., during times where the community was much smaller (about 25–30 key persons). Furthermore, a lot of approaches (more than 100 in total) are self-evidently referenced in this chapter during the summarization of the core learning concepts in E(N)FS as demonstrated in the previous (sub)sections and also further below—in this context, several pioneering works are addressed that propose essentially novel learning concepts beyond state-of-the-art at the time of their publication appearance. Moreover, when coming to the useability and past applications of E(N)FS in

page 176

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Evolving Fuzzy and Neuro-Fuzzy Systems

page 177

177

Section 4.6, we provide a larger list of E(N)FS approaches in Tables 4.1 and 4.2 along with the indication of in which real-world scenarios they have been applied successfully applied. In order to provide readers of this chapter a principal touch on how an evolving approach looks like and operates incrementally, typically in single-pass mode, on incoming samples, we demonstrate a pseudo-code skeleton in Algorithm 4.1. This shows, more or less, a common denominator regarding basic learning steps on a

Algorithm 4.1. Basic Learning Steps in an E(N)FS Approach Input: Current model E F S with C rules. (1) Load new data sample x. (2) Pre-process data sample (normalization, transformation to reduced space, feature selection, …). (3) If rule-base in E F S is empty (C = 0), (a) Initialize the first rule, typically with its center to the data sample cC+1 = x and its spread σC+1 or C+1 to a minimal coverage value; return. (4) Else, perform the following steps (5-9): (5) Check if rule evolution criteria are fulfilled (a) If yes, evolve a new rule (Section 4.3.4.1) and perform Step (3)(a); set C = C + 1; return. (b) If no, proceed with the next step. (6) Update antecedent parts of (some or all) rules (Sections 4.3.4 or 4.3.2). (7) Check if the rule pruning and/or merging criteria of updated rules are fulfilled (a) If yes, prune the rule or merge the corresponding rule(s) to one rule (Sections 4.3.4.2 and 4.3.4.3); set C = C − 1. (b) If no, proceed with the next step. (8) Check if the rule splitting criteria of the updated rules are fulfilled (a) If yes, split the corresponding rul(e) into two rules (Section 4.3.4.4); overwrite the rule with one split rule and append the other split rule as the C + 1st rule, set C = C + 1. (9) In the case of neuro-fuzzy approaches: construct neurons for the updated rules using internal norms. (10) Update consequent parameters (of some or all) rules (Sections 4.3.3 for classification and 4.3.1 for regression) Output: Updated model EFS.

June 1, 2022 12:51

178

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

single sample as used in E(N)FS approaches based on single models. One comment refers to the update of antecedents and consequents (Steps 6 and 10); some approaches may only update those of some rules (e.g., the rule corresponding to the winning cluster), others may always update those of all rules. The former may have some advantages regarding preventing the unlearning effect in parts where actual samples do not fall [29], and the latter typically achieves significance and thus reliability in the rule parameters faster (as more samples are used for updating). A further comment goes to Step 2, which, depending on the approach, may contain several techniques for pre-processing a new sample. For instance, several methods support a transformation and/or feature selection step (see Section 4.4.2) in order to reduce curse of dimensionality in the case of multi-dimensional large-scale streams (Big Data); this, in turn, decreases the effect of over-fitting and improves robustness of the model updates. Methods which are not scale-invariant (such as methods based on Euclidean distances in the cluster space) require a normalization of the stream sample in order to avoid the (undesired) domination of features with larger ranges and/or higher medians than other features [207], which typically leads to the formation of wrongly placed rules. Please note that all the methods for updating the consequent parameters described in Sections 4.3.1 and 4.3.1.1 are scale-invariant (so, the demand for normalization solely depends on the criteria used for updating the antecedent space). Furthermore, Steps 8 (pruning and/or merging) and 9 (splitting) may appear in a different order or may not be even used in an EFS approach at all—leaving only the evolution step in Step 5, which, however, is necessary to make an incremental learning approach actually evolving; without this step, the approach would be simply adaptive (updating only the parameters without evolving its structure), as being apparent in several adaptive methods [49], already proposed during the 1990s, e.g., within the scope of adaptive control [208].

4.4. Stability and Reliability Two important issues when learning from data streams are (i) the assurance of stability during the incremental learning process and (ii) the investigation of the reliability of model outputs in terms of predictions, forecasts, quantifications, classification statements, etc. These usually lead to an enhanced robustness of the evolved models. Stability is usually guaranteed by all properly designed E(N)FS approaches, as long as the data streams from which the models are learnt appear in a quite “smooth, expected” fashion. However, specific occasions such as drifts and shifts [209] or high noise levels may appear in these, which require a specific handling within the learning engine of the EFS approaches in order to assure model stability by avoiding catastrophic forgetting and/or diverging parameters and structures.

page 178

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Evolving Fuzzy and Neuro-Fuzzy Systems

page 179

179

Another problem is dedicated to high-dimensional streams, usually stemming from large-scale time-series [210] embedded in the data, and mostly causing a curse of dimensionality effect, which leads to over-fitting and thus to a decrease of their parameter stability and finally to significant downtrends in model accuracy. A few EFS approaches have embedded an online dimensionality reduction procedure so far in order to diminish this effect, see the discussion in Section 4.4.2. Although high noise levels can be automatically handled by all the consequent learning techniques described in Section 4.3.1 as well as by the antecedent learning engines discussed in Sections 4.3.2 and 4.3.4, reliability aspects in terms of calculating certainty in model outputs by respecting the noise and data uncertainty can be an additional valuable issue in order to understand model outputs better and to perform an adequate handling of them by experts/operators. Drift handling is included in several basic EFS methods in a straightforward manner, that is, by using forgetting factors in consequent learning, as described in Section 4.3.1. This section is dedicated to a summary of developments in stability, reliability, and robustness of EFS achieved so far, which can be also generically used in combination with most of the EFS approaches published so far.

4.4.1. Drift Handling in Streams In data-stream mining environments, concept drift means that the statistical properties of either the input or target variables change over time in unforeseen ways. In particular, drifts either denote changes in the underlying data distribution (input space drift) or in the underlying relationship between model inputs and targets (target concept drift)—see [209, 211] for recent surveys on drifting concepts. Figure 4.12 demonstrates two input space drifts in (a) and two target concept drifts in (b). In Figure 4.12 (a), in the left-most case indicates a drift in the mean of the distribution of the circular class, which leads to significant class overlap, the right case shows

(a)

(b)

Figure 4.12: Two drift concepts: (a) typical drifts in the input space (increasing class overlaps or moving to unexplored space); (b) typical drifts in the target concept (left: gradual, around the decision boundary, right: abrupt, class overlaid by other classes).

June 1, 2022 12:51

180

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

the same but moving into an unexplored region. In Figure 4.12(b), the left case shows a gradual case where samples get more overlapping around the decision boundary, the right case shows an abrupt case where samples from a new class fall into the same region as samples from an older class. From a practical point of view, such drifts can happen because the system behavior and environmental conditions change dynamically during the online learning process, which makes the relations and concepts contained in the old data samples obsolete. They are frequent occurrences in non-stationary environments [2] nowadays and thus desire a specific treatment. Such drifting situations are contrary to new operation modes or system states that should be integrated into the models with the same weight as the older ones in order to extend the models (knowledge expansion, see Section 4.3.4), but keeping the relations in states seen before untouched (being still valid and thus necessary for future predictions). Drifts, however, usually mean that the older learned relations (as structural parts of a model) are not valid any longer, which should be thus incrementally out-weighed and forgotten over time [209]. This is especially necessary when class overlaps increase, as shown in three out of four cases in Figure 4.12. A smooth forgetting concept for consequent learning employing the idea of exponential forgetting [208] is used in most of the current EFS approaches [30, 88]. The strategy in all of these is to integrate a forgetting factor λ ∈ 0, 1 for strengthening the influence of newer samples in the Kalman gain γ —see Eq. (4.38). Figure 4.13 shows the weight impact of samples obtained for different forgetting factor values.

Figure 4.13: Smooth forgetting strategies achieving different weights for past samples; compared to a sliding window with fixed width inducing a complete forgetting of older samples.

page 180

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Evolving Fuzzy and Neuro-Fuzzy Systems

page 181

181

This is also compared to a standard sliding window technique, which weights all samples up to a certain point of time in the past equally, but forget all others before completely → non-smooth forgetting that may induce catastrophic forgetting [212]. The effect of this forgetting factor integration is that changes of the target concept in regression problems can be tracked properly; i.e., a movement and shaping of the current regression surface towards the new output behavior is enforced, see [213]. Regarding drift handling in the antecedent part in order to address changes in the (local) data distributions of the input space—as exemplarily shown in the upper plot of Figure 4.11—some techniques may be used, such as a reactivation of rule centers and spreads from a converged situation by an increase of the learning gain on new incoming samples. This was first conducted for EFS in the pioneering work in [213] in combination with the two approaches eTS [91] and FLEXFIS [140], based on the concept of rule ages (being able to measure the intensity of drifts according to changes in the second derivative). It has the effect that rules are helped out from their stuck, converged positions to cover the new data cloud appearances of the drifted situation well, while forgetting samples from the older clouds (occurring before the drift took place). In eFuMo [163], the forgetting in antecedent learning is integrated as the degree of the weighted movement of the rule centers c. towards a new data sample xN+1 ci (N + 1) = ci (N ) +  ci (N + 1) with  ci (N + 1) =

μi ( x N+1 )η ( x N+1 − ci (N )) si (N + 1)

(4.50)

x N+1 )η being the sum of the past memberships with si (N + 1) = λsi (N ) + μi ( x j ), j = 1, . . . , N , η the fuzziness degree also used as a parameter in fuzzy μi ( C-means, and λ the forgetting factor. Forgetting is also integrated in the inverse covariance matrix and determinant update defining the shape of the clusters. Handling of local drifts, which are drifts that may appear with different intensities in different parts of the feature space (thus affecting different rules with varying intensity), is considered in [214]. The idea of this approach is that different forgetting factors are used for different rules instead of a global factor. This steers the local level of flexibility of the model. Local forgetting factors are adapted according to the local drift intensity (elicited by a modified variant of the Page–Hinkley test [215]), and the contribution of the rules in previous model errors. Such dynamic adaptation of the forgetting factor, according to the current drift intensity, has the appealing effect that during non-drift phases the conventional life-long adaptation of the fuzzy model takes place. This, in turn, increases sample significance and thus stability of the learned parameters and structure during non-drift phases— a permanent forgetting during these may end up with decreased performance, as could be verified in [214].

June 1, 2022 12:51

182

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

The distinction between target concept drifts and input space drifts could be achieved in [216] by using a semi-supervised and a fully unsupervised drift indicator. It, thus, can operate on scarcely labeled (selected by online active learning methods, see Section 4.6.1), and also fully unlabeled samples. It uses evolving fuzzy classifiers in the extended single-model architecture form with EFC-SM as the learning engine, which embeds generalized rules [217]. Another form of drift is the so-called cyclic drift, where changes in the (input/target) data distribution may happen at a certain point of time, but, later, older distributions are revisited. ENFS approaches to deal with such drift cases were addressed in [74, 218] using type-2 recurrent neuro-fuzzy systems, termed as eT2RFNN. The idea is to prevent re-learning of older local distributions from scratch and thus to increase the early significance of the rules. Furthermore, EFS ensemble techniques as discussed in Section 4.3.5 automatically embed mechanisms to address cyclic drifts early and properly, by activating and/or over-weighing those members (previously learned) in the calculation of final predictions that best meet the data distribution in the current data chunks. Whenever a drift cannot be explicitly detected or it implicitly triggers the evolution of a new rule/neuron, a posteriori drift compensation [197] is a promising option in order to (back-)improve the accuracy of the rules. This can be achieved through incremental rule splitting techniques as discussed in Section 4.3.4.4. Furthermore, such rule splitting concepts are typically parameter-free, which abandons the usage and tuning of a forgetting factor, thus decreasing parametrization effort.

4.4.2. On-line Curse of Dimensionality Reduction High dimensionality of the data stream mining and modeling problem becomes apparent whenever a larger variety of features and/or system variables are recorded, e.g., in multi-sensor networks [219], which characterize the dependencies and interrelations contained in the system/problem to be modeled. For models, including localized granular components as is the case of evolving neuro-fuzzy systems, it is well-known that curse of dimensionality is very severe in the case when a high number of variables are used as model inputs [23] [84]. This is basically because in high-dimensional spaces, someone cannot speak about locality any longer (on which these types of models rely), as all samples are moving to the edges of the joint feature space—see [116] (Chapter 1) for a detailed analysis of this problem. Therefore, the reduction of the dimensionality of the learning problem is highly desired. In a data stream sense, to ensure an appropriate reaction onto the system dynamics, the feature reduction should be conducted online and be open for anytime changes. One possibility is to track the importance levels of features over time and

page 182

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 183

183

to cut out those ones that are unnecessary—as has been used in connection with EFS for regression problems in [66, 166] and for classification problems in a first attempt in [109]. In [166], the contribution of the features in the consequents of the rules is measured in terms of their gradients in the hyper-planes; those features whose contribution over all rules is negligible can be discarded. This has been extended in [66] by also integrating the contribution of the features in the antecedent space (regarding their significance in the rules premise parts). A further approach has been suggested in [220], where an online global crisp feature selection was extended to a local variant, where for each rule a separate feature (importance) list was incrementally updated. This was useful to achieve more flexibility due to local feature selection characteristics, as features may become important differently in different parts of the feature space. However, features that are unimportant at an earlier point of time may become important at a later stage (feature reactivation). This means that crisply cutting out some features with the usage of online feature selection and ranking approaches such as [221, 222] can fail to represent the recent feature dynamics appropriately. Without a re-training cycle, which, however, slows down the process and causes additional sliding window parameters, this would lead to discontinuities in the learning process, as parameters and rule structures have been learned on different feature spaces before. An approach that addresses input structure changes incrementally on-the-fly is presented in [223] for classification problems using classical single model and one-versus-rest based multi-model architectures, see Section 4.2.7.2 (in connection with FLEXFIS-Class learning engine). It operates on a global basis; hence; features are either seen as important or unimportant for the whole model. The basic idea is that feature weights λ1 , . . . , λ p ∈ 0, 1 for the p features included in the learning problem are calculated based on a stable separability criterion [224] (deduced from Fisher’s separability criterion [225] as used in (incremental) discriminant analysis [226, 227]): J = trace(Sw−1 Sb )

(4.51)

with Sw the within scatter matrix modeled by the sum of the covariance matrices for each class, and Sb the between scatter matrix modeled by the sum of the degree of mean shift between classes. The criterion in Eq. (4.51) is applied (i) dimension-wise to see the impact of each feature separately—note that in this case, it reduces to a ratio of two variances—and (ii) for the remaining p − 1 feature subspace in order to gain the quality of separation when excluding each feature. The latter approach is termed as LOFO (leave-one-feature-out) approach and has also been successfully applied in an evolving neuro-fuzzy approach for extracting (interpretable) rules in the context of heart sounds identification [228]. The former has been successfully used recently

June 1, 2022 12:51

184

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

page 184

Handbook on Computer Learning and Intelligence — Vol. I

in an evolving neuro-fuzzy systems approach termed as ENFS-Uni0 [96], for rule length reduction and thus for increasing the readability of rules. In both cases, p criteria J1 , . . . , J p according to Eq. (4.51) are obtained. For normalization purposes to 0, 1, finally the feature weights are defined by the following: λj = 1 −

J j − min j =1,..., p ( J j ) . max j =1,..., p ( J j ) − min j =1,..., p ( J j )

(4.52)

To be applicable in online learning environments, updating the weights in incremental mode is achieved by updating the within-scatter and between-scatter matrices using the recursive covariance matrix formula [226]. This achieves a smooth change of feature weights = feature importance levels over time with new incoming samples. Hence, this approach is also denoted as smooth and soft online dimension reduction—the term softness comes from the decreased weights instead of a crisp deletion. Down-weighed features then play a marginal role during the learning process, e.g., the rule evolution criterion relies more on the highly weighted features. This, in turn, reduces curse of dimensionality effects during adapting the models. The feature weights concept has been also employed in the context of data stream regression problems in [67], there with the usage of generalized rules as defined in Section 4.2.2.2. The features weights are calculated by a combination of future-based expected statistical contributions of the features in all rules and their influences in the rules’ hyper-planes (measured in terms of gradients). A re-scaled Mahalanobis distance measure was developed to integrate weights in the rule evolution condition in a consistent and monotonic fashion, that is, lower weights, in fact, always induce lower distances. Figure 4.14 visualizes the effect of the unimportance of Feature #2

Figure 4.14: No rule evolution suggested although sample lies outside the tolerance region of a rule, whenever Feature X2 becomes unimportant; thus, the distance of a new sample to the ellipsoid shrank by the re-scaled Mahalanobis distance [67].

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 185

185

on the rule evolution criterion respecting the re-scaled Mahalanobis distance of a sample to a rule ellipsoid (as shaped out in the case when using multivariate Gaussians). Another possibility for achieving a smooth input structure change was proposed in [229] for regression problems with the usage of partial least squares (PLS). PLS performs a transformation of the input space into latent variables (LVs) by maximizing the covariance structure between inputs and the target [230]. The coordinate axes are turned into a position (achieving latent variables as weighted linear combination of the original ones) that allows to better explain the complexity of the modeling problem. Typically, a lower number of LVs is needed to achieve an accurate regression. Scores on the lower number of LVs (projected samples) can then be directly used as inputs in the evolving fuzzy models. LVs are updated incrementally with new incoming samples based on the concept of Krylov sequences in combination with Gram–Schmidt orthogonalization, see [229]. The LVs internally update the importance of the features in the loadings. Previous works in [142, 159, 231] and [232] also perform an incremental update of the LV space for evolving models, but using unsupervised principal component analysis (PCA) [233].

4.4.3. Convergence Analysis and Assurance Another important criterion when applying EFS is some sort of convergence of the parameters included in the fuzzy systems over time to optimality in the objective function sense—this accounts for the stability aspect in the stability–plasticity dilemma [234], which is important in the life-long learning context. When using the recursive (fuzzily weighted) least squares approach, as being done by many of the current techniques, convergence to optimality is guaranteed as long as there is no structural change [139], i.e., no neurons or rules are evolved and/or moved. FLEXFIS [140] takes care of this issue and can achieve sub-optimality in the LS sense subject to a constant due to a decreasing learning gain in the cluster partitions. Another and an even more direct way to guarantee stability is to show convergence of the model error itself, mostly to provide an upper bound on the (development of the) model error over time. This could be accomplished (i) in the PANFIS approach [141] by the usage of an extended recursive LS method, which integrates a binary multiplication factor in the inverse Hessian matrix update that is set to 1 when the approximation error is bigger than the system error, and (ii) in SOFMLS [44] by applying a modified least squares algorithm to train both parameters and structures with the support of linearization and Lyapunov functions [145]. It is remarkable that in this approach, the evolution or pruning of rules does not harm the convergence of the parameters, as these only change the dimension of the inverse Hessian and the weight matrices. Based on, SOFMLS approach, the bound on

June 1, 2022 12:51

186

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

the identification error could be made even smaller in [146] with the usage of an own designed dead-zone recursive least square algorithm. SEIT2-FNN [80] employs a heuristic-based approach by resetting the inverse Hessian matrix to q ∗ I with I the identity matrix after several update iterations of the consequent parameters. This resetting operation keeps the inverse Hessian matrix bounded and helps to avoid divergence of parameters. In [180], an evolving controller approach was designed based on neuro-fuzzy model architecture, and by proving a stability theorem using bounded inputs from linear matrix inequalities and parallel distributed computing concepts. This guarantees robust parameter solutions becoming convergent over time and omitting severe disturbances in control actions. Advanced convergence considerations have also been made in [235], where the stability of AnYa type EFS [100] has been proven through the Lyapunov theory, which shows that the average identification error converges to a small neighborhood of zero. An extension of this approach has been proposed in [236] through the maximization of the cross-correntropy for avoiding convergence problems due to the learning size in the gradient-based learning and to approach steady-state convergence performance even in the case of non-Gaussian noise. For consequent parameter learning, a better convergence can also be achieved by using the recursively (weighted) multi-innovation least squares approach as discussed in Section 4.3.1.1; the degree of convergence and thus robustness of the solution depends on the innovation length, which, on the other hand, slows down the update process. A good tradeoff value is 10 as shown in [148] for various data sets. Further improvements for robust consequent learning with some convergence thematic are extensively discussed throughout Section 4.3.1.1, specifically addressing a proper handling of various noise and drift aspects in realworld data streams. Finally, it should be stressed that convergence of a real (global) optimal joint antecedent–consequent parameters solution can be only assured when performing re-optimization steps from time to time on both the linear consequents and nonlinear antecedent parameters (based on recent past samples). This has been achieved successfully recently in [153].

4.4.4. Reliability and Uncertainty Reliability deals with the expected quality of model predictions in terms of classification statements, quantification outputs or forecasted values in time series as a prerequisite to explainability, see subsequent section. Reliability points to the trustworthiness of the predictions of current query points which may be basically influenced by two factors: • The quality of the training data resp. data stream samples seen so far

page 186

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Evolving Fuzzy and Neuro-Fuzzy Systems

page 187

187

• The nature and localization of the current query point for which a prediction should be obtained The trustworthiness/certainty of the predictions may support/influence the users/operators during a final decision finding—for instance, a query assigned to a class may cause a different user’s reaction when the trustworthiness about this decision is low, compared to when it is high. The first factor basically concerns the noise level in measurement data. Noise can thereby occur in various forms, such as white noise, Gaussian noise or chaotic noise [237], and affecting input variables, targets or both. Noise may also cover aspects in the direction of uncertainties in users’ annotations in classification problems: a user with lower experience level may cause more inconsistencies in the class labels, causing overlapping classes, finally increasing conflict cases (see below); similar occasions may happen in the case when several users annotate samples on the same systems, but have different opinions in borderline cases. The second factor concerns the position of the current query point with respect to the definition space of the model. A model can show a perfect trend with little uncertainty within the training samples range, but a new query point may appear far away from all training samples seen so far. This yields a severe degree of extrapolation when conducting the model inference process and furthermore a low reliability in making a prediction for the query. In a classification context, a query point may also fall close in a highly overlapped region or close to the decision boundary between two classes. The first problem is denoted as ignorance, the second as conflict [115, 121]. A visualization of these two occasions is shown in Figure 4.15 (a) (for conflict) and (b) (for ignorance). The conflict case is due to a sample falling between two classes and the ignorance case due to a query point falling outside

(a)

(b)

Figure 4.15: (a): two conflict cases: query point falls in between two distinct classes and within the overlap region of two classes; (b) ignorance case as query point lies significantly away from the training samples, thus increasing the variability of the version space [121]; in both figures, rules modeling the two separate classes are shown by an ellipsoid, the decision boundaries indicated by solid lines.

June 1, 2022 12:51

188

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

page 188

Handbook on Computer Learning and Intelligence — Vol. I

the range of training samples, whose classes are indeed linearly separable, but by several possible decision boundaries (also termed as the variability of the version space [121]). 4.4.4.1. Addressing uncertainty for classification problems

In the classification context, conflict and ignorance can be reflected and represented by the means of fuzzy classifiers in a quite natural way [238]. These concepts have been tackled in the field of evolving fuzzy classifiers (EFC), see [110], where multiple binary classifiers are trained in an all-pairs context for obtaining simpler decision boundaries in multi-class classification problems (see Section 4.2.7.3). On a single rule level, the confidence in a prediction can be obtained by the confidence in the different classes coded into the consequents of the rules having the form of Eq. (4.22). This provides a perfect conflict level (close to 0.5 → high conflict, close to 1 → low conflict) in case of overlapping classes within a single rule. If a query point x falls between rules with different majority classes (different maximal confidence levels), then the extended weighted classification scheme in Eq. (4.24) is requested to represent a conflict level properly. If the confidence in the final output class L, con f L , is close to 0.5, conflict is high; when it is close to 1, conflict is low. In case of all-pairs architecture, Eq. (4.24) can be used to represent conflict levels in each of the internal binary classifiers. Furthermore, an overall conflict level on the final classifier output is obtained by [110] the following: conflict deg =

scorek scorek + scorel

(4.53)

with k and l being the classes with the two highest scores. The ignorance criterion can be resolved in a quite natural way, represented by a degree of extrapolation, and thus by C

x ), I gn deg = 1 − max μi ( i=1

(4.54)

with C being the number of rules currently contained in the evolved fuzzy classifier. In fact, the degree of extrapolation is a good indicator of the degree of ignorance, but not necessarily sufficient, see [115] for an extended analysis and further concepts. However, integrating the ignorance levels into the preference relation scheme of all-pairs evolving fuzzy classifiers according to Eq. (4.29) for obtaining the final classification statement, helped to boost the accuracies of the classifier significantly [110]. This is because the classifiers that show a strongly extrapolated situation in the current query are down-weighted towards 0, thus masked out, in the scoring process. This led to an out-performance of incremental machine learning classifiers

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Evolving Fuzzy and Neuro-Fuzzy Systems

(a)

page 189

189

(b)

Figure 4.16: (a): High uncertainty due to extrapolation (right part) and sample insignificance (middle part, only two samples support local region); (b): high uncertainty due to high noise level over the whole definition range.

from MOA framework3 [239] on several large-scale problems. The overall ignorance level of an all-pairs classifier is the minimal ignorance degree calculated by Eq. (4.54) over all binary classifiers. 4.4.4.2. Addressing Uncertainty for Regression Problems

In the regression context, the estimation of parameters through RWFLS and modifications (see Section 4.3.1) can usually deal with high noise levels in order to find a non-overfitting trend of the approximation surface. This is especially the case through advanced strategies (total least squares, recursive correntropy, etc.) as discussed in Section 4.3.1.1 and also in a recent paper dealing with this topic in detail in [148]. However, RFWLS and extensions are not able to represent uncertainty levels of the model predictions. These can be modeled by so-called error bars or confidence intervals [240]. Figure 4.16 shows the uncertainty of approximation functions (shown in solid lines obtained by regression on the dotted data samples) as modeled by error bars (shown as dashed lines surrounding the solid lines); left for a case with increasing extrapolation on the right-hand side (error bars open up and and thus uncertainty in predictions becomes higher), right for a case with high noise level contained in the data (leading to wide error bars and thus to a high uncertainty). The pioneering approach for achieving uncertainty in evolving fuzzy modeling for regression problems was proposed in [54]. The approach is deduced from

3 http://moa.cms.waikato.ac.nz/

June 1, 2022 12:51

190

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

page 190

Handbook on Computer Learning and Intelligence — Vol. I

statistical noise and quantile estimation theory [241]. The idea is to find a lower and an upper fuzzy function for representing a confidence interval, i.e., f (xk ) ≤ f (xk ) ≤ f (xk )

∀k ∈ {1, . . . , N },

(4.55)

with N being the number of data samples seen so far. The main requirement is to define the band to be as narrow as possible while containing a certain percentage of the data. Its solution is based on the calculation of the expected covariance of the residual between the model output and new data in local regions as modeled by a linear hyper-plane. The following formulas for the local error bars (in the j -th rule) were obtained in [54] after several mathematical derivations,  f j (xk ) =  j (xk )l j (xk ) ± tα,(N)−deg σˆ (xk  j (xk ))T P j ( j (xk )xk ), (4.56) where tα,(N)−deg stands for the percentile of the t-distribution for a 100(1 − 2α) percentage confidence interval (default α = 0.025) with (N ) − deg degrees of freedom, and P j denotes the inverse Hessian matrix. deg denotes the degrees of freedom in a local model, and (N ) the sum of membership degrees over the past samples. The symbol σˆ is the variance of model errors and the first term denotes the prediction of the j th local rule. The error bars for all local rules are averaged with normalized rule membership weights . (similarly as in the conventional fuzzy systems inference process, see Section 4.2.2.1) to form an overall error bar additionally to the output prediction provided by the fuzzy system. The term after ± provides the output uncertainty for the current query sample xk . Another approach to address uncertainty in model outputs is proposed in [65, 83], where fuzzy rule consequents are represented by two terms, a linguistic one— containing a fuzzy set (typically of trapezoidal nature)—and a functional one—as in the case of TS fuzzy systems. The linguistic term offers a direct fuzzy output, which, according to the widths of the learned fuzzy sets, may reflect more or less uncertainty in the active rules (i.e., those rules which have non-zero or at least  membership degree). A granular prediction is given by the convex hull of those sets which belong to active rules. The width of the convex hull can be interpreted as confidence interval and provided as final model output uncertainty. 4.4.4.3. Parameter uncertainty

Apart from model uncertainty, parameter uncertainty can be an important aspect when deciding whether the model is stable and robust. Especially in the cases of insufficient or poorly distributed data, parameter uncertainty typically increases. Parameter uncertainty in EFS has been handled in [242, 243] in terms of the use of the Fisher information matrix [244], with the help of some key measures extracted from it. In [242], parameter uncertainty is used for guiding the design of experiments

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 191

191

process in an online incremental manner, see also Section 4.6.1.3. In [243], it is used for guiding online active sample selection, see also Section 4.6.1.

4.5. Explainability and Interpretability This section is dedicated to two topics of growing importance during recent years in the machine learning and soft computing community, namely explainability and interpretability. While explainability offers an expert or a user a kind of model-based insight and an associated reasoning why certain predictions have been made by the model, interpretability goes a step further and tries to offer understandable aspects about the evolved model, such as embedded partial relations between features that may offer new insights and further knowledge about the (constitutions of the) process for experts/users. Fuzzy systems indeed offer an architecture that is per se interpretable and explainable (as also discussed in the introduction); however, when being trained from data, they may become at least a (dark) gray box model or even nearly noninterpretable [51]. The situation is even worse when performing evolving concepts (as discussed through Section 4.3) for incrementally updating the fuzzy systems from streams, because these—apart from some pruning and merging concepts—basically try to model the data distributions and implicitly contained relations/dependencies as accurately as possible. This is also termed as precise modeling in literature [245] . The focus on precise modeling was the main one during the last two decades in the EFS community. However, several developments and concepts arose during the latest years that addressed interpretability and explainability of EFS. These will be described in the following subsections.

4.5.1. Explainability Explainability of artificial intelligence (AI) has become a huge topic during the past years [246], especially in the field of machine learning and deep learning models [247], which is also reflected in new important objectives of the European community research programme “Horizon Europe” (aiming to last until 2027). This is because AI models primarily fully operate as black boxes when being trained from data; thus, human experts do not know where the predictions come from, why they have been made, etc. However, for users it is often important that predictions are well understood in order to be able to accept the models decisions, but also to be able to gain new insights into the process and relations. The explanation of model decisions may also become an essential aspect to stimulate human feedback, especially when reasons are provided for the decisions [248] and features most influencing the decisions are highlighted [249]. Then, when the model seems to be wrong from human users’ first glance, by looking at the explained reason and

June 1, 2022 12:51

192

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

induced features, they may be persuaded to change their opinion or they may be confirmed in their first intuition. In the latter case, they can directly associate the rule leading to the reason shown and thus may change or even discard it. This in turn means that the model is enriched with knowledge provided by the human users and thus can benefit from it—becoming more accurate, more precise in certain regimes and parts of the system. In evolving neuro-fuzzy systems, explainability is already automatically granted to some extent, especially due to the granular rule-based structure (see Section 4.2) and whenever the input features have an interpretable meaning for the experts. For instance, a reason for a certain decision could be explained by showing the most active rules (in readable IF-THEN form) during the inference and production of a decision. However, due to the primary focus on precise modeling aspects, the explainability may be diminished. The authors in [250] provided several aspects in order to improve the explainability of evolving fuzzy classifiers (in single model architecture as presented in Section 4.2.7.1). It contains the following concepts: • The reason for model decisions in linguistic form: the most active fuzzy rule(s) for current query instances is (are) prepared in transparent form and shown to the humans as reason. • The certainty of model decisions in relation to the final output and possible alternative suggestions. This indicates the degree of ambiguity, i.e., the “clearness” of a model output. • The feature importance levels for the current model decision: (i) to reduce the length of the rules (=reasons) to show only the most essential premises and thus to increase their readability; (ii) to get a better feeling about which features and corresponding conditions in the rules’ antecedents strongly influenced the decision. • The coverage degree of the current instance to be predicted/classified: this is another form of certainty, which tells the human how novel the content in the current sample is. In the case of high novelty content (equivalent to a low coverage), the human may pay additional attention to provide his feedback, which may be in some cases even highly recommended (e.g., to encourage her/him to define new fault types, new classes or new operation modes). In [250], the final reasoning why a certain decision has been made is based on geometric considerations—an example in a two-dimensional classification problem for doing so is indicated in Figure 4.17. There, the rules have ellipsoidal contours that result from multi-dimensional Gaussian kernels for achieving arbitrarily rotated positions (generalized rules as defined in Section 4.2.2.2). The fuzzy sets shown along the axes (HIGH, MEDIUM, LOW) can be obtained by projection of the higher-dimensional rule contours onto the single axes, see, e.g., [67].

page 192

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 193

193

Figure 4.17: A geometric interpretation of fuzzy classification rules as extracted from data samples containing three classes (their contours shown as dotted ellipsoids).

Also shown in Fig. 4.17 are four query point cases whose classification interpretation can be seen as follows: • Query 1 (circle, shaded): In the case of this sample, only one rule (Rule 1) fires significantly, and it is, therefore, sufficient to show this rule to the human (operator/user) (in IF-THEN form) as a reason for the final model output. In our example, Rule 1 reads as the following: Rule 1: IF Width is MEDIUM AND Grey Level is MEDIUM, THEN Class #1 with (con f 1 = 0.98, con f 2 = 0.02, con f 3 = 0.0). Thus, the reason why Class #1 has been returned is that the width of the object is MEDIUM (around 35 units) and its (average) gray level is MEDIUM (i.e., somewhere around 127 when assuming a range of 0, 255). • Query 2 (rectangular, shaded): Rule 3 fires most actively, but also Rule 1 fires significantly, so the induced linguistic IF-THEN forms of both are to be shown to the human, as the classification response will depend mainly on the weighted scoring of these two rules. • Query 3 (circle, non-shaded): this query point lies in between Rules 1, 2, and 3, but it is significantly closer to Rules 1 and 3; thus, their induced linguistic

June 1, 2022 12:51

194

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

Figure 4.18: Explainable AnYa fuzzy rules extracted from the hand-written digits data base MNIST, as also used in DRB classifiers [111].

IF-THEN forms are to be shown to the human; the novelty content of this query is moderate. • Query 4 (rectangular, non-shaded): this query point indicates high novelty content, as it lies far away from all extracted rules; since Rule 1 is by far the closest, its induced linguistic IF-THEN form will be shown to the human. This type of uncertainty is not covered by the weighted output scoring, as it remains on the “safe” side of the decision boundary, shown as a dotted line. Thus, this uncertainty is separately handled by the concept of coverage. In [250], these explanation concepts for evolving fuzzy classifier decisions were successfully integrated within an annotation GUI. They significantly helped the operators to achieve a more consistent and homogeneous labeling of images (showing the surface of micro-fluidic chips) into ten different fault classes, and this over a longer time frame, than when not using any explanations. Another form of advanced explanation capability has been offered by the approach proposed in [111] under the scope of deep rule-based classifiers, which internally use massively parallel fuzzy rule bases. Therein, each rule is formed by a connection of ORs between antecedents to represent a certain class; the single antecedents are realized through the representation of prototypes for the consequent class. An example for the application scenario of hand-digit recognition in images (based on the MNIST data base) is provided in Figure 4.18. It can be realized that, for instance, for the digit number 2 various prototypical hand-written images are the representatives for this class (as connected by an OR in the respective rule)—in this case, they have been extracted autonomously from the data base based on an evolving density-based approach for clouds as demonstrated in [100]. The prototypes can then be directly associated with the centers of the clouds (incrementally updated from the data stream). For a new sample image, when its output class is 2 from the DRB classifier, a direct explanation of the reason for this output can be provided by simply showing the prototype with the highest (cloud-based) activation level(s), i.e., those that are, thus, the most similar to the new sample image. In this sense, a “similaritybased” output explanation on image sample level (with a for users understandable context) can be achieved.

page 194

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 195

195

4.5.2. Interpretability Improved transparency and interpretability of the evolved models may be useful in several real-world applications where the operators and experts intend to gain a deeper understanding of the interrelations and dependencies in the system. This may enrich their knowledge and enable them to interpret the characteristics of the system on a deeper level. Concrete examples are decision support systems or classification systems, where the description of classes by certain linguistic dependencies among features may provide answers to important questions (e.g., for providing the health state (= one class) of a patient [251]) and support the experts/user in taking appropriate actions. Another example is the substitution of expensive hardware with soft sensors, referred to as eSensors in an evolving context; [252, 253]: the model has to be linguistically or physically understandable, reliable, and plausible to an expert, before it can be substituted with the hardware. In other cases, it is beneficial to provide further insights into the control behavior [51, 254]. Interpretability, apart from pure complexity reduction, has been addressed little in the evolving systems community so far. A position paper published in Information Sciences journal [32] summarizes the achievements in EFS, provides avenues for new concepts as well as concrete formulas and algorithms and points out open issues as important future challenges. The basic common understanding is that complexity reduction as a key prerequisite for compact and, therefore, transparent models is handled in most of the EFS approaches (under the pruning and merging aspect discussed in Sections 4.3.4.2 and 4.3.4.3), whereas other important criteria (known to be important from conventional batch off-line design of fuzzy systems [245, 255]), have been loosely handled in EFS so far. These criteria include the following: • • • • • • • •

Type of construction of rules/neurons Distinguishability and simplicity Consistency Coverage and completeness Feature importance levels Rule importance Levels Interpretation of consequents Knowledge expansion

Rule construction is basically achieved through the usage of specific norms for combining the fuzzy sets in the rule antecedent parts. Thereby, the realization of the connections between the rule antecedent parts is essential in order to represent a rule properly (modeling the data distribution best with the highest possible interpretability, etc.). For instance, in Rule 1 in the previous subsection, the realization is achieved through a classical AND-connection, which is typically

June 1, 2022 12:51

196

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

page 196

Handbook on Computer Learning and Intelligence — Vol. I

realized by a t-norm [40]. Almost all of the EFS approaches published so far use a t-norm to produce a canonical conjunctive normal form of rules (only ANDconnections between singleton antecedents). An exception is a recent published approach in [96], termed as ENFS-Uni0, where uni-nullneurons are constructed based on n-uninorms [256]. Depending on the value of the neutral element in the n-uninorms (which can be learned from the data), more an AND or more an OR is represented in the corresponding rule antecedent connections to a certain degree. These neurons can be, thus, transferred to linguistically readable rules achieved through partial AND/OR-connections, allowing an extended interpretation aspect—e.g., a classification rule’s core part can be an ellipsoid (achieved by AND) with some wider spread along a certain feature (as connected by an OR to some degree)—see also [86] for a detailed explanation. Another exception is the deep rule-based evolving classifier approach proposed in [111], which internally processes a massively parallel rule base, where each rule is of 0-order AnYa-type form [100] (see also Section 4.2.6), connecting the antecedent part of the internal rules (measured in terms of similarity degrees to prototypes) with OR-connections. This yields a good flexibility to represent single classes by various exemplar realizations contained in the data (see also previous section). Distinguishability and simplicity are handled under the scope of complexity reduction, where the difference between these two lies in the occurrence of the degree of overlap of rules and fuzzy sets: redundant rules and fuzzy sets are highly over-lapping and therefore indistinguishable and thus should be merged, whereas obsolete rules or close rules showing similar approximation/classification trends belong to an unnecessary complexity which can be simplified (due to pruning). Merging and pruning are embedded in most of the E(N)FS approaches published so far, see Section 4.3.4.3, hence this interpretability criterion is a well covered one. Figure 4.19 visualizes an example of a fuzzy partition extracted in the context of house price estimation [257] when conducting native precise modeling (a) and when conducting some simplification steps according to merging and pruning during the model update (b). Only in the upper case, it is possible to assign linguistic labels to the fuzzy sets and hence to achieve interpretation quality. Consistency addresses the problem of assuring that no contradictory rules, i.e., rules that possess similar antecedents but dissimilar consequents are present in the system. This can be achieved by merging redundant rules, i.e., those one which are similar in their antecedents, with the usage of the participatory learning concept (introduced by Yager [177]). An appropriate merging of the linear parameter vectors w  is given by [191] the following:  R1 + α · Cons(R1 , R2 ) · (w  R2 − w  R1 ), w  new = w

(4.57)

where α = k R2 /(k R1 + k R2 ) represents the basic learning rate with k R1 the support of the more supported rule R1 and Cons(R1 , R2 ) the compatibility measure between

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 197

197

(a)

(b)

Figure 4.19: (a): weird un-interpretable fuzzy partition for an input variable in house pricing; (b) the same partition achieved when conducting merging, pruning options of rules and sets during the incremental learning phase → assignments of linguistic labels possible

the two rules within the participatory learning context. The latter is measured by a consistency degree between the antecedents and consequents of the two rules. It relies on the exponential proportion between rule antecedent similarity degree (overlap) and rule consequent similarity degree. Coverage and completeness refers to the problem of a well-defined coverage of the input space by the rule-base. Formally, -completeness requires that for each new incoming data sample there exists at least one rule to which this sample has a membership degree of at least . For instance, SOFNN [143] and SAFIS [170] take care of this issue by evolving new rules when this situation is not met (any longer). Other techniques such as [258] always generate new rules in order to guarantee a minimal coverage of  over the input space. An alternative and more profound option for data stream regression problems is offered as well in [32], which propose to integrate a punishment term for -completeness into the least squares optimization problem. Incremental optimization techniques based on gradients of the extended optimization function may be applied in order to approach but not necessarily fully assure -completeness. On the other hand, the joint optimization guarantees a reasonable tradeoff between model accuracy and model coverage of the feature space.

June 1, 2022 12:51

198

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

Feature importance levels are an outcome of the online dimensionality reduction concept discussed in Section 4.4.2. With their usage, it is possible to obtain an interpretation of the input structure level of which features are more important and which ones are less important. Furthermore, they may also lead to a reduction of the rule lengths, thus increasing rule transparency and readability, as features with weights smaller than  have a very low impact on the final model output and therefore can be eliminated from the rule antecedent and consequent parts when showing the rule base to experts/operators. Rule importance levels could serve as essential interpretation component as they tend to show the importance of each rule in the rule base. Furthermore, rule weights may be used for a smooth rule reduction during learning procedures, as rules with low weights can be seen as unimportant and may be pruned or even re-activated at a later stage in an online learning process (soft rule pruning mechanism).. This strategy may be beneficial when starting with an expert-based system, where originally all rules are fully interpretable (as designed by experts/users); however, some may turn out to be superfluous over time for the modeling problem at hand. Furthermore, rule weights can be used to handle inconsistent rules in a rule base (see, e.g., [259, 260], thus serving as another possibility to tackle the problem of consistency (see above). The usage of rule weights and their updates during incremental learning phases, was, to our best knowledge, not studied so far in the evolving fuzzy community. Interpretation of consequents is assured by nature in case of classifiers with consequent class labels plus confidence levels; in the case of TS fuzzy systems, it is assured as soon as local learning of rule consequents is used [29, 107] (Chapter 2) and [32] (as done by most of the E(N)FS approaches [30]), please also refer to Section 4.3.1: then, a snuggling of the partial linear trends along the real approximation surface is guaranteed, thus providing information on which parts of the feature space the model will react and in which way and intensity (gradients of features). Such trends can be nicely used for an intrinsic influence analysis of input variables onto the output (in a global or local manner) and thus also contributes to some sort of explainability as discussed in the previous subsection. Knowledge expansion refers to the automatic integration of new knowledge arising during the online process on demand and on-the-fly, also in order to expand the interpretation range of the models. It is handled by all conventional EFS approaches through rule evolution and/or splitting, see Sections 4.3.4.1 and 4.3.4.4.

4.5.2.1. Visual interpretability over time

Visual interpretability over time refers to an interesting add-on to linguistic interpretability (as discussed above), namely to the representation of a model and its changes in a visual form. In the context of evolving modeling from streams, such

page 198

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 199

199

Figure 4.20: An example of a rule chain system as occurred for a specific data stream: each rule is arranged in a separate row, its progress over time shown horizontally; each circular marker denotes a data chunk. Note that in the second chunk one rule dies (is pruned), whereas in the fourth chunk a new rule is evolved and in the sixth chunk a big rule (the third one) is split into two; the second rule remains in more or less idle mode during the last three chunks (as not being changed at all).

a technique could be especially useful if models evolve quickly, since monitoring a visual representation might then be easier than following a frequently changing linguistic description. Under this scope, alternative interpretability criteria may then become interesting to discover which are more dedicated to the timely development of the evolving fuzzy model—for instance, a trajectory of rule centers showing their movement over time, or trace paths showing birth, growth, pruning, and merging of rules. The first attempts in this direction have been conducted in [261], employing the concept of rule chains. These have been significantly extended in [262] by setting up a visualization framework with a grown-up user frontend (GUI), integrating various similarity, coverage, and overlap measures as well as specific techniques for an appropriate catchy representation of high-dimensional rule antecedents and consequents. Internally, it uses the FLEXFIS approach [140] as an incremental learning engine for evolving a fuzzy rule base. An example of a typical rule chain system is provided in Figure 4.20. The size of the ellipsoids indicates the similarity of the rules’ antecedent parts between two consecutive time instances (marked as “circles”), the rotation degree of the small lines indicates the similarity of the rules’ consequent parts in terms of their angle between two consecutive time instances: a high angle indicates a larger change between two time instances, and thus the rule has been intensively updated during the last chunk. This may provide the human with an idea about the dynamics of the system. Also, when several rules die out or are born within a few chunks, the human may gain some insight into the system dynamics and this may stimulate her/him to react properly. Furthermore, rules that fluctuate highly, changing their antecedents and consequents much in a back-and-forth manner, may be taken with care and even removed by a human.

June 1, 2022 12:51

200

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

4.6. Implementation Issues, Useability, and Applications In order to increase the useability of evolving (neuro-)fuzzy systems, several issues are discussed in this section, ranging from the reduction of sample annotation and measurement efforts through active learning and a proper online design of experiments, a proper handling of missing data to the decrease of computational resources as well as to online evaluation measures for supervising modeling performance. At the end of this section, a list of real-world applications making use of evolving (neuro-)fuzzy systems will be discussed.

4.6.1. Reducing Sample Annotation and Measurement Efforts Most of the aforementioned ENFS methods require supervised samples in order to guide the incremental and evolving learning mechanisms into the right direction, and to maintain their predictive performance. This is especially true for the recursive update of consequent parameters and input/output product-space clustering. Alternatively, predictions may be used by the update mechanisms to reinforce the model. However, erroneous and imprecise predictions may spread, sum up over time, and deteriorate model performance [2]. The problem in today’s industrial systems with increasing complexity is that target values may be costly or even impossible to obtain and measure. For instance, in decision support and classification systems, ground truth labels of samples (from a training set, historic data base) have to be gathered by experts or operators to establish reliable and accurate classifiers, which typically require time-intensive annotation and labeling cycles [250, 263]. Within a data stream mining process, this problem becomes even more apparent as experts have to provide a ground truth feedback quickly to meet real-time demands. Therefore, it is important to decrease the number of samples for model update using sample selection techniques: annotation feedbacks or measurements for only those samples are required, which are expected to maintain or increase accuracy. This task can be addressed by active learning [264], a technique where the learner has control over which samples are used to update the models [265]. However, conventional active learning approaches operate fully in batch mode by iterating multiple times over a data base, which makes them inapplicable for (faster) stream processing demands [266, 267]. 4.6.1.1. Single-pass active learning for classification problems

To select the most appropriate samples from data streams, single-pass active learning (SP-AL) also termed as online active learning for evolving fuzzy classifiers has been proposed in [115], in combination with classical single model architecture

page 200

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 201

201

(see Section 4.2.7.1) and with multi-model architecture employing the all-pairs concept (see Section 4.2.7.3). It relies on the concepts of conflict and ignorance [121]. The former addresses the degree of uncertainty in the classifier decision in terms of the class overlapping degree considering the local region where a sample falls within and in terms of the closeness of the sample to the decision boundary. The latter addresses the degree of novelty in the sample, see also Section 4.4.4. A variant of SP-AL is provided in [268, 269], where a meta-cognitive evolving scheme that relies on what-to-learn, when-to-learn, and how-to-learn concepts is proposed. The what-to-learn aspect is handled by a sample deletion strategy, i.e., a sample is not used for model updates when the knowledge in the sample is similar to that of the model. Meta-cognitive learning has been further extended in [270] to regression problems (with the use of a fuzzy neural network architecture) and in [271] with the integration of a budget-based selection strategy. Such a budgetbased learning was demonstrated to be of great practical usability, as it explicitly constrains sample selection to a maximal allowed percentage. An overview and detailed comparison of several online active learning methods in combination with EFS, also in combination with other machine learning model architectures, can be found in the survey [272]. 4.6.1.2. Single-pass active learning for regression problems

In the case of regression problems, permanent measurements of the targets can also be costly, e.g., in chemical or manufacturing systems that require manual checking of product quality. Therefore, online active learning for regression has been proposed in [243] for evolving generalized fuzzy systems (see Section 4.2.2.2), which relies on: (i) the novelty content of a sample (ignorance) with respect to the expected extrapolation behavior of the model; (ii) the predicted output uncertainty measured in terms of local error bars (see Section 4.4.4); and (iii) the reduction of parameter uncertainty measured by the change in the E-optimality of the Fisher information matrix [244]. In this sense, it can be seen as an uncertainty-based sampling technique, where the most uncertain samples for current model predictions are selected in order to “sharpen” or even expand the model’s approximation surface. An immediate glance of an uncertainty-based sampling can be achieved when inspecting Figure 4.16 (a): samples lying in the left most region of the input (x-axis) can be safely predicted and thus would not be selected, will incoming samples lying in the right most region desire much more the concrete target values, as their prediction is unsafe due to extrapolation (no training samples occurring there before). This approach has been extended in [270] with meta-cognitive learning concepts (inspired from human cognitions) and with some recurrency aspects for interval-type fuzzy systems (see Section 4.2.4).

June 1, 2022 12:51

202

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

4.6.1.3. On-line design of experiments (DoE)

In an extended way of thinking about the sample gathering process, it is not only a matter of deciding whether targets should be measured/labeled for available samples, but also which samples in the input space should be gathered at all. A central questions therein is should the model expand its knowledge or increase significance of its parameters? If so, where in the input space should the model best expand its knowledge or increase significance of its parameters? To answer these questions, techniques from the field of design of experiments (DoE) [273, 274] are required. These typically rely on several criteria in order to check in which regions of the feature space the model produces its highest errors, is most uncertain about its predictions (e.g., due to sample insignificance) or most sensitive in its outputs (e.g., due to high, changing gradients). In this context, a pioneering single-pass online method for E(N)FS has been proposed in [242]. It relies on a combination of pseudo Monte-Carlo sampling algorithm (PM-CSA) [275] and maximin optimization criterion based on uniformly generated samples which are satisfying a membership degree criterion for the worst local model (worst in terms of the normalized sum of the squared error on past samples). Currently, to our best knowledge, this is still the only method available for E(N)FS.

4.6.2. Handling of Missing Data and Imbalanced Problems When streams are measured in online systems, single data samples to be processed and used for model updates may be incomplete; that is, they do not carry the full information as one or more input values are simply missing. Such incompleteness of the data, also known as missing data, may arise due to incomplete observations, data transfer problems, malfunction of sensors or devices, incomplete information obtained from experts or on public surveys, among others [276]. The likelihood that missing data appear in samples increases with the dimensionality of the learning problem. Consider a multi-sensor networks collecting measurements from 1000 channels, then it is not so unrealistic that at least 1 channel (=0.1%) is not wellmeasured permanently. However, discarding the whole 1000-dimensional samples and simply not using it for updates when 99.9% of the data is still available is a high waste. Therefore, data imputation and estimation methods have been suggested in literature, see [277] for a survey on classification problems. These perform an imputation of missing values based on maximum likelihood and parameter estimations or based on the dependencies between input variables, but are basically dedicated to batch off-line learning problem settings. In combination with evolving (fuzzy) systems, a pioneering approach has been recently proposed in [85]. It exploits

page 202

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 203

203

the interpretable open structure of an extended fuzzy granular system, where the consequent is composed by a Mamdani part (fuzzy set) and a Takagi–Sugeno part (functional consequent; a similar concept as defined in the interval-based granular systems, see Section 4.2.4), but with an additional reduced functional consequent part, where the input with the missing value is excluded. The approach uses the midpoint of the input membership functions related to the most active rule for missing values imputation. Currently, to our best knowledge, this is still the only method available for E(N)FS. Imbalanced classification problems are characterized by a high imbalance of the number of samples belonging to different classes. Typically, when the ratio between the majority and the minority class exceeds ten, someone speaks about an imbalanced problem [120]. Such problems typically lead to an overwhelming number of classes with a lower number of samples by classes with a higher number. In the extreme case, classes with a lower number of samples are completely outmasked in the classifiers, and thus never returned as model outputs [116] (hence, no classification accuracy achieved on them at all). Evolving (neuro-)fuzzy systems approaches to handle such unpleasant effects have been, to our best knowledge, not addressed so far (and are thus still an open problem). An exception goes to the approach in [217], which integrates new classes on the fly into the fuzzy classifiers in a single-pass manner and, thereby, uses the all-pairs architecture as defined in Section 4.2.7.3. Due to the all-pairs structure, it is possible that a new class (formed by only a couple of samples) becomes “significant” (and thus “returnable”) in the classifier much earlier than in a global (single) model. This is because smaller classifiers are trained between all class pairs (wherein the new class may become a “winner” during prediction much easier) and their outputs amalgamated within a balanced relational concept.

4.6.3. On Improving Computational Demands When dealing with online data stream processing problems, usually computational demands required for updating the fuzzy systems are essential criteria whether to install and use the component, or not. It can be a kick-off criterion, especially in real-time systems, where the update is expected to terminate in real-time, i.e., before the next sample is loaded, the update cycle should be finished. In addition, the model predictions should be in-line with the real-time demands, but these are known to be very fast in the case of a (neuro-)fuzzy inference scheme (significantly below milliseconds) [23, 42, 207]. An extensive evaluation and especially a comparison of computational demands for a large variety of EFS approaches over a large variety of learning problems with different number of classes, different dimensionality, etc.,

June 1, 2022 12:51

204

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

is unrealistic. A loose attempt in this direction has been made by Komijani et al. in [69], where they classify various EFS approaches in terms of computation speed into three categories: low, medium, and high. Interestingly, the update of consequent/output parameters is more or less following the complexity of O(C p2 ) with p the dimensionality of the feature/regressor space and C the number of rules, when local learning is used, and following the complexity of O((C p)2 ), when global learning is applied. The quadratic terms p2 resp. (C p)2 are due to the multiplication of the inverse Hessian with the actual regressor vector in Eq. (4.38), and because their sizes are ( p + 1) × ( p + 1) and p + 1 in case of local learning (storing the consequent parameters of one rule) resp. (C( p + 1)) × (C( p + 1)) and C( p + 1) in case of global learning (storing the consequent parameters of all rules). This computational complexity of O(C p2 ) also holds for all extensions of RFWLS, as discussed in Section 4.3.1.1, with the exception of multi-innovation RLS, which requires O(Cvp2 ) complexity, with v the innovation length. Regarding antecedent learning, rule evolution, splitting, and pruning, most of the EFS approaches proposed in literature try to be restricted to have at most cubic complexity in terms of the number of rules plus the number of inputs. This may guarantee some sort of smooth termination in an online process, but it is not a necessary prerequisite and has to be inspected for the particular learning problem at hand. However, some general remarks on the improvement of computational demands can be given: first of all, the reduction of unnecessary complexity such as merging of redundant overlapping rules and pruning of obsolete rules (as discussed in Section 4.3.4) is always beneficial for speeding up the learning process. This also ensures that fuzzy systems are not growing forever and, thus, restricted in their expansion and virtual memory requests. In this sense, a newly designed E(N)FS method should be always equipped with merging and pruning techniques. Second, some fast version of incremental optimization techniques could be adopted to fuzzy systems estimation; for instance, there exists a fast RLS algorithm for recursively estimating linear parameters in nearly linear time (O(Nlog(N ))), but with the costs of some stability and convergence, see [278] or [279]. Another possibility for decreasing the computation time for learning is the application of single-pass online active learning for exquisitely selecting only a subset of samples, based on which the model will be updated, please also refer to Section 4.6.1.

4.6.4. Evaluation Measures Evaluation measures may serve as indicators about the actual state of the evolved fuzzy systems, pointing to its accuracy and trustworthiness in predictions. In a data streaming context, the temporal behavior of such measures plays an essential role

page 204

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 205

205

in order to track the model development over time resp. to realize down-trends in accuracy at an early stage (e.g., caused by drifts), and to react appropriately (e.g., conducting a re-modeling phase, changes at the system setup, etc.). Furthermore, the evaluation measures are indispensable during the development phase of EFS approaches. In literature dealing with incremental and data-stream problems [280], three variants of measuring the (progress of) model performance are suggested: • Interleaved-test-and-then-train. • Periodic hold out test. • Fully-train-and-then-test. Interleaved-test-and-then-train, also termed as accumulated one-step ahead error/accuracy, is based on the idea of measuring model performance in one-step ahead cycles, i.e., based on one newly loaded sample only. In particular, the following steps are carried out: (1) Load a new sample (the N th). (2) Predict its target yˆ using the current evolved fuzzy systems. (3) Compare the prediction yˆ with the true target value y and update the performance measure pm: pm(y, yˆ )(N ) ← upd( pm(y, y)(N ˆ − 1))

(4.58)

(4) Update the evolved fuzzy system (arbitrary approach). (5) Erase sample and goto Step 1. This is a rather optimistic approach, assuming that the target response is immediately available for each new sample after its prediction. Often, it may be delayed [183, 281], postponing the update of the model performance to a later stage. Furthermore, in the case of single sample updates its prediction horizon is minimal, making it difficult to really provide a clear distinction between training and test data, and hence weakening their independence. In this sense, this variant is sometimes too optimistic, under-estimating the true model error. On the other hand, all training samples are also used for testing; thus, it is quite practicable for small streams. In the case of delayed target measurements, the sample-wise interleaved-test-andthen-train scenario can also be extended to a chunk-wise variant, with the chunk-size depending on the duration of the expected delay [131]. Then, all samples from each chunk are predicted (→ more pessimistic multi-step ahead prediction), and, once their target values are available, the model is updated. This has been handled, e.g., in [131] with sequential ensemble methods. The periodic holdout procedure can “look ahead” to collect a batch of examples from the stream for use as test examples. In particular, it uses each odd data block for learning and updating the model and each even data block for eliciting the model

June 1, 2022 12:51

206

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

page 206

Handbook on Computer Learning and Intelligence — Vol. I

performance on this latest block; thereby, the data block sizes may be different for training and testing, and may vary in dependence of the actual stream characteristics. In this sense, a lower number of samples is used for model updating/tuning than in the case of interleaved-test-and-then-train procedure. In experimental test designs, where the streams are finite, it is thus more practicable for longer streams. On the other hand, this method would be preferable in scenarios with concept drift, as it would measure a model’s ability to adapt to the latest trends in the data—whereas in interleaved-test-and-then-train procedure all the data seen so far is reflected in the current accuracy/error, becoming less flexible over time. Forgetting may be integrated into Eq. (4.58), but this requires an additional tuning parameter (e.g., a window size in case of sliding windows). The following steps are carried out in a periodic holdout process: (1) Load a new data block X N = xN∗m+1 , . . . , xN∗m+m containing m samples. (2) If N is odd: (a) Predict the target values yˆ1 , . . . , yˆm using the current evolved fuzzy systems. (b) Compare the predictions yˆ N∗m+1 , . . . , yˆ N∗m+m with the true target values y N∗m+1 , . . . , y N∗m+m , and calculate the performance measure (one or more of Eq. (4.59) to Eq. (4.64)) using y and yˆ. (c) Erase block and go to Step 1. (3) Else (N is even): (a) Update the evolved fuzzy system (arbitrary approach) with all samples in the buffer using the real target values. (b) Erase block and go to Step 1. Last but not least, an alternative to the online evaluation procedure is to evolve the model an a certain training set and then evaluate it on an independent test set, termed as fully-train-and-then-test. This procedure is most heavily used in many applications of EFS, especially in nonlinear system identification and forecasting problems, as summarized in Tables 4.1 and 4.2. It extends the prediction horizon of interleaved-test-and-then-train procedure with the size of the test set, but only does this on one occasion (at the end of learning). Therefore, it is not useable in drift cases or severe changes during online incremental learning processes and should be only used during development and experimental phases. Regarding appropriate model performance measures, the most convenient choice in classification problems is the number of correctly classified samples (accuracy). In the time instance, N (processing the N th sample), the update of the performance measure as in Eq. (4.58) is then conducted by Acc(N ) =

Acc(N − 1) ∗ (N − 1) + I ( yˆ , y) N

(4.59)

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 207

207

with Acc(0) = 0 and I the indicator function, i.e., I (a, b) = 1 whenever a == b, otherwise I (a, b) = 0, yˆ the predicted class label and y the real one. It can be used in the same manner for eliciting the accuracy on whole data blocks as in the periodic hold out case. Another important measure is the so-called confusion matrix [282], which is defined as ⎡ ⎤ N11 N12 . . . N1K ⎢ ⎥ ⎢ N21 N22 . . . N2K ⎥ ⎢ ⎥ C =⎢ . (4.60) .. .. .. ⎥ ⎢ .. ⎥ . . . ⎣ ⎦ NK 1 NK 2 . . . NK K with K the current number of classes, where the diagonal elements denote the number of class j samples that have been correctly classified as class j samples and the element Ni j denotes the number of class i samples which are wrongly classified to class j . These can be simply updated by counting. Several well-known measures for performance evaluation in classification problems can then be extracted from the confusion matrix, such as different variants of (balanced) accuracies (e.g., the standard accuracy is simply the trace of the confusion matrix divided by all its elements), the false alarm and miss-detection rates in the case of binary problems, or several kappa statistics. The confusion matrix assumes fully labeled samples or at least scarcely labeled samples according to active learning based selections. In some cases, data streams are appearing in fully unsupervised manner. In such cases, an unsupervised calculation of the classifier accuracy can be conducted with concepts as discussed in [283] (for the off-line case) and in [216] for the online single-pass case. The basic idea is based on the assumption that the uncertainty estimates of the two most likely classes correspond to a bimodal density distribution. Such uncertainty estimates are available for evolving fuzzy classifiers according to the rule-coded and/or calculated confidence degrees based on reliability concepts, see Section 4.4.4. The bimodal density distributions can be assumed to be two separate Gaussians, especially in the case of binary classification tasks, as the confidence in the output class is one minus the confidence in the other class, and vice versa. A typical example is visualized in Figure 4.21, where the misclassification rate (=1-accuracy) is the area under the overlap region (normalized to the whole area covered by both Gaussians). Thus, the more these distributions overlap, the more inaccurate the classifier is assumed to be. The histograms forming the distributions can be simply updating for streams by counting the samples falling into each bin. Furthermore, often someone may not only be interested in how strong the samples are mis-classified, but also how certain they are classified (either correctly or

June 1, 2022 12:51

208

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

Figure 4.21: Classifier output certainty distribution estimated from the discrete stream sampling case in histogram form, the solid black lines denote the fitted Gaussian mixture model, the vertical black line marker the cutting point and the red shaded area the expected misclassification intensity.

wrongly). For instance, a high classification accuracy with a lot of certain classifier outputs may have a higher value than with a lot of uncertain outputs. Furthermore, a high uncertainty degree in the statements points to a high degree of conflict (compare with Eq. (4.24) and Figure 4.6), i.e., a lot of samples falling into class overlap regions. Therefore, a measure telling the average certainty degree over samples classified so far is of great interest—a widely used measure is provided in [284]   N K 1  1  |conf k (i) − yk (i)| , (4.61) 1− Cert_deg(N ) = N i=1 K k=1 with N being the number of samples seen so far, and K the number of classes; yk (i) = 1 if k is the class the current sample i belongs to, and yk (i) = 0 otherwise, and con f k (i) the certainty level in class k when classifying sample i, which can be calculated by Eq. (4.24), for instance. In the case of regression problems, the most common choices are the root mean squared error (RMSE), the mean absolute error (MAE), or the average percentual deviation (APE). Their updates are achieved by the following:   1  ∗ (N − 1) ∗ (RMSE(N − 1)2 ) + (y − yˆ )2 , (4.62) RMSE(N ) = N  1  ∗ (N − 1) ∗ MAE(N − 1) + |y − yˆ | , (4.63) MAE(N ) = N

page 208

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Evolving Fuzzy and Neuro-Fuzzy Systems

(a)

page 209

209

(b)

Figure 4.22: (a): typical accumulated accuracy increasing over time in the case of active learning variants (reducing the number of samples used for updating while still maintaining a high accuracy); (b): typical evolution of the number of rules over time.

  |y − yˆ | 1 ∗ (N − 1) ∗ APE(N − 1) + APE(N ) = N range(y)

(4.64)

with y being the observed value and yˆ the predicted value of the current sample. Their batch calculation for even blocks in a periodic hold out test is following the standard procedure and thus can be realized from Wikipedia. Instead of calculating the concrete error values, often the observed versus predicted curves over time and their correlation plots are shown. This gives an even better impression under which circumstances and at which points of time the model behaves in which way. A systematic error shift can be also realized. Apart from accuracy criteria, other measures rely on the complexity of the models. In most EFS application cases (refer to Tables 4.1 and 4.2), the development of the number of rules over time is plotted as a two-dimensional function. Sometimes, the number of fuzzy sets are also reported as additional criteria that depend on the rule lengths. These mostly depend on the dimensionality of the learning problem. In the case of embedded feature selection, there may be a big drop in the number of fuzzy sets once some features are discarded resp. out-weighted (compare with Section 4.4.2). Figure 4.22 shows a typical accumulated accuracy curve over time in the left image (using an all-pairs EFC including active learning option with different amount of samples selected for updating), and a typical development of the number of rules in the right one: at the start, rules are evolved and accumulated, later some rules turned out to be superfluous, and, hence, are back-pruned and merged. This guarantees an anytime flexibility. The accumulated accuracy trend lines as shown in Figure 4.22 (a) are elicited by using Eq. (4.59) on each new incoming sample (for which the real class is available) and storing all the updated accuracies in a vector.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

210

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

4.6.5. Real-World Applications of EFS: An Overview Due to space restrictions, a complete description of application scenarios in which EFS have been successfully implemented and used so far is simply impossible. Thus, we restrict ourselves to a compact summary within Tables 4.1 and 4.2 showing application types and classes and indicating which EFS approaches have been used Table 4.1: Application classes in which EFS approaches have been successfully applied (Part 1). Application Type/Class (alphabetically) Active learning / human–machine interaction

Adaptive online control

Chemometric modeling and process control EEG signals classification and processing Evolving smart sensors (eSensors)

Forecasting and prediction (general)

Financial domains

Health-care and bioinformatics applications

EFS approaches (+ refs)

Comment

FLEXFIS-Class [250], EFC-SM and EFC-AP [115], FLEXFIS-PLS [285], GS-EFS [243], meta-cognitive methods McIT2FIS [269] and RIVMcSFNN [270] Evolving PID and MRC controllers RECCo [286], eFuMo [162, 163], rGK [287], self-evolving NFC [288], adaptive controller in [289], robust evolving controller [180], evolving cloud-based controller [254], parsimonious controller [290], myoelectric-based control [291] FLEXFIS++ [55, 142]; the approach in [54]

Reducing the annotation effort and measurement costs in industrial processes

eTS [292], epSNNr [293], MSAFIS [294] eTS+ [252] (gas industry), eTS+ [253, 295] (chemical process industry), PANFIS [141] (NOx emissions) AHLTNM [185] (daily temp.), eT2FIS [81] (traffic flow), eFPT [296] (Statlog from UCI), eFT [173] and eMG [64] (short-term electricity load), FLEXFIS+ [257] and GENEFIS [66] (house prices), incremental LoLiMoT [174] (max. cyl. pressure), rGK [287] (sales prediction), ePFM [164], the approach in [297], EFS-SLAT [298] and others eT2FIS [81], evolving granular systems [83, 299], ePL [300], PANFIS [141], SOFNN [301], ALMMo [102], IT2-GSETSK [258] EFuNN [302], evolving granular systems [181] (Parkinson monitoring), TEDACLass [101], DRB classifier [111] (hand-written digits), MSAFIS [294] (EEG signals), ALMMo [303] (heart sounds)

Design of fuzzy controllers which can be updated and evolved on-the-fly

The application of EFS onto processes in chemical industry (high-dim. NIR spectra) Time-series modeling with the inclusion of time delays Evolving predictive and forecasting models in order to substitute cost-intensive hardware sensors Various successful implementations of EFS for forecasting and prediction

Time-series modeling with the inclusion of time delays

Various classification problems in the health-care domain

page 210

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Evolving Fuzzy and Neuro-Fuzzy Systems

page 211

211

Table 4.2: Application classes in which EFS approaches have been successfully applied (Part 2). Application Type/Class (alphabetically) Identification of dynamic benchmark problems On-line design of experiments On-line fault detection and condition monitoring

On-line monitoring Predictive maintenance, prognostics and process optimization Robotics

EFS approaches (+ refs)

Comment

DENFIS [161], eT2FIS [81], eTS+ [166], GS-EFS [67], SAFIS [171], SEIT2FNN [80], SOFNN [301], IRSFNN [304] SUHICLUST combined with EFS [242] eMG for classification [179], FLEXFIS++ [305], rGK [287], evolving cloud-based classifier [201], pEnsemble+ [306], evolving cloud-based model [307], EFuNN [308] eTS+ [252] (gas industry)

Mackey–Glass, Box–Jenkins, chaotic time series prediction, etc.

IPLS-GEFS [229, 309], EBeTS [310]

Actively steering sample gathering in the input space EFS applied as SysID models for extracting residuals and classifying faults

Supervision of system behaviors Multi-stage prediction of arising problems, RUL estimations and suggestions for proper reaction Self-localization, autonomous navigation Local modeling of multiple time-series versus instance-based learning

User behavior identification and analysis

eTS+ [311], evolving granular models [312] DENFIS [313], ENFM [169], eTS-LS-SVM [69] (sun spot), ENFS-OL [314], Ensemble of evolving data clouds [206] eClass and eTS [315, 316], eTS+ [317], FPA [318], approach in [319], RSEFNN [320]

Video processing

eTS, eTS+ [187, 321]

Visual quality control and online surface inspection Web mining

EFC-SM and EFC-AP [217, 322], pClass [117]

Analysis of the user’s behaviors in multi-agent systems, on computers, indoor environments, car driving, etc. Real-time object identification, obstacles tracking and novelty detection Image classification tasks based on feature vectors

eClass0 [323], eT2Class (evolving type-2) [324]

High-speed mining in web articles

Time-series modeling and forecasting

June 1, 2022 12:51

212

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

in the circumstance of which application type. In all of these cases, E(N)FS helped to increase automatization capability, improving performance of the models and finally increasing the useability of the whole systems; in some cases, no models have been (could be) applied before at all due to a too complex or even impossible physical description of the whole process.

4.7. Open Issues and Future Directions Even though, based on all the facets discussed throughout the first six section in this chapter, the reader may get a good impression about the methodological richness and the range of applicability of current state-of-the-art evolving neuro-fuzzy systems approaches, there are still some open problems that have not been handled with sufficient care and necessary detail so far; in particular: • Profound robustness in learning and convergence assurance: as indicated in Section 4.3.1.1, a lot of approaches have been suggested for improving the robustness of consequent parameters learning; in fact, several bounds exist on the achievable model/system error, but an assurance of convergence to the real optimal solution in an incremental, single-pass manner is still not fully granted; it is only achievable through re-iterative optimization jointly with the antecedent space, which, however, does not meet the single-pass spirit and may significantly slow down the learning process. Furthermore, hardly any approaches exist to guarantee stability or provide bounds on the sub-optimality of the antecedent space (and the nonlinear parameters contained therein, typically rule/neuron centers and spreads). Recursive optimization of nonlinear parameters have been indeed discussed in Section 4.3.2, but are still in their infancy and loosely used so far—most of the ENFS approaches employ incremental, evolving clustering techniques that are typically operating in a more heuristic manner for learning the antecedent parameters and structures. • Evolving model ensembles: in the batch learning community there exists a wide range of ensemble techniques for the purpose of improving the robustness of single models (especially in the case of high noise with unknown distribution), which can be basically grouped into independent ensemble learning, sequential ensemble learning, and simultaneous ensemble learning—see also Chapter 16 of this book for demonstrating ensemble learning paradigms and a statistical analysis of the bias–variance–covariance error decomposition problematic therein. A few evolving ensemble approaches (around five in total) emerged during the last two years in order to take advantage of the higher robustness of ensembles also in the case stream modeling purposes, which are discussed in Section 4.3.5; several important ensembling concepts, known to be successful in the batch learning case (e.g., negative correlation learning), have not been handled so far for the evolving

page 212

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems









b4528-v1-ch04

page 213

213

case. In addition, it is still an open challenge when to best evolve new and to prune older ensemble members. Uncertainty in model outputs: in fact, uncertainty in the inputs are naturally handled by ENFS due to the learning and autonomous evolution of fuzzy sets and rules (describing uncertainty through their levels of granulation and ranges of influence, see Section 4.2); however, uncertainty in the outputs for regression problems have been more or less just loosely addressed by two concepts: interval-based modeling as described in Section 4.2.4 and error bars as described in Section 4.4.4.2. More statistically funded concepts such as probability estimates (for the predictions) as achieved by Gaussian process models or support vector machines (when applying Gaussian kernels) or whole output likelihood distributions as achieved by particle filtering techniques have not been studied so far. Drifts versus anomalies: Section 4.4.1 provides several concepts proposed during the last years for handling drifts in streams adequately by outweighing older samples, adding new rules, splitting rules, etc. However, a distinction between drifts, i.e., regular dynamic changes due to process dynamics and nonstationary environments that should be integrated into the models, and anomalies indicating possible arising problems, which should not be integrated into the models has not been actually studied so far. It may be a precarious issue because it typically depends on the type of anomalies and drifts how different and in which pattern(s) they appear (as changes) in data samples; hence, a generic concept for this distinction would be a highly challenging issue. Explainability and interpretability of ENFS: almost all of the ENFS approaches developed so far rely on precise modeling aspects; that is, they intend to achieve highest possible accuracy for predicting new data samples with a best possible number of structural components (rules and neurons). Little has been investigated so far into the direction of at least maintaining or even improving interpretability and explainability of model components and outputs (which are in principle offered by ENFS due to their IF-THEN rule-based structure, as discussed in Section 4.2). Some first concepts are available as demonstrated in Section 4.5, but they are more present on a discussion- and position-oriented basis rather than being implemented to and tested with concrete ENFS algorithms (with a few exceptions as noted there). Active learning and online design of experiments with ENFS: single-pass active learning in combination with ENFS for reducing target labeling/measurement efforts was investigated in several approaches for both, classification and regression problems in literature as demonstrated in Section 4.6.1; some challenges regarding budget-based selection and especially structural-based (rather than sampling-based) selection are still open. Furthermore, complete online guidance on the measurement setup (in the form of online DoE) in order

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

214

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

to select parts of the input space to measure next has been studied so far only in [242]. • Applications of ENFS in the context of continual learning: Due to their evolving and incremental learning properties, ENFS would be able to properly address the principles of continual learning and task-based incremental learning [15] in specific data stream environments, where new data chunks characterize new tasks (typically with some relation to older tasks); applications in this directions together with some methodological expansions for improving their performance on new tasks properly have not been carried out so far. • Applications of ENFS in predictive maintenance and prescriptive analytics: two hot fields of applications with strong and wide objectives in the Horizon 2020 and the new Horizon Europe 2021–2027 framework programmes and where system dynamics aspects often may play a major role [12], but where ENFS have been hardly applied so far, see Table 4.2 for three approaches in total.

Acknowledgments The author acknowledges the support by the “LCM—K2 Center for Symbiotic Mechatronics” within the framework of the Austrian COMET-K2 program and the Austrian Science Fund (FWF): contract number P32272-N38, acronym IL–EFS.

References [1] J. Gama, Knowledge Discovery from Data Streams (Chapman & Hall/CRC, Boca Raton, Florida, 2010). [2] M. Sayed-Mouchaweh and E. Lughofer, Learning in Non-Stationary Environments: Methods and Applications (Springer, New York, 2012). [3] J. Dean, Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners (John Wiley and Sons, Hoboken, New Jersey, 2012). [4] G. Blokdyk, Very large database: A Complete Guide (CreateSpace Independent Publishing Platform, 2018). [5] W. Zhang, MARS Applications in Geotechnical Engineering Systems: Multi-Dimension with Big Data (Science Press, Beijing, 2019). [6] W. Härdle, H.-S. Lu and X. Shen, Handbook of Big Data Analytics (Springer Verlag, Cham, Switzerland, 2018). [7] M. Xia and Y. He, Functional connectomics from a ‘big data’ perspective, NeuroImage, 160, 152–167 (2017). [8] O. Reichmann, M. Jones and M. Schildhauer, Challenges and opportunities of open data in ecology, Science, 331(6018), 703–705 (2011). [9] X. Wu, X. Zhu, G. Wu and W. Ding, Data mining with big data, IEEE Trans. Knowl. Data Eng., 26(1), 97–107 (2014). [10] P. Angelov, D. Filev and N. Kasabov, Evolving Intelligent Systems—Methodology and Applications (John Wiley & Sons, New York, 2010). [11] S. Koenig, M. Likhachev, Y. Liu and D. Furcy, Incremental heuristic search in artificial intelligence, Artif. Intell. Mag., 25(2), 99–112 (2004).

page 214

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 215

215

[12] E. Lughofer and M. Sayed-Mouchaweh, Predictive Maintenance in Dynamic Systems: Advanced Methods, Decision Support Tools and Real-World Applications (Springer, New York, 2019). [13] G. Karer and I. Skrjanc, Predictive Approaches to Control of Complex Systems (Springer Verlag, Berlin, Heidelberg, 2013). [14] P. Angelov, Autonomous Learning Systems: From Data Streams to Knowledge in Real-Time (John Wiley & Sons, New York, 2012). [15] M. Delange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh and T. Tuytelaars, A continual learning survey: defying forgetting in classification tasks, IEEE Trans. Pattern Anal. Mach. Intell., doi: 10.1109/TPAMI.2021.3057446 (2021). [16] P. Angelov and N. Kasabov, Evolving computational intelligence systems. In Proceedings of the 1st International Workshop on Genetic Fuzzy Systems, Granada, Spain, pp. 76–82, (2005). [17] A. Joshi, Machine Learning and Artificial Intelligence (Springer Nature, Switzerland, 2020). [18] O. Nasraoui and C.-E. B. N’Cir, Clustering Methods for Big Data Analytics: Techniques, Toolboxes and Applications (Unsupervised and Semi-Supervised Learning) (Springer Verlag, Cham, Switzerland, 2019). [19] X. Wu, V. Kumar, J. Quinlan, J. Gosh, Q. Yang, H. Motoda, G. MacLachlan, A. Ng, B. Liu, P. Yu, Z.-H. Zhou, M. Steinbach, D. Hand and D. Steinberg, Top 10 algorithms in data mining, Knowl. Inf. Syst., 14(1), 1–37 (2006). [20] Z. Liang and Y. Li, Incremental support vector machine learning in the primal and applications, Neurocomputing, 72(10–12), 2249–2258 (2009). [21] P. Laskov, C. Gehl, S. Krüger and K. Müller, Incremental support vector learning: analysis, implementation and applications, J. Mach. Learn. Res., 7, 1909–1936 (2006). [22] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd edn. (Prentice Hall Inc., Upper Saddle River, New Jersey, 1999). [23] W. Pedrycz and F. Gomide, Fuzzy Systems Engineering: Toward Human-Centric Computing (John Wiley & Sons, Hoboken, New Jersey, 2007). [24] M. Affenzeller, S. Winkler, S. Wagner and A. Beham, Genetic Algorithms and Genetic Programming: Modern Concepts and Practical Applications (Chapman & Hall, Boca Raton, Florida, 2009). [25] P. de Campos Souza, Fuzzy neural networks and neuro-fuzzy networks: a review the main techniques and applications used in the literature, Appl. Soft Comput., 92, 106275 (2020). [26] V. Balas, J. Fodor and A. Varkonyi-Koczy, Soft Computing based Modeling in Intelligent Systems (Springer, Berlin, Heidelberg, 2009). [27] L. Trujillo, E. Naredo and Y. Martinez, Preliminary study of bloat in genetic programming with behavior-based search. In M. Emmerich, et al. (eds.), EVOLVE — A Bridge between Probability, Set Oriented Numerics, and Evolutionary Computation IV (Advances in Intelligent Systems and Computing), vol. 227, Springer, Berlin Heidelberg, Germany, pp. 293–304 (2013). [28] L. Zadeh, Fuzzy sets, Inf. Control, 8(3), 338–353 (1965). [29] E. Lughofer, Evolving Fuzzy Systems: Methodologies, Advanced Concepts and Applications (Springer, Berlin Heidelberg, 2011). [30] I. Skrjanc, J. Iglesias, A. Sanchis, E. Lughofer and F. Gomide, Evolving fuzzy and neuro-fuzzy approaches in clustering, regression, identification, and classification: a survey, Inf. Sci., 490, 344–368 (2019). [31] P. Angelov and X. Gu, Empirical Approach to Machine Learning (Springer Nature, Switzerland, 2019). [32] E. Lughofer, On-line assurance of interpretability criteria in evolving fuzzy systems: achievements, new concepts and open issues, Inf. Sci., 251, 22–46 (2013).

June 1, 2022 12:51

216

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

[33] E. Lughofer, Model explanation and interpretation concepts for stimulating advanced humanmachine interaction with ‘expert-in-the-loop’. In J. Zhou and F. Chen (eds.), Human and Machine Learning: Visible, Explainable, Trustworthy and Transparent, Springer, New York, pp. 177–221 (2018). [34] S. Bhattacharyya, V. Snasel, D. Gupta and A. Khanna, Hybrid Computational Intelligence: Challenges and Applications (Hybrid Computational Intelligence for Pattern Analysis and Understanding) (Academic Press, London, 2020). [35] E. Mamdani, Application of fuzzy logic to approximate reasoning using linguistic systems, Fuzzy Sets Syst., 26(12), 1182–1191 (1977). [36] K. Kumar, S. Deep, S. Suthar, M. Dastidar and T. Sreekrishnan, Application of fuzzy inference system (FIS) coupled with Mamdani’s method in modelling and optimization of process parameters for biotreatment of real textile wastewater, Desalin. Water Treat., 57(21), 9690–9697 (2016). [37] E. Pourjavad and A. Shahin, The application of Mamdani fuzzy inference system in evaluating green supply chain management performance, Int. J. Fuzzy Syst., 20, 901–912 (2018). [38] A. Geramian, M. Mehregan, N. Mokhtarzadeh and M. Hemmati, Fuzzy inference system application for failure analyzing in automobile industry, Int. J. Qual. Reliab. Manage., 34(9), 1493–1507 (2017). [39] A. Reveiz and C. Len, Operational risk management using a fuzzy logic inference system, J. Financ. Transform., 30, 141–153 (2010). [40] E. Klement, R. Mesiar and E. Pap, Triangular Norms (Kluwer Academic Publishers, Dordrecht Norwell New York London, 2000). [41] W. Leekwijck and E. Kerre, Defuzzification: criteria and classification, Fuzzy Sets Syst., 108(2), 159–178 (1999). [42] A. Piegat, Fuzzy Modeling and Control (Physica Verlag, Springer Verlag Company, Heidelberg, New York, 2001). [43] H. Nguyen, M. Sugeno, R. Tong and R. Yager, Theoretical Aspects of Fuzzy Control (John Wiley & Sons, New York, 1995). [44] J. Rubio, SOFMLS: online self-organizing fuzzy modified least square network, IEEE Trans. Fuzzy Syst., 17(6), 1296–1309 (2009). [45] W. Ho, W. Tung and C. Quek, An evolving Mamdani–Takagi–Sugeno based neural-fuzzy inference system with improved interpretability—accuracy. In Proceedings of the WCCI 2010 IEEE World Congress of Computational Intelligence, Barcelona, pp. 682–689 (2010). [46] T. Takagi and M. Sugeno, Fuzzy identification of systems and its applications to modeling and control, IEEE Trans. Syst. Man Cybern., 15(1), 116–132 (1985). [47] R. Babuska, Fuzzy Modeling for Control (Kluwer Academic Publishers, Norwell, Massachusetts, 1998). [48] A. Banzaouia and A. Hajjaji, Advanced Takagi–Sugeno Fuzzy Systems: Delay and Saturation (Springer, Cham Heidelberg, 2016). [49] R. Qi, G. Tao and B. Jiang, Fuzzy System Identification and Adaptive Control (Springer Nature, Switzerland, 2019). [50] J. Abonyi, Fuzzy Model Identification for Control (Birkhäuser, Boston, USA, 2003). [51] O. Nelles, Nonlinear System Identification (Springer, Berlin, 2001). [52] Q. Ren, S. Achiche, K. Jemielniak and P. Bigras, An enhanced adaptive neural fuzzy tool condition monitoring for turning process. In Proceedings of the International IEEE Conference on Fuzzy Systems (FUZZ-IEEE) 2016, Vancouver, Canada (2016). [53] F. Serdio, E. Lughofer, A.-C. Zavoianu, K. Pichler, M. Pichler, T. Buchegger and H. Efendic, Improved fault detection employing hybrid memetic fuzzy modeling and adaptive filters, Appl. Soft Comput., 51, 60–82 (2017).

page 216

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 217

217

[54] I. Skrjanc, Confidence interval of fuzzy models: an example using a waste-water treatment plant, Chemom. Intell. Lab. Syst., 96, 182–187 (2009). [55] R. Nikzad-Langerodi, E. Lughofer, C. Cernuda, T. Reischer, W. Kantner, M. Pawliczek and M. Brandstetter, Calibration model maintenance in melamine resin production: integrating drift detection, smart sample selection and model adaptation, Anal. Chim. Acta, 1013, 1–12 (featured article) (2018). [56] B. Thumati, M. Feinstein and J. Sarangapani, A model-based fault detection and prognostics scheme for takagisugeno fuzzy systems, IEEE Trans. Fuzzy Syst., 22(4), 736–748, (2014). [57] Y.-Q. Zhang, Constructive granular systems with universal approximation and fast knowledge discovery, IEEE Trans. Fuzzy Syst., 13(1), 48–57 (2005). [58] L. Herrera., H. Pomares, I. Rojas, O. Valenzuela and A. Prieto, TaSe, a taylor series-based fuzzy system model that combines interpretability and accuracy, Fuzzy Sets Syst., 153(3), 403–427 (2005). [59] L. Wang and J. Mendel, Fuzzy basis functions, universal approximation and orthogonal leastsquares learning, IEEE Trans. Neural Networks. 3(5), 807–814 (1992). [60] N. E. Day, Estimating the components of a mixture of normal distributions, Biometrika, 56(463–474) (1969). [61] H. Sun and S. Wang, Measuring the component overlapping in the Gaussian mixture model, Data Min. Knowl. Discovery, 23, 479–502 (2011). [62] C. Bishop, Pattern Recognition and Machine Learning (Springer, New York, 2007). [63] R. Duda, P. Hart and D. Stork, Pattern Classification, 2nd edn. (Wiley-Interscience (John Wiley & Sons), Southern Gate, Chichester, West Sussex, England, 2000). [64] A. Lemos, W. Caminhas and F. Gomide, Multivariable Gaussian evolving fuzzy modeling system, IEEE Trans. Fuzzy Syst., 19(1), 91–104 (2011). [65] D. Leite, R. Ballini, P. Costa and F. Gomide, Evolving fuzzy granular modeling from nonstationary fuzzy data streams, Evolving Syst., 3(2), 65–79 (2012). [66] M. Pratama, S. Anavatti and E. Lughofer, GENEFIS: towards an effective localist network, IEEE Trans. Fuzzy Syst., 22(3), 547–562 (2014). [67] E. Lughofer, C. Cernuda, S. Kindermann and M. Pratama, Generalized smart evolving fuzzy systems, Evolving Syst., 6(4), 269–292 (2015). [68] J. Abonyi, R. Babuska and F. Szeifert, Modified Gath-Geva fuzzy clustering for identification of Takagi–Sugeno fuzzy models, IEEE Trans. Syst. Man Cybern. Part B: Cybern., 32(5), 612–621 (2002). [69] M. Komijani, C. Lucas, B. Araabi and A. Kalhor, Introducing evolving Takagi–Sugeno method based on local least squares support vector machine models, Evolving Syst., 3(2), 81–93 (2012). [70] A. Smola and B. Schölkopf, A tutorial on support vector regression, Stat. Comput., 14, 199–222 (2004). [71] J. Mercer, Functions of positive and negative type and their connection with the theory of integral equations, Philos. Trans. R. Soc. London, Ser. A, 209, 441–458 (1909). [72] A. Zaanen, Linear Analysis (North Holland Publishing Co., 1960). [73] W. Cheng and C. Juang, An incremental support vector machine-trained TS-type fuzzy system for online classification problems, Fuzzy Sets Syst., 163(1), 24–44 (2011). [74] M. Pratama, J. Lu, E. Lughofer, G. Zhang and M. Er, Incremental learning of concept drift using evolving type-2 recurrent fuzzy neural network, IEEE Trans. Fuzzy Syst., 25(5), 1175–1192 (2017). [75] R. Abiyev and O. Kaynak, Fuzzy wavelet neural networks for identification and control of dynamic plants—a novel structure and a comparative study, IEEE Trans. Ind. Electron., 55(8), 3133–3140 (2008).

June 1, 2022 12:51

218

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

[76] L. Zadeh, The concept of a linguistic variable and its application to approximate reasoning, Inf. Sci., 8(3), 199–249 (1975). [77] J. Mendel and R. John, Type-2 fuzzy sets made simple, IEEE Trans. Fuzzy Syst., 10(2), 117–127 (2002). [78] Q. Liang and J. Mendel, Interval type-2 fuzzy logic systems: theory and design, IEEE Trans. Fuzzy Syst., 8(5), 535–550 (2000). [79] J. Mendel, Uncertain Rule-Based Fuzzy Logic Systems: Introduction and New Directions (Prentice Hall, Upper Saddle River, 2001). [80] C. Juang and Y. Tsao, A self-evolving interval type-2 fuzzy neural network with on-line structure and parameter learning, IEEE Trans. Fuzzy Syst., 16(6), 1411–1424 (2008). [81] S. Tung, C. Quek and C. Guan, eT2FIS: an evolving type-2 neural fuzzy inference system, Inf. Sci., 220, 124–148 (2013). [82] N. Karnik and J. Mendel, Centroid of a type-2 fuzzy set, Inf. Sci., 132(1–4), 195–220 (2001). [83] D. Leite, P. Costa and F. Gomide, Interval approach for evolving granular system modeling. In M. Sayed-Mouchaweh and E. Lughofer (eds.), Learning in Non-Stationary Environments: Methods and Applications, Springer, New York, pp. 271–300 (2012). [84] W. Pedrycz, A. Skowron and V. Kreinovich, Handbook of Granular Computing (John Wiley & Sons, Chichester, West Sussex, England, 2008). [85] C. Garcia, D. Leite and I. Skrjanc, Incremental missing-data imputation for evolving fuzzy granular prediction, IEEE Trans. Fuzzy Syst., 28(10), 2348–2361 (2020). [86] P. de Campos Souza and E. Lughofer, An advanced interpretable fuzzy neural network model based on uni-nullneuron constructed from n-uninorms, Fuzzy Sets Syst., on-line and in press (https://doi.org/10.1016/j.fss.2020.11.019) (2021). [87] P. de Campos Souza, Pruning fuzzy neural networks based on unineuron for problems of classification of patterns, J. Intell. Fuzzy Syst., 35(1), 2597–2605 (2018). [88] E. Lughofer, Evolving fuzzy systems—fundamentals, reliability, interpretability and useability. In P. Angelov (ed.), Handbook of Computational Intelligence, World Scientific, New York, pp. 67–135 (2016). [89] M. Sugeno and G. Kang, Structure identification of fuzzy model, Fuzzy Sets Syst., 26, 15–33 (1988). [90] M. Sugeno, Industrial Applications of Fuzzy Control (Elsevier Science, Amsterdam, 1985). [91] P. Angelov and D. Filev, An approach to online identification of Takagi–Sugeno fuzzy models, IEEE Trans. Syst. Man Cybern. Part B: Cybern., 34(1), 484–498 (2004). [92] E.-H. Kim, S.-K. Oh and W. Pedrycz, Reinforced hybrid interval fuzzy neural networks architecture: design and analysis, Neurocomputing, 303, 20–36 (2018). [93] C. Juang, Y.-Y. Lin and C.-C. Tu, A recurrent self-evolving fuzzy neural network with local feedbacks and its application to dynamic system processing, Fuzzy Sets Syst., 161(19), 2552–2568 (2010). [94] Y. Bodyanskiy and O. Vynokurova, Fast medical diagnostics using autoassociative neuro-fuzzy memory, Int. J. Comput., 16(1), 34–40 (2017). [95] T. Calvo, B. D. Baets and J. Fodor, The functional equations of frank and alsina for uninorms and nullnorms, Fuzzy Sets Syst., 120, 385–394 (2001). [96] P. de Campos Souza and E. Lughofer, An evolving neuro-fuzzy system based on uni-nullneurons with advanced interpretability capabilities, Neurocomputing, 451, 231–251 (2021). [97] Y. Bodyanskiy and O. Vynokurova, Hybrid adaptive wavelet-neuro-fuzzy system for chaotic time series identification, Inf. Sci., 220, 170–179 (2013). [98] W. Yuan and L. Chao, Online evolving interval type-2 intuitionistic fuzzy LSTM-neural networks for regression problems, IEEE Access, 7, 35544 (2019). [99] A. M. Silva, W. Caminhas, A. Lemos and F. Gomide, A fast learning algorithm for evolving neo-fuzzy neuron, Appl. Soft Comput., 14B, 194–209 (2014).

page 218

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 219

219

[100] P. Angelov and R. Yager, A new type of simplified fuzzy rule-based system, Int. J. Gen. Syst., 41(2), 163–185 (2012). [101] D. Kangin, P. Angelov and J. Iglesias, Autonomously evolving classifier TEDAClass, Inf. Sci., 366, 1–11 (2016). [102] P. Angelov, X. Gu and J. Principe, Autonomous learning multimodel systems from data streams, IEEE Trans. Fuzzy Syst., 26(4), 2213–2224 (2018). [103] L. Kuncheva, Fuzzy Classifier Design (Physica-Verlag, Heidelberg, 2000). [104] Y. Y. T. Nakashima, G. Schaefer and H. Ishibuchi, A weighted fuzzy classifier and its application to image processing tasks, Fuzzy Sets Syst., 158(3), 284–294 (2006). [105] F. Rehm, F. Klawonn and R. Kruse, Visualization of fuzzy classifiers, Int. J. Uncertainty Fuzziness Knowledge-Based Syst., 15(5), 615–624 (2007). [106] C.-F. Juang and P.-H. Wang, An interval type-2 neural fuzzy classifier learned through soft margin minimization and its human posture classification application, IEEE Trans. Fuzzy Syst., 23(5), 1474–1487 (2015). [107] P. Angelov, E. Lughofer and X. Zhou, Evolving fuzzy classifiers using different model architectures, Fuzzy Sets Syst., 159(23), 3160–3182 (2008). [108] P. Angelov and X. Zhou, Evolving fuzzy-rule-based classifiers from data streams, IEEE Trans. Fuzzy Syst., 16(6), 1462–1475 (2008). [109] A. Bouchachia and R. Mittermeir, Towards incremental fuzzy classifiers, Soft Comput., 11(2), 193–207 (2006). [110] E. Lughofer and O. Buchtala, Reliable all-pairs evolving fuzzy classifiers, IEEE Trans. Fuzzy Syst., 21(4), 625–641 (2013). [111] P. Angelov and X. Gu, Deep rule-based classifier with human-level performance and characteristics, Inf. Sci., 463–464, 196–213 (2018). [112] D. Nauck and R. Kruse, NEFCLASS-X: a soft computing tool to build readable fuzzy classifiers, BT Technol. J., 16(3), 180–190 (1998). [113] H. Ishibuchi and T. Nakashima, Effect of rule weights in fuzzy rule-based classification systems, IEEE Trans. Fuzzy Syst., 9(4), 506–515 (2001). [114] A. Bouchachia, Incremental induction of classification fuzzy rules. In IEEE Workshop on Evolving and Self-Developing Intelligent Systems (ESDIS) 2009, Nashville, USA, pp. 32–39 (2009). [115] E. Lughofer, Single-pass active learning with conflict and ignorance, Evolving Syst., 3(4), 251–271 (2012). [116] T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference and Prediction, 2nd edn. (Springer, New York Berlin Heidelberg, 2009). [117] M. Pratama, S. Anavatti, M. Er and E. Lughofer, pClass: an effective classifier for streaming examples, IEEE Trans. Fuzzy Syst., 23(2), 369–386 (2015). [118] E. Allwein, R. Schapire and Y. Singer, Reducing multiclass to binary: a unifying approach for margin classifiers, J. Mach. Learn. Res., 1, 113–141 (2001). [119] J. Fürnkranz, Round robin classification, J. Mach. Learn. Res., 2, 721–747 (2002). [120] H. He and E. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009). [121] E. Hüllermeier and K. Brinker, Learning valued preference structures for solving classification problems, Fuzzy Sets Syst., 159(18), 2337–2352 (2008). [122] J. Fürnkranz, Round robin rule learning. In Proceedings of the International Conference on Machine Learning (ICML 2011), Williamstown, MA, pp. 146–153 (2001). [123] D. Nauck and R. Kruse, A neuro-fuzzy method to learn fuzzy classification rules from data, Fuzzy Sets Syst., 89(3), 277–288 (1997). [124] P. Angelov, X. Gu and J. Principe, A generalized methodology for data analysis, IEEE Trans. Cybern., 48(10), 2981–2993 (2018).

June 1, 2022 12:51

220

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

[125] C. Schaffer, Overfitting avoidance as bias, Mach. Learn., 10(2), 153–178 (1993). [126] Z. Zhou, Ensemble Methods: Foundations and Algorithms (Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, Boca Raton, Florida, 2012). [127] P. Brazdil, C. Giraud-Carrier, C. Soares and R. Vilalta, Metalearning (Springer, Berlin Heidelberg, 2009). [128] P. Sidhu and M. Bathia, An online ensembles approach for handling concept drift in data streams: diversified online ensembles detection, Int. J. Mach. Learn. Cybern., 6(6), 883–909 (2015). [129] D. Leite and I. Skrjanc, Ensemble of evolving optimal granular experts, OWA aggregation, and time series prediction, Inf. Sci., 504, 95–112 (2019). [130] M. Pratama, W. Pedrycz and E. Lughofer, Evolving ensemble fuzzy classifier, IEEE Trans. Fuzzy Syst., 26(5), 2552–2567 (2018). [131] E. Lughofer and M. Pratama, Online sequential ensembling of fuzzy systems. In Proceedings of the IEEE Evolving and Adaptive Intelligent Systems Conference (EAIS) 2020, Bari, Italy (2020). [132] L. Breiman, Bagging predictors, Mach. Learn., 24(2), 123–140 (1996). [133] M. Collins, R. Schapire and Y. Singer, Logistic regression, AdaBoost and Bregman distances, Mach. Learn., 48(1–3), 253–285 (2002). [134] T. White, Hadoop: The Definitive Guide (O’Reilly Media, 2012). [135] O. Chapelle, B. Schoelkopf and A. Zien, Semi-Supervised Learning (MIT Press, Cambridge, MA, 2006). [136] W. Abraham and A. Robins, Memory retention the synaptic stability versus plasticity dilemma, Trends Neurosci., 28(2), 73–78 (2005). [137] M. Stone, Cross-validatory choice and assessment of statistical predictions, J. R. Stat. Soc., 36(1), 111–147 (1974). [138] B. Efron and R. Tibshirani, An Introduction to the Bootstrap (Chapman and Hall/CRC, 1993). [139] L. Ljung, System Identification: Theory for the User (Prentice Hall PTR, Prentice Hall Inc., Upper Saddle River, New Jersey, 1999). [140] E. Lughofer, FLEXFIS: a robust incremental learning approach for evolving TS fuzzy models, IEEE Trans. Fuzzy Syst., 16(6), 1393–1410 (2008). [141] M. Pratama, S. Anavatti, P. Angelov and E. Lughofer, PANFIS: a novel incremental learning machine, IEEE Trans. Neural Networks Learn. Syst., 25(1), 55–68 (2014). [142] C. Cernuda, E. Lughofer, L. Suppan, T. Röder, R. Schmuck, P. Hintenaus, W. Märzinger and J. Kasberger, Evolving chemometric models for predicting dynamic process parameters in viscose production, Anal. Chim. Acta, 725, 22–38 (2012). [143] G. Leng, X.-J. Zeng and J. Keane, An improved approach of self-organising fuzzy neural network based on similarity measures, Evolving Syst., 3(1), 19–30 (2012). [144] Y. Xu, K. Wong and C. Leung, Generalized recursive least square to the training of neural network, IEEE Trans. Neural Networks, 17(1) (2006). [145] J. L. Salle and S. Lefschetz, Stability by Lyapunov’s Direct Method: With Applications (Academic Press, New York, 1961). [146] J. Rubio, Stability analysis for an online evolving neuro-fuzzy recurrent network. In P. Angelov, D. Filev and N. Kasabov (eds.), Evolving Intelligent Systems: Methodology and Applications, John Wiley & Sons, New York, pp. 173–200 (2010). [147] D. Ge and X. Zeng, Learning evolving T-S fuzzy systems with both local and global accuracy— a local online optimization approach, Appl. Soft Comput., 68, 795–810 (2018). [148] E. Lughofer, Improving the robustness of recursive consequent parameters learning in evolving neuro-fuzzy systems, Inf. Sci., 545, 555–574 (2021). [149] R. Bao, H. Rong, P. Angelov, B. Chen and P. Wong, Correntropy-based evolving fuzzy neural system, IEEE Trans. Fuzzy Syst., 26(3), 1324–1338 (2018).

page 220

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 221

221

[150] H. Rong, Z.-X. Yang and P. Wong, Robust and noise-insensitive recursive maximum correntropy-based evolving fuzzy system, IEEE Trans. Fuzzy Syst., 28(9), 2277–2284 (2019). [151] B. Chen, X. Liu, H. Zhao and J. C. Principe, Maximum correntropy Kalman filter, Automatica, 76, 70–77 (2017). [152] G. Golub and C. V. Loan, Matrix Computations, 3rd edn. (John Hopkins University Press, Baltimore, Maryland, 1996). [153] X. Gu and P. Angelov, Self-boosting first-order autonomous learning neuro-fuzzy systems, Appl. Soft Comput., 77, 118–134 (2019). [154] L. Ngia and J. Sjöberg, Efficient training of neural nets for nonlinear adaptive filtering using a recursive Levenberg-Marquardt algorithm, IEEE Trans. Signal Process., 48(7), 1915–1926 (2000). [155] J. Sherman and W. Morrison, Adjustment of an inverse matrix corresponding to changes in the elements of a given column or a given row of the original matrix, Ann. Math. Stat., 20, 621 (1949). [156] W. Wang and J. Vrbanek, An evolving fuzzy predictor for industrial applications, IEEE Trans. Fuzzy Syst., 16(6), 1439–1449 (2008). [157] A. Bouchachia, Evolving clustering: an asset for evolving systems, IEEE SMC Newsl., 36 (2011). [158] G. Gan, C. Ma and J. Wu, Data Clustering: Theory, Algorithms, and Applications (ASA-SIAM Series on Statistics and Applied Probability) (Society for Industrial & Applied Mathematics, USA, 2007). [159] G. Klancar and I. Skrjanc, Evolving principal component clustering with a low runtime complexity for LRF data mapping, Appl. Soft Comput., 35, 349–358 (2015). [160] P. Angelov and X. Gu, Autonomous learning multi-model classifier of 0 order (almmo-0). In Proceedings of the Evolving and Intelligent Systems (EAIS) Conference 2017, Ljubljana, Slovenia, IEEE Press (2017). [161] N. K. Kasabov and Q. Song, DENFIS: dynamic evolving neural-fuzzy inference system and its application for time-series prediction, IEEE Trans. Fuzzy Syst., 10(2), 144–154 (2002). [162] A. Zdsar, D. Dovzan and I. Skrjanc, Self-tuning of 2 DOF control based on evolving fuzzy model, Appl. Soft Comput., 19, 403–418 (2014). [163] D. Dovzan, V. Logar and I. Skrjanc, Implementation of an evolving fuzzy model (eFuMo) in a monitoring system for a waste-water treatment process, IEEE Trans. Fuzzy Syst., 23(5), 1761–1776 (2015). [164] L. Maciel, R. Ballini and F. Gomide, Evolving possibilistic fuzzy modeling for realized volatility forecasting with jumps, IEEE Trans. Fuzzy Syst., 25(2), 302–314 (2017). [165] S. Blazic and I. Skrjanc, Incremental fuzzy c-regression clustering from streaming data for local-model-network identification, IEEE Trans. Fuzzy Syst., 28(4), 758–767 (2020). [166] P. Angelov, Evolving Takagi–Sugeno fuzzy systems from streaming data, eTS+. In P. Angelov, D. Filev and N. Kasabov (eds.), Evolving Intelligent Systems: Methodology and Applications, John Wiley & Sons, New York, pp. 21–50 (2010). [167] A. Almaksour and E. Anquetil, Improving premise structure in evolving Takagi–Sugeno neurofuzzy classifiers, Evolving Syst., 2, 25–33 (2011). [168] I. Skrjanc, S. Blazic, E. Lughofer and D. Dovzan, Inner matrix norms in evolving cauchy possibilistic clustering for classification and regression from data streams, Inf. Sci., 478, 540– 563 (2019). [169] H. Soleimani, K. Lucas and B. Araabi, Recursive gathgeva clustering as a basis for evolving neuro-fuzzy modeling, Evolving Syst., 1(1), 59–71 (2010). [170] H.-J. Rong, N. Sundararajan, G.-B. Huang and P. Saratchandran, Sequential adaptive fuzzy inference system (SAFIS) for nonlinear system identification and prediction, Fuzzy Sets Syst., 157(9), 1260–1275 (2006).

June 1, 2022 12:51

222

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

[171] H.-J. Rong, Sequential adaptive fuzzy inference system for function approximation problems. In M. Sayed-Mouchaweh and E. Lughofer (eds.), Learning in Non-Stationary Environments: Methods and Applications, Springer, New York (2012). [172] H.-J. Rong, N. Sundararajan, G.-B. Huang and G.-S. Zhao, Extended sequential adaptive fuzzy inference system for classification problems, Evolving Syst., 2(2), 71–82 (2011). [173] A. Lemos, W. Caminhas and F. Gomide, Fuzzy evolving linear regression trees, Evolving Syst., 2(1), 1–14 (2011). [174] C. Hametner and S. Jakubek, Local model network identification for online engine modelling, Inf. Sci., 220, 210–225 (2013). [175] G. Leng, T. McGinnity and G. Prasad, An approach for on-line extraction of fuzzy rules using a self-organising fuzzy neural network, Fuzzy Sets Syst., 150(2), 211–243 (2005). [176] F.-J. Lin, C.-H. Lin and P.-H. Shen, Self-constructing fuzzy neural network speed controller for permanent-magnet synchronous motor drive, IEEE Trans. Fuzzy Syst., 9(5), 751–759 (2001). [177] R. R. Yager, A model of participatory learning, IEEE Trans. Syst. Man Cybern., 20(5), 1229–1234 (1990). [178] E. Lima, M. Hell, R. Ballini and F. Gomide, Evolving fuzzy modeling using participatory learning. In P. Angelov, D. Filev and N. Kasabov (eds.), Evolving Intelligent Systems: Methodology and Applications, John Wiley & Sons, New York, pp. 67–86 (2010). [179] A. Lemos, W. Caminhas and F. Gomide, Adaptive fault detection and diagnosis using an evolving fuzzy classifier, Inf. Sci., 220, 64–85 (2013). [180] D. Leite, R. Palhares, C. S. Campos and F. Gomide, Evolving granular fuzzy model-based control of nonlinear dynamic systems, IEEE Trans. Fuzzy Syst., 23(4), 923–938 (2015). [181] D. Leite, P. Costa and F. Gomide, Evolving granular neural networks from fuzzy data streams, Neural Networks, 38, 1–16 (2013). [182] D. Leite, M. Santana, A. Borges and F. Gomide, Fuzzy granular neural network for incremental modeling of nonlinear chaotic systems. In Proceedings of the IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) 2016, pp. 64–71 (2016). [183] G. Marrs, M. Black and R. Hickey, The use of time stamps in handling latency and concept drift in online learning, Evolving Syst., 3(2), 203–220 (2012). [184] S. Wu, M. Er and Y. Gao, A fast approach for automatic generation of fuzzy rules by generalized dynamic fuzzy neural networks, IEEE Trans. Fuzzy Syst., 9(4), 578–594 (2001). [185] A. Kalhor, B. Araabi and C. Lucas, An online predictor model as adaptive habitually linear and transiently nonlinear model, Evolving Syst., 1(1), 29–41 (2010). [186] S. Tzafestas and K. Zikidis, NeuroFAST: on-line neuro-fuzzy art-based structure and parameter learning TSK model, IEEE Trans. Syst. Man Cybern. Part B, 31(5), 797–802 (2001). [187] P. Angelov, P. Sadeghi-Tehran and R. Ramezani, An approach to automatic real-time novelty detection, object identification, and tracking in video streams based on recursive density estimation and evolving Takagi–Sugeno fuzzy systems, Int. J. Intell. Syst., 26(3), 189–205 (2011). [188] G. Huang, P. Saratchandran and N. Sundararajan, A generalized growing and pruning RBF (GGAP-RBF) neural network for function approximation, IEEE Trans. Neural Networks, 16(1), 57–67 (2005). [189] B. Hassibi and D. Stork, Second-order derivatives for network pruning: optimal brain surgeon. In S. Hanson, J. Cowan and C. Giles (eds.), Advances in Neural Information Processing, vol. 5, Morgan Kaufman, Los Altos, CA, pp. 164–171 (1993). [190] D. Leite, P. Costa and F. Gomide, Granular approach for evolving system modeling. In Proceedings of the International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, Lecture Notes in Artificial Intelligence, pp. 340–349 (2010).

page 222

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 223

223

[191] E. Lughofer, J.-L. Bouchot and A. Shaker, On-line elimination of local redundancies in evolving fuzzy systems, Evolving Syst., 2(3), 165–187 (2011). [192] G. Leng, A hybrid learning algorithm with a similarity-based pruning strategy for self-adaptive neuro-fuzzy systems, Appl. Soft Comput., 9(4), 1354–1366 (2009). [193] C. Mencar, G. Castellano and A. Fanelli, Distinguishability quantification of fuzzy sets, Inf. Sci., 177, 130–149 (2007). [194] V. V. Cross and T. A. Sudkamp, Similarity and Compatibility in Fuzzy Set Theory: Assessment and Applications (Springer, Physica, Heidelberg New York, 2010). [195] B. Hartmann, O. Banfer, O. Nelles, A. Sodja, L. Teslic and I. Skrjanc, Supervised hierarchical clustering in fuzzy model identification, IEEE Trans. Fuzzy Syst., 19(6), 1163–1176 (2011). [196] L. Teslic, B. Hartmann, O. Nelles and I. Skrjanc, Nonlinear system identification by Gustafson– Kessel fuzzy clustering and supervised local model network learning for the drug absorption spectra process, IEEE Trans. Neural Networks, 22(12), 1941–1951 (2011). [197] E. Lughofer, M. Pratama and I. Skrjanc, Incremental rule splitting in generalized evolving fuzzy systems for autonomous drift compensation, IEEE Trans. Fuzzy Syst., 26(4), 1854–1865 (2018). [198] E. Lughofer, M. Pratama and I. Skrjanc, Online bagging of evolving fuzzy systems, Inf. Sci., 570, 16–33 (2021). [199] W. Hoeffding, Probability inequalities for sums of bounded random variables, J. Am. Stat. Assoc., 58(301), 13–30 (1963). [200] J. Iglesias, M. Sesmero, E. Lopez, A. Ledezma and A. Sanchis, A novel evolving ensemble approach based on bootstrapping and dynamic selection of diverse learners. In Proceedings of the Evolving and Adaptive Intelligent Systems Conference (EAIS) 2020, Bari, Italy (2020). [201] B. Costa, P. Angelov and L. Guedes, Fully unsupervised fault detection and identification based on recursive density estimation and self-evolving cloud-based classifier, Neurocomputing, 150A, 289–303 (2015). [202] M. Islam, X. Yao, S. Nirjon, M. Islam and K. Murase, Bagging and boosting negatively correlated neural networks, IEEE Trans. Syst. Man Cybern. Part B: Cybern., 38(3), 771–784 (2008). [203] R. Yager, On ordered weighted averaging aggregation operators in multicriteria decision making, IEEE Trans. Syst. Man Cybern., 18, 183–190 (1988). [204] L. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms (Wiley-Interscience (John Wiley & Sons), Southern Gate, Chichester, West Sussex, England, 2004). [205] C. C. Coello and G. Lamont, Applications of Multi-Objective Evolutionary Algorithms (World Scientific, Singapore, 2004). [206] E. Soares, P. Costa, B. Costa and D. Leite, Ensemble of evolving data clouds and fuzzy models for weather time series prediction, Appl. Soft Comput., 64, 445–453 (2018). [207] R. Kruse, J. Gebhardt and R. Palm, Fuzzy Systems in Computer Science (Verlag Vieweg, Wiesbaden, 1994). [208] K. Aström and B. Wittenmark, Adaptive Control, 2nd edn. (Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1994). [209] I. Khamassi, M. Sayed-Mouchaweh, M. Hammami and K. Ghedira, Discussion and review on evolving data streams and concept drift adapting, Evolving Syst., 9(1), 1–23 (2017). [210] G. Box, G. Jenkins and G. Reinsel, Time Series Analysis, Forecasting and Control (Prentice Hall, Englewood Cliffs, New Jersey, 1994). [211] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy and A. Bouchachia, A survey on concept drift adaptation, ACM Comput. Surv., 46(4), 44 (2014). [212] O.-M. Moe-Helgesen and H. Stranden. Catastrophic forgetting in neural networks. Technical report, Norwegian University of Science and Technology, Trondheim, Norway (2005). [213] E. Lughofer and P. Angelov, Handling drifts and shifts in on-line data streams with evolving fuzzy systems, Appl. Soft Comput., 11(2), 2057–2068 (2011).

June 1, 2022 12:51

224

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

[214] A. Shaker and E. Lughofer, Self-adaptive and local strategies for a smooth treatment of drifts in data streams, Evolving Syst., 5(4), 239–257 (2014). [215] H. Mouss, D. Mouss, N. Mouss and L. Sefouhi, Test of Page-Hinkley, an approach for fault detection in an agro-alimentary production system. In Proceedings of the Asian Control Conference, vol. 2, pp. 815–818 (2004). [216] E. Lughofer, E. Weigl, W. Heidl, C. Eitzinger and T. Radauer, Recognizing input space and target concept drifts with scarcely labelled and unlabelled instances, Inf. Sci., 355–356, 127–151 (2016). [217] E. Lughofer, E. Weigl, W. Heidl, C. Eitzinger and T. Radauer, Integrating new classes on the fly in evolving fuzzy classifier designs and its application in visual inspection, Appl. Soft Comput., 35, 558–582 (2015). [218] M. Pratama, J. Lu, E. Lughofer, G. Zhang and S. Anavatti, Scaffolding type-2 classifier for incremental learning under concept drifts, Neurocomputing, 191, 304–329 (2016). [219] C.-Y. Chong and S. Kumar, Sensor networks: evolution, opportunities and challenges, Proc. IEEE, 91(8), 1247–1256 (2003). [220] S. Alizadeh, A. Kalhor, H. Jamalabadi, B. Araabi and M. Ahmadabadi, Online local input selection through evolving heterogeneous fuzzy inference system, IEEE Trans. Fuzzy Syst., 24(6), 1364–1377 (2016). [221] V. Bolon-Canedo, N. Sanchez-Marono and A. Alonso-Betanzos, Feature Selection for HighDimensional Data Artificial Intelligence: Foundations, Theory, and Algorithms (Springer, Heidelberg Berlin, 2016). [222] U. Stanczyk and L. Jain, Feature Selection for Data and Pattern Recognition (Springer, Heidelberg, New York, London, 2016). [223] E. Lughofer, On-line incremental feature weighting in evolving fuzzy classifiers, Fuzzy Sets Syst., 163(1), 1–23 (2011). [224] J. Dy and C. Brodley, Feature selection for unsupervised learning, J. Mach. Learn. Res., 5, 845–889 (2004). [225] Y. Xu, J.-Y. Yang and Z. Jin, A novel method for fisher discriminant analysis, Pattern Recognit., 37(2), 381–384 (2004). [226] M. Hisada, S. Ozawa, K. Zhang and N. Kasabov, Incremental linear discriminant analysis for evolving feature spaces in multitask pattern recognition problems, Evolving Syst., 1(1), 17–27 (2010). [227] S. Pang, S. Ozawa and N. Kasabov, Incremental linear discriminant analysis for classification of data streams, IEEE Trans. Syst. Men Cybern. Part B: Cybern., 35(5), 905–914 (2005). [228] P. de Campos Souza and E. Lughofer, Identification of heart sounds with an interpretable evolving fuzzy neural network, Sensors, 20(6477) (2020). [229] E. Lughofer, A.-C. Zavoianu, R. Pollak, M. Pratama, P. Meyer-Heye, H. Zörrer, C. Eitzinger, J. Haim and T. Radauer, Self-adaptive evolving forecast models with incremental PLS space updating for on-line prediction of micro-fluidic chip quality, Eng. Appl. Artif. Intell., 68, 131–151 (2018). [230] M. Haenlein and A. Kaplan, A beginner’s guide to partial least squares (PLS) analysis, Understanding Stat., 3(4), 283–297 (2004). [231] N. Kasabov, D. Zhang and P. Pang, Incremental learning in autonomous systems: evolving connectionist systems for on-line image and speech recognition. In Proceedings of IEEE Workshop on Advanced Robotics and its Social Impacts, 2005, Hsinchu, Taiwan, pp. 120–125 (2005). [232] D. P. Filev and F. Tseng, Novelty detection based machine health prognostics. In Proc. of the 2006 International Symposium on Evolving Fuzzy Systems, Lake District, UK, pp. 193–199 (2006). [233] I. Jolliffe, Principal Component Analysis (Springer Verlag, Berlin Heidelberg New York, 2002).

page 224

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 225

225

[234] F. Hamker, RBF learning in a non-stationary environment: the stability-plasticity dilemma. In R. Howlett and L. Jain (eds.), Radial Basis Function Networks 1: Recent Developments in Theory and Applications, Physica-Verlag, Heidelberg, New York, pp. 219–251 (2001). [235] H.-J. Rong, P. Angelov, X. Gu and J.-M. Bai, Stability of evolving fuzzy systems based on data clouds, IEEE Trans. Fuzzy Syst., 26(5), 2774–2784 (2018). [236] H.-J. Rong, Z.-X. Yang and P. Wong, Robust and noise-insensitive recursive maximum correntropy-based evolving fuzzy system, IEEE Trans. Fuzzy Syst., 28(9), 2277–2284 (2020). [237] V. Stojanovic and N. Nedic, Identification of time-varying OE models in presence of nonGaussian noise: application to pneumatic servo drives, Int. J. Robust Nonlinear Control, 26(18), 3974–3995 (2016). [238] J. Hühn and E. Hüllermeier, FR3: a fuzzy rule learner for inducing reliable classifiers, IEEE Trans. Fuzzy Syst., 17(1), 138–149 (2009). [239] A. Bifet, G. Holmes, R. Kirkby and B. Pfahringer, MOA: massive online analysis, J. Mach. Learn. Res., 11, 1601–1604 (2010). [240] M. Smithson, Confidence Intervals (SAGE University Paper (Series: Quantitative Applications in the Social Sciences), Thousand Oaks, California, 2003). [241] K. Tschumitschew and F. Klawonn, Incremental statistical measures. In M. Sayed-Mouchaweh and E. Lughofer (eds.), Learning in Non-Stationary Environments: Methods and Applications, Springer, New York, pp. 21–55 (2012). [242] I. Skrjanc, Evolving fuzzy-model-based design of experiments with supervised hierarchical clustering, IEEE Trans. Fuzzy Syst., 23(4), 861–871 (2015). [243] E. Lughofer and M. Pratama, On-line active learning in data stream regression using uncertainty sampling based on evolving generalized fuzzy models, IEEE Trans. Fuzzy Syst., 26(1), 292–309 (2018). [244] B. Frieden and R. Gatenby, Exploratory Data Analysis Using Fisher Information (Springer Verlag, New York, 2007). [245] J. Casillas, O. Cordon, F. Herrera and L. Magdalena, Interpretability Issues in Fuzzy Modeling (Springer Verlag, Berlin Heidelberg, 2003). [246] F. Dosilovic, M. Brcic and N. Hlupic, Explainable artificial intelligence: a survey. In Proceedings of the 41st International Convention Proceedings, MIPRO 2018, Opatija, Croatia, pp. 210–215 (2018). [247] W. Samek and G. Montavon, Explainable AI: Interpreting, Explaining and Visualizing Deep Learning (Springer Nature, Switzerland, 2019). [248] Z. Bosni´c, J. Dem¸sar, G. Ke¸spret, P. Rodrigues, J. Gama and I. Kononenko, Enhancing data stream predictions with reliability estimators and explanation, Eng. Appl. Artif. Intell., 34, 178–192 (2014). [249] E. Strumbelj and I. Kononenko, Explaining prediction models and individual predictions with feature contributions, Knowl. Inf. Syst., 41(3), 647–665 (2014). [250] E. Lughofer, R. Richter, U. Neissl, W. Heidl, C. Eitzinger and T. Radauer, Explaining classifier decisions linguistically for stimulating and improving operators labeling behavior, Inf. Sci., 420, 16–36 (2017). [251] T. Wetter, Medical decision support systems. In Computer Science, Medical Data Analysis, Springer, Berlin/Heidelberg, pp. 458–466 (2006). [252] J. Macias-Hernandez and P. Angelov, Applications of evolving intelligent systems to the oil and gas industry. In P. Angelov, D. Filev and N. Kasabov (eds.), Evolving Intelligent Systems: Methodology and Applications, John Wiley & Sons, New York, pp. 401–421 (2010). [253] P. Angelov and A. Kordon, Evolving inferential sensors in the chemical process industry. In P. Angelov, D. Filev and N. Kasabov (eds.), Evolving Intelligent Systems: Methodology and Applications, John Wiley & Sons, New York, pp. 313–336 (2010).

June 1, 2022 12:51

226

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

[254] G. Andonovski, P. Angelov, S. Blazic and I. Skrjanc, A practical implementation of robust evolving cloud-based controller with normalized data space for heat-exchanger plant, Appl. Soft Comput., 48, 29–38 (2016). [255] M. Gacto, R. Alcala and F. Herrera, Interpretability of linguistic fuzzy rule-based systems: an overview of interpretability measures, Inf. Sci., 181(20), 4340–4360 (2011). [256] P. Akella, Structure of n-uninorms, Fuzzy Sets Syst., 158, 1631–1651 (2007). [257] E. Lughofer, B. Trawinski, K. Trawinski, O. Kempa and T. Lasota, On employing fuzzy modeling algorithms for the valuation of residential premises, Inf. Sci., 181(23), 5123–5142 (2011). [258] M. Ashrafi, D. Prasad and C. Quek, IT2-GSETSK: an evolving interval type-II TSK fuzzy neural system for online modeling of noisy data, Neurocomputing, 407, 1–11 (2020). [259] D. Nauck, Adaptive rule weights in neuro-fuzzy systems, Neural Comput. Appl., 9, 60–70 (2000). [260] A. Riid and E. Rüstern, Adaptability, interpretability and rule weights in fuzzy rule-based systems, Inf. Sci., 257, 301–312 (2014). [261] S. Henzgen, M. Strickert and E. Hüllermeier, Rule chains for visualizing evolving fuzzy rule-based systems. In Advances in Intelligent Systems and Computing, vol. 226, Proceedings of the 8th International Conference on Computer Recognition Systems CORES 2013, Springer, Cambridge, MA, pp. 279–288 (2013). [262] S. Henzgen, M. Strickert and E. Hüllermeier, Visualization of evolving fuzzy rule-based systems, Evolving Syst., 5(3), 175–191 (2014). [263] D. Cruz-Sandoval, J. Beltran and M. Garcia-Constantino, Semi-automated data labeling for activity recognition in pervasive healthcare, Sensors, 19(14), 3035 (2019). [264] B. Settles, Active Learning (Morgan & Claypool Publishers, 2012). [265] D. Cohn, L. Atlas and R. Ladner, Improving generalization with active learning, Mach. Learn., 15(2), 201–221 (1994). [266] N. Rubens, M. Elahi, M. Sugiyama and D. Kaplan, Active learning in recommender systems. In F. Ricci, L. Rokach and B. Shapira (eds.), Recommender Systems Handbook, 2nd edn., Springer Verlag, New York (2016). [267] M. Sugiyama and S. Nakajima, Pool-based active learning in approximate linear regression, Mach. Learn., 75(3), 249–274 (2009). [268] K. Subramanian, A. K. Das, S. Sundaram and S. Ramasamy, A meta-cognitive interval type2 fuzzy inference system and its projection based learning algorithm, Evolving Syst., 5(4), 219–230 (2014). [269] K. Subramanian, S. Suresh and N. Sundararajan, A metacognitive neuro-fuzzy inference system (MCFIS) for sequential classification problems, IEEE Trans. Fuzzy Syst., 21(6), 1080–1095 (2013). [270] M. Pratama, E. Lughofer, M. Er, S. Anavatti and C. Lim, Data driven modelling based on recurrent interval-valued metacognitive scaffolding fuzzy neural network, Neurocomputing, 262, 4–27 (2017). [271] M. Pratama, S. Anavatti and J. Lu, Recurrent classifier based on an incremental meta-cognitive scaffolding algorithm, IEEE Trans. Fuzzy Syst., 23(6), 2048–2066 (2015). [272] E. Lughofer, On-line active learning: a new paradigm to improve practical useability of data stream modeling methods, Inf. Sci., 415–416, 356–376 (2017). [273] G. Franceschini and S. Macchietto, Model-based design of experiments for parameter precision: state of the art, Chem. Eng. Sci., 63(19), 4846–4872 (2008). [274] S. García, A. Fernandez, J. Luengo and F. Herrera, Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power, Inf. Sci., 180, 2044–2064 (2010).

page 226

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 227

227

[275] B. Hartmann, J. Moll, O. Nelles and C.-P. Fritzen, Hierarchical local model trees for design of experiments in the framework of ultrasonic structural health monitoring. In Proceedings of the IEEE International Conference on Control Applications, pp. 1163–1170 (2011). [276] T. Pigott, A review of methods for missing data, Educ. Res. Eval., 7(4), 353–383 (2001). [277] J. Luengo, S. García and F. Herrera, On the choice of the best imputation methods for missing values considering three groups of classification methods, Knowl. Inf. Syst., 32, 77–108 (2012). [278] S. L. Gay, Dynamically regularized fast recursive least squares with application to echo cancellation. In Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, Atalanta, Georgia, pp. 957–960 (1996). [279] R. Merched and A. Sayed, Fast RLS Laguerre adaptive filtering. In Proceedings of the Allerton Conference on Communication, Control and Computing, Allerton, IL, pp. 338–347 (1999). [280] A. Bifet and R. Kirkby, Data stream mining—a practical approach. Technical report, Department of Computer Sciences, University of Waikato, Japan (2011). [281] K. Subramanian, R. Savita and S. Suresh, A meta-cognitive interval type-2 fuzzy inference system classifier and its projection based learning algorithm. In Proceedings of the IEEE EAIS 2013 Workshop (SSCI 2013 Conference), Singapore, pp. 48–55 (2013). [282] V. Stehman, Selecting and interpreting measures of thematic classification accuracy, Remote Sens. Environ., 62(1), 77–89 (1997). [283] P. Donmez, G. Lebanon and K. Balasubramanian, Unsupervised supervised learning I: estimating classification and regression errors without labels, J. Mach. Learn. Res., 11, 1323–1351 (2010). [284] N. Amor, S. Benferhat and Z. Elouedi, Qualitative classification and evaluation in possibilistic decision trees. In Proceedings of the FUZZ-IEEE Conference, Budapest, Hungary, pp. 653–657 (2004). [285] C. Cernuda, E. Lughofer, G. Mayr, T. Röder, P. Hintenaus, W. Märzinger and J. Kasberger, Incremental and decremental active learning for optimized self-adaptive calibration in viscose production, Chemom. Intell. Lab. Syst., 138, 14–29 (2014). [286] P. Angelov and I. Skrjanc, Robust evolving cloud-based controller for a hydraulic plant. In Proceedings of the 2013 IEEE Conference on Evolving and Adaptive Intelligent Systems (EAIS), Singapore, pp. 1–8 (2013). [287] D. Dovzan, V. Logar and I. Skrjanc, Solving the sales prediction problem with fuzzy evolving methods. In WCCI 2012 IEEE World Congress on Computational Intelligence, Brisbane, Australia (2012). [288] A. Cara, L. Herrera, H. Pomares and I. Rojas, New online self-evolving neuro fuzzy controller based on the TaSe-NF model, Inf. Sci., 220, 226–243 (2013). [289] H.-J. Rong, S. Han and G.-S. Zhao, Adaptive fuzzy control of aircraft wing-rock motion, Appl. Soft Comput., 14, 181–193 (2014). [290] M. Ferdaus, M. Pratama, S. G. Anavatti, M. A. Garratt and E. Lughofer, Pac: a novel selfadaptive neuro-fuzzy controller for micro aerial vehicles, Inf. Sci., 512, 481–505 (2020). [291] R. Precup, T. Teban, A. Albu, A. Borlea, I. A. Zamfirache and E. M. Petriu, Evolving fuzzy models for prosthetic hand myoelectric-based control, IEEE Trans. Instrum. Meas., 69(7), 4625–4636 (2020). [292] C. Xydeas, P. Angelov, S. Chiao and M. Reoulas, Advances in EEG signals classification via dependant HMM models and evolving fuzzy classifiers, International Journal on Computers in Biology and Medicine, special issue on Intelligent Technologies for Bio-Informatics and Medicine, 36(10), 1064–1083 (2006). [293] N. Nuntalid, K. Dhoble and N. Kasabov, EEG classification with BSA spike encoding algorithm and evolving probabilistic spiking neural network. In Neural Information Processing, LNCS, vol. 7062, Springer Verlag, Berlin Heidelberg, pp. 451–460 (2011).

June 1, 2022 12:51

228

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

[294] J. Rubio and A. Bouchachia, MSAFIS: an evolving fuzzy inference system, Soft Comput., 21, 2357–2366 (2015). [295] P. Angelov and A. Kordon, Adaptive inferential sensors based on evolving fuzzy models: an industrial case study, IEEE Trans. Syst. Man Cybern. Part B: Cybern., 40(2), 529–539 (2010). [296] A. Shaker, R. Senge and E. Hüllermeier, Evolving fuzzy patterns trees for binary classification on data streams, Inf. Sci., 220, 34–45 (2013). [297] P. V. de Campos Souza, A. J. Guimaraes, V. S. Araujo, T. S. Rezende and V. J. S. Araujo, Incremental regularized data density-based clustering neural networks to aid in the construction of effort forecasting systems in software development, Appl. Intell., 49, 1–14 (2019). [298] D. Ge and X.-J. Zeng, Learning data streams online—an evolving fuzzy system approach with self-learning/adaptive thresholds, Inf. Sci., 507, 172–184 (2020). [299] D. Leite, F. Gomide, R. Ballini and P. Costa, Fuzzy granular evolving modeling for time series prediction. In Proceedings of the IEEE International Conference on Fuzzy Systems, pp. 2794–2801 (2011). [300] L. Maciel, A. Lemos, F. Gomide and R. Ballini, Evolving fuzzy systems for pricing fixed income options, Evolving Syst., 3(1), 5–18 (2012). [301] G. Prasad, G. Leng, T. McGuinnity and D. Coyle, Online identification of self-organizing fuzzy neural networks for modeling time-varying complex systems. In P. Angelov, D. Filev and N. Kasabov (eds.), Evolving Intelligent Systems: Methodology and Applications, John Wiley & Sons, New York, pp. 201–228 (2010). [302] N. Kasabov, Evolving Connectionist Systems: Methods and Applications in Bioinformatics, Brain Study and Intelligent Machines (Springer Verlag, London, 2002). [303] E. Soares, P. Angelov and X. Gu, Autonomous learning multiple-model zero-order classifier for heart sound classification, Appl. Soft Comput., 94, 106449 (2020). [304] Y.-Y. Lin, J.-Y. Chang and C.-T. Lin, Identification and prediction of dynamic systems using an interactively recurrent self-evolving fuzzy neural network, IEEE Trans. Neural Networks Learn. Syst., 24(2), 310–321 (2012). [305] E. Lughofer, C. Eitzinger and C. Guardiola, On-line quality control with flexible evolving fuzzy systems. In M. Sayed-Mouchaweh and E. Lughofer (eds.), Learning in Non-Stationary Environments: Methods and Applications, Springer, New York, pp. 375–406 (2012). [306] M. Pratama, E. Dimla, T. Tjahjowidodo, W. Pedrycz and E. Lughofer, Online tool condition monitoring based on parsimonious ensemble+, IEEE Trans. Cybern., 50(2), 664–677 (2020). [307] G. Andonovski, S. Blazic and I. Skrjanc, Evolving fuzzy model for fault detection and fault identification of dynamic processes. In E. Lughofer and M. Sayed-Mouchaweh (eds.), Predictive Maintenance in Dynamic Systems, Springer, New York, pp. 269–285 (2019). [308] S. Silva, P. Costa, M. Santana and D. Leite, Evolving neuro-fuzzy network for real-time high impedance fault detection and classification, Neural Comput. Appl., 32, 7597–7610 (2020). [309] E. Lughofer, A. Zavoianu, R. Pollak, M. Pratama, P. Meyer-Heye, H. Zörrer, C. Eitzinger and T. Radauer, Autonomous supervision and optimization of product quality in a multi-stage manufacturing process based on self-adaptive prediction models, J. Process Control, 76, 27–45 (2019). [310] M. Camargos, I. Bessa, M. DAngelo, L. Cosme and R. Palhares, Data-driven prognostics of rolling element bearings using a novel error based evolving Takagi–Sugeno fuzzy model, Appl. Soft Comput., 96, 106628 (2020). [311] X. Zhou and P. Angelov, Autonomous visual self-localization in completely unknown environment using evolving fuzzy rule-based classifier. In 2007 IEEE International Conference on Computational Intelligence Application for Defense and Security, Honolulu, Hawaii, USA, pp. 131–138 (2007). [312] D. Leite and F. Gomide, Evolving linguistic fuzzy models from data streams. In Combining Experimentation and Theory, Springer, Berlin, Heidelberg, pp. 209–223 (2012).

page 228

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Evolving Fuzzy and Neuro-Fuzzy Systems

b4528-v1-ch04

page 229

229

[313] H. Widiputra, R. Pears and N. Kasabov, Dynamic learning of multiple time se ries in a nonstationary environment. In M. Sayed-Mouchaweh and E. Lughofer (eds.), Learning in Non-Stationary Environments: Methods and Applications, Springer, New York, pp. 303–348 (2012). [314] O. Tyshchenko and A. Deineko, An evolving neuro-fuzzy system with online learning/selflearning, Int. J. Mod. Comput. Sci., 2, 1–7 (2015). [315] J. Iglesias, P. Angelov, A. Ledezma and A. Sanchis, Creating evolving user behavior profiles automatically, IEEE Trans. Knowl. Data Eng., 24(5), 854–867 (2012). [316] J. Iglesias, P. Angelov, A. Ledezma and A. Sanchis, Evolving classification of agent’s behaviors: a general approach, Evolving Syst., 1(3), 161–172 (2010). [317] J. Andreu and P. Angelov, Towards generic human activity recognition for ubiquitous applications, J. Ambient Intell. Hum. Comput., 4, 155–156 (2013). [318] L. Wang, H.-B. Ji and Y. Jin, Fuzzy passive–aggressive classification: a robust and efficient algorithm for online classification problems, Inf. Sci., 220, 46–63 (2013). [319] I. Skrjanc, G. Andonovski, A. Ledezma, O. Sipele, J. A. Iglesias and A. Sanchis, Evolving cloud-based system for the recognition of drivers’ actions, Expert Syst. Appl., 99(1), 231–238 (2018). [320] Y.-T. Liu, Y.-Y. Lin, S.-L. Wu, C.-H. Chuang and C.-T. Lin, Brain dynamics in predicting driving fatigue using a recurrent self-evolving fuzzy neural network, IEEE Trans. Neural Networks Learn. Syst., 27(2), 347–360 (2015). [321] X. Zhou and P. Angelov, Real-time joint landmark recognition and classifier generation by an evolving fuzzy system. In Proceedings of FUZZ-IEEE 2006, Vancouver, Canada, pp. 1205–1212 (2006). [322] E. Weigl, W. Heidl, E. Lughofer, C. Eitzinger and T. Radauer, On improving performance of surface inspection systems by on-line active learning and flexible classifier updates, Mach. Vision Appl., 27(1), 103–127 (2016). [323] J. Iglesias, A. Tiemblo, A. Ledezma and A. Sanchis, Web news mining in an evolving framework, Inf. Fusion, 28, 90–98 (2016). [324] C. Zain, M. Pratama, E. Lughofer and S. Anavatti, Evolving type-2 web news mining, Appl. Soft Comput., 54, 200–220 (2017).

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

230

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

Appendix Symbols N p x x(k) or xk {x1 , . . . , x p } y(k) or y yˆ (k) or yˆ yˆi (k) or yˆi m r, reg X or R Ri C Pi (k) Q i (k) P(k) J ac  lin nonlin ci σi i win {μi1 , . . . , μip } x) μi ( x) i ( li w  i = {wi0 , . . . , wip } Lr

Number of training samples (seen so far) Dimensionality of the learning problem = the number of input features/variables One (current) data sample, containing p input variables The kth data sample in a sequence of samples Input variables/features Measured output/target value for regression problems in the kth or current sample Estimated output/target value in the kth or current sample Estimated output/target value from the ith rule in the kth or current sample Polynomial degree of the consequent function; or number of ensemble members Regressors Regression matrix The ith rule, neuron Number of rules, neurons, clusters in the evolving model Inverse Hessian matrix of the ith rule Weight matrix for the ith rule Hessian matrix Jacobian matrix Set of parameters (in an ENFS) Set of linear parameters (in an ENFS) Set of nonlinear parameters (in an ENFS) Center vector of the ith rule, cluster Ranges of influence/widths vector of the ith rule, cluster Covariance matrix of the ith rule Indicates the index of the winning rule, cluster (nearest to sample x) Antecedent fuzzy sets of the ith rule (for p inputs) Membership/Activation degree of the ith rule for the current sample x Normalized membership degree of the ith rule for the current sample x Consequent function, singleton (class label) in the ith rule Linear weights in the consequent function of the ith rule Real class label of a data sample in classification problems

page 230

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Evolving Fuzzy and Neuro-Fuzzy Systems

L Li K con f i j con f i con f J e(., .) λ φ a A e γi Pi, j Fi

AHLTNM AI ALMMo AnYa APE COG DENFIS DoE DRB EBeTS eClass EFC EFC-AP EFC-MM EFC-SM

231

Output class label for classification problems (from a fuzzy classifier) Output class label of the ith rule (neuron) Number of classes in classification problems Confidence of the ith rule in the j th class Confidence of the ith rule to its output class label Overall confidence of the fuzzy classifier to its output class label Objective functional (to be optimized) residual between first and second argument Forgetting factor Angle between two vectors; weight decay function Eigenvector Matrix containing all eigenvectors Axis vector Local density of the ith rule j th prototype of the ith class ith ensemble member

Abbreviations Abbreviation

page 231

Meaning Adaptive habitually linear and transiently nonlinear model Artificial intelligence Autonomous learning multimodel systems Angelov and yager system Average percentual deviation Center of gravity Dynamic evolving neural-fuzzy inference system Design of experiments Deep rule-based classifiers Error-Based evolving Takagi–Sugeno fuzzy model Evolving classifier Evolving fuzzy classifiers Evolving fuzzy classifier with all-pairs architecture Evolving fuzzy classifier with multi-model architecture Evolving fuzzy classifier with single model architecture

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

232

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

Abbreviation

Meaning

EFP eFPT EFS EFS-SLAT eFT eFuMO EFuNN eGNN EIS eMG ENFS ENFS-OL ENFS-Uni0 EPFM ePL epSNNr eTS eTS-LS-SVM eT2Class eT2FIS eT2RFNN EWRLS FBeM FLEXFIS FPA FWGRLS GAP-RBF GD-FNN GENEFIS GMMs GS-EFS IBeM InFuR IPLS-GEFS IRSFNN IT2-GSETSK

Evolving fuzzy predictor Evolving fuzzy pattern trees Evolving fuzzy systems Evolving fuzzy systems with self-learning/adaptive thresholds Evolving fuzzy trees Evolving fuzzy model Evolving fuzzy neural network Evolving granular neural networks Evolving intelligent systems Evolving multivariate gaussian Evolving neuro-fuzzy systems Evolving neuro-fuzzy system with online learning/self-learning Evolving neuro-fuzzy systems based on uni-null neurons Evolving possibilistic fuzzy modeling Evolving participatory learning Evolving probabilistic spiking neural network Evolving Takagi–Sugeno Evolving TS model based on least squares SVM Evolving type-2 fuzzy classifier Evolving type-2 neural fuzzy inference system Evolving type-2 recurrent fuzzy neural network Extended weighted recursive least squares Fuzzy set based evolving modeling Flexible fuzzy inference systems Fuzzy passive-aggressive classification Fuzzily weighted generalized recursive least squares Growing and pruning radial basis function Genetic dynamic fuzzy neural network Generic evolving neuro-fuzzy inference system Gaussian mixture models Generalized smart evolving fuzzy systems Interval-based evolving modeling Incremental fuzzy C-regression clustering Incremental PLS with generalized evolving fuzzy systems Interactively recurrent self-evolving fuzzy neural network Evolving interval type-II generic self-evolving TSK neural system

page 232

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch04

Evolving Fuzzy and Neuro-Fuzzy Systems

Abbreviation LNSE LOFO LOLIMOT LS LS_SVM LV MAE McIT2FIS MIMO MNIST MOM MSAFIS NeuroFAST NFC OB-EFS PANFIS PCA pClass pEnsemble PID PLS PM-CSA RECCo RFWLS rGK RSEFNN RIVMcSFNN RLM RMSE RUL RWTLS SAFIS SCFNN SEIT2-FNN SOFMLS

Meaning Learning in non-stationary environments Leave-one-feature-out approach Local linear model tree Least squares Least squares support vector machine Latent variable Mean absolute error Metacognitive neuro-fuzzy inference system Multiple input multiple output Hand-written digit data base Mean of maximum Modified sequential adaptive fuzzy inference system Neuro function activity structure and technology Neuro-fuzzy controller Online bagging of evolving fuzzy systems Parsimonious network based on fuzzy inference system Principal component analysis Parsimonious evolving fuzzy classifier Parsimonious ensemble fuzzy classifier Proportional-integral-derivative Partial least squares Pseudo monte-carlo sampling algorithm Self-evolving Cloud-based controller Recursive fuzzily weighted least squares recursive gustafson–kessel Recurrent self-evolving fuzzy neural network Recurrent interval-valued metacognitive scaffolding fuzzy neural network Recursive levenberg-marquardt Root mean squared error Remaining useful life Recursive weighted total least squares Sequential adaptive fuzzy inference system Self-constructing fuzzy neural network Self-evolving interval type-2 fuzzy neural network Online self-organizing fuzzy modified least-squares network

page 233

233

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

234

Abbreviation SOFNN SP-AL SUHICLUST SVMs TEDA TEDAClass TS TSK VLDB WLS

b4528-v1-ch04

Handbook on Computer Learning and Intelligence — Vol. I

Meaning Self organizing fuzzy neural network Single-pass active learning Supervised hierarchical clustering Support vector machines Typically and eccentricity based data analytics Typically and eccentricity based data analytics classifier Takagi–Sugeno Takagi–Sugeno–Kang Very large data base Weighted least squares

page 234

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_0005

Chapter 5

Incremental Fuzzy Machine Learning for Online Classification of Emotions in Games from EEG Data Streams Daniel Leite∗,† , Volnei Frigeri Jr.‡, and Rodrigo Medeiros‡ ∗ Faculty of Engineering and Science, Adolfo Ibáñez University, Santiago, Chile † [email protected] ‡ {vjunior.frigeri, rodrigo.nmedeiros19}@gmail.com

Emotions have an important role in human–computer interaction and decision-making processes. The identification of the human emotional state has become a need for more realistic and interactive machines and computer systems. The greatest challenge is the availability of high-performance online learning algorithms that can effectively handle individual differences and nonstationarities from physiological data streams, i.e., algorithms that self-customize to a new user with no subject-specific calibration data. We describe an evolving Gaussian fuzzy classification (eGFC) approach, which is supported by an online semi-supervised learning algorithm to recognize emotion patterns from electroencephalogram (EEG) data streams. We present a method to extract and select features from bands of the Fourier spectra of EEG data. The data are provided by 28 individuals playing the games Train Sim World, Unravel, Slender: The Arrival, and Goat Simulator for 20 minutes each—a public dataset. According to the arousal–valence system, different emotions prevail in each game (boredom, calmness, horror, and joy). We analyze 14 electrodes/brain regions individually, and the effect of time window lengths, bands, and dimensionality reduction on the accuracy of eGFC. The eGFC model is user-independent and learns its structure and parameters from scratch. We conclude that both brain hemispheres may assist classification, especially electrodes on the frontal (Af3–Af4) area, followed by the occipital (O1–O2) and temporal (T7–T8) areas. We observed that patterns may be eventually found in any frequency band; however, the alpha (8–13 Hz), delta (1–4 Hz), and theta (4–8 Hz) bands, in this order, are more monotonically correlated with the emotion classes. eGFC has been shown to be effective for real-time learning from a large volume of EEG data. It reaches a 72.20% accuracy using a compact 6-rule structure on average, 10-second time windows, and a 1.8 ms/sample average processing time on a highly stochastic time-varying 4-class classification problem.

235

page 235

June 1, 2022 12:51

236

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Handbook on Computer Learning and Intelligence — Vol. I

5.1. Introduction Machine recognition of human emotions has been a topic of increasing scientific, clinical, and technological interest [1, 2]. Emotions can be communicated through (i) non-physiological data, such as facial expressions, gestures, body language, eye blink count, and voice tones; and (ii) physiological data by using the electroencephalogram (EEG), electromyogram (EMG), and electrocardiogram (ECG) [3, 4]. Computer vision and audition, physiological signals, and brain–computer interfaces (BCI) are typical approaches to emotion recognition. The implications of nonphysiological and physiological data analysis spread across a variety of domains in which artificial intelligence has offered decision-making support; control of mechatronic systems, software, and virtual agents; and realism, efficiency, and interaction [1]. EEG feature extraction has been a primary research issue in the field since learning algorithms to determine classification models and class boundaries are applied over the resulting feature space. Any feature extraction and time windowing method; kernel machine; aggregation function; Fourier, wavelet, or alternative transforms; energy or entropy equations; and so on—combined or not—can be applied over raw multi-channel EEG data to obtain an optimal representation, i.e., a representation that further facilitates learning a classification hypothesis. As a consequence, finding such an optimal representation means finding a needle in a haystack. An additional issue concerns generalization across individuals. Highly discriminative features and robust classifiers depend on the characteristics of the experimental group, inter- and intra-subject variability, and the classification task itself. For example, different effective features can be found in recordings from different subjects. Each subject has its own set of premises and uncertainties. Although several feature extraction methods have been evaluated within intelligent systems and BCI studies, their performance remains, in general, unsatisfactory [5]. A total of 4,464 time-frequency and dynamical features are investigated in [6], based on the benchmark SEED and DEAP EEG datasets. A discussion on how controversial literature results have been in relation to more effective channels, features, frequency bands, and cross-subject recognition issues is presented. L 1 -norm penalty-based feature elimination is performed to achieve a more compact feature set—being Hjorth parameters and bilateral temporal lobes considered appealing to analyze time-varying EEG data. Although high-frequency (beta and gamma) bands have occasionally been pointed out to be more effective to differentiate emotions, all bands may contribute to classification [6–8], especially because brain activity does not consider emotions isolated, but together with other cognitive processes. A survey on evolutionary methods to systematically explore EEG features is given in [9]. A comparison among empirical mode decomposition, Daubechies-4

page 236

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Incremental Fuzzy Classification of Emotions in Games from EEG Data Streams

page 237

237

wavelet, and entropy-based extraction methods is provided in [10]. Principal component analysis (PCA) and Fisher projection have also been used for feature extraction [11]. A drawback of many extraction methods based on a combination of features concerns the loss of domain information, namely, specific channels, brain regions, and frequency bands are unknown after combinations. Domain information may be important for the purpose of model interpretability and brain response understanding. Classifiers of EEG data are typically based on support vector machines (SVMs) using different kernels, the k-nearest neighbor (kNN) algorithm, linear discriminant analysis (LDA), Naive Bayes (NB) algorithm, and the multi-layer perceptron (MLP) neural network; see [1, 12] for a summary. Deep learning has also been applied to affective computing from physiological data. A deep neural network is used in [13] to classify the states of relaxation, anxiety, excitement, and fun by means of skin conductance and blood volume pulse signals with similar performance to that of shallow methods. A deep belief network (DBN), i.e., a probabilistic generative model with deep architecture, to classify positive, neutral, and negative emotions is described in [11]. Selection of electrodes and frequency bands is performed through the weight distributions obtained from the trained DBN, being differential asymmetries between the left and right brain hemispheres relevant features. A dynamical-graph convolutional neural network (DG-CNN) [3], trained by error back-propagation, learns an adjacency matrix related to different EEG channels to outperform DBN, transductive SVM, transfer component analysis (TCA), and other methods on the SEED and DREAMER benchmark EEG datasets. Fuzzy methods to deal with uncertainties in emotion identification, especially using facial images and speech data, have recently been proposed. A two-stage fuzzy fusion strategy combined with a deep CNN (TSFF-CNN) is presented in [14]. Facial expression images and speech modalities are aggregated for a final classification decision. The method manages the ambiguity and uncertainty of emotional state information. The TSFF-CNN has outperformed non-fuzzy deep learning methods. In [15], a fuzzy multi-class SVM method uses features from the biorthogonal wavelet transform applied over images of facial expressions to detect emotions, namely, happiness, sadness, surprisingness, angriness, disgustingness, and fearfulness. An adaptive neuro-fuzzy inference system (ANFIS) that combines facial expression and EEG features has been shown to be superior to single-sourcebased classifiers in [16]. In this case, a 3D fuzzy extraction method for content analysis and a 3D fuzzy tensor to extract emotional features from raw EEG data are used. Extracted features are fed to ANFIS, which, in turn, identifies the valence status stimulated by watching a movie clip [16]. A method called multi-scale inherent fuzzy entropy (I-FuzzyEn), which uses empirical mode decomposition and fuzzy membership functions, is proposed to

June 1, 2022 12:51

238

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Handbook on Computer Learning and Intelligence — Vol. I

evaluate the complexity of EEG data [17]. Complexity estimates are useful as a bio-signature of health conditions. FuzzyEn has been shown to be more robust than analogous non-fuzzy methods in the sense of root mean square deviation. The online weighted adaptation regularization for regression (OwARR) algorithm [18] aims to estimate driver drowsiness from EEG data. OwARR uses fuzzy sets to perform part of the regularization procedure. Some offline training steps are needed to select domains. The idea is to reduce subject-specific calibration based on the EEG data. An ensemble of models using swarm-optimized fuzzy integrals for motor imagery recognition and robotic arm control is addressed in [19]. The ensemble uses the Sugeno or Choquet fuzzy integral supported by particle swarm optimization (PSO) to represent variations between individuals. The ensemble identifies the user’s mental representation of movements from EEG patterns. In [5], a multimodal fuzzy fusion BCI for motor imagery recognition, also using the Choquet and Sugeno integrals, is given. Fuzzy aggregation aims to compensate for posterior probabilities in classification due to different EEG frequencies and multiple classifiers. Despite the above studies, the literature on fuzzy methods applied to EEG data analysis and BCI systems is scarce in the global picture. All offline classifier-design methods mentioned above indirectly assume that the data sources are governed by stationary distributions since the classifiers are expected to keep their training performance during online operation using a fixed structure and parameters. Nonetheless, EEG data streams very often change, gradually and abruptly, due to movement artifacts, electrode and channel configuration, noise, and environmental conditions. Individually, factors such as fatigue, inattention, and stress also affect pre-designed classifiers in an uncertain way. The non-stationary nature of physiological signals and BCI systems leads the accuracy of both user-dependent and user-independent classifiers to degrade—being the former more susceptible to performance degradation—unless the classifiers are updated on the fly to represent the changes. The present study deals with online learning and real-time emotion classification from EEG data streams generated by individuals playing computer games. We briefly present a recently proposed evolving Gaussian fuzzy classification (eGFC) approach [20], which is supported by an incremental semi-supervised online learning algorithm. eGFC is an instance of evolving granular system [21, 22], which, in turn, is a general-purpose online learning framework, i.e., a family of algorithms and methods to autonomously construct classifiers, regressors, predictors, and controllers in which any aspect of a problem may assume a non-pointwise (e.g., interval, fuzzy, rough, statistical) uncertain characterization, including data, parameters, features, learning equations, and covering regions [23–25]. In particular, eGFC has been successfully applied in data centers [26] to real-time anomaly detection and in smart grids [20] for power quality disturbance classification.

page 238

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Incremental Fuzzy Classification of Emotions in Games from EEG Data Streams

page 239

239

We use the public EEG dataset by [27], which contains time-indexed samples obtained from individuals exposed to visual and auditory stimuli according to a protocol. Brain activity is recorded by the 14 channels of the Emotiv EPOC+ EEG device. We pre-process raw data by means of time windowing and filtering procedures to remove artifacts and undesirable frequencies. We choose to extract and analyze ten features (the maximum and mean values) from the delta (1–4 Hz), theta (4–8 Hz), alpha (8–13 Hz), beta (13–30 Hz), and gamma (30–64 Hz) bands, per EEG channel—140 features in total. Then, a unique user-independent eGFC model is developed from scratch based on sequential data from 28 consecutive players. Notice that user-dependent classification models are generally more accurate, but less generalizable. We choose a more generalizable eGFC model, which is the realistic approach in a multi-user environment. After recognizing a predominant emotion according to the arousal–valence system, i.e., boredom, calmness, horror, or joy, which are related to specific computer games, feedback can be implemented in a real or virtual environment to promote a higher level of realism and interactivity. The feedback step is out of the scope of this study. The contributions of this paper in relation to the literature of the area are as follows: • A new, user-independent time-varying Gaussian fuzzy classifier, eGFC, supplied with an online incremental learning algorithm to deal with uncertainties and nonstationarities of physiological EEG data streams—in particular, data streams due to visual and auditory stimuli with a focus on emotion recognition in computer games; • Distinguished eGFC characteristics in relation to the literature of the area include: (i) real-time operation using adaptive granularity and interpretable fuzzy rulebased structure; (ii) storing data samples and knowing the task and existing number of classes a priori are needless; (iii) the ability to incorporate new spatiotemporal patterns on the fly, without human intervention; • An analysis of the effects of time window lengths, brain regions, frequency bands, and dimensionality reduction on the classification performance.

5.2. Brain–Computer Interface A BCI consists of electrodes and other bio-sensors, high-performance hardware, and a computing framework for pattern recognition, prediction, and decisionmaking support. An EEG-based BCI allows reading, analyzing, differentiating, and interpreting physiological signals reflected in brain waves [28]. A user may interact with computers, mechatronics, and even with other humans, from thoughts, movements, and environmental perceptions—which activate specific parts of the brain. A communication path among the elements of interest is needed [29].

June 1, 2022 12:51

240

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Handbook on Computer Learning and Intelligence — Vol. I

Typical EEG-BCI applications target solutions that promote human–machine interaction [30]. The applications are different based on (i) the device used (invasive/non-invasive, in relation to the human body), (ii) stimulus (exogenous/endogenous), (iii) data processing (synchronous/asynchronous), and (iv) objective (active, passive, reactive) [4]. Noninvasive approaches prevail. Despite the lower spatial resolution provided by a set of passive electrodes placed on the scalp, and interference manifested as noise and data distortions, brain implants are too risky [31]. Moreover, the electrodes work synchronously, and the analysis of endogenous stimuli predominates, i.e., stimuli dependent on cognitive response to sensory stimuli, e.g., visual, auditory. The present study considers this context.

5.2.1. Evoked Potential and the 10–20 System The location of electrodes on the scalp follows the international 10–20 system in order to reproduce replicable setups. The position depends on the references: nasion, in the superior part of the nose; and inion, on the back of the skull [1, 28]. A letter identifies the lobe. F stands for frontal, T for temporal, P for parietal, and O for occipital. Even and odd numbers refer to positions on the right and left brain hemispheres, respectively. The 10–20 system is employed in many commercial EEG, and is widely used in scientific research. Figure 5.1 shows the ideal position of the electrodes; P3 and P4 are ground (zero volts) points. The 14 points in green are, particularly, measured by the Emotiv EPOC+ EEG device. The five blank spots (and other additional spots) can also be considered. For example, the Open BCI EEG device considers all the 19 spots shown in the figure. Evoked potential is an electrical potential recorded by the EEG after the presentation of a stimulus. In computer games, we consider auditory evoked potential (AEP), elicited through earphones; and visual evoked potentials (VEP), elicited by non-stationary patterns presented on a monitor.

5.2.2. The Arousal–Valence System and Games The dimension approach, or the arousal–valence system [32], considers human emotions in a continuous form, along the circumference of a circle. On the left side of the valence dimension are negative emotions, denoted by N, e.g., “angry,” “bored,” and related adjectives, whereas on the right side are positive emotions, P, e.g., “happy,” “calm,” and related adjectives. The upper part of the arousal dimension characterizes excessive emotional arousal or behavioral expression, while the lower part designates apathy. We assign a number from 1 to 4 to the arousal–valence quadrants (see Figure 5.2), according to the 4-class classification problem we have. A coarser or more detailed classification problem can be formulated.

page 240

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Incremental Fuzzy Classification of Emotions in Games from EEG Data Streams

Figure 5.1: The international 10–20 system to place electrodes on the scalp.

Figure 5.2: The arousal-valence model of emotions.

page 241

241

June 1, 2022 12:51

242

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Handbook on Computer Learning and Intelligence — Vol. I

Spectator and actor emotions while playing computer games are different in terms of strength [33]. While spectators are impressed with the game development, their emotions tend to be weaker and are categorized in the lower quadrants of the arousal–valence circle. On the contrary, as actors effectively change the outcome of the game, their mental activity is higher. Actors cognitively process images, build histories mentally, and evaluate characters and situations. At first, actor emotions are not prone to any quadrant. In emotion recognition, computer games have been widely used [4, 34], mainly because the brains of actors are subject to various kinds of emotions and cognitive processes.

5.3. eGFC: Evolving Gaussian Fuzzy Classifier This section summarizes eGFC, a recently proposed semi-supervised evolving classifier [20]. Although eGFC handles partially labeled data, we assume that a label becomes available some time steps after the class estimate. Therefore, we perform a single supervised learning step using the corresponding input–output pair when the output is available. eGFC uses Gaussian membership functions to cover the data space with fuzzy granules. New data samples are associated with fuzzy rules and class labels. Granules are scattered in the data space to represent local information. eGFC global response comes from the aggregation of local models. A recursive algorithm constructs its rule base, and updates local models to deal with non-stationarities. eGFC addresses additional issues such as unlimited amounts of data, and scalability [21, 26]. A local fuzzy model is created if a sample is sufficiently different from the current knowledge. The learning algorithm expands, reduces, removes, and merges granules. Rules are reviewed according to inter-granular relations. eGFC provides nonlinear, non-stationary, and smooth discrimination boundaries among classes [20]. Before describing the eGFC structure and algorithm, the following section briefly summarizes some related online fuzzy classification methods.

5.3.1. Some Related Online Learning Methods A previous fuzzy rule-based method for data stream classification, called eClass, is described in [35]—being eClass0 a model architecture that keeps class labels as rule consequent; and eClass1 a first order Takagi–Sugeno architecture. Incremental supervised algorithms are given to learn patterns from scratch. The structure of an eClass0 rule is related to that of eGFC [20] in the consequent sense, i.e., a class label is memorized. However, their learning procedures are completely different, primarily because eGFC learns from partially labeled data. While eClass0 uses focal points as rule antecedents, which are in fact an input data vector, eGFC uses Gaussian membership functions. eClass has shown encouraging results on intrusion detection,

page 242

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Incremental Fuzzy Classification of Emotions in Games from EEG Data Streams

page 243

243

and on benchmark problems from the UCI Machine Learning Repository, such as Sonar, Credit Card, Ten-Digit, and Ionosphere. FLEXFIS-Class [36] is another example of a learning method to build models of eClass0 type on the fly. Similarly, its learning equations are fully distinct from those in [35]. New concepts, such as rule confidence and covariance matrix reset, are introduced in [36]. Parsimonious classifier, pClass [37], is a further example, which incorporates ideas such as online feature weighting and datum significance. A dynamic incremental semi-supervised version of the fuzzy C-means clustering (DISSFCM) algorithm is given in [38]. The method handles streaming data chunks containing partially labeled instances. The total number of clusters is variable by applying a splitting procedure for lower-quality clusters. Incremental semisupervision as a means to guide the clustering process is discussed. This issue is also addressed in [39], where a fraction of labeled instances is processed by the proposed online micro-cluster-based algorithm. Reliable micro-clusters are updated to represent concept drifts faster. A merging procedure for Gaussian evolving clustering (eGAUSS+) is presented in [40]. The criterion for merging is based on the sum of the volumes of two ellipsoidal clusters that enclose a minimal number of data instances, and the expected volume of the new cluster, which results from merging. Merging has a significant impact on any evolving classifier [22, 40]. A semi-supervised variety of evolving neuro-granular network (eGNN) for numerical data stream classification is given in [41]. The granular network evolves fuzzy hyper-boxes and utilizes null-norm–based neurons to classify data. The Rotation of the Twin Gaussians experiment—a synthetic concept drift—and the Arising of a New Class experiment—a concept shift—have shown the robustness of the granular neural network to non-stationarities. Deep learning methods for partially labeled data-stream classification have also been considered [42, 43]. Some potential problems arise naturally in deeper models since if a model is deep, training all parameters and layers to effectively represent concept changes tends to be a slower process, which requires a larger amount of data related to the new concept.

5.3.2. Gaussian Functions and eGFC Rule Structure Learning in eGFC does not require initial rules. Rules are created and dynamically updated depending on the behavior of the system over time. When a data sample is available, a decision to either add a rule to the model structure or update the parameters of a rule is made. An eGFC rule, say R i , is   IF (x1 is Ai1 ) AND ... AND (xn is Ain ) , THEN (y is C i ) in which x j , j = 1, . . . , n, are features, and y is a class. The data stream is denoted by (x, y)h , h = 1, . . .. Moreover, Aij , j = 1, . . . , n; i = 1, . . . , c, are Gaussian

June 1, 2022 12:51

244

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

page 244

Handbook on Computer Learning and Intelligence — Vol. I

membership functions built and updated incrementally; C i is the class label of the i-th rule. Rules R i , i = 1, . . . , c, set up a zero-order Takagi–Sugeno rule base. The next sections give a semi-supervised way of constructing an eGFC model online, from scratch. The number of rules, c, is variable, which is a notable characteristic of the approach since guesses on how many data partitions exist are needless [21, 22]. In other words, eGFC rules are added, updated, merged, or removed on demand, i.e., the number of data partitions is automatically chosen by the learning algorithm based on the behavior of the data stream. A normal Gaussian membership function, Aij = G(μij , σ ji ), has height 1 [44]. It is characterized by the modal value μij and dispersion σ ji . Characteristics that make Gaussians appropriate include: easiness of learning as modal values and dispersions can be updated straightforwardly from a data stream; infinite support, i.e., since the data are priorly unknown, the support of Gaussians extends to the whole domain; and smooth class boundaries, produced by fuzzy granules obtained by the cylindrical extension and minimum T -norm aggregation of one-dimensional Gaussians [44].

5.3.3. Adding Fuzzy Rules to the Evolving Classifier Rules are created and evolved as data are available. A new granule, γ c+1 , and rule, R c+1 , are created if none of the existing rules {R 1 , . . . , R c } are sufficiently activated by xh . Let ρ h ∈ 0, 1 be an adaptive threshold. If   (5.1) T Ai1 (x1h ), . . . , Ain (xnh ) ≤ ρ h , ∀i, i = 1, . . . , c, in which T is any triangular norm, then the eGFC structure is expanded. The minimum (Gödel) T -norm is used. If ρ h is set to 0, then eGFC is structurally stable. If ρ h is equal to 1, eGFC creates a rule for each new sample, which is not practical. Structural and parametric adaptability are balanced for intermediate values (stability–plasticity tradeoff) [25]. ρ h regulates how large granules can be. Different choices result in different granular perspectives of the same problem. Section 5.3.5 gives an incremental procedure to update ρ h . A new γ c+1 is initially represented by membership functions, Ac+1 j ∀ j , with = x hj , μc+1 j

(5.2)

σ jc+1 = 1/2π.

(5.3)

and

We call (5.3) the Stigler approach to standard Gaussian functions, or the maximum approach [24, 45]. The intuition is to start big, and let the dispersions gradually shrink when new samples activate the same granule. This strategy is appealing for a compact model structure.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Incremental Fuzzy Classification of Emotions in Games from EEG Data Streams

page 245

245

In general, the class C c+1 of the rule R c+1 is initially undefined, i.e., the (c + 1)th rule remains unlabeled until a label is provided. If the corresponding output, y h , associated with xh , becomes available, then C c+1 = y h .

(5.4)

Otherwise, the first labeled sample that arrives after the h-th time step, and activates the rule R c+1 according to (5.1), is used to define its class. If a labeled sample activates a rule that is already labeled, but the sample label is different from that of the rule, then a new (partially overlapped) granule and a rule are created to represent new information. Partially overlapped Gaussian granules, tagged with different labels, tend to have their dispersions reduced over time by the parameter adaptation procedure described in Section 5.3.4. The modal values of the Gaussian granules may also drift, if convenient, to a more suitable decision boundary. With this initial rule parameterization, preference is given to the design of granules balanced along its dimensions, rather than to granules with unbalanced geometry. eGFC realizes the principle of justifiable information granularity [46, 47], but allows the Gaussians to find more appropriate places and dispersions through updating mechanisms.

5.3.4. Incremental parameter adaptation Updating the eGFC model consists in: contracting or expanding Gaussians ∗ ∗ Aij , j = 1, . . . , n, of the most active granule, γ i , considering labeled and unlabeled samples; moving granules toward regions of relatively dense population; and tagging rules when labeled data are available. Adaptation aims to develop more specific local models [24, 48] and provide better coverage to new data. A rule R i is a candidate to be updated if it is sufficiently activated by an unlabeled sample, xh , according to   min Ai1 (x1h ), . . . , Ain (xnh ) > ρ h .

(5.5)

Geometrically, xh belongs to a region highly influenced by γ i . Only the most active ∗ rule, R i , is chosen for adaptation if two or more rules reach the ρ h level for the unlabeled xh . For a labeled sample, i.e., for pairs (x, y)h , the class of the most active ∗ rule R i , if defined, must match y h . Otherwise, the second-most active rule among those that reached the ρ h level is chosen for adaptation, and so on. If none of the rules are apt, then a new one is created (Section 5.3.3).

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

246

b4528-v1-ch05

page 246

Handbook on Computer Learning and Intelligence — Vol. I ∗

To include xh in R i , the learning algorithm updates the modal values and ∗ dispersions of the corresponding membership functions Aij , j = 1, . . . , n, from ∗



∗ μij (new)

=

( i − 1)μij (old) + x hj  i∗

,

(5.6)

and ∗ σ ji (new)

 =

∗ 2 2 1/2 ( i − 1)  i ∗ 1  h i∗ + , σ j (old) x j − μ j (old)  i∗  i∗

(5.7)



in which  i is the number of times the i ∗ -th rule was chosen to be updated. Notice ∗ that (5.6)–(5.7) are recursive and, therefore, do not require data storage. As σ i ∗ defines a convex influence region around μi , very large and very small values may induce, respectively, a unique or too many granules per class. An approach is to ∗ keep σ ji between a minimum value, 1/(4π ), and the Stigler limit, 1/(2π ) [24].

5.3.5. Time-Varying Granularity Let the activation threshold, ρ h , be time-varying, similar to [23]. The threshold assumes values in the unit interval according to the overall average dispersion h σavg

n c 1   ih = σ , cn i=1 j =1 j

(5.8)

in which c and n are the number of rules and features so that ρ(new) =

h σavg h−1 σavg

ρ(old).

(5.9)

Given xh , rule activation levels are compared to ρ h to decide between parametric and structural changes of the model. In general, eGFC starts learning from scratch and without knowledge of the data properties. Practice suggests ρ 0 = 0.1 as starting value. The threshold tends to converge to a proper value after some time steps if the classifier structure and parameters reach a level of maturity and stability. Nonstationarities guide ρ h to values that better reflect the dynamic behavior of the current environment. A time-varying ρ h avoids assumptions about how often the data stream changes.

5.3.6. Merging Similar Rules Similarity between two granules with the same class label may be high enough to form a unique granule that inherits the information assigned to the merged granules.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Incremental Fuzzy Classification of Emotions in Games from EEG Data Streams

page 247

247

Analysis of inter-granular relations requires a distance measure between Gaussian objects. Let ⎛ ⎞ n  1 |μi1 − μij2 | + σ ji1 + σ ji2 − 2 σ ji1 σ ji2 ⎠ (5.10) d(γ i1 , γ i2 ) = ⎝ n j =1 j be the distance between γ i1 and γ i2 . This measure considers Gaussians and the specificity of information, which is, in turn, inversely related to the Gaussians dispersions [49]. For example, if the dispersions, σ ji1 and σ ji2 , differ from one another, rather than being equal, the distance between the underlying Gaussians is larger. eGFC may merge the pair of granules that presents the smallest value of d(.) for all pairs of granules. The underlying granules must be either unlabeled or tagged with the same class label. The merging decision is based on a threshold, , or expert judgment regarding the suitability of combining granules to have a more compact model. For data within the unit hypercube, we use  = 0.1 as the default, which means that the candidate granules should be quite similar and, in fact, carry quite similar information. A new granule, say γ i , which results from γ i1 and γ i2 , is built by Gaussians with modal values i

σ j1

μij =

i σ j2

i

μij1 + i

σ j1 i

σ j2

σ j2 i

σ j1 i

+

σ j2

μij2

, j = 1, . . . , n,

(5.11)

i

σ j1

and dispersion σ ji = σ ji1 + σ ji2 , j = 1, . . . , n.

(5.12)

These relations take into consideration the uncertainty ratio of the original granules to define the appropriate location and size of the resulting granule. Merging granules minimizes redundancy [21, 22, 40, 50].

5.3.7. Removing Rules A rule is removed from the eGFC model if it is inconsistent with the current environment. In other words, if a rule is not activated for a number of time steps, say h r , then it is deleted from the rule base. However, if a class is rare or seasonal behaviors are envisioned, then it may be the case to set h r = ∞, and keep the inactive rules. Additionally, in anomaly detection problems, unusual information granules may be more important than highly operative granules and therefore should not be removed. Removing granules and rules periodically helps keep the knowledge base updated in some applications.

June 1, 2022 12:51

248

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Handbook on Computer Learning and Intelligence — Vol. I

Alternative local-model removing procedures can be mentioned. In [51], clusters are removed based on the size of their support sets. In [52], a rule is removed if it does not contribute to minimizing the estimation error. A combination of age and number of local activations are considered in [53]. The minimal distances allowed between cluster centers are suggested in [54].

5.3.8. Semi-Supervised Learning from Data Streams The semi-supervised learning algorithm to construct and update an eGFC model along its lifespan is as follows. eGFC: Online Semi-Supervised Learning Initial number of rules, c = 0; Initial hyper-parameters, ρ 0 =  = 0.1, h r = 200; 3: Read input data sample xh , h = 1; 4: Create granule γ c+1 (Eqs. (5.2)–(5.3)), unknown class C c+1 ; 5: FOR h = 2, …DO 6: Read xh ; 7: Calculate rules’ activation degree (Eq. (5.1)); ∗ 8: Determine the most active rule R i ; ∗ 9: Provide estimated class C i ; 10: // Model adaptation 11: IF T ( Ai1 (x 1h ), . . . , Ain (x nh )) ≤ ρ h ∀i, i = 1, . . . , c 12: IF actual label y h is available 13: Create labeled granule γ c+1 (Eqs. (5.2)–(5.4)); 14: ELSE 15: Create unlabeled granule γ c+1 (Eqs. (5.2)–(5.3)); 16: END 17: ELSE 18: IF actual label y h is available ∗ ∗ 19: Update the most active granule γ i with C i = y h (Eqs. (5.6)–(5.7)); 20: Tag unlabeled active granules; 21: ELSE ∗ 22: Update the most active γ i (Eqs. (5.6)–(5.7)); 23: END 24: END 25: Update the ρ-level (Eqs. (5.8)–(5.9)); 26: Delete inactive rules based on h r ; 27: Merge granules based on  (Eqs. (5.10)–(5.12)); 28: END 1: 2:

page 248

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Incremental Fuzzy Classification of Emotions in Games from EEG Data Streams

page 249

249

5.4. Methodology We address the problem of classifying emotions from EEG data streams. We describe a feature extraction method, evaluation measures, and experiments based on individual EEG channels and the multivariable BCI system.

5.4.1. About the Dataset The incremental learning method and the eGFC classifier are evaluated from EEG data streams produced by individuals playing 4 computer games. A game motivates predominantly a particular emotion, according to the 4 quadrants of the arousal– valence system. The aim of the classifier is to assign a class of emotions to new samples. The classes are: bored (Class 1), calm (Class 2), angry (Class 3), and happy (Class 4). The raw data were provided by [27]. The data were produced by 28 individuals (experimental group), from 20 to 27 years old, male and female, including students and faculty members of the School of Technology of Firat University, Turkey. Individuals are healthy, with no disease history. Each individual played each game for 5 minutes (20 minutes in total) using the Emotiv EPOC+ EEG device and earphones. The male–female (M-F) order of players is FMMMFFMMMMMFMMMMFFFFFMMMMMMM A single user-independent eGFC model is developed for the experimental group aiming at reducing individual uncertainty, and enhancing the reliability and generalizability of the system performance. Brain activity is recorded from 14 electrodes. The electrodes are placed on the scalp according to the 10–20 system, namely, at Af3, Af4, F3, F4, F7, F8, Fc5, Fc6, T7, T8, P7, P8, O1, and O2 (Figure 5.1). The sampling frequency is 128 Hz. Each individual produces 38,400 samples per game and 153,600 samples in total. The experiments were conducted in a relatively dark and quiet room. A laptop with a 15.6 inch screen and a 16 GB graphics card was used to run the games. Maximum screen resolution and high-quality graphic rendering were chosen. The individuals were exposed to a 10-minute period of relaxation before the experiment. They repeatedly kept their eyes open and closed in 10-second intervals. None of the individuals had ever played the games before. The games were played in the same order, namely, Train Sim World, Unravel, Slender: The Arrival, and Goat Simulator. See Figure 5.3 for an illustration. According to a survey, the predominant classes of emotion for each game are, respectively, boredom (Class 1), calmness (Class 2), angriness/nervousness (Class 3), and happiness (Class 4). The classes do not portray a general opinion.

June 1, 2022 12:51

250

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Handbook on Computer Learning and Intelligence — Vol. I

Figure 5.3: Examples of interfaces of the games Train Sim World, Unravel, Slender: The Arrival, and Goat Simulator, used to generate the EEG data streams.

5.4.2. Feature Extraction and Experiments A fifth-order sinc filter is applied to the original raw EEG data to suppress the interference due to head, arms, and hands movement [27]. Subsequently, feature extraction is performed. We generate 10 features from each of the 14 EEG channels. They are the maximum and mean values of five bands of the Fourier spectrum. The bands are known as delta (1–4 Hz), theta (4–8 Hz), alpha (8–13 Hz), beta (13–30 Hz), and gamma (30–64 Hz). If 5-minute time windows over the original data are used to construct the frequency spectrum and extract a processed sample to be fed to eGFC, then each individual produces four samples—one sample per game, i.e., one sample per class or predominant emotion. Therefore, the 28 participants generate 112 processed samples. We also evaluated 1-minute, 30-second, and 10-second time windows, which yielded 560, 1120, and 3360 processed samples. Examples of spectra using 30-second time windows over data streams from frontal electrodes— Af3, Af4, F3, and F4—are illustrated in Figure 5.4. Notice that there is a higher level of energy in the delta, theta, and alpha bands, and that the maximum and mean values per band can be easily obtained to compose a 10-feature processed sample. We performed two experiments. First, the data stream of individual electrodes is analyzed (univariate time-series analysis). 10-feature samples, xh = x1 . . . x10 , h = 1, . . . , are fed to the eGFC model, which evolves from scratch as soon as the data samples arise. We perform the interleaved test-then-train approach, i.e., an eGFC estimate is given; then the true class C = {1, 2, 3, 4} of xh becomes available, and the pair (x, y) is used for a supervised learn the step.

page 250

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Incremental Fuzzy Classification of Emotions in Games from EEG Data Streams

page 251

251

Figure 5.4: Examples of the spectra and bands obtained from the data streams generated by frontal EEG electrodes considering a time window of 30 seconds.

Classification accuracy, Acc ∈ 0, 1, is obtained recursively from Acc(new) =

1 h−1 Acc(old) + τ, h h

(5.13)

in which τ = 1 if the estimate is correct, i.e., Cˆ h = C h ; and τ = 0, otherwise. A random (coin flipping) classifier has an expected Acc of 0.25 (25%) in the 4-class problem. Higher values indicate the level of learning from the data, which are quite noisy. From the investigation of individual electrodes, we expect to identify more discriminative areas of the brain.

June 1, 2022 12:51

252

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Handbook on Computer Learning and Intelligence — Vol. I

A measure of model compactness is the average number of fuzzy granules over time, h−1 1 cavg (old) + ch . (5.14) h h Second, we consider the global 140-feature problem (multivariate time-series analysis)—10 features are extracted from each electrode. Thus, xh = x1 . . . x140 , h = 1, . . . , are initially considered as inputs of the eGFC model. The classification model self-develops on the fly based on the interleaved test-then-train strategy. Dimension reduction is performed using the non-parametric Spearman’s correlationbased score method by [49]. Essentially, a feature is better ranked if it is more correlated with a class, and less correlated with the rest of the features. The Leave n-Feature Out strategy, with n = 5, is used to gradually eliminate lower-ranked features. cavg (new) =

5.5. Results and Discussions 5.5.1. Individual Electrode Experiment We analyze EEG channels separately, as an attempt to uncover more promising regions of the brain to classify emotions in games. Default hyper-parameters are used to initialize eGFC. Input samples contain 10 entries. Table 5.1 shows the accuracy and compactness of eGFC models for different time window lengths. We notice from Table 5.1 that the average accuracy for 5-minute windows, 21.1%, does not reflect that learning has in fact happened. This suggests that the filtering effect on extracting features from 5-minute windows suppresses crucial details to distinguish patterns related to emotions. Emotions tend to be sustained for shorter periods. Accuracies greater than 25% arise for smaller windows. The average eGFC accuracy for 1-minute windows, around 40.5%, is significant, especially due to the limitation of scanning the data from a single electrode (brain region) at a time. As window length reduces, accuracy increases when using a more compact fuzzy model structure. This is a result of the availability of a larger amount of processed samples (extracted from a smaller number of time windows), and the ability of the online learning algorithm to lead the structure and parameters of the eGFC models to a more stable setup after merging the Gaussian granules that approach each other. The difference between the average accuracy of the 30-second (44.0%) and 10second (45.2%) time windows becomes small, which suggests saturation around 45–46%. Time windows smaller than 10 seconds tend to be needless. Asymmetries between the accuracy of models evolved for the left and right brain hemispheres (direct pairs of electrodes) can be noticed. Although the right hemisphere is known to deal with emotional interpretation, and to be responsible

page 252

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Incremental Fuzzy Classification of Emotions in Games from EEG Data Streams

Table 5.1: eGFC classification results for individual electrodes. 5-minute time window Left hemisphere Ch Af3 F3 F7 Fc5 T7 P7 O1

Acc(%) 18.8 23.2 17.0 24.1 22.3 18.8 21.4

Avg.

20.8

Right hemisphere cavg 24.8 21.3 21.2 22.5 20.2 21.7 21.9

Ch Af4 F4 F8 Fc6 T8 P8 O2

Acc(%) 19.6 20.5 25.0 20.5 24.1 18.8 21.4

cavg 22.9 21.2 19.1 19.0 21.2 21.7 21.9

21.9

Avg.

21.4

21.0

1-minute time window Left hemisphere Ch Af3 F3 F7 Fc5 T7 P7 O1

Acc(%) 43.4 37.0 41.3 38.4 43.9 37.9 45.0

Avg.

41.0

Right hemisphere cavg 17.3 14.3 14.5 16.1 14.4 17.4 16.3

Ch Af4 F4 F8 Fc6 T8 P8 O2

Acc(%) 41.6 39.6 31.6 41.8 50.4 33.8 40.5

cavg 13.1 15.5 10.3 18.6 15.9 15.7 16.6

15.8

Avg.

39.9

15.1

30-second time window Left hemisphere Ch Af3 F3 F7 Fc5 T7 P7 O1

Acc(%) 51.3 40.5 44.2 41.3 40.4 46.3 45.9

Avg.

44.3

Right hemisphere cavg 11.7 10.3 10.3 10.8 10.2 9.4 10.1

Ch Af4 F4 F8 Fc6 T8 P8 O2

Acc(%) 43.9 42.9 37.5 44.7 49.6 42.7 44.7

cavg 8.9 8.8 6.7 11.6 11.3 10.3 12.8

10.4

Avg.

43.7

10.1

10-second time window Left hemisphere

Right hemisphere

Ch Af3 F3 F7 Fc5 T7 P7 O1

Acc(%) 52.6 41.8 46.3 40.5 40.4 46.9 47.7

cavg 6.3 6.1 6.1 5.7 5.7 6.7 6.6

Ch Af4 F4 F8 Fc6 T8 P8 O2

Acc(%) 44.0 43.8 40.2 46.8 51.5 45.2 45.2

cavg 5.5 5.9 4.8 6.8 6.2 6.5 7.0

Avg.

45.2

6.2

Avg.

45.2

6.1

page 253

253

June 1, 2022 12:51

254

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Handbook on Computer Learning and Intelligence — Vol. I

for creativity and intuition; logical interpretation of situations—typical of the left hemisphere—is also present, as players seek reasons to justify decisions. Therefore, we notice the emergence of patterns in both hemispheres. In general, with an emphasis on the 30-second and 10-second windows, the pair of electrodes Af3–Af4, in the frontal lobe, gives the best pattern recognition results, especially Af3, in the left hemisphere. The frontal lobe is the largest of the four major lobes of the brain. It is covered by the frontal cortex. The frontal cortex includes the premotor cortex and the primary motor cortex, which controls voluntary movements of specific body parts. Therefore, patterns that arise from the Af3–Af4 data streams may be strongly related to the brain commands to move hands and arms, which is indirectly related to the emotions of the players. A spectator, instead of a game actor, may have a classification model based solely on Af3–Af4 with diminished accuracy. The pair Af3–Af4 is closely followed by the pairs O1–O2 (occipital lobe) and T7–T8 (temporal lobe), which is somehow consistent with the results in [6–8]. The temporal lobe is useful for audition, language processing, declarative memory, and visual and emotional perception. The occipital lobe contains the primary visual cortex and areas of visual association. Neighbor electrodes, P7 and P8, also offer relevant information for classification. According to [55],EEG asymmetry is related to approaching and withdrawal emotions, with approaching trends reflected in left-frontal activity and withdrawal trends reflected in relative right-frontal activity. The similar average accuracy between the brain hemispheres, reported in Table 5.1, portrays a mixture of approaching and withdrawal emotional patterns, typical of game playing. In [13], differential and rational asymmetries are considered to be input features of classifiers to assist recognition. Further discussions on asymmetries require more specific steady-state experiments and are therefore out of the scope of this study.

5.5.2. Multivariate Experiment The multivariate 140-feature data stream is fed to eGFC. We take into account 10second time windows over 20-minute recordings per individual—the best average set up secured in the previous experiment. Features are ranked based on Spearman’s correlation score [49]. We use {channel, band, attribute} as notation to order the features from the most to the least important. Results are shown in Table 5.2. The bands are given in Greek letters for short. We observe from Table 5.2 a prevalence of alpha-band features in the first positions, followed by delta features and, then, Theta features. In other words, these bands are the most monotonically correlated to the games and their classes. To provide quantitative experimental evidence, we sum the monotonic correlations between a band and the classes of emotions (the sum of the 14 items that correspond to the 14 EEG channels). Figure 5.5 shows the

page 254

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Incremental Fuzzy Classification of Emotions in Games from EEG Data Streams

page 255

255

Table 5.2: Ranking of features. # Rank {channel, band, attribute} 1:{F4,α,max} 5:{O2,α,avg} 9:{F3,θ,max} 13:{T7,α,max} 17:{O2,θ,avg} 21:{P7,α,max} 25:{F8,α,max} 29:{T7,θ,avg} 33:{AF3,α,avg} 37:{AF3,θ,avg} 41:{O2,γ ,max} 45:{T7,β,max} 49:{F8,β,avg} 53:{P8,δ,max} 57:{F7,δ,avg} 61:{FC6,θ,max} 65:{F4,β,avg} 69:{F8,α,avg} 73:{T7,γ ,avg} 77:{O2,β,avg} 81:{T7,γ ,max} 85:{O1,β,avg} 89:{T8,β,max} 93:{FC5,α,avg} 97:{FC5,δ,avg} 101:{O2,β,max} 105:{FC6,β,avg} 109:{F8,δ,avg} 113:{F4,θ,avg} 117:{F7,β,max} 121:{F4,δ,avg} 125:{AF4,β,max} 129:{O1,γ ,max} 133:{T7,β,avg} 137:{F7,β,avg}

2:{FC6,α,max} 6:{T8,α,max} 10:{F3,δ,max} 14:{O2,α,max} 18:{P8,α,max} 22:{T8,θ,max} 26:{F3,θ,avg} 30:{T7,δ,max} 34:{AF3,θ,max} 38:{AF4,α,avg} 42:{AF4,δ,avg} 46:{F8,γ ,avg} 50:{AF4,δ,max} 54:{O2,α,avg} 58:{AF3,δ,avg} 62:{F7,α,avg} 66:{O1,γ ,avg} 70:{O2,θ,max} 74:{P8,θ,max} 78:{O1,α,avg} 82:{F8,γ ,max} 86:{P7,θ,max} 90:{AF3,γ ,max} 94:{O1,θ,avg} 98:{F4,β,max} 102:{FC6,β,max} 106:{F8,θ,avg} 110:{AF3,δ,max} 114:{P8,θ,avg} 118:{FC6,γ ,avg} 122:{F3,γ ,avg} 126:{P7,γ ,avg} 130:{O1,δ,max} 134:{FC5,β,max} 138:{F3,β,max}

3:{AF3,α,max} 7:{AF4,α,max} 11:{T8,δ,max} 15:{P7,δ,avg} 19:{T7,δ,avg} 23:{F3,α,max} 27:{FC6,α,avg} 31:{P7,δ,max} 35:{FC5,δ,max} 39:{F7,θ,avg} 43:{AF4,θ,max} 47:{F3,α,avg} 51:{P7,θ,avg} 55:{AF3,β,avg} 59:{P8,δ,avg} 63:{F4,θ,max} 67:{FC5,γ ,avg} 71:{F7,δ,max} 75:{T8,γ ,max} 79:{F3,β,avg} 83:{FC5,θ,avg} 87:{O1,δ,avg} 91:{FC6,δ,avg} 95:{O1,θ,max} 99:{F4,δ,max} 103:{F7,γ ,max} 107:{AF3,γ ,avg} 111:{P7,β,max} 115:{FC6,γ ,max} 119:{P8,α,avg} 123:{F8,δ,max} 127:{F3,γ ,max} 131:{AF4,γ ,max} 135:{P7,γ ,max} 139:{AF4,γ ,avg}

4:{F7,α,max} 8:{O2,δ,max} 12:{F3,δ,avg} 16:{FC5,α,max} 20:{T8,δ,avg} 24:{T8,θ,avg} 28:{O1,α,max} 32:{T8,α,avg} 36:{F4,α,avg} 40:{F7,θ,max} 44:{AF4,θ,avg} 48:{FC6,θ,avg} 52:{T7,θ,max} 56:{AF3,β,max} 60:{FC6,δ,max} 64:{FC5,β,avg} 68:{F4,γ ,max} 72:{T8,γ ,avg} 76:{FC5,θ,max} 80:{P8,β,max} 84:{P7,α,avg} 88:{FC5,γ ,max} 92:{F7,γ ,avg} 96:{AF4,β,avg} 100:{F8,β,max} 104:{P8,β,avg} 108:{O1,β,max} 112:{T7,α,avg} 116:{F4,γ ,avg} 120:{F8,θ,max} 124:{T8,β,avg} 128:{P8,γ ,max} 132:{O2,γ ,avg} 136:{P7,β,avg} 140:{P8,γ ,avg}

result in dark yellow, and its decomposition per brain hemisphere. The global values are precisely 2.2779, 1.7478, 1.3759, 0.6639, and 0.6564 for the alpha, delta, theta, beta, and gamma bands, respectively. Finally, we apply the strategy of leaving five features out at a time to evaluate single user-independent eGFC models. Each eGFC model delineates classification boundaries in a different multi-dimensional space. Table 5.3 shows the eGFC overall results for the multi-channel data streams. A quadcore laptop, i7-8550U, with 1.80 GHz, 8 GB RAM, and Windows 10 OS, is used.

June 1, 2022 12:51

256

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Handbook on Computer Learning and Intelligence — Vol. I

Figure 5.5: Spearman (monotonic) correlation between frequency bands and classes per brain hemisphere.

From Table 5.3, we testify that spatio-temporal patterns of brain activity are found in all bands of the Fourier spectrum, viz., delta, theta, alpha, beta, and gamma bands, as even lower ranked features positively affect the eGFC estimates. Features from any EEG channel may eventually assist in the classifier decision. The greatest eGFC accuracy, namely 72.20%, happens using all features extracted from the frequency spectrum. In this case, a stream of 3,360 samples is processed in 6.032 seconds (1.8 ms per sample, on average). As a sample is produced every 10 s (10-second time windows) and processed in 1.8 ms, we have shown that eGFC manages to operate in real time with much larger rule bases and Big data streams, such as streams generated by additional electrodes, and other physiological and non-physiological means, e.g., biosensors and facial images. The average number of fuzzy rules, about 5.5, along the learning steps is similar for all the streams analyzed (see Table 5.3). An example of eGFC structural evolution (best case, 72.20%) is shown in Figure 5.6. Notice that the emphasis of the model and algorithm is to keep a high accuracy during online operation at the price of merging and deleting the rules occasionally. If a higher level of model memory is desired, i.e., a greater number of fuzzy rules is wanted, then the default eGFC hyper-parameters can be changed by turning off the rule removing threshold, h r = ∞; and setting the merging parameter, , to a lower value. Figure 5.6 also shows that the effect of male–female shifts due to the consecutive use of the EEG device by different subjects of the experimental group does not necessarily imply a need for completely new granules and rules by the eGFC

page 256

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Incremental Fuzzy Classification of Emotions in Games from EEG Data Streams

page 257

257

Table 5.3: eGFC classification results of online emotion recognition. # Features 140 135 130 125 120 115 110 105 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10

Acc (%)

cavg

CPU Time (s)

72.20 71.10 71.46 71.25 70.89 70.92 70.68 70.68 70.24 70.12 69.91 66.31 65.09 65.48 64.08 63.42 62.56 60.95 60.57 60.39 54.13 52.29 52.52 50.95 51.42 49.28 48.63

5.52 5.70 5.68 5.75 5.79 5.68 5.73 5.71 5.78 5.69 5.66 5.53 5.45 5.26 5.26 5.37 5.33 5.33 5.57 5.53 5.14 5.28 5.54 5.61 5.58 5.57 5.58

6.032 5.800 5.642 5.624 5.401 5.193 4.883 4.676 4.366 4.290 3.983 3.673 3.534 3.348 2.950 2.775 2.684 2.437 2.253 2.135 1.841 1.591 1.511 1.380 1.093 0.923 0.739

model. Parametric adaptation of Gaussian granules is often enough to accommodate slightly different behavior. For example, the shifts from the sixth (female) to the seventh (male) players and from the seventh to the eighth (male) players did not require new rules. Nevertheless, if a higher level of memory is allowed, the rules for both genders can be maintained separately. Figure 5.7 illustrates the confusion matrix of the best eGFC model. We notice a relative balance of wrong estimates per class and confusion of estimates in all directions. As Table 5.3 shows a snapshot of the eGFC performance in the last time step, h = 3360, in Figure 5.8 we provide the evolution of the accuracy index over time, i.e., from h = 1 to h = 3360, considering the eGFC model structure shown in Figure 5.6. The idea is to emphasize the online learning capability of the classifier. Notice that eGFC starts learning from scratch, with zero accuracy. After about 100 time steps, the estimation accuracy reaches a value of around 81.90%. The

June 1, 2022 12:51

258

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Handbook on Computer Learning and Intelligence — Vol. I

Figure 5.6: Structural evolution of the user-independent eGFC model from data streams of 28 players.

Figure 5.7: Confusion matrix of the best eGFC model, Acc = 72.20%.

information from multiple users is captured by updating the model parameters and often the model architecture. The mean accuracy along the time steps is kept around 75.08% ± 3.32% in spite of the nonstationary nature of the EEG data generated by multiple users playing games of different styles. Therefore, the eGFC model is user independent. User-dependent eGFC models are generally more accurate, such

page 258

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Incremental Fuzzy Classification of Emotions in Games from EEG Data Streams

page 259

259

Figure 5.8: Accuracy of the user-independent eGFC model over time.

as the 81.90%-accurate model developed for the first female user. In general, the empirical results—using EEG streams as the unique physiological source of data— and evolving fuzzy intelligence are encouraging for emotion pattern recognition.

5.6. Conclusion We have evaluated an online incremental learning method that constructs and updates a fuzzy classifier from data streams. The classifier, called eGFC, is applied to realtime recognition of emotions in computer games. A group of 28 people play Train Sim World, Unravel, Slender: The Arrival, and Goat Simulator (visual and auditory stimuli) for 20 minutes each, using an EEG device. An emotion prevails in each game, i.e., boredom, calmness, angriness, or happiness, which agrees with the arousal–valence system. A set of 140 features is extracted from the frequency spectrum produced by temporal data from 14 electrodes located in different brain regions. We analyzed the delta, theta, alpha, beta, and gamma bands, from 1 to 64 Hz. We examined the contribution of single electrodes to emotion recognition, and the effect of time window lengths and dimensionality reduction on the overall accuracy of the eGFC models. We conclude that: (i) electrodes on both brain hemispheres assist the recognition of emotions expressed through a variety of spatio-temporal patterns; (ii) the frontal (Af3–Af4) area, followed by the occipital (O1–O2) and temporal (T7–T8) areas are, in this order, slightly more discerning than the others; (iii) although patterns may be found in any frequency band, the alpha (8–13 Hz) band, followed by the delta (1–4 Hz) and theta (4–8 Hz) bands, is more monotonically correlated with the emotion classes; (iv) the eGFC incremental learning method is able to process 140-feature samples using 1.8 ms per sample, on average. Thus, eGFC is suitable for real-time applications considering EEG and other sources of data simultaneously; (v) a greater number of features covering the entire spectrum and the use of 10-second time windows lead eGFC to its best performance, i.e., to 72.20% accuracy using

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

260

b4528-v1-ch05

Handbook on Computer Learning and Intelligence — Vol. I

4 to 12 fuzzy granules and rules along the time steps. Such an eGFC model is general, i.e., it is user independent. We notice that a user-dependent eGFC model—based on the first female player—has been shown to be about 81.90% accurate, as expected due to a more restrictive experimental setting and model customization. Further studies on customized models for specific users are needed—as this capability of evolving granular methods is promising. In the future, wavelet transforms shall be evaluated with a focus on specific bands. A deep neural network, playing the role of feature extractor in high-dimension spaces, will be connected to eGFC for real-time emotion recognition. We also intend to evaluate an ensemble of customized evolving classifiers for benchmark EEG datasets.

Acknowledgement This work received support from the Serrapilheira Institute (Serra 1812-26777) and from the National Council for Scientific and Technological Development (CNPq 308464/2020-6).

References [1] M. Alarcao and M. J. Fonseca, Emotions recognition using EEG signals: a survey, IEEE Trans. Affect. Comput., 10(3), pp. 374–393 (2019). [2] Y. Liu, O. Sourina and M. K. Nguyen, Real-time EEG-based emotion recognition and its applications, Trans. Comput. Sci. XII (LNCS), 6670, pp. 256–277 (2011). [3] T. Song, W. Zheng, P. Song and Z. Cui, EEG emotion recognition using dynamical graph convolutional neural networks, IEEE Trans. Affect. Comput., 11(3), pp. 532–541 (2020). [4] G. Vasiljevic and L. Miranda, Braincomputer interface games based on consumer-grade EEG devices: a systematic literature review, Int. J. Hum.-Comput. Interact., 36(2), pp. 105–142 (2020). [5] L.-W. Ko, Y.-C. Lu, H. Bustince, Y.-C. Chang, Y. Chang, J. Ferandez, Y.-K. Wang, J. A. Sanz, G. P. Dimuro and C.-T. Lin, Multimodal fuzzy fusion for enhancing the motor-imagery-based brain computer interface, IEEE Comput. Intell. Mag., 14(1), p. 96106 (2019). [6] X. Li, D. Song, P. Zhang, Y. Zhang, Y. Hou and B. Hu, Exploring EEG features in cross-subject emotion recognition, Front. Neurosci., 12, pp. 162–176 (2018), doi: 10.3389/fnins.2018.00162. [7] M. Soleymani, M. Pantic and T. Pun, Multimodal emotion recognition in response to videos, IEEE Trans. Affect. Comput., 3(2), pp. 211–223 (2012). [8] W.-L. Zheng, J.-Y. Zhu and B.-L. Lu, Identifying stable patterns over time for emotion recognition from EEG, IEEE Trans. Affect. Comput., 10(3), pp. 417–429 (2019). [9] B. Nakisa, M. Rastgoo, D. Tjondronegoro and V. Chandran, Evolutionary computation algorithms for feature selection of EEG-based emotion recognition using mobile sensors, Expert Syst. Appl., 93(1), pp. 143–155 (2018). [10] N. Zhuang, Y. Zeng, l. Tong, C. Zhang, H. Zhang and B. Yan, Emotion recognition from EEG signals using multidimensional information in EMD domain, BioMed Res. Int., 2017, pp. 1–9 (2017), doi:10.1155/2017/8317357. [11] W.-L. Zheng and B.-L. Lu, Investigating critical frequency bands and channels for EEGbased emotion recognition with deep neural networks, IEEE Trans. Auton. Ment. Dev., 7(3), pp. 162–175 (2015). [12] X. Gu, Z. Cao, A. Jolfaei, P. Xu, D. Wu, T.-P. Jung and C.-T. Lin, EEG-based brain-computer interfaces (BCIs): a survey of recent studies on signal sensing technologies and computational intelligence approaches and their applications, arXiv:2001.11337, p. 22 (2020).

page 260

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Incremental Fuzzy Classification of Emotions in Games from EEG Data Streams

page 261

261

[13] H. P. Martinez, Y. Bengio and G. N. Yannakakis, Learning deep physiological models of affect, IEEE Comput. Intell. Mag., 8(2), p. 2033 (2013). [14] M. Wu, W. Su, L. Chen, W. Pedrycz and K. Hirota, Two-stage fuzzy fusion based-convolution neural network for dynamic emotion recognition, IEEE Trans. Affect. Comput., pp. 1–1 (2020), doi:10.1109/TAFFC.2020.2966440. [15] Y. Zhang, Z. Yang, H. Lu, X. Zhou, P. Phillips, Q. Liu and S. Wang, Facial emotion recognition based on biorthogonal wavelet entropy, fuzzy support vector machine, and stratified cross validation, IEEE Access, 4, pp. 8375–8385 (2016). [16] G. Lee, M. Kwon, S. Kavuri and M. Lee, Emotion recognition based on 3D fuzzy visual and EEG features in movie clips, Neurocomputing, 144, pp. 560–568 (2014), doi: 10.1016/j.neucom.2014.04.008. [17] Z. Cao and C.-T. Lin, Inherent fuzzy entropy for the improvement of EEG complexity evaluation, IEEE Trans. Fuzzy Syst., 26(2), p. 10321035 (2017). [18] D. Wu, V. J. Lawhern, S. Gordon, B. J. Lance and C.-T. Lin, Driver drowsiness estimation from EEG signals using online weighted adaptation regularization for regression (OwARR), IEEE Trans. Fuzzy Syst., 25(6), p. 15221535 (2016a). [19] S.-L. Wu, Y.-T. Liu, T.-Y. Hsieh, Y.-Y. Lin, C.-Y. Chen, C.-H. Chuang and C.-T. Lin, Fuzzy integral with particle swarm optimization for a motor-imagery-based brain-computer interface, IEEE Trans. Fuzzy Syst., 25(1), p. 2128 (2016b). [20] D. Leite, L. Decker, M. Santana and P. Souza, EGFC: evolving Gaussian fuzzy classifier from never-ending semi-supervised data streams — with application to power quality disturbance detection and classification, in IEEE World Congress Computational Intelligence (WCCI-FUZZIEEE), pp. 1–8 (2020). [21] D. Leite, Evolving granular systems, Ph.D. thesis, School of Electrical and Computer Eng., University of Campinas (UNICAMP) (2012). [22] I. Skrjanc, J. A. Iglesias, A. Sanchis, D. Leite, E. Lughofer and F. Gomide, Evolving fuzzy and neuro-fuzzy approaches in clustering, regression, identification, and classification: a survey. Inf. Sci., 490, pp. 344–368 (2019). [23] C. Garcia, D. Leite and I. Skrjanc, Incremental missing-data imputation for evolving fuzzy granular prediction, IEEE Trans. Fuzzy Syst., 28(10), pp. 2348–2362 (2020). [24] D. Leite, G. Andonovski, I. Skrjanc and F. Gomide, Optimal rule-based granular systems from data streams, IEEE Trans. Fuzzy Syst., 28(3), pp. 583–596 (2020). [25] D. Leite, P. Costa and F. Gomide, Evolving granular neural networks from fuzzy data streams, Neural Networks, 38, pp. 1–16 (2013). [26] L. Decker, D. Leite, L. Giommi and D. Bonacorsi, Real-time anomaly detection in data centers for log-based predictive maintenance using an evolving fuzzy-rule-based approach, in IEEE World Congress on Computational Intelligence (WCCI-FUZZ-IEEE), p. 8 (2020). [27] T. B. Alakus, M. Gonenb and I. Turkogluc, Database for an emotion recognition system based on EEG signals and various computer games — GAMEEMO, Biomed. Signal Process. Control, 60, p. 101951 (12p) (2020). [28] L. Alonso and J. Gomez-Gil, Brain computer interfaces, a review, Sensors, 12(2), pp. 1211–1279 (2012). [29] M. Fatourechi, A. Bashashati, R. K. Ward and G. E. Birch, EMG and EOG artifacts in brain computer interface systems: a survey, Clin. Neurophysiol., 118(3), pp. 480–494 (2007). [30] A. Ferreira, et al., A survey of interactive systems based on brain-computer interfaces, JIS, 4(1), pp. 3–13 (2013). [31] C.-M. Kim, G.-H. Kang and E.-S. Kim, A study on the generation method of visual-auditory feedback for BCI rhythm game, J. Korea Game Soc., 13(6), pp. 15–26 (2013). [32] J. Russell, A circumplex model of affect, J. Pers. Soc. Psychol., 39(6), pp. 1161–1178 (1980).

June 1, 2022 12:51

262

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch05

Handbook on Computer Learning and Intelligence — Vol. I

[33] J. Frome, Eight ways videogames generate emotion, DiGRA Conf., pp. 831–835 (2007). [34] L. Fedwa, M. Eid and A. Saddik, An overview of serious games, Int. J. Comput. Game Technol., 2014, p. 15 (2014). [35] P. Angelov and X. Zhou, Evolving fuzzy-rule-based classifiers from data streams, IEEE Trans. Fuzzy Syst., 16(6), pp. 1462–1475 (2008). [36] P. Angelov, E. Lughofer and X. Zhou, Evolving fuzzy classifiers using different model architectures, Fuzzy Sets Syst., 159(23), pp. 3160–3182 (2008). [37] M. Pratama, S. Anavatti, M. Er and E. Lughofer, pClass: an effective classifier for streaming examples, IEEE Trans. Fuzzy Syst., 23(2), pp. 369–386 (2015). [38] G. Casalino, G. Castellano and C. Mencar, Data stream classification by dynamic incremental semi-supervised fuzzy clustering, Int. J. Artif. Intell. Tools, 28(8), p. 1960009 (26p) (2019). [39] S. U. Din, J. Shao, J. Kumar, W. Ali, J. Liu and Y. Ye, Online reliable semi-supervised learning on evolving data streams, Inf. Sci., 525, pp. 153–171 (2020). [40] I. Skrjanc, Cluster-volume-based merging approach for incrementally evolving fuzzy Gaussian clustering: eGAUSS+, IEEE Trans. Fuzzy Syst., 28(9), pp. 2222–2231 (2020). [41] D. Leite, P. Costa and F. Gomide, Evolving granular neural network for semi-supervised data stream classification, in International Joint Conference on Neural Networks (IJCNN), p. 8p (2010). [42] Y. Li, Y. Wang, Q. Liu, C. Bi, X. Jiang and S. Sun, Incremental semi-supervised learning on streaming data, Pattern Recognit., 88, pp. 383–396 (2019). [43] J. Read, F. Perez-Cruz and A. Bifet, Deep learning in partially-labeled data streams, in 30th Annual ACM Symposium on Applied Computing, pp. 954–959 (2015). [44] W. Pedrycz and F. Gomide, Fuzzy Systems Engineering: Toward Human-Centric Computing, 1st edn. Wiley: Hoboken, New Jersey (2007). [45] S. M. Stigler, A modest proposal: a new standard for the normal, Am. Stat., 36(2), pp. 137–138 (1982). [46] W. Pedrycz and W. Homenda, Building the fundamentals of granular computing: a principle of justifiable granularity, Appl. Soft Comput., 13(10), pp. 4209–4218 (2013). [47] X. Wang, W. Pedrycz, A. Gacek and X. Liu, From numeric data to information granules: a design through clustering and the principle of justifiable granularity, Knowl.-Based Syst., 101, p. 100113 (2016). [48] R. Yager, Measures of specificity over continuous spaces under similarity relations, Fuzzy Sets Syst., 159, pp. 2193–2210 (2008). [49] E. Soares, P. Costa, B. Costa and D. Leite, Ensemble of evolving data clouds and fuzzy models for weather time series prediction, Appl. Soft Comput., 64, p. 445453 (2018). [50] E. Lughofer, J.-L. Bouchot and A. Shaker, On-line elimination of local redundancies in evolving fuzzy systems, Evolving Syst., 2(3), pp. 165–187 (2011). [51] P. Angelov, Evolving Takagi-Sugeno fuzzy systems from streaming data, eTS+, in P. Angelov, D. Filev and N. Kasabov (eds.), Evolving Intelligent Systems: Methodology and Applications. John Wiley & Sons, pp. 21–50 (2010). [52] H. J. Rong, N. Sundararajan, G. B. Huang and P. Saratchandran, Sequential adaptive fuzzy inference system (SAFIS) for nonlinear system identification and prediction, Fuzzy Sets Syst., 157(9), pp. 1260–1275 (2006). [53] N. Kasabov, Evolving fuzzy neural networks for supervised/unsupervised online knowledgebased learning, IEEE Trans. Syst. Man Cybern. Part B: Cybern., 31(6), pp. 902–918 (2001). [54] B. H. Soleimani, C. Lucas and B. N. Araabi, Recursive Gath-Geva clustering as a basis for evolving neuro-fuzzy modeling, Evolving Syst., 1(1), pp. 59–71 (2010). [55] R. J. Davidson, Anterior cerebral asymmetry and the nature of emotion, Brain Cogn., 20(1), p. 125151 (1992).

page 262

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_0006

Chapter 6

Causal Inference in Medicine and in Health Policy: A Summary Wenhao Zhang∗ , Ramin Ramezani†, and Arash Naeim‡ Center for Smart Health, University of California, Los Angeles ∗ [email protected][email protected][email protected]

A data science task can be deemed as making sense of the data or testing a hypothesis about it. The conclusions inferred from the data can greatly guide us in making informative decisions. Big Data have enabled us to carry out countless prediction tasks in conjunction with machine learning, such as to identify high-risk patients suffering from a certain disease and take preventable measures. However, healthcare practitioners are not content with mere predictions; they are also interested in the cause–effect relation between input features and clinical outcomes. Understanding such relations will help doctors to treat patients and reduce the risk effectively. Causality is typically identified by randomized controlled trials. Often, such trials are not feasible when scientists and researchers turn to observational studies and attempt to draw inferences. However, observational studies may also be affected by selection and/or confounding biases that can result in wrong causal conclusions. In this chapter, we will try to highlight some of the drawbacks that may arise in traditional machine learning and statistical approaches to analyzing the observational data, particularly in the healthcare data analytics domain. We will discuss causal inference and ways to discover the cause–effect from observational studies in the healthcare domain. Moreover, we will demonstrate the application of causal inference in tackling some common machine learning issues such as missing data and model transportability. Finally, we will discuss the possibility of integrating reinforcement learning with causality as a way to counter confounding bias.

6.1. What Is Causality and Why Does It Matter? In this section, we introduce the concept of causality and discuss why causal reasoning is of paramount importance in analyzing data that arise in various domains such as healthcare and social sciences.

263

page 263

June 1, 2022 12:51

264

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Handbook on Computer Learning and Intelligence — Vol. I

6.1.1. What Is Causality? The truism “Correlation does not imply causation” is well known and generally acknowledged [2, 3]. The question is how to define “causality.”

6.1.2. David Hume’s Definition The eighteenth-century philosopher David Hume defined causation in the language of the counterfactual: A is a cause of B, if: (1) B is always observed to follow A, and (2) If A had not been, B had never existed [4]. In the former case, A is a sufficient causation of B, and A is a necessary causation of B in the latter case [5]. When both conditions are satisfied, we can safely say that A causes B (necessary-and-sufficient causation). For example, sunrise causes a rooster’ crow. This cause–effect relation cannot be described the other way around. If the rooster is sick, the sunrise still occurs. The rooster’s crow is not a necessary causation of sunrise. Hence, the rooster’s crow is an effect rather than the cause of sunrise.

6.1.3. Causality in Medical Research In medical research, the logical description of causality is considered too rigorous and is occasionally not applicable. For example, smoking does not always lead to lung cancer. Causality in medical literature is often expressed in probabilistic terms [6, 49]. A type of chemotherapy treatment might increase the likelihood of survival of a patient diagnosed with cancer, but does not guarantee it. Therefore, we express our beliefs about the uncertainty of the real world in the language of probability. One main reason for probabilistic thinking is that we can easily quantify our beliefs in numeric values and build probabilistic models to explain the cause, given our observation. In clinical diagnosis, doctors often seek the most plausible hypothesis (disease) that explains the evidence (symptom). Assume that a doctor observes a certain symptom S, and he or she has two explanations for this symptom, disease A or disease B. If this doctor can quantify his or her belief into conditional probabilities, i.e., Prob(disease A | Symptom S) and Prob(disease B | Symptom S) (the likelihood of disease A or B may occur, given that the symptom S is observed). Then, the doctor can choose the explanation that has a larger value of conditional probability.

6.1.4. The Dilemma: Potential Outcome Framework When we study the causal effect of a new treatment, we are interested in how the disease responds when we “intervene” upon it. For example, a patient is likely to recover from cancer when receiving a new type of chemotherapy treatment.

page 264

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Causal Inference in Medicine and in Health Policy: A Summary

page 265

265

To measure the causal effect on this particular patient, we shall compare the outcome of treatment to the outcome of no treatment. However, it is not possible to observe the two potential outcomes of the same patient at once, because this comparison is made using two parallel universes we imagine: (1) a universe where the patient is treated with the new chemotherapy and (2) the other where he or she is not treated. There is always one universe missing. This dilemma is known as the “fundamental problem of causal inference.”

6.2. Why Causal Inference A data science task can be deemed as making sense of the data or testing a hypothesis about it. The conclusions inferred from data can greatly guide us to make informative decisions. Big Data have enabled us to carry out countless prediction tasks in conjunction with machine learning. However, there exists a large gap between highly accurate predictions and decision making. For example, an interesting study [1] reports that there is a “surprisingly powerful correlation” (ρ = 0.79, p < 0.0001) between the chocolate consumption and the number of Nobel laureates in a country (Figure 6.1). The policy makers might hesitate to promote chocolate consumption as a way of obtaining more Nobel prizes. The developed Western countries where people eat more chocolate are more likely to have better education systems, and chocolate consumption has no direct impact on the number of Nobel laureates. As such, intervening on chocolate cannot possibly lead us to the desired outcome. In Section 6.3, we will explore more examples of the spurious correlations explained by confounders (in statistics, a confounder is a variable that impacts a dependent variable as well as an independent variable at the same time, causing a spurious correlation [67]) and how to use causal inference to gauge the real causal effect between variables under such circumstances. In predictive tasks, understanding of causality, the mechanism through which an outcome is produced, will yield more descriptive models with better performance. In machine learning, for instance, one underlying assumption, generally, is that training and testing datasets have identical or at least comparable characteristics/ distributions. This assumption is often violated in practice. For example, an activity recognition model built on a training cohort of highly active participants might perform poorly if it is applied over a cohort of bedridden elderly patients. In this example, the variables age and mobility are the causes that explain the difference between the two datasets. Therefore, understanding causality between various features and outcomes is an integral part of a robust and generalized machine learning model. Often, most statistical models rely upon pure correlations perform well under static conditions where the characteristics of the dataset are invariant. Once the context changes, such correlations may no longer exist. On the contrary, relying on the causal relations between variables can produce models less prone to change

June 1, 2022 12:51

266

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Handbook on Computer Learning and Intelligence — Vol. I

Figure 6.1: A spurious correlation between the chocolate consumption and the number of Nobel laureates by countries [1].

with context. In Section 4, we will discuss the external validity and transportability of machine learning models. In many real-world data analytics, in addition to relying solely on statistical relations among data elements, it is essential for machine learning practitioners to ask questions surrounding “causal intervention” and “counterfactual reasoning,” questions such as “What would Y be if I do X?” (causal intervention) or “Would the outcome change if I had acted differently?” (counterfactual). Suppose that one wants to find the effects of wine consumption on heart disease. We certainly live in a world in which we cannot run randomized controlled trials asking people to drink wine, say, 1 glass every night, and force them to comply with it for a decade to find the effect of wine on heart disease. In such scenarios, we normally resort to observational studies that may eventually highlight associations between wine consumption and reduced risk of heart disease. The simple reaction would be to intervene and to promote wine consumption. However, causal reasoning suggests that we think twice whether wine-reduced heart disease is a causal effect relation or the association is confounded by other factors, say, people who drink wine have more money and can buy better quality food, or have better quality of life in general. Counterfactual reasoning, on the other hand, often answers questions in retrospect,

page 266

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Causal Inference in Medicine and in Health Policy: A Summary

page 267

267

such as “Would the outcome change if I had acted differently?” Imagine a doctor who is treating a patient with kidney stones. The doctor is left with two choices, conventional treatment that includes open surgical procedures and a new treatment that only involves making a small puncture in the kidney. Each treatment may result in certain complications, raising the following questions: “Would the outcome be different if the other treatment had been given to this patient.” The “if” statement is a counterfactual condition, a scenario that never happened. The doctor cannot go back in time and give a different treatment to the same patient under the exact same conditions. So, it behooves the doctor to think about counterfactual questions in advance. Counterfactual reasoning enables us to contemplate alternative options in decision-making to possibly avoid undesired outcomes. By understating causality, we will be able to answer questions related to intervention or counterfactuals, concepts we aim to cover in the following sections.

6.3. Randomized Controlled Trials Due to the above-mentioned dilemma, a unit-level causal effect cannot be directly observed, as potential outcomes for an individual subject cannot be observed in a single universe. Randomized controlled trials (RCT) enable us to gauge the population-level causal effect by comparing the outcomes of two groups under different treatments, while other factors are kept identical. Then, the populationlevel causal effect can be expressed as average causal effect (ACE) in mathematical terms. For instance, ACE = |Prob(Recovery|Treatment = Chemotherapy) − Prob(Recovery|Treatment = Placebo)| where ACE is also referred to as average treatment effect (ATE). In a randomized controlled trial (RCT), treatment and placebo are assigned randomly to groups that have the same characteristics (e.g., demographic factors). The mechanism is to “disassociate variables of interest (e.g., treatment, outcome) from other factors that would otherwise affect them both” [5]. Another factor that might greatly bias our estimation of causal effect is the century-old problem of “finding confounders” [7–9]. A randomized controlled trial was first introduced by James Lind in 1747 to identify treatment for scurvy and then popularized by Ronald A. Fisher in the early 20th century. It is currently well acknowledged and considered the golden standard to identify the true causal effect without distortions introduced by confounding. However, randomized controlled trials are not always feasible in clinical studies due to ethical or practical concerns. For example, in a smoking-cancer medical study, researchers have to conduct randomized controlled trials to investigate if in fact smoking leads to cancer. Utilizing such trials, researchers should randomly assign

June 1, 2022 12:51

268

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Handbook on Computer Learning and Intelligence — Vol. I

participants to an experiment group where people are required to smoke and a control group where smoking is not allowed. This study design will ensure that smoking behavior is the only variable that differs between the groups, and no other variables (i.e., confounders) will bias the results. On the contrary, an observational study where we merely follow and observe the outcomes of smokers and non-smokers will be highly susceptible to confounders and can reach misleading conclusions. Therefore, the better study design would be to choose RCTs; however, it is perceived as highly unethical to ask participants to smoke in a clinical trial. Typically, randomized controlled trials are often designed and performed in a laboratory setting where researchers have full control over the experiment. In many real-world studies, data are collected from observations when researchers cannot intervene/randomize the independent variables. This highlights the need for a different toolkit to perform causal reasoning in such scenarios. In Section 6.3, we will discuss how to gauge the true causal effect from observational studies that might be contaminated with confounders.

6.4. Preliminaries: Structural Causal Models, Causal Graphs, and Intervention with Do-Calculus In this section, we primarily introduce three fundamental components of causal reasoning: structural causal model (SCM), directed acyclic graphs (DAG), and intervention with do-calculus.

6.5. Structural Causal Model The structural causal model (SCM), M, was proposed by Pearl et al. [10, 11] to formally describe the interactions of variables in a system. A SCM is a 4-tuple M =< U, V, F, P(u) > where (1) U is a set of background variables, exogenous, that are determined by factors outside the model. (2) V = {V1 , . . . , Vn } is a set of endogenous variables that are determined by variables within the model. (3) F is a set of functions { f 1 , . . . , f n } where each f i is a mapping from Ui ∪ P Ai to Vi , where Ui ⊆ U and P Ai (PA, is short for “parents”) is a set of causes of Vi . In other words, f i assigns a value to the corresponding Vi ∈ V , v i ← f i ( pai u i ), for i = 1, . . . , n. (4) P(u) is a probability function defined over the domain of U . Note that there are two sets of variables in a SCM, namely, exogenous variables, U , and endogenous variables, V . Exogenous variables are determined outside of the model and are not explained (or caused) by any variables inside our model.

page 268

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Causal Inference in Medicine and in Health Policy: A Summary

page 269

269

Therefore, we generally assume a certain probability distribution P(u) to describe the external factors. The values of endogenous variables, on the other hand, are assigned by both exogenous variables and endogenous variables. These causal mappings are captured by a set of non-parametric functions F = { f 1 , . . . , fn }. f i can be a linear or non-linear function to interpret all sorts of causal relations. The value assignments of endogenous variables are also referred to as the data generation process (DGP) where nature assigns the values of endogenous variables. Let us consider a toy example: in a smoking-lung cancer study, we can observe and measure the treatment variable smoking (S) and the outcome variable lung cancer (L). Suppose these two factors are endogenous variables. There might be some unmeasured factors that interact with the existing endogenous variables, e.g., genotype (G). Then the SCM can be instantiated as, U = {G}, V = {S, L}, F = { f S , f L },

(6.1)

f S : S ← f S (G),

(6.2)

f L : L ← f L (S, G).

(6.3)

This SCM model describes that both genotype and smoking are direct causes of lung cancer. A certain genotype is responsible for nicotine dependence and hence explains the smoking behavior [50, 51]. However, no variable in this model explains the variable genotype and G is an exogenous variable.

6.6. Directed Acyclic Graph Every SCM is associated with a directed acyclic graph (DAG). The vertices in the graph are the variables under study and the causal mechanisms and processes are edges in DAG. For instance, if variable X is the direct cause of variable Y , then there is a directed edge from X to Y in the graph. The previous SCM model can be visualized as follows:

Figure 6.2: Graphical representation of an SCM model in Section 6.2.1. The square node denotes the exogenous variable, and the round nodes denote the endogenous variable. The directed edge represents the causal mechanism.

June 1, 2022 12:51

270

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Handbook on Computer Learning and Intelligence — Vol. I

Note that the graphical representation encodes the causal relations in Eqs. (6.1)–(6.3) via a rather intuitive way. In the next section, we will show the strengths of the graphical representation when we need to study the independent relations among variables.

6.7. Intervention with Do-Calculus, do(·) Do-calculus was developed by Pearl [12, 13] to gauge the effects of causal interventions. In the example of smoking-lung cancer, the likelihood of getting lung cancer in case of smoking can be expressed in this conditional probability, Prob(L = 1| do (S = 1)), which describes the causal effect identified in a randomized controlled trial. L and S are Bernoulli random variables which only take two values: 0 and 1. L = 1 denotes the fact of getting lung cancer, and L = 0 represents the observation of no lung cancer. S = 1 means the observation of smoking whereas S = 0 says that no smoking is observed. In other words, this post-intervention distribution represents the probability of getting lung cancer (L = 1) when we intervene upon the data generation process by deliberately forcing the participant to smoke, i.e., do(S = 1). Post-intervention probability distribution refers to the probability terms that contain do-notation do(•). This post-intervention distribution is different from the conditional probability in an observational study: Prob(L = 1 | S = 1), which only represents the likelihood of outcome (L = 1) when we observe that someone smokes. This conditional probability in observational study does not entail the true causal effect as it might be biased by confounders. We will discuss the confounding issue in the next section. To recall, randomized controlled trials might be impractical or even unethical to conduct at times. For example, we cannot force participants to smoke in a randomized controlled experiment in order to find the cause–effect of an intervention (do(S = 1)). Do-calculus suggests us to raise the following question instead: Is it possible to estimate the post-intervention Prob(L |do(S)) from observational study? If we can express Prob(L | do(S) in terms of the conditional probability Prob(L|S) estimated in the observational study, then we can gauge the causal effect without performing randomized controlled trials.

6.7.1. Do-Calculus Algebra Here we introduce the algebraic procedure of do-calculus that allows us to bridge the gap of probability estimation between observational study and randomized controlled trials. The goal of do-calculus is to reduce the post-intervention distribution that contains the do(·) operator to a set of probability distributions of do(·) free. The complete mathematical proof of do-calculus can be seen in [10, 14].

page 270

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Causal Inference in Medicine and in Health Policy: A Summary

page 271

271

Rule 1. Prob(Y | do(X),Z,W)=Prob(Y | do(X),Z): If we observe that the variable W is irrelevant to Y (possibly conditional on the other variable Z), then the probability distribution of Y will not change. Rule 2. Prob(Y | do(X),Z)=Prob(Y | X,Z): If Z is a set of variables blocking all backdoor paths from X to Y, then Prob(Y | do(X )Z ) is equivalent to Prob(Y |X Z ). Backdoor path will be explained shortly. Rule 3. Prob(Y | do(X))=Prob(Y): We can remove do(X) from Prob(Y | do(X)) in any case where there are no causal paths from X to Y. If it is not feasible to express the post-intervention distribution Prob(L|do(S)) in terms of do-notation-free conditional probabilities (e.g., Prob(L|S)) using the aforementioned rules, then randomized controlled trials are necessary to gauge the true causality.

6.7.2. Backdoor Path and D-Separation In rule 2, the backdoor path refers to any path between cause and effect with an arrow pointing to cause in a directed acyclic graph (or a causal graph). For example, the backdoor path between smoking and lung cancer in Figure 6.2 is “smoking ← genotype → lung cancer.” How do we know if a backdoor path is blocked or not? Pearl, in his book [5], introduced the concept of d-separation that tells us how to block the backdoor path. Please refer to [15] for the complete mathematical proof. (1) In a chain junction, A → B → C, conditioning on B prevents information about A from getting to C, or vice versa. (2) In a fork or confounding junction, A ← B → C, conditioning on B prevents information about A from getting to C, or vice versa. (3) In a collider, A → B ← C, exactly the opposite rules hold. The path between A and C is blocked when not conditioning on B. If we condition on B, then the path is unblocked. Bear in mind that if this path is blocked, A and C would be considered independent of each other. In Figure 6.2, conditioning on genotype will block the backdoor path between smoking and lung cancer. Here, conditioning on genotype means that we only consider a specific genotype in our analysis. Blocking the backdoor between the cause and effect actually prevents the spurious correlation between them in an observational study. Please refer to the next section for more details on confounding bias.

6.8. What Is the Difference Between Prob(Y = y|X = x) and Prob(Y = y|do(X = x))? In [19], Pearl et al. explain the difference between the two distributions as follows, “In notation, we distinguish between cases where a variable X takes a value x

June 1, 2022 12:51

272

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Handbook on Computer Learning and Intelligence — Vol. I

naturally and cases where we fix X = x by denoting the latter do(X = x). So Prob(Y = y|X = x) is the probability that Y = y conditional on finding X = x, while Prob(Y = y|do(X = x)) is the probability that Y = y when we intervene to make X = x. In the distributional terminology, Prob(Y = y |X = x) reflects the population distribution of Y among individuals whose X value is x. On the other hand, Prob(Y = y |do(X = x)) represents the population distribution of Y if everyone in the population had their X value fixed at x” This can be better understood with a thought experiment. Imagine that we study the association of barometer readings and weather conditions. We can express this association in terms of Prob(Barometer|Weather) or Prob(Weather|Barometer). Notice that correlations can be defined in both directions. However, causal relations are generally unidirectional. Prob(Weather = rainy|Barometer = low) represents the probability of the weather being rainy when we see the barometer reading is low. Prob(Weather = rainy|do(Barometer = low)) describes the likelihood of the weather being rainy after we manually set the barometer reading to low. Our common sense tells us that manually setting the barometer to low would not affect the weather conditions; hence, this post-intervention probability should be zero, whereas Prob(Weather = rainy|Barometer = low) might not be zero.

6.9. From Bayesian Networks to Structural Causal Models Some readers may raise the question: “What is the connection between structural causal models and Bayesian networks, which also aims to interpret causality from the data using DAGs?”. Firstly, the Bayesian network (also known as belief networks) was introduced by Pearl [52] in 1985 as his early attempt at causal inference. A classic example1 of the Bayesian network is shown in Figure 6.3. The nodes in Bayesian networks represent the variables of interest, and the edges between linked variables denote their dependencies, and the strengths of such dependencies are quantified by conditional probabilities. The directed edges in this simple network (Figure 6.3) encode the following causal assumptions: (1) Grass wet is true if the Sprinkler is true or Rain is true. (2) Rain is the direct cause of Sprinkler as the latter is usually off on a rainy day to preserve the water usage. We can use this probabilistic model to reason the likelihood of a cause, given an effect is observed, e.g., the likelihood of a rainy day if we observe that the sprinkler is on is Prob (Rain = True|Sprinkler = True) = 0.4 as shown by the conditional probability tables in Figure 6.3.

1 The source of this example is at https://en.wikipedia.org/wiki/Bayesian_network.

page 272

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Causal Inference in Medicine and in Health Policy: A Summary

page 273

273

Figure 6.3: A simple example of a Bayesian network with conditional probability tables.

However, “a Bayesian network is literally nothing more than a compact representation of a huge probability table. The arrows mean only that the probabilities of child nodes are related to the values of parent nodes by a certain formula (the conditional probability tables)” [5]. On the contrary, the arrows in the structural causal models describe the underlying data generation process between linked variables (i.e., cause and effect) using a function mapping instead of conditional probability tables. If we construct a SCM on the same example, the DAG remains unchanged, but the interpretations of the edges are different. For example, the edge of Rain→Sprinkler indicates the function Sprinkler ← f (Rain), which dictates how the effect (Sprinkler) would respond if we wiggle the cause (Rain). Note that Sprinkler is the effect and Rain is the cause, and the absence of the arrow Sprinkler → Rain in the DAG says that there is no such function, Rain ← f (Sprinkler). Consider that we would like to answer an interventional question, “What is the likelihood of a rainy day if we manually turn on the sprinkler, Prob(Rain = True|do(Sprinkler = True))?” It is natural to choose SCMs for such questions: since we know that Sprinkler is not the direct cause of Rain according to the causal graph, rule 3 of do-calculus algebra (Section 2.3.1) permits us to reduce Prob do(Sprinkler = True)) to Prob(Rain = True). That is, the status of Sprinkler has no impact on Rain. However, a Bayesian network is not equipped to answer such interventional and counterfactual questions. The conditional probability Prob Rain = True) = 0.4 only says that an association between Sprinkler and Rain exists. Therefore, the ability to emulate interventions is one of the advantages of SCMs over Bayesian networks [5, 53].

June 1, 2022 12:51

274

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Handbook on Computer Learning and Intelligence — Vol. I

However, the Bayesian network is an integral part of the development of causal inference frameworks as it is an early attempt to marry causality to graphical models. All the probabilistic properties (e.g., local Markov property) of Bayesian networks are also valid in SCMs [5, 15, 53]. Meanwhile, Bayesian networks also impact causal discovery research, which focuses on the identification of causal structures from data through computational algorithms [54].

6.10. Simpson Paradox and Confounding Variables 6.10.1. Spurious Correlations Introduced by Confounder The famous phrase “correlation does not imply causation” suggests that the observed correlation between variables A and B does not automatically entail causation between A and B. Spurious correlations between two variables may be explained by a confounder. For example, considering the following case (Figure 6.4) where a spurious correlation between yellow fingernails and lung cancer is observed. One cannot simply claim that people who have yellow fingernails have higher risk of lung cancer as neither is the cause of the other. Confounding is a causal concept and cannot be expressed in terms of statistical correlation [16, 17]. Another interesting study [1] reported that there is a “surprisingly powerful correlation” (rho = 0.79, p < 0.0001) between chocolate consumption and the number of Nobel laureates in a country. It is hard to believe there is any direct causal relation between these two variables in this study. This correlation might be introduced again by a confounder (e.g., the advanced educational system in developed countries) that is not included in this study. Pearl argues that [10]: “one cannot substantiate causal claims from association alone, even at the population level. Behind every causal conclusion there must lie some causal assumptions that is not testable in an observational study.”

Figure 6.4: Smoking is a common cause and confounder for yellow fingernails and lung cancer. A spurious correlation may be observed between two groups who have yellow fingernails and lung cancer because of the third variable smoking.

page 274

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Causal Inference in Medicine and in Health Policy: A Summary

page 275

275

Table 6.1: Kidney stone treatment. The fraction numbers indicate the number of success cases over the total size of the group. Treatment B is more effective than treatment A at the overall population level. But the trend is reversed in subpopulations.



 357 Small Stones = 0.51 700   343 = 0.49 Large Stones 700 Overall

Treatment A

Treatment B

81 = 0.93 87

234 = 0.87 270

192 = 0.73 263

55 = 0.69 80

273 = 0.78 350

289 = 0.83 350

6.10.2. Simpson Paradox Example: Kidney Stone Confounder (a causal concept) may not only introduce spurious correlations but also generate misleading results. Table 6.1 shows a real-life medical study [18] that compares the effectiveness of two treatments for kidney stones. Treatment A includes all open surgical procedures, while treatment B is percutaneous nephrolithotomy (which involves making only a small puncture(s) in the kidney). Both treatments are assigned to two groups with the same size (i.e., 350 patients). The fraction numbers indicate the number of success cases over the total size of the group. If we consider the overall effectiveness of the two treatments, treatment A = 0.78) is inferior to treatment B (success rate = 289 = 0.83). At (success rate = 273 350 350 this moment, we may think that treatment B has a higher chance of cure. However, if we compare the treatments by the size of the kidney stones, we discover that treatment A is clearly better in both groups, patients with small stones and those with large stones. Why is the trend at the population level reversed when we analyze treatments in subpopulations? If we inspect the table with more caution, we realize that treatments are assigned by severity, i.e., people with large stones are more likely to be treated with method A, while most of those with small stones are assigned with method B. Therefore, severity (the size of the stone) is a confounder that affects both the recovery and treatment as shown in Figure 6.5. Ideally, we are interested in the pure causal relation of “treatment X → recovery” without any other unwanted effect from exogenous variables (e.g., the confounder severity). We de-confound the causal relation of “treatment X → recovery” by intervening in variable Treatment and forcing its value to be either A or B. By fixing the treatment, we can remove the effect coming from variable Severity to variable Treatment. Note that the causal edge of “Severity → Treatment” is absent in the mutilated graphical model shown in Figure 6.6. Since Severity does not affect

June 1, 2022 12:51

276

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Handbook on Computer Learning and Intelligence — Vol. I

Figure 6.5: Observational study that has a confounder, severity.

Figure 6.6: We simulate the intervention in the form of a mutilated graphical mode. The causal effect Prob(Recovery|do(Treatment)) is equal to the conditional probability Prob(Recovery| Treatment) in this mutilated graphical model.

Treatment and Recovery at the same time after the intervention, it is no longer a confounder. Intuitively, we are interested in understanding, if we use treatment A on all patients, what will be the recovery rate, Prob(Recovery|do(Treatment = A)). Similarly, what is the recovery rate, Prob(Recovery|do(Treatment = B)), if we use treatment B only. If the former has larger values, then treatment A is more effective; otherwise, treatment B has a higher chance of cure. The notation do(X = x) is a do-expression which fixes the value of X = x. Note that the probability Prob(Recovery|do(Treatment)) marginalizes away the effect of severity by Prob(Recovery|do(Treatment)) = Prob(Recovery|do(Treatment), Severity = treatmentA) + Prob(Recovery|do(Treatment), Severity = treatmentB). Essentially, we are computing the causal effects of “treatmentA → recovery” and “treatmentB → recovery”:

page 276

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

page 277

Causal Inference in Medicine and in Health Policy: A Summary

277

Prob(R = 1|do(T = A))  Prob(R = 1, S = s|do(T = A)) (law o f total probabilit y) = s∈{small,large}

=



Prob(R = 1|S = s, do(T = A))Prob(S = s|do(T = A))

s∈{small,large}

× (definition of conditional probability  Prob(R = 1|S = s, do(T = A))Prob(S = s) = s∈{small,large}

× (rule #3 in do − calculus, see Section 2.3.1)  Prob(R = 1|S = s, T = A)Prob(S = s) = s∈{small,large}

× (rule #2 in do − calculus, see Section 2.3.1) = Prob(R = 1|S = small, T = A)Prob(S = small) + Prob(R = 1|S = large, T = A)Prob(S = large) = 0.93 × 0.51 + 0.73 × 0.49 = 0.832

(3.1)

Similarly, we can compute Prob(R = 1|do(T = B))  Prob(R = 1, S = s|do(T = B)) (law o f total probabilit y) = s∈{small,large}

=



Prob(R = 1|S = s, do(T = B))Prob(S = s|do(T = B))

s∈{small,large}

× (definition of conditional probability  Prob(R = 1|S = s, do(T = B))Prob(S = s) = s∈{small,large}

× (rule # 3 in do-calculus, see Section 2.3.1)  Prob(R = 1|S = s, T = B)Prob(S = s) = s∈{small,large}

× (rule # 2 in do-calculus, see Section 2.3.1)

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

278

b4528-v1-ch06

page 278

Handbook on Computer Learning and Intelligence — Vol. I

= Prob(R = 1|S = small, T = B)Prob(S = small) + Prob(R = 1|S = large, T = B)Prob(S = large) = 0.87 × 0.51 + 0.69 × 0.49 = 0.782

(3.2)

Now we know the causal effects of Prob(Recovery|do(Treatment = A)) = 0.832 and Prob(Recovery|do(Treatment = B)) = 0.782. Treatment A is clearly more effective than treatment B. The results also align with our common sense that open surgery (treatment A) is expected to be more effective. A more informative interpretation of the results is that the difference of the two causal effects denotes the fraction of the population that would recover if everyone is assigned with treatment A compared to the other procedure. Recall that we have the opposite conclusion if we read the “effectiveness” at the population level in Table 6.1.

6.11. How to Estimate the Causal Effect Using Intervention? The “interventionist” interpretation of causal effect is often described as the magnitude by which outcome Y is changed given a unit change in treatment T . For example, if we are interested in the effectiveness of a medication in the population, we would set up an experimental study as follows: (1) We administer the drug uniformly to the entire population, do(T = 1), and compare the recovery rate Prob(Y = 1|do(T = 1)) to what we obtain under the opposite context Prob(Y = 1|do(T = 0)), where we keep everyone from using the drug in a parallel universe, do(T = 0). Mathematically, we estimate the difference known as ACE (defined in Section 1.3), AC E = Prob(Y = 1|do(T = 1)) − Prob(Y = 1|do(T = 0))

(3.3)

“A more informal interpretation of ACE here is that it is simply the difference in the fraction of the population that would recover if everyone took the drug compared to when no one takes the drug” [19]. The question is how to estimate the intervention distribution with the do operator, Prob(Y |do(T )). We can utilize the following theorem to do so. Theorem 1. The causal effect rule. Given a graph G in which a set of variables PA are designated as the parents of X, the causal effect of X on Y is given by  Prob(Y = y, |X = x, P A = z)Prob(P A = z) Prob(Y = y|do(X = x)) = z

(3.4)

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Causal Inference in Medicine and in Health Policy: A Summary

page 279

279

If we multiply and divide the right-hand side by the probability Prob(X = x | P A = z), we get a more convenient form: Prob(Y = y|do(x)) =

 Prob(X = x, Y = y, P A = z) Prob(X = x, P A = z) z

(3.5)

Now the computation of Prob(Y |do(T )) is reduced to the estimation of joint probability distributions Prob(X, Y, P A) and Prob(X, P A), which can be directly computed from the corresponding observational dataset. Please refer to [66] for details on probability distribution estimation.

6.12. External Validity and Transportability of Machine Learning Models In the era of Big Data, we diligently and consistently collect heterogeneous data from various studies, for example, data collected from different experimentational conditions, underlying populations, locations, or even different sampling procedures. In short, the collected data are messy and rarely serve our inferential goal. Our data analysis should account for these factors. “The process of translating the results of a study from one setting to another is fundamental to science. In fact, scientific progress would grind to a halt were it not for the ability to generalize results from laboratory experiments to the real world” [5]. We initially need to take a better look at heterogeneous datasets.

6.13. How to Describe the Characteristics of Heterogeneous Datasets? Big Data empowers us to conduct a wide spectrum of studies and to investigate the analytical results. We are normally inclined to incorporate or transfer such results to a new study. This naturally raises the question: Under what circumstances can we transfer the existing knowledge to new studies that are under different conditions? Before we come up with “licenses” of algorithms that permit such transfer, it is crucial to understand how the new dataset in the target study differs from the ones in existing studies. Bareinboim [21] summarizes the differences in heterogeneous datasets over the four dimensions shown in Figure 6.7, which are certainly not enough to enumerate all possibilities in practice, but more dimensions can be added in future research. In Figure 6.7, d1) datasets may vary in the study population; d2) datasets may vary in the study design. For instance, the study in Los Angeles is an experimental study under laboratory setting, while the study in New York is an observational study in the real world;

June 1, 2022 12:51

280

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Handbook on Computer Learning and Intelligence — Vol. I

Figure 6.7: Heterogeneous datasets can vary in the dimensions (d1, d2, d3, d4) shown above. Suppose we are interested in the causal effect X → Y in a study carried out in Texas and we have the same causal effect studied in Los Angeles and New York. This table exemplifies the potential differences between the datasets [20].

d3) datasets may vary in the collection process. For instance, dataset 1 may suffer from selection bias on variable age; for example, if the subjects recruited in study 1 are contacted only using landlines, the millennials are probably excluded in the study as they prefer mobile phones; d4) Studies might also take measurements on different sets of variables.

6.14. Selection Bias Selection bias is caused by preferential exclusion of data samples [22]. It is a major obstacle in validating statistical results, and it can hardly be detected in either experimental or observational studies.

6.14.1. COVID-19 Example During the COVID-19 pandemic crisis, a statewide study reported that 21.2% of New York City residents have been infected with COVID-19 [23]. The study tested 3,000 New York residents statewide at grocery and big-box stores for antibodies that indicate whether someone has had the virus. Kozyrkov [24] argues that the study might be contaminated with selection bias. The hypothesis notes that the cohort in the study is largely skewed toward the group of people who are high-risk-takers and have had the virus. A large portion of the overall population may include people who are risk-averse and cautiously stay home; these people were excluded from the research samples. Therefore, the reported toll (i.e., 21.2%) is likely to be inflated. Here, we try to investigate a more generic approach to spot selection bias issues.

page 280

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Causal Inference in Medicine and in Health Policy: A Summary

page 281

281

Causal inference requires us to make certain plausible assumptions when we analyze data. Datasets are not always complete; that is, it does not always tell the whole story. The results of the analyses (e.g., spurious correlation) from data alone can be often very misleading. You may recall the example of smokers who may develop lung cancer and have yellow fingernails. If we find this association (lung cancer and yellow fingernails) to be strong, we may come to the false conclusion that one causes the other. Back to our COVID-19 story, what causal assumptions can we make in the antibody testing study? Consider Figure 6.8 in which each of the following assumptions represent an edge: (1) We know that the antibody will be discovered if we do the related test. (2) People who have had COVID-19 and survived would generate an antibody for that virus. (3) Risk-taking people are more likely to go outdoors and participate in the testing study. (4) In order to highlight the difference between the sample cohort and the overall population in the graph, Bareinboim [21, 22] proposed a hypothetical variable S (standing for “selection”). The variable bounded in the square in Figure 6.8 stands for the characteristics by which the two populations differ. You can also think of S as the inclusion/exclusion criteria. The variable S has incoming edges from the variables “risk taking” and “carried virus.” This means that the sample cohort and overall population differ in these two aspects.

Figure 6.8: A graphical model that illustrates the selection bias scenario. The variable S (square shaped) is a difference-producing variable, which is a hypothetical variable that points to the characteristic by which the two populations differ.

June 1, 2022 12:51

282

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Handbook on Computer Learning and Intelligence — Vol. I

Now we have encoded our assumptions into a transparent and straightforward diagram. You may wonder why we go through all the hassles to prepare this graphical diagram? What value does it add to our identification of selection bias or even the debiasing procedure? Let us begin with the question of how it helps us to find the common pattern/principle of identifying selection bias? Identify selection bias with a causal graph

First, a couple of quick tips for identifying selection bias: (1) find any collider variable on the backdoor path between the cause variable and the effect variable; (2) selection bias occurs when your data collection process is conditioning on these collider variables [22]. Note that these tips are only sufficient but not necessary conditions in finding selection bias. By conditioning we mean considering only some of the possibilities a variable can take and not all. The backdoor path in tip (1) refers to any path between cause variable (COVID testing) and effect variable (antibody %) with an arrow pointing to cause variable ( [5] , page 158). In our example, the only backdoor path is “COVID-19 test ← risk taking → S ← carried virus → antibody %.” Spurious correlation will be removed if we block every backdoor path. We observe that there are three basic components of the backdoor path: (1) a fork “COVID-19 test ← high risk → S”; (2) a collider “high risk → S ← carried”; (3) another fork “S ← carried virus → antibody %.” Now we notice that this backdoor path is blocked if we do not condition on the variable S in the collider (recall d-separation). If we condition on the variable S (e.g., set S={high risk, have had virus}), the backdoor path will be opened up and spurious correlation is introduced in your antibody testing results. Now we realize that if we condition on a collider on the backdoor path between cause → effect and open up the backdoor path, we will encounter the selection bias issue. With this conclusion, we can quickly identify whether our study has a selection bias issue given a causal graph. The procedures for this identification can also be automated when the graph is complex. So, we offload this judgement to algorithms and computers. Hopefully, you are convinced at this point that using a graphical model is a more generic and automated way of identifying selection bias. Unbiased estimation with do-calculus

Now we are interested in the estimation of cause–effect between “COVID-19 test → Antibody %”, P(Antibody|do(test)). In other words, the causal effect represents the likelihood of antibody discovery if we test everyone in the population. Our readers may wonder at this point if the do-calculus makes sense, but how would we

page 282

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Causal Inference in Medicine and in Health Policy: A Summary

page 283

283

compute and remove the do-operator? Recall the algebra of do-calculus introduced in Section 2.3.1. Let us compute P( Antibody|do(test)) as shown below: P( Antibody|do(test)) = P( Antibody|do(test), {}) # Condition on nothing, an empty set.

= P( Antibody|do(test), {}).

# The backdoor path in Fig. 6.8 is blocked naturally if we condition on nothing, {} According to rule b) of do-calculus, we can safely remove the do notation.

= P( Antibody|test)  P( Antibody, risk = i, virus = j |test) # Law of total = i∈{high,low} j ∈{true, f alse}

=

probability 

P( Antibody|test, risk = i, virus = j )P(risk = i,

i∈{high,low} j ∈{true, f alse}

virus = j |test)

(4.1)

The last step of the equation shows the four probability terms measured in the study required to have an unbiased estimation. If we assume a closed world (risk={high, low}, virus={true, false}), it means that we need to measure every stratified group. Now we have seen that do-calculus can help us identify what pieces we need in order to recover the unbiased estimation.

6.15. Model Transportability with Data Fusion Transferring the learned knowledge across different studies is crucial to scientific endeavors. Consider a new treatment that has shown effectiveness for a disease in an experimental/laboratory setting. We are interested in extrapolating the effectiveness of this treatment in a real-world setting. Assume that the characteristics of the cohort in the lab setting are different from the overall population in the real world, e.g., age, income, etc. Direct transfer of existing findings to a new setting will result in biased inference/estimation of the effectiveness of the drug. Certainly, we can recruit a new cohort that is representative of the overall population and examine whether the effectiveness of this treatment is consistent. However, if we could use the laboratory findings to infer our estimation goal in the real world, this would reduce the cost of repetitive data collection and model development. Let us consider a toy example of how causal reasoning could help with data fusion.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

284

b4528-v1-ch06

page 284

Handbook on Computer Learning and Intelligence — Vol. I

Figure 6.9: A toy example shows how to transfer the existing inference in a lab setting to another population where the difference is the age, denoted by a hypothetical difference variable S (in yellow).

6.15.1. A Toy Example of Data Fusion with Causal Reasoning Imagine that we developed a new treatment for a disease, and we estimated the effectiveness of the new drug for each age group in a randomized controlled experimental setting. Let P(Recovery|do(Treatment), Age) denote the drug effect for each age group. We wish to generalize the lab results to a different population. Assume that the study cohort in lab setting and the target population are different and the differences are explained by a variable S as shown in Figure 6.9. Meanwhile, we assume that these causal effects of each specific age group are invariant across populations. We are interested in gauging the drug effect in the target population (S = s ∗ ), and the query can be expressed as P(Recovery = True|do(Treatment = True), S = s ∗ ). Then the query can be solved as follows: Quer y = P(Recover y|do(Treatment), S = s ∗ )  P(Recovery|do(Treatment), S = s ∗ , Age = age) = age

P( Age = age|do(Treatment), S = s ∗ )  P ∗ (Recover y|do(Treatment), Age = age)P ∗ ( Age = age) = age

(4.2) In the last step of the equation, P ∗ (Recover y|do(Treatment), Age = age) is the effect we discovered through various experimental studies, and it is invariant across study cohorts. Hence, P ∗ (Recover y|do(Treatment), Age = age) = P(Recover y|do(Treatment), Age = age). P ∗ ( Age = age) is the age distribution in the new population, and is a shorthand for P( Age = age|S = s ∗ ). We realize that to answer the query, we just need to compute the summation of the experimental findings weighted by the age distribution in the target population.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Causal Inference in Medicine and in Health Policy: A Summary

page 285

285

For example, we assume that we have discovered the effectiveness of the new treatment in different age groups through some experimental studies. The effectiveness is expressed as follows: P(Recover y | do(Treatment), Age < 10) = 0.1 P(Recover y | do(Treatment), Age = 10 ∼ 20) = 0.2 P(Recover y | do(Treatment), Age = 20 ∼ 30) = 0.3 P(Recover y | do(Treatment), Age = 30 ∼ 40) = 0.4 P(Recover y | do(Treatment), Age = 40 ∼ 50) = 0.5 Meanwhile, the age distribution in our target population is as follows: • • • • •

group1 : Age < 10P ∗ ( Age < 10) = 1/10 group2 : Age = 10 ∼ 20 P ∗ (10 ≤ Age < 20 = 2/10 group3 : Age = 20 ∼ 30 P ∗ (20 ≤ Age < 30) = 4/10 group4 : Age = 30 ∼ 40 P ∗ (30 ≤ Age < 40) = 2/10 group5 : Age = 40 ∼ 50 P ∗ (40 ≤ Age < 50) = 1/10

According to equation (4.2), the effectiveness in the target population should be computed as P(recover y | do(treatment), S = s ∗ )  P(recover y | do(treatment), age)P ∗ (age) = age

= 0.1 ∗

2 4 2 1 1 + 0.2 ∗ + 0.3 ∗ + 0.4 ∗ + 0.5 ∗ = 0.03 10 10 10 10 10

(4.3)

6.16. Missing Data Missing data occur when the collected values are incomplete for certain observed variables. Missingness might be introduced in a study for various reasons: for example, due to sensors that stop working because they run out of battery, data collection is done improperly by researchers, respondents refuse to answer some survey questions that may reveal their private information (e.g., income, disability) and many more reasons. The missing data issue is inevitable in many scenarios. Typically, building machine learning predictors or statistical models with missing data may expose the consequent models to the following risks: (a) the partially observed data may bias the inference models, and the study outcomes may largely deviate from the true value; (b) the reduced sample size may lose the statistical power to provide any informative insights; (c) missing data, appearing as

June 1, 2022 12:51

286

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Handbook on Computer Learning and Intelligence — Vol. I

technical impediment, might also cause severe predictive performance degradation as most of the machine learning models assume datasets are complete when making inferences. Extensive research endeavors have been dedicated to the problem of missing data. Listwise deletion or mean value substitutions are commonly used in dealing with this issue because of their simplicity. However, these naive methods fail to account for the relationships between the missing data and the observed data. Thus, the interpolations usually deviate from the real values by large. Rubin et al. introduce the concept of missing data mechanism which is widely adopted in the literature [27]. This mechanism classifies missing data into three categories: • Missing completely at random (MCAR): The observed data are randomly drawn from the complete data. In other words, missingness is unrelated to other variables or itself. For example, in Figure 6.10, job performance rating is a partially observed variable, while variable IQ is complete without any missingness. The MCAR column shows that the missing ratings are independent of IQ values.

Figure 6.10: Example source [28]. Job performance rating is a partially observed variable, and variable IQ is a completely observed variable without any missingness. The second column shows the complete ratings. The 3rd/4th/5th columns show the observed ratings under MCAR/MAR/MNAR conditions, respectively.

page 286

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Causal Inference in Medicine and in Health Policy: A Summary

page 287

287

• Missing at random (MAR): The missing values of the partially observed variable depend on other measured variables. The MAR column in Figure 6.9 shows that the missing ratings are associated with low IQs. • Missing not at random (MNAR): MNAR includes scenarios when data are neither MCAR nor MNAR. For example, in Figure 6.10, the MNAR column shows that the missing ratings are associated with itself, i.e., low job performance ratings (ratings < 9) are missing. Pearl et al. demonstrate that a theoretical performance guarantee (e.g., convergence and unbiasedness) exists for inference with data that are MCAR and MAR [29, 30]. In other words, we can still have bias-free estimations despite missing data. In Figure 6.10, assume that Y is the random variable that represents the job performance ratings. The expectations of job performance ratings under complete, MCAR, and MNAR columns are EComplete[Y ] = 10.35, EMCAR [Y ] = 10.60, and EMNAR [Y ] = 11.40, respectively. It can be easily verified that the bias of Bias MC AR = | EComplete[Y ] − EMCAR [Y ] | = 0.25 is less than BiasMNAR = | EComplete[Y ] − EMNAR [Y ] | = 1.05. As the size of the dataset grows, EMCAR [Y ] will converge to the real expectation value EComplete[Y ], i.e., BiasMCAR =. However, since the MNAR mechanism dictates that low ratings (Y < 9) are inherently missing from our observations, we cannot have a bias-free estimation, regardless of the sample size. We can notice that the observed data are governed by the missing mechanism. Therefore, the missing data issue is inherently a causal inference problem [30, 31]. Most statistical techniques proposed in the literature for handling missing data assume that data are MCAR or MAR [32–35]. For example, the expectation maximum likelihood algorithm is generally considered superior to other conventional methods (e.g., listwise or pairwise deletion) when the data are MCAR or MAR [28]. Moreover, it provides a theoretical guarantee (e.g., unbiasedness or convergence) [36] under MCAR or MAR assumptions. However, when it comes to MNAR, estimations with the conventional statistical techniques will be mostly biased. Mohan et al. report that we can achieve unbiased estimation in the MNAR scenario under certain constraints using causal methods [29–31].

6.17. Causal Perspective of Missing Data In this section, we briefly explore and discuss the causal approaches proposed by Mohan and Pearl in dealing with missing data [29–31]. Firstly, we introduce the concept of causal graph representation for missing data—missing graph(s) (m-graph(s) for short). Then we introduce the definition of recoverability, a formal definition of unbiased estimation with missing data. Next, we discuss under what conditions we can achieve recoverability. At last, we identify the unsolved problems with data that are MNAR.

June 1, 2022 12:51

288

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

page 288

Handbook on Computer Learning and Intelligence — Vol. I

(a)

(b)

(c)

(d)

Figure 6.11: m-graphs for data that are (a) MCAR, (b) MAR, (c, d) MNAR; hollow and solid circles denote partially and fully observed variables, respectively [29].

6.17.1. Preliminary on Missing Graphs Let G(V, E) be the causal graph (a DAG) where V = V ∪ U ∪ V ∗ ∪ R. V denotes the set of observable nodes, which represent observed variables in our data. These observable nodes can be further grouped into fully observable nodes, V obs , and partially observable nodes, V mis . Hence, V = V obs ∪ V mis . V obs denotes the set of variables that have complete values, whereas V mis denotes the set of variables that are missing at least one data record. Each partially observed variable v i ∈ V mis has two auxiliary variables Rv i and Vi∗ , where Vi∗ is a proxy variable that is actually observed, and Rv i denotes the causal mechanism responsible for the missingness of Vi∗ ,  v i∗

= f (rv i , v i ) =

vi ,

if rv i = 0

missing, if rv i = 1

(5.1)

In missing graphs, Rv i can be deemed as a switch that dictates the missingness of its proxy variable, Vi∗ . For example, in Figure 6.11(a), X is a fully observable node which has no auxiliary variables. The partially observable node Y is associated with Y ∗ and RYi . Y ∗ is the proxy variable that we actually observe on Y , and R y masks the values of Y by its underlying missingness mechanism (e.g., MCAR, MAR, MNAR). E is the set of edges in m-graphs, and U is the set of unobserved variables (latent variables). The toy example in Figure 6.11 is not involved with any latent variable. We can cast the classification of missingness mechanisms (e.g., MCAR, MAR, MNAR) onto m-graphs as depicted in Figure 6.11. ⊥ X 2 . The on–off status of R y is • Figure 6.11(a) shows the MCAR case where R y ⊥ solely determined by coin-toss. Bear in mind that the absence of an edge between two vertices in a causal graph is a strong constraint which indicates that there is no relation between them. The criterion to decide if an m-graph represents MCAR is R ⊥ ⊥ (V obs ∪ V mis ∪ U ). 2 This (conditional) independence is directly read off from causal graphs using the d-separation

technique.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Causal Inference in Medicine and in Health Policy: A Summary

page 289

289

• Figure 6.11(b) shows the MAR case where R y ⊥ ⊥ Y |X.R y depends on the fully observed variable X . The criterion of deciding if a m-graph represents MAR is R⊥ ⊥ (V mis ∪ U )|V obs . • Figure 6.11(c,d) shows the MNAR cases where neither of the aforementioned criteria holds. It is a clear advantage that we can directly read the missingness mechanism from the m-graphs using d-separation3 without conducting any statistical test.

6.18. Recoverability Before we can discuss under what conditions we can achieve bias-free estimation, we shall first introduce the definition of recoverability. Definition 1. Recoverability [29]. Given an m-graph G, and a target query relation Q defined on the variables in V, Q is said to be recoverable in G if there exists an algorithm that produces a consistent estimate of Q for every dataset D such that P(D) is 1) compatible with G and 2) strictly positive over complete cases, i.e., P(V obs , V mis , R = 0) > 0. Corollary 1 [29]. A query relation Q is recoverable in G if and only if Q can be expressed in terms of the probability P(O) where O = R, V ∗ V obs is the set of observable variables in G.

6.18.1. Recoverability When Data are MCAR Example 1. Taken from [29]. “Let X be the treatment and Y be the outcome as depicted in the m-graph in Figure 6.11(a). Let it be the case that we accidentally delete the values of Y for a handful of samples, hence Y ∈ Vm . Can we recover P(X,Y)?” ⊥ (X, Y ) holds Yes, P(X,Y) under MCAR is recoverable. We know that R y ⊥ in Figure 6.11(a). Thus, P(X, Y ) = P(X, Y | Rv ) = P(X, Y |Rv = 0). When Rv = 0, we can safely replace Y with Y ∗ as P(X, Y ) = P(X, Y ∗ |Rv = 0). Note that P(X, Y ) has been expressed in terms of the probability (P(X, Y ∗ |Rv = 0)) that we can compute using observational data. Hence, we can recover P(X,Y) with no bias.

6.18.2. Recoverability When Data are MAR Example 2. Taken from [29]. “Let X be the treatment and Y be the outcome as depicted in the m-graph in Figure 6.11(b). Let it be the case that some patients who 3 Watch this video about d-separation: https://www.youtube.com/watch?v=yDs_q6jKHb0.

June 1, 2022 12:51

290

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Handbook on Computer Learning and Intelligence — Vol. I

underwent treatment are not likely to report the outcome, hence X ∈ Rv . Can we recover P(X,Y)?” Yes, P(X,Y) under MAR is recoverable. We know that R y ⊥ ⊥ Y |X holds in Figure 6.11(b). Thus, P(X, Y ) = P(Y | X )P(X ) = P(Y | X, Rv )P(X ) = P(Y |X, Rv = 0)P(X ). When Rv = 0, we can safely replace Y with Y ∗ as P(X, Y ) = P(Y ∗ |X, Rv = 0)P(X ). Note that P(X, Y ) has been expressed in terms of the probability (P(Y ∗ |X, Rv = 0)P(X )) that we can compute using observational data. Hence, we can recover P(X,Y) with no bias.

6.18.3. Recoverability When Data are MNAR Example 3. Taken from [29]. “Figure 6.11(d) depicts a study where (i) some units who underwent treatment (X = 1) did not report the outcome Y , and (ii) we accidentally deleted the values of treatment for a handful of cases. Thus we have missing values for both X and Y which renders the dataset MNAR. Can we recover P(X, Y )?” ⊥ Y |X Yes, P(X, Y ) in (d) is recoverable. We know that X ⊥ ⊥ Rx and (R y ∪ R y ) ⊥ hold in Figure 6.11(d). Thus, P(X, Y ) = P(Y |X )P(X ) = P(Y |X, R y )P(X ) = P(Y ∗ |X ∗ , R y = 0, Rx = 0)P(X ∗ |Rx = 0). Note that P(X Y ) has been expressed in terms of the probability (P(Y ∗ |X ∗ , R y = 0, Rx =)P(X ∗ |Rx = 0)) that we can compute using observational data. Hence, we can recover P(X, Y ) with no bias. In the original paper [29], P(X, Y ) is not recoverable in Figure 6.10(c). Mohan et al. provide a theorem (see Theorem 1 in [29]) which states the sufficient condition for recoverability.

6.19. Testability In Figure 6.11(b), we assume that the missing mechanism R y is the causal effect of X ; hence the arrow pointing from X to R y . The question naturally arises: “Is our assumption/model compatible with our data?” Mohan et al. propose an approach to testify the plausibility of missing graphs from the observed dataset [37].

6.19.1. Testability of Conditional Independence (CI) in M-Graphs Definition 2 [37]. Let X ∪ Y ∪ Z ⊆ Vo ∪ Vm ∪ R and X ∩ Y ∩ Z = ∅. X ⊥ ⊥ Y |Z ∗ is testable if there exists a dataset D governed by a distribution P(Vo , V , R) such that X ⊥ ⊥ Y |Z is refuted in all underlying distributions P(Vo , Vm , R) compatible with the distribution P(Vo , V ∗ , R).

page 290

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Causal Inference in Medicine and in Health Policy: A Summary

page 291

291

In other words, if the CIs can be expressed in terms of observable variables exclusively, then these CIs are deemed testable. Theorem 2. Let X, Y, Z ⊂ Vo ∪ Vm ∪ R and X ∩ Y ∩ Z = ∅. The conditional independence statement S:X ⊥ ⊥ Y |Z is directly testable if all the following conditions hold: 1. Y ⊆ (R X m ∪ R Z m ). In words, Y should contain at least one element that is not in R Xm ∪ R Zm . 2. R X m ⊆ X ∪Y ∪ Z . In words, the missingness mechanisms of all partially observed variables in X are contained in X ∪ Y ∪ Z . 3. R Z m ∪ RYm ⊆ Z ∪ Y. In words, the missingness mechanisms of all partially observed variables in Y and Z are contained in Y ∪ Z .

6.19.2. Testability of CIs Comprising Only Substantive Variables As for the CIs that only include substantive variables (e.g., Figure 6.11(b)), it is fairly easy to see that X ⊥ ⊥ Y is testable when X, Y ∈ Vo .

6.20. Missing Data from the Causal Perspective Given an incomplete dataset, our first step is to postulate a model based on causal assumptions of the underlying data generation process. Secondly, we need to determine whether the data reject the postulated model by identifiable testable implications of that model. The last step is to determine, from the postulated model, whether any method exists that produces consistent estimates of the queries of interest.

6.21. Augmented Machine Learning with Causal Inference Despite the rising popularity of causal inference research, the route from machine learning to artificial general intelligence still remains uncharted. Strong artificial intelligence aims to generate artificial agents with the same level of intelligence as human beings. The capability of thinking causally is integral for achieving this goal [38]. In this section, we try to describe how to augment machine learning models with causal inference.

6.22. Reinforcement Learning with Causality Smart agents not only passively observe the world but are also expected to actively interact with their surroundings and to shape the world. Let us consider a case study in recommender systems where we augment a reinforcement learning (RL) agent with causality.

June 1, 2022 12:51

292

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Handbook on Computer Learning and Intelligence — Vol. I

Figure 6.12: A graphical illustration of the interaction between a reinforcement learning agent and its surrounding environment.

A reinforcement learning agent typically interacts with its surrounding environment as follows: choose the available actions from the observed state and collect the reward as the result of those actions. The goal of a reinforcement learning agent is to maximize the cumulative reward by choosing the optimal action in every interaction with the environment. Reinforcement learning (RL) has been widely adopted in recommender systems such as Amazon or eBay platforms in which items are being suggested to users based on learning their past purchases. Let us assume that these e-commerce recommender systems are implemented as reinforcement learning agents. Given a collection of recommended items, a reward is recorded when a purchase transaction is complete. To maximize the total revenue, the items presented on the recommendation page need to be thoughtfully chosen and placed. Meanwhile, the recommender system will learn the preference of the customers by observing their behaviors. In this scenario, the actions shown in Figure 6.12 are the items for the recommender systems, and the states are the customer preferences learned by the system. In the following section, we will introduce an instance of a reinforcement learning paradigm known as a multi-armed bandit. We will discuss issues that may arise in multi-armed bandit models such as the exploration and exploitation dilemma and the confounding issue. We finally describe how Thompson sampling and causal Thompson sampling could help resolve such shortcomings.

6.22.1. Multi-Armed Bandit Model The multi-armed bandit model is inspired by imagining a gambler sitting in front of a row of slot machines, and deciding which machine to play with the goal of maximizing his/her total return. Let us say we have N slot machines. Each slot machine gives a positive reward with probability p, or a negative reward with probability 1− p. When a gambler just starts his game, he has no knowledge of those slot machines. As he plays a few hands, he would “learn” which slot machine is more rewarding (i.e., the reward distribution of these actions). For example, if there are five slot machines with the reward distribution of 0.1, 0.2, 0.3, 0.4, 0.5, the gambler would always choose the last machine to play in order to have the maximum return.

page 292

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Causal Inference in Medicine and in Health Policy: A Summary

page 293

293

Figure 6.13: A row of slot machines in Las Vegas.

However, the gambler does not have this reward information at the beginning of the game unless he or she pulls and tries each slot machine and estimates the reward. Multi-armed bandit models have been widely adopted in the recommender systems. For example, an online video recommender system (e.g., YouTube) would pick a set of videos based on the user’s watch history and expect high click-throughrate from the user side. In this scenario, the recommender system is the gambler and the videos are slot machines. The reward of each video is the user click.

6.22.2. Exploration and Exploitation Dilemma Before we have the complete reward distribution of our actions, we would face the problem of which action to take based on our partial observation. If we take a greedy approach of always choosing the action with the maximum reward that we observed, this strategy might not be optimal. Let us understand it with the previous slot machine example where the true reward distribution is 0.1, 0.2, 0.3, 0.4, 0.5. Imagine that the gambler just sits in and only pulls a few arms; his reward estimation of these machines would be: 0.5, 0.0, 0.0, 0.0, 0.0. This information may convince him to choose the first arm to play even though the likelihood of reward for this machine is the lowest among all. The rest of the machines would never have the chance to be explored as their reward estimations are 0s. Therefore, we need to be careful not to get stuck in exploiting a specific action and forget to explore the unknowns. Over the years, there have been many proposals to handle this dilemma: (1) -greedy algorithm, (2) upper confidence bound algorithm, (3) Thompson sampling algorithm, and many other variants. We will get back to this dilemma shortly.

June 1, 2022 12:51

294

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Handbook on Computer Learning and Intelligence — Vol. I

Figure 6.14: A graphical illustration of confounding bias in reinforcement learning.

6.23. Confounding Issue in Reinforcement Learning How to view reinforcement learning with the causal lens? Recall that all actions in reinforcement learning are associated with certain rewards. If we see the actions as treatments, we can view the reward as a causal effect of the outcome from each action taken. In Section 6.3, we learned that confounders may bias the causal effect estimation given the scenarios depicted in Figure 6.14. The biased reward estimation would lead to a suboptimal policy. Therefore, it is essential to debias the reward estimation in the existence of confounders. Bareinboim et al. [39] propose a causal Thompson sampling technique, which draws its strength from “counterfactual reasoning.” Specifically, the causal Thompson sampling algorithm enables reinforcement learning agents to compare the rewards resulting from the action that the agent is about to take and the alternative actions that are overlooked.

6.23.1. Thompson Sampling In multi-armed bandit problems [40], the goal of maximizing the reward can be fulfilled in different ways. For instance, in our slot machine example, the gambler wants to estimate the true reward for each slot machine to make optimal choices and to maximize his profit in the end. One way is for the gambler to estimate and update the probability of winning at each machine based on his plays. The gambler would choose the slot machine with the maximum likelihood of winning in the next round. In the Thompson sampling algorithm, however, maximizing the reward, unlike the greedy way, is done by balancing exploitation and exploration. The Thompson algorithm associates each machine with a beta probability distribution, which is essentially the distribution of success versus failure in slot machine events. In each turn, the algorithm will sample from associated beta distributions to generate the

page 294

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Causal Inference in Medicine and in Health Policy: A Summary

page 295

295

reward estimations and then choose to play the machine with the highest estimated reward. Beta distribution is parameterized by α and β in the following equation: B(α, β) =

(α)(β) (α + β)

(6.1)

where  is the Gamma distribution. Let us inspect the property of beta distribution with the following examples and understand why it is a good candidate for the reward estimation in our slot machine example. In Figure 6.15(c), we can see that if we sample from the beta distribution when α = β = 5, the estimated reward would most likely be around 0.5. When α = 5 and β = 2, the peak moves toward the right and the sample values might be around 0.8 in most cases (see Figure 6.15(a)). (a)

(b)

(c)

Figure 6.15: Beta distribution with varying α and β, which can be interpreted as the number of successes and failures in the Thompson sampling algorithm. Estimated reward likely to be around (a) 0.8, (b) 0.2, (c) 0.5.

June 1, 2022 12:51

296

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Handbook on Computer Learning and Intelligence — Vol. I

If we treat α and β as the number of successes and failures, respectively, in our casino game, then the larger value of α in a beta distribution indicates a higher likelihood of payout, whereas the value of β in a beta distribution represents a lower likelihood of payout. In the Thompson sampling algorithm, these beta distributions would be updated by maintaining the current number of successes (α) and failures (β). As we can see from Figure 6.15(b), we can still sample a large reward with a small likelihood. This gives the opportunity for the unrewarding machines to be chosen. Therefore, the Thompson sampling algorithm addresses the exploration and exploitation dilemma to a certain degree [40].

6.23.2. Causal Thompson Sampling In [39], Bareinboim et al. argue that the reward estimation in typical reinforcement learning may be biased in the existence of unobserved confounders. Therefore, conventional Thompson sampling may choose those machines that are suboptimal and skip the real optimal choices. The authors proposed an augmented Thompson sampling with a counterfactual reasoning approach, which allows us to reason about a scenario that has not happened. Imagine that a gambler sits in front of two slot machines (M1 has a higher payout than M2). Say that the gambler is less confident in the reward estimation for the first machine than the second one. His natural predilection would be choosing the second slot machine (M2) as the gambler “believes” that the expected reward from the second machine is higher. Recall that unobserved confounders exist and may mislead the gambler into believing that the second machine is more rewarding, whereas the first machine is actually the optimal choice. Therefore, if the gambler can answer these two questions: “Given that I believe M2 is better, what would the payout be if I played M2?” (intuition), and “Given that I believe M2 is better, what would the payout be if I acted differently?” (counter-intuition). If the gambler can estimate the rewards of intuition and counterintuition, he may rectify his misbelief in the existence of confounders and choose the optimal machine. Interested readers can refer to [39] for more details. Experimental results in [39] show that causal Thompson sampling solves the confounding issue with a toy example dataset. However, this work has not been validated on a realistic dataset. It is still not clear how to handle the confounding issue in practice. Therefore, this is still an open research question.

6.24. How to Discover the Causal Relations from Data? In the previous sections, we have seen a number of examples where the causal diagrams are essential for the causal reasoning. For example, we use the causal graphs in Section 6.3 to demystify the confounding bias (e.g., the Simpson paradox).

page 296

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Causal Inference in Medicine and in Health Policy: A Summary

page 297

297

Figure 6.16: A causal graph used for causal discovery illustration.

The causal graph in Figure 6.4 was constructed after careful inspection of the dataset with the help of domain experts. Some readers may wonder if the construction of these causal structures can be automated with minimum inputs and interventions from experts. In this section, we introduce the concept of causal discovery which identifies these causal relations from the observational dataset via some computational algorithms [55]. To recapitulate, a causal diagram encodes our assumptions about the causal story behind a dataset. These causal assumptions govern the data generation process. Therefore, the dataset should exhibit certain statistical properties that agree with the causal diagram. In particular, if we believe that a causal diagram G might have generated a dataset D, the variables that are independent of each other in G should also be tested as independent in dataset D. For example, the causal diagram in Figure 6.16 indicates that X and W are conditionally independent given Z as “X → Z → W ” forms a chain structure (refer to the d-separation in Section 2.3.2). If a dataset is generated by the causal diagram in Figure 6.16, the data should suggest that X and W are independent conditional on Z , written as X  W |Z . On the contrary, we can refute this causal diagram when the data suggest that the conditional independence (e.g., X  W |Z ) does not hold. Granted that the dataset D matches with every conditional independence relation suggested in the causal diagram G, we say that the causal model G is plausible for the dataset D. The task of causal discovery is to search through all the possible causal graphs and find the most plausible candidate. The mainstream causal discovery methods can be categorized into two groups: constraint-based methods and score-based methods. Constraint-based approaches exploit the conditional independence test to search for the causal structure. Typical algorithms include PC algorithm [56] and FCI algorithm [57]. The PC algorithm (named after its authors, Peter and Clark) starts the search with a fully connected graph as shown in Figure 6.17(b). It repeatedly eliminates the edges between two variables if they are independent in the data, as shown

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

298

b4528-v1-ch06

Handbook on Computer Learning and Intelligence — Vol. I

(a)

(b)

(c)

(d)

Figure 6.17: (a) A simple example that illustrates the PC algorithm. (a) The true causal structure. (b) The PC algorithm starts with a fully connected graph. (c) It removes the edges if the data suggest that the linked variables are independent, e.g., X and Y are independent. (d) It decides the causal directions using d-separation.

in Figure 6.17(c). For example, the statistical testing checks whether Prob(X, Y ) = Prob(X ) × Prob(Y ) holds. If the equation is valid in the data, then we conclude that X and Y are independent. Finally, the PC algorithm decides the arrow directions using d-separations (Figure 6.17(d)). In particular, the algorithm finds the v-structure “X → Z ← Y ” when the data suggest that X and Y become dependent conditioning on Z . This procedure is performed in two steps: (1) verify that Prob(X, Y ) = Prob(X ) × Prob(Y ); (2) verify that Prob(X, Y | Z ) = Prob(X | Z ) × Prob(Y | Z ). The first step indicates that X and Y are initially independent if not conditioning on Z . The second step says that X and Y become dependent conditional on Z. Moreover, the PC algorithm identifies the chain structure “X → Z → W ” when the data suggest that X and W are independent conditional on Z , that is, Prob(X, W | Z ) = Prob(X | Z ) × Prob(W | Z ). The chain structure “Y → Z → W ” is also identified in a similar way. The score-based approach for causal discovery aims to optimize a predefined score function (e.g., Bayesian information criteria) that evaluates the fitness of a candidate causal graph against the given dataset. A common score-based method is the fast greedy equivalence search (FGES) [58]. This method starts the search with an empty graph and iteratively adds a directed edge in the existing structure if the fitness of the current graph is increased. This process continues till the fitness score is no longer improved. Next, the algorithm begins removing an edge if the

page 298

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Causal Inference in Medicine and in Health Policy: A Summary

page 299

299

deletion improves the score. The edge removal stops when the score cannot be improved further. Then the resulting graph is a plausible causal structure from the dataset [59]. However, causal structure learning with a high-dimensional dataset has been deemed challenging since the search space of directed acyclic graphs scales super-exponentially with the number of nodes. Generally, the search problem of identifying DAGs from data has been considered as NP-hard [60, 61]. This challenge greatly handicaps the wide usage of the aforementioned causal discovery methods. There exist ongoing efforts to make the search more efficient by relaxing certain assumptions [63], adopting parallelized computation methodology [64], and incorporating deep learning frameworks [65]. A detailed discussion on causal discovery is beyond the scope of this chapter. Those interested in learning more about causal discovery should refer to [58–65].

6.25. Conclusions In this chapter, we explored the strengths of causal reasoning when facing problems such as confounding bias, model transportability, and learning from missing data. We present some examples to demonstrate that pure data-driven or correlationbased statistical analysis may generate misleading conclusions. We argued the need to consider causality in our models to support critical clinical decision-making. Machine learning has been widely employed in various healthcare applications with recently increased efforts on how to augment machine learning models with causality to improve interpretability [41, 42] and predictive fairness [45, 46] and to avoid bias [43, 44]. The model interpretability can be enhanced through the identification of the cause–effect relation between the model input and outcome. We can observe how the model outcome responds to the intervention upon inputs. For example, powerful machine learning models can be built for early detection of type 2 diabetes mellitus using a collection of features such as age, weight, HDL cholesterol, and triglycerides [47]. However, healthcare practitioners are not content with mere predictions—they are also interested in the variables upon which the intervention will help reduce the risk of the disease effectively. Understanding causality is crucial to answer such questions. We also showed how causality can address confounding bias and selection bias in data analyses. The literature shows that causal inference can be adopted in deep learning modeling to reduce selection bias in recommender systems [43, 44]. Model fairness aims to protect the benefit of people in the minority groups or historically disadvantageous groups from the discriminative decisions produced by AI. Causal inference can also ensure model fairness against such social discriminations [45]. In addition to the attempts and

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

300

b4528-v1-ch06

Handbook on Computer Learning and Intelligence — Vol. I

progress made in this field, there are many low-hanging fruits in combining causal inference with machine learning methods. We hope that this brief introduction to causal inference will inspire more readers who are interested in this research area.

References [1] F. H. Messerli. Chocolate consumption, cognitive function, and Nobel laureates. N. Engl. J. Med., 367(16):1562–1564 (2012). [2] D. E. Geer Jr. Correlation is not causation. IEEE Secur. Privacy, 9(2):93–94 (2011). [3] K. E. Havens. Correlation is not causation: a case study of fisheries, trophic state and acidity in Florida (USA) lakes. Environ. Pollut., 106(1):1–4 (1999). [4] D. Hume. An enquiry concerning human understanding, in Seven Masterpieces of Philosophy, Routledge, pp. 191–284 (2016). [5] J. Pearl and D. Mackenzie. Beyond adjustment: the conquest of mount intervention, in The Book of Why: The New Science of Cause and Effect, Basic Books, p. 234 (2018). [6] A. Morabia. Epidemiological causality. Hist. Philos. Life Sci., 27(3–4):365–379 (2005). [7] S. A. Julious and M. A. Mullee. Confounding and Simpson’s paradox. BMJ, 309(6967): 1480–1481 (1994). [8] P. W. Holland and D. B. Rubin. On Lord’s paradox, in H. Wainer and S. Messick (eds.), Principals of Modern Psychological Measurement, Hillsdale, NJ: Lawrence Earlbaum, pp. 3–25 (1983). [9] S. Greenland and H. Morgenstern. Confounding in health research. Annu. Rev. Public Health, 22(1):189–212 (2001). [10] J. Pearl et al. Causal inference in statistics: an overview. Stat. Surv., 3:96–146 (2009). [11] P. Hünermund and E. Bareinboim. Causal inference and data-fusion in econometrics. arXiv preprint arXiv:1912.09104 (2019). [12] J. Pearl. The do-calculus revisited. arXiv preprint arXiv:1210.4852 (2012). [13] R. R. Tucci. Introduction to Judea Pearl’s do-calculus. arXiv preprint arXiv:1305.5506 (2013). [14] Y. Huang and M. Valtorta. Pearl’s calculus of intervention is complete. arXiv preprint arXiv:1206.6831 (2012). [15] D. Geiger, T. Verma, and J. Pearl. Identifying independence in Bayesian networks. Networks, 20(5):507–534 (1990). [16] T. J. VanderWeele and I. Shpitser. On the definition of a confounder. Ann. Stat., 41(1):196–220 (2013). [17] S. Greenland, J. M. Robins, and J. Pearl. Confounding and collapsibility in causal inference. Stat. Sci., 14(1):29–46 (1999). [18] C. R. Charig, D. R. Webb, S. R. Payne, and J. E. Wickham. Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy. Br. Med. J. (Clin. Res. Ed.), 292(6524):879–882 (1986). [19] J. Pearl, M. Glymour, and N. P. Jewell. The effects of interventions, in Causal Inference in Statistics: A Primer, John Wiley & Sons, pp. 53–55 (2016). [20] E. Bareinboim. Causal data science. https://www.youtube.com/watch?v=dUsokjG4DHc (2019). [21] E. Bareinboim and J. Pearl. Causal inference and the data-fusion problem. Proc. Natl. Acad. Sci., 113(27):7345–7352 (2016). [22] E. Bareinboim and J. Pearl. Controlling selection bias in causal inference, in Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, pp. 100–108 (2012). [23] J. Berke. A statewide antibody study estimates that 21% of New York City residents have had the coronavirus, Cuomo says (2020). [24] C. Kozyrkov. Were 21% of New York City residents really infected with the novel coronavirus? (2020).

page 300

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Causal Inference in Medicine and in Health Policy: A Summary

page 301

301

[25] W. Zhang, W. Bao, X.-Y. Liu, K. Yang, Q. Lin, H. Wen, and R. Ramezani. A causal perspective to unbiased conversion rate estimation on data missing not at random, arXiv:1910.09337 (2019). [26] A. B. Ryder, A. V. Wilkinson, M. K. McHugh, K. Saunders, S. Kachroo, A. D’Amelio, M. Bondy, and C. J. Etzel. The advantage of imputation of missing income data to evaluate the association between income and self-reported health status (SRH) in a Mexican American cohort study. J. Immigr. Minor. Health, 13(6):1099–1109 (2011). [27] D. B. Rubin. Inference and missing data. Biometrika, 63(3):581–592 (1976). [28] C. K. Enders. Applied Missing Data Analysis. Guilford Press (2010). [29] K. Mohan, J. Pearl, and J. Tian. Graphical models for inference with missing data, in C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (eds.), Advances in Neural Information Processing System, pp. 1277–1285 (2013). [30] J. Pearl and K. Mohan. Recoverability and testability of missing data: introduction and summary of results. Available at SSRN 2343873 (2013). [31] K. Mohan and J. Pearl. Missing data from a causal perspective, in Workshop on Advanced Methodologies for Bayesian Networks, Springer, pp. 184–195 (2015). [32] J. Yoon, J. Jordon, and M. Van der Schaar. Gain: missing data imputation using generative adversarial nets. arXiv preprint arXiv:1806.02920 (2018). [33] S. van Buuren and K. Groothuis-Oudshoorn. Mice: multivariate imputation by chained equations in R. J. Stat. Software, 45:1–68 (2010). [34] Y. Deng, C. Chang, M. S. Ido, and Q. Long. Multiple imputation for general missing data patterns in the presence of high-dimensional data. Sci. Rep., 6:21689 (2016). [35] J. L. Schafer and J. W. Graham. Missing data: our view of the state of the art. Psychol. Methods, 7(2):147 (2002). [36] R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data, vol. 793. John Wiley & Sons (2019). [37] J. Pearl. On the testability of causal models with latent and instrumental variables, in Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers Inc., pp. 435–443 (1995). [38] M. Ford. Architects of Intelligence: The truth About AI from the People Building It. Packt Publishing Ltd (2018). [39] E. Bareinboim, A. Forney, and J. Pearl. Bandits with unobserved confounders: a causal approach, in Advances in Neural Information Processing Systems, pp. 1342–1350 (2015). [40] D. Russo, B. Van Roy, A. Kazerouni, I. Osband, and Z. Wen. A tutorial on Thompson sampling. arXiv preprint arXiv:1707.02038 (2017). [41] J. Kim and J. Canny. Interpretable learning for self-driving cars by visualizing causal attention, in Proceedings of the IEEE International Conference on Computer Vision, pp. 2942–2950 (2017). [42] R. Moraffah, M. Karami, R. Guo, A. Raglin, and H. Liu. Causal interpretability for machine learning-problems, methods and evaluation. ACM SIGKDD Explor. Newsl., 22(1):18–33 (2020). [43] W. Zhang, W. Bao, X.-Y. Liu, K. Yang, Q. Lin, H. Wen, and R. Ramezani. Large-scale causal approaches to debiasing post-click conversion rate estimation with multi-task learning, in Proceedings of the Web Conference 2020, pp. 2775–2781 (2020). [44] T. Schnabel, A. Swaminathan, A. Singh, N. Chandak, and T. Joachims. Recommendations as treatments: debiasing learning and evaluation, in International Conference on Machine Learning, pp. 1670–1679. PMLR (2016). [45] M. J. Kusner, J. R. Loftus, C. Russell, and R. Silva. Counterfactual fairness. arXiv preprint arXiv:1703.06856 (2017). [46] J. R. Loftus, C. Russell, M. J. Kusner, and R. Silva. Causal reasoning for algorithmic fairness. arXiv preprint arXiv:1805.05859 (2018). [47] L. Kopitar, P. Kocbek, L. Cilar, A. Sheikh, and G. Stiglic. Early detection of type 2 diabetes mellitus using machine learning-based prediction models. Sci. Rep., 10(1):1–12 (2020).

June 1, 2022 12:51

302

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch06

Handbook on Computer Learning and Intelligence — Vol. I

[48] H. C. S. Thom. A note on the gamma distribution. Monthly Weather Review, 86(4):117–122 (1958). [49] M. Parascandola. Causes, risks, and probabilities: probabilistic concepts of causation in chronic disease epidemiology. Preventive Med., 53(4–5):232–234 (2011). [50] Z. Verde et al. ‘Smoking genes’: a genetic association study. PLoS One, 6(10):e26668 (2011). [51] J. MacKillop et al. The role of genetics in nicotine dependence: mapping the pathways from genome to syndrome. Curr. Cardiovasc. Risk Rep., 4(6):446–453 (2010). [52] J. Pearl. Bayesian networks: a model CF self-activated memory for evidential reasoning, in Proceedings of the 7th Conference of the Cognitive Science Society, University of California, Irvine, CA, USA (1985). [53] J. Pearl. Bayesian networks, causal inference and knowledge discovery. UCLA Cognitive Systems Laboratory, Technical Report (2001). [54] D. Heckerman, C. Meek, and G. Cooper. A Bayesian approach to causal discovery, in Computation, Causation, and Discovery, 141–166 (1999). [55] C. Glymour, K. Zhang, and P. Spirtes. Review of causal discovery methods based on graphical models. Front. Genet., 10:524 (2019). [56] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search, 2nd edn. Cambridge, MA: MIT Press (2001). [57] P. Spirtes et al. Constructing Bayesian network models of gene expression networks from microarray data. Carnegie Mellon University. Journal contribution. https://doi.org/10.1184/R1/ 6491291.v1 (2000). [58] J. Ramsey et al. A million variables and more: the Fast Greedy Equivalence Search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images. Int. J. Data Sci. Anal., 3(2):121–129 (2017). [59] D. M. Chickering. Optimal structure identification with greedy search. J. Mach. Learn. Res., 3:507–554 (2002). [60] D. M. Chickering. Learning Bayesian networks is NP-complete, in Learning from Data. Springer, New York, NY, pp. 121–130 (1996). [61] D. M. Chickering, D. Heckerman, and C. Meek. Large-sample learning of Bayesian networks is NP-hard. J. Mach. Learn. Res., 5:1287–1330 (2004). [62] Z. Hao et al. Causal discovery on high dimensional data. Appl. Intell., 42(3):594–607 (2015). [63] T. Claassen, J. Mooij, and T. Heskes. Learning sparse causal models is not NP-hard. arXiv preprint arXiv:1309.6824 (2013). [64] T. D. Le et al. A fast PC algorithm for high dimensional causal discovery with multi-core PCs. IEEE/ACM Trans. Comput. Biol. Bioinf., 16(5):1483–1495 (2016). [65] I. Ng et al. A graph autoencoder approach to causal structure learning. arXiv preprint arXiv:1911.07420 (2019). [66] K. P. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge, MA (2012). [67] T. J. VanderWeele, and I. Shpitser. On the definition of a confounder. Ann. Stat., 41(1):196 (2013).

page 302

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Part II

Supervised Learning

303

b4528-v1-ch07

page 303

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch07

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_0007

Chapter 7

Fuzzy Classifiers Hamid Bouchachia Department of Computing & Informatics Faculty of Science and Technology, Bournemouth University, UK [email protected]

The present chapter discusses fuzzy classification with a focus on rule-based classification systems (FRS). The chapter consists of two parts. The first part deals with type-1 fuzzy rulebased classification systems (FRCS). It introduces the steps of building such systems before overviewing their optimality quality in terms of performance, completeness, consistency, compactness, and transparency/interpretability. The second part discusses incremental and online FRCS providing insight into both online type-1 and type-2 FRCS both at the structural and functional levels.

7.1. Introduction Over the recent years, fuzzy rule-based classification systems have emerged as a new class of attractive classifiers due to their transparency and interpretability characteristics. Motivated by such characteristics, fuzzy classifiers have been used in various applications such as smart homes [7, 9], image classification [52], medical applications [51], pattern recognition [54], etc. Classification rules are simple, consisting of two parts: the premises (conditions) and consequents that correspond to class labels as shown in the following: Rule 1 : If x1 is small then Class 1 Rule 2 : If x1 is large then Class 2 Rule 3 : If x1 is medium and x2 is very small then Class 1 Rule 4 : If x2 is very large then Class 2

305

page 305

June 1, 2022 12:51

306

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch07

Handbook on Computer Learning and Intelligence — Vol. I

Figure 7.1: Two-dimensional illustrative example of specifying the antecedent part of a fuzzy if–then rule.

where xi corresponds to (the definition domain of) the input features (i.e., the dimensions of the input space). As an example, from Figure 7.1, one can derive the rule If x1 is large then Class 3. Rules may be associated with degrees of confidence that explain how well the rule covers a particular input region. Rule 5 : If x2 is small then Class 1 with confidence 0.8. Graphically, the rules partition the space into regions. Ideally, each region is covered by one rule as shown in Figure 7.1. each region is covered by one rule as shown in Figure 7.1. Basically, there are two main approaches for designing fuzzy rule-based classifiers: • Human expertise: The rules are explicitly proposed by the human expert. Usually, no tuning is required and the rules are used for predicting the classes of the input using certain inference steps (see Section 7.3). • Machine generated: The standard way of building rule-based classifiers is to apply an automatic process that consists of certain steps: partitioning of the input space, finding the fuzzy sets of the rules’ premises and associating class labels as consequents. To predict the label of an input, an inference process is applied. Usually additional steps are involved, especially for optimizing the rule base [20]. Usually, fuzzy classifiers come in the form of explicit if–then classification rules as was illustrated earlier. However, rules can also be encoded in neural networks, resulting in neuro-fuzzy architectures [31]. Moreover, different computational intelligence techniques have been used to develop fuzzy classifiers: evolutionary algorithms [20], rough sets [48], ant colony systems [23], immune systems [2],

page 306

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch07

Fuzzy Classifiers

page 307

307

particle swarm optimization [43], petri nets [10], etc. These computational approaches are used for both the design and optimization of the classifier’s rules.

7.2. Pattern Classification The problem of pattern classification can be formally defined as follows: Let T = {(xi , yi )i=1,...,N } be a set of training patterns. Each xi = (xi1 , . . . , xid ) ∈ X 1 is a d-dimensional vector and yi ∈ Y = {y1 , . . . , yC }, where Y denotes a discrete set of classes (i.e., labels). A classifier is inherently characterized by a decision function f : X → Y that is used to predict the class label yi for each pattern xi . The decision function is learned from the training data patterns T drawn i.i.d at random according to an unknown distribution from the space X × Y . An accurate classifier has a very small generalization error (i.e., it generalizes well to unseen patterns). In general, a loss (cost) is associated with errors, meaning that some misclassification errors are “worse” than others. Such a loss is expressed formally as a function (xi , yi , f (xi )) that can take different forms (e.g., 0-1, hinge, squared hinge, etc.) [18]. Ideally, the classifier (i.e., decision function) should minimize the expected error E( f ) N (xi , yi , f (xi )). Hence, the aim is to (known as the empirical risk) = (1/N ) i=1 learn a function f among a set of all functions that minimize the error E( f ). The empirical risk may be regularized by putting additional constraints (terms) to restrain the function space, penalize the number of parameters and model complexity, and avoid overfitting (which occurs when the classifier does not generalize well on unseen data). The regularized risk is then R( f ) = E( f ) + λ( f ). Classifiers can be either linear or nonlinear. A linear classifier (i.e., the decision function  and the classes are linearly separable) has the form: yi = g(w·xi ) =  d is linear g j =1 w j x i j , where w is a weight vector computed by learning from the training patterns. The function g outputs the class label. Clearly, the argument of decision d function, f = j =1 w j x i j , is a linear combination of the input and the weight vectors. Linear classifiers are of two types, discriminative or generative. Discriminative linear classifiers try to minimize R (or its simplest version E) without necessarily caring about the way the (training) patterns are generated. Examples of discriminative linear classifiers are perceptron, support vector machines (SVM), logistic regression, and linear discriminant analysis (LDA). Generative classifiers such as naive Bayes classifiers and hidden Markov models aim at learning a model of the joint probability p(x, y) over the data X and the label Y. They aim at inferring class-conditional densities p(x|y) and priors p(y). The prediction decision is made by exploiting Bayes’ rule to compute p(y|x) and then by 1x

i j will be written as x j (fuzzy variables) when not referring to a specific input vector xi .

June 1, 2022 12:51

308

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch07

page 308

Handbook on Computer Learning and Intelligence — Vol. I

selecting the label with the higher probability. The discriminative classifiers, on the other hand, aim at computing p(y|x) directly from the pairs (xi , yi )i=1···N . Examples of generative linear classifiers are probabilistic linear discriminant analysis (LDA) and naive Bayes classifier. Nonlinear classifiers (i.e., the decision function is not linear and can be quadratic, exponential, etc.). The straightforward formulation of nonlinear classifiers is the generalized linear classifier that tries to Nonlinearly map the input space onto another space (X ⊂ Rd → Z ⊂ R K ), where the patterns are separable. Thus, x ∈ Rd → z ∈ R K , z = f 1 (x), . . . , f K (x)T , where f k is a nonlinear function (e.g., log-sigmoid, tan-sigmoid,  etc.). For the  ith sample (xi ), the decision function is then of the form: K yi = g k=1 wk f k (xi ) . There are many nonlinear classification algorithms of different types, such as generalized linear classifiers, deep networks, multilayer perceptron, polynomial classifiers, radial basis function networks, etc., nonlinear SVM (kernels), k-nearest neighbor, rule-based classifiers (e.g., decision trees, fuzzy rule-based classifiers, different hybrid neuro-genetic-fuzzy classifiers), etc. [18]. Moreover, we can distinguish between two categories of classifiers: symbolic and sub-symbolic classifiers. The symbolic classifiers are those that learn the decision function in the form of rules: If x1 is A ∧ x2 is B ∧ · · · xd is Z then class is C.

(7.1)

Here, ∧ is the logic operator “and” as one of the “t-norm” operators that can be used. The advantage of this category of classifiers is the transparency and interpretability of their behavior. The sub-symbolic classifiers are those that do not operate on symbols and could be seen as a black box.

7.3. Fuzzy Rule-Based Classification Systems Fuzzy rule-based systems represent knowledge in the form of fuzzy “IF-THEN” rules. These rules can be either encoded by human experts or extracted from raw data by an inductive learning process. Generically, a fuzzy rule has the form Rr ≡ If x1 is Ar,1 ∧ . . . ∧ xd is Ar,d then yr ,

(7.2)

where {xi∗ }i=1...d are fuzzy linguistic input variables, Ar,i are antecedent linguistic terms in the form of fuzzy sets that characterize the input space and yr is the output. Each linguistic term (fuzzy set) is associated with a membership function that defines how each point in the input space formed by the input variables ({xi }i=1...d ) is mapped

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch07

Fuzzy Classifiers

page 309

309

to a membership value (along with the linguistic variable) between 0 and 1. If the membership values are single values from [0,1], then the fuzzy sets are called type-1 fuzzy sets and the fuzzy rule/system is called a type-1 fuzzy rule/system. The form of rule expressed by Eq. 7.2 can take different variations. The most well-known ones are the Mamdani type (known as linguistic fuzzy rules), Takagi– Sugeno type (known as functional fuzzy rules), and classification type. The former two have been introduced in the context of fuzzy control. Specifically, a linguistic fuzzy rule (Mamdani type) has the following structure: If x1 is Ar,1 ∧ . . . ∧ xd is Ar,d then y1 is Br,1 ∧ . . . ∧ ym is Br,m ,

(7.3)

where xi are fuzzy linguistic input variables, y j are fuzzy linguistic output variables, and Ar,i and Br, j are linguistic terms in the form of fuzzy sets that characterize xi and y j . Mamdani’s model is known for its transparency since all of the rule’s terms are linguistic terms. Functional fuzzy rules differ from linguistic fuzzy rules at the consequent level. The consequent of a functional fuzzy rule is a combination of the inputs as shown in Eq. (7.4): ⎧ ⎪ ⎨ yr,1 = fr,1 (x) (7.4) If x1 is Ar,1 ∧ · · · ∧ xd is Ar,d then · · · ⎪ ⎩ y = f (x) r,m r,m A rule consists of n input fuzzy variables (x1 , x2 , . . . , xd ) and m output variables (y1 , y2 , . . . , ym ) such that yr, j = fr, j (x). The most popular form of f is the polynomial form: yr, j = fr, j (x) = br,1 j x1 +· · ·+br,d j xd +br,0 j , where br,i j ’s denote the system’s m + 1 output parameters. These parameters are usually determined using iterative optimization methods. Functional FRS are used to approximate nonlinear systems by a set of linear systems. In this chapter, we focus on classification systems where a rule looks as follows: If x1 is Ar,1 ∧ · · · ∧ xd is Ar,d then yr is Cr τr ,

(7.5)

where Cr , τr indicate, respectively, the class label and the certainty factor associated with the rule r. A certainty factor represents the confidence degree of assigning the input, xi = xi1 , . . . , xid t , to a class Cr , i.e., how good the rule covers the space of a class Cr . For a rule r, it is expressed as follows [29]:  x ∈Cr

τr = iN

i=1

μ Ar (xi )

μ Ar (xi )

,

(7.6)

June 1, 2022 12:51

310

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch07

page 310

Handbook on Computer Learning and Intelligence — Vol. I

where μ Ar (xi ) is obtained as follows: μ Ar (xi ) =

d

μ Ar, p (xip ).

(7.7)

p=1

Equation (7.7) uses the product, but can also use any t-norm: μ Ar (xi ) = T (μ Ar,1 (xi1 ), . . . , μ Ar,d (xid )). Moreover, the rule in Eq. (7.5) could be generalized to multi-class consequents to account for class distribution and, in particular, overlap. If x1 is Ar,1 ∧ · · · ∧ xd is Ar,d then yr is C1 τ1 , . . . , CK τK .

(7.8)

This formulation was adopted in particular in [8], where τ j ’s correspond to the proportion of patterns of each class in the region covered by the rule j . Note also that the rule’s form in Eq. (7.5) could be seen as a special case of the functional fuzzy rule model, in particular, the zero-order model, where the consequent is a singleton. Furthermore, there exists another type of fuzzy classification system based on multidimensional fuzzy sets, where rules are expressed in the form [7] Rr ≡ If x is K i then yr is C j ,

(7.9)

where K i is a cluster and x = [x1 , . . . , xd ]t . The rule means that if the sample x is CLOSE to K i , then the label of x should be that of class C j . Classes in the consequent part of the rules can be determined following two options: • The generation of the fuzzy sets (partitions) is done using supervised clustering. That is, each partition will be labeled and will emanate from one class. In this case, certainty factors associated with the rules will be needless [7–9]. • The generation of the fuzzy sets is done using clustering, so that partitions can contain patterns from different classes at the same time. In this case, a special process is required to determine the rule consequents. Following the procedure described in [27], the consequent is found: yr = argmax{v j },

(7.10)

j =1···C

where v j is given as vj =

xk ∈Cr

μr (xk ) =

xk ∈Cr

⎛ ⎝

d p=1

⎞ μ Ar, p (xkp )⎠.

(7.11)

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Fuzzy Classifiers

b4528-v1-ch07

page 311

311

Figure 7.2: Structure of a fuzzy classifier.

Given a test pattern, xk , entering the system, the output of the fuzzy classifier with respect to this pattern is the winner class, w, referred in the consequent of the rule with the highest activation degree. The winning class is computed as follows: y(xk ) = argmax{w j (xk )},

(7.12)

w j (xk ) = max{μr (xk ) ∗ τr |yr = j },

(7.13)

j =1···C

where w j is given as

where τr is expressed by Eq. (7.6). If there is more than one winner (taking the same maximum value w j ), then the pattern is considered unclassifiable/ambiguous. As shown in Figure 7.2, a general fuzzy rule-based system consists mainly of four components: the knowledge base, the inference engine, the fuzzifier, and the defuzzifier, which are briefly described below. Note that fuzzy classifiers generally do not include the defuzzification step, since a fuzzy output is already indicative. 1. The knowledge base consists of a rule base that holds a collection of fuzzy IF–THEN rules and a database that includes the membership functions defining the fuzzy sets used by the fuzzy rule system. 2. The inference engine maps the input to the rules and computes an aggregated fuzzy output of the system according to an inference method. The common inference method is the Max-Min composition [40], where rules are first combined with the input before aggregating using some operator (Mamdani, Gödel, etc.) to produce a final output or rules are combined first before acting on the input vector. Typical inference in classification systems is rather simple as explained earlier. It consists of sequentially computing the following degrees: (a) Activation, i.e., the product/t-norm of the antecedent part; see Eq. (7.7). (b) Association, i.e., combine the confidence factors with the activation degree to determine association degree of the input with the classes; see Eq. (7.13).

June 1, 2022 12:51

312

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch07

Handbook on Computer Learning and Intelligence — Vol. I

(c) Classification, i.e., choose the class with the highest association degree; see Eq. (7.12). 3. Fuzzification transforms the crisp input into fuzzy input, by matching the input to the linguistic values in the rules’ antecedents. Such values are represented as membership functions (e.g., triangular, trapezoidal and Gaussian) and stored in the database. These functions specify a partitioning of the input and eventually the output space. The number of fuzzy partitions determines that of the rules. There are three types of partitioning: grid, tree, and scatter. The latter is the most common one, since it finds the regions covering the data using clustering. Different clustering algorithms have been used [57]. Known algorithms are Gustafson–Kessel and Gath–Geva [1], mountain [60], subtractive [3], fuzzy hierarchical clustering [55], fuzzy c-means [6] and mixture models [9], and leader-like algorithm [49]. 4. The defuzzifier transforms the aggregated fuzzy output generated by the inference engine into a crisp output because very often a crisp output of the system is required for facilitating decision making. The known defuzzification methods are center of area, height-center of area, max criterion, first of maxima, and middle of maxima. However, the most popular method is center of area [44]. Recall that fuzzy classification systems do not involve necessarily defuzzification. Algorithm 7.1 illustrates the main steps for generating rules and predicting the label of the class of new input. Algorithm 7.1 : Generic steps of a fuzzy classifier. Generating: 1: Given a dataset, generate partitions (i.e., clusters, granules) using any type of partitioning (grid, tree, or scatter). 2: Transform the granules into fuzzy patches (fuzzy memberships) making sure that all of the input space is covered by the fuzzy patches (i.e., the fuzzy membership functions should overlap). 3: For each granule, generate a class label as consequent following Eqs. (7.10) and (7.11) 4: Compute the certainty factors of the rules (Eq. (7.6)) and produce the rules Premise–Consequent–CF and store them in a rule base. Predicting: 1: Given an input, compute the firing degree with respect to each rule by applying Eq. (7.7). 2: Compute the winning class by applying Eqs. (7.12) and (7.13).

page 312

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Fuzzy Classifiers

b4528-v1-ch07

page 313

313

7.4. Quality Issues of Fuzzy Rule-Based Systems In general, knowledge maintenance entails the activities of knowledge validation, optimization, and evolution. The validation process aims at checking the consistency of knowledge by repairing/removing contradictions. Optimization seeks the compactness of knowledge by alleviating all types of redundancy and rendering the knowledge base transparent. Finally, evolution of knowledge is about incremental learning of knowledge without jeopardizing the old knowledge, allowing a continuous and incremental update of the knowledge. This section aims at discussing the optimization issues of rule-based systems. Optimization in this context deals primarily with the transparency of the rule base. Given the symbolic representation of rules, the goal is to describe the classifier with a small number of concise rules relying on transparent and interpretable fuzzy sets intelligible by human experts. Using data to automatically build the classifier is a process that does not always result in a transparent rule base [32]. Hence, there is a need to optimize the number and the structure of the rules while keeping the classifier accuracy as high as possible (more discussion on this tradeoff will follow below). Note that beyond classification, the Mamdani model is geared toward more interpretability; hence the name of linguistic fuzzy modeling. Takagi– Sugeno model is geared towards accuracy, hence the name precise fuzzy modeling or functional fuzzy model [21]. Overall, the quality of the rule base (knowledge) can be evaluated using various criteria: performance, comprehensibility, and completeness, which are explained in the following subsections.

7.4.1. Performance A classifier, independently of its computational model, is judged on its performance. That is, how well does the classifier perform on unseen novel data? Measuring performance measure is about assessing the quality of decision-making. Accuracy is one of the mostly used measures. It quantifies the extent to which the system meets the human being decision. Thus, accuracy measures the ratio of correct decisions. Expressively, Accuracy = Correct/Test, where correct and test indicate respectively the number of patterns well classified (corresponding to true positives + true negatives decisions), while T est is the total number of patterns presented to the classifier. However, but accuracy as a sole measure for evaluating the performance of the classifier might be misleading in some situations like in the case of imbalanced classes. Obviously different measures can be used, like precision (positive predictive value), false positive rate (false alarm rate), true positive rate (sensitivity), and ROC curves [19]. It is, therefore, recommended to use multiple performance measures to objectively evaluate the classifier.

June 1, 2022 12:51

314

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch07

Handbook on Computer Learning and Intelligence — Vol. I

7.4.2. Completeness Knowledge is complete if, for making a decision, all the needed knowledge elements are available. Following the architecture portrayed in Figure 7.2, the system should have all ingredients needed to classify a pattern. The other aspect of completeness is the coverage of the discourse representation space. That is, all input variables (dimensions) to be considered should be fully covered by the fuzzy sets using the frame of cognition FOC) [40], which stipulates that a fuzzy set (patch) along a dimension must satisfy: normality, typicality, full membership, convexity, and overlap.

7.4.3. Consistency Consistency is a key issue in knowledge engineering and is considered one important part of the rule base comprehensibility assessment [17, 45, 59]. In the absence of consistency, the knowledge is without value and its use leads to contradictory decisions. Inconsistency results from conflicting knowledge elements. For instance, two rules are in conflict if they have identical antecedents but different consequents. In the case of fuzzy rule-based classification, there is no risk to have inconsistency, since each rule corresponds to a given region of the data space. Moreover, even if an overlap between antecedents of various rules exists, the output for a given data point is unambiguously computed using the confidence factors related to each rule in the knowledge base.

7.4.4. Compactness Compactness is about the conciseness and the ease of understanding and reasoning about knowledge. Systems built on a symbolic ground, such as rule-based systems, are easy to understand and to track how and why they reach a particular decision. To reinforce this characteristic, the goal of system design is to reduce, as much as possible, the number of rules to make the system’s behavior more transparent. Thus, a small number of rules and a small number of conditions in the antecedent of rules ensures high compactness of the system’s rule base. To reduce the complexity of the rule-base and consequently to get rid of redundancy and strengthen compactness, the optimization procedure can consist of a certain number of steps [8]. All these steps are based on similarity measures. There are a number of measures based on set-theoretic, proximity, interval, logic, linguistic approximation, and fuzzy-valued [58]. In the following, the optimization steps are described.

page 314

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Fuzzy Classifiers

b4528-v1-ch07

page 315

315

• Redundant partitions: they are discovered by computing the similarity of the fuzzy sets describing these partitions to the universe. A fuzzy set Arij is removed if Sim( Arij , U ) > ,

(7.14)

where  ∈ (0, 1) and indicates a threshold (a required level of similarity), U indicates the universe that is defined as follows: ∀ xk , μU (xki ) = 1. Any pattern has a full membership in the universe of discourse. • Merging of fuzzy partitions: two fuzzy partition are merged if their similarity exceeds a certain threshold: Sim( Arij , Arik ) > τ,

(7.15)

where Arij , Arik are the j t h and k t h partitions of the feature i in the rule r • Removal of weakly firing rules: this consists of identifying rules whose output is always close to 0. (7.16) μri < β • Removal of redundant rules: there is redundancy if the similarity (e.g., overlap) between the antecedents of the rules is high, exceeding some threshold δ. The similarity of the antecedents of two rules r and p is given as p

sim( Ar , A p ) = min {Sim( Ari , Ai )}, i=1,n

(7.17)

where the antecedents Ar and A p are given by the set of fuzzy partitions representing the n features Ar =< Ar1 , . . . , Arn > and Ar =< Ar1 , . . . , Arn >. That is, similar rules (i.e., having similar antecedents and same consequent) are merged. In doing so, if some rules have the same consequence, the antecedents of those rules can be connected. However, this may result in a conflict if the antecedents are not the same. One can, however, rely on the following rules of thumb: — If, for some set of rules with the same consequent, a variable takes all forms (it belongs to all forms of fuzzy set, e.g., small, medium, large), then such a variable can be reduced from the antecedent of that set of rules. In other words, this set of rules is independent of the variable that takes all possible values. This variable the corresponds to “don’t care.” For instance, if we consider Table 7.1, rules 1, 3, and 4 are about class C1 and the input variable x2 takes all possible linguistic variables. The optimization will replace these rules with a generic rule as shown in Table 7.2.

June 1, 2022 12:51

316

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch07

Handbook on Computer Learning and Intelligence — Vol. I Table 7.1: Rules of the rule base.

1 2 3 4 5

x1

x2

x3

x4

Class

S M S S M

S S M L S

S M S S L

S M S S M

1 2 1 1 2

Table 7.2: Combining rules 1, 3, and 4.

1 3 4 1’

x1

x2

x3

x4

Class

S S S S

S M L -

S S S S

S S S S

1 1 1 1

Table 7.3: Combining rules 2 and 5.

2 5 2’

x1

x2

x3

x4

Class

M M M

S S S

M L (M ∨ L)

M M M

2 2 2

— If, for some set of rules with the same consequent, a variable takes a subset of all possible forms (e.g., small and medium), then the antecedents of such rules can be combined by or(“ing”) the values corresponding to that variable. For instance, if we consider Table 7.1, the rules 2 and 5 are about class C2 such that the variable x3 takes the linguistic values: medium and large, while the other variables take the same values. The two rules can be replaced with one generic rule as shown in Table 7.3.

7.4.5. Feature selection To enhance the transparency and the compactness of the rules, it is important that the if-part of the rules does not involve many features. Low classification performance generally results from non-informative features. The very conventional way to get rid of these features is to apply feature selection methods. Basically, there exist three classes of feature selection methods [25]:

page 316

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Fuzzy Classifiers

b4528-v1-ch07

page 317

317

1. Filters: The idea is to filter out features that have small potential to predict the outputs. The selection is done as a pre-processing step ahead of the classification task. Filters are preprocessing techniques and refer to statistical selection methods such as principal components analysis, LDA, and single value decomposition. For instance, Tikk et al. [53] describe a feature ranking algorithm for fuzzy modelling aiming at higher transparency of the rule-based. Relying on interclass separability as in LDA and using the backward selection method, the features are sequentially selected. Vanhoucke et al. [56] used a number of measures that rely mutual information to design a highly transparent classifier and particularly to select the features deemed to be the most informative. Similar approach has been taken in [50] using a mutual information-based feature selection for optimizing a fuzzy rule based system. Lee et al. [33] applied fuzzy entropy (FE) in the context of fuzzy classification to achieve low complexity and high classification efficiency. First FE was used to partition the pattern space into non-overlapping regions. Then, it was applied to select relevant features using the standard forward selection or backward elimination. 2. Wrappers: Select features that optimize the accuracy of a chosen classifier. Wrappers largely depend on the classifier to judge how well feature subsets are at classifying the training samples. For instance, in [12], the authors use a fuzzy wrapper and a fuzzy C4.5 decision tree to identify discriminative features. Wrappers are not very popular in the area of fuzzy rule-based systems. But many references claimed their methods to be wrappers, while actually they are embedded. 3. Embedded: Embedded methods perform variable selection in the process of training and are usually specific to given learning machines. In [13] to design a fuzzy rule-based classifier while selecting the relevant features, multi-objective genetic algorithms are used. The aim is to optimize the precision of the classifier. Similar approach is suggested in [11], where the T-S model is considered. In the same vein, in [46], the selection of features is done while a fuzzy rule classifier is being optimized using the Choquet Integral. The embedded methods dominate the other two categories due to the natural process of optimization rule-based system to obtain high interpretable rules.

7.5. Unified View of Rule-Based Optimization Different taxonomies related to interpretability of fuzzy systems in general have been suggested: • Mencar et al. [14] proposed a taxonomy in terms of interpretability constraints to the following levels: the fuzzy sets, the universe of discourse, the fuzzy granules, the rules, the fuzzy models the learning algorithms.

June 1, 2022 12:51

318

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch07

Handbook on Computer Learning and Intelligence — Vol. I

• Zhou and Gan [61] proposed a taxonomy in terms of low-level interpretability and high-level interpretability. Low-level interpretability is associated with the fuzzy set level to capture the semantics, while high-level interpretability is associated with the rule level. • Gacto et al. [21] proposed a taxonomy inspired from the second one and distinguished between complexity-based interpretability equivalent to the highlevel interpretability and semantics-based interpretability equivalent to low-level interpretability associated with the previous taxonomy. A straight way to deal with interpretability issues in a unified way, by considering both transparency and performance at the same time, is to use optimization methods. We can use either meta-heuristics (evolutionary methods) or specialpurpose designed methods. In the following some studies are briefly reported on. Ishibuchi et al. [26] propose a genetic algorithm for rule selection in classification problems, considering the following two objectives: to maximize the number of correctly classified training patterns and to minimize the number of selected rules. This improves the complexity of the model, thanks to the reduction in the number of rules and the use of don’t-care conditions in the antecedent part of the rule. Ishibuchi et al. [27, 28] present a multi-objective evolutionary algorithm (MOEA) for classification problems with three objectives: maximizing the number of correctly classified patterns, minimizing the number of rules, and minimizing the number of antecedent conditions. Narukawa et al. [38] rely on NSGA-II to optimize the rule base by eliminating redundant rules using multi-objective optimization that aims at increasing the accuracy while minimizing the number of rules, and the premises. Additional studies using evolutionary algorithms, in particular multi-objective evolutionary algorithms, can be found in a recent survey by Fazzoli et al. [20]. In comparison to previously mentioned studies, others have used special purpose optimization methods. For instance, Mikut et al. [37] used decision trees to generate rules before an optimization process is applied. This later consists of feature selection using an entropy-base method and applying iteratively a search-based formula that combines accuracy and transparency to find the best configuration of the rule base. Nauck [39] introduced a formula that combines by product three components: complexity (expressed as the proportion of the number of classes to the total number of variables used in the rules), coverage, (the average extent to which the domain of each variable is covered by the actual partitions of the variable), and partition complexity (quantified for each variable as inversely proportional to the number of partitions associated with that variable). Guillaume and Charnomordic [24] devised a distance-based formula to decide whether two partitions can be merged. The formula relies on the intra-distance

page 318

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Fuzzy Classifiers

b4528-v1-ch07

page 319

319

(called internal distance) of a fuzzy set for a given variable and the inter-distance (called external distance) of fuzzy sets for a variable, similar to that used for computing clusters. Any pairs of fuzzy sets that minimize the combination of these two measures over the set of data points will be merged. Valente de Oliveira et al. [15, 16] used backpropagation to optimize a performance index that consists of three constraining terms: accuracy, coverage, and distinguishability of fuzzy sets.

7.6. Incremental and Online Fuzzy Rule-Based Classification Systems Traditional fuzzy classification systems are designed in batch (offline) mode, that is, by using the complete training data at once. Thus, offline development of fuzzy classification systems assumes that the process of rule induction is done in a oneshot experiment, such that the learning phase and the deployment phase are two sequential and independent stages. For stationary processes, this is sufficient, but if, for instance, the rule system’s performance deteriorates due to a change of the data distribution or a change of the operating conditions, the system needs to be re-designed from scratch. Many offline approaches do simply perform “adaptive tuning,” that is, they permanently re-estimate the parameters of the computed model. However, it is quite often necessary to adapt the structure of the rule-base. In general, for time-dependent and complex non-stationary processes, efficient techniques for updating the induced models are needed. Such techniques must be able to adapt the current model using only the new data. They have to be equipped with mechanisms to react to gradual changes or abrupt ones. The adaptation of the model (i.e., rules) should accommodate any information brought in by the new data and reconcile this with the existing rules. Online development of fuzzy classification systems [9], on the other hand, enables both learning and deployment to happen concurrently. In this context, rule learning takes place over long periods of time, and is inherently open-ended [8]. The aim is to ensure that the system remains amenable to refinement as long as data continues to arrive. Moreover, online systems can also deal with both applications starving of data (e.g., experiments that are expensive and slow to produce data as in some chemical and biological applications) as well as applications that are data intensive [5, 7]. Generally, online systems face the challenge of accurately estimating the statistical characteristics of data in the future. In non-stationary environments, the challenge becomes even more important, since the FRS’s behavior may need to change drastically over time due to concept drift [7, 22]. The aim of online learning is to ensure continuous adaptation. Ideally, only the learning model (e.g., only rules) and uses that model as basis in the future learning steps. As new data arrive, new

June 1, 2022 12:51

320

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch07

Handbook on Computer Learning and Intelligence — Vol. I

rules may be created and existing ones may be modified allowing the system to evolve over time. Online and incremental fuzzy rule systems have been recently introduced in a number of studies involving control [3], diagnostics [35], and pattern classification [4, 8, 34, 35, 49]. Type-1 fuzzy systems are currently quite established, since they do not only operate online, but also consider related advanced concepts such as concept drift and online feature selection. For instance in [8], an integrated approach called FRCS was proposed. To accommodate incremental rule learning, appropriate mechanisms are applied at all steps: (1) incremental supervised clustering to generate the rule antecedents in a progressive manner; (2) online and systematic update of fuzzy partitions; and (3) incremental feature selection using an incremental version of the Fisher’s interclass separability criterion to dynamically select features in an online manner. In [7], a fuzzy rule-based system for online classification is proposed. Relying on fuzzy min–max neural networks, the paper explains how fuzzy rules can be continuously online generated to meet the requirements of non-stationary dynamic environments, where data arrives over long periods of time. The classifier is sensitive to drift. It is able to detect data drift [22] using different measures and react to it. An outline of the algorithm proposed is described in Algorithm 7.2. Actually IFCS consists of three steps: a) Initial one-shot experiment training: Available data is used to obtain an initial model of the IFCS. b) Training over time before saturation: Given a saturation training level, incoming data is used to further adjust the model. c) Correction after training saturation: Beyond the saturation level, incoming data is used to observe the evolution of classification performance which allow to correct the classifier if necessary. In [9] a growing type-2 fuzzy classifier (GT2FC) for online fuzzy rule learning from real-time data streams is presented. To accommodate dynamic change, GT2FC relies on a new semi-supervised online learning algorithm called 2G2M (Growing Gaussian Mixture Model). In particular, 2G2M is used to generate the type-2 fuzzy membership functions to build the type-2 fuzzy rules. GT2FC is designed to accommodate data online and to reconcile labeled and unlabeled data using selflearning. Moreover. GT2FC maintains low complexity of the rule base using online optimization and feature selection mechanisms. Type-2 fuzzy classification is very suitable for dealing with applications where input is prone to faults and noise. Thus, GT2FC offers the advantage of dealing with uncertainty in addition to self-adaptation in an online manner.

page 320

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch07

Fuzzy Classifiers

page 321

321

Algorithm 7.2 : Steps of the incremental fuzzy classification system (IFCS). if Initial=true then 2: M ← Train_Classifier() 3: E ← Test_Classifier(,M) // Just for the sake of observation 4: end if 5: i←0 6: while true do 7: i← i+1 8: Read 9: if IsLabeled(Label)=true then 10: if Saturation_Training(i)=false then 11: M ← Train_Classifier(,M) 12: If Input falls in a hyperbox with Flabel, then Flabel ← Label 13: else 14: Err ← Test_Classifier(,M) 15: Cumulated_Err ← Cumulated_Err+Err 16: if Detect_Drift(Cumulated_Err)=true then 17: M ← Reset(Cumulated_Err,M) 18: else 19: M ← Update_Classifier(,M) 20: If Input falls in a hyperbox with Flabel, then Flabel ← Label 21: end if 22: end if 23: else 24: Flabel← Predict_Label(Input,M) 25: M ← Update_Classifier(,M) 26: end if 27: end while 1:

Before proceeding further with type-2 fuzzy systems, it is important to define type-2 fuzzy sets. In contrast to the membership function associated with a type-1 fuzzy set is that takes single values from [0,1], the membership of a type-2 set is uncertain (i.e., fuzzy) and takes interval (or a more general form) rather than a specific value. Specifically, at the operational level, type-2 fuzzy rule systems (T2 FRS) differ from type-2 fuzzy rule systems (T1 FRS) in the type of fuzzy sets and the operations applied on these sets. T2 fuzzy sets are equipped mainly with two newly introduced operators called the meet, and join,  which correspond

June 1, 2022 12:51

322

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch07

page 322

Handbook on Computer Learning and Intelligence — Vol. I

Figure 7.3: Type-2 fuzzy sets.

Figure 7.4: Type-2 fuzzy logic system.

to the fuzzy intersection and fuzzy union. As shown in Figure 7.4, T2 FRS at the structural level is similar to T1 FRS but contains an additional module, the typereducer. In a classification type-2 fuzzy rule system, the fuzzy rules for a C-class pattern classification problem with d input variables can be formulated as R j ≡ if x1 is A˜ j,1 ∧ ... ∧ xd is A˜ j,d then Ci ,

(7.18)

where x = x1 , . . . , xd t such that each xi is a an input variable and A˜ r,i the corresponding fuzzy term in the form of type-2 fuzzy set. We may associate these fuzzy sets with linguistic labels to enhance interpretability. Ci is a consequent class, and j = 1, . . . , N is the number of fuzzy rules. The inference engine computes the output of type-2 fuzzy sets by combining the rules. Specifically, the meet operator is used to connect the type-2 fuzzy propositions in the antecedent. The degree of activation of the j th rule using the d input variables is computed as β˜ j (x) = dq=1 μ A˜ j,q (xq ), j = 1, . . . , N.

(7.19)

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch07

page 323

Fuzzy Classifiers

323

The meet operation that replaces the fuzzy “and” in T1 FRS is given as follows: f x (u) ∗ gx (w)/(u ∧ w), x ∈ X. (7.20) A˜ B˜ = u∈Jx w∈Jx

If we use the interval Singleton T2 FRS, the meet is given for input x = x by the firing set, i.e., β˜ j (x) = β j (x), β j (x) = μ A (x1 ) ∗ · · · ∗ μ A (xd ), j,1

j,d

× μ A j,1 (x1 ) ∗ · · · ∗ μ A j,d (xd ).

(7.21)

In [9], the Gaussian membership function is adopted and it is given as follows:   (x − m)2 = N (m, σ ; x), μ A (x) = ex p − 2σ 2

(7.22)

where m and σ are the mean and the standard deviation of the function, respectively. To generate the lower and upper membership functions, the authors used concentration and dilation hedges to generate the footprint of Gaussians with uncertain deviation as shown in Fig. 7.5. These are given as follows: μ A˜ (x) = μ D I L( A) (x) = μ A (x)1/2

(7.23)

μ A˜ (x) = μC O N( A) (x) = μ A (x)2

(7.24)

Figure 7.5: Uncertain deviation.

June 1, 2022 12:51

324

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch07

page 324

Handbook on Computer Learning and Intelligence — Vol. I

According to Eq. (7.22), we obtain the following expressions: √ μ A˜ (x) = N (m, 2σ ; x)   σ μ A˜ (x) = N m, √ ; x 2

(7.25) (7.26)

Type-reducer is responsible for converting T2 FSs obtained as output of the inference engine into T1 FSs. Type-reduction for FRS (linguistic and functional models) was proposed by Karnik and Mendel [30, 36]. This will not be adopted in our case, since the rules’ consequent represents the label of a class. Traditionally, in type-1 fuzzy classification systems, the output of the classifier is determined by the rule that has the highest degree of activation: Winner = C j ∗ , j ∗ = argmax β j ; 1≤ j ≤N

(7.27)

where β j is the firing degree of the j rule. In type-2 fuzzy classification systems, we have an interval β˜ j = β j , β j as defined in Eq. (7.23). Therefore, we compute the winning class by considering the center of the interval β j (x), β j (x), that is Winner = C j ∗ , where  1 j ∗ = argmax β j (x) + β j (x) . 1≤ j ≤N 2

(7.28) (7.29)

For the sake of illustration, an excerpt of rules generated is shown in Table 7.4. Table 7.4: Fuzzy rules for D2. Rule 1 2 3 4 5 6 7

Antecedent

C

x1 IN N(–17.60, [0.968,1.93]) ∧(x2 IN N(–2.31, [1.42,2.85]) ∨ (x2 IN N(–9.45, [2.04, 4.08])) (x1 IN N(–5.27, [2.62, 5.24]) ∨ x1 IN N(–13.10, [1.44,2.89])) ∧ x2 IN N(1.61, [1.01,2.02]) x1 IN N(–13.10, [1.44,2.89]) ∧ (x2 IN N(1.61, [1.01,2.02]) ∨ x2 IN N(–15.12, [0.98,1.96])) x1 IN N(5.47, [1.28,2.56]) ∧ (x2 IN N(7.78, [1.19,2.39]) ∨ x2 IN N(–6.10, [0.97,1.94])) (x1 IN N(–5.27, [2.62, 5.24]) ∨ x1 IN N(5.470, [1.28,2.56]) ∨ x1 IN N(9.79, [1.89, 3.78]) ∧ x2 IN N(–5.30, [1.91,3.83]) x1 IN N(2.69, [0.93,1.87]) ∧ x2 IN N(–9.45, [2.04, 4.08]) (x1 IN N(0.28, [0.97,1.93]) ∨ x1 IN N(–17.60, [0.96,1.93])) ∧ x2 IN N(–2.31, [1.42,2.85])

2 2 2 1 1 1 2

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Fuzzy Classifiers

b4528-v1-ch07

page 325

325

7.7. Conclusion This chapter briefly presents fuzzy rule-based classifiers. The working cycle for both type-1 and type-2 fuzzy classification systems has been described. Because the primary requirement of such systems is interpretability, quality issues have been lengthily discussed including various approaches with some illustrative studies. Toward the end, the chapter also introduces incremental and online learning of fuzzy classifiers. Fuzzy classifiers and rule-based classifiers in general are suited for human interpretability. Complex applications and automatic generation of clusters could produce classifiers with complex rule base. Therefore, the role of engineering transparency in such rule-based systems is crucial. It is worth nothing that transparency has recently become a major topic in machine learning, and it looks like the fuzzy community is ahead by many years. While type-1 fuzzy classifiers have been extensively studied in the past in different contexts, type-2 fuzzy classifiers have not gained much popularity like their predecessors. It is expected that this category of fuzzy classifiers will continue to be studied in the future, especially with regard to transparency, interpretability, and online generation [41, 42, 47] and deployment.

References [1] J. Abonyi, R. Babuska and F. Szeifert. Modified Gath-Geva fuzzy clustering for identification of Takagi-Sugeno fuzzy models. IEEE Trans. Syst. Man Cybern. Part B: Cybern., 32(5):612–621 (2002). [2] B. Alatas and E. Akin. Mining fuzzy classification rules using an artificial immune system with boosting. In Lecture Notes in Computer Science, vol. 3631, pp. 283–293 (2005). [3] P. Angelov. An approach for fuzzy rule-base adaptation using on-line clustering. Int. J. Approximate Reasoning, 35(3):275–289 (2004). [4] P. Angelov, E. Lughofer and X. Zhou. Evolving fuzzy classifiers using different model architectures. Fuzzy Sets Syst., 159:3160–3182 (2008). [5] O. Arandjelovic and R. Cipolla. Incremental learning of temporally coherent Gaussian mixture models. In Proceedings of the 16th British Machine Vision Conference, pp. 759–768 (2005). [6] A. Bouchachia. Incremental rule learning using incremental clustering. In Proc. of the 10th conf. on Information Processing and Management of Uncertainty in Knowledge-Based Systems, vol. 3, pp. 2085–2092 (2004). [7] A. Bouchachia. Fuzzy classification in dynamic environments. Soft Comput., 15(5):1009–1022 (2011). [8] A. Bouchachia and R. Mittermeir. Towards incremental fuzzy classifiers. Soft Comput., 11(2):193–207 (2007). [9] A. Bouchachia and C. Vanaret. GT2FC: an online growing interval type-2 self-learning fuzzy classifier. IEEE Trans. Fuzzy Syst., 22(4):999–1018 (2014). [10] X. Chen, D. Jin and Z. Li. Fuzzy petri nets for rule-based pattern classification. In Communications, Circuits and Systems and West Sino Expositions, IEEE 2002 International Conference on, vol. 2, pp. 1218–1222 (2002).

June 1, 2022 12:51

326

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch07

Handbook on Computer Learning and Intelligence — Vol. I

[11] Y.-C. Chen, N. R. Pal and I-F Chung. An integrated mechanism for feature selection and fuzzy rule extraction for classification. IEEE Trans. Fuzzy Syst., 20(4):683–698 (2012). [12] M. Cintra and H. Camargo. Feature subset selection for fuzzy classification methods. In Information Processing and Management of Uncertainty in Knowledge-Based Systems. Theory and Methods, vol. 80 of Communications in Computer and Information Science, pp. 318–327, Springer (2010). [13] O. Cordon, M. del Jesus, F. Herrera, L. Magdalena and P. Villar. A multiobjective genetic learning process for joint feature selection and granularity and context learning in fuzzy rule-based classification systems. In Interpretability Issues in Fuzzy Modeling, pp. 79–99, Springer-Verlag (2003). [14] C. Corrado Mencar and A. Fanelli. Interpretability constraints for fuzzy information granulation. Inf. Sci., 178(24):4585–4618 (2008). [15] J. de Oliveira. Semantic constraints for membership function optimization. IEEE Trans. Syst. Man Cybern. Part A: Syst. Humans, 29(1):128–138 (1999). [16] J. de Oliveira. Towards neuro-linguistic modeling: constraints for optimization of membership functions. Fuzzy Sets Syst., 106:357380 (1999). [17] D. Dubois, H. Prade and L. Ughetto. Checking the coherence and redundancy of fuzzy knowledge bases. IEEE Trans. Fuzzy Syst., 5(3):398–417 (1997). [18] P. Duda, E. Hart and D. Stork. Pattern Classification. New York, Wiley (2001). [19] T. Fawcett. An introduction to ROC analysis. Pattern Recogn. Lett., 27(8):861–874 (2006). [20] M. Fazzolari, R. Alcalá, Y. Nojima, H. Ishibuchi and F. Herrera. A review of the application of multiobjective evolutionary fuzzy systems: current status and further directions. IEEE Trans. Fuzzy Syst., 21(1):45–65 (2013). [21] M. Gacto, R. Alcalá and F. Herrera. Interpretability of linguistic fuzzy rule-based systems: an overview of interpretability measures. Inf. Sci., 181(20):4340–4360 (2011). [22] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy and A. Bouchachia. A survey on concept drift adaptation. ACM Comput. Surv., 46(4):1–44 (2013). [23] M. Ganji and M. Abadeh. A fuzzy classification system based on ant colony optimization for diabetes disease diagnosis. Expert Syst. Appl., 38(12):14650–14659 (2011). [24] S. Guillaume and B. Charnomordic. A new method for inducing a set of interpretable fuzzy partitions and fuzzy inference systems from data. In Interpretability Issues in Fuzzy Modeling, pp. 148–175, Springer-Verlag (2003). [25] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. J. Mach. Learn. Res., 3:1157–1182 (2003). [26] H. Ishibuchi, T. Murata and I. Türk¸sen. Single-objective and two-objective genetic algorithms for selecting linguistic rules for pattern classification problems. Fuzzy Sets Syst., 89(2):135–150 (1997). [27] H. Ishibuchi and Y. Nojima. Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning. Int. J. Approx. Reasoning, 44(1):4–31 (2007). [28] H. Ishibuchi and T. Yamamoto. Fuzzy rule selection by multi-objective genetic local search algorithms and rule evaluation measures in data mining. Fuzzy Sets Syst., 141(1):59–88 (2004). [29] H. Ishibuchi and T. Yamamoto. Rule weight specification in fuzzy rule-based classification systems. IEEE Trans. Fuzzy Syst., 13(4):428–435 (2005). [30] N. Karnik and J. Mendel. Operations on type-2 fuzzy sets. Fuzzy Sets Syst., 122(2):327–348 (2001). [31] N. Kasabov. Foundations of Neural Networks, Fuzzy Systems and Knowledge Engineering, MIT Press (1996). [32] L. Kuncheva. How good are fuzzy if-then classifiers? IEEE Trans. Syst. Man Cybern. Part B: Cybern., 30(4):501–509 (2000).

page 326

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Fuzzy Classifiers

b4528-v1-ch07

page 327

327

[33] H.-M. Lee, C-M. Chen, J.-M. Chen and Y.-L. Jou. An efficient fuzzy classifier with feature selection based on fuzzy entropy. IEEE Trans. Syst. Man Cybern., 31:426–432 (2001). [34] D. Leite, L. Decker, M. Santana and P. Souza. EGFC: evolving Gaussian fuzzy classifier from never-ending semi-supervised data streams with application to power quality disturbance detection and classification. In 2020 IEEE International Conference on Fuzzy Systems (FUZZIEEE), pp. 1–9 (2020). [35] E. Lughofer. Evolving Fuzzy Systems—Methodologies, Advanced Concepts and Applications (Studies in Fuzziness and Soft Computing), Springer (2011). [36] J. M. Mendel. On KM algorithms for solving type-2 fuzzy set problems. IEEE Trans. Fuzzy Syst., 21(3):426–446 (2013). [37] R. Mikut, J. Jäkel and L. Gröll. Interpretability issues in data-based learning of fuzzy systems. Fuzzy Sets Syst., 150(2):179–197 (2005). [38] K. Narukawa, Y. Nojima and H. Ishibuchi. Modification of evolutionary multiobjective optimization algorithms for multiobjective design of fuzzy rule-based classification systems. In The 14th IEEE International Conference on Fuzzy Systems, pp. 809–814 (2005). [39] D. Nauck. Measuring interpretability in rule-based classification systems. In The 12th IEEE International Conference on Fuzzy Systems, vol. 1, pp. 196–201 (2003). [40] W. Pedrycz and F. Gomide. Introduction to Fuzzy sets: Analysis and Design, MIT Press (1998). [41] M. Pratama, J. Lu, E. Lughofer, G. Zhang and S. Anavatti. Scaffolding type-2 classifier for incremental learning under concept drifts. Neurocomputing, 191:304–329 (2016). [42] M. Pratama, J. Lu and G. Zhang. Evolving type-2 fuzzy classifier. IEEE Trans. Fuzzy Syst., 24(3):574–589 (2016). [43] C. Rani and S. N. Deepa. Design of optimal fuzzy classifier system using particle swarm optimization. In Innovative Computing Technologies (ICICT), 2010 International Conference on, pp. 1–6 (2010). [44] J. Saade and H. Diab. Defuzzification techniques for fuzzy controllers. IEEE Trans. Syst. Man Cybern. Part B: Cybern., 30(1):223–229 (2000). [45] H. Scarpelli, W. Pedrycz and F. Gomide. Quantification of inconsistencies in fuzzy knowledge bases. In The 2nd European Congress on Intelligent Techniques and Soft Computing, pp. 1456–1461 (1994). [46] E. Schmitt, V. Bombardier and L. Wendling. Improving fuzzy rule classifier by extracting suitable features from capacities with respect to the choquet integral. IEEE Trans. Syst. Man Cybern. Part B: Cybern., 38(5):1195–1206 (2008). [47] H. Shahparast and E. G. Mansoori. Developing an online general type-2 fuzzy classifier using evolving type-1 rules. Int. J. Approximate Reasoning, 113:336–353 (2019). [48] Q. Shen and A. Chouchoulas. A rough-fuzzy approach for generating classification rules. Pattern Recognit., 35(11):2425–2438 (2002). [49] E. Soares, P. Angelov, B. Costa and M. Castro. Actively semi-supervised deep rule-based classifier applied to adverse driving scenarios. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2019). [50] L. Snchez, M. R. Surez, J. R. Villar and I. Couso. Mutual information-based feature selection and partition design in fuzzy rule-based classifiers from vague data. Int. J. Approximate Reasoning, 49(3):607–622 (2008). [51] W. Tan, C. Foo and T. Chua. Type-2 fuzzy system for ECG arrhythmic classification. In FUZZIEEE, pp. 1–6 (2007). [52] S. R. Thiruvenkadam, S. Arcot and Y. Chen. A PDE based method for fuzzy classification of medical images. In The 2006 IEEE International Conference on Image Processing, pp. 1805–1808 (2006). [53] D. Tikk, T. Gedeon and K. Wong. A feature ranking algorithm for fuzzy modelling problems. In Interpretability Issues in Fuzzy Modeling, pp. 176–192, Springer-Verlag (2003).

June 1, 2022 12:51

328

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch07

Handbook on Computer Learning and Intelligence — Vol. I

[54] R. Toscano and P. Lyonnet. Diagnosis of the industrial systems by fuzzy classification. ISA Trans., 42(2):327–335 (2003). [55] G. Tsekouras, H. Sarimveis, E. Kavakli and G. Bafas. A hierarchical fuzzy-clustering approach to fuzzy modeling. Fuzzy Sets Syst., 150(2):245–266 (2005). [56] V. Vanhoucke and R. Silipo. Interpretability in multidimensional classification. In Interpretability Issues in Fuzzy Modeling, pp. 193–217, Springer-Verlag (2003). [57] H. Vernieuwe, B. De Baets and N. Verhoest. Comparison of clustering algorithms in the identification of Takagi-Sugeno models: a hydrological case study. Fuzzy Sets Syst., 157(21):2876–2896 (2006). [58] D. Wu and J. Mendel. A vector similarity measure for linguistic approximation: interval type-2 and type-1 fuzzy sets. Inf. Sci., 178(2):381–402 (2008). [59] R. Yager and H. Larsen. On discovering partial inconsistencies in validating uncertain knowledge bases by reflecting on the input. IEEE Trans. Syst. Man Cybern. Part B: Cybern., 21(4):790–801 (1991). [60] R. R. Yager and D. P. Filev. Approximate clustering via the mountain method. IEEE Trans. Syst. Man Cybern., 24(8):1279–1284 (1994). [61] S. Zhou and J. Gan. Low-level interpretability and high-level interpretability: a unified view of data-driven interpretable fuzzy system modelling. Fuzzy Sets Syst., 159(23):3091–3131 (2008).

page 328

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_0008

Chapter 8

Kernel Models and Support Vector Machines Denis Kolev∗ , Mikhail Suvorov∗, and Dmitry Kangin† ∗Faculty of Computational Mathematics and Cybernetics

Moscow State University, 119991, Moscow, Russia †Faculty of Informatics and Control Systems Bauman Moscow State Technical University, 105005, Moscow, Russia {dkolev,msuvorov}@msdk-research.com [email protected]

8.1. Introduction Nowadays, the amount of information produced by different sources is steadily growing. For this reason, statistical data analysis and pattern recognition algorithms have become increasingly pervasive and highly demanded [1, 2]. The most studied and widely applied approaches are based on parametrically linear models. They combine learning and evaluation simplicity with a well-grounded theoretical basis. On the other hand, linear methods impose strong restrictions on the data structure model, significantly reducing performance in some cases. Nonlinear methods provide more flexible models, but they usually suffer from very complex, timedemanding learning procedures and possible overfitting. In this chapter, we discuss a framework for the parametrically linear methods that determine nonlinear models: namely, kernel methods. These methods preserve the theoretical basis of linear models, while providing more flexibility at the same time. This framework extends the existing machine learning methods.

329

page 329

June 1, 2022 12:51

330

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

Support vector machines (SVMs) play a significant role in the scope of linear and kernel methods. For several reasons, SVMs are one of the most popular approaches for regression and classification. The given approach is theoretically well-grounded and studied, which provides the capability for adequate adaptation of the method. It is known that SVMs are universal approximators in some specific conditions. Also, due to model sparseness (described in due course), methods are widely used for the processing of streaming/large data. SVM classifiers have been rated as one of the best-performing types of machine learning algorithms [3]. A number of problemspecific techniques are based on “SVM-driven” ideas. In this chapter, we discuss the SVM and its extensions, focusing on their relationship with kernel methods. The rest of the chapter is structured as follows. In the second section, we discuss the theoretical basis of kernel methods and offer illustrative examples. In the third section, we list the main kernel machines and explain their principles. The fourth section focuses on the SVM and its extensions.

8.2. Basic Concepts and Theory Here, we present a brief overview of the basic concepts and theorems about kernels and kernel models. We will use the following notation hereafter, unless stated otherwise. X ⊂ Rn —set of objects/vectors, Universum. n—dimensionality of the object space. Y ⊂ Rm —set of responses. m—dimensionality of responses. X N = {(x 1 , y1 ), . . . , (x N , y N )}—sample, where x i ∈ X —sample object/vector, i = 1, . . . , N , yi ∈ Y —response to x i , i = 1, . . . , N . H—Hilbert space of functions or reproducing kernel Hilbert space. K (x , t)—kernel function defined on X × X.

8.2.1. Regression and Classification Problems In many real-world problems, the classical approach of building a mathematical model of the relationship between inputs and outputs turns out to be unsound. Complicated dependencies, inputs of complex structures, and noise are all the reasons that hinder mathematical modeling. In fact, developing a strict mathematical model is often infeasible or related to strong performance and application constraints, which may limit model scalability. The alternative consists of information models that allow for the restoration of rather complex input–output relationships, as well as dealing with noise and outliers. In this chapter, we focus on regression and classification problems [1] and the use of kernel approaches to solve them.

page 330

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 331

331

The inputs and outputs of real-world problems have diverse and complicated structures. However, in many cases, they can be quantified. Hereafter we will consider sets of data objects as subsets of Rn . Furthermore, we state the problem of recognition as follows: Consider a subset X ⊆ Rn . We refer to its elements as objects. The coordinates xi of object x ∈ X are called features and are marked by a Roman index i. Each object x is associated with a vector y ∈ Y ⊆ Rm , which is regarded as a characteristic of objects in X. We assume that there exists a law f that governs the correspondence between objects and their characteristics. So, f can be regarded as a map X → Y and y = f (x), x ∈ X , y ∈ Y We have only some assumptions about this law/mapping and a training set. The concept of a training set is critical in recognition theory because it is the main source of obtaining f or, at least, its approximation fˆ. Usually, we assume that the mapping of fˆ belongs to a certain family of maps F ⊆ X Y . Then, the training set is used to choose one mapping from this family. More formally, a training set refers to a set X N of pairs (x i yi ): X N = {(x i , yi )|yi = f (x i ), i = 1, . . . , N }. On this basis, the learning algorithm can be regarded as a mapping μ, so that fˆ = μ(X N ),

fˆ ∈ F

When Y is a finite set such that Y = {L 1 , . . . , L R }, the problem is known as a classification problem. All elements L r in Y are called labels. When Y ⊆ R and Y contains an innumerable number of elements, the problem is known as a regression problem.

8.2.2. Support Vector Machines: Classification Problem and Dividing Hyperplanes To illustrate the recognition problem discussed above, we will give a simple example of a classification problem inspired by Vapnik and Chervonenkis [4, 5]. Let X = R2 and Y = {−1, 1}. Then, objects can be regarded as two-dimensional points represented by either red squares (y = −1) or green circles (y = 1). Following Vapnik and Chervonenkis, we initially consider a simple case of a linearly separable training set, as presented in Figure 8.1. Given that the points of a training set of different classes are linearly divisible, the goal is to find a hyperplane H that will divide X into two parts: H+ and H− . A hyperplane is determined by its normal

June 1, 2022 12:51

332

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I 4 3 2 1 0 -1 -2 -3 -4 -3

-2

-1

0

1

2

3

4

Figure 8.1: Linearly separable case.

vector w and an intercept ρ. A vector x belongs to the hyperplane if and only if w, x − ρ = 0. The hyperplane can be chosen in different ways. Vapnik and Chervonenkis proposed a procedure for selecting some optimal hyperplane. They formalized optimality using the notation of a separating band, which is the space between two hyperplanes parallel to H that contain no points of the training set X N . They showed that the size of this band equals (for proof, refer to Section 8.4.1) 2

w

Then, they stated the problem of quadratic programming as follows:

w 2 → min w

w.r.t yi (w, x i − ρ) ≥ 1, i = 1, . . . , N This can be generalized for the case of a linearly nonseparable training set by introducing so-called slack variables ξi ≥ 0, which allow violating the constraint yi (w, x i − ρ) ≥ 1 − ξi .

page 332

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 333

333

So, the final version of the problem becomes

w + C 2

N  i=1

ξi → min w,ξ

w.r.t yi (w, x i − ρ) ≥ 1 − ξi ξi ≥ 0, i = 1, . . . , N. Formally speaking, in the case of a linearly nonseparable sample, the set of optimization problem constraints defines an empty set. The introduction of slack variables prevents such a situation, allowing misclassifications which are penalized by the functional under optimization.

8.2.3. SVM Solver: Toward Nonlinearity and Polynomial Kernel In this section, we do not investigate the main properties of the problem solution; we instead focus on the techniques that introduce nonlinearity into the model, describing briefly the background theory and the most vital theorems. Consider the solution to an optimization problem stated by a two-class SVM classifier. Earlier, we drew attention to a convex quadratic programming problem that has a single solution. Usually, the SVM model is investigated from the standpoint of the dual optimization problem [6], which is formulated as follows:  1 αi → max − α Qα T + α 2 i=1 N

w.r.t 0 ≤ αi ≤ C, ∀i = 1, . . . , N N 

αi yi = 0.

i=1

Here, the N -dimensional vector α ∈ R N represents the vector of Lagrange multipliers corresponding to each constraint in the primal problem. The matrix Q ∈ R N×N is a matrix of “class-corresponding” inner products with elements Q i j = x i , x j yi y j . According to the differential Karush–Kuhn–Tucker conditions of the primal problem (which will be described in due course), the normal vector of the resulting hyperplane

June 1, 2022 12:51

334

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

can be expressed as w=

N 

yi αi x i

i=1

Therefore, the class label of the arbitrary vector x can be evaluated as   N  αi yi x i , x − ρ . y(x) = sign i=1

Evidently, the optimization problem depends on the values of the Lagrange multipliers and the inner products between the vectors from the learning set and classified vector. This information is sufficient for performing classification. The SVM classification model discussed here is linear, which means that it performs adequately only when the classes in the learning set (and in the Universum) are linearly separable. However, this limitation can be overcome. In particular, many specially developed techniques can be applied, including feature engineering, mixture of experts, and others. There is a special approach for the SVM model (as well as a wide range of other machine learning methods), which retains the overall model structure as it is in the linear case, but allows the fitting of nonlinear separation surfaces. This approach is referred to as “mapping to linearization space” or just the “kernel trick” [7]. Further to the dual optimization problem and final discrimination function, we note that the optimization problem depends on pairwise inner products in the learning set. The final classification rule also depends only on the inner products between vectors in the learning set and classified vector. It leads the way to the described approach. The idea is to replace the standard inner product with some two-argument nonlinear function [8], so that x, t → K (x, t), Q i j = K (x i , x j )yi y j ,   N  αi yi K (x i , x ) − ρ . y(x) = sign i=1

A simple example can be provided. Suppose that the replacement is x, t → (x, t + 1)2 . Then, the decision rule is one of a second-order polynomial type, i.e.,   N N N    2 αi yi x, x i + 2 αi yi x, x i + αi yi − ρ y(x) = sign i=1

i=1

i=1

page 334

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 335

335

However, two important questions should be answered to ensure the proper use of such an approach: 1) What function K (x, t) can be used to replace the inner product? 2) What is the set of possible discrimination surfaces that can be fitted using a given K (x , t)?

8.2.4. Kernel Function Discussion The kernel trick can be defined in different ways. Consider a function  that maps the data from the initial space Rn to some Hilbert space H, : Rn → H. The inner product in the space H is denoted by ·, · H . In the general case, the space H can be infinite-dimensional. A classification model is learned in the space H using mapped data as a learning set, (X N ) = {(x 1 ), . . . , (x N )}. It is supposed that for all x, t ∈ Rn , the inner product for the mapped vectors is evaluated as (x), (t) H = K (x, t), where K (x, t) is a symmetric two-argument function, in general nonlinear, and defined for vectors in Rn . (x) can be regarded as a new feature map, which means that the kernel trick is similar to feature engineering. Therefore, the final classification model can be expressed as   N   N   αi yi φ(x i ), φ(x ) H − ρ = sign αi yi K (x i , x) − ρ . y(x) = sign i=1

i=1

The kernel trick assists in building a linear classification model in a different space, resulting in a generally nonlinear classification model in the initial data space. For this reason, the function K (x, t) must be an inner product of (x) and (t) in some Hilbert space for all x, t ∈ Rn . Such functions are usually referred to as “kernels” (this is not a strict definition) and there are many examples of such functions: • Linear kernel K (x , t) = x, t , d • Polynomial kernel K (x , t) = x, t d or K (x,  t) = (x , t + β) ,

x−t 2 • Exponential kernel K (x, t) = exp − 2σ 2 , • Student-type RBF kernel K (x , t) = (β + γ x − t 2 )−α , β > 0, γ > 0, α > 0. Discussing the wideness of the set of discriminative functions that can be learned using a given kernel function, it is essential to point out that the classification rule

June 1, 2022 12:51

336

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

is based on a linear combination of kernel functions with one fixed input argument. Therefore, the set of functions that can be obtained is limited by the linear spectrum of the kernel:   N  αi K (x , xi ) . span(K ) = f : Rn → R | ∃x 1 , . . . , x N ∈ Rn : f (x) = i=1

We further investigate the underlying theory of the kernel trick in detail, including the criteria for a function to be a kernel and the importance of the spectrum of kernels. We conclude Section 8.2 with the representer theorem, showing the significance of kernel theory for machine learning.

8.2.5. Kernel Definition via Matrix The above mentioned discussion leads us to the following definition of a kernel function. Definition 8.1. ([9]): A function K : X × X → R is known as a positive definite kernel if for any finite subset X N = {x 1 , . . . , x N } ⊂ X the corresponding matrix K (X N , X N ): ⎤ ⎡ K (x 1 , x 1 ) · · · K (x 1 , x N ) ⎥ ⎢ .. .. .. ⎥, K (X N X N ) = ⎢ . . . ⎦ ⎣ K (x N , x 1 ) · · · K (x N , x N ) is symmetric, i.e., K (X N , X N ) = K (X N , X N )T , and non-negative definite. The latter we understand as for any t ∈ R N K (X N , X N )t, t ≥ 0. It follows directly from the definition that ∀x, t ∈ X , K (x , t) ≥ 0, K (x , t) = K (t, x). It should be pointed out that this fact provides an important link to the kernel SVM. This is because a non-negative matrix K (X N , X N ) ensures the concavity of the dual problem, which is essential for optimization.

page 336

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 337

337

8.2.6. Reproducing Kernel (RKHS) via Linear Operator Up to this point, we introduced the definition of a positive definite kernel. Now we consider one of the most important entities in the theory of reproducing kernels: reproducing kernel Hilbert space (RKHS) [10]. Let X be a non-empty subset of Rn . Then, let H be a Hilbert space of functions over X : f : X → R. Definition 8.2. A function : X × X → R is known as a reproducing kernel of H if and only if 1. ∀ x ∈ X K (·, x ) ∈ H, 2. ∀ f ∈ H f, K (·, x ) H = f (x). The function K (·, x ) is called a reproducing kernel for point x. Here, ·, · H is some voluntarily chosen function the satisfies the definition of an inner product. From the definition, it follows that the reproducing kernel is a symmetric function: K (x , t) = K (·, t ), K (·, x ) H = {property of the real − valued inner product} = K (·, x ), K (·, t ) H = K (t, x). The reproducing property may be defined in a different way. Consider a functional L x : H → R such that ∀ f ∈ H L x f = f (x). The “evaluation functional” L x can be defined for any x ∈ X . It is easy to show that L x is a linear functional. Using this notation, we introduce another definition of RKHS. Definition 8.3. A Hilbert space of functions H is an RKHS if and only if ∀ x ∈ X and the evaluation functional L x is bounded. That is to say:  ∀ x ∈ X ∃ C x : ∀ f ∈ H : |L x f | ≤ C x f H = C x  f, f H . This means that if the two functions f and g are close in terms of the RKHS norm, they will be close at every point. This can be expressed as follows: | f (x) − g (x)| = |L x ( f − g)| ≤ C x f − g H . Now, we can state a theorem that links the two definitions. Theorem 8.1. A Hilbert space H is an RKHS ⇔ it has a reproducing kernel.

June 1, 2022 12:51

338

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

According to the Riesz–Frechet theorem, for every bounded linear functional A : H → R there exists an element θ ∈ H : ∀ f ∈ H, A f =  f, θ H . In our case, A ≡ L x . Therefore, one side of the theorem (⇒) is a specific case of the Riesz– Frechet theorem. The other side (⇐) can be proved using the Cauchy–Schwartz inequality.

8.2.7. Moore–Aronszajn Correspondence Theorem The problem of selecting reproducing kernels must also be addressed. The Moore– Aronszajn theorem builds a correspondence between positive definite kernels and RKHS. Theorem 8.2 (Moore–Aronszajn [10]): Suppose K : X × X → R is a positive definite kernel ⇒ ∃! Hilbert space H of functions over X, where K is a reproducing kernel. Therefore, for each positive definite kernel, there is a unique corresponding RKHS.

8.2.8. Mercer’s Theorem We further investigate the properties of positive definite kernels in the RKHS. For this purpose, we introduce a strictly positive Borel measure μ on X . We define L 2μ (X) as a linear space of functions which can be expressed as follows:  f ∈

L 2μ (X) :

| f (x)|2 dμ(x) < ∞. X

A symmetric positive definite kernel K : X × X → R, which is square-integrable,   K 2 (x, t)dμ(x)dμ(t) < ∞, X

X

induces [11] a linear operator L K : L 2μ (X) → L 2μ (X ), which is defined by  L K f (x) =

K (x , t) f (t)dμ(t). X

page 338

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 339

339

It possesses a countable system of eigenfunctions {φk }∞ k=1 forming an orthonormal basis of L 2μ (X ) with corresponding eigenvalues {λk }∞ k=1 . Theorem 8.3 (Mercer [12]): Let X ⊂ Rn be closed and μ be a strictly positive Borel measure on X . This means that this measure of every non-empty subset in X is positive, and K is a square-integrable continuous function on X × X , being a positive definite kernel. Then K (x , t) =

∞ 

λk φk (x)φk (t)

k=1

This series converges absolutely for each pair (x, t) ∈ X × X and uniformly on each compact subset of X . This theorem enables the building of a map  : X → 2 , where 2 is a Hilbert space of square summable sequences, so that √ (x)k = λk φk (x). It follows that kernel K can be regarded as a scalar product in 2 : ∞ ∞   ((x)k (t)k ) = λk φk (x)φk (t). K (x, t) = φ(x ), φ(t) 2 = k=1

k=1

As kernel K is supposed to be continuous, the feature map  is also continuous. Returning to the SVM example, Mercer’s theorem explains the correctness of using positive definite kernels. Any square-integrable positive definite kernel can be considered an inner product between the maps of two vectors in some Hilbert space (namely 2 ) replacing the inner product in the initial space Rn . (x) can be regarded as a feature map to the transformed feature space,  : X → 2 . Thus, the SVM builds a linear classifier in the 2 space.

8.2.9. Polynomial Kernel Example To illustrate the theoretical points raised so far, we return to the example of a polynomial kernel. This will build a link between the RKHS reproducing kernel, the positive definite kernel, and the kernel trick. Consider the problem of fitting a linear regression in a two-dimensional space. For the training set X N = {(x 1 , y1 ), . . . , (x N , y N )}, x i ∈ R2 , yi ∈ R, the goal is to restore parameter w so that ∀i = 1, . . . , N the output value of the parametric function f (x i , w) is close to yi . The word “close” can be understood as a (sub)optimal solution of any regular empirical error measure, such as the

June 1, 2022 12:51

340

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

least-squares or SVM-type measure. For the current example, we assume that leastsquares optimization is used as a learning procedure, i.e., N  ( f (x i , w) − yi )2 → min . w

i=1

We consider function f (x, w) to be linear: f (x, w) = x , w =

2 

x i wi , w ∈ R 2 .

i=1

Suppose that the kernel trick is performed, meaning that the regression is transformed into a second-order polynomial as follows: x, t R2 → K (x, t) = x , t R2 2 = φ(x ), φ(t) H . Then, the feature map (x) is defined as :x =

  x1 x2



x12



⎢ ⎥ → ⎣ x22 ⎦ ∈ R3 . √ 2x1 x2

Thus, H may be understood as R3 . The inner product in this space is evaluated as x, t R2 2 = φ((x), φ((t) R 3 = x12 t12 + x22 t22 + 2x1 x2 t1 t2 . In this case, the parameter w ∈ R3 is growing in dimensionality. However, the function model still depends linearly on w, which usually simplifies the learning procedure. The model function is f (x, w) =

3 

√ wi (x)i = w1 x12 + w2 x22 + w3 2x1 x2 ,

i=1

where (x)i denotes the i-th coordinate in the transformed feature map. If we consider f (x, w) as a function member of the Hilbert space of secondorder polynomials over R2 , P 2 (R2 ), then every function in P 2 (R2 ) can be specified by its three coefficients, i.e., ⎤ ⎡ a ⎥ ⎢ g(u) = au 21 + bu 22 + cu 1 u 2 ←→ ⎣ b ⎦ . √ c/ 2 Here, ←→ denotes some schematic correspondence of the polynomial to vectors from R3 . We denote the inner product in P 2 (R2 ) as a sum of pairwise multiplications

page 340

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 341

341

of the corresponding coefficients, g1 , g2 P 2 (R2 ) = a1 a2 + b1 b2 + where gi (u) = ai u 21 + bi u 22 + ci u 1 u 2 ∈ P 2 (R2 ). One can see that the polynomial

c1 c2 2



⎢ K (u, x ) = x12 u 21 + x22 u 22 + 2x1 x2 u 1 u 2 ↔ ⎣

x12 √

x22

⎤ ⎥ ⎦,

2x1 x2

K (·, x ) ∈ P 2 (R2 )∀x ∈ R2 , is a reproducing kernel of P 2 (R2 ). Indeed, ∀g ∈ P 2 (R2 ) g, K (·, x ) P 2 (R2 ) = ax12 + bx22 +

2x1 x2 c = g(x). 2

Therefore, mapping  : R2 → R3 is equivalent to mapping : R2 → P 2 (R2 ), where

(x) = K (·, x ). The inner product between the data mapped by is evaluated as ψ(x), ψ(t) P 2 (R2 ) . Mapping is equivalent to mapping  as K (x, t) = φ(x ), φ(t) R3 = ψ(x), ψ(t) P 2 (R2 ) ∀x, t ∈ R2 Therefore, the regression learning procedure is reduced to a search for the polynomial ⎤ ⎡ π1 ⎥ ⎢ π ∈ P 2 (R2 ), π ↔ ⎣ π2 ⎦, √ π3 / 2 so that ∀i = 1, . . . , N π, ψ(x i ) P 2 (R2 ) = π1 x12 + π2 x22 + π3 x1 x2 is close to yi . However, because π is defined by three coefficients, the function model is linear with respect to its parameters. The final optimization (“learning”) procedure is N  i=1

2 2 (π1 xi1 + π2 xi2 + π3 xi1 xi2 − yi )2 → min π

where xii corresponds to the i-th entry of the i-th learning sample. It is possible to generalize the result provided here. Every kernel trick is equivalent to defining a new type of feature map, where K is a reproducing kernel of the linearization space. However, the fitted parameters are not always included

June 1, 2022 12:51

342

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

in the RKHS inner product in a linear way, which may cause problems during optimization. For instance, if we map the data onto the space of all continuous functions, the learning procedure becomes equivalent to the variational optimization over all continuous functions. However, we will further show that this problem can be overcome in most cases and that it is possible to leave the model linear with respect to the optimized parameters.

8.2.10. RKHS Approximation The next question we address is an arbitrary dividing surface using kernel K . Consider an RKHS H, a Hilbert space of real-valued functions over X ⊆ R n , where K is a reproducing kernel. If X is a compact subset, we can form the space of kernel sections as K (X ) = span{K t , t ∈ X }, which is a closure of a linear span of functions K t (·) = K (·, t). It is clear that K (X ) ⊆ H. However, if we aim to build an arbitrary function (specifying some dividing surface), we must call for the opposite embedding H ⊆ K (X ). In other words, for any positive number ε and any function f ∈ H, there should exist a function g ∈ K (X ) such that function f is well approximated by g. That is to say,

f − g H ≤ ε Returning to the RKHS, it can be shown that K (X ) is dense in H. This means that any function from the RKHS can be approximated at any given precision ε by a linear combination of reproducing kernels. Formally speaking [13], ∀ε > 0, ∀ f ∈ H ∃ N > 0, x 1 , . . . , x N ∈ X, α1 . . . , α N ∈ R :   N    f − αi K (·, x i )  ≤ ε.  i=1

H

Thus, if we return to the SVM model, we can claim that the wideness of possible decision function forms is limited by the RKHS of the selected kernel. In practice, most kernels are parametric functions. Thus, it is reasonable to analyze how the corresponding RKHS depends on kernel parameters. Consider a family of parameterized positive definite kernel functions K = {K (·, ·|θ) : X × X → R|θ ∈ } where  is a set of acceptable values for θ. As an example of such a family, we can take a Gaussian kernel with different variance values. A natural question that arises is whether the corresponding RKHS depends on kernel parameters. In general, the answer is yes, but this is not true in all cases. For instance, a Gaussian

page 342

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 343

343

kernel with different variances corresponds to the RKHS of all continuous realvalued functions, which is independent of variance parameters. At the same time, polynomial kernels with different degrees correspond to different RKHS (spaces of polynomials with different degrees).

8.2.11. Examples of Universal Kernels Using the notation from the previous subsection, we introduce a property known as the universality of the kernel [13]. Definition 8.4. A positive definite kernel K : X × X → R, X ⊆ Rn is called universal if it is a reproducing kernel of RKHS R(X ), which is the space of all continuous real-valued functions defined on X . Using the previous result, we can state that any continuous real-valued function f ∈ R(X ) can be approximated with any given precision by a function g ∈ K (X ). Therefore, the issue that must be addressed is the question of which kernels possess this property of universality. A number of universality criteria exist, but in this work, we present just a few examples of universal kernels and some sufficient conditions for universality. Definition 8.5. A kernel K : X × X → R is strictly positive definite if it is a positive definite kernel and for any finite subset X N = {x 1 , . . . , x N } ⊂ X, the corresponding matrix K (X N , X N ), ⎤ K (x 1 , x 1 ) · · · K (x 1 , x N ) ⎥ ⎢ .. .. .. ⎥, K (X N , X N ) = ⎢ . . . ⎦ ⎣ ⎡

K (x N , x 1 ) · · · K (x N , x N ) is strictly positive definite. Theorem 8.2. If a positive definite kernel is strictly positive definite, it is a universal kernel. The following positive definite kernels are examples of universal kernels:  

2 , • Exponential kernel K (x , t) = exp − x−t 2σ 2 n  min(xi , ti ), • Survival kernel K (x, t) = i=1

• K (x , t) = (β + γ x − t 2 )−α , β > 0, γ > 0, α > 0.

June 1, 2022 12:51

344

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

8.2.12. Why Span? Representer Theorem In this section, we present a very strong and very general result that is essential for kernel learning and which extends the possibility of applying the kernel trick to diverse machine learning models. Theorem 8.5 (Representer theorem [14]): Let X be a non-empty set and suppose that K is a positive definite kernel over X and a reproducing kernel of RKHS H of functions over X . Let us consider a training sample X N = {(x 1 , y1 ), . . . , (x N , y N )}, i = 1, . . . , N, x i ∈ X, yi ∈ R, a strictly monotonically increasing function g : 0, +∞) → R. Then, for any arbitrary empirical risk function E : (X × R2 ) N → R, if f ∗ is a solution of the optimization problem E((x 1 , y1 , f (x 1 )), . . . , (x N , y N , f (x N ))) + g( f ) → min, f H

then f ∗ has the structure f ∗ (·) =

N 

αi K (·, x i ).

i=1

The summand g( f ) in the functional under optimization corresponds to the regularization term that is usually added for noise-stability purposes. The theorem is significant because it presents a parametrically linear model of a generally nonlinear function, which is an optimal solution for a very wide range of machine learning problems. However, the theorem does not present the exact kernel form. It just states that such a reproducing kernel exists. Therefore, the theorem simplifies but does not remove the “model selection” problem.

8.3. Kernel Models and Applications In this part of the chapter, we discuss kernel models, their properties, and their application to regression, density estimation, dimensionality reduction, filtering, and clustering. Kernels provide an elegant approach for incorporating nonlinearity into linear models, dramatically broadening the area of application for SVMs, linear regression, PCA, k-means, and other model families.

8.3.1. Kernel Density Estimation One of the common problems in data mining is that of probability density estimation. It aims to restore an unknown probability distribution function based on some given data points that are supposed to be generated from the target distribution.

page 344

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 345

345

Formally, let X N = {x 1 , . . . , x N } be an i.i.d. sample in Rn drawn from some unknown distribution p(x). The aim is to build an estimate distribution p(x). ˆ In general, p(x) ˆ should be a consistent estimate at every point. Such problems appear as sub-tasks in many different practical applications, including classification, clustering, and others. Kernel models are widely used for the estimation of probability density functions (pdf). Kernel density estimation is closely related to the histogram density estimation approach, also referred to as the histogram method. For one-dimensional data, the estimator is calculated as  N  1 i=1 I x − h2 < xi < x + h2 pˆh (x) = N h where h is the bandwidth parameter (or window size) and I denotes the identification operator, i.e.,  1, if c is true, Ic = 0, if c is false. The estimate is sensitive to the bandwidth parameter as it may cause overfitting or under-fitting. Examples of estimates for different bandwidths are shown in Figure 8.2. The original pdf is a solid red line, while its estimates are blue dashed lines. A bandwidth that is too small (h = 0.1) results in a very rough approximation, whereas a bandwidth that is too large (h = 2) leads to over-smoothing. A bandwidth of h = 0.56 seems to be a compromise between overly small and overly large windows. However, in the case of small learning datasets, the histogram approach leads to unacceptable solutions. For this reason, the histogram function is often replaced by a non-negative symmetric smooth function K (x), normalized at 1, i.e.,  K (x)dx = 1. Then, the estimate becomes 1 pˆh (x) = N

N i=1

N i K ( x−x ) 1  h = K h (x−x i ), h N i=1

where K h (x) = h1 K ( hx ) is usually referred to as a scaled kernel. This approach is known as the Parzen–Rosenblatt method. Notice that histogram estimation is a special case of the Parzen–Rosenblatt method [15, 16]. Such a modification results in a smooth function. An example that illustrates the difference between regular histogram windows and Gaussian kernel estimation is shown in Figure 8.3.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

346

b4528-v1-ch08

page 346

Handbook on Computer Learning and Intelligence — Vol. I h = 0.1 1 0.5 0 -3

-2

-1

0 1 h = 0.56 (Silverman rule)

-2

-1

0 h=2

-2

-1

0

2

3

1

2

3

1

2

3

1 0.5 0 -3 1 0.5 0 -3

pdf estimate

Figure 8.2: Kernel density estimation.

As Figure 8.2 shows, density estimation relies on effective bandwidth parameter selection. Several approaches exist [17–21], but practically for one-dimensional data 1 all of them are based on the result that the window size decreases as O(N − 5 ). The optimal bandwidth for a kernel K estimating one-dimensional density p can be determined by minimizing the asymptotic mean integrated squared error (AMISE): 1

h∗ =

R(K ) 5  2  15 1 2 s(K ) 5 R ∂∂ x p2 N 5

where  f 2 (x)dx,

R( f ) =  s( f ) =

x 2 f (x)dx.

However, such an estimate cannot be applied directly because it requires prior knowledge about the estimated distribution (in particular, the density p(x)). For some distributions, this bandwidth estimate is known. For Gaussian distributions,

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

page 347

Kernel Models and Support Vector Machines

347

0.8

pdf histogram estimate Gaussian kernel estimate

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -3

-2

-1

0

1

2

3

Figure 8.3: Kernel density estimation from 10 points; h = 09.

the following value, known as Silverman’s rule of thumb [7], is commonly used: ∗

h =



4σˆ 5 3N

 15

.

It can be obtained from the asymptotic formula by direct substitution of the parametric variance with its estimate σˆ 2 .

8.3.2. The Nadaraya–Watson Kernel Regression In 1964, Nadaraya [22] and Watson [23] proposed a method for kernel regression based on a non-parametric regression technique. Having a sample of pairs N , we look forward to finding the most expected value of y for any value {(x i , yi )}i=1 of x. Assuming that pairs (x i yi ) come from some distribution f on X × Y , where x i ∈ X , yi ∈ Y ⊆ R, the regression problem is formulated in terms of finding the conditional expectation y¯ :  ∫ y f (x, y)dy . y¯ = E(y|x) = y f (y|x)dy = f (x) Kernel density estimation helps us to estimate the joint distribution f (x y) of two random variables (that can be, in fact, multivariate variables) x and y. This can be expressed as follows:     N  x − x y − y 1 i i K K . fˆ(x, y) = N h x h y i=1 hx hy

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

348

b4528-v1-ch08

page 348

Handbook on Computer Learning and Intelligence — Vol. I

Kernel density estimation also assists in estimating the marginal distribution f (x) as follows: fˆ(x) =

  N 1  x − xi K N h x i=1 hx

In the preceding equation, h x and h y are the corresponding bandwidths of the kernels. In turn, these estimators are used to calculate the conditional expectation  y¯ =

N



x−x i hx



yi i=1 y fˆ(x, y)dy   =  N x−x i fˆ(x) i=1 K hx K

   i dy = h y y i . assuming that y K y−y hy Bandwidth selection addresses the same approaches that were discussed in the previous section. Figure 8.4 illustrates the dependence of the Nadaraya–Watson estimator on the bandwidth parameter h x . Again, an overly large bandwidth leads to over-smoothing, while an overly small bandwidth makes estimates excessively sensitive to outliers. Regarding kernel regression, it is important to draw attention to a similar approach known as locally weighted regression (LWR) [24]. LWR also exploits kernels but in a different way, namely weighting examples by a kernel function K .

1.5 function estimate with h=0.1 1

estimate with h=0.2 sample points

0.5

0

-0.5

-1 -1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

Figure 8.4: Nadaraya–Watson regression.

0.6

0.8

1

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 349

349

Linear regression weights w then become a function of data, i.e., w(x) = argmin w

N 

K (x, x i )(w, x i − yi )2

i=1

and y(x) = w(x), x . The linear regression function w(x), x may be replaced with an arbitrary f (x, w), resulting in more complex models. It has been shown that for uniformly (or regularly) distributed data, LWR is equivalent to kernel regression, while for non-uniform (irregular) data, LWR outperforms kernel regression [24–26].

8.3.3. Kernel PCA Another example of kernel methods is a kernel extension of principal component analysis (PCA). PCA is a feature aggregation technique that aims to find a projection to an optimal linear subspace that happens to be a span of eigenvectors of the data matrix. Let X N = (x 1 , . . . , x N ) ∈ Rn×N , x i ∈ Rn , i ∈ {1, . . . , N } be a data matrix. PCA finds its decomposition on two matrices of lower rank X N ≈ GW. Columns of G represent an orthonormal basis of a linear L-dimensional subspace. In this subspace, W = (w 1 , . . . , w N ) ∈ R L×N , where w i is a representation of vector x i . Subspace and orthogonal projection are chosen so that

X N − GW 2 → min, G,W

which minimizes the reconstruction error. Alternatively, it can be seen as a variance maximization problem so that the projection explains (or keeps) as much variance of the original sample as possible [1]. Initially linear, the PCA approach can be extended easily to nonlinear cases [27]. The solution of PCA is based (e.g., [1]) on eigenvalue decomposition of the second N x i x iT , which finds eigenvectors moment (or covariance) matrix C = X N X TN = i=1 g and eigenvalues λ, C g = X N X TN g = λg.

June 1, 2022 12:51

350

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

The projection to an L-dimensional subspace of an arbitrary vector x is defined as: proj (x ) = G T x. For any kernel K , there exists a nonlinear map , and we can rewrite the covariance matrix as C =

N 

(x i )(x i )T

i=1

to find its eigenvalue decomposition. This gives us the first L eigenvectors g 1 , . . . , g L so that C gl = λl gl . Based on the definition of C , its eigenvectors are linear combinations of (x i ), i.e., gl =

N 

αil (x i ).

i=1

Vectors αl can be calculated by eigenvalue decomposition of the pairwise inner product matrix K = K [X N , X N ] , K i j = K (x i , x j ), namely K α l = λl α l . By normalizing eigenvectors so that gl , gl H = 1, we obtain a rule for finding the nonlinear projection of any data vector x on each basis vector as g l , (x) H =

N 

ail (x i ), (x) H =

i=1

N 

ail K (xi , x).

i=1

which yields the l-th coordinate of the vector x in the subspace of projection. It is worth noting that we do not have to know an implicit form of the map , just the corresponding kernel K , which is used to calculate inner products. The expression for C is tricky because, in the general case the RKHS corresponding to the mapping  is infinite-dimensional. Thus, the transpose operator is incorrect. Hence, C can be understood as a linear operator as follows: C z =

N 

(x i )(x i ), z H , ∀z ∈ H.

i=1

Kernel PCA enables the nonlinear dependences to be identified inside data. This property is illustrated in Figure 8.5. In the original feature space, two-dimensional data points form three concentric circles that are not linearly separated. Introducing

page 350

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 351

351

Feature space 10

5

0

-5

-10 -10 Gaussian kernel

-5

0

1

5

10 Polynomial kernel

40

0.8

30

0.6 20 0.4 10 0.2 0

0 -0.5

0

0.5

-40

-20

0

20

40

Figure 8.5: Kernel PCA with different kernels.

Gaussian or polynomial kernels into PCA allows a linearly separable dataset to be obtained.

8.3.4. Kernel Linear Discriminant Analysis Linear discriminant analysis (LDA) focuses on a classification problem. Namely, for a given learning set X N = {(x 1 , y1 ), . . . , (x N , y N )}, x i ∈ Rn , yi ∈ Y , where Y = {L 1 , . . . , L R } is a set of class labels, the aim is to restore the labeling function f : Rn → Y . The approach followed in LDA involves finding the normal vector w of the “optimal” (in terms of the given quality function) dividing plane in the data space. The projection of the data on the resulting vector gives a good separation of classes.

June 1, 2022 12:51

352

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

For two-class classification problems (i.e., R = 2), the optimal vector is determined as a solution of the following problem: wT S B w → maxn , w∈R wT S W w where S B = (μ1 − μ2 )(μ1 − μ2 )T ∈ Rn×n is the between-class covariance matrix; SW =

2  

(μr − x i )(μr − x i )T ∈ Rn×n

r=1 i : yi =L r

is the within-class covariance matrix; and μ r is the arithmetic mean of sample vectors of the class L r such that μr =

1  xi, Nr i : y =L r

i

where Nr is the number of samples in class L r . It can be shown that the resulting vector −1 (μ1 − μ2 ). w ∝ SW

Similar to the previous examples, a kernel trick can be applied to the given model [28]. Suppose there is a mapping between the initial data space and some Hilbert space  : Rn → H, where (x), ((t) H = K (x , t). In this Hilbert space, the sample vector class means are calculated as μr =

1  (x i ). Nr i : y =L i

r

Therefore, an inner product of the mean with the map of any arbitrary vector x ∈ R n is 1  1  (x), (x i ) H = K (x, x i ). (x), μr H = Nr i : y =L Nr i : y =L i

r

i

r

It is more complex to define the covariance matrix in the mapped feature space because it defines a linear operator in H. However, we can determine the linear

page 352

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 353

353

operator by its value over the vectors from the span of . For example, the Hilbert space within-class covariance matrix is defined as  = SW

2   

 (x i ) − μr

 T   (x i ) − μr .

r=1 i : yi =L r

Suppose 

w =

N 

αi (x i ),

i=1

then   w = SW

2  

    (x j ) − μr , w  H  x j − μr .

r=1 j : y j =L r

The coefficient (x j ) − μr , w  H can be computed easily using the properties of the inner product, which is shown as follows: (x j ) −

μr , w  H

=

N 

    αi K x j , x i −(x j ), μr H .

i=1

The result is a weighted sum of elements of H. Similarly, a second-order function  W  H can be calculated. The optimal values of αi , i = 1, . . . , N , are still w  , SW to be defined. Therefore, the resulting kernel LDA optimization problem is to find the maximum  w  , S  B W H → max ,  αi ∈R,i=1,...,N w  , SW W  H w.r.t

w =

N 

αi (x i ).

i=1

The optimal solution of the stated kernel LDA problem is α ∝ Q −1 (M 1 −M 2 ) , where M r ∈ R N , (M r )i =

Q=

2  r=1

1 Nr

 j : y j =L r

K r (I − E Nr )K rT ,

  K x j, xi ,

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

354

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I Data in the initial space

Projection to kernel discriminant hyperplane

3

25

2 20 Sample count

1 0 -1

15 10 5

-2 -3

-2

0

0

2

5

5.5 6 6.5 7 Inner product value

7.5

Figure 8.6: Kernel LDA projection example. Distribution of data in the initial space (left) and the result of the projection to the optimal vector in the RKHS (right).

and E Nr ∈ R Nr ×Nr is a matrix with elements equal to N1r , K r ∈ R N×Nr , (K r )i j = K (x i , x rj ), x rj — j -th vector in the r-th class. As noted previously, a projection of the data on the optimal vector leads to the optimal separation. In terms of kernel LDA, the projection of the (x) on the vector w  is defined by the inner product as follows: 

w , (x) H =

N 

αi K (x , x i ).

i=1

Separating the property of the kernel LDA method is illustrated in Figure 8.6.

8.3.5. Kernelization of an Arbitrary Linear Model Any linear model can be “kernelized” by replacing the inner products x, t with the kernel values K (x , t). Here, we illustrate the idea using the example of the ridge regression model. Recall the classical ridge regression with the target function N 

(yi − w, x i )2 + λ w 2 → min . w

i=1

Here, the pairs (x i , yi ) belong to the training sample X N . This optimization problem can be solved analytically. With the help of linear algebra, we get  w=

N  i=1

x i x iT

+ λI

−1  N  i=1

 yi x i .

page 354

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 355

355

Now, with an appropriate kernel, we perform the mapping x i →  (x i ). Therefore, N 

(yi − w, (x i ) H )2 + λ w 2 → min . w∈H

i=1

Assuming that w =

N 

αi  (x i ), we can reformulate the optimization problem as

i=1 N  i=1

⎛ ⎝ yi −

N 

⎞2 α j (x j ), (x i ) H ⎠ + λα T K (X N , X N )α → min . α∈R N

j =1

Here, we denote the matrix of pairwise inner products of the training sample as K (X N , X N ),   [K (X N , X N )]i j = K x i , x j . Then, the optimal solution of the stated problem is given by α = (K (X N , X N ) + λI )−1 y¯ . This result can be interpreted as a special case of the representer theorem application. Again, to calculate the y value for an arbitrary x, the exact form of the mapping  is not needed; we only have to know the corresponding kernel function. y = w, (x) = α K (X N , x) = T

N 

αi K (x i , x).

i=1

The vector K (X N , x) consists of the inner products of the training sample objects and the point of evaluation. Figure 8.7 shows an example of kernel ridge regression for the following model: y=

sin π x + ε, πx

where ε ∼ N (0, s 2 ) is normally distributed noise with zero mean and variance s 2 imposed on the signal. Kernel ridge regression depends significantly on the selection of the kernel and its parameters. In the case of a Gaussian kernel [2, 4], overly large values of spread σ 2 lead to an over-smoothed solution, while an overly small σ 2 results in over-fitting. These effects are shown in Figure 8.8. The same is the case with the regression parameter λ. The growth of this parameter will lead to smoothing and any decreases will produce a very sensitive model, as shown in Figure 8.9. To tune the parameters of the kernel ridge regression, a cross-validation technique may be applied [29].

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

356

b4528-v1-ch08

page 356

Handbook on Computer Learning and Intelligence — Vol. I 1.2 function

1

Gaussian kernel estimate, σ2=1, λ=0.5

0.8

sample points 0.6 0.4 0.2 0 -0.2 -0.4

-4

-2

0

2

4

6

8

Figure 8.7: Kernel ridge regression with a Gaussian kernel, σ 2 = 1, λ = 0.5.

1.2 function

1 0.8

Gaussian kernel estimate, σ2=0.125, λ=0.5

0.6

Gaussian kernel estimate, σ2=9, λ=0.5 sample points

0.4 0.2 0 -0.2 -0.4

-4

-2

0

2

4

6

8

Figure 8.8: Kernel ridge regression with a Gaussian kernel and different σ 2 .

8.3.6. Adaptive Kernel Filtering The proposed kernel models are applicable to adaptive filter design [30]. Adaptive kernel filtering admits online learning of the filters that are nonlinear. The learning procedure is computationally cheap because the optimization is convex. Consider a system that takes as input streaming data vectors x i ∈ Rn , which are drawn independently from some unknown probability distribution, and produces output f (x i ), aiming to predict a corresponding outcome yi . One possible representation of such a system is shown in Figure 8.10.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 357

357

1.2 function

1 0.8

Gaussian kernel estimate, σ2=1, λ=0.001

0.6

Gaussian kernel estimate, σ2=1, λ=2 sample points

0.4 0.2 0 -0.2 -0.4

-6

-4

-2

0

2

4

6

8

10

12

Figure 8.9: Kernel ridge regression with a Gaussian kernel and different λ.

f(xi )

xi

yi

Adaptive system

ei Adaptation mechanism

Figure 8.10: Adaptive system.

Therefore, the learning procedure is performed in an online way. In the case of a linear model, the output can be represented as an inner product: f (x i ) = x i , w . Learning can be performed using any online fitting procedure, such as stochastic gradient descent [30], recursive least squares [30], and others. However, such an approach limits the resulting filter output to a linear combination of the data vector components (i.e., linear function). This limitation can be overcome using the kernel trick. Hereafter we use the least-squares functional, which is defined as follows: R (X N , f ) =

N  i=1

( f (x i ) − yi )2 .

June 1, 2022 12:51

358

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

Consider the stochastic gradient descent as a learning algorithm. Suppose that the target function is an inner product in the Hilbert space H: f (x) = (x), w H , where  : Rn → H is a map from the data space to the Hilbert space. Similar to the previous examples, we assume that ∀x, t ∈ Rn (x), (t) H = K (x , t). At each new learning pair (x i , yi ), the parameter vector w i−1 is updated according to the gradient of ((x i ), w i−1 H − yi )2 : w i = w i−1 − ν ((x i ), w i−1 H − yi )  (x i ) = wi−1 + αi  (x i ) . Here, ν is a gradient descent step. Therefore, the resulting function f can be expressed as f (x) =

N 

αi K (x , x i ).

i=1

The overall algorithm is presented in Table 8.1. The selection of the parameter ν usually influences convergence and, more specifically, convergence speed. The required condition for convergence is ν< ⇒ ν
0 is the similarity between x i and x j . These relationships can be represented in the form of the similarity graph G = (V, E), where the vertex v i ∈ V corresponds to x i and the edge ei j ∈ E is weighted by si j . This graph can be fully connected (i.e., all si j > 0) or each vertex can be connected to the k-nearest neighbors only (i.e., all other edges have si j = 0), resulting in a sparse matrix [36]. The problem of clustering then can be reformulated in terms of graph partitioning. Furthermore, the edges of the partition between different groups are expected to have low weights. Following graph theory, we denote the degree of a vertex v i as di =

N 

si j .

m=1

The degree matrix D is a diagonal matrix with d1 , . . . , d N on the diagonal. Spectral clustering algorithms are based on eigenvalue decomposition of the graph Laplacian matrix. This can be calculated in several ways: 1) The unnormalized graph Laplacian matrix [38, 39] is defined as L = D − S. 2) The symmetric normalized graph Laplacian matrix [40] is defined as L = D − 2 (D − S) D − 2 = I − D − 2 S D − 2 . 1

1

1

1

3) The random walk normalized graph Laplacian matrix [40] is defined as L = D −1 (D − S) = I − D −1 S. 4) In [35], the authors define the graph Laplacian matrix as L = D− 2 S D− 2 , 1

1

which has the same eigenvectors as the symmetric normalized graph Laplacian matrix and eigenvalues λi , corresponding to 1 − λi of the latter. After calculating the graph Laplacian matrix of the data to be clustered, eigenvalue decomposition is performed. Ordinary decomposition may be replaced with a generalized eigenvalue decomposition, as in the spectral clustering procedure described by Shi and Malik [41]. Eigenvectors, which are vector columns corresponding to the k-smallest eigenvalues, form matrix V N×k containing new descriptions of the original objects as vector rows.

page 360

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 361

361

Finally, these vector rows are clustered. For instance, this is undertaken using the simple k-means algorithm. Ng, Jordan, and Weiss [35] suggest normalizing vector rows to have norm 1. Spectral clustering can easily be implemented and solved using standard linear algebra methods. Despite the lack of theoretical proof, experiments show that spectral clustering algorithms can be applied to clusters with arbitrary shapes [36]. Intuitively, it follows that spectral clustering explores the spectrum of the similarity matrix. Thus, an appropriate similarity metric will lead to a reasonably good result. Two examples of spectral clustering performance on datasets with complicated structures are given in Figure 8.12. At the end of this section, we show that spectral clustering is related to the kernel k-means, k  

 2 w(x)  (x) − μ j  → min , π j ,μ j

j =1 x∈π j

# $k where π = π j j =1 is a partitioning of data into clusters π j . For a fixed partition, cluster centers are calculated similarly to standard k-means,  w(x) (x) − μ 2 μ j = arg min μ

x∈π j

and x ∈ X N . The function w(x) assigns some nonnegative weight to each object x. Mapping  is determined by some kernel function K . Note that there is no need to compute this mapping because · depends only on the inner products computed by K . Normalized spectral clustering, which is closely connected to the normalized graph cut problem, can be expressed as a matrix trace minimization problem [34] 8

600

6

500

4 400

2 0

300

-2

200

-4 100

-6 -8 -8

-6

-4

-2

0

2

4

6

8

0

0

200

400

Figure 8.12: Spectral clustering results on different datasets.

600

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

362

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

as follows:   trace V T L V → 1

max

V ∈R N×K ,V

T V =1

,

1

where L = D − 2 S D − 2 . Denoting the diagonal matrix formed by weights w (x i ) , i = 1, . . . , N as W , the weighted kernel k-means optimization problem takes the form   1 1 trace V T W 2 K (X N , X N ) W 2 V → max , V ∈R N×K ,V

T V =1

which is close to that in normalized spectral clustering.

8.3.8. Gaussian Processes and Regression and Classification methods A Gaussian process (GP) is a generalization of the Gaussian probability distribution [66]. The main difference is that a probability distribution describes random variables, which are scalars or vectors, while a stochastic process describes the properties of functions. In this section, we will not introduce GPs in a mathematically correct way; this would take a lot of space and is not strictly related to the chapter’s topic. Instead, we will describe the main properties of this model, show its relation to kernel methods, and detail its application to regression and classification tasks. Definition 8.6. A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution. GP defines a probability distribution on a set of functions F = { f : X → R}, where X is the parameter space of the functions in F. This distribution is defined by its mean and covariance functions as follows: E f (x) = m (x) ,         k x, x = E ( f (x) − m (x))T f x  − m x  . In this case, we write that for each x ∈ X , the value of f (x) is distributed as f (x) ∼ G P(m(x), k(x, x)) = N (m(x ), k(x, x)). Here, N (μ, σ 2 ) represents a Gaussian distribution with mean μ and variance σ 2 . Following the definition of GP, we consider a distribution of the values of the function f on a set X N = (x 1 , . . . , x N ), denoted by f N = ( f (x 1 ) , . . . , f (x i ) , . . . , f (x N )) ,

page 362

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Kernel Models and Support Vector Machines

b4528-v1-ch08

page 363

363

which is a real vector-valued random variable. Therefore, following the definition f N ∼ N (M (X N ) , K (X N , X N )) , where M(X N ) ∈ R N, K (XN , X N ) ∈ R N×N are multi-dimensional applications  of functions m(x), k x, x to dataset X N correspondingly. As K (X N , X N ) is a  covariance matrix and is, therefore, positive-definite for any finite set X N , k x, x is a positive definite kernel. It is important to stress that the random variable here is not x ∈ X. In fact, it is the function f . Furthermore, the distribution over f is constructed by defining the distribution over the values of the function for each finite set of arguments. Based on the properties stated above, we consider the application of the model for regression and classification problems. A regression problem for the case of GP is formalized as a computation of the parameters of the conditional distribution. Suppose we have a defined pair m (x)   N . Now we would like to and k x, x , as well as a sample of pairs D = {(x i , yi )}i=1 ∗ ∗ assess the distribution of f (x ) for a given x . For this, we consider the conditional distribution p( f ∗ | f N = y N , x ∗ ), i.e., its mean and covariance, as follows:       E( f ∗ | f N = y N , x ∗ ) = m x ∗ + K x ∗ , X N K (X N , X N )−1 y N − M (X N ) ,      T D( f ∗ | f N = y N , x ∗ ) = k x ∗ , x ∗ −K x ∗ , X N K (X N , X N )−1 K x ∗ , X N . Here, f ∗ is a shorter notation for the f (x ∗ ) random variable. One can see that the form of the expected value matches the result obtained in Section 8.3.5 and the form defined by the representer theorem (see Theorem 5). Indeed, assuming m (x) ≡ 0 and by defining α = K (X N , X N )−1 y N , we obtain the following expected value: ∗



E( f | f N = y N , x ) =

N 

  αi k x i , x ∗ .

i=1

The difference between the results presented in Section 8.3.5 and those shown here is caused by the regularization term introduced before. It is possible to obtain the same result by assuming that ∀ i ∈ 1, . . . , N yi = f (x i ) + εi , f (x) ∼ G P (m (x) , k (x, x )) ,   εi ∼ N 0, σ 2 .

June 1, 2022 12:51

364

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

The above-mentioned assumption, m (x) ≡ 0, is a common choice for the mean function. Additionally, it does not limit model flexibility if we are using universal kernels. At the same time, it is possible to introduce further flexibility into the model in a different way, namely, by assuming that m (x) is a parametric function m (x, β). In this case, the parameters β can be adjusted using the maximum likelihood method for a given sample D. A classification problem for GP has a less straightforward formulation. In this section, we consider a two-class classification problem with a probabilistic formulation. For a linear classifier, the standard approach to define class probability is p(y = +1|x ) =

1 = σ (x, w ) . 1 + e−x,w

For nonlinear cases, it is possible to replace x , w with a nonlinear function f (x), so that the positive class probability would be expressed as σ ( f (x)). Next, the behavior of f (x) is modeled using GP, which is adjusted using data sample D. Here, f (x) plays the role of a “hidden variable” in that we do not observe the values of the function itself and we are not particularly interested in its exact values; instead, we are observing the class labels that depend on the function. For model derivation, we can reuse notations from the “regression” subsection with the main difference that y N ∈ {+1, −1} N . The predictive distribution of the values of this function for a specific input x ∗ can be expressed as    ∗ ∗ p( f ∗ |X N , f N , x ∗ ) p f N | X N , y N d f N . p( f |D, x ) = It is possible to obtain the class-prediction probability as  ∗ ∗ p(y ∗ = +1 | f ∗ ) p( f ∗ | D, x∗ )d f ∗ p(y = +1 | D, x ) =  = σ ( f ∗ ) p( f ∗ | D, x∗ )d f ∗ . The probability distribution p(f N | X N , y N ) is non-Gaussian, and direct computation of p( f ∗ | D, x∗ ) is impossible. However, several approaches have been proposed [?] involving different types of probabilistic approximation, including Laplace approximation or expectation propagation. In this section, we consider Laplace approximation in detail. Laplace’s method uses a Gaussian approximation q( f N | X N , y N ) of the non-Gaussian distribution p(f N | X N , y N ), so that the derivation of p( f ∗ | D, x∗ ) becomes tractable. It performs a second-order Taylor expansion of log( p(f N | X N , y N )) around its maximum, defining a Gaussian distribution as

page 364

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 365

365

follows: q( f N | X N , y N ) = N ( ˆf N , A−1 ), ˆf N = argmax f (log( p( f N | X N , y N ))), N % A = −∇∇ log( p( f N | X N , y N ))% ˆ . f N= f N

The process of finding the values of ˆf N , A−1 can be interpreted as the model training process. It is possible to obtain these values using Newton’s method. After performing the approximation, it is possible to derive the mean Eq ( f ∗ |D, x ∗ ) and variance Dq ( f ∗ | D, x ∗ ). This is because the convolution of Gaussian distributions can be expressed analytically. The integral defining p(y ∗ = +1 | D, x ∗ ) is still intractable and evaluated approximately. However, if the problem does not require computation of the probability of a class and requires only classification to be performed, it is possible to use the following rule: p(y ∗ = +1 | D, x ∗ ) >

1 ⇐⇒ Eq ( f ∗ | D, x ∗ ) > 0. 2

It is important to note that Eq ( f ∗ | D, x ∗ ) is expressed in the form of linear combinations of kernels, which is related to the representer theorem.

8.4. Support Vector Machines Support vector machines (SVMs) are among the most pervasive techniques for solving data mining problems, including classification, regression, anomaly detection, and metric learning. Initially invented as a linear classification problem solver in the mid-1960s by Soviet scientists Vapnik and Chervonenkis [5], the SVM was further developed into a framework of statistical methods with common optimization approaches. The SVM plays a significant role in the scope of kernel machines due to the “sparsity” of the method, which is because the size of the learning set is the main limitation for such types of algorithms. The next part of this chapter covers the following: we first describe the SVM classifier [4] in more detail and specify its relation to kernel machines, after which we introduce several types of SVM for binary classification, regression, and multiclass classification; finally, we highlight some approaches for anomaly detection, incremental SVM learning, and metric fitting.

8.4.1. Support Vector Classifier As noted in the first chapter, the main idea behind an SVM is to build a maximal margin classifier. Suppose that we have a labeled sample X N =

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

366

b4528-v1-ch08

page 366

Handbook on Computer Learning and Intelligence — Vol. I

{(x 1 , y1 ), . . . , (x N , y N )} , x i ∈ Rn , yi ∈ {1, −1}, and we need to solve a twoclass classification problem. Assume that linear classification is considered, which means that the algorithm’s decision rule can be expressed as f (x) = sign(w, x − ρ), where w ∈ Rn and ρ ∈ R. Therefore, the problem can be conceptualized as one involving finding w and ρ, or defining a hyperplane that separates the objects of different classes. It is also supposed that the sample set is linearly separable, which means that ∃w ∈ Rn , ∃ ρ ∈ R :

N 

I [(w, x i − ρ)yi ≤ 0] = 0.

i=1

That is to say, all the objects are classified correctly. The margin function is defined as follows:  ≤ 0, if f (x i ) = yi , M(x i , yi ) = (w, x i − ρ)yi > 0, if f (x i ) = yi . It is important to note that there may be infinitely many ways to define the hyperplane, as illustrated in Figure 8.13. In this figure, the two possible separation lines for two-dimensional classification problems are shown. Vapnik et al. [42] suggested a “maximum margin” approach wherein the hyperplane is placed as far away as possible from the closest objects of both classes. So, a linear two-class SVM finds a “hyper-band” between two classes with maximal width. 2 1.5 1 0.5 0 -0.5 -1 -1.5 -1.5

-1

-0.5

0

0.5

1

1.5

Figure 8.13: Different ways to define a hyperplane for the dataset.

2

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 367

367

Consider the maximal bandwidth problem in more detail. The distance between a given point x 0 ∈ Rn and a plane P(w, ρ), as defined by x, w − ρ = 0, can be calculated as follows: d(P(w, ρ), x 0 ) =

|x 0 , w − ρ| .

w

To find the best separating bandwidth, we maximize the distance from the closest data point in the learning set while considering class separation constraints. Therefore, the optimization problem for SVMs is stated as max min

w,ρ x i ∈X N

|x i ,w −ρ| ,

w

w.r.t

yi (x i , w − ρ) > 0, ∀ (x i , yi ) ∈ X N . Note that for all α > 0, the terms αw and αρ define the same separating hyperplane. Therefore, we can strengthen the problem’s constraints: ∀ (x i , yi ) ∈ X N (w, x i − ρ)yi ≥ 1, which is to say that for the objects closest to the hyperplane, the inequality turns into equality. It is possible to show that there should be at least two closest objects (with the same distance) from both classes. Suppose P(w ∗ , ρ ∗ ) ≡ P ∗ is optimal. Let the closest vector from the class “–1” be denoted as x −1 , and let the one from the class “+1” be specified as x +1 . Without loss of generality, let us suppose that d(P(w ∗ , ρ ∗ ), x +1 ) < d(P(w∗ , ρ ∗ ), x −1 ). Let ε = d(P ∗ , x −1 ) − d(P ∗ , x +1 ), and consider the following plane:   ε  = Pε∗ . P w∗ , ρ ∗ − 2 This hyperplane fits the learning set constraints (all the vectors are correctly classified), but the distance of each “–1” class sample is reduced by 2ε , and the distance of each “+1” class sample is enlarged by 2ε compared to P ∗ . It is obvious that the object that is the closest to the hyperplane Pε∗ is still x +1 , but d(P ∗ , x +1 ) < d(P ∗ , x +1 ) +

ε = d(Pε∗ , x +1 ) 2

contradicts the “maximum distance” statement. Thus, the objects from classes “+1” and “–1” that are closest to the hyperplane are equidistant from it.

June 1, 2022 12:51

368

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

Therefore, the separation bandwidth can be expressed as follows: ' & (x 1 , w − ρ) − (x −1 , w − ρ) 2 w = = . h = x 1 − x −1 ,

w

w

w

This value is to be maximized, which is to say that 2 → max . w

w

The formulation is easily converted into the standard SVM optimization problem as follows: 1

w 2 → min w,ρ 2 w.r.t. yi (w, x i − ρ) ≥ 1, i = 1, . . . , N. However, it does not allow solving the classification problem for linearly nonseparable classes. From a mathematical perspective, it comes from the point that the set defined by constraints yi (w, x i − ρ) ≥ 1 may be empty. To overcome this issue, the so-called “slack variables” ξi are introduced. N  1

w 2 + C ξi → min w,ρ,xi 2 i=1 w.r.t.

yi (w, x i − ρ) ≥ 1 − ξi , ξi ≥ 0, i = 1, . . . , N. In this case, the set of constraints defines an open set (because ∀ w, ρ there exist slack variables ξi that are large enough for the constraints to be met). The problem is defined as one of quadratic programming optimization. Usually, it is solved in its dual formulation. If we write down the Lagrangian and derive the Karush–Kuhn– Tucker (KKT) conditions, we yield the dual optimization problem as follows: N  1 αi → max − α Qα T + α 2 i=1 w.r.t. ⎧ ⎪ ⎨ 0 ≤ αi ≤ C ∀ i = 1, . . . , N, N  ⎪ ⎩ αi yi = 0, i=1

where α ∈ R N is a vector of Lagrangian multipliers, Q i j = yi y j x i , x j . KKT conditions play an essential role in solving optimization problems and explaining

page 368

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Kernel Models and Support Vector Machines

b4528-v1-ch08

page 369

369

the properties of the approach. We focus on the two main consequences of KKT conditions for the SVM. The first KKT condition leads to an explicit representation of vector w, w=

N 

yi αi x i ,

i=1

which can be interpreted as a special case of the representer theorem with the Lagrangian regarded as a regularized loss function. One of the key properties of the SVM is that most αi will be reduced to zero. Only a relatively small subset of samples has corresponding non-zero alpha entries. Such samples are called support vectors. A large number of zero entries in α can be explained by the second consequence of KKT conditions. It can be shown that vectors in the learning set can be split into three categories ⎧ ξi = 0 : C − group, ⎪ ⎨ M(x i , yi ) > 1, αi = 0, M(x i , yi ) = 1, 0 < αi < C, ξi = 0 : M − group, ⎪ ⎩ ξi > 0 : E − group, M(x i , yi ) < 1, αi = C, where M(x i , yi ) = (w, x i − ρ)yi . C-group (correct) defines a subset of vectors from X N lying beyond the boundaries of the separation band and, thus, that are well separated. The M-(margin) and E-(error) groups are the subsets of the vectors lying on the border and within the separation band, respectively. These groups are related to less separable vectors (E-group can contain misclassified vectors). Therefore, all the vectors that lie beyond the boundaries of the separation band (C-group) have zero alpha entries. Considering that the algorithm aims to minimize the sum of ξi , the number of vectors in the M- and E-groups should be relatively small. Otherwise, this could be a sign of a poorly fitted classifier. Another explanation for the sparsity of the SVM can be referred to as the “geometrical” explanation. Constraints of the primal optimization problem define a set, limited by a hyper-polyline (maybe, hyper-polygon) in the space of w. Each vector x i in the sample determines a linear function in the space of w f i (w, ρ) = yi x i , w − yi ρ − 1, defining a half-space, f i (w, ρ) ≥ 0, but not each linear functions is involved in forming the broken hyperplane, which represents a border of the set {w, ρ | f i (w, ρ) ≥ 0 ∀ i = 1, . . . , N }. An example with a sample set of five vectors is illustrated in Figure 8.14. It is assumed that each hyperplane defines a half-space above it. Only three lines of five construct the border polyline.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

370

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

15 Constraint set border

10

w

2

5 0 -5

Non-active limitations

-10 -8

Active limitations

-6

-4

-2

0 w

2

4

1

Figure 8.14: Active and non-active constraints.

8.4.2. Examples of the Kernel Trick Let us now consider an application of kernels to an SVM more carefully. The socalled “kernel trick” [43] is a powerful tool that introduces nonlinearity into a model, allowing the separation of data that cannot be separated linearly. In a linear SVM, we have N  1

w 2 + C ξi → min w,ξ,ρ 2 i=1 w.r.t. yi (w, x i − ρ) ≥ 1 − ξi , ξi ≥ 0, i = 1, . . . , N.

Slack variables ξi make the problem consistent even for linearly nonseparable cases, but it is still possible for a linear hyperplane to have poor discriminative ability, with classes being separated by a nonlinear surface. We may hope that with a map  : X → H, the problem will become linearly separable in the RKHS H. The idea of choosing RKHS is that the target function and conditions in the SVM depend only on inner products. This means that they can be calculated in H using the corresponding kernel function K without evaluating map , K (x, t) = (x), (t) H . In other words, we are trying to find such a feature space H with a corresponding reproducing kernel K so that the dividing function f in the input space can be represented as a linear combination: f (x) =

N  i=1

αˆ i K (x,x i ).

page 370

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 371

371

The dual form of the linear SVM provides an elegant solution for determining linear coefficients αˆ i . Actually, from KKT conditions, we have w=

N 

αi yi x i .

i=1

The image of w to the Hilbert space can be written as wH =

N 

αi yi (x i ).

i=1

Then inner product w, x , which determines a linear separating plane in the initial space, is replaced by a nonlinear separating surface determined by the following function: f (x) = w H , (x) H =

N 

αi yi (x), (x i ) H =

i=1

N 

αi yi K (x , x i ).

i=1

Then, the following quadratic program is to be solved instead of the initial one: N  i=1

N 1  αi − αi α j yi y j K (x i x j ) → max α 2 i, j =1

w.r.t. 0 ≤ αi ≤ C, N 

αi yi = 0.

i=1

The computational effectiveness of a kernel SVM arises from a relatively small number of non-zero αi , which correspond to support vectors. It dramatically reduces the number of kernel function calculations that are required. Example of a polynomial kernel SVM. Here, we illustrate the technique of a kernel SVM using two examples. The first one, shown in Figure 8.15, illustrates the ability of the polynomial kernel K (x , t) = (x, t + 1)2 to restore a quadratic discriminant line. True class labels of the data are marked with a different color and shape, and the quadratic discriminant line of the kernel SVM is given in red. The filled points are support vectors.

June 1, 2022 12:51

372

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I 4 3 2 1 0 -1 -2 -3 -4 -4

-2

0

2

4

Figure 8.15: Example of a polynomial kernel SVM.

2

1

0

-1

-2 -2

-1

0

1

2

Figure 8.16: Example of an RBF SVM.

The second example, shown in Figure 8.16, illustrates the performance of a radial basis function (RBF) SVM on data originating from the mixture of Gaussians. Once again, the true class labels of the data are marked with a different color and shape, while the discriminant line of RBF SVM is shown in red. Again, filled points represent support vectors.

page 372

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 373

373

8.4.3. ν-SVM and C-SVM As mentioned before, we can state the optimization problem to find an optimal margin of the form 2 1

w →, min, w,ρ 2 yi (w, x i − ρ) ≥ 1, i = 1, . . . , N.

This statement, however, does not allow data to be misclassified, but the dataset is not guaranteed to be exactly divisible in the proposed model. For this purpose, we relax the constraints to allow data to be misclassified. To do this, we introduce slack variables, making a penalty for misclassification, as follows: 2 N  1

w + C ξi → min w,ρ,ξ 2 i=1 w.r.t

yi (w, x i − ρ) ≥ 1 − ξi , ξi ≥ 0, i = 1, . . . , N. where C is a parameter that controls the value of the misclassification penalty. The constraints imposed on each dataset element are often referred to as soft margin constraints, in contrast to hard margin constraints. In turn, we can write the Lagrangian of the problem as follows:  1

w 2 + C ξi − 2 i=1 N

L (w, ρ, ξ , α, β) =



N 

αi (yi (w, x i − ρ) − 1 + ξi ) −

i=1

N 

βi ξ i ,

i=1

where αi and βi are Largangian multipliers with a corresponding set of KKT conditions: αi (yi (w, xi − ρ) − 1 + ξi ) = 0, βi ξi = 0, yi (w, xi − ρ) − 1 + ξi ≥ 0, αi ≥ 0, βi ≥ 0, ξi ≥ 0. Differentiating the Lagrangian, we obtain  ∂ L(w, ρ, ξ , α, β) =0⇒w= αi yi x i , ∂w i=1 N

June 1, 2022 12:51

374

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I N  ∂ L(w, ρ, ξ , α, β) =0⇒ αi yi = 0, ∂ρ i=1 ∂ L(w, ρ, ξ , α, β) = 0 ⇒ C − βi = αi , ∂ξi

and then switch to the dual problem as follows: ˜ L(α) =

N  i=1

1  αi − αi α j yi y j x j y j → max α 2 i=1 j =1 N

N

w.r.t. 0 ≤ αi ≤ C, i = 1, . . . , N N 

αi yi = 0.

i=1

This approach is referred to as C-SVM [44]. The dual representation gives rise to an alternative technique that can be addressed using misclassification, ν-SVM [45]. For this approach, the primal problem is stated as follows: N  1

w 2 − νγ + N1 ξi → min w,ρ,γ 2 i=1 w.r.t. yi (w, x i − ρ) ≥ γ − ξi , ξi ≥ 0, i = 1, . . . , N.

Additionally, the dual problem is the following: 1 ˜ L(α) =− 2

N  N 

αi α j yi y j x i , x j → max

i=1 j =1

α

w.r.t. 1 , i = 1, . . . , N, N N N   αi yi = 0, αi ≥ ν. 0 ≤ αi ≤

i=1

i=1

Here, the parameter ν gives a lower bound for the fraction of support vectors, as well as an upper bound for the fraction of margin errors in the case of γ > 0. Actually, the first statement can be justified by noting that the maximum contribution to the last KKT inequality of each of the vectors is N1 . For this reason, the overall count of support vectors cannot be lower than ν N . The last statement can be proved using

page 374

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 375

375

the KKT conditions of the primal problem [45]. It is possible to prove that the 1 formulation of the task is equivalent to C-SVM with C = Nγ , if γ > 0.

8.4.4. ε -SVR and ν -SVR SVMs are not restricted only to classification problems; they can also address regression problems. This family of methods is referred to as support vector regression (SVR). Given a learning set X N = {(x 1 , y 1 ), . . . , (x N , y N )}, yi ∈ R, we aim to find a regression function fˆ(x) that approximates an unknown target function f : X → R. This target function has the following form: fˆ(x) = w, x − ρ. We state the problem as a minimization of the regularized error function N   λ w 2 1 , λ ≥ 0. E{ fˆ (x i ) − yi } + E 1 fˆ, w = 2 i=1 2

Here, we do not use a standard error function 2 E{ fˆ(x i ) − yi } = ( fˆ(x i ) − yi ) ,

as we aim to obtain a sparse solution. Instead, we use % % % % %ˆ % %ˆ % f (x ) − y ) − y − ε, if f (x % % % % > ε, i i i i ˆ E{ f (x i ) − yi } = 0, else. Again, the statement is augmented by slack variables, which penalize the exit from % % % %ˆ % f (x i ) − yi % < ε. Introducing slack variables, we convert the problem into the form E(ξ , ξ , w) = C 1

2

N  # i=1

ξi1

+

ξi2

$

w 2 → min + 2 ξ 1 , ξ 2, w

fˆ(x i ) − yi ≥ −ε − ξi1 , ξi1 ≥ 0 ∀ i ∈ 1, . . . , N. fˆ(x i ) − yi ≤ ε + ξi2 , ξi2 ≥ 0 ∀ i ∈ 1, . . . , N. Here, C = 12 λ−1 .

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

376

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

As before, we introduce a Lagrangian to convert the problem into dual representation as follows: L(w, ξ 1 , ξ 2 , β 1 , β 2 , α 1 , α 2 , ρ) N N ,  # 1 $ w 2  1 2 =C − ξi + ξi + βi1 ξ i + βi2 ξi2 2 i=1 i=1



N 

N , -  , αi1 ε + ξi1 + fˆ(x i ) − yi − αi2 ε + ξi2 − fˆ(x i ) + yi .

i=1

i=1

Differentiating, we obtain  ∂L =0⇒w= x i (αi1 − αi2 ), ∂w i=1 N

∂L = 0 ⇒ C = αi1 + βi1 , ∂ξi1 ∂L = 0 ⇒ C = αi2 + βi2 , ∂ξi2  ∂L =0 ⇒ (αi1 − αi2 ) = 0. ∂ρ i=1 N

Finally, using the KKT conditions, we can obtain the final expression of the dual problem: ˜ 1 , α2 ) = − 1 L(α 2 −ε

N  N 

(αi1 − αi2 ) (α 1j − α 2j )x i , x j

i=1 j =1 N  i=1

(αi1 + αi2 ) +

N 

(αi1 − αi2 )yi → max

i=1

α1 , α2

αi1 ≥ 0, αi1 ≤ C, αi2 ≥ 0, αi2 ≤ C. It should also be mentioned that because the Lagrangian is dependent on x i in the form of scalar products, one can apply a standard kernel trick. The described approach is referred to as ε-SVR [46].

page 376

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

page 377

Kernel Models and Support Vector Machines

377

Similar to the ν-SVM, we can define the ν-SVR method [45]. The primal problem is formulated as follows: N $

w 2 C # 1 E(ξ , ξ , w, ε) = → min ξi + ξi2 + Cεν + N i=1 2 ξ 1 , ξ 2 , w, ε 1

2

fˆ(x i ) − yi ≥ −ε − ξi1 , ξi1 ≥ 0 ∀ i ∈ 1, . . . , N, fˆ(x i ) − yi ≤ ε + ξi2 , ξi2 ≥ 0 ∀ i ∈ 1, . . . , N, ε ≥ 0. The dual problem, in this case, is written as 1  1 L(α , α ) = − (α − αi2 )(α 1j − α 2j )x i , x j → max, 2 i=1 j =1 i α 1 ,α 2 N

1

0 ≤ αi1 ≤ N 

(α 1i

N

2



C C , 0 ≤ αi2 ≤ , N N αi2 )

i=1

=

N 

(α 1i + αi2 ) ≤ νC.

i=1

ε-SVR regression for the function y = x 2 is shown in Figures 8.17, 8.18, and 8.19. We compare the models for different counts of learning points. The support vectors are marked by black boxes around learning points. The Gaussian radial basis kernel is chosen for this example. The predictive performance improves as the number of points increases. Here, we can see that sparsity of the solution is exposed again. In Figure 8.19, we have 30 learning points, only 18 of which are support vectors. 40 training function

30

test 20

10

0 -5

-4

-3

-2

-1

0

1

2

3

4

Figure 8.17: ε-SVR regression of the function y = x 2 ; four learning points.

5

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

378

b4528-v1-ch08

page 378

Handbook on Computer Learning and Intelligence — Vol. I 40 training function

30

test 20

10

0 -5

-4

-3

-2

-1

0

1

2

3

4

5

Figure 8.18: ε-SVR regression of the function y = x 2 ; 10 learning points.

40 training function

30

test 20

10

0 -5

-4

-3

-2

-1

0

1

2

3

4

5

Figure 8.19: ε-SVR regression of the function y = x 2 ; 30 learning points.

8.4.5. Multiclass SVM A multiclass SVM classifier is assumed to assign labels to the input objects, where the set of labels is finite and can consist of two (or, generally, more) elements. Formally, the classifier is trained over a sample set X N = {(x 1 , y1 ), . . . , (x N , y N )} , x i ∈ Rn , yi ∈ {L 1 , . . . , L R }, R ≥ 2. The SVM classification model was initially adapted for two-class problems, and there is no common approach for solving multiclass classification problems. We present three main approaches for multiclass SVM adaptation. • One-versus-all approach. For each of the R classes, a two-class SVM classifier is trained to separate the given class from the other classes. If we denote the model

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

separating r-th class as yr (x) = sign( fr (x)) = sign  yir =

 N 

page 379

379

 yir αir K (x,

x i ) + ρr ,

i=1

1, if yi = L r , −1, otherwise,

the classification of the input vector x is performed using a winner-takes-all strategy: y(x) = L r ∗ , r ∗ = arg max [ fr (x)] . r=1,...,R

The proposed method requires training R classifiers. However, this architecture suffers from the so-called “imbalanced learning problem.” This arises when the training set for one of the classes has significantly fewer samples than all the rest in sum. • One-versus-one strategy. For each unordered pair of classes (L r , L r  ), a twoclass SVM classifier is trained, which separates objects in class L r from objects in class L r  , disregarding the other classes. Let us denote such pairwise classifiers as follows: yrr  (x), yrr  : Rn → {L r , L r  }. To classify a new incoming object x, the following procedure is performed. First, we need to calculate how many times each class occurs as a winner in pair classification: n r (x) = |{L r  | yrr  (x) = L r }| . Then, a class with maximal “pairwise wins” is selected: y(x) = L r ∗ , r ∗ = argmaxr=1,...,R [n r (x)] . classifiers to be trained, which is computationally This approach requires R(R−1) 2 complex for large values of R. This architecture can be exceptionally useful to tackle the imbalanced learning problem, where the samples count from all the classes but one has a clear majority over this class. To address this, a weighted voting procedure can be applied with some pre-defined preference matrix with confidence weights for different pairs of classifiers [47]. A comparison of different strategies is given in [48]. • Directed acyclic graph (DAG) classification [49]. This approach is similar to the one-versus-one strategy because a classifier should be fitted for each pair of classes. However, the final decision of the input label is produced not by

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

380

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I 1 vs 4? 1

4

1 vs 3? 1

3

1 vs 2? 1

4

2

2 vs 3? 2

class 1

2 vs 4?

2

class 2

3 vs 4? 3

3

class 3

4

class 4

Figure 8.20: Directed acyclic graph classification.

applying all R(R−1) classifiers, but just R − 1. For classification, a decision tree is 2 produced so that each class corresponds to a single leaf. Each node of the graph classifiers, where each outgoing edge is linked to is linked to some of the R(R−1) 2 the corresponding classifier’s result. One of the nodes is selected as a root, and starting from the root, the classification process descends to one of the leaves. An example of DAG for four classes is given in Figure 8.20. The main advantage of DAG is its learning simplicity compared to the oneversus-one strategy. However, the classification results are usually not stable because only a subpart of the classifiers is engaged in decision-making.

8.4.6. One-Class SVM A special case of classification problems is one-class classification. This classification problem aims to distinguish between “normal” data and outliers. Such problems emerge when a training set consists only of one-class examples and under the assumption of other classes existing. One class may be well structured and described, while the other may be unstructured and unclear. This problem can be viewed as a type of probability distribution estimation or as a form of anomaly detection.

page 380

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Kernel Models and Support Vector Machines

b4528-v1-ch08

page 381

381

Schölkopf et al. [50] showed that an SVM-based approach is applicable to this class of problems. The idea of a one-class SVM is to map the data onto the feature space corresponding to an appropriate kernel and to separate them from the origin with maximum margin. The quadratic program implementing this approach is given as follows: N 1 1  2

w + ξi − ρ → min , w,ξ,ρ 2 ν N i=1

w.r.t. w, (x i ) ≥ ρ − ξi , ξi ≥ 0, i = 1, . . . , N. The decision function f (x) = sign(w, (x) − ρ) describes the desired class (i.e., probability distribution or membership function, normal state opposite to anomalies), which is positive for most examples in the training set. To transfer to the dual form of the problem, we express the weights via KKT conditions: w=

N 

αi  (x i ),

i=1

0 ≤ αi ≤

N 1  , αi = 1. ν N i=1

In turn, the decision function can be rewritten as   N  αi K (x , x i ) − ρ . f (x) = sign i=1

Those objects with αi > 0 are support vectors. It can be proven that for any x i with 0 < αi < ν1N , it holds that w, (x i ) = ρ at the optimum and ρ = w, (x i ) =

N  j =1

α j K (x i , x j ).

June 1, 2022 12:51

382

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

Finally, the dual problem is given as N   1  αi α j K x i , x j → min . α 2 i, j =1

w.r.t. 0 ≤ αi ≤

N 1  αi = 1. , ν N i=1

The parameter ν is of special interest as it possesses the following theoretically proven ν-property. If the solution to a one-class SVM problem satisfies ρ = 0, then (1) ν is an upper bound on the fraction of outliers in the learning set (vectors x i for which αi = ν1N ); (2) ν is a lower bound on the fraction of SVs in the learning set (vectors x i for which αi > 0); (3) If, additionally, data are generated independently from the distribution, which does not contain discrete components, and the kernel is analytic and non-constant, then with probability 1, asymptotically ν equals both the fraction of SVs and the fraction of outliers. Figure 8.21 shows the contour lines provided by a one-class SVM learned on a set of two-dimensional points from a Gaussian mixture p(x ) = 0.5 p(x | μ1 , 1 ) + 0.5 p(x | μ2 , 2 ), 8 6 4 2 0 -2 -4 -6 -8 -10 -8

-6

-4

-2

0

2

4

6

8

Figure 8.21: Density estimation by a one-class SVM.

page 382

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 383

383

where μ1 = (3, 3)T , 1 is an identity matrix, μ2 = (−3, −3)T , and   1 −0.5 . 2 = −0.5 2

8.4.7. Incremental SVM Learning Most SVM-type models admit incremental learning [51]. This means that the model can be fitted sample-per-sample or in batch mode, but the model requires the storage of all the previous learning sets. However, such algorithms could be useful in cases where the learning data are not provided at one moment but are split into batches that arrive sequentially (e.g., in streaming data analysis). In this part, we state the algorithm for incremental SVM learning (i.e., adding one more sample to the learning set for the SVM classifier). Suppose that we have a twoclass C-SVM classifier defined by its parameters (w, ρ), and estimated for the kernel K (x , t) over the learning sample X N . Assume that we need to add the new sample (x N+1 , y N+1 ) to the dataset. Therefore, we need to assign a corresponding value α N+1 and to update all previous values αi , i = 1, . . . , N and ρ. These parameters should be changed so that KKT conditions are met, which is a necessary and sufficient condition for optimality. Before discussing the algorithm itself, we should refer once again to the KKT conditions for the two-class SVM, defining three groups of samples from X N and the constraints to be met. (1) Margin group (M-group, M ⊂ X N ) — Subset of vectors x m at the border of the dividing band; the corresponding KKT condition for x m is M(x m ) =

N 

αi ym yi K (x i , x m ) + ρym = 1 ⇐⇒ αm ∈ (0, C).

i=1

(2) Correct group (C-group, C ⊂ X N ) — Subset of vectors x c correctly classified by the model, M(x c ) =

N 

αi yc yi K (x i , x c ) + ρyc > 1 ⇐⇒ αc = 0.

i=1

(3) Error-group (E-group, E ⊂ X N ) — Subset of vectors lying within the dividing band or even misclassified, x e : M(x e ) =

N  i=1

αi ye yi K (x i , x e ) + ρye < 1 ⇐⇒ αe = C.

June 1, 2022 12:51

384

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

A further KKT condition that should be considered is N 

αi yi = 0.

i=1

Here, we should mention that KKT conditions are necessary and sufficient for a point to be a solution to the dual optimization problem. Initially, we assign α N+1 = 0, leaving the rest of the values α ∈ R N and ρ ∈ R unchanged. If margin M(x N+1 ) > 1 (see C-group), then all KKT conditions are met and, therefore, learning is finished. If margin M(x N+1 ) ≤ 1, then KKT conditions are violated. In this case, we increase α N+1 so that the following hold: (i) Margins of all points from M-group stay equal to 1, ∀ x m ∈ M M(x m ) = 1; N αi yi + α N+1 y N+1 = 0; (ii) i=1 (iii) Margins of all points from C-group stay above 1, ∀ x c ∈ C M(x c ) > 1; (iv) Margins of all points from E-group stay below 1, ∀ x e ∈ E M(x e ) < 1. Therefore, during the change of all the parameters (α, ρ), only α N+1 violates the KKT conditions. We denote the i-th vector from the C-, E-, and M-groups x ci , x ei , x m i with labels c yi , yie , yim , respectively. Consider the first two limitations. Assume that at the previous iteration, we have |M| is a sub-vector of α old related to the the model α old , ρold . Suppose that α m old ∈ R vectors from M-group. Once again, as the new learning data x N+1 arrive, we have to update all previous parameters of the classifier and calculate α N+1 iteratively. First, we assign α N+1 = 0, which can violate KKT conditions (as mentioned before). Therefore, we must change the α values. First, we fix the values of α for C-group to 0 and for E-group to C. Hence, we change only α m values. If we increase α N+1 by α N+1 > 0, we can calculate the corresponding α m , ρ while holding i and ii.   m αm old + α m + k mN+1 (α N+1 + α N+1 ) = 1, F ρ + ρ

page 384

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

where

page 385

385

.

/ (y m )T 0 ∈ R|M|+1×|M|+1 , F = Q mm y m   m T y m = y1m , . . . , yim , . . . , y|M| , ⎤ ⎡ y N+1 ⎥ ⎢y m m ⎥ ⎢ N+1 y1 x 1 x N+1 ⎥ ⎢. . . ⎥ ⎢ ∈ R|M|+1 , k mN+1 = ⎢ ⎥, m m ⎢ y N+1 yi x i , x N+1 ⎥ ⎥ ⎢ ⎦ ⎣. . . m x m , x y N+1 y|M| N+1 |M| m

k mN+1

m m m Q mm i j = yi y j x i , x j , i, j = 1, . . . , |M|

Q mm is a matrix of inner products of the objects in the M-group. Therefore, considering that conditions i and ii are met for α old , ρold and α N+1 , it is possible to derive the following: . m/ α = −(F m )−1 k mN+1 α N+1 . ρ However, considering conditions i–iv, we can claim that α N+1 is bounded, and we increase it until one of the following conditions is met: (1) For one of the data vectors in the C-group, the value of M(x ci ) reaches 1. Then this data vector is moved to the M-group with the corresponding αic set to 0, α N+1 is increased by the corresponding α N+1 , and the procedure continues with the new structure of the data subsets. (2) For one of the data vectors in the E-group, the value of M(x ei ) reaches 1. Then this data vector is moved to the M-group with the corresponding αie set to 0, α N+1 is increased by the corresponding α N+1 , and the procedure continues with the new structure of the data subsets. (3) For one of the data vectors in the M-group, the value of αim reaches αim = C. Then this data vector is moved to the E-group with the corresponding αim set to C, α N+1 is increased by the corresponding α N+1 , and the procedure continues with the new structure of the data subsets. (4) For one of the data vectors in the M-group, the value of αim is set to 0. Then this data vector is moved to the C-group with the corresponding αim set to 0, α N+1 is increased by the corresponding α N+1 , and the procedure continues with the new structure of the data subsets. (5) α N+1 = C. Then the new data vector is moved to the E-group, and the algorithm terminates.

June 1, 2022 12:51

386

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

(6) M(x N+1 ) = 1. Then the new data vector is moved to the M-group with the current value of α N+1 and the algorithm terminates. If conditions 1–4 are met, then the structure of the E, C, and M groups should be re-arranged. This means that the vector α m should be increased by α m and augmented/reduced by the corresponding α of the migrating vector. Matrix F m should be recalculated. α N+1 is increased by α N+1 . Then, a new iteration of the described procedure is performed taking into account the new structures of the sets C, E , and M. Iterations are repeated until KKT conditions are met, which is equivalent to conditions 5 and 6. For each condition from 1 to 4, an effective computational technique exists that allows the inverse inner product matrix (F m )−1 for the M-group to be calculated iteratively using the Woodbury formula. However, it is usually the case that the M-group is relatively small.

8.4.8. Deep Learning with SVM Due to the structure of the SVM and the explicit minimization of classifier complexity, the SVM has an attractive property of high generalization capability compared to many other machine learning methods. At the same time, it largely depends on the input features, which can be noisy, very complex (e.g., having spatial correlations), or dangerously large (due to the curse of dimensionality). Although the kernel trick can be regarded as a feature extraction method, it is a natural intention to couple the generalization power of the SVM with some powerful feature selection or extraction techniques. Artificial neural networks (ANNs) are one of the most popular machine learning methods that can be successfully applied to a large number of classification and regression tasks. The very structure of ANNs allows dividing them into two logical parts: feature extractor and decision rule. For example, in a neural network such as AlexNet [62], convolutional layers extract features out of pixel intensities (i.e., serve as a feature extractor), while the top fully connected layers do the decision making (i.e., act as a decision rule). This interpretation of the ANN architecture makes it easy to come to the idea of replacing the top decision-making layers with another classifier. Indeed, many researchers substitute the head of a trained network with an SVM classifier and train it while keeping the network features fixed [64, 65]. Alternatively, unsupervised ANN models such as autoencoders can be used to generate features for the SVM classifier. The approach is attractive and helps decrease error and increase accuracy for many classification problems. One may notice, however, that in this case, feature extraction is separated from classifier optimization, which is the opposite of the

page 386

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Kernel Models and Support Vector Machines

b4528-v1-ch08

page 387

387

original ANN architecture wherein the features are learned to minimize the error function explicitly. In [63], the authors demonstrated the feasibility of adding an SVM head to an ANN feature extractor in such a way as to train the features with regard to the SVM target directly. Let us denote the outputs of the last feature extraction layer as x n , n = 1, . . . , N , where n is an input image index. Then, the SVM optimization takes the form min w T w + C w

N 

max(1 − w T x n yn , 0)

n=1

for the hinge loss or min w T w + C w

N 

max(1 − w T x n yn , 0)2

n=1

for the squared hinge loss. The latter variant is differentiable and penalizes errors more strongly. Now, to facilitate the standard backpropagation algorithm, we need to ensure that the SVM loss function is differentiable with respect to the features of the last layer. Fortunately, this can be done easily. If we denote the loss function as L(w), then   ∂ L(w) = −C yn w × I w T x n yn < 1 ∂xn in the case of hinge loss and ∂ L(w) = −2C yn w × max(1 − w T x n yn , 0) ∂xn in the case of squared hinge loss. At this point, feature extraction is incorporated directly into the SVM objective, and the features are learned to minimize SVM loss. This leads to better performance on tasks such as facial expression recognition, MNIST, and CIFAR-10. Some recent works have also shown connections between kernel learning and ANNs. This includes studies such as [61], which show the equivalence of ANNs of infinite width to Gaussian processes. Following these results, it is clear that kernel methods have provided important insights into the generalization capacities of ANNs, particularly by studying their training in functional spaces rather than finite-dimensional vector spaces.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

388

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

8.4.9. Metric Learning At a first glance, the problem of metric learning seems to be unrelated to the issue of SVM applications. However, the SVM approach can be fruitful for learning a metric. The problem of metric learning involves finding a metric that pulls objects of the same class closer and pushes objects from different classes as far away as possible from one another. It is common to find this metric in a family of Mahalanobis metrics parameterized by a positive semidefinite matrix M ∈ R n×n , ρ M (x, t) = (x − t)T M (x − t), M = M T , M ≥ 0. Although the Mahalanobis metric has a probabilistic origin (i.e., in that it is used to measure the distance between two random vectors), it can be seen as a Euclidean distance in a linearly transformed space. In other words, for any Mahalanobis metric ρ M , there exists a linear map specified by matrix L ∈ R k×n so that ρ M (x, t) = (x − t)T M(x − t) = (x − t)T L T L(x − t) = L(x − t) 2 . Starting from this point, most metric learning methods seek to find the best (under some conditions) linear map L. It may be desired that in the new space for each object, its distance to all neighbors of the same class is smaller than its distance to any neighbors of different classes. This leads to the following constraints on the Mahalanobis metric ρ M : min ρ M (x i , x j ) ≥

x j ∈N (x i ) y j = yi

min ρ M (x i , x j ) + 1, ∀(x i , yi ) ∈ X,

x j ∈N (x i ) y j =yi

where N (x i ) is the neighborhood of x i determined by Euclidean distance. Finding the Mahalanobis metric with the smallest possible Frobenius norm results in an MLSVM model [52]. N 1  λ

M 2F + ξi → 2 N i=1

min

M≥0, ξ ≥0

w.r.t. min ρ M (x i , x j ) ≥ max ρ M (x i , x j ) + 1 − ξi .

x j ∈N (x i ) y j = yi

x j ∈N (x i ) y j =yi

ξi ≥ 0. Note that ρ M depends linearly on the matrix M. This leads to the quadratic programming problem linking MLSVM to the SVM approach. Slack variables ξ i are

page 388

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 389

389

introduced, as in an SVM, to obtain the soft margin constraints. This optimization problem can be solved using a modified Pegasos method [52] without transforming it into a dual form. Other approaches for metric learning involve finding a linear transformation L such that the transformed target space neighborhood N (x i ) consists of objects of the same class as x i only. Objects x j in this neighborhood are called target neighbors and denoted as j i Notice that this motivation leads to better performance of the kNN algorithm in the transformed space. In the original space, some objects of different classes may be closer (in terms of Euclidean distance) to x i than target neighbors. These objects are impostors x k and are defined as follows:  2

L(x i − x k ) 2 ≤  L(x i − x j ) + 1, j  i, yi = yk . The loss function in this LMNN model [53] differs from the one in the MLSVM. It is represented as a sum of two terms, one pulling target neighbors closer together and the other pushing impostors apart. Denoting this loss function as 0 1, yi = y j , yi j = 0, yi = y j , we obtain the following optimization problem:    L(x i − x j )2 + (1 − μ) j i



 i, j i

   2 (1 − yik ) 1 +  L(x i − x j ) − L(x i − x k ) 2

+

k

→ min . L

This optimization problem can be converted into an SVM-like form as follows:     L(x i − x j )2 + μ (1 − yik )ξi j k → min (1 − μ) j i

w.r.t.

i, j i

k

L ,ξ

 2

L(x i − x k ) 2 −  L(x i − x j ) ≥ 1 − ξi j k , ∀i, j  i, k : yi = yk ξnkl ≥ 0, M = L T L ≥ 0.

June 1, 2022 12:51

390

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

Notice that all functions in both models depend on the input objects only through inner products. This allows the kernel trick and nonlinearity to be incorporated into metric learning.

8.4.10. Structured SVM Here, we expand the classification problem, relaxing the restrictions for the labels set in the way described in [43]. As usual, we consider a sample set X ⊆ Rn and associate each of its elements with some labels from a finite set Y . In the algorithms we referred to before, it is assumed that label y ∈ Y ⊆ R, but here, we abandon this constraint and allow labels to be any arbitrary objects (e.g., strings or graphs), which are often referred to as “structured.” As before, we define a map f : X → Y and a training set X N = {(x 1 , y1 ), . . . , (x N , y N )}, and we are trying to learn a map fˆ : X → Y to give an approximation of f . To tackle a problem with such complex structured labels, we also define a discriminant function F(x, y | w) : X × Y → R, where w is a vector of parameters. This function yields the maximum for the most “suitable” pairs of x and y, and so it should represent the map fˆ in the form fˆ(x | w) = argmax F(x, y| w). y∈Y

To use common optimization schemes, we restrict the function F(x , y | w) to depend linearly on the parameters F(x, y | w) = w, (x, y) . Here, is a function, dependent on the data samples, referred to as the “joint feature map,” : X × Y → R K , w ∈ R K . If, for example, Y = {1, . . . , R} ⊂ N and

(x, y) = {[y = 1] x, [y = 2] x, . . . , [y = R] x} ∈ RR×n , then the method performs as a multiclass SVM. The maximization problem can be reformulated [54] as a maximum margin problem: 1

w 2 → min w 2

page 390

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 391

391

with respect to constraints defining that the maximum is reached only on the given labeling fˆ. These constraints can be written for all x i , yi = f (x i ) as follows: Importantly, these constraints ensure that we have an exact labeling. Although they are nonlinear, they can be replaced with N (|Y | − 1) linear constraints as follows: ∀(x i , yi ) ∈ X N ,

∀ y ∈ \yi

w, (x i , y) > 0, where (x i , y) = ( (x i , yi ) − (x i , y)). As in an ordinary SVM, w can be rescaled so that it satisfies the following conditions: ∀x i , yi , y ∈ Y \yi , w, (x i , y) ≥ 1. To allow classifier errors on the dataset, slack variables ξi are introduced: N C  1 2

w + ξi → min w,ξ 2 N i=1

w.r.t. ∀x i , yi , y ∈ Y \yi , w, (x , y) ≥ 1 − ξi , ξi ≥ 0, 1 ≤ i ≤ N. The dual version of the problem can be derived by introducing Lagrangian multipliers: L(w, ξ , α, β) = −

N  i=1

βi ξi −

N C  1

w 2 + ξi 2 N i=1

N  

αiy {w, (x, y) ≥ 1 − ξi },

i=1 y∈Y \yi

N   ∂L =w− αiy (x i , y) = 0, ∂w i=1 y∈Y \y i



C ∂L − βi − = αiy = 0. ∂ξi N y∈Y \y i

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

392

b4528-v1-ch08

page 392

Handbook on Computer Learning and Intelligence — Vol. I

The kernel trick can be applied as usual, and the dual program should be written as follows: N 

αiy −

i=1,y∈Y \yi

αiy ≥ 0,

1 2

 i,y∈Y \yi

N 

N 

i = 1, y ∈ Y \yi → maxα ,

j =1, y ∈Y \y j

αiy ≤



αiy α j y  K ((x i , y), (x j , y ))



C ∀i = 1 . . . N N

Several alternative formulations of the problem exist, which differ in terms of the way slack variables are defined, but they are not ideologically different. These alternatives can be found in [43]. The critical problem with the described method arises from the large number of constraints, which leads to the exponential complexity of the optimization algorithm based on the “dimensionality” of the space Y . To avoid this problem, the cutting plane relaxation technique is used, which provides polynomial complexity but considers only some of the constraints, giving the most severe restrictions. This means for the constraints to be violated by a value that does not exceed some prespecified value ε, the optimization is provided within a loop, while the constraint list is complemented until all the constraints are violated not more than on ε. It is possible to use the structured SVM approach to solve various tasks once a correct feature-specific joint feature map is chosen, taking contributions from both input and output. Given some problem-specific function , the structured SVM can be used for image segmentation [55] and object detection problems [56], protein alignment structure identification [57], and other applications. For the case of natural grammar analysis, we have a grammar that incorporates grammar rules G i . String parsing can be viewed as a process of grammar rule application. The hierarchy of rule applications is organized into a tree. The function

then represents the histogram of grammar rules application event frequency [58]. Then, the structure y ∈ Y , which is found by optimization of the function F(x, y| w), represents a parse tree prediction. Let us consider in more detail the problem of image segmentation as a structured SVM application. For each of the pixels in an image I , we should assign an integer value Y = {1 . . . K }, which is referred to as a segment label. Let us define a graph (V, E) with a set of vertices V , each corresponding to one pixel of the image I , and a set of edges E, giving a neighborhood relation. The assignment of the segment labels can be done within the conditional random field (CRF) framework [58]. The energy, or cost, functional is considered, which assigns the fine on each element label and on joint assignment of the labels to the neighbors.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Kernel Models and Support Vector Machines

page 393

393

Alternatively, it can be interpreted as a graph labeling problem for the graph (V, E), expressed as follows:   E w (Y ) = Ui (v i , yi ) + Di j (v i , v j , yi , y j ) → min . v i ∈V

(v i , v j )∈E

y

Optimization problems of this kind are usually solved using algorithms such as graph-cuts or α-expansion [59, 60]. Here U is referred to as unary potential and D denotes pairwise potential. However, there remains the problem of defining these potentials. Then, imagine that we have a learning set of the images of the same size with the corresponding “ground truth” segmentations, namely: {(I1 , Y1 ), (I2 , Y2 ), (I3 , Y3 ), . . . (I N , Y N )}, Ii ∈ Rn×m , Yi ∈ {1 . . . K }n×m . A structured SVM enables us to define this as an optimization problem, which is aimed at finding a functional that enables the images to be segmented as closely as possible to the “ground truth.” For each of the pixels pik in the image I k , we associate some feature vector x ik ∈ Rl , where l is the dimensionality of the feature vector. Here, x ik can be either a SIFT descriptor or simply a color value (e.g., RGB), or any other feature. First, let us define the unary potential for the image Ik as follows: Uik (v i , yi ) = w yi , xik , w yi ∈ Rl . It is worth mentioning here that the graph is the same for all images since they have the same size Let us also define the following: T T T , . . . , I (yi = K )x ik ) . ψ ki (yi ) = (I (yi = 1)x ik

It follows that if we denote wU = (w1 , w 2 , . . . w K )T , then Uik (v i , yi ) = wU , ψ ki , (yi ) = w yi , x ik . Second, define the pairwise potential as Dikj (v i , v j , yi , y j ) = w yi ,y j ∈ R. Let us also define ψ i j (yi , y j ) = (I (yi = 1, y j = 1), . . . , I (yi = p, y j = r), . . . , I (yi = K , y j = K )). In addition, suppose w V = (wi j )(i, j )∈{1...K }2 . Then Dikj (v i , v j , yi , y j ) = w V , ψi j (yi , y j ) = w yi ,y j .

June 1, 2022 12:51

394

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

We can represent an optimized energy function for the image Ik in the following way:   E wk (Y ) = Uik (v i , yi ) + Di j (v i , v j , yi , y j ) v i ∈V

=



v i ∈V

(v i , v j )∈E

wU , ψ lk (yi ) +



w V , ψi j (yi y j ) .

(v i , v j )∈E

Therefore, E wk (Y ) is a linear function of w ≡ (wU w V ). If we define ⎛ ⎞   ψ ki (yi ), ψ i j (yi , y j )⎠ ,

(Ik , Y ) = ( D (Ik , Y ), V (Y )) = ⎝ v i ∈V

(v i , v j )∈E

then E wk (Y ) = w, ψ(Ik , Y ) . We can state a structured SVM problem for the weights w, with − (Ik , Y ) as a joint feature map. One should pay attention to ensure that for each constraint, the evaluation of the maximum operator is equivalent to the optimization problem for the energy function E wk (Y ), which can be difficult.

8.5. Conclusion Kernel methods serve as powerful and theoretically proven tools for a broad range of machine learning problems. The problems we address in this chapter are those of classification, clustering, and regression. While the first two problems are aimed at finding a discriminant function to divide a dataset into groups, the latter seeks to estimate a function that governs the relationships between variables. The key question is how accurately these functions can be approximated using information contained in the training set. Here, we rely on the assumption that such functions can be taken from a reproducing kernel Hilbert space with a corresponding kernel and evaluated on the given training set. Some of the kernels, known as universal kernels, can accurately approximate any function. Mercer’s theorem shows kernels as inner products in RKHS. All of these developments pave the way for the technique of the “kernel trick,” which enables a linear model’s inner products to be substituted with kernel function values. The representer theorem presents a parametrically linear model of generally nonlinear functions, which is an optimal solution for a wide range of machine learning problems. Nowadays, the use of kernel techniques in machine learning is pervasive. In fact, any linear model built on inner products can be kernelized, as in kernel filtering and

page 394

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Kernel Models and Support Vector Machines

b4528-v1-ch08

page 395

395

kernel linear discriminant analysis, neither of which is “linear” at all. We have highlighted the use of kernels in the Nadaraya–Watson regression and density estimation problems. To reveal nonlinear dependencies in data, kernels can also be introduced into principal component analysis. Kernelizing of k-means leads to spectral clustering, which links kernel methods with graph theory. Kernels provide an extension for the widely used and deservedly praised support vector machine (SVM) approach. For linear kernels, the SVM searches for the best discriminant hyperplane that divides two classes with the largest possible margin between them. Replacing the inner products in an SVM model with appropriate kernel values results in an implicit mapping of input vectors to RKHS, where they can be linearly separated (with a successful kernel choice). In the input space, this results in a nonlinear dividing surface. The SVM itself has many interesting extensions. Its model allows control over the misclassification rate, which is implicit in C-SVM and explicit in ν-SVM, where the parameter ν gives an upper bound for the fraction of margin errors. An SVM can be extended to multiclass problems with one-versus-one or one-versus-all approaches. It can also be modified to solve regression tasks resulting in ε-SVR and ν-SVR or to estimate densities resulting in a one-class SVM. Following the tendency toward online learning methods, SVM models can be incrementally updated. The SVM’s idea of maximizing the margin between classes is also applicable in adjacent areas, including metric learning. Finally, the idea of supervised learning can be expanded to composite, structured objects, including graphs, as well as regularly used real-valued labels, and this is supported by structured SVM models.

References [1] C. M. Bishop (2006). Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA. [2] T. Hastie, R. Tibshirani, and J. Friedman (2003). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag New York. [3] X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. McLachlan, A. Ng, B. Liu, P. Yu, Z.-H. Zhou, M. Steinbach, D. Hand, and D. Steinberg (2008). Top 10 algorithms in data mining. Knowl. Inf. Syst., 14, pp. 1–37. [4] V. N. Vapnik (1995). The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc., New York, NY, USA. [5] V. Vapnik and A. Chervonenkis (1964). A note on one-class of perceptrons. Autom. Remote Control, 25, pp. 112–120. [6] S. Boyd and L. Vandenberghe. (2004). Convex Optimization. Cambridge University Press, New York, NY, USA. [7] B. W. Silverman (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall, London. [8] B. Schölkopf, C. J. C. Burges, and A. J. Smola (eds.) (1999). Advances in Kernel Methods: Support Vector Learning. MIT Press, Cambridge, MA, USA. [9] M. Cuturi (2010). Positive definite kernels in machine learning. arXiv:0911.5367v2 [stat.ML].

June 1, 2022 12:51

396

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

[10] N. Aronszajn (1950). Theory of reproducing kernels. Trans. Am. Math. Soc., 68(3), pp. 337–404. [11] H. Q. Minh, P. Niyogi, and Y. Yao (2006). Mercer’s theorem, feature maps, and smoothing. COLT, pp. 154–168. [12] J. Mercer (1909). Functions of positive and negative type, and their connection with the theory of integral equations. Philos. Trans. R. Soc. London, 209, pp. 415–446. [13] C. A. Micchelli, Y. Xu, and H. Zhang (2006). Universal kernels. J. Mach. Learn. Res., 6, pp. 2651–2667. [14] B. Schölkopf, R. Herbrich, and A. J. Smola (2001). A generalized representer theorem. Computational Learning Theory. Lecture Notes in Computer Science, pp. 416–426. [15] M. Rosenblatt (1956). Remarks on some nonparametric estimates of a density function. Ann. Math. Stat., 27(3), pp. 832–837. [16] E. Parzen (1962). On estimation of a probability density function and mode. Ann. Math. Stat., 33(3), pp. 1065–1076. [17] B. U. Park and J. S. Marron (1990). Comparison of data-driven bandwidth selectors. J. Am. Stat. Assoc., 85(409), pp. 66–72. [18] B. U. Park and B. A. Turlach (1992). Practical performance of several data driven bandwidth selectors (with discussion). Comput. Stat., 7, pp. 251–270. [19] S. J. Sheather (1992). The performance of six popular bandwidth selection methods on some real datasets (with discussion). Comput. Stat., 7, pp. 225–281. [20] S. J. Sheather and M. C. Jones (1991). A reliable data-based bandwidth-selection method for kernel density estimation. J. R. Stat. Soc., 53B, pp. 683–690. [21] M. C. Jones, J. S. Marron and S. J. Sheather (1996). A brief survey of bandwidth selection for density estimation. J. Am. Stat. Assoc., 91(433), pp. 401–407. [22] E. A. Nadaraya (1964). On estimating regression. Theory Probab. Appl., 9(1), pp. 141–142. [23] G. S. Watson (1964). Smooth regression analysis. Sankhya, 26(15), pp. 175–184. [24] C. G. Atkeson, A. W. Moore, and S. Schaal (1997). Locally weighted learning. Artif. Intell. Rev., 11, pp. 11–73. [25] H.-G. Müller (1987). Weighted local regression and kernel methods for nonparametric curve fitting. J. Am. Stat. Assoc., 82(397), pp. 231–238. [26] M. C. Jones, S. J. Davies, and B. U. Park (1994). Versions of kernel-type regression estimators. J. Am. Stat. Assoc., 89(427), pp. 825–832. [27] B. Schölkopf, A. J. Smola, and K.-R. Müller (1999). Kernel principal component analysis, in Advances in Kernel Methods: Support Vector Learning, B. Schölkopf, C. J. C. Burges, and A. J. Smola (eds.), MIT Press, Cambridge, MA, pp. 327–352. [28] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K. Muller (1999). Fisher discriminant analysis with kernels, in Neural Networks for Signal Processing IX, Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas (eds.), IEEE Press, Piscataway, NJ, pp. 41–48. [29] P. Exterkate, P. J. F. Groenen, C. Heij, and D. van Dijk (2013). Nonlinear forecasting with many predictors using kernel ridge regression. CREATES Research Papers 2013-16, School of Economics and Management, University of Aarhus. [30] M. H. Hayes (1996). Statistical Digital Signal Processing and Modeling, 1st ed. John Wiley & Sons, Inc., New York, NY, USA. [31] L. Bottou (1998). Online algorithms and stochastic approximations, in Online Learning and Neural Networks, D. Saad (ed.), Cambridge University Press, Cambridge, UK. [32] W. Liu, J. C. Príncipe, and S. Haykin (2010). Appendix B: approximate linear dependency and system stability, in Kernel Adaptive Filtering: A Comprehensive Introduction, J. C. Principe, W. Liu, and S. Haykin (eds.), John Wiley & Sons, Inc., Hoboken, NJ, USA. [33] U. Luxburg (2007). A tutorial on spectral clustering. Stat. Comput., 17(4), pp. 395–416. [34] I. S. Dhillon, Y. Guan, and B. Kulis (2004). Kernel k-means: spectral clustering and normalized cuts, in Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’04), pp. 551–556.

page 396

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Kernel Models and Support Vector Machines

b4528-v1-ch08

page 397

397

[35] A. Y. Ng, M. I. Jordan, and Y. Weiss (2001). On spectral clustering: analysis and an algorithm. Adv. Neural Inf. Process. Syst., 14, pp. 849–856. [36] U. von Luxburg (2007). A tutorial on spectral clustering. Stat. Comput., 17(4), pp. 395–416. [37] S. Balakrishnan, M. Xu, A. Krishnamurthy, and A. Singh (2011). Noise thresholds for spectral clustering. NIPS, pp. 954–962. [38] B. Mohar (1991). The Laplacian spectrum of graphs, in Graph Theory, Combinatorics, and Applications, Y. Alavi, G. Chartrand, O. R. Oellermann, and A. J. Schwenk (eds.), vol. 2, Wiley, pp. 871–898. [39] B. Mohar (1997). Some applications of Laplace eigenvalues of graphs, in Graph Symmetry: Algebraic Methods and Applications, G. Hahn and G. Sabidussi (eds.), NATO ASI Series, Springer Netherlands, pp. 225–275. [40] F. Chung (1997). Spectral graph theory. Washington: Conference Board of the Mathematical Sciences. [41] J. Shi and J. Malik (2000). Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 22(8), pp. 888–905. [42] B. E. Boser, I. M. Guyon, and V. N. Vapnik (1992). A training algorithm for optimal margin classifiers, in Proceedings of the 5th Annual Workshop on Computational Learning Theory (COLT’92), pp. 144–152. [43] M. Aizerman, E. Braverman, and L. Rozonoer (1964). Theoretical foundations of the potential function method in pattern recognition learning. Autom. Remote Control, 25, pp. 821–837. [44] C. Cortes and V. Vapnik (1995). Support-vector networks. Mach. Learn., 20, pp. 273–297. [45] B. Schölkopf, A. Smola, R. C. Williamson, and P. L. Bartlett (2000). New support vector algorithms. Neural Comput., 12, pp. 1207–1245. [46] H. Drucker, C. J. C. Burges, L. Kaufman, A. J. Smola, and V. N. Vapnik (1997). Support vector regression machines, in Proceedings of the 9th International Conference on Neural Information Processing Systems (NIPS’96), pp. 155–161. [47] E. Hüllermeier and K. Brinker (2008). Learning valued preference structures for solving classification problems. Fuzzy Sets Syst., pp. 2337–2352. [48] E. Lughofer and O. Buchtala (2013). Reliable all-pairs evolving fuzzy classifiers. IEEE Trans. Fuzzy Syst., 21(4), pp. 625–641. [49] J. C. Platt, N. Cristianini, and J. Shawe-Taylor (2000). Large margin DAGs for multiclass classification, in Proceedings of Neural Information Processing Systems, NIPS’99, pp. 547–553. [50] B. Scholkopf, J. Platt, J. Shawe-Taylor, A. Smola, and R. Williamson (2001). Estimating the support of a high-dimensional distribution. Neural Comput., 13(7), pp. 1443–1472. [51] G. Cauwenberghs and T. Poggio (2001). Incremental and decremental support vector machine learning, in Proceedings of the 13th International Conference on Neural Information Processing Systems (NIPS’00), pp. 388–394. [52] N. Nguyen and Y. Guo (2008). Metric learning: a support vector approach, in Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases, Springer-Verlag Berlin Heidelberg, pp. 125–136. [53] K. Q. Weinberger and L. K. Saul (2009). Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res., 10, pp. 207–244. [54] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun (2005). Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res., pp. 1453– 1484. [55] A. Lucchi, Y. Li, K. Smith, P. Fua (2012). Structured image segmentation using kernelized features. Comput. Vision: ECCV, pp. 400–413. [56] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun (2004). Support vector machine learning for interdependent and structured output spaces, in 21th International Conference on Machine Learning (ICML), ACM, Banff, Alberta, Canada, p. 104.

June 1, 2022 12:51

398

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch08

Handbook on Computer Learning and Intelligence — Vol. I

[57] T. Joachims, T. Hofmann, Y. Yue, and C.-N. Yu (2009). Predicting structured objects with support vector machines. Research Highlight, Commun. ACM, 52(11), pp. 97–104. [58] M. B. Blaschko and C. H. Lampert (2008). Learning to localize objects with structured output regression. Comput. Vision: ECCV, pp. 2–15. [59] Y. Boykov, O. Veksler, and R. Zabih (2001). Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell., 23(11), pp. 1222–1239. [60] Y. Boykov and V. Kolmogorov (2004). An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell., 26(9), pp. 1124–1137. [61] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein (2018). Deep neural networks as Gaussian processes, in International Conference on Learning Representations, arXiv:1711.00165. [62] A. Krizhevsky, I. Sutskever and G. E. Hinton (2012). ImageNet classification with deep convolutional neural networks, in Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12), pp. 1097–1105. [63] Y. Tang (2015). Deep learning using linear support vector machines. arXiv:1306.0239. [64] F. J. Huang and Y. LeCun (2006). Large-scale learning with SVM and convolutional for generic object categorization, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), pp. 284–291. [65] S. A. Wagner (2016). SAR ATR by a combination of convolutional neural network and support vector machines. IEEE Trans. Aerosp. Electron. Syst., 52(6), pp. 2861–2872. [66] C. E. Rasmussen and C. K. I. Williams (2005). Gaussian Processes for Machine Learning. MIT Press. [67] M. Gönen and E. Alpaydın (2011). Multiple kernel learning algorithms. J. Mach. Learn. Res., 12, pp. 2211–2268. [68] S. Qiu and T. Lane (2009). A framework for multiple kernel support vector regression and its applications to siRNA efficacy prediction. IEEE/ACM Trans. Comput. Biol. Bioinf., 6(2), pp. 190–199. [69] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan (2004). Multiple kernel learning, conic duality, and the SMO algorithm, in Proceedings of the 21st International Conference on Machine Learning, Banff, Canada. [70] R. E. Schapire (2013). Explaining adaboost, in Empirical Inference. Springer, pp. 37–52.

page 398

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch09

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_0009

Chapter 9

Evolving Connectionist Systems for Adaptive Learning and Knowledge Discovery: From Neuro-Fuzzy to Spiking, Neurogenetic, and Quantum Inspired: A Review of Principles and Applications Nikola Kirilov Kasabov∗ Auckland University of Technology, Auckland, New Zealand ∗ [email protected]

This chapter follows the development of a class of neural networks called evolving connectionist systems (ECOS). ECOS combine the adaptive/evolving learning ability of neural networks and the approximate reasoning and linguistically meaningful explanation features of symbolic representation, such as fuzzy rules. This chapter includes hybrid expert systems, evolving neuro-fuzzy systems, evolving spiking neural networks, neurogenetic systems, and quantum-inspired systems, all discussed from the point of few of their adaptability and model interpretability. The chapter covers both methods and their numerous applications for data modeling, predictive systems, data mining, and pattern recognition across the application areas of engineering, medicine and health, neuroinformatics, bioinformatics, adaptive robotics, etc.

9.1. Early Neuro-Fuzzy Hybrid Systems The human brain uniquely combines low-level neuronal learning in the neurons and the connections between them and higher-level abstraction leading to knowledge discovery. This is the ultimate inspiration for the development of the evolving connectionist systems described in this chapter.

∗ https://academics.aut.ac.nz/nkasabov; University of Ulster, https://pure.ulster.ac.uk/en/persons/

nik-kasabov; University of Auckland, https://unidirectory.auckland.ac.nz/profile/nkas001 399

page 399

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

400

b4528-v1-ch09

page 400

Handbook on Computer Learning and Intelligence — Vol. I Fuzzified data Rules extraction module

Neural network

Current price Yesterday's price (crisp values)

Neural network

Political situation Economic situation (fuzzy values)

Trading rules (fuzzy)

Predicted price (crisp value)

Fuzzy rule based decision

Decision (buy/sell/hold) (fuzzy & crisp values)

Figure 9.1: A hybrid NN-fuzzy rule-based expert system for financial decision support [5].

In the past 50 years or so, several seminal works in the areas of neural networks [1, 2] and fuzzy systems [3, 4] opened a new field of information science— the creation of new types of hybrid systems that combine the learning ability of neural networks at a lower level of information processing, and the reasoning and explanation ability of fuzzy rule-based systems at the higher level. An exemplar system is shown in Figure 9.1, where, at a lower level, a neural network (NN) module predicts the level of a stock index and, at a higher level, a fuzzy reasoning module combines the predicted values with some macroeconomics: Variables, using the following types of fuzzy rules [5]. IF AND THEN

(9.1)

These fuzzy expert systems continued the development of the hybrid NN-rulebased expert systems that used crisp propositional and fuzzy rules [6–8]. The integration of neural networks and fuzzy systems into expert systems attracted many researchers to join this theme. The low-level integration of fuzzy rules into a single neuron model and larger neural network structures, tightly coupling learning and fuzzy reasoning rules into connectionists structures, was initiated by Professor Takeshi Yamakawa and other Japanese scientists, and promoted at a series of IIZUKA conferences in Japan [3, 9]. Many models of fuzzy neural networks were developed based on these principles [5, 10–15].

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch09

Evolving Connectionist Systems for Adaptive Learning and Knowledge Discovery

page 401

401

9.2. Evolving Connectionist Systems 9.2.1. Principles of Evolving Connectionist Systems The evolving neuro-fuzzy systems developed these ideas further, where instead of training a fixed connectionist structure, the structure and its functionality were evolving from incoming data, often in an online, one-pass learning mode. This is the case with the evolving connectionist systems, Evolving Connectionist Systems (ECOS) [12–14, 16–18]. ECOS are modular connectionist-based systems that evolve their structure and functionality in a continuous, self-organized, online, adaptive, interactive way from incoming information [12]. They can process both data and knowledge in a supervised and/or an unsupervised way. ECOS learn local models from data through clustering of the data and associating a local output function for each cluster represented in a connectionist structure. They can incrementally learn single data items or chunks of data and also incrementally change their input features [17, 19]. Elements of ECOS have been proposed as part of the classical NN models, such as SOM, RBF, FuzzyARTMap, Growing neural gas, neuro-fuzzy systems, and RAN (for a review see [17]). Other ECOS models, along with their applications, have been reported in [20, 21]. The principle of ECOS is based on local learning—neurons are allocated as centers of data clusters and the system creates local models in these clusters. Fuzzy clustering, as a means to create local knowledge-based systems, was stimulated by the pioneering work of Bezdek, Yager, and Filev [22, 23]. To summarize, the following are the main principles of ECOS as stated by Kasabov [12]: 1. Fast learning from a large amount of data, e.g., using “one-pass” training, starting with little prior knowledge; 2. Adaptation in a real-time and an online mode where new data is accommodated as it comes based on local learning; 3. “Open,” evolving structure, where new input variables (relevant to the task), new outputs (e.g., classes), new connections and neurons are added/evolved “on the fly”; 4. Both data learning and knowledge representation are facilitated in a comprehensive and flexible way, e.g., supervised learning, unsupervised learning, evolving clustering, “sleep” learning, forgetting/pruning, and fuzzy rule insertion and extraction; 5. Active interaction with other ECOSs and with the environment in a multi-modal fashion;

June 1, 2022 12:51

402

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch09

Handbook on Computer Learning and Intelligence — Vol. I

6. Representing both space and time in their different scales, e.g., clusters of data, short- and long-term memory, age of data, forgetting, etc.; 7. A system’s self-evaluation in terms of behavior, global error and success, and related knowledge representation. In 1998, Walter Freeman, who attended the ICONIP conference, commented on the proposed ECOS concepts: “Through the ‘chemicals’ and let the system grow.” The development of ECOS as a trend in neural networks and computational intelligence that started in 1998 [12], continued as many improved or new computational methods that use the ECOS principles have been developed along with many applications.

9.2.2. Neuro-Fuzzy ECOS: EFuNN and DENFIS Here, we will briefly illustrate the concepts of ECOS on two implementations: EFuNN [16] and DENFIS [18]. Examples of EFuNN and DENFIS are shown in Figures 9.2 and 9.3, respectively. In ECOS, clusters of data are created based on similarity between data samples either in the input space (this is the case in some of the ECOS models, e.g., the dynamic neuro-fuzzy inference system DENFIS), or in both the input and output space (this is the case, e.g., in the EFuNN models). Samples (examples) that have a distance to an existing node (cluster center, rule node) less than a certain threshold are allocated to the same cluster. Samples that do not fit into existing clusters form new clusters. Cluster centers are continuously adjusted

Figure 9.2: An example of EFuNN model. Source: Kasabov [16].

page 402

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch09

Evolving Connectionist Systems for Adaptive Learning and Knowledge Discovery

page 403

403

Figure 9.3: An example of DENFIS model. Source: Marshall et al. [17] and Kasabov [24, 25].

according to new data samples, and new clusters are created incrementally. ECOS learn from data and automatically create or update a local fuzzy model/function, e.g., IF THEN ,

(9.2)

where Fi can be a fuzzy value, a logistic or linear regression function (Figure 9.3) or a NN model [17, 18]. A special development of ECOS is transductive reasoning and personalized modeling. Instead of building a set of local models (e.g., prototypes) to cover the whole problem space, and then use these models to classify/predict any new input vector, in transductive modelling for every new input vector a new model is created based on selected nearest neighbor vectors from the available data. Such ECOS models are NFI and TWNFI [26]. In TWNFI for every new input vector, the neighborhood of closet data vectors is optimized using both the distance between the new vector and the neighboring ones and the weighted importance of the input variables, so that the error of the model is minimized in the neighborhood area [27].

9.2.3. Methods that Use Some ECOS Principles Among other methods that use or have been inspired by the ECOS principles are: — Evolving self-organized maps (ESOM) [28];

June 1, 2022 12:51

404

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch09

Handbook on Computer Learning and Intelligence — Vol. I

— — — —

Evolving clustering methods (ECM and ECMC) [18]; Incremental feature learning in ECOS [29]; Online ECOS optimization [30, 31]; Evolving Takagi–Sugeno fuzzy model based on switching to neighboring models [32]; — Clustering and co-evolution to construct neural network ensembles [33]. Among other methods that use or have been inspired by the ECOS principles, these are (publications are available from www.ieeexplore.ieee.org; Google Scholar; Scopus): — Online ECOS optimization, developed by Zeke Chan et al; — Assessment of EFuNN accuracy for pattern recognition using data with different statistical distributions, developed by Ronei Marcos de Moraes et al; — Recursive clustering based on a Gustafson–Kessel algorithm, by D. Dovžan and I. Škrjanc; — Using a map-based encoding to evolve plastic neural networks, by P. Tonelli and J Mouret. — Evolving Takagi–Sugeno fuzzy model based on switching to neighbouring models, by A. Kalhor, B. N. Araabi, and C. Lucas; — A soft computing based approach for modelling of chaotic time series, by J. Vajpai and J. B. Arun; — Uni-norm based evolving neural networks and approximation capabilities, by F. Bordignon and F. Gomide; — Machine learning within an unbalanced distributed environment research notes, by H. J. Prueller; — FLEXFIS: a robust incremental learning approach for evolving Takagi–Sugeno fuzzy models, by E. D. Lughofer; — Evolving fuzzy classifiers using different model architectures, by P. Angelov, E. Lughofer, and X. Zhou [15]; — RSPOP: Rough set–based pseudo outer-product fuzzy rule identification algorithm, by K. K. Ang and C. Quek; — SOFMLS: online self-organizing fuzzy modified least-squares network, by J. de Jesús Rubio; — Finding features for real-time premature ventricular contraction detection using a fuzzy neural network system, by J. S. Lim; — Evolving fuzzy rule-based classifiers, by P. Angelov, X. Zhou, and F. Klawonn; — A novel generic Hebbian ordering-based fuzzy rule base reduction approach to Mamdani neuro-fuzzy system, by F. Liu, C. Quek, and G. S. Ng;

page 404

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch09

Evolving Connectionist Systems for Adaptive Learning and Knowledge Discovery

page 405

405

— Implementation of fuzzy cognitive maps based on fuzzy neural network and application in prediction of time series, by H. Song, C. Miao, W. Roel, and Z. Shen; — Development of an adaptive neuro-fuzzy classifier using linguistic hedges, by B. Cetisli; — A meta-cognitive sequential learning algorithm for neuro-fuzzy inference system, by K. Subramanian and S. Suresh; — Meta-cognitive RBF network and its projection based learning algorithm for classification problems, by G. S. Babu and S. Suresh; — SaFIN: A self-adaptive fuzzy inference network, by S. W. Tung, C. Quek, and C. Guan; — A sequential learning algorithm for meta-cognitive neuro-fuzzy inference system for classification problems, by S. Suresh and K. Subramanian; — Architecture for development of adaptive online prediction models, by P. Kadlec and B. Gabrys; — Clustering and co-evolution to construct neural network ensembles, by F. L. Minku and T. B. Ludermir; — Algorithms for real-time clustering and generation of rules from data, by D. Filev and P. Angelov; — SAKM: Self-adaptive kernel machine: A kernel-based algorithm for online clustering, by H. Amadou Boubacar, S. Lecoeuche, and S. Maouche. — A BCM theory of meta-plasticity for online self-reorganizing fuzzy-associative learning, by J. Tan and C. Quek; — Evolutionary strategies and genetic algorithms for dynamic parameter optimization of evolving fuzzy neural networks, by F. L. Minku and T. B. Ludemir; — Incremental leaning and model selection for radial basis function network through sleep learning, by K. Yamauchi and J. Hayami; — Interval-based evolving modeling, by D. F. Leite, P. Costa, and F. Gomide; — Evolving granular classification neural networks, by D. F. Leite, P. Costa, and F. Gomide; — Stability analysis for an online evolving neuro-fuzzy recurrent network, by J. de Jesus Rubio; — A TSK fuzzy inference algorithm for online identification, by K. Kim, E. J. Whang, C. W. Park, E. Kim, and M. Park; — Design of experiments in neuro-fuzzy systems, by C. Zanchettin, L. L. Minku and T. B. Ludermir; — EFuNNs ensembles construction using a clustering method and a coevolutionary genetic algorithm, by F. L. Minku and T. B. Ludermir; — eT2FIS: An evolving type-2 neural fuzzy inference system, by S. W. Tung, C. Quek, and C. Guan;

June 1, 2022 12:51

406

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch09

Handbook on Computer Learning and Intelligence — Vol. I

— Designing radial basis function networks for classification using differential evolution, by B. O’Hora, J. Perera, and A. Brabazon; — A meta-cognitive neuro-fuzzy inference system (McFIS) for sequential classification problems, by K. Subramanian, S. Sundaram, and N. Sundararajan; — An evolving fuzzy neural network based on the mapping of similarities, by J. A. M. Hernández and F. G. Castaeda; — Incremental learning by heterogeneous bagging ensemble, by Q. L. Zhao, Y. H. Jiang, and M. Xu; — Fuzzy associative conjuncted maps network, by H. Goh, J. H. Lim, and C. Quek; — EFuNN ensembles construction using CONE with multi-objective GA, by F. L. Minku and T. B. Ludermir; — A framework for designing a fuzzy rule-based classifier, by J. Guzaitis, A. Verikas, and A. Gelzinis; — Pruning with replacement and automatic distance metric detection in limited general regression neural networks, by K. Yamauchi, 2009; — Optimal incremental learning under covariate shift, by K. Yamauchi; — Online ensemble learning in the presence of concept drift, by L. L. Minku, 2011 (www.theses.bham.ac.uk); — Design of linguistically interpretable fuzzy rule-based classifiers, by H. Ishibuchi, Y. Kaisho, and Y. Nojima; — A compensatory neurofuzzy system with online constructing and parameter learning, by M. F. Han, C. T. Lin, and J. Y. Chang; — Incremental learning and model selection under virtual concept drifting environments, by K. Yamauchi; — Active learning in nonstationary environments, by R. Capo, K. B. Dyer, and R. Polikar; — Real time knowledge acquisition based on unsupervised learning of evolving neural models, by G. Vachkov; — Flexible neuro-fuzzy systems, by L. Rutkowski and K. Cpalka; — A self-adaptive neural fuzzy network with group-based symbiotic evolution and its prediction applications, by C. J. Lin and Y. J. Xu; — SOFMLS: online self-organizing fuzzy modified least-squares network, by J. de Jesús Rubio; — A self-organizing fuzzy neural network based on a growing-and-pruning algorithm, by H. Han and J. Qiao; — Online elimination of local redundancies in evolving fuzzy systems, by E. Lughofer, J. L. Bouchot, and A. Shaker; — Backpropagation-Free Learning Method for Correlated Fuzzy Neural Networks, by A. Salimi-Badr and M. M. Ebadzadeh; — A projection-based split-and merge clustering algorithm, by M. Cheng, T. Ma, and Y. Liu;

page 406

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch09

Evolving Connectionist Systems for Adaptive Learning and Knowledge Discovery

page 407

407

— Granular modelling, by M. Tayyab, B. Belaton, and M. Anbar; — Permutation entropy to detect synchronisation, by Z. Shahriari and M. Small; — Evolving connectionist systems: Characterisation, simplification, formalisation, explanation and optimisation (M.~J.Watts, 2004, PhD thesis; https://ourarhcive. otago.ac.nz). to mention only a few of the ECOS further developed methods.

9.2.4. ECOS-Based Applications Based on the ECOS concepts and methods, sustained engineering applications have been developed, some of them included in the list below and presented in [19] in detail: — Discovery of diagnostic markers for early detection of bladder cancer, colorectal cancer and other types of cancer based on EFuNN (Pacific Edge Biotechnology Ltd, https://www.pacificedgedx.com) [21, 34]; — Medical diagnosis of renal function evaluation using DENFIS [24]; — Risk analysis and discovery of evolving economic clusters in Europe, by N. Kasabov, L. Erzegovesi, M. Fedrizzi, and A. Beber; — Adaptive robot control system based on ECOS [35]; — Personalized modeling systems (www.knowledgeengineering.ai); — Monthly electricity demand forecasting based on a weighted evolving fuzzy neural network approach, by P. C. Chang, C. Y. Fan, and J. J. Lin; — Decision making for cognitive radio equipment, by W. Jouini, C. Moy, and J. Palicot; — An incremental learning structure using granular computing and model fusion with application to materials processing, by G. Panoutsos and M. Mahfouf; — Evolving fuzzy systems for data streams, by R. D. Baruah and P. Angelov; — Handwritten digits classification, by G. S. Ng, S. Erdogan, D. Shi, and A. Wahab; — Online time series prediction system with EFuNN-T, by X. Wang; — Comparative analysis of the two fuzzy neural systems ANFIS and EFuNN for the classification of handwritten digits, by T. Murali, N. Sriskanthan, and G. S. Ng; — Online identification of evolving Takagi–Sugeno–Kang fuzzy models for crane systems, by R. E. Precup, H. I. Filip, M. B. R˘adac, E. M. Petriu, and S. Preitl; — Modelling ozone levels in an arid region—a dynamically evolving soft computing approach, by S. M. Rahman, A. N. Khondaker, and R. A. Khan; — A software agent framework to overcome malicious host threats and uncontrolled agent clones, by Sujitha, G. Annie, and T. Amudha; — Comparing evaluation methods based on neural networks for a virtual reality simulator for medical training, by R. M. de Moraes and L. S. Machado;

June 1, 2022 12:51

408

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch09

Handbook on Computer Learning and Intelligence — Vol. I

— eFSM—A novel online neural-fuzzy semantic memory model, by W. L. Tung and C. Quek; — Stock trading using RSPOP: A novel rough set-based neuro-fuzzy approach, by K. K. Ang and C. Quek; — A stable online clustering fuzzy neural network for nonlinear system identification, by J. J. Rubio and J. Pacheco; — Evolving granular neural networks from fuzzy data streams, by D. Leite, P. Costa, and F. Gomide; — Neural networks for QoS network management, by R. del-Hoyo-Alonso and P. Fernández-de-Alarcón; — Adaptive online co-ordination of ubiquitous computing devices with multiple objectives and constraints, by E. Tawil and H. Hagras; — Advances in classification of EEG signals via evolving fuzzy classifiers and dependent multiple HMMs, by C. Xydeas, P. Angelov, S. Y. Chiao, and M. Reoullas; — Online data-driven fuzzy clustering with applications to real-time robotic tracking, by P. X. Liu and M. Q. H. Meng; — Evolving fuzzy systems for pricing fixed income options, by L. Maciel, A. Lemos, F. Gomide, and R. Ballini; — Combustion engine modelling using an evolving local model network, by C. Hametner and S. Jakubek; — Intelligent information systems for online analysis and modelling of biological data, by M. J. Middlemiss, (2001, PhD thesis, www.otago.ourarchive.ac.nz) — Evolving neurocomputing systems for horticulture applications, by B. J. Woodford; — Neuro-fuzzy system for post-dialysis urea rebound prediction, by A. T. Azar, A. H. Kandil, and K. M. Wahba; — ARPOP: An appetitive reward-based pseudo-outer-product neural fuzzy inference system inspired from the operant conditioning of feeding behavior in Aplysia, by E. Y. Cheu, C. Quek, and S. K. Ng, 2012; — Artificial ventilation modeling using neuro-fuzzy hybrid system, by F. Liu, G. S. Ng, C. Quek, and T. F. Loh; — A data fusion method applied in an autonomous robot, by Y. Qingmei and S. Jianmin; — Evolving fuzzy neural networks applied to odor recognition, by C. Zanchettin and T. B. Ludermir; — A reduced rule-based localist network for data comprehension, by R. J. Oentaryo and M. Pasquier; — Faster self-organizing fuzzy neural network training and a hyperparameter analysis for a brain–computer interface, by D. Coyle, G. Prasad and T. M. McGinnity;

page 408

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch09

Evolving Connectionist Systems for Adaptive Learning and Knowledge Discovery

page 409

409

— Adaptive anomaly detection with evolving connectionist systems, by Y. Liao, V. R. Vemuri, and A. Pasos; — Creating evolving user behavior profiles automatically, by J. A. Iglesias, P. Angelov, and A. Ledezma; — Autonomous visual self-localization in completely unknown environment using evolving fuzzy rule-based classifier, by X. Zhou and P. Angelov; — Online training evaluation in virtual reality simulators using evolving fuzzy neural networks, by L. S. Machado and R. M. Moraes; — Predictive functional control based on an adaptive fuzzy model of a hybrid semibatch reactor, by D. Dovžan and I. Škrjanc; — An online adaptive condition-based maintenance method for mechanical systems, by F. Wu, T. Wang, and J. Lee; — Driving profile modeling and recognition based on soft computing approach, by A. Wahab, C. Quek, and C. K. Tan; — Human action recognition using meta-cognitive neuro-fuzzy inference system, by K. Subramanian and S. Suresh; — Hybrid neural systems for pattern recognition in artificial noses, by C. Zanchettin and T. B. Ludermir; — A novel brain-inspired neural cognitive approach to SARS thermal image analysis, by C. Quek, W. Irawan, and G. S. Ng; — Intrinsic and extrinsic implementation of a bio-inspired hardware system, by B. Glackin, L. P. Maguire, and T. M. McGinnity; — Financial volatility trading using a self-organising neural-fuzzy semantic network and option straddle-based approach, by W. L. Tung and C. Quek; — A Bluetooth routing protocol using evolving fuzzy neural networks, by C. J. Huang, W. K. Lai, S. Y. Hsiao, and H. Y. Liu; — QoS provisioning by EFuNNs-based handoff planning in cellular MPLS networks, by B. S. Ghahfarokhi, and N. Movahhedinia; — Hybrid active learning for reducing the annotation effort of operators in classification systems, by E. Lughofer; — Classification of machine operations based on growing neural models and fuzzy decision, by G. Vachkov; — Computational intelligence tools for next generation quality of service management, by R. del-Hoyo, B. Martín-del-Brío, and N. Medrano; — Solving the sales prediction problem with fuzzy evolving methods, by D. Dovzan, V. Logar and I. Skrjanc; — Predicting environmental extreme events in Algeria using DENFIS, by Heddam et al. [36]; — ICMPv6-based DoS and DDoS attacks detection using machine learning techniques, open challenges, and blockchain applicability, by M. Tayyab, B. Belaton, and M. Anbar;

June 1, 2022 12:51

410

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch09

Handbook on Computer Learning and Intelligence — Vol. I

— Prognostic of rolling elemn baring, by M. O. Camargos, R. M. Palhares, I. Bessa, and L. B. Cosme; — Estimating dissolved gas and tail-water in water dams, by S. Heddam and O. Kisi; — The implementation of univariable scheme-based air temperature for solar radiation prediction: New development of dynamic evolving neural-fuzzy inference system model, by O. Kisi, S. Heddam, and Z. M. Yaseend [37]; — Multiple time series prediction [38]. While the ECOS methods presented above use McCulloch and Pitts model of a neuron, the further developed evolving spiking neural network (eSNN) architectures use a spiking neuron model applying the same or similar ECOS principles and applications.

9.3. Evolving Spiking Neural Networks 9.3.1. Main Principles, Methods, and Examples of eSNN A single biological neuron and the associated synapses is a complex information processing machine that involves short-term information processing, long-term information storage, and evolutionary information stored as genes in the nucleus of the neuron. A spiking neuron model assumes input information represented as trains of spikes over time. When sufficient input information is accumulated in the membrane of the neuron, the neuron’s post-synaptic potential exceeds a threshold and the neuron emits a spike at its axon (Figure 9.4). Some of the state-of-the-art models of a spiking neuron include early models by Hodgkin and Huxley [39]; more recent models by Maas, Gerstner, Kistler, Izhikevich, and others, e.g., Spike Response Models (SRM); Integrate-and-Fire Model (IFM) (Figure 9.4); Izhikevich models; adaptive IFM; probabilistic IFM (for details, see: [6, 7, 40–42]). Based on the ECOS principles, an evolving spiking neural network architecture (eSNN) was proposed [19, 43]. It was initially designed as a visual pattern recognition system. The first eSNNs were based on the Thorpe’s neural model [44],

Figure 9.4: The structure of the LIFM.

page 410

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch09

Evolving Connectionist Systems for Adaptive Learning and Knowledge Discovery

page 411

411

Figure 9.5: A reservoir-based eSNN for spatio-temporal pattern classification.

in which the importance of early spikes (after the onset of a certain stimulus) is boosted, called rank–order coding and learning. Synaptic plasticity is employed by a fast-supervised one-pass learning algorithm. Different eSNN models were developed, including the following: — Reservoir-based eSNN for spatio- and spectro-temporal pattern recognition shown in Figure 9.5 (following the main principles from Verstraeten et al. [45]); — Dynamic eSNN (deSNN) [46] a model that uses both rank–order and time-based STDP learning [47] to account for spatio-temporal data; and many more (for references, see [48]). Extracting fuzzy rules from an eSNN would make the eSNN not only efficient learning models, but also knowledge-based models. A method was proposed in [49] and illustrated in Figures 9.6 and 9.7. Based on the connection weights (W) between the receptive field layer (L1), and the class output neuron layer (L2), the following fuzzy rules are extracted: IF (input variable v is SMALL) THEN class Ci ; IF (v is LARGE) THEN class C j .

(9.3)

In principle, eSNN use spike information representation, spiking neuron models, SNN learning and encoding rules, and the stricture is evolving to capture spatio-temporal relationship from data. The eSNN has been further developed as computational neurogenetic models as discussed below.

9.3.2. Quantum-Inspired Optimization of eSNN eSNN have several parameters that need to be optimized for an optimal performance. Several successful methods have been proposed for this purpose, among them are

June 1, 2022 12:51

412

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch09

Handbook on Computer Learning and Intelligence — Vol. I

v

L2j

L2i Ci

Cj

Figure 9.6: A simplified structure of an eSNN for 2-class classification showing only one input variable using 6 receptive fields to convert the input values into spike trains.

Figure 9.7: The connection weights of the connections to class Ci and Cj output neurons respectively are interpreted as fuzzy rules Eq. (9.3).

quantum-inspired evolutionary algorithm, QiEA [50]; and quantum-inspired particle swarm optimization method, QiPSO [51]. Quantum-inspired optimization methods use the principle of superposition of states to represent and optimize features (input variables) and parameters of the eSNN [19]. Features and parameters are represented as qubits that are in a superposition of 1 (selected), with a probability α, and 0 (not selected) with a probability β. When the model has to be calculated, the quantum bits “collapse” in 1 or 0.

9.3.3. Some Applications Based on eSNN Numerous applications based on the different eSNN models have been reported, among them: — Online data modelling with eSNN [52]; — Advanced spiking neural network technologies for neurorehabilitation [53]; — Object movement recognition (EU FP7 Marie Curie EvoSpike project 2011–2012, INI/ETH/UZH, https://cordis.europa.eu/project/id/272006);

page 412

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch09

Evolving Connectionist Systems for Adaptive Learning and Knowledge Discovery

page 413

413

— Multimodal audio and visual information processing [43]; — Ecological data modelling and prediction of the establishment of invasive species [54]; — Multi-model eSNN for the prediction of pollution in London areas [55]; — Integrated brain data analysis [56]; — Predictive modelling method and case study on personalized stroke occurrence prediction [57]; — Financial time series prediction [38]; — Other applications [48].

9.4. Computational Neuro-Genetic Models (CNGM) Based on eSNN 9.4.1. Main Principles A neurogenetic model of a neuron is proposed in [19] and studied in [58]. It utilizes information about how some proteins and genes affect the spiking activities of a neuron such as fast excitation, fast inhibition, slow excitation, and slow inhibition. An important part of the model is a dynamic gene/protein regulatory network (GRN) model of the dynamic interactions between genes/proteins over time that affect the spiking activity of the neuron—Figure 9.8. New types of neuro-genetic fuzzy rules can be extracted from such CNGM in the form of IF AND THEN .

(9.4)

This type of eSNN is further developed into a comprehensive SNN framework called NeuCube for spatio-/spectro-temporal data modeling and analysis as presented in the next section.

Figure 9.8: A schematic diagram of a CNGM framework, consisting of a GRN as part of an eSNN [58].

June 1, 2022 12:51

414

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch09

Handbook on Computer Learning and Intelligence — Vol. I

Figure 9.9: A block diagram of the NeuCube architecture (from [56]).

9.4.2. The NeuCube Architecture The latest development of neurogenetic systems is NeuCube [56], initially designed for spatio-temporal brain data modeling, but then it was used for climate data modeling, stroke occurrence prediction, and other applications [59]. The NeuCube framework is depicted in Figure 9.9. It consists of the following functional parts (modules): — — — — —

Input information encoding module; 3D SNN reservoir module (SNNr); Output (e.g., classification) module; Gene regulatory network (GRN) module (Optional); Optimization module.

The input module transforms input data into trains of spikes. Spatio-temporal data (such as EEG, fMRI, climate) is entered into the main module—the 3D SNN cubes (SNNc). Input data is entered into pre-designated spatially located areas of the SNNc that correspond to the spatial location in the origin where data was collected (if there is such). Learning in the SNN is performed in two stages: — Unsupervised training, where spatio-temporal data is entered into relevant areas of the SNNc over time. Unsupervised learning is performed to modify the initially set connection weights. The SNNc will learn to activate same groups of spiking neurons when similar input stimuli are presented, also known as a polychronization effect [7].

page 414

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch09

Evolving Connectionist Systems for Adaptive Learning and Knowledge Discovery

page 415

415

— Supervised training of the spiking neurons in the output classification module, where the same data that was used for unsupervised training is now propagated again through the trained SNNc and the output neurons are trained to classify the spatio-temporal spiking pattern of the SNNc into pre-defined classes (or output spike sequences). As a special case, all neurons from the SNN are connected to every output neuron. Feedback connections from output neurons to neurons in the SNN can be created for reinforcement learning. Different SNN methods can be used to learn and classify spiking patterns from the SNNr, including the deSNN [60] and SPAN models [61]. The latter is suitable for generating spike trains in response to certain patterns of activity of the SNNc. Memory in the NeuCube architecture is represented as a combination of the three types of memory described below, which are mutually interacting, — Short-term memory, represented as changes of the PSP and temporary changes of synaptic efficacy; — Long-term memory, represented as a stable establishment of synaptic efficacy— LTP and LTD; — Genetic memory, represented as a genetic code. In NeuCube, similar activation patterns (called “polychronous waves”) can be generated in the SNNc with recurrent connections to represent short-term memory. When using STDP learning connection weights change to form LTP or LTD, which constitute long-term memory. Results of the use of the NeuCube suggest that the NeuCube architecture can be explored for learning long (spatio-)temporal patterns and to be used as associative memory. Once data is learned, the SNNc retains the connections as a long-term memory. Since the SNNc learns functional pathways of spiking activities represented as structural pathways of connections, when only a small initial part of input data is entered the SNNc will “synfire” and “chain-fire” learned connection pathways to reproduce learned functional pathways. Thus, a NeuCube can be used as an associative memory and as a predictive system.

9.4.3. Applications of NeuCube Current applications of the NeuCube include (see [62, 63]): • Brain data modeling and analysis, such as: – Modeling EEG, fMRI, DTI data; – Personalized brain data modeling [64]; – Sleep data modeling;

June 1, 2022 12:51

416

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch09

Handbook on Computer Learning and Intelligence — Vol. I

– Predicting brain re-wiring through mindfulness [65]; – Emotion recognition [66]; • Audio/Visual signals – Speech, sound, and music recognition; – Video of moving object recognition • Multisensory streaming data modelling – – – – – – – –

Prediction of events from temporal climate data (stroke) [67]; Hazardous environmental event prediction; Predicting risk of earthquakes [62]; Predicting flooding in Malaysia by M Othman et al; Predicting pollution in London area [55]; Predicting extreme weather from satellite images; Predicting traffic flow [68]; Odor recognition [69].

The above applications are only few of the current projects.

9.4.4. Neuromorphic Implementations The different types of eSNN and neurogenetic systems, especially the NeuCube architecture, are suitable to be implemented as a neuromorphic hardware system for embedded applications. For this purpose, both digital (e.g., [70]) and analogue (e.g., [71]) realizations can be used.

9.5. Conclusion This chapter presents briefly the methods of ECOS. ECOS facilitate adaptive learning and knowledge discovery from complex data. ECOS integrating principles are derived from neural networks, fuzzy systems, evolutionary computation, quantum computing, and brain information processing. ECOS methods also include eSNN, deSNN, NeuCube, neurogenetic, and quantum inspired. ECOS applications are manifold, but perhaps most welcome in the environmental and health sciences, where the diagnostic phenomena are chaotic in nature and the data sets are massive and often incomplete. In the field of sustainability science, whether it is in analyzing issues related to sustainable resource utilization, countering global environmental issues, or the assurance of the continuity of life, (particularly human life) on earth, the speed of transformation is most rapid. Massive data sets with the characteristics just described need to be analyzed, virtually in real time, for prognoses to be made

page 416

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch09

Evolving Connectionist Systems for Adaptive Learning and Knowledge Discovery

page 417

417

and solutions to the issues sought at a heightened level of urgency. In this sense, evolving connectionist systems for adaptive learning and knowledge discovery can make a great contribution to the methodologies of intelligent evolving systems.

Acknowledgment The work on this chapter is supported by the Knowledge Engineering and Discovery Research Institute (KEDRI, http://www.kedri.aut.ac.nz). Some of the methods, techniques and applications are publicly available from https://www.kedri.aut. ac.nz; https://www.theneucom.com; https://www.kedri.aut.ac.nz/neucube/; https:// neucube.io. I would like to acknowledge the work of many authors who developed methods and applications of ECOS referring to the presented here material. I have not been able to reference all or even a part of their work. Many of them have contributed greatly to the foundations of ECOS, e.g., S. Amari, T. Yamakawa, W. Freeman, J. Taylor, S. Grossberg, P. Angelov, D. Filev, J. Pratt, S. Ozawa, T. Ludemir, Z.G. Hou, C. Alippi, P. Pang, Lughofer, B. Gabrys, M. Watts, Dovzan and Skrjanc, F. Gomide, Ang and Quek, Rubio, Aznarte, Cetisli, Subramanian and Suresh, Babu and Suresh, Tung, Boubacar, Yamauchi, Hayashi, L. Goh, Ishibuchi, Hwang and Song, Capo, Vachkov, Rutkowski and Cpalka, Han and Qiao, Zanchettin, O’Hora, Hernandez and Castaeda, Guzaitis, and many more. I would like to acknowledge the editor of this volume Plamen Angelov for his incredible energy and efforts to put all these chapters together.

References [1] Amari, S. (1967). A theory of adaptive pattern classifiers, IEEE Trans. Electron. Comput., 16, 299–307. [2] Amari, S. (1990). Mathematical foundations of neurocomputing, Proc. IEEE, 78, 1143–1163. [3] Zadeh, L. A. (1965). Fuzzy sets, Inf. Control, 8, 338–353. [4] Zadeh, L. A. (1988). Fuzzy logic, IEEE Comput., 21, 83–93. [5] Kasabov, N. (1996). Foundations of Neural Networks, Fuzzy Systems and Knowledge Engineering, Cambridge, Massachusetts, MIT Press, 550p. [6] Hopfield, J. (1995). Pattern recognition computation using action potential timing for stimulus representation, Nature, 376, 33–36. [7] Izhikevich, E. M. (2004). Which model to use for cortical spiking neurons? IEEE Trans. Neural. Networks, 15(5), 1063–1070. [8] Kasabov, N. and Shishkov, S. (1993). A connectionist production system with partial match and its use for approximate reasoning, Connect. Sci., 5(3/4), 275–305. [9] Yamakawa, T., Uchino, E., Miki, T. and Kusanagi, H. (1992). A neo fuzzy neuron and its application to system identification and prediction of the system behaviour. Proc. of the 2nd International Conference on Fuzzy Logic & Neural Networks, Iizuka, Japan, pp. 477–483. [10] Furuhashi, T., Hasegawa, T., Horikawa S. and Uchikawa, Y. (1993). An adaptive fuzzy controller using fuzzy neural networks, Proceedings of Fifth IFSA World Congress, pp. 769–772. [11] Lin and Lee, 1996.

June 1, 2022 12:51

418

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch09

Handbook on Computer Learning and Intelligence — Vol. I

[12] Kasabov, N. (1998). Evolving fuzzy neural networks: algorithms, applications and biological motivation. In: Yamakawa, T. and Matsumoto, G. (eds.) Methodologies for the Conception, Design and Application of Soft Computing, World Scientific, pp. 271–274. [13] Kasabov, N., Kim, J. S., Watts, M. and Gray, A. (1997). FuNN/2: a fuzzy neural network architecture for adaptive learning and knowledge acquisition, Inf. Sci., 101(3–4), 155–175. [14] Angelov, (2002). [15] Angelov, P., Filev, D. P. and Kasabov, N. (eds.) (2010). Evolving Intelligent Systems: Methodology and Applications, IEEE Press and Wiley. [16] Kasabov, N. (2001). Evolving fuzzy neural networks for on-line supervised/unsupervised, knowledge–based learning, IEEE Trans. Syst. Man Cybern. Part B: Cybern., 31(6), 902–918. [17] Kasabov (2003). Evolving Connectionist Systems, Springer. [18] Kasabov, N. and Song, Q. (2002). DENFIS: dynamic, evolving neural-fuzzy inference systems and its application for time-series prediction, IEEE Trans. Fuzzy Syst., 10, 144–154. [19] Kasabov, N. (2007). Evolving Connectionist Systems: The Knowledge Engineering Approach, Springer (1st ed., 2003). [20] Watts, M. (2009). A decade of Kasabov’s evolving connectionist systems: a review, IEEE Trans. Syst. Man Cybern. Part C: Appl. Rev., 39(3), 253–269. [21] Futschik, M. and Kasabov, N. (2002). Fuzzy clustering in gene expression data analysis, Proc. of the World Congress of Computational Intelligence WCCI’2002, Hawaii, IEEE Press. [22] Bezdek, J. (ed.) (1987). Analysis of Fuzzy Information, vols. 1, 2, 3, CRC Press, Boca Raton, Florida. [23] Yager, R. R. and Filev, D. (1994). Generation of fuzzy rules by mountain clustering, J. Intell. Fuzzy Syst., 2, 209–219. [24] Marshall, M. R., et al. (2005). Evolving connectionist system versus algebraic formulae for prediction of renal function from serum creatinine, Kidney Int., 6, 1944–1954. [25] Kasabov, N. (2002). Evolving Connectionist Systems: Methods and Applications in Bioinformatics, Brain Study and Intelligent Machines, Springer Verlag, London, New York, Heidelberg. [26] Song, Q. and Kasabov, N. (2006). TWNFI: a transductive neuro-fuzzy inference system with weighted data normalisation for personalised modelling, Neural Networks, 19(10), 1591–1596. [27] Kasabov, N. and Hu, Y. (2010). Integrated optimisation method for personalised modelling and case study applications, Int. J. Funct. Inf. Personal. Med., 3(3), 236–256. [28] Deng, D. and Kasabov, N. (2003). On-line pattern analysis by evolving self-organising maps, Neurocomputing, 51, 87–103. [29] Ozawa, S., Pang, S., and Kasabov, N. (2010). On-line feature extraction for evolving intelligent systems, in: P. Angelov, D. Filev, and N. Kasabov (eds.) Evolving Intelligent Systems, IEEE Press and Wiley, (7), 151–172. [30] Minku, F. L. and Ludermir, T. B. (2005). Evolutionary strategies and genetic algorithms for dynamic parameter optimisation of evolving fuzzy neural networks. In: Proceedings of IEEE Congress on Evolutionary Computation (CEC), Edinburgh, September, vol. 3, pp. 1951–1958. [31] Chan, Z. and Kasabov, N. (2004). Evolutionary computation for on-line and off-line parameter tuning of evolving fuzzy neural networks, Int. J. of Computational Intelligence and Applications, Imperial College Press, vol. 4, No. 3, 309–319. [32] Angelov, P. and Filev, D. (2004). An approach to on-line identification of evolving Takagi– Sugeno models, IEEE Trans. Syst. Man Cybern. Part B, 34(1), 484–498. [33] Minku, F. L. and Ludermir, T. B. (2006). EFuNNs ensembles construction using a clustering method and a coevolutionary genetic algorithm. In: Proceedings of the IEEE Congress on Evolutionary Computation, Vancouver, July 16–21, IEEE Press, Washington, DC, pp. 5548– 5555. [34] Futschik, M., Jeffs, A., Pattison, S., Kasabov, N., Sullivan, M., Merrie, A. and Reeve, A. (2002). Gene expression profiling of metastatic and non-metastatic colorectal cancer cell-lines, Genome Lett., 1(1), 1–9.

page 418

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch09

Evolving Connectionist Systems for Adaptive Learning and Knowledge Discovery

page 419

419

[35] Huang, L., et al. (2008). Evolving connectionist system based role allocation for robotic soccer, Int. J. Adv. Rob. Syst., 5(1), 59–62. [36] Heddam, S., Watts, M. J., Houichi, L., Djemili, L. and Sebbar, A. (2018). Evolving connectionist systems (ECoSs): a new approach for modelling daily reference evapotranspiration (ET0), Environ. Monit. Assess., 190, 216, https://doi.org/10.1007/s10661-018-6903-0. [37] Kisi, O., Heddam, S. and Yaseen, Z. M. (2019). The implementation of univariable schemebased air temperature for solar radiation prediction: new development of dynamic evolving neural-fuzzy inference system model, Appl. Energy, 241, 184–195. [38] Widiputra, H., Pears, R. and Kasabov, N. (2011). Multiple time-series prediction through multiple time-series relationships profiling and clustered recurring trends. In: Huang, J. Z., Cao, L. and Srivastava, J. (eds.) Advances in Knowledge Discovery and Data Mining. PAKDD 2011. Lecture Notes in Computer Science, vol. 6635, Springer, Berlin, Heidelberg, pp. 161–172. [39] Hodgkin, A. L. and Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve, J. Physiol., 117, 500–544. [40] Hebb, D. (1949). The Organization of Behavior, John Wiley and Sons, New York. [41] Gerstner, W. (1995). Time structure of the activity of neural network models, Phys. Rev., 51, 738–758. [42] Kasabov (2010). To spike or not to spike: A probabilistic spiking neural model, Neural Networks, 23(1), 16–19. [43] Wysoski, S., Benuskova, L. and Kasabov, N. (2010). Evolving spiking neural networks for audiovisual information processing, Neural Networks, 23(7), 819–835. [44] Thorpe, S., Delorme, A., et al. (2001). Spike-based strategies for rapid processing, Neural Networks, 14(6–7), 715–725. [45] Verstraeten, D., Schrauwen, B., D’Haene, M. and Stroobandt, D. (2007). An experimental unification of reservoir computing methods, Neural Networks, 20(3), 391–403. [46] Kasabov, N., et al. (2013a). Dynamic evolving spiking neural networks for on-line spatio- and spectro-temporal pattern recognition, Neural Networks, 41, 188–201. [47] Song, S., Miller, K., Abbott, L., et al. (2000). Competitive Hebbian learning through spiketiming-dependent synaptic plasticity, Nat. Neurosci., 3, 919–926. [48] Schliebs, S. and Kasabov, N. (2013). Evolving spiking neural network: a survey, Evol. Syst., 4, 87–98. [49] Soltic, S. and Kasabov, N. (2010). Knowledge extraction from evolving spiking neural networks with rank order population coding, Int. J. Neural Syst., 20(6), 437–445. [50] Defoin-Platel, M., Schliebs, S. and Kasabov, N. (2009). Quantum-inspired evolutionary algorithm: a multi-model EDA, IEEE Trans. Evol. Comput., 13(6), 1218–1232. [51] Nuzly, H., Kasabov, N. and Shamsuddin, S. (2010). Probabilistic evolving spiking neural network optimization using dynamic quantum inspired particle swarm optimization, Proc. ICONIP 2010, Part I, LNCS, vol. 6443. [52] Lobo, J. L., Del Ser, J., Bifet, A. and Kasabov, N. (2020). Spiking neural networks and online learning: an overview and perspectives, Neural Networks, 121, 88–100, https://www. sciencedirect.com/science/article/pii/S0893608019302655?via{\%}3Dihub. [53] Chen, Y., Hu, J., Kasabov, N., Hou, Z. and Cheng, L. (2013). NeuCubeRehab: a pilot study for EEG classification in rehabilitation practice based on spiking neural networks, Proc. ICONIP 2013, Springer, Berlin, Heidelberg, vol. 8228, pp. 70–77. [54] Schliebs, S., et al. (2009). Integrated feature and parameter optimization for evolving spiking neural networks: exploring heterogeneous probabilistic models, Neural Networks, 22, 623–632. [55] Maciag, P. S., Kasabov, N., Kryszkiewicz, M. and Bembenik, R. (2019). Air pollution prediction with clustering-based ensemble of evolving spiking neural networks and a case study on London area, Environ. Modell. Software, 118, 262–280, https://www.sciencedirect.com/science/article/ pii/S1364815218307448?dgcid=author.

June 1, 2022 12:51

420

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch09

Handbook on Computer Learning and Intelligence — Vol. I

[56] Kasabov, N. (2014). NeuCube: a spiking neural network architecture for mapping, learning and understanding of spatio-temporal brain data, Neural Networks, 52, 62–76, http://dx.doi.org/10. 1016/j.neunet.2014.01.006. [57] Kasabov, N., Liang, L., Krishnamurthi, R., Feigin, V., Othman, M., Hou, Z. and Parmar, P. (2014). Evolving spiking neural networks for personalised modelling of spatio-temporal data and early prediction of events: a case study on stroke, Neurocomputing, 134, 269–279. [58] Benuskova, L. and Kasabov, N. (2007). Computational Neuro-genetic Modelling, Springer, New York. [59] Kasabov et al. (2015). Evolving spatio-temporal data machines based on the NeuCube neuromorphic framework: Design methodology and selected applications, Neural Networks, 78, 1–14 (2016). http://dx.doi.org/10.1016/j.neunet.2015.09.011. [60] Kasabov et al. (2013b). Dynamic evolving spiking neural networks for on-line spatio- and spectro-temporal pattern recognition, Neural Networks, 41, 188–201. [61] Mohemmed, A., Schliebs, S., Matsuda, S. and Kasabov, N. (2013). Evolving spike pattern association neurons and neural networks, Neurocomputing, 107, 3–10. [62] Kasabov, N., Scott, N., Tu, E., Marks, S., Sengupta, N., Capecci, E., Othman, M., Doborjeh, M., Murli, N., Hartono, R., Espinosa-Ramos, J., Zhou, L., Alvi, F., Wang, G., Taylor, D., Feigin, V., Gulyaev, S., Mahmoudh, M., Hou, Z.-G. and Yang, J. (2016). Design methodology and selected applications of evolving spatio-temporal data machines in the NeuCube neuromorphic framework, Neural Networks, 78, 1–14. [63] Kasabov, N. (2018). Time-Space, Spiking Neural Networks and Brain-Inspired Artificial Intelligence, Springer. [64] Doborjeh, M., Kasabov, N., Doborjeh, Z., Enayatollahi, R., Tu, E., Gandomi, A. H. (2019). Personalised modelling with spiking neural networks integrating temporal and static information, Neural Networks, 119, 162–177. [65] Doborjeh, Z., Doborjeh, M., Taylor, T., Kasabov, N., Wang, G. Y., Siegert, R. and Sumich, A. (2019). Spiking neural network modelling approach reveals how mindfulness training rewires the brain, Sci. Rep., 9, 6367, https://www.nature.com/articles/s41598-019-42863-x. [66] Tan, C., Sarlija, M. and Kasabov, N. (2021). NeuroSense: short-term emotion recognition and understanding based on spiking neural network modelling of spatio-temporal EEG patterns, Neurocomputing, 434, 137–148, https://authors.elsevier.com/sd/article/S0925-2312(20)32010-5. [67] Kasabov, N., Liang, L., Krishnamurthi, R., Feigin, V., Othman, M., Hou, Z. and Parmar, P. (2014). Evolving spiking neural networks for personalised modelling of spatio-temporal data and early prediction of events: a case study on stroke, Neurocomputing, 134, 269–279. [68] Laña, I., Lobo, J. L., Capecci, E., Del Ser, J. and Kasabov, N. (2019). Adaptive long-term traffic state estimation with evolving spiking neural networks, Transp. Res. Part C: Emerging Technol., 101, 126–144, https://doi.org/10.1016/j.trc.2019.02.011. [69] Vanarse, A., Espinosa-Ramos, J. I., Osseiran, A., Rassau, A. and Kasabov, N. (2020). Application of a brain-inspired spiking neural network architecture to odour data classification, Sensors, 20, 2756. [70] Furber, S. (2012). To build a brain, IEEE Spectr., 49(8), 39–41. [71] Indiveri, G., Linares-Barranco, B., Hamilton, T., Van Schaik, A., Etienne-Cummings, R., Delbruck, T., Liu, S., Dudek, P., Hafliger, P., Renaud, S., et al. (2011). Neuromorphic silicon neuron circuits, Front. Neurosci., 5, 1–23. [72] Doborjeh, Z., Kasabov, N., Doborjeh, M. and Sumich, A. (2018). Modelling peri-perceptual brain processes in a deep learning spiking neural network architecture, Sci. Rep., 8, 8912, doi:10.1038/s41598-018-27169-8; https://www.nature.com/articles/s41598-018-27169-8. [73] Ibad, T., Kadir, S. J. A. and Ab Aziz, N. B. (2020). Evolving spiking neural network: a comprehensive survey of its variants and their results, J. Theor. Appl. Inf. Technol., 98(24), 4061–4081.

page 420

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch09

Evolving Connectionist Systems for Adaptive Learning and Knowledge Discovery

page 421

421

[74] Kasabov, N. (1994). Connectionist fuzzy production systems. In: Ralescu, A. L. (eds.) Fuzzy Logic in Artificial Intelligence. FLAI 1993. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), vol. 847, Springer, Berlin, Heidelberg, pp. 114–128. [75] Kasabov, N. (1995). Hybrid connectionist fuzzy production systems: towards building comprehensive AI, Intell. Autom. Soft Comput., 1(4), 351–360. [76] Kasabov, N. (1993). Hybrid connectionist production systems, J. Syst. Eng., 3(1), 15–21. [77] Kasabov, N. (1991). Incorporating neural networks into production systems and a practical approach towards the realisation of fuzzy expert systems, Comput. Sci. Inf., 21(2), 26–34 (1991). [78] Kasabov, N., Schliebs, R. and Kojima, H. (2011). Probabilistic computational neurogenetic framework: from modelling cognitive systems to Alzheimer’s disease, IEEE Trans. Auton. Ment. Dev., 3(4), 300–311. [79] Kasabov, N. (ed.) (2014). Springer Handbook of Bio-/Neuroinformatics, Springer. [80] Kumarasinghe, K., Kasabov, N. and Taylor, D. (2021). Brain-inspired spiking neural networks for decoding and understanding muscle activity and kinematics from electroencephalography signals during hand movements, Sci. Rep., 11, 2486, https://doi.org/10.1038/s41598-021-81805-4. [81] Kumarasinghe, K., Kasabov, N. and Taylor, D. (2020). Deep learning and deep knowledge representation in spiking neural networks for brain-computer interfaces, Neural Networks, 121, 169–185, https://doi.org/10.1016/j.neunet.2019.08.029. [82] Schliebs, R. (2005). Basal forebrain cholinergic dysfunction in Alzheimer’s disease — interrelationship with β-amyloid, inflammation and neurotrophin signaling, Neurochem. Res., 30, 895–908.

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_0010

Chapter 10

Supervised Learning Using Spiking Neural Networks Abeegithan Jeyasothy∗,¶ , Shirin Dora†, , Suresh Sundaram‡,∗∗ and Narasimhan Sundararajan§,†† ∗ School of Computer Science and Engineering, Nanyang Technological University, Singapore † Department of Computer Science, Loughborough University, United Kingdom ‡ Department of Aerospace Engineering, Indian Institute of Science, Bangalore, India § School of Electrical and Electronic Engineering,

Nanyang Technological University (Retired Professor), Singapore ¶ [email protected]  [email protected] ∗∗ [email protected] †† [email protected]

Spiking neural networks (SNNs), termed as the third generation of neural networks, are inspired by the information processing mechanisms employed by biological neurons in the brain. Higher computational power and the energy-efficient nature of SNNs have led to the development of many learning algorithms to train SNNs for classification problems using a supervised learning paradigm. This chapter begins by providing a brief introduction to spiking neural networks along with a review of existing supervised learning algorithms for SNNs. It then highlights three recently developed learning algorithms by the authors that exploit the biological proximity of SNNs for training the SNNs. Performance evaluation of the proposed algorithms, along with a comparison with some other important existing learning algorithms for SNNs, is presented based on multiple problems from the UCI machine learning repository. The performance comparison highlights the better performance achieved by the recently developed learning algorithms. Finally, this chapter presents possible future directions of research in the area of SNNs.

10.1. Introduction Advances in machine learning have often been driven by new findings in neuroscience. This is also true for spiking neural networks (SNN), which represent and process information using some mechanisms that are inspired by biological neurons. SNNs employ spiking neurons whose inputs and outputs are binary events 423

page 423

June 1, 2022 12:51

424

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Handbook on Computer Learning and Intelligence — Vol. I

in time called spikes. The binary nature of spikes renders them extremely energyefficient compared to artificial neural networks (ANNs) whose inputs and outputs are real values. Furthermore, it has been shown that spiking neurons are computationally more powerful than sigmoidal neurons [1]. Implementation of SNNs in chips has resulted in an active and growing area of research referred to as neuromorphic computing. These aspects of SNNs have led to a surge in research on the development of learning methods for SNNs. The discrete nature of spikes renders the response of spiking neurons nondifferentiable. As a result, learning algorithms for ANNs cannot be directly used to train SNNs. Therefore, most of the current learning approaches for SNNs were developed by adapting existing learning algorithms for ANNs or building upon the learning mechanisms observed in biology. In this chapter, we focus on the recently developed supervised learning schemes for developing spiking neural networks to handle classification problems. We first provide a brief overview of currently existing learning algorithms for SNNs. We then highlight in detail three of our recently developed learning algorithms for SNNs that exploit the current understanding of various biological learning phenomena. The three recent learning algorithms consist of two batch learning algorithms and one online learning algorithm. The two batch learning algorithms are: (i) twostage margin maximization spiking neural network (TMM-SNN) and (ii) synaptic efficacy function-based leaky-integrate-and-fire neuron (SEFRON) and (iii) the online learning algorithm is online meta-neuron based learning algorithm (OMLA). These learning algorithms have been chosen as they are based on diverse biological phenomena. TMM-SNN and SEFRON employ local learning rules to update the synaptic weights in the network as is the case with biological learning mechanisms. TMM-SNN directly uses the potentials contributed by presynaptic neurons to update the weights while SEFRON employs a modified spike-timedependent plasticity (STDP) [2] for learning. TMM-SNN employs constant weights whereas SEFRON uses a time-varying weights model. Unlike TMM-SNN and SEFRON, OMLA is inspired by the concept of tripartite synapses [3] in the brain, which allows OMLA to consider both local and global information in the network for learning. Similar to TMM-SNN, OMLA also uses a constant weight model. The chapter begins by providing the necessary background for understanding and simulating SNNs in Section 10.2. Section 10.3 provides an overview of existing learning algorithms for SNNs. Section 10.4 describes the three different learning algorithms for classification using SNNs. Section 10.5 presents the results of performance evaluation of these learning algorithms for SNNs. Finally, Section 10.6 summarizes the chapter and provides important future directions for further research in this area.

page 424

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Supervised Learning Using Spiking Neural Networks

page 425

425

10.2. Spiking Neural Networks Spiking neural networks (SNNs) are used in many applications. However, we are limiting our presentation here to cover only the usage of SNNs in a supervised learning environment. In SNNs, an additional dimension, time, is included in the computations. The inputs and outputs for spiking neurons can be described using a binary vector, where the length of the vector is the input time interval, 1’s in the vector represent the occurrence of a spike, and 0’s in the vector represent the absence of the spike. The position of 1’s in the vector indicates the time of the spike. Figure 10.1 shows a generic architecture of a typical spiking neural network. The design of an SNN architecture is very similar to that of any other neural network. However, the main differences are the temporal input nature of a spiking neuron and the operation of a spiking neuron. In a nutshell, the fundamental functioning of any spiking neuron can be described as follows: For an input spike, either a positive or a negative postsynaptic potential is generated. Postsynaptic potentials from all the input spikes are accumulated in a spiking neuron. An output spike is generated when the accumulated potential crosses a threshold. After an output spike, the postsynaptic potential of a spiking neuron resets and the neuron starts accumulating postsynaptic potentials again after a refractory period. Using SNNs for real-world dataset problems is not very straightforward, as most of the real-world datasets consist of analog values. Hence the datasets have to be encoded into spike patterns for use with SNNs. This chapter focuses on setting the context for using SNNs in a supervised learning environment for real-world datasets. Accordingly, this section is divided into four subsections. First, the different encoding mechanisms to convert real-valued inputs into spike patterns are presented. In the second subsection, the concept of a supervised learning environment for SNNs is defined and the mechanisms for decoding the SNN outputs to work with supervised learning algorithms are presented. In the next subsection, different types of spiking

Figure 10.1: A generic architecture of a spiking neural network.

June 1, 2022 12:51

426

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

page 426

Handbook on Computer Learning and Intelligence — Vol. I

neuron models are presented. Finally, the different weight models used in a spiking neural network are highlighted.

10.2.1. Encoding Methods For a real-valued input x where x ∈ (0, 1), some encoding scheme is needed to convert these real-valued data into a spike pattern. Generally, there are two types of encoding schemes, viz., a rate encoding and a temporal encoding scheme. Both types of encoding, schemes differ significantly [4]. Frequency and time of spikes carry the information in rate encoding and temporal encoding respectively. In rate encoding, the spike vector for an input is generated by the frequency of the input. In temporal encoding, the spike vector is generated by a set of functions called receptive field neurons (RF neurons) that mimic the functioning of the biological sensory neurons. 10.2.1.1. Rate encoding

In rate encoding, the real value of the input is used as the frequency of occurrence of the spikes in the time interval T . The Poisson distribution is the popular choice for rate coding to generate spikes in random order. For example, if there are 300 time steps in the time interval T , for a real-valued input with 0.25, any random 75(= 300 × 0.25) time steps will have a spike and the remaining 225 time steps will not have any spikes. A generic rate encoding scheme is given by R(x) = s, where s = {s 1 , s 2 , . . . , s f }.

(10.1)

Here f ∈ [0, T ] is the total number of spikes generated for input x, where f = T ∗x. Figure 10.2 shows the generation of a spike pattern for a real-valued input x = 0.2 using rate encoding. Ten time steps are used in the spike interval. Hence two ( f = 10 ∗ 0.2) spikes are generated for an input x = 0.2. The timing of the two spikes is s 1 and s 2 . 10.2.1.2. Temporal encoding

The basic principle of temporal encoding is that the real-valued input is mapped to a receptive field (RF) neuron (function) and the output of the RF neuron is used as the firing strength to determine the time of the spike. Based on the type and number of

Figure 10.2: Generating a spike pattern for real-valued input data (x = 0.2) using the rate encoding scheme.

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Supervised Learning Using Spiking Neural Networks

page 427

427

RF neurons used in the encoding scheme, temporal encoding can be categorized into three major groups, such as direct encoding, population encoding, and polynomial encoding. Direct Encoding: In direct encoding, one RF neuron is used whereas in population encoding and polynomial encoding, multiple RF neurons are used in mapping a single real-valued input. Generally, a linear function is used for direct encoding. A general direct encoding equation is given by φ = x, where φ is the firing strength and is used to compute the time of spike (s) as s = T · (1 − φ)

(10.2)

Population Encoding: Population encoding is the most commonly employed encoding technique for evaluating spiking neural networks on problems with realvalued features [5]. The difference between population encoding and polynomial encoding is that in population encoding the same closed functions with evenly spaced centers are used as RF neurons, whereas in polynomial encoding, different functions (eg., Chebyshev polynomials) are used as RF neurons. The similarity between population encoding and polynomial encoding is that the input data are projected into a higher dimensional space Rq by the RF neurons (where q is the number of RF neurons used in the encoding scheme). In population encoding, for a given input x, each RF neuron produces a firing strength φr , where r (r ∈ {1, q}) represents the rth RF neuron (q = 1 for direct encoding). The firing strength φr determines the presynaptic firing time sr sr =  T. 1 − φr for the rth RF neuron. Generally, Gaussian functions are used as the receptive field neurons. However, other types of closed functions can also be used, e.g., squared cosine functions, etc. The center (μr ) and width (σr ) of the rth Gaussian receptive field used for encoding an input feature having the range [Imin , Imax ] are initialized as (2r − 3) (Imax − Imin ) , r ∈ {1, · · · , q} 2 q −2 1 (Imax − Imin ) , r ∈ {1, · · · , q}. σr = γ q −2

μr = Imin +

(10.3) (10.4)

Based on the center and width of the receptive fields from Eqs. (10.3) and (10.4), respectively, the firing strength (φr ) of the rth receptive field for x is given as 

 (x − μr )2 . φr = ex p − 2σr2

(10.5)

June 1, 2022 12:51

428

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Handbook on Computer Learning and Intelligence — Vol. I

Figure 10.3: Generating spike patterns for real-valued input (x = 0.2) using the population encoding scheme.

For firing strength (φr ), spike time sr is calculated using Eq. (10.2). A generic temporal encoding scheme is denoted by P(x) = s where s is given by   (10.6) s = s1 , s2 , . . . , sq where s ∈ [0, T ]q . In direct encoding and population encoding, a higher RF neuron firing strength (φr closer to 1) corresponds to an early presynaptic firing (sr closer to 0’s) and a lower RF neuron firing strength (φr closer to 0) corresponds to a late presynaptic firing (sr closer to T ’s). Figure 10.3 shows the generation of a spike pattern for a real-valued input x = 0.2 using population encoding. Four RF neurons are used; hence the dimensions of the input are increased by 4 times. It can be observed that the 4th RF neuron has 0 firing strength and the 2nd RF neuron has the highest firing strength. Therefore, the spike times s4 = 10 and s2 = 2, respectively. Figures 10.2 and 10.3 illustrate the different encoding schemes. The same real-valued input x = 0.2 is used in all the encoding schemes. Ten time steps are used in the spike interval. The generated spike pattern is significantly different for each encoding scheme. In Figure 10.2, the spike occurs at two random time steps within the interval. The frequency of the spikes carries the information about the real-valued input. In Figure 10.3, the time of the spike is determined by the firing strengths of the RF neurons. In these two cases, the time of the spike carries the information of the given real-valued input.

10.2.2. Overall Problem Definition:Some Preliminaries Converting a real-valued input to a spike pattern is a pre-processing step. After that, the supervised learning problem can be formulated for temporal inputs (spikes). We have unified the notations for spike times to avoid the differences in the notation

page 428

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Supervised Learning Using Spiking Neural Networks

b4528-v1-ch10

page 429

429

of the different encoding mechanisms. We refer to Figure 10.1 for defining the notations in an SNN. For a spike input, we indicate the index of the attribute by a subscript and the order of the spike by a superscript e.g., sik means the spike time of the kth spike of the ith input neuron. Here i ∈ [0, Ni ], where Ni is the total number of attributes and k ∈ [0, f i ], where f i is the total number of spikes for the ith input neuron. For a temporal input s where s ∈ [0, T ] Ni , an SNN classifier S N N (s) produces an output y, y = {y1 , y2 , . . . , y N j } (10.7)  f  y j = tˆ1j , tˆ2j , . . . , tˆj j Here N j is the total number of output classes and y j is the vector of postsynaptic firing time from the j th output classes and y j ∈ [0, T + δT ] f j . f j is the total number of spikes for the j th output neuron, T is the presynaptic spike time interval limit, and δT represents an incremental smaller time after T to allow late postsynaptic spikes. tˆkj is the kth output spike of the j th output neuron. We use s with layer index for spikes in the input and hidden layers. tˆ is used only for spikes in the output layer. The same notations will be followed in the remaining sections. The weight wih means the weight between the ith neuron in the current layer and the hth neuron in the next layer. In a supervised learning environment, the error signal is determined at the output layer. The error signal is then used in the supervised learning algorithm to update the weights. Different decoding schemes and coded class labels are used, depending on the encoding scheme used to convert the real-valued inputs. 10.2.2.1. A typical classification problem

A typical classification problem requires the SNN to predict a class label for a given input spike pattern. In an analog neural network, the predicted class label is determined by the output neuron that has a higher response value. Similar analogy is used in SNN to predict the class label. However, a decoding method has to be used to determine the output spiking neuron that has a higher response value for given input spikes. Decoding methods are used to estimate the predicted class for a given input spike pattern based on the response of the output neurons in the network. Decoding methods for an input pattern are determined based on the encoding method used in the pre-processing step. We know that both encoding methods carry the input information in different formats—rate encoding carries the input information in the frequency of the spike and temporal encoding carries the input information in the time of the spikes. For inputs generated from the rate-encoding scheme and

June 1, 2022 12:51

430

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

page 430

Handbook on Computer Learning and Intelligence — Vol. I

temporal-encoding scheme, a frequency-based decoding scheme and a time-based decoding scheme are used, respectively. Rate-based decoding scheme For inputs generated from a rate-encoding scheme, the frequency (or total count) of the input will be high for higher input values and low otherwise. Similarly, the response of the output neuron is determined by the frequency of the output spikes. Therefore, the output class label for an input spike pattern is determined by the output neuron that fires the most number of spikes and the class label cˆ is predicted as cˆ = argmax( f j )

(10.8)

j

Coded class labels are assigned to align with the decoding methods. For a rate decoding scheme, coded class labels may have higher frequency values for the desired class labels and lower frequency values for other class labels. For an input spike pattern belonging to the class ck , a generic coded class label is determined by  Fd if j = ck (10.9) fˆj = Fd − Fm otherwise Here, fˆj is the coded class label for the j th output neuron, Fd is the desired frequency of the correct class label, and Fm is the margin frequency, which reduces the frequency of the spikes for undesired output classes. Temporal decoding scheme For inputs generated from a temporal encoding scheme, the spike time of the input is closer to 0s for a higher input value and closer to T s otherwise. Hence the response of the output neuron is determined by the spiking times of the output spikes. The output class label for an input spike pattern is determined by the output neuron that fires first and the class label cˆ is predicted as   (10.10) cˆ = argmin y1 , j

where y1 indicates the vector of the spike time of the 1st spike of all the output neurons. For a temporal decoding scheme, coded class labels may have low spike time values for the desired class label and high spike time values for other class labels. For an input spike pattern belonging to the class ck , a generic coded class label is determined by  tˆd if j = ck (10.11) yˆ j = tˆd + tˆm otherwise

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Supervised Learning Using Spiking Neural Networks

b4528-v1-ch10

page 431

431

Here, yˆ j is the coded class label for the j th output neuron, tˆd is the desired spike time of the correct class label, and tˆm is the margin spike time, which delays the firing of the spikes for undesired output classes.

10.2.3. Spiking Neuron Models In a biological neuron, information from one neuron is passed to another neuron in the form of an action potential. In SNNs, a spike is a simplified representation of an action potential in a biological neuron. Figure 10.4 shows the different phases of an action potential in a biological neuron. In the absence of any stimulus, the neuron is in the resting state, and the membrane potential of the neuron in this state is termed as the resting potential. In the presence of an external stimulus (input to the neuron), the neuron starts depolarizing, i.e., the membrane potential of the neuron starts growing. This phase is termed as the depolarization phase. The membrane potential rises sharply once the potential reaches a threshold value and an action potential is said to have occurred when the membrane potential reaches the peak value. After generating an action potential, the membrane potential of the neuron decays abruptly (repolarization phase) and undershoots the resting potential. This phase is termed as the refractory period. In this period, the neuron exhibits a subdued response to any external stimulus and its membrane potential slowly returns to the resting potential. It has been shown in the literature that the strength of the stimulus only determines the occurrence or non-occurrence of an action potential and not the amplitude of the action potential. In SNNs, a spiking neuron mimics the generation of an action

Figure 10.4: Different phases of an action potential (picture from https://en.wikipedia.org/ wiki/Action_potential).

June 1, 2022 12:51

432

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Handbook on Computer Learning and Intelligence — Vol. I

potential in a biological neuron. A spike in SNNs is a binary event in time and it represents the occurrence or non-occurrence of an action potential. A spiking neuron model reproduces the functional characteristics of generating an action potential in a biological neuron. The functioning of a biologically plausible spiking neuron is accurately described by the Hodgkin–Huxley model [6]. It reproduces all the stages of an action potential in a biological neuron. However, the high computational requirements of the Hodgkin–Huxley, etc. model render it unsuitable for implementing in a supervised learning environment. To overcome these issues, several mathematically tractable spiking neuron models e.g., the Izhikevich neuron model [7], the leaky integrate-and-fire model, the spike response model, etc., have been developed which reproduce the different aspects of the Hodgkin–Huxley model to varying degrees of accuracy. Generally, the leaky integrate-and-fire (LIF) model [8, 9] and the spike response model (SRM) [10, 11] are the commonly used spiking neuron models in a supervised learning environment. In a supervised learning environment, some error signals should update the network parameters so that the desired outputs can be achieved. In the SNN context, the network parameters are associated with the postsynaptic potentials. In the LIF and SRM, the postsynaptic potentials can be broken down into two components consisting of a weight and a function so that the weight can be updated during the supervised learning phase. 10.2.3.1. Leaky integrate-and-fire (LIF) model

The leaky integrate-and-fire (LIF) neuron model is a highly simplified version of the Hodgkin–Huxley model. It reproduces the depolarization phase of an action potential, but the repolarization phase and refractory period are not implicitly reproduced by the LIF neuron model. The membrane potential of the neuron is explicitly reset after generating a spike to model the repolarization phase. The LIF neuron model employs an absolute refractory period, during which the membrane potential of the neuron is fixed at the resting potential. The LIF neuron model employs an electrical circuit with a resistance (R) and a capacitance (C) to emulate the membrane potential of a neuron. The electrical circuit is shown in Figure 10.5. The current I (t) in the figure represents the presynaptic input current at time t. The current causes the voltage to build up across the capacitor, which represents the membrane potential (v(t)) of a neuron. Once the voltage across the capacitor crosses a certain threshold value, the neuron generates a spike and the potential of the neuron is reset to the resting potential. This neuron model is termed as the leaky integrate-and-fire model because, in the absence of any presynaptic current, the voltage across the capacitor decays to zero due to the current induced by the capacitor in the RC circuit.

page 432

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Supervised Learning Using Spiking Neural Networks

page 433

433

Figure 10.5: An RC circuit used to model a leaky integrate-and-fire neuron.

The dynamics of the RC circuit in Figure 10.5 is described as follows: I (t) =

dv v(t) +C R dt

(10.12)

The input current I (t) has two components, viz. the current through the resistance R and the current that charges the capacitor C. Here, v(tR ) is the component of current is the component of current that charges that goes through the resistance R and C dv dt the capacitor. From Eq. (10.12), the membrane potential of the neuron is given as τm

dv = R I (t) − v(t) dt

(10.13)

where τm is the time constant of the circuit. Let us assume that the simulation begins at t = t0 and v(t0 ) = vr , where vr is the resting potential. Based on this initial condition, the solution of Eq. (10.13) is given as     (t − t0 ) s 1 t −t0 exp − + I (t − s)ds. (10.14) v(t) = vr exp − τm C 0 τm A generalization of the leaky integrate-and-fire neuron that does not allow any leakage of current at the capacitor is termed as an integrate-and-fire neuron. 10.2.3.2. Spike Response Model (SRM)

In comparison to the leaky integrate-and-fire neuron model, the spike response model (SRM) is capable of modeling an action potential more accurately as it implicitly models the refractory period. Assuming that the last spike by the postsynaptic neuron was generated at tˆ, the SRM describes the membrane potential (v(t)) at time t as ∞ κ(t − tˆ, s)I (t − s)ds (10.15) v(t) = η(t − tˆ) + 0

where the functions η(.) and κ(.) are termed as the kernel functions. The function η(.) determines the shape of an action potential and models the refractory period of

June 1, 2022 12:51

434

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Handbook on Computer Learning and Intelligence — Vol. I

the spiking neuron. The function κ(.) determines the response of the neuron to an input current I (t). The function κ(.) is used to model the differences between the effects of input current on the membrane potential of a neuron during and after the refractory period. During the refractory period, a given input current will result in a smaller change in the membrane potential of the neuron. After the refractory period, the same input current will produce a normal change in the membrane potential of the neuron. Besides the η(.) and κ(.) functions, the SRM employs a dynamic threshold function, represented by θ(t − tˆ). The threshold of the neuron is maximum immediately after a spike and it slowly decays to the normal value. By comparing Eqs. (10.14) and (10.15), it can be observed that the LIF model is a special case of the SRM model. A generic SRM equation of membrane potential v(t) that is suitable for a supervised learning environment can be re-written as postsynaptic potentials

refractory response







fh fi Ni    wih (t − sik ) + ν(t − shk ) + External inputs. v h (t) = i=1 k=1

(10.16)

k=1

Equation (10.16) can be directly used in a supervised learning environment. The postsynaptic potential term in Eq. (10.16) is similar to that of a LIF model. In Eq. (10.16), v h (t) is the accumulated membrane potential in the hth output neuron, and wih is the weight between the ith input neuron and the hth output neuron. wih is the weight between the ith input neuron and the hth output neuron. More details on different weight models are covered in the next subsection. (t) is the spike response function and ν(t) is the refractory response function. (t) and ν(t) are some kernels to model the shape of the spike response. ν(t) is a negative kernel and it takes the hth output neuron’s spikes shk as the inputs. ν(t) ensures that the accumulated potential v h (t) is reset and starts to accumulate postsynaptic potentials after every output spike shk . External inputs are any additional inputs and they are generally not used in the learning framework. f +1 The next output spike sh h is produced as f +1

sh h

= {t|v h (t) = θh }

(10.17)

here θh is the firing threshold for the hth output neuron.

10.2.4. Synapse/Weight Models Input spike potentials (or spike responses) are amplified by either positive or negative factors to produce excitatory postsynaptic potentials or inhibitory postsynaptic potentials, respectively. The amplification factor is the weight of the connection. Weights are the learnable parameters in an SNN. Weight models can be categorized

page 434

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Supervised Learning Using Spiking Neural Networks

b4528-v1-ch10

page 435

435

into three major groups, viz., constant weight models, dynamic weight models, and time-varying weight models. The main difference between constant weight models and the other two weight models is that the constant weight model does not have any temporal elements and the other two weight models do have some temporal elements. Additional temporal elements in the weight models improve the computational power of the spiking neuron. The difference between a dynamic weight model and a time-varying weight model is that the dynamic weight model is reactive to the input spike and changes the weight based on the frequency of the spike, whereas time-varying weight is a learnable time function for input spikes. 10.2.4.1. Constant Weight Model

A constant weight model imitates the long-term plasticity observed in biological neurons [12, 13]. Long-term plasticity is associated with the number of physical vesicle release sites in the presynaptic (input) side of a neural connection. Long-term plasticity is one of the possible biological mechanisms to explicitly determine the synaptic strength (weight). In long-term plasticity, the synaptic strength (weight) of a synapse (connection) stays nearly constant. This is the inspiration for constant weight models in spiking neural networks (or any artificial neural networks). Constant weight models are widely used in analog neural networks and in the early developments in SNNs. 10.2.4.2. Dynamic Weight Model

A dynamic weight model imitates the short-term plasticity observed in the biological neurons [14, 15]. Short-term plasticity controls quanta of vesicle release per impinging spike; the release process is a product of two factors: facilitation and depression. Facilitation and depression processes are coupled through their dependence calcium concentration, and their interplay dictates synaptic strength and dynamics in response to the variability of impinging spikes. Similar to this biological phenomenon, the dynamic weight models in SNNs regulate the weight of a connection based on the incoming spikes. Generally, dynamic weight models are suitable for the rate encoding scheme, where the input spikes have different frequencies. For the temporal encoding scheme, dynamic weight models act like constant weight models. 10.2.4.3. Time-Varying Weight Models

Time-varying weight models are inspired by the gamma-aminobutyric acid (GABA)-switch phenomenon observed in the biological synapse [16, 17]. This phenomenon is observed during the development of an infant’s brain and during other physiological changes. During a GABA-switch, the synapse (connection) changes

June 1, 2022 12:51

436

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Handbook on Computer Learning and Intelligence — Vol. I

from excitatory (positive weight) to inhibitory (negative weight), or vice versa. Inspired by this phenomenon, a time-varying weight model allows both positive and negative weights in the same connection. Time-varying weight is a learnable function in time. This is one advantage of time-varying weights compared to constant or dynamic weights, as the other two weight models can have only positive or negative weight in one connection.

10.3. A Review of Learning Algorithms for SNNs The most important part of a supervised learning environment is the learning algorithm used to train the SNN. Using the learning algorithms developed for analog neural networks in the realm of SNN is not very straightforward. Most of the learning algorithms for analog neural networks use the derivatives of the output activation functions (gradient-based learning algorithms). However, outputs in SNNs are binary events in time and the discontinuity in membrane potential makes it very challenging to determine the derivative of the output activation function. These challenges involved in adapting gradient-based learning algorithms opened up the exploration of directly adapting learning phenomena observed in biological neurons. This chapter is divided into three subsections. First, an overview of gradient-based learning algorithms for SNNs is presented. Second, an overview of biologicallyinspired learning algorithms for SNNs is presented. Finally, an overview of other algorithms being used in SNNs is presented.

10.3.1. Gradient-Based Learning Algorithms There have been many attempts to adapt gradient-based learning algorithms to train SNN classifiers [5, 18]. However, gradient-based learning algorithms in SNNs are severely limited by the discontinuous nature of the spiking activity. Initially in SpikeProp [5], the gradient-based learning algorithm was adapted for temporal encoded inputs. In SpikeProp, the discontinuity in membrane potential during the firing of an output spike is approximated with a linear function around the firing time. This approximation permitted one to determine the derivative of the output activation function. However, the non-firing of an output neuron (silent neuron or dead neuron) was a challenge for this approach. Several improvements have been proposed to improve the performance of SpikeProp [19, 20]. The initial SpikeProp worked with the temporal encoding scheme with a single spike. Later, it was also extended to work with multiple spike inputs [21]. The above-mentioned methods have been developed for shallow network architectures. In [22], temporal encoded inputs are transformed into an exponential term to work with deep architectures. Generally, rate coding is one of the widely used encoding methods for deep architectures [23].

page 436

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Supervised Learning Using Spiking Neural Networks

b4528-v1-ch10

page 437

437

10.3.2. Biologically Inspired Learning Algorithms The functionality of a spiking neuron is inspired by the functionality of a biological neuron. This allowed researchers to adapt the phenomena observed in the brain to develop algorithms to train SNNs. Spike-time-dependent plasticity (STDP) [2] is the most common biological phenomenon adapted for SNN. Other than the STDP rule, a meta-neuron learning rule inspired from the regulation of neural connection weight by astrocyte cells is also used in SNN. 10.3.2.1. STDP-based learning algorithms

Spike-time-dependent plasticity (STDP) is one of the common phenomena observed in the brain. In STDP, the change in weight is computed by the difference in the time of the presynaptic spike with respect to the postsynaptic spike. Figure 10.6 shows the standard STDP rule. In the STDP rule, if the time of the input spike is close to the time of the output spike, it may have a higher influence on the change in weight. The positive and negative influence on the change in weight is determined by the time of the input spike with respect to the output spike. If an input spike fires before the output spike, then the input spike will have a positive influence on the weight change; otherwise a negative weight change will occur. The weight change in the STDP rule can be described as  + A+ · ex p(−δs/τ+ ) if δs≥ 0 (10.18) w(δs) = if δs < 0 − A− · ex p(δs/τ− )

Figure 10.6: The standard STDP rule. The x axis shows the time difference between post- and presynaptic spikes. The end cross shows the strengthening of the connection (positive weight change) and the blue triangle shows the weakening of the connection (negative weight change) (Image from https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004566.)

June 1, 2022 12:51

438

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Handbook on Computer Learning and Intelligence — Vol. I

where δs = Postsynaptic spike time − Presynaptic spike time and w(δs) is the change in weight calculated for δs. (A+ , A− ) and (τ+ ,τ− ) are the amplitude and time constant of the positive and negative learning window, respectively. The STDP rule has laid the foundation for many SNN learning algorithms. In ReSuMe [24], the STDP and anti-STDP rules are used in a supervised learning environment. Improvements to the ReSuMe learning algorithm have been made by combining delay learning [25] and adding dynamic firing threshold [26]. In [27], the ReSuMe algorithm has been extended to a multilayer architecture. The STDP rule is also used in a normalized form [28] or with the Bienenstock–Cooper–Munro rule [29] to overcome the local learning nature. 10.3.2.2. Meta-neuron-based learning algorithm

Another biological phenomenon observed in our brain is the regulation of the synaptic connection by astrocyte cells. An astrocyte cell can have up to 100,000 connections with other cells, which allows them to access non-local information. This allows astrocyte cells to modulate the weights of each connection using nonlocal information. This phenomenon was adapted in a supervised learning algorithm as a meta-neuron-based learning algorithm. An additional neuron called a meta-neuron is used in the learning algorithm to mimic the functioning of the astrocyte cells. [30]

10.3.3. Rank Order Learning Many learning algorithms for SNNs utilize the concept of rank order learning (ROL), [31] which is inspired by the fast processing times exhibited by neurons in the visual areas of the brain. The fast processing times indicate that very few spikes are required for the information to reach the brain from the retina. Based on this idea, rank order learning employs the order of first spikes generated by a group of presynaptic neurons. To better understand ROL, consider an example with four presynaptic neurons and a single postsynaptic neuron as shown in Figure 10.7. Based on the temporal order of presynaptic spikes, the ranks of presynaptic neurons are estimated to be 3, 1, 2, and 0, respectively. Using these ranks, the weights of the connections between the four presynaptic neurons and the postsynaptic neuron are initialized as λ3 , λ1 , λ2 , and λ0 . Here, λ is the modulation factor and is always set to a value in the interval [0, 1]. Several variants of rank order learning have been proposed in the literature to improve its performance on real-world problems [32–39]. Rank order learning is used to develop an online learning algorithm for a four-layered SNN [32]. The network employs contrast-sensitive neuronal maps and orientation-selective neuronal maps in the second and third layers, respectively. The first layer in the network is used to present input spike patterns to the network and the last layer is

page 438

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Supervised Learning Using Spiking Neural Networks

page 439

439

Figure 10.7: An illustrative example of rank order learning using the spikes generated by four presynaptic neurons and a single postsynaptic neuron.

used to determine the predicted class label. The number of neuronal maps in the last layer is not fixed a priori and is determined by the learning algorithm during training. The weights for neurons in the last layer are learnt in a supervised manner to maximize the classification performance. Later, the authors proposed a modified architecture with a single output neuron [34]. Wysoski et al. further extended this architecture to simultaneously handle both auditory and visual inputs [33]. The online learning algorithm mentioned above employs rank order learning, which utilizes only the first spike generated by each neuron for learning. The information in subsequent spikes is not used in learning. Kasabov et al. proposed a dynamic evolving SNN to exploit the information present in the spikes generated after the first spike [35]. Dynamic evolving SNN uses the first spike to initialize the weights, and subsequent spikes are used to fine-tune the initial weights.

10.3.4. Other Learning Algorithms Many learning algorithms for SNNs use the postsynaptic potential of a neuron for learning as it continuously provides an accurate estimate of a spiking neuron’s state at any given time. In [40], the authors developed a learning approach that utilizes the difference between the potential and the threshold for learning. In [41], the authors overcame the discontinuous nature of spikes by convolving the input and output spike patterns with a mathematical function to convert them into continuous analog signals. In the next section, we describe three learning algorithms, namely TMM-SNN, OMLA, and SEFRON.

10.4. Classification Using SNNs 10.4.1. A Two-Stage Margin Maximization Spiking Neural Network In this section, we describe a learning algorithm for a three-layered SNN that maximizes the interspike interval between spikes generated by neurons that are

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

440

b4528-v1-ch10

Handbook on Computer Learning and Intelligence — Vol. I

associated with different classes. The network is trained in two stages and is, thus, referred to as a two-stage margin maximization spiking neural network (TMM-SNN). The first stage of the learning algorithm is termed as the structure learning stage and the second stage is termed as the output weights learning stage. During the structure learning stage, the learning algorithm evolves the number of neurons in the hidden layer of the network and updates the weights of connections between the input neurons and the hidden layer neurons. For updating weights in this stage, TMM-SNN uses the normalized membrane potential (NMP) learning rule, which relies only on the locally available information on a synapse. To estimate the change in the weight of a particular synapse, the NMP learning rule utilizes the current weight of the synapse and the proportion of unweighted postsynaptic membrane potential that is contributed by the presynaptic neuron. 10.4.1.1. Architecture of TMM-SNN

TMM-SNN utilizes a network with three layers. The input layer is used to present a spike pattern, (x, c), to the network. Here, x = [x1 , . . . , xi , . . . , xm ]T is an g m-dimensional spike pattern that belongs to class c. xi = {ti1 , . . . ti , . . . , tiG i } is g a spike train where ti denotes the time of the gth spike generated by the ith input neuron. Each input spike pattern is presented to the network for a duration of 600 milliseconds with a time step of 1 millisecond. These spike patterns presented through the input layer drive the activity in the hidden neurons and the output neurons in the network. The responses of output neurons are used to determine the predicted class for a given spike pattern based on the time to first spike decoding method. The output layer consists of C neurons where C is the total number of classes. Without loss of generality, the description in this section and in Sections 10.4.1.2, 10.4.1.3, and 10.4.1.4 assume a network with K hidden neurons. Figure 10.8 shows the architecture of a TMM-SNN with K hidden neurons.

T

0

T

0

T

0

T

0

T

0

T

0

Figure 10.8: Architecture of TMM-SNN.

page 440

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Supervised Learning Using Spiking Neural Networks

b4528-v1-ch10

page 441

441

The spikes generated by the input neurons affect the potential of the hidden neurons which generate a spike when their potential crosses a threshold value. The potential contributed by the ith input neuron at time t is given by   g

t − ti (10.19)  v i (t) = g g

where (t − ti ) is the potential contributed by the gth spike at time t. TMM-SNN uses the spike response model to estimate the value of (·), which is given by  s s (10.20)

(s) = exp 1 − τ τ where τ is the time constant of the neuron, which is set to 300 milliseconds for all simulations involving TMM-SNN. Based on the potential contributed by the input neurons, the potential (v k2 ) of the kth hidden neuron at time t is given by  (1) wik  v i (t) (10.21) v k(1) (t) = i (1) is the weight of the connection between the ith input neuron and the where wik kth hidden neuron. The hidden neuron generates a spike every time its potential crosses its threshold, which is denoted by θk(1) . After the neuron generates a spike, its potential is reset to zero and its potential due to subsequent input spikes is computed using Eq. (10.21). The threshold of all hidden neurons is determined by the learning algorithm during training. The spike train generated by the kth hidden neurons is (1) G g g denoted by hˆ k = {tˆk1 , . . . , tˆk , . . . , tˆk k } where tˆk denotes the gth spike generated by the kth hidden neuron. G (1) k represents the number of spikes generated by the kth hidden neuron. Similar to hidden neurons, output neurons are also modeled using the spike response model. Therefore, the potential contributed by the kth hidden neuron to the output neurons at time t is given by   g

t − tˆk (10.22)  v k(2) (t) = (1)

g∈{1,...,G k }

Based on the potential contributed by the hidden neurons, the potential of the j th output neuron at time t is given by  (1) (2) wkj  v k (t) (10.23) v (2) j (t) = k (2) is the weight of the connection between the kth hidden neuron and the where wik j th output neuron. The output neuron generates a spike every time its potential crosses the threshold, which is given by θ (2) j . The threshold for all output neurons

June 1, 2022 12:51

442

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Handbook on Computer Learning and Intelligence — Vol. I

is set to one. The spike train generated by the output neurons is denoted by yˆ j = G

(2)

g g {tˆ1j , . . . , tˆj , . . . , tˆj j }, where tˆj denotes the gth spike generated by the j th output (2) neuron. The superscript in G (1) k and G k is used to differentiate between the number of spikes generated by neurons in the hidden layer and output layer, respectively. Given the response of output neurons, the predicted class (c) ˆ is determined using the time to first spike decoding method, given by

cˆ = argmin tˆ1j

(10.24)

j

10.4.1.2. Normalized membrane potential learning rule

The NMP learning rule updates the weight of a connection between two neurons based on the information present in the current input spike pattern and the knowledge acquired by the network from previously learnt spike patterns. It estimates the information present in the current input spike pattern based on the fraction of unweighted membrane potential that is contributed by the presynaptic neuron to the postsynaptic neuron. The current weight of a synapse is used as an estimate of the existing knowledge of the network. Next, the NMP learning rule is described for the ith neuron in the input layer and kth neuron in the hidden layer of the TMM-SNN (see Figure 10.8). The normalized membrane potential (ηik ) contributed by the ith presynaptic neuron to the kth postsynaptic neuron at time t is given by vi (t) , ηik (t) =  p vp (t)

i ∈ {1, . . . , m}.

(10.25)

Using only ηik to update the weight of a synapse may result in a loss of the knowledge acquired by the network from previously learnt spike patterns. To avoid this problem, the NMP learning rule updates the weight of a synapse based on the difference between ηik and the current weight of the synapse. Furthermore, it only updates the weight of a synapse when ηik is greater than the current weight of the synapse, and, as a result, synapses with higher weights are not updated. This ensures that the input neurons that contributed significantly toward the membrane potential of the kth hidden neuron for the past spike patterns are not altered, thereby protecting the past knowledge stored in the network. Since TMM-SNN uses the time to first spike decoding method to determine the predicted class, the NMP learning rule is always used at the time of first spike (tˆk1 ) generated by the kth output neuron. To summarize, according to the NMP learning 1 ) of a synapse between the ith input neuron and rule, the change in the weight ( wik the kth hidden neuron is given by

page 442

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Supervised Learning Using Spiking Neural Networks

 1 wik =

  1 wik ≥ ηik tˆk1    ,  1 1 < ηik tˆk1 wik α ηik (tˆk1 ) − wik 0

i ∈ {1, . . . , m}

page 443

443

(10.26)

where α is the learning rate and is set to 0.01 for all experiments with TMM-SNN. 10.4.1.3. Structure learning stage

In this subsection, the structure learning stage of the learning algorithm for TMMSNN is described. In this stage, the learning algorithm uses two different strategies for evolving the number of hidden neurons and for learning weights for the connections between the input neurons and the hidden neurons in the network. It uses a neuron addition strategy for adding hidden neurons to the network in line with the complexity of the classification problem. When a new neuron is added to the network, the class label of the current input spike pattern is used as the associated class for the newly added neuron. It uses the margin maximization strategy for structure learning to update weights between the input neurons and the hidden neurons such that the interclass margins estimated from the responses of the hidden neurons are maximized. The learning algorithm uses both learning strategies during the first epoch of this stage. After the first epoch, the number of hidden neurons is fixed and only the margin maximization strategy for structure learning is used to update the weights between the input neurons and the hidden neurons in the network. A particular learning strategy in this stage is selected based on the time of first spikes generated by the winner neurons associated with the class of the current input spike pattern and any other class. The winner neuron (CC) associated with the same class as the current input spike pattern is given by CC = argmin tˆk1 k,c¯k =c

(10.27)

where c¯k represents the associated class for the kth hidden neuron and c denotes the class label for the current input spike pattern. Similarly, the winner neuron (MC) associated with a different class is given by MC = argmin tˆk1 . k,c¯k =c

(10.28)

Next, the different learning strategies used in the structure learning stage will be described, assuming a network that has already evolved to K hidden neurons. (1) Neuron addition strategy: The learning algorithm uses this strategy for those spike patterns that contain information that has not been acquired by the network from previously learned spike patterns. It utilizes the time of the first spike generated by the neuron CC to identify those spike patterns that can be learned

June 1, 2022 12:51

444

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Handbook on Computer Learning and Intelligence — Vol. I

accurately by using this strategy. Specifically, a new neuron is added to the 1 is greater than a threshold Ta . The associated class for network whenever tˆCC the newly added neuron is assigned using the class label for the current input spike pattern. The criterion for the neuron addition strategy can be expressed as 1 > Ta Then add a hidden neuron to the network. If tˆCC

(10.29)

Here, Ta is set to a value in the interval [0, T ] which can be mathematically expressed as Ta = αa T where αa ∈ [0, 1] is termed as the addition threshold. An appropriate range for setting αa is [0.4, 0.55]. Refer to [42] for more details regarding the impact of αa on the performance of TMM-SNN. To accurately capture the information present in the current input spike pattern, the weights and threshold of the newly added neuron are initialized such that it fires precisely at T /3. For this purpose, the weight of the connection between the ith input neuron and the newly added neuron is initialized using the normalized membrane potential (Eq. 10.25) at T /3 for the current input spike pattern, given by (1) wi(K +1) = ηi(K +1) (T /3),

i ∈ {1, . . . , m}

(10.30)

Based on this initial value of weights, the threshold of the newly added neuron is initialized as  (1) wi(K +1) v i (T /3). (10.31) θ K +1 = i

(2) Margin maximization strategy for structure learning: The learning algorithm uses this strategy to update the weights of hidden neurons when the interclass margin is lower than a certain threshold (Tm ) for a given spike pattern. Here, the interclass margin is computed as (1) (1) − tˆCC . ϒ (1) = tˆMC

(10.32)

Based on the definition of the interclass margin, the criterion for this strategy is given by If ϒ < Tm then update the weights of hidden neurons

(10.33)

Tm is set to a value in the interval [0, (T − T /3)]. A value higher than (T − T /3) is not recommended as the parameters for a newly added neuron are initialized such that it fires at T /3. An intuitive method to choose Tm is by setting its value based on the following equation:   T (10.34) Tm = αm T − 3

page 444

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Supervised Learning Using Spiking Neural Networks

b4528-v1-ch10

page 445

445

where αm is termed as the margin threshold and is always set to a value in the interval [0, 1]. See [42] for more details. It may be observed from Eq. (10.32) that the interclass margin depends only on the spike times of the neurons CC and MC. Therefore, the learning algorithm maximizes the margin by only updating weights of the neurons CC and MC. The updated values of weights of these neurons are computed using the NMP learning rule Eq. (10.25) and are given by  1 1 1 = wi(CC) + wi(CC) wi(CC) , i ∈ {1, . . . , m}. (10.35) 1 1 1 wi(MC) = wi(MC) − wi(MC) The update equation described above forces the neuron CC to fire sooner and delays the spike generated by the neuron MC, thereby resulting in a higher interclass margin. The learning strategy for maximizing the margin is used for multiple epochs during the structure learning stage until the average interclass margin across all training samples converges to a maximum value. To summarize, the pseudo-code for the structure learning stage of TMM-SNN is given in Algorithm 10.1. Algorithm 10.1 Pseudo-code for the structure learning stage

1: 2: 3: 4: 5: 6:

Evolving loop (Epoch 1) for each training spike pattern do 1 if tˆCC > Ta then Add a neuron Iterative loop (Epoch 1 until convergence) for each training spike pattern do 1 1 if (tˆMC − tˆCC ) < Tm then Update weights of the neurons CC and MC

10.4.1.4. Output weights learning stage

The structure learning stage helps determine the appropriate number of hidden neurons required for a particular classification problem. At the end of the structure learning stage, the parameters for hidden neurons are fixed and the connections between the hidden neurons and the output neurons are randomly initialized. In this stage, the learning algorithm updates the weights of the connections between the hidden neurons and the output neurons to maximize the interclass margins based on the responses of output neurons. For this purpose, the winner neurons associated with

June 1, 2022 12:51

446

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Handbook on Computer Learning and Intelligence — Vol. I

the same class as the current input spike pattern (CC) and a different class (MC) are estimated according to the times of spikes generated by the output neurons, given as CC = arg j ( j = c)

(10.36)

argmin t (2) j . j, j =c

(10.37)

MC =

Based on the times of spikes generated by the output neurons CC and MC, the interclass margin is computed as (2) (2) − tCC . ϒ (2) = t MC

(10.38)

In this stage, the learning algorithm only updates weights of the output neurons CC and MC because only these neurons determine the interclass margin (Eq. (10.38)). Furthermore, connections from only those hidden neurons are updated that are associated with the same class as the class label for the current input spike pattern. The required change in the weight is determined based on the unweighted membrane potential contributed by the hidden neuron at the time of the first spike generated by the output neuron. Based on this, the change in weight for the connection between the kth hidden neuron and the j th output neuron is given by     α  v k(2) t (2) c¯k = c (2) j (10.39) = wkj 0 c¯k = c Based on Eq. (10.39), the updated weights for the output neurons CC and MC are given by  (2) (2) (2) = wk(CC) + wk(CC) wk(CC) , k ∈ {1, . . . , K } (10.40) (2) (2) (2) wk(MC) = wk(MC) − wk(MC) The update rule in Eq. (10.40) forces the output neuron associated with the correct class to fire sooner and delays the spikes generated by the neurons associated with the other class (MC). This results in an increase in average margin based on the responses of output neurons. To summarize, the pseudo-code for the structure learning stage of TMM-SNN is given in Algorithm 10.2. Algorithm 10.2 Pseudo-code for the output weights learning stage Iterative loop (Epoch 1 until convergence) 1: for each training spike pattern do 2 2 2: if (tˆMC − tˆCC ) < Tm then 3: Update weights of the neurons CC and MC

page 446

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Supervised Learning Using Spiking Neural Networks

b4528-v1-ch10

page 447

447

10.4.2. Online Meta-Neuron-Based Learning Algorithm In this section, we will describe the online meta-neuron based learning algorithm for a two-layered SNN with a meta-neuron. The architecture of the SNN with a meta-neuron is inspired by the concept of heterosynaptic plasticity that occurs on a tripartite synapse [3] in the brain. A tripartite synapse is a junction between a presynaptic neuron, a postsynaptic neuron and an astrocyte cell. A single astrocyte cell can be simultaneously connected to multiple tripartite synapses, [43] which allows this cell to modulate plasticity on a given synapse based on nonlocal information present in other synapses. Similar to an astrocyte cell, the meta-neuron is also connected to all presynaptic neurons and can access all the information transmitted in the two-layered network. Based on both the local and global information available to the meta-neuron, a meta-neuron-based learning rule has been developed that updates weights in the network to produce precise shifts in the spike times of output neurons. The meta-neuron based learning rule is then used to develop an online meta-neuron-based learning algorithm (OMLA) that can be used to learn from input spike patterns in one shot. In the next two sections, we provide a description of the architecture for OMLA and its learning algorithm, respectively. 10.4.2.1. Spiking neural network with a meta-neuron

The architecture of the SNN with a meta-neuron consists of two layers with an arbitrary number of neurons in each layer. But, to support easier understanding, let us consider a network with m presynaptic neurons and a single postsynaptic neuron. The presynaptic neurons in the network are also connected to a metaneuron, which allows this neuron to access the information present in spikes generated by the presynaptic neurons. Furthermore, the meta-neuron has access to the global information in the network in the form of synaptic weights between pre- and postsynaptic neurons. Figure 10.9 shows the architecture of the SNN with a meta-neuron. The first layer in the SNN is used to present m-dimensional spike patterns, x = {x1 , . . . , xi , . . . , xm }, to the network. Each spike pattern is presented to the network for a duration of time T , which is termed as the simulation interval of a single (g) spike pattern. Here, xi = {ti(1) , . . . , ti , . . . , t (G i ) } is a spike train with G i spikes, which are generated by the ith presynaptic neuron. The unweighted postsynaptic potential (PSP) resulting from the gth spike generated by the ith presynaptic neuron (g) at time t is denoted by (t − ti ), which has been modeled using the spike response function [5] given by  s s (10.41)

(s) = exp 1 − τ τ

June 1, 2022 12:51

448

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Handbook on Computer Learning and Intelligence — Vol. I

Figure 10.9: Architecture of the two-layered SNN with a meta-neuron.

where τ is the time constant of the neuron and is set to 3 ms for all simulations involving SNN with a meta-neuron. Depending on the PSP contributed by the presynaptic neurons, the PSP (v) of the postsynaptic neuron at time t is computed as   (g)  wi t − ti (10.42) v(t) = i

g

where wi is the weight of the connection between the ith presynaptic neuron and the postsynaptic neuron. The postsynaptic neuron generates a spike whenever its potential crosses a certain threshold, which is denoted by θ. After generating a spike, the potential of the neuron is reset to zero. The output of the postsynaptic neuron is denoted by yˆ = {tˆ(1) , . . . , tˆ( f ) , . . .}. The aim of the learning algorithm for the SNN with a meta-neuron is to accurately capture the relationship between the input spike patterns and the desired output spike patterns. For this purpose, the weights of the postsynaptic neuron are updated individually for each output spike generated by this postsynaptic neuron. Suppose that t ( f ) represents the desired spike time of the f th spike generated by the postsynaptic neuron; then the weights of the neuron are updated such that it spikes precisely at t ( f ) . The postsynaptic neuron will generate a spike at t ( f ) when its potential is equal to θ at t ( f ) . Therefore, the weights of the postsynaptic neuron should be updated such that the resulting change in potential ( v ( f ) ) is equal to v ( f ) = θ − v ( f ) .

(10.43)

By definition, only the presynaptic spikes in the interval  = [tˆ( f −1) , t ( f ) ] will have an impact on the time of the f th output spike generated by the postsynaptic neuron. Therefore, the weights of the postsynaptic neuron are updated such that

page 448

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Supervised Learning Using Spiking Neural Networks

 

(g)

wi (t ( f ) − ti ) = v ( f )

page 449

449

(10.44)

(g) ti ∈

i

where wi represents the change in weight of the connection between the ith presynaptic neuron and the postsynaptic neuron. For a particular output spike, the value of wi is estimated based on the contribution of the ith presynaptic neuron toward v ( f ) . For this purpose, the metaneuron uses both the local and global information in the network to estimate a (f) weight sensitivity modulation factor (Mi ) for each connection between the preand postsynaptic neurons. It is used to estimate the proportion of v ( f ) that is contributed by the ith presynaptic neuron. (f) Based on v ( f ) (Eq. (10.43)) and Mi , the weight of the connection between the ith presynaptic neuron and the postsynaptic neuron is updated as (f)

wi = Mi



v ( f ) (g)

ti ∈

(g)

(t ( f ) − ti )

,

i ∈ {1, . . . , m}.

(10.45)

(f)

Here, Mi is computed by the meta-neuron as the proportion of its PSP at t ( f ) that is contributed by the ith presynaptic neuron, given by  (f) (g) z (t ( f ) − ti ) (g) ti ∈ i (f) , i ∈ {1, . . . , m} (10.46) Mi =   (f) (g) z (t ( f ) − ti ) (g) i t ∈ i i

(f) where z i

is the weight of the connection between the ith presynaptic neuron and the meta-neuron. The meta-neuron estimates its weights by comparing the information present in a given input spike pattern with the knowledge stored in the corresponding synaptic weights. While updating the weights due to a given output spike, the weight of the connection between the ith presynaptic neuron and the meta-neuron is given by  (f) (f) u i (t ( f ) ) − wi if u i (t ( f ) ) > wi (f) (10.47) zi = 0 otherwise. (f)

Here, u i represents the information present in the input spike pattern generated by the ith presynaptic neuron. It is termed as the normalized PSP contributed by the ith presynaptic neuron at t ( f ) , given by  (g)

(t ( f ) − ti ) (g) ti ∈ (f) (f) , i ∈ {1, . . . , m}. (10.48) u i (t ) =   ( f ) − t (g) )

(t (g) i i t ∈ i

It is evident from Eq. (10.47) that the weights between a given presynaptic neuron and the meta-neuron are initialized to zero when the normalized PSP induced by

June 1, 2022 12:51

450

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Handbook on Computer Learning and Intelligence — Vol. I

the particular presynaptic neuron for the current input spike pattern at t ( f ) is lower than wi . This implies that the weight sensitivity modulation factor will also be zero for these presynaptic neurons, and hence, their weights will not be updated. This selective update mechanism of the meta-neuron based learning rule is similar to the selective plasticity exhibited by astrocyte cells in the brain. (f) The general idea behind computing Mi using Eq. (10.46) is to induce a higher change in the weight of the neurons that contributed higher PSP to the postsynaptic  (f) neuron, and vice versa. Additionally, this also ensures that i Mi = 1, which guarantees that the change in weight will alter the PSP of the postsynaptic neuron at t ( f ) by v ( f ) , thereby making sure that it fires precisely at t ( f ) . The meta-neuron based learning rule (Eq. (10.45)) is a general learning rule that can be used to perform a one-shot weight update in a network such that the spike times of the postsynaptic neuron are precisely shifted. This renders the meta-neuron based learning rule suitable for online learning. 10.4.2.2. Online meta-neuron-based learning algorithm

The online meta-neuron-based learning algorithm (OMLA) is developed for a twolayered SNN described in Section 10.4.2.1. It learns from the input spike patterns in an online manner, i.e., each spike pattern is presented to the network only once. The predicted class for a given spike pattern is determined based on the output neuron that spikes first. For this purpose, the learning algorithm updates the synaptic weights in the network to ensure that the output neuron associated with the correct class generates the first spike at the target firing time (TI D ). TI D is set to a fixed time instant in the simulation interval. For the other class neurons, the target firing time is set to T + δ, which implies that the other class neurons should not generate any spike within the simulation interval. OMLA utilizes the meta-neuron-based learning rule to update the weights and evolve the network in an online manner. For every spike pattern, it chooses one of three learning strategies, namely a neuron addition strategy, a delete spike pattern, and a parameter update strategy. The appropriate learning strategy is chosen based on the information present in the network and the knowledge present in the input spike pattern. To learn effectively in an online manner, OMLA stores past spike patterns that were learned by adding a neuron to the network (neuron addition strategy). While adding neurons for subsequent spike patterns, these pseudo-inputs (spike patterns in the meta-neuron memory) are used by the learning algorithm for better approximation of the past knowledge stored in the network. Without loss of generality, it is assumed that the network has K output neurons which were added to the network while learning the spike patterns X1 , . . . , xh , . . . , x K . At this stage, the meta-neuron’s memory consists of these

page 450

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Supervised Learning Using Spiking Neural Networks

b4528-v1-ch10

page 451

451

K spike patterns that were used to add the K neurons. The learning algorithm is described below for the current input spike pattern x from class c that is presented to the network for learning. The description below uses CC (CC implies correct class) to represent the output neuron from class c that has minimum latency. Suppose that c j represents the class associated with the j th output neuron; then CC is given by CC = argmin tˆ(1) j . j,c j =c

(10.49)

Similarly, MC is used to represent the neuron from any other class that has minimum latency and is given by MC = argmin tˆ(1) j . j,c j =c

(10.50)

Since the learning algorithm only uses the first spike to determine the predicted and t (1) as tˆj and t j , respectively. class, the following description represents tˆ(1) j j Similarly, normalized PSP contributed by the ith presynaptic neuron is represented by u i instead of u (1) i . Next, the different learning strategies in OMLA are described in detail. • Neuron addition strategy: In this strategy, a new neuron is added to the network to learn the information present in an input spike pattern that contains a significant amount of new information. For this purpose, the learning algorithm considers the interval between TI D and tˆCC . A high value of this interval (TI D Tn then 5: Add a neuron 6: else if (tˆCC ≤ Td ) & (tˆMC − tˆCC ) ≥ Tm then 7: Delete the spike pattern 8: else 9: if tˆCC > Td then 10: Update the CC neuron 11: if (tˆMC − tCC ) < Tm then 12: Update the MC neuron 1:

and output). The normalized form of STDP is used to train the classifier. Training an SNN with time-varying weight models is not exactly the same as training an SNN with constant weight models. The weight update (single value) determined for a given input pattern must be embedded in a function before updating the timevarying weights. 10.4.3.1. Architecture of SEFRON

SEFRON has two layers. Real-valued inputs are converted into spike patterns using some encoding scheme. In SEFRON, the first layer is considered the spike input layer and the second layer is considered the spike output layer. The input and output neurons are connected by the time-varying weights. The architecture of SEFRON with population encoding is shown in Figure 10.10 In a SEFRON classifier, let θ be the firing threshold. The postsynaptic potential v(t) increases or decreases for every presynaptic spike. The output neuron fires a postsynaptic spike tˆ when the postsynaptic potential v(t) crosses its firing threshold θ.   tˆ = t|v(t) = θ

(10.65)

Here, the postsynaptic potential v(t) of the output neuron is determined as v(t) =





i∈{1,m} r∈{1,q}

      wir sir . t − sir .H t − sir

(10.66)

page 454

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Supervised Learning Using Spiking Neural Networks

page 455

455

Figure 10.10: Architecture or multiclass SEFRON with population encoding.

where H (t) represents a Heaviside step function and (t) is the spike response function, given below [28]:

(t) =

 t t .ex p 1 − τ τ

(10.67)

  weight and is obtained by sampling the timeThe term wir sir is the momentary   r varying weight function wi t at the instant sir as shown in Figure 10.10. wir t is the time-varying weight between the rth RF neuron of the ith input feature and the  r output neuron. The time-varying function wi t is also set to [0, T ]ms. 10.4.3.2. Learning rule to train SEFRON

In SEFRON’s learning rule, the error between the ‘actual’ postsynaptic spike time (tˆa ) and the ‘reference’ postsynaptic spike time (tˆrf ) is minimized using a modified STDP rule. The steps of SEFRON’s learning rule are briefly explained. First, the STDP rule is used to determine the fractional contribution of presynaptic spikes u ri (tˆ) for a given postsynaptic spike time tˆ. Next, u ri (tˆ) is used to determine the postsynaptic potential due to the fractional contribution VSTDP (tˆ). Next, the error function is defined as the difference in the ratio of θ to VSTDP (tˆ) at the reference and actual postsynaptic spike times. Finally, the error is embedded in a Gaussian distribution function centered at the current presynaptic spike time to update the time-varying weight.

June 1, 2022 12:51

456

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Handbook on Computer Learning and Intelligence — Vol. I

The general STDP learning rule in Eq. (10.18) is used in a normalized form to compute the fractional contribution u ri (tˆ) of the presynaptic spike sir for a given postsynaptic spike tˆ. The fractional contribution u ri (tˆ) is calculated as ⎧ δw(tˆ − sir ) ⎪ ⎨  for all sir ≤ tˆ q m r ˆ δw( t − s ) (10.68) u ri (tˆ) = i r=1 i=1 ⎪ ⎩ r 0 for all si > tˆ For classification problems, only the first postsynaptic spike is important. Therefore the presynaptic spikes fired after the first postsynaptic spike are ignored. Hence, A− is assumed to be 0 in the SEFRON’s learning rule. The term u ri (tˆ) is independent of variable A+ and depends only on the plasticity window τ+ . The sum of u ri (tˆ) is equal to 1. q m   u ri (tˆ) = 1 (10.69) i=1 r=1

The fractional contribution indicates the influence of a presynaptic spike in firing a postsynaptic spike. A higher value of u ri (tˆ) indicates that the presynaptic spike at sir has more influence in firing an output spike at tˆ. This is the characteristic of the STDP learning rule, where the presynaptic spikes closer to the postsynaptic spike will have higher weight update. VSTDP (tˆ) is determined using u ri (tˆ). VSTDP (tˆ) is interpreted as the postsynaptic potential at tˆ if the fractional contribution from the STDP rule uri (tˆ) is used instead  of the weight wir sir . VSTDP (tˆ) is determined by replacing wir sir in Eq. (10.66) with u ri (tˆr ): q m        ˆ u ri tˆ . tˆ − sir .H t − sir (10.70) VSTDP (t ) = i=1 r=1

It must be noted that VSTDP (tˆ) is not always equal to the firing threshold   (θ) (except in the ideal case). The variations in the input data and the weight wir sir may have caused the difference. The overall strength (γ ) is used to adjust the difference between VSTDP (tˆ) and θ. γ is the overall strength required by all the connections to make the firing possible at time tˆ if u ri (tˆ) is used to determine the postsynaptic potential. q m        u ri tˆ . tˆ − sir .H t − sir (10.71) θ = γ .VSTDP(tˆ) = γ . i=1 r=1

where the overall strength γ is calculated as γ =

θ VSTDP (tˆ)

.

(10.72)

page 456

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Supervised Learning Using Spiking Neural Networks

b4528-v1-ch10

page 457

457

In supervised learning, a reference signal is used to move the actual output toward the desired value. In SEFRON, the reference signal and actual outputs are the reference postsynaptic spike time (tˆrf ), and the actual postsynaptic spike time (tˆ), respectively. The error between tˆrf and tˆ is used to update the time-varying weights in SEFRON. The error between tˆrf and tˆ is defined using the overall strength required to fire a postsynaptic spike at tˆrf and tˆ. err = γrf − γa =

θ VSTDP (tˆrf )



θ VSTDP (tˆa )

(10.73)

where γrf and γa are the overall strengths required to fire a postsynaptic spike at the reference and actual postsynaptic spike times, respectively. This error err is used to determine the weight update. It can be noted by comparing Eqs. (10.66) and (10.71) that the momentary weight is equal to the product of overall strength and fractional contribution for the ideal case. (wir sir = γ .u ri (tˆ)). Hence, multiplying the error  err  by the fractional contribution for the reference postsynaptic spike time u ri tˆrf ensures that the weight is updated in a directions toward the ideal momentary weight. This indicates that subsequently the actual postsynaptic spike time (tˆa ) also moves in a directions toward the reference postsynaptic spike time (tˆrf ). Two steps are involved in determining the time-varying weight update. First, the single-value weight update is determined using err. Next, that single value is embedded in a time-varying function to determine the time-varying weight update. Let wir (sir ) be the single-value weight update. wir (sir ) is determined by multiplying err by tˆrf :     (10.74) wir sir = λ.u ri tˆrf .err, where λ is the learning rate and is usually set  to a smaller value. r r The single-value weight update wi si is computed for a given input pattern.   Next, wir sir is embedded in a time-varying function. The time-varying function should ensure that the weight update is similar to other input patterns that are similar to the current pattern. Hence, a Gaussian function is used to embed the weight  r r update. The weight update wi si at the presynaptic spike time sir is embedded in a time-varying function gir (t) as     −(t − sir )2 r r r , (10.75) gi (t) = wi si .ex p 2σ 2 where σ is the efficacy update range. σ determines the temporal neighborhood region affected by the weight update. A smaller value for σ updates smaller temporal neighbours. Hence, more variation in the data distributions can be captured. On the other hand, an infinite value for σ updates all the weights in the interval and

June 1, 2022 12:51

458

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Handbook on Computer Learning and Intelligence — Vol. I

subsequently will result in a constant function. This constant function is similar to a constant weight model. The time-varying weight wir (t) is updated by adding gir (t) to wir (t). wir (t)new = wir (t)old + gir (t).

(10.76)

It can be noted that the updated wir (t)new may have both positive and negative values. This imitates the GABA-switch phenomenon observed in a biological neuron. 10.4.3.3. SEFRON for the classification problem

Only the first actual postsynaptic spike tˆ is used for the classification task and the subsequent postsynaptic spikes are ignored. In the case of no postsynaptic spike firing, the firing time of the postsynaptic spike is taken as the end of the simulation time. Let tˆb be the classification boundary referred to as the boundary spike time. The output class label for an input pattern is determined by the sign of the difference between tˆ and tˆb . If tˆ is fired before tˆb it belongs to class 1, otherwise it belongs to class 2. Initialization: In SEFRON, the first sample from the training dataset is used to initialize wir (t) and θ. Let tˆd1 and tˆd2 be the desired postsynaptic spike time for class 1 and class 2, respectively. The overall strength γ required to make the firing possible is assumed to be 1 for the first sample. Hence θ is set to the value of VSTDP (tˆd ) using Eq. (10.71). q m         u ri tˆd . tˆd − sir .H tˆd − sir . (10.77) θ= i=1 r=1

Here, tˆd is the desired firing time of the first sample. tˆd can be either tˆd1 or tˆd2 . It can be noted from Eq. (10.65) and (10.66) that the postsynaptic potential at tˆd must be equal to the firing threshold for the first sample to make the postsynaptic firing possible. q m         wir sir . tˆd − sir .H tˆd − sir . (10.78) θ = v(tˆd ) = i=1 r=1

It can be by comparing Eqs.  noted    (10.77) and (10.78), that the initial momentary weight wir sir must be equal tou ri tˆd . Using this similarity, the time-varying weight is initialized by embedding u ri tˆd in a Gaussian distribution function.     −(t − sir )2 r r ˆ . (10.79) wi (t)initial = u i td .ex p 2σ 2

page 458

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Supervised Learning Using Spiking Neural Networks

page 459

459

During training, only the reference postsynaptic spike time is required to be determined to update the weights. The desired postsynaptic spike time is used as the reference postsynaptic spike time. tˆrf = tˆd1 for samples from class 1 and tˆrf = tˆds for samples from class 2, respectively.

10.5. Performance Evaluation In this section, the performance of SpikeProp, SWAT, SEFRON, TMM-SNN, and OMLA has been evaluated and compared using 10 benchmark datasets from the UCI machine learning repository [44]. The results for all the algorithms were evaluated on the same training and testing splits. Table 10.1 shows details of the different datasets. All the experiments were conducted in a MATLAB 2019a environment using a 64-bit Windows 10 operating system. Table 10.2 shows the results of a performance evaluation for all the learning algorithms used for the comparison. From Table 10.2, it can be observed that all the learning algorithms except SWAT have similar performances for simple multiclass classification problems such as Iris, Wine, and Acoustic Emission. SWAT performs 5–7% lower than the best performing algorithm. Similarly, a comparable performance is obtained for all the learning algorithms excluding SWAT on simple binary problems such as breast cancer and Echocardiogram. The performance of SWAT is 2–4% lower than that of the best performing algorithm. For low-dimensional binary classification problems with high interclass overlap such as Mammogram and PIMA, the evolving learning algorithms TMM-SNN and OMLA achieve the best performance due to their evolving nature. The performance of SEFRON and SpikeProp is similar to one another and is 2–4% lower than the performance of the best performing algorithm. SWAT has the highest error Table 10.1: Description of the dataset used for validation. No.Samples Dataset Iris Wine Acoustic emission Breast cancer Echo-cardiogram Mammogram Liver PIMA Ionosphere Hepatitis

No. Features

No. Classes

Training

Testing

4 13 5 9 10 9 6 8 34 19

3 3 4 2 2 2 2 2 2 2

75 60 62 350 66 80 170 384 175 78

75 118 137 333 65 11 175 384 176 77

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

460

b4528-v1-ch10

Handbook on Computer Learning and Intelligence — Vol. I Table 10.2: Performance comparison of UCI datasets.

Method

Architecture

Training Accuracy (%)

Iris

SpikeProp SWAT TMM-SNN OMLA

4-25-10-3 4-24-312-3 4-24-(4-7)-3 4-24-(5-7)

97.2(1.9) 96.7(1.4) 97.5(0.8) 97.9(0.7)

96.7(1.6) 92.4(1.7) 97.2(1.0) 97.9(0.7)

Wine

SpikeProp SWAT TMM-SNN OMLA

13-79-10-3 13-78-1014-3 13-78-3-3 13-78-(3-6)

99.2(1.2) 98.6(1.1) 100(0) 98.5(1.0)

96.8(1.6) 92.3(2.4) 97.5(0.8) 97.9(0.7)

Acoustic Emission

SpikeProp SWAT TMM-SNN OMLA

5-31-10-4 5-30-390-4 5-30-(4-7)-4 5-30-(4-9)

98.5(1.7) 93.1(2.3) 97.6(1.3) 99.2(0.8)

97.2(3.5) 91.5(2.3) 97.5(0.7) 98.3(1.0)

Breast Cancer

SpikeProp SWAT TMM-SNN OMLA SEFRON

9-55-15-2 9-54-702-2 9-54-(2-8)-2 9-54-2 9-55-1

97.3(0.6) 96.5(0.5) 97.4(0.3) 97.4(0.4) 98.3(0.8)

97.2(0.6) 95.8(1.0) 97.2(0.5) 97.8(0.4) 96.4(0.7)

Breast Cancer

SpikeProp SWAT TMM-SNN OMLA SEFRON

9-55-15-2 9-54-702-2 9-54-(2-8)-2 9-54-2 9-55-1

97.3(0.6) 96.5(0.5) 97.4(0.3) 97.4(0.4) 98.3(0.8)

97.2(0.6) 95.8(1.0) 97.2(0.5) 97.8(0.4) 96.4(0.7)

Echocardiogram

SpikeProp SWAT TMM-SNN OMLA SEFRON

10-61-10-2 10-60-780-2 10-60-(2-3)-2 10-60-(6-10) 10-61-1

86.6(2.5) 90.6(1.8) 86.5(2.1) 89.6(1.5) 88.6(3.9)

84.5(3.0) 81.8(2.8) 85.4(1.7) 86.3(0.9) 86.5(3.3)

Mammogram

SpikeProp SWAT TMM-SNN OMLA SEFRON

9-55-10-2 9-54-702-2 9-54-(5-7)-2 9-54-(14-20) 9-55-1

82.8(4.7) 82.6(2.1) 87.2(4.4) 88.6(2.0) 92.8(5)

81.8(6.1) 78.2(12.3) 84.9(8.6) 85.5(4.1) 82.7(10)

Liver

SpikeProp SWAT TMM-SNN OMLA SEFRON

6-37-15-2 6-36-468-2 6-36-(5-8)-2 6-36-(12-15) 6-37-1

71.5(5.2) 74.8(2.1) 74.2(3.5) 69.9(2.3) 91.5(5.4)

65.1(4.7) 60.9(3.2) 70.4(2.0) 67.7(1.8) 67.7(1.3)

PIMA

SpikeProp SWAT

8-49-20-2 8-48-702-2

78.6(2.5) 77.0(2.1)

76.2(1.8) 72.1(1.8)

Dataset

Testing Accuracy (%)

(Continued)

page 460

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Supervised Learning Using Spiking Neural Networks

page 461

461

Table 10.2: (Continued) Method

Architecture

Training Accuracy (%)

TMM-SNN OMLA SEFRON

8-48-(5-14)-2 8-48-20 8-49-1

79.7(2.3) 78.6(1.7) 78.4(2.5)

78.1(1.7) 77.9(1.0) 77.3(1.3)

Ionosphere

SpikeProp SWAT TMM-SNN OMLA SEFRON

34-205-25-2 34-204-2652-2 34-204-(23-34)-2 34-204-(19-25) 34-205-1

89.0(7.9) 86.5(6.7) 98.7(0.4) 94.0(1.7) 97.0(2.5)

86.5(7.2) 90.0(2.3) 92.4(1.8) 93.5(0.5) 88.9(1.7)

Hepatitis

SpikeProp SWAT TMM-SNN OMLA SEFRON

19-115-15-2 19-114-1482-2 19-114-(3-9)-2 19-114-12 19-115-1

87.8(5.0) 86.0(2.1) 91.2(2.5) 89.8(2.9) 94.6(3.5)

83.5(2.5) 83.1(2.2) 86.6(2.2) 87.4(2.4) 82.7(3.3)

Dataset

Testing Accuracy (%)

rate among all the learning algorithms. Its performance is 6–7% lower than that of the best performing algorithm, which indicates that it’s not able to accurately identify the discriminative features of classes with high interclass overlap. For the low-dimensional binary classification problem of Liver, TMM-SNN is the best performing algorithm. SEFRON, OMLA, and SpikeProp have similar performance which are 3–5% lower than the performance of TMM-SNN. SWAT has the highest error rate. Its performance is 10% lower than the performance of TMM-SNN. The performances of the different learning algorithms have also been studied for high-dimensional datasets such as Ionosphere and Hepatitis. For both of these datasets, the best performance was achieved using TMM-SNN and OMLA whose performances were comparable to one another. Furthermore, in the case of the Hepatitis dataset, all the other algorithms had similar performances, which is 3–5% lower than the best performing learning algorithm. However, for the Ionosphere dataset, the performance of other algorithms exhibited higher variability. In this case, the performances of SEFRON and SWAT are similar to one another and are 2–4% lower than the performances of the best performing algorithm. SpikeProp has the highest error rate among all the learning algorithms for the Ionosphere dataset. Its performance is 6–7% lower than the performances of TMM-SNN and OMLA. Overall, it can be observed that the evolving learning algorithms of TMMSNN and OMLA allow them to closely approximate the relationship between the input spike patterns and the desired outputs for the datasets used in the evaluation. SEFRON and SpikeProp are the next better-performing algorithms in most cases whereas SWAT has the highest error rate for most problems.

June 1, 2022 12:51

462

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Handbook on Computer Learning and Intelligence — Vol. I

10.6. Summary and Future Directions Recently, SNNs have gained a lot of traction due to their biological plausibility and energy efficiency. This chapter focuses on the fundamentals of SNNs in a supervised learning framework to provide an effective starting point for new researchers. Compared to traditional deep neural networks, deep spiking neural networks are in a nascent stage. In a deep neural network, the errors associated with output neurons are backpropagated to the input layer to update network parameters for minimizing the errors. Biologically inspired learning rules for SNNs provide a mechanism for learning in shallow networks, which alleviates the need for backpropagation. These approaches include SNNs that contain multiple layers but only a single layer has learned parameters [28, 30]. Other layers are used for encoding, feature extractions or other pre-processing steps [45, 46]. To enable backpropagation of errors in deep SNNs, biological learning rules have been combined with gradient descent [47, 48]. However, using gradient descent in SNNs is a challenging problem due to the discontinuity in the response of spiking neurons at the time of spike. To overcome this issue, researchers have developed techniques that replace the output of a spiking neuron with an approximate continuous function [49, 50] or use a proxy to estimate the derivative [51, 52]. Furthermore, silent neurons in SNNs resulting from weight updates during training lead to a sparser forward propagation, which impedes computation of gradient. This could lead to a failure of the training process. This is usually tackled by proper weight initialization or converting a pre-trained deep neural network to SNNs [53, 54]. Despite the above challenges in SNNs, they have been used for many applications. A review of different SNNs for image classification problems is presented in [23]. The inherent temporal input and output in SNNs make them well suited for spatiotemporal processing, thereby increasing their usage in new application areas. This capability of SNNs has been utilized in automatic speech recognition (ASR) where their performance is closer to that of deep neural networks [55–57]. Studies on time series forecasting using SNNs have also shown promising results [58–60]. Several studies have also used SNNs for reinforcement learning [61, 62] but further research is required in this direction. Recently, many techniques have been developed for the explainability of machine learning tools but this is yet to be explored in SNNs. The biological proximity of SNNs might help improve our understanding of the brain. Another interesting direction involving SNNs is their hardware realization, which has been accelerated by the availability of neuromorphic chips [63, 64] that enable realization of arbitrary SNN architectures in hardware. These chips enable faster deployment of SNNs for edge computing due to their lower energy needs and offer tremendous benefits for areas such as autonomous driving, industrial robotics, etc.

page 462

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Supervised Learning Using Spiking Neural Networks

b4528-v1-ch10

page 463

463

References [1] W. Maass, Noisy spiking neurons with temporal coding have more computational power than sigmoidal neurons, Institute of Theoretical Computer Science, Technische Universität Graz, Austria, Technical Report (1999). [2] H. Markram, W. Gerstner and P. J. Sjöström, Spike-Timing-Dependent Plasticity: A Comprehensive Overview, (Frontiers Media SA (2012)). [3] A. Araque, V. Parpura, R. P. Sanzgiri and P. G. Haydon, Tripartite synapses: glia, the unacknowledged partner, Trends Neurosci., 22(5), 208–215 (1999). [4] J. Gautrais and S. Thorpe, Rate coding versus temporal order coding: a theoretical approach, Biosystems, 48(1–3), 57–65 (1998). [5] S. M. Bohte, J. N. Kok and H. La Poutré, Error-backpropagation in temporally encoded networks of spiking neurons, Neurocomputing, 48, 17–37 (2002). [6] A. L. Hodgkin and A. F. Huxley, A quantitative description of membrane current, and its application to conduction, and excitation in nerve, J. Physiol., 117(4), 500–544 (1952). [7] E. M. Izhikevich, Simple model of spiking neurons, IEEE Trans. Neural Networks, 14(6), 1569–1572 (2003). [8] R. B. Stein, A theoretical analysis of neuronal variability, Biophys. J., 5(2), 173–194 (1965). [9] R. B. Stein, Some models of neuronal variability, Biophys. J., 7(1), 37–68 (1967). [10] W. M. Kistler, W. Gerstner and J. Hemmen, Reduction of the Hodgkin-Huxley equations to a single-variable threshold model, Neural Comput., 9(5), 1015–1045 (1997). [11] W. Gerstner, Time structure of the activity in neural network models, Phys. Rev. E, 51(1), 738–758 (1995). [12] T. V. P. Bliss and T. Lømo, Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path, J. Physiol., 232(2), 331–356 (1973). [13] D. O. Hebb, The Organization of Behavior: A Neuropsychological Theory, vol. 63, John Wiley & Sons, New York (1949). [14] M. Tsodyks and H. Markram, The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability, PNAS, 94(2), 719–723 (1997). [15] J. S. Dittman, A. C. Kreitzer and W. G. Regehr, Interplay between facilitation, depression„ and residual calcium at three presynaptic terminals, J. Neurosci., 20(4), 1374–1385 (2000). [16] K. Ganguly, A. F. Schinder, S. T. Wong and M. M. Poo, GABA itself promotes the developmental switch of neuronal GABAergic responses from excitation to inhibition, Cell, 105(4), 521–532 (2001). [17] S. W. Lee, Y. B. Kim, J. S. Kim, W. B. Kim, Y. S. Kim, H. C. Han, C. S. Colwell, Y. W. Cho and Y. I. Kim, GABAergic inhibition is weakened or converted into excitation in the oxytocin, and vasopressin neurons of the lactating rat, Mol. Brain, 8(1), 1–9 (2015). [18] X. Lin, X. Wang and Z. Hao, Supervised learning in multilayer spiking neural networks with inner products of spike trains, Neurocomputing, 237, 59–70 (2017). [19] S. B. Shrestha and Q. Song, Adaptive learning rate of SpikeProp based on weight convergence analysis, Neural Networks, 63, 185–198 (2015). [20] B. Schrauwen and J. Van Campenhout, Extending SpikeProp. In International Joint Conference on Neural Networks, vol. 1, IEEE, pp. 471–475 (2004). [21] Y. Xu, X. Zeng, L. Han and J. Yang, A supervised multi-spike learning algorithm based on gradient descent for spiking neural networks, Neural Networks, 43, 99–113 (2013). [22] H. Mostafa, Supervised learning based on temporal coding in spiking neural networks, IEEE Trans. Neural Networks Learn. Syst., 29(7), 3227–3235 (2018). [23] A. Tavanaei, M. Ghodrati, S. R. Kheradpisheh, T. Masquelier and A. Maida, Deep learning in spiking neural networks, Neural Networks, 111, 47–63 (2019). [24] F. Ponulak and A. Kasi´nski, Supervised learning in spiking neural networks with resume: sequence learning, classification and spike shifting, Neural Comput., 22 2), 467–510 (2010).

June 1, 2022 12:51

464

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch10

Handbook on Computer Learning and Intelligence — Vol. I

[25] A. Taherkhani, A. Belatreche, Y. Li and L. P. Maguire, Multi-DL-ReSuMe: multiple neurons delay learning remote supervised method. In 2015 International Joint Conference on Neural Networks (IJCNN), IEEE, pp. 1–7 (2015). [26] M. Zhang, H. Qu, X. Xie and J. Kurths, Supervised learning in spiking neural networks with noise-threshold, Neurocomputing, 219, 333–349 (2017). [27] I. Sporea and A. Grüning, Supervised learning in multilayer spiking neural networks, Neural Comput., 25(2), 473–509 (2013). [28] A. Jeyasothy, S. Sundaram and N. Sundararajan, SEFRON: a new spiking neuron model with time-varying synaptic efficacy function for pattern classification, IEEE Trans. Neural Networks Learn. Syst., 30(4), 1231–1240 (2019). [29] J. J. Wade, L. J. Mcdaid, J. A. Santos and H. M. Sayers, SWAT: a spiking neural network training algorithm for classification problems, IEEE Trans. Neural Networks, 21(11), 1817–1830 (2010). [30] S. Dora, S. Suresh and N. Sundararajan, Online meta-neuron based learning algorithm for a spiking neural classifier, Inf. Sci., 414, 19–32 (2017). [31] S. Thorpe and J. Gautrais, Rank order coding. In Computational Neuroscience: Trends in Research, Springer US, pp. 113–118 (1998). [32] S. G. Wysoski, L. Benuskova and N. Kasabov, On-line learning with structural adaptation in a network of spiking neurons for visual pattern recognition. In S. D. Kollias, A. Stafylopatis, W. Duch and E. Oja (eds.) Artificial Neural Networks — ICANN 2006. ICANN 2006. Lecture Notes in Computer Science, vol. 4131, Springer, Berlin, Heidelberg, pp. 61–70 (2006). [33] S. G. Wysoski, L. Benuskova and N. Kasabov, Evolving spiking neural networks for audiovisual information processing, Neural Networks, 23(7), 819–835 (2010). [34] S. G. Wysoski, L. Benuskova and N. Kasabov, Fast, and adaptive network of spiking neurons for multi-view visual pattern recognition, Neurocomputing, 71(13), 2563–2575 (2008). [35] N. Kasabov, K. Dhoble, N. Nuntalid and G. Indiveri, Dynamic evolving spiking neural networks for on-line spatio-, and spectro-temporal pattern recognition, Neural Networks, 41, 188–201 (2013). [36] S. Dora, K. Subramanian, S. Suresh and N. Sundararajan, Development of a self-regulating evolving spiking neural network for classification problem, Neurocomputing, 171, 1216–1229 (2016). [37] S. Dora, S. Ramasamy and S. Sundaram, A basis coupled evolving spiking neural network with afferent input neurons. In International Joint Conference on Neural Networks, pp. 1–8 (2013). [38] S. Dora, S. Suresh and N. Sundararajan, A sequential learning algorithm for a Minimal Spiking Neural Network (MSNN) classifier. In International Joint Conference on Neural Networks, pp. 2415–2421 (2014). [39] S. Dora, S. Suresh and N. Sundararajan, A sequential learning algorithm for a spiking neural classifier, Appl. Soft Comput., 36, 255–268 (2015). [40] M. Zhang, H. Qu, A. Belatreche, Y. Chen and Z. Yi, A highly effective, and robust membrane potential-driven supervised learning method for spiking neurons, IEEE Trans. Neural Networks Learn. Syst., 30(1), 123–137 (2019). [41] A. Mohemmed, S. Schliebs, S. Matsuda and N. Kasabov, Span: spike pattern association neuron for learning spatio-temporal spike patterns, Int. J. Neural Syst., 22(04), 1250012, 2012. [42] S. Dora, S. Sundaram and N. Sundararajan, An interclass margin maximization learning algorithm for evolving spiking neural network, IEEE Trans. Cybern., pp. 1–11 (2018). [43] M. M. Halassa, T. Fellin, H. Takano, J.-H. Dong and P. G. Haydon, Synaptic islands defined by the territory of a single astrocyte, J. Neurosci., 27(24), 6473–6477 (2007). [44] D. Dua and C. Graff, UCI machine learning repository (2017). [45] M. Beyeler, N. D. Dutt and J. L. Krichmar, Categorization, and decision-making in a neurobiologically plausible spiking network using a STDP-like learning rule, Neural Networks, 48, 109–124 (2013).

page 464

June 1, 2022 12:51

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Supervised Learning Using Spiking Neural Networks

b4528-v1-ch10

page 465

465

[46] A. Tavanaei, T. Masquelier and A. S. Maida, Acquisition of visual features through probabilistic spike-timing-dependent plasticity. In International Joint Conference on Neural Networks, pp. 307–314 (2016). [47] A. Tavanaei and A. Maida, BP-STDP: approximating backpropagation using spike timing dependent plasticity, Neurocomputing, 330, 39–47 (2019). [48] Y. Bengio, T. Mesnard, A. Fischer, S. Zhang and Y. Wu, STDP-compatible approximation of backpropagation in an energy-based model, Neural Comput., 29(3), 555–577 (2017). [49] J. H. Lee, T. Delbruck and M. Pfeiffer, Training deep spiking neural networks using backpropagation, Front. Neurosci., 10, 508 (2016). [50] E. O. Neftci, C. Augustine, S. Paul and G. Detorakis, Event-driven random back-propagation: enabling neuromorphic deep learning machines, Front. Neurosci., 11, 324, 2017. [51] S. B. Shrestha and G. Orchard, Slayer: spike layer error reassignment in time, Adv. Neural Inf. Process. Syst., 31, 1412–1421 (2018). [52] E. O. Neftci, H. Mostafa and F. Zenke, Surrogate gradient learning in spiking neural networks, IEEE Signal Process. Mag., 36, 61–63 (2019). [53] B. Rueckauer, I.-A. Lungu, Y. Hu, M. Pfeiffer and S.-C. Liu, Conversion of continuous-valued deep networks to efficient event-driven networks for image classification, Front. Neurosci., 11, 682 (2017). [54] D. Neil, M. Pfeiffer and S.-C. Liu, Learning to be efficient: algorithms for training low-latency, low-compute deep spiking neural networks. In Proceedings of the 31st Annual ACM Symposium on Applied Computing, pp. 293–298 (2016). [55] Z. Pan, Y. Chua, J. Wu, M. Zhang, H. Li and E. Ambikairajah, An efficient, and perceptually motivated auditory neural encoding, and decoding algorithm for spiking neural networks, Front. Neurosci., 13 (2019). [56] J. Wu, E. Yılmaz, M. Zhang, H. Li and K. C. Tan, Deep spiking neural networks for large vocabulary automatic speech recognition, Front. Neurosci., 14, 199 (2020). [57] M. Dong, X. Huang and B. Xu, Unsupervised speech recognition through spike-timingdependent plasticity in a convolutional spiking neural network, PLoS One, 13(11), e0204596 (2018). [58] G. Sun, T. Chen, Z. Wei, Y. Sun, H. Zang and S. Chen, A carbon price forecasting model based on variational mode decomposition and spiking neural networks, Energies, 9(1), 54, 2016. [59] S. Kulkarni, S. P. Simon and K. Sundareswaran, A spiking neural network (SNN) forecast engine for short-term electrical load forecasting, Appl. Soft Comput., 13(8), 3628–3635 (2013). [60] D. Reid, A. J. Hussain and H. Tawfik, Financial time series prediction using spiking neural networks, PLoS One, 9(8), e103656 (2014). [61] F. Zhao, Y. Zeng and B. Xu, A brain-inspired decision-making spiking neural network, and its application in unmanned aerial vehicle, Front. Neurorob., 12, 56 (2018). [62] N. Frémaux, H. Sprekeler and W. Gerstner, Reinforcement learning using a continuous time actor-critic framework with spiking neurons, PLoS Comput. Biol., 9(4), e1003024 (2013). [63] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain, et al., Loihi: a neuromorphic manycore processor with on-chip learning, IEEE Micro, 38(1), 82–99 (2018). [64] F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta, G.-J. Nam, et al., TrueNorth: design and tool flow of a 65 mW 1 million neuron programmable neurosynaptic chip, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 34(10), 1537–1557 (2015).

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_0011

Chapter 11

Fault Detection and Diagnosis Based on an LSTM Neural Network Applied to a Level Control Pilot Plant Emerson Vilar de Oliveira∗ , Yuri Thomas Nunes† , Mailson Ribeiro Santos‡ and Luiz Affonso Guedes¶ Federal University of Rio Grande do Norte—UFRN, Department of Computer Engineering and Automation—DCA ∗ [email protected][email protected][email protected][email protected]

This chapter presents a fault detection and diagnosis (FDD) approach for industrial processes that is based on Recurrent Neural Networks. More specifically, a Long Short-Term Memory (LSTM) neural network architecture is used. A set of experiments using real pilot-scale plant data is performed to analyze the performance of this FDD approach. To evaluate the robustness of the proposed approach, special attention is given to the influence of the LSTM neural network hyper-parameter on the accuracy of fault detection and diagnosis activities. To compare the performance of the LSTM, the same FDD approach using the traditional multilayer perceptron (MLP) neural network architecture is evaluated. The obtained results indicate that the LSTM-based approach has better performance than the MLP-based approach under the same conditions.

11.1. Introduction The demand for improved operational safety of industrial process operations has led to the development and application of several fault detection and diagnosis (FDD) approaches [1–3]. FDD solutions are considered to be critical assets for the safe operation of industrial processes. For example, they play an important role in preventing accidents in industry. Thus, it is necessary to carefully evaluate and validate these solutions before implementation in real industrial processes. Therefore, it is quite common to

467

page 467

June 1, 2022 12:52

468

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

Handbook on Computer Learning and Intelligence — Vol. I

adopt simulation and implementation strategies in small-scale prototypes to validate FDD strategies. Regarding simulation-based benchmarks for FDD, the Tennessee Eastman Process (TEP) benchmark is widely used in the literature [4–7]. Eastman Chemical Company developed this benchmark to provide a simulator for a real chemical process [8]. The TEP is mainly applied in the evaluation of process control and monitoring methods. Another well-known benchmark for FDD is the DAMADICS (Development and Application of Methods for Actuator Diagnosis in Industrial Control Systems), which provides simulated data from faulted control valves in a water evaporation process [9]. In contrast, few works in the literature use fault data from real processes to validate FDD strategies. This may be due to the difficulty to safely obtain or generate fault scenarios in real processes. Thus, small-scale processes (called pilot processes here) are considered to be a good trade-off between simulation and real industrial processes in the FDD context [10, 11]. Concerning the approaches used to carry out FDD, techniques based on machine learning (ML) are the most prominent in the literature [12–15]. Among these MLbased approaches, those that use neural networks stand out [16–23]. However, a great number of these works use the classical multilayer perceptron (MLP) neural networks. This chapter aims to evaluate the performance of a FDD approach based on a recurrent neural network. The neural network for the FDD strategy adopted a Long Short-Term Memory (LSTM) architecture. In addition, the neural network is performed on data acquired by a pilot process. The pilot process is a two-tank level control system with coupled tanks. We pay particular attention to the influence of the neural network hyper-parameters on the performance of the FDD system. For comparison purposes, we also evaluate the performance of an MLP-based approach. Accuracy and confusion matrices are used to evaluate the performance of both the neural network approaches. The rest of this chapter is organized as follows. The main concepts of FDD, MLP, and LSTM neural networks are presented in Section 11.2 and Section 11.3, respectively. The proposed architecture for the FDD is presented in Section 11.4. The experimental setup of the experiments is presented in Section 11.5. The results of performance evaluation of LSTM-based and MLP-based FDD are presented in Section 11.6. Finally, our conclusions are outlined in Section 11.7.

11.2. Fault Detection and Diagnosis In the last few decades, FDD techniques applied to the industrial process operation monitoring context have received a great deal of attention. In addition, there has recently been an increasing number of studies, demonstrating the relevance of this

page 468

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

Fault Detection and Diagnosis Based on an LSTM Neural Network u Controller

page 469

469 y

Plant

+ er -

Model

Fault Information

FDD System

residue

Figure 11.1: General structure of a quantitative model for FDD.

area of research [24]. This interest is mainly due to the expanding complexity of industrial process operations to meet market requirements, increase production, improve continuity, and develop process reliability under stricter environmental and safety restrictions [11]. For example, human operator faults cause about 70% to 90% of accidents in an industrial environment [25, 26]. Fault detection and fault classification are FDD tasks. While fault detection detects whether a fault has occurred, fault classification determines the type of fault that has occurred. The type of fault is useful information for process monitoring [27]. FDD methods can be classified into three groups: quantitative model-based methods, qualitative model-based methods, and data-based methods [25]. Modelbased methods are generally developed based on physical knowledge of the process. In quantitative models, process operation knowledge is expressed as mathematical functions that map the system’s inputs to outputs. Figure 11.1 shows the general structure of a quantitative model. In contrast, in qualitative model equations, these relationships are expressed as qualitative functions that are centered on different process units. Figure 11.2 shows the general structure of a qualitative model. Meanwhile, data-based approaches use only historical data of the process variables from industrial processes in operation [25]. Figure 11.3 shows the general structure of a data-based model for the FDD problem. Figure 11.4 presents a tree-based classification of FDD methods. It is important to note that the approach that is adopted in this chapter can be classified as quantitative data-based, using neural networks for FDD. One of the main concepts for diagnosing faults in the operation of a process is the definition of what a fault is. A fault can be expressed in general terms as a

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

470

b4528-v1-ch11

Handbook on Computer Learning and Intelligence — Vol. I u

y

Controller

Plant

Discrepancy Detector

Qualitative Database

discrepancies

Fault Information

FDD System

Figure 11.2: General structure of a qualitative model for FDD.

Controller

Process database

Fault Information

outputs

u

Plant

data

Sensors

Static/Expert Analyzer

Features Extractor

FDD System

Figure 11.3: General structure of a data-based model for the FDD problem.

deviation from the behavior of the process concerning the range of acceptable values for a set of variables or calculation parameters [28]. Figure 11.5 shows a controlled process system and indicates the different sources of faults that are present in it. Based on Figure 11.5, it is possible to basically categorize three classes of faults or malfunctions: parameter faults, structural faults, and faults in actuators and sensors. Parameter faults are observed when a disturbance enters the process through one or more external variables, such as a change in the concentration value of the reagent in the supply of a reactor. Structural faults refer to changes in the process itself that occur due to a serious equipment fault, such as a broken or

page 470

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

Fault Detection and Diagnosis Based on an LSTM Neural Network

page 471

471

Diagnostic Methods

Quantitative Model-Based

Data-Based

Qualitative Model-Based

Observers

Quantitative

Qualitative Parity Space

EKF

Causal Models

Abstraction Hierarchy QTA Qualitative Physics

Digraphs Fault Trees

Neural Networks

Statistical

Functional

Structural

Expert Systems

PCA/PLS

Statistical Classifiers

Figure 11.4: Classification of fault detection and diagnosis methods [25].

FEEDBACK CONTROLLER

Controller Malfunction u

Process Disturbance

ACTUADOR

DYNAMIC PLANT

Actuator Faults

Structural Faults

Sensor Fault

y SENSORS

DIAGNOSTIC SYSTEM

Figure 11.5: A general diagnostic framework [25].

leaking pipe. This type of malfunction results in a change in the flow of information between several variables. Finally, faults in actuators and sensors can occur due to a constant component fault, or with a positive or negative bias. A fault in one of these instruments can cause the plant’s state variables to deviate beyond the limits acceptable by the process, which can cause a serious malfunction [25].

June 1, 2022 12:52

472

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

Handbook on Computer Learning and Intelligence — Vol. I

11.3. Artificial Neural Networks This section will outline the concept of a neural network. First, non-recurrent architecture neural networks are briefly discussed. Second, concepts about recurrent neural networks are described. Finally, the LSTM network is presented in detail.

11.3.1. Multilayer Perceptron Neural Networks The behavior of the human brain inspired the concept of the artificial neural network. The human brain has billions of processing structures called neurons. Each of these neurons makes connections to hundreds of other neurons via synapses. Neurons consist of a body (soma), axon, and dendrites. The dendrites receive other neurons’ stimuli. The neuron sends an action through the axon towards the synapses. This happens by a chemical reaction. The synapses then spread the information to the connected dendrites. A neuron network made up of billions of these structures can quickly process complex information [29, 30]. The perceptron is the simplest form of an artificial neural network and is composed of only one neuron. The perceptron works by adjusting weights and bias to model a function that describes two linearly separable classes. This structure can classify the inputs into two groups [29]. A perceptron layer is necessary to classify three or more classes. However, these classes must continue to be linearly separable to improve the perceptron results. Perceptrons are interesting objects for the study of neural networks. However, most real-world problems do not obey the linearity criterion that limits them. Due to these limitations, perceptrons are unproductive in solving this kind of problem. A single perceptron is not able to model complex functions. MLPs are a more robust structure that is capable of real-world problem modeling. MLP is a network composed of more than one layer of chained perceptrons. Figure 11.6 shows an abstract representation of the architecture of an MLP neural network. In Figure 11.7 each circle (or node) represents a perceptron (or neuron). The arrows at the beginning are the inputs, while the arrows at the end are the outputs. The internal arrows indicate the transfer of information between the layers. Each MLP neuron does a linear operation on the inputs. It starts with a multiplication of the neuron inputs by the layer weights. All of these multiplication results are then summed. A bias is added to this sum to complete the linear operation. Equation (11.1) shows the mathematical computation applied to an MLP neuron.   n  wn x n + b (11.1) y= i=1

page 472

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

Fault Detection and Diagnosis Based on an LSTM Neural Network

page 473

473

Figure 11.6: Abstract MLP neural network architecture representation.

b x1

x2

. . .

Σ

yk

Activation Function

xn

Weight Activation multiplication Function

b

Σ

Bias

Summation

Figure 11.7: Abstract MLP neuron representation.

Here xn denotes the input values, wn denotes the respective layer weight to the input value, b is the bias value, and y is the neuron output. Figure 11.7 shows an abstract MLP neuron representation that is associated with a generic activation function. Here, x1 , x2 , . . . , xn are the neuron inputs and b is the bias value as presented in Eq. (11.1). The filled circle represents the input multiplication over the respective weight. The summation is the uppercase Greek letter sigma. The filled box is a generic activation function representation. Finally, the neuron output is y to the kth neuron in a layer.

June 1, 2022 12:52

474

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

Handbook on Computer Learning and Intelligence — Vol. I

All of the information in an MLP flows through the perceptrons at the input to the output layer. This kind of network is called a feed-forward network. Depending on the network’s size, the number of associated weights can be considerable. These weights need adjustments over the network training to fit the problem as desired. The backpropagation algorithm is the most common method to adjust the weight values. This algorithm starts the weights of each neuron with a random and usually low value. As mentioned, all of the information flows from the input to the output of the network. The backpropagation algorithm returns errors over the layers, from the end to the beginning. These errors are responsible for indicating how much to adjust the weights. This iteration continues until the outputs have the desired values. This means that the error has a low value. Given an activation function, a partial derivative calculates the output error about the desired value. This operation generates an error gradient, which is calculated for each k neuron in an MLP-based neural network [29].

11.3.2. Recurrent Neural Networks Feed-forward neural networks, such as the MLP, do not have a cycle in their structure. This happens because of the characteristic of information transport being straightforward through the layers. This information transference occurs from the beginning to the end of the network. Thus, these networks are not able to weigh new information with the already present information. This feature means that feedforward networks have no memory. Unlike conventional MLPs, recurrent neural networks (RNNs) use information obtained from feedback cycles. These explicit feedback cycles can happen through one or more neurons. Usually, internal states are responsible for the information forwarded over the network. These elements are changed according to the new inputs. Even with the changes, the past states information continues to have an amount of relevance. Consequently, the RNN features a structure that represents a memory. Figure 11.8 shows two representations of common RNN architectures. In Figure 11.8(a), it is possible to see explicit feedback from one neuron output to every other neuron in the layer. In addition, it is possible to observe that there is no feedback to itself. Meanwhile, Figure 11.8(b) shows an RNN structure with explicit feedback from the neuron only to itself. In a multi-layer recurrent network, the feedback can be from one to every other neuron inner layer or from the last layer to the first [30]. The memory feature of the RNN is helpful for problems that not only depend on the current input but also have a time dependency on the data [29].

11.3.3. Long Short-Term Memory Recurrent Neural Networks LSTM is a type of RNN. It was originally proposed by the authors of [31] to mitigate the problem resulting from explicit information feedback. The feedback associated with the backpropagation algorithm can bring annoying difficulties. These kinds of

page 474

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

Fault Detection and Diagnosis Based on an LSTM Neural Network

Z-1

Z-1

Z-1

page 475

475

Z-1 Z-1

Outputs

Z-1

Z-1 Z-1

Inputs -1

Z

Delay operators

(a) RNN architecture with no self-feedback.

Z-1 Delay operators

(b) RNN architecture with only self-feedback.

Figure 11.8: Abstract RNN representations.

behaviors are usually present in common RNNs [31]. Among these problems, the most well-known are the vanishing gradients and blow-up. Vanishing gradients can make the network lose the fitting capacity. Then, it can take an unworkable amount of time to assimilate the necessary time dependence. Although the result may yet not work as expected, the gradient blow-up may generate an undesirable oscillation over the neural weight values [31]. To overcome these problems, LSTM neurons have internal states and functions. These internal functions work with the neuron input to produce the new state values. The value of the internal state of the neurons is updated at each algorithm time step. This sophisticated structure can keep information from past inputs for longer, while keeping the recent entries relevant. Figure 11.9 represents an LSTM neuron. In Figure 11.9, it is possible to see the LSTM inputs and outputs (states), activation functions, and gates. At a specific time t, the LSTM has three inputs: the information source xt , and the two past time instant recurrent states, h t −1 and ct −1 . These states are the hidden state and cell state. The cell outputs are the current time cell state ct and hidden state h t . The output yt is equal to the hidden state h t . If the cell belongs to the last network layer, then the hidden state h t is the final output yt . If the cell belongs to an internal layer, then it serves only as h t −1 to the forward layer. Moreover, in Figure 11.9, the gates are highlighted as Forget Gate, Input Gate, Cell Gate, Cell Update, and Output Gate whose features are described in the following subsections. 11.3.3.1. Forget gate

This component of the LSTM neuron has the work of forgetting information retained by past states, if necessary. Figure 11.10 shows the forget gate in the LSTM cell.

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

476

b4528-v1-ch11

page 476

Handbook on Computer Learning and Intelligence — Vol. I yt

Cell Update ct-1

ct

tanh Forgot Gate ft

ϕ

it

ϕ

Input Gate and Cell Gate gt

Output Gate ot

tanh

ϕ

(ht-1, Xt)

ht-1

ht

tanh

ϕ

Tan H

Sigmoid

xt Copy

Hadamard

Sum

Figure 11.9: LSTM cell representation.

yt

Cell Update ct-1

ct

tanh Forgot Gate ft

ϕ

it

ϕ

Input Gate and Cell Gate gt

Output Gate ot

tanh

(ht-1, Xt)

ht-1

ht

xt Copy

Hadamard

Sum

tanh

ϕ

T H Tan

Sigmoid

Figure 11.10: Forget gate localization in an LSTM cell representation.

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

page 477

Fault Detection and Diagnosis Based on an LSTM Neural Network

477

From Figure 11.10, it is possible to observe that two of the cell inputs, h t −1 and xt , are joined and then go through the Sigmoid activation function ϕ. Equation (11.2) shows the mathematical process used by the forget gate [32]. f t = ϕ(Wi f xt + bi f + Wh f h (t −1) + bh f ),

(11.2)

where xt is the neuron input at the current time t, and h t −1 is the hidden state at the previous time instant. The matrices Wi f ∈ R H ×X and bi f ∈ R X are the weights and bias in the input of the forget gate (input forget, “i f ”). The matrices Wh f ∈ R H ×H and bh f ∈ R H are the weights and bias for the hidden states in the forget gate (hidden forget, “h f ”). The variable H is the number of neurons in the network and X is the number of inputs. Finally, the Greek letter ϕ is the Sigmoid activation function and f t is the forget gate output. 11.3.3.2. Input gate and cell gate

The input gate and the cell gate decide how much of the new input information is inserted into the network. Figure 11.11 shows the input gate in the left and the cell gate in the right of the LSTM cell. Figure 11.11 shows that the initial combination of the inputs h t −1 and xt passes through two different activation functions. For the input gate, the activation function is a Sigmoid. For the cell gate, it is the Hyperbolic Tangent activation function. yt

Cell Update ct-1

ct

tanh Forgot Gate ft

ϕ

it

ϕ

Input Gate and Cell Gate gt

Output Gate ot

tanh

ϕ

(ht-1, Xt)

ht-1

ht

xt Copy

Hadamard

Sum

tanh

ϕ

Tan H

Sigmoid

Figure 11.11: Input gate and cell gate localization in an LSTM cell representation.

June 1, 2022 12:52

478

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

page 478

Handbook on Computer Learning and Intelligence — Vol. I

Equations (11.3) and (11.4) show the respective mathematical operations done by the input and cell gates [32]. i t = ϕ(Wii xt + bii + Whi h (t −1) + bhi ),

(11.3)

gt = tanh(Wig xt + big + Whg h (t −1) + bhg )

(11.4)

where xt is the neuron input at the current time t, and h t −1 is the hidden state at the previous time instant. For the input gate, the matrices Wii ∈ R H ×X and bii ∈ R X are the weights and bias in the input of the input gate (input input, “ii”). The matrices Whi ∈ R H ×H and bhi ∈ R H are the weights and bias for the hidden states in the input gate (hidden input, “hi”). The Greek letter ϕ is the Sigmoid activation function, and i t is the input gate output. For the cell gate, matrices Wig ∈ R H ×X and big ∈ R X are the weights and bias in the input of the cell gate (input cell, “ig”1 ). The matrices Whg ∈ R H ×H and bhg ∈ R H are the weights and bias for the hidden states in the cell gate (hidden cell, “hg”). Another activation function is the hyperbolic tangent, represented by tanh. The gt is the cell gate output. The variable H is the number of neurons in the network and X is the number of inputs for both equations. 11.3.3.3. Cell update and cell state

Cell update is the set of operations accomplished to calculate the new cell state value. The cell state is the internal layer state that is propagating through the neurons. It has the function of loading the information resulting from the internal operations in the same layer of neurons. Figure 11.12 shows the cell update in the LSTM cell. From Figure 11.12, it is possible to see the element-wise multiplication (Hadamard) between cell states in the past instant ct −1 with the output of the forget gate f t . This value is summing to the result from the element-wise multiplication between the input gate i t and cell gate gt outputs. Equation (11.5) shows the mathematical processing done in the cell update [32]. ct = f t ∗ c(t −1) + i t ∗ gt

(11.5)

The ∗ symbol is the Hadamard multiplication. The output of the operations results in the current cell state value ct , which is one of the inputs to the next LSTM cell. 11.3.3.4. Output gate and hidden state

The output gate works for the processing part of the information that will serve as the output of the network yt . The hidden state is the element that carries information 1 In this case, the letter g was used to not be confused with the letter c used for the cell state.

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

page 479

Fault Detection and Diagnosis Based on an LSTM Neural Network

479 yt

Cell Update ct-1

ct

tanh Forgot Gate ft

ϕ

it

ϕ

Input Gate and Cell Gate gt

Output Gate ot

tanh

ϕ

(ht-1, Xt)

ht-1

ht

xt Copy

Hadamard

Sum

tanh

ϕ

T H Tan

Sigmoid

Figure 11.12: Cell update localization in an LSTM cell representation.

from previous moments through the network layers. Figure 11.13 shows the output gate in the LSTM neuron. Figure 11.13 shows the output gate processing the information coming from the initial combination (xt , h t −1) through a Sigmoid activation function. The hidden state results from an element-wise multiplication of the output gate output ot and the cell update output ct passed by a tangent hyperbolic activation function. Equation (11.7) shows the mathematical processing done in output gate and how to obtain the current hidden state value [32]. ot = ϕ(Wio xt + bio + Who h (t −1) + bho ),

(11.6)

h t = ot ∗ tanh(ct )

(11.7)

where xt is the neuron input at the current time t, and h t −1 is the hidden state at the previous time instant. The matrices Wio ∈ R H ×X and bio ∈ R X are the weights and bias in the input of the output gate (input-output, “io”). The matrices Who ∈ R H ×H and bho ∈ R H are the weights and bias for the hidden states in the output gate (hidden output, “ho”). The variable H is the number of neurons in the network and X is the number of inputs. The Greek letter ϕ is the Sigmoid activation function and ot is the forget gate output. The hyperbolic tangent activation function is represented by tanh. As mentioned, if the neuron belongs to an internal/hidden layer, then the current h t is the output network yt .

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

480

b4528-v1-ch11

page 480

Handbook on Computer Learning and Intelligence — Vol. I yt

Cell Update ct-1

ct

tanh Forgot Gate ft

ϕ

it

ϕ

Input Gate and Cell Gate gt

Output Gate ot

tanh

ϕ

(ht-1, Xt)

ht-1

ht

xt Copy

Hadamard

Sum

tanh

ϕ

Tan H

Sigmoid

Figure 11.13: Output gate localization in an LSTM cell representation.

In the original proposal from LSTM, there is a truncation factor associated with the gradient calculation. As research on LSTM advances and new variations are proposed, this factor has changed. Some works show better convergence in the network training and the storage of information with a complete calculation of these gradients [33]. This kind of modification features the Vanilla LSTM. The Vanilla LSTM is popular among researchers and is an often-used variation [32]. 11.3.3.5. Batch normalization

This approach uses the methodology that was originally proposed in [34], which uses a batch normalized LSTM to classify the faults generated by an industrial process simulator TEP [8]. As mentioned, it used an LSTM neural network attached to a Batch Normalization (BN) procedure. The BN aims to improve the train iterations of the model. This procedure standardizes the activation functions using the mean and standard deviation estimates in each network layer. Assuming a neural network with K (k) layers and a mini-batch of size M (m) > 1, each mini-batch instance m is composed of the activation values from the respective layer k. Thus, the BN methodology calculates the mean and standard deviation for each minibatch instance activation. Then, the mean and standard deviation parameters of the ˆ ˆ whole layer are estimated [35]. These parameters are E(h) and Var(h), respectively. According to the authors [34], their main contribution is to apply the BN to the recurrent term Whi h (t −1). They assert that the normalization of the recurrent LSTM

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

Fault Detection and Diagnosis Based on an LSTM Neural Network

page 481

481

term instead of the Wii xt term improves the LSTM’s memory and convergence. Equation (11.8) defines the batch normalization operation. ˆ h − E(h) BN(h; η, ρ) = ρ + η   ˆ Var(h) +

(11.8)

where h ∈ R H is the mini-batch vector with size M (m) > 1. In the proposed methodology, each set of recurrent hidden states Whi h (t −1) ∈ R H ×H are the minibatch instances. The symbols ρ ∈ R H and η ∈ R H are learnable parameter vectors of the normalized feature’s mean and standard deviation.  is a constant minibatch regularization parameter for numerical stability, and  is the element-wise multiplication. Following the original notation [34], ρ = 0, to avoid unnecessary redundancy and over-fitting, simplifying the BN input to BN(h; η).

11.4. LSTM-Based Approach to FDD In contrast to the trend present in the literature, the approach presented in [34] does not use a combination of two techniques, such as “feature extraction” + “classification”, to solve the FDD problem. Instead, the model directly receives a set of multi-variable faults and returns as output the pertinence scores for each fault class. The inputs feed the BN-LSTM, applying the normalization on the hidden states recurrent terms. After these computations, the BN-LSTM output feeds a fully connected layer. This last step resizes the network output, and then a LogSoftmax function is applied to transform the network output in pertinence probabilities for each class. To optimize the gradients, we chose the Adam algorithm [36]. To optimize the loss function, we chose the Negative Log-Likelihood Loss algorithm. The source code to implement the proposed methodology is available online.2 The code is written in the Python programming language using the framework for machine learning PyTorch. Figure 11.14 presents a block diagram of the LSTM-based FDD approach that we have adopted in this study. The Fault Classifier module corresponds to an LSTM/MLP neural network. This module receives the sensor data in the current instant and J − 1 delays (Z J −1) for each variable, represented by xt and xt −1 , . . . , xt −J +1 , respectively. These data are the basis of the fault classification task. The classifier output indicates the detected faults, which are represented by y1 , y2 , ..., yC . Figure 11.15 shows the processes that are made in the Fault Classifier module. It is possible to see the fault classification sequence from the window sample generation to the LogSoftMax layer output, as mentioned earlier in this section. 2 https://github.com/haitaozhao/LSTM_fault_detection.

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

482

b4528-v1-ch11

Handbook on Computer Learning and Intelligence — Vol. I

X1(t) y1 -1

Z

Industrial Process

Z-J-1

X1(t-1) y2 y3

X1(t-J+1)

Fault Classifier (Network) Xn(t) Z-1

Z-J-1

Xn(t-1)

yc

Xn(t-J+1)

Figure 11.14: Representation diagram of the fault classification methodology.

Fault Classification J

inputs .. .

LSTM/MLP Layer

Batch Normalization

LogSoftMax

Window Weight adjustments

Adam

NLLLoss

Training Model

Figure 11.15: Flowchart describing the Fault Classifier module process.

11.4.1. The Pilot Plant: PiPE The level control pilot plant was developed by De Lorenzo of Brazil and named as PiPE, which is an acronym for Pilot Plant Experiment. It was specially designed to study process control. It has four process variables: pressure, temperature, flow rate, and level, which are defined by the input vector S = (u, t, f, y) [37]. Figure 11.16 shows a photo of the pilot plant and Figure 11.17 shows its schematic diagram. Regarding instruments and equipment, the pilot plant is composed of a panel with a programmable logic controller (PLC) and an electrical system to plant control; two pressurized vessels, one acrylic made and another

page 482

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

Fault Detection and Diagnosis Based on an LSTM Neural Network

page 483

483

Figure 11.16: Photograph of the pilot plant [38].

Figure 11.17: Schematic diagram of the pilot plant [38].

made of stainless steel, named T 1 and T 2, respectively; a centrifugal re-circulation pump controlled by a frequency inverter; a heating and heat exchange system; two valves, named V 1 and V 2; and sensors for temperature, pressure, flow rate, and level [38]. In addition, this pilot plant has a transducer from a physical to an electrical

June 1, 2022 12:52

484

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

Handbook on Computer Learning and Intelligence — Vol. I

signal, this signal is processed by the PLC; a terminal bus; and a supervisory control and data acquisition (SCADA) for parameter configuration and process operation visualization. The pilot plant works with the two tanks, T 1 and T 2, connected by a piping system that allows a flow between them. The fluid flows from T 1 to T 2 using gravitational force because T 1 is at a level higher than T 2 in relation to the ground level. From T 2 to T 1, the liquid flows by the pressure generated by a centrifugal pump. The pilot plant was originally designed for process control. To carry out the fault detection and diagnosis studies that are proposed in this chapter, a fault set was introduced. Consequently, the produced faults were grouped into three fault groups: actuator, sensor, and structural. Each fault group has different intensity levels for the same type of fault. Table 11.1 shows the faults with their respective description. The actuator fault group has six faults associated with the levels of pump off-set: +2%, +4%, +8%, −2%, −4%, and −8%. The sensor fault group also has six faults associated with the levels of off-set. In the structural fault group, there are two levels of drain opening. These openings simulate a physical leak in tank T 1. In addition to the tank leaks, four levels are struck for valve V 1 and three are struck for V 2. Table 11.1: Set of generated faults. Group

Fault

Description

Actuator

F1 F2 F3 F4 F5 F6

+2% off-set +4% off-set +8% off-set −2% off-set −4% off-set −8% off-set

Sensor

F7 F8 F9 F10 F11 F12

+2% off-set +4% off-set +8% off-set −2% off-set −4% off-set −8% off-set

Structural

F13 F14 F15 F16 F17 F18 F19 F20 F21

100% of tank leak 66% of tank leak 30% of stuck valve V1 50% of stuck valve V1 85% of stuck valve V1 100% of stuck valve V1 25% of stuck valve V2 50% of stuck valve V2 75% of stuck valve V2

page 484

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

Fault Detection and Diagnosis Based on an LSTM Neural Network

page 485

485

Figure 11.18: Fault F5 scenario.

Here, the data are only used as a reference to the pilot plant control. To make this control, we used a multistage fuzzy controller, which was developed on the software JFuzZ [39]. The communication between the SCADA system and the PLC was realized by the OPC (OLE for process control) protocol [40], which collects data from the level of tank T 1 (y), reference (r, set-point defined by the user), and pressure (u, control pump signal) to compose the vector x = (r, y, u). The sample time used in all experiments was 100 ms. Figure 11.18 shows a fault scenario associated with the occurrence of an F5-type fault. In this fault scenario, it is possible to note that the pressure variable is quite affected, while the level variable is little impacted. Figure 11.19 shows a fault scenario associated with the occurrence of an F8-type fault. In this fault scenario, the level and pump pressure variables change considerably. Furthermore, there is a high noising region at the end of the fault occurrence. This kind of behavior is not present in the previous scenario. Figure 11.20 shows a fault scenario associated with the occurrence of an F18-type fault. This fault scenario is more similar to the fault scenario presented in Figure 11.18.

11.5. Experimental Setup This section presents the setup of the experiments. We explain the criteria for the fault scenario selection in this section, which composes the data for the network training and validation. We also present the selected values for the network hyper-parameter variation. The variation of these values aims to evaluate the hyper-parameters’ impact on the FDD system. Finally, some training details are shown.

June 1, 2022 12:52

486

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

Handbook on Computer Learning and Intelligence — Vol. I

Figure 11.19: Fault F8 scenario.

Figure 11.20: Fault F18 scenario.

11.5.1. Fault Selection for Training and Validation To train and validate the LSTM neural network models, a pair of faults in each fault group was selected. The least dissimilar fault pairs in each fault group were chosen to compose the training and validation sets. The dissimilar criteria that we used was the mean of the Euclidean distance between the variables operating in different faults. The Euclidean distance for two variables is defined by Eq. (11.9), where v iFa and v iFb correspond to the ith process sample when subjected to fault a and fault b, respectively. N is the number of samples in each fault considered in the analysis. The pair of faults is more different from each other when the distance value is bigger.

page 486

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

Fault Detection and Diagnosis Based on an LSTM Neural Network

page 487

487

Table 11.2: Fault selection. Subgroup

Combination

Dissimilarity

Actuator Positive (+2%, +4%, +8%)

F01 ↔ F02 F01 ↔ F03 F02 ↔ F03

61.30 46.45 51.89

Actuator Negative (−2%, −4%, −8%)

F04 ↔ F05 F04 ↔ F06 F05 ↔ F06

82.57 114.28 156.03

Sensor Positive (+2%, +4%, +8%)

F07 ↔ F08 F07 ↔ F09 F08 ↔ F08

75.75 427.42 356.74

Sensor Negative (−2%, −4%, −8%)

F10 ↔ F11 F10 ↔ F12 F11 ↔ F12 F13 ↔ F14

72.98 282.43 236.93 122.48

Structural V1 (30%, 50%, 84%, 100%)

F15 ↔ F16 F15 ↔ F17 F15 ↔ F18 F16 ↔ F17 F16 ↔ F18 F17 ↔ F18

226.43 428.27 436.50 212.47 219.92 50.41

Structural V2 (25%, 50%, 75%)

F19 ↔ F20 F19 ↔ F21 F20 ↔ F21

41.15 75.15 83.50

Structural Tank leakage

 i   N  i 2 i D v Fa , v Fb = i=1 v Fa − v iFb .

(11.9)

Table 11.2 summarizes the fault selection procedure. The subgroup column shows a new grouping of the faults into subgroups based on conceptual similarities. At the remaining columns, the dissimilarity metric is listed for each pair combination in each subgroup. For both dissimilarity tests and subsequent evaluations, 600 samples of each fault were used. This value was chosen because of the sample files in the fault datasets. The number of input samples for the model varies according to the size of the input size. Equation (11.10) shows the calculation made for the number of model input samples. Here, 600 is the total number of samples at each fault, J is the input size, and Q e is the number of inputs to the model. Q e = 600 − ( J − 1).

(11.10)

Then, based on the least dissimilarity criterion shown in Table 11.2, the faults selected for training are F01, F04, F07, F10, F13, F17, and F19. The ones that form the validation fault group are faults F03, F05, F08, F11, F14, F18, and F20, respectively.

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

488

b4528-v1-ch11

Handbook on Computer Learning and Intelligence — Vol. I

11.5.2. LSTM Neural Network Architecture Selection To carry out the performance evaluation of the LSTM-based FDD approach, a set of hyper-parameters was considered. The selected hyper-parameters were the number of network inputs, J , and the number of neurons in the LSTM neural network layer, HS. The selected values for J and HS were the following: • J = 3, 5, 7, 10. • HS = 20, 30, 40, 50, 60. The choice of these values was based on the behavior of the selected fault set. For each value of J , all values of HS were tested, resulting in a total of 20 different combinations of hyper-parameter pair values. The performance metric that was chosen for model evaluation was the accuracy or correctness in the fault classification task. The experiment consists of training 100 models with the same hyper-parameters and extracting the performance metrics obtained by each of them. At the end of the experiment, each architecture has 100 accuracy values obtained from the trained models for 50 periods each. It should be noted that the choice of an exhaustive validation test for various combinations of hyper-parameter values (J and HS) aims to decrease randomness in the LSTM/MLP neural network training process.

11.6. Results This section presents the main results obtained using the LSTM-based FDD approach. Here, the influence of the LSTM neural network hyper-parameter values on the solution performance is evaluated and discussed. For better interpretation, the obtained results are presented using box-plots and a confusion matrix indicating the accuracy values obtained in the fault classification tasks. To better analyze the LSTM-based FDD approach performance, the same FDD solution using an MLP neural network was also performed and evaluated. The results obtained with the two network approaches are presented in this section.

11.6.1. Results Using the LSTM-based FDD Approach Figure 11.21(a,b) presents box-plots of the accuracy distribution of the validation group by varying the network input size (J ) and LSTM layer size (HS), respectively.3 3 The graphics of the type of box-plot presented in Figure 11.21(a,b) highlight the following

information: (i) the horizontal lines at the ends are the maximum and minimum values; (ii) the box top-line delimits the first quartile and the bottom-line is the third quartile, composed of the distribution inter-quartile distance; (iii) the inner box line is the median distribution value; (iv) the filled point inner box is the mean; and (v) the unfilled points are the outliers.

page 488

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

Fault Detection and Diagnosis Based on an LSTM Neural Network

(a) Accuracy for varying input size (J).

page 489

489

(b) Accuracy for varying hidden layer size (HS).

Figure 11.21: Box-plot of the accuracy values for varying hyper-parameters.

From Figure 11.21(a), it is possible to observe that the scenarios with input sizes equal to 7 and 10 (J = 7 and J = 10) showed more consistent behavior in their distributions, considering all possible network sizes (H S). The mean and median accuracy values are much closer than the remaining scenarios and the inter-quartile distance is smaller, which indicate a smaller variation in the accuracy value. In addition, they have value outliers below the box, which means that low accuracy values are uncommon. The highlight of Figure 11.21(b) is the value of hidden size equal to 20, which has characteristics similar to the evaluation of the input size. Figure 11.22 shows the behavior of all combinations between the hyperparameters of the LSTM neural network in isolation from J . In this figure’s graphics, a reference line with an accuracy value of 0.95 is given as a visual guide. To carry out a more detailed performance evaluation, the hyper-parameter values of the LSTM neural network were updated as J = 7 and H S = 20. This hyper-parameter configuration presents accuracy values with average (0.88), median (0.90), variance (0.05), and maximum (0.93) close to the 0.95 reference. Associated with these accuracy values, this hyper-parameter combination has a lower computational complexity than combinations such as J = 10 − H S = 30 and J = 10 − H S = 50, which both have averaging accuracy values around 0.89. A confusion matrix was then used to evaluate the classification accuracy obtained for each fault group. This experiment assesses the model generalization capacity when trained for a specific fault and validated against another with similar behavior in the same subgroup. Figure 11.23 shows the accuracy confusion matrix obtained for the fault classification task using the LSTM neural network configuration setup that was given previously.4 In this case, given a confusion matrix (MC ), its elements (MC (i, j)) 4 The row index of the matrix refers to the predict label from the neural network. The column index of the matrix refers to the actual label of the data. The integer number in the labeled cells is the count of samples of the column label classified as the row label. The percentage in the labeled cells

June 1, 2022 12:52

490

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

Handbook on Computer Learning and Intelligence — Vol. I

(a)

(b)

(c)

(d)

Figure 11.22: Box-plot of accuracy values combining the hyper-parameters.

represent how many times the model classified the i-th fault as the j -th fault. Thus, it is possible to note that faults F08 and F11 neither produce false positives nor false negatives. The faults F03 and F05 also obtained good accuracy values for the fault classification task (95.62% and 97.91%, respectively). In contrast, the faults F14, F18, and F20 had lower accuracy values (85.69%, 88.55%, and 72.73%) and the highest error rates. For example, fault F18 was erroneously classified as fault F20 in 123 samples out of 594. In general, the evaluated model presented a final accuracy value of 91.49%, that is considered quite adequate.

is the percentage of the number of samples in this cell over the total samples. The last row presents the percentage of correctly classified samples of the column label. The last column represents the percentage of the correct predictions of the row label. The element of the last row and last column is the overall accuracy of the model.

page 490

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

Fault Detection and Diagnosis Based on an LSTM Neural Network

page 491

491

Figure 11.23: Confusion matrix using test and train data defined in Section 11.5.1 with the LSTM classifier.

Regarding the pilot plant, the results corroborate that some fault groups are easier classified. For instance, faults within the actuator and sensor group achieve higher accuracy and low misclassification. Although the structural group has a higher misclassification count, it still presents a good accuracy.

11.6.2. Results Using MLP-Based FDD Approach Figure 11.24(a,b) presents box-plots of the accuracy distribution of the validation group varying the network input size (J ) and MLP layer size (HS), respectively. From Figure 11.24(a), it is possible to see that all of the input size scenarios show an undesirable behavior in their distributions, considering all possible network sizes (H S). The mean and median accuracy values are lower and closer to 0.6 value. The inter-quartile distance is also small, which indicates a lesser variation in the accuracy around this value. They also have values outliers above the box, which means that high accuracy values are uncommon. The boxes in Figure 11.21(b) have similar characteristics to the evaluation of the input size variation. Figure 11.25 shows the behavior of all combinations between the hyperparameters of the MLP neural network in isolation from J . In this figure’s graphics, a reference line with an accuracy value of 0.75 (0.2 less than in the LSTM analysis) is presented as a visual guide.

June 1, 2022 12:52

492

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

Handbook on Computer Learning and Intelligence — Vol. I

(a) Accuracy varying input size (J).

(b) Accuracy varying hidden layer size (HS).

Figure 11.24: Box-plot of the accuracy values for varying hyper-parameters.

(a)

(b)

(c)

(d)

Figure 11.25: Box-plot of accuracy values combining the hyper parameters.

For a more detailed performance evaluation, the hyper-parameter values of the MLP network were kept the same as those of the LSTM (J = 7 and H S = 20). This hyper-parameter configuration presents accuracy values with average (0.63) and median (0.63). These values are considerably far from those obtained in the

page 492

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

Fault Detection and Diagnosis Based on an LSTM Neural Network

page 493

493

Figure 11.26: Confusion matrix using test and train data defined in Section 11.5.1 with the MLP classifier.

results with the LSTM network. As previously done, the confusion matrix present in the Figure 11.26 shows the classification accuracy obtained for each fault group. Figure 11.26 shows the accuracy confusion matrix obtained for the fault classification task using the MLP neural network configuration setup that was used previously. In this case, it is possible to note that the faults F08 and F11 produce neither false positives nor false negatives, as in the LSTM results. The fault F20 obtained a reasonable accuracy value for the fault classification task (89.56%). The faults F03 and F14 obtained a low accuracy value for the fault classification task (76.6% and 55.89%, respectively), but above those of the remaining faults. The faults F05 and F18 had the lowest accuracy values (12.46% and 12.63%) and the highest error rates. For example, fault F18 was erroneously classified as fault F05 in 351 samples out of 594. In general, the MLP evaluated model presented a final accuracy value of 63.88%, which is considered to be a low value when compared with that obtained with the LSTM neural network.

11.7. Conclusion In this chapter, the LSTM-based FDD approach performance was analyzed and compared to the same approach using an MLP layer. The pilot plant data are a noticeable aspect of this study. The pilot plant has the characteristics found

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

494

b4528-v1-ch11

Handbook on Computer Learning and Intelligence — Vol. I

in industrial systems, such as industrial instrumentation (sensors and actuators) and inherent nonlinearity. Reducing the gap to a real industrial process, the FDD approach performance addressed in this chapter is evaluated in more realistic scenarios than simulation. The performance evaluation of the LSTM-based FDD approach shows that it has an excellent performance for fault detection and a good performance for classification tasks. It is, however, relevant to highlight that although fault classification is a more complex task than fault detection, fault detection alone is good enough in many real industrial process operations. Among the groups of faults, the actuator faults were easier to be classified. Another point to be highlighted is the distribution behavior of the accuracy values concerning the variation of the input size and the size of the LSTM network used. The change in accuracy indicates the need to find a specific combination between hyper-parameters to obtain a model with higher general accuracy, given the behavior observed in the obtained results. Our future work will analyze neural network-based approaches with incremental learning techniques, which will enable new fault patterns to be learned without losing the previously acquired fault knowledge.

References [1] H. Yu, F. Khan and V. Garaniya, Nonlinear Gaussian belief network based fault diagnosis for industrial processes, J. Process Control, 35, 178–200 (2015). [2] S. Zhou, J. Zhang and S. Wang, Fault diagnosis in industrial processes using principal component analysis and hidden Markov model. In Proceedings of the 2004 American Control Conference, vol. 6, pp. 5680–5685, IEEE (2004). [3] S. Verron, T. Tiplica and A. Kobi, Fault diagnosis of industrial systems by conditional Gaussian network including a distance rejection criterion, Eng. Appl. Artif. Intell., 23(7), 1229–1235 (2010). [4] S. Yin, S. X. Ding, A. Haghani, H. Hao and P. Zhang, A comparison study of basic datadriven fault diagnosis and process monitoring methods on the benchmark Tennessee Eastman process, J. Process Control, 22(9), 1567–1581 (2012). ISSN 0959-1524. doi: https://doi. org/10.1016/j.jprocont. 2012.06.009. URL http://www.sciencedirect.com/science/article/pii/ S0959152412001503. [5] M. F. D’Angelo, R. M. Palhares, M. C. C. Filho], R. D. Maia, J. B. Mendes and P. Y. Ekel, A new fault classification approach applied to Tennessee Eastman benchmark process, Appl. Soft Comput., 49, 676–686 (2016). ISSN 1568-4946. doi: https://doi.org/10.1016/j.asoc.2016.08.040. URL http:// www.sciencedirect.com/science/article/pii/S1568494616304343. [6] X. Gao and J. Hou, An improved SVM integrated GS-PCA fault diagnosis approach of Tennessee Eastman process, Neurocomputing, 174, 906–911 (2016). ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2015.10.018. URL http://www.sciencedirect.com/science/ article/pii/S0925231215014757. [7] N. Basha, M. Z. Sheriff, C. Kravaris, H. Nounou and M. Nounou, Multiclass data classification using fault detection-based techniques, Comput. Chem. Eng., 136, 106786 (2020).

page 494

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

Fault Detection and Diagnosis Based on an LSTM Neural Network

[8]

[9]

[10]

[11]

[12]

[13] [14]

[15]

[16]

[17] [18] [19]

[20]

[21]

[22]

page 495

495

ISSN 0098-1354. doi: https://doi.org/10.1016/j.compchemeng.2020.106786. URL http://www. sciencedirect.com/science/article/pii/S0098135420300090. J. Downs and E. Vogel, A plant-wide industrial process control problem, Comput. Chem. Eng., 17(3), 245–255 (1993). ISSN 0098-1354. doi: https://doi.org/10.1016/0098-1354(93) 80018-I. URL http:// www.sciencedirect.com/science/article/pii/009813549380018I. Industrial challenge problems in process control. M. Barty´s, R. Patton, M. Syfert, S. de las Heras and J. Quevedo, Introduction to the DAMADICS actuator FDI benchmark study, Control Eng. Pract., 14(6), 577–596 (2006). ISSN 0967-0661. doi: https://doi.org/ 10.1016/j.conengprac.2005.06.015. URL http://www.sciencedirect.com/ science/article/pii/S0967066105001796. A Benchmark Study of Fault Diagnosis for an Industrial Actuator. B. S. J. Costa, P. P. Angelov and L. A. Guedes, A new unsupervised approach to fault detection and identification. In 2014 International Joint Conference on Neural Networks (IJCNN), pp. 1557–1564, IEEE (2014). doi: 10.1109/IJCNN.2014.6889973. C. G. Bezerra, B. S. J. Costa, L. A. Guedes and P. P. Angelov, An evolving approach to unsupervised and real-time fault detection in industrial processes, Expert Syst. Appl., 63, 134–144 (2016). ISSN 0957-4174. doi: https://doi.org/10.1016/j.eswa.2016.06.035. URL http:// www.sciencedirect.com/science/article/pii/S0957417416303153. M. Chow, P. M. Mangum and S. O. Yee, A neural network approach to real-time condition monitoring of induction motors, IEEE Trans. Ind. Electron., 38(6), 448–453 (1991). doi: 10.1109/41.107100. A. Malhi and R. Gao, PCA-based feature selection scheme for machine defect classification, IEEE Trans. Instrum. Meas., 53, 1517–1525 (2005). doi: 10.1109/TIM.2004.834070. Y. Wang, Q. Ma, Q. Zhu, X. Liu and L. Zhao, An intelligent approach for engine fault diagnosis based on Hilbert–Huang transform and support vector machine, Appl. Acoust., 75, 1–9 (2014). ISSN 0003-682X. doi: https:// doi.org/10.1016/j.apacoust.2013.07.001. URL https://www.sciencedirect. com/science/article/pii/S0003682X13001515. D. Pandya, S. Upadhyay and S. Harsha, Fault diagnosis of rolling element bearing with intrinsic mode function of acoustic emission data using APF-KNN, Expert Syst. Appl., 40(10), 4137–4145 (2013). ISSN 0957-4174. doi: https://doi.org/10.1016/j.eswa.2013.01.033. URL https://www. sciencedirect.com/science/article/pii/S0957417413000468. Ł. Jedli´nski and J. Jonak, Early fault detection in gearboxes based on support vector machines and multilayer perceptron with a continuous wavelet transform, Appl. Soft Comput., 30, 636–641 (2015). S. Heo and J. H. Lee, Fault detection and classification using artificial neural networks, IFACPapersOnLine, 51(18), 470–475 (2018). Z. Chen, C. Li and R.-V. Sanchez, Gearbox fault identification and classification with convolutional neural networks, Shock Vib., 2015, 390134 (2015). K. B. Lee, S. Cheon and C. O. Kim, A convolutional neural network for fault classification and diagnosis in semiconductor manufacturing processes, IEEE Trans. Semicond. Manuf., 30(2), 135–142 (2017). G. Tian, Q. Zhou and L. Du, Deep convolutional neural networks for distribution system fault classification. In 2018 IEEE Power Energy Society General Meeting (PESGM), pp. 1–5, IEEE (2018). doi: 10.1109/PESGM. 2018.8586547. D. Dey, B. Chatterjee, S. Dalai, S. Munshi and S. Chakravorti, A deep learning framework using convolution neural network for classification of impulse fault patterns in transformers with increased accuracy, IEEE Trans. Dielectr. Electr. Insul., 24(6), 3894–3897 (2017). M. Hemmer, H. Van Khang, K. G. Robbersmyr, T. I. Waag and T. J. Meyer, Fault classification of axial and radial roller bearings using transfer learning through a pretrained convolutional neural network, Designs, 2(4), 56 (2018).

June 1, 2022 12:52

496

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch11

Handbook on Computer Learning and Intelligence — Vol. I

[23] C. Liu, G. Cheng, X. Chen and Y. Pang, Planetary gears feature extraction and fault diagnosis method based on VMD and CNN, Sensors, 18(5), 1523 (2018). [24] R.-E. Precup, P. Angelov, B. S. J. Costa and M. Sayed-Mouchaweh, An overview on fault diagnosis and nature-inspired optimal control of industrial process applications, Comput. Ind., 74, 75–94 (2015). ISSN 0166-3615. doi: https://doi.org/10.1016/j.compind.2015.03.001. URL http://www.sciencedirect.com/science/article/pii/S0166361515000469. [25] V. Venkatasubramanian, R. Rengaswamy, K. Yin and S. N. Kavuri, A review of process fault detection and diagnosis, Part I: quantitative model-based methods, Comput. Chem. Eng., 27(3), 293–311 (2003). ISSN 0098-1354. doi: https://doi.org/10.1016/S0098-1354(02)00160-6. URL http://www.sciencedirect.com/science/article/pii/S0098135402001606. [26] P. Wang and C. Guo, Based on the coal mine’s essential safety management system of safety accident cause analysis, Am. J. Environ. Energy Power Res., 1(3), 62–68 (2013). [27] L. Yin, H. Wang, W. Fan, L. Kou, T. Lin and Y. Xiao, Incorporate active learning to semisupervised industrial fault classification, J. Process Control, 78, 88–97 (2019). ISSN 09591524. doi: https:// doi.org/10.1016/j.jprocont.2019.04.008. URL https://www.sciencedirect. com/science/article/pii/S095915241930277X. [28] D. M. Himmelblau, Fault Detection and Diagnosis in Chemical and Petrochemical Processes. vol. 8, Elsevier Science Ltd. (1978). [29] B. Coppin, Artificial Intelligence Illuminated. Jones and Bartlett illuminated series, Jones and Bartlett Publishers (2004). ISBN 9780763732301. URL https://books.google.com.br/ books?id=LcOLqodW28EC. [30] S. Haykin, Neural Networks: A Comprehensive Foundation. Prentice-Hall (2007). [31] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Comput., 9(8), 1735–1780 (1997). [32] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink and J. Schmidhuber, LSTM: a search space odyssey, IEEE Trans. Neural Networks Learn. Syst., 28(10), 2222–2232 (2016). [33] A. Graves and J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural network architectures, Neural Networks, 18(5–6), 602–610 (2005). [34] H. Zhao, S. Sun and B. Jin, Sequential fault diagnosis based on LSTM neural network, IEEE Access, 6, 12929–12939 (2018). [35] S. Ioffe and C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv:1502.03167 (2015). [36] D. P. Kingma and J. Ba, Adam: a method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014). [37] A. Marins, Continuous process workbench. Technical manual. DeLorenzo Brazil, Brazil (2009). [38] B. Costa, P. Angelov and L. A. Guedes, A new unsupervised approach to fault detection and identification. In 2014 International Joint Conference on Neural Networks (IJCNN), IEEE (2014). doi: 10.1109/IJCNN.2014. 6889973. [39] B. S. J. Costa, C. G. Bezerra and L. Guedes, Java fuzzy logic toolbox for industrial process control. In Proceedings of the 2010 Brazilian Conference on Automatics (CBA), Brazilian Society for Automatics (SBA), pp. 207–214, SBA (2010). [40] J. Liu, K. W. Lim, W. K. Ho, K. C. Tan, A. Tay and R. Srinivasan, Using the OPC standard for real-time process monitoring and control, IEEE Software, 22(6), 54–59 (2005).

page 496

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch12

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_0012

Chapter 12

Conversational Agents: Theory and Applications Mattias Wahde∗ and Marco Virgolin† Department of Mechanics and Maritime Sciences, Chalmers University of Technology, 412 96 Göteborg, Sweden, ∗ [email protected][email protected]

In this chapter, we provide a review of conversational agents (CAs), discussing chatbots, intended for casual conversation with a user, as well as task-oriented agents that generally engage in discussions intended to reach one or several specific goals, often (but not always) within a specific domain. We also consider the concept of embodied conversational agents, briefly reviewing aspects such as character animation and speech processing. The many different approaches to representing dialogue in CAs are discussed in some detail, along with methods for evaluating such agents, emphasizing the important topics of accountability and interpretability. A brief historical overview is given, followed by an extensive overview of various applications, especially in the fields of health and education. We end the chapter by discussing the benefits and potential risks regarding the societal impact of current and future CA technology.

12.1. Introduction Conversational agents (CAs), also known as intelligent virtual agents (IVAs), are computer programs designed for natural conversation with human users, either involving informal chatting, in which case the system is usually referred to as a chatbot, or with the aim of providing the user with relevant information related to a specific task (such as a flight reservation), in which case the system is called a task-oriented agent. Some CAs use only text-based input and output, whereas others involve more complex input and output modalities (for instance, speech). There are also so-called embodied conversational agents (ECAs) that are typically equipped with an animated visual representation (face or body) on-screen. Arguably the most

497

page 497

June 1, 2022 12:52

498

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch12

Handbook on Computer Learning and Intelligence — Vol. I

important part of CAs is the manner in which they represent and handle dialogue. Here, too, there is a wide array of possibilities, ranging from simple templatematching systems to highly complex representations based on deep neural networks (DNNs). Among the driving forces behind the current, rapid development of this field is the need for sophisticated CAs in various applications, for example, in healthcare, education, and customer service; see also Section 12.7. In terms of research, the development is also, in part, driven by high-profile competitions such as the Loebner prize and the Alexa prize, which are briefly considered in Sections 12.5 and 12.6. Giving a full description of the rapidly developing field of CAs, with all of the facets introduced above, is a daunting task. Here, we will attempt to cover both theory and applications, but we will mostly focus on the representation and implementation of dialogue capabilities (rather than, say, issues related to speech recognition, embodiment, etc.), with emphasis on task-oriented agents, particularly applications of such systems. We strive to give a bird’s-eye view of the many different approaches that are, or have been, used for developing the capabilities of CAs. Furthermore, we discuss the ethical implications of CA technology, especially aspects related to interpretability and accountability. The interested reader may also wish to consult other reviews on CAs and their applications [1–5].

12.2. Taxonomy The taxonomy used here primarily distinguishes between, on the one hand, chatbots and, on the other, task-oriented agents. Chatbots are systems that (ideally) can maintain a casual dialogue with a user on a wide range of topics, but are generally neither equipped nor expected to provide precise information. For example, many (though not all) chatbots may give different answers if asked the same specific question several times (such as “Where were you born?”). Thus, a chatbot is primarily useful for conversations on everyday topics, such as restaurant preferences, movie reviews, sport discussions, and so on, where the actual content of the conversation is perhaps less relevant than the interaction itself: a well-designed chatbot can provide stimulating interactions for a human user, but would be lost if the user requires a detailed, specific answer to questions such as “Is there any direct flight to London on Saturday morning?”. Task-oriented agents, on the other hand, are ultimately intended to provide clear, relevant, and definitive answers to specific queries, a process that often involves a considerable amount of database queries (either offline or via the internet) and data processing that, with some generosity, can be referred to as cognitive processing. This might be a good point to clear up some confusion surrounding the taxonomy

page 498

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

b4528-v1-ch12

page 499

499

of CAs, where many authors refer to all such systems as chatbots. In our view, this is incorrect. As their name implies, chatbots do just that: chat, while taskoriented agents are normally used for more serious and complex tasks. While these two categories form the basis for a classification of CAs, many other taxonomic aspects could be considered as well, for example, the input and output modalities used (e.g., text, speech, etc.), the applicability of the agent (generalpurpose or domain-specific), whether or not the agent is able to self-learn, and so on [4]. One such aspect concerns the representation used, where one frequently sees a division into rule-based systems and systems based on artificial intelligence (AI), by which is commonly meant systems based on deep neural networks (DNNs) [6]. This, too, is an unfortunate classification in our view; first of all, the field of AI is much wider in scope than what the current focus on DNNs might suggest: for example, rule-based systems, as understood in this context, have played, and continue to play, a very important part in the field. Moreover, even though many so-called rule-based agents are indeed based on handcrafted rules, there is nothing preventing the use of machine learning methods (a subfield of AI), such as stochastic optimization methods [7] or reinforcement learning [8], in such systems. The name rule-based is itself a bit unfortunate, since a rule is a very generic concept that could, at least in principle, even be applied to the individual components of DNNs. Based on these considerations, we will here suggest an alternative dichotomy, which will be used alongside the other classification given above. This alternative classification divides CAs (whether they are chatbots or task-oriented agents) into the two categories interpretable systems and black box systems. We hasten to add that interpretable does not imply a lack of complexity: in principle, the behavior of an interpretable CA could be every bit as sophisticated as that of a black box (DNN) CA. The difference is that the former are based on transparent structures consisting of what one might call interpretable primitives, i.e., human-understandable components such as IF–THEN–ELSE rules, sorting functions, or AIML-based statements (see Section 12.4.1.1), which directly carry out high-level operations without the need to rely on the concerted action of a huge number of parameters as in a DNN. As discussed in Sections 12.5.3 and 12.8, this is a distinction that, at least for task-oriented agents, is likely to become increasingly important in the near future: CAs are starting to be applied in high-stakes decisions, where interpretability is a crucial aspect [9], not least in view of the recently proposed legislation (both in the EU and the USA) related to a user’s right to an explanation or motivation, for example, in cases where artificial systems are involved in bank credit decisions or clinical decision-making [10]. In such cases, the artificial system must be able to explain its reasoning or at least carry out its processing in a manner that can be understood by a human operator.

June 1, 2022 12:52

500

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch12

Handbook on Computer Learning and Intelligence — Vol. I

12.3. Input and Output Modalities Textual processing, which is the focus of the sections below, is arguably the core functionality of CAs. However, natural human-to-human conversation also involves aspects such as speech, facial expressions, gaze, body posture, and gestures, all of which convey subtle but important nuances of interaction beyond the mere exchange of factual information. In other words, interactions between humans are multimodal. Aspects of multimodality are considered in the field of embodied conversational agents (ECAs; also known as interface agents) that involve an interface for natural interaction with the user. This interface is often in the form of an animated face (or body) shown on-screen, but can also be a physical realization in the form of a social robot [16–18]. Figure 12.1 gives some examples; the three left panels show agents with a virtual (on-screen) interface, whereas the rightmost panel shows the social robot iCAT. Here, we will focus on virtual interfaces rather than physical implementations such as robots.

12.3.1. Nonverbal Interaction Humans tend to anthropomorphize systems that have a life-like shape, such as an animated face on-screen [19]. Thus, a user interacting with an ECA may experience better rapport with the agent than would be the case for an interaction with an agent lacking a visual representation, and the embodiment also increases (in many, though not all, cases) the user’s propensity to trust the agent, as found by Loveys et al. [20]. Some studies show that certain aspects of an ECA, such as presenting a smiling face or showing a sense of humor, improve the experience of interacting with the agent [21, 22], even though the user’s personality is also a relevant factor when assessing the quality of interaction. For example, a recent study by ter Stal et al. on ECAs in the context of eHealth showed that users tend to prefer ECAs that are similar to themselves regarding age and sex (gender) [23]. Other studies show similar results regarding personality traits such as extroversion and

Figure 12.1: Three examples of ECAs and one social robot (right). From left to right: Gabby [11, 12], SimSensei [13], Obadiah [14], and iCAT [15] (Royal Philips / Philips Company Archives). All images are reproduced with kind permission from the respective copyright holders.

page 500

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

b4528-v1-ch12

page 501

501

introversion [24, 25]. Additional relevant cues for human–agent interaction include gaze [26, 27], body posture [28], and gestures [29, 30], as well as the interaction between these expressive modalities [28]. Another point to note is that human likeness is not a prerequisite for establishing rapport between a human user and an ECA [31]. In fact, to some degree, one may argue that the opposite holds: Mori’s uncanny valley hypothesis [32] states essentially that artificial systems that attempt to mimic human features in great detail, when not fully succeeding in doing so, tend to be perceived as eerie and repulsive. This hypothesis is supported by numerous studies in (humanoid) robotics, but also in the context of virtual agents [33]. What matters from the point of view of user–agent rapport seems instead to be that the appearance of an ECA should match the requirements of its task [34], and that the ECA should make use of several interaction modalities [31]. Conveying emotions is an important purpose of facial expressions. Human emotions are generally described in terms of a set of basic emotions. Ekman [35] defined six basic emotions, namely anger, disgust, fear, happiness, sadness, and surprise, a set that has later been expanded by others to cover additional basic emotions as well as linear combinations thereof [36]. Emotions can be mapped to facial expressions via the facial action coding system (FACS) [37, 38] that, in turn, relies on so-called action units (AUs), which are directly related to movements of facial muscles (as well as movements of the head and the eyes). For example, happiness can be modelled as the combination of AU6 (cheek raiser) and AU12 (lip corner puller). Emotion recognition can also be carried out using black box systems such as DNNs [39], trained end to end: these systems learn to predict the emotion from the image of a face, without the need to manually specify how intermediate steps should be carried out [40]. State-of-the-art facial emotion recognition systems typically achieve 80 to 95% accuracy over benchmark datasets [41]. Emotions are also conveyed in body language [42], albeit with greater cultural variation than in the case of facial expressions. ECAs equipped with cameras (especially depth cameras [43]) can be made to recognize body postures, gestures, and facial expressions (conveying emotional states), and to react accordingly. Gesture recognition, especially in the case of hand gestures, helps to achieve a more natural interaction between the user and the ECA [44]. Furthermore, the combination of audio and visual clues can be exploited to recognize emotions by multimodal systems [45, 46].

12.3.2. Character Animation Methods for character animation can broadly be divided into three categories. In procedural animation, the poses of an animated body or face are parameterized, using underlying skeletal and muscular models (rigs) [47] coupled with

June 1, 2022 12:52

502

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch12

page 502

Handbook on Computer Learning and Intelligence — Vol. I

inverse kinematics. In procedural facial animation, [48] every unit of sound (phoneme) is mapped to a corresponding facial expression (a viseme). Animations are then built in real time, either by concatenation or by blending, a process that allows full control over the animated system but is also computationally demanding and may struggle to generate life-like animations. In data-driven animation, the movement sequences are built by stitching together poses (or facial expressions) from a large database. Many different techniques have been defined, ranging from simple linear interpolations to methods that involve DNNs [49, 50]. Finally, motion capture-based animation and performance-based animation methods [51] map the body or facial movements of a human performer onto a virtual, animated character. This technique typically generates very realistic animations, but is laborious and costly. Moreover, the range of movements of the virtual character is limited by the range of expressions generated by the human performer [52]. Several frameworks have been developed for animating ECAs. An example is the SmartBody approach of Thiebaux et al. [53] that makes use of the behavior markup language (BML) [54] to generate animation sequences that synchronize facial motions and speech.

12.3.3. Speech Processing Speech processing is another important aspect of an ECA or, more generally, of a spoken dialogue system (SDS)1 . CAs handle incoming information in a step involving natural language understanding (NLU; see also Figure 12.4). In cases where speech (rather than just text) is used as an input modality, NLU is preceded by automatic speech recognition (ASR). The ASR step may provide a more natural type of interaction from the user’s point of view, but it increases the complexity of the implementation considerably, partly because of the inherent difficulty in reliably recognizing speech and partly because spoken language tends to contain aspects that are not generally present in textual input (e.g., repetition of certain words, the use of interjections such as “uh,” “er,” and so on, as well as noise on various levels). The performance of ASR can be measured in different ways: a common measure is the word error rate (WER), defined as WER =

S+I +D , N

(12.1)

where N is the number of words in the reference sentence (the ground truth), and S, I , and D are, respectively, the number of (word) substitutions, insertions, and deletions required to go from the hypothesis generated by the ASR system to the 1 An SDS is essentially a CA that processes speech (rather than text) as its input and output.

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

b4528-v1-ch12

page 503

503

reference sentence. It should be noted, however, that the WER does not always represent the true ASR performance accurately, as it does not capture the semantic content of an input sentence. For example, in some cases, several words can be removed from a sentence without changing its meaning whereas in other cases, a difference of a single word, e.g., omitting the word not, may completely change the meaning. Still, the WER is useful in relative performance assessments of different ASR methods. With the advent of deep learning, the accuracy of ASR has increased considerably. State-of-the-art methods for ASR, where DNNs are used for both phoneme recognition and word decoding, generally outperform [55] earlier systems that were typically based on hidden Markov models (HMMs). For DNNs, WERs of 5% or less (down from around 10–15% in the early 2010s) have been reported [56, 57], implying performance on a par with, or even exceeding, human performance. However, it should be noted that these low WERs generally pertain to low-noise speech in benchmark datasets, such as the LibriSpeech dataset, which is based on audio books [58]. In practical applications with noisy speech signals, different speaker accents, and so on, ASR performance can be considerably worse [59]. Modern text-to-speech (TTS) synthesis, which also relies on deep learning to a great degree, generally reaches a high level of naturalness [60, 61], as measured (subjectively) using the five-step mean opinion score (MOS) scale [62]. The requirements on speech output vary between different systems. For many ECAs, speech output must not only transfer information in a clear manner but also convey emotional states that, for example, match the facial expression of the ECA. It has been demonstrated that the pitch of an ECA voice influences the agent’s perceived level of trustworthiness [63]. Similar results have been shown for ECAs that speak with a happy voice [64]. Finally, with the concept of augmented reality, or the still more ambitious concept of mixed reality, even more sophisticated levels of embodiment are made possible [65], where the embodied agent can be superposed onto the real world for an immersive user experience. This approach has recently gained further traction [66] with the advent of wearable augmented reality devices such as Microsoft’s Hololens and Google Glass.

12.4. Dialogue Representation A defining characteristic of any CA is the manner in which it handles dialogue. In this section, dialogue representation is reviewed for both chatbots and task-oriented agents, taking into consideration the fact that chatbots handle dialogue in a different and more direct manner than task-oriented agents.

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

504

b4528-v1-ch12

Handbook on Computer Learning and Intelligence — Vol. I

12.4.1. Chatbots Chatbots can be categorized into three main groups by the manner in which they generate their responses. Referring to the alternative dichotomy introduced in Section 12.2, the first two types, namely pattern-based chatbots and informationretrieval chatbots, would fall into the category of interpretable systems, whereas the third type, generative chatbots, are either of the interpretable kind or the black box variety, depending on the implementation used; those that make strong use of DNNs would clearly fall into the black box category. 12.4.1.1. Pattern-Based chatbots

The very first CA, namely Weizenbaum’s ELIZA [67] belongs to a category that can be referred to as pattern-based chatbots2 . This chatbot, released in 1966, was meant to emulate a psychotherapist who attempts to gently guide a patient along a path of introspective discussions, often by transforming and reflecting statements made by the user. ELIZA operates by matching the user’s input to a set of patterns and then applying cleverly designed rules to formulate its response. Thus, as a simple example, if a user says “I’m depressed,” ELIZA could use the template rule (I’m 1) → (I’m sorry to hear that you are 1), where 1 indicates a single word, to return “I’m sorry to hear that you are depressed.” In other cases, where no direct match can be found, ELIZA typically just urges the user to continue. For example, the input “I could not go to the party” can be followed by “That is interesting. Please continue” or “Tell me more.” ELIZA also ranks patterns, such that, when several options match the user’s input, it will produce a response based on the highest-ranked (least generic) pattern. Furthermore, it features a rudimentary short-term memory, allowing it to return to earlier parts of a conversation to some degree. More modern pattern-based chatbots, some of which are discussed in Section 12.6, are often based on the Artificial Intelligence Markup Language (AIML) [68], an XML-like language developed specifically for defining templatematching rules for use in chatbots. Here, each pattern (user input) is associated with an output (referred to as a template). A simple example of an AIML specification is

I LIKE * I like as well

2 This type of chatbot is also commonly called rule-based. However, one might raise the same objections to that generic term as expressed in Section 12.2, and therefore the more descriptive term pattern-based will be used instead.

page 504

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

b4528-v1-ch12

page 505

505

With this specification, if a user enters “I like tennis,” the chatbot would respond by saying “I like tennis as well.” AIML has a wide range of additional features, for example, those that allows it to store, and later retrieve, variables (e.g., the user’s name). A chatbot based on AIML can also select random responses from a set of candidates, so as to make the dialogue more life-like. Moreover, AIML has a procedure for making redirections from one pattern to another, thus greatly simplifying the development of chatbots. For example, with the specification

WHAT IS YOUR NAME My name is Alice

WHAT ARE YOU CALLED

WHAT IS YOUR NAME

the agent would respond “My name is Alice” if asked “What is your name” or “What are you called.” As can be seen in the specification, the second pattern redirects to the first, using the so-called symbolic reduction in artificial intelligence (srai) tag. 12.4.1.2. Information-retrieval chatbots

Chatbots of this category generate their responses by selecting a suitable sentence from a (very) large dialogue corpus, i.e. a database of stored conversations. Simplifying somewhat, the basic approach is as follows: The agent (1) receives a user input U ; (2) finds the most similar sentence S (see below) in the associated dialogue corpus; and (3) responds with the sentence R that, in the dialogue corpus, was given in response to the sentence S. Similarity between sentences can be defined in different ways. Typically, sentences are encoded in the form of numerical vectors, called sentence embeddings, such that numerical measures of similarity can be applied. A common approach is to use TF-IDF [69] (short for term frequency—inverse document frequency), wherein each sentence is represented by a vector whose length equals the number of words3 (N ) available in the dictionary used. The TF vector is simply the frequency of occurrence of each word, normalized by the number of words in the sentence. Taken alone, this measure tends to give too much emphasis to common words such 3 Usually after stemming and lemmatization, i.e., converting words to their basic form so that, for

example, cats is represented as cat, better as good, and so on.

June 1, 2022 12:52

506

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch12

page 506

Handbook on Computer Learning and Intelligence — Vol. I

as “the,” “is,” and “we,” which do not really say very much about the content of the sentence. The IDF for a given word w is typically defined as IDF = log

M , Mw

(12.2)

where M is the total number of sentences (or, more generally, documents) in the corpus and Mw is the number of sentences that contain w. Thus, IDF will give high values to words that are infrequent in the dialogue corpus and therefore likely to carry more relevant information about the contents of a sentence than more frequent words. The IDF values can be placed in a vector of length N that, for a fixed dialogue corpus, can be computed once and for all. Next, given the input sentence U , a vector q(U ) can be formed by first generating the corresponding TF vector and then performing a Hadamard (element-wise) product with the IDF vector. The vector q(U ) can then be compared with the corresponding vector q(σi ) for all sentences σi in the corpus, using the vector (dot) product, and then extracting the most similar sentence S by computing the maximum cosine similarity as   q(U ) · q(σi ) . (12.3) S = argmaxi q(U )q(σi ) The chatbot’s response is then given as the response R to S in the dialogue corpus. An alternative approach is to directly retrieve the response that best matches the user’s input [70]. Some limitations of the TF-IDF approach are that it does not take into account (1) the order in which the words appear in a sentence (which of course can have a large effect on the semantic content of the sentence) and (2) the fact that many words have synonyms. Consider, for instance, the two words mystery and conundrum. A sentence containing one of those words would have a term frequency of 1 for that word, and 0 for the other, so that the two words together give zero contribution to the cosine similarity in Equation (12.3). Such problems can be addressed using word embeddings, which are discussed below. Another limitation is that TF-IDF, in its basic form, does not consider the context of the conversation. Context handling can be included by, for example, considering earlier sentences in the conversation [71]. To overcome the limitations of TF-IDF, Yan et al. proposed an informationretrieval chatbot where a DNN is used to consider context from previous exchanges, and rank a list of responses retrieved by TF-IDF from the least to most plausible [72]. Other works on information-retrieval chatbots via DNNs abandon the use of TF-IDF altogether, and focus on how to improve the modeling of contextual information processing [73–75]. An aspect that is common to different types of DNN-based chatbots (of any kind), and incidentally also task-oriented agents, is the use of word embeddings. Similar to sentence embeddings, an embedding ew of a word w is

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

b4528-v1-ch12

page 507

507

Figure 12.2: Projection obtained with t-SNE [80] of almost forty 300-dimensional word embeddings from word2vec [81]. The projection mostly preserves the embedding proximity, as similar words are clustered together. A notable exception is suspected_purse_snatcher, which is in the center of the plot and far from robber and robbery_suspect, in the bottom right. Note the presence of misspelled words from natural text, such as telvision and televsion.

a vectorial representation of w, i.e., ew ∈ Rd . DNNs rely on word embeddings in order to represent and process words and can actually learn the embeddings during their training process [76–79]. Typically, this is done by incorporating an N × d real-valued matrix  within the structure of the DNN as a first layer, where N is the number of possible words in the vocabulary (or, for long words, wordchunks) and d is the (predetermined) size of each word embedding.  is initialized at random. Whenever the network takes some words wi , w j , wk , . . . as the input, only the corresponding rows of , i.e., the embeddings ewi , ew j , ewk , . . . become active and pass information forward to the next layers. During the training process, the parameters of the DNN, which include the values of  and those of each ewi , are updated. Ultimately, each word embedding will capture a particular meaning. For example, Mikolov et al. [78] famously showcased that arithmetic operations upon DNN-learned word embeddings can be meaningful: eKing − eMan + eWoman ≈ eQueen .

(12.4)

As an additional example, Figure 12.2 shows a word cloud obtained by applying dimensionality reduction to a few dozen 300-dimensional word embeddings. As can be seen, embeddings of words with similar syntax or semantics are clustered together. We remark that there are other ways than training DNNs to build word embeddings, resulting in word embeddings that carry information of another nature. Some of these methods are focused, e.g., on statistical occurrences of words

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

508

b4528-v1-ch12

page 508

Handbook on Computer Learning and Intelligence — Vol. I

within document classes [82, 83], or on word–word co-occurrences [84, 85]. Moreover, word embeddings can be used as numerical word representations in other systems, e.g., in evolutionary algorithms, to obtain human-interpretable mathematical expressions representing word manipulations [86]. 12.4.1.3. Generative chatbots

While chatbots in the two previous categories rely on existing utterances, either in the form of pre-specified patterns or retrieved from a dialogue corpus, generative chatbots instead generate their responses using statistical models, called generative models or, specifically when one intends to model probability distributions over language (e.g., in the form of what words are likely to appear after or in between some other words), language models. Currently, this field is dominated by black box systems realized by DNNs trained on large amounts of data. An important neural network model that is used at the heart of several generative chatbots is the sequence to sequence model (seq2seq) [86]. Figure 12.3 shows a simplified representation of seq2seq. Given a sequence of general tokens (in this case, words), seq2seq returns a new sequence of tokens, typically by making use of recurrent neural networks. Generally speaking, these networks operate by taking as input, one by one, the tokens that constitute a given sequence; processing each token; and producing an output that, crucially, depends on both the token processing and the outputs produced so far. Depending on the task, the training process of these systems can be based on the feedback of interacting users (a feedback that can be as simple as just indicating whether answers are good or bad) or on reference output text, such as known translations, answers to questions, or phrases where some words are masked and need to be guessed.

At

Hörsalsvägen

2

Encoder RNNenc

RNNenc

RNNenc

RNNenc

RNNenc

e1

e2

e3

e4

e5

RNNdec

RNNdec

RNNdec

RNNdec

e'1

e'2

e'3

e'4

Decoder Where

is

the

library

?

Figure 12.3: Simplified representation of seq2seq, unfolded over time. Each token of the input sentence “Where is the library?” is transformed into an embedding and fed to the recurrent neural network called the encoder (RNNenc ). This information is passed to the recurrent neural network responsible for providing the output sentence, called the decoder (RNNdec ). Note that each output token is routed back in the decoder to predict the next output token. The special tags and identify the start and end of the sentence, respectively.

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

b4528-v1-ch12

page 509

509

One of the first uses of seq2seq to realize (part of the conversational capability of) chatbots can be attributed to Vinyals and Quoc [88], who showed that even though seq2seq was originally proposed for machine translation, it could be trained to converse when provided with conversation corpora (e.g., movie subtitles [89]). Today, seq2seq is a core component in several state-of-the-art generative chatbots: MILABOT [90] from the Montreal Institute for Learning Algorithms, Google’s Meena [91], and Microsoft’s XiaoIce4 [92]. Another recent and popular component of DNNs useful for generating language is the transformer [93, 94]. Like recurrent neural networks, the transformer is designed to process sequential data, but it is different in that it does not need to parse the sequence of tokens in a particular order. Rather, the transformer leverages, at multiple stages, an information retrieval-inspired mechanism known as selfattention. This mechanism makes it possible to consider context from any other position in the sentence5 . Broadly speaking, self-attention is capable of scoring, for each token (or better, for each embedding), how much contextual focus should be put on the other tokens (embeddings), no matter their relative position. For example, in the phrase “Maria left her phone at home”, the token “her” will have a large attention score for “Maria,” while “home” will have large attention scores for “left” and “at.” In practice, self-attention is realized by matrix operations, and the weights of the matrices are optimized during the training process of the DNN, just like any other component. Transformers owe much of their popularity in the context of natural language processing to BERT (Bidirectional Encoder Representations from Transformers), proposed by Devlin et al. [79]. At the time of its introduction, BERT outperformed other contenders on popular natural language processing benchmark tasks of different nature [97], at times by a considerable margin. Before tuning the DNN that constitutes BERT to the task at hand (e.g., language translation, question answering, or sentiment analysis), pre-training over large text corpora is carried out to make the system learn what words are likely to appear next to others. Specifically, this is done by (1) collecting large corpora of text; (2) automatically masking out some tokens (here, words); and (3) optimizing the model to infer back the masked tokens. As mentioned above, when trained enough, the self-attention mechanism is capable of gathering information on the most important context for the missing tokens. phone at home,” For example, when the blanks need to be filled in “Maria a well-tuned self-attention will point to the preceding token “Maria” to help infer “her,” and point to “at” and “home” to infer “left.” In particular, when trained to 4 Note that XiaoIce is a large system that includes many other functionalities than chatting. 5 The concept of (self-)attention is not a transformer-exclusive mechanism; in fact, it was firstly

proposed to improve recurrent neural networks [95, 96].

June 1, 2022 12:52

510

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch12

Handbook on Computer Learning and Intelligence — Vol. I

guess the token at the end of the sequence, transformer-based DNNs are proficient at generating language. At inference time, they can take the user’s utterance as the initial input. Today’s state-of-the-art DNNs for language modeling use a mix of (advanced) recurrent neural networks and transformers. Notable examples include ALBERT [98], XLNet [99], and the models by OpenAI, GPT2, and GPT3 [100, 101]. The latter two in particular contain hundreds of billions of parameters, were trained on hundreds of billions of words and are capable of chatting quite well even in so-called few- or zero-shot learning settings, i.e., when a very limited number of examples or no examples at all, respectively are used to tune the model to chatting. For the readers interested in delving into the details of DNNs for language modeling, we refer to the recent surveys by Young et al. [102] and Otter et al. [103].

12.4.2. Task-Oriented Agents As mentioned in Section 12.2, the role of task-oriented agents is to engage in factbased conversations on specific topics, and they must thus be able to give precise, meaningful, and consistent answers to the user’s questions. Here, we will review such agents, following the alternative dichotomy from Section 12.2, where agents are classified as either interpretable systems or black box systems. The use of this dichotomy is particularly important here, since task-oriented agents are responsible for reaching concrete goals and may be deployed in sensitive settings (for example, medical or financial applications; see also Section 12.7) where transparency and accountability are essential. Figure 12.4 shows the so-called pipeline model, in which the various stages of processing are shown as separate boxes. In this model, the first and last steps, namely ASR and speech synthesis, can be considered to be peripheral, as many

Figure 12.4: The pipeline model for CAs. The first and last steps (ASR and speech synthesis; see also Section 12.3) can, to some degree, be considered as external from the core components, shown in green. It should be noted that there are different versions of the pipeline model, with slightly different names for the various components, but essentially following the structure shown here.

page 510

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

b4528-v1-ch12

page 511

511

agents operate in a completely text-based fashion. Turning to the core elements, shown in green in Figure 12.4, one can view the processing as taking place in three stages: First, in a step involving natural language understanding (NLU) the CA must identify precisely what the user said, or rather what the user meant; hence the notion of intent detection. Next, in a step that one may refer to as cognitive processing [104], the agent executes its dialogue policy, i.e., undertakes the actions needed to construct the knowledge required for the response, making use of a knowledge base in the form of an offline database, or via the internet or cloud-based services. A key factor here is to keep track of context, also known as dialogue state. Finally, the CA formulates the answer as a sentence in the process of natural language generation (NLG). While the pipeline model offers a clear, schematic description of a task-oriented agent, it should be noted that not all such agents fall neatly into this model. Moreover, in the currently popular black box systems, the distinction between the stages tends to become blurred, as research efforts push toward requiring less and less human specification of how the processing should take place; see also Subsection 12.4.2.2. 12.4.2.1. Interpretable dialogue systems

The simplest examples of interpretable, task-oriented agents are the so-called finitestate systems [105], where the dialogue follows a rigid, predefined structure, such that the user is typically required to provide information in a given sequence. For example, if the CA is dealing with a hotel reservation, it might first ask the user “Which city are you going to?” then “Which day will you arrive?” and so on, requiring an answer to each question in sequence. In such a system the dialogue initiative rests entirely with the CA, and the user has no alternative but to answer the questions in the order specified by the agent. In addition, finite-state systems can be complemented with the capability of handling universal commands (such as “Start over”), which the user might give at any point in the conversation. Even with such features included, however, CAs based on the finite-state approach are generally too rigid for any but the most basic kinds of dialogues; for example, in the case of a natural conversation regarding a reservation (flight, hotel, restaurant, etc.), the user may provide the required information in different orders, sometimes even conveying several parts of the request in a single sentence (e.g., “I want to book a flight to Paris on the 3rd of December”). Such aspects of dialogue are handled, to a certain degree, in frame-based (or slot-filling) systems, such as GUS [106], where the dialogue is handled in a more flexible manner that allows mixed initiative, i.e. a situation where the control over the dialogue moves back and forth between the agent and the user. Here, the agent maintains a so-called frame containing several slots that it attempts to fill by interacting with the user who may provide the information in any order (sometimes also filling more than one slot

June 1, 2022 12:52

512

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch12

Handbook on Computer Learning and Intelligence — Vol. I Slot DEPARTURE CITY ARRIVAL CITY DEPARTURE DATE DEPARTURE TIME

Question Where are you traveling from? Where are you going to? What day would you like to travel? What time would you like to leave?

Figure 12.5: An example of a frame for a flight-reservation system. The frame contains a set of slots, each with an associated question. Note that only some of the required slots are shown in this figure.

Rule QUERY → {is there, do you have} VEHICLE → {a car (available), any car (available)} PRICELIMIT → {for less than, under, (priced) below } PRICE → {NUMBER (dollars)} NUMBER → {50, 100, 150, 200, 300, 500 } TIMERANGE → {a day, per day, a week, per week }

Figure 12.6: An example of a simple semantic grammar for (some parts of) a car rental dialogue.

with a single sentence). Figure 12.5 shows (some of the) slots contained in a frame used in a flight-reservation system. In a single-domain frame-based system, the agent’s first step is to determine the intent of the user. However, in the more common multi-domain systems, this step is preceded by a domain determination step where the agent identifies the overall topic of discourse [105]. Intent detection can be implemented in the form of explicit, hand-written rules, or with the use of more sophisticated semantic grammars that define a set of rules (involving phrases) that form a compact representation of many different variations of a sentence, allowing the CA to understand different user inputs with similar meaning. As a simple (and partial) example, consider a CA that handles car rentals. Here, a suitable semantic grammar may contain the rules shown in Figure 12.6, so that the sentences “Is there any car available for less than 100 dollars per day?” and “Do you have a car priced below 300 a week?” would both be identified as instances of QUERY—VEHICLE—PRICELIMIT—PRICE—TIMERANGE. Semantic grammars can be built in a hierarchical fashion (see, for example, how NUMBER is part of PRICE in the example just described), and can be tailored to represent much more complex features than in this simple example [107]. The output of framebased systems (i.e., the NLG part in Figure 12.4) is typically defined using a set of predefined templates. Continuing with the car rental example, such a template could take the form “Yes, we have a available at dollars per day,” where and would then be replaced by the appropriate values for the situation at hand.

page 512

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

b4528-v1-ch12

page 513

513

The frame-based approach works well in situations where the user is expected to provide several pieces of information on a given topic. However, many dialogues involve other aspects such as, for example, frequently switching back and forth between different domains (contexts). Natural dialogue requires a more advanced and capable representation than that offered by frame-based systems and typically involves identification of dialogue acts as well as dialogue state tracking. Dialogue acts [108] provide a form of categorization of an utterance, and represent an important first step in understanding the utterance [109]. For example, an utterance of the form “Show me all available flights to London” is an instance of the COMMAND dialogue act, whereas the possible response “OK, which day to you want to travel?” is an instance of CLARIFY, and so on. Dialogue act recognition has been approached in many different ways [110], reaching an accuracy of around 80–90%, depending on the dataset used. Dialogue state tracking, which is essentially the ability of the CA to keep track of context, has also been implemented in many ways, for example using a set of state variables (as in MIT’s JUPITER system [111]) or the more general concept of information state that underlies the TrindiKit and other dialogue systems [112]. Dialogue state tracking has also been the subject of a series of competitions; see the work by Williams et al. [113] for a review. For the remainder of this section, the systems under consideration reach a level of complexity such that their interpretability is arguably diminished, yet they still make use of high-level, interpretable primitives and are most definitely far from being complete black boxes. Thus, going beyond finite-state and frame-based systems, several methods have been developed that incorporate more sophisticated models of the user’s beliefs and goals. These methods can generally be called modelbased methods [114] (even though this broad category of methods also goes under other names, and involves somewhat confusing overlaps between the different approaches). Specifically, in plan-based methods [115, 116], such as Ravenclaw [117], which are also known as belief-desire-intention (BDI) models, the CA explicitly models the goals of the conversation and attempts to guide the user toward those goals. Agent-based methods [118] constitute another specific example. They involve aspects of both the plan-based and information state approaches, and extend those notions to model dialogue as a form of cooperative problem-solving between agents (the user and the CA), where each agent is capable of reasoning about its own beliefs and goals but also those of the other agent. Thus, for example, this approach makes it possible for the CA not only to respond to a question literally, but also to provide additional information in anticipation of the user’s goals [119]. Even though there is no fundamental obstacle to using a data-driven, machine learning approach to systems of the kind described above, these methods are often referred to as handcrafted as this is the manner in which they have most often

June 1, 2022 12:52

514

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch12

Handbook on Computer Learning and Intelligence — Vol. I

been built. Moreover, these methods generally maintain a single, deterministic description of the current state of the dialogue. There are also methods that model dialogue in a probabilistic, statistical way and which can be seen as extensions of the information-state approach mentioned above. The first steps toward such systems model dialogue as a Markov decision process (MDP) [120], in which there is a set of states (S), a set of actions (A), and a matrix of transition probabilities (P) that models the probability of moving from one given state to another when taking a specific action a ∈ A. Any action a resulting in state s is associated with an immediate reward r(s, a). There is also a dialogue policy π , specifying the action that the agent should take in any given situation. The policy is optimized using reinforcement learning so as to maximize the expected reward. Levin et al. [121] provide an accessible introduction to this approach. The MDP approach, which assumes that the state s is observable, is unable to deal with the inevitable uncertainty of dialogue, whereby the user’s goal, the user’s last dialogue act, and the dialogue history, which together define the dialogue state, cannot be known with absolute certainty [122]. Therefore, systems that take into account the inherent uncertainty in conversation (including ASR uncertainty) have been proposed in the general framework of partially observable Markov decision processes (POMDPs). Rather than representing a single state, a POMDP defines a distribution over all possible dialogue states, referred to as the belief state. A difficult problem with MDPs and, to an even greater degree, POMDPs, is the fact that their representation of dialogue states is very complex. This, in turn, results in huge state spaces and therefore computational intractability. Much work on POMDPs, therefore, involves various ways of coping with such problems [123]. Young et al. provide a review of POMDPs for dialogue [124].

12.4.2.2. Black box dialogue systems

Black box task-oriented agents are essentially dominated by the field of DNNs. For these CAs, many of the aspects that were described before for DNN-based chatbots still apply. For example, DNN-based task-oriented agents also represent words with embeddings, and rely on DNN architectures such as recurrent or convolutional ones [125], as well as transformer-based ones [126, 127]. There are other types of DNNs used for task-oriented systems, such as memory networks [128, 129] and graph convolutional networks [130, 131]. However, delving into their explanation is beyond the scope of this chapter. Normally, the type of DNN used depends on what stage of the agent the DNN is responsible for. In fact, black box systems can often be framed in terms of the pipeline model (Figure 12.4), where NLU, NLG, dialogue policy, etc., are modeled somewhat separately. Even so, the DNNs responsible for realizing a black box

page 514

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

b4528-v1-ch12

page 515

515

task-oriented agent are connected to, and dependent on, one another, so as to allow the flow of information that makes end-to-end information processing possible. An emerging trend in the field, as we mentioned before, is to make the agent less and less dependent on human design, and instead increase the responsibility of the DNNs to automatically learn how to solve the task from examples and interaction feedback. Wen et al. [125] proposed one of the first DNN-based task-oriented agents trained end-to-end, namely a CA for restaurant information retrieval. To collect the specific data needed for training the system, the authors used a Wizard-of-Oz approach [132], scaled to a crowd-sourcing setting, to gather a large number of examples. In particular, one part of Amazon Mechanical Turk [133] workers were instructed to act as normal users asking information about restaurants and food. The other part, i.e., the wizards of Oz, were asked to play the role of a perfect agent by replying to all inquiries using a table of information that was provided beforehand. Alternatively, since training from examples requires large corpora to be collected beforehand, reinforcement learning can be used [134]. Some works attempt to improve the way these black box systems interface with the knowledge base. Typically, the DNN that is responsible for the dialogue policy chooses the information retrieval query deemed to be most appropriate. Dhingra et al. [135] have shown, however, that one can include mechanisms to change the lookup operation so as to use soft queries that provide multiple degrees of truth, ultimately making it possible to obtain a richer signal that is useful to improve training. Madotto et al. [136] instead looked at improving the scalability of the systems to query large knowledge bases, by essentially having the DNNs assimilate information from the knowledge base within the network parameters at training time. This improves information retrieval speed because querying the external knowledge base normally takes longer than inferring the same information from within a DNN. However, this also means that some degree of re-training is needed when the information in the knowledge base is changed. There are many more works on DNN-based task-oriented agents that deal with different aspects for improvement. For instance, a problem is that a DNN responsible for NLG can learn to rely on specific information retrieved from the knowledge base in order to generate meaningful responses, to such an extent that changes to the knowledge base can lead to unstable NLG. This problem is being tackled in ongoing research aimed at decoupling NLG from the knowledge base [137, 138]. Another interesting aspect concerns incorporating training data from multiple domains, for situations where the data that are specific to the task at hand are scarce [139]. Finally, we refer to the works by Bordes et al. [129], Budzianowski et al. [140], and Rajendran et al. [141] as examples where datasets useful for training and benchmarking this type of CA are presented.

June 1, 2022 12:52

516

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch12

Handbook on Computer Learning and Intelligence — Vol. I

Figure 12.7: Schematic view of the different levels at which CAs can be evaluated, mirroring the organization of Section 12.5.

12.5. Evaluation of Conversational Agents Evaluation of CAs is a very much an open topic of research. Current systems are still far from achieving seamless, human-like conversation capabilities [142, 143]. Understanding how best to measure such capabilities represents an important endeavor to advance the state-of-the-art. There are many different evaluation criteria, aimed at different levels of the human-agent interaction. Firstly in this section, a brief introduction on low-level metrics used to evaluate general language processing is given. Next, moving on to a higher-level of abstraction, evaluation metrics that address the quality of interaction with CAs, in a more general setting, and in more detail for the case of ECAs, are considered. Last but not least, evaluation systems that delve into broad implications and ethics are described. Figure 12.7 summarizes these different aspects of evaluation.

12.5.1. Low-level Language Processing Evaluation Metrics On a low level, when developing the capabilities of a CA to process utterances, retrieve information, and generate language, traditional metrics for general pattern recognition like precision and recall represent useful metrics for evaluation. For example, for a CA employed in a psychiatric facility, one may wish the agent (or a component thereof) to assess whether signs of depression transpire from the interaction with the patient: In such a case, maximizing recall is important. In practice, traditional metrics such as precision and recall are employed across several types of benchmark problems, e.g., to evaluate sentiment analysis [144], paraphrase identification [145], question answering [146], reading comprehension [49], and also gender bias assessment [147]. The general language understanding evaluation (GLUE) benchmark [97], and its successor SuperGLUE [148], provide a carefully picked collection of datasets and tools for evaluating natural language processing systems, and maintain a leaderboard of top-scoring systems.

page 516

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

b4528-v1-ch12

page 517

517

Alongside traditional scoring metrics, the field of natural language processing has produced additional metrics that are useful to evaluate CAs in that they focus on words and n-grams (sequences of words) and are thus more directly related to language. A relevant example of this is the bilingual evaluation understudy (BLEU) scoring metric. Albeit originally developed to evaluate the quality of machine translation [149], BLEU has since been adopted in many other tasks, several of which are of relevance for CAs, e.g., summarization and question answering. For the sake of brevity, we will only describe how BLEU works at a high level. In short, BLEU can be seen as related to precision and is computed by considering how many n-grams in a candidate sentence (e.g., produced by a CA) match n-grams in reference sentences (e.g., high-quality sentences produced by humans). The n-grams considered are normally of length 1 to 4. Matches for longer n-grams are weighted exponentially more than matches for smaller n-grams. In addition, BLEU includes a penalty term to penalize candidate sentences that are shorter than reference sentences. We refer the reader interested in the details of BLEU to the seminal work by Panineni et al. [149]. A metric similar to BLEU, but with the difference that it focuses on recall rather than precision, is recall-oriented understudy for gisting evaluation (ROUGE), which is reviewed in the seminal works by Lin et al. [150, 151].

12.5.2. Evaluating the Quality of the Interaction The evaluation of the quality of interaction requires human judgment. In this context, researchers attempt to find the aspects that define a good interaction with an agent in order to obtain a better understanding of how CAs can be built successfully. This section is divided into one general part on conversational capabilities and one part on embodiment. 12.5.2.1. General CAs

The work by Adiwardana et al. of Google Brain proposes to categorize human evaluation into sensibleness and specificity [91]. Evaluating sensibleness means to assess whether the agent is able to give responses that make sense in context. Evaluating specificity, instead, involves determining whether the agent is effectively capable of answering with specific information, as opposed to providing vague and empty responses, a strategy often taken to avoid making mistakes. Interestingly, the authors found that the average of sensibleness and specificity reported by users correlates well with perplexity, a low-level metric for language models that represents the degree of uncertainty in token prediction. The Alexa Prize competition represents an important stage for the evaluation of CAs [152]. To rank the contestants, Venkatesh et al. defined several aspects, some of which require human assessment, and some of which are automatic [153].

June 1, 2022 12:52

518

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch12

Handbook on Computer Learning and Intelligence — Vol. I

For example, one aspect is conversational user experience, which is collected as the overall rating (in the range 1, 5) from users that were given the task of interacting with the CAs during the competition. Another aspect, coherence, was set to be the number of sensible responses over the total number of responses, with humans annotating sensible answers. Other aspects were evaluated with simple metrics, e.g., engagement as the number of utterances, and topical breadth and topical depth as the number of topics touched in the conversation (identified with a vocabulary of keywords) and the number of consecutive utterances on a certain topic, respectively. Venkatesh et al. also attempted to find a correlation between human judgment and automatic metrics, but only small correlations were found. The recent survey by Chaves and Gerrosa [154] reports on further research endeavors to elucidate the aspects that account for the perception of successful and engaging interactions with CAs. These aspects include, for example, conscientiousness: how aware of and attentive to the context the CA appears to be; communicability: how transparent the CA’s interaction capabilities are, e.g., for the user to know how to best query the agent [155]; damage control: how capable the CA is to recover from failure, handle unknown concepts, etc [156]. Some apparent qualities can actually be perceived to be negative, when excessive. CAs that take too much initiative constitute a clear example, as they can be perceived as intrusive or frustrating, or even give the feeling of attempting to exercise control over the user [157, 158]. Aspects of the perception of the interaction are collected using standardized questionnaires (involving, for example, the Likert scale [159]), which can be relatively general and can be applied beyond CAs, such as the system usability scale (SUS) [160], or specific to the human–CA interaction, including particular focuses such as evaluation of speech recognition, as done by the subjective assessment of speech system interfaces (SASSI) [161]. A notable organizational framework of user feedback that applies well to task-oriented CAs is the paradigm for dialogue system evaluation (PARADISE) [162]. PARADISE links the overall human-provided rating about the quality of interaction to the agent’s likelihood of success in completing a task and the cost of carrying out the interaction. Task success is based on predeciding what the most important building blocks of the interaction are, in the form of attribute–value pairs (e.g., an attribute could be departing station, and the respective values could be Amsterdam, Berlin, Paris, etc.). With this organization of the information, confusion matrices can be created, useful for determining whether the CA returns the correct values for interactions regarding certain attributes. The cost of the interactions, i.e., how much effort the user needs to invest to reach the goal, consists of counting how many utterances are required in total, and how many are needed to correct a misunderstanding. Finally, a scoring metric is obtained by a linear regression of the perceived satisfaction as a function of task success statistics and interaction costs.

page 518

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

b4528-v1-ch12

page 519

519

Recently, in an effort to provide consistent evaluations and to automate human involvement, Sedoc et al. proposed ChatEval [163], a web application aimed at standardizing comparisons across CAs. To use ChatEval, one uploads (the description and) the responses of a CA to a provided set of prompts. As soon as they are uploaded, the responses are evaluated automatically with low-level metrics such as average response length, BLEU score, and cosine similarity over sentence embeddings between the agent’s response and the ground-truth response [164]. Optionally, by paying a fee one can make ChatEval automatically start a humanannotation task on Amazon Mechanical Turk, where users evaluate the responses of the agent in a standardized way. 12.5.2.2. Evaluation of embodiment

When embodiment comes into play, alongside aspects that are general to the interaction with CAs such as conversational capabilities in both chat and taskoriented settings, or handling of intentional malevolent probing, ECAs can be evaluated in terms of visual look and appearance personalization options [165]. Another aspect of evaluation considers whether the (E)CA exhibits (interesting) personality traits. ECAs are arguably at an advantage in this context compared to non-embodied CAs, because they can enrich their communication with body language and facial expressions. Since ECAs can leverage nonverbal interaction, they can engage the user more deeply at an emotional level. In this setting, emotional ECAs can be evaluated by comparison with non-emotional ones [166], in terms of aspects such as likeability, warmth, enjoyment, trust, naturalness of emotional reactions, believability, and so on. Evaluation of embodiment can also be carried out in terms of opposition with the absence of embodiment, or in terms of physical embodiment vs. virtual embodiment, the latter type of comparison being more popular. In one such study [167], it was found that people evaluate (at different levels) the interaction with a physical ECA (specifically, Sony’s Aibo [168]) and a prototype robot (by Samsung, called April) more positively than the interaction with a virtual version of the same ECA. A similar study has been carried out using a physical and virtual ECA to interact during chess games [169]. At the same time, other studies have instead recorded that physical ECAs are no better than virtual ECAs. For example, it was found that whether a task-oriented ECA is physical or virtual results in no statistical differences between two very different tasks: persuading the user to consume healthier food and helping to solve the Towers of Hanoi puzzle [170]. Last but not least, as mentioned before in Section 12.3, an excess of embodiment (particularly so when embodiment attempts to be too human-like) can actually worsen the quality of the perceived interaction. Aside from the uncanny valley hypothesis [32, 171], another valid reason

June 1, 2022 12:52

520

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch12

Handbook on Computer Learning and Intelligence — Vol. I

for this is that more sophisticated embodiments can lead to bigger expectations on conversational capabilities that leave the user more disappointed when those expectations are not met [172]. Much uncertainty about the benefits of using physical or virtual ECAs is still present, since both the ECAs involved and the methods used for their evaluations vary widely in the literature, from adopting questionnaires on different aspects of the interaction to measuring how long it takes to complete a puzzle. For further reading on the evaluation of ECAs, we refer to Granström and House [173] and Weiss et al. [174].

12.5.3. Evaluating Societal Implications With CAs becoming ever more advanced and widespread, it is becoming increasingly clear that their societal impact must be evaluated carefully before they are deployed. In fact, failures of CAs from a societal and ethical standpoint have already been recorded. Microsoft’s chatbot Tay for Twitter is an infamous example [175]: The agent’s capability to learn and mimic user behavior was exploited by Twitter trolls, who taught the agent to produce offensive tweets, forcing a shutdown of the agent. Moreover, it is not hard to imagine that a lack of regulations and evaluation criteria for the safe deployment of CAs in sensitive societal sectors can lead to a number of unwanted consequences. While some general regulations might exist [10], for many societal sectors the evaluation tools required to ensure safe and ethical use of CAs are still largely missing. Healthcare is one among several sectors where the adoption of CAs could bring many benefits, e.g., by helping patients to access care during a pandemic [176] or supporting pediatric cancer patients who must remain in isolation during part of their treatment [177]. However, many risks exist as well [178]. For example, safeguarding patient privacy is crucial: it must be ensured that a CA does not leak sensitive patient information. Ethical and safety implications regarding the circumstances and conditions under which a CA is allowed to operate in a clinic must also be carefully assessed. Examples of questions in this domain are: “In what context is it ethical to have patients interacting with a CA rather than a human?”; “Is informing about health complications a context where an agent should be allowed to operate?” “Is it safe to use CAs for patients who suffer from psychiatric disorders?” “Would it be acceptable to introduce disparities in care whereby wealthier patients can immediately access human doctors, while the others must converse with agents first?” Evaluation tools to answer these questions have yet to be developed. Last but not least, it is essential to note that the use of black box CAs comes with additional risks, especially in cases where such CAs are involved in high-stakes decisions, where one can instead argue in favor of more transparent and accountable

page 520

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

b4528-v1-ch12

page 521

521

systems [9]. The field of interpretable AI involves methods for developing such inherently transparent systems. The closely related, and perhaps more general, notion of explainable AI (XAI) [179, 180] considers (post hoc) approaches for attempting to explain the decisions taken by black box systems [9] as well as methods for improving their level of explainability [181]. With transparent systems originating from interpretable AI, a sufficient understanding of the system’s functionality to guarantee safe use can generally be achieved. For black box CAs, however, one should be careful when applying methods for explaining their decision-making to the specific case of evaluating their safety. This is so since explanations for black box systems can, in general, only be specific (also referred to as local [179]) to the particular event under study. In other words, given two events that are deemed to be similar and for which it is expected that the CA should react similarly, this might not happen [9]. Furthermore, the possibility of understanding the workings of agent components derived using machine learning is key to spotting potential biases hidden in the (massive amounts of) data [179, 180] used when training such components. With black box systems, it is simply harder to pinpoint potential damaging biases before the system has been released and has already caused damage [9, 182].

12.6. Notable Conversational Agents To some degree, this section can be considered a historical review of CAs, but it should be observed that the set of CAs that have gained widespread attention is very much dominated by chatbots rather than task-oriented agents: even though some task-oriented systems are mentioned toward the end of the section, for a complete historical review the reader is also advised to consider the various systems for taskoriented dialogue presented in Subsection 12.4.2. Moreover, this section is focused on dialogue capabilities rather than the considerable and concurrent advances in, say, speech recognition and the visual representation of CAs (embodiment). The origin of CAs can be traced back to the dawn of the computer age when, in 1950, Turing reflected on the question of whether a machine can think [183]. Noting that this question is very difficult to answer, Turing introduced the imitation game, which today goes under the name the Turing test. In the imitation game, a human interrogator (C) interacts with two other participants, a human (A) and a machine (B). The participants are not visible to each other, so that the only way that C can interact with A and B is via textual conversation. B is said to have passed the test if C is unable to determine that it is indeed the machine. Passing the Turing test has been an important goal in CA research (see also below) even though, as a measure of machine intelligence, the test itself is also controversial; objections include, for example, suggestions that the test does not

June 1, 2022 12:52

522

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch12

Handbook on Computer Learning and Intelligence — Vol. I

cover all aspects of intelligence, or that successful imitation does not (necessarily) imply actual, conscious thinking6 . As computer technology became more advanced, the following decade saw the introduction of the first CA, namely the chatbot ELIZA [67] in 1966; see also Subsection 12.4.1.1. The most famous incarnation of ELIZA, called DOCTOR, was implemented with the intention of imitating a Rogerian psychoanalyst. Despite its relative simplicity compared to modern CAs, ELIZA was very successful; some users reportedly engaged in deep conversations with ELIZA, believing that it was an actual human, or at least acting as though they held such a belief. In 1972, ELIZA was followed by the PARRY chatbot that, to a great degree, represents its opposite: instead of representing a medical professional, PARRY was written to imitate a paranoid schizophrenic and was used as a training tool for psychiatrists to learn how to communicate with patients suffering from this affliction. The semblance of intelligence exhibited by both ELIZA and PARRY can be derived, to a great degree, from their ability to deflect the user’s input, without actually giving a concrete, definitive reply to the user. Predictably, ELIZA has actually met PARRY in several conversations [184], which took place in 1972 over the ARPANet, the predecessor to the internet. At this point, it is relevant to mention the Loebner prize, which was instituted in the early 1990s. In the associated competition, which is an annual event since 1991, the entrants participate in a Turing test. A 100,000 USD one-time prize (yet to be awarded) is offered for a CA that can interact with (human) evaluators in such a way that it is deemed indistinguishable from a real human in terms of its conversational abilities. There are also several smaller prizes. The Loebner competition is controversial and has been criticized in several different ways, for example on the grounds that it tends to favor deception rather than true intelligence. Nevertheless, the list of successful entrants does offer some insight into the progress of CA development. A.L.I.C.E [185], a somewhat contrived abbreviation of Artificial Linguistic Internet Computer Entity, was a pattern-based chatbot built using AIML (see Section 12.4.1.1). Its first version was released in 1995, followed by additional versions a few years later. Versions of A.L.I.C.E. won the Loebner competition several times in the early 2000s. Two other notable chatbots are Jabberwacky and its successor Cleverbot. Jabberwacky was implemented in the 1980s, and was released on the internet in 1997,

6 It should be noted, in fairness, that Turing did address many of those objections in his paper, and that

he did not intend his test to be a measure of machine consciousness.

page 522

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

b4528-v1-ch12

page 523

523

whereas Cleverbot appeared in 2006. These chatbots are based on information retrieval [186] and improve their capabilities (over time) automatically from conversation logs: every interaction between Cleverbot and a human is stored, and can then be accessed by the chatbot in future conversations. Versions of Jabberwacky won the Loebner competition in 2005 and 2006. Cleverbot rose to fame when, in 2011, it participated in a Turing test (different from the Loebner competition) in India, and was deemed 59.3% human, which can be compared with the average score of 63.3% for human participants. Another chatbot for which similar performance has been reported is Eugene Goostman. In two Turing tests [187], one in 2012 marking the centenary of the birth of Alan Turing and one in 2014, organized in Cambridge on the occasion of the 60th anniversary of his death, this chatbot managed to convince 29% (2012) and 33% (2014) of the evaluators that it was human. Claims regarding the intellectual capabilities of these chatbots have also been widely criticized. Another recent AIML-based chatbot is Mitsuku [188], which features interactive learning and can perhaps be considered a representative of the current state-of-theart in pattern-based chatbot technology. Mitsuku has won the Loebner competition multiple times (more than any other participant) in the 2010s. Implemented in a scripting language called ChatScript [189], the chatbot Rose and its two predecessors Suzette and Rosette have also won the Loebner competition in the 2010s. The 2010s also saw the introduction of CAs used as personal assistants on mobile devices, starting with the launch of SIRI (for Apple’s iPhone) in 2011 and rapidly followed by Google Now in 2012, as well as Amazon’s Alexa and Microsoft’s Cortana a few years later7 . These CAs, which also generally feature advanced speech recognition, combine chatbot functionality (for everyday conversation) with taskoriented capabilities (for answering specific queries). In this context, it is relevant to mention again the Alexa prize [152]; introduced in 2016, this contest does not focus on the Turing test. Instead, the aim is to generate a CA that is able to converse “coherently and engagingly” on a wide range of current topics, with a grand challenge involving a 20-minute conversation of that kind [190]. In 2020, Google Brain presented Meena [91], a new DNN-based generative chatbot the architecture of which was optimized with evolutionary algorithms. With human evaluation recorded in percentage and based on the average of sensibleness and specificity (see Section 12.5.2), Meena was found to outperform chatbots such as XiaoIce, DialoGPT, Mitsuku, and Cleverbot, with a gain of 50% compared to XiaoIce and 25% compared to Cleverbot (the best after Meena).

7 SmarterChild, a CA released in the early 2000s on instant messenger networks, can perhaps be seen

as a precursor to SIRI.

June 1, 2022 12:52

524

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch12

Handbook on Computer Learning and Intelligence — Vol. I

12.7. Applications This section provides a review of recent applications of CAs, focusing mainly on task-oriented agents. We have attempted to organize applications into a few main subtopics, followed by a final subsection on other applications. However, we make no claims regarding the completeness of the survey below; in this rapidly evolving field, new interesting applications appear continuously. The description below is centered on results from scientific research. There are also several commercial products that are relevant, for example general-purpose CAs such as Apple’s Siri, Amazon’s Alexa, and Microsoft’s Cortana, and CA development tools such as Google’s DialogFlow, Microsoft’s LUIS, IBM’s Watson, and Amazon’s Lex. However, these products will not be considered further here.

12.7.1. Health and Well-Being Arguably one of the most promising application areas, health and well-being is a central topic of much CA research [3, 191]. This application area is not without controversy, however, due to ethical considerations such as safety and respect for privacy [178]. With the continuing rise in human life expectancy, especially in developed countries, the fraction of elderly people in the population is expected to rise dramatically over the coming decades, an undoubtedly positive development but also one that will exacerbate an already strained situation for the healthcare systems in many countries. Thus, a very active area of research is the study of CAs (and, especially, ECAs) for some (noncritical) tasks in elderly care. Examples include CAs that monitor medicine intake [192], interact with patients regarding their state of health (e.g., during cancer treatment) [193], or provide assistance and companionship [194–196], a case where social robots, such as Paro by Shibata et al. [197] (see also Hung et al. [198]), also play a role. Mental health problems affect a large portion of the worldwide population. Combined with a shortage of psychiatrists, especially in low-income regions, these diseases present a major challenge [199]. CAs are increasingly being applied in the treatment of mental health problems [199–201], generally with positive results, even though most studies presented so far have been rather preliminary and exploratory in nature, rarely involving full, randomized controlled trials. Moreover, issues beyond the CA functionality and performance, such as legal and ethical considerations, must also be addressed [178, 200] carefully and thoroughly before CAs can be applied fully in the treatment of mental health problems (or, indeed, in any form of healthcare). Several tasks can be envisioned for CAs applied to mental health problems. A prominent example is intervention in cases of anxiety and depression, using CAs such as Woebot [202] that applies cognitive behavioral therapy (CBT).

page 524

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

b4528-v1-ch12

page 525

525

Another important application is for the treatment of post-traumatic stress disorder (PTSD), where the anonymity offered by a CA interaction (as opposed to an interaction with a human interviewer) may offer benefits [203]. CAs can also be used in lifestyle coaching, discouraging harmful practices such as smoking [204] or drug abuse [205], and promoting healthy habits and lifestyle choices involving, for example, diet or physical exercise [206]. An example of such a CA is Ally, presented by Kramer et al. [207], which was shown to increase physical activity measured using step counts. A related application area is self-management of chronic diseases, such as asthma and diabetes. In the case of asthma, the kBot CA was used for helping young asthma patients manage their condition [208]. Similarly, Gong et al. [209] proposed Laura, which was tested in a randomized controlled trial and was shown to provide significant benefits. Another important application, where spoken dialogue is crucial, is to provide assistance to people who are visually impaired, with the aim of improving social inclusion as well as accessibility to various services [210, 211]. CAs may also play an important role in assisting medical professionals, for example, in scheduling appointments, in providing information to patients, as well as in dictation. A survey of 100 physicians showed that a large majority believe that CAs could be useful in administrative tasks, but less so in clinical tasks [212]. Still, some CAs, such as Mandy by Ni et al. [213], aim to provide assistance to physicians in primary care, for example, in basic interaction with patients regarding their condition.

12.7.2. Education Education is another field where CAs have been applied widely. Most of the educational applications reported below rely on interpretable CAs (e.g., based on AIML [214]); however, black box CAs have also been investigated (e.g., using seq2seq [215]). A natural application of CAs in the education domain is to provide a scalable approach to knowledge dissemination. Massive open online courses are a clear example, since one-to-one interaction between teacher and student is not feasible in these cases. In such settings, the course quality can be improved by setting up CAs that can handle requests from hundreds or thousands of students at the same time [216]. For instance, Bayne et al. proposed Teacherbot [217]. This pattern-based CA was created to provide teaching support over Twitter for a massive open online course that enrolled tens of thousands of students. Similarly, a CA in the form of a messaging app was created to provide teaching assistance for a large course at the Hong Kong University of Science and Technology [218]. Even though CAs can improve the dissemination of education, their effectiveness in terms of learning outcomes and engagement capability strongly depends

June 1, 2022 12:52

526

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch12

Handbook on Computer Learning and Intelligence — Vol. I

on the specific way in which they are implemented, and the context in which they are proposed. For example, AutoTutor was found to compare favorably when contrasted with classic textbook reading in the study of physics [219]. On the other hand, the AIML-based CA Piagetbot (a successor of earlier work, Freudbot [220]) was not more successful than textbook reading when administered to psychology students [221]. More recently, Winkler et al. proposed Sara [222], a CA that interacts with students during video lectures on programming. Interestingly, equipping Sara with voice recognition and structuring the dialogue into sub-dialogues to induce scaffolding were found to be key mechanisms to improve knowledge retention and the ability to transfer concepts. Beyond providing an active teaching support, CAs have also been used to improve the way education is conducted, e.g., as an interface for course evaluation [223], and related logistics aspects, e.g., by providing information about the campus [224–226]. Linking back to and connecting with the previous section on health and wellbeing, educational CAs have also been applied to provide learning support to visually impaired adults [227] and deaf children [228]. Moreover, a particularly sensitive context of care where CAs are being investigated is the education of children with autism. For example, an AIML-based virtual ECA called Thinking Head was adopted to teach children with autism (6 to 15 years old) about conversational skills and dealing with bullying [229]. Similarly, the ECA Rachel [230] was used to teach emotional behavior. More recently, ECAs have also been used to help teenagers with autism learn about nonverbal social interaction [231, 232]. Alongside speech recognition, these ECAs include facial features tracking in order to provide cues on smiling, maintaining eye contact, and other nonverbal aspects of social interaction. For more reading on this topic, we refer to the works by Johnson et al. [233], Kerry et al. [234], Roos [214], Veletsianos and Russell [235], and Winkler and Söllner [216].

12.7.3. Business Applications The rise of e-commerce has already transformed the business landscape and it is a process that, in all likelihood, will continue for many years to come. Currently, e-commerce is growing at a rate of 15–20% per year worldwide, and the total value of e-commerce transactions is more than 4 trillion US dollars. Alongside this trend, many companies are also deploying CAs in their sales and customer service. It has been estimated that CA technology will be able to answer up to 80% of users’ questions, and provide savings worth billions of dollars [236]. Research in this field is often centered on customer satisfaction. For example, Chung et al. [237] considered this issue in relation to customer service CAs for

page 526

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

b4528-v1-ch12

page 527

527

luxury brands, whereas Feine et al. [238] studied how the customer experience can be assessed using sentiment analysis, concluding that automated sentiment analysis can act as a proxy for direct customer feedback. An important finding that must be taken into account when developing customer service CAs is that customers generally prefer systems that provide a quick and efficient solution to their problem [239], and that embodiment does not always improve users’ perception of the interaction [172]. Another useful design principle is to use a tiered approach, where a CA can refer to a human customer service agent in cases where it is unable to fulfill the customer’s request [240]. As in the case of other applications, the development of CAs for business applications is facilitated by the advent of relevant datasets in this case involving customer service conversations [241].

12.7.4. Tourism and Culture CAs are becoming increasingly popular means to improve and promote tourism. In this context, a verbal interaction with a CA can be useful to provide guidance at the airport [242], assist in booking recommendations [243, 244], and adapt tour routes [245]. To improve portability, oftentimes these CAs are implemented as text-based mobile apps [245–247], sometimes making use of social media platforms [248]. Several CAs have been designed specifically to provide guidance and entertainment in museums and exhibits. Gustafson et al. built August, an ECA consisting of a floating virtual head, and showcased it at the Stockholm Cultural Centre [249]; Cassell et al. set up MACK, a robot-like ECA, at the MIT Media Lab [250]; Traum et al. proposed Ada and Grace, two more ECAs, employed at the Museum of Science of Boston [251]; and Kopp et al. studied how the visitors of the Heinz Nixdorf Museums Forum interacted with Max, a full-body human-like virtual ECA [252]. Seven years after the introduction of Max, Pfeiffer et al. compiled a paper listing the lessons learned from using the ECA [253]. Beyond queries about the venue, tourists were found to ask about Max’s personal background (e.g., “Where are you from?”) and physical attributes (e.g., “How tall are you?”), but often insult and provoke the agent, overall testing the human-likeness of the CA’s reactions. For the interested reader, the short paper by Schaffer et al. [254] describes key steps toward the development of a technical solution to deploy CAs in different museums. While the previous works were targeted on a specific museum or venue, CulturERICA can converse about cross-museum, European cultural heritage [255]. As in the case of tourism, many museum-guide CAs are implemented as mobile apps, or as social media messaging platform accounts, for mobility [256, 257]. For the reader interested in a similar application, namely the use of CAs in libraries, we refer to the works by Rubin et al. [258] and Vincze [259].

June 1, 2022 12:52

528

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch12

Handbook on Computer Learning and Intelligence — Vol. I

12.7.5. Other Applications There are many more and wildly different applications where CAs can be beneficial. For example, CAs are being investigated to aid legal information access. Applications of this kind include legal guidance for couple separation and child custody [260], as well as immigration and banking [261]. DNN-based CAs are also being investigated to summarize salient information from legal documents in order to speed up the legal process [262]. Research is also being conducted regarding the role of CAs in vehicles, and especially so in the realm of self-driving cars [263]. Since vehicles are becoming increasingly autonomous, CAs can act as driver assistants in several ways, from improving passenger comfort by chatting about a wide range of activities, possibly unrelated to the ride (e.g., meetings, events), to assessing whether the driver is fit to drive, and even commanding the driver to take back vehicle control in case of danger [264, 265]. Games are another application of relevance for CAs. Beyond entertainment, games are often built to teach and train (so-called serious games): in this setting, CAs are used to improve engagement in disparate applications, e.g., guiding children in games about healthy dietary habits [266], teaching teenagers about privacy [267], and training business people as well as police personnel to react appropriately to conflicts and emergencies [268]. Researchers have also investigated the effect of using CAs to help gaming communities grow and bond, by having the community members converse with a CA in a video game streaming platform [269]. Furthermore, CAs themselves represent a gamification element that can be of interest for the media: The BBC is exploring the use of CAs as a less formal and more fun way of reaching a younger audience [270]. Finally, CAs are often proposed across different domains to act as recommender systems [271]. Other than in business and customer service applications, recommender system-like CAs have been developed to recommend music tracks [272], solutions for sustainable energy management [273], and also food recipes: Foodie Fooderson, for example, converses with the user to learn his or her preferences and then uses this information to recommend recipes that are healthy and tasty at the same time [274].

12.8. Future Directions Over the next few years, many new CAs will be deployed, as much by necessity as by choice. For example, the trend toward online retailing is an important factor driving the development of CAs for customer service; similarly, the ongoing demographic shift in which an increasing fraction of the population becomes elderly implies a need for technological approaches in healthcare, including, but not limited to, CAs; furthermore, the advent of self-driving vehicles will also favor the development

page 528

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

b4528-v1-ch12

page 529

529

of new types of CAs, for example, those that operate essentially as butlers during the trip, providing information and entertainment. The further development of CAs may also lead to a strong shift in the manner in which we interact with AI systems, such that current technology (e.g., smartphones) may be replaced by immersive technologies based on augmented or virtual reality. Just like smartphones have made interaction with screens more natural (touch as opposed to keyboards and smart pencils), we can expect that advanced, human-like CAs will lead to the development of new devices, where looking at a screen becomes secondary, and natural conversation is the primary mode of interaction. While the current strong focus on black box systems is likely to persist for some time, we also predict that the widespread use of CA technology, especially for taskoriented agents operating in areas that involve high-stakes decision-making as well as issues related to privacy, where accountability and interpretability are crucial, will eventually force a shift toward more transparent CA technologies, in which the decision-making can be followed, explained and, when needed, corrected. This shift will be driven by not only ethical considerations but also legal ones [275]. In the case of chatbots (e.g., Tay; see Subsection 12.5.3), we have already witnessed that irresponsible use of black box systems comes with risks, as these systems are trained based on large corpora of (dialogue) data that may be ridden with inherent biases toward the current majority view, something that could put minority groups at a further disadvantage [276]. This raises an interesting point since one of the supposed advantages of black box systems is that they reduce the need for handcrafting, yet they may instead require more effort being spent in manually curating the datasets on which they rely. As always, technology itself is neither good nor evil, but can be used in either way; there is a growing fear that CAs might be used unethically, for example, in gathering private conversational data for use in, say, mass surveillance and control. Moreover, as technology already exists for generating so-called deep fakes (such as fake text, speech, or videos, wrongfully attributed to a specific person or group [277–281]), CA technology could exacerbate the problem by allowing the development and deployment of legions of fake personas that are indistinguishable from real persons and that may swarm social networks with malicious intent. Even in cases where intentions are good, the use of CAs may be controversial. For example, affective agents may promote unhealthy or unethical bonding between humans and machines. The research community and policymakers have a strong collective responsibility to prevent unethical use of CA technology. Tools must be developed for evaluating not only the functionality but also the safety and societal impact of CAs. As mentioned in Subsection 12.5.3, this is a nascent field where much work still remains to be done. CAs will most likely become a game-changing technology that can offer many benefits, provided that the issues just mentioned are carefully considered. This

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

530

b4528-v1-ch12

Handbook on Computer Learning and Intelligence — Vol. I

technology may radically change the manner in which we interact with machines, and will hopefully be developed in an inclusive manner allowing, for example, disadvantaged groups the same access to online and digital services as everyone else. Because of the transformative nature of this technology, making specific longterm predictions is very difficult. Thus, we end this chapter by simply quoting Alan Turing [183]: “We can only see a short distance ahead, but we can see plenty there that needs to be done.”

References [1] J. Lester, K. Branting and B. Mott, Conversational agents, The Practical Handbook of Internet Computing, pp. 220–240 (2004). [2] J. Masche and N.-T. Le, A review of technologies for conversational systems. In International Conference on Computer Science, Applied Mathematics, and Applications, pp. 212–225. Springer (2017). [3] L. Laranjo, A. G. Dunn, H. L. Tong, A. B. Kocaballi, J. Chen, R. Bashir, D. Surian, B. Gallego, F. Magrabi, A. Y. Lau, et al., Conversational agents in healthcare: a systematic review, J. Am. Med. Inf. Assoc., 25(9), 1248–1258 (2018). [4] S. Diederich, A. B. Brendel and L. M. Kolbe, Towards a taxonomy of platforms for conversational agent design. In 14th International Conference on Wirtschaftsinformatik (WI2019), pp. 1100–1114 (2019). [5] R. Bavaresco, D. Silveira, E. Reis, J. Barbosa, R. Righi, C. Costa, R. Antunes, M. Gomes, C. Gatti, M. Vanzin, et al., Conversational agents in business: a systematic literature review and future research directions, Comput. Sci. Rev., 36, 100239 (2020). [6] Y. LeCun, Y. Bengio and G. Hinton, Deep learning, Nature, 521(7553), 436–444 (2015). [7] M. Wahde, Biologically Inspired Optimization Methods: An Introduction (WIT Press, 2008). [8] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (MIT Press, 2018). [9] C. Rudin, Stop explaining black box machine learning models for high stakes decisions, and use interpretable models instead, Nat. Mach. Intell., 1(5), 206–215 (2019). [10] B. Goodman and S. Flaxman, European Union regulations on algorithmic decision-making, and a “right to explanation”, AI Mag., 38(3), 50–57 (2017). [11] P. M. Gardiner, K. D. McCue, L. M. Negash, T. Cheng, L. F. White, L. Yinusa-Nyahkoon, B. W. Jack and T. W. Bickmore, Engaging women with an embodied conversational agent to deliver mindfulness and lifestyle recommendations: a feasibility randomized control trial, Patient Educ. Couns., 100(9), 1720–1729 (2017). [12] B. W. Jack, T. Bickmore, L. Yinusa-Nyahkoon, M. Reichert, C. Julce, N. Sidduri, J. MartinHoward, Z. Zhang, E. Woodhams, J. Fernandez, et al., Improving the health of young African American women in the preconception period using health information technology: a randomised controlled trial, Lancet Digital Health, 2(9), e475–e485 (2020). [13] D. DeVault, R. Artstein, G. Benn, T. Dey, E. Fast, A. Gainer, K. Georgila, J. Gratch, A. Hartholt, M. Lhommet, et al., SimSensei Kiosk: a virtual human interviewer for healthcare decision support. In Proceedings of the 2014 International Conference on Autonomous Agents, and Multi-agent Systems, pp. 1061–1068 (2014). [14] M. Ochs, C. Pelachaud and G. McKeown, A user-perception based approach to create smiling embodied conversational agents, ACM Trans. Interact. Intell. Syst., 7(1), 1–33 (2017). [15] A. van Breemen, X. Yan and B. Meerbeek, iCat: an animated user-interface robot with personality. In Proceedings of the Fourth International Joint Conference on Autonomous Agents, and Multiagent Systems, pp. 143–144 (2005). [16] I. Leite, C. Martinho and A. Paiva, Social robots for long-term interaction: a survey, Int. J. Social Rob., 5(2), 291–308 (2013).

page 530

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

b4528-v1-ch12

page 531

531

[17] T. Belpaeme, J. Kennedy, A. Ramachandran, B. Scassellati and F. Tanaka, Social robots for education: a review, Sci. Rob., 3(21), eaat5954 (2018). [18] S. M. Anzalone, S. Boucenna, S. Ivaldi and M. Chetouani, Evaluating the engagement with social robots, Int. J. Social Rob., 7(4), 465–478 (2015). [19] N. Epley, A. Waytz, S. Akalis and J. T. Cacioppo, When we need a human: motivational determinants of anthropomorphism, Social Cognit., 26(2), 143–155 (2008). [20] K. Loveys, G. Sebaratnam, M. Sagar and E. Broadbent, The effect of design features on relationship quality with embodied conversational agents: a systematic review, Int. J. Social Rob., pp. 1–20 (2020). [21] C. Creed and R. Beale, User interactions with an affective nutritional coach, Interact. Comput., 24(5), 339–350 (2012). [22] P. Kulms, S. Kopp and N. C. Krämer, Let’s be serious, and have a laugh: can humor support cooperation with a virtual agent? In International Conference on Intelligent Virtual Agents, pp. 250–259. Springer (2014). [23] S. ter Stal, M. Tabak, H. op den Akker, T. Beinema and H. Hermens, Who do you prefer? The effect of age, gender, and role on users’ first impressions of embodied conversational agents in eHealth, Int. J. Hum.-Comput. Interact., 36(9), 881–892 (2020). [24] A. M. von der Pu¨tten, N. C. Krämer and J. Gratch, How our personality shapes our interactions with virtual characters-implications for research, and development. In International Conference on Intelligent Virtual Agents, pp. 208–221. Springer (2010). [25] A. Cerekovic, O. Aran and D. Gatica-Perez, How do you like your virtual agent?: Humanagent interaction experience through nonverbal features, and personality traits. In International Workshop on Human Behavior Understanding, pp. 1–15. Springer (2014). [26] K. Ruhland, C. E. Peters, S. Andrist, J. B. Badler, N. I. Badler, M. Gleicher, B. Mutlu and R. McDonnell, A review of eye gaze in virtual agents, social robotics, and HCI: behaviour generation, user interaction, and perception. In Computer Graphics Forum, vol. 34, pp. 299–326 (2015). [27] T. Pejsa, S. Andrist, M. Gleicher and B. Mutlu, Gaze, and attention management for embodied conversational agents, ACM Trans. Interact. Intell. Syst., 5(1), 1–34 (2015). [28] L. Marschner, S. Pannasch, J. Schulz and S.-T. Graupner, Social communication with virtual agents: the effects of body, and gaze direction on attention, and emotional responding in human observers, Int. J. Psychophysiol., 97(2), 85–92 (2015). [29] C. Pelachaud, Studies on gesture expressivity for a virtual agent, Speech Commun., 51(7), 630–639 (2009). [30] A. Sadeghipour and S. Kopp, Embodied gesture processing: motor-based integration of perception, and action in social artificial agents, Cognit. Comput., 3(3), 419–435 (2011). [31] K. Bergmann, F. Eyssel and S. Kopp, A second chance to make a first impression? How appearance, and nonverbal behavior affect perceived warmth and competence of virtual agents over time. In International Conference on Intelligent Virtual Agents, pp. 126–138. Springer (2012). [32] M. Mori, K. F. MacDorman and N. Kageki, The uncanny valley [from the field], IEEE Rob. Autom. Mag., 19(2), 98–100 (2012). [33] A. Tinwell, M. Grimshaw, D. A. Nabi and A. Williams, Facial expression of emotion and perception of the uncanny valley in virtual characters, Comput. Hum. Behav., 27(2), 741–749 (2011). [34] K. Keeling, S. Beatty, P. McGoldrick and L. Macaulay, Face value? Customer views of appropriate formats for embodied conversational agents (ECAs) in online retailing. In Proceedings of the 37th Annual Hawaii International Conference on System Sciences, 2004, pp. 1–10. IEEE (2004). [35] P. Ekman, Basic emotions. In Handbook of Cognition, and Emotion, pp. 45–60. John Wiley & Sons (1999).

June 1, 2022 12:52

532

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch12

Handbook on Computer Learning and Intelligence — Vol. I

[36] C. M. Whissell, The dictionary of affect in language. In The Measurement of Emotions, pp. 113–131. Elsevier (1989). [37] C.-H. Hjortsjo, Man’s Face and Mimic Language (Studentlitteratur 1969). [38] P. Ekman and W. V. Friesen, Facial Action Coding Systems (Consulting Psychologists Press 1978). [39] S. Li and W. Deng, Deep facial expression recognition: a survey, IEEE Trans. Affective Comput., 2020), doi: 10.1109/TAFFC.2020.2981446. [40] B. C. Ko, A brief review of facial emotion recognition based on visual information, Sensors, 18(2), 401 (2018). [41] D. Mehta, M. F. H. Siddiqui and A. Y. Javaid, Facial emotion recognition: a survey, and realworld user experiences in mixed reality, Sensors, 18(2), 416 (2018). [42] M. Coulson, Attributing emotion to static body postures: recognition accuracy, confusions, and viewpoint dependence, J. Nonverbal Behav., 28(2), 117–139 (2004). [43] T. D’Orazio, R. Marani, V. Renè and G. Cicirelli, Recent trends in gesture recognition: how depth data has improved classical approaches, Image Vis. Comput., 52, 56–72, (2016). [44] S. S. Rautaray and A. Agrawal, Vision based hand gesture recognition for human computer interaction: a survey, Artif. Intell. Rev., 43(1), 1–54 (2015). [45] S. E. Kahou, X. Bouthillier, P. Lamblin, C. Gulcehre, V. Michalski, K. Konda, S. Jean, P. Froumenty, Y. Dauphin, N. Boulanger-Lewandowski, et al., EmoNets: multimodal deep learning approaches for emotion recognition in video, J. Multimodal User Interfaces, 10(2), 99–111 (2016). [46] H. M. Fayek, M. Lech and L. Cavedon, Evaluating deep learning architectures for speech emotion recognition, Neural Networks, 92, 60–68 (2017). [47] M. Gillies and B. Spanlang, Comparing, and evaluating real time character engines for virtual environments, Presence: Teleop. Virt. Environ., 19(2), 95–117 (2010). [48] J. Serra, O. Cetinaslan, S. Ravikumar, V. Orvalho and D. Cosker, Easy generation of facial animation using motion graphs. In Computer Graphics Forum, vol. 37, pp. 97–111. Wiley Online Library (2018). [49] S. Zhang, X. Liu, J. Liu, J. Gao, K. Duh and B. Van Durme, ReCoRD: bridging the gap between human, and machine commonsense reading comprehension, arXiv preprint arXiv:1810.12885 (2018). [50] K. Lee, S. Lee and J. Lee, Interactive character animation by learning multi-objective control, ACM Trans. Graphics, 37(6), 1–10 (2018). [51] T. Weise, S. Bouaziz, H. Li and M. Pauly, Realtime performance-based facial animation, ACM Trans. Graphics, 30(4), 1–10 (2011). [52] P. Edwards, C. Landreth, E. Fiume and K. Singh, Jali: an animator-centric viseme model for expressive lip synchronization, ACM Trans. Graphics, 35(4), 1–11 (2016). [53] M. Thiebaux, S. Marsella, A. N. Marshall and M. Kallmann, Smartbody: behavior realization for embodied conversational agents. In Proceedings of the 7th International Joint Conference on Autonomous Agents, and Multiagent Systems, vol. 1, pp. 151–158 (2008). [54] S. Kopp, B. Krenn, S. Marsella, A. N. Marshall, C. Pelachaud, H. Pirker, K. R. Thorisson and H. Vilhjalmsson, Towards a common framework for multimodal generation: the behavior markup language. In International Workshop on Intelligent Virtual Agents, pp. 205–217. Springer (2006). [55] V. Këpuska and G. Bohouta, Comparing speech recognition systems (Microsoft API, Google API, and CMU Sphinx), Int. J. Eng. Res. Appl., 7(03), 20–24 (2017). [56] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, et al., State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4774–4778. IEEE (2018).

page 532

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

b4528-v1-ch12

page 533

533

[57] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk and Q. V. Le, Specaugment: a simple data augmentation method for automatic speech recognition, arXiv preprint arXiv:1904.08779 (2019). [58] V. Panayotov, G. Chen, D. Povey and S. Khudanpur, Librispeech: an ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5206–5210 (2015). [59] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al., Deep speech 2: end-to-end speech recognition in English, and Mandarin. In International Conference on Machine Learning, pp. 173–182 (2016). [60] A. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. Driessche, E. Lockhart, L. Cobo, F. Stimberg, et al., Parallel wavenet: fast high-fidelity speech synthesis. In International Conference on Machine Learning, pp. 3918–3926 (2018). [61] M. Bi´nkowski, J. Donahue, S. Dieleman, A. Clark, E. Elsen, N. Casagrande, L. C. Cobo and K. Simonyan, High fidelity speech synthesis with adversarial networks, arXiv preprint arXiv:1909.11646 (2019). [62] R. C. Streijl, S. Winkler and D. S. Hands, Mean opinion score (MOS) revisited: methods, and applications, limitations, and alternatives, Multimedia Syst., 22(2), 213–227 (2016). [63] A. C. Elkins and D. C. Derrick, The sound of trust: voice as a measurement of trust during interactions with embodied conversational agents, Group Decis. Negot., 22(5), 897–913 (2013). [64] I. Torre, J. Goslin and L. White, If your device could smile: people trust happy-sounding artificial agents more, Comput. Hum. Behav., 105, 106215 (2020). [65] M. Anabuki, H. Kakuta, H. Yamamoto and H. Tamura, Welbo: an embodied conversational agent living in mixed reality space. In CHI’00 Extended Abstracts on Human Factors in Computing Systems, pp. 10–11 (2000). [66] I. Wang, J. Smith and J. Ruiz, Exploring virtual agents for augmented reality. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–12 (2019). [67] J. Weizenbaum, Eliza–a computer program for the study of natural language communication between man, and machine, Commun. ACM, 9(1), 36–45 (1966). [68] R. Wallace, The elements of AIML style, Alice AI Foundation, 139 (2003). [69] G. Salton and C. Buckley, Term-weighting approaches in automatic text retrieval, Inf. Process. Manage., 24(5), 513–523 (1988). [70] A. Ritter, C. Cherry and W. B. Dolan, Data-driven response generation in social media. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 583–593 (2011). [71] X. Zhao, C. Tao, W. Wu, C. Xu, D. Zhao and R. Yan, A document-grounded matching network for response selection in retrieval-based chatbots, arXiv preprint arXiv:1906.04362 (2019). [72] R. Yan, Y. Song and H. Wu, Learning to respond with deep neural networks for retrieval-based human-computer conversation system. In Proceedings of the 39th International ACM SIGIR conference on Research, and Development in Information Retrieval, pp. 55–64 (2016). [73] R. Lowe, N. Pow, I. Serban and J. Pineau, The Ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 285–294, Prague, Czech Republic (2015). Association for Computational Linguistics. [74] X. Zhou, D. Dong, H. Wu, S. Zhao, D. Yu, H. Tian, X. Liu and R. Yan, Multi-view response selection for human-computer conversation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 372–381 (2016). [75] Y. Wu, W. Wu, C. Xing, C. Xu, Z. Li and M. Zhou, A sequential matching framework for multiturn response selection in retrieval-based chatbots, Comput. Ling., 45(1), 163–197 (2019).

June 1, 2022 12:52

534

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch12

Handbook on Computer Learning and Intelligence — Vol. I

[76] Y. Bengio, R. Ducharme, P. Vincent and C. Jauvin, A neural probabilistic language model, J. Mach. Learn. Res., 3, 1137–1155 (2003). [77] H. Schwenk, Continuous space language models, Comput. Speech Lang., 21(3), 492–518 (2007). [78] T. Mikolov, W.-t. Yih and G. Zweig, Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751 (2013). [79] J. Devlin, M.-W. Chang, K. Lee and K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long, and Short Papers), pp. 4171–4186, Minneapolis, Minnesota (2019). Association for Computational Linguistics. [80] L. v. d. Maaten and G. Hinton, Visualizing data using t-SNE, J. Mach. Learn. Res., 9, 2579–2605 (2008). [81] T. Mikolov, K. Chen, G. Corrado and J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781 (2013). [82] S. T. Dumais, G. W. Furnas, T. K. Landauer, S. Deerwester and R. Harshman, Using latent semantic analysis to improve access to textual information. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 281–285 (1988). [83] T. K. Landauer, P. W. Foltz and D. Laham, An introduction to latent semantic analysis, Discourse Processes, 25(2–3), 259–284 (1998). [84] K. Lund and C. Burgess, Producing high-dimensional semantic spaces from lexical cooccurrence, Behav. Res. Methods Instrum. Comput., 28(2), 203–208 (1996). [85] R. Collobert, Word embeddings through Hellinger PCA. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Citeseer (2014). [86] L. Manzoni, D. Jakobovic, L. Mariot, S. Picek and M. Castelli, Towards an evolutionary-based approach for natural language processing. In Proceedings of the 2020 Genetic, and Evolutionary Computation Conference, GECCO ’20, pp. 985–993, New York, NY, USA (2020). Association for Computing Machinery. ISBN 9781450371285. [87] I. Sutskever, O. Vinyals and Q. V. Le, Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems, pp. 3104–3112, (2014). [88] O. Vinyals and Q. Le, A neural conversational model, arXiv preprint arXiv:1506.05869 (2015). [89] J. Tiedemann, News from OPUS–a collection of multilingual parallel corpora with tools, and interfaces. In Recent Advances in Natural Language Processing, vol. 5, pp. 237–248, (2009). [90] I. V. Serban, C. Sankar, M. Germain, S. Zhang, Z. Lin, S. Subramanian, T. Kim, M. Pieper, S. Chandar, N. R. Ke, et al., A deep reinforcement learning chatbot, arXiv preprint arXiv:1709. 02349 (2017). [91] D. Adiwardana, M.-T. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan, Z. Yang, A. Kulshreshtha, G. Nemade, Y. Lu, et al., Towards a human-like open-domain chatbot, arXiv preprint arXiv:2001.09977 (2020). [92] L. Zhou, J. Gao, D. Li and H.-Y. Shum, The design, and implementation of XiaoIce, an empathetic social chatbot, Comput. Ling., 46(1), 53–93 (2020). [93] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L- . Kaiser and I. Polosukhin, Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, pp. 5998–6008 (2017). [94] T. Wolf, J. Chaumond, L. Debut, V. Sanh, C. Delangue, A. Moi, P. Cistac, M. Funtowicz, J. Davison, S. Shleifer, et al., Transformers: state-of-the-art natural language processing. In

page 534

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

[95] [96]

[97]

[98] [99]

[100] [101]

[102] [103] [104]

[105] [106] [107]

[108] [109]

[110]

[111]

[112] [113] [114]

b4528-v1-ch12

page 535

535

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45 (2020). D. Bahdanau, K. Cho and Y. Bengio, Neural machine translation by jointly learning to align, and translate. In 3rd International Conference on Learning Representations, ICLR 2015 (2015). M.-T. Luong, H. Pham and C. D. Manning, Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421 (2015). A. Wang, A. Singh, J. Michael, F. Hill, O. Levy and S. Bowman, GLUE: a multi-task benchmark, and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing, and Interpreting Neural Networks for NLP, pp. 353–355, Brussels, Belgium (2018). Association for Computational Linguistics. Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma and R. Soricut, ALBERT: a lite BERT for self-supervised learning of language representations, arXiv preprint arXiv:1909.11942 (2019). Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov and Q. V. Le, XLNet: generalized autoregressive pretraining for language understanding. In Proceedings of the Advances in Neural Information Processing Systems, pp. 5753–5763 (2019). A. Radford, J. Wu, R. Child, D. Luan, D. Amodei and I. Sutskever, Language models are unsupervised multitask learners, OpenAI Blog, 1(8), 9 (2019). T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, arXiv preprint arXiv:2005.14165 (2020). T. Young, D. Hazarika, S. Poria and E. Cambria, Recent trends in deep learning based natural language processing, IEEE Comput. Intell. Mag., 13(3), 55–75 (2018). D. W. Otter, J. R. Medina and J. K. Kalita, A survey of the usages of deep learning for natural language processing, IEEE Trans. Neural Networks Learn. Syst., 32(2, 604–624 (2020). M. Wahde, A dialogue manager for task-oriented agents based on dialogue building-blocks, and generic cognitive processing. In 2019 IEEE International Symposium on Innovations in Intelligent Systems, and Applications (INISTA), pp. 1–8 (2019). D. Jurafsky and J. H. Martin, Speech, and Language Processing, 2nd edn. (Prentice-Hall, 2009). D. G. Bobrow, R. M. Kaplan, M. Kay, D. A. Norman, H. Thompson and T. Winograd, GUS, a frame-driven dialog system, Artif. Intell., 8(2), 155–173 (1977). W. Ward, S. Issar, X. Huang, H.-W. Hon, M.-Y. Hwang, S. Young, M. Matessa, F.-H. Liu and R. M. Stern, Speech understanding in open tasks. In Speech, and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23–26 (1992). H. Bunt, Context, and dialogue control, Think Q., 3(1), 19–31 (1994). A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Jurafsky, P. Taylor, R. Martin, C. V. Ess-Dykema and M. Meteer, Dialogue act modeling for automatic tagging, and recognition of conversational speech, Comput. Ling., 26(3), 339–373 (2000). Z. Chen, R. Yang, Z. Zhao, D. Cai and X. He, Dialogue act recognition via crfattentive structured network. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 225–234 (2018). V. Zue, S. Seneff, J. R. Glass, J. Polifroni, C. Pao, T. J. Hazen and L. Hetherington, JUPlTER: a telephone-based conversational interface for weather information, IEEE Trans. Speech Audio Process., 8(1), 85–96 (2000). S. Larsson and D. R. Traum, Information state, and dialogue management in the TRINDI dialogue move engine toolkit, Nat. Lang. Eng., 6(3–4), 323–340 (2000). J. D. Williams, A. Raux and M. Henderson, The dialog state tracking challenge series: a review, Dialogue Discourse, 7(3), 4–33 (2016). J.-G. Harms, P. Kucherbaev, A. Bozzon and G.-J. Houben, Approaches for dialog management in conversational agents, IEEE Internet Comput., 23(2), 13–22 (2018).

June 1, 2022 12:52

536

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch12

Handbook on Computer Learning and Intelligence — Vol. I

[115] P. R. Cohen and C. R. Perrault, Elements of a plan-based theory of speech acts, Cognit. Sci., 3(3), 177–212 (1979). [116] J. Allen, G. Ferguson and A. Stent, An architecture for more realistic conversational systems. In Proceedings of the 6th International Conference on Intelligent User Interfaces, pp. 1–8 (2001). [117] D. Bohus and A. I. Rudnicky, The RavenClaw dialog management framework: architecture, and systems, Comput. Speech Lang., 23(3), 332–361 (2009). [118] N. Blaylock, Towards tractable agent-based dialogue. PhD thesis, University of Rochester, Rochester, New York (2005). [119] M. F. McTear, Spoken dialogue technology: enabling the conversational user interface, ACM Comput. Surv., 34(1), 90–169 (2002). [120] E. Levin, R. Pieraccini and W. Eckert, Using Markov decision process for learning dialogue strategies. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), vol. 1, pp. 201–204, (1998). [121] E. Levin, R. Pieraccini and W. Eckert, A stochastic model of human-machine interaction for learning dialog strategies, IEEE Trans. Speech Audio Process., 8(1), 11–23 (2000). [122] S. Young, J. Schatzmann, K. Weilhammer and H. Ye, The hidden information state approach to dialog management. In 2007 IEEE International Conference on Acoustics, Speech, and Signal Processing-ICASSP’07, vol. 4, pp. IV-149–IV-152 (2007). [123] S. Young, M. Gaši´c, S. Keizer, F. Mairesse, J. Schatzmann, B. Thomson and K. Yu, The hidden information state model: a practical framework for POMDP-based spoken dialogue management, Comput. Speech Lang., 24(2), 150–174 (2010). [124] S. Young, M. Gaši´c, B. Thomson and J. D. Williams, POMDP-based statistical spoken dialog systems: a review, Proc. IEEE, 101(5), 1160–1179 (2013). [125] T.-H. Wen, D. Vandyke, N. Mrkši´c, M. Gaši´c, L. M. Rojas-Barahona, P.-H. Su, S. Ultes and S. Young, A network-based end-to-end trainable task-oriented dialogue system, pp. 438–449 (2017). [126] P. Budzianowski and I. Vuli´c, Hello, it’s GPT-2–How can I help you? Towards the use of pretrained language models for task-oriented dialogue systems, arXiv preprint arXiv:1907.05774 (2019). [127] Y. Gou, Y. Lei and L. Liu, Contextualize knowledge bases with transformer for end-to-end task-oriented dialogue systems, arXiv preprint arXiv:2010.05740v3 (2020). [128] J. Weston, S. Chopra and A. Bordes, Memory networks, CoRR abs/1410.3916 (2015). [129] A. Bordes, Y.-L. Boureau and J. Weston, Learning end-to-end goal-oriented dialog. In International Conference of Learning Representations (ICLR) (2017). [130] T. N. Kipf and M. Welling, Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (2017). [131] S. Banerjee and M. M. Khapra, Graph convolutional network with sequential attention for goal-oriented dialogue systems, Trans. Assoc. Comput. Linguist., 7, 485–500 (2019). [132] J. F. Kelley, An iterative design methodology for user-friendly natural language office information applications, ACM Trans. Inf. Syst., 2(1), 26–41 (1984). [133] K. Crowston, Amazon mechanical turk: a research tool for organizations, and information systems scholars. In eds. A. Bhattacherjee, and B. Fitzgerald, Shaping the Future of ICT Research. Methods, and Approaches, pp. 210–221, Berlin, Heidelberg (2012). Springer Berlin Heidelberg. [134] B. Liu, G. Tur, D. Hakkani-Tur, P. Shah and L. Heck, End-to-end optimization of task-oriented dialogue model with deep reinforcement learning. In NIPS Workshop on Conversational AI (2017). [135] B. Dhingra, L. Li, X. Li, J. Gao, Y.-N. Chen, F. Ahmed and L. Deng, Towards end-to-end reinforcement learning of dialogue agents for information access, pp. 484–495 (2017).

page 536

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

b4528-v1-ch12

page 537

537

[136] A. Madotto, S. Cahyawijaya, G. I. Winata, Y. Xu, Z. Liu, Z. Lin and P. Fung, Learning knowledge bases with parameters for task-oriented dialogue systems, arXiv preprint arXiv:2009.13656 (2020). [137] D. Raghu, N. Gupta and Mausam, Disentangling language, and knowledge in task-oriented dialogs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long, and Short Papers), pp. 1239–1255 (2019). [138] C.-S. Wu, R. Socher and C. Xiong, Global-to-local memory pointer networks for task-oriented dialogue. In International Conference on Learning Representations (2019). [139] L. Qin, X. Xu, W. Che, Y. Zhang and T. Liu, Dynamic fusion network for multi-domain endto-end task-oriented dialog, arXiv preprint arXiv:2004.11019 (2020). [140] P. Budzianowski, T.-H. Wen, B.-H. Tseng, I. Casanueva, S. Ultes, O. Ramadan and M. Gaši´c, MultiWOZ—a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling, arXiv preprint arXiv:1810.00278 (2018). [141] J. Rajendran, J. Ganhotra, S. Singh and L. Polymenakos, Learning end-to-end goal-oriented dialog with multiple answers. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3834–3843, Brussels, Belgium (2018). [142] E. Luger and A. Sellen, “Like having a really bad PA” the gulf between user expectation, and experience of conversational agents. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 5286–5297 (2016). [143] H. J. Levesque, Common Sense, the Turing Test, and the Quest for Real AI (MIT Press, 2017). [144] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng and C. Potts, Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013). [145] W. B. Dolan and C. Brockett, Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005) (2005). [146] P. Rajpurkar, J. Zhang, K. Lopyrev and P. Liang, SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392, Austin, Texas (2016). Association for Computational Linguistics. [147] J. Zhao, T. Wang, M. Yatskar, V. Ordonez and K.-W. Chang, Gender bias in coreference resolution: evaluation, and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 15–20 (2018). [148] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy and S. Bowman, SuperGLUE: a stickier benchmark for general-purpose language understanding systems. In Proceedings of the Advances in Neural Information Processing Systems, pp. 3266–3280 (2019). [149] K. Papineni, S. Roukos, T. Ward and W.-J. Zhu, BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318 (2002). [150] C.-Y. Lin, ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, pp. 74–81 (2004). [151] C.-Y. Lin and F. J. Och, Automatic evaluation of machine translation quality using longest common subsequence, and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pp. 605–612 (2004). [152] The Alexa prize. https://developer.amazon.com/alexaprize. Accessed: December (2020). [153] A. Venkatesh, C. Khatri, A. Ram, F. Guo, R. Gabriel, A. Nagar, R. Prasad, M. Cheng, B. Hedayatnia, A. Metallinou, et al., On evaluating, and comparing open domain dialog systems, arXiv preprint arXiv:1801.03625 (2018). [154] A. P. Chaves and M. A. Gerosa, How should my chatbot interact? A survey on social characteristics in human–chatbot interaction design, Int. J. Hum.-Comput. Interact., pp. 1–30 (2020).

June 1, 2022 12:52

538

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch12

Handbook on Computer Learning and Intelligence — Vol. I

[155] J. Hill, W. R. Ford and I. G. Farreras, Real conversations with artificial intelligence: a comparison between human–human online conversations, and human–chatbot conversations, Comput. Hum. Behav., 49, 245–250 (2015). [156] T. J.-J. Li, I. Labutov, B. A. Myers, A. Azaria, A. I. Rudnicky and T. M. Mitchell, Teaching agents when they fail: end user development in goal-oriented conversational agents. In Studies in Conversational UX Design, pp. 119–137. Springer (2018). [157] W. Duijvelshoff, Use-cases, and ethics of chatbots on Plek: a social intranet for organizations. In Workshop on Chatbots and Artificial Intelligence (2017). [158] E. Tallyn, H. Fried, R. Gianni, A. Isard and C. Speed, The Ethnobot: gathering ethnographies in the age of IoT. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1–13 (2018). [159] R. Likert, A technique for the measurement of attitudes, Arch. Psychol., 22(140), 55 (1932). [160] J. Brooke, SUS: a “quick, and dirty” usability scale. In Usability Evaluation in Industry, p. 189. CRC Press (1996). [161] K. S. Hone and R. Graham, Towards a tool for the subjective assessment of speech system interfaces (SASSI), Nat. Lang. Eng., 6(3–4), 287–303 (2000). [162] M. A. Walker, D. J. Litman, C. A. Kamm and A. Abella, PARADISE: a framework for evaluating spoken dialogue agents, arXiv preprint cmp-lg/9704004 (1997). [163] J. Sedoc, D. Ippolito, A. Kirubarajan, J. Thirani, L. Ungar and C. Callison-Burch, ChatEval: a tool for chatbot evaluation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 60–65, (2019). [164] C.-W. Liu, R. Lowe, I. Serban, M. Noseworthy, L. Charlin and J. Pineau, How NOT to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2122–2132, Austin, Texas (2016). Association for Computational Linguistics. [165] K. Kuligowska, Commercial chatbot: performance evaluation, usability metrics, and quality standards of embodied conversational agents, Professionals Center for Business Research, 2 (2015). [166] M. d. O. Meira and A. d. P. Canuto, Evaluation of emotional agents’ architectures: an approach based on quality metrics, and the influence of emotions on users. In Proceedings of the World Congress on Engineering, vol. 1, pp. 1–8 (2015). [167] K. M. Lee, Y. Jung, J. Kim and S. R. Kim, Are physically embodied social agents better than disembodied social agents?: The effects of physical embodiment, tactile interaction, and people’s loneliness in human–robot interaction, Int. J. Hum. Comput. Stud., 64(10), 962–973 (2006). [168] Aibo. https://us.aibo.com/. Accessed: December (2020). [169] A. Pereira, C. Martinho, I. Leite and A. Paiva, iCat, the chess player: the influence of embodiment in the enjoyment of a game. In Proceedings of the 7th International Joint Conference on Autonomous Agents, and Multiagent Systems, vol. 3, pp. 1253–1256 (2008). [170] L. Hoffmann and N. C. Krämer, Investigating the effects of physical, and virtual embodiment in task-oriented, and conversational contexts, Int. J. Hum. Comput. Stud., 71(7–8), 763–774 (2013). [171] L. Ciechanowski, A. Przegalinska, M. Magnuski and P. Gloor, In the shades of the uncanny valley: an experimental study of human–chatbot interaction, Future Gener. Comput. Syst., 92, 539–548 (2019). [172] M. S. B. Mimoun, I. Poncin and M. Garnier, Case study—embodied virtual agents: an analysis on reasons for failure, J. Retail. Consum. Serv., 19(6), 605–612 (2012). [173] B. Granstrom and D. House, Modelling, and evaluating verbal, and non-verbal communication in talking animated interface agents. In Evaluation of Text, and Speech Systems, pp. 65–98. Springer (2007).

page 538

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

b4528-v1-ch12

page 539

539

¨ [174] B. Weiss, I. Wechsung, C. Kuhnel and S. Möller, Evaluating embodied conversational agents in multimodal interfaces, Comput. Cognit. Sci., 1(1), 6 (2015). [175] G. Neff and P. Nagy, Talking to bots: symbiotic agency, and the case of Tay, Int. J. Commun., 10, 17 (2016). [176] U. Bharti, D. Bajaj, H. Batra, S. Lalit, S. Lalit and A. Gangwani, Medbot: conversational artificial intelligence powered chatbot for delivering tele-health after COVID-19. In 2020 5th International Conference on Communication, and Electronics Systems (ICCES), pp. 870–875. IEEE (2020). [177] M. Ligthart, T. Fernhout, M. A. Neerincx, K. L. van Bindsbergen, M. A. Grootenhuis and K. V. Hindriks, A child, and a robot getting acquainted-interaction design for eliciting selfdisclosure. In Proceedings of the 18th International Conference on Autonomous Agents, and Multi Agent Systems, pp. 61–70 (2019). [178] D. D. Luxton, Ethical implications of conversational agents in global public health, Bull. World Health Organ., 98(4), 285 (2020). [179] A. Adadi and M. Berrada, Peeking inside the black-box: a survey on explainable artificial intelligence (XAI), IEEE Access, 6, 52138–52160 (2018). [180] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti and D. Pedreschi, A survey of methods for explaining black box models, ACM Comput. Surv., 51(5), 1–42 (2018). [181] P. Angelov and E. Soares, Towards explainable deep neural networks (xDNN), Neural Networks, 130, 185–194 (2020). [182] 600,000 images removed from AI database after art project exposes racist bias. https:// eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhyperallergic.com%2F51882 2%2F600000-images-removed-from-ai-database-after-art-project-exposes-racist-bias%2F&a mp;data=04%7C01%7Cangelov%40live.lancs.ac.uk%7C40fce153c9a14096f12408da038fdb 9c%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C637826213638828385%7CU nknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1ha WwiLCJXVCI6Mn0%3D%7C3000&sdata=hVcTGsdRMIqzOoRH0%2Bh6VRx8Nasf 9%2BjO6sH4%2Fhl72WY%3D&reserved=0. Accessed: December (2020). [183] A. M. Turing, Computing machinery and intelligence, Mind, 59(236), 433–460 (1950). [184] Log of conversations between Eliza and Parry. https://eur02.safelinks.protection.outlook.com/ ?url=https%3A%2F%2Fdatatracker.ietf.org%2Fdoc%2Fhtml%2Frfc439&data=04%7C 01%7Cangelov%40live.lancs.ac.uk%7C40fce153c9a14096f12408da038fdb9c%7C9c9bcd11 977a4e9ca9a0bc734090164a%7C0%7C0%7C637826213638984582%7CUnknown%7CTWF pbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6M n0%3D%7C3000&sdata=Dv6p4xXmz%2B70sELqMuiG7oTN7bj%2FSQDMoLz%2B MitIbvw%3D&reserved=0. Accessed: December (2020). [185] R. S. Wallace, The anatomy of ALICE. In Parsing the Turing Test, pp. 181–210. Springer (2009). [186] Cleverbot. http://www.cleverbot.com. Accessed: December (2020). [187] H. Shah, K. Warwick, J. Vallverdú and D. Wu, Can machines talk? Comparison of ELIZA with modern dialogue systems, Comput. Hum. Behav., 58, 278–295 (2016). [188] Mitsuku. http://www.pandorabots.com. Accessed: December (2020). [189] Chatscript. https://sourceforge.net/projects/chatscript. Accessed: December (2020). [190] A. Ram, R. Prasad, C. Khatri, A. Venkatesh, R. Gabriel, Q. Liu, J. Nunn, B. Hedayatnia, M. Cheng, A. Nagar, E. King, K. Bland, A. Wartick, Y. Pan, H. Song, S. Jayadevan, G. Hwang and A. Pettigrue, Conversational AI: the science behind the Alexa Prize, arXiv preprint (2018). URL http://arxiv.org/abs/1801.03604. [191] J. L. Z. Montenegro, C. A. da Costa and R. da Rosa Righi, Survey of conversational agents in health, Expert Syst. Appl., 129, 56–67 (2019). [192] A. Fadhil, A conversational interface to improve medication adherence: towards AI support in patient’s treatment, arXiv preprint arXiv:1803.09844 (2018).

June 1, 2022 12:52

540

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch12

Handbook on Computer Learning and Intelligence — Vol. I

[193] A. Piau, R. Crissey, D. Brechemier, L. Balardy and F. Nourhashemi, A smartphone chatbot application to optimize monitoring of older patients with cancer, Int. J. Med. Inform., 128, 18–23 (2019). [194] Y. Sakai, Y. Nonaka, K. Yasuda and Y. I. Nakano, Listener agent for elderly people with dementia. In Proceedings of the 7th ACM/IEEE International Conference on Human-Robot Interaction, pp. 199–200 (2012). [195] R. Yaghoubzadeh, M. Kramer, K. Pitsch and S. Kopp, Virtual agents as daily assistants for elderly or cognitively impaired people. In International workshop on intelligent virtual agents, pp. 79–91. Springer (2013). [196] S. Nikitina, S. Callaioli and M. Baez, Smart conversational agents for reminiscence. In 2018 IEEE/ACM 1st International Workshop on Software Engineering for Cognitive Services (SE4COG), pp. 52–57. IEEE (2018). [197] T. Shibata, T. Mitsui, K. Wada, A. Touda, T. Kumasaka, K. Tagami and K. Tanie, Mental commit robot, and its application to therapy of children. In 2001 IEEE/ASME International Conference on Advanced Intelligent Mechatronics, vol. 2, pp. 1053–1058. IEEE (2001). [198] L. Hung, C. Liu, E. Woldum, A. Au-Yeung, A. Berndt, C. Wallsworth, N. Horne, M. Gregorio, J. Mann and H. Chaudhury, The benefits of and barriers to using a social robot PARO in care settings: a scoping review, BMC Geriatr., 19(1), 232 (2019). [199] H. Gaffney, W. Mansell and S. Tai, Conversational agents in the treatment of mental health problems: mixed-method systematic review, JMIR Ment. Health, 6(10), e14166 (2019). [200] A. N. Vaidyam, H. Wisniewski, J. D. Halamka, M. S. Kashavan and J. B. Torous, Chatbots, and conversational agents in mental health: a review of the psychiatric landscape, Can. J. Psychiatry, 64(7), 456–464 (2019). [201] E. Bendig, B. Erb, L. Schulze-Thuesing and H. Baumeister, The next generation: chatbots in clinical psychology, and psychotherapy to foster mental health–a scoping review, Verhaltenstherapie, pp. 1–13 (2019). [202] Woebot. https://woebothealth.com. Accessed: December (2020). [203] G. M. Lucas, A. Rizzo, J. Gratch, S. Scherer, G. Stratou, J. Boberg and L.-P. Morency, Reporting mental health symptoms: breaking down barriers to care with virtual human interviewers, Front. Rob. AI, 4, 51 (2017). [204] O. Perski, D. Crane, E. Beard and J. Brown, Does the addition of a supportive chatbot promote user engagement with a smoking cessation app? An experimental study, Digital Health, 5, 2055207619880676 (2019). [205] R. Bhakta, M. Savin-Baden and G. Tombs, Sharing secrets with robots? In EdMedia+ Innovate Learning, pp. 2295–2301. Association for the Advancement of Computing in Education (AACE) (2014). [206] A. Fadhil and S. Gabrielli, Addressing challenges in promoting healthy lifestyles: the AI-chatbot approach. In Proceedings of the 11th EAI International Conference on Pervasive Computing Technologies for Healthcare, pp. 261–265 (2017). [207] J.-N. Kramer, F. Ku¨nzler, V. Mishra, S. N. Smith, D. Kotz, U. Scholz, E. Fleisch and T. Kowatsch, Which components of a smartphone walking app help users to reach personalized step goals? Results from an optimization trial, Ann. Behav. Med., 54(7), 518–528 (2020). [208] D. Kadariya, R. Venkataramanan, H. Y. Yip, M. Kalra, K. Thirunarayanan and A. Sheth, kBot: knowledge-enabled personalized chatbot for asthma self-management. In 2019 IEEE International Conference on Smart Computing (SMARTCOMP), pp. 138–143. IEEE (2019). [209] E. Gong, S. Baptista, A. Russell, P. Scuffham, M. Riddell, J. Speight, D. Bird, E. Williams, M. Lotfaliany and B. Oldenburg, My diabetes coach, a mobile app-based interactive conversational agent to support type 2 diabetes self-management: randomized effectivenessimplementation trial, J. Med. Internet Res., 22(11), e20322 (2020).

page 540

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

b4528-v1-ch12

page 541

541

[210] S. M. Felix, S. Kumar and A. Veeramuthu, A smart personal AI assistant for visually impaired people. In 2018 2nd International Conference on Trends in Electronics, and Informatics (ICOEI), pp. 1245–1250. IEEE (2018). [211] J. P. Bigham, M. B. Aller, J. T. Brudvik, J. O. Leung, L. A. Yazzolino and R. E. Ladner, Inspiring blind high school students to pursue computer science with instant messaging chatbots. In Proceedings of the 39th SIGCSE Technical Symposium on Computer Science Education, pp. 449–453 (2008). [212] A. Palanica, P. Flaschner, A. Thommandram, M. Li and Y. Fossat, Physicians’ perceptions of chatbots in health care: cross-sectional web-based survey, J. Med. Internet Res., 21(4), e12887 (2019). [213] L. Ni, C. Lu, N. Liu and J. Liu, Mandy: towards a smart primary care chatbot application. In International Symposium on Knowledge, and Systems Sciences, pp. 38–52. Springer (2017). [214] S. Roos, Chatbots in education: a passing trend or a valuable pedagogical tool? Dissertation (2018). [215] K. Palasundram, N. M. Sharef, N. Nasharuddin, K. Kasmiran and A. Azman, Sequence to sequence model performance for education chatbot, Int. J. Emerg. Technol. Learn., 14(24), 56–68 (2019). [216] R. Winkler and M. Söllner, Unleashing the potential of chatbots in education: a state-of-the-art analysis. In Academy of Management Annual Meeting (AOM) (2018). [217] S. Bayne, Teacherbot: interventions in automated teaching, Teach. Higher Educ., 20(4), 455–467 (2015). [218] D. E. Gonda and B. Chu, Chatbot as a learning resource? Creating conversational bots as a supplement for teaching assistant training course. In 2019 IEEE International Conference on Engineering, Technology, and Education (TALE), pp. 1–5. IEEE (2019). [219] A. C. Graesser, S. Lu, G. T. Jackson, H. H. Mitchell, M. Ventura, A. Olney and M. M. Louwerse, AutoTutor: a tutor with dialogue in natural language, Behav. Res. Methods Instrum. Comput., 36(2), 180–192 (2004). [220] B. Heller, M. Proctor, D. Mah, L. Jewell and B. Cheung, Freudbot: an investigation of chatbot technology in distance education. In EdMedia+ Innovate Learning, pp. 3913–3918. Association for the Advancement of Computing in Education (AACE) (2005). [221] B. Heller and M. Procter, Conversational agents, and learning outcomes: an experimental investigation. In EdMedia+ Innovate Learning, pp. 945–950. Association for the Advancement of Computing in Education (AACE) (2007). [222] R. Winkler, S. Hobert, A. Salovaara, M. Söllner and J. M. Leimeister, Sara, the lecturer: improving learning in online education with a scaffolding-based conversational agent. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–14 (2020). [223] T. Wambsganss, R. Winkler, M. Söllner and J. M. Leimeister, A conversational agent to improve response quality in course evaluations. In Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–9 (2020). [224] S. Ghose and J. J. Barua, Toward the implementation of a topic specific dialogue based natural language chatbot as an undergraduate advisor. In 2013 International Conference on Informatics, Electronics, and Vision (ICIEV), pp. 1–5. IEEE (2013). [225] B. R. Ranoliya, N. Raghuwanshi and S. Singh, Chatbot for university related FAQs. In 2017 International Conference on Advances in Computing, Communications, and Informatics (ICACCI), pp. 1525–1530. IEEE (2017). [226] M. Dibitonto, K. Leszczynska, F. Tazzi and C. M. Medaglia, Chatbot in a campus environment: design of LiSA, a virtual assistant to help students in their university life. In International Conference on Human-Computer Interaction, pp. 103–116. Springer (2018). [227] M. N. Kumar, P. L. Chandar, A. V. Prasad and K. Sumangali, Android based educational chatbot for visually impaired people. In 2016 IEEE International Conference on Computational Intelligence, and Computing Research (ICCIC), pp. 1–4. IEEE (2016).

June 1, 2022 12:52

542

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch12

Handbook on Computer Learning and Intelligence — Vol. I

[228] R. Cole, D. W. Massaro, J. d. Villiers, B. Rundle, K. Shobaki, J. Wouters, M. Cohen, J. Baskow, P. Stone, P. Connors, et al., New tools for interactive speech, and language training: using animated conversational agents in the classroom of profoundly deaf children. In MATISSEESCA/SOCRATES Workshop on Method, and Tool Innovations for Speech Science Education (1999). [229] M. Milne, M. H. Luerssen, T. W. Lewis, R. E. Leibbrandt and D. M. Powers, Development of a virtual agent based social tutor for children with autism spectrum disorders. In The 2010 International Joint Conference on Neural Networks (IJCNN), pp. 1–9. IEEE (2010). [230] E. Mower, M. P. Black, E. Flores, M. Williams and S. Narayanan, Rachel: design of an emotionally targeted interactive agent for children with autism. In 2011 IEEE International Conference on Multimedia, and Expo, pp. 1–6. IEEE (2011). [231] H. Tanaka, H. Negoro, H. Iwasaka and S. Nakamura, Embodied conversational agents for multimodal automated social skills training in people with autism spectrum disorders, PLoS One, 12(8), e0182151 (2017). [232] M. R. Ali, Z. Razavi, A. A. Mamun, R. Langevin, B. Kane, R. Rawassizadeh, L. Schubert and M. E. Hoque, A virtual conversational agent for teens with autism: experimental results and design lessons, arXiv preprint arXiv:1811.03046v3 (2020). [233] W. L. Johnson, J. W. Rickel, J. C. Lester, et al., Animated pedagogical agents: face-to-face interaction in interactive learning environments, Int. J. Artif. Intell. Educ., 11(1), 47–78 (2000). [234] A. Kerry, R. Ellis and S. Bull, Conversational agents in e-learning. In International Conference on Innovative Techniques, and Applications of Artificial Intelligence, pp. 169–182. Springer (2008). [235] G. Veletsianos and G. S. Russell, Pedagogical agents. In Handbook of Research on Educational Communications, and Technology, pp. 759–769. Springer (2014). [236] M. Adam, M. Wessel and A. Benlian, AI-based chatbots in customer service, and their effects on user compliance, Electron. Mark., pp. 1–19 (2020). [237] M. Chung, E. Ko, H. Joung and S. J. Kim, Chatbot e-service, and customer satisfaction regarding luxury brands, J. Bus. Res., 117, 587–595 (2020). [238] J. Feine, S. Morana and U. Gnewuch, Measuring service encounter satisfaction with customer service chatbots using sentiment analysis. In 14th Internationale Tagung Wirtschaftsinformatik (WI2019), pp. 1115–1129 (2019). [239] M. Dixon, K. Freeman and N. Toman, Stop trying to delight your customers, Harv. Bus. Rev., 88(7/8), 116–122 (2010). [240] J. Cranshaw, E. Elwany, T. Newman, R. Kocielnik, B. Yu, S. Soni, J. Teevan and A. MonroyHer´n, andez, Calendar.help: designing a workflow-based scheduling agent with humans in the loop. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 2382–2393 (2017). [241] M. Hardalov, I. Koychev and P. Nakov, Towards automated customer support. In International Conference on Artificial Intelligence: Methodology, Systems, and Applications, pp. 48–59. Springer (2018). [242] V. Kasinathan, M. H. Abd Wahab, S. Z. S. Idrus, A. Mustapha and K. Z. Yuen, AIRA chatbot for travel: case study of AirAsia. In Journal of Physics: Conference Series, vol. 1529, p. 022101. IOP Publishing (2020). [243] I. Nica, O. A. Tazl and F. Wotawa, Chatbot-based tourist recommendations using model-based reasoning. In 20th International Configuration Workshop, pp. 25–30 (2018). [244] A. V. D. Sano, T. D. Imanuel, M. I. Calista, H. Nindito and A. R. Condrobimo, The application of AGNES algorithm to optimize knowledge base for tourism chatbot. In 2018 International Conference on Information Management, and Technology (ICIMTech), pp. 65–68. IEEE (2018). [245] M. Casillo, F. Clarizia, G. D’Aniello, M. De Santo, M. Lombardi and D. Santaniello, CHATBot: a cultural heritage aware teller-bot for supporting touristic experiences, Pattern Recognit. Lett., 131, 234–243 (2020).

page 542

June 1, 2022 12:52

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

Conversational Agents: Theory and Applications

b4528-v1-ch12

page 543

543

[246] A. I. Niculescu, K. H. Yeo, L. F. D’Haro, S. Kim, R. Jiang and R. E. Banchs, Design, and evaluation of a conversational agent for the touristic domain. In Signal, and Information Processing Association Annual Summit, and Conference (APSIPA), 2014 Asia-Pacific, pp. 1–10. IEEE (2014). [247] R. Alotaibi, A. Ali, H. Alharthi and R. Almehamdi, AI chatbot for tourist recommendations: a case study in the city of Jeddah, Saudi Arabia, Int. J. Interact. Mobile Technol., 14(19), 18–30 (2020). [248] A. M. M. Kyaw, Design, and development of a chatbot for recommending tourist attractions in Myanmar. Information Management Research Reports, AIT (2018). [249] J. Gustafson, N. Lindberg and M. Lundeberg, The August spoken dialogue system. In Sixth European Conference on Speech Communication, and Technology (1999). [250] J. Cassell, T. Stocky, T. Bickmore, Y. Gao, Y. Nakano, K. Ryokai, D. Tversky, C. Vaucelle and H. Vilhjalmsson, MACK: media lab autonomous conversational kiosk. In Proceedings of Imagina, vol. 2, pp. 12–15 (2002). [251] D. Traum, P. Aggarwal, R. Artstein, S. Foutz, J. Gerten, A. Katsamanis, A. Leuski, D. Noren and W. Swartout, Ada, and grace: direct interaction with museum visitors. In International Conference on Intelligent Virtual Agents, pp. 245–251. Springer (2012). [252] S. Kopp, L. Gesellensetter, N. C. Krämer and I. Wachsmuth, A conversational agent as museum guide–design, and evaluation of a real-world application. In International Workshop on Intelligent Virtual Agents, pp. 329–343. Springer (2005). [253] T. Pfeiffer, C. Liguda, I. Wachsmuth and S. Stein, Living with a virtual agent: seven years with an embodied conversational agent at the Heinz Nixdorf Museums Forum. In Proceedings of the Re-Thinking Technology in Museums 2011-Emerging Experiences, pp. 121–131, (2011). [254] S. Schaffer, O. Gustke, J. Oldemeier and N. Reithinger, Towards chatbots in the museum. In mobileCH@ Mobile HCI (2018). [255] O.-M. Machidon, A. Tavˇcar, M. Gams and M. Dugulean¢a, CulturalERICA: a conversational agent improving the exploration of European cultural heritage, J. Cult. Heritage, 41, 152–165 (2020). [256] S. Vassos, E. Malliaraki, F. dal Falco, J. Di Maggio, M. Massimetti, M. G. Nocentini and A. Testa, Art-bots: toward chat-based conversational experiences in museums. In International Conference on Interactive Digital Storytelling, pp. 433–437. Springer (2016). [257] G. Gaia, S. Boiano and A. Borda, Engaging museum visitors with AI: the case of chatbots. In Museums, and Digital Culture, pp. 309–329. Springer (2019). [258] V. L. Rubin, Y. Chen and L. M. Thorimbert, Artificially intelligent conversational agents in libraries, Library Hi Tech, 28(4), 496–522 (2010). [259] J. Vincze, Virtual reference librarians (chatbots), Library Hi Tech News, 34(4), 5–8, (2017). [260] K. S. B. Agreda, E. D. B. Fabito, L. C. Prado, M. H. G. Tebelin and V. Benil-daEleonor, Attorney 209: a virtual assistant adviser for family-based cases, J. Autom. Control Eng., 1(3), 198–201 (2013). [261] M. Queudot, É. Charton and M.-J. Meurs, Improving access to justice with legal chatbots, Stats, 3(3), 356–375 (2020). [262] G. Shubhashri, N. Unnamalai and G. Kamalika, LAWBO: a smart lawyer chatbot. In Proceedings of the ACM India Joint International Conference on Data Science, and Management of Data, pp. 348–351. Association for Computing Machinery (2018). [263] G. Lugano, Virtual assistants, and self-driving cars. In 2017 15th International Conference on ITS Telecommunications (ITST), pp. 1–5. IEEE (2017). [264] E. Okur, S. H Kumar, S. Sahay and L. Nachman, Audio-visual understanding of passenger intents for in-cabin conversational agents. In Second Grand-Challenge, and Workshop on Multimodal Language (Challenge-HML), pp. 55–59, Seattle, USA (2020). Association for Computational Linguistics.

June 1, 2022 12:52

544

Handbook on Computer Learning and Intelligence - Vol. I…– 9.61in x 6.69in

b4528-v1-ch12

Handbook on Computer Learning and Intelligence — Vol. I

[265] E. de Salis, M. Capallera, Q. Meteier, L. Angelini, O. Abou Khaled, E. Mugellini, M. Widmer and S. Carrino, Designing an AI-companion to support the driver in highly autonomous cars. In International Conference on Human-Computer Interaction, pp. 335–349. Springer (2020). [266] A. Fadhil and A. Villafiorita, An adaptive learning with gamification & conversational UIs: the rise of CiboPoliBot. In Adjunct Publication of the 25th Conference on User Modeling, Adaptation, and Personalization, pp. 408–412 (2017). [267] E. Berger, T. H. Sæthre and M. Divitini, PrivaCity. In International Conference on Informatics in Schools: Situation, Evolution, and Perspectives, pp. 293–304. Springer (2019). [268] J. Othlinghaus-Wulhorst and H. U. Hoppe, A technical, and conceptual framework for serious role-playing games in the area of social skill training, Front. Comput. Sci., 2, 28 (2020). [269] J. Seering, M. Luria, C. Ye, G. Kaufman and J. Hammer, It takes a village: integrating an adaptive chatbot into an online gaming community. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–13 (2020). [270] B. Jones and R. Jones, Public service chatbots: automating conversation with BBC news, Digital Journalism, 7(8), 1032–1053 (2019). [271] D. Jannach, A. Manzoor, W. Cai and L. Chen, A survey on conversational recommender systems, arXiv preprint arXiv:2004.00646 (2020). [272] Y. Jin, W. Cai, L. Chen, N. N. Htun and K. Verbert, MusicBot: evaluating critiquing-based music recommenders with conversational interaction. In Proceedings of the 28th ACM International Conference on Information, and Knowledge Management, pp. 951–960 (2019). [273] U. Gnewuch, S. Morana, C. Heckmann and A. Maedche, Designing conversational agents for energy feedback. In International Conference on Design Science Research in Information Systems, and Technology, pp. 18–33. Springer (2018). [274] P. Angara, M. Jiménez, K. Agarwal, H. Jain, R. Jain, U. Stege, S. Ganti, H. A. Mu¨ller and J. W. Ng, Foodie fooderson a conversational agent for the smart kitchen. In Proceedings of the 27th Annual International Conference on Computer Science, and Software Engineering, pp. 247–253 (2017). [275] A. Jobin, M. Ienca and E. Vayena, The global landscape of AI ethics guidelines, Nat. Mach. Intell., 1(9), 389–399 (2019). [276] E. Bender, T. Gebru, A. McMillan-Major and S. Shmitchell, On the dangers of stochastic parrots: can language models be too big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21), pp. 610–623, (2021). [277] D. I. Adelani, H. Mai, F. Fang, H. H. Nguyen, J. Yamagishi and I. Echizen, Generating sentiment-preserving fake online reviews using neural language models, and their human-and machine-based detection. In International Conference on Advanced Information Networking, and Applications, pp. 1341–1354. Springer (2020). [278] A. Bartoli, A. De Lorenzo, E. Medvet, D. Morello and F. Tarlao, “Best dinner ever!!!”: automatic generation of restaurant reviews with LSTM-RNN. In 2016 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 721–724. IEEE (2016). [279] A. Bartoli and E. Medvet, Exploring the potential of GPT-2 for generating fake reviews of research papers. In 6th Fuzzy Systems, and Data Mining (FSDM), Xiamen, China (2020). [280] T. T. Nguyen, C. M. Nguyen, D. T. Nguyen, D. T. Nguyen and S. Nahavandi, Deep learning for deepfakes creation, and detection: a survey, arXiv preprint arXiv:1909.11573v2 (2020). [281] M. Westerlund, The emergence of deepfake technology: a review, Technol. Innov. Manag. Rev., 9(11), 39–52 (2019).

page 544

June 1, 2022 13:2

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-index

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_bmatter

Index

antibody testing study, 281 architecture, 472 artificial intelligence (AI), 135, 499 artificial intelligence markup language (AIML), 504–505, 522–523, 526 artificial neural network (ANN), 472 astrocyte cell, 438, 447 augmented reality, 503 automatic speech recognition (ASR), 502–503, 510, 514 average causal effect (ACE), 267 average percentual deviation, 208 averaging accuracy, 489 axiomatic set theory, 98–99, 107–108 axiomatization, 99–101, 104, 107 axioms, 99–104, 106–107 axis parallel rules, 140–141 axon, 472

A A.L.I.C.E, 522 ABI model, 19 abrupt ones, 319 abstraction, 90, 97, 104, 108–109, 114, 115–116, 121, 123 accumulated accuracy, 207, 209 accuracy, 7, 13, 22, 27–28, 39–41, 313, 318, 488, 494 accuracy trade–off, 35, 42 accuracy values, 492 action potential, 431, 433 action units (AUs), 501 actions, 292 activation degree, 311 activation function, 473–474, 477 active learning, 200 actuator, 491 actuator faults, 494 Adam algorithm, 481 adaptation, 319 adaptive kernel filtering, 356 adaptive systems, 164 adaptive tuning, 319 agent-based methods, 513 aggregation, 81–82 aggregation of ensemble member outputs, 154 Alexa, 523–524 Alexa prize, 517, 523 algorithmic, 97, 108, 114–115, 120 all-pairs classifications, 152 ambiguity, 59 annotation and labeling cycles, 200 annoying difficulties, 474 antecedent, 322 antecedent linguistic terms, 308 anti-STDP, 438

B backdoor path, 271, 282–283 backpropagation, 319 backpropagation algorithm, 474 backward elimination, 317 bagging, 154, 174 base learner, 154 base variable, 86 basic emotions, 501 basic learning steps, 177 batch normalization operation, 480–481 batch normalized, 480 bayes, 12 Bayesian, 11–12, 17 Bayesian information criteria, 298 Bayesian network, 272–273 behavior, 485, 488–489 belief-desire-intention (BDI) models, 513 i1

page i1

June 1, 2022 13:2

i2

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-index

Handbook on Computer Learning and Intelligence — Vol. I

belief-state, 514 beliefs, 19–20 beta distribution, 295 better convergence, 480 bias, 6, 24 bias value, 473 bias-free estimations, 287 bidirectional encoder representations from transformers (BERT), 509 Bienenstock–Cooper–Munro rule, 438 big data, 134, 155 bilingual evaluation understudy (BLEU), 517, 519 biologically inspired learning algorithms, 436–437 black box, 3, 13–15, 17 black box dialogue systems, 514 black box systems, 499, 510–511, 514–515, 521, 529 blow-up, 475 blown-up cluster, 173 box-plots, 488 C α-cut, 67, 138 C-SVM, 373 CA, 498–500, 502–504, 510–522, 525–529 cardinality, 69, 70 Cartesian product, 84–85 causal discovery, 297 causal effect rule, 268, 278 causal inference, 263, 274, 281 causal intervention, 266 causal mechanism, 269 causal models, 272 causal reasoning, 5, 37 causal thompson sampling, 296 causation, 264 cell gate, 477–478 cell inputs, 477 cell state, 475, 478 cell update, 478–479 center of area, 312 center of gravity, 139 certainty factor, 309 certainty of model decisions, 192 chain junction, 271 chain structure, 297 chained perceptrons, 472 characteristic contour/spread, 142

characteristic functions, 58 chatbot, 497–499, 503–506, 508–509, 514, 521, 523, 529 ChatEval, 519 Choquet integral, 317 chunk-wise incremental learning, 156 class distribution, 310 class label, 305, 307, 312 class score, 152 classical fuzzy classification model, 149 classification accuracy, 489, 493 classification output explaination, 193 classification rules, 305 classification tasks, 494 classifier output certainty distribution, 208 clean classification rules, 149 Cleverbot, 522–523 cloud-based fuzzy system, 147 cloud-based rules, 148 cluster, 310, 312 cluster center update, 169 cluster delamination effect, 172–173 cluster fusion effect, 171 clusters with arbitrary shapes, 166 codebook, 88 cognitive processing, 511 collider, 271 compactness, 305, 314–313, 316 compatibility concept, 166 complement, 72–73, 84 completeness, 197, 305, 313–314 complex information, 472 complexity, 318 compositional rule of inference, 33, 37 comprehensibility, 313 comprehensibility assessment, 314 computational demands in EFS, 203 Computational Neuro-Genetic Models (CNGM), 413 concept, 11–12, 14–15, 18, 21, 25–27 concept drift, 179, 319 conditional independence, 290–291 conditional probability, 270, 276 conditions, 305 confidence, 306 confidence degree, 309 confidence in a prediction, 188 confidence intervals, 189 confidence levels, 149 conflict, 187, 201, 315

page i2

June 1, 2022 13:2

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-index

Index confounder, 265, 275–276, 294 confounding, 274 confounding bias, 263, 299 confounding junction, 271 confusion matrix, 207, 488–489, 493 conjunction, 79–80 conorms, 81 consequent, 322 consequent li ( x ) of the ith rule, 137 consequent learning in EFCs, 164 consequent/output weight parameters, 158 consequents, 305 consistency, 196, 305, 314 consistent behavior, 489 constant weight model, 435 constraint-based methods, 297 construction of fuzzy rules, 146 continual learning, 135 continuous adaptation, 319 contradiction, 74 contrastive, 5, 17 control, 19, 30–33, 35, 42 convergence, 481 convergence analysis and ensurance, 185 convergence to global optimality, 159 convergence to optimality, 185 conversational agents, 497, 516 conversational user experience, 518 convexity, 68 core, 67 correlation, 264 Cortana, 523–524 cosine similarity, 506, 519 counter-intuition, 296 counterfactual reasoning, 266–267 counterfactuals, 5, 37 coupled tanks, 468 coverage, 88–89, 197 coverage degree, 192 COVID-19, 280 crisp input, 312 current cell state value, 478 current input, 474 current time, 478–479 curse of dimensionality, 182 cyclic drift, 182 cylindrical extension, 86

page i3

i3 D 20 different combinations, 488 d-separation, 271, 289, 297 Development and Application of Methods for Actuator Diagnosis in Industrial Control Systems (DAMADICS), 468 data, 485 data clouds, 147, 166 data drift, 320 data fusion, 283–284 data generation process, 269 data stream sample appearances, 165 data streams, 155 data-based approaches, 469 data-driven animation, 502 De Morgan laws, 76 de-confound, 275 dead-zone recursive least square, 160 decision boundary, 148 decision function, 307 decision tree, 317–318 decrease randomness, 488 deep learning, 3, 17, 503 deep learning with SVM, 386 deep neural networks, 498–499 deep rule-based classifiers, 153, 194 defuzzification, 33–34, 138, 311 degree of purity, 150 delay learning, 438 delay of evolving mechanisms, 169 delete spike pattern strategy, 452 dendrites, 472 dynamic evolving neural-fuzzy inference system (DENFIS), 402–403 depolarization phase, 431–432 descriptive models, 265 design, 307 desired value, 474 diagnosing faults, 469 dialogue act, 514–513 dialogue act recognition, 513 dialogue corpus, 505–506, 508 dialogue policy, 511, 514–515 dialogue state, 511 dialogue state tracking, 513 different intensity levels, 484 direct encoding, 427 directed acyclic graph (DAG), 268–269, 273 discourse representation space, 314 discriminative, 307

June 1, 2022 13:2

i4

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-index

Handbook on Computer Learning and Intelligence — Vol. I

disjunction, 79–80 disjunctive connectives, 80 dissimilarity, 487 distance value, 486 distinguishability, 196 DNNs, 502–504, 506–510, 514–515, 523, 528 do-calculus, 270, 276, 282 do-expression, 276 do-notation, 270 do-operator, 283 domain, 58 domain determination, 512 drift, 178, 320 drift compensation, 182 drift handling, 179 drift handling in the antecedent part, 181 DTI, 415 dual t-norm, 76 duality, 76 dynamic behavior, 134 dynamic change, 320 dynamic clustering, 135 dynamic firing threshold, 438 dynamic threshold function, 434 dynamic weight model, 435 E easier classified, 491 electroencephalogram (EEG), 235, 236–240, 249–252, 254, 256, 258–261, 415 efficiency, 317 EFS Approaches, 176 evolving fuzzy neural network (EFuNN), 402 element-wise multiplication, 478–479, 481 ELIZA, 504, 522 embedded, 317 embodied conversational agents (ECAs), 497, 500–503, 516, 519–520, 524, 526–527 empirical, 97, 114 empirical membership functions, 166 empirical risk, 307 enerality, 59 enhanced (weighted) classification scheme, 150 ensemble member, 154, 174 entropy-base, 318 equality, 69 erroneously classified, 490, 493 error, 474 error bars, 189 error gradient, 474

error rates, 490, 493 Euclidean distance, 486 Eugene Goostman, 523 evaluation measures, 204 evaluation measures for EFS, 204 evolution, 313 evolution thresholds, 172 evolutionary algorithms, 318 evolutionary fuzzy, 40, 42 evolving concepts, 165 evolving connectionist systems (ECOS), 401, 407, 416 evolving fuzzy classifier (EFC) approaches, 148 evolving intelligent systems, 134 evolving mechanisms, 135 evolving model ensembles, 174 evolving neuro-fuzzy systems, 136 evolving spiking neural networks (eSNN), 410–411 evolving systems, 164 excellent performance, 494 excitatory postsynaptic potentials, 434 excluded middle, 73, 75 exhaustive validation, 488 expected error, 307 expected quality of model predictions, 186 expert systems, 7–10, 12–13, 15–17, 29, 35–37, 42 explainability, 191 explainable AI (XAI), 3, 5–8, 10–11, 13, 15, 17–19, 25, 28, 30, 34, 41–42, 521 explanans, 5–6, 10–11, 13–15, 17–20, 22, 24–25, 27–29, 35, 38, 41–42 explanation, 4–7, 10–11, 18–19, 25, 27–28, 35, 42 explanation of model decisions, 191 explicit feedback, 474 exploration and exploitation dilemma, 293 exponential-like membership function, 65 extended fuzzy classification model, 149 extended Takagi–Sugeno fuzzy systems, 142 extended weighted recursive least squares, 160 external validity, 279 extrapolation behavior, 168 F facial action coding system (FACS), 501 facial animation, 502 fairness, 10 false negatives, 490, 493

page i4

June 1, 2022 13:2

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

Index false positive rate, 313 false positives, 490, 493 fast greedy equivalence search, 298 fault class, 481 fault classification task, 469, 481, 488, 493 fault classifier, 481 fault datasets, 487 fault detection, 469, 494 fault detection and diagnosis, 467, 484 fault group, 486, 491, 493 fault knowledge, 494 fault patterns, 494 fault scenario, 485 fault selection, 487 fault set, 484, 488 faults, 470, 484 faults in actuators, 471 FDD approach, 493 feature importance levels, 184, 192, 198 feature ranking, 317 feature reactivation, 183 feature selection, 316, 318 feature weights, 183 features, 18, 24, 27–28, 36 feed-forward, 474 feedback cycles, 474 feedback to itself, 474 fidelity, 5, 10–11, 13, 27–28 filters, 317 final accuracy, 493 finite-state systems, 511 firing degree, 312, 324 first of maxima, 312 fuzzy inferential system (FIS), 29–30, 33, 39–41 fitting capacity, 475 flow of information, 471 fMRI, 415 focus of attention, 90 footprint, 323 forget gate, 475, 477, 479 forgetting, 181 forgetting factor, 158 forward selection, 317 four RF neurons, 428 frame of cognition, 88–90, 314 frame-based (or slot-filling) systems, 511–512 frequency-based decoding, 430 full model update, 178 fully connected layer, 481

b4528-v1-index

page i5

i5 fully-train-and-then-test, 206 fuzzification, 312 fuzzification layer, 145 fuzziness, 59 fuzzy basis function networks, 140 fuzzy classification, 305 fuzzy classifiers, 148, 305–306, 325 fuzzy control, 309 fuzzy entropy, 317 fuzzy granulation, 88 fuzzy granules, 317 fuzzy inference, 140 fuzzy input, 312 fuzzy intersection, 322 fuzzy logic, 13, 28–31, 35, 37–38, 41–42, 97 fuzzy memberships, 312 fuzzy neural network, 38–40 fuzzy propositions, 78 fuzzy relation, 84 fuzzy relations, 83–85 fuzzy set, 28–30, 33–34, 38, 40–42, 58, 91, 97, 115, 125, 306, 317 fuzzy set similarity measure, 171 fuzzy systems, 136 fuzzy union, 322 fuzzy wrapper, 317 G -greedy algorithm, 293 games, 235, 238–240, 242, 249–250, 252, 254, 258–259, 261–262 gates, 475 Gauss–Newton algorithm, 163 Gaussian, 312 Gaussian function, 140 Gaussian membership function, 64, 323 Gaussian process, 362 Gaussian receptive field, 427 gene regulatory network, 414 general language understanding evaluation (GLUE) benchmark, 516 general type-2 fuzzy set, 143 generalization error, 307 generalized inverse iteration, 161 generalized mean, 81 generalized multivariate Gaussian distribution, 141 generalized RFWLS, 160 generalized rules, 141 generalized Takagi–Sugeno fuzzy systems, 140

June 1, 2022 13:2

i6

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-index

Handbook on Computer Learning and Intelligence — Vol. I

generative, 307 generative chatbots, 504, 508 generative models, 508 genetic algorithm, 318 global learning, 158 good accuracy values, 490 google now, 523 GPT2, 510 GPT3, 510 gradient calculation, 480 gradient-based learning algorithm, 436 gradients, 481 gradual changes, 319 gradual drift, 172 granular computing, 97–98, 104–105, 110, 114 granularity, 90 granulation, 87 granules, 97–99, 101, 104, 108, 110, 114–116, 118–125, 129, 144 graphical models, 274 gravitation concept, 150 granular computing (GrC), 97–98, 115 grey box model, 191 grid, 312 growing type-2 fuzzy classifier (GT2FC), 320 GUS, 511 H Hadamard multiplication, 478 half-life of a rule, 170 handcrafted rules, 499 height, 66 height-center of area, 312 heterogeneous datasets, 279–280 hidden Markov models (HMMs), 503 hidden state, 475, 477–478 hidden state value, 479 high accuracy, 491 high-dimensional dataset, 299 high-level interpretability, 318 higher general accuracy, 494 higher misclassification, 491 higher order fuzzy sets, 92 historical data, 469 Hodgkin–Huxley, 432 homogeneous rule merging, 172 human brain, 472 human operator faults, 469 human problem solving, 98 human-centered, 97–98

hyper-parameter configuration, 467, 489, 492 hyper-parameters, 488, 491 hyper-parameters’ impact, 485 hyperbolic activation function, 479 hyperbolic tangent, 478 hyperbolic tangent activation function, 477 I idle rule, 170 IF–THEN, 9, 13–14, 29–30, 32–33, 306, 308 ignorance, 187, 201 imbalanced classification problems, 203 imitation game, 521 implementation issues, 200 implication, 80 implying a shift, 133 inclusion, 70 inclusion/exclusion criteria, 281 incremental, 305, 319–320, 325 incremental feature selection, 320 incremental learning, 134, 156, 313 incremental learning from scratch, 157 incremental learning techniques, 494 incremental machine learning, 134 incremental optimization, 162 incremental partial least squares, 185 incremental rule splitting procedure, 173 incremental supervised clustering, 320 incremental SVM, 383 incremental, evolving clustering, 166 industrial environment, 469 industrial process, 467–468, 494 industrial process simulator, 480 industrial systems, 494 inference, 306 inference engine, 10, 33, 36, 40, 311, 322 information, 5, 9, 11–12, 14, 18–19, 23–25 information granulation, 88 information granules, 87 information hiding, 90 information loss, 141 information state, 513 information-retrieval chatbot, 504–506 inhibitory postsynaptic potentials, 434 initial batch mode training step, 157 innovation vectors, 160 input gate, 477–478 input size, 487, 489, 491, 494 input space drifts, 179 input values, 473

page i6

June 1, 2022 13:2

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-index

Index inputs, 472, 478 integration of feature weights, 146 intelligent, 97, 108, 110–111 intelligent virtual agents (IVAs), 497 intent detection, 511 inter-distance, 319 inter-quartile, 491 inter-quartile distance, 489 interface agents, 500 interleaved-test-and-then-train, 205 internal functions, 475 internal layer, 475, 478 internal operations, 478 internal state, 474–475 internal/hidden layer, 479 interpretability, 195, 299, 305, 308, 313, 317–318, 322, 325, 498, 529 interpretable, 313 interpretable AI, 521 interpretable dialogue systems, 511 interpretable primitives, 499, 513 interpretable systems, 499, 510 interpretation, 80 interpretation of consequents, 198 intersection, 72, 84 interval type-2 fuzzy set, 143 interval-based fuzzy systems, 143 interval-valued fuzzy, 93 intervals, 87 intervention, 276, 278 intra-distance, 318 intuition, 296 intuitive set theory, 98–99, 104–105 inverse covariance matrix update, 169 inverse weighted Hessian matrix, 159 iterations, 480 iterative re-optimization, 161 Izhikevich neuron, 432 J Jabberwacky, 522–523 joint homogeneity between the rules, 171 K kernel, 335 kernel density estimation, 344 kernel linear discriminant analysis, 351 kernel PCA, 349 kernel trick, 335

page i7

i7 knowledge, 4–7, 9–13, 18, 30, 40, 97, 114, 116, 313 knowledge base, 311, 314, 511, 515 knowledge contraction, 164 knowledge expansion, 164, 198 knowledge extraction, 13, 39 knowledge refinement, 165 knowledge simplification, 165 L language model, 508 last network layer, 475 law of total probability, 276 layer, 472–473 layer size, 488, 491 layer weight, 472–473 layered model architecture, 144 linear discriminant analysis (LDA), 317 leaky integrate-and-fire (LIF) model, 432 LIF neuron, 432–433 learning algorithms, 317 learning engine for E(N)FS, 177 learning in non-stationary environments, 134 least squares error, 157 lesser variation, 491 level control pilot plant, 482 level variable, 485 levels of off-set, 484 levels of pump off-set, 484 Levenberg–Marquardt algorithm, 163 Likert scale, 518 line with an accuracy, 489 linear classifiers, 308 linear discriminant analysis, 351 linear operation, 472 linearly separable, 472 linguistic approximation, 91 linguistic input variables, 308–309 linguistic interpretability, 195 linguistic labels, 322 linguistic modifiers, 91 linguistic output variables, 309 linguistic statements, 78–79 linguistic values, 312 linguistic variable, 86 liquid flows, 484 listwise deletion, 286 local data distributions, 147 local density, 148 local drifts, 181

June 1, 2022 13:2

i8

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-index

Handbook on Computer Learning and Intelligence — Vol. I

local learning, 158 locally weighted regression, 348 Loebner prize, 522 logic, 80 logical operations, 79 LogSoftmax, 481 long short-term memory, 467–468 long-term plasticity, 435 loss, 307 loss function, 481 low accuracy, 489, 493 low misclassification, 491 low value, 493 low-level interpretability, 318 lower, 323 lower accuracy values, 490 lower computational complexity, 489 lowest accuracy, 493 M -membership function, 62 machine learning, 235, 243, 285, 468 machine translation, 509 main contribution, 480 malfunctions, 470 Mamdani, 309 Mamdani fuzzy systems, 137 margin maximization strategy for structure learning, 443–444 Markov decision process (MDP), 514 massively parallel fuzzy rule bases, 194 mathematical operations, 478 matrices, 478–479 max criterion, 312 max-min composition, 311 mean, 480, 491 mean absolute error, 208 mean of maximum, 139 mean opinion score (MOS) scale, 503 mean value substitutions, 286 median accuracy, 491 Meena, 509, 523 membership degrees, 58 membership function, 58, 311, 312, 321 membrane potential, 431–432, 434 memory, 474, 481 Mercer’s theorem, 338 meta-cognitive learning, 201 meta-heuristics, 318 meta-neuron, 447, 449

meta-neuron based learning rule, 447 meta-neuron learning rule, 437 meta-neuron-based learning algorithm, 438 meta-neuron-based learning rule, 450 metric learning, 388 middle of maxima, 312 MIMO fuzzy systems, 151 min–max, 320 mini-batch, 480–481 missing at random (MAR), 287–289 missing completely at random (MCAR), 286, 288–289 missing data, 285 missing data handling, 202 missing data mechanism, 286 missing graphs, 288 missing not at random (MNAR), 287–288, 290 missingness mechanism, 289, 291 Mitsuku, 523 mixed initiative, 511 mixed reality, 503 model architecture, 135 model classified, 490 model complexity, 307 model ensembles, 154 model evaluation, 488 model fairness, 299 model parameter adaptation, 156 model transportability, 283 model-based methods, 513 modified recursive least squares, 160 Moore–Aronszajn theorem, 338 moral, 20 motion capture-based animation and performance-based animation methods, 502 multi-armed bandit, 294 multi-armed Bandit model, 292 multi-class, 310 multi-innovation RFWLS, 160 multi-model all-pairs fuzzy classifier, 151 multi-model one-versus-rest fuzzy classifier, 150 multi-objective, 317 multi-objective evolutionary algorithm, 318 multi-variable faults, 481 multiclass SVM, 378 multidimensional, 310 multidimensional fuzzy set, 83 multilayer perceptron, 467–468 multistage fuzzy controller, 485

page i8

June 1, 2022 13:2

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-index

Index multivalued logic, 80 mutual information, 317 N n-grams, 517 Nadaraya–Watson Kernel Regression, 347 natural language generation (NLG), 511–512, 514–515 natural language processing, 517 natural language understanding (NLU), 502, 511, 514 negative log-likelihood loss, 481 neo-fuzzy neuron network, 146 network input size, 488 network inputs, 488 network sizes, 489, 491 network training, 474, 480, 485 NeuCube, 414 neural network, 468, 472, 481, 488 neural network hyper-parameters, 468 neural network training, 488 neural networks, 8, 11–12, 14, 16–17, 29, 38, 42, 135, 469 neural weight values, 475 neuro-fuzzy, 29, 38–42 neuro-fuzzy classifiers, 153 neuro-fuzzy hybrid systems, 399 neuro-fuzzy systems, 38, 144 neuro-genetic fuzzy rules, 413 neurogenetic model of a neuron, 413 neuromorphic implementations, 416 neuron addition strategy, 443, 450–451 neuron input, 473, 477 neuron output, 473 neurons, 472, 474, 478–479 new input information, 477 NMP learning rule, 440, 442 noise, 187 noising region, 485 non-contradiction, 73, 75 non-informative features, 316 non-recurrent, 472 non-stationary dynamic environments, 320 non-stationary environments, 319 non-stationary processes, 319 nonlinear classifiers, 308 nonlinearity, 494 normal fuzzy set, 89 normalization, 66 normalization layer, 145

page i9

i9 normalized degree of activation, 140 normalized membrane potential (NMP) learning rule, 440, 442 nullnorms, 82 number of inputs, 487 number of neurons, 488 number of samples, 486 O odor recognition, 416 on-line design of experiments, 202 one-class SVM, 380 one-versus-rest classification, 150 online, 305, 319, 320 online active learning, 200 online bagging of EFS, 174 online boosting concept, 175 online classification, 320 online data modelling, 412 online data streams, 133 online dimensionality reduction, 182 online ensemble of E(N)FS, 154 online feature reduction/selection, 182 online feature weighting, 184 online generation, 325 online learning, 235, 238, 242, 252, 257, 319, 325 online meta-neuron based learning algorithm (OMLA), 424, 447, 450, 459, 461 online missing data imputation, 202 online systems, 319 operator, 75, 321 optimization, 307, 313–314, 411 ordered weighted averaging, 82 outliers, 489 output, 472, 478 output error, 474 output gate, 478–479 output layer, 474 output uncertainty, 190 output weights learning stage, 445–446 over-fitting, 154, 481 overlap, 314 P paradigm for dialogue system evaluation (PARADISE), 518 parallelized computation methodology, 299 parameter faults, 470 parameter uncertainty, 190

June 1, 2022 13:2

i10

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-index

Handbook on Computer Learning and Intelligence — Vol. I

parameter update formula, 163 parameter update strategy, 453 PARRY, 522 partial AND/OR-connections, 196 partial model update, 178 partially observable Markov decision processes (POMDPs), 514 partially observed data, 285 partition, 306, 312, 315, 318 Parzen–Rosenblatt method, 345 past instant, 478 past time, 475 patches, 312 pattern classification, 307 pattern recognition, 8, 11, 17, 239, 254 pattern-based chatbot, 504, 522 PC algorithm, 298 perceptron, 472, 474 performance, 305, 313 performance measure, 313 performance metrics, 488 periodic holdout procedure, 205 perplexity, 517 personal assistants, 523 personalized brain data modeling, 415 physical computation, 98, 110–111 physical leak, 484 pilot plant, 482–485 pilot plant data, 493 pilot plant experiment, 482 pilot process, 468 pilot-scale plant, 467 pipeline model, 510–511 pixels, 20–24 plan-based methods, 513 plant control, 482 plasticity–stability dilemma, 156 polychronization effect, 414 polynomial encoding, 427 polynomials, 145, 309 population encoding, 427 possibility, 60 post-intervention probability distribution, 270 postsynaptic potentials, 425 potential outcome framework, 264 precise evolving fuzzy modeling, 138 precise modeling, 191 precision, 313, 317, 516–517 predicting pollution in London, 416 predicting risk of earthquakes, 416

predictive maintenance, 135 preference, 59–61 preference relation matrix, 152 premises, 305 previous time instant, 478–479 probability, 59 probability densities, 87 procedural animation, 501 process control, 482, 484 process monitoring, 469 process operation visualization, 484 process operations, 469, 494 process variables, 482 product, 311 production system, 35–36 programmable logic controller, 482 projection, 85 projection concept, 141 proposition, 79–80 prototype-based clustering techniques, 166 prototypes, 115–116, 120–121, 123 proxy variable, 288 pseudo-streams, 155 Python, 481 PyTorch, 481 Q qualitative functions, 469 qualitative model-based, 469 quality, 313, 325 quantitative model-based, 469 quantization, 88 quantum-inspired, 411 quantum-inspired evolutionary algorithm, 412 quantum-inspired particle swarm optimization, 412 quite adequate, 490 R randomized controlled trials, 266–267 rank order learning, 438 rate encoding, 426, 430 rate-based decoding scheme, 430 re-scaled Mahalanobis distance, 184 real industrial processes, 467 real processes, 468 real-time adaptation, 169 real-time data streams, 320 real-world application examples, 134 real-world Applications of EFS, 210

page i10

June 1, 2022 13:2

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-index

Index real-world problems, 472 realistic scenarios, 494 reason for model decisions, 192 reasonable accuracy, 493 recall, 516, 517 recall-oriented understudy for gisting evaluation (ROUGE), 517 recent entries, 475 receptive field (RF) neuron, 426 recommender systems, 299, 528 recoverability, 289–290 recurrent hidden states, 481 recurrent layers, 146 recurrent neural networks, 467, 472, 474, 508–510 recurrent states, 475 recurrent term, 480–481 recursive algorithm, 157 recursive correntropy, 161 recursive fuzzily weighted least squares, 158 recursive Gauss–Newton approach, 163 recursive learning of linear parameters, 157, 161 recursive Levenberg–Marquardt (RLM) algorithm, 164 recursive weighted total least squares, 161 reducing sample annotation and measurement, 200 redundancy, 313 redundant partitions, 315 reference, 489 refinement, 319 refractory period, 425, 431–432, 434 regressor, 157 regressor vector, 159 regularization parameter, 481 reinforcement learning, 291, 514, 515 reinforcement learning agent, 292 relations, 83 relevance, 18, 20–22, 24 reliability, 186 repolarization phase, 431–432 representation theorem, 68 representer theorem, 344 reproducing kernel, 337 reproducing kernel Hilbert space (RKHS), 337 residual, 163 resting potential, 431 results, 493 ReSuMe, 438

page i11

i11 rete network, 36–37 reward, 292 RF neuron, 427–428 robustness, 154, 159 ROC curves, 313 root mean squared error, 208 rough sets, 87 rule activation layer, 145 Rule age, 170, 181 rule chains, 199 rule consequent functions, 145 rule construction, 195 rule evolution, 162, 164, 166 rule evolution criterion, 167 rule extraction, 14–15, 27 rule firing degree, 138 rule importance levels, 198 rule inclusion metric, 171 rule learning, 319 rule merging, 165, 170 rule parameters initialization, 159 rule pruning, 164, 169 rule significance, 170 rule splitting, 165, 172 rule update, 162 rule utility, 170 rule-based classification, 305 rule-based classifiers, 306, 325 rule-based one-versus-rest classification, 151 rule-based systems, 314, 325, 499 rule-extraction, 14 rulebase, 4, 11, 13–14, 22, 32–37, 40–41 rules, 306 S ν-SVR, 373, 375 ε-SVR, 375 S-membership function, 63 s-norms, 75 saliency maps, 14–15, 17, 22, 24 same layer, 478 sample selection, 200 sample time, 485 sample-wise incremental learning, 156 samples, 487 saturation, 320 supervisory control and data acquisition (SCADA), 485 scale-invariance (of learning steps), 178 scatter, 312

June 1, 2022 13:2

i12

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-index

Handbook on Computer Learning and Intelligence — Vol. I

score-based methods, 297 selection bias, 280, 282, 299 self-adaptation, 320 self-attention, 509 self-learning, 320 self-learning computer systems, 135 semantic grammars, 512 semantic soundness, 89 semi-supervised online learning, 156, 320 sensibleness, 517 sensitivity, 313 sensor group, 491 sensors, 285, 471 sentence embeddings, 505 sentences, 79 separability criterion, 183 sequence to sequence model (seq2seq), 508–509 sequential, 319 sequential ensemble, 176 set of variables, 470 set operations, 80 set theoretical, 98, 101, 103–105, 107–108 set theory, 80, 98–101, 106–107, 115, 126, 128 setup of the experiments, 485 shifts, 178 short-term plasticity, 435 sigma, 473 sigmoid activation function, 477–479 signal, 484 similarity, 59–60, 315 similarity measures, 314 simplicity, 196 Simpson paradox, 274–275 single-pass active learning, 200 singleton, 87, 323 SIRI, 523–524 six faults, 484 size, 487 small-scale processes, 468 small-scale prototypes, 468 smooth and soft online dimension reduction, 184 smooth forgetting, 180 social robot, 500, 524 soft computing models, 135 soft rule pruning mechanism, 198 sources of faults, 470 spatio-temporal pattern classification, 411 specific combination, 494

specificity, 69–70, 90, 517 spectral clustering, 359 speech synthesis, 510 spike response model, 432–433 spike-time-dependent plasticity, 437 SpikeProp, 436, 459, 461 spoken dialogue system (SDS), 502 spurious correlations, 265, 274 stability, 178 standard deviation, 480 standard operations, 72–73 state, 292 state values, 475 static models, 134 stationary processes, 319 statistical, 8, 11–12, 27 steepest descent algorithm, 163 storage of information, 480 straightforward, 474 struck, 484 structural, 272 structural causal model (SCM), 268, 273 structural faults, 470 structure evolution, 156 structure learning, 443 structure learning stage, 443, 445 structured SVM, 390 structures, 472 sub-optimality, 159 subgroups, 487 subjective assessment of speech system interfaces (SASSI), 518 subnormal, 65 SuperGLUE, 516 superpixel, 27 supervisory control and data acquisition, 484 support, 66 support of the cluster (rule), 169 support vector machines (SVMs), 330, 365 SWAT, 459, 461 symbolic reduction in artificial intelligence (srai) tag, 505 symbols, 30 synapses, 472 synaptic efficacy function-based leaky-integrate-and-fire neuron (SEFRON), 424, 453–459, 461 synaptic efficacy function-based neuron, 453 system usability scale (SUS), 518

page i12

June 1, 2022 13:2

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-index

Index T t-conorm, 75–76, 138 t-norm, 75, 138, 310–311 t-SNE, 507 Takagi–Sugeno, 309 Takagi–Sugeno (TS) fuzzy systems, 138 tank leaks, 484 target concept drifts, 179 task-oriented agent, 497, 515 task-oriented agents, 498–499, 503, 506, 510–511, 514–515, 521, 524, 529 taxonomy, 317 Tay, 520, 529 technology adoption model, 19 temporal decoding, 426, 430 temporal encoding, 426, 435–436 tennessee eastman process, 468 term frequency—inverse document frequency (TF–IDF), 505–506 term set, 86 testability, 290 text-to-speech synthesis, 503 textual processing, 500 Thompson sampling, 294, 296 Thompson sampling algorithm, 293 three fault groups, 484 time dependence, 474–475 time step, 475 time-based decoding, 430 time-dependent, 319 time-varying weight model, 435 token, 508–509 trade–off, 40 train and validate, 486 training, 487–488 training and validation sets, 486 transfer of information, 472 transformer, 509–510 transparency, 305, 308, 313, 316, 318, 325 transparent, 313–314 transportability, 279 trapezoidal, 312 trapezoidal membership function, 62 tree, 312 triangular conorms, 76 triangular fuzzy sets (membership functions), 61 triangular norms, 75, 80–82, 312 tripartite synapse, 447

page i13

i13 true negatives, 313 true optimal incremental solutions, 139 true positive rate, 313 true positives, 313 truncation factor, 480 trust, 4–6, 10, 13, 17–20, 28 truth value, 79 turing test, 521–523 two tanks, 484 two techniques, 481 two-stage margin maximization spiking neural network (TMM-SNN), 424, 439–444, 459, 461 type-1, 305, 325 type-1 fuzzy classification, 324 type-1 fuzzy rule-based classification systems, 305 type-2, 305 type-2 fuzzy classification, 320, 324–325 type-2 fuzzy classifier, 320 type-2 fuzzy membership, 320 type-2 fuzzy propositions, 322 type-2 fuzzy rule, 320–322 type-2 fuzzy set, 93, 322 type-2 fuzzy systems, 142, 321 type-reducer, 322, 324 U unbiased estimation, 282–283 uncanny valley hypothesis, 501, 519 uncertainty, 59–60, 320–321 uncertainty due to extrapolation, 189 uncertainty due to high noise level, 189 uncertainty estimates, 207 uncertainty in evolving fuzzy modeling, 189 uncertainty in predictions, 187 uncertainty-based sampling, 201 understandable rulebase, 10, 15, 38 undesirable behavior, 491 undesirable oscillation, 475 unimodal, 89 uninorm, 82–83 union, 72–73, 80, 84 universal approximators, 139 universal kernels, 343 universe, 86 universe of discourse, 315, 317 unlabeled data, 320 unlearning effect, 178

June 1, 2022 13:2

i14

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v1-index

Handbook on Computer Learning and Intelligence — Vol. I

unnecessary complexity, 204 unobserved confounders, 296 unsupervised calculation of accuracy, 207 upper confidence bound algorithm, 293 upper membership functions, 323 useability of evolving (neuro-)fuzzy systems, 200 V validation, 313, 485 validation fault group, 487 vanilla LSTM, 480 vanishing gradients, 475 variability of the version space, 187 variable, 485 variables change, 485 variation, 489 virtual reality, 529 visual guide, 489 visual interpretability, 198

W weight decay regularization, 160 weight sensitivity modulation factor, 450 weighted least-squares, 158 weights, 472, 474, 478 weights and bias, 477 winner class, 311 winner-takes-all classification, 149 winning class, 312, 324 winter, 7, 12, 15, 17, 38, 42 word embedding, 506–507 word error rate (WER), 502–503 word2vec, 507 wrappers, 317 X XiaoIce, 509, 523 Z zero-order model, 310

page i14

HANDBOOK ON COMPUTER LEARNING AND INTELLIGENCE Volume 2: Deep Learning, Intelligent Control and Evolutionary Computation

12498_9789811246074_V2_TP.indd 1

12/10/21 11:58 AM

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

HANDBOOK ON COMPUTER LEARNING AND INTELLIGENCE Volume 2: Deep Learning, Intelligent Control and Evolutionary Computation

Editor

Plamen Parvanov Angelov Lancaster University, UK

World Scientific NEW JERSEY



LONDON

12498_9789811246074_V2_TP.indd 2



SINGAPORE



BEIJING



SHANGHAI



HONG KONG



TAIPEI



CHENNAI



TOKYO

12/10/21 11:58 AM

Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

Library of Congress Cataloging-in-Publication Data Names: Angelov, Plamen P., editor. Title: Handbook on computer learning and intelligence / editor, Plamen Parvanov Angelov, Lancaster University, UK. Other titles: Handbook on computational intelligence. Description: New Jersey : World Scientific, [2022] | Includes bibliographical references and index. | Contents: Volume 1. Explainable AI and supervised learning - Volume 2. Deep learning, intelligent control and evolutionary computation. Identifiers: LCCN 2021052472 | ISBN 9789811245145 (set) | ISBN 9789811246043 (v. 1 ; hardcover) | ISBN 9789811246074 (v. 2 ; hardcover) | ISBN 9789811247323 (set ; ebook for institutions) | ISBN 9789811247330 (set ; ebook for individuals) Subjects: LCSH: Expert systems (Computer science)--Handbooks, manuals, etc. | Neural networks (Computer science)--Handbooks, manuals, etc. | Systems engineering- Data processing--Handbooks, manuals, etc. | Intelligent control systems--Handbooks, manuals, etc. | Computational intelligence--Handbooks, manuals, etc. Classification: LCC QA76.76.E95 H3556 2022 | DDC 006.3/3--dc23/eng/20211105 LC record available at https://lccn.loc.gov/2021052472

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

Copyright © 2022 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher. For any available supplementary material, please visit https://www.worldscientific.com/worldscibooks/10.1142/12498#t=suppl Desk Editors: Nandha Kumar/Amanda Yun Typeset by Stallion Press Email: [email protected] Printed in Singapore

Alex - 12498 - Handbook on Computer Learning and Intelligence.indd 1

10/5/2022 9:46:46 am

June 14, 2022 16:49

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v2-fm

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_fmatter

Preface

The Handbook aims to be a one-stop-shop for the various aspects of the broad research area of Computer learning and Intelligence. It is organized in two volumes and five parts as follows: Volume 1 includes two parts: Part 1 Explainable AI Part II Supervised Learning Volume 2 has three parts: Part III Deep Learning Part IV Intelligent Control Part V Evolutionary Computation The Handbook has twenty-six chapters in total, which detail the theory, methodology and applications of Computer Learning and Intelligence. These individual contributions are authored by some of the leading experts in the respective areas and often are co-authored by a small team of associates. They offer a cocktail of individual contributions that span the spectrum of the key topics as outlined in the titles of the five parts. In total, over 67 authors from over 20 different countries contributed to this carefully crafted final product. This collaborative effort brought together leading researchers and scientists form USA, Canada, Wales, Northern Ireland, England, Japan, Sweden, Italy, Spain, Austria, Slovenia, Romania, Singapore, New Zealand, Brazil, Russia, India, Mexico, Hong Kong, China, Pakistan and Qatar. The scope of the Handbook covers the most important aspects of the topic of Computer Learning and Intelligence. Preparing, compiling and editing this Handbook was an enjoyable and inspirational experience. I hope you will also enjoy reading it, will find answers to your questions and will use this book in your everyday work. Plamen Parvanov Angelov Lancaster, UK 3 June 2022 v

page v

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v2-fm

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_fmatter

About the Editor

Professor Plamen Parvanov Angelov holds a Personal Chair in Intelligent Systems and is Director of Research at the School of Computing and Communications at Lancaster University, UK. He obtained his PhD in 1993 and DSc (Doctor of Sciences) degree in 2015 when he also become Fellow of IEEE. Prof. Angelov is the founding Director of the Lancaster Intelligent, Robotic and Autonomous systems (LIRA) Research Centre which brings together over 60 faculty/academics across fifteen different departments of Lancaster University. Prof. Angelov is a Fellow of the European Laboratory for Learning and Intelligent Systems (ELLIS) and of the Institution of Engineering and Technology (IET) as well as a Governorat-large of the International Neural Networks Society (INNS) for a third consecutive three-year term following two consecutive terms holding the elected role of Vice President. In the last decade, Prof. Angelov is also Governor-at-large of the Systems, Man and Cybernetics Society of the IEEE. Prof. Angelov founded two research groups (the Intelligent Systems Research group in 2010 and the Data Science group at 2014) and was a founding member of the Data Science Institute and of the CyberSecurity Academic Centre of Excellence at Lancaster. Prof. Angelov has published over 370 publications in leading journals and peerreviewed conference proceedings, 3 granted US patents, 3 research monographs (by Wiley, 2012 and Springer, 2002 and 2018) cited over 12300 times with an h-index of 58 (according to Google Scholar). He has an active research portfolio in the area of computer learning and intelligence, and internationally recognized results into online and evolving learning and algorithms. More recently, his research is focused on explainable deep learning as well as anthropomorphic and empirical computer learning. Prof. Angelov leads numerous projects (including several multimillion ones) funded by UK research councils, EU, industry, European Space Agency, etc. His research was recognized by the 2020 Dennis Gabor Award for “outstanding contributions to engineering applications of neural networks”, ‘The Engineer Innovation vii

page vii

June 1, 2022 13:18

viii

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v2-fm

Handbook on Computer Learning and Intelligence — Vol. II

and Technology 2008 Special Award’ and ‘For outstanding Services’ (2013) by IEEE and INNS. He is also the founding co-Editor-in-Chief of Springer’s journal on Evolving Systems and Associate Editor of several leading international journals, including IEEE Transactions on Cybernetics, IEEE Transactions on Fuzzy Systems, IEEE Transactions on AI, etc. He gave over 30 dozen keynote/plenary talks at high profile conferences. Prof. Angelov was General Co-Chair of a number of high-profile IEEE conferences and is the founding Chair of the Technical Committee on Evolving Intelligent Systems, Systems, Man and Cybernetics Society of the IEEE. He is also co-chairing the working group (WG) on a new standard on Explainable AI created by his initiative within the Standards Committee of the Computational Intelligent Society of the IEEE which he chaired from 2010–2012. Prof. Angelov is the founding co-Director of one of the funded programs by ELLIS on Human-centered machine learning. He was a member of the International Program Committee of over 100 international conferences (primarily IEEE conferences).

page viii

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v2-fm

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_fmatter

Acknowledgments

The Editor would like to acknowledge the unwavering support and love of his wife Rositsa, his children (Mariela and Lachezar) as well as of his mother, Lilyana.

ix

page ix

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v2-fm

page xi

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_fmatter

Contents

Preface

v

About the Editor

vii

Acknowledgments

ix

Introduction

xv

Part I: Explainable AI 1. Explainable Artificial Intelligence and Computational Intelligence: Past and Present

3

Mojtaba Yeganejou and Scott Dick 2. Fundamentals of Fuzzy Set Theory

57

Fernando Gomide 3. Granular Computing

97

Andrzej Bargiela and Witold Pedrycz 4. Evolving Fuzzy and Neuro-Fuzzy Systems: Fundamentals, Stability, Explainability, Useability, and Applications

133

Edwin Lughofer 5. Incremental Fuzzy Machine Learning for Online Classification of Emotions in Games from EEG Data Streams Daniel Leite, Volnei Frigeri Jr., and Rodrigo Medeiros

xi

235

June 1, 2022 13:18

xii

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v2-fm

page xii

Handbook on Computer Learning and Intelligence — Vol. II

6. Causal Inference in Medicine and in Health Policy: A Summary

263

Wenhao Zhang, Ramin Ramezani, and Arash Naeim

Part II: Supervised Learning 7. Fuzzy Classifiers

305

Hamid Bouchachia 8. Kernel Models and Support Vector Machines

329

Denis Kolev, Mikhail Suvorov, and Dmitry Kangin 9. Evolving Connectionist Systems for Adaptive Learning and Knowledge Discovery: From Neuro-Fuzzy to Spiking, Neurogenetic, and Quantum Inspired: A Review of Principles and Applications

399

Nikola Kirilov Kasabov 10. Supervised Learning Using Spiking Neural Networks

423

Abeegithan Jeyasothy, Shirin Dora, Suresh Sundaram, and Narasimhan Sundararajan 11. Fault Detection and Diagnosis Based on an LSTM Neural Network Applied to a Level Control Pilot Plant

467

Emerson Vilar de Oliveira, Yuri Thomas Nunes, Mailson Ribeiro Santos, and Luiz Affonso Guedes 12. Conversational Agents: Theory and Applications

497

Mattias Wahde and Marco Virgolin Index

i1

Part III: Deep Learning 13. Deep Learning and Its Adversarial Robustness: A Brief Introduction Fu Wang, Chi Zhang, Peipei Xu, and Wenjie Ruan

547

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v2-fm

Contents

14. Deep Learning for Graph-Structured Data

page xiii

xiii

585

Luca Pasa, Nicolò Navarin, and Alessandro Sperduti 15. A Critical Appraisal on Deep Neural Networks: Bridge the Gap between Deep Learning and Neuroscience via XAI

619

Anna-Sophie Bartle, Ziping Jiang, Richard Jiang, Ahmed Bouridane, and Somaya Almaadeed 16. Ensemble Learning

635

Y. Liu and Q. Zhao 17. A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification

661

Xiaowei Gu and Plamen P. Angelov

Part IV: Intelligent Control 18. Fuzzy Model-Based Control: Predictive and Adaptive Approach

699

Igor Škrjanc and Sašo Blažiˇc 19. Reinforcement Learning with Applications in Autonomous Control and Game Theory

731

Kyriakos G. Vamvoudakis, Frank L. Lewis, and Draguna Vrabie 20. Nature-Inspired Optimal Tuning of Fuzzy Controllers

775

Radu-Emil Precup and Radu-Codrut David 21. Indirect Self-evolving Fuzzy Control Approaches and Their Applications

809

Zhao-Xu Yang and Hai-Jun Rong

Part V: Evolutionary Computation 22. Evolutionary Computation: Historical View and Basic Concepts Carlos A. Coello Coello, Carlos Segura, and Gara Miranda

851

September 14, 2022 14:30

xiv

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v2-fm

page xiv

Handbook on Computer Learning and Intelligence — Vol. II

23. An Empirical Study of Algorithmic Bias

895

Dipankar Dasgupta and Sajib Sen 24. Collective Intelligence: A Comprehensive Review of Metaheuristic Algorithms Inspired by Animals

923

Fevrier Valdez 25. Fuzzy Dynamic Parameter Adaptation for Gray Wolf Optimization of Modular Granular Neural Networks Applied to Human Recognition Using the Iris Biometric Measure

947

Patricia Melin, Daniela Sánchez, and Oscar Castillo 26. Evaluating Inter-Task Similarity for Multifactorial Evolutionary Algorithm from Different Perspectives

973

Lei Zhou, Liang Feng, Min Jiang, and Kay Chen Tan Index

i1

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v2-fm

c 2022 World Scientific Publishing Company  https://doi.org/10.1142/9789811247323_fmatter

Handbook on Computer Learning and Intelligence Introduction by the Editor

You are holding in your hand the second edition of the handbook that was published six years ago as Handbook on Computational Intelligence. The research field has evolved so much during the last five years that we decided to not only update the content with new chapters but also change the title of the handbook. At the same time, the core chapters (15 out of 26) have been written and significantly updated by the same authors. The structure of the book is also largely the same, though some parts evolved to cover new phenomena better. For example, the part Deep Learning evolved from the artificial neural networks, and the part Explainable AI evolved from fuzzy logic. The new title, Computer Learning and Intelligence, better reflects the new state-of-the-art in the area that spans machine learning and artificial intelligence (AI), but is more application and engineering oriented rather than being theoretically abstract or statistics oriented like many other texts that also use these terms. The term computer learning is used instead of the common term machine learning deliberately. Not only does it link with the term computational intelligence but it also reflects more accurately the nature of learning algorithms that are of interest. These are implemented on computing devices spanning from the Cloud to the Edge, covering personal and networked computers and other computerized tools. The term machine is more often used in a mechanical and mechatronic context as a physical device contrasting the (still) electronic nature of the computers. In addition, the term machine learning, the genesis of which stems from the research area of AI, is now used almost as a synonym for statistical learning. In fact, learning is much more than function approximation, parameter learning, and number fitting. An important element of learning is reasoning, knowledge representation, expression and handling (storage, organization, retrieval), as well as decision-making and cognition. Statistical learning is quite successful today, but it seems to oversimplify some of these aspects. In this handbook, we aim to provide all perspectives, and therefore, we combine computer learning with intelligence. xv

page xv

June 1, 2022 13:18

xvi

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v2-fm

Handbook on Computer Learning and Intelligence — Vol. II

Talking about intelligence, we mentioned computational intelligence, which was the subject of the first edition of the handbook. As is well known, this term itself came around toward the end of the last century and covers areas of research that are inspired and that “borrow”/“mimic” such forms of natural intelligence as the human brain (artificial neural networks), human reasoning (fuzzy logic and systems), and natural evolution (evolutionary systems). Intelligence is also at the core of the widely used term AI. The very idea of developing systems, devices, algorithms, and techniques that possess characteristics of intelligence and are computational (not just conceptual) dates back to the middle of the 20th century or even earlier, but only now it is becoming truly widespread, moving from the labs to the real-life applications. In this handbook, while not claiming to provide an exhaustive picture of this dynamic and fast-developing area, we provide some key directions and examples written in self-contained chapters, which are, however, organized in parts on topics such as: • • • • •

Explainable AI Supervising Learning Deep Learning Intelligent Control Evolutionary Computation

The primary goal of Computer Learning and Intelligence is to provide efficient computational solutions to the existing open problems from theoretical and application points of view with regard to the understanding, representation, modeling, visualization, reasoning, decision, prediction, classification, analysis, and control of physical objects, environmental or social phenomena, etc., to which the traditional methods, techniques, and theories (primarily so-called first principles based, deterministic, or probabilistic, often expressed as differential equations, regression, and Bayesian models, and stemming from mass and energy balance) cannot provide a valid or useful/practical solution. Another specific feature of Computer Learning and Intelligence is that it offers solutions that bear characteristics of intelligence, which is usually attributed to humans only. This has to be considered broadly rather than literally, as is the area of AI. This is, perhaps, clearer with fuzzy logic, where systems can make decisions very much like humans do. This is in a stark contrast to the deterministic-type expert systems or probabilistic associative rules. One can argue that artificial neural networks, including deep learning, process the data in a manner that is similar to what the human brain does. For evolutionary computation, the argument is that the population of candidate solutions “evolves” toward the optimum in a manner similar to the way species or living organisms evolve in nature.

page xvi

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v2-fm

Introduction

page xvii

xvii

The handbook is composed of 2 volumes and 5 parts, which contain 26 chapters. Eleven of these chapters are new in comparison to the first edition, while the other 15 chapters are substantially improved and revised. More specifically, Volume 1 includes Part I (Explainable AI) and Part II (Supervised Learning). Volume 2 includes Part III (Deep Learning), Part IV (Intelligent Control), and Part V (Evolutionary Computation). In Part I the readers can find six chapters on explainable AI, including: • Explainable AI and Computational Intelligence: Past and Present This new chapter provides a historical perspective of explainable AI. It is written by Dr. Mojtaba Yeganejou and Prof. Scott Dick from the University of Alberta, Canada. • Fundamentals of Fuzzy Sets Theory This chapter is a thoroughly revised version of the chapter with the same title from the first edition of the handbook. It provides a step-by-step introduction to the theory of fuzzy sets, which itself is anchored in the theory of reasoning and is one of the most prominent and well-developed forms of explainable AI (the others being decision trees and symbolic AI). It is written by one of the leading experts in this area, former president of the International Fuzzy Systems Association (IFSA), Prof. Fernando Gomide from UNICAMP, Campinas, Brazil. • Granular Computing This chapter is also a thoroughly revised version from the first edition of the handbook and is written by two of the pioneers in this area, Profs. Witold Pedrycz from the University of Alberta, Canada, and Andrzej Bargiela. Granular computing became a cornerstone of the area of explainable AI, and this chapter offers a thorough review of the problems and solutions that granular computing offers. • Evolving Fuzzy and Neuro-Fuzzy Systems: Fundamentals, Stability, Explainability, Useability, and Applications Since its introduction around the turn of the century by Profs. Plamen Angelov and Nikola Kasabov, the area of evolving fuzzy and neuro-fuzzy systems has been constantly developing as a form of explainable AI and machine learning. This thoroughly revised chapter, in comparison to the version in the first edition, offers a review of the problems and some of the solutions. It is authored by one of the leading experts in this area, Dr. Edwin Lughofer from Johannes Kepler University Linz, Austria. • Incremental Fuzzy Machine Learning for Online Classification of Emotions in Games from EEG Data Streams

June 14, 2022 16:49

xviii

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v2-fm

Handbook on Computer Learning and Intelligence — Vol. II

This new chapter is authored by Prof. Daniel Leite and Drs. Volnei Frigeri Jr and Rodrigo Medeiros from Adolfo Ibáñez University, Chile, and the Federal University of Minas Gerais, Brazil. It offers one specific approach to a form of explainable AI and its application to games and EEG data stream processing. • Causal Reasoning This new chapter is authored by Drs. Ramin Ramezani and Wenhao Zhang and Prof. Arash Naeim). It describes the very important topic of causality in AI. Part II consists of six chapters, which cover the area of supervised learning: • Fuzzy Classifiers This is a thoroughly revised and refreshed chapter that was also a part of the first edition of the handbook. It covers the area of fuzzy rule-based classifiers, which combine explainable AI and intelligence, more generally, as well as the area of machine or computer learning. It is written by Prof. Hamid Bouchachia from Bournemouth University, UK. • Kernel Models and Support Vector Machines This chapter offers a very skillful review of one of the hottest topics in research and applications related to supervised computer learning. It is written by Drs. Denis Kolev, Mikhail Suvorov, and Dmitry Kangin and is a thorough revision of the version that was included in the first edition of the handbook. • Evolving Connectionist Systems for Adaptive Learning and Knowledge Discovery: From Neuro-Fuzzy to Spiking, Neurogenetic, and Quantum Inspired: A Review of Principles and Applications This chapter offers a review of one of the cornerstones of computer learning and intelligence, namely the evolving connectionist systems, and is written by the pioneer in this area, Prof. Nikola Kasabov from the Auckland University of Technology, New Zealand. • Supervised Learning Using Spiking Neural Networks This is a new chapter written by Drs. Abeegithan Jeyasothy from the Nanyang Technological University, Singapore; Shirin Dora from the University of Ulster, Northern Ireland, UK; Sundaram Suresh from the Indian Institute of Science; and Prof. Narasimhan Sundararajan, who recently retired from the Nanyang Technological University, Singapore. It covers the topic of supervised computer learning of a particular type of artificial neural networks that holds a lot of promise—spiking neural networks. • Fault Detection and Diagnosis based on LSTM Neural Network Applied to a Level Control Pilot Plant

page xviii

June 14, 2022 16:49

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v2-fm

Introduction

page xix

xix

This is a thoroughly revised and updated chapter in comparison to the version in the first edition of the handbook. It covers an area of high industrial interest and now includes long short-term memory type of neural networks and is authored by Drs. Emerson V. de Oliveira, Yuri Thomas Nunes, and Malison Ribeira Santos and Prof. Luiz Affonso Guedez from the Federal University of Rio Grande du Nord, Natal, Brazil. • Conversational Agents: Theory and Applications This is a new chapter that combines intelligence with supervised computer learning. It is authored by Prof. Mattias Wahde and Dr. Marco Virgolin from the Chalmers University, Sweden. The second volume consist of three parts. The first of these, part III, is devoted to deep learning and consists of five chapters, four of which are new and specially written for the second edition of the handbook: • Deep Learning and Its Adversarial Robustness: A Brief Introduction This new chapter, written by Drs. Fu Wang, Chi Zhang, PeiPei Xu, and Wenjie Ruan from Exeter University, UK, provides a brief introduction to the hot topic of deep learning to understand its robustness under adversarial perturbations. • Deep Learning for Graph-Structured Data This new chapter written by leading experts in the area of deep learning, Drs. Luca Pasa and Nicolò Navarin and Prof. Alessandro Sperduti, describes methods that are specifically tailored to graph-structured data. • A Critical Appraisal on Deep Neural Networks: Bridge the Gap from Deep Learning to Neuroscience via XAI This new chapter, written by Anna-Sophia Bartle from the University of Tubingen, Germany; Ziping Jiang from Lancaster University, UK; Richard Jiang from Lancaster University UK, Ahmed Bouridane from Northumbria University, UK, and Somaya Almaadeed from Qatar University, provides a critical analysis of the deep learning techniques trying to bridge the divide between neuroscience and explainable AI • Ensemble Learning This is a thoroughly revised chapter written by the same authors as in the first edition (Dr. Yong Liu and Prof. Qiangfu Zhao) and covers the area of ensemble learning.

June 14, 2022 16:49

xx

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

b4528-v2-fm

Handbook on Computer Learning and Intelligence — Vol. II

• A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification This is a new chapter contributed by Dr. Xiaowei Gu from the University of Aberystwyth in Wales, UK, and by Prof. Plamen Angelov. It provides an explainable-by-design form of deep learning, which can also take the form of linguistic rule base and is applied to aerial image scene classification. Part IV consists of four chapters covering the topic of intelligent control: • Fuzzy Model-Based Control: Predictive and Adaptive Approach This chapter is a thorough revision of the chapter form the first edition of the handbook and is co-authored by Prof. Igor Škrjanc and Dr. Sašo Blažiˇc from Ljubljana University, Slovenia. • Reinforcement Learning with Applications in Automation Control and Game Theory This chapter written by renowned world leaders in this area, Dr. Kyriakos G. Vamvoudakis from the Georgia Institute of Technology, Prof. Frank L. Lewis from the University of Texas, and Dr. Draguna Vrabie from the Pacific Northwest National Laboratory in the USA. In this thoroughly revised and updated chapter in comparison to the version in the first edition, the authors describe reinforcement learning that is being applied not only to systems control but also to game theory. • Nature-Inspired Optimal Tuning of Fuzzy Controllers This chapter is also a thorough revision from the first edition by the same authors, Prof. Radu-Emil Precup and Dr. Radu-Codrut David, both from the Politehnica University of Timisoara, Romania. • Indirect Self-Evolving Fuzzy Control Approaches and Their Applications This is a new chapter and is written by Drs. Zhao-Xu Yang and Hai-Jun Rong from Xian Jiaotong University, China. Finally, Part V includes five chapters on evolutionary computation: • Evolutionary Computation: History View and Basic Concepts This is a thoroughly revised chapter from the first edition of the handbook, which sets the scene with a historical and philosophical introduction of the topic. It is written by one of the leading scientists in this area, Dr. Carlos A. Coello-Coello from CINVESTAV, Mexico, and co-authored by Carlos Segura from the Centre of Research in Mathematics, Mexico, and Gara Miranda from the University of La Laguna, Tenerife, Spain.

page xx

June 14, 2022 16:49

Handbook on Computer Learning and Intelligence – 9.61in x 6.69in

Introduction

b4528-v2-fm

page xxi

xxi

• An Empirical Study of Algorithmic Bias This chapter is written by the same author as in the first edition, Prof. Dipankar Dasgupta from the University of Memphis, USA, and co-authored by his associate Dr. Sanjib Sen, but it offers a new topic: an empirical study of algorithmic bias. • Collective Intelligence: A Comprehensive Review of Metaheuristic Algorithms Inspired by Animals This chapter is a thoroughly revised version of the chapter by the same author, Dr. Fevrier Valdez from the Institute of Technology, Tijuana, Mexico, who contributed to the first edition. • Fuzzy Dynamic Parameter Adaptation for Gray Wolf Optimization of Modular Granular Neural Networks Applied to Human Recognition Using the Iris Biometric Measure This chapter is a thoroughly revised version of the chapters contributed by the authors to the first edition of the handbook. It is written by the leading experts in the area of fuzzy systems, Drs. Patricia Melin and Daniela Sanchez and Prof. Oscar Castillo from the Tijuana Institute of Technology, Mexico. • Evaluating Inter-task Similarity for Multifactorial Evolutionary Algorithm from Different Perspectives The last chapter of this part of the second volume and of the handbook is new and is contributed by Lei Zhou and Drs. Liang Feng from Chongqing University, China, and Min Jiang from Xiamen University, China, and Prof. Kay Chen Tan from City University of Hong Kong. In conclusion, this handbook is a thoroughly revised and updated second edition and is composed with care, aiming to cover all main aspects and recent trends in the computer learning and intelligence area of research and offering solid background knowledge as well as end-point applications. It is designed to be a one-stop shop for interested readers, but by no means aims to completely replace all other sources in this dynamically evolving area of research. Enjoy reading it.

Plamen Parvanov Angelov Editor of the Handbook Lancaster, UK 3 June 2022

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Part III

Deep Learning

545

b4528-v2-ch13

page 545

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

© 2022 World Scientific Publishing Company https://doi.org/10.1142/9789811247323_0013

Chapter 13

Deep Learning and Its Adversarial Robustness: A Brief Introduction Fu Wang, Chi Zhang, Peipei Xu, and Wenjie Ruan College of Engineering, Mathematics and Physical Sciences University of Exeter, UK

Deep learning, one of the most remarkable techniques in computational intelligence, has become increasingly popular and powerful in recent years. In this chapter, we, first of all, revisit the history of deep learning and then introduce two typical deep learning models including convolutional neural networks (CNNs) and recurrent neural networks (RNNs). After that, we present how the deep learning models are trained and introduce currently popular deep learning libraries and frameworks. Then we focus primarily on a newly emerged research direction in deep learning—adversarial robustness. Finally, we show some applications and point out some challenges of deep learning. This chapter cannot exhaustively cover every aspect of deep learning. Instead, it gives a short introduction to deep learning and its adversarial robustness, and provides a taste of what deep learning is, how to train a neural network, and why deep learning is vulnerable to adversarial attacks, and how to evaluate its robustness.

13.1. Introduction As a branch of computational intelligence, deep learning or deep neural networks (DNNs) have achieved great success in many applications such as image analysis [1], speech recognition [2], and text understanding [3]. Deep learning has demonstrated superior capability in approximating and reducing large, complex datasets into highly accurate predictive and transformational outputs [4]. The origin of deep learning can be traced back to 1943, when Walter Pitts and Warren McCulloch built a mathematical model, the M-P model, to mimic the

547

page 547

June 1, 2022 13:18

548

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Handbook on Computer Learning and Intelligence — Vol. II

biological neural networks inside the human brain [5]. The M-P model is the earliest and most basic unit of artificial neural networks. It is an over-simplification of the biological brain yet successfully simulates the fundamental function of a real neuron cell in our brain [5]. As shown in Figure 13.1, a neuron receives signals from other neurons and compares the sum of inputs with a threshold. The output of this neuron is then calculated based on a defined activation function. In 1958, Frank Rosenblatt proposed the perceptron, one of the simple artificial neural networks [6]. Figure 13.2 shows the structure of a perceptron. A single-layer perceptron has a very limited learning capacity, which cannot even solve a XOR problem [7]. But a multilayer perceptron exhibits a better learning performance by stacking multiple hidden layers, as shown in Figure 13.2(b). Such a structure is still commonly seen in modern deep neural networks with a new name, i.e., fully connected layer, since the neurons in each layer connect every neuron in the next layer. Kunihiko Fukushima, known as the inventor of convolutional neural networks, proposed a type of neural network, called neocognitron [8]. It enables the model to easily learn the visual patterns by adopting a hierarchical and multilayered structure. In the development of deep learning, there were several technical breakthroughs that significantly boosted its advances [9]. The first notable breakthrough was the backpropagation (BP) algorithm which is now a standard method for training neural networks, that computes the gradient of the loss function for weights in the network by the chain rule. The invention of BP can be traced back to 1970, when Seppo Linnainmaa presented a FORTRAN code for backpropagation in his master’s thesis [10]. In 1985, the concept of BP was for the first time applied in neural networks [11]. Later on, Yann LeCun demonstrated the first practical BP applied in handwritten zip code recognition [12]. Another significant breakthrough in deep learning is the development of graphics processing units (GPUs), which has improved the computational speed

Figure 13.1: An M-P model.

page 548

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Deep Learning and Its Adversarial Robustness: A Brief Introduction

(a)

page 549

549

(b)

Figure 13.2: Illustrations of a single-layer perceptron and a three-layer perceptron.

by thousands of times since 1999 [13]. Deep learning has also revealed a continuously improved performance when more data are used for training. For this reason, Fei-Fei Li launched ImageNet in 2009 [1], which is an opensource database containing more than 14 million labeled images. With the computational power of GPUs and the well-maintained database ImageNet, deep learning started to demonstrate a surprisingly remarkable performance on many challenging tasks [4]. A notable example is AlexNet [14], which has won several international image competitions and demonstrated a significant improvement in the best performance in multiple image databases [14]. Since then, deep learning has become one of the most important realms in computational intelligence. A more comprehensive introduction to the development of deep learning can be seen in [15–17].

13.2. Structures of Deep Learning In this section, we first detail the activation functions that are widely applied in deep neural networks. Then, we introduce two typical deep learning structures— convolutional neural networks and recurrent neural networks.

13.2.1. Activation Functions Activation function is the most important component in deep neural networks, empowering models with a superior capacity to learn nonlinear and complex representations. Typical activation functions used in deep neural networks include Sigmoid, Hyperbolic Tangent (Tanh), and Rectified Linear Unit (ReLU). The Sigmoid activation function was pervasively adopted at the early stage of neural networks and still plays an important role in deep learning models [18].

June 1, 2022 13:18

550

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

page 550

Handbook on Computer Learning and Intelligence — Vol. II

(a)

(b)

Figure 13.3: Illustration of the Sigmoid and Tanh activation functions.

Its mathematical form and derivative can be written as 1 , 1 + e−x   d f sig(x) = f sig(x) 1 − f sig (x) . dx f sig(x) =

(13.1)

As we can see, the derivation of the Sigmoid function is very easy to compute and its output range lies in (0, 1). As Figure 13.3(a) shows, the output range of Sigmoid is not zero-centered, which might slow down the training progress [18]. The Hyperbolic Tangent activation function, on the other hand, stretches the output range of the Sigmoid function and makes it zero-centered. The Tanh function and its derivative are given by f tanh (x) =

e x − e−x , e x + e−x

d f tanh (x) 2 = 1 − f tanh (x). dx

(13.2)

As Figure 13.3 show, the gradient of Tanh is steeper than that of Sigmoid, so deep neural networks with Tanh activation functions usually converge faster in training than when using Sigmoid activation. A common disadvantage of Sigmoid and Tanh is that they both suffer from the vanishing gradient issue. That is, after several training epochs, the gradient will be very close to zero, preventing the effective update of the model parameters in backpropagation. The most popular activation functions adopted by deep learning are Rectified Linear Unit (ReLU) and its variants (visualized in Figure 13.4). ReLU is extremely

June 15, 2022 13:4

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Deep Learning and Its Adversarial Robustness: A Brief Introduction

page 551

551

simple and computational-friendly and can be written as f ReLU (x) = max(0, x),  d f ReLU (x) 1, x ≥ 0, = 0, x < 0. dx

(13.3)

ReLU follows the biological features of neuron cells, which only fire if the inputs exceed a certain threshold. Note that ReLU is non-differentiable at zero, and all positive inputs have the same non-zero derivative, one, which could prevent the vanishing gradient problem in backpropagation. The drawback of ReLU activation is known as dead ReLU problem, i.e., some neurons may never activate and only output zero. One way to address this issue is to give a small slope to negative inputs, which leads to parametric ReLU activation, denoted by P-ReLU. Its mathematical form is f P−ReLU (x) = max(ax, x),  d f P−ReLU (x) 1, x ≥ 0, = a, x < 0. dx

(13.4)

where a is a pre-defined small parameter, e.g., 0.01, or a trainable parameter. Theoretically, P-ReLU inherits ReLU’s benefits and does not have the dead ReLU problem. But there is no substantial evidence that P-ReLU is always better than ReLU. Another variant of ReLU is called Exponential Linear Units (ELU) and its mathematical form is  x, x ≥ 0, f ELU (x) = x a(e − 1), x < 0, (13.5)  d f ELU (x) 1, x ≥ 0, = f ELU (x) + a, x < 0. dx Compared to ReLU, ELU can utilize negative inputs and is almost zero-centered. Although empirical studies show that ELU can marginally improve the performance of deep neural networks on classification tasks, it is still unconfirmed whether it can fully surpass ReLU. Figure 13.4 shows the visualization of ReLU, P-ReLU and ELU. Multiple hidden layers with nonlinear activation functions forge the core idea of deep neural networks. Deep learning utilizes various types of layers to extract features from data automatically. In the next section, we introduce convolutional neural networks, one of the deep learning structures that have achieved human-level performance on many image classification tasks [1].

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

552

b4528-v2-ch13

Handbook on Computer Learning and Intelligence — Vol. II

(a)

(b)

(c)

Figure 13.4: Illustration of different ReLU activation functions.

Figure 13.5: An example of convolutional neural network, comprising a convolutional layer with ReLU activation, a max-pooling layer, and two fully-connected layers with ReLU activation.

13.2.2. Structures of Convolutional Neural Networks Convolutional neural networks (CNNs) are widely applied on many computer vision tasks including image classification, semantic segmentation, and object detection [19]. The connectivity patterns between hidden neurons in CNNs were actually inspired by the animal visual cortex [20–22]. Such connections based on shared weights allow CNNs to encode visual patterns and empower the network to succeed in image-focused tasks while reducing the learnable parameters in CNNs. Typically, three types of layers are widely adopted by CNNs, i.e., convolutional layers, pooling layers, and fully connected layers. Figure 13.5 shows a simple CNN architecture for MNIST classification [23]. There are four basic components in this DNN architecture. • Input Layer: It is the layer that takes in input data such as images. • Convolutional Layer: It is the layer that adopts convolutional kernels to transform the image data into features in different levels. Convolutional layers also utilize activation functions, such as Sigmoid and ReLU, which are used to produce the output of each convolutional kernel.

page 552

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Deep Learning and Its Adversarial Robustness: A Brief Introduction

page 553

553

• Pooling Layer: It is the layer that essentially performs a downsampling operation on its input, with an aim to reduce the parameter number in the network. The pooling operation is normally max pooling or average pooling. • Fully Connected Layer: It is the layer that is the same as standard neural networks, with an aim to make the final classification or prediction. The core idea of CNNs is to utilize a series of layer-by-layer transformations, such as convolution and downsampling, to learn the representations of the input data for various classification or regression tasks. Now we explain the convolutional layer and pooling layer, detailing their structures and operations. 13.2.2.1. Convolutional layer

The power of the convolutional layer lies in that it can capture the spatial and temporal dependencies on the input data (e.g., images) by convolutional kernels. A kernel usually has small dimensions (e.g., 3×3, 5×5) and is applied to convolve the entire input across the spatial dimensionality. As shown in Figure 13.6, the output of a convolutional layer on a single kernel is usually a 2D feature map. Since every convolutional kernel will produce a feature map, all feature maps are finally stacked together to form the full output of the convolutional layer. There are three key hyper-parameters that decide the convolutional operation, namely, depth, stride, and zero-padding. • Depth describes the 3rd (e.g., for gray images) or 4th dimension (e.g., for RGB images) of the output of the convolutional layer. It essentially measures the number of kernels applied in this convolutional layer. As a result, a small depth can significantly reduce the hidden neuron numbers, which could also compromise the representation learning capability of the neural network.

Figure 13.6: An example of convolutional operation: after the zero-padding, the kernel and the receptive field will first of all go through an element-wise multiplication, then be summed up and plus the bias. The kernel will traverse the whole input to produce all the elements in output matrix.

June 1, 2022 13:18

554

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Handbook on Computer Learning and Intelligence — Vol. II

• Stride is the size of the step the convolutional kernel moves each time on the input data. A stride size is usually 1, meaning the kernel slides one pixel by one pixel. On increasing the stride size, the kernel slides over the input with a larger interval and thus has less overlap between the receptive fields. • Zero-padding means that zero-value is added to surround the input with zeros, so that the feature map will not shrink. In addition to keeping the spatial sizes constant after convolution, padding also improves the performance by preserving part information of the border. It is important to understand the hyper-parameters in convolutional layers. To visualize the convolutional operation and the impact of depth, stride, and zero-padding, you could refer to the website https://cs231n.github.io/ convolutional-networks/#conv, which provides a detailed explanation and a few useful visualizations. 13.2.2.2. Pooling layer

The addition of a pooling layer after the convolutional layer is a common pattern within a convolutional neural network that may be repeated one or more times. The pooling operation involves sliding a two-dimensional filter over each channel of the feature map and summarizing the features lying within the region covered by the filter. Pooling layers can reduce the dimensions of the feature maps. Thus, it reduces the learnable parameter number in the network. The pooling layer also summarizes the features present in a region of the feature map generated by a convolution layer. So, further operations are performed on summarized features instead of precisely positioned features generated by the convolution layer. This makes the model more robust to variations in the position of the features in the input image. Typically, max pooling and average pooling are widely used in CNNs. • Max pooling is a pooling operation that selects the maximum element from the region of the feature map covered by the filter. Thus, the output after the maxpooling layer would be a feature map containing the most prominent features of the previous feature map. • Average pooling computes the average of the elements from the region of the feature map covered by the filter. Thus, average pooling gives the average of the features in a patch. 13.2.2.3. Stacking together

Finally, we can stack convolutional layers and pooling layers by repeating them multiple times, followed by one or a few fully connected layers. We can build a typical CNN architecture, as shown by Figure 13.5 and 13.7. We can see that,

page 554

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Deep Learning and Its Adversarial Robustness: A Brief Introduction

page 555

555

Figure 13.7: An example of complicated convolutional neural network: it contains multiple convolutional layers and max-pooling layers.

compared to a traditional neural network that only contains the fully connected layers, adopting multiple convolutional and pooling layers enables the network to learn complex visual features and further boost the classification performance. 13.2.2.4. Summary

We introduced the basic structures of convolutional neural networks in this section. Since AlexNet achieved a huge success in the ImageNet competition, CNNs are predominantly applied in the computer vision area, and significantly improve the state-of-the-art performances on many computer vision tasks. Meanwhile, the development of CNNs has made deep learning models more practical in solving real-world problems. For example, face recognition has been integrated into smartphones to identify their owners and secure sensitive operations, including identity verification and payment. In addition to vision tasks, convolution layers serve as a fundamental building block that can automatically extract local features. So they are essential components in a wide range of deep learning models, such as WaveNet [24] for speech synthesis and TIMAM [25] for text-to-image matching tasks.

13.2.3. Structures of Recurrent Neural Networks Recurrent neural networks (RNNs) refer to neural networks with loops and gates, specifying sequential processing. With further in-depth research on neural networks, scientists found out that although feed-forward neural networks, such as Multilayer Perceptron (MLP) and CNNs, can handle large grades of values well, they are insufficient for processing sequential data. When modeling human neuronal connectivity, scientists observed that human brain neurons are not oneway connected. Instead, they are connected in a dense web, namely a recurrent network. Inspired by such observations, the notion of RNNs was first built by David Everett Rumelhart in 1986 [11]. In the same year, Michael I. Jordan built a simple RNN structure, a Jordan network, which directly uses the output of the network for feedback [26]. In 1990, Jeffrey L. Elman made an improvement in the structure by

June 1, 2022 13:18

556

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Handbook on Computer Learning and Intelligence — Vol. II

feedback of the output of the internal state, i.e., the output of hidden layers [27]. He also trained the network through backpropagation, which generated the RNN model we use today. However, these models face problems such as gradient vanishing and gradient exploding, when the networks and data volumes are large. Such problems were not alleviated until the invention of long short-term memory (LSTM) networks by Hochreiter and Schmidhuber in 1997 [28]. The LSTM structure including gates and loops, allowing for the processing of the entire sequential data, has a wide application in speech recognition, natural language procession, etc. Based on the LSTM model, Kyunghyun Cho et al. developed gated recurrent units (GRU) [29], which bear resemblance to simplified LSTM units. There are also other RNN techniques, such as bi-directional RNNs, which show a good performances when combined with LSTM or GRU. In this section, we introduce the basic structure of RNNs, analyze their working principles by unrolling the loop, and illustrate the unique structure of LSTM and GRU. We also demonstrate the specific applications of RNNs based on its structures and discuss the challenges of this technique. 13.2.3.1. Vanilla RNN and unfolding

Let’s start by introducing a vanilla RNN, which has only one input node x, one hidden unit h, and one output node o, with only one loop. As shown in Figure 13.8, where the weights are listed besides the connection, we initialize the hidden state at the very beginning (t = 0) to 0 and implement no bias. Given an input sequence x = 0.5, 1, −1, −2 at time steps 1 to 4, the values of the hidden units and outputs are as shown in the table in Figure 13.8. Assuming that the input sequence of a RNN is x = x1 , . . . , xn , then the output of the RNN is o = o1 , . . . , on , with a hidden state h = h 0 , . . . , h n , we present a

Figure 13.8: Simple structure of RNN with three layers and ReLU activation.

page 556

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Deep Learning and Its Adversarial Robustness: A Brief Introduction

page 557

557

more general form of RNN as follows: h t = δ(U xt + W h t −1 + bh ), ot = φ(V h t + b y ),

(13.6)

where 1 < i < n, and δ and φ are the activation functions in nodes h and o. Nowadays, most of the analyses of RNNs tend to open the loop structure through an unfolding process. A RNN with a fixed input length can be transformed into an equivalent feed-forward network, which not only provides a more straightforward observation but also makes techniques designed for feed-forward networks suitable for RNNs. Figure 13.9 illustrates how a simple RNN unit is transformed into a feed-forward model. This equivalent feed-forward model shares the parameters with the original RNN, while the number of layers increases to the input sequence length. However, for RNNs only with the loop structure, the information at the very beginning may have little influence on the final output when the sequence is too long. That is when the long short-term memory (LSTM) structure takes place, which is proposed to address such problems and performs well when processing long sequential data. 13.2.3.2. Long short-term memory

Unlike standard RNNs only with loops, the LSTM network has its own specific structure LSTM cell. In the LSTM unit shown in Figure 13.10, the data flow is controlled by an input gate i t , an output gate ot , and a forget gate f t . These gates regulate not only the input and output information between cells but also the cell states themselves. They are realized by the combination of linear transformation and activation functions, such as Sigmoid δ and Tanh.

Figure 13.9: Transforming a vanilla RNN into a feed-forward model.

June 1, 2022 13:18

558

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

page 558

Handbook on Computer Learning and Intelligence — Vol. II

Figure 13.10: An LSTM unit.

The mathematical form of an LSTM unit can be formulated as f t = δ(W f xt + U f h t −1 + b f ), i t = δ(Wi xt + Ui h t −1 + bi ), ot = δ(Wo xt + Uo h t −1 + bo ), ct = tanh(Wc xt + U ch t −1 + bc ), 

(13.7)

ct , ct = f t ◦ ct −1 + i t ◦  h t = ot ◦ tanh(ct ), where x, h, c, and t represent input, hidden state, cell state, and time point, respectively. Both W and U indicate the weight matrix. The forget gate ft transforms input xt and hidden state h t −1 . It selects useful information in ct −1 and abandons the rest. The input gate i t controls what values should be added to the cell state, and ct work together the vector  ct provides all the possible values for addition. i t and  to update the cell state. When we have the new cell state ct , we use tanh to scale its value to the range −1 to 1. The output gate ot function plays the role of a filter, which selects the information from the scaled cell state and passes it to the hidden state h t that will soon be output. Given the specific cell and gate structure, we do not need to moderate the whole sequence of information. Through the selecting and forgetting procedure, we can moderate only a part of the data, which improves the accuracy and efficiency of processing long sequence data. There are also different variants of LSTM. For example, LSTM with peephole connections, where gates can directly access cell state information, was quite popular. In the next part, we discuss a simplified version of LSTM, i.e., the gated recurrent unit, which is more commonly used at present.

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Deep Learning and Its Adversarial Robustness: A Brief Introduction

page 559

559

13.2.3.3. Gated recurrent unit

Compared with LSTM, the gated recurrent unit (GRU) contains only two gates, i.e., the update gate and the reset gate, where the output gate is omitted. This structure was first presented by Kyunghyun Cho et al. in 2014 to deal with the gradient vanishing problem when processing long sequential data via fewer gates. The structure of GRU is shown in Figure 13.11. Unlike LSTM with the cell state, GRU has only hidden state h t for transmission between units and output delivery. Therefore, a GRU, has fewer parameters to learn during the training process and shows better performances on smaller datasets. Equation (13.8) presents the mathematical form of GRU, in which xt represents the input data. Both W and U are weight matrices and b indicates the bias matrix. To generate the gate z t , we first make a linear transformation of both input xt and hidden state h t −1 from the last time point, and then apply a nonlinear transformation by the sigmoid function. The resulting value z t helps decide which data to pass to next time steps. The reset gate rt has a structure similar to that of update gate, only with different weight Wr , Ur and bias br parameters. The reset gate selects useful information from the hidden state h t −1 at the last time point and forgets the irrelevant information. Combined with a linear transformation and tanh activation, it delivers the candidate hidden state  h t , which may also be referred to as current memory content, where useful information from the past is kept. The last step is to update the hidden state and gives out h t . We use update gate z t to take h t and deliver the new hidden state h t , element-wise products with both h t −1 and  which can be written as z t = δ(Wz xt + Uz h t −1 + bz ), rt = δ(Wr xt + Ur h t −1 + br ),  h t = tanh(Wh xt + Uh (rt ◦ h t −1 ) + bh ), ht . h t = (1 − z t ) ◦ h t −1 + z t ◦ 

Figure 13.11: A gated recurrent unit.

(13.8)

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

560

b4528-v2-ch13

Handbook on Computer Learning and Intelligence — Vol. II

The GRU model is quite popular in polyphonic music modeling, speech signal modeling, handwriting recognition, etc. Because of the relatively few parameters, it is more efficient in the training process than LSTM, especially when dealing with smaller datasets. However, it still shows limitations in machine translation and language recognition scenarios and may not wholly outperform LSTM. 13.2.3.4. Application in natural language processing

Unlike CNNs that commonly appear in computer vision tasks, the specific loop structures and memory function of RNNs lead to their different application scenarios. The previous input of RNNs influences the current input and output. Therefore, RNNs are widely used in processing sequential data and solving temporal problems, where many natural language processing (NLP) tasks fall into this area, e.g., machine translation, named-entity recognition, sentiment analysis, etc. Bengio et al. [30] made the first attempt at applying deep learning in NLP tasks in 2003, and Collobert et al. demonstrated that multiple-layer neural networks may be a better solution compared to traditional statistical NLP features. While the real revolution started with word embedding in 2013, the word2vec toolbox [31] showed that pre-trained word vectors can improve model performance on most downstream tasks. At that time, word embedding could not handle polysemy of human language properly, until Peters et al. [3] proposed a deep contextualized word representation, also known as embedding from language language models (ELMO), which provides an elegant solution to the polysemous problems. Note that ELMO uses BiLSTM, or bidirectional LSTM, as the feature extractor [32]. Right now, it has been demonstrated by many studies that the ability of the Transformer to extract features is far stronger than of the LSTM. ELMO and Transformer had a profound impact on subsequent research and inspired a large number of studies, such as GPT [33] and BERT [34]. 13.2.3.5. Summary

In this section, we first introduce the basic concept of recurrent neural networks and the working principle of vanilla RNNs and then present two representative models, LSTM and GRU, which are capable of handling long sequential data. Moreover, the gate structure allows modifying only small parts of the data during long sequential data processing, avoiding the gradient vanishing situation. Apart from these two variants, there are still various models of RNN for different application scenarios, such as the bidirectional RNN, deep RNN, continuous-time recurrent neural network, and so on. Although classic RNNs are no longer the best option in NLP, various structures of RNN still enable its used in a wide range of sequence processing applications.

page 560

June 15, 2022 13:4

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Deep Learning and Its Adversarial Robustness: A Brief Introduction

page 561

561

13.3. Training Deep Learning Models A great number of parameters enable the fantastic representation capability of deep neural networks, but the difficulty of training neural networks grows exponentially with their scale. In this section, we briefly discuss the development of training methods and tricks.

13.3.1. Gradient Descent and Backpropagation From an optimization perspective, training a network f θ is normally formulated as a minimization problem, min L(θ), θ

(13.9)

where we wish to minimize the empirical risk that is given by the expectation of a given loss function over all examples that draw i.i.d. from a distribution D, which can be written as L D (θ) = E(x,y)∼D  ( f θ (x), y).

(13.10)

However, in practice, we can only access a finite training set S = {(xi , yi )}ni=1 , so the corresponding empirical risk or training loss is written as L S (θ) =

n 1  ( f θ (xi ) , yi ). n i=1

(13.11)

Assume that  is continuous and differentiable. The landscape of the value of  is just like a mountain area that exists in a high-dimensional space, where dimension depends on the number of parameters in the trained model. Our task can be viewed as searching the lowest valley inside mountains. Imagine that we start at a random location in the mountains. All we can do is to look locally and climb down the hill step by step until we reach a local minimum point. This iterative process is achieved by gradient descent (GD), which can be written as θ k+1 = θ k − ηk ∇ L S (θ k ),

(13.12)

where ∇ L S (θ) is the gradient of θ and shows the direction to climb down, and ηi is the step size, which is also known as the learning rate. When we use GD to train a deep learning model, the gradient of weights ∇ L S (θ) is computed from the last layer to the first layer via the chain rule. Therefore, the model’s weights are also reversely updated layer by layer, and this process is known as backpropagation. As one of the landmark findings in supervised learning, backpropagation is a foundation of deep learning.

June 1, 2022 13:18

562

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Handbook on Computer Learning and Intelligence — Vol. II

13.3.2. Stochastic Gradient Descent and Mini-Batching Note that in GD, we only conduct backpropagation after the model sees the whole training set, which is feasible yet not efficient. In other words, GD heads to the steepest direction at each step, while we only need to climb down in roughly the right direction. Following this intuitives, stochastic gradient descent (SGD) takes the gradient of the loss for each single training example as an estimation of the actual gradient and conducts gradient descent via backpropagation based on the stochastic estimation, as shown in Figure 13.12(b). By doing so, SGD directly speeds up GD from two perspectives. First, there might be repeated examples in the training set. For instance, we can copy an example ten times to build a training set and train a model on it, where GD is ten times slower than SGD. Besides, even though there are no redundant examples in the training set, SGD does not require the trained model to go over all examples and updates weights much more frequently. Although the stochastic estimation of the actual gradient introduces random noise into gradient descent, the training actually benefits from the noise. At the early training stage, the model has barely learned any information from the training set. Therefore, the noise is relatively small, and an SGD step is basically as good as a GD step. Furthermore, the noise in SGD steps can prevent convergence to a bad local minimum point in the rest stages. On the downside, SGD updates weights for every example, and sometimes the training effect from two extremely different examples may even cancel out. Hence, it takes more steps for SGD to converge at an ideal point than GD and cannot make the best use of parallel computation brought by the advanced graph process unit (GPU). A promising way to address these drawbacks is by calculating the estimation over a subset B of the training set, namely mini-batch. An intuitive way to partition the training set into mini-batches is to randomly shuffle the training set S and divide it into several parts as uniformly as possible. Let S = {B1, B2 , . . . , Bm } represent the partition result. For any B j ∈ S, the corresponding loss is written as |B j |

L B (θ) =

1   ( f θ (xi ), yi ), |B j | i=1

(13.13)

where | · | returns the size of the input batch, namely batch size. Although the estimation is calculated on multiple examples, the time consumption is usually similar to a regular SGD step when employing a GPU with enough memory. By doing so, the frequency of backpropagation has been significantly reduced, and the training effect from any two mini-batches can hardly cancel each other out. Figure 13.12 shows the differences of GD, SGD and SGD with mini-batching.

page 562

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Deep Learning and Its Adversarial Robustness: A Brief Introduction

page 563

563

(a) Gradient Descent

(b) Stochastic Gradient Descent

(c) Stochastic Gradient Descent with Mini-Batching

Figure 13.12: An illustration of the pipelines of GD, SGD, and SGD with mini-batching.

13.3.3. Momentum and Adaptive Methods Although SGD with mini-batching outperforms regular SGD when training deep learning models under most scenarios, adding a momentum item can make it even better. The idea of this improvement is similar to the concept of momentum in physics. If SGD is a man trying to climb down a hill step by step, SGD with momentum can be imagined as a ball rolling in the same area. One mathematical form of momentum can be written as p0 = 0, pk+1 = βpk + ∇ L(θ k ), 0 ≤ β ≤ 1,

(13.14)

θ k+1 = θ k − ηk pk+1 . The momentum item can be viewed as a weighted average of current gradient and the momentum item from the previous step, where β is the weight factor. If we unroll (13.14) at any specific step, it can be seen that the gradient items of previous steps have accumulated at the current momentum item, where the old gradient term decays exponentially according to β, and the later gradient has a larger impact at present. This exponential decay is known as the exponential moving average, which we will see again later. Goh [35] provided a vivid visualization about the impact of momentum, where SGD with momentum makes modest moves at each step and finds a better solution than SGD eventually. The insight behind the improvement is that the historical gradient information can smooth the noise in SGD steps and dampen the oscillations. Except for momentum, another intuitive way to enhance SGD is by adding adaptivity into the training process. In the above methods, we assume that all

June 1, 2022 13:18

564

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Handbook on Computer Learning and Intelligence — Vol. II

parameters in the trained model are updated via a global learning rate, while adaptive methods aim to adjust the learning rate for each parameter automatically during training. The earliest attempt in this research direction is the adaptive gradient (AdaGrad) method. AdaGrad starts training with SGD at a global learning rate. For a specific parameter, AdaGrad decreases the corresponding learning rate according to the parameter’s gradient value that is calculated via backpropagation. The update process can be written as s 0 = 0, s k+1 = s k + ∇ L(θ k )2 , θ k+1 = θ k − √

η

k

s k+1

+

(13.15) ∇ L(θ k ),

where  is added to avoid dividing zero. The learning rate of a specific parameter will be enlarged if the corresponding item in s k+1 is smaller than 1, otherwise it will be shrunk. Note that in AdaGrad, all previous gradient information is treated equally, so most parameters’ learning rates decrease during training due to the accumulation of historical gradient. One problem with this adapting strategy is that the training effect of AdaGrad at the later training stage can hardly be maintained because of the small learning rates. The root mean square propagation (RMSProp) method addresses this by taking a weighted average of the historical accumulation item and the square of current gradient, which can be formalized as v 0 = 0, v k+1 = mv k + (1 − m)∇ L(θ k )2 , θ k+1 = θ k − √

η

k

v k+1

+

(13.16)

∇ L(θ k ),

where m ∈ [0, 1) is introduced to achieve an exponential moving average of past gradients. Compared to AdaGrad, the newer gradient has larger weights in RMSProp, which prevent the learning rate from continuously decreasing over training. The Adam method combines the RMSProp and momentum together and is the most widely used method at present. An Adam step requires calculating the exponential moving average of momentum and the square of past gradient, which are given by v 0 = 0,

p0 = 0,

pk+1 = β1 pk + (1 − β1 )∇ L(θ k ), v k+1 = β2 v k + (1 − β2 )∇ L(θ k )2 .

(13.17)

page 564

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Deep Learning and Its Adversarial Robustness: A Brief Introduction

page 565

565

Then the parameters is updated via θ k+1 = θ k − ηk √

pk+1 v k+1 + 

.

(13.18)

Although Adam normally performs better than AdaGRrad and RMSProp in practice, it has two momentum parameters that need to be tuned and uses more memory to conduct backpropagation. On the other hand, our understanding of its theory is still poor. From the practical perspective, Adam is necessary for training some networks for NLP tasks, while both SGD with momentum and Adam generally work well on training CNN models. However, using Adam to minimize the training loss may lead to worse generalization errors than SGD, especially in image problems.

13.3.4. Challenges in Training Except for the high demand for computational resources, training deep neural networks is facing overfitting and gradient vanishing. Overfitting is well studied in the machine learning area, which normally occurs when the trained model is too complex or the training set is not large enough. However, in the scope of deep learning, because we do not really understand why deep learning works, there are no methods or mathematical theorems that can tell us how to quantify a specific model’s capability and how many parameters and layers a model needs to handle a specific task. But existing studies show that increasing the model’s size in a proper way is the best most likely way to increase the model’s performance. Therefore, neural network’s complexity keeps growing. In the previous section, we describe how neural networks are trained in supervised learning by default. Under this scenarios, we need a great amount of labeled data to train our models. However, generating such data is heavily dependent on human participation, which is much slower than the evolution of neural network’s architecture. Thus, overfitting becomes an inevitable problem, and all we can do is to mitigate it. As mentioned in Section 13.2.1 that multiple nesting of Sigmoid and Tanh may make the gradient close to zero and cause the gradient vanishing. Before ReLU came out, Hinton et al. [36] proposed an unsupervised layer-wise training approach, which only trains one layer at a time to avoid gradient vanishing. For a given hidden layer, this method takes the output of the previous layer as input to conduct unsupervised training and repeat the same process on the next layer. After all the layers are pretrained, it is time for gradient descent to fine-tune the model’s weights. In fact, this pre-training with the fine-tuning approach first partitions the model’s weights and then finds a local optimal setting for each group. The global training starts in the combination of these pre-trained weights, which makes the training process much more efficient. Due to the advancement of computer hardware, this method is rarely

June 1, 2022 13:18

566

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Handbook on Computer Learning and Intelligence — Vol. II

(a)

(b)

Figure 13.13: Illustration of applying dropout in a multiple-layer perceptron.

used in training small and medium size neural networks now. But this paradigm is still very useful when training very large language models. Dropout is an interesting approach to prevent overfitting. Suppose we have a multiple-layer perceptron as shown in Figure 13.13(a). Applying dropout during training means that hidden units are randomly discarded with a given probability when performing forward propagation. The discarded unit in Figure 13.13(b) simply outputs zero no matter what its inputs are and remains unchanged at that iteration. Because every hidden unit might be deactivated, dropout introduces random noise into the output of hidden layers. Correspondingly, units in the following layer are forced to find useful patterns and make predictions independently because any specific dimension of their inputs is not reliable. Note that dropout is only applied in the training process, and a sub-model that inherits parameters from the original model is sampled at each training iteration. While at the inference procedure, all these trained sub-models are integrated together to give predictions, which implies the idea of ensemble learning. Batch normalization is one of the recent milestone works in deep learning that can mitigate both gradient vanishing and overfitting. It applies as the so-called batch normalization layer and plays an important role in improving the training and convergence speed. Neural networks are supposed to learn the distribution of the training data. If the distribution keeps changing, learning will be difficult. Unfortunately, for a specific hidden layer, the distribution of its previous layer’s output does change if the activation function is not zero-centered after each weight update. Ioffe et al. [37] used the term internal covariate shift to describe this phenomenon. A straightforward idea to address the internal covariate shift is by normalizing the output of each layer of the neural network. Suppose that we normalize those outputs to 0 mean, 1 variance, which satisfies the normal distribution. However, when the data distribution of each layer is the standard normal distribution, a neural network is completely unable to learn any useful features of the training data. Therefore, it is obviously unreasonable to directly normalize each layer.

page 566

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Deep Learning and Its Adversarial Robustness: A Brief Introduction

page 567

567

To address this, batch normalization introduces two trainable parameters to translate and scale the normalized results to retain information. Generally, the translate and scale parameters are calculated on the training set and directly used to process data at testing. Although we still do not understand why batch normalization is helpful, including it in our neural networks allows training to be more radical. Specifically, batch normalization reduces sensitivity to weight initialization and makes the networks easier to optimize, where we can use larger learning rates to speed up the training process. On the other hand, the translate and scale parameters can be viewed as extra noise because each mini-batch is randomly sampled. Like in dropout and SGD, extra noise enables better generalization and prevents overfitting. However, there might be a difference in the distribution of training data and test data, and the pre-calculated normalization cannot suit the test data perfectly, which leads to inconsistency in the model’s performance at training and testing. As shown in Figure 13.14, batch normalization works on the batch dimension. We can naturally conduct normalization along different dimensions, which motivates layer, instance, and group normalization methods. The first two methods are straightforward, while for group normalization it is relatively hard to get the intuition behind it. Recall that there are many convolution kernels in each layer of a CNN model. The features learned by these kernels are not completely independent. Some of them should have the same distribution and thus can be grouped. In the visual system, there are many physical factors that can cause grouping, such as frequency, shape, brightness, texture, etc. Similarly, in the biological visual system, the responses of neuron cells would also be normalized. So far, we have only briefly described a few tricks and methods that can make training neural networks easier. Deep learning is a very active field, and there are scores of well-developed methods we have not mentioned. If readers are interested in more, they are advised to further study the existing literature according to specific needs.

Figure 13.14: Illustration of different types of normalization layers [38]. C, N, and H, W are shorthands for channels, instances in a batch, and spatial location.

June 1, 2022 13:18

568

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Handbook on Computer Learning and Intelligence — Vol. II

13.3.5. Frameworks At the end of this section, we want to give a quick review of deep learning frameworks that greatly boost the development of this area. Although deep learning benefits a lot from advanced GPUs and other hardware, it is not easy for most users and researchers who are not majoring in computer science to make use of such improvements. Deep learning frameworks enable people to implement and customize their neural networks in their projects. Theano is the first framework that focuses on deep learning study. It was developed by Mila, headed by Yoshua Bengio, back in 2008. Caffe is also an early open-source deep learning framework, which was started by Yangqing Jia during his PhD at UC Berkeley. About roughly the same period, Microsoft Research Asia presented a framework called Minerva, which may be the first one that introduced the concept of the computational graph into deep learning implementation. In a computational graph, all components, e.g., input examples, model weights, and even operations, are traded as graph nodes. During the forward propagation, the data flow from the input layer to the output layer, where computation is done at each graph node. Meanwhile, backpropagation can directly utilize the useful intermediate results kept by the computational graph to avoid repeated calculations when updating model weights. Published by Google Brain, TensorFlow further develops the computational graph and is now the most popular framework in the industry. It includes most deep learning algorithm models and is supported by an active open source community. Keras is an advanced framework on top of other frameworks. It provides a unified API to utilize different frameworks and is probably the best option for new programmers. The latest version of TensorFlow also adopts Keras’s API to simplify its usage and is now more friendly for beginners. Meanwhile, in the research area, Pytorch is also a popular option. This framework originated from Torch, a Lua-based machine learning library, and was improved by Facebook. Pytorch dynamically builds the corresponding computational graph when performing deep learning tasks, allowing users to customize models and verify their ideas easily. There are many more deep learning frameworks available now, such as MXNet, PaddlePaddle, MegEngine, Mindspore, and so on, which benefit a wide range of users. Meanwhile, it can be seen on GitHub that a growing number of third-party libraries have been developed to enhance these frameworks. Overall, the rapid development of deep learning is inseparable from these open-source frameworks. There is no such thing as the best deep learning frameworks, and we should appreciate those developers, engineers, and scientists who have contributed to such projects.

13.4. The Robustness of Deep Learning Given the prospect of broad deployment of DNNs in a wide range of applications, it becomes critical that a deep learning model obtain satisfactory yet trustworthy

page 568

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Deep Learning and Its Adversarial Robustness: A Brief Introduction

page 569

569

performance. Unfortunately, a large number of recent research studies have demonstrated that deep learning models are actually too fragile or extremely nonrobust to be applied in high-stakes applications. Such vulnerability has raised concerns about the safety and trustworthiness of modern deep learning systems, especially those on safety-critical tasks. So, how to enhance the robustness of deep learning models to adversarial attacks is an important topic. Recently, although a significant amount of research has emerged to tackle this challenge [39], it remains an open challenge to researchers in computational intelligence. In this section, we first describe the progress in discovering and understanding such a threat and then introduce how to quantify the deep learning model’s robustness via verification techniques. The mainstream defense methods will be summarized at the end of this section.

13.4.1. Adversarial Threat Due to the lack of understanding about the basic properties behind these sophisticated deep learning models, adversarial examples, i.e., a special kind of inputs that are indistinguishable from nature examples but can completely fool the model have become the major security concern. Adversarial attacks, the procedure that generates adversarial examples, are conceptually similar to the so-called evasion attacks in machine learning [40]. The fact that DNNs are vulnerable to such attacks was first identified by Szegedy et al. back in 2013 [41]. They found that well-trained image classifiers can be misled by trivial perturbations δ generated by maximizing the loss function, which can be formulated as max ( f θ (x + δ), y), s.t. δ ≤ , δ

where · is a pre-defined metric and  is the adversarial budget. Later, Goodfellow et al. proposed the fast gradient sign method (FGSM) [42], which is one of the earliest adversarial attack methods. Along this direction, both basic iterative method [43] and projected gradient descent (PGD) [44] conduct multiple FGSM steps to search strong adversarial examples. PGD also takes advantage of random initialization and re-start, so it performs much better in practice and is adopted by recent works as a robustness testing method. Besides, not only is the most popular adversarial attack at present but it also plays an essential role in adversarial training, a promising defense technique that we will discuss later. Generally, adversarial attacks can be divided into two types depending on the attacker’s goals. Non-targeted attacks want to fool a given classifier to produce wrong predictions, and they do not care about what these predictions exactly are.

June 1, 2022 13:18

570

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Handbook on Computer Learning and Intelligence — Vol. II

In classification tasks, the goal of non-targeted attacks can be described as f θ (x + δ) = f θ (x), where f θ (x) = y. One step further, targeted attacks aim to manipulate classifiers and make them give a specific prediction, i.e., f θ (x + δ) = yt , where f θ (x) = y and y = yt . Conducting targeted attacks is more difficult than non-targeted attacks, and they are more threatening for real-world applications. While non-targeted attacks have a higher success rate, they are commonly used to test the robustness of deep learning models under experimental environments. On the other hand, adversarial attacks are also classified by the ability of the attackers. FGSM and PGD are both white-box attacks, which assume that adversaries can access all information about their victims, especially the gradients of the model during backpropagation. However, conducting black-box attacks is much more difficult and complex than white-box attacks. In this case, the details of deep learning models are hidden from attackers, but they can pretend to be regular users, which can get the model’s predictions with respect to input examples. Based on the query results, attackers can either fine-tune a surrogate model to produce transferable adversarial examples or estimate the gradient information to generate malicious perturbation as in white-box scenarios. Because of the inherent high dimensionality of deep learning tasks, attacks usually need thousands of query results to generate one usable adversarial example [45]. The significant amount of queries required by black-box attacks makes them computationally expensive and hard to achieve in real-world tasks. Thus, making full use of the information obtained in the query process to produce efficient adversarial attacks is still an open problem. In addition to the additive perturbation, conducting small spatial transformations such as translation and rotation can also generate adversarial examples [46]. This indicates that l p norm is not an ideal metric of adversarial threat. Besides, universal adversarial perturbation, i.e., a single perturbation that works on most input examples simultaneously and causes misclassification, is also an interesting and practical technique. Along this direction, we found that combining spatial and additional perturbations can significantly boost the performance of universal adversarial perturbation [47]. Note that adversarial attacks can be easily generalized to a wide range of deep learning tasks beyond classification. As shown in Figure 13.15, our previous work [48] provides an example of how to attack an object detection system via PGD with a limited number of perturbed pixels. Deep learning models being deceived by adversarial examples under white-box scenarios is almost inevitable at the current state, and existing defense methods can only mitigate the security issues but cannot address them adequately. The existence

page 570

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Deep Learning and Its Adversarial Robustness: A Brief Introduction

page 571

571

Figure 13.15: Generating an adversarial example toward an object detection system with a limited number of perturbed pixels [48]. Here, SmoothGrad finds the sensitive pixels within an input example, and Half-Neighbor refinement selects the actual perturbed area for PGD attack accordingly.

of adversarial examples demonstrated that small perturbations of the input (normally under l p metrics) are capable of producing a large distortion at the ending layers [41]. Therefore, in addition to security, enabling adversarial robustness would also help us to better understand the fundamental properties of deep learning and computational intelligence.

13.4.2. Robustness Verification Although DNNs have achieved significant success in many domains, there are serious concerns when applying them to real-world safety-critical systems, such as self-driving cars and medical diagnosis systems. Even though adversarial attacks can evaluate the robustness of DNNs by crafting adversarial examples, these approaches can only falsify robustness claims yet cannot verify them because there is no theoretical guarantee provided on their results. Safety verification can provide the provable guarantee, i.e., when using verification tools to check DNN models, if there is no counterexample (adversarial example), such verification techniques can provide a robustness guarantee. To better explain verification techniques on adversarial robustness, this section explains the relationship between the verification and adversarial threat, then defines verification properties, and further presents various verification techniques. To verify that the outputs of DNNs are invariant against a small probation, we introduce the maximal safe norm ball for local robustness evaluation, as shown in Figure 13.16. Given a genuine image x0 , we use the l p norm ball to define the perturbation against the original image, i.e., x − x0 p ≤ r. When the norm ball is small and far away from the decision boundary, DNN gives the invariant output, whereas an intersection with the decision boundary appears on enlarging the norm ball, which means that the perturbation is large enough to change the output.

June 1, 2022 13:18

572

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Handbook on Computer Learning and Intelligence — Vol. II

Figure 13.16: Visualizing the decision boundary and different safety bounds.

In this process, a critical situation exists when the norm ball is tangent to the boundary, generating the maximal safe norm ball rmax . No matter what kind of adversarial attacks are used, the model is safe and robust as long as the perturbation is within the range of rmax . Otherwise, the model is unsafe because there exists at least one adversarial example. Ideally, we only need to determine the maximal safe radius rmax for the robustness analysis. However, in practice, rmax is not easy to calculate, we thus estimate the upper bound r2 and lower bound r1 of rmax , corresponding to the adversarial threat and robustness verification. 13.4.2.1. Robustness Properties

Here, we summarize four fundamental definitions that support different kinds of verification techniques. Definition 1 (Local Robustness Property). Given a neural network f : Rn → Rm and an input x0 , the local robustness of f on x0 is defined as Robust ( f, r)  ∀ x − x0 p ≤ r, ∃i, j ∈ {1, . . . , m},

fl (x) ≥ f j (x). (13.19)

For targeted local robustness of f on x0 and label l, it is defined as Robustl ( f, r)  ∀ x − x0 p ≤ r, ∃ j ∈ {1, . . . , m},

fl (x) ≥ f j (x).

(13.20)

From Definition 1, the local robustness means the decision of a DNN does not change within the region rmax . For example, there exists a label l such that, for all inputs x in region rmax , and other labels j , the DNN believes that x is more possible to be in class l than in any class j .

page 572

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Deep Learning and Its Adversarial Robustness: A Brief Introduction

page 573

573

To compute the set of outputs with respect to predefined input space, we have the following definition of output reachability. Definition 2 (Output Reachability Property). Given a neural network f , and an input region x − x0 p ≤ r, the reachable set of f and r is a set Reach( f, r) such that Reach( f, r)  { f (x) | x ∈ x − x0 p ≤ r}.

(13.21)

Due to the region r covering a large or infinite number of entities within the input space, there is no realistic algorithm to check their classifications. In addition to the highly nonlinear (or black-box) neural network f , the output reachability problem is highly non-trivial. Based on this, the verification of output reachability is to determine whether all inputs x with distance r can map onto a given output set Y, and whether all outputs in Y have a corresponding input x within r, that is, check whether the equation Reach( f, r) = Y is satisfied or not. Normally, the largest or smallest values of a specific dimension of the set Reach( f, r) are much more useful in reachability problem analysis. Definition 3 (Interval Property). Given a neural network f : Rn → Rm , an input x0 , and x − x0 p ≤ r, the interval property of f and r is a convex set Interval( f, r) such that Interval( f, r) ⊇ { f (x) | x ∈ x − x0 p ≤ r}.

(13.22)

If the set Interval( f, r) is the smallest convex set of points in { f (x) | x ∈ x − x0 p ≤ r}, we call this set a convex hull, which means Interval( f, r) is the closest to { f (x) | x ∈ x − x0 p ≤ r}. Unlike output reachability property, the interval property calculates a convex overapproximation of output reachable set. Based on Definition 3, the interval is also seen as an over-approximation of output reachability. Similarly, for f , r and output region Y, we can define the verification of interval property as Y ⊇ { f (x) | x ∈ x − x0 p ≤ r}, which means that all inputs in r are mapped onto Y. The computing process has to check whether the given convex set Y is an interval satisfying (13.22). Similar to the output reachability property, there exist useful and simple problems, e.g., determining whether a given real number d is a valid upper bound for a specific dimension of { f (x) | x ∈ x − x0 p ≤ r}. Definition 4 (Lipschitzian Property). Given a neural network f : Rn → Rm , an input x0 , and η(x0 , l p , r) = {x| x − x0 p ≤ r}, Li ps( f, η, l p ) ≡ sup x∈η

| f (x) − f (x0 )| . ||x − x0 || p

(13.23)

June 1, 2022 13:18

574

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Handbook on Computer Learning and Intelligence — Vol. II

Inspired by the Lipschitz optimization [49], Li ps( f, η, l p ) is a Lipschitzian metric of f , η, and l p , which is to quantify the changing rate of neural networks within η. Intuitively, the greatest changing rate of this metric is the best Lipschitz constant. Based on the Lipschitzian metric and a real value d ∈ R, the verification of Lipschitzian property is to find out whether Li ps( f, η, l p ) ≤ d. Usually, the value of this metric is not easy to compute, and d is the upper bound for this value. There are some relationships between the definitions of the above four properties. For example, when the set Reach( f, r) is computed exactly, we can easily evaluate the inclusion of the set in another convex set Interval( f, r). Similarly, based on the Lipschitzian metric Li ps( f, η, l p ), we can evaluate the interval property and verify this property. Please refer to our papers [39, 50] for more details. 13.4.2.2. Verification Approaches

Current verification techniques mainly include constraint solving, search-based approach, optimization, and over-approximation. To make these techniques efficient, one of these categories can also be combined with another category. The most significant aspect of verification techniques is that they can provide a guarantee for the result. Based on the type of guarantee, verification methods can be divided into four categories. They are methods with deterministic guarantee, one-side guarantee, guarantee converging bounds, and statistical guarantee. Methods with deterministic guarantee state exactly whether a property holds. They transform the verification problem into a set of constraints, further solving them with a constraint solver. As a result, the solvers return a deterministic answer to a query, i.e., either satisfiable or unsatisfiable. Representative solvers used in this area are Boolean satisfiability (SAT) solvers, satisfiability modulo theories (SMT) solvers, linear programming (LP) solvers, and mixed-integer linear programming (MILP) solvers. The SAT method determines if, given a Boolean formula, there exists an assignment to the Boolean variables such that the formula is satisfiable. The prerequisite for the employment is to abstract a network into a set of Boolean combinations, whose example is given in [51]. The SMT method takes one step further and determines the satisfiability of logical formulas with respect to combinations of background theories expressed in classical first-order logic with equality. Classical SMT methods are Reluplex [52] and Planet [53]. The LP method optimizes the linear objective function subject to linear equality or linear inequality constraints. Given the requirement that some of the variables are integral, the LP problem becomes a MILP problem. For instance, a hidden layer

page 574

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Deep Learning and Its Adversarial Robustness: A Brief Introduction

page 575

575

z i+1 = ReLU(Wi z i + bi ) can be described with the following MILP conditions: Wi z i + bi ≤ z i+1 ≤ Wi z i + bi + Mti+1 , 0 ≤ z i+1 ≤ M(1 − ti+1 ), where ti+1 has value 0 or 1, representing the activated situation, and M > 0 is a large constant that can be treated as ∞. The details of the encoding neural network to a MILP problem can be found in [54]. To improve the verification efficiency, extensions are made to the simple MILP method, achieving successful variant methods such as [55] and Sherlock in [56]. Approaches with a one-side guarantee only compute an approximated bound, which can claim the sufficiency of achieving properties. While these methods provide only a bounded estimation with high efficiency, they are applicable for analyzing large models. Some approaches adopt abstract interpretation, using abstract domains, e.g., boxes, zonotopes, and polyhedra, to over-approximate the computation of a set of inputs. Its application has been explored in a few approaches, including AI2 [57] and [58], which can verify interval property but cannot verify reachability property. Because the non-convex property of neural networks partially causes difficulty of verification, methods such as [59] tend to make a convex outer over-approximation of the sets of reachable activation and apply the convex optimization. In the approach ReluVal [60], the interval arithmetic is leveraged to compute rigorous bounds on the DNN outputs, i.e., interval property. The key idea of the interval analysis approach is that given the ranges of operands, an overestimated range of the output can be computed by using only the lower and upper bounds of the operands. In addition, one-side guarantee approaches also include [61] dealing with output reachability estimation and FastLin/FastLip in [62] using linear approximation for ReLU networks. Approaches aiming to achieve statistical guarantees on their results usually make claims such as the satisfiability of a property or a value is a lower bound of another value with a certain probability. Typical methods are CLEVER [63] working with Lipschitz property [64], which proposes the robustness estimation to measure the frequency and the severity of adversarial examples. Approaches to compute the converge bounds can be applied on large-scale real world systems and they can work with both output reachability property and interval property. Huang et al. [65] made a layer-to-layer analysis of the network, guaranteeing a misclassification being found if it exists. DeepGame [66] reduces the verification problem to a two-player turn-based game, studying two variants of point-wise robustness. Unfortunately, these verification algorithms have some weaknesses. Algorithms based on constraint solvers, such as SMT/SAT and MILP/LP, need to encode

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

576

b4528-v2-ch13

Handbook on Computer Learning and Intelligence — Vol. II

the entire network layer by layer, limiting them to only work on small-scale or quasi-linear neural networks. Although approaches based on linear approximation, convex optimization, and abstract interpretation can work on nonlinear activation functions, they are essentially a type of over-approximations algorithm, i.e., they generally calculate the lower bound of the maximal safe radius rmax . Besides, these algorithms depend on the specific structure of the networks to design different abstract interpretations or linear approximations. Hence, they cannot provide a generic framework for different neural networks. To circumvent such shortcomings, approaches based on global optimization are developed. DeepGo [50] firstly proves that most neural network layers are Lipschitz continuous and uses Lipschitz optimization to solve the problem. Since the only requirement is Lipschitz continuity, a number of studies have been done to expand this method to various neural network structures. DeepTRE [67] focuses on the Hamming distance, proposing an approach to iteratively generate lower and upper bounds on the network’s robustness. The tensor-based approach returns intermediate bounds, allowing for efficient GPU computation. Furthermore, we also proposed a generic framework, DeepQuant [68], to quantify the safety risks on networks based on a Mesh Adaptive Direct Search algorithm. In summary, the maximal safe radius rmax provides the local robustness information of the neural network. The adversarial attacks focus on estimating the upper bound of the radius and searching for adversarial examples outside the upper bound of the safe norm ball, while verification studies the lower bound of rmax , ensuring that inputs inside this norm ball are safe. Both analysis methods only concern evaluating existing neural networks. How to further improve the neural network robustness will be discussed in the section on adversarial defense.

13.4.3. Defense While verification aims to provide theoretical guarantees on the robustness and reliability of DNNs, defense methods attempt to enhance such properties from a more practical perspective. Generally, the defense strategy can be divided into four categories: preprocessing, model enhancement, adversarial training, and certified defenses. Specifically, example pre-processing methods are probably the most straightforward defense. In this strategy, input examples are processed via methods such as median smoothing, bit depth compression, and JPEG compression [69] before feeding them to the deep learning system. Pre-processing can reduce the impact of perturbation and keep enough information within the processed examples for the downstream tasks. Because most kinds of adversarial perturbation are essentially additional noise, preprocessing defenses are highly effective in vision and audio applications. Besides,

page 576

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Deep Learning and Its Adversarial Robustness: A Brief Introduction

page 577

577

this pre-processing defense does not require to re-train the protected models and can be deployed directly. Although it might cause a trivial problem if the attackers are aware of the pre-processing procedure, they can correspondingly update the generation of adversarial examples to break or bypass such defense. This means that pre-processing defenses are not reliable in white-box or “gray-box" (attackers have limited knowledge about their target model) environments. In parallel with pre-processing, model enhancement focuses on deep learning models themselves. The core of this type of method is to improve the robustness by modifying the model architecture, activation functions, and loss functions. A typical defense along this direction is defensive distillation [70], which follows the teacher–student model [71]. Defensive distillation first produces a regular network (teacher) to generate soft labels instead of the one-hot encoding of the true labels (hard labels) to train a student model. This defense had significantly boosted the model’s robustness on several benchmarks, but it was soon broken by the C&W attack proposed by Carlini and Wagner [72]. In addition to processing perturbed examples correctly, model enhancement sometimes achieves the defense effect via adversarial detection, i.e., identifying adversarial examples and refusing to give corresponding outputs. Although many model enhancement methods eventually failed [73], this direction is still a valuable and active research direction. The idea of adversarial training appeared with the FGSM attack [42] and has been significantly generalized by later works [44]. In these pioneering works, adversarial examples were regarded as an augmentation of the original training dataset, and adversarial training required both original examples and adversarial examples to train a robust model. Madry et al. formalized adversarial training as a Min–Max optimization problem:   (13.24) min E(x,y)∼D max L(Fθ (x + δ)) . θ

δ∈B

Here, in the inner maximization problem, adversarial examples are generated by maximizing the classification loss, while the outer minimization updates model parameters by minimizing the loss on adversarial examples from the inner maximization. Equation (13.24) provides a principled connection between the robustness of neural networks and the adversarial attacks. This indicates that a qualified adversarial attack method could help train robust models. At the current stage, PGD adversarial training is still one of the most effective defenses [73, 74], although there are quite incremental works aiming for improvement. Nevertheless, all these PGDbased adversarial trainings take multiple backpropagations to generate adversarial perturbations, which significantly raises the computational cost. As a result, conducting adversarial training, especially on large-scale networks, is prohibitively

June 1, 2022 13:18

578

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Handbook on Computer Learning and Intelligence — Vol. II

time-consuming. On the way to improve its efficiency, recent studies have two interesting findings. On the one hand, Wong et al. [75] demonstrated that high-quality or strong adversarial examples might not be necessary for conducting successful adversarial training. They empirically showed that FGSM with uniform initialization could adversarially train models with considerable robustness more efficiently than PGD adversarial training. On the other hand, YOPO adversarial training [76] simplifies the backpropagation for updating training adversarial examples, i.e., it treats the first layer and others separately during training and generates adversarial examples toward the first layer instead of the whole trained model. This successful attempt suggests that we might need to adversarially train only the first few layers to produce robust deep learning models. Moreover, inspired by recent studies, our work [77] reveals that the connection of the lower bound of Lipschitz constant of a given network and the magnitude of its partial derivative toward adversarial examples. Supported by this theoretical finding, we utilize the gradient’s magnitude to quantify the effectiveness of adversarial training and determine the timing to adjust the training procedure. In addition to robustness, adversarial training also benefits the deep learning study in terms of interpretability [78] and data augmentation. Such a technology is getting increasing attention in applications beyond security, such as text-to-image tasks [25, 79] and nature language processing [80]. Therefore, we believe that adversarial training will still be an interesting and promising research topic in the next few years. The previous defenses are all heuristic strategies and cannot provide theoretical guarantees. Like the defensive distillation method, most defenses are likely to be broken by emerging adversarial attacks. To end such an “arms race”, certified defenses aim to theoretically guarantee that the outputs of deep neural networks would not change within a certain disturbance range around inputs. Similar to some verification techniques, certified defenses come out with strong theoretical guarantees. However, they are usually computationally expensive and difficult to be extended on large-scale networks or high-dimensional datasets [50]. A promising method in this direction is randomized smoothing [81], which introduces random noise to smooth the learned feature during training. This technique demonstrates that it is feasible to use randomized perturbation to improve the trained model’s robustness concretely. However, it can only work and provide guarantees under the l2 constraint. Thus, how to generalize it to other threat models is an urgent need of research. Overall, defending against adversarial attacks is still a work in progress, and we are far from solving this problem. We believe that preventing adversarial threats and interpreting DNNs are two sides of the same coin, and any useful finding on one side would boost our progress on the other side. Although all current defense strategies are not perfect and might only work temporarily, we would gain a better

page 578

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Deep Learning and Its Adversarial Robustness: A Brief Introduction

page 579

579

understanding of the fundamental mechanism of deep learning models if we keep digging along this direction.

13.5. Further Reading After more than 10 years of exploration, deep learning is still a jungle full of secrets and challenges. This chapter only provides a brief yet general introduction, and we are certainly not experts in most sub-areas of this gigantic and fast-developing research topic. So, at the end of this chapter, we provide a simple overview on two emerging deep learning techniques and existing challenges.

13.5.1. Emerging Deep Learning Models One of the most noticeable threads is the generative adversarial networks (GANs) [82]. Beyond learning from data, GAN models can generate new information by utilizing learned features and patterns, which gives birth to many exciting applications. For instance, convolutional neural networks can decouple the semantic content and art style in images and rebuild images with the same semantics but different style [83]. Coloring grayscale images without manual adjustment was a difficult task for computers, which is can be done by a deep learning algorithm called deep colorization [84]. In the audio processing area, Google AI utilized the GAN model to generate high-fidelity raw audio waveforms [85], and the latest deep learning model [86] can even decompose audio into fundamental components, such as loudness, pitch, timbre, and frequency, and manipulate them independently. Various GANs would start new revolutions in a lot of applications, such as Photoshop, one of the popular graphics editors, which includes a neural networkbased toolbox to help users advance their arts. Applications such as deepfake and deepnude also raise concern about how to prevent such powerful techniques from being abused. As we mentioned before, Transformer [32] has greatly boosted the raw performance of DNNs on natural language processing tasks. The most famous deep learning models built upon Transformer are GPT [33] and BERT [34]. The latest version of GPT in 2020, namely GPT-3, is the largest deep neural network, which can generate high-quality text that sometimes cannot be distinguished from that written by a real person. BERT is the most popular method and has been widely used across the NLP industry. One difference between these two models is that GPT has only one direction, while BERT is a bidirectional model. Therefore, BERT usually has a better performance but is incompetent for text generation tasks, which are GPT’s specialty. Nowadays, the application of Transformer is no longer limited to the NLP area. Recent studies [87] show that this powerful architecture is capable of vision tasks.

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

580

b4528-v2-ch13

Handbook on Computer Learning and Intelligence — Vol. II

13.5.2. Challenges Despite the exciting new technologies and applications, the difficulties associated with deep learning have not changed significantly. In addition to adversarial threats and the high demand for computational power, the lack of interpretability is also a big challenge in deep learning. A modern deep neural network usually has billions of weights, intricate architecture, and tons of hyper-parameters such as learning rate, dropout rate, etc. We do not yet fully understand the inner mechanism of these models, especially why and how a deep learning model makes certain decisions. Currently, the training of deep learning models is still largely based on trial and error and lacks a principled way to guide the training. Taking the LeNet-5 model as an example, it has two conventional layers and two fully connected layers [88]. We know that if we train it properly on the MNIST dataset, it can achieve approximately 99% classification accuracy on the test set. However, we cannot know how the accuracy changes without experiments after adding one more conventional layer into the model because we do not have a mathematical tool to formulate the represent capability of deep neural network models. Without such a theoretical guide, designing neural network architectures for a specific task has become a challenging task. This leads us to the neural architecture search (NAS) technology, which aims to find the best network architecture under limited computing resources in an automatic way. However, at the current stage, research on understanding deep learning is still in its infancy. The lack of labeled data is also a barrier, which significantly limits the applicability of deep learning in some scenarios where only limited data are available. Training a deep learning model to a satisfactory performance usually requires a huge amount of data. For example, ImageNet, the dataset that has significantly accelerated the development of deep learning, contains over 14 million labeled images. While being data-hungry is not a problem for consumer applications where large amounts of data are easily available, copious amounts of labeled training data are rarely available in most industrial applications. So, how to improve the applicability of deep learning models with limited training data is still one of the key challenges today. Emerging techniques such as transfer learning, meta learning, and GANs show some promise with regard to overcoming this challenge.

References [1] O. Russakovsky, J. Deng, H. Su, et al., Imagenet large scale visual recognition challenge, Int. J. Comput. Vision, 115(3), 211–252 (2015). [2] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al., Deep speech 2: end-to-end speech recognition in english and mandarin. In International Conference on Machine Learning, pp. 173–182. PMLR (2016).

page 580

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Deep Learning and Its Adversarial Robustness: A Brief Introduction

page 581

581

[3] M. E. Peters, M. Neumann, M. Iyyer, et al., Deep contextualized word representations, arXiv preprint arXiv:1802.05365 (2018). [4] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, 521(7553), 436–444 (2015). [5] W. S. McCulloch and W. Pitts, A logical calculus of the ideas immanent in nervous activity, Bull. Math. Biophys., 5(4), 115–133 (1943). [6] F. Rosenblatt and S. Papert, The perceptron, A perceiving and recognizing automation, Cornell Aeronautical Laboratory Report. pp. 85–460 (1957). [7] M. Minsky and S. A. Papert, Perceptrons: An Introduction to Computational Geometry (MIT Press, 2017). [8] K. Fukushima, A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position, Biol. Cybern., 36, 193–202 (1980). [9] T. Gonsalves, The summers and winters of artificial intelligence. In Advanced Methodologies and Technologies in Artificial Intelligence, Computer Simulation, and Human-Computer Interaction, pp. 168–179. IGI Global (2019). [10] S. Linnainmaa, The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors, Master’s Thesis (in Finnish), Univ. Helsinki. pp. 6–7 (1970). [11] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning representations by back-propagating errors, Nature, 323(6088), 533–536 (1986). [12] Y. LeCun, B. Boser, J. S. Denker, et al., Backpropagation applied to handwritten zip code recognition, Neural Comput., 1(4), 541–551 (1989). [13] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, GPU computing, Proc. IEEE, 96(5), 879–899 (2008). [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, Commun. ACM, 60(6), 84–90 (2017). [15] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep Learning (MIT Press Cambridge, 2016). [16] J. Schmidhuber, Deep learning in neural networks: an overview, Neural Networks, 61, 85–117 (2015). [17] A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, Dive into Deep Learning (2020). https://d2l.ai. [18] C. Nwankpa, W. Ijomah, A. Gachagan, and S. Marshall, Activation functions: comparison of trends in practice and research for deep learning, arXiv preprint arXiv:1811.03378 (2018). [19] W. Zhang, et al., Shift-invariant pattern recognition neural network and its optical architecture. In Annual Conference of the Japan Society of Applied Physics (1988). [20] D. H. Hubel and T. N. Wiesel, Receptive fields and functional architecture of monkey striate cortex, J. Physiol., 195(1), 215–243 (1968). [21] K. Fukushima and S. Miyake, Neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and Cooperation in Neural Nets, pp. 267–285. Springer (1982). [22] M. Matsugu, K. Mori, Y. Mitari, and Y. Kaneda, Subject independent facial expression recognition with robust face detection using a convolutional neural network, Neural Networks, 16(5–6), 555–559 (2003). [23] K. O’Shea and R. Nash, An introduction to convolutional neural networks, arXiv preprint arXiv:1511.08458 (2015). [24] A. V. D. Oord, S. Dieleman, H. Zen, et al., Wavenet: a generative model for raw audio, arXiv preprint arXiv:1609.03499 (2016). [25] N. Sarafianos, X. Xu, and I. A. Kakadiaris, Adversarial representation learning for text-to-image matching. In International Conference on Computer Vision (2019). [26] M. I. Jordan, Serial order: a parallel distributed processing approach. In Advances in Psychology, vol. 121, pp. 471–495. Elsevier (1997).

June 1, 2022 13:18

582

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Handbook on Computer Learning and Intelligence — Vol. II

[27] J. L. Elman, Finding structure in time, Cognit. Sci., 14(2), 179–211 (1990). [28] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Comput., 9(8), 1735–1780 (1997). [29] K. Cho, B. van Merrienboer, Ç. Gülçehre, et al., Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing (2014). [30] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, A neural probabilistic language model, J. Mach. Learn. Res., 3, 1137–1155 (2003). [31] T. Mikolov, I. Sutskever, K. Chen, et al., Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems (2013). [32] A. Vaswani, N. Shazeer, N. Parmar, et al., Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (2017). [33] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, Improving language understanding by generative pre-training (2018). [34] J. Devlin, M. Chang, K. Lee, and K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2019). [35] G. Goh, Why momentum really works, Distill (2017). doi: 10.23915/distill.00006. URL http://distill.pub/2017/momentum. [36] G. E. Hinton, S. Osindero, and Y.-W. Teh, A fast learning algorithm for deep belief nets, Neural Comput., 18(7), 1527–1554 (2006). [37] S. Ioffe and C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv:1502.03167 (2015). [38] Y. Wu and K. He, Group normalization. In European Conference on Computer Vision (2018). [39] X. Huang, D. Kroening, W. Ruan, et al., A survey of safety and trustworthiness of deep neural networks: verification, testing, adversarial attack and defence, and interpretability, Comput. Sci. Rev., 37, 100270 (2020). [40] B. Biggio and F. Roli, Wild patterns: ten years after the rise of adversarial machine learning, Pattern Recognit., 84, 317–331 (2018). [41] C. Szegedy, W. Zaremba, I. Sutskever, et al., Intriguing properties of neural networks. In International Conference on Learning Representations (2014). [42] I. J. Goodfellow, J. Shlens, and C. Szegedy, Explaining and harnessing adversarial examples. In International Conference on Learning Representations (2015). [43] A. Kurakin, I. J. Goodfellow, and S. Bengio, Adversarial examples in the physical world. In International Conference on Learning Representations, Workshop (2017). [44] A. Madry, A. Makelov, L. Schmidt, et al., Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations (2018). [45] J. Uesato, B. O’Donoghue, P. Kohli, et al., Adversarial risk and the dangers of evaluating against weak attacks. In International Conference on Machine Learning (2018). [46] C. Xiao, J.-Y. Zhu, B. Li, et al., Spatially transformed adversarial examples. In International Conference on Learning Representations (2018). [47] Y. Zhang, W. Ruan, F. Wang, et al., Generalizing universal adversarial attacks beyond additive perturbations. In IEEE International Conference on Data Mining (2020). [48] Y. Zhang, F. Wang, and W. Ruan, Fooling object detectors: adversarial attacks by half-neighbor masks, arXiv preprint arXiv:2101.00989 (2021). [49] D. R. Jones, C. D. Perttunen, and B. E. Stuckman, Lipschitzian optimization without the lipschitz constant, J. Optim. Theory Appl., 79(1), 157–181 (1993). [50] W. Ruan, X. Huang, and M. Kwiatkowska, Reachability analysis of deep neural networks with provable guarantees. In International Joint Conference on Artificial Intelligence (2018).

page 582

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Deep Learning and Its Adversarial Robustness: A Brief Introduction

page 583

583

[51] L. Pulina and A. Tacchella, An abstraction-refinement approach to verification of artificial neural networks. In International Conference on Computer Aided Verification (2010). [52] G. Katz, C. Barrett, D. L. Dill, et al., Reluplex: an efficient SMT solver for verifying deep neural networks. In International Conference on Computer Aided Verification (2017). [53] R. Ehlers, Formal verification of piece-wise linear feed-forward neural networks. In International Symposium on Automated Technology for Verification and Analysis (2017). [54] A. Lomuscio and L. Maganti, An approach to reachability analysis for feed-forward ReLU neural networks, arXiv preprint arXiv:1706.07351 (2017). [55] C.-H. Cheng, G. Nührenberg, and H. Ruess, Maximum resilience of artificial neural networks. In International Symposium on Automated Technology for Verification and Analysis (2017). [56] S. Dutta, S. Jha, S. Sanakaranarayanan, et al., Output range analysis for deep neural networks, arXiv preprint arXiv:1709.09130 (2018). [57] T. Gehr, M. Mirman, D. Drachsler-Cohen, et al., AI2: safety and robustness certification of neural networks with abstract interpretation. In IEEE Symposium on Security and Privacy (2018). [58] M. Mirman, T. Gehr, and M. Vechev, Differentiable abstract interpretation for provably robust neural networks. In International Conference on Machine Learning (2018). [59] E. Wong and Z. Kolter, Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning (2018). [60] S. Wang, K. Pei, J. Whitehouse, et al., Formal security analysis of neural networks using symbolic intervals. In USENIX Security Symposium. USENIX Association (2018). [61] W. Xiang, H.-D. Tran, and T. T. Johnson, Output reachable set estimation and verification for multi-layer neural networks, IEEE Trans. Neural Networks Learn. Syst., 29, 5777–5783 (2018). [62] L. Weng, H. Zhang, H. Chen, et al., Towards fast computation of certified robustness for ReLU networks. In International Conference on Machine Learning (2018). [63] T.-W. Weng, H. Zhang, P.-Y. Chen, et al., Evaluating the robustness of neural networks: an extreme value theory approach, arXiv preprint arXiv:1801.10578 (2018). [64] O. Bastani, Y. Ioannou, L. Lampropoulos, et al., Measuring neural net robustness with constraints, arXiv preprint arXiv:1605.07262 (2016). [65] X. Huang, M. Kwiatkowska, S. Wang, and M. Wu, Safety verification of deep neural networks. In International Conference on Computer Aided Verification (2017). [66] M. Wu, M. Wicker, W. Ruan, et al., A game-based approximate verification of deep neural networks with provable guarantees, Theor. Comput. Sci., 807, 298–329 (2020). [67] W. Ruan, M. Wu, Y. Sun, et al., Global robustness evaluation of deep neural networks with provable guarantees for the Hamming distance. In International Joint Conference on Artificial Intelligence (2019). [68] P. Xu, W. Ruan, and X. Huang, Towards the quantification of safety risks in deep neural networks, arXiv preprint arXiv:2009.06114 (2020). [69] R. Shin and D. Song, JPEG-resistant adversarial images. In Annual Network and Distributed System Security Symposium (2017). [70] N. Papernot, P. D. McDaniel, X. Wu, et al., Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy (2016). [71] G. E. Hinton, O. Vinyals, and J. Dean, Distilling the knowledge in a neural network, arXiv preprint arXiv:1503.02531 (2015). [72] N. Carlini and D. A. Wagner, Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (2017). [73] A. Athalye, N. Carlini, and D. A. Wagner, Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In International Conference on Machine Learning (2018). [74] L. Rice, E. Wong, and J. Z. Kolter, Overfitting in adversarially robust deep learning. In International Conference on Machine Learning (2021).

June 1, 2022 13:18

584

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch13

Handbook on Computer Learning and Intelligence — Vol. II

[75] E. Wong, L. Rice, and J. Z. Kolter, Fast is better than free: revisiting adversarial training. In International Conference on Learning Representations (2020). [76] D. Zhang, T. Zhang, Y. Lu, et al., You only propagate once: accelerating adversarial training via maximal principle. In Proceedings of the Advances in Neural Information Processing Systems (2019). [77] F. Wang, Y. Zhang, Y. Zheng, and W. Ruan, Gradient-guided dynamic efficient adversarial training, arXiv preprint arXiv:2103.03076 (2021). [78] T. Zhang and Z. Zhu, Interpreting adversarially trained convolutional neural networks. In International Conference on Machine Learning (2019). [79] T. Chen, Y. Liao, C. Chuang, et al., Show, adapt and tell: adversarial training of cross-domain image captioner. In International Conference on Computer Vision (2019). [80] D. Wang, C. Gong, and Q. Liu, Improving neural language modeling via adversarial training. In International Conference on Machine Learning (2019). [81] J. M. Cohen, E. Rosenfeld, and J. Z. Kolter, Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning (2019). [82] I. Goodfellow, J. Pouget-Abadie, M. Mirza, et al., Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems (2014). [83] L. A. Gatys, A. S. Ecker, and M. Bethge, Image style transfer using convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (2016). [84] Z. Cheng, Q. Yang, and B. Sheng, Deep colorization. In International Conference on Computer Aided Verification (2015). [85] J. Engel, K. K. Agrawal, S. Chen, et al., Gansynth: adversarial neural audio synthesis, arXiv preprint arXiv:1902.08710 (2019). [86] J. Engel, L. Hantrakul, C. Gu, and A. Roberts, DDSP: differentiable digital signal processing, arXiv preprint arXiv:2001.04643 (2020). [87] A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., An image is worth 16 x 16 words: transformers for image recognition at scale. In International Conference on Learning Representations (2021). [88] Y. LeCun, C. Cortes, and C. Burges, MNIST handwritten digit database, AT&T Labs, 2 (2010).

page 584

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch14

© 2022 World Scientific Publishing Company https://doi.org/10.1142/9789811247323_0014

Chapter 14

Deep Learning for Graph-Structured Data Luca Pasa∗ , Nicolò Navarin†, and Alessandro Sperduti‡ Department of Mathematics, University of Padua Via Trieste, 63 - 35121 Padova, Italy ∗ [email protected][email protected][email protected]

In this chapter, we discuss the application of deep learning techniques to input data that exhibit a graph structure. We consider both the case in which the input is a single, huge graph (e.g., a social network), where we are interested in predicting the properties of single nodes (e.g., users), and the case in which the dataset is composed of many small graphs where we want to predict the properties of whole graphs (e.g., molecule property prediction). We discuss the main components required to define such neural architectures and their alternative definitions in the literature. Finally, we present experimental results comparing the main graph neural networks in the literature.

14.1. Introduction In this chapter, we consider whether it is possible to define neural network architectures that deal with structured input data. The amount and variety of data generated and stored in modern information processing systems is constantly increasing, as is the use of machine learning (ML) approaches, especially very successful deep learning (DL) solutions, to extract knowledge from them. Traditional ML approaches have been developed assuming data to be encoded into feature vectors; however, many important real-world applications generate data that are naturally represented by more complex structures, such as graphs. Graphs are particularly suited to represent the relations (arcs) between the components (nodes) constituting an entity. For instance, in social network data, single data “points” (i.e., users) are closely inter-related, and not explicitly representing such dependencies would most likely

585

page 585

June 1, 2022 13:18

586

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch14

Handbook on Computer Learning and Intelligence — Vol. II

lead to an information loss. Because of this, ML techniques able to directly process structured data have gained more and more attention since the first developments, such as recursive neural networks [1, 2] proposed in the second half of the 1990s. In the last few years, there has been a burst of interest in developing DL models for graph domains, or more in general, for non-Euclidean domains, commonly referred to as geometric deep learning [3]. In this chapter, we focus on deep learning for graphs.

14.1.1. Why Graph Neural Networks? The analysis of structured data is not new to the machine learning community. After the first proposals in the field of neural networks [1, 2, 4], in the 2000s, kernel methods for structured data became the dominant approach to dealing with such kinds of data. Kernel methods have been defined for trees [5], graphs, [6–8] and graph with continuous attributes [9, 10]. Kernel methods have been quite successful on the many tasks where small- to medium-sized datasets were available. However, the kernel approach suffers from two main drawbacks: • lack of scalability to large datasets (with some exceptions); • scarce ability to deal with real-valued attributes. For more details on kernel methods for structured data, please refer to the surveys by Giannis et al. [11] or Nils et al. [12]. In the beginning of the 2010s, following the success of deep neural networks in many application domains, in particular of convolutional networks for images [13], the research community started to be interested in the design of deep neural network models for graphs, hoping for a leap in performance, just as it happened in the case of images. The first scientific papers in this research area laid down the basic concepts that are still in use. The core idea is to learn a representation of a node in the graph that is conditioned on the representations of neighboring nodes. While this approach can be easily implemented in trees or DAGs, where the presence of directed arcs allows the updating of the nodes in a reverse topological order, in general graphs the presence of cycles is an obstacle to its implementation. A practical solution that has been identified by earlier researchers [14, 15] consists in using an iterative or multilayer approach, in which each node representation depends on the ones of the neighbors at the previous iteration or layer. A few years later, basically the same idea has been derived and implemented, starting from spectral graph theory (see Section 14.4). Having the possibility to learn a representation of a node in a graph is functional to the design of predictors in two main settings, namely prediction of a property of a single node in a (large) graph or of a property of a graph as a whole.

page 586

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch14

Deep Learning for Graph-Structured Data

page 587

587

14.1.2. The Two Main Settings: Node Classification vs. Graph Classification There are two main problem settings that can arise when dealing with structured data. • Predictions over nodes in a network. In this setting, the dataset is composed of a single (possibly disconnected) large graph. Each example is a node in the graph, and the learning tasks are defined as predictions over the nodes. An example in this setting is the prediction of properties of a social network user based on his or her connections. • Predictions over graphs. In this case, each example is composed of a whole graph, and the learning tasks are predictions of properties of the whole graphs. An example is the prediction of toxicity in humans of chemical compounds represented by their molecular graph. Figure 14.1 shows the difference between the two settings. From a technical point of view, both settings require us to define (deep) neural architectures able to learn graph isomorphic invariant representations for nodes, which is usually obtained by resorting to a graph convolution. However, the second setting requires additional components, as will be discussed in Section 14.7, for learning a graph isomorphic invariant representation for each individual graph.

14.1.3. Graph Neural Networks Basics As mentioned before, the main problem faced by graph neural networks is to compute a sound and meaningful representation for graph nodes, i.e., • it is invariant to the way the graph is represented (e.g., in which order nodes and arcs of a graph are presented); • it incorporates structural information that is relevant for the prediction task. 30

good 27

33

bad

24

H

H

C

C

H

H

H

36

H

26 32

(a) Node-level predictions.

O

H

H

C

O

H

28

H

(b) Graph-level predictions.

Figure 14.1: Examples of tasks on graphs: (a) node regression task, e.g., prediction of the age of users in a social network; (b) graph classification task, e.g., molecule classification. The values inside the boxes are the targets we aim to predict. Note that in (a) there is a target for each node, while in (b) we have a single target per graph.

June 1, 2022 13:18

588

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch14

Handbook on Computer Learning and Intelligence — Vol. II

The first requirement (soundness) is justified by the fact that graphs that are isomorphic should induce the same function over their nodes. This requirement is so important that even the representations of whole graphs should comply to it; otherwise the same graph represented in different ways may return different predictions, which is of course not desirable. The second requirement (meaningfulness) is obvious: learned representations should preserve only the necessary information to best perform the task at hand. Both requirements can be reasonably satisfied in a neural network by using a convolution operator, defined on graphs, supported by a mechanism of communication (message-passing) between neighboring nodes. The general idea of graph convolution starts from a parallel between graphs and images. In convolutional networks for images, the hidden representation inherits the shape of the input image (with the exception of pixels close to the border that, if no padding is considered, imply a slight dimension reduction). Let us focus on the first convolutional layer. For each entry in the hidden representation, the convolution layer computes a representation that depends on the corresponding input pixel, and on the neighboring ones (depending on the size of the filter). In practice, we have that each neighboring position has an associated weight. This definition of convolution layer stems from the direct application of the convolution operator. On graphs, there is not a straightforward corresponding convolution operator. As we will see later, it is possible either to define graph neural networks directly in the graph domain (see Section 14.3), or to resort to the graph spectral domain (see Section 14.4).

14.2. Notation and Definitions In the following, we use italic letters to refer to variables, bold lowercase letters to refer to vectors, and bold uppercase letters to refer to matrices. The elements of a matrix A are referred to as ai j (and similarly for vectors). We use uppercase letters to refer to sets or tuples. In this chapter, we deal with problems and tasks that involve the concept of graph. Let G = (V, E, X) be a graph, where V = {v 0 , . . . , v n−1 } denotes the set of vertices (or nodes) of the graph, E ⊆ V × V is the set of edges, and X ∈ Rn×s is a multivariate signal on the graph nodes with the i-th row xi representing the attributes of v i . We define A ∈ Rn×n as the adjacency matrix of the graph, with elements ai j = 1 ⇐⇒ (v i , v j ) ∈ E. With N (v) we denote the set of nodes adjacent to node v.

14.3. Early Models of Graph Neural Network In this section, we review the early definitions of the graph neural networks that have been proposed in literature. In the following definitions, for the sake of simplicity,

page 588

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Deep Learning for Graph-Structured Data

b4528-v2-ch14

page 589

589

we ignore the bias terms. During the reading of this section, it is important to keep in mind that a graph convolution operator can be applied to perform both graph classification and node classification (see Section 14.7). Therefore, while some of them have been applied to graph classification rather than node classification, the basic node relabeling or graph convolution components are the same for both tasks. The first definition of neural network for structured data, including graphs, has been proposed by Sperduti and Starita in 1997 [2]. The authors propose to use the generalized recursive neuron, a generalization to structures of a recurrent neuron, which is able to build a map from a domain of structures to the set of reals. Using the generalized recursive neurons, the authors show that it is possible to formalize several learning algorithms and models able to deal with structures as generalization of the supervised networks developed for sequences. The core idea is to define a neural architecture that is modeled according to the graph topology. Thanks to weights sharing, the same set of neurons is applied to each vertex in the graph, obtaining a representation in output that is based on the information associated with the vertex and on the representations generated for its neighbors. Although in the paper a learning algorithm based on recurrent backpropagation [16, 17] is proposed for general graphs with cycles, no experimental assessment of the algorithm is reported. Later, an approach for general graphs, based on standard backpropagation, has been proposed by Micheli [15] for graph predictions, and by Scarselli et al. [14] for node prediction. Several years later, the approach followed by Micheli was independently (re)proposed in [18] under the name graph convolution.

14.3.1. Recursive Graph Neural Networks Scarselli et al. [14] proposed a recursive neural network exploiting the following general form of convolution: hvt +1 =



f (hut , xv , xu ),

(14.1)

u∈N (v)

where t ≥ 0 is the time of the recurrence, hvt +1 is the representation of node v at timestep t + 1, and f is a neural network whose parameters have to be learned, and are shared among all the vertices. The recurrent system is defined as a contraction mapping, and thus it is guaranteed to converge to a fixed point h . It can be noticed that h is sound since: (i) the terms f (hut , xv , xu ) do not change with a change in the graph representation; (ii) the terms just mentioned are aggregated by a commutative operator (i.e., sum), and so hvt +1 (t ≥ 0) does not depend on the order of presentation of the vertices. This idea has been re-branded later as neural message passing, the formal definition of which was proposed by Gilmer et al. [19]

June 1, 2022 13:18

590

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch14

page 590

Handbook on Computer Learning and Intelligence — Vol. II

14.3.2. Feedforward Graph Neural Networks In the same year, Micheli [15] proposed the neural network for graphs (NN4G) model. NN4G is based on a graph convolution that is defined as:  (1)  ¯ xv , hv(1) = σ W ⎛ ¯ (i+1) xv + hv(i+1) = σ ⎝W

(14.2) i 

ˆ (i+1,k) W



⎞ hu(k) ⎠ , i > 0,

(14.3)

u∈N (v)

k=1

ˆ (l,m) are the weights on the connections from where 1 ≤ v ≤ n is the vertex index, W ¯ (i+1) are weights transforming the input representations for the layer m to layer l, W (i + 1)-th layer, and σ is a nonlinear activation function applied element-wise. Note that in this formulation, skip connections are present, to the (i + 1)-th layer, from layer 1 to layer i. Moreover, soundness of hv(i+1) for i ≥ 1 is again guaranteed by the commutativity of the sum over the neighbours (i.e., u∈N (v) hu(k) ).

14.4. Derivation of Spectral Graph Convolutions The derivation of the spectral graph convolution operator originates from graph spectral filtering [18]. Let us fix a graph G. Let x : V → R be a signal on the nodes V of the graph G, i.e., a function that associates a real value with each node of V . Since the number of nodes in G is fixed (i.e., n) and the set V can be arbitrarily but consistently ordered1 , we can naturally represent every signal as a vector x ∈ Rn , which from now on we will refer to as signal. In order to set up a convolutional network on G, we need the notion of convolution ∗G between a signal x and a filter signal f. However, as we do not have a natural description of translation on graphs, it is not so obvious how to define the convolution directly in the graph domain. This operation is therefore usually defined in the spectral domain of the graph, using an analogy with classical Fourier analysis in which the convolution of two signals is computed as the pointwise product of their Fourier transforms. For this reason, we start providing the definition of the graph Fourier transform [20]. Let L be the (normalized) graph Laplacian, defined as 1

1

L = In − D− 2 AD− 2 ,

(14.4)

1 The set V is organized in an arbitrary order, for instance, when representing the graph with the

corresponding adjacency matrix.

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Deep Learning for Graph-Structured Data

b4528-v2-ch14

page 591

591

where In is the n × n identity matrix, and D is the degree matrix with entries given as ⎧ ⎨ n−1 k=0 aik , if i = j di j = . (14.5) ⎩0, otherwise Since L is real, symmetric, and positive semi-definite, we can compute its eigendecomposition as L = UU ,

(14.6)

where  = diag(λ0 , . . . , λn−1 ) is a diagonal matrix with the ordered eigenvalues of L as diagonal entries, and the orthonormal matrix U contains the corresponding eigenvectors {u0 , . . . , un−1 } of L as columns. In many classical settings of Fourier analysis, such as the Euclidean space or the torus, the Fourier transform can be defined in terms of the eigenvalues and eigenvectors of the Laplace operator. In analogy, we consider now the eigenvectors {u0 , . . . , un−1 } as the Fourier basis on the graph G, and the eigenvalues {λ0 , . . . , λn−1 } as the corresponding graph frequencies. In particular, going back to our spatial signal x, we can define its graph Fourier transform as xˆ = U x,

(14.7)

and its inverse graph Fourier transform as x = Uxˆ .

(14.8)

The entries xˆi = x · ui are the frequency components or coefficients of the signal x with respect with the basis function ui and associated with the graph frequency λi . For this reason, xˆ can also be regarded as a distribution on the spectral domains of the graph, i.e., to each basis function ui with frequency λi a corresponding coefficient xˆi is associated. Using the graph Fourier transform to switch between spatial and spectral domains, we are now ready to define the graph convolution between a filter f and a signal x as

     (14.9) f ∗G x = U fˆ xˆ = U U f U x , where fˆ xˆ = ( fˆ0 xˆ0 , . . . , fˆn−1 xˆn−1 ) denotes the component-wise Hadamard ˆ product of the two vectors xˆ and f. For graph convolutional networks, it is easier to design the filters f in the spectral ˆ and then define the filter f on the graph as f = Uf. ˆ domain as a distribution f,

June 1, 2022 13:18

592

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch14

Handbook on Computer Learning and Intelligence — Vol. II

According to Eq. (14.9), for a given fˆ the application of the convolutional filter f to a signal x is given as

     f ∗G x = U U Ufˆ U x = U fˆ U x . (14.10) The Hadamard product fˆ xˆ can be formulated in matrix–vector notation as fˆ xˆ = ˆ x by applying the diagonal matrix Fˆ = diag(f), ˆ given by Fˆ    fˆi if i = j ˆ ˆ i j = diag(f) = , (F) ij 0 otherwise to the vector xˆ . According to Eq. (14.10), we therefore obtain ˆ x. f ∗G x = UFU

(14.11)

We can design the diagonal matrix Fˆ and, thus, the spectral filter f in various ways. The simplest way would be to define fθ as a non-parametric filter, i.e., use Fˆ θ = diag(θ), where θ = (θ0 , . . . , θn−1 ) is a completely free vector of filter parameters that can be learned by the neural network. However, such a filter grows in size with the data, and it is not well suited for learning. A simple alternative is to use a polynomial parametrization based on powers of the spectral matrix  [21] for the filter, such as, Fˆ θ =

k 

θi i .

(14.12)

i=0

This filter has k + 1 parameters {θ0 , . . . , θk } to learn, and it is spatially k-localized on the graph. One of the main advantages of this filter is that we can formulate it explicitly in the graph domain. Recalling the eigendecomposition L = UU of the graph Laplacian, Eqs. (14.11) and (14.12) when combined give fθ ∗G x = UFˆ θ U x =

k 

θi Ui U x

i=0

=

k  i=0

i

θi (UU ) x =

k 

θi Li x.

(14.13)

i=0

Note that the computation of the eigendecomposition of the graph Laplacian L (the cost is of the order O(n 3 )) is feasible only for relatively small graphs (with some thousands of nodes at most). However, real-world problems involve graphs with hundreds of thousands or even millions of nodes: in these cases, the computation of the eigendecomposition of L is prohibitive and a filter of the form of Eq. (14.13) has clear advantages compared to a spectral filter given in the form of Eq. (14.11).

page 592

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Deep Learning for Graph-Structured Data

b4528-v2-ch14

page 593

593

Applying the convolution defined in Eq. (14.13) to a multivariate signal X ∈ Rn×s and using m filters, we obtain the following definition for a single graph convolutional layer: H=

k 

Li X(i) .

(14.14)

i=0 (i)

where  ∈ R . This layer can be directly applied in a single-layer neural network to compute the task predictions, i.e.,   k  i (i) L X , (14.15) Y=σ s×m

i=0

where σ is an appropriate activation function for the output task, e.g., the softmax for a classification problem with m classes.

14.5. Graph Convolutions in Literature In this section, we present in detail some of the most widely adopted graph convolutions that have been proposed in the literature.

14.5.1. Chebyshev Graph Convolution The parametrization of the polynomial filter defined in Eq. (14.12) is given in the monomial basis. Alternatively, Defferrard et al. [18] proposed to use Chebyshev polynomials as a polynomial basis. In general, the usage of a Chebyshev basis improves the stability in the case of numerical approximations. By defining the operators T (0) (x) = 1, T (1) (x) = x, and for k > 1, T (k) (x) = 2x T (k−1) (x) − T (k−2) (x), the filter is defined as Fˆ θ =

k 

˜ θi T (i) (),

(14.16)

i=0

˜ = 2  − In is the diagonal matrix of scaled eigenvectors of the graph where  λmax Laplacian. The resulting convolution is then: fθ ∗G x =

k  i=0

˜ = where L

2

λmax

L − In .

˜ θi T (i) (L)x.

(14.17)

June 1, 2022 13:18

594

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch14

Handbook on Computer Learning and Intelligence — Vol. II

14.5.2. Graph Convolutional Networks Kipf et al. [22] proposed to fix order k = 1 in Eq. (14.17) to obtain a linear firstorder filter for each graph convolutional layer in a neural network. Additionally, they fixed θ0 = θ1 = θ. They also suggested to stack these simple convolutions to obtain larger receptive fields (i.e., neighbors reachable with an increasing number of hops) in order to improve the discriminatory power of the resulting network. Specifically, the resulting convolution operator is defined as 1

1

fθ ∗G x = θ(In + D− 2 AD− 2 )x = θ(2In − L)x.

(14.18)

A renormalization trick to limit the eigenvalues of the resulting matrix is also 1 1 ˜ − 21 A ˜D ˜ − 12 , where introduced by the following: In + D− 2 AD− 2 is replaced by D ˜ = A + In and dii =  nj=0 a˜ i j . In this way, the spectral filter fθ is built not on the A spectral decomposition of the graph Laplacian L but on the eigendecomposition of ˜D ˜ − 21 . ˜ − 12 A the perturbed operator D Applying this convolution operator to a multivariate signal X ∈ Rn×s and using m filters, we obtain the following definition for a single graph convolutional layer: ˜D ˜ − 21 X, ˜ − 12 A H=D

(14.19)

where  ∈ Rs×m . This convolutional operation has complexity O(|E|ms). To obtain a Graph Convolutional Network (GCN), several graph convolutional layers are stacked and interleaved by a nonlinear activation function, typically a ReLU. If H(0) = X, then, based on single layer convolution in Eq. (14.19), we obtain the following recursive definition for the k-th graph convolutional layer: ˜ −2 A ˜D ˜ − 2 H(k−1) ). H(k) = ReLU (D 1

1

(14.20)

14.5.3. GraphConv In 2019, Morris et al. [23] investigated GNNs from a theoretical point of view. Specifically, they studied the relationship between GNN and the one-dimensional Weisfeiler–Lehman graph isomorphism heuristic (1-WL). 1-WL is an iterative algorithm that computes a graph invariant for labeled graphs, which at each iteration produces a coloring for the nodes of the graph. The output of an iteration depends on the coloring from the previous one. At iteration 0, the algorithm uses as initial coloring the label of the node. The aim of 1-WL is to test whether two graphs G and H are isomorphic. After applying the algorithm until convergence on both graphs, if they show different color distributions, the 1-WL test concludes that the graphs are not isomorphic. If the coloring is the same, the graphs may or may not be isomorphic. In fact, this algorithm is not able to distinguish all non-isomorphic graphs (like all graph invariants), but is still a powerful heuristic, which can successfully test

page 594

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch14

Deep Learning for Graph-Structured Data

page 595

595

isomorphism for a broad class of graphs. A more detailed introduction to 1-WL is reported in Section 14.6. As a result of this theoretical study, Morris et al. [23] defined the GraphConv operator inspired by the Weisfeiler–Lehman graph invariant; it is defined as follows: ¯ (i) + AH(i) W ˆ (i) , H(i+1) = H(i) W

(14.21)

¯ (i) and W ˆ (i) are two weights matrices. It is interesting to note where H0 = X, and W that this definition of convolution is very similar to NN4G (cf. Eq. (14.3)).

14.5.4. Graph Convolutions Exploiting Personalized PageRank Klicpera et al. [24] proposed a graph convolution by exploiting Personalized PageRank. Let f (·) define a two-layer feedforward neural network. The PPNP layer is defined as −1

˜ f (X), (14.22) H = α In − (1 − α)A ˜ = A + In . Such a filter preserves locality due to the properties of where A Personalized PageRank. The same paper proposed an approximation, derived by a truncated power iteration, avoiding the expensive computation of the matrix inversion, referred to as APPNP. It is implemented as a multilayer network where the (l + 1)-th layer is defined as ˜ (l) + αH(0), H(l+1) = (1 − α)SH

(14.23)

where H(0) = f (X) and S˜ is the renormalized adjacency matrix adopted in GCN, ˜D ˜ − 21 . ˜ − 12 A i.e., S˜ = D

14.5.5. GCNII While the GCN presented in Section 14.5.2 is very popular, one of its main drawbacks is that it is not suited to build deep networks. Actually, the authors proposed to use just two graph convolutional layers. This is a common problem of many graph convolutions, known as the over-smoothing problem [25]. In [26], an extension of GCN was proposed. Authors aimed at the definition of a graph convolution that can be adopted to build truly deep graph neural networks. To pursue this goal, they enhance the original GCN formulation in Eq. (14.20) with Initial residual and Identity mapping connections (thus the acronym GCNII). The l + 1-th layer of GCNII is defined as    ˜ (l) + αl H(0) (1 − βl )In + βl W(l) , (14.24) H(l+1) = σ (1 − αl )SH

June 1, 2022 13:18

596

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch14

Handbook on Computer Learning and Intelligence — Vol. II

where α and β are hyper-parameters. Authors propose to fix αl = 0.1 or 0.2 for each l, and βl = λl , where λ is a single hyper-parameter to tune.

14.5.6. Graph Attention Networks Graph Attention Networks (GATs) [27] exploit a different convolution operator based on masked self-attention. The attention mechanism was introduced by Bahdanau et al. [28] with the goal of enhancing the performance of the encoder– decoder architecture on neural network-based machine translation tasks. The basic idea behind the proposed attention mechanism is to allow the model to selectively focus on valuable parts of the input, and hence learn the associations between them. In GAT, the attention mechanism is developed by replacing the adjacency matrix in the convolution with a matrix of attention weights H(i+1) = σ (B(i+1) H(i) ),

(14.25)

where 0 ≤ i < l (the number of layers), H(0) = X, and the u, v-th element of B(i+1) is defined as (i+1) = bu,v

exp(Leaky R E LU (w [Whu(i) ||Whv(i) ]))

k∈N (u)

exp(Leaky R E LU (w [Whu(i) ||Whk(i) ]))

,

(14.26)

if (u, v) ∈ E, and 0 otherwise. The vector w and the matrix W are learnable parameters. Authors propose to use multihead attention to stabilize the training. While it may be harder to train, GAT allows to weight differently the neighbors of a node; thus it is a very expressive graph convolution.

14.5.7. GraphSAGE Another interesting proposal for the convolution over the node neighborhood is GraphSage [29], where the aggregation over the neighborhoods is performed by using sum, mean or max-pooling operators, followed by a linear projection in order to update the node representation. In addition to that, the proposed approach exploits a particular neighbor sampling scheme. In fact, the method uniformly samples for each vertex v, a fixed-size set of neighbors U (v), instead of using the full neighborhood N (v). Using a fixed-size subset of neighbors allows to maintain the computational footprint of each batch invariant to the actual node degree distribution. Formally, the k-th layer of the GraphSAGE convolution is defined as follows: hU(k)(v) = A(k) ({hu(k−1), ∀u ∈ U (v)}), hv(k) = σ (W(k) [hv(k−1) , hU(k)(v)]),

(14.27) (14.28)

page 596

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Deep Learning for Graph-Structured Data

b4528-v2-ch14

page 597

597

where A(k) is the aggregation function used at the k-th layer. In particular, the authors proposed to use different strategies to implement A, such as a mean aggregation function, a pooling aggregator, or even a more complex aggregator based on an LSTM architecture.

14.5.8. Gated Graph Sequence Neural Networks Li et al. [30] extended the recurrent graph neural network proposed by Scarselli et al. in 2009 [14]. They proposed to remove the constraint for the recurrent system to be a contraction mapping, and implemented this idea by adopting recurrent neural networks to define the recurrence. Specifically, the authors adopted the gated recurrent unit (GRU). The recurrent convolution operator is defined as follows: hv(1) = [xv , 0], av(t ) = Av [hv(t −1), ∀v ∈ V ], zvt = σ (Wz av(t ) + Uz hv(t −1)), rvt = σ (Wr av(t ) + Ur hv(t −1)), cvt = tanh (Wav(t ) + U(rvt hv(t −1))), hv(t ) = (1 − zvt ) hv(t −1) + zvt cvt , where Av is row v of the adjacency matrix A. Note that the propagation of the node embeddings through the edges is performed by computing av(t ) , while the other equations describe the GRU unit structure.

14.5.9. Other Convolutions The convolutions presented up to now are the most common and interesting. Moreover, the structure of these operators highlights many elements that the majority of the GCs developed in the last few years have in common. In the literature, many convolutions were developed using some of the discussed operators as a starting point. For instance, DGCNN [31] adopts a graph convolution very similar to GCN [22]. Specifically, it adopts a slightly different propagation scheme for vertices’ representations, based on the random-walk graph Laplacian. A more straightforward approach in defining convolutions on graphs is PATCHY-SAN (PSCN) [32]. This approach is inspired by how convolutions are defined over images. It consists in selecting a fixed number of vertices from each graph and exploiting a canonical ordering on graph vertices. For each vertex, it defines a fixed-size neighborhood, exploiting the same vertex ordering. It requires

June 1, 2022 13:18

598

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch14

Handbook on Computer Learning and Intelligence — Vol. II

the vertices of each input graph to be in a canonical ordering, which is as complex as the graph isomorphism problem (no polynomial-time algorithm is known). The Funnel GCNN (FGCNN) model [33] aims to enhance the gradient propagation using a simple aggregation function and LeakyReLU activation functions. Hinging on the similarity of the adopted graph convolutional operator, that is, the GraphConv (see Section 14.5.3), to the Weisfeiler–Lehman (WL) Subtree Kernel [34], it introduces a loss term for the output of each convolutional layer to guide the network to reconstruct the corresponding explicit WL features. Moreover, the number of filters used at each convolutional layer is based on a measure of the WL kernel complexity. Bianchi et al. [35] developed a convolutional layer based on auto-regressive moving average (ARMA) filters. This particular type of filter, compared to polynomial filters, allows to manage higher order neighborhoods. Moreover, ARMA filters present a flexible frequency response. The resulting nonlinear graph filter enhanced the modeling capability compared to common GNNs that exploit convolutional layers based on polynomial filters. Recently, geometric graph convolutional networks (Geom-GCN) [36] were proposed. The model exploits a geometric aggregation scheme that is permutationinvariant and consists of three modules: node embedding, structural neighborhood, and bi-level aggregation. For what concerns the node embedding component, the authors proposed to use three different ad-hoc embedding methods to compute node embeddings that preserve specific topology patterns. The structural neighborhood consists in the definition of a neighborhood that is the result of the concatenation of the neighborhood in the graph space and the neighborhood in the latent space. Specifically, the neighborhood of a node v in the latent space is the set of nodes whose distance from v is less than a predefined value. To compute the distance, a particular function is defined over two latent space node representations, and it is based on a similarity measure between nodes in the latent space. The bi-level aggregation exploits two aggregation functions used in defining the graph neural network layer. These two functions guarantee permutation invariance for a graph, and they are able to extract effectively structural information of nodes in neighborhoods.

14.5.10. Beyond the Message Passing Some recent works in the literature exploit the idea of extending graph convolution layers to increase the receptive field size, without increasing the depth of the model. The basic idea underpinning these methods is to consider the case in which the graph convolution can be expressed as a polynomial of the powers of a transformation T (·) of the adjacency matrix. The models based on this idea are able to simultaneously and directly consider all topological receptive fields up to k-hops, just like the ones that

page 598

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Deep Learning for Graph-Structured Data

b4528-v2-ch14

page 599

599

are obtained by a stack of graph convolutional layers of depth k, without incurring the typical limitations related to the complex interactions among the parameters of the GC layers. Formally, the idea is to define a representation as built from the contribution of all topological receptive fields up to k-hops as H = f (T (A)0 X, T (A)1 X, . . . , T (A)k X),

(14.29)

where T (·) is a transformation of the adjacency matrix (e.g., the Laplacian matrix), and f is a function that aggregates and transforms the various components obtained from the powers of T (A), for instance the concatenation, the summation, or even something more complex, such as a multilayer perceptron. The f function can be defined as a parametric function, depending on a set of parameters θ whose values can be estimated from data (e.g., when f involves an MLP). In the following, we discuss some recent works that define instances of this general architectural component. One of the first methods that exploited this intuition is presented by Atwood and Towsley [37]. In their work, the authors propose two architectures designed to perform node classification and graph classification (we will discuss in depth the distinction between the two tasks in Section 14.7). Specifically, Atwood and Towsley proposed to exploit the power series of the probability transition matrix, multiplied (using the Hadamard product) by the inputs. Moreover, the model developed to perform graph classification exploits the summation as f function. Similarly, the model proposed by Defferrard et al. in 2016 [18] exploits the Chebyshev polynomials and sums the obtained representations over k. Another interesting approach is that proposed by Tran et al. [38], where the authors consider larger receptive fields compared to standard graph convolutions. They focus on a convolution definition based on the shortest paths, instead of the standard random walks obtained by exponentiation of the adjacency matrix. Wu et al. introduced a simplification of the graph convolution operator, dubbed Simple Graph Convolution (SGC) [39]. The proposed model is based on the idea that perhaps the nonlinear operator introduced by GCNs is not essential. The authors propose to stack several linear GC operators, since stacking multiple GC layers has an important effect on the location of the learned filters (after k GC layers, the hidden representation of a vertex considers information coming from the vertices up to distance k). Formally, the model is defined as follows: Y = softmax(Sk X), ˜ = A + I, and D ˜ is the degree matrix of A. ˜ Notice that, ˜D ˜ − 21 , A ˜ − 12 A where S = D in this case, f selects just the kth power of the diffusion operators. An extension of this work, dubbed Linear Graph Convolution (LGC), was proposed by Navarin et al. [21], where the authors propose to introduce skip connections from each layer to the last one, which is a merge layer implementing the sum operator, followed by k αi Li X). a softmax activation Y = softmax(i=0

June 1, 2022 13:18

600

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch14

Handbook on Computer Learning and Intelligence — Vol. II

Liao et al. in 2019 [40] proposed to construct a deep graph convolutional network, exploiting particular localized polynomial filters based on the Lanczos algorithm, which leverages multiscale information. In the same year, Chen et al. [41] suggested to replace the neighbor aggregation function by graph-augmented features combining node degree features and multiscale graph-propagated features. Basically, the proposed model concatenates the node degree with the power series of the normalized adjacency matrix. The proposed model aggregates the graph-augmented features of each vertex and projects each of these subsets by using an MLP. Luan et al. [42] introduced two deep GCNs that rely on Krylov blocks. The first one exploits a GC layer, named snowball, which concatenates multiscale features incrementally, resulting in a densely connected graph network. The architecture stacks several layers, and exploits nonlinear activation functions. The second model, called Truncated Krylov, concatenates multiscale features in each layer. In this model, the topological features from all levels are mixed together. A similar approach is proposed by Rossi et al. [43]. The authors proposed an alternative method, named SIGN, to scale GNNs to very large graphs. This method uses, as a building block, the set of exponentiations of linear diffusion operators. In this building block, every exponentiation of the diffusion operator is linearly projected by a learnable matrix. Moreover, a nonlinear function is applied to the concatenation of the diffusion operators. Very recently, a model dubbed deep adaptive graph neural network was introduced [44], to learn node representations by adaptively incorporating information from large receptive fields. The model exploits an MLP network for node feature transformation. Then, it constructs a multiscale representation leveraging on the computed node feature transformations and the exponentiation of the adjacency matrix. This representation is obtained by stacking the various adjacency matrix exponentiations (thus obtaining a three-dimensional tensor). Similarly to the GCNs that rely on Krylov blocks [42], also in this case the model projects the obtained multiscale representation using multiplication by a weights matrix, obtaining a representation where the topological features from all levels are mixed together. Moreover, this projection also uses (trainable) retention scores. These scores measure how much information on the corresponding representations derived by different propagation layers should be retained to generate the final representation for each node in order to adaptively balance the information from local and global neighborhoods.

14.5.11. Relational Data In some applications, edges are typed, i.e., edges come with associated labels. This is the case, for instance, with knowledge graphs, where the edges specify the

page 600

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Deep Learning for Graph-Structured Data

b4528-v2-ch14

page 601

601

relationship between two concepts. This kind of data is, thus, commonly addressed as relational data. Graph neural networks can be easily extended to deal with such data [45]. Relational Graph Convolutional Networks are defined for a node v as ⎛ hv(i+1) = σ ⎝W(i) hv(i) +

  r∈R u∈Nr (v)

⎞ 1 W (i) h(i) ⎠ , |Nr (v)| r u

(14.30)

where R is the set of possible relations, Nr (v) is the set of neighbors of v that are connected to it via an edge of type r, and hv(0) = xv . When the set of possible relations is big, authors suggest to use regularization to constrain Wr (i) . Simonovsky and Komodakis [46] proposed a similar approach which was an extension of the model proposed by Duvenaud et al. [47] (see Section 14.3.2) that computes the sum over the neighbors of a vertex by weights conditioned by the edge labels.

14.6. Expressive Power of Graph Neural Networks When dealing with graphs, the problem of determining whether two graphs are just different representations of the same one is known as the graph isomorphism problem [48], for which there are no known polynomial-time algorithms. This means that, for any graph neural network, there exist infinitely many non-isomorphic graphs that the model cannot distinguish, i.e., that will be mapped to the same representation. This is a central issue for graph neural networks, since the more indistinguishable graphs there are, the more limited is the class of functions the network can represent (i.e., the network can only represent functions that associate the same output to all the graphs that are mapped to the same internal representation). The expressiveness of graph neural networks has been recently studied [49]. The authors did show that most graph neural networks are at most as powerful as the one-dimensional Weisfeiler–Lehman graph invariant (see Section 14.5.3), depending on the adopted neighbor aggregation function. They proved that some GNN variants, such as the GCN described in Section 14.5.2, or graphSAGE described in Section 14.5.7, do not achieve this expressiveness, and proposed the graph isomorphism network (GIN) model, which was proven to be as expressive as the Weisfeiler–Lehman test. In GIN, the aggregation over node neighbors is implemented using an MLP; therefore, the resulting GC formulation is the following (where hv(0) = xv ): hv(k) = MLP(k) ((1 + (k) )hv(k−1) +

 u∈N (v)

hu(k−1) ).

(14.31)

June 1, 2022 13:18

602

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch14

Handbook on Computer Learning and Intelligence — Vol. II

Graph 1

Graph 2

Figure 14.2: Example of two non-isomorphic graphs that cannot be distinguished by the 1-dimensional Weisfeiler–Lehman graph invariant test. Note that, in this simple example, all nodes have the same label.

14.6.1. More Expressive Graph Neural Networks While the 1-dimensional Weisfeiler–Lehman graph invariant test is a fast and generally effective graph isomorphism test, there are many (even small) pairs of graphs it cannot distinguish. Figure 14.2 reports a pair of graphs that are not isomorphic but that cannot be distinguished by 1-WL test. One possibility to define more expressive graph neural networks is to inspire their computation to more powerful graph invariants, such as the k-dimensional Weisfeiler–Lehman (k-WL) tests, which base the iterative coloring procedure of WL on graphlets of size k. k-WL tests are increasingly powerful for all k > 2 (k = 1 and k = 2 have the same discriminative power). Some recent works followed this direction. Maron et al. [50] proposed a hierarchy of increasingly expressive graph neural networks, where the k-th order network is as expressive as the k-dim WL. A practical implementation of a network as expressive as the three-dimensional WL is then proposed. The main drawback of the proposed approach is the computational complexity that is quadratic in the number of nodes of the input graphs (even for graphs with sparse connectivity). Moreover, such expressive networks tend to easily overfit the training data. Morris et al. [51] proposed a more efficient alternative, defining a revisited, local version of the k-WL that exploits the graph sparsity. They proposed a neural network architecture based on such a method and showed that it improves the generalization capability of the network compared to other alternatives in the literature. Again, the main problem of the approach is that to compute the k-order node tuples on which the WL test is based, the computational complexity is O(n k ).

14.7. Prediction Tasks for Graphs When considering graph-level prediction tasks, the node-level representations computed by the graph convolutional operators need to be aggregated in order to obtain a single (fixed-size) representation of the graph. Thus, one of the main

page 602

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Deep Learning for Graph-Structured Data

b4528-v2-ch14

page 603

603

problems to solve in this scenario is how to transform a variable number of nodelevel representations into a single graph-level one. Formally, a general GNN model for graph classification (or regression) is built according to the equations described in the following. First, d graph convolution layers are stacked H(i) = σ (GC(H(i−1) , G)),

(14.32)

where σ (·) is an element-wise nonlinear activation function, GC(·, ·) is a graph convolution operator, and hv(i) (the v-th row of H(i) ) is the representation of node v at the i-th graph convolution layer, 1 ≤ i ≤ d, and hv(0) = xv . Then, an aggregation function is applied h S = aggr({hv(i) | v ∈ V, 1 ≤ i ≤ d}),

(14.33)

where aggr(·) is the aggregator function. Note that the aggregation may depend on all the hidden representations computed by the different GC layers and not just the last one. h S is the fixed-size graph-level representation. Subsequently, the readout (·) function (usually implemented by a multilayer perceptron) applies some nonlinear transformation on h S . Finally, an output layer (e.g., the LogSoftMax for a classification problem) is applied.

14.7.1. Aggregation Function and Readout As we highlighted above, the GC can be used to perform either graph classification or node classification. Therefore, the main additional component required by graph classification GNNs is the aggregator function. An effective and efficient graphlevel representation should be, as much as possible, invariant to different isomorphic representations of the input graph, thus letting the learning procedure to focus only on the property prediction task, with no need to worry about the way the graph in the input is represented. The simplest aggregation operators adopted in the literature are linear, namely the average and the sum of vertex representations. This kind of operators have been used in many GNN models. For instance, NN4G [15] (described in Section 14.3.2) computes, for each graph, the average graph vertex representation for each hidden layer, and concatenates them. Other approaches consider only the last graph convolution layer to compute such an average [37]. A more complex approach exploits multilayer perceptrons to transform node representations before a sum aggregator is applied [19]. In the last few years, several more complex (nonlinear) techniques to perform aggregation have been proposed. One of these is SortPooling [31], which is a nonlinear pooling operator, used in conjunction with concatenation to obtain an aggregation operator. The idea is to select a predetermined number of vertex

June 1, 2022 13:18

604

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch14

Handbook on Computer Learning and Intelligence — Vol. II

embeddings using a sorting function, and to concatenate them, obtaining a graphlevel representation of fixed size. Note, however, that this representation ignores some of the nodes of the graph, thus losing potentially important information. In many cases, using such simple aggregators inevitably results in a loss of information due to the mix of numerical values they introduce. A much better approach, from a conceptual point of view, would be to consider all the representations of nodes of a graph as a (multi)set, and the aggregation function to learn as a function defined on these (multi)sets. To face this setting, the DeepSets [52] network has been recently proposed. The idea is to design neural networks that take sets as the input. A DeepSet maps the elements of the input set in a high-dimensional space via a learned φ(·) function, usually implemented as a multilayer perceptron. It then aggregates node representations by summing them, and finally it applies the readout, i.e., the ρ(·) function (another MLP), to map the graph-level representation to the output of the task at hand. Formally, a DeepSets network can be defined as follows: ⎞ ⎛  φ(xi )⎠, (14.34) s f (X ) = ρ ⎝ x i ∈X

for some ρ(·) and φ(·) functions, if X is countable. Navarin et al. [53] proposed a graph aggregation scheme based on DeepSets implementing the φ(·) function as a multilayer perceptron. An interesting property of this approach is that it has been proven that any function s f (X ) over a set X satisfying the following two properties: (1) variable number of elements in the input, i.e., each input is a set X = {x1 , . . . , xm } with xi belonging to some set X (typically a vectorial space) and M > 0; (2) permutation invariance; is decomposable in the form of Eq. (14.34). Moreover, under some assumptions, DeepSets are universal approximators of functions over countable sets, or uncountable sets with a fixed size. Therefore, they are potentially very expressive from a functional point of view. Recently, Pasa et al. [54] proposed to extend this approach by implementing φ(·) by exploiting self-organizing maps (SOMs) to map the node representations in the space defined by the activations of the SOM neurons. The resulting representation embeds the information about the similarity between the various inputs. In fact, similar input structures will be mapped in similar output representations (i.e., node embeddings). Using a fully unsupervised mapping for the φ(·) function may lead to the loss of task-related information. To avoid this issue, the proposed method makes the φ(·) mapping supervised by stacking, after the SOM, a graph convolution layer

page 604

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Deep Learning for Graph-Structured Data

b4528-v2-ch14

page 605

605

that can be trained via supervised learning, allowing to better incorporate topological information in the mapping.

14.7.2. Graph Pooling Pooling operators have been defined and used in convolutional neural networks (CNNs) for image processing. Their aim is to improve the performance of the network by introducing some degree of local invariance into transformations of the image, such as scale, translation, or rotation transformations. Replicating these operators in the context of graphs turns out to be very complex for different reasons. First, in the GNN setting—compared to standard CNNs—graphs contain no natural notion of spatial locality because of the complex topological structure. Moreover, unlike image data, in the graph context the number and degree of nodes are not fixed, thus making it even more complex to define a general graph pooling operator. In order to address these issues, different types of graph pooling operators have been proposed in the last two years. In general, graph pooling operators are defined so as to learn a clustering of nodes that allows to uncover the hierarchical topology given by the underlying sub-graphs. In a GNN architecture, the graph pooling operators can be inserted after one or more GC layers. One example of the first proposed graph pooling operators is DiffPool [55]. DiffPool computes a node clustering by learning a cluster assignment matrix S (l) over the node embedding computed after l GC layers. Let us denote the adjacency matrix at the l-th layer as A(l) and the node embedding matrix as Z (l) . DiffPool computes A(l+1) , X (l+1) , which are, respectively, the new adjacency matrix and the new matrix of embeddings for each of the nodes/clusters in the graph. X (l+1) is computed by aggregating the nodes embeddings Z (l) according to the cluster assignments S (l) , generating embeddings for each of the clusters:

X (l+1) = S (l) Z (l) . Similarly, the new adjacency matrix A(l+1) represents the connections between clusters, and is computed as

A(l+1) = S (l) A(l) S (l) . The main limitation of DiffPool is the computation of the soft clustering assignments, since during the early phases of training, a dense assignment matrix must be stored in memory. Asymptotically, this incurs a quadratic storage complexity over the number of graph vertices, making the application of DiffPool in large graphs unfeasible. To face this complexity issue, a sparse version of graph pooling has been proposed [56], where a differentiable graph coarsening method is used to reduce the size of the graph in an adaptive manner within a graph neural network pipeline.

June 1, 2022 13:18

606

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch14

Handbook on Computer Learning and Intelligence — Vol. II

The sparsity of this method is achieved by applying a node dropping policy, that basically drops a fixed number of nodes from the original graph. The nodes to drop are selected based on a projection score computed for each node. This method allows to reduce the memory storage complexity, which becomes linear in the number of edges and nodes of the graph. Another recently proposed graph pooling method is MinCutPool [57]. The method is based on the k-way normalized minCUT problem, which that is the task of partitioning V in k disjoint subsets by removing the minimum number of edges. The idea is to learn the parameters in a MinCutPool layer by minimizing the minCUT objective, which can be jointly optimized with a task-specific loss. The method, similarly to DiffPool, uses a cluster assignment matrix. This matrix is computed by solving a relaxed continuous formulation of the minCUT optimization problem that can be solved in polynomial time and guarantees a near-optimal solution.

14.8. Experimental Comparison In this section, we revise some of the most interesting results obtained by GNN models in different tasks. First, we outline the main applications of GNNs and the dataset commonly used as benchmarks to evaluate the strengths and the weaknesses of the models. Then, we discuss a selection of the results obtained by models that achieved notable results.

14.8.1. Applications and Benchmark Datasets Graph structured data are ubiquitous in nature; therefore, there are a wide number of possible real-world applications of graph neural networks. However, it is important to take into account that in the past some problems were usually modeled using flat or sequential representations, instead of using graphs. This is due to the fact that for some tasks involving graph structured data, considering the topological information is not crucial, and using a simpler model may bring benefits in terms of performance. In this section, we list and discuss the most interesting tasks on which GNNs have been successfully applied in the last few years. One of the most prominent areas in which GNNs have been applied with success is Cheminformatics. In fact, many chemical compound datasets are often used to benchmark new models. In Cheminformatics, different tasks can be considered. In general, the input is the molecular structure, while the output varies based on the specific task. The majority of tasks in benchmark datasets model graph classification or regression problems. An example is the quantitative structure– property relationship (QSPR) analysis, where the aim is to predict one or more chemical properties (e.g., toxicity, solubility) of the input chemical compound. Other tasks in this field include finding structural similarities among compounds,

page 606

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Deep Learning for Graph-Structured Data

b4528-v2-ch14

page 607

607

drug side-effect identification, and drug discovery. The datasets related to these applications, commonly used as benchmarks, are MUTAG [58], PTC [59], NCI1 [60], PROTEINS [61], D&D [62], and ENZYMES [61]. The first two datasets contain chemical compounds represented by their molecular graphs, where each node is labeled with an atom type, and the edges represent bonds between them. MUTAG contains aromatic and hetero-aromatic nitro compounds, and the task is to predict their mutagenic effect on a bacterium. PTC contains chemical compounds and the task is to predict their carcinogenicity for male rats. In NCI1, the graphs represent anti-cancer screens for cell lung cancer. The last three datasets, PROTEINS, D&D, and ENZYMES, contain graphs that represent proteins. Each node corresponds to an amino acid, and an edge connects two of them if they are less than 6Å apart. In particular, ENZYMES, unlike the other reported datasets (that model binary classification problems), allows testing the models on multiclass classification over six classes. Another interesting field where GNNs have achieved good results is social network analysis. A social network is modeled as a graph where nodes represent users, and edges the relationships between them (e.g., friendship, co-authorship). Usually, the tasks on social graphs regard node or graph classification. For what concerns node classification, three major datasets in the literature are usually employed to assess the performances of GNNs: Cora [63], Citeseer [64], and PubMed [65]. Each dataset is represented as a single graph. Nodes represent documents, and node features are sparse bag-of-words feature vectors. Specifically, the task requires us to classify the research topics of papers. Each node represents a scientific publication described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from a dictionary. For what concerns graph classification on social datasets, the most popular social benchmarks are COLLAB, IMDB-BINARY (IMDB-B), IMDBMULTI (IMDB-M), REDDIT-BINARY, and REDDIT-MULTI [66]. In COLLAB, each graph represents a collaboration network of the corresponding researcher with other researchers from three fields of physics. The task consists in predicting the physics field the researcher belongs to. IMDB-B and IMDB-M are composed of graphs derived from actors/actresses who have acted in different movies on IMDB, together with the movie genre information. Each graph has a target that represents the movie genre. IMDB-B models a binary classification task, while IMDB-M contains graphs that belong to three different classes. Unlike the bioinformatics datasets, the nodes contained in the social datasets do not have any associated label. REDDIT-BINARY and REDDIT-MULTI contain graphs that represent online discussion threads where nodes represent users, and an edge represents the fact that one of the two users responded to the comment of the other user. In REDDITBINARY, four popular subreddits are considered. Two of them are question/answerbased subreddits, while the other two are discussion-based subreddits. A graph is

June 1, 2022 13:18

608

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch14

Handbook on Computer Learning and Intelligence — Vol. II

labeled according to whether it belongs to a question/answer-based community or a discussion-based community. By contrast, in REDDIT-MULTI, there are five subreddits involved and the graphs are labeled with their corresponding subreddits.

14.8.2. Experimental Setup and Validation Process In the last few years, many new graph neural network models have been proposed. However, despite the theoretical advancements reached by the latest contributions in the field, comparing the results of the various proposed methods turns out to be hard. Indeed, many different methods (not all correct) to perform validation, and thus select the model’s hyper-parameters, were used. In a recently published paper [67], the authors highlighted and discussed the importance of the validation strategy, especially when dealing with graph neural networks. Moreover, they experimentally proved that the experimental settings of many papers are ambiguous or not reproducible. Another important issue is related to the correct usage of data splits for model selection versus model assessment. A fair method to evaluate a model requires two distinct phases: model selection on the validation set, and model assessment on the test set. Unfortunately, in many works, in particular, in graph classification, some of the hyper-parameter are selected by considering the best results directly on the test set, clearly an incorrect procedure. This is also due to the fact that some datasets have a limited number of samples, in particular in validation, and thus a hyper-parameter selection performed using the validation set is extremely unstable. However, performing model selection and model assessment in an incorrect way could lead to overly optimistic and biased estimates of the true predictive performance of a model. Another aspect that should be considered comparing two or more models is related to the amount of information used in input and output encoding. In fact, it is common practice in the literature to augment node descriptors with structural features. For example, some models add the degree (and some clustering coefficients) to each node feature vector [55], or add a one-hot representation of node degrees [49]. Obviously, this encoding difference makes it even harder to compare the experimental results published in the literature. Good experimental practices suggest that all models should be consistently compared using the same input representations, and using the same validation strategy. For this reason, the results reported in the following section are limited to models that have been evaluated using the same, fair methodology.

14.8.3. Experimental Results In this section, we report and discuss the most meaningful results obtained by some of the models discussed in the previous sections. Table 14.1 reports the results on classification tasks over bioinformatics benchmark datasets, while in

page 608

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch14

Deep Learning for Graph-Structured Data

page 609

609

Table 14.1: Accuracy comparison among several state-of-the-art models on graph classification tasks Model\Dataset PSCN [32]

FGCNN [53]

SOM-GCNN [54]

DGCNN [67]

GIN [67]

DIFFPOOL [67]

GraphSAGE [67]

PTC

NCI1

PROTEINS

D&D

ENZYMES

60.00

76.34

75.00

76.27

±4.82

±1.68

±2.51

±2.64

— — —

58.82

81.50

74.57

77.47

±1.80

±0.39

±0.80

±0.86

62.24 ±1.7

83.30

75.22

78.10

±0.45

±0.61

±0.60

— —

76.4

72.9

76.6

38.9

±1.7

±3.5

±4.3

±5.7

— —

80.0

73.3

75.3

59.6

±1.4

±4.0

±2.9

±4.5

— —

76.9

73.7

75.0

59.5

±1.9

±3.5

±3.5

±5.6

— —

76.0

73.0

72.9

58.2

±1.8

±4.5

±2.0

±6.0

50.01 ±2.9

Table 14.2: Accuracy comparison among several state-of-the-art models on graph classification tasks considering social network datasets Model\Dataset PSCN [32]

DGCNN [67]

GIN [67]

DIFFPOOL [67]

GraphSAGE [67]

COLLAB

IMDB-B

IMDB-M

RED.-B

RED.-5K

72.60

71.00

45.23

86.30

49.10

±2.15

±2.29

±2.84

±1.58

±0.70

57.4

53.3

38.6

72.1

35.1

±1.9

±5.0

±2.2

±7.8

±1.4

75.9

66.8

42.2

87.0

53.8

±1.9

±3.9

±4.6

±4.4

±5.9

67.7

68.3

45.1

76.6

34.6

±1.9

±6.1

±3.2

±2.4

±2.0

71.6

69.9

47.2

86.1

49.9

±1.5

±4.6

±3.6

±2.0

±1.7

Table 14.2 we report the results on classification tasks over social network datasets. Finally, Table 14.3 reports the most interesting results in the literature for what concerns the task of node classification. The results reported in the tables show that there isn’t a method that obtains the best performance on all the considered

June 1, 2022 13:18

610

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch14

Handbook on Computer Learning and Intelligence — Vol. II Table 14.3: Accuracy comparison among several state-of-the-art models on nodes classification tasks considering social network datasets Model\Dataset Cheby [68]

GCN [68]

GAT [68]

SGC [68]

ARMA [68]

LGC [21]

APPNP [68]

GCNII [26]

Citeseer

Cora

PubMed

70.2

81.4

78.4

±1.0

±0.7

±0.4

71.1

81.5

79.0

±0.7

±0.6

±0.6

70.8

83.1

78.5

±0.5

±0.4

±0.3

71.3

81.7

78.9

±0.2

±0.1

±0.1

72.3

82.8

78.8

±1.1

±0.6

±0.3

72.9

85.0

80.2

±0.3

±0.3

±0.4

71.8

83.3

80.1

±0.5

±0.5

±0.2

73.4

85.5

80.3

datasets/tasks. This behavior suggests that each of the considered datasets and tasks stress different aspects of the graph convolution operators. For what concerns graph classification and bioinformatics datasets, the best results are obtained by the SOM-GCNN in all datasets except ENZYMES, where GIN, obtains the best result. A very similar accuracy is achieved by DIFFPOOL. These results emphasize the importance of using an expressive aggregation and readout methodology. Indeed SOM-GCNN, GIN, and DIFFPOOL introduce novel expressive aggregation methodologies. The social network results in Table 14.2 show that even in these datasets, GIN obtains very interesting performances since it obtains the best results in three datasets: COLLAB, and two versions of REDDIT (binary classification and multiclass). These datasets have a significantly higher number of edges and nodes per graph than the two IMDB datasets, where PSCN obtains a better accuracy. In node classification tasks results (Table 14.3), the GCNII shows the best performances in all datasets. In Cora, the accuracy obtained by GCNII is very close to the one obtained by LGC. In Citeseer, GCNII performs 1% better than LGC, which achieved the second highest result. In PubMed, the accuracy values of GCNII, LGC, and APPNP

page 610

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Deep Learning for Graph-Structured Data

b4528-v2-ch14

page 611

611

are very close; indeed the performances of the three methods are less than one standard deviation apart.

14.9. Open-Source Libraries The growth in popularity of GNN models poses new challenges for the existing computing libraries and frameworks aimed at developing DNN models. In fact, the particular representations of the nodes and of the diffusion operators make it necessary to handle the sparsity of GNN operations efficiently in widespread GPU hardware. Moreover, considering the enormous size of some graphs (e.g., social networks datasets) it became crucial to implement the possibility of scaling the computations to large-scale graphs and multiple GPUs. For these reasons, some new libraries that extend and adapt the functionality of widely adopted DNN frameworks, such as PyTorch [69] and TensorFlow [70], have been recently developed. These libraries provide high-level interfaces and methods to implement and develop multiple GNN variants. In the following section, we review three commonly adopted libraries that are built on the top of PyTorch and TensorFlow to develop GNN models: PyTorch Geometrics, Deep Graph Library, and Spektral.

14.9.1. Pytorch-Geometric PyTorch Geometric (PyG) [68] is a widespread library that is built upon PyTorch. PyG includes various methods for deep learning on graphs and other irregular structures, from a variety of published papers. The library allows to easily develop GNN architectures thanks to an interface that implements message passing. This interface presents methods to define the message, update the node embeddings, and perform neighborhood aggregation and combination. Moreover, multiple pooling operations are provided. For what concerns the data management, the library implements a mini-batch loader working either for many small graphs, or when the input is a single giant graph. Finally, the library implements loaders and management functions for the most common benchmark datasets. PyG exploits dedicated GPU scatter and gather kernels to handle sparsity, instead of using sparse matrix multiplication kernels. The scatters allow to operate on all edges and nodes in parallel, thus significantly accelerating the GNN processing.

14.9.2. Deep Graph Library Deep Graph Library (DGL) [71] is a framework-agnostic library. In fact, it can use PyTorch, TensorFlow, or MXNet [72] as backends. Similarly to PyG, DGL provides a message passing interface. The interface exposes three main methods: the message function to define the message; the reduce function that allows each node to access the received messages and perform aggregation; and the update

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

612

b4528-v2-ch14

Handbook on Computer Learning and Intelligence — Vol. II

function that operates on the aggregation results. This function is typically used to combine the aggregation with the original features of a node and to compute the updated node embedding. To improve the performance in graph processing, DGL adopts advanced optimization techniques such as kernel fusion, multithread and multiprocess acceleration, and automatic sparse format tuning. In particular, it leverages on specialized kernels for GPUs or TPUs that allow to improve the parallelization of matrix multiplications between both dense and sparse matrices. The adopted parallelization scheme is chosen by the library using heuristics that consider multiple factors including the input graph. Thanks to these optimizations, compared to other popular GNN frameworks, DGL shows better performance both in terms of computational time demand and memory consumption.

14.9.3. Spektral Very recently, a new library dubbed Spektral [73] has been introduced. The Spektral library is based on the Keras API and TensorFlow 2. The library provides the essential building blocks for creating graph neural networks. Moreover, it implements some of the most common GC layers as Keras layers. This allows to use them to build Keras models, and thus they inherit the most important features of Keras such as the training loop, callbacks, distributed training, and automatic support for GPUs and TPUs. At the time of writing, Spektral implements 15 different GCN layers, based on the message passing paradigm. Furthermore, the library provides several graph pooling layers and global graph pooling methods. Comparing Spektral with PyG and DGL, it is important to notice that Spektral is developed specifically for the TensorFlow ecosystem. In terms of computational performance, Spektral’s implementation of GC layers is comparable to the one of PyG.

14.10. Conclusions and Open Problems In this chapter, we introduced and discussed graph neural networks (GNNs). We firstly introduced the main building block of GNNs: the graph convolution (GC). We discussed how this component is exploited by graph neural networks to compute a (hopefully) sound and meaningful representation for graph nodes, and how GC operators evolved from the first definitions to the GC layers recently proposed in the literature. The experimental results obtained using GNNs show the benefits of using this type of model when dealing with structured data. Despite the good results obtained in several graph-based applications, there are several open problems and challenges. One of the main open problems is scalability. Several models that obtain very good results in benchmark datasets turn out to be not scalable to large-scale settings. Developing models capable of dealing with very large graphs is crucial,

page 612

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Deep Learning for Graph-Structured Data

b4528-v2-ch14

page 613

613

since several interesting tasks require dealing with huge amounts of data (e.g., social networks). In addition, from a theoretical prospective, graph neural networks offer very interesting and challenging open problems. Indeed, the expressiveness of many graph neural networks is rather limited. As we discussed, recent works showed that GNNs are as powerful as the one-dimensional Weisfeiler–Lehman graph invariant test. Only recently, a few models that try to overcome this limit have been introduced. The encouraging results highlight that more research has to be carried out in this direction. From the architectural point of view, GNNs belong to the deep learning framework, but most of the models in the literature use a very number amount of stacked convolutional layers. Going deep into the number of layers leads to some known issues such as the complex gradient propagation through the layers, and the over-smoothing problem [25]. Exploiting deeper models would allow obtaining richer representations of graph nodes to be obtained; therefore, finding a model that allows to develop deeper GNNs, could help to further improve the state-of-the-art performances.

References [1] A. Sperduti, D. Majidi, and A. Starita, Extended Cascade-Correlation for syntactic and structural pattern recognition. In eds. P. Perner, P. S. Wang, and A. Rosenfeld, Advances in Structural and Syntactical Pattern Recognition, 6th International Workshop, SSPR ’96, Leipzig, Germany, August 20–23, 1996, Proceedings, vol. 1121, Lecture Notes in Computer Science, pp. 90–99. Springer (1996). doi: 10.1007/3-540-61577-6_10. URL https://doi.org/10.1007/ 3-540-61577-6_10. [2] A. Sperduti and A. Starita, Supervised neural networks for the classification of structures, IEEE Trans. Neural Networks, 8(3), 714–735 (1997). [3] M. M. Bronstein, J. Bruna, Y. Lecun, A. Szlam, and P. Vandergheynst, Geometric deep learning: going beyond Euclidean data, IEEE Signal Process. Mag., 34(4), 18–42 (2017). ISSN 10535888. [4] P. Frasconi, M. Gori, and A. Sperduti, A general framework for adaptive processing of data structures, IEEE Trans. Neural Networks, 9(5), 768–786 (1998). doi: 10.1109/ 72.712151. URL https://doi.org/10.1109/72.712151. [5] M. Collins and N. Duffy, Convolution kernels for natural language. In Proceedings of the Advances in Neural Information Processing Systems, vol. 14, pp. 625–632 (2001). [6] R. I. Kondor and J. Lafferty, Diffusion kernels on graphs and other discrete structures. In ICML (2002). [7] N. Shervashidze, P. Schweitzer, E. J. van Leeuwen, K. Mehlhorn, and K. M. Borgwardt, Weisfeiler-Lehman graph kernels, JMLR, 12, 2539–2561 (2011). [8] G. Da San Martino, N. Navarin, and A. Sperduti, Ordered decompositional DAG kernels enhancements, Neurocomputing, 192, 92–103 (2016). [9] A. Feragen, N. Kasenburg, J. Petersen, M. de Bruijne, and K. M. Borgwardt, Scalable kernels for graphs with continuous attributes. In Neural Information Processing Systems (NIPS) 2013, pp. 216–224 (2013).

June 1, 2022 13:18

614

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch14

Handbook on Computer Learning and Intelligence — Vol. II

[10] G. Da San Martino, N. Navarin, and A. Sperduti, Tree-based kernel for graphs with continuous attributes, IEEE Trans. Neural Networks Learn. Syst., 29(7), 3270–3276 (2018). ISSN 2162237X. [11] G. Nikolentzos, G. Siglidis, and M. Vazirgiannis, Graph kernels: a survey, CoRR, abs/1904.12218 (2019). URL http://arxiv.org/abs/1904.12218. [12] N. M. Kriege, F. D. Johansson, and C. Morris, A survey on graph kernels, Appl. Network Sci., 5(1), 6 (2020). doi: 10.1007/s41109-019-0195-3. URL https://doi.org/10.1007/s41109-019-0195-3. [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks. In eds. P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3–6 (2012), Lake Tahoe, Nevada, United States, pp. 1106–1114 (2012). URL https://proceedings. neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html. [14] F. Scarselli, M. Gori, A. C. Ah Chung Tsoi, M. Hagenbuchner, and G. Monfardini, The graph neural network model, IEEE Trans. Neural Networks, 20(1), 61–80 (2009). [15] A. Micheli, Neural network for graphs: a contextual constructive approach, IEEE Trans. Neural Networks, 20(3), 498–511 (2009). [16] L. B. Almeida, A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. In eds. M. Caudil and C. Butler, Proceedings of the IEEE First International Conference on Neural Networks, San Diego, CA, pp. 609–618 (1987). [17] F. J. Pineda, Generalization of back-propagation to recurrent neural networks, Phys. Rev. Lett., 59(19), 2229–2232 (1987). ISSN 1079-7114. doi: 10.1103/PhysRevLett.59.2229. URL https: //link.aps.org/doi/10.1103/PhysRevLett.59.2229. [18] M. Defferrard, X. Bresson, and P. Vandergheynst, Convolutional neural networks on graphs with fast localized spectral filtering. In Neural Information Processing Systems (NIPS), pp. 3844–3852 (2016). [19] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning, pp. 1263–1272 (2017). [20] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains, IEEE Signal Process. Mag., 30(3), 83–98 (2013). [21] N. Navarin, W. Erb, L. Pasa, and A. Sperduti, Linear graph convolutional networks. In European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp. 151–156 (2020). [22] T. N. Kipf and M. Welling, Semi-supervised classification with graph convolutional networks. In ICLR, pp. 1–14 (2017). [23] C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe, Weisfeiler and Leman go neural: higher-order graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 4602–4609 (2019). [24] J. Klicpera, A. Bojchevski, and S. Günnemann, Predict then propagate: graph neural networks meet personalized pagerank. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9 (2019). OpenReview.net (2019). [25] K. Oono and T. Suzuki, Graph neural networks exponentially lose expressive power for node classification. In ICLR (2020). [26] M. Chen, Z. Wei, Z. Huang, B. Ding, and Y. Li, Simple and deep graph convolutional networks. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13–18 July 2020, Virtual Event, vol. 119, Proceedings of Machine Learning Research, pp. 1725–1735. PMLR (2020). URL http://proceedings.mlr.press/v119/chen20v.html.

page 614

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Deep Learning for Graph-Structured Data

b4528-v2-ch14

page 615

615

[27] P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, Graph attention networks. In ICLR (2018). [28] D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate. In eds. Y. Bengio and Y. LeCun, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings (2015). URL http://arxiv.org/abs/1409.0473. [29] W. Hamilton, Z. Ying, and J. Leskovec, Inductive representation learning on large graphs. In Proceedings of the Advances in Neural Information Processing Systems, pp. 1024–1034 (2017). [30] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, Gated graph sequence neural networks. In ICLR (2016). [31] M. Zhang, Z. Cui, M. Neumann, and Y. Chen, An end-to-end deep learning architecture for graph classification. In AAAI Conference on Artificial Intelligence (2018). [32] M. Niepert, M. Ahmed, and K. Kutzkov, Learning convolutional neural networks for graphs. In International conference on machine learning, pp. 2014–2023 (2016). [33] N. Navarin, D. V. Tran, and A. Sperduti, Learning kernel-based embeddings in graph neural networks. In eds. G. D. Giacomo, A. Catalá, B. Dilkina, M. Milano, S. Barro, A. Bugarín, and J. Lang, ECAI 2020 — 24th European Conference on Artificial Intelligence, 29 August–8 September 2020, Santiago de Compostela, Spain, August 29–September 8, 2020 — Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020), vol. 325, Frontiers in Artificial Intelligence and Applications, pp. 1387–1394. IOS Press (2020). doi: 10.3233/FAIA200243. URL https://doi.org/10.3233/FAIA200243. [34] N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and K. M. Borgwardt, Weisfeilerlehman graph kernels, J. Mach. Learn. Res., 12, 2539–2561 (2011). [35] F. M. Bianchi, D. Grattarola, C. Alippi, and L. Livi, Graph neural networks with convolutional ARMA filters, arXiv preprint (2019). [36] H. Pei, B. Wei, K. C.-C. Chang, Y. Lei, and B. Yang, Geom-GCN: ceometric graph convolutional networks. In International Conference on Learning Representations (2019). [37] J. Atwood and D. Towsley, Diffusion-convolutional neural networks. In Neural Information Processing Systems (NIPS), pp. 1993–2001 (2016). [38] D. V. Tran, N. Navarin, and A. Sperduti, On filter size in graph convolutional networks. In 2018 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1534–1541. IEEE (2018). [39] F. Wu, T. Zhang, A. H. de Souza, C. Fifty, T. Yu, and K. Q. Weinberger, Simplifying graph convolutional networks, ICML (2019). [40] R. Liao, Z. Zhao, R. Urtasun, and R. S. Zemel, LanczosNet: multi-scale deep graph convolutional networks. In 7th International Conference on Learning Representations, ICLR 2019 (2019). [41] T. Chen, S. Bian, and Y. Sun, Are powerful graph neural nets necessary? A dissection on graph classification, arXiv preprint arXiv:1905.04579 (2019). [42] S. Luan, M. Zhao, X.-W. Chang, and D. Precup, Break the ceiling: stronger multiscale deep graph convolutional networks. In Proceedings of the Advances in Neural Information Processing Systems, pp. 10945–10955 (2019). [43] E. Rossi, F. Frasca, B. Chamberlain, D. Eynard, M. Bronstein, and F. Monti, Sign: scalable inception graph neural networks, arXiv preprint arXiv:2004.11198 (2020). [44] M. Liu, H. Gao, and S. Ji, Towards deeper graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 338–348 (2020). [45] M. S. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling, Modeling relational data with graph convolutional networks. In eds. A. Gangemi, R. Navigli, M. Vidal, P. Hitzler, R. Troncy, L. Hollink, A. Tordai, and M. Alam, The Semantic Web — 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, Proceedings, vol. 10843,

June 1, 2022 13:18

616

[46] [47]

[48] [49]

[50] [51] [52] [53]

[54] [55]

[56] [57] [58]

[59] [60] [61] [62] [63]

[64]

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch14

Handbook on Computer Learning and Intelligence — Vol. II Lecture Notes in Computer Science, pp. 593–607. Springer (2018). doi: 10.1007/978-3-31993417-4\38. URL https://doi.org/10.1007/978-3-319-93417-4_38. M. Simonovsky and N. Komodakis, Dynamic edge-conditioned filters in convolutional neural networks on graphs. In CVPR (2017). D. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R. Gómez-Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, Convolutional networks on graphs for learning molecular fingerprints. In Neural Information Processing Systems (NIPS), pp. 2215–2223, Montreal, Canada (2015). R. C. Read and D. G. Corneil, The graph isomorphism disease, J. Graph Theory, 1(4), 339–363 (1977). K. Xu, W. Hu, J. Leskovec, and S. Jegelka, How powerful are graph neural networks? In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9 (2019). OpenReview.net (2019). URL https://openreview.net/forum?id=ryGs6iA5Km. H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman, Provably powerful graph networks. In Proceedings of the Advances in Neural Information Processing Systems, vol. 32 (2019). C. Morris, G. Rattan, and P. Mutzel, Weisfeiler and Leman go sparse: towards scalable higherorder graph embeddings. In NeurIPS (2019). M. Zaheer, S. Kottur, and S. Ravanbhakhsh, Deep sets. In Neural Information Processing Systems (NIPS), pp. 3391–3401 (2017). N. Navarin, D. V. Tran, and A. Sperduti, Universal readout for graph convolutional neural networks. In International Joint Conference on Neural Networks, IJCNN 2019, Budapest, Hungary, July 14–19, 2019, pp. 1–7. IEEE (2019). doi: 10.1109/IJCNN. 2019.8852103. URL https://doi.org/10.1109/IJCNN.2019.8852103. L. Pasa, N. Navarin, and A. Sperduti, Som-based aggregation for graph convolutional neural networks, Neural Comput. Appl., 2020. URL https://doi.org/10.1007/s00521-020-05484-4. R. Ying, J. You, C. Morris, X. Ren, W. L. Hamilton, and J. Leskovec, Hierarchical graph representation learning with differentiable pooling. In Neural Information Processing Systems (NIPS) (2018). C. Cangea, Veliˇckovi´c, N. Jovanovi´c T. Kipf, and P. Li‘o, Towards Sparse Hierarchical Graph Classifiers. In NIPS Relational Representation Learning Workshop (2018). F. M. Bianchi, D. Grattarola, and C. Alippi, Spectral clustering with graph neural networks for graph pooling. In International Conference on Machine Learning, pp. 874–883. PMLR (2020). A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, and C. Hansch, Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity, J. Med. Chem., 34(2), 786– 797 (1991). ISSN 0022-2623. C. Helma, R. D. King, S. Kramer, and A. Srinivasan, The predictive toxicology challenge 2000– 2001, Bioinformatics, 17(1), 107–108 (2001). N. Wale, I. A. Watson, and G. Karypis, Comparison of descriptor spaces for chemical compound retrieval and classification, Knowl. Inf. Syst., 14(3), 347–375 (2008). K. M. Borgwardt, C. S. Ong, S. Schönauer, S. Vishwanathan, A. J. Smola, and H.-P. Kriegel, Protein function prediction via graph kernels, Bioinformatics, 21(Suppl 1), 47–56 (2005). P. D. Dobson and A. J. Doig, Distinguishing enzyme structures from non-enzymes without alignments, J. Mol. Biol., 330(4), 771–783 (2003). A. K. McCallum, K. Nigam, J. Rennie, and K. Seymore, Automating the construction of internet portals with machine learning, Inf. Retrieval, 3(2), 127–163 (2000). ISSN 13864564. doi: 10.1023/A:1009953814988. C. L. Giles, K. D. Bollacker, and S. Lawrence, CiteSeer: an automatic citation indexing system, Proceedings of the ACM International Conference on Digital Libraries, pp. 89–98 (1998).

page 616

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Deep Learning for Graph-Structured Data

b4528-v2-ch14

page 617

617

[65] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad, Collective classification in network data, AI Mag., 29(3), 93–93 (2008). [66] P. Yanardag and S. Vishwanathan, A structural smoothing framework for robust graph comparison, Proceedings of the Advances in Neural Information Processing Systems, 28, 2134– 2142 (2015). [67] F. Errica, M. Podda, D. Bacciu, and A. Micheli, A fair comparison of graph neural networks for graph classification. In International Conference on Learning Representations (2020). [68] M. Fey and J. E. Lenssen, Fast graph representation learning with PyTorch Geometric. In ICLR 2019 Workshop on Representation Learning on Graphs and Manifolds (2019). [69] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, PyTorch: an imperative style, high-performance deep learning library. In eds. H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett, Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc. (2019). [70] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283 (2016). [71] M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma, L. Yu, Y. Gai, T. Xiao, T. He, G. Karypis, J. Li, and Z. Zhang, Deep graph library: a graph-centric, highly-performant package for graph neural networks, arXiv preprint arXiv:1909.01315 (2019). [72] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems, arXiv preprint arXiv:1512.01274 (2015). [73] D. Grattarola and C. Alippi, Graph neural networks in tensor flow and keras with spektral, arXiv preprint arXiv:2006.12138 (2020).

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch15

© 2022 World Scientific Publishing Company https://doi.org/10.1142/9789811247323_0015

Chapter 15

A Critical Appraisal on Deep Neural Networks: Bridge the Gap between Deep Learning and Neuroscience via XAI Anna-Sophie Bartle∗,x , Ziping Jiang∗,¶ , Richard Jiang∗, , Ahmed Bouridane†,∗∗ , and Somaya Almaadeed‡,†† ∗LIRA Center, Lancaster University, Lancaster LA1 4YW, UK †Computer and Information Sciences, Northumbria University, NE1 8ST, UK ‡Computer Sciences, Qatar University, Doha, Qatar x [email protected][email protected]  [email protected] ∗∗ [email protected] †† [email protected]

Starting in the early 1940s, artificial intelligence (AI) has come a long way, and today, AI is a powerful research area with many possibilities . Deep neural networks (DNNs) are part of AI and consist of several layers—the input layers, the so-called hidden layers, and the output layers. The input layers receive data; the data are then converted into computable variables (i.e., vectors) and are passed on to the hidden layers, where they are computed. Each data point (neuron) is connected to another data point within a different layer that passes information back and forth. Adjusting the weights and bias at each hidden layer (having several iterations between those layers), such a network maps the input to output, thereby generalizing (learning) its knowledge. At the end, the deep neural network should have enough input to predict results for specific tasks successfully. The history of DNNs or neural networks is, in general, closely related to neuroscience, as the motivation of AI is to teach human intelligence to a machine. Thus, it is possible to use the knowledge of the human brain to develop algorithms that can simulate the human brain. This is performed with DNNs. The brain is considered an electrical network that sets off electrical impulses. During this process, information is carried from one synapse to another, just like it is done within neural networks. However, AI systems should be used carefully, which means that the researcher should always be capable of understanding the system he or she created, which is an issue discussed within explainable AI and DNNs.

619

page 619

June 1, 2022 13:18

620

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch15

Handbook on Computer Learning and Intelligence—Vol. II

15.1. Introduction Artificial intelligence (AI) is a widely used term in today’s society. However, defining AI is not an easy task as it is a broad research area that includes various approaches and techniques and which is used within different research areas such as natural language processing (NLP), machine vision, multi-agent systems, and so on. AI can, nevertheless, be described as the intelligence of machines. Such intelligent systems are described as systems that “processes information in order to do something purposeful” and that are built by humans to make those systems think and act like human beings [1]. How human-like a system thinks and acts is tested using Alan Turing’s so-called Turing test, which is a way to test machine intelligence; a system passes the test if a user of the system cannot tell whether he or she is interacting with a human or a computer [2]. Hence, the goal of AI is to construct systems that behave like a human, trying to implement human-like intelligence. Thus, throughout this chapter, artificial intelligence is understood as “the ability to do the right thing at the right moment by studying and developing computational artefacts that exhibit some facets of intelligent behavior” [1]. To understand current trends in AI, this chapter starts with a short summary of the history of AI and neural networks. It then continues with the introduction of deep neural networks, thereby explaining the term deep learning. Since DNNs make use of neuroscience, the third part of this chapter briefly discusses the interchangeable connections and possibilities between neuroscience and DNNs. The fourth section continues with the notion of explainable AI (XAI) and DNNs, demonstrating some of the difficulties such smart systems face. This is followed by a critical review on DNNs. The survey ends with a discussion on possible future trends of AI and tries to emphasize the main points of the chapter in a brief conclusion.

15.2. History of Artificial Intelligence and Neural Networks Artificial intelligence is all about mechanically replicating the process of the human way of thinking and acting. The idea, like many other big discoveries, came from classical philosophers and is, as such, still a controversial topic in philosophy [3]. The history of artificial intelligence is structured into four seasons—spring, fall, summer, and winter [4]. The first real occurrence of AI dates back to the early 1940s, when the first robot was developed by Gregory Powell and Mike Donavan [4]. The two engineers tried to create a robot that was able to react to specific orders given by humans, thereby following the Three Laws of Robotics proposed by Isaac Asimov—to not harm a human being, to obey the given orders, and to not put its own existence at risk. At the same time, Alan Turing, one of the still most influential mathematicians in testing the intelligence of machines, created a machine that was able to decipher the Enigma code used by the German military in World War II, which was, until then, an impossible task. In 1950, Alan Turing published his still

page 620

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

A Critical Appraisal on Deep Neural Networks

b4528-v2-ch15

page 621

621

used and famous Turing test, which considers whether an artificial system can be called intelligent or not; as soon as an artificial system cannot be distinguished from a human, it can be considered intelligent. The academic discipline of AI was founded in 1956, during the Dartmouth Summer Research Project on AI at the Dartmouth College, by Marvin Minsky and John McCarthy, marking the beginning of the AI spring. The season lasted 20 years, during which AI blossomed and was significantly successful. One of the most impressive works during that time was the development of the dialogue system ELIZA, which was written by Joseph Weizenbaum somewhere between 1964 and 1966; the system demonstrated the possibilities of artificial systems understanding natural language. In the 1970s, Marvin Minsky stated that in less than a decade, it would be possible to create artificial machines that are as intelligent as the average human being. However, due to the limited computing capacities of computers available during that time and because of using only basic algorithms, the success of AI stopped advancing, thereby losing important funding from big companies and governments, marking the beginning of the AI fall and winter, which lasted again for about two decades [4]. ELIZA demonstrates why early strategies and implementations of AI failed to replicate human intelligence within artificial systems. The system shows the lack of initiative and knowledge, since ELIZA has no real-world knowledge or common sense. Moreover, the dialogue systems lack memory capacity and struggle with adopting the correct grammar of a language. However, another approach was the introduction of artificial neural networks (ANNs) by Warren McCulloch and Walter Pitts in 1943 [5, 6], who defined the so-called threshold logic, which divides AI into two approaches—one focusing on biological processes and the other on neural Networks that eventually led to the implementation of finite-state automata. Moreover, the Hebbian learning theory, which was developed during the 1940s by Donald Hebb, also stated the possibility of creating ANNs. Such ANNs should be able to replicate the processes of neurons of the human brain since the brain itself is an electrical network that sets off electrical impulses, during which information is carried from one synapse to another. Frank Rosenblatt, moreover, showed that it was possible for perceptrons (a type of neural network) “to learn and make decisions” based on the input data [4]. In 1975, the backpropagation algorithm was introduced by Werbos. The algorithm implements a multi-layer network consisting of various hidden layers. In all the layers, the input data can be used for training, namely modifying the weights of the nodes that are interconnected with the other nodes of the network; each node represents some sort of input. During this “learning” process, errors made by such networks are exponentially reduced since they are distributed in parallel between all the nodes available in the network [4]. In 1997, AI celebrated a breakthrough with IBM’s Deep Blue, a chess-playing program that was able to beat the world’s best chess player, Gary Kasparov, using collections of rules that follow

June 1, 2022 13:18

622

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch15

Handbook on Computer Learning and Intelligence—Vol. II

simple “if–then” principles. Such artificial systems can be implemented for logicalmathematical problems; however, they would fail in tasks involving the process of facial recognition [4]. In 2015, ANNs were reintroduced in the form of Google’s DeepMind AlphaGo, which was developed by Google and which was able to beat the world’s best Go player; Go is a board game that exceeds the number of possible moves in chess by far. As we can see, AI and ANNs have come a long way, with the above-mentioned details being only a brief summary of the history and development of AI and its corresponding ANNs.

15.3. Rise of Deep Neural Networks Deep neural networks, or rather deep learning, have become popular over the last couple of years as they demonstrated one breakthrough after another—be it tasks in solving the difficulties of image recognition and automatic translation or in beating the world’s champion in the board game Go. Our class definition for deep learning is that “it uses specialized algorithms to model and understand complex structures and relationships among data and datasets” using multiple layers [3]. The era of deep learning started with ImageNet, an image database created by Fei-Fei Li [7] in 2009. ImageNet set the milestone for image and facial recognition processes that are, for example, used to unlock smartphones. By 2010, many AI researchers tried to develop their deep learning system, trying to train the best and most accurate DNNs. In 2014, Ian Goodfellow created a deep learning neural network called the generative adversarial network (GAN), which aims to generate synthetic data. From the viewpoint of classification, the discriminative capability of the GAN became better at classifying input data by “learning” from its errors on synthetic data, thereby allowing the generator to adjust its weights and produce more realistic data to fool the discriminator [7, 8]. As we can see in Figure 15.1, simple neural networks do not look that different from DNNs as both consist of input, hidden, and output layers connected to one another. The input layers consist of input data such as words or pixels, which are converted into calculable variables (i.e., vectors), which are then passed on to one or more hidden layers that consist of so-called nodes (neurons) connected to one another via synapses. The various hidden layers draw information from the received data to produce a result via an output layer [3]. Whereas some researchers suggest that the difference between simple neural networks and deep neural networks is the increased number of hidden layers within DNNs (multi-layer networks), others suggest that there is no difference as both have neurons and synapses that are interconnected and share the same functionality—using weights and bias and an activation function to change and optimize the next layer’s outcome [7].

page 622

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch15

A Critical Appraisal on Deep Neural Networks

page 623

623

Figure 15.1: Left: Simple neural network. Right: Deep neural network.

Deep learning networks need to iterate over large datasets (input) to create reliable solutions, as has been seen successfully in a wide range of applications [9–12]. The input data contain labels (supervised learning), which enable the network to classify the various input data points into various categories to which the input belongs. By adjusting and modifying the weights of the nodes (connections between the neurons) through backpropagation (generalizing the learning algorithm), it is able to reduce the error function, mapping more and more input data to the correct output layer, thereby producing more reliable output. Results of various studies have shown that the more the hidden layers available, the better the outcome is. The word deep comes from the many hidden layers used to produce better results. However, having too many hidden layers may lead to overgeneralizing the data, creating suboptimal solutions, since classifying overgeneralized input data into certain groups might not be explicitly clear [3]. Since we have seen a rapid and exponential growth of available data, training DNNs with huge amounts of data is possible. DNNs rely heavily on large datasets, on which they need training to provide reliable predictions and results. In this respect, humans are far better at learning rules and classifying certain objects into categories. This is because humans need only a few examples to understand the concept behind it. Therefore, deep learning can only help with problems for which a huge amount of data is available, but it should be avoided if this is not the case [3]. However, just like any other technique used in AI, deep learning has its limitations. DNNs can compute non-open problems. However, to solve problems with a hierarchical structure (such as embedded sentences), sentences with ambiguous meaning, or problems that require one to draw open-ended inferences, DNNs are rather useless because there are no straightforward solutions available. This is so because deep learning is “hermeneutic” [3] since DNNs are “being selfcontained and isolated from other, potentially useful knowledge” [3], though

June 1, 2022 13:18

624

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch15

Handbook on Computer Learning and Intelligence—Vol. II

there exists a large variety of combined methods, such as deep kernel methods, deep Gaussian processes, Bayesian drop-out techniques, etc. For straightforward problems, DNNs are capable of learning the connections between input and output layers. Nevertheless, DNNs offer a great opportunity to solve certain problems and to make some things easier and more efficient (e.g., facial recognition to unlock one’s smartphone).

15.4. Neuroscience and Deep Neural Networks The rise of deep neural networks dates back to the 1940s, during which time it was found that the brain sends electrical impulses from synapse to synapse, thereby transferring and computing information received from one node to the other to produce a result. Since then, ANNs and DNNs have been built and developed, often using insights and findings of the field of neuroscience to further develop already existing algorithms. Therefore, it might be possible to improve and further develop DNNs by better understanding the biological human brain (be it from humans or other living beings, e.g., animals and plants) to create even better and more humanlike intelligent systems. However, it is not only AI that can draw inferences from neuroscience, but also the other way around, since AI can be used to recognize and detect, for example, certain illnesses whose malformation would hardly be recognized as such [13]. Using neuroscience to develop and improve DNNs is an obvious assumption, since the main goal of AI is to teach artificial intelligent systems to act like and be as intelligent as humans. In AI, human intelligence, for example, was thought to consist of symbolic representations that need to be manipulated to produce a learning effect. However, the idea of reducing the brain and its functions in a symbolic representation was found to be too simple compared to the complex interactions necessary to solve certain tasks in the real world. Therefore, neural networks within AI research were developed, trying to implement algorithms to simulate the information processing that happens within such networks through dynamic and interconnected neurons. These neurons are then fine-tuned by adjusting the weights of the connected neurons between the layers and its parameters, thereby reducing the errors made by the system and maximizing the system’s reliable outcome. Since the successful implementation of such networks showed promising results, especially in terms of simulating human behavior and intelligence, the research of the functionality of the human brain (neuroscience) was and is taken more and more into consideration—after all, the idea of such ANNs came from neuroscience, which provided first insights into the human brain [13]—making it possible in the first place to develop such deep learning systems and its underlying algorithms for AI.

page 624

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

A Critical Appraisal on Deep Neural Networks

b4528-v2-ch15

page 625

625

Reinforcement learning, moreover, played an important role in the development of deep learning. Reinforcement learning is about maximizing the reward by learning how to act to get a positive result, thereby learning to avoid actions that lead to negative outcomes. This can be compared to the learning process within the human brain—correct learning trials lead to an increase in dopamine neuron activation, successfully pushing the learning process [14]. Another interesting and rather important aspect within the human brain is the memory capacity. Human intelligence has the ability to maintain information for quite some time, making it possible to further process such information at a later stage (as it is needed for embedded sentences, for example, in which the main clause is interrupted by a subordinate clause). However, over the years, AI systems have improved at maintaining and storing information thanks to the recurrent neural networks and long short-term memory (LSTM) networks that have been developed. Nevertheless, more research needs to be done to make such networks successfully solve rather complex tasks such as commonsense understanding—understanding the real world when interacting with a problem [13]. Thus, understanding the human brain could be a prerequisite to creating even better and more robust intelligent machines as it might provide and inspire new types of algorithms by understanding the functionality of the living brain. This can be done through “analysis, visualization, causal manipulation, and hypothesis-driven experiments” within neuroscience [13]. The expansion of interest from AI to neuroscience might be as useful as the impact neuroscience has on AI, with slightly different approaches and perspectives. AI, for example, uses various algorithms and architectures to imitate the human brain, so as to reproduce the underlying calculations the brain does when processing information. Therefore, AI might give useful insights into the working processes of the brain, thereby connecting the computational level—the purpose of the computational process—with the algorithmic level, which includes the processes and computations that are necessary. AI not only provides various possibilities to analyze data coming from functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG) but also helps to provide information on the computational and algorithmic processes underlying the information processes within the brain. Based on the information provided by AI, neuroscientists and psychologists could adapt their research field accordingly. A close cooperation between both departments could, thus, benefit both parties—for AI could build better and more reliable intelligent systems by understanding the functions of the brain, whereas neuroscientists could learn more about the possible mechanisms underlying the brain to find new research approaches (e.g., treatment options for various illnesses). Neuroscientists, for example, found that the above-mentioned dopamine neuron activation during the learning process

June 1, 2022 13:18

626

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch15

Handbook on Computer Learning and Intelligence—Vol. II

within the human brain resembles the algorithm found in temporal difference (TD) learning [15]. TD learning is a method used in reinforcement learning, which dynamically adjusts its learning “predictions to match later, more accurate, predictions” [15]. Therefore, a close cooperation might be beneficial for both research areas.

15.5. Explainable AI and DNNs Due to the problem that AI systems become less and less interpretable even though they become more and more complicated, explainable AI (XAI) and DNNs represent a research initiative trying to draw attention to the need for intelligent machines to remain explainable—being able to describe the underlying mechanisms and algorithms of such artificial systems. This is an important task [16] as it would mean that scenes depicted in science-fiction movies, in which super-intelligent systems take over the world, could not happen as there is always enough knowledge and know-how on how to stop and limit possible errors of such artificial systems, rather than just having luck that the system was capable of creating such intelligent solutions. Moreover, XAI wants to move beyond AI systems (e.g., DNNs) as such machines still need concrete training sets to succeed in one specific task. The problem is that there is still no way of making AI systems think in a contextual, abstract way; this, in turn, narrows down the usefulness of such machines as they can only be used for a narrowed specific task for which the system has the knowhow as it is trained on such kinds of data; the system knows how to compute the data without having the ethical understanding of what the actual task is about. XAI, thus, tries to draw attention to the ethical problems AI systems cause by questioning whether those systems should be given such a huge responsibility as they lack important knowledge. Thus, XAI discusses whether developing AI systems should be stopped in order to avoid artificial systems from taking over the world [16, 17]. The problematic notion of not having sufficient knowledge and expertise to explain the underlying functions and computations of DNNs is demonstrated in Figure 15.2. Here we can see many techniques used within AI in relation to how well such techniques are explainable. The purple techniques (neural networks and deep learning) are the current trends and are implemented the most when building current AI systems. However, it seems that the higher the accuracy the approach obtains, the less explainable they are even though their performance increases. This means that it is unknown why the algorithm chooses one result over the other. Therefore, this not only causes ethical problems, as already mentioned above, but also limits the developmental possibilities that such systems could have if they

page 626

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

A Critical Appraisal on Deep Neural Networks

b4528-v2-ch15

page 627

627

Figure 15.2: Techniques used in AI and the extent to which they are understood and can be explained (retrieved at [16]).

could be fully understood. Understanding is one of the most important steps for moving beyond and creating even better systems [16]. For various fields, AI is helpful and should be used, for example, in diagnosing medical conditions. However, the results given by intelligent machines should always be interpretable and explainable. Thus, for example, a doctor should not solely rely on the machine’s diagnosis, but should also be able to ask questions to the machine, which in turn should be able to provide understandable answers. To achieve this, it is necessary for the machine to understand and explain its given results. Therefore, the system should be capable of processing abstract information, which still needs to be developed further in the future in order for humans to understand the underlying information processes within the system [17]. Moreover, one important aspect of XAI concerns the predictions made by AI systems and whether such predictions are reliable. The goal is to comprehend a system’s underlying computational processes that are used to make such predictions to create algorithms that make even better and more reliable predictions. Therefore, we need to understand the rationale of such systems (similar to understanding animal behavior in order to react to their behavior). There are two ways for humans to understand the underlying processes of such AI systems, namely the deep approach and the black-box approach [17]. Within the deep approach, the researcher tries to understand the different blocks making up the system completely (e.g., looking at the parameters of perceptrons, trying to understand what activates and changes them, etc.) [17]; this, however, is hard to accomplish as it requires expert knowledge to draw generalizations out of such systems. The other approach is called the black-box approach, in which the system is treated like a black box in which no knowledge about its contents is present. In the

June 1, 2022 13:18

628

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch15

Handbook on Computer Learning and Intelligence—Vol. II

Figure 15.3: Today’s AI systems raise loads of questions since they are not explainable. However, XAI tries to raise awareness that moving beyond the explainable is a risk that should be avoided by first trying to understand the implemented model and its actions before moving to even more complex tasks. First understanding systems will lead to even better systems and is a saver, as AI systems cannot take over the decision-making process since humans still understand what is going on.

approach, the researcher tries to understand the behavior of the system by playing around with and manipulating its input to interpret the framework by looking at the various outputs of the system [17]. Thus, with the acquired knowledge, machines can specifically be trained on how to do what they were already been doing, but now with the knowledge, comprehension, and intention of the programmer, which leads to even better and more robust systems as well as more security on such systems, as it is all about understanding what the system is doing before moving on to even more complex algorithms. Figure 15.3 captures the intentions of XAI very well.

15.6. Critical Review on Deep Neural Networks Even though intelligent machines have exceeded human intelligence in solving some of the straightforward logical/mathematical problems such as the recently developed DNN, AlphaGo, which won against the world’s champion in the board game Go, DNNs haven’t reached the status of human intelligence in other areas such as natural language processing. DNNs, as already discussed, consist of several layers, and generally speaking, the more layers a DNN has, the harder it is to train [3, 18]. Moreover, systems such as self-driving cars and speech recognition play important roles in our daily life and will continue to grow in importance over the next couple of years [12]. Most users of such systems are often not interested in AI and ANN/DNN and, thus, are frequently not interested in knowing the underlying processes happening within such AI systems, since their focus mainly lies on the abilities and capabilities those systems offer. However, this section gives a brief

page 628

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

A Critical Appraisal on Deep Neural Networks

b4528-v2-ch15

page 629

629

summary of the different architectures that can be implemented within deep learning approaches and techniques. All architectures have in common that they are made up of “multi-layered artificial neural architectures that can learn complex, non-linear functional mappings, given sufficient computational resources and training data” [20]. Since it reads and computes a huge amount of data, it simplifies manual work as it is an automatic process to recognize patterns, giving more or less sufficient outcomes. As already mentioned, DNNs consist of the input, hidden, and output layers, each consisting of interconnected, functional neurons that compute and carry information, making it possible to map input data to the output. The number of hidden layers is thereby important, since at each hidden layer, the neurons of the next layer are connected to the results, and predictions are made based on the outcome of the previous hidden layer. This is done by adjusting weights and bias, thus adapting certain learning rules that eventually capture some generalizations in the data. Within each hidden layer, the system learns more about the data and tries to find rules that can be generalized for a specific task the network needs to fulfill. The input consists of training examples, e.g., words (of sentences) or pixels (of pictures), which are converted into vectors or certain types of embeddings and to whom synaptic weights are assigned. The dot product between each of the input vectors and its assigned weights are calculated and the results for each calculation are added. Depending on the dimensionality of each of the input vectors and weights, the output for each input [20] is an m-dimensional vector and the summation becomes “an m-dimensional hyperplane.” The neural network “learns an approximation to the function that produced each of the outputs from its corresponding input” with the approximation being “the partition of the input space into samples that minimizes the error function between the output of the ANN given its training inputs and the training outputs” [20]—also known as universal approximation theorem. The universal approximation theorem states that “any … mapping between input and output vectors can be approximated to with arbitrary accuracy with an ANN provided that it has a sufficient number of neurons in a sufficient number of layers with a specific activation function” [20, 21]. Thus, all parts of an artificial neuron, including the activation and error function, are necessary to build a working network that is able to optimize the connections between input and output pairs, thereby minimizing the errors made by the system. Nevertheless, especially a DNN’s success lies in better performance with newer and more powerful GPU machines and its optimized regularization techniques that are getting better at avoiding overfitting of data, but still managing to find adaptive learning rates, which is not least because of the availability of large training datasets containing high-quality labels [20, 21].

June 1, 2022 13:18

630

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch15

Handbook on Computer Learning and Intelligence—Vol. II

Throughout the years, various deep learning architectures have been developed. The architectures can be distinguished from design; however, their advantages are not really comparable since most DNNs are chosen depending on their performance in specific tasks. Convolutional neural networks (CNNs), for example, are rather used for tasks concerning computer vision, whereas recurrent neural networks (RNN) are likely used for sequence and time series modeling tasks [20]. One of the most basic architectures of deep learning is the so-called deep feedforward network in which only the connections between the neurons move forward in such a way that the connections move from the first hidden layer in one direction over all the hidden layers available within the network, capturing nonlinear relations of data [21]. Due to its simple implementation, it is an often-used architecture that uses backpropagation, thus first assigning random weights to the input data and throughout the training process, fine-tunes those weights, thereby reducing the system’s error rate [20]. Unlike the feed-forward network moving in one direction, recurrent neural networks (RNNs) form a cycle during processing in which the subset of outputs of a given hidden layer becomes the input of the next hidden layer, thereby “forming a feedback loop” [21]. RNNs are, therefore, able to maintain some of the information received from the previous hidden models and are often used to develop networks that try to tackle practical problems such as speech recognition, or video classification, in which the video is analyzed on a frame-by-frame basis. Having a phrase of three words, the system would need three hidden layers, each layer analyzing and maintaining one of the words. Long short-term memory (LSTM) is a specific type of RNN that was designed to maintain even more information, for which their long-range dependencies were developed more closely. As such, the input does not move from layer to layer, but from block to block, each block consisting of different layers—namely, the input, forget, and output blocks in which a logistic function keeps track of the information of the input data [21]. Hence, LSTM models are used within tasks such as part-ofspeech tagging in NLP or speech recognition modeling. Another kind of deep learning network is called restricted Boltzmann machine (RBM), which serves as a stochastic neural network. Due to its ability to learn from supervised as well as unsupervised data (labeled vs. unlabeled data), it is an often-used tool. The network was developed by Paul Smolensky in 1986; however, it became popular only in 2002, when the algorithm was reintroduced by Hinton [20]. RBMs are often used in feature learning, which enables systems to automatically classify data based on feature detection [20, 21]. Another popular and widely used deep network is the convolutional neural network (CNN), which contains convolutions and multi-layer perceptrons (usually consisting of three or more layers) calculating a result by using convolution operation

page 630

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

A Critical Appraisal on Deep Neural Networks

b4528-v2-ch15

page 631

631

on the input data and passing the outcome to the following layer [20]. By this additional calculation, it offers even deeper neural networks using only a few parameters (filters). However, CNNs are capable of successfully breaking down the characteristics of, e.g., images, and learning these representations throughout the different layers. Therefore, CNNs are often used in image and facial recognition processes as well as in NLP applications such as semantic parsing [21]. The last DNN mentioned here is called the sequence-to-sequence model, which consists of two different recurrent neural networks—namely, the encoder, encoding the input data, and the decoder, which is responsible to produce output by using the information provided by the encoder. It is a rather new model that still needs to be developed further, as it is used for difficult and highly complex tasks such as question-answering systems, machine translation, and chatbots [20, 21]. Even though there are many more DNNs that fulfill various influential tasks, this section just mentioned a few architectures to demonstrate some of the differences among such architectures and to provide various examples on how different implementations of such DNNs exist and what they are used for. From the viewpoint of neuroscience, deep learning as the state-of-the-art AI technique has its roots in the emulation of the human brain. To make DNNs explainable, an ultimate goal is to find a way to match human intelligence and build a human-made “brain” or at least, at a functionally higher level, map the deep architectures to the layered information processing units in the brain. However, there are yet important differences between the features of current mainstream DNNs and the human brain. First, a human brain is more like an analogue circuit without the ability to store high-precision parameters. Secondly, neurons in the human brain are randomly interconnected instead of the carefully “handcrafted” architectures of the current mainstream DNNs. Such a randomly interconnected architecture [18, 19], known as Turing’s type-B machine (unorganized machine), can lead to a generalized AI. From the viewpoint of neuroscience, XAI can help bridge the gap between deep learning and neuroscience in a mutually beneficial way. On the one hand, XAI helps build an abstract DNN model that is more easily understood by humankind [22, 23]; on the other, XAI models [16, 17, 26] derived from DNNs can also help in understanding the mechanisms of the human brain [24–27].

15.7. Discussion and Conclusion In this chapter, we came to the conclusion that human intelligence does not solely consist of logical/mathematical intelligence but also of “linguistic, spatial, musical kinaesthetic, interpersonal, intrapersonal, naturalist, and existential” intelligence [1]. Due to its multi-dimensionality, it is difficult to include all these abilities of a

June 1, 2022 13:18

632

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch15

Handbook on Computer Learning and Intelligence—Vol. II

human brain into one single system; hence, most AI systems only include parts of the above-mentioned possible intelligence aspects, focusing on games such as chess or Go. The history of AI goes back to the early 1940s, when AI was already a broadly discussed topic that was eventually implemented and converted into algorithms computable by computers. AI tries to replicate human intelligence by simulating the brain functions. However, implementing the human brain in some kind of artificial environment is no easy task; the human brain is too complex an organ to be understood simply by looking at it. It rather takes many years to capture only small fractions of the knowledge the brain has to offer and, thus, it will take even more years to discover more parts of the brain that have been hidden so far. Since the underlying computations of the brain can give insights about the functionality of the brain and the other way around, neuroscientists and researchers in AI should work together since both parties would benefit from such cooperation. However, future work should not solely focus on human intelligence. Rather, it should also focus on animals or plants since they also have powerful abilities that should somehow be captured. Thus, focusing solely on human intelligence might make us blind to understanding and creating new powerful intelligent systems [1]. Even though intelligent systems offer great possibilities, the research initiatives of explainable AI and DNNs warn against giving such intelligent systems too much power without even understanding the computational process underlying such newly developed systems. They argue that a system should first be understood before trying to develop even newer and better algorithms that would outperform the old algorithms. Not understanding the concepts could have a negative impact since such systems could develop in such a way—as described in science-fiction movies—that they would overpower the human world, with human beings having no power to stop them due to their inability to understand the system’s underlying logic. Thus, before developing newer and better algorithms that are even less explainable than DNNs, older systems should be fully understood. This not only makes understanding the system easier and better but also enables the programmer to create even better and more human-like algorithms, since understanding processes of the brain can provide useful information. Moreover, since machines are taking over the decision-making process in many daily situations, one should ask questions on what kind of suggestions such systems give. Intelligent machines still mostly lack the ability to process abstract information or anything that requires real-world knowledge. Thus, we should be careful in putting too much trust in such intelligent machines; rather, we should keep questioning the outcomes and predictions made by such systems. Furthermore, it is not only abstract information such systems are having difficulties computing, but also drawing inference from hierarchical structures (e.g., embedded sentences)

page 632

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

A Critical Appraisal on Deep Neural Networks

b4528-v2-ch15

page 633

633

or commonsense understanding. Future works could focus on how to handle such kinds of problems. Moreover, AI could be used to build algorithms that optimize the situations in daily life for small farmers, giving insights into current weather conditions, planting situations, the condition of their soil, and so on, thus making farming more attractive and profitable, even for small farmers. Moreover, because cultures become increasingly diverse, working on better translation services might support the inclusion process and cross-cultural communication, providing better chances, especially for younger immigrants, to receive education [1]. In conclusion, artificial intelligence and deep neural networks have come a long way and will most likely be a big topic in the coming years. AI offers great possibilities; however, such intelligent systems should always be used with caution, keeping in mind the consequences of using such systems. AI comes with a great deal of knowledge and can simplify certain things; however, understanding how such intelligent systems work, especially when losing control, can be a dangerous task.

References [1] V. Dignum, Responsible Artificial Intelligence: How to Develop and Use AI in a Responsible Way, Springer Nature Switzerland (2019). [2] A. M. Turing, Computing machinery and intelligence, Mind, 59, pp. 433–460 (1950). [3] G. Marcus, Deep learning: a critical appraisal, New York University, pp. 1–27 (2017). [4] M. Haenlein and A. Kaplan, A brief history of artificial intelligence, California Management Review, 61(4), pp. 5–15 (2019). [5] Wikipedia, History of artificial intelligence, in Wikipedia, retrieved from https://en.wikipedia. org/wiki/History_of_artificial_intelligence. [6] Wikipedia, History of artificial neural networks, in Wikipedia, retrieved from https://en. wikipedia.org/wiki/History_of_artificial_neural_networks. [7] T. Greene, 2010–2019: the rise of deep learning, in the next web, (2020), https://thenextweb. com/artificialintelligence/2020/01/02/2010-2019-the-rise-of-deep-learning/, accessed 5th of January 2020. [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, in Advances in neural information processing systems, pp. 2672–2680 (2014). [9] C. Chiang, C. Barnes, P. Angelov, and R. Jiang, Deep learning based automated forest health diagnosis from aerial images, IEEE Access (2020). [10] G. Storey, R. Jiang, A. Bouridane, and C. T. Li, 3DPalsyNet: a facial palsy grading and motion recognition framework using fully 3D convolutional neural networks, IEEE Access (2019). [11] Z. Jiang, P. L. Chazot, M. E. Celebi, D. Crookes, and R. Jiang, Social behavioral phenotyping of drosophila with a 2D-3D hybrid CNN framework, IEEE Access (2019). [12] G. Storey, A. Bouridane, and R. Jiang, Integrated deep model for face detection and landmark localisation from ‘in the wild’ images, IEEE Access (2018). [13] D. Hassabis, D. Kumaran, C. Summerfield, and M. Botvinick, Neuroscience-inspired artificial intelligence, Neuron, 95, pp. 145– 258 (2017). [14] J. R. Hollerman and W. Schultz, Dopamine neurons report an error in the temporal prediction of reward during learning, Nat.Neurosci., 1(4), pp. 304-309 (1998).

June 1, 2022 13:18

634

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch15

Handbook on Computer Learning and Intelligence—Vol. II

[15] Wikipedia, Temporal difference learning, in Wikipedia—the Free Encyclopedia, retrieved from https://en.wikipedia.org/wiki/Temporal_difference_learning [16] M. Lukianoff, Explainable artificial intelligence (XAI) is on DARPA’s agenda—why you should pay attention, in towards data science, https://towardsdatascience.com/explainable-artificialintelligence-xai-is-on-darpas-agenda-why-you-should-pay-attention-b63afcf284b5 (2019). [17] J. Torres, Explainable AI: the next frontier in human-machine harmony, Towards Data Science, https://towardsdatascience.com/explainable-ai-the-next-frontier-in-human-machineharmony-a3ba5b58a399 (2019). [18] R. Jiang and D. Crookes, Shallow unorganized neural networks using smart neuron model for visual perception, IEEE Access, 7, pp. 152701–152714 (2019). [19] C. S. Webster, Alan Turing’s unorganized machines and artificial neural networks: his remarkable early work and future possibilities. Evol. Intell., 5, pp. 35–43 (2012). [20] S. Saptarshi, S. Basak, P. Saikia, S. Paul, V. Tsalavoutis, F. D. Atiah, V. Ravi, and R. A. Peters, A review of deep learning with special emphasis on architectures, applications and recent trends, Knowledge-Based Syst., 194, p. 105596 (2020). [21] A. Shreshtha and A. Mahmood, Review of deep learning algorithms and architectures, IEEE Access, 7, pp. 53040–53065 (2019). [22] J. E. T. Taylor and G. W. Taylor, Artificial cognition: how experimental psychology can help generate explainable artificial intelligence, Psychon. Bull. Rev., pp. 6276–6282 (2020). [23] R. M. Byrne, Counterfactuals in explainable artificial intelligence (XAI): evidence from human reasoning, in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), 1, pp. 6276–6282 (2019). [24] M.-A. T. Vu, T. Adalı, D. Ba, G. Buzsáki, D. Carlson, K. Heller, C. Liston, C. Rudin, V. Sohal, A. Widge, H. Mayberg, G. Sapiro, and K. A. Dzirasa, A shared vision for machine learning in neuroscience, J. Neurosci., 18, pp. 1601–1607 (2018). [25] J. M. Fellous, G. Sapiro, A. Rossi, H. Mayberg, and M. Ferrante, Explainable artificial intelligence for neuroscience: behavioral neurostimulation, Front. Neurosci., 13, p. 1346 (2019). [26] R. Evans and E. Grefenstette, Learning explanatory rules from noisy data, J. Artif. Intell. Res., 2, pp. 1–64 (2017). [27] A. M. Zador, A critique of pure learning and what artificial neural networks can learn from animal brains, Nat. Commun., 10(3770), pp. 1–7 (2019).

page 634

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch16

page 635

© 2022 World Scientific Publishing Company https://doi.org/10.1142/9789811247323_0016

Chapter 16

Ensemble Learning Y. Liu∗ and Q. Zhao† School of Computer Science and Engineering The University of Aizu Aizu-Wakamatsu, Fukushima 965-8580, Japan ∗ [email protected][email protected]

An ensemble system adopts the divide-and-conquer strategy. Instead of using a single learning model to solve a given task, the ensemble system combines a set of learning models that learn to subdivide the task, and thereby solve it more efficiently. This chapter introduces ensemble learning in training a set of cooperative learning models from the view of bias– variance–covariance trade-off in supervised learning. Three types of ensemble learning methods are discussed based on how interactions are performed among the individual learning models in their learning. As an example of ensemble learning, negative correlation learning is described, and analyzed by estimating bias, variance, and covariance on a regression task with different noise conditions. Finally, ensemble systems with awareness are briefly introduced.

16.1. Bias–Variance Trade-Off in Supervised Learning There are two components in a general supervised learning model. The first component is called a probability space (E, Pr) where each elementary event in the event set E associates with two random variables of the input pattern x and the desired output y. y is a scalar for simplicity in this chapter. x is a vector in R p . All of the events in E follow the probability distribution Pr. The second component is a learning machine that can be implemented by a set of functions F(x, w), w ∈ W , where W is a space of real-valued vectors. Supervised learning is to search the function F(x, w) so that the expected squared error is minimized, R(w) = E[(F(x, w) − y)2 ] 635

(16.1)

June 1, 2022 13:18

636

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch16

page 636

Handbook on Computer Learning and Intelligence—Vol. II

where E represents the expectation value over the probability space (E, Pr). Because the probability distribution Pr is normally unknown in practice, only a training set D = {(x(1), y(1)), · · · , (x(N ), y(N ))} is available where the data in the training set is supposed to follow the unknown probability distribution Pr. The expected squared error is therefore minimized on the training set D: R(w) = E D [(F(x, w) − y)2 ]   N E (F(x(i), w) − y(i))2 = i=1

(16.2) (16.3)

The learned function F(x, w) would depend on the training set D. The different training sets D might lead to different F(x, w), so that function F(x, w) is written as F(x, D) to show its explicit dependence on the training set D. Suppose that all the training sets D of given size N would be available, from which N data points could be independently selected by following the probability distribution Pr. All the training sets D of given size N will form a new probability space (E (N) , Pr (N) ). Let E D represent the expected squared error E D [(F(x, D) − y)2 ] over (E (N) , Pr (N) ) and E with no subscript denote the expectation over (E, Pr ). With respect to the training set D, there is the following separation of the meansquared error [1]:     E D (F(x, D) − y)2 = E D F(x, D)2 − 2y E D [F(x, D)] + y 2 = (E D [F(x, D)] − y)2   +E D F(x, D)2 − (E D [F(x, D)])2 = (E D [F(x, D)] − y)2 + var D (F(x, D))

(16.4)

where the variance var D (F(x, D)) could be represented as var D (F(x, D)) = E D [(F(x, D) − E D [F(x, D)])2 ] = E D [F(x, D)2 ] − (E D [F(x, D)])2

(16.5)

The term (E D [F(x, D)] − y)2 in Eq. (16.4) shows the bias of the approximating function F(x, D), which measures how much the average function value at x deviates from y. The other term var D (F(x, D)) indicates the variance of the approximating function F(x, D), which displays the changes of the function values at x from one training set to another. As Eq. (16.4) shows, the expected mean-square value includes two terms of bias and variance. If neither bias nor variance is negative, both of them for the approximating function F(x, D) should be small for achieving good performance.

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch16

page 637

Ensemble Learning

637

On the one hand, the obtained function F(x, D) would not be able to capture some of the aspects of the data if it would be too simple. It might either overestimate or underestimate most of (x, y). Both cases would lead to large bias. On the other hand, if the function F(x, D) were too complex, it would be capable of implementing numerous solutions that are consistent with the training data. However, for data different from the training data, a wide range of values of F(x, D) might appear as the training set D varies so that its variance would become large. The complexity of F(x, D) for best matching all the data would be hard to determine if only a small set of data points would be available. In practice, there is usually a trade-off between bias and variance [1]. Decreasing bias by introducing more parameters often tends to increase variance. Optimization of reducing parameters for the smaller variance often tends to increase bias.

16.2. Bias–Variance–Covariance Trade-Off in Ensemble Learning Rather than applying a single function F(x, D) on D, an ensemble system is to combine a set of Fi (x, D). One way of combination is to use their simple average as output: F(x, D) =

1 M  Fi (x, D) M i=1

(16.6)

where M is the number of individual approximating functions in the ensemble. The expected mean-squared error of the ensemble can be represented by individual function outputs:  E D [(F(x, D) − y) ] = E D 2

1 M  Fi (x, D) − y M i=1

2  (16.7)

Based on Eq. (16.4), the following decomposition can be driven:  ED

1 M  Fi (x, D) − y M i=1

2 

2 1 M  Fi (x, D) − y = ED M i=1   1 M  Fi (x, D) + var D (16.8) M i=1 



where the two terms on the right side of Eq. (16.8) are the bias and the variance of the ensemble, respectively. The second term of the ensemble variance may be

June 1, 2022 13:18

638

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch16

page 638

Handbook on Computer Learning and Intelligence—Vol. II

further divided into two terms:  1 M  Fi (x, D) var D M i=1  2   1 M 1 M  Fi (x, D) − E D  Fi (x, D) = ED M i=1 M i=1  2 1 M  (Fi (x, D) − E D [Fi (x, D)]) = ED M 2 i=1  1 ( M (Fi (x, D) − E D [Fi (x, D)]) = ED M 2 i=1  × ( M j =1 (F j (x, D) − E D [F j (x, D)])  1 M 2  (Fi (x, D) − E D [Fi (x, D)]) = ED M 2 i=1  1 M M   (Fi (x, D) − E D [Fi (x, D)]) + ED M 2 i=1 j =1, j =i  × (F j (x, D) − E D [F j (x, D)]) 

(16.9)

where the first one is the average of the variance of individual functions. The second is the average covariance among the different functions in the ensemble. Corresponding to the bias–variance trade-off for a single function, there is a bias–variance–covariance trade-off among an ensemble. If all the individual functions Fi would be positively correlated greatly, there would be little reduction of the variance for the ensemble with such individual functions. If the individual functions Fi would be uncorrelated, the weighted average covariance among them would become zero while the variance of theirs could decay at M1 . Both experimental and theoretical results have suggested that when individual functions in an ensemble are unbiased, negatively correlated functions would lead to the most effective ensemble [2]. The ensemble would be less effective when its combined individual functions are uncorrelated. There might hardly be any gain by combining positively correlated functions in an ensemble.

16.3. Three Types of Ensemble Learning Methods Based on how interactions are performed in learning, three types of ensemble learning methods are discussed for the learning models implemented by neural networks. It should be pointed out that these ensemble learning methods are also applicable to other learning models.

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Ensemble Learning

b4528-v2-ch16

page 639

639

16.3.1. Independent Ensemble Learning It is clear that there is no advantage of combining a set of identical neural networks for an ensemble learning system. The individual neural networks have to be different, and cooperative as well. Data sampling is a common approach to training individual neural networks to be different, in which cross-validation [3, 4] and bootstrapping [5] have been widely used. Cross-validation is a method for estimating prediction error in its original form [6]. Cross-validation can be used for sampling data for training a set of networks by splitting the data into m roughly equal-sized sets, and training each network on the different sets independently. As indicated by Meir [7], for a small dataset with noise, such data splitting might help to reduce the correlation among the m trained neural networks more drastically than those trained on the whole data. If a large number of individual neural networks should be trained, splitting the given data with no overlapping would lead each set of data to be too small to train a neural network. Therefore, data re-sampling such as bootstrap [5] would be helpful if available data samples are limited. Bootstrap was introduced for estimating the standard error of a statistic [5]. Breiman [8] applied the idea of bootstrap for bagging predictors. In bagging predictors, a training set containing N patterns is perturbed by sampling with replacement N times from the original dataset. The perturbed dataset may contain repeats. This procedure can be repeated to generate a number of different datasets with overlapping.

16.3.2. Sequential Ensemble Learning It is uncertain how different the trained neural networks would be after independent ensemble learning when there is no interaction in their learning. Some neural networks by independent ensemble learning might still be positively correlated, and do little help in their combined ensemble system. Sequential ensemble learning trained a set of neural networks in a particular order by re-sampling the data based on what early trained neural networks had learned. Boosting is one example of sequential ensemble learning. Boosting algorithm was originally proposed by Schapire [9]. Schapire proved that it is theoretically possible to convert a weak learning algorithm that performs only slightly better than random guessing into one that achieves arbitrary accuracy. The proof presented by Schapire [9] is constructive. The construction uses filtering to modify the distribution of examples in such a way as to force the weak learning algorithm to focus on the harder-to-learn parts of the distribution. Boosting trains a set of learning machines sequentially on the datasets that have been filtered by the previously trained learning machines [9]. The original boosting procedure is as follows [10]. The first machine is trained with N1 patterns randomly

June 1, 2022 13:18

640

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch16

Handbook on Computer Learning and Intelligence—Vol. II

chosen from the available training data. After the first machine has been trained, a second training set with N1 patterns is randomly selected on which the first machine would have 50% error rate. That is, there are 50% of patterns in the training set, which will be misclassified by the first machine. Once the second machine is trained on the second training set, the third set of training patterns will be filtered through the first and second machines. Those patterns would be added into the third training set if the first two machines did not disagree with them. Such a data filtering process would be continued until a total of N1 patterns are selected. Accordingly, the third machine will be trained on the third training set. During testing on the ensemble with these three trained machines, each new pattern will be classified using the following voting scheme. If the first two machines would agree on the new pattern, their answer would be taken as the output for the new pattern. Otherwise, the new pattern would be labeled by the third machine.

16.3.3. Simultaneous Ensemble Learning Both the independent ensemble learning and sequential ensemble learning follow a two-stage design process of training individual neural networks first, and combining them thereafter. The direct interactions among the individual neural networks cannot be exploited until the integration stage. Feedback from the integration might not be fully used in training the existing individual neural networks in the ensemble or any newly added to the ensemble. In order to inject the feedback from the integration into learning each neural network, simultaneous ensemble learning trains a set of neural networks interactively. Negative correlation learning [11–13] and the mixturesof-experts architectures [14, 15] are two examples of simultaneous ensemble learning. The mixtures-of-experts architecture is to train multiple networks through competitive and associative learning [14, 15]. The mixtures-of-experts architecture applied the principle of divide and conquer for solving a complex problem by decomposing it into a set of simpler subproblems. It is assumed that the data can be adequately summarized by a collection of functions, where each of them can be defined over a local region of the input space. The mixtures-of-experts architecture adaptively partitions the input space into possibly overlapping regions, and allocates different networks to summarize the data located in different regions. The mixturesof-experts architecture consists of two layers of networks, including a gating network and a number of expert networks. In the mixtures-of-experts architecture, all expert networks are allowed to look at the input, and make their best guess. The gating network uses the normalized exponential transformation to weight the outputs of the expert networks to provide an overall best guess. All the parameter adjustments in the expert networks and gating network are performed simultaneously. Although the mixtures-of-experts architecture can produce biased individual networks whose

page 640

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch16

Ensemble Learning

page 641

641

estimates are negatively correlated [16], it did not directly address the issue of the bias–variance–covariance trade-off.

16.4. Statistical Analysis of Negative Correlation Learning 16.4.1. Negative Correlation Learning Given the training set D = {(x(1), y(1)), · · · , (x(N ), y(N ))}, the output of an ensemble with M neural networks can be defined by a simple average among the outputs from all individual neural networks on the nth training pattern x(n): F(n) =

1 M  Fi (n) M i=1

(16.10)

where Fi (n) is the output of neural network i on x(n). Negative correlation learning trains all individual neural networks in the ensemble simultaneously and interactively by introducing a correlation penalty term into the error function for each neural network [11–13]. The error function E i of the ith neural network on the training set is given by 1 N  E i (n) N n=1  1 1 N 2 = n=1 (Fi (n) − y(n)) + λpi (n) N 2

Ei =

(16.11)

where E i (n) is the error function of neural network i at presentation of the nth training pattern. In the right side of Eq. (16.11), the first term is the mean-squared error of individual network i, while the second term pi is a correlation penalty function. The purpose of minimizing pi is to directly correlate each individual’s error negatively with errors for the rest of the ensemble. The penalty function pi can be chosen as

(16.12) pi (n) = (Fi (n) − F(n))  j =i F j (n) − F(n) . The parameter λ is used to adjust the strength of the penalty. The partial derivative of E i with respect to the output of network i on the nth training pattern is ∂ pi (n) ∂ E i (n) = Fi (n) − y(n) + λ ∂ Fi (n) ∂ Fi (n)

= Fi (n) − y(n) + λ j =i F j (n) − F(n) = Fi (n) − y(n) + λ M j =1 (F j (n) − F(n) − λ(Fi (n) − F(n)) = Fi (n) − y(n) − λ(Fi (n) − F(n)) = (1 − λ)(Fi (n) − y(n)) + λ(F(n) − y(n))

(16.13)

June 1, 2022 13:18

642

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch16

Handbook on Computer Learning and Intelligence—Vol. II

where the output of ensemble F(n) is supposed to have a constant value with respect to Fi (n). Backpropagation can be used for weight adjustments in the mode of patternby-pattern updating in negative correlation learning. Therefore, weight updating of all the individual networks can be performed simultaneously using Eq. (16.13) after the presentation of each training data. One complete presentation of the entire training set during the learning process is called an epoch. From Eqs. (16.11), (16.12), and (16.13), it can be seen that negative correlation learning has the following learning characteristics: (1) During the training process, all the individual neural networks interact with each other through their penalty terms in the error functions. Not only does each Fi minimize the difference between Fi (n) and y(n), but also the difference between Fi (n) and F(n). (2) At λ = 0.0, no correlation penalty is enforced so that all the individual neural networks will be trained independently. (3) At λ = 1, Eq. (16.13) becomes ∂ E i (n) = F(n) − y(n) ∂ Fi (n)

(16.14)

The error of the ensemble for the nth training pattern can be represented as E ensemble =

1 1 M (  Fi (n) − y(n))2 2 M i=1

(16.15)

The partial derivative of E ensemble with respect to Fi on the nth training pattern is   1 1 M ∂ E ensemble = i=1 Fi (n) − y(n) ∂ Fi (n) M M =

1 (F(n) − y(n)) M

(16.16)

Therefore, the following relation exists: ∂ E ensemble ∂ E i (n) =M× ∂ Fi (n) ∂ Fi (n)

(16.17)

It shows that minimization of the error function of the ensemble is achieved by minimizing the error functions of the individual networks. From this point of view, negative correlation learning provides a novel way to decompose the learning task of the ensemble into a number of subtasks for different individual networks [17–22].

page 642

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch16

Ensemble Learning

page 643

643

16.4.2. Simulation Setup The following regression task has been used to estimate the bias of mixtureof-experts architectures and the variance and covariance of experts’ weighted outputs [16]:    1 2 1 10 sin(π x1 x2 ) + 20 x3 − + 10x4 + 5x5 − 1 f (x) = 13 2

(16.18)

where x = [x1 , . . . , x5 ] is an input vector whose components lie between zero and one. The value of f (x) lies in the interval [−1, 1]. Twenty-five training sets, (x(k) (l), y (k) (l)), l = 1, . . . , L, L = 500, k = 1, . . . , K , K = 25, were created at random. The reason for creating 25 training sets is that the same setting was used to analyze the mixture-of-experts [16]. Each set consisted of 500 input–output patterns in which the components of the input vectors were independently sampled from a uniform distribution over the interval (0,1). In the noise-free condition, the target outputs were not corrupted by noise; in the small noise condition, the target outputs were created by adding noise sampled from a Gaussian distribution with a mean of zero and a variance of σ 2 = 0.1 to the function f (x); in the large noise condition, the target outputs were created by adding noise sampled from a Gaussian distribution with a mean of zero and a variance of σ 2 = 0.2 to the function f (x). A testing set of 1024 input–output patterns, (t(n), d(n)), n = 1, . . . , N , N = 1024, was also generated. For this set, the components of the input vectors were independently sampled from a uniform distribution over the interval (0,1), and the target outputs were not corrupted by noise in all three conditions. The ensemble architecture used in negative correlation learning consists of eight 3-layer feedforward neural networks where each of them has one hidden layer with five hidden nodes. The hidden node in the hidden layer is defined by the logistic function: ϕ(y) =

1 1 + exp (−y)

(16.19)

The output layer is defined as a linear combination of the outputs from the hidden nodes. For each estimation of bias, variance, and covariance of an ensemble, 25 simulations were conducted. In each simulation, the ensemble was trained on a different training set from the same initial weights distributed inside a small range so that different simulations of an ensemble yielded different performances solely due to the use of different training sets. Such simulation setup follows the suggestions from Jacobs [16].

June 1, 2022 13:18

644

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch16

Handbook on Computer Learning and Intelligence—Vol. II

16.4.3. Measurement of Bias, Variance, and Covariance The equations for measuring bias, variance, and covariance can be derived from Eqs. (16.8) and (16.9). The average outputs of the ensemble and the individual network i on the nth pattern in the testing set, (t(n), d(n)), n = 1, . . . , N , are denoted, respectively, by F (t(n)) and F i (t(n)), which are given by F(t(n)) =

1 K (k)  F (t(n)) K k=1

(16.20)

F i (t(n)) =

1 K (k)  F (t(n)) K k=1 i

(16.21)

and

where F (k) (t(n)) and Fi(k) (t(n)) are the outputs of the ensemble and the individual network i on the nth pattern in the testing set from the kth simulation, respectively, and K = 25 is the number of simulations. The integrated bias E bias , integrated variance E var , and integrated covariance E cov of the ensemble are defined by, respectively, E bias = and M E var = i=1

2 1 N

n=1 F(t(n)) − d(n) N

2 1 N 1 K 1 (k) n=1 k=1 2 Fi (t(n)) − F i (t(n)) N K M

(16.22)

(16.23)

and 1 N 1 K   N n=1 K k=1



1 (k) (k) (t(n)) − F (t(n)) F (t(n)) − F (t(n)) F i j i j M2

M M E cov = i=1 j =1, j =i

(16.24)

We may also define the integrated mean-squared error (MSE) E mse on the testing set as 2 1 N 1 K (k) k=1 F (t(n)) − d(n) (16.25) E mse = n=1 N K The integrated mean-squared error E train on the training set is given by E train =

2 1 L 1 K (k) (k) l=1 k=1 F (x (l)) − y (k) (l) L K

(16.26)

It is clear that the following equality holds: E mse = E bias + E var + E cov

(16.27)

page 644

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch16

Ensemble Learning

page 645

645

16.4.4. Measuring Bias, Variance, and Covariance for Different Strength Parameters This section to investigate the dependence of E bias , E var , and E cov on the strength parameter λ in Eq. (16.11) for negative correlation learning. 16.4.4.1. Results in the noise-free condition

The results of negative correlation learning in the noise-free condition for the different values of λ at epoch 2000 are given in Table 16.1. The results suggest that both E bias and E mse appeared to decrease with increasing value of λ. It seems that E var increased as the value of λ increased, and E cov decreased as the value of λ increased. It is interesting that negative correlation learning controls not only E var and E cov , but also the E bias of the ensemble. Compared with independent training (i.e., λ = 0.0 in negative correlation learning), although negative correlation learning created larger variance, the sum of the variance and covariance in negative correlation learning was smaller because of the negative covariance. At the same time, negative correlation learning reduced the bias of the ensemble significantly. In order to find out why E bias decreased with increasing value of λ, the concept of capability of a trained ensemble is introduced. The capability of a trained ensemble is measured by its ability to produce correct input–output mapping on the training set used, specifically by its integrated mean-squared error E train on the training set. The smaller E train is, the larger capability the trained ensemble has. Table 16.1 shows the results of the error E train for different λ values. It was observed that the smaller error E train was obtained using the larger value of λ. It seemed that the value of strength parameter λ played a role in deciding the capability of the ensemble. Increasing the value of λ led to increasing the capability of the ensemble. When λ = 0, the capability of the ensemble trained was not big enough to produce correct input–output mapping, so the ensemble had a relatively larger E bias . When λ = 1, the ensemble trained had the smallest E train and the highest Table 16.1: The results of negative correlation learning in the noise-free condition for different λ values at epoch 2000 λ 0.0 0.25 0.50 0.75 1.0

E bias

E var

E cov

E mse

E train

1.31 × 10−3 1.07 × 10−3 8.67 × 10−4 5.21 × 10−4 9.11 × 10−5

1.51 × 10−4 1.78 × 10−4 2.50 × 10−4 4.04 × 10−4 0.274853

1.40 × 10−4 7.96 × 10−5 −2.60 × 10−5 −2.51 × 10−4 −0.274746

1.60 × 10−3 1.33 × 10−3 1.09 × 10−3 6.74 × 10−4 1.98 × 10−4

1.25 × 10−3 1.02 × 10−3 8.14 × 10−4 4.88 × 10−4 1.14 × 10−4

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

646

b4528-v2-ch16

Handbook on Computer Learning and Intelligence—Vol. II

capability of producing correct input–output mapping, so it had the smallest E bias . From Eq. (16.27), the integrated MSE consists of the sum of three terms: the integrated bias, variance, and covariance. Therefore, negative correlation learning provides a control of bias, variance, and covariance through the choice of λ value to achieve good performance. For this regression task with noisefree condition and the ensemble architecture used, it was observed that the bias– variance–covariance trade-off was optimal for λ = 1.0 in the sense of minimizing the MSE. Figures 16.1–16.5 show the learning curves of E bias , E var , E cov , E mse , and E train for negative correlation learning (λ = 1.0) and independent training (i.e., λ = 0 in negative correlation learning). For negative correlation learning, E bias , E mse , and E train decreased very quickly at the early training stage and continued to reduce stably; E var increased quickly at the early training stage and then reduced slightly while E cov reduced quickly at early training stage and continued to reduce stably. It is important to note that the integrated covariance became negative during the training procedure in negative correlation learning. In general, negative correlation learning produced faster convergence in terms of minimizing the MSE.

1

lambda

0 1

0.1

E bias

0.01

0.001

0.0001

1e-05 0

200

400

600

800

1000

1200

1400

1600

1800

2000

Number of Training Epochs Figure 16.1: Comparison between negative correlation learning (λ = 1.0) and independent training (i.e., λ = 0 in negative correlation learning) on E bias . The vertical axis is the value of E bias and the horizontal axis is the number of training epochs.

page 646

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch16

Ensemble Learning

page 647

647

1

lambda

0 1

0.1

E var

0.01

0.001

0.0001

1e-05 0

200

400

600

800

1000

1200

1400

1600

1800

2000

Number of Training Epochs Figure 16.2: Comparison between negative correlation learning (λ = 1.0) and independent training (i.e., λ = 0 in negative correlation learning) on E var . The vertical axis is the value of E var and the horizontal axis is the number of training epochs.

0.05

lambda

0 1

0

-0.05

E cov

-0.1

-0.15

-0.2

-0.25

-0.3 0

200

400

600

800

1000

1200

1400

1600

1800

2000

Number of Training Epochs Figure 16.3: Comparison between negative correlation learning (λ = 1.0) and independent training (i.e., λ = 0 in negative correlation learning) on E cov . The vertical axis is the value of E cov and the horizontal axis is the number of training epochs.

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

648

b4528-v2-ch16

Handbook on Computer Learning and Intelligence—Vol. II 1

lambda

0 1

0.1

E mse

0.01

0.001

0.0001

1e-05

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Number of Training Epochs Figure 16.4: Comparison between negative correlation learning (λ = 1.0) and independent training (i.e., λ = 0 in negative correlation learning) on E mse . The vertical axis is the value of E mse and the horizontal axis is the number of training epochs.

1

lambda

0 1

0.1

E train

0.01

0.001

0.0001

1e-05 0

200

400

600

800

1000

1200

1400

1600

1800

2000

Number of Training Epochs Figure 16.5: Comparison between negative correlation learning (λ = 1.0) and independent training (i.e., λ = 0 in negative correlation learning) on E train . The vertical axis is the value of E train and the horizontal axis is the number of training epochs.

page 648

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch16

Ensemble Learning

page 649

649

16.4.4.2. Results in the noise conditions

Tables 16.2 and 16.3 compare the performance of negative correlation learning for different strength parameters in both small noise (variance σ 2 = 0.1) and large noise (variance σ 2 = 0.2) conditions. The results show that there were same trends for E bias , E var , and E cov in both noise-free and noise conditions. That is, E bias appeared to decrease with increasing value of λ. E var seemed to increase as the value of λ increased, and E cov seemed to decrease as the value of λ increased. However, E mse appeared to decrease first and then increase with increasing value of λ. In order to find out why E mse showed different trends in noise-free and noise conditions, the integrated mean-squared error E train on the training set is also shown in Tables 16.2 and 16.3. When λ = 0, the ensemble trained had a relatively large E train . This indicated that the capability of the ensemble trained was not big enough to produce correct input–output mapping (i.e., it was underfitting) for this regression task. When λ = 1, the ensemble learned too many specific input–output relations (i.e., it was overfitting), and it might memorize the training data and therefore be less able to generalize between similar input–output patterns. Although the overfitting was not observed for the ensemble used in noise-free condition, too large capability of the ensemble would lead to overfitting for both noise-free and noise conditions because of the ill-posedness of any finite training set. Choosing a proper value of λ is important, and also problem dependent. For the noise conditions used for this Table 16.2: The results of negative correlation learning in the small noise condition for different λ values at epoch 2000 λ

E bias

0.0 0.25 0.5 0.75 1.0

0.0046 0.0041 0.0038 0.0029 0.0023

E var

E cov

0.0018 0.0074 0.0020 0.0068 0.0027 0.0059 0.0046 0.0051 0.2107 −0.1840

E var + E cov

E mse

E train

0.0092 0.0088 0.0086 0.0097 0.0267

0.0138 0.0129 0.0124 0.0126 0.0290

0.0962 0.0940 0.0915 0.0873 0.0778

Table 16.3: The results of negative correlation learning in the large noise condition for different λ values at epoch 2000 λ

E bias

0.0 0.25 0.5 0.75 1.0

0.0083 0.0074 0.0061 0.0051 0.0041

E var

E cov

0.0031 0.0135 0.0035 0.0126 0.0048 0.0120 0.0087 0.0109 0.3270 −0.2679

E var + E cov

E mse

E train

0.0166 0.0161 0.0168 0.0196 0.0591

0.0249 0.0235 0.0229 0.0247 0.0632

0.1895 0.1863 0.1813 0.1721 0.1512

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

650

b4528-v2-ch16

Handbook on Computer Learning and Intelligence—Vol. II Table 16.4: Comparison between negative correlation learning (NCL) and the mixture-of-experts (ME) architecture Noise condition

Method

E bias

E var

E cov

E mse

Small

NCL with λ = 0.5 NCL with λ = 1 ME

0.004 0.002 0.008

0.003 0.006 0.211 −0.184 0.030 −0.020

0.012 0.029 0.018

Large

NCL with λ = 0.5 NCL with λ = 1 ME

0.006 0.004 0.013

0.005 0.012 0.327 −0.268 0.065 −0.040

0.023 0.063 0.038

regression task and the ensemble architectured used, the bias–variance–covariance trade-off was optimal for λ = 0.5 among the tested values of λ in the sense of minimizing the MSE on the testing set. 16.4.4.3. Comparisons with other work

Table 16.4 compares the results of negative correlation learning with λ = 0.5 and λ = 1 to those produced by the mixture-of-experts architecture [16]. In the small noise case, the integrated MSE of the mixture-of-experts architecture was about 0.018, while the integrated MSE of negative correlation learning with λ = 0.5 was 0.012. The integrated MSE achieved by the mixture-of-experts architecture was about 0.038 in the large noise case, which is compared to 0.023 by negative correlation learning with λ = 0.5. Although both the mixture-of-experts architecture and negative correlation learning tend to create negatively correlated networks, negative correlation learning can achieve good performance by controlling bias, variance, and covariance through the choice of λ value.

16.4.5. Correlations among the Individual Networks The section compares the correlations among the individual networks and the biases of the individual networks created by negative correlation learning and independent training, respectively. The correlation between network i and network j is given by Cor (i, j ) = 





(k) (k) N K Fi (t(n)) − F i (t(n)) F j (t(n)) − F j (t(n)) n=1 k=1

2

2 (k) (k) N K N K n=1 k=1 Fi (t(n)) − F i (t(n)) n=1 k=1 F j (t(n)) − F j (t(n))

(16.28) and the bias of network i is given by K N

2 1 1   (k) Fi (t(n)) − d(n) Bias(i) = N K n=1 k=1

(16.29)

page 650

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Ensemble Learning

b4528-v2-ch16

page 651

651

where Fi(k) (t(n)) is the output of the network i on the nth pattern in the testing set, (t(n), d(n)), n = 1, . . . , N , from the kth simulation, F i (t(n)) represents the average output of the network i on the nth pattern in the testing set. 16.4.5.1. Results in the noise-free condition

In order to observe the effect of the correlation penalty terms, Table 16.5 shows the correlations among the individual networks trained by independent training (i.e., λ = 0 in negative correlation learning) and negative correlation learning with 8 λ = 0.5 and λ = 1 in the noise-free condition. There are 2 = 28 correlations among different pairs of networks. For negative correlation learning with λ = 0.5, 18 correlations among them had negative values. For negative correlation learning with λ = 1, the number of negative correlations increased to 20. The results suggest that negative correlation learning leads to negatively correlated networks. In contrast, for the independent training without the correlation penalty terms, 27 correlations were positive and only one was negative. Because every individual network learns the same task in the independent training, the correlations among them are generally positive. In negative correlation learning, each individual network learns different Table 16.5: The correlations among the individual networks trained by negative correlation learning with different λ values in the noise-free condition

λ=0

Cor (1, 2) = 0.042 Cor (1, 6) = 0.251 Cor (2, 4) = 0.078 Cor (2, 8) = 0.165 Cor (3, 7) = 0.065 Cor (4, 7) = 0.100 Cor (5, 8) = 0.222

Cor (1, 2) = −0.023 Cor (1, 6) = −0.002 Cor (2, 4) = −0.0004 λ = 0.5 Cor (2, 8) = 0.042 Cor (3, 7) = −0.006 Cor (4, 7) = 0.071 Cor (5, 8) = −0.091

λ=1

Cor (1, 2) = −0.368 Cor (1, 6) = −0.383 Cor (2, 4) = −0.244 Cor (2, 8) = −0.272 Cor (3, 7) = −0.248 Cor (4, 7) = −0.152 Cor (5, 8) = 0.122

Cor (1, 3) = −0.031 Cor (1, 7) = 0.226 Cor (2, 5) = 0.043 Cor (3, 4) = 0.233 Cor (3, 8) = 0.155 Cor (4, 8) = 0.104 Cor (6, 7) = 0.326

Cor (1, 4) = 0.129 Cor (1, 8) = 0.213 Cor (2, 6) = 0.031 Cor (3, 5) = 0.208 Cor (4, 5) = 0.087 Cor (5, 6) = 0.086 Cor (6, 8) = 0.126

Cor (1, 5) = 0.0289 Cor (2, 3) = 0.053 Cor (2, 7) = 0.046 Cor (3, 6) = 0.151 Cor (4, 6) = 0.203 Cor (5, 7) = 0.199 Cor (7, 8) = 0.253

Cor (1, 3) = −0.070 Cor (1, 4) = 0.114 Cor (1, 7) = −0.129 Cor (1, 8) = −0.078 Cor (2, 5) = −0.138 Cor (2, 6) = −0.071 Cor (3, 4) = −0.021 Cor (3, 5) = −0.062 Cor (3, 8) = 0.0200 Cor (4, 5) = 0.052 Cor (4, 8) = −0.140 Cor (5, 6) = 0.085 Cor (6, 7) = 0.188 Cor (6, 8) = 0.018

Cor (1, 5) = −0.049 Cor (2, 3) = 0.006 Cor (2, 7) = −0.138 Cor (3, 6) = −0.047 Cor (4, 6) = −0.081 Cor (5, 7) = 0.157 Cor (7, 8) = −0.019

Cor (1, 3) = 0.221 Cor (1, 7) = 0.096 Cor (2, 5) = −0.161 Cor (3, 4) = −0.203 Cor (3, 8) = −0.158 Cor (4, 8) = 0.053 Cor (6, 7) = −0.026

Cor (1, 5) = 0.038 Cor (2, 3) = −0.131 Cor (2, 7) = −0.364 Cor (3, 6) = −0.335 Cor (4, 6) = −0.159 Cor (5, 7) = −0.039 Cor (7, 8) = 0.185

Cor (1, 4) = −0.177 Cor (1, 8) = −0.288 Cor (2, 6) = −0.087 Cor (3, 5) = −0.277 Cor (4, 5) = 0.068 Cor (5, 6) = −0.056 Cor (6, 8) = 0.078

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

652

b4528-v2-ch16

Handbook on Computer Learning and Intelligence—Vol. II

Table 16.6: The biases among the individual networks trained by negative correlation learning with different λ values in the noise-free condition λ=0

Bias(1) = 0.0024 Bias(5) = 0.0030

Bias(2) = 0.0041 Bias(6) = 0.0025

Bias(3) = 0.0032 Bias(7) = 0.0027

Bias(4) = 0.0022 Bias(8) = 0.0020

λ = 0.5

Bias(1) = 0.0031 Bias(5) = 0.0040

Bias(2) = 0.0045 Bias(6) = 0.0025

Bias(3) = 0.0054 Bias(7) = 0.0031

Bias(4) = 0.0020 Bias(8) = 0.0024

λ=1

Bias(1) = 8.0774 Bias(5) = 1.0458

Bias(2) = 12.1147 Bias(6) = 8.7573

Bias(3) = 3.3489 Bias(7) = 2.3400

Bias(4) = 5.5513 Bias(8) = 2.6317

parts or aspects of the training data so that the problem of correlated errors can be removed or alleviated. Table 16.6 shows the biases of the individual networks trained by independent training (i.e., λ = 0.0 in negative correlation learning) and negative correlation learning with λ = 0.5 and λ = 1. In independent training, all the individual networks are trained to implement the whole learning task. In contrast, different individual networks are trained to implement the subtasks of the whole learning task so that the individual networks created by negative correlation learning had larger bias values than those of the individual networks created by independent training. However, as shown in (16.17), the ensemble in negative correlation learning is trained to implement the whole learning task. This is the reason why the ensemble combined by these individual networks had smaller integrated bias, as shown in Tables 16.1, 16.2 and 16.3. 16.4.5.2. Results in the noise conditions

Tables 16.7 and 16.9 show the correlations among the individual networks trained by independent training (i.e., λ = 0 in negative correlation learning) and negative correlation learning with λ = 0.5 and λ = 1 in the small and large noise conditions. Tables 16.8 and 16.10 show the biases of these networks. The values of the correlations among the individual networks had relatively larger positive values when λ was 0. They reduced to relatively smaller positive values when λ was increased to 0.5. Most of them became negative values when λ was further increased to 1. Overall, the results indicated that the neural networks trained by negative correlation learning tended to be negatively correlated neural networks in both the noise-free and noise conditions. On the other hand, the biases of the individual networks trained by negative correlation learning were larger than those of the individual networks trained by independent training. Nevertheless, the ensembles created by negative correlation learning had smaller integrated biases.

page 652

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch16

Ensemble Learning

page 653

653

Table 16.7: The correlations among the individual networks trained by negative correlation learning with different λ values in the small noise condition

λ=0

Cor (1, 2) = 0.685 Cor (1, 6) = 0.451 Cor (2, 4) = 0.568 Cor (2, 8) = 0.737 Cor (3, 7) = 0.551 Cor (4, 7) = 0.599 Cor (5, 8) = 0.610

Cor (1, 3) = 0.593 Cor (1, 7) = 0.705 Cor (2, 5) = 0.645 Cor (3, 4) = 0.708 Cor (3, 8) = 0.518 Cor (4, 8) = 0.571 Cor (6, 7) = 0.443

Cor (1, 4) = 0.703 Cor (1, 8) = 0.658 Cor (2, 6) = 0.462 Cor (3, 5) = 0.560 Cor (4, 5) = 0.601 Cor (5, 6) = 0.425 Cor (6, 8) = 0.396

Cor (1, 5) = 0.655 Cor (2, 3) = 0.579 Cor (2, 7) = 0.694 Cor (3, 6) = 0.498 Cor (4, 6) = 0.558 Cor (5, 7) = 0.731 Cor (7, 8) = 0.737

λ = 0.5

Cor (1, 2) = 0.385 Cor (1, 6) = 0.325 Cor (2, 4) = 0.289 Cor (2, 8) = 0.392 Cor (3, 7) = 0.273 Cor (4, 7) = 0.287 Cor (5, 8) = 0.327

Cor (1, 3) = 0.356 Cor (1, 7) = 0.402 Cor (2, 5) = 0.355 Cor (3, 4) = 0.404 Cor (3, 8) = 0.276 Cor (4, 8) = 0.263 Cor (6, 7) = 0.302

Cor (1, 4) = 0.280 Cor (1, 8) = 0.341 Cor (2, 6) = 0.188 Cor (3, 5) = 0.294 Cor (4, 5) = 0.249 Cor (5, 6) = 0.276 Cor (6, 8) = 0.185

Cor (1, 5) = 0.351 Cor (2, 3) = 0.302 Cor (2, 7) = 0.304 Cor (3, 6) = 0.255 Cor (4, 6) = 0.310 Cor (5, 7) = 0.351 Cor (7, 8) = 0.314

Cor (1, 2) = −0.141 Cor (1, 6) = −0.126 Cor (2, 4) = −0.162 Cor (2, 8) = 0.038 Cor (3, 7) = −0.067 Cor (4, 7) = −0.081 Cor (5, 8) = −0.274

Cor (1, 3) = −0.136 Cor (1, 7) = −0.141 Cor (2, 5) = −0.148 Cor (3, 4) = −0.227 Cor (3, 8) = −0.087 Cor (4, 8) = −0.037 Cor (6, 7) = −0.052

Cor (1, 4) = −0.177 Cor (1, 8) = −0.035 Cor (2, 6) = 0.047 Cor (3, 5) = −0.128 Cor (4, 5) = −0.176 Cor (5, 6) = −0.031 Cor (6, 8) = 0.004

Cor (1, 5) = −0.172 Cor (2, 3) = −0.099 Cor (2, 7) = −0.126 Cor (3, 6) = −0.037 Cor (4, 6) = −0.241 Cor (5, 7) = −0.084 Cor (7, 8) = −0.007

λ=1

Table 16.8: The biases among the individual networks trained by negative correlation learning with different λ values in the small noise condition λ=0

Bias(1) = 0.0177 Bias(5) = 0.0185

Bias(2) = 0.0170 Bias(6) = 0.0228

Bias(3) = 0.0212 Bias(7) = 0.0183

Bias(4) = 0.0193 Bias(8) = 0.0178

λ = 0.5

Bias(1) = 0.0249 Bias(5) = 0.0250

Bias(2) = 0.0244 Bias(6) = 0.0274

Bias(3) = 0.0287 Bias(7) = 0.0260

Bias(4) = 0.0284 Bias(8) = 0.0264

λ=1

Bias(1) = 2.7812 Bias(5) = 2.0343

Bias(2) = 4.3145 Bias(6) = 1.2497

Bias(3) = 2.0255 Bias(7) = 2.4406

Bias(4) = 3.7067 Bias(8) = 1.2872

16.4.6. Measuring Bias, Variance, and Covariance for Different Ensemble Sizes This section investigates the dependence of E bias , E var , and E cov on the ensemble size, i.e., the number of individual networks in the ensemble. The ensemble architectures used in negative correlation learning are composed of 4, 8, 16, and 32 individual networks, respectively. All individual networks have 5 hidden nodes.

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

654

b4528-v2-ch16

Handbook on Computer Learning and Intelligence—Vol. II

Table 16.9: The correlations among the individual networks trained by negative correlation learning with different λ values in the large noise condition

λ=0

Cor (1, 2) = 0.654 Cor (1, 6) = 0.512 Cor (2, 4) = 0.540 Cor (2, 8) = 0.665 Cor (3, 7) = 0.639 Cor (4, 7) = 0.565 Cor (5, 8) = 0.622

Cor (1, 3) = 0.615 Cor (1, 7) = 0.656 Cor (2, 5) = 0.647 Cor (3, 4) = 0.714 Cor (3, 8) = 0.605 Cor (4, 8) = 0.577 Cor (6, 7) = 0.574

Cor (1, 4) = 0.657 Cor (1, 8) = 0.584 Cor (2, 6) = 0.477 Cor (3, 5) = 0.667 Cor (4, 5) = 0.626 Cor (5, 6) = 0.528 Cor (6, 8) = 0.446

Cor (1, 5) = 0.735 Cor (2, 3) = 0.582 Cor (2, 7) = 0.706 Cor (3, 6) = 0.492 Cor (4, 6) = 0.520 Cor (5, 7) = 0.737 Cor (7, 8) = 0.684

λ = 0.5

Cor (1, 2) = 0.320 Cor (1, 6) = 0.239 Cor (2, 4) = 0.316 Cor (2, 8) = 0.361 Cor (3, 7) = 0.311 Cor (4, 7) = 0.318 Cor (5, 8) = 0.351

Cor (1, 3) = 0.376 Cor (1, 7) = 0.388 Cor (2, 5) = 0.327 Cor (3, 4) = 0.431 Cor (3, 8) = 0.390 Cor (4, 8) = 0.297 Cor (6, 7) = 0.373

Cor (1, 4) = 0.392 Cor (1, 8) = 0.329 Cor (2, 6) = 0.229 Cor (3, 5) = 0.397 Cor (4, 5) = 0.288 Cor (5, 6) = 0.349 Cor (6, 8) = 0.240

Cor (1, 5) = 0.342 Cor (2, 3) = 0.375 Cor (2, 7) = 0.244 Cor (3, 6) = 0.290 Cor (4, 6) = 0.370 Cor (5, 7) = 0.346 Cor (7, 8) = 0.333

Cor (1, 2) = −0.128 Cor (1, 6) = −0.188 Cor (2, 4) = −0.186 Cor (2, 8) = −0.082 Cor (3, 7) = −0.150 Cor (4, 7) = −0.132 Cor (5, 8) = −0.007

Cor (1, 3) = −0.197 Cor (1, 7) = −0.0002 Cor (2, 5) = −0.084 Cor (3, 4) = −0.189 Cor (3, 8) = −0.054 Cor (4, 8) = −0.198 Cor (6, 7) = −0.051

Cor (1, 4) = −0.134 Cor (1, 8) = −0.034 Cor (2, 6) = −0.080 Cor (3, 5) = −0.198 Cor (4, 5) = −0.102 Cor (5, 6) = −0.041 Cor (6, 8) = −0.084

Cor (1, 5) = −0.149 Cor (2, 3) = −0.114 Cor (2, 7) = −0.132 Cor (3, 6) = −0.058 Cor (4, 6) = −0.028 Cor (5, 7) = 0.009 Cor (7, 8) = −0.030

λ=1

Table 16.10: The biases among the individual networks trained by negative correlation learning with different λ values in the large noise condition λ=0

Bias(1) = 0.0325 Bias(5) = 0.0349

Bias(2) = 0.0308 Bias(6) = 0.0361

Bias(3) = 0.0360 Bias(7) = 0.0323

Bias(4) = 0.0359 Bias(8) = 0.0343

λ = 0.5

Bias(1) = 0.0480 Bias(5) = 0.0468

Bias(2) = 0.0495 Bias(6) = 0.0438

Bias(3) = 0.0474 Bias(7) = 0.0421

Bias(4) = 0.0472 Bias(8) = 0.0501

λ=1

Bias(1) = 3.0875 Bias(5) = 2.590

Bias(2) = 5.2799 Bias(6) = 2.0779

Bias(3) = 3.7562 Bias(7) = 3.3165

Bias(4) = 4.9059 Bias(8) = 2.1227

The results of both negative correlation learning and independent learning are given in Figures 16.6–16.9 for the different ensemble sizes and different λ values in the noise-free condition at epoch 2000. For independent training (i.e., λ = 0), E train did not change much when varying the ensemble size. It indicated that the ensembles with different sizes trained by independent training had similar capabilities. E bias decreased slightly when the ensemble size M was increased from 4 to 16, and then increased slightly when M was changed to 32. E var seemed to decay at 1/M, while E cov increased with increasing ensemble size M. It was observed that the ensembles

page 654

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch16

Ensemble Learning

page 655

655

0.01

lambda

0 0.5 1

E bias

0.001

0.0001

1e-05 0

5

10

15

20

25

30

35

Ensemble Size Figure 16.6: Comparison of negative correlation learning for the different ensemble sizes and different values of λ in the noise-free condition. The vertical axis is the value of E bias on the testing set at epoch 2000 and the horizontal axis is the ensemble size.

0.01

Sum of E var and E cov

lambda

0 0.5 1

0.001

0.0001

1e-05

0

5

10

15

20

25

30

35

Ensemble Size Figure 16.7: Comparison of negative correlation learning for different ensemble sizes and different values of λ in the noise-free condition. The vertical axis is the sum of E var and E cov on the testing set at epoch 2000 and the horizontal axis is the ensemble size.

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

656

b4528-v2-ch16

Handbook on Computer Learning and Intelligence—Vol. II 0.01

lambda

0 0.5 1

E mse

0.001

0.0001

1e-05 0

5

10

15

20

25

30

35

Ensemble Size Figure 16.8: Comparison of negative correlation learning for different ensemble sizes and different values of λ in the noise-free condition. The vertical axis is the value of E mse on the testing set at epoch 2000 and the horizontal axis is the ensemble size.

0.01

lambda

0 0.5 1

E train

0.001

0.0001

1e-05 0

5

10

15

20

25

30

35

Ensemble Size Figure 16.9: Comparison of negative correlation learning for the different ensemble sizes and different values of λ in the noise-free condition. The vertical axis is the value of E train on the testing set at epoch 2000 and the horizontal axis is the ensemble size.

page 656

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Ensemble Learning

b4528-v2-ch16

page 657

657

with 16 and 32 individual neural networks performed slightly better than the other two ensembles with four and eight individual networks. For negative correlation learning with λ = 0.5, E bias quickly dropped when M was increased from 4 to 16, and then increased slightly when M became 32. E var appeared to decay approximately at M1 , while E cov had negative values for M = 4 and M = 8, and increased to small positive values for M = 16 and M = 32. Both E mse and E train decreased with increasing ensemble size. For negative correlation learning with λ = 1, smaller E bias and larger E var were obtained compared with those in negative correlation learning with λ = 0.5. Because of negative values of E cov , the sum of E var and E cov was also smaller. It is apparent that for this regression task with noise-free condition and the same ensemble, negative correlation learning with λ = 1 performed better than negative correlation learning with λ = 0.5, and the latter performed better than independent training. It needs to be noted that the capability of the ensemble trained by negative correlation learning was larger than that of the same ensemble trained by independent training. It is clear that negative correlation learning obtained much smaller E train than that obtained by independent training. It is worth pointing out that larger ensemble size does not necessarily improve the performance of the ensemble. In fact, overfitting was observed for the ensemble with 32 individual networks trained by negative correlation learning with λ = 1.

16.5. Ensemble Learning with Awareness Currently, learning systems have become large, heterogeneous, uncertain, and dynamic. They also need to match some conflict requirements, such as flexibility, performance, resource usage, and reliability. One key problem is how to dynamically resolve these requirements at run time when there exist dynamic and uncertain scenarios in the learning problems. For such learning problems, learning is not just a one-time optimization problem anymore. Instead, learning systems should be capable of dynamic management of different learning goals at run time. In this section, ideas of aware learning systems are introduced. It has been observed that learning systems could better manage trade-offs between goals at run time if they would be aware of their own state, behavior and performance [23, 24].

16.5.1. Awareness System An awareness system consists of a number of aware units where each aware unit could be mathematically described as [23, 24] y = R3 (R2 (R1 (x))),

(16.30)

June 1, 2022 13:18

658

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch16

Handbook on Computer Learning and Intelligence—Vol. II

where R1 is a receptor, R2 is a reactor, and R3 is a relater. The input x and the output y are usually represented as real vectors. Each element of x can come from a physical sensor, a software sensor, or a lower level aware unit. Each element of y is a concept to be aware of or some information to be used by a higher level aware unit. The function of the receptor R1 is to receive data x from the environment and conduct the data preprocessing of filtering out irrelevant noises, enhancing the signals, and normalizing or standardizing the inputs. The input x from different kinds of sensors can therefore be treated in the same way after R1 . The function of the reactor R2 is to react to the output of R1 , and extract or select important features. The function of the relater R3 is to detect certain events, make proper decisions based on the features provided by R2 , and relate the detected events to other aware units. An aware unit has two operation modes, working mode and learning mode. In the working mode, data flow forward from lower-level aware units to higher-level aware units. In the learning mode, data flow backward by sending feedback to the receptor through the relater and reactor. System parameters can therefore be adjusted based on the feedback. When the contexts are complex and dynamically changing, the system must be able to learn and become more and more aware autonomously. An aware unit is certainly not enough to cope with such complex contexts although one aware unit itself can be used as an aware system. It is desirable to use an aware unit as a subsystem to form a larger ensemble system with other aware units.

16.5.2. Three-Level Awareness The term “awareness computing” was originally used in the context of computersupported cooperative work, ubiquitous computing, social network, and others [23, 24]. It has been considered by researchers as a process for acquiring and distributing context information related to what is happening, what happened, and what is going to happen in an environment under concern conditions. Three levels of awareness could be introduced into a system. At the low level of awareness, the system has neither awareness nor decision. It just provides the context for the human users to be aware of any useful information contained in the context. Such a low level of awareness appears in most monitoring systems for traffic control, nuclear power plants, and public facilities. These systems could run smoothly at routine cooperative work. However, in emergent cases, human users might fail to detect possible dangers by lacking computing power in the human brain even if the background information were provided seamlessly. At the middle level of awareness, a system can be aware of the importance or urgency of different context patterns so that critical information can be provided to

page 658

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Ensemble Learning

b4528-v2-ch16

page 659

659

the human user or some other systems in a more noticeable, comprehensible and/or visible way. The awareness ability of human users could be enhanced significantly on detecting some danger and seizing a chance. Many decision-supporting systems developed so far have such a level of awareness although their developers did not intend to build awareness in the systems at the beginning. At the high level of awareness, a system is aware of the meaning of the context patterns and can make corresponding decisions for the user. These kinds of systems have been used in smart homes and smart offices, in which the server may be aware of human users’ actions, locations, and behaviors and can provide suitable services based on different context patterns. The system is able to switch light on or off to keep the best lighting conditions in a smart office and provide software/hardware resources to meet the requirements of a user in cloud computing. Even if these systems with certain levels of awareness are indeed smart, they are not intelligent. The events to be aware of are normally pre-defined in those smart systems. The decisions made are programmed by using some manually definable rules. For example, many context-aware systems developed from mobile computing are implemented by using some smart sensors connected to a server through a base station. Awareness could be automatically built through correlation maps between the input and the output. When the output can be derived directly from the input, it would be simple to find the correlation map. However, such a correlation map could be hard to build in the real world. In practice, there could be many unforeseen outputs that are not registered in the system. Therefore, the system should be able to detect possible new outputs. Meanwhile, the current input may be the factor for many outputs to occur. The system should be able to modify the correlation map dynamically so that the scope of possible outputs can be narrowed down after more information is acquired. It is a kind of dynamic optimization problem that could be solved effectively by evolutionary algorithms. A population of search agents could be defined to work together during a search. To speed up a search, the agents may share knowledge through imitation or by acquiring some appropriate knowledge [25]. It is expected that awareness systems would stimulate smart systems, and create real intelligence in the near future.

References [1] S. Geman, E. Bienenstock, and E. Doursat, Neural networks and the bias/variance dilemma, Neural Comput., 4, 1–58 (1992). [2] R. T. Clemen and R. L. Winkler, Limits for the precision and value of information from dependent sources, Oper. Res., 33, 427–442 (1985). [3] A. Krogh and J. Vedelsby, Neural network ensembles, cross validation, and active learning. In eds. G. Tesauro, D. S. Touretzky, and T. K. Leen, Advances in Neural Information Processing Systems 7, pp. 231–238. The MIT Press (1995).

June 1, 2022 13:18

660

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch16

Handbook on Computer Learning and Intelligence—Vol. II

[4] E. A. Bender, Mathematical Methods in Artificial Intelligence (IEEE Computer Society Press, Los Alamitos, CA, 1996). [5] B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap (Chapman & Hall, 1993). [6] M. Stone, Cross-validatory choice and assessment of statistical predictions, J. R. Stat. Soc., 36, 111–147 (1974). [7] R. Meir, Bias, variance, and the combination of least squares estimators. In eds. G. Tesauro, D. S. Touretzky, and T. K. Leen, Advances in Neural Information Processing Systems 7, pp. 295–302. The MIT Press (1995). [8] L. Breiman, Bagging predictors, Mach. Learn., 24, 123–140 (1996). [9] R. E. Schapire, The strength of weak learnability, Mach. Learn., 5, 197–227 (1990). [10] H. Drucker, C. Cortes, L. D. Jackel, Y. LeCun, and V. Vapnik, Boosting and other ensemble methods, Neural Comput., 6, 1289–1301 (1994). [11] Y. Liu and X. Yao, Simultaneous training of negatively correlated neural networks in an ensemble, IEEE Trans. Syst. Man Cybern. Part B: Cybern., 29(6), 716–725 (1999). [12] Y. Liu and X. Yao, Ensemble learning via negative correlation, Neural Networks, 12(10), 1399–1404, 1999. [13] Y. Liu, X. Yao, and T. Higuchi, Evolutionary ensembles with negative correlation learning, IEEE Trans. Evol. Comput., 4(4), 380–725 (2000). [14] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, Adaptive mixtures of local experts, Neural Comput., 3, 79–87 (1991). [15] M. I. Jordan and R. A. Jacobs, Hierarchical mixtures-of-experts and the EM algorithm, Neural Comput., 6, 181–214 (1994). [16] R. A. Jacobs, Bias/variance analyses of mixture-of-experts architectures, Neural Comput., 9, 369–383 (1997). [17] Y. Liu and X. Yao, A cooperative ensemble learning system. In Proc. of the 1998 IEEE International Joint Conference on Neural Networks (IJCNN’98), pp. 2202–2207. IEEE Press, Piscataway, NJ, USA (1998). [18] Y. Liu and X. Yao, Towards designing neural network ensembles by evolution. In Parallel Problem Solving from Nature—PPSN V: Proc. of the Fifth International Conference on Parallel Problem Solving from Nature, vol. 1498, Lecture Notes in Computer Science, pp. 623–632. Springer-Verlag, Berlin (1998). [19] Y. Liu, Awareness learning for balancing performance and diversity in neural network ensembles. In Proc. of the International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery, pp. 113–120. Springer, Cham (2019). [20] Y. Liu, Learning targets for building cooperation awareness in ensemble learning. In Proc. of 2018 9th International Conference on Awareness Science and Technology (iCAST), pp. 16–19. IEEE (2018). [21] Y. Liu, Combining two negatively correlated learning signals in a committee machine. In Proc. of 2018 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), pp. 1–4. IEEE (2018). [22] Y. Liu, Bidirectional negative correlation learning. In Proc. of International Symposium on Intelligence Computation and Applications, pp. 84–92. Springer, Singapore (2017). [23] Q. Zhao, Reasoning with awareness and for awareness: realizing an aware system using an expert system, IEEE Syst. Man Cybern. Mag., 3(2), 35–38 (2017). [24] Q. Zhao, Making aware systems interpretable. In Proc. of 2016 International Conference on Machine Learning and Cybernetics (ICMLC), pp. 881–887. IEEE (2016). [25] I. Chowdhury, K. Su, and Q. Zhao, MS-NET: modular selective network round robin based modular neural network architecture with limited redundancy, Int. J. Mach. Learn. Cybern., 12, 763–781 (2021). https://doi.org/10.1007/s13042-020-01201-8.

page 660

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

© 2022 World Scientific Publishing Company https://doi.org/10.1142/9789811247323_0017

Chapter 17

A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification Xiaowei Gu∗,‡,¶ and Plamen P. Angelov†,§ ∗ Department of Computer Science, Aberystwyth University

Aberystwyth, Ceredigion, SY23 3DB, UK † School of Computing and Communications, Lancaster University

Bailrigg, Lancaster, LA1 4WA, UK ‡ [email protected] § [email protected]

Aerial scene classification is the key task for automated aerial image understanding and information extraction, but is highly challenging due to the great complexity and real-world uncertainties exhibited by such images. To perform precise aerial scene classification, in this research, a multistream deep rule-based ensemble system is proposed. The proposed ensemble system consists of three deep rule-based systems that are trained simultaneously on the same data. The three ensemble components employ ResNet50, DenseNet121, and InceptionV3 as their respective feature descriptors because of the state-of-the-art performances the three networks have demonstrated on aerial scene classification. The three networks are fine-tuned on aerial images to further enhance their discriminative and descriptive abilities. Thanks to its prototype-based nature, the proposed approach is able to self-organize a transparent ensemble predictive model with prototypes learned from training images and perform highly explainable joint decision-making on testing images with greater precision. Numerical examples based on both benchmark aerial image sets and satellite sensor images demonstrated the efficacy of the proposed approach, showing its great potential in solving real-world problems.

17.1. Introduction Remotely sensed aerial images are data sources of great importance for Earth observation and are fundamental for a wide variety of real-world applications,

¶ Corresponding author.

661

page 661

June 1, 2022 13:18

662

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

Handbook on Computer Learning and Intelligence—Vol. II

such as environmental monitoring and urban planning [1]. Rapid development of sensing techniques has led to a significant growth in the volumes of aerial images acquired over the past decade, enabling much detailed analysis of the Earth’s surface. As a key task in aerial image understanding, the main aim of aerial scene classification is to automatically assign distinct land-use class labels to aerial images based on their semantic contents. However, it is widely recognized as a challenging task due to the huge volume, high intraclass diversity, low interclass variation, complex geometrical structures, and spatial patterns of these remotely sensed images. Recently, aerial scene classification has received lots of attention and is now a hotly researched topic in the fields of machine learning and remote sensing [2]. Deep neural network (DNN)-based approaches have gained huge popularity in the remote sensing community because of their impressive performance on many benchmark problems for aerial scene classification [3]. Thanks to the capability of automatically learning informative representations of raw input images with multiple levels of abstraction [4], DNNs are currently considered to be one of the state-of-the-art approaches in the remote sensing domain [5]. For example, F. Zhang et al. [6] proposed a gradient boosting random ensemble framework composed of multiple DNNs for aerial scene classification. Zhang et al. [7] introduced a multiscale dense network that utilizes multiscale information in the network structure for hyper-spectral remote sensing image classification. Zhang et al. [8] combined a convolutional neural network (CNN) and a capsule network into a new deep architecture, achieving competitive classification performance on public challenging benchmark remote sensing image datasets. Liu et al. [9] used a Siamese CNN model to learn discriminative feature representations from aerial images for scene classification. However, DNNs are widely known “black box" models with high system complexity, but lacking transparency. The training process is computationally expensive and requires huge amounts of labeled training data to construct a well-performing model. Their decision-making and reasoning processes are not explainable to humans. Most importantly, DNNs are fragile to real-world uncertainties as they can fail easily when handling data with unfamiliar patterns. These deficiencies hinder the performance of DNNs in real-world applications [10]. An alternative approach to making use of the powerful representation learning ability of DNNs is to directly use off-the-shelf pretrained DNNs as feature descriptors. Many works [11–13] have reported good performance on aerial scene classification using mainstream classifiers, i.e., k-nearest neighbor [14], support vector machine [15] and random forest [16], with high-level semantic features extracted from aerial images by DNNs pretrained on natural images such as ImageNet [17]. Popular pretrained DNNs for aerial scene classification include

page 662

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification

page 663

663

VGGNet [18], InceptionNet [19, 20], ResNet [21], DenseNet [22], etc. Although this avoids the computationally expensive and time-consuming training process of DNNs, the issue of lacking explainability remains unsolved because the model transparency of mainstream classifiers is often very low on high-dimensional, complex problems such as image classification. The deep rule-based (DRB) classifier [10, 23] is a recently introduced generic approach for image classification by integrating a zero-order prototype-based fuzzy rule-based system with a multilayer image-processing architecture. As an attractive alternative to DNNs, DRB has demonstrated great performance across a variety of image classification problems [24–26] and, at the same time, offers high-level model transparency and explainability thanks to its prototype-based nature. Despite its competitive results on aerial scene classification problems, it is observed that DRB sometimes misclassifies aerial images into wrong land-use categories due to the high intraclass diversity and low interclass variation. The main reason for this is because the pretrained DNNs employed by DRB were originally trained on natural images. Therefore, they fail to capture the most distinctive characteristics of different land-use categories from aerial images. To address this problem, one feasible solution is to tune the pretrained DNNs for aerial scene classification such that more descriptive high-level semantic features can be extracted from aerial images through DNN activation [27]. Furthermore, due to the highly complex nature of aerial images, a single fine-tuned DNN may not be able to provide better performance. Since different DNNs have their own strengths and weaknesses, a more promising approach is to employ multiple fine-DNNs together such that images of different land-use categories can be discriminated with great precision despite of the presence of intraclass diversity and interclass similarity. Following this principle, in this chapter, a novel multistream deep rule-based ensemble (MSDRBE) system is proposed for aerial scene classification. The proposed framework consists of three DRB systems trained simultaneously. Each DRB system self-organizes a set of prototype-based IF...THEN rules from training images for classification in a fully autonomous, objective, and transparent manner. To ensure the diversity of ensemble components, the three DRB systems use ResNet50, DenseNet121, and InceptionV3 as their respective feature descriptors for feature extraction. The three employed DNNs are the state-of-the-art models for aerial scene classification, showing great performance. To further enhance their descriptive abilities, they have been fine-tuned on benchmark aerial imagery in advance. During decision-making, land-use categories of unlabeled aerial images are determined by the scores of confidence produced by the three DRB systems within the ensemble framework jointly. Numerical examples of benchmark aerial image datasets show that MSDRBE is able to achieve top performance on aerial scene classification, outperforming the state-of-the-art approaches.

June 1, 2022 13:18

664

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

page 664

Handbook on Computer Learning and Intelligence—Vol. II

17.2. Preliminaries In this section, technical details of DRB are summarized to make this chapter selfcontained [10, 23]. The general architecture of DRB is seen in Figure 17.1. A standard DRB consists of the following four components [10, 23]: (1) Pre-processing module: The main purposes of this module include: (i) preparing input images for feature extraction; and (ii) augmenting the input images to improve the generalization ability. Therefore, it is usually composed of several sub-layers with different functionalities, e.g., normalization, rotation, scaling, and segmentation. (2) Feature descriptor: This component converts each image, I, to a more meaningful and descriptive feature vector, denoted as x x←

F(I) ; ||F(I)||

(17.1)

where F(·) represents the feature extraction process; || · || is the L 2 norm. DRB may employ different types of commonly used feature descriptors for feature extraction, depending on the nature of the problem. An ensemble of feature descriptors can further be considered for extracting more discriminative representations from images [28].

Figure 17.1: Architecture of DRB.

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification

page 665

665

(3) Massively parallel rule base: This component is the “learning engine” of DRB, composed of C massively parallel IF . . . THEN fuzzy rules identified from training images through a self-organizing and human-interpretable manner. Each IF. . .THEN rule, denoted as Rn (n = 1, 2, . . . , C), is identified from training images of a particular class separately, and is formulated as follows [23, 29]: Rn : I F(I ∼ P n,1 ) O R (I ∼ Pn,2 ) O R · · · O R (I ∼ Pn,Nn ) T H E N (I belongs to classn ),

(17.2)

where C is the number of classes, I stands for a particular image, Pn, j is the j th prototype of the nth class, and Nn is the number of prototypes identified from images of the nth class. (4) Decision maker: This is the final component of DRB, which assigns class labels to unlabeled images based on their visual similarities to prototypes of different classes. The system identification and validation processes of DRB are given by the next two subsections.

17.2.1. Identification Process Since the massively parallel fuzzy rules of DRB are learned separately from labeled training images of different classes, only the identification process of the nth rule, Rn , is presented here. The same principle can be applied to other massively parallel fuzzy rules within the system. Identification process of Rn [10, 23] Step 0. For each newly arrived training image of the nth class, denoted as In,k , its feature vector xn,k is first extracted by the feature descriptor. Step 1. If In,k is the very first observed image of the nth class, namely, k = 1, it initializes Rn with the global meta-parameters set by Nn ← 1; μ n ← xn,k ,

(17.3)

where xn,k is the corresponding feature vector of In,k ; μ n denotes the global mean of feature vectors of observed labeled training images of the nth class. In,k itself becomes the first prototype of Rn , denoted as Pn,Nn , and the local parameters

June 1, 2022 13:18

666

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

page 666

Handbook on Computer Learning and Intelligence—Vol. II

associated with Pn,Nn are initialized by Pn,Nn ← In,k ;

pn,Nn ← xn,k ;

Sn,Nn ← 1;

rn,Nn ← ro ,

(17.4)

where Sn,Nn is the number of images associated with Pn,Nn ; pn,Nn is the feature vector of Pn,N n ; rn,Nn is theπ radius of the area of influence around Pn,Nn ; ro is a small value, ro = 2(1 − cos( 6 )) [23]. Then, Rn is initialized as follows: Rn :

I F (I ∼ Pn,Nn ) T H E N (I belongs to class n),

(17.5)

and the identification process returns back to Step 0 for the next image. If In,k is not the very first image, namely, k > 1, the global mean is updated by μn ←

μn + xn,k (k − 1)μ , k

(17.6)

and the process enters Step 2. Step 2. Data density values at In,k and Pn, j ( j = 1, 2, . . . , Nn ) are calculated as follows [30]. Dn (Z) =

1 1+

μn ||2 ||zz −μ μn ||2 1−||μ

,

(17.7)

where Z = In,k , Pn,1 , Pn,2 , . . . , Pn,Nn ; z = xn,k , pn,1 , pn,2 , . . . , pn,Nn . The nearest prototype, denoted as Pn, j ∗ , to In,k is identified based on the similarity in-between using the following equation: j∗ =

min

j =1,2,...,Nn

(||xn,k − pn, j ||).

(17.8)

Then, the process enters Step 3. Step 3. In this step, Condition 1 is examined firstly to see whether In,k has the potential to become a new prototype: Condition 1: I f (Dn (In,k ) > Or (Dn (In,k )
rn, j ∗ ) T hen (In,k becomes a new protot ype) If Condition 1 is satisfied, In,k is recognized as a new prototype (Nn ← Nn + 1) and added to Rn in the form of Eq. (17.2). Local meta-parameters of the new prototype

June 10, 2022 12:25

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification

page 667

667

are set by Eq. (17.4). Otherwise (if Condition 1 is not met), In,k is used to update the meta-parameters of Pn, j ∗ as follows: Sn, j ∗ pn, j ∗ + xn,k ; Sn, j ∗ ← Sn, j ∗ + 1; Sn, j ∗ + 1 1 2 = r ∗ + (1 − ||pn, j ∗ ||2 ). 2 n, j

pn, j ∗ ← rn, j ∗

(17.10)

Then, the identification process goes back to Step 0 if new images are available.

17.2.2. Validation Process During the validation process, given an unlabeled image, I, each of the C massively parallel IF...THEN fuzzy rules within DRB will generate a score of confidence, denoted as λn (I) (n = 1, 2, . . . , C). λn (I) is calculated based on the visual similarity between I and prototypes identified from labeled training images of the corresponding class [10, 23]. λn (I) =

max

j =1,2,...,Nn

(e−||x−pn, j || ) 2

(17.11)

Based on the obtained scores of confidence, λ1 (I), λ2 (I), . . . , λC (I) (C scores, in total), the class label of I is determined using the “winner takes all” principle [10, 23]. label ← classn∗ ;

n ∗ = arg max (λn (I)).

(17.12)

n=1,2,...,C

17.3. The Proposed Framework The architecture of the proposed MSDRBE framework is presented in Figure 17.2, where it can be observed that MSDRBE is composed of three DRB systems, namely, (i) DRB-R; (ii) DRB-D, and (iii) DRB-I. The pre-processing modules employed by the three DRB systems have the exact same structure, which can be seen in Figure 17.3. As shown by this figure, each preprocessing module consists of the following three sub-layers: (1) Re-scaling layer: This layer re-scales input images to the size of 248 × 248 pixels [31]. Note that the original spatial resolution of images may be changed due to re-scaling. (2) Segmentation layer: This layer crops five sub-images with the same size of 224 × 224 pixels from the central area and four corners of each image for data augmentation [32]. (3) Flipping layer: This layer creates a mirror from each image segment by flipping it horizontally.

June 1, 2022 13:18

668

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

Handbook on Computer Learning and Intelligence—Vol. II

Figure 17.2: Architecture of MSDRBE.

Figure 17.3: Architecture of the pre-processing module.

page 668

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification

page 669

669

Thus, a total of Mo = 10 new sub-images are created from each input image I by the pre-processing module. The segmentation and flipping layers effectively improve the generalization ability of DRB, allowing the system to capture more semantic information from the input images [28]. DRB-R, DRB-D, and DRB-I employ the fine-tuned ResNet50 [21], DenseNet121 [22] and InceptionV3 [20] as their respective high-level feature descriptors. MSDRBE employs the three DNNs for feature extraction because they have demonstrated the state-of-the-art performance on aerial scene classification with fewer parameters than alternative DNNs [33, 34]. For example, ResNet50 has 2.5 million parameters, DenseNet121 has only 0.8 million parameters, and InceptionV3 has 2.4 million parameters. Since each DNN has its own focus and can provide a unique view of the images, using the three DNNs together in the proposed MSDRBE can effectively enhance its capability of image understanding. Technical details of the three DNNs used by MSDRBE are summarized briefly as follows: (1) ResNet50 [21]: This network is one of the widely used DNNs for object detection and classification and has 50 layers. It addresses the gradient degradation problem by reformulating the layers to learn a residual mapping, which correspondingly improves the network capacity and performance. Some researchers further suggest that ResNet50 is able to outperform other DNNs that are trained using the ImageNet datasets [17] on aerial scene classification [33]. (2) DenseNet121 [22]: This DNN is a variant of the popular DenseNet that uses dense connections to address the gradient degradation problem. As its name suggests, DenseNet121 has 121 layers. Each convolutional layer within the network is connected to all subsequent layers and, thus, each convolutional layer receives feature maps from all previous layers as inputs. The dense connections maximize the information flow passing between different layers, enabling DenseNet121 to achieve better performance with fewer parameters. (3) InceptionV3 [20]: InceptionV3 is a popular DNN with 159 layers from the InceptionNet family. The design principle of InceptionNet is to increase both the depth and width of the network for performance improvement. Compared with its previous versions, namely, InceptionV1 and InceptionV2, InceptionV3 introduces the idea of factorization by decomposing large convolutions into smaller convolutions, which effectively reduces the number of parameters without decreasing the system efficacy. InceptionV3 offers the state-of-the-art performance on image classification and is widely used for transfer learning.

June 1, 2022 13:18

670

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

Handbook on Computer Learning and Intelligence—Vol. II

Figure 17.4: Transfer learning framework used by MSDRBE.

As mentioned in Section 17.1, the transfer learning technique is leveraged to fine-tune ResNet50, DenseNet121, and InceptionV3 to make them better fit for aerial scene classification. Transfer learning is an effective method to train a DNN using limited data without overfitting. The main aim of transfer learning is to utilize the previously learned knowledge to solve new problems better and faster [34, 35]. The diagram of the transfer learning framework used by MSDRBE is depicted in Figure 17.4 for illustration. In this research, the final softmax predicting layers of the three DNNs are replaced with two fully connected layers, each has 1024 rectified linear units (ReLUs), and the rest of the original networks are frozen to accelerate the fine-tuning process. Then, the modified DNNs are fine-tuned on an aerial scene classification problem [5]. The parameters layers are updated so that more discriminative high-level semantic features can be extracted from aerial images by the three DNNs. During the system identification and learning processes of MSDRBE, the three fine-tuned high-level feature descriptors will summarize each input image into a 1024 × 1 dimensional discriminative representation based on the feature vectors extracted from the 10 sub-images produced by the pre-processing module. F(I) ; x← ||F(I)||

F(I) =

Mo 

DN(I( j )),

(17.13)

j =1

where I( j ) represents the j th sub-image of I; DN(I( j )) represents the 1024 × 1 dimensional feature vector extracted from I( j ) by the high-level DNN-based feature descriptor, which can be either ResNet50, DenseNet121, or InceptionV3.

page 670

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification

page 671

671

The identification process of each DRB within the proposed MSDRBE is conducted separately following the exact same process as described in Section 17.2.1. During the validation process, MSDRBE adopts the weighted voting mechanism for determining the class labels of unlabeled images. For each unlabeled image, I, the three ensemble components of MSDRBE, namely, DRB-R, DRB-D, and DRB-I, produce three sets of scores of confidence using Eq. (17.11) (see Section 17.2.2). The three sets of scores of confidence (one set per ensemble component, C scores per set) are then collected by the joint decision maker and integrated to the overall scores of confidence as follows: λ R (I) + λnD (I) + λnI (I) , λˆ n (I) = n 3

(17.14)

where n = 1, 2, . . . , C; λnR (I), λnD (I), and λnI (I) are the scores of confidence produced by DRB-R, DRB-D, and DRB-I, respectively. The class label of I is finally determined using the “winner takes all” principle label ← classn∗ ;

n ∗ = arg max (λˆ n (I)).

(17.15)

n=1,2,...,C

17.4. Numerical Experiments and Evaluation In this section, numerical experiments on benchmark aerial scene classification problems are conducted for evaluating the performance of the proposed MSDRBE approach. The DRB and MSDRBE algorithms were developed on the MATLAB R2018a platform. Numerical experiments were conducted on a Windows 10 laptop with dual core i7 CPU with a clock rate of 2.60 GHz×2 and 16 GB RAM. The three high-level DNN-based feature descriptors employed by MSDRBE, namely, ResNet50, DenseNet121, and InceptionV3, were implemented using the Keras module from Tensorflow. The DNN fine-tuning was conducted on a Linux server with two NVIDIA GP100GL GPUs.

17.4.1. Dataset Description The following four widely used benchmark datasets for aerial scene classification, namely, NWPU45 [5], UCMerced [36], WHU-RS19 [37], and RSSCN7 [38] were involved in the numerical experiments conducted in this research. Key information on the four datasets involved in this research is summarized in Table 17.1. The NWPU45 dataset [5] was utilized for fine-tuning the three DNNs employed by MSDRBE, namely, ResNet50, DenseNet121, and InceptionV3. This dataset contains 31,500 aerial images of 45 different land-use categories, 700 images per category. All images have a uniform size of 256 × 256 pixels. The NWPU45 dataset is currently one of the largest datasets for aerial scene classification covering most

June 1, 2022 13:18

672

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

Handbook on Computer Learning and Intelligence—Vol. II Table 17.1: Key information of four aerial image sets

Dataset

Images

NWPU45 31,500 UCMerced 2100 WHU-RS19 950 RSSCN7 2800

Categories

Images per category

Image size

Spatial resolution (m)

45 21 19 7

700 100 50 400

256×256 256×256 600×600 400×400

0.2∼30 0.3 up to 0.5 —

of the commonly seen land-use categories; thus, it is very suitable for fine-tuning the DNNs. Example images of this dataset are given in Figure 17.5. To improve the generalization ability of the fine-tuned networks, each image of the NWPU45 dataset was re-scaled to the size of 248 × 248 pixels and five new images were cropped out from the central area and four corners of the original image. In this way, this dataset was augmented to five times larger. During the fine-tuning process, 80% of the images of the augmented dataset were randomly selected out to form the training set and the rest 20% were used as the validation set. The performance of MSDRBE was evaluated on the UCMerced [36], WHU-RS19 [37], and RSSCN7 [38] datasets. The UCMerced dataset [36] consists of 21 land-use categories; each category has 100 images with the size of 256 × 256 pixels. The spatial resolution of the images is 0.3 m per pixel. The WHU-RS19 dataset [37] contains 950 aerial images with the size of 600 × 600 pixels. The 950 images are evenly distributed in 19 land-use categories. The spatial resolution is up to 0.5 m per pixel. The RSSCN7 dataset [38] has 2800 aerial images evenly distributed in seven land-used categories with the image size of 400 × 400 pixels. Images of each category are obtained at four different scales with 100 images per scale. Example images of the three datasets are illustrated in Figures 17.6–17.8, respectively. During the experiments for performance evaluation, each dataset was randomly split into labeled training and unlabeled testing sets with two different ratios by following the commonly used experimental protocols [2]. For the UCMerced dataset, the splitting ratios between training images and testing images per category were set to 5:5 and 8:2. For the WHU-RS19 dataset, the splitting ratios were set to 4:6 and 6:4. The splitting ratios were set to 2:8 and 5:5 for the RSSCN7 dataset.

17.4.2. DNN Implementation As described in Section 17.3, the pretrained ResNet50, DenseNet121, and InceptionV3 are employed as the feature descriptors for the three ensemble component DRB systems of MSDRBE. To facilitate feature extraction, the final predicting layers of the three DNNs were replaced with two fully connected layers, each layer

page 672

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification

page 673

673

Airplane

Airport

Baseball Diamond

Basketball Court

Beach

Bridge

Chaparral

Church

Circular Farmland

Cloud

Commercial area

Dense Residential

Desert

Forest

Freeway

Golf Course

Ground Track Field

Harbor

Industrial Area

Intersection

Island

Lake

Meadow

Medium Residential

Mobile Home Park

Mountain

Overpass

Palace

Parking Lot

Railway

Railway Station

Rectangular Farmland

River

Roundabout

Runway

Sea Ice

Ship

Snowberg

Sparse Residential

Stadium

Storage Tank

Tennis Court

Terrace

Therma Power Station

Wetland

Figure 17.5: Example images of the NWPU45 dataset.

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

674

b4528-v2-ch17

Handbook on Computer Learning and Intelligence—Vol. II

Agricultural

Airplane

Baseball Diamond

Beach

Buildings

Chaparral

Dense Residential

Forest

Freeway

Golf Course

Harbor

Intersection

Medium Residential

Mobile Home Park

Overpass

Parking Lot

River

Runway

Sparse Residential

Storage Tanks

Tennis Court

Figure 17.6: Example images of the UCMerced dataset.

Airport

Beach

Bridge

Commercial

Desert

Farmland

Football Field

Forest

Industrial

Meadow

Mountain

Park

Parking

Pond

Port

Railway Station

Residential

River

Viaduct

Figure 17.7: Example images of WHU-RS19 dataset.

page 674

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification

Grass

Field

Industry

Residential

River and Lake

page 675

675

Forest

Parking

Figure 17.8: Example images of RSSCN7 dataset.

Table 17.2: Setting of data augmentation Setting

Value

Horizontal flipping Vertical flipping Rotation range Width shifting range Height shifting range Fill mode

True True 22 0.19 0.18 nearest

consists of 1024 ReLUs with a dropout rate of 0.3, and the remaining parts of the DNNs are frozen. The input size of the three DNNs was adjusted to 224 × 224 pixels. During the fine-tuning process, a 45-way softmax layer was added to the three networks (because the NWPU45 dataset has a total of 45 classes). Note that this softmax layer was removed after fine-tuning, and the 1024 × 1 dimensional activations from the second fully connected layer were used as the feature vectors of images. The adaptive moment estimation (Adam) algorithm [39] was used as the optimizer for updating the parameters of the final two fully connected layers of DNNs by minimizing the categorical cross-entropy loss function. The setting for data augmentation was listed in Table 17.2 [40]. The batch size was set to 10 and the learning rate was set as 10−5 . The three networks were fine-tuned for 25 epochs to avoid over-fitting, and the corresponding accuracy and loss during training and validation were shown in Figures 17.9–17.11, respectively. It can be observed from the three figures that both the training and validation accuracy of ResNet50, DenseNet121, and InceptionV3 all converged to nearly 100% after 25 training epochs, showing that the knowledge learned from ImageNet datasets [17] has been transferred successfully to the aerial scene classification tasks.

June 1, 2022 13:18

676

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

Handbook on Computer Learning and Intelligence—Vol. II

Figure 17.9: Accuracy and loss during fine-tuning of ResNet50.

Figure 17.10: Accuracy and loss during fine-tuning of DenseNet121.

17.4.3. Results and Comparison Experimental results obtained by using the proposed MSDRBE framework on the UCMerced dataset are reported in Table 17.3, where the average classification accuracy and the standard deviation after 25 Monte Carlo experiments are presented. To validate the effectiveness of the proposed ensemble framework, classification performances of DRB-R, DRB-D, and DRB-I are given in the same table. The state-of-the-art results of existing works are also presented in Table 17.3 as well for benchmark comparison. Following the same protocol, the classification performance of MSDRBE, DRBR, DRB-D, and DRB-I on the WHU-RS19 dataset under two different training– testing split ratios as mentioned at the beginning of this section are tabulated

page 676

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification

page 677

677

Figure 17.11: Accuracy and loss during fine-tuning of InceptionV3. Table 17.3: Numerical results on UCMerced and comparison with the state-of-the-art Algorithm

Percentage of training images per category 50% 80%

MSDRBE DRB-R DRB-D DRB-I GCLBP [41] MS-CLBP+FV [42] MTJSLRC [43] CaffeNet [2] VGG-VD-16 [2] GoogLeNet [2] SalM3LBP-CLM [44] SalM3LBP [44] salCLM(eSIF) [44] TEX-Net-LF [45] VGG-16-CapsNet [8] ARCNet-VGG [46] SeRBIA [47]

0.9718 ± 0.0046 0.9534 ± 0.0050 0.9510 ± 0.0063 0.9432 ± 0.0071 — — — 0.9396 ± 0.0067 0.9414 ± 0.0069 0.9270 ± 0.0060 0.9421 ± 0.0075 0.8997 ± 0.0085 0.9293 ± 0.0092 0.9589 ± 0.0037 0.9533 ± 0.0018 0.9681 ± 0.0014 0.9636 ± 0.0041

0.9791 ± 0.0062 0.9635 ± 0.0073 0.9620 ± 0.0092 0.9503 ± 0.0078 0.9000 ± 0.0210 0.9300 ± 0.0120 0.9107 ± 0.0067 0.9502 ± 0.0081 0.9521 ± 0.0120 0.9431 ± 0.0089 0.9575 ± 0.0080 0.9314 ± 0.0100 0.9452 ± 0.0079 0.9662 ± 0.0049 0.9881 ± 0.0022 0.9912 ± 0.0040 0.9786 ± 0.0099

in Table 17.4. In addition, the state-of-the-art approaches in the literature are also listed for comparison. Similar to the previous two numerical examples in this subsection, the classification accuracy rates obtained by MSDRBE and its ensemble components, as well as the results obtained by the state-of-the-art approaches on the RSSCN7 dataset under two different experimental settings are reported in Table 17.5.

June 1, 2022 13:18

678

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

Handbook on Computer Learning and Intelligence—Vol. II Table 17.4: Numerical results on WHU-RS19 and comparison with the state-of-the-art Algorithm

Percentage of training images per category 40% 60%

MSDRBE DRB-R DRB-D DRB-I GCLBP [41] MS-CLBP+FV [42] MTJSLRC [43] CaffeNet [2] VGG-VD-16 [2] GoogLeNet [2] SalM3LBP-CLM [44] SalM3LBP [44] salCLM(eSIF) [44] TEX-Net-LF [45] ARCNet-VGG [46] SeRBIA [47]

0.9821 ± 0.0040 0.9540 ± 0.0059 0.9594 ± 0.0047 0.9490 ± 0.0060 — — — 0.9511 ± 0.0120 0.9544 ± 0.0060 0.9312 ± 0.0082 0.9535 ± 0.0076 0.8974 ± 0.0184 0.9381 ± 0.0091 0.9761 ± 0.0036 0.9750 ± 0.0049 0.9757 ± 0.0060

0.9854 ± 0.0058 0.9619 ± 0.0081 0.9695 ± 0.0070 0.9544 ± 0.0097 0.9100 ± 0.0150 0.9432 ± 0.0120 0.9174 ± 0.0114 0.9624 ± 0.0056 0.9605 ± 0.0091 0.9471 ± 0.0133 0.9638 ± 0.0082 0.9258 ± 0.0089 0.9592 ± 0.0095 0.9800 ± 0.0046 0.9975 ± 0.0025 0.9802 ± 0.0054

Table 17.5: Numerical results on RSSCN7 and comparison with the state-of-the-art Algorithm

Percentage of training images per category 20% 50%

MSDRBE DRB-R DRB-D DRB-I CaffeNet [2] VGG-VD-16 [2] GoogLeNet [2] TEX-Net-LF [45] SeRBIA [47]

0.9132 ± 0.0052 0.8828 ± 0.0058 0.8624 ± 0.0051 0.8515 ± 0.0080 0.8557 ± 0.0095 0.8398 ± 0.0087 0.8255 ± 0.0111 0.8861 ± 0.0046 0.9485 ± 0.0037

0.9308 ± 0.0052 0.9064 ± 0.0060 0.8873 ± 0.0073 0.8709 ± 0.0071 0.8825 ± 0.0062 0.8718 ± 0.0094 0.8584 ± 0.0092 0.9125 ± 0.0058 0.9619 ± 0.0057

17.4.4. Discussion Tables 17.3–17.5 show that MSDRBE is able to outperform the state-of-the-art approaches across the three aerial scene classification problems. The classification accuracy rate obtained by MSDRBE on the UCMercet dataset is 97.18% if 50% of the images are used for training and 97.91% if 80% of images are used for training. MSDRBE achieves 98.21% and 98.54% classification accuracy on the WHU-RS19 dataset given 40% and 60% of images for training, respectively. For the RSSCN7

page 678

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification

page 679

679

dataset, the classification accuracy rates of MSDRBE on the testing images are 91.32% and 93.08% under two different training–testing split ratios, respectively. The classification accuracy comparison between MSDRBE and its three ensemble components, namely, DRB-R, DRB-D, and DRB-I, further justifies the efficacy of the proposed ensemble framework. However, it can be observed from the performance comparison that ARCNetVGG [46] is able to achieve slightly higher classification accuracy than MSDRBE on the UCMerced dataset with 80% of images for training and the WHU-RS19 dataset with 60% of images for training, respectively. The main reason is that although both MSDRBE and ARCNet-VGG utilized the transfer learning technique, ARCNetVGG directly used UCMerced and WHU-RS19 datasets for knowledge transferring. Meanwhile, MSDRBE used an alternative aerial image set, namely, NWPU45, for fine-tuning the DNNs and, then, its performance was evaluated on other datasets. In this way, MSDRBE actually performed knowledge transferring twice during the experiments. The first knowledge transferring happened during the fine-tuning process on the NWPU45 dataset and the second knowledge transferring happened during the feature extraction process on UCMerced, WHU-RS19, and RSSDN7 datasets. The significant variabilities in terms of the geometrical structures and spatial patterns exhibited by images of different aerial image sets inevitably have an adverse influence on the performance of MSDRBE. Nevertheless, since ARCNetVGG was fine-tuned on UCMerced and WHU-RS19 datasets directly, it failed to achieve good performance when there were insufficient training images. Therefore, it was outperformed by MSDRBE in the other two cases. SeRBIA was able to surpass MSDRBE on the RSSCN7 dataset thanks to its unique chunk-by-chunk semi-supervised learning mechanism. Despite that it uses only pre-trained VGGNet and AlexNet [48] for feature extraction, SeRBIA [47] is capable of self-learning from unlabeled images and self-expanding its knowledge base for more precise classification. On the other hand, MSDRBE managed to outperform SeRBIA on the UCMerced and WHU-RS19 datasets thanks to its highly discriminative features extracted by the three fine-tuned DNNs. Thus, one may conclude that MSDRBE can produce even better classification results if it adopts similar semi-supervised learning techniques as used by SeRBIA. Meanwhile, one could expect that SeRBIA will perform much better on aerial scene classification tasks if the fine-tuned DNNs are employed as its feature descriptors.

17.5. Applications to Large-Scale Satellite Sensor Images In this section, numerical examples based on large-scale satellite sensor image are presented to demonstrate the applicability of MSDRBE in real-world scenarios, where a similar experimental protocol as used by SeRBIA [47] was considered.

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

680

b4528-v2-ch17

Handbook on Computer Learning and Intelligence—Vol. II

IF (I~

) OR (I~

) OR … OR (I~

) OR (I~

) THEN (Agricultural)

IF (I~

) OR (I~

) OR … OR (I~

) OR (I~

) THEN (Airport)

IF (I~

) OR (I~

) OR … OR (I~

) OR (I~

) THEN (Baseball Diamond)

IF (I~

) OR (I~

) OR … OR (I~

) OR (I~

) THEN (Beach)

IF (I~

) OR (I~

) OR … OR (I~

) OR (I~

) THEN (Buildings)

IF (I~

) OR (I~

) OR … OR (I~

) OR (I~

) THEN (Chaparral)

IF (I~

) OR (I~

) OR … OR (I~

) OR (I~

) THEN (Dense Residential)

IF (I~

) OR (I~

) OR … OR (I~

) OR (I~

) THEN (Forest)

IF (I~

) OR (I~

) OR … OR (I~

) OR (I~

) THEN (Freeway)

IF (I~

) OR (I~

) OR … OR (I~

) OR (I~

) THEN (Golf Course)

IF (I~

) OR (I~

) OR … OR (I~

) OR (I~

) THEN (Harbor)

IF (I~

) OR (I~

) OR … OR (I~

) OR (I~

) THEN (Intersection)

IF (I~

) OR (I~

) OR … OR (I~

) OR (I~

) THEN (Medium Residential)

IF (I~

) OR (I~

) OR … OR (I~

) OR (I~

) THEN (Mobile Home Park)

IF (I~

) OR (I~

) OR … OR (I~

) OR (I~

) THEN (Overpass)

IF (I~

) OR (I~

) OR … OR (I~

) OR (I~

) THEN (Parking Lot)

IF (I~

) OR (I~

) OR … OR (I~

) OR (I~

) THEN (River)

IF (I~

) OR (I~

) OR … OR (I~

) OR (I~

) THEN (Runway)

IF (I~

) OR (I~

) OR … OR (I~

) OR (I~

) THEN (Sparse Residential)

IF (I~

) OR (I~

) OR … OR (I~

) OR (I~

) THEN (Storage Tanks)

IF (I~

) OR (I~

) OR … OR (I~

) OR (I~

) THEN (Tennis Court)

Figure 17.12: Examples of massively parallel IF...THEN rules.

page 680

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification

page 681

681

To perform automatic image analysis, the UCMerced dataset was used to prime MSDRBE at first. The UCMerced dataset contains 21 commonly seen land-use categories, in total, 2100 images with the same spatial resolution (0.3 m). This dataset is very suitable for training MSDRBE thanks to its relatively smaller image size, simpler structures, and less variety in terms of semantic contents within a single image. From images of this dataset, each ensemble component of MSDRBE selforganized 21 massively parallel IF...THEN rules. Examples of the IF...THEN rules identified by MSDRBE are given in Figure 17.12. Twelve satellite sensor images of different areas of the UK were downloaded from Google Earth (Google Inc.) with spatial resolutions varying from 0.3 m to 30 m and the same image size of 600 × 800 pixels. The 12 images are given in Figures 17.13–17.24, where great variations in terms of the geometric structure and semantic contents can be observed from these images. A

B

C

D

E

A

B

C

D

E

1

2

3

1

2

3

Figure 17.13: Classification result on satellite sensor image 1.

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

682

b4528-v2-ch17

Handbook on Computer Learning and Intelligence—Vol. II A

B

C

D

E

A

B

C

D

E

1

2

3

1

2

3

Figure 17.14: Classification result on satellite sensor image 2.

During the numerical examples, each satellite sensor image, denoted as I, was firstly segmented by a grid net with the grid size of 200 × 200 pixels into 3 × 4 non-overlapping sub-regions. Then, MSDRBE read these image segments one by one and produced a set of overall scores of confidence for each image segment, denoted as λˆ 1 (Ik ), λˆ 2 (Ik ),…, λˆ C (Ik ) using Eq. (17.14), where C = 21; Ik stands for the k t h segment of I.

page 682

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification A

B

C

D

E

A

B

C

D

E

page 683

683

1

2

3

1

2

3

Figure 17.15: Classification result on satellite sensor image 3.

By using Condition 2, the joint decision maker identified one or multiple landuse categories that share similar high-level semantic features with Ik [47] Condition 2: I f (ϕ λˆ n (Ik ) >

max (λˆ j (Ik ))),

j =1,2,...,C

T hen (Ik possesses semantic features o f categor yn ), where ϕ = 1.1 in this research.

(17.16)

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

684

b4528-v2-ch17

Handbook on Computer Learning and Intelligence—Vol. II A

B

C

D

E

A

B

C

D

E

1

2

3

1

2

3

Figure 17.16: Classification result on satellite sensor image 4.

Assuming that there were L k land-use categories satisfying Condition 2, denoted as categor y1∗ , categor y2∗ , . . . , categor y L∗ k , the respective likelihoods of the L k most relevant land-use categories associated with Ik can be calculated as [47] λ˜ ∗ (Ik ) , ∗n =  L k n ∗ ˜ j =1 λ j (Ik )

(17.17)

page 684

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification A

B

C

D

E

A

B

C

D

E

page 685

685

1

2

3

1

2

3

Figure 17.17: Classification result on satellite sensor image 5.

where ∗n is the likelihood of categor yn∗ ; λ˜ ∗n (Ik ) is the score of confidence corresponding to categor yn∗ standardized by the mean υk and standard deviation δk of the C overall scores of confidence calculated on Ik (namely, λˆ 1 (Ik ), λˆ 2 (Ik ), . . . , λˆ C (Ik )): λˆ ∗ (I )−υ λ˜ ∗n (Ik ) = n δkk k . MSDRBE completed the analyzing process on I once all its image segments have been processed. The analyzing results for the 12 satellite sensor images by MSDRBE are also presented in Figures 17.13–17.24 in the form of a 3 × 4 table with the background in white and yellow colors. In this research, the maximum value of L k was set to be 5 for visual clarity [47].

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

686

b4528-v2-ch17

Handbook on Computer Learning and Intelligence—Vol. II A

B

C

D

E

A

B

C

D

E

1

2

3

1

2

3

Figure 17.18: Classification result on satellite sensor image 6.

From Figures 17.13–17.24 one can see that MSDRBE is able to classify the local regions of large-scale satellite images into one or multiple land-use categories in very high precision thanks to its ensemble structure and the fine-tuned DNN-based feature descriptors. In most cases, it can further accurately describe the respective likelihoods of the most relevant land-use categories associated with the sub-regions of these images. The results show the great potential of MSDRBE for analyzing large-scale satellite sensor images automatically. However, it is also noticeable that MSDRBE made completely incorrect categorizations in some rare cases, such as the sub-region “A2” in image 12, where it should be “Mobile Home Park” instead of “Harbor.” Due to the high similarity amongst high-level semantic features shared by different land-use categories, MSDRBE may

page 686

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification A

B

C

D

E

A

B

C

D

E

page 687

687

1

2

3

1

2

3

Figure 17.19: Classification result on satellite sensor image 7.

confuse land-use categories such as “Baseball Diamond” and “Meadow,” “Storage Tank” and “Buildings,” “Tennis Court” and “Buildings,” “Mobile Home Park” and “Dense Residential.” MSDRBE sometimes may also ignore other land-use categories if it categorizes a certain sub-region to “Intersection” despite that highlevel semantic features of other land-use categories also appear in the same subregion. This is due to the high similarity of high-level semantic features between images of land-use category “Intersection” and images of other categories such as “Dense Residential” and “Medium Residential.” Nevertheless, such issues can be addressed successfully by making a preselection on benchmark datasets and removing the less representative images [47].

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

688

b4528-v2-ch17

Handbook on Computer Learning and Intelligence—Vol. II A

B

C

D

E

A

B

C

D

E

1

2

3

1

2

3

Figure 17.20: Classification result on satellite sensor image 8.

17.6. Conclusion In this research, a novel MSDRBE system is introduced for aerial scene classification. The proposed approach integrates three DRB systems to form an ensemble. Each DRB system employs a different DNN as its feature descriptor. However, instead of relying on the pretrained weights based on natural images, three DNNs used by MSDRBE are fine-tuned on aerial images to improve their capability to capture the more discriminative high-level semantic features from aerial images, allowing the DRB systems to self-organize a set of massively parallel IF...THEN rules composed of highly representative prototypes. A joint decisionmaking mechanism based on weighted voting is further adopted to maximize the prediction precision of the proposed approach. Numerical examples show the great performance of the proposed MSDRBE, surpassing the state-of-the-art approaches

page 688

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification A

B

C

D

E

A

B

C

D

E

page 689

689

1

2

3

1

2

3

Figure 17.21: Classification result on satellite sensor image 9.

on a variety of benchmark aerial scene classification problems. Applications to satellite sensor images further demonstrate the strong ability of MSDRBE in handling real-world uncertainties and solving real-world problems. As for future work, there are several considerations. Firstly, a semi-supervised learning mechanism could be introduced to MSDRBE, enabling it to autonomously learn from unlabeled images and self-expand its knowledge base for more precise predictions. Secondly, a high-quality, large-scale aerial image set composed of images with different levels of scale, illumination, and resolution covering most of commonly seen land-use categories needs to be constructed in order to train a more effective MSDRBE for satellite sensor image analysis. Lastly, more creative joint decision-making mechanisms could be developed to further enhance the prediction precision, for example, by giving different weights to different ensemble components based on their classification accuracy rates on training images.

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

690

b4528-v2-ch17

Handbook on Computer Learning and Intelligence—Vol. II A

B

C

D

E

A

B

C

D

E

1

2

3

1

2

3

Figure 17.22: Classification result on satellite sensor image 10.

page 690

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification A

B

C

D

E

A

B

C

D

E

1

2

3

1

2

3

Figure 17.23: Classification result on satellite sensor image 11.

page 691

691

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

692

b4528-v2-ch17

Handbook on Computer Learning and Intelligence—Vol. II A

B

C

D

E

A

B

C

D

E

1

2

3

1

2

3

Figure 17.24: Classification result on satellite sensor image 12.

page 692

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification

page 693

693

References [1] D. Hou, Z. Miao, H. Xing, and H. Wu, Two novel benchmark datasets from ArcGIS and bing world imagery for remote sensing image retrieval, Int. J. Remote Sens., 42(1), 240–258 (2021). ISSN 0143–1161. [2] G. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, and L. Zhang, Aid: a benchmark dataset for performance evaluation of aerial scene classification, IEEE Trans. Geosci. Remote Sens., 55(7), 3965–3981 (2017). ISSN 0196–2892. [3] X. Zhu, D. Tuia, L. Mou, G. Xia, L. Zhang, F. Xu, and F. Fraundorfer, Deep learning in remote sensing: a comprehensive review and list of resources, IEEE Geosci. Remote Sens. Mag., 5(4), 8–36 (2017). ISSN 2473–2397. [4] Y. Li, H. Zhang, X. Xue, Y. Jiang, and Q. Shen, Deep learning for remote sensing image classification: a survey, WIREs Data Min. Knowl. Discov., 8(6), e1264 (2018). ISSN 1942–4795. [5] G. Cheng, J. Han, and X. Lu, Remote sensing image scene classification: benchmark and state of the art, Proc. IEEE, 105(10), 1865–1883 (2017). ISSN 0018–9219. [6] F. Zhang, B. Du, and L. Zhang, Scene classification via a gradient boosting random convolutional network framework, IEEE Trans. Geosci. Remote Sens., 54(3), 1793–1802 (2016). ISSN 0196– 2892. [7] C. Zhang, G. Li, and S. Du, Multi-scale dense networks for hyperspectral remote sensing image classification, IEEE Trans. Geosci. Remote Sens., 57(11), 9201–9222 (2019). ISSN 0196–2892. [8] W. Zhang, P. Tang, and L. Zhao, Remote sensing image scene classification using CNN-CapsNet, Remote Sens., 11(5), 494 (2019). ISSN 2072–4292. [9] X. Liu, Y. Zhou, J. Zhao, R. Yao, B. Liu, and Y. Zheng, Siamese convolutional neural networks for remote sensing scene classification, IEEE Geosci. Remote Sens. Lett., 16(8), 1200–1204 (2019). ISSN 1545–598X. [10] X. Gu, P. P. Angelov, C. Zhang, and P. M. Atkinson, A massively parallel deep rule-based ensemble classifier for remote sensing scenes, IEEE Geosci. Remote Sens. Lett., 15(3), 345–349, 2018. ISSN 1545–598X. [11] Q. Weng, Z. Mao, J. Lin, and W. Guo, Land-use classification via extreme learning classifier based on deep convolutional features, IEEE Geosci. Remote Sens. Lett., 14(5), 704–708 (2017). ISSN 1545–598X. [12] G. Cheng, Z. Li, X. Yao, L. Guo, and Z. Wei, Remote sensing image scene classification using bag of convolutional features, IEEE Geosci. Remote Sens. Lett., 14(10), 1735–1739 (2017). ISSN 1545–598X. [13] A. F. K. Guy, T. Akram, B. Laurent, S. Rameez, M. Mbom, and N. Muhammad, A deep heterogeneous feature fusion approach for automatic land-use classification, Inf. Sci. (Ny), 467, 199–218 (2018). ISSN 0020–0255. [14] N. S. Altman, An introduction to kernel and nearest-neighbor nonparametric regression, Am. Stat., 46(3), 175–185 (1992). ISSN 0003–1305. [15] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods (Cambridge: Cambridge University Press, 2000). [16] L. Breiman, Random forests, Mach. Learn. Proc., 45(1), 5–32 (2001). ISSN 2640–3498. [17] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karparg, A. Khosla, M. Bernstein, A. Berg, and F. Li, ImageNet large scale visual recognition challenge, Int. J. Comput. Vis., 115(3), 211–252 (2015). ISSN 0920–5691. [18] K. Simonyan and A. Zissermann, Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, pp. 1–14, San Diego, USA (2015). [19] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, C. Hill, and A. Arbor, Going deeper with convolutions. In IEEE conference on Computer Vision and Pattern Recognition, pp. 1–9, Boston, USA (2015).

June 1, 2022 13:18

694

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

Handbook on Computer Learning and Intelligence—Vol. II

[20] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, Rethinking the inception architecture for computer vision. In IEEE conference on Computer Vision and Pattern Recognition, pp. 2818–2826, Las Vegas, USA (2016). [21] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition. In IEEE conference on Computer Vision and Pattern Recognition, pp. 770–778, Las Vegas, USA (2016). [22] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, Densely connected convolutional networks. In IEEE conference on Computer Vision and Pattern Recognition, pp. 4700–4708, Hawaii, USA (2017). [23] P. P. Angelov and X. Gu, Deep rule-based classifier with human-level performance and characteristics, Inf. Sci. (Ny), 463–464, 196–213 (2018). ISSN 0020–0255. [24] X. Gu and P. P. Angelov, A deep rule-based approach for satellite scene image analysis. In IEEE International Conference on Systems, Man and Cybernetics, pp. 2778–2783, Miyazaki, Japan (2018). [25] D. C. R. Novitasari, D. Wahyuni, M. Munir, I. Hidayati, F. M. Amin, and K. Oktafianto, Automatic detection of breast cancer in mammographic image using the histogram oriented gradient (HOG) descriptor and deep rule based (DRB) classifier method. In International Conference on Advanced Mechatronics, Intelligent Manufacture and Industrial Automation, pp. 185–190, Batu-Malang, Indonesia (2018). [26] A. B. Sargano, X. Gu, P. Angelov, and Z. Habib, Human action recognition using deep rule-based classifier, Multimed. Tools Appl., 79, 30653–30667 (2020). ISSN 1380–7501. [27] R. P. de Lima and K. Marfurt, Convolutional neural network for remote-sensing scene classification: transfer learning analysis, Remote Sens., 12(1), 86 (2020). ISSN 2072–4292. [28] X. Gu and P. Angelov, Deep rule-based aerial scene classifier using high-level ensemble feature descriptor. In International Joint Conference on Neural Networks, pp. 1–7, Budapest, Hungary (2019). [29] X. Gu and P. P. Angelov, Self-organising fuzzy logic classifier, Inf. Sci. (Ny), 447, 36–51 (2018). ISSN 0020–0255. [30] P. P. Angelov, X. Gu, and J. C. Principe, A generalized methodology for data analysis, IEEE Trans. Cybern., 48(10), 2981–2993 (2018). ISSN 2168–2267. [31] M. I. Lakhal, H. Cevikalp, S. Escalera, and F. Ofli, Recurrent neural networks for remote sensing image classification, IET Comput. Vis., 12(7), 1040–1045 (2018). ISSN 1751–9632. [32] F. Hu, G. Xia, J. Hu, and L. Zhang, Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery, Remote Sens., 7(11), 14680– 14707 (2015). ISSN 2072–4292. [33] W. Li, Z. Wang, Y. Wang, J. Wu, J. Wang, Y. Jia, and G. Gui, Classification of high-spatialresolution remote sensing scenes method using transfer learning and deep convolutional neural network, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 13, 1986–1995 (2020). ISSN 1939– 1404. [34] B. Cui, X. Chen, and Y. Lu, Semantic segmentation of remote sensing images using transfer learning and deep convolutional neural network with dense connection, IEEE Access, 8, 116744– 11675 (2020). ISSN 2169–3536. [35] Z. Huang, Z. Pan, and B. Lei, Transfer learning with deep convolutional neural network for SAR target classification with limited labeled data, Remote Sens., 9(9), 1–21 (2017). ISSN 2072–4292. [36] Y. Yang and S. Newsam, Bag-of-visual-words and spatial extensions for land-use classification. In International Conference on Advances in Geographic Information Systems, pp. 270–279, San Jose, USA (2010). [37] G. Xia, W. Yang, J. Delon, Y. Gousseau, H. Sun, and H. Maitre, Structural high-resolution satellite image indexing. In ISPRS, TC VII Symposium Part A: 100 Years ISPRS—Advancing Remote Sensing Science, pp. 298–303, Vienna, Austria (2010).

page 694

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch17

A Multistream Deep Rule-Based Ensemble System for Aerial Image Scene Classification

page 695

695

[38] Q. Zou, L. Ni, T. Zhang, and Q. Wang, Deep learning based feature selection for remote sensing scene classification, IEEE Geosci. Remote Sens. Lett., 12(11), 2321–2325 (2015). ISSN 1545– 598X. [39] D. P. Kingma and J. L. Ba, Adam: a method for stochastic optimization. In International Conference on Learning Representations, pp. 1–15, San Diego, USA (2015). [40] A. Bahri, S. G. Majelan, S. Mohammadi, M. Noori, and K. Mohammadi, Remote sensing image classification via improved cross-entropy loss and transfer learning strategy based on deep convolutional neural networks, IEEE Geosci. Remote Sens. Lett., 17(6), 1087–1091 (2020). ISSN 1545–598X. [41] C. Chen, L. Zhou, J. Guo, W. Li, H. Su, and F. Guo, Gabor-filtering-based completed local binary patterns for land-use scene classification. In IEEE international conference on Multimedia Big Data, pp. 324–329, Beijing, China (2015). [42] L. Huang, C. Chen, W. Li, and Q. Du, Remote sensing image scene classification using multiscale completed local binary patterns and fisher vectors, Remote Sens., 8(6), 1–17 (2016). ISSN 2072–4292. [43] K. Qi, W. Liu, C. Yang, Q. Guan, and H. Wu, Multi-task joint sparse and low-rank representation for the scene classification of high-resolution remote sensing image, Remote Sens., 9(1), 10 (2017). ISSN 2072–4292. [44] X. Bian, C. Chen, L. Tian, and Q. Du, Fusing local and global features for high-resolution scene classification, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 10(6), 2889–2901 (2017). ISSN 1939–1404. [45] R. M. Anwer, F. S. Khan, J. van de Weijer, M. Molinier, and J. Laaksonen, Binary patterns encoded convolutional neural networks for texture recognition and remote sensing scene classification, ISPRS J. Photogramm. Remote Sens., 138, 74–85 (2018). ISSN 0924-2716. [46] Q. Wang, S. Member, S. Liu, and J. Chanussot, Scene classification with recurrent attention of VHR remote sensing images, IEEE Trans. Geosci. Remote Sens., 57(2), 1155–1167 (2019). ISSN 0196–2892. [47] X. Gu, P. P. Angelov, C. Zhang, and P. M. Atkinson, A semi-supervised deep rule-based approach for complex satellite sensor image analysis, IEEE Trans. Pattern Anal. Mach. Intell. (2021). ISSN 0182–8828. [48] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, pp. 1097–1105, Lake Tahoe, USA (2012).

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Part IV

Intelligent Control

697

b4528-v2-ch18

page 697

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

© 2022 World Scientific Publishing Company https://doi.org/10.1142/9789811247323_0018

Chapter 18

Fuzzy Model-Based Control: Predictive and Adaptive Approach Igor Škrjanc∗ and Sašo Blažiˇc† Faculty of Electrical Engineering, University of Ljubljana Tržaška 25, 1000 Ljubljana, Slovenia ∗ [email protected][email protected]

This chapter deals with fuzzy model-based control. It focuses on two approaches that are presented and discussed, namely a predictive and an adaptive one. Both are based on the Takagi–Sugeno model form, which possesses the property of universal approximation of an arbitrary smooth nonlinear function and can be therefore used as a proper model to predict the future behavior of the plant. Despite many successful implementations of Mamdani fuzzy model-based approaches, it soon became clear that these approaches lack systematic ways to analyze the control system stability, performance, robustness, and the systematic way of tuning the controller parameters to adjust the performance. On the other hand, Takagi– Sugeno fuzzy models enable a more compact description of the nonlinear system, rigorous treatment of stability, and robustness. Perhaps the most important feature of Takagi–Sugeno models is the ability that they can be easily adapted to the different linear control algorithms to cope with the demanding nonlinear control problems.

18.1. Introduction In this chapter, we face the problem of controlling a nonlinear plant. Classical linear approaches try to treat a plant as linear, and a linear controller is designed to meet the control objectives. Unfortunately, such an approach results in an acceptable control performance only if the nonlinearity is not too strong. The previous statement is far from being rigorous in the definitions of “acceptable” and “strong.” However, the fact is that by increasing the control performance requirements, the problems with

699

page 699

June 1, 2022 13:18

700

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Handbook on Computer Learning and Intelligence — Vol. II

the nonlinearity also become much more apparent. So, there is a clear need to cope with the control of nonlinear plants. The problem of control of nonlinear plants has received a great deal of attention in the past. A natural solution is to use a nonlinear controller that tries to “cancel” the nonlinearity in some sense or at least it increases the performance of the controlled system over a wide operating range with respect to the performance of an “optimal” linear controller. Since a controller is just a nonlinear dynamic system that maps controller inputs to controller actions, we need to somehow describe this nonlinear mapping and implement it into the controlled system. In the case of a finitedimensional system, one possibility is to represent a controller in the state–space form where four nonlinear mappings are needed: state-to-state mapping, input-tostate mapping, state-to-output mapping, and direct input-to-output mapping. The Stone–Weierstrass theorem guarantees that all these mappings can be approximated by basis functions arbitrary well. Immense approximators of nonlinear functions have been proposed in the literature to solve the problem of nonlinear-system control. Some of the most popular ones are piecewise linear functions, fuzzy models, artificial neural networks, splines, wavelets, etc. In this chapter, we put our focus on fuzzy controllers. Several excellent books exist that cover various aspects of fuzzy control [1–4]. Following a seminal paper of Zadeh [5] that introduced fuzzy set theory, the fuzzy logic approach was soon introduced into controllers [6]. In these early ages of fuzzy control, the controller was usually designed using Zadeh’s notion of linguistic variables and fuzzy algorithms. It was often claimed that the approach with linguistic variables, linguistic values, and linguistic rules is “parameterless,” and therefore the controllers are extremely easy to tune based on the expert knowledge that is easily transformable into the rule database. Many successful applications of fuzzy controllers have shown their ability to control nonlinear plants. But it soon became clear that the Mamdani fuzzy model control approach lacks systematic ways to analyze control system stability, performance, and robustness. The Takagi–Sugeno fuzzy model [7] enables more compact description of the fuzzy system. Moreover, it enables rigorous treating of stability and robustness in the form of linear matrix inequalities [3]. Several control algorithms originally developed for linear systems can be adapted in a way that it is possible to combine them with the Takagi–Sugeno fuzzy models. Thus, the number of control approaches based on a Takagi–Sugeno model proposed in the literature in the past three decades is huge. In this chapter, we will show how to design predictive and adaptive controllers for certain classes of nonlinear systems where the plant model is given in the form of a Takagi–Sugeno fuzzy model. The proposed algorithms are not developed in an ad hoc manner, but having the stability of the overall system in mind. This is why the stability analysis of the algorithms complements the algorithms themselves.

page 700

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

page 701

Fuzzy Model-Based Control: Predictive and Adaptive Approach

701

18.2. Takagi–Sugeno Fuzzy Model A typical fuzzy model [7] is given in the form of rules. R j : if x p1 is A1,k1 and x p2 is A2,k2 and . . . and x pq is Aq,kq then y = φ j (x) j = 1, . . . , m k1 = 1, . . . , f 1 k2 = 1, . . . , f 2 . . . kq = 1, . . . , f q (18.1)   The q-element vector xTp = x p1 , . . . , x pq denotes the input or variables in premise, and variable y is the output of the model. With each variable in premise x pi (i = 1, . . . , q), f i fuzzy sets (Ai,1 , . . . , Ai, fi ) are connected, and each fuzzy set Ai,ki (ki = 1, . . . , fi ) is associated with a real-valued function μ Ai,ki (x pi ) : R → [0, 1], that produces membership grade of the variable x pi with respect to the fuzzy set Ai,ki . To make the list of fuzzy rules complete, all possible variations of fuzzy sets are given in Eq. (18.1), yielding the number of fuzzy rules m = f 1 × f 2 × · · · × f q . The variables x pi are not the only inputs of the fuzzy system. Implicitly, the nelement vector xT = [x1 , . . . , xn ] also represents the input to the system. It is usually referred to as the consequence vector. The functions φ j (·) can be arbitrary and smooth functions in general, although linear or affine functions are usually used. The system in Eq. (18.1) can be described in closed form if the intersection of fuzzy sets is previously defined. The generalized form of the intersection is the so-called triangular norm (T-norm). In our case, the latter was chosen as algebraic product yielding the output of the fuzzy system:  fq  f1  f 2 k1 =1 k2 =1 · · · kq =1 μ A1,k1(x p1 )μ A2,k2(x p2 ) . . . μ Aq,kq (x pq )φ j (x) (18.2) yˆ =  f  f  fq 1 2 k1 =1 k2 =1 · · · kq =1 μ A1,k1 (x p1 )μ A2,k2 (x p2 ) . . . μ Aq,kq (x pq ) It has to be noted that a slight abuse of notation is used in Eq. (18.2) since j is not explicitly defined as running index. From Eq. (18.1), it is evident that each j corresponds to the specific variation of indexes ki , i = 1, . . . , q. To simplify Eq. (18.2), a partition of unity is considered, where functions β j (x p ) defined by β j (x p ) =  f 1

k1 =1

μ A1,k1 (x p1 )μ A2,k2 (x p2 ) . . . μ Aq,kq (x pq ) ,  fq k2 =1 · · · kq =1 μ A1,k1 (x p1 )μ A2,k2 (x p2 ) . . . μ Aq,kq (x pq )

 f2

j = 1, . . . , m

(18.3)

give information about the fulfillment of the respective fuzzy rule in the normalized m form. It is obvious that j =1 β j (x p ) = 1 irrespective of x p , as long as the denominator of β j (x p ) is not equal to zero (that can be easily prevented by stretching

June 1, 2022 13:18

702

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

page 702

Handbook on Computer Learning and Intelligence — Vol. II

the membership functions over the whole potential area of x p ). Combining Eq. (18.2) and (18.3) and changing summation over ki by summation over j , we arrive to the following equation: yˆ =

m 

β j (x p )φ j (x).

(18.4)

j =1

The class of fuzzy models have the form of linear models; this refers to {β j } as a set of basis functions. The use of membership functions in input space with overlapping receptive fields provides interpolation and extrapolation. Very often, the output value is defined as a linear combination of consequence states   φ j (x) = θ Tj x, j = 1, . . . , m, θ Tj = θ j 1 , . . . , θ j n (18.5) If Takagi–Sugeno model of the 0-th order is chosen, φ j (x) = θ j 0 , and in the case of the first order model, the consequent is φ j (x) = θ j 0 + θ Tj x. Both cases can be treated by the model in Eq. (18.5) by adding 1 to the vector x and augmenting vector θ with θ j 0 . To simplify the notation, only the model in Eq. (18.5) will be treated in the rest of the chapter. If the matrix of the coefficients for the whole set of rules is written as  T = [θθ 1 ,. . . , θ m ] and the vector of membership values as  β T (x p ) = β 1 (x p ), . . . , β m (x p ) , then Eq. (18.4) can be rewritten in the matrix form as x. yˆ = β T (x p )

(18.6)

The fuzzy model in the form given in Eq. (18.6) is referred to as the affine Takagi– Sugeno model and can be used to approximate any arbitrary function that maps the compact set C ⊂ Rd (d is the dimension of the input space) to R with any desired degree of accuracy [8–10]. The generality can be proven by Stone–Weierstrass theorem [11] which indicates that any continuous function can be approximated by fuzzy basis function expansion [12]. When identifying the fuzzy model, there are several parameters that can be tuned. One possibility is to only identify the parameters in the rule consequents and let the antecedent-part parameters untouched. If the position of the membership functions is good (the input space of interest is completely covered and the density of membership functions is higher where nonlinearity is stronger), then a good model can be obtained by only identifying the consequents. The price for this is to introduce any existing prior knowledge in the design of the membership functions. If, however, we do not know anything about the controlled system, we can use some evolving systems techniques where the process of identification changes not only the consequent parameters but also the antecedent parameters [13–17].

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Fuzzy Model-Based Control: Predictive and Adaptive Approach

page 703

703

18.3. Fuzzy Model-Based Predictive Control The fundamental methods that are essentially based on the principal of predictive control are generalized predictive control [18], model algorithmic control [19], and predictive functional control [20], Dynamic matrix control [21], extended prediction self-adaptive control [22], and extended horizon adaptive control [23]. All these methods have been developed for linear process models. The principle is based on the process model output prediction and calculation of control signal that brings the output of the process to the reference trajectory in a way to minimize the difference between the reference and the output signal in a certain interval, between two prediction horizons, or to minimize the difference in a certain horizon, called coincidence horizon. The control signal can be found by means of optimization or it can be calculated using the explicit control law formula [24, 25]. The nature of processes is inherently nonlinear and this implies the use of nonlinear approaches in predictive control schemes. Here, we can distinguish between two main group of approaches: the first group is based on the nonlinear mathematical models of the process in any form and convex optimization [26], while the second group relies on approximation of nonlinear process dynamics with nonlinear approximators such as neural networks [10, 27], piecewise-linear models [28], Volterra and Wiener models [29], multimodels and multivariables [30, 31], and fuzzy models [32, 33]. The advantage of the latter approaches is the possibility of stating the control law in the explicit analytical form. In some highly nonlinear cases, the use of nonlinear model-based predictive control can be easily justified. By introducing the nonlinear model into predictive control problem, the complexity increases significantly. In [24, 25], an overview of different nonlinear predictive control approaches is given. When applying the model-based predictive control with Takagi–Sugeno fuzzy model, it is always important how to choose fuzzy sets and corresponding membership functions. Many existing clustering techniques can be used in the identification phase to make this task easier. There exist many fuzzy model-based predictive algorithms [34–36] that put significant stress on the algorithm that properly arranges membership functions. The basic idea of model-based predictive control is to predict the future behavior of the process over a certain horizon using the dynamic model and obtaining the control actions to minimize a certain criterion. Traditionally, the control problem is formally stated as an optimization problem, where the goal is to obtain control actions on a relatively short control horizon by optimizing the behavior of the controlled system in a larger prediction horizon. One of the main properties of most predictive controllers is that they utilize the so-called receding horizon approach. This means that even though the control signal is obtained on a larger interval in

June 1, 2022 13:18

704

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

page 704

Handbook on Computer Learning and Intelligence — Vol. II

the current point of time, only the first sample is used while the whole optimization routine is repeated in each sampling instant. The model-based predictive control relies heavily on the prediction model quality. When the plant is nonlinear, the model is also nonlinear. As it is very well-known, fuzzy model possesses the property of universal approximation of an arbitrary and smooth nonlinear function and can be used as a prediction model of a nonlinear plant. Many approaches originally developed for the control of linear systems can be adapted to include a fuzzy model-based predictor. In this chapter, fuzzy model is incorporated into the predictive functional control (PFC) approach. This combination provides the means of controlling nonlinear systems. Since no explicit optimization is used, the approach is very suitable for the implementation on an industrial hardware. The fuzzy model-based predictive control (FMBPC) algorithm is presented in the state–space form [37]. The approach is an extension of predictive functional algorithm [20] to the nonlinear systems. This approach was successfully used in many different practical implementations [38– 40]. The proposed algorithm easily copes with phase nonminimal and time-delayed dynamics. The approach can be combined by a clustering technique to obtain the FMBPC algorithm where membership functions are not fixed a priori but this is not intention of this work.

18.3.1. The Development of the Control Algorithm In the case of FMBPC, the prediction of the plant output is given by its fuzzy model in the state–space domain. This is why the approach in the proposed form is limited to the open-loop stable plants. By introducing some modifications the algorithm can be made applicable also for the unstable plants. The problem of delays in the plant is circumvented by constructing an auxiliary variable that serves as the output of the plant if there were no delay present. The so-called undelayed model of the plant will be introduced for that purpose. It is obtained by “removing” delays from the original (“delayed”) model and converting it to the state–space description: ¯ m xm0 (k) + B ¯ m u(k) + R ¯m xm0 (k + 1) = A ¯ m xm0 (k), ym0 (k) = C

(18.7)

where ym0 (k) models the “undelayed” output of the plant. The behavior of the closed-loop system is defined by the reference trajectory that is given in the form of the reference model. The control goal is to determine the future control action so that the predicted output value coincides with the reference trajectory. The time difference between the coincidence point and the current time is

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Fuzzy Model-Based Control: Predictive and Adaptive Approach

page 705

705

called a coincidence horizon. It is denoted by H . The prediction is calculated under the assumption of constant future manipulated variables (u(k) = u(k + 1) = . . . = u(k + H − 1)), i.e., the mean level control assumption is used. The H -step before the prediction of the “undelayed” plant output is then obtained from (18.7):   H     ¯m A ¯m −I A ¯ m − I −1 B ¯ m . (18.8) ¯ m u(k) + R ¯ mH xm0 (k) + A ym0 (k + H ) = C The reference model is given by the first-order difference equation yr (k + 1) = ar yr (k) + br w(k),

(18.9)

where w stands for the reference signal. The reference model parameters should be chosen so that the reference model gain is unity. This is accomplished by fulfilling the following equation: (1 − ar )−1 br = 1

(18.10)

The main goal of the proposed algorithm is to find the control law that enables the reference trajectory tracking of the “undelayed” plant output y 0p (k). In each time instant, the control signal is calculated so that the output is forced to reach the reference trajectory after H time samples (y 0p (k + H ) = yr (k + H )). The idea of FMBPC is introduced through the equivalence of the objective increment vector  p and the model output increment vector m :  p = m

(18.11)

The former is defined as the difference between the predicted reference signal yr (k + H ) and the actual output of the “undelayed” plant y 0p (k):  p = yr (k + H ) − y 0p (k).

(18.12)

The variable y 0p (k) cannot be measured directly. Rather, it will be estimated by using the available signals: y 0p (k) = y p (k) − ym (k) + ym0 (k).

(18.13)

It can be seen that the delay in the plant is compensated by the difference between the outputs of the “undelayed” and the “delayed” model. When the perfect model of the plant is available, the first two terms on the right-hand side of Eq. (18.13) cancel and the result is actually the output of the “undelayed” plant model. If this is not the case, only the approximation is obtained. The model output increment vector m is

June 1, 2022 13:18

706

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Handbook on Computer Learning and Intelligence — Vol. II

defined by the following formula: m = ym0 (k + H ) − ym0 (k).

(18.14)

The following is obtained from Eq. (18.11) by using Eq. (18.12) and (18.14), and introducing Eq. (18.8):   ¯ mA ¯ mH xm0 (k) u(k) = g0−1 yr (k + H ) − y 0p (k) + ym0 (k) − C    H ¯ m − I −1 R ¯m , ¯m −I A ¯m A (18.15) −C where g0 stands for    H ¯m A ¯ m − I −1 B ¯ m. ¯m −I A g0 = C

(18.16)

The control law of FMBPC in analytical form is finally obtained by introducing Eq. (18.13) into Eq. (18.15):   ¯ mA ¯ mH xm0 (k) u(k) = g0−1 yr (k + H ) − y p (k) + ym (k) − C    H ¯ m − I −1 R ¯m . ¯m −I A ¯m A (18.17) −C In the following section, it will be shown that the realizability of the control law Eq. (18.17) relies heavily on the relation between the coincidence horizon H and the relative degree of the plant ρ. In the case of discrete-time systems, the relative degree is directly related to the pure time delay of the system transfer function. If the system is described in the state–space form, any form can be used in general, but the analysis is much simplified in the case of certain canonical descriptions. If the system is described in controllable canonical form in each fuzzy domain, then ¯ m of the fuzzy model in Eq. (18.36) take the controllable ¯ m , B¯ m , and C too matrices A canonical form: ⎡ ⎡ ⎤ ⎤ 0 1 0 ... 0 0 ⎢ 0 ⎢ ⎥ ⎥ 0 1 ... 0 ⎥ ⎢ ⎢0⎥ ⎢ ⎢ ⎥ . . . . . .. ¯ m = ⎢ .. ⎥ ¯ m = ⎢ .. .. .. .. ⎥ , B A ⎥, . ⎢ ⎢ ⎥ ⎥ ⎣ 0 ⎣ ⎦ 0 0 ... 1 0⎦ −a¯ n −a¯ n−1 −a¯ n−2 . . . −a¯ 1 ⎡ ⎤ 0 ⎢0⎥ ⎢ ⎥   .⎥ ¯m = ⎢ ¯ m = b¯n . . . b¯ρ 0 . . . 0 , R C ⎢ .. ⎥ ⎢ ⎥ ⎣0⎦ r¯

1 (18.18)

page 706

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Fuzzy Model-Based Control: Predictive and Adaptive Approach

page 707

707

m m where a¯ j = βi a ji , j = 1, . . . , n, b¯ j = i=1 i=1 βi b ji , j = ρ, . . . , n, and m r¯ = i=1 βi (ri /bi ), j = 1, . . . , n, and where the parameters a ji , b ji , and ri are state–space model parameters defined as in Eq. (18.33). Note that the state–space system with matrices from Eq. (18.18) has a relative degree ρ that is reflected in the ¯ m – last (ρ − 1) elements are equal to 0 while b¯ρ = 0. form of matrix C Prop 1. If the coincidence horizon H is lower than the plant relative degree ρ (H < ρ), then the control law as in Eq. (18.17) is not applicable. Proof. By taking into account the form of matrices in Eq. (18.18), it can easily be seen that ⎡ ⎤ 0 ⎢ .. ⎥ ⎢.⎥ ⎢ ⎥ ⎢0⎥ ⎢ ⎥  −1  H ⎥ ¯ ¯ ¯ (18.19) Bm = ⎢ Am − I Am − I ⎢1⎥, ⎢⎥ ⎢ ⎥ ⎢.⎥ ⎣ .. ⎦ 

i.e., the first (n − H ) elements of the vector are zeros, then there is the element 1, followed by (H −1) arbitrary elements. It then follows from Eq. (18.16) and (18.19) that g0 = 0 if ρ > H , and consequently the control law cannot be implemented.  The closed-loop system analysis makes sense only in the case of non-singular control law. Consequently, the choice of H is confined to the interval [ρ, ∞).

18.3.2. Stability Analysis The stability analysis of the proposed predictive control can be performed using an approach of linear matrix inequalities (LMI) proposed in [27] and [41] or it can be done assuming the frozen-time theory [42, 43] which discusses the relation between the nonlinear dynamical system and the associated linear time-varying system. In our stability study, we have assumed that the frozen-time system given in Eq. (18.36) is a perfect model of the plant, i.e., y p (k) = ym (k) for each k. Next, it is assumed that there is no external input to the closed-loop system (w = 0) – an assumption often made when analyzing stability of the closed-loop system. Even if there is external signal, it is important that it is bounded. This is assured by selecting a stable reference model, i.e., |ar | < 1. The results of the stability analysis are also qualitatively the same if the system operates in the presence of bounded disturbances and noise.

June 1, 2022 13:18

708

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Handbook on Computer Learning and Intelligence — Vol. II

Note that the last term in the parentheses of the control law in Eq. (18.17) is equal to g0r¯ . This is obtained using Eq. (18.16) and (18.18). Taking this into account and considering the above assumptions the control law in Eq. (18.17) can be simplified to   ¯ mA ¯ mH xm0 (k) − r. ¯ (18.20) u(k) = g0−1 −C Inserting the simplified control law in Eq. (18.20) into the model of the “undelayed” plant Eq. (18.7) we obtain   ¯m −B ¯ m g0−1 C ¯ m AmH xm0 (k). (18.21) xm (k + 1) = A The closed-loop state transition matrix is defined as ¯m −B ¯ m g0−1 C ¯ mA ¯ mH Ac = A

(18.22)

If the system is in the controllable canonical form, the second term on the righthand side of Eq. (18.22) has non-zero elements only in the last row of the matrix, and consequently, Ac is also in the Frobenius form. The interesting form of the matrix is obtained in the case H = ρ. If H = ρ, it can easily be shown that g0 = b¯ρ and that Ac takes the following form: ⎡ ⎤ 0 1 0 0 ··· 0 ⎢0 0 1 0 ··· 0 ⎥ ⎢ ⎥ ⎢0 0 0 1 ... 0 ⎥ ⎢ ⎥ .. ⎥ . (18.23) Ac = ⎢ ⎢ ... ... ... · · · . . . . ⎥ ⎢ ⎥ ⎢0 0 0 0 ··· 1 ⎥ ⎣ ⎦ ¯bn b¯ ρ+1 0 · · · 0 − b¯ · · · − b¯ ρ

ρ

The corresponding characteristic equation of the system is   b¯ρ+1 n−ρ−1 b¯ρ+2 n−ρ−2 b¯n−1 b¯n z + z + ... + z+ = 0. z ρ z n−ρ + b¯ρ b¯ρ b¯ρ b¯ρ

(18.24)

The solutions of this equation are closed-loop system poles: ρ poles lie in the origin of the z-plane, while the other (n − ρ) poles lie in the roots of the polynomial b¯ρ z n−ρ + b¯ρ+1 z n−ρ−1 + . . . + b¯n−1 z + b¯n . These results can be summarized in the following proposition: Prop 2. When the coincidence horizon is equal to the relative degree of the model (H = ρ), then (n − ρ) closed-loop poles tend to open-loop plant zeros, while the rest (ρ) of the poles go to the origin of z-plane. The proposition states that the closed-loop system is stable for H = ρ if the plant is minimum phase. When this is not the case, the closed-loop system would

page 708

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Fuzzy Model-Based Control: Predictive and Adaptive Approach

page 709

709

become unstable if H is chosen to be equal to ρ. In such a case, the coincidence horizon should be larger. The next proposition deals with the choice of a very large coincidence horizon. Prop 3. When the coincidence horizon tends to infinity (H → ∞) and the openloop plant is stable, the closed-loop system poles tend to open-loop plant poles. Proof. The proposition can be proven easily. In the case of stable plants, Am is a Hurwitz matrix that always satisfies lim AmH = 0.

H →∞

(18.25)

Combining Eqs. (18.22) and (18.25) we arrive at the final result lim Ac = Am

H →∞

(18.26) 

The three propositions provide some design guidelines for choosing the coincidence horizon H . If H < ρ, the control law is singular and thus not applicable. If H = ρ, the closed-loop poles go to open-loop zeros, i.e., high-gain controller is being used. If H is very large, the closed-loop poles go to open-loop poles, i.e., low-gain controller is being used and the system is almost open-loop. If the plant is stable, the closed-loop system can be made stable by choosing coincidence horizon large enough.

18.3.3. Practical Example: Continuous Stirred-Tank Reactor The simulated continuous stirred-tank reactor (CSTR) process consists of an irreversible, exothermic reaction, A → B, in a constant volume reactor cooled by a single coolant stream, which can be modeled by the following equation [44]:     q −E 0 0 0 C A0 − C A − k0 C A exp (18.27) C˙ A = V RT   ρc C pc k0 H 0 −E q ˙ + qc C exp T = (T0 − T ) − V ρC p A RT ρC p V    hA × 1 − exp − (18.28) (Tc0 − T ) qc ρc C pc The actual concentration C 0A is measured with a time delay td = 0.5 min is C A (t) = C 0A (t − td ).

(18.29)

The objective is to control the concentration of A (C A ) by manipulating the coolant flow rate qc . This model is a modified version of the first tank of a two-tank CSTR example from [45]. In the original model, the time delay was zero.

June 1, 2022 13:18

710

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Handbook on Computer Learning and Intelligence — Vol. II Table 18.1: Nominal CSTR parameter values Physical quantity

Symbol

Value

Measured product concentration Reactor temperature Coolant flow rate Process flow rate Feed concentration Feed temperature Inlet coolant temperature CSTR volume Heat transfer term Reaction rate constant Activation energy term Heat of reaction Liquid densities Specific heats

CA T qc q C A0 T0 Tc0 V hA k0 E/R H ρ, ρc C p , C pc

0.1 mol/l 438.54 K 103.41 l min−1 100 l min−1 1 mol/l 350 K 350 K 100 l 7 × 105 cal min−1 K−1 7.2 × 1010 min−1 1 × 104 K −2 × 105 cal mol 1 × 103 g/l 1 cal g−1 K−1

The symbol qc represents the coolant flow rate (manipulated variable) and the other symbols represent constant parameters whose values are defined in Table 18.1. The process dynamics are nonlinear due to the Arrhenius rate expression that describes the dependence of the reaction rate constant on the temperature (T ). This is why the CSTR exhibits some operational and control problems. The reactor presents multiplicity behavior with respect to the coolant flow rate qc , i.e., if the coolant flow rate qc ∈ (11.1 l/min, 119.7 l/min), there are three equilibrium concentrations C A . Stable equilibrium points are obtained in the following cases: • qc > 11.1 l/min ⇒ stable equilibrium point 0.92 mol/l < C A < 1 mol/l, and • qc < 111.8 l/min ⇒ stable equilibrium point C A < 0.14 mol/l (the point where qc ≈ 111.8 l/min is a Hopf bifurcation point). If qc ∈ (11.1 l/min, 119.7 l/min), there is also at least one unstable point for the measured product concentration C A . From the above facts, one can see that the CSTR exhibits quite complex dynamics. In our application, we are interested in the operation in the stable operating point given by qc = 103.41 l min−1 and C A = 0.1 mol/l. 18.3.3.1. Fuzzy identification of the continuous stirred-tank reactor

From the description of the plant, it can be seen that there are two variables available for measurement — measured product concentration C A and reactor temperature T . For the purpose of control it is certainly beneficial to make use of both, although

page 710

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Fuzzy Model-Based Control: Predictive and Adaptive Approach

page 711

711

it is not necessary to feed reactor temperature if one wants to control product concentration. In our case, the simple discrete compensator was added to the measured reactor temperature output qc f f = K f f [T (k) − T (k − 1)] ,

(18.30)

Normalized degree of fulfilment

where K f f was chosen to be 3, while the sampling time Ts = 0.1 min. The above compensator is a sort of the D-controller that does not affect the equilibrium points of the system (the static curve remains the same), but it does to some extent affect their stability. In our case, the Hopf bifurcation point moved from (qc , C A ) = (111.8 l/min, 0.14 mol/l) to (qc , C A ) = (116.2 l/min, 0.179 mol/l). This means that the stability interval for the product concentration C A expanded from (0, 0.14 mol/l) to (0, 0.179 mol/l). The proposed FMBPC will be tested on the compensated plant, so we need fuzzy model of the compensated plant. The plant was identified in a form of discrete second order model with T T the premise defined as x p = [CA (k)] and the consequence vector as x = C A (k), C A (k − 1), qc (k − TDm ), 1 . The functions φ j (·) can be arbitrary and smooth functions in general, although linear or affine functions are usually used. Due to strong nonlinearity, the structure with six rules and equidistantly shaped Gaussian membership functions was chosen. The normalized membership functions are shown in Figure 18.1.

1

β1

β2

0.06

0.08

β3

β4

β5

β6

0.14

0.16

0.8

0.6

0.4

0.2

0 0.04

0.1 0.12 CA [mol/l]

Figure 18.1: The membership functions.

June 1, 2022 13:18

712

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Handbook on Computer Learning and Intelligence — Vol. II

The structure of the fuzzy model is the following: R j : if C A (k) is A j then C A (k + 1) = −a1 j C A (k) − a2 j C A (k − 1) +b1 j qc (k − TDm ) + r j

j = 1, . . . , 6

(18.31)

The parameters of the fuzzy form in Eq. (18.31) have been estimated using least square algorithm, where the data have been preprocessed using QR factorization [46]. The estimated parameters can be written as vectors a1T = [a11 , . . . , a16 ], a2T = [a21 , . . . , a26 ], b1T = [b11 , . . . , b16 ], and r1T = [r11 , . . . , r16 ]. The estimated parameters in the case of CSTR are as follows: a1T = [−1.3462, −1.4506, −1.5681, −1.7114, −1.8111, −1.9157] a2T = [0.4298, 0.5262, 0.6437, 0.7689, 0.8592, 0.9485] b1T = [1.8124, 2.1795, 2.7762, 2.6703, 2.8716, 2.8500] · 10−4 r1T = [−1.1089, −1.5115, −2.1101, −2.1917, −2.5270, −2.7301] · 10−2 (18.32) and TDm = 5. After estimation of parameters, the TS fuzzy model Eq. (18.31) was transformed into the state–space form to simplify the procedure of obtaining the control law:    βi (x p (k)) Am i xm (k) + Bm i u(k − TDm ) + Rm i (18.33) xm (k + 1) = i

ym (k) =



βi (x p (k))Cm xm (k)

(18.34)

i

 Am i =

Bm i

0

1



−a2i −a1i       0 0 = Cm = b1i 0 Rm i = , 1 ri /bi

(18.35)

where the process measured output concentration C A is denoted by ym and the input flow qc by u. The frozen-time theory [42, 43] enables the relation between the nonlinear dynamical system and the associated linear time-varying system. The theory

page 712

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Fuzzy Model-Based Control: Predictive and Adaptive Approach

page 713

713

establishes the following fuzzy model: ¯ m xm (k) + B ¯ m u(k − TDm ) + R ¯m xm (k + 1) = A ¯ m xm (k), ym (k) = C where ¯m = A



(18.36)

βi (x p (k))Am i

i

¯m = B



βi (x p (k))Bm i

i

¯m = C



βi (x p (k))Cm i

(18.37)

i

¯m = R



βi (x p (k))Rm i

i

18.3.3.2. Simulation results

The reference tracking ability and disturbance rejection capability of the FMBPC control algorithm were tested experimentally on a simulated CSTR plant. The FMBPC was compared to the conventional PI controller. In the first experiment, the control system was tested for tracking the reference signal that changed the operating point from nominal (C A = 0.1 mol/l) to larger concentration values and back, and then to smaller concentration values and back. The proposed FMBPC used the following design parameters: H = 9 and ar = 0.96. The parameters of the PI controller were obtained by minimizing the following criterion: CPI =

400 

(yr (k) − y P I (k))2 ,

(18.38)

k=0

where yr (k) is the reference model output depicted in Figure 18.2, and y P I (k) is the controlled output in the case of PI control. This means that the parameters of the PI controller were minimized to obtain the best tracking of the reference model output for the case treated in the first experiment. The optimal parameters were K P = 64.6454 l2 mol−1 min−1 and Ti = 0.6721 min. Figure 18.2 also shows manipulated and controlled variables for the two approaches. In the lower part of the figure, the set-point is depicted with the dashed line, the reference model output with the magenta dotted line, the FMBPC response with the blue thick solid line and the PI response with the red thin solid line. The upper part of the figure represents the two control signals. The performance criteria obtained in the experiment are the

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Handbook on Computer Learning and Intelligence — Vol. II

uF MBP C , uP I

714

b4528-v2-ch18

110 100 uFMBPC

90

uPI

w, yr , yF MBP C , yP I

80

0

20

40 t

60

80

0.15 0.1 y y

0.05

FMBPC PI

0

20

40 t

60

80

Figure 18.2: The performance of the FMBPC and the PI control in the case of reference trajectory tracking.

following: CPI

400  = (yr (k) − y P I (k))2 = 0.0165

(18.39)

k=0

C F M B PC =

400  (yr (k) − y F M B PC (k))2 = 0.0061.

(18.40)

k=0

The disturbance rejection performance was tested with the same controllers that were set to the same design parameters as in the first experiment. In the simulation experiment, the step-like positive input disturbance of 3 l/min appeared and disappeared later. After some time, the step-like negative input disturbance of −3 l/min appeared and disappeared later. The results of the experiment are shown in Figure 18.3 where the signals are depicted by the same line types as in Figure 18.2. Similar performance criteria can be calculated as in the case of reference tracking CPI =

400 

(yr (k) − y P I (k))2 = 0.0076

(18.41)

(yr (k) − y F M B PC (k))2 = 0.0036

(18.42)

k=0

C F M B PC =

400  k=0

The simulation results have shown that better performance criteria are obtained in the case of the FMBPC control in both control modes: the trajectory tracking

page 714

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Fuzzy Model-Based Control: Predictive and Adaptive Approach

page 715

715

uF MBP C , uP I

108 106 104 102 100

w, yF MBP C , yP I

98

0

20

40 t

60

80

0

20

40 t

60

80

0.11 0.105 0.1 0.095 0.09

Figure 18.3: The control performance of the FMBPC and the PI control in the case of disturbance rejection.

mode and the disturbance rejection mode. This is obvious because the PI controller assumes linear process dynamics, while the FMBPC controller takes into account the plant nonlinearity through the fuzzy model of the plant. The proposed approach is very easy to implement and provides a high control performance.

18.4. Direct Fuzzy Model Reference Adaptive Control We have already established that the fuzzy controllers are capable of controlling nonlinear plants. If the model of the plant is not only nonlinear but also unknown or poorly known, the solution becomes considerably more difficult. Nevertheless, several approaches exist to solve the problem. One possibility is to apply adaptive control. Adaptive control schemes for linear systems do not produce good results, although adaptive parameters try to track the “true” local linear parameters of the current operating point that is conducted with some lag after each operating-point change. To overcome this problem, adaptive control was extended in the 1980s and 1990s to time-varying and nonlinear plants [47]. It is also possible to introduce some sort of adaptation into the fuzzy controller. The first attempts at constructing a fuzzy adaptive controller can be traced back to [48], where the so-called linguistic self-organizing controllers were introduced. Many approaches were later presented where a fuzzy model of the plant was constructed online, followed by control parameters adjustment [49]. The main drawback of these schemes was that their stability was not treated rigorously.

June 1, 2022 13:18

716

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Handbook on Computer Learning and Intelligence — Vol. II

The universal approximation theorem [50] provided a theoretical background for new fuzzy direct and indirect adaptive controllers [50–52] whose stability was proven using the Lyapunov theory. Robust adaptive control was proposed to overcome the problem of disturbances and unmodeled dynamics [53]. Similar solutions have also been used in adaptive fuzzy and neural controllers, i.e., projection [54], dead zone [55], leakage [56], adaptive fuzzy backstepping control [57], etc., have been included in the adaptive law to prevent instability due to reconstruction error. The control of a practically very important class of plants is treated in this section that, in our opinion, occur quite often in process industries. The class of plants consists of nonlinear systems of arbitrary order but where the control law is based on the first-order nonlinear approximation. The dynamics not included in the first-order approximation are referred to as parasitic dynamics. The parasitic dynamics are treated explicitly in the development of the adaptive law to prevent the modeling error to grow unbounded. The class of plant also includes bounded disturbances. The choice of simple nominal model results in very simple control and adaptive laws. The control law is similar to the one proposed by [58, 59] but an extra term is added in this work where an adaptive law with leakage is presented [60]. It will be shown that the proposed adaptive law is a natural way to cope with parasitic dynamics. The boundedness of estimated parameters, the tracking error, and all the signals in the system will be proved, if the leakage parameter σ satisfies certain conditions. This means that the proposed adaptive law ensures the global stability of the system. A very important property of the proposed approach is that it can be used in the consequent part of Takagi–Sugeno-based control. The approach enables easy implementation in the control systems with evolving antecedent parts [13–17]. This combination results in a high-performance and robust control of nonlinear and slow time-varying systems [61–64].

18.4.1. The Class of Nonlinear Plants Our goal is to design control for a class of plants that include nonlinear time-invariant systems, where the model behaves similarly to a first-order system at low frequencies (the frequency response is not defined for nonlinear systems, so frequencies are meant here in a broader sense). If the plant were the first-order system (without parasitic dynamics), it could be described by a fuzzy model in the form of if-then rules: if z 1 is Aia and z 2 is Bib , then y˙ p = −ai y p + bi u + ci i a = 1, . . . , n a i b = 1, . . . , n b i = 1, . . . , k,

(18.43)

page 716

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Fuzzy Model-Based Control: Predictive and Adaptive Approach

page 717

717

where u and y p are the input and the output of the plant, respectively, Aia and Bib are fuzzy membership functions, and ai , bi , and ci are the plant parameters in the ith domain. Note the ci term in the consequent. Such an additive term is obtained if a nonlinear system is linearized in an operating point. This additive term changes by changing the operating point. The term ci is new compared to the model used in [58, 59]. The antecedent variables that define the domain in which the system is currently situated are denoted by z 1 and z 2 (actually, there can be only one such variable or there can also be more of them, but this does not affect the approach described here). There are n a and n b membership functions for the first and the second antecedent variables, respectively. The product k = n a × n b defines the number of fuzzy rules. The membership functions have to cover the whole operating area of the system. The output of the Takagi–Sugeno model is then given by the following equation:  k  0 i=1 βi (x p )(−ai y p + bi u + ci ) , (18.44) y˙ p = k 0 i=1 βi (x p ) where x p represents the vector of antecedent variables z i (in the case of fuzzy model given by Eq. (18.43), x p = [z 1 z 2 ]T ). The degree of fulfillment βi0 (x p ) is obtained using the T-norm, which in this case is a simple algebraic product of membership functions: βi0 (x p ) = T (μ Aia (z 1 ), μ Bib (z 2 )) = μ Aia (z 1 ) · μ Bib (z 2 ),

(18.45)

where μ Aia (z 1 ) and μ Bib (z 2 ) stand for degrees of fulfillment of the corresponding fuzzy rule. The degrees of fulfillment for the whole set of fuzzy rules can be written in a compact form as  T (18.46) β 0 = β10 β20 . . . βk0 ∈ Rk , or in a more convenient, normalized form β0 β = k

0 i=1 βi

∈ Rk .

(18.47)

Due to Eq. (18.44) and (18.47), the first-order plant can be modeled in fuzzy form as β T a)y p + (β β T b)u + (β β T c). (18.48) y˙ p = −(β   T T T  are where a = a1 a2 . . . ak , b = b1 b2 · · · bk , and c = c1 c2 . . . ck k vectors of unknown plant parameters in respective domains (a, b, c ∈ R ). To assume that the controlled system is of the first order is a quite big generalization. Parasitic dynamics and disturbances are therefore included in the

June 1, 2022 13:18

718

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Handbook on Computer Learning and Intelligence — Vol. II

model of the plant. The fuzzy model of the first order is generalized by adding stable factor plant perturbations and disturbances, which results in the following model [58]: β T (t)a)y p (t) + (β β T (t)b)u(t) + (β β T c) y˙ p (t) = −(β − y ( p)y p (t) + u ( p)u(t) + d(t),

(18.49)

where p is a differential operator d/dt,  y ( p) and u ( p) are stable strictly proper linear operators, while d is a bounded signal due to disturbances [58]. Equation (18.49) represents the class of plants to be controlled by the approach proposed in the following sections. The control is designed based on the model given by Eq. (18.48), while the robustness properties of the algorithm prevent the instability due to parasitic dynamics and disturbances.

18.4.2. The proposed fuzzy adaptive control algorithm A fuzzy model reference adaptive control is proposed to achieve tracking control for the class of plants described in the previous section. The control goal is that the plant output follows the output ym of the reference model. The latter is defined by a first order linear system G m ( p): ym (t) = G m ( p)w(t) =

bm w(t), p + am

(18.50)

where w(t) is the reference signal while bm and am are the constants that define desired behavior of the closed system. The tracking error ε(t) = y p (t) − ym (t)

(18.51)

therefore represents some measure of the control quality. To solve the control problem, simple control and adaptive laws are proposed in the following subsections. 18.4.2.1. Control law

The control law is very similar to the one proposed by [58] and [59]:      ˆ w(t) − β T (t)q(t) ˆ y p (t) + β T (t)ˆr(t) , u(t) = β T (t)f(t)

(18.52)

ˆ ˆ ∈ Rk , and rˆ (t) ∈ Rk are the control gain vectors to be where f(t) ∈ Rk , q(t) determined by the adaptive law. This control law is obtained by generalizing the model reference adaptive control algorithm for the first-order linear plant to the fuzzy case. The control law also includes the third term that is new compared to the one β T c) term in Eq. (18.49). in [59]. It is used to compensate the (β

page 718

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Fuzzy Model-Based Control: Predictive and Adaptive Approach

page 719

719

18.4.2.2. Adaptive law

The adaptive law proposed in this chapter is based on the adaptive law from [58]. The e1 -modification was used in the leakage term in [58]. An alternative approach was proposed in [59]. Here, we follow the basic idea from [59] while also adding a new adaptive law for rˆi : f˙ˆi = −γ f i bsign εwβi − γ f i σ w 2 βi2 ( fˆi − fˆi∗ ) i = 1, 2, . . . k q˙ˆi = γqi bsign εy p βi − γqi σ y 2p βi2 (qˆi − qˆi∗ ) i = 1, 2, . . . k

,

(18.53)

r˙ˆi = −γri bsign εβi − γri σ βi2 (ˆri − rˆi∗ ) i = 1, 2, . . . k where γ f i , γqi , and γri are positive scalars referred to as adaptive gains, σ > 0 is the parameter of the leakage term, fˆi∗ , qˆi∗ , and rˆi∗ are the a priori estimates of the control gains fˆi , qˆi , and rˆi respectively, and bsign is defined as follows: ⎧ ⎨1 b1 > 0, b2 > 0, . . . bk > 0 . (18.54) bsign = ⎩−1 b1 < 0, b2 < 0, . . . bk < 0 If the signs of all elements in vector b are not the same, the plant is not controllable β T b is equal to 0 for this β ) and any control signal does not have an for some β (β effect. It is possible to rewrite the adaptive law Eq. (18.53) in the compact form if the ˆ q, control gain vectors f, ˆ and rˆ are defined as   fˆ T = fˆ1 fˆ2 . . . fˆk   (18.55) qˆ T = qˆ1 qˆ2 . . . qˆk   rˆ T = rˆ1 rˆ2 . . . rˆk Then the adaptive law Eq. (18.53) takes the following form: ˙ˆ β − f σ w 2 diag(β β ) diag(β β )(fˆ − fˆ ∗ ) f bsign εwβ f = − β ) diag(β β )(qˆ − qˆ ∗ ) , q˙ˆ = q bsign εy pβ − q σ y 2p diag(β

(18.56)

β − r σ diag(β β ) diag(β β )(ˆr − rˆ ∗ ) r bsign εβ r˙ˆ = − where f ∈ Rk×k , q ∈ Rk×k , and r ∈ Rk×k are positive definite matrices, diag(x) ∈ Rk×k is a diagonal matrix with the elements of vector x on the main diagonal, while fˆ ∗ ∈ Rk , qˆ ∗ ∈ Rk , and rˆ ∗ ∈ Rk are the a priori estimates of the control gain vectors.

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

720

b4528-v2-ch18

Handbook on Computer Learning and Intelligence — Vol. II

18.4.2.3. The sketch of the stability proof

The reference model in Eq. (18.50) can be rewritten in the following form: y˙m = −am ym + bm w.

(18.57)

By subtracting Eq. (18.57) from Eq. (18.49), the following tracking-error model is obtained:     T ˆ − bm w − (β β T b)(β β T f) β T q) β T a) − am y p + β b)(β ˆ + (β ε˙ = −am ε + (β   T β T rˆ ) + (β β T c) +u ( p)u −  y ( p)y p + d β b)(β + (β (18.58) Now we assume that there exist constant control parameters f ∗ , q∗ , and r∗ that stabilize the closed-loop system. This is a mild assumption and it is always fulfilled unless the unmodeled dynamics are unacceptably high. These parameters are only needed in the stability analysis and can be chosen to make the “difference” between the closed-loop system and the reference model smaller in some sense (The definition of this “difference” is not important for the analysis.) The parameters f ∗ , q∗ , and r∗ are sometimes called the “true” parameters because they result in the perfect tracking in the absence of unmodeled dynamics and disturbances. The parameter errors are defined as f˜ = fˆ − f ∗ q˜ = qˆ − q∗

(18.59)

r˜ = rˆ − r∗ The expressions in the square brackets in Eq. (18.59) can be rewritten similarly as in [58]: 

k   T Tˆ T˜ β β β βi f˜i + η f (β b)(β f) − bm = bsign f + η f = bsign i=1





β q) β a) − am = bsign β q˜ + ηq = bsign β b)(β ˆ + (β (β T

T

T

T

k 

βi q˜i + ηq , (18.60)

i=1



k   β T rˆ ) + (β β T c) = bsign β T r˜ + ηr = bsign β T b)(β βi r˜i + ηr (β i=1

where bounded residuals η f (t), ηq (t), and ηr (t) are introduced (the boundedness can be shown simply; see also [58]). The following Lyapunov function has been

page 720

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Fuzzy Model-Based Control: Predictive and Adaptive Approach

page 721

721

proposed for the proof of stability: V =

1 2 ε 2

+

1 2

k 

˜2 γ f−1 i fi

+

i=1

1 2

k 

γqi−1 q˜i2

+

i=1

1 2

k 

γri−1r˜i2 .

(18.61)

i=1

Calculating the derivative of the Lyapunov function along the solution of the system Eq. (18.58) and taking into account Eq. (18.60) and adaptive laws in Eq. (18.53) we obtain V˙ = ε˙ε +

k 

˜ ˙ˆ γ f−1 i fi fi

i=1

+

k 

γqi−1 q˜i q˙ˆi

+

i=1

k 

γri−1r˜i r˙ˆi

i=1

= −am ε + η f wε − ηq y p ε + ηr ε + εu ( p)u − ε y ( p)y p + εd 2



k 

σ w 2 βi2 ( fˆi − fˆi∗ ) f˜i −

i=1



k 

k 

σ y 2p βi2 (qˆi − qˆi∗ )q˜i

i=1

σ βi2 (ˆri − rˆi∗ )˜ri .

(18.62)

i=1

In principle, the first term on the right-hand side of Eq. (18.62) is used to compensate for the next six terms while the last three terms prevent parameter drift. The terms from the second one to the seventh one are formed as a product between the tracking error ε(t) and a combined error E(t) defined as E(t) = η f (t)w(t) − ηq (t)y p (t) + ηr (t) + u ( p)u(t) −  y ( p)y p (t) + d(t) (18.63) Equation (18.62) can be rewritten as  V˙ = −am ε 2 − aEεm −

k 

σ



w 2 βi2 ( fˆi

− fˆi∗ ) f˜i −

i=1



k 

k 

σ y 2p βi2 (qˆi − qˆi∗ )q˜i

i=1

σ βi2 (ˆri − rˆi∗ )˜ri .

(18.64)

i=1

. The first term on the right-hand side of Eq. (18.64) becomes negative, if |ε| > |E| am If the combined error were a priori bounded, the boundedness of the tracking error ε would be more or less proven. The problem lies in the fact that not only bounded signals (w(t), η f (t), ηq (t), ηr (t), d(t)) are included in E(t), but also are the ones whose boundedness is yet to be proven (u(t), y p (t)). If the system becomes unstable, the plant output y p (t) becomes unbounded and, consequently, the same applies to

June 1, 2022 13:18

722

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Handbook on Computer Learning and Intelligence — Vol. II

the control input u(t). If y p (t) is bounded, it is easy to see from the control law that u(t) is also bounded. Unboundedness of y p (t) is prevented by leakage terms in the adaptive law. In the last three terms in Eq. (18.64) that are due to the leakage, there are three similar expressions. They have the following form: ( fˆi (t) − fˆi∗ ) f˜i (t) = ( fˆi (t) − fˆi∗ )( fˆi (t) − f i∗ ).

(18.65)

It is simple to see that this expression is positive if either fˆi > max{ fˆi∗ , f i∗ } or fˆi < min{ fˆi∗ , fi∗ }. The same reasoning applies to qˆi and rˆi . This means that the last three terms in Eq. (18.64) become negative if the estimated parameters are large (or small) enough. The novelty of the proposed adaptive law with respect to the one in [58] is in the quadratic terms with y p and w in the leakage. These terms are used to help cancel the contribution of εE in Eq. (18.64): εE = εη f w − εηq y p + εηr + εu ( p)u − ε y ( p)y p + εd.

(18.66)

Since ε(t) is the difference between y p (t) and ym (t) and the latter is bounded, ε = O(y p ) when y p tends to infinity. By analyzing the control law and taking into account stability of parasitic dynamics u (s) and  y (s), the following can be concluded: u = O(y p ), u ( p)u = O(y p ) ⇒ εE = O(y 2p ).

(18.67)

The third term on the right-hand side of Eq. (18.64) is −(qˆi − qˆi∗ )q˜i O(y 2p ), which means that the “gain” (qˆi − qˆi∗ )q˜i with respect to y 2p of the negative contributions to V˙ can always become greater (as a result of adaptation) than the fixed gain of quadratic terms with y p in Eq. (18.66). The growth of the estimated parameters is also problematic because these parameters are control gains and high gains can induce instability in combination with parasitic dynamics. Consequently, σ has to be large enough to prevent this type of instability. Note that the stabilization in the presence of parasitic dynamics is achieved without using an explicit dynamic normalization that was used in [58]. The stability analysis of a similar adaptive law for linear systems was treated in [65] where it was proven that all the signals in the system are bounded and the tracking error converges to a residual set whose size depends on the modeling error if the leakage parameter σ chosen is large enough with respect to the norm of parasitic dynamics. In the approach proposed in this chapter, the “modeling error” is E(t) from Eq. (18.63), and therefore the residual-set size depends on the size of the norm of the transfer functions ||u || and || y ||, the size of the disturbance d, and the size of the bounded residuals η f (t), ηq (t), and ηr (t). Only the adaptation of the consequent part of the fuzzy rules is treated in this chapter. The stability of the system is guaranteed for any (fixed) shape of the

page 722

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Fuzzy Model-Based Control: Predictive and Adaptive Approach

page 723

723

membership functions in the antecedent part. This means that this approach is very easy to combine with existing evolving approaches for the antecedent part. If the membership functions are slowly evolving, these changes introduce another term to V˙ that can be shown not to be larger than O(y 2p ). This means that the system stability is preserved by the robustness properties of the adaptive laws. If, however, fast changes in the membership functions occur, a rigorous stability analysis would have to be performed.

18.4.3. Simulation Example: Three-Tank System A simulation example has been given that illustrates the proposed approach. A simulated plant was chosen since it is easier to make the same operating conditions than it would be when testing on a real plant. The simulated test plant consisted of three water tanks. The schematic representation of the plant is given in Figure 18.4. The control objective was to maintain the water level in the third tank by changing the inflow into the first tank. When modeling the plant, it was assumed that the flow through the valve was proportional to the square root of the pressure difference on the valve. The mass conservation equations for the three tanks are √ S1 h˙ 1 = φin − k1 sign(h 1 − h 2 ) |h 1 − h 2 | √ √ (18.68) S2 h˙ 2 = k1 sign(h 1 − h 2 ) |h 1 − h 2 | − k2 sign(h 2 − h 3 ) |h 2 − h 3 |, √ √ ˙ S3 h 3 = k2 sign(h 2 − h 3 ) |h 2 − h 3 | − k3 sign(h 3 ) |h 3 | where φin is the volume inflow into the first tank, h 1 , h 2 , and h 3 are the water levels in three tanks, S1 , S2 , and S3 are areas of the tanks cross-sections, and k1 , k2 , and k3 are coefficients of the valves. The following values were chosen for the parameters of the system: S1 = S2 = S3 = 2 · 10−2 m2 k1 = k2 = k3 = 2 · 10−4 m5/2 s−1 in

(18.69)

(t)

h1(t)

h2(t)

h3(t)

V1 (k1) S1

.

1

(t)

V2 (k2) S2

2

(t)

V3 (k3) S3

Figure 18.4: Schematic representation of the plant.

out

(t)

June 1, 2022 13:18

724

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Handbook on Computer Learning and Intelligence — Vol. II

The nominal value of inflow φin was set to 8 · 10−5 m3 s−1 , resulting in steady-state values 0.48 m, 0.32 m, and 0.16 m, for h 1 , h 2 , and h 3 , respectively. In the following, u and y p denote deviations of φin and h 3 , respectively, from the operating point. By analyzing the plant it can be seen that the plant is nonlinear. It has to be pointed out that the parasitic dynamics are also nonlinear, not just the dominant part as was assumed in deriving the control algorithm. This means that this example will also test the ability of the proposed control to cope with nonlinear parasitic dynamics. The coefficients of the linearized system in different operating points depend on u, h 1 , h 2 , and h 3 even though that only y p will be used as an antecedent variable z 1 , which is again a violation of the basic assumptions but still produces fairly good results. The proposed control algorithm was compared to a classical model reference adaptive control (MRAC) with e1 -modification. Adaptive gains γ f i , γqi , and γri in the case of the proposed approach were the same as γ f , γq , and γr , respectively, in the case of MRAC. A reference signal was chosen as a periodic piecewise constant function that covered quite a wide area around the operating point (±50% of the nominal value). There were 11 triangular fuzzy membership functions (the fuzzification variable was y p ) used; these were distributed evenly across the interval [−0.1, 0.1]. As already mentioned, the evolving of the antecedent part was not done in this work. The control input signal u was saturated at the interval [−8 · 10−5 , 8 · 10−5 ]. No prior knowledge of the estimated parameters was available to us, so the initial parameter estimates were 0 for all examples. The design objective is that the output of the plant follows the output of the reference model 0.01/(s + 0.01). The reference signal was the same in all cases. It consisted of a periodic signal. The results of the experiment with the classical MRAC controller with e1 -modification are shown in Figure 18.5. We used the following design parameters: γ f = 10−4 , γq = 2·10−4 , γr = 10−6 ,

σ = 0.1. Figures 18.6 and 18.7 show the results of the proposed approach; the former shows a period of system responses after the adaptation has settled, and the ˆ q, latter depicts time plots of the estimated parameters. Since f, ˆ and rˆ are vectors, all elements of the vectors are depicted. Note that every change in the reference signal results in a sudden increase in tracking error ε (up to 0.01). This is due to the fact that zero tracking of the reference model with relative degree 1 is not possible if the plant has relative degree 3. The experiments show that the performance of the proposed approach is better than the performance of the MRAC controller for linear plant which is expected due to nonlinearity of the plant. Very good results were obtained in the case of the proposed approach even though that the parasitic dynamics are nonlinear and linearized parameters depend not only on the antecedent variable y p but also on

page 724

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Fuzzy Model-Based Control: Predictive and Adaptive Approach

page 725

725

yp , ym , w

0.1 0 −0.1 4

4.01

4.02 t

4.03

4.02 t

4.03

4.02 t

4.03

4.04 6

x 10

0.01 ε

0 −0.01 4

4.01

6

x 10

−5

u

5

4.04

x 10

0 −5 4

4.01

4.04 6

x 10

Figure 18.5: The MRAC controller — time plots of the reference signal and outputs of the plant and the reference model (upper figure), time plot of tracking error (middle figure), and time plot of the control signal (lower figure).

yp , ym , w

0.1 0 −0.1 4

4.01

4.02 t

4.03

4.02 t

4.03

4.02 t

4.03

4.04 6

x 10

0.01 ε

0 −0.01 4

4.01

6

−5

u

5

4.04 x 10

x 10

0 −5 4

4.01

4.04 6

x 10

Figure 18.6: The proposed approach — time plots of the reference signal and outputs of the plant and the reference model (upper figure), time plot of tracking error (middle figure), and time plot of the control signal (lower figure).

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

726

b4528-v2-ch18

page 726

Handbook on Computer Learning and Intelligence — Vol. II −4

10

x 10

ˆ f

5 0 −5 0

0.5

1

1.5

2 t

2.5

2 t

2.5

2 t

2.5

3

3.5 x 10

−4

ˆ q

5

x 10

0 −5 0

0.5

1

1.5

3

3.5

4 6

x 10

−5

2 ˆ r

4 6

x 10

0 −2 0

0.5

1

1.5

3

3.5

4 6

x 10

Figure 18.7: The proposed approach — time plots of the control gains.

others. The spikes on ε in Figure 18.6 are consequences of the fact that the plant of relative degree 3 is forced to follow the reference model of relative degree 1. These spikes are inevitable no matter which controller is used. The drawback of the proposed approach is relatively slow convergence since the parameters are only adapted when the corresponding membership is non-zero. This drawback can be overcome by using classical MRAC in the beginning when there are no parameter estimates or the estimates are bad. When the system approaches desired behavior the adaptation can switch to the proposed one by initializing all ˆ q, elements of vectors f, ˆ and rˆ with estimated scalar parameters from the classical MRAC.

18.5. Conclusion This chapter presents two approaches to the control of nonlinear systems. We chose these two solutions because they are easy to tune and easy to implement on one hand, but they also guarantee the stability under some assumptions on the other. Both approaches also only deal with the rule consequents and are easy to extend to the variants with evolving antecedent part.

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Fuzzy Model-Based Control: Predictive and Adaptive Approach

page 727

727

References [1] W. Pedrycz, Fuzzy Control and Fuzzy Systems (Research Studies Press, 1993). [2] K. Passino and S. Yurkovich, Fuzzy Control (Addison-Wesley, 1998). [3] K. Tanaka and H. O. Wang, Fuzzy Control Systems Design and Analysis: A Linear Matrix Inequality Approach (John Wiley & Sons, Inc., New York, NY, USA, 2002). [4] R. Babuska, Fuzzy Modeling for Control (Kluwer Academic Publishers, 1998). [5] L. A. Zadeh, Outline of a new approach to the analysis of complex systems and decision processes, IEEE Trans. Syst. Man Cybern., SMC-3(1), 28–44 (1973). [6] E. Mamdani, Application of fuzzy algorithms for control of simple dynamic plant, Proc. Inst. Electr. Eng., 121(12), 1585–1588 (1974). [7] T. Takagi and M. Sugeno, Fuzzy identification of systems and its applications to modelling and control, IEEE Trans. Syst. Man Cybern., 15, 116–132 (1985). [8] B. Kosko, Fuzzy systems as universal approximators, IEEE Trans. Comput., 43(11), 1329–1333 (1994). [9] H. G. Ying, Necessary conditions for some typical fuzzy systems as universal approximators, Automatica, 33, 1333–1338 (1997). [10] L.-X. Wang and J. M. Mendel, Fuzzy basis functions, universal approximation, and orthogonal least-squares learning, IEEE Trans. Neural Networks, 3(5), 807–814 (1992). [11] R. R. Goldberg, Methods of Real Analysis (John Wiley and Sons, 1976). [12] C.-H. Lin, SISO nonlinear system identification using a fuzzy-neural hybrid system, Int. J. Neural Syst., 8(3), 325–337 (1997). [13] P. Angelov, R. Buswell, J. A. Wright, and D. Loveday, Evolving rule-based control. In EUNITE Symposium, pp. 36–41, Tenerife, Spain (2001). [14] P. Angelov and D. P. Filev, An approach to online identification of Takagi-Sugeno fuzzy models, IEEE Syst Man Cybern. B: Cybern., 34(1), 484–498 (2004). [15] A. B. Cara, Z. Lendek, R. Babuska, H. Pomares, and I. Rojas, Online self-organizing adaptive fuzzy controller: application to a nonlinear servo system. In Fuzzy Systems (FUZZ), 2010 IEEE International Conference on, pp. 1–8 (2010). ISBN 1098-7584. doi: 10.1109/FUZZY.2010.5584027. [16] P. Angelov, P. Sadeghi-Tehran, and R. Ramezani, An approach to automatic real-time novelty detection, object identification, and tracking in video streams based on recursive density estimation and evolving takagi-sugeno fuzzy systems, Int. J. Intell. Syst., 26(3), 189–205 (2011). [17] P. Sadeghi-Tehran, A. B. Cara, P. Angelov, H. Pomares, I. Rojas, and A. Prieto, Self-evolving parameter-free rule-based controller. In IEEE, Proc. 2012 World Congress on Computational Intelligence, WCCI-2012, pp. 754–761. IEEE (2012). ISBN 978-1-4673-1489-3. [18] D. W. Clarke, C. Mohtadi, and P. S. Tuffs, Generalized predictive control–part 1, part 2, Automatica, 24, 137–160 (1987). [19] J. Richalet, A. Rault, J. L. Testud, and J. Papon, Model predictive heuristic control: applications to industrial processes, Automatica, 14, 413–428 (1978). [20] J. Richalet, Industrial application of model based predictive control, Automatica, 29(5), 1251– 1274 (1993). [21] C. R. Cutler and B. L. Ramaker, Dynamic matrix control — a computer control algorithm. In Proceedings of the ACC, San Francisco (1980). [22] R. M. C. De Keyser, P. G. A. Van de Valde, and F. A. G. Dumortier, A comparative study of self-adaptive long-range predictive control methods, Automatica, 24(2), 149–163 (1988). [23] B. E. Ydstie, Extended horizon adaptive control. In IFAC World Congress (1985). [24] B. W. Bequette, Nonlinear control of chemical processes: a review, Ind. Eng. Chem. Res., 30, 1391–1413 (1991). [25] M. A. Henson, Nonlinear model predictive control: current status and future directions, Comput. Chem. Eng., 23, 187–202 (1998).

June 1, 2022 13:18

728

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Handbook on Computer Learning and Intelligence — Vol. II

[26] J. L. Figueroa, Piecewise linear models in model predictive control, Lat. Am. Appl. Res., 31(4), 309–315 (2001). [27] H. O. Wang, K. Tanaka, and M. F. Griffin, An approach to fuzzy control of nonlinear systems: stability and design issues, IEEE Trans. Fuzzy Syst., 4(1), 14–23 (1996). [28] M. S. Padin and J. L. Figueroa, Use of CPWL approximations in the design of a numerical nonlinear regulator, IEEE Trans. Autom. Control, 45(6), 1175–1180 (2000). [29] F. J. Doyle, T. A. Ogunnaike, and R. K. Pearson, Nonlinear model-based control using secondorder volterra models, Automatica, 31, 697–714 (1995). [30] N. Li, S. Li, and Y. Xi, Multi-model predictive control based on the Takagi-Sugeno fuzzy models: a case study, Inf. Sci. Inf. Comput. Sci., 165(3–4), 247–263 (2004). [31] J. A. Roubos, S. Mollov, R. Babuska, and H. B. Verbruggen, Fuzzy model-based predictive control using Takagi-Sugeno models, Int. J. Approximate Reasoning, 22(1–2), 3–30 (1999). [32] J. Abonyi, L. Nagy, and F. Szeifert, Fuzzy model-based predictive control by instantaneous linearization, Fuzzy Sets Syst., 120(1), 109–122 (2001). [33] I. Škrjanc and D. Matko, Predictive functional control based on fuzzy model for heat-exchanger pilot plant, IEEE Trans. Fuzzy Syst., 8(6), 705–712 (2000). [34] D. Andone and A. Hossu, Predictive control based on fuzzy model for steam generator. In Proceedings of the IEEE International Conference on Fuzzy Systems, pp. 1245–1250 (2004). [35] J.-H. Kim and U.-Y. Huh, Fuzzy model based predictive control. In Proceedings of the IEEE International Conference on Fuzzy Systems, pp. 405–409 (1998). [36] H.-R. Sun, P. Han, and S.-M. Jiao, A predictive control strategy based on fuzzy system. In Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, pp. 549–552 (2004). [37] S. Blažiˇc and I. Škrjanc, Design and stability analysis of fuzzy model-based predictive control–a case study, J. Intell. Rob. Syst., 49(3), 279–292 (2007). [38] D. Dovžan and I. Škrjanc, Control of mineral wool thickness using predictive functional control, Rob. Comput. Integr. Manuf., 28(3), 344–350 (2012). [39] I. Škrjanc and G. Klanˇcar, A comparison of continuous and discrete tracking-error model-based predictive control for mobile robots, Rob. Auton. Syst., 87, 177–187 (2017). [40] R. Baždari´c, D. Vonˇcina, and I. Škrjanc, Comparison of novel approaches to the predictive control of a DC-DC boost converter, based on heuristics, Energies, 11(12), 3300 (2018). [41] K. Tanaka, T. Ikeda, and H. O. Wang, Robust stabilization of a class of uncertain nonlinear systems via fuzzy control: quadratic stabilizability, h∞ control theory, and linear matrix inequalities, IEEE Trans. Fuzzy Syst., 4(1), 1–13 (1996). [42] D. J. Leith and W. E. Leithead, Gain-scheduled and nonlinear systems: dynamics analysis by velocity-based linearization families, Int. J. Control, 70(2), 289–317 (1998). [43] D. J. Leith and W. E. Leithead, Analytical framework for blended model systems using local linear models, Int. J. Control, 72(7–8), 605–619 (1999). [44] J. D. Morningred, B. E. Paden, and D. A. Mellichamp, An adaptive nonlinear predictive controller, Chem. Eng. Sci., 47, 755–762 (1992). [45] M. A. Henson and D. E. Seborg, Input-output linearization of general processes, AIChE J., 36, 1753 (1990). [46] T. K. Moon and W. C. Stirling, Mathematical Methods and Algorithms for Signal Processing (Prentice Hall, 1999). [47] M. Krsti´c, I. Kanellakopoulos, and P. Kokotovi´c, Nonlinear and Adaptive Control Design (John Wiley and Sons, 1995). [48] T. J. Procyk and E. H. Mamdani, A linguistic self-organizing process controller, Automatica, 15, 15–30 (1979). [49] J. R. Layne and K. M. Passino, Fuzzy model reference learning control for cargo ship steering, IEEE Control Syst. Mag., 13, 23–34 (1993).

page 728

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch18

Fuzzy Model-Based Control: Predictive and Adaptive Approach

page 729

729

[50] L. X. Wang and J. M. Mendel, Fuzzy basis functions, universal approximation, and orthogonal least-squares learning, IEEE Trans. Neural Networks, 3, 807–881 (1992). [51] Y. Tang, N. Zhang, and Y. Li, Stable fuzzy adaptive control for a class of nonlinear systems, Fuzzy Sets Syst., 104, 279–288 (1999). [52] H. Pomares, I. Rojas, J. González, F. Rojas, M. Damas, and F. J. Fernández, A two-stage approach to self-learning direct fuzzy controllers, Int. J. Approximate Reasoning, 29(3), 267–289 (2002). ISSN 0888-613X. [53] P. A. Ioannou and J. Sun, Robust Adaptive Control (Prentice-Hall, 1996). [54] S. Tong, T. Wang, and J. T. Tang, Fuzzy adaptive output tracking control of nonlinear systems, Fuzzy Sets Syst., 111, 169–182 (2000). [55] K.-M. Koo, Stable adaptive fuzzy controller with time varying dead-zone, Fuzzy Sets Syst., 121, 161–168 (2001). [56] S. Ge and J. Wang, Robust adaptive neural control for a class of perturbed strict feedback nonlinear systems, IEEE Trans. Neural Networks, 13(6), 1409–1419 (2002). [57] S. Tong and Y. Li, Adaptive fuzzy output feedback tracking backstepping control of strictfeedback nonlinear systems with unknown dead zones, IEEE Trans. Fuzzy Syst., 20(1), 168–180 (2012). [58] S. Blažiˇc, I. Škrjanc, and D. Matko, Globally stable direct fuzzy model reference adaptive control, Fuzzy Sets Syst., 139(1), 3–33 (2003). [59] S. Blažiˇc, I. Škrjanc, and D. Matko, A new fuzzy adaptive law with leakage. In 2012 IEEE Conference on Evolving and Adaptive Intelligent Systems (EAIS), pp. 47–50 (2012). [60] S. Blažiˇc, I. Škrjanc, and D. Matko, A robust fuzzy adaptive law for evolving control systems, Evolving Syst., pp. 1–8 (2013). doi: 10.1007/s12530-013-9084-7. [61] G. Andonovski, P. Angelov, S. Blažiˇc, and I. Škrjanc, A practical implementation of robust evolving cloud-based controller with normalized data space for heat-exchanger plant, Appl. Soft Comput., 48, 29–38 (2016). [62] G. Andonovski, A. Bayas, D. Saez, S. Blažiˇc, and I. Škrjanc, Robust evolving cloud-based control for the distributed solar collector field. In 2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1570–1577 (2016). doi: 10.1109/FUZZ-IEEE.2016.7737877. [63] G. Andonovski, P. Angelov, S. Blažiˇc, and I. Škrjanc, Robust evolving cloud-based controller (RECCo). In 2017 Evolving and Adaptive Intelligent Systems (EAIS), pp. 1–6 (2017). [64] S. Blažiˇc and A. Zdešar, An implementation of an evolving fuzzy controller. In 2017 Evolving and Adaptive Intelligent Systems (EAIS), pp. 1–7, 2017. [65] S. Blažiˇc, I. Škrjanc, and D. Matko, Adaptive law with a new leakage term, IET Control Theory Appl., 4(9), 1533–1542 (2010).

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

© 2022 World Scientific Publishing Company https://doi.org/10.1142/9789811247323_0019

Chapter 19

Reinforcement Learning with Applications in Autonomous Control and Game Theory Kyriakos G. Vamvoudakis∗,§, Frank L. Lewis†,¶ , and Draguna Vrabie‡, ∗ Daniel Guggenheim School of Aerospace Engineering

Georgia Institute of Technology, GA, 30332, USA † University of Texas at Arlington Research Institute

University of Texas, Arlington, TX 76118, USA ‡ Pacific Northwest National Laboratory, WA 99352, USA § [email protected]. ¶ [email protected]  [email protected]

This book chapter showcases how ideas from control systems engineering, game theory, and computational intelligence have been combined to obtain a new class of control and game-theoretic techniques. Since reinforcement learning involves modifying the control policy based on responses from an unknown environment, we have the initial feeling that it is closely related to adaptive control. Moreover, reinforcement learning methods allow learning of optimal and game-theoretic solutions relative to prescribed cost/payoff metrics by measuring data in real time along system trajectories; hence, it also has relations to optimal feedback control. This chapter is an exposition of the research results that clarify these relations.

19.1. Background Adaptive control has been one of the most widespread techniques for feedback control of modern engineered systems since the 1960s. Its effective and reliable use in aerospace systems, the industrial process industry, vehicle control, communications, and elsewhere is firmly established. Optimal control is equally widespread since it enables the design of feedback controllers with the purpose of minimizing energy, fuel consumption, performance time, or other premium quantities. Optimal controllers are normally designed offline by solving the Hamilton– Jacobi–Bellman (HJB) equations, for example, the Riccati equation, using the 731

page 731

June 1, 2022 13:18

732

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Handbook on Computer Learning and Intelligence — Vol. II

complete knowledge of the system dynamics. Optimal control policies for nonlinear systems can be obtained by solving nonlinear HJB equations that cannot usually be solved. By contrast, adaptive controllers learn online to control unknown systems using data measured in real time along the system trajectories. Adaptive controllers are generally not optimal in the sense of minimizing user-prescribed performance functions. Indirect adaptive controllers have been designed that use system identification techniques to first identify the system parameters, and then use the obtained model to solve optimal design equations [1]. It is shown that adaptive controllers may satisfy certain inverse optimality conditions [2]. The design of adaptive controllers that learn online, in real time, the solutions to user-prescribed optimal control problems can be studied by means of reinforcement learning (RL) and adaptive dynamic programming techniques, and forms the subject of this chapter, which is a condensation of two papers that appeared in the IEEE Control Systems Magazine [3, 4]. The real-time learning of optimal policies for unknown systems occurs in nature. The limits within which living organisms can survive are often quite narrow and the resources available to most species are meager. Therefore, most organisms occurring in nature act in an optimal fashion to conserve resources while achieving their goals. Living organisms learn by acting on their environment, observing and evaluating the resulting reward stimulus, and adjusting their actions accordingly to improve the reward. Inspired by natural learning mechanisms, that is, those occurring in biological organisms including animals and humans, RL techniques for machine learning and artificial intelligence have been developed and used mainly in the computational intelligence community [5–9] for autonomy [10]. The purpose of this chapter is to give an exposition of the usefulness of RL techniques for designing feedback policies for engineered systems, where optimal actions may be driven by objectives such as minimum fuel, minimum energy, minimum risk, maximum reward, and so on. We specifically focus on a family of techniques known as approximate or adaptive dynamic programming (ADP) [11–16, 19]. We show that RL ideas can be used to design a family of adaptive control algorithms that converge in real time to optimal control solutions while using data measured along the system trajectories. Such techniques may be referred to as optimal adaptive control. The use of RL techniques provides optimal control solutions for linear or nonlinear systems using adaptive control techniques. In effect, RL methods allow the solution of HJB equations, or for linear quadratic design, the Riccati equation, online and without knowing the full system dynamics. In machine learning, RL [5, 7, 8] refers to the set of optimization problems that involve an actor or an agent that interacts with its environment and modifies its actions, or control policies, based on stimuli received in response to its actions.

page 732

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Reinforcement Learning with Applications in Autonomous Control and Game Theory

page 733

733

RL is based on evaluative information from the environment and could be called action-based learning. RL implies a cause-and-effect relationship between actions and reward or punishment. It implies goal-directed behavior at least insofar as the agent has an understanding of reward versus lack of reward or punishment. The idea behind RL is that of modifying actions or behavior policy in order to reduce the prediction error between anticipated future rewards and actual performance as computed based on observed rewards. The RL algorithms are constructed on the idea that effective control decisions must be remembered, by means of a reinforcement signal, such that they become more likely to be used a second time. Although the idea originates from experimental animal learning, where it is observed that the dopamine neurotransmitter acts as a reinforcement informational signal that favors learning at the level of the neuron [20, 21], RL is strongly connected, from a theoretical point of view, with adaptive control and optimal control methods. Although RL algorithms have been widely used to solve the optimal regulation problems, some research efforts considered solving the optimal tracking control problem for discrete-time [22] and continuous-time systems [24, 25]. Moreover, existing methods require exact knowledge of the system dynamics a priori while finding the feedforward part of the control input using the dynamic inversion concept. In order to attain the required knowledge of the system dynamics, in [24], a plant model was first identified and then an RL-based optimal tracking controller was synthesized using the identified model. One class of RL methods is based on the actor-critic structure shown in Figure 19.1, where an actor component applies an action, or control policy, to the environment, and a critic component assesses the value of that action. Based on this assessment of the value, one of several schemes can then be used to modify or improve the action, or control policy, in the sense that the new policy yields a value that is improved relative to the previous value. The learning mechanism supported by the actor–critic structure implies two steps, namely, policy evaluation, executed by the critic, followed by policy improvement, performed by the actor. The policy evaluation step is performed by observing from the environment the results of applying current actions. It is of interest to study RL systems that have an actor–critic structure wherein the critic assesses the value of current policies based on some sort of optimality criteria [7, 17, 19, 26]. It is worth noting that the actor–critic structure is similar to the one encountered in adaptive control. However, adaptive control schemes make use of system identification mechanisms to first obtain a model of the system dynamics and then use this model to guide adaptation of the parameters of the controller/actor. In contrast, in the optimal RL algorithm case the learning process is moved to a higher level, having no longer as object of interest the details of a

June 1, 2022 13:18

734

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Handbook on Computer Learning and Intelligence — Vol. II

Figure 19.1: RL with an actor-critic structure. This structure provides methods for learning optimal control solutions online based on data measured along the system trajectories.

system’s dynamics, but a performance index that quantifies how close to optimality the closed loop control system operates. In this scheme, RL is a means of learning optimal behaviors by observing the real-time responses from the environment to non-optimal control policies. Two new research monographs on this subject are [27, 28]. The intention of this book chapter is to present the main ideas and algorithms of RL and approximate dynamic programming and to show how they are applied in the design of optimal adaptive feedback controllers. A few connections are also shown between RL and decision and control of systems distributed over communication graphs, including shortest path problems, relaxation methods, and cooperative control. In fact, we show that RL provides connections between optimal control, adaptive control, and cooperative control and decisions on finite graphs. This book chapter presents an expository development of ideas from RL and ADP and their applications in automatic control systems. Surveys of ADP are given in [29–41].

19.2. Markov Decision Processes and Stochasticity A framework for studying RL is provided by Markov decision processes (MDP). In fact, dynamical decision problems can be cast into the framework of MDP. Included are feedback control systems for human engineered systems, feedback regulation mechanisms for population balance and survival of species [42], the balance of power between nations, decision-making in multi-player games, and economic mechanisms for the regulation of global financial markets. Therefore, we provide the development of MDP here.

page 734

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Reinforcement Learning with Applications in Autonomous Control and Game Theory

page 735

735

Consider the MDP (X, U, P, R) where X is a set of states and U is a set of actions or controls. The transition probabilities P : X × U × X → [0, 1] give, for u  each state x ∈ X and action u ∈ U , the conditional probability Px,x  = Pr{x |x, u}  of transitioning to state x ∈ X , given the MDP is in state x and takes action u. The u cost function R : X × U × X → R gives the expected immediate cost Rx,x  paid  after transition to state x ∈ X , given the MDP starts in state x ∈ X and takes action u u ∈ U . The Markov property refers to the fact that transition probabilities Px,x  depend only on the current state x and not on the history of how the MDP attained that state. The basic problem for MDP is to find a mapping π : X × U → [0, 1] that gives for each state x and action u the conditional probability π(x, u) = Pr{u|x} of taking action u, given the MDP is in state x. Such a mapping is termed a closedloop control or action strategy or policy. The strategy or policy is called stochastic or mixed if there is a non-zero probability of selecting more than one control when in state x. We can view mixed strategies as probability distribution vectors having as component i the probability of selecting the i t h control action while in state x ∈ X . If the mapping π : X × U → [0, 1] admits only one control with probability 1, then in every state x the mapping is called a deterministic policy. Then π(x, u) = Pr{u|x} corresponds to a function mapping states into controls μ(x) : X → U .

19.2.1. Optimal Sequential Decision Problems Dynamical systems evolve causally through time. Therefore, we consider sequential decision problems and impose a discrete stage index k such that the MDP takes an action and changes states at non-negative integer stage values k. The stages may correspond to time or more generally to sequences of events. We refer to the stage value as the time. Denote state values and actions at time k by xk , u k . MDP evolves in discrete time. It is often desirable for human engineered systems to be optimal in terms of conserving resources such as cost, time, fuel, and energy. Thus, the notion of optimality should be captured in selecting control policies for MDP. Define therefore u a stage cost at time k by rk ≡ rk (xk , u k , xk+1 ). Then Rx,x  = E{r k |x k = x, u k =  u, xk+1 = x }, with E{·} being the expected value operator. Define a performance index as the sum of future costs over the time interval [k, k + T ], T > 0, Jk,T =

T  i=0

γ rk+1 ≡ i

k+T 

γ i−k ri ,

(19.1)

i=k

where 0 ≤ γ < 1 is a discount factor that reduces the weight of costs incurred further in the future. Usage of MDP in the fields of computational intelligence and economics usually considers rk as a reward incurred at time k, also known as utility and Jk,T as a

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

736

b4528-v2-ch19

Handbook on Computer Learning and Intelligence — Vol. II

discounted return, also known as strategic reward. We refer instead to state costs and discounted future costs to be consistent with objectives in the control of dynamical systems. For convenience we call rk the utility. Consider that an agent selects a control policy πk (xk , u k ) and uses it at each stage k of the MDP. We are primarily interested in stationary policies, where the conditional probabilities πk (xk , u k ) are independent of k. Then πk (x, u) ≡ π(x, u) = Pr{u|x}, ∀k. Non-stationary deterministic policies have the form π = {μ0 , μ1 , . . . }, where each entry is a function μk (x) : X → U ; k = {0, 1, . . . }. Stationary deterministic policies are independent of time so that π = {μ, μ, . . . }. Select a fixed stationary policy π(x, u) = Pr{u|x}. Then the “closed-loop” MDP reduces to a Markov chain with state space X . That is, the transition probabilities between states are fixed with no further freedom of choice of actions. The transition probabilities of this Markov chain are given by   π u Pr{x  |x, u}Pr{u|x} = π(x, u)Px,x (19.2) px,x  ≡ Px,x  = , u

u

where the Chapman–Kolmogorov identity is used [46]. A Markov chain is ergodic if all the states are positive recurrent and aperiodic [46]. Under the assumption that the Markov chain corresponding to each policy, with transition probabilities given as in (19.2), is ergodic, it can be shown that every MDP has a stationary deterministic optimal policy [31, 47]. Then, for a given policy, there exists a stationary distribution pπ (x) over X that gives the steady-state probability that the Markov chain is in state x. The value of a policy is defined as the conditional expected value of future cost when starting in state x at time k and following the policy π(x, u) thereafter,  k+T  π i−k γ ri |xk = x , (19.3) Vk (x) = E π {Jk,T |xk = x} = E π i=k π

where V (x) is known as the value function for policy π(x, u). The main objective of MDP is to determine a policy π(x, u) to minimize the expected future cost,  k+T   π i−k γ ri |xk = x . (19.4) π (x, u) = arg min Vk (x) = arg min E π π

π

i=k

This policy is termed as optimal policy, and the corresponding optimal value is given as,  k+T   π i−k γ ri |xk = x . (19.5) Vk (x) = min Vk (x) = min E π π

π

i=k

page 736

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Reinforcement Learning with Applications in Autonomous Control and Game Theory

page 737

737

In computational intelligence and economics, the interest is in utilities and rewards and the interest is in maximizing the expected performance index.

19.2.2. A Backward Recursion for the Value By using the Chapman–Kolmogorov identity and the Markov property, we can write the value of policy π(x, u) as, Vk (x) =



π(x, u)



u u π  Px,x  [R x,x  + γ Vk+1 (x )].

(19.6)

x

u

This equation provides a backwards recursion for the value at time k in terms of the value at time k + 1.

19.2.3. Dynamic Programming The optimal cost can be written as Vk = min Vkπ (x) = min π



π

π(x, u)



u u π  Px,x  [R x,x  + γ Vk+1 (x )].

(19.7)

x

u

Bellman’s optimality principle [32] states, “An optimal policy has the property that no matter what the previous control actions have been, the remaining controls constitute an optimal policy with regard to the state resulting from those previous controls.” Therefore, we can write Vk = min Vkπ (x) = min π



π

π(x, u)



u u   Px,x  [R x,x  + γ Vk+1 (x )].

(19.8)

x

u

Suppose an arbitrary control u is now applied at time k and the optimal policy is applied from time k + 1 on. Then, Bellman’s optimality principle says that the optimal control policy at time k is given by π  (x, u) = arg min π

 u

π(x, u)



u u   Px,x  [R x,x  + γ Vk+1 (x )].

(19.9)

x

Under the assumption that the Markov chain corresponding to each policy with transition probabilities given as in (19.2), is ergodic, every MDP has a stationary deterministic optimal policy. Then, we can equivalently minimize the conditional

June 1, 2022 13:18

738

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Handbook on Computer Learning and Intelligence — Vol. II

expectation over all actions u in state x. Therefore,  u u   Px,x Vk = min  [R x,x  + γ Vk+1 (x )], u

x

u k = arg min



u

u u   Px,x  [R x,x  + γ Vk+1 (x )].

(19.10) (19.11)

x

The backwards recursion (19.10) forms the basis for dynamic programming which gives offline methods for working backwards in time to determine optimal policies [5]. DP is an offline procedure [48] for finding the optimal value and optimal policies that requires knowledge of the complete system dynamics in the form of u u  transition probabilities Px,x  = Pr{x |x, u} and expected costs, R x,x  = E{r k |x k = x, u k = u, xk+1 = x  }.

19.2.4. Bellman Equation and Bellman Optimality Equation Dynamic programming is a backwards in time method for finding the optimal value and policy. By contrast, RL is concerned with finding optimal policies based on causal experience by executing sequential decisions that improve control actions based on the observed results of using a current policy. This procedure requires the derivation of methods for finding optimal values and optimal policies that can be executed forward in time. The key to this is the Bellman equation, which we now develop. References for this subsection include [7, 8, 19, 30]. To derive forwardin-time methods for finding optimal values and optimal policies, set now the time horizon T to +∞ and define the infinite-horizon cost, Jk =

∞ 

γ i rk+1 =

i=0

∞ 

γ i−k ri .

(19.12)

i=k

The associated infinite-horizon value function for policy π(x, u) is, V π (x) = E π {γ i−k ri |xk = x}.

(19.13)

By using (19.6) with T = ∞ it is seen that the value function for policy π(x, u) satisfies the Bellman equation,   u u π  π(x, u) Px,x (19.14) V π (x) =  [R x,x  + γ V (x )]. u

x

The key to deriving this equation is that the same value function appears on both sides, which is due to the fact that the infinite-horizon cost is used. Therefore, the Bellman equation (19.14) can be interpreted as a consistency equation that must be satisfied by the value function at each time stage. It expresses a relation between the current value of being in state x and the value of being in next state x  given that

page 738

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Reinforcement Learning with Applications in Autonomous Control and Game Theory

page 739

739

policy π(x, u) is used. The solution to the Bellman equation is the value given by the infinite sum in (19.13). The Bellman equation (19.14) is the starting point for developing a family of reinforcement learning algorithms for finding optimal policies by using causal experiences received stage-wise forward in time. The Bellman optimality equation (19.8) involves the minimum operator, and so does not contain any specific policy π(x, u). Its solution relies on knowing the dynamics in the form of transition probabilities. By contrast, the form of the Bellman equation is simpler than that of the optimality equation and is easier to solve. The solution to the Bellman equation yields the value function of a specific policy π(x, u). As such, the Bellman equation is well suited to the actor–critic method of RL shown in Figure 19.1. It is shown subsequently that the Bellman equation provides methods for implementing the critic in Figure 19.1, which is responsible for evaluating the performance of the specific current policy. Two key ingredients remain to be put in place. First, it is shown that methods known as policy iteration and value iteration use the Bellman equation to solve optimal control problems forward in time. Second, by approximating the value function in (19.14) by a parametric structure, these methods can be implemented online using standard adaptive control system identification algorithms such as recursive least-squares. In the context of using the Bellman equation (19.14) for RL, V π (x) may be   u u considered a predicted performance u π(x, u) x  Px,x  R x,x  , the observed onestep reward and V π (x  ) a current estimate of future behavior. Such notions can be capitalized on in the subsequent discussion of temporal different learning, which uses them to apply adaptive control algorithms that can learn optimal behavior online in real-time applications. Given that the MDP is finite and has N states then the Bellman equation (19.14) is a system of N simultaneous linear equations for the value V π (x) of being in each state x given the current policy π(x, u). The optimal value satisfies,   u u π π(x, u) Px,x (19.15) V  (x) = min V π (x) = min  [R x,x  + γ V (x)], π

π

x

u

while Bellman’s optimality principle gives the Bellman optimality equation,   u u  π(x, u) Px,x (19.16) V  (x) = min V π (x) = min  [R x,x  + γ V (x)]. π

π

x

u

Now under the assumption of ergodicity on the Markov chains corresponding to each action/policy, the Bellman optimality equation (19.16) can be written as,  u u   Px,x (19.17) V  (x) = min  [R x,x  + γ V (x )], u

x

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

740

b4528-v2-ch19

Handbook on Computer Learning and Intelligence — Vol. II

which is known as the HJB equation in control systems. The optimal control is given by,  u u   Px,x (19.18) u  = arg min  [R x,x  + γ V (x )]. u

x

Equations (19.17) and (19.18) can be written in the context of the feedback control of dynamical systems. In the linear quadratic regulator (LQR) case, the Bellman equation (19.14) becomes a Lyapunov equation and the optimality equation (19.16) becomes the well-known algebraic Riccati equation. We shall see, in the subsequent subsection, two different algorithmic methods to solve the Bellman equation (19.14).

19.2.5. Policy Iteration and Value Iteration Given a current policy π(x, u), its value (19.13) can be determined by solving (19.14). This procedure is known as policy evaluation. Moreover, given the value for some policy π(x, u) we can always use it to find another policy that is at least no worse. This step is known as policy improvement. Specifically, suppose that V π (x) satisfies (19.14). Then define a new policy π  (x, u) by  u u π  Px,x (19.19) π  (x, u) = arg min  [R x,x  + γ V (x )] π

π

and it can be shown that the value is monotonically decreasing [7, 31]. The policy (19.19) is said to be greedy with respect to the value function. In the special case  that V π (x) = V π (x) in (19.19), the value and the policy satisfy Eq. (19.17) and (19.18). In computational intelligence, “greedy” refers to quantities determined by optimizing over short or one-step horizons, without regard to potential impacts far into the future. Now we will form the basis for the policy and value iteration algorithms in the form of two steps, a policy evaluation and a policy improvement step. The policy evaluation step can be written as   u u π  π(x, u) Px,x ∀x ∈ S ⊆ X, (19.20) V π (x) =  [R x,x  + γ V (x )], π

u

and the policy improvement step as  u u π  Px,x ∀x ∈ S ⊆ X. π  (x, u) = arg min  [R x,x  + γ V (x )], π

(19.21)

π

Here, S is a suitably selected subspace of the state space, to be discussed later. We call an application of (19.20) followed by an application of (19.21) one step.

page 740

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Reinforcement Learning with Applications in Autonomous Control and Game Theory

page 741

741

This terminology is in contrast to the decision time stage k defined above. At each step of such algorithms, we obtain a policy that is no worse than the previous policy. Therefore, it is not difficult to prove convergence under fairly mild conditions to the optimal value and optimal policy. Most such proofs are based on the Banach Fixed Point Theorem. Note that (19.17) is a fixed point equation for V  (·) . Then the two equations (19.20) and (19.21) define an associated map that can be shown under mild conditions to be a contraction map [8, 31, 49] which converges to the solution of (19.17). A large family of algorithms is available that implement the policy evaluation and policy improvement procedures in different ways, or interleave them differently, or select subspace S ⊆ X in different ways, to determine the optimal value and optimal policy. The relevance of this discussion for feedback control systems is that these two procedures can be implemented for dynamical systems online in real time by observing the data measured along the system trajectories. The result is a family of adaptive control algorithms that converge to optimal control solutions. Such algorithms are of the actor–critic class of RL systems, as shown in Figure 19.1. There, a critic agent evaluates the current control policy using methods based on (19.20). After this evaluation is completed, the action is updated by an actor agent based on (19.21). One method of RL for using (19.20) and (19.21) to find the optimal value and policy is called policy iteration and is described next. Algorithm 19.1: Policy Iteration 1: procedure 2: Given an initial admissible policy π0 ( j) ( j −1) 3: while || V π − V π ||≥ ac do 4: Solve for the value V j (x) using Bellman’s equation V j (x) =



π j (x, u)

x

u

5:



u [R u  Px,x  x,x  + γ V j (x )], ∀x ∈ X

Update the control policy π( j +1) using π j +1 (x, u) = arg min π

 π

u [R u  Px,x  x,x  + γ V j (x )], ∀x ∈ X

6: j := j + 1 7: end while 8: end procedure

In Algorithm 19.1, ac is a small number used to terminate the algorithm when two consecutive value functions differ by less than ac . At each step j the policy iteration algorithm determines the solution of the Bellman equation to compute the value V j (x) of using the current policy π j (x, u).

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

742

b4528-v2-ch19

Handbook on Computer Learning and Intelligence — Vol. II

This value corresponds to the infinite sum (19.13) for the current policy. Then the policy is improved. The steps are continued until there is no change in the value or the policy. Note that j is not the time or stage index k, but a policy iteration step iteration index. As detailed in the next sections, policy iteration can be implemented for dynamical systems online in real time by observing data measured along the system trajectories. Data for multiple times k are needed to solve the Bellman equation at each step j . The policy iteration algorithm must be suitably initialized to converge. The initial policy π0 (x, u) and value V0 must be selected so that V1 ≤ V0 . Then, for finite Markov chains with N states, policy iteration converges in a finite number of steps, less than or equal to N , because there are only a finite number of policies [31]. If the MDP is finite and has N states, then the policy evaluation equation is a system of N simultaneous linear equations, one for each state. Instead of directly solving the Bellman equation, it can be solved by an iterative policy evaluation procedure. A second method for using (19.20) and (19.21) in RL is value iteration. Algorithm 19.2: Value Iteration 1: procedure 2: Given an initial policy π0 j ( j −1) 3: while || V π − V π ||≥ ac do 4: Solve for the value V j (x) using Bellman’s equation V j +1 (x) =



π j (x, u)

x

u

5:



u [R u  Px,x  x,x  + γ V j (x )], ∀x ∈ S j ⊆ X

Update the policy π j +1 using π j +1 (x, u) = arg min π

 π

u [R u  Px,x  x,x  + γ V j (x )], ∀x ∈ S j ⊆ X

6: j := j + 1 7: end while 8: end procedure

In Algorithm 19.2, ac is a small number used to terminate the algorithm when two consecutive value functions differ by less than ac . In subsequent sections, we show how to implement value iteration for dynamical systems online in real time by observing data measured along the system trajectories. Data for multiple times k are needed to solve the Bellman equation for each step j . Standard value iteration takes the update set as S j = X , for all j . That is, the value and policy are updated for all states simultaneously. Asynchronous value iteration methods perform the updates on only a subset of the states at each step. In the extreme case, updates can be performed on only one state at each step.

page 742

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Reinforcement Learning with Applications in Autonomous Control and Game Theory

page 743

743

It is shown in [31] that standard value iteration, which has S j = X, ∀ j , converges for finite MDP for all initial conditions when the discount factor satisfies 0 < γ < 1. When S j = X, ∀ j and γ = 1 an absorbing state is added and a “properness” assumption is needed to guarantee convergence to the optimal value. When a single state is selected for value and policy updates at each step, the algorithm converges, for all choices of initial value, to the optimal cost and policy if each state is selected for update infinitely often. More universal algorithms result if value update is performed multiple times for different choices of S j prior to a policy improvement. Then, it is required that the updates are performed infinitely, often for each state, and a monotonicity assumption must be satisfied by the initial starting value. Considering (19.16) as a fixed point equation, value iteration is based on the associated iterative map, which can be shown under certain conditions to be a contraction map. In contrast to policy iteration, which converges under certain conditions in a finite number of steps, value iteration usually takes an infinite number of steps to converge [31]. Consider finite MDP and consider the transition probability graph having probabilities (19.2) for the Markov chain corresponding to an optimal policy π  (x, u). If this graph is acyclic for some π  (x, u), then value iteration converges in at most N steps when initialized with a large value. Having in mind the dynamic programming equation (19.6) and examining the value iteration, value update V j (x  ) can be interpreted as an approximation or estimate for the future stage cost-togo from the future state x  . Those algorithms wherein the future cost estimates are themselves costs or values for some policy are called roll-out algorithms in [31]. Such policies are forward looking and self-correcting. It is shown that these methods can be used to derive algorithms for receding horizon control [50]. MDP, policy iteration, and value iteration are closely tied to optimal and adaptive control.

19.2.6. Generalized Policy Iteration In policy iteration the system of linear equations (see Algorithm 19.1) is completely solved at each step to compute the value (19.13) of using the current policy π j (x, u). This solution can be accomplished by running iterations of the policy evaluation step until convergence. By contrast, in value iteration one takes only one iteration in the value update step. Generalized policy iteration algorithms make several iterations in their value update step. Usually, policy iteration converges to the optimal value in fewer steps j since it does more work in solving equations at each step. On the other hand, value iteration is the easiest to implement as it only takes one iteration of a recursion. Generalized policy iteration provides a suitable compromise between computational complexity and convergence speed. Generalized policy iteration is

June 1, 2022 13:18

744

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Handbook on Computer Learning and Intelligence — Vol. II

a special case of the value iteration algorithm given above, where we select S j = X, ∀ j , and perform value update multiple times before each policy update.

19.2.7. Q-Learning The conditional expected value in (19.10), at iteration k is  u u   Px,x Q k (x, u) =  [R x,x  + γ Vk+1 (x )] x  (x  )|xk = x, u k = u} = E π {rk + γ Vk+1

(19.22)

and is known as optimal Q function [51, 52]. In other words, the Q function is equal to the expected return for taking an arbitrary action u at time k in state x and thereafter following an optimal policy. The Q function is a function of the current state x and the action u. In terms of the Q function, the Bellman optimality equation has the particularly simple form, Vk (x) = min Q k (x, u),

(19.23)

u k = arg min Q k (x, u),

(19.24)

u

and u

Given some fixed policy π(x, u), define the Q function for that policy as, π (x  )|xk = x, u k = u} Q πk (x, u) = E π {rk + γ Vk+1  u u π  Px,x =  [R x,x  + γ Vk+1 (x )]

(19.25)

x

where this function is equal to the expected return for taking an arbitrary action u at time k in state x and thereafter following the existing policy π(x, u). Note that Vkπ (x) = Q πk (x, π(x, u)); hence (19.25) can be written as the backwards recursion in the Q function,  u u π    Px,x (19.26) Q πk (x, u) =  [R x,x  + γ Q k+1 (x , π(x , u ))]. x

The Q function is a two-dimensional function of both the current state x and the action u. By contrast, the value function is a 1-dimensional function of the state. For finite MDP, the Q function can be stored as a 2D lookup table at each state/action pair. Note that direct minimization in (19.8) and (19.9) requires knowledge of the state transition probabilities, which correspond to the system dynamics, and costs. By contrast, the minimization in (19.23) and (19.24) requires knowledge only of the Q function and not of the system dynamics. The utility of the Q function is twofold.

page 744

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Reinforcement Learning with Applications in Autonomous Control and Game Theory

page 745

745

First, it contains information about control actions in every state. As such, the best control in each state can be selected using (19.24) by knowing only the Q function. Second, the Q function can be estimated online in real time directly from date observed along the system trajectories, without knowing the system dynamics information, that is, the transition probabilities. Later, we shall see how this is accomplished. The infinite horizon Q function for a prescribed fixed policy is given by,  u u π  Px,x (19.27) Q π (x, u) =  [R x,x  + γ V (x )] x

which also satisfies the Bellman equation. Note that for a fixed policy π(x, u) V π (x) = Q π (x, π(x, u)),

(19.28)

whence according to (19.27) the Q function satisfies the Bellman equation,  u u π    Px,x (19.29) Q π (x, u) =  [R x,x  + γ Q (x , π(x , u ))]. x

The Bellman optimality equation for the Q function is given by  u u    Px,x Q  (x, u) =  [R x,x  + γ min Q (x , u )].  u

x

(19.30)

It is interesting to compare (19.17) and (19.30) where the minimum operator and the expected value operator are reversed. The algorithms defined in the previous subsections, namely policy iteration and value iteration, are really easy to implement in terms of the Q function (19.25) as follows. Algorithm 19.3: Policy Iteration Using Q Function 1: procedure 2: Given an initial admissible policy π0 j ( j −1) 3: while || Q π − Q π ||≥ ac do 4: Solve for Q j (x) using Q j (x) =



π j (x, u)

x

u

5:



u [R u    Px,x  x,x  + γ Q j (x , π(x , u ))], ∀x ∈ X

Update the policy π j +1 using π j +1 (x, u) = arg min Q j (x, u), ∀x ∈ X π

6: j := j + 1 7: end while 8: end procedure

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

746

b4528-v2-ch19

Handbook on Computer Learning and Intelligence — Vol. II

Algorithm 19.4: Value Iteration Using Q Function 1: procedure 2: Given an initial policy π0 j ( j −1) 3: while || Q π − Q π ||≥ ac do 4: Solve for the value Q j (x) using Bellman’s equation Q j +1 (x) =

 u

5:

π j (x, u)

 x

u [R u    Px,x  x,x  + γ Q j (x , π(x , u ))], ∀x ∈ S j ⊆ X

Update the policy π j +1 using π j +1 (x, u) = arg min Q j (x, u), ∀x ∈ S j ⊆ X π

6: j := j + 1 7: end while 8: end procedure

In Algorithms 19.3 and 19.4, ac is a small number used to terminate the algorithm when two consecutive Q functions differ by less than ac . As we show, the utility of the Q function is that these algorithms can be implemented online in real time, without knowing the system dynamics, by measuring data along the system trajectories. They yield optimal adaptive control algorithms, that is, adaptive control algorithms that converge online to optimal control solutions.

19.3. Methods for Implementing Policy Iteration and Value Iteration Different methods are available for performing the value and policy updates for policy iteration and value iteration [7, 8, 31]. The three main methods are exact computation, Monte Carlo methods, and temporal difference learning. The last two methods can be implemented without knowledge of the system dynamics. Temporal difference learning is the means by which optimal adaptive control algorithms can be derived for dynamical systems. Policy iteration requires a solution at each step of the Bellman equation for the value update. For a finite MDP with N states, this is a set of linear equations in N unknowns, namely, the values of each state. Value iteration requires performing the one-step recursive update at each step for the value update. Both of these can u be accomplished exactly if we know the transition probabilities Px,x  and costs u Rx,x  of the MDP, which corresponds to knowing full system dynamics information. Likewise the policy improvements can be explicitly computed if the dynamics are known. Monte Carlo learning is based on the definition (19.13) for the value function, and uses repeated measurements of data to approximate the expected value.

page 746

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Reinforcement Learning with Applications in Autonomous Control and Game Theory

page 747

747

The expected values are approximated by averaging repeated results along sample paths. An assumption on the ergodicity of the Markov chain with transition probabilities (19.2) for the given policy being evaluated is implicit. This assumption is suitable for episodic tasks, with experience divided into episodes [7], namely, processes that start in an initial state and run until termination, and are then restarted at a new initial state. For finite MDP, Monte Carlo methods converge to the true value function if all states are visited infinitely often. Therefore, in order to ensure accurate approximations of value functions, the episode sample paths must go through all the states x ∈ X many times. This issue is called the problem of maintaining exploration. Several methods are available to ensure this, one of which is to use “exploring starts”, in which every state has nonzero probability of being selected as the initial state of an episode. Monte Carlo techniques are useful for dynamic systems control because the episode sample paths can be interpreted as system trajectories beginning in a prescribed initial state. However, no updates to the value function estimate or the control policy are made until after an episode terminates. In fact, Monte Carlo learning methods are closely related to repetitive or iterative learning control [53]. They do not learn in real time along a trajectory, but learn as trajectories are repeated.

19.4. Temporal Difference Learning and Its Application to Feedback Control It has now been shown that the temporal difference method [7] for solving the Bellman equations leads to a family of optimal adaptive controllers, that is, adaptive controllers that learn online the solutions to optimal control problems without knowing the full system dynamics. Temporal difference learning is true online RL, wherein control actions are improved in real time by estimating their value functions by observing data measured along the system trajectories. Policy iteration requires a solution at each step of N linear equations. Value iteration requires performing a recursion at each step. Temporal difference RL methods are based on the Bellman equation without using any systems dynamics knowledge, but using data observed along a single trajectory of the system. Therefore, temporal difference learning is applicable for feedback control applications. Temporal difference updates the value at each time step as observations of data are made along a trajectory. Periodically, the new value is used to update the policy. Temporal difference methods are related to adaptive control in that they adjust values and actions online in real time along system trajectories. Temporal difference methods can be considered stochastic approximation techniques whereby the Bellman equation (19.14) is replaced by its evaluation along a single sample path of the MDP. Then, the Bellman equation becomes a deterministic equation that

June 1, 2022 13:18

748

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Handbook on Computer Learning and Intelligence — Vol. II

allows the definition of a temporal difference error. Equation (19.6) is used to write the Bellman equation (19.14) for the infinite-horizon value (19.13). An alternative form of the Bellman equation is V π (xk ) = E π {rk |xk } + γ E π {Vπ (xk+1 )|xk }

(19.31)

which is the basis for temporal difference learning. Temporal difference RL uses one sample path, namely the current system trajectory, to update the value. Then, (19.31) is replaced by the deterministic Bellman’s equation V π (xk ) = rk + γ Vπ (xk+1 ),

(19.32)

which holds for each observed data experience set (xk , xk+1 , rk ) at each time step k. This dataset consists of the current state xk , the observed cost rk , and the next state xk+1 . Hence we can define the temporal difference error as, ek = −V π (xk ) + rk + γ V π (xk+1 ),

(19.33)

and the value estimate is updated to make the temporal difference error small. In the context of temporal difference learning, the interpretation of the Bellman equation is shown in Figure 19.2, where V π (xk ) may be considered as a predicted performance or value, rk as the observed one step reward, and γ V π (xk+1 ) as a current estimate of the future value. The Bellman equation can be interpreted as a consistency equation that holds if the current estimate for the predicted value

Figure 19.2: Temporal difference interpretation of the Bellman equation. The figure how the use of the Bellman equation captures the action, observation, evaluation, and improvement mechanisms of RL.

page 748

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Reinforcement Learning with Applications in Autonomous Control and Game Theory

page 749

749

V π (xk ) is correct. Temporal difference methods update the predicted value estimate Vˆ π (xk ) to make the temporal difference error small. The idea, based on stochastic approximation, is that if we use the deterministic version of Bellman’s equation repeatedly in policy iteration or value iteration, then on average these algorithms converge toward the solution of the stochastic Bellman equation.

19.5. Optimal Adaptive Control for Discrete-Time Deterministic Systems A family of optimal adaptive control algorithms can now be developed for dynamical systems. These algorithms determine the solutions to the Hamilton–Jacobi (HJ) design equations online in real time without knowing the system drift dynamics. In the LQR case, this means that they solve the Riccati equation online without knowing the system A matrix. Physical analysis of dynamical systems using Lagrangian mechanics or Hamiltonian mechanics produces system descriptions in terms of nonlinear ordinary differential equations. Discretization yields nonlinear difference equations. Most research in RL is conducted for systems that operate in discrete time [7, 26, 47]. Therefore, we cover discrete-time dynamical systems first and then continuous time systems. Temporal difference learning is a stochastic approximation technique based on the deterministic Bellman’s equation (19.32). Therefore, we lose little by considering deterministic systems here. Therefore, consider a class of discrete-time systems described by deterministic nonlinear dynamics in the affine state-space difference equation form, xk+1 = f (xk ) + g(xk )u k , k = 0, 1, . . .

(19.34)

with state xk ∈ Rn and control input u k ∈ Rm . We use this form because its analysis is convenient. The following development can be generalized to the sampled-data form xk+1 = F(xk , u k ). A deterministic control policy is defined as a function from state space to control space h(·) : Rn → Rm . That is, for every state xk , the policy defines a control action, u k = h(xk ),

(19.35)

which is a feedback controller. The value function with a deterministic cost function can be written as V h (xk ) =

∞  i=k

γ i−k r(xi , u i ) =

∞  i=k

  γ i−k Q(xi ) + u Ti Ru i ,

(19.36)

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

750

b4528-v2-ch19

Handbook on Computer Learning and Intelligence — Vol. II

where 0 < γ ≤ 1 is a discount factor, Q(xk ) 0, R 0, and u k = h(xk ) is a prescribed feedback control policy. That is, the stage cost is, r(xk , u k ) = Q(xk ) + u Tk Qu k

(19.37)

which is taken as quadratic in u k to simplify the developments but can be any positive definite function of the control. We assume that the system is stabilizable on some set  ∈ Rn ; that is, there exists a control policy u k = h(xk ) such that the closed-loop system xk+1 = f (xk ) + g(xk )h(xk ) is asymptotically stable on . For the deterministic value (19.36), the optimal value is given by Bellman’s optimality equation,   (19.38) V  (xk ) = min r(xk , h(xk )) + γ V  (xk+1 ) , h(·)

which is just the discrete-time HJB equation. The optimal policy is then given as,   (19.39) h  (xk ) = arg min r(xk , h(xk )) + γ V  (xk+1 ) . h(·)

In this setup, the deterministic Bellman’s equation (19.32) is V h (xk ) = r(xk , u k ) + γ V h (xk+1 ) = Q(xk ) + u Tk Ru k + γ V h (xk+1 ), V h (0) = 0,

(19.40)

which is a difference equation equivalent of the value (19.36). That is, instead of evaluating the infinite sum (19.36), the difference equation (19.40) can be solved with boundary condition V (0) = 0 to obtain the value of using a current policy u k = h(xk ). The discrete-time Hamiltonian function can be defined as, H (xk , h(xk ), Vk ) = r(xk , h(xk )) + γ V h (xk+1 ) − V h (xk ),

(19.41)

where Vk ≡ γ V h (xk+1 ) − V h (xk ) is the forward difference operator. The Hamiltonian function captures the energy content along the trajectories of a system as reflected in the desired optimal performance. In fact, the Hamiltonian is the temporal difference error (19.33). The Bellman equation requires that the Hamiltonian be equal to zero for the value associated with a prescribed policy. For the discrete-time linear quadratic regulator case, we have xk+1 = Axk + Bu k ,

(19.42)



 1  i−k  T γ xi Qxi + u Ti Ru i V (xk ) = 2 i=k h

and the Bellman equation can be easily derived.

(19.43)

page 750

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Reinforcement Learning with Applications in Autonomous Control and Game Theory

page 751

751

19.5.1. Policy Iteration and Value Iteration for Discrete-Time Deterministic Dynamical Systems Two forms of RL can be used based on policy iteration and value iteration. Algorithm 19.5: Policy Iteration Using Temporal Difference Learning 1: procedure 2: Given an initial admissible policy h 0 (xk ) j ( j −1) ||≥ ac do 3: while || V h − V h 4: Solve for V j (xk ) using V j +1 (xk ) = r (xk , h j (xk )) + γ V j +1 (xk+1 ) 5:

Update the policy h j +1 using h j +1 (xk ) = −

γ −1 T R g (xk )∇V j +1 (xk+1 ), 2

6: j := j + 1 7: end while 8: end procedure Algorithm 19.6: Value Iteration Using Temporal Difference Learning 1: procedure 2: Given an initial policy h 0 (xk ) j ( j −1) ||≥ ac do 3: while || V h − V h 4: Solve for V j (xk ) using V j +1 (xk ) = r (xk , h j (xk )) + γ V j (xk+1 ) 5:

Update the policy h j +1 using h j +1 (xk ) = −

γ −1 T R g (xk )∇V j +1 (xk+1 ), 2

6: j := j + 1 7: end while 8: end procedure

In Algorithms 19.5 and 19.6, ac is a small number used to terminate the algorithm . when two consecutive value functions differ by less than ac and ∇ V (x) = ∂ V∂ (x) x Note that in value iteration, we can select any initial control policy, not necessarily admissible or stabilizing. The policy iteration and value iteration algorithms just described are offline design methods that require knowledge of the discrete-time dynamics. By contrast, we next desire to determine online methods for implementing policy iteration and value iteration that do not require full dynamics information. Policy iteration and value iteration can be implemented for finite MDP by storing and updating lookup tables. The key to implementing policy iteration and value iteration online for dynamical systems with infinite state and action spaces is to approximate the value function by a suitable approximator structure in terms of

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

752

b4528-v2-ch19

Handbook on Computer Learning and Intelligence — Vol. II

unknown parameters. Then, the unknown parameters are tuned online exactly as in system identification. This idea of value function approximation (VFA) is used by Werbos [26, 33] and is called approximate dynamic programming (ADP) or adaptive dynamic programming. The approach is used by Bertsekas and Tsitsiklis [31] and is called neuro-dynamic programming [8, 19, 31]. For nonlinear systems (19.34), the value function contains higher order nonlinearities. Then, we assume that the Bellman equation (19.40) has a local smooth solution [57]. Then, according to the Weierstrass higher-order approximation theorem, there exists a dense basis set φi (x) such that, V (x) = W T φ(x) +  L (x)

(19.44)

where φ(x) = [φ1 (x) φ2 (x) . . . φ L (x)] : Rn → R L and  L (x) converges uniformly to zero as the basis sets L → ∞. Note that in the LQR case, the weight vector consists of the elements of the symmetric Riccati matrix P. We are now in a position to present several adaptive control algorithms based on temporal difference RL that converge online to the optimal control solution. Algorithm 19.7: Optimal Adaptive Control Using Policy Iteration 1: procedure 2: Given an initial admissible policy h 0 (xk ) 3: while No convergence do 4: Determine the least-squares solution W j +1 using   WT j +1 φ(xk ) − γ φ(xk+1 ) = r (xk , h j (xk )) 5:

Update the policy h j +1 using h j +1 (xk ) = −

γ −1 T R g (xk )∇φ T (xk+1 )W j +1 , 2

6: j := j + 1 7: end while 8: end procedure Algorithm 19.8: Optimal Adaptive Control Using Value Iteration 1: procedure 2: Given an initial policy h 0 (xk ) 3: while No convergence do 4: Determine the least squares solution W j +1 using T WT j +1 φ(xk ) = r (xk , h j (xk )) + γ W j φ(xk+1 )

5:

Update the policy h j +1 using h j +1 (xk ) = −

6: j := j + 1 7: end while 8: end procedure

γ −1 T R g (xk )∇φ T (xk+1 )W j +1 , 2

page 752

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Reinforcement Learning with Applications in Autonomous Control and Game Theory

page 753

753

19.5.2. Introducing a Second “Actor” Neural Network Using value function approximation (VFA) allows standard system identification techniques to be used to find the value function parameters that approximately solve the Bellman equation. The approximator structure just described which is used for the approximation of the value function, is known as the critic neural network, as it determines the value of using the current policy. Using VFA, the policy iteration RL algorithm solves the Bellman equation during the value update portion of each iteration step j by observing only the dataset (xk , xk+1 , r(xk , h j (xk ))) each time along the system trajectory. In the case of value iteration, VFA is used to perform a value update. The critic network solves the Bellman equation using the observed data without knowing the system dynamics. However, note that in the linear quadratic case (systems of the form xk+1 = Axk + Bu k with quadratic costs), the policy update is given by  −1 (19.45) K j +1 = − B T P j +1 B + R B T P j +1 A, which requires full knowledge of the dynamics. This problem can be solved by introducing a second neural network for the control policy, known as the actor neural network [29, 33, 43]. Therefore, consider a parametric approximator structure for the control action u k = h(xk ) = U T σ (xk )

(19.46)

with σ (x) : Rn → Rm being a vector of M activation functions and U ∈ R M×m being a matrix of weights or unknown parameters. Note that in the LQR case the basis set can be taken as the state vector. After convergence of the critic neural network parameters to W j +1 in policy iteration or value iteration, it is required to perform the policy update. To achieve this aim, we can use a gradient descent method for tuning the actor weights U , such as  T i i T T T U i+1 j +1 = U j +1 − βσ (x k ) 2R(U j +1 ) σ (x k ) + γ g (x k )∇φ (x k+1 )W j +1 , (19.47) where β ∈ R+ is a tuning parameter. The tuning index i can be incremented with the time index k. Note that the tuning of the actor neural network requires observations at each time k of the dataset xk , xk+1 , that is, the current state and the next state. However, as per formulation (19.46), the actor neural network yields the control u k at time k in terms of the state xk at time k. The next state xk+1 is not needed in (19.46). Thus, after (19.47) converges, (19.46) is a legitimate feedback controller. Note also that, in the LQR case, the actor neural network (19.46) embodies the feedback gain computation (19.45). Equation (19.45) contains the state internal dynamics A, but (19.46) does not. Therefore, the A matrix is not needed to compute the feedback control.

June 1, 2022 13:18

754

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Handbook on Computer Learning and Intelligence — Vol. II

The reason is that the actor neural network learns information about A in its weights, since xk , xk+1 are used in its tuning. Finally, note that only the input function g(·) or, in the LQR case, the B matrix, is needed in (19.47) to tune the actor neural network. Thus, introducing a second actor neural network completely avoids the need for knowledge of the state drift dynamics f (·), or A in the LQR case.

19.6. Actor-Critic Structures for Optimal Adaptive Control The implementation of RL using two neural networks, one as a critic and the other as an actor, yields the actor-critic RL structure shown in Figure 19.1. In this control system, the critic and the actor are tuned online using the observed data (xk , xk+1 , r(xk , h j (xk ))) along the system trajectory. The critic and the actor are tuned sequentially in both the policy iteration and the value iteration algorithms. That is, the weights of one neural network are held constant while the weights of the other are tuned until convergence. This procedure is repeated until both neural networks have converged. Thus, the controller learns the optimal controller online. This procedure amounts to an online adaptive optimal control system wherein the value function parameters are tuned online and the convergence is to the optimal value and control. The convergence of value iteration using two neural networks for the discrete-time nonlinear system (19.34) is proved in [43]. According to RL principles, the optimal adaptive control structure requires two loops, a critic and an actor, and operates at multiple timescales. A fast loop implements the control action inner loop, a slower timescale operates in the critic loop, and a third timescale operates to update the policy. Several actor-critic structures for optimal adaptive control have been proposed in [54–56].

19.6.1. Q-learning for Optimal Adaptive Control It has just been shown how to implement an optimal adaptive controller using RL that only requires knowledge of the system input function g(xk ). The Q learning RL method gives an adaptive control algorithm that converges online to the optimal control solution for completely unknown systems. That is, it solves the Bellman equation (19.40) and the HJB equation (19.38) online in real time by using data measured along the system trajectories, without any knowledge of the dynamics f (xk ), g(xk ). Writing the Q function Bellman equation (19.29) along a sample path gives Q π (xk , u k ) = r(xk , u k ) + γ Q π (xk+1 , h(xk+1 )),

(19.48)

which defines a temporal difference error, ek = −Q π (xk , u k ) + r(xk , u k ) + γ Q π (xk+1 , h(xk+1 )).

(19.49)

page 754

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Reinforcement Learning with Applications in Autonomous Control and Game Theory

page 755

755

The Q function is updated using the algorithm, Q k (xk , u k ) = Q k−1 (xk , u k ) + αk [r(xk , u k ) + γ min Q k−1 (xk+1 , u k ) − Q k−1 (xk , u k )], u

(19.50)

with αk being a tuning parameter. This algorithm is developed for finite MDP and the convergence proved by Watkins [51] using stochastic approximation (SA) methods. The Q learning algorithm (19.50) is similar to SA methods of adaptive control or parameter estimation used in control systems. Let us now derive methods for Q learning for dynamical systems that yield adaptive control algorithms that converge to optimal control solutions. Policy iteration and value iteration algorithms can be given using the Q function. A Q learning algorithm is easily developed for discretetime dynamical systems using Q function approximation [9, 26, 33, 58]. Assume therefore that, for nonlinear systems, the Q function is parameterized as Q(x, u) = W T φ(z),

(19.51)

where z k ≡ [xkT u Tk ]T , W is the unknown parameter vector, and φ(z) is the basis set vector. Substituting the Q function approximation into the temporal difference error (19.49) yields ek = −W T φ(z k ) + r(xk , u k ) + γ W T φ(z k+1 ),

(19.52)

on which either policy iteration or value iteration algorithms can be based. Considering the policy iteration algorithm defined before, the Q function evaluation step is   (19.53) W jT+1 φ(z k ) − γ φ(z k+1 ) = r(xk , h j (xk )), and the policy improvement step is   h j +1 (xk ) = arg min W jT+1 φ(xk , u) , ∀x ∈ X. u

(19.54)

Now, using value iteration, the Q learning is given by, W jT+1 φ(z k ) = r(xk , h j (xk )) + γ W jT φ(z k+1 ),

(19.55)

and (19.54). Note that these equations do not require knowledge of the dynamics.

19.6.2. Approximate Dynamic Programming Using Output Feedback The methods discussed above have relied on full state variable feedback. Little work has been done on the applications of RL for feedback control using output feedback. This corresponds to partially observable Markov processes. The design of an ADP controller that uses only the output feedback is given in [59].

June 1, 2022 13:18

756

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Handbook on Computer Learning and Intelligence — Vol. II

19.7. Integral RL for Optimal Adaptive Control of Deterministic Continuous-Time Systems RL is considerably more difficult for continuous-time systems [60] than for discretetime systems, and fewer results are available. See [61] for the development of an offline policy iteration method for continuous-time systems. Using a method known as integral RL (IRL) [27–29, 44] allows the application of RL to formulate online optimal adaptive control methods for continuous-time systems. These methods find solutions to optimal HJ design equations and Riccati equations online in real time without knowing the system drift dynamics f (x), or in the LQR case without knowing the A matrix. Consider the continuous-time nonlinear dynamical system, x˙ = f (x) + g(x)u, t ≥ 0,

(19.56)

where x ∈ Rn is the state, and u ∈ Rm is the control input. Also consider as x = 0 an equilibrium point, e.g., f (0) = 0 and f (x) + g(x)u Lipschitz on a set  ⊆ Rn that contains the origin. We assume that the system is stabilizable on , that is, there exists a continuous control input function u(t) such that the closed-loop system is asymptotically stable on . Define a performance measure or cost function that has the value associated with the feedback control policy u = μ(x) given by,  ∞ μ r(x(τ ), u(τ )) dτ, (19.57) V (x(t)) = t

where r(x, u) = Q(x) + u T Ru is the utility function, Q(x)  0, ∀x and x = 0 → Q(x) = 0 and R = R T 0. For the continuous-time LQR, we have, x˙ = Ax + Bu,   1 ∞ T μ x Qx + u T Ru dτ. V (x(t)) = 2 t

(19.58) (19.59)

If the cost is smooth, then an infinitesimal equivalent to (19.57) can be found by Leibniz’s formula, to be 0 = r(x, μ(x)) + (∇ V μ )T ( f (x) + g(x)μ(x)), V μ (0) = 0,

(19.60)

where ∇ V μ denotes the column gradient vector with respect to x. This is the continuous-time Bellman equation. This equation is defined based on the continuous-time Hamiltonian function, H (x, μ(x), ∇ V μ ) = r(x, μ(x)) + (∇ V μ )T ( f (x) + g(x)μ(x))

(19.61)

page 756

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Reinforcement Learning with Applications in Autonomous Control and Game Theory

page 757

757

and the HJB equation is given by 0 = min H (x, μ(x), ∇ V  )

(19.62)

μ = arg min H (x, μ(x), ∇ V  ).

(19.63)

μ

with an optimal control, μ

We now see the problem with continuous-time systems immediately. Comparing the continuous-time Hamiltonian (19.61) to the discrete-time Hamiltonian (19.41), it is seen that (19.61) contains the full system dynamics f (x)+ g(x)u, while (19.41) does not. What this means is that the continuous-time Bellman equation (86) cannot be used as a basis for RL unless the full dynamics are known. RL methods based on (19.60) can be developed [49]. They have limited use for adaptive control purposes because the system dynamics must be known, state derivatives must be measured, or integration over an infinite horizon is required. In another approach, Euler’s method can be used to discretize the continuous-time Bellman equation (19.60) (see [62]). Noting that 0 = r(x, μ(x)) + (∇ V μ )T ( f (x) + g(x)μ(x)) = r(x, μ(x)) + V˙ μ ,

(19.64)

we use Euler’s forward method to discretize this partial differential equation (PDE) in order to obtain, 0 = r(x, μ(x)) + ≡

V μ (xk+1 ) − V μ (xk ) T

r S (xk , u k ) V μ (xk+1 ) − V μ (xk ) + , T T

(19.65)

where T > 0 is a sample period, i.e., t = kT and r S (xk , u k ) = T r(xk , u k ). Now, note that the discretized continuous-time Bellman equation (19.65) has the same form as the discrete-time Bellman equation (19.40). Therefore, all the RL methods just described for discrete-time systems can be applied. However, this equation is an approximation only. An alternative exact method for continuous-time RL is given in [29, 44]. That method is termed integral RL (IRL). Note that the cost (19.57) can be written in the integral reinforcement form, μ



t +T

V (x(t)) = t

r(x(τ ), u(τ ))dτ + V μ (x(t + T )),

(19.66)

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

758

b4528-v2-ch19

Handbook on Computer Learning and Intelligence — Vol. II

for some T > 0. This equation is exactly in the form of the discrete-time Bellman equation (19.40). According to Bellman’s principle, the optimal value is given in terms of this construction as [5]

 t +T V  (x(t)) = min r(x(τ ), u(τ ))dτ + V  (x(t + T )) , u(t ¯ :t +T )

t

where u(t ¯ : t + T ) = {u(τ ) : t ≤ τ < t + T } and the optimal control is given by,

 t +T   r(x(τ ), u(τ ))dτ + V (x(t + T )) . μ (x(t)) = arg min u(t ¯ :t +T )

t

It is shown in [44] that the Bellman equation (19.60) is equivalent to the integral reinforcement form (19.66). That is, the positive definite solution of both is the value (19.57) of the policy u = μ(x). The integral reinforcement form (19.66) serves as a Bellman equation for continuous-time systems, and is a fixed point equation. Therefore, the temporal difference error for continuous-time systems can be defined as  ∞ r(x(τ ), u(τ ))dτ + V μ (x(t + T )) − V μ (x(t)), (19.67) e(t : t + T ) = t

which is an equation that does not involve the system dynamics. Now, it is easy to formulate policy iteration and value iteration for continuous-time systems. The following algorithms are termed integral RL for continuous-time systems [29], [44]. Both algorithms give optimal adaptive controllers for continuous-time systems, that is, adaptive control algorithms that converge to optimal control solutions. Algorithm 19.9: IRL Optimal Adaptive Control Using Policy Iteration 1: procedure 2: Given an initial admissible policy μ0 (x) j ( j −1) ||≥ ac do 3: while || V μ − V μ 4: Solve for V j using V j +1 (x(t)) = 5:

 t +T t

r (x(s), μ j (x(s)))ds + V j +1 (x(t + T )), V j +1 (0) = 0

Update the policy μ j +1 using 1 μ j +1 (xk ) = − R −1 g T (x)∇V j +1 , 2

6: j := j + 1 7: end while 8: end procedure

page 758

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Reinforcement Learning with Applications in Autonomous Control and Game Theory

page 759

759

Algorithm 19.10: IRL Optimal Adaptive Control Using Value Iteration 1: procedure 2: Given an initial policy μ0 (x) j ( j −1) ||≥ ac do 3: while || V μ − V μ 4: Solve for V j using V j +1 (x(t)) = 5:

 t +T t

r (x(s), μ j (x(s))) + V j (x(t + T ))

Update the policy μ j +1 using 1 μ j +1 (x) = − R −1 g T (x)∇V j +1 , 2

6: j := j + 1 7: end while 8: end procedure

In Algorithms 19.9 and 19.10, ac is a small number used to terminate the algorithm when two consecutive value functions differ by less than ac ; note that neither algorithm requires knowledge about the system drift dynamics function f (x). That is, they work for partially unknown systems. Convergence of IRL policy iteration has been proved in [44]. An IRL algorithm for nonlinear systems has been proposed in [45].

19.7.1. Online Implementation of IRL: A Hybrid Optimal Adaptive Controller Both of these IRL algorithms can be implemented online by RL techniques using value function approximation V (x) = W T φ(x) in a critic approximator network. Using VFA in the policy iteration algorithm just defined yields  ∞ T r(x(s), μ j (x(s)))ds. (19.68) W j +1 [φ(x(t)) − φ(x(t + T ))] = t

Using VFA in the value iteration algorithm just defined yields  ∞ T r(x(s), μ j (x(s)))ds + W jT φ(x(t + T ))]. W j +1 φ(x(t)) =

(19.69)

t

Then, RLS or batch least-squares is used to update the value function parameters in these equations. On convergence of the value parameters, the action is updated. IRL provides an optimal adaptive controller, that is, an adaptive controller that measures data along the system trajectories and converges to optimal control solutions. Note that only the system input coupling dynamics g(x) is needed to implement these algorithms. The drift dynamics f (x) is not needed. The implementation of IRL optimal adaptive control is shown in Figure 19.3. The time is incremented at each iteration by the period T . The RL time interval T need

June 1, 2022 13:18

760

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Handbook on Computer Learning and Intelligence — Vol. II

not be the same at each iteration. T can be changed depending on how long it takes to receive meaningful information from the observations and is not a sample period in the standard meaning. The measured data at each time increment are (x(t), x(t + T ), ρ(t : t + T )) where  t +T ρ(t : t + T ) = r(x(τ ), u(τ ))dτ, (19.70) t

is the integral reinforcement measured at each time interval. The integral reinforcement can be computed in real time by introducing an integrator ρ˙ = r(x(t), u(t)) as shown in Figure 19.3. That is, the integral reinforcement ρ(t) is added as an extra continuous-time state. It functions as the memory or controller dynamics. The remainder of the controller is a sampled-data controller. Note that the control policy μ(t) is updated periodically after the critic weights have converged to the solution of (19.68) or (19.68). Therefore, the policy is piecewise constant in time. On the other hand, the control varies continuously with the state between each policy update. It is seen that IRL for continuous-time systems is in fact a hybrid continuous-time/discrete-time adaptive controller that converges to the optimal control solution in real time without knowing the drift dynamics f (x). The optimal adaptive controller has multiple control loops and several timescales. The inner control action loop operates in continuous time. Data are sampled at intervals of length T . The critic network operates at a slower timescale depending on how long it takes to solve (19.68) or (19.68). Due to the fact that the policy update for continuous-time systems does not involve the drift dynamics f (x), no actor neural network is needed in IRL. Only a critic neural network is needed for VFA.

Figure 19.3: Hybrid optimal adaptive controller based on integral RL. It shows the two-timescale hybrid nature of the IRL controller. The integral reinforcement signal ρ(t) is added as an extra state and functions as the memory of the controller. The Critic runs on a slow timescale and learns the value of using the current control policy. When the Critic converges, the Actor control policy is updated to obtain an improved value.

page 760

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Reinforcement Learning with Applications in Autonomous Control and Game Theory

page 761

761

19.8. Synchronous Optimal Adaptive Control for Continuous-Time Systems The IRL controller just given tunes the critic neural network to determine the value while holding the control policy fixed. Then, a policy update is performed. Now we develop an adaptive controller that has two neural networks, one for value function approximation and the other for control approximation. We could call these the critic neural network and actor neural network. The two neural networks are tuned simultaneously, that is, synchronously in time. This procedure is more nearly in line with accepted practice in adaptive control. Though this synchronous controller does require knowledge of the dynamics, it converges to the approximate local solutions of the HJB equation and the Bellman equation online, yet does not require explicitly solving either one. The HJB is usually impossible to solve for nonlinear systems, except for special cases. Based on the continuous-time Hamiltonian (19.61) and the stationarity condition μ) , a policy iteration algorithm for continuous-time systems could be 0 = ∂ H (x,u,∇V ∂u written based on the policy evaluation step, 0 = H (x, μ j (x), ∇ V j +1 ) ≡ r(x, μ j (x)) + (∇ V j +1 )T ( f (x) + g(x)μ j (x)), V j +1(0) = 0

(19.71)

and the policy improvement step μ j +1 = arg min H (x, μ, ∇ V j +1 ). μ

(19.72)

Unfortunately, (19.71) is a nonlinear partial differential equation and cannot usually be solved analytically. However, this policy iteration algorithm provides the structure needed to develop another adaptive control algorithm that can be implemented online using measured data along the trajectories and converges to the optimal control. Specifically, select a value function approximation (VFA), or critic neural network, structure as V (x) = W1T φ(x)

(19.73)

and a control action approximation structure or actor neural network as, 1 u(x) = − R −1 g T (x)∇φ T W2 . 2

(19.74)

These approximators could be, for instance, two neural networks with unknown parameters or weights W1 , W2 and φ(x) the basis set or activation functions of the first neural network. The structure of the second action neural network comes from

June 1, 2022 13:18

762

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Handbook on Computer Learning and Intelligence — Vol. II

the policy iteration algorithm. Then, tuning the neural network weights as σ [σ T W1 + Q(x) + u T Ru], + 1)2

(19.75)

1 W˙ 2 = −α2 {(F1 W2 − 1T σ¯ T W1 ) − D(x)W2 m T (x)W1 }, 4

(19.76)

W˙ 1 = −α1

(σ T σ

and

guarantees stability as well as convergence to the optimal value and control [63]. In these parameter estimation algorithms α1 , α2 are positive tuning parameters, F1 0 is a matrix picked appropriately to guarantee stability, and 1 is a row vector of ones with appropriate dimensions, D(x) = ∇φ(x)g(x)R −1 g T (x)∇φ T (x), σ = ∇φ( f + gu), σ¯ = σ T σσ +1 , and m(x) = (σ T σσ+1)2 . A persistence of excitation condition on σ¯ (t) is needed to achieve convergence to the optimal value. This synchronous policy iteration controller is an adaptive control algorithm that requires full knowledge of the system dynamics f (x), g(x), yet converges to the optimal control solution. That is, it solves, locally approximately, the HJB equation, which is usually intractable for nonlinear systems. In the continuous-time LQR case, it solves the ARE using data measured along the trajectories and knowledge of A, B. The utility of this algorithm is that it can approximately solve the HJB equation for nonlinear systems using data measured along the system trajectories in real time. The HJB is usually impossible to solve for nonlinear systems, except in some special cases. The VFA tuning algorithm for W1 is based on gradient descent, while the control action tuning algorithm is a form of backpropagation [33] that is, however, also tuned by the VFA weights W1 . This adaptive structure is similar to the actor– critic RL structure in Figure 19.1. However, in contrast to IRL, this algorithm is a continuous-time optimal adaptive controller with two parameter estimators tuned simultaneously, that is, synchronously and continuously in time. An algorithm that combines the advantages of the synchronous policy iteration algorithm and IRL has been proposed in [64]. The authors in [65] have proposed a model-free IRL algorithm, while the authors in [66] have used neural network-based identifiers for a model-free synchronous policy iteration.

19.9. Using RL for Multi-Player Nash, Stackelberg, and Graphical Games Game theory provides an ideal environment to study multi-player decision and control problems, and offers a wide range of challenging and engaging problems. Game theory has been successful in modeling strategic behavior, where the outcome for each player depends on the actions of himself and all the other players.

page 762

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Reinforcement Learning with Applications in Autonomous Control and Game Theory

page 763

763

Every player chooses a control to minimize, independently of the others, his own performance objective. Multi-player cooperative games rely on solving coupled HJ equations, which in the linear quadratic case reduce to coupled algebraic Riccati equations. RL techniques have been applied to design adaptive controllers that converge to the solutions of two-player zero-sum games in [28, 67], of multi-player non-zero-sum games in [68], of graphical games in [69–71], and of Stackelberg games in [72, 73]. In these cases, the adaptive control structure has multiple loops, with action networks and critic networks for each player. Below we provide a sketch on how to derive RL frameworks for zero- and non-zero-sum games. Graphical and Stackelberg games can be derived in a similar manner by following the appropriate citations.

19.9.1. Zero-Sum Games Consider the nonlinear time-invariant system given by, x(t) ˙ = f (x(t)) + g(x(t))u(t) + k(x(t))d(t), t ≥ 0,

(19.77)

where x ∈ Rn is a measurable state vector, u(t) ∈ Rm is the control input, d(t) ∈ Rq is the adversarial input, f (x) ∈ Rn , is the drift dynamics, g(x) ∈ Rn×m is the input dynamics, and k(x) ∈ Rn×q is the adversarial input dynamics. It is assumed that f (0) = 0 and f (x) + g(x)u + k(x)d is locally Lipschitz and that the system is stabilizable. The cost functional index to be optimized is defined as,  ∞  ∞     r(x, u, d) dt ≡ Q(x) + u T Ru − γ 2 d2 dt, J (x(0), u, d) = 0

0

where Q(x)  0, R = R 0, and γ ≥ γ  ≥ 0, with γ  being the smallest γ such that the system is stabilized. The value function with feedback control and adversarial policies can be defined as  ∞  ∞     r(x, u, d) dτ ≡ Q(x) + u T Ru − γ 2 d2 dτ, V (x(t), u, d) = T

t

t

∀x, u, d. When the value is finite, a differential equivalent to this is the Bellman equation,  ∂ V T f (x) + g(x)u + k(x)d , V (0) = 0, ∂x and the Hamiltonian is given by

 ∂ V T ∂V T f (x) + g(x)u + k(x)d . = r(x, u, d) + H x, u, d, ∂x ∂x r(x, u, d) +

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

764

b4528-v2-ch19

Handbook on Computer Learning and Intelligence — Vol. II

Define the two-player zero-sum differential game as, V  (x(0)) = min max J (x(0), u, d), u

d

subject to (19.77). It is worth noting that u is the minimizing player while d is the maximizing one. To solve this zero-sum game, one needs to to solve the following HJ-Isaacs (HJI) equation, r(x, u  , d  ) +

 ∂ V  T f (x) + g(x)u  + k(x)d  = 0, ∂x

given a solution V  ≥ 0 : Rn → R to this equation, one has 1 ∂V u  = − R −1 g T (x) , 2 ∂x 1 T ∂V . k (x) d = 2γ 2 ∂x 19.9.1.1. Approximate solution

Solving the HJI equation analytically is extremely difficult or even impossible. The following PI algorithm can be used to approximate the HJI solution by iterating on the Bellman equation. It terminates when two consecutive value functions do not differ (regarding a corresponding suitable norm error). Algorithm 19.11: PI for Regulation in Zero-Sum Games (H∞ Control) 1: procedure 2: Given admissible policies u (0) 3: for j = 0, 1, . . . given u j 4:

(i)

for i = 0, 1, . . . set d 0 = 0 to solve for the value V j (x) using Bellman’s equation Q(x) +

∂ V ji ∂x

T

2 ( f (x) + g(x)u j + k(x)d i ) + u Tj Ru j − γ 2 d 2 = 0, V ji (0) = 0, i

d i+1 =

1 T ∂Vj k (x) , ∂x 2γ 2

on convergence, set V j +1 (x) = V ji (x) 5: Update the control policy u j +1 using ∂ V j +1 1 u j +1 = − R −1 g T (x) . 2 ∂x 6: Go to 3 7: end procedure

page 764

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Reinforcement Learning with Applications in Autonomous Control and Game Theory

page 765

765

The PI Algorithm 19.11 is an offline algorithm and requires complete knowledge of the system dynamics. In the following, it is desired to design an online PI algorithm that simultaneously updates the value function and policies and does not require knowledge of the internal dynamics. 19.9.1.2. Approximate solution of zero-sum games using integral RL

In [27, 74], an equivalent formulation of the Bellman equation that does not involve the dynamics is found to be,  t r(x(τ ), u(τ ), d(τ ))dτ + V (x(t)), (19.78) V (x(t − T )) = t −T

for any time t ≥ 0 and time interval T > 0. This equation is called the integral RL (IRL) Bellman equation. It is assumed that there exist weights W1 such that the value function V  (x) is approximated as, V  (x) = W1T φ1 (x) + (x), ∀x, where φ1 (x) : Rn → R N is the basis function vector, N is the number of basis functions, and (x) is the approximation error. The optimal weights W1 are not known and, as before, the current critic weights are used, Vˆ (x) = Wˆ cT φ1 (x), ∀x,

(19.79)

where Wˆ c are the current estimated values of the ideal critic approximator weights W1 . Now, using (19.79) in (19.78), the approximate IRL equation is given as,  T 2   T ˆ Q(x) + uˆ T R uˆ − γ 2 dˆ dτ = e1 , φ1 (t) Wc + t −T

where φ1 (t) = φ1 (t) − φ1 (t − T ) and e1 ∈ Rn is the approximation (residual) error after using current critic approximator weights and u, ˆ dˆ are given by, ∂φ1 T ˆ 1 Wu , u(x) ˆ = − R −1 g T (x) 2 ∂x and 1 T ∂φ1 T ˆ ˆ k (x) Wd , d(x) = 2γ 2 ∂x respectively, where Wˆ u and Wˆ d are the current estimated values of the optimal actor and disturbance policy weights, respectively.

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

766

b4528-v2-ch19

Handbook on Computer Learning and Intelligence — Vol. II

Now it is desired to select Wˆ c such that e1 is minimized. Hence, after using a normalized gradient descent, one has φ1 (t) W˙ˆ c = −α  2 φ1 (t)T φ1 (t) + 1  T 2 

 T ˆ T 2 ˆ φ1 (t) Wc + Q(x) + uˆ R uˆ − γ d dτ , t −T

The weights for the actor Wˆ u need to be picked in order to guarantee closed-loop stability. Hence, one has ˙ ˆ Wu = −αu (F2 Wˆ u − 1T T φ1 (t)T Wˆ c ) −

1 4



∂φ1 ∂φ1 T ˆ  g(x)R −1 g(x)T Wu ∂x ∂x

ˆ W

2 c , φ1 (t)T φ1 (t)) + 1 φ1 (t)T



where αu ∈ R+ is a tuning gain and F2 0 is a user-defined positive definite matrix picked appropriately for stability. Similarly, ˙ Wˆ d = −αd (F3 Wˆ d − 1T T (φ1 (t)T Wˆ c )

T  T φ1 (t)T 1 ∂φ1 T ∂φ1 ˆ ˆ k(x)k(x) ) Wd  + 2(  2 Wc , 4γ ∂ x ∂x φ1 (t)T φ1 (t) + 1 where αd ∈ R+ is a tuning gain and F3 0 is a user-defined positive definite matrix picked appropriately for stability. Please note that all derivations here are performed under the assumption that the value functions are smooth functions. If the smoothness assumption is not satisfied, then the theory of viscosity solutions can be used to solve the HJ equations. A different approach than the integral RL algorithm is developed in [66] which uses a system identifier along with RL to avoid requiring the knowledge of the internal dynamics. The concurrent learning technique can also be employed to speed up the convergence of RL algorithms [75, 76].

19.9.2. Non-Zero-Sum Nash Games This subsection considers N players playing a non-zero-sum dynamic game. The N agent/decision makers can be non-cooperative and can also cooperate in teams. Each of the agents has access to the full state of the system. In this section, non-cooperative

page 766

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Reinforcement Learning with Applications in Autonomous Control and Game Theory

page 767

767

solution concepts will be considered. RL is used to find the Nash equilibrium of noncooperative games online in real time using measured data. Consider the N -player nonlinear time-invariant differential game, x(t) ˙ = f (x(t)) +

N 

g j (x(t))u j (t),

j =1

where x ∈ R is a measurable state vector, u j (t) ∈ Rm j are the control inputs, f (x) ∈ Rn is the drift dynamics and g j (x) ∈ Rn×m j is the input dynamics. It is  assumed that f (0) = 0 and f (x) + Nj=1 g j (x)u j is locally Lipschitz and that the system is stabilizable. The cost functional associated with each player is given by,  ∞   ri (x, u 1 , . . . , u N ) dt Ji (x(0), u 1 , u 2 , . . . , u N ) = n

0









Q i (x) +

0

N 

 u Tj Ri j u j dt, ∀i ∈ N ,

j =1

where Q i (·)  0 is generally nonlinear and Ri i 0, ∀i ∈ N , Ri j  0, ∀ j = i ∈ N are symmetric matrices and N := {1, 2, . . . , N }. The value can be defined as,  ∞   ri (x, u 1 , . . . , u N ) dτ, ∀i ∈ N , ∀x, u 1 , u 2 , . . . , u N . Vi (x, u 1 , u 2 , . . . , u N ) = t

Differential equivalents to each value function are given by the following Bellman equation, ⎛ ⎞ N T  ∂ Vi ⎝ g j (x)u j ⎠ = 0, Vi (0) = 0, ∀i ∈ N . f (x) + ri (x, u 1 , . . . , u N ) + ∂x j =1 The Hamiltonian functions is defined as Hi (x, u 1 , . . . , u N ,



∂ Vi ∂ Vi ⎝ ) = ri (x, u 1 , . . . , u N ) + f (x) + ∂x ∂x T

T

N 

⎞ g j (x)u j ⎠ ,

j =1

∀i ∈ N . According to the stationarity conditions, the associated feedback control policies are given by,   ∂ Vi T ∂ Vi 1  , ∀i ∈ N . = − Rii−1 giT (x) u i = arg min Hi x, u 1 , . . . , u N , ui ∂x 2 ∂x

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

768

b4528-v2-ch19

Handbook on Computer Learning and Intelligence — Vol. II

After substituting the feedback control policies into the Hamiltonian, one has the coupled HJ equations, 1  ∂ Vj T ∂ Vj −1 T g j (x)R −T j j Ri j R j j g j (x) 4 j =1 ∂ x ∂x ⎞ ⎛ N ∂ V 1 ∂ Vi T ⎝ j⎠ T , ∀i ∈ N . g j (x)R −1 f (x) − + j j g j (x) ∂x 2 j =1 ∂x N

0 = Q i (x) +

(19.80)

19.9.2.1. Approximate solution for non-zero sum games

A PI algorithm is now developed as follows by iterating on the Bellman equation to approximate the solution to the coupled HJ equations, where iac ∀i ∈ N . is a small number used to terminate the algorithm when two consecutive value functions differ by less than iac , ∀i ∈ N . Algorithm 19.12: PI for Regulation in Non-Zero-Sum Games 1: procedure 2: Given N-tuple of admissible policies μki (0), ∀i ∈ N 3: 4:

μ(k)

Q i (x) +

5:

μ(k−1)

− Vi  ≥ iac , ∀i ∈ N do while Vi Solve for the N-tuple of costs Vik (x) using the coupled Bellman equations ∂ Vik ∂x

T

( f (x) +

N 

T

gi (x)μki ) + μki Rii μki +

i=1

N 

T

j =1

μkj Ri j μkj = 0, V μi (0) = 0. k

, ∀i ∈ N using Update the N-tuple of control policies μk+1 i T

1 −1 T ∂ Vik μk+1 = − , R g (x) i 2 ii i ∂x 6: k := k + 1. 7: end while 8: end procedure

19.9.2.2. Approximate solution of non-zero sum games using integral RL

It is obvious that (19.80) requires complete knowledge of the system dynamics. In [27], an equivalent formulation of the coupled IRL Bellman equation that does not involve the dynamics is given as,  t ri (x(τ ), u 1 (τ ), . . . , u N (τ ))dτ + Vi (x(t)), ∀i ∈ N , Vi (x(t − T )) = t −T

for any time t ≥ 0 and time interval T > 0.

page 768

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Reinforcement Learning with Applications in Autonomous Control and Game Theory

page 769

769

Assume, as before, that there exist constant weights Wi and such that the value functions Vi are approximated on a compact set  as, Vi (x) = WiT φi (x) + i (x), ∀x, ∀i ∈ N , where φi (x) : Rn → R K i , ∀i ∈ N are the basis function and basis set vectors, K i , ∀i ∈ N , is the number of basis functions, and i (x) is the approximation error. From the approximation literature, the basis functions can be selected as sigmoids, tanh, polynomials, etc. Assuming current weight estimates Wˆ ic , ∀i ∈ N , the outputs of the critic are given by Vˆi (x) = Wˆ icT φi (x), ∀x, i ∈ N . By using a procedure similar to that used for zero-sum games, the update laws can be rewritten as φi (t) W˙ ic = −αi  2 φi (t))T φi (t) + 1

 t N    T ˆ T T uˆ j Ri j uˆ j dτ , ∀i ∈ N , Q i (x) + uˆ i Rii uˆ i + φi (t) Wic + t −T

and

j =1

˙ ˆ Wiu = −αiu (Fi Wˆ iu − 1T φi (t)T Wˆ ic ) N T

1  ∂φi −T −1 T ∂φi gi (x)Rii Ri j Rii gi (x) − 4 j =1 ∂ x ∂x

φi (t)T ˆ ˆ Wiu  2 W j c , ∀i ∈ N , φi (t)T φi (t) + 1

respectively, with φi (t) := φi (t) − φi (t − T ), and Fi 0 is picked appropriately for stability. A model-free learning algorithm for non-zero-sum Nash games has been investigated in [77]. System identification is used in [78] to avoid requiring the complete knowledge of the system dynamics.

19.10. Conclusion This book chapter uses computational intelligence techniques to bring together adaptive control and optimal control. Adaptive controllers do not generally converge to optimal solutions, and optimal controllers are designed offline using full dynamics

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

770

b4528-v2-ch19

Handbook on Computer Learning and Intelligence — Vol. II

information by solving matrix design equations. The book chapter shows that methods from RL can be used to design new types of adaptive controllers that converge to optimal control and game-theoretic solutions online in real time by measuring data along the system trajectories. These are called optimal adaptive controllers, and they have multi-loop, multi-timescale structures that come from reinforcement learning methods of policy iteration and value iteration. These controllers learn the solutions to HJ design equations, such as the Riccati equation, online without knowing the full dynamical model of the system. The readers are also directed to the works [34–39], and [40] for other RL-inspired control approaches.

Acknowledgements This chapter is based upon the work supported by the NSF under grant numbers CPS-1851588, CPS-2038589, SATC-1801611, and S&AS-1849198, the ARO under grant number W911NF1910270, and the Minerva Research Initiative under grant number N00014-18-1-2160.

References [1] K. J. Astrom, and B. Wittenmark, Adaptive Control, Addison-Wesley (1995). [2] P. Ioannou and B. Fidan, Adaptive Control Tutorial, SIAM Press, Philadelphia (2006). [3] F. L. Lewis, D. Vrabie, K. G. Vamvoudakis, RL and feedback control: using natural decision methods to design optimal adaptive controllers, IEEE Control Syst. Mag., vol. 32, no. 6, pp. 76–105 (2012). [4] K. G. Vamvoudakis, H. Modares, B. Kiumarsi, F. L. Lewis, Game theory-based control system algorithms with real-time RL, IEEE Control Syst. Mag., vol. 37, no. 1, pp. 33–52 (2017). [5] F. L. Lewis, D. Vrabie, and V. Syrmos, Optimal Control, 3rd edn., John Wiley, New York (2012). [6] Z.-H. Li and M. Krstic, Optimal design of adaptive tracking controllers for nonlinear systems, Automatica, vol. 33, no. 8, pp. 1459–1473 (1997). [7] R. S. Sutton and A. G. Barto, RL: An Introduction, MIT Press, Cambridge, Massachusetts (1998). [8] W. B. Powell, Approximate Dynamic Programming, 2nd edn., Wiley, Hoboken (2011). [9] P. J. Werbos, A menu of designs for RL over time, in Neural Networks for Control, eds. W. T. Miller, R. S. Sutton, P. J. Werbos, MIT Press, Cambridge, pp. 67–95 (1991). [10] K. G. Vamvoudakis, P. J. Antsaklis, W. E. Dixon, J. P. Hespanha, F. L. Lewis, H. Modares, and B. Kiumarsi, Autonomy and machine intelligence in complex systems: a tutorial, Proc. American Control Conference (ACC), pp. 5062–5079 (2015). [11] F. L. Lewis, G. Lendaris, and D. Liu, Special issue on approximate dynamic programming and RL for feedback control, IEEE Trans. Syst. Man Cybern. Part B, vol. 38, no. 4 (2008). [12] F. Y. Wang, H. Zhang, and D. Liu, Adaptive dynamic programming: an introduction, IEEE Comput. Intell. Mag., pp. 39–47 (2009). [13] S. N. Balakrishnan, J. Ding, and F. L. Lewis, Issues on stability of ADP feedback controllers for dynamical systems, IEEE Trans. Syst. Man Cybern. Part B, vol. 38, no. 4, pp. 913–917, special issue on ADP/RL, invited survey paper (2008). [14] F. L. Lewis and D. Liu, RL and Approximate Dynamic Programming for Feedback Control, IEEE Press Series on Computational Intelligence (2013).

page 770

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Reinforcement Learning with Applications in Autonomous Control and Game Theory

page 771

771

[15] J. Si, A. Barto, W. Powell, and D. Wunsch, Handbook of Learning and Approximate Dynamic Programming, IEEE Press, USA (2004). [16] D. Prokhorov (ed.), Computational Intelligence in Automotive Applications, Springer (2008). [17] X. Cao, Stochastic Learning and Optimization, Springer-Verlag, Berlin (2007). [18] J. M. Mendel and R. W. MacLaren, RL control and pattern recognition systems, in Adaptive, Learning, and Pattern Recognition Systems: Theory and Applications, eds. J. M. Mendel and K. S. Fu, Academic Press, New York, pp. 287–318 (1970). [19] L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, RL and Dynamic Programming Using Function Approximators, CRC Press, Boca Raton (2009). [20] W. Schultz, Neural coding of basic reward terms of animal learning theory, game theory, microeconomics and behavioral ecology, Curr. Opin. Neurobiol., vol. 14, pp. 139–147 (2004). [21] K. Doya, H. Kimura, and M. Kawato, Neural mechanisms for learning and control, IEEE Control Syst. Mag., pp. 42–54 (2001). [22] H. Zhang, Q. Wei, and Y. Luo, A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm, IEEE Trans. Syst. Man Cybern. Part B: Cybern., vol. 38, pp. 937–942 (2008). [23] D. Wang, D. Liu, and Q. Wei, Finite-horizon neuro-optimal tracking control for a class of discrete-time nonlinear systems using adaptive dynamic programming approach, Neurocomputing, vol. 78, pp. 14–22 (2012). [24] H. Zhang, L. Cui, X. Zhang, and X. Luo, Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method, IEEE Trans. Neural Networks, vol. 22, pp. 2226–2236 (2011). [25] T. Dierks and S. Jagannathan, Optimal control of affine nonlinear continuous-time systems, Proc. Am. Control Conf., pp. 1568–1573 (2010). [26] P. J. Werbos, Approximate dynamic programming for real-time control and neural modeling, in Handbook of Intelligent Control, eds. D. A. White and D. A. Sofge, Van Nostrand Reinhold, New York (1992). [27] D. Vrabie, K. G. Vamvoudakis, and F. L. Lewis, Optimal Adaptive Control and Differential Games by RL Principles, Control Engineering Series, IET Press (2012). [28] H. Zhang, D. Liu, Y. Luo, and D. Wang, Adaptive Dynamic Programming for Control: Algorithms and Stability, Springer (2013). [29] D. Vrabie and F. L. Lewis, Neural network approach to continuous-time direct adaptive optimal control for partially-unknown nonlinear systems, Neural Networks, vol. 22, no. 3, pp. 237–246 (2009). [30] A. G. Barto, R. S. Sutton, and C. Anderson, Neuron-like adaptive elements that can solve difficult learning control problems, IEEE Trans. Syst. Man Cybern., pp. 834–846 (1983). [31] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, MA (1996). [32] R. E. Bellman, Dynamic Programming, Princeton University Press, Princeton, NJ (1957). [33] P. J. Werbos, Neural networks for control and system identification, Proc. IEEE Conf. Decision and Control (1989). [34] B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis, Optimal and autonomous control using RL: a survey, IEEE Trans. Neural Networks Learn. Syst., vol. 29, no. 6, pp. 2042–2062 (2018). [35] K. G. Vamvoudakis and N.-M. T. Kokolakis, Synchronous RL-Based Control for Cognitive Autonomy, Foundations and Trends in Systems and Control, vol. 8, no. 1–2, pp. 1–175 (2020). [36] Z.-P. Jiang, T. Bian, and W. Gao, Learning-Based Control: A Tutorial and Some Recent Results, Foundations and Trends in Systems and Control, vol. 8, no. 3, pp. 176–284 (2020). [37] R. Kamalapurkar, P. Walters, J. Rosenfeld, and W. Dixon, RL for Optimal Feedback Control, Springer (2018).

June 1, 2022 13:18

772

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Handbook on Computer Learning and Intelligence — Vol. II

[38] D. Liu, Q. Wei, D. Wang, X. Yang, and H. Li, Adaptive Dynamic Programming with Application in Optimal Control, Springer (2017). [39] K. G. Vamvoudakis, Y. Wan, F. L. Lewis, and D. Cansever (eds.), Handbook of RL and Control, to appear, Springer (2021). [40] K. G. Vamvoudakis and S. Jagannathan (eds.), Control of Complex Systems: Theory and Applications, Elsevier (2016). [41] Y. Jiang and Z.-P. Jiang, Robust Adaptive Dynamic Programming, John Wiley & Sons (2017). [42] D. Prokhorov and D. Wunsch, Adaptive critic designs, IEEE Trans. Neural Networks, vol. 8, no. 5, pp. 997–1007 (1997). [43] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof, IEEE Trans. Syst. Man Cybern. Part B, vol. 38, no. 4, pp. 943–949, special issue on ADP/RL (2008). [44] D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis, Adaptive optimal control for continuous-time linear systems based on policy iteration, Automatica, vol. 45, pp. 477–484 (2009). [45] D. Liu, X. Yang, and H. Li, Adaptive optimal control for a class of continuous-time affine nonlinear systems with unknown internal dynamics, Neural Comput. Appl., vol. 23, pp. 1843– 1850 (2013). [46] A. Papoulis, Probability, Random Variables and Stochastic Processes, McGraw-Hill (2002). [47] R. M. Wheeler and K. S. Narendra, Decentralized learning in finite Markov chains, IEEE Trans. Automatic Control, vol. 31, no. 6 (1986). [48] R. A. Howard, Dynamic Programming and Markov Processes, MIT Press, Cambridge, MA (1960). [49] P. Mehta and S. Meyn, Q-learning and Pontryagin’s minimum principle, Proc. IEEE Conf. Decision and Control, pp. 3598–3605 (2009). [50] H. Zhang, J. Huang, and F. L. Lewis, Algorithm and stability of ATC receding horizon control, Proc. IEEE Symp. ADPRL, Nashville, pp. 28–35 (2009). [51] C. Watkins, Learning from delayed rewards, Ph.D. Thesis, Cambridge University, Cambridge, England (1989). [52] C. J. C. H. Watkins and P. Dayan, Q-learning, Mach. Learn., vol. 8, pp. 279–292 (1992). [53] K. L. Moore, Iterative Learning Control for Deterministic Systems, Springer-Verlag, London (1993). [54] S. Ferrari and R. F. Stengel, An adaptive critic global controller, Proc. Am. Control Conf., pp. 2665–2670 (2002). [55] D. Prokhorov, R. A. Santiago, and D. C. Wunsch II, Adaptive critic designs: a case study for neurocontrol, Neural Networks, vol. 8, no. 9, pp. 1367–1372 (1995). [56] T. Hanselmann, L. Noakes, and A. Zaknich, Continuous-time adaptive critics, IEEE Trans. Neural Networks, vol. 18, no. 3, pp. 631–647 (2007). [57] A. J. Van der Schaft, L2-gain analysis of nonlinear systems and nonlinear state feedback H8 control, IEEE Trans. Automatic Control, vol. 37, no. 6, pp. 770–784 (1992). [58] S. Bradtke, B. Ydstie, and A. Barto, Adaptive linear quadratic control using policy iteration, Proc. Am. Control Conf., Baltimore, pp. 3475–3479 (1994). [59] F. L. Lewis and K. G. Vamvoudakis, RL for partially observable dynamic processes: adaptive dynamic programming using measured output data, IEEE Trans. Syst. Man Cybern. Part B, vol. 41, no. 1, pp. 14–25 (2011). [60] K. Doya, RL in continuous time and space, Neural Comput., vol. 12, pp. 219–245 (2000). [61] M. Abu-Khalaf, F. L. Lewis, and J. Huang, Policy iterations on the Hamilton-Jacobi-Isaacs equation for state feedback control with input saturation, IEEE Trans. Automatic Control, vol. 51, no. 12, pp. 1989–1995 (2006).

page 772

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch19

Reinforcement Learning with Applications in Autonomous Control and Game Theory

page 773

773

[62] L. C. Baird, RL in continuous time: advantage updating, Proc. Int. Conf. on Neural Networks, Orlando, FL, pp. 2448–2453 (1994). [63] K. G. Vamvoudakis and F. L. Lewis, Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem, Automatica, vol. 46, no. 5, pp. 878–888 (2010). [64] K. G. Vamvoudakis, D. Vrabie, and F. L. Lewis, Online learning algorithm for optimal control with integral RL, Int. J. Robust Nonlinear Control, vol. 24, no. 17, pp. 2686–2710 (2014). [65] Y. Jiang and Z. P. Jiang, Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics, Automatica, vol. 48, pp. 2699–2704 (2012). [66] S. Bhasin, R. Kamalapurkar, M. Johnson, K. G. Vamvoudakis, F. L. Lewis, and W. E. Dixon, A novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear systems, Automatica, vol. 49, no. 1, pp. 82–92 (2013). [67] D. Vrabie and F. L. Lewis, Adaptive dynamic programming for online solution of a zero-sum differential game, J. Control Theory Appl., vol. 9, no. 3, pp. 353–360 (2011). [68] K. G. Vamvoudakis and F. L. Lewis, Multi-player non-zero sum games: online adaptive learning solution of coupled Hamilton-Jacobi equations, Automatica, vol. 47, pp. 1556–1569 (2011). [69] K. G. Vamvoudakis, F. L. Lewis, and G. R. Hudas, Multi-agent differential graphical games: online adaptive learning solution for synchronization with optimality, Automatica, vol. 48, no. 8, pp. 1598–1611 (2012). [70] M. I. Abouheaf, F. L. Lewis, K. G. Vamvoudakis, S. Haesaert, and R. Babuska, Multi-agent discrete-time graphical games and RL solutions, Automatica, vol. 50, no. 12, pp. 3038–3053 (2014). [71] K. G. Vamvoudakis and J. P. Hespanha, Cooperative Q-learning for rejection of persistent adversarial inputs in networked linear quadratic systems, IEEE Trans. Autom. Control, vol. 63, no. 4, pp. 1018–1031 (2018). [72] K. G. Vamvoudakis, F. L. Lewis, M. Johnson, and W. E. Dixon, Online learning algorithm for stackelberg games in problems with hierarchy, Proc. 51st IEEE Conf. on Decision and Control, Maui, HI, pp. 1883–1889 (2012). [73] K. G. Vamvoudakis, F. L. Lewis, and W. E. Dixon, Open-loop stackelberg learning solution for hierarchical control problems, Int. J. Adapt. Control Signal Process., vol. 33, no. 2, pp. 285–299 (2019). [74] K. G. Vamvoudakis, D. Vrabie, and F. L. Lewis, Online learning algorithm for zero-sum games with integral reinforcement learning, J. Artif. Intell. Soft Comput. Res., vol. 1, no. 4, pp. 315–332 (2011). [75] R. Kamalapurkar, P. Walters, and W. E. Dixon, Model-based RL for approximate optimal regulation, Automatica, vol. 64, pp. 94–104 (2016). [76] R. Kamalapurkar, J. Klotz, and W. E. Dixon, Concurrent learning-based online approximate feedback Nash equilibrium solution of n-player nonzero-sum differential games, IEEE/CAA J. Autom. Sin., vol. 1, no. 3, p. 239–247 (2014). [77] K. G. Vamvoudakis, Non-zero sum Nash Q-learning for unknown deterministic continuous-time linear systems, Automatica, vol. 61, pp. 274–281 (2015). [78] M. Johnson, R. Kamalapurkar, S. Bhasin, and W. E. Dixon, Approximate n-player nonzerosum game solution for an uncertain continuous nonlinear system, IEEE Trans. Neural Network Learn. Syst., vol. 26, no. 8, pp. 1645–1658 (2015).

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

© 2022 World Scientific Publishing Company https://doi.org/10.1142/9789811247323_0020

Chapter 20

Nature-Inspired Optimal Tuning of Fuzzy Controllers Radu-Emil Precup and Radu-Codrut David Politehnica University of Timisoara, Department of Automation and Applied Informatics, Bd. V. Parvan 2, 300223 Timisoara, Romania [email protected] [email protected]

This chapter deals with the nature-inspired optimal tuning of fuzzy controllers. In the first part, the structures of low-cost proportional-integral (PI) and proportional-integralderivative fuzzy controllers in their Mamdani and Takagi–Sugeno forms are discussed. In the second part, the operation principles of recent nature-inspired optimization algorithms are described, namely gray wolf optimizer (GWO), whale optimization algorithm (WOA), slime mold algorithm (SMA), hybrid particle swarm optimization–gravitational search algorithm (PSOGSA), and hybrid gray wolf optimizer–particle swarm optimization (GWOPSO) algorithm. Here, the information feedback model F1 in these algorithms, which leads to GWOF1, WOAF1, SMAF1, PSOGSAF1, and GWOPSOF1 algorithms, is introduced. In the third part, these nature-inspired algorithms are applied to the optimal tuning of Takagi– Sugeno PI-fuzzy controllers for nonlinear servo systems, minimizing the objective functions defined as the sum of times multiplied by absolute errors.

20.1. Introduction As pointed out by Precup and David [1], the majority of fuzzy control applications uses fuzzy controllers (FCs) for direct feedback control or is placed at a lower level in hierarchical control system structures. However, fuzzy control can also be used at the supervisory level, for example, in adaptive control system structures [2]. Nowadays, fuzzy control is no longer used only to directly express the knowledge on the process or, in other words, to carry out model-free fuzzy control. A fuzzy controller can be tuned from a fuzzy model obtained in terms of system identification techniques, and thus it can be viewed in the framework of a model-based fuzzy control.

775

page 775

June 1, 2022 13:18

776

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

Handbook on Computer Learning and Intelligence—Vol. II

However, using as little information as possible on the process model and the systematic tuning based on the knowledge gained from experiments conducted on the real process belongs to the popular framework of data-driven control or modelfree control. Some important techniques in this regard, focusing on the iterative experiment-based update of controller parameters, and exhibiting successful results, both classical and fresh ones, are iterative feedback tuning (IFT) [3, 4], model-free adaptive control (MFAC) [5, 6], simultaneous perturbation stochastic approximation [7, 8], correlation-based tuning [9, 10], frequency domain tuning [11, 12], iterative regression tuning [13], and adaptive online IFT [14]. Widely used non-iterative techniques are model-free control (MFC) [15, 16], virtual reference feedback tuning [17, 18], active disturbance rejection control (ADRC) [19, 20], data-driven predictive control [21, 22], unfalsified control [23, 24], data-driven inversion-based control [25, 26], and an approach that does not make an a priori assumption of the persistency of excitation on the system input, but it, however, studies equivalent conditions on the given data based on which different analysis and control problems can be solved [27]. The advantages of fuzzy control and data-driven or model-free control are merged in terms of the combination of these techniques. These combinations include H∞ fuzzy control [28], fault-tolerant fuzzy control [29], parameterized datadriven fuzzy control [30], data-driven interpretable fuzzy control [31], MFC merged with fuzzy control [32, 33], MFAC merged with fuzzy control [34], ADRC mixed with fuzzy control [35, 36], fuzzy logic-based adaptive ADRC [37], and distending function-based data-driven arithmetic fuzzy control [38]. The following categories of FCs are most often used: • Mamdani FCs, referred to also as linguistic FCs, with either fuzzy consequents, i.e., type-1 fuzzy systems [39] or singleton consequents, i.e., type-2 fuzzy systems. These FCs are usually used as direct closed-loop controllers. • Takagi–Sugeno FCs, referred to also as type-3 fuzzy systems [39], with mainly affine consequents. These FCs are typically used as supervisory controllers. As shown by Precup and David [1, 40], a systematic way to meet the performance specifications of fuzzy control systems is the optimal tuning of FCs in terms of optimization problems with variables represented by the parameters of FCs. Natureinspired algorithms can solve these optimization problems to meet the performance specifications if they are expressed by adequately defined objective functions and constraints. The motivation for nature-inspired optimization algorithms in combination with fuzzy control and fuzzy modeling concerns their ability to cope with non-convex or non-differentiable objective functions due to controller structures and nonlinearities, and to process complexity, which can lead to multiobjective optimization problems.

page 776

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Nature-Inspired Optimal Tuning of Fuzzy Controllers

b4528-v2-ch20

page 777

777

A selection of latest and also representative nature-inspired algorithms applied to the optimal parameter tuning of FCs includes adaptive weight genetic algorithm (GA) for gear shifting [41], GA-based multi-objective optimization for electric vehicle powertrains [42], GA for hybrid power systems [43], engines [44], energy management in hybrid vehicles [45], wellhead back pressure [46], micro-unmanned helicopters [47], particle swarm optimization (PSO) algorithm with compensating coefficient of inertia weight factor for filter time constant adaptation in hybrid energy storage systems [48], set-based PSO algorithm with adaptive weights for optimal path planning of unmanned aerial vehicles [49], PSO algorithm for zinc production [50], inverted pendulums [51] and servo systems [40], hybrid PSO–artificial bee colony algorithm for frequency regulation in microgrids [52], imperialist competitive algorithm for human immunodeficiency [53], gray wolf optimizer (GWO) algorithms for sun-tracker systems [54] and servo systems [55], PSO, cuckoo search and differential evolution (DE) for gantry crane systems [56], whale optimization algorithm (WOA) for vibration control of steel structures [57] and servo systems [58, 59], grasshopper optimization algorithm for load frequency [60], DE for electro-hydraulic servo systems [20], gravitational search algorithm (GSA), charged system search (CSS) for servo systems [40], multiverse optimizer for fuzzy parameter tuning in structural motion systems [61], and slime mold algorithm (SMA) for servo systems [62]. This chapter is the second edition of an earlier work by Precup and David [1], which dealt with nature-inspired optimization of fuzzy controllers and fuzzy models. The current chapter focuses on nature-inspired optimization of fuzzy controllers and fuzzy models that are no more treated. The same optimization problems are considered; however, this chapter does not offer fuzzy controllers with a reduced process parametric sensitivity; the output sensitivity function is no more included in the objective function (equal to the sum of times multiplied by absolute errors), or, in other words, it can be considered that its weighting parameter in the objective function is zero. The same process is considered in both chapters, namely a family of nonlinear servo systems. The same controllers are also considered in both chapters, i.e., Takagi–Sugeno proportional-integral (PI)-fuzzy controllers. CSS algorithms were treated in the study conducted by Precup and David [1], while this chapter treats more recent algorithms, i.e., GWO, WOA, SMA, hybrid PSOGSA, and hybrid GWOPSO algorithm. The current chapter exploits a recent and successful modification and also improvement of nature-inspired optimization algorithms characterized by a modification of their structure in terms of including information feedback models according to the general formulation given in an earlier study [63]. Six types of information feedback models are proposed by Wang and Tan [63], with individuals from previous iterations being selected in either a fixed or random manner and

June 1, 2022 13:18

778

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

Handbook on Computer Learning and Intelligence—Vol. II

further used in the update process of the algorithms. The PSO and representative algorithms combined with information feedback models are discussed by Wang and Tan [63], the combination with NSGA-III is treated by Gu and Wang [64] and with multi-objective evolutionary algorithms based on decomposition by Zhang et al. [65]. The description of the introduction of the information feedback model F1 in these algorithms, which leads to GWOF1, WOAF1, and SMAF1, has already been presented by Precup et al. [62] but is connected to a different objective function and emphasized in this chapter, and the presentation of PSOGSAF1 and GWOPSOF1 algorithms is also included. The next section gives insights into the design and tuning of Mamdani and Takagi–Sugeno FCs with dynamics focusing on PI-fuzzy controllers. Section 3 discusses the GWO, WOA, SMA, PSOGSA, and GWOPSO algorithm-based optimal tuning of FCs for nonlinear servo systems. Details on the introduction of the information feedback model F1 in nature-inspired optimization algorithms are also provided, which are aimed at the GWOF1, WOAF1, SMAF1, PSOGSAF1, and GWOPSOF1 algorithm-based optimal tuning of FCs for nonlinear servo systems. The conclusions are outlined in Section 4.

20.2. Structures and Design Methodology of Fuzzy Controllers This chapter is expressed using the information given by Precup and David in an earlier study [1]. The FCs without dynamics offer nonlinear input–output static maps, which can be modified by the proper tuning of all parameters in the FC structure (the shapes and scaling factors of membership functions, the rule base, and the defuzzification method). The dynamics of integral and / or derivative type in the FC structure can be introduced on either the inputs or the outputs of the FC resulting in two versions of PI-fuzzy controllers [66]: • the proportional-integral-fuzzy controller with integration of controller output (PI-FC-OI) and • the proportional-integral-fuzzy controller with integration of controller input (PIFC-II). According to Sugeno and Kóczy [39, 67], both versions of PI-fuzzy controllers, with the structures given in Figure 20.1, are type-2 fuzzy systems.

Figure 20.1: Structures of proportional-integral-fuzzy controllers [1].

page 778

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Nature-Inspired Optimal Tuning of Fuzzy Controllers

b4528-v2-ch20

page 779

779

As shown in Figure 20.1, the two PI-fuzzy controllers are built around the two inputs-single output fuzzy controller (TISO-FC), which is a nonlinear subsystem without dynamics. In addition, the input variables are also scheduling variables. The presentation is focused, as demonstrated by Precup and David [1], on PI-fuzzy controllers because they can be extended relatively easily to proportionalintegral-derivative (PID)-fuzzy controllers and transformed in particular forms of proportional-derivative (PD)-fuzzy controllers. The structures presented in Figure 20.1 can also be transformed to be extended to two-degrees-of-freedom (2-DOF) fuzzy controllers, proposed in earlier studies by Precup and Preitl [68, 69] as fuzzy controllers with non-homogenous dynamics with respect to the input channels, and next developed in servo system and electrical drives applications [70, 71]. The dynamics are inserted in PI-FC-OI by the numerical differentiation of the control error e(td ), leading to the increment of control error e(td ) = e(td ) − e(td − 1), and the numerical integration of the increment of control signal u(td ) = u(td ) − u(td − 1), which gives the control signal u(td ). The variable td in Figure 20.1 is the current discrete time index, and q indicates the forward time shift operator. The dynamics is inserted in PI-FC-II by the numerical integration of e(td ) producing the integral of control error e I (td ) = e I (td − 1) + e(td ). It is assumed that the nonlinear scaling factors of the input and output variables specific to TISO-FC are inserted in the controlled process. As pointed out by Precup and David [1], the fuzzification in TISO-FC that belongs to PI-FC-OI is conducted in terms of the input (and also scheduling) and output membership functions illustrated in Figure 20.2 for Mamdani PI-fuzzy controllers. The fuzzification in TISO-FC, which belongs to PI-FC-II, is performed in terms of the same membership functions as those specific to PI-FC-OI, but the input variable e I (td ) is used instead of e(td ), and the output variable u(td ) is used instead of u(td ). The relatively simple shapes of membership functions given in Figure 20.2, which is reflected in few parameters, contribute to ensuring the cost-effective implementation of FCs. Other distributions of membership

Figure 20.2: Input and output membership functions of Mamdani PI-fuzzy controllers with integration on controller output [1].

June 1, 2022 13:18

780

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

page 780

Handbook on Computer Learning and Intelligence—Vol. II

functions can modify the FC nonlinearities in a desired way. The Takagi–Sugeno PI-fuzzy controllers make use of only the input membership functions illustrated in Figure 20.2. Both versions of the Mamdani PI-fuzzy controllers, namely PI-FC-OI and PIFC-II, employ Mamdani’s MAX-MIN compositional rule of inference assisted by the rule base exemplified as follows for PI-FC-OI [1]: Rule 1: IF e(td ) IS N AND e(td ) IS P THEN u(td ) IS ZE, Rule 2: IF e(td ) IS ZE AND e(td ) IS P THEN u(td ) IS PS, Rule 3: IF e(td ) IS P AND e(td ) IS P THEN u(td ) IS PB, Rule 4: IF e(td ) IS N AND e(td ) IS ZE THEN u(td ) IS NS, Rule 5:IF e(td ) IS ZE AND e(td ) IS ZE THEN u(td ) IS ZE, Rule 6: IF e(td ) IS P AND e(td ) IS ZE THEN u(td ) IS PS, Rule 7: IF e(td ) IS N AND e(td ) IS N THEN u(td ) IS NB, Rule 8: IF e(td ) IS ZE AND e(td ) IS N THEN u(td ) IS NS, Rule 9: IF e(td ) IS P AND e(td ) IS N THEN u(td ) IS ZE,

(20.1)

and the center of gravity method for singletons is used in the defuzzification module of the FCs. The inference engines of both versions of Takagi–Sugeno PI-fuzzy controllers are based on the SUM and PROD operators assisted by the rule base, which is exemplified here for PI-FC-OI [1]: Rule 1: IF e(td ) IS N AND e(td ) IS P THEN u(td ) = K P1 [e(td ) + μ1 e(td )], Rule 2: IF e(td ) IS ZE AND e(td ) IS P THEN u(td ) = K P2 [e(td ) + μ2 e(td )], Rule 3: IF e(td ) IS P AND e(td ) IS P THEN u(td ) = K P3 [e(td ) + μ3 e(td )], Rule 4: IF e(td ) IS N AND e(td ) IS ZE THEN u(td ) = K P4 [e(td ) + μ4 e(td )], Rule 5: IF e(td ) IS ZE AND e(td ) IS ZE THEN u(td ) = K P5 [e(td ) + μ5 e(td )], Rule 6: IF e(td ) IS P AND e(td ) IS ZE THEN u(td ) = K P6 [e(td ) + μ6 e(td )], Rule 7: IF e(td ) IS N AND e(td ) IS N THEN u(td ) = K P7 [e(td ) + μ7 e(td )], Rule 8: IF e(td ) IS ZE AND e(td ) IS N THEN u(td ) = K P8 [e(td ) + μ8 e(td )], Rule 9: IF e(td ) IS P AND e(td ) IS N THEN u(td ) = K P9 [e(td ) + μ9 e(td )],

(20.2)

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

Nature-Inspired Optimal Tuning of Fuzzy Controllers

page 781

781

and the weighted average method is applied in the defuzzification module of the FCs. Using the PI-fuzzy controller structures described above, the rule bases given in (20.1) and (20.2) make these controllers behave as bumpless interpolators between separately designed PI controllers. The maximum number of such controllers is nine, and the following conditions ensure the interpolation between only two separately designed PI controllers in case of Takagi–Sugeno PI-fuzzy controllers [1]: K P1 = K P2 = K P4 = K P5 = K P6 = K P8 = K P9 = K P , K P3 = K P7 = η K P ,

(20.3)

μ = μ = μ = μ = μ = μ = μ = μ = μ = μ. 1

2

3

4

5

6

7

8

9

The unified design methodology of the FCs described in this section consists of the following steps for Mamdani and Takagi–Sugeno PI-fuzzy controllers considered in their PI-FC-OI versions [1]: Step 1. The continuous-time design of the linear PI controller with the transfer function C(s)   1 kc (1 + Ti s) = kC 1 + , kC = kc Ti (20.4) C(s) = s Ti s is performed, and it leads to the controller gain kc (or kC , depending on the expression of the PI controller transfer function) and integral time constant Ti . Step 2. The sampling period Ts is set according to the requirements of quasi-continuous digital control. Tustin’s method is then applied to discretize the continuous-time linear PI controller, and the recurrent equation of the incremental digital PI controller is u(td ) = K P [e(td ) + μe(td )],

(20.5)

where the expressions of the parameters that appear K P and μ in Eqs. (20.3) and (20.5) are as demonstrated earlier by Precup and David [1]: K P = kc (Ti − 0.5Ts ), μ =

Ts . Ti − 0.5Ts

(20.6)

The parameter η, with typical values 0 < η < 1, is introduced in Eq. (20.3) to alleviate the overshoot of the fuzzy control system when both inputs have the same sign. These PI-fuzzy controllers can also be applied to the control of nonminimum phase systems with right half-plane zeros, where such rule bases produce the alleviation of the downshoot as well.

June 1, 2022 13:18

782

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

page 782

Handbook on Computer Learning and Intelligence—Vol. II

Step 3. The modal equivalence principle [72] is applied to map the linear controller parameters onto the PI-fuzzy controller ones. The application of this principle to the Takagi–Sugeno PI-FC-OI results in the tuning condition Be = μBe ,

(20.7)

and the application to the Mamdani PI-FC-OI leads to the tuning conditions [1] Be = μBe , Bu = K P μBe .

(20.8)

The tuning conditions for the Mamdani PI-FC-II are [66] Be =

1 Be , Bu = K P Be , μ

(20.9)

and the tuning condition for the Takagi–Sugeno PI-FC-II is [1] Be =

1 Be . μ

(20.10)

The value of the parameter Be in Eqs. (20.7)–(20.10) must be set by the designer. This can be carried out according to the designer’s experience, and the stability analysis of the fuzzy control system can be conducted in this regard. However, the systematic tuning of the parameter Be will be presented as follows in terms of obtaining it after solving optimization problems, as this parameter will be one of the elements of the vector variables of the objective functions. The PD-fuzzy controllers are designed using the methodology given above and dedicated to PI-fuzzy controllers. The design is carried out in terms of the PD-fuzzy controllers, which is actually the PI-FC-OI structure given in Figure 20.1 but with the output integrator dropped out. The PID-fuzzy controller structure is presented in Figure 20.3, where the TISOFCPD block and the TISO-FCPI block are identical to TISO-FC given in Figure 20.1, u PD (td ) is the control signal of the PD controller, and u P I (td ) is the increment of control signal calculated by the PI controller. The fuzzification uses the input membership functions exemplified in Figure 20.3 for the Takagi–Sugeno PID-fuzzy controller, with g ∈ {P I, P D}. Figure 20.3 also shows that this PID-fuzzy controller

Figure 20.3: Structure and input membership functions of Takagi–Sugeno PID-fuzzy controller [1].

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

Nature-Inspired Optimal Tuning of Fuzzy Controllers

page 783

783

structure consists of a parallel connection of two blocks, namely the proportionalderivative-fuzzy controller (PD-FC) and the proportional-integral-fuzzy controller (PI-FC). Setting the sampling period Ts , Tustin’s method leads to the discrete time PID controller with the following transfer function C(q −1 ) and parameters [1]: ρ0 + ρ1 q −1 + ρ2 q −2 , 1 − q −1     Td Td Ts Td , ρ1 = −kc 1 + 2 − , ρ2 = k c . ρ0 = k c 1 + Ts Ts Ti Ts

C(q −1 ) =

(20.12)

The transfer function in (20.12) is expressed next as a parallel connection of discrete time PD and PI controllers with the superscripts PD and PI, respectively, associated to Figure 20.3 [1] C(q −1 ) = K PP D (1 − q −1 + μ P D ) − K PP D

= ρ2 , μ

PD

K PP I (1 − q −1 + μ P I ) , 1 − q −1

(20.13) ρ1 2ρ2 PI PI = − , K P = ρ2 − ρ1 − ρ0 , μ = . ρ2 ρ2 − ρ1 − ρ0

Therefore, the analogy of (20.13) and the transfer function associated to (20.5) fully justify carrying out the separate design of PI-FC and PD-FC outlined in Figure 20.3 in terms of the design methodology given above. As pointed out by Precup and David [1], similar controller structures can be formulated in the framework of state feedback control systems. As Precup et al. pointed out [73], the general approach to deal with such fuzzy control systems makes use of Takagi–Sugeno–Kang fuzzy models of the process and carries out the stability analysis, producing linear matrix inequalities (LMIs) in terms of the parallel distributed compensation (PDC) approach. The PDC approach is important as it states that the dynamics of each local subsystem in the rule consequents of the Takagi–Sugeno–Kang fuzzy models of the process is controlled separately using the eigenvalue analysis [74, 75]. Recent results on LMI-based stability analysis include the relaxation of stability conditions producing fuzzy adaptive compensation control [76], nonlinear consequents [77], optimal filters with memory [78], application to chaotic systems [79], non-fragile control design [80], event-triggered control [81], indirect adaptive control [82], event-triggered fuzzy sliding mode control [83], the negative absolute eigenvalue approach [84], and the Lyapunov–Krasovskii functionals-based approach [85]. The effects of various parameters of the fuzzy models are discussed in relation with non-quadratic Lyapunov function-based approaches as the membership-functiondependent analysis [86, 87], non-quadratic stabilization of uncertain systems [88],

June 1, 2022 13:18

784

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

Handbook on Computer Learning and Intelligence—Vol. II

piecewise continuous and smooth functions, piecewise continuous exact fuzzy models, general polynomial approaches [89, 90], sum-of-squares-based polynomial membership functions [91, 92], integral structure-based Lyapunov functions [93], the subspace-based improved sector nonlinearity approach [94], and interpolation function-based approaches [95]. Representative applications of different fuzzy control systems and algorithms include predictive functional control [96], evolving controllers [97, 98], robust adaptive fuzzy control [99], tensor product-based model transformation [100], signatures [101], and fuzzy-neural adaptive terminal iterative learning control [102]. Challenging processes subjected to fuzzy control are generally demonstrated by Precup et al. and Dzitac et al. [103, 104], and they also deal with gathering of robots [105], hybrid electric vehicles [106], vehicle navigation [107], traffic management [108], telemanipulation [109, 110], non-affine nonlinear systems [111], suspension systems [112], robotic motion control [113], machining processes [114], fault detection [115], turbojet engineers [116], and medical applications [117, 118]. The extensions of fuzzy controllers to type-2 fuzzy controllers prove additional performance improvement because of their extended flexibility, similar to 2-DOF fuzzy controllers. The tuning of type-2 fuzzy controllers is mainly based on optimization algorithms, and recent algorithms used in this purpose include the firefly algorithm [119], several algorithms in the framework of multi-objective optimization [120], chemical optimization [121], with a useful overview conducted by Castillo and Melin [122]. Significant real-world applications deal with virtual disk cloning [123], spacecraft and aerial robots [124, 125], quadrotors [126], unmanned aerial vehicles [127], induction motor drives [128], even-triggered control [129, 130], supervisory control [131], and networked control systems [132].

20.3. GWO, WOA, SMA, PSOGSA, and GWOPSO Algorithm-Based Optimal Tuning of Fuzzy Controllers for Nonlinear Servo Systems This section is organized as follows: The recent nature-inspired optimization algorithms are first described. The optimization problem is next defined with a focus on the Takagi–Sugeno PI-fuzzy controller dedicated to the position control of a family of nonlinear servo systems, and the optimal tuning methodology is presented along with the mapping of nature-inspired optimization algorithms onto the optimization problem. The final subsection gives an example of validation on servo system lab equipment and real-time experimental results are included. All algorithms presented in this section belong to the category of swarm intelligence algorithms in the general class of nature-inspired optimization algorithms. They make use of a total number of N agents, and each agent is assigned to a position

page 784

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

Nature-Inspired Optimal Tuning of Fuzzy Controllers

page 785

785

vector Xi (k) f

q

Xi (k) = [xi1 (k) . . . xi (k) . . . xi (k)]T ∈ Ds ⊂ q , i = 1 · · · N,

(20.14)

f

where [35] xi (k) is the position of ith agent in f th dimension, f = 1 . . . q, k is the index of the current iteration, k = 1 . . . kmax , kmax is the maximum number of iterations, and T indicates matrix transposition. The expression of the search domain Ds is Ds = [l 1 , u 1 ] × · · · × [l f , u f ] × · · · × [l q , u q ] ⊂ q ,

(20.15)

l f are the lower bounds, u f are the upper bounds, f = 1 . . . q, and f

xi (k) ∈ [l f , u f ], i = 1 . . . N, f = 1 . . . q.

(20.16)

The significance of the agents will be different for each algorithm that is described in the next sub-sections. Another common feature of all algorithms that have been described as follows is that they are stopped after the maximum number of iterations kmax is reached.

20.3.1. Gray Wolf Optimizer The agents in GWO algorithms are gray wolves. The operating mechanism of GWO starts with the random initialization of the population of agents that comprise the wolf pack [133] and continues with the exploration stage, which models the search for the prey. During this stage, the positions of the top three agents, namely the alpha (α), beta (β), and delta (δ) agents, dictate the search pattern; these first three best vector solutions obtained at each iteration are T

Xl (k) = x l1 (k) . . . x l f (k) . . . x lq (k) , l ∈ {α, β, δ},

(20.17)

and they (namely Xα (k), Xβ (k), and Xδ (k)) are obtained in terms of Jα j (Xα (k)) = mini=1... N {J (Xi (k))|Xi (k) ∈ Ds }, Jα j (Xβ (k)) = mini=1... N {J (Xi (k))|Xi (k) ∈ Ds \{Xα (k)}}, Jα j (Xδ (k)) = mini=1... N {J (Xi (k))|Xi (k) ∈ Ds \{Xα (k), Xβ (k)}},

(20.18)

Jα j (Xα (k)) < Jα j (Xβ (k)) < Jα j (Xδ (k)), where J : q → + is the objective function in a minimization-type optimization problem.

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

786

b4528-v2-ch20

page 786

Handbook on Computer Learning and Intelligence—Vol. II

The following search coefficients are defined: f

f

al (k) = a(k)(2r1l − 1), f

(20.19)

f

cl (k) = 2r2l , l ∈ {α, β, δ}, f

f

where r1l ∈ [0, 1] and r2l ∈ [0, 1] are uniformly distributed random numbers, f = 1 . . . q, and the coefficients a(k) are linearly decreased from 2 to 0 during the search process   k −1 , a(k) = 2 1 − kmax − 1

f = 1 . . . q.

(20.20)

The approximate distances between the current solution and the alpha, beta, and if if delta solutions, i.e., dαi f (k), dβ (k), and dδ (k), respectively, are [134] if

f

f

dl (k) = |cl (k)x l f (k) − xi (k)|, i = 1 . . . N, l ∈ {α, β, δ},

(20.21)

and the elements of the updated alpha, beta, and delta solutions are computed in terms of [134] f

if

x l f (k + 1) = x l f (k) − al (k)dl (k),

f = 1 . . . q, i = 1 . . . N, l ∈ {α, β, δ}, (20.22)

and they lead to the updated expressions of the agents’ positions obtained as the arithmetic mean of the updated alpha, beta, and delta agents [134] f

xi (k + 1) =

x α f (k + 1) + x β f (k + 1) + x δ f (k + 1) , 3

f = 1 . . . q, i = 1 . . . N. (20.23)

with the following vector form of Eq. (20.23) [134]: Xi (k + 1) =

Xα (k + 1) + Xβ (k + 1) + Xδ (k + 1) , i = 1 . . . N. 3

(20.24)

20.3.2. Whale Optimization Algorithm The agents in WOA are whales. The operating mechanism of WOA starts again with the random initialization of the population of agents that represent the set of whales [135]. A set of search-coefficient vectors ai (k) ∈ q and ci (k) ∈ q is introduced as follows at each iteration k in order to model the next stage of GWO, namely the

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

Nature-Inspired Optimal Tuning of Fuzzy Controllers

page 787

787

prey encircling: f

q

f

q

ai (k) = [ai1 (k) · · · ai (k) · · · ai (k)]T ∈ q , ci (k) = [ai1 (k) · · · ai (k) · · · ai (k)]T ∈ q , i = 1 · · · N,

(20.25)

and the elements of the coefficient vectors are defined as f

f

ai (k) = a(k)(2r1 − 1), f

(20.26)

f

ci (k) = 2r2 ,

with the coefficients a(k) linearly decreased from 2 to 0 during the search process f f according to (20.20), r1 ∈ [0, 1] and r2 ∈ [0, 1] are uniformly distributed numbers, f = 1 . . . q. Using the notation Xbest (k) for the best agents’ position vector in the population at iteration k, which fulfills f

q

1 (k) · · · xbest (k) · · · abest (k)]T ∈ Ds ⊂ q , Xbest (k) = [xbest

J (Xbest (k)) = min J (Xi (k)),

(20.27)

i=1···N

the position update equation in WOA is [58, 59] ⎧ ⎪ Xbest (k) − fi (k), if C1, ⎪ ⎨ Xi (k + 1) = if C2, Xrand (t) − gi (k), ⎪ ⎪ ⎩h (k)ebl cos(2πl) + X (k), otherwise, i best f C1 : p < 0.5 and |ai (k)| ≤ 1∀ f = 1 · · · q,

(20.28)

f

C2 : p < 0.5 and |ai (k)| > 1 ∀ f = 1 · · · q, i = 1 · · · N, where p ∈ [0, 1] is a random number, the parameter b = const defines the shape of the logarithmic spiral of whales’ path, set to b = 1 for simplifying the shape, l ∈ [−1, 1] is a uniformly distributed random number, rand is an arbitrary agent index, the associated position vector is Xrand (k) f

q

1 (k) · · · xrand (k) · · · arand (k)]T ∈ Ds ⊂ q , Xrand (k) = [xrand

(20.29)

and fi (k), gi (k) are distance vectors with the expressions f

f

f

1 (k)| · · · ai |xi (k) − xbest (k)| fi (k) = [ai1 |xi1 (k) − xbest q

q

q

· · · ai |xi (k) − xbest (k)|]T ∈ q , f

f

f

1 (k)| · · · ai |xi (k) − xrand (k)| gi (k) = [ai1 |xi1 (k) − xrand q

q

q

· · · ai |xi (k) − xrand (k)|]T ∈ q , i = 1 · · · N.

(20.30)

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

788

b4528-v2-ch20

Handbook on Computer Learning and Intelligence—Vol. II

20.3.3. Slime Mold Algorithm The agents in SMA are elements of the slime mold. The operating mechanism of SMA starts with the random initialization of the population of agents that represent the slime mold [136]. Using the notation Si (k) for the fitness (i.e., the value of the objective function) of ith agent with the position vector Xi (k), the population is ranked in the ascending order of the fitness function values leading to two sets of position vectors, SetFH , which is the ranked set of agents in the first half of the population, and SetSH , which is the ranked set of agents in the second half of the population, in accordance with Precup et al. [62] {X1 (k), X2 (k), . . . , X H (k), X H +1 (k), . . . , X N (k)} = SetFH ∪ SetSH , Set FH = {X1 (k), X2 (k), . . . , X H (k)},

(20.31)

Set SH = {X H +1 (k), . . . , X N (k)}, S1 (k) ≤ S2(k) ≤ · · · ≤ S H (k) ≤ S H +1 (k) ≤ · · · ≤ S N (k),

where H = N/2 is the integer part of N /2. The vector Wi (k) of weights of slime mold consists of the elements (weights) f wi (k), f = 1 · · · q [62] f

q

Wi (k) = [wi1 (k) · · · wi (k) · · · wi (k)]T , i = 1 · · · N,  f 1 + ri log(gi (k)), if Xi (k) ∈ SetFH , f wi (k) = f 1 + ri log(gi (k)), otherwise, gi (k) =

(20.32)

Sb (k) − Si (k) + 1, i = 1 · · · N, f = 1 · · · q, Sb (k) − Sw (k) + ε f

where log is the notation for the decimal logarithm, ri ∈ [0, 1] are random numbers, ε = const > 0 is relatively small to avoid zero denominator in Eq. (20.32), Sb (k) and Sw (k) are the best and worst fitness, respectively, obtained at the current iteration k in terms of Precup et al. [62] Sb (k) = mini=1···N Si (k) = S1 (k), Sw (k) = maxi=1···N Si (k) = S N (k). f

(20.33)

f

The uniformly distributed random numbers v a and v b are next introduced: f

f

v a ∈ [−a, a], v b ∈ [−b, b], f = 1 · · · q,   k , a = arctanh 1 − kmax k . b =1− kmax

(20.34)

page 788

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

Nature-Inspired Optimal Tuning of Fuzzy Controllers

page 789

789

Using the notation Xb (k) for the best agent, representing the solution obtained f so far (in all iterations until k), with the elements xb (k), f = 1 · · · q [62] f

q

Xb (k) = [xb1 (k) · · · xb (k) · · · xb (k)]T ∈ Ds ⊂ q ,

(20.35)

and computing two agents, X A (k) and X B (k), which are randomly selected in the population (A and B are random integer indices, A, B = 1...N ) f

q

f

q

X A (k) = [x 1A (k) · · · x A (k) · · · x A (k)]T ∈ Ds ⊂ q , X B (k) = [x 1B (k) · · · x B (k) · · · x B (k)]T ∈ Ds ⊂ q ,

(20.36)

the position update equation is [136] ⎧ f f r (u − l f ) + l f , if r f < z, ⎪ ⎪ ⎨ f f f f f f f x i (k + 1) = x b (k) + v a [wi (k)x A (k) − x B (k)], if r f ≥ z and ri < pi , i = 1 · · · N, ⎪ ⎪ ⎩ f f otherwise, v b x i (k),

(20.37)

where r f ∈ [0, 1] are random numbers, z ∈ [0, 0.1] is a constant parameter, the parameters pi are computed in accordance with [136] pi = tanh |Si (k) − Smin (k)|, i = 1 · · · N,

(20.38)

and Smin (k) is the best fitness obtained so far (in all iterations until k) Smin (k) = mini=1...N, kb =0...k Si (kb ).

(20.39)

20.3.4. Hybrid Particle Swarm Optimization–Gravitational Search Algorithm The hybridization of nature-inspired optimization algorithms is a solution to overcome certain shortcomings observed during the use of classical algorithms. The hybridization of PSO and GSA aims to obtain an improved search by combining the ability of social thinking specific to PSO and the local search capability of GSA. The agents in PSOGSA are swarm particles with masses considered in the gravitational field. As shown by Precup and David [40], the operating mechanism of PSOGSA starts with the random initialization of the population of agents. The agent velocity vector Vi (k), where f

q

Vi (k) = [v i1 (k) · · · v i (k) · · · v i (k)]T ∈ q , i = 1 · · · N,

(20.40)

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

790

b4528-v2-ch20

Handbook on Computer Learning and Intelligence—Vol. II

is used to express and also explain the operating principle of PSOGSA. Let Pg,Best (k) be the best swarm position vector Pg,Best (k) = [ pg1 (k) · · · pgf (k) · · · pgq (k)]T ∈ q ,

(20.41)

which is updated in terms of [40] Pg,Best (k) = Xi (k) if J (Xi (k)) < J (Pg,Best (k)).

(20.42)

The velocity and position update equations are [40] f v i (k f

+ 1) =

⎧ ⎨w(k) v id (k) + c1r1 [Pg,Best (k) − x if (k)] + c2 r2 aif (k)

if m i (k) > 0,

⎩w(k) v f (k) + c1r1 [Pg,Best (k) − x f (k)] i i

otherwise,

f

f

x i (k + 1) = x i (k) + v i (k + 1), f = 1 · · · q, i = 1 · · · N,

(20.43)

where r1 , r2 ∈ [0, 1] are uniformly distributed random variables, c1 , c2 ≥ 0 f are weighting factors; the parameter w(k) is the inertia weight and ai (k) is the acceleration specific to GSA [137]. The parameter w(k) specific to PSO is set in accordance with the recommendations given in the seminal papers on PSO [138, 139]. The following dependence expression of w(k) is used in [40]: w(k) = wmax − k

wmax − wmin , kmax

(20.44)

where upper wmax and lower wmin bounds are imposed to w(k) to limit the particles movement in the search domain during the search process.

20.3.5. Hybrid Gray Wolf Optimize–Particle Swarm Optimization The hybridization of GWO and PSO targets the improvement of the exploitation stage of GWO in terms of adding the ability of social thinking specific to PSO. The agents in GWOPSO are viewed as both gray wolves and swarm particles. As shown by David et al. [140], the operating mechanism of GWOPSO starts with the random initialization of the population of agents. As shown by David et al. [140], the approximate distances between the current solution and the alpha, beta, and delta solutions are modified with respect to (20.21), and the following relation is

page 790

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

Nature-Inspired Optimal Tuning of Fuzzy Controllers

page 791

791

used: if

f

f

dl (k) = |cl (k)x l f (k) − ω P S O xi (k)|, i = 1 · · · N, l ∈ {α, β, δ}, f = 1 · · · q, (20.45) where the inertia weight ω P S O specific to PSO is expressed as ω P S O = 0.5(1 + rω ),

(20.46)

where rω ∈ [0, 1] is a uniformly distributed random variable. The elements of the updated alpha, beta, and delta solutions are computed in terms of (20.22). However, the updated expressions of the agents’ positions are no longer obtained as the arithmetic mean of the updated alpha, beta, and delta agents given in (20.23), and the following modified version of (20.23), derived from [140], is used: xi (k + 1) = xi (k) + ω P S O {cαf (k)rα [x α f (k + 1) − xi (k)] f

f

f

+ cβ (k)rβ [x β f (k + 1) − xi (k)] + cδ (k)rδ [x δ f (k + 1) − xi (k)]}, (20.47) f

f

f

f

f = 1 · · · q, i = 1 · · · N, where rα , rβ , rδ ∈ [0, 1] are uniformly distributed random variables.

20.3.6. Nature-Inspired Optimization Algorithms with Information Feedback Model F1 Using the notation given in (20.14) for the position vector Xi (k) of ith agent at iteration k of a certain nature-inspired optimization algorithm without information feedback model (namely, the algorithms presented in the previous sub-sections), the notation Si (k) for its fitness, which is also equal to the objective function value J , and introducing the notation Zi (k) for the position vector of ith agent at iteration k of the nature-inspired optimization algorithm with information feedback model F1 [62] f

q

Zi (k) = [z i1 (k) · · · z i (k) · · · z i (k)]T ∈ Ds , i = 1 · · · N,

(20.48)

f

where z i (k) is the position of ith agent in f th dimension, f = 1 · · · q, the information feedback model F1 is characterized by the recurrent equation [63–65] Zi (k + 1) = αF1 Xi (k) + βF1 Zi (k).

(20.49)

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

792

b4528-v2-ch20

Handbook on Computer Learning and Intelligence—Vol. II

The expressions of the weighting factors αF1 and βF1 , which satisfy αF1 + βF1 = 1,

(20.50)

are [63–65] Si (k) , Si (k) + Si (k + 1) Si (k + 1) . = Si (k) + Si (k + 1)

αF1 = βF1

(20.51)

Equations (20.49)–(20.51) are integrated in GWO, WOAF1, SMA, PSOGSA, and GWOPSO algorithms, and they lead to GWOF1, WOAF1, SMAF1, PSOGSAF1, and GWOPSOF1 algorithms, respectively. Three out of these five algorithms are described in [62] and applied to the optimal tuning of fuzzy controllers.

20.3.7. Problem Setting and Optimal Tuning Methodology The process is characterized by the following nonlinear continuous-time timeinvariant single input-single output (SISO) state–space model, which defines a general class of nonlinear servo systems [1]: ⎧ −1, if u(t) ≤ −u b , ⎪ ⎪ ⎪ ⎪ ⎪ u(t) + u c ⎪ ⎪ , if − u b < u(t) < −u c , ⎪ ⎪ ⎪ u − uc ⎪ ⎨ b m(t) = 0, if − u c ≤ |u(t)| ≤ u a , ⎪ ⎪ ⎪ ⎪ u(t) − u a ⎪ ⎪ , if u a < u(t) < u b , ⎪ ⎪ u − u ⎪ b a (20.52) ⎪ ⎪ ⎩ 1, if u(t) ≥ u b , ⎡ ⎡ ⎤ ⎤

 0 1 0 1 x˙ P (t) = ⎣ 1 ⎦ x P (t) + ⎣ k P ⎦ m(t) + 0 d(t), 0 − T T y(t) = [1 0]x P (t), where k P is the process gain, T is the small time constant, the control signal u is a pulse width modulated duty cycle, and m is the output of the saturation and dead zone static nonlinearity specific to the actuator. The dead zone and saturation nonlinearity is given in the first equation in (20.50), and it is characterized by the parameters u a , u b , and u c , with 0 < u a < u b and 0 < u c < u b . The state–space

page 792

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

Nature-Inspired Optimal Tuning of Fuzzy Controllers

page 793

793

model in (20.50) includes the actuator and measuring element dynamics. The state vector x P (t) is expressed as follows in (angular) position applications [1]: x P (t) = [x P,1 (t) x P,2 (t)]T = [α(t) ω(t)]T ,

(20.53)

where α(t) is the angular position, ω(t) is the angular speed, and the controlled output is y(t) = x P,1 (t) = α(t). The nonlinearity in (20.52) is neglected in the following simplified model of the process expressed as the transfer function P(s), defined in zero initial conditions, with u as input and y as output: P(s) =

kE P , s(1 + T s)

where the equivalent process gain k E P is ⎧ kP ⎪ ⎪ ⎨ u − u , if − u b < u(t) < −u c , b c kE P = ⎪ kP ⎪ ⎩ , if u a < u(t) < u b . ub − ua

(20.54)

(20.55)

The optimization problem that ensures the optimal tuning of FCs is ρ ∗ = arg min J (ρ), (ρ) =

∞ 

ρ∈Dρ

(k|e(k)|),

(20.56)

k=0

where J is the objective function, ρ is the vector variable of the objective function and also the parameter vector of the FC, ρ ∗ is the optimal value of the vector ρ, and Dρ is the feasible domain of ρ. Precup and David [1] have recommended that the stability of the fuzzy control system should be taken into consideration when setting the domain Dρ . PI controllers with the transfer function given in (20.4) can deal with the processes modeled in (20.54). The extended symmetrical optimum (ESO) method [141, 142] is successfully employed to tune these controllers such that they guarantee a desired tradeoff to the performance specifications (i.e., maximum values of control system performance indices) imposed to the control system using a single tuning parameter, β, with the recommended values 1 < β < 20. The diagrams presented in Figure 20.4 are useful to set the value of the tuning parameter β and, therefore, to achieve the tradeoff to the linear control system performance indices expressed as percent overshoot σ1 [%], normalized settling time tˆs = ts /T , normalized rise time tˆr = tr /T , and phase margin m .

June 1, 2022 13:18

794

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

Handbook on Computer Learning and Intelligence—Vol. II

Figure 20.4: Linear control system performance indices versus the parameter β specific to the ESO method for the step-type modification of the reference input [1].

The PI tuning conditions specific to the ESO method are [141, 142] kc =

1 1 , Ti = βT , kC = √ . 2 β βk E P T βk E P T √

(20.57)

A simple version of set-point filter that ensures an improvement in the performance of the linear control system by canceling a zero in the closed-loop transfer function with respect to the reference input is [141, 142] F(s) =

1 . 1 + β T s

(20.58)

The PI-fuzzy controllers are tuned starting with the linear PI controllers such that to ensure the further improvement of the control system performance indices for the nonlinear process modeled in (20.52). As shown by Precup and David [1], the presentation and the results will be focused on the Takagi–Sugeno PI-FC-OI, but the presentation for the Mamdani PI-FC-OI and for both versions of PI-FC-II is straightforward. The controller parameter vector, which will be determined by the optimal tuning methodology such that to solve the optimization problem defined in (20.56) is [1] ρ = [ρ1 ρ2 ρ3 ]T = [β Be η]T ,

(20.59)

and the application of the optimal tuning methodology described as follows will give the optimal parameter vector ρ ∗ [1] ρ ∗ = [ρ1∗ ρ2∗ ρ3∗ ]T = [β ∗ Be∗ η∗ ]T .

(20.60)

page 794

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

Nature-Inspired Optimal Tuning of Fuzzy Controllers

page 795

795

The optimization algorithms are mapped onto the optimization problem defined in (20.56) in terms of q = 3, Ds = Dρ , Xi (k) = ρ, i = 1 · · · N, Zi (k) = ρ, i = 1 · · · N, Si (k) = J, i = 1 · · · N,

(20.61)



Xb (kmax ) = ρ , Xi (kmax ) = ρ ∗ , Xb (kmax ) = ρ ∗ , Zb (kmax ) = ρ ∗ , where Zb in SMAF1 corresponds to Xb in SMA. The optimal tuning methodology consists of the following steps: Step 1. The sampling period Ts is set in accordance with the requirements of quasi-continuous digital control. Step 2. The feasible domain Dρ is set to include all constraints imposed to the elements of ρ. Step 3. One of the nature-inspired optimization algorithms solves the optimization problem defined in (20.56) using (20.61), resulting in the optimal parameter vector ρ ∗ in (20.60) and three of the optimal parameters of the Takagi–Sugeno PI-fuzzy controller, namely β ∗ , Be∗ and η∗ . Step 4. The optimal parameter Be of the Takagi–Sugeno PI-fuzzy controller, ∗ , results from (7), with the optimal parameters replaced in (6). with the notation Be The set-point filter with the transfer function defined in (20.58) is used in the fuzzy control system but with β ∗ instead of β.

20.3.8. Application of Optimal Tuning Methodology The optimal tuning methodology is applied in this sub-section to a Takagi–Sugeno PI-FC-OI dedicated to the position control of a laboratory servo system in relation with the optimization problem defined in (20.56). The experimental set-up is illustrated in Figure 20.5. An optical encoder is used for the measurement of the angle and a tacho-generator for the measurement of the angular speed. The speed can also be estimated from the angle measurements. The PWM signals proportional with the control signal are produced by the actuator in the power interface. The main features of the experimental setup are [1] rated amplitude of 24 V, rated current of

June 1, 2022 13:18

796

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

Handbook on Computer Learning and Intelligence—Vol. II

Figure 20.5: Servo system experimental setup in the Intelligent Control Systems Laboratory of the Politehnica University of Timisoara [1].

3.1 A, rated torque of 15 N cm, rated speed of 3000 rpm, and weight of inertial load of 2.03 kg. The values of the parameters of the process model given in (20.52) and (20.54), obtained by a least squares algorithm, are u a = 0.15, u b = 1, u c = 0.15, k P = k E P = 140, and T = 0.92 s. The sampling period was set to Ts = 0.01 s in step 1 of the optimal tuning methodology. The vector variable ρ belongs to the feasible domain (or the search domain) domain Dρ set in step 2 as Dρ = {β|3 ≤ β ≤ 17} × {Be |20 ≤ Be ≤ 38} × {η|0.25 ≤ η ≤ 0.75}.

(20.62)

The dynamic regimes considered in step 3 are characterized by the r = 40 rad step-type modification of the set-point and zero initial conditions. The upper limit of the sum in (20.56) was set to 2000 instead of ∞. This leads to the length of 20 s of the time horizon involved in the evaluation of the objective function by simulation or experiments conducted on the fuzzy control system. The fixed parameters of the nature-inspired optimization algorithms applied in step 3 of the optimal tuning methodology are z = 0.03 and ε = 0.001 for SMA. The following are the parameters of PSOGA: an exponential decrease law of the gravitational constant was used, with the initial value g0 = 100 and the exponent ζ = 8.5, the parameter ε = 10−4 that avoids possible divisions by zero in the GSA part, the weighting parameters c1 = c2 = 0.3 in the PSO part and the inertia weight parameters wmax = 0.9 and wmin = 0.5 to ensure a good balance between exploration and exploitation characteristics. The other algorithms do not work with fixed parameters. The same numbers of agents N = 20 and maximum number of iterations kmax = 20 were used for all algorithms in the comparison.

page 796

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

Nature-Inspired Optimal Tuning of Fuzzy Controllers

page 797

797

Table 20.1: Controller parameter and minimum objective function values after the application of the nature-inspired optimization algorithms.

GWO WOA SMA PSOGSA GWOPSO GWOF1 WOAF1 SMAF1 PSOGSAF1 GWOPSOF1

Be∗

∗ Be

η∗

β∗

kc∗

Ti∗

Jmin = J (ρ ∗ )

21.1473 31.2222 27.5679 34.9617 32.6942 35.748 35.6415 37.9981 33.393 29.8815

0.0447 0.0659 0.0601 0.077 0.0685 0.0753 0.084 0.0809 0.0701 0.0619

0.6395 0.6673 0.6315 0.7367 0.7219 0.6833 0.6192 0.7493 0.6804 0.5706

5.141 5.1638 5.1613 4.03 5.1943 5.1647 4.773 5.1112 5.1903 5.3303

0.0034 0.0034 0.0034 0.0035 0.0034 0.0034 0.0036 0.0034 0.0034 0.034

4.7297 4.7507 4.7484 4.5507 4.7787 4.7515 4.3912 4.7023 4.7751 4.9039

201,200.5 173,844.3 168,243.7 192,835.4 220,176.1 163,129.8 168,119.9 155,117.2 170,003.3 214,907.6

Moreover, to ensure a fair comparison of the algorithms, to reduce the effects of random parameters in the algorithms, all results are given as average values taken for the best five runs of the algorithms. The optimal controller parameters and the minimum objective function values are summarized in Table 20.1. The optimal parameter values of the linear part (i.e., the initial PI controller) are included as well. Several conclusions can be drawn after the analysis of the results illustrated in Table 20.1. First, the values of the tuning parameters obtained after running the algorithms are close. Second, the best algorithm as far as the performance expressed as the smallest value of the minimum objective function concerned is SMAF1 followed by GWOF1. Third, as shown by Precup et al. [62], the introduction of the information feedback model F1 is beneficial generally as it leads to the reduction of the minimum value of the objective functions obtained by running all algorithms. Fourth, the hybridizations lead to performance improvement for PSOGSA but not for GWOPSO in the conditions of the application and objective function considered in this chapter. Two typical fuzzy control system responses are presented in Figure 20.6 in terms of controlled output and control signal versus time considering the “average” controller after the first iteration of GWO (with the parameters Be = 33.2784, η = 0.5268, and Be = 0.0676), and the “average” controller after the application of 20 iterations of GWO (with the parameters given in the first row in Table 20.1). Figure 20.6 also highlights the improvement of the overshoot and settling time after the application of GWO. Both plots in Figure 20.6 also highlight the effects of the process nonlinearity in the initial and final part of the transients. As specified by Precup and David [1], the comparison of nature-inspired optimization algorithms can be carried out by means of several performance indices.

June 1, 2022 13:18

798

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

Handbook on Computer Learning and Intelligence—Vol. II

Figure 20.6: Real-time experimental results expressed as fuzzy control system responses y and u with after one iteration of GWO (dotted line) and 20 iterations of GWO (continuous line).

Three such indices are the convergence speed defined as the number of evaluations of the objective functions until finding their minimum values, the number of evaluations per iteration, and the standard deviation with respect to the average of the solutions. However, in all the cases the results of the comparisons of the performance indices should be presented, as performed in this section, in terms of average values for several runs of the algorithms to alleviate the impact of the random parameters specific to all nature-inspired optimization algorithms. The perspective of nonparametric statistical tests could also be provided [143–145]. Complexity analyses of the algorithms could be conducted as well.

20.4. Conclusions and Outlook This chapter is a revised version of the one by Precup and David [1], and it presented aspects concerning the nature-inspired optimal tuning of fuzzy controllers. The presentation has been focused on Takagi–Sugeno and Mamdani PI-fuzzy controllers, and it shows that the presented design methodology and tuning methodology can be applied relatively easily to PD-fuzzy controllers and PID-fuzzy controllers. Once the optimization problems are defined, the nature-inspired optimization algorithms can be applied to compute the optimal tuning of the parameters of fuzzy controllers. The chapter provides the exemplification of several such

page 798

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Nature-Inspired Optimal Tuning of Fuzzy Controllers

b4528-v2-ch20

page 799

799

algorithms, namely GWO, WOA, SMA, hybrid PSOGSA, hybrid GWOPSO, and other algorithms, referred to as GWOF1, WOAF1, SMAF1, PSOGSAF1, and GWOPSOF1, obtained by the introduction of the information feedback model F1 in the former ones. The results can be extended with no major difficulties to other algorithms as CSS [1]. Other nature-inspired optimization algorithms are possible candidates for the optimal tuning of fuzzy controllers; such representative algorithms are population extremal optimization algorithms applied to continuous optimization and nonlinear controller optimal tuning problems [144–146], modified PSO applied to smart grids [147, 148], water cycle algorithms applied to resolve the traveling salesman problem [149], the bat algorithm applied to medical goods distribution [150], the differential evolution algorithm applied to fuzzy job-shop scheduling [151], NSGA-III with adaptive mutation applied to Big Data optimization [152], processing of crossover operations in NSGA-III for large-scale optimization [153], conflict monitoring optimization [154], and parallelized multiple swarm artificial bee colony algorithms for constrained optimization problems [155]. Other representative results are related to the optimal tuning of fuzzy control [156] and inclusion of fuzzy logic in nature-inspired optimization algorithms [157, 158], including their adaptation [159]. Summing up, we advise again — as shown by Precup and David [1] — the reduction of the number of parameters tuned by the nature-inspired optimization algorithms in order to have a reasonable dimension of the search space that will offer an efficient search for the optimal solution. The major shortcoming of natureinspired optimization algorithms in optimal tuning of controllers is represented by the large number of evaluations of the objective functions. This can be solved by including gradient information that can be taken from data-driven approaches.

20.5. Acknowledgments This work was supported by a grant from the Romanian Ministry of Education and Research, CNCS–UEFISCDI, project number PN-III-P4-ID-PCE-2020-0269, within PNCDI III.

References [1] Precup, R.-E. and David, R.-C. (2016). Nature-inspired optimization of fuzzy controllers and fuzzy models. In Angelov, P. P. (ed.), Handbook on Computational Intelligence, Vol. 2: Evolutionary Computation, Hybrid Systems, and Applications. Singapore: World Scientific, pp. 697–729. [2] Precup, R.-E. and Hellendoorn, H. (2011). A survey on industrial applications of fuzzy control. Comput. Ind., 62, pp. 213–226. [3] Hjalmarsson, H. (2002). Iterative feedback tuning: an overview. Int. J. Adapt. Control Signal Proc., 16(5), pp. 373–395.

June 1, 2022 13:18

800

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

Handbook on Computer Learning and Intelligence—Vol. II

[4] Jung, H., Jeon, K., Kang, J.-G. and Oh, S. (2020). Iterative feedback tuning of cascade control of two-inertia system. IEEE Control Syst. Lett., 5(3), pp. 785–790. [5] Hou, Z.-S. and Wang, Z. (2013). From model-based control to data-driven control: survey, classification and perspective. Inf. Sci., 235, pp. 3–35. [6] Yu, W., Wang, R., Bu, X.-H. and Hou, Z.-S. (2020b). Model free adaptive control for a class of nonlinear systems with fading measurements. J. Franklin Inst., 357(12), pp. 7743–7760. [7] Spall, J. C. and Cristion, J. A. (1998). Model-free control of nonlinear stochastic systems with discrete-time measurements. IEEE Trans. Autom. Control, 43(9), pp. 1198–1210. [8] Zamanipour, M. (2020). A novelty in Blahut-Arimoto type algorithms: optimal control over noisy communication channels. IEEE Trans. Veh. Technol., 69(6), pp. 6348–6358. [9] Karimi, A., Miskovic, L. and Bonvin, D. (2004). Iterative correlation-based controller tuning. Int. J. Adapt. Control Signal Proc., 18(8), pp. 645–664. [10] Sato, T., Kusakabe, T., Himi, K., Arakim N. and Konishi, Y. (2020). Ripple-free data-driven dual-rate controller using lifting technique: application to a physical rotation system. IEEE Trans. Control Syst. Technol., doi: 10.1109/TCST.2020.2988613. [11] Kammer, L. C., Bitmead, R. R. and Bartlett, P. L. (2000). Direct iterative tuning via spectral analysis. Automatica, 36(9), pp. 1301–1307. [12] da Silva Moreira, J., Acioli Júnior, G. and Rezende Barros, G. (2018). Time and frequency domain data-driven PID iterative tuning. IFAC-PapersOnLine, 51(15), pp. 1056–1061. [13] Halmevaara, K. and Hyötyniemi, H. (2006). Data-based parameter optimization of dynamic simulation models. In Proc. 47 t h Conf. Simul. Model. (SIMS 2006). Helsinki, Finland, pp. 68–73. [14] McDaid, A. J., Aw, K. C., Haemmerle, E. and Xie, S. Q. (2012). Control of IPMC actuators for microfluidics with adaptive “online” iterative feedback tuning. IEEE/ASME Trans. Mechatron., 17(4), pp. 789–797. [15] Fliess, M. and Join, C. (2013). Model-free control. Int. J. Control, 86(12), pp. 2228–2252. [16] Fliess, M. and Join, C. (2020). Machine learning and control engineering: the model-free case. In Proc. Fut. Technol. Conf. 2020 (FTC 2020). Vancouver, BC, Canada, pp. 1–20. [17] Campi, M. C., Lecchini, A. and Savaresi, S. M. (2002). Virtual reference feedback tuning: a direct method for the design of feedback controllers. Automatica, 38(8), pp. 1337–1346. [18] Formentin, S., Campi, M. C., Caré, A. and Savaresi, S. M. (2019). Deterministic continuoustime Virtual Reference Feedback Tuning (VRFT) with application to PID design. Syst. Control Lett., 127, pp. 25–34. [19] Gao, Z. (2006). Active disturbance rejection control: a paradigm shift in feedback control system design. In Proc. 2006 Amer. Control Conf. (ACC 2006). Minneapolis, MN, USA, pp. 2399–2405. [20] Roman, R.-C., Precup, R.-E. and Petriu, E. M. (2021). Hybrid data-driven fuzzy active disturbance rejection control for tower crane systems. Eur. J. Control, 58, pp. 373–387. [21] Kadali, R., Huang, B. and Rossiter, A. (2003). A data driven subspace approach to predictive controller design. Control Eng. Pract., 11(3), pp. 261–278. [22] Lucchini, A., Formentin, S., Corno, M., Piga, D. and Savaresi, S. M. (2020). Torque vectoring for high-performance electric vehicles: a data-driven MPC approach. IEEE Control Syst. Lett., 4(3), pp. 725–730. [23] Safonov, M. G. and Tsao, T.-C. (1997). The unfalsified control concept and learning. IEEE Trans. Autom. Control, 42(6), pp. 843–847. [24] Jiang, P., Cheng, Y.-Q., Wang, X.-N. and Feng, Z. (2016). Unfalsified visual servoing for simultaneous object recognition and pose tracking. IEEE Trans. Cybern., 46(12), pp. 3032– 3046. [25] Novara, C., Formentin, S., Savaresi, S. M. and Milanese, M. (2015). A data-driven approach to nonlinear braking control. In Proc. 54t h IEEE Conf. Dec. Control (IEEE CDC 2015). Osaka, Japan, pp. 1–6.

page 800

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Nature-Inspired Optimal Tuning of Fuzzy Controllers

b4528-v2-ch20

page 801

801

[26] Galluppi, O., Formentin, S., Novara, C. and Savaresi, S. M. (2019). Multivariable D2-IBC and application to vehicle stability control. ASME J. Dyn. Syst. Meas. Control, 141(10), pp. 1–12. [27] Van Waarde, H. J., Eising, J., Trentelman, H. L. and Camlibel, M. K. (2020). Data informativity: a new perspective on data-driven analysis and control. IEEE Trans. Autom. Control, 65(11), pp. 4753–4768. [28] Wu, H.-N., Wang, J.-W. and Li, H.-X. (2012). Design of distributed H∞ fuzzy controllers with constraint for nonlinear hyperbolic PDE systems. Automatica, 48(10), pp. 2535–2543. [29] Simani, S., Alvisi, S. and Venturini, M. (2015). Data-driven design of a fault tolerant fuzzy controller for a simulated hydroelectric system. IFAC-PapersOnLine, 48(21), pp. 1090–1095. [30] Kamesh, R. and Rani, K. Y. (2016). Parameterized data-driven fuzzy model based optimal control of a semi-batch reactor. ISA Trans., 64, pp. 418–430. [31] Juang, C.-F. and Chang, Y.-C. (2016). Data-driven interpretable fuzzy controller design through multi-objective genetic algorithm. In Proc. 2016 IEEE Int. Conf. Syst. Man. Cybern. (SMC 2016). Budapest, Hungary, pp. 2403–2408. [32] Roman, R.-C., Precup, R.-E. and David, R.-C. (2018). Second order intelligent proportionalintegral fuzzy control of twin rotor aerodynamic systems. Proc. Comput. Sci., 139, pp. 372–380. [33] Precup, R.-E., Roman, R.-C., Hedrea, E.-L., Petriu, E. M. and Bojan-Dragos, C.-A. (2021a). Data-driven model-free sliding mode and fuzzy control with experimental validation. Int. J. Comput. Communic. Control, 16(1), 4076. [34] Roman, R.-C., Precup, R.-E., Bojan-Dragos, C.-A. and Szedlak-Stinean, A.-I. (2019a). Combined model-free adaptive control with fuzzy component by virtual reference feedback tuning for tower crane systems. Proc. Comput. Sci., 162, pp. 267–274. [35] Roman, R.-C., Precup, R.-E., Petriu, E. M. and Dragan, F. (2019b). Combination of data-driven active disturbance rejection and Takagi-Sugeno fuzzy control with experimental validation on tower crane systems. Energies, 12(8), pp. 1–19. [36] Roman, R.-C., Precup, R.-E. and Petriu, E. M. (2021). Hybrid data-driven fuzzy active disturbance rejection control for tower crane systems. Eur. J. Control, 58, pp. 373–387. [37] Touhami, M., Hazzab, A., Mokhtari, F. and Sicard, P. (2019). Active disturbance rejection controller with ADRC-fuzzy for MAS control. Electr. Elecron. Autom. (EEA), 67(2), pp. 89–97. [38] Dombi, J. and Hussain, A. (2019). Data-driven arithmetic fuzzy control using the distending function. In Ahram, T., Taiar, R., Colson, S. and Choplin, A. (eds.), IHIET: Human Interaction and Emerging Technologies, Advances in Intelligent Systems and Computing, Vol. 1018. Cham: Springer, pp. 215–221. [39] Sugeno, M. (1999). On stability of fuzzy systems expressed by fuzzy rules with singleton consequents. IEEE Trans. Fuzzy Syst., 7(2), pp. 201–224. [40] Precup, R.-E. and David, R.-C. (2019). Nature-Inspired Optimization Algorithms for Fuzzy Controlled Servo Systems. Oxford: Butterworth-Heinemann, Elsevier. [41] Eckert, J. J., Corrêa de Alkmin Silva, L., Dedini, F. G. and Corrêa, F. C. (2019a). Electric vehicle powertrain and fuzzy control multi-objective optimization, considering dual hybrid energy storage systems. IEEE Trans. Veh. Technol., 69(4), pp. 3773–3782. [42] Eckert, J. J., Santiciolli, F. M., Yamashita, R. Y., Corrêa, F. C., Silva, L. C. A. and Dedini, F. G. (2019b). Fuzzy gear shifting control optimisation to improve vehicle performance, fuel consumption and engine emissions. IET Control Theory Appl., 13(16), pp. 2658–2669. [43] Abadlia, I., Hassaine, L., Beddar, A., Abdoune, F. and Bengourina, M. R. (2020). Adaptive fuzzy control with an optimization by using genetic algorithms for grid connected a hybrid photovoltaic-hydrogen generation system. Int. J. Hydr. Energy, 45(43), pp. 22589–22599. [44] Masoumi, A. P., Tavakolpour-Saleh, A. R. and Rahideh, A. (2020). Applying a genetic-fuzzy control scheme to an active free piston Stirling engine: design and experiment. Appl. Energy, 268, 115045.

June 1, 2022 13:18

802

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

Handbook on Computer Learning and Intelligence—Vol. II

[45] Fu, Z.-M., Zhu, L.-L., Tao, F.-Z., Si, P.-J. and Sun, L.-F. (2020). Optimization based energy management strategy for fuel cell/battery/ultracapacitor hybrid vehicle considering fuel economy and fuel cell lifespan. Int. J. Hydr. Energy, 45(15), pp. 8875–8886. [46] Liang, H.-B., Zou, J.-L., Zuo, K. and Khan, M. J. (2020). An improved genetic algorithm optimization fuzzy controller applied to the wellhead back pressure control system. Mech. Syst. Signal Process., 142, 106708. [47] Hu, Y.-P., Yang, Y.-P., Li, S. and Zhou, Y.-M. (2020). Fuzzy controller design of microunmanned helicopter relying on improved genetic optimization algorithm. Aerosp. Sci. Technol., 98, 105685. [48] Wu, T.-Z., Yu, W.-S. and Guo, L.-X. (2019). A study on use of hybrid energy storage system along with variable filter time constant to smooth DC power fluctuation in microgrid. IEEE Access, 7, pp. 175377–175385. [49] Wai, R.-J. and Prasetia, A. S. (2019). Adaptive neural network control and optimal path planning of UAV surveillance system with energy consumption prediction. IEEE Access, 7, pp. 126137– 126153. [50] Xie, S.-W., Xie, Y.-F., Li, F.-B., Jiang, Z.-H. and Gui, W.-H. (2019). Hybrid fuzzy control for the goethite process in zinc production plant combining type-1 and type-2 fuzzy logics. Neurocomputing, 366, pp. 170–177. [51] Girgis, M. E. and Badr, R. I. (2021). Optimal fractional-order adaptive fuzzy control on inverted pendulum model. Int. J. Dynam. Control, 9(1), pp. 288–298. [52] Mohammadzadeh, A. and Kayacan, E. (2020). A novel fractional-order type-2 fuzzy control method for online frequency regulation in AC microgrid. Eng. Appl. Artif. Intell., 90, 103483. [53] Reisi, N. A., Hadipour Lakmesari, S., Mahmoodabadi, M. J. and Hadipour, S. (2019). Optimum fuzzy control of human immunodeficiency virus type1 using an imperialist competitive algorithm. Inf. Med. Unlocked, 16, 100241. [54] Tripathi, S., Shrivastava A. and Jana, K. C. (2020). Self-tuning fuzzy controller for sun-tracker system using Gray Wolf Optimization (GWO) technique. ISA Trans., 101, pp. 50–59. [55] Precup, R.-E., David, R.-C., Petriu, E. M., Szedlak-Stinean, A.-I. and Bojan-Dragos, C.-A. (2016). Grey wolf optimizer-based approach to the tuning of PI-fuzzy controllers with a reduced process parametric sensitivity. IFAC-PapersOnLine, 49(5), pp. 55–60. [56] Solihin, M. I., Chuan, C. Y. and Astuti, W. (2020). Optimization of fuzzy logic controller parameters using modern meta-heuristic algorithm for Gantry Crane System (GCS). Mater. Today Proc., 29, pp. 168–172. [57] Azizi, M., Ejlali, R. G., Ghasemi, S. A. M. and Talatahari, S. (2019a). Upgraded whale optimization algorithm for fuzzy logic based vibration control of nonlinear steel structure. Eng. Struct., 19, pp. 53–70. [58] David, R.-C., Precup, R.-E., Preitl, S., Petriu, E. M., Szedlak-Stinean A.-I. and Roman, R.-C. (2020a). Whale optimization algorithm-based tuning of low-cost fuzzy controllers with reduced parametric sensitivity. In Proc. 28t h Medit. Conf. Control Autom. (MED 2020). Saint-Raphael, France, pp. 440–445. [59] David, R.-C., Precup, R.-E., Preitl, S., Szedlak-Stinean, A.-I., Roman, R.-C. and Petriu, E. M. (2020c). Design of low-cost fuzzy controllers with reduced parametric sensitivity based on whale optimization algorithm. In Proc. 2020 IEEE Int. Conf. Fuzzy Syst. (FUZZ-IEEE 2020). Glasgow, UK, pp. 1–6. [60] Nosratabadi, S. M., Bornapour, M. and Gharaei, M. A. (2019). Grasshopper optimization algorithm for optimal load frequency control considering predictive functional modified PID controller in restructured multi-resource multi-area power system with redox flow battery units. Control Eng. Pract., 89, pp. 204–227. [61] Azizi, M., Ghasemi, S. A. M., Ejlali, R. G. and Talatahari, S. (2019b). Optimal tuning of fuzzy parameters for structural motion control using multiverse optimizer. Struct. Des. Tall Spec. Build., 28(13), e1652.

page 802

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Nature-Inspired Optimal Tuning of Fuzzy Controllers

b4528-v2-ch20

page 803

803

[62] Precup, R.-E., David, R.-C., Roman, R.-C., Petriu, E. M. and Szedlak-Stinean, A.-I. (2021b). Slime mould algorithm-based tuning of cost-effective fuzzy controllers for servo systems. Int. J. Comput. Intell. Syst., 14(1), pp. 1042–1052. [63] Wang, G.-G. and Tan, Y. (2019). Improving metaheuristic algorithms with information feedback models. IEEE Trans. Cybern., 49(2), pp. 542–555. [64] Gu, Z.-M. and Wang, G.-G. (2010). Improving NSGA-III algorithms with information feedback models for large-scale many-objective optimization. Future Gener. Comput. Syst., 107, pp. 49–69. [65] Zhang, Y., Wang, G.-G., Li, K.-Q., Yeh, W.-C., Jian, M.-W. and Dong, J.-Y. (2020a). Enhancing MOEA/D with information feedback models for large-scale many-objective optimization. Inf. Sci., 522, pp. 1–16. [66] Precup, R.-E. and Preitl, S. (1999a). Fuzzy Controllers. Timisoara: Editura Orizonturi Universitare. [67] Kóczy, L. T. (1996). Fuzzy IF-THEN rulemodels and their transformation into one another. IEEE Trans. Syst. Man Cybern. Part A: Syst. Humans, 26(5), pp. 621–637. [68] Precup, R.-E. and Preitl, S. (1999b). Development of some fuzzy controllers with nonhomogenous dynamics with respect to the input channels meant for a class of systems. In Proc. 1999 Eur. Control Conf. (ECC 1999). Karlsruhe, Germany, pp. 61–66. [69] Precup, R.-E. and Preitl, S. (2003). Development of fuzzy controllers with non-homogeneous dynamics for integral-type plants. Electr. Eng., 85(3), pp. 155–168. [70] Precup, R.-E., Preitl, S., Petriu, E. M., Tar, J. K., Tomescu, M. L. and Pozna, C. (2009). Generic two-degree-of-freedom linear and fuzzy controllers for integral processes. J. Franklin Inst., 346(10), pp. 980–1003. [71] Preitl, S., Stinean, A.-I., Precup, R.-E., Preitl, Z., Petriu, E. M., Dragos, C.-A. and Radac, M.-B. (2012). Controller design methods for driving systems based on extensions of symmetrical optimum method with DC and BLDC motor applications. IFAC Proc. Vol., 45(3), pp. 264–269. [72] Galichet, S. and Foulloy, L. (1995). Fuzzy controllers: synthesis and equivalences. IEEE Trans. Fuzzy Syst., 3(2), pp. 140–148. [73] Precup, R.-E., Preitl, S., Petriu, E. M., Roman, R.-C., Bojan-Dragos, C.-A., Hedrea, E.-L. and Szedlak-Stînean, A.-I. (2020a). A center manifold theory-based approach to the stability analysis of state feedback Takagi-Sugeno-Kang fuzzy control systems. Facta Univ. Ser. Mech. Eng., 18(2), pp. 189–204. [74] Wang, H. O., Tanaka, K. and Griffin, M. F. (1996). An approach to fuzzy control of nonlinear systems: stability and design issues. IEEE Trans. Fuzzy Syst., 4(1), pp. 14–23. [75] Tanaka, K. and Wang, H. O. (2001). Fuzzy Control Systems Design and Analysis: A Linear Matrix Inequality Approach. New York: John Wiley & Sons. [76] Wang, Z.-H., Liu, Z., Chen, C. L. P. and Zhang, Y. (2019). Fuzzy adaptive compensation control of uncertain stochastic nonlinear systems with actuator failures and input hysteresis. IEEE Trans. Cybern., 49(1), pp. 2–13. [77] Moodi, H., Farrokhi, M., Guerra, T.-M. and Lauber, J. (2019). On stabilization conditions for T-S systems with nonlinear consequent parts. Int. J. Fuzzy Syst., 21(1), pp. 84–94. [78] Frezzatto, L., Lacerda, M. J., Oliveira, R. C. L. F. and Peres, P. L. D. (2019). H2 and H∞ fuzzy filters with memory for Takagi-Sugeno discrete-time systems. Fuzzy Sets Syst., 371, pp. 78–95. [79] Gunasekaran, N. and Joo, Y. H. (2019). Stochastic sampled-data controller for T-S fuzzy chaotic systems and its applications. IET Control Theory Appl., 13(12), pp. 1834–1843. [80] Sakthivel, R., Mohanapriya, S., Kaviarasan, B., Ren, Y. and Anthoni, S. M. (2020). Non-fragile control design and state estimation for vehicle dynamics subject to input delay and actuator faults. IET Control Theory Appl., 14(1), pp. 134–144. [81] Liu, D., Yang, G.-H. and Er, M. J. (2020). Event-triggered control for T-S fuzzy systems under asynchronous network communications. IEEE Trans. Fuzzy Syst., 28(2), pp. 390–399.

June 1, 2022 13:18

804

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

Handbook on Computer Learning and Intelligence—Vol. II

[82] Shamloo, N. F., Kalat, A. A. and Chisci, L. (2020). Indirect adaptive fuzzy control of nonlinear descriptor systems. Eur. J. Control, 51, pp. 30–38. [83] Jiang, B.-P., Karimi, H. R., Kao, Y.-G. and Gao, C.-C. (2020). Takagi-Sugeno model based event-triggered fuzzy sliding-mode control of networked control systems with semi-Markovian switchings. IEEE Trans. Fuzzy Syst., 28(4), pp. 673–683. [84] Gandhi, R. V. and Adhyaru, D. M. (2020). Takagi-Sugeno fuzzy regulator design for nonlinear and unstable systems using negative absolute eigenvalue approach. IEEE/CAA J. Autom. Sin., 7(2), pp. 482–493. [85] Xia, Y., Wang, J., Meng, B. and Chen, X.-Y. (2020). Further results on fuzzy sampled-data stabilization of chaotic nonlinear systems. Appl. Math. Comput., 379, 125225. [86] Lam, H.-K. (2018). A review on stability analysis of continuous-time fuzzy-model-based control systems: from membership-function-independent to membership-function-dependent analysis. Eng. Appl. Artif. Intell., 67, pp. 390–408. [87] Yang, X.-Z., Lam, H.-K. and Wu, L.-G. (2019). Membership-dependent stability conditions for type-1 and interval type-2 T-S fuzzy systems. Fuzzy Sets Syst., 356, pp. 44–62. [88] Pang, B., Liu, X., Jin, Q. and Zhang, W. (2016). Exponentially stable guaranteed cost control for continuous and discrete-time Takagi-Sugeno fuzzy systems. Neurocomputing, 205(1), pp. 210–221. [89] Li, G.-L., Peng, C., Fei, M.-R. and Tian, Y.-C. (2020a). Local stability conditions for T-S fuzzy time-delay systems using a homogeneous polynomial approach. Fuzzy Sets Syst., 385, pp. 111–126. [90] Xiao, B., Lam, H.-K., Yu, Y. and Li, Y.-D. (2020). Sampled-data output-feedback tracking control for interval type-2 polynomial fuzzy systems. IEEE Trans. Fuzzy Syst., 28(3), pp. 424–433. [91] Yu, G.-R., Huang, Y.-C. and Cheng, C.-Y. (2018). Sum-of-squares-based robust H∞ controller design for discrete-time polynomial fuzzy systems. J. Franklin Inst., 355(1), pp. 177–196. [92] Zhao, Y.-X., He, Y.-X., Feng, Z.-G., Shi, P. and Du, X. (2019). Relaxed sum-of-squares based stabilization conditions for polynomial fuzzy-model-based control systems. IEEE Trans. Fuzzy Syst., 27(9), pp. 1767–1778. [93] Yoneyama, J. (2017). New conditions for stability and stabilization of Takagi-Sugeno fuzzy systems. In Proc. 2017 Asian Control Conf. (ASCC 2017). Gold Coast, Australia, pp. 2154– 2159. [94] Robles, R., Sala, A., Bernal, M. and González, T. (2017). Subspace-based Takagi-Sugeno modeling for improved LMI performance. IEEE Trans. Fuzzy Syst., 25(4), pp. 754–767. [95] Meda-Campaña, J. A., Grande-Meza, A., de Jesús Rubio, J., Tapia-Herrera, R., HernándezCortés, T., Curtidor-López, A. V., Páramo-Carranza, L. A. and Cázares-Ramírez, I. O. (2018). Design of stabilizers and observers for a class of multivariable TS fuzzy models on the basis of new interpolation function. IEEE Trans. Fuzzy Syst., 26(5), pp. 2649–2662. [96] Škrjanc, I. and Blažiˇc, S. (2005). Predictive functional control based on fuzzy model: design and stability study. J. Intell. Robot. Syst., 43(2–4), pp. 283–299. [97] Angelov, P., Škrjanc, I. and Blažiˇc, S. (2013). Robust evolving cloud-based controller for a hydraulic plant. In Proc. 2013 IEEE Conf. Evol. Adapt. Intell. Syst. (EAIS 2013). Singapore, pp. 1–8. [98] Costa, B., Škrjanc, I., Blažiˇc, S. and Angelov, P. (2013). A practical implementation of selfevolving cloud-based control of a pilot plant. In Proc. 2013 IEEE Int. Conf. Cybern. (CYBCONF 2013). Lausanne, Switzerland, pp. 7–12. [99] Blažiˇc, S., Škrjanc, I. and Matko, D. (2014). A robust fuzzy adaptive law for evolving control systems. Evol. Syst., 5(1), pp. 3–10. [100] Baranyi, P. (2004). TP model transformation as a way to LMI-based controller design. IEEE Trans. Ind. Electron., 51(2), pp. 387–400.

page 804

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Nature-Inspired Optimal Tuning of Fuzzy Controllers

b4528-v2-ch20

page 805

805

[101] Pozna, C. and Precup, R.-E. (2014). Applications of signatures to expert systems modeling. Acta Polytech. Hung., 11(2), pp. 21–39. [102] Wang, Y.-C., Chien, C.-J., Chi, R.-H. and Hou, Z.-S. (2015). A fuzzy-neural adaptive terminal iterative learning control for fed-batch fermentation processes. Int. J. Fuzzy Syst., 17(3), pp. 423–433. [103] Precup, R.-E., Angelov, P., Costa, B. S. J. and Sayed-Mouchaweh, M. (2015). An overview on fault diagnosis and nature-inspired optimal control of industrial process applications. Comput. Ind., 74, pp. 75–94. [104] Dzitac, I., Filip, F. G. and Manolescu, M. J. (2017). Fuzzy logic is not fuzzy: World-renowned computer scientist Lotfi A. Zadeh. Int. J. Comput. Commun. Control, 12(6), pp. 748–789. [105] Bolla, K., Johanyák, Z. C., Kovács, T. and Fazekas, G. (2014). Local center of gravity based gathering algorithm for fat robots. In Kóczy, L. T., Pozna, C. R. and Kacprzyk, J. (eds.), Issues and Challenges of Intelligent Systems and Computational Intelligence, Studies in Computational Intelligence, Vol. 530. Cham, Heidelberg, New York, Dordrecht, London: Springer-Verlag, pp. 175–183. [106] Johanyák, Z. C. (2015). A simple fuzzy logic based power control for a series hybrid electric vehicle. In Proc. 9t h IEEE Eur. Mod. Symp. Mat. Mod. Comput. Sim. (EMS 2015). Madrid, Spain, pp. 207–212. [107] Vašˇcák, J. and Hvizdoš, J. (2016). Vehicle navigation by fuzzy cognitive maps using sonar and RFID technologies. In Proc. IEEE 14t h Int. Symp. Appl. Mach. Intell. Informat. (SAMI 2014). Herl’any, Slovakia, pp. 75–80. [108] Vašˇcák, J., Hvizdoš, J. and Puheim, M. (2016). Agent-based cloud computing systems for traffic management. In Proc. 2016 Int. Conf. Intell. Networks Collab. Syst. (INCoS 2016). Ostrava, Czech Republic, pp. 73–79. [109] Ando, N., Korondi, P. and Hashimoto, H. (2004). Networked telemicromanipulation systems “Haptic Loupe”. IEEE Trans. Ind. Electron., 51(6), pp. 1259–1271. [110] Haidegger, T., Kovács, L., Precup, R.-E., Preitl, S., Benyó, B. and Benyó, Z. (2011). Cascade control for telerobotic systems serving space medicine. IFAC Proc. Vol., 44(1), pp. 3759– 3764. [111] Yu, Q.-X., Hou, Z.-S., Bu, X.-H. and Yu, Q.-F. (2020a), RBFNN-based data-driven predictive iterative learning control for nonaffine nonlinear systems. IEEE Trans. Neural Networks Learn. Syst., 31(4), pp. 1170–1182. [112] Joelianto, E., Sutarto, H. Y., Airulla, D. G. and Zaky, M. (2010). Hybrid controller design based magneto-rheological damper lookup table for quarter car suspension. Int. J. Artif. Intell., 18(1), pp. 193–206. [113] Aparanji, V. M., Wali, U. V. and Aparna, R. (2020). Multi-layer auto resonance network for robotic motion control. Int. J. Artif. Intell., 18(1), pp. 19–44. [114] Haber, R. E. and Alique, J. R. (2004). Nonlinear internal model control using neural networks: an application for machining processes. Neural Comput. Appl., 13(1), pp. 47–55. [115] Michail, K., Deliparaschos, K. M., Tzafestas, S. G. and Zolotas, A. C. (2016). AI-based actuator/sensor fault detection with low computational cost for industrial applications. IEEE Trans. Control Syst. Technol., 24(1), pp. 293–301. [116] Andoga, R. and Fozo, L. (2017). Near magnetic field of a small turbojet engine. Acta Phys. Pol. A, 131(4), pp. 1117–1119. [117] Rotariu, C., Pasarica, A., Andruseac, G., Costinm H. and Nemescu, D. (2014). Automatic analysis of the fetal heart rate variability and uterine contractions. In Proc. 8t h Int. Conf. Exp. Electr. Power Eng. (EPE 2014). Iasi, Romania, pp. 1–6. [118] Precup, R.-E., Teban, T.-A., Albu, A., Borlea, A.-B., Zamfirache, I. A. and Petriu, E. M. (2020b). Evolving fuzzy models for prosthetic hand myoelectric-based control. IEEE Trans. Instrum. Meas., 69(7), pp. 4625–4636.

June 1, 2022 13:18

806

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

Handbook on Computer Learning and Intelligence—Vol. II

[119] Lagunes, M. L., Castillo, O., Valdez, F., Soria, J. and Melin, P. (2018). Parameter optimization for membership functions of type-2 fuzzy controllers for autonomous mobile robots using the firefly algorithm. In Proc. 2018 North Amer. Fuzzy Inform. Process. Soc. Ann. Conf. (NAFIPS 2018). Fortaleza, Brazil, pp. 569–579. [120] Maldonado, Y., Castillo, O. and Melin, P. (2014). A multi-objective optimization of type-2 fuzzy control speed in FPGAs. Appl. Soft Comput., 24, pp. 1164–1174. [121] Astudillo, L., Melin, P. and Castillo, O. (2013). Nature inspired chemical optimization to design a type-2 fuzzy controller for a mobile robot. In Proc. Joint IFSA World Congr. NAFIPS Ann. Meet. (IFSA/NAFIPS 2013). Edmonton, AB, Canada, pp. 1423–1428. [122] Castillo, O. and Melin, P. (2012). A review on the design and optimization of interval type-2 fuzzy controllers. Appl. Soft Comput., 12(4), pp. 1267–1278. [123] Navarro, G., Umberger, D. K. and Manic, M. (2017). VD-IT2, Virtual disk cloning on disk arrays using a type-2 fuzzy controller. IEEE Trans. Fuzzy Syst., 25(6), pp. 1752–1764. [124] Camci, E., Kripalani, D. R., Ma, L.-L., Kayacan, E. and Khanesar, M. A. (2018). An aerial robot for rice farm quality inspection with type-2 fuzzy neural networks tuned by particle swarm optimization-sliding mode control hybrid algorithm. Swarm Evol. Comput., 41, pp. 1–8. [125] Sarabakha, A., Fu, C.-H. and Kayacan, E. (2019). Intuit before tuning: type-1 and type-2 fuzzy logic controllers. Appl. Soft Comput., 81, 105495. [126] Kayacan, E., Sarabakha, A., Coupland, S., John, R. I. and Khanesar, M. A. (2018). Type-2 fuzzy elliptic membership functions for modeling uncertainty. Eng. Appl. Artif. Intell., 70, pp. 170–183. [127] Sarabakha, A., Fu, C.-H., Kayacan, E. and Kumbasar, T. (2018). Type-2 fuzzy logic controllers made even simpler: from design to deployment for UAVs. IEEE Trans. Ind. Electron., 65(6), pp. 5069–5077. [128] Venkataramana, N. N. and Singh, S. P. (2021). A novel interval type-2 fuzzy-based direct torque control of induction motor drive using five-level diode-clamped inverter. IEEE Trans. Ind. Electron., 68(1), pp. 149–159. [129] Li, Z.-C., Yan, H.-C., Zhang, H., Lam, H.-K. and Wang, M. (2021). Aperiodic sampleddata-based control for interval type-2 fuzzy systems via refined adaptive event-triggered communication scheme. IEEE Trans. Fuzzy Syst., 29(2), pp. 310–321. [130] Zhang, Z.-N., Su, S.-F. and Niu, Y.-G. (2020b). Dynamic event-triggered control for interval type-2 fuzzy systems under fading channel. IEEE Trans. Cybern., doi: 10.1109/TCYB.2020.2996296. [131] Shen, Q.-K., Shi, Y., Jia, R.-F. and Shi, P. (2021). Design on type-2 fuzzy-based distributed supervisory control with backlash-like hysteresis. IEEE Trans. Fuzzy Syst., 29(2), pp. 252–261. [132] Du, Z.-B., Kao, Y.-G., Karimi, H. R. and Zhao, X.-D. (2020). Interval type-2 fuzzy sampleddata H∞ control for nonlinear unreliable networked control systems. IEEE Trans. Fuzzy Syst., 28(7), pp. 1434–1448. [133] Mirjalili, S., Mirjalili, S. M. and Lewis, A. (2014). Grey wolf optimizer. Adv. Eng. Software, 69, pp. 46–61. [134] Precup, R.-E., David, R.-C., Szedlak-Stinean, A.-I., Petriu E. M. and Dragan, F. (2017). An easily understandable grey wolf optimizer and its application to fuzzy controller tuning. Algorithms, 10(2), 68. [135] Mirjalili, S. and Lewis, A. (2016). The whale optimization algorithm. Adv. Eng. Software, 95, pp. 51–67. [136] Li, S.-M., Chen, H.-L., Wang, M.-J., Heidari, A. A. and Mirjalili, S. (2020b). Slime mould algorithm: a new method for stochastic optimization. Future Gener. Comput. Syst., 111, pp. 300–323. [137] Rashedi, E., Nezamabadi-pour, H. and Saryazdi, S. (2009). GSA: a gravitational search algorithm. Inf. Sci., 179(13), pp. 2232–2248.

page 806

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Nature-Inspired Optimal Tuning of Fuzzy Controllers

b4528-v2-ch20

page 807

807

[138] Kennedy, J. and Eberhart, R. C. (1995). A new optimizer using particle swarm theory. In Proc. 6 t h Int. Symp. Micro Mach. Human Sci. (MHS’95). Nagoya, Japan, pp 39–43. [139] Kennedy, J. and Eberhart, R. C. (1997). A discrete binary version of the particle swarm algorithm. In Proc. 1997 IEEE Int. Conf. Syst. Man Cybern. (SMC’97). Orlando, FL, USA, vol. 5, pp. 4104–4108. [140] David, R.-C., Precup, R.-E., Preitl, S., Roman, R.-C. and Petriu, E. M. (2020b). Fuzzy control systems with reduced parametric sensitivity design based on hybrid grey wolf optimizer-particle swarm optimization. In Proc. 24t h Int. Conf. Syst. Theory Control Comput. (ICSTCC 2020). Sinaia, Romania, pp. 66–71. [141] Preitl, S. and Precup, R.-E. (1996). On the algorithmic design of a class of control systems based on providing the symmetry of open-loop Bode plots. Buletinul Stiintific al U.P.T. Trans. Autom. Control Comput. Sci., 41(55), pp. 47–55. [142] Preitl, S. and Precup, R.-E. (1999). An extension of tuning relations after symmetrical optimum method for PI and PID controllers. Automatica, 35, pp. 1731–1736. [143] Derrac, J., García, S., Molina, D. and Herrera, F. (2011). A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol. Comput., 1(1), pp. 3–18. [144] Li, L.-M., Lu, K.-D., Zeng, G.-Q., Wu, L. and Chen, M.-R. (2016). A novel realcoded population-based extremal optimization algorithm with polynomial mutation: a nonparametric statistical study on continuous optimization problems. Neurocomputing, 174, pp. 577–587. [145] Zeng, G.-Q., Xie, X.-Q., Chen, M.-R. and Weng, J. (2019). Adaptive population extremal optimization based PID neural network for multivariable nonlinear control systems. Swarm Evol. Comput., 44, pp. 320–334. [146] Chen, M.-R., Zeng G.-Q. and Lu, K.-D. (2019). A many-objective population extremal optimization algorithm with an adaptive hybrid mutation operation. Inf. Sci., 498, pp. 62–90. [147] Faria, P., Soares, J., Vale, Z., Morais, H. and Sousa, T. (2013). Modified particle swarm optimization applied to integrated demand response and DG resources scheduling. IEEE Trans. Smart Grid, 4(1), pp. 606–616. [148] Lu, K.-D., Zhou, W.-N., Zeng, G.-Q., and Zheng, Y.-Y. (2019). Constrained population extremal optimization-based robust load frequency control of multi-area interconnected power system. Int. J. Electr. Power Ener. Syst., 105, pp. 249–271. [149] Osaba, E., Del Ser, J., Sadollah, A., Bilbao, M. N. and Camacho, D. (2018). A discrete water cycle algorithm for solving the symmetric and asymmetric traveling salesman problem. Appl. Soft Comput., 71, pp. 277–290. [150] Osaba, E., Yang, X. S., Fister Jr, I., Del Ser, J., Lopez-Garcia, P. and Vazquez-Pardavila, A. J. (2019). A discrete and improved bat algorithm for solving a medical goods distribution problem with pharmacological waste collection. Swarm Evol. Comput., 44, pp. 273–286. [151] Gao, D., Wang, G.-G. and Pedrycz, W. (2020). Solving fuzzy job-shop scheduling problem using DE algorithm improved by a selection mechanism. IEEE Trans. Fuzzy Syst., 28(12), pp. 3265–3275. [152] Yi, J.-H., Deb, S., Dong, J.-Y., Alavi, J.-Y. and Wang, G.-G. (2020a). An improved NSGA-III algorithm with adaptive mutation operator for Big Data optimization problems. Future Gener. Comput. Syst., 88, pp. 571–585. [153] Yi, J.-H., Xing, L.-N., Wang, G.-G., Dong, J.-Y., Vasilakos, A. V., Alavi, A. H. and Wang, L. (2020b). Behavior of crossover operators in NSGA-III for large-scale optimization problems. Inf. Sci., 509, pp. 470–487. [154] Moattari, M. and Moradi, M. H. (2020). Conflict monitoring optimization heuristic inspired by brain fear and conflict systems. Int. J. Artif. Intell., 18(1), pp. 45–62.

June 1, 2022 13:18

808

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch20

Handbook on Computer Learning and Intelligence—Vol. II

[155] Subotic, M., Manasijevic, A. and Kupusinac, A. (2020). Parallelized Multiple Swarm Artificial Bee Colony (PMS-ABC) algorithm for constrained optimization problems. Stud. Informat. Control, 29(1), pp. 77–86. [156] Ochoa, P., Castillo, O. and Soria, J. (2020). Optimization of fuzzy controller design using a differential evolution algorithm with dynamic parameter adaptation based on type-1 and interval type-2 fuzzy systems. Soft Comput., 24(1), pp. 193–214. [157] Rodríguez, L., Castillo, O., Soria, J., Melin, P., Valdez, F., Gonzalez, C. I., Martinez G. E. and Soto, J. (2017). A fuzzy hierarchical operator in the grey wolf optimizer algorithm. Appl. Soft Comput., 57, pp. 315–328. [158] Castillo, O. and Amador-Angulo, L. (2018). A generalized type-2 fuzzy logic approach for dynamic parameter adaptation in bee colony optimization applied to fuzzy controller design. Inf. Sci., 460–461, pp. 476–496. [159] Precup, R.-E., David, R.-C., Petriu, E. M., Preitl S. and Radac, M.-B. (2014). Novel adaptive charged system search algorithm for optimal tuning of fuzzy controllers. Expert Syst. Appl., 41(4), pp. 1168–1175.

page 808

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

© 2022 World Scientific Publishing Company https://doi.org/10.1142/9789811247323_0021

Chapter 21

Indirect Self-evolving Fuzzy Control Approaches and Their Applications Zhao-Xu Yang∗ and Hai-Jun Rong† State Key Laboratory for Strength and Vibration of Mechanical Structures, Shaanxi Key Laboratory of Environment and Control for Flight Vehicle, School of Aerospace Engineering, Xi’an Jiaotong University, Xi’an, 710049, P.R.China ∗ [email protected][email protected]

21.1. Introduction Intelligent control that takes inspiration from the autonomous learning, reasoning, and decision of living beings is very significant when a system has no help from prior knowledge, but remains under control. The classical approaches of intelligent control include, but are not be limited to, fuzzy inference systems (FIS), artificial neural networks (ANN), genetic algorithms (GA), particle swarm optimization (PSO), support vector machine (SVM), etc. Intelligent control systems, especially the novel self-evolving fuzzy control systems with structure evolution and parameter adjustment, have found great applications in engineering and industry.

21.1.1. Self-evolving Fuzzy Model It is well known that fuzzy inference systems (FIS) can be used to closely approximate any nonlinear input–output mapping by means of a series of if– then rules [1]. In the design of FIS, there are two major tasks, viz., the structure identification and parameter adjustment. Structure identification determines the input–output space partition, antecedent and consequent variables of if–then rules, number of such rules, and initial positions of membership functions. The other task of parameter adjustment involves realizing the parameters for the fuzzy model structure determined in the previous step [2]. 809

page 809

June 1, 2022 13:18

810

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Handbook on Computer Learning and Intelligence—Vol. II

In most of the real applications, not all fuzzy rules contribute significantly to the system performance during the entire time period. A fuzzy rule may be active initially, but later it may contribute little to the system output. For this reason, the insignificant fuzzy rules have to be removed during learning to realize a compact fuzzy system structure. Using the ideas of adding and pruning hidden neurons to form a minimal RBF network in [3], a hierarchical online self-organizing learning algorithm for dynamic fuzzy neural networks (DFNN) has been proposed in [4]. Another online self-organizing fuzzy neural network (SOFNN) proposed by Leng et al. [5] also includes a pruning method. The pruning method utilizes the optimal brain surgeon (OBS) approach to determine the importance of each rule. However, in these two algorithms, the pruning criteria need all the past data received so far. Hence, they are not strictly sequential and further require increased memory for storing all the past data. An online sequential learning algorithm using recursive least-square error (RLSE) was presented for a fixed fuzzy system structure [6]. It is referred to as an Online Sequential Fuzzy Extreme Learning Machine (OS-Fuzzy-ELM). In an OS-Fuzzy-ELM algorithm, the parameters of fuzzy membership functions are selected randomly and the consequent parameters of fuzzy rules are calculated analytically using least square error. For sequential learning, the consequent parameters are updated using RLSE and the training data can be presented one-by-one or chunk-by-chunk. The simulation results based on the function approximation and classification problems show that OS-Fuzzy-ELM produces better generalization performance with a smaller computation time. However, in the algorithm, the structure of the fuzzy system is determined before learning and once it is determined, it will never change according to the learning process. To improve the performance of the above algorithms in the nonlinear learning problems, the difficulties described above need to be overcome. The self-evolving fuzzy models with both structure evolution and parameter adjustment are applied to reconstruct the nonlinear uncertainties. The structure evolution approach is presented to evolve the model structure and determine the initial values of the antecedent and consequent parameters in an incremental manner. Initially, the rule base of the self-evolving fuzzy model is empty and the rules are generated according to the structure learning algorithm with different criteria, which is designed for the automatic determination of rules that involve the recruiting and pruning of fuzzy rules. In this way, a compact rule base can be built. Thus, the self-evolving fuzzy control schemes yielded by the proposed fuzzy models are able to achieve a controller structure with an appropriate size during the approximation of nonlinear uncertainties. In the rest of this chapter, two typical self-evolving fuzzy models referred to as extended sequential adaptive fuzzy inference system (ESAFIS) [7] and meta-cognitive fuzzy inference system (McFIS) [8] with improvements in

page 810

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Indirect Self-Evolving Fuzzy Control Approaches and Their Applications

page 811

811

both accuracy and speed are described, which would be used to design the fuzzy controller.

21.1.2. Self-Evolving Fuzzy Control System Adaptive identification and control of systems with uncertain time-varying parameters and unknown dynamics has been the focus of numerous studies. Among various adaptive control techniques, the feedback linearization method and the backstepping method are the effective control strategies and have been widely applied to uncertain nonlinear systems [9–11]. In the conventional control approaches, some a priori information about the nonlinear systems will be required. But the systems’ nonlinear input–output relationship is generally hard to understand, thus their exact mathematical models are difficult to describe in an analytical form. To overcome the problem, some researchers [12] have turned to universal function approximators as a means of explicitly identifying complex nonlinear systems. Fuzzy models are powerful tools to account for the control of complex uncertain nonlinear systems without exact mathematical models. Therefore, they have been widely employed in the field of nonlinear system control [13]. A number of investigations have been performed for the fuzzy control of nonlinear systems based on the merger of fuzzy models and adaptive control techniques. With the fuzzy model structure, both linear and nonlinear parameterized models are achieved accordingly. In most control schemes [14–16], only consequent parameters of fuzzy rules are modified according to the rules obtained based on the Lyapunov function, while the antecedent fuzzy membership function parameters are assigned based on the designer’s experience. It is obvious that as the fuzzy membership function parameters vary, differently shaped membership functions change simultaneously and display different shapes. This helps to seize fast-varying dynamics, alleviate approximation errors, and enhance control performance. Thus, some nonlinearly parameterized controllers are developed [17, 18] where both fuzzy membership function parameters and consequent parameters are updated. Although higher control performance can be achieved compared with the linear parameterized controllers, a large number of tunable parameters are required and also result in high computation complexity. All the above-mentioned control methods address the issue of parameter adjustment of the fuzzy-neural model without considering structure learning. Nonlinear systems generally show dynamic changes due to nonstationary environments and external disturbances, which result in drifts and shifts of system states or new operating conditions and modes. From a practical point of view, there is no guarantee that a fixed controller structure has a satisfactory performance in online applications when the environment or the object of the controller changes. Thus, some

June 1, 2022 13:18

812

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Handbook on Computer Learning and Intelligence—Vol. II

fuzzy-neural models with self-adaptation capability have been developed so that model parameters and model structure are updated for rapidly adapting to a variety of unexpected dynamic changes. Besides in nonstationary environments, it is generally difficult to achieve a whole learning data with the uniform distribution that can cover all the system characteristics. In this case, dynamic learning is required to be able to identify useless data and delete them during the learning process. Aircraft wing-rock motion and a thrust active magnetic bearing system, as application examples, suffer from inherent nonlinear uncertainties and fast timevarying characteristics. Considering these concerns, the main motivation of this work is to generate a self-adaptable controller structure and evolve it in the online mode. The controller can start with no pre-defined fuzzy rules, and does not need any offline pretraining or explicit model of the controlled system. The structure and parameters of the controller are adaptively adjusted according to the information extracted from nonstationary and uncertain data streams. To meet the goal, the self-evolving control approaches are proposed for uncertain nonlinear systems that are hard to formulate in mathematical forms due to nonstationary parameters and uncertainties. Recently, self-evolving fuzzy control approaches that combine the control technique with self-evolving fuzzy models have been employed to design the controllers for nonlinear systems with completely unknown functions. A key element in these fuzzy control approaches is that self-evolving fuzzy models are utilized to approximate the unknown dynamics exploiting the fact of being universal approximators.

21.2. Adaptive Fuzzy Control of Aircraft Wing-Rock Motion Wing rock is a highly nonlinear aerodynamic phenomenon in which limits cycle roll oscillations are experienced by aircraft at high angles of attack [19]. Wing-rock motion is a concern because it may cause a loss of stability in the lateral/directional mode due to the large amplitudes and high frequencies of the rolling oscillations, thus degrading the maneuverability, tracking accuracy, and operational safety of high-performance fighters. To understand the underlying mechanism of wing-rock motion, some research has been conducted on the motion of slender delta wings, and several theoretical models that describe the nonlinear rolling motion have been developed using simple differential equations [20, 21]. During the past decades, many control techniques have been developed [22–25]. Based on the differential models, the control strategies used to suppress the wing-rock motion include nonlinear optimal feedback control [23], the adaptive control based on feedback linearization [24] and the robust H∞ control [25]. Although these methods are successful in controlling wing-rock motion, a key assumption in these methods is that the system nonlinearities are known a priori,

page 812

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Indirect Self-Evolving Fuzzy Control Approaches and Their Applications

page 813

813

which is generally not applicable in the real world. The reason is that the aerodynamic parameters governing wing rock are inadequately understood and thus an accurate mathematical model for the wing-rock motion cannot be obtained. These present significant difficulties to the controller design using conventional control techniques. In the proposed adaptive fuzzy control scheme, the fuzzy controller is built by employing ESAFIS to approximate the nonlinear dynamics of wing-rock motion. However, different from the original ESAFIS algorithm where a recursive least-square error (RLSE) is used to adjust the consequent parameters, they are updated based on the stable adaptive laws derived from a Lyapunov function that is an effective way to guarantee the stability of the closed-loop system. Moreover, a sliding controller is incorporated into the fuzzy controller and activated to work with the fuzzy controller for offsetting the modeling errors of the ESAFIS.

21.2.1. Problem Formulation The nonlinear wing-rock motion for an 80◦ slender delta wing developed by Nayfeh et al. [21] is considered in the study, whose dynamics is described by the following differential equation: φ¨ =



 2 Sb ρU∞ Cl − D φ˙ + u, 2I x x

(21.1)

where φ is the roll angle, ρ is the density of air, U∞ is the freestream velocity, S is the wing reference area, b is the chord, I x x is the mass moment of inertia, D is the damping coefficient and fixed as 0.0001, u is the control input, and Cl is the roll moment coefficient that is given by Cl = a1 φ + a2 φ˙ + a3 φ 3 + a4 φ 2 φ˙ + a5 φ φ˙ 2 .

(21.2)

The aerodynamic parameters a1 to a5 are nonlinear functions of the angle of attack α and are presented in Table 21.1. Table 21.1: Coefficients of Rolling Moment with Angle of Attack α. α 15.0 21.5 22.5 25.0

a1

a2

a3

a4

a5

−0.01026 −0.04207 −0.04681 −0.05686

−0.02117 −0.01456 0.01966 0.03254

−0.14181 0.04714 0.05671 0.07334

0.99735 −0.18583 −0.22691 −0.35970

−0.83478 0.24234 0.59065 1.46810

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

814

b4528-v2-ch21

page 814

Handbook on Computer Learning and Intelligence—Vol. II

Substituting Eq. (21.2) into Eq. (21.1), the wing-rock dynamics is then rewritten as φ¨ + ω2 φ = μ1 φ˙ + b1 φ˙ 3 + μ2 φ 2 φ˙ + b2 φ φ˙ 2 + u.

(21.3)

The coefficients in the above equation are given by the following relations: ω2 = −Ca1 , μ1 = Ca2 − D, μ2 = Ca4 , b1 = Ca3 , b2 = Ca5 ,

(21.4)

ρU 2 Sb

where C(= 2I∞x x ) is a fixed constant and equal to 0.354. The values of the coefficients in Eq. (21.3) could be obtained for any angle of attack according to any interpolation method. In [21], author have has pointed out that the observed onset angle where μ1 is zero corresponds to the onset of wing rock at “19–20 deg”. This can be verified by observing the properties of the open-loop system with u = 0. Figures 21.1(a) and (b) present the time process of the calculated roll angle and the roll rate given by Eq. (21.3) for α = 15◦ . ˙ the state equations of the By choosing state variables x1 = φ and x2 = φ, wing-rock dynamics will be x˙1 = x2

(21.5)

x˙2 = f (x1 , x2 ) + u,

where f (x1 , x2 ) = −ω2 φ + μ1 φ˙ + b1 φ˙ 3 + μ2 φ 2 φ˙ + b2 φ φ˙ 2 is assumed to be a bounded real continuous nonlinear function. Defining x = [x1 , x2 ], Eq. (21.5) can be further written in the following state– space form: x˙ = Ax + b f (x) + bu, 20

(21.6)

3

15 2

Roll Rate φ′ (degree/s)

Roll Angle φ (degree)

10 5 0 −5

1

0

−1

−10 −2 −15 −20

0

200

400

600

800

1000

1200

−3

0

200

400

600

800

1000

time (s)

time (s)

(a) Roll Angle

(b) Derivative of Roll Angle

Figure 21.1: The time process of roll angle and its derivative for α = 15◦ .

1200

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

page 815

Indirect Self-Evolving Fuzzy Control Approaches and Their Applications

where

   0 1 0 A= , b= . 0 0 1

815



(21.7)

Define the output tracking error e = φd − φ, the reference output vector φd = [φd , φ˙d ]T and the tracking error vector e = [e, e] ˙ T = [e1 , e2 ]T . The control law ˙ T can track the desired command u is designed such that the state φ = [φ, φ] T φd = [φd , φ˙d ] as closely as possible. According to the feedback linearization method [26] and the dynamics system given by Eq. (21.6), the desired control input u ∗ can be obtained as follows: u ∗ = φ¨ d − f (x) + k1 e˙ + k2 e,

(21.8)

where k1 and k2 are real numbers. By substituting Eq. (21.8) with Eq. (21.6), the following equation is obtained: e¨ + k1 e˙ + k2 e = 0,

(21.9)

By properly choosing k1 and k2 , all the roots of the polynomial s 2 + k1 s + k2 = 0 are in the open left half-plane, which implies that the tracking error will converge to zero. Equation (21.9) also shows that the tracking of the reference command is asymptotically achieved from any initial conditions, i.e., lim t →∞ |e| = 0. However, if the motion dynamics f (x) is perturbed or unknown, it is difficult to implement this control law in practical applications. To circumvent this problem, an indirect adaptive fuzzy control strategy based on ESAFIS is designed. Before presenting the fuzzy control scheme, a brief description of ESAFIS is given in the following section.

21.2.2. Extended Sequential Adaptive Fuzzy Inference System 21.2.2.1. Architecture of ESAFIS

ESAFIS [7] is the extension version of the SAFIS [27] algorithm for improving its accuracy and speed. Different from SAFIS, ESAFIS can implement both the zeroorder and first-order TS fuzzy models, which are commonly described as follows: Rule k: If (x1 is A1k ) · · · (x Nx is A Nx k ), then ( yˆ1 is ak1 ) · · · ( yˆ N y is ak N y ), where Aik (i = 1, . . . , Nx ) denotes the fuzzy sets, namely, linguistic values associated with the ith input variable xi in rule k, Nx is the dimension of the input vector x (x = [x1 , . . . , x Nx ]), Nh is the number of fuzzy rules, and N y is the dimension of the output vector yˆ (ˆy = [ yˆ1 , . . . , yˆ N y ]). In ESAFIS, the number of fuzzy rules Nh varies. Initially, there is no fuzzy rule and then during learning fuzzy rules are added and removed. For the zero-order TS fuzzy model, the consequence akj (k = 1, 2, . . . , Nh ; j = 1, 2, . . . , N y ) is a constant, while

June 1, 2022 13:18

816

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Handbook on Computer Learning and Intelligence—Vol. II

Figure 21.2: The structure of ESAFIS.

in the first-order TS fuzzy model, a linear combination of input variables, that is, akj = qk0, j + qk1, j x1 + · · · + qk Nx , j x Nx is used as the consequence. A lot of research work has proved that the first-order TS fuzzy model possesses better performance in terms of system structure and learning ability than the zero-order TS fuzzy model [7, 28]. Thus, ESAFIS with the first-order TS fuzzy model is mainly studied here to construct the control law. The structure of ESAFIS illustrated in Figure 21.2 consists of five layers to realize the above fuzzy rule model. Layer 1: In layer 1, each node represents an input variable and directly transmits the input signal to layer 2. Layer 2: In this layer, each node represents the membership value of each input variable. ESAFIS utilizes the function equivalence between an RBF network and an FIS. Thus, its antecedent part (if part) in fuzzy rules is achieved by Gaussian functions of the RBF network as used in many existing studies in the literature [29, 30]. The membership value of the fuzzy set Aik (xi ) for the ith input variable xi in the kth Gaussian function is given by   (xi − μik )2 , k = 1, 2, . . . , Nh , Aik (xi ) = exp − σk2

(21.10)

page 816

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Indirect Self-Evolving Fuzzy Control Approaches and Their Applications

page 817

817

where Nh is the number of Gaussian functions, μik is the center of the kth Gaussian function for the ith input variable, and σk is the width of the kth Gaussian function. In ESAFIS, the widths of all the input variables in the kth Gaussian function are the same. Layer 3: Each node in this layer represents the “IF” part of if–then rules obtained by the sum–product composition and the total number of such rules is Nh . The firing strength (if part) of the kth rule is given by  N Nx x   (xi − μik )2 Aik (xi ) = exp − Rk (x) = σk2 i=1 i=1   x − μk 2 = exp − . (21.11) σk2 Layer 4: The nodes in this layer are named normalized nodes, whose number is equal to the number of nodes in third layer. The kth normalized node is given by Rk (x) . R¯ k = Nh k=1 Rk (x)

(21.12)

Layer 5: Each node in this layer corresponds to an output variable, which is given by the weighted sum of the output of each normalized rule. The system output is calculated as

Nh ak Rk (x) , (21.13) yˆ = k=1 Nh k=1 Rk (x) and the consequence is ak = xeT qk ,

(21.14)

where xe is the extended input vector by appending the input vector x with 1, that is, [1, x]T ; qk is the parameter matrix for the kth fuzzy rule and is given by ⎡ ⎤ qk0,1 · · · qk0,N y ⎢ . ⎥ .. ⎥ . . (21.15) qk = ⎢ ··· . ⎣ . ⎦ qk Nx ,1 · · · qk Nx ,N y (N +1)×N x

y

The system output denoted by Eq. (21.13) can be rewritten as yˆ = HQ,

(21.16)

where Q is the consequent parameter matrix existing in all Nh fuzzy rules and H is the vector representing the normalized firing strength of rules. The two parameters

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

818

b4528-v2-ch21

Handbook on Computer Learning and Intelligence—Vol. II

are given by



q10,1 ⎢ . ⎢ . ⎢ . ⎢ ⎢ q1N ,1 x ⎢ ⎢ . Q=⎢ ⎢ .. ⎢ ⎢ q N 0,1 ⎢ h ⎢ . ⎢ .. ⎣



··· .. .

q10,N y .. .

···

q1Nx ,N y .. .

··· ··· ···

q Nh 0,N y .. .

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

q Nh Nx ,1 · · · q Nh Nx ,N y and

 H=

xeT

R1 (x)

Nh

k=1

Rk (x)

, . . . , xeT

,

Nh (Nx +1)×N y

R Nh (x)

Nh

k=1

(21.17)

 .

Rk (x)

(21.18)

1×Nh (Nx +1)

21.2.3. Learning Algorithm of ESAFIS The learning algorithm of ESAFIS involves the addition and removal of fuzzy rules online and the adjustment of the consequent parameters. To avoid the difficulty of knowing the input distribution as a priori knowledge in SAFIS, ESAFIS uses the concept of modified influence of a fuzzy rule to add and remove rules during learning. The modified influence represents a statistical contribution of the rule to the output using a certain limited number of samples. Suppose that the input observations are generated in a random distribution and arrive with the same sampling distribution; the modified influence E minf (k) can be estimated by using the recently received M training samples as long as the memory factor M is reasonably large. It is given by E minf (k) =

n 

T Rk (xl ) xle qk  .

Nh M k=1 Rk (xl ) l=n−M+1

(21.19)

The sequential learning process of ESAFIS is briefly summarized below. For details, see [7]. ESAFIS begins with no fuzzy rules. As the inputs xn ,yn (n is the time index) are received sequentially during learning, fuzzy rules are added or deleted based on the following two criteria, viz., the distance criterion and the modified influence of the newly added fuzzy rule Nh + 1, ⎧ x − μnr  > n ⎪ ⎪ ⎨ n n  , (21.20) en  R Nh +1 (xl ) ⎪ > eg E (N + 1) =

Nh +1 ⎪ ⎩ minf h M R (x ) l=n−M+1

k=1

k

l

page 818

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Indirect Self-Evolving Fuzzy Control Approaches and Their Applications

page 819

819

where n , eg are the thresholds to be selected appropriately, xn is the latest input data, and μnr is the center of the fuzzy rule nearest to xn . eg is the growing threshold and is chosen according to the desired learning accuracy. κ is an overlap factor that determines the overlap of fuzzy rules in the input space. en = yn − yˆ n , where yn is the true value and yˆ n is the approximated value. n is the distance threshold, which decays exponentially and is given as n = max {max × γ n , min } ,

(21.21)

where max and min are the largest and the smallest length of interest and γ is the decay constant. The equation shows that initially it is the largest length of interest in the input space that allows fewer fuzzy rules to coarsely learn the system and then it decreases exponentially to the smallest length of interest in the input space that allows more fuzzy rules to finely learn the system. When a new fuzzy rule Nh + 1 is added, its corresponding antecedent and consequent parameters are assigned as follows: ⎧ ⎪ ⎪q(Nh +1)0, j = en j , j = 1, 2, . . . , N y ⎪ ⎪ ⎨q (Nh +1)i, j = 0, i = 1, 2, . . . , N x , (21.22) ⎪ μ = x N +1 n ⎪ h ⎪ ⎪ ⎩σ = κx − μ  Nh +1

n

nr

where en j is the j th element of the system output error vector en (= yn − yˆ n ). When there is no addition of a new rule, the consequent parameters of all existing rules (Q) are modified based on an RLSE method. However, when this method is applied for controller design, problems such as the stability of the overall scheme and convergence of the approximation error arise. Thus, using the Lyapunov stability theory, stable parameter tuning laws instead of the RLSE method are derived for ESAFIS here that guarantee the stability of the overall system. This will be introduced in the following sections. After parameter adjustment, the existing rules will change their influence on the system output and thus all rules need to be checked for possible pruning. If the modified influence of a rule is less than a certain pruning threshold ep , the rule is insignificant to the output and should be removed. Given the pruning threshold ep selected properly, the kth fuzzy rule will be removed by satisfying E minf (k) < ep . When the kth rule needs to be removed from the existing fuzzy rules, the number of fuzzy rules will be reduced to Nh − 1 and all the parameters including the center μk , width σk , and consequence qk , related with this rule will be reduced from the related matrices to suit the reduced system. The dynamical learning capability of ESAFIS can be used as the basis for learning in adaptive control. Generally, adaptive control approaches can be classified

June 1, 2022 13:18

820

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Handbook on Computer Learning and Intelligence—Vol. II

into two categories, namely, indirect and direct adaptive control schemes. In direct adaptive control, the control law u ∗ is approximated directly without explicitly attempting to determine the model of the dynamics f (x). Unlike the direct adaptive control strategy, in the indirect control strategy, ESAFIS is used to identify the nonlinear motion dynamics f (x) and then the fuzzy control law is implemented using the identified model. To fully depict the potential of ESAFIS in the design of the adaptive controller, in the following section, we focus on showing how to achieve stable indirect adaptive control for the wing-rock problem when ESAFIS is used as an online approximator.

21.2.4. Design Procedure of Indirect Adaptive Fuzzy Control Scheme The proposed indirect adaptive fuzzy control scheme is illustrated in Figure 21.3, where two controllers are incorporated, viz, a fuzzy controller u f and a sliding controller u s . The fuzzy control term u f is constructed using ESAFIS for learning the system dynamics f (x) and is given as u f = φ¨d − fˆ(x) + k1 e˙ + k2 e,

(21.23)

where fˆ(x) is the approximated value from ESAFIS and is represented as fˆ(x) = HQ I .

(21.24)

According to the universal approximation theorem of fuzzy systems [31], there exists an optimal fuzzy system estimator to approximate the nonlinear dynamic

Figure 21.3: The proposed indirect adaptive fuzzy control scheme.

page 820

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Indirect Self-Evolving Fuzzy Control Approaches and Their Applications

page 821

821

function f (x) in Eq. (21.6) such that f (x) = HQ∗I +  I ,

(21.25)

where  I is the approximation error and Q∗I are the optimal parameters and is defined as follows:  

(21.26) Q∗I = argmin supx∈Mx  f (x) − fˆ(x) . where  ·  represents the two-norm of a vector, and Mx is the predefined compact set of the input vector x. Using the approximation theory, the inherent approximation error  I can be reduced arbitrarily with the number of fuzzy rules increasing. In ESAFIS, the fuzzy system starts with no fuzzy rules. Fuzzy rules are dynamically recruited according to their significance to system performance and the complexity of the mapped system. Thus, it is reasonable to assume that  I is bounded by the constant ε¯ I , i.e., | I | ≤ ε¯ I . To compensate for approximation errors in representing the actual nonlinear dynamics by ESAFIS with ideal parameter values, a sliding mode controller u s is used to augment the fuzzy controller above and thus the overall control law is considered as u = u f + us .

(21.27)

On the basis of Eqs. (21.23), (21.24), (21.25) and (21.27), the tracking error of Eq. (21.6) becomes ˜ I −  I − u s ], e˙ = e + b[HQ

(21.28)

˜ I (= Q I − Q∗I ) is the parameter error with dimension 3Nh × 1 and H is where Q the normalized firing strength with dimension 1 × 3Nh . represents the 2 × 2 controllable canonical form matrix equaling     0 1 0 0 , = + 0 0 −k2 −k1   (21.29) 0 1 . = −k2 −k1 During the modeling process of ESAFIS, the determination of fuzzy rules and antecedent parameters is based on its learning algorithm described above. Different from the RLSE method, the parameter Q I is updated here using the following stable adaptive laws for satisfying the stability of the entire system. Theorem 21.1. Consider the wing-rock dynamics represented by Eq. (21.5). If the indirect fuzzy control law is designed as Eq. (21.23), the parameter update law

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

822

b4528-v2-ch21

Handbook on Computer Learning and Intelligence—Vol. II

and the sliding mode control law are designed as Eq. (21.30), the stability of the proposed indirect fuzzy control system can be assured. ˙ TI = −η I eT PbH Q u s = ε¯ I sgn(eT Pb), where

 sgn(e Pb) = T

1

eT Pb > 0

−1 eT Pb < 0.

(21.30)

(21.31)

Proof. Consider the following Lyapunov function candidate; V =

1 ˜T ˜ 1 T Q QI , e Pe + 2 2η I I

(21.32)

where P is the symmetric and positive definite matrix; η I is a positive constant that appears in the adaptation law and is referred to as the learning rate. Using Eq. (21.28), the derivative of the Lyapunov function is given as 1 ˙˜ T ˜ 1 QI V˙ = [˙eT Pe + eT P˙e] + Q 2 ηI I   1 T T 1 ˙˜ T ˜ T = e ( P + P )e + e PbH + Q I Q I − eT Pb I − eT Pbu s 2 ηI   1 1 ˙˜ T ˜ = − eT Je + eT PbH + Q (21.33) Q I − eT Pb I − eT Pbu s , I 2 ηI where J = −( T P + P ). J is a symmetric definite matrix and is selected by the user. Based on Eq. (21.30), Eq. (21.33) becomes 1 V˙ = − eT Je − eT Pb I − eT Pb¯ε I sgn(eT Pb). 2

(21.34)

Equation (21.34) further satisfies the following condition: 1 V˙ ≤ − eT Je + |eT Pb|| I | − eT Pb¯ε I sgn(eT Pb). 2

(21.35)

Since |eT Pb| = (eT Pb)sgn(eT Pb), Eq. (21.35) becomes 1 V˙ ≤ − eT Je + |eT Pb|(| I | − ε¯ I ). 2

(21.36)

|eT Pb|(| I | − ε¯ I ) ≤ 0.

(21.37)

Notice that

page 822

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Indirect Self-Evolving Fuzzy Control Approaches and Their Applications

page 823

823

So, 1 V˙ ≤ − eT Je ≤ 0. 2

(21.38)

The above equation illustrates that V˙ is negative semidefinite and thus the overall control scheme is guaranteed to be stable. Using Barbalat’s lemma [32, 33], it can be seen that as t → ∞, e → 0. Thus, the tracking error of the system will converge  to zero.

21.2.5. Performance Evaluation In this section, the proposed indirect adaptive fuzzy control scheme is evaluated by controlling the wing-rock motion under α = 15◦ . At the same time, the direct adaptive fuzzy control scheme using ESAFIS is considered for comparison. According to the guidelines described in [7, 27], the parameters existing in ESAFIS for the indirect and direct adaptive control schemes are commonly chosen as γ = 0.99, max = 1.0, min = 0.1, κ = 5.0, eg = 0.01, ep = 0.001, and M = 40. To stabilize the system, the coefficients in Eq. (21.9) are chosen as k1 = 2 and k2 = 1. Two initial simulation conditions, viz., small initial condition (φ(0) = 11.46 degree, ˙ the φ(0) = 2.865 degree/s) and the large initial condition (φ(0) = 103.14 degree, ˙ φ(0) = 57.3 degree/s), are used to investigate the effectiveness of the proposed control scheme. The reference trajectory vector is chosen as (φd , φ˙d )T = (0, 0)T . In addition, control schemes based on the RBF network [34] and the Takagi– Sugeno (TS) fuzzy system [35] are applied here to control the wing rock system for comparison. The simulation results of the roll angle φ and the roll rate φ˙ from the indirect control schemes are illustrated in Figures 21.4 and 21.5, where the small and large 40

150 Indirect(ESAFIS) Direct(ESAFIS) RBF Reference φd

30

Roll Angle φ (degree)

Roll Angle φ (degree)

20

Indirect(ESAFIS) Direct(ESAFIS) RBF Reference φd

100

10 0 −10

50

0

−50

−20 −100 −30 −40

0

5

10

15

−150

0

5

10

Time(s)

Time(s)

(a) φ (0)=11.46 degree

(b) φ (0)=103.14 degree

Figure 21.4: Roll angle for α = 15◦ under different initial states.

15

June 1, 2022 13:18

824

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Handbook on Computer Learning and Intelligence—Vol. II

(a) φ′ (0)=2.865 degree/s

(b) φ′ (0)=57.3 degree/s

Figure 21.5: Roll rate for α = 15◦ under different initial states.

initial conditions are included. For comparison, the simulation results from the RBF network and the reference zero signals are also presented in the two figures. The results show that in the proposed adaptive fuzzy control scheme, the difference between the system output and the reference output is large at the beginning, but the system still remains stable and the system output asymptotically converges to the desired trajectory. The convergence time lasts about 5 sec whatever the initial condition is. This is much faster than that in the uncontrolled condition and also largely decreases the maneuverability burden caused by the undesired rolling motion. Compared with the RBF network, the proposed fuzzy control scheme achieves a faster error convergence speed and better transient results under the small and large initial conditions. Besides, the simulation results from a similar work using the TS fuzzy system with a fixed system structure [35] are listed here for evaluating the performance of the proposed control scheme using ESAFIS. Figure 21.6 depicts the results of the roll angle and roll rate achieved in the indirect control scheme under the small initial condition. From the two figures, one can see that the TS fuzzy system obtains a faster converge speed than ESAFIS. However, ESAFIS has smaller roll rate errors during the initial learning process than the TS fuzzy system. Table 21.2 gives an overview of the comparison results from three algorithms based on the root mean square error (RMSE) between the reference and the actual values, as done in [30, 36, 37]. The E φ and E φ˙ shown in the table represent the RMSEs of roll angle and roll rate while E ave is the average of both. It can be seen that the RMSEs of the ESAFIS are lower than those of RBF and the TS fuzzy system. Using the ESAFIS learning algorithm, the fuzzy rules are generated automatically and their time evolution for small and large initial conditions over the whole learning process is depicted in Figures 21.7(a) and (b). One can clearly note that in

page 824

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Indirect Self-Evolving Fuzzy Control Approaches and Their Applications

(a) φ (0)=11.46 degree

page 825

825

(b) φ′ (0)=2.865 degree/s

Figure 21.6: Roll angle and roll rate for α = 15◦ under small initial states in the indirect control scheme.

Table 21.2: Comparison results with different algorithms for α = 15◦ . Algorithms

ESAFIS TS RBF

Small initial condition

Large initial condition



E φ˙

E ave



E φ˙

E ave

2.85 6.08 15.34

1.90 20.29 34.58

2.38 13.19 24.96

21.27 74.64 52.06

20.48 312.19 80.48

20.88 193.42 66.27

7

7 Indirect(ESAFIS) Direct(ESAFIS) RBF

6

5

Number of Rules

Number of Rules

5

4

3

4

3

2

2

1

1

0

Indirect(ESAFIS) Direct(ESAFIS) RBF

6

0

5

10

Time(s)

(a) (φ, φ˙ )T = (11.46, 2.865)T

15

0

0

5

10

Time(s)

(b) (φ, φ˙ )T = (103.14, 57.3)T

Figure 21.7: Number of rules/neurons for α = 15◦ under different initial states.

15

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

826

b4528-v2-ch21

page 826

Handbook on Computer Learning and Intelligence—Vol. II Table 21.3: Influence of number of rules on tracking performance for α = 15◦ . Parameter eg 0.0001 0.001 0.01

No. of Rules



E φ˙

E ave

3 3 2

2.85 2.85 2.85

1.90 1.90 1.90

2.38 2.38 2.38

8

45 40 Indirect(ESAFIS) Direct(ESAFIS)

6 5 4 3 2 1 0

Norm of Consequent Parameters

Norm of Consequent Parameters

7

Indirect(ESAFIS) Direct(ESAFIS)

35 30 25 20 15 10 5

0

5

10

Time(s)

(a) (φ, φ˙ )T = (11.46, 2.865)T

15

0

0

5

10

15

Time(s)

(b) (φ, φ˙ )T = (103.14, 57.3)T

Figure 21.8: Norm of consequent parameters for α = 15◦ under different initial states.

this simulation study, the number of fuzzy rules required by ESAFIS is less than that of neurons required by RBF network and that of rules required by the TS fuzzy system. Thus, compared with the RBF network and the TS fuzzy system, ESAFIS produces a more compact system structure and also avoids the determination of fuzzy rules by trial-and-error before learning. To illustrate the influence of the number of rules on the tracking performance, some simulations are done under different numbers of rules obtained by varying the parameters. Table 21.3 gives the tracking RMSEs under different numbers of rules as the parameter eg varies. From the table, it can be found that the number of rules has little variation although the parameter eg changes in the range [0.01, 0.0001] and also the tracking performance almost remains the same with the number of rules changing. This implies that the proposed control schemes are robust to the number of rules and the design parameters. Figures 21.8(a) and (b) illustrate that the parameters Q I and Q D for small and large initial conditions are bounded throughout the control process. This guarantees the bounding of the control signals in the two initial conditions, which is depicted in Figure 21.9. The figure further shows that the control efforts for small and large initial conditions are smooth without high-frequency chattering phenomena during the whole control process.

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Indirect Self-Evolving Fuzzy Control Approaches and Their Applications 4

Indirect(ESAFIS) Direct(ESAFIS)

3

20

2

10

Control Input u

Control Input u

827

30 Indirect(ESAFIS) Direct(ESAFIS)

1

0

0

−10

−1

−20

−2

−30

−3

page 827

0

5

10

Time (s)

(a) (φ, φ˙ )T = (11.46, 2.865)T

15

−40

0

5

10

15

Time (s)

(b) (φ, φ˙ )T = (103.14, 57.3)T

Figure 21.9: Control input u for α = 15◦ under different initial states.

21.3. Meta-cognitive Fuzzy-Neural Control of the Thrust Active Magnetic Bearing System The thrust active magnetic bearing (TAMB) systems are capable of suspending and moving the rotor to the predefined positions functionally via the controlled electromagnetic force at high rotational speed without mechanical fiction. Thus, they are widely applied in many industrial applications, e.g., reaction wheels for artificial satellites, gas turbine engines, turbomolecular pumps, blood pumps, etc., Moreover, the TAMB system offers many practical and promising advantages, such as low noise, long service life, non-lubrication, low energy consumption, and light mechanical wear compared with the conventional ball bearing or sliding bearing. In most real-world applications of the TAMB system, the controlled rotor should be positioned and moved precisely and functionally to deal with the different operation demands and environments. Since TAMBs are open-loop unstable systems, the design of an active control system constitutes an essential task for the position and tracking control of the rotor. In practice, the TAMB system suffers from inherent nonlinearities in electromechanical and motion dynamics [39], which have to be considered during the controller design process. An indirect adaptive meta-cognitive fuzzy-neural control approach is proposed to control the rotor position of a nonlinear TAMB system with modeling uncertainties and external disturbances. In the current literature [14, 15, 17, 18, 40–42] the parameters existing in all the rules need to be adjusted and thus the size of the adjustable parameters is dependent on the number of input states and the fuzzy rules. In general, the number of fuzzy rules needs to be large enough to improve the approximation accuracy. In this case, a large online computation burden will be caused, resulting in a slow response time for real-time control applications. In the proposed meta-cognitive fuzzy-neural control approach, this problem can be

June 1, 2022 13:18

828

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Handbook on Computer Learning and Intelligence—Vol. II

alleviated by only adjusting the parameters of the rule nearest to the current input. The size of the adjustable parameters is independent of the size of the fuzzy rules and much lower than those of the existing work. Moreover, to guarantee the stability of the controlled system, projection-type adaptation laws for the parameters of the nearest rule are determined from the Lyapunov function. The zero convergence of tracking errors can be achieved according to the stability proof.

21.3.1. Problem Formulation The considered TAMB system [43, 44] is given by z¨ (t) = a(z; t)˙z + b(z; t)z + g(z; t)u(t) + d(z; t),

(21.39)

where a(z; t) = −(c(t)/m), b(z; t) = Q z (t)/m, g(z; t) = Q i (t)/m, d(z; t) = f d z (t)/m, and z = [z z˙ ], f d z (t) denotes system disturbances and is set as a combination of a small sinusoidal term 0.01 sin(2t) + 0.01 and a random value in the range [−0.001, 0.001]. u(t) represents the control current. m is the mass of the rotor, and z and z˙ represent the axial position and velocity of the rotor, respectively. c(t) is the friction coefficient, and Q z (t) and Q i (t) are the displacement and current stiffness parameters, respectively. These parameters are time-varying during real applications. With regard to simulation studies, they are assumed to suffer from the following perturbations: c(t) = c0 ∗ (1 + 0.3 sin(2t)), Q z (t) = Q z0 ∗ (1 + 0.2 sin(t)), Q i (t) = Q i0 ∗ (1 + 0.3 cos(t)). The TAMB is a typical second-order dynamic system that can be always represented as a general class of uncertain systems with order n x (n) = f (x) + g(x)u + d(x),

(21.40)

where u represents the control input. x(= [x, x˙ , . . . , x (n−1) ]T ∈ n ) denotes the system state and is assumed to be measurable. f (x) and g(x) are assumed to be unknown and bounded real continuous functions. d(x) represents the unknown but bounded modeling errors and external disturbances. Let x1 = x, x2 = x˙ = x˙1 , x3 = x¨ = x˙2 , · · · , and xn = x (n−1) = x˙(n−1) , then Eq. (21.40) becomes  x˙i = xi+1 , i = 1, 2, . . . , n − 1 (21.41) x˙n = f (x) + g(x)u + d(x)

page 828

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Indirect Self-Evolving Fuzzy Control Approaches and Their Applications

page 829

829

where x = [x1 , x2 , . . . , xn ]T ∈ n . Define the tracking error e = x − xd , the reference command xd = [x1d , x2d , . . . , xnd ]T , the tracking error vector e = [e1 , e2 , . . . , en ]T . The control objective of the proposed controller u is to force the state x to follow the specified desired trajectory xd as closely as possible. Assumption 1: The reference signals and their continuous first derivatives xd , x˙ d are bounded and available. Assumption 2: g(x) is invertible for all x ∈ n and satisfies g(x) > γ0 , with γ0 > 0, γ0 ∈ . As pointed in [45], Assumption 2 is a strong restriction for many physical systems and is hard to hold in a whole space. The assumption can be relaxed to hold in a compact subset, as in [45]. The proposed control approach is applicable if Assumption 2 is satisfied only in a compact subset. The adaptive backstepping technique is introduced here for developing the nonlinear controller. Based on the backstepping technology, the control input u is constructed step-by-step based on a Lyapunov function for guaranteeing global stability. First, x2d to xnd are presented as fictitious control signals. Each x(i+1)d (i = 1, 2, . . . , n − 1) is designed with the purpose to reducing error ei = xi − xid in the previous design stage. The derivative of the error ei in the ith step is equal to e˙i = x˙i − x˙id = xi+1 − x˙id = x(i+1)d + ei+1 − x˙id .

(21.42)

The virtual control x(i+1)d is defined as x(i+1)d = −ki ei + x˙id − ei−1 ,

(21.43)

where ki is a positive constant. After the fictitious controller xnd is designed, the ideal controller u ∗ is designed to force the error en = xn − xnd to be as small as possible. The ideal controller u ∗ is chosen as u ∗ = g −1 (x)(− f (x) − d(x) − kn en + x˙nd − en−1 ).

(21.44)

n 1 2 e is defined and then according to A Lyapunov function V1 = i=1 2 i Eqs. (21.42)–(21.44), the derivative of V1 is

V˙1 = −

n 

ki ei2 ≤ 0.

(21.45)

i=1

By properly choosing ki > 0 (i = 1, 2, . . . , n), the stability of the system can be ensured and the tracking error will converge to zero. However, the exact

June 1, 2022 13:18

830

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Handbook on Computer Learning and Intelligence—Vol. II

mathematical models of f (x), g(x), and d(x) are hard to achieve, which makes the ideal controller u ∗ in (21.44) unavailable. To solve the problem, an indirect metacognitive fuzzy-neural control u f is built using the meta-cognitive fuzzy-neural model to approximate f (x) + d(x)(= f d (x)) and g(x) separately and is given as u f = gˆ −1 (x)[− fˆd (x) − kn en + x˙nd − en−1 ],

(21.46)

ˆ are the approximation models of the unknown nonlinear where fˆd (x) and g(x) functions f d (x) and g(x). There exists the approximation errors in Eq. (21.46) when the meta-cognitive fuzzy-neural model is utilized to approximate f d (x) and g(x). A sliding mode compensator u s is applied here to offset the approximation errors. The whole actual control input is designed as u = u f + us .

(21.47)

According to the control law u in Eq. (21.47), the derivative of the error en without affecting the error ei (i = 1, 2, . . . , n − 1) becomes ˆ e˙n = f d (x) − fˆd (x) + (g(x) − g(x))u f + g(x)u s − kn en − en−1 .

(21.48)

In the following section, the proposed indirect meta-cognitive fuzzy-neural control approach is described in detail. Before this, the meta-cognitive fuzzy-neural model used to approximate the functions f d (x) and g(x) is presented.

21.3.2. Meta-cognitive Fuzzy-Neural Model 21.3.2.1. Model structure

To design the controller u f , the meta-cognitive fuzzy-neural model applies the following fuzzy rule to approximate f d (x) and g(x): Rule k : If x1 is A1k and · · · and xn is Ank , then f d (x) is a f k and g(x) is agk , where Aik (i = 1, . . . , n) represents the fuzzy sets related to the ith input variable xi in rule k, n is the input size, that is, x(x = [x1 , . . . , xn ]T ). The terms of f d (x) and g(x) constitute the outputs of the fuzzy-neural model. The consequences a f k and agk are constants. A five-layer structure as depicted in Figure 21.10 is implemented to realize the fuzzy-neural model. The input layer represents the input states. The membership layer is used to calculate the membership value of each input variable from Gaussian functions. The rule layer illustrates the firing strength of the rules computed from the sum-product composition. The normalization layer is used to calculate the normalization values of the rules. The output layer represents the system outputs,

page 830

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Indirect Self-Evolving Fuzzy Control Approaches and Their Applications

page 831

831

Figure 21.10: The structure of meta-cognitive fuzzy-neural model.

which is given with N rules as

N f d (x) =

k=1

N

N

g(x) =

a f k Rk (x)

k=1

k=1

N

Rk (x)

agk Rk (x)

k=1

Rk (x)

¯ = a f R(x),

(21.49)

¯ = ag R(x),

(21.50)

where a f = [a f 1 , a f 2 , . . . , a f N ] ∈ 1×N , ag = [ag1 , ag2 , . . . , ag N ] ∈ 1×N , N N ¯ Rk (x), . . . , R N (x)/k=1 Rk (x)]T ∈ N×1 are the and R(x) = [R1 (x)/k=1 normalization values of the rules. Rk is the firing strength of the kth rule and is given as  n n   (xi − μik )2 Aik (xi ) = exp − Rk (x) = σk2 i=1 i=1   x − μk 2 , (21.51) = exp − σk2 where μk and σk are the center and width of the kth Gaussian membership function. 21.3.2.2. Meta-cognitive learning mechanism

The McFIS [8] carries out its sequential learning process by invoking the data deleting strategy, data learning strategy, and data reserving strategy. According to [8] in the data reserving strategy, data with lower knowledge will be kept for later learning but this is hard to realize in real-time controller design, where all the data are sampled sequentially and only one new observation is required anytime.

June 1, 2022 13:18

832

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Handbook on Computer Learning and Intelligence—Vol. II

In our study, the knowledge of the reserved data is regarded as being similar to the existing data. Then, the reserved data are put in the deletion strategy for removing. Different from [8], two self-regulatory strategies, data learning and data deletion, are considered in our proposed meta-cognitive fuzzy-neural model and help perform the real-time control task efficiently. How the two strategies are implemented in the fuzzy-neural model is described below. 21.3.2.3. Data learning strategy

The data learning process of the meta-cognitive fuzzy-neural model involves online rule evolution and parameter update of the rule nearest to the current input data. It is noteworthy that evolution strategies of rules in the meta-cognitive fuzzy-neural model are the same as those of McFIS. But the way of parameter learning is different from [8] aiming to suit the real-time stable controller design. As in [8] the rules are determined incrementally according to the following conditions, e > E a and ψ < E s . ψ is the spherical potential and indicates the novelty of an input data x, and is given as   N   2    R(x, μk ) . (21.52) ψ = −   N k=1

E s and E a are the novelty and adding thresholds. E a is self-regulated as E a = δ E a + (1 − δ)e,

(21.53)

where e depicts the tracking error. δ is the slope factor. The self-regulatory adding threshold E a aims to capture the global knowledge initially and then fine-tunes the system at a later stage. After adding a new fuzzy rule (the (N + 1)th rule), its parameters are initialized as μ N+1 = x σ N+1 = κ ∗ min∀ j x − μk , k = 1, . . . , N

(21.54)

a f,N+1 = ag,N+1 = 0 where κ is the overlap parameter between the rule antecedents. When e ≥ El , the rule parameters are modified. Similarly, the threshold, El , is self-regulated with tracking error as El = δ El + (1 − δ)e.

(21.55)

In the meta-cognitive fuzzy-neural model, the parameters of the nearest rule are modified according to the stable rules determined using the Lyapunov function in order to ensure the stability of the controlled system. This is described next.

page 832

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Indirect Self-Evolving Fuzzy Control Approaches and Their Applications

page 833

833

During learning, the contribution of a rule to the output may become low. In this case, the insignificant rule should be removed from the rule base to avoid the overfitting. The kth rule contribution is given by βk = R¯ k max |ei a f k + ei agk |, i = 1, . . . , n. i

(21.56)

The kth rule is pruned if its contribution falls below a threshold value E p for Nw consecutive inputs. 21.3.2.4. Data deletion strategy

If the data are not satisfied with the data learning strategy, they lack new knowledge to learn and are deleted without further learning. This avoids the over-learning of similar data and decreases the computational burden.

21.3.3. Meta-cognitive Fuzzy-Neural Control Approach 21.3.3.1. System errors

In the designed meta-cognitive fuzzy-neural controller u f , the meta-cognitive fuzzyneural model is used to reconstruct the functions f d (x) and g(x). Based on the universal approximation property of a fuzzy-neural system, an optimal model exists to learn the terms f d (x) and g(x) such that, ¯ μ∗ , σ ∗ ) + ε f (x), f d (x) = a∗f R(x; ¯ μ∗ , σ ∗ ) + εg (x), g(x) = ag∗ R(x;

(21.57)

where ε f (x) and εg (x) represent approximation errors; a∗f , ag∗ , μ∗ , and σ ∗ are the optimal parameter vectors of a f , ag , μ, and σ , respectively. To construct f d (x) and g(x), the optimal values of a∗f , ag∗ , μ∗ , and σ ∗ need to be ˆ and σˆ are used to replace known but are unavailable. The online estimates aˆ f , aˆ g , μ, ∗ ∗ ∗ ∗ ˆ are the optimal a f , ag , μ , and σ . Thus, the estimated functions fˆd (x) and g(x) used to approximate the unknown functions f d (x) and g(x), which are given as ¯ μ, ˆ σˆ ), fˆd (x) = aˆ f R(x; ¯ μ, ˆ σˆ ). g(x) ˆ = aˆ g R(x;

(21.58)

Using the “winner rule” strategy [8], the optimal parameter vectors comprise two parts. One part represents optimal parameters of the active nearest rule, which ∗ ∗ , μ∗nr , and σnr . The other part consists of the optimal are denoted as a ∗f,nr , ag,nr ∗ , μ∗s , and σs∗ (1 ≤ s ≤ N , parameters of fixed rules, which are denoted as a ∗f s , ags

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

834

b4528-v2-ch21

Handbook on Computer Learning and Intelligence—Vol. II

s = nr), thereby representing rules that have not been activated. Hence,   a∗f = a ∗f,nr a∗f s  ∗  ∗ ag∗ = ag,nr ags   ∗ ¯ μ∗nr , σnr R(x; ) ¯ , R(x; μ∗ , σ ∗ ) = ¯ R(x; μ∗s , σs∗ )

(21.59)

∗ ¯ ¯ μ∗nr , σnr ) and R(x; μ∗s , σs∗ ) represent the normalized firing strengths of where R(x; the active nearest rule and fixed rules, respectively. Similarly, the estimates aˆ f , aˆ g , μ, ˆ and σˆ are divided into aˆ f,nr , aˆ g,nr , μˆ nr , σˆ nr , aˆ f s , aˆ gs , μˆ s , and σˆ s . The approximation errors between the actual functions and the estimated functions are given as ∗ )  f (x) = f d (x) − fˆd (x) = a ∗f,nr R¯ nr (x; μ∗nr , σnr

− aˆ f,nr R¯ nr (x; μˆ nr , σˆ nr ) + ε f (x)

(21.60)

∗ ∗ ˆ = ag,nr ) R¯ nr (x; μ∗nr , σnr g (x) = g(x) − g(x)

− aˆ g,nr R¯ nr (x; μˆ nr , σˆ nr ) + εg (x)

(21.61)

From Eq. (21.48), the approximation errors ( f (x), g (x)) between the actual ˆ need to be established to functions ( f d , g) and the estimated functions ( fˆd , g) calculate the derivative of the error en . How they are accomplished are illustrated by the following theorem. Theorem 1: The parameter estimation errors of the nearest rule are defined as ∗ − aˆ g,nr , μ˜ nr = μ∗nr − μˆ nr , a˜ f,nr = a ∗f,nr − aˆ f,nr , a˜ g,nr = ag,nr ∗ − σˆ nr . σ˜ nr = σnr

(21.62)

The function approximation errors  f (x) and g (x) can be expressed as

(x; μˆ nr , σˆ nr )μˆ nr − R¯ nr,σ (x; μˆ nr , σˆ nr )σˆ nr }  f (x) = a˜ f,nr {R¯ nr (x; μˆ nr , σˆ nr ) − R¯ nr,μ

(x; μˆ nr , σˆ nr )μ˜ nr + R¯ nr,σ (x; μˆ nr , σˆ nr )σ˜ nr } + d f + aˆ f,nr {R¯ nr,μ

(21.63)

(x; μˆ nr , σˆ nr )μˆ nr − R¯ nr,σ (x; μˆ nr , σˆ nr )σˆ nr } g (x) = a˜ g,nr {R¯ nr (x; μˆ nr , σˆ nr ) − R¯ nr,μ

(x; μˆ nr , σˆ nr )μ˜ nr + R¯ nr,σ (x; μˆ nr , σˆ nr )σ˜ nr } + dg + aˆ g,nr {R¯ nr,μ

where

(x; μˆ nr , σˆ nr ) =  R¯ nr,μ

Rnr

N

Rnr +

s=1,s =nr

N

s=1,s =nr

Rs Rs

2

2(x − μnr )T ∈ R 1×n 2 σnr

(21.64)

page 834

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Indirect Self-Evolving Fuzzy Control Approaches and Their Applications

page 835

835

and

(x; μˆ nr , σˆ nr ) =  R¯ nr,σ

Rnr

N

Rnr +

s=1,s =nr

N

s=1,s =nr

Rs

2

Rs

2x − μnr 2 ∈R 3 σnr

∗ ∗ are the derivatives of R¯ nr (x; μ∗nr , σnr ) with respect to μ∗nr and σnr at (μˆ nr , σˆ nr ), respectively. d f and dg are the residual terms.

Proof. The derivation process of the approximation errors  f (x) and g (x) in Eqs. (21.63)–(21.64) are similar and thus for simplicity, only how the approximation  error  f (x) is achieved is described below. The approximation error  f (x) can be written as ∗ ) − a ∗f,nr R¯ nr (x; μˆ nr , σˆ nr ) + a ∗f,nr R¯ nr (x; μˆ nr , σˆ nr )  f (x) = a ∗f,nr R¯ nr (x; μ∗nr , σnr

− aˆ f,nr R¯ nr (x; μˆ nr , σˆ nr ) + ε f (x) = a ∗f,nr R˜¯ nr + a˜ nr Rˆ¯ nr + ε f (x) = a ∗f,nr R˜¯ nr − aˆ f,nr R˜¯ nr + aˆ f,nr R˜¯ nr + a˜ f,nr Rˆ¯ nr + ε f (x) = a˜ f,nr R˜¯ nr + aˆ f,nr R˜¯ nr + a˜ f,nr Rˆ¯ nr + ε f (x),

(21.65)

∗ ) − R¯ nr (x; μˆ nr , σˆ nr ) and Rˆ¯ nr = R¯ nr (x; μˆ nr , σˆ nr ). In where R˜¯ nr = R¯ nr (x; μ∗nr , σnr ∗ ) is taken about order to deal with R˜¯ nr , Taylor’s series expansion of R¯ nr (x; μ∗nr , σnr ∗ ∗ μnr = μˆ nr and σnr = σˆ nr such that ∗

) = R¯ nr (x; μˆ nr , σˆ nr ) + R¯ nr ˆ nr , σˆ nr )(μ∗nr − μˆ nr ) R¯ nr (x; μ∗nr , σnr ,μ (x; μ

∗ ˆ nr , σˆ nr )(σnr − σˆ nr ) + o(x; μ˜ nr , σ˜ nr ), + R¯ nr ,σ (x; μ

(21.66)

where o(x; μ˜ nr , σ˜ nr ) denotes the sum of high-order arguments in a Taylor’s series

(x; μˆ nr , σˆ nr ) ∈ R 1×n and R¯nr ˆ nr , σˆ nr ) ∈ R are derivatives expansion, and R¯nr,μ ,σ (x; μ ∗ ∗ ∗ ∗ of R¯ nr (x; μnr , σnr ) with respect to μnr and σnr at (μˆ nr , σˆ nr ). They are expressed as

R¯ nr ,μ (x;

 ∗  ∂ R¯ nr (x; μ∗nr , σnr ) μˆ nr , σˆ nr ) =  μ∗nr = μˆ nr  σnr∗ = σˆ nr ∂μ∗nr

N Rnr s=1,s 2(x − μnr )T =nr Rs = ∈ R 1×n  2 2

N σnr Rnr + s=1,s =nr Rs

(21.67)

June 1, 2022 13:18

836

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Handbook on Computer Learning and Intelligence—Vol. II

 ∗ ∗  ¯ ∂ R (x; μ , σ ) nr nr nr 

ˆ nr , σˆ nr ) = R¯ nr  μ∗nr = μˆ nr ,σ (x; μ ∗  σnr∗ = σˆ nr ∂σnr

N Rnr s=1,s 2x − μnr 2 =nr Rs = ∈ R,  2 3

N σnr R Rnr + s=1,s =nr s

(21.68)

Eq. (21.66) can then be expressed as

ˆ nr , σˆ nr )μ˜ nr + R¯ nr ˆ nr , σˆ nr )σ˜ nr + o(x; μ˜ nr , σ˜ nr ) R˜¯ nr = R¯ nr ,μ (x; μ ,σ (x; μ

(21.69)

Substituting Eq. (21.69) with Eq. (21.65) yields

 f (x) = a˜ f,nr {R¯ nr ˆ nr , σˆ nr )μ˜ nr ,μ (x; μ

+ R¯ nr ˆ nr , σˆ nr )σ˜ nr + o(x; μ˜ nr , σ˜ nr )} ,σ (x; μ

ˆ nr , σˆ nr )μ˜ nr + aˆ f,nr {R¯ nr ,μ (x; μ

+ R¯ nr ˆ nr , σˆ nr )σ˜ nr + o(x; μ˜ nr , σ˜ nr )} ,σ (x; μ

+ a˜ f,nr R¯ nr (x; μˆ nr , σˆ nr ) + ε f (x).

(21.70)

According to the definition of μ˜ nr , σ˜ nr , and a ∗f,nr , Eq. (21.70) is expressed as

∗ ˆ nr , σˆ nr )(μ∗nr − μˆ nr ) + a˜ f,nr R¯ nr ˆ nr , σˆ nr )(σnr − σˆ nr )  f (x) = a˜ f,nr R¯ nr ,μ (x; μ ,σ (x; μ

ˆ nr , σˆ nr )μ˜ nr + aˆ f,nr R¯ nr ˆ nr , σˆ nr )σ˜ nr + aˆ f,nr R¯ nr ,μ (x; μ ,σ (x; μ

+ a˜ f,nr R¯ nr (x; μˆ nr , σˆ nr ) + a ∗f,nr o(x; μ˜ nr , σ˜ nr ) + ε f (x)

(x; μˆ nr , σˆ nr )μˆ nr − R¯ nr ˆ nr , σˆ nr )σˆ nr } = a˜ f,nr {R¯ nr (x; μˆ nr , σˆ nr ) − R¯ nr,μ ,σ (x; μ

ˆ nr , σˆ nr )μ˜ nr + R¯ nr ˆ nr , σˆ nr )σ˜ nr } + d f , + aˆ f,nr {R¯ nr ,μ (x; μ ,σ (x; μ

(21.71)

where

∗ (x; μˆ nr , σˆ nr )μ∗nr + R¯ nr ˆ nr , σˆ nr )σnr } d f = a˜ f,nr {Rˆ¯ nr,μ ,σ (x; μ

+ a ∗f,nr o(x; μ˜ nr , σ˜ nr ) + ε f (x) is the residual approximation error. The proof can be extended to the approximation error g (x) (It is omitted here to save space.) Based on Eqs. (21.42)–(21.43), (21.48), and (21.63)–(21.64), the system error dynamics equals e˙i = −ki ei − ei−1 + ei+1 , i = 1, 2, . . . , n − 1

page 836

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Indirect Self-Evolving Fuzzy Control Approaches and Their Applications

page 837

837

   



e˙n = a˜ f,nr Rˆ¯ nr − Rˆ¯ nr,μ μˆ nr − Rˆ¯ nr,σ σˆ nr + aˆ f,nr Rˆ¯ nr,μ μ˜ nr + Rˆ¯ nr,σ σ˜ nr   

μˆ nr − Rˆ¯ nr,σ σˆ nr + d f + a˜ g,nr Rˆ¯ nr − Rˆ¯ nr,μ  

μ˜ nr + Rˆ¯ nr,σ σ˜ nr + dg u f + g(x)u s − kn en − en−1 , (21.72) + aˆ g,nr Rˆ¯ nr,μ

¯ ¯ ˆ nr , σˆ nr ) and Rˆ¯ nr ˆ nr , σˆ nr ). Now, our focus where Rˆ¯ nr ,μ = Rnr ,μ (x; μ ,σ = Rnr ,σ (x; μ is on the error Eq. (21.72) and the determination of the adaptive laws for aˆ f,nr , aˆ g,nr , μˆ nr , and σˆ nr such that all signals remain bounded and ei (i = 1, 2, . . . , n) = 0.

21.3.3.2. Controller structure and stability analysis

Since the approximation errors d f and dg exist, the system stability cannot be confirmed. In this study, a sliding mode compensator u s is constructed to handle the approximation errors so that system stability is guaranteed. This study assumes that the approximation errors d f and dg are bounded, i.e., |d f | ≤ D¯ f and |dg | ≤ D¯ g . Actually, the error bounds, D¯ f and D¯ g , are not easy to measure. Therefore, their estimated values ( Dˆ f , Dˆ g ) are used to calculate the sliding mode compensator u s and updated based on the bound estimation laws, which is given as u s = −sgn(en )

Dˆ f + Dˆ g |u f | . γ0

(21.73)

The parameters of the nearest rule (aˆ f,nr , aˆ g,nr , μˆ nr , σˆ nr ) are modified according to Eqs. (21.74)–(21.77) given as

a˙ˆ f,nr =

a˙ˆ g,nr

⎧ 0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨

max if aˆ f,nr ∈ / (B min f , B f ) and

μˆ nr − Rˆ¯ nr,σ σˆ nr ) en ( Rˆ¯ nr − Rˆ¯ nr,μ

. (aˆ f,nr − aˆ cf,nr ) > 0. ⎪ ⎪ ⎪ ⎪

⎪ μˆ nr γ f en ( Rˆ¯ nr − Rˆ¯ nr,μ ⎪ ⎪ otherwise. ⎪ ⎩ ˆ

− R¯ nr,σ σˆ nr ) ⎧ 0 if aˆ g,nr ∈ / (Bgmin , Bgmax ) and ⎪ ⎪ ⎪ ⎪ ⎪ ⎪

⎪ μˆ nr − Rˆ¯ nr,σ σˆ nr ) en ( Rˆ¯ nr − Rˆ¯ nr,μ ⎪ ⎨ c . = u f (aˆ g,nr − aˆ g,nr ) > 0. ⎪ ⎪ ˆ ˆ ⎪

⎪ μˆ nr γg en ( R¯ nr − R¯ nr,μ ⎪ ⎪ ⎪ otherwise. ⎪ ⎩

− Rˆ¯ nr,σ σˆ nr )u f

(21.74)

(21.75)

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

838

b4528-v2-ch21

Handbook on Computer Learning and Intelligence—Vol. II

μ˙ˆ nr, j =

σ˙ˆ nr =

⎧ ⎪ 0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨

min max if μˆ nr, j ∈ / (Bμ, j , Bμ, j ) and

en (aˆ f,nr + aˆ g,nr u f ) Rˆ¯ nr,μ, j

(μˆ nr, j − μˆ cnr, j ) > 0, j = 1, 2, . . . , n.

⎪ ⎪ ⎪ ⎪ ⎪ γμ en (aˆ f,nr + aˆ g,nr u f ) ⎪ ⎪ otherwise. ⎪ ⎩ Rˆ¯ ⎧ 0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎨

.

(21.76)

nr,μ, j

if σˆ nr ∈ / (Bσmin , Bσmax ) and

en (aˆ f,nr + aˆ g,nr u f ) Rˆ¯ nr,σ c . (σˆ nr − σˆ nr ) > 0.

⎪ ⎪ ⎪ γσ en (aˆ f,nr + aˆ g,nr u f ) ⎪ ⎪ otherwise. ⎩ ˆ¯ Rnr ,σ

(21.77)

The error bounds ( Dˆ f , Dˆ g ) are updated based on Eqs. (21.78)–(21.79), given as,  γd f |en | if Dˆ f ≤ Bd f . ˙ˆ Df = 0 otherwise.  γdg |en ||u f | if Dˆ g ≤ Bdg . D˙ˆ g = 0 otherwise.

(21.78)

(21.79)

In these equations, γ f , γg , γμ , γσ , γd f , and γdg are the positive learning rates. In order to avoid the problem that the updated parameters may drift to infinity over time, the projection algorithm [46] is applied by limiting their lower and upper min max bounds [Bmin , Bmax ],  = { f, g, μ, σ }. [Bμ, j , Bμ, j ] is the j th element of the vector max c min max ˆ g,nr ∈ [Bgmin , Bgmax ], μˆ cnr, j ∈ [Bμ, [Bμmin , Bμmax ]. aˆ cf,nr ∈ [B min f , B f ], a j , Bμ, j ], and c min max σˆ nr ∈ [Bσ , Bσ ] are any point in the acceptable range. [0, Bd f ] and [0, Bdg ] are the error-bound ranges. The following theorem shows the stability property of the proposed metacognitive fuzzy-neural control approach. Theorem 2: Consider the system (Eq. (21.41)) with the fictitious control laws Eq. (21.43) and control law (21.47), where the meta-cognitive fuzzy-neural control u f is given by Eq. (21.46), and the sliding mode compensator u s is given by Eq. (21.73). If the adaptation laws of the fuzzy-neural model parameters and approximation error bounds are designed as in Eqs. (21.74)–(21.79), the convergence of the tracking errors and the stability of the proposed meta-cognitive fuzzy-neural control approach can be guaranteed.

page 838

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Indirect Self-Evolving Fuzzy Control Approaches and Their Applications

page 839

839

Proof. Define a Lyapunov function candidate as 1 2 1 2 1 2 1 2 ei + a˜ f,nr + a˜ g,nr + σ˜ 2 i=1 2γ f 2γg 2γσ nr n

V =

+

1 T 1 ˜2 1 ˜2 μ˜ nr μ˜ nr + Df + D 2γμ 2γd f 2γdg g

(21.80)

where D˜ f = D¯ f − Dˆ f and D˜ g = D¯ g − Dˆ g .



Differentiating the Lyapunov function (Eq. (21.80)) and using Eq. (21.72) yields V˙ =

n 

 

−ki ei2 + en a˜ f,nr Rˆ¯ nr − Rˆ¯ nr,μ μˆ nr − Rˆ¯ nr,σ σˆ nr

i=1

 

μˆ nr − Rˆ¯ nr,σ σˆ nr u f + en (aˆ f,nr + aˆ g,nr u f ) Rˆ¯ nr,μ μ˜ nr + en a˜ g,nr Rˆ¯ nr − Rˆ¯ nr,μ 1

+ en (aˆ f,nr + aˆ g,nr u f ) Rˆ¯ nr,σ σ˜ nr + en d f + en dg u f + en g(x)u s − a˜ f,nr a˙ˆ f,nr γf −

1 1 1 ˙T 1 ˙ˆ ˜ 1 ˙ˆ ˜ a˜ g,nr a˙ˆ g,nr − σ˜ nr σ˙ˆ nr − μ ˆ nr μ˜ nr − Df Df − Dg Dg , γg γσ γμ γd f γdg (21.81)

˙ˆ f,nr , D˙˜ f = − D˙ˆ f , ˙˜ nr = −μ where a˙˜ f,nr = −a˙ˆ f,nr , a˙˜ g,nr = −a˙ˆ g,nr , σ˙˜ nr = −σ˙ˆ nr , μ ˙ and D˙˜ g = − Dˆ g are utilized. Let   1

μˆ nr − Rˆ¯ nr,σ σˆ nr − a˜ f,nr a˙ˆ f,nr V f = en a˜ f,nr Rˆ¯ nr − Rˆ¯ nr,μ γf   1

Vg = en a˜ g,nr Rˆ¯ nr − Rˆ¯ nr,μ μˆ nr − Rˆ¯ nr,σ σˆ nr u f − a˜ g,nr a˙ˆ g,nr γg 1 ˙T

Vμ = en (aˆ f,nr + aˆ g,nr u f ) Rˆ¯ nr,μ μ˜ nr − μ ˆ μ˜ nr γμ nr  n   1 ˙ ˆ

¯ = en (aˆ f,nr + aˆ g,nr u f ) Rnr,μ, j μ˜ nr, j − μˆ nr, j μ˜ nr, j γμ j =1 1

σ˜ nr − σ˜ nr σ˙ˆ nr . Vσ = en (aˆ f,nr + aˆ g,nr u f ) Rˆ¯ nr,σ γσ

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

840

b4528-v2-ch21

Handbook on Computer Learning and Intelligence—Vol. II

Eq. (21.81) can be rewritten as V˙ =

n 

−ki ei2 + V f + Vg + Vμ + Vσ + en d f

i=1

+ en dg u f + en g(x)u s −

1 ˆ˙ ˜ 1 ˙ˆ ˜ Df Df − Dg Dg . γd f γdg

(21.82)

When the adaptation law for the parameter aˆ f,nr is designed as Eq. (21.74) and the first condition is satisfied,  

μˆ nr − Rˆ¯ nr,σ σˆ nr V f = en a˜ f,nr Rˆ¯ nr − Rˆ¯ nr,μ  

μˆ nr − Rˆ¯ nr,σ σˆ nr (aˆ f,nr − a ∗f,nr ). = −en Rˆ¯ nr − Rˆ¯ nr,μ

(21.83)

Since   max ˆ¯ − Rˆ¯ μˆ − Rˆ¯ σˆ , B ), the condition − e R a ∗f,nr ∈ (B min n nr nr nr f f nr,μ nr,σ (aˆ f,nr − a ∗f,nr ) < 0 holds. For the second condition of Eq. (21.74), V f = 0. As a result, it can be concluded that V f ≤ 0. If the adaptation laws for the parameters aˆ g,nr , μˆ nr, j , and σˆ nr are designed as Eqs. (21.75)–(21.77), one can conclude that Vg ≤ 0, Vμ ≤ 0, and Vσ ≤ 0. Consequently, with the design of u s Eq. (21.73), Eq. (21.82) can be represented as, V˙ =

n 

−ki ei2 + V f + Vg + Vμ + Vσ + en d f

i=1

+ en dg u f − en g(x)sgn(en ) ≤

n 

−ki ei2 + |en ||d f | + |en ||dg ||u f | − en g(x)sgn(en )

i=1



Dˆ f + Dˆ g |u f | 1 ˆ˙ ˜ 1 ˙ˆ ˜ − Df Df − Dg Dg γ0 γd f γdg

1 ˙ˆ ˜ 1 ˙ˆ ˜ Df Df − Dg Dg . γd f γdg

Dˆ f + Dˆ g |u f | γ0 (21.84)

page 840

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Indirect Self-Evolving Fuzzy Control Approaches and Their Applications

page 841

841

Considering that |d f | ≤ D¯ f and |dg | ≤ D¯ g and with Assumption 2, Eq. (21.84) becomes V˙ ≤

n 

−ki ei2 + |en | D¯ f + |en ||u f | D¯ g − |en |( Dˆ f + Dˆ g |u f |)

i=1

− =

1 ˙ˆ ˜ 1 ˙ˆ ˜ Df Df − Dg Dg γd f γdg

n 

−ki ei2 + |en |( D¯ f − Dˆ f ) + |en ||u f |( D¯ g − Dˆ g )

i=1



1 ˙ˆ ˜ 1 ˙ˆ ˜ Df Df − Dg Dg . γd f γdg

(21.85)

Let Vd f = |en |( D¯ f − Dˆ f ) −

1 ˙ˆ ˜ Df Df γd f

Vdg = |en ||u f |( D¯ g − Dˆ g ) −

1 ˙ˆ ˜ Dg Dg . γdg

When the adaptation law for the error-bound Dˆ f is designed as Eq. (21.78) and the first condition is satisfied, Vd f = 0. Using the second condition of Eq. (21.78) yields Vd f = |en |( D¯ f − Dˆ f ) ≤ 0 with Dˆ f ≥ Bd f > D¯ f . Thus, it can be concluded that Vd f ≤ 0. Similarly, with the adaptation law (21.75) for the error bound Dˆ g , one can obtain Vdg ≤ 0. In this case, the derivative of V satisfies V˙ ≤

m 

−ki ei2 < 0.

(21.86)

i=1

Equations (21.80) and (21.86) show that V ≥ 0 and V˙ ≤ 0, respectively. Thus, V (ei (t), a˜ f,nr (t), a˜ g,nr (t), μ˜ nr (t), σ˜ nr (t), D˜ f (t), D˜ g (t)) ≤ V (ei (0), a˜ f,nr (0), a˜ g,nr (0), μ˜ nr (0), σ˜ nr (0), D˜ f (0), D˜ g (0)).

(21.87)

This implies that ei (i = 1, 2, . . . , n), a˜ f,nr , a˜ g,nr , μ˜ nr , σ˜ nr , D˜ f , D˜ g are bounded. Denoting function Vc (t) ≡

n 

ki ei2

i=1

= − V˙ (ei (t), a˜ f,nr (t), a˜ g,nr (t), μ˜ nr (t), σ˜ nr (t), D˜ f (t), D˜ g (t)).

(21.88)

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

842

b4528-v2-ch21

Handbook on Computer Learning and Intelligence—Vol. II

Integrating (21.88) with respect to time can show that ! t Vc (τ )dτ 0

≤ V (ei (0), a˜ f,nr (0), a˜ g,nr (0), μ˜ nr (0), σ˜ nr (0), D˜ f (0), D˜ g (0)) −V (ei (t), a˜ f,nr (t), a˜ g,nr (t), μ˜ nr (t), σ˜ nr (t), D˜ f (t), D˜ g (t)),

(21.89)

Since V (ei (0), a˜ f,nr (0), a˜ g,nr (0), μ˜ nr (0), σ˜ nr (0), D˜ f (0), D˜ g (0)) is bounded and V (ei (t), a˜ f,nr (t), a˜ g,nr (t), μ˜ nr (t), σ˜ nr (t), D˜ f (t), D˜ g (t)) is nonincreasing and bounded, the following result can be obtained: ! t Vc (τ )dτ < ∞, (21.90) lim t →∞ 0

Since V˙c (t) is bounded, it can be concluded that limt →∞ Vc (t) = 0 according to Barbalat’s lemma [26], which implies that ei → 0, i = 1, 2, . . . , n as t → ∞. Therefore, the proposed meta-cognitive fuzzy-neural control approach is asymptotically stable and the tracking errors of the system will converge to zero. Remark: The synthesis associated with the overall architecture of the proposed metacognitive fuzzy-neural control approach is shown in Figure 21.11. The control input u is recursively constructed via the virtual control laws x2d to xnd and comprises two parts, viz., the sliding mode compensator u s and the meta-cognitive fuzzy-neural control input u f . u s with the adaptive error-bound estimation laws is designed to attenuate the effects of the approximation error terms d f and dg . u f is built using the meta-cognitive fuzzy-neural model to approximate the unknown functions f d (x) and g(x). Different from the existing work [40–42], the proposed control method can perform both data deletion and data learning concurrently online. The data with similar information content are deleted and not employed to design the controller, which avoids over-learning and improves control performance. Data learning is capable of determining the system structure and modifying parameters simultaneously. All of the rules are automatically produced online and good initial values of all the rules are located, from which subsequent parameter tuning is implemented.

21.3.4. Control Performance The desired trajectory is z d (t) = 0.005 sin(5t) and the initial states are (z, z˙ ) = (−0.003, 0)T . The parameters of the TAMB system are set as m = 7.35 kg, c0 = 3.5e − 4, Q i0 = 212.82N/ A, and Q z0 = −7.3e5 N/m. The parameters associated with the AMcFBC are selected as E a = 0.008, El = 0.002, κ = 0.8, E s = 0.2, γμ = γσ = γ f = γg = 100, and γd f = γdg = 0.02.

page 842

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Indirect Self-Evolving Fuzzy Control Approaches and Their Applications

page 843

843

Figure 21.11: Structure of the proposed meta-cognitive fuzzy-neural control approach.

The simulation results of the TAMB control system using the proposed control approach are shown in Figure 21.12. For the sake of comparison, the results of the neural network control (NC) scheme with the fixed structure are included here. From the results, faster tracking responses are obtained by the proposed meta-cognitive fuzzy-neural controller since the meta-cognitive fuzzyneural model can rapidly learn the system dynamics with the meta-cognitive learning capability. Figure 21.13(a) gives the spherical potential and the novelty threshold during the whole learning process. The tracking error (e) and the self-regulatory thresholds (E a , El ) related to the meta-cognitive learning mechanism are depicted in Figure 21.13(b). Figure 21.14(a) presents the rule evolution process of the proposed meta-cognitive fuzzy-neural controller, where a new rule is added to the model if ψ < E s and e > E a . This occurs mainly during the initial learning process to rapidly learn the system dynamics and then insignificant rules are pruned after the meta-cognitive fuzzy-neural model captures the system dynamics. The local rules

June 1, 2022 13:18

844

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Handbook on Computer Learning and Intelligence—Vol. II

(a) rotor position

(b) rotor velocity

Figure 21.12: Trajectories of the rotor position and velocity.

(a) spherical potential

(b) tracking error

Figure 21.13: Spherical potential and tracking error with self-regulatory thresholds.

(a) rule evolvement

(b) control input

Figure 21.14: Rule evolving process and control input.

page 844

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Indirect Self-Evolving Fuzzy Control Approaches and Their Applications

page 845

845

pruning around 0.8 second are also shown in the figure for clear illustration. In this simulation study, the meta-cognitive fuzzy-neural controller generates a total of four fuzzy rules online as shown in Figure 21.14(a) while the NC requires eight hidden neurons determined by trial and error. When e > El as shown in Figure 21.13(b), the data are used for updating the parameters of the nearest rule according to the stable adaptation laws Eq. (21.74)–(21.79). If the data are not satisfied with the adding and pruning of rules and the updating of parameters, they are removed from the learning process. Besides, from these results, one can infer that the proposed metacognitive fuzzy-neural control approach has the capability to adapt the structure of the fuzzy-neural system and its corresponding parameters in real time, even in a changing environment. Figure 21.14(b) depicts the control signal and shows the control inputs are bounded in the whole control phase based on the projection-type learning laws.

21.4. Conclusions In this chapter, two kinds of indirect self-evolving fuzzy control systems utilizing ESAFIS and the meta-cognitive fuzzy-neural model are developed for the suppression of the wing-rock motion and rotor position control of a nonlinear TAMB system, respectively. The self-evolving fuzzy models are used to directly approximate the system dynamics and then the controller is constructed based on the approximation function. In the rule evolvement, the fuzzy rules are dynamically recruited and pruned such that a system structure of an appropriate size is autonomously achieved. When the learning criterion of parameter adjustment is satisfied, the parameters are updated via the projection-type update laws determined based on the Lyapunov function theory. This further guarantees the stability of the controlled system in the ultimate bounded sense. Simulation results demonstrate the feasibility and performance of the proposed controllers.

References [1] Jang, J.-S. R., Sun, C.-T., and Mizutani, E. (1997). Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence (Prentice-Hall, Upper Saddle River, New Jersey). [2] Mitra, S. and Hayashi, Y. (2000). Neuro-fuzzy rule generation: survey in soft computing framework, IEEE Trans. Neural Networks, 11(3), pp. 748–768. [3] Lu, Y., Sundararajan, N., and Saratchandran, P. (1997). A sequential learning scheme for function approximation using minimal radial basis function neural networks, Neural Comput., 9(2), pp. 461–478. [4] Wu, S. and Er, M. J. (2000). Dynamic fuzzy neural networks-a novel approach to function approximation, IEEE Trans. Syst. Man Cybern. Part B Cybern., 30(2), pp. 358–364. [5] Leng, G., McGinnity, T. M., and Prasad, G. (2005). An approach for on-line extraction of fuzzy rules using a self-organising fuzzy neural network, Fuzzy Sets Syst., 150(2), pp. 211–243.

June 1, 2022 13:18

846

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Handbook on Computer Learning and Intelligence—Vol. II

[6] Rong, H.-J., Huang, G.-B., Sundararajan, N., and Saratchandran, P. (2009). Online sequential fuzzy extreme learning machine for function approximation and classification problems, IEEE Trans. Syst. Man Cybern. Part B Cybern., 39(4), pp. 1067–1072. [7] Rong, H.-J., Sundararajan, N., Huang, G.-B., and Zhao, G.-S. (2011). Extended sequential adaptive fuzzy inference system for classification problems, Evolving Syst., 2, p. 7182. [8] Subramanian, K. and Suresh, S. (2012). A meta-cognitive sequential learning algorithm for neuro-fuzzy inference system, Appl. Soft Comput., 12, pp. 3603–3614. [9] Xu, B., Sun, F., Liu, H., and Ren, J. (2012). Adaptive Kriging controller design for hypersonic flight vehicle via back-stepping, IET Control Theory Appl., 6(4), pp. 487–497. [10] Pashilkar, A. A., Sundararajan, N., and Saratchandran, P. (2006). Adaptive back-stepping neural controller for reconfigurable flight control systems, IEEE Trans. Control Syst. Technol., 14(3), pp. 553–561. [11] van Soest, W. R., Chu, Q. P., and Mulder, J. A. (2006). Combined feedback linearization and constrained model predictive control for entry flight, J. Guid. Control Dyn., 29(2), pp. 427–434. [12] Li, G., Wen, C., Zheng, W. X., and Chen, Y. (2011). Identification of a class of nonlinear autoregressive models with exogenous inputs based on kernel machines, IEEE Trans. Signal Process., 59(5), pp. 2146–2159. [13] Takagi, T. and Sugeno, M. (1985). Fuzzy identification of systems and its applications to modelling and control, IEEE Trans. Syst. Man Cybern., SMC-15(1), pp. 116–132. [14] Zou, A.-M., Hou, Z.-G., and Tan, M. (2008). Adaptive control of a class of nonlinear pure-feedback systems using fuzzy backstepping approach, IEEE Trans. Fuzzy Syst., 16(4), pp. 886–897. [15] Lin, F.-J., Shieh, P.-H., and Chou, P.-H. (2008). Robust adaptive backstepping motion control of linear ultrasonic motors using fuzzy neural network, IEEE Trans. Fuzzy Syst., 16(3), pp. 676–692. [16] Wen, J. (2011). Adaptive fuzzy controller for a class of strict-feedback nonaffine nonlinear system, in 2011 9th IEEE International Conference on Control and Automation (ICCA) (IEEE), pp. 1255–1260. [17] Lin, C.-M. and Li, H.-Y. (2012). Tsk fuzzy CMAC-based robust adaptive backstepping control for uncertain nonlinear systems, IEEE Trans. Fuzzy Syst., 20(6), pp. 1147–1154. [18] Chen, W., Jiao, L., Li, R. and Li, J. (2010). Adaptive backstepping fuzzy control for nonlinearly parameterized systems with periodic disturbances, IEEE Trans. Fuzzy Syst., 18(4), pp. 674–685. [19] A. G. Sreenatha, V. Patki, M., and V. Joshi, S. (2000). Fuzzy logic control for wing-rock phenomenon, Mech. Res. Commun., 27(3), pp. 359–364. [20] Elzebda, J. M., Nayfeh, A. H., and Mook, D. T. (1989). Development of an analytical model of wing rock for slender delta wings, J. Aircr., 26(8), pp. 737–743. [21] Nayfeh, A. H., Elzebda, J. M., and Mook, D. T. (1989). Analytical study of the subsonic wingrock phenomenon for slender delta wings, J. Aircr., 26(9), pp. 805–809. [22] Škrjanc, I. (2007). A decomposed-model predictive functional control approach to air-vehicle pitch-angle control, J. Intell. Rob. Syst., 48(1), pp. 115–127. [23] Luo, J. and Lan, C. E. (1993). Control of wing-rock motion of slender delta wings, J. Guid. Control Dyn., 16(2), pp. 225–231. [24] Monahemi, M. M. and Krstic, M. (1996). Control of wing rock motion using adaptive feedback linearization, J. Guid. Control Dyn., 19(4), pp. 905–912. [25] Shue, S.-P., Agarwal, R. K., and Shi, P. (2000). Nonlinear H method for control of wing rock motions, J. Guid. Control Dyn., 23(1), pp. 60–68. [26] Sundararajan, N., Saratchandran, P., and Yan, L. (2001). Fully Tuned Radial Basis Function Neural Networks for Flight Control (Boston: Kluwer academic publishers). [27] Rong, H.-J., Sundararajan, N., Huang, G.-B., and Saratchandran, P. (2006). Sequential Adaptive Fuzzy Inference System (SAFIS) for nonlinear system identification and prediction, Fuzzy Sets Syst., 157(9), pp. 1260–1275.

page 846

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch21

Indirect Self-Evolving Fuzzy Control Approaches and Their Applications

page 847

847

[28] Wai, R.-J. and Yang, Z.-W. (2008). Adaptive fuzzy neural network control design via a T-S fuzzy model for a robot manipulator including actuator dynamics, IEEE Trans. Syst. Man Cybern. Part B Cybern., 38(5), pp. 1326–1346. [29] de Jesús Rubio, J. (2009). Sofmls: online self-organizing fuzzy modified least-squares network, IEEE Trans. Fuzzy Syst., 17(6), pp. 1296–1309. [30] Lughofer, E. and Angelov, P. (2011). Handling drifts and shifts in on-line data streams with evolving fuzzy systems, Appl. Soft Comput., 11(2), pp. 2057–2068. [31] Wang, L. (1994). Adaptive Fuzzy Systems and Control: Design And Stability Analysis (PTR Prentice Hall). [32] Wang, L.-X. (1993). Stable adaptive fuzzy control of nonlinear systems, IEEE Trans. Fuzzy Syst., 1(2), pp. 146–155. [33] Han, H., Su, C.-Y., and Stepanenko, Y. (2001). Adaptive control of a class of nonlinear systems with nonlinearly parameterized fuzzy approximators, IEEE Trans. Fuzzy Syst., 9(2), pp. 315– 323. [34] Singh, S. N., Yirn, W., and Wells, W. R. (1995). Direct adaptive and neural control of wing-rock motion of slender delta wings, J. Guid. Control Dyn., 18(1), pp. 25–30. [35] Passino, K. M. (2005). Biomimicry for Optimization, Control, and Automation (Springer Science and Business Media). [36] de Jesús Rubio, J., Vázquez, D. M., and Pacheco, J. (2010). Backpropagation to train an evolving radial basis function neural network, Evolving Syst., 1(3), pp. 173–180. [37] Lemos, A., Caminhas, W., and Gomide, F. (2011). Fuzzy evolving linear regression trees, Evolving Syst., 2(1), pp. 1–14. [38] Lughofer, E., Bouchot, J.-L., and Shaker, A. (2011). On-line elimination of local redundancies in evolving fuzzy systems, Evolving Syst., 2(3), pp. 165–187. [39] Ji, J., Hansen, C. H., and Zander, A. C. (2008). Nonlinear dynamics of magnetic bearing systems, Int. J. Mater. Syst. Struct., 19(12), pp. 1471–1491. [40] Tong, S., Wang, T., Li, Y., and Chen, B. (2013). A combined backstepping and stochastic smallgain approach to robust adaptive fuzzy output feedback control, IEEE Trans. Fuzzy Syst., 21(2), pp. 314–327. [41] Chang, Y.-H. and Chan, W.-S. (2014). Adaptive dynamic surface control for uncertain nonlinear systems with interval type-2 fuzzy neural networks, IEEE Trans. Cybern., 44(2), pp. 293–304. [42] Wen, J. and Jiang, C. (2011). Adaptive fuzzy controller for a class of strict-feedback nonaffine nonlinear systems, J. Syst. Eng. Electron., 22(6), pp. 967–974. [43] Du, H., Zhang, N., Ji, J., and Gao, W. (2010). Robust fuzzy control of an active magnetic bearing subject to voltage saturation, IEEE Trans. Control Syst. Technol., 18(1), pp. 164–169. [44] Lin, F.-J., Chen, S.-Y., and Huang, M.-S. (2010). Tracking control of thrust active magnetic bearing system via hermite polynomial-based recurrent neural network, IET Electr. Power Appl., 4(9), pp. 701–714. [45] Ge, S. S., Hang, C. C., and Zhang, T. (1999). A direct adaptive controller for dynamic systems with a class of nonlinear parameterizations, Automatica, 35, pp. 741–747. [46] Spooner, J. T. and Passino, K. M. (1996). Stable adaptive control using fuzzy systems and neural networks, IEEE Trans. Fuzzy Syst., 4(3), pp. 339–359.

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Part V

Evolutionary Computation

849

b4528-v2-ch22

page 849

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

© 2022 World Scientific Publishing Company https://doi.org/10.1142/9789811247323_0022

Chapter 22

Evolutionary Computation: Historical View and Basic Concepts Carlos A. Coello Coello∗,x , Carlos Segura†,¶ , and Gara Miranda‡, ∗ Departamento de Computación (Evolutionary Computation Group),

CINVESTAV-IPN México D.F., Mexico † Centro de Investigación en Matemáticas, Área Computación, Callejón Jalisco s/n,

Mineral de Valenciana, Guanajuato, Mexico ‡ Dpto. Estadística, I. O. y Computación, Universidad de La Laguna,

Santa Cruz de Tenerife, Spain x [email protected][email protected]  [email protected]

22.1. Introduction In most real-world problems, achieving optimal (or at least good) solutions results in saved resources, time, and expenses. These problems where optimal solutions are desired are known as optimization problems. In general, optimization algorithms search among the set of possible solutions (search space), evaluating the candidates—perhaps also analyzing their feasibility—and choosing from among them the final solution—or solutions—to the problem. Another perspective is that an optimization problem consists in finding the values of those variables that minimize/maximize the objective functions while satisfying any existing constraints. Optimization problems, involve variables that somehow define the search space. However, constraints are not mandatory: many unconstrained problems have been formulated and studied in the literature. In the case of the objective function, most optimization problems have been formulated on the basis of a single objective function. However, in many problems, it is actually necessary to optimize several different objectives at once [1]. Usually, such objectives are mutually competitive or conflicting, and values of variables that optimize one objective may be far from optimal for the others. Thus, in multi-objective optimization, there is no single and 851

page 851

June 1, 2022 13:18

852

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Handbook on Computer Learning and Intelligence — Vol. II

unique optimal solution, but rather a set of optimal solutions, none of them better than the others. Optimization problems can be simple or complex depending on the number of objectives, the solution space, the number of decision variables, the constraints specified, etc. Some optimization problems that satisfy certain properties are relatively easy to solve [2]. However, many optimization problems are quite difficult to solve; in fact, many of them fit into the family of NP-hard problems [3]. Assuming that P = NP, there is no approach that guarantees that the optimal solution to these problems is achieved in polynomial time. In the same way, optimization algorithms can be very simple or complex depending on whether they are able to give exact or approximate solutions, whether they are less or more efficient, or even whether they need to be specifically designed for a particular problem or not. Moreover, algorithms can be also classified by the amount of completion time required in comparison to their input size. Some problems may have multiple related algorithms of differing complexity, while other problems might have no algorithms or, at least, no known efficient algorithms. Taking the above into account, when deciding which algorithmic technique to apply, much depends on the properties of the problem to be solved, but also on the expected results: the quality of the solution, the efficiency of the approach, the flexibility and generality of the method, etc. Since many optimization algorithms have been proposed, with different purposes and from different perspectives, there is no single criterion that can be used to classify them. We can find different categories of algorithms in the literature, depending on the features related to their internal implementation, their general behavior scheme, the field of application, complexity, etc. For the purposes of this chapter, the classification presented in Figure 22.1 provides a meaningful starting point. In general, most solution techniques involve some form of exploration of the set of feasible solutions. Exact approaches are usually based on an enumerative,

Figure 22.1: Classification of optimization methods.

page 852

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Evolutionary Computation: Historical View and Basic Concepts

page 853

853

exhaustive, or brute-force search that yields the optimal solution to the problem. However, in most problems dealing with real or large instances, the search space may be far too large, or there may not even exist a convenient way to enumerate it. In such situations, it is common to resort to some kind of approximate method. These approaches do not generally ensure optimal solutions, but some of them at least ensure a certain level of solution quality, e.g., approximation algorithms. Ad hoc heuristics are another kind of approximate method that incorporate information about the problem at hand in order to decide which candidate solution should be tested next or how the next candidate can be produced [4]. Since they are based on specific knowledge of the problem, such ad hoc methods have the drawback of being problem dependent. That is, a wide variety of heuristics can be specifically and successfully designed to optimize a given problem, but unfortunately, they cannot be directly extrapolated to another problem. The specific mechanisms or decisions that work well in one case may not work at all for related problems that share common features, e.g., problems belonging to the same family. As a result, such procedures are unable to adapt to particular problem instances or to extend to different problems. Restarting and randomization strategies, as well as combinations of simple heuristics, offer only partial and largely unsatisfactory answers to these issues [5]. Metaheuristics appear as a class of modern heuristics whose main goal is to address these open challenges [6, 7]. A metaheuristic is an approximate method for solving a very general class of problems. It combines objective functions and heuristics in an abstract and hopefully efficient way, usually without requiring a deep insight into their structure. Metaheuristics have been defined as master strategies to guide and modify other heuristics to produce solutions beyond those normally identified by ad hoc heuristics. Compared to exact search methods, metaheuristics are not able to ensure a systematic exploration of the entire solution space, so they cannot guarantee that optimal solutions will be achieved. Instead, with the aim of increasing efficiency, they seek to examine only the parts where “good” solutions may be found. In most metaheuristics, a single solution, or a set of them, is maintained as active solutions and the search procedures are based on movements within the search space. New candidate solutions are built mainly in two ways: from scratch or as an evolution of current solutions. In many cases, these new candidate solutions are generated by means of a neighborhood definition, although different strategies are also used. At each process iteration, the metaheuristic evaluates different kinds of moves, some of which are selected to update the current solutions as determined by certain criteria (objective value, feasibility, statistical measures, etc.). Well-designed metaheuristics try to avoid being trapped in local optima or to repeatedly cycle through solutions that were already visited. It is also important

June 1, 2022 13:18

854

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Handbook on Computer Learning and Intelligence — Vol. II

to provide a reasonable assurance that the search does not overlook promising regions. A wide variety of metaheuristics have been proposed in the literature [8]. In many cases, these criteria or combinations of moves are often performed stochastically by utilizing statistics obtained from samples from the search space, or based on a model of some natural phenomenon or physical process [9, 10]. Another interesting feature that arises when analyzing the internal operation of a metaheuristic involves the number of candidate solutions or states that are being managed at a time. Trajectory-based metaheuristics maintain a single current state at any given instant, which is replaced by a new one at each generation. In contrast, population-based metaheuristics maintain a current pool with several candidate states instead of a single current state. As described before, in its search for better solutions and greater quality, research in the field of optimization has focused on the design of metaheuristics as generalpurpose techniques to guide the construction of solutions, with the aim of improving the resulting quality. In recent decades, research in this field has led to the creation of new hybrid algorithms, combining different concepts from various fields such as genetics, biology, artificial intelligence, mathematics, physics, and neurology, among others [11]. In this context, evolutionary computation (EC) is a research area within computer science that studies the properties of a set of algorithms—usually called evolutionary algorithms (EAs)—that draw their inspiration from natural evolution. Initially, several algorithms intended to better understand the population dynamics present in evolution were designed [12]. Although the design of the EC schemes is based on drawing inspiration from natural evolution, a faithful modeling of biological processes is not usually incorporated. Since EAs are usually overly simplistic versions of their biological counterparts, using EAs to model population dynamics is not too widely accepted by the evolutionary biology community [13]. Instead, the advances achieved in recent decades have shown that the main strength of EC is that it can be successfully applied to numerous practical problems that appear in several areas such as process control, machine learning, and function optimization [14, 15]. For this reason, EC should be seen as a general and robust evolution-inspired framework devoted to problem-solving. Although it is not easy to classify the kinds of problems that can be solved with EC, some taxonomies have been proposed. For instance, Everett distinguished between two main uses of EAs [16]: optimizing the performance of operating systems, and the testing and fitting of quantitative models. Another view was proposed by Eiben and Smith [17] by providing an analogy between problems and systems. In working systems, the three main components are: inputs, outputs, and the internal model connecting these two. They distinguished between three kinds of problems that can be solved with EC depending on which of the three components

page 854

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Evolutionary Computation: Historical View and Basic Concepts

page 855

855

is unknown. In any case, the important fact is that EC has been successfully used in a huge number of applications such as numerical optimization, design of expert systems, training of neural networks, data mining, and many others. In general, EC might be potentially applied to any problem where we can identify a set of candidate solutions and a quality level associated with each of them. This chapter is devoted to presenting the history and philosophy behind the use of EC, as well as introducing the reader to the design of EAs. In Section 22.2, a brief review of natural evolution theories is provided. In addition, some of the main concepts from the fields of evolution and genetics that have inspired developments in EC are introduced. Then, Section 22.3 presents the history and philosophy of the first EC approaches ever developed. Some of the latest developments, as well as the current unifying view of EC, are summarized in Section 22.4. Section 22.5 offers some observations on the design of EAs and describes some of the most popular components that have been used when tackling optimization problems with EC. Finally, some concluding remarks are given in Section 22.6.

22.2. The Fundamentals of Evolution The term evolution is defined by the Oxford Dictionary as the process by which different kinds of living organisms are thought to have developed and diversified from earlier forms during the history of the Earth. A much more general definition of this term is the gradual development of something, especially from a simple to a more complex form. The word evolution originated from the Latin word evolutio which means unrolling, from the verb evolvere. Early meanings were related to physical movement, first recorded in describing a tactical “wheeling” maneuver in the realignment of troops or ships. Current senses stem from the notion of “opening out” and “unfolding,” giving rise to a general sense of “development.” The term appeared a couple of centuries before Darwin wrote On the Origin of Species. In fact, Darwin did not even use the word evolution in his book until the last line: There is grandeur in this view of life, with its several powers, having been originally breathed by the Creator into a few forms or into one; and that, whilst this planet has gone circling on according to the fixed law of gravity, from so simple a beginning endless forms most beautiful and most wonderful have been, and are being evolved.

22.2.1. A Historical Review During the eighteenth century, a group of researchers, called naturalists, managed to gather a great deal of information on the flora and fauna in many different areas of our planet. In an attempt to organize and classify this remarkable amount of information, Carl Linnaeus (1707–1778) proposed a set of rules to assign genus and species labels to all known living beings. His taxonomy, called System Naturae [18], focused

June 1, 2022 13:18

856

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Handbook on Computer Learning and Intelligence — Vol. II

solely on the morphological properties of living beings to define the classification. Since this classification criterion was purely morphological, to Linnaeus the terms species identified distinct groups with no relation of origin. This perspective, called fixity of species, considered that each species was created as it was, and individuals did not experience changes over time. Linnaeus began his study believing this concept, but later rejected it after observing interbreeding between various species. Due to his extensive work in the field, he is considered the founding father of our current taxonomic system. The accumulation of information provided by naturalists and the progress achieved in the taxonomies led to the adoption of new approaches, different from the fixity of species, and based on the fact that some species came from other species. This idea required defining a new classification that reflected the relationships among organisms. It was called natural classification. Although Georges-Louis Leclerc, Comte de Buffon (1707–1788), was the first to question Linnaeus’s fixity of species, the first to propose a hypothesis on how one species could come from another was Jean-Baptiste Pierre Antoine de Monet, Chevalier de Lamarck (1744–1829). Lamarck, in his Zoological Philosophy [19], presented a systematic description for the evolution of living beings. For Lamarck, species develop as a result of their reaction and adaptation to the environment. These changes, therefore, must be gradual and will occur over long periods of time. Lamarck believed that certain organs are strengthened by the use that animals make of them, mainly due to the specific nature of their environment. Other organs, in contrast, are atrophied and eventually eliminated because they fall into disuse. For Lamarck, nature developed in the following way: circumstances create a need, that need creates habits, habits produce the changes resulting from the use or disuse of the organ and the means of nature are responsible for setting these modifications. Lamarck believed that these physiological changes acquired over the life of an organism could be transmitted to the offspring. This hypothesis is known as the inheritance of acquired characteristics and is also commonly referred to as Lamarckism. We can say, therefore, that Lamarck was the first to formulate a strictly evolutionary hypothesis, although the word “evolution” was at that time reserved for the development of the embryo, so his proposal was referred to as transformism. Despite Lamarck’s hypothesis, there was no experimental evidence for the existence of mechanisms by which individuals could transmit the alleged improvements acquired over the course of their lives. In fact, Lamarckism is now considered an obsolete theory. The principles governing the transformation of the individual characteristics, which are now commonly accepted by science, were first established by Darwin and Wallace. The principles governing the transmission or inheritance of these characteristics were first established by Mendel. In 1858, Charles Robert Darwin (1809–1882) and Alfred Russel Wallace (1823–1913) gave a presentation at the Linnean Society of London on a theory of

page 856

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Evolutionary Computation: Historical View and Basic Concepts

page 857

857

the evolution of species by means of natural selection [20]. One year later, Darwin published On the Origin of Species [21], a work in which he provided a detailed explanation of his theory supported by numerous experiments and observations of nature. Darwin’s theory—usually known as Darwinism—is based on a set of related ideas which try to explain different features of biological reality: • Living beings are not static; they change continuously: some new organisms are created and some die out. This process of change is gradual, slow, and continuous, with no abrupt or discontinuous steps. • Species diversify by adapting to different environments or ways of life, thus branching out. The implication of this phenomenon is that all species are somehow related—to varying degrees—and ultimately all species have a single origin in one common and a remote ancestor. That is, all living organisms can be traced back to a single and common origin of life. • Natural selection is the key to the system. It is conceived as a result of two factors: the inherited natural variability of the individuals of a species, and the selection through survival in the struggle for life, i.e., the fittest individuals, who were born with favorable spontaneous modifications better suited to the environment, and are thus more likely to survive, reproduce and leave offspring with these advantages. This implies that each slight variation, if useful, is preserved. Although Darwin knew that there should be a mechanism for transmitting these characteristics from the parents to offspring, he was unable to discover the transmission mechanism. It was Gregor Johann Mendel (1822–1884) who suggested a number of hypotheses that would set the basic underlying principles of heredity. Mendel’s research [22] focused on plants (peas) and especially individual features, all unequivocally different from each other and having the peculiarity of not being expressed in a graduated form, i.e., the feature is only present or not present. This research led Mendel to three important conclusions: • The inheritance of each trait is determined by “units” or “factors” that are passed on to descendants unchanged. These units are now called genes. • An individual inherits one such unit from each parent for each trait. • A trait may not show up in an individual but can still be passed on to the next generation. In this sense, the term phenotype refers to the set of physical or observable characteristics of an organism, while the term genotype refers to the individual’s complete collection of genes. The difference between genotype and phenotype is that the genotype can be distinguished by looking at the deoxyribonucleic acid (DNA) of the cells and the phenotype can be established from observing the external appearance of an organism.

June 1, 2022 13:18

858

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Handbook on Computer Learning and Intelligence — Vol. II

Mendel’s laws laid the foundation of modern genetics and abolished the idea that characteristics are transmitted from parents to children through bodily fluids so that, once mixed, they cannot be separated, thus causing the offspring to have characteristics that will mix the characteristics of the parents. This theory is called pangenesis and is mainly based on observations such as how crossing plants with red flowers with plants with white flowers produces plants with pink flowers. Starting from Mendel’s advances in genetics and the concept of natural selection, in the 1930s and 1940s the modern evolutionary synthesis was established [23–25]. Basically, this theory [26] served as a link between the unity of evolution (“the gene”) and the mechanism of evolution (“the selection”), i.e., gradual changes and natural selection in populations are the primary mechanisms of evolution. According to this theory, genetic variation in populations arises mainly by chance through mutation (alteration or change in the genetic information—genotype—of a living being) and recombination (the mixing of the chromosomes produced at meiosis). Finally, note that there are also theories that relate learning and evolution. This is the case of the Baldwin effect [27], which was proposed by James Mark Baldwin. This theory has also inspired some advances in EC.

22.2.2. Main Concepts In the above description, some important definitions related to evolution and genetics were introduced. This section is devoted to describing some other terms that have been used in some popular books and papers on EC. A complete list of the terms contained in the papers on EC would be too extensive to be included. For this reason, we have selected the most broadly used terms: • DNA (deoxyribonucleic acid): nucleic acid that consists of two long chains of nucleotides twisted into a double helix and joined by hydrogen bonds between the complementary bases adenine and thymine or cytosine and guanine. It is the main constituent of the chromosome and carries the genes as segments along its strands. • Chromosome: structure within the nucleus of eukaryotic cells that bears the genetic material as a threadlike linear strand of DNA. • Gene: as seen in the previous section, Mendel regarded “genes” as inherited factors that determine the external features of living beings. In modern genetics, a “gene” is defined as a sequence of DNA that occupies a specific location on a chromosome. • Diploid cell: a cell that contains two sets of paired chromosomes. For example, humans have 2 sets of 23 chromosomes, for a total of 46 chromosomes. Exact replicas of diploid cells are generated through a process designed mitosis.

page 858

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Evolutionary Computation: Historical View and Basic Concepts

page 859

859

• Haploid cell: a cell that contains only one complete set of chromosomes. The genesis of a haploid cell can occur by the meiosis of diploid cells or by the mitosis of haploid cells. One haploid cell will merge with another haploid cell at fertilization, i.e., sperm and ova (also known as “gametes”). • Locus: the specific position of the chromosome where a gene or other DNA sequence is located. • Genetic linkage: the tendency of genes that are located proximal to each other on a chromosome to be inherited together. • Alleles: variant or alternative forms of a gene that are located at the same position —locus—on a chromosome. • Genotype: genetic information of an organism. This information—contained in the chromosomes of the organism—may or may not be manifested or observed in the individual. • Phenotype: a property observed in an organism, such as morphology, development, or behavior. It is the expression of the genotype in relation to a particular environment. • Epistasis: the type of interaction between genes located at different loci on the same chromosome consisting of one gene masking or suppressing the expression of the other.

22.3. History of Evolutionary Computation EC is a field of research with a fascinating but complex history. In its origins, several independent researchers proposed numerous approaches with the common feature of using natural evolution to inspire their implementations. Some of the developments were also inspired from other related fields such as artificial life [28] and other areas of artificial intelligence. EC had many independent beginnings, so different terms have been used to refer to concepts that are broadly similar. Given the similarities of the different schemes proposed, the term EC was invented in the early 1990s in an effort to unify the field of evolution-inspired algorithms [29]. Determining the merits of the various authors is not easy, which is why several different versions of the history of EC have been told [30]. In this section, we review the origins of EC, focusing on the most important milestones. Many of the bases of EC were developed in the 1960s and 1970s. During that period, three different popular schemes inspired by natural evolution were devised: evolutionary programming (EP), evolution strategies (ESs) and genetic algorithms (GAs). These schemes were inspired by the same ideas but some details, such as the nature of the representation and the operators applied, were different. In addition, there were several earlier attempts that highly influenced the advances in this area. Some of these proposals can be considered to be the first designed evolutionary algorithms

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

860

b4528-v2-ch22

Handbook on Computer Learning and Intelligence — Vol. II

(EAs). In other cases, the schemes were substantially different from what is currently known as an EA. In any case, they are historically important because they had a significant impact on the subsequent studies carried out in the field. This section reviews the origins along with the three aforementioned types of EAs.

22.3.1. Origins Wright and Cannon [31, 32] were probably the first researchers to influence the subsequent development of EC. They viewed natural evolution as a learning process where the genetic information of species is continuously changed through a trialand-error mechanism. In this way, evolution can be regarded as a method whose aim is to maximize the adaptiveness of species to their environment. Wright went further and introduced the concept of fitness landscape to relate genotypes and fitness values. This concept has been extended and is widely used today to study the performance of EAs [33]. In addition, he developed one of the first studies where selection and mutation were linked, concluding that a proper balance between them is required to proceed successfully. These ideas were extended by Campbell [34]. He claimed that a blind-variation-and-selective-survival process is one of the main underlying principles of evolution, and hypothesized that “in all processes leading to expansions of knowledge, a blind-variation-and-selective-survival process is involved.” This is one of the principles behind universal Darwinism, a proposal that extends the applicability of Darwin’s theory to other areas beyond natural evolution. Turing can also be considered another EC pioneer. In [35], Turing recognized a connection between machine learning and evolution. Specifically, Turing claimed that in order to develop an intelligent computer that passes the famous Turing test, a learning scheme based on evolution might be used. In this regard, the structure of the evolved machine is related to hereditary materials, changes made to the machine are related to mutations, and the use of an experimenter to evaluate the generated machines is related to natural selection. In addition, Turing had previously suggested that these kinds of methods might be used to train a set of networks that are akin to current neural networks [36]. Turing used the term “genetical search” to refer to these kinds of schemes. However, this paper was not published until 1968 [37] because his supervisor considered it a “schoolboy essay” [38]. By the time of its publication, other EAs had already appeared and the term “genetical search” was never adopted by the EC community. Some important advances were made in the late 1950s and early 1960s, coinciding with the period in which electronic digital computers became more readily available. In this period, several analyses of the population dynamics appearing in evolution were carried out. A survey of the application of digital

page 860

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Evolutionary Computation: Historical View and Basic Concepts

page 861

861

computers in this field was presented in [12]. Crosby identified three kinds of schemes: • Methods based on studying the mathematical formulations that emerge in the deterministic analysis of population dynamics. These schemes usually assume an infinite population size and analyze the distributions of genes under different circumstances. An example of this type of scheme is the one developed by Lewontin [39], where the interactions between selection and linkage are analyzed. • Approaches that simulate evolution but do not explicitly represent a population [40]. In these methods, the frequencies of the different potential genotypes are stored for each generation, resulting in a non-scalable scheme in terms of the number of genes. In addition, a faithful mathematical representation of the evolutionary system is required. • Schemes that simulate evolution by explicitly maintaining a representation of the population. Fraser and some of his colleagues pioneered these kinds of methods and analyzed this approach in a set of papers published over the course of a decade [41, 42]. The main conclusions drawn from these studies were gathered in their seminal book [43]. From its inception, this model was widely accepted, and several researchers soon adopted it [44, 45]. In fact, even though there are several implementation details that differ from those used in current EAs, several authors regard the works of Fraser as representing the first invention of a GA [29]. As for the implementation details, the main difference with respect to GAs is that individuals are diploid instead of haploid. However, in our opinion, the key difference is not the implementation but the philosophy behind the use of the scheme. Fraser used its algorithms as a way to analyze population dynamics and not to solve problems, which is the typical current application of EAs. There were other authors who proposed schemes resembling current EAs. Among them, the works of Friedman, Box, and Friedberg are particularly important and cited often. Friedman [46] hypothesized on the automatic design of control circuits by using a scheme inspired by natural evolution that was termed “selective feedback.” His proposals were very different from current ones. For instance, there was no notion of population and/or generations. Moreover, Friedman did not implement his method, and some of his assumptions seem to be overly optimistic [30]. In any event, his research can be considered the first efforts in the field of evolvable hardware. Box proposed a technique called “Evolutionary Operation” (EVOP) [47] to optimize the management processes of the chemical industry. His technique essentially used a single parent to generate multiple offspring by modifying a few production parameters at a time. Then, one individual was selected to survive to the next generation by considering certain statistical evidence. The system was

June 1, 2022 13:18

862

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Handbook on Computer Learning and Intelligence — Vol. II

not autonomous. Specifically, the variation and selection processes required human intervention, so it can be considered the first interactive EA. The most important contribution of Friedberg et al. to the field of EC was their attempt to automatically generate computer programs using an evolutionary scheme [48, 49]. In the previous papers, Friedberg et al. did not identify any relationship between their proposal and natural evolution. However, in subsequent publications by his coauthors, such a relationship was explicitly identified [50]. In their first attempts, the aim was to automatically generate small programs with some simple functionalities that rely on a tailor-made instruction set. In order to direct the search to promising regions, problem-dependent information was considered by defining a variation scheme specifically tailored to the problem at hand. Even with this addition, their success was limited. Even so, some of the analyses they carried out were particularly important for the subsequent achievements in EC. Among their contributions, some of the most important are the following: • The idea of dividing candidate solutions into classes is a precursor of the notions of intrinsic parallelism and schema, proposed years later by Holland [51]. • Several instruction sets were tested, showing, for the first time, the influence of genotype-to-phenotype mapping. • A credit assignment algorithm was used to measure the influence of the single instructions. This idea is closely related to the work of Holland on GAs and classifier systems. There are several more papers that did not elicit much attention when they were first published, but that proposed techniques very similar to others that were later reinvented. For instance, the work by Reed, Toombs and Barricelli [52] provides several innovations closely related to self-adaptation, crossover, and coevolution1 .

22.3.2. Evolutionary Programming Evolutionary programming (EP) was devised by Fogel [53] while he was engaged in basic research on artificial intelligence for the National Science Foundation. At the time, most attempts to generate intelligent behavior used man as a model [54]. However, Fogel realized that since evolution had been able to create humans and other intelligent creatures, a model mimicking evolution might be successful. Note that, as previously described, some research into this topic had already been published by then. The basic ideas adopted by Fogel are similar: use variation and selection to evolve candidate solutions better adapted to the given goals. However, Fogel was not aware of the existing research and his proposals differed substantially 1 Some preliminary work on coevolution was started in [57].

page 862

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Evolutionary Computation: Historical View and Basic Concepts

page 863

863

from them, meaning that his models can be considered a reinvention of evolutionbased schemes. Philosophically, the coding structures utilized in EP are an abstraction of the phenotype of different species [55]. As a result, the encoding of candidate solutions can be freely adapted to better support the requirements of the problems at hand. Since there is no sexual communication between different species, this view of candidate solutions also justifies the lack of recombination operators in EP. Thus, the non-application of recombination operators is due to a conceptual rather than a technical view. The lack of recombination, the freedom to adapt the encoding, and the use of a probabilistic survivor selection operator—not used in the initial versions of EP—might be considered the main features that distinguished EP from other EC paradigms [56]. In his first designs, Fogel reasoned that one of the main features that characterize intelligent behavior is the capacity to predict one’s environment, coupled with a translation of said predictions into an adequate response so as to accomplish a given objective. A finite state transducer is a finite state machine (FSM) that allows a sequence of input symbols to be transformed into a sequence of output symbols. Depending on the states and given transitions, they can be used to model different situations and cause different transformations. Thus, Fogel viewed FSMs as a proper mechanism for dealing with the problem of generating intelligent behavior. In his initial experiments, the aim was to develop a FSM capable of predicting the behavior of an environment. The environment was considered to be a sequence of symbols belonging to a given input alphabet. The aim was to develop a FSM that, given the symbols previously generated by the environment, could predict the next symbol to emerge from the environment. The initial EP proposal operates as follows. First, a population with N random FSMs is created. Then, each member of the population is mutated, creating as many offspring as parents. Five mutation modes are considered: add a state, delete a state, change the next state of a transition, change the output symbol of a transition, or change the initial state. Mutation operators are chosen randomly—other ways were also tested—and in some cases the offspring are subjected to more than one mutation. Finally, the offspring are evaluated and the best N FSMs are selected to survive. FSMs are evaluated in light of their capacity to correctly predict the next symbols in known sequences. Initially, the fitness is calculated by considering a small number of symbols, but as the evolution progresses, more symbols are attached to the training set. In the first experiments carried out involving EP, different prediction tasks were tested: periodic sequences of numbers, sequences with noise, non-stationary environments, etc. Subsequently, more difficult tasks were considered. Among other applications, EP was used to tackle pattern recognition, classification, and control

June 1, 2022 13:18

864

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Handbook on Computer Learning and Intelligence — Vol. II

system design. All this research resulted in the publication of the first book on EC [58]. EP was also used to evolve strategies for gaming [59]. This work is particularly important because it is one of the first applications of coevolution. It is also important to remark that in most of the initial studies on EP, the amount of computational results was not ample because of the limited computational power at the time of its inception. Most of these initial experiments were recapitulated and extended in a later period [60], providing a much deeper understanding of EP. In the 1970s, most of the research into EP was conducted under the guidance of Dearholt. One of the main contributions of his work was the application of EP to practical problems. EP was applied to pattern recognition in regular expressions [61] and handwritten characters [62] and to classify different types of electrocardiograms [63]. These works incorporated several algorithmic novelties. Among them, the most influential were the use of several simultaneous mutation operators and the dynamic adaptation of the probabilities associated with the different mutation schemes. Starting in the 1980s, EP diversified by using other arbitrary representations of candidate solutions in order to address different problems. The number of applications that have been addressed with EP is huge. In [60], EP was used to solve routing problems by considering a permutation-based encoding. In order to tackle the generation of computer programs, tree-based encoding was used in [64]. Real-valued vectors were used to deal with continuous optimization problems by [65] and to train neural networks by [66]. During this period, the feeling was that by designing problem-specific representations with operators specifically tailored to face a given problem, more efficient searches might be performed [54]. In fact, most EP practitioners defended that, by designing intelligent mutation schemes, the use of recombination operators might be avoided [64, 67]. Among the aforementioned topics, the application of EP to continuous optimization problems saw a large expansion in the 1990s. Over this period, great efforts were made to develop self-adaptive schemes. In self-adaptation, some of the parameters required by the algorithm are bound to each individual and evolved with the original variables. The variables of the problem are called object parameters, while those newly attached are called strategy parameters. Initial proposals adapted the scale factor of Gaussian mutation [68]. In more advanced variants of EP, several mutation operators are considered simultaneously [65]. Since self-adaptive schemes surpassed more traditional EP variants, self-adaptation was adopted by practitioners addressing problems where real encoding was not used, becoming a common mechanism in EP [69]. Self-adaptive EP for continuous optimization presents several similarities with ESs. One of the most important differences is that EP does not include recombination. In addition, some implementation details were also different in the first variants devised [17]. For instance, the order in which the object and

page 864

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Evolutionary Computation: Historical View and Basic Concepts

page 865

865

strategy parameters were subjected to mutation was different in the two schemes, though these differences vanished over time. Moreover, it has been shown that there are cases where mutation alone can outperform schemes with recombination and vice versa [70]. There is also evidence that indicates that it is promising to adapt the parameters associated with crossover [71], which further stretches the barriers between EP and ESs. Since self-adaptation for continuous optimization was first proposed in ESs, the history of and advances in self-adaption for continuous optimization are presented in the section devoted to ESs.

22.3.3. Evolution Strategies In the mid-1960s, three students at the Technical University of Berlin—Bienert, Schwefel and Rechenberg—were studying practical problems that arise in fluid mechanics and some related fields. Their desire was to build robots that could autonomously solve engineering problems [30]. They formulated these engineering tasks as optimization problems and developed autonomous machines that were automatically modified using certain rules. They also made some initial attempts at using traditional optimization schemes. However, since the problems were noisy and multimodal, their schemes failed to provide promising solutions. In order to avoid these drawbacks, they decided to apply mutation and selection methods analogous to natural evolution. Rechenberg published the first report on ESs by applying these ideas to minimize the total drag of a body in a wind tunnel [72]. Subsequently, other problems such as the design of pipes and efficient flashing nozzles were addressed [30]. Philosophically, the coding structures utilized in ESs are an abstraction of the phenotype of different individuals [55]. This is why the encoding of candidate solutions can be freely adapted to better support the requirements of the problems at hand. In addition, recombination among individuals is allowed in spite of not being implemented in the initial variants of ESs. In the first problems tested, a set of quantitative discrete variables had to be adjusted. The initial proposal was very similar to the current (1+1)-ES and works as follows. First, a random solution is created and considered to be the current solution. Then, the current solution is mutated by slightly changing each variable. Specifically, mutation is implemented by emulating a Galton pin-board, so binomial distributions are used. The principle behind the design of this operator is to apply small mutations more frequently than large mutations, as is the case in nature. The current solution is replaced by the mutant only if the mutant solution is at least as good. This process is repeated until the stopping criterion is reached. Empirical evidence revealed the promising behavior of these kinds of schemes with respect to more traditional optimization methods.

June 1, 2022 13:18

866

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Handbook on Computer Learning and Intelligence — Vol. II

In 1965, Schwefel first implemented an ES in a general computer [73]. In this case, ESs were mainly applied to continuous optimization. The most important contribution of the new variant was the incorporation of Gaussian mutation with zero mean, which is still the most typical distribution used nowadays. The mutation operator could be controlled by tuning the standard deviations—or step size—of the Gaussian distribution. Studies on the step size resulted in the first theoretical analyses of ESs. As previously discussed, initial ES variants only kept a single solution at a time. However, in the late 1960s, the use of populations with several individuals was introduced. The first population-based scheme tested is now known as (μ+1)ES. This scheme uses a population with μ individuals. These individuals are used to create a new individual through recombination and mutation. Then, the best μ individuals from the μ+1 individuals are selected to survive. This scheme resembles the steady state selection that was popularized much later in the field of GAs [74]. ESs were extended further to support any number of parents and offspring. In addition, two different selection schemes and a notation that is still in use were proposed. The first selection comprises the (μ+λ)-ES. In this case, λ offspring are created from a population with μ individuals. Then, the best μ individuals from the union of parents and offspring are selected to survive. Schemes that consider the second kind of selection are known as (μ, λ)-ES. In this case, the best μ individuals from the λ offspring are selected to survive. When ESs were first developed, the papers were written in German, meaning that they were accessible to a very small community. However, in 1981, Schwefel published the first book on ESs in English [75]. Since then, several researchers have adopted ESs and the number of studies in this area has grown enormously. Schwefel’s book focuses on the use of ESs for continuous optimization, which is in fact the field where ESs have been more successful. In his book, Schwefel discusses, among other topics, the use of different kinds of recombinations and the adaptation of the Gaussian distributions adopted for mutation. Regarding recombination, some controversies and misunderstandings have appeared. The use of adaptive mutation is one of the distinguishing features of ESs, and the history of and advances in both topics are described herein. Most of the papers reviewed in this chapter involve continuous single-objective optimization problems. However, it is important to note that the application of ESs has not been limited to these kinds of problems. For instance, multi-objective [76], mixed-integer [77], and constrained problems [78] have also been addressed. 22.3.3.1. Recombination

Initially, four different recombination schemes were proposed by combining two different properties [75]. First, two choices were given for the number of

page 866

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Evolutionary Computation: Historical View and Basic Concepts

page 867

867

parents taking part in creating an offspring: bisexual or multisexual. In bisexual recombination—also known as local— two parents are selected and they are used to recombine each parameter value. In the case of multisexual recombination— also known as global—different pairs of parents are selected for each parameter value. Thus, more than two individuals can potentially participate in the creation of each offspring. In addition, two different ways of combining two parameter values were proposed. In the discrete case, one of the values is selected randomly. In the intermediary case, the mean value of both parameters is calculated. In the above recombination schemes, the number of parents taking part in each recombination cannot be specified. Rechenberg proposed a generalization that changed this restriction [79]. Specifically, a new parameter, ρ, is used to specify the number of parents taking part in the creation of each individual. The new schemes were called (μ/ρ, λ) and (μ/ρ+λ). Further extensions of the intermediary schemes were developed by weighting the contributions of the different individuals taking part in the recombination [80]. Different ways of assigning weights have been extensively tested. Previous schemes were not proper generalizations of the initial schemes, in the sense that the intermediary case calculates the mean of ρ individuals, and not the mean of two individuals selected from among a larger set, as was the case in the initial schemes. A more powerful generalization that allowed for both the original operators and those proposed by Rechenberg to be implemented was devised [81]. Thus, using the notation proposed in this last paper is preferable. The situation became even more complicated because of a misinterpretation of the initial crossover schemes. In the survey presented in [82], the authors showed a formula indicating that in global recombination, one parent is chosen and held fixed while the other parent is randomly chosen anew for each component.2 This new way of recombination was adopted by many authors over the following years, so when using global recombination, the exact definition must be carefully given. 22.3.3.2. Adaptive mutation

One of the main features that characterizes ESs is its mutation operator. In the case of continuous optimization, mutation is usually based on Gaussian distributions with zero mean. In most cases, individuals are disturbed by adding a random vector generated from a multivariate Gaussian distribution, i.e., the candidate solutions xi are perturbed using Eq. (22.1). In this equation, C represents a covariance matrix. xi = xi + Ni (0, C) 2 To the best of our knowledge, [82] was the first paper where this definition was given.

(22.1)

June 1, 2022 13:18

868

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Handbook on Computer Learning and Intelligence — Vol. II

From the inception of ESs, it is clear that the features of C might have a significant effect on search performance. As a result, several methods that adapt C during the run have been developed. Many ES variants can be categorized into one of the following groups depending on how C is adapted [83]: • Schemes where the surfaces of equal probability density are hyper-spheres. In these cases, the only free adaptive parameter is the global step size or standard deviation. • Schemes where the surfaces of equal probability density are axis-parallel hyperellipsoids. In these cases, the most typical approach is to have as many free adaptive parameters as there are dimensions. • Schemes where the surfaces of equal probability density are arbitrarily oriented hyper-ellipsoids. In these cases, any positive-definite matrix can be used, resulting in (n 2 + n)/2 free adaptive parameters. The first adaptive ES considered the global step size as the only free parameter. These schemes originated from the studies presented in [84], which analyzed the convergence properties of ESs depending on the relative frequency of successful mutations. These analyses led to the first adaptive scheme [75], where the global step size is adjusted by an online procedure with the aim of producing successful mutations with a ratio equal to 1/5. This rule is generally referred to as the “ 15 success rule of Rechenberg.” However, this rule is not general so it does not work properly for many functions, which is why self-adaptive ESs were proposed. In the first of such variants, the only strategy parameter was the global step size, and this strategy parameter was subjected to variation using a log-normal distributed mutation. In the case of schemes where the surfaces of equal probability are axis-parallel hyperellipsoids, a larger number of methods have been devised. A direct extension of previous methods considers the self-adaptation of n strategy parameters, where n is the number of dimensions of the optimization problem at hand. In such a case, each strategy parameter represents the variance in each dimension. Another popular ES variant is the derandomized scheme [85]. Ostermeier et al. realized that in the original self-adaptive ES, the interaction of the random elements can lead to a drop in performance. The scheme is based on reducing the number of random decisions made by ESs. Another interesting extension is termed accumulation [86]. In these variants, the adaptation is performed considering information extracted from the best mutations carried out during the whole run, and not only in the last generation. Note that these last variants favor adaptation instead of self-adaptation. By adapting the whole covariance matrix, correlations can be induced among the mutations performed involving the different parameters, thus making ESs more suitable for dealing with non-separable functions. The first adaptive scheme where the whole matrix is adapted was presented in [75]. In this case, the coordinate system

page 868

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Evolutionary Computation: Historical View and Basic Concepts

page 869

869

in which the step size control takes place, as well as the step sizes themselves, are self-adapted. In general, this method has not been very successful because it requires very large populations to operate properly. A very efficient and popular scheme is covariance matrix adaptation (cma-es) [83]. In this scheme, derandomization and accumulation are integrated in an adaptive scheme whose aim is to build covariance matrices that maximize the creation of mutation vectors that were successful in previous generations of the execution. This scheme is very complex and the practitioner must specify a large number of parameters. This has led to simpler schemes being devised. For instance, the covariance matrix self-adaptation scheme (cmsa) [87] reduces the number of parameters required by combining adaptation and self-adaptation. Many other not so popular schemes that are capable of adapting the whole covariance matrix have been devised. The reader is referred to [88] for a broader review of these kinds of schemes. Finally, it is important to note that several schemes that favor some directions without adapting a full covariance matrix have been devised. For instance, in [89], the main descent direction is adapted. The main advantage of these kinds of schemes is the reduced complexity in terms of time and space. Since the time required for matrix computation is usually negligible in comparison to the function evaluation phase, these schemes have not been very popular. Still, they might be useful, especially for large-scale problems, where the cost of computing the matrices is much higher. In addition, several other adaptive and self-adaptive variants that do not fit into the previous categorization have been devised. For instance, some schemes that allow the use of Gaussian distributions with a non-zero mean have been proposed [90]. In addition, some methods not based on Gaussian mutation have been designed [91]. In this chapter, we have briefly described the fundamentals of different methods. The reader is referred to [88, 92, 93] for discussions of recent advances and implementation details for several ES variants.

22.3.4. Genetic Algorithms Holland identified the relationship between the adaptation process that appears in natural evolution and optimization [51], and conjectured that it might have critical roles in several other fields, such as learning, control, and/or mathematics. He proposed an algorithmic framework inspired by these ideas, which was initially called the Genetic Plan and was subsequently renamed the Genetic Algorithm. GAs are based on progressively modifying a set of structures with the aim of adapting them to a given environment. The specific way of making these modifications is inspired by genetics and evolution. His initial aim was to formally study the phenomenon of adaptation. However, GAs were soon used as problem solvers. Specifically, GAs have been successfully applied to numerous applications, such

June 1, 2022 13:18

870

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Handbook on Computer Learning and Intelligence — Vol. II

as machine learning, control, and search and function optimization. A very popular use of GAs appeared in the development of learning classifier systems, which is an approach to machine learning based on the autonomous development of rule-based systems capable of generalization and inductive learning. The application of GAs as function optimizers was also soon analyzed [94], defining metrics, benchmark problems, and methodologies to perform comparisons. This is probably the field where GAs have been most extensively and successfully applied. The evolutionary schemes devised by Holland are very similar to those previously designed by Bremermann [95] and Bledsoe [96]. In fact, “by 1962 there was nothing in Bremermann’s algorithm that would distinguish it from what later became known as genetic algorithms” [97]. Moreover, they pioneered several ideas that were developed much later, such as the use of multisexual recombination, the use of real-encoding evolutionary approaches, and the adaptation of mutation rates. Moreover, since Bremermann’s algorithm was very similar to the schemes described earlier by Fraser, some authors regard GAs as having been reinvented at least three times [29]. Philosophically, the coding structures utilized in GAs are an abstraction of the genotype of different individuals [55]. Specifically, each trial solution is usually coded as a vector, termed chromosome, and each element is denoted with the term gene. A candidate solution is created by assigning values to the genes. The set of values that can be assigned are termed alleles. In the first variants of GAs, there was an emphasis on using binary representations—although using other encodings is plausible—so even when continuous optimization problems were addressed, the preferred encoding was binary. Goldberg claimed that “GAs work with a coding of the parameter set, not the parameters themselves” [98], which is a distinguishing feature of GAs. This results in the requirement to define a mapping between genotype and phenotype. The working principles of GAs are somewhat similar to those of EP and ESs. However, unlike other schemes, in most early forms of GAs, recombination was emphasized over mutation. The basic operation of one of the first GAs—denoted as canonical GA or simple GA—is as follows. First, a set of random candidate solutions is created. Then, the performance of each individual is evaluated and a fitness value is assigned to each individual. This fitness value is a measure of each individual’s quality, and in the initial GA variants, it had to be a positive value. Based on the fitness values, a temporary population is created by applying selection. This selection process was also called reproduction in some of the first works on GAs [98]. The best-performing individuals are selected with larger probabilities and might be included several times in the temporary population. Finally, in order to create the population of the next generation, a set of variation operators is applied. The most popular operators are crossover and mutation, but other operators such as inversion,

page 870

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Evolutionary Computation: Historical View and Basic Concepts

page 871

871

segregation, and translocation have also been proposed [98]. In most of the early research on GAs, more attention was paid to the crossover operator. The reason is that crossover was believed to be the major source of the power behind GAs, while the role of mutation was to prevent the loss of diversity [13]. This first variant was soon extended by Holland’s students. Some of these extensions include the development of game-playing strategies, the use of diploid representations, and the use of adaptive parameters [99]. Among Holland’s many contributions, the schema theorem was one of the most important. Holland developed the concepts of schemata and hyperplanes to explain the working principles of GAs. A schema is basically a template that allows grouping candidate solutions. In a problem where candidate solutions are represented with l genes, a schema consists of l symbols. Each symbol can be an allele of the corresponding gene or “*,” which is the “don’t care” symbol. Each schema designates the subset of candidate solutions in which the corresponding representations match every position in the schema that is different from “*.” For instance, the candidate solutions “0 0 1 1” and “0 1 0 1” belong to the schema “0 * * 1.” Note that the candidate solutions belonging to a schema form a hyperplane in the space of solutions. Holland introduced the concept of intrinsic parallelism—subsequently renamed as implicit parallelism—to describe the fact that the evaluation of a candidate solution provides valuable information on many different schemata. He hypothesized that GAs operate by better sampling schemata with larger mean fitness values. The concept of schema was also used to justify the use of binary encoding. Specifically, Holland showed that by using binary encoding the number of schemata sampled in each evaluation can be maximized. However, the use of binary encoding was not extensively tested in his preliminary works, and its application has been a source of controversy. One of the first problems detected with the use of binary encoding is that, depending on the length of the vector, precision or efficiency might be sacrificed. It is not always easy to strike a proper balance, so dynamic parameter encoding methods have also been proposed [100]. Goldberg weakened the binaryencoding requirement by claiming that “the user should select the smallest alphabet that permits a natural expression of the problem” [98]. A strict interpretation of this means that the requirement for binary alphabets can be dropped [99]. In fact, several authors have been able to obtain better results by using representations different from the binary encoding, such as floating point representations. For instance, Michalewicz carried out an extensive comparison between binary and floating point representations [101]. He concluded that GAs operating with floatingpoint representations are faster and more consistent and provide a higher precision. Subsequently, Fogel and Ghozeil demonstrated that there are equivalences between any bijective representation [102]. Thus, for GAs with bijective representations,

June 1, 2022 13:18

872

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Handbook on Computer Learning and Intelligence — Vol. II

interactions that take place between the representation and other components are keys to their success. Holland used the above concepts to develop the schema theorem, which provides lower bounds to the change in sampling rate for a single hyperplane from one generation to the next. For many years, the schema theorem was used to explain the working operation of GAs. However, several researchers have pointed out some weaknesses that call into question some of the implications of the theorem [103]. Some of the most well-known weaknesses pointed out by several researchers are the following. First, the schema theorem only provides lower bounds for the next generation and not for the whole run. Second, the schema theorem does not fully explain the behavior of GAs in problems that present large degrees of inconsistencies in terms of the bit values preferred [104]. Finally, considering the typical population sizes used by most practitioners, the number of samples belonging to high-order schemata—those with few “*”—is very low, so in these cases the theorem is not very accurate. In general, the controversies are not with the formulae, but with the implications that can be derived from them. The reader is referred to [105] for a more detailed discussion of this topic. Finally, we should note that other theoretical studies not based on the concept of schema have also been proposed to explain the behavior of GAs. In some cases, the model is simplified by assuming infinitely large population sizes [106]. In other cases, the Markov model has been used to analyze simple GAs with finite population sizes [107]. Due to the complexity of analyzing the mathematics behind GAs, there is a large amount of work focused on the practical use of GAs. For instance, the initial GA framework has been extended by considering several selection and variation operators. Regarding the former, note that, initially, selection was only used to choose the individuals that were subjected to variation. This operator is now known as parent selection or mating selection [17]. However, in most current GAs, a survivor selection or environmental selection is also carried out. The aim of this last selection—also called replacement—is to choose which individuals will survive to the next generation. In its most general form, survivor selection can operate on the union of parents and offspring. The number of selection operators defined in recent decades is very large, ranging from deterministic to stochastic schemes. Parent selection is usually carried out with stochastic operators. Some of the most well-known stochastic selectors are fitness proportional selection, ranking selection, and tournament selection. Alternatively, survivor selection is usually based on deterministic schemes. In the case of the survivor selection schemes, not only is the fitness used, but the age of individuals might also be considered. Some of the most-well known schemes are age-based replacement and replace-worst. Also of note is the fact that elitism is included in most current GAs. Elitist selectors ensure that the best individual from among parents and offspring is always selected to survive.

page 872

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Evolutionary Computation: Historical View and Basic Concepts

page 873

873

In these kinds of schemes, and under some assumptions, asymptotic convergence is ensured [108] and, in general, there is empirical evidence of its advantages. A topic closely related to replacement is the steady state model [74]. In the steady state model, after the creation of one or maybe a few offspring, the replacement phase is executed. Note that this model is closely related to the generation gap concept previously defined in [94]. The generation gap was introduced into GAs to permit overlapping populations. In the same way, several crossover schemes have been proposed in the literature. The first GAs proposed by Holland operated with one-point crossover, whose operation is justified in light of the schema theorem. Specifically, Goldberg [98] claimed that in GAs with one-point crossover, short, low-order and highly fit schemata—which received the name of building blocks3 —are sampled, recombined, and resampled to form strings of potentially higher fitness. However, this hypothesis has been very controversial. One of the basic features of one-point crossover is that bits from a single parent that are close together in the encoding are usually inherited together, i.e., it suffers from positional bias. This idea was borrowed in part from the biological concept of coadapted alleles. However, several authors have questioned the importance of this property [105]. In fact, Goldberg claimed that since many encoding decisions are arbitrary, it is not clear whether a GA operating in this way might obtain the desired improvements [98]. The inversion operator might in some way alleviate this problem by allowing the reordering of genes so that linkages between arbitrary positions can be created. Another popular attempt to identify linkages was carried out in the messy GA [110], which explicitly manipulates schemata and allows linking non-adjacent positions. More advanced linkage learning mechanisms have been depicted recently [111]. In addition, given the lack of a theory that fully justifies the use of one-point crossover, several different crossover operators, as well as alternative hypotheses to explain the effectiveness of crossover, have been proposed. Among these hypotheses, some of the widely accepted consider crossover as a macro-mutation or as an adaptive mutation [112]. Regarding alternative definitions of crossover operators, it has been shown empirically that depending on the problem, operators different from onepoint crossover might be preferred. For instance, Syswerda showed the proper behavior of uniform crossover with a set of benchmark problems [113]. In the case of binary encoding, some of the most well-known crossover operators are the two-point crossover, multipoint crossover, segmented crossover, and uniform crossover [17]. Note that some multisexual crossover operators have also been provided. In fact,

3 The term building block has also been used with slightly different definitions [109].

June 1, 2022 13:18

874

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Handbook on Computer Learning and Intelligence — Vol. II

well before the popularization of GAs, Bremermann had proposed their use [30]. For other representations, a large set of crossover operators has been devised [17]. As mentioned earlier, several components and/or parameters in GAs have to be tuned. The first GA variants lacked procedures to automatically adapt these parameters and components. However, several adaptive GAs have also been proposed [114]. In these schemes, some of the parameters and/or components are adapted by using the feedback obtained during the search. In addition, most current GAs do not operate on binary strings, relying instead on encodings that fit naturally with the problem at hand. As a result, several variation operators specific to different chromosome representations have been defined. The reader is referred to [17] for a review of some of these operators.

22.4. A Unified View of Evolutionary Computation The three main branches of EC—(EP, GAs, and ESs)—developed quite independently of each other over the course of about 25 years, during which several conferences and workshops devoted to specific types of EAs were held, such as the Evolutionary Programming Conference. However, in the early 1990s, it was clear to some researchers that these approaches have several similarities, and that findings in one kind of approach might also be useful for other types of EAs. Since these researchers considered that the topics covered by the conferences at the time period were too narrow, they decided to organize a workshop called “Parallel Problem Solving From Nature” that would accept papers on any type of EA, as well as papers based on other metaphors of nature. This resulted in a growth in the number of interactions and collaborations among practitioners of the various EC paradigms, and as a result, the different schemes began to merge naturally. This unification process continued at the Fourth International Conference on Genetic Algorithms, where the creators of EP, GAs, and ESs met. At this conference, the terms EC and EA were proposed and it was decided to use these terms as the common denominators of their approaches. Basically, an EA was defined as any approach toward solving problems that mimics evolutionary principles.4 Similar efforts were undertaken at other conferences, such as in the Evolutionary Programming Conference. Finally, these unifying efforts resulted in the establishment in 1993 of the journal Evolutionary Computation, published by MIT Press. Furthermore, the original conferences were soon replaced by more general ones, such as the Genetic and Evolutionary Computation Conference5 and the IEEE Congress on Evolutionary 4 See http://ls11-www.cs.uni-dortmund.de/rudolph/ppsn 5 The conference originally known as “International Conference on Genetic Algorithms” was renamed

as “Genetic and Evolutionary Computation Conference” in 1999.

page 874

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Evolutionary Computation: Historical View and Basic Concepts

page 875

875

Computation. In addition, other journals specifically devoted to EC have appeared, such as the IEEE Transactions on Evolutionary Computation. It is worth noting that since the different types of EAs were developed in an independent way, many ideas have been reinvented and explored more than once. In addition, there was no clear definition for each kind of scheme, so the boundaries between EC paradigms become blurred. Thus, for many contemporary EAs, it is very difficult and even unfair to claim that they belong to one or another type of EA. For instance, a scheme that adopts proportional parent selection, self-adaptation, and stochastic replacement merges ideas that were originally depicted in each of the initial types of EAs. For this reason, when combining ideas that were originated by practitioners of the different EC paradigms—which is very typical—, it is preferable to use the generic terms EC and EA. In recent decades, several papers comparing the differences and similarities of the different types of EAs have appeared [115]. In addition, some authors have proposed unifying and general frameworks that allow for the implementation of any of these schemes [116, 117]. In fact, note that with a pseudocode as simple as the one shown in Algorithm 22.1, any of the original schemes can be implemented. Although this simple pseudocode does not fit with every contemporary EA, it does capture the main essence of EC by combining population, random variation, and selection. Two interesting books where the unifying view is used are: [17, 117]. The last two decades have seen an impressive growth in the number of EC practitioners. Thus, the amount of research that has been conducted in the area is huge. For instance, EAs have been applied to several kinds of optimization problems, such as constrained [118] and multi-objective [1, 119] optimization problems. Also, several efforts have been made to address the problem of premature convergence [120]. Studies conducted on this topic include the use of specific selection schemes such as fitness sharing [121], crowding schemes [94], restarting

Algorithm 22.1: Pseudocode of an Evolutionary Algorithm: A Unified View 1: Initialization: Generate an initial population with N individuals 2: Evaluation: Evaluate every individual in the population 3: while (not fulfilling the stopping criterion) do 4: Mating selection: Select the parents to generate the offspring 5: Variation: Apply variation operators to the mating pool to create a child population 6: Evaluation: Evaluate the child population 7: Survivor selection: Select individuals for the next generation 8: end while

June 1, 2022 13:18

876

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Handbook on Computer Learning and Intelligence — Vol. II

mechanisms [122], and the application of multi-objective concepts to tackle singleobjective problems [123]. Other highly active topics include the design of parallel EAs [124] and memetic algorithms [125]—related to Lamarck’s hypothesis and the Baldwin effect, which were discussed earlier—which allows hybridizing EAs with more traditional optimization schemes. Note that for most of the topics that have been studied over the past decades, there were some preliminary works that were developed in the 1960s. For instance, the island-based model [126] was proposed in [127], while some preliminary works on hybrid models were presented as early as 1967 [128]. The aforementioned works represent only a very small cross-section of the topics that have been covered in recent years. It is beyond the scope of this chapter to present an extensive review of current research, so readers are referred to some of the latest papers published in some of the most popular conferences and journals in the field. Interestingly, it is also remarkable that while considerable efforts have been made to unify the different EC paradigms, some new terms for referring to specific classes of EAs have also appeared in recent decades. However, in these last cases, their distinguishing features are clear, so there is no ambiguity in the use of such terms. For instance, Differential Evolution [129] is a special type of EA where the mutation is guided by the differences appearing in the current population. Another popular variant is Genetic Programming [130], which focuses on the evolution of computer programs. Finally, we would like to state that while there has been considerable research into EC in the last decade, the number of topics that have yet to be addressed and further explored in the field is huge. Moreover, in recent years, there has been a remarkable increase in the number of proposals based on alternative nature-inspired phenomena [10]. In some cases, they have been merged with evolutionary schemes. Thus, similarly to the period where synergies were obtained by combining ideas that arose within different evolutionary paradigms, the interactions among practitioners of different nature-inspired algorithms might be beneficial for advancing this field.

22.5. Design of Evolutionary Algorithms EC is a general framework that can be applied to a large number of practical applications. Among them, the use of EC to tackle optimization problems is probably the most popular one [17]. The generality of EC and the large number of different components that have been devised imply that when facing new problems, several design decisions must be made. These design decisions have an important impact on the overall performance [131], meaning that they must be made very carefully, and particularly, in order to draw proper conclusions, statistical comparisons are a

page 876

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Evolutionary Computation: Historical View and Basic Concepts

page 877

877

crucial step in the design process [132]. Additionally, theoretical analyses support the proper design of EAs [133]. Thus, while EAs are usually referred to as general solvers, most successful EAs do not treat problems as black-box functions. Instead, information on the problem is used to alter the definition of the different components of the EA [134]. This section discusses some of the main decisions involved in the design of EAs and describes some of the most well-known components that have been used to handle optimization problems. One of the first decisions to make when applying EAs is the way in which individuals are represented. The representation of a solution involves selecting a data structure to encode solutions [135] and, probably more importantly, a way to transform this encoding (genotype) into the phenotype [136]. Note that, in direct representations, there is no transformation between genotype and phenotype, whereas in cases where such a transformation is required, the representation is said to be indirect. Note that while in many cases a direct representation is effective, indirect representation allows for the introduction of problem-specific knowledge, the use of standard genetic operators and, in some cases, it facilitates the treatment of constraints, among other benefits. Several efforts to facilitate the process of designing, selecting, and comparing representations have been made. The recommendations provided by Goldberg [137], [138], [139] and [140] are interesting but too general, and in some cases there is a lack of theory behind these recommendations. A more formal framework is given by [136] that introduces some important features that should be taken into account, such as redundancy, scaling, and locality. The aim of this last framework is to reduce the black art behind the selection of proper representations. However, since analyses and recommendations are based on simplifications of EAs, there is usually a need to resort to trial-anderror mechanisms, which are combined with the analyses of these recommendations and features. Finally, it is important to note that the representation chosen influences other design decisions. For instance, variation operators are applied to the genotype, so its design depends on the representation. In order to illustrate the representation process, let us consider a simple optimization problem such as the minimization of the function f (x, y, z) (Eq. 22.2), where each variable is an integer number in the range [0, 15]. f (x, y, z) = (x − 1)2 + (y − 2)2 + (z − 7)2 − 2

(22.2)

Considering some of the first GAs designed, an alternative is to represent each number in base 2 with 4 binary digits. In such a case, individuals consist of a sequence of 12 binary digits or genes (based on the GAs nomenclature). Another possible choice—which is more widespread nowadays—is to use a gene for each variable, where each gene can have any value between 0 and 15. Figure 22.2 shows some candidate solutions considering both the binary and integer representations.

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

878

b4528-v2-ch22

Handbook on Computer Learning and Intelligence — Vol. II Values of variables

Binary representation

Integer representation

x = 5, y = 8, z = 3

0

1

0

1

1

0

0

0

0

0

1

1

5

8

3

x = 7, y = 0, z = 11

0

1

1

1

0

0

0

0

1

0

1

1

7

0

11

Figure 22.2: Individuals with binary and integer encoding.

Each cell represents a gene of the chromosome. Note that for this case, the base-2 encoding is an example of an indirect representation, whereas the second encoding is a direct representation. Another example of an indirect representation appeared at the Second Edition of the Wind Farm Layout Optimization Competition [141]. The purpose of this contest was to design optimizers to select the positions of turbines. Several of the top-ranked contestants noted that most high-quality solutions aligned several of the turbines in specific directions. Thus, instead of directly optimizing the positions, having problem-dependent information—relationship among directions and positions—can be used to design an indirect encoding, which focuses the search on promising regions of the search space. Consequently, instead of looking for thousands of positions, just a few parameters had to be optimized. Another important component of EAs is the fitness function. The role of the fitness function is to assign a quality measure to each individual. Note that some problems might involve several requirements and objectives. In order to deal with these multiple aims, they might be combined into the fitness function to generate a global notion of quality. A taxonomy regarding typical kinds of requirements and methods to combine them is discussed in [142]. In simple EAs, the fitness function is used mainly in the selection stages. However, in more complex algorithms, such as memetic algorithms, other components also take the fitness function into account. The term fitness function is usually associated with maximization [17], but it has caused some controversy because in some papers minimization fitness functions are employed. Changing minimization into maximization or vice versa in the implementation of EAs is in many cases trivial, but when connecting selection operators and fitness functions, both components should agree in the optimization direction. Moreover, as discussed later, some selection operators make sense only when the fitness function is maximized and every candidate solution attains a positive fitness. Thus, when designing the fitness function, the way in which it will be used must be taken into account. In order to illustrate the definition of a fitness function, let us consider an even simpler optimization problem, such as the minimization of the function f (x) (Equation 22.3), where x is a real number in the range (0, 15). In this case, a plausible fitness function for the minimization of f (x) that guarantees that the

page 878

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Evolutionary Computation: Historical View and Basic Concepts

page 879

879

fitness value of any candidate solution is positive is given by f 1 (Eq. 22.4). Note that 191 is larger than the maximum value of the function f (x) in the range considered. Usually, this maximum value is not known, but any value greater than 191 might also be used if positive values are desired. However, depending on the other components—especially the selection operators—this specific constant might affect the performance of the optimization process. Alternatively, a fitness function that might produce negative values is given by the simpler function f 2 (x) = − f (x). In this last case, estimating the maximum value of the function is not a requirement. However, not every selection operator can be used with f 2 . f (x) = (x − 1)2 − 5

(22.3)

f 1 (x) = − f (x) + 191

(22.4)

Regarding the task of designing a fitness function, this is a straightforward process in some cases because it is just equal to (or a simple transformation of) the objective function. In these cases, the objective function is said to be selfsufficient [6]. For instance, in the case of the Traveling Salesman Problem, most effective optimizers just consider the minimization of the distance traveled, so the fitness function is viewed as the inverse of the distance. However, in other cases, this task is more complex. For instance, let us consider the Satisfiability Problem. In this case, each solution is mapped in the original problem to a 0 or 1 value, and the aim is to find a solution whose value is 1. Using only the binary value is not enough to guide the optimization process, so there is a need to define a more complex fitness function. The definition of the fitness function heavily impacts the expected performance of EAs. The analysis of the fitness landscape is complex, especially for combinatorial optimization, but it is useful in identifying drawbacks of the proposed fitness function [143]. Recent advances such as local optima networks [144] facilitate the analyses, especially when dealing with high-dimensional problems. As an example, in the case of maximizing the nonlinearity of Boolean functions, the direct use of nonlinearity as the fitness function causes large plateaus that hinder the proper performance of EAs. Extending the fitness function by including additional information from the Walsh–Hadamard transform provides a better guide that results in improved performance [145, 146]. Finally, note that in some cases, the fitness function is adapted online once certain issues with the optimization process are detected [147]. An additional issue that might emerge with the design of fitness functions is that they might be too computationally expensive. This would result in just a few generations being evolved, with potentially poor results. Surrogate models can be used to alleviate this difficulty [148]. The main idea behind surrogate models is

June 1, 2022 13:18

880

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Handbook on Computer Learning and Intelligence — Vol. II

to approximate the fitness function with a more inexpensive process than the one initially designed. Different ways of building surrogate models have been proposed. Problem-dependent models rely on ad hoc knowledge of the problem and on the fact that in order to distinguish between very poor and high quality solutions, very accurate functions are not required. Thus, functions with different trade-offs between accuracy and computational cost might be designed and applied at different stages of the optimization process [149]. In contrast, problem-independent models usually rely on machine learning methods and other statistical procedures [150, 151]. Note that while the most typical use of surrogate models is to reduce the computational cost by providing inexpensive fitness functions, they have also been applied to redesign other components of an EA [148]. The selection operator is another important component that has a large impact on the overall performance of EAs [152]. One of the aims of selection is to focus the search on the most promising regions. Some of the first selectors designed made their decisions considering only the individuals’ fitness values. However, other factors, such as age or diversity, are usually considered in state-of-the-art EAs [153]. Selection operators are applied in two different stages: parent selection and replacement. In both cases, stochastic and deterministic operators can be used, though the most popular choice is to apply stochastic operators in the parent selection stage and deterministic operators in the replacement phase [17]. A very popular stochastic selection operator is the fitness proportional operator. In the fitness proportional selection, the total fitness F is first calculated as the sum of the fitness values of the individuals in the current population. Then, the selection probability of an individual Ii ( pi ) is calculated using Eq. (22.5), where f it represents the fitness function. In order to apply this selection scheme, the fitness function of every individual must be positive. Let us assume that we are using the fitness function given in Eq. (22.4) and that our current population consists of three individuals with x = 2, x = 5, and x = 12. Table 22.1 shows the fitness value and the selection probability associated with each individual. One of the weaknesses of this operator is that it is susceptible to function transposition, which means that its behavior changes if the fitness function of every individual is transposed by adding a constant value. This is illustrated in Figure 22.3. In this figure, a population Table 22.1: Fitness values and selection probabilities induced by the fitness proportional selection Individual x=2 x=5 x = 12

Fitness

Selection Prob.

195 180 75

0.43 0.40 0.17

page 880

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Evolutionary Computation: Historical View and Basic Concepts

page 881

881

Fitness Functions f1 f2

1400

Fitness value

1200 A2

1000

B2 C2

800 600 400 A1

200

B1 C1

0 0

2

4

6

8

10

12

14

x

Figure 22.3: Effect of transposition on fitness proportional selection.

consisting of individuals A, B, and C is used. Two fitness functions are considered: the first one ( f 1 ) is shown in Eq. (22.4), while the second one ( f 2 ) is generated by replacing the value 191 in Eq. (22.4) with the value 1000. Figure 22.3 shows the fitness values of the different individuals, as well as the selection probabilities associated with the individuals when f 1 and f 2 are applied. Since the selection scheme is susceptible to function transposition, these probabilities differ, as can be seen in the pie charts. In addition, note that with f 2 , the selection pressure is very low, i.e., all individuals are selected with very similar probabilities. The reason is that when the differences between the fitness values are small with respect to their absolute values, the selection pressure vanishes. This last effect is another of the known weaknesses that can affect fitness proportional selection. f it (Ii ) (22.5) F Another quite popular stochastic selection operator is the tournament selection. In this case, each time an individual must be selected, there is a competition between k randomly selected individuals, with the best one being chosen. Tournament selection has very different features than fitness proportional selection. For instance, pi =

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

882

b4528-v2-ch22

page 882

Handbook on Computer Learning and Intelligence — Vol. II

it is invariant to transposition of the fitness function. However, this does not mean that it is superior to fitness proportional selection. For instance, when facing some fitness landscapes, fitness proportional selection can induce a larger focus on exploitation than tournament selection. Depending on the problem at hand, on the stopping criterion, and on the other components applied, this might be desired or counterproductive, so it is the task of the designer to analyze the interactions between the selectors and the fitness landscape. The variation phase is another quite important component that is usually adapted to the problem at hand. Among the different operators, the crossover and mutation operators are the most popular, and a large number of different choices have been devised for each of them. For instance, [17] present several operators that can be applied with binary, integer, floating-point, and permutation representations. In order to better illustrate the working operation of crossover, two popular crossover operators that can be applied with binary, integer, and floatingpoint, representations are presented: the one-point and the uniform crossover. One-point crossover (Figure 22.4) operates by choosing a random gene, and then splitting both parents at this point and creating the two children by exchanging the tails. Alternatively, uniform crossover (Figure 22.5) works by treating each gene independently, and the parent associated with each gene is selected randomly. This might be implemented by generating a string with L random variables from a uniform distribution over [0, 1], where L is the number of genes. In each position, if the value is below 0.5, the gene is inherited from the first parent; otherwise, it is inherited from the second parent. The second offspring is created using inverse mapping. It is important to note that the crossover schemes discussed induce different linkages between the genes. For instance, in the one-point crossover, genes that are close in the chromosome are usually inherited together, while this is not the case in the uniform crossover. Depending on the meaning of each gene and on its position in the chromosome, introducing this kind of linkage might make sense or not. Thus, when selecting the crossover operator, the representation of the individuals must Parent 1

0

1

0

1

1

0

0

Parent 2

0

0

0

1

1

0

1

1

1

0

Offspring 1 0

1

0

1

0

0

0

0

0

0

1

0

1

1

0

0

0

1

1

Offspring 2 0

1

0

1

1

0

1

1

1

1

Figure 22.4: Operation of one-point crossover.

0

0

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

page 883

Evolutionary Computation: Historical View and Basic Concepts Parent 1 0

1

0

1

1

0

0

883

Parent 2 0

0

0

1

1

0

1

1

1

0

0

0

0

1

0

1

1

0

1

0

1

1

Uniform [0,1] random variables 0.2

0.7 0.1 0.2

0.9 0.4

0.7 0.8

0.2 0.6

0.8

0.3

Offspring 1 0

1

0

1

0

0

0

Offspring 2 0

0

0

1

1

0

1

1

1

1

0

0

Figure 22.5: Operation of uniform crossover.

be taken into account. Finally, operators that try to learn a proper linkage between genes have been devised [111]. These kinds of operators have yielded significant benefits in several cases. Note also that in many cases, problem-dependent crossover and mutation operators are designed. Properly designed crossover operators detect characteristics that are important for the problem and that are present in the parents. They then create offspring by combining these features. For instance, in the Graph Coloring Problem, the authors noted that the important feature is not the specific color assigned to each node, but the groups of nodes sharing the same color. Thus, ad hoc operators that inherit these kinds of features have been designed [154]. Additionally, ad hoc operators might establish a linkage between genes by considering problem-dependent knowledge. For instance, in the Frequency Assignment Problem, antennas that interact strongly with one another are inherited and/or mutated together with a larger probability than unrelated antennas [155]. Finally, in the case of crossover operators, it is also important to distinguish between the parent-centric and meancentric operators because of the impact that this feature has on the dynamics of the population [156]. Properly designed operators that adapt this feature to the needs of the problem at hand have excelled [157]. Note that while we have discussed different components involved in the design of EAs independently, there are important interactions among them. Thus, regarding their performance, it is not possible to make categorical assertions. Depending on the components used, different balances between exploitation and exploration can be induced. It is therefore very important to recognize how to modify this balance, since it is one of the keys to success. In some cases, there is no explicit mechanism for managing the trade-off between exploration and exploitation. Note that due to the way EAs work, the diversity maintained in the population is one way to control this trade-off [158]. In fact, many authors regard the proper management of diversity as one of the cornerstones of proper performance [120]. Thus, several explicit strategies

June 1, 2022 13:18

884

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Handbook on Computer Learning and Intelligence — Vol. II

for managing diversity have been proposed [159]. In some ways, this can be viewed as another component of EAs, but it is not a completely independent component because it is highly coupled to some of the components already discussed. For instance, in [120], diversity management methods are classified as selectionbased, population-based, crossover/mutation-based, fitness-based, and replacementbased, depending on the sort of component that is modified. Additionally, some implicit mechanisms that alter this degree, such as multi-objectivization, have been devised [123]. It is also important to note that for many problems, trajectory-based strategies are usually considered more powerful than EAs in terms of their intensification capabilities. As a result, several state-of-the-art optimizers are hybrids that incorporate trajectory-based strategies and/or other mechanisms to promote intensification [6, 160]. Thus, knowing the different ways of hybridizing algorithms is an important aspect of mastering the design of EAs. Finally, given the difficulties in understanding and theoretically analyzing the interactions that appear in EAs, several components and parameterizations are typically tested during the design phase. By analyzing the performance of different components and the reasons for their good or poor performance, some alternative designs might be tested until a good enough scheme is obtained. Moreover, note that it is not just the components that have to be selected. A set of parameters must also be adapted to the problem at hand, and it is important that it is done systematically [114, 161]. In this regard, two kinds of methodologies have been devised [162]. Parameter tuning [163] can be thought of as searching for good parameters and components before the algorithm is executed, whereas parameter control strategies [164] alter the components and parameters during the run. This can be used, for instance, with the aim of applying an ensemble of variation operators whose probabilities and other internal parameters are changed during the optimization process [165, 166]. Table 22.2 summarizes some of the most important aspects to define when designing and running EAs. Note that in addition to all these aspects, we also have to consider the internal parameters associated with many of the operators mentioned above. We would like to remark that in the design of proper EAs, it is very important to ascertain the features and implications of using different components and parameter values and recognize the drawbacks that each of them can circumvent. This is because of the impossibility of testing every parameter and component combination, given how time-consuming and computationally insensitive this task is. Thus, designing efficient EAs for new problems is a complex task that requires knowledge in several areas that has been developed in recent decades through both experimental and theoretical studies. Generic tools for developing metaheuristics can facilitate this entire process [167–170].

page 884

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Evolutionary Computation: Historical View and Basic Concepts

page 885

885

Table 22.2: Some aspects to consider when designing EAs Component or aspect Individual Population

Evolution

Stopping criterion

Parameterization

Related decisions Representation Evaluation (fitness function) Size Population sizing scheme Initialization Mutation operator Crossover operator Repair operator Intensification operator Probabilities for applying operators Parent selection Replacement scheme Diversity Management Strategy Hybridization Number of evaluations of individuals Generations Time Given signs of stagnation A certain solution quality is achieved Parameter control Parameter tuning

22.6. Concluding Remarks EC is a very active area of research that studies the design and application of a set of algorithms that draw their inspiration from natural evolution. The first studies that influenced the development of EC date back to at least the 1930s. However, it was during the 1960s that the roots of EC were established. During the first stages of EC, several independent researchers devised different schemes that were applied to a diverse range of applications. Three different types of schemes were developed: EP, ESs, and GAs. Although each type of scheme had its own features, further developments made it clear that these schemes had a lot in common, so they began to merge in a natural way. In the 1990s, the terms EC and EA were proposed with the aim of unifying the terminology, and they were soon widely adopted by the community. In fact, it is now very difficult to claim that a given state-of-the-art EA belongs to any of the previous classes because they usually combine components from several types of schemes. The number of studies that have been carried out in the last decades is vast. These studies have yielded significant advances in EC; as a result, the number of problems that have been addressed with EAs has grown enormously.

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

886

b4528-v2-ch22

Handbook on Computer Learning and Intelligence — Vol. II

One of the main handicaps of EC is that mastering EAs is a very complex task. One of the reasons is that despite the huge amount of research conducted in recent years, the discipline cannot yet be considered fully mature. For instance, we are still in a period where different researchers have opposing views with respect to several aspects of EAs, and, therefore, much more research is still required to fully understand EAs and the large number of different components that have been devised by various authors. Thus, designing EAs to tackle new problems is a complex task that involves trial-and-error processes and in-depth analyses in order to understand the interactions and implications of the different components proposed for the problem at hand. Fortunately, many tools that facilitate the use of EAs are now available and recent years have seen significant efforts made to facilitate the application of EAs. In the near future, we expect this area to grow even more as it continues to develop on its path to maturity.

Acknowledgments The first author gratefully acknowledges financial support from CONACyT grant no. 2016-01-1920 (Investigación en Fronteras de la Ciencia 2016) and from a project from the 2018 SEP-Cinvestav Fund (application no. 4). The second author acknowledges the financial support from CONACyT through “Ciencia Básica” project no. 285599 and from the “Laboratorio de Supercómputo del Bajio” with project no. 300832. The third author acknowledges financial support from the Spanish Ministry of Economy, Industry and Competitiveness as part of the program “I+D+i Orientada a los Retos de la Sociedad” (contract number TIN2016-78410-R).

References [1] Coello Coello, C. A., Lamont, G. B. and Van Veldhuizen, D. A. (2007). Evolutionary Algorithms for Solving Multi-Objective Problems, 2nd edn. (Springer, New York), ISBN 978-0-38733254-3. [2] Nocedal, J. and Wright, S. J. (2006). Numerical Optimization, 2nd edn. (Springer, New York, NY, USA). [3] Brassard, G. and Bratley, P. (1996). Fundamentals of Algorithms (Prentice-Hall, New Jersey). [4] Weise, T. (2008). Global Optimization Algorithms: Theory and Application (http://www. it-weise.de/). [5] Glover, F. and Kochenberger, G. A. (eds.) (2003). Handbook of Metaheuristics (Kluver Academic Publishers), ISBN 1-4020-7263-5. [6] Talbi, E.-G. (2009). Metaheuristics: From Design to Implementation (John Wiley & Sons), ISBN 978-0470278581. [7] Gendreau, M. and Potvin, J.-Y. (eds.) (2019). Handbook of Metaheuristics, 3rd edn. (Springer, New York, NY, USA). [8] Hussain, K., Salleh, M. N. M., Cheng, S. and Shi, Y. (2019). Metaheuristic research: a comprehensive survey, Artif. Intell. Rev., 52, pp. 2191–2233. [9] Blum, C. and Roli, A. (2003). Metaheuristics in combinatorial optimization: overview and conceptual comparison, ACM Comput. Surv., 35(3), pp. 268–308.

page 886

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Evolutionary Computation: Historical View and Basic Concepts

page 887

887

[10] Del Ser, J., Osaba, E., Molina, D., Yang, X.-S., Salcedo-Sanz, S., Camacho, D., Das, S., Suganthan, P. N., Coello Coello, C. A. and Herrera, F. (2019). Bio-inspired computation: where we stand and what’s next, Swarm Evol. Comput., 48, pp. 220–250. [11] Molina, D., LaTorre, A. and Herrera, F. (2018). An insight into bio-inspired and evolutionary algorithms for global optimization: review, analysis, and lessons learnt over a decade of competitions, Cognit. Comput., 10, pp. 517–544. [12] Crosby, J. L. (1967). Computers in the study of evolution, Sci. Prog. Oxford, 55, pp. 279–292. [13] Mitchell, M. (1998). An Introduction to Genetic Algorithms (MIT Press, Cambridge, MA, USA). [14] Darwish, A., Hassanien, A. E. and Das, S. (2020). A survey of swarm and evolutionary computing approaches for deep learning, Artif. Intell. Rev., 53, pp. 1767–1812. [15] Mirjalili, S., Faris, H. and Aljarah, I. (2020). Evolutionary Machine Learning Techniques (Springer). [16] Everett, J. E. (2000). Model building, model testing and model fitting, in The Practical Handbook of Genetic Algorithms: Applications, 2nd edn. (CRC Press, Inc., Boca Raton, FL, USA), pp. 40–68. [17] Eiben, A. E. and Smith, J. E. (2003). Introduction to Evolutionary Computing, Natural Computing Series (Springer). [18] Linnaeus, C. (1735). Systema naturae, sive regna tria naturae systematice proposita per secundum classes, ordines, genera, & species, cum characteribus, differentiis, synonymis, locis (Theodorum Haak). [19] de Monet de Lamarck, J. B. P. A. (1809). Philosophie zoologique: ou Exposition des considérations relative à l’histoire naturelle des animaux (Chez Dentu). [20] Darwin, C. and Wallace, A. (1858). On the tendency of species to form varieties; and on the perpetuation of varieties and species by natural means of selection, Zool. J. Linn. Soc., 3, pp. 46–50. [21] Darwin, C. (1859). On the Origin of Species by Means of Natural Selection (Murray), or the Preservation of Favored Races in the Struggle for Life. [22] Mendel, G. (1865). Experiments in plant hibridization, in Brünn Natural History Society, pp. 3–47. [23] Fisher, R. (1930). The Genetical Theory of Natural Selection (Oxford University Press), ISBN 0-19-850440-3. [24] Wright, S. (1931). Evolution in mendelian populations, Genetics, 16, pp. 97–159. [25] Haldane, J. B. S. (1932). The Causes of Evolution (Longman, Green and Co.), ISBN 0-69102442-1. [26] Huxley, J. S. (1942). Evolution: The Modern Synthesis (Allen and Unwin). [27] Baldwin, J. (1986). A new factor in evolution, Am. Nat., 30, pp. 441–451. [28] Conrad, M. and Pattee, H. H. (1970). Evolution experiments with an artificial ecosystem, J. Theor. Biol., 28(3), pp. 393–409. [29] Fogel, D. B. (1995). Evolutionary Computation: Toward a New Philosophy of Machine Intelligence (IEEE Press, Piscataway, NJ, USA). [30] Fogel, D. B. (1998). Evolutionary Computation: The Fossil Record (Wiley-IEEE Press). [31] Wright, S. (1932). The roles of mutation, inbreeding, crossbreeding and selection in evolution, in VI International Congress of Genetics, Vol. 1, pp. 356–366. [32] Cannon, W. B. (1932). The Wisdom of the Body (W. W. Norton & Company, Inc.). [33] Richter, H. (2014). Fitness landscapes: from evolutionary biology to evolutionary computation, in H. Richter and A. Engelbrecht (eds.), Recent Advances in the Theory and Application of Fitness Landscapes, Emergence,Complexity and Computation, Vol. 6 (Springer Berlin Heidelberg), pp. 3–31. [34] Campbell, D. T. (1960). Blind variation and selective survival as a general strategy in knowledge-processes, in M. C. Yovits and S. Cameron (eds.), Self-Organizing Systems

June 1, 2022 13:18

888

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Handbook on Computer Learning and Intelligence — Vol. II

(Pergamon Press, New York), pp. 205–231. [35] Turing, A. M. (1950). Computing machinery and intelligence, Mind, 59, pp. 433–460. [36] Turing, A. M. (1948). Intelligent Machinery, Report, National Physical Laboratory, Teddington, UK. [37] Evans, C. R. and Robertson, A. D. J. (eds.) (1968). Cybernetics: Key Papers (University Park Press, Baltimore, MD, USA). [38] Burgin, M. and Eberbach, E. (2013). Recursively generated evolutionary turing machines and evolutionary automata, in X.-S. Yang (ed.), Artificial Intelligence, Evolutionary Computing and Metaheuristics, Studies in Computational Intelligence, Vol. 427 (Springer Berlin Heidelberg), pp. 201–230. [39] Lewontin, R. C. (1964). The interaction of selection and linkage. I. General considerations; heterotic models, Genetics, 49(1), pp. 49–67. [40] Crosby, J. L. (1960). The use of electronic computation in the study of random fluctuations in rapidly evolving populations, Philos. Trans. R. Soc. London, Ser. B, 242(697), pp. 550–572. [41] Fraser, A. S. (1957). Simulation of genetic systems by automatic digital computers. I. Introduction, Aust. J. Biol. Sci., 10, pp. 484–491. [42] Fraser, A. S. and Burnell, D. (1967). Simulation of genetic systems XII. Models of inversion polymorphism, Genetics, 57, pp. 267–282. [43] Fraser, A. S. and Burnell, D. (1970). Computer Models in Genetics (McGraw-Hill, NY). [44] Gill, J. L. (1965). Effects of finite size on selection advance in simulated genetic populations, Aust. J. Biol. Sci., 18, pp. 599–617. [45] Young, S. S. (1966). Computer simulation of directional selection in large populations. I. The programme, the additive and the dominance models, Genetics, 53, pp. 189–205. [46] Friedman, G. J. (1956). Selective feedback computers for engineering synthesis and nervous system analogy, Master’s thesis, University of California, Los Angeles. [47] Box, G. E. P. (1957). Evolutionary operation: a method for increasing industrial productivity, Appl. Stat., 6(2), pp. 81–101. [48] Friedberg, R. M. (1958). A learning machine: part I, IBM J. Res. Dev., 2(1), pp. 2–13. [49] Friedberg, R. M., Dunham, B. and North, J. H. (1959). A learning machine: part II, IBM J. Res. Dev., 3(3), pp. 282–287. [50] Dunham, B., Fridshal, D., Fridshal, R. and North, J. H. (1963). Design by natural selection, Synthese, 15(1), pp. 254–259. [51] Holland, J. H. (1975). Adaptation in Natural and Artificial Systems (University of Michigan Press, Ann Arbor, MI, USA). [52] Reed, J., Toombs, R. and Barricelli, N. A. (1967). Simulation of biological evolution and machine learning: I. Selection of self-reproducing numeric patterns by data processing machines, effects of hereditary control, mutation type and crossing, J. Theor. Biol., 17(3), pp. 319–342. [53] Fogel, L. J. (1962). Autonomous automata, Ind. Res., 4, pp. 14–19. [54] Fogel, L. J. (1999). Intelligence through Simulated Evolution: Forty Years of Evolutionary Programming, Wiley Series on Intelligent Systems (Wiley). [55] Fogel, D. B. (1994). An introduction to simulated evolutionary optimization, IEEE Trans. Neural Networks, 5(1), pp. 3–14. [56] Fogel, D. B. and Chellapilla, K. (1998). Revisiting evolutionary programming, in Proc. SPIE, Vol. 3390, pp. 2–11. [57] Barricelli, N. A. (1962). Numerical testing of evolution theories, Acta Biotheor., 16(1–2), pp. 69–98. [58] Fogel, L. J., Owens, A. J. and Walsh, M. J. (1966). Artificial Intelligence through Simulated Evolution (John Wiley & Sons). [59] Fogel, L. J. and Burgin, G. H. (1969). Competitive goal-seeking through evolutionary programming, Tech. Rep. Contract AF 19(628)-5927, Air Force Cambridge Research Labs.

page 888

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Evolutionary Computation: Historical View and Basic Concepts

page 889

889

[60] Fogel, L. J. and Fogel, D. B. (1986). Artificial intelligence through evolutionary programming, Tech. Rep. PO-9-X56-1102C-1, U.S. Army Research Institute, San Diego, CA. [61] Lyle, M. (1972). An investigation into scoring techniques in evolutionary programming, Master’s thesis, New Mexico State University, Las Cruces, New Mexico, USA. [62] Cornett, F. N. (1972). An application of evolutionary programming to pattern recognition, Master’s thesis, New Mexico State University, Las Cruces, New Mexico, USA. [63] Dearholt, D. W. (1976). Some experiments on generalization using evolving automata, in 9th Hawaii Internation Conference on System Sciences (Western Periodicals, Honolulu), pp. 131–133. [64] Chellapilla, K. (1997). Evolving computer programs without subtree crossover, IEEE Trans. Evol. Comput., 1(3), pp. 209–216. [65] Yao, X., Liu, Y. and Lin, G. (1999). Evolutionary programming made faster, IEEE Trans. Evol. Comput., 3(2), pp. 82–102. [66] Porto, V. W. and Fogel, D. B. (1995). Alternative neural network training methods [active sonar processing], IEEE Expert, 10(3), pp. 16–22. [67] Fogel, D. B. and Atmar, J. W. (1990). Comparing genetic operators with gaussian mutations in simulated evolutionary processes using linear systems, Biol. Cybern., 63(2), pp. 111–114. [68] Fogel, D. B., Fogel, L. J. and Atmar, J. W. (1991). Meta-evolutionary programming, in Asilomar Conference in Signals Systems and Computers, 1991, pp. 540–545. [69] Fogel, L. J., Angeline, P. J. and Fogel, D. B. (1995). An evolutionary programming approach to self-adaptation on finite state machines, in Annual Conference on Evolutionary Programming IV, pp. 355–365. [70] Richter, J. N., Wright, A. and Paxton, J. (2008). Ignoble trails—where crossover is provably harmful, in G. Rudolph, T. Jansen, S. Lucas, C. Poloni and N. Beume (eds.), Parallel Problem Solving from Nature PPSN X, Lecture Notes in Computer Science, Vol. 5199 (Springer Berlin Heidelberg), pp. 92–101. [71] Jain, A. and Fogel, D. B. (2000). Case studies in applying fitness distributions in evolutionary algorithms. II. Comparing the improvements from crossover and Gaussian mutation on simple neural networks, in 2000 IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks, pp. 91–97. [72] Rechenberg, I. (1965). Cybernetic solution path of an experimental problem, Tech. Rep. Library Translation 1122, Royal Air Force Establishment. [73] Schwefel, H.-P. (1965). Kybernetische Evolution als Strategie der experimentellen Forschung in der Strömungstechnik, Dipl.-Ing. Thesis, Technical University of Berlin, Hermann Föttinger– Institute for Hydrodynamics. [74] Whitley, D. and Kauth, J. (1988). GENITOR: A Different Genetic Algorithm (Colorado State University, Department of Computer Science). [75] Schwefel, H.-P. (1981). Numerical Optimization of Computer Models (John Wiley & Sons, Inc., New York, NY, USA). [76] Igel, C., Hansen, N. and Roth, S. (2007). Covariance Matrix adaptation for multi-objective optimization, Evol. Comput., 15(1), pp. 1–28. [77] Li, R., Emmerich, M. T. M., Eggermont, J., Bäck, T., Schütz, M., Dijkstra, J. and Reiber, J. H. C. (2013). Mixed integer evolution strategies for parameter optimization, Evol. Comput., 21(1), pp. 29–64. [78] Mezura-Montes, E. and Coello Coello, C. A. (2005). A simple multimembered evolution strategy to solve constrained optimization problems, IEEE Trans. Evol. Comput., 9(1), pp. 1–17. [79] Rechenberg, I. (1978). Evolutionsstrategin, in B. Schneider and U. Ranft (eds.), Simulationsmethoden in der Medizin und Biologie (Springer-Verlag, Berlin, Germany), pp. 83–114. [80] Bäck, T. and Schwefel, H.-P. (1993). An overview of evolutionary algorithms for parameter optimization, Evol. Comput., 1(1), pp. 1–23.

June 1, 2022 13:18

890

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Handbook on Computer Learning and Intelligence — Vol. II

[81] Eiben, A. E. and Bäck, T. (1997). Empirical investigation of multiparent recombination operators in evolution strategies, Evol. Comput., 5(3), pp. 347–365. [82] Bäck, T., Hoffmeister, F. and Schwefel, H.-P. (1991). A survey of evolution strategies, in Proceedings of the Fourth International Conference on Genetic Algorithms (Morgan Kaufmann), pp. 2–9. [83] Hansen, N. and Ostermeier, A. (2001). Completely derandomized self-adaptation in evolution strategies, Evol. Comput., 9(2), pp. 159–195. [84] Rechenberg, I. (1973). Evolutionsstrategie: Optimierung technischer Systeme nach Prinzipien der biologischen Evolution (Frommann-Holzboog, Stuttgart). [85] Ostermeier, A., Gawelczyk, A. and Hansen, N. (1994a). A derandomized approach to selfadaptation of evolution strategies, Evol. Comput., 2(4), pp. 369–380. [86] Ostermeier, A., Gawelczyk, A. and Hansen, N. (1994b). Step-size adaptation based on nonlocal use of selection information, in Y. Davidor, H.-P. Schwefel and R. Manner (eds.), Parallel Problem Solving from Nature PPSN III, Lecture Notes in Computer Science, Vol. 866 (Springer Berlin Heidelberg), pp. 189–198. [87] Beyer, H.-G. and Sendhoff, B. (2008). Covariance matrix adaptation revisited—the CMSA evolution strategy—, in G. Rudolph, T. Jansen, S. Lucas, C. Poloni and N. Beume (eds.), Parallel Problem Solving from Nature—PPSN X, Lecture Notes in Computer Science, Vol. 5199 (Springer Berlin Heidelberg), pp. 123–132. [88] Rudolph, G. (2012). Evolutionary strategies, in G. Rozenberg, T. Bäck and J. Kok (eds.), Handbook of Natural Computing (Springer Berlin Heidelberg), pp. 673–698. [89] Poland, J. and Zell, A. (2001). Main vector adaptation: a CMA variant with linear time and space complexity, in L. Spector, E. D. Goodman, A. Wu, W. B. Langdon, H.-M. Voigt, M. Gen, S. Sen, M. Dorigo, S. Pezeshk, M. H. Garzon and E. Burke (eds.), Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001) (Morgan Kaufmann, San Francisco, California, USA), pp. 1050–1055. [90] Ostermeier, A. (1992). An evolution strategy with momentum adaptation of the random number distribution, in R. Manner and B. Manderick (eds.), Parallel Problem Solving from Nature 2, PPSN-II (Elsevier, Brussels, Belgium), pp. 199–208. [91] Yao, X. and Liu, Y. (1997). Fast evolution strategies, in P. Angeline, R. Reynolds, J. Mc-Donnell and R. Eberhart (eds.), Evolutionary Programming VI, Lecture Notes in Computer Science, Vol. 1213 (Springer Berlin Heidelberg), pp. 149–161. [92] Bäck, T., Foussette, C. and Krause, P. (2013). Contemporary Evolution Strategies, Natural Computing Series (Springer). [93] Li, Z., Lin, X., Zhang, Q. and Liu, H. (2020). Evolution strategies for continuous optimization: a survey of the state-of-the-art, Swarm Evol. Comput., 56, p. 100694. [94] De Jong, K. A. (1975). An analysis of the behavior of a class of genetic adaptive systems., Ph.D. thesis, University of Michigan, Ann Arbor, Michigan, USA. [95] Bremermann, H. J. (1962). Optimization through evolution and recombination, in M. C. Yovits, G. T. Jacobi and G. D. Golstine (eds.), Proceedings of the Conference on Self-Organizing Systems (Spartan Books, Washington, DC), pp. 93–106. [96] Bledsoe, W. (1961). The use of biological concepts in the analytical study of “systems", Tech. Rep. Technical report of talk presented to ORSA-TIMS national meeting held in San Francisco, California, The University of Texas at Austin, USA. [97] Fogel, D. B. and Anderson, R. W. (2000). Revisiting Bremermann’s genetic algorithm. I. Simultaneous mutation of all parameters, in 2000 IEEE Congress on Evolutionary Computation, Vol. 2, pp. 1204–1209. [98] Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning, Artificial Intelligence (Addison-Wesley). [99] Bäck, T., Fogel, D. B. and Michalewicz, Z. (1997). Handbook of Evolutionary Computation, 1st edn. (IOP Publishing Ltd., Bristol, UK).

page 890

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Evolutionary Computation: Historical View and Basic Concepts

page 891

891

[100] Schraudolph, N. and Belew, R. (1992). Dynamic Parameter Encoding for genetic algorithms, Mach. Learn., 9(1), pp. 9–21. [101] Michalewicz, Z. (1994). Genetic Algorithms + Data Structures = Evolution Programs (Springer-Verlag New York, Inc., New York, NY, USA). [102] Fogel, D. B. and Ghozeil, A. (1997). A note on representations and variation operators, IEEE Trans. Evol. Comput., 1(2), pp. 159–161. [103] Altenberg, L. (1995). The Schema Theorem and Price’s Theorem, in D. Whitley and M. D. Vose (eds.), Foundations of Genetic Algorithms 3 (Morgan Kaufmann, San Francisco), pp. 23–49. [104] Heckendorn, R. B., Whitley, D. and Rana, S. (1996). Nonlinearity, hyperplane ranking and the simple genetic algorithm, in R. K. Belew and M. D. Vose (eds.), Foundations of Genetic Algorithms 4, pp. 181–201. [105] Whitley, D. and Sutton, A. (2012). Genetic algorithms: a survey of models and methods, in G. Rozenberg, T. Bäck and J. Kok (eds.), Handbook of Natural Computing (Springer Berlin Heidelberg), pp. 637–671. [106] Vose, M. D. and Liepins, G. (1991). Punctuated equilibria in genetic search, Complex Syst., 5, pp. 31–44. [107] Nix, A. and Vose, M. D. (1992). Modeling genetic algorithms with Markov chains, Ann. Math. Artif. Intell., 5(1), pp. 79–88. [108] Eiben, A. E., Aarts, E. H. L. and Hee, K. M. (1991). Global convergence of genetic algorithms: a Markov chain analysis, in H.-P. Schwefel and R. Manner (eds.), Parallel Problem Solving from Nature, Lecture Notes in Computer Science, Vol. 496 (Springer Berlin Heidelberg), pp. 3–12. [109] Radcliffe, N. J. (1997). Schema processing, in T. Bäck, D. B. Fogel and Z. Michalewicz (eds.), Handbook of Evolutionary Computation (Institute of Physics Publishing and Oxford University Press, Bristol, New York), pp. B2.5:1–10. [110] Goldberg, D. E., Korb, B. and Deb, K. (1989). Messy genetic algorithms: motivation, analysis, and first results, Complex Syst., 3(5), pp. 493–530. [111] Chen, Y. and Lim, M. H. (2008). Linkage in Evolutionary Computation, Studies in Computational Intelligence (Springer). [112] Sivanandam, S. N. and Deepa, S. N. (2007). Introduction to Genetic Algorithms (Springer). [113] Syswerda, G. (1989). Uniform Crossover in Genetic Algorithms, in Proceedings of the 3rd International Conference on Genetic Algorithms (Morgan Kaufmann Publishers Inc., San Francisco, CA, USA), pp. 2–9. [114] Lobo, F. G., Lima, C. F. and Michalewicz, Z. (2007). Parameter Setting in Evolutionary Algorithms, 1st edn. (Springer Publishing Company, Incorporated). [115] Bäck, T., Rudolph, G. and Schwefel, H.-P. (1993). Evolutionary programming and evolution strategies: similarities and differences, in D. B. Fogel and J. W. Atmar (eds.), Second Annual Conference on Evolutionary Programming (San Diego, CA), pp. 11–22. [116] Bäck, T. (1996). Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolutionary Programming, Genetic Algorithms (Oxford University Press, Oxford, UK). [117] De Jong, K. A. (2006). Evolutionary Computation: A Unified Approach (MIT Press). [118] Mezura-Montes, E. (2009). Constraint-Handling in Evolutionary Optimization, 1st edn. (Springer). [119] Miguel Antonio, L. and Coello Coello, C. A. (2018). Coevolutionary multiobjective evolutionary algorithms: survey of the state-of-the-art, IEEE Trans. Evol. Comput., 22(6), pp. 851–865. [120] Crepinšek, M., Liu, S.-H. and Mernik, M. (2013). Exploration and exploitation in evolutionary algorithms: a survey, ACM Comput. Surv., 45(3), pp. 35:1–35:33. [121] Goldberg, D. E. and Richardson, J. (1987). Genetic algorithms with sharing for multimodal function optimization, in Proceedings of the Second International Conference on Genetic Algorithms and their Application (L. Erlbaum Associates Inc., Hillsdale, NJ, USA), pp. 41–49.

June 1, 2022 13:18

892

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Handbook on Computer Learning and Intelligence — Vol. II

[122] Eshelman, L. (1991). The CHC adaptive search algorithm: how to have safe search when engaging in nontraditional genetic recombination, Foundations of Genetic Algorithms, pp. 265–283. [123] Segura, C., Coello, C., Miranda, G. and León, C. (2016). Using multi-objective evolutionary algorithms for single-objective constrained and unconstrained optimization, Ann. Oper. Res., 240, pp. 217–250. [124] Alba, E. (2005). Parallel Metaheuristics: A New Class of Algorithms (John Wiley & Sons, NJ, USA). [125] Moscato, P. (1989). On evolution, search, optimization, genetic algorithms and martial arts: towards memetic algorithms, Tech. Rep. C3P Report 826, California Institute of Technology. [126] Ma, H., Shen, S., Yu, M., Yang, Z., Fei, M. and Zhou, H. (2019). Multi-population techniques in nature inspired optimization algorithms: a comprehensive survey, Swarm Evol. Comput., 44, pp. 365–387. [127] Bossert, W. (1967). Mathematical optimization: are there abstract limits on natural selection? in P. S. Moorehead and M. M. Kaplan (eds.), Mathematical Challenges to the Neo-Darwinian Interpretation of Evolution (The Wistar Institute Press, Philadelphia, PA), pp. 35–46. [128] Kaufman, H. (1967). An experimental investigation of process identification by competitive evolution, IEEE Trans. Syst. Sci. Cybern., 3(1), pp. 11–16. [129] Storn, R. and Price, K. (1995). Differential evolution: a simple and efficient adaptive scheme for global optimization over continuous spaces, Tech. rep., International Computer Science Institute, Berkeley, Tech. Rep. TR95012. [130] Koza, J. R. (1990). Genetic programming: a paradigm for genetically breeding populations of computer programs to solve problems, Tech. rep., Stanford University, Stanford, California, USA. [131] Rothlauf, F. (2011). Design of Modern Heuristics: Principles and Application, 1st edn. (Springer Publishing Company, Incorporated), ISBN 3540729615. [132] Carrasco, J., Garca, S., Rueda, M., Das, S. and Herrera, F. (2020). Recent trends in the use of statistical tests for comparing swarm and evolutionary computing algorithms: practical guidelines and a critical review, Swarm Evol. Comput., 54, p. 100665. [133] Doerr, B. and Neumann, F. (eds.) (2019). Theory of Evolutionary Computation (Springer International Publishing), ISBN 978-3-03-029413-7. [134] Grefenstette, J. (1987). Incorporating problem specific knowledge into genetic algorithms, in L. Davis (ed.), Genetic Algorithms and Simulated Annealing, pp. 42–60. [135] Ashlock, D., McGuinness, C. and Ashlock, W. (2012). Representation in evolutionary computation, in J. Liu, C. Alippi, B. Bouchon-Meunier, G. W. Greenwood and H. A. Abbass (eds.), Advances in Computational Intelligence: IEEE World Congress on Computational Intelligence, WCCI 2012, Brisbane, Australia, June 10–15, 2012. Plenary/Invited Lectures (Springer Berlin Heidelberg, Berlin, Heidelberg), ISBN 978-3-642-30687-7, pp. 77–97. [136] Rothlauf, F. (2006). Representations for Genetic and Evolutionary Algorithms (SpringerVerlag, Berlin, Heidelberg), ISBN 354025059X. [137] Goldberg, D. E. (1990). Real-coded genetic algorithms, virtual alphabets, and blocking, Complex Syst., 5, pp. 139–167. [138] Radcliffe, N. J. (1992). Non-linear genetic representations, in R. Manner and B. Manderick (eds.), Parallel Problem Solving from Nature 2, PPSN-II, Brussels, Belgium, September 28–30, 1992 (Elsevier), pp. 261–270. [139] Palmer, C. (1994). An approach to a problem in network design using genetic algorithms, unpublished PhD thesis, Polytechnic University, Troy, NY. [140] Ronald, S. (1997). Robust encodings in genetic algorithms: a survey of encoding issues, in Proceedings of 1997 IEEE International Conference on Evolutionary Computation (ICEC ’97), pp. 43–48.

page 892

June 1, 2022 13:18

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Evolutionary Computation: Historical View and Basic Concepts

page 893

893

[141] Wilson, D., Rodrigues, S., Segura, C., Loshchilov, I., Hutter, F., Buenfil, G. L., Kheiri, A., Keedwell, E., Ocampo-Pineda, M., zcan, E., Pea, S. I. V., Goldman, B., Rionda, S. B., HernndezAguirre, A., Veeramachaneni, K. and Cussat-Blanc, S. (2018). Evolutionary computation for wind farm layout optimization, Renewable Energy, 126, pp. 681–691. [142] Wilkerson, J. L. and Tauritz, D. R. (2011). A guide for fitness function design, in Proceedings of the 13th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO ’11 (Association for Computing Machinery, New York, NY, USA), ISBN 9781450306904, p. 123124. [143] Reeves, C. R. (2000). Fitness landscapes and evolutionary algorithms, in C. Fonlupt, J.-K. Hao, E. Lutton, M. Schoenauer and E. Ronald (eds.), Artificial Evolution (Springer Berlin Heidelberg, Berlin, Heidelberg), ISBN 978-3-540-44908-9, pp. 3–20. [144] Adair, J., Ochoa, G. and Malan, K. M. (2019). Local optima networks for continuous fitness landscapes, in Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO ’19 (Association for Computing Machinery, New York, NY, USA), ISBN 9781450367486, p. 14071414. [145] Clark, J. A., Jacob, J. L. and Stepney, S. (2004). Searching for cost functions, in Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No. 04TH8753), Vol. 2, pp. 1517– 1524. [146] López-López, I., Gómez, G. S., Segura, C., Oliva, D. and Rojas, O. (2020). Metaheuristics in the optimization of cryptographic Boolean functions, Entropy, 22(9), pp. 1–25. [147] Majig, M. and Fukushima, M. (2008). Adaptive fitness function for evolutionary algorithm and its applications, in International Conference on Informatics Education and Research for Knowledge-Circulating Society (icks 2008), pp. 119–124. [148] Shi, L. and Rasheed, K. (2010). A survey of fitness approximation methods applied in evolutionary algorithms, in Tenne Y. and Goh C. K. (eds.), Computational Intelligence in Expensive Optimization Problems. Adaptation Learning and Optimization, Vol. 2. (Springer Berlin Heidelberg, Berlin, Heidelberg), ISBN 978-3-642-10701-6, pp. 3–28. [149] Jin, Y. (2011). Surrogate-assisted evolutionary computation: recent advances and future challenges, Swarm Evol. Comput., 1(2), pp. 61–70. [150] Jin, Y. (2005). A comprehensive survey of fitness approximation in evolutionary computation, Soft Comput., 9(1), p. 312. [151] Jin, Y., Wang, H., Chugh, T., Guo, D. and Miettinen, K. (2019). Data-driven evolutionary optimization: an overview and case studies, IEEE Trans. Evol. Comput., 23(3), pp. 442–458. [152] Blickle, T. and Thiele, L. (1996). A comparison of selection schemes used in evolutionary algorithms, Evol. Comput., 4(4), pp. 361–394. [153] Segura, C., Coello, C., Segredo, E., Miranda, G. and León, C. (2013). Improving the diversity preservation of multi-objective approaches used for single-objective optimization, in 2013 IEEE Congress on Evolutionary Computation (CEC), pp. 3198–3205. [154] Galinier, P. and Hao, J.-K. (1999). Hybrid evolutionary algorithms for graph coloring, J. Comb. Optim., 3, pp. 379–397. [155] Segura, C., Hernndez-Aguirre, A., Luna, F. and Alba, E. (2017). Improving diversity in evolutionary algorithms: new best solutions for frequency assignment, IEEE Trans. Evol. Comput., 21(4), pp. 539–553. [156] Deb, K., Anand, A. and Joshi, D. (2002). A computationally efficient evolutionary algorithm for real-parameter optimization, Evol. Comput., 10(4), pp. 371–395. [157] Jain, H. and Deb, K. (2011). Parent to mean-centric self-adaptation in sbx operator for realparameter optimization, in B. K. Panigrahi, P. N. Suganthan, S. Das and S. C. Satapathy (eds.), Swarm, Evolutionary, and Memetic Computing (Springer Berlin Heidelberg, Berlin, Heidelberg), ISBN 978-3-642-27172-4, pp. 299–306.

June 1, 2022 13:18

894

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch22

Handbook on Computer Learning and Intelligence — Vol. II

[158] Arabas, J. and Opara, K. (2020). Population diversity of nonelitist evolutionary algorithms in the exploration phase, IEEE Trans. Evol. Comput., 24(6), pp. 1050–1062. [159] Pandey, H. M., Chaudhary, A. and Mehrotra, D. (2014). A comparative review of approaches to prevent premature convergence in GA, Appl. Soft Comput., 24, pp. 1047–1077. [160] Neri, F., Cotta, C. and Moscato, P. (2011). Handbook of Memetic Algorithms (Springer Publishing Company, Incorporated), ISBN 3642232469. [161] López-Ibáñez, M., Dubois-Lacoste, J., Pérez Cáceres, L., Stützle, T. and Birattari, M. (2016). The irace package: iterated racing for automatic algorithm configuration, Oper. Res. Perspect., 3, pp. 43–58. [162] Eiben, A. E., Hinterding, R. and Michalewicz, Z. (1999). Parameter control in evolutionary algorithms, IEEE Trans. Evol. Comput., 3(2), pp. 124–141. [163] Huang, C., Li, Y. and Yao, X. (2020). A survey of automatic parameter tuning methods for metaheuristics, IEEE Trans. Evol. Comput., 24(2), pp. 201– 216. [164] Gomes Pereira de Lacerda, M., de Araujo Pessoa, L. F., Buarque de Lima Neto, F., Ludermir, T. B. and Kuchen, H. (2021). A systematic literature review on general parameter control for evolutionary and swarm-based algorithms, Swarm Evol. Comput., 60, p. 100777. [165] Mallipeddi, R., Suganthan, P., Pan, Q. and Tasgetiren, M. (2011). Differential evolution algorithm with ensemble of parameters and mutation strategies, Appl. Soft Comput., 11(2), pp. 1679–1696, the Impact of Soft Computing for the Progress of Artificial Intelligence. [166] Wu, G., Mallipeddi, R. and Suganthan, P. N. (2019). Ensemble strategies for population-based optimization algorithms a survey, Swarm Evol. Comput., 44, pp. 695–711. [167] Luke, S., Panait, L., Balan, G. and Et (2007). ECJ: a java-based evolutionary computation research system, http://cs.gmu.edu/_eclab/projects/ecj/. [168] León, C., Miranda, G. and Segura, C. (2009). Metco: a parallel plugin-based framework for multi-objective optimization, Int. J. Artif. Intell. Tools, 18(4), pp. 569–588. [169] Liefooghe, A., Jourdan, L. and Talbi, E.-G. (2011). A software framework based on a conceptual unified model for evolutionary multiobjective optimization: paradiseomoeo, Eur. J. Oper. Res., 209(2), pp. 104–112. [170] Durillo, J. J. and Nebro, A. J. (2011). jmetal: a java framework for multi-objective optimization, Adv. Eng. Software, 42, pp. 760–771.

page 894

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch23

© 2022 World Scientific Publishing Company https://doi.org/10.1142/9789811247323_0023

Chapter 23

An Empirical Study of Algorithmic Bias Dipankar Dasgupta∗ and Sajib Sen† Department of Computer Science, University of Memphis, Memphis, USA ∗ [email protected][email protected]

In all goal-oriented selection activities, the existence of a certain level of bias is unavoidable and may be desired for efficient AI-based decision support systems. However, a fair, independent comparison of all eligible entities is essential to alleviate explicit biasness in a competitive marketplace. For example, searching online for a good or a service, it is expected that the underlying algorithm will provide fair results by searching all available entities in the search category mentioned. However, a biased search can make a narrow or collaborative query, ignoring competitive outcomes, resulting in it costing the customers more or getting lower-quality products or services for the resources (money) they spend. This chapter describes algorithmic biases in different contexts with real-life case studies, examples, and scenarios; it provides best practices to detect and remove algorithmic bias.

23.1. Introduction Because of the massive digitization of data and systems, decision-making processes are largely being driven by machines in private and public sectors. Recent advances in technologies such as artificial intelligence (AI), data science, etc., converting physical systems into an automated decision-making bot [1, 2]. Starting from suggestion of products to buy, movie recommendations, algorithms are also used in very high-stakes decisions such as loan application [3], dating [4], and hiring [5, 6]. The benefits of algorithmic decisions are undoubtedly very clear to humans [2]. Unlike a human, an algorithm or a machine does not get bored or tired [7–9] and works at a very high magnitude of different factors than humans [10–14]. On the contrary, like humans, decision supposed system is vulnerable to the different magnitude of biases that render into biased/unfair decisions [15, 16]. As technologies can also be used as an adversary to fool the system, AI is disrupting many areas of the 895

page 895

June 1, 2022 13:19

896

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch23

Handbook on Computer Learning and Intelligence—Vol. II

economy. Not only that, as government sectors are embracing the automated system to improve their accuracy, AI has also been claimed to objectify democracy [1]. Although there are many definitions of bias/fairness by Dr. Narayanan [17], in the context of a decision support system, fairness is the absence of any collusion among different groups or any favoritism, prejudice based on inherent or derived characteristics of an individual or group. A catastrophic canonical example of a biased algorithm is the case of an automated software called COMPAS. The Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) calculates or predicts the risk of a person recommitting another crime. Judges in the United States used the prediction of COMPAS to take a decision on whether to release an offender or keep him/her in prison. An investigation of that automated software discovered bias by providing higher risk to African Americans than to Caucasians with the same profile [18]. Similar but different impactful outcomes are also found in other areas of the automated system families. For example, artificial intelligence (AI)-based judges are found to be biased against darker-skinned contestants as beauty pageant winners [19], or facial recognition-based digital cameras overpredicts blinking for Asians [20]. These types of biases originated from neglected or inherent biases in data or model output. Whether the bias generated from data can be removed by pre-processing, inprocessing, and postprocessing [21] algorithm, tacit collusion from an algorithm is hard to detect. Biased technology can also turn users into newly minted economists. Take into consideration the controversial “price surge” case of ride-sharing service Uber [22]. On a New Year’s Eve, their fare increases six to seven times than normal. On a quest for an explanation, Uber’s own marketing recognizes its proprietary black-box (dynamic pricing) algorithm decision. They claimed that the market price automatically signals to increase prices to encourage more drivers to ensure the increased supply. On top of that, Uber’s CEO also claimed that the company is not responsible for the price surge; it is the market that sets the price [23]. Seeing an automated intelligent system as a black-box created not only shortterm effects like Uber’s “price-surge” case but also brought serious consequences such as the COMPAS case. In this scenario, devising trust in a model’s prediction as well as in the model is an important problem when the model has been deployed for decision making. The practice of using machine learning on blind faith should not be applied to terrorism detection or medical diagnosis [24]; also, the importance of human–machine responsibility needs to be considered, as the outcomes can be catastrophic. Involving an automated model/system in such an important real-life scenario needs to ensure trust. To convince a user to use and trust the deployed AI system, it is very important to ensure the two different (but interrelated) definitions of trust: (1) trusting a model’s prediction and (2) trusting a model. Both definitions require that a user understand the model’s behavior when it is deployed in the wild.

page 896

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

An Empirical Study of Algorithmic Bias

b4528-v2-ch23

page 897

897

The explainability characteristics of an AI system is not only required from the consumer and developer perspectives but also helps law enforcers to validate the violation of antitrust law. Our study offers a descriptive and normative study of algorithmic bias/collusion and its critically important implications in computational science, economics, and law. Our contributions are as follows: (i) First, our explanation paints a descriptive and computationally explainable picture of the new change in commerce that is taking place by the algorithmdriven automated decision support system. (ii) Second, taking that snapshot as a base, this article strives to identify the overall spectrum of consequences for consumer welfare and antitrust law. (iii) Finally, this article provides a case study by using the explainable AI concept from literature to walk-through an automated decision support system and recommends best practices to overcome these issues. A shorter version of our study was presented in COMPSAC 2020 workshop [25]. However, here we elaborated our study with more data, examples as well as a detailed explanation. We also provide additional case studies to support our arguments. Our study is divided as follows: Section 23.2 provides a classification of different bias, Section 23.3 gives examples of data biasness, and Section 23.4 illustrates algorithm bias/collusions with examples. Sections 23.5 and 23.6 detail categories of algorithmic bias, followed by the models to detect biasness in different case studies. The final section concludes with the outcome of our empirical study.

23.2. Classification of Bias The definition of bias is different based on context and it can exist in many forms. Suresh and Guttag [26] pointed out different sources of bias in machine learning end-to-end projects. In [27], the authors provided a comprehensive list of biases starting from data origin, collection, and processing in a machine learning pipeline. In this chapter, different types of bias have been pointed out mainly based on the two above-mentioned research papers, along with other existing articles. After that, a circular relationship of these biases and different categorizations also provided a base for the chapter [28]. (1) Historical Bias: An example of historical bias can be an image search result of CEOs in 2018. The result shows more male CEOs than women, as only 5% of Fortune 500 CEOs were women in 2018. So, this type of bias is the already existing bias, and even through perfect sampling and feature selection this bias can integrate into the data generation process [26].

June 1, 2022 13:19

898

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch23

Handbook on Computer Learning and Intelligence—Vol. II

(2) Representation Bias: Bias generated from how people define and sample from a population and represent it [26]. ImageNet (shown in Figure 23.2 below) dataset bias towards Western countries is an example of representation bias. (3) Evaluation Bias: This kind of bias happens when we choose an inappropriate and disproportionate benchmark to evaluate a model [26]. For example, choosing Audience and IJB-A as benchmarks for a facial recognition system will make the model evaluation biased, as these benchmarks are biased towards skin color and gender [29]. (4) Aggregation Bias: This type of bias originates when false inferences are drawn from a subgroup based on the observation of a different group, and consequently, the false conclusion affects the decision support system [26]. For example, a decision support system made on predicting diabetes patients can have aggregation bias because the main feature, HbA1c level, varies widely in diabetes patients across ethnicities and genders. (5) Population Bias: Population bias arises when the dataset has different demographics, statistics, representatives, and user characteristics [27]. For example, a dataset of a social media platform may show that women are more prone to using Instagram, Facebook, whereas men use Twitter. (6) Behavioral Bias: Bias based on different user behavior across different contexts, platforms, or datasets [27]. (7) Content Production Bias: When users generate content that differs structurally, lexically, semantically, and syntactically based on user characteristics, such as age or gender [27], content production bias aries. (8) Linking Bias: When network attributes collected from user activities, interactions, or connections do not represent the true behavior of the users, this type of bias arises [27]. (9) Temporal Bias: This type of bias originates from the differences in user behaviors as well as the population over time [27]. (10) Popularity Bias: Popularity bias is one of the main biases for the recommendation system. Popular items tend to come first in search as well as in recommendations, which subsequently get reviewed by the user. However, this type of popularity can also be increased by fake reviews or social bots, which do not get checked. (11) Algorithmic Bias: Algorithmic bias is when the bias is not present in the input data and is added purely by an algorithm [30]. (12) Presentation Bias: Bias generated based on how information is being presented [30]. For example, based on how contents is presented on the Web, the data generated from user clicks tend to get information that is biased toward content that is not presented well.

page 898

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

An Empirical Study of Algorithmic Bias

b4528-v2-ch23

page 899

899

Figure 23.1: Bias definitions in the data, algorithm, and user interaction feedback loop are placed on their most appropriate arrows [28].

(13) Ranking Bias: This type of bias is prevalent in the search engine optimization model. Ranking popular pages on the top results in the their getting viewed more than others, which creates the bias. (14) Social Bias: When a model judgment gets affected by other people’s actions or content, social bias arises. (15) Emergent Bias: This type of bias can be found as an output of interaction with real users. Dependency on data to make a machine learning model creates an existence of a feedback loop phenomenon, where the predicted outcome of a machine learning model affects future data that will be used for subsequent model training. As a result, the previously mentioned types of biases are intertwined with each other. Mehrabi et al. [28] categorize these definitions and types into three categories and represented them on the arrows of the machine learning model feedback loop (Figure 23.1).

23.3. Data Bias Examples As mentioned in the previous section, there are multiple ways in which data can be biases. Among them, biases caused by unbalanced diversity and representation in data are more prevalent. Buolamwini and Gebru [29] showed that datasets such as IJB-A and Adience mainly contain light-skinned subjects. These datasets, which claim to be the benchmark for the facial recognition system, have 79.6% and 86.2% light-skinned subjects in IJB-A and Adience, respectively (Figure 23.3). Besides having unbalanced data, biases can also seep into the way we use and analyze the data. Buolamwini and Gebru [29] also showed that dividing data only by male and female is not enough. There are underlying biases against dark-skinned females if

June 1, 2022 13:19

900

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch23

Handbook on Computer Learning and Intelligence—Vol. II

Figure 23.2: Fraction of each country, represented by their two-letter ISO codes, in Open Images and ImageNet image datasets. In both datasets, the US and Great Britain represent the top locations [31].

Figure 23.3: Geographical distribution representation for countries in the Open Images dataset. In their sample, almost one third of the data was US-based, and 60% of the data was from the six most represented countries across North America and Europe, [31].

we only consider data in two main groups. The authors suggested subdividing the data into four categories by gender (e.g., male and female) and by skin color (e.g., light and dark). Chen et al. [32] pointed out the effects of data bias in medical fields where artificial intelligence systems are not treating all patients equally. This problem is not only in the medical field, skewed data toward a certain group, or unbalanced data but also in certain benchmark datasets—which subsequently create a bias toward downstream applications that are evaluated on these datasets. For example, the

page 900

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

An Empirical Study of Algorithmic Bias

b4528-v2-ch23

page 901

901

popular benchmark datasets ImageNet and Open Images have been found to contain representation bias and have been advised to incorporate geodiversity while creating such kinds of datasets [31].

23.4. Algorithmic Bias/Collusion Examples Because of the blessings brought by big data and big analytics today, businesses (and governments) rely mostly on them. With the growth of technology as the cost of storing and analyzing data drops, companies are investing in “smart” and “self-learning” machines for faster prediction, planning, trade, and logistic support. Companies’ interest also supported by the advances in artificial intelligence (AI). Naturally, these developments have many challenging ethical and legal questions to answer. With growth in the relationship between man and machine, the human’s control—or lack of it—over machines and the accountability of machine activities have increased. Although these questions have long fascinated our interest, few would envision the day when these headways (and the ethical and legal hurdles boosted by them) would become an antitrust issue and lead to consumer discomfort. Price fixing is illegal, and executives and companies, if found involved, are thrown into prison in the United States [33]. So, it made the news in 2015 when the U.S. Department of Justice (DOJ) charged a price-fixing cartel involved in selling posters through the Amazon marketplace. According to DOJ, David Topkins and his coconspirator agreed on using a specific pricing algorithm that collected their competitors’ pricing information for specific posters sold online and then applied the sellers’ pricing rules [34]. The goal of using the automated algorithm was to coordinate with the respective changes in the prices set by their competitors. A skeptical reader might wonder: Can robo-sellers really raise prices? The simple and definitive answer is, they already have done so. In 2011, one could find a 20-year-old biology textbook on fruit flies for the surprising price of $23 million available on Amazon [35]. A reader may wonder how this is possible. This price was set by the interaction of two different sellers’ algorithms. The algorithm of the first seller automatically set the price for their book at 1.27059 times that of the second seller’s book. On the other hand, the second seller’s algorithm set the price of their book at 0.9983 times that of the first book. As the two equations x = 1.27059 ∗ y and y = 0.9983 ∗ x, cannot mitigate for positive numbers, the outcome of this interaction became an upward spiral, where each algorithm’s price hike was responded to by the other algorithm’s price hike and vice versa. This continuity resulted in both the books costing millions of dollars from April 8 to April 18, 2011 [35]. In simple thought, the fruit-fly textbook example seems to be a mistake caused by the product rather than a conscious anti-competitive intent. On the other hand, suspicion about Uber’s algorithm for “price surge,” mentioned

June 1, 2022 13:19

902

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch23

Handbook on Computer Learning and Intelligence—Vol. II

earlier, signals to whether it had been designed to exploit the consumer. Such kinds of events echo calls for “algorithmic neutrality”—to prevent socially and economically harmful misuses. From the textbook example, we can strengthen our argument, whether their developers’ intent or not, robo-sellers can automatically decide on pricing that eventually raises consumer price. In this case, price fixing played the role of algorithmic bias.

23.5. Algorithm Bias/Collusion Categories Despite the unimaginable benefits of algorithms, there is a budding competition policy literature raising concerns about the potential of algorithmic bias leading to consumer harm. One of the main theories about harm talks about the possibilities of algorithm decisions leading to collusive outcomes, where consumers pay higher/unfair prices than in a competitive market. The following section discusses each of these theories of harm in turn.

23.5.1. Messenger (The Classic Digital Cartel) This scenario is a digital equivalent of the smoke-filled room agreement. Here, algorithms are used intentionally to monitor, implement, and police cartels. In this category, humans agree to collude, and algorithms execute the collusion, acting as a messenger or intermediate medium for collusion (Figure 23.4). An example of the “messenger” scenario is the CMA’s Trod Ltd./GB eye Ltd. case, like David Topkins in the U.S., where two parties agreed on a classic horizontal price-fixing cartel using an algorithm. These two parties had an agreement of using automated repricing software to monitor and adjust their prices so that neither of them could undercut the other [37]. As there was a clear anti-competitive agreement between humans using the algorithm as the messenger, the law-enforcing agency CMA was able to establish the violation of antitrust law. From an economic perspective, companies are taking algorithms as a “messenger” to make explicitly collusive agreement for some of reasons: (1) Two parties can easily detect and respond immediately to any deviations from the agreement.

Figure 23.4: Explicit coordination implemented or facilitated by algorithms [36].

page 902

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch23

An Empirical Study of Algorithmic Bias

page 903

903

(2) The probability of accidental deviation is reduced. (3) An algorithm ensures the reduction of agency slack. From a practical perspective, users of algorithms used for competition (such as pricing algorithms) should be aware of sharing sensitive information about their algorithms, in public or with their competitors. This information would allow others to draw a conclusion about their algorithmic behavior or how the prices are/will be calculated. In other words, the algorithm may function as a “messenger” for competitively sensitive information.

23.5.2. Hub-and-Spoke Unlike in the “messenger” scenario, the algorithm is not involved in explicit agreement/bias, but it is the use of the same algorithm by competitors that creates bias (possibly inadvertently) of fixing outcome/prices. As shown in Figure 23.5, if multiple competitors (such as firm A and firm B) use the same optimization algorithm (e.g., Algorithm X), this may lead the competitors to react in a similar way to external events (e.g., changes in demand, cost in a market). On top of that, if the competitors are somehow aware or able to infer that they are using the same optimization algorithm (Algorithm X), this may help an individual firm to interpret the competitor’s market behavior, which will eventually come into implicit collusion as well as a biased environment. On a related spectrum of algorithmic biasness, individual firms might find themselves facing cartel allegations without having any intended involvement in a cartel. In this scenario, different individuals (the spokes) use the same third-party service provider’s (the hub’s) optimization algorithm to determine the outcome and intended reaction to market changes (shown in Figure 23.5). The recent Eturas case is an example of the hub-and-spoke agreement that exists in the online world. In the Eturas case, the administration of a Lithuanian online travel booking system sent an electronic notice to all its travel agents about a new technical restriction of putting a cap on the discount rate. According to

(a) common pricing algorithms

(b) common intermediary

Figure 23.5: “Tacit” coordination [36].

June 1, 2022 13:19

904

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch23

Handbook on Computer Learning and Intelligence—Vol. II

Figure 23.6: Hub-and-spoke scenario of Eturas online booking platform with two agents [38].

the Court of Justice of the EU, travel agents who were aware of that notice and acted accordingly were presumed to have involved themselves in an online cartel agreement (see Figure 23.6). So, the example states that when individual firms sign up to using a platform’s algorithm and are also aware of the fact that other competitors are also using the same algorithm, and that the platform’s algorithm fixes prices to a certain level, it can be claimed that these firms have engaged in “hub-and-spoke” behavior. Thus, when firms independently sign up to using a platform’s algorithm, knowing that other competitors are using the same algorithm and that the algorithm fixes prices at a certain level, they can be held to have engaged in classic “huband-spoke” behavior, which eventually creates horizontal collusion. Another exciting example of the “hub-and-spoke” scenario is when 700 petrol stations (25% of the Danish retail fuel market) in Rotterdam, the Netherlands, used advanced analytics software from a2i system to determine petrol prices (see Figure 23.7). Although in this scenario neither of the individuals with a2i were deemed to be involved in anti-competitive agreement, the group using the algorithm created a 5% (millions of euros) higher price margin than normal, as reported by the Wall Street Journal [39]. When multiple players or individuals use the same thirdparty algorithm or analytics, data points, and values, the likelihood of tacit alignment increases. Although a2i claims that their advance analytics helped companies to avoid price wars or eliminate them altogether, in fact, they had no intention of collusion, the unilateral use of a decision-making algorithm softened competition.

page 904

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

An Empirical Study of Algorithmic Bias

b4528-v2-ch23

page 905

905

Figure 23.7: A example of gas stations changing prices using one third-party algorithm [39].

23.5.3. Tacit Collusion on Steroids (Predictable Agent) This category mainly points out why computational speed along with market transparency has become a curse. Assume a market where all individuals or firms adopt their own optimization algorithm, access their rival’s real-time algorithmic output (e.g., pricing information), and adjust their own algorithm within seconds

June 1, 2022 13:19

906

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch23

Handbook on Computer Learning and Intelligence—Vol. II

or even in real time. This scenario eventually creates a perfect breeding ground for tacit collusion. One the one hand, market economy claims transparency in the market for a competition to exist among competitors, and on the other hand, transparency along with computational speed works against it. If a firm or individual increases their price (e.g., outcome from an optimization algorithm), their rival’s system will respond in real time, which diminishes the risk of enough customers moving to the other sellers for that homogeneous product. This situation is also applicable when a firm decreases its price. So, ultimately, there is no competitive gain or incentive by offering a discount. Eventually, all competitors in the market reach a suitable “supra-competitive” equilibrium (i.e., a price which is higher than the price that would exist in a competitive environment). On top of that, monitoring a competitor’s output/pricing and reacting based on that (i.e., conscious parallelism) is not unlawful by existing anti-competitive law, because of the absence of any form of explicit or trace of agreement (see Figures 23.8 and 23.9). Conscious parallelism can be facilitated and stabilized to the extent (i) the rival’s output is predictable, and (ii) through repeated action, competitors can decode each other’s algorithmic output [40]. A relevant example of this is what happened to the textbook The Making of a Fly on the Amazon marketplace in 2011 [35]. A conscious parallelism concept (one algorithm reacts based on the output of another and vice versa) between two different sellers led the price of this textbook to rise from $106.23

Figure 23.8: Use of pricing algorithms leading to tacit collusion [38].

page 906

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

An Empirical Study of Algorithmic Bias

b4528-v2-ch23

page 907

907

Figure 23.9: Tacit coordination without agreement between firm and algorithms but responding fast with the market change [36].

to $23 million. Although the example appeared to have a lack of “sanity checks” within the algorithms rather than any anti-competitive intent. However, this example demonstrates how algorithmic decision leads to the unintended outcome when there is no human intervention.

23.5.4. Artificial Intelligence and Digital Eye This category discusses the situation when the exchange of information between algorithms is not a part of a human plan, but the developers/programmers have (unintentionally) overlooked the implementation of the required safeguards to prevent information exchange. The important difference with the previous category (tacit collusion on steroid) is that this algorithm is not explicitly designed to tacitly collude, but it does so through its advanced self-learning capabilities. The similarity with the previous category is that like a “conscious parallelism” model, this model is also hard to detect by anti-competitive enforcement law. These algorithms not only maintain coordination but also generate this coordination by themselves. This category matches with selflearning AI models such as AlphaZero, which is used to beat a human chess master. Like others, the model follows the “Win-Continue Lose-Reverse” rule and adapts itself with the environment. Setting revenue as a goal, this model creates incremental change to the outcome (i.e., price). If the action increases revenue, then it continues in that direction; if not, then in the reverse direction. Making small changes, this model wants to learn the market demand and environment and requires very limited computation resources and data. One real-life example can be Amazon’s machine learning repricing algorithm [36]. Information collected by CMA [36] explains that their algorithmic simple re-pricer takes the seller’s past pricing data, different competing firms’ price data, and stock levels of the rivals to determine their optimum price. This algorithm also considers their rivals’ publicly available pricing information and customer feedback. Whereas their simple re-pricers often offer the lowest price among competitors, their

June 1, 2022 13:19

908

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch23

Handbook on Computer Learning and Intelligence—Vol. II

Figure 23.10: Reinforcement/Automated learning between intelligent machine [38].

machine learning re-pricer optimize output by following specific goal “meeting sales target” or taking specific shares of “buy box” sales [36].

23.6. Software Agents for Collusion Calvano et al. [41] provided an interesting experimental analysis where two software agents (Q-learning pricing algorithms) systematically learn to collude by themselves. The authors showed that although the collusion is partial, it is enforced by punishment if there any deviation. These two learning algorithms use trial-anderror strategies to learn by themselves, with no prior knowledge of the environment. They come into collusion by leaving no trace, without communicating with each other, and not having any instruction to collude. From the research, it is very apparent that software agents, in the absence of human intention, can create revenue for the owner by violating anti-competitive law. Figure 23.11 shows how two different agents punish each other after a deviation at period 1. When agent 1 (red colored) lowers the price from 1.8 to 1.6 at period 1, its profit hikes to 1.2 (monopoly profit) and agent 2 comes into nash price. After period 1, agent 2 punishes itself and lowers the price too close to agent 1, which penalizes agent 1’s profit. So, at period 2, both agent 1 and agent 2 come into temporary equilibrium and change each algorithm’s behavior incrementally. From periods 2 to 9, both agents change their price by small incremental changes, and at period 10 they come into pre-shock situation again. Figure 23.11 above describes

page 908

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch23

An Empirical Study of Algorithmic Bias

page 909

909

Figure 23.11: Price impulse response, prices (top) and profit (bottom), of two agents after deviation [41].

how software agents work even when there is a deviation in the market environment and eventually maintain supra-competitive profit and price level.

23.7. Model Explanation to Detect Untrustworthy Algorithm By far, it is very clear that algorithms can be biased in different forms. Bias can be intentional, unintentional or a mix of both. Some algorithmic biases are detectable, and some are implicit. Some biases generate short-term benefits and harm, whereas some are for long-term consumer harm. As we can see with the pace of technological advancement, consumer welfare is getting compromised. To ensure consumer welfare and trust as well as to reduce anti-competitive intent, the idea of an algorithm as “black-box” needs to change. Human responsibility for algorithmic action as well as the human–machine relationship needs to be pointed out. Not only that, effective and understandable human interaction is important for machine learning systems to assess trust. We do not expect outcomes such as Uber’s “price surge” when consumers are kept in the dark when explanations are expected. We want to trust systems where every action can be explained with reasons and facts. Emilio et al. [41] proposed several models to explain the “black-box” algorithm as a white box. Their approach is focused on mainly (i) trusting an algorithm’s

June 1, 2022 13:19

910

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch23

Handbook on Computer Learning and Intelligence—Vol. II

prediction, (ii) trusting the algorithm, and (iii) improving the model. In this chapter, we focus only on the local interpretable model-agnostic explanation (LIME) model to explain our algorithm. In a naïve explanation, LIME works as follows: (A) Permute data (create fake dataset) based on each observation. (B) Calculate distance (similarity score) between permutated data and observed observations. (C) Make predictions on a new dataset using the “black-box” machine learned model. (D) Choose m features from the permutated data that best describe the black-box algorithm. (E) Fit a simple model to the permutated data with m features and similarity scores as weights. (F) Feature weights from the simple model to provide an explanation for the complex “black-box” model’s local behavior. Examples shown in Figure 23.12 explain how explainability helps to trust a prediction as well as a model. For the sake of theoretical explanation, we need to define certain concepts and variables. Assume, xε R d is the original representation of an instance (predicted output) being explained, where x denotes a binary vector {0, 1}. Let us assume that G is a potentially interpretable models such as decision tree, and gεG is an explainable model, where g acts over presence or absence of interpretable components. To explain gεG, assume (g) as a measure of complexity (as opposed to interpretability). For example, the length of a decision tree. Now, assume that the simple model being explained is denoted by f . In determining an instance, x belongs to the explained model, f (x) denotes the probability, and πx (z) is defined as a proximity measure between perturbed instance z to x. We also define the unfaithfulness of g in approximating f in the locality defined by πx (z) by L( f, g, πx ). To ensure the model’s interpretability and local fidelity, we wish

Figure 23.12: The LIME model explains the symptoms in the patient’s history that led to the prediction of “flu”. The model explains that sneezing and headache are contributing factors for the prediction, and no fatigue was used against it. This explanation helps the doctor to trust the model’s prediction [42].

page 910

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

An Empirical Study of Algorithmic Bias

b4528-v2-ch23

page 911

911

Figure 23.13: Explaining a prediction of classifiers to determine the class “Christianity” or “Atheism” based on text in documents. The bar chart shows the importance of the words that determine the prediction. and are highlighted in the document. Blue indicates “Atheism” and orange indicates “Christianity” [42].

to minimize L( f, g, πx ), where complexity (g) is maintained low enough to be interpretable. ξ = argmin gεG L( f, g, πx ) + (g)

(23.1)

From the theoretical explanation and the examples mentioned above, we wish to detect and reduce algorithmic bias by explaining the prediction (outcome) and improving the “black-box” algorithm from the programmers’ point of view.

23.8. Case Study 1: Detecting Bias Using Explaining the Black Box A simple search and optimization problem has been taken in this case to replicate the idea of a biased search or an optimized result. In this case study, we choose the popular standard Knuth Miles dataset (i.e., 128 representative U.S. cities with positions and populations) from the Stanford Graphbase database [43]. Each vertex in the graph is a city and each edge is an undirected path between two cities. We wanted to replicate an optimal route search algorithm (e.g., Google Maps algorithm) for the US highway road network, given the starting and ending points. For the sake of simplicity, in our experiment we filtered out edges that are more than 700 miles between cities. The goal of the algorithm is to provide the optimal (e.g., less traveled path) route to reach a destination. A sample search from San Jose, CA, to Tampa, FL, provided in Figure 23.14. This search result provides a route with a distance of 3049 miles. A similar search on a standard Google map provides a route with a distance of 2797 miles. From a consumer’s point of view, a consumer needs to be able to trust the algorithm’s outcome as well as the algorithm we are proposing. To prove that the

June 1, 2022 13:19

912

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch23

Handbook on Computer Learning and Intelligence—Vol. II Figure 23.14: Search result of an optimum route [San Jose CA => San Bernardino CA => Tucson, AZ => Roswell, NM => Waco, TX => Vicksburg, MS => Valdosta, GA => Tampa, FL].

page 912

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

An Empirical Study of Algorithmic Bias

b4528-v2-ch23

page 913

913

Figure 23.15: This figure shows how a biased and unbiased search result may vary. Unbiased search result: [San Jose CA => San Bernardino CA => Tucson, AZ => Roswell, NM => Waco, TX => Vicksburg, MS => Valdosta, GA => Tampa, FL]; Biased search result: [San Jose, CA => San Bernardino, CA => Tucson, AZ => Roswell, NM => Wichita, MS => Saint Louis, MO => Tuscalosa, AL => Tampa, FL].

LIME model works in this case, let us assume a biased search where the algorithm gets an incentive when it chooses some gas station in the map. In that case, the algorithm may provide a route (shown in Figure 23.15 as an orange line partial route), which turns out to be sub-optimal. This kind of prediction or outcome is harmful for the consumer but a gain for the algorithm as well as the one who owns it. The search result shows a route from starting (“San Jose, CA”) node to destination (“Tampa, FL”) node, where each tuple indicates the neighboring node name and distance. To explain the algorithm’s output using LIME model, let our optimization algorithm (g) be explainable. The measure of complexity (g) for the explainable model g is choosing the best neighboring node starting from the start node to reach the optimum route to destination. In our case, f (x) denotes the probability of x (optimal route) belonging to a certain class (i.e., optimum route class). We further use πx (z) as a proximity measure between an instance z (any node that is not a part of the optimum result) to x (nodes that are part of the optimum result) to define a locality around x. Finally, let L( f, g, πx ) be a measure of how unfaithful g is in approximating f in the locality defined by πx . So, to explain the model, we need to minimize L( f, g, πx ) and (g). In our case study, a user might ask why a certain node chose a certain neighboring node instead of choosing a random neighboring node. For example, from the starting node (San Jose, CA) the optimal result follows the next best node, “San Bernardino, CA,” with 411 miles instead of choosing “San Francisco, CA,” with 47 miles. An explanation for this

June 1, 2022 13:19

914

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch23

page 914

Handbook on Computer Learning and Intelligence—Vol. II

result can be that in all probable data points (nodes) for the node “San Jose, CA,” the probability of choosing “San Bernardino, CA,” is high given that our destination of interest is the south-west. Even though the “San Francisco, CA” node is shorter in distance, it is in the opposite direction to our area of interest. A similar explanation goes for every chosen optimal node in our search result. Notice the case for the biased result: the “Roswell, NM” node chooses the “Wichita, KS” node instead of “Waco, TX”. Even though this result is optimized by the “black-box” algorithm, an explanation for this result may provide the consumer an intuition of bias against the search result. For example, in our first explanation, we choose the optimum neighboring node that is towards south-west because our destination is in the southwest. So a selection optimized node in the south-west direction makes sense. The “Wichita, KS” node is more in the “north” direction than the “Waco, TX” node with respect to “Roswell, NM.” So, if the algorithm’s output can represent a couple of such examples, then the result will turn into a biased result for the consumer easily. With respect to theoretical explanation, the probability of choosing “Wichita, KS” by the “Roswell, NM” node is less than the “Waco, TX” node given the destination is “Tampa, FL.”

23.9. Case Study 2: Detecting Bias Based on Node Coverage Rate To demonstrate the implicit bias that exists in any algorithms (including AI-driven algorithms), we consider a coverage estimation problem. The idea is to show that with some parameter changes, the search (of N nodes) can be reduce n nodes (where n < N ), producing sub-optimal results. Such reduction in search space, while beneficial from the business point of view, may be detrimental to customers. Hypothetical Scenario: To illustrate the concept mentioned above, we chose a popular problem for optimization called “0/1-knapsack” problem. The problem is defined below. Given a set of n items numbered from 1 to n, each with a weight wi and a value v i along with a maximum weight capacity W , the objective is to Minimize

n 

v i xi ,

(23.2)

wi xi ≤ W,

(23.3)

i=1

Subject to

n  i=1

where v i is the value of item i, wi is the weight of item i, and xi ε{0, 1}. Based on the above knapsack problem, we emulated an illustrative test case to demonstrate algorithmic biasness. Assume, that there are three objects n = 3, having associated values = [4, 2, 10] and weights = [8, 1, 4], where v i and wi represent the value and the weight for the ith object, respectively. Moreover, a bag

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

An Empirical Study of Algorithmic Bias

b4528-v2-ch23

page 915

915

having capacity W = 10 needs to fill up with the objects, keeping the object value as high as possible within the weight capacity. The ideal solution without any internal bias gives us a total gain of 12 by choosing items 2 and 3 (e.g., 2 + 10 = 12). The same problem provides different results when we forcefully (or being biased) included the 1st item and whichever item comes after that in the result maintaining the constraint. That biased algorithm gave us a total gain of 6 by choosing items 1 and 2 (e.g., 4 + 2 = 6). In both situations, none of the results exceeded the total capacity W = 10 but provided different outputs. This proves that an ideal algorithm can be biased as a programmer intends, and by this process some entity or object can get unfair advantages by giving incentive to the service provider. On the other hand, from the perspective of the user/service receiver, we believe that it is the ideal solution that can be achieved. This toy problem does not indicate any differences in node coverage between biased and unbiased algorithms, except a different gain. To extend the node coverage estimation problems, similar problems took into consideration the large dataset. In that case, instead of a deterministic algorithm, four metaheuristic algorithms were taken to replicate the idea of input data, “black-box” algorithm, and predictive output (i.e., optimal and non-optimal class). Node coverage for biased and unbiased algorithm: Our hypothesis is that a biased algorithm will always have less node coverage than an unbiased one. To prove our hypothesis, we choose four different search and optimization metaheuristic algorithms such as the genetic algorithm, hill climbing with random walk, simulated annealing, and tabu search. We also choose a dataset by a randomly generated item’s value and weight and a random threshold. This dataset is then used for four different algorithms (see Table 23.1). 1. Genetic algorithm: It is a metaheuristic algorithm that is generally used for search and optimization problems. This algorithm is biologically inspired by natural selection, crossover, and mutation. Some popular applications of the genetic algorithm in the field of artificial intelligence are given in [44, 46, 47]. 2. Hill climbing with random walk: This is an iterative algorithm that starts with an arbitrary solution to a problem. For each next iteration, if there is a better solution, then an incremental change is made to the solution until no further improvement can be found. 3. Simulated annealing: It is also a metaheuristic algorithm used in search and optimization problems. This algorithm is used in a discrete search space. For problem space, where finding an approximate global optimum is more important than finding a local optimum, this algorithm is the best fit. 4. Tabu search: It is a special kind of metaheuristic algorithm used for search and optimization. This algorithm uses memory, which keeps track of visited solutions.

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

916

b4528-v2-ch23

Handbook on Computer Learning and Intelligence—Vol. II Table 23.1: Sample Result of four algorithms run for 100 nodes dataset

Techniques Genetic algorithm Hill climbing with random walk Simulated annealing Tabu search

Nodes

Iter.

Avg Coverage

Optimal Coverage

Optimal Value/Gain

100 100 100 100

200 200 200 200

57% 55% 59% 47%

59% 57% 60% 54%

15278 17802 15485 17607

Figure 23.16: Regression plot of on the optimal node coverage of an unbiased algorithm with increasing number of nodes. Every datapoint on the y-axis represents values identified by each algorithm.

If a potential solution can be found in the visited set for a short period of time, it is marked as tabu. The parameter changes we were interested in are: (1) Changing the number of nodes or data points for each of the four algorithms. For example, using different dataset with 50,100, or 200 nodes. (2) Applying the same dataset for an output with a varying algorithm. For example, a dataset of 100 nodes was applied for the same result by four different algorithms. Table 23.1 shows a sample result achieved with 100 nodes using the four algorithms separately. This same approach is then applied to a biased situation (same as a toy scenario concept) for each algorithm. From our result, we found that it is hard to determine whether an algorithm is biased or not by simply looking at its node coverage across different search spaces (see Figure 23.16 and Figure 23.17), but a comparison between the two result provides an approximate constant difference across different search spaces (see Figure 23.18). From this case study, we can get an intuition that whether the “black-box” algorithm has been tested against any benchmark algorithm; then comparing the

page 916

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch23

An Empirical Study of Algorithmic Bias

page 917

917

Figure 23.17: Regression plot of the optimal node coverage of a biased algorithm against an increasing number of nodes. Every datapoint on the y-axis represents values identified by each algorithm.

Figure 23.18: Differences in the coverage rate for biased and unbiased algorithm. The y-axis represents absolute node coverage rate difference with an increasing number of nodes on the x-axis.

result between these two algorithms across different search space, may reveal the biasness of the “black-box” algorithm or may simply raise the question of explanation of the result.

23.10. Engineers’ and Researchers’ Role in Mitigating Bias Bias in AI-based systems can manifest from either data or algorithms and their parameter tunings. In particular, bias can be injected into framing the problem, collecting and preparing the data, and interpreting data and decision manipulation. While there is no defined solution to measure algorithms’ fairness, some approaches can help detect and mitigate hidden algorithmic biases regardless of the data quality. According to the chief trust officer at a Vienna-based synthetic-data startup [47], tech companies need to be accountable for AI systems’ decisions. It is also true that

June 1, 2022 13:19

918

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch23

Handbook on Computer Learning and Intelligence—Vol. II

AI provides an excellent opportunity for data scientists, businesses, and engineers to explore new application areas. He invites engineers to work hard to mitigate bias from datasets, from algorithms, and eventually from society. Now, it opens a new paradigm for engineers on how they should design their AI applications. It is to be noted that users of an AI system may unknowingly introduce bias based on how, where, and when they use it. Hence, developers need to take extra validation on their system design to mitigate such user bias. There is no one-size-fits-all system to mitigate bias from AI, and it can be introduced at any point when learning the algorithm. However, according to ongoing research, the most promising and logical starting point is when the model parses data. Researchers across different organizations are putting in effort to develop systems and algorithms to mitigate bias in AI. Toolkits like Aequitas measure bias in datasets; Themis-ml reduces tendency in data by using a bias mitigation algorithm. Researchers at IBM developed a standard interface for working professionals accommodating a complete toolkit called AI Fairness 360 to identify and mitigate bias in datasets and AI models. Some researchers also believe that watchdog systems to audit AI algorithms should be accommodated before deploying them into the real world because of the uncertainty of bias mitigation algorithms.

23.11. Conclusion This empirical study highlighted the possible existence of algorithmic bias in search, optimization, and coverage problems with some real-world applications. From the examples and models reported in the literature, it is evident that algorithms, especially self-learning algorithms, can be made biased intentionally or unintentionally. In particular, based on catastrophic incidents caused by the “blackbox” artificial intelligence algorithm, law enforcement is now holding developers responsible for their intelligent machines’ decisions and actions. To have trust in intelligent systems or algorithms, cross-validation measures in detecting bias of different kinds are necessary. As a form of sanity check and preventive measures, software developers need to disclose algorithmic details such as explainable AI as a best practice. Considering the unparalleled benefits of AI and its possible misuse or abuse, regulations are essential for AI-based developers to take responsibility for their products [47]. Accordingly, proper guidelines and principles such as the Algorithmic Accountability Act [48] has been introduced for companies to audit their AI/ML systems for bias and discrimination and regulate the use of AI systems that are prone to be misused or bad for humanity. Mariya Gabriel, Europe’s top official on the digital economy, said to companies using AI, “People need to be

page 918

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

An Empirical Study of Algorithmic Bias

b4528-v2-ch23

page 919

919

informed when they are in contact with an algorithm and not another human being. Any decision made by an algorithm must be verifiable and explained.”

23.A.1. Appendix I A search result from the starting (“San Jose, CA”) node to the destination (“Tampa, FL”) node, where each tuple indicates the neighboring node name and distance, is as follows, “San Jose, CA” => [(“San Francisco, CA,” 47), (“Salinas, CA,” 58), (“Stockton, CA,” 80), (“Santa Rosa, CA,” 102), (“Sacramento, CA,” 127), (“Red Bluff, CA,” 235), (“Reno, NV,” 266), (“Santa Barbara, CA,” 294), (“Weed, CA,” 340), (“Santa Ana, CA,” 395), (“San Bernardino, CA,’ 411), (“San Diego, CA,” 485), (“Salem, OR,” 664)] “San Bernardino, CA” => [(“Santa Ana, CA,” 54), (“San Diego, CA,” 130), (“Santa Barbara, CA,” 152), (“Salinas, CA,” 374), (“Stockton, CA,” 388), (“San Jose, CA,” 411), (“Sacramento, CA,” 435), (“San Francisco, CA,” 450), (“Tucson, AZ,” 460), (“Reno, NV,” 470), (“Santa Rosa, CA,” 505), (“Richfield, UT,” 534), (“Red Bluff, CA,” 566), (“Weed, CA,” 671), (“Salt Lake City, UT,” 681)] “Tucson, AZ” => [(“San Diego, CA,” 423), (“San Bernardino, CA,” 460), (“Roswell, NM,” 469), (“Santa Ana, CA,” 504), (“Santa Fe, NM,” 523), (“Santa Barbara, CA,” 607), (“Richfield, UT,” 652)] “Roswell, NM” => [(“Santa Fe, NM,” 196), (“San Angelo, TX,” 320), (“Trinidad, CO,” 323), (“Wichita Falls, TX,” 389), (“Salida, CO,” 423), (“Tucson, AZ,” 469), (“Sherman, TX,” 506), (“Waco, TX,” 515), (“San Antonio, TX,” 538), (“Seminole, OK,” 539), (“Wichita, KS,” 585), (“Tulsa, OK,” 596), (“Sterling, CO,” 601), (“Tyler, TX,” 604), (“Salina, KS,” 640), (“Victoria, TX,” 651), (“Texarkana, TX,” 666), (“Shreveport, LA,” 691)] “Waco, TX” => [(“Tyler, TX,” 134), (“Sherman, TX,” 163), (“San Antonio, TX,” 181), (“Wichita Falls, TX,” 204), (“Victoria, TX,” 223), (“San Angelo, TX,” 228), (“Shreveport, LA,” 233), (“Texarkana, TX,” 270), (“Seminole, OK,” 290), (“Tulsa, OK,” 380), (“Vicksburg, MS,” 419), (“Wichita, KS,” 471), (“Roswell, NM,” 515), (“Springfield, MO,” 539), (“Salina, KS,” 558), (“Topeka, KS,” 612), (“Tupelo, MS,” 620), (“Sedalia, MO,” 640), (“Tuscaloosa, AL,” 658), (“Saint Joseph, MO,” 659), (“Trinidad, CO,” 667), (“Selma, AL,” 675), (“Santa Fe, NM,” 681)] “Vicksburg, MS” => [(“Shreveport, LA,” 186), (“Tupelo, MS,” 230), (“Tuscaloosa, AL,” 239), (“Texarkana, TX,” 240), (“Selma, AL,” 256), (“Tyler, TX,” 285), (“Sherman, TX,” 400), (“Waco, TX,” 419), (“Springfield, MO,” 460), (“Seminole, OK,” 489), (“Tallahassee, FL,” 496), (“Wichita Falls, TX,” 511), (“Tulsa, OK,” 522), (“Victoria, TX,” 530), (“Saint Louis, MO,” 530), (“Valdosta, GA,” 542), (“Vincennes, IN,” 556), (“Sedalia, MO,” 575), (“San Antonio, TX,”

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

920

b4528-v2-ch23

Handbook on Computer Learning and Intelligence—Vol. II

576), (“Swainsboro, GA,” 579), (“Waycross, GA,” 586), (“Terre Haute, IN,” 614), (“Springfield, IL,” 614), (“San Angelo, TX,” 639), (“Savannah, GA,” 669), (“Saint Joseph, MO,” 687), (“Saint Augustine, FL,” 690)] “Valdosta, GA” => [(“Waycross, GA,” 63), (“Tallahassee, FL,” 82), (“Saint Augustine, FL,” 155), (“Swainsboro, GA,” 168), (“Savannah, GA,” 172), (“Tampa, FL,” 233), (“Sarasota, FL,” 286), (“Selma, AL,” 302), (“Sumter, SC,” 328), (“West Palm Beach, FL,” 386), (“Tuscaloosa, AL,” 387), (“Wilmington, NC,” 449), (“Winston-Salem, NC,” 476), (“Tupelo, MS,” 483), (“Vicksburg, MS,” 542), (“Rocky Mount, NC,” 552), (“Roanoke, VA,” 585), (“Williamson, WV,” 655), (“Richmond, VA,” 672), (“Staunton, VA,” 673)]

References [1] K. Ha, Why AI is a threat to democracy—and what we can do to stop it (2020). https://www.teo.chnologyreview.com/s/613010/why-ai-is-a-threat-to-democracyand-what-wecan-do-to-stop-it/. [2] D. Dasgupta, Z. Akhtar, and S. Sen, Machine learning in cybersecurity: a comprehensive survey, J. Def. Model. Simul. (2020). doi: 10.1177/1548512920951275. URL https://doi.org/10.1177/ 1548512920951275. [3] A. Mukerjee, R. Biswas, K. Deb, and A. P. Mathur, Multi–objective evolutionary algorithms for the risk–return trade–off in bank loan management, Int. Trans. Oper. Res., 9, 583–597 (2002). [4] A. Webb, A love story: how I gamed online dating to meet my match, Reviewing Data Driven Algorithmic Dating Sites (2013). [5] M. Bogen and A. Rieke, Help wanted: an examination of hiring algorithms, equity, and bias. Technical report, Upturn, Washington, DC (2018). [6] L. Cohen, Z. C. Lipton, and Y. Mansour, Efficient candidate screening under multiple tests and implications for fairness, CoRR, abs/1905.11361 (2019). URL http://arxiv.org/abs/1905.11361. [7] S. Danziger, J. Levav, and L. Avnaim-Pesso, Extraneous factors in judicial decisions, Proc. Natl. Acad. Sci., 108, 6889–6892 (2011). [8] A. O’Keeffe and M. McCarthy, The Routledge Handbook of Corpus Linguistics (2010). [9] S. Sen, K. D. Gupta, and M. Manjurul Ahsan, Leveraging machine learning approach to setup software-defined network (SDN) controller rules during DDoS attack. In eds. M. S. Uddin and J. C. Bansal, Proceedings of International Joint Conference on Computational Intelligence, Springer, Singapore, pp. 49–60 (2020). ISBN 978-981-13-7564-4. [10] N. Sadman, K. D. Gupta, A. Haque, S. Poudyal, and S. Sen, Detect review manipulation by leveraging reviewer historical stylometrics in amazon, yelp, facebook and google reviews. In Proceedings of the 2020 The 6th International Conference on E-Business and Applications, ICEBA 2020, New York, NY, USA, pp. 42–47 (2020). Association for Computing Machinery. ISBN 9781450377355. doi: 10.1145/3387263.3387272. URL https://doi.org/10.1145/3387263. 3387272. [11] M. M. Ahsan, K. D. Gupta, M. M. Islam, S. Sen, M. L. Rahman, and M. Shakhawat Hossain, Covid-19 symptoms detection based on nasnetmobile with explainable AI using various imaging modalities, Mach. Learn. Knowl. Extr., 2(4), 490–504 (2020). ISSN 2504-4990. doi: 10.3390/make2040027. URL http://dx.doi.org/10.3390/make2040027. [12] N. Sadman, K. Datta Gupta, M. A. Haque, S. Sen, and S. Poudyal, Stylometry as a reliable method for fallback authentication. In 2020 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTICON), pp. 660–664 (2020). doi: 10.1109/ECTI-CON49241.2020. 9158216.

page 920

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

An Empirical Study of Algorithmic Bias

b4528-v2-ch23

page 921

921

[13] M. M. Ahsan, K. D. Gupta, M. M. Islam, S. Sen, M. L. Rahman, and M. Shakhawat Hossain, Covid-19 symptoms detection based on nasnetmobile with explainable AI using various imaging modalities, Mach. Learn. Knowl. Extr., 2(4), 490–504 (2020). ISSN 2504-4990. doi: 10.3390/make2040027. URL https://www.mdpi.com/2504-4990/2/4/27. [14] Z. Akhtar, M. R. Mouree, and D. Dasgupta, Utility of deep learning features for facial attributes manipulation detection. In 2020 IEEE International Conference on Humanized Computing and Communication with Artificial Intelligence (HCCAI), pp. 55–60 (2020). doi: 10.1109/HCCAI49649.2020.00015. [15] S. M. Julia Angwin, Jeff Larson and L. Kirchner, Machine Bias: There’s Software Used Across the Country to Predict Future Criminals. And It’s Biased Against Blacks, ProPublica (2016). [16] A. Jain, Weapons of math destruction: how big data increases inequality and threatens democracy, Bus. Econ., 52, 123–125 (2017). doi: 10.1057/s11369-017-0027-3. [17] A. Narayanan, Translation tutorial: 21 fairness definitions and their politics, Presented at the Association Computing Machinery Conf. Fairness, Accountability, and Transparency, New York, NY, USA (2018). [18] J. Angwin, J. Larson, S. Mattu, and L. Kirchner, Machine bias (2016). URL https://www. propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing. [19] S. Levin, A beauty contest was judged by AI and the robots didn’t like dark skin (2016). URL https://www.theguardian.com/technology/2016/sep/08/artificial-intelligencebeauty- contest-doesnt-like-black-people. [20] A. Rose, Are face-detection cameras racist? (2020). URL http://content.time.com/time/ business/article/0,8599,1954643,00.html. [21] R. K. E. Bellamy, K. Dey, M. Hind, S. C. Hoffman, S. Houde, K. Kannan, P. Lohia, J. Martino, S. Mehta, A. Mojsilovic, S. Nagar, K. N. Ramamurthy, J. Richards, D. Saha, P. Sattigeri, M. Singh, K. R. Varshney, and Y. Zhang, AI fairness 360: an extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias (2018). [22] E. Posner, Why uber will—and should—be regulated, slate, Jan. 5, 2015 (endorsing arguments for regulation of surge pricing, Uber’s term for raising prices at times of higher demand) (2015). [23] M. Stoller, How Uber creates an algorithmic monopoly to extract rents (2004). https://www. nakedcapitalism.com/2014/04/matt-stoller-how-uber-creates-an-algorithmic-monopoly.html. [24] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad, Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, New York, NY, USA, pp. 1721–1730 (2015). Association for Computing Machinery. ISBN 9781450336642. doi: 10.1145/2783258.2788613. URL https://doi.org/ 10.1145/2783258.2788613. [25] S. Sen, D. Dxasgupta, and K. D. Gupta, An empirical study on algorithmic bias. In 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMP-SAC), pp. 1189–1194 (2020). doi: 10.1109/COMPSAC48688.2020.00-95. [26] H. Suresh and J. V. Guttag, A framework for understanding unintended consequences of machine learning, CoRR, abs/1901.10002 (2019). URL http://arxiv.org/abs/1901.10002. [27] A. Olteanu, C. Castillo, F. Diaz, and E. Kıcıman, Social data: biases, methodological pitfalls, and ethical boundaries, Front. Big Data, 2, 13 (2019). ISSN 2624-909X. doi: 10.3389/fdata.2019.00013. URL https://www.frontiersin.org/article/10. 3389/fdata.2019.00013. [28] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan, A survey on bias and fairness in machine learning (2019). [29] J. Buolamwini and T. Gebru, Gender shades: intersectional accuracy disparities in commercial gender classification. In eds. S. A. Friedler and C. Wilson, Proceedings of the 1st Conference on Fairness, Accountability and Transparency, vol. 81, Proceedings of Machine Learning Research, New York, NY, USA, pp. 77–91 (2018). PMLR. URL http://proceedings.mlr.press/ v81/buolamwini18a.html.

June 1, 2022 13:19

922

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch23

Handbook on Computer Learning and Intelligence—Vol. II

[30] R. Baeza-Yates, Bias on the web, Commun. ACM, 61(6), 54–61 (2018). ISSN 0001-0782. doi: 10.1145/3209581. URL https://doi.org/10.1145/3209581. [31] S. Shankar, Y. Halpern, E. Breck, J. Atwood, J. Wilson, and D. Sculley, No classification without representation: assessing geodiversity issues in open data sets for the developing world (2017). [32] I. Y. Chen, P. Szolovits, and M. Ghassemi, Can AI help reduce disparities in general medical and mental health care?, AMA J. Ethics, 21(2), E167–179 (2019). ISSN 2376-6980. doi: 10.1001/amajethics.2019.167. URL https://doi.org/10.1001/amajethics.2019.167. [33] US Dept of Justice, Criminal program update (US Dept of Justice, 2015). URL https://www. justice.gov/atr/division-update/2015/ criminal-program-update. [34] US Dept of Justice, Former e-commerce executive charged with price fixing in the antitrust division’s first online marketplace prosecution (2015). [35] J. D. Sutter, Amazon seller lists book at $23,698,655.93—plus shipping (2011). URL http: //www.cnn.com/2011/TECH/web/04/25/amazon.price.algorithm/index.html. [36] CMA, Pricing algorithms (2018). URL https://www.icsa.org.uk/knowledge/governance-andcompliance/indepth/technical/pricing-algorithms. [37] U Govt, CMA issues final decision in online cartel case (2016). URL https://www.gov.uk/ government/news/cma-issues-final-decision-in-online-cartel-case. [38] F. B. D. LLP, Pricing algorithms: the digital collusion scenarios (2017). URL https://www. freshfields.com/digital/. [39] S. Schechner, Why do gas station prices constantly change? Blame the algorithm (2017). URL https://www.wsj.com/articles/why-do-gas-station-prices-constantly-change-blame-thealgorithm-1494262674. [40] A. Ezrachi and M. E. Stucke, Algorithmic collusion: problems and counter measures (2017). URL https://one.oecd.org/document/DAF/COMP/WD(2017)25/en/pdf[https://perma. cc/53AY-V74Y. [41] C. Emilio, C. Giacomo, D. Vincenzo, and P. Sergio, Artificial intelligence, algorithmic pricing and collusion (2018). URL https://ssrn.com/abstract=3304991orhttp://dx.doi.org/10.2139/ssrn. 3304991. [42] M. T. Ribeiro, S. Singh, and C. Guestrin, “Why should I trust you?”: Explaining the predictions of any classifier (2016). [43] D. E. Knuth, The Stanford GraphBase: A Platform for Combinatorial Computing, ACM Press (1993). [44] K. Gupta and S. Sen, A genetic algorithm approach to regenerate image from a reduce scaled image using bit data count, Broad Res. Artif. Intell. Neurosci., 9(2), 34–44 (2018). ISSN 2067-3957. URL https://www.edusoft. ro/brain/index.php/brain/article/view/805. [45] K. D. Gupta, D. Dasgupta, and S. Sen, Smart crowdsourcing based content review system (SCCRS): an approach to improve trustworthiness of online contents. In eds. X. Chen, A. Sen, W. W. Li, and M. T. Thai, Computational Data and Social Networks, Springer International Publishing, Cham, pp. 523–535 (2018). ISBN 978-3-030-04648-4. [46] D. Dasgupta, AI vs. AI: viewpoints. Technical Report (No. CS-19-001), The University of Memphis, Memphis, TN (2019). [47] J. Buolamwini and T. Gebru, Gender shades: intersectional accuracy disparities in commercial gender classification. In eds. S. A. Friedler and C. Wilson, Proceedings of the 1st Conference on Fairness, Accountability and Transparency, vol. 81, Proceedings of Machine Learning Research, New York, NY, USA, pp. 77–91 (2018). PMLR. URL http://proceedings.mlr.press/ v81/buolamwini18a.html. [48] K. Hao, Congress wants to protect you from biased algorithms, deepfakes, and other bad AI, MIT Technology Review (2019).

page 922

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch24

© 2022 World Scientific Publishing Company https://doi.org/10.1142/9789811247323_0024

Chapter 24

Collective Intelligence: A Comprehensive Review of Metaheuristic Algorithms Inspired by Animals Fevrier Valdez Computer Science, Tijuana Institute of Technology, Calzada Tecnologico S/N, Tijuana, 22450, BC, Mexico, [email protected]

In this chapter, a survey on the optimization methods based on collective intelligence (CI) with animal behavior is described. The review presents some methods such as cuckoo search algorithm (CSA), particle swarm optimization (PSO), ant colony optimization (ACO), and bat algorithm (BA). We considered these algorithms because they are inspired by animal behavior and have been demonstrated to be useful for solving complex optimization problems in many applications. Moreover, the algorithms are inspired by CI. There are many other algorithms based on animal behavior but only the methods described above are considered, mainly because we have worked a lot with them and improved them using fuzzy logic to adapt parameters, thus managing to improve convergence, results, etc.

24.1. Introduction The main motivation of this chapter is to review the methods based on CI and populations inspired by some animal species and their applications. CI is considered a specific computational process that provides a straightforward explanation for several social phenomena; it is the theory, design, application, and development of biologically motivated computational paradigms. Overall, the three areas of CI are neural networks [41], fuzzy systems [107], and evolutionary computation [104]. The most relevant contribution of this chapter is to review the importance of metaheuristics based on animals as motivation and see how the authors used these algorithms to solve complex optimization problems and achieve the best

923

page 923

June 1, 2022 13:19

924

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch24

Handbook on Computer Learning and Intelligence — Vol. II

results compared with other traditional methods. We used a tool available at www.connectedpapers.com to calculate the relationship between the works. We also used the Web of Science Core Collection (WoS) to search the more recent works about these methods. Optimization methods are commonly based on nature, biological evolution, animal species, and physical laws, and researchers often apply these methods to solve optimization problems, such as to improve important parameters in a control system; or in a neural network; to find the best architecture, in robotics, for example; to find the best route or to minimize the use of battery [110, 111]. These methods have been demonstrated to be better than traditional methods in complex problems. Some works can be reviewed in [1, 5, 42, 43, 52, 61, 98, 106, 113, 124]. This chapter is organized as follows: in Section 24.1 a brief introduction of collective intelligence is presented; in Section 24.2 the literature review and the methods analyzed in the chapter are presented; Section 24.3 shows the review and applications that have used these methods within computational intelligence; in Section 24.4 we describe the conclusions obtained after this review; and the references are presented at the end of the chapter.

24.2. Literature Review Collective intelligence optimization methods have been widely used because they have demonstrated the ability to solve several complex optimization problems in many areas. The term “collective intelligence”, has come into use [7–9, 12]. Swarm intelligence is the part of artificial intelligence that is based on the study of actions of individuals in various decentralized systems. This chapter describes some heuristic techniques based on animal behavior and populations throughout history that have been used by researchers as optimization methods to solve problems of everyday life and industry. This chapter provides a brief description of the metaheuristics CSA, PSO, ACO, and BA. In this section, Table 2.4 presents most popular optimization algorithms based on populations in chronological order.

24.2.1. Cuckoo Search Algorithm The cuckoo optimization algorithm is based on the life of the bird called cuckoo [123]. The basis of this novel optimization algorithm is the specific breeding and egg laying behavior of this bird. Adult cuckoos and eggs are used in this modeling. Adult cuckoos lay eggs in other birds’ habitats. Those eggs hatch and become mature cuckoos if they are not found and removed by host birds. The immigration of groups of cuckoos and the environmental specifications hopefully lead them to converge and reach the best place for reproduction and breeding. The objective function is in this

page 924

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch24

Collective Intelligence: A Comprehensive Review of Metaheuristic Algorithms

page 925

925

Table 24.1: Popular optimization algorithms inspired by populations in chronological order. Year

Popular optimization algorithms based on populations

2020 2020 2020 2020 2020 2019 2018 2016 2016 2015 2015 2015 2012 2012 2010 2010 2009 2009 2009 2007 2006 2006 2005 2005 2005 2002 2001 2002 1998 1992 1995 1992 1989 1979 1975 1966 1965

Improved binary grey wolf optimizer Political optimizer [4] Chimp optimization algorithm [68] Black widow optimization algorithm [54] Coronavirus optimization algorithm [82] Improved spider monkey optimization [115] Monkey search algorithm [72] Whale optimization algorithm [86] Dolphin swarm algorithm [93] Ant lion optimizer [85] Lightning search algorithm [99] Artificial algae algorithm [109] Fruit fly optimization algorithm [90] Krill herd algorithm [45] Bat algorithm [126] Firefly algorithm [120] Cuckoo search [119] Biogeography-based optimization [101] Gravitational search algorithm [94] Intelligent water drops [59] Cat swarm optimization [19] Glow-worm [71] Artificial bee algorithm [64] Harmony search algorithm [49] Honey bee algorithm [105] Bacterial foraging algorithm [103] Bee colony optimization [80] Estimation of distribution algorithm [88] Shark search algorithm [57] Genetic programming [70] Particle swarm optimization [40] Ant colony optimization [24] Tabu search [50] Cultural algorithms [97] Genetic algorithms [58] Evolutionary programming [6] Evolution strategies [10]

best place. CSA was developed Yand and Deb [123]. CSA is a new, continuous, and overall conscious search based on the life of a cuckoo. Similar to other metaheuristic algorithms, CSA begins with a main population, a group of cuckoos. These cuckoos lay eggs in the habitat of other host birds. A random group of potential solutions is generated that represent the habitat in CSA [120, 121].

June 1, 2022 13:19

926

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch24

page 926

Handbook on Computer Learning and Intelligence — Vol. II

24.2.2. Ant Colony Optimization One of the first behaviors studied by researchers was the ability of ants to find the shortest path between their nest and a food source. From these studies and observations followed the first algorithmic models of the foraging behavior of ants, as developed by Marco Dorigo [27, 28]. Since then, research in the development of algorithms based on ants has been very popular, resulting in a large number of algorithms and applications. These algorithms that were developed as a result of studies on ant foraging behavior are referred to as instances of the ant colony optimization metaheuristic [30, 31]. The ACO algorithm has been applied to the Traveling Salesman Problem (TSP) [27, 28, 30–32]. Given a set of cities and the distances between them, the TSP is the problem of finding the shortest possible path to visit every city exactly once. The possible path that visits every city exactly once can be represented by a complete weighted graph G = (N, E), where N is the set of nodes representing the cities and E is the set of edges. Each edge is assigned a value di j , which is the distance between cities i and j . When applying the ACO algorithm to the TSP, a pheromone strength τi j (t) is associated with each edge (i, j ), where τi j (t) is a numerical value that is modified during the execution of the algorithm and t is the iteration counter. Equation (24.1), shows the probabilistic rules of the ant k. 

β η i j k pikj   α  β , if j ∈ Ni l ∈ Nik τij ηij τi j

α 

(24.1)

The ACO metaheuristic is inspired by the behavior of real ant colonies, which presented an interesting feature—how to find the shortest paths between the nest and food. On their way the ants deposit a substance called pheromone. This trail allows the ants to get back to their nest from the food source; it uses the evaporation of pheromone to avoid an unlimited increase of pheromone trails and allows the ants to forget bad decisions [33, 35]. The probability of ant k going from node i to node j is the heuristic information, say, prior knowledge. It is the possible neighborhood of ant k when in the city i; it is the pheromone trail. α and β are parameters that determine the influence of the pheromone and prior knowledge, respectively (Eq. (24.1)).

24.2.3. Particle Swarm Optimization Particle Swarm Optimization (PSO) is a population-based stochastic optimization technique developed by Eberhart and Kennedy in 1995, inspired by the social behavior of bird flocking or fish schooling.

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch24

Collective Intelligence: A Comprehensive Review of Metaheuristic Algorithms

page 927

927

PSO has many processes similar to those that work with genetic algorithms. This algorithm initiates a swarm of random particles, where each of the contained particles could be a solution to the problem that is being worked on. These possible solutions are evaluated in each of the iterations that we have [66]. PSO shares many similarities with evolutionary computation techniques such as Genetic Algorithms (GA) [17]. The system is initialized with a population of random solutions and searches for optima by updating generations. However, unlike GA, PSO has no evolution operators such as crossover and mutation. In PSO, the potential solutions, called particles, fly through the problem space by following the current optimum particles [2]. Another reason that PSO is attractive is that there are few parameters to adjust. One version, with slight variations, works well in a wide variety of applications. Particle swarm optimization has been used for approaches that can be implemented across a wide range of applications, as well as for specific applications focused on a specific requirement.

24.2.4. Bat Algorithm Bat algorithm is a bio-inspired algorithm developed by Yang [121], and it has been found to be very efficient. If we idealize some of the echolocation characteristics of microbats, we can develop various bat-inspired algorithms or bat algorithms. For simplicity, we now use the following approximate or idealized rules [118]: 1. All bats use echolocation to sense distance, and they also “know” the difference between food/prey and background barriers in some magical way. 2. Bats fly randomly with velocity vi at position xi with a fixed frequency f min , varying wavelength λ and loudness A0 to search for prey. They can automatically adjust the wavelength (or frequency) of their emitted pulses and adjust the rate of pulse emission r ε [0, 1], depending on the proximity of their target. 3. Although loudness can vary in many ways, we assume that it varies from a large (positive) A0 to a minimum constant value Amin . For simplicity, the frequency f \in [0, f_{max}], the new solutions xi t , and velocity vi t at a specific time step t are represented by a random vector drawn from a uniform distribution [51].

24.3. Applications In this chapter, we present review of the applications and relationship between the works of several authors. In this case, we describe only the works using each analyzed method. In particular, in this chapter, we study in detail only six algorithms based

June 1, 2022 13:19

928

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch24

Handbook on Computer Learning and Intelligence — Vol. II

on collective intelligence, such as CSA, ACO, PSO, and BA algorithms. In this section, we present the connected papers using these methods and the most relevant applications found in recent years are presented.

24.3.1. Cuckoo Search Review Using the site https://www.connectedpapers.com/, we can appreciate the detailed information about the relationship among published works around the world taking as reference the most important papers considered after an exhaustive review of the literature. The collected information can be useful to know which papers have been cited in the past years, the more relevant ones, the authors, etc. For example, the paper “Cuckoo search algorithm: a metaheuristic approach to solve structural optimization problems” [46] was considered for this study of CSA as the main work to calculate the relationship with the connected papers. Figure 24.1 shows the graph prepared using the software; we can observe that the obtained results are not necessarily works based on CSA. However, it is shown that the papers are linked with some topic about collective intelligence or similar works because the researchers have cited their works. The graph presents the papers that were most frequently cited. This usually means that they are important seminal works for this field and it would be a good idea to get familiar with them [14, 21–23, 47, 55, 56, 63, 74, 96]. Table 24.2 shows the information about the papers, authors, and citations to observe in detail the relationship among them. In addition, the table shows applications of CSA in many areas with different methods. CSA has often been used to solve complex optimization problems. According to a search in the Web of Science (WoS) for “Cuckoo search applications,” we found the following results. In total, 399 journal papers were found with the abovementioned search words. However, we are presenting a brief description of only three papers, but the complete updated references can be found with the same query in WoS. Lin et al. [79] developed the dynamic risk assessment of food safety based on an improved hidden Markov model (HMM) integrating CSA (CS-HMM). CSA was used to conduct a global search to obtain the initial value of HMM, and then the Baum-Welch algorithm was used for modifying the initial value to obtain a trained risk assessment model. On the other hand, in [73] the cuckoo search and bee colony algorithm (CSBCA) methods were used for the optimization of test cases and generation of path convergence within minimal execution time. The performance of the proposed CSBCA was compared with the performance of existing methods such as PSO, CSA, bee colony algorithm (BCA), and firefly algorithm (FA). Finally, in [108], an improved CSA variant for constrained nonlinear optimization was proposed. Figure 24.2 shows the total number of papers by authors with at least three papers.

page 928

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch24

Collective Intelligence: A Comprehensive Review of Metaheuristic Algorithms

page 929

929

Figure 24.1: Papers associated with the original paper “Cuckoo search algorithm: a metaheuristic approach to solve structural optimization problems.”

24.3.2. Ant Colony Review In this section, we present the papers associated with ACO. The paper selected to make this analysis is “Ant colony optimization theory: a survey” [29]. In Figure 24.3, one can appreciate the full network with relationship works. Using the software, if we select a prior work, it will highlight all graph papers referencing it, and selecting a graph paper will highlight all referenced prior work. However, in this example, only the paper cited above is shown as the main work. Moreover, the tool allows making several searches. Each paper can be selected depending on our interests and

June 1, 2022 13:19

930

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch24

page 930

Handbook on Computer Learning and Intelligence — Vol. II Table 24.2: Prior works.

Paper Use of a self-adaptive penalty approach for engineering optimization problems Solving engineering optimization problems with the simple constrained particle swarm optimizer An augmented Lagrange multiplier based method for mixed integer discrete continuous optimization and its applications to mechanical design Constraint-handling in genetic algorithms through the use of dominance-based tournament selection An effective co-evolutionary particle swarm optimization for constrained engineering design problems A new meta-heuristic algorithm for continuous engineering optimization: harmony search theory and practice Mixed variable structural optimization using Firefly Algorithm An improved particle swarm optimizer for mechanical design optimization problems Society and civilization: an optimization algorithm based on the simulation of social behavior Gaussian quantum-behaved particle swarm optimization approaches for constrained engineering design problems

Graph Citation

Last Author

Year

Citation

Carlos A. Coello, Coello Carlos A. Coello, Coello

2000

791

28

2008

251

28

S. N., Kramer

1994

476

26

Efrén, Mezura

2002

586

23

Ling, Wang

2007

668

23

Zong Woo, Geem

2005

1444

23

Amir Hossein

2011

576

23

Q. H., Wu

2004

311

22

Kim-Meow, Liew

2003

369

22

Leandro dos Santos

2010

256

21

the network is automatically built from the selected work. With this, it is possible to analyze in detail the relationship that exists between similar works. Table 24.3 shows the prior works with the analyzed paper mentioned earlier. We can appreciate the information of citations and authors [13, 15, 25, 26, 34, 36, 37, 44, 81, 102]. All applications with the ACO method using computational intelligence have been included. ACO using fuzzy systems is often used in parameter tuning to find the most important parameter of the methods. In addition, in the Web of science (WoS). we made a query with the topic “ACO with fuzzy logic” and found the following results. In total, 50 journal works were found with the search described above. However, a brief description of only the first three papers has been presented here, but the complete updated references can be found with the same search in WoS. In [11], a holistic optimization approach for inverted cart–pendulum control tuning,

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch24

Collective Intelligence: A Comprehensive Review of Metaheuristic Algorithms

page 931

931

Figure 24.2: CSA applications by authors.

Figure 24.3: Papers associated with the original paper “Ant colony optimization theory: a survey.”

June 1, 2022 13:19

932

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch24

page 932

Handbook on Computer Learning and Intelligence — Vol. II Table 24.3: Prior works.

Ant system: optimization by a colony of cooperating agents Ant colony system: a cooperative learning approach to the traveling salesman problem MACS-VRPTW: a multiple ant colony system for vehicle routing problems with time windows Optimization, learning and natural algorithms Exact and approximate nondeterministic tree-search procedures for the quadratic assignment problem MAX-MIN ant system 1 Positive feedback as a search strategy Ants can colour graphs A new rank based version of the ant system: a computational study AntNet: distributed stigmergetic control for communications networks

Alberto, Colorni

1996

10307

40

Luca Maria, Gambardella Giovanni, Agazzi

1997

7121

36

1999

792

31

Marco, Dorigo Vittorio, Maniezzo Holger H., Hoos Alberto, Colorni Alain, Hertz Christine, Strauss

1992 1999

3373 337

31 30

2000 1991 1997 1997

2484 751 564 888

29 29 28 28

Marco, Dorigo

1998

1753

27

Figure 24.4: ACO with fuzzy logic categorized by authors.

with a simplified ant colony optimization method with a constrained Nelder–Mead algorithm (ACO-NM), was proposed. On the other hand, in [18], an improved ACO algorithm optimized fuzzy PID controller for load frequency control in multi-area interconnected power systems was proposed. Finally, in [126], an improved ACO to design an A2-C1 type fuzzy logic system was developed. Figure 24.4 shows the number of publications by authors with at least one work. In addition, in Figure 24.5 the results obtained by countries are shown, where it can be seen that India and China have the major number of publications with the query described above. Other recent applications made with ACO can be seen in [53, 60, 83, 127].

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch24

Collective Intelligence: A Comprehensive Review of Metaheuristic Algorithms

page 933

933

Figure 24.5: ACO with fuzzy logic categorized by countries.

24.3.3. PSO Review As for the methods described earlier, we used the site https://www.connectedpapers. com/ to observe detailed information about the relationship among published works around the world taking as reference the most important papers considered after making an exhaustive review of the literature. The collected information can be useful to know which papers have been cited in the last years and which are more relevant, including the authors, etc. For example, the paper “Particle swarm optimization: an overview” [92] was considered for this study with PSO as the main work to review the relationship with the connected papers. Figure 24.6 shows the graph after use the software above earlier. The graph shows which papers are linked with some topics about collective intelligence or similar works because the researchers have cited their works using these references. The graph represents the papers that were most commonly cited. This usually means that they are important seminal works for this field and it would be a good idea to get familiar with them [20, 38, 39, 65, 67, 77, 84, 95, 100, 114]. Table 24.4 shows the relevant information about the connected papers shown in Figure 24.6. In this part, the applications of the PSO algorithm in computational intelligence are presented. While searching the topic “PSO with neural networks” in WoS, the results showed 2360 publications related to this query. However, only three papers are described, but the updated references can be found using the same query in WoS. In [62], A Novel PSO-Based Optimized Lightweight Convolution Neural Network for Movements Recognizing from Multichannel Surface Electromyogram was proposed. In [91], the authors proposed a work for prediction of clad characteristics using ANN and combined PSO-ANN algorithms in the laser metal deposition process. Finally, in [116], a PSO-GA based hybrid with

June 1, 2022 13:19

934

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch24

Handbook on Computer Learning and Intelligence — Vol. II

Figure 24.6: Papers connected with the original paper “Particle swarm optimization: an overview”.

Adam Optimization for ANN training with application in Medical Diagnosis was developed. Other works on PSO applied to neural networks, math functions, and control can be found reviewed in [3, 76, 78, 112]. In Figure 24.7, the results obtained by authors with at least 14 papers querying in WoS with the topic used in this part.

page 934

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch24

Collective Intelligence: A Comprehensive Review of Metaheuristic Algorithms

page 935

935

Table 24.4: Prior works on PSO. Title

Last author

A modified particle swarm optimizer The particle swarm—explosion, stability, and convergence in a multidimensional complex space The fully informed particle swarm: simpler, maybe better Population structure and particle swarm performance A cooperative approach to particle swarm optimization Small worlds and mega-minds: effects of neighborhood topology on particle swarm performance A new optimizer using particle swarm theory Comprehensive learning particle swarm optimizer for global optimization of multimodal functions Self-organizing hierarchical particle swarm optimizer with time-varying acceleration coefficients Comparing inertia weights and constriction factors in particle swarm optimization

Year

Citations

Graph citations

Russell C., Eberhart James, Kennedy

1998

9129

38

2002

7426

35

José, Neves

2004

1517

35

Rui, Mendes

2002

1520

35

Andries Petrus, Engelbrecht James, Kennedy

2004

1764

31

1999

1065

31

James, Kennedy S., Baskar

1995 2006

12238 2583

31 30

Harry C., Watson

2004

2391

30

Yuhui, Shi

2000

2668

26

Figure 24.7: PSO with neural networks categorized by authors.

June 1, 2022 13:19

936

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch24

Handbook on Computer Learning and Intelligence — Vol. II

Figure 24.8: PSO with neural networks categorized by countries.

Figure 24.8 shows the results obtained by countries, where it can be seen that China has the major number of publications related to the query described above.

24.3.4. BA Review Like the other algorithms reviewed, we selected a popular article to observe the important connections between them and the network that was built after querying using the tool described earlier. The collected information can be useful to review which papers have been cited in the recent years, the ones that are more relevant, the authors, etc. For example, the paper “Bat algorithm: a novel approach for global engineering optimization” [125] was considered for this study with BA as the work to review the relationship with related papers. Figure 24.9 shows the graph obtained using the software described. The graph shows which papers are linked with some topics about collective intelligence or similar works because the researchers cited these references in their work. The graph represents the papers that were most commonly cited [20, 38, 39, 65, 67, 77, 84, 95, 100, 114]. Table 24.5 represents information about the graph presented in this section. We can observe that the first paper has 791 citations [16, 22, 23, 47, 55, 56, 69, 75, 96, 117]. We made a search in WoS, similar to the other methods analyzed in this chapter, with the topic “BA with fuzzy logic”; however, we found only 19 papers on this method, maybe because not many authors have tried to combine or apply these two techniques. Here, only three recent papers are presented but the other papers can

page 936

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch24

Collective Intelligence: A Comprehensive Review of Metaheuristic Algorithms

page 937

937

Figure 24.9: Papers connected with the original paper “Bat algorithm: a novel approach for global engineering optimization.”

be reviewed in WoS. In addition, other works where this algorithm was used are cited. In [48], an implementation of a novel hybrid BAT-fuzzy controller-based MPPT for a grid-connected PV battery system was made. In addition, in [89], a comparative study of the type-2 fuzzy particle swarm, bee colony, and bat algorithms in optimization of fuzzy controllers was presented. Finally, in [87], the smart bat algorithm for wireless sensor network deployment in a 3D environment was proposed. Other recent applications of BA can be found in [69, 122]. Figure 24.10 shows the results obtained by authors with at least one paper by raising the query in WoS with the topic used in this part. Finally, Figure 24.11 shows the results obtained, categorized by countries, where it can be seen that China has the maximum number of publications related to the query described above.

June 1, 2022 13:19

938

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch24

page 938

Handbook on Computer Learning and Intelligence — Vol. II Table 24.5: Prior works on BA.

Title Use of a self-adaptive penalty approach for engineering optimization problems Society and civilization: an optimization algorithm based on the simulation of social behavior A new meta-heuristic algorithm for continuous engineering optimization: harmony search theory and practice An improved particle swarm optimizer for mechanical design optimization problems Nature-inspired metaheuristic algorithms An augmented Lagrange multiplier based method for mixed integer discrete continuous optimization and its applications to mechanical design Solving engineering optimization problems with the simple constrained particle swarm optimizer An effective co-evolutionary particle swarm optimization for constrained engineering design problems Mixed variable structural optimization using firefly algorithm Constraint-handling in genetic algorithms through the use of dominance-based tournament selection

Last author

Year

Citations

Graph citations

Carlos A. Coello, Coello Kim-Meow, Liew

2000

791

26

2003

369

23

Zong Woo, Geem

2005

1444

22

Q. H., Wu

2004

311

21

Xin-She, Yang S. N., Kramer

2008 1994

3112 476

21 21

Carlos A. Coello, Coello

2008

251

21

Ling, Wang

2007

668

21

Amir Hossein, Alavi Efrén, Mezura-Montes

2011

576

19

2002

586

19

Figure 24.10: BA with fuzzy logic categorized by authors.

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch24

Collective Intelligence: A Comprehensive Review of Metaheuristic Algorithms

page 939

939

Figure 24.11: BA with fuzzy logic categorized by authors.

24.4. Conclusions In this chapter, after conducting an exhaustive study of the different optimization methods inspired by collective intelligence, we can observe in detail the importance of these algorithms, and we can appreciate the activities developed by animals adapted to computational systems for solving complex problems in the real world. These algorithms are capable of improving the performance of many applications by combining different methods, which are usually required in many complex problems. For example, ant colony optimization is a metaheuristic motivated by the foraging behavior of ants. It represents a general algorithmic framework that can be applied to various optimization problems in management, engineering, and control. ACO takes inspiration from the behavior of various ant species to find the shortest path. PSO, inspired by the social behavior of bird flocking or fish schooling, is capable of solving many complex optimization problems. Finally, BA, the more recent method, is inspired by the behavior of the micro bats. It is capable of achieving positive results. In this chapter, we only analyzed the most relevant methods proposed in recent years; however, there are other works with other algorithms where the researchers can find similar algorithms. Therefore, in this chapter, we focused on the methods based on collective intelligence inspired by animal behavior. Recently, many researchers have been applying these techniques to solve complex applications and obtain better results than with traditional methods. In addition, it can be highlighted in this chapter that the optimization algorithms used with computational intelligence can be used to improve the results in many applications. Collective intelligence is nowadays used by researchers to achieve the best results in algorithms. The networks built with the most relevant works are used

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

940

b4528-v2-ch24

Handbook on Computer Learning and Intelligence — Vol. II

to make calculations, giving us a general panorama of the researchers applying these methods in the world.

Acknowledgment We would like to express our gratitude to CONACYT, Tecnologico Nacional de Mexico/Tijuana Institute of Technology, for the facilities and resources granted for the development of this research.

References [1] Abdel-Basset, M., L. A. Shawky (2019). Flower pollination algorithm: a comprehensive review. Artif. Intell. Rev., 52(4):2533–2557. [2] Angeline, P. J. (1998). Using selection to improve particle swarm optimization. In Proceedings 1998 IEEE World Congress on Computational Intelligence, pp. 84–89. [3] Argha, R., D. Diptam, C. Kaustav (2013). Training artificial neural network using particle swarm optimization algorithm. Int. J. Adv. Res. Comput. Sci. Software Eng., 3(3):430–434. [4] Askari, Q., I. Younas, M. Saeed (2020). Political optimizer: a novel socio-inspired metaheuristic for global optimization. Knowledge-Based Syst., 195:105709. [5] Askarzadeh, A., E. Rashedi (2017). Harmony search algorithm. [6] Bäck, T., G. Rudolph, H. P. Schwefel (1997). Evolutionary programming and evolution strategies: similarities and differences. In Proceedings of the Second Annual Conference on Evolutionary Programming, pp. 11–22. [7] Beni, G. (1988). The concept of cellular robotic system. In Proceedings of the 1988 IEEE International Symposium on Intelligent Control, pp. 57–62. IEEE Computer Society Press. [8] Beni, G., S. Hackwood (1992). Stationary waves in cyclic swarms. In Proceedings of the 1992 International Symposium on Intelligent Control, pp. 234–242. IEEE Computer Society Press. [9] Beni, G., J. Wang (1989). Swarm intelligence. In Proceedings of the Seventh Annual Meeting of the Robotics Society of Japan, pp. 425–428. RSJ Press. [10] Beyer, H. G., H. P. Schwefel (2002). Evolution strategies: a comprehensive introduction. Nat. Comput., 1:3–52. [11] Blondin, M. J., P. M. Pardalos (2020). A holistic optimization approach for inverted cartpendulum control tuning. Soft Comput., 24(6):4343–4359. [12] Bonabeau, E., M. Dorigo, G. Theraulaz (1997). Swarm Intelligence. Oxford University Press. [13] Bullnheimer, B., R. Hartl, C. Strauss (1997). A new rank based version of the ant system: a computational study. Working Paper No. 1, SFB Adaptive Information Systems and Modelling in Economics and Management Science, Vienna, to appear in CEJOR. [14] Cagnina, L. C., S. Esquivel, C. Coello (2008). Solving engineering optimization problems with the simple constrained particle swarm optimizer. Informatica (Slovenia), 32:319–326. [15] Caro, G. D., M. Dorigo (1998). AntNet: distributed stigmergetic control for communications networks. arXiv abs/1105.5449. [16] Carreon-Ortiz, H., Valdez, F (2022). A new mycorrhized tree optimization nature-inspired algorithm. Soft Comput, 26:4797–4817. https://doi.org/10.1007/s00500-022-06865-8. [17] Castillo, O., P. Melin (2002). Hybrid intelligent systems for time series prediction using neural networks, fuzzy logic, and fractal theory. IEEE Trans. Neural Networks, 13(6):1395–1408. [18] Chen, G., Z. Li, Z. Zhang, S. Li (2020). An improved ACO algorithm optimized fuzzy PID controller for load frequency control in multi area interconnected power systems. IEEE Access, 8:6429–6447.

page 940

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch24

Collective Intelligence: A Comprehensive Review of Metaheuristic Algorithms

page 941

941

[19] Chu, S. C., P. W. Tsai, J. S. Pan (2006). Cat swarm optimization. In Proceedings of the 9th pacific Rim International Conference on Artificial Intelligence, LNAI 4099, Guilin, pp. 854–858. [20] Clerc, M., J. Kennedy (2002). The particle swarm—explosion, stability, and convergence in a multidimensional complex space. IEEE Trans. Evol. Comput., 6:58–73. [21] Coelho, L. (2010). Gaussian quantum-behaved particle swarm optimization approaches for constrained engineering design problems. Expert Syst. Appl., 37:1676–1683. [22] Coello, C. (2000). Use of a self-adaptive penalty approach for engineering optimization problems. Comput. Ind., 41:113–127. [23] Coello, C., E. Mezura-Montes (2002). Constraint-handling in genetic algorithms through the use of dominance-based tournament selection. Adv. Eng. Inf., 16:193–203. [24] Colorni, A., M. Dorigo, V. Maniezzo (1992). Distributed optimization by ant colonies. In Bourgine, F. V. P., ed., Proceedings of the First European Conference on Artificial Life, pp. 134–142. MIT Press. [25] Costa, D., A. Hertz (1997). Ants can colour graphs. J. Oper. Res. Soc., 48:295–305. [26] Dorigo, M. (1992). Optimization, learning and natural algorithms. Ph.D. Thesis, Politecnico di Milano, Italy. [27] Dorigo, M. (1994). Learning by probabilistic Boolean networks. In Proceedings of the IEEE International Conference on Neural Networks, pp. 887–891. [28] Dorigo, M., M. Birattari, T. Stützle (1992). Ant colony optimization. In IEEE Computational Intelligence Magazine, pp. 28–39. [29] Dorigo, M., C. Blum (2005). Ant colony optimization theory: a survey. Theor. Comput. Sci., 344(2):243–278. [30] Dorigo, M., E. Bonabeau, G. Theraulaz (2000). Ant algorithms and stigmergy. Future Gener. Comput. Syst., 16(8):851–871. [31] Dorigo, M., G. D. Caro (1999a). Ant colony optimization: a new meta-heuristic. In Proceedings of the IEEE Congress on Evolutionary Computation, vol. 2, pp. 1470–1477. [32] Dorigo, M., G. D. Caro (1999b). The ant colony optimization meta-heuristic. In New Ideas in Optimization, pp. 11–32. [33] Dorigo, M., L. Gambardella (1996). A study of some properties of ant-Q. In Proceedings of the Fourth International Conference on Parallel Problem Solving from Nature, pp. 656–665. [34] Dorigo, M., L. Gambardella (1997a). Ant colony system: a cooperative learning approach to the traveling salesman problem. IEEE Trans. Evol. Comput., 1:53–66. [35] Dorigo, M., L. M. Gambardella (1997b). Ant colonies for the travelling salesman problem. Biosystems, 43(2):73–81. [36] Dorigo, M., V. Maniezzo, A. Colorni (1991). Positive feedback as a search strategy. Technical Report, No. 91-016, Politecnico di Milano, Milano. [37] Dorigo, M., V. Maniezzo, A. Colorni (1996). Ant system: optimization by a colony of cooperating agents. IEEE Trans. Syst. Man Cybern. Part B: Cybern., 26(1):29–41. [38] Eberhart, R., J. Kennedy (1995a). A new optimizer using particle swarm theory. In Proceedings of the Sixth International Symposium on Micro Machine and Human Science (MHS’95), pp. 39–43. [39] Eberhart, R., Y. Shi (2000). Comparing inertia weights and constriction factors in particle swarm optimization. In Proceedings of the 2000 Congress on Evolutionary Computation. CEC00 (Cat. No.00TH8512), vol. 1, pp. 84–88. [40] Eberhart, R., C. Kennedy (1995b). A new optimizer using particle swarm theory. In Proc. Sixth Int. Symposium on Micro Machine and Human Science, pp. 33–43. [41] Feng, J., Y. Chai, C. Xu (2021). A novel neural network to nonlinear complex-variable constrained nonconvex optimization. J. Franklin Inst., 358(8):4435–4457. [42] Fister, I., X. S. Yang, J. Brest, I. Fister (2013). Modified firefly algorithm using quaternion representation. Expert Syst. Appl., 40(18):7220–7230.

June 1, 2022 13:19

942

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch24

Handbook on Computer Learning and Intelligence — Vol. II

[43] Gallo, C., V. Capozzi (2019). A simulated annealing algorithm for scheduling problems. J. Appl. Math. Phys., 7:2579–2594. [44] Gambardella, L., E. Taillard, G. Agazzi (1999). MACS-VRPTW: a multiple ant colony system for vehicle routing problems with time windows. In D. Corne, M. Dorigo and F. Glover, eds., New Ideas in Optimization, McGraw-Hill, London, UK, pp. 63–76. [45] Gandomi, A., A. Alavi (2012). Krill herd: a new bio-inspired optimization algorithm. Commun. Nonlinear Sci. Numer. Simul., 17:4831–4845. [46] Gandomi, A., X. S. Yang, A. Alavi (2013). Cuckoo search algorithm: a metaheuristic approach to solve structural optimization problems. Eng. Comput., 29:245–245. [47] Gandomi, A. H., X. Yang, A. H. Alavi (2011). Mixed variable structural optimization using Firefly Algorithm. Comput. Struct., 89:2325–2336. [48] Ge, X., F. Ahmed, A. Rezvani, N. Aljojo, S. Samad, et al. (2020). Implementation of a novel hybrid BAT-Fuzzy controller based MPPT for grid-connected PV-battery system. Control Eng. Pract., 98:104380. [49] Geem, Z. W. (2008). Novel derivative of harmony search algorithm for discrete design variables. Appl. Math. Comput., 199(1):223–230. [50] Glover, F. (1989). Tabu search—part I. ORSA J. Comput., 1(3):190–206. [51] Goel, N., D. Gupta, S. Goel (2013). Performance of Firefly and Bat Algorithm for unconstrained optimization problems. Int. J. Adv. Res. Comput. Sci. Software Eng., 3(5):1405–1409. [52] Greco, R., I. Vanzi (2019). New few parameters differential evolution algorithm with application to structural identification. J. Traffic Transp. Eng. (Engl. Ed.), 6(1):1–14. [53] Hajewski, J., S. Oliveira, D. Stewart, L. Weiler (2020). gBeam-ACO: a greedy and faster variant of Beam-ACO. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion (GECCO ’20), pp. 1434–1440. [54] Hayyolalam, V., A. A. P. Kazem (2020). Black widow optimization algorithm: a novel metaheuristic approach for solving engineering optimization problems. Eng. Appl. Artif. Intell., 87:103249. [55] He, Q., L. Wang (2007). An effective co-evolutionary particle swarm optimization for constrained engineering design problems. Eng. Appl. Artif. Intell., 20:89–99. [56] He, S., E. Prempain, Q. Wu (2004). An improved particle swarm optimizer for mechanical design optimization problems. Eng. Optim., 36:585–605. [57] Hersovici, M., M. Jacovi, Y. S. Maarek, D. Pelleg, M. Shtalhaim, et al. (1998). The shark-search algorithm. An application: tailored Web site mapping. Computer Networks and ISDN Systems, 30(1):317–326. [58] Holland, J. H. (1975). Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. University of Michigan Press, Ann Arbor, MI. [59] Hosseini, H. S. (2009). The intelligent water drops algorithm: a nature-inspired swarm-based optimization algorithm. Int. J. Bio-Inspir. Comput., 1(1/2):71. [60] Huang, Z. M., W. Neng Chen, Q. Li, X. N. Luo, H. Q. Yuan, et al. (2020). Ant colony evacuation planner: an ant colony system with incremental flow assignment for multipath crowd evacuation. IEEE Trans. Cybern., doi: 10.1109/TCYB.2020.3013271. [61] Joshi, A. S., O. Kulkarni, G. M. Kakandikar, V. M. Nandedkar (2017). Cuckoo search optimization: a review. Mater. Today: Proc., 4(8):7262–7269. [62] Kan, X., D. Yang, L. Cao, H. Shu, Y. Li, et al. (2020). A novel PSO-based optimized lightweight convolution neural network for movements recognizing from multichannel surface electromyogram. Complexity, 2020:1–15. [63] Kannan, B., S. Kramer (1994). An augmented Lagrange multiplier based method for mixed integer discrete continuous optimization and its applications to mechanical design. J. Mech. Des., 116:405–411.

page 942

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch24

Collective Intelligence: A Comprehensive Review of Metaheuristic Algorithms

page 943

943

[64] Karaboga, D. (2005). An idea based on honey bee swarm for numerical optimization, Technical Report—TR06, Erciyes University. [65] Kennedy, J. (1999). Small worlds and mega-minds: effects of neighborhood topology on particle swarm performance. In Proceedings of the 1999 Congress on Evolutionary ComputationCEC99 (Cat. No. 99TH8406), vol. 3, pp. 1931–1938. [66] Kennedy, J., R. C. Eberhart (1995). Particle swarm optimization. In Proceedings of IEEE International Conference on Neural Networks, pp. 1942–1948. [67] Kennedy, J., R. Mendes (2002). Population structure and particle swarm performance. In Proceedings of the 2002 Congress on Evolutionary Computation. CEC’02 (Cat. No.02TH8600) vol. 2, pp. 1671–1676. [68] Khishe, M., M. R. Mosavi (2020). Chimp optimization algorithm. Expert Syst. Appl., 149:113338. [69] Kotwal, A., R. Bharti, M. Pandya, H. Jhaveri, R. Mangrulkar (2021). Application of BAT algorithm for detecting malignant brain tumors. In Dey, N., Rajinikanth, V. (eds.) Applications of BAT Algorithm and its Variants. Springer Tracts in Nature-Inspired Computing. Springer, Singapore, pp. 119–132. [70] Koza, J. (1994). Genetic programming as a means for programming computers by natural selection. Stat. Comput., 4(2):87–112. [71] Krishnanand, K. N., D. Ghose (2006). Glowworm swarm based optimization algorithm for multimodal functions with collective robotics applications. Multiagent Grid Syst., 2(3):209– 222. [72] Kuliev, E., V. Kureichik (2018). Monkey search algorithm for ECE components partitioning. J. Phys.: Conf. Ser., 1015:042026. [73] Lakshminarayana, P., T. Sureshkumar (2020). Automatic generation and optimization of test case using hybrid cuckoo search and bee colony algorithm. J. Intell. Syst., 30:59–72. [74] Lee, K., Z. W. Geem (2005a). A new meta-heuristic algorithm for continuous engineering optimization: harmony search theory and practice. Comput. Methods Appl. Mech. Eng., 194:3902–3933. [75] Lee, K., Z. W. Geem (2005b). A new meta-heuristic algorithm for continuous engineering optimization: harmony search theory and practice. Comput. Methods Appl. Mech. Eng., 194:3902–3933. [76] Li, C., T. Wu (2011). Adaptive fuzzy approach to function approximation with PSO and RLSE. Expert Syst. Appl., 38:13266–13273. [77] Liang, J., A. Qin, P. Suganthan, S. Baskar (2006). Comprehensive learning particle swarm optimizer for global optimization of multimodal functions. IEEE Trans. Evol. Comput., 10: 281–295. [78] Lima, N. F. D., T. B. Ludermir (2011). Frankenstein PSO applied to neural network weights and architectures. Evolutionary Computation (CEC), 2452–2456. [79] Lin, X., J. Li, Y. Han, Z. Geng, S. Cui, et al. (2021). Dynamic risk assessment of food safety based on an improved hidden Markov model integrating cuckoo search algorithm: a sterilized milk study. J. Food Process Eng., n/a(n/a):e13630. [80] Lucˇci´c, P., D. Teodorovi´c (2001). Bee system: modeling combinatorial optimization transportation engineering problems by swarm intelligence. In Preprints of the RISTAN IV Triennial Symposium on Transportation Analysis, Sao Miguel, pp. 441–445. [81] Maniezzo, V. (1999). Exact and approximate nondeterministic tree-search procedures for the quadratic assignment problem. Informs J. Comput., 11:358–369. [82] Martínez-Álvarez, F., G. Asencio-Cortés, J. F. Torres, D. Gutiérrez-Avilés, L. Melgar-García, et al. (2020). Coronavirus optimization algorithm: a bioinspired metaheuristic based on the COVID-19 propagation model. Big Data, 8(4):308–322.

June 1, 2022 13:19

944

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch24

Handbook on Computer Learning and Intelligence — Vol. II

[83] Mazinan, A., F. Sagharichiha (2015). A novel hybrid PSO-ACO approach with its application to SPP. Evolving Systems, 6:293–302. [84] Mendes, R., J. Kennedy, J. Neves (2004). The fully informed particle swarm: simpler, maybe better. IEEE Trans. Evol. Comput., 8:204–210. [85] Mirjalili, S. (2015). The ant lion optimizer. Adv. Eng. Software, 83:80–98. [86] Mirjalili, S., A. Lewis (2016). The whale optimization algorithm. Adv. Eng. Software, 95:51–67. [87] Ng, C. K., C. H. Wu, W. H. Ip, K. Yung (2018). A smart bat algorithm for wireless sensor network deployment in 3-D environment. IEEE Commun. Lett., 22(10):2120–2123. [88] Ocenasek, J., J. Schwarz (2002). Estimation of distribution algorithm for mixed continuousdiscrete optimization problems. In 2nd Euro-International Symposium on Computational Intelligence, pp. 227–232. [89] Olivas, F., G. A. Angulo, J. Perez, C. Camilo, F. Valdez, et al. (2017). Comparative study of type-2 fuzzy particle swarm, bee colony and bat algorithms in optimization of fuzzy controllers. Algorithms, 10:101. [90] Pan, W. T. (2012). A new fruit fly optimization algorithm: taking the financial distress model as an example. Knowledge-Based Syst., 26:69–74. [91] Pant, P., D. Chatterjee (2020). Prediction of clad characteristics using ANN and combined PSO-ANN algorithms in laser metal deposition process. Surf. Interfaces, 21:1–10. [92] Poli, R., J. Kennedy, T. Blackwell (2007). Particle swarm optimization: an overview. Swarm Intell., 1. [93] Qi Wu, T., M. Yao, J. Hua Yang (2016). Dolphin swarm algorithm. Front. Inf. Technol. Electron. Eng., 17:717–729. [94] Rashedi, E., H. Nezamabadi-Pour, S. Saryazdi (2009). GSA: a gravitational search algorithm. Inf. Sci., 179(13):2232–2248. [95] Ratnaweera, A., S. Halgamuge, H. Watson (2004). Self-organizing hierarchical particle swarm optimizer with time-varying acceleration coefficients. IEEE Trans. Evol. Comput., 8:240–255. [96] Ray, T., K. Liew (2003). Society and civilization: an optimization algorithm based on the simulation of social behavior. IEEE Trans. Evol. Comput., 7:386–396. [97] Reynolds, R. G. (1994). An introduction to cultural algorithms. In Proceedings of the 3rd Annual Conference on Evolutionary Programming, pp. 131–139. [98] Rodrigues, D., L. Pereira, R. Nakamura, K. Costa, X. Yang, et al. (2013). João Paulo Papa, A wrapper approach for feature selection based on Bat Algorithm and Optimum-Path Forest. Expert Syst. Appl., 41:2250–2258. [99] Shareef, H., A. A. Ibrahim, A. H. Mutlag (2015). Lightning search algorithm. Appl. Soft Comput., 36:315–333. [100] Shi, Y., R. Eberhart (1998). A modified particle swarm optimizer. In IEEE International Conference on Evolutionary Computation Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98TH8360), pp. 69–73. [101] Simon, D. (2009). Biogeography-based optimization. evolutionary computation. IEEE Trans. Evol. Comput., 12:702–713. [102] Stützle, T., H. Hoos (2000). MAX-MIN ant system. Future Gener. Comput. Syst., 16:889–914. [103] Tang, W. J., Q. H. Wu, J. R. Saunders (2006). Bacterial foraging algorithm for dynamic environments. In IEEE International Conference on Evolutionary Computation, pp. 1324– 1330. [104] Telikani, A., A. H. Gandomi, A. Shahbahrami (2020). A survey of evolutionary computation for association rule mining. Inf. Sci., 524:318–352. [105] Teodorovic, M. Dell’orco (2005). [106] Teodorovi´c, D. (2009). Bee colony optimization (BCO). In Lim, C. P., L. C. Jain, S. Dehuri, eds., Innovations in Swarm Intelligence, Springer Berlin Heidelberg, pp. 39–60.

page 944

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch24

Collective Intelligence: A Comprehensive Review of Metaheuristic Algorithms

page 945

945

[107] Trivedi, A., P. K. Gurrala (2021). Fuzzy logic based expert system for prediction of tensile strength in Fused Filament Fabrication (FFF) process. Mater. Today: Proc., 44:1344–1349. [108] Tsipianitis, A., Y. Tsompanakis (2020). Improved Cuckoo Search algorithmic variants for constrained nonlinear optimization. Adv. Eng. Software, 149. [109] Uymaz, S. A., G. Tezel, E. Yel (2015). Artificial algae algorithm (AAA) for nonlinear global optimization. Appl. Soft Comput., 31:153–171. [110] Valdez, F. (2016). Swarm intelligence: an introduction, history and applications. In Handbook on Computational Intelligence, pp. 587–606. [111] Valdez, F. (2020). A review of optimization swarm intelligence-inspired algorithms with type-2 fuzzy logic parameter adaptation. Soft Comput., 24(1):215–226. [112] Valdez, F., P. Melin, O. Castillo (2010). Evolutionary method combining Particle Swarm Optimisation and Genetic Algorithms using fuzzy logic for parameter adaptation and aggregation: the case neural network optimisation for face recognition. Int. J. Artif. Intell. Soft Comput., 2(1/2):77–102. [113] Valdez, F., P. Melin, O. Castillo, O. Montiel (2008). A new evolutionary method with a hybrid approach combining particle swarm optimization and genetic algorithms using fuzzy logic for decision making. Appl. Soft Comput., 11(2):2625–2632. [114] van den Bergh, F., A. Engelbrecht (2004). A cooperative approach to particle swarm optimization. IEEE Trans. Evol. Comput., 8:225–239. [115] Wang, Z., J. Mumtaz, L. Zhang, L. Yue (2019). Application of an improved spider monkey optimization algorithm for component assignment problem in PCB assembly. Procedia CIRP, 83:266–271. [116] Yadav, R., A. Anubhav (2020). PSO-GA based hybrid with adam optimization for ANN training with application in medical diagnosis. Cognit. Syst. Res., 64:191–199. [117] Yang, X. S. (2008). Nature-Inspired Metaheuristic Algorithms. Luniver Press. [118] Yang, X. S. (2010a). A new metaheuristic bat-inspired algorithm. In Nature Inspired Cooperative Strategies for Optimization, eds. J. R. Gonzalez et al., Studies in Computational Intelligence, vol. 284, Springer, pp. 65–74. [119] Yang, X., S. Deb (2009). Cuckoo search via Lévy flights. In World Congress on Nature Biologically Inspired Computing (NaBIC), pp. 210–214. [120] Yang, X. S. (2010b). Nature-Inspired Metaheuristic Algorithms, 2nd edn. Luniver Press. [121] Yang, X. S. (2014). Nature-Inspired Optimization Algorithms. Elsevier Oxford. [122] Yang, X. S. (2019). Multiobjective Bat Algorithm (MOBA). [123] Yang, X. S., S. Deb (2010a). Cuckoo search via Lévy flights. In 2009 World Congress on Nature & Biologically Inspired Computing (NaBIC), pp. 210–214. [124] Yang, X. S., S. Deb (2010b). Eagle strategy using Lévy walk and firefly algorithms for stochastic optimization. In Studies in Computational Intelligence, vol. 284, pp. 101–111. [125] Yang, X. S., A. Gandomi (2012). Bat algorithm: a novel approach for global engineering optimization. Eng. Comput., 29:464–483. [126] Zhang, Z., T. Wang, Y. Chen, J. Lan (2019). Design of type-2 fuzzy logic systems based on improved ant colony optimization. Int. J. Control Autom. Syst., 17:536–544. [127] Zhao, H. (2020). Optimal path planning for robot based on ant colony algorithm. In 2020 International Wireless Communications and Mobile Computing (IWCMC), pp. 671–675.

January 19, 2018 9:17

ws-book961x669

Beyond the Triangle: Brownian Motion...Planck Equation-10734

This page intentionally left blank

HKU˙book

page vi

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch25

© 2022 World Scientific Publishing Company https://doi.org/10.1142/9789811247323_0025

Chapter 25

Fuzzy Dynamic Parameter Adaptation for Gray Wolf Optimization of Modular Granular Neural Networks Applied to Human Recognition Using the Iris Biometric Measure Patricia Melin∗ , Daniela Sánchez†, and Oscar Castillo‡ Tijuana Institute of Technology, Calzada Tecnologico, s/n, CP 22379, Tijuana, Mexico ∗ [email protected][email protected][email protected]

In this chapter, a gray wolf optimizer with dynamic parameter adaptation based on fuzzy theory for modular granular neural network (MGNN) has been proposed. These architectures have several parameters, such as the number of sub-granules, the percentage of data for the training phase, the learning algorithm, the goal error, the number of hidden layers and their number of neurons, and the optimization that seeks to find them. The effectiveness of this optimization with its fuzzy dynamic parameters adaptation is proved using a database of iris biometric measures. The advantage of the proposed fuzzy dynamic parameters adaptation is focused on determining the gray wolf optimizer parameters depending on the population behavior, i.e., depending on current results, the parameters are adjusted to improve results. Human recognition is an important area that can offer security to areas or information, especially if biometric measurements are used; for this reason, the method was applied to human recognition using a benchmark database with very good results.

25.1. Introduction In this chapter, an approach for fuzzy dynamic adaptation of parameters for a gray wolf optimizer (GWO) applied to modular granular neural networks (MGNN) is proposed. The goal of the proposed adjustment of parameters applied to a gray wolf optimizer is to modify its parameters during its execution, i.e., the parameters are not fixed in the execution, allowing that depending on population behavior, their values can be again established each iteration [1, 2]. To prove the advantages of the 947

page 947

June 1, 2022 13:19

948

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch25

Handbook on Computer Learning and Intelligence — Vol. II

proposed method, modular granular neural networks are applied to a benchmark iris database [3–5]. The use of a human recognition problem represents an important challenge [6] because the use of biometric measures in recent years has offered a greater guarantee of security in data, areas, or systems [7, 8]; for this reason, the selection of techniques used to perform this task must be very careful [9]. Among the most used techniques to perform human recognition, intelligent techniques can be found, such as artificial neural networks [10], fuzzy logic [11, 12], computer vision [13], and machine learning [14], among others. Among the improvements made to conventional artificial neural networks, we can find ensemble neural networks (ENNs) [15, 16] and modular neural networks (MNNs) [17]. These kinds of neural networks are multiple artificial neural networks. In the case of ENNs, each neural network learns the same information or task, and for the MNNs, the information or task is divided into parts, and each part is learned by a neural network (module). On the other hand, in [18], a mixture of modular neural networks with granular computing (GrC) [19, 20] was proposed, where the granulation of information into subsets of different sizes was presented. It is important to mention that artificial neural networks architectures can be designed by optimization techniques, such as genetic algorithm (GA) [21, 22], ant colony optimization (ACO) [23], particle swarm optimization (PSO) [24], bat algorithm [25], and fireworks algorithm (FA) [26] just to mention a few. When two or more intelligence techniques are combined, the resulting system is called a hybrid intelligent system [18, 27]; this kind of system has been widely applied with excellent results [28, 29]. In this chapter, a hybrid intelligent system is proposed where modular neural networks, granular computing, fuzzy logic, and a gray wolf optimizer are combined to improve recognition rates using iris biometric measures. This chapter is organized as follows. The proposed hybrid optimization is described in Section 25.2. The results obtained by the proposed method are presented in Section 25.3. Statistical comparisons by hypothesis testing are presented in Section 25.4. Finally, conclusions and future lines of research are outlined in Section 25.5.

25.2. Proposed Method The proposed fuzzy dynamic adaptation and its application are described in this section; this hybrid intelligence system builds modular neural networks with a granular approach, fuzzy logic, and a gray wolf optimizer.

25.2.1. General Architecture of the Proposed Method The proposed hybrid intelligence system uses modular granular neural networks to achieve human recognition based on the iris biometric measure; this artificial neural

page 948

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch25

Fuzzy Dynamic Parameter Adaptation for Gray Wolf Optimization

page 949

949

network was proposed in [18]. In this work, a variant of the gray wolf optimizer with a fuzzy dynamic adaptation is proposed to perform the optimization of this kind of artificial neural network. In addition, a comparison with a GWO without fuzzy dynamic parameters adaptation is performed to know if the fuzzy inference system allows for improved results. The main objective of this hybrid optimization is to allow an adjustment of the optimization parameters to find optimal modular granular neural networks architectures. The main task of the optimization techniques is to find the number of subgranules (modules) and the respective neural architecture. In Figure 25.1, the granulation process is illustrated, where a whole granule represents a database. It can be divided into “m” sub-granules, where each sub-granule can be different in size; applied to human recognition, this implies images of a different number of persons. In this work, the gray wolf optimizer with a fuzzy dynamic adaptation performs the optimization of the granulation and the architecture of the modular granular neural networks.

Figure 25.1: The general architecture of proposed method.

June 1, 2022 13:19

950

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch25

page 950

Handbook on Computer Learning and Intelligence — Vol. II

Figure 25.2: The hierarchy of a gray wolf optimizer.

25.2.1.1. Description of the gray wolf optimizer

The gray wolf optimizer uses as its main paradigm the hunting behavior of gray wolves as proposed in [30]. In each pack of wolves, there are between 5 and 12 wolves, where each wolf has a place in the hierarchy. The leaders and who make important decisions is the group called alphas [31, 32]. In Figure 25.2, the hierarchy is illustrated. As mentioned, this algorithm is based on the hunting behavior of gray wolves, but it is important to specify all those aspects and processes on which this algorithm is based. These aspects and processes and their mathematical model are mentioned below. • Social hierarchy: The best solution is called alpha (α), the second-best solution is called beta (β), the third-best solution is called delta (δ), and the other wolves are called and considered as omega solutions (ω). • Encircling prey: An important process during a hunt is when gray wolves encircle the prey. This process can be represented by the following expressions:  = |C · Xp (t) − X (t)|, D  X (t + 1) = Xp (t) − A · D,

(25.1) (25.2)

where A and C are coefficient vectors, Xp is the prey position vector, X is the position vector of a gray wolf, and t is the current iteration. The vector A and C are calculated by the following equations: A = 2 a . r1 − a ,

(25.3)

C = 2 · r2 ,

(25.4)

where r1 and r2 are random vectors with values in 0 and 1 and a is a vector with components that linearly decreased from 2 to 0 during each iteration. • Hunting: As alpha, beta, and delta solutions are the best solutions, these solutions know where the prey is; therefore, their positions allow the rest of the solutions to

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch25

Fuzzy Dynamic Parameter Adaptation for Gray Wolf Optimization

page 951

951

Figure 25.3: The pseudocode of the original gray wolf optimizer.

update their positions based on them. This process is represented by the following equations:  Dβ = |C2 · Xβ − X |, Dδ = |C3 · Xδ − X | Dα = |C1 · Xα − X|,

(25.5)

X1 = Xα − A1 · ( Dα ), X2 = Xβ − A2 · ( Dβ ), X3 = Xδ − A3 · ( Dδ ),

(25.6)

X1 + X2 + X3 . X (t + 1) = 3

(25.7)

• Attacking prey: This process consists of exploitation. Here a decreases from 2 to 0 during each iteration, and the A vector contains random numbers in an interval [−a, a]. This allows a search agent to update its next position. This position can be any position between its current position and the prey. • Search for prey: This process consists of exploration. This process allows having  To avoid local divergence in the population. This algorithm represents it using A.  optima problems and to favor exploration, a C vector with values in an interval [0, 2] is used. In Figure 25.3, the pseudocode of the original gray wolf optimizer is illustrated. 25.2.1.1.1. Description of the gray wolf optimizer with fuzzy dynamic parameter adaptation

The fuzzy dynamic parameter adaptation has been proposed to other optimization techniques and applied to artificial neural networks [1, 2]. In this work, a fuzzy inference system is proposed to adjust GWO parameters. The fuzzy inference system has two inputs variables and three outputs variables. All the variables use three

June 1, 2022 13:19

952

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch25

Handbook on Computer Learning and Intelligence — Vol. II

Figure 25.4: A fuzzy inference system.

triangular functions, and their linguistic values are “Low,” “Medium,” and “High.” Figure 25.4 illustrates the structure of the fuzzy inference system. The inputs variables are: • Iteration: Internally, a variable was added in the algorithm. This variable saves how many iterations the objective function remains without changing its value. This variable is restarted when its value arrives at 5. • Actual_a: This input variable gets the current value of a used in the GWO. The outputs variables are:  a r2 is necessary. With • C: As previously shown in Eq. (25.4), to calculate C, this fuzzy variable, C is directly determined. This value is the same for all the C vector.  a a • Updated_a and r: As it is previously shown in Eq. (25.3), to calculate A, and r1 are necessary. With these fuzzy variables, a and r are directly determined. These values are the same for all the a and r vectors. The fuzzy IF–THEN rules used in the fuzzy systems are summarized in Figure 25.5. The range of each variable is shown in Table 25.1.

page 952

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch25

Fuzzy Dynamic Parameter Adaptation for Gray Wolf Optimization

page 953

953

Figure 25.5: Fuzzy rules of the fuzzy systems.

Table 25.1: Ranges of variables Variable

Range

Iteration Actual_a C r Updated_a

1 to 5 0 to 2 0 to 2 0 to 1 0 to 2

25.2.1.1.2. Description of the gray wolf optimizer for MGNN

The gray wolf optimizer aims at optimizing modular granular neural network architectures. The parameters that are subject to optimization are the number of sub-granules (modules), the percentage of data for the training phase, the learning algorithm, the goal error, the number of hidden layers, and the number of neurons of each hidden layer. To determinate the total number of dimensions for each search agent, Eq. (25.8) is used, where each parameter to be optimized is represented by a dimension: Dimensions = 2 + (3 ∗ m) + (m ∗ h),

(25.8)

where m is the maximum number of sub-granules and h is the maximum number of hidden layers per module. The objective function is the following expression:

f =

m  i=1

⎛⎛ ⎞ ⎞ nm  ⎝⎝ X j ⎠ /n m ⎠,

(25.9)

j =1

where m is the total number of sub-granules (modules), X j is 0 if the module provides the correct result and 1 if not, and nm is the total number of data/images used for the testing phase in the respective module.

June 1, 2022 13:19

954

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch25

Handbook on Computer Learning and Intelligence — Vol. II Table 25.2: Values for search space Parameters of MNNs

Minimum

Maximum

Modules (m) Percentage of data for training Error goal Learning algorithm Hidden layers (h) Neurons for each hidden layer

1 20 0.000001 1 1 20

10 80 0.001 3 10 400

Figure 25.6: A structure of each search agent.

25.2.2. Proposed Method Applied to Human Recognition In [33], a gray wolf optimizer for MGNNs architectures was proposed but without a fuzzy dynamic parameter adaptation. To have a fair comparison, in this work, 10 search agents and maximum 30 iterations are used as in [33]. The minimum and maximum values used to limit the search space are summarized in Table 25.2. The gray wolf optimizer has two stopping conditions: when the maximum number of iterations is achieved or when the best solution has an error value equal to zero. In Figure 25.6, the structure of each search agent is illustrated, and in Figure 25.7, the diagram of the proposed method is presented. For the MGNNs learning phase, the optimization technique can be selected from one of the following options: 1. Gradient descent with scaled conjugate gradient (SCG) 2. Gradient descent with adaptive learning and momentum (GDX) 3. Gradient descent with adaptive learning (GDA)

page 954

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch25

Fuzzy Dynamic Parameter Adaptation for Gray Wolf Optimization

page 955

955

Figure 25.7: A Diagram of the proposed method.

25.2.2.1. Data selection, database, and pre-processing

The description of the database, data selection, and the pre-processing applied are presented below. 25.2.2.1.1. Data selection

In [18], a method to select images for each neural network phase (training and testing) was proposed. The optimization technique in each possible solution (search agent)

June 1, 2022 13:19

956

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch25

Handbook on Computer Learning and Intelligence — Vol. II

Figure 25.8: An example of images selection for training and testing phase.

has a percentage of data. This percentage is converted to a number of images, and images for each phase are selected randomly. In Figure 25.8, an example for the iris database is illustrated. In this case, a person has 14 images: 7 of them are for the training phase and 7 for the testing phase. 25.2.2.1.2. Database

The iris database was collected and presented by the Institute of Automation of the Chinese Academy of Sciences (CASIA) [34]. Each image has a dimension of 320 x 280, JPEG format. The database was formed with information from 77 persons, where each person has 14 images. Figure 25.9 illustrates a sample of the images of the iris database. 25.2.2.1.3. Pre-processing

In [35], L. Masek and P. Kovesi developed the pre-processing for the iris database. This pre-processing performed to each image is to obtain coordinates and radius of iris and pupil for performing a cut in iris. The image is resized to 21 × 21 pixels. In Figure 25.10, the pre-processing is shown.

page 956

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch25

Fuzzy Dynamic Parameter Adaptation for Gray Wolf Optimization

page 957

957

Figure 25.9: Sample of Iris database.

Figure 25.10: Sample pre-processing for Iris database.

25.3. Experimental Results In this section, the results achieved by the proposed method applied to human recognition using iris as a biometric measure are shown. The main comparison of the proposed method is against a gray wolf optimizer without a fuzzy dynamic adaptation. In [33], the optimization of MGNN architectures was performed. In [36], an optimization is performed, but an automatic division of the images is performed, and each division is learned by a submodule. In both works, the GWO used up to 80% of images for the training phase. In this work, that test is replicated (test #1), and other using up to 50% of data for the training phase (test #2) is performed. A total of 20 runs were executed for each test to perform comparisons.

25.3.1. Results Using up to 80% of Data for Training In this test, the proposed hybrid optimization can use up to 80% of the data for the training phase. In Table 25.3, the best five results with the proposed optimization are shown. The behavior of run #8 is illustrated in Figure 25.11, where the best, the average, and the worst results of each iteration are presented. In Figure 25.12, alpha (first-best solution), beta (second-best solution), and delta (third-best solution) behaviors of

1

8

Training

75% (1,3,4,5, 6,7,8,9,10,11 and 14)

77% (1,2,3,4,5,6,8,9,11,13 and, 14)

Testing

25% (2,12, and 13)

23% (7,10, and 12)

Module #1(1 to 11) Module #2(12 to 17) Module #3(18 to 23) Module #4(24 to 33) Module #5(34 to 46) Module #6(47 to 56) Module #7(57 to 63) Module #8(64 to 72) Module #9(73 to 77) Module #1(1 to 12) Module #2(13 to 18) Module #3(19 to 32) Module #4(33 to 39) Module #5(40 to 41) Module #6(42 to 44) Module #7(45 to 53) Module #8(54 to 59) Module #9(60 to 65) Module #10(66 to 77)

Error

99.57

0.0043

100

0

b4528-v2-ch25

5(232,31,20,25,31) 5(207,62,81,86,59) 5(200,70,145,180,190) 5(199,69,54,209,82) 5(58,163,210,126,214) 5(231,54,52,155,47) 5(41,141,182,224,202) 5(188,153,200,150,221) 3(247,100,69) 5(228,146,214,163,228) 5(196,114,113,210,96) 5(123,170,89,65,117) 5(180,143,200,223,203) 5(218,225,197,99,249) 5(142,111,250,49,221) 5(125,190,207,117,225) 5(155,160,164,53,194) 5(165,150,133,218,97) 5(89,91,164,54,168)

Persons per module

Rec. Rate %

Handbook on Computer Learning and Intelligence — Vol. II

Run

Num. hidden layers and Num. of neurons

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Images

June 1, 2022 13:19

958

Table 25.3: The best five results (Test #1)

page 958

June 1, 2022 13:19

13

76% (1,2,3,4,5,6,8,9, 11,13, and 14)

24% (7,10, and 12)

75% (1,2,3,4,5,6,8,9, 11,13, and 14)

25% (7,10, and 12)

Module #1(1 to 3) Module #2(4 to 14) Module #3(15 to 31) Module #4(32 to 40) Module #5(41 to 56) Module #6(57 to 62) Module #7(63 to 74) Module #8(75 to 77) Module #1(1 to 8) Module #2(9 to 14) Module #3(15 to 17) Module #4(18 to 26) Module #5(27 to 28) Module #6(29 to 37) Module #7(38 to 47) Module #8(48 to 62) Module #9(63 to 67) Module #10(68 to 77)

99.57

0.0043

99.57

0.0043

b4528-v2-ch25

(Continued)

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

10

5(156,171,24,147,182) 4(127,151,48,48) 5(111,146,207,245,188) 5(156,242,45,211,154) 5(161,191,65,108,81) 5(126,186,70,140,45) 5(162,76,161,96,202) 5(157,94,217,188,225) 4(119,39,131,242) 4(44,173,119,71) 4(43,215,49,218) 5(30,179,96,145,234) 4(87,210,139,114) 4(35,115,27,240) 5(153,84,180,128,181) 4(63,61,40,123) 5(29,57,214,119) 1(148)

Fuzzy Dynamic Parameter Adaptation for Gray Wolf Optimization

:

959 page 959

17

Training

77% (1,2,3,5,6,7,8,9, 10,12, and 14)

Testing

23% (4,11, and 13)

5(202,140,211,137,156) 5(167,179,239,73,108) 5(163,195,65,142,193) 5(236,189,153,109,152) 5(225,71,124,148,148) 5(32,172,113,138,198) 5(199,150,166,164,79) 5(180,148,112,162,79) 5(176,199,84,157,195) 5(228,223,231,219,224)

Persons per module Module #1(1 to 4) Module #2(5 to 9) Module #3(10 to 13) Module #4(14 to 15) Module #4(16 to 25) Module #6(26 to 33) Module #7(34 to 47) Module #8(48 to 58) Module #9(59 to 72) Module #10(73 to 77)

Rec. Rate %

Error

100

0

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Run

Num. hidden layers and Num. of neurons

b4528-v2-ch25

Handbook on Computer Learning and Intelligence — Vol. II

Images

June 1, 2022 13:19

960

Table 25.3: (Continued)

page 960

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch25

Fuzzy Dynamic Parameter Adaptation for Gray Wolf Optimization

Figure 25.11: Convergence of run #8.

Figure 25.12: Alpha, beta, and delta behavior of run #8.

page 961

961

June 1, 2022 13:19

962

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch25

Handbook on Computer Learning and Intelligence — Vol. II

Figure 25.13: An average of convergence (up to 80%). Table 25.4: Comparison of Results (Test #1) Method

Best

Average

Worst

GWO [33]

100% 0 100% 0 100% 0

99.04% 0.0096 99.31% 0.0069 99.40% 0.0060

97.84% 0.0216 98.70% 0.0130 99.13% 0.0087

GWO [36] GWO + FIS

run #8 are presented. In Figure 25.13, the average convergence of 20 runs obtained by the GWO presented in [33, 36], and the proposed hybrid optimization are shown. In Table 25.4, a comparison of results among [33, 36], and the proposed method is shown.

25.3.2. Results Using up to 50% of Data for Training In this test, the proposed hybrid optimization can use up to 50% of data for the training phase. In Table 25.4, the best five results with the proposed optimization are shown. The behavior of run #6 is shown in Figure 25.14, where the best, the average, and the worst result of each iteration are shown. In Figure 25.15, alpha (first-best

page 962

June 1, 2022 13:19

Run

Training

Testing

2

47% (1,3,5, 6,9,11, and 13)

53% (2,4,6,7, 8,10,12, and 14)

4

48% (1,3,4, 5,6,9, and 13)

52% (2,7,8,10,11,12, and 14)

Num. Hidden layers and Num. of neurons

Module #1(1 to 5) Module #2(6 to 15) Module #3(16 to 23) Module #4(24 to 30) Module #5(31 to 33) Module #6(34 to 35) Module #7(36 to 49) Module #8(50 to 53) Module #9(54 to 63) Module #10(64 to 77) Module #1(1 to 6) Module #2(7 to 10) Module #3(11 to 18) Module #4(19 to 23) Module #5(24 to 36) Module #6(37 to 48) Module #7(49 to 62) Module #8(63 to 64) Module #9(65 to 74) Module #10(75 to 77)

Rec. Rate %

Error

97.40

0.0260

97.40

0.0260

(Continued)

b4528-v2-ch25

4(82,197,188,186) 5(184,111,88,77,191) 5(28,153,120,69,105) 4(188,142,159,123) 5(144,195,65,172,192) 2(169,175) 5(174,124,172,161,25) 3(124,171,73) 3(141,148,172) 4(48,197,161,83) 5(150,146,154,186,147) 5(147,146,86,111,154) 5(108,134,21,168,130) 5(142,96,196,181,71) 5(93,59,127,141,139) 5(69,160,150,97,161) 5(108,180,159,150,20) 5(129,147,135,37,199) 5(115,157,157,161,191) 5(92,144,197,176,179)

Persons per module

Fuzzy Dynamic Parameter Adaptation for Gray Wolf Optimization

Images

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Table 25.5: The best 5 results (Test #2)

963 page 963

6

9

Training

Testing

49% (1,2,3, 5,6,8, and 9)

51% (4,7,10, 11,12,13, and 14)

43% (2,3,4,5,6, and 13)

57% (1,7,8,9, 10,11,12, and 14)

Num. Hidden layers and Num. of neurons

Module #1(1 to 6) Module #2(7 to 14) Module #3(15 to 24) Module #4(25 to 41) Module #5(42 to 44) Module #6(45 to 52) Module #7(53 to 58) Module #8(59 to 62) Module #9(63 to 73) Module #10(74 to 77) Module #1(1 to 16) Module #2(17 to 26) Module #3(27 to 38) Module #4(39 to 46) Module #5(47 to 48) Module #6(49 to 52) Module #7(53 to 60) Module #8(61 to 71) Module #9(72 to 77)

Rec. Rate %

Error

97.77

0.0223

97.40

0.0260

b4528-v2-ch25

5(179,144,127,200,157) 5(150,100,186,82,157) 5(161,186,169,29,65) 5(182,138,154,150,180) 4(162,159,23,147) 5(170,182,98,99,63) 5(124,95,143,155,128) 5(163,149,59,183,170) 5(113,195,150,101,190) 5(182,179,193,194,52) 5(155,188,175,170,105) 5(165,160,194,147,26) 5(132,72,24,166,61) 5(133,29,153,40,32) 5(154,57,194,89,142) 5(198,166,177,58,82) 5(158,92,127,196,170) 5(175,178,120,29,157) 5(139,164,193,162,139)

Persons per module

Handbook on Computer Learning and Intelligence — Vol. II

Run

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Images

June 1, 2022 13:19

964

Table 25.5: (Continued)

page 964

June 1, 2022 13:19

50% (2,3,4, 5,6,9, and 13)

52% (1,7,8,10,11,12, and 14)

97.59%

0.0241

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

16

Module #1 (1 to 11) Module #2(12 to 16) Module #3(17 to 22) Module #4(23 to 25) Module #5(26 to 41) Module #6(42 to 54) Module #7(55 to 63) Module #8(64 to 72) Module #9(73 to 77)

b4528-v2-ch25

Fuzzy Dynamic Parameter Adaptation for Gray Wolf Optimization

: 5(179,199,61,163,136) 5(181,153,157,168,189) 5(171,190,174,100,43) 5(32,177,146,178,90) 5(144,149,151,170,178) 5(161,171,176,52,51) 5(155,174,108,185,127) 5(180,165,150,172,160) 5(167,130,116,119,131)

965 page 965

June 1, 2022 13:19

966

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch25

Handbook on Computer Learning and Intelligence — Vol. II

Figure 25.14: Convergence of run #6.

Figure 25.15: Alpha, beta, and delta behavior of run #6.

page 966

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch25

Fuzzy Dynamic Parameter Adaptation for Gray Wolf Optimization

page 967

967

Figure 25.16: Average of convergence (up to 50%).

Table 25.6: Comparison of results (Test #2) Method GWO Proposed GWO

Best

Average

Worst

97.59% 0.0241 97.77%

96.75% 0.0325 97.16%

95.82% 0.0418 96.59%

solution), beta (second-best solution), and delta (third-best solution) behavior of run #6 are shown. In Figure 25.16, the average convergence of 20 runs obtained by the GWO without fuzzy dynamic parameters adaptation and the proposed hybrid optimization are shown. The average convergence of the proposed method is better than the GWO without fuzzy dynamic parameters adaptation. In Table 25.6, a comparison of results between the GWO without fuzzy dynamic parameters adaptation and the proposed hybrid optimization are shown. From the values in Table 25.6, we can say that better results are achieved by the proposed method; even the worst result is improved.

June 1, 2022 13:19

968

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch25

page 968

Handbook on Computer Learning and Intelligence — Vol. II Table 25.7: Values of test #1.

Method

N

Standard Error SD Estimated Mean deviation of the mean difference t value

Melin P. [33] 20 99.036 Proposed Method 20 99.405 Sanchez D. [36] 20 99.307 Proposed Method 20 99.405

0.478 0.280 0.407 0.280

0.11 0.063 0.097 0.063

p value

Degree of freedom

−0.369

−2.98

0.0055

31

−0.097

−0.88

0.384

33

Figure 25.17: Sample distribution (up to 80%, comparison #1).

25.4. Statistical Comparison of Results To verify if the results achieved by the proposed method have been significantly improved, statistical t-tests are performed. In these t-tests, the previous recognition rates are used in the calculations.

25.4.1. Statistical Comparison for Test #1 In Table 25.7, the values obtained in the t-test among [33, 36], and the proposed optimization are shown, where t-values of 2.98 and 0.88 are obtained, respectively. This means that there is sufficient evidence to say that the proposed optimization improves results when compared with the method presented in [33]. In Figures 25.17 and 25.18, the distribution of the samples is illustrated.

25.4.2. Statistical Comparison for Test #2 In Table 25.8, the values obtained in the t-test between the proposed method and the GWO without a fuzzy dynamic parameter adaptation are shown, where the t value is 3.23. This means that there is sufficient evidence to say that the proposed

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch25

Fuzzy Dynamic Parameter Adaptation for Gray Wolf Optimization

page 969

969

Figure 25.18: Sample distribution (up to 80%, comparison #2).

Table 25.8: Values of Test #2. Method

N

Standard Error SD Estimated Degree of Mean deviation of the mean difference t value p value freedom

GWO 20 96.750 Proposed Method 20 97.164

0.472 0.325

0.11 0.073

−0.413

−3.23 0.0027

34

Figure 25.19: Sample distribution (up to 50%).

optimization improved results. In Figure 25.19, the distribution of the samples is illustrated.

25.5. Conclusions In this chapter, a fuzzy inference system to perform dynamic parameters adaptation applied to a gray wolf optimizer was proposed. To perform a comparison between the

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

970

b4528-v2-ch25

Handbook on Computer Learning and Intelligence — Vol. II

proposed optimization and a gray wolf optimizer without this adjustment, the human recognition was performed using the iris biometric measure. Both optimization techniques designed are modular granular neural networks architectures. This design consisted of the number of sub-granules, the percentage of data for the training phase, error goal, the learning algorithm, the number of hidden layers, and their respective number of neurons. The main objective is to minimize the error of recognition. The results show improvements when the proposed fuzzy inference system is used, but statistical comparisons were also performed, mainly with a previous work where a gray wolf optimizer was developed for MGNN optimization using the same general architecture. In conclusion, in general, the proposed hybrid optimization improved the recognition rates because it allowed for a change in parameters in each iteration depending on gray wolf population behavior. In future works, other parameters will be considered, and other fuzzy inference systems will be proposed to improve results, especially when less than 50% of the data for the training phase are used.

References [1] B. González, F. Valdez, P. Melin and G. Prado-Arechiga, Fuzzy logic in the gravitational search algorithm enhanced using fuzzy logic with dynamic alpha parameter value adaptation for the optimization of modular neural networks in echocardiogram recognition, Appl. Soft Comput., 37, 245–254 (2015). [2] D. Sánchez, P. Melin and O. Castillo, Particle swarm optimization with fuzzy dynamic parameters adaptation for modular granular neural networks, in J. Kacprzyk, E. Szmidt, S. Zadro˙zny, K. Atanassov and M. Krawczak (eds.), Advances in Fuzzy Logic and Technology 2017. EUSFLAT 2017, IWIFSGN 2017. Advances in Intelligent Systems and Computing, vol. 643, Springer, Cham, pp. 277–288 (2017). [3] R. Abiyev and K. Altunkaya, Personal iris recognition using neural network, Int. J. Secur. Appl., 2, 41–50 (2008). [4] M. De Marsico, A. Petrosino, and S. Ricciardi, Iris recognition through machine learning techniques: a survey, Pattern Recognit. Lett., 82(2), 106–115 (2016). [5] A. M. Patil, D. S. Patil and P. Patil, Iris recognition using gray level co-occurrence matrix and Hausdorff dimension, Int. J. Comput. Appl., 133(8), 29–34 (2016). [6] M. De Marsico, M. Nappi and H. Proença, Human Recognition in Unconstrained Environments, 1st edn., Academic Press (2017). [7] S. Furnell and N. Clarke, Power to the people? The evolving recognition of human aspects of security, Comput. Secur., 31(8), 983–988 (2012). [8] G. Schryen, G. Wagner and A. Schlegel, Development of two novel face-recognition CAPTCHAs: a security and usability study, Comput. Secur., 60, 95–116 (2016). [9] M. Nixon, P. Correia, K. Nasrollahi, T. B. Moeslund, A. Hadid and M. Tistarelli, On soft biometrics, Pattern Recognit. Lett., 68(2), 218–230 (2015). [10] J. Heaton, Artificial Intelligence for Humans, Volume 3: Deep Learning and Neural Networks, 1st edn., CreateSpace Independent Publishing Platform (2015). [11] L. A. Zadeh, Some reflections on soft computing, granular computing and their roles in the conception, design and utilization of information/intelligent systems, Soft Comput., 2, 23–25 (1998).

page 970

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch25

Fuzzy Dynamic Parameter Adaptation for Gray Wolf Optimization

page 971

971

[12] L. A. Zadeh and J. Kacprzyk, Fuzzy Logic for the Management of Uncertainty, 1st edn., WileyInterscience (1992). [13] R. Chellappa, The changing fortunes of pattern recognition and computer vision, Image Vision Comput., 55(1), 3–5 (2016). [14] Z.-T. Liu, M. Wu, W.-H. Cao, J.-W. Mao, J.-P. Xu and G.-Z. Tan, Speech emotion recognition based on feature selection and extreme learning machine decision tree, Neurocomputing, 273, 271–280 (2018). [15] L. K. Hansen and P. Salomon, Neural network ensembles, IEEE Trans. Pattern Anal. Mach. Intell., 12, 993–1001 (1990). [16] A. Sharkey, Combining Artificial Neural Nets: Ensemble and Modular Multi-net Systems, 1st edn., Springer (1999). [17] F. Azamm, Biologically inspired modular neural networks, PhD thesis, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, USA (2000). [18] D. Sánchez and P. Melin, Optimization of modular granular neural networks using hierarchical genetic algorithms for human recognition using the ear biometric measure, Eng. Appl. Artif. Intell., 27, 41–56 (2014). [19] A. Gacek, Granular modelling of signals: a framework of granular computing, Inf. Sci., 221, 1–11 (2013). [20] C. Zhong, W. Pedrycz, D. Wang, L. Li and Z. Li, Granular data imputation: a framework of granular computing, Appl. Soft Comput., 46, 307–316 (2016). [21] J. H. Holland, Adaptation in Natural and Artificial Systems, 1st edn., University of Michigan Press (1975). [22] K. F. Man, K. S. Tang, and S. Kwong, Genetic Algorithms: Concepts and Designs, 2nd edn., Springer (1999). [23] M. Dorigo, Optimization, learning and natural algorithms, PhD Thesis, Politecnico di Milano, Italy (1992). [24] J. Kennedy and R. Eberhart, Particle swarm optimization, in Proceedings of ICNN’95 — International Conference on Neural Networks, Perth, WA, Australia, vol. 4, pp. 1942–1948 (1995). [25] X.-S. Yang, Bat algorithm: literature review and applications, Int. J. Bio-Inspired Comput., 5(3), 141–149 (2013). [26] Y. Tan, Fireworks Algorithm: A Novel Swarm Intelligence Optimization, 1st edn., Springer (2015). [27] M. Farooq, Genetic algorithm technique in hybrid intelligent systems for pattern recognition, Int. J. Innovative Res. Sci. Eng. Technol., 4(4), 1891–1898 (2015). [28] P. Melin, D. Sánchez and O. Castillo, Genetic optimization of modular neural networks with fuzzy response integration for human recognition, Inf. Sci., 197, 1–19 (2012). [29] G. Villarrubia, J. F. De Paz, P. Chamoso and F. De la Prieta., Artificial neural networks used in optimization problems, Neurocomputing, 272, 10–16 (2018). [30] S. Mirjalili, S. M. Mirjalili, and A. Lewis, Grey wolf optimizer, Adv. Eng. Software, 69, 46–61 (2014). [31] L. D. Mech, Alpha status, dominance, and division of labor in wolf packs, Can. J. Zool., 77, 1196–1203 (1999). [32] C. Muro, R. Escobedo, L. Spector and R. Coppinger, Wolf-pack (Canis lupus) hunting strategies emerge from simple rules in computational simulations, Behav. Processes, 88, 192–197 (2011). [33] P. Melin and D. Sánchez, A grey wolf optimization algorithm for modular granular neural networks applied to iris recognition, in A. Abraham, P. K. Muhuri, A. K. Muda and N. Gandhi (eds.), Advances in Intelligent Systems and Computing, 17th International Conference on Hybrid Intelligent Systems (HIS 2017), Delhi, India, Springer, pp. 282–293 (2017).

June 1, 2022 13:19

972

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch25

Handbook on Computer Learning and Intelligence — Vol. II

[34] Database of human iris. Institute of Automation of Chinese Academy of Sciences (CASIA) (2020). http://www.cbsr.ia.ac.cn/english/IrisDatabase.asp. [35] L. Masek and P. Kovesi, MATLAB source code for a biometric identification system based on iris patterns, The School of Computer Science and Software Engineering, The University of Western Australia (2003). [36] D. Sánchez, P. Melin and O. Castillo, A grey wolf optimizer for modular granular neural networks for human recognition, Comput. Intell. Neurosci., 2017, 1–26 (2017).

page 972

September 14, 2022 17:17

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch26

© 2022 World Scientific Publishing Company https://doi.org/10.1142/9789811247323_0026

Chapter 26

Evaluating Inter-task Similarity for Multifactorial Evolutionary Algorithm from Different Perspectives Lei Zhou∗,¶ , Liang Feng†, , Min Jiang‡,∗∗ , and Kay Chen Tan§,†† ∗School of Computer Science and Engineering

Nanyang Technological University, Singapore †College of Computer Science Chongqing University, China ‡Department of Artificial Intelligence Xiamen University, China § Department of Computing Hong Kong Polytechnic University, Hong Kong SAR ¶ [email protected]  [email protected] ∗∗ [email protected] †† [email protected]

The multifactorial evolutionary algorithm (MFEA) is a recently proposed method toward evolutionary multitasking. In contrast to the traditional single-task evolutionary search, MFEA conducts evolutionary search on multiple tasks simultaneously. It aims to improve convergence characteristics across multiple optimization problems at once by seamlessly transferring knowledge among them. For evolutionary multitasking, the similarity between tasks has great impact on the performance of MFEA. However, in the literature, only a few works have been conducted to provide deeper insights into the measure of task relationship in MFEA. In this chapter, we present a study of similarity measures between tasks for MFEA from three different viewpoints, i.e., the distance between best solutions, the fitness rank correlation, and the fitness landscape analysis. Furthermore, 21 multitasking problem sets are developed to investigate and analyze the effectiveness of the three similarity measures with MFEA for evolutionary multitasking.

973

page 973

June 1, 2022 13:19

974

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch26

Handbook on Computer Learning and Intelligence — Vol. II

26.1. Introduction Taking inspiration from Darwin’s evolution theory, evolutionary algorithms (EAs) have been proposed in the literature and have achieved great success in solving complex optimization problems, e.g., nonlinear, multimodal, and discrete NP-hard problems [1, 2]. By imitating the behavior of biological evolution, such as reproduction and natural selection, EAs are capable of preserving useful information from elite individuals and maintaining the balance between exploration and exploitation, which directs optimization toward promising areas of the problem search space [3]. EAs have been demonstrated to provide near-optimal, if not optimal, solutions to many real-world optimization problems, including vehicle routing [4, 5], aircraft engine design [6], image processing [7, 8], etc. However, as most of the existing EAs work in a single-task mode, the similarity between problems, which could enhance the evolutionary search process, is ignored in the present EA design. Recently, in contrast to traditional EAs, the multifactorial evolutionary algorithm (MFEA) has been proposed with the aim of evolutionary multitask algorithm design, which solves multiple optimization problems (tasks) concurrently [9]. With the introduction of a unified representation for different tasks, MFEA is capable of exploiting the latent synergies between distinct (but possibly similar) optimization problems. In [9–11], MFEA has demonstrated superior search performances over the traditional single-task EA in terms of solution quality and convergence speed on a set of continuous, discrete, and the mixtures of continuous and combinatorial tasks. In spite of the success obtained by MFEA, it is worth noting here that the performance of MFEA is greatly affected by the similarity between tasks [12]. Specifically, similar tasks can often enhance the problem solving on all the tasks, while dissimilar problems may deteriorate the optimization performance on one or all the tasks due to the negative transfer across tasks. It is thus desirable that the inter-task relationship should be identified before the execution of evolutionary multitasking, so that similar tasks can be grouped and solved via MFEA. In the literature, there are a few works that have proposed to study the task similarity for MFEA. In particular, in [13], the authors presented several illustrative examples to show how the synergy of tasks may affect the optimization performance. Further, a synergy metric ξ is proposed to measure the correlation between the objective functions, which is then used to predict the performance of MFEA. Particularly, when ξ > 0, an improved optimization performance is expected. However, as the calculation of ξ requires information of global optimal solution and function gradients, it is unpractical in real-world applications, where the global optimum is usually unknown and the gradients cannot be calculated. On the other hand, in [12], Spearman’s rank correlation coefficient (SRCC) is employed as a measure for task similarity evaluation. In contrast to the synergy metric, it only requires the function

page 974

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch26

Evaluating Inter-task Similarity for MFEA from Different Perspectives

page 975

975

objective values of a set of randomly sampled solutions. Therefore, this method can be applied to broader optimization problems. In this study, to verify the similarity metric, nine multitasking sets are divided into three categories with respect to task similarity based on the predefined thresholds of SRCC, which are denoted as high similarity (HS), medium similarity (MS), and low similarity (LS). The obtained empirical results with MFEA on these multitasking sets generally agree with the similarity categories of the benchmarks. However, as the number of multitasking sets investigated in [12] is small, the generalization capability of Spearman’s rank correlation coefficient needs to be further confirmed. In the field of statistics and machine learning, a variety of methods have been introduced to evaluate the task relationship from different views [14–16]. In contrast, in the literature, the study of task similarity for multitasking in optimization has received far less attention. Keeping this in mind, in this chapter, we embark on a study to measure the similarity between tasks for MFEA from three different views, i.e., the distance between best solutions, the fitness rank correlation, and the fitness landscape analysis. Particularly, we adopt the maximum mean discrepancy (MMD) to approximate the distance between best solutions, Spearman’s rank correlation coefficient (SRCC) to measure the fitness rank correlation as in [12], and fitness distance correlation (FDC) to analyze the fitness landscape similarity across tasks. Further, we develop new multitasking benchmarks, which contain 21 multitasking problem sets, to investigate and analyze the effectiveness of the three similarity measures with MFEA for evolutionary multitasking. The rest of this chapter is organized as follows. A brief introduction of evolutionary multitasking and the multifactorial evolutionary algorithm is given in Section 26.2. Section 26.3 gives the motivation behind the three views as well as the detailed methods of the similarity measures for MFEA in evolutionary multitasking. Section 26.4 contains empirical studies on the 21 newly developed multitasking sets to investigate the effectiveness of the three similarity measures. Lastly, the concluding remarks of this chapter and potential directions for future research are discussed in Section 26.5.

26.2. Preliminaries In this section, the concept of evolutionary multitasking is introduced, which is followed by a brief review of the related works. Next, the details of the multifactorial evolutionary algorithm are presented.

26.2.1. Evolutionary Multitasking In contrast to traditional EAs that solve one problem in a single run, evolutionary multitasking aims to tackle multiple optimization tasks simultaneously by

June 1, 2022 13:19

976

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch26

Handbook on Computer Learning and Intelligence — Vol. II

exploiting the synergies between distinct tasks to enhance the optimization performance. Given K minimization tasks { f 1 , f2 , . . . , f K }, the objective of evolutionary multitasking is to find {x1 , x2 , . . . , xK } = argmin{f1 (x), f2 (x), . . . , f K (x)} at the same time. Since the precursory evolutionary multitask algorithm MFEA was first proposed in [9], it has witnessed an explosive growth of research efforts devoted to evolutionary multitasking. In particular, a plethora of strategies were proposed to enhance the performance of MFEA, such as decision variables manipulation [17, 18], search space alignment [19, 20], resource allocation [21, 22], and the control of knowledge transfer frequency [23], etc. Further, Feng et al. [24] diversified the family of MFEA with multifactorial particle swarm optimization (MFPSO) and multifactorial differential evolution (MFDE). Gupta et al. [25] extended MFEA to handle multi-objective optimization problems. Liaw et al. [26], Tang et al. [27], and Chen et al. [28] considered the simultaneous optimization of more than three tasks, which is also known as evolutionary many tasking. Feng et al. [29] proposed an explicit multitask framework in which knowledge is explicitly exchanged between tasks in the form of solutions. Besides the algorithm design, evolutionary multitasking has also been successfully applied to a wide range of realworld problems, such as machine learning [30, 31], operational research [32, 33], and engineering design [34, 35].

26.2.2. The Multifactorial Evolutionary Algorithm The multifactorial evolutionary algorithm (MFEA) is proposed as an implementation of the evolutionary multitasking paradigm [9]. Particularly, in MFEA, tasks are encoded in a unified search space and translated into a task-specific solution with respect to each of the tasks when evaluated. To compare the individual solutions in a multitasking environment, the following properties for each individual are defined in MFEA. — Factorial Cost: The factorial cost f p of an individual p denotes its fitness or objective value on a particular task Ti . For K tasks, there will be a vector with length K , in which each dimension gives the fitness of p on the corresponding task. — Factorial Rank: The factorial rank r p simply denotes the index of individual p in the list of population members sorted in ascending order with respect to their factorial costs on one task. — Scalar Fitness: The scalar fitness ϕ p of an individual p is defined based on its 1 best rank over all tasks, which is given by ϕ p = j . Scalar fitness is min j ∈{1,...,K } r p

used to select the solutions for survival in the next generation.

page 976

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch26

Evaluating Inter-task Similarity for MFEA from Different Perspectives

page 977

977

— Skill Factor: The skill factor τ p of individual p denotes the task, amongst all j tasks in MFO, on which p is most effective, i.e., τ p = argmin{r p }, where j ∈ {1, . . . , K }. The pseudocode of MFEA is presented in Algorithm 26.1. The MFEA first initializes a population of N individuals and evaluates each of them on all tasks (lines 1–2). Then, the skill factor and scalar fitness are calculated for each individual in the population (line 3). Next, an offspring population is generated in each generation by assortative mating and evaluated via vertical cultural transmission (lines 5–6). Particularly, the assortative mating procedure conducts genetic operators, i.e., crossover and mutation, on selected parents where individuals with different skill factors are mated with a predefined probability. The vertical cultural transmission evaluates the offspring on specific tasks based on the skill factor of their parents. Finally, the scalar fitness and skill factor of individuals in both the parent and offspring populations are updated before the natural selection of solutions for the next generation (lines 7–8). The whole process continues until the stopping criteria are satisfied. Algorithm 26.1: Pseudocode for MFEA. 1 2 3 4 5 6 7

8

Generate an initial population with N individuals. Evaluate each individual on all the tasks. Calculate scalar fitness and skill factor for each individual. While stopping criteria are not satisfied, do Apply assortative mating to generate an offspring population. Evaluate offspring individuals via vertical cultural transmission. Update the scalar fitness and skill factor of individuals in the parent and offspring populations. Select the fittest N individuals to survive for the next generation.

26.3. Measuring the Similarity between Tasks for MFEA from Different Views In this section, to explore the similarity measure of tasks for enhanced evolutionary multitasking, we present a study to investigate the task similarity from three different views. One is from the distance between best solutions, one is from the fitness rank correlation of sampled solutions, and the last is based on the fitness distance correlation. In particular, for evaluating task similarity based on fitness rank correlation, we employ the Spearman’s rank correlation coefficient used in

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

Handbook on Computer Learning and Intelligence — Vol. II

Function Value

978

b4528-v2-ch26

4

T1

3.5

T2

3 2.5 2 1.5 1 0.5 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

Figure 26.1: The function graph of two functions that have a common global optimum.

[12]. For the other two, we propose two new similarity metrics based on the best solution discrepancy and fitness landscape analysis, respectively.

26.3.1. Similarity Measure Based on Distance between Best Solutions As discussed in [9], the key idea behind evolutionary multitasking is the sharing of useful traits across optimization tasks along the search process. If a high-quality solution found in one task happens to be the good solution of another, the sharing of this solution will enhance the corresponding optimization process significantly. For example, as illustrated in Figure 26.1, task T 1 and T 2 have a common global optimum. T 1 has a smooth landscape and is easy to solve, while T 2 contains lots of local optima. In this case, if the global optimum of T 1 has been found and transferred to T 2 , T 2 will be solved directly for free. Taking this cue, an intuitive way of measuring the similarity between tasks for MFEA is based on the distance or discrepancy between the best solution of each task. In practice, the optimum of a given task of interest is usually unknown beforehand, it is thus hard to obtain the corresponding distance or similarity between tasks directly. In this chapter, we propose to approximate the distance between tasks with maximum mean discrepancy(MMD) of the best sampled solutions of each task. MMD is a widely used statistical measure for comparing distributions in a Reproducing Kernel Hilbert Space (RKHS). It estimates the distance between two distributions with the distance between their corresponding means in the RKHS [16]. In particular, the pseudocode for the similarity measure with MMD is outlined in Algorithm 26.2. Given two tasks T 1 and T 2 with dimensions D1 and D2 , two sets

page 978

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch26

Evaluating Inter-task Similarity for MFEA from Different Perspectives

page 979

979

of solutions S1 and S2 of size n with dimension Dmax =max(D1 , D2 ) are uniformly and independently sampled in the unified search space. Next, the fitness vectors F1 and F2 are assigned to S1 and S2 based on the evaluation on T 1 and T 2 , respectively. Further, the first m best solutions, denoted as Sb1 and Sb2 , are selected from S1 and S2 accordingly to calculate the MMD1 , which is as follows: 2   m  1 m 1  i i   g(Sb1 ) − g(Sb2 ) , (26.1) MMD =  m i=1 m i=1 where || . || denotes the Euclidean norm, and g(s) extracts the first Dmin =min(D1 , D2 ) dimensions of a solution s. A smaller MMD value implies that the distributions are more similar to each other. Algorithm 26.2: Pseudocode for calculating the similarity measure with MMD. Input : Two tasks T1 and T2 . Output: MMD value. 1 Uniformly and independently sample two sets of solutions S1 and S2 from the unified search space. 2 Evaluate S1 and S2 on T1 and T2 , respectively. 3 Preserve the first m best solutions Sb1 and Sb2 from S1 and S2 , respectively. 4 Calculate the MMD value using Eq. (26.1).

26.3.2. Similarity Measure Based on Fitness Rank Correlation According to [12], the fitness rank correlation estimates the degree of similarity between two rankings of a single set of solutions according to their fitness values on two different tasks. In contrast to the measure in Section 26.3.1, which only focuses on the best solutions, the fitness rank correlation provides a similarity measure based on all the sampled solutions. The rationale behind it is that if the solutions that perform well on one task are also good on the other, the improvement on one task is likely to be beneficial to the other. Therefore, it is expected that MFEA could be effective on two tasks with a high positive fitness rank correlation. The calculation of the fitness rank correlation is based on Spearman’s rank correlation coefficient (SRCC) used in [12]. The pseudocode for of the similarity measure with SRCC is presented in Algorithm 26.3. Given two tasks T 1 and T 2 , n solutions are uniformly and independently sampled from the unified search space. By evaluating these n solutions 1 In this study, linear mapping is considered in the calculation of MMD.

June 1, 2022 13:19

980

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch26

page 980

Handbook on Computer Learning and Intelligence — Vol. II

on T 1 and T 2 , respectively, the corresponding rank vectors r1 and r2 can be achieved based on the obtained fitness values. The SRCC value is then calculated as follows: S RCC =

cov(r1 , r2 ) , σ (r1 )σ (r2 )

(26.2)

where cov and σ denotes the covariance and standard deviation, respectively. Note that a higher value of SRCC indicates a more similar relationship between tasks. Algorithm 26.3: Pseudocode for calculating the similarity measure with SRCC. Input : Two tasks T1 and T2 . Output: SRCC value. 1 Uniformly and independently sample n solutions S from the unified search space. 2 Evaluate S on T1 and T2 , respectively. 3 Obtain the rank vector r1 and r2 by ranking the solutions with respect to their fitness on T1 and T2 , respectively. 4 Calculate the SRCC value using Eq. (26.2).

26.3.3. Similarity Measure Based on Fitness Landscape Analysis Fitness landscape analysis (FLA) gives a systematic manner to analyze the property of a function’s search space [36]. It thus provides us the possibility of measuring the similarity between tasks from the view of tasks’ landscapes. To illustrate, Figure 26.2 gives a multitasking example of two tasks. In this figure, the blue and red curves represent task T 1 and T 2 , respectively, that are going to be solved by MFEA simultaneously. The global optima of T 1 and T 2 are located at C and D, respectively. Assume the optimizations of both tasks start at point A. It is straightforward to see that task T 1 is able to find the global optimum at point C quickly, while task T 2 may get stuck at the local optimum A. However, if proper guidance can be derived based on the landscape analysis of T 1 and T 2 , and the solution of T 1 at point C is transferred to T 2 , it will direct the search of T 2 toward the corresponding global optimum at point D. In the literature, the fitness distance correlation (FDC) is one of the commonly used measures for fitness landscape analysis, which was proposed by Terry Jones and Stephanie Forrest to predict the performance of GA on functions with known global minima (or maxima) [37]. It is based on the evaluation of how the solutions’ fitness values are correlated with their distances to the global optimum. However, the original FDC evaluates the fitness-distance correlation of the same task, which is not able to measure the cross-task correlation required in the case of MFEA. Keeping

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch26

Evaluating Inter-task Similarity for MFEA from Different Perspectives

page 981

981

T1 T2

A

B

C

D

Figure 26.2: An illustrative example for similarity measure between tasks based on fitness landscape analysis.

this in mind, we propose a variant of FDC, called cross-task FDC (CTFDC), for the similarity measure of tasks in evolutionary multitasking. Algorithm 26.4: Pseudocode for calculating the similarity measure with CTFDC. Input : Two tasks T1 and T2 . Output: CTFDC value. 1 Randomly sample n solutions S from the unified search space. 2 Obtain the fitness vector f1 and f2 by Evaluate S on T1 and T2 , respectively. 3 Get the best solutions s1 of T1 and s2 of T2 . 4 For each solution si , calculate the distance between si and s1 (s2 ) to form the distance vector d1 (d2 ). 5 Calculate the CTFDC value using Eq. (26.3). In particular, to calculate cross-task FDC, as depicted in Algorithm 26.4, we first uniformly and independently sample n solutions from the search space and evaluate them on tasks T 1 and T 2 , respectively, to obtain the n dimensional fitness vectors f1 and f2 . Then, the solution with best fitness value of each task, denoted as s1 and s2 , is considered as the “global optimum” for the two tasks, respectively. Further, we obtain the distance vectors d1 and d2 by computing the Euclidean distance between the corresponding best solution (i.e., s1 and s2 ) and the other sampled solutions in

June 1, 2022 13:19

982

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch26

page 982

Handbook on Computer Learning and Intelligence — Vol. II

each of the two tasks. Finally, CTFDC can be calculated as follows:  1/n ni=1 (f1i − f¯1 )(d2i − d¯2 ) , C T F DC T1 →T2 = σ (f1 )σ (d2 )

(26.3)

where f¯1 and d¯2 , σ (f1 ) and σ (d2 ) are the means and standard deviations of f1 and d2 , respectively. Note that by replacing f1i , d2i , f¯1 , d¯2 , f1 , and d2 with f2i , d1i , f¯2 , d¯1 , f2 , and d1 in Eq. (26.3), respectively, we can obtain CTFDCT2→T1 accordingly. As the covariance between fitness and distance is calculated across tasks, the proposed CTFDC can thus be considered as the fitness landscape analysis across tasks, which serves as the task similarity measure in the present study for MFEA. Note that a high CTFDC value indicates that the solutions that result in fitness improvement of T 1 (or T 2 ) have a high probability to bias the search of T 2 (or T 1 ) toward promising areas accordingly.

26.4. Experimental Results To evaluate the effectiveness of the three similarity measures, an empirical study on the newly developed multitasking benchmarks, which contains 21 multitasking sets, is presented in this section.

26.4.1. Experiment Configuration To investigate the similarity between different tasks, according to [12], seven functions (i.e., “Griewank,” “Rastrigin,” “Ackley,” “Schwefel,” “Sphere,” “Rosenbrock,” and “Weierstrass”) are used as the individual tasks to be solved. The properties of these functions are summarized in Table 26.1. Further, refer to [12], when paired for multitasking, each function is rotated by a randomly generated D × D orthogonal matrix M, where D denotes the dimension of the function. Next, with the seven individual functions, we develop the new multitasking benchmarks by considering all the possible combinations of the functions, which Table 26.1: Summary of properties of the individual functions Function Griewank Rastrigin Ackley Schwefel Sphere Rosenbrock Weierstrass

Dimension

Search Space

Global Optimum

D = 50 D = 50 D = 50 D = 50 D = 50 D = 50 D = 25

[−100, 100] D [−50, 50] D [−50, 50] D [−500, 500] D [−100, 100] D [−50, 50] D [−0.5, 0.5] D

[0, …, 0] D [0, …, 0] D [0, …, 0] D [420.9687, …, 420.9687] D [0, …, 0] D [0, …, 0] D [0, …, 0] D

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch26

Evaluating Inter-task Similarity for MFEA from Different Perspectives

page 983

983

Table 26.2: Summary of the 21 multitasking problem sets Multitasking Sets P1 P4 P7 P10 P13 P16 P19

Task Combination

Multitasking Sets

Task Combination

Multitasking Sets

Task Combination

Griewank+Rastrigin Griewank+Sphere Rastrigin+Ackley Rastrigin+Rosenbrock Ackley+Sphere Schwefel+Sphere Sphere+Rosenbrock

P2 P5 P8 P11 P14 P17 P20

Griewank+Ackley Griewank+Rosenbrock Rastrigin+Schwefel Rastrigin+Weierstrass Ackley+Rosenbrock Schwefel+Rosenbrock Sphere+Weierstrass

P3 P6 P9 P12 P15 P18 P21

Griewank+Schwefel Griewank+Weierstrass Rastrigin+Sphere Ackley+Schwefel Ackley+Weierstrass Schwefel+Weierstrass Rosenbrock+Weierstrass

are summarized in Table 26.2. To evaluate the effectiveness of the three different similarity measures between tasks, a comparison between MFEA and the singletask evolutionary algorithm (SEA), which shares the same reproduction operator with MFEA, is performed on all the developed benchmarks. The configuration of parameters and search operators are configured according to [12] for fair comparison, which are detailed below: (1) Population size: N = 30. (2) Generation for each task: G = 500. (3) Evolutionary operators: • Crossover: Simulated binary crossover (SBX). • Mutation: Polynomial mutation. (4) Probability in assortative mating: rmp = 0.3. (5) Independent run times: R = 20. (6) The number of sampled solutions in MMD, SRCC, and CDFDC: n = 1,000,000. (7) The number of best solutions selected in MMD: m = 100. Lastly, as the calculation of the three similarity measures is based on uniformly and independently sampled solutions in the variable space, we consider the mean value of 10 independent calculation results for MMD, SRCC, and CTFDC in this study.

26.4.2. Experimental Results and Analysis Table 26.3 summarizes the performance obtained by the MFEA and SEA on the 21 newly developed multitasking sets over 20 independent runs. In the table, column “True Performance” gives the comparison obtained by MFEA over SEA, while column “Predicted Performance” gives the predicted performance of MFEA indicated by the corresponding similarity measure. In particular, “1” denotes MFEA

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

984

b4528-v2-ch26

page 984

Handbook on Computer Learning and Intelligence — Vol. II

Table 26.3: Results of the true performance obtained by MFEA over SEA, and the predicted performance of MFEA based on task similarity, on the 21 multitasking sets Predicted Performance Predicted Performance Multitasking True Multitasking True Sets Performance MMD SRCC CTFDC Sets Performance MMD SRCC CTFDC P1 P3 P5 P7 P9 P11 P13 P15 P17 P19 P21

1 1 1 1 1 1 1 −1 1 1 1

1 −1 1 1 1 1 1 1 −1 1 1

1 −1 1 1 1 1 1 −1 −1 1 1

1 1 1 1 1 1 1 −1 1 1 1

P2 P4 P6 P8 P10 P12 P14 P16 P18 P20 Error rate

1 1 1 1 1 −1 1 1 −1 1 –

1 1 1 1 1 1 −1 −1 1 1 −1 −1 1 1 −1 −1 −1 −1 1 1 23.8% 19.0%

1 1 1 1 1 −1 1 1 −1 1 0%

obtained superior performance in terms of solution quality or convergence speed over SEA on both of the tasks, while “–1” indicates the deteriorated performance is achieved by MFEA against SEA on at least one task. It can be seen from the table that on 18 out of 21 multitasking sets, MFEA achieved superior performance over the single-task SEA, which again confirms the efficacy of the evolutionary multitasking as discussed in [9]. On the other hand, we observe deteriorated performance on three multitasking sets, i.e., P12, P15, and P18. This highlighted the importance of the study of task relationships for enhanced evolutionary multitasking performance. As depicted in Figure 26.3, where the averaged convergence traces of six representative multitasking sets are presented, enhanced problem-solving performance can be observed on “Weierstrass” when it is paired with“Rosenbrock” in multitasking (see Figure 26.3(f)). However, the search process gets trapped in local optimum when it is solved together with another function, i.e., “Schwefel” in Figure 26.3(e). Further, the predicted performance is obtained based on the task similarity evaluated by the three similarity measures between tasks. In the following, we explain how these predictions are obtained based on the calculated task similarities, and analyze the effectiveness of each similarity measure accordingly. 26.4.2.1. Results and analysis on MMD

The results of the MMD values for the 21 multitasking sets are listed in Table 26.4. As can be observed, the MMD values are small (less than 0.3) on task pairs that have the same global optimum. However, the MMD values become larger (greater than 0.4) when the function “Schwefel” is paired in the multitasking sets. In this section, we consider tasks in the multitasking sets with MMD < 0.3 to be more

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

b4528-v2-ch26

Evaluating Inter-task Similarity for MFEA from Different Perspectives Figure 26.3: Convergence traces of MFEA against SEA on 6 representative multitasking sets. Y -axis: Averaged fitness value in log scale; X-axis: Generation.

985

page 985

June 1, 2022 13:19

Handbook on Computer Learning and Intelligence - Vol. II…– 9.61in x 6.69in

986

b4528-v2-ch26

Handbook on Computer Learning and Intelligence — Vol. II Table 26.4: The mean MMD values of 10 independent calculations on the 21 multitasking sets Function Griewank (T 1 ) Rastrigin (T 1 ) Ackley (T 1 ) Schwefel (T 1 ) Sphere (T 1 ) Rosenbrock (T 1 ) Weierstrass (T 1 )

Griewank Rastrigin Ackley Schwefel Sphere Rosenbrock Weierstrass (T 2 ) (T 2 ) (T 2 ) (T 2 ) (T 2 ) (T 2 ) (T 2 ) – – – – – – –

0.213 – – – – – –

0.196 0.228 – – – – –

0.432 0.412 0.424 – – – –

0.200 0.164 0.224 0.405 – – –

0.266 0.249 0.246 0.417 0.238 – –

0.235 0.275 0.291 0.400 0.222 0.205 –

similar with each other and thus predict the corresponding MFEA performance as “1,” otherwise, the corresponding prediction is “–1.” From Table 26.3, it can be observed that almost all multitasking sets with MMD values