Genetic Programming Theory and Practice XX (Genetic and Evolutionary Computation) [1st ed. 2024] 9819984122, 9789819984121

Genetic Programming Theory and Practice brings together some of the most impactful researchers in the field of Genetic P

140 51

English Pages 352 [343] Year 2024

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Acknowledgments
Contents
Contributors
1 TPOT2: A New Graph-Based Implementation of the Tree-Based Pipeline Optimization Tool for Automated Machine Learning
1.1 Introduction
1.2 Related Work
1.3 Evolutionary Algorithm
1.4 GraphPipelineIndividual Representation
1.4.1 Mutation
1.4.2 Crossover
1.5 TPOT2 API
1.5.1 TPOTEstimator
1.5.2 Ensembling
1.6 Experiment Set-Up
1.6.1 TPOT1 Versus TPOT2
1.7 Results and Discussion
1.8 Conclusions
1.8.1 Future Work
References
2 Analysis of a Pairwise Dominance Coevolutionary Algorithm with Spatial Topology
2.1 Introduction
2.2 Preliminaries
2.2.1 Coevolutionary Algorithms
2.2.2 Spatial Topologies for PDCoEA
2.2.3 Error Thresholds
2.2.4 Problems
2.2.5 MaximinHill—A Problem for Error Thresholds
2.3 Experimental Methodology
2.4 Experiments
2.4.1 Setup
2.4.2 Spatial Topology PDCoEA
2.4.3 Payoff and Genotypic Diversity
2.4.4 Error Threshold in STPDCoEA
2.5 Related Work
2.6 Conclusion
References
3 Accelerating Image Analysis Research with Active Learning Techniques in Genetic Programming
3.1 Introduction
3.2 Data Sets
3.2.1 KOMATSUNA
3.2.2 Cell Classification
3.3 Active Learning
3.4 AL-GP Applied to Decision Tree GP
3.4.1 Decision Tree GP (DT-GP)
3.4.2 Active Learning Implementation
3.4.3 KOMATSUNA Multi-image Results
3.4.4 KOMATSUNA Single-Image Results
3.4.5 Cell Classification
3.5 AL-GP Applied to SEE-Segment
3.5.1 SEE-Segment
3.5.2 AL Implementation for SEE-Segment
3.5.3 KOMATSUNA Results
3.6 Conclusions
References
4 How the Combinatorics of Neutral Spaces Leads Genetic Programming to Discover Simple Solutions
4.1 Introduction
4.2 Related Work
4.2.1 I/O Systems
4.2.2 RNA Studies
4.2.3 GP on Boolean Functions
4.2.4 Neutral Networks
4.2.5 Our Earlier Work
4.3 Genotypes, Phenotypes, Behavior, Fitness
4.3.1 Discrimination of Genotypes and Phenotypes
4.3.2 The Difference of Structural and Semantic Neutrality
4.4 Methods
4.4.1 Linear Genetic Programming
4.4.2 Boolean Function Programs/Circuits
4.4.3 Visualization Method
4.5 The Role of Neutrality
4.5.1 Longer Programs
4.5.2 A New Fitness Function
4.6 Results
4.6.1 A Comparison of Success Rates
4.6.2 Comparison of Search Trajectory Networks for Three Targets
4.6.3 Simpler Solutions
4.7 Discussion and Future Work
References
5 The Impact of Step Limits on Generalization and Stability in Software Synthesis
5.1 Introduction
5.2 Background
5.2.1 The Push Language and Interpreter
5.2.2 Step Limits and Infinite Loops
5.2.3 Success, Generalization, and Stability
5.3 Methodology and Experimental Design
5.4 Results
5.4.1 Last Index of Zero
5.4.2 Fuel Cost
5.4.3 Middle Character
5.4.4 GCD
5.5 Discussion
5.5.1 Stability of Evolved Programs
5.5.2 Stability and (mis)match with Instruction Set
5.5.3 Finding Additional Generalizing Solutions
5.5.4 Saving Computational Effort
5.6 Future Work
5.7 Conclusions
References
6 Genetic Programming Techniques for Glucose Prediction in People with Diabetes
6.1 Introduction
6.2 The Problem of Glucose Management
6.3 Background
6.3.1 Grammatical Evolution for Glucose Prediction
6.3.2 Recent Techniques for Glucose Prediction Based on Grammatical Evolution
6.4 Proposed Framework for Glucose Control
6.4.1 Framework Description
6.4.2 Experimental Results
6.5 Conclusions
References
7 Methods for Rich Phylogenetic Inference Over Distributed Sexual Populations
7.1 Introduction
7.2 Methods
7.2.1 Genome Instrumentation
7.2.2 Genealogical Inference
7.2.3 Population Size Inference
7.2.4 Positive Selection Inference
7.2.5 Software and Data
7.3 Results and Discussion
7.3.1 Genealogical Inference
7.3.2 Population Size Inference
7.3.3 Positive Selection Inference
7.4 Conclusion
References
8 A Melting Pot of Evolution and Learning
8.1 Introduction
8.2 Machine Learning
8.2.1 Binary and Multinomial Classification Through Evolutionary Symbolic Regression ch8Sipper2022esr
8.2.2 Classy Ensemble: A Novel Ensemble Algorithm for Classification ch8sipper2022classy
8.2.3 EC-KitY: Evolutionary Computation Tool Kit in Python ch8eckity2023
8.3 Deep Learning
8.3.1 Evolution of Activation Functions for Deep Learning-Based Image Classification ch8Lapid2022
8.3.2 Adaptive Combination of a Genetic Algorithm and Novelty Search for Deep Neuroevolution ch8SegalS22
8.4 Adversarial Deep Learning
8.4.1 An Evolutionary, Gradient-Free, Query-Efficient, Black-Box Algorithm for Generating Adversarial Instances in Deep Networks ch8Lapid2022Query
8.4.2 Foiling Explanations in Deep Neural Networks ch8Vitrack2023
8.4.3 Patch of Invisibility: Naturalistic Black-Box Adversarial Attacks on Object Detectors ch8Lapid2023
8.5 Concluding Remark
References
9 Particularity
9.1 Overview
9.2 Lexicase
9.3 Variance
9.4 Epsilon
9.5 Batched
9.6 Downsampled
9.7 Informed
9.8 Weighted
9.9 Gradient
9.10 Plexicase
9.11 Hidden
9.12 Living
9.13 Honor
References
10 The OpenELM Library: Leveraging Progress in Language Models for Novel Evolutionary Algorithms
10.1 Introduction
10.2 Background: Evolution and LLMs
10.3 OpenELM Evolutionary Algorithms
10.4 Language Models as Evolutionary Operators
10.4.1 Diff Models
10.4.2 LMX: Language Model Crossover
10.5 Engineering Challenges
10.5.1 OpenELM Inference Optimizations
10.5.2 Execution of Generated Code
10.6 OpenELM Domains
10.6.1 Sodarace
10.6.2 Image Generation
10.6.3 Prompts
10.6.4 Programming Puzzles
10.7 Discussion
10.8 Conclusion
References
11 GP for Continuous Control: Teacher or Learner? The Case of Simulated Modular Soft Robots
11.1 Introduction
11.2 Related Works
11.3 Background: Simulated Voxel-Based Soft Robots
11.3.1 VSR Morphology
11.3.2 VSR Controller
11.4 Evolutionary Optimization of VSR Controllers
11.4.1 Multi-layer Perceptron Optimized with a Genetic Algorithm
11.4.2 Array of Regression Trees Optimized with GP
11.4.3 Regression Graphs Optimized with GraphEA
11.5 Experiments and Results
11.5.1 Direct Evolution of the Controller
11.5.2 Offline Imitation Learning
11.6 Discussion
11.7 Concluding Remarks
References
12 Shape-constrained Symbolic Regression: Real-World Applications in Magnetization, Extrusion and Data Validation
12.1 Introduction
12.2 Related Work
12.3 Shape-constrained Symbolic Regression
12.3.1 Interaction Transformation Evolutionary Algorithm
12.4 Shape Constraint Handling
12.4.1 Single-Objective Approach
12.4.2 Multi-objective Approach
12.4.3 Feasible-Infeasible Two-Population Approach
12.5 Constraint Evaluation
12.5.1 Optimistic Approach
12.5.2 Pessimistic Approach
12.6 Real World Problems
12.6.1 Twin-Screw Extruder Modeling
12.6.2 Data Validation for Industrial Friction Performance Measurements
12.6.3 Magnetization Curves
12.7 Conclusion
References
13 Phylogeny-Informed Fitness Estimation for Test-Based Parent Selection
13.1 Introduction
13.2 Phylogeny-Informed Fitness Estimation
13.2.1 Phylogeny Tracking
13.3 Methods
13.3.1 Lexicase Selection
13.3.2 Diagnostic Experiments
13.3.3 Genetic Programming Experiments
13.3.4 Statistical Analyses
13.3.5 Software and Data Availability
13.4 Results and Discussion
13.4.1 Phylogeny-Informed Estimation Reduces Diversity Loss Caused by Subsampling
13.4.2 Phylogeny-Informed Estimation Improves Poor Exploration Caused by Down-Sampling
13.4.3 Phylogeny-Informed Estimation Can Enable Extreme Subsampling for Some Genetic Programming Problems
13.5 Conclusion
References
14 Origami: (un)folding the Abstraction of Recursion Schemes for Program Synthesis
14.1 Introduction
14.2 Recursion Schemes
14.2.1 Fixed Point of a Linked List
14.2.2 Functor Algebra
14.2.3 Well-Known Recursion Schemes
14.3 Origami
14.3.1 How to Choose a Template
14.3.2 Jokers to the Right: Catamorphism
14.3.3 When You Started Off with Nothing: Anamorphism
14.3.4 Stuck in the Middle with You: Hylomorphism
14.3.5 Clowns to the Left of Me: Accumorphism
14.4 Preliminary Results
14.5 Discussion and Final Remarks
References
15 Reachability Analysis for Lexicase Selection via Community Assembly Graphs
15.1 Introduction
15.2 Approach
15.2.1 Community Assembly Graphs
15.2.2 Calculating Stability
15.2.3 Assumptions
15.2.4 Reachability Analysis
15.3 Background
15.3.1 Lexicase Selection
15.3.2 Community Assembly Graphs
15.4 Proof of Concept in NK Landscapes
15.4.1 Methods
15.4.2 Results
15.5 Proof of Concept in Genetic Programming
15.5.1 Methods
15.5.2 Results
15.6 Conclusion
References
16 Let's Evolve Intelligence, Not Solutions
16.1 Introduction
16.2 What Should We Strive For?
16.3 What Assumptions Are Limiting Us?
16.3.1 Posit#1: Impossible to Engineer Intelligence
16.3.2 Posit #2: No Occam's Razor for Intelligence
16.3.3 Posit #3: Intelligence Is Grounded
16.3.4 Posit #4: Intelligence Is Transferable
16.3.5 Posit #5: Intelligence Is Intrinsically Self-reinforcing
16.4 What Do We Need?
16.4.1 A Caveat: Intelligence == Process and/or Intelligence == Capabilities and/or Intelligence == Individual(s)
16.4.2 The World
16.4.3 The Drivers
16.4.4 Models of Understanding
16.4.5 Process of Intelligence Self-Reinforcement
16.5 How Should We Approach It?
16.5.1 Revisiting Reproducibility
16.5.2 Back to the Intelligence Function
16.5.3 Genetic Programming of Intelligence
16.6 Conclusions
References
Appendix Index
Index
Recommend Papers

Genetic Programming Theory and Practice XX (Genetic and Evolutionary Computation) [1st ed. 2024]
 9819984122, 9789819984121

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Genetic and Evolutionary Computation

Stephan Winkler Leonardo Trujillo Charles Ofria Ting Hu   Editors

Genetic Programming Theory and Practice XX

Genetic and Evolutionary Computation Series Editors Wolfgang Banzhaf , Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA Kalyanmoy Deb , Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI, USA

The area of Genetic and Evolutionary Computation has seen an explosion of interest in recent years. Methods based on the variation-selection loop of Darwinian natural evolution have been successfully applied to a whole range of research areas. The Genetic and Evolutionary Computation Book Series publishes research monographs, edited collections, and graduate-level texts in one of the most exciting areas of Computer Science. As researchers and practitioners alike turn increasingly to search, optimization, and machine-learning methods based on mimicking natural evolution to solve problems across the spectrum of the human endeavor, this growing field will continue to surprise with novel applications and results. Recent award-winning PhD theses, special topics books, workshops and conference proceedings in the areas of EC and Artificial Life Studies are of interest. Areas of coverage include applications, theoretical foundations, technique extensions and implementation issues of all areas of genetic and evolutionary computation. Topics may include, but are not limited to: Optimization (multi-objective, multi-level) Design, control, classification, and system identification Data mining and data analytics Pattern recognition and deep learning Evolution in machine learning Evolvable systems of all types Automatic programming and genetic improvement Proposals in related fields such as: Artificial life, artificial chemistries Adaptive behavior and evolutionary robotics Artificial immune systems Agent-based systems Deep neural networks Quantum computing will be considered for publication in this series as long as GEVO techniques are part of or inspiration for the system being described. Manuscripts describing GEVO applications in all areas of engineering, commerce, the sciences, the arts and the humanities are encouraged. Prospective Authors or Editors: If you have an idea for a book, we would welcome the opportunity to review your proposal. Should you wish to discuss any potential project further or receive specific information regarding our book proposal requirements, please contact Wolfgang Banzhaf, Kalyan Deb or Mio Sugino: Areas: Genetic Programming/other Evolutionary Computation Methods, Machine Learning, Artificial Life Wolfgang Banzhaf Consulting Editor BEACON Center for Evolution in Action Michigan State University, East Lansing, MI 48824 USA [email protected] Areas: Genetic Algorithms, Optimization, Meta-Heuristics, Engineering Kalyanmoy Deb Consulting Editor BEACON Center for Evolution in Action Michigan State University, East Lansing, MI 48824 USA [email protected] Mio Sugino [email protected] The GEVO book series is the result of a merger the two former book series: Genetic Algorithms and Evolutionary Computation https://www.springer.com/series/6008 and Genetic Programming https://www.springer.com/series/6016.

Stephan Winkler · Leonardo Trujillo · Charles Ofria · Ting Hu Editors

Genetic Programming Theory and Practice XX

Editors Stephan Winkler School of Information Communications and Media University of Applied Sciences Upper Austria Hagenberg, Austria Charles Ofria Department of Computer Science and Engineering Michigan State University East Lansing, MI, USA

Leonardo Trujillo Engineering Sciences Graduate Program Tecnológico Nacional de México IT de Tijuana Tijuana, Baja California, Mexico Ting Hu School of Computing Queen’s University Kingston, ON, Canada

ISSN 1932-0167 ISSN 1932-0175 (electronic) Genetic and Evolutionary Computation ISBN 978-981-99-8412-1 ISBN 978-981-99-8413-8 (eBook) https://doi.org/10.1007/978-981-99-8413-8 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore Paper in this product is recyclable.

Preface

From 2020 until 2022, the COVID-19 pandemic struck the world and massively changed our daily lives. For the academic world, it meant a sudden halt to in-person meetings, workshops, and conferences, among many other consequences. Of course, several editions of the Genetic Programming Theory and Practice (GPTP) workshop were also affected: In 2020, we had to cancel GPTP; in 2021, we held it online, which turned out great, but still we hoped that we would never be forced to repeat that format for GPTP. In 2022, we planned GPTP as an in-person event at the University of Michigan, and after not being completely sure if it would work for a long time, we were very glad when it finally did! We were very cautious (e.g., wearing masks at the event) and, unfortunately, several of our colleagues who would have preferred to attend the workshop were not able to travel, but overall it was a great event. November 2022 saw the premiere of the “GPTP Sandbox”, a new online format for GPTP in which we focused on two things: On the first day, six Ph.D. students were given the opportunity to present their research and receive feedback from the community; on the second day, we discussed how genetic programming could be promoted and made more visible in the machine learning community and in society in general. In 2023, finally, we were able to plan GPTP under more or less “normal” circumstances. (Apart from the fact that for the first time in a long period, Wolfgang Banzhaf was not part of the organization team due to his sabbatical year.) For the first time we planned to hold the workshop at the Kellogg Hotel & Conference Center at Michigan State University, and we also decided to try to integrate more groups that are active in GP research into the GPTP community. Additionally, we also decided to offer the speakers not only the possibility to give regular talks but also so-called lightning talks, i.e., shorter talks in which ideas and research directions could be presented without necessarily having final results or conclusions yet.

v

vi

Preface

And then, in June 2023, GPTP finally came back to Michigan State University! Following the tradition, we started the twentieth edition of our workshop with a joint dinner at a local bar (downtown East Lansing), and then 45 GPTP’ers had three great days at the Kellogg Center, where we had a very nice conference room and delicious food (coffee breaks, lunch, and even a great conference dinner on the first day of the workshop). Day one started with a great keynote given by Oana Carja from Carnegie Mellon University, she talked about topological puzzles in biology and how geometry shapes evolution and applications to designing intelligent collectives. What a great start to GPTP XX! We then saw and discussed presentations given by Jason Moore and Pedro Ribeiro (Cedars-Sinai Medical Center), Una May OReilly (MIT CSAIL), Nathan Haut (Michigan State University), Wolfgang Banzhaf (Michigan State University), and Moshe Sipper (Ben-Gurion University); lightning talks were delivered by Nic McPhee (University of Minnesota), Jose Manuel Velasco Cabo (Universidad Complutense de Madrid), and Matthew Andres Moreno (Michigan State University). On day two, Thomas Baeck from Leiden University gave an exciting keynote, in which he talked about automated algorithm configuration for expensive optimization tasks; it was very interesting to see all the industrial applications in which theoretical advances of GP have led to successful solutions. Throughout the day, presentations were given by Lee Spector (Amherst College and UMass Amherst), Joel Lehman and Herbie Bradley (Carper.ai), Eric Medvet (University of Trieste) Christian Haider (University of Applied Sciences Upper Austria), and Alexander Lalejini (Grand Valley State University); lightning talks were given by Michael Affenzeller (University of Applied Sciences Upper Austria) and Stuart W. Card. The last day started with one of the most remarkable keynotes in GPTP history, namely James Foster’s talk about his life and his academic journey in computer science and genetic programming. For sure, all of us hearing this talk will never forget it as it was full of interesting stories and emotional moments. We then heard the last talks of GPTP XX, namely lightning talks by Lisa Soros (Barnard College) and Fabricio Olivetti de Franca (Federal University of ABC) and regular talks by Emily Dolson (Michigan State University) and Talib Hussain (John Abbott College). Throughout the event, not only after the talks but also during breaks and in the evening, we had great discussions, in which we often talked about the current hype machine learning is seeing at the moment—and that GP is the perfect method for so many applications, as it is a very flexible method that produces interpretable results, which is so important in numerous applications. Nevertheless, we all agreed that we have to do more in order to make GP more visible in our community as well as in society in general! One of the most prominent issues we should address is

Preface

vii

Fig. 1 Attendees of GPTP XX at Michigan State University, June 2023

that we need more generally and easily available implementations of GP that can be integrated into any data science workflow. We are very honored and grateful that we could once again organize another GPTP workshop in-person (Fig.1). It is our intention that GPTP continues to be a core event for genetic programming research, bringing together academics, practitioners and theorists from diverse fields of science that intersect in our community, providing for a constructive, thoughtful, inspired and open interchange of ideas, and to do so, whenever possible, in-person, with a coffee during breaks or a beer at dinner. Kingston, Ontario, Canada East Lansing, Michigan, USA Tijuana, Baja California, Mexico Hagenberg, Austria September 2023

Ting Hu Charles Ofria Leonardo Trujillo Stephan Winkler

Acknowledgments

We would like to thank all of the participants for making GP Theory and Practice a successful in-person workshop once again in 2023. Special thanks to our three wonderful keynote speakers, Carja, Thomas, and James: Your talks were amazing! We would also like to thank our financial supporters for making the existence of GP Theory and Practice possible for twenty great editions. For 2023, we are grateful to the following sponsors: • Michael Affenzeller from HEAL and the University of Applied Sciences Upper Austria • Stuart Card • John Koza • Jason H. Moore at the Department of Computational Biomedicine in Cedars-Sinai A number of people made key contributions to the organization of the workshop. We are particularly grateful for contractual assistance by Mio Sugino, SpringerNature Tokyo, and editorial assistance by Kokila Durairaj, Springer-Nature Chennai. We would also like to express our gratitude to Erik Goodman at the BEACON Center for the Study of Evolution in Action at Michigan State University for his continued support. Kingston, Ontario, Canada East Lansing, Michigan, USA Tijuana, Baja California, Mexico Hagenberg, Austria September 2023

Ting Hu Charles Ofria Leonardo Trujillo Stephan Winkler

ix

Contents

1

2

3

4

5

TPOT2: A New Graph-Based Implementation of the Tree-Based Pipeline Optimization Tool for Automated Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pedro Ribeiro, Anil Saini, Jay Moran, Nicholas Matsumoto, Hyunjun Choi, Miguel Hernandez, and Jason H. Moore Analysis of a Pairwise Dominance Coevolutionary Algorithm with Spatial Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mario Hevia Fajardo, Per Kristian Lehre, Jamal Toutouh, Erik Hemberg, and Una-May O’Reilly

1

19

Accelerating Image Analysis Research with Active Learning Techniques in Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nathan Haut, Wolfgang Banzhaf, Bill Punch, and Dirk Colbry

45

How the Combinatorics of Neutral Spaces Leads Genetic Programming to Discover Simple Solutions . . . . . . . . . . . . . . . . . . . . . . Wolfgang Banzhaf, Ting Hu, and Gabriela Ochoa

65

The Impact of Step Limits on Generalization and Stability in Software Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicholas Freitag McPhee and Richard Lussier

87

6

Genetic Programming Techniques for Glucose Prediction in People with Diabetes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 J. Ignacio Hidalgo, Jose Manuel Velasco, Daniel Parra, and Oscar Garnica

7

Methods for Rich Phylogenetic Inference Over Distributed Sexual Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Matthew Andres Moreno

xi

xii

Contents

8

A Melting Pot of Evolution and Learning . . . . . . . . . . . . . . . . . . . . . . . . 143 Moshe Sipper, Achiya Elyasaf, Tomer Halperin, Zvika Haramaty, Raz Lapid, Eyal Segal, Itai Tzruia, and Snir Vitrack Tamam

9

Particularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Lee Spector, Li Ding, and Ryan Boldi

10 The OpenELM Library: Leveraging Progress in Language Models for Novel Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . 177 Herbie Bradley, Honglu Fan, Theodoros Galanos, Ryan Zhou, Daniel Scott, and Joel Lehman 11 GP for Continuous Control: Teacher or Learner? The Case of Simulated Modular Soft Robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Eric Medvet and Giorgia Nadizar 12 Shape-constrained Symbolic Regression: Real-World Applications in Magnetization, Extrusion and Data Validation . . . . . 225 Christian Haider, Fabricio Olivetti de Franca, Bogdan Burlacu, Florian Bachinger, Gabriel Kronberger, and Michael Affenzeller 13 Phylogeny-Informed Fitness Estimation for Test-Based Parent Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 Alexander Lalejini, Matthew Andres Moreno, Jose Guadalupe Hernandez, and Emily Dolson 14 Origami: (un)folding the Abstraction of Recursion Schemes for Program Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Matheus Campos Fernandes, Fabricio Olivetti de Franca, and Emilio Francesquini 15 Reachability Analysis for Lexicase Selection via Community Assembly Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Emily Dolson and Alexander Lalejini 16 Let’s Evolve Intelligence, Not Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 303 Talib S. Hussain Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335

Contributors

Michael Affenzeller Heuristic and Evolutionary Algorithms Laboratory (HEAL), University of Applied Sciences Upper Austria, Hagenberg, Austria Florian Bachinger Heuristic and Evolutionary Algorithms Laboratory (HEAL), University of Applied Sciences Upper Austria, Hagenberg, Austria Wolfgang Banzhaf Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA Ryan Boldi University of Massachusetts, Amherst, Amherst, MA, USA Herbie Bradley CarperAI and EleutherAI and CAML Lab, University of Cambridge and Stability AI, Cambridge, UK Bogdan Burlacu Heuristic and Evolutionary Algorithms Laboratory (HEAL), University of Applied Sciences Upper Austria, Hagenberg, Austria Hyunjun Choi Department of Computational Biomedicine, Cedars Sinai Medical Center, Los Angeles, CA, USA Dirk Colbry Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, USA Li Ding University of Massachusetts, Amherst, Amherst, MA, USA Emily Dolson Michigan State University, East Lansing, MI, USA Achiya Elyasaf Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel Mario Hevia Fajardo University of Birmingham, Birmingham, England Honglu Fan EleutherAI and University of Geneva, Geneva, Switzerland Matheus Campos Fernandes Federal University of ABC, Santo Andre, SP, Brazil Fabricio Olivetti de Franca Federal University of ABC, Santo Andre, SP, Brazil

xiii

xiv

Contributors

Emilio Francesquini Federal University of ABC, Santo Andre, SP, Brazil Nicholas Freitag McPhee University of Minnesota Morris, Morris, MN, USA Theodoros Galanos EleutherAI and University of Malta, Aurecon, Malta, USA Oscar Garnica Universidad Complutense de Madrid, Madrid, Spain Christian Haider Heuristic and Evolutionary Algorithms Laboratory (HEAL), University of Applied Sciences Upper Austria, Hagenberg, Austria Tomer Halperin Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, Israel Zvika Haramaty Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, Israel Nathan Haut Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, USA Erik Hemberg ALFA, MIT CSAIL, Cambridge, England Jose Guadalupe Hernandez Michigan State University, East Lansing, MI, USA Miguel Hernandez Department of Computational Biomedicine, Cedars Sinai Medical Center, Los Angeles, CA, USA J. Ignacio Hidalgo Instituto de Tecnología del Conocimiento, Universidad Complutense de Madrid, Madrid, Spain Ting Hu School of Computing, Queen’s University, Kingston, ON, Canada Talib S. Hussain John Abbott College, Ste-Anne-de-Bellevue, QC, Canada Gabriel Kronberger Heuristic and Evolutionary Algorithms Laboratory (HEAL), University of Applied Sciences Upper Austria, Hagenberg, Austria Alexander Lalejini Grand Valley State University, Allendale, MI, USA Raz Lapid DeepKeep, Tel-Aviv, Israel Joel Lehman CarperAI and StabilityAI, Newark, NJ, USA Per Kristian Lehre University of Birmingham, Birmingham, England Richard Lussier University of Minnesota Morris, Morris, MN, USA Nicholas Matsumoto Department of Computational Biomedicine, Cedars Sinai Medical Center, Los Angeles, CA, USA Eric Medvet Department of Engineering and Architecture, University of Trieste, Trieste, Italy Jason H. Moore Department of Computational Biomedicine, Cedars Sinai Medical Center, Los Angeles, CA, USA

Contributors

xv

Jay Moran Department of Computational Biomedicine, Cedars Sinai Medical Center, Los Angeles, CA, USA Matthew Andres Moreno University of Michigan, Ann Arbor, MI, USA Giorgia Nadizar Department of Mathematics and Geosciences, University of Trieste, Trieste, Italy Gabriela Ochoa Department of Computer Science, University of Stirling, Stirling, UK Una-May O’Reilly ALFA, MIT CSAIL, Cambridge, England Daniel Parra Universidad Complutense de Madrid, Madrid, Spain Bill Punch Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, USA Pedro Ribeiro Department of Computational Biomedicine, Cedars Sinai Medical Center, Los Angeles, CA, USA Anil Saini Department of Computational Biomedicine, Cedars Sinai Medical Center, Los Angeles, CA, USA Daniel Scott EleutherAI and Georgia Institute of Technology, Atlanta, USA Eyal Segal Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, Israel Moshe Sipper Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, Israel Lee Spector Amherst College, Amherst, MA, USA; University of Massachusetts, Amherst, Amherst, MA, USA Snir Vitrack Tamam Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, Israel Jamal Toutouh University of Malaga, Malaga, Spain Itai Tzruia Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, Israel Jose Manuel Velasco Universidad Complutense de Madrid, Madrid, Spain Ryan Zhou EleutherAI and Queen’s University, Kingston, Canada

Chapter 1

TPOT2: A New Graph-Based Implementation of the Tree-Based Pipeline Optimization Tool for Automated Machine Learning Pedro Ribeiro, Anil Saini, Jay Moran, Nicholas Matsumoto, Hyunjun Choi, Miguel Hernandez, and Jason H. Moore

1.1 Introduction In recent years, machine learning (ML) has been applied to a number of domains, including image recognition, weather forecasting, stock market prediction, recommendation engines, text generation, etc. A whole gamut of algorithms is available for these tasks, such as Logistic Regression, Naive Bayes, K-Nearest Neighbors, Decision Tree, Random Forest, Support Vector Machine, etc. Each algorithm also has a large number of hyperparameter settings to adjust. In addition, a typical user also needs to select methods for data cleaning, feature selection, feature engineering, etc. The role of a data scientist is to search through a large space of possible operators and their hyperparameters in order to find the best-performing pipeline for a given task. Over the years, multiple methods have been developed to automate searching for the best machine learning pipeline. One of those methods is TPOT [13]. In TPOT, the pipelines are represented as trees, where the root node is either a classifier or a regressor, with other nodes encoding other ML operators for data preprocessing, feature engineering, etc. Over the years, TPOT has been successfully applied to several problems, such as genetic analysis [12]. Several extensions of TPOT have been developed that optimize the existing implementation or provide additional functionality. However, these changes were often not merged into the main software package. For example, in Parmentier et al. [14], the authors fork TPOT and implement a successive halving strategy to reduce computational demand. The algorithm begins by quickly searching through a larger population evaluated on smaller portions of the data early in the training phase, then evaluating fewer but likely better-performing models on larger portions of the dataset later. This allowed TPOT to explore a larger search space as P. Ribeiro (B) · A. Saini · J. Moran · N. Matsumoto · H. Choi · M. Hernandez · J. H. Moore Department of Computational Biomedicine, Cedars Sinai Medical Center, Los Angeles, CA, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 S. Winkler et al. (eds.), Genetic Programming Theory and Practice XX, Genetic and Evolutionary Computation, https://doi.org/10.1007/978-981-99-8413-8_1

1

2

P. Ribeiro et al.

well as reach better performance. TPOT was forked and modified in another paper to use covariate adjustments [11]. Recently, TPOT has been forked and modified to automate quantitative trait locus (QTL) analysis in a biology-based AutoML software package called AutoQTL [9]. However, none of the above forks of TPOT have been merged into the master branch. Due to the way TPOT and these forks are structured, it would also be difficult to do so. Since these extensions have been developed in isolation, they may not be compatible with the main repository. This limits their usability as different features might not be used together, and users may not be aware of the different forks and their respective features. In this chapter, we introduce a new edition of TPOT called TPOT2.1 Although TPOT2 shares some similarities with TPOT in how it searches for ML pipelines, it has been implemented from scratch to be easily maintainable and extendable. TPOT2 uses a graph-based representation which is much more flexible than the tree-based representation used by TPOT. TPOT2 also gives the user much more flexibility to define various parameters of the underlying algorithm. The following sections describe the TPOT2 algorithm, including the representation, genetic operators, etc. For the remainder of the paper, we will refer to the original TPOT as TPOT1 and the new version as TPOT2.

1.2 Related Work There are several other methods in the domain of AutoML, which differ in the way they search for optimal machine learning pipelines. Some popular methods include Auto-WEKA [16], Auto-Sklearn 1 and 2 [5, 7], and TPOT [13], among others. Both Auto-Weka and Auto-Sklearn, for example, use Bayesian optimization. AutoSklearn also improves upon the performance of its earlier version with meta-learning, successive halving, and other optimizations. TPOT utilizes an evolutionary algorithm to search through the space of possible pipelines. Other existing packages have inspired various parts of TPOT2. For example, several Python packages implement an API for evolutionary algorithms, not necessarily to evolve machine learning pipelines. Some popular packages include PyMoo [2], DEAP [8], and KarooGP [3], etc. Other packages, such as baikal2 and skdag,3 have been developed for easily building graph-based Scikit-Learn pipelines. Instead of using the existing implementations for the evolutionary algorithm and graph-based pipelines, we developed a new implementation to meet our needs. In the future, we may work toward adding the functionality of exporting to other graph pipeline representations into TPOT2.

1

https://github.com/EpistasisLab/tpot2. https://github.com/alegonz/baikal. 3 https://github.com/scikit-learn-contrib/skdag. 2

1 TPOT2: A New Graph-Based Implementation of the Tree-Based …

3

1.3 Evolutionary Algorithm TPOT2 implements an evolutionary algorithm module that follows a standard algorithm outlined in Fig. 1.1. First, an initial population is generated and evaluated. The individuals in the initial population are generated sequentially in the following way. TPOT2 loops through the possible final estimators and assigns them as the root node to the individuals one by one. After all final estimators have been assigned as a root to different individuals, the loop repeats for the following individuals. We randomly add between 0 and .max(i, 3) nodes to each root in the .ith loop. We stop the loop when the required number of individuals has been generated. The population is a list of individuals that is available to be used to generate new individuals. The primary evolutionary algorithm loop begins with survival selection, where lower-performing individuals are removed from the current population. Next, the parent selection algorithm selects (with replacement) individuals to be used in mutation, crossover, or a combination of both. Next, the crossover and mutation methods are used to generate new individuals. Note that crossover is one-directional,

Fig. 1.1 A flowchart illustrating the evolutionary algorithm used in TPOT2

4

P. Ribeiro et al.

which means it generates one individual per pair. More details on mutation and crossover operators are found in the next section. If a newly generated individual is identical to an already evaluated individual, TPOT2 will mutate that individual until it becomes unique (up to 20 attempts). With parallel processing, we want to ensure each node or core has an individual to evaluate every generation. This allows us to ensure we are creating the expected number of individuals to saturate the computational resources. Finally, the set of new individuals is evaluated. Individuals that throw errors or time out are discarded. The remainder gets added to the current population list. The algorithm then loops back into survival selection, where the now expanded population is cut down, and the loop continues. We exit the loop once we complete the desired number of generations, satisfy an early stopping condition, or receive a manual termination signal. The default best individual is the one with the highest value of the first objective function, which by default, is the cross-validation score on the training data. TPOT1 and TPOT2 use the Nondominated Sorting Genetic Algorithm (NSGAII) for survival selection [4]. For parent selection, TPOT selected parents randomly from the population, whereas the default in TPOT2 is to use Dominated Tournament Selection (as described in the NSGA-II paper [4]). While the selection methods are hardcoded into TPOT1, they can be passed as parameters in TPOT2.

1.4 GraphPipelineIndividual Representation The individuals in TPOT2 are represented as NetworkX directed acyclic graphs [10]. Figure 1.2 shows an example TPOT2 individual. The GraphPipelineIndividual class contains a template for the pipeline represented as a directed acyclic graph. A single node in the graph contains both the machine learning method type and its hyperparameters. The individual holds other parameters that dictate their search space: • root_config_dict: Defines the root node’s possible methods and hyperparameter ranges. • inner_config_dict: Defines the inner nodes’ possible methods and hyperparameter ranges. If set to None, the graph will have no inner nodes. • leaf_config_dict: Defines the leaf nodes’ possible methods and hyperparameter ranges. If set to None, leaf nodes are pulled from inner_config_dict. • max_size: The maximum number of nodes in any pipeline. • linear_pipeline : If True, TPOT2 will evolve only linear pipelines. The structure of configuration dictionaries (root_config_dict, etc.) is different from the one in TPOT1. The keys are the Python Types for the desired method, and the corresponding values are functions that return a set of hyperparameters from the desired search space. These functions are designed to be compatible with the Optuna hyperparameter optimization package [1]. The hyperparameters are currently chosen randomly, but we plan to explore Optuna integration in the future.

1 TPOT2: A New Graph-Based Implementation of the Tree-Based …

5

Fig. 1.2 An example individual in TPOT2

Additionally, when setting the key to be the special-case string “Recursive,” the user can pass in the above parameters in a dictionary in place of a function to recursively define the search space for a given node. For example, the user can define the configuration dictionaries in such a way that the leaf nodes are set to be a pipeline of the shape “Selector-Transformer-Classifier” and the root node is a final classifier. These configuration dictionaries can be customized for specific tasks. Some examples of useful search spaces users can define using the above tools are: • AutoQTL: AutoQTL [9] is a fork of TPOT with a custom configuration and objective function. This can be replicated in TPOT2 by using a custom configuration dictionary that includes the genetic encoders and encoding frequency selectors used as transformers and selectors, as well as passing in the custom objective function as described in the paper. • Logistic regression with Selected Features: TPOT2 can perform genetic feature selection for a given machine learning algorithm. For example, the root node can be set to logistic regression; all leaf nodes can be set to a set of feature selectors. Then TPOT2 can evolve a set of selectors that optimize the performance of logistic regression. • Symbolic regression or classification: TPOT2 can also evolve symbolic regression or classification pipelines. The root node can be set to linear or logistic regression, inner nodes can encode basic arithmetic operators, and leaf nodes can encode Feature Set Selectors.

6

P. Ribeiro et al.

1.4.1 Mutation In TPOT2, we implement eight mutation methods. Currently, each of these methods is selected with uniform probability during mutation. In the future, we might explore applying these methods with different probabilities. • _mutate_get_new_hyperparameter: Pick a node randomly and assign new hyperparameters to that node. • _mutate_replace_method: Pick a node randomly and select a new operator (and hyperparameters) for that node. • _mutate_remove_node: Pick a random node (other than the root node) and remove it from the graph. Connect all its children to all its parents. • _mutate_remove_extra_edge: Randomly pick a node with more than one outgoing edge and randomly remove one edge. • _mutate_add_connection_from: Pick two nodes and add an edge between them (as long as the graph could still be acyclic after the addition). • _mutate_insert_leaf: Create a new node and add an edge between it and the randomly chosen existing node. • _mutate_insert_bypass_node: Pick two nodes at random and create a new node. Add edges from one of the existing nodes to the new node and from the new node to the second node (as long as the graph could still be acyclic after the addition). • _mutate_insert_inner_node: Pick two nodes connected by an edge and create a new node. Remove the edge between the original nodes. Add edges from one existing node to the new node and from the new node to the second existing node.

1.4.2 Crossover The crossover operator in TPOT2 takes in two individuals and modifies the first individual. • _crossover_swap_branch: A branch, or a subgraph, is a node and its descendants. This operator selects the root of a subgraph and removes it. All other nodes in the subgraph are removed if they become disconnected from the root of the whole graph. (If a node in the graph has another path to the root, it is not removed). A full subgraph from the second individual is copied into the first individual with outgoing edges to the same parents as the originally selected node. This is similar to a subtree crossover generalized to directed acyclic graphs. • _crossover_take_branch: This operator copies a branch (subgraph) from the second individual and attaches an edge from the root of the branch to a node chosen at random in the first individual.

1 TPOT2: A New Graph-Based Implementation of the Tree-Based …

7

1.5 TPOT2 API 1.5.1 TPOTEstimator The TPOTEstimator class is the primary entry point into TPOT2, where users can define the search space and input other parameters for evolving graph pipelines. For convenience and consistency with TPOT1, TPOTClassifier, and TPOTRegressor classes are also provided, which contain default parameters for use in classification and regression tasks, respectively. TPOT2 exposes more parameters and options than TPOT1, providing users more flexibility in defining the search space and evolutionary parameters. For example, users can now provide their own objective functions, specify the maximum size of the pipelines, separately define the possible operators for root nodes, inner nodes, and leaf nodes, and even define a set of preprocessing steps for all pipelines. Additionally, there are parameters related to the evolutionary algorithm, such as survival selection methods, parent selection methods, genetic operators, changing population size over time, changing the proportion of data used over time, etc.

1.5.1.1

Optimizations

In Parmentier et al. [14], the authors describe how their successive halving can improve the performance of TPOT1. It has also shown to be very successful in Auto-Sklearn. Unfortunately, this feature was never brought into the main software package in TPOT1. We re-implement this feature in TPOT2. The premise of the algorithm is to evaluate a large number of pipelines with a small subset of the data in the early generations and fewer pipelines with a larger subset or all of the data in the later generations. The user can provide the range as well as the rate of change of values of the corresponding parameters. The original paper found that halving and doubling population size and computational budget a few times over the course of the evolution led to performance improvements. Future work can look into other ways of scaling it.

1.5.1.2

Cross Validation Early Stopping (Experimental)

We can reduce the computational load by not fully evaluating all folds in crossvalidation in poorly performing models. When a model performs poorly on the first few folds of cross-validation, it can be reasonably assumed that it will continue to perform poorly on the remaining. We can save time and computational resources by terminating the evaluation early. TPOT2 implements two strategies; both can be used independently or simultaneously.

8

P. Ribeiro et al.

• Threshold Early Stoping: After evaluating each fold, the pipeline must reach a certain percentile of all previously evaluated scores; otherwise, it is discarded. • Selection Early Stopping: We evaluate each fold one at a time in selecting early stopping. After each fold, we select the best individuals and discard the rest. In the future, we will look into implementing an algorithm similar to greedy k-fold cross-validation as described by Soper [15] to evaluate a population more efficiently.

1.5.1.3

Validation Set to Select From Pareto Front Models

TPOT1 would sometimes overfit the cross-validation score with overly complex pipelines that had poor generalization compared to the simpler pipelines in the Pareto front. This is more common in smaller datasets. TPOT2 can subsample the training data into a validation set. It then uses the validation set to select the best model from the Pareto front, hopefully avoiding overfit models.

1.5.2 Ensembling TPOT1 includes classifiers and regressors in the search space for the inner nodes. All inner classifier and regressor operators (i.e., those that are not the final estimator) were wrapped in a StackingEstimator object which passed through its inputs in addition to its prediction to the next operator. When two classifiers or regressors exist in different branches, however, this would cause two duplicates of the data to be passed to the final operator. This could negatively impact performance. In TPOT2, there is no passthrough from classifiers or regressors. Only the model predictions pass to the next layer. This allows TPOT2 to learn whether or not to pass through data to the root (final estimator) node. In testing, we have not found much improvement in including classifiers and regressors in inner nodes for TPOT2. By default, only transformers and selectors are included in the search space for the inner nodes. Future work will explore strategies to improve the ensembling strategies in TPOT2. It is possible that optimizations (in parallelization, caching, successive halving, early termination of the cross-validation evaluation, etc.) may improve the performance of ensembling by allowing more models to be evaluated in the same time period. Another option would be to do post hoc ensembling with the best-evaluated pipelines after the evolutionary algorithm completes. There may be an optimal configuration dictionary to define an efficient search space for an ensemble. For example, an individual could be defined by an ensemble of ‘selector-transformer-classifier’ pipelines followed by a meta-classifier.

1 TPOT2: A New Graph-Based Implementation of the Tree-Based …

9

1.6 Experiment Set-Up For our experiments, we use the benchmark datasets from diverse domains compiled in Feurer et al. [6] and hosted my OpenML. Specifically, we compare the performance of TPOT24 against TPOT1 on the 39 OpenML tasks grouped in . D_test in that paper. Each task provides a training and testing split for an underlying dataset. These datasets vary in the number of samples, the number and types of features, as well as the presence of missing values. To make the comparison between indexTree-Based Pipeline Optimization Tool (TPOT) TPOT1 and TPOT2 as fair as possible, we do some preprocessing steps. To take care of the missing values, we preprocess the data with mean imputation of the numeric variables and mode imputation of the categorical variables. Then, we perform one-hot encoding of categorical columns with the minimum frequency of categories set to 0.001. The latter was done since TPOT1 does not have a parameter for specifying categorical columns and does not do one-hot encoding by default. The preprocessing resulted in datasets with the number of samples ranging from 463 to 389279 and the number of columns ranging from 4 to 7200. To ensure the same partitions of the data for cross-validation in TPOT1 and TPOT2, we pass in a scikit-learn cross-validation splitter with randomized splits and the same seed to both. Note that OpenML task 168795 partitions its data such that there are less than ten examples for two classes in the training set. We, therefore, randomly resampled the classes with fewer than ten examples and appended them to the training data so that for each class we have at least 10 samples. This was done so that it could be correctly split during cross-validation. Additionally, the dataset for task 189866 was very large and causing TPOT1 and TPOT2 to run into memory issues. To alleviate memory issues for this experiment, we set n_jobs to 24 rather than 48 and allowed 10 h instead of 5 h before considering the run to have timed out. All experiments were conducted on a High-Performance Computing cluster node with an Intel Xeon Gold 6342 CPU with 48 threads and 1TB of memory.

1.6.1 TPOT1 Versus TPOT2 Although TPOT1 and TPOT2 have slightly different parameters, we set parameter values for them so that experimental settings for them are as close to each other as possible. The parameter values are shown in Table 1.1. We compare the performance of TPOT1 and TPOT2 at 30 generations. A population size of 48 was selected to match the 48 threads on the CPU we are using. Each individual pipeline is given a limit of 5 min to be evaluated. In theory, this should take approximately 5 min per generation since all individuals can be evaluated simultaneously. The maximum time for the entire process would be approximately 2.5 h, with some extra time 4

The code to run experiments: https://github.com/epistasisLab/tpot2_gptp_experiments.

10

P. Ribeiro et al.

Table 1.1 AutoML methods and their parameters Method Parameter TPOT1

scoring population_size generations n_jobs cv

TPOT2

max_time_mins max_eval_time_mins scores n_jobs cv max_eval_time_seconds crossover_probability mutate_probability mutate_then_crossover_probability crossover_then_mutate_probability other_objective_functions other_objective_functions_weights memory_limit preprocessing

Values “roc_auc” for Binary, “neg_log_loss” for multiclass 48 30 48 StratifiedKFold(n_splits=10, shuffle=True, random_state=42) None 5 [“roc_auc”] for Binary, [“neg_log_loss”] for multiclass 48 StratifiedKFold(n_splits=10, shuffle=True, random_state=42) 300 0.1 0.9 0 0 [number_of_nodes_objective] [–1] None False

allocated for processing between generations and the final pipeline fitting. We also set a maximum time limit of 5 h per run to allow for wiggle room. If the run exceeds this limit, it will be terminated and not included in the results. Both algorithms used a 90% mutation rate and a 10% crossover rate. The primary objective function we use is: maximize the area under the receiver operator curve (auroc) for binary problems or minimize the log loss for multiclass problems. In addition, a secondary objective function was included that minimizes the number of nodes in a given pipeline. There is a key difference in the search space of the two algorithms. TPOT1 allows classifiers wrapped inside a custom StackingEstimator class, which passes through its inputs along with its predictions to the next node, to be included in the inner or leave nodes (this cannot be disabled). TPOT2, however, for these experiments, includes classifiers only in the root node.

1 TPOT2: A New Graph-Based Implementation of the Tree-Based …

11

1.7 Results and Discussion For each dataset, we evaluated five runs each for TPOT1 and TPOT2. Table 1.2 summarizes the number of completed and failed runs for each method. A failed run is one in which the algorithm throws an error and terminates prematurely, or does not finish in the allotted time period and is killed by the system. TPOT1 failed on 45 runs, and TPOT2 failed only on one. The method that TPOT1 utilizes for timing out pipelines does not always work correctly, partly due to the fact that it is incompatible with the C backends of many algorithms. This would cause it to get stuck and run over time. Some TPOT1 failures were also due to memory constraints with the larger datasets. In some other cases, TPOT1 seems to abruptly end training early before completing all generations. We are not sure what the cause of this is. We still included those results in our analysis as it returned a final pipeline. The runs where TPOT2 failed were all on the largest dataset due to memory issues. There are still some optimizations and fixes to be made so that it is more stable for larger datasets.TPOT2 may also effectively end training early when memory issues prevent it from continuing training. This occurs when all worker processes crash simultaneously, generally due to memory issues. This then causes all subsequent evaluations to fail (Fig. 1.3).

Table 1.2 Summary of run completion for each algorithm Algorithm failed_count TPOT1 TPOT2

45 1

completed_count 150 194

Fig. 1.3 Stripplot summarizing duration in minutes to run each method on a given dataset. The x-axis denotes the identifiers of the datasets

12

P. Ribeiro et al.

We later found a bug introduced in version 0.12.0 that caused early termination of some runs in TPOT1. This bug reduced training times for some runs and potentially lowered the average scores for TPOT1 runs in this paper. We have fixed this particular bug in the 0.12.1 release. The updated results with the fixed TPOT1 are published in the GitHub repository. Note that we only include completed runs (including the TPOT1 runs that end training abruptly but still return a final pipeline) in our analyses in this paper. The performance of TPOT1 and TPOT2 on each dataset is measured in the following way. For every run, we take the pipeline with the best objective function value (log loss or auroc, based on the dataset) across all generations on the training set, calculate its performance on the holdout set, and report that value. We average the values for five runs. Figure 1.4 illustrates the scores of the best pipeline found in a

Fig. 1.4 Stripplot summarizing the test scores of both methods on all the datasets. Log loss scores are reported for multiclass tasks and AUROC for binary problems. Diamonds indicate the average score for a method on a particular dataset

1 TPOT2: A New Graph-Based Implementation of the Tree-Based …

13

Fig. 1.5 Histograms (for binary and multiclass datasets, respectively) summarizing the standard deviation of scores for completed runs across all datasets

given run on the holdout set for each dataset. Additionally, in Fig. 1.5, we show the standard deviation of scores for each model across all datasets. In Fig. 1.6, we plot the average test scores of TPOT1 and TPOT2 on all the datasets. Since all the points are very close to the diagonal line, we see that TPOT1 and TPOT2 have very similar performances with the given parameter setting though in a few instances, TPOT2 had meaningfully better scores than TPOT1. Next, we take a deeper look into the results of the TPOT2 runs. Figure 1.7 plots the best cross-validation scores on the training set in the population for a given generation for all runs on different datasets. TPOT2 appears to generally converge quickly, with only minor improvements after a few generations. We also look at the Pareto front of all the evaluated pipelines in a given run. This is shown in Fig. 1.8, where the y-axis denotes the values of the primary objective (log loss or auroc), and the x-axis denotes the value of the secondary objective (number of

14

P. Ribeiro et al.

Fig. 1.6 Scatterplot summarizing the test scores of the methods on all the datasets. Each point in the plots represents a dataset. Log loss scores are reported for multiclass tasks and AUROC for binary problems

nodes in a given pipeline). The plots show that the larger models provide negligible improvements in primary objective scores. The number-of-nodes objective is not a perfect measure of model complexity. A larger pipeline may be simpler than a smaller one. For example, a pipeline with logistic regression and several feature selectors is likely simpler than a model with a single XGBoost. However, more often than not, larger pipelines are usually more complex and overfit the cross-validation score on the training data. TPOT1 often had a similar issue where the final returned model was large. While the large model

1 TPOT2: A New Graph-Based Implementation of the Tree-Based …

15

Fig. 1.7 TPOT2 best cross-validation scores in the population for a given generation. Each line represents a single run for a given dataset

Fig. 1.8 Pareto front of all the evaluated pipelines. Each line represents the pareto front for a single run

would have the best cv score, it would underperform on test data compared to other pareto front models.

1.8 Conclusions In this chapter, we presented a new version of the popular AutoML package TPOT called TPOT2. Among other differences, TPOT2 uses graph-based representation for individuals instead of the tree-based representation used by TPOT. By benchmarking both versions on a diverse set of datasets from OpenML, we show that the new version performs at least as well as the original. Moreover, the new version has several additional features that allow it to be more flexible for different types of problems, more easily maintained, and more easily extended. We will continue to build additional features and optimizations into TPOT2.

16

P. Ribeiro et al.

1.8.1 Future Work There are many avenues for optimizing existing and adding new features in TPOT2. In this section, we highlight some of our future plans.

1.8.1.1

MetaLearner

TPOT2 allows the users to specify the search space of pipelines for a particular dataset and control other aspects of evolutionary search through many different parameter settings. However, different parameter settings will be optimal for different types of problems. For example, a large dataset might benefit more from using the successive halving algorithm than a relatively smaller one. However, trying out several parameter settings can be computationally expensive. Autosklearn2 addresses this issue by training a ‘meta-learner’ that estimates the best parameter values for a given dataset. In the future, we can also look into training a meta-learner to estimate optimal parameters for a given dataset in TPOT2.

1.8.1.2

Optuna Optimization

Currently, hyperparameters for different ML operators in TPOT2 are generated and mutated randomly. We plan to look into different strategies for integrating Optuna to optimize hyperparameters during an evolutionary run.

1.8.1.3

Interpretability

Given that the increase in the performance of TPOT2 plateaued early in an evolutionary run on many datasets, it is possible that complex pipelines are not required for optimal performance on these datasets. In order to further improve the performance, future work could look into having TPOT2 focus more on hyperparameter optimization in the later generations. Alternatively, we could leverage TPOT2’s strength in complex graph building to try to build more interpretable pipelines that may be composed of a higher number of more interpretable steps. For example, optimizing a set of more robust feature selections and engineering followed by a simple classifier as opposed to a single complex XGBoost model. This process could include defining an objective function that more accurately measures interpretability. Acknowledgements The study was supported by the following NIH grants: R01 LM010098 and U01 AG066833.

1 TPOT2: A New Graph-Based Implementation of the Tree-Based …

17

References 1. Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2019) 2. Blank, J., Deb, K.: pymoo: Multi-objective optimization in python. IEEE Access 8, 89497– 89509 (2020) 3. Cavaglià, M., Gaudio, S., Hansen, T., Staats, K., Szczepa´nczyk, M., Zanolin, M.: Improving the background of gravitational-wave searches for core collapse supernovae: a machine learning approach. Mach. Learn.: Sci. Technol. 1(1), 015005 (2020) 4. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002) 5. Feurer, M., Eggensperger, K., Falkner, S., Lindauer, M., Hutter, F.: Auto-sklearn 2.0: hands-free automl via meta-learning. arXiv:2007.04074 [cs.LG] (2020) 6. Feurer, M., Eggensperger, K., Falkner, S., Lindauer, M., Hutter, F.: Auto-sklearn 2.0: hands-free automl via meta-learning. J. Mach. Learn. Res. 23(1), 11936–11996 (2022) 7. Feurer, M., Klein, A., Eggensperger, J., Springenberg, K., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Advances in Neural Information Processing Systems 2015, vol. 28, pp. 2962–2970 (2015) 8. Fortin, F.-A., De Rainville, F.-M., Gardner, M.-A., Parizeau, M., Gagné, C.: DEAP: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012) 9. Freda, P.J., Ghosh, A., Zhang, E., Luo, T., Chitre, A.S., Polesskaya, O., St. Pierre, C.L., Gao, J., Martin, C.D., Chen, H. et al.: Automated quantitative trait locus analysis (autoqtl). BioData Mining 16(1) (2023) 10. Hagberg, A.A., Schult, D.A., Swart, P.J.: Exploring network structure, dynamics, and function using networkx. In: Varoquaux, G., Vaught, T., Millman, J. (eds.), Proceedings of the 7th Python in Science Conference, pp. 11–15. Pasadena (2008) 11. Manduchi, E., Fu, W., Romano, J.D., Ruberto, S., Moore, J.H.: Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses. BMC Bioinf. 21(1) (2020) 12. Manduchi, E., Romano, J.D., Moore, J.H.: The promise of automated machine learning for the genetic analysis of complex traits. Hum. Genet. 141(9), 1529–1544 (2021) 13. Olson, R.S., Moore, J.H.: Tpot: a tree-based pipeline optimization tool for automating machine learning. In: Workshop on Automatic Machine Learning, pp. 66–74. PMLR (2016) 14. Parmentier, L., Nicol, O., Jourdan, L., Kessaci, M.E.: Tpot-sh: A faster optimization algorithm to solve the automl problem on large datasets. In: 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), pp. 471–478. IEEE (2019) 15. Soper, D.S.: Greed is good: Rapid hyperparameter optimization and model selection using greedy k-fold cross validation. Electronics 10(16), 1973 (2021) 16. Thornton, C., Hutter, F., Hoos, H. H., Leyton-Brown, K.: Auto-weka: Combined selection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 847–855 (2013)

Chapter 2

Analysis of a Pairwise Dominance Coevolutionary Algorithm with Spatial Topology Mario Hevia Fajardo, Per Kristian Lehre, Jamal Toutouh, Erik Hemberg, and Una-May O’Reilly

2.1 Introduction We are interested in replicating adversarial settings computationally. In these settings there are two competitive sides, the behavior of the competitors can be asymmetric, and the sides interact because of their conflicting objectives. The competitors may be characterized as predators and preys, attackers and defenders, game players, software modules and test cases, etc. [17]. The evolutionary process drives the emergence of dynamics such as competitive oscillation or arms races. There are multiple benefits of computational replication of these, e.g., moving from reactive to anticipatory defenses, and tracing threats from behavioral intentions and tools [25]. There is existing research that study this sort of computational replication with competitive coevolutionary algorithms in different application areas [14, 17, 25, 27]. A competitive coevolutionary algorithms (CCA) is used to replicate adversarial learning [26]. An Evolutionary Algorithm (EA) typically evolves individual solutions [13] using an a-priori defined fitness function to evaluate an individual’s M. Hevia Fajardo (B) · P. K. Lehre University of Birmingham, Birmingham, England e-mail: [email protected] P. K. Lehre e-mail: [email protected] J. Toutouh University of Malaga, Malaga, Spain e-mail: [email protected] E. Hemberg · U.-M. O’Reilly ALFA, MIT CSAIL, Cambridge, England e-mail: [email protected] U.-M. O’Reilly e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 S. Winkler et al. (eds.), Genetic Programming Theory and Practice XX, Genetic and Evolutionary Computation, https://doi.org/10.1007/978-981-99-8413-8_2

19

20

M. Hevia Fajardo et al.

quality. In contrast, coevolutionary algorithms calculate an individual’s fitness based on its interactions with other individuals or a dynamic environment allowing them to mimic coupled interactions. Moreover, the behavior of evolutionary processes depends on the selective pressure and mutation rates [10]. When the mutation rate is too high genetic information can be lost. The threshold value (error threshold) for the mutation rate depends on the selective pressure in the evolutionary process (See Sect. 2.2.3. The CCA can integrate a spatial topology that determines competitorand parent-relationships in the population. Spatially organizing an algorithm alters the selection pressure and could lead to more robust solutions [15]. We want to start an empirical investigation of spatial topologies in CCA, with respect to performance, diversity and error threshold. Our research questions are: RQ-1 What properties of a spatial topology impact the PDCoEA performance? RQ-2 How does different spatial topologies impact the performance of a CCA in the form of the PDCoEA variants [20]? RQ-3 How does the problem affect the CCA with a spatial topology? RQ-4 How does the spatial topology impact the diversity of the PDCoEA? RQ-5 How does the error threshold change for PDCoEA with spatial topologies? We introduce STPDCoEA, a CCA with spatial topology. The STPDCoEA extends a theoretically analyzed CCA named PDCoEA [19] and is simple to describe but has complex behavior. We study population connectivity through a spatial topology in terms of topology, performance, diversity, problem, and error threshold. We consider the game DefendIt [20], Games of skill [6] and a onemax-like problem. Our contributions are: • An empirical investigation of how different spatial topologies impact the performance of the STPDCoEA. We observe that the topology has an impact on the performance when using a champion measure of the player. • An analysis of different problems and their effect on the STPDCoEA with a spatial topology. We observe the impact on both performance and diversity. • A comparison of the diversity for the spatial topology of the STPDCoEA. We see how the spatial topology has an impact on the diversity measurements. The distribution of the degree of connectedness of the nodes is a factor. • An analysis of the error threshold for the STPDCoEA with spatial topologies. We observe that different topologies have different error thresholds. The chapter proceeds as follows. In Sect. 2.2 we present the preliminaries. In Sect. 2.3 we present the experimental methodology. In Sect. 11.5 we present the experiments. In Sect. 13.5 we present related work. Finally, in Sect. 2.6 we present conclusions and future work.

2 Analysis of a Pairwise Dominance Coevolutionary Algorithm …

21

2.2 Preliminaries We use the following notational conventions. For any .n ∈ N, we write .[n] := {1, . . . , n}. Given a set . X and a function . f : X → R, we let .arg max x∈X f (x) refer to an arbitrary element in . X that takes the maximal . f -value. For any .n ∈ N, the n . For any bitstring .x ∈ {0, 1}n , the set of bitstrings of length .n is denoted .{0, 1}∑ n .i-th bit in . x is denoted . x i . We let . H (x, y) := i=1 |x i − yi | denote the Hamming n distance between two bitstrings .x, y ∈ {0, 1} . We use standard asymptotic notation, including . O, Ω, ω and .poly(n). For any Boolean value .b, we let .[b] denote the Iverson-bracket, defined by .[b] = 1 if .b is true, and .[b] = 0 otherwise. We start with coevolutionary algorithms in Sect. 2.2.1. In Sect. 2.2.2 we describe spatial topologies for PDCoEA. In Sect. 2.2.3 we describe error thresholds. In Sect. 2.2.4 we describe the problems (games) we investigate.

2.2.1 Coevolutionary Algorithms Biological coevolution refers to the influences two or more species exert on each other’s evolution [27]. A seminal paper on reciprocal relationships between insects and plants coined “coevolution” [9]. Coevolution can be cooperative, i.e., mutual benefit, or competitive, i.e., negative interactions arising from constrained and shared resources or from predator-prey relationships. Well known to the reader, an Evolutionary Algorithm (EA) typically evolves individual solutions, e.g., fixed length bit strings as in Genetic Algorithms (GAs) [13] using an a-priori defined fitness function to evaluate an individual’s quality. In contrast, coevolutionary algorithms calculate an individual’s fitness based on its interactions with other individuals or a dynamic environment allowing them to mimic coupled natural species-to-species interactions. We use the theoretically analyzed CCA called PDCoEA [19] as a starting point (see Algorithm 1). The basis of this algorithm is selection at the level of pairs of opposing players using the “dominance” relation in Definition 2.1. Here, .g1 (x, y) represents the payoff received by predator .x when competing against prey . y, and . g2 (x, y) is the payoff of prey . y when competing against predator . x. Informally, a predator-prey pair .(x1 , y1 ) is said to “dominate” a predator-prey pair .(x2 , y2 ) if simultaneously predator .x1 receives higher payoff than predator .x2 when evaluated against prey . y1 , and prey . y1 receives higher payoff than prey . y2 when evaluated against predator .x1 . Initially (lines 1–3), the algorithm samples .λ predators and prey uniformly at random. In each generation (lines 6–14), the algorithm produces .λ new predator-prey pairs which form the next generation of predators . Pt+1 and prey . Q t+1 . Each new predator-prey pair is constructed by applying mutation (lines 12–13) to a pair .(x, y) of predator and prey selected from the current populations . Pt and . Q t (lines 6–11). In the selection step, the algorithm compares two predator-prey pairs and selects the dominating pair.

22

M. Hevia Fajardo et al.

Definition 2.1 ([19]) Given two functions .g1 , g2 : X × Y → R and two pairs (x1 , y1 ), (x2 , y2 ) ∈ X × Y, we say that .(x1 , y1 ) dominates .(x2 , y2 ) wrt .g1 and .g2 , denoted .(x1 , y1 ) ≻ (x2 , y2 ), if and only if .g1 (x1 , y1 ) ≥ g1 (x2 , y1 ) and .g2 (x1 , y1 ) ≥ g2 (x2 , y1 ).

.

Definition 2.1 is a generalization of the maximin-dominance relation defined in [19] for the case where the second payoff function is .g2 (x, y) := −g1 (x, y). Algorithm 1 Pairwise Dominance CoEA (PDCoEA) [19] Require: Payoff functions g1 , g2 : {0, 1}n × {0, 1}n → R. Require: Population size λ ∈ N and mutation rate χ ∈ (0, n] 1: for i ∈ [λ] do 2: Sample P0 (i ) ∼ Unif({0, 1}n ) and Q 0 (i ) ∼ Unif({0, 1}n ) 3: end for 4: for t ∈ N until termination criterion met do 5: for i ∈ [λ] do 6: Sample (x1 , y1 ) ∼ Unif(Pt × Q t ) 7: Sample (x2 , y2 ) ∼ Unif(Pt × Q t ) 8: if (x1 , y1 ) ≻ (x2 , y2 ) then (x, y) := (x1 , y1 ) 9: else(x, y) := (x2 , y2 ) 10: end if 11: Obtain x ' by flipping each bit in x with probability χ/n. 12: Obtain y ' by flipping each bit in y with probability χ/n. 13: Set Pt+1 (i ) := x ' and Q t+1 (i ) := y ' . 14: end for 15: end for

2.2.2 Spatial Topologies for PDCoEA We extend the PDCoEA with spatial topologies for the populations. First, we define the neighborhood of a vertex .v in graph .G as the vertices that share an edge with .v and denote it by . NG (v). The Spatial Topology Pairwise Dominance CoEA (STPDCoEA) is a variant of PDCoEA that uses a spatial topology for the populations, see Algorithm 2. The STPDCoEA requires a graph topology .G. Each node in the graph contains a predator-prey pair .(x, y). At each generation .t (lines 6–15), each node .Gt (v) uniformly samples a node from its neighborhood . NG (v), this creates a pair of adversaries (opponents). The “dominating” pair is varied, producing offspring which are stored on the node .Gt+1 (v). Note, a difference with PDCoEA is that STPDCoEA evaluates all the nodes at each generation. In contrast, PDCoEA samples uniformly from the population .λ times each generation. In addition, the PDCoEA is implicitly fully connected, since every individual can compete against each other, whereas in the STPDCoEA the possible adversaries are determined by the connected nodes (neighborhood). Finally, each individual is paired with an opponent in the graph, and any offspring the pair generate remain paired in the next generation. In contrast, the PDCoEA, they are paired randomly in every generation.

2 Analysis of a Pairwise Dominance Coevolutionary Algorithm …

23

Algorithm 2 Spatial Topology Pairwise Dominance CoEA (STPDCoEA) Require: Payoff functions g1 , g2 : {0, 1}n × {0, 1}n → R. Require: Population size λ ∈ N and mutation rate χ ∈ (0, n]. Require: Graph topology G := G(V , E). 1: for v ∈ [λ] do 2: Sample P0 (v) ∼ Unif({0, 1}n ) and Q 0 (v) ∼ Unif({0, 1}n ) 3: Set G0 (v) := (P0 (v), Q 0 (v)) 4: end for 5: for t ∈ N until termination criterion met do 6: for v ∈ [λ] do 7: Let (x1 , y1 ) := Gt (v) 8: Sample (x2 , y2 ) ∼ Unif(NG (v)) 9: if (x1 , y1 ) ≻ (x2 , y2 ) then (x, y) := (x1 , y1 ) 10: else(x, y) := (x2 , y2 ) 11: end if 12: Obtain x ' by flipping each bit in x with probability χ/n. 13: Obtain y ' by flipping each bit in y with probability χ/n. 14: Set Gt+1 (v) := (x ' , y ' ) 15: end for 16: end for

2.2.3 Error Thresholds The behavior of evolutionary processes depends strongly on the relationship between selective pressure and mutation rates. Informally, when the mutation rate exceeds a certain so-called “error threshold”, genetic information cannot be fully replicated, and good genetic material can disappear. The value of the error threshold depends on the selective pressure in the process. Eigen’s quasi-species model of evolution provides the first mathematical characterization [10]. Consider a “needle in the haystack” fitness landscape with one fit “master bitstring” surrounded by other bitstrings of the same low fitness. Assuming that the initial individuals are copies of the master sequence, two different evolutionary dynamics may occur. If the mutation rate is below the error threshold, then some fraction of the population will remain on the master sequence. However, if the mutation rate is above the error threshold, then the individuals will eventually distribute evenly over the search space, essentially losing the genetic information in the master bitstring. Note several simplifying assumptions, e.g., infinite population size, infinite time, and a “needle in the haystack” landscape. The first rigorous analysis connecting error thresholds to the runtime of evolutionary algorithms was presented in [18] (Theorem 2.1). This result applies to any evolutionary process following Algorithm 3. Note that the pseudocode in Algorithm 3 refers to an unspecified selection mechanism in line 3, and is thus an algorithmic “blueprint” covering a range of evolutionary algorithms, rather than a specific algorithm. The PDCoEA fits this blueprint if we only focus what happens to one population (e.g., the predator population . P), and ignore the other population (e.g., the prey population . Q). The algorithm maintains a population of .λ individuals. In each generation (lines 2–5), the algorithm produces a new population of .λ individuals. Each

24

M. Hevia Fajardo et al.

Fig. 2.1 Illustration of Theorem 2.1

of these individuals is produced independently, first by selecting (using any selection mechanism) an individual from the current population (line 3), then applying bitwise mutation to the selected individual (line 4). Algorithm 3 Population Selection-Variation Algorithm (PSVA) Require: Bitstring length n ∈ N, population size λ ∈ N, and mutation rate χ ∈ (0, n] Require: Initial population of λ individuals P0 (1), . . . , P0 (λ) ∈ {0, 1}n . 1: for t ∈ N until termination criterion met do 2: for i ∈ [λ] do 3: Select a parent index It (i ) ∈ [λ], and set x := Pt (It (i )). 4: Obtain x ' by flipping each bit in x with probability χ/n. 5: Set Pt+1 (i ) := x ' . 6: end for 7: end for

Informally, Theorem 2.1 can be described as follows, and illustrated in Fig. 2.1 (left). Let .x ∗ ∈ {0, 1}n be any fixed bitstring, not necessarily the bitstring with highest fitness. Assume that the algorithm has selected an individual .x ' that is within Hamming distance .b(n) < n/5 to .x ∗ . Then, since .x is relatively close to .x ∗ , any offspring .x ' from .x produced via bitwise mutation will in expectation have larger Hamming distance to .x ∗ , than its parent .x. The expected change in distance, i.e., the drift, will increase as a function of the mutation rate .χ . If the algorithm does not select .x or other individuals close to .x ∗ with sufficiently high probability, then the population will “drift away” from bitstring .x ∗ . The negative drift can be offset with increasing the probability of selecting individuals near .x, e.g., the selective pressure. Note that we can choose .x ∗ to be any bitstring, e.g., the optimum. Assume that the fitness landscape is partitioned into a low fitness region (here called “infeasible region”), and a high fitness region (here called “feasible region”), and where the feasible region forms a “funnel structure” as indicated in Fig. 2.1 (right). If the entrance to the funnel is sufficiently narrow, then we can consider a bitstring .x ∗ at the “entrance” of the funnel. The result above shows that if the mutation rate is too high, then the algorithm is prevented from progressing further into the funnel. Theorem 2.1 makes this precise. Let .α0 ≥ 1 be the value such that no individual within Hamming distance .b(n) of the target bitstring .x ∗ has more than .α0 off-

2 Analysis of a Pairwise Dominance Coevolutionary Algorithm …

25

spring in expectation, i.e., condition 1. of Theorem 2.1. If the mutation rate satisfies χ > ln(α0 )/(1 − δ), then the algorithm needs with overwhelming probability of exponential time to produce any search point within distance .a(n) := b(n)(1 − δ) to .x ∗ .

.

Theorem 2.1 ([18]) Given Algorithm 3 with mutation parameter .χ , and population size .λ = poly(n). Let .a(n) and .b(n) be positive integers s.t. .b(n) ≤ n/χ and ∗ n .d(n) := b(n) − a(n) = ω(ln n). For an . x ∈ {0, 1} , let . T (n) be the smallest .t ≥ 0, ∑ ∗ s.t. . H (Pt ( j), x ) ≤ a(n) for some . j, 1 ≤ j ≤ λ. Let . Rt (i) := λj=1 [It ( j) = i]. If there are constants .α0 ≥ 1 and .δ > 0 s.t. [ ] 1. E Rt (i) | a(n) < H (Pt (i), x ∗ ) < b(n) ≤ α0 , for all i, 1 ≤ i ≤ λ, 2. ψ := ln(α0 )/χ + δ < 1, and } { 1 1 1√ b(n) ψ(2 − ψ) , < min , − 3. n 5 2 2

.

then there exists a constant .c > 0 such that .

) ( Pr T (n) ≤ ecd(n) ≤ e−Ω(d(n)) .

(2.1)

Error thresholds were recently documented in coevolutionary algorithms. Informally, for any sufficiently small subset . A × B ⊂ X × Y of the search space, the PDCoEA (Algorithm 1 in this paper) with bitwise mutation probability above.ln(2)/n needs exponential time to sample any pair of strategies in . A × B with overwhelmingly high probability. For the formal statement, see Theorem 14 in [19].

2.2.4 Problems We describe the problems (games) we investigate here. In Sect. 2.2.4.1 we describe Games Of Skill. In Sect. 2.2.4.2 we describe DefendIt. In Sect. 2.2.5 we describe MaximinHill a simple problem we propose to explore error thresholds.

2.2.4.1

Games of Skill

Games of Skill [6] are defined as follows: Definition 2.2 (Games of Skill [6]) We define a payoff of a Game of Skill as a random anti-symmetric matrix, where each entry equals: .

f (xi , y j ) := 1/2(Q i j − Q ji ) = 1/2(Wi j − W ji ) + Si − S j

26

M. Hevia Fajardo et al.

where . Q i j = Wi j + Si − S j , and where for all .Wi j and . Si are independent random variables with distributions .N(0, σW2 ) and .N(0, σ S2 ), and where .σ = max{σW , σ S }. The intuition is that . Si captures a part of the transitive strength of a strategy .πi . It can be seen as a model where each player is assigned a single ranking, which is used to estimate winning probabilities. Moreover, .Wi j encodes all interactions that are specific only to .πi playing against .π j , and can represent non-transitive interactions (i.e., cycles) but the randomness means it can also become transitive. We represent the strategies with bit strings. A strategy .xi is used if .|x|1 = i and strategy . y j is used if .|y|1 = j (recall that .|x|1 refers to the number of 1-bits in the bitstring .x). We sorted .xi and . y j with the values from . Si and . S j in order to allow the mutation operators used by CCAs to find similar solutions. Arguably the sorting mimics real-world games of skill, since similar strategies tend to give similar results.

2.2.4.2

DefendIt

The definition of the DefendIt game is from [20]. An instance of the game is given by a tuple .(k, l, v, c, B D , B A ) where .k ∈ N is the number of resources, .l ∈ N is the number of time-steps, .v = (v (1) , . . . , v (k) ) where .v ( j) ∈ [0, ∞) is the value of resource . j ∈ [k], .c = (c(1) , . . . , c(k) ) where .c( j) ∈ [0, ∞) is he cost of resource . j ∈ [k], . B D ∈ [0, ∞) is the defender’s budget, and . B A ∈ [0, ∞) is the attacker’s budget. Defender and attacker strategies are represented by bitstrings of length .n := k · l. We adopt the notation .x = (x1(1) , . . . , xl(1) , . . . , x1(k) , . . . , xl(k) ) ∈ {0, 1}n for ( j) the defender’s strategy, where .xi = 1 for . j ∈ [k] and .i ∈ [l] means that the defender attempts to acquire resource . j at time .i. Analogously, we denote . y = ( j) (y1(1) , . . . , yl(1) , . . . , y1(k) , . . . , yl(k) ) ∈ {0, 1}n for the attacker’s strategy, where . yi = 1 for . j ∈ [k] and .i ∈ [l] means that the attacker attempts to acquire resource . j at time .i. We will define the payoff of strategies in terms of the ownership of the resources. ( j) In particular, .z i ∈ {0, 1} is the ownership of resource . j ∈ [k] at time .i ∈ {0} ∪ [l], ( j) ( j) where .z i = 1 indicates that the defender owns resource . j at time .i, and .z i = 0 means that the attacker owns resource . j at time .i. For all . j ∈ [k], we define ( j) . z 0 (x, y) := 1, which corresponds to the assumption that the defender is in possession of all resources at the beginning of the game. The ownership of a resource . j is defined inductively for .i ∈ [l] as follows ⎧ ( j) ⎪ ⎨z i−1 (x, y) ( j) . z i (x, y) := 1 ⎪ ⎩ 0

( j)

( j)

if xi = yi , ( j) ( j) if xi = 1 and yi = 0, ( j) ( j) if xi = 0 and yi = 1.

(2.2)

Table 2.1 describes ownership outcomes for the move combinations in DefendIt. Intuitively, this means that if a player attempts to acquire the resource while the opponent does not, the player obtains the resource. If neither the defender or attacker

2 Analysis of a Pairwise Dominance Coevolutionary Algorithm …

27

Table 2.1 Ownership outcomes for the 4 combinations of moves in DefendIt Defender/attacker 0 1 0 1

Previous owner (Unchanged) Defender owns

Attacker owns Previous owner

move, the ownerships does not change. If both the defender and the attacker attempt to acquire the resource, the ownership does not change. The overall cost of a defender or an attacker strategy .x is C(x) :=

k ∑

.

j=1

c( j)

l ∑

( j)

xi ,

i=1

i.e., the number attempts of acquiring a resource weighted by the cost of that resource. A defender strategy .x is called over-budget if .C(x) > B D . Similarly, an attacker strategy . y is called over-budget if .C(y) > B A . Finally, the payoff function .g1 (.g2 ) for the defender (attacker) are defined as {∑ ∑l ( j) k ( j) D j=1 v i=1 z i (x, y) if C(x) ≤ B . g1 (x, y) := ∑ ∑ ( j) k l − j=1 i=1 xi otherwise {∑ ∑ ( j) k l ( j) A j=1 v (l − i=1 z i (x, y)) if C(y) ≤ B g2 (x, y) := ∑ ∑ ( j) l − kj=1 i=1 yi otherwise. Informally, if the overall cost of a strategy exceeds the player’s budget (over-budget strategy), the payoff is minus the number of times the player attempts to acquire any resource. Note that the payoff function for an over-budget defender strategy is independent of the attacker strategy, and isomorphic to the OneMax problem [8], and similarly for over-budget attacker strategies. If the overall cost of a strategy is within the budget (within-budget strategy), then the payoff is the number of time steps the player is in possession of the resource multiplied by the value of the resource. For analysis of the hardness of finding an optimal player strategy in DefendIt see [20]. We only consider DefendIt instances where the cost of an item is identical to the value of the resource. This corresponds to the NP-hard subset sum problem which is a special case of the knapsack problem [20]. Hence, the decisions of DefendIt is still NP-complete for our choice of costs.

28

M. Hevia Fajardo et al.

2.2.5 M AXIMIN H ILL—A Problem for Error Thresholds We propose MaximinHill a simple maximin-optimization problem designed to be easy to optimize for any hill-climbing CCA. The simplicity of the problem allows us to understand better the dynamics of CCAs, especially the error thresholds. For any parameters .0 ≤ α ≤ n and .0 ≤ β ≤ n we define: MaximinHillα,β

.

| | | |( )( )| | | |x|1 || || |y|1 || |x|1 |y|1 || || | − β− + α− , α− := | β − n n | | n | | n |

where .|x|1 is the number of 1-bits in the bitstring .x and .|a| denotes the absolute value of any .a ∈ R. Given that player .x wants to maximize and . y wants to minimize the function .MaximinHillα,β , the optimal worst-case strategies for both . x and . y is to select a solution with .|x|1 = βn and .|y|1 = α respectively. When the opponent’s strategy remains unchanged, approaching .|x|1 = β (or .|y|1 = α) is always advantageous and leads to higher payoffs. However, despite the straightforward strategy when the opponent does not change, there is still a competitive aspect. If the opponent decides to move toward its optimal worst-case strategy, it becomes detrimental to your own payoff. The competitive nature comes from both players aiming to maximize their individual payoffs. We next contribute an experimental methodology that allows the performance of PDCoEA (or other CCAs) strategies to be compared.

2.3 Experimental Methodology We now explain how we compare the performance of CCAs. It can be non-trivial to compare strategies from two or more CCAs. For example, it is computationally intractable (NP-hard) to compare how close strategies in DefendIt are to the optimum [20]. Also, it may be meaningless to compare strategies with randomly chosen strategies. This dilemma also occurs in more complex versions of coevolutionary algorithms, e.g., STPDCoEA. We use a methodology introduced in [20] where the performance of an algorithm A is compared relative to one or more reference algorithms B. Figure 2.2 illustrates the methodology. We run A and all the algorithms of B independently for the same number of function evaluations and collect “champion” defenders and attackers at regular time intervals, specified by a period length .τ [20]. We then evaluate the individuals in the population of an algorithm at a time .t against the champions collected from all algorithms up until time .t. For a formal description of this methodology formally see [20]. Having introduced the experimental methodology, the next section will describe the experiments with the STPDCoEA.

2 Analysis of a Pairwise Dominance Coevolutionary Algorithm …

29

Fig. 2.2 Illustration of experimental methodology

2.4 Experiments In this section, we investigate the concept of spatially distributed CCAs on graphs (spatial topologies), more specifically we investigate the STPDCoEA introduced in Sect. 2.2.2. This section is organized as follows: • In Sect. 2.4.1, we describe the experimental setup in general. • In Sect. 2.4.2, we investigate how CCAs can be applied to spatial topologies and different properties of the topology. This approach explores how different spatial topologies impact the performance of a CCA (RQ-1, RQ-2 and RQ-3). • In Sect. 2.4.3, we examine the influence of spatial CCAs on the payoff and genotypic diversity. We analyze how the spatial distribution affects the overall fitness and the variability of genotypes within the populations. By investigating the relationship between spatial organization and diversity, we gain insights into the potential benefits or limitations of spatially distributed CCAs (RQ-4). • In Sect. 2.4.4, we explore the concept of error thresholds in mutation rates for spatial CCAs. By systematically varying the mutation rates in our experiments, we identify the critical points at which the performance of the CCAs significantly degrades (RQ-5). Understanding these error thresholds provides valuable information for practitioners to determine suitable mutation rates and optimize the algorithm’s performance in real-world applications. We also systematically tune the mutation rates and investigate the effect of tuned mutation rates in STPDCoEA.

30

M. Hevia Fajardo et al.

2.4.1 Setup We use the champions methodology of Sect. 2.3 to compare the topologies setting the period length to .τ = 100,000 where each topology is compared against the PDCoEA and we plot the payoffs of these champions comparisons. Our champions have the best-worst-case (minmax) payoff. We repeat each experiment 40 (unless stated otherwise) times. In the plots, the solid line shows the median of the experiment trials and the shaded region is the interquartile range. For our analysis we used the following parameter settings for all algorithms unless noted otherwise. We used a population size of.625 and a mutation parameter.χ = 0.1. The parameter selection was informed by experimental [20] and theoretical [19] analyses on PDCoEA suggesting to use large population sizes and a mutation parameter .χ < ln(2), i.e., a mutation probability smaller than .ln(2)/n ≈ 0.693. All experiments are initialized with the all-zeroes bit-string. For DefendIt these initial solutions are part of the feasible solutions. In the plots, the predator (pred) is the attacker and the prey is defender. Note, due to space constraints we sometimes only show pred or prey when both are similar.

2.4.2 Spatial Topology PDCoEA In our study, we employed various graph topologies to explore the performance of CCAs. Specifically, we utilized the following graph structures: Barabási-Albert, cycle (representing a 1-D graph), lattice (representing a 2-D graph), Erdös Rényi, and binary tree, see Fig. 2.3. Barabási-Albert The Barabasi graph (see Fig. 2.3a) is characterized by preferential attachment, where nodes with higher degrees have a greater likelihood of acquiring additional connections. This graph topology mimics real-world networks where few nodes (called hubs) tend to attract more connections than others and some nodes have a small amount of connections. 1-D graph The cycle graph (see Fig. 2.3b), representing a 1-D graph, consists of nodes connected in a circular arrangement, forming a ring-like structure. This

Fig. 2.3 The different topologies used for STPDCoEA

2 Analysis of a Pairwise Dominance Coevolutionary Algorithm …

31

topology allows for interactions between neighboring nodes but does not incorporate long-range connections. 2-D graph The lattice graph (see Fig. 2.3c) represents a 2-D graph, featuring a grid-like structure where nodes are connected to their adjacent neighbors. This topology enables interactions between nodes in a structured and localized manner. It is similar to the spatial topology in [15]. Erdös Rényi The Erdös Rényi graph (see Fig. 2.3d) is a random graph where nodes are connected with a certain probability. This graph topology represents a more random and less structured network configuration. Binary tree The binary tree graph (see Fig. 2.3e) is a hierarchical structure where each node has a maximum of two child nodes. This topology creates a branching pattern, resembling a tree-like structure. In addition, we added a cycle from all the leaves to the root node of the tree. This was essential for the STPDCoEA using the directed binary tree graph to work. Each topology brings its own characteristics, allowing us to analyze the impact of these variations on the algorithm’s performance and explore the strengths and limitations of different graph structures in our optimization scenarios. Figure 2.3 shows examples of the topologies used in the experiments. Here we show the 1-D graph and the binary tree in its directed form. When using the directed version of the other three graphs, the direction of the nodes was set uniformly at random. We note that this often resulted in subgraphs and nodes that only give (or receive) genetic information to (from) the graph without receiving (giving) it back. First, we investigate undirected graphs, where individuals in the population compete with their neighbors (all edges are incoming) based on the topology. Then, we focus on the edge direction property of the graph. We study directed graphs where individuals compete only against their incoming neighborhoods, and we study the impact of edge directions on the performance and dynamics of the CCAs. Table 2.2 shows the connectedness of the topologies, as measured by the in- and out-degree of the nodes in the graph. The Barabási-Albert and Erdös Rényi graphs where generated at random once and the same graph was used for all experiments. The undirected topologies all have higher degree of connectedness.

2.4.2.1

Undirected Edges

First, we investigate the application of STPDCoEA on undirected graphs and compare the performance across various graph topologies and problems. Specifically, we analyze the impact of different graph structures, on two distinct problem instances: DefendIt and Games of Skill. By employing the STPDCoEA on these undirected graphs, we aim to uncover how the spatial distribution of individuals within the graph influences the overall performance of the algorithm. DefendIt In this set of experiments, we compare the undirected graphs on DefendIt. In Fig. 2.4, we observe that the overall performance of the different undirected graphs does not appear to be significantly affected. However, certain trends

32

M. Hevia Fajardo et al.

Table 2.2 Descriptive statistics of in-degree and out-degree for different topologies. .λ = 625 is the population size (number of nodes in the topology) In-degree

Out-degree

Topology

min

median mean

max

min

median mean

max

Undirected 1-D graph

2

2

2

2

2

2

2

2

Undirected Binary tree

2

2

3

315

2

2

3

315

Undirected 2-D graph

4

4

4

4

4

4

4

4

Undirected Barabási-Albert

1

3

4.51

111

1

3

4.51

111

Undirected Erdös Rényi

3

10

10.04

23

3

10

10.04

23

Directed 1-D graph

1

1

1

1

1

1

1

1

Directed Binary tree

1

1

1.5

313

1

1

1.5

2

Directed 2-D graph

0

2

2

4

0

2

2

4

Directed Barabási-Albert

0

1

2.25

64

0

1

2.25

47

Directed Erdös Rényi

0

5

5.02

12

0

5

5.02

13

PDCoEA (fully connected)

















Fig. 2.4 The DefendIt problem, the lines show different undirected topologies. X-axis shows the number of fitness evaluations (.t). Y-axis shows the median maximin payoff of the champions and their interquartile ranges

become apparent when comparing different graph types. Notably, the BarabásiAlbert graph and the 2-D graph exhibit a slightly better performance compared to the other graph structures. These graphs seem to provide a more conducive environment for cooperative evolution, leading to improved solutions. Additionally, an observation is that the 1-D graph demonstrates a higher range of payoffs, indicating a greater diversity of solutions obtained. This suggests that the low connectivity of the 1-D graph promotes exploration and the emergence of diverse strategies among individuals. Overall, these findings suggest that the choice of graph topology can influence the performance and diversity of solutions achieved by the CCAs. The Barabási-Albert and 2-D graphs show promising performance for this particular problem, while the 1-D graph offers the potential for exploring a wider range of solutions. Games of Skill In this set of experiments, we compare the undirected graphs on Games of Skill. In Fig. 2.5, we observe a comparison between STPDCoEA topologies and PDCoEA. The performance of all spatial CCAs is statistically significantly

2 Analysis of a Pairwise Dominance Coevolutionary Algorithm …

33

Fig. 2.5 The Games of Skill problem, the lines show different undirected topologies. X-axis shows the number of fitness evaluations (.t). Y-axis shows the median maximin payoff of the champions and their interquartile ranges

Fig. 2.6 Pairwise Wilcoxon tests with Bonferroni correction. A green square indicates that the algorithm in the Y-axis had a statistically significantly higher median payoff in the predator population than the algorithm in the X-axis at the end of the run, a red square indicates a statistically significantly lower payoff and yellow indicates no statistical significant difference. The DefendIt problem with directed edges is not shown as there are no statistically significant differences

lower compared to PDCoEA (c.f. Fig. 2.6). This suggests that the spatial distribution of individuals within the graph structure may introduce additional challenges or constraints to the coevolutionary process on this optimization problem. When comparing the performance of spatial CCAs among themselves, we notice that the differences are not significant. Still, the binary tree graph shows comparatively lower performance compared to the other spatial CoEAs, indicating that this graph structure may be ill-suited for this problem. Overall, the results from Fig. 2.5 indicate that the use of spatial CoEAs can be detrimental on some problem types. These findings highlight the importance of understanding better when and how to employ spatial CoEAs and suggest that further investigation is needed to understand the underlying factors contributing to the observed performance differences.

2.4.2.2

Directed Graphs

We now explore directed graphs. Figure 2.7 shows the results from using directed graphs in the STPDCoEA. Notably, we observe that the performance of some graphs

34

M. Hevia Fajardo et al.

Fig. 2.7 Directed edges, the lines show different topologies. X-axis shows the number of fitness evaluations (.t). Y-axis shows the median maximin payoff of the champions and their interquartile ranges

experiences a noticeable decrease compared to their undirected counterparts. Among the directed graphs analyzed, the 1-D graph in the DefendIt problem stands out as having the largest drop in performance, making it statistically significantly worse than most other algorithms at the end of the run (c.f. Fig. 2.6). However, it is worth noting that the performance on the Erdös Rényi graph remains relatively stable for both the DefendIt and Games of Skill problems. This suggests that the Erdos Renyi graph is robust to the introduction of directed connections. A possible explanation is that this graph is relatively well connected and restricting the communication with directed connections does not change its dynamics too much.

2.4.3 Payoff and Genotypic Diversity In this section, we explore the impact of the spatial topologies on the payoff and genotypic diversity of the evolving populations on CCAs. We aim to investigate how this spatial topology influences the variability of payoffs and genotypes within the populations. By analyzing this relationship, we gain insights into the effects of spatial topologies on the exploration of the search spaces during evolution.

2.4.3.1

Payoff Diversity

Figure 2.8 presents the standard deviation in payoffs sampled (50 samples) from the population at intervals of 10,000 fitness evaluations. The results highlight a contrast

2 Analysis of a Pairwise Dominance Coevolutionary Algorithm …

35

Fig. 2.8 Payoff diversity for different topologies (lines). The x-axis shows the fitness evaluations (.t). The y-axis shows the standard deviation of the payoffs for 50 random samples. The solid line is the median of 20 trials (interquartile range shaded)

between the DefendIt and Games of Skill problems in terms of payoff diversity when using spatial topologies. For the DefendIt problem, we observe that the use of a spatial topology in CCAs leads to higher payoff diversity compared to the non-spatial populations in PDCoEA. However, intriguingly, this higher diversity in payoffs is not observed in the Games of Skill problem. In this problem, PDCoEA populations maintain a similar level of payoff diversity compared to the different STPDCoEA topologies.

2.4.3.2

Hamming Distance Diversity

The genotypic diversity is measured with the Hamming distance. Figure 2.9 provides insights into the median Hamming distance between solutions in the populations, along with its standard deviation, for CCAs employing both directed and undirected graphs. The figure shows the results for both DefendIt and Games of Skill. It shows an overall trend that suggests that a spatial topology tends to generate higher genotypic diversity within the populations. When examining Fig. 2.9, we observe, for both problems, that the 1-D graph consistently exhibits a high median Hamming distance, indicating a genotypic difference between solutions within the population. This is accompanied by the standard deviation which highlights the variability in the Hamming distances. Similarly, the 2-D graph demonstrates a relatively high median and standard deviation. Notably, the binary tree graph stands out because it has a low median Hamming distance and standard deviation in all cases. Both measures are low even when compared with the PDCoEA. Another observation is that for the Barabási-Albert graph the standard deviation tends to be high but for the undirected version the median is relatively small compared against the 1-D and 2-D graphs.

36

M. Hevia Fajardo et al.

Fig. 2.9 Genotypic diversity for different topologies (lines). The x-axis shows the fitness evaluations (.t). The y-axis shows the median or the standard deviation of the hamming distance in the populations. The solid line is the median of 20 trials with the interquartile range shaded

2.4.3.3

Hamming Distance Versus Graph Distance

In this section, we study the relation between the Hamming distance against the graph distance for all graph topologies. In Fig. 2.10, we plot these relations, with the directed graphs displayed at the top and the undirected graphs at the bottom. The plots show a snapshot of the whole population after.2 × 106 function evaluations. This visualization allows us to examine the relationship between the Hamming distance, which measures the dissimilarity between solutions, and the graph distance, which represents the number of edges between individuals in the graph structure. This representation allows us to observe the patterns and correlations between Hamming distance and graph distance across the different graph topologies. Note that nodes that are isolated from each other within the graph topology are assigned a graph distance that is .1.2 times the maximum graph distance (rounded). This treatment of isolated nodes provides a consistent and meaningful representation of the graph distance metric. For the directed graphs, for two nodes . A and . B we plot

2 Analysis of a Pairwise Dominance Coevolutionary Algorithm …

37

Fig. 2.10 Scatter-plot of a snapshot (.t = 2 × 106 ) for Hamming distance (x-axis) versus path length distance (y-axis) of all individual pairs in both populations

both the graph distance . A to . B and . B to . A (which may be different). We consider node . A to be isolated from node . B if there is no directed path from . A to . B.1 In Fig. 2.10, several observations can be made regarding the characteristics of different graph topologies and their impact on the hamming distance between pairs of nodes. In the case of the directed graphs, such as Barabási-Albert (see Fig. 2.10a), 2-D graph (see Fig. 2.10c), and Erdös Rényi (see Fig. 2.10d), we can observe pairs of nodes that are isolated from each other. However, despite their isolation, some of these node pairs still exhibit small hamming distances. This phenomenon suggests that genetic information might flow in one direction, but not necessarily in the opposite direction, contributing to the observed differences. Moving on to the undirected Barabási-Albert graph (see Fig. 2.10f), we also encounter isolated nodes. These isolated nodes can either be part of a small subgraph or completely disconnected. Most of these isolated nodes tend to exhibit high hamming distances, indicating substantial genetic differences between them. A notable finding emerges when considering the 1-D graph (see Fig. 2.10b and g). Here, a clear correlation between graph distance and Hamming distance can be observed. In the directed graph (see Fig. 2.10b), the right half of the plot appears to mirror the left half. This symmetry suggests that although these points seem far apart based on the graph distance, they are actually located near each other when considering the directionality of the graph. Finally, for the binary tree graph (see Fig. 2.10e and j), both in the directed and undirected cases, the solutions exhibit small hamming distances. This observation implies that the individuals in these populations share significant genetic similarities. This might be caused by the small graph distances among all pairs of nodes.

2.4.4 Error Threshold in STPDCoEA In the previous experiments, we used the same mutation probability (.χ ) for all. The parameter selection was informed by previous experimental [20] and 1

Node . B is not necessarily isolated from . A if node . A is isolated from . B.

38

M. Hevia Fajardo et al.

theoretical [19] analyses on PDCoEA suggesting the use large population sizes and a mutation parameter .χ < ln(2). These suggestions were based on the concept of error thresholds. Error thresholds for mutation rates refer to the critical points or thresholds above which the performance of EAs and CCAs experience a significant degradation. These are dependent on both mutation rate and selection pressure, and different topologies can provide different selection pressures. We now investigate how the spatial topology of STPDCoEA impacts the error threshold.

2.4.4.1

Approximating the Error Threshold

In previous experiments, we used a conservative mutation probability of .χ = 0.1. To approximate the error threshold for the spatial topologies we use the OneMaxlike problem MaximinHill with .α = 1 and .β = 1, where increasing the number of ones is always beneficial for both populations. We use this problem because it only requires the algorithms to hill-climb and it does not have a deceptive landscape. These characteristics help us identify the error thresholds. Figure 2.11 illustrates the error thresholds in mutation rates for different types of CCAs. The graph shows that spatial CCAs generally exhibit lower error thresholds compared to the non-spatial PDCoEA. We hypothesize that the reduced communication and lower reproductive rate of individuals within the graph structure restricts the flow of genetic information, leading to decreased robustness against high mutation rates. Moreover, among the spatial CCAs, those utilizing directed graphs demonstrate even lower error thresholds, supporting our hypothesis. This suggests that the inclusion of spatial organization in the algorithm’s design influences the sensitivity

Fig. 2.11 Error threshold approximation for the MaximinHill problem for different topologies (lines). The Y-axis is the number of fitness evaluations until a solution is found (runtime). The X-axis is the mutation parameter .χ . The solid line shows the median value over 100 trials and the shaded area is the interquartile range

2 Analysis of a Pairwise Dominance Coevolutionary Algorithm …

39

to mutation rates. Note that for small mutation rates Fig. 2.11 also shows an increase in runtime. This is because most offspring are identical to their parents (due to the low mutation rate), therefore finding better solutions takes more time.

2.4.4.2

Tuned Mutation Rates

We observed the importance of appropriate mutation rates. In particular, we saw that when the mutation rate is too high, it can lead to divergence, through excessive disruption and loss of promising solutions. On the contrary, if the mutation rate is too low, there is a risk of converging to local optima due to a lack of exploration. In order to tune the mutation probability .χ, we used Fig. 2.11 and the following methodology. We identified the best performing .t among all mutation probabilities .χ (lowest median runtime) and use the first mutation probability.χ for which the median runtime exceeds .t by at least .1.6 × 105 . The mutation probabilities (.χ ) discovered by this heuristic are shown in Table 2.3. We see that the mean in and out degree (see Table 2.2) is correlated with the tuned mutation probability value. DefendIt Fig. 2.12 shows the impact of using tuned mutation probabilities (.χ) on the performance of algorithms with directed and undirected graphs. The results reveal a mixed effect on the performance when using the tuned parameter values. Among the directed graph topologies, the directed Barabási-Albert and 2-D-graph algorithms had a reduced performance. PDCoEA showed a similar decrease in performance. This suggests that the tuned parameter values may not be suitable for these specific algorithm/problem pairs. Similarly, the Erdös Rényi and Binary tree graphs demonstrated a slight reduction in performance with the tuned parameter values. Although the performance decline was not as great as in the previous cases, it still indicates that the parameter tuning did not yield the desired improvement for these algorithms in the directed graph context. Not surprisingly, the directed 1-D graph algorithm showed an improvement in performance when using the tuned parameter values. We attribute this to the use of a .χ below the error threshold, whereas before a .χ above the error threshold showed in Fig. 2.11 was used. Finally, for undirected graph topologies, the performance of all algorithms seemed to have decreased with the tuned .χ, compared to the results shown in Fig. 2.4. Table 2.3 Mutation parameters .χ for different topologies .χ directed Topology 1-D graph Binary tree 2-D graph Barabási-Albert Erdös Rényi PDCoEA

0.07 0.09 0.11 0.14 0.21 0.4



undirected

0.10 0.14 0.18 0.20 0.25

40

M. Hevia Fajardo et al.

Fig. 2.12 The DefendIt problem, the lines show different topologies with tuned mutation probability (.χ). X-axis shows the number of fitness evaluations (.t). Y-axis shows the median maximin payoff of the champions (20 trials) and their interquartile ranges

Fig. 2.13 The Games of Skill problem, the lines show different topologies with tuned mutation probability (.χ ). X-axis shows the number of fitness evaluations (.t). Y-axis shows the median maximin payoff of the champions (20 trials) and their interquartile ranges

Games of Skill Fig. 2.13 presents the results for the Games of Skill problem, showcasing the impact of tuned mutation probability (.χ) on the performance of different graphs. In contrast to the observations in the DefendIt problem, the tuned mutation probability (.χ ) proved to be beneficial for all algorithms in terms of obtaining a higher payoff in the champions comparison. This suggests that the parameter adjustments successfully enhanced the algorithms’ ability to achieve higher payoffs.

2 Analysis of a Pairwise Dominance Coevolutionary Algorithm …

41

2.5 Related Work Coevolutionary Algorithm This contribution extends the insightful body of prior work studying and exploiting coevolutionary algorithms [2, 14, 17, 20, 23, 25–28]. In a basic CCA at each generation, an individual is assigned a fitness score derived from some function of its performance outcomes in its competitions. E.g., the function can sum the performance outcomes or use the average, maximum, minimum, or median outcome [3]. Variations of the CCA have been defined for specific problem domains, e.g. [1, 4, 12, 21, 22, 29]. The dynamics of CCAs are difficult to analyze because of their fitness evaluation is derived interactively, [26]. Effectively an individual’s score (or payoffs in the case of DefendIt) is a sample-based estimate of their performance where the samples are drawn from the opposing population which itself is evolving. CCA Problems Different competitive games or simplified problems have been studied [3, 11, 16, 17]. The study of DefendIt [20] was inspired by FlipIt. FlipIt was introduced in 2012 in a cryptographic setting by [7]. It was described as a game of “Stealthy Takeover” because it models “situations in which an attacker periodically compromises a system or critical resource completely, learns all its secret information and is not immediately detected by the system owner or defender.” The Games of Skill was formulated from a study of the geometrical properties of real-world games (e.g., Tic-Tac-Toe, Go, StarCraft II) [6]. These games were proved to resemble a spinning top, with the upright axis representing transitive strength, and the radial axis representing the non-transitive dimension, which corresponds to the number of cycles that exist at a particular transitive strength. This geometric structure has consequences for learning and clarifies why populations of strategies are necessary for training of agents, and how population size relates to the structure of the game. A method of Nash clustering to measure the interaction between transitive and cyclical strategy behavior, and the effect that population size has on the convergence of learning in these games was shown. CCA Analysis There is also a nascent body of theoretical analysis of coevolutionary algorithms. Relevant to this contribution is the topic of error thresholds, a phenomenon first studied in molecular biology [18, 24]. An error threshold is an essential characteristic of a non-elitist evolutionary algorithm. Informally, the threshold describes how the performance of the algorithm suddenly degrades when the mutation rate is increased beyond a certain point which depends on the selective pressure. For traditional non-elitist evolutionary algorithms which use selection and bitwise mutation applied to optimization on the Boolean hypercube, this threshold occurs when each bit is flipped with probability .χ /n ≈ ln(α0 )/n, where .α0 is the reproductive rate of the selection mechanism (expected number of offspring of the fittest individual), and .n is the bitstring length. Mutation rates above this threshold lead to exponentially large runtime on problems with at most a polynomial number of global optima (Theorem 4 in [18]), while mutation rates below this threshold lead to polynomial expected runtime assuming some additional algorithmic and problem conditions are met (Theorem 1 in [5]).

42

M. Hevia Fajardo et al.

Spatial Topologies Spatial evolutionary algorithms and topologies are described in [30]. For diversity and efficiency, the algorithm designer often tries to minimize the number of competitions per generation while maximizing the accuracy of its fitness estimate of each adversary. Assuming one or both populations are of size . N , two extreme structures are: one-vs-one, each adversary competes only once against a member of the opposing population, and all-vs-all, each adversary competes against all members of the opposing population. One-vs-one has minimal fitness evaluations., 2 . O(N ). In contrast, all-vs-all has a high computational cost, . O(N ) [28]. Other structures provide intermediate trade-offs between computation costs and competition bias (fitness estimation). In [23], adversaries termed hosts and parasites are placed on a M.×M grid with a fixed neighborhood (size .c) and one host and parasite per cell. The structure of the competitions is competition among all competitors in the neighborhood. Fitness evaluations are reduced to . O(Mc2 ) by this. An adversary has an outcome for each competition. A successful example of spatial topologies and coevolution is for training Generative Adversarial Networks (GANs) [15]. The effects signal propagation in a gradient-based and coevolutionary learning system were studied in [31]. These studies were limited to GANs for image generation.

2.6 Conclusion Competitive coevolutionary algorithms are used to model adversarial dynamics. The diversity of the adversarial populations can be increased with a spatial topology. we explored a pairwise dominance coevolutionary algorithm named PDCoEA to achieve more clarity in how a spatial topology impacts performance and complexity. We used a methodology for consistent algorithm comparison to empirically study the impact of topology, problem, and the impact of mutation rates on the dynamics and payoffs. We compared multi-run dynamics problems and observed that the error threshold seems to be correlated with the connectedness of the topology. Future work will investigate a self-adapting mutation rate to see if it evolves to the error threshold. We will study the solutions more quantitatively and the impact of the spatial topologies, e.g., measure coevolutionary pathologies. Finally, we will investigate more how to set mutation rates. Acknowledgements Lehre and Hevia were supported by a Turing AI Fellowship (EPSRC grant ref EP/V025562/1).

2 Analysis of a Pairwise Dominance Coevolutionary Algorithm …

43

References 1. Angeline, P.J., Pollack, J.B.: Competitive environments evolve better solutions for complex tasks. In: Proceedings of the Fifth International Conference (GA93), Genetic Algorithms, pp. 264–270 (1993) 2. Antonio, L.M., Coello, C.A.C.: Coevolutionary multi-objective evolutionary algorithms: a survey of the state-of-the-art. IEEE Trans. Evolut. Comput. 1–16 (2018) 3. Axelrod, R.: The Evolution of Cooperation, vol. 10. Basic, New York (1984) 4. Cardona, A.B., Togelius, J., Nelson, M.J.: Competitive coevolution in ms. pac-man. In: 2013 IEEE Congress on Evolutionary Computation, pp. 1403–1410 (2013) 5. Corus, D., Dang, D.C., Eremeev, A.V., Lehre, P.K.: Level-based analysis of genetic algorithms and other search processes. IEEE Trans. Evol. Comput. 22(5), 707–719 (2018) 6. Czarnecki, W.M., Gidel, G., Tracey, B., Tuyls, K., Omidshafiei, S., Balduzzi, D., Jaderberg, M.: Real world games look like spinning tops. Adv. Neural. Inf. Process. Syst. 33, 17443–17454 (2020) 7. van Dijk, M., Juels, A., Oprea, A., Rivest, R.L.: FlipIt: The game of “Stealthy Ttakeover. J. Cryptol. 26(4), 655–713 (2013) 8. Droste, S., Jansen, T., Wegener, I.: On the analysis of the (1+1) evolutionary algorithm. Theoret. Comput. Sci. 276(1–2), 51–81 (2002) 9. Ehrlich, P.R., Raven, P.H.: Butterflies and plants: a study in coevolution. Evolution 18(4), 586–608 (1964) 10. Eigen, M.: Selforganization of matter and the evolution of biological macromolecules. Naturwissenschaften 58(10), 465–523 (1971) 11. Ficici, S.G.: Solution concepts in coevolutionary algorithms. Ph.D. thesis, Brandeis University (2004) 12. Fogel, D.: Blondie24: playing at the edge of artificial intelligence (2001) 13. Goldberg, D.E.: Genetic algorithms in search, optimization and machine learning, 1st edn. Addison-Wesley Longman Publishing Co. Inc, Boston (1989) 14. Hemberg, E., Rosen, J., Warner, G., Wijesinghe, S., O’Reilly, U.M.: Detecting tax evasion: a co-evolutionary approach. Artif. Intell. Law 24, 149–182 (2016) 15. Hemberg, E., Toutouh, J., Al-Dujaili, A., Schmiedlechner, T., O’Reilly, U.M.: Spatial coevolution for generative adversarial network training. ACM Trans. Evol. Learn. Optim. 1(2) (2021) 16. Jones, S.T., Outkin, A.V., Gearhart, J.L., Hobbs, J.A., Siirola, J.D., Phillips, C.A., Verzi, S.J., Tauritz, D., Mulder, S.A., Naugle, A.B.: Evaluating moving target defense with pladd. Technical report, Sandia National Lab.(SNL-NM), Albuquerque, NM (United States) (2015) 17. Krawiec, K., Heywood, M.: Solving complex problems with coevolutionary algorithms. In: Proceedings of the 2016 on Genetic and Evolutionary Computation Conference Companion, pp. 687–713. ACM (2016) 18. Lehre, P.K.: Negative drift in populations. In: Proceedings of the 11th International Conference on Parallel Problem Solving from Nature (PPSN 2010). LNCS, vol. 6238, pp. 244–253. Springer, Berlin (2010) 19. Lehre, P.K.: Runtime analysis of competitive co-evolutionary algorithms for maximin optimisation of a bilinear function. In: Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’22, pp. 1408–1416. ACM, New York (2022) 20. Lehre, P.K., Hevia Fajardo, M., Hemberg, E., Toutouh, J., O’Reilly, U.M.: Analysis of a pairwise dominance coevolutionary algorithm and defendit. In: Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’23. ACM, New York (2023) 21. Lim, C.U., Baumgarten, R., Colton, S.: Evolving behaviour trees for the commercial game DEFCON. In: European Conference on the Applications of Evolutionary Computation, pp. 100–110. Springer (2010) 22. Luke, S., et al.: Genetic programming produced competitive soccer softbot teams for robocup97. Genet. Program. 1998, 214–222 (1998) 23. Mitchell, M.: Coevolutionary learning with spatially distributed populations. Comput. Intell.: Princip. Pract. 400 (2006)

44

M. Hevia Fajardo et al.

24. Ochoa, G.: Error thresholds in genetic algorithms. Evol. Comput. 14(2), 157–182 (2006) 25. O’Reilly, U.M., Toutouh, J., Pertierra, M., Sanchez, D.P., Garcia, D., Luogo, A.E., Kelly, J., Hemberg, E.: Adversarial genetic programming for cyber security: a rising application domain where gp matters. Genet. Program Evolvable Mach. 21, 219–250 (2020) 26. Popovici, E., Bucci, A., Wiegand, R.P., De Jong, E.D.: Coevolutionary Principles, pp. 987– 1033. Springer, Berlin (2012) 27. Rosin, C.D., Belew, R.K.: New methods for competitive coevolution. Evol. Comput. 5(1), 1–29 (1997) 28. Sims, K.: Evolving 3d morphology and behavior by competition. Artif. Life 1(4), 353–372 (1994) 29. Togelius, J., Burrow, P., Lucas, S.M.: Multi-population competitive co-evolution of car racing controllers. In: 2007 IEEE Congress on Evolutionary Computation, pp. 4043–4050 (2007) 30. Tomassini, M.: Spatially Structured Evolutionary Algorithms: Artificial Evolution in Space and Time. Springer (2005) 31. Toutouh, J., O’Reilly, U.M.: Signal propagation in a gradient-based and evolutionary learning system. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 377–385 (2021)

Chapter 3

Accelerating Image Analysis Research with Active Learning Techniques in Genetic Programming Nathan Haut, Wolfgang Banzhaf, Bill Punch, and Dirk Colbry

3.1 Introduction Image analysis is the process of extracting useful information from image data. This extracted information can then be used to study systems captured in the image data. Image analysis is broadly applied from medical imaging to computer vision [11, 14]. In medical imaging, the image data will often come from magnetic resonance imaging (MRI), positron emission tomography (PET), computerized tomography (CT), x-rays, etc. [11]. For example, in [12] the authors are able to use MRI and PET image data with deep learning methods to improve the success rate of identifying Alzheimer’s disease. Image analysis involves extracting useful information from image data, which is generally rich with information, but can also contain significant noise. Segmentation is a specific step in image analysis where the features of interest are isolated and background information (noise) is removed. Active learning is a field in machine learning where data selection is performed to maximally inform model development [2]. Active learning’s origins drew inspiration from query learning, which was a method for designing experiments with the goal N. Haut (B) · B. Punch · D. Colbry Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, USA e-mail: [email protected] B. Punch e-mail: [email protected] D. Colbry e-mail: [email protected] W. Banzhaf Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 S. Winkler et al. (eds.), Genetic Programming Theory and Practice XX, Genetic and Evolutionary Computation, https://doi.org/10.1007/978-981-99-8413-8_3

45

46

N. Haut et al.

of maximizing information gain using statistical measures [8]. In many ways active learning is similar to query learning, the key difference though is the consideration for how the information gained will inform model development of a specific machine learning method. This is why active learning methods vary based on which machine learning method is being used from neural networks to support vector machines [1, 6, 9]. Uncertainty sampling is a method for selecting new training data that maximizes model uncertainty with the idea that the data points the model is maximally uncertain about will provide the most information for model training [7]. Uncertainty sampling has been shown to be an effective method when used with support vector machines, where one intuitive approach is to sample new data points nearest the decision boundary [6]. There are 3 main classes of active learning: pool-based, stream-based, and membership query synthesis [9]. Pool-based methods rely on an existing set of unlabelled data, which the active learning method search to pick one or several data points that contain maximal information to label and add to the training set. Stream-based methods are similar to pool-based in the sense that the unlabelled data exists. The key difference is that the data is not searched but rather fed to the AL method one data point at a time and the AL method either indicates the data be labelled or not based on some predicted information score. Membership query synthesis does not use existing data and instead searches a space to find a training point to both be generated and labelled. In this work, we focus on pool-based methods where we have existing unlabelled data and select training samples one at a time by selecting the sample with the highest predicted information score. While genetic programming models individually generally lack statistical properties to compute uncertainty, model populations present in GP can be utilized to quantify uncertainty across the diverse models within a population. We have previously applied active learning to genetic programming in symbolic regression tasks using active learning in genetic programming (AL-GP) with a stack-based genetic programming system (StackGP) [4]. In that work we took inspiration from [5] and explored how populations in GP can be exploited to select new training data that the current model population is maximally uncertain about. This is done by selecting a model ensemble of diverse individuals from the population and using the ensemble to find points that maximize the uncertainty to add to the training set. This process proved successful and additional work is being performed to study how additional data diversity metrics can be included to improve the method. This general strategy of utilizing the model populations in symbolic regression GP tasks to select training data to maximally inform evolution seemed generalizable to GP and evolutionary computation, so in this work, we demonstrate how AL-GP can be implemented and applied to other evolutionary computation methods and various image analysis problems to accelerate the development of segmentation and object classification models. These models can then be used to aid research in various fields reliant on image data. The SEE-Insight project is an open-source framework to accelerate the biggest bottleneck in Scientific Image Understanding, which is manual image annotation. SEE-Segment [3] is the first tool developed for SEE-Insight and consists of an evolu-

3 Accelerating Image Analysis Research …

47

tionary machine learning approach that utilizes a genetic algorithm to select a computer vision algorithm and optimize parameters for an image annotation task (image segmentation). SEE-Segment will work with a Graphical User Interface (GUI) allowing researchers to upload their image datasets and then incrementally annotate their images. The image annotations are used to test scientific hypotheses or as a first step to feeding into a data-driven model such as a neural network. Because annotating images can be slow and tedious, while researchers are interfacing with the GUI the SEE-Segment system is simultaneously searching this grammar (aka “algorithm space”) to find automated methods that can reproduce their manual workflows. This search is happening “in the background” on large-scale systems. Given the complexity and size of the search space, there is no guarantee that it will converge to a reasonable solution. However, if a good algorithm is found then suggestions are passed to the researcher to help speed up their annotation process. In the best case, a fully automated algorithm is identified that can reproduce their manual annotation. In the worst case, SEE-Segment will not take any longer than manually annotating images without the discovery tools. This application is an instance of a Combined Algorithm Search and Hyperparameter (CASH) problem and uses a genetic algorithm to search for an algorithm (and hyperparameters). Although the space is nondifferentiable and highly heterogeneous, preliminary results are promising. By using a well-defined image grammar and genetic algorithms as the core search tool, results of the machine learning are highly human interpretable. This allows the system to “generate code” that can be used for teaching as well as copy-and-pasted to a researcher’s own program. Decision tree GP (DT-GP) is a GP system that evolves decision trees and was developed as part of this work specifically to solve the problem of cell classification but is generalizable to any classification task. In this work, we explore the efficacy of AL-GP in two different population-based ML systems and then demonstrate how AL-GP can be applied in a research setting to accelerate progress in scientific studies.

3.2 Data Sets 3.2.1 KOMATSUNA To benchmark the active learning methods in both systems, the KOMATSUNA [13] data set was used since it is a fairly simple segmentation problem and has ground truth labels available. The KOMATSUNA data set contains 300 images of plants where the provided labels are the segmentation patterns that identify the plant from background data. The KOMATSUNA data set is ordered and tracks plants over time as they grow. Each set of 5 consecutive images is taken within the same day so the images are substantially similar within those sets. Example images from the KOMATSUNA data set are shown in Fig. 3.1 to demonstrate how the sizes of plants vary in the set and also how the camera angle and location can vary.

48

N. Haut et al.

(a) Plant 1

(b) Label 1

(c) Plant 2

(d) Label 2

Fig. 3.1 Example image a from KOMATSUNA dataset with its corresponding label (b). The label is the true segmentation mask. A second example image (c) and it’s label d are shown to demonstrate the diversity of images in the dataset

3.2.2 Cell Classification AL-GP was also applied to cell image data to show how active learning could be applied to the problem of cell segmentation and classification. For this data set, the goal of classification is to determine which of the cells are co-transfected (expressing two proteins of interest). The image data [10] consists of two streams, one that tracks the expression of green fluorescence which indicates the presence of one protein, and another that tracks purple fluorescence, which indicates the expression of the other protein. An example of this data is shown in Fig. 3.2.

3 Accelerating Image Analysis Research …

49

Fig. 3.2 The image on the left shows cells that are expressing a protein with a purple fluorescent piece. The image on the right shows cells that are expressing a protein with a green fluorescent piece

3.3 Active Learning Active learning was used to iteratively select training samples to maximally inform model populations. Each run begins with a single randomly selected training sample and one additional training sample is selected and added by active learning after a set number of generations. Once the new sample is added, evolution continues on the expanded dataset. Uncertainty was quantified by measuring disagreement between the models in an ensemble. To do this, an ensemble of 10 diverse models is selected from the population. The ensemble of models is then evaluated on every unselected training sample and the uncertainty on each sample is recorded by measuring the average difference between the predictions of each model in the ensemble. The sample with the highest uncertainty value is then selected and added to the training set. This method varies slightly from the original AL-GP approach in that the ensemble sizes are constant and the models are not selected by selecting best-fitting models on different data partitions. This change was made since we begin with a single training sample, which would result in one model selected for the first ensemble. A single model lacks any measure for uncertainty.

3.4 AL-GP Applied to Decision Tree GP 3.4.1 Decision Tree GP (DT-GP) Image segmentation and object classification were performed using an implementation of decision tree GP (DT-GP), where decision trees are evolved to consider which pixels are foreground or background for segmentation and to identify which class an object belongs to for object classification tasks. The decision trees can utilize 3 types of operators: boolean operators, (in)equality operators, and numerical

50

N. Haut et al.

Fig. 3.3 Shown here is an example tree generated to segment the KOMATSUNA data set. Here delta is a placeholder for the data

operators. Boolean operators can take in boolean values and return boolean values. Inequality and equality operators take in numeric values and return boolean values. Numeric operators can take in numeric vectors and return numeric scalars. The boolean operators available are And, Or, Not, Nand, Nor, and Xor. The inequality operators available are.≥, >, ≤, N O , which allows . P(x) to vary considerably, in principle, over the set of all outputs; (iii) the avoidance of finite size effects by requiring . N O >> 1; (iv) the nonlinearity of . f , a feature fulfilled by many realistic maps.

4.2.2 RNA Studies In earlier studies of the genotype-phenotype maps of RNA primary to secondary structure mappings it was found that many neutral networks exist which allow multiple RNA primary structures to map to few secondary structures [19]. The distribution of the number of primary structures to secondary structures was not uniform, but highly skewed: There were many secondary structures with just a few primary structures in their neutral networks and a small number of secondary structures with a large number of primary structures. This and other studies pointed to the importance of neutral networks and to the mechanisms by which they allow a search to succeed.

4.2.3 GP on Boolean Functions Boolean functions were an early study object in Genetic Programming [13]. In [14] the fitness landscape of Boolean functions was systematically examined in the context

68

W. Banzhaf et al.

of a tree-based GP system and found to yield a curious feature—some of the Boolean functions were easy to find with small trees, while many were not found under such restrictions. If one allowed larger trees, the number of different Boolean functions increased, until at some point all of them could be found. The proportion of each Boolean function in the search space seemed to approach a certain asymptotic density. If one increased the number of allowed nodes in the tree, no further increase in the concentrations of a certain Boolean function could be found.

4.2.4 Neutral Networks Neutrality itself has long been found to play a key role in the mechanisms of evolution. It was even stated that neutral evolution is the main engine of evolution [12]. Our own interests in neutral evolution go back to this inspiration from Biology [1]. While these ideas were formulated in a less formal system, the advent of the concept of neutral networks [22] brought a new perspective into these considerations which subsequently led to a study of neutral networks in Genetic Programming [2]. In [23] the authors examined some newly defined statistical measures of Boolean functions, in particular in connection with the neutrality features of many genotypes. The Boolean functions themselves were more complicated than in other work (multiplexers, parity problem), and the definition of neutrality made use of the notion of neutral networks. Recently, Wright and Laue [25] examined the distribution of genotypes in a Cartesian Genetic Programming (CGP) representation, with a focus evolvability and complexity properties brought about by the genotype-phenotype maps in those systems.

4.2.5 Our Earlier Work In our previous research, we used a Linear Genetic Programming (LGP) system, where a genotype is defined as a unique genetic program and a phenotype is defined as a Boolean relation a program represents, and constructed neutral networks to quantitatively study neutrality and related properties including robustness and evolvability in the system [7, 8, 10]. We reported a highly redundant G-P map and heterogeneous distribution of mutational connections among phenotypes. Meanwhile, a recent graph-based model, search trajectory networks (STNs) [17, 18, 20], was developed to analyse and visualise search trajectories of any metaheuristics. Search trajectory networks are a data-driven, graph-based model of search dynamics where nodes represent a given state of the search process and edges represent search progression between consecutive states. In our most recent work [9], we adopt STNs for an examination of the statistical behavior of searchers navigating the corresponding genotype space in Boolean LGP. Nodes are genotypes/phenotypes and edges represent their mutational transitions.

4 How the Combinatorics of Neutral Spaces …

69

We also quantitatively measure the characteristics of phenotypes including their genotypic abundance (the requirement for neutrality) and Kolmogorov complexity. We connect these quantified metrics with search trajectory visualisations, and find that more complex phenotypes are under-represented by fewer genotypes and are harder for evolution to discover. Less complex phenotypes, on the other hand, are over-represented by genotypes, are easier to find, and frequently serve as steppingstones for evolution.

4.3 Genotypes, Phenotypes, Behavior, Fitness In order to have a clearer understanding of what is going on in evolution, we have to get a grasp on the search process and define terms carefully. Figure 4.1 shows a sketch of the search process at different levels of the (discrete) fitness in our Boolean function system. At each level of fitness we can imagine a neutral network (solutions with equal fitness) connected through edges symbolizing mutational (or variational) moves. There are connections between levels as well (moves that allow fitness to change), which usually go through portal nodes, nodes that provide connections between different fitness levels.

Fig. 4.1 Sketch of a network of neutral networks. Each level depicts one neutral network, with a discrete fitness value corresponding to its level. Nodes depict genotypes (genetic programs) which are connected within a level, reachable by neutral moves, with few nodes allowing jumps to a lower level (better fitness). The fitness of a node is measured by executing it and comparing the function it stands for with a target relation. The neutral networks are connected through what are called portal nodes to other neutral networks at a lower (better) fitness level

70

W. Banzhaf et al.

4.3.1 Discrimination of Genotypes and Phenotypes The use of the term phenotype is somewhat ambiguous in Evolutionary Computation, extending a tradition from Biology. In Biology, an organism can have many different phenotypes. Phenotype is simply what is being observed, whether it is a structural trait of an organism, or its behavior. Here, we shall take a closer look and offer a more crisp definition. Before we go there, let us start with the easiest definition—genotypes. A genotype is the pattern which is subject to the evolutionary operators of mutation and crossover, i.e. subject to genetic manipulation, see also Fig. 4.2. As a side: The reader might know Cartesian GP [16] which is frequently characterized as a graph GP system. However, we should emphasize that Cartesian GP with its linear sequence of numbers encoding graphs is actually a type of linear GP system. Phenotypes are more difficult to define. There are different definitions of phenotypes, but generally, we understand phenotypes to be what is observed and subject to selection. The definition can be based on 1. its fitness; 2. its behavior; or 3. its effective structure. What do we suggest adopting in the context of Genetic Programming? - We suggest discarding 1, since it is very clear in even this simple example of Boolean functions that different Boolean functions can have the same fitness, but that does not make them identical phenotypes! We have a choice of either 2 or 3. If we adopt 3, we are at the lowest level of resolution, but such a suggestion would run counter to what has been traditionally considered a phenotype in Biology. If we adopt 2, we are following at least part of that tradition. But we then need to discriminate further

Fig. 4.2 The genotype maps to a phenotype which produces behavior that is judged by a fitness function

4 How the Combinatorics of Neutral Spaces …

71

regarding the effective structure of the phenotype, since there are in GP many ways to produce a single behavior. Hence we suggest calling the different effective structures that produce the same behavior isotypes. Isotypes are semantically neutral, but different programs. Traditionally, they are considered the same phenotype, because they result in the same behavior.1

4.3.2 The Difference of Structural and Semantic Neutrality This brings back a difference we made earlier between structural (or syntactic) and semantic neutrality/introns. Structural introns are parts of different genotypes that are structurally neutral variations, i.e. they don’t affect the effective structure or isotype. Semantic introns, on the other hand, affect effective structure of a program, but not its behavior. Given this situation, a remark is due on tree-based GP (TGP): Most (if not all) neutral variations in TGP runs are semantic, where most (if not all) neutral variations in LGP are structural. However, it is much easier to identify structural neutrality than it is to identify semantic neutrality. For the former, one only has to analyze a program once and this analysis is of order . O(L) with . L being the length of the program. For semantic neutrality, one has to run the program and analyze the semantics for each node on .k fitness cases, which for large .k is possible only on a sample, and still time consuming. Note that both GP representations can in principle have both types of neutrality. So, one would need to run both a structural and a semantic analysis on both representations to identify the full amount of neutrality in each individual. One would do the easier analysis first - for structural neutrality—then exclude those parts of the program from the second type of analysis. But the majority of reduction in complexity would come for LGP from the first step, while for TGP it would only come at the second step. That is why LGP is easier to handle, because one could just do the first step, and still end up with a good approximation of complexity, while for TGP the first step does not yield a good approximation. If one wants to get a handle on the complexity of an evolved solution, it is important to identify the neutrality in a representation. What makes this difficult for tree GP is not only that semantic introns are time-consuming to identify, but also that genotype and phenotype in TGP are both trees, making it difficult to see the smaller (phenotypic) tree within the larger (genotypic) one. In other words, only a part of the GP tree genotype can count as the tree phenotype.

1

An alternative is to divide phenotypes into a static (structural) and a dynamic (behavioral) part.

72

W. Banzhaf et al.

4.4 Methods In this section, we describe the LGP system we used, the Boolean programs investigated in this study, and our visualization method using STNs.

4.4.1 Linear Genetic Programming Linear Genetic Programming (LGP) [4] is a variant of GP [3] where a sequential representation of computer programs is employed to encode an evolutionary individual. Such a linear genetic program often consists of a set of imperative instructions to be executed sequentially. Registers are used to either read input variables (input registers) or to enable computational capacity (calculation register). One or more registers can be designated as the output register(s) such that the final stored value(s) after the program is executed will be the program’s output.

4.4.2 Boolean Function Programs/Circuits We use an LGP algorithm for a three-input, one-output Boolean function search application, similar to our previously examined LGP system [7, 10, 11]. Each instruction has one return, two operands and one Boolean operator. The operator set has four Boolean functions {AND, OR, NAND, NOR}, any of which can be selected as the operator for an instruction. Three registers .R1 , .R2 , and .R3 receive the three Boolean inputs, and are write-protected in a linear genetic program. That is, they can only be used as an operand in an instruction. Registers .R0 and .R4 are calculation registers, and can be used as either a return or an operand. Register .R0 is also the designated output register, and the Boolean value stored in .R0 after a linear genetic program’s execution will be the final output of the program. All calculation registers are initialized to FALSE before execution of a program. An example linear genetic program with three instructions is given as follows: I : R4 = R2 AND R3 I2 : R0 = R1 OR R4

. 1

I3 : R0 = R3 AND R0 A linear genetic program can have any number of instructions; for the ease of sampling in this study, we use linear genetic programs that have a fixed length of 6 or 12 instructions. The genotype in our GP algorithm is a unique linear genetic program. Since we have a finite set of registers and operators, as well as a fixed length for all programs, the genotype space is finite and we can calculate its size. For each instruction, two

4 How the Combinatorics of Neutral Spaces …

73

registers can be chosen as return registers and any of the five registers can be used as one of two operands. Finally, an operator can be picked from the set of four possible Boolean functions. Thus, there are .2 × 5 × 5 × 4 = 200 unique instructions. Given the fixed length of six instructions for all linear genetic programs, we have a total number of .2006 = 6.4 × 1013 possible different programs. For 12 instructions, that number grows to .4 × 1027 different programs in the search space. The phenotype in our GP algorithm is a Boolean relationship that maps three inputs to one output, represented by a linear genetic program, i.e., . f : B3 → B, 3 where .B = {TRUE, FALSE}. There are thus a total of .22 = 256 possible Boolean 13 relationships. Having .6.4 × 10 genotypes to encode 256 phenotypes, our LGP algorithm must have a highly redundant genotype-phenotype mapping. We define the abundance/redundancy of a phenotype as the total number of genotypes that map to it. We choose the fitness of a linear genetic program as the deviation of the phenotype’s behavior from a target Boolean function and want to minimize that deviation in the search process. Given three inputs, there are .23 = 8 combinations of Boolean inputs. The Boolean relationship encoded by a linear genetic program can be seen as an 8-bit string representing the outputs that correspond to all 8 possible combinations of inputs. Formally, we define fitness as the Hamming distance of this 8-bit output and the target output. For instance, if the target relationship is . f (R1 , R2 , R3 ) = R1 AND R2 AND R3 , represented by the 8-bit output string of 00000001, the fitness of a program encoding the FALSE relationship, i.e., 00000000, is 1. Fitness is to be minimized and falls into the range between 0 and 8, where 0 is the perfect fitness and 8 is the worst.

4.4.3 Visualization Method We use Search Trajectory Networks (STNs) [18], to analyze and visualize the behavior of the studied fitness functions and program lengths. STNs are a graph-based tool to study the dynamics of evolutionary algorithms and other metaheuristics. Originally, the model was used to track the trajectories of search algorithms in genotype space, where nodes represent visited solutions (or genotypes) during the search process, and edges represent consecutive transitions between visited genotypes. However, for large search spaces, this approach renders unmanageable models. Therefore, coarser models have been proposed where nodes represent sets of related genotypes rather than single ones. In particular, nodes can group genotypes expressing the same behavior or phenotype [9, 21]. This is the approach we follow here. We generate and visualize Phenotye Search Trajectory Networks, where the search space locations (nodes) are unique phenotypes, and edges represent consecutive transitions between phenotypes. For constructing the Phenotype STN models, multiple adaptive walks (100 to be precise) are performed for each fitness function and program length. Adaptive walks are search trajectories in the fitness landscape that accept both neutral as well

74

W. Banzhaf et al.

as improving moves, with deteriorating fitness moves being prohibited. A move in our implementation is a single genotype mutation (change of a symbol for another selected uniformly at random). For each run, the sequence of genotypes visited is recorded. Thereafter, in a post-processing step, the visited genotypes across all runs are grouped into their corresponding phenotypes and the transitions between pairs of grouped nodes are also aggregated into single edges to construct a single graph model. Notice that some nodes and transitions may appear multiple times during the sampling process. However, the graph model retains as nodes each unique phenotype, and as edges each transition between pairs of visited phenotypes. Counters are maintained as attributes of the nodes and edges, indicating their sampling frequency. These counters are later used as visual attributes to depict the size of nodes and the widths of edges. Once network models are constructed, we can proceed to visualize them and compute relevant network metrics. Node-edge diagrams are the most familiar form of network visualization, where nodes are assigned to points in the two-dimensional Euclidean space and edges connect adjacent nodes by lines or curves. Nodes and edges can be decorated with visual properties such as size, color and shape to highlight relevant characteristics. Our STN visualizations use node colors to identify four types of nodes: (1) best nodes, which have the minimum possible fitness (zero) (2) neutral nodes, whose adjacent outgoing nodes have the same fitness, (3) portals, which link to a node with improved fitness, and (4) portals to best, which are portal nodes with a direct link to best solutions. The shape of nodes identifies three positions in the search trajectories: (1) start of trajectories, (2) best node (fitness zero), (3) end of trajectories (which are not the best node), (4) intermediate locations in the trajectories. Edges color indicate whether an edge is neutral or improving. Node sizes and edge thickness are proportional to their sampling frequency. A key aspect of network visualization is the graph-layout, which accounts for the positions of nodes in the 2D Euclidean space. Graphs are mathematical objects, they do not have a unique visual representation. Many graph-layout algorithms have been proposed in the literature. Here, we use the layout we designed in [9] to visualize the phenotype STNs. This layout uses the fitness values as the nodes’ . y coordinates, while the .x coordinates are placed as a horizontal grid according to the number of nodes per fitness level. This layout allows us to appreciate the search progression towards smaller (better) fitness values, as well as the amount of neutrality present in the search space.

4.5 The Role of Neutrality In the literature of GP, there has been a discussion about the benefits of neutrality for a long time. It has often been said that neutrality harms, since it robs the evolutionary algorithm of a guidance toward the goal, as neutral networks have identical fitness for all its genotypes or phenotypes, depending on which level one studies.

4 How the Combinatorics of Neutral Spaces …

75

Fig. 4.3 The amount of neutrality of phenotype A, B depends on the space left for neutral variation (yellow part). The larger the neutral part, the more variants are available to that phenotype. At the same time, the shorter the effective part of the solution

However, it turns out that there are subtle differences between individuals, even if their fitness does not change on a neutral network. Two major differences are the abundance and the connectivity of phenotypes. As the results on discrete input-output maps discussed in the introduction show, some phenotypes are very frequently represented, compared to others, based on their complexity. Some are better connected in the network. The differences in abundance and connectivity guide a stochastic algorithm like evolution, even in the absence of fitness differences. This will lead to a higher probability of using highly abundant and better connected nodes in the network. These are two counter-acting tendencies: In a size limited search space (with a limit on the dimensionality of individuals - in our case 6 or 12 lines of code), abundance is tied to the amount of neutrality a phenotype gains via its neutral genotypes. If the total length of a program is fixed at . L, the effective length is . E and the neutral length is . N , the following simple relation has to hold: .

N =L−E

(4.3)

Figure 4.3 shows two examples. Note that the degree of neutrality depends exponentially on the number of neutral instructions. Thus a phenotype represented by genotype . A will be more represented in the search space, and easier to find, compared to a phenotype represented by genotype . B. It will be shorter in its effective part, thus less complex. This explains the phenomenon described earlier, of an negative exponential dependence of abundance on Kolmogorov complexity. On the other hand, a phenotype represented by genotype . B has a better connectivity to other fitness because it has more effective instructions. This will make it easier to move via mutation from one phenotype to the other. These are counteracting tendencies that are able to explain why evolution in the longer run (when neutrality dominates) prefers low complexity solutions, while in the shorter run (when fitness effects dominate) it prefers to access solutions with higher complexity. Given the role of neutrality in influencing search performance, a natural question to ask is whether and how we can increase the degree of beneficial neutrality in a search space. In the following two subsections, we shall discuss two ways of increasing neutrality that seem to have a positive effect on search efficacy.

76

W. Banzhaf et al.

4.5.1 Longer Programs The simplest way to introduce more neutrality in the search space is to allow an increase in the length of programs, i.e. the number of instructions. While this does not change the ratio of abundances between phenotypes, at least beyond a certain minimum length (see [14]), it allows more connectivity between different phenotypes (i.e., non-neutral changes) through an expansion of the neutral networks of those different phenotypes. That is to say that the 1-mutant neighborhood of phenotypes includes more other phenotypes in larger networks than in smaller ones.

4.5.2 A New Fitness Function The second method is to consider the fitness function to be used in the search space more carefully, according to the problem. If there is a symmetry in the problem, we might want to exploit that for the fitness function, allowing those phenotypes/solutions that show invariance under a certain transformation to be grouped together in the fitness function. For example, in symbolic regression (SR), it is a well-known fact that a root mean square error (RMSE) fitness function which judges every data point separately, can be replaced by a correlation fitness function plus a post-processing step that uses the fact that all best fitness solutions under the correlation criterion are equally correct, modulo a linear scaling transformation at the end. This opens up multiple pathways to a good solution for the symbolic regression problem, where RMSE would only allow a very limited number.

4.5.2.1

A Failed Attempt

Here, we apply the same method in the context of Boolean functions. The equivalent of a symmetry (an invariance under certain transformations), is, for instance, the fact that a negated Boolean function is also a solution, provided one adds a negation gate at the output. As in the case of SR, it is only the local, relative difference of outputs between different input patterns for the function that are important. Thus, we define a new fitness function called relative distance (RD) between two bit strings .b(1) and (2) .b as follows: N −1 E (2) (1) . R D(b , b(2) ) = (bi(1) − bi+1 ) (4.4) i=1

Minimizing RD will allow phenotypes that have the same local pattern change to be identically treated. If one reaches distance 0, a final decision has to be made for the first bit, which determines the bit pattern completely. Consider an example: Two 8-bit bitstrings .b1 = 10100101 and .b2 = 01000010 have a Hamming distance

4 How the Combinatorics of Neutral Spaces …

77

of . H D = 6. If looked at from the perspective of the relative distance, though, their distance is much smaller: . R D = 2. This can be measured by transforming them into relative bit patterns .b1,r = 1110111 and .b2,r = 110011. The phenotype of the solution is thus either the phenotype already found, or its negation. But during the search process prior to hitting fitness 0, both types of patterns are treated equally, which allows more pathways towards a solution. It should thus be easier to find solutions with the help of the RD fitness function than it is with the help of the Hamming distance (HD) fitness function (which is the equivalent of MSE in the binary space). Unfortunately, that is not the case with this particular relative distance metric. While the additional symmetry of relative fitness might enhance the chances of finding the desired phenotype in a binary search space, this is balanced out by the countervailing influence of the enlarged search space. However, there is another symmetry that can serve to increase the neutrality of the search space: Applying a negation to the inputs of the Boolean function will produce three equivalent behaviors, those, together with the original behavior of the program, can be evaluated as the .NegativeInputDistance fitness measure and minimized. This is what we are going to explore in the following in more detail.

4.5.2.2

The Negative Input Fitness Function

We are going to lump together the distances of 4 different phenotypic behaviors into one distance measure. For each of these behaviors we calculate Hamming distance .d to the target .t, then select the minimum as the .NegativeInputDistance (. N D): .

N D(b, t) = min di i=1...4

(4.5)

where { d =

. i

b for i = 1 d(b¯i , t) for i = 2...4

with .b¯i symbolizing a single negated input .i. Once . N D has converged to .0, we know we have hit the target, and it is only a matter of resolving which of the 4 possible phenotypes (lumped together in . N D) produced the target behavior. This can be resolved in a post-processing step in constant time by testing which of the 4 possibilities produces the target phenotype behavior.

78

W. Banzhaf et al.

4.6 Results 4.6.1 A Comparison of Success Rates Figure 4.4 shows the effects of longer programs and the new fitness function . N D on the success rate of finding different phenotypes. Each of the 256 phenotypes is used as the target. A search starts with a randomly generated program and continues for 2,000 steps of point-mutations. 100 runs are collected for each experiment with a specified target phenotype. We plot the success rate, computed as the proportion of 100 runs that reach the target, as a function of the logarithm of the redundancy/abundance of the phenotype in the system. We can see that redundancy is not equally distributed, but clustered around a number of redundancy values. This is a reflection of the discreteness of the Boolean search space used. We can discern around 10 different clusters, from very low redundancy (high difficulty) to very high redundancy (low difficulty). As shown in the figure, all 4 strategies are comparably successful when a target is very easy to reach (right side). For targets somewhat harder to discover strategies using the new fitness function . N D are clearly more successful, even at the 100 % level. More difficult

1.00

Success Rate

0.75

Strategy HD_L6 HD_L12 ND_L6 ND_L12

0.50

0.25

0.00 10

3

10

4

10

5

10

6

Redundancy

Fig. 4.4 Comparison of search strategies using fitness functions HD and ND, as well as program length 6 and 12. The 4 strategies are denoted using 4 different colors. Each data point represents one experiment using a specified phenotype as target. 256 distinct phenotypes are used as targets. Success rate is computed as the fraction of 100 runs in one experiment that reach a target phenotype within 2000 steps, and is plotted in relation to the target phenotype’s redundancy. Phenotypes with higher redundancies are easier to find

4 How the Combinatorics of Neutral Spaces …

79

phenotypes are found easier with the new . N D fitness function, down to the most difficult phenotypes. For very difficult phenotypes, longer programs (L12) are more successful, with again . N D fitness in the lead.

4.6.2 Comparison of Search Trajectory Networks for Three Targets Next, we take a closer look at the search trajectories and investigate the effects of fitness . N D and longer programs on neutrality. We pick three representative phenotypes as targets, phenotype 84 (redundancy 1,313,880), phenotype 140 (redundancy 331,241), and phenotype 215 (redundancy 3,060), with increasing difficulties to discover. Figures 4.5, 4.6 and 4.7 show comparisons of STN visualizations for these three targets. Adapting the new fitness function . N D and increasing program length allow more neutral moves (edged colored in grey), especially among stepping-stone phenotypes with medium fitness values. There are also more paths explored that lead to the target. These effects are increasingly prominent when searching for more difficult targets. In addition to visualizing the STNs, we collect network metrics that quantitatively characterize these search networks. Figure 4.8 shows the comparison of these network metrics for the 4 search strategies. The number of nodes indicates distinct phenotypes explored. Neutral edges captures the proportion of neutral searches. While the differences are subtle, fitness . N D and longer programs facilitate more thorough searches in the space through allowing more neutral moves. Degree best is the number of incoming edges to best phenotypes (targets and their equivalents). Strength best is the weighted sum of those incoming edges to best phenotype, given that edges are weighted based on their visit frequencies. As seen in the figure, more difficult targets have few paths directly connected to them. But fitness function . N D and longer programs enable the discovery of more distinct paths finding the targets.

4.6.3 Simpler Solutions It is a rather counter-intuitive fact that evolution seems to choose simpler solutions (phenotypes) over more complex ones. For many years our community has thought that evolution produces complex solutions. Bolstering this idea was the realization that above a certain complexity threshold there are many solutions to a problem, while below that threshold, there are none. In fact, there are normally only a few just above the threshold, and then there are many more as we increase the complexity. So it would seem that it is much easier to discover a complex solution than it would be to discover one of the few simple ones.

80

W. Banzhaf et al. Target 84 Length 6

Target 84 Length 12

6

Fitness HD

Fitness HD

6

4

4

Type 2

Best

2

Neutral Portal Portal_B 0

0

Target 84 Length 6

Improving

Target 84 Length 12

FALSE TRUE

Position 6

6

Begin Best End

4

Fitness ND

Fitness ND

Medium

4

2

2

0

0

Fig. 4.5 Search trajectory network (STN) for easy target phenotype 84, comparing fitness function HD (top) and ND (bottom) and program length 6 (left) and 12 (right). Nodes are distinct phenotypes and edges show mutational transitions among phenotypes during searches. The figure shows the aggregation of search trajectories of 100 runs. Node shapes denote positions, i.e., beginning phenotype (randomly chosen), best phenotypes (fitness 0), end phenotype, and intermediate phenotype. Edge colors denote if a mutation improves fitness

But there is another influence at work in Genetic Programming—neutrality. Neutrality makes sure that simple solutions (in a limited search space) have many more neutral ‘siblings’ due to the combinatorics of the neutral space. As a result, the abundance of simple solutions is much higher and increases the chance of finding them. Does that mean that evolution always finds the simplest solution? No, because it is not just the abundance of a phenotype that determines success, but also the connectivity of that phenotype. It is thus an effort (it requires optimization) to find the very simplest solution: Determining the Kolmogorov complexity of a bit string— in our case the phenotype’s behavior—is an effort, but it can be achieved by a suitably set up GP search. We can therefore assume that evolution will generally find a simple solution, but not the simplest. In the normal case, the simplest solution might just not have enough connectivity in the network to be easily accessible, except in very

4 How the Combinatorics of Neutral Spaces … Target 140 Length 6

Target 140 Length 12

6

Fitness HD

6

Fitness HD

81

4

4

Type 2

Best

2

Neutral Portal Portal_B 0

0

Target 140 Length 6

Improving

Target 140 Length 12

FALSE TRUE

Position 6

6

Begin Best End

4

Fitness ND

Fitness ND

Medium

4

2

2

0

0

Fig. 4.6 Search trajectory network (STN) for medium target phenotype 140, comparing fitness function HD (top) and ND (bottom) and program length 6 (left) and 12 (right)

simple problems. That is because it needs to be accessed from a different fitness level, not from the neutral level on which it has so many siblings. Here we consider the cumulative development of the complexity distribution over runs. Figure 4.9 shows the distribution of phenotypic complexity in 100 runs of target phenotype 140, L .= 12, with high difficulty. We can see that complexity of programs at the beginning of runs is quite low: Most random programs clock in at complexity 1, 2 or 3. In fact, the median is 2. As the target phenotype with K-complexity 3 is reached, most solutions are beyond the minimum of 3, with the median of 5 at the time of discovery of the target. Beyond the initial discovery of the target, however, complexity of solutions trends downward, with a median of 4 at iterations 2000.

4.7 Discussion and Future Work We have seen that evolution does something rather unexpected, if left to its own devices: It prefers simple over complex solutions, at least to the degree possible and achievable with reasonable effort. This is not a miracle but due to the higher

82

W. Banzhaf et al. Target 215 Length 6

Target 215 Length 12

6

Fitness HD

Fitness HD

6

4

4

Type 2

Best

2

Neutral Portal Portal_B 0

0

Target 215 Length 6

Improving

Target 215 Length 12

FALSE TRUE

Position 6

6

Begin Best End

4

Fitness ND

Fitness ND

Medium

4

2

2

0

0

Fig. 4.7 Search trajectory network (STN) for hard target phenotype 215, comparing fitness function HD (top) and ND (bottom) and program length 6 (left) and 12 (right). The differences on the level(s) shortly before reaching the target are dramatic between HD and ND runs. A huge increase in possible pathways demonstrates the advantage of searching with the RD metric over the simple HD metric

abundance of simpler solutions over complex ones, at least in search spaces that are limited in size. While this is an observation that could, in principle, be made in all GP systems allowing neutral solutions to play a role (i.e. those that do not exclude neutral search), it is particularly obvious in linear GP due to the ease of developing structural introns. While the same process is expected to go on in tree-based GP, it is less visible there due to the semantic nature of neutrality in that representation. Still it would be worthwhile to consider the same phenomenon in a TGP system. The corresponding intron discovery algorithm is easy to formulate, yet computationally expensive to execute. The general result on input-output maps is based on algorithmic information theory. This should be expected to hold under many different circumstances, and it is no surprise that genotype-phenotype maps in GP fall under it. However, due to the nature of neutrality, a particularly intuitive explanation for the exponential distribution of abundance can be seen in our system: The increase in neutrality leads to exponentially more neutral solutions, due to the combinatorics allowed in

4 How the Combinatorics of Neutral Spaces … Target 84

83

Target 140

Target 215

80

Nodes

60 40 20 0 6

12

6

Neutral Edges

Target 84

12

6

Target 140

12 Target 215

0.6 0.4 0.2 0.0 6

12

6

12

6

12

Function HD

Target 84

Target 140

Target 215

ND

Degree Best

20 15 10 5 0 6

12

6

Target 84

12

6

Target 140

12 Target 215

Strength Best

100 75 50 25 0 6

12

6

12

6

12

Fig. 4.8 Metrics of STNs comparing two fitness functions HD and ND and two program lengths 6 and 12. Number of nodes indicates how many distinct phenotypes are visited. The proportion of neutral edges shows edges in an STN that connect genotypes to phenotypes with the same fitness value. Degree best is the number of incoming edges to best phenotypes (fitness of 0). Strength best is the weighted degree (sum of edge weights) of best nodes

the neutral space. We think this is a key insight and it is essential not to overengineer approaches to GP that do away with neutrality, or ignore it in favor of a presumed better efficiency of search algorithms by removing non-effective code during evolution. There is a reason for the emergence of neutrality from evolutionary processes, and it is counterproductive to fight this tendency. In fact it is not only an emergent process but it serves evolution at the same time to find simple solutions to problems faster than complex ones. To leave the reader with something more heavy to ponder at the end: Does this tendency to prefer simple solutions perhaps shed some new light on the “unreasonable effectiveness of mathematics in the natural sciences” [24], a question Eugene Wigner has pondered for Physics, but as easily applicable to Biology?

84

W. Banzhaf et al.

Fig. 4.9 Histogram of complexity distribution of phenotypes with target 140 at. L = 12, HD fitness. Three cumulative distributions are shown, at the beginning of runs (random programs), at the moment they hit fitness 0, and at the end of 2000 iterations. The Kolmogorov complexity of the target phenotype is 3, thus no solution can be found with lower complexity. At the time of discovery, phenotypes are more complex than later

4 How the Combinatorics of Neutral Spaces …

85

Acknowledgements The authors gratefully acknowledge the reviewers’ insightful comments which helped to improve the manuscript.

References 1. Banzhaf, W.: Genotype-phenotype-mapping and neutral variation—a case study in genetic programming. In: International Conference on Parallel Problem Solving from Nature, pp. 322–332. Springer (1994) 2. Banzhaf, W., Leier, A.: Evolution on neutral networks in genetic programming. In: Genetic Programming—Theory and Practice III, pp. 207–221. Springer (2006) 3. Banzhaf, W., Nordin, P., Keller, R.E., Francone, F.: Genetic Programming—An Introduction. Morgan Kaufmann, Morgan Kaufmann Publishers 340 Pine Street, 6th Floor San Francisco, CA 94104 USA (1998) 4. Brameier, M., Banzhaf, W.: Linear Genetic Programming. Springer (2007) 5. Dingle, K., Camargo, C., Louis, A.: Input-output maps are strongly biased towards simple outputs. Nat. Commun. 9, 761 (2018) 6. Dingle, K., Valle Perez, G., Louis, A.: Generic predictions of output probability based on complexities of inputs and outputs. Sci. Rep. 10, 4415 (2020) 7. Hu, T., Banzhaf, W.: Neutrality and variability: two sides of evolvability in linear genetic programming. In: Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, pp. 963–970 (2009) 8. Hu, T., Banzhaf, W., Moore, J.H.: The effect of recombination on phenotypic exploration and robustness in evolution. Artif. Life 20(4), 457–470 (2014) 9. Hu, T., Ochoa, G., Banzhaf, W.: Phenotype search trajectory networks for linear genetic programming. In: Genetic Programming: 26th European Conference, EuroGP 2023, Held as Part of EvoStar 2023, Brno, Czech Republic, April 12–14, 2023, Proceedings, pp. 52–67. Springer (2023) 10. Hu, T., Payne, J.L., Banzhaf, W., Moore, J.H.: Robustness, evolvability, and accessibility in linear genetic programming. In: European Conference on Genetic Programming, pp. 13–24. Springer (2011) 11. Hu, T., Payne, J.L., Banzhaf, W., Moore, J.H.: Evolutionary dynamics on multiple scales: a quantitative analysis of the interplay between genotype, phenotype, and fitness in linear genetic programming. Gen. Program. Evol. Mach. 13, 305–337 (2012) 12. Kimura, M.: The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge, UK (1983) 13. Koza, J.R.: Genetic Programming. MIT Press, 12th floor of One Broadway, in Cambridge, MA 02142 (1992) 14. Langdon, W.B., Poli, R.: Foundations of genetic programming. Springer (2002) 15. Lehman, J., et al.: The surprising creativity of digital evolution: a collection of anecdotes from the evolutionary computation and artificial life research communities. Artif. Life 26, 274–306 (2020) 16. Miller, J.F.: Cartesian genetic programming: its status and future. Gen. Program. Evol. Mach. 21, 129–168 (2020) 17. Ochoa, G., Malan, K.M., Blum, C.: Search trajectory networks of population-based algorithms in continuous spaces. In: European Conference on Applications of Evolutionary Computation. EvoApps, pp. 70–85. Springer International Publishing, Cham (2020) 18. Ochoa, G., Malan, K.M., Blum, C.: Search trajectory networks: a tool for analysing and visualising the behaviour of metaheuristics. Appl. Soft Comput. 109, 107,492 (2021) 19. Reidys, C., Stadler, P., Schuster, P.: Generic properties of combinatory maps: neutral networks of RNA secondary structures. Bull. Math. Biol. 59, 339–397 (1997)

86

W. Banzhaf et al.

20. Sarti, S., Adair, J., Ochoa, G.: Neuroevolution trajectory networks of the behaviour space. In: European Conference on Applications of Evolutionary Computation, EvoApps, Lecture Notes in Computer Science, vol. 13224, pp. 685–703. Springer (2022). 10.1007/978-3-031-024627_43 21. Sarti, S., Adair, J., Ochoa, G.: Neuroevolution trajectory networks of the behaviour space. In: European Conference on Applications of Evolutionary Computation, EvoApps, Lecture Notes in Computer Science, vol. 13224, pp. 685–703. Springer (2022) 22. Schuster, P., Fontana, W., Stadler, P.F., Hofacker, I.L.: From sequences to shapes and back: a case study in RNA secondary structures. In: Proceedings of the Royal Society of London. Series B: Biological Sciences, vol. 255(1344), pp. 279–284 (1994) 23. Vanneschi, L., Pirola, Y., Mauri, G., Tomassini, M., Collard, P., Verel, S.: A study of the neutrality of boolean function landscapes in genetic programming. Theor. Comput. Sci. 425, 34–57 (2012) 24. Wigner, E.P.: The unreasonable effectiveness of mathematics in the natural sciences. Commun. Pure Appl. Math. 13, 1–14 (1960) 25. Wright, A.H., Laue, C.L.: Evolvability and complexity properties of the digital circuit genotypephenotype map. In: Proceedings of the Genetic and Evolutionary Computation Conference— GECCO 2021, pp. 840–848. ACM Press (2021)

Chapter 5

The Impact of Step Limits on Generalization and Stability in Software Synthesis Nicholas Freitag McPhee and Richard Lussier

5.1 Introduction Imagine that we’re assessing the generalization of a human-written program by running it on a set of test cases. If we were concerned about the possibility of infinite loops or long run times, we might set a time limit and terminate the program’s execution when that time limit is reached.1 In doing this, there is always the risk that we set the time limit too low, stopping the program just before it would have successfully completed. Alternatively, we could use a time limit that is far too high, giving the program more time than it actually needs to solve the test cases at hand and consume unnecessary computing resources. If only a few particularly “large” inputs caused the program to exceed the time limit, the program’s authors might argue that their program still generalizes, and that the time limit should be raised to a level that provides the program with a reasonable amount of time to solve the problematic test cases. On the other hand, we typically wouldn’t expect that providing extra time would create any problems for correctness; if the program could find the right answer in time .T than we’d expect that it would still generate the correct answer if given time .U > T .

1 Programming

competitions typically take this approach to testing submissions.

Supported by the Morris Academic Partners and the Undergraduate Research Opportunity Programs at the University of Minnesota Morris. N. Freitag McPhee (B) · R. Lussier University of Minnesota Morris, Morris, MN 56267, USA e-mail: [email protected] R. Lussier e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 S. Winkler et al. (eds.), Genetic Programming Theory and Practice XX, Genetic and Evolutionary Computation, https://doi.org/10.1007/978-981-99-8413-8_5

87

88

N. Freitag McPhee and R. Lussier

Table 5.1 Subset of Table 3 from [2] showing the generalization rates for several benchmark software synthesis problems. “Num. generalizations” is the number of runs (out of 100) that (in the terminology used in this paper) generalized, i.e., had zero error on all the testing cases. “Prop. generalized” is the proportion of the successes (runs that had zero error on all training cases) that went on to generalize. All the generalization checks were performed using the same step limit used during the evolutionary process Problem Num. generalizations Prop. generalized Find Pair Fizz Buzz Fuel Cost GCD

4 25 50 8

1.00 0.96 1.00 0.67

Because evolution is quite capable of finding things like infinite loops, most evolutionary software synthesis systems use some kind of execution limit to handle that possibility. There has often been an implicit assumption that if a program generalized, i.e., it had zero error on (often thousands of) previously unseen test cases, then the program was truly “general” in a sense that might be understood by a software developer. Conversely, it was implicitly assumed that if the program didn’t generalize, i.e., it had a non-zero error for at least one test case, then it wasn’t general. The results here make it clear that these assumptions don’t hold, at least not in a simple sense. In particular, we find that: • There are evolved programs that don’t appear to generalize, but do generalize if given slightly higher step limits. • There are evolved programs that appear to generalize, but which fail all the test cases when given slightly higher step limits. The first of these isn’t terribly surprising; it’s not uncommon that programs need a little more time to terminate, especially when given new, larger inputs. It does mean, though, that we should be careful about how we interpret reported failures to generalize, as those may contain what are effectively false negatives, where logically correct programs fail to generalize because they are stopped prematurely. The second observation is more problematic. We would typically assume that providing more resources (including a higher step limit) would at least not be a “bad thing”, and if a program generalized with a step limit of 800, it would still generalize with a step limit of 803, but our results show that this isn’t consistently the case. These then could potentially be considered false positives, where a program appears to generalize, but has logical or run-time properties that might not be considered “correct” or “acceptable” in human-generated code. In combination, these observations make it hard to know how to interpret typical reports of generalization rates. Consider the subset of results from [2] in Table 5.1.2 Here all of the 100 solutions evolved for Find Pair and Fuel Cost generalized, all but one of the solutions to Fizz 2

Here the data from [2] is simply a representative example. We could have used similar data from numerous other publications.

5 The Impact of Step Limits on Generalization and Stability …

89

Buzz generalized, and . 23 of the solutions to GCD generalized.3 In each case, the reported generalization rate is computed using the same step limit used in the initial evolutionary processes. Our work here suggests, however, that we need to interpret these generalization numbers with care. For example, many of our evolved GCD solutions appear to generalize, but actually perform very poorly if given a slightly higher step limit. Our goal in this chapter is to better understand the role of step limits and their impact on generalization . We take programs evolved with a fixed step limit (as is done in much software synthesis research) and run them with a range of step limits to see how their behavior is impacted by the step limit. This analysis highlights the existence of evolved programs that can generalize when given higher step limits than those used during evolution (see, e.g., Fig. 5.1), and evolved programs that jump from solving all the test cases to solving none of them with small increments to the step limit (see, e.g., Fig. 5.4). While the work here is done using PushGP [15, 16], the key ideas apply to any software synthesis system [13] that uses some kind of execution limits. This includes, for example, both grammatical evolution and grammar-guided genetic programming, although their execution limits are not always mentioned in published studies.4 In Sect. 5.2, we will go over the key features of the Push programming language needed for this research, including ways that step limits interact with looping. Next, in Sect. 5.3, we will outline our experimental procedure and the four test problems used for this research. In Sect. 5.4, we will display the results we found when running the evolved programs over a range of step limits , and then discuss our results in Sect. 5.5. Finally, we will suggest some potential future work in Sect. 5.6 and provide conclusions in Sect. 5.7.

5.2 Background In this section, we’ll go over some of the key ideas used throughout this chapter. We’ll provide a brief overview of the Push language and its execution model and review how step limits are typically used to address concerns about infinite loops; for a more complete description see, e.g., [15, 16]. We’ll also review key concepts like success, generalization, and stability.

3 [2] uses the term “success” to mean a program that passes both the training and unseen testing cases, and doesn’t have a specific term for a program that passes the training cases irrespective of whether it generalizes to unseen cases. We’re limiting “success” to just mean that a program passes the training cases, and we say that a program “generalizes” if it also passes the unseen test cases. 4 There is no mention of execution limits in [7], for example, although examination of the code linked in the paper suggests that the loopBreakConst variable plays the same role as step limits here. [1] does mention that there is a limit, but doesn’t provide its value; again looking at the code suggests that the rec_counter variable is a kind of execution limit, although the details differ from PushGP step limits.

90

N. Freitag McPhee and R. Lussier

5.2.1 The Push Language and Interpreter Push is a programming language designed specifically to support the evolution of programs in, e.g., PushGP. A Push interpreter contains a set of stacks, one for each type relevant to the problem at hand (e.g., a stack of integers, a stack of Booleans, etc.). All instructions then act on those stacks, popping their arguments from and pushing their results to the appropriately typed stacks. The instruction string-length, for example, will pop the top string from the string stack, compute its length, and then push the result onto the integer stack. Each time an instruction is executed is termed an execution step. The instructions are held on a code stack, which can itself be manipulated by instructions in a variety of ways. The execution of a Push program terminates when either (a) the code stack is empty (i.e., all the instructions have been executed) or (b) the step limit is reached. Push programs typically return one or more results of specific types; these are taken from the top of the appropriately typed stacks after execution has completed. Note that this can be done regardless of whether the program execution completed “normally” or because the step limit was exceeded.

5.2.2 Step Limits and Infinite Loops One of the great strengths of Push is its support for a wide range of control flow mechanisms, allowing for the evolution of important constructs such as looping and recursion, along with perhaps less expected constructs such as loop unrolling. Most of these control structures work by manipulating blocks of instructions on the code stack.5 An evolved program can, for example, push multiple copies of a block of code onto the code stack as a way of repeating that behavior numerous times. With this, however, comes the serious risk of unwanted execution paths such as infinite loops. It’s common practice to avoid infinite loops (or any excessively long program execution) through the use of a step limit, where the interpreter stops the execution of a Push program when the number of steps executed exceeds that limit. How, then, do we evaluate the performance (fitness) of a program that is terminated? A common approach in genetic programming would be to apply a harsh penalty value, possibly even removing that individual from the population. Harsh penalties, however, tend to be quite “discouraging” to evolutionary processes, which could cause them to avoid the “dangerous” regions of the search space altogether; see, e.g., work on the impact of size limits in tree-based genetic programming [9, 12]. If we have heavy penalties for exceeding the step limit then evolution might be discouraged from exploring the space of programs that use looping, as it’s easy for those to exceed the step limit and incur that penalty. We don’t want to “scare” evolution away from using looping constructs through penalties, however, since many interesting problems require loops. 5

See [17] for more on looping instructions, including examples.

5 The Impact of Step Limits on Generalization and Stability …

91

As an alternative, we can take the answers from the appropriate stacks at any point in time, including when a program is terminated because it exceeds the step limit. This avoids harsh penalties, and in principle allows the evolutionary process to take advantage of what might be considered partial information (whatever is currently on the relevant stacks) when a program is terminated prematurely. This approach is how most PushGP research has been done.

5.2.3 Success, Generalization, and Stability The test problems used in this research are all software synthesis problems taken from two prominent GP software synthesis benchmark suites [3, 6]. These problems are essentially programming problems, where a set of input–output cases define the behavior of the desired program, similar to unit tests. These input–output cases are then divided into the training cases and the test cases. During the evolutionary process, each evolved program is run on each of the training cases. The program is executed with each of the inputs in turn, and then each output is compared to the target output specified in that input–output pair. Some kind of difference is computed between the target output and the actual output; this is called the error on that training case. Two important concepts in this work are success and generalization , both of which address different aspects of how successful the evolutionary process was in solving a particular problem. We say that an evolved PushGP program succeeds if it is able to pass (i.e., have zero error) all the training cases used to score programs during evolution. There is the risk of overfitting, though. To account for this possibility, we check to see if the program generalizes by running it on a set of previously unseen test cases. We say a program generalizes if it has zero error on each of the test cases. Another important concept is stability; we define a stable program as one that is truly general in the sense that it generalizes for every step limit above a certain value. This also implies that, if given enough steps, the program would work for any sized input. On the other hand, an unstable program relies heavily on a particular step limit to function correctly. Small changes in the step limit can cause these programs to jump from generalizing to failing every test case. Although the program “works”, it’s only for a selected few step limits and there’s no guarantee it would work for arbitrary inputs even if given additional steps. To determine stability , there needs to be a way to observe how a program behaves over a range of step limits . In this research, this is done by running a “best” program (defined in Sect. 5.3) on every step limit from 0 to 1000 steps.

92

N. Freitag McPhee and R. Lussier

5.3 Methodology and Experimental Design We tested the impact of varying step limits on four problems chosen to utilize several different input types and to lend themselves to a variety of looping constructs. Last Index of Zero comes from original program synthesis benchmark (PSB) proposed by Helmuth et al. [6] suite, while the rest are defined in PSB2 suite [2, 3] of Helmuth et al. To see what, if any, impact increasing the step limit had on the evolutionary behavior, we conducted 60 runs of each problem, the first 30 runs using a step limit of 200 and then 30 additional runs using a step limit of 800. Once each run was completed, we took a “best performing” program (defined below) from each run and ran it on the test set for all step limits from 0 to 1000 to see how the behavior of a program might change with other step limits . Our definition of a “best performing” program was either a randomly chosen program that succeeded or, if no program succeeded in a run, a randomly chosen program with the lowest total error at the end of the run. We then identified two important types of programs: those that generalized at the original step limit and those that generalized somewhere that wasn’t the original step limit. Each program was then also classified as being either stable or unstable based on its behavior across different step limits . Stable programs had some step limit .T where the program generalized at step limit .T , as well as generalizing with all step limits .U > T . Unstable programs lacked this property, typically jumping between generalizing at some step limit to failing some (often many or even all) test cases when given slightly higher step limits . In the remainder of this section, we’ll outline the four test problems used in this study. The full list of parameters can be seen in Table 5.2. The stacks and instruction sets used for each problem are identical to the ones used for these problems in the original program synthesis benchmark papers [2, 3, 6]. Last Index of Zero In the Last Index of Zero (LIOZ) problem [6], the goal is to evolve a program that takes a vector of up to 50 integers containing at least one zero, and outputs the index (starting at 0) of the last zero in the vector. The output for

Table 5.2 Parameters used for every problem. All runs used Uniform Mutation with Addition and Deletion (UMAD) [5] (with UMAD rate 0.1) as the sole reproduction operator and lexicase selection [14] as the parent selection mechanism. This work uses Plushy genomes [10], which are converted to Push programs for evaluation Value Parameter Max generations Population size Training set size Testing set size Max initial plushy size Step limits

300 1000 200 2000 100 200 and 800

5 The Impact of Step Limits on Generalization and Stability …

93

this problem comes from the integer stack. For example, given an input vector of ⟨0, 3, 4, 0, 2⟩, the expected output would be 3. The error for this problem is .|e − a|, where .e is the expected output and .a is the actual output.

.

Fuel Cost In the Fuel Cost (FC) problem [2, 3], the goal is to evolve a program which takes as input a vector of between 1 and 20 integers in the range .[6, 100000]. For an input vector .[x0 , . . . , xn−1 ], the expected output is .

) ∑ (| xi | −2 3 0≤i (function) that represents the ˆ ˆ The function (.G(t)) can be expressed as the sum prediction of Glucose (.G(t)). of the current value of glucose .G(t) and an expression .< ex pr >. 2. This rule defines the expression .< ex pr >. It provides multiple alternative ways to construct an expression:

6 Genetic Programming Techniques for Glucose Prediction …

113

Algorithm 6 Example of a GE algorithm for glucose prediction 1: procedure (grammar, dataset, f eatur es, objectives) 2: Load Dataset 3: Features selection 4: Grammar elaboration 5: pop = Generate population(len_pop) 6: for i ← 1 to N _generations do 7: for j ← 1 to len_ pop do ˆ j ) = decode(grammar, pop j ) G(t 8: 9: Fitness calculation 10: end for 11: pop = Selection (dataset, pop, fitness, objectives) 12: pop = Crossover(pop) 13: pop = Mutation(pop) 14: end for 15: end procedure

Fig. 6.2 Example of a grammar for glucose prediction

• .< ex pr >< op >< ex pr >: It allows combining two expressions with an operator .< op >. • .< cte >< op >< var >< op >< ex pr >: It allows combining a constant .< cte >, a variable .< var >, and another expression with operator .< op >. • .< var >: It represents a single variable while (.− < var >) represents the negation of a variable. • . pow(< var >, < sign >< ex ponent >): It represents raising a variable to the power of .< ex ponent >, with an optional sign .< sign >, while .(− pow(< var >, < sign >< ex ponent >)) represents the negation of the result obtained from raising a variable to the power of .< ex ponent >, with an optional sign .< sign >. 3. This rule defines the possible variables.< var > that can be used in an expression. The variables include:

114

J. I. Hidalgo et al.

• . B I : represents the variable Basal insulin (It is designed to maintain a steady level of insulin in the bloodstream to counteract the liver’s glucose release and meet the body’s basic insulin needs when not consuming food). • . I B : represents the variable Insulin bolus. An insulin bolus refers to a specific dose of insulin administered (injected) to cover the immediate rise in blood glucose levels resulting from a meal or snack. • . Fch : represents the absorption kinetics of carbohydrates in the human body following the Berger [28] or the Bateman model [29]. Both functions serve as a simplification and approximation of a very complex biological process. • .HR: represents the heart rate. • C, S: represent other variables such as calories burned, sleep quality, etcetera. 4. This rule defines the possible operators .< op > that can be used in an expression. 5. This rule defines a constant .< cte > as a combination of a base .< base > multiplied by 10 raised to the power of .< sign >< ex ponent >. It allows for the representation of numeric constants in the grammar.

6.3.2 Recent Techniques for Glucose Prediction Based on Grammatical Evolution GE [30] was developed to overcome the limitations of traditional GP in handling variable-length representations. The key idea of GE is to use a grammar to define the syntax of the programs being evolved, and a mapping process to translate the genotype (i.e., the genetic code) into a phenotype (i.e., the program). GE has been applied to the problem of blood glucose prediction in several works, being [31] the seed of this line of research, Fig. 6.3A [32]. The approach was evaluated using data from the OhioT1DM dataset [33] containing glucose and physiological data from 43 diabetic patients. This data is then used as input for the grammatical evolution algorithm, which evolves a population of mathematical equations over several generations. The algorithm optimizes the equations based on their ability to fit the collected data and make accurate predictions of blood glucose levels. One of the main problems encountered in recent years has been the limited availability of actual clinical data for developing and validating predictive models. This fact presents a significant challenge as many studies rely on datasets not derived from clinical settings, raising concerns about their applicability to real-world scenarios. To solve this issue, a novel approach is proposed [34] that utilizes a Markov chain-based enrichment of data method for simulating realistic glucose profiles (see Fig. 6.3B). This method considers the individual variability in glucose patterns, the temporal correlation of glucose values, and the impact of insulin and food intake on glucose levels. Subsequently, random grammatical evolution is applied to learn the underlying dynamics of the simulated glucose profiles. Finally, an ensemble method based on “bagging” combines multiple random grammatical evolution models, resulting in a more accurate and robust glucose forecasting model.

6 Genetic Programming Techniques for Glucose Prediction …

115

Fig. 6.3 Grammatical evolution for glucose prediction

In boxes C and D of Fig. 6.3, we present a line of research that incorporates the Parkes and Clarke error grid as a means of measuring the safety of predictions and their possible application to actual clinical treatment. Contador et al. [35] employs a multi-objective optimization framework to simultaneously optimize two objectives: prediction accuracy and safety. The aim of the research is to analyze the performance of this approach in different scenarios, namely the “agnostic” scenario where no future information about the patient is available, and the “what-if” scenario where hypothetical changes in insulin or carbohydrate intake are considered. In the other study [36], the researchers propose a two-step process for glucose forecasting. In the first step, they extract latent glucose variability features from historical glucose data. These features captures the underlying patterns and dynamics of glucose fluctuations. GE is then employed to model the relationship between this latent feature and future glucose values. Structured Grammatical Evolution overcomes some of the limitations of GE by using a different genotype-phenotype mapping process. In SGE, each gene in the genotype represents a non-terminal symbol of the grammar, and the mapping process consists of constructing a parse tree from the genotype, which represents the evolved program. This way, SGE ensures a one-to-one correspondence between genes and non-terminal symbols, increasing the locality of the genetic operators and reducing the genotype redundancy. Additionally, SGE uses structured grammar, which allows the definition of substructures or modules that can be reused or combined in different ways, making the search space more modular and reducing the search effort. The paper that proposed this technique [37] validated its interest by applying it to the problem of blood glucose prediction and served as a starting point for a new line of research. Within this line we can cite the work from Parra et al. [38]. In this study, Fig. 6.3F, the authors incorporate, alongside SGE, some of the ideas proposed in SINDy [39, 40]. SINDy is a data-driven approach used to discover governing equations that

116

J. I. Hidalgo et al.

describe the dynamics of a system. The synergy between the two techniques allows for the discovery of compact and interpretable models that capture the essential relationships between glucose and relevant input variables. By reducing the complexity of the models and focusing on the most influential factors, the researchers aim to improve the interpretability and generalization of the solutions. The evolved difference equations have the potential to enhance glucose prediction models making them useful for understanding and interpreting the relationships between glucose and other variables. We can also mention [41], Fig. 6.3G, that focuses on predicting hypoglycemia episodes. Hypoglycemia refers to a condition in which the blood glucose concentration drops below normal levels, typically below 70 mg/dL (3.9 mmol/L). This condition can occur when there is an imbalance between insulin dosage, food intake, and physical activity. As we have already stated, correct blood glucose management can be challenging, and hypoglycemia is one of the potential risks associated with tight glycemic control. Hypoglycemia can have various symptoms, including sweating, shakiness, confusion, dizziness, irritability, and in severe cases, loss of consciousness or seizures. These symptoms not only cause discomfort but can also impair cognitive function and physical coordination, leading to accidents and injuries. Recurrent episodes of hypoglycemia can also lead to a condition known as hypoglycemia unawareness, where individuals no longer experience the warning signs of low blood glucose, making it even more challenging to manage the condition effectively. The consequences of hypoglycemia extend beyond immediate symptoms. Prolonged or recurrent episodes of low blood glucose can have both short-term and long-term implications for individuals with diabetes. Acutely, hypoglycemia can disrupt daily activities, impair quality of life, and result in hospitalization. In the long term, it can lead to adverse health outcomes, including cardiovascular events, cognitive impairment, and reduced life expectancy. These problems indicate the importance of predicting hypoglycaemic events soon so that the individual can react in time and take the necessary countermeasures. Cruz et al. [41] demonstrates the application of Structured Grammatical Evolution for the prediction of short-term hypoglycaemia events. The input features used in the models include physiological variables measured in the two hours preceding the prediction time. These variables consist of glucose levels obtained from a Continuous Glucose Monitoring System (CGM) and additional information such as heart rate, number of steps, and calories burned, which are collected using a smartwatch. The paper focuses on using Structured Grammatical Evolution (SGE) to train classification models for predicting the future state of individuals in terms of hypoglycemia or non-hypoglycemia within a short-term horizon (30, 60, 90, and 120 min). One of the significant advantages of the models generated through SGE is their high interpretability, as they consist of if-else statements that classify the data based on the input variables from each patient.

6 Genetic Programming Techniques for Glucose Prediction …

117

6.4 Proposed Framework for Glucose Control 6.4.1 Framework Description The box diagram in Fig. 6.4 depicts the main steps of our proposed framework for glucose prognosis in people with diabetes using various evolutionary techniques, with a primary focus on SGE. The diagram visually represents the sequential flow of the framework’s components and their interconnections. The first step, labeled “Data Gathering,” involves collecting relevant data on glucose levels. This data serves as the foundation for subsequent stages of the framework. It is built upon glUCModel [42, 43], a specialized web application designed for chronic disease management, primarily focusing on diabetes, and has been in operation for several years. glUCModel is a comprehensive platform that empowers patients and doctors to manage chronic illnesses more effectively. Patients can easily upload their personal and medical data, ensuring centralized storage in a secure database accessible to healthcare professionals by offering seamless communication between patients and physicians, ensuring compliance with data protection regulations. GlUCModel has been developed as a user-friendly web application, ensuring seamless accessibility from any Internet-connected device, including desktop computers, tablets, and mobile phones. This versatility enables patients to conveniently

Fig. 6.4 Box diagram: glucose prediction process using GE

118

J. I. Hidalgo et al.

engage with the platform, promoting regular monitoring and better adherence to treatment plans. The next step is “Data Augmentation,” which aims to enhance the available dataset by incorporating additional information or generating synthetic data points. This process helps expand the data’s variety and diversity, leading to better model training and performance. Simultaneously, the framework incorporates “Latent Variability Features” from the collected data. These features capture hidden patterns, trends, or variations within the dataset. By extracting and combining these latent features, the framework aims to improve the accuracy and robustness of the subsequent steps. The “Scenario Clustering” step involves grouping similar data instances or scenarios based on their characteristics. This clustering process helps identify distinct patterns or subgroups within the dataset, enabling a more targeted and practical analysis. Following the clustering step, the framework trains a model (“SGE training”), using SGE, tailored to the specific characteristics of the identified clusters. By training SGE models on each cluster, the framework can capture the unique characteristics and dynamics of groups of different glucose time series. The “Interpretable Personal Model” step involves developing personalized models for individual patients. These models are designed to be interpretable, meaning that healthcare professionals or patients can easily understand and interpret their inner workings. The aim is to create transparent models that provide insights into the glucose prediction process for better understanding and decision-making. Simultaneously, the framework employs “Classification Rules” derived from the SGE models. These rules help classify glucose levels into categories or states, identifying abnormal glucose patterns or potential risks. The “Glucose Prediction” step utilizes personalized models and classification rules to predict future glucose levels for individual patients. This prediction process considers the patient’s specific characteristics and patterns from the training phases. The framework also includes a “Hypoglycemia Alert” component, which uses the predicted glucose levels to detect and signal the occurrence of hypoglycemia (low blood sugar) promptly. This alert mechanism enables proactive interventions or treatments to prevent hypoglycemia-related complications. Finally, the “Glucose Control” step integrates predictions, classification rules, and patient-specific models to guide and optimize the management of glucose levels. This control component aims to maintain blood sugar within a healthy range and mitigate the risks associated with diabetes.

6.4.2 Experimental Results Figure 6.5 displays a sequence of four images, each depicting distinct stages in the utilization of our GE-driven software tool: Grammar formulation (6.5a), Model

6 Genetic Programming Techniques for Glucose Prediction …

119

Fig. 6.5 Our software (Pancreas Model tool) for glucose prediction based on GE

training (6.5b), Model validation (6.5c), and log generation (6.5d). This tool is the basis of our work, and on top of which, we add the following layers. In the publication by Hidalgo et al. [34], we presented a comprehensive comparison, exploring diverse data augmentation techniques alongside ensemble strategies and four prediction time horizons (30, 60, 90, and 120 minutes). The introduced approach, referred to as Random-GE, adopts a methodology akin to the principles of the Random-Forest technique. Within this framework, GE is employed to train numerous models utilizing authentic and synthetic data. The designation “Random” signifies that the evolutionary process concludes prematurely upon achieving a specified fitness threshold. This yields a multitude of models with modest individual quality, which, when employed collectively, furnish highly confident predictions, mitigating the inherent data uncertainty. This inventive method embraces a truncated generation count, leading to efficient training times. Despite this brevity, the power of an ensemble encompassing more than 100 models is harnessed, culminating in exceptionally robust predictions at a computationally economical expense. In the paper, the precision and reliability of these predictions undergo meticulous evaluation through both Parkes and Clarke’s error grid analyses. Figure 6.6 illustrates the outcomes of implementing the techniques mentioned above. In Fig. 6.6a, the earlier GE results are depicted, revealing a significant number of predictions falling into zones C, D, and E within Clarke’s error grid. It is important to note that predictions

120

J. I. Hidalgo et al.

Fig. 6.6 Clarke’s error grid before and after dataset enhancement. Prediction Horizon is 120 min

within these zones may result in incorrect and potentially severe treatments, with some even classified as highly severe (zone E). Subsequently, Fig. 6.6b presents the outcomes following the integration of the introduced dataset enhancements. Clearly, there is a discernible reduction in the frequency of data points within these hazardous zones. In our study published as Contador et al. [44], we explored novel approaches to enhance dataset quality to facilitate improved model training using GP. Our investigation encompassed two distinct aspects. Firstly, we delved into integrating latent variables associated with glucose variability. Simultaneously, we examined diverse techniques for clustering the data. Notably, these endeavors were, again, undertaken across four prediction time horizons (30, 60, 90, and 120 minutes). The empirical findings reveal a compelling trend: jointly and separately implementing these two enhancements, our approach consistently yielded heightened predictive accuracy compared to the conventional GP reference method. Our work by Cruz et al. [41] presents a comparison between personalized and general models, along with the use of structured GE and dynamic structured GE for the prediction of hypoglycaemia events in a short-term temporal horizon. Upon analysis, certain patterns emerge from the data: • In some cases, the dynamic approach demonstrates improved recall rates compared to the static approach, indicating the potential benefits of incorporating dynamic structures in the grammatical evolution process. • Conversely, for specific patients, the static approach yields higher recall values, showcasing the significance of tailoring the algorithm to individual characteristics. It is worth noting that the general models generally exhibit competitive performance across patients, demonstrating the algorithm’s ability to capture and generalize patterns relevant to hypoglycemia classification. The variations in recall rates

6 Genetic Programming Techniques for Glucose Prediction …

121

between hypoglycemic and non-hypoglycemic events underscore the importance of balancing sensitivity and specificity in the classification process. This work needs to be extended to longer time horizons.

6.5 Conclusions In conclusion, this article overviews the most recent techniques utilizing GE for glucose prediction. Furthermore, we propose a comprehensive framework for glucose monitoring in patients with diabetes, leveraging various evolutionary techniques primarily focused on SGE. By incorporating advancements in the field, our framework offers a systematic approach to glucose monitoring and management, addressing the specific needs of individuals with diabetes. In general, the proposed framework integrates physiological data gathering from multiple sources, data augmentation, latent variability features, SGE models, personal models, glucose prediction, hypoglycemia alert, and glucose control to provide a holistic approach to glucose monitoring in patients with diabetes. It leverages evolutionary techniques, mainly SGE, to develop personalized and interpretable models while incorporating data-driven insights to improve patient care and management. While the proposed framework for glucose control based on GE presents a promising approach, there are several avenues for future research and development. The following directions can enhance the effectiveness of the framework and broaden its impact: • Refinement of latent variability features: The extraction and utilization of latent variability features in the framework can be enhanced. Research can focus on identifying and incorporating additional hidden patterns or factors that influence glucose levels but are not directly observable. This can involve exploring advanced statistical techniques, incorporating domain knowledge, or leveraging other data sources (such as lifestyle information) to capture a broader range of variables affecting glucose control. • Advanced model interpretability: While the framework emphasizes the development of interpretable personal models, further research can focus on refining the interpretability of these models. Exploring novel techniques for model visualization, feature importance attribution, and decision rule extraction can enhance the transparency and understanding of the model’s inner workings. This will facilitate trust and acceptance by healthcare professionals and patients, enabling better clinical decision-making. • Extension to other diabetes-related outcomes: Expanding the framework’s scope to include other diabetes-related outcomes beyond glucose control can be an interesting avenue for future work. This may involve predicting complications, optimizing medication regimens, or personalizing dietary recommendations based on individual glucose profiles. By considering a broader range of factors, the framework can provide a more holistic approach to diabetes management.

122

J. I. Hidalgo et al.

Acknowledgements This work is supported by Spanish Government MINECO grants PID2021125549OB-I00 and PDC2022-133429-I00.

References 1. Sapra, A., Bhandari, P.: Diabetes. In: StatPearls, StatPearls Publishing, Treasure Island (FL) (2023) 2. Amorosa, L.F., Lee, E.J., Swee, D.E.: Chapter 34—diabetes mellitus. In: Rakel, R.E., Rakel, D.P. (eds.) Textbook of Family Medicine, 8th edn., pp. 731–755. W.B. Saunders, Philadelphia (2011) 3. W.H. Organization, Global report on diabetes. World Health Organization (2016) 4. American Diabetes Association, Diagnosis and classification of diabetes mellitus, Diabetes Care 33 Suppl 1 (Suppl 1) S62–9 (2010) 5. Michalek, D.A., Onengut-Gumuscu, S., Repaske, D.R., Rich, S.S.: Precision medicine in type 1 diabetes. J. Indian Inst. Sci. 103(1), 335–351 (2023) 6. Castorani, V., Favalli, V., Rigamonti, A., Frontino, G., Di Tonno, R., Morotti, E., Sandullo, F., Scialabba, F., Arrigoni, F., Dionisi, B., Foglino, R., Morosini, C., Olivieri, G., Barera G., Meschi, F., Bonfanti, R.: A comparative study using insulin pump therapy and continuous glucose monitoring in newly diagnosed very young children with type 1 diabetes: it is possible to bend the curve of HbA1c, Acta Diabetol, Aug. 2023 7. Deshpande, A.D., Harris-Hayes, M., Schootman, M.: Epidemiology of diabetes and diabetesrelated complications. Phys. Ther. 88(11), 1254–1264 (2008) 8. Papatheodorou, K., Banach, M., Bekiari, E., Rizzo, M., Edmonds, M.: Complications of diabetes 2017. J. Diabetes Res. 2018, 3086167 (2018) 9. Cappon, G., Vettoretti, M., Sparacino, G., Facchinetti, A.: Continuous glucose monitoring sensors for diabetes management: a review of technologies and applications. Diabetes Metab. J. 43(4), 383–397 (2019) 10. Thomas, A., Thevis, M.: Chapter three—recent advances in the determination of insulins from biological fluids. Advances in Clinical Chemistry, Vol. 93, pp. 115–167. Elsevier (2019) 11. Parkes, J.L., Slatin, S.L., Pardo, S., Ginsberg, B.H.: A new consensus error grid to evaluate the clinical significance of inaccuracies in the measurement of blood glucose. Diabetes Care 23(8), 1143–1148 (2000) 12. Puftzner, A., Klonoff, D.C., Pardo, S., Parkes, J.L.: Technical aspects of the Parkes error grid. J. Diabetes Sci. Technol. 7(5), 1275–1281 (2013) 13. Clarke, W.L., Cox, D., Gonder-Frederick, L.A., Carter, W., Pohl, S.L.: Evaluating clinical accuracy of systems for self-monitoring of blood glucose. Diabetes Care 10(5), 622–628 (1987) 14. Cobelli, C., Mari, A.: Validation of mathematical models of complex endocrine-metabolic systems. A case study on a model of glucose regulation. Med. Biol. Eng. Comput. 21(4), 390–399 (1983) 15. Dalla Man, C., Rizza, R.A., Cobelli, C.: Meal simulation model of the glucose-insulin system. IEEE Trans. Biomed. Eng. 54(10), 1740–1749 (2007) 16. Dalla Man, C., Rizza, R.A., Cobelli, C.: Mixed meal simulation model of glucose-insulin system. Conf. Proc. IEEE Eng. Med. Biol. Soc. 2006, 307–310 (2006) 17. Lobo, B., Farhy, L., Shafiei, M., Kovatchev, B.: A data-driven approach to classifying daily continuous glucose monitoring (CGM) time series. IEEE Trans. Biomed. Eng. 69(2), 654–665 (2022) 18. Felizardo, V., Garcia, N.M., Pombo, N., Megdiche, I.: Data-based algorithms and models using diabetics real data for blood glucose and hypoglycaemia prediction—a systematic literature review. Artif. Intell. Med. 2021(102120), 102120–102120 (2021)

6 Genetic Programming Techniques for Glucose Prediction …

123

19. Zecchin, C., Facchinetti, A., Sparacino, G., De Nicolao, G., Cobelli, C.: Neural network incorporating meal information improves accuracy of short-time prediction of glucose concentration. IEEE Trans. Biomed. Eng. 59(6), 1550–1560 (2012) 20. Hamdi, T., Ben Ali, J., Fnaiech, N., Di Costanzo, V., Fnaiech, F., Moreau, E., Ginoux, J.-M.: Artificial neural network for blood glucose level prediction. In: 2017 International Conference on Smart, Monitored and Controlled Cities (SM2C), pp. 91–95 (2017) 21. Tena, F., Garnica, O., Lanchares, J., Hidalgo, J.I.: Ensemble models of cutting-edge deep neural networks for blood glucose prediction in patients with diabetes. Sensors 21(21) (2021) 22. Tena, F., Garnica, O., Davila, J.L., Hidalgo, J.I.: An LSTM-based neural network wearable system for blood glucose prediction in people with diabetes. IEEE J Biomed Health Inform, Aug 2023 23. Li, K., Daniels, J., Liu, C., Herrero, P., Georgiou, P.: Convolutional recurrent neural networks for glucose prediction. IEEE J. Biomed. Health Inf. 24(2), 603–613 (2019) 24. Zhu, T., Li, K., Herrero, P., Chen, J., Georgiou, P.: A deep learning algorithm for personalized blood glucose prediction. In: KHD@IJCAI, pp. 64–78 (2018) 25. Contreras, I., Oviedo, S., Vettoretti, M., Visentin, R., í, J.: Personalized blood glucose prediction: a hybrid approach using grammatical evolution and physiological models. PLoS One 12(11), e0187754 (2017) 26. Liu, C., Vehí, J., Avari, P., Reddy, M., Oliver, N., Georgiou, P., Herrero, P.: Long-term glucose forecasting using a physiological model and deconvolution of the continuous glucose monitoring signal. Sensors (Basel) 19(19), 4338 (2019) 27. Hovorka, R., Canonico, V., Chassin, L.J., Haueter, U., Massi-Benedetti, M., Federici, M.O., Pieber, T.R., Schaller, H.C., Schaupp, L., Vering, T., Wilinska, M.E.: Nonlinear model predictive control of glucose concentration in subjects with type 1 diabetes. Physiol. Meas. 25(4), 905–920 (2004) 28. Berger, M., Cüppers, H., Hegner, H., Jörgens, V., Berchtold, P.: Absorption kinetics and biologic effects of subcutaneously injected insulin preparations. Diabetes Care 5(2), 77–91 (1982) 29. Garrett, E.R.: The bateman function revisited: a critical reevaluation of the quantitative expressions to characterize concentrations in the one compartment body model as a function of time with first-order invasion and first-order elimination. J. Pharmacokinetics and Biopharm. 22(2), 103–128 (1994) 30. O’Neil, M., Ryan, C.: Grammatical Evolution, pp. 33–47. Springer, US, Boston, MA (2003) 31. Hidalgo, J.I., Colmenar, J.M., Risco-Martin, J.L., Cuesta-Infante, A., Maqueda, E., Botella, M., Rubio, J.A.: Modeling glycemia in humans by means of grammatical evolution. Appl. Soft Comput. 20, 40–53 (2014) 32. Hidalgo, J.I., Colmenar, J.M., Velasco, J.M., Kronberger, G., Winkler, S.M., Garnica, O., Lanchares, J.: Identification of models for glucose blood values in diabetics by grammatical evolution. Springer International Publishing. Ch. 15, 367–393 (2018) 33. Marling, C., Bunescu, R.: The OhioT1DM dataset for blood glucose level prediction: update 2020. CEUR Workshop Proc. 2675, 71–74 (2020) 34. Hidalgo, J.I., Botella, M., Velasco, J.M., Garnica, O., Cervigón, C., Martínez, R., Aramendi, A., Maqueda, E., Lanchares, J.: Glucose forecasting combining markov chain based enrichment of data, random grammatical evolution and bagging. Appl. Soft Comput. J. 88 (2020) 35. Contador, S., Colmenar, J.M., Garnica, O., Velasco, J.M., Hidalgo, J.I.: Blood glucose prediction using multi-objective grammatical evolution: analysis of the “agnostic” and “what-if” scenarios. Gen. Program. Evol. Mach. 23(2), 161–192 (2022) 36. Contador, S., Velasco, J.M., Garnica, O., Hidalgo, J.I.: Glucose forecasting using genetic programming and latent glucose variability features. Appl. Soft Comput. 110, 107609 (2021) 37. Lourenco, N., Hidalgo, J.I., Colmenar, J.M., Garnica, O.: Structured grammatical evolution for glucose prediction in diabetic patients. In: GECCO 2019-Proceedings of the 2019 Genetic and Evolutionary Computation Conference, pp. 1250–1257. Association for Computing Machinery, Inc (2019) 38. Parra, D., Joedicke, D., Gutiérrez, A., Velasco, J.M., Garnica, O., Colmenar, J.M., Hidalgo, J.I.: Obtaining difference equations for glucose prediction by structured grammatical evolution and

124

39. 40. 41.

42.

43.

44.

J. I. Hidalgo et al. sparse identification. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13789, pp. 189–196. Springer Science and Business Media Deutschland GmbH, LNCS (2022) Brunton, S.L., Proctor, J.L., Kutz, J.N.: Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proc. Nat. Acad. Sci. 113(15), 3932–3937 (2016) Brunton, S.L., Kutz, J.N.: Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control, 2nd ed. Cambridge University Press (2022) De La Cruz, M., Cervigon, C., Alvarado, J., Botella-Serrano, M., Hidalgo, J.: Evolving classification rules for predicting hypoglycemia events. In: 2022 IEEE Congress on Evolutionary Computation, CEC 2022—Conference Proceedings. Institute of Electrical and Electronics Engineers Inc. (2022) Hidalgo, J.I., Maqueda, E., Risco-Martín, J.L., Cuesta-Infante, A., Colmenar, J.M., Nobel, J.: glucmodel: a monitoring and modeling system for chronic diseases applied to diabetes. J. Biomed. Inf. 48, 183–192 (2014) Hidalgo, I., Botella-Serrano, M., Lozano-Serrano, F., Maqueda, E., Lanchares, J., MartinezRodriguez, R., Aramendi, A., Garnica, O.: A web application for the identification of blood glucose patterns through continuous glucose monitoring and decision trees. In: Diabetes Technology & Therapeutics, vol. 22, Mary Ann Liebert, Inc 140 Huguenot Street, 3rd FL, New Rochelle, NY 10801 USA, pp. A64–A64 (2020) Contador, S., Velasco, J.M., Garnica, O., Hidalgo, J.I.: Glucose forecasting using genetic programming and latent glucose variability features. Appl. Soft Comput. 110, 107609 (2021)

Chapter 7

Methods for Rich Phylogenetic Inference Over Distributed Sexual Populations Matthew Andres Moreno

7.1 Introduction The structure of hereditary relatedness within an evolved population captures substantial aspects of its evolutionary history [5]. In the context of evolutionary computation (EC), such phylogenetic information can be diagnostic of systems’ runtime dynamics and thereby guide EC algorithm selection, tuning, and development. Some work has even gone so far as to apply phylogenetic information to mediate recombination [23], fitness estimation [13], and diversity maintenance schemes [2, 19]. Growth in parallel and distributed computing power benefits EC capability. Bigger population sizes, more sophisticated genetic representations, and more robust fitness evaluation will continue to come into reach. Very large-scale operation, however, will require renovation of methodologies poorly suited to parallel and distributed computing. Approaches relying on centralized control and complete system visibility, in particular, are expected to become increasingly inefficient and brittle. Phylogenetic work in digital evolution systems, in particular, has traditionally been accomplished through centralized tracking. Collecting and stitching together all parent–offspring relationships yields a perfect phylogenetic record. Even at the scale of a single processor, storing an entirely comprehensive phylogenetic record quickly becomes unwieldy. Fortunately, in asexual systems (where offspring have exactly one parent) extant lineages comprise only a minuscule fraction of all ancestors. So, pruning away extinct lineages greatly tempers memory use—even for longlasting runs with large population sizes. Sexual lineages (i.e., multiple parents per offspring) do not exhibit this lineage winnowing property. As such, application of the perfect tracking model becomes more challenging, although not entirely unheard of [3, 14, 15]. The APOGeT tool is notable in this domain. It applies a user-defined M. A. Moreno (B) University of Michigan, Ann Arbor, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 S. Winkler et al. (eds.), Genetic Programming Theory and Practice XX, Genetic and Evolutionary Computation, https://doi.org/10.1007/978-981-99-8413-8_7

125

126

M. A. Moreno

interbreeding compatibility measure to cluster together “species” on the fly, making use of the global system visibility to distill tractable summary data [9]. Unter tracking, extinction is fast and easy to detect through reference counting. Introducing a distributed computing model where lineages wind across networks of independent nodes, however, greatly complicates matters. Extinction notifications would have to wind back over all of lineages’ node-to-node migrations. Data loss—whether due to hardware failures, dropped messages, or some other issue— exacerbates matters further still. Any lost extinction notification would entrench ancestors’ records leaving a kind of zombie lineage. More worrisome, though, data loss could entirely disjoin components of the phylogenetic record, introducing profound uncertainty as to how large components relate. Unfortunately, at very large scales, hardware failures are expected to become near inevitabilities [10]. Although not traditionally performed in EC, phylogenetic analysis is possible without direct tracking. In fact, this is the de facto paradigm within biological science, where phylogenetic relationships are largely inferred through comparisons of extant traits. As it has become available, genetic sequence data has gained increasing prevalence as the basis for phylogenetic analysis. Unfortunately, phylogenetic reconstruction is notoriously difficult, demanding vast quantities of data, software, and computation—not to mention sufficient statistical and algorithmic efforts to precipitate an entire field of study. Fortunately, EC can largely sidestep this plight. Malleability of the digital substrate invites explicit design of genetic components in order to facilitate straightforward phylogenetic inference from small amounts of data with minimal computational overhead. This desideratum motivated the recent development of “hereditary stratigraphy,” a design for digital genetic material expeditious to phylogenetic inference [16]. Existing work with hereditary stratigraphy has been restricted exclusively to asexual lineages (i.e., exactly one parent per offspring). Given the essential role of sexual recombination operations (i.e., crossover) in EC, effort to address this limitation is of key significance. This work introduces techniques to apply hereditary stratigraphy methods to sexual lineages. Developed methods enable decentralized inference of (1) genealogical history, (2) population size fluctuations, and (3) selective sweeps. Such capabilities can enhance diagnostic telemetry that benefits application-oriented EC. Digital model systems for evolution research involving sexual dynamics may also benefit from enhanced observability. In both cases, the proposed approach affords greater scalability than previously possible. Given the difficulties in managing memory usage of sexual pedigree records, reconstruction-based analysis may even prove useful in the absence of multiprocessing.

7 Methods for Rich Phylogenetic Inference Over Distributed Sexual Populations

127

7.2 Methods After opening with a brief primer on hereditary stratigraphy and an overview of application in proposed “gene-” and “species-level” instrumentation (Sect. 7.2.1), discussion in this section turns to cover debuted inferential techniques and corresponding validation experiments. Three inferential techniques are described: 1. genealogical inference (Sect. 7.2.2), 2. effective population size inference (Sect. 7.2.3), and 3. positive selection inference (Sect. 7.2.4). The first two rely on “species-level” annotation. The third, selection inference, applies “gene-level” instrumentation. Section 7.2.5 closes with information on software implementation and supplementary materials.

7.2.1 Genome Instrumentation This section reviews hereditary stratigraphy as originally developed for inference over asexual populations then introduces gene- and species-level hereditary stratigraphy instrumentation strategies designed for sexual populations. Subsequent discussion covers recombination and “gene drive” mechanisms employed for species-level instrumentation. Hereditary Stratigraphy Proposed methods for genealogical and evolutionary inference over distributed sexual populations draw on “hereditary stratigraphy,” originally developed to facilitate phylogenetic inference over asexual populations [17]. The core mechanism of this technique is a generation-on-generation accumulation of randomized “fingerprint” packets. Offspring inherit parents’ fingerprints and append a new entry. This process repeats with the next generation. Each accumulated fingerprint ultimately serves as a kind of lineage checkpoint. Because fingerprints are faithfully inherited each generation after their creation, discrepancy between coincident fingerprints held by two population members implies divergent ancestry at their shared time point. So, the last common ancestry necessarily precedes the first mismatching fingerprints. Fingerprint values convey no function phenotypic information. Rather, they should be considered simply as neutral ornamentation affixed to an underlying functional genome to instrument it. Because they tag along with genomes across replication events, a system’s phylogenetic history can be reconstructed by proxy based on these instruments. Some attention must be paid to memory footprint. Proceeding naively, fingerprint accumulation with each generation would hopelessly bloat annotation size. Fortunately, the underlying inferential mechanism supports thinning out of fingerprints. Pruned-away fingerprints introduce uncertainty into the estimation of two records’ divergence generation. If every other fingerprint was pruned, for example, divergence generations would only be estimable to within two generations. In this way, configuration of what fingerprints to keep when directly administers underlying trade-offs

128

M. A. Moreno

Fig. 7.1 Proposed “species-level” and “gene-level” instrumentation. Each organism has a single hereditary stratigraph attached as species-level instrumentation. Gene-level instrumentation associates instrumentation with individual genes, to be used for gene tree reconstructions

between memory use and inferential power. Although beyond the present scope, significantly more can be said about fingerprint curation. For details, refer to [16]. To simplify experimental setup and analysis, experiments reported here do not incorporate fingerprint pruning. However, all inference mechanisms introduced here are, in principle, compatible with fingerprint pruning. Further work remains to directly investigate how pruning affects presented approaches’ characterization of evolutionary history and dynamics. Analogous work in asexual populations should provide some initial indications in this direction [18]. This work uses 64-bit fingerprints, which collide spuriously with negligible probability .2−64 ≈ 5 × 10−20 . At population size 100 over 200 generations, as in the first sets of reported experiments, the probability of any collision is minuscule: −15 .< 2 × 10 . At population size 200 over 400 generations, as in the last set of reported experiments, the probability of any collision is also minuscule: .< 5 × 10−15 . Sexual Instrumentation Schemes As originally devised for asexual populations, hereditary stratigraph annotations assign one-to-one with genomes. Here, we explore two alternate schemes designed for instrumentation of sexual populations: “genelevel” and “species-level” instrumentation. Figure 7.1 compares these two schemes. Gene-level instrumentation views individual genes simply as asexual atoms and instruments them individually. Reconstructions, therefore, operate along the lines of “gene tree” analyses in traditional phylogenetics [1]. In some cases, it may make sense to instrument every gene independently. Other applications may warrant instrumenting only a subset of genes or introducing instrumented “dummy” genes. Species-level instrumentation associates one instrument per genome but these instruments adopt a consensus value within species populations. Consensus arises through a “gene drive” mechanism (described below) that forces a single fingerprint value to sweep each fingerprint layer within interbreeding subpopulations. Species-level instrumentation powers genealogical inference (Sect. 7.2.2) and population size inference (Sect. 7.2.3). Gene hereditary stratigraph instrumentation powers positive selection inference (Sect. 7.2.4). “Gene-drive” recombination Mechanism Species-level tracking relies on consensus among fingerprint values within reproductively isolated subpopulations. Drift

7 Methods for Rich Phylogenetic Inference Over Distributed Sexual Populations

129

Fig. 7.2 Gene drive mechanism for species-level instrumentation. The larger of parents’ differentia values at each layer is inherited. The largest value generated among layer 0 differentia (.a) spreads from one member at generation 0 to all four by generation 2. This mechanism applies to “specieslevel” instrumentation (Fig. 7.1)

will ultimately yield consensus, but expected coalescence times can be long for large populations A simple inheritance rule reaches consensus faster: inherit the larger of parents’ fingerprint values. Global maximum fingerprint values will spread rapidly and fix. Figure 7.2 depicts this mechanism. Asynchronous generations slightly complicate the picture. When individuals from different generations recombine, one parent will have fingerprint layers absent in the other. How should recombination proceed in this case? One possibility would be to simply “fast forward” the younger instrument to match the generational depth of the elder. However, like with fingerprint pruning above, full consideration remains for future work. All reported experiments use synchronous generations. The gene-drive-based recombination mechanism described in this section is applied exclusively to species-level instrumentation. (Gene-level instruments, although tagging along with genes shuffled up through genome-level recombination, did not themselves recombine). Fingerprint Collision Probability Under gene-drive recombination, fixed fingerprints skew large due to the gene-drive criterion. This skew increases the probability of fingerprint collision between two populations, which would cause spurious detection of shared ancestry. We will assess the extent of this issue by computing threshold population sizes where collision probability becomes substantive. Suppose independent populations of size .a and .b. The largest fingerprint in each population will drive to fixation. If each population member’s gene is drawn from uniform distribution on integers .[0, u), then the probability of collision between the populations’ fixed genes can be derived as

.

a a+b

a+b−1 E n=1

u −n−1

( ( ) ) ( )a+b−n−1 (a+b) (a+b) (a−1 (b−1 n ) n ) u −n−1 u−1 1 − a+b−1 1 − a+b−1 a+b−1 u n+1 E ( n ) ( n ) n+1 b + . ( ( )a+b−1 )a+b−1 a+b 1 − u−1 1 − u−1

( u−1 )a+b−n−1 u

u

n=1

u

130

M. A. Moreno

For 32-bit fingerprints .u = 232 , collision occurs with . p < 0.5 (. p = 0.46) for populations of size .a = b = 232 . Collision occurs with . p < 0.01 for populations of size .a = b = 226 . So, 32-bit fingerprints can differentiate species pairs of around 7 1 .6.7 × 10 members each with reasonable consistency.

7.2.2 Genealogical Inference This section describes evolutionary scenarios used to test genealogical inference and then explains conversion of recorded sexual pedigrees to phylogenetic trees (for reference in evaluating reconstruction quality). Validation Experiment This experiment tested the quality of genealogical history recovered from species-level annotation. Trials covered three evolutionary scenarios. The first, “allopatry,” induced full speciation through introduction of a total reproductive barrier among subpopulations. The second, “ring,” induced partial phylogenetic structure through an island model with small amounts of migration. Finally, a well-mixed control lacking meaningful phylogenetic structure was included. Ten independent replicates were performed for each treatment. Populations evolved on the simple one-max problem, described in Supplementary Sect. 5.1. At the 200th generation, species-level annotations were extracted from extant population members. Phylogenetic reconstruction used agglomerative triebased reconstruction techniques from [18]. To evaluate reconstruction quality, inferred phylogenies were compared to references distilled from the underlying pedigree record (described below). Reconstruction error was quantified using quartet distance [6, 20]. Inferred phylogenies were also visualized to confirm the recovery of major historical features. Allopatry Treatment. This treatment simulates 100 generations of well-mixed sympatric evolution. At generation 100, the population is divided into two 50member subpopulations. These subpopulations evolve independently for 50 generations. Then, at generation 150, the first subpopulation is split again into five 10-member subpopulations. All subpopulations then evolve independently for a further 50 generations. Phylogenetic reconstruction from this treatment should recover a binary branching at generation 100 followed by a quintuple branching along one lineage at generation 150. Ring Treatment. This treatment splits the population into ten distinct subpopulations (i.e., islands). Islands connected in a ring topology. One individual migrated between adjacent populations per generation. Bag Treatment. This treatment applies a well-mixed population model. As such, no meaningful phylogenetic structure exists to be detected.

1

Reported experiments used 64-bit fingerprint values, which will exhibit even lower collision probabilities. However, numerical considerations impede precise calculation.

7 Methods for Rich Phylogenetic Inference Over Distributed Sexual Populations

131

Phylogeny Extraction from Pedigree Records In order to provide a baseline reference for evaluation of annotation-based phylogenetic reconstruction, phylogenetic trees (reflecting “relatedness”) were extracted from full sexual pedigrees. Such conversion has fundamental limitations: well-mixed populations of modest size lack structured differences in phylogenetic relatedness. Although speciation and spatial structure introduce phylogenetic structure, this issue still arises among constituent well-mixed subpopulations. No effort was made to condense such arbitrary branch structure, which could be achieved through clustering to a relatedness threshold. Phylogenetic relationships were distilled from pedigree data through a two-step process. First, pairwise Most Recent Common Ancestor (MRCA) depths were collated from the pedigree. Then, UPGMA reconstruction was applied, yielding an inferred tree [22]. Finally, branch lengths were corrected to accurately position leaf nodes at their known generational depths.

7.2.3 Population Size Inference This section introduces the mechanistic principle behind distributed population size estimation, reports statistical formulations derived to this end, and then describes experiments that test the proposed inference technique. Inference Principle Recall that the proposed gene drive mechanism (Sect. 7.2.1) fixes the maximum of population size.n fingerprint values, each drawn from a uniform distribution. Under this gene drive mechanism, fixed fingerprint magnitude reveals information about population size: larger populations tend to fix greater fingerprints (as illustrated in Supplementary Fig. 9). Scaling integer fingerprint values to fall between 0 and 1, fixed fingerprint magnitude turns out to follow beta distribution .β(n, 1) [8]. Directly analogous techniques to estimate population size have also arisen in decentralized, anonymous network engineering [24, 25]. In these schemes, nodes independently draw a random vector of numerical values from a known distribution. Values are exchanged through an aggregating function (e.g., minimum, maximum, etc.), ultimately resulting in consensus values fixed within the network. Each node can then independently infer probabilistic information about the larger network, in a manner that is generally consistent across nodes. Population Size Estimator Statistics Statistical details for population size inference follow, some of which are, to the best knowledge, not yet reported. Natural log is used throughout. Section 7.2.5 links to full derivations. Maximum Likelihood Estimator (MLE). The maximum likelihood estimator for population size given .k independent observations of unit-scaled fixed-fingerprint values .xi is nˆ

. mle

= −k/

Ek i=1

log(xi ).

(7.1)

132

M. A. Moreno

With true population size .n, mean square error of the estimator is .MSE(nˆ mle ) = n 2 (k 2 + k − 2)/[(k − 1)2 (k − 2)]. Expected value follows as .E(nˆ mle ) = nk/(k − 1) for .k > 1. Subtracting this value from .nˆ mle yields a mean-unbiased population size estimator. These MLE results were also derived in [25]. Confidence Interval. Confidence intervals are useful to the interpretation of population size estimates. Formulations derived from the maximum likelihood estimator are provided below. For a single observation of unit-scaled fixed-fingerprint magnitudes .x, ˆ the population size .n can be estimated with .c% confidence to fall between lower bound .log[(50 + 0.5c)/100]/ log x ˆ and upper bound .log[(50 − 0.5c)/100]/ log x. ˆ At this low observational power, the 95% confidence interval spans a 145-fold order of magnitude and a 99% confidence interval spans a 1057-fold order of magnitude. For .k observations of unit-scaled fixed-fingerprint magnitudes .xˆi , population size can be estimated with .c% confidence to fall within the interval .(nˆ lb , nˆ ub ), computed as numerical solutions of

.0

k k ) ) ( ( E E 100 + c 100 − c = 2[ k, −nˆ lb log xˆi − [(k) log xˆi − [(k) and 0 = 2[ k, −nˆ ub . 100 100 i=1

i=1

(7.2) Here,.[ is the complete gamma function. Four independent observations provide a 95% confidence interval spanning an 8-fold magnitude or a 99% confidence interval spanning a 16-fold magnitude. Nine independent observations are sufficient for a 95% confidence interval spanning a factor spanning a 4-fold magnitude or a 99% confidence interval spanning a factor of 6-fold magnitude. Thirty-three independent observations are sufficient for a 95% confidence interval spanning a 2-fold Ek magnitude log xˆi ∝ or a 99% confidence interval spanning a 2.5-fold magnitude. Because . i=1 nˆ −1 mle , confidence bound width can be shown to scale as a constant factor of population size .n. Median-unbiased Estimator. Evaluating either confidence interval formula with .c = 50 derives a median-unbiased estimator. Credible Intervals. Assuming a uniform prior over population size, credibility contained within a factor . f of the maximum likelihood estimate .nˆ mle can be calculated as .γ (k + 1, f k)/k! − γ (k + 1, k/ f )/k!. Here, .γ is the lower incomplete gamma function. By form, this credibility remains constant across population sizes .n. Credibility interval width scales similarly with sample size as discussed above for confidence intervals. Rolling Estimation. Experiments reported here compute a population size estimate and confidence intervals from a simple rolling collection of ten preceding fixed-gene magnitudes. Supplemental Sect. 5.2 walks through an example of rolling population size inference. More sophisticated regularizations have been proposed to estimate dynamically changing population sizes from time series data [24]. Validation Experiment This experiment tests ability of the population size estimation process to detect differences between populations and to detect changes

7 Methods for Rich Phylogenetic Inference Over Distributed Sexual Populations

133

within a population over time. Four treatments were evaluated with ten independent replicates performed per treatment. Bottleneck treatment. Simulated population crash and rebound. Population size was kept at 100 for 67 generations, reduced an order of magnitude to 10 for 66 generations, and then returned to 100 for another 67 generations. Range expansion. Simulated gradual growth in population size. Population size was initiated at 10 for 67 generations, then increased linearly for 66 generations to 142 at generation 133, and then maintained at 142 for another 67 generations. Selection pressure. Modulated selection intensity. This affects effective population size by changing the number of population members eliminated without contributing to the gene pool. High selection pressure was applied for 67 generations (tournament size 8). Then, selection pressure was eliminated for 66 generations (tournament size 1). Finally, high selection pressure was reinstated for the last 67 generations (tournament size 8). Control treatment. Constant population size 100 for 200 generations.

7.2.4 Positive Selection Inference This section presents proposed gene selection inference methods and design of validation experiments used to test it. Inference Mechanism Alleles experiencing positive selection increase in frequency. However, allele frequency increases can also occur through drift. The key difference between the two is the rate of increase—drift tends to be slower than selection, especially for large population sizes. Selection can therefore be differentiated from drift dynamics by considering copy count of gene descendants after a fixed number of generations .g. If this copy count exceeds expectation under a null hypothesis of pure drift dynamics, positive selection can be inferred. Stronger positive selection will correlate with greater growth of copy count within the .g generation window. This is the working principle behind the proposed detection method. The proposed mechanism measures delayed copy count through gene-level instrumentation. Each fingerprint is bundled with a fixed-length, zero-initialized bit field. These bit fields are copied verbatim to descendants along with the rest of gene annotations. At the .gth generation following its creation, a single bit is set at random in each bit field. During subsequent recombination, corresponding bit fields with matching fingerprints combine using the bitwise operation. In this manner, set bits propagate among the records that trace back to a particular founder gene at the snapshot window outset. Annotations’ bit counts converge to reflect the number of gene copies present after generational delay.g. Figure 7.3 summarizes the overall mechanism. Set bits can undercount snapshot gene copies due to bit position overlap or gene copy extinctions subsequent to generation .g. Sensitivity to larger copy counts could be achieved by setting bits instead with probability . p < 1.

134

M. A. Moreno

Fig. 7.3 Proposed mechanism for detecting gene-level selection via a distributed delayed copy count estimation mechanism. Strata deposited at generation .n progress through 16 generations, with copy count of one allele growing due to selection. On the sixteenth generation, a “snapshot” is performed to set a random bit of the field annotated onto each descendant differentia copy (here, denoted by letter). In subsequent recombination events, set bits are exchanged between bit fields associated with common differentia. Copy count at generation .n + 16 from can then be estimated from these bit fields, with high copy count indicative of selection. Note that in this example collision between set bits .i ' and .i" result in an undercount. This mechanism is associated with “gene-level” instrumentation (Fig. 7.1)

A bit field width of 8 bytes and a snapshot delay of 16 generations, by arbitrary choice, were used in experiments. Better sensitivity to weak selection events should be achievable through longer snapshot windows and larger bit fields, but potentially at the cost of diluting signal from strong selection events. Future work should explore how to best tailor snapshot window length and bit field widths. Soft sweeps should, in principle, be detectable to some extent through this methodology, as they involve increases in copy count at faster-than-drift rates. Soft sweeps are scenarios where changes in environmental conditions induce positive selection on an existing, potentially widespread allelic variant that was previously neutral [11]. However, weak sweeps on very widespread alleles will register only a weak signal

7 Methods for Rich Phylogenetic Inference Over Distributed Sexual Populations

135

on this instrumentation because increases in copy count are spread across many preexisting allele copies. Although weak sweeps are not explored here, they merit future consideration. Validation Experiment A minimalistic experimental system was devised. Each individual in the population comprised a single floating-point number, representing a single focal gene. Gene values were restricted between 0.0 and 1.0. Fitness score was defined as the sum of the genetic value and a random number drawn from a continuous, uniform unit-valued distribution. So, individuals’ gene values corresponded directly to probabilistic fitness advantage. For example, a value of 0.2 would give an average 20% selective advantage. Fitness scores for each individual were calculated once per generation and used for all tournaments. All individuals were initialized with a gene value 0.0. At generation 50, one organism’s gene value was set to either 0.0,2 0.1, or 1.0. This operation was repeated in subsequent generations if the introduced allele went extinct. This procedure enabled comparison of a strong selective sweep (fitness advantage 1.0), a weaker selective sweep (fitness advantage 0.1), and a control treatment with pure drift dynamics (fitness advantage 0.0). Synchronous selection with tournament size 2 was performed over 200 simulated generations with a constant population size of 400. crossover propagated a random parent’s gene value. No mutation was applied. Ten independent replicates were performed for each treatment.

7.2.5 Software and Data Software, data, analyses, and supplementary materials are hosted via the Open Science Framework at https://osf.io/xj5pn/. Data structures and algorithms associated with hereditary stratigraphy methodology are published in the hstrat Python package on PyPi and at github.com/mmore 500/hstrat [17]. recombination features and corresponding C++ implementations are on the project’s roadmap. This work benefited greatly from open-source scientific software [4, 7, 20]. The Alife data standard facilitated tool interoperation [12].

7.3 Results and Discussion This section reports validation experiments for debuted genealogical, population size, and positive selection inference techniques. Results support their efficacy. 2 The smallest representable floating point value was set for fitness advantage treatment 0.0 so the introduced gene could be differentiated from the background gene. This value was small enough to have no meaningfully detectable effect on selection.

136

M. A. Moreno

7.3.1 Genealogical Inference Figure 7.4 compares phylogenetic trees reconstructed from species-level instrumentation to corresponding references extracted from perfectly tracked sexual pedigrees. For treatments with meaningful phylogenetic structure—the “allopatry” and “ring” treatments—phylogenetic reconstruction largely succeeded in recovering the historical relationships between subpopulations. In fact, for the “allopatry” treatment, inner node time points appear to more closely track the true generational time frames of speciation events (at generation 100 and 150) than the UPGMA-based pedigree distillation. Figure 7.5 shows distributions of reconstruction error for each treatment. Across the three treatments, all ten replicate reconstructions yielded quartet distance from the reference below 0.66 (the null expectation for arbitrary trees) [21]. This confirms recovery of phylogenetic information in all three cases (exact binomial test, . p < 0.01). However, as expected, reconstruction quality on the bag population structure was marginal due to the lack of meaningful phylogenetic information available to reconstruct. Performance on the ring and allopatry treatments was stronger, achieving quartet distances between reconstruction and reference of around 0.3 in the typical case. However, junk phylogenetic structure within the reference phylogeny obscures the amount of meaningful reconstruction error.

7.3.2 Population Size Inference Figure 7.6 summarizes the distribution of effective population size estimates across replicates at the beginning, middle, and end of evolutionary runs. Evidencing detection sensitivity, estimates differ across time points within all non-control treatments. For the bottleneck and selection pressure treatments, which involve reversion to initial conditions, estimate distributions at the first and last time points are comparable, as expected. Supplemental Fig. 10 shows ten-sample rolling estimates of population size for one replicate from each surveyed treatment. All population estimates respond to underlying demographic changes, although the response to selection pressure relaxation appears weaker than responses to changes in population size. Substantial estimate volatility appears across all cases. Supplemental Fig. 11 summarizes the detectability of underlying effective population size changes. Detection was performed by evaluating 95% confidence interval overlap between rolling population size estimates at different time points No false positives were detected. Most true changes in effective population size were detected in at least nine out of ten replicates, except for the selection pressure treatment and for the last segment of the range expansion treatment.

7 Methods for Rich Phylogenetic Inference Over Distributed Sexual Populations

137

7.3.3 Positive Selection Inference Gene selection experiments introduced novel alleles with fitness advantages of 1.0 (strong selection), 0.1 (weaker selection), or 0.0 (no selection—control). Figure 7.7 plots underlying gene copy count with instrument delay copy count estimates. These bit counts were extracted from a single randomly sampled population member. For

Fig. 7.4 Comparison of reference phylogenies (top) and reconstructed phylogenies (bottom) for example replicates of each experimental treatment: bag (well-mixed), allopatry (population split at generation 100 with a secondary split on one branch at generation 150), and ring (ten island subpopulations with some migration). Phlogenies are downsampled from 100 to 20 tips for legibility. Extant organism IDs are annotated on tips. Taxon color coding is consistent between reference and reconstruction to facilitate comparison. Branch length on .x axis given in generations

138

M. A. Moreno

Fig. 7.5 Normalized quartet distances between reconstructed phylogenies and references distilled from tracked pedigree. Lower indicates less reconstruction error. Notches give bootstrapped 95% CI. Horizontal blue line indicates the expected quartet distance between random trees. Some reconstruction error is expected, especially in control treatment, due to the resolution of effectively arbitrary phylogenetic structure among well-mixed population components Fig. 7.6 Distributions of ten-sample MLE population size estimates by treatment across three time points. See Sect. 7.2.3 for population size and selection pressure manipulations performed for each treatment. Notches indicate bootstrapped 95% confidence intervals

the strong selection treatment, allele frequency fixes rapidly and induces a sharp spike in the instrument values. For the weaker selection treatment, allele frequency grows somewhat less rapidly and is sometimes delayed. An instrument bit count spike is apparent, but it occurs with a smaller magnitude and more variable timing. The sensitivity and specificity of positive gene selection detection were evaluated by surveying false-positive (i.e., detection selection on control replicates) and falsenegative rates (i.e., non-detection of selection on fitness-advantaged replicates) across a range of detection threshold values. Supplemental Fig. 12 plots detection outcomes across a range of detection thresholds. Strong selection events can be unambiguously distinguished from neutral events, as well as from weaker selection events. Weaker selection and neutral events were not entirely separable. The middle-of-the-road

7 Methods for Rich Phylogenetic Inference Over Distributed Sexual Populations

139

Fig. 7.7 Trajectories of true gene prevalence (ired) and instrument set bit counts (blue). Top row summarizes distribution across replicates. Bottom row shows an example replicate of each treatment. Fitness advantage 0.0 conferred no selective benefit. Fitness advantage 0.1 corresponded to relatively weak selection and fitness advantage 1.0 corresponded to strong selection. Spikes of high gene prevalence annotation bit count (blue) are used to detect underlying selective dynamics (red). Note that . y axis scaling differs among bottom-row graphs

detection threshold misidentified one neutral event and one weak positive event— corresponding to a 10% false-positive and 10% false-negative rate.

7.4 Conclusion Nature operates as a fully distributed system. Methodology developed to provide our increasingly rich account of natural history often composes a larger picture through biomaterial samplings and best-effort field observations. It seems prudent, therefore, to look to biology not only for inspiration in engineering algorithms that employ ecology and evolution but also in devising methodology to observe them at scale. To this end, this work has explored reconstruction-based approaches to phylogenetic analysis. Proposed instruments collect “gene-” and “species-level” information from fully distributed EC systems that employ sexual recombination. This is achieved through extension of hereditary stratigraphy genome annotations originally designed for completely asexual populations. Experiments validated capability

140

M. A. Moreno

to detect aspects of genealogical history, demographic history, and selection dynamics, all without centralized tracking. The ultimate aim of this project is to provision infrastructure for phylogenetic observation to any pertinent digital evolution system, with a special eye to large-scale distributed processing. To this end, open-source, plug-and-play software implementation of hereditary stratigraphy algorithms and data structures are core priority. Ample opportunity exists for collaboration to tailor hereditary stratigraphy techniques and software to applications across evolution systems, programming languages, and underlying hardware.

References 1. Avise, J.C.: Gene trees and organismal histories: a phylogenetic approach to population biology. Evol. 43(6), 1192–1208 (1989) 2. Burke, E., Gustafson, S., Kendall, G., Krasnogor, N.: Is increased diversity in genetic programming beneficial? an analysis of lineage selection. In: CEC ’03., volume 2, vol. 2, pp. 1398–1405 (2003) 3. Burlacu, B., Affenzeller, M., Kommenda, M., Winkler, S., Kronberger, G.: Visualization of genetic lineages and inheritance information in genetic programming. In: GECCO Conference Companion Proceedings, pp. 1351–1358 (2013) 4. Cock, P.J., et al.: Biopython. Bioinform. 25(11), 1422 (2009) 5. Dolson, E.L., Vostinar, A.E., Wiser, M.J., Ofria, C.: The modes toolbox: measurements of open-ended dynamics in evolving systems. Artif. Life 25(1), 50–73 (2019) 6. Estabrook, G.F., McMorris, F., Meacham, C.A.: Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Syst. Zool. 34(2), 193–200 (1985) 7. Fortin, F.-A., De Rainville, F.-M., Gardner, M.-A., Parizeau, M., Gagné, C.: DEAP: evolutionary algorithms made easy. J. ML Res. 13, 2171–2175 (2012) 8. Gentle, J.E.: Generation of Random Numbers, pp. 305–331. Springer (2009) 9. Godin-Dubois, K., Cussat-Blanc, S., Duthen, Y.: Apoget: automated phylogeny over geological timescales. In: ALIFE 2019 (MethAL workshop) (2019) 10. Gropp, W., Snir, M.: Programming for exascale computers. Comput. Sci. & Eng. 15(6), 27–35 (2013) 11. Hermisson, J., Pennings, P.S.: Soft sweeps: molecular population genetics of adaptation from standing genetic variation. Genet. 169(4), 2335–2352 (2005) 12. Lalejini, A., et al.: Data standards for artificial life software. In: The 2019 Conference on Artificial Life, pp. 507–514. MIT Press (2019) 13. Lalejini, A., Moreno, M.A., Hernandez, J.G., Dolson, E.: Phylogeny-informed fitness estimation. In: GPTP XX. Springer (2023) 14. McPhee, N.F., Donatucci, D., Helmuth, T.: Using graph databases to explore the dynamics of genetic programming runs. In: GPTP XIII, pp. 185–201 (2016) 15. McPhee, N.F., Finzel, M.D., Casale, M.M., Helmuth, T., Spector, L.: A detailed analysis of a PushGP run. In: GPTP XIV, pp. 65–83 (2018) 16. Moreno, M.A., Dolson, E., Ofria, C.: Hereditary stratigraphy: genome annotations to enable phylogenetic inference over distributed populations. In: ALIFE 2022: the 2022 Conference on Artificial Life. MIT Press (2022a) 17. Moreno, M.A., Dolson, E., Ofria, C.: hstrat: a python package for phylogenetic inference on distributed digital evolution populations. J. Open Source Softw. 7(80), 4866 (2022) 18. Moreno, M.A., Dolson, E., Rodriguez-Papa, S.: Toward phylogenetic inference of evolutionary dynamics at scale. In: ALIFE 2023: Proceedings of the 2023 Artificial Life Conference, p. 79. MIT Press (2023)

7 Methods for Rich Phylogenetic Inference Over Distributed Sexual Populations

141

19. Murphy, G., Ryan, C.: A simple powerful constraint for genetic programming. In: Genetic Programming, pp. 146–157. Springer (2008) 20. Sand, A., et al.: tqdist. Bioinform. 30(14), 2079–2080 (2014) 21. Smith, M.R.: Information theoretic generalized Robinson-Foulds metrics for comparing phylogenetic trees. Bioinform. 5007–5013 (2020) 22. Sokal, R., Michener, C.: A Statistical Method for Evaluating Systematic Relationships, p. 40. University of Kansas science bulletin, University of Kansas (1958) 23. Stanley, K.O., Miikkulainen, R.: Evolving neural networks through augmenting topologies. Evol. Comput. 10(2), 99–127 (2002) 24. Terelius, H., Varagnolo, D., Johansson, K.H.: Distributed size estimation of dynamic anonymous networks. In: IEEE Conference on Decision and Control (CDC), pp. 5221–5227 (2012) 25. Varagnolo, D., Pillonetto, G., Schenato, L.: Distributed statistical estimation of the number of nodes in sensor networks. In: IEEE Conference on Decision and Control (CDC), pp. 1498–1503 (2010)

Chapter 8

A Melting Pot of Evolution and Learning Moshe Sipper, Achiya Elyasaf, Tomer Halperin, Zvika Haramaty, Raz Lapid, Eyal Segal, Itai Tzruia, and Snir Vitrack Tamam

8.1 Introduction In Evolutionary Computation (EC)—or Evolutionary Algorithms (EAs)—core concepts from evolutionary biology—inheritance, random variation, and selection—are harnessed in algorithms that are applied to complex computational problems. As discussed by [1], EAs present several important benefits over popular machine learning (ML) methods, including: less reliance on the existence of a known or discoverable gradient within the search space; ability to handle design problems, where the objective is to design new entities from scratch; fewer required a priori assumptions about the problem at hand; seamless integration of human expert knowledge; ability to solve problems where human expertise is very limited; support of interpretable solution representations; support of multiple objectives. Importantly, these strengths often dovetail with weak points of ML algorithms, which has resulted in an increasing number of works that fruitfully combine the fields of EC with ML and deep learning (DL). Herein, we will survey eight recent works by our group, which are at the intersection of EC, ML, and DL:

M. Sipper (B) · T. Halperin · Z. Haramaty · R. Lapid · E. Segal · I. Tzruia · S. V. Tamam Department of Computer Science, Ben-Gurion University of the Negev, 8410501 Beer-Sheva, Israel e-mail: [email protected] A. Elyasaf Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, 8410501 Beer-Sheva, Israel R. Lapid DeepKeep, Tel-Aviv, Israel © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 S. Winkler et al. (eds.), Genetic Programming Theory and Practice XX, Genetic and Evolutionary Computation, https://doi.org/10.1007/978-981-99-8413-8_8

143

144

M. Sipper et al.

• Machine Learning (Sect. 8.2) 1. Binary and Multinomial Classification through Evolutionary Symbolic Regression [2] (Sect. 8.2.1) 2. Classy Ensemble: A Novel Ensemble Algorithm for Classification [3] (Sect. 8.2.2) 3. EC-KitY: Evolutionary Computation Tool Kit in Python [4] (Sect. 8.2.3) • Deep Learning (Sect. 8.3) 4. Evolution of Activation Functions for Deep Learning-Based Image Classification [5] (Sect. 8.3.1) 5. Adaptive Combination of a Genetic Algorithm and Novelty Search for Deep Neuroevolution [6] (Sect. 8.3.2) • Adversarial Deep Learning (Sect. 8.4) 1. An Evolutionary, Gradient-Free, Query-Efficient, Black-Box Algorithm for Generating Adversarial Instances in Deep Networks [7] (Sect. 8.4.1) 2. Foiling Explanations in Deep Neural Networks [8] (Sect. 8.4.2) 3. Patch of Invisibility: Naturalistic Black-Box Adversarial Attacks on Object Detectors [9] (Sect. 8.4.3) If one’s interest is piqued by a particular project we invite them to peruse the respective, cited, full paper (which are all freely available online).

8.2 Machine Learning 8.2.1 Binary and Multinomial Classification Through Evolutionary Symbolic Regression [2] Classification is an important subfield of supervised learning. As such, many powerful algorithms have been designed over the years to tackle both binary datasets as well as multinomial, or multiclass ones. Symbolic regression (SR) is a family of algorithms that aims to find regressors of arbitrary complexity. Sipper [2] showed that evolutionary SR-based regressors can be successfully converted into performant classifiers. We devised and tested three evolutionary SR-based classifiers: GPLearnClf, CartesianClf, and ClaSyCo. The first two are based on the one-vs-rest approach, while the last one is inherently multinomial. GPLearnClf is based on the GPLearn package [10], which implements treebased Genetic Programming (GP) symbolic regression, is relatively fast, and— importantly—interfaces seamlessly with Scikit-learn [11]. GPLearnClf evolves .C separate populations independently, each fitted to a specific class by considering as target values the respective column vector (of .C column vectors) of the onehot-encoded target vector . y. The fitness function is based on log loss (aka binary

8 A Melting Pot of Evolution and Learning

145

cross-entropy). Prediction is carried out by outputting the argmax of the set of best evolved individuals (one from each population). The hyperparameters to tune were population size and generation count. CartesianClf is based on Cartesian GP (CGP), which grew from a method of evolving digital circuits [12]. It is called “Cartesian” because it represents a program using a two-dimensional grid of nodes. The CGP package we used [13] evolves the population in a .(1 + λ)-manner, i.e., in each generation, it creates .λ offspring (we used the default .λ = 4) and compares their fitness to the parent individual. The fittest individual carries over to the next generation; in case of a draw, the offspring is preferred over the parent. Tournament selection is used (tournament size .= | population|), single-point mutation, and no crossover. We implemented CartesianClf similarly to GPLearnClf in a one-vs-rest manner, with .C separate populations evolving independently, using binary cross-entropy as fitness. The hyperparameters to tune were the number of rows, number of columns, and maximum number of generations. ClaSyCo (Classification through Symbolic Regression and Coevolution) also employs .C populations of trees; however, these are not evolved independently as with the one-vs-rest method (as done with GPLearnClf and CartesianClf)—but in tandem through cooperative coevolution. A cooperative coevolutionary algorithm involves a number of evolving populations, which come together to obtain problem solutions. The fitness of an individual in a particular population depends on its ability to collaborate with individuals from the other populations [14, 15]. Specifically, in our case, an individual SR tree .i in population .c, .gpic , .i ∈ {1, . . . , n_pop}, .c ∈ {1, . . . , C}, is assigned fitness through the following steps (we describe this per single dataset sample, although in practice fitness computation is vectorized by Python): 1. Individual .gpic computes an output . yˆic for the sample under consideration. ' 2. Obtain the best-fitness classifier of the previous generation, .gpcbest , for each pop' ' ' ulation .c , .c ∈ {1, . . . , C}, .c /= c (these are called “representatives” or “cooperators” [14]). ' c' for the sample under consideration. 3. Each .gpcbest computes an output . yˆbest C 1 . 4. We now have .C output values, . yˆbest , ..., yˆic , ..., yˆbest c C 1 5. Compute .σ ( yˆbest , ..., yˆi , ..., yˆbest ), where .σ is the softmax function. 6. Assign a fitness score to .gpic using the cross-entropy loss function. (NB: only individual .gpic is assigned fitness—all other .C − 1 individuals are representatives.) Tested over 162 datasets and compared to three state-of-the-art machine learning algorithms—XGBoost, LightGBM, and a deep neural network—we found our algorithms to be competitive. Further, we demonstrated how to find the best method for one’s dataset automatically, through the use of Optuna, a state-of-the-art hyperparameter optimizer [16].

146

M. Sipper et al.

8.2.2 Classy Ensemble: A Novel Ensemble Algorithm for Classification [3] Sipper [3] presented Classy Ensemble, a novel ensemble-generation algorithm for classification tasks, which aggregates models through a weighted combination of per-class accuracy. The field of ensemble learning has an illustrious history spanning several decades. Indeed, we ourselves employed this paradigm successfully in several recent works: • Sipper and Sipper [17, 18] presented conservation machine learning, which conserves models across runs, users, and experiments. As part of this work, we compared multiple ensemble-generation methods, also introducing lexigarden—which is based on lexicase selection, a performant selection technique for evolutionary algorithms [19, 20]. • Sipper [21] presented SyRBo—Symbolic-Regression Boosting—an ensemble method based on strong learners that are combined via boosting, used to solve regression tasks. • Sipper [22] introduced AddGBoost, a gradient boosting-style algorithm, wherein the decision tree is replaced by a succession of stronger learners, which are optimized via a state-of-the-art hyperparameter optimizer. • Sipper [23] presented a comprehensive, stacking-based framework for combining deep learning with good old-fashioned machine learning, called Deep GOld. The framework involves ensemble selection from 51 retrained pretrained deep networks as first-level models, and 10 ML algorithms as second-level models. Classy Ensemble receives as input a collection of fitted models, each one’s overall accuracy score, and per-class accuracy, i.e., each model’s accuracy values as computed separately for every class (note: we used scikit-learn’s [11] balanced_accuracy_score, which avoids inflated performance estimates on imbalanced datasets). Classy Ensemble adds to the ensemble the topk best-performing (over validation set) models, for each class. A model may be in the topk set for more than one class. Thus, for each model in the ensemble, we also maintain a list of classes for which it is a voter, i.e., its output for each voter class is taken into account in the final aggregation. The binary voter vector of size n_classes is set to 1 for classes the model is permitted to vote for, 0 otherwise. Thus, a model not in the ensemble is obviously not part of the final aggregated prediction; further, a model in the ensemble is only “allowed” to vote for those classes for which it is a voter—i.e., for classes it was among the topk. Classy Ensemble provides a prediction by aggregating its members’ predictedclass probabilities, weighted by the overall validation score, and taking into account voter permissions. Tested over 153 machine learning datasets we demonstrated that Classy Ensemble outperforms two other well-known aggregation algorithms—order-based pruning

8 A Melting Pot of Evolution and Learning

147

and clustering-based pruning—as well as our aforementioned lexigarden ensemble generator. We then enhanced Classy Ensemble with a genetic algorithm, creating Classy Evolutionary Ensemble, wherein an evolutionary algorithm is used to select the set of models that Classy Ensemble picks from. This latter algorithm was able to improve state-of-the-art deep learning models over the well-known, difficult ImageNet dataset.

8.2.3 EC-KitY: Evolutionary Computation Tool Kit in Python [4] There is a growing community of researchers and practitioners who combine evolution and learning. Having used several EC open-source software packages over the years we identified a large “hole” in the software landscape—there was a lacuna in the form of an EC package that is: 1. A comprehensive toolkit for running evolutionary algorithms. 2. Written in Python. 3. Can work with or without scikit-learn (aka sklearn), the most popular ML library for Python. To wit, the package should support both sklearn and standalone (nonsklearn) modes. 4. Designed with modern software engineering in mind. 5. Designed to support all popular EC paradigms: genetic algorithms (GAs), genetic programming (GP), evolution strategies (ES), coevolution, multi-objective, etc. While there are several EC Python packages, none fulfill all five requirements. Some are not written in Python, some are badly documented, some do not support multiple EC paradigms, and so forth. Importantly for the ML community, most tools do not intermesh with extant ML tools. Indeed, we have personally had experience with the hardships of combining EC tools with scikit-learn when doing evolutionary machine learning. Thus was born EC-KitY: a comprehensive Python library for doing EC, licensed under the BSD 3-Clause License, and compatible with scikit-learn. Designed with modern software engineering and machine learning integration in mind, EC-KitY can support all popular EC paradigms, including genetic algorithms, genetic programming, coevolution, evolutionary multi-objective optimization, and more. EC-KitY can work both in standalone, non-sklearn mode, and in sklearn mode. Below we show two code examples that solve a symbolic regression problem. In standalone mode, the user can run an EA with a mere three lines of code: from e c k i t y . a l g o r i t h m s . s i m p l e _ e v o l u t i o n i m p o r t S i m p l e E v o l u t i o n from e c k i t y . s u b p o p u l a t i o n i m p o r t S u b p o p u l a t i o n from e x a m p l e s . t r e e g p . n o n _ s k l e a r n _ m o d e . s y m b o l i c _ r e g r e s s i o n . sym_reg_evaluator import SymbolicRegressionEvaluator

148

M. Sipper et al.

algo = S i m p l e E v o l u t i o n ( S u b p o p u l a t i o n ( S y m b o l i c R e g r e s s i o n E v a l u a t o r () )) a l g o . e v o l v e () print ( ' algo . e x e c u t e ( x =2 , y =3 , z =4) : ' , algo . e x e c u t e ( x =2 , y =3 , z =4) )

Running an EA in sklearn mode is just as simple: from s k l e a r n . d a t a s e t s i m p o r t m a k e _ r e g r e s s i o n from s k l e a r n . m e t r i c s i m p o r t m e a n _ a b s o l u t e _ e r r o r from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t t r a i n _ t e s t _ s p l i t from e c k i t y . a l g o r i t h m s . s i m p l e _ e v o l u t i o n i m p o r t S i m p l e E v o l u t i o n from e c k i t y . c r e a t o r s . g p _ c r e a t o r s . full i m p o r t F u l l C r e a t o r from e c k i t y . g e n e t i c _ e n c o d i n g s . gp . tree . utils i m p o r t create_terminal_set from e c k i t y . s k l e a r n _ c o m p a t i b l e . r e g r e s s i o n _ e v a l u a t o r i m p o r t RegressionEvaluator from e c k i t y . s k l e a r n _ c o m p a t i b l e . s k _ r e g r e s s o r i m p o r t S K R e g r e s s o r from e c k i t y . s u b p o p u l a t i o n i m p o r t S u b p o p u l a t i o n X , y = m a k e _ r e g r e s s i o n ( n _ s a m p l e s =100 , n _ f e a t u r e s =3) terminal_set = create_terminal_set (X) algo = S i m p l e E v o l u t i o n ( Subpopulation ( creators = FullCreator ( terminal_set = terminal_set ), e v a l u a t o r = R e g r e s s i o n E v a l u a t o r () ) ) r e g r e s s o r = S K R e g r e s s o r ( algo ) X_train , X_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t (X , y , t e s t _ s i z e =0.2) r e g r e s s o r . fit ( X_train , y _ t r a i n ) print ( ' MAE on test set : ' , m e a n _ a b s o l u t e _ e r r o r ( y_test , r e g r e s s o r . p r e d i c t ( X _ t e s t ) ) )

We recently taught a course in which 48 students worked in groups of two or three, submitting a total of 22 projects that used EC-KitY to solve a diverse array of complex problems, including evolving Flappy Bird agents, evolving blackjack strategies, evolving Super Mario agents, evolving chess players, and solving problems such as maximum clique and vehicle routing. EC-KitY proved quite up to the tasks.

8.3 Deep Learning 8.3.1 Evolution of Activation Functions for Deep Learning-Based Image Classification [5] Artificial Neural Networks (ANNs), and, specifically, Deep Neural Networks (DNNs), have gained much traction in recent years and are now being effectively put to use in a variety of applications. Considerable work has been done to improve training and testing performance, including various initialization techniques, weighttuning algorithms, different architectures, and more. However, one hyperparameter is usually left untouched: the activation function (AF). While recent work has seen the design of novel AFs [24–27], the Rectified Linear Unit (ReLU) remains by far the most commonly used one, mainly due to its overcoming the vanishing-gradient problem, thus affording faster learning and better performance.

8 A Melting Pot of Evolution and Learning

149

Fig. 8.1 A sample CGP individual, with 3 inputs—.x, −1, 1—and 1 output—. y. The genome consists of 5 3-valued genes, per 4 functional units, plus the output specification (no genes for the inputs). The first value of each 3-valued gene is the function’s index in the lookup table of functions (bottom-left), and the remaining two values are parameter nodes. The last gene determines the outputs to return. In the above example, with .n i representing node .i: node 3, gene 101, n . f 1 (n 0 , n 1 ) = n 0 × n 1 ; node 4, gene 330, . f 3 (n 3 ) = e 3 (unary function, 3rd gene value ignored); node 5, gene 042,. f 0 (n 4 , n 2 ) = n 4 + n 2 ; node 6, gene 250,. f 2 (n 5 ) = n15 ; node 7, output node,.n 6 is the designated output value. The topology is fixed throughout evolution, while the genome evolves

Lapid [5] introduced a novel coevolutionary algorithm to evolve AFs for imageclassification tasks. Our method is able to handle the simultaneous coevolution of three types of AFs: input-layer AFs, hidden-layer AFs, and output-layer AFs. We surmised that combining different AFs throughout the architecture might improve the network’s performance. We devised a number of evolutionary algorithms, including a coevolutionary one, comprising three separate populations: (1) input-layer AFs, (2) hidden-layer AFs, and (3) output-layer AFs. Combining three individuals—one from each population— results in an AF architecture that can be evaluated. We compared our novel algorithm to four different methods: standard ReLU- or LeakyReLU-based networks, networks whose AFs are produced randomly, and two forms of single-population evolution, differing in whether an individual represents a single AF or three AFs. We chose ReLU and LeakyReLU as baseline AFs since we noticed that they are the most-used functions in the deep-learning domain. We used Cartesian genetic programming (CGP), wherein an evolving individual is represented as a two-dimensional grid of computational nodes—often an a-cyclic graph—which together express a program [28]. An individual is represented by a linear genome, composed of integer genes, each encoding a single node in the graph, which represents a specific function. A node consists of a function, from a given table of functions, and connections, specifying where the data for the node comes from. A sample individual in the evolving CGP population, representing the well-known sigmoid AF, is shown in Fig. 8.1. Tested on four datasets—MNIST, FashionMNIST, KMNIST, and USPS—coevolution proved to be a performant algorithm for finding good AFs and AF architectures.

150

M. Sipper et al.

8.3.2 Adaptive Combination of a Genetic Algorithm and Novelty Search for Deep Neuroevolution [6] As the field of Reinforcement Learning (RL) [29] is being applied to harder tasks, two unfortunate trends emerge: larger policies that require more computing time to train, and “deceptive” optima. While gradient-based methods do not scale well to large clusters, evolutionary computation (EC) techniques have been shown to greatly reduce training time by using modern distributed infrastructure [30, 31]. The problem of deceptive optima has long since been known in the EC community: Exploiting the objective function too early might lead to a sub-optimal solution, and attempting to escape it incurs an initial loss in the objective function. Novelty Search (NS) mitigates this issue by ignoring the objective function while searching for new behaviors [32]. This method has been shown to work for RL [31]. While both genetic algorithms (GAs) and NS have been shown to work in different environments [31], we attempted in [6] to combine the two to produce a new algorithm that does not fall behind either, and in some scenarios surpasses both. Segal [6] proposed a new algorithm: Explore-Exploit .γ -Adaptive Learner (. E 2 γ AL, or EyAL). By preserving a dynamically-sized niche of novelty-seeking agents, the algorithm manages to maintain population diversity, exploiting the reward signal when possible and exploring otherwise. The algorithm combines both the exploitative power of a GA and the explorative power of NS while maintaining their simplicity and elegance. Our experiments showed that EyAL outperforms NS in most scenarios while being on par with a GA—and in some scenarios, it can outperform both. EyAL also allows the substitution of the exploiting component (GA) and the exploring component (NS) with other algorithms, e.g., Evolution Strategy and Surprise Search, thus opening the door for future research.

8.4 Adversarial Deep Learning 8.4.1 An Evolutionary, Gradient-Free, Query-Efficient, Black-Box Algorithm for Generating Adversarial Instances in Deep Networks [7] Despite their success, recent studies have shown that DNNs are vulnerable to adversarial attacks. A barely detectable change in an image can cause a misclassification in a well-trained DNN. Targeted adversarial examples can even evoke a misclassification of a specific class (e.g., misclassify a car as a cat). Researchers have demonstrated that adversarial attacks are successful in the real world and may be produced for data modalities beyond imaging, e.g., natural language and voice recognition [33–36]. DNNs’ vulnerability to adversarial attacks has raised concerns about applying these techniques to safety-critical applications.

8 A Melting Pot of Evolution and Learning

151

To discover effective adversarial instances, most past work on adversarial attacks has employed gradient-based optimization [37–41]. Gradient computation can only be executed if the attacker is fully aware of the model architecture and weights. Thus, these approaches are only useful in a white-box scenario, where an attacker has complete access and control over a targeted DNN. Attacking real-world AI systems, however, might be far more arduous. The attacker must consider the difficulty of implementing adversarial instances in a black-box setting, in which no information about the network design, parameters, or training data is provided—the attacker is exposed only to the classifier’s input-output pairs. In this context, a typical strategy has been to attack trained replacement networks and hope that the generated examples transfer to the target model [42]. The substantial mismatch between the alternative model and the target model, as well as the significant computational cost of alternative network training, often renders this technique ineffective. Lapid [7] assumed a real-world, black-box attack scenario, wherein a DNN’s input and output may be accessed but not its internal configuration. We focused on a scenario in which a specific DNN is an image classifier, specifically, a convolutional neural network (CNN), which accepts an image as input and outputs a probability score for each class. We presented QuEry Attack (for Query-Efficient Evolutionary Attack): an evolutionary, gradient-free optimization approach for generating adversarial instances, more suitable for real-life scenarios, because usually there is no access to a model’s internals, including the gradients; thus, it is important to craft attacks that do not use gradients. Our proposed attack can deal with either constrained (.e value that constrains the norm of the allowed perturbation) or unconstrained (no constraint on the norm of the perturbation) problems, and focuses on constrained, untargeted attacks. We believe that our framework can be easily adapted to the targeted setting. Figure 8.2 shows examples of successful and unsuccessful instances of images generated by QuEry Attack, evaluated against ImageNet, CIFAR10, and MNIST. QuEry Attack is a strong and fast attack that employs a gradient-free optimization strategy. We tested QuEry Attack against MNIST, CIFAR10, and ImageNet models, comparing it to other commonly used algorithms. We evaluated QuEry Attack’s performance against non-differential transformations and robust models, and it proved to succeed in both scenarios.

8.4.2 Foiling Explanations in Deep Neural Networks [8] In order to render a DL model more interpretable, various explainable algorithms have been conceived. Van Lent [43] coined the term Explainable Artificial Intelligence (XAI), which refers to AI systems that “can explain their behavior either during execution or after the fact”. In-depth research into XAI methods has been sparked by the success of Machine Learning systems, particularly Deep Learning, in a variety of domains, and the difficulty in intuitively understanding the outputs of complex models, namely, how did a DL model arrive at a specific decision for a given input.

152

M. Sipper et al.

Fig. 8.2 Examples of adversarial attacks generated by QuEry Attack. With higher resolution the attack becomes less visible to the naked eye. The differences between images that successfully attack the model and those that do not are subtle. Top row: Imagenet (.l∞ = 6/255). Middle row: CIFAR10 (.l∞ = 6/255). Bottom row: MNIST (.l∞ = 60/255). Left: the original image. Middle: a successful attack. Right: A failed attack

Explanation techniques have drawn increased interest in recent years due to their potential to reveal hidden properties of DNNs [44]. For safety-critical applications, interpretability is essential, and sometimes even legally required. The importance assigned to each input feature for the overall classification result may be observed through explanation maps, which can be used to offer explanations. Such maps can be used to create defenses and detectors for adversarial attacks [45– 47]. Tamam [8] showed that these explanation maps can be transformed into any target map, using only the maps and the network’s output probability vector. This was accomplished by adding a perturbation to the input that is scarcely (if at all) noticeable to the human eye. This perturbation has minimal effect on the neural network’s output, therefore, in addition to the classification outcome, the probability vector of all classes remains virtually identical. Our black-box algorithm, AttaXAI, enables manipulation of an image through a barely noticeable perturbation, without the use of any model internals, such that the explanation fits any given target explanation. AttaXAI explores the space of images through evolution, ultimately producing an adversarial image; it does so by continually updating a Gaussian probability distribution, used to sample the space of perturbations. By continually improving this distribution the search improves (Fig. 8.3). Figure 8.4 shows a sample result.

8 A Melting Pot of Evolution and Learning

153

Fig. 8.3 Schematic of AttaXAI. Individual images are sampled from the population’s distribution .N(μ, σ ), and fed into the model (Feature 1 and Feature 2 are image features, e.g., two pixel values;

in reality the dimensionality is much higher). Then, the fitness function, i.e. the loss, is calculated using the output probability vectors and the explanation maps to approximate the gradient and update the distribution parameters, .μ and .σ

Fig. 8.4 An attack generated by AttaXAI. Dataset: ImageNet. DL Model: VGG16. XAI model: Deep Lift. The primary objective has been achieved: having generated an adversarial image (.xadv ), virtually identical to the original (.x), the explanation map (.g) of the adversarial image (.xadv ) is now, incorrectly, that of the target image (.xtarget ); essentially, the two rightmost columns are identical

This work demonstrated how focused, undetectable modifications to the input data can result in arbitrary and significant adjustments to the explanation map. We showed that explanation maps of several known explanation algorithms may be modified at will. Importantly, this is feasible with a black-box approach, while maintaining the output of the model. We tested AttaXAI against the ImageNet and CIFAR100 datasets using 4 different network models.

8.4.3 Patch of Invisibility: Naturalistic Black-Box Adversarial Attacks on Object Detectors [9] The implications of adversarial attacks can be far-reaching, as they can compromise the security and accuracy of systems that rely on DL. For instance, an adversarial attack on a vehicle-mounted, image-recognition system could cause it to misidentify a stop sign as a speed-limit sign [48], potentially causing the vehicle to crash. As DL becomes increasingly ubiquitous, the need to mitigate adversarial attacks becomes

154

M. Sipper et al.

Fig. 8.5 An adversarial patch evolved by our gradient-free algorithm, which conceals people from an object detector

more pressing. Therefore, research into adversarial attacks and defenses is a rapidly growing area, with researchers working on developing robust and secure models that are less susceptible to such attacks. In [9], we focused on fooling surveillance cameras (both indoor and outdoor), because of their ubiquity and susceptibility to attack, by creating adversarial patches (Fig. 8.5). Our objective was to generate physically plausible adversarial patches, which are performant and appear realistic—without the use of gradients. An adversarial patch is a specific type of attack, where an image is modified by adding a small, local pattern that engenders misclassification. The goal of such an attack is to intentionally mislead a model into making an incorrect prediction or decision. By “physically plausible” we mean patches that not only work digitally, but also in the physical world, e.g., when printed—and used. The space of possible adversarial patches is huge, and with the aim of reducing it to afford a successful search process, we chose to use pretrained generative adversarial network (GAN) generators. Given a pretrained generator, we seek an input latent vector, corresponding to a generated image that leads the object detector to err. We leverage the latent space’s (relatively) small dimension, approximating the gradients using an Evolution Strategy algorithm [49], repeatedly updating the input latent vector by querying the target object detector until an appropriate adversarial patch is discovered. Figure 8.6 depicts a general view of our approach. We search for an input latent vector that, given a pretrained generator, corresponds to a generated image that “hides” a person from the object detector. The patches we generated can be printed and used in the real world. We compared different deep models and concluded that is possible to generate patches that

8 A Melting Pot of Evolution and Learning

155

Fig. 8.6 Naturalistic Black-Box Adversarial Attack: Overview of framework. The system creates patches for object detectors by using the learned image manifold of a pretrained GAN (.G) on realworld images (as is often the case, we use the GAN’s generator, but do not need the discriminator). We use a pretrained classifier (.C) to force the optimizer to find a patch that resembles a specific class, the.T V component in order to make the images as smooth as possible, and the detector (. D) for the actual detection loss. Efficient sampling of the GAN images via an iterative evolution strategy ultimately generates the final patch

fool object detectors. The real-world tests of the printed patches demonstrated their efficacy in “concealing” persons, evidencing a basic threat to security systems.

8.5 Concluding Remark Our main conclusion from the works presented above is simple: When combined judiciously, EC and ML/DL reinforce each other to form a powerful alliance.

And we are fervently expanding this lineup of successful joint ventures. Acknowledgements This research was partially supported by the following grants: Israeli Innovation Authority through the Trust.AI consortium; Israeli Science Foundation grant no. 2714/19; Israeli Smart Transportation Research Center (ISTRC); Israeli Council for Higher Education (CHE) via the Data Science Research Center, Ben-Gurion University of the Negev, Israel.

156

M. Sipper et al.

References 1. Sipper, M., Olson, R.S., Moore, J.H.: Evolutionary computation: the next major transition of artificial intelligence? BioData Min. 10(1), 26 (2017) 2. Sipper, M.: Binary and multinomial classification through evolutionary symbolic regression. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO ’22, pp. 300–303. Association for Computing Machinery, New York, NY, USA (2022a) 3. Sipper, M.: Classy ensemble: a novel ensemble algorithm for classification. https://arxiv.org/ abs/2302.10580 (2023a) 4. Sipper, M., Halperin, T., Tzruia, I., Elyasaf, A.: EC-KitY: Evolutionary computation tool kit in Python with seamless machine learning integration. SoftwareX 22, 101381 (2023b) 5. Lapid, R., Sipper, M.: Evolution of activation functions for deep learning-based image classification. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO ’22, pp. 2113–2121. Association for Computing Machinery, New York, NY, USA (2022a) 6. Segal, E., Sipper, M.: Adaptive combination of a genetic algorithm and novelty search for deep neuroevolution. In: Bäck, T., van Stein, B., Wagner, C., Garibaldi, J.M., Lam, H.K., Cottrell, M., Doctor, F., Filipe, J., Warwick, K., Kacprzyk, J. (eds.) Proceedings of the 14th International Joint Conference on Computational Intelligence, IJCCI 2022, Valletta, Malta, October 24–26, 2022, pp. 143–150. SciTePress (2022) 7. Lapid, R., Haramaty, Z., Sipper, M.: An evolutionary, gradient-free, query-efficient, blackbox algorithm for generating adversarial instances in deep convolutional neural networks. Algorithms 15(11), (2022b) 8. Tamam, S.V., Lapid, R., Sipper, M.: Foiling explanations in deep neural networks. https://arxiv. org/abs/2211.14860 (2022) 9. Lapid, R., Sipper, M.: Patch of invisibility: naturalistic black-box adversarial attacks on object detectors. https://arxiv.org/abs/2303.04238 (2023) 10. GPLearn. https://gplearn.readthedocs.io/ (2021). Accessed 30 April 2021 11. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 12. Miller, J.F.: Cartesian Genetic Programming, pp. 17–34. Springer (2011) 13. Quade, M.: Cartesian. https://github.com/Ohjeah/cartesian (2020) 14. Pena-Reyes, C.A., Sipper, M.: Fuzzy CoCo: a cooperative-coevolutionary approach to fuzzy modeling. IEEE Trans. Fuzzy Syst. 9(5), 727–737 (2001) 15. Sipper, M., Moore, J.H.: OMNIREP: originating meaning by coevolving encodings and representations. Memetic Comput. 11(3), 251–261 (2019) 16. Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: A next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2623–2631 (2019) 17. Sipper, M., Moore, J.H.: Conservation machine learning. BioData Min. 13(1), 9 (2020) 18. Sipper, M., Moore, J.H.: Conservation machine learning: a case study of random forests. Nat. Sci. Rep. 11(1), 3629 (2021a) 19. Metevier, B., Saini, A.K., Spector, L.: Lexicase selection beyond genetic programming. In: Banzhaf, W., Spector, L., Sheneman, L. (eds.) Genetic Programming Theory and Practice XVI, pp. 123–136. Springer (2019) 20. Spector, L.: Assessment of problem modality by differential performance of lexicase selection in genetic programming: a preliminary report. In: Proceedings of the 14th Annual Conference Companion on Genetic and Evolutionary Computation, pp. 401–408. ACM (2012) 21. Sipper, M., Moore, J.H.: Symbolic-regression boosting. Genet. Program. Evolvable Mach. 1–25 (2021b) 22. Sipper, M., Moore, J.H.: AddGBoost: a gradient boosting-style algorithm based on strong learners. Mach. Learn. Appl. 7, 100243 (2022b)

8 A Melting Pot of Evolution and Learning

157

23. Sipper, M.: Combining deep learning with good old-fashioned machine learning. SN Comput. Sci. 4(1), 85 (2022c) 24. Agostinelli, F., Hoffman, M., Sadowski, P., Baldi, P.: Learning activation functions to improve deep neural networks (2014). arXiv:1412.6830 25. Saha, S., Nagaraj, N., Mathur, A., Yedida, R.: Evolution of novel activation functions in neural network training with applications to classification of exoplanets (2019). arXiv:1906.01975 26. Sharma, S., Sharma, S.: Activation functions in neural networks. Towar. Data Sci. 6(12), 310– 316 (2017) 27. Sipper, M.: Neural networks with à la carte selection of activation functions. SN Comput. Sci. 2(6), 1–9 (2021) 28. Miller, J.F., Harding, S.L.: Cartesian genetic programming. In: Proceedings of the 10th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO ’08, pp. 2701–2726. Association for Computing Machinery, New York, NY, USA (2008). ISBN 9781605581316 29. Sutton, R.S., Barto, A.G.: Reinforcement Learning: an Introduction, 2nd edn. MIT press (2018) 30. Salimans, T., Ho, J., Chen, X., Sidor, S., Sutskever, I.: Evolution strategies as a scalable alternative to reinforcement learning (2017). arXiv:1703.03864 31. Such, F.P., Madhavan, V., Conti, E., Lehman, J., Stanley, K.O., Clune, J.: Deep neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning (2017). arXiv:1712.06567 32. Lehman, J., Stanley, K.O.: Exploiting open-endedness to solve problems through the search for novelty. In: Proceedings of the Eleventh International Conference on Artificial Life (ALIFE). MIT Press, Cambridge, MA, (2008) 33. Wang, X., Jin, H., He, K.: Natural language adversarial attack and defense in word level (2019) 34. Morris, J.X., Lifland, E., Yoo, J.Y., Qi, Y.: Textattack: a framework for adversarial attacks in natural language processing. In: Proceedings of the 2020 EMNLP (2020) 35. Carlini, N., Wagner, D.: Audio adversarial examples: Targeted attacks on speech-to-text. In: 2018 IEEE Security and Privacy Workshops (SPW), pp. 1–7. IEEE (2018) 36. Schönherr, L., Kohls, K., Zeiler, S., Holz, T., Kolossa, D.: Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding (2018). arXiv:1808.05665 37. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples (2014). arXiv:1412.6572 38. Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z.B., Swami, A.: The limitations of deep learning in adversarial settings. In: 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 372–387. IEEE (2016) 39. Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. IEEE (2017) 40. Gu, S., Rigazio, L.: Towards deep neural network architectures robust to adversarial examples (2014). arXiv:1412.5068 41. Moosavi-Dezfooli, S.-M., Fawzi, A., Frossard, P.: Deepfool: a simple and accurate method to fool deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2574–2582 (2016) 42. Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A.: Practical black-box attacks against machine learning. In: Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pp. 506–519 (2017) 43. Van Lent, M., Fisher, W., Mancuso, M.: An explainable artificial intelligence system for smallunit tactical behavior. In: Proceedings of the National Conference on Artificial Intelligence, pp. 900–907. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2004 44. Došilovi´c, F.K., Brˇci´c, M., Hlupi´c, N.: Explainable artificial intelligence: a survey. In: 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp. 0210–0215. IEEE (2018) 45. Walia, S., Kumar, K., Agarwal, S., Kim, H.: Using xai for deep learning-based image manipulation detection with shapley additive explanation. Symmetry 14(8), 1611 (2022)

158

M. Sipper et al.

46. Fidel, G., Bitton, R., Shabtai, A.: When explainability meets adversarial learning: Detecting adversarial examples using shap signatures. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2020) 47. Kao, C.-Y., Chen, J., Markert, K., Böttinger, K.: Rectifying adversarial inputs using xai techniques. In: 2022 30th European Signal Processing Conference (EUSIPCO), pp. 573–577. IEEE (2022) 48. Eykholt, K., Evtimov, I., Fernandes, E., Li, B., Rahmati, A., Xiao, C., Prakash, A., Kohno, T., Song, D.: Robust physical-world attacks on deep learning visual classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1625–1634 (2018) 49. Wierstra, D., Schaul, T., Glasmachers, T., Sun, Y., Peters, J., Schmidhuber, J.: Natural evolution strategies. J. Mach. Learn. Res. 15(1), 949–980 (2014)

Chapter 9

Particularity Lee Spector, Li Ding, and Ryan Boldi

9.1 Overview In this paper we first describe lexicase selection, an algorithm that was originally developed for use in genetic programming systems, which produce computer programs through processes of variation and selection. We then present a generalization of the ideas that underlie lexicase selection, describing this generalization in terms of a design principle called “particularity.” After defining particularity, we review a sequence of developments that exemplify particularity in different ways to extend the problem-solving power of the systems in which they are used. In doing so, we broaden our scope beyond genetic programming, discussing the use of particularity in other forms of machine learning, including deep neural networks, and in biology. We conclude with some general comments about future of particularity in the design of adaptive systems.

L. Spector (B) Amherst College, Amherst, MA 01002, USA e-mail: [email protected] L. Spector · L. Ding · R. Boldi University of Massachusetts, Amherst, Amherst, MA 01002, USA e-mail: [email protected] R. Boldi e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 S. Winkler et al. (eds.), Genetic Programming Theory and Practice XX, Genetic and Evolutionary Computation, https://doi.org/10.1007/978-981-99-8413-8_9

159

160

L. Spector et al.

9.2 Lexicase In traditional approaches to genetic programming [27], individuals are selected to serve as parents, and thereby to produce offspring, on the basis of scalar fitness values. Usually these fitness values are measures of performance over a collection of training examples, which are often called “fitness cases” in the genetic programming literature. In some applications the different fitness cases that define a problem may present similar challenges to one another, and success or failure on any one case may be roughly as informative as success or failure on any other. But in many applications this will not be true. In some, edge cases of several kinds may be present, and some cases may require qualitatively different approaches than others. The numbers of cases that call for each approach may vary, and the number and nature of categories of cases may be unknown. Lexicase selection [23, 37] is a parent selection method that prioritizes a single fitness case first and foremost when selecting each parent. It breaks ties with a second single fitness case, and then a third, and so on until a winner emerges.1 If several potential parents remain after considering all cases then the final tie is broken randomly. In the standard version of the technique the sequence of fitness cases used is random, with a different shuffle of the cases used for selecting each parent. Lexicase selection has been shown to significantly improve the problem-solving power of genetic programming in settings ranging from digital circuit design to general software synthesis, and in some settings it has been shown to allow genetic programming systems to solve problems that could not be solved when using selection methods based on scalar fitness measures [14, 19, 20, 23, 24, 39]. Over evolutionary time, lexicase selection focuses on each particular case, each pair of cases, and more generally each subset of a problem’s fitness cases. It focuses on each in the sense that eliteness with respect to each will, with some shuffle of the cases, allow an individual to produce offspring. This is because each subset of the fitness cases may sometimes occur before all other cases in the shuffle, and thereby determine which individuals can be selected as parents. Note that lexicase selection behaves differently than methods that select on the basis of generally good performance over many cases. Lexicase selection will, for example, often select individuals that are “specialists” in the sense that they are elite on one or a small number of cases, but atrociously bad on many others. This promotion of specialists appears to be connected to the problem-solving benefits of lexicase selection [18]. Lexicase selection also behaves differently than methods that adjust the influence that each case can have on an individual’s scalar fitness value, such as “implicit fitness sharing” [31]. Prior work has demonstrated problem-solving advantages of lexicase selection over implicit fitness sharing [23], with one explanation being that

1

This lexicographic processing of fitness cases is the reason that lexicase selection is so named.

9 Particularity

161

implicit fitness sharing cannot reward good performance on particular combinations of cases that are rarely handled well by the same individual. Prior work has also considered the co-solvability of pairs of cases [28], along with mechanisms designed to maintain diversity with respect to user-specified qualities [33, 35]. These techniques can be considered to embrace the particularity design principle to some extent. While detailed comparisons of these techniques to lexicase selection are beyond the scope of this paper, we note that lexicase selection is perhaps both simpler to implement and more thoroughgoing in its particularity. We define the “particularity” design principle as a mandate to prefer design choices, like those embodied in lexicase selection, that take all particular challenges and combinations of challenges posed by the environment seriously, and to explore their implications as independently as possible while avoiding averaging over multiple challenges. Often, in the history of the design of adaptive systems, averaging or other forms of aggregation have been considered necessary because the widely-used optimization methods operated only with scalar objective functions. But we now know that in many contexts there are alternatives, for example with lexicase selection or other many-objective optimization techniques. By attending more closely to the particular challenges posed by the environment, these methods may be able to adapt more quickly and successfully.

9.3 Variance The particularity of lexicase selection, meaning its promotion of good performance on each particular environmental challenge and on each particular combination of environmental challenges, contrasts not only with the common practice in genetic programming but also with a wide range of practices in machine learning more generally, where averaging and other forms of aggregation are ubiquitous. Many machine learning methods do allow users to specify hyperparameters that adjust the “bias-variance tradeoff,” meaning the extent to which individual training examples influence a model’s behavior. High variance configurations allow each training example to have a large influence, but in many machine learning settings that means that the influence is on the behavior of a single model that is being trained. In evolutionary algorithms that operate on populations, by contrast, the influence can be on one among many approaches toward solutions that are being explored simultaneously in the population. In a sense, particularist methods, when used in population-based algorithms, allow us to manage the bias/variance trade-off by refusing to sacrifice one for the other, demanding instead that we prioritize both the guidance provided by individual challenges (high variance) and the guidance provided by larger collections of challenges (high bias) in different parts of the population. We note that higher population diversities often result from particularist methods, since different sub-populations are advantaged by attention to different combinations

162

L. Spector et al.

of features of the environment [16]. Prior studies suggest, however, that the advantages of these methods stem not from the maintenance of diversity per se, but from the fact that the contours of the diverse populations they produce reflect the diversity of the challenges posed by the environment [17]. That is, the advantages stem from particularity. The extent to which such advantages can be obtained in settings that don’t involve populations is a topic that deserves further study.

9.4 Epsilon For some kinds of applications, the particularity of the original form of lexicase selection is too extreme. Consider, for example, continuous-valued symbolic regression applications. In this setting, the error for each individual on each case will be a real number, and it may be likely, depending on the problem and the genetic programming configuration, that no two individuals in the population will have exactly the same error for some particular case. When this happens there will be only one elite individual for the case in question, and there will never be ties that must be broken by performance on other cases. If this is true for a large number of cases then lexicase selection will usually select parents on the basis of only a single case, or of a small number of cases. This means that there will be little focus on individuals that do well on combinations of cases, and little guidance of the evolutionary process toward individuals that can perform well on all cases. Fortunately, epsilon lexicase selection solves this problem [30]. By allowing not only the elite to survive each filtering step of lexicase selection, but also individuals that under-perform the elite only by a small amount, epsilon lexicase selection can outperform scalar fitness-based methods in several settings [29, 34]. Epsilon lexicase selection can be considered a “relaxation” of lexicase selection in the sense that the criteria for selection are less stringent. Other forms of relaxation have been explored, but not all of them appear to be advantageous [38]. We may speculate that those that will be advantageous will be attentive to the particularity of the problem environment both with respect to individual cases and with respect to combinations of cases.

9.5 Batched Another form of relaxation of lexicase selection involves the grouping of cases into batches, within which averaging or some other form of aggregation is performed. Parents are selected on the basis of shuffled sequences of batches rather than shuffled sequences of single cases, with each step of filtering being based on comparisons of aggregate measures of performance over the batch.

9 Particularity

163

Here the particularity of the method is applied not less stringently, as with epsilon lexicase selection, but rather more coarsely. This method proved successful in an application of lexicase selection in learning classifier systems, an evolutionary computing context quite different from genetic programming [1]. The success of batch lexicase selection in this setting suggests that intermediate levels of particularity may in some cases be helpful. It also suggests that the “particulars” upon which a particularist approach focuses need not be exactly the training examples provided by the problem environment. They might instead be “cases” that are derived from the training examples in some non-trivial way. Here they are simply averages over batches of training examples, but in principle, they might be derived from the training examples in other ways, some of which we discuss below.

9.6 Downsampled What happens if we reduce the number of cases not by grouping them into batches over which we aggregate performance, but rather by using only the cases in a single batch, and considering them individually while ignoring all of the rest? Downsampling is a general method by which data sets used for machine learning applications are decreased in size. Downsampling allows for larger systems to be used despite the entire data set being prohibitively expensive to enumerate. Furthermore, when downsampling is done every generation or iteration, it can help prevent overfitting as only portions of the training set are seen at a time, reducing the risk of memorization. Because downsampling constrains the set of challenges that can be seen by an individual at a given time, it can help an adaptive system to attend to the particularity of the sampled subset of its problem environment. For this to be successful in solving the environment’s overarching problems, it is important that samples are changed sufficiently often, and that the lessons learned from some samples can be maintained while exploring lessons learned from others. Another important effect of reducing the size of the training set is that every iteration becomes cheaper to perform. When downsampling to 10% of the size of the training set, a similar number of iterations could be 10% as expensive to perform. An evolutionary process could therefore be run for 10 times as long, or with 10 times as large a population, using the same computational budget. When applied to lexicase selection, randomly downsampling training sets has been found to significantly improve problem-solving performance [15, 22, 25] when using the same computational budget as full lexicase selection. Selection schemes that select on the basis of aggregate measures seem to not benefit as much from the use of downsampling as lexicase selection does [2]. This may be because lexicase selection pays attention to particular challenges in the downsample, and is therefore able to maintain high levels of diversity that prevent premature convergence when running for a long time.

164

L. Spector et al.

One might think that changing the downsample entirely every generation, as it is done for random downsampling, could prevent effective learning. The intuition here is that the training set may be changed before the population really has a chance to get a foothold on the information that it provides. However, work has been done to show that, at least for genetic programming applied to certain benchmark program synthesis problems, the problem-solving power of these techniques is not hindered by rapid changes to the downsample [4]. The reason for this was attributed to the presence of synonymous or nearly synonymous cases that come in adjacent generations’ downsamples. Synonymous cases are cases that measure similar behavior and as such are passed by similar groups of individuals. If each downsample is likely to contain cases that are synonymous with cases in the previous and next downsamples, then there will be some consistency in the challenges presented by the environment over evolutionary time.

9.7 Informed One way of sharpening the focus of a downsampled selection scheme, as opposed to relaxing it, is to reduce the presence of synonymous cases. Synonymous cases are redundant as they share particularities with each other. Whilst including them might not be harmful on its own, they take the place of cases that might provide selection with more useful information (Table 9.1). Informed downsampled lexicase selection is a method to automatically detect and maintain a downsample of cases that are individually particular or unique [3]. For example, the downsample would be filled with cases that measure qualitatively different behaviors in the individuals being selected. The downsamples are selected by analyzing how the population performs on training cases over the course of evolution. If two training cases are passed by the same groups of individuals, then these test cases add no new information over each other. If two training cases are solved by disjoint sets of population members, these cases probably measure different behaviors, and are individually important to include in our training sets. It turns out constructing downsamples using this information further improves the success rate of genetic programming runs that use downsampled lexicase selection [2, 3]. This benefit is likely due to maintaining higher test case coverage over the course of a run [5]. Because lexicase selection is able to focus on the particularities in a training set, having specific cases missing results in the loss of certain ecological niches. When all (or as many as possible) of the particularities are represented in the sample, lexicase selection can maintain the niches and more effectively pursue paths to solutions.

9 Particularity

165

Table 9.1 Number of generalizing solutions (successes) out of 100 runs achieved by PushGP on the test set for a variety of program synthesis benchmark problems as reported in [3]. Results are comparing the performance of Lexicase selection (Lex), Informed downsampled lexicase selection (IDS) and randomly downsampled lexicase selection (Rnd) on these problems. DS rate is the downsampling rate, or what proportion of the test cases appear in the sample each generation. The parent rate is the proportion of parents used to estimate the niche maintained by a sample, and the generational interval is the number of generations between we do this estimation. Problem names in bold face are where an informed downsampling approach performs the best out of all the techniques. Results signified with an asterisk (*) are significantly better than the corresponding run with random down-sampling at a . p n pop do P ← P \ worst(P, q) end while end while return best(P, q) end function

▷ initialization

▷ generations ▷ build offspring ▷ crossover

▷ mutation

▷ merge parents and offspring

θ , and the Gaussian mutation, which perturbs every element of the parent with a Gaussian noise with zero mean and .σmut . We drive the evolution with a fitness function .q which measures the quality of a candidate solution as a single value in .R. We detail in Sect. 11.5 what .q exactly measures: in brief, it operates by running a simulation where the robot equipped with the controller under evaluation, i.e., the candidate . f corresponding with the MLP defined by .θ , faces a task. The larger the value of .q( f ), the better the solution. In our experiments, and according to our previous experience, we set. pxover = 0.8, .σxover = σmut = 0.35, .n pop = 100, .n tour = 5, and .n evals = 10000. . 2

11 GP for Continuous Control: Teacher or Learner …

211

11.4.2 Array of Regression Trees Optimized with GP 11.4.2.1

Representation of . f

In this variant, we use an array of .m regression trees for encoding the multivariate function . f . As discussed in Sect. 11.1, trees representing numerical functions through the combination of simple arithmetic operators are a rather successful way for expressing solutions to regression problems, in particular when the interpretability of the solution is important [1, 21]. In the context of this study, we define a regression tree for a regression problem n .R → R as a tree where terminal nodes are labeled with either a variable in . X = {x1 , . . . , xn } or a constant in .C ⊆ R and non-terminal nodes are labeled with an arithmetic operator in . O. We assume that operators in . O may have different arities and that regression trees are valid in terms of arity: that is, if a node is labeled with an operator .o with arity .n o , then it has exactly .n o children nodes. In this work, based on our previous experience, we set . O = {+, −, ×, ÷∗ }, where all the operators have the same binary arity and .÷∗ is the protected division, and .C = {0, 0.5, 1, . . . , 5}. More sophisticated options are known in the literature for incorporating numerical constants in regression trees, as ephemeral random constant [31, 39]: for the sake of simplicity, we do not employ them here and leave their investigation to future work. We denote by .Tn,O,C the set of all the possible trees defined as above. For expressing a multivariate function . f : Rn → Rm suitable to be the controller of a VSR, we use an array .t = (t1 , . . . , tm ) of .m trees in .Tn,O,C , rather then a single tree. Hence, we define . f as y = f (x) = [tanh (t1 (x)) . . . tanh (t1 (x))] ,

.

(11.3)

where .ti (x) ∈ R is the result of the application of the tree .ti to the point .x ∈ Rn . We apply .tanh to the result of the application of the tree to ensure that the output of m . f is in .[−1, 1] , as required by the VSR. We denote by . Tn,m,O,C the set of all the m . We remark that the trees in .t are possible arrays of .m trees, i.e., .Tn,m,O,C = Tn,O,C independent from each other.

11.4.2.2

Optimization of . f

Optimizing through evolution a univariate function represented as a regression tree is exactly the problem that is solved by tree-based GP. In this work, however, we search for arrays of trees, rather than single trees: the search space is .Tn,m,O,C instead of .Tn,O,C , as for “normal” GP. We hence employ an EA which is an adaptation of a standard version of GP to accommodate this change. The EA we propose, which we call multi-GP for brevity and show in Algorithm 8, is similar to the one used for numerical vectors (shown in Algorithm 7), with three key differences: (a) it uses an initialization procedure that produces arrays of trees instead of numerical vectors; (b) it uses three genetic oper-

212

E. Medvet and G. Nadizar

Algorithm 8 The multi-GP EA used for optimizing arrays of regression trees. oneof(a, b) gives either a or b, with equal probability. function evolve() ▷ initialization P ← multi- ramped- half- and- half(n pop , h min , h max ) n ← n pop while n < n evals do ▷ build offspring P' ← ∅ while |P ' | < n pop do n ←n+1 r ← sample(U (0, 1), 1) ▷ element-wise standard tree crossover if r ≤ 21 pxover then t1 ← tournament(P, n tour , q) t2 ←( tournament(P, n tour , q) ) t ← tree- xover(t1,1 , t2,1 ) . . . tree- xover(t1,m , t2,m ) ▷ uniform crossover else if r ≤ pxover then t1 ← tournament(P, n tour , q) t2 ←( tournament(P, n tour , q) ) t ← one- of(t1,1 , t2,1 ) . . . one- of(t1,m , t2,m ) else ▷ standard tree mutation t1 ←( tournament(P, n tour , q) ) t ← tree- mut(t1,1 ) . . . tree- mut(t1,m ) end if if t ∈ / P ∪ P ' then P ' ← P ' ∪ {t} end if end while ▷ merge parents and offspring P ← P ∪ P' while |P| > n pop do P ← P \ worst(P, q) end while end while return best(P, f ) end function

ators (two crossovers and one mutation), suitable for .Tn,m,O,C ; (c) it incorporates a mechanism for favoring the diversity in the population. The latter peculiarity, which is explained below in detail, is particularly important since the search space .Tn,m,O,C is discrete, differently from .R p of the previous case, and hence the risk of premature convergence due to lack of diversity is not negligible [34]. In particular, our multi-GP initializes the population by applying for .m times the ramped half-and-half procedure and combining the resulting .n pop m trees in .n pop .msized arrays of trees. Concerning crossover, multi-GP uses a crossover (with . 21 pxover ) that simply consists in applying, element-wise, the standard tree crossover to the pair of trees coming from the two parents, and another crossover (again with . 21 pxover ) that builds the child taking trees from parents with equal probability (a form of uniform crossover operating over arrays of trees). Concerning mutation, multi-GP uses, element-wise and for.m times, the standard tree mutation. Both the initialization procedure and the genetic operators output arrays of trees in which each individual

11 GP for Continuous Control: Teacher or Learner …

213

tree height is granted to be in .[h min , h max ]. Finally, concerning the mechanism for favoring the diversity in the population, multi-GP simply avoids inserting in the offspring those newly generated individuals which are already part of the offspring or the parent population, i.e., those which are duplicates. According to this mechanism, inspired by [2], multi-GP re-tries to generate individuals for up to.n attempts times—this parameter not being shown in Algorithm 8 for the brevity; if it fails, the duplicate individual is accepted. In our experiments, and according to our previous experience, we set. pxover = 0.8, .n pop = 100, .n tour = 5, .n attempts = 100, .h min = 3, .h max = 8, and .n evals = 10000.

11.4.3 Regression Graphs Optimized with GraphEA 11.4.3.1

Representation of . f

In this variant, we use a graph as the multivariate function, inspired by the approach proposed in [15]. More specifically we consider directed graphs with unlabeled edges whose nodes can be of four types: input, constant, hidden, or output. For a multivariate regression problem .Rn → Rm , input nodes are labeled with variables in . X = {x1 , . . . , xn }, constants nodes are labeled with constants in .C ⊆ R, hidden nodes are labeled with arithmetic operators in . O, and output nodes are labeled with variables in .Y = {y1 , . . . , ym }. We enforce a few constraints on the structure of the graphs. Each graph .g has to: (a) be acyclic; (b) contain exactly .|X | = n input nodes, each labeled with one different variable in . X ; (c) contain exactly .|C| constant nodes, each labeled with one different constant in .C; (d) contain exactly .|Y | = m output nodes, each labeled with one different variable in .Y ; (e) have no incoming edges to input and constant nodes, no outgoing edges from and exactly one incoming edge to output nodes, and the correct number of incoming edges (corresponding to the arity of the label) to each hidden node. We denote by .G n,m,O,C the set of all the graphs that meet these requirements. When we compute the output .y ∈ Rm from the input .x ∈ Rn using a graph .g ∈ G n,m,O,C , we do as we do for regression trees. For each . j-th element . y j of .y, we consider the tree .t ∈ Tn,O,C defined by the subgraph starting from the output node labeled with . y j and following only incoming edges: then, we compute the value of . y j using that .t. Similarly to the case of arrays of tree, we use .tanh on the output of .t to make it fall in .[−1, 1]. In this work, based on our experience and on [15], we set.C = {0, 0.5, 1, . . . , 5} as for the arrays of trees case and . O = {+, −, ×, ÷∗ , log∗ }, where .log∗ is the protected logarithm with unary arity.

214

11.4.3.2

E. Medvet and G. Nadizar

Optimization of . f

We search the space .G n,m,O,C of graphs, driven by the fitness function .q, by means of the GraphEA algorithm [15]. In brief, GraphEA evolves a fixed size population of graphs using a number of unary genetic operators (i.e., mutations defined over graphs) in the reproduction phase. In order to protect the innovation introduced by those operators, GraphEA employs a speciation mechanism, inspired by the one of NEAT [35]: as a side effect, this mechanism favors the diversity in the population. Moreover, GraphEA enforces structural constraints on graphs by performing the corresponding validity checks: whenever a check fails after the generation of a new individual, that individual is discarded and a new one is built. We refer the reader to the cited paper for further details. In our experiments and following [35], we set .n pop = 100 and .n evals = 10000; for the other parameters, we used the default values. The rationale for considering also graphs as a representation of . f is that they may, potentially, be more able, with respect to arrays of trees, to capture dependencies among output variables. As an example, consider the case where .n = 3, m = 2 and is the target function. With arrays of trees, the where . y2 = y1 + 1 and . y1 = 2x3 xx21 −1 +1 evolution should find an individual where the first and second trees are largely similar, despite having no way to make one tree in an individual influence the other tree. With graphs, the evolution should “simply” find a graph where the same subgraph is exploited twice to connect two output nodes. Intuitively, hence, evolution of graphs should be able to benefit more the modularity of the problem, if any. Despite this interesting premise, the effective and efficient evolution of graphs encoding multivariate functions has not reached complete maturity: indeed, several approaches have been and are being proposed for evolving graphs [5, 20, 22].

11.5 Experiments and Results We performed a twofold experimental campaign aimed at, first, verifying if multi-GP is on par with the widespread NE (i.e., MLP+GA) approach for evolving controllers for VSRs—for placing these results in a broader context, we also included GraphEA in the comparison: we describe these experiments in Sect. 11.5.1. Second, for gaining deeper insights in the experimental findings, we employed the same three EAs (NE, multi-GP, GraphEA) as learners from the controllers optimized through direct evolution, which acted as teachers: we describe these experiments in Sect. 11.5.2. In both cases, we targeted the problem of optimizing a controller for a biped VSR that faces the task of locomotion. Since we considered a 2-D scenario, locomotion is indeed directed locomotion, i.e., the VSR has to run the fastest possible along the positive .x-direction. We used the average .x-velocity .vx of the VSR during a simulation of 30 s (simulated time) as fitness, i.e., we set .q( f ) = vx . For computing .v x , we simply considered the distance between the . x-position of the VSR center of mass taken at .t = 0s and its .x-position taken at .t = 30s.

11 GP for Continuous Control: Teacher or Learner …

215

Fig. 11.1 Frames of a VSR with an evolved controller (here, the MLP evolved with the random seed .1 and .n comm = 1) performing locomotion on the uneven terrain. Frames are taken every 0.2 s starting from .t = 5 s. The color of each voxel encodes the current expansion/contraction: closer to orange for contraction (i.e., current area lower than rest area), closer to blue for expansion, closer to yellow for area at rest value

For increasing the difficulty of the task, we made the VSR run on a slightly uneven, rather than a flat, terrain. This way, controllers that better exploit the sensors the voxel are equipped with should be more capable of adapting to the terrain unevenness, eventually being faster. As VSR body, we used a polyomino composed of 10 voxels arranged in a bipedlike shape. We recall that, since the VSR uses a homogeneous distributed controller (see Sect. 11.3.2), the input and output space sizes .n and .m are independent from the number of voxels in the VSR. We experimented with two values for the number.n comm of communication channels shared among voxels, namely.n comm = 1 and.n comm = 3. Differently than the number of voxels in the VSR, .n comm does have an impact on .n and .m, and hence on the size of the search space. Figure 11.1 shows some frames captured during a simulation of one VSR equipped with an evolved controller. The figure shows the shape of the VSR body, the unevenness of the terrain, and allows to appreciate the kind of synergic contraction and expansion that results, in the end, in an effective gait. We performed the experiments using 2-D-VSR-Sim [17] for the simulation and JGEA [18] for the evolution: the two software frameworks are glued together and available at https://github.com/ericmedvet/2d-robot-evolution. We made all our results publicly available at https://github.com/ericmedvet/2023-GPForContinuousControlAndLearning.

11.5.1 Direct Evolution of the Controller For this part of the experimental campaign, we applied the three EAs described above to the problem of evolving a controller (i.e., a numerical vector .θ ∈ R p for NE, an array of trees .t ∈ Tn,m,O,C for multi-GP, a graph .g ∈ G n,m,O,C for GraphEA) for the task of locomotion. We performed 10 evolutionary runs for each EA by varying the random seed.

216

E. Medvet and G. Nadizar

Fig. 11.2 Median and inter-quartile range (across 10 runs) of the velocity .vx★ of the best individual along fitness evaluations. Multi-GP is not worse than NE only with larger search space, i.e., with .n comm = 3; GraphEA is always the worst performer

Figure 11.2 reports the results of this experiment. It shows the fitness of the best individual in the population (shortly, the best fitness .vx★ ) during the evolution, one plot for each of the two values of .n comm . Several interesting observations may be made from Fig. 11.2. First, both NE and multi-GP appear to be able to evolve effective controllers within the budget of .n evals = 10000 fitness evaluations. This claim is supported by the fact that, (a) for these two EAs, the best fitness .vx★ curve reaches a stable value in the final part of the evolution and (b) the absolute values of .vx★ are on par with those obtained by other studies with the same kind of robots (e.g., [26]). Moreover, we visually inspected the behaviors of a few of the evolved VSRs and verified that they are well capable of expressing effective gaits. We make the videos of the VSRs obtained with the first random seed and .n comm = 1 available at https://github.com/ ericmedvet/2023-GPForContinuousControlAndLearning/tree/main/videos. Second, multi-GP appears to be on par with NE with .n comm = 3 and slightly worse with .n comm = 1. For the former case, multi-GP is as effective and as efficient as NE; that is, it achieves the same values of .vx★ in the same number of evaluations. Interestingly, with the smaller search space corresponding to .n comm = 1, multi-GP achieves lower values of final.vx★ than with.n comm = 3, while NE does the opposite. We attempt to explain this finding with the fact that a larger number of communication channels makes reactive behaviors easier to achieve through large values for the exchanged communication; in turn, these large values are easier to obtain with multiGP, which can exploit operators (such as .÷∗ ) that can easily produce large outputs from small inputs. In other words, trees are maybe better at doing a sort of bangbang control, i.e., to switch abruptly the actuation and communication values in a way that results in an effective overall gait. Indeed, we verified, by visually inspecting the videos of the evolved robots, that those controlled with trees produce in general movements which are less smooth than those produced by robots controlled by MLPs.

11 GP for Continuous Control: Teacher or Learner …

217

Fig. 11.3 Median and inter-quartile range (across 10 runs) of the solution size of the best individual along fitness evaluations. Solutions obtained with multi-GP converge to a stable size, apparently independent from the size of the search space, while those obtained with GraphEA do not

Third, differently than NE and multi-GP, GraphEA struggles at evolving good controllers: at the end of the evolution it reaches much lower values for .vx★ . Moreover, the .vx★ curve does not seem to have reached a plateau. We hypothesize that the motivation for this low effectiveness may lie in the default values for the speciation mechanism incorporated in GraphEA, which result in a slower evolution. For supporting the interpretation of the lower effectiveness of GraphEA and for gaining more insights, we show in Fig. 11.3 the size of the best solution in the population during the evolution. For NE, the size is .|θ| for every individual; for multi-GP, the size is the sum of the sizes of the trees in the array; for GraphEA, the size is the sum of the number of nodes and edges in the graph. Two interesting observations can be made by looking at Fig. 11.3. First, the size of best arrays of trees for multi-GP appears to converge to a stable value after approximately 5000 fitness evaluations. Instead, the size of graphs in GraphEA never stops increasing: we interpret this as a sign that this EA was much slower, i.e., less efficient, than the other two in optimizing solutions for the case study here considered. Second, the size of the solutions evolved with multi-GP appears to be just slightly larger for the .n comm = 3 case with respect to the .n comm = 1 case. Since the size of an array of trees is given by the sum of the sizes of the composing trees, and considering that there are .m = 4 trees in the former case and .m = 2 in the latter, this means that the trees for .n comm = 3 are in general simpler. Hence, an effective behavior may be obtained through simpler elaboration of the inputs.

218

E. Medvet and G. Nadizar

11.5.2 Offline Imitation Learning After having compared the three EAs in their ability to evolve effective controllers through direct experience of the environment where the robot is immersed, we investigated experimentally the possibility of obtaining a controller by making it “behave like” another, pre-evolved controller. For this aim, we proceeded as follows: 1. we took the controllers evolved in the previous experiment and, for each of them, performed one simulation saving a dataset containing all the input-output pairs it handled; 2. for each dataset and each EA, we performed a few of evolutionary optimizations for obtaining a multivariate function that fits the dataset, driving the evolution with a fitness function measuring the error on the dataset (to be minimized); 3. we took each of the functions obtained at the previous step and put it inside a VSR, as a controller, and measured the velocity .vx it achieved in a simulation. Concerning the first step, and considering that we actually use the . f inside the controller every 0.2 s, we obtained a dataset .(xi , yi )i of .10 · 30 · 5 = 1500 inputoutput pairs out for each controller evolved in the first experiment. We further reduced the size of each dataset by even sampling to .n d = 375 pairs. This way, we obtained 10 datasets for each of the three EAs, i.e., a total of .30 + 30 datasets given by the two values for .n comm . For the second step, we used the very same EAs we used in the first experiment. However, we here used the average mean squared error (aMSE), i.e., the MSE on each output variable averaged across the output variables, as fitness function: q( f ) =

.

j=m i=n )2 1 E 1 Ed ( f j (xi ) − yi, j , m j=1 n d i=1

(11.4)

where . f j (x) is the . j-th element of the vector in .Rm obtained by applying . f on n .x ∈ R . We applied each EA to each dataset 5 times, by varying the random seed, hence performing .5 · 3 · 30 = 450 runs for the case of .n comm = 1 and 450 runs for the case of .n comm = 3—we kept the two cases separated since the dimensionality of the dataset, and hence of the function to be optimized to fit them, was different. We call teacher the EA that evolved the controller which generated each dataset and learner the EA that we used to obtain a multivariate function given a dataset. Finally, for the third step, we took each of the .450 + 450 evolved functions and used it inside a controller of a VSR for which we measured the velocity .vx it achieved in a 30 s simulation. We recall that the key difference between evolving a function through simulation and evolving one to fit a given dataset is that, in the first case, each function being evaluated is potentially able to obtain input-output pairs that facilitate the evolution. Differently, in the second case, the input-output pairs are given and cannot be changed. For this reason, we call offline imitation learning this way of obtaining a function for controlling a robot. It is offline because the functions being evolved

11 GP for Continuous Control: Teacher or Learner …

219

are never put inside a controller, and it is a form of imitation because they can only attempt to reproduce the behavior of the teacher: they cannot experience directly the outcome of the action they generate. This difference between direct evolution and offline imitation may be of practical relevance if what the function is expected to do is to generate an effective behavior on a generally large input space, rather than “simply” approximate it on a few points.

11.5.2.1

Evolution of Multivariate Functions from Data

Figure 11.4 displays the results of the first step in the form of six line plots showing how the fitness .aMSE★ of the best solution changed during the evolution. The figure shows one plot for each combination of teacher EA and value of .n comm ; within the plot, the line color corresponds to the learner EA. The first observation is that, in general, all three EAs are decent learners, i.e., the average MSE decreases steadily during the evolution. In most of the cases, and in particular for multi-GP, the budget of .n evals = 10000 appears to suffice to reach a plateau. This is not particularly surprising as all the three EAs have been shown to be able to cope with regression problems.

Fig. 11.4 Median and inter-quartile range (across 10 runs) of the average MSE of the best individual along fitness evaluations, one column per teacher, one row per .n comm . NE is the best learner, in particular when the teacher is NE

220

E. Medvet and G. Nadizar

There are, however, differences among the combinations of teacher and learner. As learner, NE looks to be the most effective: the average MSE achieved at the end of the evolution is the lowest one in .4 on .6 cases, with .3 of them being sharply better. NE struggles only with .n comm = 3 when the teacher is not NE. This finding is somehow consistent with the observations drawn in Sect. 11.5.1: we believe that the functions evolved by multi-GP (and GraphEA) obtain effective gaits in a rather different way than those evolved with NE. Very likely, this is due to the substantially different way the two representations model a multivariate function. For what concerns multi-GP, Fig. 11.4 shows that it is able to reach pretty quickly a decent .aMSE★ on the data, but then often stagnates. Interestingly, the greater average MSE is observed when the teacher is multi-GP and with.n comm = 3. That is, as learner multi-GP is not particularly good at imitating the behaviors it produced as teacher. Finally, similarly to what observed from the results of the direct evolution, GraphEA appears to be sensibly worse than the other two EAs. It is in general slower and achieves the greatest average MSE in all but one case.

11.5.2.2

Learned Functions Used as Controllers

We took the .450 + 450 functions evolved in the previous experiment and put them inside VSRs for assessing their ability to generate behaviors that are effective for locomotion, i.e., good gaits. We show the results of this evaluation in Table 11.1. The table shows, for each combination of teacher EA, learner EA, and value of .n comm , the 50-th (i.e., the median), 75-th, and 90-th percentile of the 50 .vx values (5 random seeds and 10 datasets) obtained through simulations. We report also the 75-th and 90-th percentile, instead of just the median, because the observed velocities were in general very low (see, for a comparison, the range of the . y-axis of Fig. 11.2).

Table 11.1 .x-velocity .vx obtained by the learned controllers for different teachers (groups of columns) and values of .n comm (groups of rows). For each combination of teacher, learner, and .n comm value, we show the 50-th (i.e., the median), 75-th, and 90-th percentile of the 50 .vx values obtained for that combination. Only the functions learned with NE achieve decent .vx values when put inside a controller Teacher.→ NE Multi-GP GraphEA 50-th 75-th 90-th 50-th 75-th 90-th 50-th 75-th 90-th .n comm Learner.↓ 1

3

NE Multi-GP GraphEA NE Multi-GP GraphEA

0.12 0.00 0.00 0.04 0.00 0.00

1.03 0.02 0.01 0.17 0.00 0.00

1.79 0.07 0.06 0.70 0.01 0.01

0.04 0.04 0.05 0.12 0.15 0.00

0.17 0.14 0.06 0.25 0.20 0.04

0.54 0.24 0.13 0.48 0.36 0.17

0.03 0.02 0.00 0.01 0.02 0.00

0.16 0.06 0.04 0.08 0.20 0.01

0.56 0.11 0.10 0.23 0.35 0.20

11 GP for Continuous Control: Teacher or Learner …

221

The main observation that we draw from Table 11.1 is that offline imitation learning does not, in general, work. None of the combinations achieves a median .vx value which is on par with the median values obtained through direct evolution. This happens regardless of the learner and teacher. The greatest median .vx is the one obtained when multi-GP is both the teacher and the learner and for .n comm = 3; we recall, however, that .vx is here 0.15.ms−1 , while it was .≈ 2.75ms−1 with direct evolution. If we look at the best controllers learned through offline imitation learning , i.e., at the 90-th percentile columns, we see that the best combination is the one where NE learns “from itself” with .n comm = 1. We recall that the case of NE with .n comm = 1 was also the best performing one for direct evolution. Nevertheless, the difference is still apparent: 1.79.ms−1 with offline imitation learning (and considering the 90-th percentile) and .≈ 3ms−1 with direct evolution (median value, though rather representative, due to the short inter-quartile range). We visually inspected the behaviors obtained by controllers learned with offline imitation and found that they were in general very sloppy: the VSR was able to produce a gait-like movement, but it appeared somehow “out-of-sync” and definitely not effective in making the robot advance.

11.6 Discussion We compared experimentally NE, multi-GP, and GraphEA in their ability to obtain an effective controller, in the form of a multivariate function . f : Rn → Rm , for the task of locomotion performed by a VSR. We used each of the three EAs in two ways: (a) by driving it with a fitness function that actually performs a simulation to measure how well . f behaves in practice (direct evolution) and (b) by driving it with a fitness function that measures how well . f fits the data produced by another function, in turn obtained through direct evolution (offline imitation learning). We found that, broadly speaking, GraphEA is largely less efficient and effective in both cases—we hence restrict the further discussion to the first two competitors. The results obtained through direct evolution suggest that there are negligible differences between NE and multi-GP in terms of achieved.vx★ : both EAs produced effective controllers. A deeper analysis of them, through visual inspection of the behaviors, suggests however that the two EAs achieve locomotion in different ways. Controllers obtained through NE give smoother behaviors than those obtained through multi-GP. The latter appears to realize a form of bang-bang control, i.e., to produce actuation values for the voxel with abrupt variations over time. We believe that the overall resulting behavior is still effective because the dynamics of the VSR body, namely, its spring-damper systems, mitigate this kind of control and, somehow, makes it effective. The results of offline imitation learning are more interesting, in our opinion. None of the EAs was able to consistently learn effective controllers. We remark that the EAs were decently able to solve the regression problem corresponding to imitation learning : this was true in particular for NE, which well exploited the approximation

222

E. Medvet and G. Nadizar

potential of MLPs. However, when we put the learned . f inside the VSRs, they were not able to produce effective gaits. From another point of view, the functions that were pretty good at fitting the data produced by their teachers were not successful in coupling with the body of the VSR, namely, with its corresponding dynamical system. Previous studies already showed that small differences in the shape of the body of a VSR may be very detrimental on the performance, if the controller is not adjusted to accommodate those differences [19]: this adjustment may be the result of an intrinsic plasticity of the controller itself, like with Hebbian learning [3], or may be simply obtained through a short re-optimization of the controller coupled with the modified body, as in [19]. From a broader perspective, we think that our results show that direct evolution produces functions that are much more robust than those obtained through offline imitation learning . Moreover, multi-GP appears to suffer more than NE from this lack of robustness: we attribute this weakness to the way GP represents numerical functions, which is very suitable for solving “static” symbolic regression problems, but may not be as effective when the functions are inserted in a dynamical system whose dynamics may be complex.

11.7 Concluding Remarks We considered the case of simulated modular soft robots , namely, VSRs, and compared three EAs and their respective representations for evolving a controller that let the VSR perform well the task of locomotion. We used an homogeneous distributed controller, in which a single multivariate function . f : Rn → Rm is used inside each voxel to generate the VSR actions (contraction/expansion) in response to its observations (sensor readings). That is, we used the EAs to solve a continuous control task. We compared experimentally the three approaches (a multi-tree version of GP, a GA that evolves the weights of an MLP, and GraphEA, that directly evolves graphs of arithmetic operators) in two ways. First, through direct evolution, i.e., with a fitness function that evaluates a candidate . f by employing it in a simulation of the VSR facing its task. Second, through offline imitation learning , i.e., with a fitness function that evaluates a candidate . f ability to reproduce the input-ouput pairs of a teacher . f . We discussed the results of our experiments and attempted to interpret the main findings: (a) GP is almost on par with GA+MLP, but produces less smooth behaviors; (b) offline imitation learning does not work, since a good coupling between body dynamics and controller appears to be extremely important for this control task. We believe our results may constitute a starting point for a larger discussion about how the dynamics of the evolution intertwine with those of the life of the single agent, possibly resulting in a “moving” fitness landscape in which finding an equilibrium is not straightforward.

11 GP for Continuous Control: Teacher or Learner …

223

References 1. Bacardit, J., Brownlee, A.E., Cagnoni, S., Iacca, G., McCall, J., Walker, D.: The intersection of evolutionary computation and explainable AI. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 1757–1762 (2022) 2. Bartoli, A., De Lorenzo, A., Medvet, E., Squillero, G.: Multi-level diversity promotion strategies for grammar-guided genetic programming. Appl. Soft Comput. 83, 105599 (2019) 3. Ferigo, A., Iacca, G., Medvet, E., Pigozzi, F.: Evolving Hebbian learning rules in voxel-based soft robots. IEEE Trans. Cogn. Dev. Syst. (2022) 4. Floreano, D., Dürr, P., Mattiussi, C.: Neuroevolution: from architectures to learning. Evol. Intell. 1, 47–62 (2008) 5. Françoso Dal Piccol Sotto, L., Kaufmann, P., Atkinson, T., Kalkreuth, R., Porto Basgalupp, M.: Graph representations in genetic programming. Genet. Program. Evolvable Mach. 22(4), 607–636 (2021) 6. Harding, S., Miller, J.F.: Evolution of robot controller using cartesian genetic programming. In: Genetic Programming: 8th European Conference, EuroGP 2005, Lausanne, Switzerland, March 30-April 1, 2005. Proceedings 8, pp. 62–73. Springer (2005) 7. Hiller, J., Lipson, H.: Automatic design and manufacture of soft robots. IEEE Trans. Robot. 28(2), 457–466 (2012) 8. Jin, L., Li, S., Yu, J., He, J.: Robot manipulator control using neural networks: a survey. Neurocomputing 285, 23–34 (2018) 9. Kadlic, B., Sekaj, I., Perneck`y, D.: Design of continuous-time controllers using cartesian genetic programming. IFAC Proc. Vol. 47(3), 6982–6987 (2014) 10. Koza, J.R., Rice, J.P.: Automatic programming of robots using genetic programming. In: AAAI, vol. 92, pp. 194–207 (1992) 11. La Cava, W., Orzechowski, P., Burlacu, B., de França, F.O., Virgolin, M., Jin, Y., Kommenda, M., Moore, J.H.: Contemporary symbolic regression methods and their relative performance (2021). arXiv:2107.14351 12. Legrand, J., Terryn, S., Roels, E., Vanderborght, B.: Reconfigurable, multi-material, voxelbased soft robots. IEEE Robot. Autom. Lett. (2023) 13. Lewis, M.A., Fagg, A.H., Solidum, A., et al.: Genetic programming approach to the construction of a neural network for control of a walking robot. In: ICRA, pp. 2618–2623. Citeseer (1992) 14. Lobov, S.A., Zharinov, A.I., Makarov, V.A., Kazantsev, V.B.: Spatial memory in a spiking neural network with robot embodiment. Sens. 21(8), 2678 (2021) 15. Medvet, E., Bartoli, A.: Evolutionary optimization of graphs with graphea. In: International Conference of the Italian Association for Artificial Intelligence, pp. 83–98. Springer (2021) 16. Medvet, E., Bartoli, A., De Lorenzo, A., Fidel, G.: Evolution of distributed neural controllers for voxel-based soft robots. In: Proceedings of the 2020 Genetic and Evolutionary Computation Conference, pp. 112–120 (2020a) 17. Medvet, E., Bartoli, A., De Lorenzo, A., Seriani, S.: 2D-VSR-Sim: a simulation tool for the optimization of 2-D voxel-based soft robots. SoftwareX 12, 100573 (2020) 18. Medvet, E., Nadizar, G., Manzoni, L.: JGEA: a modular java framework for experimenting with evolutionary computation. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 2009–2018 (2022a) 19. Medvet, E., Nadizar, G., Pigozzi, F.: On the impact of body material properties on neuroevolution for embodied agents: the case of voxel-based soft robots. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 2122–2130 (2022b) 20. Medvet, E., Pozzi, S., Manzoni, L.: A general purpose representation and adaptive EA for evolving graphs. In: Proceedings of the Genetic and Evolutionary Computation Conference (2023) 21. Mei, Y., Chen, Q., Lensen, A., Xue, B., Zhang, M.: Explainable artificial intelligence by genetic programming: a survey. IEEE Trans. Evol. Comput. (2022) 22. Miller, J.F., Harding, S.L.: Cartesian genetic programming. In: Proceedings of the 10th Annual Conference Companion on Genetic and Evolutionary Computation, pp. 2701–2726 (2008)

224

E. Medvet and G. Nadizar

23. Nadizar, G., Medvet, E., Miras, K.: On the schedule for morphological development of evolved modular soft robots. In: European Conference on Genetic Programming (Part of EvoStar), pp. 146–161. Springer (2022a) 24. Nadizar, G., Medvet, E., Nichele, S., Pontes-Filho, S.: An experimental comparison of evolved neural network models for controlling simulated modular soft robots. Appl. Soft Comput. 110610 (2023a) 25. Nadizar, G., Medvet, E., Ramstad, H.H., Nichele, S., Pellegrino, F.A., Zullich, M.: Merging pruning and neuroevolution: towards robust and efficient controllers for modular soft robots. Knowl. Eng. Rev. 37 (2022b) 26. Nadizar, G., Medvet, E., Walker, K., Risi, S.: A fully-distributed shape-aware neural controller for modular robots. In: Proceedings of the Genetic and Evolutionary Computation Conference (2023b) 27. Nolfi, S.: Behavioral and Cognitive Robotics: an Adaptive Perspective. Stefano Nolfi (2021) 28. Nordin, P., Banzhaf, W.: Genetic programming controlling a miniature robot. In: Working Notes for the AAAI Symposium on Genetic Programming, vol. 61, p. 67. MIT, Cambridge, MA, USA, AAAI (1995) 29. Pfeifer, R., Bongard, J.: How the body shapes the way we think: a new view of intelligence. MIT press (2006) 30. Pigozzi, F., Tang, Y., Medvet, E., Ha, D.: Evolving modular soft robots without explicit intermodule communication using local self-attention. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 148–157 (2022) 31. Poli, R., Langdon, W.B., McPhee, N.F.: A field guide to genetic programming (2008) 32. Salvato, E., Fenu, G., Medvet, E., Pellegrino, F.A.: Crossing the reality gap: a survey on simto-real transferability of robot controllers in reinforcement learning. IEEE Access (2021) 33. Seo, K., Hyun, S.: Toward automatic gait generation for quadruped robots using cartesian genetic programming. In: Applications of Evolutionary Computation: 16th European Conference, EvoApplications 2013, Vienna, Austria, April 3–5, 2013. Proceedings 16, pp. 599–605. Springer (2013) 34. Squillero, G., Tonda, A.: Divergence of character and premature convergence: a survey of methodologies for promoting diversity in evolutionary optimization. Inf. Sci. 329, 782–799 (2016) 35. Stanley, K.O., Miikkulainen, R.: Evolving neural networks through augmenting topologies. Evol. Comput. 10(2), 99–127 (2002) 36. Sui, X., Cai, H., Bie, D., Zhang, Y., Zhao, J., Zhu, Y.: Automatic generation of locomotion patterns for soft modular reconfigurable robots. Appl. Sci. 10(1), 294 (2020) 37. Talamini, J., Medvet, E., Bartoli, A., De Lorenzo, A.: Evolutionary synthesis of sensing controllers for voxel-based soft robots. In: ALIFE 2019: The 2019 Conference on Artificial Life, pp. 574–581. MIT Press (2019) 38. Turner, A.J., Miller, J.F.: Recurrent cartesian genetic programming. In: Parallel Problem Solving from Nature–PPSN XIII: 13th International Conference, Ljubljana, Slovenia, September 13–17, 2014. Proceedings 13, pp. 476–486. Springer (2014) 39. Virgolin, M., Alderliesten, T., Witteveen, C., Bosman, P.A.: Improving model-based genetic programming for symbolic regression of small expressions. Evol. Comput. 29(2), 211–237 (2021)

Chapter 12

Shape-constrained Symbolic Regression: Real-World Applications in Magnetization, Extrusion and Data Validation Christian Haider, Fabricio Olivetti de Franca, Bogdan Burlacu, Florian Bachinger, Gabriel Kronberger, and Michael Affenzeller

12.1 Introduction Modeling complex systems, especially in critical applications, often requires the model to fulfill a set of mathematical properties. These properties can either be derived from natural science laws, based on prior knowledge specified by domain experts or observations made through experiments. Using such information about the desired properties to guide the construction of models can result in models that are more accurate, interpretable, and generalizable. In recent years, the availability of large datasets and computational methods has led to a surge of interest in scientific machine learning, but there is still a need for modeling techniques that can incorporate prior knowledge into these models [4]. One approach is to use Bayesian methods, which allow for the incorporation of prior distributions on model parameters and can produce posterior distributions that reflect both the data and the prior information [31]. Other techniques include regularization, which penalizes complex models to encourage sparsity and interpretability, and model selection criteria that take into account the complexity and goodness-of-fit of different models. Another approach is to use shape constraints, which provide information about the function’s shape. This enables the possibility to define constraints on the model’s image and derivatives by constraining, for example, the model’s range, monotonicity or concavity. Therefore, shape constraints can help to restrict the solutions space, leading to more reliable and accurate results. Additionally, to find models, that better reflect the underlying C. Haider (B) · B. Burlacu · F. Bachinger · G. Kronberger · M. Affenzeller Heuristic and Evolutionary Algorithms Laboratory (HEAL), University of Applied Sciences Upper Austria, Hagenberg, Austria e-mail: [email protected] F. O. de Franca Center for Mathematics, Computation and Cognition (CMCC), Heuristics, Analysis and Learning Laboratory (HAL), Federal University of ABC, Santo Andre, Brazil e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 S. Winkler et al. (eds.), Genetic Programming Theory and Practice XX, Genetic and Evolutionary Computation, https://doi.org/10.1007/978-981-99-8413-8_12

225

226

C. Haider et al.

physical principles, shape constraints can also help to improve extrapolation predictions and to mitigate the effect of noise. This contribution shows the usage of shape constraints within the context of symbolic regression (SR). As such, this paper has the objective to show the handling and the functionality of shape-constrained symbolic regression (SCSR), especially in real-world applications. The paper is structured as follows: Sect. 12.2, summarizes the related work on how to integrate prior knowledge into data-based modeling. In Sect. 12.3, shapeconstrained symbolic regression is defined. Section 12.4, shows different methods using SCSR. In Sect. 12.5, we discuss the constraint evaluation process. In Sect. 12.6, we showcase the usage of shape-constrained SR on three different real-world applications from various fields. Section 12.7 concludes the work.

12.2 Related Work The integration of existing knowledge into data-driven modeling has become a prominent topic of interest in recent years [4]. As a result, there has been a growing number of publications on the subject, which is commonly referred to in the literature as “Physics-inspired Machine Learning”. This approach aims to combine domainspecific knowledge with machine learning techniques to improve the accuracy and interpretability of models. The integration of prior knowledge can help to overcome the limitations of purely data-driven models and has the potential to enable the development of more robust and reliable models. Bladek and Krawiec discussed using formal constraints in combination with SR. Their approach is called Counterexample-driven Genetic Programming, which uses satisfiability modulo theories (SMT) solver to evaluate the constraint and to extend their training set with counterexamples [6]. In [19] the authors propose a multiobjective SR approach that is driven by both data and prior knowledge. The properties are defined using formal constraints, and an optimistic approach is used to evaluate the constraints by checking them against a set of discrete data samples. This approach is similar to our approach using sampling as an optimistic constraint-handling approach. Asadzadeh et al. use a hybrid SR approach for black-box modeling, utilizing the representation of the hybrid model as an algebraic equation. Their approach is capable of including knowledge by partly fixing the regression tree based on prior knowledge [1]. Another important method for including knowledge in data-based modeling is given through monotonic constraints, which have become an important factor recently for explainable AI [27]. In the literature, models using monotonic constraints are often referred to as isotonic models [5, 23, 30]. Auguste et al. proposed two new methods of enforcing monotonicity constraints in regression and classification trees [2]. Lately, the inclusion of knowledge is also discussed within the field of neural networks (NNs). Steward and Ermon investigated the use of prior knowledge specified as constraints (e.g. physical behavior) in NNs. Their experiments showed promising

12 Shape-constrained Symbolic Regression: Real-World Applications …

227

results, but were rather small-scaled [29]. Their approach has similarities to shape constraints. Another interesting approach in this field is represented by Domain Adapted Neural Networks (DANNs) [22]. DANNs allow the inclusion of monotonicity or approximation constraints during the training phase of Deep NNs. Li et al. use semantic priors in the form of leading powers for symbolic expressions to define asymptotic constraints [20]. They make use of a special grammar that is generated by a NN. Raissi et al. present Physics-informed Neural Networks (PINNs) a method that allows the inclusion of prior knowledge in NNs [28]. Another way of including knowledge through shape constraints offers polynomial regression. Using sum-of-squares polynomials enables this feature, with the idea that a polynomial is non-negative as long as it can be represented as a sum of squared polynomials [25]. In [8] the authors used semidefinite programs (SDPs) to fit shape-constrained polynomial regression to noisy evaluations using monotonicity and convexity constraints. Papp and Alizadeh utilize polynomial splines for shape-constrained estimations [24]. In [17] the authors compared shape-constrained polynomial regression to other methods utilizing shape constraints. Showing a comparison on the Feynman benchmark instances with different levels of noise.

12.3 Shape-constrained Symbolic Regression In this section, we discuss the algorithms that were used in the real-world applications in Sect. 12.6. These algorithms are capable of integrating prior knowledge into the modeling process. Shape-constrained symbolic regression (SCSR) allows enforcing certain characteristics of the model. Genetic Programming (GP) has proven to be especially suited to solve shape-constrained symbolic regression. The properties are enforced by defining constraints that refer to the shape of the function: f ∗ (x) = argmin L( f (x), y), .

f (x)∈M

x ∈D⊂Ω (12.1)

subject to shape constraints ci(Xi ),

Xi ⊆ Ω.

where . L( f (x), y) is the loss function, . f (x) the model, . y the target, .M denotes the model space, .D is the dataset, .Ω is the domain, and .ci are shape constraints with .Xi being the domains for constraints. For example, we might know that a given function . f (x) might be monotonically increasing over the input variable .x1 then we will be able to introduce a monotonic increase constraint as:

.

∂f ≥0 ∂ x1

(12.2)

228

C. Haider et al.

Handling constrained problems has already many studied strategies in the evolutionary computation literature, ranging from penalization methods to multi-objective approaches [7]. In [17] the authors use shape constraints as a constrained handling method. Therefore, two different single-objective approaches are investigated. The first approach simply discards all infeasible solutions by maximizing their objective value. As a second approach, the authors used a two-population approach to store feasible and infeasible solutions separately from each other. In [11] the authors present a multi-objective approach to handle shape constraints. Therefore, they separated the two objectives (model quality and constraint satisfaction) to handle them independently of each other.

12.3.1 Interaction Transformation Evolutionary Algorithm The Interaction Transformation Evolutionary Algorithm (ITEA) introduced in [10] constrains the search space of symbolic models using the Interaction-Transformation representation [9]. In short, this represents the set of all linear combinations of a nonlinear function of the original variables. This non-linear function is the composition of any unary function to the product of the original variables to an integer power. f (x, w) = w0 +

m ∑

. IT

w j · ( f j (r j (x)),

(12.3)

j=1

representing a model with .m terms where .w ∈ Rm+1 are the coefficients of the affine combination, . f j : R → R is the . j-th transformation function and .r j : Rd → R is the interaction function: d ∏ k .r j (x) = xi i j , (12.4) i=1

where .ki j ∈ Z represents the exponents for each variable. These regression models are evolved using a mutation-only evolutionary algorithm framework that, at every step, mutates every individual of the population and select the next population using tournament selection applied to the concatenation of the original and mutated population.

12.4 Shape Constraint Handling Using shape-constrained regression, we need a strategy to handle the defined constraints. We discuss two different methods, (i) a single-objective approach and (ii) a multi-objective approach.

12 Shape-constrained Symbolic Regression: Real-World Applications …

229

12.4.1 Single-Objective Approach In [17] the authors introduce SCSR, a method that allows to define constraints on a function’s shape to enforce certain properties. To handle these constraints, two different single-objective approaches are implemented. The first approach is a naive method, that enforces the constraints by removing solutions that violate one or more constraints from the population. Therefore, each infeasible solution gets the worst quality value assigned, which leads to a likely removal from the next generation. The second approach handles the constraints by defining two populations, this method is called Feasible-Infeasible Two-population [16]. The first population includes all feasible solutions, and the second the infeasible solutions. Using this two-population approach enables us to use two different fitness evaluation functions: the feasible population tries to minimize the error function, whereas the infeasible population tries to minimize the constraint violations. This approach allows us to keep a more diverse population throughout the generations.

12.4.2 Multi-objective Approach Using a multi-objective approach for handling shape constraints is discussed in [13, 17]. The authors compared different algorithms e.g. NSGA-II, MOEA/D and NSGAIII. Besides testing different algorithms, two different methods are implemented (i) using a single objective for each constraint and (ii) using two objectives, one for the data component and one for the physics component. Both multi-objective approaches use a soft-constraint handling method, that allows to differentiate between infeasible solutions based on the severity of the violation.

12.4.2.1

Using a Single Objective for Each Constraint

This approach uses a single objective for each defined constraint, which leads to 1 + n objectives being solved. The first objective is to minimize the error function, the .n objectives are to minimize constraint violations. Splitting each constraint into individual objectives allows us to consider the strength of violation per constraint separately. The calculation for each objective function is defined as

.

upper

Pi = Pilower + Pi .

Pilower = | min(lower( f i (x) − lower(ci ), 0))| upper Pi

(12.5)

= | max(upper( f i (x) − upper(ci ), 0))|.

where .lower and .upper are functions giving the lower and upper bound of the given interval, respectively. . f i (x) is the evaluation of the interval corresponding to the .i-th constraint, and .ci is the feasibility interval for the .i-th constraint.

230

12.4.2.2

C. Haider et al.

Two Objective Approach

Using the two-objective approach, we reduced the number of objectives heavily, especially on difficult problems with many constraints. The first objective targets the data component by minimizing the error function, and the second objective handles the physics component by minimizing the sum of constraint violations. Calculating the sum of the constraint violations is done equally by the Single objective for each constraint approach as shown in Eq. 12.5 with just one addition of summing the single constraint violation terms up. Using this approach we lose the capability of tracking and investigating each constraint separately but allows us to reduce the complexity of the resulting Pareto-front a lot.

12.4.3 Feasible-Infeasible Two-Population Approach The Feasible-Infeasible Two-Population approach (FI-2Pop) [15] was originally conceived to handle constrained optimization problems without requiring additional hyperparameters (e.g., a penalization coefficient) or additional computational cost (e.g., calculating the Pareto-front). It works by splitting the current population into two sets: one of the feasible solutions and the other composed of infeasible solutions. The evolution steps are applied to each set independently without any modification to the evolutionary algorithm framework. The only distinction between the two sets is in the evaluation of each solution. While the feasible set seeks to maximize the fitness, the infeasible set minimizes the constraint violation. At every step, whenever a feasible solution becomes infeasible, it is moved to the infeasible set and vice versa. This dynamic allows the population to explore the boundary regions of infeasibility and both sets eventually converge to a minimum distance from each other, hopefully around the local optima. We have tested this approach in [17] in combination with ITEA.

12.5 Constraint Evaluation The evaluation of constraints can be a challenging task, since it usually requires finding the extrema of a non-linear non-convex function within the range of a given bounded domain. Therefore, using approximations may be good practice, to solve the task more efficiently. Generally, it can be distinguished between two types of approximation methods for shape constraint evaluation, (i) optimistic and (ii) pessimistic.

12 Shape-constrained Symbolic Regression: Real-World Applications …

231

12.5.1 Optimistic Approach Optimistic approaches only check the constraints against a finite amount of samples from the input space, accepting that the result might violate the constraint for certain data points outside the samples. In [12] the authors use sampling as an optimistic approach to evaluate the constraints. Implementing optimistic constraint evaluation is rather easy, since the approval of being infeasible can be checked by finding a data point outside the required range. This leads to the problem that the quality of the evaluation strongly depends on the number of samples taken. The quality can be increased by increasing the number of samples taken, but this comes with the expense of computational time. Especially for high-dimensional problem data, the number of needed samples rises exponentially.

12.5.2 Pessimistic Approach Pessimistic approaches check the validity of the constraints against a calculated boundary of the model and its partial derivatives. Thus, it can be ensured that if the resulting interval is within the range of the constraint interval, the solution is feasible. Contrary to the optimistic approach, using the pessimistic approach can not guarantee that if the resulting interval is outside the constraint, it is infeasible. Therefore, pessimistic methods tend to overestimate the final image boundary. One suitable method for pessimistic evaluation is interval arithmetic (IA) [14, 21]. IA allows applying arithmetic operations when the given variables are represented as bounded or unbounded intervals. In [11, 17] the authors used IA for the shape constraint evaluation.

12.6 Real World Problems In the following section, we present three real-world problems that illustrate the usability of shape constraints. These problems are selected for their relevance and for the insights they provide. We hope, with these problems, we can provide readers with a deeper understanding of the relevance to integrate prior knowledge into databased modeling.

12.6.1 Twin-Screw Extruder Modeling In this work, we used a hybrid modeling approach to predict the polymer melt flow, through twin-screw extruder kneading blocks, using symbolic regression together with shape constraints.

232

C. Haider et al.

Table 12.1 Variation for the diameter ratio, dimensionless disc width, and dimensionless undercut Parameter

Min

Max

Increment

.∏ D

1.45 0.05 0.10

1.80 0.40 0.60

0.05 0.05 0.10

.∏ B .∏ S2

Usually, polymers are mixed with a variety of filler elements to prepare them for the desired application. Every polymer goes through this process at least once, and the process is usually performed by a co-rotating twin-screw extruder due to its excellent mixing capabilities and modular screw configuration. The screw of a twinscrew consists of conveying elements and kneading blocks, where the conveying elements are mainly used for material transportation and building up the pressure and the kneading blocks are used for melting and mixing. Comparing those two elements, kneading blocks show a much more complex geometry and therefore are more complex to model. This is why we have looked at the modeling capabilities of kneading blocks in this work. Since it is important to understand the behavior of the extruder to enhance the compounding process, minimize the power demand and increase product quality by tailoring the screw configuration to the material and application needs, accurate and reliable models are needed. This is why over the past years, different modeling approaches were developed to describe the flow in an intermeshing co-rotating twinscrew extruder. The two most common approaches included numerical and analytical methods. In this scenario, using shape-constrained symbolic regression is especially interesting since knowledge integration will be possible at various steps of the hybrid modeling approach. A tailored knowledge integration will increase the model quality essentially. It will improve the range of applications, generalization properties, and accuracy. To show the possibilities of adding additional knowledge via shape constraints into a data-based modeling approach, we compared the extrapolation behavior of models using shape constraints versus models only using data. Therefore, a parametrically driven numerical design study was performed, varying the most common influencing parameters within a wide range covering most kneading blocks available in the industry. For each modeling setup, the flow field is evaluated and the dimensionless dragflow capability . A1 is evaluated. For the offset angle, .α three configurations are performed:.α = 30,.α = 45,.α = 60. The variations for the remaining three independent influencing parameters (.∏ B , .∏ D , .∏ S2 ) are listed in Table 12.1. As we can observe in Fig. 12.1 using shape constraints helps a lot in stabilizing the resulting models. We can also see that we fulfill the given conf (x) ≥ 0, ∏ B ∈ [0..0.5]), whereas straints (. f (x) > 0; ∂ f∂α(x) ≤ 0, α ∈ [60..90]; ∂∂∏ B using no additional information results in models violating these constraints. Another

12 Shape-constrained Symbolic Regression: Real-World Applications …

233

Fig. 12.1 Comparing partial dependence plots using no information versus using additional information with shape constraints

benefit of using shape constraints is, that we achieved an overall better test error on the extrapolation set. Using no additional information lead to an average normalized means squared error (NMSE) of 0.3406, whereas the SCSR approach resulted in an average NMSE of 0.1364.

12.6.2 Data Validation for Industrial Friction Performance Measurements Shape constraints can be used to validate that measurements match expected patterns in friction performance experiments. Together with Miba Frictec, we have developed an automated data validation solution where physically expected patterns in friction measurements are expressed in the form of shape constraints. The system automatically detects deviations from the expected patterns and marks those data as potentially erroneous. This reduces the manual effort required for data validation and is capable to detect subtle errors that would even be hard to spot for an expert. Miba Frictec is an international company with headquarters located in Austria that produces friction components for various applications, mainly in the automotive and energy sector. They develop new friction materials and components and produce components for high-performance applications at their facilities in Upper Austria. All produced components are regularly tested for quality assurance on industrial test benches for friction performance experiments, whereby measurement data are

234

C. Haider et al.

collected into a large historic database mainly for analytical purposes. All tests are highly automatized, which means that each day a large volume of data is collected. However, several problems may occur in the testing equipment which can render the measurements invalid. These problems include for instance sensor calibration issues, sensor failure, unexpected stops of the test bench and unexpected failures of tested samples. Manual inspection of the data for errors is challenging because of the large volumes of data that are collected, and because some errors are subtle and can be missed easily. Therefore, an automated solution to identify erroneous data is required. We have decided to use a model-based approach, where we use symbolic regression to find a model that fits the observed data well. The parameters which are systematically varied in the friction experiments are the sliding velocity .v, the normal force or pressure . p, and the temperature .T . The friction coefficient .μ (for static and dynamic friction) is one of the most important performance indicators and is calculated from the forces measured on the test bench. We use all measurements to create prediction models for the friction coefficient .μˆ = f ( p, v, T ) using symbolic regression. A separate model is fit for each experiment because different types of friction materials can have significantly different performance values. For example, certain materials are more sensitive to temperature than others. From prior experience, we know the accuracy of measurements in successful tests, and from the database of historic tests, we know the typical variability observed when testing different samples of the same product type. Additionally, the main effects on friction performance are well-known and can be easily expressed in the form of shape constraints. For instance, in this application, we know that the friction coefficient should decrease with higher temperature and higher pressure. The functional dependency between the load parameters and the friction coefficient should be smooth, monotonic, and convex/concave. This knowledge is encoded in the form of shape constraints for symbolic regression, which ensures that the identified SR models match the patterns expected by the experts. This by itself is already an important improvement over the unconstrained SR models. Additionally, we can use these models for data validation. The main concept of the shape constraints for data validation is that erroneous data are assumed to deviate from the expected patterns for friction experiments. Therefore, when fitting a regression model with shape constraints, the model does not fit erroneous data well and therefore has a higher training error. Thus, we can flag the data for manual inspection if we detect an unusually high training error. This behavior is shown in Fig. 12.2. In the case of valid data, the unconstrained model and the model using a monotonic increasing constraint results in a similar training error. But if the data is erroneous instead, we can observe a high deviation of the training error, which indicates the invalidity of the data, since the constrained approach does not fit the corrupted data as well anymore. We use shape constraints in this process because they allow a formal description of the expected patterns that is flexible and easy to understand. For components that are regularly tested within the quality assurance process, it would be possible to check whether the new data are similar to existing data. For new configurations, however,

12 Shape-constrained Symbolic Regression: Real-World Applications …

235

Fig. 12.2 Showcase of fitting a regression model with shape constraints on valid and erroneous data

Fig. 12.3 Example for model-based data validation with symbolic regression and shape constraints

we do not have such data. By using shape constraints, we do not need validated data or models from past experiments, because the shape constraints encode the expected patterns, and we can directly validate newly arriving data without any reference measurements for valid and invalid data. Note that fitting an unconstrained regression model would be insufficient, because the model would also fit the data errors and therefore may have a small training error even with erroneous data. Figure 12.3 shows an example of a data validation result. The experiment consists of a sequence of segments in which repeated measurements with the same load parameters are performed. The figure shows the measured and the predicted friction coefficient, as well as the load parameter values. The model is trained on the full dataset, but we use shape constraints to ensure that the prediction model is physically plausible. The approach detects a higher training error in the first and fifth segment, where the observed drop in the friction coefficient cannot be explained by the changes of parameters. Problematic segments are flagged for inspection based on an error threshold, which is determined from the known measurement accuracy. The shape constraints used in this example are:

236

C. Haider et al.

∀ v ∈ [0, 1] ∧ p ∈ [0, 1] ∧ T ∈ [0, 1] =⇒ ( ∂μdyn ∈ [−0.01, 0.01] 0 ≤ μdyn ≤ 1 ∧ ∂v ) 2 ∂μdyn ∂ μdyn ∂ 2 μdyn ∂μdyn ∧ ≤0∧ ≤ 0 ∧ ≥ 0 ∧ ≥ 0 ∂p ∂ p2 ∂T ∂T 2

. v, p,T

(12.6)

This application using shape-constrained symbolic regression has been deployed within a prototype application to be used by engineers. More details on symbolic regression for modeling friction components can be found in [3, 18].

12.6.3 Magnetization Curves In [26], we describe a use case to utilize a broader spectrum domain knowledge to model magnetization curves (MC) for high magnetic field strengths. The major challenge within this modeling task was that only sparse data points were available, as illustrated in Fig. 12.4. The black part of the function represents the available data points, and the red part shows the values approximated via the Fröhlich-Kennelly extrapolation. Such MCs describe the relationship between the magnetic flux density B and the magnetic field strength H, therefore such graphs are called B-H curves. One important characteristic of such B-H curves is, that starting at a saturation point S, the magnetic flux density is growing linearly with the slope equal to the permeability of the vacuum .μ0 . Additionally, to the input parameter B and target value H, we can derive additional information about the system in terms of the magnetic polarization

Fig. 12.4 Illustration of training/test data distribution for B-H graphs

12 Shape-constrained Symbolic Regression: Real-World Applications …

237

Table 12.2 Defined constraints over H, B, J and .μr Constraint Regression ∂B −7 ∂ H ∈ [4π · 10 , 4π ∂B . ∂ H ∈ [0, ∞] ∂2 B . 2 ∈ [−10, 0] ∂ H

.

∈ [0, 1.96] . J ∈ [1.956, 1.96] .μr ∈ [1, 13500] .μr ∈ [1, 1.005] .J

· 10−7 ]

∈ [8 · 105 , 106 ] 6 . H ∈ [0, 10 ] .H

.H

∈ [15000, 106 ]

∈ [0, 106 ] 5 6 . H ∈ [5 · 10 , 10 ] 6 . H ∈ [0, 10 ] 5 6 . H ∈ [9 · 10 , 10 ] .H

Fig. 12.5 Transformation of B-H curve to receive the magnetic polarization and relative permeability of a material

J and relative permeability .μr , whereby .0 ≤ J ≤ S and .1 ≤ μr ≤ ∞. In Fig. 12.5 the transformation process to receive J and .μr is shown. This transformation allows us to utilize additional information about the system for the algorithmic search. This knowledge is expressed in the following constraints (Table 12.2). This approach leads to two types of constraints. Firstly, constraints for the B-H curve can be utilized by using the raw data from the dataset. Secondly, constraints regarding J and .μr , which are called extended constraints (EC) where through the transformation process the target value from the dataset is replaced with the model estimations, which is an estimation post-processing step. Testing this approach, we set up an experiment using three different algorithms in 30 independent runs. The results are presented in Table 12.3. This table shows the median RMSE among the different methods using additional information and not. It is noticeable that using EC leads to significantly lower prediction errors. A more detailed result representation is given in [26].

238

C. Haider et al.

Table 12.3 Median test values over 30 independent runs. Using three different algorithms: Agedlayered population structure (ALPS), Genetic Algorithm (GA) and Offspring Selection Genetic Algorithm (OSGA) for comparision Material No information Additional information ALPS GA OSGA ALPS GA OSGA M1 M2 M3 M4

6.369 2.264 0.235 0.282

0.149 1.373 0.985 1.802

2.660 3.589 0.810 1.033

0.231 0.174 0.117 0.126

0.071 0.159 0.105 0.138

0.129 0.142 0.109 0.134

12.7 Conclusion The use of shape constraints in data-based modeling approaches allows us to address various use cases where the consideration of prior knowledge plays an important role. In this work, we explored three different applications, where we utilized shape constraints in different ways. In our first experiment, we applied SCSR as a valuable tool in the prediction of polymer melt flow through twin-screw extruders kneading blocks. The complex geometry of kneading blocks makes their modeling particular challenging, and accurate predictions are crucial. By integrating specific domain knowledge via shape constraints, we have shown that it leads to stabilized models that comply with essential physical constraints. Additionally, the SCSR approach outperforms traditional data-only models in terms of test error on the extrapolation set. In the second application, we employed shape constraints for data validation tasks, using SC to specify prior knowledge and detect faults in previously unseen data. Our experiments have shown that SCSR is capable of detecting data faults that can only be detected through the intricate interaction of the observed data. Therefore, we showed that SCSR is a powerful tool, that can be seamlessly integrated into data import pipelines. As a last use-case, we utilized SC to leverage a broader spectrum of domain knowledge for a specific modeling task. We incorporated prior domain knowledge to extract additional input parameters from the given data, which were then used to generate the final model. Our findings indicated that utilizing a broader spectrum of domain knowledge led to a significant increase in test quality. These results highlight the significance of shape constraints in enhancing model quality, generalization properties, extrapolation behavior and accuracy, particularly when applied to the modeling of complex systems. The successful application of shape-constrained modeling opens up new possibilities for leveraging prior knowledge and improving the performance of data-based models in various real-world applications.

12 Shape-constrained Symbolic Regression: Real-World Applications …

239

Acknowledgements The authors gratefully acknowledge the federal state of Upper Austria for funding the research project FinCoM (Financial Condition Monitoring) and thus, the underlying research of this study. Furthermore, the authors thank the federal state of Upper Austria as part of the program “#upperVISION2030” for funding the research project SPA (Secure Prescriptive Analytics) and thus, the research of parts of this study. This project was partially funded by Fundaçao de Amparo a Pesquisa do Estado de Sao Paulo (FAPESP), grant number 2021/12706-1.

References 1. Asadzadeh, M.Z., Gänser, H.-P., Mücke, M.: Symbolic regression based hybrid semiparametric modelling of processes: an example case of a bending process. Appl. Eng. Sci. 6, 100049 (2021) 2. Auguste, C., Malory, S., Smirnov, I.: A better method to enforce monotonic constraints in regression and classification trees (2020). arXiv:2011.00986 3. Bachinger, F., Kronberger, G.: Comparing shape-constrained regression algorithms for data validation. In: Moreno-Díaz, R., Pichler, F., Quesada-Arencibia, A. (eds.) Computer Aided Systems Theory—EUROCAST 2022, pp. 147–154. Springer Nature Switzerland, Cham (2022) 4. Baker, N., Alexander, F., Bremer, T., Hagberg, A., Kevrekidis, Y., Najm, H., Parashar, M., Patra, A., Sethian, J., Wild, S., et al.: Workshop report on basic research needs for scientific machine learning: Core technologies for artificial intelligence (2019) 5. Barlow, R.E., Brunk, H.D.: The isotonic regression problem and its dual. J. Am. Stat. Assoc. 67(337), 140–147 (1972) 6. Bladek, I., Krawiec, K.: Solving symbolic regression problems with formal constraints. In: Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’19, pp. 977– 984. Association for Computing Machinery, New York, NY, USA (2019) 7. Coello Coello, C.A.: Constraint-handling techniques used with evolutionary algorithms. In: Proceedings of the 2016 on Genetic and Evolutionary Computation Conference Companion, GECCO ’16 Companion, pp. 563–587. Association for Computing Machinery, New York, NY, USA (2016) 8. Curmei, M., Hall, G.: Shape-constrained regression using sum of squares polynomials (2022) 9. de França, F.O.: A greedy search tree heuristic for symbolic regression. Inf. Sci. 442, 18–32 (2018) 10. de Franca, F.O., Aldeia, G.S.I.: Interaction-transformation evolutionary algorithm for symbolic regression. Evol. Comput. 29(3), 367–390 (2021) 11. Haider, C., de Franca, F., Burlacu, B., Kronberger, G.: Shape-constrained multi-objective genetic programming for symbolic regression. Appl. Soft Comput. 132, 109855 (2023) 12. Haider, C., de França, F.O., Kronberger, G., Burlacu, B.: Comparing optimistic and pessimistic constraint evaluation in shape-constrained symbolic regression. In: Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’22, pp. 938–945. Association for Computing Machinery, New York, NY, USA (2022) 13. Haider, C., Kronberger, G.: Shape-constrained symbolic regression with NSGA-III. In: Moreno-Díaz, R., Pichler, F., Quesada-Arencibia, A. (eds.) Computer Aided Systems Theory— EUROCAST 2022, pp. 164–172. Springer Nature Switzerland, Cham (2022) 14. Hickey, T., Ju, Q., Van Emden, M.H.: Interval arithmetic: from principles to implementation. J. ACM 48(5), 1038–1068 (2001) 15. Kimbrough, S.O., Koehler, G.J., Lu, M., Wood, D.H.: On a feasible-infeasible two-population (fi-2pop) genetic algorithm for constrained optimization: distance tracing and no free lunch. Eur. J. Oper. Res. 190(2), 310–327 (2008) 16. Kimbrough, S.O., Koehler, G.J., Lu, M.-C., Wood, D.H.: On a feasible-infeasible twopopulation (fi-2pop) genetic algorithm for constrained optimization: distance tracing and no free lunch. Euopean J. Oper. Res. 190, 310–327 (2008)

240

C. Haider et al.

17. Kronberger, G., de Franca, F.O., Burlacu, B., Haider, C., Kommenda, M.: Shape-constrained symbolic regression-improving extrapolation with prior knowledge. Evol. Comput. 30(1), 75– 98 (2022) 18. Kronberger, G., Kommenda, M., Promberger, A., Nickel, F.: Predicting friction system performance with symbolic regression and genetic programming with factor variables. In: Aguirre, H.E., Takadama, K. (eds.) Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2018, Kyoto, Japan, July 15-19, 2018, pp. 1278–1285. ACM (2018) 19. Kubalík, J., Derner, E., Babuška, R.: Symbolic regression driven by training data and prior knowledge. In: Proceedings of the 2020 Genetic and Evolutionary Computation Conference, GECCO ’20, pp. 958–966. Association for Computing Machinery, New York, NY, USA (2020) 20. Li, L., Fan, M., Singh, R., Riley, P.: Neural-guided symbolic regression with asymptotic constraints (2019). arXiv:1901.07714 21. Lodwick, W.A.: Constrained Interval Arithmetic. Technical report, USA (1999) 22. Muralidhar, N., Islam, M.R., Marwah, M., Karpatne, A., Ramakrishnan, N.: Incorporating prior domain knowledge into deep neural networks. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 36–45. IEEE (2018) 23. Neelon, B., Dunson, D.B.: Bayesian isotonic regression and trend analysis. Biom. 60(2), 398– 406 (2004) 24. Papp, D., Alizadeh, F.: Shape-constrained estimation using nonnegative splines. J. Comput. Graph. Stat. 23(1), 211–231 (2014) 25. Parrilo, P.A.: Structured semidefinite programs and semialgebraic geometry methods in robustness and optimization. Ph.D. thesis, California Institute of Technology (2000) 26. Piringer, D., Wagner, S., Haider, C., Fohler, A., Silber, S., Affenzeller, M.: Improving the flexibility of shape-constrained symbolic regression with extended constraints. In: Moreno-Díaz, R., Pichler, F., Quesada-Arencibia, A. (eds.) Computer Aided Systems Theory—EUROCAST 2022, pp. 155–163. Springer Nature Switzerland, Cham (2022) 27. Rai, A.: Explainable AI: From black box to glass box. J. Acad. Mark. Sci. 48, 137–141 (2020) 28. Raissi, M., Perdikaris, P., Karniadakis, G.E.: Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707 (2019) 29. Stewart, R., Ermon, S.: Label-free supervision of neural networks with physics and domain knowledge. Proc. AAAI Conf. Artif. Intell. 31(1), (2017) 30. Tibshirani, R.J., Hoefling, H., Tibshirani, R.: Nearly-isotonic regression. Technometrics 53(1), 54–61 (2011) 31. van de Schoot, R., Depaoli, S., King, R., Kramer, B., Märtens, K., Tadesse, M.G., Vannucci, M., Gelman, A., Veen, D., Willemsen, J., et al.: Bayesian statistics and modelling. Nat. Rev. Methods Prim. 1(1), 1 (2021)

Chapter 13

Phylogeny-Informed Fitness Estimation for Test-Based Parent Selection Alexander Lalejini, Matthew Andres Moreno, Jose Guadalupe Hernandez, and Emily Dolson

13.1 Introduction A phylogeny (ancestry tree) details the hereditary history of a population. Phylogenetic trees represent the evolutionary relationships among taxa (e.g., individuals, genotypes, genes, species, etc.). In evolutionary biology, phylogenies are typically estimated from the fossil record, phenotypic traits, and extant genetic information. Although imperfect, estimated phylogenies have profoundly advanced understanding of the evolution of life on Earth by organizing knowledge of biological diversity, chronicling the sequence of events comprising natural history, and revealing population-level evolutionary dynamics underlying those events [42, 45]. In evolutionary computing, we may track phylogenies at runtime with perfect (or user-adjustable) accuracy [2, 9, 10, 38–40]. In this context, a phylogeny can reveal how an evolutionary algorithm steers a population through a search space, showing the step-by-step process by which any solutions evolved [34]. Indeed, phylogenetic analyses have yielded insights for evolutionary computing research [7, 15, 34, 35]. For example, recent work suggests that phylodiversity metrics provide an improved window into the mechanisms that cause evolutionary algorithms to succeed A. Lalejini (B) Grand Valley State University, Allendale, MI, USA e-mail: [email protected] M. A. Moreno University of Michigan, Ann Arbor, MI, USA e-mail: [email protected] J. G. Hernandez · E. Dolson Michigan State University, East Lansing, MI, USA e-mail: [email protected] E. Dolson e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 S. Winkler et al. (eds.), Genetic Programming Theory and Practice XX, Genetic and Evolutionary Computation, https://doi.org/10.1007/978-981-99-8413-8_13

241

242

A. Lalejini et al.

or fail [14, 24]. Insufficient diversity traps evolutionary algorithms at sub-optimal solutions through premature convergence [17]. Most commonly, an evolutionary algorithm’s capacity for diversity maintenance is measured by counting the number of distinct candidate solutions in the population at a single time point (e.g., counting the number of unique genotypes or phenotypes). Phylodiversity metrics, however, take into account the evolutionary history of a population by quantifying the topology of its phylogeny [12, 24]. In this way, phylodiversity metrics better capture how well an evolutionary algorithm has explored a search space compared to other, more commonly used diversity metrics. The power of this approach is evidenced by the fact that phylodiversity has been shown to be predictive of an evolutionary algorithm’s success; that is, in many contexts, runs of evolutionary computation that maintain more phylogenetic diversity are more likely to produce a high-quality solution [24, 44]. The informativeness of post hoc phylogenetic analyses naturally motivates the possibility of incorporating phylogenetic information at runtime. This direction has been only lightly explored, with the few existing approaches focusing on diversity maintenance for phylogenetic structure [6]. We propose to economize costly computation of phenotypic properties by substituting some evaluations with alreadycomputed properties of near relatives. Specifically, we augment down-sampled lexicase selection [25] by exploiting a population’s phylogeny to estimate fitness. Lexicase-based parent selection algorithms have been shown to be highly successful in finding effective solutions to test-based problems across many domains [1, 11, 28, 30, 33, 36, 37]. Test-based problems specify expected behavior through a corpus of input/output examples (i.e., training cases). Traditional selection procedures typically aggregate performance across the training set to produce a scalar fitness score, which is then used to select parents. Instead, lexicase selection considers each training case separately, which has been shown to improve diversity maintenance [14, 20] and overall search space exploration [26, 27]. Standard lexicase selection, however, still has the drawback of requiring each candidate solution in a population to be evaluated against all training cases, which can be computationally expensive. Down-sampled lexicase selection and cohort lexicase selection address this drawback by randomly subsampling the set of training cases used to evaluate candidate solutions each generation, reducing the number of pergeneration training case evaluations required. This frees computational resources for other aspects of the evolutionary search (e.g., running for more generations), which has been shown to dramatically improve problem-solving success on many problems [16, 18, 22, 25]. Random subsampling can leave out important training cases, causing lexicase selection to fail to maintain certain critical genetic diversity [4, 26]. Moderating subsample rates or including redundant training cases can mitigate this drawback. Unfortunately, even the occasional loss of important diversity induced by random subsampling can prevent problem-solving success on certain problems. Domains that require substantial search space exploration are particularly sensitive, as are problems with training sets where an expected behavior is represented by a small number of training cases (making them more likely to be left out in a random subsample).

13 Phylogeny-Informed Fitness Estimation for Test-Based Parent Selection

243

Recently, Boldi et al. [3] proposed informed down-sampled lexicase selection, which uses runtime population statistics to construct subsamples that are less likely to leave out important training cases. Informed down-sampled lexicase improved problemsolving success over naive down-sampling for some problems but did not entirely eliminate the drawbacks of random subsampling [4]. Here, we propose “phylogeny-informed fitness estimation” as an alternative approach to mitigating the negative effects of random subsampling. phylogenyinformed fitness estimation requires the population’s ancestry information to be tracked during evolution. Each generation, random subsampling is applied to the training set, determining which training cases each candidate solution is directly evaluated against. The population’s phylogeny is then used to estimate a candidate solution’s performance on any training cases that they were not evaluated against, receiving the score of the nearest ancestor (or relative) that was evaluated on that training case. All training cases can then be used for lexicase selection, as each candidate solution has a score for all training cases (either true or estimated). Of course, phylogeny-informed fitness estimation requires that an individual’s score on a particular training case is likely to be similar to that of its close relatives (e.g., parent or grandparent). Under low to moderate mutation rates, we hypothesize that this condition holds for many common representations used across evolutionary computing. In this work, we demonstrate two methods of phylogeny-informed estimation, ancestor-based estimation, and relative-based estimation, in the context of downsampled and cohort lexicase selection. We used the contradictory objectives and multi-path exploration diagnostics from the DOSSIER suite [27] to test success in mitigating the drawbacks of subsampling with lexicase selection. These two diagnostics test a selection procedure’s capacity for diversity maintenance and search space exploration, two problem-solving characteristics that subsampling is known to degrade [4, 26]. Subsequently, we assessed the holistic performance of phylogenyinformed estimation on four GP problems from the general program synthesis benchmark suites [19, 21]. Overall, we find evidence that our phylogeny-informed estimation methods can help mitigate the drawbacks of random subsampling in the context of lexicase selection. Selection scheme diagnostics confirm that phylogeny-informed estimation can improve diversity maintenance and search space exploration. For full-fledged GP, phylogeny-informed estimation’s efficacy is sensitive to problem, subsampling method, and subsampling level.

13.2 Phylogeny-Informed Fitness Estimation Phylogeny-informed fitness estimation is designed to integrate with any evolutionary algorithm that dynamically subsamples from a corpus of discrete fitness test cases. Implementation requires runtime availability of the evolving population’s phylogeny. For simplicity, our description assumes an asexually reproducing population (i.e., no

244

A. Lalejini et al.

recombination); however, phylogeny-informed fitness estimation can in principle be extended to systems with recombination. In each generation, population members are evaluated against a subset of training cases (as determined by the particular subsampling routine). For example, 10% random down-sampling selects 10% of the training set (at random) to use for evaluation each generation. In this example scenario, each individual in the population is evaluated against 10% of the full training set. Without any form of fitness estimation, parent selection would be limited to using the performance scores on the sampled training cases. With phylogeny-informed fitness estimation, however, the complete training set can instead be used during selection. The population’s phylogeny is used to estimate a candidate solution’s performance on any training case that the individual was not directly evaluated against. We propose two methods of phylogeny-based fitness estimation: ancestor-based estimation and relative-based estimation. Ancestor-based estimation limits fitness estimation to using only the ancestors along the focal individual’s line of descent. To estimate an individual’s score on a training case, ancestor-based estimation iterates over the individual’s ancestors (i.e., parent, grandparents, etc.) along its lineage (from most to least recent) until finding the nearest ancestor that was evaluated against the focal training case. This ancestor’s score on the training case then serves as an estimate for the focal individual’s score on that training case. Figure 13.1 depicts a simple example of ancestor-based estimation. Relative-based estimation operates similarly to ancestor-based estimation, except estimations are not limited to direct ancestors. Instead, we use a breadth-first search starting from the focal individual to find the nearest relative in the phylogeny that was evaluated against the focal training case. Unlike ancestor-based estimation, the source of the estimate can be off on the focal individual’s line of descent (e.g., a “cousin”). Relative-based estimation may particularly benefit subsampling procedures that use the full training set partitioned across subsets of the population (e.g., cohort partitioning). In general, however, ancestor-based estimation can be more effectively optimized, making it more computationally efficient in practice. Simple ancestor-based estimation can forgo full phylogeny tracking, and instead pass the requisite evaluation information from parent to offspring, which is similar to fitness inheritance approximation techniques for evolutionary multi-objective optimization [5, 8]. However, our approach differs fundamentally from fitness inheritance techniques. We evaluate all individuals on a subsample of training cases each generation, while other fitness inheritance methods reduce evaluations by assigning a proportion of individuals to inherit the average fitness of their parents. Unfortunately, tuning inheritance proportion for these scalar inheritance techniques has largely proven to be problem-sensitive in practice [43]. To bound the computational cost of searching the phylogeny, the allowed search distance can be limited, and if no suitable ancestor or relative is found within the search limit, the estimate fails. Even without bounding, both ancestor-based and relative-based estimation can fail to find a suitable taxon in the phylogeny where the focal training case has been evaluated. For example, in the first generation, individuals in the population have no ancestors that can be used for estimates. In this

13 Phylogeny-Informed Fitness Estimation for Test-Based Parent Selection

245

Fig. 13.1 Example application of ancestor-based phylogeny-informed fitness estimation. Taxa are depicted as circles in the tree. Extant taxa are unlabeled black circles (leaf nodes in the tree), and ancestral (non-extant) taxa are labeled A through G. In this example, an individual’s quality is assessed using five pass/fail training cases. A taxon’s scores on the five training cases are given as vectors below the associated node in the tree. For ancestral taxa, a “?” indicates that the score was not evaluated (and is therefore unknown), a “0” indicates that taxon failed the training case when evaluated, and a “1” indicates that taxon passed the training case. In the extant population of this example, the third and fourth training cases were evaluated, and all others were estimated. For these extant taxa, ancestor-based estimates are shown in color (corresponding with the color of the ancestor used for the estimate), and evaluated scores are shown in black (with a “X” below)

work, we assume maximally poor performance (e.g., failure, maximum error, etc.) for failed estimates. We recognize, however, that future work should explore alternative approaches to handling estimation failures. For example, it may be better to favor individuals with unknown scores on a training case over individuals known to fail that test case; that is, known failure should be considered to be worse than unknown performance. Similarly, future extensions should also consider penalizing the performance score (or increasing the error) on estimations based on how far away they are found in the phylogeny, as estimations from more distant relatives may be more likely to be inaccurate.

13.2.1 Phylogeny Tracking Phylogeny-informed estimation methods require (1) runtime phylogeny tracking and (2) that the phylogeny is annotated with the training cases each taxon (i.e., node in the tree) has been evaluated against as well as the results of those evaluations. Phylogenies can be tracked at any taxonomic level of organization, be it individual

246

A. Lalejini et al.

organisms (i.e., individual candidate solutions), genotypes, phenotypes, individual genes, et cetera. We refer generically to the entity corresponding to a node as a “taxon”, pluralized as “taxa”. In our experiments, taxa represented genotypes. As such, each taxon might represent multiple individual candidate solutions present in the population or that existed at some point during the run. Complete phylogeny tracking requires that the ancestral relationships of all taxa be stored for the entire duration of the run. This mode of tracking, of course, can become computationally intractable for long runs with large population sizes, as the complete phylogeny can rapidly grow to be too large. Fortunately, phylogeny-informed estimation does not require complete phylogenies. Instead, we recommend pruning dead (extinct) branches for substantial memory and time savings. Although pruning does not affect ancestor-based estimation, it may affect relative-based estimation because close relatives without extant descendants will be removed as potential sources for fitness estimates. However, we do not expect any potential gains in estimation power from using complete phylogenies for relative-based estimation to outweigh the computational costs of complete phylogeny tracking. Phylogenies can be tracked without increasing the overall time complexity of an evolutionary algorithm; adding a new node to the tree is a constant-time operation. Without pruning, however, the space complexity of phylogeny tracking is .O(generations * population size). In most cases, pruning reduces the space complexity to .O(generations + population size), although the exact impact depends on the strengths of both selection and any diversity maintenance mechanisms being used. While worst-case time complexity of pruning is .O(generations), this worstcase can only occur a constant number of times per run of an evolutionary algorithm, and is thus amortized to .O(1). Thus, while phylogeny tracking is not zero cost, it can be achieved fairly efficiently. Practitioners must weigh the potential benefits of phylogeny tracking (e.g., phylogeny-informed estimation) against runtime overhead costs for their particular system. We expect the potential benefits to outweigh the tracking costs in systems where fitness evaluation is expensive. Efficient software libraries with plug-and-play tools for phylogeny tracking in evolutionary computing systems are increasingly widespread, such as hstrat [39], Phylotrackpy [13], MABE [2], Empirical [40], and DEAP [10]. In this work, we use the phylogeny tracking from the Empirical C++ library. The C++ implementation of our phylogeny-informed estimation methods are available as part of our supplemental material [29].

13.3 Methods We investigated phylogeny-informed fitness estimation in the context of two variants of the lexicase parent selection algorithm that use random subsampling: downsampled lexicase and cohort lexicase. For each of these selection schemes, we compared three different modes of fitness estimation: ancestor-based estimation, relative-

13 Phylogeny-Informed Fitness Estimation for Test-Based Parent Selection

247

based estimation, and a no-estimation control. We repeated all experiments at four different subsampling levels: 1%, 5%, 10%, and 50%. In a first set of experiments, we used selection scheme diagnostics from the DOSSIER suite [27] to assess whether phylogeny-informed estimation is able to mitigate the drawbacks of random subsampling. These diagnostics measure different characteristics related to how a selection scheme steers populations through search spaces. Subsequently, we investigated how each mode of phylogeny-informed estimation impacts problem-solving success on four genetic programming (GP) problems from the program synthesis benchmark suites [19, 21].

13.3.1 Lexicase Selection The lexicase parent selection algorithm is designed for test-based problems where candidate solutions are assessed based on their performance on a set of input/output examples (training cases) that specify correct behavior [23]. To select a single parent, lexicase selection shuffles the set of training cases into a random order, and all members of the population are included in a pool of candidates eligible for selection. Each training case is then applied in sequence (in shuffled order), filtering the pool of eligible candidates to include only those candidates with elite performance on the current training case. This filtering continues until all training cases have been applied in the shuffled sequence. After filtering, if more than one candidate remains in the pool, one is selected at random.

13.3.1.1

Down-Sampled Lexicase Selection

Each generation, down-sampled lexicase selection randomly subsamples the training set, using only those sampled training cases for evaluation and selection that generation [25, 37]. When augmented with phylogeny-informed fitness estimation, down-sampled lexicase still evaluates on only the sampled training cases. However, the full set of training cases is then used to select parents, and the population’s phylogeny is used to estimate an individual’s performance in any training cases it was not evaluated against.

13.3.1.2

Cohort Lexicase Selection

Cohort lexicase selection partitions the training set and the population into an equal number of “cohorts” [25]. As typical, we specify cohorts as evenly sized partitions of the population, with each assigned to a particular evenly sized partition of training cases. For example, a subsample level of 10% would divide the population and training set into 10 evenly sized cohorts. Cohort membership is randomly assigned to each generation. Each cohort of candidate solutions is evaluated only on its corre-

248

A. Lalejini et al.

sponding cohort of training cases. This method of partitioning reduces the number of per-generation evaluations required, but still uses the full set of training cases each generation. To select a parent, cohort lexicase first selects a cohort to choose from, and applies standard lexicase selection to choose a parent from within the chosen cohort. Under phylogeny-informed fitness estimation, candidate solution evaluations remain a subset to training case cohorts as described above. However, the full set of training cases is used when selecting a parent. Performance on unevaluated training cases is estimated from the population’s phylogeny.

13.3.2 Diagnostic Experiments The DOSSIER suite comprises a set of diagnostic benchmarks used to empirically analyze selection schemes on important characteristics of evolutionary search [27]. This suite of diagnostics has already been used to illuminate differences between down-sampled and cohort lexicase, as well as a variety of other lexicase selection variants [26]. Such discriminating power makes it a good choice for analysis of phylogeny-informed fitness estimation. Here, we used the contradictory objectives and multi-path exploration diagnostics to isolate and measure the effect of phylogeny-informed fitness estimation on diversity maintenance and search space exploration. We parameterized these experiments similarly to the diagnostics experiments from [27]. The diagnostics use a simple genome-based representation comprising a sequence of 100 floating-point values, each restricted to a range between 0.0 and 100.0. Each diagnostic translates a genome into a phenotype (also a length-100 sequence of continuous numeric values). Each position in a genome sequence is referred to as a “gene”, and each position in a phenotype is a “trait”. lexicase selection treats each trait as a single training case. The trait value is taken as the candidate solution’s score on that training case (higher scores are better). We apply subsampling to the diagnostics in the same way as in [26]; traits not included in a subsample are marked as “unknown”, and their scores are not included in the phenotype. For all diagnostic experiments, we ran 10 replicates of each condition. In each replicate, we evolved a population of 500 individuals for 50,000 generations. For each generation of a run, we evaluated the population on the appropriate diagnostic and used the condition-specific selection procedure to select 500 individuals to reproduce asexually. We mutated offspring at a per-gene rate of 0.7%, drawing mutations from a normal distribution with a mean of 0.0 and a standard deviation of 1.0. For all conditions using phylogeny-informed fitness estimation, we limited phylogeny search depth to 10. Note that all subsampling levels are run for the same number of generations in our diagnostic experiments, as we make comparisons across trait estimation methods (and not across subsample levels). We describe the two diagnostics used in our experiments below, but we direct readers to [27] for a more detailed presentation.

13 Phylogeny-Informed Fitness Estimation for Test-Based Parent Selection

13.3.2.1

249

Contradictory Objectives Diagnostic

The contradictory objectives diagnostic measures a selection scheme’s capability to find and maintain multiple global optima in a population. This diagnostic translates genomes into phenotypes by identifying the gene with the greatest value and marking that gene as “active” (and all others as “inactive”). The active gene value is directly copied into the corresponding trait in the phenotype, and all traits associated with inactive genes are interpreted as zero-value traits. There are 100 independent global optima in this search space, one associated with each trait. All traits are maximized at the upper bound of 100.0. For analysis purposes, we consider any trait above a score of 96 (out of 100) as “satisfied”. On this diagnostic, we report a population’s “satisfactory trait coverage”. Satisfactory trait coverage counts the number of distinct traits satisfied across the entire population. Satisfactory trait coverage thus quantifies the unique global optima maintained in a population.

13.3.2.2

Multi-path Exploration Diagnostic

The multi-path exploration diagnostic measures the ability of a selection scheme to continuously explore multiple pathways in a search space. This diagnostic translates genomes into phenotypes by first identifying the gene with the greatest value and marking that genes as the “activation gene”. Starting from the activation gene, all consecutive genes that are less than or equal to the previous gene are marked as active, creating an active region. Genes in the active region are directly copied into the corresponding traits in the phenotype, and all genes outside of the active region are interpreted as zero-valued traits. Traits are maximized at the upper bound (100.0), and a phenotype with all maximized traits occupies the global optimum in the search space. This diagnostic defines a search space with many pathways, each differing in path length and peak height but identical in slope (see Fig. 5.1 in [26] for visual example). The pathway beginning at the first gene position leads to the global optimum, and all other pathways lead to local optima. Thus, a selection scheme must be able to continuously explore different pathways in order to consistently reach the global optimum. On this diagnostic, we report the best aggregate trait score in the final population; greater aggregate trait scores indicate that the population found better pathways in the search space.

13.3.3 Genetic Programming Experiments We compared the problem-solving success of ancestor-based estimation, relativebased estimation, and the no-estimation control on four GP problems from the first and second program synthesis benchmark suites [19, 21]: Median, Small or Large,

250

A. Lalejini et al.

Grade, and Fizz Buzz. As in our diagnostic experiments, we conducted our GP experiments using both down-sampled and cohort lexicase selection, each at four subsampling levels: 1%, 5%, 10%, and 50%. To account for greater phenotypic volatility within the GP domain, we limited all phylogeny searches to depth five.

13.3.3.1

GP System

For all GP experiments, we ran 30 replicates of each condition. In each replicate, we evolved a population of 1,000 linear genetic programs using the SignalGP representation [31]. Our supplemental material [29] provides full instruction sets and configuration details (including source code). Although SignalGP supports the evolution of programs composed of many callable modules, we limited all programs to a single program module for simplicity. We reproduced programs asexually and applied mutations to offspring. Single-instruction insertions, deletions, and substitutions were applied, each at a per-instruction rate of 0.1%. We also applied ‘slip’ mutations [32], which can duplicate or delete sequences of instructions, at a perprogram rate of 5%.

13.3.3.2

Program Synthesis Problems

The program synthesis benchmark suites include a variety of introductory programming problems that are well-studied and are commonly used to evaluate GP methodology. We chose the Median, Small or Large, Grade, and Fizz Buzz problems because they have training sets that contain different qualitative categories of training cases. Brief descriptions of each problem are provided below: • Median [21]: Programs are given three integer inputs (.−100 ≤ inputi ≤ 100) and must output the median value. We limited program length to a maximum of 64 instructions and limited the maximum number of instruction-execution steps to 64. • Small or Large [21]: Programs are given an integer .n and must output “small” if .n < 1000, “large” if .n ≥ 2000, and “neither” if .1000 ≥ n < 2000. We limited program length to a maximum of 64 instructions and limited the maximum number of instruction-execution steps to 64. • Grade [21]: Programs receive five integer inputs in the range [0, 100]: . A, B, C, D, and score. . A, . B, .C, and . D are monotonically decreasing and unique, each defining the minimum score needed to receive that “grade”. The program must read these thresholds and return the appropriate letter grade for the given score or return . F if score .< D. We limited program length to a maximum of 128 instructions and limited the maximum number of instruction-execution steps to 128. • Fizz Buzz [19]: Given an integer .x, the program must return “Fizz” if .x is divisible by 3, “Buzz” if.x is divisible by 5, “FizzBuzz” if.x is divisible by both 3 and 5, and.x

13 Phylogeny-Informed Fitness Estimation for Test-Based Parent Selection

251

if none of the prior conditions are true. We limited program length to a maximum of 128 instructions and limited the maximum number of instruction-execution steps to 128. For each problem, we used 100 training cases for program evaluation and selection and 1,000 test cases for determination of problem-solving success; we provide the training and testing sets used for all problems in our supplemental material [29]. We evaluated all training and test cases on a pass–fail basis. For all problems, we ensured that input–output edge cases were included in both the training and testing sets. However, training sets were not necessarily balanced; that is, we did not ensure equal representation of different categories of outputs. We categorized a replicate as successful if it produced a program that solved all training cases that it was evaluated against and all 1,000 testing cases. We terminated runs after a solution was found or after 30,000,000 training case evaluations, which corresponds to 300 generations of evolution under standard lexicase selection where all programs are evaluated against all training cases each generation.

13.3.4 Statistical Analyses For all experiments, we only compared among our three fitness estimation treatments. We did not compare measurements taken from treatments across problems, subsampling method (down-sampling and cohorts), or down-sampling level (1%, 5%, 10%, and 50%). Thus, we only made pairwise comparisons between fitness estimation methods that shared problem, subsampling method, and down-sampling level. When comparing distributions of measurements taken from different treatments, we performed Kruskal–Wallis tests to screen for statistical differences among independent conditions. For comparisons in which the Kruskal–Wallis test was significant (significance level of 0.05), we performed post hoc Wilcoxon rank-sum tests to discern pairwise differences. When comparing problem-solving success rates, we used pairwise Fisher’s exact tests (significance level of 0.05). We used the Holm–Bonferroni method to correct for multiple comparisons where appropriate.

13.3.5 Software and Data Availability The software used to conduct our experiments, our data analyses, and our experiment data can be found in our supplemental material [29], which is hosted on GitHub and archived on Zenodo. Our experiments are implemented in C++ using the Empirical library [40]. Our experiment data are archived on the Open Science Framework at https://osf.io/wxckn/.

252

A. Lalejini et al.

13.4 Results and Discussion 13.4.1 Phylogeny-Informed Estimation Reduces Diversity Loss Caused by Subsampling Figure 13.2 shows final satisfactory trait coverages achieved on the contradictory objectives diagnostic, which measures a selection scheme’s ability to find and maintain a diversity of mutually exclusive optima. Satisfactory trait coverage is the total number of unique global optima maintained in a population. The horizontal line in Fig. 13.2 indicates median satisfactory trait coverage (of 10 runs) at the end of 50,000 generations of standard lexicase, providing a baseline reference for standard lexicase selection’s capacity for diversity maintenance on this diagnostic. Both ancestor-based and relative-based estimation methods mitigated the diversity loss caused by subsampling across most subsampling levels for both down-sampled

Fig. 13.2 Satisfactory trait coverage on the contradictory objectives diagnostic. Panels a and b show satisfactory trait coverage for ancestor-based, relative-based, and no estimation in the context of down-sampled lexicase and cohort lexicase, respectively. The black horizontal line in each plot indicates the median satisfactory trait coverage of 10 runs of standard lexicase selection (no subsampling). Kruskal-Wallis tests for all subampling levels for each of down-sampled and cohort lexicase were statistically significant (. p < 0.001)

13 Phylogeny-Informed Fitness Estimation for Test-Based Parent Selection

253

and cohort lexicase. However, the magnitude of mitigation varied by condition. We were unable to detect a significant difference between ancestor-based estimation and the no-estimation control for 1% down-sampled lexicase, but in all other cases of down-sampled and cohort lexicase, both ancestor-based and relative-based estimation had greater satisfactory trait coverage (Corrected Wilcoxon rank-sum tests, . p < 0.003). At 50% subsampling, phylogeny-informed down-sampled lexicase performed as well as our standard lexicase baseline, entirely mitigating the drawbacks of down-sampling. Unsurprisingly, the absolute efficacy of both phylogeny-informed estimation methods decreased as subsampling levels became more extreme. We initially expected relative-based estimation to facilitate better satisfactory trait coverage than ancestor-based estimation, as fitness estimates were not limited to direct ancestors. However, our data do not support this hypothesis. We were unable to detect a significant difference in satisfactory trait coverage between ancestor-based and relative-based estimation across all subsampling levels for both down-sampled and cohort lexicase. Consistent with previous work [26], cohort lexicase with no estimation generally maintained more objectives than down-sampled lexicase with no estimation. This is likely due to the fact that cohort lexicase uses all training cases each generation (partitioned across different subsets of the population). In contrast, down-sampled lexicase may omit an entire category of training cases for a generation. Such omissions increase the chance that lexicase fails to select lineages that specialize on the excluded training cases, which can eliminate potentially important diversity. Given that cohort lexicase uses all training cases every generation, we initially expected cohort lexicase to benefit more from phylogeny-informed estimation than down-sampled lexicase. Under cohort partitioning, we expected nearby relatives to have been evaluated on missing training cases because contemporary relatives in other cohorts would be evaluated against different subsets of training cases. This would benefit relative-based estimation over ancestor-based estimation. Direct analysis of estimated phylogenetic distances under the diagnostic problems, reported in supplemental material, provides some support for this intuition. Observationally, however, cohort lexicase does not appear to have benefited from phylogeny-informed estimation more than down-sampled lexicase. Further investigation is needed to establish any potential interactions between subsampling methods and fitness estimation methods.

13.4.2 Phylogeny-Informed Estimation Improves Poor Exploration Caused by Down-Sampling The multi-path exploration diagnostic measures a selection scheme’s ability to explore many pathways in a search space in order to find the global optimum. Figure 13.3 shows performance (max aggregate score) results from all experiments with the multi-path exploration diagnostic. The horizontal line in Fig. 13.3 indicates

254

A. Lalejini et al.

Fig. 13.3 Maximum aggregate scores in the final generation on the multi-path exploration diagnostic. Panels a and b show max aggregate scores for ancestor-based, relative-based, and no estimation in the context of down-sampled lexicase and cohort lexicase, respectively. The black horizontal line in each plot indicates the median max aggregate score from 10 runs of standard lexicase selection (no subsampling). Kruskal-Wallis tests for each of the following comparisons were statistically significant (. p < 0.001): down-sampled lexicase at 5%, 10%, and 50% subsampling levels and cohort lexicase at 5% and 10% subsampling levels

the median score (out of 10 runs) at the end of 50,000 generations of standard lexicase. This serves as a baseline reference for standard lexicase selection’s capacity for search space exploration. The phylogeny-informed estimation methods reduced the drawback of subsampling on the exploration diagnostic under most of the conditions tested, although the magnitude varied by condition. Under down-sampled lexicase, phylogeny-informed estimation improved performance at 5%, 10%, and 50% subsampling levels (statistically significant, . p < 0.003), and we were unable to detect a statistically significant difference between ancestor-based and relative estimation at any subsampling level. With phylogeny-informed estimation, some runs of 50% down-sampled lexicase matched (or exceeded) the performance of our standard lexicase reference (Fig. 13.3a). This is particularly notable, as we ran all conditions for an equal number of generations regardless of subsampling level; that is, down-sampling (at all

13 Phylogeny-Informed Fitness Estimation for Test-Based Parent Selection

255

subsampling levels) did not confer the usual benefits of increased population size or additional generations relative to our standard lexicase reference. Under cohort lexicase, we were unable to detect statistically significant differences among conditions under 1% and 50% subsampling levels. Consistent with prior work [26], the performance of cohort lexicase at moderate subsampling levels is close to that of standard lexicase on the exploration diagnostic, so there is little potential for either estimation method to improve over the no-estimation control. At 10% subsampling (i.e., 10 partitions), ancestor-based and relative-based estimation outperformed the no-estimation control (. p < 0.001), and relative-based estimation outperformed ancestor-based estimation (. p < 0.001). At 5% cohort partitioning (i.e., 20 partitions), relative-based estimation outperformed both ancestor-based estimation and the no-estimation control (. p < 0.05), but we detected no statistically significant difference between ancestor-based estimation and the no-estimation control. As noted previously, we held generations constant across subsampling levels for these experiments. We do expect performance to continue to increase given longer runtimes (performance over time is visualized in the supplemental material [29]). Further experimentation is needed to investigate whether extra generations would allow for smaller subsample sizes (e.g., 1%, 5%) to achieve the performance of more moderate subsample sizes or of full lexicase selection. Moreover, it is unclear whether phylogeny-informed estimation would enhance progress over no-estimation controls if given more generations of evolution. Phylogeny-informed estimation more consistently improved performance under down-sampled lexicase (3 out of 4 subsampling levels) than under cohort lexicase (2 out of 4 subsampling levels). We hypothesize that cohort partitioning may interact poorly with our implementation of phylogeny-informed estimation in the context of lexicase selection. lexicase selection is sensitive to the ratio between population size and the number of training cases used for selection [26]. Increasing the number of training cases for a fixed population size decreases the odds that any particular training case will appear early in any of the shuffled sequences of training cases used during selection, which can be detrimental to lexicase’s ability to maintain diversity. Cohort partitioning reduces the effective population size for any given selection event, as candidate solutions compete within only their cohort. Normally, cohort partitioning also reduces the number of training cases used to mediate selection events within each cohort, maintaining approximately the same ratio between population size and the number of training cases used for selection. However, in our implementation of phylogeny-informed estimation, we use all training cases to mediate selection in each cohort; during selection, this changes the ratio between the population size and number of training cases used for selection. We hypothesize that this dynamic can cause phylogeny-informed estimation to underperform when used in combination with cohort lexicase selection, especially for problems with low levels of redundancy in the training cases. To mitigate this problem, we could modify cohort partitioning when used in combination with phylogeny-informed estimation: evaluation would be conducted according to standard cohort partitioning procedures (still using the full training set across the entire population), but when selecting parents, we can

256

A. Lalejini et al.

allow all individuals in the population to compete with one another (as opposed to competing only within their cohort).

13.4.3 Phylogeny-Informed Estimation Can Enable Extreme Subsampling for Some Genetic Programming Problems Table 13.1a, b show problem-solving success on the Median, Small or Large, Grade, and Fizz Buzz program synthesis problems for down-sampled lexicase and cohort lexicase, respectively. We consider a run to be successful if it produces a program capable of passing all tests in the testing set. In total, we compared the success rates of ancestor-based estimation, relative-based estimation, and no-estimation control under 32 distinct combinations of subsampling methods (down-sampling and cohort partitioning), subsampling level (1%, 5%, 10%, and 50%), and program synthesis problem. We were unable to detect a statistically significant difference among fitness estimation treatments in 20 out of 32 comparisons (Table 13.1). At least one phylogenyinformed estimation method resulted in greater success rates in 4 out of 16 comparisons under down-sampled lexicase and in 5 out of 16 comparisons under cohort lexicase (. p < 0.05, Fisher’s exact test with Holm–Bonferroni correction for multiple comparisons). Under down-sampled lexicase, the no-estimation control outperformed at least one phylogeny-informed estimation method in 3 out of 16 comparisons (. p < 0.04). Under cohort lexicase, we detected no instances where the no-estimation control resulted in a significantly greater number of successes than either phylogeny-informed estimation method. Overall, the benefits of phylogeny-informed estimation varied by subsampling method and by problem. We found no statistically significant differences in success rates between ancestor-based and relative-based estimation in any comparison. Under cohort lexicase, phylogeny-informed estimation had a more consistently neutral or beneficial effect on problem-solving success. Down-sampled lexicase, however, was sometimes more successful without the addition of phylogeny-informed estimation; for example, on the Fizz Buzz problem with 5% down-sampling, the no-estimation control substantially outperformed either phylogeny-informed estimation control (Table 13.1a). This result contrasts with the diagnostics results where phylogeny-informed estimation was more consistently beneficial. We suspect this difference stems from the restricted scope of the diagnostics, which by design isolate specific problem-solving characteristics. Our diagnostics show that phylogeny-informed estimation can improve diversity maintenance and multi-path exploration when subsampling; however, there are other problem-solving characteristics that our initial methods of phylogeny-informed estimation may degrade. For example, any fitness estimation technique can result in inaccurate estimates, which may result in the selection of low-quality candidate solu-

13 Phylogeny-Informed Fitness Estimation for Test-Based Parent Selection

257

Table 13.1 Problem-solving success for down-sampled lexicase (a) and cohort lexicase (b) on GP problems, Median, Small or Large, Grade, and Fizz Buzz. For each subsampling level, we compared the success rates of ancestor-based estimation, relative-based estimation, and no-estimation (out of 30 replicates each). The “Full” column gives the success rate of running standard lexicase selection (out of 30 replicates); we include these results as a baseline reference and did not make direct comparisons to them. We were unable to detect a statistically significant difference between relative-based and ancestor-based estimation across all tables (Fisher’s exact test with a Holm– Bonferroni correction for multiple comparisons). If phylogeny-based estimation results are bolded and italicized, they each significantly outperformed the no-estimation control. If a no-estimation control result is bolded and italicized, it significantly outperformed both phylogeny-based estimation methods. A result is annotated with a † if one phylogeny-informed estimation treatment differed significantly from the control, but the other did not

tions that would not have been otherwise selected. Further investigation is required to disentangle how phylogeny-informed estimation can impede evolutionary search, which will allow us to improve our estimation methods to address these shortcomings. Future work will run phylogeny-informed estimation on the full suite of DOSSIER diagnostic problems, assessing a wider range of problem-solving characteristics. Problem-solving successes at 1% subsampling are particularly notable, as evaluation was limited to just one training case per individual. Under down-sampled lexicase, 1% subsampling results in elite selection mediated by the sampled training case, and under cohort lexicase, each of the 100 cohorts undergoes elite selection on the particular corresponding training case. Even under these seemingly extreme conditions, the no-estimation control produced solutions to the median and grade problems.

258

A. Lalejini et al.

13.5 Conclusion This work demonstrates two approaches to phylogeny-informed fitness estimation: one aimed to enable faster estimate lookup (ancestor-based estimation) and the other performing more extensive search for close relatives (relative-based estimation). Using selection scheme diagnostic problems, we find evidence that our phylogenyinformed estimation methods can help to mitigate the drawbacks of in the context of lexicase selection, improving diversity maintenance (Fig. 13.2) and search space exploration (Fig. 13.3). In practice, we find that phylogeny-informed estimation’s effects on problem-solving success for GP varies by problem, subsampling method, and subsampling level (Table 13.1). We did not find consistent or substantial differences between ancestor-based and relative-based fitness estimation. Future work should determine whether and how these two estimation methods differ in their influence on evolutionary search. As of now, we recommend the use of ancestor-based estimation, as its implementation can be more effectively optimized than relative-based estimation. We opted for simplicity in our application of phylogeny-informed estimation to both down-sampled lexicase and cohort lexicase. Estimation methods allow each lexicase parent selection event to use the full training set, as an individual’s performance on any unevaluated training cases can be estimated. As such, phylogeny-informed estimation allows each member of the population to be evaluated on a different subset of the training set. We could subsample the training cases systematically to minimize the distance in the phylogeny required for estimates, which may lead to an increase in estimation accuracy and improved problem-solving success. Further, we could also take into account mutation information when choosing subsamples. For example, we could more thoroughly evaluate (i.e., use more training cases) offspring with larger numbers of mutations, as they may be more likely to be phenotypically distinct from their parents. In addition to the phylogeny-informed estimation methods proposed in this work, we envision many ways in which runtime phylogeny tracking can be used to improve evolutionary search. Beyond fitness estimation, runtime phylogenetic analyses may be useful in the context of many quality diversity algorithms, which currently rely on phenotypic or behavioral diversity [41]. Acknowledgements We thank the participants of Genetic Programming Theory and Practice XX for helpful comments and suggestions on our work. In particular, we thank Nicholas McPhee for insightful feedback on our manuscript. This work was supported in part through computational resources and services provided by the Institute for Cyber-Enabled Research at Michigan State University.

13 Phylogeny-Informed Fitness Estimation for Test-Based Parent Selection

259

References 1. Aenugu, S., Spector, L.: Lexicase selection in learning classifier systems. In: Proceedings of the Genetic and Evolutionary Computation Conference - GECCO ’19, pp. 356–364. ACM Press, Prague, Czech Republic (2019). https://doi.org/10.1145/3321707.3321828, http://dl.acm.org/ citation.cfm?doid=3321707.3321828 2. Bohm, C., G., N.C., Hintze, A.: MABE (Modular Agent Based Evolver): A framework for digital evolution research. In: Proceedings of the 14th European Conference on Artificial Life ECAL 2017, pp. 76–83. MIT Press, Lyon, France (2017). https://doi.org/10.7551/ecal_a_016 3. Boldi, R., Briesch, M., Sobania, D., Lalejini, A., Helmuth, T., Rothlauf, F., Ofria, C., Spector, L.: Informed down-sampled lexicase selection: identifying productive training cases for efficient problem solving (2023). http://arxiv.org/abs/2301.01488, arXiv:2301.01488 [cs] 4. Boldi, R., Lalejini, A., Helmuth, T., Spector, L.: A Static Analysis of Informed DownSamples (2023). https://doi.org/10.1145/3583133.3590751, http://arxiv.org/abs/2304.01978, arXiv:2304.01978 [cs] 5. Bui, L.T., Abbass, H.A., Essam, D.: Fitness inheritance for noisy evolutionary multi-objective optimization. In: Proceedings of the 7th annual conference on Genetic and evolutionary computation, pp. 779–785. ACM, Washington, DC, USA (2005). https://doi.org/10.1145/1068009. 1068141 6. Burke, E., Gustafsont, S., Kendall, G., Krasnogor, N.: Is increased diversity in genetic programming beneficial? An analysis of lineage selection. In: The 2003 Congress on Evolutionary Computation, 2003. CEC ’03. vol. 2, pp. 1398–1405. IEEE, Canberra, Australia (2003). https:// doi.org/10.1109/CEC.2003.1299834 7. Burlacu, B., Yang, K., Affenzeller, M.: Population diversity and inheritance in genetic programming for symbolic regression. Nat. Comput. (2023). https://doi.org/10.1007/s11047-02209934-x 8. Chen, J.H., Goldberg, D.E., Ho, S.Y., Sastry, K.: Fitness inheritance in multi-objective optimization. In: Proceedings of the 4th Annual Conference on Genetic and Evolutionary Computation, pp. 319–326. GECCO’02, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2002), event-place: New York City, New York 9. Crepinsek, M., Mernik, M., Liu, S.H.: Analysis of exploration and exploitation in evolutionary algorithms by ancestry trees. Int. J. Innov. Comput. Appl. 3(1), 11 (2011). https://doi.org/10. 1504/IJICA.2011.037947 10. De Rainville, F.M., Fortin, F.A., Gardner, M.A., Parizeau, M., Gagné, C.: DEAP: a python framework for evolutionary algorithms. In: Proceedings of the 14th Annual Conference Companion on Genetic and Evolutionary Computation, pp. 85–92. ACM, Philadelphia, Pennsylvania USA (2012). https://doi.org/10.1145/2330784.2330799 11. Ding, L., Spector, L.: Optimizing neural networks with gradient lexicase selection. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=J_ 2xNmVcY4 12. Dolson, E., Lalejini, A., Jorgensen, S., Ofria, C.: Interpreting the tape of life: ancestry-based analyses provide insights and intuition about evolutionary dynamics. Artif. Life 26(1), 58–79 (2020). https://doi.org/10.1162/artl_a_00313 13. Dolson, E., Moreno, M.A.: Phylotrackpy: a python phylogeny tracker (2023). https://doi.org/ 10.5281/ZENODO.7922091 [Computer software] 14. Dolson, E.L., Banzhaf, W., Ofria, C.: Ecological theory provides insights about evolutionary computation. preprint, PeerJ Preprints (2018). https://doi.org/10.7287/peerj.preprints.27315v1 15. Donatucci, D., Dramdahl, M.K., McPhee, N.F., Morris, M.: Analysis of Genetic Programming Ancestry Using a Graph Database (2014) 16. Ferguson, A.J., Hernandez, J.G., Junghans, D., Lalejini, A., Dolson, E., Ofria, C.: Characterizing the Effects of Random Subsampling on Lexicase Selection. In: Banzhaf, W., Goodman, E., Sheneman, L., Trujillo, L., Worzel, B. (eds.) Genetic Programming Theory and Practice XVII, pp. 1–23. Springer International Publishing, Cham (2020). https://doi.org/10.1007/9783-030-39958-0_1. Series Title: Genetic and Evolutionary Computation

260

A. Lalejini et al.

17. Goldberg, D.E., Richardson, J.: Genetic Algorithms with Sharing for Multimodal Function Optimization. In: Proceedings of the Second International Conference on Genetic Algorithms on Genetic Algorithms and Their Application, pp. 41–49. L. Erlbaum Associates Inc., USA (1987), event-place: Cambridge, Massachusetts, USA 18. Helmuth, T., Abdelhady, A.: Benchmarking parent selection for program synthesis by genetic programming. In: Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion, pp. 237–238. ACM, Cancún Mexico (2020). https://doi.org/10.1145/3377929. 3389987 19. Helmuth, T., Kelly, P.: PSB2: the second program synthesis benchmark suite. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 785–794. ACM, Lille France (2021). https://doi.org/10.1145/3449639.3459285 20. Helmuth, T., McPhee, N.F., Spector, L.: Effects of Lexicase and Tournament Selection on Diversity Recovery and Maintenance. In: Proceedings of the 2016 on Genetic and Evolutionary Computation Conference Companion - GECCO ’16 Companion, pp. 983–990. ACM Press, Denver, Colorado, USA (2016). https://doi.org/10.1145/2908961.2931657, http://dl.acm.org/ citation.cfm?doid=2908961.2931657 21. Helmuth, T., Spector, L.: General Program Synthesis Benchmark Suite. In: Proceedings of the 2015 on Genetic and Evolutionary Computation Conference - GECCO ’15. pp. 1039–1046. ACM Press, Madrid, Spain (2015). https://doi.org/10.1145/2739480.2754769, http://dl.acm. org/citation.cfm?doid=2739480.2754769 22. Helmuth, T., Spector, L.: Problem-solving benefits of down-sampled lexicase selection. Artif. Life 27(3–4), 183–203 (2022) 23. Helmuth, T., Spector, L., Matheson, J.: Solving uncompromising problems with lexicase selection. IEEE Trans. Evolut. Comput. 19(5), 630–643 (2015) 24. Hernandez, J.G., Lalejini, A., Dolson, E.: What can phylogenetic metrics tell us about useful diversity in evolutionary algorithms? In: Banzhaf, W., Trujillo, L., Winkler, S., Worzel, B. (eds.) Genetic Programming Theory and Practice XVIII, pp. 63–82. Springer Nature Singapore, Singapore (2022). https://doi.org/10.1007/978-981-16-8113-4_4. Series Title: Genetic and Evolutionary Computation 25. Hernandez, J.G., Lalejini, A., Dolson, E., Ofria, C.: Random subsampling improves performance in lexicase selection. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion on - GECCO ’19, pp. 2028–2031. ACM Press, Prague, Czech Republic (2019). https://doi.org/10.1145/3319619.3326900 26. Hernandez, J.G., Lalejini, A., Ofria, C.: An Exploration of Exploration: Measuring the Ability of Lexicase Selection to Find Obscure Pathways to Optimality. In: Banzhaf, W., Trujillo, L., Winkler, S., Worzel, B. (eds.) Genetic Programming Theory and Practice XVIII, pp. 83–107. Springer Nature Singapore, Singapore (2022). https://doi.org/10.1007/978-981-16-8113-4_5. Series Title: Genetic and Evolutionary Computation 27. Hernandez, J.G., Lalejini, A., Ofria, C.: A suite of diagnostic metrics for characterizing selection schemes (2022), http://arxiv.org/abs/2204.13839 28. La Cava, W., Spector, L., Danai, K.: Epsilon-Lexicase Selection for Regression. In: Proceedings of the Genetic and Evolutionary Computation Conference 2016, pp. 741–748. ACM, Denver Colorado USA (2016). https://doi.org/10.1145/2908812.2908898 29. Lalejini, A., Dolson, E., Moreno, M.A., Hernandez, J.G.: Phylogeny-informed evaluation (Archived GitHub repository) (2023). https://doi.org/10.5281/zenodo.7938857 30. Lalejini, A., Dolson, E., Vostinar, A.E., Zaman, L.: Artificial selection methods from evolutionary computing show promise for directed evolution of microbes. eLife 11, e79665 (2022). https://doi.org/10.7554/eLife.79665 31. Lalejini, A., Ofria, C.: Evolving event-driven programs with SignalGP. In: Proceedings of the Genetic and Evolutionary Computation Conference on - GECCO ’18, pp. 1135–1142. ACM Press, Kyoto, Japan (2018). https://doi.org/10.1145/3205455.3205523 32. Lalejini, A., Wiser, M.J., Ofria, C.: Gene duplications drive the evolution of complex traits and regulation. In: Proceedings of the 14th European Conference on Artificial Life ECAL 2017, pp. 257–264. MIT Press, Lyon, France (2017). https://doi.org/10.7551/ecal_a_045

13 Phylogeny-Informed Fitness Estimation for Test-Based Parent Selection

261

33. Matsumoto, N., Saini, A.K., Ribeiro, P., Choi, H., Orlenko, A., Lyytikäinen, L.P., Laurikka, J.O., Lehtimäki, T., Batista, S., Moore, J.H.: Faster Convergence with Lexicase Selection in Tree-Based Automated Machine Learning. In: Pappa, G., Giacobini, M., Vasicek, Z. (eds.) Genetic Programming, vol. 13986, pp. 165–181. Springer Nature Switzerland, Cham (2023). https://doi.org/10.1007/978-3-031-29573-7_11, https://link.springer.com/10. 1007/978-3-031-29573-7_11, series Title: Lecture Notes in Computer Science 34. McPhee, N.F., Casale, M.M., Finzel, M., Helmuth, T., Spector, L.: Visualizing genetic programming ancestries using graph databases. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 245–246. ACM, Berlin Germany (2017). https:// doi.org/10.1145/3067695.3075617 35. McPhee, N.F., Hopper, N.J.: Analysis of genetic diversity through population history. In: Proceedings of the 1st Annual Conference on Genetic and Evolutionary Computation - Volume 2. pp. 1112–1120. GECCO’99, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1999), event-place: Orlando, Florida 36. Metevier, B., Saini, A.K., Spector, L.: Lexicase selection beyond genetic programming. In: Banzhaf, W., Spector, L., Sheneman, L. (eds.) Genetic Programming Theory and Practice XVI, pp. 123–136. Springer International Publishing, Cham (2019). https://doi.org/10.1007/ 978-3-030-04735-1_7. Series Title: Genetic and Evolutionary Computation 37. Moore, J.M., Stanton, A.: Lexicase selection outperforms previous strategies for incremental evolution of virtual creature controllers. In: Proceedings of the 14th European Conference on Artificial Life ECAL 2017, pp. 290–297. MIT Press, Lyon, France (2017). https://doi.org/10. 7551/ecal_a_050, https://www.mitpressjournals.org/doi/abs/10.1162/isal_a_050 38. Moreno, M.A., Dolson, E., Ofria, C.: Hereditary stratigraphy: genome annotations to enable phylogenetic inference over distributed populations. In: The 2022 Conference on Artificial Life. MIT Press (2022). https://doi.org/10.1162/isal_a_00550 39. Moreno, M.A., Dolson, E., Ofria, C.: hstrat: a Python Package for phylogenetic inference on distributed digital evolution populations. J. Open Source Softw. 7(80), 4866 (2022). https:// doi.org/10.21105/joss.04866 40. Ofria, C., Moreno, M.A., Dolson, E., Lalejini, A., Rodriguez Papa, S., Fenton, J., Perry, K., Jorgensen, S., Hoffmanriley, Grenewode, Edwards, O.B., Stredwick, J., Cgnitash, theycallmeheem, Vostinar, A., Moreno, R., Schossau, J., Zaman, L., djrain: Empirical: C++ library for efficient, reliable, and accessible scientific software (2020). https://doi.org/10.5281/ZENODO. 4141943 [Computer Software] 41. Pugh, J.K., Soros, L.B., Stanley, K.O.: Quality Diversity: A New Frontier for Evolutionary Computation. Front. Robot. AI 3 (2016). https://doi.org/10.3389/frobt.2016.00040 42. Rothan, H.A., Byrareddy, S.N.: The epidemiology and pathogenesis of coronavirus disease (COVID-19) outbreak. J. Autoimmunity 109, 102433 (2020) 43. Santana-Quintero, L.V., Montaño, A.A., Coello, C.A.C.: A Review of Techniques for Handling Expensive Functions in Evolutionary Multi-Objective Optimization. In: Hiot, L.M., Ong, Y.S., Tenne, Y., Goh, C.K. (eds.) Computational Intelligence in Expensive Optimization Problems, vol. 2, pp. 29–59. Springer Berlin Heidelberg, Berlin, Heidelberg (2010). https://doi.org/10. 1007/978-3-642-10701-6_2, http://link.springer.com/10.1007/978-3-642-10701-6_2, series Title: Evolutionary Learning and Optimization 44. Shahbandegan, S., Hernandez, J.G., Lalejini, A., Dolson, E.: Untangling phylogenetic diversity’s role in evolutionary computation using a suite of diagnostic fitness landscapes. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 2322– 2325. ACM, Boston Massachusetts (2022). https://doi.org/10.1145/3520304.3534028 45. Tucker, C.M., Cadotte, M.W., Carvalho, S.B., Davies, T.J., Ferrier, S., Fritz, S.A., Grenyer, R., Helmus, M.R., Jin, L.S., Mooers, A.O., Pavoine, S., Purschke, O., Redding, D.W., Rosauer, D.F., Winter, M., Mazel, F.: A guide to phylogenetic metrics for conservation, community ecology and macroecology: A guide to phylogenetic metrics for ecology. Biol. Rev. 92(2), 698–715 (2017). https://doi.org/10.1111/brv.12252

Chapter 14

Origami: (un)folding the Abstraction of Recursion Schemes for Program Synthesis Matheus Campos Fernandes, Fabricio Olivetti de Franca, and Emilio Francesquini

14.1 Introduction Computer programming can be seen as the task of creating a set of instructions that, when executed, can provide as output a solution to a specific problem. This task involves several steps starting from an abstract description of the solution (i.e., an algorithm) to a concrete implementation written in a programming language. Given the importance of creating computer programs, and the repetitive tasks usually involved, an often sought holy grail is the ability to automatically generate source codes, either in part or in its entirety. This automatic generation would follow a certain high-level specification which reduces the burden of manually creating the program. This problem is known as Program Synthesis (PS). In the specific scenario in which the user provides a set of input–output examples, the problem is referred to as Inductive Synthesis or Programming-by-Example. The main advantage of this approach is that it can be easier to create a set of inputs and expected outputs but, on the other hand, it might be difficult to provide a representative set containing corner cases. A popular approach for PS is Genetic Programming (GP), a technique that applies the concept of evolution to search for a program constrained by the search space of a solution representation. Some notable approaches include PushGP [1], CBGP [2], GE [3], and GGGP [4]. So far, most of the PS algorithms do not effectively exploit common programming patterns, which means the search algorithm needs to scan the search space to find a M. C. Fernandes (B) · F. O. de Franca · E. Francesquini Federal University of ABC, Santo Andre, SP, Brazil e-mail: [email protected] F. O. de Franca e-mail: [email protected] E. Francesquini e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 S. Winkler et al. (eds.), Genetic Programming Theory and Practice XX, Genetic and Evolutionary Computation, https://doi.org/10.1007/978-981-99-8413-8_14

263

264

M. C. Fernandes et al.

correct program without any guidance. While a less restrictive search space can be desirable to allow the algorithm to navigate toward one of the many solutions, a constrained search space, if correctly done, can speed up the search process allowing the search algorithm to focus only on part of the program. Fernandes et al. [5] explored the use of abstractions (e.g., higher-order functions) and additional information (e.g., type information) to help with the exploration of the search space with the introduction of the Higher-Order Typed Genetic Programming (HOTGP). This algorithm exploits the information of the input and output types of the desired program to limit the search space. It also adds support to higher-order functions, .λ-functions, and parametric polymorphism to make it possible to apply common programming patterns such as map and filter. The authors showed that the use of the type information and higher-order functions helped to improve the success rate of GPSB, surpassing some of the current state-of-the-art approaches. A general and useful pattern is the use of Recursion Schemes [6]. This pattern captures the common structure of recursive functions and data structures as combinations of consumer and producer functions, also known as fold and unfold, respectively. They are known to be very general and capable of implementing many commonly used algorithms, ranging from data aggregation to sorting. Gibbons [7] coined the name origami programming and showed many examples of how to write common algorithms using these patterns. The folding and unfolding process can be generalized through Recursion Schemes, which divide the programming task into three simpler steps: (i) choosing the scheme (among a limited number of choices), (ii) choosing a fixed point of a data structure that describes the recursion trace and, (iii) writing the consumer (fold) and producer (unfold) procedure. In this paper, we studied the problems in the General Program Synthesis Benchmark Suite (GPSB) [8] and solved them using Recursion Schemes, reporting a selection of distinct solutions. This examination revealed that most of the solutions for the proposed problems follow the common pattern of folding, unfolding a data structure, or a composition of both. With these observations, we explore the crafting of computer programs following these recursive patterns and craft program templates that follow these patterns along with an explanation of how they can constrain the search space of candidate programs. Our main goal is to find a set of recursion schemes that simplifies the program synthesis process while reducing the search space with type information. We also present the general idea on how to evolve such programs, called Origami, a program synthesis algorithm that first determines the (un)folding pattern it will evolve and then it evolves the corresponding template using inductive synthesis. The remainder of this text is organized as follows. In Sect. 14.2, we introduce recursion schemes and explain the basic concepts needed to understand our proposal. Section 14.3 presents Origami and outlines some examples of recursive patterns that can be used to solve common programming problems. In Sect. 14.4, we show some preliminary results adapting HOTGP to one of the presented patterns and analyze the results. Finally, in Sect. 14.5, we give some final observations about Origami and describe future work.

14 Origami: (un)folding the Abstraction of Recursion Schemes for Program Synthesis

265

14.2 Recursion Schemes Recursive functions are sometimes referred to as the goto of functional programming languages [6, 7, 9, 10]. They are an essential part of how to build a program in these languages, allowing programmers to avoid mutable state and other imperative programming constructs. Both goto and recursion are sometimes considered harmful when creating a computer program. The reason is that the former can make it hard to understand the program execution flow as the program grows larger, and the latter may lead to problems such as stack overflows if proper care is not taken. This motivated the introduction of higher-order functions as the preferred alternative to direct recursion. Higher-order functions are functions that either (i) take one or more functions as an argument, or (ii) return a function as output. Here we are interested mostly in the first case. By using higher-order functions which apply the function received as input in a recursive pattern to a given value or data structure, we can represent many common recursion schemes. Some of the most well-known examples of these functions are map, filter, and fold. The main advantage of using these general patterns is that the programmer does not need to be as careful (guaranteeing termination, or sane memory usage for example) as they would need to be using direct recursion. Among these higher-order functions, fold is the most general (as in it can be used to implement the others). This pattern is capable of representing many recursive algorithms that start with a list of values and return a transformed value. Another common general pattern is captured by the unfold function that starts from a seed value and unfolds it generating a list of values in the process. While fold and unfold describe the most common patterns of recursive functions, they are limited to recursions that follow a linear path (e.g., a list). In some scenarios, the recursion follows a nonlinear path such as a binary tree (e.g., most comparison-based sort algorithms). To generalize the fold and unfold operations to different recursive patterns, Recursion Schemes [6, 11] describe common patterns of generating and consuming inductive data types, not only limited to the consumption or generation of data, but also abstracting the idea of having access to partial results or even backtracking. The main idea is that the recursion path is described by a fixed point of an inductive data type and the program building task becomes limited to some specific definitions induced by the chosen data structure. This concept is explained in more detail in the following section.

14.2.1 Fixed Point of a Linked List A generic inductive list (i.e., linked list) carrying values of type a is described as1 : data List a = Nil | Cons a (List a) 1

In this paper, we employ the Haskell language notation, which is similar to the ML notation.

266

M. C. Fernandes et al.

This is read as a list that can be either empty (Nil) or a combination (using Cons) of a value of type a followed by another list (i.e., its tail). In this context, a is called a type parameter.2 We can eliminate the recursive definition by adding a second parameter to replace the recursion: data ListF a b = NilF | ConsF a b

This definition still allows us to carry the same information as List but, with the removal of the recursion from the type, it becomes impossible to create generic functions that work with different list lengths. Fortunately, we can solve this problem by obtaining the fixed point of ListF, that corresponds to the definition of List: data Fix f = MkFix (f (Fix f)) unfix :: Fix f −> f (Fix f) unfix (MkFix x) = x

MkFix is a type constructor and unfix extracts one layer of our nested structure. The Fix data type creates a link between the fixed point of a list and the list itself. This allows us to write recursive programs for data structures with non-recursive definitions that are very similar to those targeting data structures with recursive definitions.

14.2.2 Functor Algebra Recursion schemes come from two basic operations: the algebra, which describes how elements must be consumed, and coalgebra, which describes how elements must be generated. Assuming the fixed point structure f a supports a function fmap, a higher-order function that applies the first argument to every nested element of the structure f3 a Functor Algebra (F-Algebra) and its dual, Functor Co-Algebra, are defined as: data Algebra f a = f a −> a data CoAlgebra f a = a −> f a

Or, in plain English, it is a function that combines all the information carried by the parametric type (f) into a single value (i.e., fold function) and a function that, given a seed value, creates a data structure defined by f (i.e., unfold function). The application of an Algebra into a fixed point structure is called catamorphism and anamorphism for Co-algebras:

2 3

Similar to generics in Java and templates in C++. In category theory, this is called an endofunctor.

14 Origami: (un)folding the Abstraction of Recursion Schemes for Program Synthesis

267

cata :: (f a −> a) −> Fix f −> a cata alg data = alg (fmap (cata alg) (unFix data)) ana :: (a −> f a) −> a −> Fix f ana coalg seed = MkFix (fmap (ana coalg) (coalg seed)) −− definition of fmap for ListF fmap :: (b −> c) −> ListF a b −> ListF a c fmap f NilF = NilF fmap f (ConsF a b) = ConsF a (f b)

So, given an algebra alg, cata peels the outer layer of the fixed point data, maps itself to the whole structure, and applies the algebra to the result. In short, the procedure traverses the structure to its deepest layer and applies alg recursively accumulating the result. Similarly, ana can be seen as the inverse of cata. In this function, we first apply coalg to seed, generating a structure of type f (usually a singleton), then we map the ana coalg function to the just generated data to further expand it, finally we enclose it inside a Fix structure to obtain the fixed point. In the context of lists, this procedure is known as unfold as it departs from one value and unrolls it into a list of values. Although the general idea of defining the fixed point of a data structure and implementing the catamorphism may look like over-complicating standard functions, the end result allows us to focus on much simpler implementations. In the special case of a list, we just need to specify the neutral element (NilF) and how to combine two elements (ConsF x y). All the inner mechanics of how the whole list is reduced is abstracted away in the cata function. In the next section, we will show different examples of programs developed using this pattern.

14.2.3 Well-Known Recursion Schemes Besides the already mentioned recursion schemes, there are less frequent patterns that hold some useful properties when building recursive programs. The most wellknown recursion schemes (including the ones already mentioned) are: • catamorphism/anamorphism: also known as folding and unfolding, respectively. The catamorphism aggregates the information stored in the inductive type. The anamorphism generates an inductive type starting from a seed value. • paramorphism/apomorphism: these Recursion Schemes work as catamorphism and anamorphism, however, at every step they allow access to the original downwards structure. • histomorphism/futumorphism: histomorphism allows access to every previously consumed elements from the most to the least recent and futumorphism allows access to the elements that are yet to be generated.

268

M. C. Fernandes et al.

Algorithm 1 Origami Program Synthesis 1: procedure Origami(x, y, t ypes) 2: r ← pickRecursionScheme(types) 3: b ← pickInductiveType(types) 4: p ← pickTemplate(r, b, types) 5: f ← createFitnessFunction(p, b) 6: return evolveProgram(p, f) 7: end procedure

▷ The training data and the program type. 1 ▷ Step O 2 ▷ Step O 3 ▷ Step O 4 ▷ Step O 5 ▷ Step O

And, of course, we can also combine these morphisms creating the hylomorphism (anamorphism followed by catamorphism), metamorphism (catamorphism followed by anamorphism), and chronomorphism (combination of futumorphism and histomorphism). In the following section, we will explain some possible ideas on how to exploit these patterns in the context of program synthesis.

14.3 Origami The main idea of Origami is to reduce the search space by breaking down the synthesis process into smaller steps. An overview of the approach is outlined in Algorithm 1. 1 of the algorithm is to determine the recursion scheme of the program. Step .O This can be done heuristically (e.g., following a flowchart4 ). Since there are just a few known morphisms and the distribution of use cases for each morphism is highly skewed, this determination could be run in parallel; interactively determined by an expert; chosen by a machine learning algorithm based on the input and output types; derived from the input/output examples; or obtained from the textual description of the algorithm. 2 after the choice of recursion scheme, it is time to choose the approThen, in Step.O priate base (inductive) data type. The most common choices are natural numbers, lists, and rose trees. Besides these choices, one can provide custom data structures 1 if needed. This choice could be done employing the same methods used in Step .O. 3 deals with the choice of which specific template of evolvable functions Step .O (further explained in the next sections) will specify the parts of the program that must be evolved returning a template function to be filled by the evolutionary process. Once 4 that will receive the evolved this is done, we can build the fitness function (Step .O) functions, wrap them into the recursion scheme, and evaluate them using the training 5 to find the correct program. data. Finally, we run the evolution (Step .O) To illustrate the process, let us go through the process to generate a solution to the problem count odds from the General Program Synthesis Benchmark (GPSB) [8]:

4

https://hackage.haskell.org/package/recursion-schemes-5.2.2.4/docs/docs/flowchart.svg.

14 Origami: (un)folding the Abstraction of Recursion Schemes for Program Synthesis

269

Count Odds Given a vector of integers, return the number of integers that are odd, without use of a specific even or odd instruction (but allowing instructions such as modulo and quotient).

We can start by determining the type signature of this function: countOdds :: [Int] −> Int

1 As the type signature suggests, we are collapsing a list of values into a Step .O value of the same type. So, we should pick one of the catamorphism variants. Let us pick the plain catamorphism. 2 In this step, we need to choose a base inductive type. Since the type Step .O information tells us we are working with lists, we can use the list functor. 3 As the specific template, we choose the reduction to a value (from a list Step .O of integers to a single value). We are now at this point of the code generation where we depart from the following template: countOdds :: [Int] -> Int countOdds ys = cata alg (fromList ys) alg :: ListF Int Int -> Int alg xs = case xs of NilF -> e1 ConsF x y -> e2 4 We still need to fill up the gaps e1 and e2 in the code. At this point, the Step .O piece of code NilF - .> e1 can only evolve to a constant value as it must return an integer and it does not have any integer available in scope. The piece of code ConsF x y - .> e2 can only evolve to operations that involve x, y, and integer constants. 5 Finally, the evolution can be run and the final solution should be: Step .O countOdds :: [Int] −> Int countOdds xs = cata alg xs alg :: ListF Int Int −> Int alg xs = case xs of NilF −> 0 ConsF x y −> mod x 2 + y

As we can see from this example, the evolution of the functions inside the recursion scheme is well determined by the input–output types of the main function. We should notice, though, that other solutions may require intermediate steps, whose types are not entirely defined by the types of the main function. To identify the different

270

M. C. Fernandes et al.

templates that can appear, we manually solved the entire GPSB benchmark in such a way that the evolvable gaps of the recursion schemes have a well determined and concrete function type. Particularly to this work, we will highlight one example of each template, but the entire set of solutions is available at the Github repository at https://github.com/folivetti/origami-programming. This repository is in continuous development and will also contain alternative solutions using different data structures (e.g., indexed lists) and solutions to other benchmarks. In the following examples, the evolvable parts of the solution are shown in underlined green, making it more evident the number and size of programs that must be evolved by the main algorithm. It should be noted that we made some concessions in the way some programs were solved. In particular, our solutions are only concerned with returning the required values and disregard any IO operations (for instance, print the result with a string “The results is”) as we do not see the relevancy in evolving this part of the program at this point. This will be part of the full algorithm for a fair comparison with the current state of the art.

14.3.1 How to Choose a Template The first step requires making a decision among one of the available templates. Some options (from the most naive to more advanced ones) are listed below: • • • •

Run multiple searches in parallel with each one of the templates Integrate this decision as part of the search (e.g., encode into the chromosome) Use the type information to pre-select a subset of the templates (see. Table 14.1) Use the description of the problem together with a language model.

Specifically to the use of type information, as we can see in Table 14.1, the type signature can constrain the possible recursion schemes, thus reducing the search space of this choice. There are also some specific patterns in the description of the program that can help us choose one of the templates. For example, whenever the problem requires returning the position of an element, we should use the accumorphism recursion scheme. Table 14.1 Association between type signatures and its corresponding recursion schemes Type signature Recursion scheme f a - .> b a - .> f b a - .> b

Catamorphism, accumorphism Anamorphism Hylomorphism

14 Origami: (un)folding the Abstraction of Recursion Schemes for Program Synthesis

271

14.3.2 Jokers to the Right: Catamorphism In the previous sections, we gave the definition of a catamorphism. Indeed, its definition is analog to a right fold but generalized to any fixed point data structure. For a list .x with .n elements, this is equivalent to applying a function in the following order: . f (x0 , f (x1 , f (x2 , . . . f (xn−1 , y) . . .), where . y is the initial value. So the accumulation is performed from the end of the structure to the beginning. Notice that all of the solutions follow the same main form cata alg (fromList data), which changes the input argument into a fixed form of a list and applies the algebra of the catamorphism. Specifically for the catamorphism, we observed four different templates that we will exemplify in the following, from the simplest to the more complicated approaches. In what follows, we will only present the definition of the alg function.

14.3.2.1

Reducing a Structure

The most common use case of catamorphism is to reduce a structure to a single value, or f a - .> a. In this case, the algebra follows a simple function that is applied to each element and combined with the accumulated value. This template was already illustrated at the beginning of Sect. 14.3 with the example of countOdds. Due to space constraint, we will refer the reader to that particular example.

14.3.2.2

Regenerating the Structure: Mapping

The higher-order function map is a fold that processes and reassembles the structure, so, any function with a type signature f a -.> f b is a catamorphism. Double Letters Given a string, print (in our case, return) the string, doubling every letter character, and tripling every exclamation point. All other non-alphabetic and non-exclamation characters should be printed a single time each. −− required primitives: if−then−else, (), ([]) −− user provided: constant ’!’, constant "!!!", isLetter alg NilF = @\evolv{[]}@ alg (ConsF x xs) = @\evolv{if x == ’!’}@ @\evolv{then "!!!" xs}@ @\evolv{else if isLetter x then [x,x] xs}@ @\evolv{else x:xs}@

The main difference from the previous example is that, in this program, at every step the intermediate result (xs) is a list. Notice that the ConsF case is still constrained in such a way that we can either insert the character x somewhere in xs, or change x into a string and concatenate to the result.

272

M. C. Fernandes et al.

Evolvable functions: given a function of type f a -.> f b, we need to evolve (i) the pattern alg NilF of type f b and the pattern alg (ConsF x xs) of type a -.> f b -.> b.

14.3.2.3

Generating a Function

When the evolved main function has more than a single argument, we can use a template that uses a catamorphism to consume the first argument, accumulate on it, and return a function that consumes the remaining arguments. Two-argument functions are written in Haskell notation as f a -.> f a -.> b, which is read as: a function that takes two arguments of type f a (e.g., a list of values) and returns a value of type b. This signature is equivalent to its curried form which is f a -.> (f a -.> b): a function that takes a value of type f a and returns a function that takes an f a and returns a value of b. While generating a function that returns a function seems to add complexity, the type constraints can help guide the synthesis more efficiently than if we were to interpret it as a function of two arguments. Super Anagrams Given strings x and y of lowercase letters, return true if y is a super anagram of x, which is the case if every character in x is in y. To be true, y may contain extra characters but must have at least as many copies of each character as x does. −− required primitives: delete, constant bool −− elem, () alg NilF ys = @\evolv{True}@ alg (ConsF x xs) ys = @\evolv{(not.null) ys \& elem x ys && xs (delete x ys)@

For this problem, we incorporated the second argument of the function as a second argument of alg. For the base case, the end of the first string, we assume that this is a super anagram returning True. For the second pattern, we must remember that xs is supposed to be a function that receives a list and return a boolean value. So, we first check that the second argument is not null, that x is contained in ys and then evaluate xs passing ys after removing the first occurrence of x. Notice that for the NilF case we are not limited to returning a constant value, we can apply any function to the second argument that returns a boolean. Thus, any function String -.> Bool will work. Even though we have more possibilities for the base case, we can grow the tree carefully to achieve a proper solution. The same goes for the ConsF case, in which the search space is larger as the function recieves a char value, a string value and a function of type String -> String as arguments. Evolvable functions: given a function of type f a -.> f a -.> b, we need to evolve (i) the pattern alg NilF of type f a -.> b and, (ii) the pattern alg (ConsF x xs) of type a -.> (f a -.> b) -.> f a -.> b.

14 Origami: (un)folding the Abstraction of Recursion Schemes for Program Synthesis

14.3.2.4

273

Combination of Patterns

More complex programs often combine two or more different tasks represented as functions that return tuples (f a -.> (b, c)). If both tasks are independent and they are both catamorphisms, they are equivalent to applying different functions in every element of the .n-tuple. The evolution process would be the same as per the previous template but we would evolve one function for each output type.

14.3.3 When You Started Off with Nothing: Anamorphism The anamorphism starts from a seed value and then unfolds it into the desired recursive structure. Its evolutionary template is composed of a function over its argument that spans into multiple cases, each one responsible for evolving one function. The most common case is that this main function is a predicate that spans over a True or False response. The result of each case should be the return data structure containing one element and one seed value. For this recursion scheme, we only identified a single template in which the first argument is used as the initial seed and any remaining argument (if of the same type) is used as a constant when building the program. Of course, during the program synthesis, we may test any permutation of the use of the input arguments. For Loop Index Given 3 integer inputs start, end, and step, print the integers in the sequence = start, .n i = .n i − 1 + step for each .n i .< end, each on their own line.

.n 0

−− required primitives: (==), (+) forLoopIndex :: Int −> Int −> Int −> [Int] forLoopIndex start end step = toList (ana coalg @\evolv{start}@) where coalg seed = case @\evolv{seed == end}@ of True −> NilF False −> ConsF @\evolv{seed (seed + step)}@

In this program, the first argument is the starting seed of the anamorphism and the step and end is used when defining each case. The case predicate must evolve a function that takes the seed as an argument and returns a boolean. To evolve such a function, we are limited to the logical and comparison operators. As the type of the seed is well determined, we must compare it with values of the same type, which can be constants or one of the remaining arguments. After that, we must evolve two programs, one that creates the element out of the seed (a function of type Int -.> Int) and the generation of the next seed. Evolvable functions: given a function of type a -.> a -.> ... -.> a -.> f b, we need to evolve (i) the pattern coalg of type a -.> f b a, when using list as the base type, (ii) the predicate function a -.> Bool that returns True for the terminating case (if any), (iii) the starting seed value (either a constant or one of the arguments).

274

M. C. Fernandes et al.

14.3.4 Stuck in the Middle with You: Hylomorphism Hylomorphism is the fusion of both catamorphism and anamorphism. This template works the same as evolving the functions for both schemes. Collatz Numbers Given an integer, find the number of terms in the Collatz (hailstone) sequence starting from that integer. −− required primitives: constant int, (==) −− (+), (*), mod, div alg NilF = @\evolv{1}@ alg (ConsF x xs) = @\evolv{1 + xs}@ coalg x = case @\evolv{x == 1}@ of True −> NilF False −> ConsF x @\evolv{(if mod x 2 == 0}@ @\evolv{then div x 2 }@ @\evolv{else div (3*x + 1) 2)}@

In hylomorphism, first, the coalgebra produces a value that can be consumed by the algebra. The single input argument will be the seed to generate the next hailstone number as the next seed. If this seed is equal to .1, the process terminates. The algebra in this case simply counts the number of generated values but adds a .+1 (NilF) to account for the value .1 that was dropped during the anamorphism. Evolvable functions: given a function of type a -.> b, we need to evolve (i) the pattern coalg of type a -.> f a b, (ii) the pattern alg (ConsF x xs) of type f a b -.> b, and (iii) the pattern alg NilF of type b.

14.3.5 Clowns to the Left of Me: Accumorphism In some situations, our solution needs to traverse the inductive structure to the left. In other words, at every step, we accumulate the results and gain access to the partial results. For lists, this is equivalent to . f (. . . f ( f (y, x0 ), x1 ), . . . xn ). For this purpose, we can implement accumorphism that requires an algebra and an accumulator. This recursion scheme requires an accumulator function st besides the algebra. Notice that the accumulator function will receive a fixed form structure, the current state, and it will return the fixed form with a tuple of the original value and the trace state. This template must be carefully used because it can add an additional degree of freedom through the st function, notice that this function can be of any type, not limited by any of the main program types, thus expanding the search space. To avoid such problems, we will use accumulators in very specific use cases as described in the following sub-sections.

14 Origami: (un)folding the Abstraction of Recursion Schemes for Program Synthesis

14.3.5.1

275

Indexing Data

Whenever the problem requires the indexing of the data structure, we can use the accumulator to store the index of each value of the structure and, afterward, use this information to process the final solution. With this template, the accumulator should be of type Int. Last Index of Zero Given a vector of integers, at least one of which is 0, return the index of the last occurrence of 0 in the vector. −− required primitives: if−then−else, (+), (==) −− (), constant int, Maybe, Last lastIndexZero :: [Int] −> Int lastIndexZero xs = accu st alg (fromList xs) @\evolv{0}@ where st NilF s = NilF st (ConsF x xs) s = ConsF x (xs, @\evolv{s+1}@) alg NilF s = @\evolv{−1}@ alg (ConsF x xs) s = @\evolv{if x == 0 \& xs == -1@@then s@@else xs@

This template requires that we evolve: the initial value for s, the expression to update s at every element of the list (the remainder of st function is given), the value of the base case of alg, and the general case. The accumulator function of this program has the function of indexing our list. When the list is indexed, we build the result from the bottom up by signaling that we have not found a zero by initially returning -1. Whenever x == 0 and the current stored index is -1, the program returns the index stored in that level (s). Otherwise, it just returns the current xs. Evolvable functions: given a function of type f a - .> b, we need to evolve (i) a constant value of type Int, (ii) the pattern st (ConsF x xs) s of type a .> Int -.> (a, Int), (iii) the pattern alg NilF s of type Int -.> b, (iv) the pattern alg (ConsF x xs) s of type a -.> b -.> Int -.> b.

14.3.5.2

A Combination of Catamorphisms

In some cases, the recursive function is equivalent to the processing of two or more catamorphisms, with a post-processing step that combines the results. A simple example is the average of the values of a vector in which we need to sum the values and count the length of the vector, combining both final results with the division operator. This template of catamorphism constrains the type of the accumulator to a tuple of the returning type of the program. Vector Average Given a vector of floats, return the average of those floats. Results are rounded to 4 decimal places.

276

M. C. Fernandes et al. −− required primitives: (+), (/) vecAvg :: [Double] −> Double vecAvg xs = accu st alg (fromList xs) @\evolv{(0.0, 0.0)}@ where st NilF (s1, s2) = NilF st (ConsF x xs) = ConsF x (xs, (@\evolv{s1 + x}@, @\evolv{s2 + 1)}@) alg NilF (s1, s2) = @\evolv{s1 / s2}@ alg (ConsF x xs) s = xs

We, again, illustrate this solution by splitting the accumulator function into two distinct functions, one for each element of the tuple. While function f accumulates the sum of the values of the list, function g increments the accumulator by one at every step. In this template, the final solution is the combination of the values at the final state of the accumulator, thus in the alg function, we just need to evolve a function of the elements of the state. Evolvable functions: given a function of type f a -.> b, we need to evolve (i) a constant value of type (b, b), (ii) the pattern st (ConsF x xs) s of type a -.> (b,b) -.> (b, b), (iii) the pattern alg NilF s of type (b,b) -.> b. In the next section, we report a simple experiment with a subset of these problems as a proof-of-concept of our approach.

14.4 Preliminary Results The main objective of this work is to introduce the ideas of using recursion schemes to solve programming challenges and to verify whether the current benchmark problems can be solved using this approach. In this section, we will show how using the catamorphism template can help improve the overall performance of a GP approach. For this purpose, we will use the Higher-Order Typed Genetic Programming (HOTGP) [5]. HOTGP supports higher-order functions and .λ-functions, disallows the creation of impure functions, and uses type information to guide the search. We adapted this algorithm to generate only the evolvable parts of a catamorphism (in here, implemented as a foldr) and tested most of the benchmarks that can be solved by this specific template. Specifically, we asked HOTGP to generate an expression for the alg (ConsF x xs) pattern. For the alg NilF, we used default empty values depending on the data type: .0 for Int and Float, False for Bool, the space character for Char, and empty lists for lists and strings. Naturally, this is a simplification of the template, as all of the benchmarks we are interested in happen to use these values for the null pattern. In order to properly generate the recursion patterns, this part of the function should also be considered in the evolution. We set the maximum depth of the tree to 5, as the expressions we want to generate are always smaller than that. All the other parameters use the same values described by Fernandes et al. [5]. To position Origami within the current litera-

14 Origami: (un)folding the Abstraction of Recursion Schemes for Program Synthesis

277

ture, we compare the obtained results against those obtained by HOTGP itself [5], PushGP in the original benchmark [12], Grammar-Guided Genetic Programming (G3P) [13], and the extended grammar version of G3P (here called G3P+) [14], as well as some recently proposed methods such as Code Building Genetic Programming (CBGP) [15], G3P with Haskell and Python grammars (G3Phs and G3Ppy) [16], Down-sampled lexicase selection (DSLS) [17] and Uniform Mutation by Addition and Deletion (UMAD) [18]. We removed the benchmarks that would need the algorithm to output a function (Mirror Image, Vectors Summed, and Grade), as this is not currently supported by HOTGP. For completeness, we also tested the benchmarks that can be solved by accumorphisms, even though this adaptation does not support it to show that once committed to a template (e.g., catamorphism), the algorithm cannot find the correct solution if it requires a different template (e.g., accumorphism). Analyzing the results depicted in Table 14.2, one can notice that when comparing the standard HOTGP with Origami, Origami always obtains an equal or better number of perfect runs, except on the accumorphism benchmarks. Not only that, but the number of problems that are always solved increased from .1 to .4, and those higher than .75% increased from .3 to .6, a significant improvement in success rate. When compared to related work, out of the .7 solvable benchmarks, Origami had the best results in .6 of them. The only exception being the replace-space-with-newline. Overall, once we choose the correct template, the synthesis step becomes simpler.

14.5 Discussion and Final Remarks Our main hypothesis with this study is that, by starting the program synthesis through fixing one of the recursion schemes, we simplify the process of program synthesis. For this purpose, we used a general set of benchmarks widely used in the literature. Within this benchmark suite, we observed that, in most cases, the evolvable part of the programs becomes much simpler, to the point of being trivial. However, some of them require a pre-processing of the input arguments with some general use functions (such as zip) to keep this simplicity, or an adaptation to the output type to return a function instead of a value. In some cases, more complicated functions can be evolved with the help of human interaction by asking additional information such as when should the recursion stop?. Also, in many cases, the type signature of each one of the evolvable programs already constrains the search space. For example, the pattern alg NilF must return a value of the return type of the program without using any additional information, thus the space is constrained to constant values of the return type. Analyzing the minimal function set required to solve all these problems, one can formulate a basic idea about the adequate choice based on the signature of the main function and on any user-provided type/function. Table 14.3 shows the set we used for the presented experimental evaluation.

HOTGP

– 50 0 0 100 38

– 89

0 80 1

1 3 4 5

Origami

0 100 94 0 100 60

100 100

84 0 6

4 6 7 7

Benchmark

checksum* count-odds double-letters last-index-of-zero* negative-to-zero replace-space-withnewline scrabble-score string-lengthsbackwards syllables vector-average* # of Best Results .= 100 .≥ 75 .≥ 50 .> 0 1 4 7 10

64 88 2

31 95

1 11 50 62 82 100

DSLS

0 4 5 10

48 92 2

20 86

5 12 20 56 82 87

UMAD

0 0 2 9

18 16 0

2 66

0 8 6 21 45 51

PushGP

0 0 2 6

0 5 0

2 68

0 12 0 22 63 0

G3P

0 2 2 3

– 88 0

– –

– 0 – 10 99 0

CBGP

0 0 0 7

39 0 0

1 18

0 3 0 44 13 16

G3P+

0 0 0 1

– 4 0

– 0

– – – 0 0 –

G3Phs

0 0 1 3

– 0

– 34

– – – 2 66 –

G3Ppy

Table 14.2 Percentage of runs that returned a perfect solution on the validation set. The bottom part of the table summarizes the result as the number of times each algorithm had the highest percentage, and in how many problems, the percentage was greater or equal to a certain threshold. The benchmarks marked with .∗ are only solvable with accumorphism

278 M. C. Fernandes et al.

14 Origami: (un)folding the Abstraction of Recursion Schemes for Program Synthesis

279

Table 14.3 Function set used for solving the GPSB benchmark problems Type class Functions Numbers Logical Lists Tuple Map General purpose

.fromIntegral,

+, -, *, /, ˆ div, quot, mod, rem abs, min, max .=, ==, /=, &&, not, || cons, snoc, . , head, tail, init, last, null, length, delete, elem fst, snd findMap, insertWith if-then-else, case, uncurry, fromEnum, toEnum, id

Table 14.4 Function set assumed to be provided by the user. All of these functions and constants are explicitly mentioned in the problem description Type Functions Int - .> Bool String Char Int Char -.> Bool

( .< 1000), ( .> = 2000) "small", "large", "!!!", "ABCDF", "ay" ’!’, ’ ’, ’\n’ 0, 1, 64 isVowel, isLetter

Table 14.4 lists the set of functions and constants we assume should be provided by the user, as they are contained in the problem description. Some of them can be replaced by case-of instructions (e.g., isVowel, isLetter, scrabbleScore), which can increase the difficulty of obtaining a solution. We should note that, from the .29 problems considered here and implemented by a human programmer, .17% were trivial enough and did not require any recursion scheme; .41% were solved using catamorphism; .20% of used accumorphism (although we were required to constrain the accumulator function). Anamorphism accounted for only .7% of the problems and hylomorphism for .14%. The distribution of the usage of each recursion scheme is shown in Fig. 14.1. As the problem becomes more difficult and other patterns emerge, we can resort to more advanced recursions such as dynomorphism when dealing with dynamic programming problems, for example. Also, none of these problems required a recursive pattern with a base structure different from a list. In the future, we plan to test other benchmarks and introduce new ones that require different structures to test our approach. One challenge to this approach is how to treat the templates containing multiple evolvable parts. For example, anamorphism requires the evolution of three programs: one that generates the next element, one to generate the next seed, and one predicate to check for the stop condition. We will consider a multi-gene approach [19] or a collaborative co-evolution strategy [20, 21].

280

M. C. Fernandes et al.

Fig. 14.1 Distribution of recursion schemes used to solve the full set of GPSB problems

As a final consideration, we highlight the fact that most of the programs can be further simplified if we annotate the output type with monoids. In functional programming, and Haskell in particular, monoids are a class of types that have an identity value (mempty) and a binary operator (.) such that mempty . a = a . mempty = a. With these definitions we can replace many of the functions and constants described in Tables 14.3 and 14.4 with mempty and ., reducing the search space. Acknowledgements This research was partially funded by Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP), grant numbers #2021/12706-1, #2019/26702-8, #2021/06867-2, Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) and CNPq grant number 301596/2022-0.

References 1. Helmuth, T., McPhee, N.F., Spector, L.: Program synthesis using uniform mutation by addition and deletion. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1127–1134 (2018) 2. Pantridge, E., Spector, L.: Code building genetic programming. In: Proceedings of the 2020 Genetic and Evolutionary Computation Conference, pp. 994–1002 (2020) 3. O’Neill, Michael, Ryan, Conor: Grammatical evolution. IEEE Trans. Evolut. Comput. 5(4), 349–358 (2001) 4. Manrique, D., Ríos, J., Rodríguez-Patón, A.: Grammar-guided genetic programming. In: Encyclopedia of Artificial Intelligence, pp. 767–773 (2009) 5. Fernandes, M.C., De França, F.O., Francesquini, E.: HOTGP - Higher-Order Typed Genetic Programming. In: Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’23, pp. 1091–1099, New York, NY, USA (2023). Association for Computing Machinery 6. Meijer, E., Fokkinga, M., Paterson, R.: Functional programming with bananas, lenses, envelopes and barbed wire. In: Functional Programming Languages and Computer Archi-

14 Origami: (un)folding the Abstraction of Recursion Schemes for Program Synthesis

7. 8.

9. 10. 11. 12.

13.

14.

15.

16.

17. 18.

19.

20. 21.

281

tecture: 5th ACM Conference Cambridge, MA, USA, August 26–30, 1991 Proceedings 5, pp. 124–144. Springer (1991) Gibbons, J.: The fun of programming (2003) Thomas Helmuth and Lee Spector. General program synthesis benchmark suite. In: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, pp. 1039–1046 (2015) Jones, N.D.: The expressive power of higher-order types or, life without cons. J. Funct. Program. 11(1), 55–94 (2001) Harvey, B.: Avoiding recursion. In: Hoyles, C., Noss, R. (eds.) Learning Mathematics and LOGO. The MIT Press Cambridge, Massachusetts, USA, p. 393426 (1992) Garland, S.J., Luckham, D.C.: Program schemes, recursion schemes, and formal languages. J. Comput. Syst. Sci. 7(2), 119–160 (1973) Helmuth, T., Spector, L.: General program synthesis benchmark suite. In: GECCO ’15: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, pp. 1039– 1046. ACM, Madrid, Spain (2015) Forstenlechner, S., Fagan, D., Nicolau, M., O’Neill, M.: A grammar design pattern for arbitrary program synthesis problems in genetic programming. In: McDermott, J., Castelli, M., Sekanina, L., Haasdijk, E., García-Sánchez, P. (eds.) Genetic Programming, pp. 262–277. Springer International Publishing, Cham (2017) Forstenlechner, S., Fagan, D., Nicolau, M., O’Neill, M.: Extending program synthesis grammars for grammar-guided genetic programming. In: Auger, A., Fonseca, C.M., Lourenço, N., Machado, P., Paquete, L., Whitley, D. (eds.) Parallel Problem Solving from Nature – PPSN XV, pp. 197–208. Springer International Publishing, Cham (2018) Pantridge, E., Helmuth, T., Spector, L.: Functional code building genetic programming. In: Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’22, pp. 1000– 1008, New York, NY, USA (2022). Association for Computing Machinery Garrow, F., Lones, M.A., Stewart, R.: Why functional program synthesis matters (in the realm of genetic programming). In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO ’22, pp. 1844–1853, New York, NY, USA (2022). Association for Computing Machinery Helmuth, T., Spector, L.: Problem-solving benefits of down-sampled lexicase selection. Artif. Life 27(3–4), 183–203, 03 (2022) Helmuth, T., McPhee, N.F., Spector, L.: Program synthesis using uniform mutation by addition and deletion. In: Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’18, pp. 1127–1134, New York, NY, USA (2018). Association for Computing Machinery Searson, D.P., Leahy, D.E., Willis, M.J.: Gptips: an open source genetic programming toolbox for multigene symbolic regression. In: Proceedings of the International Multiconference of Engineers and Computer Scientists, vol. 1, pp. 77–80. Citeseer (2010) Soule, T., Heckendorn, R.B.: Improving performance and cooperation in multi-agent systems. Genetic Programming Theory and Practice V, pp. 221–237 (2008) Grefenstette, J., Daley, R.: Methods for competitive and cooperative co-evolution. In: Adaptation, Coevolution and Learning in Multiagent Systems: Papers from the 1996 AAAI Spring Symposium, pp. 45–50 (1996)

Chapter 15

Reachability Analysis for Lexicase Selection via Community Assembly Graphs Emily Dolson and Alexander Lalejini

15.1 Introduction Lexicase selection is a state-of-the-art parent selection algorithm for genetic programming [27]. It has proven highly effective across a wide variety of problems [4, 17, 19, 22–24], and has spawned many variants [2, 13, 18, 28]. One challenge of working with lexicase selection, however, is that most fitness-landscape-based analytical techniques do not directly apply to it. Fitness landscapes represent the mapping of genotypes to fitness and the adjacency of genotypes to each other, providing intuition for which genotypes are (easily) reachable from which other genotypes via an evolutionary process. Because lexicase selection is designed for scenarios where multiple factors (e.g. different test cases that evolved code is run on) determine solution quality, there is no single fitness landscape for a given problem in the context of lexicase selection (for further details on lexicase selection works, see Sect. 15.3.1). Moreover, a candidate solution’s probability of being selected by lexicase selection depends entirely on the composition of the population of other solutions that it is competing with [16]. In other words, ecological dynamics play a large role in lexicase selection [6, 8]; the fitness landscape for any individual criterion is constantly shifting due to endogenous change. Thus, even if we could calculate individual criterion fitness landscapes, there would not be a meaningful way to combine them into a single static model predicting the population’s change over time. In cases where evolutionary algorithms fail to solve problems, fitness-landscapebased analyses are useful for understanding why [29]. Although alternative analytical E. Dolson (B) Michigan State University, East Lansing, MI, USA e-mail: [email protected] A. Lalejini Grand Valley State University, Allendale, MI, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 S. Winkler et al. (eds.), Genetic Programming Theory and Practice XX, Genetic and Evolutionary Computation, https://doi.org/10.1007/978-981-99-8413-8_15

283

284

E. Dolson and A. Lalejini

techniques can fill some of this gap [14], it would be useful to have a technique for conducting reachability analysis, i.e. identifying whether lexicase selection is capable of reaching a solution to a given problem under a given configuration. In particular, such an analysis would help distinguish instances where lexicase selection simply needs more time or a larger population size from instances where it is incapable of or unlikely to solve a problem. A technique capable of supporting reachability analysis for lexicase selection would need to consider (1) what populations of solutions can exist, and (2) what additional solutions are reachable from a given population. If we have (1), (2) is relatively easy to calculate, as it simply requires identifying solutions mutationally adjacent to those in the population. Identifying possible populations, however, may at first seem intractable. Fortunately, this problem has been solved by ecologists seeking to predict the way that an ecological community will change over time. Here, we propose to use a tool from ecology—the community assembly graph—to improve our understanding of the adaptive “landscape” experienced by lexicase selection. This approach can likely be generalized to other evolutionary algorithms with strong ecological interactions between members of the population.

15.2 Approach 15.2.1 Community Assembly Graphs In ecology, the process of building up a collection of “species” that can coexist with each other is called community assembly. There is a vast literature of ecological theory on this process that may be relevant to lexicase selection. For now, we will borrow a single concept from this literature: community assembly graphs [9, 26] (see Fig. 15.1). In these graphs, each node represents a possible community composition (i.e. set of species that could stably coexist with each other in the same space). Community compositions can be represented as bit-strings where each position corresponds to a species that could potentially be present. A one indicates that that species is present, while a zero indicates that it is absent. While this technique may potentially make for a very large graph (.2n nodes, where .n is the number of possible species), note that the graph only includes stable communities. Many collections of species cannot all coexist with each other for more than a few generations and thus are not represented as nodes in the graph. Edges between nodes are directed and indicate transitions that can occur via the addition of a new species to the community at the source node. Note that in the previous paragraph we referred to “species”, as these are the units that can belong to a community in ecology. In evolutionary computation, however, we do not usually attempt to define species. Instead, we can choose an alternative unit of organization to compose our communities out of. One obvious option would be

15 Reachability Analysis for Lexicase Selection via Community Assembly Graphs

285

Fig. 15.1 A simple community assembly graph. This graph represents lexicase selection on a representative NK fitness landscape with N .= 3, K .= 2. The fitness contributions of each of the three genes function as the three fitness criteria for lexicase selection. Node labels indicate the ids of genotypes that are present in the community represented by each node. The genotype and phenotype corresponding to each id are shown below the graph. Edge labels indicate which genotype was added to the community to cause a transition from the edge’s source node to its destination node. Here, for simplicity, evolution is arbitrarily assumed to have started from a population containing only genotype 0. The two mutationally adjacent genotypes to 0 are 1 and 2, so we consider the effect of adding either of them to the starting community. Both genotype 1 and genotype 2 can stably coexist with genotype 0 (they each have a fitness criterion on which they outperform genotype 0 and a fitness criterion on which they do not), so adding either one results in a two-genotype community. Ultimately, there are two different sink nodes in this graph; evolution will likely stagnate when it reaches either of them. Because genotype 1 only appears in one of the sink nodes, we can conclude that it will probably not always be found, despite having the highest score on objective 1

286

E. Dolson and A. Lalejini

genotypes (i.e. representations of the entire genome). However, we recommend using phenotypes (i.e. representations of a solution’s selection-relevant traits) instead, as all genotypes that share the same phenotype will also be ecologically equivalent. In the context of lexicase selection, we define “phenotypes” to be the vector of scores on all fitness criteria/test cases (sometimes referred to as “error vectors” in the literature). Another important difference between community assembly graphs in ecology and the approach we propose here concerns which phenotypes may be added to a community. In traditional ecological scenarios, it is assumed that new species are being introduced to a local community from some nearby location, and thus any species could be introduced to any community. In evolutionary computation, however, new phenotypes come instead from the process of evolution. Consequently, for the purposes of making community assembly graphs for lexicase selection, we only consider the introduction of phenotypes that are mutationally adjacent to a phenotype that is already present in a community. For a very simple example community assembly graph for lexicase selection, see Fig. 15.1. A run of lexicase selection can be approximately modeled as a random walk on this community assembly graph, although in practice some paths will have higher probabilities of being taken than others. Sink nodes in this graph indicate possible end states for the process of evolution by lexicase selection on the problem being solved. By definition, a sink node is a community that has no mutationally adjacent solutions capable of “invading” (i.e. surviving in) it. Thus, it is impossible to escape (setting aside, for now, the possibility of multiple simultaneous mutations). Any evolutionary scenario that produces multiple sink nodes has multiple possible outcomes. A corollary of this observation is that any solution not appearing in all sink nodes is not guaranteed to be found, and that running evolution for longer will not change that fact.

15.2.2 Calculating Stability As the nodes in the community assembly graph correspond to stable communities, identifying which communities are stable is a critical step in calculating a community assembly graph. We define stable communities as follows: Definition 15.1 A stable community is a population of phenotypes such that, if no mutations were allowed, the set of phenotypes present would have a high probability of remaining the same through .G generations worth of selection events. In other words, a stable community is a set of phenotypes expected to stably coexist for .G generations. Note that this definition means stable communities can only be defined with respect to fixed values of.G, population size, and some threshold for what counts as a “high” probability of survival. What is the purpose of allowing generations to vary by including the .G term? In genetic programming, each phenotype can be represented as a neutral network of equivalent genotypes that are mutationally adjacent to each other [1]. While some

15 Reachability Analysis for Lexicase Selection via Community Assembly Graphs

287

phenotype . A may be adjacent to some other phenotype . B, it is unlikely that all genotypes within . A are adjacent to all genotypes in . B. Instead, to discover phenotype . B from phenotype . A, evolution likely needs to traverse the neutral network of phenotype . A. Consequently, it would be incorrect to assume that . B can be discovered as soon as . A is in the population. Instead, . A must first be able to survive for some amount of time related to the topology of its neutral network. If there is no neutrality in a landscape, .G can simply be set to 1. We can identify stable communities in lexicase selection using a sequence of two equations. First, we identify . Plex , the probability of each member of the population being selected in a given selection event. . Plex can be calculated with the following equation, derived in [16]: ⎧ ⎪ 1 i f |Z | = 1 ⎪ ⎪ ⎨ 1/|Z | i f |N | = 0 . Plex (i|Z , N ) = |N | E ⎪ Plex (i,{z∈Z |z elite on N j },{n i ∈N |i!= j}) ⎪ ⎪ ⎩ j=0 else |N |

(15.1)

Note that the run-time of this function is exponential, as the problem of calculating Plex is . N P-Hard [5]. However, in practice . Plex can be calculated fairly efficiently using a variety of optimizations [5]. Once we have calculated . Plex , we can calculate the probability of each phenotype surviving for.G generations,. Psur vival , based on the following equation (adapted from [8]): S G . Psur vival (i, S, G, pop) = (1 − (1 − Plex (i, pop)) ) (15.2)

.

where .i is the individual . Psur vival is being calculated for, . S is the population size, G is the number of generations, and . pop is the current population. As . S and .G get large, this function begins to approximate a step function [8]. Consequently, we can often safely make the simplifying assumption that the survival of each member of the population is either guaranteed or impossible.

.

15.2.3 Assumptions Three important simplifying assumptions are made in the construction of community assembly graphs in this paper. The first is that it is only possible to generate single-step mutants during a reproduction event. Obviously, many mutation schemes would violate this assumption. If necessary, it would be easy to amend our graph construction process to consider larger numbers of mutations. We refrain from doing so here in the interest of improved tractability. Whether this assumption reduces the utility of community assembly graph analysis is an important open question. The second simplifying assumption is that no more than one new phenotype is added to the population per generation. In some cases, the result of two phenotypes

288

E. Dolson and A. Lalejini

being added simultaneously may be different than the results of those phenotypes being added in sequence. In such cases, the community assembly graph model may incorrectly predict the behavior of lexicase selection. In theory, if the landscape has sufficient neutrality (and thus .G should be large), it should be relatively uncommon for these cases to occur. Whether these cases significantly impact the behavior of lexicase selection is another open question. Moreover, even if these events are important, community assembly graphs give us a framework for quantifying that importance. The third simplifying assumption is that the population size of each phenotype is irrelevant and we can conduct analyses purely on the basis of presence vs. absence of each phenotype. This assumption is likely justified, as population size will only impact . Plex in scenarios where multiple solutions tie. Again, empirical analysis is warranted to determine whether these scenarios play a significant role in lexicase selection’s behavior.

15.2.4 Reachability Analysis Analyzing the topological properties of the community assembly graph can provide a variety of insights into the landscape being explored by lexicase selection. For now, we will focus on a single one: reachability. Reachability analysis asks whether a path exists from a given starting point, . A, to a given ending point, . Z . Thus, it allows us to determine, for an initial population (. A), whether lexicase selection is capable of finding an optimal solution (. Z ). Using a community assembly graph, we can conduct this analysis via a simple graph traversal. We can also ask a stronger question about reachability: are there paths from . A that do not ultimately lead to . Z ? Community assembly graphs also make this form of reachability easy to assess; we need only determine whether there are any sink nodes (i.e. nodes with no outgoing edges) other than . Z that are accessible from . A. The answer to this question tells us approximately how likely lexicase selection is to find an optimal solution to a given problem from a given initial population.

15.2.4.1

Tractability

Why do we start from a predefined initial population, . A? This restriction keeps the problem substantially more tractable. The full community assembly graph for a given configuration of lexicase selection on a given problem would have as many as .2n nodes, where .n is the number of unique phenotypes that could possibly exist. While many of these communities can be excluded from the graph due to being unstable, the set of unique performance vectors is already potentially uncountable. By starting from a defined point, we can avoid attempting to enumerate all possible vectors. Fortunately, in the context of genetic programming, it is often reasonable to assume that we start from a population containing a single error vector: the one representing minimal scores on all fitness criteria.

15 Reachability Analysis for Lexicase Selection via Community Assembly Graphs

289

However, in many practical cases, this restriction is still insufficient to make reachability analysis tractable. In these cases, we can limit our analysis to an even smaller sub-graph (.G) containing only the nodes that we are most likely to discover. We can find .G using a modified version of Dijkstra’s algorithm. Like Dijkstra’s algorithm, our algorithm is a graph traversal in which all newly discovered nodes are placed in a priority queue. On every iteration, we remove the top node from the priority queue and “explore” it, i.e. we place all undiscovered nodes that it has an edge to in the priority queue. However, whereas Dijkstra’s algorithm assigns priorities based on minimal summed path length, our algorithm will assign priorities based on the hitting probability of each node given the portion of .G that we have explored so far. Hitting probability is the probability that a node will be visited in the course of a random walk along the graph. Note that it differs from the more commonly used properties of hitting time (the average time it will take a random walk to reach a node) and PageRank (the probability of ending a random walk on a node). We will denote the hitting probability for node .i when starting from node . A as . Hi A . Using this traversal algorithm, we can explore the top . N communities that are most likely to occur at some point during an evolutionary process started from the community represented by node . A. To maintain tractability, we stop our traversal after we have fully explored N nodes. As is standard for graph traversal algorithms, we consider a node fully explored once all nodes that it has a directed edge to have either been explored or are in the priority queue of nodes to explore in the future. An interesting side effect of this algorithm is that at the end, we are left with nodes with three different states: (1) fully explored, (2) discovered but unexplored, and (3) undiscovered. Only fully explored nodes will be included in .G for follow-up analysis. However, we can use our knowledge of the nodes in state 2 to determine which nodes in .G truly have no outgoing edges and thus should be considered sink nodes for the purposes of reachability analysis (as opposed to nodes in .G that have outgoing edges to nodes with too low a hitting probability to be included in .G).

15.2.4.2

Calculating Hitting Probability

Note that calculating hitting probability is non-trivial, particularly when we are calculating it as the graph is being traversed. Whenever we discover a new path to a node, its hitting probability (and thus its priority in the priority queue) must be updated.1 In the case of simple paths, these updates can be handled by summing the probabilities of each path. However, when a cycle is discovered, things become more challenging, as we have effectively just discovered an infinite number of new paths to all nodes that are successors (direct or indirect) of any node in the cycle. As a simplifying assumption for the purposes of this proof-of-concept paper, we ignore the effect of cycles on our priorities. Consequently, we may underestimate the hitting probabilities of some nodes. As we expect cycles to be fairly rare in 1

While standard heap-based priority queues do not support updating priorities, a Fibonacci heap can be used to enable this operation in constant time.

290

E. Dolson and A. Lalejini

the context of lexicase selection (and, when they do occur, they are likely evenly distributed across the graph), we do not anticipate this bias has a dramatic impact on our analyses here. Development of an algorithm to appropriately handle cycles is underway as part of follow-up research. One approach to solving this problem is to employ a combination of infinite series and combinatorics to calculate the probabilities of taking each possible path through the graph. Unfortunately, this approach is challenging to implement, due to the possibility of many overlapping cycles. A more straightforward approach is to reduce the problem of calculating hitting probability to the much easier problem of calculating PageRank. This reduction can be achieved by constructing a modified version of .G, which we will term .G i' , for each node .i that we want to calculate the hitting probability of. .G i' is identical to .G except that all outgoing edges from node .i are removed. Consequently, any path in .G that would include node .i corresponds to a path in .G i' that ends at node .i. One other aspect of our hitting probability calculations that it is important to be aware of is that all probabilities we calculate are conditional on the part of the graph already traversed at any point in time. It is possible that at the time we reach our stopping condition of having explored . N nodes, there will be nodes in the priority queue that have a higher hitting probability than nodes that we have already explored. Similarly, our analysis makes no prediction about what will happen in the unlikely event that evolution reaches a node not contained in .G. It would be worthwhile in future work to explore the impact of these shortcomings and whether they can be overcome.

15.3 Background For readers interested in slightly more context on the inner workings of lexicase selection and the history of community assembly graphs, we offer some additional context.

15.3.1 Lexicase Selection Lexicase selection is a parent selection algorithm designed to operate in contexts where multiple criteria affect fitness [27]. In genetic programming, these criteria are generally the candidate solution’s performance on various test cases. To select an individual from the population, the test cases are placed in a random order and the entire population is placed into a pool of solutions eligible for selection. The algorithm then iterates through the test cases in their randomized sequence. For each test case, the best-performing solutions in the pool are identified. These solutions are

15 Reachability Analysis for Lexicase Selection via Community Assembly Graphs

291

allowed to remain in consideration for selection, while all others are removed from the pool. The last remaining solution is selected. If there is a tie, a random solution is selected.

15.3.2 Community Assembly Graphs A central set of questions in community ecology concern the extent to which ecological communities assemble in predictable ways. As an obvious way to systematically address this class of question, community assembly graphs have a long history in ecological theory [3, 10, 25, 26]. However, in the context of ecology there are a number of factors that interfere with using community graphs to accurately model community dynamics. Servan and Allesina identify three specific complicating factors: rate of invasion, size of invasion, and timing of invasion [26]. Uncertainty in these factors makes it hard to predict the outcome of invasions a priori. In evolutionary computation, rate of invasion translates to the rate of discovery of new phenotypes. As previously discussed, due to the neutrality inherent in most genetic programming problems, we can assume this is relatively slow. Moreover, we can tune the rate at which we expect to discover new phenotypes using the .G parameter. For this same reason, we can be relatively confident that the size of the invasion is always a single individual with the new phenotype. Interestingly, these are the same simplifying assumptions Servan and Allesina end up making [26], suggesting that evolutionary computation is well-suited to this framework. Timing of invasion is the most challenging of these factors to account for in the context of evolutionary computation. Specifically, the major variable is how the invasion is timed relative to fluctuations in other phenotype’s population sizes. Fortunately for our immediate purposes, it is unlikely to make a large difference in the context of lexicase selection, as the ecology of lexicase selection is not heavily influenced by population size [8]. However, this challenge will be important to consider when using community assembly graphs to analyze other selection schemes. Servan and Allesina handle this problem by assuming populations are at equilibrium, which may be a viable approach in evolutionary computation as well.

15.4 Proof of Concept in NK Landscapes To confirm that community assembly graphs accurately predict the dynamics of lexicase selection, we initially test them out in the context of a simple and wellunderstood problem: NK Landscapes [15].

292

E. Dolson and A. Lalejini

15.4.1 Methods NK landscapes are a popular and well-studied class of fitness landscapes with a tunable level of epistasis. They are governed by two parameters: N and K. Genomes that evolve on NK Landscapes are bit-strings of length N. Each of the N sites contributes some amount to the overall fitness of the bitstring, depending on its value and the value of the K adjacent sites. Fitness contributions are calculated based on a randomly generated lookup table for each site, which contains a different fitness contribution for every possible value that the genome could have at that site. For example, if K.= 0, each site would have two possible fitness contributions: one which would be contributed if the bit at that position were a 1 and one which would be contributed if it were a 0. If K .= 1, there would be 4 possible values (for 00, 01, 10, or 11), and so on. For the purpose of lexicase selection, the fitness contribution at each site is treated as a separate fitness criterion. We generate a community assembly graph for an arbitrary NK fitness landscape (see Fig. 15.2). We chose a small value of N (3) for the purposes of being able to visualize the entire community assembly graph and a relatively high value of K (2) to increase the complexity of the search space. In calculating this community assembly graph, we assume .G = 1 and population size is 100. To assess the accuracy of the community assembly graph’s predictions, we then ran 30 replicate runs of evolution on this NK Landscape. These runs were allowed to proceed for 500 generations with a population size of 100. We tested two different mutation rates. The first was low (0.001 per-site) to ensure that the assumptions of the community assembly graph were not violated. The second was more realistic (0.1 per-site), to test the impact of violations to the assumptions.

15.4.2 Results At the low mutation rate, the community assembly graph perfectly identified the set of final communities observed (see Table 15.1). At the higher mutation rates, the population consistently converged on a single community. This community was one of the possible end states predicted by the community assembly graph. These preliminary results suggest that violations to the assumptions of the community assembly graph may alter the probabilities of observing the different final communities. Further experiments to determine whether violations to the assumptions can ever lead to convergence on non-sink nodes are underway (although such an occurrence should theoretically be impossible). Encouragingly, the community that was consistently reached under the high mutation rates is the “better” community to find, in that it includes the best-performing solutions on each fitness criterion. We hypothesize that this trend may generalize across evolutionary scenarios, and thus that predictions based on community assembly graphs may represent a near worst-case scenario. While more work is needed to

15 Reachability Analysis for Lexicase Selection via Community Assembly Graphs

293

Fig. 15.2 Community assembly graph for the NK fitness landscape used in these experiments. N .= 3, K .= 2. The starting population is assumed to contain only the bitstring 000. Starting node is blue, sink nodes are red Table 15.1 Final communities reached by lexicase selection on the example NK Landscape Mutation rate 1, 2, 3 1, 2, 4 2, 4, 7 Other 0.001 00.01 0.1

11 0 0

13 30 30

6 0 0

0 0 0

test this hypothesis, the reasoning behind it is as follows: the primary assumption that is violated by having a high mutation rate is the assumption that only one new “species” is introduced at once. Violating this assumption should tend to favor the best-performing solutions, as they will tend to outcompete other solutions that are introduced simultaneously.

294

E. Dolson and A. Lalejini

15.5 Proof of Concept in Genetic Programming While the results on NK Landscape are promising, they are dramatically simpler than realistic genetic programming problems. To understand whether community assembly graphs are a practical tool for genetic programming, we next test them on a selection of problems from the first and second program synthesis benchmark suites [11, 12].

15.5.1 Methods Any community assembly graph of an evolutionary algorithm is inherently specific to a specific problem, genetic representation, mutation scheme, and selection scheme. Here, we use community assembly graphs to conduct a supplemental analysis of the experiments in [20]. These experiments used SignalGP as a genetic representation [21]. SignalGP is a tag-based linear genetic programming system in which programs evolve sequences of instructions to execute in response to receiving external signals. Instructions can take arguments which indicate which data they operate on. Mutations can occur in four forms: (1) one instruction is replaced with a different instructions, (2) an instruction is inserted into the genome, (3) an instruction is deleted from the genome, or (4) the arguments to an instruction are changed. For more details, see [21]. Although Lalejini et al. [20] explored multiple variations on lexicase selection, here we will focus only on standard lexicase selection. To carry these analyses out for other selection schemes in the future, we would need to adjust Eq. 15.1 accordingly. It is not tractable to map out the entire mutational landscape of SignalGP. Instead, we must sample a representative portion of the landscape. A representative sample should, theoretically, include the regions of genotype space where we expect evolution to end up. We propose that the best way to find these regions is to conduct multiple replicate runs of evolution in the scenario of interest. Here, we use 10 replicate runs per condition. From each of these runs, we extract the full genotypelevel phylogeny of the final population i.e. the full ancestry tree of which genotypes descended from which other genotypes [7]. These phylogenies will indicate what parts of the genotype space evolution ultimately traversed. However, that information alone is insufficient, as we must also know the ecological context that each genotype is likely to find itself in. To obtain this additional information, we conduct mutational landscaping analysis around each genotype in each phylogeny. For each genotype, we produce 10,000 random mutants and test their performance on all test cases. We record all instances where a mutant had a different phenotype (i.e. test case performance profile) than the genotype it was generated from. By aggregating this data, we produce a network showing the probability of mutating from each phenotype to each other phenotype.

15 Reachability Analysis for Lexicase Selection via Community Assembly Graphs

295

Finally, we construct the community assembly graph. Phenotype, rather than genotype, determines the ecological impact of a given solution being present in the population. Thus, each node in the community assembly graph represents a combination of phenotypes. Each community has the potential to have a given phenotype introduced to it if and only if that phenotype is mutationally adjacent to one of its member-phenotypes. The probability of mutating from one phenotype to another determines the probability of that mutation occurring. Even with this sampling approach, the full community assembly graph is intractably large. To avoid this problem, we use the probability-based priority queue approach discussed in Sect. 2.4.1. For the purposes of comprehensible data visualization, we explore a relatively small (100) number of nodes per graph.

15.5.2 Results We first conduct a community assembly graph for the grade problem [12] (see Fig. 15.3). Lalejini et al. found that this problem was solved fairly consistently [20], suggesting that it is relatively easy for SignalGP to solve. Indeed, the perfect solution appears in this graph and is the only sink node. Thus, it is reachable from our starting point (the community containing only the phenotype with scores of 0 for all test cases), and evolution is unlikely to get stuck anywhere else in between. The community corresponding to the perfect solution also has a relatively high PageRank, suggesting we should usually expect evolution to reach it. Interestingly, there is an edge directly from the starting community to the optimal solution. However, the probability of taking this path is very low. Likely, the existence of this edge is due to mutational landscaping around the optimal genotype; it is often fairly easy to get a mutation that completely breaks a good solution. Since all mutations are bidirectional, the existence of such mutations means that it is technically also possible to mutate from the starting phenotype directly to the ending phenotype. Nevertheless, it is important to be aware that the probability of starting with a genotype that makes such a mutation possible is vanishingly small. The community assembly graph for the median problem [12] is qualitatively similar (albeit with the optimal solution having a somewhat lower PageRank) (see Fig. 15.4). Since Lalejini et al. found similar performance between the grade and median problems [20], the similarity of these graphs makes sense. Next, we construct a community assembly graph for the FizzBuzz problem [11] (see Fig. 15.5). Interestingly, the optimal solution is also the only true2 sink node in this graph, and it is also technically reachable from the starting community. However, the PageRank for the optimal community is very low, indicating that actually arriving there is unlikely. Instead, there appears to be a node in the middle of the graph that 2

Note that there are other nodes that have no outgoing edges within this sub-graph; however, these nodes all have outgoing edges to nodes that did not make it into the sub-graph because reaching them is unlikely.

296

E. Dolson and A. Lalejini

Fig. 15.3 Community assembly graph of the 100 most accessible communities for the grade problem. Node colors and sizes indicate the PageRank of each node, which translates to the probability of a random walk ending on each node. Edge colors indicate the probability of choosing each edge. Nodes are arranged along the y axis according to how far away from the starting node they are (measured as shortest path). The starting node (representing a community containing only the worst-performing phenotype) is the lowest node on the y axis. Note that this portion of the graph contains only one true sink node (outlined in red). In this case, that node represents the optimal solution, indicating that the solution for this problem is indeed reachable

functions as some sort of attractor. Thus, we can conclude that the reason Lalejini et al. found that this problem was harder to solve [20] was likely due primarily to mutations producing the optimal genotype being rare and/or evolution spending most of its time in the vicinity of the attractor. The small-or-large problem [12] was solved the least frequently in Lalejini et al.’s analysis [20]. Indeed, none of the 10 runs of evolution used to build our community assembly graph found a solution. Consequently, the optimal solution does not appear in the community assembly graph. There is a true sink node, but it is not easy to reach (see Fig. 15.6 (top)). To understand the impact of sub-graph size, we also construct a 1000-node community assembly graph for this problem (see Fig. 15.6 (bottom)). The larger graph has a number of additional sink nodes, although they all still have fairly low PageRanks. Thus, it seems unlikely that there is a single node where the search process is consistently getting stuck. However, some of the difficulty of this problem may arise from there being many different places where it is possible to get stuck. One important caveat to this analysis is that it is inherently biased by the trajectory that evolution actually took in the underlying runs of evolution. Since we only carried

15 Reachability Analysis for Lexicase Selection via Community Assembly Graphs

297

Fig. 15.4 Community assembly graph of the 100 most accessible communities for the median problem. Node colors and sizes indicate the PageRank of each node, which translates to the probability of a random walk ending on each node. Edge colors indicate the probability of choosing each edge. Nodes are arranged along the y axis according to how far away from the starting node they are (measured as shortest path). The starting node (representing a community containing only the worst-performing phenotype) is the lowest node on the y axis. This portion of the graph contains only one true sink node (outlined in red), which represents the optimal solution

out mutational landscaping around observed genotypes, there are many parts of the fitness landscape that we are unaware of. If we have no examples of runs where a problem was successfully solved, we are very unlikely to conclude that a solution is reachable. However, we believe that this analysis is nevertheless useful for identifying where and why evolution tends to get stuck.

15.6 Conclusion Based on our observations thus far, community assembly graphs appear to accurately predict the possible end states of lexicase selection. Consequently, we can use them to identify circumstances where optimal solutions are inaccessible. Moreover, we can use them in the same way that fitness landscapes are used for other evolutionary algorithms: as a tool for understanding unexpected results. Like fitness landscapes, community assembly graphs can be large and costly to calculate. Nevertheless, their ability to precisely calculate possible evolutionary trajectories makes them a powerful

298

E. Dolson and A. Lalejini

Fig. 15.5 Community assembly graph of the 100 most accessible communities for the FizzBuzz problem. Node colors and sizes indicate the PageRank of each node. Edge colors indicate the probability of choosing each edge. Nodes are arranged along the y axis according to how far away from the starting node they are (measured as shortest path). The starting node (representing a community containing only the worst-performing phenotype) is the lowest node on the y axis. Note that this portion of the graph contains only one true sink node (outlined in red). In this case, that node represents the optimal solution, indicating that the solution for this problem is indeed reachable. However, this node has very low PageRank, indicating that reaching it is relatively unlikely

analytical tool to have in our tool-kit. In particular, we hypothesize that they may be valuable for comparing the effect of subtle changes to a selection scheme. Further research is necessary to understand how accurately community assembly graphs built from a small sample of runs of evolution can predict the possible outcomes of subsequent runs. Relatedly, it will be important to understand the impact of sub-graph size on this prediction accuracy. While even the small graphs shown here seem to intuitively describe different evolutionary scenarios, we have also presented preliminary evidence that some dynamics may be missed by small sub-graphs. A number of additional open questions relate to the interaction between neutral networks and community assembly graphs. Neutral networks are networks of mutationally adjacent genotypes with the same phenotype. Our analysis assumes that these networks are somewhat sizeable and thus take some time to traverse before a new phenotype can be discovered. This assumption is why we expect it to be uncommon that multiple new phenotypes are introduced to a community at once. Recent analysis by Banzhaf et al. lends preliminary support to this assumption in the context of genetic programming, but raises an important additional layer of nuance: the expected size of neutral networks should decrease with the complexity of pro-

15 Reachability Analysis for Lexicase Selection via Community Assembly Graphs

299

Fig. 15.6 Community assembly graph of the 100 (top) and 1000 (bottom) most accessible communities for the small-or-large problem. Node colors and sizes indicate the PageRank of each node. Edge colors indicate the probability of choosing each edge. The starting node is the lowest node on the y axis. These portions of the graph contain true sink nodes (outlined in red), but they do not correspond to the optimal solution

300

E. Dolson and A. Lalejini

gram outputs [1]. This pattern likely has implications for the topology of community assembly graphs, although it is not immediately obvious what they are. In the future, we plan to extend this community assembly graph approach to analyze other evolutionary algorithms in which the probability of various solutions being selected is strongly impacted by the composition of the population. We also plan to develop better techniques for quantitatively analyzing the resulting graphs, as such techniques will enable us to make use of larger sub-graphs than those shown here.

References 1. Banzhaf, W., Hu, T., Ochoa, G.: How the combinatorics of neutral spaces leads gp to discover simple solutions (2023), to appear in Genetic Programming Theory and Practice XX 2. Boldi, R., Briesch, M., Sobania, D., Lalejini, A., Helmuth, T., Rothlauf, F., Ofria, C., Spector, L.: Informed down-sampled lexicase selection: Identifying productive training cases for efficient problem solving (2023). https://doi.org/10.48550/arXiv.2301.01488 3. Capitán, J.A., Cuesta, J.A., Bascompte, J.: Statistical mechanics of ecosystem assembly. Phys. Rev. Lett. 103(16), 168101 (2009). https://doi.org/10.1103/PhysRevLett.103.168101. American Physical Society 4. Ding, L., Spector, L.: Optimizing Neural Networks with Gradient Lexicase Selection. In: International Conference on Learning Representations (2022) 5. Dolson, E.: Calculating lexicase selection probabilities is np-hard. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1575–1583. GECCO ’23, Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3583131. 3590356 6. Dolson, E., Banzhaf, W., Ofria, C.: Applying Ecological Principles to Genetic Programming. In: Banzhaf, W., Olson, R.S., Tozier, W., Riolo, R. (eds.) Genetic Programming Theory and Practice XV, pp. 73–88. Springer International Publishing, Cham (2018) 7. Dolson, E., Rodriguez-Papa, S., Moreno, M.A.: Phylotrack: C++ and python libraries for in silico phylogenetic tracking. J. Open Source Softw. (in review). https://doi.org/10.5281/zenodo. 7922092 8. Dolson, E.L., Banzhaf, W., Ofria, C.: Ecological theory provides insights about evolutionary computation. PeerJ Preprints 6, e27315v1 (2018). https://doi.org/10.7287/peerj.preprints. 27315v1 9. Hang-Kwang, L., Pimm, S.L.: The assembly of ecological communities: a minimalist approach. J. Animal Ecol. 749–765 (1993) 10. Hang-Kwang, L., Pimm, S.L.: The assembly of ecological communities: A minimalist approach. J. Animal Ecol. 62(4), 749–765 (1993). https://doi.org/10.2307/5394. publisher: [Wiley, British Ecological Society] 11. Helmuth, T., Kelly, P.: PSB2: the second program synthesis benchmark suite. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 785–794. ACM, Lille France (2021). https://doi.org/10.1145/3449639.3459285 12. Helmuth, T., Spector, L.: General program synthesis benchmark suite. In: Proceedings of the 2015 on Genetic and Evolutionary Computation Conference - GECCO ’15, pp. 1039–1046. ACM Press, Madrid, Spain (2015). https://doi.org/10.1145/2739480.2754769 13. Hernandez, J.G., Lalejini, A., Dolson, E., Ofria, C.: Random subsampling improves performance in lexicase selection. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 2028–2031. GECCO ’19, Association for Computing Machinery (2019). https://doi.org/10.1145/3319619.3326900

15 Reachability Analysis for Lexicase Selection via Community Assembly Graphs

301

14. Hernandez, J.G., Lalejini, A., Ofria, C.: A suite of diagnostic metrics for characterizing selection schemes (2022). https://doi.org/10.48550/arXiv.2204.13839 15. Kauffman, S., Levin, S.: Towards a general theory of adaptive walks on rugged landscapes. J. Theor. Biol. 128(1), 11–45 (1987). https://doi.org/10.1016/S0022-5193(87)80029-2 16. La Cava, W., Helmuth, T., Spector, L., Moore, J.H.: A Probabilistic and Multi-Objective Analysis of Lexicase Selection and .e-Lexicase Selection. Evol. Comput. 1–26 (2018). https://doi. org/10.1162/evco_a_00224 17. La Cava, W., Spector, L., Danai, K.: Epsilon-Lexicase Selection for Regression. In: Proceedings of the Genetic and Evolutionary Computation Conference 2016, pp. 741–748. ACM, Denver Colorado USA (2016). https://doi.org/10.1145/2908812.2908898 18. La Cava, W., Spector, L., Danai, K.: Epsilon-Lexicase Selection for Regression. In: Proceedings of the Genetic and Evolutionary Computation Conference 2016, pp. 741–748. GECCO ’16, Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/ 2908812.2908898 19. Lalejini, A., Dolson, E., Vostinar, A.E., Zaman, L.: Artificial selection methods from evolutionary computing show promise for directed evolution of microbes. eLife 11, e79665 (2022). https://doi.org/10.7554/eLife.79665 20. Lalejini, A., Moreno, M.A., Hernandez, J.G., Dolson, E.: Phylogeny-informed fitness estimation (2023), to appear in Genetic Programming Theory and Practice XX 21. Lalejini, A., Ofria, C.: Evolving event-driven programs with SignalGP. In: Proceedings of the Genetic and Evolutionary Computation Conference on - GECCO ’18, pp. 1135–1142. ACM Press, Kyoto, Japan (2018). https://doi.org/10.1145/3205455.3205523 22. Matsumoto, N., Saini, A.K., Ribeiro, P., Choi, H., Orlenko, A., Lyytikäinen, L.P., Laurikka, J.O., Lehtimäki, T., Batista, S., Moore, J.H.: Faster Convergence with Lexicase Selection in Tree-Based Automated Machine Learning. In: Pappa, G., Giacobini, M., Vasicek, Z. (eds.) Genetic Programming, vol. 13986, pp. 165–181. Springer Nature Switzerland, Cham (2023). Series Title: Lecture Notes in Computer Science 23. Metevier, B., Saini, A.K., Spector, L.: Lexicase Selection Beyond Genetic Programming. In: Banzhaf, W., Spector, L., Sheneman, L. (eds.) Genetic Programming Theory and Practice XVI, pp. 123–136. Springer International Publishing, Cham (2019). Series Title: Genetic and Evolutionary Computation 24. Moore, J.M., Stanton, A.: Lexicase selection outperforms previous strategies for incremental evolution of virtual creature controllers. In: Proceedings of the 14th European Conference on Artificial Life ECAL 2017, pp. 290–297. MIT Press, Lyon, France (2017). https://doi.org/10. 7551/ecal_a_050, https://www.mitpressjournals.org/doi/abs/10.1162/isal_a_050 25. Schreiber, S.J., Rittenhouse, S.: From simple rules to cycling in community assembly. Oikos 105(2), 349–358 (2004). https://doi.org/10.1111/j.0030-1299.2004.12433.x 26. Serván, C.A., Allesina, S.: Tractable models of ecological assembly. Ecol. Lett. 24(5), 1029– 1037 (2021) 27. Spector, L.: Assessment of problem modality by differential performance of lexicase selection in genetic programming: a preliminary report. In: Proceedings of the 14th annual conference companion on Genetic and evolutionary computation, pp. 401–408. ACM (2012). http://dl. acm.org/citation.cfm?id=2330846 28. Spector, L., Cava, W.L., Shanabrook, S., Helmuth, T., Pantridge, E.: Relaxations of lexicase parent selection. In: Banzhaf, W., Olson, R.S., Tozier, W., Riolo, R. (eds.) Genetic Programming Theory and Practice XV, pp. 105–120. Genetic and Evolutionary Computation, Springer International Publishing (2018). https://doi.org/10.1007/978-3-319-90512-9_7 29. Østman, B., Adami, C.: Predicting evolution and visualizing high-dimensional fitness landscapes. In: Recent Advances in the Theory and Application of Fitness Landscapes, pp. 509– 526. Emergence, Complexity and Computation, Springer, Berlin, Heidelberg (2014). https:// doi.org/10.1007/978-3-642-41888-4_18

Chapter 16

Let’s Evolve Intelligence, Not Solutions Talib S. Hussain

16.1 Introduction Despite significant advances in the techniques used throughout the disparate fields related to artificial intelligence (AI)—machine learning (ML), neural networks (NN) and deep learning, evolutionary computation (EC), etc.—we remain immersed in the perspective of trying to develop solutions to problems. In large part, we are driven to be useful—to create systems that help now and/or are profitable now, to produce viable research that meets a tangible need and to continue to make gains towards automating the learning of complex tasks. However, for most of us, the dreams of strong AI—of systems that can learn and adapt on their own, know why they do what they do, and that perhaps one day can “think” on their own—remain dreams that seem too far-fetched. Some of us hope we can engineer intelligence from component techniques. Some believe the mysterious depths of a deep network are perhaps indeed a form of intelligence and are enamoured of the amazing recent results of generative large language models. Some pursue less ambitious dreams of explainable AI that can provide post hoc reasoning behind black-box choices. Some feel the problem is intractable—that we can never build an artificial system that demonstrates true understanding. Some believe that strong AI may in fact be so different from natural intelligence that we may not recognize it as intelligence. And some of us fear the singularity and question the ethics of creating artificial life. In this position paper, I wish to discuss the challenge of re-orienting ourselves to creating intelligence, not solutions. Rather than following the current research paths that focus on creating better and better performing solutions within narrow domains for specific types of problems, let us define research paths that focus on creating strong to “strongish” AI that is highly intelligent in many core capabilities T. S. Hussain (B) John Abbott College, Ste-Anne-de-Bellevue, QC, Canada e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 S. Winkler et al. (eds.), Genetic Programming Theory and Practice XX, Genetic and Evolutionary Computation, https://doi.org/10.1007/978-981-99-8413-8_16

303

304

T. S. Hussain

Fig. 16.1 Solutions focus versus intelligence focus

and broadly useful across many domains and types of problems (see Fig. 16.1). I use the qualifier ‘-ish’ to reflect the need to strive for AI that can understand, reason, learn, plan, be creative and solve problems to achieve goals, but that is not necessarily self-aware/conscious. (While such consciousness may turn out to be necessary for meaningful intelligence [4], we are quite far from understanding how to create it.) I lay out a general re-framing that identifies a broad approach to consider with regards to creating strongish AI and identify a number of research avenues that may be useful to explore. I believe that the field of evolutionary computation in general, and the field of genetic programming in particular, offers a rich opportunity for achieving this, but I believe my ideas apply broadly across all approaches to AI as well as related fields such as cognitive modelling. The gap between current techniques and the “general problem-solving ability of man” remains “unbridgeable” [2]. This paper seeks a long-view path to begin to bridge that gap. Let’s try to “step back” and reason from first principles, somewhat independently of prior work per se. What should we be striving for? What assumptions are limiting us? What do we need? How should we approach it?

16.2 What Should We Strive For? As a field, across all domains, research communities and endeavours, I believe there are several fundamental changes we can aspire to make in how we approach our work. Broadly, current approaches are based on applying an AI technique to solve a specific problem. We have made great strides in addressing the challenge of shareable data, in sharing tools to build/train AIs using various techniques, and in defining methods for assessing and comparing task performance of different solutions. As of recently, we can now create human-competitive AI solutions using a variety of techniques (transformers, deep learning, evolutionary computation, etc.) though the computing and data resources involved may be tremendous. There has even been some recent interest (mostly in private labs, e.g. [9]) towards exploring artificial general intelligence as well as early research on embodied AI and embodied artificial evolution (e.g. [6, 7, 10]). However, each of these systems/solutions is essentially

16 Let’s Evolve Intelligence, Not Solutions

305

what I will term, somewhat controversially perhaps, an “intelligence dead end”. It is a solution highly tuned to a particular problem that has been developed essentially “from scratch” using particular data and a particular learning process, often with each aspect of the approach highly influenced by human design choices and carefully shaped with iterative human oversight. That final solution cannot be used to solve anything else. Its knowledge and behaviours cannot be transferred. It is a narrow, custom-made solution. Researchers and developers can of course transfer their own knowledge of how to build such solutions to build yet more. But each solution stands alone. Moreover, each technique is generally its own stovepipe and does not play well with others. Some research efforts have indeed, by contrast, been focused on more general capabilities and phenomena rather than simply being solution oriented. These efforts offer some intuitions that are complementary to the ideas presented in this paper as well as techniques that are likely key avenues for future pursuit of these ideas. Some efforts have focused on the process of learning and/or adapting the way in which the systems learn, such as continual learning [13] and meta-learning or “learning to learn” approaches [12, 13]. These approaches investigate methods for continually improving performance on a specific task (which often retains a “solution-focus”) as well as for generalizing performance across multiple tasks or new situations. Yet, generally, the measures of success for most efforts remain focused on (task or multi-task) performance rather than on intrinsic (intelligence) properties of the learning system itself. Other efforts have focused on complexity, such as artificial life and “open-endedness” approaches [17]. These approaches offer some valuable perspectives, insights and techniques into the challenge of creating systems that do not converge to specific solutions, but rather generate increasingly complex strategies and other intrinsic properties. However, in practice, many such systems are destined to stop producing anything interesting within days of launching… (since) their problem domains contain only a narrow set of possibilities... (Yet) open-endedness could be a prerequisite to AI, or at least a promising path to it. [17]

Implicit in the solutions-oriented focus of most AI research and development is the idea that we should strive for intelligent systems that reliably solve problems as well as or better than humans, even if they can only do so in a narrow context. Implicit in the large data and large-scale focus of many efforts is the idea that intelligence is a property that may emerge once we reach sufficient scale—that it is a numbers game. Recent remarkable performance by systems such as ChatGPT (Chat Generative Pre-Trained Transformer) [15] certainly provide some evidence of the potential merits of this approach for creating highly powerful, though still “weak AI”, one-off solutions. However, if we wish to avoid such “intelligence dead-ends”, I suggest that we must strive to adopt methods that encourage and even require us to create the opposite—“intelligence seeds” that promote and propagate the creation of increasingly intelligent systems. In other words, we should be creating and sharing AIs that may build upon themselves and/or upon other AIs to enable the creation of new AIs that are quantifiably more intelligent across multiple dimensions.

306

T. S. Hussain

Broadly across the diverse research fields related to artificial intelligence, this paper could be regarded as advocating steps towards a future in which researchers/developers in one field and domain using one set of AI/EC/ML/NN/etc., techniques create certain intelligent capabilities that other researchers/developers in another field or domain or using another technique leverage directly to create even more intelligent capabilities. A future in which (trained) “solutions” in one domain are directly used to create “solutions” in another—not just in a shared “codeto-build-it” sense, shared methodology sense, or shared data sense, but in a shared “components of intelligence” sense. This could include, for example, taking a trained intelligence and transferring it to a new domain with relatively little additional training or re-training required, taking the trained research products from two different labs and combining and extending them in another lab while leveraging prior training cycles directly, and/or taking parts of two different intelligences and, with relatively little effort, combining those parts to create an improved new intelligence. For us specifically in the field of evolutionary computation, this paper could be regarded as an exploration of how we might adapt evolutionary computation methods to evolve strongish AI. I suggest that we need to adopt the (contrarian) viewpoints that natural evolution does NOT actually produce and reward intelligence and that living and experiencing the real world does NOT actually require or induce the development of intelligence. We need to go beyond natural evolution processes (evolution++?) and beyond the real world towards processes and experiences designed to cultivate intelligence. Rather than evolve systems that succeed by any means possible (often the simplest), we want to evolve systems that perform in the most intelligent ways possible (even at the expense of “success”, at least initially).

16.3 What Assumptions Are Limiting Us? In general, I believe that current research and development approaches, for all their successes, suffer from some assumptions that limit us in our quest to create intelligence. Here I call out these possible limitations by positing their opposite.

16.3.1 Posit#1: Impossible to Engineer Intelligence A key underlying assumption of much AI research is that somehow, as humans, we will be able to engineer intelligence by engineering solutions—that we can code the “right” algorithm or set of rules, that we can build it, train it and/or “tweak” it and we will make steady progress towards achieving strong AI. Let us posit the opposite first—that it is not possible to “engineer” intelligence. On the plus side, we have an existence proof. Animal and human intelligence evolved. They evolved to solve the problem of survival in a dynamic, complex, nuanced and often inimical world. However, while we humans did evolve in this

16 Let’s Evolve Intelligence, Not Solutions

307

world, we did not evolve to be doctors, mathematicians, engineers, astrophysicists, chefs, graphic designers, programmers and so on. We as a species grew that knowledge over time and as individuals we are taught the methods and ways of thinking needed to perform those tasks (often despite the natural competing tendencies we evolved to have). Moreover, natural intelligence evolved via a variety of messy methods and happenstance. While human physiology embodies intelligence, it itself is not a design for intelligence per se. We don’t know enough. And yet we want the AIs we create using ourselves as inspiration to assist us on our complex tasks. Though we don’t know enough about what intelligence is to build it, perhaps we can make it feasible to “grow” strongish AI. For example, we can provide the conditions that will challenge and encourage an intelligence to develop as well as a framework for sharing and combining intelligences that will enable us to collectively achieve significant gains in intelligence capabilities over time. We would still need to engineer processes for growing the AIs, and it is likely that we could engineer some components required by intelligence to bootstrap or improve the efficiency of those processes. But the path forward requires significant collaborative exploration on many new dimensions of intelligence that we have yet to understand and consider.

16.3.2 Posit #2: No Occam’s Razor for Intelligence There is often an underlying assumption in our research that we should be able to create an intelligent system if we could (just) discover the (simple) principles by which it operates. That it will all just fall into place if we can model neural networks appropriately, discover the right set of rules that define robust reasoning, model an effective evolutionary process with the right representations and fitness function, and/or create the right transformers trained on the right data. But recall—the world is complex and what led to our intelligence isn’t simple. It isn’t just because of the process of evolution or because of our brains. It is because of the complexity of the world around us and the great many things that have impacted and continue to impact our experience and survival (and a lot of luck). Moreover, we have just the one (or a limited few) existence proof. Of the millions (billions? trillions?) of species that have existed, there is just us that approach the level of intelligence we want to demand of our AIs. We seek human-competitive intelligence. And we cannot truly say that evolutionary algorithms or artificial neural networks are techniques that will inevitably produce such intelligence—there is no such guarantee. So, let us posit second that there is no Occam’s razor for intelligence. We will not find just the right algorithm. We will not achieve intelligence in a single lab by the efforts of a single researcher working diligently on their favourite approach. We will need something much messier and more complicated. Something that may, perhaps, accrete or emerge from a myriad of complementary and competitive methods, from formative experiences and interactions in complex environments, from a wealth of

308

T. S. Hussain

potentially contradictory human-provided guidance and/or from many small-scale and large-scale iterative improvements using processes that themselves change in response to complex, dynamic worlds.

16.3.3 Posit #3: Intelligence Is Grounded Most recent AI approaches are based on the use of large amounts of training data that are treated as independent of each other with the tacit assumption that the training data is “enough”—that it provides all the information needed to extract the desired intelligence and that all the context needed for learning is captured in these independent data samples. Moreover, in the vast majority of approaches, we lock down the system after it has been trained. Yet, we then typically assess the utility of the system by its ability to generalize to new situations in just the right way, even though it is, in essence, independent of those new situations. Thus, let us instead posit, thirdly, that strongish AI is grounded. It learns in the context of a meaningfully complex world and from the situations it experiences, over time, in that world. Its learning is tied to those experiences and how they are experienced—context matters, order matters, relationships matter. As a result of being grounded in its world, an intelligence can model, deeply, the external world around it, thereby becoming capable of analyzing situations it is faced with and predicting potential outcomes of different courses of actions. This model, whatever form it may take, enables the intelligence to base its choices on the nuances of a current situation, and not just on simple pattern classification or context-insensitive analysis. Such a grounded intelligence is also likely to be highly efficient—it shouldn’t need to experience the entire world in order to ground itself in it. Rather, it should learn rapidly from key experiences that are intrinsically grounding and rich in context.

16.3.4 Posit #4: Intelligence Is Transferable Embodied Intelligence research certainly seeks to address the issue of grounding an intelligence in a world, and it offers valuable insights and avenues to explore. Yet I suggest that embodying an intelligence is only one approach to grounding it to its world, and potentially fundamentally limiting on its own. For while intelligence may be grounded in a world, I suggest that we as a field require that the intelligence also be transferable to new worlds. In order for us, across our broad AI-related fields, to make dramatic strides in creating new intelligences that will address diverse realworld needs, we require the ability to take an intelligence trained in one domain and rapidly adapt it or enhance it to be effective in a different domain. So, while the intelligence must ground its internal models and processing to the world in which it was created, it must do so in a manner that allows it (or us) to rapidly adapt those models when faced with a new world with new context and challenges, or more

16 Let’s Evolve Intelligence, Not Solutions

309

generally be used as part of a bootstrapping step that leverages the intelligence’s knowledge and capabilities to seed the new intelligence-creation process in the new world. The resulting new intelligence would in turn become grounded in the new world (or perhaps in both). What this means in practice is difficult to imagine since it is not a part of current research methodologies. Recently, the creation of large-scale foundation models has offered a new methodology wherein we can use the same trained model, with transfer learning, for diverse tasks across different research and development efforts and different domains [3]. There have also been recent efforts to create “unified” models that can perform multiple different tasks without requiring task-specialized components [14]. These are certainly steps in the direction of creating transferable intelligences, though perhaps not grounded ones. Likely, we need our intelligences to be composable—comprised of separable components that can be recombined in new ways to best meet the needs of the new world. Perhaps some intelligences are transformable—able to rapidly adapt or be adapted to new goals, new constraints and new data. Perhaps an individual agent is not our key unit of “intelligence” per se—rather it may be a complex set of individuals or interacting algorithms that collectively provide robustness to new environments. Perhaps the internal structure of an intelligence retains a degree of “plasticity” that enables it to adapt to new sensing modalities, new performance objectives, new sets of possible actions and new types of experiences.

16.3.5 Posit #5: Intelligence Is Intrinsically Self-reinforcing Finally, let us posit that we must develop systems in which the process of creating intelligence is itself “self-reinforcing”. In other words, via experiencing and surviving the world, the process that leads to intelligence will reward elements of an entity that support intelligence. For example, if there is a combination of parts that results in intelligence, the process will, over time, reward the refinement of those parts and their combination into different configurations until, eventually, improved intelligence emerges. Natural evolution finds a solution that works, not necessarily a solution that works intelligently. As mentioned before, only one solution of the type we are seeking has evolved in the natural world. Let us therefore claim that natural evolution is NOT self-reinforcing for intelligence (of the type we seek to create). If it were, the world would be filled with such intelligences. So, we need to explore methods beyond the limitations of natural evolution. We require self-reinforcing foundations for our research or else we are likely to fail due to low odds of creating an intelligence. For instance, without this property, even if we seed a system with intelligent components or pre-cursors, a non-intelligent solution that can survive in the complex early portion of development may overtake the system and reduce the chances of an emerging intelligence to near zero. This applies within any automated process as well as across research efforts themselves. For example, the seductive capabilities of large language models (e.g. in-context

310

T. S. Hussain

learning) such as ChatGPT will likely capture a significant portion of the human (and financial) capital across the world for the foreseeable future, despite the approach (of predicting the next words without understanding their meaning) likely not being an approach to general artificial intelligence (though it is a fabulous solution…). What else might we consider about such a self-reinforcing intelligence creation process? Likely, it implies that “learning” in intelligence is a continual process based on the particular sequences of situations experienced in the world, and in response to changes in the world itself over time. Learning doesn’t have a start and an end, per se. An intelligence learns—always (at least in principle). The process that leads to intelligence always continues to operate, especially in light of a changing world and corresponding changing needs. Moreover, the process likely encapsulates everything at both the micro and macro levels across all time horizons. It encompasses the evolution of species, the growth of individuals over time, the learning induced by a particular situation, the creation of supporting (and shaping) constructs such as families, tribes, cultures and societies. It encapsulates the teaching of a parent or instructor, the experiences over time of living in the world, the changes over time in how individuals are rewarded and much more. In a process of the type we seek to create, intelligence begets intelligence.

16.4 What Do We Need? The remainder of this paper explores four key needs that our research must likely address implied by our five key posits above. These needs in turn inform and identify new research areas and new research methodologies. Specifically, we need to: 1. Create complex worlds within which intelligences will grow or evolve. A world must comprise all the context that will be needed to acquire the skills desired of the AI, sufficient complexity to require strongish intelligence, as well as sufficiently nuanced (interactive) experiences and consequences to induce the development of strongish intelligence. 2. Create drivers within and across worlds that will explicitly define and provide purpose to our systems and induce them to develop capabilities that address situations intelligently and to ground themselves appropriately to their world(s). 3. Create (eventually) intelligent entities that internally build models of the world and how to behave in it that are explicitly grounded to their experiences in the world yet transferable (in part or whole) to new contexts. 4. Create elements in the world and/or within the process trying to create the intelligence that will reward, manipulate and otherwise impact or direct the process of creating intelligence itself so that it is self-reinforcing. The remainder of this paper discusses each of these in turn and identifies potential research elements involved. As a beginning notation for discussing the suggested reframing, let us use the tuple .[{W }, {D}, {M}, S] to represent what is needed to create

16 Let’s Evolve Intelligence, Not Solutions

311

intelligence, where .{W } .= The world(s) within which the intelligence is grounded, {D} = The drivers for intelligence, .{M} .= The grounded models of understanding, . S .= The process of intelligence self-reinforcement. .

16.4.1 A Caveat: Intelligence == Process and/or Intelligence == Capabilities and/or Intelligence == Individual(s) A key goal of this paper is to re-frame thinking from creating solutions to creating intelligence. However, that begs the question of what is an intelligence. In this paper, I do not seek to clearly define it. Rather, I intentionally open the door to a wide, multi-level interpretation of intelligence. The viewpoint that a particular “trained” individual is an “intelligence” is one valid view, yet it is one that is very analogous to a “solutions-focus”. It ignores the process that led to that final individual, ignores all interim versions, demotes the importance of any internal capabilities or components and ignores the context of any other individuals that may influence its behaviour. So, I suggest that we move away from a default single-entity view. Let us instead consider that anything that can be characterized or measured as exhibiting a capability associated with intelligence is potentially an intelligence. An intelligence is a process, a set of adaptive capabilities, a set of rules, a specific application of certain rules, a specific set of values following a set of rules and so on. Let us avoid premature convergence of our research fields due to assuming a particular fixed interpretation of what an intelligence is. For example, if we create the general ability to combine different intelligences that use different techniques, then it may very well be that multiple different processes may be running at different time scales to achieve that merged result. In the sections that follow, the word intelligence may be used seemingly at different levels, often an “individual”, but sometimes a component capability, sometimes the process and sometimes “all-of-the-above”. As much as possible, I suggest we treat these as interchangeable for now since a true artificial general intelligence may very well be inseparable from the process(es) that created it and that it is a part of—we simply do not know enough yet.

16.4.2 The World In most existing evolutionary and machine learning systems, the “world” is the set of data used to train and test the system. The “world” is thus defined by the range of possible values and the actual set of values used in training the system. These data carefully capture the problem that is trying to be solved. Random presentation of the data is intended to remove potential biases due to order encountered. Data is carefully segregated to ensure there is no leakage. Data is cleaned and organized to minimize noise and maximize the chances of successful learning. In other words, modern

312

T. S. Hussain

systems are trained in an artificial, created utopia (dystopia?). The inhabitants of our “world” are given a carefully curated, limited experience, subject to high censorship and strict expectations, and then we hope they can be useful in the real world and behave only within expected constraints. This is less true in the latest large-scale efforts that train on a huge variety of large non-pre-processed data. However, almost ubiquitously, these inhabitants have no control over the data they are exposed to. They are not agents in their own growth and evolution. In other systems, such as embodied intelligence approaches, the “world” is a simulated or real environment within which the system navigates, interacts and learns on its own to achieve goals, such as scene navigation, change detection or rearrangement [6] or with human guidance, such as with imitation learning [1, 19]. Such approaches support the creation of solutions that perform well on the desired tasks, especially with large-scale training on massive data sets [6], and many offer valuable insights into methods for enabling experience-based learning for AI. But, again, these systems are largely focused on learning to perform well on specific tasks, not upon creating intelligence per se. What do we really need, then, when creating a world to support the creation of intelligence? That is the question. That is research. However, I suggest that it likely requires things such as “depth”/“richness” and “variability”/“messiness”, and also likely requires mechanisms that enable an intelligence to experience “consequences”.

16.4.2.1

Depth and Variability

The world should not always be neat with clear answers to clear situations. It should be rife with contradictions, inconsistencies, and non-sequiturs, as well as being filled with elements that are stable, interrelated and predictable. Multiple reasons may explain a particular outcome, multiple conflicting interpretations (or truths) may apply to a situation, multiple courses of action may lead to similar outcomes and similar courses of action may lead to very different outcomes. Why do we need such a world? Because a world that is not deep and rich does not need intelligence in order to “survive”. Because a world that is highly predictable and not variable does not need intelligence in order to “survive”. Non-intelligent (or weakly intelligent) solutions, complex as they may be, suffice in those situations. Such depth and variability are difficult or perhaps impossible to capture in the “data” within a traditional methodology. We can imagine the inclusion of a great number of examples of which “inputs” and “actions” lead to which “outputs”, with a good data set capturing some of the needed depth and variability. But this still presupposes that the world and the act of interacting with it can be “captured in data” in the first place. Something critical is missing—the fact that a true intelligence, once deployed, must interact with that world as the world itself changes and it must anticipate and deal with the consequence of its actions, where those consequences in turn may vary depending on the situation, over time or even stochastically. Researchers in embodied intelligence do attempt to address this issue directly.

16 Let’s Evolve Intelligence, Not Solutions

313

Many embodied AI researchers believe that there are genuine discoveries to be made about the properties of intelligence needed to handle real-world environments that can only be made by attempting to solve problems in environments that are as close to the real world as is feasible at this time. [6, p. 5].

They often define their “world model” .W as a function that, given the state of the environment .st at time .t and an agent action .a, produces a prediction = st+1 of the state of the world at time .t + 1 if the agent were to take action .a.

. W (s, a)

While having changes of state based on actions taken is a key part of a world needed to induce intelligence, I suggest that it is not sufficient. Much as evolution within our real-world does not tend to produce intelligence of the type we are seeking, simply embodying an AI within a world focused on the changes resulting from individual actions is not likely to produce such intelligence. This is because the depth and variability of a world that requires intelligence are not explicitly captured.

16.4.2.2

Consequences

A grounded intelligence must be able to relate its knowledge to its experience in the world. A key aspect of that experience is that of consequence—without consequences, all actions are of no true or distinguishing importance as far as the intelligence itself is concerned. It could be claimed that the ultimate sign of intelligence is the active minimization of (all) consequences. However, consequence exists across different scales and time frames. There is the immediate consequence to an individual of an adverse event leading to an immediate failure. There is the indirect consequence of a sequence of seemingly unrelated actions leading to a particular favourable or unfavourable outcome that was not intended by those actions. There is the delayed consequence where certain actions taken earlier lead to a consequence later on regardless of intervening actions that were taken in the meantime. There is the long-term consequence of a set of similar or repeated actions leading to a particular conclusion, where that conclusion may be approached in a continual fashion (e.g. bad diet = bad health) or abruptly (e.g. risky sports .= sudden death). There is the consequence that gets worse over time or gets better over time. Practice makes perfect. Familiarity breeds contempt. The same actions undertaken in essentially the same situations do not always have the same outcome. Why is it important for us to capture complex consequences in our worlds? Because dealing with this requires intelligence. Without it, “weak AI” solutions can suffice. Consequence is perhaps only meaningful in a system that has intent and only has value when it impacts decisions made and leads to new learnings. A trained classifier never experiences any consequences for its good or poor responses. At worst, we humans stop using it, but the AI itself did not experience the consequence. An AI of any sort that does not have continual learning simply has no way to learn from the consequences of its actions. A new round of human-guided training cycle is not consequence, just more training. A robotic system that navigates a real-world environment may suffer damage from a poor choice leading to a fall, for example.

314

T. S. Hussain

However, while the robot becomes damaged, the intelligence itself does not truly experience consequences unless there is some learning involved or some impact on its future intent. Embodied intelligences are perhaps starting points for AIs that experience consequence since they learn continually and via reinforcement. But we want our intelligences to be able to change what they choose to learn or how they choose to learn it in order to achieve or avoid certain consequences; we want them to meaningfully act to avoid anticipated consequences; we want them to “learn from their mistakes” and truly understand “cause and effect” [16]. In order to support all this, our world must explicitly provide consequences for the intelligence to experience. Perhaps better than anything else, these will help ground our intelligence.

16.4.2.3

World Notation

We can define a notation to try to broadly capture how we may define worlds that support the creation of intelligence. Let’s say that a world .W that enables intelligence is comprised of a set of elements . E and experiences . X , as well as dependencies between experiences, denoted as .W = [{E}, {X }, {X i →d X j }], where . X i →d X j denotes that an experience . X j that occurred later in time wouldn’t have happened without the occurrence of the earlier experience. X i due to the dependency relationship .d. An element of the world. E is something that can either be perceived or manipulated by an individual, or that can interact with or happen to an individual. Thus, individuals become elements of the world, in addition to the objects and events of the world. Elements have state, and that state may change during an experience. Dependencies can be simple temporal ones (e.g. occurred-before, occurred at same time) or more specific (e.g. directly caused, influenced by), or a given experience may not have any dependencies either. An experience . X i = [I1 → . . . → In ] involves a sequence of interactions . I1 → . . . → In , where a given interaction . Ik is a non-empty set of (concurrent) actions . Ik = {A}. A given . A m is the action taken by a subset of the elements . E m of the world and their consequence .Cm upon themselves (. E m ) or other elements (. E n ) of the world. . Ik = {Am (E m ) → Cm ⟨E m , E n ⟩}. The individual(s) forming our artificial intelligence may be part of . E m , . E n or neither. If involved, the individual(s) may be passive or active within an interaction. A passive individual is in the role of an observer. An active individual contributes their own action(s) to the interaction. Each element can specify full, partial or no current state. An action in general should be specific and observable, but it may also encompass the lack of action, an action avoided, or a series of uninterrupted sub-actions forming a single higher level action. The consequence .Cm can represent the complete or partial change of state of any of the elements . E m , . E n affected in the interaction. This notation allows us to capture a broad spectrum of approaches to our “data”. At one end, if each piece of data is considered as independent of all others as well as independent of any interactions with the intelligence (as per a traditional data-driven approach), then .W reduces to .{X } and each experience reduces to a single interaction comprised a single action with a passive individual . E (i.e. a single input-output pair:

16 Let’s Evolve Intelligence, Not Solutions

315

X = I = [ A(E) → C(E)]). At the other end, if the world is fully simulated, then each experience comprises the full sequence of actions taken by and/or experienced by the intelligence in the simulated world during a particular contiguous training series of events. These may entail scripted events that indicate the beginning and end of an experience, they may involve events that occurred within a certain location, or they may encompass all events within a simulation run. If the simulated world is reset between runs, then there are no dependencies between experiences across runs. If the world is not reset, then each experience is (potentially) dependent on all others that occurred before it during other runs via an “occurred-before” dependency. More generally, though, we may define arbitrary sequences of interactions as we see fit. We may capture “canned” variations that can be trained in a large-scale fashion. We may capture interactive experiences that can be re-experienced multiple times, potentially with small, randomized variations. We may capture a variety of known dependencies among elements. Using such notation, we may also compare worlds in terms of how they overlap or vary in their elements, experiences, actions and consequences. For example, a “gentler” world may exhibit less-extreme consequences for the same actions involving the same elements. A “richer” world may exhibit more elements, more actions, and more variety in consequences. A “deep” world may exhibit longer interaction sequences. A “messy” world may include many similar sequences of interactions with conflicting consequences. An intelligence that performs mental rehearsal or experimentation to solidify knowledge can compute random variations of a single “true” experience it had in the world. A world targeting more complex intelligence may include more challenging dependencies between experiences (e.g. highly separated in time, chained dependencies, etc.). This notation encompasses both dynamic, interactive data and static, fixed data. It supports orchestrated sequences of experiences selected to foster certain kinds of learning (e.g. much in the way that a parent may limit and guide the experiences of a child and slowly allow those experiences to grow in complexity as the child grows in capability). It supports freely interactive, self-guided data collection and experimentation. It also supports immersive learning in context with a large number of potential interactions.

.

16.4.3 The Drivers The flip side of consequence is purpose. The ultimate purpose of an intelligence could be said to be the maximization of positive consequences and the minimization of negative ones. More generally, the purpose of an intelligence should be to meet the expectations we have of it. But what should those expectations be if we wish to create strongish AI? In a solutions-focused approach, the purpose of a system is usually to perform the objective task as best as possible. The training process embeds that purpose in the evaluation functions used to train and tune the system. Once deployed, that purpose

316

T. S. Hussain

is, generally, implicitly pursued in the attempt to provide the best performance. In many typical solutions, the purpose is fairly simple and straightforward, often seeking to determine the best response to specific demands made by a user. In some solutions, such as controlling dynamical systems or driving a car, the purpose is more complex, nuanced and perhaps multi-objective in nature. Specific target states should be achieved and/or maintained, and specific negative outcomes avoided. In an intelligence-focused approach, however, the role of purpose is much different. It is more “existential” in nature. The purpose of an intelligence should be to perform intelligently. The purpose of the process we are using to create the intelligence must be to foster the development of increasingly intelligent capabilities. We seek to create intelligent systems that will be able to understand, reason, learn, plan, be creative and solve problems to achieve goals. Thus, our approach must strive to create systems that do those things, at all times. It is not enough for an AI to accomplish a task objective. It must also be able to detect when it should change its efforts towards achieving a different objective, when a new situation warrants additional tasks to be undertaken, and/or when situations occur that are beyond its capabilities and it requires assistance or guidance. It must be able to pursue multiple competing objectives, identify new goals and develop new ways of solving problems. As a field, we need to create novel methods for “driving” our systems to be more and more intelligent. To not just perform better, but to perform in increasingly intelligent ways. As EC researchers, we are used to thinking of fitness functions that guide our evolutionary processes to creating solutions that express our desired capabilities by maximizing fitness. As NN researchers, we are used to learning functions that seek to incrementally move our structures towards better-performing solutions. However, an intelligence creation process is not as simple as maximizing a fitness function; it is not as simple as minimizing mean squared error. It involves being useful, it involves growth and survival, it involves learning from experience and past choices made. It involves being aware of how and why the intelligence behaved as it did and learning better, more intelligent ways to behave in the future. It involves maximizing an intelligence function.

16.4.3.1

Intelligence Function

So, what is this intelligence function? That is the question. That is research. But we can imagine some of the things it may include. We need to reward our AIs based not only on whether they survived or on how “well” they did but also on “how” they tried to solve the problem and/or whether they tried to survive in a “desired way”. Perhaps a reward based on whether they used the same consistent methods or explored different approaches, depending on the context. Perhaps an assessment of whether they applied what they learned from one problem to try to solve another. And, what about rewarding intelligent failure? Thus, we likely need nuanced intelligence functions that penalize rote behaviours while attempting to measure and reward the behaviours we wish our AI to learn (e.g. analytic thinking, generalization, reasoning, planning, understanding, etc.). Functions that reward the seeds of intelligence.

16 Let’s Evolve Intelligence, Not Solutions

317

But we don’t yet understand ourselves how to measure those things properly. For example, what is the math for assessing a nascent AI that does poorly on everything, but applies cross-domain, generalized learning while failing? I suggest that we need some “broad” measures of how broadly an AI was able to apply a particular type of intelligent behaviour (regardless of how “bad” the overall performance was). We also need some “narrow” measures that reward an AI for applying particular intelligent behaviours on particular sub-tasks. What is the math for assessing an AI that solves a particular sub-task in an extremely “intelligent” way, but fails miserably on the overall task and all other sub-tasks? What is the math for making a detailed analysis showing great situational understanding, but choosing to prioritize an inappropriate objective? This is a rich area for exploration. Not only do we need to identify the different types of intelligent behaviours to assess but also develop the associated measures and learn when and how to apply them to support the creation of intelligence. Since we don’t understand intelligence yet, we likely may not initially understand which behaviours a nascent intelligence should learn first, and moreover there likely are many paths to growing more and more intelligent.

16.4.3.2

Embodied Drivers

In the natural world, many of the drivers that perhaps led to intelligence are based on our biological needs. We are embodied intelligences. These needs require us (individually, in the aggregate, across time) to solve problems, perhaps in novel ways, to address the challenges of achieving those needs, day after day, year after year, generation after generation, species after species. What of an AI? At the moment, current AIs do not generally make decisions based on any scarcity of resource that influences the decision itself. For example, determining what analysis should be made if data has a cost to the AI, if the energy needed to perform computations impacts the types and scale of computations taken by an AI, if the energy required by a mobile AI must be accounted for in any actions taken, and so on. Embodied drivers that we build into a system intended to create AI can in principle be anything. They may be tangential to the actual problems we want the AI to focus on or they may be integral to those problems. They may be subtle and nuanced or they may be apparent and significant. They may be in fact be directly related to the real-world needs of the “body” of an AI and reflect physical limitations of the AI (e.g. a mobile robot). They may be abstract “needs” created to foster the growth of intelligence—to foster the capability to take into account demanding, persistent needs that interact with a changing, dynamic, complex world. A tension exists between an embodied driver and an objective driver. Any needs that are not directly related to the objective function necessarily consume resources and influence decisions in a manner that may reduce performance on an objective function. But, what of our intelligence function? The tension induced by our intelligence function is between the internal approach of the AI and the overall approach across the process of creating our intelligence. The most intelligent approaches for

318

T. S. Hussain

a given individual, the best models of understanding it can develop, are simply grist for the mill within the larger process of creating ever more intelligent systems.

16.4.3.3

Driver Notation

We can define a notation to facilitate a discussion of drivers that foster the creation of intelligence. Let’s say that the self-reinforcing process . S used to create the intelligence, as applied to a particular world .W , contains a set of drivers .{D} and one or more methods .{O} for resolving the competing objectives represented by those drivers if needed. . S⟨W ⟩ = [{D}, {O}]. A given driver may be applied to a particular set of (or all) possible experiences in the world, . D⟨{X }⟩. A given driver comprises a purpose element . P, and one or more assessment functions: an intelligence function . Q, a feedback function . F, an embodied function . B and/or an evaluation function . V . . Di ⟨{X }⟩ = [Pi , Q i , Fi , Bi , Vi ]. Each of the assessment functions is a function of a set of intelligence models and a subset of (or all) experiences . f ({M}, {X ' } ⊆ {X }). In other words, each assessment in principle has access to any of the information on the actions taken in the world, the consequences (outcomes) experienced, and internal processing of the intelligence(s). This enables a driver to apply to a set of independent models, to encompass intelligences comprised of sub-models, multiple collaborating intelligences, competitive assessment across intelligences and so on. A purpose is defined in terms of target relative achievement across all the assessment functions of the driver in the current experience context. . P = f (Q, F, B, V ). For that set of experiences, the goal of the intelligence can be said to be achieving the purpose . P. This could be, for example, achieving a certain minimum level of assessment for each function, achieving a certain level of assessment for at least one function, maximizing achievement on all functions across all models, increasing performance on at least two functions over the course of the experiences, and so on. Moreover, we may consider the set of experiences as very specific (i.e. these particular actual experiences), nuanced (e.g. a set of notional experiences that obey some characteristics, such as experiences with short interaction sequences), or very general (e.g. all experiences). The notation here is intended to be suggestive rather than prescriptive since there are many directions research could lead. Since there are multiple drivers possible in our process, and different drivers may apply to different or overlapping sets of experiences, this can produce a multiobjective effect such that in any given experience, an intelligence may have competing objectives influencing its choices. We may also define drivers that apply at different levels of our process. Thus, we encompass functions used by an individual intelligence to influence their actions and their learning, as well as functions to be applied across generations of individuals in an evolutionary search. An intelligence function . Q performs an assessment of the quality of an “intelligence capability” of the model(s) used by the intelligence in the current experience context. The assessment should evaluate some aspect of “how intelligently did I perform in this intelligence dimension?”, with the goal of driving the system to increased intelligence along that dimension. Perhaps this may include assessments

16 Let’s Evolve Intelligence, Not Solutions

319

such as the diversity of responses to the experiences, severity of poor consequences suffered, ability to reduce severity of consequences over the course of a series of experiences (as determined by their temporal dependencies), length of interactions achieved within the experiences, time or depth of processing, nuanced nature of response and so on. A “broad” function may encompass a large number of varied experiences, a “narrow” function may encompass a small number of similar experiences. A feedback function . F defines the logic used to assess the quality of an intelligence’s model and performance. This may not only translate to elements of a learning function within a model but also include feedback that may be given to an intelligence on how it may improve. These may include identifying parts of the model that may need to be updated, An embodied function . B defines the logic used to assess how well an intelligence achieves its physical and/or logical needs. These may include measures of maintaining/maximizing/minimizing a certain energy state, cost of resources used, cost of consequences, processing time used and so on. Finally, an evaluation function .V defines the logic that assesses relative performance on the objective task or sub-task. This may encompass a standard objective function or fitness function for overall performance. However, since a function may potentially apply to a sub-model, it may include assessment of a component capability. Since it may apply to multiple models, it may include assessment of ensemble performance. The drivers may be used at several levels of the process for creating an intelligence—at the level of assessing part of an intelligence, at the level of a single intelligence learning from its world, at the level of multiple intelligences collaborating within the world, at the level of evaluating multiple intelligences to determine relative goodness, and at an aggregate level across a process. In keeping with our goal of creating grounded intelligences, these drivers can play a key role in that grounding. Not only can an intelligence learn to map a particular understanding to a particular set of experiences but also the context of the drivers that informed the choices made by the intelligence in those experiences. The process may apply one or more method(s) . O to resolving conflicts between multiple drivers, such as randomly accepting one, performing a weighted combination, using a voting mechanism and so on. Depending on the level of the process at which the driver is being applied, this effect of applying a method . O may range from affecting actual decisions made by an intelligence to helping evolve solutions on a Pareto-optimal front. Overall, the intent of this notation is to identify the wide range of drivers that can be incorporated into a process for creating intelligence. The notation encompasses most current approaches, with many focused on a single driver . D⟨{all − X }⟩ = [P : maximi ze-.V, V : task-.speci f ic-.objective. f unction]. A typical multi-objective approach defines several drivers . Di ⟨{all − X }⟩ = [P : maximi ze-. Vi , Vi ] with . O = achieve-. par eto-.optimalit y. But we can also represent an embodied intelligence that uses human feedback on its real-world actions: . D⟨{all − X }⟩ = [P : maximi ze-. Q-.be f or e-. B-.be f or e. F-.be f or e-. V, Q : maximi ze-.ex planation-.clarit y, F : positive-.or -.negative-

320

T. S. Hussain

r ein f or cement, B : minimi ze-. pr ocessing-.energy, V : objective-. f unction]. Such a solution may always try to provide the best explanation possible even at the expense of failing due to running out of energy and being told it did poorly.

.

16.4.4 Models of Understanding Next, we come back to the intelligence itself. What does the intelligence itself do? How is it an agent in its own evolution? Why does it do what it does and to what end? If our goal is to create intelligences that truly understand the world they experience, our intelligences must ground all their behaviour in terms of their experiences. And, recall that we want to create intelligences that are also transferable. So, let us consider our intelligences to be comprised of models of understanding where a given intelligence may comprise a single model, or multiple sub-models.

16.4.4.1

Grounded Models

What does it mean for an intelligence to be grounded in experience? Generally, we recognize that any action performed by an intelligence occurs within some context related to its purpose. This context may be exploration, curiosity, or experimentation. It may be in service of a particular role or in service of multiple roles. A choice of action may be at the expense of one objective but in service of another, e.g. due to competing objectives due to different roles. Additionally, we recognize that the intelligence has developed within the context of the experiences that it has had. Let us say that an intelligent system “remembers” what led to its knowledge, to some degree. The depth, fidelity and quality of that remembrance may vary in many ways, but the key point is that it is explicitly accessible. The experience brings the idea to mind, and the idea brings the experience to mind. Thus, all attempts to reason can be grounded in experience, and all attempts to generalize can likewise be grounded in experience. The questions of how to store these remembrances (directly, indirectly) and how to associate them with the system’s understanding and structure (directly, indirectly) are research questions. Somehow, though, the relationships must be made tangible and accessible to the intelligence process. In such a way, for example, an intelligent agent could give the actual references (or chain of references or chain of ideas) used for any given part of a response. So, a neural network that essentially breaks enormous numbers of experiences apart into distributed weights so that they cannot be individually reconstituted faces an intrinsic challenge—it is not grounded in a meaningful way. An EC algorithm in which an experience by an individual in an early generation led to a certain intelligent capability also faces a potentially intrinsic challenge—an individual in a later generation may not be able to reconstitute what led to that capability.

16 Let’s Evolve Intelligence, Not Solutions

16.4.4.2

321

Transferable Models

What does it mean for a model acquired by one intelligence in one world to be transferable to a different intelligence in the same or different world or for the same intelligence to adapt that model to a different world? Generally, this may entail a variety approaches, such as the following: • In an experience-focused approach, an intelligent system provides a way to “map” experiences of the new world to experiences in the original world. Combined with grounded knowledge, this in turn allows the knowledge (or structure associated with the mapped experiences) to be transferred in whole or in part. How such a mapping may be defined or applied are research questions, but some form of approach, either intrinsic to the AI technique or complementary to it, is needed. • In an external-knowledge approach, knowledge about the intelligence’s internal model(s) and their relationship to supporting experiences is identified to some level of fidelity and stored independently of the individual. This knowledge can then be provided to other intelligences to inform their model development. The form of this external knowledge may include explicit symbolic representations of knowledge, mappings of experiences to observed intelligent behaviours to emulate, and more. • In an apprenticeship approach, a model in one context is used to create experiences for another intelligence (or for itself in a different context) that will support the creation of comparable intelligence capabilities. • In a shared-components approach, the internal model of one intelligence, in whole or in part, is directly appropriated by another intelligence. For example, this could involve the literal copying of the weights and structure of a neural module. • In a generic-intermediary model approach, a common representation and execution framework for different models is defined. A given model is translated to this generic form and then that is transferred to another intelligence. Such an approach would be quite intriguing and powerful if it could encompass highly diverse AI techniques. But even encompassing a few key techniques could have immense payoff in terms of worldwide collaboration across research efforts. • Finally, in a “walk a mile in my shoes” approach, sequences of experiences that were used in the development of an intelligence are provided to the new intelligence in the attempt to recreate that learning outcome. We may term this a “formative transfer” approach that could enable the transfer of grounded intelligence between different researchers using different techniques. Formative transfer may be applied to replicate a prior intelligence’s full experience or key parts of its experience. Since the intelligence is grounded, key experiences that led to new understandings may be easier to identify explicitly.

16.4.4.3

Model Notation

There are many, many different AI techniques that exist, and each necessarily creates and notates its own internal models in technique-specific ways. Rather than try

322

T. S. Hussain

to create a notation that encompasses all these specifics (a potentially intractable task), we instead define a general notation that allows us to describe any model in terms of their groundedness and transferability. Let us define a model . M({X }, W ) = [{U }, {G}, {T }] as having a set of understandings .{U } and a set of grounded understandings .{G} that have been formed from a set of experiences .{X } in a world .W , as well as a set of transfer methods .{T }. An understanding in general is formed from a (sub)set of experiences but may not be explicitly tied to them, .U = f ({X ' }, W ). We say that a grounded understanding .{G {X i ,Dj } } ⊆ {U } is an understanding from {U} that can be explicitly tied to a specific (sub)set of experiences .{X i } and/or a particular set of drivers .{Dj }. We say that a transfer method .T defines an approach enabling the transfer of a set of understandings and/or transformed experiences to a new model and/or new world. .T = {(UW,M , {X }) → (UW' ,M' , {X ' })}. A model may include more than one transfer method, and a transfer method may be techniquespecific and/or world-specific. With this type of notation, we seek to encourage explicit consideration of grounding and transferability. With further refinement, perhaps we may use such a notation to discuss the explicatory power of a grounded understanding. How similar must a new experience . X i be to one more or of the .{X } of a grounded understanding .G in order to trigger the application of that understanding during execution? Perhaps we may compare different models in terms of coverage—e.g. what percentages of the understandings are grounded, what percentages of the understandings may be transferred, how broadly may models be transferred to other techniques and other worlds. Perhaps we may identify what percentage of formative experiences can be transformed effectively in another world. Since a given intelligence may comprise multiple sub-models, this notation may enable us to identify what percentage of its capabilities may be readily transferred to other intelligences in the same world, or which sub-models may require more significant re-working if the intelligence itself is applied to a new world. The former may be particularly useful in an evolutionary++ process in which understandings are transferred between individuals in a population or across generations. The latter may be particular useful in determining the long-term potential of an intelligence for furthering research at other labs.

16.4.5 Process of Intelligence Self-Reinforcement Our current AIs spring full-formed from the crucible of the development and training process, generally directed by a single mind (i.e. the researcher) or a team of minds (i.e. the development team). These AIs are an end-product rather than a work in progress. They are, generally, useful only insofar as they solve problems properly. All intermediary versions are merely ephemeral forms, though perhaps useful to retain for their diversity in continuing to improve the overall solution. Reinforcement and feedback are useful, generally, only for small changes to internal structure and across many training cycles. A deep neural network, for example, will develop more and more effective internal representations as exposure and repetitions increase, but the

16 Let’s Evolve Intelligence, Not Solutions

323

process of learning, the interim structures, and any key learning milestones along the way have no intrinsic value—only the final structure. Consider an intelligence that performs very well and “intelligently” in one world . W1 but does not initially transfer well to a second world . W2 . It may need to “forget” some of its understandings that are preventing it from learning well in the new world. It may need to “go-back-to-school” and re-learn old understandings in new ways. The process may need to identify that an understanding that seemed intelligent in . W1 was actually not intelligent and merely a fancy “solution” or “work-around”. The process may need to recognize that a number of formative experiences in .W1 actually conflict with how things work in .W2 . Perhaps an entire component of intelligence is misleading or inappropriate in the new world. Perhaps a key driver in .W1 is missing in .W2 , thereby invalidating several understandings. Perhaps an earlier incarnation of the intelligence with a simpler set of understandings would transfer better to .W2 ? Or, consider an intelligence . AI1 that performs very intelligently in .W1 , and an . AI2 that performs very intelligently in . W2 . The . AI2 research team may wish to apply . AI2 to .W1 to explore whether it can bring added value. However, attempts to manually re-train . AI2 don’t work well enough. Perhaps . AI1 should be used directly as a teacher for . AI2 ? Perhaps . AI1 contains a component capability that . AI2 lacks, and that component, along with its grounded understandings, can be directly added to . AI2 ? Perhaps both . AI1 and . AI2 perform in a complementary fashion and subsets of their understandings should be combined into a third approach . AI3 ? I suggest that if we as a field wish to create ever-more-intelligent systems, then we need to move away from an outcome-oriented process. To create a strongish AI, we need to allow for a fluid ebb and flow within our process.

16.4.5.1

Graduated Experiences

Let us consider the fitness function. In almost all evolutionary computation approaches, the fitness function represents our goal and is usually a fixed function that distinguishes between solutions. But this pre-supposes that we know in advance what a good solution should look like. It pre-supposes that what determines a good solution will remain unchanged over the course of our evolution. If trying to evolve a strongish AI, it further pre-supposes that we are able to assess intelligence with a fixed, definitive, mathematically-straightforward function. This is very unlikely to work—we need something more nuanced, graduated and varying. For example, we may need our evolutionary process to behave more like a K12 schooling process. Early generations may focus on change end of sentence to: creating simpler behaviors using simpler intelligence functions that reflect simpler expectations and simpler sequences of interactions with fewer elements and possible actions. As generations start to converge on sufficient capabilities for that level of expected intelligence, additional drivers, elements and actions may be added. This graduated process may continue until the experiences in the world become highly complex.

324

T. S. Hussain

And what about cases where the problem itself changes frequently and in unexpected ways? New data, new priorities, new objectives, new realities, new capabilities all dramatically change what our desired intelligence may be able to do or learn. These may involve new experiences, new elements, new dependencies or more complicated interactions in our world. These in turn may require new drivers since a new type of intelligence capability may be required to address them and hence new measures to reward that capability may be needed. We need mechanisms that will reward different types, levels and components of intelligence as they appear, and not penalize insufficient performance at the wrong stage of development. Such mechanisms may be automated, may require human-provided feedback, and/or be part of an iterative refinement process.

16.4.5.2

Variable World Complexity

Just as many AI efforts involve significant attention to the data used, we will likely need to pay careful attention to the experiences we create. Unlike a more traditional approach where all data is considered, generally, of equal value and utility for training at any time, this will likely not be true for our worlds’ experiences. We must tune experiences so that they both require sufficient intelligence and also so that they don’t require too much intelligence. In other words, the world in which our intelligences are created must grow with them in complexity. It may even need to shrink in complexity in some dimensions as it grows in complexity in others to ensure success. This variable world complexity may seem non-intuitive, but I suggest that it may be essential for promoting the growth of intelligence. Let’s call this “phases” of world complexity. Let’s expand our world notation by introducing a phase factor. . Wp = [{E}p , {X }p , {X i →d X j }]. Our world at a particular phase . p will comprise a particular set of elements .{E}p and a particular set of experiences .{X }p involving those elements. As we move from one (simpler) phase to another (more complex) phase, we may, for example, increase the set of elements, increase the number and variety of experiences, increase number of interactions per experience, increase the number of elements involved in the interactions, increase the number and variety of dependencies, and/or increase the complexity of the dependencies. If done appropriately, I suggest that such a process should ensure that by the end of every phase, our solutions are as intelligent as possible, and more intelligent across all phases than if just the “final” phase had been used from the beginning.

16.4.5.3

Process Composability

Using a specific process (e.g. genetic programming (GP), back-propagation) and a tailored representation (domain-specific, task-specific) to create a special-purpose solution to a specific problem has a clear consequence—we are generally unable to use elements of a solution for one problem to form a solution for another problem.

16 Let’s Evolve Intelligence, Not Solutions

325

To create strongish AI, I suggest we may require that the self-reinforcing processes used should themselves be composable—parts of a process or sub-process should be able to be shared across research efforts and explicitly combined in different ways to explore new processes. This may be a key factor to producing models that are highly transferable—not only are we sharing what we created but also how to use and adapt it further. What this means in practice is an open question, but it may offer many research avenues to explore.

16.4.5.4

Formative Feedback

If we wish our process to create intelligences that are self-reinforcing, then those intelligences likely need to be able to identify what parts of themselves are “more intelligent” and what parts are less so. In the real world, as soon as a child starts doing something unwise, a nearby adult is likely to provide immediate formative feedback—e.g. “Don’t do that! You’ll hurt yourself”. If a student writes a mid-term essay that has some issues, a good teacher will provide detailed feedback so that they can learn what they did wrong and improve their understanding for the next time. This type of detailed formative feedback on performance is generally missing from the processes we use to create intelligences—a typical objective function, for example, is classic summative feedback. Having multiple, nuanced drivers with a good intelligence function could provide some steps towards incorporating formative feedback. However, creating a good intelligence function may be particularly challenging research. Conversely, receiving a performance analysis of how “intelligently” the AI performed may be something that a human can do fairly easily. Sometimes when it comes to intelligence, “you know it when you see it”. Consider an approach involving human-based feedback to a grounded intelligence. If feedback on a particular set of experiences is particularly strong, perhaps only the understandings grounded in those experiences should be rewarded? Perhaps those should become building blocks for new intelligences? There are a variety of methods from how humans learn that may be incorporated into the process. A key tenet of embodied learning is the notion that embodied agents should learn in the same way as humans—encompassing learning guided by humans or agents as well as exploratory learning (e.g. enhanced simulations that permit exploration). It may also involve allowing an intelligence to solve problems more like a human, especially a human that is learning. For instance, rather than provide a single answer, the AI could provide multiple notional answers and seek feedback on which is best. Mechanisms for incorporating human feedback into our processes, particularly an EC process could provide very valuable formative feedback with respect to our intelligence function. This could take the form of crowdsourcingbased EC [18], human contribution or refinement of an intelligence function. Q, and/or human-provided reinforcement feedback playing the role of a feedback function . F in our drivers. In a complex EC process that provides graduated experiences and variable world complexity, humans could also provide guidance and inputs at key

326

T. S. Hussain

transition points to help the overall process “ratchet up” the challenge for the evolving AIs.

16.4.5.5

Process Notation

The processes that may be used to create AIs vary dramatically in their details across the field. How we train a neural network, how we develop symbolic AI and how we evolve solutions with GP have very little in common in their specifics. Yet, I suggest that there is a common way we might approach the process at a highlevel. Let’s extend our earlier notation for . S, the process for creating self-reinforcing intelligence. . S⟨W ⟩ = [{D}, {O}, {J }, {H }, {L}, {Z }]. As before, it contains a set of drivers .{D} and methods .{O} for resolving competing objectives. Additionally, it comprises a set of methods .{J } for adjusting the detailed training process to increase the degree of intelligence required over time, a set of methods .{H } for obtaining high-level feedback on intelligence characteristics exhibited, a set of methods .{L} for enabling intelligences to learn in a manner that reinforces their intelligence, and a set of methods .{Z } for ensuring that intelligences are meaningfully reusable. An adjusting method . J may direct changes in any of the other elements of . S as well as in the world .W to which . S is being applied. For instance, it may increase the set of drivers over time, or increase the severity of the consequences in the world, change the set of experiences used in the world, allow for increased or decreased feedback or guidance from humans at different phases of the process, and/or change the methods being used for distilling new reusable intelligence components. .{J } may include scripted, static methods that always proceed in the same manner (e.g. double number of experiences every 100 generations) and/or dynamic methods that adjust based on certain conditions being met, upon performance metrics (such as attaining minimum thresholds for specific drivers) or upon process metrics (such as achieving a certain degree of diversity across the population). .{H } may include automated methods or human-based methods that operate intimately with the normal training methodology or independently of it. One may imagine, for example, an AI receiving specific feedback from a human regarding a specific task it is working on or just completed. One may also imagine an AI receiving feedback daily from a human on how intelligently it is performed overall that day across all tasks completed. .{L} comprises a wide range of methods for learning from humans, other intelligences, from experiments and more such that the intelligence is learning from itself or another intelligence. While these may be part of the core learning technique used by a researcher, these methods are intended to be focused on the higher level of how “intelligence begets intelligence.” .{Z } may vary significantly from technique to technique, and may not be possible in all cases. The purpose of a . Z method is to create reusable “chunks” of intelligence. These may be sub-models of a model, sub-rules of a ruleset, or sub-structures of a neural structure and so on. A given. Z method may involve, for example, decomposing an intelligence, composing two or more intelligences, repackaging an intelligence

16 Let’s Evolve Intelligence, Not Solutions

327

with a standard interface to facilitate sharing, or even interpreting the meanings of an intelligence. Multiple methods of each type may be possible within a complex process due to variations in how a given method is applied based on the details of different intelligent components. For instance, symbolic AIs with explicit rules likely require different handling than black-box neural networks with deep structures. By explicitly specifying constituent methods, this notation in principle supports the ability to compose a new high-level, self-reinforcing process from existing ones. For example, by combining in part or whole, methods of those processes (e.g. we could compose a new process . S3 from existing processes . S1 and . S2 , where .{D3 } = {D1 } ∪ {D2 }, .{L 3 } = {l1b ∈ L 1 , l2a ∈ L 2 } and the other sets are from . S1 ).

16.5 How Should We Approach It? There is a lot of research needed to pursue this path of creating “strongish” AI. I have presented some re-framing, posed some questions and made some suggestions. Here I offer some additional general thoughts as well as discuss some early technical ideas on how genetic programming in particular could approach these paths.

16.5.1 Revisiting Reproducibility We as a field are deeply steeped in the lore of reproducibility based on performance against data—sharing the same data sets, striving to ensure that training and testing data do not mix to avoid leakage, and assessing performance in a manner that focuses on outcomes. I suggest that we may need a different way of thinking. The reproducibility of a process may have more to do with the statistical likelihood of generating a certain quality of intelligence than achieving a certain level of performance against an objective function. So, the bane of data leakage perhaps becomes trumped by the bane of rote/unintelligent performance. The drive for high objective performance on a narrowly-defined metric perhaps becomes the drive for highly-nuanced performance across a variety of measures. The ability to show higher performance when using more training data is perhaps given less importance than the ability to adapt to the changed context of experiences that have already been given. The measure of the ability to give the same “right” answer, reliably, to comparable or repeated questions is replaced with the ability to give increasingly more informative and nuanced “answers” to the same types of questions over time.

328

T. S. Hussain

16.5.2 Back to the Intelligence Function The intelligence function—what it is, what it comprises, how to represent it, and how to gauge or assess a system against it—is perhaps the most foundational research we can work on. Where do we start? Let us first consider the simple idea that an intelligence function . Q = {qi } is comprised of a set of intelligence capabilities .qi , each of which can be independently assessed. A given .qi does not necessarily need to be independent of the others but should characterize a qualitatively different types or dimension of “intelligent thinking”, such as different types of logical reasoning [11], intelligent behaviours (e.g. [5]) and/or one of a “multiple” of intelligences [8]. These may include capabilities such as the degree to which the system applies firstorder logic, inductive logic, abductive logic, persistence, adaptability, contingency thinking, etc. A specific intelligence function may combine performance across the various capabilities, may use a multi-objective approach and/or may vary the relative importance of different capabilities over time. These different times could reflect phases of the world or times in the “life” of an individual. For instance, a good value for a given intelligence capability may mean that the system was consistent in applying that capability at all times, or perhaps that it was better at applying it at certain key times, or perhaps that it appropriately applied it “when young” and didn’t “when older”. In other words, the use of certain capabilities may only be appropriate in certain times and contexts. Using them inappropriately reflects lower intelligence. This approach to the intelligence function offers a great deal of research flexibility. A “less intelligent” system may be comprised of fewer .qi or lower assessed values across the.qi than a “more intelligent” system. A “more intelligent” system may be one that exhibits the appropriate patterns of changing behaviour over time. For instance, indulging in certain behaviours when something new is presented compared to later on when that thing is familiar. Moreover, we may have different drivers focused on different .qi . For instance, one driver may apply the purpose of maximizing the use of intelligence capability .q1 across all experiences (e.g. it may be critical that the intelligence seek to use deductive reasoning at all times). Another driver may apply the purpose of some minimal threshold of using .q2 for a small set of experiences (e.g. for certain experiences, the intelligence should generate multiple alternative options to a course of action). As the world grows in complexity across phases, a driver may include, exclude or assign a different level of important to using a particular subset .{qj } of all .qi . The specifics of any given intelligence function are likely less important initially compared to the importance of adopting the approach of performing collaborative research to improve systems against shared intelligence capabilities. Different subcommunities may explore different intelligence functions, but those functions should be reconcilable at some level. For example, two different functions may share certain intelligence capabilities but not others; two different functions may be combined to form a more robust third one; different functions may be provided as competing drivers within a world. The sets of all possible .qi and combinations thereof that one may possibly consider also offers a valuable motivator for researchers – identifying

16 Let’s Evolve Intelligence, Not Solutions

329

a new .qi or identifying the impact of a given .qi or combination of .qi ’s becomes a valuable contribution to the science and could inform a large number of theses and dissertations. The drive to improve upon the state of the art will lead to intense efforts to create systems that perform increasingly well (depth) against more and more .qi (breadth)—with each new enhancement in “intelligence breadth and depth” inspiring new funding opportunities and tangible returns on investment. Eventually, as we learn more about intelligence capabilities, we should collectively be able grow our efforts to use more and more complex intelligence functions, to perform increasingly well across an increasing breadth of domains, and systematically approach assessable, achievable strongish AI.

16.5.3 Genetic Programming of Intelligence What about the role of GP in the quest to evolve strongish AI? It is likely too much of a leap to imagine that we can simply evolve the “program” for intelligence from base programming operations, even if a complex, hand-crafted grammar is used. As stated earlier—we simply don’t know enough about intelligence. But, perhaps we can co-evolve the set of operations? Let us think about what grounded knowledge really means. It means that a certain set of experiences map to a certain set of actions. So, why not capture that mapping as an operation? In this way, the language of our GP grows and evolves. Normally, this may be highly infeasible, but what if we adopt variable world complexity and graduated experiences? Initially, our GP need only evolve programs using a simple base language in a simple base world. Using assessments from an initial intelligence function and other initial drivers, we can ground our individuals to the experiences that they performed best on and even distill those individuals down to only those programming portions that are involved in those (interactive) experiences. From this, we can extract new programming operations to capture re-usable intelligent operations and expand our GP language. As the world grows in complexity, the process continues. Next, imagine that feedback for certain understandings of an individual highlights their value for intelligence. Since we can tie those understandings to portions of the individual’s program, we can begin to identify meaningful sub-modules. This is a form of decomposition/pruning that allows us to break individuals up into potentially useful building blocks. Those can be separated out as distinct individuals. Massive recombinations of those building block individuals into new individuals and subsequent genetic competition against the latest experiences and drivers could lead to more and more powerful intelligences. This resulting algorithm, if feasible, could provide reusable intelligent components that are grounded and transferable. Being “just code” at a core level, our models should easily transfer to new domains as potential building blocks. But also, we may generate a valuable “language” for describing components of intelligence from scratch. Within such a general approach there are many questions to explore. Figure 16.2 provides an illustration of increasing world complexity over time. The nature of how the world changes, the types of new complexities introduced in succes-

330

T. S. Hussain

Fig. 16.2 Increasing world complexity over time

Fig. 16.3 Alternative changes in population size as world complexity increases

sive phases and the linking of those complexities to the details of a specific problem domain are all open questions. Once we have such models, how do we ensure that a population continues to grow in intelligence without losing diversity in intelligence capabilities? Perhaps there are different impacts from how we vary population size over time, such as increasing population size continuously as time passes or decreasing population size as we move from one world phase to another (see Fig. 16.3)? Perhaps as the population grows in overall performance against the intelligence function, we increase population size to ensure new capabilities are not extinguished? The interplay between experiences and drivers also offers a rich set of questions to explore, as illustrated in Fig. 16.4. Should sub-populations within a world be focused on different drivers even when provided with the same or similar experiences? This could perhaps ensure that different intelligence capabilities are separately reinforced. Should sub-populations have the same drivers, but be provided with different sets of experiences? This could perhaps ensure that differently grounded knowledge is learned. Should such sub-populations in turn be mixed in later generations so that more complex intelligence capabilities evolve as they are exposed to new drivers or experiences? If we have the capability to transfer intelligence, then this opens up many possibilities to explore in an evolutionary algorithm. How might we exchange intelligence

16 Let’s Evolve Intelligence, Not Solutions

331

Fig. 16.4 Potential ways to incorporate different sets of drivers within a population

Fig. 16.5 Exchange of intelligences across different populations in different worlds or phases

across different populations evolving in different worlds (e.g. cross-breeding, see Fig. 16.5)? How might we exchange intelligence between individuals within the same or successive generations (e.g. peer-learning, teachers, parents)? How is transfer used instead of or in conjunction with recombination? What is the impact of importing intelligence from another context mid-evolution? Can we merge two worlds over the course of evolution to produce intelligences that perform intelligently in both?

16.6 Conclusions In this paper, I discussed the challenge of re-orienting ourselves to creating intelligence, not solutions—targeting the dream of strong or strongish AI. I laid out a general re-framing and a notation that identifies a broad approach to consider with regards to creating intelligence and identifies a number of research questions that may be useful to explore. I believe that the field of evolutionary computation in general, and the field of genetic programming in particular, offers a rich opportunity for exploring these questions, but I believe these ideas apply broadly across all approaches to AI. The challenges of creating any AI that performs well on a complex real-world task remain high, but there have been many recent advances that are exciting. I believe it behooves us to “step-back” and look forward beyond current methodologies to where we would like to end up with regards to strong or strongish

332

T. S. Hussain

AI. In my efforts here to “step-back”, I have somewhat intentionally avoided diving into the minutia of existing research with the hopes of emphasizing a new higher level dialogue. I hope these ideas spark complementary or even opposing visions, and, of course, foster intriguing discussions.

References 1. Bahl, S., Gupta, A., Pathak, D.: Human-to-robot imitation in the wild. In: Proceedings of Robotics: Science and Systems XVIII, New York City, NY, USA (2022) 2. Bishop, J.M.: Artificial intelligence is stupid and causal reasoning will not fix it. Front. Psychol. 11 (2021) 3. R. Bommasani, D.A. Hudson, E. Adeli, R. Altman, S. Arora, S.V. Arx, M.S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N.S. Chatterji, A.S. Chen, K.A. Creel, J. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L.E. Gillespie, K. Goel, N.D. Goodman, S. Grossman, N. Guha, T. Hashimoto, P. Henderson, J. Hewitt, D.E. Ho, J. Hong, K. Hsu, J. Huang, T.F. Icard, S. Jain, D. Jurafsky, P. Kalluri, S. Karamcheti, G. Keeling, F. Khani, O. Khattab, P. Koh, M.S. Krass, R. Krishna, R. Kuditipudi, A. Kumar, F. Ladhak, M. Lee, T. Lee, J. Leskovec, I. Levent, X. Li, X. Li, T. Ma, A. Malik, C.D. Manning, S. Mirchandani, E. Mitchell, Z. Munyikwa, S. Nair, A. Narayan, D. Narayanan, B. Newman, A. Nie, J. Niebles, H. Nilforoshan, J.F. Nyarko, G. Ogut, L.J. Orr, I. Papadimitriou, J.S. Park, C. Piech, E. Portelance, C. Potts, A. Raghunathan, R. Reich, H. Ren, F. Rong, Y.H. Roohani, C. Ruiz, J. Ryan, C. R’e, D. Sadigh, S. Sagawa, K. Santhanam, A. Shih, K.P. Srinivasan, K., and P. Liang. On the opportunities and risks of foundation models. arXiv:2108.07258 (2021) 4. Bołtu´c, P.: Consciousness for agi. Proc. Comput. Sci. 169, 365–372 (2020) 5. Costa, A.L.: Teaching for intelligence: recognizing and encouraging skillful thinking and behavior. In Context 18 (1988). Context Institute. Accessed from 20 May 2023 from https:// www.context.org/iclib/ic18/costa/ 6. Deitke, M., Batra, D., Bisk, Y., Campari, T., Chang, A., Chaplot, D., Chen, C., D’Arpino, C., Ehsani, K., Farhadi, A., Fei-Fei, L., Francis, A., Gan, C., Grauman, K., Hall, D., Han, W., Jain, U., Kembhavi, A., Krantz, J., Wu, J.: Retrospectives on the embodied AI workshop. arXiv:2210.06849 (2022) 7. Eiben, A.E., Kernbach, S., Haasdijk, E.: Embodied artificial evolution. Evol. Intel. 5, 261–272 (2012) 8. Gardner, H.: Frames of Mind : The Theory of Multiple Intelligences. Basic Books (2011) 9. Glover, E.: 15 artificial general intelligence companies to know (2023). Accessed from 30 May 2023 from https://builtin.com/artificial-intelligence/artificial-general-intelligence-companies 10. Gupta, A., Savarese, S., Ganguli, S.: Embodied intelligence via learning and evolution. Nat. Commun. 12, 5721 (2021) 11. Gust, H., Krumnack, U., Schwering, A., Kühnberger, K.-U.: The role of logic in agi systems: Towards a lingua franca for general intelligence. In: Proceedings of the 2nd Conference on Artificial General Intelligence, New York City, NY, USA (2009) 12. Hussain, T.: A meta-model perspective and attribute grammar approach to facilitating the development of novel neural network models. In: Studies in Computational Intelligence: MetaLearning in Computational Intelligence, vol. 358, pp. 245–272 (2011) 13. Khetarpal, K., Riemer, M., Rish, I., Precup, D.: Towards continual reinforcement learning: a review and perspectives. J. Artif. Intell. Res. 75, 1401–1476 (2022) 14. Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-io: a unified model for vision, language, and multi-modal tasks. arXiv:2206.08916 (2022) 15. OpenAI: Introducing chatgpt (2023). Accessed from 30 May 2023 from https://openai.com/ blog/chatgpt

16 Let’s Evolve Intelligence, Not Solutions

333

16. Savage, N.: Why artificial intelligence needs to understand consequences. Nat. Outlook, Feb 24 (2023) 17. Stanley, K.O., Lehman, J., Soros, L.: Open-endedness: The last grand challenge you’ve never heard of (2023). Accessed from 30 May 2023 from https://www.oreilly.com/radar/openendedness-the-last-grand-challenge-youve-never-heard-of/ 18. Wei, F.-F., Chen, W.-N., Guo, X.-Q., Zhao, B., Jeon, S.-W., Zhang, J.: Cec: Crowdsourcingbased evolutionary computation for distributed optimization (2023). arXiv:2304.05817 19. Zhang, T., McCarthy, Z., Jow, O., Lee, D., Chen, X., Goldberg, K., Abbeel, P.: Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. arXiv:1710.04615 (2018)

Index

A Active learning, 45 Adversarial attacks, 150, 152, 153 Artificial general intelligence, 304 Artificial intelligence, 151, 303 Artificial life, 303 AutoML, 2, 10, 15

E Ecological dynamics, 283 Ensemble learning, 146 Epsilon lexicase selection, 162, 167 Error threshold, 20 Evolutionary robotics, 207 Explainable artificial intelligence, 151

B Benchmark, 9, 164, 166, 264, 268, 270, 276, 277, 279 Bias-variance tradeoff, 161 Boolean functions, 50

F Fitness estimation, 42, 243, 244, 246–248, 251, 253, 256, 258 Fold, 264–267, 271 Functional programming, 265, 280

C Classification, 5, 7, 47, 106, 118, 144, 146, 148–150, 152, 166, 308 Combinatorics, 80 Competitive coevolutionary algorithms, 42 Complexity, 14, 42, 71, 75, 106, 144, 272, 305 Computer vision, 45 Crossover, 4, 6, 10, 112, 126, 135

G Generalization, 8, 22, 87–89, 91, 93, 94, 97, 98, 100, 101, 103, 159, 166, 316 Genotype-phenotype maps, 69 Glucose prediction, 106 Gradient lexicase selection, 166 Grammatical evolution, 106, 263 Graphs, 4, 6, 29 Grounded intelligence, 308

D Data augmentation, 106 Deep learning, 45, 111, 143, 144, 146–148, 151, 165, 192, 303 Digital evolution, 125, 140 Diversity, 20, 48, 106, 322 Downsampled lexicase selection, 163, 164

I Image analysis, 45 Imitation learning, 204, 218, 221, 222, 312 Informed downsampled lexicase selection, 164 Input-output maps, 67 Intelligence function, 316

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 S. Winkler et al. (eds.), Genetic Programming Theory and Practice XX, Genetic and Evolutionary Computation, https://doi.org/10.1007/978-981-99-8413-8

335

336 K Kolmogorov complexity, 80

L Language model, 270 Large language models, 178, 303 Lexicase selection, 146, 159–168, 171, 173, 242, 243, 247, 248, 250–252, 254, 255, 258, 277, 283

M Machine learning, 1, 2, 4, 5, 143–147, 151, 159, 161, 163, 167, 173, 268, 311 Multi-objective optimization, 115, 147, 228, 283, 290

N Neuroevolution, 144, 150, 203 Neutrality, 82 Neutral networks, 66

O Optimization, 4, 7, 11, 15, 16, 28, 147, 151, 161, 167–169

P Parallel and distributed computing, 125 Particularity, 159, 161, 162, 165, 167, 168, 172, 173 Phenotype, 169 Phylogeny, 131, 136, 241–248, 250, 252– 258 Plexicase selection, 167, 168 Program synthesis, 164, 167, 168, 243, 247, 249, 250, 256, 263, 264, 268, 273, 277 Push, 89, 90, 99, 102, 263, 277

Index Q Quality, 150, 161–163, 169, 171 Quality-diversity, 169, 181, 186

R Random subsampling, 242, 243, 246, 247, 258 Real-world applications, 29, 151, 155, 231 Recombination, 125–129, 133–135, 139, 329 Recursion schemes, 264–270, 273, 276, 277, 279, 280 Reinforcement learning, 150, 168

S Search trajectory networks, 68 Segmentation, 46 Shape-constrained regression, 226, 227 Simple solutions, 80 Soft robots, 204, 205, 222 Software synthesis, 88, 89, 91, 102, 160 Spatial topology, 20 Stability, 89, 91, 93, 97–99, 102 Step limits, 88, 89, 91–103 Symbolic regression, 5, 144, 147, 162, 168, 226

T Tree-Based Pipeline Optimization Tool (TPOT), 4–15

U Uncertainty, 46, 106 Unfold, 264–267, 273

V Variable world complexity, 324 Visualization, 36, 52, 121