296 105 14MB
English Pages 363 [373] Year 2009
Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5481
Leonardo Vanneschi Steven Gustafson Alberto Moraglio Ivanoe De Falco Marc Ebner (Eds.)
Genetic Programming 12th European Conference, EuroGP 2009 Tübingen, Germany, April 15-17, 2009 Proceedings
13
Volume Editors Leonardo Vanneschi University of Milano-Bicocca Department of Informatics, Systems and Communication (D.I.S.Co.) Viale Sarca 336-U14, 20126 Milano, Italy E-mail: [email protected] Steven Gustafson GE Global Research Niskayuna, NY 12309, USA E-mail: [email protected] Alberto Moraglio University of Coimbra Department of Computer Engineering Polo II - Pinhal de Marrocos, 3030 Coimbra, Portugal E-mail: [email protected] Ivanoe De Falco Institute of High Performance Computing and Networking National Research Council of Italy (ICAR - CNR) Via P. Castellino 111, 80131 Napoli, Italy E-mail: [email protected] Marc Ebner Eberhard Karls Universität Tübingen Wilhelm Schickard Institut für Informatik, Abt. Rechnerarchitektur Sand 1, 72076 Tübingen, Germany E-mail: [email protected] Cover illustration: Detail of You Pretty Little Flocker by Alice Eldridge (2008) www.infotech.monash.edu.au/research/groups/cema/flocker/flocker.html Library of Congress Control Number: Applied for CR Subject Classification (1998): D.1, F.1, F.2, I.5, I.2, J.3 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-642-01180-2 Springer Berlin Heidelberg New York 978-3-642-01180-1 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12652358 06/3180 543210
Preface
The 12th European Conference on Genetic Programming, EuroGP 2009, took place in T¨ ubingen, Germany during April 15–17 at one of the oldest universities in Germany, the Eberhard Karls Universit¨ at T¨ ubingen. This volume contains manuscripts of the 21 oral presentations held during the day, and the nine posters that were presented during a dedicated evening session and reception. The topics covered in this volume reflect the current state of the art of genetic programming, including representations, theory, operators and analysis, feature selection, generalization, coevolution, and numerous applications. A rigorous, double-blind peer-review process was used, with each submission reviewed by at least three members of the international Program Committee. In total, 57 papers were submitted with an acceptance rate of 36% for full papers and an overall acceptance rate of 52% including posters. The MyReview management software originally developed by Philippe Rigaux, Bertrand Chardon, and other colleagues from the Universit´e Paris-Sud Orsay, France was used for the reviewing process. We are sincerely grateful to Marc Schoenauer from INRIA, France for his continued assistance in hosting and managing the software. Paper review assigments were largely done by an optimization process matching paper keywords to keywords of expertise submitted by reviewers. EuroGP 2009 was part of the larger Evo* 2009 event, which also included three other co-located events: EvoCOP 2009, EvoBIO 2009, and EvoWorkshops 2009. We would like to thank the many people who made EuroGP and Evo* a success. The great community of researchers and practitioners who submit and present their work, as well as serve the vital task of making timely and constructive reviews, are the foundation for our success. We are indebted to the local organizer, Marc Ebner from the Wilhelm Schickard Institute for Computer Science at the University of T¨ ubingen for his hard and invaluable work that really made Evo* 2009 an enjoyable and unforgettable event. We extend our thanks to Andreas Zell, Chair of Computer Architecture at the Wilhelm Schickard Institute for Computer Science at the University of T¨ ubingen and Peter Weit, Vice Director of the Seminar for Rhetorics at the New Philology Department at the University of T¨ ubingen for local support. Our acknowledgments also go to the Tourist Information Center of T¨ ubingen, especially Marco Schubert, and to the German Research Foundation (DFG) for financial support. We express our gratitude to the Evo* Publicity Chair Ivanoe De Falco, from ICAR, National Research Council of Italy and to Antonio Della Cioppa from the University of Salerno, Italy for his collaboration in maintaining the Evo* Web page. We would like to thank our two internationally renowned invited keynote speakers: Peter Schuster, President of the Austrian Academy of Sciences and Stuart R. Hameroff, Professor Emeritus of the Departments of Anesthesiology
VI
Preface
and Psychology and Director of the Center for Consciousness Studies at the University of Arizona, Tucson, USA. Last but not least, also this year, like every year since the first EuroGP edition in 1998, the work of organization has been made efficient, pleasant and enjoyable by the continued and invaluable coordination and assistance of Jennifer Willies and we are wholeheartedly grateful to her. Without her dedicated effort and the support from the Centre for Emergent Computing at Edinburgh Napier University, UK, these events would not be possible. April 2009
Leonardo Vanneschi Steven Gustafson Alberto Moraglio Ivanoe De Falco Marc Ebner
Organization
Administrative details were handled by Jennifer Willies, Centre for Emergent Computing at Edinburgh Napier University, UK.
Organizing Committee Program Co-chairs
Publication Chair Publicity Chair Local Chair
Leonardo Vanneschi (University of Milano-Bicocca, Italy) Steven Gustafson (GE Global Research, USA) Alberto Moraglio (University of Coimbra, Portugal) Ivanoe De Falco (ICAR, National Research Council of Italy) Marc Ebner (University of T¨ ubingen, Germany)
Program Committee Hussein Abbass Lee Altenberg R. Muhammad Atif Azad Wolfgang Banzhaf Anthony Brabazon Nicolas Bredeche Edmund Burke Stefano Cagnoni Philippe Collard Pierre Collet Ernesto Costa Ivanoe De Falco Michael Defoin Platel Edwin DeJong Antonio Della Cioppa Ian Dempsey Federico Divina Marc Ebner Anik` o Ek` art Anna Esparcia-Alczar Daryl Essam Francisco Fern`andez de Vega Christian Gagn´e
UNSW@ADFA, Australia University of Hawaii at Manoa, USA University of Limerick, Ireland Memorial University of Newfoundland, Canada University College Dublin, Ireland Universit´e Paris-Sud, France University of Nottingham, UK Universit` a degli Studi di Parma, Italy Laboratoire I3S (UNSA-CNRS), France LSIIT-FDBT, France Universidade de Coimbra, Portugal ICAR, National Research Council of Italy, Italy University of Auckland, New Zealand Universiteit Utrecht The Netherlands Universit` a di Salerno, Italy Pipeline Financial Group, Inc., USA Universidad Pablo de Olavide, Spain Universit¨ at T¨ ubingen, Germany Aston University, UK ITI Valencia, Spain UNSW@ADFA, Australia Universidad de Extremadura, Spain MDA, Canada
VIII
Organization
Mario Giacobini Folino Gianluigi Steven Gustafson Jin-Kao Hao Inman Harvey Tuan-Hao Hoang Gregory Hornby Colin Johnson Tatiana Kalganova Maarten Keijzer Robert E. Keller Graham Kendall Asifullah Khan Krzysztof Krawiec Jiri Kubalik William B. Langdon Kwong Sak Leung John Levine Simon M. Lucas Robert Matthew MacCallum Penousal Machado Bob McKay Nic McPhee J¨ orn Mehnen Xuan Hoai Nguyen Miguel Nicolau Julio Cesar Nievola Michael O’Neill Una-May O’Reilly Clara Pizzuti Riccardo Poli Thomas Ray Denis Robilliard Marc Schoenauer Michele Sebag Lukas Sekanina Yin Shan, Medicare
Universit` a degli Studi di Torino, Italy ICAR-CNR, Italy GE Global Research, USA LERIA, Universit´e d’Angers, France University of Sussex, UK University of New South Wales @ ADFA, Australia UCSC, USA University of Kent, UK Brunel University, UK Chordiant Software International, The Netherlands University of Essex, UK University of Nottingham, UK Pakistan Inst. of Engineering and Applied Sciences,Pakistan Poznan University of Technology, Poland Czech Technical University in Prague, Czech Republic University of Essex, UK The Chinese University of Hong Kong, Hong Kong University of Strathclyde, UK University of Essex, UK Imperial College London, UK Universidade de Coimbra, Portugal Seoul National University, Korea University of Minnesota, Morris, USA Cranfield University, UK Seoul National University, Korea INRIA, France Pontificia Universidade Catolica do Parana, Brazil University College Dublin, Ireland MIT, USA Institute for High-Performance Computing and Networking, Italy University of Essex, UK University of Oklahoma, USA Universit´e du Littoral, Cote D’Opale, France INRIA, France Universit´e Paris-Sud, France Brno University of Technology, Czech Republic Australia
Organization
Sara Silva Moshe Sipper Alexei N. Skurikhin Terence Soule Ivan Tanev Ernesto Tarantino Marco Tomassini Leonardo Vanneschi S´ebastien Verel Man Leung Wong Tina Yu Mengjie Zhang
IX
Universidade de Coimbra, Portugal Ben-Gurion University, Israel Los Alamos National Laboratory, USA University of Idaho, USA Doshisha University, Japan ICAR-CNR, Italy University of Lausanne, Switzerland Universit` a degli Studi di Milano, Italy Universit´e de Nice Sophia Antipolis/CNRS, France Lingnan University, Hong Kong Memorial University of Newfoundland, Canada Victoria University of Wellington, New Zealand
Table of Contents
Oral Presentations One-Class Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robert Curry and Malcolm I . Heywood Genetic Programming Based Approach for Synchronization with Parameter Mismatches in EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dilip P. Ahalpara, Siddharth Arora, and M.S. Santhanam Memory with Memory in Tree-Based Genetic Programming . . . . . . . . . . . Riccardo Poli, Nicholas F. McPhee, Luca Citi, and Ellery Crane
1
13 25
On Dynamical Genetic Programming: Random Boolean Networks in Learning Classifier Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Larry Bull and Richard Preen
37
Why Coevolution Doesn’t “Work”: Superiority and Progress in Coevolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T homas Miconi
49
On Improving Generalisation in Genetic Programming . . . . . . . . . . . . . . . . Dan Costelloe and Conor Ryan
61
Mining Evolving Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andr´as Jo´o
73
The Role of Population Size in Rate of Evolution in Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T ing Hu and Wolfgang Banzhaf Genetic Programming Crossover: Does It Cross over? . . . . . . . . . . . . . . . . . Colin G. Johnson
85 97
Evolution of Search Algorithms Using Graph Structured Program Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shinichi Shirakawa and Tomoharu Nagao
109
Genetic Programming for Feature Subset Ranking in Binary Classification Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . K ourosh Neshatian and Mengjie Zhang
121
Self Modifying Cartesian Genetic Programming: Fibonacci, Squares, Regression and Summing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simon Harding, Julian F. Miller, and Wolfgang Banzhaf
133
XII
Table of Contents
Automatic Creation of Taxonomies of Genetic Programming Systems . . . Mario Graff and Riccardo Poli
145
Extending Operator Equalisation: Fitness Based Self Adaptive Length Distribution for Bloat Free GP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sara Silva and Stephen Dignum
159
Modeling Social Heterogeneity with Genetic Programming in an Artificial Double Auction Market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shu-Heng Chen and Chung-Ching Tai
171
Exploring Grammatical Evolution for Horse Gait Optimisation . . . . . . . . James E. Murphy, Michael O’ Neill, and Hamish Carr There Is a Free Lunch for Hyper-Heuristics, Genetic Programming and Computer Scientists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Riccardo Poli and Mario Graff Tree Based Differential Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian B. Veenhuis A Rigorous Evaluation of Crossover and Mutation in Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David R. White and Simon Poulding On Crossover Success Rate in Genetic Programming with Offspring Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gabriel K ronberger, Stephan Winkler, Michael Aff enzeller, and Stefan Wagner An Experimental Study on Fitness Distributions of Tree Shapes in GP with One-Point Crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C´esar Est´ebanez, Ricardo Aler, Jos´e M. Valls, and Pablo Alonso
183
195 208
220
232
244
Posters Behavioural Diversity and Filtering in GP Navigation Problems . . . . . . . . David Jackson
256
A Real-Time Evolutionary Object Recognition System . . . . . . . . . . . . . . . . Marc Ebner
268
On the Effectiveness of Evolution Compared to Time-Consuming Full Search of Optimal 6-State Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcus K omann, Patrick Ediger, Dietmar Fey, and Rolf Hoff mann
280
Semantic Aware Crossover for Genetic Programming: The Case for Real-Valued Function Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quang Uy Nguyen, Xuan Hoai Nguyen, and Michael O’ Neill
292
Table of Contents
Beneficial Preadaptation in the Evolution of a 2D Agent Control System with Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lee Graham, Robert Cattral, and Franz Oppacher Adaptation, Performance and Vapnik-Chervonenkis Dimension of Straight Line Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jos´e L. Monta˜na, C´esar L. Alonso, Cruz E. Borges, and Jos´e L. Crespo
XIII
303
315
A Statistical Learning Perspective of Genetic Programming . . . . . . . . . . . Nur Merve Amil, Nicolas Bredeche, Christian Gagn´e, Sylvain Gelly, Marc Schoenauer, and Olivier Teytaud
327
Quantum Circuit Synthesis with Adaptive Parameters Control . . . . . . . . . Cristian Ruican, Mihai Udrescu, Lucian Prodan, and Mircea Vladutiu
339
Comparison of CGP and Age-Layered CGP Performance in Image Operator Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . K arel Slan´y
351
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
363
One-Class Genetic Programming Robert Curry and Malcolm I. Heywood Dalhousie University 6050 University Avenue Halifax, NS, Canada B3H 1W5 {rcurry,mheywood}@cs.dal.ca
Abstract. One-class classification naturally only provides one-class of exemplars, the target class, from which to construct the classification model. The one-class approach is constructed from artificial data combined with the known in-class exemplars. A multi-objective fitness function in combination with a local membership function is then used to encourage a co-operative coevolutionary decomposition of the original problem under a novelty detection model of classification. Learners are therefore associated with different subsets of the target class data and encouraged to tradeoff detection versus false positive performance; where this is equivalent to assessing the misclassification of artificial exemplars versus detection of subsets of the target class. Finally, the architecture makes extensive use of active learning to reinforce the scalability of the overall approach. Keywords: One-Class Classification, Coevolution, Active Learning, Problem Decomposition.
1
Introduction
The ability to learn from a single class of exemplars is of importance under domains where it is not feasible to collect exemplars representative of all scenarios e.g., fault or intrusion detection; or possibly when it is desirable to encourage fail safe behaviors in the resulting classifiers. As such, the problem of one-class learning or ‘novelty detection’ presents a unique set of requirements from that typically encountered in the classification domain. For example, the discriminatory models of classification most generally employed might formulate the credit assignment goal in terms of maximizing the separation between the inand out-class exemplars. Clearly this is not feasible under the one-class scenario. Moreover, the one-class case often places more emphasis on requiring fail safe behaviors that explicitly identify when data differs from the target class or ‘novelty detection’. Machine learning algorithms employed under the one-class domain therefore need to address the discrimination/ novelty detection problem directly. Specific earlier works include Support Vector Machines (SVM) [1,2,3], bottleneck neural L. Vanneschi et al. (Eds.): EuroGP 2009, LNCS 5481, pp. 1–12, 2009. c Springer-Verlag Berlin Heidelberg 2009
2
R. Curry and M.I. Heywood
networks [4] and a recent coevolutionary genetic algorithm approach based on an artificial immune system [5] (for a wider survey see [6,7]). In particular the one-class SVM model of Sch¨ olkopf relies on the correct identification of “relaxation parameters” to separate exemplars from the origin (representing the second unseen class) [1]. Unfortunately, the values for such parameters vary as a function of the data set. However, a recent work proposed a kernel autoassociator for one-class classification [3]. In this case the kernel feature space is used to provide the required non-linear encoding, this time in a very high dimensional space (as opposed to the MLP approach to the encoding problem). A linear mapping is then performed to reconstruct the original attributes as the output. Finally, the work of Tax again uses a kernel based one-class classifier. This approach is distinct in that data is artificially generated to aid the identification of the most concise hypersphere describing the in-class data [2]. Such a framework builds on the original support vector data description model, whilst reducing the significance of specific parameter selections. The principle drawback, however, is that tens or even hundreds of thousands of artificially generated training exemplars are required to build a suitably accurate model [2]. The work proposed in this paper uses the artificial data generation model of Tax, but specifically addresses the training overhead by employing an active learning algorithm. Moreover, the Genetic Programming (GP) paradigm provides the opportunity to solve the problem using an explicitly multiple objective model, where this provides the basis for cooperative coevolutionary problem decomposition.
2
Methodology
The general principles of the one-class GP (OCGP) methodology, as originally suggested in [8], is comprised of four main components: (1) Local membership function: Conventionally, GP programs provide a mapping between the multi-dimensional attribute space and a real-valued onedimensional number line called the gpOut axis. A binary switching function (BSF), as popularized by Koza, is then used to map the one-dimensional ‘raw’ gpOut to one of two classes, as shown in Fig. 1(a) [9]. However, a BSF assumes that the two classes can be separated at the origin. Moreover, under a one-class model of learning – given that we have no information on the distribution of out-class data – exemplars belonging to the unseen classes are just as likely to appear on either side of the origin, resulting in high false positive rates. Therefore, instead of using the ‘global’ BSF, GP individuals utilize a Gaussian or ‘local’ membership function (LMF), Fig. 1(b). A small region of the gpOut axis is therefore evolved for expressing in-class behavior, where this region is associated with a subset of the target distribution encountered during training. In this way GP individuals act as novelty detectors, as any region on the gpOut axis other than that of the LMF is associated with the out-class conditions; thus supporting conservative generalization properties when deployed. Moreover, instead of a single classifier providing a single mapping for all in-class exemplars,
One-Class Genetic Programming
3
Fig. 1. (a) ‘Global’ binary switching function vs. (b) Gaussian ‘local’ membership function
our goal will be to encourage the coevolution of multiple GP novelty detectors to map unique subsets of the in-class exemplars to their respective LMF. (2) Artificial outlier generation: In a conventional binary classification problem the decision boundaries between classes are supported by exemplars from each class. However, in a one-class scenario it is much more difficult to establish appropriate and concise decision boundaries since they are only supported by the target class. Therefore, the approach we adopt for developing one-class classifiers is to build the outlier class data artificially and train using a two class model. The ‘trick’ is to articulate this goal in terms of finding an optimal tradeoff between detection and false positive rates as opposed to explicitly seeking solutions with ‘zero error’. (3) Balanced block algorithm (BBA): When generating artificial outlier data it is necessary to have a wider range of possible values than the target data and also to ensure that the target data is surrounded in all attribute directions. Therefore, the resulting ‘two-class’ training data set tends to be unbalanced and large; implying artificial data partitions in the order of tens of thousands of exemplars (as per Tax [2]). The increased size of the training data set has the potential to significantly increase the training overhead of GP. To address this overhead the BBA active learning algorithm is used [10]; thus fitness evaluation is conducted over a much smaller subset of training exemplars, dynamically identified under feedback from individuals in the population, Fig. 2. (4) Evolutionary multi-objective optimization (EMO): EMO allows multiple objectives to be specified, thereby providing a more effective way to express the quality of GP programs. Moreover, EMO provides a means of comparing individuals under multiple objectives without resorting to a pr ior i scalar weighting
4
R. Curry and M.I. Heywood
Fig. 2. The balanced block algorithm first partitions the target and outlier classes (Level 0) [10]. Balanced blocks of training data are then formed by selecting a partition from each class by dynamic subset selection (Level 1). For each block multiple subsets are then chosen for GP, training again performed by DSS (Level 2).
functions. In this way the overall OCGP classifier is identified through a cooperative approach that supports the simultaneous development of multiple programs from a single population. The generation of artificial data in and around the target data means that outlier data lying within the actual target distribution cannot be avoided. Thus, when attempting to classify the training data it is necessary to cover as much of the target data as possible (i.e., maximize detection rate), while also minimizing the amount of outlier data covered (i.e., minimize false positive rate); the first two objectives. Furthermore, it is desirable to have an objective to encourage diversity among solutions by actively rewarding non-overlapping behavior between the coverage of different classifiers as evolved from the same population. Finally, the fourth objective encourages solution simplicity, thus reducing the likelihood of overfitting and promoting solution transparency. GP programs are compared by their objectives using the notion of dominance, where a classifier A is said to dominate classifier B if it performs at least as well as B in all objectives and better than B in at least one objective. Pareto ranking then combines the objectives into a scalar fitness by assigning each classifier a rank based on the number of classifiers by which it is dominated [11,12]. A classifier is said to be non-dominated if it is not dominated by any other classifier in the population and has a rank of zero. The set of all non-dominated classifiers is referred to as the Pareto front . The Pareto front programs represent the current best trade-offs of the multi-objective criteria providing a range of candidate solutions. Pareto front programs influence OCGP by being favored for reproduction and update the archives of best programs which determine the final OCGP classifiers. 2.1
OCGP Algorithm
The general framework of our algorithm is described by the flowchart in Fig. 3. The first step is to initialize the random GP population of programs and the
One-Class Genetic Programming
5
Fig. 3. Framework for OCGP assuming target data is provided as input
necessary data structures, Step 1. OCGP is not restricted to a specific form of GP but in this work a linear representation is assumed. Artificial outlier data is then generated in and around the provided target data, Step 2. The next stage outlines the three levels of the balanced block algorithm (BBA), Fig. 2, within the dashed box. Level 0, Step 3, partitions the target and outlier data. The second level, Step 5, selects a target and outlier partition to create the Level 1 balanced block. At Level 2, Step 7, subsets are selected from the block to form the subset of exemplars over which fitness evaluation is performed (steady state tournament). The next box outlines how individuals are evaluated. First programs are evaluated on the current data subset to establish the corresponding gpOut distribution, Step 10. The classification region is then determined to parameterize the LMF, Step 11, and the multi-objective fitness is established, Step 12. Once all programs have their multi-objective fitness the programs can be Pareto ranked, Step 13. The Pareto ranking determines the tournament winners, or parents, from which genetic operators can be applied to create children, Step 14. In addition parent programs update the difficulty of exemplars in order to influence future subset selections. That is to say, previous performance (error) on artificial–target
6
R. Curry and M.I. Heywood
class data is used to guide the number of training subsets sampled from level 1 blocks. As such, difficulty values are averaged across the data in level 1 and 2, Step 6 and 8 respectively. Once training is complete at level 2, the population and the archives associated with the current target partitions are combined, Step 15, and evaluated on the Level 1 block (Step 10 through Step 13). The archive of the target partition is then updated with the resulting Pareto front, Step 16, and partition difficulties updated in order to influence future partition selections and the number of future subset iterations. The change in partition error rates for each class is also used to determine the block stop criteria, Step 4. Once the block stop criteria has been met the archives are filtered with any duplicates across the archives being removed, Step 17, and the final OCGP classifier consists of the remaining archive programs. More details of the BBA are available from [10]. 2.2
Developments
Relative to the above, this work introduces the following developments: Clustering. In the original formulation of OCGP the classification region for each program (i.e., LMF location) was determined by dividing a program’s gpOut axis into class-consistent ‘regions’ (see Fig. 4) and the corresponding class separation distance (1), or csd, between adjacent regions estimated. The target region that maximizes csd with respect to the neighboring outlier regions has the best separability and is chosen as the classification region for the GP program (determining classification region at Fig. 3 Step 11). csd
0/1
|µ0 − µ1 | = 2 σ0 + σ12
(1)
In this work the LMF is associated with the most dense region of the gpOut axis i.e., independent of the label information. Any artificial exemplars lying within the LMF might either result in the cluster being penalized once the fitness criteria is applied or be considered as outliers generated by the artificial
Fig. 4. Determining GP classification region by class separation distance
One-Class Genetic Programming
7
data generation model. In effect we establish a ‘soft’ model of clustering (may contain some artificial data points) in place of the previous ‘hard’ identification of clusters (no clusters permitted with any artificial data points). To this end, subtractive clustering [13] is used to cluster the one-dimensional gpOut axis and has the benefit of not requiring a pr ior i specification of the number of clusters. The mixed content of a cluster precludes the use of a class separation distance metric. Instead the sum squared error (SSE) of the exemplars within the LMF are employed. The current GP program’s associated LMF returns confidence values for each of the exemplars in the classification region. Therefore, the error for each type of exemplar can be determined by subtracting their confidence value from their actual class label (i.e., reward for target exemplars in classification region and penalize for outliers). The OCGP algorithm using gpOut clustering will be referred to as OCGPC. Caching. The use of the clustering algorithm caused the OCGPC algorithm to run much more slowly than OCGP. The source was identified to be the clustering of the entire GP population and the current archive on the entire Level 1 block (Fig. 3 Step 15). Therefore, instead of evaluating the entire population and archive on the much larger Level 1 block, the mean and standard deviation is cached from the subset evaluation step. Caching was introduced in both the OCGP and OCGPC algorithms and was found to speed up training time without negatively impacting classification performance. Overlap. The overlap objective has been updated from the previous work to compare tournament programs against the current archive instead of comparing against the other tournament programs (assessing multi-objective fitness at Step 12). Individuals losing a tournament are destined to be replaced by the search operators, thus should not contribute to the overlap evaluation. Moreover, comparison against the archive programs is more relevant, as they represent the current best solution to the current target partition (i.e., target exemplars already covered) and thus encourages tournament programs to classify the remaining uncovered target exemplars. Artificial outlier generation. Modifications have been made in order to improve the quality of the outlier data (Fig. 3 Step 2). Previously a single radius, R, was determined by the attribute of the target data having the largest range and was then used as the radius for all attribute dimensions when creating outliers. If a large disparity exists between attribute ranges, this can lead to large volumes of the outlier distribution with little to no relevance to the target data. Alternatively, a vector R of radii is used consisting of a radius for each attribute. Additionally, when the target data consists of only non-negative values, negative outlier attribute values are restricted to within a close proximity to zero.
3
Experiments
In contrast to the previous performance evaluation [8], we concentrate on benchmarking against data sets that have large unbalanced distributions in the
8
R. Curry and M.I. Heywood
Table 1. Binary classification datasets. The larger in-class partition of Adult, Census and Letter-vowel was matched with a larger artificial exemplar training partition. Dataset Features Class 0 1 Total Dataset Features Class 0 1 Total
Adult 14 Train Test 50, 000 11, 355 7, 506 3, 700 57, 506 15, 055 Letter-a 16 Train Test 10, 000 4, 803 574 197 10, 574 5, 000
Census 40 Train Test 50, 000 34, 947 5, 472 2, 683 55, 472 37, 630 Letter-e 16 Train Test 10, 000 4, 808 545 192 10, 545 5, 000
Letter-vowel 16 Train Test 50, 000 4, 031 2, 660 969 52, 660 5, 000 Mushroom 22 Train Test 10, 000 872 1, 617 539 11, 617 1, 411
underlying exemplar distribution, thus are known to be difficult to classify under binary classification methods. Specifically, the Adult, Census-Income (KDD), Mushroom and Letter Recognition data sets from the UCI machine learning repository [14] were utilized (Table 1). The Letter Recognition data set was used to create three one-class classification data sets where the target data was alternately all vowels (Letter-vowel), the letter ‘a’ (Letter-a) and the letter ‘e’ (Letter-e). For the Adult and Census data sets a predefined training and test partition exists. For the Letter Recognition and Mushroom data sets the data was first divided into a 75% training and 25% test data set while maintaining the class distributions of the original data. The class 0 exemplars were removed from the training data set to form the one-class target data set. Training was performed on a dual G4 1.33 GHz Mac Server with 2 GB of RAM. All experiments are based on 50 GP runs where runs differ only in their choice of random seeds for initializing the population while all other parameters remain unchanged. Table 2 lists the common parameter settings for all runs. The OCGP algorithm results are compared to results found by a one-class support vector machine (OC ν-SVM) [1] and a one-class or bottleneck neural network (BNN)1 , where both algorithms are trained on the target data alone. Additionally, the OCGP results will be compared to a two-class support vector machine (ν-SVM) which use both the artificial outlier and the target data. The two-class SVM is used as a baseline binary classification algorithm in order to assess to what degree the OCGP algorithms are able to provide a better characterization of the problem i.e., both algorithms are trained on the target and artificial outlier data. Comparison against Tax’s SVM was not possible as the current implementation does not scale to large data sets. 1
Unlike the original reference ([4]) the BNN was trained using the efficient second Conjugate Gradient weight update rule with tansig activation functions; both of which make a significant improvement over first order error back-propagation.
One-Class Genetic Programming
9
Table 2. Parameter Settings Dynamic Page-Based Linear GP Population size 125 Tournament Size 4 Max # of pages 32 Number of registers 8 Page size 8 instructions Instr. prob. 1, 2 or 3 0/5, 4/5, 1/5 Max page size 8 instructions Function set {+, –, ×, ÷} Prob. Xover, Mut., Swap 0.9, 0.5, 0.9 Terminal set {# of attributes} Balanced Block Algorithm Parameters Target Partition Size Outlier Partition Size Max block selections Number of Archives Archive Size
≈
#P atterns #Archives
Max subset iterations 500 Tourneys per subset 2000 Level 2 subset size Archive Parameters Adult = 15, Census = 11, Letter-vowel = 10, Letter-e = 6, Mushroom = 4 10
10 6 100 Letter-a = 6,
The algorithms are compared in terms of ROC curves of (FPR, DR) pairs on test data sets (Fig. 5). Due to the large number of runs of the OCGP algorithms and the multiple levels of voting possible, plotting all of the OCGP solutions becomes too convoluted. Alternatively, only non-dominated solutions will be shown, where these represent the best-case OCGP (FPR, DR) pairs over the 50 runs. Similarly only the non-dominated solutions of the bottle-neck neural networks will be shown, while for the other algorithms only a small number of solutions are present and so all results will be plotted. Comments will be made as follows on an algorithm-by-algorithm basis: OCGPC. Of the two one-class GP models the cluster variant for establishing the region of interest on the gpOut axis described in Sect. 2.2 appeared to be the most consistent. Specifically, OCGPC provided the best performing curves under the Adult and Vowel data sets (Fig. 5(a) and (c)) and consistently the runner up under all but the Census data set. In each of the three cases where it appears as a runner up it was second to the BNN. However, GP retains the advantage of indexing a subset of the attributes; whereas multi-layer neural networks index all features – a bias of NN methods in general and the autoassociator method of deployment in particular. OCGP. When assuming the original region based partitioning of gpOut , ROC performance decreases significantly with strongest performance on the Census data set; and runner up performance under Adult and Mushroom. Performance on the remaining data sets might match or better the one-class ν-SVM, but generally worse than BNN or OCGPC. BNN. As remarked above the BNN was the strongest one-class model, with best ROC curves on Letter ‘a’, ‘e’, and mushroom; and joint best on Census (with OCGP). Moreover, the approach was always better than the SVM methods benchmarked in this work.
R. Curry and M.I. Heywood
1
1
0.9
0.9
0.8
0.8
0.7
0.7
Detection Rate
Detection Rate
10
0.6 0.5 0.4
OCGPC OCGP BNN OC ν−SVM ν−SVM
0.3 0.2 0.1 0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 False Positive Rate
0.6 0.5 0.4
0.2 0.1 0 0
1
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6 0.5 0.4
OCGPC OCGP BNN OC ν−SVM ν−SVM
0.3 0.2 0.1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 False Positive Rate
0.5 0.4
OCGPC OCGP BNN OC ν−SVM ν−SVM
0.3 0.2 0.1 0 0
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 False Positive Rate
1
(d) Letter ‘a’
1
1
0.9
0.9
0.8
0.8
0.7
0.7
Detection Rate
Detection Rate
1
0.6
(c) Vowel
0.6 0.5 0.4
OCGPC OCGP BNN OC ν−SVM ν−SVM
0.3 0.2 0.1 0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 False Positive Rate
(b) Census
Detection Rate
Detection Rate
(a) Adult
0 0
OCGPC OCGP BNN OC ν−SVM ν−SVM
0.3
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 False Positive Rate
(e) Letter ‘e’
0.6 0.5 0.4
OCGPC OCGP BNN OC ν−SVM ν−SVM
0.3 0.2 0.1
1
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 False Positive Rate
(f) Mushroom
Fig. 5. Test dataset ROC curves
1
One-Class Genetic Programming
11
SVM methods. Neither SVM method – one-class or binary SVM trained on the artificial versus target class – performed well. Indeed the binary SVM was at best degenerate on all but the Letter ‘a’ data set; resulting in it being ranked worst in all cases. This indicates that merely combining artificial data with target class data is certainly not sufficient for constructing one-class learners as the learning objective is not correctly expressed. The one-class SVM model generally avoided degenerate solutions, but never performed better than OCGPC or BNN.
4
Conclusion
A Genetic Programming (GP) classifier has been proposed in this work for the one-class classification domain. Four main components of the algorithm have been identified. Artificial outlier generation is used to establish more concise boundaries and to enable the use of a two-class classifier. The active learning algorithm BBA tackles the class imbalance and GP scalability issues introduced by the large number of artificial outliers required. Gaussian ‘local’ membership functions allow GP programs to respond to specific regions of the target distribution and act as novelty detectors. Evolutionary multi-objective optimization drives the search for improved target regions, while allowing for the simultaneous development of multiple programs that cooperate towards the overall problem decomposition through the objective for minimizing overlapping coverage. A second version of the OCGP algorithm is introduced, namely OCGPC, which determines classification regions by clustering the one-dimensional gpOut axis. In addition, ‘caching’ of the classification regions is introduced to both algorithms, in order to eliminate the need to redetermine classification regions over training blocks. Caching reduces training times without negatively impacting classification performance. Modifications were also made to improve the quality of generated artificial outliers and to the objectives used to determine classification regions, including the use of the sum-squared error and improving the overlap objective by comparing to only the current best archive solutions. The OCGP and OCGPC algorithms were evaluated on six data sets larger than previously examined. The results were compared against two one-class classifiers trained on target data alone, namely one-class ν-SVM and a bottleneck neural network (BNN). An additional comparison was made with a two-class SVM trained on target data and the generated artificial outlier data. The OCGPC and BNN models were the most consistent performers overall; thereafter model preference might be guided by the desirability for solution simplicity. In this case the OCGPC model makes additional contributions as it operates as a classifier as opposed to an autoassociator i.e., autoassociators are unable to simplify solutions in terms of attributes indexed. Future work will concentrate in this direction and to applying OCGPC to learning without artificial data.
12
R. Curry and M.I. Heywood
Acknowledgements BNN and SVM models were built in MATLAB and LIBSVM respectively. The authors acknowledge MITACS, NSERC, CFI and SwissCom Innovations.
References 1. Scholk¨ opf, B., Platt, J., Shawe-Taylor, J., Smola, A., Williamson, R.: Estimating the support of a high-dimensional distribution. Neural Computation 13, 1443–1471 (2001) 2. Tax, D., Duin, R.: Uniform object generation for optimizing one-class classifiers. Journal of Machine Learning Research 2, 155–173 (2001) 3. Zhang, H., Huang, W., Huang, Z., Zhang, B.: A kernel autoassociator approach to pattern classification. IEEE Transactions on Systems, Man and Cybernetics - Part B 35(3), 593–606 (2005) 4. Manevitz, L., Yousef, M.: One-class document classification via neural networks. Neurocomputing 70(7-9), 1466–1481 (2007) 5. Wu, S., Banzhaf, W.: Combatting financial fraud: A coevolutionary anomaly detection approach. In: Genetic and Evolutionary Computation Conference (GECCO), pp. 1673–1680 (2008) 6. Markou, M., Singh, S.: Novelty detection: A review – part 1: Statistical approaches. Signal Processing 83, 2481–2497 (2003) 7. Markou, M., Singh, S.: Novelty detection: A review – part 2: Neural network based approaches. Signal Processing 83, 2499–2521 (2003) 8. Curry, R., Heywood, M.: One-class learning with multi-objective Genetic Programming. In: Proceedings of the IEEE Systems, Man and Cybernetics Conference, SMC, pp. 1938–1945 (2007) 9. Koza, J.: Genetic programming: On the programming of computers by means of natural selection. Statistics and Computing 4(2), 87–112 (1994) 10. Curry, R., Lichodzijewski, P., Heywood, M.: Scaling genetic programming to large datasets using hierarchical dynamic subset selection. IEEE Transactions on Systems, Man and Cybernetics - Part B 37(4), 1065–1073 (2007) 11. Kumar, R., Rockett, P.: Improved sampling of the pareto-front in multiobjective genetic optimizations by steady-state evolution: A pareto converging genetic algorithm. Evolutionary Computation 10(3), 283–314 (2002) 12. Zitzler, E., Thiele, T.: Multiobjective evolutionary algorithms: A comparitive case study and the strength pareto approach. IEEE Transactions on Evolutionary Computation 3(4), 257–271 (1999) 13. Chiu, S.: 9. In: Fuzzy Information Engineering: A Guided Tour of Applications. John Wiley & Sons, Chichester (1997) 14. Asuncion, A., Newman, D.J.: UCI Repository of Machine Learning Databases. Dept. of Information and Comp. Science. University of California, Irvine (2008), http://www.ics.uci.edu/~ mlearn/mlrepository.html
Genetic Programming Based Approach for Synchronization with Parameter Mismatches in EEG Dilip P. Ahalpara1 , Siddharth Arora2 , and M.S. Santhanam2,3 1
3
Institute for Plasma Research, Near Indira Bridge, Bhat, Gandhinagar-382428, India 2 Physical Research Laboratory, Navrangpura, Ahmedabad-380009, India Indian Institute of Science Education and Research, Pashan, Pune-411021, India [email protected], [email protected], [email protected]
Abstract. Effects of parameter mismatches in synchronized time series are studied first for an analytical non-linear dynamical system (coupled logistic map, CLM) and then in a real system (Electroencephalograph (EEG) signals). The internal system parameters derived from GP analysis are shown to be quite effective in understanding aspects of synchronization and non-synchronization in the two systems considered. In particular, GP is also successful in generating the CLM coupled equations to a very good accuracy with reasonable multi-step predictions. It is shown that synchronization in the above two systems is well understood in terms of parameter mismatches in the system equations derived by GP approach.
1
Introduction
Synchronized time series with errors are known to exist in nature as well as in many nonlinear dynamical systems [1,2,3,4]. For instance, consider two dynamical systems X(µ) and Y (µ) which are coupled to one another and µ is an internal parameter. If x(t) and y(t) are the measured output of some variable from X and Y systems respectively, then two possible scenarios can arise within the context of synchronization. This could lead to complete synchronization, implying x(t)=y(t) for all t after certain transient time. Another possibility is that |x(t)-y(t)| 5.0) return( 3000000 – (d * timing) ) return( 4000000 – (total navigator movement * timing) )
// TIER 3 // TIER 2 // TIER 1
2) Fitness Function E. Fitness function E is a count of the number of targets at their depots when evaluation ends. This provides very impoverished performance feedback to the evolutionary process, and no reward for intermediate work. Its value is FE = 1 / (T + 1)
(6)
where T is the number of targets at their depots. Table 5. GP System Settings
Setting, Parameter, or Option Value {FA, LA, RA, TDR, TDS, DDR, DDS, S, V, ERC, TAR, Function set SS, SV, +, –, *, /, SQR, SQRT, IF, NEG, SW} 151 generations (Sequence 1) 79 generations (Sequence 2)
Halting condition
3.2 The GP System GP system settings are listed in Table 5. Only those that differ from version one are given. The function set is identical to that of version one except for the addition of SW. Because there are no walls or obstacles, and because targets are pushed instead of carried, FA, LA, RA, and TAR act as constants. 3.3 Results Since fitness function FE uses inversion to convert a maximization problem into a minimization problem, all logged fitness values are converted to T = (1 / f) – 1, called the converted fitness. Table 6. Statistics for Best-of-Run Converted Fitness
Configuration Mean best converted fitness Standard deviation Confidence interval (95%) Number of samples
Without exaptation 2.96 2.84 ±0.63 78
With Exaptation 6.10 3.62 ±0.98 52
Beneficial Preadaptation in the Evolution of a 2D Agent Control System
313
Fig. 2. Distribution showing that exaptation produces many fewer of the least fit results
The best-of-run fitness statistics for each sequence are given in Table 6. A MannWhitney U test comparison gives p ≤ 0.0001. Both the exaptive and non-exaptive configurations have a comparable amount of variation in their outcomes. The overlap in outcome distributions, however, is less extreme than that of version one, showing an even greater benefit for the exaptive approach. Fig. 2 shows how the outcomes are spread across the full range of 0 to 20. As with version one, the most obvious difference between the exaptive and non-exaptive approaches is in the poorest outcomes, with the exaptive approach being about half as likely as the non-exaptive approach to produce control systems that perform extremely poorly. The balance of outcomes shows the exaptive approach exceeds the non-exaptive approach, to greater or lesser degrees, in all other bins in the graph.
4 Conclusion The 2D GP Navigator provides examples of complex task learning with and without prior training on simplified versions of the learning task or rewards for intermediate steps. The exaptive approaches, which add new complexity to the task with each epoch or tier, successfully outperform otherwise identical non-exaptive approaches. In version one the benefit was moderate, whereas the improvement in version two, with its harsher task and impoverished feedback, was more substantial. Several unexpected aspects of the behavior of the system seem to be evident in the fitness data and in animations of navigators in action [8], [9]. In version one the nonexaptive configuration was able to produce highly fit solutions. At the outset this had seemed unlikely given the complexity of the task. Observations of the behavior of fit individuals, such as [8], indicate that the placement of targets allowed many to be collected without requiring the ability to sense them. Because targets are clustered,
314
L. Graham, R. Cattral, and F. Oppacher
the navigator can expect to gather many of them simply by repeatedly traversing a small area. Many target and depot collisions will occur, and many targets will be deposited with little effort. This occurs in both sequence configurations. Nevertheless, the exaptive configuration is more efficient. The outcome distributions in Fig. 1 show that the boosted advantage stems less from producing a greater number of highly fit solutions, and more from producing considerably less of the poorest. The exaptive approach seems less likely to converge on weak solutions, with its epochs of preadaptation acting as a safeguard against such outcomes. One other noteworthy aspect of the performance in version one is that in the final 60 generations, the exaptive configuration is initially worse off, but quickly overtakes the non-exaptive configuration. The earlier epochs seem to endow the population with beneficial building blocks that are harnessed later on. Despite entering the final epoch with poorer fitness than the non-exaptive GP system, the population possesses a store of genetic material that allows it to progress more quickly. It is preadapted. The results from version two tell a similar story, but with an even more exaggerated benefit for the exaptive approach. The task was made more difficult than that of version one in a variety of ways and this seems to have helped in highlighting the potential strength of exaptation and learning in stages. What is perhaps most remarkable about version two is the ability of the nonexaptive system to solve the problem at all. Given the crude and impoverished feedback provided by the fitness function, we expected no progress. Its successful runs, despite being infrequent, are still somewhat surprising. This speaks to the strength and tenacity of the GP problem-solving technique.
References 1. Koza, J.R.: Genetic Programming. MIT Press, Cambridge (1992) 2. Gould, S.J., Vrba, E.: Exaptation: A Missing Term in the Science of Form. Paleobiology 8(1), 4–15 (1982) 3. Andrews, P.W., Gangestad, W., Matthews, D.: Adaptationism – How to Carry Out an Exaptationist Program. Behavioral and Brain Sciences 25(4), 489–553 (2002) 4. Gould, S.J.: The Structure of Evolutionary Theory, ch. 11. The Belknap Press of Harvard University Press, Cambridge (2002) 5. Frazzetta, T.H.: Complex Adaptations in Evolving Populations. Sinauer Associates, Sunderland (1975) 6. Lembcke, S.: Chipmunk Game Dynamics (last accessed, July 2008), http://wiki.slembcke.net/main/published/Chipmunk 7. Mann, H.B., Whitney, D.R.: On a Test of Whether One of Two Random Variables is Stochastically Larger Than the Other. Annals of Mathematical Statistics 18, 50–60 (1947) 8. Graham, L.: Animation of an Evolved 2D Navigator (last accessed, October 2008), http://www.youtube.com/watch?v=GrnHnGwmZ7A 9. Graham, L.: Animation of an Evolved 2D Navigator (last accessed, October 2008), http://www.youtube.com/watch?v=i4f5gT-GV9M
Adaptation, Performance and Vapnik-Chervonenkis Dimension of Straight Line Programs Jos´e L. Monta˜ na1 , C´esar L. Alonso2 , Cruz E. Borges1, and Jos´e L. Crespo3 1
3
Departamento de Matem´ aticas, Estad´ıstica y Computaci´ on, Universidad de Cantabria, 39005 Santander, Spain {montanjl,borgesce}@unican.es 2 Centro de Inteligencia Artificial, Universidad de Oviedo Campus de Viesques, 33271 Gij´ on, Spain [email protected] Departamento de Matem´ atica Aplicada, Estad´ıstica y Ciencias de la Computaci´ on, Universidad de Cantabria, 39005 Santander, Spain [email protected]
Abstract. We discuss here empirical comparation between model selection methods based on Linear Genetic Programming. Two statistical methods are compared: model selection based on Empirical Risk Minimization (ERM) and model selection based on Structural Risk Minimization (SRM). For this purpose we have identified the main components which determine the capacity of some linear structures as classifiers showing an upper bound for the Vapnik-Chervonenkis (VC) dimension of classes of programs representing linear code defined by arithmetic computations and sign tests. This upper bound is used to define a fitness based on VC regularization that performs significantly better than the fitness based on empirical risk. Keywords: Genetic Programming, Linear Genetic Programming, VapnikChervonenkis dimension.
1
Introduction
Throughout these pages we study some theoretical and empirical properties of a new structure for representing computer programs in the GP paradigm. This data structure –called st r ai ght li n e pr ogr am (slp) in the framework of Symbolic Computation ([1])– was introduced for the first time into the GP setting in [2]. A slp consists of a finite sequence of computational assignments. Each assignment is obtained by applying some functional (selected from a specified) to a set of arguments that can be variables, constants or pre-computed results. The slp structure can describe complex computable functions using less amount of computational resources than GP-trees. The key point for explaining this feature is the ability of slp’s for reusing previously computed results during the evaluation process. Another advantage with respect to trees is that the slp structure L. Vanneschi et al. (Eds.): EuroGP 2009, LNCS 5481, pp. 315–326, 2009. c Springer-Verlag Berlin Heidelberg 2009
316
J.L. Monta˜ na et al.
can describe multivariate functions by selecting a number of assignments as the output set. Hence one single slp has the same representation capacity as a forest of trees (see [2] for a complete presentation of this structure). Linear Genetic Programming (LGP) is a GP variant that evolves sequences of instructions from an imperative programming language or from a machine language. The structure of the program representation consists of assignments of operations over constants or memory variables called registers, to another registers (see [3] for a complete overview on LGP). The GP approach with slp’s can be seen as a particular case of LGP where the data structures representing the programs are lists of computational assignments. We study the practical performance of ad- hoc recombination operators for slp’s. We apply the SLP-based GP approach to solve some instances of the symbolic regression problem. Experimentation done over academic examples uses a weak form of structural risk minimization and suggests that the slp structure behaves very well when dealing with bounded length individuals directed to minimize a compromise between empirical risk and non-scalar length (i.e. number of non-linear operations used by the structure). We have calculated an explicit upper bound for the Vapnik-Chervonenkis dimension (VCD) of some particular classes of slp’s. This bound constitutes our basic tool in order to perform structural risk minimization of the slp structure.
2
Straight Line Programs: Basic Concepts and Properties
Straight line programs are commonly used for solving problems of algebraic and geometric flavor. An extensive study of the use of slp’s in this context can be found in [4]. The formal definition of slp’s we provide in this section is taken from [2]. Definition 1. L et F = {f1 , . . . , fn } be a set of fu n ct i on s, wher e each fi has ar i t y ai , 1 ≤ i ≤ n, an d let T = {t1 , . . . , tm } be a set of t er m i n als. A st r ai ght li n e pr ogr am ( slp) over F an d T i s a fi n i t e sequ en ce of com put at i on al i n st r uct i on s Γ = {I1 , . . . , Il }, wher e for each k ∈ {1, . . . , l}, Ik ≡ uk := fjk (α1 , . . . , αajk ); with fjk ∈ F, αi ∈ T for al l i i f k = 1 an d αi ∈ T ∪ {u1 , . . . , uk−1 } for 1 < k ≤ l. T he set of t er m i n als T sat i sfi es T = V ∪ C wher e V = {x1 , . . . , xp } i s a fi n i t e set of var i ables an d C = {c1 , . . . , cq } i s a fi n i t e set of con st an t s. T he n u m ber of i n st r u ct i on s l i s t he len gt h of Γ.
Usually an slp Γ = {I1 , . . . , Il } will be identified with the set of variables ui introduced at each instruction ui , thus Γ = {u1 , . . . , ul }. Each of the non-terminal variables ui can be considered as an expression over the set of terminals T constructed by a sequence of recursive compositions from the set of functions F. Following [2] we denote by SLP (F, T ) the set of all slp’s over F and T. E xam ple 1. Let F be the set given by the three binary standard arithmetic operations F = {+, −, ∗} and let T = {1, x1 , x2 , . . . , xn } be the set of terminals. Any slp Γ in SLP (F, T ) represents a n-variate polynomial with integer coefficients.
Adaptation, Performance and VC Dimension of Straight Line Programs
317
An output set of a slp Γ {u1 , . . . , ul } is any set of non-terminal variables of Γ , that is O(Γ ) = {ui1 , . . . , uit }. The function computed by a slp Γ = {u1 , . . . , ul } over F and T with set of terminal variables V = {x1 , . . . , xp } and with output set O(Γ ) = {ui1 , . . . , uit }, denoted by ΦΓ : I p → Ot , is defined recursively in the natural way and satisfies ΦΓ (a1 , . . . , ap ) = (b1 , . . . , bt ), where bj stands for the value of the expression over V of the non terminal variable uij when we replace each variable xk with ak ; 1 ≤ k ≤ p.
3
Vapnik-Chervonenkis Dimension of Families of slp’s
In the last years GP has been applied to a range of complex learning problems including that of classification and symbolic regression in a variety of fields like quantum computing, electronic design, sorting, searching, game playing, etc. A common feature of both tasks is that they can be thought of as a supervised learning problem (see [5]) where the hypothesis class C is the search space described by the genotypes of the evolving structures. In the seventies the work by Vapnik and Chervonenkis ([6], [7], [8]) provided a remarkable family of bounds relating the performance of a learning machine (see [9] for a modern presentation of the theory). The Vapnik- Chervonenkis dimension (VCD) is a measure of the capacity of a family of functions (or learning machines) as classifiers. The VCD depends on the class of classifiers. Hence, it does not make sense to calculate VCD for GP in general, however it makes sense if we choose a particular class of computer programs as classifiers. Our aim is to study in depth the formal properties of GP algorithms focusing on the analysis of the classification complexity (VCD) of straight line programs. 3.1
Estimating the VC Dimension of Slp’s Parameterized by Real Numbers
The following definition of VC dimension is standard. See for instance [7]. Definition 2. L et C be a class of subset s of a set X . W e say t hat C shat t er s a set A ⊂ X i f for ever y su bset E ⊂ A t her e exi st s S ∈ C su ch t hat E = S ∩ A. T he V C di m en si on of C i s t he car di n ali t y of t he lar gest set t hat i s shat t er ed by C. Through this section we deal with concept classes Ck,n such that concepts are represented by k real numbers, w = (w1 , ..., wk ), instances are represented by n real numbers, x = (x1 , ..., xn ), and the membership test to the family Ck,n is expressed by a formula Φk,n (w, x) taking as inputs the pair concept/instance (w, x) and returning the value 1 if ”x belongs to the concept represented by w” and 0 otherwise We can think of Φk,n as a function from IRk+n to {0, 1}. So for each concept w, define: Cw := {x ∈ IRn : Φk,n (w, x) = 1}. (1)
318
J.L. Monta˜ na et al.
The goal is to obtain an upper bound on the VC dimension of the collection of sets (2) Ck,n = {Cw : w ∈ IRk }. Now assume that the formula Φk,n is a boolean combination of s atomic formulas, each of them having the following forms: τi (w, x) > 0
(3)
τi (w, x) = 0
(4)
or where {τi (w, x)}1≤i≤s are infinitely differentiable functions from IRk+n to IR. Next, make the following assumptions about the functions τi . Let α1 , ..., αv ∈ IRn . Form the sv functions τi (w, αj ) from IRk to IR. Choose Θ1 , ..., Θr among these, and define Θ : IRk → IRr (5) as Θ(w) := (Θ1 (w), ..., Θr (w))
(6)
Assume there is a bound B independent of the αi , r and ǫ1 , ..., ǫr such that if Θ−1 (ǫ1 , ..., ǫr ) is a (k−r)-dimensional C ∞ - submanifold of IRk then Θ−1 (ǫ1 , ..., ǫr ) has at most B connected components. With the above setup, the following result is proved in [10]. Theorem 1. T he V C di m en si on V of a fam i ly of con cept s Ck,n whose m em ber shi p t est can be expr essed by a for m u la Φk,n sat i sfyi n g t he above con di t i on s sat i sfi es: V ≤ 2 log2 B + 2k log2 (2es)
(7)
Next we state our main result concerning the VCD of a collection of subsets accepted by a family of slp’s. We will say that a subset C ⊂ IRn is accepted by a slp Γ if the function computed by Γ , ΦΓ , expresses the membership test to C. For slp’s Γ = (u1 , . . . , ul ) of length l accepting sets we assume that the output is the last instruction ul and takes values in {0, 1}. Theorem 2. L et T = {t1 , . . . , tn } be a set of t er m i n als an d let F = {+, −∗, /, sign} be a set of fu n ct i on als wher e {+, −, ∗, /} den ot es t he set of st an dar d ar i t hm et i c oper at i on s an d sign(x) i s a fu n ct i on t hat ou t pu t s 1 i f i t s i n pu t x ∈ IR sat i sfi es x ≥ 0 an d ou t pu t s 0 ot her wi se. L et Γn,L be t he col lect i on of slp′ s Γ over F an d T usi n g at m ost L n on - scalar oper at i on s ( i .e. oper at i on s i n {∗, /, sign}) an d a fr ee n u m ber of scalar oper at i on s ( i .e. oper at i on s i n {+, −}) whose ou t pu t i s obt ai n ed by applyi n g t he fu n ct i on al sign ei t her t o a pr evi ou sly com pu t ed r esu lt or t o a t er m i n al tj , 1 ≤ j ≤ n. L et Cn,L be t he class of con cept s defi n ed by t he n su bset s of IR accept ed by som e slp belon gi n g t o Γn,L . T hen V C − dim(Cn,L ) ≤ 2(n + 1)(n + L)L(2L + log2 L + 9)
(8)
Adaptation, Performance and VC Dimension of Straight Line Programs
319
Sketch of the proof. The first step in the proof consist of constructing of a universal slp ΓU , over sets FU an TU , that parameterizes the elements of the family Γn,L . The definition of FU and TU depends only on the parameters n and L and will be clear after the construction. The key idea in the definition of ΓU is the introduction of a set of parameters α, β taking values in {0, 1}k , for a suitable natural number k, such that each specialization of this set of parameters yields a particular slp belonging to Γn,L and conversely, each slp in Γn,L can be obtained specializing the parameters α, β. For this purpose define u−n+m = tm for 1 ≤ m ≤ n. Note that any non-scalar assignment ui , 1 ≤ i ≤ L, in a slp Γ belonging to Γn,L is a function of t = (t1 , . . . , tn ) that can be parameterized as follows. ui = Ui (α, β)(t) = αi−n (
i−1
αj i uj ) ∗ (
j=−n+1
+ (1 −
i−1 j=−n+1 i i α−n )[β−n i−1 j=−n+1
αj i uj β j i uj
i−1
β j i uj ) +
(9)
j=−n+1
i + (1 − β−n )sgn(
i−1
β j i uj )],
(10)
j=−n+1
for some suitable values α = (αj i ), β = (β j i ), with αj i , β j i ∈ {0, 1}. Let us consider the family of parametric slp’s {Γ(α,β) } where for each (α, β) the slp Γ(α,β) := (U1 (α, β), . . . , UL (α, β)). Next replace the family of concepts Cn,L with the class of subsets of IRn C := {C(α,β) } where for each (α, β), the set C(α,β) is given as follows. C(α,β) := {t = (t1 , . . . , tn ) ∈ IRn : t is accepted by Γ(α,β) }
(11)
In the new class C parameters αj i , β j i are allowed to take values in IR. Since Cn,L ⊂ C it is enough to bound the VC dimension of C. Claim A. The number of parameters α, β is exactly (n + 1)(n + L)L
(12)
Claim B. For each i, 1 ≤ i ≤ L, the following holds: (1) The function Ui (α, β)(t) is a piecewise rational function in the variables α, β, t of formal degree bounded by 3.2i − 2. (2) Ui is defined up to a set of zero measure and there is a partition of the domain of definition of Ui by subsets (Ωji )1≤j≤ni with ni ≤ 2i such that each Ωji is defined by a conjunction of i rational inequalities of the form p ≥ 0 or p < 0 with degree deg p ≤ 3.2i − 2. Moreover, the restriction of Ui to the set Ωji , Ui |Ωji , is a rational function of degree bounded by 3.2i − 2. (3) Condition UL (α, β)(t) = 1 can be expressed by a boolean formula of the following form: ∧1≤j≤L pi,j ǫi,j 0; (13) 1≤i≤2L
where for each i, j, pi,j is a rational function in the variables α, β, t of degree bounded by 3.2L − 2 and ǫi,j is a sign condition in {≥, 1 and in T if k = 1. We keep homogeneous populations of equal length slp’s. Next, we describe the recombination operator. Definition 3. ( slp- cr ossover ) ( see [2]) L et Γ = {u1 , . . . , uL } an d Γ ′ = {u′1 , . . . , u′L } be t wo slp’s over F an d T. F i r st , a posi t i on k i n Γ i s r an dom ly select ed; 1 ≤ k ≤ L. L et Suk = {uj1 , . . . , ujm } be t he pi ece of t he code of Γ r elat ed t o t he evalu at i on of uk wi t h t he assum pt i on t hat j1 < . . . < jm . N ext r an dom ly select a posi t i on t i n Γ ′ wi t h m ≤ t ≤ L an d m odi fy Γ ′ by m aki n g t he su bst i t ut i on of t he subset of i n st r uct i on s {u′t−m+1 , . . . , u′t } i n Γ ′ , by t he i n st r uct i on s of Γ i n Suk su i t ably r en am ed. T he r en am i n g fu n ct i on R over Suk i s defi n ed as R(uji ) = u′t−m+i , for al l i ∈ {1, . . . , m}. W i t h t hi s pr ocess we obt ai n t he fi r st off spr i n g fr om Γ an d Γ ′ . For t he secon d off spr i n g we sym m et r i cal ly r epeat t hi s st r at egy, bu t n ow we begi n by r an dom ly select i n g a posi t i on k ′ i n Γ ′ . E xam ple 2.
Let us consider the following slp’s: ⎧ ⎧ u1 := x + y u1 := x ∗ x ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ u2 := u1 ∗ u1 ⎨ u2 := u1 + y Γ ≡ u3 := u1 ∗ x Γ ′ ≡ u3 := u1 + x ⎪ ⎪ ⎪ ⎪ u4 := u3 + u2 u4 := u2 ∗ x ⎪ ⎪ ⎪ ⎪ ⎩ ⎩ u5 := u3 ∗ u2 u5 := u1 + u4
If k = 3 then Su3 = {u1 , u3 }, and t must be selected in {2, . . . , 5}. Assumed that t = 3, the first offspring will be: ⎧ u1 := x ∗ x ⎪ ⎪ ⎪ ⎪ u ⎨ 2 := x + y Γ1 ≡ u3 := u2 ∗ x ⎪ ⎪ u4 := u2 ∗ x ⎪ ⎪ ⎩ u5 := u1 + u4
For the second offspring, if the selected position in Γ ′ is k ′ = 4, then Su4 = {u1 , u2 , u4 }. Now if t = 5, the offspring will be: ⎧ u1 := x + y ⎪ ⎪ ⎪ ⎪ ⎨ u2 := u1 ∗ u1 Γ2 ≡ u3 := x ∗ x ⎪ ⎪ u4 := u3 + y ⎪ ⎪ ⎩ u5 := u4 ∗ x
Next we describe mutation. The first step when mutation is applied to a slp Γ consists of selecting an instruction ui ∈ Γ at random. Then a new random selection is made within the arguments of the function f ∈ F that constitutes the instruction ui . The final step is the substitution of the selected argument for another one in T ∪ {u1 , . . . , ui−1 } randomly chosen.
322
4.1
J.L. Monta˜ na et al.
Fitness Based on Structural Risk Minimization
Under Structural Risk Minimization a set of possible models C forms a nested structure C1 ⊂ C2 ⊂ · · · ⊂ CL ⊂ . . . ⊂ C. Here CL represents the set of models of complexity L, where L is some suitable parameter depending on the problem. We require that VC-dimension VL of model CL is an increasing function of L. In our particular case C is the class of slp’s having length bounded by some constant l with set of terminals T =: {t1 , . . . , tn } and set of functionals F = {+, −, ∗, /, sign} and CL is the class of slp´s in C which use at most L non-scalar operations. In this situation one chooses the model that minimizes the right side of Equation 17. For practical use of Equation 17 we adopt the following formula with appropriately chosen practical values of theoretical constants (see [12] for the derivation of this formula).
εm (h) 1 −
ln m p(h) − p(h)ln p(h) + 2m
−1
,
(18)
h and h is a function such that for where εm (h) means empirical risk, p(h) = m each SLP Γ of length bounded by l with set of terminals T and set of functionals F the following holds: h(Γ ) = VL where L = min{k : Γ ∈ Ck }, that is, h is the VC-dimension of the class of models using as many non-scalar operations as Γ .
4.2
Experimentation
We consider instances of Symbolic Regression for our experimentation. The symbolic regression problem has been approached by Genetic Programming in several contexts. Usually, in this paradigm a population of tree-like structures encoding expressions, is evolved. We adopt slp’s as the structures that evolve within the process. We will keep homogeneous populations of equal length individuals along the process. E xper i m en t 1 . We compare four crossover methods: uniform, one-point crossover, 2-point crossover and slp-crossover as defined in Definition 3. For this purpose we have performed 200 executions of the algorithm for each instance and crossover operator. We have run our implemented algorithm based on GP with straight line programs on the following three target functions:
f1 (x) = x4 + x3 + x2 + x;
f2 (x) = cos(2x);
f3 (x) = 2.718 x2 + 3.1416 x
These functions are also proposed in the book by John Koza, as illustration examples of the tree-based GP approach ([13]). The sample set contains only 20 points for the three target functions. The following values for the control parameters are the same for all tested functions: population size M = 500; number of generations G = 50; probability of crossover pc = 0, 9; probability of mutation pm = 0, 01; length of the homogeneous population of slp’s L = 20. The set of functions is F = {+, −, ∗, //} for f1 and f3 and F = {+, −, ∗, //, sin} for
Adaptation, Performance and VC Dimension of Straight Line Programs
323
f2 . In the above sets ”//” indicates the protected division i.e. x//y returns x/y if y = 0 and 1 otherwise. The basic set of terminals is T = {x, 1}. We include in T a constant c0 for the target function f3 . The constant c0 takes random values in the interval range [−1, 1] and is fixed before each execution. In table 2 we show the corresponding success rates for each crossover method and target function. The success rate (SR) is defined as the ratio of the successful runs with respect to the total number of runs. In our case one run will be considered successful if an individual with a empirical risk lower than 0.2 is found. The umbral of 0.2 is obtained by multiplying the number of test points, 20, by 0.01, that is the maximum error allowed to consider a point as a success point. We can see that uniform crossover and slp-crossover are the best recombination operators for the studied target functions. On the other hand the one point crossover is the worst of the studied recombination operators. Note that uniform crossover and two point crossover are not defined for the tree structure. This experiment was designed to test the effectiveness of VC regularization of the slp model using the fitness based in Structural Risk Minimization as given by formula in Equation 18 (VC-fitness) instead of fitness based on Empirical Risk Minimization (ERM-fitness). We describe first experimental comparison using the methodology and data sets from [12]. First we describe the experimental setup.
E xper i m en t 2.
The following target functions where used: Discontinuous piecewise polynomial function defined as follows:
T ar get fu n ct i on s.
g1 (x) = 4(x2 (3 − 4x), x ∈ [0, 0.5], g1 (x) = (4/3)x(4x2 − 10x + 7) − 3/2, x ∈ (0.5, 0.75], g1 (x) = (16/3)x(x − 1)2 , x ∈ (0.75, 1]. Sine-square function: g2 (x) = sin2 (2πx), x ∈ [0, 1] √ sin x2 +x2 Two-dimensional sinc function: g3 (x) = x2 +x1 2 2 , x1 , x2 ∈ [−5, 5] 1
4
3
2
2
Polynomial function: g4 (x) = x + x + x + x, x ∈ [0, 1] E st i m at or s used .
We use slp’s of fixed length l with functionals in {+, −, ∗, /} for target functions g2 , g3 , g4 . For target function g1 we add the sign operator. In the experiments we set l = 6, 10, 20 and 40. The model complexity is measured by the number of non-scalar operations in the considered slp, that is, the number of instructions which do not contain {+, −}-operators. This is a measure of the non-linearity of the regressor as suggested by Theorem 2. Each estimator has being obtained running a GP algorithm with the following values for the control parameters: population size M = 500 or M = 100; number of generations G = 50; probability of crossover pc = 0, 9; probability of mutation pm = 0, 01. In all trials, slp-crossover, as described in Definition 3, was used.
324
J.L. Monta˜ na et al.
A training set of fixed size n is generated. The xvalues follows from uniform distribution in the input domain. The prediction risk εntest of the model chosen by the GP algorithm based on SLP evolution is estimated by the mean square error (MSE) between the model (estimated from training data) and the true values of target function g(x) for independently generated test inputs (xi , yˆi )1≤i≤ntest , i. e.: E xper i m en t at i on pr ocedu r e.
εntest =
1 ntest
n test i=1
(g(xi ) − yˆi )2
(19)
Table 1. Prediction Risk: VC-fitness vs. ERM-fitness population size 100 g1
g2
population size 500
g3
g4
g1
g2
g3
g4
length 6 ERM-fitness VC-fitness Comparative
0.1559 0.0396 3.9368
0.1502 0.1502 1
0.0456 0.0456 1
ERM-fitness VC-fitness Comparative
0.1559 0.1559 1
0.2068 0.1434 1.4421
0.0456 0.0456 1
ERM-fitness VC-fitness Comparative
0.1566 0.1559 1.0004
0.1324 0.2068 0.6402
0.0456 0.0456 1
0.1054 0.1019 1.0343
0.0670 0.0706 0.9490
0.1501 0.1443 1.0401
0.0456 0.0456 1
0.1054 0.1019 1.0343
0.0396 0.0396 1
0.1502 0.1502 1
0.0456 0.0456 1
0.0377 0.1054 0.3576
0.0396 0.0661 0.5990
0.1396 0.1827 0.7640
0.0456 0.0456 1
0.0870 0.0633 1.3744
0.0745 0.0326 2.2852
0.1502 0.1502 1
0.0456 0.0456 1
0.2287 0.0805 2.8409
length 10 0.1054 0.1791 0.5884
length 20 0.4982 0.1852 2.6900
length 40 ERM-fitness VC-fitness Comparative
0.1029 0.1559 0.6600
0.1439 0.2068 0.6958
0.0456 0.0456 1
0.1357 0.5709 0.2376
The above experimental procedure is repeated 100 times using 100 different random realizations of n training samples (from the same statistical distribution). Experiments were performed using a small training sample (n = 20) and a large test set (ntest = 200). Table 1 shows comparison results for fitness based on ERM (ERM-fitness) and fitness based on VC-regularization (VCfitness). Experiments for each target function g = gi , 1 ≤ i ≤ 4 are divided into four groups corresponding to different values of the length of the evolved slp´s (l = 6, l = 10, l = 20, l = 40). For each length two possible population sizes are considered (100 and 500 individuals). For each population size two fitness are considered: ERM-fitness and VC-fitness. The values in the fitness rows represent the estimated prediction error of the selected model. The values in the comparative rows represent the ratio between the prediction risk for the regressor
Adaptation, Performance and VC Dimension of Straight Line Programs
325
Table 2. SR over 200 independent runs for the crossover operators Function 1-point 2-point uniform slp f1 f2 f3
100 30 42
100 35 71
100 53 91
100 53 93
obtained using the ERM-fitness and the corresponding value for the regressor obtained using the VC-fitness. Accordingly, the values in the comparative rows that are bigger than or equal to 1 represent a better performance of VC-fitness. If we consider an experiment successful when the comparative value is ≥ 1, then the success rate is greater than 70%. If we consider an experiment strictly successful when the the comparative value is > 1, then the strict success rate is greater than 30%.
5
Conclusions and Future Research
We have calculated a sharp bound for the VC dimension of the GP genotype defined by computer programs using straight line code. We have used this bound to perform VC-based model selection under the GP paradigm showing that this model selection method consistently outperforms LGP algorithms based on empirical risk minimization. A next step in our research is to compare VCregularization of slp’s with other regularization methods based on asymptotical analysis like Akaike information criterion (AIC) or Bayesian information criterion (BIC). A second goal in our research on SLP-based GP is to study the experimental behavior of the straight line program computation model under Vapnik-Chervonenkis regularization but without assuming previous knowledge of the length of the structure. This investigation is crucial in practical applications for which the GP machine must be able to learn not only the shape but also the length of the evolved structures. To this end new recombination operators must be designed since the crossover procedure employed in this paper only applies to populations having fixed length chromosomes.
Acknowledgements This work was partially supported by Spanish Grants TIN2007-67466-C02-02, MTM2007-62799 and FPU program.
References 1. Giusti, M., Heintz, J., Morais, J.E., Pardo, L.M.: Straight line programs in Geometric elimination Theory. Journal of Pure and Applied Algebra 124, 121–146 (1997) 2. Alonso, C.L., Monta˜ na, J.L., Puente, J.: Straight line programs: a new Linear Genetic Programming Approach. In: Proc. 20th IEEE International Conference on Tools with Artificial Intelligence, ICTAI, pp. 517–524 (2008)
326
J.L. Monta˜ na et al.
3. Brameier, M., Banzhaf, W.: Linear Genetic Programming. Springer, Heidelberg (2007) 4. Aldaz, M., Heintz, J., Matera, G., Monta˜ na, J.L., Pardo, L.: Time-space tradeoffs in algebraic complexity theory. Journal of Complexity 16, 2–49 (1998) 5. Teytaud, O., Gelly, S., Bredeche, N., Schoenauer, M.A.: Statistical Learning Theory Approach of Bloat. In: Proceedings of the 2005 conference on Genetic and Evolutionary Computation, pp. 1784–1785 (2005) 6. Vapnik, V., Chervonenkis, A.: Ordered risk minimization. Automation and Remote Control 34, 1226–1235 (1974) 7. Vapnik, V.: Statistical learning theory. John Wiley & Sons, Chichester (1998) 8. Vapnik, V., Chervonenkis, A.: On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications 16, 264–280 (1971) 9. Lugosi, G.: Pattern classification and learning theory. In: Principles of Nonparametric Learning, pp. 5–62. Springer, Heidelberg (2002) 10. Karpinski, M., Macintyre, A.: Polynomial bounds for VC dimension of sigmoidal and general Pffafian neural networks. J. Comp. Sys. Sci. 54, 169–176 (1997) 11. Milnor, J.: On the Betti Numbers of Real Varieties. Proc. Amer. Math. Soc. 15, 275–280 (1964) 12. Cherkassky, V., Yunkian, M.: Comparison of Model Selection for Regression. Neural Computation 15, 1691–1714 (2003) 13. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. The MIT Press, Cambridge (1992)
A Statistical Learning Perspective of Genetic Programming Nur Merve Amil, Nicolas Bredeche, Christian Gagn´e*, Sylvain Gelly, Marc Schoenauer, and Olivier Teytaud TAO, INRIA Saclay, LRI, Bat. 490, Universit´e Paris-Sud, 91405 Orsay CEDEX, France (*) LVSN, GEL-GIF, Univ. Laval, Qubec, Canada, F1V0A6
Abstract. This paper proposes a theoretical analysis of Genetic Programming (GP) from the perspective of statistical learning theory, a well grounded mathematical toolbox for machine learning. By computing the Vapnik-Chervonenkis dimension of the family of programs that can be inferred by a specific setting of GP, it is proved that a parsimonious fitness ensures universal consistency. This means that the empirical error minimization allows convergence to the best possible error when the number of test cases goes to infinity. However, it is also proved that the standard method consisting in putting a hard limit on the program size still results in programs of infinitely increasing size in function of their accuracy. It is also shown that cross-validation or hold-out for choosing the complexity level that optimizes the error rate in generalization also leads to bloat. So a more complicated modification of the fitness is proposed in order to avoid unnecessary bloat while nevertheless preserving universal consistency.
1
Introduction
This paper is about two important issues in Genetic Programming (GP), that is Universal Consistency (UC) and code bloat. UC consists in the convergence to the optimal error rate with regards to an unknown distribution of examples. A restricted version of UC is consistency, which focus on the convergence to the optimal error rate within a restricted search space. Both UC and consistency are well studied in the field of statistical learning theory. Despite their possible benefits, they have not been widely studied in the field of GP. Code bloat is the uncontrolled growth of program size that may occur in GP when relying on a variable length representation [10,11]. This has been identified as a key problem in GP for which there have been several empirical studies. However, very few theoretical studies addressed this issue directly. The work presented in this paper is intended to provide some theoretical insights on the bloat phenomenon and its link with UC in the context of GP-based learning taking a statistical learning theory perspective [23]. Statistical learning theory provides several theoretical tools to analyze some aspects of learning accuracy. Our main objective consists in performing both an in-depth analysis of bloat as well as providing appropriate solutions to avoid L. Vanneschi et al. (Eds.): EuroGP 2009, LNCS 5481, pp. 327–338, 2009. c Springer-Verlag Berlin Heidelberg 2009
328
N.M. Amil et al.
it. Section 2 shortly exposes issues of code bloat with GP. Section 3 and 4 present all the aforementioned results about code bloat avoidance and UC and propose a new approach ensuring both. Then, Section 5 provides some extensions of the previous theoretical results on the use of cross-validation and hold-out methodologies. Follows some experimental results in Section 6, illustrating the accuracy of the theoretical results. Section 7 finally concludes this paper with a discussion on the consequences of those theoretical results for GP practitioners and uncover some perspectives of work.
2
Code Bloat in GP
Due to length constraints, we do not introduce here some important theories around code bloat: introns, fitness causes bloat, and removal bias. The reader is refered to [1,3,11,13,16,17,21] for more informations around that. Some common solutions against bloat rely either on specific operators (e.g. size-fair crossover [12], or different fair mutation [14]), on some parsimony-based penalization of the fitness [22] or on abrupt limitation of the program size such as the one originally used by Koza [10]. Also, some multi-objective approachs have been proposed [2,5,7,15,19]. Some other more particular solutions have been proposed but are not widely used yet [18,24]. Also, all proofs are removed due to length constraints. Readers familiar with mathematics like in e.g. [6] should however be able to guess the main ideas. Although code bloat is not clearly understood, it is yet possible to distinguish at least two kinds of code bloat. We first define structural bloat as the code bloat that necessarily takes place when no optimal solution can be approximated by a set of programs with bounded length. In such a situation, optimal solutions of increasing accuracy will also exhibit an increasing complexity (larger programs), as larger and larger code will be generated in order to better approximate the target function. This extreme case of structural bloat has also been demonstrated in [9]. The authors use some polynomial functions of increasing difficulty, and demonstrate that a precise fit can only be obtained through an increased bloat (see also [4] for related issues about problem complexity in GP). Another form of bloat is the functional bloat, which takes place when program length keeps on growing even though an optimal solution (of known complexity) does lie in the search space. In order to clarify this point, let us use a simple symbolic regression problem defined as follow: given a set S of test cases, the goal is to find a function f (here, a GP-tree) that minimizes the Mean Square Error (or MSE). If we intend to approximate a polynomial (e.g. 14 ∗ x2 with x ∈ [0, 1]), we may observe code bloat since it is possible to find arbitrarily long polynomials that gives the exact solution (e.g. 14x2 +0∗x3 +. . .), or sequences of polynomials of length growing accuracy converging to the optimal accuracy (e.g. nto ∞1 and xi ). Most of the works cited earlier are in fact concerned Pn (x) = 14x2 + i=1 n!i! with functional bloat, which is the most surprising, and the most disappointing kind of bloat. We will consider various levels of functional bloat: cases where length of programs found by GP runs to infinity as the number of test cases
A Statistical Learning Perspective of Genetic Programming
329
runs to infinity whereas a bounded-length solution exists, and also cases where large programs are found with high probability by GP whereas a small program is optimal. Another important issue is to study the convergence of the function given by GP toward the actual function used to generate the test cases, under some sufficient conditions and when the number of test cases goes to infinity. This property is known in statistical learning as Universal Consistency (UC). Note that this notion is slightly different from that of universal approximation, commonly referred in symbolic regression, where GP search using operators {+, ∗} is assumed to be able to approximate any continuous function. UC is rather concerned with the behavior of the algorithm when the number of test cases goes to infinity: the existence of a polynomial that approximates a given function at any arbitrary precision does not imply that any polynomial approximation built from a set of sample points will converge to that given function when the number of points goes to infinity. Or more precisely, UC can be stated informally as follows (a formal definition will be given later). A GP setting corresponds to symbolic regression from examples if it takes as inputs a finite number of examples x1 , . . . , xn with their associated labels y1 , . . . , yn and outputs a program Pn . Universal consistency holds if, when pairs (x1 , y1 ), . . . , (xn , yn ) (test cases) are identically independently distributed as the random variable (x, y), L(Pn ) → L∗ where L(p) = P r(y = p(x)) and where L∗ = inf p measurable L(p). In all of this paper, Pr(.) denotes probabilities, as the traditional notation P (.) is used for programs.
3
Negative Results without Regularization and Resampling
Definition 1 precisely defines the programs space under examination. Theorem 1 evaluates its VC-dimension [23]. Many theorems, in the sequel, are based only on VC-dimensions and hold for other sets of programs as well. It should be noted the mildness of the hypothesis behind our results. We consider any programs of bounded length, working with real variables, provided that the computation time is a priori bounded. Usual families of programs in GP verify this hypothesis and much stronger hypothesis. For example, usual treebased representations avoid loops and therefore all quantities that have to be bounded in lemma below (typically, number of times each operator is used) are bounded for trees of bounded depths. This is also true for direct acyclic graphs. We here deal with a very general case; much better constants can be derived for specific cases, without changing the fundamental results in the sequel of the paper. Definition 1 (Set of programs studied). Let F (n, t, q, m, z) be the set of functions from Rz−m towards {0, 1} which can be computed by a program with a maximum of n lines as follows: (1) A run uses at most t operations. (2) Each line contains one operation among the followings:
330
N.M. Amil et al. – Operations α → exp(α) (at most q times); – Operations +, −, ×, and /; – Jumps conditioned on >, ≥, L∗ . Then, L(Ps ) −→ L∗ =⇒ V (Ps ) −→ ∞ .
Interpretation. This is structural bloat: if the space of programs approximates but does not contain the optimal function and cannot approximate it within bounded size, then bloat occurs. Note that for any F1 , F2 , . . . , the assumption ∀V LV > L∗ holds simultaneously for all V for many distributions, as we consider countable unions of families with finite VC-dimension (e.g. see [6, chap. 18]). We now show that, even in cases in which an optimal short program exists, the usual procedure (known as the method of Sieves; see also [20]) defined below, consisting in defining a maximum VC-dimension depending upon the sample size and then using a family of functions accordingly, leads to bloat. Theorem 3 (Bloat with the method of Sieves). Let F1 , . . . , Fk , . . . be nonempty sets of functions with finite VC-dimensions V1 , . . . , Vk , . . . , and let F = ∪n Fn . Then given s i.i.d. test cases, consider Pˆ ∈ Fs minimizing the empirical ˆ in Fs . risk L From theorems about the method of Sieves, we already know that if Vs = ˆ ˆ ˆ o(s/ log(s)) and Vs → ∞, then Pr L(P ) ≤ L(P ) + ǫ(s, Vs , δ) ≥ 1 − δ and almost surely L(Pˆ ) → inf P ∈F L(P ). We now state that if Vs → ∞, and noting V (P ) = min{Vk ; P ∈ Fk }, then ∀V0 , δ0 > 0, ∃ Pr, a distribution of probability on X and Y , such that ∃g ∈ F1 ∗ ˆ such that L(g) = L , and for s sufficiently large Pr V (P ) ≤ V0 ≤ δ0 .
332
N.M. Amil et al.
Interpretation. The result in particular implies that for any V0 , there is a distribution of test cases such that ∃g; V (g) = V1 and L(g) = L∗ , with probability 1, V (Pˆ ) ≥ V0 infinitely often as s increases. This shows that bloat can occur if we use only an abrupt limit on code size, if this limit depends upon the number of test cases ( a fortiori if there’s no limit). Note that this result, proved thanks to a particular distribution, could indeed be proved for the whole class of classification problems for which the conditional probability of Y = 1 (conditionally to X) is equal to 21 in an open subset of the domain.
4
Universal Consistency without Bloat
In this section, we consider a more complicated case where the goal is to ensure UC, while simultaneously avoiding non-necessary bloat. This means that an optimal program does exist in a given family of functions and convergence towards the minimal error rate is performed without increasing the program complexity. This is achieved by: i) merging regularization and bounding of the VC-dimension, and ii) penalization of the complexity (i.e. length) of programs by a penalty term R(s, P ) = R(s)R′ (P ) depending upon the sample size and the program. R(., .) is user-defined and the algorithm looks for a classifier with a small value of both R′ and L. In the following, we study both the UC of this algorithm (i.e. L → L∗ ) and the no-bloat theorem (i.e. R′ → R′ (P ∗ ) when P ∗ exists). Note that the bound Vs = o(log(s)) is much stronger than the usual limit used in the method of Sieves (see Theorem 3). Theorem 4 (No-bloat theorem). Let F1 , . . . , Fk , . . . with finite VC-dimensions V1 , . . . , Vk , . . . . Let F = ∪n Fn . Define V (P ) = Vk with k = inf{t|P ∈ Ft }. Define LV = inf P ∈FV L(P ). Consider Vs = o(log(s)) and Vs → ∞. Consider also ˆ ˜ ) = L(P ˆ )+R(s, P ) in Fs , and assume that R(s, .) ≥ 0. Asthat Pˆs minimizes L(P sume that supP ∈FVs R(s, P ) = o(1). Then, L(Pˆs ) → inf P ∈F L(P ) almost surely. Note that for well chosen family of functions, inf P ∈F L(P ) = L∗ . Moreover, assume that ∃P ∗ ∈ FV ∗ L(P ∗ ) = L∗ . With R(s, P ) = R(s)R′ (P ) and with R′ (s) = supP ∈FVs R′ (P ), we get the following results: 1. Non-asymptotic no-bloat theorem: For any δ ∈]0, 1], R′ (Pˆs ) ≤ R′ (P ∗ )+ (1/R(s))2ǫ(s, Vs , δ) with probability at least 1 − δ. This result is in particular interesting for ǫ(s, Vs , δ)/R(s) → 0. 2. Almost-sure no-bloat theorem: If for some α > 0, R(s)s(1−α)/2 = O(1), then almost surely R′ (Pˆs ) → R′ (P ∗ ) and if R′ (P ) has discrete values (such as the number of instructions in P or many complexity measures for programs) then for s sufficiently large, R′ (Pˆs ) = R′ (P ∗ ); 3. Convergence rate: For any δ ∈]0, 1], with probability at least 1 − δ, +2ǫ (s, Vs , δ) , L Pˆs ≤ inf L (P ) + R (s) R′ (s) P ∈FVs =o(1) by hypothesis 2V )) where ǫ (s, V, δ) = 4−log(δ/(4s . 2s−4
A Statistical Learning Perspective of Genetic Programming
333
Interpretation. Combining a code limitation and a penalization leads to UC without bloat.
5
Some Negative Results with Subsampling: Hold-Out or Cross-Validation
When one tries to learn a relation between x and y, the “true” cost function (typically the mean squared error of a given approximate relation, which is an expectation under some usually unknown law of probability) is generally not available. It is usually replaced by its empirical mean on a finite sample. Minimizing this empirical mean is natural, but this can be done over various families of functions (e.g. trees with depth 1, 2, 3, and so on). Choosing between these various levels is hard. Typically, the empirical mean decreases as the complexity is increased, but this decrease is not generally a decrease of the generalization error, as trees of larger depth have usually a very bad generalization error due to overfitting. Therefore, the problem is somewhat multi-objective: there is a conflict between the empirical error and the complexity level. This multi-objective optimization setting has been studied in [2,5,7]. This section is devoted to hold-out and cross-validation as tools for UC without bloat. First, let’s consider hold-out for choosing the complexity level. Consider X0 , . . . , XN , Y0 , . . . , YN , 2(N + 1) samples (each of them consisting in n examples, i.e. Xi = (Xi,1 , Xi,2 , . . . , Xi,n ) and Yi = (Yi,1 , Yi,2 , . . . , Yi,n )), the Xi ’s being learning sets, the Yi ’s being (hold-out) test sets. Consider that the function can be chosen in many complexity levels, F0 ⊂ F1 ⊂ F2 ⊂ F3 ⊂ . . . , where F0 ˆ k (f ) the error rate of the function f in the is non-empty and Fi = Fi+1 . Note L ˆ k (f ) = 1 n l (f, Xk,i ) where l (f, x) = 1 if f fails on set Xk of examples: L i=1 n ˆ k (.). In hold-out, after the complete x and 0 otherwise. Define fk = arg minFk L learning, the resulting classifier is fk∗ (n) , where k ∗ (n) = arg mink≤N (n) lk and lk = n1 ni=1 l (fk , Yk,i ). In the sequel, we assume that f ∈ Fk ⇒ 1 − f ∈ Fk and that V Cdim(Fk ) → ∞ as k → ∞. The case with hold-out leads to different cases, namely: Greedy case: all Xk ’s and Yk ’s are independent; this means that we test separately each complexity level Fk with different learning sets Xk and test sets Yk . Case with pairing: X0 is independent of Y0 , ∀k, Xk = X0 and ∀k, Yk = Y0 ; this means that we use the same learning set for all complexity levels and the same test set for all complexity levels. This case is far more usual. Theorem 5 (No bloat avoidance with greedy hold-out). Consider greedy hold-out for choosing between complexity levels 0, 1, . . . , N (n). If N (n) is a constant, then for some distribution of examples ∀k ∈ [0, N ], P (k ∗ (n) = k) → 1/(N + 1). If N (n) → ∞ as n → ∞, then for some distribution of examples such that an optimal function lies in F0 , greedy hold-out leads to k ∗ (n) → ∞ as n → ∞ and therefore lim supn→∞ k ∗ (n) = ∞.
334
N.M. Amil et al.
All the following results are in the general case of N a non decreasing function of n. Proposition 2 (Bloat cannot be controlled by hold-out with pairing, first result). Consider the case with pairing. For arbitrarily large v, there exists a distribution with optimal function in F0 such that lim inf n→∞ Pr(k ∗ (n) ≥ v) > 0. Now, let’s consider a distribution that depends on n. This is interesting, as it provides lower bounds on what can be guaranteed, for a given value of n, independently of the distribution. For technical reasons, and without loss of generality with renumbering of the Fk , we assume that Fv+1 has a VC-dimension larger than Fv . We can show that, with a distribution dependent on n, lim supn→∞ k ∗ (n) → ∞. This leads to this other negative theorem about the control of bloat by hold-out. Proposition 3 (Bloat can not be controlled by hold-out with pairing, second result). lim supn k ∗ (n) = ∞, where the distribution depends on n but is always such that an optimal function lies in F0 . This result above is in the setting of a distribution which depends on n; it is of course not interesting for modelizing the evolution of one particular problem as the number of examples increases, but it shows that no bound on k ∗ (n) for n ≥ n0 can be provided, whatever may be n0 , for hold-out with pairing, unless the distribution of problems is taken into account. Cross-Validation for the Control of Bloat. We now turn our attention to the case of cross-validation. We formalize N -folds cross-validation as follows: i
fki = arg min L(., X ′ k ), Fk
k ∗ = arg min
N 1
L(fki , Xki ) N i=1
i
X ′ k = (Xk1 , Xk2 , . . . , Xki−1 , Xki+1 , Xki+2 , . . . , XkN ) for i ≤ N where for any i and k, Xki is a sample of n points. Greedy cross-validation could be considered as in the case of hold-out above: all Xki could be independent. This leads to the same result (for some distribution, k ∗ (n) → ∞) with roughly the same proof. We therefore only consider crossvalidation with pairing, i.e. ∀i, k, k ′ , Xki = Xki ′ . For short, we note Xki = X i . Theorem 6 (Negative result on subsampling). Assume that Fk has a VCdimension going to ∞ as k → ∞. One can not avoid bloat with only hold-out or cross-validation, in the sense that with paired hold-out, or greedy hold-out, or cross-validation, for any V , there exists some distribution for which almost surely, k ∗ (n) > V infinitely often whereas an optimal function lies in F0 . Note that propositions above show in some cases stronger forms of bloat. If we consider greedy hold-out, hold out with pairing and cross-validation with
A Statistical Learning Perspective of Genetic Programming
335
pairing, then: (1) For some well-chosen distribution of examples, greedy holdout almost surely leads to (i) k ∗ (n) → ∞ if N → ∞ (ii) k ∗ (n) asymptotically uniformly distributed in [[0, N ]] if N finite, whereas an optimal function lies in F0 (theorem 5). (2) Whatever may be V = V Cdim(Fv ), for some well-chosen distribution, hold-out with pairing almost surely leads to k ∗ (n) > V infinitely often whereas an optimal function lies in F0 (proposition 2). (3) Whatever may be V = V Cdim(Fv ), for some well-chosen distribution, cross-validation with pairing almost surely leads to k ∗ (n) > V infinitely often whereas an optimal function lies in F0 .
6
Experimental Results
Some theoretical elements presented in Sections 3 and 4 are verified experimentally in this section. The experimentation are conducted using Koza-style GP [10], with a problem setup similar to the classical symbolic regression example, modified for binary classification. This is covered by theoretical results above. The GP branches used are the addition, subtraction, multiplication, protected division, and if-less-than. This last branch takes four arguments, returning the third argument if the first argument is less than the second one, otherwise returning the fourth argument. The GP terminals are the x variable, and the 0 and 1 constants. The learning task consists in minimizing the error e(i) between the desired output yi = {−1, 1} and the obtained output yˆi of the tested GP tree for the xi input, as in the following: e(i) = max(1 − yi yˆi , 0). The fitness measure used in the experiments consists in minimizing the sum of the errors to which is added a complexity factor that approximate the VC-dimension of the s 2 1 GP program: f = s i=1 e(i) + k t logs 2 (t) , where t is the number of nodes of the GP program tested, s is the number of test cases used for fitness evaluation, and k is a trade-off weight in the composition of the complexity penalization relatively to the accuracy term. The s test cases are distributed uniformly in xi ∈ [0, 1], with associated yi = {−1, 1}. For xi < 0.4, each yi are equal to 1 with probability 0.25 (so yi = −1 with probability 0.75), for xi ∈ [0.4, 0.6[, yi = 1 with probability 0.5, and for xi ≥ 0.6, yi = 1 with probability 0.75. Thus, the associated classifier with best generalization capabilities would return yi∗ = −1 for xi < 0.4, yi∗ = 1 for xi ≥ 0.6 and a random output for xi ∈ [0.4, 0.6[, with a minimal generalization error of 0.3. After the evolutions, each best-of-run classifier is thus evaluated by a fine sampling of the input space, with the generalization error evaluated as the difference between the output given by the tested best-of-run classifier and the output obtained by a classifier with best generalization capabilities. Five types of GP evolutions have been tested: i) no limitation on the tree size (no depth limit and complexity trade-off k = 0), ii) depth limitation on the tree size of 17 levels (complexity trade-off k = 0), iii) soft complexity penalty in the fitness (complexity trade-off k = 0.0001), iv) medium complexity penalty in the fitness (complexity trade-off k = 0.001), and v) important complexity penalty in the fitness (complexity trade-off k = 0.01). For the three last approaches, the depth limitation of 17 levels is still maintained. The
336
N.M. Amil et al. 0.42
1000
No limit Depth limit k=0.0001 k=0.001 k=0.01
900
800
No limit Depth limit k=0.0001 k=0.001 k=0.01
700
0.38 Average tree size
Average generalization error
0.4
0.36
600
500
400
0.34 300
200
0.32 100
0.3 10
20
30
40
50 60 Number of test cases
(a)
70
80
90
100
0 10
20
30
40
50 60 Number of test cases
70
80
90
100
(b)
Fig. 1. Generalization errors and tree sizes observed for different size limitations. Figure (a) shows the average generalization errors observed, with apparently better results for the approaches where the fitness includes some parsimony pressure. Figure (b) shows the average tree sizes obtained, where important bloat is observed for the no limitation and maximum depth limitations.
selection method used is lexicographic parsimony pressure [15], that is regular tournament selection 4 participants, with the smallest participant taken in case of ties. Other GP parameters are: population of 1000 individuals; evolutions on 200 generations; crossover probability of 0.8; subtree, swap and shrink mutation of probability 0.05 each; and finally half-and-half initialization with maximal depth of 5. All the experiments have been implemented using the GP facilities of the Open BEAGLE (http://beagle.gel.ulaval.ca, [8]) C++ framework for evolutionary computations. The experiments have been conducted different number of test cases varying from s = 10 to s = 100 by steps of 10. One hundred evolutions is done for each combinations of approaches tested and number of test cases, for a total of 50 000 evolutions. Figure 1 shows the average generalization errors and tree size obtained for the different approaches in function of the number of test cases used for fitness evaluation. These results show that bloat occurs when no limitation of size occurs, even when lexicographic parsimony pressure is used (see curve No limit of Figure 1b), which validates Theorem 3. Then, as stated by Theorem 2, UC is achieved using moderate complexity penalization in the fitness measure, with a convergence toward optimal generalization error of 0.3 (see curve k=0.001 of Figure 1a). Third, as predicted by Theorem 4, increasing the penalization leads to both UC and no bloat (see curve k=0.01 of both Figures 1a and 1b). Note that Theorem 3 asserts that this result cannot be achieved by a priori scaling of the complexity, and that Section 5 shows that this can not be achieved by cross-validation.
7
Conclusion
In this paper, we have proposed a theoretical study of two important issues in Genetic Programming (GP) known as Universal Consistency (UC) and code
A Statistical Learning Perspective of Genetic Programming
337
bloat. We have shown that the understanding of the bloat phenomenon in GP could benefit from classical results from statistical learning theory. The main limit of our work is that it deals only with the statistical elements of genetic programming (effect of noise) and not with the dynamics (the effect of bounded computational power). Application of theorems from learning theory has led to two original outcomes with both positive and negative results. Firstly, results on UC of GP: there is almost sure asymptotic convergence to the optimal error rate in the context of binary classification with GP with any of the classical forms of regularizations (from learning theory): the method of Sieves, or Structural Risk Minimization. Secondly, results on code bloat: i) if the ideal target function does not have a finite description then code bloat is unavoidable (structural bloat: obviously, if there’s no finite-length program with optimal error, then reaching the optimal error is, at best, only possible with an infinite growth of code), and ii) code bloat can be avoided by simultaneously bounding the length of the programs with some ad hoc limit and using some parsimony pressure in the fitness function (functional bloat), i.e. by combining Structural Risk Minimization and Sieves. An important point is that all methods leading to no-bloat use a regularization term; in particular, cross-validation or hold-out methods do not reach no-bloat. Acknowledgements. This work was supported in part by the PASCAL Network of Excellence, and by postdoctoral fellowships from the ERCIM (Europe) and the FQRNT (Qu´ebec) to C. Gagn´e. We thank Bill Langdon for very helpful comments.
References 1. Banzhaf, W., Nordin, P., Keller, R.E., Francone, F.D.: Genetic Programming: an introduction. Morgan Kaufmann Publisher Inc., San Francisco (1998) 2. Bleuler, S., Brack, M., Thiele, L., Zitzler, E.: Multiobjective genetic programming: Reducing bloat using SPEA2. In: Proceedings of the 2001 Congress on Evolutionary Computation CEC 2001, COEX, World Trade Center, 159 Samseong-dong, Gangnam-gu, Seoul, Korea, pp. 536–543. IEEE Press, Los Alamitos (2001) 3. Blickle, T., Thiele, L.: Genetic programming and redundancy. In: Hopf, J. (ed.) Genetic Algorithms Workshop at KI 1994, pp. 33–38. Max-Planck-Institut f¨ ur Informatik (1994) 4. Daida, J.M., Bertram, R.R., Stanhope, S.A., Khoo, J.C., Chaudhary, S.A., Chaudhri, O.A., Polito II, J.A.: What makes a problem GP-Hard? Analysis of a tunably difficult problem in genetic programming. Genetic Programming and Evolvable Machines 2(2), 165–191 (2001) 5. De Jong, E.D., Watson, R.A., Pollack, J.B.: Reducing bloat and promoting diversity using multi-objective methods. In: Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2001, pp. 11–18. Morgan Kaufmann Publishers, San Francisco (2001) 6. Devroye, L., Gy¨ orfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Springer, Heidelberg (1997) 7. Ekart, A., Nemeth, S.: Maintaining the diversity of genetic programs. In: Foster, J.A., Lutton, E., Miller, J., Ryan, C., Tettamanzi, A.G.B. (eds.) EuroGP 2002. LNCS, vol. 2278, pp. 162–171. Springer, Heidelberg (2002)
338
N.M. Amil et al.
8. Gagn´e, C., Parizeau, M.: Genericity in evolutionary computation software tools: Principles and case study. International Journal on Artificial Intelligence Tools 15(2), 173–194 (2006) 9. Gustafson, S., Ekart, A., Burke, E., Kendall, G.: Problem difficulty and code growth in genetic programming. Genetic Programming and Evolvable Machines 4(3), 271– 290 (2004) 10. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992) 11. Langdon, W.B.: The evolution of size in variable length representations. In: IEEE International Congress on Evolutionary Computations (ICEC 1998), pp. 633–638. IEEE Press, Los Alamitos (1998) 12. Langdon, W.B.: Size fair and homologous tree genetic programming crossovers. Genetic Programming And Evolvable Machines 1(1/2), 95–119 (2000) 13. Langdon, W.B., Poli, R.: Fitness causes bloat: Mutation. In: Late Breaking Papers at GP 1997, pp. 132–140. Stanford Bookstore (1997) 14. Langdon, W.B., Soule, T., Poli, R., Foster, J.A.: The evolution of size and shape. In: Advances in Genetic Programming III, pp. 163–190. MIT Press, Cambridge (1999) 15. Luke, S., Panait, L.: Lexicographic parsimony pressure. In: GECCO 2002: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 829–836. Morgan Kaufmann Publishers, San Francisco (2002) 16. McPhee, N.F., Miller, J.D.: Accurate replication in genetic programming. In: Genetic Algorithms: Proceedings of the Sixth International Conference (ICGA 1995), Pittsburgh, PA, USA, pp. 303–309. Morgan Kaufmann, San Francisco (1995) 17. Nordin, P., Banzhaf, W.: Complexity compression and evolution. In: Genetic Algorithms: Proceedings of the Sixth International Conference (ICGA 1995), Pittsburgh, PA, USA, pp. 310–317. Morgan Kaufmann, San Francisco (1995) 18. Ratle, A., Sebag, M.: Avoiding the bloat with probabilistic grammar-guided genetic programming. In: Artificial Evolution VI. Springer, Heidelberg (2001) 19. Silva, S., Almeida, J.: Dynamic maximum tree depth: A simple technique for avoiding bloat in tree-based GP. In: Cant´ u-Paz, E., Foster, J.A., Deb, K., Davis, L., Roy, R., O’Reilly, U.-M., Beyer, H.-G., Kendall, G., Wilson, S.W., Harman, M., Wegener, J., Dasgupta, D., Potter, M.A., Schultz, A., Dowsland, K.A., Jonoska, N., Miller, J., Standish, R.K. (eds.) GECCO 2003. LNCS, vol. 2723, pp. 1776–1787. Springer, Heidelberg (2003) 20. Silva, S., Costa, E.: Dynamic limits for bloat control: Variations on size and depth. In: Deb, K., et al. (eds.) GECCO 2004. LNCS, vol. 3103, pp. 666–677. Springer, Heidelberg (2004) 21. Soule, T.: Exons and code growth in genetic programming. In: Foster, J.A., Lutton, E., Miller, J., Ryan, C., Tettamanzi, A.G.B. (eds.) EuroGP 2002. LNCS, vol. 2278, pp. 142–151. Springer, Heidelberg (2002) 22. Soule, T., Foster, J.A.: Effects of code growth and parsimony pressure on populations in genetic programming. Evolutionary Computation 6(4), 293–309 (1998) 23. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995) 24. Zhang, B.-T., M¨ uhlenbein, H.: Balancing accuracy and parsimony in genetic programming. Evolutionary Computation 3(1) (1995)
Quantum Circuit Synthesis with Adaptive Parameters Control Cristian Ruican, Mihai Udrescu, Lucian Prodan, and Mircea Vladutiu Advanced Computing Systems and Architectures Laboratory University “Politehnica” Timisoara, 2 V. Parvan Blvd., Timisoara 300223, Romania {crys,mudrescu,lprodan,mvlad}@cs.upt.ro http://www.acsa.upt.ro
Abstract. The contribution presented herein proposes an adaptive genetic algorithm applied to quantum logic circuit synthesis that dynamically adjusts its control parameters. The adaptation is based on statistical data analysis for each genetic operator type, in order to offer the appropriate exploration at algorithm runtime without user intervention. The applied performance measurement attempts to highlight the “good” parameters and to introduce an intuitive meaning for the statistical results. The experimental results indicate an important synthesis runtime speedup. Moreover, while other GA approaches can only tackle the synthesis for quantum circuits over a small number of qubits, this algorithm can be employed for circuits that process up to 5-6 qubits.
1
Introduction
The implementation of the meta-heuristic approach for quantum circuit synthesis makes use of ProGeneticAlgorithm [ProGA] [2] framework, that provides a robust and optimized C++ environment for developing genetic algorithms. The problem of setting values for different control parameters is crucial for genetic algorithm performance [7]. This paper introduces a genetic algorithm tailored for evolving quantum circuits. Our ProGA framework is used for genetic algorithm implementation, its architecture being extended in order to handle statistical information. The statistical data are analyzed on-the-fly by the adaptive algorithm and the results are used for adjusting the genetic parameters during run-time processes. The framework becomes a useful tool for the adaption behavior and it is designed to allow an easy development for this type of engineering problem, all its low-level details being implemented in a software library. It also allows for different configurations, thus making the comparison between the characteristics of the emerged solutions become straightforward; this is a fact that will be later used in the experiments section. In theory, the parameter control involved in a genetic algorithm may be related to population size, mutation probability, crossover probability, selection type, etc. Our proposal focuses on the parameter control using statistical information from the current state of the search. From the best of our knowledge, this is the first meta-heuristic approach for the GAbased quantum circuit synthesis. The experimental results prove the fact that L. Vanneschi et al. (Eds.): EuroGP 2009, LNCS 5481, pp. 339–350, 2009. c Springer-Verlag Berlin Heidelberg 2009
340
C. Ruican et al.
parameter control provides a higher convergence rate and therefore an important runtime speedup. Moreover, previous GA approaches for quantum circuit synthesis were effective only on a small number of qubits (up to 3-4 qubits [12]), while this solution works as well for 5-6 qubit circuits. A quantum circuit is composed of one or more quantum gates, acting over a qubit set, according to a quantum algorithm. Designing a quantum circuit in order to implement a given function is not an easy task, because even if we know the target unitary transformation we don’t know how to compose it out of elementary transformations. Even if the circuit is eventually rendered, we don’t have information about its efficiency. This is the main reason why we propose a genetic algorithm approach for the synthesis task. The genetic algorithms will evolve a possible solution that will be evaluated against other previous solutions obtained, and ultimately a close-to-optimal solution will be indicated. In a circuit synthesis process the correct values for the parameter control are hard to be determined; this is the main reason why we propose an adaptive genetic algorithm that will control and change these values during run-time. 1.1
Problem Definition
Quantum circuit synthesis is considered to be the automatic combination and optimization of quantum circuits in order to implement a given function. The quantum circuit synthesis is an extensively investigated topic [5][9][10][12]. Several research groups have published results and significant progress has been reported in gate number reduction, qubit reduction, or even runtime speedup. Quantum circuit synthesis can have an important role in the development of quantum computing technology; in the last decades, the automatic classical circuit synthesis has improved the use or new circuits (in terms of development time, delay time, integration scale, cost and time to market, etc), allowing developers to be evermore creative. New complex applications are possible, while the classic physical technology limits are pushed to the edges. To shortly present the problem definition, we may consider -as requirement- the construction of a given quantum function out of a set of elementary operators (which are implemented by elementary gates). Solving the problem may be possible by following an already known path when dealing with the digital or analog circuit synthesis; however when the problem is moved in the quantum world, the situation becomes different. From examining the state-of-the-art, there is no common accepted path to follow for finding a solution [6][9][12] to be rendered.
2
Proposed Approach
The architecture is important in the realm of system development. At first glance, within our proposal different parts may be identified: a high-level description language parser used to map the quantum circuit description to a low-level representation, an algorithm responsible for the optimization of the abstract circuit, and a genetic algorithm responsible for the synthesis and optimization tasks.
Quantum Circuit Synthesis with Adaptive Parameters Control
341
The proposed breakdown structure indicates a layered software architecCircuitOptimization ture (see Figure 1), each layer being «flow» responsible for a dedicated task. The «flow» ProGAFramework rippled computation allows for intermediate results that can be used or GeneticAlgorithm MetaHeuristic optimized in the next layer. Thus, «flow» «flow» starting from a circuit description in Parser a high-level language, after applying QHDLParser FileParser all the phases, the process eventually leads to the corresponding circuit. As «flow» intermediate results, we have the abApplication stract description of the circuit, the inQuantumCircuitSynthesis ternal data representation used for the Start Synthesis optimization, and other program relevant information. Our Quantum HardFig. 1. Software flow proposed approach ware Description Language (QHDL) [8] parser uses a generic implementation to create the internal data structure that is used, later on, by the genetic algorithm. The adjustment for the genetic algorithm parameters control is made by the meta-heuristic component. In the end, the evolved solution is optimized and maybe a new evolution cycle is triggered. The synthesis solution is provided as result, namely the circuit layout. Optimization
3
Genetic Algorithm Details
A dedicated genetic algorithm is used to emerge a circuit synthesis solution. The obtained solution is not necessarily the optimal one, thus providing incentives for tuning the algorithm.The terminal set that is used in the synthesis problem is composed of quantum gates (any gate from a database may be randomly used in the chromosome encoding), of the implemented methods that generate random numbers (used in the selector probabilities and in the gate selector when genetic operators are applied), and of the constant gate characteristic values (i.e. quantum circuit cost and efficiency). The function set is derived from the nature of the problem, and is composed of the mathematical functions necessary to evaluate the circuit output function (tensor product, multiplication and equality). The fitness measure specifies the solution quality within the synthesis algorithm. Therefore, the fitness assignments to a chromosome indicate how close to the algorithm target the individual output is. Considering the discrete search space X, the objective function f : X → R our scope is the find the maxx∈X f where x is a vector of decision variables, f (x) = f (x1 , ..., xn ). It is a maximization problem, because the goal is to find the optimum quantum circuit that implements a given input function. The fitness function is defined as: eval(x) = f (x) + W × penalty(x)
(1)
342
C. Ruican et al.
where f=
f (evolved circuit) f (initial circuit)
(2)
and penalty = 1 −
number of evolved gates − number of initial gates number of initial gates
(3)
The fitness operator is implemented as a comparison between the output function of the chromosome and the output function of the given circuit, therefore revealing the differences between them. The quantum circuit output function is computed by applying the tensor product for all horizontal rows and then multiplying all the results (see Figure 2), each gate having a mathematical representation for its logic. The initial circuit is provided by the user via a high-level language hardware description. A penalty function is used in order to indicate the fact that there is a more efficient chromosome than that of the given circuit. In our optimization approach, the penalty function is implemented as the difference between the number of gates from the evolved circuit, and from the given circuit, divided by the number of given gates; it is applied only when the evolved circuit has the same functionality as the given circuit. The penalty is considered as a constraint for the algorithm and it is used to assure that a better circuit is obtained from the given one. For simplicity, we consider W = 1, the focus being, in this paper, on the operator performance. The circuit representation is essential for the chromosome chromosome gene 1 gene 2 gene m encoding (see Figure 2). Our encoding approach is to split the cirgene n 1 encoding cuit representation in sections and plains [1], a representation that will be used in the chromosome definition. Following Nature, where a chromosome is composed of genes, in our chromosome the genes are the circuit sections. We are able to Fig. 2. Chromosome Encoding encode the circuit within the chromosome [3], and to represent a possible candidate solution. A gene will store the specific characteristics of a particular section, and the genetic operators will be applied at the gene level or inside the gene. The mutation operator is applied inside the gene, only one locus being replaced, or at the gene level when the entire gene is randomly replaced with new locus gates (see Algorithm 1 and Algorithm 2). The crossover operator is much more complex than mutation. In this case, the gates selected from parents are used to create offsprings, by copying their contents and properties. Thus, using a crossover probability, two genes are selected for reproduction and, by using one or two cut points, the content from between is exchanged (see Algorithm 3 and Algorithm 4).
Quantum Circuit Synthesis with Adaptive Parameters Control
343
Algorithm 1. mutation inside the gene Require: Selected gene. Ensure: The new offspring. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:
get the number of present genes and the gene length randomly select a gene and a locus detect the gate corresponding to the selected locus if empty position inside of the quantum gate then repeat move the selected locus to the right until a different gate id is detected (to detect the quantum gate end) end if search to the left the neighboring gate and memorize its locus search to the right the neighboring gate and memorize its locus generate a new random quantum gate or more quantum gates between the left and right locus positions
Algorithm 2. mutation at the gene level Require: Selected gene. Ensure: The new offspring. 1: get the number of present genes and the gene length 2: randomly select a gene 3: perform the gene mutation by replacing the complete contents with new randomly generated gates
4
Adaptive Behavior
Almost every practical genetic algorithm is controlled by several parameters, such as population size, selector type, mutation probability, crossover probability. It is considered from a meta heuristic point of view, that genetic algorithms contain all necessary information for adaptive behavior. Nevertheless, in the following subsections, we present how the adaptive behavior optimizes the circuit synthesis search algorithm (from the users point of view, the setting of parameters is far from being a trivial task). ProGA is an object oriented framework used for genetic algorithm implementations. Software methods and design patterns are applied in order to create the necessary abstract levels. The framework is fostering for genetic algorithm implementations, through derivation of new classes from abstract ones. The type of algorithm (steady state or non-overlapping), the population structure, the encoding of the genome, and the initial settings for parameter control are made within this framework. An important framework characteristic is the possibility of extending its functionality. Thus, as is presented in Figure 3, the Adaption Control can use the framework interface, thus allowing its integration into the system.
344
C. Ruican et al.
Algorithm 3. crossover inside the gene Require: Selected gene. Ensure: The new offspring. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21:
create a new offspring copy parent1 details into offspring randomly select a cut1 point between 0 and chromosome length minus 1 detect the gate corresponding to the selected locus if empty position inside a quantum gate then repeat move the selected locus to the right until a different gate id is detected (to detect the quantum gate end) end if search to the left the neighboring gate and memorize its index search to the right the neighboring gate and memorize its index if the crossover operator has one point then calculate Start index as Cut1 point calculate Stop index as the chromosome length end if if the crossover operator has two cut points then randomly select a cut2 point between right index and chromosome length minus 1 calculate Start index as left index calculate Stop index as Cut2 point end if exchange the elements of offspring and parent2 between the Start and Stop indexes
Statistic Data
Adaptive algorithm
GA parameters GA parameters
ProGA Framework
Synthesis Process
Fig. 3. Adaptive Control Integration within ProGA
The framework will provide the data necessary for statistical analysis and the actual values for parameter control, while the Adaptive Control component will return to the framework, the new adjusted values in order to determine parameters control. The Adaptive Control is considered an external tool for the genetic algorithm implementation, and it is responsible only with the appropriate update of the parameter control. Any fitness higher than or equal to 1 is considered as a solution for the synthesis problem. For each solution, statistical data is saved for later analysis: the generation number when the solution is evolved, the resulted fitness value, the chromosome values that have generated the solution, and the time required for evolution. Identical solutions are not saved into the history list, because it is not important - from an algorithmic point of view - to analyze identical data values. Thus, the history list will always contain better solutions for the given synthesis problem.
Quantum Circuit Synthesis with Adaptive Parameters Control
345
Algorithm 4. crossover at the gene level Require: Selected gene. Ensure: The new offspring. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17:
4.1
create a new offspring copy parent1 details into offspring random select a cut1 point between 0 and chromosome length minus 1 if the crossover operator has one cut point then calculate Start index as Cut1 × GeneLength calculate Stop index as the chromosome length end if if the crossover operator has two cut points then randomly select a cut2 point between 1 and number of genes repeat selection for cut2 until it is different than cut1 order the cut1 and cut2 points calculate Start index as Cut1 × GeneLength calculate Stop index as Cut2 × GeneLength minus 1 end if exchange the elements of offspring and parent2 between the Start and Stop index
Operator Performance
Two types of statistical data are used as input for the adaptive algorithm. The first type is represented by the fitness results for each population, corresponding to the best, mean and worst chromosomes. The second type is represented by the operator performance. Following an idea proposed in reference [11], the performance records are essential in deciding on operator’s reward. – Absolute: when the resulted offspring has a better fitness than the best fitness from the previous generation. – Relative: when the resulted offspring has a better fitness than its parents, but is not absolute. – In Range: when the resulted offspring has a fitness situated between the parents fitness values. – Worse: when the resulted offspring has a lower fitness than its parents fitness values. For the circuit synthesis algorithm, the mutation and crossover probabilities are major parameters, that need to be controlled in a dynamic manner. Because we have defined two mutation and two crossover operators, there are four related statistical data that need to be memorized and then later analyzed for the parameter control adjustment. Each operator offspring result is important and needs to be recorded. The first type of mutation, called mutation A, is responsible with gate mutations inside of genes, while the second type of mutation, called mutation B, is applied at the chromosome level. The same rules are defined for the
346
C. Ruican et al.
crossover operator (applied at the gene level and called crossover A, and at chromosome level and therefore called crossover B). The operators are implemented within the ProGA framework, and adding an extra-layer for the adaptive algorithm is sufficient for the meta-heuristic extension. From the meta-heuristic point of view, it is not important to know the operator implementation details, the only requirement is to be informed about the number of operators, because for all of them a separate statistical structure will be reserved. The algorithm will receive breeding feedback from each operator and will analyze the returned data by computing the operator performance and deciding on its adjustment rate. The change of parameter controls is made by using the feedback data from the search (stored as statistical data). The adaptive algorithm distinguishes between the qualities of evolved solutions by different operators, and then adjusts the rates based on merits. As described, the adaptive algorithm is external to the genetic algorithm framework, the only interaction consisting of the transfer of parameter rates and feedback data. 4.2
Performance Assessment
Several statistical functions may provide valuable information about data distribution; functions as Maximum, Minimum, Arithmetic Average and Standard Deviation may be applied on any kind of statistical data. For each generation, the maximum, arithmetic average and minimum fitness values are provided by the genetic algorithm framework and then stored in the statistical data. When the genetic evolution has evolved a solution, other statistical functions are computed: maximum for all the generation maximum fitness, arithmetic average on all the maximum values, etc. Thus, we defined statistical functions on each generation, and statistical functions over all the generations, which are only used to evaluate the algorithm efficiency. The second type of statistical data for the operator performance is computed when the operator is applied. For example, when the Crossover B result is available, the resulted offspring fitness value is compared against the previous best fitness and, if it is higher, the Absolute value is increased with one step. If it is lower, the comparison continues and if the offspring fitness is higher than the parents fitness, then the Relative value is increased, etc. After each generation, the operator performance is updated with statistical data. Following the 1/5 Rechenberg [4] rule, after five generations the analysis of the acquired data has to be made. Parameters α, β, γ and δ are introduced to rank the operator performance; they are not adjusted during the algorithm evolution. Their scope is only to statically rank the operator performance. Thus, an absolute improvement will have a higher importance in comparison with the relative improvement; a worse result will drastically decrease the operator rank. The operator reward is updated according to the following formula: σ(op) = α × Absolute + β × Relative − γ × InRange − δ × W orse
(4)
Quantum Circuit Synthesis with Adaptive Parameters Control
5
347
Experiments
During the performed experiments, several variables were used to measure, control and manipulate the application results. The proposed synthesis tool allows two different types of statistical data; the correlation research allows the measuring of statistical data and, at the same time, looking for possible relationship situations between some set of variables, while in the experimental research some variables are influenced in order to see their effect on other variable sets. The data analysis of the experimental results also creates correlations between the manipulated variables and those affected by the manipulation. The experiments were conducted on a computer having the following con|a • |a ⊕ figuration: Intel Pentium M processor |b • |b • ⊕ |a • at 1.86GHz, 1GB RAM memory and •⊕ |c ⊕ • |c |b ⊕ • SuSe 10.3 as the operating system. In •⊕ |d • • |d ⊕• |c order to avoid lucky guesses, the experiments have been repeated for 20 •⊕ |e • ⊕ • |e ⊕• |d times, the average result being used for • ⊕ ⊕ |f |f ⊕ |e comparison. The configurations for the (a) (b) (c) algorithm parameters used in our experiments are shown in Table 1. Each Fig. 4. Benchmark - Five-Qubit[5] (a), 2Qubit RippleCarry Adder[13] (b) and Six- case study is started with a benchmark quantum circuit (see Figure 4) Qubit[5] (c) circuits that is used for the synthesis algorithm evaluation. For each benchmark the name of the circuit is presented along with its number of qubits (with garbage qubits if present) and the circuit cost. Two synthesis configurations (see Table 1) are used to evolve a synthesis solution, different parameters being manipulated during the test evolution (i.e. the behavior will dynamically adjust the mutation and the crossover probabilities), and the results are presented as graphs. Table 1. Configurations Used in the Performed Experiments Variable Name Configuration 1 Configuration 2 Population size / Generations 50/150 50/150 Multiple Single Mutation type Two points One point Crossover type Roulette Wheel Rank Selector type Elitism percent 0.1 0.05 0.03 0.05 Mutation probability 0.3 0.4 Crossover probability Performance increase/decrease 0.1/0.1 0.15/0.2
348
C. Ruican et al.
Statistic Fitness Evolution
Best Individual Fitness Evolution 100
1
Fitness value
99
0.8 Fitness value
98 97
0.6
0.4
96 0.2 95 0
10
20
30
40
50
60
70
80
90
100
Generation (number)
0 MIN MEAN MAX STDEV / Generation Mean Worst Best
Algorithm performance 800000
Mutation and Crossover Adaption 700000
0.7 Mutation
Average Time (ticks)
600000
Crossover
0.6
500000 400000
0.5
300000 Probability value
200000 100000 0
0.4
0.3
1
Solutions
0.8
0.2
0.6 0.4
0.1
0.2 0
0 0
1
2
3
4 5 Runs (number)
6
7
8
9
0
10
20
30
40 50 60 70 Generation (number)
80
90
100
Fig. 5. Statistic Results for Configuration 2
5.1
Five-Qubit Circuit
For the Five-Qubit test circuit, we used the xor5 (see Figure 4a), its output being the EXOR of all its inputs [5]. In Figure 5 the algorithm results are presented by using plots. Using the proposed configurations, we have evolved solutions for the employed benchmark circuit and, in the right-bottom corner, the automatic adjustment for the mutation and crossover probabilities are also highlighted. 5.2
Six-Qubit Circuit
For the Six-Qubit test circuit, we used the graycode function (see Figure 4c); if the circuit for such a function is run in reverse then, the output is the ordinal number of the corresponding Gray code pattern [5]. Using the parameters set by configurations 1 and 2, several equivalent solution are emerged, differentiated by the number of gates, costs and feasibility. A second six-qubit circuit is an add cell (see Figure 4b) taken from reference [13], and our synthesis methodology is able to evolve the same solution in less than 32 seconds. 5.3
Going beyond 6-Qubit Circuits
Other genetic algorithm based approaches have presented effective solutions only for three or four-qubit circuits. As stated in [12], the main encountered
Quantum Circuit Synthesis with Adaptive Parameters Control
349
Table 2. Test of the Approach Convergence Number of Mutation inputs per probability q-gate 3-input 0.4 3-input 0.2 0.6 4-inputs 5-inputs 0.2 0.2 6-inputs
Crossover probability 0.6 0.6 0.4 0.4 0.6
Real time Runtime (average (average 20 20 runs) runs) as in [12] < 30sec < 13sec < 60sec < 8sec < 30sec < 16sec Not reported < 31sec Not reported < 33sec
difficulties where: the complexity of performing tensor product for large matrixes, a high number of individuals used for the total population, and the complexity of encoding a specific quantum gate. Our approach tackles these problems first by using an OOP (object-oriented programming) environment (backed by a framework architecture that employs optimization techniques); this improves the effectiveness of using quantum operations (including the tensor product). Second, our chromosome representation and meta-heuristic approach allows for using small populations (about 300 individuals) within the genetic evolution process. Also, another improvement comes from the fact that our approach uses a more flexible encoding scheme for the quantum gates. As a result, the experiments can be performed within our synthesis framework for 5 and 6 qubit circuits (see Table 2). Even so, attempting to perform synthesis over a larger number of qubits will also have to confront the complexity problem of matrix multiplication. However, we intend to further investigate this matter and optimize our framework, in order to extend the effectiveness of our approach for even larger quantum circuits.
6
Conclusion
The physicists still have a long way to go in order to bring the quantum circuit implementation details into a clear view, including solutions to the decoherence and gate support problems. The engineers have also a role to play in order to build a real quantum computer as a super-machine based on solid-state qubits. This paper has presented a new approach for the automated tuning of parameters control, defined in a genetic algorithm that is used for the synthesis of quantum circuits. Statistical data are saved on each generation and analyzed by an algorithm that dynamically adjusts the parameter control values. Also, this paper offered a strategy to implement the Rechenbergs rule and the operators performance analysis in a circuit synthesis algorithm. The experiments and the source code availability prove the effectiveness of the approach for the quantum circuit synthesis task. Future work will focus on the automatic adjustment of the population size and for the selector type, depending on the problem complexity type (direct relation with the number of qubits involved in the circuit description).
350
C. Ruican et al.
References 1. Ruican, C., Udrescu, M., Prodan, L., Vladutiu, M.: Automatic Synthesis for Quantum Circuits Using Genetic Algorithms. In: Beliczynski, B., Dzielinski, A., Iwanowski, M., Ribeiro, B. (eds.) ICANNGA 2007. LNCS, vol. 4431, pp. 174–183. Springer, Heidelberg (2007) 2. Ruican, C., Udrescu, M., Prodan, L., Vladutiu, M.: A Genetic Algorithm Framework Applied to Quantum Circuit Synthesis. Nature Inspired Cooperative Strategies for Optimization (2007) 3. Ruican, C., Udrescu, M., Prodan, L., Vladutiu, M.: Software Architecture for Quantum Circuit Synthesis. In: International Conference on Artificial Intelligence and Soft Computing (2008) 4. Eiben, E.A., Michalewicz, Z., Schoenauer, M., Smith, J.E.: Parameter Control in Evolutionary Algorithms. In: Parameter Setting in Evolutionary Algorithms (2007) 5. Maslov, D.: Reversible Logic Synthesis Benchmarks Page (2008), http://www.cs.uvic.ca/%7Edmaslov 6. Maslov, D., Dueck, G.W., Miller, M.D.: Quantum Circuit Simplification and Level Compaction. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2008) 7. Herrera, F., Lozano, M.: Fuzzy adaptive genetic algorithms: design, taxonomy, and future directions. Soft Computing 7(8), 545–562 (2003) 8. Stillerman, M., Guaspari, D., Polak, W.: Final Report-A Design Language for Quantum Computing. Odyssey Research Associates, Inc., New York (2003) 9. Svore, K., Cross, A., Aho, A., Chuang, I., Markov, I.: Toward a Software Architecture for Quantum Computing Design Tools. IEEE Computer, Los Alamitos (2006) 10. Rubinstein, B.I.P.: Evolving quantum circuits using genetic programming. In: Proceedings of the 2001 Congress on Evolutionary Computation (2001) 11. Gheorghies, O., Luchian, H., Gheorghies, A.: Walking the Royal Road with Integrated-Adaptive Genetic Algorithms. University Alexandru Ioan Cuza of Iasi (2005), http://thor.info.uaic.ro/~ tr/tr05-04.pdf 12. Lukac, M., Perkowski, M.: Evolving quantum circuits using genetic algorithm. In: Proceedings of the 2002 NASA/DOD Conference on Evolvable Hardware (2002) 13. Van Meter, R., Munro, V.J., Nemoto, K., Itoh, K.M.: Arithmetic on a DistributedMemory Quantum Multicomputer. ACM Journal on Emerging Technologies in Computer Systems 3(4), A17 (2008)
Comparison of CGP and Age-Layered CGP Performance in Image Operator Evolution Karel Slan´ y Faculty of Information Technology, Brno University of Technology Boˇzetˇechova 2, 612 66 Brno, Czech Republic [email protected]
Abstract. This paper analyses the efficiency of the Cartesian Genetic Programming (CGP) methodology in the image operator design problem at the functional level. The CGP algorithm is compared with an age layering enhancement of the CGP algorithm by the means of achieved best results and their computational effort. Experimental results show that the Age-Layered Population Structure (ALPS) algorithm combined together with CGP can perform better in the task of image operator design in comparison with a common CGP algorithm.
1
Introduction
Cartesian Genetic Programming (CGP) was introduced by J. F. Miller and P. Thomson in 1999 [8]. When comparing with a standard genetic programming approach, CGP represents solution programs as bounded (c × r)-node directed graphs. It utilizes only a mutation operator which is operating in small populations. The influence of different aspects of the CGP algorithm have been investigated; for example the role of neutrality [2, 14], bloat [6], modularity [13] and the usage of search strategies [7]. In order to evolve more complicated digital circuits, CGP has been extended to operate at the functional level [9]. Gates in the nodes were replaced by high-level components, such adders, shifters, comparators, etc. This approach has been shown to suite well for the evolution of various image operators such as noise removal filters and edge detectors. In several papers it has been reported, that the resulting operators are human-competitive [5, 10]. The literature describes many techniques designed to preserve diversity in population as pre-selection [1], crowding [3], island models [11], etc. Escaping the local optima can only be achieved by the introduction of new, randomly generated individuals into the population. The simplest method, how to implement this technique, is restarting the evolution multiple times with different random number generator seeds. This increases the chances to find the global optima. But the evolutionary algorithm must have enough time to find a local optima. More sophisticated methods do not restart the process from scratch, but periodically introduce randomly generated individuals into the existing population. The algorithm has to ensure, that the new incomes are not easily discarded L. Vanneschi et al. (Eds.): EuroGP 2009, LNCS 5481, pp. 351–361, 2009. c Springer-Verlag Berlin Heidelberg 2009
352
K. Slan´ y
by existing better solutions and receive enough time to evolve. Such an algorithm is the Age-Layered Population Structure (ALPS) algorithm introduced by G. Hornby in 2006 [4]. This paper deals with the comparison of a standard CGP based algorithm with a modification of ALPS in the case of image operator evolution. The performance of these algorithms is measured by comparing the achieved fitness during the evolution process.
2
Image Filter Evolution
As introduced in [9, 10], an image operator can be considered as a digital circuit with nine 8-bit inputs and a single 8-bit output. This circuit can then process grey-scaled 8-bit per pixel encoded images. Every pixel value of the filtered image is calculated using a corresponding pixel and its neighbours in the processed image. 2.1
CGP at the Functional Level
In CGP a candidate graph (circuit) solution is represented by an array of c (columns) × r (rows) interconnected programmable elements (gates). The number of the circuit inputs, ni , and outputs, no , is fixed through the entire evolutionary process. Each element input can be connected to the output of any other element, which is placed somewhere in the previous l columns. Feedback connections are not allowed. The l parameter defines the interconnection degree and influences the size of the total search space. In case of l = 1 only neighbouring columns are allowed to connect; on the other hand, if l = c then a element block can connect to any element in any previous column. Each programmable element is allowed to perform one function of the functions defined in the function set Γ . The functions stored in Γ influence the design level of the circuit. The set Γ can represent a set of gates or functions defined at a higher level of abstraction. Every candidate solution is encoded into a chromosome, which is a string of c × r × (ei + 1) + no integers as shown in fig. 1. The parameter ei is the number of inputs of the used programmable elements. If we use two-input programmable elements, then ei = 2. CGP, unlike genetic programming (GP), operates with a small population of λ elements, where usually λ = 5. The initial population is randomly generated. In each generation step the new population consists of a parent, which is the fittest individual from the previous generation, and its mutants . In case of multiple individuals sharing the best fitness, the individual, which was not selected to be the parent in the previous generation, is selected to be the parent of the next generation. This is used mainly to ensure diversity of the small population. The mutants are created by a mutation operator, which randomly modifies genes of an individual. Crossover operator is not used in CGP. In various applications, which utilize CGP, crossover has been identified to have rather destructive effect. In the particular case of image filter evolution at the functional level the crossover
Comparison of CGP and Age-Layered CGP Performance
353
Fig. 1. An example of a 3-input circuit. Parameters are set to l = 3, c = 3, r = 2, Γ = {+(0), −(1)}. Gates 2 and 7 are not used. the chromosome looks like 1,2,1, 0,0,1, 2,3,0, 3,4,0 1,6,0, 0,6,1, 6, 8. The last two numbers represent the connection of the outputs.
Fig. 2. Example of an image filter consisting of a 3 × 3 input and output matrix connected to the image operator circuit
operator does not demonstrate a very destructive behaviour [12]. However no benefits of utilizing crossover operators have been shown. In image operator evolution the goal of CGP is to find a filter which minimizes the difference between the filtered image If and the reference image Ir , which must be present for a particular input image Ii . If the input and the reference image are of the size K × L pixels and a square 3 × 3 input and output matrix is used, then the filtered image has the size of (K − 2) × (L − 2) pixels. Because of the shape of the matrices the pixels at the edge of the filtered images can be read but they cannot be written. For an 8-bit grey-scale image the fitness value fv can be defined by the following expression: fv =
K−2 L−2 i=1 j=1
|If (i, j) − Ir (i, j)| .
(1)
The expression (1) summarizes the differences of corresponding pixels in the filtered and the reference image. When fv drops to fv = 0 then it means that
354
K. Slan´ y
the images If and Ir of the size K × L pixels are indistinguishable from each other (except the pixels on the edges). Papers [10, 5] show that this approach leads to good image filters. The results are satisfiable even in cases, when only a single image in the fitness function is used. 2.2
ALPS Paradigm for CGP
Premature convergence has always been a problem in genetic algorithms. One way to deal with this problem is to increase mutation probability. This will keep the diversity high. But this can also very likely destroy good alleles, which are already present in the population. When the mutation rate is set too high, then the genetic operator cannot explore narrow surroundings of a particular solution. Large population sizes can also be a solution to the diversity problem, but then more time is needed to search for a good solution. The Age-Layered Population Structure (ALPS) [4] algorithm adds time tags into a genetic algorithm. These tags represent the age of a particular candidate solution in the population. The candidate solutions are only allowed to mutually interact and compete in groups, which contain solutions with similar age. By structuring the population into age-based groups, the algorithm ensures that a newly generated solution cannot easily be outperformed by a better solution, which is already present in the population. Also, new, randomly generated solutions are added in regular periods. These are the two main parts which maintain population diversity in the ALPS algorithm. The age measure is the count of how many generations the candidate solution has been evolving in the population. Newly generated solution start with the age of 0. Individuals which were generated by an genetic operator such as mutation or crossover receive the age value of the oldest parent increased by 1. Every time a candidate solution is taken to be a parent, its age increases by 1. In cases a candidate solution is used multiple times to be a parent during a generation cycle, its age is still increased by 1 only once. The population is defined as a structure of age layers, which restrict the competition and breeding among candidate solutions. Each layer, except for the last layer, has a maximum age limit. This limit allows only solutions with the age below its value to be in the layer. The last layer has no maximal age restriction, so that any best solution can stay in this layer for an unlimited time. The structure of the layers can be defined in various ways. Different systems for setting the age limits, which can be used, are shown in tab. 1. The limit values are multiplied by an age-gap parameter, thus obtaining maximum age of an individual in each layer. The ALPS algorithm was designed for maintaining diversity in difficult GP based problems. Its main genetic operator is crossover with tournament selection. But crossover is not used in CGP, instead only mutation and elitism are utilized. Some modifications need to be done in order to make the ALPS algorithm work with CGP in the case of image filter evolution. These changes mainly involve removing the crossover operator from the algorithm. During every generation cycle each layer interacts with other layers by sending individuals to the next layer or by receiving new individuals from the previous
Comparison of CGP and Age-Layered CGP Performance
355
Table 1. Various aging scheme distributions, which can be used in the ALPS algorithm aging scheme linear Fibonacci polynomial exponential factorial
0 1 1 1 1 1
1 2 2 2 2 2
maximum 2 3 3 4 4 6
age in layer 3 4 4 5 5 8 9 16 8 16 24 120
5 6 13 25 32 720
layer. The original ALPS algorithm starts with randomly initialized first layer. Other layers are empty and will be filled during the process of evolution. The individuals grow older and move to next layers or are discarded. In regular intervals the bottom layer is replaced by newly generated individuals. Let us call the parameter describing this behaviour randomize-period. The value of the parameter stands for the number of generations between two randomization of the bottom layer. Whenever the age of a member in a particular layer exceeds the age limit for this layer, then such a member is moved to the next layer. This formulation can cause trouble in implementation of the algorithm. Just imagine the case, when new members have to be moved into a fully occupied higher layer. In this case, the layer, which has to accept members or offspring from a previous layer, is divided into halves. The first half is used for generating new members from the original layer and the second half is populated by the newly incomes. After this step both halves behave again as a single layer. Also elitism, similar to CGP, is used. Each layer keeps its best evolved member and only replaces it with an individual with a better or a least the same fitness. If this individual is selected to be a parent, its age is not increased in order to keep it in the current layer. During the process of evolution the size limits of the population do not change, but the number of individuals in the layers may vary. This is caused by the fact, that in the initial phase the algorithm starts with only one populated layer. Also in certain situations a layer can lead into extinction, when current layer members and its offspring are moved into next layer and no newcomers have arrived.
3
Experimental Set-Up
In order to evolve image operators a set of function has been adopted from [10]. The CGP and modified ALPS-CGP algorithms are used to find a random shot noise filter and a Sobel filter using a set of three pictures as training data. The evolved image operators are connected to a 3 × 3 input and output mask. The size of the pictures is 256 × 256 pixels. The experiment consists of two main groups which differ in the way how the evolved image operators is defined. In the first set of experiments the image operator shall consist of 8 columns × 6 rows of programmable elements with the
356
K. Slan´ y
Table 2. Function set used in the experiments. All functions have 8-bit inputs and outputs. ID Function 0 x∨y 1 x∧y 2 x⊕y 3 x+y
Description binary or binary and binary xor addition
ID Function 4 x +sat y 5 (x + y) >> 1 6 M ax(x, y) 7 M in(x, y)
Description saturated addition average maximum minimum
Fig. 3. Images used in described experiments. Top row contains input images entering the evolved image operators. Bottom row shows reference images used for fitness evaluation. The evolution searches for a Sobel operator and for a random shot noise removal filter.
interconnection parameter l = 1. The value l = 1 allows only interconnection of neighbouring columns. This is because of an easy implementation as a pipelined filter in hardware. The second set of experiments utilizes chromosomes which consist of only a single row with 48 programmable elements with the interconnection parameter set to l = 48. This value ensures unlimited interconnection, except that the output of an evolved circuit cannot connect directly to its input. The CGP algorithm uses population size of 8 individuals. Mutation probability is set to 8% in both algorithms. The ALPS-CGP algorithm uses 5 layers. Each layer can hold up to 8 individuals. The polynomial aging scheme is used with age-gap = 20. The bottom layer is regenerated with random individuals every randomize-period = 5 cycles. Each evolutionary process of 10000 generations is repeated 100 times. Average data are used to compare both of the two algorithms. The measured data are compared according to the evaluation number. That means the fitness values are plotted against the number of evaluations which the algorithm has performed rather than to the generation it has reached. This is
Comparison of CGP and Age-Layered CGP Performance
357
Fig. 4. The progression of the fitness value during the first 500 evaluations when evolving the Sobel filter by using the camera pictures. The image operator consists of 8 × 6 elements.
Fig. 5. The progression of the fitness value during the first 500 evaluations when evolving the Sobel filter by using the circle pictures. The image operator consists of 8 × 6 elements.
Fig. 6. The progression of the fitness value during the first 500 evaluations when evolving the random shot noise removal filter by using the Lena pictures. The image operator consists of 8 × 6 elements.
done because of the fact, that the ALPS-CGP algorithm uses larger populations. Thus it has a greater chance of exploring larger amounts of search-space in a single generation cycle than the CGP algorithm.
358
K. Slan´ y Table 3. The average achieved fitness after finishing 10000 generations set of experiments 1 1 1 2 2 2
image set average fitness CGP average fitness ALPS camera Sobel 2 034 740 1 371 492 circle Sobel 1 948 039 1 345 093 Lena impulse 52 980 31 681 camera Sobel 2 323 204 1 893 629 circle Sobel 2 176 023 1 580 389 Lena impulse 47 557 31 284
Fig. 7. The progression of the fitness value during the first 500 evaluations when evolving the Sobel filter by using the camera pictures. The image operator consists of 1 × 48 elements.
Fig. 8. The progression of the fitness value during the first 500 evaluations when evolving the Sobel filter by using the circle pictures. The image operator consists of 1 × 48 elements.
3.1
Results
In the first set of experiments image operators with rectangular arrangement of programmable elements with the interconnection parameter l = 1 were evolved. The graphs in fig. 4, 5 and 6 show that the ALPS-CGP algorithm is behaving slightly better than the standard CGP algorithm. In average we have obtained similar or better image operators in less time than using only the CGP algorithm.
Comparison of CGP and Age-Layered CGP Performance
359
Fig. 9. The progression of the fitness value during the first 500 evaluations when evolving the random shot noise removal filter by using the Lena pictures. The image operator consists of 1 × 48 elements.
The average fitness values, measured after reaching final generation number, are summarized in tab. 3. In the second set of experiments the evolutionary process searched for image operators with a linear structure with high interconnection parameter, allowing more complex structures to be designed. This also increases the search space of all possible solutions. The graphs in fig. 7, 8 and 9 show the behaviour of the algorithms in the first 500 evaluations. Again we have obtained similar results.
4
Discussion
In the first group of experiments we can observe a performance gain, when using the ALPS-CGP algorithm. In the initial phase, when the algorithms are started, the ALPS-CGP algorithm converges faster to local optima than the simple CGP algorithm. But then still keeps improving the fitness value of the evolved image operators. The ALPS algorithm in average achieves better fitness values. In the second group of experiments the algorithms show similar behaviour as in group one. Again ALPS has achieved better fitness values than the CGP algorithm. Only in the case of the random shot noise filter the algorithms show approximately the same performance. This may be because the random shot filter is easier to construct from the function set Γ . Also the second group of experiments showed, that the evolved image operators achieve slightly worse results, as in the first case. The explanation may be the fact, that in the second case the elements, which the image operator consists of, are allowed to connect more freely in comparison with the first experimental set. The search space is much greater. Another explanation may be the fact, that the configuration of the image operator is taken from [10], and might be more optimized for the role of an image operator. The whole system is implemented in SW. To finish all runs in the first or second set of experiments 6 days of computation time are needed when using a two 1800 MHz dual-core AMD Opteron system. This is mainly caused by the
360
K. Slan´ y
time-consuming fitness function - every pixel in the training images has to be computed separately. The slow performance is also a drawback, when the optimal performance parameters need to be found. The ALPS-CGP algorithm is not much more complex than the CGP algorithm. In fact the main differences are the time-tags and the consequent restriction. These are not difficult to implement in hardware. Therefore the next step will be implementing the system in a FPGA. This will give a greater chance of evaluating the ALPS-CGP and CGP performance. Also larger input masks can be used, which can lead to better image operators.
5
Conclusions
An analysis of the performance of a standard CGP approach and an ALPS enhanced CGP algorithm in the task of image operator evolution was performed. The performance of the algorithms was measured in six cases of system settings. Experiments have shown that the ALPS-CGP algorithm performs better than the standard CGP algorithm. However in more difficult cases the performance of the ALPS-CGP algorithm appears not to be much superior. Further experiments, including hardware implementation, are needed to receive a better comparison of the two algorithms.
Acknowledgements This work was supported by the Grant Agency of the Czech Republic under No. 102/07/0850 Design and hardware implementation of a patent-invention machine and the Research intention No. MSM 0021630528 – Security-Oriented Research in Information Technology.
References [1] Cavicchio, D.J.: Adaptive search using simulated evolution. Ph.D thesis, University of Michigan (1970) [2] Collins, M.: Finding needles in haystacks is harder with neutrality. Genetic Programming and Evolvable Machines 7(2), 131–144 (2006) [3] De Jong, K.A.: Analysis of the Behavior of a Class of Genetic Adaptive Systems. Ph.D thesis. University of Michigan (1975) [4] Hornby, G.S.: Alps: the age-layered population structure for reducing the problem of premature convergence. In: GECCO 2006: Proceedings of the 8th annual conference on Genetic and evolutionary computation, pp. 815–822. ACM, New York (2006) [5] Mart´ınek, T., Sekanina, L.: An evolvable image filter: Experimental evaluation of a complete hardware implementation in fpga. In: Moreno, J.M., Madrenas, J., Cosp, J. (eds.) ICES 2005. LNCS, vol. 3637, pp. 76–85. Springer, Heidelberg (2005)
Comparison of CGP and Age-Layered CGP Performance
361
[6] Miller, J.F.: What bloat? cartesian genetic programming on boolean problems. In: 2001 Genetic and Evolutionary Computation Conference Late Breaking Papers, pp. 295–302 (2001) [7] Miller, J.F., Smith, S.L.: Redundancy and computational efficiency in cartesian genetic programming. IEEE Trans. Evolutionary Computation 10(2), 167–174 (2006) [8] Miller, J.F., Thomson, P.: Cartesian genetic programming. In: Poli, R., Banzhaf, W., Langdon, W.B., Miller, J., Nordin, P., Fogarty, T.C. (eds.) EuroGP 2000. LNCS, vol. 1802, pp. 121–132. Springer, Heidelberg (2000) [9] Sekanina, L.: Image filter design with evolvable hardware. In: Cagnoni, S., Gottlieb, J., Hart, E., Middendorf, M., Raidl, G.R. (eds.) EvoIASP 2002, EvoWorkshops 2002, EvoSTIM 2002, EvoCOP 2002, and EvoPlan 2002. LNCS, vol. 2279, pp. 255–266. Springer, Heidelberg (2002) [10] Sekanina, L.: Evolvable Components: From Theory to Hardware. Springer, Heidelberg (2004) [11] Skolicki, Z., De Jong, K.A.: Improving evolutionary algorithms with multirepresentation island models. In: Yao, X., Burke, E.K., Lozano, J.A., Smith, J., Merelo-Guerv´ os, J.J., Bullinaria, J.A., Rowe, J.E., Tiˇ no, P., Kab´ an, A., Schwefel, H.-P. (eds.) PPSN 2004. LNCS, vol. 3242, pp. 420–429. Springer, Heidelberg (2004) [12] Slan´ y, K., Sekanina, L.: Fitness landscape analysis and image filter evolution using functional-level cgp. In: Ebner, M., O’Neill, M., Ek´ art, A., Vanneschi, L., EsparciaAlc´ azar, A.I. (eds.) EuroGP 2007. LNCS, vol. 4445, pp. 311–320. Springer, Heidelberg (2007) [13] Walker, J.A., Miller, J.F.: Investigating the performance of module acquisition in cartesian genetic programming. In: GECCO 2005: Proceedings of the 2005 conference on Genetic and evolutionary computation, pp. 1649–1656. ACM, New York (2005) [14] Yu, T., Miller, J.F.: Neutrality and the evolvability of boolean function landscape. In: Miller, J., Tomassini, M., Lanzi, P.L., Ryan, C., Tetamanzi, A.G.B., Langdon, W.B. (eds.) EuroGP 2001. LNCS, vol. 2038, pp. 204–217. Springer, Heidelberg (2001)
Author Index
Affenzeller, Michael 232 Ahalpara, Dilip P. 13 Aler, Ricardo 244 Alonso, C´esar L. 315 Alonso, Pablo 244 Amil, Nur Merve 327 Arora, Siddharth 13 Banzhaf, Wolfgang 85, 133 Borges, Cruz E. 315 Bredeche, Nicolas 327 Bull, Larry 37 Carr, Hamish 183 Cattral, Robert 303 Chen, Shu-Heng 171 Citi, Luca 25 Costelloe, Dan 61 Crane, Ellery 25 Crespo, Jos´e L. 315 Curry, Robert 1 Dignum, Stephen
O’Neill, Michael Oppacher, Franz
183, 292 303
Ruican, Cristian 339 Ryan, Conor 61
159
Santhanam, M.S. 13 Schoenauer, Marc 327 Shirakawa, Shinichi 109 Silva, Sara 159 Slan´ y, Karel 351
280
Gagn´e, Christian 327 Gelly, Sylvain 327 Graff, Mario 145, 195 Graham, Lee 303 Harding, Simon 133 Heywood, Malcolm I. Hoffmann, Rolf 280 Hu, Ting 85
Nagao, Tomoharu 109 Neshatian, Kourosh 121 Nguyen, Quang Uy 292 Nguyen, Xuan Hoai 292
Poli, Riccardo 25, 145, 195 Poulding, Simon 220 Preen, Richard 37 Prodan, Lucian 339
Ebner, Marc 268 Ediger, Patrick 280 Est´ebanez, C´esar 244 Fey, Dietmar
McPhee, Nicholas F. 25 Miconi, Thomas 49 Miller, Julian F. 133 Monta˜ na, Jos´e L. 315 Murphy, James E. 183
Tai, Chung-Ching 171 Teytaud, Olivier 327 Udrescu, Mihai 1
Jackson, David 256 Johnson, Colin G. 97 Jo´ o, Andr´ as 73 Komann, Marcus 280 Kronberger, Gabriel 232
339
Valls, Jos´e M. 244 Veenhuis, Christian B. 208 Vladutiu, Mircea 339 Wagner, Stefan 232 White, David R. 220 Winkler, Stephan 232 Zhang, Mengjie
121