134 77 26MB
English Pages 464 [459] Year 2007
Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
4684
Lishan Kang Yong Liu Sanyou Zeng (Eds.)
Evolvable Systems: From Biology to Hardware 7th International Conference, ICES 2007 Wuhan, China, September 21-23, 2007 Proceedings
13
Volume Editors Lishan Kang China University of Geosciences School of Computer Science Wuhan, Hubei 430074, China E-mail: kangw [email protected] Yong Liu The University of Aizu,Tsuruga Ikki-machi, Aizu-Wakamatsu City, Fukushima 965-8580, Japan E-mail: [email protected] Sanyou Zeng China University of Geosciences School of Computer Science Wuhan, Hubei 430074, China E-mail: [email protected]
Library of Congress Control Number: 2007933938 CR Subject Classification (1998): B.6, B.7, F.1, I.6, I.2, J.2, J.3 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-540-74625-0 Springer Berlin Heidelberg New York 978-3-540-74625-6 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2007 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12115266 06/3180 543210
Preface
We are proud to introduce the proceedings of the 7th International Conference on Evolvable Systems: From Biology to Hardware (ICES 2007) held in Wuhan, China, September 21–23, 2007. ICES 2007 successfully attracted 123 submissions. After rigorous reviews, 41 high-quality papers were included in the proceedings of ICES 2007, representing an acceptance rate of 33%. ICES conferences are the first series of international conferences on evolvable systems. The idea of evolvable systems, whose origins can be traced back to the cybernetics movement of the 1940s and the 1950s, has recently led to bio-inspired systems with self-reproduction or self-repair of the original hardware structures, and evolvable hardware with the autonomous reconfiguration of hardware structures by evolutionary algorithms. Following the workshop Towards Evolvable Hardware taking place in Lausanne, Switzerland, in October 1995, the 1st International Conference on Evolvable Systems: From Biology to Hardware (ICES 1996) was held in Tsukuba, Japan (1996). Subsequent ICES conferences were held in Lausanne, Switzerland (1998), Edinburgh, UK (2000), Tokyo, Japan (2001), Trondheim, Norway (2003), and Barcelona, Spain (2005) where it was decided that China University of Geosciences, Wuhan, would be the location of ICES 2007 with Lishan Kang as the General Chair. ICES 2007 addressed the theme “From Laboratory to Real World” by explaining how to shorten the gap between evolvable hardware research and design for real-world applications in semiconductor engineering and mechanical engineering. ICES 2007 featured the most up-to-date research and applications in digital hardware evolution, analog hardware evolution, bio-inspired systems, mechanical hardware evolution, evolutionary algorithms in hardware design, and hardware implementations of evolutionary algorithms. ICES 2007 also provided a venue to foster technical exchanges, renew everlasting friendships, establish new connections, and presented the Chinese cultural traditions to overcome cultural barriers. On behalf of the Organizing Committee, we would like to thank warmly the sponsors, China University of Geosciences and Chinese Society of Astronautics, who helped in one way or another to achieve our goals for the conference. We wish to express our appreciation to Springer, for publishing the proceedings of ICES 2007 in the Lecture Notes in Computer Science. We would also like to thank also the authors for submitting their work, as well as the Program Committee members and reviewers for their enthusiasm, time and expertise. The invaluable help of active members of the Organizing Committee, including Xuesong Yan, Qiuming Zhang, Yan Guo, Siqing Xue, Ziyi Chen, Xiang Li, Guang Chen, Rui Wang, Hui Wang, and Hui Shi, in setting up and maintaining the online submission systems, assigning the papers to the reviewers, and
VI
Preface
preparing the camera-ready version of the proceedings was highly appreciated and we would like to thank them personally for their efforts to make ICES 2007 a success. September 2007
Lishan Kang Yong Liu Sanyou Zeng
Organization
ICES 2007 was organized by the School of Computer Science and Research Center for Space Science and Technology, China University of Geosciences, sponsored by China University of Geosciences and Chinese Society of Astronautics.
Honorary Conference Chair Yanxin Wang
China University of Geosciences, China
General Chair Lishan Kang
China University of Geosciences, China
Program Chair Yong Liu Tetsuya Higuchi
University of Aizu, Japan National Institute of Advanced Industrial Science and Technology, Japan
Local Chair Sanyou Zeng
China University of Geosciences, China
Program Committee Elhadj Benkhelifa Peter J. Bentley Stefano Cagnoni Carlos A. Coello Coello Peter Dittrich Marco Dorigo Rolf Drechsler Marc Ebner Manfred Glesner Darko Grundler Pauline C. Haddow Alister Hamilton Morten Hartmann Jingsong He
University of the West of England, UK University College London, UK Universit` a degli Studi di Parma, Italy Depto. de Computaci´ on, Mexico Friedrich Schiller University, Germany Universit´e Libre de Bruxelles, Belgium University of Bremen, Germany Universitaet Wuerzburg, Germany Darmstadt University, Germany Univesity of Zagreb, Croatia The Norwegian University of Science and Technology, Norway Edinburgh University, UK Norwegian University of Science and Technology, Norway University of Science and Technology of China, China
VIII
Organization
Arturo Hernandez Aguirre Francisco Herrera Tetsuya Higuchi
Tulane University, USA University of Granada, Spain National Institute of Advanced Industrial Science and Technology, Japan Masaya Iwata National Institute of Advanced Industrial Science and Technology, Japan Yaochu Jin Honda Research Institute Europe, Germany Didier Keymeulen Jet Propulsion Laboratory, USA Jason Lohn NASA Ames Research Center, USA Michael Lones Department of Electronics, University of York, UK Wenjian Luo University of Science and Technology of China, China Juan Manuel Moreno Arostegui Technical University of Catalonia (UPC), Spain Karlheinz Meier University of Heidelberg, Germany Julian Miller Department of Electronics University of York, UK Masahiro Murakawa National Institute of Advanced Industrial Science and Technology, Japan Michael Orlov Ben-Gurion University, Israel Marek Perkowski Portland State University, USA Eduardo Sanchez Logic Systems Laboratory, Switzerland Lukas Sekanina Brno University of Technology, Czech Republic Moshe Sipper Ben-Gurion University, Israel Adrian Stoica Jet Propulsion Lab, USA Kiyoshi Tanaka Shinshu University, Japan Gianluca Tempesti University of York, UK Christof Teuscher University of California, San Diego (UCSD), USA ´ Yann Thoma Ecode d’ing´enieurs de Gen`eve, Switzerland Adrian Thompson University of Sussex, UK Jon Timmis University of York, UK Jim Torresen University of Oslo, Norway Jochen Triesch J.W. Goethe University, Germany Edward Tsang University of Essex, UK Gunnar Tufte The Norwegian University of Science and Technology, Norway Andy Tyrrell University of York, UK Youren Wang Nanjing University of Aeronautics and Astronautics, China Xin Yao University of Birmingham, UK Ricardo Zebulum Jet Propulsion Lab, USA Sanyou Zeng China University of Geosciences, China Qingfu Zhang University of Essex, UK Shuguang Zhao Xidian University, China
Organization
Steering Committee Pauline C. Haddow Tetsuya Higuchi Julian F. Miller Jim Torresen Andy Tyrrell (Chair)
The Norwegian University of Science and Technology, Norway National Institute of Advanced Industrial Science and Technology, Japan University of Birmingham, UK University of Oslo, Norway University of York, UK
IX
Table of Contents
Digital Hardware Evolution An Online EHW Pattern Recognition System Applied to Sonar Spectrum Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kyrre Glette, Jim Torresen, and Moritoshi Yasunaga
1
Design of Electronic Circuits Using a Divide-and-Conquer Approach . . . . Guoliang He, Yuanxiang Li, Li Yu, Wei Zhang, and Hang Tu
13
Implementing Multi-VRC Cores to Evolve Combinational Logic Circuits in Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jin Wang, Chang Hao Piao, and Chong Ho Lee
23
An Intrinsic Evolvable Hardware Based on Multiplexer Module Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jixiang Zhu, Yuanxiang Li, Guoliang He, and Xuewen Xia
35
Estimating Array Connectivity and Applying Multi-output Node Structure in Evolutionary Design of Digital Circuits . . . . . . . . . . . . . . . . . . Jie Li and Shitan Huang
45
Research on the Online Evaluation Approach for the Digital Evolvable Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rui Yao, You-ren Wang, Sheng-lin Yu, and Gui-jun Gao
57
Research on Multi-objective On-Line Evolution Technology of Digital Circuit Based on FPGA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guijun Gao, Youren Wang, Jiang Cui, and Rui Yao
67
Evolutionary Design of Generic Combinational Multipliers Using Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michal Bidlo
77
Analog Hardware Evolution Automatic Synthesis of Practical Passive Filters Using Clonal Selection Principle-Based Gene Expression Programming . . . . . . . . . . . . . . . . . . . . . . Zhaohui Gan, Zhenkun Yang, Gaobin Li, and Min Jiang
89
Research on Fault-Tolerance of Analog Circuits Based on Evolvable Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qingjian Ji, Youren Wang, Min Xie, and Jiang Cui
100
XII
Table of Contents
Analog Circuit Evolution Based on FPTA-2 . . . . . . . . . . . . . . . . . . . . . . . . . Qiongqin Wu, Yu Shi, Juan Zheng, Rui Yao, and Youren Wang
109
Bio-inspired Systems Knowledge Network Management System with Medicine Self Repairing Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . JeongYon Shim
119
Design of a Cell in Embryonic Systems with Improved Efficiency and Fault-Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuan Zhang, Youren Wang, Shanshan Yang, and Min Xie
129
Design on Operator-Based Reconfigurable Hardware Architecture and Cell Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Min Xie, Youren Wang, Li Wang, and Yuan Zhang
140
Bio-inspired Systems with Self-developing Mechanisms . . . . . . . . . . . . . . . . Andr´e Stauffer, Daniel Mange, Jo¨el Rossier, and Fabien Vannel
151
Development of a Tiny Computer-Assisted Wireless EEG Biofeedback System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Haifeng Chen, Ssanghee Seo, Donghee Ye, and Jungtae Lee
163
Steps Forward to Evolve Bio-inspired Embryonic Cell-Based Electronic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Elhadj Benkhelifa, Anthony Pipe, Mokhtar Nibouche, and Gabriel Dragffy Evolution of Polymorphic Self-checking Circuits . . . . . . . . . . . . . . . . . . . . . . Lukas Sekanina
174
186
Mechanical Hardware Evolution Sliding Algorithm for Reconfigurable Arrays of Processors . . . . . . . . . . . . . Natalia Dowding and Andy M. Tyrrell
198
System-Level Modeling and Multi-objective Evolutionary Design of Pipelined FFT Processors for Wireless OFDM Receivers . . . . . . . . . . . . . . Erfu Yang, Ahmet T. Erdogan, Tughrul Arslan, and Nick Barton
210
Reducing the Area on a Chip Using a Bank of Evolved Filters . . . . . . . . . Zdenek Vasicek and Lukas Sekanina
222
Evolutionary Design Walsh Function Systems: The Bisectional Evolutional Generation Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nengchao Wang, Jianhua Lu, and Baochang Shi
233
Table of Contents
XIII
Extrinsic Evolvable Hardware on the RISA Architecture . . . . . . . . . . . . . . Andrew J. Greensted and Andy M. Tyrrell
244
Evolving and Analysing “Useful” Redundant Logic . . . . . . . . . . . . . . . . . . . Asbjoern Djupdal and Pauline C. Haddow
256
Adaptive Transmission Technique in Underwater Acoustic Wireless Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guoqing Zhou and Taebo Shim
268
Autonomous Robot Path Planning Based on Swarm Intelligence and Stream Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chengyu Hu, Xiangning Wu, Qingzhong Liang, and Yongji Wang
277
Research on Adaptive System of the BTT-45 Air-to-Air Missile Based on Multilevel Hierarchical Intelligent Controller . . . . . . . . . . . . . . . . . . . . . . Yongbing Zhong, Jinfu Feng, Zhizhuan Peng, and Xiaolong Liang
285
The Design of an Evolvable On-Board Computer . . . . . . . . . . . . . . . . . . . . . Chen Shi, Shitan Huang, and Xuesong Yan
292
Evolutionary Algorithms in Hardware Design Extending Artificial Development: Exploiting Environmental Information for the Achievement of Phenotypic Plasticity . . . . . . . . . . . . . Gunnar Tufte and Pauline C. Haddow
297
UDT-Based Multi-objective Evolutionary Design of Passive Power Filters of a Hybrid Power Filter System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shuguang Zhao, Qiu Du, Zongpu Liu, and Xianghe Pan
309
Designing Electronic Circuits by Means of Gene Expression Programming II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xuesong Yan, Wei Wei, Qingzhong Liang, Chengyu Hu, and Yuan Yao Designing Polymorphic Circuits with Evolutionary Algorithm Based on Weighted Sum Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Houjun Liang, Wenjian Luo, and Xufa Wang Robust and Efficient Multi-objective Automatic Adjustment for Optical Axes in Laser Systems Using Stochastic Binary Search Algorithm . . . . . . Nobuharu Murata, Hirokazu Nosato, Tatsumi Furuya, and Masahiro Murakawa Minimization of the Redundant Sensor Nodes in Dense Wireless Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dingxing Zhang, Ming Xu, Wei Xiao, Junwen Gao, and Wenshen Tang
319
331
343
355
XIV
Table of Contents
Evolving in Extended Hamming Distance Space: Hierarchical Mutation Strategy and Local Learning Principle for EHW . . . . . . . . . . . . . . . . . . . . . Jie Li and Shitan Huang
368
Hardware Implementation of Evolutionary Algorithms Adaptive and Evolvable Analog Electronics for Space Applications . . . . . Adrian Stoica, Didier Keymeulen, Ricardo Zebulum, Mohammad Mojarradi, Srinivas Katkoori, and Taher Daud
379
Improving Flexibility in On-Line Evolvable Systems by Reconfigurable Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jim Torresen and Kyrre Glette
391
Evolutionary Design of Resilient Substitution Boxes: From Coding to Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nadia Nedjah and Luiza de Macedo Mourelle
403
A Sophisticated Architecture for Evolutionary Multiobjective Optimization Utilizing High Performance DSP . . . . . . . . . . . . . . . . . . . . . . . Quanxi Li and Jingsong He
415
FPGA-Based Genetic Algorithm Kernel Design . . . . . . . . . . . . . . . . . . . . . . Xunying Zhang, Chen Shi, and Fei Hui
426
Using Systolic Technique to Accelerate an EHW Engine for Lossless Image Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yunbi Chen and Jingsong He
433
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
445
An Online EHW Pattern Recognition System Applied to Sonar Spectrum Classification Kyrre Glette1 , Jim Torresen1 , and Moritoshi Yasunaga2 1
2
University of Oslo, Department of Informatics, P.O. Box 1080 Blindern, 0316 Oslo, Norway {kyrrehg,jimtoer}@ifi.uio.no University of Tsukuba, Graduate School of Systems and Information Engineering, 1-1-1 Ten-ou-dai, Tsukuba, Ibaraki, Japan [email protected]
Abstract. An evolvable hardware (EHW) system for high-speed sonar return classification has been proposed. The system demonstrates an average accuracy of 91.4% on a sonar spectrum data set. This is better than a feed-forward neural network and previously proposed EHW architectures. Furthermore, this system is designed for online evolution. Incremental evolution, data buses and high level modules have been utilized in order to make the evolution of the 480 bit-input classifier feasible. The classification has been implemented for a Xilinx XC2VP30 FPGA with a resource utilization of 81% and a classification time of 0.5μs.
1
Introduction
High-speed pattern recognition systems applied in time-varying environments, and thus needing adaptability, could benefit from an online evolvable hardware (EHW) approach [1]. One EHW approach to online reconfigurability is the Virtual Reconfigurable Circuit (VRC) method proposed by Sekanina in [2]. This method does not change the bitstream to the FPGA itself, rather it changes the register values of a circuit already implemented on the FPGA, and obtains virtual reconfigurability. This approach has a speed advantage over reconfiguring the FPGA itself, and it is also more feasible because of proprietary formats preventing direct FPGA bitstream manipulation. However, the method requires much logic resources. An EHW pattern recognition system, Logic Design using Evolved Truth Tables (LoDETT), has been presented by Yasunaga et al. Applications include face image and sonar target recognition [3,4]. This architecture is capable of classifying large input vectors (512 bits) into several categories. The classifier function is directly coded in large AND gates. The category module with the highest number of activated AND gates determines the classification. Incremental evolution is utilized such that each category is evolved separately. The average recognition accuracy for this system, applied to the sonar target task, is 83.0%. However, evolution is performed offline and the final system is synthesized. This approach L. Kang, Y. Liu, and S. Zeng (Eds.): ICES 2007, LNCS 4684, pp. 1–12, 2007. c Springer-Verlag Berlin Heidelberg 2007
2
K. Glette, J. Torresen, and M. Yasunaga
gives rapid (< 150ns) classification in a compact circuit, but lacks run-time reconfigurability. A system proposed earlier by the authors addresses the reconfigurability by employing a VRC-like array of high-level functions [5]. Online/on-chip evolution is attained, and therefore the system seems suited to applications with changes in the training set. However, the system is limited to recognizing one category out of ten possible input categories. A new architecture was then proposed by the authors to allow for the high classification capabilities of the LoDETT system, while maintaining the online evolution features from [5]. This was applied to multiple-category face image recognition and a slightly higher recognition accuracy than the LoDETT system was achieved [6]. While in LoDETT a large number of inputs to the AND gates can be optimized away during circuit synthesis, the run-time reconfiguration aspect of the online architecture has led to a different approach employing fewer elements. The evolution part of this system has been implemented on an FPGA in [7]. Fitness evaluation is carried out in hardware, while the evolutionary algorithm runs on an on-chip processor. In this paper the architecture, previously applied to face image recognition, has been applied to the sonar target recognition task. The nature of this application has led to differences in the architecture parameters. Changes in the fitness function were necessary to deal with the higher difficulty of this problem. The sonar target dataset was presented by Gorman and Sejnowski in [8]. A feed-forward neural network was presented, which contained 12 hidden units and was trained using the back-propagation algorithm. A classification accuracy of 90.4% was reported. Later, better results have been achieved on the same data set, using variants of the Support Vector Machine (SVM) method. An accuracy of 95.2% was obtained in a software implementation presented in [9]. There also exists some hardware implementations of SVMs, such as [10], which performs biometric classification in 0.66ms using an FPGA. The next section introduces the architecture of the evolvable hardware system. Then, the sonar return-specific implementation is detailed in section 3. Aspects of evolution are discussed in section 4. Results from the experiments are given and discussed in sections 5. Finally, section 6 concludes the paper.
2
The Online EHW Architecture
The EHW architecture is implemented as a circuit whose behaviour and connections can be controlled through configuration registers. By writing the genome bitstream from the genetic algorithm (GA) to these registers, one obtains the phenotype circuit which can then be evaluated. This approach is related to the VRC technique, as well as to the architectures in our previous works [11,5]. 2.1
System Overview
A high-level view of the system can be seen in figure 1. The system consists of three main parts – the classification module, the evaluation module, and the
An Online EHW Pattern Recognition System
3
ONLINE EVOLVABLE SYSTEM TOP-LEVEL VIEW CLASSIFICATION SYSTEM TOP-LEVEL MODULE
CPU
configuration &
input pattern
EVALUATION MODULE fitness
training patterns
CDM1
M A X.
CDM2
D E T E C T O R
configuration
CLASSIFICATION MODULE input pattern
category classification
Fig. 1. High level system view
category classification
CDMK
Fig. 2. EHW classification module view
CPU. The classification module operates stand-alone except for its reconfiguration which is carried out by the CPU. In a real-world application one would imagine some preprocessing module providing the input pattern and possibly some software interpretation of the classification result. The evaluation module operates in close cooperation with the CPU for the evolution of new configurations. The evaluation module accepts a configuration bitstring, also called genome, and calculates its fitness value. This information is in turn used by the CPU for running the rest of the GA. The evaluation module has been implemented and described in detail in [7]. 2.2
Classification Module Overview
The classifier system consists of K category detection modules (CDMs), one for each category Ci to be classified – see figure 2. The input data to be classified is presented to each CDM concurrently on a common input bus. The CDM with the highest output value will be detected by a maximum detector, and the identifying number of this category will be output from the system. Alternatively, the system could also state the degree of certainty of a certain category by taking the output of the corresponding CDM and dividing by the maximum possible output. In this way, the system could also propose alternative categories in case of doubt. 2.3
Category Detection Module
Each CDM consists of M ”rules” or functional unit (FU) rows – see figure 3. Each FU row consists of N FUs. The inputs to the circuit are passed on to the inputs of each FU. The 1-bit outputs from the FUs in a row are fed into an N input AND gate. This means that all outputs from the FUs must be 1 in order for a rule to be activated. The 1-bit outputs from the AND gates are connected to an input counter which counts the number of activated FU rows.
4
K. Glette, J. Torresen, and M. Yasunaga
As the number of FU rows is increased, so is the output resolution from each CDM. Each FU row is evolved from an initial random bitstream, which ensures a variation in the evolved FU rows. To draw a parallel to the LoDETT system, each FU row represents a kernel function. More FU rows give more kernel functions (with different centers) that the unknown pattern can fall into. CATEGORY DETECTION MODULE
input pattern
FU11
FU12
FU1N
FUNCTIONAL UNIT N-input AND
FU 2N
N-input AND
C O U N T E R
output
Input pattern
Addr FUM1 FUM2
FUMN
f1 f MUX
FU22
Data MUX
FU21
C
f2
Output
f
Configuration
N-input AND
Fig. 3. Category detection module. N functional units are connected to an N input AND gate.
2.4
Fig. 4. Functional unit. The data MUX selects which of the input data to feed to the functions f1 and f2 . The f MUX selects which of the function results to output.
Functional Unit
The FUs are the reconfigurable elements of the architecture. This section describes the FU in a general way, and section 3.2 will describe the applicationspecific implementation. As seen in figure 4, each FU behavior is controlled by configuration lines connected to the configuration registers. Each FU has all input bits to the system available at its inputs, but only one data element (e.g. one byte) of these bits is chosen. One data element is thus selected from the input bits, depending on the configuration lines. This data is then fed to the available functions. Any number and type of functions could be imagined, but for clarity, in figure 4 only two functions are illustrated. The choice of functions for the sonar classification application will be detailed in section 3.1. In addition, the unit is configured with a constant value, C. This value and the input data element are used by the function to compute the output from the unit. The advantage of selecting which inputs to use, is that connection to all inputs is not required. A direct implementation of the LoDETT system [4] would have required, in the sonar case, N = 60 FUs in a row. Our system typically uses N = 6 units. The rationale is that not all of the inputs are necessary for the pattern recognition. This is reflected in the don’t cares evolved in [4].
An Online EHW Pattern Recognition System
3
5
Implementation
This section describes the sonar return classification application and the following application-specific implementation of the FU module. The evaluation module, which contains one FU row and calculates fitness based on the training vectors, is in principle equal to the description in [7] and will not be further described in this paper. 3.1
Sonar Return Classification
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
10
20
30
40
50
sonar spectrum
60
200 180 160 140 120 100 80 60 40 20 0
metal rock 0
10
20
30
40
50
max. detector
The application data set has been found in the CMU Neural Networks Benchmark Collection1 , and was first used by Gorman and Sejnowski in [8]. This is a real-world data set consisting of sonar returns from underwater targets of either a metal cylinder or a similarly shaped rock. The number of CDMs in the system then becomes K = 2. The returns have been collected from different aspect angles and preprocessed based on experiments with human listeners, such that the input signals are spectral envelopes containing 60 samples, normalized to values between 0.0 and 1.0 – see figure 5. There are 208 returns in total which have
60
60 8-bit samples
Fig. 5. The sonar return spectral envelope, which is already a preprocessed signal, has its 60 samples scaled to 8-bit values before they are input to the CDMs
been divided into equally sized training and test sets of 104 returns. The samples have been scaled by the authors to 8-bit values ranging between 0 and 255. This gives a total of 60 × 8 = 480 bits to input to the system for each return. Based on the data elements of the input being 8-bit scalars, the functions available to the FU elements have been chosen to greater than and less than or equal. Through experiments these functions have shown to work well, and intuitively this allows for detecting the presence or absence of frequencies in the signal, and their amplitude. The constant is also 8 bits, and the input is then compared to this value to give true or false as output. This can be summarized as follows, with I being the selected input value, O the output, and C the constant value: f Description Function 0 Greater than O = 1 if I > C, else 0 1 Less than or equal O = 1 if I ≤ C, else 0 1
http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/neural/bench/cmu/
6
3.2
K. Glette, J. Torresen, and M. Yasunaga
Functional Unit Implementation
Based on the choice of data elements and functions above, the application-specific implementation of the FU can be determined. As described in the introduction, the VRC technique used for reconfiguration of circuits has the disadvantage of requiring much logic resources. This especially becomes the case when one needs to select the input to a unit from many possible sources, which is a common case for EHW. This is an even bigger problem when working with data buses as inputs instead of bit signals. FUNCTIONAL UNIT IMPLEMENTATION sample number
=
input sample number
> D
input sample value
>
OUTPUT REG.
output
M U X
constant
function
configuration input
Fig. 6. Implementation of the FU for the sonar spectrum recognition case
Instead of using a large amount of multiplexer resources for selecting one 8-bit sample from 60 possible, we have opted for a ”time multiplexing” scheme, where only one bit is presented to the unit at a time. See figure 6. The 60 samples are presented sequentially, one for each clock cycle, together with their identifying sample number. The FU checks the input sample number for a match with the sample number stored in the configuration register, and in the case of a match, the output value of the FU is stored in a memory element. This method thus requires a maximum of 60 clock cycles before an FU has selected its input value. The sample input value is used for comparison with the constant value C stored in the configuration. Since the two functions greater than and less than or equal are opposite, only a greater than-comparator is implemented, and the function bit in the configuration decides whether to choose the direct or the negated output.
4
Evolution
This section describes the evolutionary process. Although the base mechanisms are the same as in [6], there are important changes to the fitness function. The GA implemented for the experiments follows the Simple GA style [12]. The algorithm is written to be run on the PowerPC 405 hard processor core in the Xilinx Virtex-II Pro (or better) FPGAs [11], or the MicroBlaze soft processor core available for a greater number of FPGA devices [5]. Allowing the GA to run in software instead of implementing it in hardware gives an increased flexibility compared to a hardware implementation.
An Online EHW Pattern Recognition System
7
The GA associates a bit string (genome) with each individual in the population. For each individual, the fitness evaluation circuit is configured with the associated bit string, and training vectors are applied on the inputs. By reading back the fitness value from the circuit, the individuals can be ranked and used in selection for a new generation. When an individual with a maximum possible fitness value has been created (or the maximum limit of generations has been reached), the evolution run is over and the bit string can be used to configure a part of the operational classification circuit. 4.1
Genome
The encoding of each FU in the genome string is as follows: Spectrum sample address (6 bit) Function (1 bit) Constant (8 bit) This gives a total of Bunit = 15 bits for each unit. The genome for one FU row is encoded as follows: F U1 (15b) F U2 (15b) ... F UN (15b) The total amount of bits in the genome for one FU row is then, with N = 6, Btot = Bunit × N = 15 × 6 = 90. In the implementation this is rounded up to 96 bits (3 words). 4.2
Incremental Evolution of the Category Detectors
Evolving the whole classification system in one run would give a very long genome, therefore an incremental approach is chosen. Each category detector CDMi can be evolved separately, since there is no interdependency between the different categories. This is also true for the FU rows each CDM consists of. Although the fitness function changes between the rows, as will be detailed in the next section, the evolution can be performed on one FU row at a time. This significantly reduces the genome size. 4.3
Fitness Function
The basic fitness function, applied in [6], can be described as follows: A certain set of the available vectors, Vt , are used for training of the system, while the remaining, Vv , are used for verification after the evolution run. Each row of FUs is fed with the training vectors (v ∈ Vt ), and the fitness is based on the row’s ability to give a positive (1) output for vectors v belonging to its own category (Cv = Ci ), while giving a negative (0) output for the rest (Cv = Ci ). In the case of a positive output when Cv = Ci , the value 1 is added to the fitness sum. When Cv = Ci and the row gives a negative output (value 0), 1 is added to the fitness sum. The other cases do not contribute to the fitness value.
8
K. Glette, J. Torresen, and M. Yasunaga
The basic fitness function FB for a row can then be expressed in the following way, where o is the output of the FU row: o if Cv = Ci FB = xv where xv = (1) 1 − o if Cv = Ci v∈Vt
While in the face image application each FU row within the same category was evolved with the same fitness function, the increased variation of the training set in the current application made it sensible to divide the training set between different FU rows. Each FU row within the same category is evolved separately, but by changing the fitness function between the evolution runs one can make different FU rows react to different parts of the training set. The extended fitness function FE can then be expressed as follows: o if Cv = Ci and v ∈ Vf,m FE = xv where xv = (2) 1 − o if Cv = Ci v∈Vt
Where Vf,m is the part of the training set FU row m will be trained to react positively to. For instance, if the training set of 104 vectors is divided into 4 equally sized parts, FU row 1 of the CDM would receive an increase in fitness for the first 26 vectors, if the output is positive for vectors belonging to the row’s category (i.e. ”rock” or ”metal”). In addition, the fitness is increased for giving a negative output to vectors not belonging to the row’s category, for all vectors of the training set.
5
Results
This section presents the results of the implementation and experiments undertaken. The classification results are based on a software simulation of the EHW architecture, with has identical functionality to the hardware proposed. Results from a hardware implementation of the classifier module are also presented. 5.1
Architecture and Evolution Parameters
The architecture parameters N and M , that is, the number of FUs in an FU row and the number of FU rows in a CDM, respectively, have been evaluated. From experiments, a value of N = 6 has shown to work well. Increasing the number of FU rows for a category leads to an increase in the recognition accuracy, as seen in figure 7. However, few FU rows are required before the system classifies with a relatively good accuracy, thus the system could be considered operational before the full number of FU rows are evolved. It is worth noting that the training set classification accuracy relatively quickly rises to a high value, and then has slow growth, compared to the test set accuracy which has a more steady increase as more FU rows are added. For the evolution experiments, a population size of 32 is used. The crossover rate is 0.9. Linear fitness scaling is used, with 4 expected copies of the best individual. In addition, elitism is applied. The maximum number of generations
100
100
95
95
90
90
85
85 Accuracy (%)
Accuracy (%)
An Online EHW Pattern Recognition System
80 75 70
9
80 75 70
65
65 training (1000 gens., Fe) test (1000 gens., Fe) training (20000 gens., Fe) test (20000 gens., Fe)
60
training (1000 gens., Fe) test (1000 gens., Fe) training (1000 gens., Fb) test (1000 gens., Fb)
60
55
55 10
20
30
40 FU rows
50
60
70
10
20
30
40 FU rows
50
60
70
Fig. 7. Average classification accuracy on Fig. 8. Average classification perforthe training and test sets as a function of mance obtained using fitness functions FB the number of FU rows per CDM. N = 6. and FE . N = 6. Results for generation limits of 1000 and 20000.
allowed for one evolution run is 1000. By observing figure 7 one can see that this produces better results than having a limit of 20000 generations, even though this implies that fewer FU rows are evolved to a maximum fitness value. The classification accuracies obtained by using the basic fitness function FB , and the extended fitness function FE with the training set partitioned into 4 parts of equal size, have been compared – see figure 8. The use of FE allows for a higher classification accuracy on both the training and the test set. 5.2
Classification Accuracy
10 evolution runs were conducted, each with the same training and test set as described in section 3.1, but with different randomized initialization values for the genomes. The extended fitness function FE is used. The results can be Table 1. Average classification accuracy FU rows (M ) training set test set 20 97.3% 87.8% 58 99.6% 91.4%
seen in table 1. The highest average classification accuracy, 91.4%, was obtained at M = 58 rows. The best system evolved presented an accuracy of 97.1% at M = 66 rows (however, the average value was only 91.2%). The results for M = 20, a configuration requiring less resources, are also shown. These values are higher than the average and maximum values of 83.0% and 91.7% respectively obtained in [4], although different training and test sets have been used. The classification performance is also better than the average value of 90.4% obtained from the original neural network implementation in [8], but lower than the results obtained by SVM methods (95.2%) [9].
10
5.3
K. Glette, J. Torresen, and M. Yasunaga
Evolution Speed
In the experiments, several rows did not achieve maximum fitness before reaching the limit of 1000 generations per evolution run. The average number of generations required for each evolution run (that is, one FU row) was 853. This gives an average of 98948 generations for the entire system. The average evolution time for the system is 63s on an Intel Xeon 5160 processor using 1 core. This gives an average of 0.54s for one FU row, or 4.3s for 8 rows (the time before the system has evolved 4 FU rows for each category and thus is operational). A hardware implementation of the evaluation module has been reported in [7]. This was reported to use 10% of the resources of a Xilinx XC2VP30 FPGA, and, together with the GA running on an on-chip processor, the evolution time was equivalent to the time used by the Xeon workstation. Similar results, or better because of software optimization, are expected for the evolution in the sonar case. 5.4
Hardware Implementation
An implementation of the classifier module has been synthesized for a Xilinx XC2VP30 FPGA in order to achieve an impression of speed and resource usage. The resources used for two configurations of the system can be seen in table 2. While the M = 58 configuration uses 81% of the FPGA slices, the M = 20 configuration only uses 28% of the slices. Both of the configurations classify at the same speed, due to the parallel implementation. Given the post-synthesis clock frequency estimate of 118MHz, and the delay of 63 cycles before one pattern is classified, one has a classification time of 0.5μs. Table 2. Post-synthesis device utilization for two configurations of the 2-category classification module implemented on an XC2VP30 Resource M = 20 M = 58 Available Slices 3866 11189 13696 Slice Flip Flops 4094 11846 27392 4 input LUTs 3041 8793 27392
5.5
Discussion
Although a good classification accuracy was achieved, it became apparent that there were larger variations within each category in this data set than in the face image recognition data set applied in [6]. The fitness function was therefore extended such that each row would be evolved with emphasis on a specific part of the training set. This led to increased classification accuracy due to the possibility for each row to specialize on certain features of the training set. However, the partitioning of the training set was fixed, and further investigation into the partitioning could be useful. The experiments also showed that by increasing the number of FU rows per category, better generalization abilities were obtained. The fact that the generalization became better when the evolution was cut off at an earlier stage,
An Online EHW Pattern Recognition System
11
could indicate that a CDM consisting of less ”perfect” FU rows has a higher diversity and is thus less sensible to noise in the input patterns. The main improvement of this system over the LoDETT system is the aspect of on-line evolution. As a bonus the classification accuracies are also higher. The drawback is the slower classification speed, 63 cycles, whereas LoDETT only uses 3 cycles for one pattern. It can be argued that this slower speed is negligeable in the case of the sonar return application, since the time used for preprocessing the input data in a real-world system would be higher than 63 cycles. In this case, even an SVM-based hardware system such as the one reported in [10] could be fast enough. Although the speed is not directly comparable since the application differs, this system has a classification time of 0.66ms, roughly 1000 times slower than our proposed architecture. Therefore, the architecture would be more ideal applied to pattern recognition problems requiring very high throughput. The LoDETT system has been successfully applied to genome informatics and other applications [13,14]. It is expected that the proposed architecture also could perform well on similar problems, if suitable functions for the FUs are found.
6
Conclusions
The online EHW architecture proposed has so far proven to perform well on a face image recognition task and a sonar return classification task. Incremental evolution and high level building blocks are applied in order to handle the complex inputs. The architecture benefits from good classification accuracy at a very high throughput. The classification accuracy has been shown to be higher than an earlier offline EHW approach. Little evolution time is needed to get a basic working system operational. Increased generalisation can then be added through further evolution. Further, if the training set changes over time, it would be possible to evolve better configurations in parallel with a constantly operational classification module.
Acknowledgment The research is funded by the Research Council of Norway through the project Biological-Inspired Design of Systems for Complex Real-World Applications (proj. no. 160308/V30).
References 1. Yao, X., Higuchi, T.: Promises and challenges of evolvable hardware. In: Higuchi, T., Iwata, M., Weixin, L. (eds.) ICES 1996. LNCS, vol. 1259, pp. 55–78. Springer, Heidelberg (1997) 2. Sekanina, L., Ruzicka, R.: Design of the special fast reconfigurable chip using common F PGA. In: Proc. of Design and Diagnostics of Electronic Circuits and Sy stems - IEEE DDECS’2000, pp. 161–168. IEEE Computer Society Press, Los Alamitos (2000)
12
K. Glette, J. Torresen, and M. Yasunaga
3. Yasunaga, M. et al.: Evolvable sonar spectrum discrimination chip designed by genetic algorithm. In: Proc. of 1999 IEEE Systems, Man, and Cybernetics Conference (SMC’99), IEEE Computer Society Press, Los Alamitos (1999) 4. Yasunaga, M., Nakamura, T., Yoshihara, I., Kim, J.: Genetic algorithm-based design methodology for pattern recognition hardware. In: Miller, J.F., Thompson, A., Thompson, P., Fogarty, T.C. (eds.) ICES 2000. LNCS, vol. 1801, pp. 264–273. Springer, Heidelberg (2000) 5. Glette, K., Torresen, J., Yasunaga, M., Yamaguchi, Y.: On-chip evolution using a soft processor core applied to image recognition. In: Proc. of the First NASA /ESA Conference on Adaptive Hardware and Systems (AHS 2006), Los Alamitos, CA, USA, pp. 373–380. IEEE Computer Society Press, Los Alamitos (2006) 6. Glette, K., Torresen, J., Yasunaga, M.: An online EHW pattern recognition system applied to face image recognition. In: Giacobini, M., et al. (eds.) EvoWorkshops 2007. LNCS, vol. 4448, pp. 271–280. Springer, Heidelberg (to appear, 2007) 7. Glette, K., Torresen, J., Yasunaga, M.: Online evolution for a high-speed image recognition system implemented on a Virtex-II Pro FPGA. In: The Second NASA/ESA Conference on Adaptive Hardware and Systems (AHS 2007) (accepted, 2007) 8. Gorman, R.P., Sejnowski, T.J.: Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks 1(1), 75–89 (1988) 9. Frieß, T.T., Cristianini, N., Campbell, C.: The Kernel-Adatron algorithm: a fast and simple learning procedure for Support Vector machines. In: Proc. 15th International Conf. on Machine Learning, pp. 188–196. Morgan Kaufmann, San Francisco, CA (1998) 10. Choi, W.-Y., Ahn, D., Pan, S.B., Chung, K.I., Chung, Y., Chung, S.-H.: SVMbased speaker verification system for match-on-card and its hardware implementation. Electronics and Telecommunications Research Institute journal 28(3), 320– 328 (2006) 11. Glette, K., Torresen, J.: A flexible on-chip evolution system implemented on a Xilinx Virtex-II Pro device. In: Moreno, J.M., Madrenas, J., Cosp, J. (eds.) ICES 2005. LNCS, vol. 3637, pp. 66–75. Springer, Heidelberg (2005) 12. Goldberg, D.: Genetic Algorithms in search, optimization, and machine learning. Addison Wesley, Reading (1989) 13. Yasunaga, M., et al.: Gene finding using evolvable reasoning hardware. In: Tyrrell, A., Haddow, P., Torresen, J. (eds.) ICES 2003. LNCS, vol. 2606, pp. 228–237. Springer, Heidelberg (2003) 14. Yasunaga, M., Kim, J.H., Yoshihara, I.: The application of genetic algorithms to the design of reconfigurable reasoning vlsi chips. In: FPGA ’00: Proceedings of the 2000 ACM/SIGDA eighth international symposium on Field programmable gate arrays, New York, NY, USA, pp. 116–125. ACM Press, New York (2000)
Design of Electronic Circuits Using a Divide-and-Conquer Approach Guoliang He1, Yuanxiang Li1, Li Yu2, Wei Zhang2, and Hang Tu2 1 State
Key Laboratory of Software Engineering , Wuhan University, Wuhan 430072, China 2 School of Computer Science, Wuhan University, Wuhan 430072, China [email protected], [email protected], [email protected], [email protected]
Abstract. Automatic design of electronic logic circuits has become a new research focus with the cooperation of FPGA technology and intelligent algorithms in recent twenty years. However, as the size of logic circuits became larger and more complex, it has become difficult for the automatic design method to obtain valid and optimized circuits. Based on a divide-and-conquer approach, a two-layer encoding scheme was devised for design of electronic logic circuits. In the process of evolvement, each layer was evolved parallel and they contacted each other at the same time. Moreover, in order to simulate and evaluate evolved electronic logic circuits, a two-step simulation algorithm was proposed to reduce computation complexity of simulating circuits and to improve the simulation efficiency. At last, a random number generator was automatically designed with this encoding scheme and the proposed simulation algorithm, and the result showed this method was efficient. Keywords: Evolvable Hardware, Divide-and-Conquer Approach, Simulation Algorithm.
1 Introduction In recent years the evolution of digital circuits has been intensively studied, which are expected to allow one automatically to produce large and efficient electronic circuit in applications instead of human design. However, an issue in the evolutionary design of electronic circuits is the problem of scale, which has not yet broken through grandly so far [1,2]. A possible way of tackling the problem is using building blocks that are higher functions rather than two-input gates. It was also observed that the evolutionary design of circuits with this method is easier when compared with the original methods in which the building blocks are merely gates [3,4,5]. However, to identify suitable building blocks and thus to evolve efficient electronic circuits are difficult tasks, which are to be further investigated. Another efficient method is to divide a larger circuit into L. Kang, Y. Liu, and S. Zeng (Eds.): ICES 2007, LNCS 4684, pp. 13–22, 2007. © Springer-Verlag Berlin Heidelberg 2007
14
G.L. He et al.
several sub-circuits and then evolve each sub-circuit respectively [6,7], but to divide an unknown circuit into sub-circuits in term of its abstract description is also a mysterious field. Other methods including variable length chromosome coding technology [8], compressed chromosome encoding, biological development [9] and so on were also presented to solve the problem of the scale of circuits. In this paper, a two-layer encoding method for designing electronic circuits is introduced in term of a divide-and-conquer approach, and then a two-step simulation algorithm is proposed to simulate and evaluate evolved circuits [10,11]. It adopts the event-driven idea, and carries on simulation of logical circuits stage by stage to reduce computation complexity of simulating circuits.
2 Design Logic Circuits with a Two-Layer Encoding Method 2.1 Circuit Encoding Scheme At present, one of the most important issues of designing a complex hardware system is how to deal with the scalability of the chromosome representation of electronic circuits. With the length of chromosomes being increased exponentially as the complexity of circuit improves, it is inefficient to evolve a valid circuit by current evolutionary techniques.
An initialized circuit
The representation of the circuit
Evolution of Directed Graph
Evolution of Genetic Tree
Simulation and Fitness calculation
Satisfy the ending condition
End the evolution
Fig. 1. The overview of the evolution
Design of Electronic Circuits Using a Divide-and-Conquer Approach
15
Based on the theory of divide-and-conquer and evolvable hardware, a method is introduced by dividing a circuit into modules before evolution, and then every module and the connections among the modules are parallel evolved. An overview of the evolution process is depicted in Fig. 1. In Accordance with this principle, a two-layer encoding method is introduced to evolve circuits. Fig. 2 shows the general architecture of this two-layer encoding scheme. The first layer is an acyclic directed graph, which is formed by connecting the modules through some links. The second layer is a lot of trees to form a population in each module predefined before evolution. First, each module acts as a genetic tree after an initialized circuit is modularized, then these modules are connected by some links to form an acyclic directed graph as a circuit. So a population is obtained by different links among the genetic tree layers. The nodes in each genetic tree are classified as terminal nodes and nonterminal nodes. The former are element in the set of constant values( “0” or “1” ) and variables. The later are fundamental logic gates or evolved modules, in this paper only the elementary logic gates listed in Table 1 are considered.
Fig. 2. The two-layer encoding scheme Table 1. Allowed cell functions
Letter 1 2 3 4 5
Function !a a•b a+b !( a • b) !(a + b)
Letter 6 7 8 9
Function a⊕b !( a ⊕ b) a • !c + b • c D-Trigger
2.2 Evolving Scheme In terms of this encoding method representing evolving circuits, some strategies could be devised. At the beginning of the evolution, there is only one tree inside each module
16
G.L. He et al.
and the acyclic oriented graph of connection among the modules is also unique. In the process of evolution, the population of the trees is produced in each module according to the evolution which will be discussed later, but in a module only a main tree has the connection of the leaf nodes and root node with outside of the module. At the same time, this strategy describes the links among the modules as acyclic oriented graphs and a lot of different connection ways form the senior colony of oriented graphs. From this evolving strategy, it can be seen that the two-layer encoding scheme is efficient and flexible to design more complex circuits. On the one hand, the modules and the oriented graphs are evolved separately, so this short chromosome that represents any one of them is very efficient to use evolutionary algorithm. At the same time, it reuses same modules to form a population of oriented graphs for evolution. Therefore, while the complexity of the evolvable circuit and the population for evolution become larger, the length of the chromosome representing this circuit and the evolutionary computation will increase to a lower extent. On the other hand, different evolution strategies which are efficient for the code structure of each level could be used flexibly to rapidly get a valid and optimized circuit. 2.3 Evolutionary Operations Due to this two-layer encoding method, some evolutionary operators are implemented differently in each layer to obtain valid circuits. For the first layer, the operators on graphs are as follow: Crossover: randomly select two graphs and pick up the different parts of them to combine into new two graphs. Mutation: add, delete and change the parts of the connections among nodes within a graph to form a new graph. Selection: inserting the newly produced individual into the former population and thus a new population is created at the end of each generation of evolution. When the number of individuals reaches the upper limit, delete the individuals with worse fitness value. The selection probability is the function of the fitness value of the genetic tree and the predefined value at the beginning. For the evolving of genetic trees in each module, which is the second layer, other operators are as follows: Exchange: this operator contains two different methods: the first is swapping the order of the subtrees in the same tree; the second is changing subtrees between two trees, with a precondition that the input and output of the two subtrees are matched. Mutation: change one or several non-terminal nodes in a tree according to the mutating probability. Upgrade: pick out a subtree in a certain tree and make it a tree itself. Deletion: delete a certain branch of a tree. Generation: create subtrees according to the probability of appearance of nodes on different layers.
Design of Electronic Circuits Using a Divide-and-Conquer Approach
17
3 Simulation and Evaluation Simulation and evaluation are very important processes to guide the strategy of evolution. Existing emulation software can carry out emulation well for circuits, but they can not simulate and evaluate logic circuits efficiently for the proposed two-layer scheme. For digital circuits, its function is commonly validated by simulating test data. In the process of validating a circuit function, the alteration of testing data does not always result in the output value to change for every logical gate. In terms of this idea, a circuit simulation algorithm is introduced to avoid some unnecessary macro blocks simulating repeatedly, because their output values have not changed as test data altering for the same circuit. So, the proposed simulation algorithm could shorten the runtime and improve the efficiency of simulation. 3.1 Simulation Algorithm of Electronic Logic Circuits For the simulation of logic circuits, it is necessary to handle the problem of logic gates or modules’ parallelism and delaying time. This paper introduces a “time and incident” form to record the signal lines and their signal values when they are going to change at a certain moment in future. Because the emulated circuit has to be downloaded into the FPGA board to perform a real time simulation, here a standard delay model is used through the simulation process (assuming that the delay time of signal line is zero)[12]. The data structure of “time and incident” form is as following: Structure of “Time” form {
┊ Time; Time variable, record a certain moment of the system Pevent;Incident pointer, points to the head of the array of the incident which will happen at “Time” moment. Next; Pointer to the next “Time” form
┊ } Structure of “Incident” form {
┊ Signal_ID ; The identification recording the signal which triggers the current incident Signal_Value;Recording the value when the signal is going to be at a certain moment Next; Pointer to the next “Incident” form
┊ }
18
G.L. He et al.
According to the treating method of the characteristic of circuit in the context, the main idea of the simulation algorithm is as following. The emulation process of logic circuits is divided into two stages according to the topological structure of the circuit. First stage is the structurization for the evolved circuit. Each macro entity is a child circuit without feedback and described as an oriented tree. Because not every macro entity wants emulation in every emulation clock cycle, each macro entity in the circuit is set to the state of activation or suspension in terms of its input value being changed or not. The second stage is to emulate the activated macro entity. The program will traverse every node (operator, variable or constant) of the macro entity and calculate the output values for these nodes according to their type and child nodes. Then these values will be stored in a corresponding place (In the macro entity the nodes are arranged in a breadth priority order, the emulation process begins layer upon layer backward from leaf nodes). For each basic macro entity (or logic gates) the output value and time delaying is determinate when the input signal is given. The main process of the simulation algorithm is expatiated in Figure 3. We can see from the process that the emulation for the macro entity is the key point, which decides the time complexity and efficiency of the whole algorithm. Because macro entities correspond with genetic trees in the evolutionary operations and that is to say the emulation of macro entity is the emulation of the genetic tree. This paper utilizes the event-drive idea and do not emulate those macro entities which don’t need emulation. Compared with those algorithms which have to traverse every macro entity, this algorithm can save unnecessary emulation time and improve the efficiency of simulation. 3.2 Evaluation of Electronic Logic Circuits To design digital circuits effectively, evaluation for circuit and emulation result must be carried out to guide the strategy for further evolution. According to the characteristic of logic circuits, the evaluation can be carried out from two aspects: Evaluation of circuit: the evaluation of circuit includes the evaluation of its function and its scale. The evaluation of the function means whether the circuit meets the demand of design and can reach the designated function. As for the evaluation of circuit scale, the fewer is the number of logic gates, the higher is the extent of parallelism when the function of the circuit is full. In fact, smaller scale means higher parallelism when the performance and speed of the circuit are the same. Evaluation of complexity: here the evaluation of complexity is defined as the time that is obtained through the simulation software to get the output result of the circuit, or that is needed to meets certain state. The fitness function of logic circuits is as following: fitness = g
c1 − c2
s
p − c3 , ci (i =1,2,3) are nonnegative constants.
In this formula, g denotes the extent of optimization of circuits’ performance, s denotes runtime and p denotes circuits’ scale. A good circuit indicates good performance, short runtime and small scale.
Design of Electronic Circuits Using a Divide-and-Conquer Approach
19
Initialize the population and their parameters
Read the values of input signals in the current time
If the value of an input signal changes, the macro entities connected to it are added to the set of S
Is there an event of current time in the time table?
Modify the value of the corresponding signa l, a nd a dd the ma c ro e ntitie s c onnected to the signal to the set of S as well as deleting the event.
Simulate the macro entities in the set of S. If the output value of the macro entity is changed, it is added to the event table as an event.
Delete the record of current time in the time table
Add a unit clock of this simulation
Satisfy the ending condition ?
End the simulation process
Fig. 3. The simulation algorithm of digital circuits
Moreover, the derivation tree generated in the process of evolution needs evaluation. Because it is not the whole circuit, we use the root node to represent the corresponding tree. So the fitness of the nodes is calculated as following: 1, Find the m circuits with the highest fitness and add their fitness to every node that the oriented graph of this circuit is connected to. The higher is the fitness summation of a certain node, the more important is its effect in the connection of the oriented graph.
20
G.L. He et al.
2. Evaluate each module separately. This method is employed for certain hardware object. 3. The key nodes decided by rule 1 and 2 require special fitness calculation. That is to say, evaluate the change of the oriented graph’s performance when certain node mutates and increase the fitness of the nodes which lead to big variation of the oriented graph’s fitness.
4 Experiment Using this two-layer encoding method and the simulation algorithm, a practical pseudo-random number generator is automatically designed to validate its efficiency. At the beginning of evolution, a basic embryonic circuit has to be given as the primitive individual. There are 65 modules in the embryonic circuit of the pseudo-random number generator. Among these modules, 64 modules are connected in series which are made up of single D triggers and the 65th module is a combinational circuit whose input is connected to the D output terminals of the former 64 triggers and its output is connected to the input terminal of the first D trigger. In fact, this is a typical LFSR pseudo-random number generator. Table 2. The evaluation results of the optimized circuit
Monobit Test Birthday Spacing Test Overlapping Permutations Test Ranks of 31x31 and 32x32 Matrices Ranks of 6x8 Matrices Monkey Tests on 20-bit Words Monkey Tests OPSO,OQSO,DNA Count the 1`s in a Stream of Bytes Count the 1`s in Specific Bytes Parking Lot Test Minimum Distance Test Random Spheres Test The Sqeeze Test Overlapping Sums Test Runs Test The Craps Test
Gen 1 1.000000 1.000000 1.000000
Gen 1000 0.912541 0.827782 0.720198
Gen 2000 0.773936 0.612698 0.515017
1.000000
0.210932
0.321032
1.000000 1.000000
1.00000 1.000000
0.999164 1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
0.999613 1.000000 0.973520 1.000000 1.000000 1.000000 1.000000
0.602602 0.946229 0.534159 0.999676 0.653748 0.659004 0.322741
Design of Electronic Circuits Using a Divide-and-Conquer Approach
21
Table 3. The values of parameters and its fitness
Gen 1 Gen 1000 Gen 2000
g 0 6 11
s 95 2756 3340
p 2100 21992 49183
Fitness 0.000000 0.166469 1.142326
For the individuals of the j generation, the evaluation function will evaluate the nj bit binary random bit string generated by emulation of these individuals. The g, s and p in f (g, s, p) are obtained synchronistically by emulation. nj is the monotone increasing function of j and n0=200000. FIPS-140 standard and the famous Diehard software are adopted to test its randomicity. The performance of the best individuals from 3 generations is shown in the Table 2, and the values of parameters and its fitness are shown in Table 3(parameters:C1=4.0, C2=0.5, C3=0.5). From Table 2 it shows that after the evolution of 2000 generations, this random number generator has passed most tests for random numbers’ performance which indicates the proposed method can design random number generators with good performance and have the ability of auto-design of combinational circuit and sequential circuit.
5 Conclusion Based on the divide-and-conquer method, a two-layer encoding scheme is presented for evolution of logic circuits. In the evolution process, genetic trees and the links among modules are evolved separately but also get in touch with each other. Moreover, according to this encoding method, a simulation algorithm is also proposed to simulate and evaluate electronic logic circuits to improve the efficiency of simulation. However, there are still some issues to be considered about this Divide-and-Conquer Approach. To automatically divide a circuit into several sub-circuits acted as modules in the two-layer encoding method has to be considered in stead of pre-divided in this paper. In the future study we hope to distill some principles in the evolution of circuits to be used as fundamental building blocks which could design larger logic circuits efficiently and easily. And it is necessary to improve the algorithms of evolution and simulation in order to evolve circuits quickly and efficiently.
Acknowledgments This work is supported by the National Natural Science Foundation of China under grant No. 60442001 and National High-Tech Research and Development Program of China (863 Program) No. 2002AA1Z1490.
References 1. Yao, X., Higuchi, T.: Promises and challenges of Evovable Hardware. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 29(1), 87–97 (1999) 2. Gordon, T., Bentley, P.: Evolving hardware. In: Zomaya, A. (ed.) Handbook of Nature Inspired and Innovative computing, pp. 387–432. Springer, Heidelberg (2005)
22
G.L. He et al.
3. Higuchi, T., Murakawa, M., Iwata, M., et al.: Evolvable hardware at function level. In: Proc 1997 IEEE international conference in Evolutionary Computation Indianapolis, Indiana, pp. 187–192. IEEE Computer Society Press, Los Alamitos (1997) 4. Kang, L.S., He, W., Chen, Y.P.: Evolvable hardware are realized with function type programmable device. Chinese Journal of Computers 22(7), 781–784 (1999) 5. Zhao, S.G., Yang, W.H.: Intrinsic hardware evolution based on prototype of function level FPGA. Chinese Journal of Computers 25(6), 666–669 (2002) 6. Torresen, J.: Evolving multiplier circuits by training set and training vector partitioning. In: The 5th international conference of evolvable systems:from biology to hardware, pp. 228–237 (2003) 7. Stomeo, E., Kalganova, T., Lambert, C.: Generalized Disjunction Decomposition for the Evolution of Programmable Logic Array Structures. In: First NASA/ESA Conference on Adaptive Hardware and Systems 2006 (2006) 8. Kajitani, I., Hoshino, T., Iwata, M., et al.: Variable length chromosome GA for evolvable hardware. In: Proc the 1996 IEEE International Conference on Evolutionary Computation (ICEC’96), Piscatawa, NJ, USA, pp. 443–447. IEEE Computer Society Press, Los Alamitos (1996) 9. Gordon, T.G.W., Bentley, P.J.: Development brings scalability to hardware evolution. In: The 2005 NANSA/DoD conference on evolvable hardware, pp. 272–279 (2005) 10. Tu, H.: On evolvable hardware. Wuhan University, Wuhan (2004) 11. He, G.l.: Evolvable hardware technology and its applications in the design of digital circuits. Wuhan University, Wuhan (2007) 12. Kang, F.j.: Modern simulation technology and application. National defence industry press, Beijin (2001)
Implementing Multi-VRC Cores to Evolve Combinational Logic Circuits in Parallel Jin Wang1, Chang Hao Piao2, and Chong Ho Lee1 1 Department
of Information & Communication Engineering, Inha University, Incheon, Korea [email protected] 2 Department of Automation Engineering, ChongQing University of Posts and Telecommunications, Chongqing, China
Abstract. To conquer the scalability issue of evolvable hardware, this paper proposes a multi-virtual reconfigurable circuit (VRC) cores-based evolvable system to evolve combinational logic circuits in parallel. The basic idea behind the proposed scheme is to divide a combinational logic circuit into several subcircuits, and each of them is evolved independently as a subcomponent by its corresponding VRC core. The virtual reconfigurable circuit architecture is designed for implementing real-world applications of evolvable hardware (EHW) in common FPGAs. In our approach, all the VRC cores are realized in a Xilinx Virtex xcv2000E FPGA as an evolvable system to achieve parallel evolution. The proposed method is evaluated on the evolutions of 3-bit multiplier and adder and compared to direct evolution and incremental evolution in the terms of computational effort and hardware implementation cost. Keywords: Intrinsic evolvable hardware, scalability, parallel evolutionary algorithm, incremental evolution.
1 Introduction As an alternative to conventional specification-based circuit design method, EHW has been introduced as an important paradigm for automatic circuit design in the last decade. However, there is still a long way to go before EHW become a real substitute to human designers. One of the problems appearing is that most of the evolved circuits are size limited [1, 2]. This is named as the scalability issue of EHW. In literature [2], Yao indicated that the existing evolvable systems were generally not scalable due to two reasons: (1) Chromosome string length of EHW, which increases with the target circuit size. A long chromosome string is required for representing a complex system. This often makes the search space too large that is difficult to explore even with evolutionary techniques. (2) Computational complexity of an evolutionary algorithm (EA), which is more pivotal factor than chromosome string length. Generally, the number of individual evaluations required to find a desired solution can increase drastically with the increased complexity of the target system. This paper focus on the investigation of scalability issues applied to the evolutionary design of combinational logic circuits. In our proposal, a combinational logic circuit will be decomposed as several sub-circuits. An evolvable system L. Kang, Y. Liu, and S. Zeng (Eds.): ICES 2007, LNCS 4684, pp. 23–34, 2007. © Springer-Verlag Berlin Heidelberg 2007
24
J. Wang, C.H. Piao, and C.H. Lee
including several VRC cores is employed to evolve the separate sub-circuits in parallel. Finally, the evolved sub-circuits interact to perform the expected top level logic function. Our proposed method approaches the scalability issue of EHW by speeding up the EA computation, shortening the chromosome length, and reducing the computational complexity of the task. Experiments of evolving 3-bit multiplier and adder are conducted in this paper to compare the execution time and the hardware cost of proposed evolutionary strategy with direct evolution and incremental evolution [3, 4, 5]. The rest of this paper is organized as follows: Section 2 briefly introduces the existed approaches to the scalability issue of EHW. The proposed scheme implementing parallel evolution with multi-VRC cores is presented in Section 3. Section 4 describes the hardware realization of the proposed evolvable system. Experimental results are summarized in Section 5 and discussed in Section 6. Section 7 concludes our work.
2 Previous Scalable Approaches to EHW Various approaches have been proposed to solve the scalability issue of EHW. Murakawa et al. [6] tackled the problem of scale in the evolved circuits by using function level evolution. By employing higher level functions as building block rather than multi-input gates, an important property of function level approach is that the size of chromosome remains limited while the complexity of circuits can grow arbitrarily. This approach in itself is reasonable, and it has also been considered in the application of evolving spatial image operators [7, 8]. While higher level functions allowed the designer to reduce the EA search space and to make evolution easier, its disadvantage could be that the evolved solutions do not exhibit any innovation in their structure [9]. On the other hand, to identify the suitable function blocks and thus to evolve efficient electronic circuits is a difficult and time-consuming task. Parallel genetic algorithms as one of the most promising choices to improve the computational ability of EA have been presented in different types [10, 11]. Using parallelism to increase the speed of evolution seems to be an answer to combat the high computational cost. While parallel evolution does offer limited relief on the high computational cost problems, it does not provide any new capabilities from the standpoint of computational complexity theory. For example, the computational complexity of evolving combinational logic circuits which grow exponentially with the size of the circuit inputs on traditional genetic algorithm still grow exponentially with the size of circuit inputs on parallel genetic algorithm. When billions of candidate circuits are evaluated to evolve even small combinational logic circuits (e.g. 4-bit multiplier), relying too much on sheer speedup of the EA itself seems not a reasonable solution to the computational complexity issue of EHW. A possible way to reduce the computational complexity of EHW is incremental evolution which is based on the principle of divide-and-conquer. Incremental evolution was first introduced by Torresen [3] as a scalability approach to EHW. According to incremental evolution strategy, different non-trial circuits have been successfully evolved by using both extrinsic [3, 4] and intrinsic EHW [5]. In this approach, a circuit is evolved through its smaller components. This means the
Implementing Multi-VRC Cores to Evolve Combinational Logic Circuits in Parallel
25
evolution is first undertaken individually and serially on a set of sub-circuits, and then the evolved sub-circuits are employed as the building blocks which are used to further evolution or structure of the target circuit. A variation of incremental evolution, named Multiobjective genetic algorithm, has been suggested by Coello Coello et al. in [12]. In their case, each of the outputs of a combinational logic circuit is handled as an objective which is evolved in its corresponding subcomponent independently. Another scheme related to the idea of system partition is cooperative coevolution proposed by Potter and De Jong in literature [13]. It consists of serial evolution of coadapted subcomponents which interact to perform the function of the top system.
3 Description of the Proposed Approach Our proposed scheme approaches the scalability issue of EHW from three aspects: speeding up the EA, limiting the chromosome length, and decreasing the computational complexity of the problem. The main idea behind our proposed approach is to use parallel intrinsic evolution to handle subcomponents which are the decomposed versions of the top system. The circuit decomposition and assembly is inspired by the principle of divide-and-conquer which has been introduced in [3, 4, 5, 12, 14] to limit the computational complexity of EHW. A generalized hardware architecture for evolving decomposed subcomponents in parallel is introduced in this paper. This architecture, which we call multi-VRC cores, can realize parallel evolution in a single commercial FPGA. The feature of parallel intrinsic evolution of subcomponents is the most significant difference between our approach and the existing extrinsic EHW approaches [3, 4, 12, 13, 14], wherein subcomponents were evolved in a serial pattern by software simulation. 3.1 Decomposition of Logic Circuit There are commonly two strategies which have been introduced to decompose combinational logic circuits: Shannon decomposition and system output decomposition [4, 14]. In this paper, for the purpose of simplifying hardware implementation, only the second scheme-decomposition of system output is employed. Fig. 1 shows the system output decomposition approach for evolving a 2bit multiplier which includes 4-bit input and 4-bit output. In this scenario, according to the system output decomposition strategy, the 4-bit output of multiplier is assigned into two groups as the vertical line in truth table indicates. Each partitioned 2-bit output is applied to evolve its corresponding subcomponents (subcomponent 1 and 2). Each of them is a 4-in/2-out circuit. Further, the evolved subcomponents can be assembled together (as shown in Fig. 1) to perform a correct multiplier function. Although this particular illustration shows 2 subcomponents, the actual number of subcomponents may be more. E.g. we can evolve a separate circuit for only 1-bit output, so four subcomponents with 4-bit input and 1-bit output would be required in this case.
26
J. Wang, C.H. Piao, and C.H. Lee
Fig. 1. Partitioning output function of 2-bit multiplier
3.2 Evolutionary Algorithm In this work, a kind of intrinsic EHW which is built on the multi-VRC cores structure is used to evolve the partitioned subcomponents in parallel. Virtual reconfigurable circuit architecture was first proposed by Sekanina [8] for the purpose of implementing real-world applications of EHW in common FPGAs. The structure of the VRC is flexible, which can be designed for a given problem to fit the application requirements. For the application of evolving combinational logic circuits, Cartesian genetic programming (CGP)-based geometric structure has been implemented by Sekanina on VRC [15]. CGP was first introduced by Miller et al in [16], whose phenotype is a two-dimensional array of logic cells. In our approach, to reduce the chromosome length, a revised two-dimensional gates array which introduces more connection restrictions than standard CGP is employed. Compared with CGP in which each cell can get its inputs from the external inputs of cells array or the cell output in its previous layers, each gate in our proposed array is limited to connect to the gate outputs in its previous one layer. Some very similar gates array structures have been reported by Torresen [3] and Coello Coello [12] for learning combinational logic circuits in different literatures. The basic frame of the parallel evolutionary algorithm employed in our evolvable system is illustrated in Fig. 2. Though this particular illustration shows a parallel evolutionary algorithm designed for evolving two subcomponents, the actual number of subcomponents can be more. In this model, the population is divided into two subpopulations to maintain two decomposed subcomponents. The evolution of each subcomponent is performed according to the 1+ λ evolutionary strategy, where λ =2. Evolutionary operations are only based on selection and mutation operators. At the beginning phase, each subpopulation including λ individuals is created at random. Once the fitness of each individual is evaluated, the fittest individual is selected as the parent chromosome. The next generation of subpopulation is generated by using the fittest individual and its λ mutants. This process will repeat until the stop criteria of each subpopulation are achieved, which are defined as: (1) EA finds the expected solution for its corresponding subcomponent; (2) Predefined generation number is exhausted. In the evolutionary process, each VRC core maintains the evolution of its corresponding subpopulation independently.
Implementing Multi-VRC Cores to Evolve Combinational Logic Circuits in Parallel
27
Fig. 2. Flow diagram of proposed parallel evolutionary algorithm for evolving two subcomponents
Each subcomponent processes different decomposed system output function, so the fitness value of each individual in different subpopulation is calculated by comparing the subcomponent output with its corresponding partitioned desired system output as follows:
Fitness =
∑ ∑ x;
vector output
⎧1 output = expect where x = ⎨ ⎩0 output ≠ expect
(1)
For each partitioned output vector, each processed single bit output is compared with its corresponding system expected output (which is labeled as expect). If they are equal, the variable x will be presented as 1 and be added to the fitness function. Fitness is the sum values for the compared results of all outputs (output) in the processed partitioned output vectors (vector) set with the truth table.
4 Hardware Implementation Celoxica RC1000 PCI board [17] equipped with a Xilinx Virtex xcv2000E FPGA [18] (see Fig. 3) which has been successfully applied as a high performance, flexible and low cost FPGA-based hardware platform for different computationally intensive applications [19, 20] is employed as our experimental platform for the implementation and verification of the proposed multi-VRC cores architecture. The proposed evolvable system is composed of two main components (as shown in Fig. 3): Control and interface and several VRC cores. In the evolvable system, all operations of VRC cores are controlled by the control and interface which executes the commands from host PC and connects with the on board 8Mbytes SRAM. Each VRC core in the proposed evolvable system corresponds to a decomposed subcomponent which is defined in previous section. A VRC core consists of a Virtual reconfigurable circuit unit, an EA unit, and a Fitness unit. The EA unit implements the genetic operations and generates configuration bits string (chromosomes) to
28
J. Wang, C.H. Piao, and C.H. Lee
configure the virtual reconfigurable circuit unit. The virtual reconfigurable circuit unit, whose function is virtual reconfigurable, processes the input data from four memory banks. Fitness unit calculates individual fitness by comparing the output from the virtual reconfigurable circuit unit with its corresponding partitioned output in truth table.
Fig. 3. Organization of the proposed evolvable system with multi-VRC cores
Every virtual reconfigurable circuit unit in one VRC core can be considered as a digital circuit which acts as a decomposed subcomponent of the top system. Fig. 4 presents a virtual reconfigurable circuit unit designed for evolving a 6-input/3-output subcomponent as an example. It consists of 43 uniform function elements (FE) allocated in an array of 6 columns and 8 rows. In the last column of FE array, the amount of FEs is equal to the number of system outputs, and each FE corresponds to one system output. Every FE has two inputs and can be configured to perform one of 8 logic functions listed in Fig. 4. The input connections of each FE are selected by its two equipped multiplexers. An input of the FE in column 2 (3, 4, 5 or 6) can be connected to any one output of FE in its previous one column. In column 1, each input of FE can be connected to any one of the system inputs or defined as a bias of value 1 or 0. Each FE is equipped with a Flip-flop to support pipeline processing. A column of FEs is considered as a single stage of the pipeline. Each FE needs 9 bits to determine its input connections (3+3 bits) and function (3 bits). Although the number of configuration bits required in column 6 is lower than 72 bits, the configuration memory still employs 72 bits per column in our implementation to simplify hardware design. Hence, the configuration memory process 72 × 6=432 bits. In our approach, eight FEs in the same column are configured simultaneously, so the 432-bit data in the configuration memory is divided and stored in 6 configuration banks (cnfBank) of 72 bits.
Implementing Multi-VRC Cores to Evolve Combinational Logic Circuits in Parallel
29
Fig. 4. Virtual reconfigurable circuit unit for the evolution of the 6-in/3-out subcomponent
Fig. 5. Architecture of EA unit
Fig. 5 describes the architecture of the EA unit designed for the evolution of a 6input/3-output subcomponent. When the EA is activated, the population memory is filled by two chromosomes, which are the mutated versions of two 6×72 bits random numbers generated by the Random Number Generator (RNG) with linear cellular automata [21]. The mutation unit processes 6×72 bits data in 6 clocks per 72 bits. Only randomly selected bits are inverted and the number of mutated bits is decided by the predefined mutation rate (which is 0.8% in this work). After all chromosomes of the initial population have been evaluated, the best chromosome is chosen as the parent chromosome to be stored in the best chromosome memory. The new population is generated using the parent chromosome and its 2 mutants. If the fitness
30
J. Wang, C.H. Piao, and C.H. Lee
of the offspring chromosome is better than that of the parent, the parent chromosome that is stored in the best chromosome memory will be replaced. Fitness calculation is realized in the Fitness unit. The input training vectors set is loaded from onboard SRAM and processed as the input of the VRC unit. The output vectors of the VRC unit are sent to the Fitness unit and compared with the partitioned expected output set specified in a truth table (that are also stored in onboard SRAM). The fitness value is increased for every output bit match. Therefore, the maximal fitness value is 64×3 (the size of the decomposed system output in this scenario is 3-bit).
5 Experimental Results Our proposed multi-VRC cores-based evolvable system was designed by using VHDL and synthesized into Virtex xcv2000E FPGA using Xilinx ISE 8.1. According to our synthesis report, in all cases, the proposed evolvable system can be operated more than 90MHz. However, the actual hardware experiment was run at 30MHz because of easier synchronization with PCI interface which can operate correctly with PCI bus clocks from zero to 33MHz. In this paper, 3-bit multiplier and 3-bit adder were employed as evolutionary targets to illustrate our proposed scheme. Three different evolutionary strategies were used in the experiments: (1) direct evolution which was employed in [15, 22]; (2) incremental evolution proposed by Torresen [4] with partitioned training vector strategy only; (3) our proposed multiVRC cores-based scheme. The maximum number of generations of one EA run was set to 227 in all evolutionary strategies. To achieve the reasonable hardware cost and performance, a uniform 8×6 FE array was employed to evolve all the decomposed subcomponents in incremental evolution and our proposed strategy. A 12×6 FE array was selected for the direct evolution of 3-bit multiplier, which was depended on our previous experiments. No feasible 3-bit multiplier solution can be evolved using smaller size FE arrays (e.g. 10×6, 8×8). To simplify hardware design, this 12×6 FE array was also employed to directly evolve 3-bit adder. In the evolution of 3-bit multiplier, with the proposed scheme, two and three subcomponents-based system partitions were implemented individually. To achieve more symmetrical computational complexity in each decomposed subcomponent, the system output partitions were executed as follows: (1,3,5), (2,4,6) for 2 subcomponents and (1,4), (2,5), (3,6) for 3 subcomponents. The same system decomposition rule was also employed in incremental evolution. The comparisons of the device cost, the chromosome length in each subcomponents, the times of successful EA runs to find feasible logic circuits (times of success), the average and standard deviation values of the number of generations, and the average total evolution time produced by direct evolution, incremental evolution, and multi-VRC cores-based approach are shown in table 1. We performed 100 runs for each case. The evolvable 3-bit adder includes 6-input and 4-output, wherein input carry is not considered in this work. Only two subcomponents-based system decomposition was implemented in our experiment. Table 2 summarizes all results for evolving 3-bit adder under different settings. All average experimental results are from 100 independent EA runs for each case.
Implementing Multi-VRC Cores to Evolve Combinational Logic Circuits in Parallel
31
Table 1. The results of evolving 3-bit multiplier with different evolution strategies
EA type
Divided outputs
Device cost (slices)
Chromosome (bits)
Direct evolution
1-6
4274
792
2984
432
3423
432
4505 6422
Incremental evolution
MultiVRC cores
1,3,5 2,4,6 1,4 2,5 3,6 1,3,5 2,4,6 1,4 2,5 3,6
Number of generations avg. std. dev.
Total evolution time (avg.)
Times of success
77.147 sec
50
18081238
19461400
1526060 2122824 505963 391404 133943
1803625 2297710 494748 353602 165753
432
2450171
2326822
10.454 sec
61
432
610772
409938
2.606 sec
57
58 61 70 67 60
15.569 sec 4.400sec
Table 2. The results of evolving 3-bit adder with different evolution strategies
EA type Direct evolution Incremental evolution Multi-VRC cores
Divided outputs 1-4 1,3 2,4 1,3 2,4
Device Chromcost osome (slices) (bits)
Number of generations avg. std. dev.
Total evolution time (avg.)
Times of success
4130
792
380424
465868
1.623 sec
47
2948
432
48011 58684
63766 70317
0.455 sec
56 57
4460
432
69256
62631
0.295 sec
51
6 Discussion We have presented the results of our initial experiments on the multi-VRC coresbased evolvable system. The analysis is conducted on the two examples presented in this paper. In all cases, the times of successful EA runs to find feasible logic circuits in direct evolution, incremental evolution, and our proposed approach are comparable. Since our main motive of the approach proposed in this paper is to develop an efficient evolvable system for conquering the scalability issue of EHW, we need to perform another comparison wherein we analyze the computational cost required by the three mentioned approaches. The computational costs of different types of evolutionary strategies can be evaluated by using the average total system evolution time. It can be clearly appreciated that the multi-VRC cores-based EHW outperforms other approaches in all cases. All results indicated the execution time for EA learning significantly depends on the levels of the system decomposition selected. For the evolution of 3-bit multiplier, the speedup obtained by means of multi-VRC cores with two decomposed subcomponents is 7.4 (against direct evolution) and 1.5 (against incremental evolution with two subcomponents). With three subcomponents-based
32
J. Wang, C.H. Piao, and C.H. Lee
decomposition, the speedup of multi-VRC cores-based EHW is 29.6 (against direct evolution) and 1.7 (against incremental evolution with three subcomponents). Similar compared results can be obtained in the evolutions of 3-bit adder. We can consider the proposed multi-VRC cores-based EHW is a hybridization of parallel intrinsic evolution and divide-and-conquer approach-based incremental evolution. The better performance related to computational costs obtained by our approach is mainly due to two new features of our proposed evolvable system: (1) Parallel implemented multi-VRC cores with powerful computational ability. In our approach, all the RC cores are implemented in a FPGA, which executes the evolution of subcomponents (e.g. fitness evaluation and genetic operations) in parallel. The most obvious advantage of this implementation is that the evolution of each subcomponent is completely pipelined and parallel. The proposed hardware implementation gives us a promise to conquer the overhead introduced by slow inter processor communications, setup time issues in general multi-processors based parallel evolution. (2) System decomposition strategy. The main advantage of decomposition system output is that evolution is performed in some smaller subcomponents with less output than top system. The number of gates required for implementation each subcomponent can be reduced for the smaller size of system output. A shorter chromosome can be employed to present each subcomponent. For example, in our experiments, the chromosome length in each subcomponent was reduced from 792 to 432. On the other hand, decomposed output function also reduces the computational complexity of the problems to be solved. Therefore, by partitioning system output, a simpler and smaller EA search space can be achieved in the evolution of each subcomponent. Another interesting observation is the hardware implementation cost. With the introduction of system decomposition strategy, the device cost is larger than that needed for traditional direct evolution approach. More VRC cores are required in our proposed approach than in direct evolution (which employed only one VRC). However, the increase of device cost is not very significant in multi-VRC cores-based EHW. From our synthesis results, the hardware cost in two-VRC cores-based approach is very close to the result in direct evolution. In three-VRC cores-based approach, the device cost increase is 1.5 times as compared to direct evolution. This feature is due to smaller inputs/outputs combination is employed in the decomposed subcomponents. Corresponding to the Shannon’s effect [15], we can employ smaller size of FE array and chromosome memory to implement each partitioned subcomponents. On the other hand, we need to remind that our original motive is to conquer the existed scalability issue of EHW. Device cost is not considered as a serious issue in our work, because the number of transistors becoming available in circuits increases according to Moore’s Law.
7 Conclusion In this paper, we have presented a novel scalable evolvable hardware, known as multi-VRC cores architecture, for synthesizing combinational logic circuits. The proposed approach uses a divide-and-conquer-based technique to decompose the top system into several subcomponents. Then, all subcomponents are independently evolved by their corresponding VRC cores in parallel. The experimental results show
Implementing Multi-VRC Cores to Evolve Combinational Logic Circuits in Parallel
33
that our proposed scheme performs significantly better than direct evolution and incremental evolution in term of the EA execution time. Both of 3-bit multiplier and adder are able to be evolved in less than 3 seconds, which is untouchable by any other reported evolvable systems. Future work will be devoted to apply this scheme to other more complex real-world applications. Acknowledgments. This work was supported by the Korean MOCIE under research project 'Super Intelligent Chip Design'.
References 1. Torresen, J.: Possibilities and Limitations of Applying Evolvable Hardware to Real-world Application. In: Proc. of the 10th International Conference on Field Programmable Logic and Applications, FPL-2000, Villach, Austria, pp. 230–239 (2000) 2. Yao, X., Higuchi, T.: Promises and Challenges of Evolvable Hardware. IEEE Transactions on Systems, Man, and Cybernetics 29(1), 87–97 (1999) 3. Torresen, J.: A Divide-and-Conquer Approach to Evolvable Hardware. In: Sipper, M., Mange, D., Pérez-Uribe, A. (eds.) ICES 1998. LNCS, vol. 1478, pp. 57–65. Springer, Heidelberg (1998) 4. Torresen, J.: Evolving Multiplier Circuits by Training Set and Training Vector Partitioning. In: Tyrrell, A.M., Haddow, P.C., Torresen, J. (eds.) ICES 2003. LNCS, vol. 2606, pp. 228–237. Springer, Heidelberg (2003) 5. Wang, J., et al.: Using Reconfigurable Architecture-Based Intrinsic Incremental Evolution to Evolve a Character Classification System. In: Hao, Y., Liu, J., Wang, Y.-P., Cheung, Y.-m., Yin, H., Jiao, L., Ma, J., Jiao, Y.-C. (eds.) CIS 2005. LNCS (LNAI), vol. 3801, pp. 216–223. Springer, Heidelberg (2005) 6. Murakawa, M., et al.: Hardware Evolution at Function Level. In: Ebeling, W., Rechenberg, I., Voigt, H.-M., Schwefel, H.-P. (eds.) PPSN IV 1996. LNCS, vol. 1141, pp. 62–71. Springer, Heidelberg (1996) 7. Zhang, Y., et al.: Digital Circuit Design Using Intrinsic Evolvable Hardware. In: Proc. Of the 2004 NASA/DoD Conference on the Evolvable Hardware, pp. 55–63. IEEE Computer Society Press, Los Alamitos (2004) 8. Sekanina, L.: Virtual Reconfigurable Circuits for Real-World Applications of Evolvable Hardware. In: Tyrrell, A.M., Haddow, P.C., Torresen, J. (eds.) ICES 2003. LNCS, vol. 2606, pp. 186–197. Springer, Heidelberg (2003) 9. Sekanina, L.: Evolutionary Design of Digital Circuits: Where Are Current Limits? In: Proc. of the First NASA/ESA Conference on Adaptive Hardware and Systems, AHS 2006, pp. 171–178. IEEE Computer Society Press, Los Alamitos (2006) 10. Gordon, V.S., Whitley, D.: Serial and Parallel Genetic Algorithms as Function Optimizers. In: Proc. of the Fifth International Conference on Genetic Algorithms, pp. 177–183. Morgan Kaufmann, San Mateo, CA (1993) 11. Cantu-Paz, E.: A Survey of Parallel Genetic Algorithms. Calculateurs Parallels 10(2), 141–171 (1998) 12. Coello Coello, C.A., Aguirre, A.H.: Design of Combinational Logic Circuits Through an Evolutionary Multiobjective Optimization Approach. Artificial Intelligence for Engineering, Design, Analysis and Manufacture 16(1), 39–53 (2002) 13. Potter, M.A., De Jong, K.A.: Cooperative Co-evolution: An Architecture for Evolving Coadapted Subcomponents. Evolutionary Computation 8(1), 1–29 (2000)
34
J. Wang, C.H. Piao, and C.H. Lee
14. Kalganova, T.: Bidirectional Incremental Evolution in Extrinsic Evolvable Hardware. In: Proc. of the 2nd NASA/DoD Workshop on Evolvable Hardware, pp. 65–74. IEEE Computer Society Press, Los Alamitos (2000) 15. Sekanina, L., et al.: An Evolvable Combinational Unit for FPGAs. Computing and Informatics 23(5), 461–486 (2004) 16. Miller, J.F., Thomson, P.: Cartesian Genetic Programming. In: Poli, R., Banzhaf, W., Langdon, W.B., Miller, J., Nordin, P., Fogarty, T.C. (eds.) EuroGP 2000. LNCS, vol. 1802, pp. 121–132. Springer, Heidelberg (2000) 17. Celoxica Inc., RC1000 Hardware Reference Manual V2.3 (2001) 18. http://www.xilinx.com 19. Martin, P.: A Hardware Implementation of a Genetic Programming System Using FPGAs and Handel-C. Genetic Programming and Evolvable Machines 2(4), 317–343 (2001) 20. Bensaali, F., et al.: Accelerating Matrix Product on Reconfigurable Hardware for Image Processing Applications. IEE proceedings-Circuits, Devices and Systems 152(3), 236–246 (2005) 21. Wolfram, S.: Universality and Complexity in Cellular Automata. Physica 10D, 1–35 (1984) 22. Miller, J.F., et al.: Principles in the Evolutionary Design of Digital Circuits–Part I. Journal of Genetic Programming and Evolvable Machines 1(1), 7–35 (2000)
An Intrinsic Evolvable Hardware Based on Multiplexer Module Array Jixiang Zhu, Yuanxiang Li, Guoliang He, and Xuewen Xia The State's Key Laboratory of Software Engineering,WuHan University [email protected]
Abstract. In traditional, designing analog and digital electrical circuits are the tasks of hard engineering, but with the emergence of Evolvable Hardware (EHW) and many researchers’ significant research in this domain EHW has been established as a promising solution for automatic design of digital and analog circuits during the last 10-odd years. At present, the main research in EHW field is focused on the extrinsic and intrinsic evolution. In this paper, we will fix our attention on intrinsic evolution. Some researchers concentrate on how to implement intrinsic evolution, mainly including the following three aspects: The first, evolve the bitstream directly and then recompose the bitstream; The second, amend the content of Look-Up-Table (LUT) by relative tools; The third, set up a virtual circuit on a physical chip, and then evolve its “parameters” which are defined by the deviser, when the parameters are changed, the corresponding circuit is evolved. This paper ignores the first and the second approaches, and proposes a virtual circuit based on Multiplexer Module Array (MMA) which is implemented on a Xilinx Virtex-II Pro (XC2VP20) FPGA.
,
Keywords: intrinsic, digital, multiplexer, FPGA.
1 Introduction Evolvable Hardware (EHW) is the application of Genetic Algorithms (GA) and Genetic Programming (GP) to electronic circuits and devices[3]. The research of EHW is mainly concentrated on extrinsic evolution and intrinsic evolution. The extrinsic evolution means that during the evolution, the fitness of individuals are evaluated by software simulation, only the final best individual will be downloaded to the physical chip. When it refers to digital circuit, the extrinsic evolution is actually making use of someway to evolve some functionalities, in other words, the intention of researching extrinsic evolution is to search a better algorithm which will help us obtain an optimized and correct circuit more easily than the previous algorithms do. Some available algorithms have proposed by Tatiana Kalganova et al in [8,9], and they have successfully evolved some digital circuits by input decomposition and output decomposition, or some other methodologies. While the intrinsic evolution means that during the evolution, every individual will be downloaded to the physical chip, the chip will evaluate the corresponding circuit and then will return its fitness to GA process. L. Kang, Y. Liu, and S. Zeng (Eds.): ICES 2007, LNCS 4684, pp. 35–44, 2007. © Springer-Verlag Berlin Heidelberg 2007
36
J. Zhu et al.
There are many methods to implement intrinsic evolution. A methodology for direct evolving is described in [1], it is the most intuitionistic intrinsic evolution, however, this approach requires a very professional understanding of a given FPGA as well as a familiarity of its configuration bitstream structure. Although the author introduces the bitstream composition of XC2V40 and teaches us how to localize the LUT contents in a configuration bitstream, there are two fatal limitations in its flexibility: On one hand, even produced by the same corporation, different types of FPGA have different configuration compositions, if the experimental environment is changed, we must spend unwanted energy on familiarizing ourselves with the new environment and re-parse the configuration, then locate the LUT contents from millions of configuration bitstream bits, and link them as a gene for evolution ,that is obviously a lack of portability. On the other hand, illegal bitstream may destroy FPGA, however, evolve the configuration bitstream directly will generate some illegal bitstream easily, when these illegal ones are downloaded to the FPGA, the chip will not work even damaged. Another methodology has been introduced in [2,3,4], that is to make use of JBits SDK or other third-party tools which are provided by Xilinx. JBits is a set of Java classes which provide an Application Program Interface (API) into the Xilinx Virtex FPGA family bitstream, it provides the capability of designing and dynamically modifying circuits in Xilinx Virtex series FPGA devices. Compared with the previous method, this method can modify the LUT contents more securely, and it has been proved to be a feasible method for intrinsic evolution. However it still has some limitations [2]: the largest drawback of JBits API is its manual nature, everything must be explicitly stated in the source code, including the routing. Another equally important limitation is that JBits API also requires that the users are very familiar with the architecture of a specified FPGA, hence, this method is shortage of flexibility. A virtual EHW FPGA based on “sblock” has been proposed in [5]. The functionality of the “sblock” and its connectivity to its neighbors is achieved by configuring its LUT, to alter a function in a “sblock” or change its connectivity, the LUT may be reprogrammed, and then map the virtual EHW FPGA to physically FPGA. The virtual EHW FPGA has many of the characteristics of the physical FPGA, it is able to solve the genome complexity issues and enable evolution of large complex circuits, but ultimately speaking, this method operates the LUT contents in a bottom-level way, this makes us uneasy about its security. Put the above three methodologies in perspective, ultimately, they change the LUT contents directly or indirectly for intrinsic evolution, their shared flaws are the deficiencies in flexibility, portability and security. The methodology of virtual circuit not operates the LUT contents directly, it is a middle-level structure between FPGA and GA, therefore we do not need to spend overmuch energy on being familiar with the given FPGA, some virtual circuits have been introduced in [6,7], while in this paper, we propose a new virtual circuit i.e. MMA, compared with the previous virtual circuits, the primary advantage is its convenience in encoding and decoding. The MMA is designed in VHDL and implemented on a Xilinx Virtex-II Pro series FPGA——XC2VP20. By setting up a MMA in a physical chip, we only need to evolve the data structure which is defined by ourselves to realize intrinsic evolution. During the GA process,
An Intrinsic Evolvable Hardware Based on Multiplexer Module Array
37
we need not to manipulate the bitstream and not to change the LUT contents, so it is a safe evolution. In addition, this method only requires us grasp the general approach of designing a simple circuit using VHDL or Verilog HDL, instead of mastering the internal structure of a specific FPGA as a professional hardware engineer, since VHDL is a universal hardware description language, in this sense, it is “evolution friendly” [5] and in possession of flexibility. In the following sections, a detailed description of MMA methodology will be laid out.
2 The Structure of MMA In this section, we will introduce the structure of MMA in a bottom-up manner. 2.1 Function Styles The function styles are defined by ourselves, they may be some basic logic gates such as “AND”, “OR”, “NOT”, or some complex function modules, it depends on whether a gate-level EHW or a function-level EHW will be carried out. In our experiment, we defined ten function styles including basic logic-gates and some simple functions, every function style has an opposite function style to it, see Table 1. During the evolution, the numbers ‘0’ to ‘9’ represent the corresponding functions, we need to look up this table when decoding chromosomes to circuits. Table 1. Function Style in our experiment
0 1 2 3 4
Function Style a !a a•b a+b !( a • b)
5 6 7 8 9
Function Style !(a + b) a⊕b !( a ⊕ b) a • !c + b • c !( a • !c + b • c)
2.2 Multiplexer Module Figure 1 illustrates the internal structure of a single multiplexer module in our system. The blocks marked “MUX1”, “MUX2”, “MUX3”, “MUX4” represent the multiplexers, those marked “reg1”, “reg2” ,“reg3” ,“reg4” represent the control-port of the corresponding multiplexers, and those marked “Function Style 0”, “Function Style 1”,…, “Function Style i” represent the corresponding function modules defined in above section, the input signals of each multiplexer module will be introduced in section 2.3, the output of “MUX4” is that of the whole multiplexer module. Every multiplexer module contains several input-switch multiplexers and a function-switch multiplexer, the number of input-switch multiplexers depends on the function styles, if there is a function style has a max input of n, n input-switch multiplexers will be needed even though the other function styles are less inputs than n, because it must be ensured that all possible function styles have sufficient inputs.
38
J. Zhu et al.
In the proposed system, each multiplexer module contains four multiplexers, three for input-switch, and one for function-switch. See figure 1, “MUX1”, “MUX2”, “MUX3” are used for input-switch, because there are two three-inputs functions in our experiment, “MUX4” is used for function-switch, each of the four multiplexers is connected to a register, then the contents of these registers will determine which input will be the output. The input signals of a single multiplexer module are all connected to “MUX1”, “MUX2”, “MUX3”, each of these three multiplexers select one input signal as an input of the following function modules, so there are three inputs for all function modules, however, if a function style needs less input signals than three, the redundant inputs will be null. The output of “MUX4” which is selected from the function modules’ outputs in terms of “reg4” is that of the whole multiplexer module, then the current multiplexer module will has a functionality of the corresponding function module, hence, as a conclusion, if the circuit needs to be changed in logical structure or functionality, just to change the contents of these four registers. reg1
reg2
reg3
reg4 Function Style 0
MUX1
Function Style 1 Input Signals
MUX4
Output
MUX2
MUX3
Function Style i
Fig. 1. Structure of a Multiplexer Module
2.3 Multiplexer Module Array Figure 2 illustrates the structure of MMA which is prepared for an m inputs (I0, I1,I2,…,Im-1) and n outputs O0,O1,O2,…,On-1 circuit, the blanks represent the multiplexer modules, the arrowheads represent the routing. The size of the MMA is determined by the scale of the target circuit, the number of columns is determined by ourselves optionally, while the rows can not be less than the larger one of the inputs and outputs i.e. max(m, n). When describing the MMA in VHDL, all the multiplexer modules are connected to the previous column of modules and the primal inputs of the circuit, so during the GA process, the physical structure of the circuit is actually not changed, the only thing we need to do is evolving the contents of the registers, as it is mentioned in section 2.2, these registers determine the logic structure and functionality of the circuit, hence, when these registers evolved, the circuit is changed.
An Intrinsic Evolvable Hardware Based on Multiplexer Module Array
39
Take designing a 4-bit adder for example, a 4-bit adder has 8 inputs and 5 outputs, we design an 8*8 MMA. Each of the 8 multiplexer modules in the first column is connected to the 8 primal inputs directly, while from the second column to the eighth column, all the multiplexer modules are connected to the 8 primal inputs and 8 outputs from the previous column, that is to say, except the modules in the first column are only 8 inputs, the other modules are 16 inputs, the 5 outputs are selected from the outputs of the 8 modules in the last column. Inputs Outputs
Multiplexer Module Array I0
O0
…..
I1
O1
….. ┊
┊
┊
Im-1
….. …..
┊
┊ On-1
Fig. 2. The structure of MMA
3 The Proposed System In this section, we will introduce the proposed system. 3.1 The Experiment Platform The platform of our research is an AMD-XPL card embedded a XCV2P20 device, it is built in PC via PCI, the XC2VP20 will be able to work when the driver is installed correctly. The ADM-XPL is an advanced PCI Mezzanine card (PMC) supporting Xilinx Virtex-II PRO (V2PRO) devices, an on-board high speed multiplexed address and data local bus connects the bridge to the target FPGA, memory resources provided on-board include DDR SDRAM, pipelined ZBT and Flash, all of which are optimized for direct use by the FPGA using IP and toolkits provided by Xilinx. See more detailed specifications in [10,11,12]. See figure 3, it illustrates the proposed system explicitly, the FPGA Space is allocated from some usable memories or registers in FPGA, the arrowheads reflect the direction of data flow. The GA process in PC is described in C++, while the modules in XC2VP20 are described in VHDL, a Counter is necessary for generating the primal inputs of MMA. During the GA process, the chromosomes are written to the FPGA Space one by one. Once a single chromosome finishes being written, the MMA will configure the circuit in terms of the current chromosome, after that, the Counter will
40
J. Zhu et al.
generate the primal inputs, from “000…0” i.e. all “0” to “111…1” i.e. all “1”, at the same time, the corresponding outputs of every input combination will be evaluated by the configured circuit, if the Counter reaches all “1” state, the evaluation stopped, then the FPGA Space returns the Truth Table of current circuit to the GA process, and it will be compared with the target Truth Table, then the corresponding fitness of current chromosome will be evaluated. Then the next chromosome will follow the same stages and obtain its fitness until the GA process is terminated.
Chromosome
FPGA Space
GA Current circuit Truth Table Counter
MMA
PC
XC2VP20
Fig. 3. Intrinsic Evolution Framework
In such a cycle, the GA process needs to call relative API functions to write datum to and read datum from the FPGA Space, so we need to configure the C++ compiler for using the API header files and libraries following the approach which has been introduced in [10]. 3.2 VHDL design In the proposed system, designing the virtual circuit in VHDL is a preliminary work. See Figure 3 again, the virtual circuit contains the three modules: the first module marked “FPGA Space” is a location allocated from all usable memory resources mentioned in the previous section, the second is MMA, and the third is a Counter. The FPGA Space module is prepared for receiving chromosome and returning the Truth Table, this module is an alternating interface between the GA process and the MMA: the GA process writes every chromosome to FPGA Space module, and the FPGA Space module returning the Truth Table to GA process. The MMA module read datum from FPGA Space module, and then maps them to a real function circuit. The Counter module’s outputs are the inputs of the MMA module, it starts generating the input combinations of the MMA module as soon as the circuit finished mapping, once an input combination has been generated, a corresponding output combination will be written to an appointed location of the FPGA Space module, when all the
An Intrinsic Evolvable Hardware Based on Multiplexer Module Array
41
input combinations have been completed, these locations constitute the Truth Table of this circuit. When this preliminary work is finished, it is a universal design for evolving different target circuits, or only needs some tiny modification such as broadening the size or shifting the routing of MMA to evolve more complex target circuits. 3.3 Encoding and Decoding A pivotal problem in EHW domain is how to encode circuits to chromosomes, which has a direct influence on both the course of evolution and the final outcome. A good coding should satisfy three conditions at least: the first, its length should be controlled in the range which the GA could handle; the second, it should be able to decode to a practical circuit conveniently; the third, the evolution of the coding should be able to reflect that of both functionality and routing. The proposed methodology in this paper has advantage in obtaining such a good coding. In section 2.2, we have mentioned that every multiplexer module has four controlports, which are connected to four registers respectively, so we can use four integers to describe a multiplexer module, and then joint these four integers of all multiplexer modules together row by row as a chromosome. In the experiment, we evolved a 4-bit adder and design an 8*8 MMA, the modules in first column were 8 inputs numbered from ‘0’ to ‘7’, while those in other 7 columns were 16 inputs numbered from ‘0’ to ‘15’, the anterior 8 sequence numbers represented the 8 primal inputs and the latter 8 sequence numbers represented the 8 inputs from the previous column’s outputs. Take an example, see figure 4, if a Multiplexer Module is described by four integers as follows:
Fig. 4. An example
then we decode this single module in this way: the number “7” represents that the last primal input is the first input of this module, the number “12” represents that the row 5th module’s output in the previous column is the second input of current module, the number “0” represents that the first primal input is the third input of current module, the number “6” represent function style 6, look up table 1, it is “a ⊕ b”, however, this function has only two inputs, so as it is introduced in section 2.2, the third input will be ignored, so this module has a functionality of “XOR”. The 8*8 MMA has 64 such modules, joint these 64*4 integers together as a chromosome, so the length of chromosome in our experiment is 256. 3.4 Fitness Evaluation The fitness is evaluated by GA, once the GA process obtains the Truth Table of current circuit from the appointed locations of FPGA, it compares this truth table with the Truth Table of the target circuit, checks them bit by bit and counts the match-bits,
42
J. Zhu et al.
the percentage of the match-bits is the fitness of current circuit. There will be some trivial differences in fitness evaluation between different GA processes, see more detailed description of fitness evaluation in [8,9]. 3.5 Experiment Results In our experiment, the max generation of GA was 2,000,000, and the program was executed 10 times, there were 9 times step into “fitness stalling effect” [9], only once the fitness reached 100%, the low success rate is mainly attributed to the following reasons: Firstly, theoretically speaking, an 8*8 MMA is enough to evolve a 4-bit adder, but in fact, it is difficult to evolve a fully functionality 4-bit adder in finite generations with an size of 8*8, so increasing the size of MMA will help to improve the success rate. Secondly, the routing between multiplexer modules in our experiment was stiff, every module could only connect to the previous column modules or primary inputs, and each module has only one output, which result in many modules have not been fully utilized, so a more flexible routing will be beneficial to increasing the likelihood of success evolution in limited size. Thirdly, the function styles defined in the proposed system are too simple, see table 1, all the function styles are only one output, extending multi-output function styles or more complex function styles will be able to improve the efficiency of evolution. In addition, the GA operators in our experiment were not efficient enough, because we did not try our best to seek a better algorithm for evolution, this maybe another reason for the low success rate. How to improve the success rate of evolution will be our next step research, the main point of this paper is proposing a model for intrinsic evolution, and the experimental result shows that it is a feasible methodology, Table 2 shows the chromosome which has fully evolved, every four integers represents a single multiplexer module, the modules in shadow means that these modules will not be used in final circuit, there are 25 modules in shadow, so the other 39 modules constitute the final circuit. Table 2. A fully evolved chromosome
5 9 3 11 8 7 7 10 1 13 6 3 1 0 9 7 3 10 06 4 1 8 15
7 5 6 8 5 2 3 0
12 5 8 3 6 2 10 4 2 6 14 3 4 11 6 9 6 4 10 2 13 8 9 0 10 5 7 6 4 14 3 5
6 8 12 4 9 7 2 7 6 2 11 2 9 6 10 1 3 7 15 4 10 8 12 4 11 3 9 0 1 9 7 2 11 13 5 2 10 3 2 0 5 1 11 3 13 11 10 0 13 10 9 2 5 7 13 6 9 3 5 1 12 8 2 3 9 12 5 3 1 5 13 2 5 9 10 7 11 9 1 0 14 12 8 0 9 10 1 5 6 10 5 10 5 1 13 3
5 1 8 4 0 4 8 3 15 11 7 4 10 13 9 3 8 14 12 8 4 0 15 7 12 7 9 0 1 5 10 7
8 10 2 4 9 11 7 2 2 6 13 6 15 12 7 4 13 2 4 1 14 5 6 0 12 1511 3 0 4 11 2
9 15 6 12 8 3 14 11 0 13 10 7 7 3 0 15 4 1 2 9 10 4 13 0
3 6 4 6 6 5 8 2
Decoding this chromosome to practical circuit following the method described in section 3.3, the corresponding circuit is showed in Figure 5, that 25 invalid modules are not included in the circuit diagram while the other 39 valid modules do, but we
An Intrinsic Evolvable Hardware Based on Multiplexer Module Array
43
can only see 32 modules in this diagram, it is because that there are 7 modules’ function styles are 0, see Table 1, it means that these modules have functionalities of buffers, so we draw them as wires for the sake of understandability.
Fig. 5. The corresponding circuit diagram of the fully evolved chromosome
4 Conclusion The main purpose of this paper is to propose an intrinsic evolution methodology based on MMA, and as an example, we introduced the proposed system by evolving a 4-bit adder, of course, the 4-bit adder has been evolved successfully by other methodologies, we just want to prove this is a feasible methodology and illuminate that compares with some of the previous methodologies, it has advantages in circuit encoding and decoding, flexibility and portability etc. In this experiment, we have not taken any special measures to control the evolution for the sake of improving the efficiency and success rate of GA process, this is not the focus of this paper, but it will be our next step research.
Acknowledgement The first author would like to thank State Key Laboratory of Software Engineering, Wuhan University for supporting our research work in this field, and the National
44
J. Zhu et al.
Natural Science Foundation under Grant No.60442001 and National High-Tech Research and Development Program of China (863 Program) No.2002AA1Z1490.The authors acknowledge the anonymous referees.
References 1. Upegui, A., Sanchez, E.: Evolving Hardware by Dynamically Reconfiguring Xilinx FPGAs. In: Moreno, J.M., Madrenas, J., Cosp, J. (eds.) ICES 2005. LNCS, vol. 3637, pp. 56–65. Springer, Heidelberg (2005) 2. Guccione, S., Levi, D., Sundararajan, P.: JBits: Java based interface for reconfigurable computing. Xilinx Inc. (1999) 3. Hollingworth, G., Smith, S., Tyrrell, A.: Safe Intrinsic Evolution of Virtex Devices. In: The Second NASA/DoD Workshop on Evolvable Hardware[C], pp. 195–202 (2000) 4. Hollingworth, G., Smith, S., Tyrrell, A.: The Intrinsic Evolution of Virtex Devices Through Internet Reconfigurable Logic. In: Miller, J.F., Thompson, A., Thompson, P., Fogarty, T.C. (eds.) ICES 2000. LNCS, vol. 1801, p. 72. Springer, Heidelberg (2000) 5. Pauline, C.: Haddow and Gunnar Tufte. Bridging the Genotype-Phenotype Mapping for Digital FPGAs, pp. 109–115. IEEE Computer Society Press, Los Alamitos (2000) 6. Sekanina, L.: On Dependability of FPGABased Evolvable Hardware Systems That Utilize Vitual Reconfigurable Circuits, pp. 221–228. ACM, New York (2006) 7. Glette, K., Torresen, J.: A Flexible On-Chip Evolution System Implemented on a Xilinx Virtex-II Pro Device. In: Moreno, J.M., Madrenas, J., Cosp, J. (eds.) ICES 2005. LNCS, vol. 3637, pp. 66–75. Springer, Heidelberg (2005) 8. Kalganova, T.: Bidirectional Incremental Evolution in Extrinsic Evolvable Hardware. In: Proc. of the Second NASA/DOD Workshop on Evolvable Hardware(EH’00), pp. 65–74 (2000) 9. Stomeo, E., Kalganova, T., Lambert, C.: Generalized Disjunction Decomposition for Evolvable Hardware. IEEE, Transactions on Systems, MAN And Cybernetics—Part B: Cybernetic 36(5), 1024–1043 (2006) 10. Alpha Data Parallel Systems Ltd. ADM-XRC SDK 4.7.0 User Guide (Win32), Version 4.7.0.1 (2006) 11. Alpha Data Parallel Systems Ltd. ADM-XRC-PRO-Lite (ADM-XPL) Hardware Manual, Version 1.8 (2005) 12. Alpha Data Parallel Systems Ltd. ADC-PMC2 User Manual, Version 1.1 (2006)
Estimating Array Connectivity and Applying Multi-output Node Structure in Evolutionary Design of Digital Circuits Jie Li and Shitan Huang Xi’an Microelectronics Technology Institute 710071 Xi’an, Shannxi, China [email protected]
Abstract. Array connectivity is an important feature for measuring the efficiency of evolution. Generally, the connectivity is estimated by array geometry and level-back separately. In this paper, a connectivity model based on the path number between the first node and the last node is esteblished. With the help of multinomial coefficient expansion, a formula for estimating array connectivity is presented. By applying this technique, the array geometry and level-back are taken into account simultaneously. Comparison of connectivity within arrays of different geometries and level-backs becomes possible. Enlightened by this approach, a multi-output node structure is developed. This structure promotes the connectivity without increasing the array size. A multiobjective fitness funciton based on power consumption and critical delay of circuits is proposed, which enables evolved circuits to agree with the requirements of applications. Experimental results show that the proposed approach offers flexibility in constructing circuits and thus improves the efficiency of evolutionary design of circuits. Keywords: Evolvable Hardware, array connectivity, multinomial coefficient, multi-output node structure.
1 Introduction Graphical rectangular arrays are commonly used in evolutionary design to map a digital circuit into a genotype [1][2][3][4][5]. In this representation, each input of a node can be connected to one output of the nodes in the previous columns. Feedback and connection within the same column are not allowed. The array is characterized by the geometry and level-back [6]. The array geometry refers to the number of columns and rows of an array [7]. The level-back indicates the maximum number of columns to the current node can have their outputs connected to the inputs of that node [6]. The values of array geometry and level-back determine the connectivity of an array. The larger the size of an array and the larger the level-back, the greater the array connectivity. As the array connectivity increase, the algorithm offers more flexible ability to construct circuits that match the functional requirements, and thus increases the evolutionary efficiency. Miller studied the effects on efficiency caused by array L. Kang, Y. Liu, and S. Zeng (Eds.): ICES 2007, LNCS 4684, pp. 45–56, 2007. © Springer-Verlag Berlin Heidelberg 2007
46
J. Li and S. Huang
geometry [8]. He argued that an array of smaller size may tend to produce higher fitness solution, if it is sufficient to realize a circuit. Kalganova applied different array geometries and level-backs for evolving multiple-valued combinational circuits [9], and investigated their influences on the average fitness and successful ratio. The conclusion is that by carefully selecting the geometry and level-back it is possible to improve the EA performance. Also she pointed out that the effect of row is less important in the evolutionary process. In [4] Kalganova intended to improve the solution quality by dynamically changing the array geometry. Other researches tried another way to obtain the maximum connectivity by employing a maximum levelback that equals the number of columns [3][6][10]. These published works evaluated the array connectivity by geometry and level-back separately. Connectivity comparison can be performed within arrays of the same level-back and different geometries or those of the same geometry and different level-backs conveniently. However, for those arrays of different geometries and level-backs, the comparison seems to be of no effect, and therefore, can not be a help for the determination of the best choice of these parameters efficiently. In this paper, a model based on the path number between the first and the last node of an array for estimating connectivity is established. Combined with the expansion of multinomial coefficients, a formula for calculating the array connectivity is presented. The advantages of this technique are that the level-back and geometry are taken into account simultaneously, and comparison between arrays of different level-backs and geometries is possible. According to this model, a multi-output node structure is developed. The structure enables a node to be a functional cell as well as a routing cell. A fitness function based on the power consumption and critical delay is also given, which can lead the evolved circuits close to the requirements of applications. This article is organized as follows: section 2 deals with the module and the connectivity formula, while section 3 describes the node structure. Section 4 introduces the fitness function and section 5 the evolutionary algorithm. The experimental results are reported in section 6. Conclusions are given in section 7.
2 Estimating Array Connectivity Suppose there is an array of n×m. Suppose also each node takes a structure of 1-input and 1-output.The array connectivity can be summarized as the total number of paths that connect the nodes in the first column and the last column. Since the ability of the nodes in the last column connecting to that in the first column is equivalent, we can simplify the array connectivity as follows: given a n×m array with a level-back of lb, the total number of paths that connect the input of the last node (No. n×m – 1) to the output of the first node (No. 0) is defined as the array connectivity, which is denoted by C (n, m, lb). Any column in an array, except the first and the last one, presents one of the two opposite states: a path passes the column and no path passes the column. Let 1 represents the state that a column passed by a path and 0 the state that a column is skipped. The array connection state can be encoded as a binary string. In an array of n×m, there are n possible entries for a path to enter the next column. Fig. 1 shows the different connection states of an array of 3×4 with lb = 3, and the path number of each state.
Estimating Array Connectivity and Applying Multi-output Node Structure
(a)
(b)
(c)
(d)
47
Fig. 1. Array connection states and path numbers. (a) State coding: 1111, path number: 9; (b) State coding: 1011, path number: 3; (c) State coding: 1101, path number: 3; (d) State coding: 1001, path number: 1.
As shown in Fig. 1, the number of array connection states is relevant to the levelback and the number of columns. When lb = 1, a node can only connect to the nodes in its neighbor column as featured by Fig. 1(a). For lb = 2, a node can connect to both the nodes in neighbor column and the nodes in the column with a distance of 2. The states of array include the situations shown in Fig. 1 (a), (b), (c). As the lb increases to 3, the state shown in Fig. 1 (d) should be also included. Apparently, the number of array states is enlarged as the number of columns and level-back increased. The number of array states is irrelevant to the rows, whereas the specific number of routing paths is dependent on the number of rows. According to the analyzing above, we investigated the connection states of some arrays with different geometries and level-backs in order to find out the relationships between them. The statistical results are listed in table 1, where m is the number of columns; lb represents the level-back; Si denotes the number of states that have i columns skipped by the routing paths, in other words, Si is the number of binary strings under the restriction of a certain level-back. When lb = 1, there is only one connection state – S0. The particular number of paths is determined by the number of rows and the output number of a node. By comparing the data in table 1 with the expanded coefficients of binomial, trinomial and quadrinomial, we can examine the relationships between them. Fig. 2 shows the comparison results. Similar results can be obtained as the m and lb increased. Thereby, an expression for calculating connectivity of an n×m array with lb≥2 can be derived from Fig. 2 as follows: m − 2 ( lb )
C (n, m, lb) = ∑ ⎛⎜ i i =0 ⎝ 1
⎞ n m − 2 −i + ⎟ ⎠
⎡ m−2 ⎤ ⎢ 2 ⎥ + lb − 2 ⎢ ⎥
∑ i=2
m − 2 ( lb )
where
⎛ ⎜i ⎝
⎞ ⎟ ⎠
represents the coefficients of lb-nomial;
integer greater than or equal to
m− 2 2
.
( lb )
⎛ m −i −1 ⎞ n m − 2−i ⎜i ⎟ ⎝ ⎠
(1)
⎡ m− 2 ⎤ denotes the least ⎢ 2 ⎥
48
J. Li and S. Huang Table 1. Statistical results of array connection states lb
2
3
4
Si S0 S1 S2 S3 S4 S5 S0 S1 S2 S3 S4 S5 S6 S0 S1 S2 S3 S4 S5 S6 S7
m 3 1 1
4 1 2
5 1 3 1
6 1 4 3
7 1 5 6 1
8 1 6 10 4
9 1 7 15 10 1
10 1 8 21 20 5
1 1
1 2 1
1 3 3
1 4 6 2
1 5 10 7
1 6 15 16 6
1 7 21 30 19 3
1 1
1 2 1
1 3 3 1
1 4 6 4
1 5 10 10 3
1 6 15 20 12 2
1 7 21 35 31 12 1
1 8 28 50 45 16 1 1 8 28 56 65 40 10
11 1 9 28 35 15 1 1 9 36 77 90 51 10 1 9 36 84 120 101 44 6
As mentioned previously, the node is predefined as a 1-input 1-output structure, so there are n possible entries in each column except the first and the last one. Consider a multi-output node structure, assume that there are k outputs of a node. Then the number of possible entries of a column becomes kn. As can be seen from expression (1), the array connectivity can be enhanced greatly without increasing the size of the array, and thus improving the performance of EA. This analysis results in the final formula for array connectivity estimation. The formula is given as follows:
⎧0 ⎪1 ⎪ ⎪ m−2 C(n, m, lb) = ⎨(kn) ⎡ m−2⎤ ⎪ ⎢ 2 ⎥+lb−2 (lb) 1 m−2 (lb) ⎢ ⎥ ⎪ ⎛ ⎞ ⎛ m−i−1⎞ m−2−i m−2−i ( ) kn + ⎜ i ⎟ (kn) ∑ ⎪∑⎜⎝ i ⎟⎠ ⎝ ⎠ i =2 ⎩ i=0
m=1 m=2 lb=1; m>2
(2)
lb=2,3,...,( m−1); m>2
where k≥1. In this formula, the calculation considers the effects of array geometry together with that of level-back. By applying this technique, connectivity comparison within arrays of different geometries and level-backs is possible. As can be seen from the formula, the effect that the number of columns performs on the connectivity is greater than that of the number of rows. As the level-back takes the maximum value as it can be, the array connectivity reaches to its peak. These conclusions correspond with those of the published works.
Estimating Array Connectivity and Applying Multi-output Node Structure
(a)
49
(b)
(c) Fig. 2. Comparison results. (a) Binomial coefficient vs. lb = 2; (b) Trinomial coefficients vs. lb = 3; (c) Quadrinomial coefficients vs. lb = 4.
3 Representation of Multi-output Node Structure Multi-output node structure has been introduced to satisfy the requirements of multioutput logic function [11][12][13]. In these researches, the outputs of a node are completely used as the outputs of the defined function. The number of outputs of a node varies as the function presented by the node changes. In contrast, we propose a multi-output structure with a fixed number of outputs. In this work, the output number takes a value of 3, which is slightly larger than the output number of the most complex function (e.g. 1-bit full adder). Redundant outputs of a node connect to one of the inputs of the node randomly. In this way, not only can a node be a functional node that presents the required logic function, but also it can be a switching node that transmits data from its inputs to its outputs directly. The multi-output node structure is represented in a hierarchical way, as shown in Fig. 3 (a). The external connection describes how the inputs are connected to the outside world. The internal connection illustrates the implemented function, the function outputs and the connections of the redundant outputs. Fig. 3 (b) shows a diagram of a 3-input 3-output structure and the detail representation of the node, in
50
J. Li and S. Huang
which the implemented function is a 1-bit full adder. The first and the second output indicate the carry and the sum of the adder, respectively. The third output is randomly connected to the second input of the node.
(a)
(b)
Fig. 3. Hierarchical representation of node structure. (a) Hierarchical representation; (b) Example of a 1-bit adder node.
4 Multi-objective Fitness Function Generally, the fitness function for combinational circuits is separated into two stages: the first stage is to evolve a fully functional circuit; the second is to find an optimal solution based on the previous result. In this work, we use the power consumption and critical delay to evaluate circuits in the second stage. The power consumption refers to the total power consumed by all the MOSFETs in the circuit. The critical delay indicates the amount of propagation of the critical path. The new fitness function is formulated as follows: fitness =
⎧ ⎨ ⎩ max_
fit _ number < max_ fit
fit _ number
(
fit + α / power × max_ delay
)
(3) fit _ number = max_ fit
where fit_number is the number of bits that match to the truth table; max_fit is the total number of output bits of the truth table; power denotes the power consumption is a self-defined regulator for normalization. and max_delay, the critical delay. Suppose that the power consumption and the delay of a p-type MOSFET are of the same as that of a n-type MOSFET. We can determine the power consumption and the critical delay for each building block, according to the CMOS based circuits of
α
Table 4. Power consumption and critical delay of building blocks No. 1 2 3 4 5 6
Building block WIRE NOT AND NAND OR NOR
Critical delay 0 1 2 1 2 1
Power consumption 0 2 6 4 6 4
No. 7 8 9 10 11
Building block XOR NXOR MUX HA FA
Critical delay 2 2 2 2 4
Power consumption 8 8 6 14 22
Estimating Array Connectivity and Applying Multi-output Node Structure
51
primitive gates [15]. The calculation results are shown in table 4, where HA and FA refer to 1-bit half adder and 1-bit full adder, respectively. The data for HA and FA is obtained from the evolved circuits.
5 Evolutionary Algorithm In this paper, Cartesian Genetic Programming is applied as the evolutionary algorithm for evolving circuits. CGP is developed by Miller and Thomson for automatic evolution of digital circuits [2]. Typically, the one point mutation is adopted in CGP. For the purpose of obtaining higher diversity, an extended operator, inverted mutation, is employed in this work. The inverted mutation randomly selects two points within a chromosome, and exchanges the internal connections of the symmtrical nodes one by one based on the center of these two points. This operation is capable of generating a piece of new gene sequence as well as holding the existing structure in part. From this point of view, the inverted operation possesses some features of crossover. However, the inverting can only act on a single chromosome. Information exchanging between chromosomes is impossible. So it is still within the range of mutation. The combination of the standard CGP, 3-input 3-output node structure, the inverted mutation and the new fitness function is called modified CGP (MCGP), as will be used in the following experiments.
6 Experimental Results and Analysis In the following experiments, the point mutation ratio was set to guarantee that the number of modified genes is at least of 1 and at most of 4. The ratio of the inverted mutation was fixed to 0.3. The level-back took the value of 1. 6.1 1-Bit Full Adder The results produced in the experiment and that reported in [6] are listed in table 5. One of the optimal circuits found in this experiment is shown in Fig. 4 (a). The best solution obtained in [6] and a human designed circuit are shown as Fig.4 (b) and (c). Performance comparison is given in table 6, in which Max_delay denotes the critical delay and Power stands for the power consumption. Note that if the 1-output structure is applied the 3×3 array in table 5 only has a connectivity of 3. With the help of the 3-output structure, the connectivity becomes 3 times larger than its previous value. It can be seen from table 6 that the performances of the circuits generated by MCGP and CGP are equivalent. However, by observing Fig. 4 (a) and (b), we can see that in the circuit of MCGP the sum and the carry were generated simultaneously, while in CGP they appeared in different moments. 6.2 2-Bit Adder The results produced in the experiment and that reported in [6] are listed in table 7. The optimal circuit found in this experiment is shown in Fig. 5 (a). The best solution
52
J. Li and S. Huang Table 5. Evolutionary results of 1-bit full adder in 100 runs Algorithm Maximum generation Level-back Array geometry Connectivity Successful cases
MCGP 10,000 1 3×3 9 100
CGP 10,000 2 1×3 2 84
Fig. 4. Circuits of 1-bit full adder. (a) MCGP; (b) CGP; (c) Human designed. Table 6. Performance comparison of 1-bit full adder’s circuits Outputs MCGP CGP Human design
c '=(c ⊕ x1). x 0+ ( c⊕ x1).c s =( c ⊕ x1). x 0+ (c ⊕ x1). x 0 c '=( x 0⊕ x1).c + ( x 0⊕ x1). x1 s =c ⊕ x 0⊕ x1 c '= x 0. x1+ c.( x 0⊕ x1) s =c ⊕ x 0⊕ x1
Max_delay
Power
4
22
4
22
6
34
obtained in [6] and one possible carry look-ahead circuit are shown as Fig. 5 (b) and (c). The carry look-ahead circuit is obtained form [15]. Performance comparison is given in table 8. MCGP took advantages of functional building blocks, and resulted in a relative high successful ratio, though the array connectivity in MCGP was smaller than that in CGP. Table 7. Evolution results of 2-bit adder in 100 runs Algorithm Maximum generation Level-back Array geometry Connectivity Successful cases
MCGP 50,000 1 3×3 9 98
CGP 50,000 5 1×6 16 62
Estimating Array Connectivity and Applying Multi-output Node Structure
53
Fig. 5. Circuits of 2-bit adder. (a) MCGP; (b) CGP; (c) Carry look-ahead. Table 8. Performance comparison of 2-bit adder’s circuits Outputs MCGP
c '=(( c⊕ x1). x 0+( c⊕ x1).c ).( x1⊕ y1)+ x1.( x1⊕ y1) s 0 = c ⊕ x 0⊕ y 0
Max_delay
Power
6
44
6
44
8
78
s1=((c ⊕ x1). x 0+ ( c⊕ x1).c )⊕ x1⊕ y1 CGP
c '= y1.( x1⊕ y1) + ( x 0.( x 0⊕ y 0)+ c.( x 0⊕ y 0)).( x1⊕ y1) s 0 = c ⊕ x 0⊕ y 0
Carry lookahead
s1=( x 0.( x 0⊕ y 0) +c.( x 0⊕ y 0))⊕ x1⊕ y1 c '= x1. y1+ ( x1⊕ y1).( x 0. y 0+ ( x 0⊕ y 0).c ) s 0 = c ⊕ x 0⊕ y 0 s1= x1⊕ y1⊕( x 0. y 0+ ( x 0⊕ y 0).c )
6.3 2-Bit Multiplier The best solution is shown in Fig. 6. Performance comparison refers to table 9. As shown in Fig. 6, the best solution took a maximum of 7 gates as mentioned by Miller [6], and was equivalent to the most efficient circuit reported in [6]. Comparing with the circuit produced in [6], the evolved circuit in this work had a significantly reduced delay time due to its usage of the primitive gates, such as NAND and NOR. Table 9. Performance comparison of 2-bit multiplier’s circuits Outputs MCGP
CGP Human design
Max_delay
Power
3
34
p2 = (x1. y1).( x0. y0) ; p3 = (x1.y1).(x0.y0).(x1.y0) . p 0 = x 0. y 0 ; p1 = ( x 0. y1) ⊕ ( x1. y 0) ;
7
48
p2 = x1. y1.( x0. y0) ; p 3 = x 0.x1. y 0. y1 ;
4
48
p 0 = x 0. y 0 ; p1 = x1.y0 ⊕ x0.y1 ;
p2 = x1.y1 + x0. y0 ; p3 = x1.y0 + x0.y1 . p 0 = x 0. y 0 ; p1 = ( x1. y 0 ) ⊕ ( x 0. y1) ;
54
J. Li and S. Huang
Fig. 6. Evolved optimal circuit of 2-bit multiplier 6.4 4-Bit Adder The array took a geometry of 5×5, and the maximum generation was limited to 1,000,000. Four successful cases were achieved in 20 runs. The most efficient circuit obtained in the experiments is shown in Fig. 7. The evolved circuit tended to use fine grain building blocks (e.g. NAND, NXOR, MUX) rather than coarse grain ones (e.g. half adder, full adder). Table 10 shows the comparison of performance. The circuits of ripple-carry and carry look-ahead follow that mentioned in [15]. Table 10. Performance comparison of 4-bit adder’s circuits
Max_delay Power
MCGP
Ripple-carry
Carry look-ahead
8 102
16 88
8 204
Fig. 7. Evolved optimal circuit of 4-bit adder
According to table 10, the optimal solution in MCGP had a greatly reduced delay time than that of the ripple-carry circuit; and its power consumption was only half of that consumed by the carry look-ahead circuit. Derived from the results, the
Estimating Array Connectivity and Applying Multi-output Node Structure
55
performance of the evolved circuit is therefore better than that of the circuits of ripple-carry and carry look-ahead.
7 Conclusion A formula for estimating the array connectivity is developed in this work. According to the presented formula, four possible ways for increasing the connectivity can be concluded: the first way is to increase the number of the array columns; the second one is to take a large level-back; the third, to increase the number of rows. These methods have been studied by other researchers. In their works these features are taken into account separately, while in this paper their effects on the connectivity are considered as a whole. The presented technique can be of a great benefit to the determination of the parameters for evolutionary design of digital circuits. The last way is to employ multi-output node structure, which can increase the array connectivity and improve the efficiency of evolution without increasing the size of the array. This structure offers more flexible ability of constructing circuits in that it allows a node to be a functional cell as well as a routing cell. This feature of the developed structure is also beneficial to the requirements of fault-tolerance. Further investigation will be carried out on this issue. Experiments for combinational circuits show that the proposed approach is able to promote the array connectivity and evolvability, and thus improves the evolutionary efficiency. Acknowledgments. The authors would like to thank the anonymous reviewers for their helpful comments.
References 1. Coello, C.A.C., Christiansen, A.D., Aguirre, A.H.: Use of Evolutionary Techniques to Automate the Design of Combinational Circuits. International Journal of Smart Engineering System Design 2(4), 299–314 (2000) 2. Miller, J.F., Thomson, P.: Cartesian Genetic Programming. In: Poli, R., Banzhaf, W., Langdon, W.B., Miller, J., Nordin, P., Fogarty, T.C. (eds.) EuroGP 2000. LNCS, vol. 1802, pp. 121–132. Springer, Heidelberg (2000) 3. Sekanina, L.: Design Methods for Polymorphic Digital Circuits. In: Proceeding of 8th IEEE Design and Diagnostic Circuits and Systems Workshop, Sopron, HU, UWH, pp. 145–150. IEEE Computer Society Press, Los Alamitos (2005) 4. Kalganova, T., Miller, J.F.: Evolving more Efficient Digital Circuits by Allowing Circuit Layout Evolution and Multi-objective Fitness. In: Stoica, A., et al. (eds.) Proceeding of the 1st NASA/DoD Workshop on Evolvable Hardware, pp. 54–65. IEEE Computer Society Press, Los Alamitos (1999) 5. Vassilev, V.K., Job, D., Miller, J.F.: Towarda the Automatic Design of More Efficient Digital Circuits. In: Lohn, J., et al. (eds.) Proceeding of the 2nd NASA/DoD Workshop on Evolvable Hardware, pp. 151–160. IEEE Computer Society Press, Los Alamitos, CA (2000)
56
J. Li and S. Huang
6. Miller, J.F., Job, D., Vassilev, V.K.: Principles in the Evolutionary Design of Digital Circuits - Part I. Genetic Programming and Evolvable Machines 1(1), 7–35 (2000) 7. Kalganova, T., Miller, J.F.: Circuits Layout Evolution: An Evolvable Hardware Approach. Coloquium on Evolutionary Hardware Systems. IEE Coloquium Digest (1999) 8. Miller, J.F., Thomson, P.: Aspects of Digital Evolution: Geometry and Learning. In: Sipper, M., Mange, D., Pérez-Uribe, A. (eds.) ICES 1998. LNCS, vol. 1478, pp. 25–35. Springer, Heidelberg (1998) 9. Kalganova, T., Miller, J.F., Fogarty, T.: Some Aspects of an Evolvable Hardware Approach for Multiple-valued Combinational Circuit Design. In: Sipper, M., Mange, D., Pérez-Uribe, A. (eds.) ICES 1998. LNCS, vol. 1478, pp. 78–89. Springer, Heidelberg (1998) 10. Sekanina, L., Vašĭček, Z.: On the Practical Limits of the Evolutionary Digital Filter Design at the Gate Level. In: Rothlauf, F., Branke, J., Cagnoni, S., Costa, E., Cotta, C., Drechsler, R., Lutton, E., Machado, P., Moore, J.H., Romero, J., Smith, G.D., Squillero, G., Takagi, H. (eds.) EvoWorkshops 2006. LNCS, vol.~3907, pp. 344--355. Springer, Heidelberg (2006) 11. Miller, J.F., Thomson, P., Fogarty, T.C.: Genetic Algorithms and Evolution Strategies. In: Quagliarella, D., et al. (eds.) Engineering and Computer Science: Recent Advancements and Industrial Applications, Wiley, Chichester (1997) 12. Kalganova, T.: An Extrinsic Function-level Evolvable Hardware Approach. In: Proceeding of the 3rd Euopean Conference on Genetic Programming (Euro 2000), Springer, London (2000) 13. Vassilev, V.K., Miller, J.F.: Scalability Problem of Digital Circuit Evolution Evolvability and Efficient Design. In: Proceeding of the 2nd NASA/DoD Workshop on Evolvable Hardware, IEEE Computer Society Press, Los Alamitos (2000) 14. Sekanina, L.: Evolutionary Design of Gate-level Polymorphic Digital Circuits. In: Rothlauf, F., Branke, J., Cagnoni, S., Corne, D.W., Drechsler, R., Jin, Y., Machado, P., Marchiori, E., Romero, J., Smith, G.D., Squillero, G. (eds.) EvoWorkshops 2005. LNCS, vol. 3449, pp. 185–194. Springer, Heidelberg (2005) 15. Uyemura, J.P.: Introduction to VLSI Circuits and Systems. Wiley, Chichester (2001)
Research on the Online Evaluation Approach for the Digital Evolvable Hardware Rui Yao, You-ren Wang, Sheng-lin Yu, and Gui-jun Gao College of Automation and Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, Jiangsu, 210016, China [email protected], [email protected], [email protected]
Abstract. An issue that arises in evolvable hardware is how to verify the correctness of the evolved circuit, especially in online evolution. The traditional exhaustive evaluation approach has made evolvable hardware unpractical to real-world applications. In this paper an incremental evaluation approach for online evolution is proposed, in which the immune genetic algorithm is used as the search engine. This evolution approach is performed in an incremental way: some small seed-circuits have been evolved firstly; then these small seed-circuits are employed to evolve larger module-circuits; and the module-circuits are utilized to build still larger circuits further. The circuits of 8-bit adder, 8-bit multiplier and 110-sequence detector have been evolved successfully. The evolution speed of the incremental evaluation approach appears to be more effective compared with that of the exhaustive evaluation method; furthermore, the incremental evaluation approach can be used both in the combinational logic circuits as well as the sequential logic circuits. Keywords: Evolvable hardware, online evolution, incremental evaluation, digital circuit.
1
Introduction
In evolvable hardware (EHW), there’re two methods in performing evolution, i.e., online evolution and offline evolution [1]. So far most of the EHW researchers prefer to adopt the offline approach in which the circuits are evaluated using their software models [2, 3]. However, the software simulation is too slow, and modeling is very difficult in some cases; moreover it is unable to repair the hardware’s local fault online [4]. Online evolution is the prerequisite for the hardware’s online repair. Nevertheless, for online evolution, the fitness evaluation and circuit verification turns out to be even more difficult [5]. How to guarantee the correctness of the evolved circuits is one of the puzzles of the EHW research, especially for the online evolution. On the one hand, a test set must be designed to guarantee the correctness of the circuit’s logic function. But the self-contained test set cannot be acquired in some cases. In theory, a combinational logic circuit’s correctness may be guaranteed if all of its possible input combinations have been tested. Whereas for L. Kang, Y. Liu, and S. Zeng (Eds.): ICES 2007, LNCS 4684, pp. 57–66, 2007. c Springer-Verlag Berlin Heidelberg 2007
58
R. Yao et al.
a sequential logic circuit, the output at one moment concerned not only its inputs at that time, but also its origin state. Therefore the online verification of a sequential circuit is very difficult, and the fitness evaluation can be hardly performed at whiles. On the other hand, the exhaustive evaluation method will not work for circuits with a large number of inputs, since the number of possible input combinations increases exponentially as the number of inputs increases. For instance, there are 264 =18,446,744,073,709,551,616 input combinations in a 64-bit adder. If the self-contained test is adopted, and suppose that evaluates one test vector need 1ns; then some 264 ×10-12=18,446,744 seconds=213 days are needed to test all the inputs combinations, i.e., 213 days are needed to evaluate one individual. Suppose that there are 50 individuals in the population, and 2000 generations are needed to find a correct solution; then the evolution time needed are 2000×50×213=21300000 days=58494 years! Aiming at the puzzle of the fitness evaluation dues to the complexity of the circuits, an approach of partitioning the training set as well as each training vector was proposed by Jim Torresen to realize relatively large combinational logic circuits [6], a random partial evaluation approach was presented by Hidenori Sakanashi et al for lossless compression of very high resolution bi-level images [7], and a fast evolutionary algorithm that estimated the fitness value of the offspring by that of its parents, was proposed by Mehrdad Salami et al, to quicken the evaluation process [8]. However, Torresen’s approach quickens the evolution process at the expense of resources, and it is suitable only for the evolution of the combinational logic circuits; Sakanashi’s approach can not secure a 100%correctness of the digital logic circuits; Salami’s approach is only suitable for slow fitness evaluation applications, and it will be bankrupt in usefulness if the estimation time of an individual exceeds its evaluation time. In this paper the idea of partitioning verification in the electronic design automation is introduced into the evaluation of the EHW circuits, say, incremental evaluation approach. The approach is used in the online evolutionary design of the digital circuits, and the circuits of 8-bit adder, 8-bit multiplier and 110sequence detector have been evolved successfully.
2
The Incremental Evaluation Approach
The basic idea of the incremental evaluation approach is to perform evolution in an incremental way: some small basic modules to construct a certain circuits, namely seed-circuits, have been evolved at first; then they are used to evolve larger module-circuits; and the module-circuits are used to build still larger circuits further. 1) When a seed-circuit is evolved, it is evaluated using a test set that only satisfies the function of itself. Because in the incremental evaluation, most of the evolution time is spent on the evolution process of the seed-circuit, thus the evolution time decreased dramatically.
Research on the Online Evaluation Approach
59
2) While the newly developed circuits are evaluated, the test set is designed in an incremental way. Because the correctness of the seed-circuits has been evaluated already, only the newly developed function is evaluated. For instance, to a 3-bit serial adder, a2a1a0+b2b1b0, after the least significant bit (a0+ b0) and hypo-least significant bit (a1+ b1) has been evolved, when the most significant bit (a2+ b2) is evaluated, its output only concerns to the local bits a2, b2 and the abutted carry c1, thus only 23 =8 test vectors are needed; whereas in the exhaustive evaluation, 26 =64 test vectors are needed. By analogy, 16×8=128 test vectors are needed in the incremental evaluation to evaluate a 16-bit adder, while in the exhaustive evaluation, the number is 232 =4,294,967, 296. Thus it secures that the number of the test vectors needed increases linearly rather than exponentially as the number of the circuit’s inputs increases. While meeting the precondition of ensuring the circuit’s correctness, the test set decreases dramatically, and the evolution speed is improved greatly. 3) While to evolve relative larger circuits using the modular-circuits, only the interconnections between them are taken into account to design their test sets.
3 3.1
The Technical Scheme The Online EHW Evolutionary Platform
The block diagram of our online evolutionary platform made up of the Xilinx Virtex FPGA is shown in figure 1.
2XWSXW
Fig. 1. Block diagram of the online evolutionary platform
The co-processing environment of CPU and FPGA is adopted in the platform. The hardware platform comprises of FPGA, input SRAM and output SRAM; while the software platform is provide by the Java-based API: JBits. Here the Xilinx FPGA XCV1000 is adopted, which is capable of partial dynamic reconfiguration [9], and the number of its internal equivalent system gates is 1M. The total storage capacity of input SRAM and outputs SRAM is 8Mb, and the SRAM may be divided into 4 banks of 2Mb SRAM block.
60
3.2
R. Yao et al.
Coding
The building blocks in the evolution area are modules with different granularities. They are modeled as modules with 2 input port and 1 output portwith 2 input ports and 1 output port, as shown in figure 2.
Fig. 2. Sketch map of the modules in the evolution area
Since both the function of the modules with different granularities and the interconnections between them need to be coded, the multi-parameters cascading and partitioning coding method is used. The whole chromosome is divided into two sections, Section A and Section B, as shown in figure 3(a). Where Section A determines the interconnection, as shown in figure 3(b); Section B determines the module’s function, as shown in figure 3(c).
Fig. 3. Multi-parameters cascading and partitioning coding
3.3
Immune Genetic Algorithm
The immune genetic algorithm (IGA) can tackle the premature convergence of the genetic algorithm and the slow run speed of the immune algorithm effectively. In the early 1990’s, Forrest et al proposed the framework of the immune genetic algorithm by combining the mechanism of antibody’s recognizing of antigen with
Research on the Online Evaluation Approach
61
GA [10]. Since then, many new approaches have appeared to design the affinities and the selection probability, e.g. the concentration and affinity of the antibody were used to design the expected reproduction rate; the immune operators such as vaccination and immune selection were designed and combined with the GA’s operators; the concentration of the antibody was determined using information entropy [11-14]. In this paper a modified IGA, namely MMIGA, is used as the evolutionary algorithm, the performance index of the circuit is regarded as the antigen, and the solutions on behalf of the circuit’s topology is viewed as the antibodies, the coherence between the actual response and the expected response of the circuit is considered as the appetency of the antibody to the antigen. The MMIGA algorithm can be described as follows: Step 1. Initialization. Randomly generate N antibodies to form the original population Am. Step 2. Calculate the affinities. View the fitness of each antibody as its affinity. Calculate the affinity of each antibody, namely di . If the optimal individual is found, stop; otherwise, continue. Step 3. Sort the antibodies in Am by affinity. The one that has highest affinity is viewed as the monkey-king. Step 4. Select the S highest affinity antibodies from Am to propose a new set Bm. Step 5. Clone the S antibodies in Bm to form Cm of N antibodies. The number of clones of the ith antibody is: ⎫ ⎧ ⎪ ⎪ ⎪ ⎪ ⎨ d N⎬ i , i = 1, 2, ..., S (1) ni = Round · s ⎪ S⎪ ⎪ ⎪ d ⎭ ⎩ j j=1
where Round() is the operator that rounds its argument toward the closest integer. Step 6. Perform mutation on Cm with probability of 0.95 and obtain Dm. The mutation bits of each antibody adjust adaptively according to its affinity. The number of the mutation bits of the ith antibody is: dmax − di · rmax · L , i = 1, 2, ..., N (2) mbi = Int dmax where dmax is the maximum affinity, L is the length of the antibody, rmax is the maximum mutating rate (here is 0.5). Simulated annealing is used to determine whether to hold the mutation or not: if the affinity increases, hold the mutation; otherwise, hold the mutation with a small probability. Step 7. Select N-S lowest affinity antibodies from Am to form a temporary set Em. Step 8. Perform mutation on Em and obtain Fm.
62
R. Yao et al.
Step 9. Sort the antibodies in Dm by affinity and determine the dissimilar matrix NL. Considering the symmetry of the comparability, to reduce the calculate cost, only the upper triangular matrix is calculated, as shown in formula (3). ⎡ ⎤ nl11 nl12 . . . nl1N ⎢ nl21 . . . . . . nl2N ⎥ ⎥ NL = ⎢ (3) ⎣.................. ⎦ . . . . . . . . . nlN N Step 10. Determine the fuzzy select array SL. Select S-1 dissimilar antibodies according to NL in the order of nl11 → nl12 → · · · → nl1n → nl21 → · · · → nlnn to form SL. Step 11. Hold the elite of Am, select S-1 antibodies from Dm according to SL, in addition to the N-S antibodies in Fm to form the new antibody set Am. Step 12. Repeat step 2-11, until the optimal individual is found or the terminating iterative generation is arrived.
4 4.1
Results and Discussion Online Evolutionary Design Results of the Combinational Logic Circuits
Results of the Serial Adder. The 4-bit serial adder and 8-bit serial adder are taken as examples in this paper. The incremental evaluation and exhaustive evaluation is used respectively in the contrast experiments. The parameters of the algorithm are as follows: the size of the antibody set is N=150, the size of the highest affinity antibody set is S=5 and the mutation probability of the lowest affinity antibody set is 0.9. The maximum terminating iterative generation is 5000. Each evaluation approach runs 50 times respectively, and the results of the generations and time needed are shown in figure 4 to figure 7. From figure 4 and figure 6 we can see that, while evolving the 4-bit adder and 8-bit adder using incremental evaluation and exhaustive evaluation respectively, the generation needed makes no odd; it is owing to the incremental evolution being used in both of them. But figure 5 and figure 7 clearly shows that, the evolution time of exhaustive evaluation is much longer than that of the incremental evaluation. In contrast, there is not much difference in the evolution time of the 4-bit adder, because the difference of the test set is puny: the test set of the exhaustive evaluation is 500, while that of the incremental evaluation is 200. But the evolution time of the 8-bit adder in figure 7 differs distinctly, because the test set of the exhaustive evaluation is 66000, while that of the incremental evaluation is 200. It is obvious that the degree of improvement of the evolution speed in the incremental evaluation approach is governed by the size of the test set. Results of the 8-Bit Multiplier. The 8-bit multiplier is evolved using the seed-circuits of the 82 pipelining multiplier and the 16-bit adder, whose topology is shown in fig.8. The size of the test set is 4096; the parameters are as same as that of 4.1.1.
Research on the Online Evaluation Approach
63
Evolution generation of 4−bit adder 500 exhaustive evaluation incremental evaluation
450 400
Generations
350 300 250 200 150 100 50 0
0
10
20
30
40
50
Runs
Fig. 4. Generations needed for the 4-bit adder Evolution time of 4−bit adder 120 exhaustive evaluation incremental evaluation 100
Time in minutes
80
60
40
20
0
0
10
20
30
40
50
runs
Fig. 5. Time needed for the 4-bit adder
The seed-circuits in figure 8 have used the look-up tables (LUT) and the routing resources, as well as the fast carry chain, AND MUX and XOR, etc. So not only the operation speed has been improved, but also the hardware resources have been saved: only 5 slices have been used in the 8×2 pipelining multiplier, and 8 slices in the 16-bit adder. Only 4 8×2 pipelining multipliers and 3 16-bit adders, total 44 slices, i.e., 22 CLBs, have been used in the optimum solution of the 8-bit multiplier. 4.2
Online Evolutionary Design Results of the Sequential Logic Circuits
The circuit of the 100-sequence detector has an input ’x’ and an output ’z’, except for 2 flip-flops. Its input is a string of random signal; its output ’z’ is a
64
R. Yao et al. Evolution generations of 8−bit adder 500 exhaustive evaluation incremental evaluation
450 400
Generation
350 300 250 200 150 100 50 0
0
10
20
30
40
50
Runs
Fig. 6. Generations needed for the 8-bit adder Evolution time of 8−bit adder 300 exhaustive evaluation incremental evaluation 250
Time in minutes
200
150
100
50
0
0
10
20
30
40
50
Runs
Fig. 7. Time needed for the 8-bit adder
’1’ whenever a 110-sequence appears. 3 states (00, 01 and 11) are used out of its 4 states. While evolving the 110-sequence detector using traditional evaluation, LUTs are used as the basic modules. The evolution runs 20 times, among which 10 runs are successful. It means that the successful rate of the evolution is 0.5. The results of the generations and time needed in the successful 10 runs are shown in table 1. The mean evolution generation in table 1 is 932, and the mean evolution time is 120.4 minutes. While evolving the 110-sequence detector using incremental evaluation, the flip-flop module whose function has been verified, together with the LUT of the combinational logic, are taken as basic modules. The evolution runs 30 times, the results of the generations and time needed are shown in table 2.
Research on the Online Evaluation Approach c(n+1)
c(2n+1) a(n+1) a(n+2)
G1&G2 ˆ G3&G4
0
1
F1&F2 ˆ F3&F4
0
1
an b(i+1) bi
65
P(2n+1) b(n+1) a(n+1) G1&G2
P(2n)
bn an
0
1
0
1
s(n+1)
sn
F1&F2
c(n-1)
c(2n-1)
Fig. 8. Seed-circuits of the 8-bit multiplier Table 1. Generations and time needed to evolve the 110-sequence detector using the traditional method Runs 1 2 3 4 5 6 7 8 9 10 Generation 860 1889 358 1026 236 735 301 2173 1234 508 Time 136 169 52 111 36 78 54 249 193 126
Table 2. Generations and time needed to evolve the 110-sequence detector using the traditional method Runs Generation Time Runs Generation Time
1 5 1 16 17 2
2 7 1 17 64 7
3 35 9 18 14 1
4 101 14 19 11 1
5 149 17 20 37 4
6 96 11 21 29 3
7 25 5 22 30 3
8 135 15 23 13 2
9 32 4 24 12 2
10 230 26 25 26 3
11 78 9 26 47 4
12 6 1 27 67 7
13 23 2 28 33 4
14 50 6 29 44 5
15 21 2 30 16 1
The mean evolution generation in table 2 is 48.4, which is about 0.52% of that of the traditional evolution; and the mean evolution time is 5.7 minutes, which is about 4.7% of that of the traditional method. So we can conclude that the incremental evaluation can not only increase the successful rate of the evolution, but also speed up the evolution speed dramatically, and decrease the evolution time.
5
Conclusion
An incremental evaluation approach for the online evolution of the evolvable hardware has been proposed in this paper. The seed-circuits is evolved using partial test set, then the newly developed circuit is evaluated using a test set designed in an incremental approach. The immune genetic algorithm is used as the evolutionary algorithm. It is shown that the incremental evaluation approach can simplify the online verification and evaluation process of the digital circuits, and accelerate the evolutionary design process. It has opened a new avenue to
66
R. Yao et al.
tackling the slow evolution speed and the puzzle of online evaluation caused by the complexity of the circuits. Acknowledgments. The work presented in this paper has been funded by National Natural Science Foundation of China ( 60374008, 90505013 ) and Aeronautical Science Foundation of China ( 2006ZD52044, 04I52068 ).
References 1. Yao, X., Hugichi, T: Promises and Callenges of Evolvable Hardware. IEEE Trans On Systems Man and Cybernetics -Part C: Applications and Reviews 29(1), 87–97 (1999) 2. Zhao, S.-g., Jiao, L.-c.: Multi-objective evolutionary design and knowledge discovery of logic circuits based on an adaptive genetic algorithm. Genetic Programming and Evolvable Machines 7(3), 195–210 (2006) 3. Liu, R., Zeng, S.-y., Ding, L., et al.: An Efficient Multi-Objective Evolutionary Algorithm for Combinational Circuit Design. In: Proc. of the First NASA/ESA Conference on Adaptive Hardware and Systems, pp. 215–221 (2006) 4. Wang, Y.-r., Yao, R., Zhu, K.-y., et al.: The Present State and Future Trends in Bio-inspired Hardware Research (in Chinese). Bulletin of National Natural Science Foundation of China 5, 273–277 (2004) 5. Xu, Y., Yang, B., Zhu, M.-c.: A new genetic algorithm involving mechanism of simulated annealing for sigital FIR evolving hardware (in Chinese). Journal of Computer-Aided Design & Computer Graphics 18(5), 674–678 (2006) 6. Torresen, J.: Evolving Multiplier Circuits by Training Set and Training Vector Partitioning. In: Tyrrell, A.M., Haddow, P.C., Torresen, J. (eds.) ICES 2003. LNCS, vol. 2606, pp. 228–237. Springer, Heidelberg (2003) 7. Sakanashi, H., Iwata, M., Higuchi, T.: Evolvable hardware for lossless compression of very high resolution bi-level images. Computers and Digital Techniques 151(4), 277–286 (2004) 8. Salami, M., Hendtlass, T.: The Fast Evaluation Strategy for Evolvable Hardware. Genetic Programming and Evolvable Machines 6(2), 139–162 (2005) 9. Upegui, A., Sanchez, E.: Evolving Hardware by Dynamically Reconfiguring Xilinx FPGAs. In: Moreno, J.M., Madrenas, J., Cosp, J. (eds.) ICES 2005. LNCS, vol. 3637, pp. 56–65. Springer, Heidelberg (2005) 10. Forest, S., Perelson, A.S.: Genetic algorithms and the immune system. In: Schwefel, H.-P., M¨ anner, R. (eds.) PPSN 1990. LNCS, vol. 496, pp. 320–325. Springer, Heidelberg (1991) 11. Fukuda, T., Mori, K., Tsukiama, M.: Parallel Search for Multi-modal Function Optimization with Diversity and Learning of Immune Algorithm. In: Dasgupta, D. (ed.) Artificial Immune Systems and Their Applications, pp. 210–220. Springer, Berlin (1999) 12. Jiao, L.-c., Wang, L.: A Novel Genetic Algorithm Based on Immunity. IEEE Transactions on Systems, Man and Cybernetics-Part S: Systems and Humans 30(5), 552–561 (2000) 13. Cao, X.-b., Liu, K.-s., Wang, X.-f.: Solve Packong Problem Using An Immune Genetic Algorithm. Mini-Micro Systems 21(4), 361–363 (2000) 14. Song, D., Fu, M.: Adaptive Immune Algorithm Based on Multi-population. Control and Decision 20(11), 1251–1255 (2005)
Research on Multi-objective On-Line Evolution Technology of Digital Circuit Based on FPGA Model Guijun Gao, Youren Wang, Jiang Cui, and Rui Yao College of Automation and Engineering, Nanjing University of Aeronautics and Astronautics, 210016 Nanjing, China guijun [email protected]
Abstract. A novel multi-objective evolutionary mechanism for digital circuits is proposed. Firstly, each CLB of FPGA is configured as minimum evolutionary structure cell (MESC). The two-dimensional array consisted of MESCs by integer scale values is coded. And the functions and interconnections of MESCs are reconfigured. Secondly, the circuit function, the number of active CLBs and the circuit response speed are designed for evolutionary aims. The fitness of the circuit function is evaluated by on-line test. The fitness of the active CLBs’ number and response speed are evaluated by searching the evolved circuit in reverse direction. Then the digital circuits are designed by multi-objective on-line evolution in these evaluation methods. Thirdly, a multi-objective optimization algorithm is improved, which could quicken the convergence speed of online evolution. Finally, Hex-BCD code conversion circuit is taken as an example. The experimental results prove the feasibility and availability of the new on-line design method of digital circuits. Keywords: Evolvable Hardware, Digital Circuit, On-line Evolution, Multi-objective Evolutionary Method, FPGA Model.
1
Introduction
The key principle of Evolvable Hardware (EHW) [1,2] is to realize the hardware self-adaptation, self-combination and self-repair based on evolutional algorithm and programmable logic device (PLD). There are two methods to realize: one is off-line evolution [3] and the other is on-line [4]. At present, a lot of research is done on off-line evolution. Firstly, the structure sketches of the circuits are designed by evolutional algorithm and mathematics model. Secondly, placing and routing is implemented by special design software. Thirdly, physical structures are mapped into FPGA. Finally, the logic functions of the designed circuits are evaluated according to the test results of FPGA. But its disadvantages are also visible. For example, its evaluation speed is slow and its evolutionary time is overlong. It’s difficult to synthetically evaluate circuit function, active areas, time-delay, power dissipation and so on. So it can not implement real-time repair of hardware faults. EHW on-line evolution can configure the functions and L. Kang, Y. Liu, and S. Zeng (Eds.): ICES 2007, LNCS 4684, pp. 67–76, 2007. c Springer-Verlag Berlin Heidelberg 2007
68
G. Gao et al.
interconnections of the MESCs in FPGA directly. In the process of evolutionary design, we can realize the real-time evaluation of each chromosome in every generation, obtain the evaluation indices of circuits’ comprehensive capability, and implement hardware on-line repair in the case of local faults. So it is important to do research on EHW on-line evolution technology. It will become one of the research hotspots whose key technologies include code method, on-line evaluation, multi-objective evolutionary mechanism, and FPGA chips suitable for evolution. Today, on-line evolution of EHW is usually based on binary code method [5,6], whose shortcoming is overlong chromosome. This problem makes the search space become huge and the satisfactory result not found in certain time. If minterm and function-level code methods are used, it is necessary to decode. The evolved logic circuits based on the two methods aren’t one-to-one mapping with the physical circuits. These methods are unfavorable to hardware on-line repair. Some researchers have studied multi-objective evolution in the field of digital circuits for many years. Zhao Shuguang [7] integrated several objects into a single object in the form of ”sum of weights” based on netlist-level representation. He Guoliang [8] and Soliman [9] researched the multi-objective evolution of two objects about circuit function and the number of active logic gates. They evolved the 100% functional circuit firstly and then optimized the active logic gates. But these multi-objective evolution designs are off-line. Considering inner characteristic structure of FPGA, we propose a method of digital circuit on-line multi-objective evolution which is suitable for FPGA by coding MESCs and evolving three objects of evolutionary aim(100% functional circuit) and two restricted aims(number of active CLBs and circuit response speed). Based on this, a novel multi-objective on-line evolutionary algorithm is presented.
2
Multi-objective Design Principle of Digital Circuit On-Line Evolution
Generally, if the input/output characteristic of expected design circuit is known, we can use multi-objective evolutionary algorithm to implement circuit function and optimize the structure. At first, the initial chromosome for representing circuit structure in the multi-objective evolutionary algorithm is created randomly. Secondly, the chromosome is downloaded into FPGA and then tested and evaluated. Thirdly, multi-objective evolutionary algorithm evolves the old chromosomes and creates new ones according to test results. The second and third parts run repeatedly and stop when a certain chromosome satisfies the multi-objective evolutionary conditions. Finally, the expected circuit can be realized. Fig.1 shows the multi-objective on-line evolutionary process of digital circuit. 2.1
Multi-objective Optimization Model and Evolutionary Algorithm
It is an important application for multi-objective optimization algorithm in the field of circuits’ on-line evolutionary design. And therefore how to construct
Research on Multi-objective On-Line Evolution Technology
69
˅
Fig. 1. Multi-objective on-line design process of digital circuit
multi-objects and design evolutionary algorithm need be researched urgently. For the sake of multi-objective on-line evolutionary application in EHW, we improve the multi-objective optimization model based on typical restriction method in this paper. The problems of multi-objective optimization about digital circuit design can be expressed as follows: primary function: y = f itness(x) = F (1) subsidiary functions: power(x) ≤ D, timedelay(x) ≤ P The functional fitness value of the evolved circuit is calculated as follows: F =
n−1 m−1
Ci,j
(2)
i=0 j=0
Ci,j =
1, outdata = epdata 0, outdata = epdata
(3)
Where outdata is output value of current evolved circuit; epdata is output value of expected circuit; n is total number of circuit output pins; m is total number of test signals. The fitness value of circuit function reflects the approaching extent between current circuit and expected circuit directly. Power dissipation is simplified by the number of active CLBs in this paper, which is signified by D. Time-delay is defined by the signal maximal transmission time from input to output, which is signified by P. The value of power dissipation and time-delay can be obtained by the method of searching evolved circuit in reverse direction. Fig.2 shows the realization process of multi-objective evolutionary algorithm. The steps are as follows: STEP 1. Create initial population randomly. STEP 2. Judge if each chromosome of population satisfies the constraint conditions and niche qualification. If yes, it will be preserved into registered pool. Best chromosome of population is signed.
70
G. Gao et al.
STEP 3. Evolve the population. If registered pool is not full, go to STEP 2. Otherwise, go to next. STEP 4. Judge if there are excellent genes in the population. If there are excellent genes, sign and lock them. STEP 5. Evolve the population by multi-objective evolutionary algorithm based on the strategy of mutating, preserving excellent genes and so on. STEP 6. Evaluate each target. If all of the multi-objective evolutionary conditions are satisfied, evolution is stopped. Otherwise, go to STEP 4.
Aim and subsidiary evaluation
Initial population progenitive pool
Register pool is Full Restricted register pool
Old population
Evolve (mutate, preserve excellent gene etc)
Local evaluation
Current best chromosome
Aim evalution
Niche evaluation
New population
Register pool is not Full
Subsidiary evaluation Register pool is Full
End
Fig. 2. Flow chart of multi-objective evolutionary algorithm
2.2
Encoding and Searching Methods of Digital Circuit
In the design of digital circuits, there are many encoding methods, such as binary code, min-term code and function-level code. We propose an integer code method based on FPGA model. This method includes coding functions of CLBs and interconnections between CLBs. Fig.3 shows the encoding and searching methods of 2-D array of FPGA. We encode the 2-D array with cascaded multi-parameter encoding method. In the paper, one LUT is used in every CLB. Two input pins (F1 and F2) and one output pin (XQ) in every LUT are used to evolve the circuit. Every chromosome can be expressed by 80-bit code in Fig.3. The first 55 bits of the whole code express the interconnections between CLBs. The others express the functions of CLBs. It is important for multi-objective evaluation to search active CLBs of 2D evolutionary array in reverse direction. The searching direction is signed in
Research on Multi-objective On-Line Evolution Technology Input column
Evolutionary column
0 4
3 Test signal input
2
1
1 4
9 8
3
6 2
5 4
1
59 9
7 58
8
19 18
64 14
17 16
57 7
15 14
63
13
29 28
62 12
39
25 24
6
49
69 19 38 74 24 48 79 29 54
27 26
Output column 5
4
68
18
37 36 73
23
47 46 78
28 53
35 45 67 17 34 72 22 44 77 27 52
Test result ouput
33 43 13 3 23 26 51 6 21 11 16 32 71 42 76 12 61 2 56 22 66 1
0
3
2
71
0 0
55
5
11 10
60
10
21 20
65
15
31 30 70
20
41 40 75
25 50
Searching direction
Fig. 3. Schematic of structure of FPGA, code and searching direction of digital circuit evolutionary design
Fig.3. Firstly, the CLBs connected to output column are signed by searching this column. Secondly, if the searched CLB has been signed, the CLBs connected to this CLB will be signed by orderly searching in reverse direction. If the input column has been searched, the search is stopped. Consequently, the positions of all active CLBs are signed. The number of active CLBs and time-delay can be calculated by this method. 2.3
Improved Multi-objective Evolutionary Strategy of Digital Circuit Design
The evolutionary design process of digital circuit is as follows: confirm the logical function of designed circuit, design fitness evaluation function, search the optimization structure by evolutionary algorithm, download each chromosome of each generation to FPGA, evaluate the current circuit, and stop evolution if the evolutionary current circuit satisfies expected function. By analyzing evolutionary circuits, we can find that partial genes of some chromosomes are best (that is, the sub-functions of the circuit corresponding to best genes satisfy some evolutionary conditions) but the whole chromosomes aren’t best. However, these best genes may be destroyed and evolved again in the next evolution. And the evolutionary time is increased accordingly. On the basis of the analysis, we present a novel improved idea of circuit design. When some sub-circuits satisfy performance requirements, we preserve the best genes corresponding to the sub-circuits and only evolve the other functions and their interconnections. Then a new strategy of accelerating convergence of circuit evolutionary design is presented. For example, we make use of n individuals of single
72
G. Gao et al. Locked part
In
On
. . .
. . .
I0
O0 Evolutionary initial phase
In
On . . . .
...
. ... . . O0
I0
On
In
. . Or
...
Phase of þlockingÿ some excellent genes
. . .
. . . O0
I0 Evolutionary end
Fig. 4. Improved evolutionary strategy of accelerating evolution process convergence
population to design circuit by parallel evolution. The improved strategy can be realized as follows: when one or more pins outputs of FPGA satisfy test conditions, the CLBs influencing these outputs will be signed and locked in order to avoid being destroyed. And locked genes will be used to learn in the process of subsequent evolution for quickening convergence. The idea of preserving part excellent gene sections is applied into the multiobjective evolutionary design of digital circuits. The searching space is reduced by preserving the locked genes based on multi-objective evaluation of every generation. So it can heighten the speed of multi-objective on-line evolutionary design. The idea is similar to the experts’ designing thoughts.
3 3.1
Experimental Results Experimental Program and Results Analysis
Virtex FPGA is adopted as the experimental hardware model and JBits2.8 as the software platform. Four basic logic functions are used to configure the CLBs of FPGA. They are F1&F2, F1&(˜F2), F1ˆF2 and F1|F2. HEX-BCD code conversion circuit is taken as the experimental example. The evolutionary area is an array of 5*7. Because the first column works as inputs and the last as outputs, the two columns won’t participate in the evolution. And then the actual evolutionary area becomes an array of 5*5. Multi-objective evaluation function is acquired by equation (1) in chapter 2.1. Fig.5 shows 30 groups of contradistinctive results between multi-objective evolution and single-objective evolution. The results illuminate that the number of active CLBs and time-delay levels of multi-objective evolution are less than that of single evolution but the convergence speed is slower obviously. We use accelerating convergence strategy to quicken the convergence speed of multi-objective evolution. Fig.6 shows the contrast of convergence speed among the single-objective algorithm, common multi-objective algorithm and accelerating multi-objective algorithm. The experimental results indicate that the convergence speed of accelerating multi-objective algorithm is faster than that of
Research on Multi-objective On-Line Evolution Technology
73
7
25
p1−−time−delay level curve of single−objective evoution
d1−−number curve of single−objective evolutionary active CLBs
p2−−time−delay level curve of multi−objective evolution
6
d2−−number curve of multi−objective evolutionary active CLBs d1
time−delay levels
number of active CLBs
20
15 d2
10
5 p1 4 p2 3
5
0
2
5
10
15 20 experimental times
25
1
30
(a) Comparative chart of active CLBs
5
10
15 20 experimental times
25
30
(b) Comparative chart of time-delay levels
300 t1−−curve of single−objective evolutionary time t2−−curve of multi−objective evolutionary time
evolutionary time
250
200
150
t2
100
t1
50
0
5
10
15 20 experimental times
25
30
(c) Comparative chart of evolutionary time
Fig. 5. Comparison of experimental results between multi-objective and singleobjective evolution
100 c1
percentage of convergence£¥
95
c3 c2 c1−curve of accelerating multi−objective evolution
90
c2−curve of common multi−objective evolution 85
c3−curve of single−objective evolution
80 75 70 65 60 55
0
1000
2000 3000 evolutionary generations
4000
5000
Fig. 6. Comparative chart of convergence curve of different algorithms in the process of evolution
74
G. Gao et al. 200 t1−−histogram of common multi−objective evolutionary time t2−−histogram of accelerating multi−objective evolutionary time
180 160
evolutionary time
140 120 100 80
t1
60 t2
40 20 0
1
2
3
4
5 6 7 experimental times
8
9
10
Fig. 7. Histogram of evolutionary time of two multi-objective evolutionary algorithms
common algorithm. Local degeneration is caused by constraint conditions. But it’s still convergent for the whole evolutionary process. 10 groups of evolution time values of the two methods are shown in Fig.7. 3.2
Circuit Structures Analysis
A typical circuit structure of single-objective evolution is shown in Fig.8 and that of multi-objective evolution in Fig.9. The gray parts show the obvious redundancies and the white parts are still possible to be predigested. From the two circuits, we can find that the number of active CLBs is 18 and time-delay level is 5 in Fig.8. And the number of active CLBs is 10 and time-delay level is 3 in Fig.9. Obviously, the circuit structure of multi-objective evolution is better than that of single-objective evolution and the redundancies are reduced greatly. So the approach of multi-objective evolution has such advantages as less resource consumption and time-delay. a4
b4
a3
b3
a2
b2
a1
b1
a0
b0
Fig. 8. Circuit structure based on single-objective evolution
Research on Multi-objective On-Line Evolution Technology
75
Fig. 9. Circuit structure based on multi-objective evolution
4
Conclusions
According to the structure characteristic of FPGA, multi-objective on-line evolution of digital circuit based on coding the MESCs by two-level integer method and multi-objective evolutionary algorithm is realized in this paper. An improved multi-objective evolutionary strategy based on locking the excellent gene sections is presented. Then the growing evolution of digital circuit comes true by evolving several sub-modules divided by the whole circuit and preserving the excellent gene sections. At last, the experimental results prove the feasibility of the new design method of digital circuit. Compared with the design circuit of single-objective evolution, multi-objective evolutionary circuit has less active CLBs, redundancies and time-delay level. And the improved multi-objective evolutionary strategy shortens the evolutionary time greatly. In the paper, the proposed method is only applied in combinational logic circuit. And more research is still needed on multi-objective on-line evolutionary method for sequential circuit in the future. Acknowledgments. The work presented in this paper has been funded by National Natural Science Foundation of China (60374008, 90505013) and Aeronautical Science Foundation of China (2006ZD52044, 04I52068).
References 1. Yao, X., Higuchi, T.: Promises and challenges of evolvable hardware. IEEE Transactions on Systems: Man and Cybernetics 29, 87–97 (1999) 2. Lohn, J.D., Horny, G.S.: Evolvable hardware: using evolutionary computation to design and optimize hardware systems. Computational Intelligence Magazine 1, 19– 27 (2006) 3. Anderson, E.F.: Off-line Evolution of Behaviour for Autonomous Agents in RealTime Computer Games. In: Proc. of 7th International Conference of Parallel Problem Solving from Nature, Granada, Spain, pp. 689–699 (2002) 4. Darniani, E., Tettamanzi, A.G.B., Liberali, V.: On-line Evolution of FPGA-based Circuits: A Case Study on Hash Functions. In: Proc. Of the First NASA/DoD Workshop on Evolvable Hardware, California, USA, pp. 26–33 (1999)
76
G. Gao et al.
5. Yao, X., Liu, Y.: Getting most out of evolutionary approaches. In: Proc. of NASA/DoD Conference of Evolvable Hardware, Washington DC, USA, pp. 8–14 (2002) 6. Hartmann, M., Lehre, P.K., Haddow, P.C.: Evolved digital circuits and genome complexity. In: Proc. Of NASA/DoD Conference of Evolvable Hardware, Washington DC, USA, pp. 79–86 (2005) 7. Shuguang, Z., Yuping, W., Wanhai, Y., et al.: Multi-objective Adaptive Genetic Algorithm for Gate-Level Evolution of Logic Circuits. Chinese Journal of Computeraided Design & Computer Graphics 16, 402–406 (2004) 8. Guoliang, H., Yuanxiang, L., Xuan, W., et al.: Multiobjective simulated annealing for design of combinational logic circuits. In: Proc. of the 6th World Congress on Intelligent Control and Automation, Dalian, China, pp. 3481–3484 (2006) 9. Soliman, A.T., Abbas, H.M.: Combinational circuit design using evolutionary algorithms. In: IEEE CCECE 2003 Canadian Conference on Electrical and Computer Engineering, Canadian, pp. 251–254 (2003)
Evolutionary Design of Generic Combinational Multipliers Using Development Michal Bidlo Brno University of Technology, Faculty of Information Technology Boˇzetˇechova 2, 61266 Brno, Czech republic [email protected]
Abstract. Combinational multipliers represent a class of circuits that is usually considered to be hard to design by means of the evolutionary techniques. However, experiments conducted under the previous research demonstrated (1) a suitability of an instruction-based developmental model to design generic multiplier structures using a parametric approach, (2) a possibility of the development of irregular structures by introducing an environment which is considered as an external control of the developmental process – inspired by the structures of conventional multipliers and (3) an adaptation of the developing structures to the different environments by utilizing the properties of the building blocks. These experiments have represented the first case when generic multipliers were designed using an evolutionary algorithm combined with the development. The goal of this paper is to present an improved developmental model working with the simplified building blocks based on the concept of conventional generic multipliers, in particular, adders and basic AND gates. We show that this approach allows us to design generic multiplier structures which exhibit better delay in comparison with the classic multipliers, where adder represents a basic component.
1
Introduction
The design of combinational multipliers has been often concerned as a non-trivial task for demonstrating the capabilities of evolutionary systems. Gate-level representation has been usually utilized whose search space is typically rugged and hard to explore using the evolutionary algorithms. In case of applying a direct encoding (a non-developmental genotype–phenotype mapping) it is extremely difficult to achieve scalability of the evolved solutions, i.e. to obtain larger instances of the circuits, for example, when the traditional Cartesian Genetic Programming (CGP) is utilized [1]. Therefore, more effective representations have been investigated in order to overcome these issues and, in general, improve the scalability and evolvability of digital circuits, as summarized in the following paragraph. Miller et al outlined the principles in the evolutionary design of digital circuits and showed some results of evolved combinational arithmetic circuits, including multipliers, in [2]. A detailed study of the fitness landscape in case of the evolutionary design of combinational circuits using Cartesian Genetic Programming is L. Kang, Y. Liu, and S. Zeng (Eds.): ICES 2007, LNCS 4684, pp. 77–88, 2007. c Springer-Verlag Berlin Heidelberg 2007
78
M. Bidlo
proposed in [3]. 3 × 3 multipliers constitute the largest and most complex circuits designed by means of traditional CGP in these papers. Vassilev et al utilized a method based on CGP which exploits redundancy contained in the genotypes. Larger (up to 4 × 4 bits) and more efficient multipliers were evolved by means of this approach in comparison with the conventional designs [4]. Vassilev and Miller studied the evolutionary design of 3 × 3 multipliers by means of evolved functional modules rather than only two-input gates [5]. Their approach is based on Murakawa’s method of evolving sub-circuits as the building blocks of the target design in order to speed up and improve the scalability of the design process [6]. Torresen applied the partitioning of the training vectors and the partitioning of the training set approach (so-called increased complexity evolution or incremental evolution) for the design of multiplier circuits. His approach was focused on improving the evolution time and evolvability rather than optimizing the target circuit. 5×5 multipliers were evolved using this method [7]. Stomeo et al devised a decomposition strategy for evolvable hardware which allows to design large circuits [8]. Among others, 6×6 multipliers were evolved by means of this approach. Aoki et al introduced an effective graph-based evolutionary optimization technique called Evolutionary Graph Generation [9]. The potential capability of this method was demonstrated through experimental synthesis of arithmetic circuits at different levels of abstraction. 16 × 16 multipliers were evolved using word-level arithmetic components (such as one-bit full adders or one-bit registers). An instruction-based developmental system was introduced in [10] for the design of arbitrarily large multipliers. Genetic algorithm was utilized to evolve a program for the construction of generic multipliers using a parametric approach. Basic AND gates and higher building blocks based on one-bit adder were utilized. A concept of environment (an external control of the developmental process) was introduced in order to design irregular structures. An interesting phenomenon of adaptation to different environments was observed. The results presented in [10] pose the first case when generic multipliers were evolved using the development. In this paper an improved developmental model is presented that is based on the system introduced in [10]. Simplified building blocks are utilized in order to clarify the circuit structure and not to restrict the search space (only basic AND gates and pure half and full adders are considered). The adaptation is exploited (as discussed in [10]) for the design of different circuit structures depending on the environment chosen. In particular, the experiments are devoted to the design of carry-save multipliers which exhibit shorter delay in comparison with the classic multipliers as described in [11]. Note that the evolutionary development of classic generic multipliers was introduced in [10].
2
Biologically Inspired Development
In nature, the development is a biological process of ontogeny representing the formation of a multicellular organism from a zygote. It is influenced by the genetic information of the organism and the environment in which the development is carried out.
Evolutionary Design of Generic Combinational Multipliers
79
In the area of computer science and evolutionary algorithms in particular, the computational development has been inspired by that biological phenomena. Computational development is usually considered as a non-trivial and indirect mapping from genotypes to phenotypes in an evolutionary algorithm. In such case the genotype has to contain a prescription for the construction of a target object. While the genetic operators work with the genotypes, the fitness calculation (evaluation of the candidate solutions) is applied on phenotypes created by means of the development. The utilization of environment in the computational development may be understood as an external information (in addition to the genetic information included in the genotype) and as an additional control mechanism of the development. The principles of the computational development together with a brief biological background are summarized in [12].
3
Development of Efficient Generic Multipliers
The method of the development is inspired by the construction of conventional combinational multipliers for which generic algorithms exist. Figure 1 shows two typical designs of a 4 × 4 multiplier constructed by means of the conventional approach [11]. It is evident that the circuits contain parts which differ from the rest of the circuit, i.e. they represent a kind of irregularity. In particular, it is a case of the first level (“row”) of AND gates occuring in both multipliers in Fig. 1a,b and the last level of adders occuring in the carry-save multiplier (Fig. 1b). However, the rest of the circuit structure exhibits a high level of regularity that can be expressed by means of an iterative algorithm utilizing variables and parameters related to a given instance of the multiplier. For example, the number of bits of the operands determines the number of AND gates and adders in the appropriate circuit level or the number of levels of the multiplier. Theferore, this concept is assumed to be convenient for the design of generic multipliers using the development and an evolutionary algorithm. Experiments were conducted under the previous research dealing with the evolutionary design of generic multipliers using an instruction-based developmental model [10]. The building blocks utilized in that method include an adder put together with a basic AND gate which, however, may pose an unsuitable approach preventing the evolution from finding better solutions. Nevertheless, general programs were evolved for the construction of multipliers whose structure corresponds to that of the classic combinational multiplier shown in Fig. 1a. The obtained results showed the possibility of designing generic multipliers using an evolutionary algorithm with the development which gave rise an interesting area deserving of future investigation. Therefore, new features of the developmental model will be introduced in this paper in order to design more efficient multipliers than that were evolved in [10]. The approach presented herein is based on the structure of the carry-save multiplier (see Fig. 1b) which exhibit shorter delay in comparison with the classic multiplier shown in Fig. 1a [11]. In particular, simplified building blocks will be introduced including only basic AND gates and pure one-bit adders and an enhanced instruction for creating the
80 a0 b
M. Bidlo 0
a1
b
0
a2
b
0
a3
b 0
a0
b1
a1
b1
a2
b1
a0
a3
b0
b1
a1
b0
a2
b0
a3
b0
a
b1
a
b1
a
b1
0
HA
HA
FA a
0
b
FA 2
a
HA
b 2
1
a
b 2
2
1
HA a0
a
b 2
FA
FA a
0
b 3
a
b 3
1
3
b1
a2
b
HA a1
b 2
b 3
2
FA a0
FA a
a
p1
p2
a3
b
b3
a2
b3
2
a
3
FA
b3
a1
a3
b3
b
3
FA
p0
2
b 2
3
FA
HA
2
HA
FA
FA
FA
p3
p4
p5
p6
p7
p
0
p
1
p
2
p 3
(a)
FA
FA
HA
FA
FA
p 4
p
p 6
5
p
7
(b)
Fig. 1. 4 × 4 conventioanl multipliers: (a) classic combinational multiplier, (b) more efficient carry-save multiplier. a0 , . . . , a3 , respective b0 , . . . , b3 denote the bits of the first, respective the second operand, p0 , . . . , p7 represent the bits of the product.
circuit structure will be presented that is able to generate two building blocks at a time — in addition to the single-block generative instruction introduced in [10] — due to the increased complexity of the construction algorithm needed for the design of generic multiplier structure using the simplified building blocks.
4
Instruction-Based Developmental System
A simple two-dimensional grid consisting of a given number of rows and columns was chosen as a suitable structure for the development of the target circuits. The building blocks are placed into this grid by means of a developmental program. In order to handle irregularities, an external control of the developmental process has been introduced that is called an environment. A building block represents a basic component of the circuit to be developed. The general structure of the block is shown in Figure 2a. Each building block contains three inputs from which one or two may be unused depending on the type of the block. There are two outputs at each building block from which one may be meaningless, i.e. permanently set to logic 0, depending on the block type. The outputs are denoted symbolically as out0 and out1. In case of the block containing only one output, out0 represents the effective output and out1 is permanently set to logic 0. The circuit is developed inside a grid (rectangular array) which proved to be a suitable structure for the the design of combinational multipliers (see Figure 2b). Figure 3 shows the set of building blocks utilized for the experiments presented in this paper. For the interconnection of the blocks the position (row, col) in the grid is utilized. The inputs of the blocks are connected
Evolutionary Design of Generic Combinational Multipliers
81
to the outputs of the neighboring blocks by referencing the symbolic names of the outputs or via indices to the primary inputs of the circuit, depending on the block type. Feedback is not allowed. For example, out1(row, col − 1) means that the input of the block at the position (row, col) in the grid is connected to the output denoted out1 of the block on its left-hand side. The connections to the primary inputs are determined by the indices v0 and v1 . Let A = a0 a1 a2 , B = b0 b1 b2 represent the primary inputs (operands A and B) of a 3 × 3 multiplier. For instance, an AND gate with v0 = 1 and v1 = 2 has its inputs connected to the second bit (a1 ) of operand A and the third bit (b2 ) of operand B. In case of the building blocks at the borders of the grid (when row = 0 or col = 0), where no blocks with valid outputs occur (for row − 1 or col − 1), the appropriate inputs of the blocks at (row, col) are set to logic 0. The development of the circuit is performed by means of a developmental program. This program, which is the subject of evolution, consists of simple application-specific instructions. The instructions make use of numeric literals 0, 1, . . . , max value, where max value is specified by the designer at the beginning of evolution. In addition to the numeric literals, a parameter and some inputs
col
row
(0,0)
block type position in the grid (row, col) out0
out1
outputs (m−1,n−1)
(a)
(b)
Fig. 2. (a) Structure of a building block. (row, col) determines the position of the block in the grid – see part (b). The connection of the inputs depends on the type and position of the block. (b) A grid of the building blocks with m rows and n columns for the development of generic multipliers.
(a)
(b)
(c)
out0 out1 sum cout
(d)
HA−2 (row, col) out0 out1 sum cout
(e)
out0(row−2, col) out0(row−1, col)
FA−1 (row, col) out0 out1 sum cout
(f)
out0(row−2, col) out0(row−1, col)
FA−2 (row, col) out0 out1 sum cout
(g)
out0(row−1, col) out1(row−1, col−1) out1(row, col−1)
out0
out0(row−1, col)
out1(row−2, col−1)
HA−1 (row, col)
ID−2 (row, col) out0
out0(row−2, col) out0(row−1, col)
out1(row, col−1)
out0
out1(row, col−1)
ID−1 (row, col)
v0 v1
out1(row, col−1)
out0(row−1, col)
FA−3 (row, col) out0 out1 sum cout
(h)
Fig. 3. Building blocks for the development of generic multipliers. (a, b) buffers – identity functions, (c) AND gate, (d, e) half adders, (f, g, h) full adders. (row, col) denotes the position in the grid. v0 and v1 determine indices of primary input bits. Connection of different inputs of the blocks are shown. Unused inputs and outputs are not depicted (set to logic 0). Note that the full adders (g, h) are new to this paper and are inspired by and intended for the design of carry-save multipliers.
82
M. Bidlo
variables of the developmental system can be utilized. The parameter represents the width (the number of bits) of the operands – inputs of the multiplier. The parameter is referenced by its symbolic name w, whose value is specified by the designer at the beginning of the evolutionary process. For example, in case of designing a 4 × 4 multiplier, the parameter possesses this value, i.e. w = 4. The value of the parameter is invariable during the evolutionary process. There are four variables integrated into the developmental system denoted v0 , v1 , v2 and v3 , whose values are altered by the appropriate instructions during the execution of the program (developmental process). Table 1 describes the instruction set utilized for the development. The SET instruction assigns a value determined by a numeric literal, parameter or another variable to a specified variable. Instructions INC, respective DEC are intended for increasing, respective decreasing the value of a given variable. The difference can be specified only by a numeric literal. Simple loops inside the developmental program are provided with the REP instruction whose first argument determines the repetition count and the second argument states the number of instructions after the REP instruction to be repeated. Inner loops are not allowed, i.e. REP instructions inside the repeated code are interpreted as NOP (no operation) instructions. The GEN instruction generates one or two building blocks of the type specified by its arguments. If (row, col) do not exceed the grid boundaries, the block is generated at that position. In case of generating two blocks, the second one is placed to (row+1, col). If the grig boundary exceeds, no block is generated out of the grid. The possibility of generating two building blocks during an execution of the GEN instruction represents a new feature introduced in this paper in comparison with the system presented in [10], where only one block could be generated. This variant has been chosen in order to reduce the complexity of the developmental process when the simplified building blocks have been introduced. Note that this approach does nowise restrict the capabilities of the construction algorithm because the number and types of the blocks to be generated are determined independently by means of the evolutionary algorithm. In case of generating an AND gate, its inputs are connected to the primary inputs indexed by the actual values of variables v0 , v1 as shown in Figure 3c. If v0 or v1 exceeds the bit width of the operands, the appropriate input of the AND gate is connected to logic 0. The inputs of the other building blocks are determined by the position (row, col) in the grid they are generated to. After executing GEN, col is increased by one. In fact, the developmental program may consist of several parts, which may consist of different number of instructions. Let us define the length of a program (or a part of a program) as the number of instructions it is composed of. These parts are executed on demand with respect to an environment. A single execution of a part of program is referred to as a developmental step. The meaning of the environment is to enable the system to develop more complex structures which may not be fully regular. The environment is represented by a finite sequence of values specified by the designer at the beginning of the evolution, e.g. env = (0, 1, 2, 2). The number of different values in the environment corresponds to the number of parts of the developmental program. In addition, there is an
Evolutionary Design of Generic Combinational Multipliers
83
Table 1. Instructions utilized for the development Instruction Arguments Description 0: SET variable, value Assign value to variable. variable ∈ {v0 , v1 , v2 , v3 }, value ∈ {0, 1, . . . , max value, w, v0 , v1 , v2 , v3 }. 1: INC variable, value Increase variable by value. variable ∈ {v0 , v1 , v2 , v3 }, value ∈ {0, 1, . . . , max value}. 2: DEC variable, value If variable ≥ value, then decrease variable by value. variable ∈ {v0 , v1 , v2 , v3 }, value ∈ {0, 1, . . . , max value}. 3: REP count, number Repeat count-times number following instructions. All REP instructions in number following ones are interpreted as NOP instructions (inner loops are not allowed). 4: GEN block1, block2 Generate block1 to the actual position (row, col). If block2 is non-emty block, generate block2 to (row + 1, col). Increase col by 1. 5: NOP An empty operation.
environment pointer (let us denote it e) determining a particular value in the environment during the development time. Each part of the program is executed deterministically, sequentially and independently on the others according to the environment values. However, the parameter and the variables of the developmental system are shared by all the parts of program. At the beginning of the evolutionary process the value of the parameter w and the form of the environment env are defined by the designer. The developmental program, whose number of parts and their lengths are also specified a priori by the designer, is intended to operate over these data in order to develop multiplier of a given size. As evident, the different sizes of multipliers are created by setting the parameter and adjusting the environment. Hence the circuit of a given size is always developed from scratch; it is a case of parametric developmental design. The following algorithm will be defined in order to handle the developmental process. 1. Initialize row, col, v0 , v1 , v2 , v3 and e to 0. 2. Execute env(e)-th part of program. 3. If a GEN instruction was executed, increase row by 2 in case of generating two building blocks simultaneously or by 1 if only single blocks were generated. Increase e by one and set col to 0. 4. If neither e, nor row exceed, go to step 2. 5. Evaluate the resulting circuit.
5
Evolutionary System Setup
A chromosome consists of a linear array of the instructions, each of which is represented by the operation code and two arguments (the utilization of the arguments depends on the type of the instruction). The array contains n parts of the developmental program stored in sequence, whose lengths (the number of
84
M. Bidlo
instructions) correspond to l0 , l1 , . . . , ln−1 . The number of the parts and their lengths are determined by the designer. In general, the structure of a chromosome can be expressed as i0,0 i0,1 . . . i0,l0 −1 ; . . . ; in−1,0 in−1,1 . . . in−1,ln−1 −1 , where ij,k denotes the k-th instruction of j-th part of program for k = 0, 1, . . . , lj − 1 and j = 0, 1, . . . , n − 1. During the application of the genetic operators the parts of the program are not distinguished, i.e. the chromosome is handled as a single sequence of instructions. The chromosomes possess constant length during the evolution. The population consists of 32 chromosomes which are generated randomly at the beginning of evolution. Tournament selection operator of base 2 is utilized. Mutation of a chromosome is performed by a random selection of an instruction followed by a random choice of a part of the instruction (operation code or one of its arguments). If the operation code is to be mutated, entire new instruction will replace the original one, otherwise one argument is mutated. The mutation algorithm ensures proper values of arguments depending on the instruction type (see Table 1). The mutation is performed with the probability 0.03, only one instruction per chromosome is mutated. A special crossover operator has been applied with probability 0.9, working as follows. Two parent chromosomes are selected and an instruction is selected randomly in each of them (i1 , i2 ). A position (index) is chosen randomly in each of the two offspring (c1 , c2 ). After the crossover, the first, respective the second offspring contains the original instructions from the first, respective the secont parent with the exception of i1 , respective i2 , which is put at the position c2 in the second offspring, respective c1 in the first offspring. The fitness function is calculated as the number of bits processed correctly by the multiplier developed by means of the program stored in the chromosome. The experiments were conducted with the evolution of programs for the construction of 4 × 4 multipliers, i.e. the parameter w = 4. There are 24+4 = 256 possible test vectors and the multipliers produce 8-bit results. Therefore, the maximum fitness representing a working solution equals 256 · 8 = 2048. After the evolution the resulting program is verified in order to determine whether it is able to create larger multipliers, typically up to the size 14 × 14 bits. This size of circuit was determined experimentally, allowing to perform a sufficient number of developmental steps for demonstrating the correctness of the evolved program and keeping a reasonable verification time. If a program shows this ability, it will be considered as general.
6
Experimental Results and Discussion
The experiments were devoted to the design of carry-save multipliers which exhibit better properties in comparison with the classic multipliers. In [10] an external information called an environment was introduced into the development for additional control of the developmental process. Moreover, an ability of adaptation of the design to different environments was observed allowing to create different multiplier structures. This feature was utilized in the experiments presented
Evolutionary Design of Generic Combinational Multipliers
85
herein in order to investigate the ability of the evolutionary design system to construct carry-save multipliers. The selection of the evolved programs and circuits presented in this paper is based on their generality (i.e. the ability to construct generic multipliers) and the resemblance to the carry-save multiplier structure with respect to the circuit delay and the number of building blocks the developed multipliers are composed of. In the first sort of experiments a subset of the bulding blocks from Fig. 3 was chosen for the design of the carry-save multipliers (see Fig. 1b). Therefore, only the blocks (a, b, c, d, g, h) were involved in the design process. Considering the irregular structure of the conventional carry-save multiplier, a program consisting of four parts is to be evolved. The parts of the program are executed according to the environment env = (0, 1, 2, 2, 3), which is specified a priori with respect to the structure of carry-save multiplier. Therefore, the construction of the circuit is performed as follows. Considering Fig. 1b, the first level of the AND gates is created using part 0. The second level of AND gates together with the following level of adders are constructed by means of part 1. According to the environment, the next levels of AND gates and adders are created by means of double application of part 2. Finally, part 3 is utilized to design the last level of adders. Two hundreds of independent runs of the evolutionary algorithm were conducted from which 18% evolved a correct program for the construction of 4 × 4 multipliers. 60% of the evolved programs were classsified as general, i.e. able to create arbitrarily large multiplier. Figure 4 shows (a) one of the best evolved general program and (b) a 4 × 4 multiplier constructed by means of that program. At the beginning of the development, the system is initialized: the variables are set to 0, the parameter is set to 4, row and column positions are initialized to 0 and the number of rows and columns is limited to 8 – no gate may be generated a0
a1
b 0
a2
b 0
0 a0
0
b 0
0 a1
b1
a3
b
a2
b1
0
0 b1
0 a3
b1
0
FA
0
0
FA
FA a0
0
FA
0
0
FA a1
b 2
0
0
FA
0
b3
a1
FA a0
FA
a2
b 2
0 a3 b 2
0
FA
0 0
FA 2
b
FA
0
0
FA b3
FA a2 b 3
0 a3
0
b3
FA
0
0
FA
0
FA
FA
FA
FA
FA
FA
FA
FA
p 4
p
p
p 7
0 0
0 0
FA p
(a)
0
0
FA p 1
0
FA p
2
p 3
0
5
6
(b)
Fig. 4. (a) An evolved general program, (b) 4 × 4 multiplier exhibiting the carry-save structure created by means of this program. Note that blank rectangles represent empty blocks (not generated by any instruction) whose outputs are considered as logic 0.
86
M. Bidlo
behind the grid boundaries. According to the first element of the environment (0), part 0 of the evolved program is executed. (see Fig. 4a) The first REP instruction initiates a loop repeating 4 times (for w = 4 – designing a 4 × 4 multiplier) two instructions after the REP instruction. In each pass, an AND gate (code 2 in the argument of GEN instructions at line 2) is generated with its inputs connected to the primary inputs of the circuit indexed by the values of variables v0 , v1 . Moreover, v0 is increased by 1 (line 3) so that the AND gates generated in different passes possess the different first input. After executing a GEN instruction, the column position is increased by 1. After finishing part 0, the row position is increased by 1 and the column position is set to 0. According to the next element of the environment (1), part 1 will be executed. Note, however, that the GEN instruction at line 2 of part 1 generates two building blocks into the actual column, the second block “under” the first one: full adder is generated into the second row of the first column (code 5 in the first GEN argument) and the identity function is generated into the third row of the first column (code 0 in the second argument). Since there have been building blocks generated into two rows, the row position is increased by 2 after finishing part 1. In case of executing part 3, only the full adders are generated (code 5 of the GEN instructions at lines 4 and 5) as there is no space left in the grid for the second level of blocks specified by the second argument of the GEN instructions – the number of rows of the grid was limited to 8 for this experiment. It is evident that the multiplier shown in Fig. 4 could be optimized with respect to the inputs of some building blocks (e.g. adders possessing only one non-zero input could be replaced by the identity functions as demonstrated in [10]). After this optimization the circuit corresponds to the carry-save multiplier shown in Fig. 1b. The second sort of experiments was devoted to the design of multipliers using the full set of building blocks shown in Fig. 3 and the same form of environment like in the previous experiment. Therefore, this setup corresponds to both variants of the multipliers from Fig 1. The prefix (0, 1, 2, 2) of the environment may be utilized for the evolution of classic multiplier structures shown in Fig. 1a. Again, 200 independent experiments were conducted from which 37% of working programs were obtained and 54% of them were classified as general. However, the experiments showed that the evolution of efficient carry-save multipliers is extremely difficult using this setup. Although there is all the resources available as in the first sort of experiments, no valid carry-save structure was obtained. The evolution generated the carry-save components very rarely and not to the positions at which they could be usefully utilized during the circuit operation. The classic structures (Fig. 1a) were evolved instead. An example of a general program together with a 4 × 4 multiplier is shown in Fig. 5 which represents the same type of the classic multiplier structure that was evolved in [10]. The experiments presented in this section represent a continuation of the successful research in the field of the evolutionary design of generic multipliers using development. The phenomenon of adaptation of the developmental process to different environments during the evolution introduced in [10] enabled
Evolutionary Design of Generic Combinational Multipliers a b1 0
a b 0 0
0 0 FA
a b2 0
0 0
FA
a b3 0
0 0
FA
87
0 0
FA 0
0
a b0 1
a b1 1
a b2 1
a b3 1
FA
FA
FA
FA
FA
a 2b0
a 2b1
a 2 b2
a 2 b3
a2 0
HA
FA
FA
FA
FA
a 3 b0
a 3 b1
a 3 b2
a 3 b3
a3 0
FA
HA
FA
FA
FA
FA
p
p
p
p
p
p
0
0
FA
0
0
0 0 0
FA
0
FA
0
0
0
0 0
FA p 0
(a)
0
FA p
1
0
2
3
4
5
6
7
(b)
Fig. 5. (a) An evolved general program, (b) a 4 × 4 multiplier based on the structure of the classic combinational multiplier. Blank rectangles represent empty blocks with the outputs possessing logic 0.
us to design various multiplier structures. In particular, the carry-save structure was rediscovered in this paper, exhibiting shorter delay in comparison with the classic multiplier, which was the main goal in the new sorts of our experiments. Although the carry-save multipliers showed to be very hard to evolve, the evolutionary developmental system demonstrated the ability to design this class of multipliers using a reduced set of building blocks. Moreover, simplified building blocks were introduced in this paper together with an improved developmental model in comparison with [10]. Therefore, there is a smaller limitation of the evolutionary process which, however, leads to more difficult evolution because of lower level of abstraction in the circuit representation. The success of the evolution of carry-save multipliers demonstrates an ability of the experimental system to design different circuit structures with more complex interconnection of their components which represents a promising area for the next research.
7
Conclusions
In view of the successful experiments, there is a big potential for the application of this model to other classes of well-scalable circuits, e.g. adders, median and sorting networks etc. Therefore, the future research will be focused on adjusting the existing system to the specific circuit structures in order to investigate the evolution in the designs involving other building blocks and environments with respect to the construction of generic combinational circuits. Acknowledgements. This work was supported by the Grant Agency of the Czech Republic under contract No. 102/07/0850 Design and hardware implementation of a patent-invention machine, No. 102/05/H050 Integrated Approach
88
M. Bidlo
to Education of PhD Students in the Area of Parallel and Distributed Systems and the Research Plan No. MSM 0021630528 Security-Oriented Research in Information Technology.
References 1. Miller, J.F., Thomson, P.: Cartesian genetic programming. In: Poli, R., Banzhaf, W., Langdon, W.B., Miller, J., Nordin, P., Fogarty, T.C. (eds.) EuroGP 2000. LNCS, vol. 1802, pp. 121–132. Springer, Heidelberg (2000) 2. Miller, J.F., Job, D.: Principles in the evolutionary design of digital circuits – part I. Genetic Programming and Evolvable Machines 1(1), 8–35 (2000) 3. Miller, J.F., Job, D.: Principles in the evolutionary design of digital circuits – part II. Genetic Programming and Evolvable Machines 3(2), 259–288 (2000) 4. Vassilev, V., Job, D., Miller, J.: Towards the automatic design of more efficient digital circuits. In: Proc of the Second NASA/DoD Workshop on Evolvable Hardware, Palo Alto, CA, pp. 151–160. IEEE Computer Society Press, Los Alamitos (2000) 5. Vassilev, V., Miller, J.F.: Scalability problems of digital circuit evolution. In: Proc. of the 2nd NASA/DoD Workshop of Evolvable Hardware, Los Alamitos, CA, US, pp. 55–64. IEEE Computer Society Press, Los Alamitos (2000) 6. Murakawa, M., Yoshizawa, S., Kajitani, I., Furuya, T., Iwata, M., Higuchi, T.: Hardware evolution at function level. In: Ebeling, W., Rechenberg, I., Voigt, H.M., Schwefel, H.-P. (eds.) PPSN IV 1996. LNCS, vol. 1141, pp. 206–217. Springer, Heidelberg (1996) 7. Torresen, J.: Evolving multiplier crcuits by training set and training vector partitioning. In: Tyrrell, A.M., Haddow, P.C., Torresen, J. (eds.) ICES 2003. LNCS, vol. 2606, pp. 228–237. Springer, Heidelberg (2003) 8. Stomeo, E., Kalganova, T., Lambert, C.: Generalized disjunction decomposition for evolvable hardware. IEEE Transactions on Systems, Man and Cybernetics – Part B 36, 1024–1043 (2006) 9. Aoki, T., Homma, N., Higuchi, T.: Evolutionary synthesis of arithmetic circuit structures. Artificial Intelligence Review 20, 199–232 (2003) 10. Bidlo, M.: Evolutionary development of generic multipliers: Initial results. In: Proc. of The 2nd NASA/ESA Conference on Adaptive Hardware and Systems, AHS 2007, IEEE Computer Society Press, Los Alamitos (2007) 11. Wakerly, J.F.: Digital Design: Principles and Practice. Prentice Hall, New Jersey, US (2001) 12. Kumar, S., Bentley, P.J. (eds.): On Growth, Form and Computers. Elsevier Academic Press, Amsterdam (2003)
Automatic Synthesis of Practical Passive Filters Using Clonal Selection Principle-Based Gene Expression Programming Zhaohui Gan1, Zhenkun Yang1, Gaobin Li1, and Min Jiang2 1
College of Information Science and Engineering, Wuhan University of Science & Technology 430081 Wuhan, China [email protected], [email protected] 2 College of Computer Science, Wuhan University of Science & Technology 430081 Wuhan, China [email protected]
Abstract. This paper proposes a new method to synthesize practical passive filter using Clonal Selection principle-based Gene Expression Programming and binary tree representation. The circuit encoding of this method is simple and efficient. Using this method, both the circuit topology and component parameters can be evolved simultaneously. Discrete component value is used in the algorithm for practical implementation. Two kinds of filters are experimented to verify the excellence of our method, experimental results show that this approach can generate passive RLC filters quickly and effectively.
1 Introduction Automatic design of electronic circuit is the dream of electronic engineers. Many scholars have done a lot of research on this direction. Until now, automatic design of digital circuit has made great progress. However, analog circuit synthesis, including topology and sizing, is a complex problem. Most analog circuits were designed by skilled engineer who uses conventional methods based on domain-specific knowledge. However, recent significant development of computer technology and circuit theory made it possible for us to take some approaches to automatically synthesize analog circuits by computers. Analog circuit synthesis involves both the sizing (component value) and topology (circuit structure) optimization. Recently, remarkable progress has been made in analog circuit synthesis. Horrocks successfully applied Genetic Algorithms (GA) into component value optimization for passive and active filter using preferred component values [1, 2]. Parallel Tabu Search Algorithm (PTS) was also applied into the same problem by Kalinli [3]. GA was also applied to select circuit topology and component values for analog circuit by Grimbleby [4]. Koza and his collaborators have done lots of research on automatic analog circuit synthesis by means of Genetic Programming (GP) [5, 6], it maybe the most notable progress in this field. Based on GP, they developed circuit-construction program trees which have four kinds of circuit-construction L. Kang, Y. Liu, and S. Zeng (Eds.): ICES 2007, LNCS 4684, pp. 89–99, 2007. © Springer-Verlag Berlin Heidelberg 2007
90
Z. Gan et al.
functions. The behavior of each circuit was evaluated by SPICE (Simulation Program with Integrated Circuit Emphasis). A main drawback of this technique is that it is very complex to implementation and it requires large computing time. Some topology-restricted approaches were proposed to reduce the computing time. Lohn and Colombano developed a compact circuit topology representation that decreases the complexity of evolutionary synthesis of analog circuit [7]. These representation methods restricted some potential topologies and decreased the running time. Using parallel genetic algorithm, these algorithms allows circuit size, circuit topology, and component values to be evolved [8]. In [9, 10], a novel tree representation method was proposed to synthesize basic RLC filter circuits, this representation method was restricted to series-parallel topologies. Compared with general-topology method, this approach was more efficient for passive circuit synthesis. Gene Expression Programming (GEP), a new technique of evolutionary algorithm used for automatic creation of computer programs, was first invented by Candida Ferreira in 1999 [11]. The main difference between GA, GP and GEP is the form of individual encoding: individuals in GA are linear strings of fixed length whereas the GP individuals are trees of different shape and size, the GEP individuals are also trees of different shape and size, but their encoding is linear strings of fixed length using Karva notation [12]. GEP combines the advantage of both GA and GP, while overcoming some shortcomings of GA and GP. Clonal Selection Algorithm (CSA), inspired by natural immunological mechanisms, has been successfully applied into several challenging domains, such as multimodal optimization, data mining and pattern recognition [13, 14]. CSA can enhance the diversity of the population and has faster convergence speed. Clonal Selection principle-based Gene Expression Programming (CS-GEP) [15] is proposed by us as an evolutionary algorithm, which combines the advantages of Clonal Selection Algorithm and GEP. In this paper, based on the binary tree representation method, CS-GEP is applied into practical passive filter synthesis successfully. This method allows both the circuit topology and sizing to be evolved simultaneously. Furthermore, taking the practical conditions into account, discrete component values are used in the passive filter circuit synthesis, so it is convenient for engineering implementation. Two kinds of passive filter design tasks are finished to evaluate the proposed approach, the experiment results demonstrate that our method can synthesize passive filters effectively. This paper is organized as follows. Section 2 gives an overview of the related work including the circuit representation method and the overview of GEP. Section 3 explains the circuit encoding method in detail and applies CS-GEP algorithm into practical passive filter synthesis. The experiments of a low-pass and a high-pass filter design are covered in Section 4. Some conclusions and further works are drawn in Section 5.
2 Related Work and Motivations 2.1 An Overview of Gene Expression Programming Gene Expression Programming is a new evolutionary algorithm for creation of computer programs automatically. Similar to GP, GEP also uses randomly generated
Automatic Synthesis of Practical Passive Filters
91
population and applies genetic operators into this population until the algorithm finds an individual that satisfies some termination criteria. The main difference between GP and GEP is that the GEP individuals are encoded as linear strings of fixed length using Karva notation, the chromosome is simple, linear, and compact encoding by this method. In GEP, each gene is composed of a fixed length of symbols (including head and tail). The head contains symbols that represent both functions and terminals, whereas the tail contains only terminals. For each problem, the length of the head h is predetermined by user, whereas the length of tail t is given by:
t = h × ( n − 1) + 1
(1)
where n is the largest arity of the functions in function set. For example, from the function set F= {E, Q, S, +, -, *, /} and the terminal set T= {x, y}, suppose an algebraic expression is: sin ( xy ) + y + e x
(2)
Then if we define h is 10, here n is 2, so, t=11. The gene is shown in Fig. 1:
Head Tail 012345678901234567890 Q+S+ * y E x y x y x x x y y x y x y x Fig. 1. The gene of GEP for equation (2) Q
+
+ S y
*
x
Gene 1
+
y
E x
(a) The expression tree of Eq. (2)
+
Gene 2
+
Gene 3
Gene 4
(b) A four-gene chromosome linked by addition
Fig. 2. The expression tree of gene and chromosome
where, Q represents the square-root function, E represents the exponential function, and S represents the sinusoidal function. The gene can be represented by the expression tree shown in Fig. 2 (a). The GEP chromosomes are usually composed of several genes of equal length. The interaction between the genes was specified by the linking function. An example of a four-gene chromosome linked by addition is shown in Fig. 2 (b).
92
Z. Gan et al.
+ +
RS L2 0.33uH
//
L1
L3 100uH L1 22uH
L5 68uH
C1 C2 82uF
RS 1k VS
C1 0.1uF
+ L2
L4 47uH
RL 1k
Vout
C3 0.22uF
+ //
// L3
C2 L4
(a) Schematic
+
+ C3
L5
RL
(b) Binary tree representation
Fig. 3. An example of RLC circuit and its tree representation
2.2 Circuit Representation and Analysis
Binary tree can be used to represent the structure of series-parallel RLC circuits [9], an example of RLC circuit and its corresponding binary tree representation are shown in Fig. 3. There are two kinds of node in the binary tree: connection nodes and component nodes. Two types of connection nodes are used to represent the series and parallel connection, which are denoted by + and //, respectively. The component nodes consist of three types of passive electrical component: R (resistor), L (inductor) and C (capacitor). Compared to other evolutionary approaches simulated by SPICE, the circuit analysis algorithm for this circuit presentation is very simple. First, the impedance of each node is calculated from leaf to root. Second, starting from the root node, the current flow down through out of the tree, the current of each node can be obtained according to circuit theory [16]. If the node is series, the current flowing out is equal to the current flowing in; if the node is parallel, the current flowing in is divided inversely proportional to the impedance of its children nodes, and flowing out to the children nodes. Finally, the voltage of RL is calculated by multiplying its current and impedance, the voltage gain is obtained through dividing the voltage of RL by the source voltage. 2.3 Motivation from Binary Tree Representation and GEP
The binary tree representation of RLC circuit is compact and efficient for synthesizing practical passive filters. GEP retains the benefit of GA and GP, the chromosome of GEP is simple and efficient due to the linear encoding method. All the above advantages motivated us to develop a algorithm that combines the binary tree representation method and GEP to synthesize practical passive filters.
3 Synthesize RLC Filters Using CS-GEP 3.1 Circuit Encoding Method
The template of RLC passive filter is shown in Fig.4, where VS is the input signal and Vout is the output signal, the value of RS and RL both are 1k ohm. Fig. 4 (a) is the
Automatic Synthesis of Practical Passive Filters
+
Evolved
RS 1k
93
RLC Circuit
RS
RL Vout 1k
VS
Evolved RLC Tree
RL (a) The schematic
(b) The tree representation
Fig. 4. The template of evolved RLC circuit
Head
Tail
Topology
Value Domain Sizing
(a) The structure of a gene
Tail Head 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 90123456789012345678 + RS + L1 // C1 + L2 + // // L3 C2 + + L4 C3 L5 RL R C L L R C R L C R R R L C C L R L L R (b) The head and tail of the gene
Value Domain 5 11 18 14 2 9 63 25 34 7 4 54 12 36 50 48 2 19 12 58 (c) The value domain of the gene
Fig. 5. The gene of the circuit in Fig. 3
schematic of the circuit, the dashed line enclosed part is to be evolved by our algorithm. Fig. 4 (b) is the corresponding tree representation of this circuit. The most difficult part of analog circuit synthesis is how to encode the circuit (including topology and sizing). The main issue is to explore a rule that describes an analog circuit correctly. In our paper, the binary tree representation method combining with GEP is employed to encode series-parallel RLC circuits. The circuit in Fig.3 is encoded in Fig. 5. Fig. 5 (a) shows that a gene consists of three parts: head, tail and value domain. The topology of a circuit is represented by the head and tail, and the sizing of the circuit is determined by the value domain. The structure of the circuit is encoded in Fig. 5 (b), only two kinds of functions, namely, + and // functions exist in GEP encoding for circuit presentation. The component values of the circuit are listed in Fig. 5 (c). Typically manufactured preferred component values are classified as twelve series (E12), there are twelve values in each decade, the preferred values are 10, 12, 15, 18, 22, 27, 33, 39, 47, 56, 68, 82, 100… Each component value ranges over five decades, so sixty values (Component Value Set) are available for each component. The numbers in the value domain denote the index of the values in Component Value Set.
94
Z. Gan et al.
Each individual of CS-GEP consists of a gene encoded as Fig. 5, which can create a valid circuit. CS-GEP uses a population of individuals to evolve until the algorithm finds a circuit that satisfied the predefined specification. 3.2 The Framework of CS-GEP
The Artificial Immune System (AIS) is a new kind of evolutionary algorithms inspired by the natural immune system to solve real-world problems. The Clonal Selection Principle (CSP) establishes the idea that only those cells that recognize the antigens are selected to proliferate. When an antibody recognizes an antigen with a certain affinity, it is selected to proliferate (asexual). During asexual reproduction, the B-cell clones experience a hypermutation process, together with a high selection pressure, their affinities are improved. In the end of this process, the B-cells with higher antigenic affinity are selected to become memory cells with long life spans. Based on the CSP, Clonal Selection Algorithm (CLONALG) was proposed by de Castro and Von Zuben to solve complex problems [14]. Combining the advantage of CSA and GEP, Clonal Selection principle-based Gene Expression Programming (CS-GEP) is proposed as a new evolutionary algorithm. The algorithm is briefly described as follows: Step 1: Initialization An initial population P (N individuals) is generated randomly. Step 2: Evaluation Evaluate the fitness of each individual, the fitness function is based on the mean squared error (MSE) of the individuals, the mean squared error E of the individual is expressed as: E=
1 n 2 Wi ( Ai − Ti ) ∑ n i =1
(3)
where n is the number of sampling points of frequency and i is the index of the sampling point, Wi is the weight of the sampling point i, Ai is the actual frequency response and Ti is the target frequency response in the sampling point i of the circuit. The fitness function of the individual is also adopted as: fitness = 1000 ⋅
1 1+ E
(4)
Step 3: Selection Select n (n sq 0 Pa = ⎨ ⎩0.95, sq ≤ sq 0 ⎧0, sq > sq 0 Pb = ⎨ ⎩0.05, sq ≤ sq 0
. (3)
112
Q. Wu et al.
Fitness function is depicted as in equation 4. pa
Fitness' = n
1 ∑(Uo (i) −Uo' (i))2 n −1 i=1
+
pb Am − Am0
(4)
In equation 4, only square error is calculated in the former stage of the evolution, and the gain joins evaluation only if the square error is no larger than sq0. The flowchart of the extrinsic evolution of analog circuits is shown in fig 2.
Fig. 2. Flowchart of extrinsic evolution of analog circuits
4 Experiments 4.1 Evolution of Feedback Amplifier Sine wave and square wave are used as input signals in the experiment. Sine wave is designed as in formula (1). Parameters of square wave are designed as follows: desired gain (Am0) is 5, frequency is 1 kHz, and the max voltage value (v0) is 0.01v.
Analog Circuit Evolution Based on FPTA-2
113
Table 1. Evolved results of amplifiers
Number
Sine wave Open-loop Feedback
Square wave Open-loop Feedback
1
2.562
10.365
1.147
3.348
2
3.134
8.581
1.095
4.143
3
2.307
14.012
1.954
3.316
4
5.046
12.955
2.075
4.324
5
1.497
11.231
2.034
3.357
6
5.164
9.021
1.845
5.549
7
2.45
13.357
1.157
3.348
8
6.358
15.025
1.034
5.051
9
1.754
14.344
1.125
3.167
10
7.46
9.348
1.462
4.921
11
3.011
16.127
1.039
3.654
12
4.046
16.696
1.535
4.254
13
2.132
8.395
2.014
2.395
14
1.963
10.184
1.951
2.168
15
5.486
9.946
1.674
3.214
In the experiments, mutation probability is 0.08, the max generation is 10000, and the max fitness value is 40. 15 experiments are performed. FPTA-2 cell without feedback structure is also evolved as an amplifier, and the other parameters are the same. The evolution results are shown as in table 1. As known from table 1, the fitness values of open-loop circuits are much smaller than that of feedback amplifiers. It means that FPTA-2 cells with feedback structure are easier to evolve into an amplifier than cells without it. The optimized results based on the two different circuits are shown in fig 3 to fig 6. In the circuits without feedback structure, optimized result emerges in the 10th experiment. The max fitness value is 7.46, and the corresponding chromosome is: 110001110001000110001111110010110101111011101111010101101. As shown in fig 3(a), peak-to-peak values are different in two cycles of the output, and the larger one is 30.1mv. The close gain is about 3.01 and the frequency is about 1 kHz which can be estimated from fig 3(b). There is a comparatively large frequency component at 0HZ.
114
Q. Wu et al.
That is because bias voltage exists in the output of transistor circuit, and it also exists in other results. Although the output curve seems smooth, the result is not so good since the voltage gain is much smaller than desired. The transmission bands exist in small range around 1 kHz as in fig 3(c). The results acquired from circuits with feedback structure in fig 4 are much better than those in fig 3. The max fitness value is 16.696 which is found in the 12th experiment and the corresponding chromosome is: 101011011101100110010011111 110010111001001001000110110101. The peak-to-peak value is approximately 49 mv, and the gain is about 4.9v. Period of the output signal is 1ms, and the curve is very smooth. Though there is still a bit distortion of the curve, there is no harmonic in the signal. This means the distortion coefficient of the circuit is small. The transmission band in fig 4(c) is much wider than that in fig 3(c). The amplifier performs as a low-pass filter. The total time consumption is a little larger in feedback experiments than that in open-loop ones. That’s because the chromosome contains more genes. However, it takes less time to reach a certain fitness value in feedback experiments. Thus the speed of convergence process is higher in fig 4 than in fig 3. The results in fig 5 and fig 6 are mostly the same as in fig 3 and fig 4 respectively. The open-loop circuit can not obtain as good results as feedback circuits. There are harmonics in the square output of circuit with feedback, which makes the result not good.
(a) Optimized output
(c) Gain-frequency curve
(b) Amplitude frequency
(d) Max fitness value
Fig. 3. Result of open-loop structure and sine wave input
Analog Circuit Evolution Based on FPTA-2
(a) Optimized output
(c) Gain-frequency curve
(b) Amplitude frequency
(d) Max fitness value
Fig. 4. Result of feedback structure and sine wave input
(a) Optimized output
(c) Gain-frequency curve
(b) Amplitude frequency
(d) Max fitness value
Fig. 5. Result of open-loop structure and square wave input
115
116
Q. Wu et al.
(a) Optimized output
(c) Gain-frequency curve
(b) Amplitude frequency
(d) Max fitness value
Fig. 6. Result of feedback structure and square wave input
(a) Optimized output
(b) Amplitude frequency
(c) Gain-frequency curve
(d) Max fitness value
Fig. 7. Optimized output after fitness adjustment
Analog Circuit Evolution Based on FPTA-2
117
4.2 Fitness Adjustment Fitness function is adjusted based on feedback circuits. For example, the input sine wave is the same as depicted before and 15 experiments are carried out. The output of the best evolved circuit is shown in fig 7. The curves in two cycles are basically the same. That is because the average gain in two cycles is taken account in the fitness. As to the evolution result, square error gets to sq0 at generation 7593 when gain is about 4.8, and the max fitness occurs at generation 9205 when gain is about 5. The results show that adjustment of the fitness works to the evolution of sine wave amplifying.
5 Parallel Cell Structure As circuit becomes more and more complex, the number of FPTA-2 cells increases rapidly. The arrangement and connection between the cells is surely an important factor to influence evolution effects. Take 4-cell structure as an example, it can be arranged as shown in fig.8 (a). The cell will easily be separated into two parts if some switches are configured off. This results in zero-output of the circuit, which holds a large proportion during evolution. It means time waste and ineffectiveness evolution. Another 4-cell structure is put forward as shown in fig 8(b). Four cells are arranged in two rows and two lines, and cell b and cell c are both connected to the output cell d. If
(a) Serial arrangement
(b) Parallel arrangement
Fig. 8. Two types of the four-cell arrangement
(a) Serial arrangement
(b) Parallel arrangement
Fig. 9. Best fitness of two types of circuit arrangement
118
Q. Wu et al.
one of the two cells is separated, the other cell is still probably connected to cell d. And the probability of zero-output is smaller. Take integral circuit as an example, two types of cell arrangement are tested and compared. The input of the circuit is DC with voltage of 1v, and lasts for 1ms. The max fitness value curves are presented as in fig. 9(a) and fig 9(b). The proportion between zero-output generation and total evolved generation is 0.5267 in fig 9(a), and 0.3629 in fig 9(b). The parallel arrangement of cells works more efficiently than the serial arrangement. Because the length of chromosome is larger and multi-cell evolution is more complex, further research is needed on circuit evolution of multi-cell.
6 Conclusion The method on analog circuit evolution is discussed in this paper. Evolution on feedback amplifier is much better than on open-loop circuit. Fitness function is adjusted on a square error threshold. Four-cell circuit arrangement is regulated and the results show that it can speed up the evolution and helps multi-cell circuit evolution more effectively.
References [1] Design of a 96 decibel operational amplifier and other problems for which a computer program evolved by genetic programming is competitive with human performances. In: 1996 Japan-China Joint Int. Workshop Information Systems, Ashikaga Inst. Technol., Ashikaga, Japan (1996) [2] Zebulum, R.S., Pacheco, M.A., Vellasco, M.: Synthesis of CMOS Operational Amplifiers through Genetic Algorithms. In: Proc. XI Brazilian Symp. Integrated Circuit Design, pp. 125–128 (1998) [3] Zhiqiang, Z., Youren, W.: A new Analog Circuit Design Based on Reconfigurable Transistor Array. In: 2006 8th International Conference on Solid-State and Integrated Circuit Technology Proceedings, Shanghai, China, pp. 1745–1747 (2006) [4] Yuehua, Z., Yu, S.: Analog Evolvable hardware Experiments based on FPTA. Users of Instruments 14(3), 10–12 (2006) [5] Kaiyang, Z., Youren, W.: Research on evolvable circuits based on reconfigurable analog array. Computer testing and control 13(10) (2005) [6] Stoica, A., Zebulum, R., Keymeulen, D., Tawel, R., Daud, T., Thakoor, A.: Reconfigurable VLSI Architectures for Evolvable Hardware: from Experimental Field Programmable Transistor Arrays to Evolution-Oriented Chips. IEEE Transactions on VLSI Systems, Special Issue on Reconfigurable and Adaptive VLSI Systems 9(1), 227–232 (2001) [7] Levi: HereBoy: A Fast Evolutionary Algorithm [A]. In: Proceedings of the Second NASA/DoD Workshop on Evolvable Hardware[C]. Washington, pp. 17–24 (2000) [8] Vieira, P.F., Sá, L.B., Botelho, J.P.B., Mesquita, A.: Evolutionary Synthesis of Analog Circuits Using Only MOS Transistors. In: Proceedings of the 2004 NASA/DoD Conference on Evolution Hardware (EH’04) (2004) [9] Langeheine, J., Trefzer, M., Brüderle, D., Meier, K., Schemmel, J.: On the Evolution of Analog Electronic Circuits Using Building Blocks on a CMOS FPTA. In: Deb, K., et al. (eds.) GECCO 2004. LNCS, vol. 3102, pp. 1316–1327. Springer, Heidelberg (2004)
Knowledge Network Management System with Medicine Self Repairing Strategy JeongYon Shim Division of General Studies, Computer Science, Kangnam University San 6-2, Kugal-Dong, Kihung-Gu,YongIn Si, KyeongKi Do, Korea, Tel.: +82 31 2803 736 [email protected]
Abstract. In the complex information environment, the role of intelligent system is getting high and it is essential to develop a smarter and more efficient intelligent system for processing automatic knowledge acquisition, structuring the memory efficient to store and retrieving the related information and repairing the system automatically. Focusing on the self repairing system, in this study Medicine Self Repairing Strategy for knowledge network management is designed. The concepts of Self type, Internal Entropy, medicine treatment are defined for modeling Self Repairing System. We applied this proposed system to virtual memory consisting of knowledge network and tested the results. Keywords: Knowledge network management strategy, Internal entropy, medicine.
1
Introduction
The living things are exposed to a dynamic, complex and dangerous environment. For surviving they have developed their own way such as an intelligent system to overcome the faced difficulties. They protect the body from the external stimulus or attack of disease and keep a balance of body by autonomic self repairing system. When the system can’t overcome the external attacks, the body loses the balance. In the oriental medicine, the broken status of balance is regarded as a disease and the main role of medicine treatment is to help autonomic self repairing system to recover the balance of body. The role of medicine helps to remove the harmful factors and to raise the deficiency of the body. As a computer technology develops very rapidly, the information environment as well as the real world becomes more and more complex. Especially the internet came to human society and changed the paradigm of modern society. In this tendency, the role of intelligent system is getting high and it is essential to develop a smarter and more efficient intelligent system for processing automatic knowledge acquisition, structuring the memory efficient to store and retrieving the related information and repairing the system automatically. Many studies of the intelligent system adopting Life science have made and applied to the many practical areas for recent decades. L. Kang, Y. Liu, and S. Zeng (Eds.): ICES 2007, LNCS 4684, pp. 119–128, 2007. c Springer-Verlag Berlin Heidelberg 2007
120
J. Shim
Focusing on the self repairing system, in this study Medicine Self Repairing Strategy for knowledge network management is designed. The concepts of Self type, Internal Entropy, medicine treatment are defined for modeling Self Repairing System. We applied this proposed system to virtual memory consisting of knowledge network and tested the results.
2
Intelligent Knowledge Network Management System
An intelligent knowledge management System was designed for assembling the functions of knowledge selection, knowledge acquisition using modular neural network and symbolic learning, structuring memory, perception & inference and knowledge retrieval by author.[1] As shown in Figure 1,data flow in this system is as follows: The raw data passed Preprocessing module are temporary stored in Input Knowledge pool for the efficient information processing. Input data stored in Input Knowledge Pool are filtered by Reactive layer and distributed to ALM(Adaptive Learning Module) or SKAM(Symbolic Knowledge Acquisition Module). If it is training data, it is used for learning process in ALM and the output nodes of learning frame are connected to nodes in higher Associative Knowledge layer. ALM consists of three layered Neural Network and is operated by BP algorithm. If it is symbolic representation of data, it activates SKAM which analyzes knowledge and associative relations logically. Its output nodes are also connected to the nodes in Associative knowledge layer. Associative Knowledge layer constructs knowledge net based on propagated knowledge from the previous step. Constructed knowledge net is selectively stored in connected memory. For the efficient memory retention, maintenance and knowledge retrieval, memory is designed to consist of knowledge network. Knowledge network is specially designed and composed of knowledge cells according to their associative
Fig. 1. Intelligent Knowledge Network Management System
Knowledge Network Management System
121
relation and concepts. This system process knowledge network management. During this processing, discarded data or useless dead knowledge is sent to Pruning pool and removed. 2.1
Knowledge Network Management Strategy
We designed knowledge cells of knowledge network to have several properties,i.e. ID, T(Self Type), IE(Internal Entropy) and C(contents), for the efficient processing of memory maintenance and knowledge retrieval. As shown in figure 3, Knowledge cell is connected to other knowledge cell and the collection of knowledge cells compose Knowledge network. Knowledge network is represented as a form of Knowledge Associative list.
Fig. 2. Knowledge cell, Knowledge network and Knowledge Associative list
Knowledge network management strategy system takes charge of maintaining efficient memory by discriminating dead cells and removing dead cells. Main function of Knowledge network management strategy is pruning the dead cells or negative cells and maintain the new state of memory constantly. In this process, the system checks if there is a negative cell and calculates Internal entropy of whole memory. If it finds a dead cell, it sends a signal and starts the pruning process. In the case of a negative cell, it determine if a negative cell should be destroyed. During destroying a negative cell, memory configuration is made. After removing, new associative relation should be made by calculating the new strength. In this case of following example in Figure 4, node kj is removed because it is a dead cell. After removed, ki is directly connected to kk with new strength. The new strength is calculated by eq 1. Rik = Rij ∗ Rjk
(1)
For the more general case as shown in Figure 5, if there are n number of dead cells between ki and kj ,the new strength Rij is calculated by equation 3. Rij = Ri,i+1 ∗ Rj−1,j (2)
122
J. Shim
Fig. 3. Knowledge nodes with dead cells
3 3.1
Medicine Self Repairing Strategy Self Maintenance
Medicine Self Repairing Strategy is designed for self maintenance of the system. As shown in Figure 2, the system checks SEG(Surviving Energy Gauge) Balancing factor if it is in the balancing state periodically. All the knowledge cells composing knowledge network in memory are represented as an object including ID,Self Type, Self Internal Entropy and contents. Among the attributes, Self Type and Self Internal Entropy are used for Medicine Self Repairing Strategy. The main role of medicine is to select a bad knowledge cell which is harmful or useless for the intelligent processing and to remove it. Medicine type matching mechanism is memory cleaning process removing the type matched knowledge cell with medicine type.
Fig. 4. Self maintenance
Self Type Self Type is defined as own property representing its characteristics and has five factors as M,F,S,E and,K whose property can be determined depending on the application area. There exists attractive relationship and rejecting relationship as shown in Table1 and Table2. If one type meets Attracting relation, two types are associated and their relational strength increases. On the contrary, if it meets Rejecting relation,expelling strength works on two types and their strength decreases.
Knowledge Network Management System
123
Table 1. Type Matching Rule : Attracting Relation Attracting Relation M⊕F F⊕E E⊕K K⊕S S⊕M
Table 2. Type Matching Rule: Rejecting Relation Rejecting Relation ME ES SF FK KM
3.2
Self Internal Entropy and SEG Balancing Factor
Every knowledge cell has Self Internal Entropy(IE,Ei ) representing the strength of the cell. It has a value of [-1,1]. Minus value means negative energy, plus value means positive energy and zero represents the state of no-energy. Ei =
1 − exp(−σx) 1 + exp(−σx)
(3)
SEG(Surviving Energy Gauge) is a measure of evaluating the state of surviving ability in the circled environment. This value is used for maintaining the system. SEG is calculated by eq.2. n Eti (4) SEG(t) = i=1 n 3.3
Medicine Self Repairing Mechanism
In this system, every knowledge cell has its Self Type which has a value of M,F,E,K, and S. But Self Type of knowledge cell can be transfer to other type or abnormal type according to the situation. For example,supposing that the state of starvation continues for a long time, this cell can be changed to a dead cell of which type is D by self maintenance mechanism. The dead cell has zero as a value of Self Internal Entropy. During the memory structuring, the abnormal
124
J. Shim
typed knowledge cells can be added in the knowledge network. If many useless cells are included in Knowledge network, this situation makes the efficiency of a system lower. Because the value of SEG is also lower, Self maintenance system can easily check the abnormal state. For Preventing this situation and recovering, the repairing treatment is essential. In this study, we propose Medicine Self Repairing Mechanism for recovering and removing the bad sectors in Knowledge network. The main idea is medicine type matching remove,that is,medicine searches the matching type cells and remove the cells from Knowledge network. Its precondition is that system should detect the abnormal types and find the appropriate medicine matched with the abnormal type. In this process the empirical knowledge is used for selecting the medicine. System should have medicine matching type list for medicine self repairing.
Fig. 5. Medicine type matching remove
Knowledge Network Management System
125
The following algorithm represents medicine Self Repairing algorithm. Algorithm 1: Medicine Self Repairing Algorithm Start STEP1 Check SEG, Repairing Signal. STEP2 Input the medicine type, MED. STEP3: while(!EOF) begin If (SEG ≤ θ and Repairing control = ON) then Input the medicine type, MED. Search the type,T,in the cell matched with MED referring medicine type matching table. If(found) Remove the found cell. end. STEP4 call Knowledge Network Restructuring module Stop
4
Experiments
The prototype of Knowledge Network management system was applied to virtual memory and Medicine Self Repairing Strategy was tested using knowledge associative list and medicine lookup table of Table 3. Figure 6 shows the initial state of knowledge network and Table 4 shows its knowledge list containing Self Type,knowledge nodes,IE and Relation. In this experiment, medicine type matching repairing strategy was processed in four steps as shown in graph of Figure 7. MED-D means D type cell removing, MED-DP is D and P type
Fig. 6. Knowledge network
126
J. Shim Table 3. Medicine Lookup table MED Attracting Relation D,P,C Rejecting Relation
F
Table 4. Knowledge associative list of initial knowledge network Type Knowledge i IE Rel. Knowledge i + 1 M
K1
1.0 0.9
K2
M
K1
1.0 0.4
K7
F
K2
0.8 0.8
K3
F
K2
0.8 0.6
K5
F
K3
0.6 0.2
K4
F
K3
0.6 0.8
K6
C
K4
-1.0 0.0
NULL
S
K5
0.7 0.0
NULL
D
K6
0.0 0.0
NULL
S
K7
1.0 0.7
K8
S
K7
1.0 0.3
K9
S
K7
1.0 0.5
NULL
C
K8
-1.0 0.0
NULL
M
K9
0.2 0.1
K10
F
K10
0.3 0.0
NULL
D
K11
0.0 0.6
K10
D
K11
0.0 0.1
K12
D
K11
0.0 0.9
K13
M
K12
0.2 0.0
NULL
P
K13
-0.3 0.6
K14
M
K14
0.1 0.0
NULL
removing, and MED-DPC is D,P and C type removing. MEDA-F means Adding ennergy to F type cells because F type is Attracting relation with this medicine according to Medicine Lookup table. As shown in Figure 8, we can know that the value of SEG is rising as Medicine Type Matching Repairing is processed. As a result, the memory was successfully updated to the optimal state.
Knowledge Network Management System
127
Fig. 7. Change of Entropy by medicine type matching repairing
Fig. 8. Change of SEG by medicine type matching repairing
5
Conclusion
Knowledge network management system with Medicine Type Matching Repairing strategy was designed. The concepts of Self Type, Internal Entropy and SEG were defined and used for self repairing system. We applied this proposed system to virtual memory consisting of knowledge net and tested the results. As a result of testing , we could find that the memory was successfully updated and maintained by medicine type matching repairing strategy. The proposed system was also can be usefully applied to many areas requiring the efficient memory management.
References 1. Shim, J.-Y.: Knowledge Retrieval Using Bayesian Associative Relation in the Three Dimensional ModularSystem. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 630–635. Springer, Heidelberg (2004) 2. Anderson, J.R.: Learning and Memory. Prentice Hall, Englewood Cliffs
128
J. Shim
3. Fausett, L.: Fundamentals of Neural Networks. Prentice Hall, Englewood Cliffs 4. Haykin, S.: Neural Networks. Prentice Hall, Englewood Cliffs 5. Shim, J.-Y., Hwang, C.-S.: Data Extraction from Associative Matrix based on Selective learning system. In: IJCNN’99, Washongton D.C (1999) 6. Anderson, J.R.: Learning and Memory. Prentice Hall, Englewood Cliffs 7. Shim, J.-Y.: Automatic Knowledge Configuration by Reticular Activating System. In: Wang, L., Chen, K., Ong, Y.S. (eds.) ICNC 2005. LNCS, vol. 3610, Springer, Heidelberg (2005)
Design of a Cell in Embryonic Systems with Improved Efficiency and Fault-Tolerance Yuan Zhang, Youren Wang, Shanshan Yang, and Min Xie College of Automation and Engineering Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China [email protected]
Abstract. This paper presents a new design of cells to construct embryonic arrays, the function unit of which can act in three different operating modes. Compared with cells based on LUT with four inputs and one output, the new architecture displays improved flexibility and resource utilization ratios. Configuration memory employed by embryonics can implement 1-bit error correcting and 2-bit error checking by using extended hamming code. The two-level fault-tolerance is achieved in the embryonic array by the error correcting mechanism of memory at celllevel and column-elimination mechanism at array-level which is triggered by cell-level fault detection. The implementation and simulation of a 4bit adder subtracter circuit is presented as a practical example to show the effectiveness of embryonic arrays in terms of functionality and twolevel fault-tolerance. Keywords: Embryonic systems, Cellular arrays, Two-level self-repair, Extended hamming code, Fault tolerance of configuration memory.
1
Introduction
The requirements for reliability and fault tolerance of electronic systems become higher and higher as the complexity of digital systems increases. The traditional design for fault tolerance can’t provide a solution for integrated SOC (System On a Chip)for its complex architecture, large volume, and difficulty in integration. Therefore there is a necessity to quest for new architectures and mechanisms used for on-line self-repair of VLSI(Very Large Scale Integration). Embryonic systems [1] are planar cellular arrays with self-test and self-repair properties. Similar to those found in biological organisms, properties such as multi-cellular organization, cellular replication, cellular differentiation and cellular division are used to build a digital circuit with high reliability and robust fault tolerance properties. Early cellular architecture is based on multiplexer [2] which is unsuitable for implementing large scale circuits for its large resources and difficulty in routing. In recent years, LUT is mostly used in the architecture of POE (POE = phylogenesis, ontogenesis, epigenesis) [3]. However, it doesn’t solve the problem of low L. Kang, Y. Liu, and S. Zeng (Eds.): ICES 2007, LNCS 4684, pp. 129–139, 2007. c Springer-Verlag Berlin Heidelberg 2007
130
Y. Zhang et al.
resource utilization ratio. In this paper the improved cellular architecture is proposed which is capable of implementing arithmetic operation or two gates with three operational modes to improve flexibility and resource utilization ratio. It is necessary to have a research on the reliability and the fault-tolerant ability of configuration memory within embryonic arrays as errors in the memory will render the system useless. Simple off-line test, an addition of a complex artificial immune system [4,5] for on-line self-test and hamming code for 1-bit error correcting [6] have been proposed previously. In this paper we apply extended hamming code [7] to the memory fault tolerance which can correct 1-bit error at cell-level and give 2-bit error signal to trigger array-level self-repair to obtain enhanced fault-tolerance ability.
2
Architecture of Cells in Embryonic Systems
The embryonic array is constituted by a symmetrical electronic cellular array. Fig. 1 gives an abstract view of the basic components of the embryonic cell. Every cell has the identical circuit structure, which contains co-ordinate generator, memory, function unit, switch box and control unit [8].Input multiplexers controlled by co-ordinate are contained in the southmost and northmost cells to have the inputs of the system sourced from west as well as south and north.
Active cell
Co-ordinate generator +1 0
INPUTS
Spare cell
Memory
1
transparent
0 0
1 1
Control Unit
2
2 1 0
OUTPUTS
0
1
2
INPUTS
Embryonic system
Function unit LUT
Transparent signal
2
INPUTS
N
To east neighbor
Self-test logic
Self-test and selfrepair logic
W
Switch box
E
S
Fig. 1. Architecture of cells in embryonic systems
2.1
Co-ordinate Generator and Memory
Each cell chooses its configuration bits stored in the memory according to its coordinate. For column-elimination adopted each cell just needs one co-ordinate and the memory just has to store the configuration bits of cells within the same row. When the cell is in normal state, the configuration bits are chosen from the memory according to its co-ordinate calculated by adding 1 to the co-ordinate of the west
Design of a Cell in Embryonic Systems
131
cell. When it is in transparent state, the cell passes the co-ordinate of the west cell to east directly and selects transparent configuration from the memory. 2.2
Function Unit
Fig. 2 shows the function unit in some detail. A cell has three different operational modes, which make it flexible enough to map all kinds of application circuits onto the array conveniently. – 4-LUT mode: when MODE(1:0) is “11”, the function unit acts as a 16-bit LUT which supplies an output depending on its four inputs. It is suitable for implementing a circuit with larger logic function. – 3-LUT mode: when MODE(1:0) is “01”, the function unit is split into two 8-bit LUTs, both supplying a result depending on three inputs. The first result can go through the flip-flop as the first output. The second one can be used as a second output. A cell can perform an arithmetic operation function, such as a full adder or a full subtracter. – 2-LUT mode: when MODE(1:0) is “00”, the function unit can be regarded as two 4-bit LUTs with two inputs and one output. A cell can implement two gates. The inputs of function unit can be chosen from fixed logic values, any of outputs of the neighboring cells or the cell’s own output according to the configuration bits. The configuration bit EN REG can enable sequential operation.
...
INPUTS_MUX
...
0 1
0 1
D
Q
0 1
Fun0_out
CLK
...
EN_REG
0 1
...
0 1
0 1
0 1
1 0
Fun1_out
MODE(1) MODE(0)
Fig. 2. Architecture of function unit
2.3
Switch Box
Switch box is designed for propagating information between cells. As shown in Fig. 3, outputs in each direction can select from function unit’s outputs and cell’s inputs in other three directions according to the configuration bits.
Y. Zhang et al. N0_OUT
N0_IN N1_IN E0_IN E1_IN W0_IN W1_IN FUN0_OUT FUN1_OUT S0(2:0)
W1_OUT
W0_IN W1_IN N0_IN N1_IN S0_IN S1_IN FUN0_OUT FUN1_OUT
Switch Box N0_IN N1_IN E0_IN E1_IN W0_IN W1_IN FUN0_OUT FUN1_OUT
E0_IN E1_IN N0_IN N1_IN S0_IN S1_IN FUN0_OUT FUN1_OUT
S0_IN S1_IN E0_IN E1_IN W0_IN W1_IN FUN0_OUT FUN1_OUT
W0(1:0)
S0_IN S1_IN E0_IN E1_IN W0_IN W1_IN FUN0_OUT FUN1_OUT
W0_OUT
N1(2:0)
N0(2:0) E0_IN E1_IN N0_IN N1_IN S0_IN S1_IN FUN0_OUT FUN1_OUT
N1_OUT
W1(2:0) S0_OUT
E0_OUT
E0(2:0) W0_IN W1_IN N0_IN N1_IN S0_IN S1_IN FUN0_OUT FUN1_OUT S1(2:0)
132
E1_OUT
E1(2:0)
S1_OUT
Fig. 3. Architecture of switch box
2.4
Control Unit
Control unit is responsible for controlling the other blocks in the cell by combining fault flags into cell state signal which triggers the cell’s elimination. There are four states for a cell. The transition from one state to the next at discrete time steps according to the fault flags from self-test logic within the cell and the state of neighboring cells in the same column is what determines the array’s behavior. The cell works in normal state with no faults detected. The cell will go into nonreparable transparent state with 2-bit error detected in the memory that can’t return to normal state again. When a fault is detected in the function unit or any neighboring cell of the same column, the cell will go into reparable transparent state and return to the normal state when the fault disappears. When the cell is in either of the transparent states, control unit will set the transparent signal to ‘1’, and self-test logic continues to test the transparent configuration bits. If 2-bit error is detected in the transparent configuration bits, the cell will go into backup state and set the backup signal to ‘1’.
3
Fault Tolerance Mechanism of Embryonic Systems
In order to improve the fault tolerance ability of embryonic systems we are developing a hierarchical approach including cell-level and array-level for fault tolerance, allowing the system to keep operating normally in the presence of multi-fault with less redundancy resources. The “stuck-at” fault model which covers the most common faults is used. 3.1
Fault Detection and Fault Tolerance at Cell-Level
Duplication of the function unit appears to give the function unit fault-detection ability since a fault will cause outputs of function unit copies to differ. This discrepancy can be detected by a XOR-gate comparing the cell function outputs.
Design of a Cell in Embryonic Systems
133
When the outputs of the two function units are different, the fault signal from comparator will be set to ’1’ indicating a fault occurring in the cell. The advantages of this method for fault detection are that it is simple and operates on-line. However the ability of fault location and fault tolerance can’t be obtained by the method of dual modular redundancy. Errors within the memory will lead to the failure of function unit and the connections between cells, which will make the cell abnormal. To achieve high reliability and robust fault tolerance property of configuration memory, extended hamming code which can correct 1-bit error and detect 2-bit error provides a solution for the design of memory fault tolerance. The memory has the ability of error correcting with 1-bit error occurring to ensure that the cell keeps operating normally. When 2-bit error occurs, a fault signal is given to trigger array reconfiguration. Extended Hamming Code. It’s known that Hamming code is perfect 1-bit error correcting code by the use of extra check bits. The extended hamming code is slightly more powerful, which is able to detect when 2-bit error has occurred by adding a parity check bit as well as able to correct any single error. Create the code word as follows: 1. Mark all bit positions that are powers of two as check bits. (positions 1, 2, 4, etc.) 2. All other bit positions are for the data to be encoded. (positions 3, 5, 6, 7, etc.) 3. Each check bit calculates the parity for some of the bits in the code word. The position of the check bit determines the sequence of bits that it alternately checks and skips. For example: Position 1: check 1 bit, skip 1 bit, check 1 bit, skip 1 bit, etc. (1,3,5,7,9,etc.) Position 2: check 2 bits, skip 2 bits, check 2 bits, skip 2 bits, etc. (2,3,6,7,etc.) 4. Set a check bit to 1 if the total number of ones in the positions it checks is odd. Set it to 0 if the total number of ones in the positions it checks is even. 5. Set the parity check bit to 1 if the total number of all bits and check bits is odd. Set it to 0 if the total number of all bits and check bits is even. Each data bit is checked by several check bits. An error will cause check bits which check it to change in response. Check each check bit; add the positions that are wrong, this will indicate the location of the bad bit. The parity check bit can help to detect 2-bit error. Error Checking and Correcting Circuits for the Memory. As shown in Fig. 4, the co-ordinate is passed to the configuration memory to select the cell’s configuration for generating corresponding check bits and a parity check bit according to coding process referred above and also passed to the standard check bits and parity check bit memory for standard ones. The discrepancy caused by errors can be detected by a XOR-gate comparing their outputs. Error detecting and correcting unit will judge the number of errors according to the result of comparison. Then it will correct the bad bit with one error occurring or give 2-bit error signal with 2-bit error detected . If generated check bits are different from
134
Y. Zhang et al.
Memory Transparent signal Co-ordinate
transparent
Check bits and parity check bit generator
2
Standard check bits and parity check bit memory
1 0
Backup signal
Transparent configration
Memory self-test and self-repair logic
transparent
1 0
Comparator
2 1 0
Correct Error detecting configuration bits and 2-bit error signal correcting unit
Fig. 4. Fault detection and fault tolerance for the configuration memory
standard ones, this indicates at least one error occurring in the configuration bits. Then the comparison of parity check bit is a necessity for identifying the number of errors. A discrepancy of parity check bit indicates one error occurring. Then the bad bit will be located according to the comparison of check bits and will be inverted to ensure exporting correct configuration bits, otherwise indicates 2-bit error detected which can’t be corrected at cell-level. The control unit receives its 2-bit error signal and activates array-level self-repair. All the cells in the faulty column select transparent configuration bits for testing. If 2bit error in transparent configuration bits arises, the control unit will generate backup signal and the backup transparent configuration bits will be used to keep the cell transparent correctly. 3.2
Fault Tolerance at Array-Level
Reconfiguration mechanism based on column-elimination (Fig. 5) is adopted for array-level fault tolerance in this paper. faulty cell
spare cell
0
1
2
0
1
2
0
1
2
fault signal propagation
active cell
0
1
0 0
1
2
0
1
2
0
1
2
0
1
Co-ordinate propagation
normal (no fault)
transparent cell
reconfiguration (A fault is detected in the array )
return to normal (The faulty column becomes transparent )
Fig. 5. The process of array-level fault tolerance based on cell recombination
When 2-bit error occurs in the memory or a fault is detected in the function unit, the control unit will export transparent signal. The signal will propagate along the column by OR-gate network. All the cells within that column will be signaled to move to the transparent state. Consequently, the faulty column is removed from the array whose function is replaced by the spare column.
Design of a Cell in Embryonic Systems
4
135
Example
To illustrate the two-level fault tolerance property within embryonic systems the design of 4-bit adder subtracter is presented. Fig. 6 shows the circuit structure. It is composed of four full adders and four XOR-gates. Add sub is the control signal. When it’s ‘0’, the circuit performs as a 4-bit adder. When it’s ‘1’,the circuit performs as a 4-bit subtracter. y3
y2
y1
y0
add_sub
c
C3
FA
C2
FA
3
s3
x3
s2
x1
C0
FA
1
s1
x2
C1
FA
2
0
s0
x0
Fig. 6. Schematic diagram of 4-bit adder subtracter
According to the cell architecture in section 2, a cell is capable of implementing a full adder or two gates. Six cells are enough for the logic circuit which can be mapped onto a 3*3 array with a column of spare cells. y2y1y3 y0 Cell20
0
y2 y1y3 y0 2-LUT Cell21
1
co-ordinate
Add_sub W0_IN N0_IN W0_IN N1_IN W0_IN
2-LUT Cell22
1
E0_OUT
y3y0
XOR
S0_OUT
XOR
S1_OUT
W0_IN N0_IN W0_IN N1_IN
XOR
S0_OUT
XOR
S1_OUT
E0_OUT
N0_IN
A
S0_IN
B
SI_IN
Ci
Si
Ci+1
Cell00
Add_sub W0_IN A Si
S0_IN
B
E0_IN
Ci
co-ordinate
Ci+1
E1_OUT E0_OUT
x1x2 0
N0_IN
A
S0_IN
B
W0_IN S0_OUT FA2 3-LUT Cell01 E1_OUT W0_IN N0_OUT N0_IN E0_OUT S0_IN N1_OUT W1_IN
1
x1x2 x0x3
C
Spare cell
S0_IN 3-LUT Cell12 E1_OUT
3-LUT Cell11 N1_IN W1_IN
Cell10
N0_IN
0
co-ordinate
y2 y1
Si
Ci
N1_IN E0_OUT
Ci+1
Spare cell
Si
B Ci
Ci+1
S1_IN co-ordinate x0 x3 0 1 FA1
N0_OUT E0_OUT
Spare cell
00000041041
08382037800
08382037800
00000041041 08382037800
08382037000
08382037000
08382037000
00000041041
00000041041
00000041041
10248E08C40 10248E08C40 10248E08C40
N0_OUT
S0_OUT FA3 E1_OUT 3-LUT Cell02 A
S2 S3
00000041041
00000041041 10245008F80 10245008F80 10245008F80 00000041041 S1 S0
00000041041
00000041041
10249200C38 10249200C38 10249200C38 102463C0C00 102463C0C00 102463C0C00
W0_OUT S1_IN FA0
x1x2x0 x3
Fig. 7. Implementation of 4-bit adder subtracter in an embryonic array
The placing and routing in the embryonic array and corresponding data in the memory are shown in Fig. 7. The four cells implementing four full adders work in 3-LUT mode and the two cells implementing four XOR-gates work in 2-LUT mode. The rightmost column of spare cells is ready for array-level fault tolerance based on reconfiguration.
136
Y. Zhang et al.
If it is designed by the function unit based on LUT with four inputs and one output, at least twelve cells are needed and the routing will be more difficult. Here six cells are enough for the implementation with the adoption of the design of reconfigurable function unit which has three operational modes. So the improved function unit provides a solution for enhanced cellular function and reduced hardware resources.
5
Simulation and Results
The whole system described by VHDL is synthesized, placed and routed by the XST and the P&R tools from XILINX, then post-simulated by ModelSim SE 6.0c. 5.1
Verification of Cell-Level Fault Tolerance
We take cell11 as the fault injection cell here. As shown in Fig. 8, cell11 configbits “10248E08CC0 ” represents the configuration exported from memory with 1-bit error. Cell11 corrected configbits “10248E08C40 ” represents the configuration bits corrected by extended hamming code. From the result we can see the bad bit is corrected and the right configuration bits “10248E08C40 ” are exported to configure function unit and switch box, ensuring the sum s and carry c of adding x to y are right. 5.2
Verification of Array-Level Fault Tolerance
When 2-bit error occurs in the memory or a fault is detected in the function unit, array-level self-repair will be triggered by the fault signal from the celllevel self-test logic. Then the logic function will be mapped onto the non-faulty cells by column-elimination.
Fig. 8. The simulation result of memory fault tolerance with 1-bit error occurring
Design of a Cell in Embryonic Systems
137
Fig. 9. The simulation result of the system with 2-bit error occurring in the memory
Fig. 10. The simulation result of the system with a fault occurring in the function unit
2-bit Error Occurring in the Memory. As shown in Fig. 9, when the configuration bits of cell11 from the memory are “10248E08C80 ”, the 2-bit error signal cell11 memory fault is high indicating 2-bit error occurring. At this time the sum s and carry c of adding x to y are wrong indicating the invalidation of cell-level self-repair. Then the transparent configuration bits “00000041C41 ” also with two bits error are chosen. The system also exports a wrong result. Then the cell goes into backup state setting backup signal high and works with correct transparent configuration “00000041041 ”. Notice that after some reconfiguration time, the result is right. The system returns to normal and the co-ordinates of the last column cells cell02 x, cell12 x, and cell22 x change from “10 ” to “01 ”, the spare column replaces the function of the faulty column.
138
Y. Zhang et al.
A Fault Occurring in the Function Unit. We introduce a signal cell11 fun fault to simulate fails in function unit. When it’s ‘0’, the inputs of function unit are connected to normal input signals, otherwise to fixed logic value ‘1’, simulating the stuck at 1 fault. As shown in Fig. 10, Notice that when cell11 fun fault goes to logic 1 at 200 ns, there is a small time interval on which the output of the circuit is unknown just before returning to normal behavior. Then cell11 selects transparent configuration bits “000000041041 ” and the co-ordinates of the last column cells cell02 x, cell12 x, and cell22 x change from “10 ” to “01 ”, the spare column replaces the function of the faulty cell’s column. When the fault disappears at 400 ns, the reconfiguration is triggered and the correct result of subtracting y from x is obtained again.
6
Conclusions and Future Work
In this paper the architecture of embryonic systems is introduced. A controllable function unit based on multi-mode enhances the cell function and reduces the hardware resources. The configuration memory is endowed with 1-bit error correcting and 2-bit error detecting in the form of extended hamming code. So the fault-tolerant ability is improved by the two-level self-repair strategy including memory fault tolerance at cell-level and cell recombination by columnelimination at array-level. The direction of our future research is optimizing the cell’s architecture and improving the resource utilization ratios and the ability of fault location at celllevel to achieve automatic design of large scale digital circuits. Acknowledgments. The work presented here in this paper has been funded by National Natural Science Foundation of China (60374008, 90505013) and Aeronautic Science foundation of China(2006ZD52044, 04I52068).
References 1. Mange, D., Sanchez, E., Stauffer, A., Tempesti, G., Marchal, P., Piguet, C.: Embryonics: A New Methodology for Designing Field-Programmable Gate Arrays with Self-Repair and Self-Replicating Properties [J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 6(3), 387–399 (1998) 2. Ortega, C., Tyrrell, A.: Design of a Basic Cell to Construct Embryonic Arrays [J]. IEE Procs. On Computers and Digital Techniques 145-3, 242–248 (1998) 3. Moreno, J., Thoma, Y., Sanchez, E.: POEtic: A Prototyping Platform for Bioinspired Hardware [A]. In: Proceedings of the eighth International Conference on Evolvable Systems: From Biology to Hardware [C], Bacelona, Spain, pp. 177–187 (2005) 4. Zhang, X., Dragffy, G., Pipe, A.G., Zhu, Q.M.: Artificial Innate Immune System: an Instant Defence Layer of Embryonics[A]. In: the proceedings of the 3rd International Conference on Artificial Immune Systems[C], Catania, Italy, pp. 302–315 (2004)
Design of a Cell in Embryonic Systems
139
5. Bradley, D.W., Tyrrell, A.M.: Immunotronics: Novel Finite-State-Machines Architectures with Built-In Self-Test Using Self-Nonself differentiation[J]. IEEE Transactions on Evolutionary Computation 6(3), 227–238 (2002) 6. Prodan, L., Udrescu, M., Vladutiu, M.: Self-Repairing Embryonic Memory Arrays[A]. In: Proc. of The sixth NASA/DoD workshop on Evolvable Hardware[C], Seattle, USA, pp. 130–137 (2004) 7. Zhang, J., Zhang, X.: Extended Hamming Code Algorithm and Its Application in FLASH/EEPROM Driver[J]. Ordnance Industry Automation 2003, 22(3), 52–54 (2003) 8. Ortega-Sanchez, C., Mange, D., Smith, S., Tyrrell, A.M.: Embryonics: A BioInspired Cellular Architecture with Fault-Tolerant Properties [J]. Genetic Programming and Evolvable Machines 1(3), 187–215 (2000)
Design on Operator-Based Reconfigurable Hardware Architecture and Cell Circuit Min Xie, Youren Wang, Li Wang, and Yuan Zhang College of Automation and Engineering, Nanjing University of Aeronautics and Astronautics, 210016 Nanjing, China [email protected]
Abstract. Due to the generic and highly programmable nature, gate-based FPGA provides the ability to implement a wide range of application. However, its small cell and complex interconnection network cause problems of low hardware resource utilization ratio and long interconnection time-delay in compute-intensive information processing field. PMAC (Programmable Multiply-Add Cell) presented in this article ensures high-speed and flexibility by adding much programmability to the multiply-add structure. PMAC array architecture resolves these problems and greatly increases resource utilization ratio and the efficiency of information processing. By establishing PMAC model and simulating, PMAC array is actualized on the VirtexII Pro series XC2VP100 device. By implementing FFT butterfly operation and 4th order FIR on PMAC array, flexibility and correctness of the architecture are proved. The results have also shown to have an average increase of 28.3% in resource utilization ratio and decrease of 15.5% in interconnection time-delay. Keywords: Reconfigurable computing, Reconfigurable hardware, FPGA, Operator-based programmable cell circuit, Information processing.
1 Introduction There are three types of information processors: microprocessor, ASIC, PLD. Microprocessors which are based on software programs limit their performance by lack of customizability and the processing speed is slower than ASIC. The speed of ASIC is quick which satisfies the real-time processing demands, but ASICs have disadvantages of fixed architecture and no programmability. FPGA acquires the balance between speed and flexibility, but it has the defects of low resource utilization ratio, long interconnection time-delay and much power dissipation [1], [2]. Recently there has been a significant increase in the sizes of gate-based FPGA. Architectures contain tens of thousands to hundreds of thousands of logic elements, providing a logic capacity into the millions of gates which can implement a wide range of application. However, when gate-based FPGA is used for algorithm-level evolvable hardware, such architecture has disadvantages of too long chromosome and too big search space, which can’t implement on-line evolution of the operator-based circuit. Therefore it is necessary to use the operator-based circuit as the fundamental unit to evolve the algorithm-level circuit. L. Kang, Y. Liu, and S. Zeng (Eds.): ICES 2007, LNCS 4684, pp. 140–150, 2007. © Springer-Verlag Berlin Heidelberg 2007
Design on Operator-Based Reconfigurable Hardware Architecture and Cell Circuit
141
In recent years, more and more coarse-grain programmable cell circuits are researched [3], [4], [5]. Paul Franklin [6] presented the architecture of RaPid (Reconfigurable Pipelined Data-path); its special data-path provides high performance assistances of regular calculation for DSP processor. But it is a linear structure which can achieve high performance only in linear system. Higuchi [7] used DSP processor as the cell circuit, but this architecture occupied too much hardware resource to be implemented on a single chip. Aiming at the disadvantages of gate-based FPGA in information processing filed, this article presents a new architecture based on operator-based PMAC, which gives attention to the balance of speed, feasibility, power consumption and resource utilization ratio.
2 Reconfigurable Hardware Architecture 2.1 The General Architecture The general architecture is based on the 2-D cell array structure as shown in Fig. 1. The architecture contains five main parts: reconfigurable cell circuit PMAC, programmable interconnection network Configurable Interconnection, embedded memory BlockRAM, I/O programmable resource I/O routing pool and I/O block (not described in Fig. 1). In the architecture, the hard-core is 16*16 cell array. Based on the characteristics of the information algorithm, the hierarchical architecture is adopted. The 1st level is the 8-bit PMAC which can implement byte-based data processing; the 2nd level is the CLB which is composed of 4 PMACs. Such design can expand byte-based cell PMAC to word-based cell CLB easily. The CLB array makes the word-based information processing available. If expanding is unnecessary, the 4 PMACs in CLB can implement byte-based FFT butterfly operation or 4th order FIR etc. The 3rd level is Hunk which is composed of 4 CLBs. When CLB is word-based cell, Hunk can implement the word-based FFT butterfly operation or 4th order FIR etc.
Fig. 1. The general reconfigurable hardware architecture
In the architecture, the hierarchy of cells determines the hierarchy of programmable interconnection network. PMAC, CLB and Hunk have their own programmable interconnection network respectively. BlockRAM which is adjacent to the Hunk
142
M. Xie et al.
satisfies the memory demands for information processing. I/O routing pool is the important resource for interconnection between inner logics and I/O blocks. 2.2 Design of Operator-Based Programmable Cell PMAC Design of cell circuit for highly efficient information processing depends on the analysis of the inherent characteristics of the aiming algorithms. When information processing algorithms are implemented by hardware, not only the calculation volume and complexity, but also the regularity and modularization should be taken into account. Analyzing quantities of algorithms, such as FFT/IFFT, FIR/IIR, DCT/IDCT, Convolution and Correlation operation etc, we conclude that: in these algorithms, lots of multiplication and addition are used; calculation volume ratio of the addition and multiplication is approximate to 1:1. So PMAC adopts the 8-bit CLA (Carry Look-ahead Adder) and 8-bit Booth Multiplier as its core as shown in Fig. 2.
Fig. 2. PMAC cell circuit architecture
Most of the coarse grain cells are based on 4-bit or 8-bit ALUs which have powerful functions, but logic cell is more complex and the 1:1 rule is not taken into account sufficiently. The functional and technical features and advantages of the PMAC which is the hardcore of the architecture are as follows.
:
(1)Configuration of multiplier input registers multiplier input registers can be configured to shift registers or general registers by setting MUX1 and MUX2. Data are shifted into registers from ShiftinA and ShiftinB or inputted form DataA and DataB. By setting the MUX3 and MUX4 the registers can be bypassed. (2)Configuration of multiplier outputs registers: registers between adder and Booth multiplier make pipeline data processing available. If registers are unnecessary, they can be bypassed by setting MUX5. To set MUX5, MUX6 and MUX7 can change the connective modes of adder and multiplier. These modes can accomplish additions of any 2 data among multiplier output MULin, Add, external input DataC and DataD. (3)Configuration of multiplier expansion [8]: sometimes precision of byte-based processing is not enough while word-based needed. Based on multiplication distributive law, the 4 byte-based multipliers in PMAC can be expanded to a word-based multiplier in CLB.
Design on Operator-Based Reconfigurable Hardware Architecture and Cell Circuit
143
(4)Configuration of adder module: a_s is addition/subtraction control signal. The module is an adder when a_s is in high level. Inputs of the module can be MULin, DataC, DataD or Add. The module can accomplish the following operations: DataC adding/subtracting MULin which is the most common information processing operator; DataD adding/subtracting DataC which means the adder can be used separately; by setting MUX8 and MUX9, adder can be configured to accumulator or inverse accumulator which can save much resource when the volume of addition is more than that of multiplication. (5)Configuration of adder expansion: byte-based adder can be expanded to word-based adder. Saturation operation is adopted for the adder: the sum selects maximal value when it is overflow or minimal when underflow. (6)Handshaking signals among PMACs: input effective new_data, trigger enable en, current level computing finished done, current level computing now busy, begin to compute start and ready to incept data rdyfdata. There are also some control signals and logic signals: clock clk, partial product input part_pro_in, partial product output part_pro_out, carry in carryin, carry out carryout. PMAC is a powerful cell. A single PMAC can accomplish byte-based multiplication, addition/subtraction, accumulation, inverse accumulation, multiply-accumulation and inverse multiply-accumulation etc. 2.3 Design of Programmable Interconnection Network In the CLB, programmable interconnection of the 4 PMAC is shown in Fig. 3.
ShiftinB ShiftoutB
Fig. 3. Programmable interconnection network in CLB
Data_C
Data_D
ShiftinA ShiftoutA
Add
Data_A Data_B
Part_mul_in
Mul Data_D
Part_mul_out
Data_C
Part_mul_in
Add
ShiftinB ShiftoutB
Data_A Data_B
ShiftinA ShiftoutA
Mul Data_D
Part_mul_out
ShiftinB ShiftoutB
Carryout Carryin
ShiftinA ShiftoutA
Carryin
Carryout
ShiftinA ShiftoutA
Part_mul_in
ShiftinB ShiftoutB
Part_mul_in
Carryin Carryout
Data_C
Part_mul_out
Data_A Data_B
Part_mul_out
Data_C
Carryin Carryout
Data_A Data_B
Data_D
Mul
Add
Mul
Add
Fig. 4. Testing and scanning chain for shift registers in CLB
Small circles denote programmable bits. In generic FPGA, in order to satisfy the highly programmable nature, programmable interconnection network is distributed symmetrically and switch matrix is adopted. Because information processing dataflow are regular, we adopt direct interconnection mode. Such design without switch matrix saves much chip area. Simply configured CLB can form a shift register which can test
144
M. Xie et al.
shift registers in PMAC array as shown in Fig. 4. Black dots denote effective connection. Of course, the other parts in PMAC array can be tested in the same way.
3 Analysis of Experiment Results 3.1 Hardware Resource Consumption of PMAC PMAC is modeled on development platform ISE6.2 of Xilinx Corporation and simulated on simulation platform ModesSim SE6.0 of Mentor Graphics Corporation. The hardware platform is XC2VP100 FPGA. We have used VHDL to depict the cell circuit PMAC and the general architecture, and synthesized the process by Xilinx Synthesis Technology XST; finally, programmable bit file was configured to XC2VP100. Table 1 gives resource consumption amounts of the main units in PMAC. Add_Sub, Booth_Multi and Shif_Regi*2 represent 8-bit adder/subtracter with registered output, 8-bit booth multiplier with registered output and two 8-bit shift register separately. Table 1. The resource consumption amounts of PMAC on XC2VP100 Inner units Add_Sub Booth_Multi Shif_Regi*2 PMAC
Total Equivalent Gates 160 643 131*2 1235
Slices 4 33 4*2 40
Slice FF 8 29 8*2 53
The PMAC which only consumes 1235 equivalent NAND gates totally occupies 40 Slices. Hardware resource utilization ratio is only 66% and that of registers is only 65% [9]. The reason consists in PMAC which needs many 1-bit adders in the Slices but rarely needs the other logic resource. On the other hand, 40 separate slices need to interconnect together; the interconnection lines would be too long and complex. Of course, the interconnection would waste much power and increase the time-delay. If the embedded block MULT18x18 is adopted to implement the 8-bit multiplier, 33 more Slices are needed to form a PMAC. Employment of embedded block can’t increase the utilization ratio of resource and decrease the power consumption. If operator-based PMAC is adopted, several cells with simple interconnection can implement certain information processing algorithms. Such architecture has three main advantages. First, PMAC can implement many algorithms on the cell array flexibly. Second, booth multiplier and CLA in the PMAC are both compact structure and high-speed computing parts. Interconnection between PMACs is short and simple which confirms short time-delay. So high clock frequency can be achieved. Third, complex interconnection is one of the main reasons why generic FPGA consumes much power; to simplify the interconnection can also reduce the power. If booth multiplier, CLA and shift register in PMAC can be fully used, in other words, the ratio of multiplication and addition calculation volume is 1:1, inner resource utilization ratio is about 97.25% which is 33% overtopping generic FPGA. PMAC gives attention to the balance of the flexibility, speed, power consumption and resource utilization ratio and gets better results.
Design on Operator-Based Reconfigurable Hardware Architecture and Cell Circuit
3.2
145
Performance Analysis of PMAC
Fig.5 shows post Place&Route simulation of the main functions of PMAC. Definitions of the input/output pins correspond to Fig. 1. In simulation process, sh_ctrl denotes MUX1 and MUX2 configuration bits; when sh_ctrl is in high level, shina and shinb shift 1 bit and are outputted in shouta and shoutb; a_s_rol denotes the MUX6, MUX7, MUX8 and MUX9 configuration bits. When (a_s)&(a_s_rol)=10, PMAC accomplishes the most common operation of multiply-add; when (a_s)&(a_s_rol)=11, PMAC implements multiply-accumulation. Table 2 shows a_s and a_s_rol combinational control functions. Table 2. The add/subtract model control a_s_rol a_s 0 1
0
1
data_c-shina*shinb data_c+shina*shinb
add(i)-mul(i-1) add(i)+mul(i-1)
In the simulation process, data format is Q7Q6Q5Q4Q3Q2Q1Q0. Data are all scaled at Q4, that is to say, the last 4 bits Q3Q2Q1Q0 denote the decimal part while the front 4 bits Q7Q6Q5Q4 the integer part. So data precision is 0.0625(1/16). For example, the first group of inputs are shina=57=>3.5625, shinb=25=>1.5625, data_c=16=>1 and the corresponding outputs are mul=89=>5.5625, add=105=>6.5625. The sign “=>” denotes the corresponding scaled conversion calculation result of the data; the average error between outputs and the actual calculation results 5.5664, 6.5664 is 0.00390. In concrete application, the scaling schedule can be selected. For example, the rotating factor W in FFT could be scaled at Q7, so the average error is only 0.0079(1/128).
Fig. 5. The Post-Place&Route simulation of PMAC implemented on XC2VP100
Fig. 5 shows the characteristic that PMAC has three levels of registers. Data in rectangles are the 1st group. The 1st level is shift registers; shouta and shoutb are 1 cycle after shina and shinb. The 2nd level is multiplier output registers; the products are 2 cycles after shina and shinb. The 3rd level is adder output registers; the sum is 3 cycles after data_c. As shown in Fig.5, the functions of PMAC are correct and stable.
146
M. Xie et al.
3.3 Implementation of Information Processing Algorithm on PMAC Array 3.3.1 Implementation of FFT Butterfly Operation on PMAC Array FFT which transforms signals from time-domain to frequency-domain is an important information processing algorithm; its general formulas are: X ( k ) = F F T [ x ( n )] =
N −1
∑ x ( n ) iW
kn N
(1)
1 N −1 x(k )iWN− kn ∑ N n= 0
(2)
n=0
x(n) = IFFT [ X (k )] = Among them, 0≤k≤N-1
,the rotating factor W = e
− j 2π / N
N
. Between formulas (1)
and (2), FFT and IFFT have much similarity that both are based on many regular butterfly operations. Butterfly operation flow shows in Fig. 6.
Fig. 6. Butterfly operation dataflow
A, B and W are all complex numbers, A=a1+a2i, B=b1+b2i, W=w1+w2i. Butterfly operation can be divided into two parts by combining and each part has 4 real number additions and 4 real number multiplications. Changing the a_s signal of cell periodically, butterfly operation can be easily implemented on 4 PMACs (Fig.7).
Fig. 7. The interconnection mode of butterfly operation implemented on PMAC array
In Fig. 7, the solid lines denote the effective interconnection while the shadow ones ineffective. Every butterfly operation is completed in two cycles, that is to say, each group of inputs keeps two cycles; in the first cycle, a_s_1, a_s_3 and a_s_4 are in high level, and a_s_2 is in low level; the 4 PMACs accomplish the operation of formulas (3) and (4) which is the A+BW branch. In the second cycle, the 4 a_s signals inverse; the 4 PMACs implement the operation of formulas (5) and (6) which is the A-BW branch.
Design on Operator-Based Reconfigurable Hardware Architecture and Cell Circuit
147
Out_re1=(a1+b1*w1)+b2i*w2i
(3)
Out_im1=(a2i+b2i*w1)+b1*w2i
(4)
Out_re2=(a1-b1*w1)-b2i*w2i
(5)
Out_im2=(a2i-b2i*w1)-b1*w2i
(6)
In the process, rotating factor W is scaled at Q7; the other data are scaled at Q4. Simulation shows in Fig.8. For example, the 1st group of data, a1=96, a2=64, b1=48, b2=56, w1=48, w2=64; out_re1=91=>5.6875≈5.625, out_re2=97=>6.0625≈6.025; out_im1=108=>6.75≈6.8125, out_im2=21=>1.3125≈1.2875; the average error between output and the actual calculation results is about 0.0406.
Fig. 8. The Post-Place&Route simulation of butterfly operation implemented on PMAC array
We implemented the same butterfly operation on the (FPGA)Slices array and MULT&Slice resource and made comparison. Table 3 shows the performance and efficiency comparison of butterfly operation implemented on different resources. Table 3. The performance and efficiency comparison of butterfly operation implemented on different hardware resource Logic resource (FPGA)Slices MULT&Slice PMAC
Frequency (MHz) 122.654 174.368 225.887
Utilization ratio 65% σ m |f (τ m+1 ) ≤ f (σ m )} and Card(N ) = 0 2. If M = {υ m−1 < σ m |f (υ m−1 ) ≥ f (σ m )} and Card(N ) = 0 An example of discrete Morse function is given on the figure 1. Following [4] discrete vector field is defined on a cell complex with discrete Morse function as follows. Definition 3. Let K be an arbitrary cell complex. The discrete vector field V on K is the mapping V : K −→ K ∪ ∅ such that the following conditions are satisfied: 1.If V (σ) = ∅ than dim V (σ) = dim σ + 1. 2. If V (σ) = τ than V (τ ) = ∅. 3. For all σ∈ K, the cardinality of the set |V (σ)| ≤ 1. In other words, the discrete vector field is a sequence of cells σ1 , τ1 , σ2 , τ2 , ..., σn , τn , called vector chain. In each pair σi , τi the cell σi is a face of τi . The second
202
N. Dowding and A.M. Tyrrell 11
6
4
2
13
3
5
7 1 1
0 12
10
8
9
Fig. 1. Discrete Morse function on the cell complex and the corresponding discrete vector field
and third conditions state that each cell can serve as an image or pre-image of no more than one cell and, therefore, V is well-defined. A particularly useful class of discrete vector fields connected with Morse functions are called gradient-like discrete vector fields. Gradient-like discrete vector fields are associated with discrete Morse function and can be constructed in several steps. Start with the sub-complex K 1 . For all non-critical 1-cells if vertex v ∈ ∂σ 1 and f (v) > f (σ 1 ) establish the vector v which goes from v to σ 1 . Proceed for higher dimensions until all cells and their boundaries have been examined. That is, if σ m < τ m+1 and f (τ m+1 ) ≤ f (σ m ) then v is the vector starting at τ m+1 and finishing at σ m . The discrete vector field shown on the figure 1 belong to the class of gradient-like vector fields. If K is a cell complex and f : K −→ R is discrete Morse function, then there is a unique gradient-like vector field corresponding to f can be defined on K. The discrete vector field establishes the rules of transition from an one arbitrary cell of the cell complex to another until it arrives to the critical cell determining, thus, the discrete flow on the cell complex. Hence, the discrete gradient-like vector field leads one away from zero-valued critical vertex towards a maximal critical vertex. Specifying location for the replacement processor and establishing a valid route to it represent the most challenging problems for the distributive reconfiguration strategy (see [9]). In Section 2 it will be shown how the discrete Morse functions can be used to discover the appropriate location for the replacement processor and to establish a valid path to this location.
2
Sliding Tree Algorithm
A successful reconfiguration strategy for embedded binary trees should ideally have the following properties: – It should be able to construct new routes bypassing the faulty vertices – It should preserve the mutual disposition of the tree nodes
Sliding Algorithm for Reconfigurable Arrays of Processors
203
Algorithm 1. Sliding Tree Algorithm 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:
if k is not critical vertex then U pdate distance f unction on grid G end if if k is in H − tree then Remove subtree Sb rooted at k and all nodes between k and its; parentp U pdate distance f unction on G Compute the area consumed by the subtree Sb Compute the target value of distance f unction C F ind node v with LP V (v) = C using P ath Searching Algorithm if v is not f ound then ST OP end if Construct path f rom p to v P lant subtree Sb at node v end if
The second property allows one to maintain the integrity of the tree and its layout will guarantee construction of optimal routes and the possibility of repeated reconfiguration. Assume that an arbitrary operational node v of the embedded tree failed and denote the subtree rooted at v as Sb(v). Then communication between the root and all nodes of Sb(v) is damaged. In order to recover from the fault the subtree Sb(v) is removed together with the path between v and its parent node p. The replacement node is searched for and the subtree Sb is restored at the new position. If the algorithm fails to find the appropriate position, reconfiguration fails. 2.1
Distance Function
Let G be a graph representation of the mesh-connected array of processors called grid graph. The discrete Morse function proposed in this section is defined in two steps: first, its values on vertices will be defined. Then, they will be complemented by values on edges. Let C be a set of vertices of the grid that should be excluded from the routing process. These vertices are called critical. All vertices that correspond to a faulty processors are included into the class C as well as vertices of the embedded tree and vertices that lie on the border of the grid. The distance function is defined as follows: Definition 4. For each non-critical vertex v of the graph G with the set of critical vertices C = C1 , C2 , ..., Cn , define the distance function as LP V (v) = min [ min (length(v, c)) ] Ci ∈C c∈Ci
(1)
204
N. Dowding and A.M. Tyrrell
Note that values of the distance function on the strong critical vertices are zero. Example of distance function for one critical vertex is given in figure 1. Vertices which are equidistant from the critical one form well known diamond-shaped wavefronts (e.g. [10],[11]). With the function defined in this way, another set of critical vertices on the grid can be identified. For these vertices the following inequality holds: LP V (u)u∈I(v) ≤ LP V (v)
(2)
where I(v) is set of vertices incident to the vertex v. Collectively these vertices are called maximal critical vertices of the graph G corresponding to a function LP V (v). They subdivide the graph into collection of subgraphs known as neighborhood subgraphs or N-subgraphs of G. The distance function is a Morse function on each separate N-subgraph, hence, it can be called locally discrete Morse function. Locally discrete Morse functions are studied in [5]. 2.2
Discrete Vector Fields
With the discrete function LP V defined on the set of vertices, the discrete vector field can be defined in a non-unique way. In order to maintain the structural integrity of the embedded tree, it would be desirable that the discrete flow which is emanating from a vertex of grid graph, had a minimal number of twists. To achieve this it is necessary the choose direction of the flow. More precisely, the discrete vector field should be built in the following manner. Let G be a grid graph and let C be the set of critical vertices. Assume that the distance function is computed for each vertex and N-subgraphs are defined for each critical vertex. Let F : G −→ G ∪ be a discrete vector field defined as follows: – If v is a critical vertex then the image of v under F is an empty set: Im(v) = – If v is not critical, let the edge g = (u1 , u2 ) define the priority direction of the vector field on the N-subgraph H (see figure 2). Let v be a vertex from H. Let Ev = {(v, wi )}, where 1 ≤ i ≤ 4, be a set of edges incident to v and such that LP V (wi ) > LP V (v). Then the edge ej ∈ Ev is assigned to be an image of v under F if the following equation holds: ω(ej ) = mink (α(g, ek )), where α(g, ek ) is the oriented angle between the edges g and ek . The methods of defining the discrete vector field for a given discrete Morse function can vary and may be chosen in such a way that suits best for the task. For example, different vector fields can be obtained when a different local priority vector is chosen. 2.3
Path Searching Algorithm
In this section it will be shown how the distance function and the direction preserving vector field can be used in the search of an appropriate replacement vertex and establishing the valid route.
Sliding Algorithm for Reconfigurable Arrays of Processors
205
Fig. 2. Discrete vector field on the grid. Assume that f indicates priority direction. Numbers in the boxes show the value of distance function of the node. Vertex W has three candidate vertices with higher values of distance function. Then the vertex B is chosen because edge (W, B) is co-directed with f .
Due to the regularity of the H-tree layout, the area of the grid consumed by a tree of level d can be easily computed using a simple recursive algorithm. Knowing the area required to a subtree it is easy to conclude that the minimal value of the distance function for the candidate node u for a root of a subtree should satisfy equation 3 (h and w denote the height and the width of the H-tree expressed in a number of vertices): w h (3) LP V (u) = max{ + 1, + 1} 2 2 Priority direction is chosen to be pnt − cld where cld is the root of the subtree to be relocated and pnt is its parent node. The Path Searching Algorithm is aimed at finding a vertex which satisfies 3. The search is directed by the distance function and discrete vector field defined in 2.2. The Path Searching Algorithm follows the discrete vector field until it hits the maximal critical vertex where the search stops. The discrete vector field can conveniently direct the search in the domain of the neighbor - subgraph. But N-subgraphs constitute rather small portions of the grid, especially when a large number of critical points is present. When the path reaches a maximal critical vertex w and LP V (w) is less than required by 3, it is desirable that the algorithm continues the search. This can be done by a simple extension to the Basic Scheme of the Search Path Algorithm in which the extended family of nearest nodes is examined when the maximal critical vertex is found. The algorithm continues the search along the non-growing vertices for a finite number of steps. This enhanced algorithm is called Extended Scheme. In this research two types of algorithms have been tested. First of them - basic search algorithm - complete execution upon meeting a maximal critical point. The second, extended search algorithm, continues the search until satisfactory vertex is found or until specified number of search steps elapsed.
206
N. Dowding and A.M. Tyrrell
Algorithm 2. Path Searching Algorithm (Basic Scheme) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22:
Input : 1. target value of distance f unction C 2. Starting at parent node p : v = p P AT H ⇐ v while LP V (v) < C do list L ⇐ incident vertices(v) M = maxu∈L (LP V (u)) for all u ∈ L do if LP V (u) = M then S⇐u end if end for if L not then ω=0 for all u ∈ L do if ω < angle[(v, u), direction] then ω = angle[(v, u), direction] best = u end if end for P AT H ⇐ best v = best else ST OP 23: end if 24: end while
3
Results
In order to test the reconfiguration scheme based on the Sliding Tree Algorithm, a simulation has been designed and the reliability of the system (and, hence, the effectiveness of the algorithm) has been estimated. The code of the simulation was written in C++ and run in under Microsoft Visual Studio 6. LEDA 4.4.1. library by Algorithmic Solutions Software GmbH has been used to support graph manipulations. Simulation has been run under the following assumptions: – all processors are fault-free at the initial moment of time – processors are statistically independent, that is, failure of a single processor does not effect status of other processors in the array – faults are fatal and faulty processors cannot be used again. – buses and communication lines are considered to be fault-free. These assumption allow one to design a realistic reliability model of the array without excessive complications to the model. Figure 3 compares search techniques of Basic and Extended Schemes.
Sliding Algorithm for Reconfigurable Arrays of Processors
207
1 0.9
extended scheme d=5
Survivability coefficient
basic scheme d=5
0.8
basic scheme d=6 extended scheme d=6
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 5
10
15
20
25 30 35 40 45 Number of faulty nodes
50
55
60
Fig. 3. Survivability coefficient for the basic scheme and the extended scheme for the trees of different depth
The reliability of the system which consists of N components over the time period t is defined as ρ(t) = S(t) N (see [16]) where S(t) is number of surviving components at time t and N is total number of components. The reliability of the single processing element over some period of time t is given by expression R(t) = exp−λt , where λ is the failure rate usually expressed as number of failures per 10−6 hours. In order to get more realistic picture assume that all processing elements of the grid have equal probability of failure. In order to estimate the reliability of the system the method considered in [11] will be employed. Let Ci be the survivability coefficient, defined as probability that the array survives with i faults. Ci is determined as Ci = Si /K where Si is the number of successful recoveries of the system from i faults and K is the number of patterns with i faulty nodes. The reliability of the system can be expressed as K M (4) Ci RM−i (1 − R)i R(t) = i i=0 The survivability coefficient Ci is determined experimentally using the simulation. The uniform random number generator was used to generate two random integers x, y ∈ {1, ..., max(h(d), w(d))} and simulate the failure of the processor with coordinates (x, y). Then the Sliding Tree Algorithm is used to reconfigure the tree. In order understand the level of reliability of the proposed algorithm it is useful to compare it with previous work on reconfigurable binary trees (see [7], [8]). All three schemes are specialized reconfiguration schemes for binary trees. Such comparison is difficult because of the difference in characteristics such as number of spares. Also, the parameters of experiments (time range and the depth of the tree) differ substantially. However it is still possible to compare the absolute numbers for reliability evaluation. Figure 4 compares the reliability of basic and extended versions of the Sliding Tree Algorithm with two specialized
208
N. Dowding and A.M. Tyrrell
Reliability 1 0.9 0.8 0.7 0.6 0.5 0.4 Basic STA Extended STA Modular scheme RAE scheme
0.3 0.2 0.1 0 0.05
0.1
0.2
0.3
0.4
Time(h*106)
0.5
0.6
0.7
Fig. 4. Comparison of reliability of Basic an Extended schemes with two specialized schemes - modular scheme and RAE scheme
schemes - modular tree scheme and RAE scheme presented in [8] and [7] respectively. Figure 4 shows that both basic and extended scheme demonstrates reliability higher then 0.9 up to 0.25 h∗ 106. During this period of time they show reliability close to that of RAE and modular schemes. Moreover, the extended scheme outperform the RAE scheme up to the time of 0.35 h ∗ 106 . This is a promising result for a general architecture against specialized.
4
Conclusion
The framework for the reconfiguration algorithm design presented in this paper proposes a general method for developing distributive reconfiguration algorithms based on discrete Morse functions and discrete vector fields. The General framework will be called General Design Environment for Reconfiguration Strategies (GDERS). GDERS represent a generic and systematic approach to the problem of reconfiguration algorithms. The method of discrete functions on cell complexes used in GDERS can be applied to any topology of the host arrays. Moreover, the class of target arrays can be extended to the class of planar graphes. This can be done by considering the minimal spanning tree of the target graph during the process of reconfiguration and reconnecting the nodes of target graph when reconfiguration is complete. Another advantage of the proposed reconfiguration strategy is distributivity. The Sliding Tree Algorithm demonstrated the capabilities of distributive approach to the tree structure which is characterized by strong data dependency. The Sliding Tree Algorithm serves as an example of how GDERS can be used in practice. In addition, the reliability demonstrated by the Sliding Tree Algorithm shows good performance even when it is compared to the specialized schemes. The Sliding Tree Algorithm has a lot of potential for the future work. The most prospective directions should be the methods of decreasing the number
Sliding Algorithm for Reconfigurable Arrays of Processors
209
of spare nodes and further study of discrete Morse functions with the aim of improving the Path Search Algorithm of the main Sliding Tree Algorithm. The main outcome of this paper is the design a distributive and flexible reconfiguration algorithm for the mesh-connected arrays of processors which is capable of recovery after repeated faults. The importance of the Sliding Tree Algorithm for further development of adaptive and evolvable systems follows from the number of advantages which the algorithm can deliver. First of all, applying the algorithm for fault-tolerance can provide an uninterrupted operation of the system,thus, allowing the system to evolve new features and to adapt to the new environment. Moreover, the algorithm can be used generally for restructuring the target array, hence, assisting in performing adaptation and self-development in a specific conditions. In addition, the generic nature of the proposed approach allows one to apply the algorithm to different target arrays and different discrete functions.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.
13. 14.
15.
16.
Milnor, J.: Morse Theory. Ann. Math. St, Prinston Univ. Pr. (1973) Forman, R.: A User’s Guide to Morse Theory. In: Sem. Lotharingen Comb. (2002) Forman, R.: Morse Theory for Cell Complexes. Adv. in Math. 134, 90–145 (1998) Forman, R.: Combinatorial Vector Fields and Dynamical Systems. Mathematische Zeitung 228, 629–681 (1998) Goresky, M., MacPherson, R.: Stratified Morse Theory. Springer, Berlin and Heidelberg GmbH (1988) Leiserson, C.E.: Area-Efficient Graph Layout (for VLSI). In: IEEE 21st Ann. Symp. Found. Comp. Sc. (1980) Raghavendra, C.S., Avizienis, A., Ercegovac, M.D.: Fault Tolerance in Binary Tree Architectures. IEEE Trans. Comp. 33, 568–572 (1984) Hassan, A., Agarval, V.: A Fault-tolerant modular archtecture for binary trees. IEEE Trans. Comp. 35, 356–361 (1986) Jigang, W., Shrikanthan, T.: An Improved Reconfiguration Algorithm for Degradable VLSI/WSI Array. Jour. Syst. Architecture 49, 23–31 (2003) Lee, C.Y.: The Algorithm for Path Connections and Its Applications. IRE Trans. Electr. Comp. EC-10, 346–365 (1961) Kung, S.-Y., Jean, S.-N., Chang, C.-W.: Fault-Tolerant Array Processors Using Single Track Switches. IEEE Trans. Comp. 38, 501–514 (1989) Abachi, H., Walker, A.-J.: Reliability analysis of tree, torus and hypercube message passing architecture. In: Proc. of the 29th Southeast. Symp. on Syst. Th., pp. 44– 48. IEEE Computer Society Press, Los Alamitos (1997) Chean, M., Fortes, J.A.B.: A Taxonomy of Reconfiguration Techniques for FaultTolerant Proccesor Arrays. IEEE Comp. 23, 55–69 (1990) Ortega, C., Mange, D., Smith, S.L., Tyrrell, A.M.: Embryonics: A Bio-Inspired Cellular Architecture with Fault-Tolerant Properties. Jour. of Gen. Prog. and Evol. Machines 1, 187–215 (2000) Greenstead, A.J., Tyrrell, A.M.: An Endocrinologic-Inspired Hardware Implementation of a Multicellular System. In: Proc.NASA/DoD Conf. Evol. Hardware, Seattle (2004) Lala, P.K.: Fault-Tolerant and Fault Testable Hardware Design. Prentice Hall Int., Englewood Cliffs (1985)
System-Level Modeling and Multi-objective Evolutionary Design of Pipelined FFT Processors for Wireless OFDM Receivers Erfu Yang1 , Ahmet T. Erdogan1, Tughrul Arslan1 , and Nick Barton2 1
School of Engineering and Electronics 2 School of Biological Sciences The University of Edinburgh, King’s Buildings, Edinburgh EH9 3JL, United Kingdom {E.Yang,Ahmet.Erdogan,T.Arslan,N.Barton}@ed.ac.uk Abstract. The precision and power consumption of pipelined FFT processors are highly affected by the wordlengths in fixed-point application systems. Due to nonconvex space, wordlength optimization under multiple competing objectives is a complex, time-consuming task. This paper proposes a new approach to solving the multi-objective evolutionary optimization design of pipelined FFT processors for wireless OFDM receivers. In our new approach, the number of design variables can be significantly reduced. We also fully investigate how the internal wordlength configuration affects the precision and power consumption of the FFT by setting the wordlengths of input and FFT coefficients to be 12 and 16 bits in fixed-point number type. A new system-level model for representing power consumption of the pipelined FFT is also developed and utilized in this paper. Finally, simulation results are provided to validate the effectiveness of applying the nondominated sorting genetic algorithm to the multi-objective evolutionary design of a 1024-point pipelined FFT processor for wireless OFDM receivers.
1
Introduction
The precision and power consumption are of prime importance in fixed-point DSP (digital signal processing) applications. FFT (fast Fourier transform) has emerged as one of main DSPs and has been widely applied to wireless communication systems. Since the FFT is the key component in wireless OFDM (Orthogonal-Frequency-Division-Multiplexing) receivers [1, 2], it is chosen as a benchmark system in this paper to investigate the system-level modeling and evolutionary optimization issues under multiple objectives. Determining the optimum wordlength plays an important role in a complex DSP system [2]. In a complex system this process may be spent more than half of the total design time. Since the design space is nonconvex, there are often many local optimum solutions [2]. Though it has been reported that a good trade-off between precision and complexity can be found by using different wordlengths for all the stages in [3], the wordlengths obtained by the simulationbased strategy may be far from globally optimal or Pareto-optimal in the domain of multi-objective optimizations. L. Kang, Y. Liu, and S. Zeng (Eds.): ICES 2007, LNCS 4684, pp. 210–221, 2007. c Springer-Verlag Berlin Heidelberg 2007
System-Level Modeling and Multi-objective Evolutionary Design
211
To make a trade-off design between conflicting objectives such as power consumption and precision in designing a full custom pipelined FFT processor with variable wordlength, a number of wordlength configurations for different stages of FFT processor should be explored in a rapid and efficient manner. The conventional method for exploring different wordlength configurations is based simulation and offline analysis, which often requires much runtime and unavoidably results in a very low efficiency of design. However, multi-objective evolutionary algorithms can automatically generate all the optimal configurations under the given design constraints. Undoubtedly, this will significantly improve the design efficiency under multi-objectives. The aim of this paper is to present an automatic approach for efficiently generating all the Pareto-optimal solutions under two competing design objectives, i.e., power consumption and precision. Toward this end, a multi-objective evolutionary algorithm is selected and applied to the multi-objective optimizations of pipelined FFT processors for wireless OFDM receivers. Compared with the commonly used simulation-based wordlength optimization methods, the evolutionbased multi-objective optimization approach is capable of exploring the entire space. Moreover, all the solutions can be simultaneously generated by running the algorithm. Therefore, the time for determining optimum wordlengths can be significantly reduced.
2
Related Work
The precision and hardware complexity of fixed-point FFT processors are affected by wordlengths. In recent years the wordlength-based optimization has received considerable attention [4, 5, 6, 7, 2]. In [4], the genetic algorithm (GA) was used to optimize the wordlength in a 16-point radix-4 pipelined FFT processor. However, only the wordlengths for input data and FFT coefficients were optimized by the GA. The work in [4] can be classified into single-objective optimization problems. In designing DSPs and VLSI systems, multi-objective evolutionary algorithms have been applied [8,9,10]. The multi-objective optimizations for pipelined FFT processors were particularly investigated in [5, 6, 7]. A multi-objective genetic algorithm was employed to find solutions for the FFT coefficients which have optimum performance in terms of signal-to-noise ratio (SNR) and power consumption. In these works, the authors only focused on the wordlength impact of fixed-point FFT coefficients on the performance and power consumption. The research objective was to find a solution for the FFT coefficients which have optimum performance. So the FFT coefficients were directly used to encode the chromosomes in [5, 6, 7]. Due to this representation, the size of the chromosome used for the multi-objective genetic algorithms in [5, 6, 7] highly depends on the size of the FFT. For example, to a 1024-point pipelined FFT, the number of variables required in [5, 6, 7] will be more than 2048. Together with a large population, the computation requirement and algorithm complexity in [5, 6, 7] are extremely high.
212
E. Yang et al.
Compared with these existing works, this paper presents a new method to deal with the multi-objective optimization problems arising from designing pipelined FFT processors in fixed-point applications. In our new multi-objective optimization strategy, the number of design variables will be the same as the number of the stages in the pipelined FFT processor. For example there are only 10 design variables to be optimized in a 1024-point FFT. Unlike [5, 6, 7] we will fully investigate how the internal wordlength configuration affects the precision and power consumption of the FFT by setting the wordlengths of input and FFT coefficients to be 12 and 16 bits in fixed-point number type. To perform the multi-objective optimizations the full fixed-point operations will be used in this study. In [5,6,7] only partial fixed-point computation was used. Furthermore, we will adopt a new system-level model to represent power consumption rather than using a lower-level dynamic-switching-based power consumption model [6, 7]. There are also other wordlength-based optimizations. In [2] a fast algorithm for finding an optimum wordlength was presented by using the sensitivity information of hardware complexity and system output error with respect to the signal wordlengths. However, a complexity-and-distortion measure is needed to update the searching direction. It may be very difficult to obtain such a measure for a complex system. Moreover, it’s also hard for extending the direction information-based optimization methods to other more complex applications. To find the optimum wordlengths for all the stages of the FFT processors, simulation-based results have been reported in [3], where the optimization of the wordlengths in an 8k-points pipelined FFT processor was made by using a C-model. However, the computation time and complexity will exponentially increase when the size of FFT processor increases. As a result, this simulationbased wordlength optimization method can only search a partial solution space. In addition, the power consumption was not explicitly dealt with in [3]. In [2] the wordlength optimization of an OFDM demodulator was investigated as a case study. Only 4 wordlength variables, i.e., the FFT input, equalizer right input, channel estimator input, and equalizer upper input, were selected. The internal wordlenghs of the FFT were asumed to have already been decided. In their simulations, only the input to the FFT and other blocks were constrained to be in fixed-point type, whereas the blocks including FFT were simulated in float-point type. However, in this study we will deal with the multi-objective optimization issues by fully selecting the internal wordlenghs of the FFT as design variables. In our simulations, the entire FFT processor is simulated in fixed-point type. Thus, the results will be closer to the real DSP hardware environment.
3
Pipelined FFT Processors and Wordlength Optimizations
The precision of pipelined FFT processors relies on the wordlength. To increase the precision,alongerwordlengthisdesirable.However,theshortwordlengthisexpected forreducingthecostandpowerconsumption.Althoughtheoptimumwordlengthmay be obtained by trial-and-error manner or computer-based simulations, the process
System-Level Modeling and Multi-objective Evolutionary Design
213
itself is extremely time-consuming. It has been reported that in a complex system it may take 50% of the design time to determine the optimum wordlength, see [2] and the reference therein. For pipelined FFT processors it is very difficult to obtain an analytical equation to determine the optimum wordlength even for the case where all the stages of the FFT processor have the same wordlength. When the optimum wordlength of each stage needs to be configured independently or separately, the complexity of determining the optimum wordlengths for the whole processor will become particularly challenging if there is no efficient method to automatically generate all the trade-off solutions under multiple competing objectives and constraints.
4
Statement of the Multi-objective Optimization Problem
Let x denote the design variable vector in multi-objective optimizations. In this paper the output SQNR (Signal-to-Quantization-Noise-Ratio) is used to measure the precision performance of fixed-point pipelined FFT processors. The first objective to be optimized is defined by min f1 (x) = D − SQN R(x)
(1)
where D is a positive constant and can be used to represent the desired SQNR. The SQNR is calculated as follows 2N 2N f l2 f ix 2 fl f ix 2 fl f ix 2 SQN R(x) = 10 log (Ri + Ii )/ [(Ri − Ri ) + (Ii − Ii ) ] i=1
i=1
(2) and are the real and imaginary where N is the data length of the FFT. part of the output for the float-point FFT. Correspondingly, Rif ix and Iif ix are the real and imaginary part of the output for the fixed-point FFT. The second objective is to minimize the power consumption, i.e., Rif l
Iif l
min f2 = P ower(x)
(3)
where P ower(x) will be detailed in the next section. Now we are in a position to state the multi-objective optimization problem under consideration in this paper as follows: Problem 1 (multi-objective optimization problem). Find the design variable vector x∗ ∈ Ω under some constraints to minimize the objective vector function min f (x) = [f1 (x), f2 (x)]
(4)
in the sense of Pareto optimality, i.e., for every x∗ ∈ Ω either fi (x) = fi (x∗ ) (∀i ∈ {1, 2}) or at least there is one i ∈ {1, 2} such that fi (x) > fi (x∗ ).
214
E. Yang et al.
So for the multi-objective optimization problem, we are interested in finding all the solutions which satisfy the following nondominance conditions: – 1) ∀i ∈ {1, 2}: fi (x1 ) ≤ fi (x2 ) – 2) ∀j ∈ {1, 2}: fj (x1 ) < fj (x2 ) where x1 and x2 are two solutions. The Pareto-optimal set are defined as all the solutions that are nondominated within the entire search space. The objective functions of the Pareto-optimal set constitute the Pareto front we intend to find and assess the performance of a multi-objective optimization algorithm. For wordlength variables in a wireless OFDM receiver, we choose the wordlength of each stage in the pipelined FFT processor since these wordlengths have the most significant effect on precision and power consumption in the OFDM system. The inputs and outputs of the FFT are set to 12 bits fixed-point numbers since 12 bits are enough to represent a number required in this study. So we can focus on the effect of the internal wordlengths on the performance measurements of the FFT. It should be noted that in this case the number of the variables will be only determined by the size of the FFT processor. For example, there are 10 wordlength variables for a 1024-point FFT since there are 10 stages in total.
5
System-Level Modeling of Pipelined FFT Processors
To solve the multi-objective fixed-point optimization problems from pipelined FFT processors with different length of data points, a new system-level modeling approach is needed. By using such a system-level approach, a computational model can be developed to give the main performance metrics of pipelined FFT processors, such as power consumption, precision, and area, etc. In this paper, we are focusing on developing a system-level model to compute the power consumption of fixed-point FFT processor under different length of data points and wordlength. To obtain the precision metric, a relatively simple relation can be derived, as shown in (2). A parameterized system-level model is desired in order to reduce the computation complexity and resource requirement in power-efficient embedded applications. The model developed by a system-level approach also needs to provide some flexibility and scalability. Toward this end, in this study we propose four parameterized system-level models to represent the power consumption of FFT processors, i.e., single exponential fitting, bi-exponential fitting, single polynomial fitting, and bi-polynomial fitting. For exponential fitting methods, the following modeling formula is used: P ower = aebMsize + cedMsize
(5)
where Msize is the FIFO memory size required by the FFT. For polynomial fitting methods, the cubic polynomial function is utilized, i.e.: 3 2 P ower = aMsize + bMsize + cMsize + d
(6)
System-Level Modeling and Multi-objective Evolutionary Design
215
For bi-exponential and bi-polynomial modeling, the modeling data were grouped into two sections, i.e., memory size lies in [0.0625, 0.5]KB or [1, 16]KB. To determine the coefficients of the system-level model (5) and (6) , a set of data representing the power consumption of pipelined FFT processors under different FIFO memory size are needed. For this purpose, both the modeling and validating data sets can be obtained by using an accurate SystemC-based system-level simulator for FFT processors. In the SystemC-based system-level simulator, a gate-level power analysis is performed for each component by selecting a variety of parameters and representative input vectors [11]. The power information obtained for each component is then back-annotated to its SystemC object. After executing the complete SystemC-based simulator, the sufficiently accurate power estimates can be obtained.
Modeling FFT power consumption with memory size 120
Power consumption (mW)
100
80
60
Original data Cubic polynomial Bi−cubic polynomial Exponential Bi−exponential
40
20
0 0
2
4
6
8 10 Memory size (KB)
12
14
16
Fig. 1. Modelling FFT power consumption with FIFO memory size
The flexibility and scalability of the proposed system-level models lie in their coefficients which can be easily re-determined by using different modeling and validating data sets representing different architectures of FFT processors. To model the relationship between the dominating FIFO memory size and systemlevel power consumption, the only requirement is to present the necessary data set to the parameterized system-level models. The advantage of this parameterized system-level modeling is that it does not influence the multi-objective optimization algorithms which will be detailed later in this paper. As an example, Fig. 1
216
E. Yang et al.
shows a system-level modeling case by using a set of data representing the systemlevel power-consumption of FFT processors under different FFT size. This figure also shows the comparative results of the four modeling methods based on the same modeling and validating data set for the system-level power consumption of FFT processors. From this figure we observed that the bi-exponential fitting method gives the best modeling performance. So in this paper it is used in simulation example which will be reported later.
6
Multi-objective Evolutionary Design
Since the wordlength optimization problem is a noncontinuous optimization problem with a nonconvex search space, it is hard to find a global optimum solution [2]. So the conventional optimization techniques, such as gradient-based and simplex-based methods cannot be efficiently applied to solving this kind of optimization problems because there is no easy way to obtain the gradient information or analytically calculate the derivative for the complex objective functions f1 (x) and f2 (x). Hence, non-gradient optimizers are dominating in these applications. Currently there have been many popular non-gradient optimization methods available, such as simulated annealing (SA), evolutionary algorithms (EAs) including genetic algorithms, random search, Tabu search, and particle swarm optimization (PSO). Among these methods evolutionary algorithms have been recognized as one of the possibly well-suited to multi-objective optimization problems, see [12]. In this study the multi-objective evolutionary algorithms (MOEAs) are also chosen as main optimizers due to their several advantages compared with other optimization methods. Particularly, in this study we are interested in applying NSGA-II [13] to the multi-objective optimization problem under consideration in this paper. In comparison to NSGA [14], NSGA-II has many advantages and has been applied to many applications. In particular, NSGA-II was developed by using a fast sorting algorithm and incorporating elitism. There is also no need for specifying a sharing parameter. In the NSGA-II algorithm, the initial population of size N is first created with random values. The population is then sorted based on the nondomination into different fronts. The first front is completely non-dominant set in the current population and the second front is dominated by the individuals in the first front only. Each individual in each front is assigned a fitness (or rank) value equal to its nondomination level (the fitness in the first level is 1, 2 is for the next-best level, and so on). Once the non-dominated sort is completed, the usual tournament selection, recombination, and mutation operators are used to generate an offspring population. After the initial generation, elitism is introduced by combining current and previous population members. So the size of the combined population is doubled. This intermediate population is then sorted according to nondominaton by using a fast sorting algorithm. After generating a new population of size N , the selection is performed to produce the individuals of the next generation. The offsprings are generated by the selected parents using usual crossover and
System-Level Modeling and Multi-objective Evolutionary Design
217
mutation operators. In NSGA-II, the diversity among nondominated solutions is realized by introducing a crowding comparison procedure, see [13] for more details. One of the disadvantages of NSGA-II in [13] is that it did not provide a method or metric to measure how good or bad the final solutions are, particularly for a very complex engineering multi-optimization problem. So, in this study we also have to find appropriate metrics to assess the quality of multi-objective optimal design of pipelined FFT processors. Generally, the metrics to measure the performance of the heuristics evolutionary algorithms for multi-objective optimizations mainly include Generational Distance (GD), Spacing (SP), Maximum Spread (MS), etc. For more details on these performance metrics, one is referred to [15] and the references therein. In this paper we only choose the SP and MS metrics to measure the performance of NSGA-II for the multi-objective optimization problem of the pipelined FFT processor under consideration in this paper. The SP is used to measure how well the solutions are distributed (spaced) in the non-dominated sets found so far by MOEAs. It is fully computed by measuring the distance variance of neighboring vectors in the non-dominated sets. The SP is defined by Sp =
1 ¯2 Σ n (di − d) n − 1 i=1
(7)
where n is the number of nondominated vectors found so far. di = minnj=1,j=i M i j n ( m=1 |fm −fm |), and d¯ = Σi=1 di /n in which M is the total number of objective functions. Since SP measures the standard deviations of the distances among the solutions found so far, the smaller it is the better the distribution of the solutions is. A value of zero indicates that all the points on the Pareto front are evenly spaced. The MS is used to represent the maximum extension between the farthest solutions in the non-dominated sets found so far. MS is defined as follows M (maxn f i − minn f i )2 M s = Σm=1 (8) i=1 m i=1 m Unlike the metric SP, a bigger MS value indicates a better performance.
7
Simulation Results
In this section we demonstrate a design example by using the nondominated genetic algorithm (NSGA-II) to the multi-objective optimization of a 1024-point pipelined FFT processor. In the simulation the control parameters of the NSGAII were set as follows: the number of maximum generation= 250, population size = 100, probability of crossover = 0.85, and probability of mutation = 0.15. These control parameters are held constant during the design in one run.
218
E. Yang et al.
The other initial setting for the entire design is summarized in the following: – 1) The inputs and outputs of the FFT were set to be 12 bits fixed-point numbers in the specification. – 2) D was set to be 40 dB. – 3) The FFT coefficients had a wordlength of 16 bits and were assumed to be stored in a ROM. – 4) There was no scaling for all the stages of the FFT. – 5) There were total 10 design variables ranging from 8 to 24. – 6) The bi-exponential model for power consumption was used in the whole simulation. – 7) The desired SP and MS were 0.25 and 35.0 respectively. We have run the total program over 10 times, the successful rate under the design requirements above is 100%. The obtained results from one typical run are shown in Figs. 2-4. In Fig. 2 the x-axis represents the SQNR error defined by (1). The y-axis is the power consumption (mW). In this design example the SP and MS performance metrics are 0.1910 and 39.6434, respectively. Figure 3 illustrates the Pareto-optimal set from which a trade-off design of internal configuration of wordlengths can be easily made in terms of further design requirements and user’s preferences. The number of individuals on each ranked front over generations is plotted in Fig. 4. From this figure we can also observe that the multi-objective optimization algorithm worked perfectly. It should also be noted that the results are only for the purpose of demonstration. If we change the setting for the control parameters of the MOEA, the performance of the trade-off design may be further improved. The non−dominated front 80
Power consumption (mW)
75
Pareto front
70
65
60
55
50 0
5
10
15 20 SQNR error (dB)
25
30
Fig. 2. Pareto front obtained by MOEA for 1K FFT
35
System-Level Modeling and Multi-objective Evolutionary Design
(b) stage no.: 2
15
Wordlength
Wordlength
(a) stage no.: 1
10 5
0
20
40
60
80
15 10 5
100
0
20
40
10
0
20
40
60
80
0
20
40
Wordlength
Wordlength
40
60
80
0
20
40
Wordlength
Wordlength
40
60
80
0
20
40
Wordlength
Wordlength
40
60
80
100
80
100
(j) stage no.: 10
20
20
100
15 10
100
25
0
80
20
(i) stage no.: 9
15
60
(h) stage no.: 8
15
20
100
15 10
100
20
0
80
20
(g) stage no.: 7
10
60
(f) stage no.: 6
20
20
100
10 0
100
40
0
80
20
(e) stage no.: 5
0
60
(d) stage no.: 4
20
Wordlength
Wordlength
(c) stage no.: 3
0
219
60
80
30 20 10
100
0
20
40
Solution
60 Solution
Fig. 3. Pareto-optimal set
Ranking−based multi−objective evolutions
Number of individuals
200
150
100
50
0 10 8
250 6
200 150
4
100
2
Ranking no.
50 0
0
Generation
Fig. 4. Number of individuals on each ranked front
Unlike the commonly used simulation-based wordlength optimization methods, the proposed approach in this paper is capable of exploring the entire space. The most important is that all the solutions can be simultaneously generated by exploring the entire search space. Therefore, the time for determining the optimum wordlengths can be expected to be significantly reduced. In this
220
E. Yang et al.
example, the computation time is only 111.76 seconds (Windows XP platform, DELL D810 laptop). The trade-off design now can be easily made in terms of the Pareto-optimal front and set.
8
Conclusion
Determining the optimum wordlength for pipelined FFT processors under multiple competing objectives is a complex, time-consuming task. A new approach to solving the multi-objective evolutionary optimization design of pipelined FFT processors for wireless OFDM receivers has been proposed in this paper. How the internal wordlength configuration affects the precision and power consumption of the FFT has been fully investigated in this paper by setting the wordlengths of input and FFT coefficients to be 12 and 16 bits in fixed-point number type. A new system-level model for representing power consumption of pipelined FFT processors has also been developed and utilized in this paper. Simulation results have been provided to validate the effectiveness of applying the nondominated sorting genetic algorithm to the multi-objective evolutionary design of a 1024-point pipelined FFT processor for wireless OFDM receivers.
Acknowledgment This research is funded by the UK Engineering and Physical Sciences Research Council (EPSRC) under grant EP/C546318/1. The authors thank all of the team members of the ESPACENET1 project, which involves the Universities of Edinburgh, Surrey, Essex, and Kent, Surrey Satellite Technology (SSTL), NASA Jet Propulsion Laboratory (JPL), EPSON, and Spiral Gateway. The supports from Stefan Johansson, Peter Nilsson, Nasri Sulaiman, and Ali Ahmadinia are greatly appreciated.
References 1. Fechtel, S.A., Blaickner, A.: Efficient FFT and equalizer implementation for OFDM receivers. IEEE Transactions on Consumer Electronics 45(4), 1104–1107 (1999) 2. Han, K., Evans, B.L.: Optimum wordlength search using sensitivity information. EURASIP Journal on Applied Signal Processing 5, 1–14 (2006) 3. Johnsson, S., He, S., Nilsson, P.: Worldlength optimization of a pipelined FFT processor. In: Proceedings of the 42nd Midwest Symposium on Circuits and Systems, Las Cruces, NM, pp. 501–503 (August 1999) 4. Sulaiman, N., Arslan, T.: A genetic algorithm for the optimisation of a reconfigurable pipelined FFT processor. In: Proceedings of the 2004 NASA/DoD Conference of Evolution Hardware, Seattle, LA, June 24 - 26, 2004, pp. 104–108 (2004) 1
Evolvable Networks of Intelligent and Secure Integrated and Distributed Reconfigurable System-On-Chip Sensor Nodes for Aerospace Based Monitoring and Diagnostics.
System-Level Modeling and Multi-objective Evolutionary Design
221
5. Sulaiman, N., Arslan, T.: A multi-objective genetic algorithm for on-chip real-time optimisation of word length and power consumption in a pipelined FFT processor targeting a MC-CDMA receiver. In: Proceedings of the 2005 NASA/DoD Conference of Evolution Hardware, Washington, D.C, June 29 - July 1, 2005, pp. 154–159 (2005) 6. Sulaiman, N., Erdogan, A.T.: A multi-objective genetic algorithm for on-chip realtime adaptation of a multi-carrier based telecommunications receiver. In: Proceedings of the 1st NASA/ESA Conference on Adaptive Hardware and Systems, Istanbul, Turkey, June 15-18, 2006, pp. 424–427 (2006) 7. Sulaiman, N., Arslan, T.: A multi-objective genetic algorithm for on-chip real-time adaptation of a multi-carrier based telecommunications receiver. In: Proceedings of the 2006 IEEE Congress on Evolutionary Computation (CEC 2006), Vancouver, BC, Canada, July 16-21, 2006, pp. 3161–3165. IEEE Computer Society Press, Los Alamitos (2006) 8. Bright, M., Arslan, T.: Multi-objective design strategies for high-level low-power design of DSP system. In: Proceedings of the IEEE International Symposium on Circuits and Systems, Orlando, Florida, pp. 80–83. IEEE Computer Society Press, Los Alamitos (June 1999) 9. Palermo, G., Silvano, C., Zaccaria, V.: Multi-objective design space exploration of embedded systems. Journal of Embedded Computing 1(11), 1–9 (2002) 10. Talarico, C., Rodriguez-Marek, E., Sung Koh, M.: Multi-objective design space exploration methodologies for platform based SOCs. In: Proceedings of the 13th Annual IEEE International Symposium and Workshop on Engineering of Computer Based Systems, Potsdam, Germany, March 27-30, 2006, pp. 353–359. IEEE Computer Society Press, Los Alamitos (2006) 11. Ahmadinia, A., Ahmad, B., Arslan, T.: System level reconfigurable FFT architecture for system-on-chip design. In: Proceedings of the 2nd NASA/ESA Conference on Adaptive Hardware and Systems, August 5-7, 2007, Edinburgh, UK (to appear, 2007) 12. Deb, K.: Multi-Objective Optimization Using Evolutionary Algorithms, 1st edn. John Wiley & Sons, Ltd, Chichester (2002) 13. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast elitist multi-objective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 6(2), 182–197 (2002) 14. Srinivas, N., Deb, K.: Multiobjective optimization using nondominated sorting in genetic algorithms. Evolutionary Computation 2(3), 221–248 (1994) 15. Salazar-Lechuga, M., Rowe, J.E.: Particle swarm opotimization and fitness sharing to solve multi-objective optimization problems. In: Proceedings of the IEEE Swarm Intellligence Symposium 2006, Indianapolis, Indiana, May 12-14, 2006, pp. 90–97. IEEE Computer Society Press, Los Alamitos (2006) 16. Han, K., Evans, B.L.: Wordlength optimization with complexity-and-distortion measure and its applications to broadband wireless demodulator design. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Canada, May 17 - 21, 2004, pp. 37–40. IEEE Computer Society Press, Los Alamitos (2004) 17. Jenkins, W.K., Mansen, A.J.: Variable word length DSP using serial-by-modulus residue arithmetic. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Minneapolis, MN, pp. 89–92. IEEE Computer Society Press, Los Alamitos (May 1993) 18. Pappalardo, F., Visalli, G., Scarana, M.: An application-oriented analysis of power/precision trade-off in fixed and floating-point arithmetic units for VLSI processors. In: Circuits, Signals, and Systems, pp. 416–421 (2004)
Reducing the Area on a Chip Using a Bank of Evolved Filters Zdenek Vasicek and Lukas Sekanina Faculty of Information Technology, Brno University of Technology Boˇzetˇechova 2, 612 66 Brno, Czech Republic [email protected], [email protected]
Abstract. An evolutionary algorithm is utilized to find a set of image filters which can be employed in a bank of image filters. This filter bank exhibits at least comparable visual quality of filtering in comparison with a sophisticated adaptive median filter when applied to remove the salt-and-pepper noise of high intensity (up to 70% corrupted pixels). The main advantage of this approach is that it requires four times less resources on a chip when compared to the adaptive median filter. The solution also exhibits a very good behavior for the impulse bursts noise which is typical for satellite images.
1
Introduction
As low-cost digital cameras have entered to almost any place, the need for highquality, high-performance and low-cost image filters is of growing interest. In this paper, a new approach is proposed to the impulse noise filters design. The aim is to introduce a class of simple image filters that utilize small filtering windows and whose performance is at least comparable to existing well-tuned algorithms devoted to common processors. Furthermore, an area-efficient hardware implementation is required because these filters have to be implemented on the off-the-shelf hardware, such as field programmable gate arrays (FPGA). In most cases, impulse noise is caused by malfunctioning pixels in camera sensors, faulty memory locations in hardware, or errors in the data transmission (especially in satellite images [1]). We distinguish two common types of impulse noise: the salt-and-pepper noise (commonly referred to as intensity spikes or speckle) and the random-valued shot noise. For images corrupted by salt-andpepper noise, the noisy pixels can take only the maximum or minimum values (i.e. 0 or 255 for 8-bit grayscale images). In case of the random-valued shot noise, the noisy pixels have an arbitrary value. We will deal with the salt-and-pepper noise in this paper. Traditionally, the salt-and-pepper noise is removed by median filters. When the noise intensity is less than approx. 10% a simple median utilizing 3×3 or 5×5-pixel window is sufficient. Evolutionary algorithms (EA) have been applied to the image filter design problems in recent years [2,3,4]. EA is utilized either to find some coefficients of a pre-designed filtering algorithm or to devise a complete structure L. Kang, Y. Liu, and S. Zeng (Eds.): ICES 2007, LNCS 4684, pp. 222–232, 2007. c Springer-Verlag Berlin Heidelberg 2007
Reducing the Area on a Chip Using a Bank of Evolved Filters
223
of a target image filter. As the first approach only allows tuning existing designs, the use of the second approach has led to introducing completely new filtering schemes, unknown so far [4]. The images filtered by evolved filters are not as smudged as the images filtered by median filters. Moreover, evolved filters occupy only approx. 70% of the area needed to implement the median filter on a chip. When the intensity of noise is increasing (10-90% pixels are corrupted), simple median filters are not sufficient and more advanced techniques have to be utilized. Various approaches were proposed (see a survey of the methods, e.g. in [5]). Among others, adaptive medians provide good results [6]. However, they utilize large filtering windows and additional values (such as the maximum and minimum value of the filtering window) have to be calculated. This makes them expensive in terms of hardware resources. Others algorithms are difficult to accelerate in hardware for real-time processing of images coming from cameras. Unfortunately, the evolutionary design approach stated above which works up to 10% noise intensity does not work for higher noise intensities. The method proposed in this paper combines simple evolved filters with human-designed components to create a bank of 3 × 3 filters which provides a sufficient filtering quality for high noise intensities (up to 70%), and simultaneously a very low implementation cost in hardware.
2
Conventional Image Filters
Various approaches have been proposed to remove salt-and-pepper noise from grayscale images [7,8,9,5]. As linear filters have inclination to smoothing, most of proposed approaches are based on a nonlinear approach. The median filter is the most popular nonlinear filter for removing the impulse noise because of its good denoising power, computational efficiency and a reasonably expensive implementation in hardware [10]. The median filter utilizes the fact that original and corrupted pixels are significantly different and hence the corrupted pixels can easily be identified as non-medians. However, when the noise level (the number of corrupted pixels) increases, some pixels remain corrupted and unfiltered [11]. Median filters which utilize larger filtering windows are capable of removing noise of high intensity but filtered images do not exhibit a sufficient visual quality. The adaptive median filters produce significantly better resulting images than convential medians [6]. The filter operates with a kernel of Smax × Smax pixels. The kernel is divided into subkernels of size 3 × 3, 5 × 5, . . . , Smax × Smax inputs. For each subkernel, the minimum, maximum and median value is calculated. In order to obtain the filtered pixel, the calculated values are processed by the algorithm described in [6]. With the aim to visually compare images filtered by the convential median filter and adaptive median filter, Figure 1 provides some examples for the 40% salt-and-pepper noise (PSNR states for the peak signal-to-noise ratio). We can observe that there are many unfiltered shots in the image obtained by the median filter. We used the 3 × 3 kernel in order to easily compare the results described is next sections. Note that the use of larger kernels implies that many details
224
Z. Vasicek and L. Sekanina
a) original
b) b) 40% noise
image
d) median filter 5x5
c) median filter 3x3
e) adaptive median filter 5x5
f) adaptive median filter 7x7
Fig. 1. Images obtained by using conventional filters. a) Original image b) Noise image corrupted by 40% Salt-and-Pepper noise, PSNR: 9.364 dB c) Filtered by median filter with the kernel size 3 × 3, PSNR: 18.293 dB d) Filtered by median filter with the kernel size 5 × 5, PSNR: 24.102 dB e) Filtered by adaptive median with the kernel size up to 5 × 5, PSNR: 26.906 dB f) Filtered by adaptive median with the kernel size up to 7 × 7, PSNR: 27.315 dB.
are lost in the image. On the other hand, the image obtained by the adaptive median filter is sharp and preserves details. However, the adaptive median (with 5x5 filtering window) costs approx. eight times more area on a chip in comparison to a conventional 3x3 median filter. Even better visual results can be achieved by using more specialized algorithms [9], but their hardware implementation leads to area-expensive and slow solutions.
3 3.1
Evolutionary Design of Image Filters The Approach
This section describes the evolutionary method which can be utilized to create innovative 3 × 3 image filters [4]. These filters will be utilized in the proposed bank of filters. Every image filter is considered as a function (a digital circuit in the case of hardware implementation) of nine 8-bit inputs and a single 8-bit output, which processes grayscale (8-bits/pixel) images. As Fig. 2 shows, every pixel value of the filtered image is calculated using a corresponding pixel and its eight neighbors in the processed image. In order to evolve an image filter which removes a given type of noise, we need an original (training) image to measure the fitness values of candidate filters. The goal of EA is to minimize the difference between the original image and the filtered image. The generality of the evolved filters (i.e., whether the filters operate sufficiently also for other images of the same type of noise) is tested by means of a test set.
Reducing the Area on a Chip Using a Bank of Evolved Filters
3.2
225
EA for Filter Evolution
The method is based on Cartesian Genetic Programming (CGP) [12]. A candidate filter is represented using a graph which contains nc (columns) × nr (rows) nodes placed in a grid. The role of EA is to find the interconnection of the programmable nodes and the functions performed by the nodes. Each node represents a two-input function which receives two 8-bit values and produces an 8-bit output. Table 1 shows the functions we consider as useful for this task [4]. We can observe that these functions are also suitable for hardware implementation (i.e. there are not such functions as multiplication or division). A node input may be connected either to an output of another node, which is placed anywhere in the preceding columns or to a primary input. Filters are encoded as arrays of integers of the size 3 × nr × nr + 1. For each node, three integers are utilized which encode the connection of node inputs and function. The last integer encodes the primary output of a candidate filter. Table 1. List of functions implemented in each programmable node code 0 1 2 3 4 5 6 7
function 255 x 255 − x x∨y x ¯∨y x∧y x∧y x⊕y
Input image
description code function constant 8 x1 identity 9 x2 inversion A swap(x, y) bitwise OR B x+y bitwise x ¯ OR y C x +S y bitwise AND D (x + y) 1 bitwise NAND E max(x, y) bitwise XOR F min(x, y)
Filtered image I0
I0
3
9
3
13
11
17
7
14
10
0
14
15
18
15
I2
2
11
4
15
1
19
15 23
11 27
4
12
15 16
11 20
14 24
11 28
I2
I8 I7
3
22
7
25
3
26
3
O
29
15
33
14
37
30
10
34
2
38
15 31
2
35
9
39
9
15 36
7
40
I3
I3 I4
21
I1
I1
I5
description right shift by 1 right shift by 2 swap nibbles + (addition) + with saturation average maximum minimum
Image filter
I4 I5
32
I6 I7
I6 I8
Fig. 2. The concept of image filtering using a 3 × 3 filter (left). An example of evolved filter (right).
EA uses a single genetic operator – mutation – which modifies 5% of the chromosome (this value was determined experimentally). No crossover operator is utilized in this type of EA because no suitable crossover operator has been proposed so far [13]. Mutation modifies either a node or an output connection. The EA operates with the population of λ individuals (typically, λ = 8). The initial population is randomly generated. Every new population consists of a parent (the fittest individual from the previous population) and its mutants. In case that two or more
226
Z. Vasicek and L. Sekanina
individuals have received the same fitness score in the previous generation, the individual which did not serve as the parent in the previous population will be selected as a new parent. This strategy was proven to be very useful [12]. The evolution is typically stopped (1) when the current best fitness value has not improved in the recent generations, or (2) after a predefined number of generations. 3.3
Fitness Function
The design objective is to minimize the difference between the filtered image and the original image. Usually, mean difference per pixel (mdpp) is minimized. Let u denote a corrupted image and let v denote a filtered image. The original (uncorrupted) version of u will be denoted as w. The image size is K × K (K=128) pixels but only the area of 126 × 126 pixels is considered because the pixel values at the borders are ignored and thus remain unfiltered. The fitness value of a candidate filter is obtained as f itness = 255.(K − 2)2 −
K−2 K−2
|v(i, j) − w(i, j)|.
i=1 j=1
3.4
Design Examples
This approach was utilized to evolve efficient image filters for Gaussian noise and 5 % salt-and-pepper noise and to create novel implementations of edge detectors [4]. Examples of filtered images for the 40 % salt-and-pepper noise are given in Fig. 3. When compared with the common median filter (see Fig. 1 and PSNR), evolved filters preserve more details and generate sharper images. Note that these filtered images represent the best outputs that can be obtained by a single 3 × 3-input filter evolved using described method for the 40% noise. Figure 2 shows an example of evolved filter. We can observe that EA can create only a combinational behavior and the filter utilizes only 3 × 3 pixels at the input. These filters are not able to compete to adaptive median filters which sophistically operate with larger kernels. A way to improve evolved filters could be to increase the kernel size; however, this will lead to smoothing and loosing details in images.
a) evolved filter1
b) evolved filter2
c) evolved filter3
Fig. 3. A corrupted image (see Fig. 1) filtered by evolved filters a) evf1 (PSNR: 18.868 dB), b) evf2 (PSNR: 18.266 dB) and c) evf3 (PSNR: 18.584 dB)
Reducing the Area on a Chip Using a Bank of Evolved Filters
4
227
Proposed Approach
In order to create a salt-and-pepper noise filter which generates filtered images of the same quality as an adaptive median filter and which is suitable for hardware implementation, we propose to combine several simple image filters utilizing the 3 × 3 window that are designed by an evolutionary algorithm according to previous Section 3. As Figure 4(a) shows the procedure has three steps: (1) the reduction of a dynamic range of noise, (2) processing using a bank of n filters and (3) deterministic selection of the best result. We analyzed various filters evolved according to description in Section 3 and recognized that they have problems with the large dynamic range of corrupted pixels (0/255). A straightforward solution of this problem is to create a component which inverts all pixels with value 255, i.e. all shots are transformed to have a uniform value. Filter kernel 3x3
Filtered image I0 I1
Filter 1
O0
Filter 2
O1
Filter n
On
I2 I3 I5 I4
Pre− processing filter
Post− processing filter
I8 I7 I6
(a)
(b)
Fig. 4. a) Proposed architecture for salt-and-pepper noise removal and b) training image
Preprocessed image then enters a bank of n filters that operate in parallel. Since we repeated the evolutionary design of salt-and-noise filters (according to Section 3) many times, we have gathered various implementations of this type of filter. We selected n different evolved filters which exhibit the best filtering quality and utilized them in the bank. Note that all these filters were designed by EA using the same type of noise and training image and with the same aim: to remove 40% salt-and-pepper noise. Finally, the outputs coming from banks 1 . . . n were combined by n-input median filter which can easily be implemented using comparators [14]. As the proposed system naturally forms a pipeline, the overall design can operate at the same frequency as a simple median filter when implemented in hardware.
5 5.1
Experimental Results Quality of Filtering
The filters utilized in the bank were evolved using the method described in Section 3. These filters use the size of kernel 3 × 3 pixels and contain up to 8 × 4 programmable nodes with functions according to Table 1.
228
Z. Vasicek and L. Sekanina
a) bridge
b) goldhill
c) lena
d) d) bridge with
e) e) goldhill with
f)f) lena with
40% noise
40% noise
40% noise
Fig. 5. Examples of test images Table 2. PSNR for adaptive median filter with the kernel size up to 5 × 5 and 7 × 7 kernel size image/noise goldhill bridge lena pentagon camera
10% 31.155 29.474 33.665 32.767 30.367
20% 30.085 28.064 31.210 31.460 28.560
5×5 40% 26.906 24.993 27.171 28.235 25.145
50% 24.290 22.567 24.433 25.217 22.675
70% 15.859 14.781 15.468 16.315 14.973
10% 31.155 29.474 33.655 32.767 30.367
20% 30.085 28.058 31.207 31.460 28.560
7×7 40% 27.315 25.177 27.529 28.621 25.298
50% 25.961 23.710 25.984 27.175 23.852
70% 20.884 19.060 20.455 21.654 19.242
A training 128 × 128-pixel image was partially corrupted by 40% salt-andpepper noise (see Fig. 4(b)). EA operates with an eight-member population. The 5% mutation is utilized. A single run is terminated after 196,608 generations. Results will be demonstrated for 5 test images of size 256 × 256 pixels which contain the salt-and-pepper noise with the intensity of 10%, 20%, 40%, 50% and 70% corrupted pixels. Figure 5 shows some examples of test images. Table 2 summarizes results obtained for the adaptive median filter which serves as a reference implementation. All results are expressed in terms of PSNR = 10 log10
1 MN
2552 2 i,j (v(i, j) − w(i, j))
where N × M is the size of image. Table 3 summarizes results for the images filtered using the bank of size 3 and 5. The output pixel is calculated by a 3-input (5-input, respectively) median circuit. Surprisingly, only three filters utilized in the bank are needed to obtain a bank filter which produces images of at least comparable visual quality to
Reducing the Area on a Chip Using a Bank of Evolved Filters
229
Table 3. PSNR for the bank filter filter image/noise goldhill bridge lena pentagon camera
10% 33.759 31.458 30.304 34.631 30.576
20% 30.619 28.992 28.162 31.89 28.185
3-bank 40% 50% 27.716 25.867 25.83 24.282 25.684 24.137 28.681 26.577 25.284 23.72
70% 19.091 18.333 18.324 18.437 17.85
10% 34.392 32.321 30.393 35.201 31.091
20% 31.131 29.714 28.424 32.411 28.74
5-bank 40% 50% 27.966 25.965 26.124 24.441 25.881 24.203 28.945 26.683 25.576 23.919
70% 19.079 18.327 18.314 18.435 17.845
Table 4. Result of synthesis for different filters optimal median filter 3×3 5×5 7×7 no. of slices 268 1506 4426 area [%] 1.1 6.4 18.7 fmax [MHz] 305 305 305 filter
adaptive median 5×5 7×7 2024 6567 8.6 27.8 303 298
evolved filters proposed utilized in bank filter bank fb1 fb2 fb3 fb4 fb5 3-bank 5-bank 156 199 137 183 148 500 843 0.7 0.8 0.6 0.8 0.6 2.1 3.6 316 318 308 321 320 308 305
the adaptive median filter. This fact is demonstrated by Figure 6 where the visual quality of the images filtered by the adaptive median and 3-bank filter is practically undistinguishable. 5.2
Implementation Cost
In order to compare the implementation cost of median filters, adaptive median filters, evolved filters and the bank of filters, all these filters were implemented in FPGA [15]. Results of synthesis are given for relatively large Virtex II Pro XC2vp50-7 FPGA which contains 23616 slices (configurable elements of the FPGA). This FPGA is available on our experimental Combo6x board. Table 4 shows that proposed bank filters require considerably smaller area on the chip in comparison to adaptive median filters whose implementation is based on area-demanding sorting networks. In order to implement the proposed 3-bank filter in a small and cheap embedded system, a smaller FPGA, XC3S50, is sufficient (it contains 768 slices). However,a larger and more expensive XC3S400 FPGA (containing 3584 slices) has to be utilized to implement the adaptive median filter. 5.3
Other Properties of Evolved Bank Filters
Figure 7 shows another interesting feature we observed for the bank of evolved filters. This kind of filters is relatively good in removing the impulse bursts noise; much better than the adaptive median filters. The impulse bursts usually corrupt images during the data transmission phase when the impulse noise occurs. The main reason for the occurrence of bursts is the interference of frequency
230
Z. Vasicek and L. Sekanina
a) bridge adaptive median filter
a) bridge 3−bank filter
b) goldhill adaptive median filter
b) goldhill 3−bank filter
c) lena adaptive median filter
c) lena 3−bank filter
Fig. 6. Comparison of resulting images filtered using the adaptive median filter with kernel size up to 7 × 7 (a, b, c) and 3-bank filter (d, e, f)
a) image with 40% noise
b) adaptive median
c) evolved filter
filter
Fig. 7. a) Image corrupted by 40% impulse noise (bursts), images filtered using b) adaptive median with kernel size up to 5 × 5 (PSNR: 11.495 dB) and c) 3-bank filter (PSNR: 22.618 dB)
modulated carrying signal with the signals from other sources of emission. Reliable elimination of this type of noise by means of standard robust filters can be achieved only by using sliding windows that are large enough. However, e.g., the 5x5 median filter leads to significant smearing of useful information [1]. Note that images shown in Figure 7 were obtained by the bank filter which was not trained for the impulse bursts noise at all. This solution represents a promising area of our future research.
6
Discussion
The proposed approach was evaluated on a single class of images. Future work will be devoted to testing the proposed filtering scheme on other types of images. Anyway, results obtained for this class of images are quite promising from the
Reducing the Area on a Chip Using a Bank of Evolved Filters 14 0
1
2
3
4
5
6
7
8
12
9
10
14
11
13 11
0
1
6
12
13
14
15
16
0
11
17
18
11
11
14 19
3
14
1
20
21
22
23
24
6
15
25
26
14
13
14 27
9
4
7
28
29
30
31
32
14
5
33
34
15
7
10 35
3
6
9
36
231
37
38
0
39
40
Fig. 8. Example of an evolved filter utilized in the 3-bank filter
application point of view. We can reach the quality of adaptive median filtering using a 3-bank filter; however, four times less resources are utilized. This can potentially lead to a significant reduction of power consumption of a target system. Moreover, Table 4 does not consider the implementation cost of supporting circuits (i.e. the FIFOs) needed to correctly read the filtering windows from memory. This cost can be significant since adaptive median filters require larger filtering windows than the bank filter. Currently we do not exactly know why three (five, respectively) filters evolved with the aim of removing 40%-salt-and-pepper noise are able to suppress the saltand-pepper noise with the intensity up to 70%. Moreover, none of these filters does work sufficiently in the task which it was trained for (the 40% noise). We can speculate that although these filters perform the same task, they operate in a different way. While a median filter gives as its output one of the pixels of the filtering window, evolved filters sometime produce new pixel values. By processing these n-values in the n-input median, the shot can be suppressed. We tested several variants of evolved filters in the bank but never observed a significant degradation in the image quality. The existence of several filters in the bank offers an opportunity to permanently evolve one of them while the remaining ones could still be sufficient to achieve a correct filtering. A possible incorrect behavior of the candidate filter will not probably influence the system significantly. Therefore, this approach could lead to on-line adaptive filtering, especially in the case when EA can modify different filters of the bank. Note that a solution which uses only a single filter cannot be utilized in the on-line adaptive system in which the image processing must not be interrupted.
7
Conclusions
In this paper a new class of image filters was introduced. The proposed bank filter consists of a set of evolved filters equipped with a simple preprocessing and post-processing unit. Our solution provides the same filtering capability as a standard adaptive median filter; however, using much fewer resources on a chip. The solution also exhibits a very good behavior for the impulse bursts noise which is typical for satellite images. In particular, evolutionary design of image filters for this type of noise will be investigated in our future research.
232
Z. Vasicek and L. Sekanina
Acknowledgements This research was partially supported by the Grant Agency of the Czech Republic under No. 102/07/0850 Design and hardware implementation of a patent-invention machine and the Research Plan No. MSM 0021630528 – Security-Oriented Research in Information Technology.
References 1. Koivisto, P., Astola, J., Lukin, V., Melnik, V., Tsymbal, O.: Removing Impulse Bursts from Images by Training-Based Filtering. EURASIP Journal on Applied Signal Processing 2003(3), 223–237 (2003) 2. Dumoulin, J., Foster, J., Frenzel, J., McGrew, S.: Special Purpose Image Convolution with Evolvable Hardware. In: Oates, M.J., Lanzi, P.L., Li, Y., Cagnoni, S., Corne, D.W., Fogarty, T.C., Poli, R., Smith, G.D. (eds.) EvoWorkshops 2000. LNCS, vol. 1803, pp. 1–11. Springer, Heidelberg (2000) 3. Porter, P.: Evolution on FPGAs for Feature Extraction. PhD thesis, Queensland University of Technology, Brisbane, Australia (2001) 4. Sekanina, L.: Evolvable components: From Theory to Hardware Implementations. Natural Computing. Springer, Heidelberg (2004) 5. Schulte, S., Nachtegael, M., Witte, V.D., Van der Weken, D., Kerre, E.E.: Fuzzy impulse noise reduction methods for color images. In: Computational Intelligence, Theory and Applications International Conference 9th Fuzzy Days in Dortmund, pp. 711–720. Springer, Heidelberg (2006) 6. Hwang, H., Haddad, R.A.: New algorithms for adaptive median filters. In: Tzou, K.-H., Koga, T. (eds.) Proc. SPIE, Visual Communications and Image Processing ’91: Image Processing, vol. 1606, pp. 400–407 (1991) 7. Yung, N.H., Lai, A.H.: Novel filter algorithm for removing impulse noise in digital images. In: Proc. SPIE, Visual Communications and Image Processing ’95, vol. 2501, pp. 210–220 (1995) 8. Bar, L., Kiryati, N., Sochen, N.: Image deblurring in the presence of salt-andpepper noise. In: Scale Space, pp. 107–118 (2005) 9. Nikolova, M.: A variational approach to remove outliers and impulse noise. J. Math. Imaging Vis. 20(1-2), 99–120 (2004) 10. Ahmad, M.O., Sundararajan, D.: A fast algorithm for two-dimensional median filtering. IEEE Transactions on Circuits and Systems 34, 1364–1374 (1987) 11. Dougherty, E.R., Astola, J.T.: Nonlinear Filters for Image Processing. SPIE/IEEE Series on Imaging Science & Engineering (1999) 12. Miller, J., Job, D., Vassilev, V.: Principles in the Evolutionary Design of Digital Circuits – Part I. Genetic Programming and Evolvable Machines 1(1), 8–35 (2000) 13. Slany, K., Sekanina, L.: Fitness landscape analysis and image filter evolution using functional-level cgp. In: EuroGP 2007. LNCS, vol. 4445, pp. 311–320. Springer, Heidelberg (2007) 14. Knuth, D.E.: The Art of Computer Programming: Sorting and Searching, 2nd edn. Addison-Wesley, Reading (1998) 15. Vasicek, Z., Sekanina, L.: An area-efficient alternative to adaptive median filtering in fpgas. In: Proc. of the 17th Conf. on Field Programmable Logic and Applications, pp. 1–6. IEEE Computer Society Press, Los Alamitos (to appear, 2007)
Walsh Function Systems: The Bisectional Evolutional Generation Pattern Nengchao Wang1, Jianhua Lu2, and Baochang Shi1,* 1
Department of Mathematics, Huazhong University of Science and Technology, Wuhan, 430074, China 2 State Key Laboratory of Coal Combustion, Huazhong University of Science and Technology, Wuhan, 430074, China [email protected], [email protected]
Abstract. In this paper, the concept of evolution is introduced to examine the generation process for Walsh function systems. By considering the generation process for Walsh function systems as the evolution process of certain discrete dynamic systems, a new unified generation pattern which is called the Bisectional Evolutional Generation Pattern (BEGP for short) for Walsh function systems is proposed, combined with their properties of symmetric copying. As a byproduct of this kind of pattern, a kind of ordering for Walsh function systems which is called quasi-Hadamard ordering is found naturally. Keywords: Walsh function, Quasi-Hadamard ordering, Bisectional Evolutional Pattern, Symmetric copying.
1 Introduction Walsh function systems are a kind of closed orthogonal function systems which take only two values +1 and -1. The first system was introduced by J. L.Walsh in 1923 [1] which now is called Walsh function system of Walsh ordering. The mathematical expression of the i th Walsh function of the k th family in this system has following form k −1
wk ,i ( x ) = ∏ sgn[cos ir 2 r π x ],
0 ≤ x