Advances in neural networks - ISNN 2008 5th International Symposium on Neural Networks, ISNN 2008, Beijing, China, September 24-28, 2008: proceedings [1 ed.] 9783540877318, 3540877312

The two volume set LNCS 5263/5264 constitutes the refereed proceedings of the 5th International Symposium on Neural Netw

324 48 16MB

English Pages 927 Year 2008

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Advances in neural networks - ISNN 2008 5th International Symposium on Neural Networks, ISNN 2008, Beijing, China, September 24-28, 2008: proceedings [1 ed.]
 9783540877318, 3540877312

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany

5263

Fuchun Sun Jianwei Zhang Ying Tan Jinde Cao Wen Yu (Eds.)

Advances in Neural Networks – ISNN 2008 5th International Symposium on Neural Networks, ISNN 2008 Beijing, China, September 24-28, 2008 Proceedings, Part I

13

Volume Editors Fuchun Sun Tsinghua University, Dept. of Computer Science and Technology Beijing 100084, China E-mail: [email protected] Jianwei Zhang University of Hamburg, Institute TAMS 22527 Hamburg, Germany E-mail: [email protected] Ying Tan Peking University, Department of Machine Intelligence Beijing 100871, China E-mail: [email protected] Jinde Cao Southeast University, Department of Mathematics Nanjing 210096, China E-mail: [email protected] Wen Yu Departamento de Control Automático, CINVESTAV-IPN México D.F., 07360, México E-mail: [email protected]

Library of Congress Control Number: 2008934862 CR Subject Classification (1998): F.1.1, I.2.6, I.5.1, H.2.8, G.1.6 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13

0302-9743 3-540-87731-2 Springer Berlin Heidelberg New York 978-3-540-87731-8 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2008 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12529735 06/3180 543210

Preface

This book and its companion volume, LNCS vols. 5263 and 5264, constitute the proceedings of the 5th International Symposium on Neural Networks (ISNN 2008) held in Beijing, the capital of China, during September 24–28, 2008. ISNN is a prestigious annual symposium on neural networks with past events held in Dalian (2004), Chongqing (2005), Chengdu (2006), and Nanjing (2007). Over the past few years, ISNN has matured into a well-established series of international symposiums on neural networks and related fields. Following the tradition, ISNN 2008 provided an academic forum for the participants to disseminate their new research findings and discuss emerging areas of research. It also created a stimulating environment for participants to interact with each other and exchange information on future challenges and opportunities of neural network research. ISNN 2008 received 522 submissions from about 1,306 authors in 34 countries and regions (Australia, Bangladesh, Belgium, Brazil, Canada, China, Czech Republic, Egypt, Finland, France, Germany, Hong Kong, India, Iran, Italy, Japan, South Korea, Malaysia, Mexico, The Netherlands, New Zealand, Poland, Qatar, Romania, Russia, Singapore, South Africa, Spain, Switzerland, Taiwan, Turkey, UK, USA, Virgin Islands (UK)) across six continents (Asia, Europe, North America, South America, Africa, and Oceania). Based on rigorous reviews by the Program Committee members and reviewers, 192 high-quality papers were selected for publication in the proceedings with an acceptance rate of 36.7%. These papers were organized in 18 cohesive sections covering all major topics of neural network research and development. In addition to the contributed papers, the ISNN 2008 technical program included four plenary speeches by Dimitri P. Bertsekas (Massachusetts Institute of Technology, USA), Helge Ritter (Bayreuth University, Germany), Jennie Si (Arizona State University, USA), and Hang Li (Microsoft Research Asia, China). Besides the regular sessions and panels, ISNN 2008 also featured four special sessions focusing on some emerging topics. As organizers of ISNN 2008, we would like to express our sincere thanks to Tsinghua University, Peking University, The Chinese University of Hong Kong, and Institute of Automation at the Chinese Academy of Sciences for their sponsorship, to the IEEE Computational Intelligence Society, International Neural Network Society, European Neural Network Society, Asia Pacific Neural Network Assembly, the China Neural Networks Council, and the National Natural Science Foundation of China for their technical co-sponsorship. We thank the National Natural Science Foundation of China and Microsoft Research Asia for their financial and logistic support. We would also like to thank the members of the Advisory Committee for their guidance, the members of the International Program Committee and additional reviewers for reviewing the papers, and members of the Publications Committee for checking the accepted papers in a short period of time. In particular, we would

VI

Preface

like to thank Springer for publishing the proceedings in the prestigious series of Lecture Notes in Computer Science. Meanwhile, we wish to express our heartfelt appreciation to the plenary and panel speakers, special session organizers, session Chairs, and student helpers. In addition, there are still many more colleagues, associates, friends, and supporters who helped us in immeasurable ways; we express our sincere gratitude to them all. Last but not the least, we would like to thank all the speakers, authors, and participants for their great contributions that made ISNN 2008 successful and all the hard work worthwhile.

September 2008

Fuchun Sun Jianwei Zhang Ying Tan Jinde Cao Wen Yu

Organization

General Chair Bo Zhang, China

General Co-chair Jianwei Zhang, Germany

Advisory Committee Chairs Xingui He, China Yanda Li, China Shoujue Wang, China

Advisory Committee Members Hojjat Adeli, USA Shun-ichi Amari, Japan Zheng Bao, China Tianyou Chai, China Guoliang Chen, China Ruwei Dai, China Wlodzislaw Duch, Poland Chunbo Feng, China Walter J. Freeman, USA Kunihiko Fukushima, Japan Aike Guo, China Zhenya He, China Frank L. Lewis, USA Ruqian Lu, China Robert J. Marks II, USA Erkki Oja, Finland Nikhil R. Pal, India Marios M. Polycarpou, USA Leszek Rutkowski, Poland DeLiang Wang, USA Paul J. Werbos, USA Youshou Wu, China Donald C. Wunsch II, USA Youlun Xiong, China

VIII

Organization

Lei Xu, Hong Kong Shuzi Yang, China Xin Yao, UK Gary G. Yen, USA Bo Zhang, China Nanning Zheng, China Jacek M. Zurada, USA

Program Committee Chairs Ying Tan, China Jinde Cao, China Wen Yu, Mexico

Steering Committee Chairs Zengqi Sun, China Jun Wang, China

Organizing Committee Chairs Fuchun Sun, China Zengguang Hou, China

Plenary Sessions Chair Derong Liu, USA

Special Sessions Chairs Xiaoou Li, Mexico Changyin Sun, China Cong Wang, China

Publications Chairs Zhigang Zeng, China Yunong Zhang, China

Publicity Chairs Andrzej Cichocki, Japan Alois Knoll, Germany Yi Shen, China

Organization

Finance Chair Yujie Ding, China Huaping Liu, China

Registration Chair Fengge Wu, China

Local Arrangements Chairs Lei Guo, China Minsheng Zhao, China

Electronic Review Chair Xiaofeng Liao, China

Steering Committee Members Shumin Fei, China Chengan Guo, China Min Han, China Xiaofeng Liao, China Baoliang Lu, China Zongben Xu, China Zhang Yi, China Hujun Yin, UK Huaguang Zhang, China Ling Zhang, China Chunguang Zhou, China

Program Committee Members Ah-Hwee Tan, Singapore Alan Liew, Australia Amir Hussain, UK Andreas Stafylopatis, Greece Andries Engelbrecht, South Africa Andrzej Cichocki, Japan Bruno Apolloni, Italy Cheng Xiang, Singapore Chengan Guo, China Christos Tjortjis, UK

IX

X

Organization

Chuandong Li, China Dacheng Tao, Hong Kong Daming Shi, Singapore Danchi Jiang, Australia Dewen Hu, China Dianhui Wang, Australia Erol Gelenbe, UK Fengli Ren, China Fuchun Sun, China Gerald Schaefer, UK Guangbin Huang, Singapore Haibo He, USA Haijun Jiang, China He Huang, Hong Kong Hon Keung Kwan, Canada Hongtao Lu, China Hongyong Zhao, China Hualou Liang, USA Huosheng Hu, UK James Lam, Hong Kong Jianquan Lu, China Jie Zhang, UK Jinde Cao, China Jinglu Hu, Japan Jinling Liang, China Jinwen Ma, China John Qiang Gan, UK Jonathan H. Chan, Thailand Jos´ e Alfredo F. Costa, Brazil Ju Liu, China K. Vijayan Asari, USA Kang Li, UK Khurshid Ahmad, UK Kun Yuan, China Liqing Zhang, China Luonan Chen, Japan Malik Ismail, USA Marco Gilli, Italy Martin Middendorf, Germany Matthew Casey, UK Meiqin Liu, China Michael Li, Australia Michel Verleysen, Belgium Mingcong Deng, Japan Nian Zhang, USA

Organization

Nikola Kasabov, New Zealand Norikazu Takahashi, Japan Okyay Kaynak, Turkey Paul S. Pang, New Zealand ´ P´eter Erdi, USA Peter Tino, UK Ping Guo, China Ping Li, Hong Kong Qiankun Song, China Qing Ma, Japan Qing Tao, China Qinglong Han, Australia Qingshan Liu, China Quanmin Zhu, UK Rhee Man Kil, Korea Rubin Wang, China Sabri Arik, Turkey Seiichi Ozawa, Japan Sheng Chen, UK Shunshoku Kanae, Japan Shuxue Ding, Japan Stanislaw Osowski, Poland Stefan Wermter, UK Sungshin Kim, Korea Tingwen Huang, Qatar Wai Keung Fung, Canada Wei Wu, China Wen Yu, Mexico Wenjia Wang, UK Wenlian Lu, China Wenwu Yu, Hong Kong Xiaochun Cheng, UK Xiaoli Li, UK Xiaoqin Zeng, China Yan Liu, USA Yanchun Liang, China Yangmin Li, Macao Yangquan Chen, USA Yanqing Zhang, USA Yi Shen, China Ying Tan, China Yingjie Yang, UK Zheru Chi, Hong Kong

XI

XII

Organization

Reviewers Dario Aloise Ricardo de A. Araujo Swarna Arniker Mohammadreza Asghari Oskoei Haibo Bao simone Bassis Shuhui Bi Rongfang Bie Liu Bo Ni Bu Heloisa Camargo Liting Cao Jinde Cao Lin Chai Fangyue Chen Yangquan Chen Xiaofeng Chen Benhui Chen Sheng Chen Xinyu Chen Songcan Chen Long Cheng Xiaochun Cheng Zunshui Cheng Jungik Cho Chuandong Li Antonio J. Conejo Yaping Dai Jayanta Kumar Debnath Jianguo Du Mark Elshaw Christos Emmanouilidis Tolga Ensari Yulei Fan Mauricio Figueiredo Carlos H. Q. Foster Sabrina Gaito Xinbo Gao Zaiwu Gong Adilson Gonzaga Shenshen Gu Dongbing Gu Suicheng Gu Qianjin Guo

Jun Guo Chengan Guo Hong He Fengqing Han Wangli He Xiangnan He Yunzhang Hou Wei Hu Jin Hu Jun Hu Jinglu Hu Yichung Hu Xi Huang Chuangxia Huang Chi Huang Gan Huang He Huang Chihli Hung Amir Hussain Lei Jia Qiang Jia Danchi Jiang Minghui Jiang Lihua Jiang Changan Jinag Chi-Hyuck Jun Shunshoku Kanae Deok-Hwan Kim Tomoaki Kobayashi Darong Lai James Lam Bing Li Liping Li Chuandong Li Yueheng Li Xiaolin Li Kelin Li Dayou Li Jianwu Li Ping Li Wei Li Xiaoli Li Yongmin Li Yan Li

Organization

Rong Li Guanjun Li Jiguo Li Lulu Li Xuechen Li Jinling Liang Clodoaldo Aparecido de Moraes Lima Yurong Liu Li Liu Maoxing Liu Nan Liu Chao Liu Honghai Liu Xiangyang Liu Fei Liu Lixiong Liu Xiwei Liu Xiaoyang Liu Yang Liu Gabriele Lombardo Xuyang Lou Jianquan Lu Wenlian Lu Xiaojun Lu Wei Lu Ying Luo Lili Ma Shingo Mabu Xiangyu Meng Zhaohui Meng Cristian Mesiano Xiaobing Nie Yoshihiro Okada Zeynep Orman Stanislaw Osowski Tsuyoshi Otake Seiichi Ozawa Neyir Ozcan Zhifang Pan Yunpeng Pan Zhifang Pang Federico Pedersini Gang Peng Ling Ping Chenkun Qi

Jianlong Qiu Jianbin Qiu Dummy Reviewer Zhihai Rong Guangchen Ruan Hossein Sahoolizadeh Ruya Samli Sibel Senan Zhan Shu Qiankun Song Wei Su Yonghui Sun Junfeng Sun Yuan Tan Lorenzo Valerio Li Wan Lili Wang Xiaofeng Wang Jinlian Wang Min Wang Lan Wang Qiuping Wang Guanjun Wang Duan Wang Weiwei Wang Bin Wang Zhengxia Wang Haikun Wei Shengjun Wen Stefan Wermter Xiangjun Wu Wei Wu Mianhong Wu Weiguo Xia Yonghui Xia Tao Xiang Min Xiao Huaitie Xiao Dan Xiao Wenjun Xiong Junlin Xiong Weijun Xu Yan Xu Rui Xu Jianhua Xu

XIII

XIV

Organization

Gang Yan Zijiang Yang Taicheng Yang Zaiyue Yang Yongqing Yang Bo Yang Kun Yang Qian Yin Xiuxia Yang Xu Yiqiong Simin Yu Wenwu Yu Kun Yuan Zhiyong Yuan Eylem Yucel Yong Yue Jianfang Zeng Junyong Zhai Yunong Zhang Ping Zhang Libao Zhang Baoyong Zhang

Houxiang Zhang Jun Zhang Qingfu Zhang Daoqiang Zhang Jiacai Zhang Yuanbin Zhang Kanjian Zhang Leina Zhao Yan Zhao Cong Zheng Chunhou Zheng Shuiming Zhong Jin Zhou Bin Zhou Qingbao Zhu Wei Zhu Antonio Zippo Yanli Zou Yang Zou Yuanyuan Zou Zhenjiang Zhao

Table of Contents – Part I

Computational Neuroscience Single Trial Evoked Potentials Study during an Emotional Processing Based on Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ling Zou, Renlai Zhou, Senqi Hu, Jing Zhang, and Yansong Li

1

Robust Speaker Modeling Based on Constrained Nonnegative Tensor Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qiang Wu, Liqing Zhang, and Guangchuan Shi

11

A Hypothesis on How the Neocortex Extracts Information for Prediction in Sequence Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weiyu Wang

21

MENN Method Applications for Stock Market Forecasting . . . . . . . . . . . . Guangfeng Jia, Yuehui Chen, and Peng Wu

30

New Chaos Produced from Synchronization of Chaotic Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zunshui Cheng

40

A Two Stage Energy Model Exhibiting Selectivity to Changing Disparity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaojiang Guo and Bertram E. Shi

47

A Feature Extraction Method Based on Wavelet Transform and NMFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Suwen Zhang, Wanyin Deng, and Dandan Miao

55

Cognitive Science Similarity Measures between Connection Numbers of Set Pair Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Junjie Yang, Jianzhong Zhou, Li Liu, Yinghai Li, and Zhengjia Wu

63

Temporal Properties of Illusory-Surface Perception Probed with Poggendorff Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qin Wang and Marsanori Idesawa

69

Interval Self-Organizing Map for Nonlinear System Identification and Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luzhou Liu, Jian Xiao, and Long Yu

78

XVI

Table of Contents – Part I

A Dual-Mode Learning Mechanism Combining Knowledge-Education and Machine-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yichang Chen and Anpin Chen The Effect of Task Relevance on Electrophysiological Response to Emotional Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Baolin Liu, Shuai Xin, Zhixing Jin, Xiaorong Gao, Shangkai Gao, Renxin Chu, Yongfeng Huang, and Beixing Deng A Detailed Study on the Modulation of Emotion Processing by Spatial Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Baolin Liu, Shuai Xin, Zhixing Jin, Xiaorong Gao, Shangkai Gao, Renxin Chu, Beixing Deng, and Yongfeng Huang

87

97

107

Mathematical Modeling of Neural Systems MATLAB Simulation and Comparison of Zhang Neural Network and Gradient Neural Network for Time-Varying Lyapunov Equation Solving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yunong Zhang, Shuai Yue, Ke Chen, and Chenfu Yi

117

Improved Global Exponential Stability Criterion for BAM Neural Networks with Time-Varying Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yonggang Chen and Tiheng Qin

128

Global Exponential Stability and Periodicity of CNNs with Time-Varying Discrete and Distributed Delays . . . . . . . . . . . . . . . . . . . . . . . Shengle Fang, Minghui Jiang, and Wenfang Fu

138

Estimation of Value-at-Risk for Exchange Risk Via Kernel Based Nonlinear Ensembled Multi Scale Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kaijian He, Chi Xie, and Kinkeung Lai

148

Delay-Dependent Global Asymptotic Stability in Neutral-Type Delayed Neural Networks with Reaction-Diffusion Terms . . . . . . . . . . . . . . . . . . . . . Jianlong Qiu, Yinlai Jin, and Qingyu Zheng

158

Discrimination of Reconstructed Milk in Raw Milk by Combining Near Infrared Spectroscopy with Biomimetic Pattern Recognition . . . . . . . . . . . Ming Sun, Qigao Feng, Dong An, Yaoguang Wei, Jibo Si, and Longsheng Fu Data Fusion Based on Neural Networks and Particle Swarm Algorithm and Its Application in Sugar Boiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yanmei Meng, Sijie Yan, Zhihong Tang, Yuanling Chen, and Jingneng Liu

168

176

Table of Contents – Part I

XVII

Asymptotic Law of Likelihood Ratio for Multilayer Perceptron Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joseph Rynkiewicz

186

An On-Line Learning Radial Basis Function Network and Its Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nini Wang, Xiaodong Liu, and Jianchuan Yin

196

A Hybrid Model of Partial Least Squares and RBF Neural Networks for System Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nini Wang, Xiaodong Liu, and Jianchuan Yin

204

Nonlinear Complex Neural Circuits Analysis and Design by q-Value Weighted Bounded Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hong Hu and Zhongzhi Shi

212

Fuzzy Hyperbolic Neural Network Model and Its Application in H∞ Filter Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shuxian Lun, Zhaozheng Guo, and Huaguang Zhang

222

On the Domain Attraction of Fuzzy Neural Networks . . . . . . . . . . . . . . . . . Tingwen Huang, Xiaofeng Liao, and Hui Huang

231

CG-M-FOCUSS and Its Application to Distributed Compressed Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhaoshui He, Andrzej Cichocki, Rafal Zdunek, and Jianting Cao

237

Dynamic of Cohen-Grossberg Neural Networks with Variable Coefficients and Time-Varying Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xuehui Mei and Haijun Jiang

246

Permutation Free Encoding Technique for Evolving Neural Networks . . . Anupam Das, Md. Shohrab Hossain, Saeed Muhammad Abdullah, and Rashed Ul Islam

255

Six-Element Linguistic Truth-Valued Intuitionistic Reasoning in Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Li Zou, Wenjiang Li, and Yang Xu

266

A Sequential Learning Algorithm for RBF Networks with Application to Ship Inverse Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gexin Bi and Fang Dong

275

Stability and Nonlinear Analysis Implementation of Neural Network Learning with Minimum L1 -Norm Criteria in Fractional Order Non-gaussian Impulsive Noise Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daifeng Zha

283

XVIII

Table of Contents – Part I

Stability of Neural Networks with Parameters Disturbed by White Noises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wuyi Zhang and Wudai Liao

291

Neural Control of Uncertain Nonlinear Systems with Minimum Control Effort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dingguo Chen, Jiaben Yang, and Ronald R. Mohler

299

Three Global Exponential Convergence Results of the GPNN for Solving Generalized Linear Variational Inequalities . . . . . . . . . . . . . . . . . . . Xiaolin Hu, Zhigang Zeng, and Bo Zhang

309

Disturbance Attenuating Controller Design for a Class of Nonlinear Systems with Unknown Time-Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Geng Ji

319

Stability Criteria with Less Variables for Neural Networks with Time-Varying Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tao Li, Xiaoling Ye, and Yingchao Zhang

330

Robust Stability of Uncertain Neural Networks with Time-Varying Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei Feng, Haixia Wu, and Wei Zhang

338

Novel Coupled Map Lattice Model for Prediction of EEG Signal . . . . . . . Minfen Shen, Lanxin Lin, and Guoliang Chang

347

Adaptive Synchronization of Delayed Chaotic Systems . . . . . . . . . . . . . . . . Lidan Wang and Shukai Duan

357

Feedforward and Fuzzy Neural Networks Research on Fish Intelligence for Fish Trajectory Prediction Based on Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yanmin Xue, Hongzhao Liu, Xiaohui Zhang, and Mamoru Minami

364

A Hybrid MCDM Method for Route Selection of Multimodal Transportation Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lili Qu and Yan Chen

374

Function Approximation by Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . Fengjun Li

384

Robot Navigation Based on Fuzzy RL Algorithm . . . . . . . . . . . . . . . . . . . . . Yong Duan, Baoxia Cui, and Huaiqing Yang

391

Table of Contents – Part I

Nuclear Reactor Reactivity Prediction Using Feed Forward Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shan Jiang, Christopher C. Pain, Jonathan N. Carter, Ahmet K. Ziver, Matthew D. Eaton, Anthony J.H. Goddard, Simon J. Franklin, and Heather J. Phillips Active Noise Control Using a Feedforward Network with Online Sequential Extreme Learning Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qizhi Zhang and Yali Zhou

XIX

400

410

Probabilistic Methods A Probabilistic Method to Estimate Life Expectancy of Application Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shengzhong Yuan and Hong He

417

Particle Filter with Improved Proposal Distribution for Vehicle Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Huaping Liu and Fuchun Sun

422

Cluster Selection Based on Coupling for Gaussian Mean Fields . . . . . . . . Yarui Chen and Shizhong Liao Multiresolution Image Fusion Algorithm Based on Block Modeling and Probabilistic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chenglin Wen and Jingli Gao An Evolutionary Approach for Vector Quantization Codebook Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carlos R.B. Azevedo, Esdras L. Bispo Junior, Tiago A.E. Ferreira, Francisco Madeiro, and Marcelo S. Alencar Kernel-Based Text Classification on Statistical Manifold . . . . . . . . . . . . . . Shibin Zhou, Shidong Feng, and Yushu Liu A Boost Voting Strategy for Knowledge Integration and Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Haibo He, Yuan Cao, Jinyu Wen, and Shijie Cheng

432

442

452

462

472

Supervised Learning A New Strategy for Pridicting Eukaryotic Promoter Based on Feature Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shuanhu Wu, Qingshang Zeng, Yinbin Song, Lihong Wang, and Yanjie Zhang Searching for Interacting Features for Spam Filtering . . . . . . . . . . . . . . . . . Chuanliang Chen, Yunchao Gong, Rongfang Bie, and Xiaozhi Gao

482

491

XX

Table of Contents – Part I

Structural Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hui Xue, Songcan Chen, and Qiang Yang

501

The Turning Points on MLP’s Error Surface . . . . . . . . . . . . . . . . . . . . . . . . . Hung-Han Chen

512

Parallel Fuzzy Reasoning Models with Ensemble Learning . . . . . . . . . . . . . Hiromi Miyajima, Noritaka Shigei, Shinya Fukumoto, and Toshiaki Miike

521

Classification and Dimension Reduction in Bank Credit Scoring System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bohan Liu, Bo Yuan, and Wenhuang Liu Polynomial Nonlinear Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . JinFeng Wang, KwongSak Leung, KinHong Lee, and Zhenyuan Wang Testing Error Estimates for Regularization and Radial Function Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Petra Vidnerov´ a and Roman Neruda

531 539

549

Unsupervised Learning A Practical Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei Li, Haohao Li, and Jianye Chen

555

Concise Coupled Neural Network Algorithm for Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lijun Liu, Jun Tie, and Tianshuang Qiu

561

Spatial Clustering with Obstacles Constraints by Hybrid Particle Swarm Optimization with GA Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xueping Zhang, Hui Yin, Hongmei Zhang, and Zhongshan Fan

569

Analysis of the Kurtosis-Sum Objective Function for ICA . . . . . . . . . . . . . Fei Ge and Jinwen Ma

579

BYY Harmony Learning on Weibull Mixture with Automated Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhijie Ren and Jinwen Ma

589

A BYY Split-and-Merge EM Algorithm for Gaussian Mixture Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lei Li and Jinwen Ma

600

A Comparative Study on Clustering Algorithms for Multispectral Remote Sensing Image Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lintao Wen, Xinyu Chen, and Ping Guo

610

Table of Contents – Part I

A Gradient BYY Harmony Learning Algorithm for Straight Line Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gang Chen, Lei Li, and Jinwen Ma

XXI

618

Support Vector Machine and Kernel Methods An Estimation of the Optimal Gaussian Kernel Parameter for Support Vector Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wenjian Wang and Liang Ma

627

Imbalanced SVM Learning with Margin Compensation . . . . . . . . . . . . . . . Chan-Yun Yang, Jianjun Wang, Jr-Syu Yang, and Guo-Ding Yu

636

Path Algorithms for One-Class SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Liang Zhou, Fuxin Li, and Yanwu Yang

645

Simulations for American Option Pricing Under a Jump-Diffusion Model: Comparison Study between Kernel-Based and Regression-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hyun-Joo Lee, Seung-Ho Yang, Gyu-Sik Han, and Jaewook Lee

655

Global Convergence Analysis of Decomposition Methods for Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jun Guo and Norikazu Takahashi

663

Rotating Fault Diagnosis Based on Wavelet Kernel Principal Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . L. Guo, G.M. Dong, J. Chen, Y. Zhu, and Y.N. Pan

674

Inverse System Identification of Nonlinear Systems Using LSSVM Based on Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Changyin Sun, Chaoxu Mu, and Hua Liang

682

A New Approach to Division of Attribute Space for SVR Based Classification Rule Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dexian Zhang, Ailing Duan, Yanfeng Fan, and Ziqiang Wang

691

Chattering-Free LS-SVM Sliding Mode Control . . . . . . . . . . . . . . . . . . . . . . Jianning Li, Yibo Zhang, and Haipeng Pan

701

Selection of Gaussian Kernel Parameter for SVM Based on Convex Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Changqian Men and Wenjian Wang

709

Multiple Sources Data Fusion Strategies Based on Multi-class Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luo Zhong, Zhe Li, Zichun Ding, Cuicui Guo, and Huazhu Song

715

XXII

Table of Contents – Part I

A Generic Diffusion Kernel for Semi-supervised Learning . . . . . . . . . . . . . . Lei Jia and Shizhong Liao

723

Weighted Hyper-sphere SVM for Hypertext Classification . . . . . . . . . . . . . Shuang Liu and Guoyou Shi

733

Theoretical Analysis of a Rigid Coreset Minimum Enclosing Ball Algorithm for Kernel Regression Estimation . . . . . . . . . . . . . . . . . . . . . . . . . Xunkai Wei and Yinghong Li

741

Kernel Matrix Learning for One-Class Classification . . . . . . . . . . . . . . . . . . Chengqun Wang, Jiangang Lu, Chonghai Hu, and Youxian Sun

753

Structure Automatic Change in Neural Network . . . . . . . . . . . . . . . . . . . . . Han Honggui, Qiao Junfei, and Li Xinyuan

762

Hybrid Optimisation Algorithms Particle Swarm Optimization for Two-Stage FLA Problem with Fuzzy Random Demands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yankui Liu, Siyuan Shen, and Rui Qin T-S Fuzzy Model Identification Based on Chaos Optimization . . . . . . . . . Chaoshun Li, Jianzhong Zhou, Xueli An, Yaoyao He, and Hui He ADHDP for the pH Value Control in the Clarifying Process of Sugar Cane Juice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaofeng Lin, Shengyong Lei, Chunning Song, Shaojian Song, and Derong Liu Dynamic PSO-Neural Network: A Case Study for Urban Microcosmic Mobile Emission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chaozhong Wu, Chengwei Xu, Xinping Yan, and Jing Gong An Improvement to Ant Colony Optimization Heuristic . . . . . . . . . . . . . . . Youmei Li, Zongben Xu, and Feilong Cao Extension of a Polynomial Time Mehrotra-Type Predictor-Corrector Safeguarded Algorithm to Monotone Linear Complementarity Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mingwang Zhang and Yanli Lv QoS Route Discovery of Ad Hoc Networks Based on Intelligence Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cong Jin and Shu-Wei Jin Memetic Algorithm-Based Image Watermarking Scheme . . . . . . . . . . . . . . Qingzhou Zhang, Ziqiang Wang, and Dexian Zhang

776 786

796

806 816

826

836 845

Table of Contents – Part I

A Genetic Algorithm Using a Mixed Crossover Strategy . . . . . . . . . . . . . . Li-yan Zhuang, Hong-bin Dong, Jing-qing Jiang, and Chu-yi Song Condition Prediction of Hydroelectric Generating Unit Based on Immune Optimized RBFNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhong Liu, Shuyun Zou, Shuangquan Liu, Fenghua Jin, and Xuxiang Lu

XXIII

854

864

Synthesis of a Hybrid Five-Bar Mechanism with Particle Swarm Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ke Zhang

873

Robust Model Predictive Control Using a Discrete-Time Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yunpeng Pan and Jun Wang

883

A PSO-Based Method for Min-ε Approximation of Closed Contour Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bin Wang, Chaojian Shi, and Jing Li

893

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

903

Table of Contents – Part II

Machine Learning and Data Mining Rough Set Combine BP Neural Network in Next Day Load Curve Forecasting

1

Improved Fuzzy Clustering Method Based on Entropy Coefficient and Its Application

11

An Algorithm of Constrained Spatial Association Rules Based on Binary

21

Sequential Proximity-Based Clustering for Telecommunication Network Alarm Correlation

30

A Fast Parallel Association Rules Mining Algorithm Based on FP-Forest

40

Improved Algorithm for Image Processing in TCON of TFT-LCD

50

Clustering Using Normalized Path-Based Metric

57

Association Rule Mining Based on the Semantic Categories of Tourism Information

67

The Quality Monitoring Technology in the Process of the Pulping Papermaking Alkaline Steam Boiling Based on Neural Network

74

A New Self-adjusting Immune Genetic Algorithm

81

Calculation of Latent Semantic Weight Based on Fuzzy Membership

91

Research on Spatial Clustering Acetabuliform Model and Algorithm Based on Mathematical Morphology

100

Intelligent Control and Robotics Partner Selection and Evaluation in Virtual Research Center Based on Trapezoidal Fuzzy AHP

110

A Nonlinear Hierarchical Multiple Models Neural Network Decoupling Controller

119

Adaptive Dynamic Programming for a Class of Nonlinear Control Systems with General Separable Performance Index

128

A General Fuzzified CMAC Controller with Eligibility

138

Case-Based Decision Making Model for Supervisory Control of Ore Roasting Process

148

An Affective Model Applied in Playmate Robot for Children

158

The Application of Full Adaptive RBF NN to SMC Design of Missile Autopilot

165

Multi-Objective Optimal Trajectory Planning of Space Robot Using Particle Swarm Optimization

171

The Direct Neural Control Applied to the Position Control in Hydraulic Servo System

180

An Application of Wavelet Networks in the Carrying Robot Walking

190

TOPN Based Temporal Performance Evaluation Method of Neural Network Based Robot Controller

200

A Fuzzy Timed Object-Oriented Petri Net for Multi-Agent Systems

210

Fuzzy Reasoning Approach for Conceptual Design

220

Extension Robust Control of a Three-Level Converter for High-Speed Railway Tractions

227

Pattern Recognition Blind Image Watermark Analysis Using Feature Fusion and Neural Network Classifier

237

Gene Expression Data Classification Using Independent Variable Group Analysis

243

The Average Radius of Attraction Basin of Hopfield Neural Networks

253

A Fuzzy Cluster Algorithm Based on Mutative Scale Chaos Optimization

259

A Sparse Sampling Method for Classification Based on Likelihood Factor

268

Estimation of Nitrogen Removal Effect in Groundwater Using Artificial Neural Network

276

Sequential Fuzzy Diagnosis for Condition Monitoring of Rolling Bearing Based on Neural Network

284

Evolving Neural Network Using Genetic Simulated Annealing Algorithms for Multi-spectral Image Classification

294

Detecting Moving Targets in Ground Clutter Using RBF Neural Network

304

Application of Wavelet Neural Networks on Vibration Fault Diagnosis 313 321 331

341

Audio, Image Processing and Computer Vision Denoising Natural Images Using Sparse Coding Algorithm Based on the Kurtosis Measurement

351

A New Denoising Approach for Sound Signals Based on Non-negative Sparse Coding of Power Spectra

359

Building Extraction Using Fast Graph Search

367

376 Image Denoising Using Neighbouring Contourlet Coefficients

384

Robust Watermark Algorithm Based on the Wavelet Moment Modulation and Neural Network Detection

392

Manifold Training Technique to Reconstruct High Dynamic Range Image

402

Face Hallucination Based on CSGT and PCA

410

Complex Effects Simulation Based Large Particles System on GPU

419

A Selective Attention Computational Model for Perceiving Textures

429

Classifications of Liver Diseases from Medical Digital Images

439

A Global Contour-Grouping Algorithm Based on Spectral Clustering

449

Emotion Recognition in Chinese Natural Speech by Combining Prosody and Voice Quality Features

457

Fault Diagnosis On-Line Diagnosis of Faulty Insulators Based on Improved ART2 Neural Network

465

Diagnosis Method for Gear Equipment by Sequential Fuzzy Neural Network

473

Study of Punch Die Condition Discrimination Based on Wavelet Packet and Genetic Neural Network

483

Data Reconstruction Based on Factor Analysis

492

Synthetic Fault Diagnosis Method of Power Transformer Based on Rough Set Theory and Bayesian Network

498

Fuzzy Information Fusion Algorithm of Fault Diagnosis Based on Similarity Measure of Evidence

506

Other Applications and Implementations NN-Based Near Real Time Load Prediction for Optimal Generation Control

516

A Fuzzy Neural-Network-Driven Weighting System for Electric Shovel

526

Neural-Network-Based Maintenance Decision Model for Diesel Engine

533

Design of Intelligent PID Controller Based on Adaptive Genetic Algorithm and Implementation of FPGA

542

Fragile Watermarking Schemes for Tamperproof Web Pages

552

Real-Time Short-Term Traffic Flow Forecasting Based on Process Neural Network

560

Fuzzy Expert System to Estimate Ignition Timing for Hydrogen Car

570

Circuitry Analog and Synchronization of Hyperchaotic Neuron Model

580

A Genetic-Neural Method of Optimizing Cut-Off Grade and Grade of Crude Ore

588

A SPN-Based Delay Analysis of LEO Satellite Networks

598

Research on the Factors of the Urban System Influenced Post-development of the Olympics’ Venues

607

A Stock Portfolio Selection Method through Fuzzy Delphi

615

A Prediction Algorithm Based on Time Series Analysis

624

Applications of Neural Networks in Electronic Engineering An Estimating Traffic Scheme Based on Adaline

632

SVM Model Based on Particle Swarm Optimization for Short-Term Load Forecasting

642

A New BSS Method of Single-Channel Mixture Signal Based on ISBF and Wavelet

650

A Novel Pixel-Level and Feature-Level Combined Multisensor Image Fusion Scheme

658

Combining Multi Wavelet and Multi NN for Power Systems Load Forecasting

666

An Adaptive Algorithm Finding Multiple Roots of Polynomials

674

Cellular Neural Networks and Advanced Control with Neural Networks Robust Designs for Directed Edge Overstriking CNNs with Applications

682

Application of Local Activity Theory of Cellular Neural Network to the Chen’s System

692

Application of PID Controller Based on BP Neural Network Using Automatic Differentiation Method

702

Neuro-Identifier-Based Tracking Control of Uncertain Chaotic System

712

Robust Stability of Switched Recurrent Neural Networks with Discrete and Distributed Delays under Uncertainty

720

Nature Inspired Methods of High-dimensional Discrete Data Analysis WHFPMiner: Efficient Mining of Weighted Highly-Correlated Frequent Patterns Based on Weighted FP-Tree Approach

730

Towards a Categorical Matching Method to Process High-Dimensional Emergency Knowledge Structures

740

Identification and Extraction of Evoked Potentials Based on Borel Spectral Measure for Less Trial Mixtures

748

A Two-Step Blind Extraction Algorithm of Underdetermined Speech Mixtures

757

A Semi-blind Complex ICA Algorithm for Extracting a Desired Signal Based on Kurtosis Maximization

764

Fast and Efficient Algorithms for Nonnegative Tucker Decomposition

772

Pattern Recognition and Information Processing Using Neural Networks Neural Network Research Progress and Applications in Forecast

783

Adaptive Image Segmentation Using Modified Pulse Coupled Neural Network

794

Speech Emotion Recognition System Based on BP Neural Network in Matlab Environment

801

Broken Rotor Bars Fault Detection in Induction Motors Using Park’s Vector Modulus and FWNN Approach

809

Coal and Gas Outburst Prediction Combining a Neural Network with the Dempster-Shafter Evidence

822

Using the Tandem Approach for AF Classification in an AVSR System

830

Author Index

841

Single Trial Evoked Potentials Study during an Emotional Processing Based on Wavelet Transform Ling Zou1,2,3, Renlai Zhou2,3,*, Senqi Hu4, Jing Zhang2, and Yansong Li2 1

Faculty of Information Science & Engineering, Jiangsu Polytechnic University, Changzhou, Jiangsu, 213164, China 2 State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University, Beijing, 100875, China 3 Beijing Key Lab of Applied Experimental Psychology, Beijing, 100875, China 4 Department of Psychology, Humboldt State University {Ling Zou,Renlai Zhou,Senqi Hu,Jing Zhang,Yansong Li, rlzhou}@bnu.edu.cn

Abstract. The present study aimed at examining the event-related potentials (ERPs) single-trial extraction during an emotional processing by wavelet transform and analyzing the brain responses to emotional stimuli. ERPs were recorded from 64 electrodes in 10 healthy university students while three types of emotional pictures (pleasant, neural, and unpleasant) from the international affective picture system were presented. All the subjects showed significantly greater P300 and slow waves amplitudes at antero-inferior, medial-inferior and posterior electrode sites for pleasant and unpleasant pictures than for neural pictures and unpleasant pictures elicited more positive P300 and slow wave effects than pleasant pictures. The results indicated the effectiveness of the wavelet transform-based approach in ERP single-trial extraction and further supported the view that emotional stimuli are processed more intensely. Keywords: ERPs, wavelet transform, emotion, P300, slow wave.

1 Introduction In recent years there has been growing interest in understanding brain mechanisms subserving emotion, brain asymmetries related to emotion, and the influence of emotion on memory [1-3]. A number of neuroimaging studies have been published on investigating brain response to the passive viewing of affective pictures by using many methods [4-6]. Among these studies, event related potentials (ERPs) of the electroencephalogram (EEG) were widely used for its non-invasive and readily available to community clinics [5-6]. Investigations of evoked potentials to emotional visual stimuli have revealed higher cortical positivity in response to emotional compared with neutral stimuli. Radilovà studied evoked potentials in response to emotional pictures, and found that unpleasant, Compared to neutral, visual stimuli elicited more robust P300 effects [7]. Radilovà and his coworkers also reported that a *

Corresponding author.

F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 1–10, 2008. © Springer-Verlag Berlin Heidelberg 2008

2

L. Zou et al.

significantly greater P300 for erotic compared with non-erotic scenes leading to the suggestion that the arousing quality of emotional stimuli produces a heightened P300 component independently of the valence of the stimuli [8]. Similarly, other scientists reported significantly larger evoked potentials to arousing vs. neutral stimuli [6, 9]. Interestingly, the effects occurred primarily from frontal to parietal recording sites, also associated with P300 generation [10]. Keil et. Al. demonstrated that both the P300 and the late positive slow wave show an arousal-related signal enhancement with largest differences in late VEPs as a function of emotional arousal near PZ electrode [5-6]. The most common way to visualize the ERPs is to take an average over time locked single-trial measurements. The implicit assumption in the averaging is that the task-related cognitive process does not vary much in timing from trial to trial. However, it has been evident for a few decades that in many cases this assumption is not valid. The observation of variation in the parameters of the ERPs permits the dynamic assessment of changes in cognitive state. Thus, the goal in the analysis of ERPs is currently the estimation of the single potentials, that we call single-trial extraction. Several techniques have been proposed to improve the visualization of the ERPs from the background EEG with various successes [11-13]. Among these ways, wavelet transform (WT) was taken as the promising one for its optimal resolution both in the time and in the frequency domain. WT has been used to decompose the ERP signal onto a space with basis functions. With this technique the ERP is assumed to be the result of the superimposition of wave packets in various frequencies with varying degrees of frequency stabilization, enhancement and time locking within conventional frequency bands of the ongoing EEG activity such as delta, theta, alpha and gamma ranges. The wavelet analysis treats ERP responses in time frequency plane and has yielded to new knowledge about ERP components [14]. However, the selection of relevant frequency band and the interpretation of the results in frequency domain is a challenging task. In this paper, we first estimated single-trial ERPs during an emotional processing based on wavelet multiresolution analysis (MRA) [15], and then analyzed the brain responses to emotional stimuli by using the extracted ERPs. The results indicated the effectiveness of the wavelet transform-based approach in ERP single-trial extraction and further supported the view that emotional stimuli are processed more intensely.

2 Materials and Methods 2.1 Subjects Ten health undergraduate students (7men) from Beijing normal University with aged from 19 to 25 years old Participated in the experiment. All subjects were right-handed with normal or correct-normal vision. 2.2 Stimuli and Design 210 colorful pictures were selected from the International Affective Picture System (IAPS), consisting of 70 highly arousing pleasant, 70 neural, and 70 highly arousing unpleasant images. The pictures were chosen according to the normative ratings of the

Single Trial Evoked Potentials Study during an Emotional Processing

3

IAPS. The order of the pictures was arranged so that 70 neural, 70 pleasant and 70 unpleasant pictures were shown respectively in each block. Emotional pictures were presented on a 19-in. computer screen with a refresh rate of 60Hz. The screen was placed approximately 1.5m in front of the viewer. Each picture was presented for 1000 ms, with inter-trial intervals varying between 2500 and 3000 ms. After the EEG recordings of each type emotion stimuli, subjects were asked to rate the respective picture on a 11-point scale, among them, 0 means no sense while 100 means very pleasant or unpleasant. 2.3 Electrophysiological Recordings EEG activity was recorded continuously from 64 leads with a DC amplifier in AC mode (bandpass: 0.01-100Hz; SYNAMPS, Neuroscan) and digitized at a rate of 500 Hz. For each trial, 1.1s of data was saved on a hard disc (from 0.1 s pre- to 1 s poststimulation). Horizontal and vertical electrooculograms (EOG) were recorded by electrodes placed above and below the left eye (VEOG) and lateral to the outer canthus of each eye (HEOG). Offline, the EEG was re-referenced to linked mastoids. For the purpose of statistical analysis, the mean voltages of the averaged visually evoked potentials (VEPs) were obtained with horizontal plane (anterior, medial, posterior), and vertical plane (inferior, superior), based on recording sites of the international 1020 system [16]. The locations of these regions with respect to sites of the international 10-20 system are shown in Fig. 1. 2.3 Multiresolution Analysis: Discrete Wavelets Transform (DWT) DWT is a time-frequency analysis technique that is most suited for non-stationary signal such as the ERPs. DWT analyzes the signal at different frequency bands with different resolutions by decomposing the signal into a coarse approximation and detail information. DWT employs two sets of functions, called scaling functions and wavelet functions, which are associated with lowpass and highpass filters, respectively. The decomposition of the signal into different frequency bands is simply obtained by successive highpass and lowpass filtering of the time domain signal. The original signal x [n] is first passed through a half-band highpass filter g [n] and a lowpass filter h[n]. After the filtering, half of the samples can be eliminated according to Nyquist’s rule, since the signal now has a highest frequency of π/2 radians instead of π. The signal can therefore be subsampled by 2, simply by discarding every other sample. This constitutes one level of decomposition and can mathematically be expressed as follows:

y high [k ] = ∑ x[n] ⋅ g[2k − n] .

(1)

ylow [k ] = ∑ x[n] ⋅ h[2k − n] .

(2)

n

n

Where

yhigh [k ] and ylow [k ] are the outputs of the highpass and lowpass filters after the

subsampling, and are referred to as detail coefficients and approximation coefficients, respectively. This procedure is repeated by decomposing the approximation coefficients

4

L. Zou et al.

until further decomposition is not possible. The detail coefficients di at the level I then constitute Level i DWT coefficients. At each level, the successive filtering and subsampling result in half the time resolution and double the frequency resolutions, hence multiresolution analysis.

Fig. 1. Layout of the electrode array. 1, 2(left/right antero-superior); 3, 4(left/right anteroinferior); 5, 6(left/right medial-superior); 7, 8(left/right medial-inferior); 9, 10(left/right postero-inferior); 11, 12(left/right postero-superior).

In this study, we chose the Daubechies wavelets as the basic wavelet functions for their simplicity and general purpose applicability in a variety of time-frequency representation problems [17]. According to the sampling frequency of 500 Hz, a 6 level decomposition was used, thus having 6 scales of details (d1-d6) and a final approximation (a6).

3 Results Fig. 2 showed the seven signals obtained from 6-level decomposition of a sample VEP and the reconstruct single-trial signal (from a subject during unpleasant pictures stimuli process from PO3 electrode site). For a single-trial signal x[n] from P1 electrode and using Daubechies-5 wavelets, these levels correspond to the following frequency bands: d1:125-250Hz, d2:62.5-125Hz, d3:31.5-62.5Hz (Gamma), d4:15.2-31.3Hz (Beta),

Single Trial Evoked Potentials Study during an Emotional Processing

5

d5:7.8-15.6 Hz (Alpha), d6: 3.9-7.8 Hz (Theta), a6:0.1-3.9 Hz (Delta). The wavelet transform yielded 280 coefficients in d1, 144 in d2, 76 in d3, 42 in d4, 25 in d5, 17 in d6 and 17 in a6. Then, we calculated the wavelet energy for each frequency band introduced in [18] and we got the relative energy values in percentage to reflect the probability distribution of energy at different resolution levels. In this case, delta band preserved most of the signal energy (approx. 72% at PO3, which indicated that the waveform morphology is determined predominantly by this band. In order to capture an adequate proportion of the signal energy, the theta band was also included into the analysis. Combined, the delta and theta band preserve 77% of the signal energy at site PO3. In our study, the combination of delta and theta band preserve over 67% (i.e. at least two thirds) of the signal energy at all electrodes sites. The delta band corresponded to the approximation level (a6) of the MRA while the theta band corresponded to the highest detail level (d6). All activity from frequency bands higher than the theta band was suppressed by setting corresponding wavelet coefficients to zero and subsequent inverse transform to the time domain. Delta and theta frequencies had been proven very important in the generation of the P3 response to auditory stimuli [11, 14]. Fig.3 showed the time-frequency distributions for the above sample VEP and its wavelet-based VEP estimate, respectively. The unpleasant stimuli appeared at 0 s. The same axis range for the amplitude is used here. Visual-related activity is clearly

Fig. 2. Sample VEP decomposition and reconstruction. The original signal was shown at the uppermost panel in the left column. Left column showed the decomposed signal reflecting the time course of the signal in the respective frequency band. Right column showed the reconstructed single-trial signal by the sum of a6 and d6.

6

L. Zou et al.

Fig. 3. Sample results for the time-frequency plot of a single trial of VEP. Left column corresponding the original signal. Right column corresponding the reconstructed single-trial signal by wavelet transform.

noticeable in the time-frequency distribution of the wavelet-based VEP estimate, whereas such activity can hardly be seen from the raw signal. Therefore, we conclude that the wavelet-based method can recover the evoked potential. To investigate that the different brain areas are activated and compare the VEPs components of the same brain area during the processing of different emotions, the above described wavelet transform method was applied to the single trials of each subject at 62 electrodes described in Fig.1 (not included the VEOG and HEOG sites). For each subject the results of the wavelet decomposition of the 15 single trials were averaged, and then the grand mean visually evoked potentials (VEPs) under the three types of stimuli of the 10 subjects were obtained for 12 scalp areas. Mean voltages in these regions were assessed in the P300 (300-500 ms) and in the slow wave window (550-900 ms) [5-6, 16]. Fig. 4 showed the grand mean VEPs at PZ site were composed of five components: A N100, a P200, a N200, a P300 component and a late positive slow wave. Here, we focused on the P300 and slow wave time window indication the sustained and highlevel processing of salient visual stimuli [6, 16]. Grand average ERPs to unpleasant, pleasant, and neutral stimuli are presented in Fig.5. Here, we selected the P3/4, C3/4, CP3/4, F3/4, PO7/8, PZ, CZ, CPZ, FZ and OZ electrodes, which distributed in the antero-inferior, medial-inferior, postero-inferio and postero-superio scalp areas. Table.1 showed the mean and standard deviation (SD) of the P300 window (300500 ms) amplitude of the grand-average VEPs in response to the three types of emotional stimuli on the postero, medial-inferior and antero-inferior sites as shown in Fig.1. Table.2 showed the mean and standard deviation (SD) of the slow wave window (550-900 ms) under the same conditions.

Single Trial Evoked Potentials Study during an Emotional Processing

7

Fig. 4. Grand-average VEPs at the electrode PZ in response to three types of emotional stimuli

Fig. 5. Grand-average VEPs in response to three types of emotional stimuli at P3/4, C3/4, CP3/4, F3/4, PO7/8, PZ, CZ, CPZ, FZ and OZ electrode sites

8

L. Zou et al.

Table 1. Mean amplitude and standard deviation (SD) of the P300 window of the grandaverage VEPs in response to three types of stimuli at different electrode sites

Electrodes sites

unpleasant Mean SD ( v) ( v) 5.27 0.38 4.47 0.28 0.65 0.53

μ

Postero-inferio Postero-superio Medial-inferior (up) (C1,C3, C5, C2, C4,C6) Medial-inferior (down) 3.63 (CP1,CP3,CP5, CP2, CP4, CP6) Antero-inferior -1.61

Neural Mean ( v) 0.42 0.60 -0.01

μ

SD ( v) 0.39 0.25 0.55

pleasant Mean ( v) 3.49 3.47 -1.06

μ

SD ( v) 0.37 0.27 0.45

0.40

0.16

0.56

1.59

0.23

0.33

-0.18

0.50

-2.54

0.41

μ

μ

μ

Table 2. Mean amplitude and standard deviation (SD) of the Slow wave window of the grandaverage VEPs in response to three types of stimuli at different electrode sites

Electrodes sites

unpleasant Mean SD ( v) ( v) 2.51 0.79 2.11 0.67 1.50 0.17

μ

Postero-inferio Postero-superio Medial-inferior (up) (C1,C3, C5, C2, C4,C6) Medial-inferior (down) 2.57 (CP1,CP3,CP5, CP2, CP4, CP6) Antero-inferior 0.05

Neural Mean ( v) -0.00 0.12 0.22

μ

SD ( v) 0.55 0.08 0.12

pleasant Mean ( v) 1.45 1.53 0.26

μ

SD ( v) 0.07 0.52 0.70

0.50

-0.05

0.06

1.44

0.23

0.55

0.14

0.14

-0.02

0.68

μ

μ

μ

From Fig 5, Table.1 and Table.2, we can see the distribution of P300 and slow wave over the scalp areas. The positive P300 and slow wave were found greatest over posterior sites, both inferior and superior, as well as medial-inferior (down) sites. Negativity of the P300 and slow wave were over anterior-inferior and medial-inferior (up) sites. The distribution areas were the same as the described in [15]. The results demonstrated the voltages evoked by unpleasant images and pleasant pictures were greater than neural pictures. Statistical analyses showed unpleasant pictures evoked greater positive P300 voltages than pleasant pictures (e.g. at postero-infero sites under the unpleasant stimuli, MeanP300=5.27μv, SDP300=0.38μv, while under the pleasant stimuli, MeanP300 = 3.49μv, SDP300 = 0.37μv) ; Pleasant pictures evoked greater negative P300 voltages than unpleasant pictures(e.g. at antero-inferior sites under the unpleasant stimuli, MeanP300=-1.61μv, SDP300=0.33μv, while under the pleasant stimuli, MeanP300 = -2.54 μv, SDP300=0.41μv ); Unpleasant pictures evoked greater slow wave voltages than pleasant pictures (e.g. at the medial-inferior(up) sites under the unpleasant stimuli, Mean Slow=1.50μv, SDSlow=0.17μv, while under the pleasant stimuli, MeanSlow=0.26μv, SDSlow=0.70μv ), which were different from [16].

Single Trial Evoked Potentials Study during an Emotional Processing

9

4 Discussion In this paper we have pursued two complimentary goals: (1) to improve the visualization of the single-trial ERPs based on wavelet transform method and seek its application to cognitive VEPs; (2) to analyze the brain responses to emotional stimuli by using the extracted single-trial VEPs. Firstly, we used MRA method to estimate the single-trial VEPs by keeping the wavelet coefficients of low frequency bands (delta band and theta band) and then reconstructed the original experimental example signal. We then got the grand mean VEPs at scalp areas by applying the above wavelet-based approach to the 10 subjects, each subjects including 15 trials. The results showed that the VEPs obtained by wavelet method could be used as a reliable, sensitive, and high-resolution indicator for emotion study after only 15 trials of ensemble averaging. Secondly, our results showed greater P300 and slow wave amplitudes for unpleasant and pleasant pictures compared to neural stimuli, indicating that motivationally relevant stimuli automatically direct attentional resources, are processed more deeply and thus provoke an arousan-related enhancement of VEPs, which further support the view that emotional stimuli are processed more intensely. We observed greater P300 and greater slow wave amplitudes for unpleasant pictures compared to pleasant pictures, while in previous studies, pleasant pictures evoked the greatest P300 as well as slow wave amplitudes [6, 15].Our results based on the estimated single-trial VEPs also showed significantly greater P300 and slow waves amplitudes at antero-inferior, medial-inferior and posterior electrode sites for pleasant and unpleasant pictures than for neural pictures. These findings are in accordance with results demonstrating the largest differences in late VEPs as a function of emotional arousal for electrode sites near PZ [5, 6, 10]. The MRA of the wavelet transform method enables the latency and the amplitude of the VEP to be detected more accurately, while it is difficult to achieve this performance using other methods based solely on either a time-domain or frequencydomain approach [12, 13]. In addition, the WT method can significantly reduce the number of stimuli required for detection of small VEPs. The wavelet method suggested in this paper therefore has great potential for clinical cognitive practicability. In the future work, we’re going to explore how other factors (e.g. sex, hemisphere) influence emotion perception by using WT method or WT-based methods. Acknowledgments. This work was supported by the open project of State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University, and Jiangsu Education Nature Foundation (07KJD510038).

References 1. Renlai, Z., Senqi, H.: Effects of Viewing Pleasant and Unpleasant Photographs on Facial EMG Asymmetry. Perceptual and Motor Skills 99, 1157–1167 (2004) 2. Gasbarri, A., Arnone, B., Pompili, A., Marchetti, A., Pacitti, F., Saadcalil, S., Pacitti, C., Tavares, M.C., Tomaz, C.: Sex-related Lateralized Effect of Emotional Content on Declarative Memory: an Event Related Potential Study. Behav. Brain Res. 168, 177–184 (2006)

10

L. Zou et al.

3. Wiens, S.: Interoception in Emotional Experience. Curr. Opin. Neurol. 18, 442–447 (2005) 4. Phan, K.L., Wager, T., Taylor, S.F., Liberzon, I.: Functional Neuroanatomy of Emotion: a Meta-analysis of Emotion Activation Studies in PET and fMRI. NeuroImage 16, 331–348 (2002) 5. Keil, A., Müller, M.M., Gruber, T., Stolarova, M., Wienbruch, C., Elbert, T.: Effects of emotional arousal in the cerebral hemispheres: a study of oscillatory brain activity and event-related potentials. Clin. Neurophysiol. 112, 2057–2068 (2001) 6. Cuthberg, B., Schupp, H., Bradley, M., Birbaumer, N., Lang, P.: Brain Potentials in Affective Picture Processing: Covariation with Autonomic Arousal and Affective Report. Biol. Psychol. 52, 95–111 (2000) 7. Radilovà, J.: The Late Positive Components of Visual Evoked Responses Sensitive to Emotional Factors. Act. Nerv. Super. (suppl. 3), 334 (1982) 8. Radilovà, J.: P300 and the Emotional States Studied by Psycho Physiological Methods. Int. J. Psychophysiol. 7, 364–366 (1989) 9. Dolcos, F., Cabeza, R.: Event-related Potentials of Emotional Memory: Encoding Pleasant, Unpleasant, and Neutral Pictures. Cogn. Affect. Behav. Neurosci. 2, 252–263 (2002) 10. Polich, J., Kok, A.: Cognitive and biological determinants of P300: an integrative review. Biol. Psychol. 41, 103–146 (1995) 11. Roth, A., Roesch-Ely, D., Bender, S., Weisbrod, M., Kaiser, S.: Increased Event-related Potential Latency and Amplitude Variability in Schizophrenia Detected through Waveletbased Single Trial Analysis. Int. J. Psychophysiology 66, 244–254 (2007) 12. Vorobyov, S., Cichocki, A.: Blind noise reduction for multisensory signals using ICA and subspace filtering with application to EEG analysis. Biol. Cybern. 86, 293–303 (2002) 13. Yin, H.E., Zeng, Y.J., Zhang, J.H.: Application of adaptive noise cancellation with neuralnetwork-based fuzzy inference system for visual evoked potentials estimation. Med. Eng. Phys. 26, 87–92 (2004) 14. Demiralp, T., Ademoglu, A., Istefanopulos, Y., Basar-Eroglu, C., Basar, E.: Wavelet analysis of oddball P300. Int. J. Psychophysiology 39, 221–227 (2001) 15. Mallat, S.: A Theory for Multiresolution Signal Decomposition: the Wavelet Representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 11(7), 674–693 (1989) 16. Herbert, B.M., Pollatos, O., Schandre, R.: Interoceptive Sensitivity and Emotion Processing: An EEG study. Int. J. Psychophysiology 65, 214–227 (2007) 17. Polikar, R., Topalis, A., Green, D., Kounios, J., Clark, C.M.: Comparative Multiresolution Wavelet Analysis of ERP Spectral Bands Using an Ensemble of Classifiers Approach for Early Diagnosis of Alzheimer’s Disease. Computers in Biology and Medicine 37, 542–556 (2007) 18. Rosso, O.A., Blanco, S., Yordanova, J., Kolev, V., Figliola, A., Schürmann, M., Basar, E.: Wavelet entropy: a new tool for analysis of short duration brain electrical signals. J. Neuro. Met. 105, 65–75 (2001)

Robust Speaker Modeling Based on Constrained Nonnegative Tensor Factorization Qiang Wu, Liqing Zhang, and Guangchuan Shi Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China {johnnywu,lqzhang,sgc1984}@sjtu.edu.cn

Abstract. Nonnegative tensor factorization is an extension of nonnegative matrix factorization(NMF) to a multilinear case, where nonnegative constraints are imposed on the PARAFAC/Tucker model. In this paper, to identify speaker from a noisy environment, we propose a new method based on PARAFAC model called constrained Nonnegative Tensor Factorization (cNTF). Speech signal is encoded as a general higher order tensor in order to learn the basis functions from multiple interrelated feature subspaces. We simulate a cochlear-like peripheral auditory stage which is motivated by the auditory perception mechanism of human being. A sparse speech feature representation is extracted by cNTF which is used for robust speaker modeling. Orthogonal and nonsmooth sparse control constraints are further imposed on the PARAFAC model in order to preserve the useful information of each feature subspace in the higher order tensor. Alternating projection algorithm is applied to obtain a stable solution. Experiments results demonstrate that our method can improve the recognition accuracy specifically in noise environment.

1 Introduction Speaker recognition is the task of determining the identification of a person from one’s voice which has great potential applications in industry, business and security, etc. For a speaker recognition system, feature extraction is one of important tasks, which aims at finding succinct, robust, and discriminative features from acoustic data. Acoustic features such as linear predictive cepstral coefficients (LPCC)[1], mel-frequency cepstral coefficients (MFCC)[1], perceptual linear predictive coefficients (PLP) [2] are commonly used. The conventional speaker modeling methods such as Gaussian mixture models(GMM)[3] achieve very high performance for speaker identification and verification tasks on high-quality data when training and testing conditions are well controlled. However, in the real application such systems usually do not perform well for a large variety of speech signals corrupted by adverse conditions such as environmental noise and channel distortions. Feature compensation techniques [2,4] such as CMS, RASTA have been developed for robust speech recognition. Spectral subtraction [5] and subspacebased filtering[6] techniques assuming a priori knowledge of the noise spectrum have been widely used because of their simplicity. Recently the computational auditory nerve models and sparse coding attract much attention from both neuroscience and speech signal processing communities. Smith et al.[7] proposed an algorithm for learning efficient auditory codes using a theoretical model for coding sound in terms of spikes.Much research F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 11–20, 2008. c Springer-Verlag Berlin Heidelberg 2008 

12

Q. Wu, L. Zhang, and G. Shi

about sparse coding and representation for sound and speech[8,9,10] is also proved to be useful for auditory modeling and speech separation which will be a potential way for robust speech feature extraction. As a powerful data modeling tool for pattern recognition, multilinear algebra of the higher order tensors has been proposed as a potent mathematical framework to manipulate the multiple factors underlying the observations. Currently common tensor decomposition methods include: (1) the CANDECOMP/PARAFAC model [11,12,13]; (2) the Tucker Model[14,15]; (3) Nonnegative Tensor Factorization (NTF) which imposes the nonnegative constraint on the CANDECOMP/PARAFAC model [16,17]. In computer vision applications, Multilinear ICA [18]and tensor discriminant analysis [19] are applied to image representation and recognition, which improve recognition performance. In this paper, we proposed a new feature extraction method for robust speaker recognition based on auditory periphery model and tensor factorization. A novel tensor factorization method called cNTF is derived by imposing orthogonal and nonnegative constraints on the tensor structure. The advantages of our feature extraction method include following: (1) simulation of the auditory perception mechanism of human being provides a higher frequency resolution at low frequencies which helps to obtain robust spectro-temporal feature; (2) a supervised feature extraction procedure via cNTF learns the basis functions of multi-related feature subspaces which preserve the individual, spectro-temporal information in the tensor structure; furthermore the orthogonal constraint ensures redundancy minimization between different basis functions; (3) sparse constraint on cNTF enhances energy concentration of speech signal which will preserve the useful feature during the noise reduction. The sparse tensor feature extracted by cNTF can be further processed into a representation called auditory-based nonnegative tensor feature(ANTF) via discrete cosine transform, which can be used as feature for speaker recognition.

2 Method 2.1 Multilinear Algebra and PARAFAC Model Multilinear algebra is the algebra of higher order tensors. A tensor is a higher order generalization of a matrix. Let X ∈ RN1 ×N2 ×...×NM denotes a tensor. The order of X is M . An element of X is denoted by xn1 ,n2 ,...,nM , where 1 ≤ nd ≤ Nd and 1 ≤ d ≤ M . The mode-d matricization or matrix unfolding of an M th-order tensor X ∈ RN1 ×N2 ×...×NM rearranges the elements of X to form the matrix X(d) ∈ RNd ×Nd+1 Nd+2 ···NM N1 ···Nd−1 , which is the ensemble of vectors in RNd obtained by keeping index nd fixed and varying the other indices. Matricizing a tensor is similar to vectoring a matrix. The PARAFAC model was suggested independently by Carroll and Chang[11] under the name CANDECOMP(canonical decomposition) and by Harshman[12] under the name PARAFAC(parallel factor analysis) which has gained increasing attention in the data mining field. This model has structural resemblance with many physical models of common real-world data and its uniqueness property implies that the data following the PARAFAC model can be uniquely decomposed into individual contributions.

Robust Speaker Modeling Based on Constrained Nonnegative Tensor Factorization

13

An M -way tensor X ∈ RN1 ×N2 ×...×NM can be decomposed into a sum of M rank-1 terms, i.e. represented by the outer product of M vectors: X = a(1) ◦ a(2) ◦ · · · ◦ a(M) ,

(1)

where ◦ is the outer product operator, a(d) ∈ RNd , for d = 1, 2, . . . , M . The rank of tensor X , denoted R = rank(X ), is the minimal number of rank-1 tensors that is required to yield X : X =

R 

(2) (M) A(1) :,r ◦ A:,r ◦ · · · ◦ A:,r ,

(2)

r=1

(d)

where A:,r represents the rth column vector of the mode matrix A(d) ∈ RNd ×R . The PARAFAC model aims to find a rank-R approximation of the tensor X , X ≈

R 

(2) (M) A(1) :,r ◦ A:,r ◦ · · · ◦ A:,r ,

(3)

r=1

The PARAFAC model can also be written in matrix notation by use of the Khatri-Rao product, which gives the equivalent expressions:  T X(d) ≈ A(d) A(d−1) ⊙ . . . ⊙ A(1) ⊙ A(M) ⊙ . . . ⊙ A(d+1) ,

(4)

where ⊙ is the Khatri-Rao product operator. 2.2 Constrained Nonnegative Tensor Factorization Given a nonnegative M -way tensor X ∈ RN1 ×N2 ×...×NM , nonnegative tensor factorization(NTF) seeks a factorization of X in the form: X ≈ Xˆ =

R 

(2) (M) A(1) :,r ◦ A:,r ◦ · · · ◦ A:,r ,

(5)

r=1

where the mode matrices A(d) ∈ RNd ×R for d = 1, . . . , M are restricted to have only nonnegative elements in the factorization. In order to find an approximate tensor factorization Xˆ , we can construct Least Square cost function JLS and KL-divergence cost function JKL based on the approximate factorization model (4). The cost functions with mode matrices A(d) are given by M

JLS1 (A(d) ) =

1 X(d) − A(d) Z(d) 2F 2 d=1

=

Nd¯  M Nd  1 

2

d=1 p=1 q=1

[X(d) ]pq − [A(d) Z(d) ]pq

2

(6)

14

Q. Wu, L. Zhang, and G. Shi

JKL1 (A(d) ) =

M 

D(X(d) A(d) Z(d) )

d=1

Nd¯  Nd  M   = [X(d) ]pq log d=1 p=1 q=1

[X(d) ]pq − [X(d) ]pq + [A(d) Z(d) ]pq [A(d) Z(d) ]pq

 (7)

T M where Z(d) = A(d−1) ⊙ . . . ⊙ A(1) ⊙ A(M) ⊙ . . . ⊙ A(d+1) and Nd¯ = j =d Nj . These cost functions are quite similar to NMF[20], which performs matrix factorization in each mode and minimizes the error for all modes. By above model, we can add additional constraint which makes the basis functions be as orthogonal as possible, i.e. ensures redundancy minimization between different basis This orthogonal constraint can be imposed by minimizing the formula functions. (d)T (d) [A A ]pq . p=q For the traditional NMF methods, many approaches have been proposed to control the sparsenses by additional constraints or penalization terms. These constraints or penalizations can be applied to the basis vectors or both basis and encoding vectors. The nsNMF model[22] proposed a factorization model V = WSH, providing a smoothing matrix S ∈ Rq×q given by θ S = (1 − θ)I + 11T (8) q where I is the identify matrix, 1 is a vector of ones, and the parameter θ satisfies 0 ≤ θ ≤ 1. For θ = 0, the model(8) is equivalent to the original NMF. As θ → 1, stronger smoothness is imposed on S, leading to a strong sparseness on both W and H. By this nonsmooth approach, we can control the sparseness of basis vectors and encoding vectors and maintain the faithfulness of the model to the data. The same idea can be applied to the NTF. Then the corresponding cost functions with orthogonal and sparse control constraints can be given by ⎛ ⎞ Nd¯  Nd  M 2    1 ⎝ [X(d) ]pq − [A(d) SZ(d) ]pq + α JLS2 (A(d) ) = [A(d)T A(d) ]pq ⎠ 2 p=1 q=1 p=q

d=1

(9) JKL2 (A(d) ) =

M 







d=1

Nd¯  Nd   ⎝ [X(d) ]pq log p=1 q=1

p=q



[A(d)T A(d) ]pq ⎠

[X(d) ]pq − [X(d) ]pq + [A(d) SZ(d) ]pq [A(d) SZ(d) ]pq



(10)

where α > 0 is a balancing parameter between reconstruction and orthogonality. We can derive multiplicative learning algorithms for mode matrices A(d) using the exponential gradient, which are similar to those in NMF. Updating algorithms in an element-wise manner for minimizing the cost function (9) and (2.2) are directly derived as done in [16,17]:

Robust Speaker Modeling Based on Constrained Nonnegative Tensor Factorization

15

– LS: (d)

(d)

Aij ← Aij – KL: (d) Aij



[X(d) Z(d)T ST ]ij [A(d) SZ(d) Z(d)T ST ]ij + α p=j [A(d)T ]pi

(d) Aij

[X

]

(d) ik [SZ(d) ]jk [A(d) SZ (d) ] ik (d)T ] (d) ] pi p=j [A k [SZ jk + α



(11)

k

(12)

3 Feature Extraction Based on Auditory Model and Tensor Representation As we know, human auditory system is of powerful capability in speech recognition and speaker recognition. Much of research on auditory model has already shown that the features based on simulation of auditory system are more robust than traditional features under noisy background. In our feature extraction framework, we calculate the frequency selectivity information by imitating the process performed in the auditory periphery and pathway. And the robust speech features are obtained by the projections of the extracted auditory information mapped into multiple interrelated feature subspace via cNTF. A diagram of feature extraction and speaker recognition framework is shown in Figure 1. 6SHHFK

Pre-Emphasis

DCT

Recognition Result

GMM

A A

Cochlear Filters

X

F17)

Nonlinearity &RFKOHDU)HDWXUH )HDWXUH7HQVRUE\ 'LIIHUHQW6SHDNHUV

6SHFWUR7HPSURDO %DVLV)XQFWLRQV

Fig. 1. Feature extraction and recognition framework

3.1 Feature Extraction Based on Auditory Model We extract the features by imitating the process occurred in the auditory periphery and pathway, such as outer ear, middle ear, basilar membrane, inner hair-cell, auditory nerves, and cochlear nucleus. We implement traditional pre-emphasis to model the combined outer and middle ear functions, which is xpre (t) = x(t)−0.97x(t−1), where x(t) is the discrete time speech signal, t = 1, 2, ..., and xpre (t) is the filtered output signal. The frequency selectivity of peripheral auditory system such as basilar membrane is simulated by a bank of cochlear filters, which have an impulse response in the following form: gi (t) = ai tn−1 e2πbi ERB(fi )t cos(2πfi t + φi ), (1 ≤ i ≤ N ),

(13)

16

Q. Wu, L. Zhang, and G. Shi

where n is the order of the filters, N is the number of filterbanks. For the ith filter bank, fi is the center frequency, ERB(fi ) is the equivalent rectangular bandwidth (ERB) of the auditory filter, φi is the phase, and ai , bi ∈ R are constants where bi determines the rate of decay of the impulse response, which is related to bandwidth. In order to model nonlinearity of the inner hair-cells, we compute the power of each band in every frame k with a logarithmic nonlinearity:  {xig (t)}2 ), (14) P (i, k) = log(1 + γ t∈f rame k

where P (i, k) is the output power, γ is a scaling constant, and xig (t) = τ xpre (τ )gi (t− τ ) is the outputs of each gammatone filterbanks. This model can be considered as average firing rates in the inner hair-cells, which simulate the higher auditory pathway. The resulting power feature vector P (i, k) at frame k with component index of frequency fi , comprises the spectro-temporal power representation of the auditory response. Similar to Mel-scale processing in MFCC extraction, this power spectrum provides a much higher frequency resolution at low frequencies than at high frequencies. 3.2 Sparse Tensor Representation In order to extract robust features based on tensor structure, we model the cochlear power feature of different speakers as 3-order tensor X ∈ RNf ×Nt ×Ns . Each feature tensor is an array with three modals frequency × time × speaker identity which comprises the cochlear power feature matrix X ∈ RNf ×Nt of different speakers. Then we transform the auditory feature tensor into multiple interrelated subspaces by cNTF to learn the basis functions A(d) , (d = 1, 2, 3). Figure 2 shows the tensor model for the calculation of basis functions. Compared with traditional subspace learning methods, the extracted tensor features may characterize the differences of speakers and preserve the discriminative information for classification. As described in Section 3.1, the

cNTF

Basis Functions

Fig. 2. Tensor model for calculation of basis functions via cNTF

cochlear power feature can be considered as neurons response in the inner hair-cells. The hair-cells have receptive fields which refer to a coding of sound frequency. Here we employ the sparse localized basis function A ∈ RNf ×R in time-frequency subspace to transform the auditory feature into the sparse feature subspace, where R is the dimension of sparse feature subspace. The representation of auditory sparse feature Xs is obtained via the following transformation: ˆ Xs = AX

(15)

Robust Speaker Modeling Based on Constrained Nonnegative Tensor Factorization

10

2

20

1

30

0

40

2

50

0

20

40

60

80

0

20

40

60

80

0

20

40

60

80

17

1

60

0

70

2

80

1

90 100

20

40

60

(a) Basis functions

80

0

(b) Examples of encoding vector

Fig. 3. Results of cNTF applied to the clean speech data. (a) basis functions (100×80) in spectrotemproal domain. (b) Examples for encoding feature vector.

ˆ consists of the nonnegative elements of A−1 , i.e. A ˆ = [A−1 ]+ . Figure 3(a) where A shows an example of basis functions in spectro-temporal domain. From this result we can see that most elements of basis function are near to zero, which accords with the sparse constraint of cNTF. Figure 3(b) gives several examples for the encoding feature vector after transformation which also prove the sparse characteristic of feature. Our feature extraction model is based on the fact that in sparse coding the energy of the signal is concentrated on a few components only, while the energy of additive noise remains uniformly spreading on all the components. As a soft-threshold operation, the absolute values of pattern from the sparse coding components are compressed towards to zero. The noise is reduced while the signal is not strongly affected. We also impose orthogonal constraint to cNTF which helps to extract the helpful feature by minimizing the redundancy of different basis functions.

4 Experiments Results In this section we provide the evaluation results of a speaker identification system using ANTF. Aurora2 speech corpus is used to test the recognition performance, which is designed to evaluate speech recognition algorithms in noisy conditions. Different noise classes were considered to evaluate the performance of ANTF against MFCC, MelNMF, Mel-PCA feature and identification accuracy was assessed. In our experiments the sampling rate of speech signals was 8kHz. For the given speech signals, we employed time window of length 40000 samples (5s). For computational simplicity, we selected 36 cochlear filter banks and time duration 10 samples(1.25ms). Then the dimension of the speaker data is 36 × 10 = 360. We calculated the basis functions using cNTF after the calculation of cochlear power feature. For learning the basis functions in different subspaces, 550 sentences (5 sentences each person) were selected randomly as the training data and 200 dimension sparse tensor representation is extracted. In order to estimate the speaker model and test the efficiency of our method, we use 5500 sentences (50 sentences each person) as training data and 1320 sentences (12 sentences each person) mixed with different kinds of noise were used as testing data. The

18

Q. Wu, L. Zhang, and G. Shi

Table 1. Identification accuracy in four noisy conditions(subway, car noise, babble, exhibition hall) for Aurora2 noise testing dataset Noise SNR(dB) ANTF(%) Mel-NMF(%) Mel-PCA(%) MFCC(%)

5 24.5 15.5 3.6 2.7

Subway 10 15 58.2 82.7 40.9 67.3 12.7 50.9 16.4 44.6

20 86.4 88.2 88.2 76.4

5 24.6 23.6 21.8 16.4

Babble 10 15 60.0 83.6 41.8 61.8 51.8 79.1 51.8 79.1

20 89.1 82.7 96.4 93.6

5 23.6 3.6 2.7 5.5

Car noise 10 15 57.3 79.1 26.4 57.3 10.0 38.2 17.3 44.6

20 86.4 74.6 79.1 78.2

Exhibition hall 5 10 15 20 16.4 50.9 82.7 90.9 9.1 29.1 68.2 86.4 3.6 20.9 59.1 89.1 1.8 20.0 50.0 76.4

testing data were mixed with subway, babble, car noise, exhibition hall in SNR intensities of 20dB, 15dB, 10dB and 5dB. For the final feature set, 16 cepstral coefficients were extracted and used for speaker modeling. GMM was used to build the recognizer with 64 gaussian mixtures. For comparison, the performance of MFCC, Mel-NMF and Mel-PCA with 16-order cepstral coefficients are also tested. We use PCA and NMF to learn the part-based representation in the spectro-temporal domain after mel filtering, which is similar to [9]. The feature after PCA or NMF projection was further processed into the cesptral domain via discrete cosine transform. Table 1 presents the identification accuracy obtained by ANTF and baseline system in all testing conditions. We can observe from Table 1 that the performance degradation of ANTF is slower with increasing noise intensity that compared with other features. It performs better than other three features in the high noise conditions such as 5dB condition noise. Figure 4 describes the identification rate in four noisy conditions averaged over SNRs between 5-20 dB, and the overall average accuracy across all the conditions. The results suggest that this auditory-based tensor representation feature is robust against the additive noise, which indicates the potential of the new feature for dealing with a wider variety of noisy conditions. 100% ANTF Mel−NMF Mel−PCA MFCC

Identification rate

80% 60% 40% 20% 0

Subway

Babble

Car noise

Exhibition hall

Average

Fig. 4. Identification accuracy in four noisy conditions averaged over SNRs between 5-20dB, and the overall average accuracy across all the conditions, for ANTF and other three features using Aurora2 noise testing dataset

Robust Speaker Modeling Based on Constrained Nonnegative Tensor Factorization

19

5 Conclusion In this paper, we presented a novel speech feature extraction framework which is robust to noise with different SNR intensities, for evaluation with identification systems operating under a wide variety of conditions. This approach is primarily data-driven and effectively extracts robust feature of speech called ANTF that is invariant to noise types and interference with different intensities. We derived new feature extraction methods called cNTF for robust speaker identification. The research is mainly focused on the encoding of speech based on general higher order tensor structure to extract the robust auditory-based feature from interrelated feature subspace. The frequency selectivity features at basilar membrane and inner hair cells were used to represent the speech signals in the spectro-temporal domain, and then cNTF algorithm was employed to extract the sparse tensor representation for robust speaker modeling. The discriminative and robust information of different speakers may be preserved after the multi-related subspace projection. Experiment on Aurora2 has shown the improvement of the noise robustness by the new method, in comparison with baseline systems trained on the same amount of information.

Acknowledgment The work was supported by the National High-Tech Research Program of China (Grant No.2006AA01Z125) and the National Natural Science Foundation of China (Grant No. 60775007).

References 1. Rabiner, L.R., Juang, B.: Fundamentals on Speech Recognition. Prentice Hall, New Jersey (1996) 2. Hermansky, H., Morgan, N.: RASTA Processing of Speech. IEEE Trans. Speech Audio Process 2, 578–589 (1994) 3. Reynolds, D.A., Quatieri, T.F., Dunn, R.: Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing 10, 19–41 (2000) 4. Reynolds, D.A.: Experimental Evaluation of Features for Robust Speaker Identification. IEEE Trans. Speech Audio Process 2, 639–643 (1994) 5. Berouti, M., Schwartz, R., Makhoul, J., Beranek, B., Newman, I., Cambridge, M.A.: Enhancement of Speech Corrupted by Acoustic Noise. Acoustics, Speech, and Signal Processing. In: IEEE International Conference on ICASSP 1979, vol. 4, pp. 208–211 (1979) 6. Hermus, K., Wambacq, P., Van hamme, H.: A Review of Signal Subspace Speech Enhancement and Its Application to Noise Robust Speech Recognition. EURASIP Journal on Applied Signal Processing 1, 195–209 (2007) 7. Smith, E., Lewicki, M.S.: Efficient Auditory Coding. Nature 439, 978–982 (2006) 8. Kim, T., Lee, S.Y.: Learning Self-organized Topology-preserving Complex Speech Features at Primary Auditory Cortex. Neurocomputing 65, 793–800 (2005) 9. Cho, Y.C., Choi, S.: Nonnegative Features of Spectro-temporal Sounds for Classification. Pattern Recognition Letters 26, 1327–1336 (2005) 10. Asari, H., Pearlmutter, B.A., Zador, A.M.: Sparse Representations for the Cocktail Party Problem. Journal of Neuroscience 26, 7477–7490 (2006)

20

Q. Wu, L. Zhang, and G. Shi

11. Carroll, J.D., Chang, J.J.: Analysis of Individual Differences in Multidimensional Scaling via An n-way Generalization of “Eckart-Young” Decomposition. Psychometrika 35, 283– 319 (1970) 12. Harshman, R.A.: Foundations of the PARAFAC Procedure: Models and Conditions for An “Explanatory” Multi-modal Factor Analysis. UCLA Working Papers in Phonetics 16, 1–84 (1970) 13. Bro, R.: PARAFAC: Tutorial and Applications. Chemometrics and Intelligent Laboratory Systems 38, 149–171 (1997) 14. De Lathauwer, L., De Moor, B., Van de walle, J.: A Multilinear Singular Value Decomposition. SIAM Journal on Matrix Analysis and Applications 21, 1253–1278 (2000) 15. Kim, Y.D., Choi, S.: Nonnegative Tucker Decomposition. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007, pp. 1–8 (2007) 16. Welling, M., Weber, M.: Positive Tensor Factorization. Pattern Recognition Letters 22, 1255– 1261 (2001) 17. Shashua, A., Hazan, T.: Non-negative Tensor Factorization with Applications to Statistics and Computer Vision. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 792–799 (2005) 18. Vasilescu, M.A.O., Terzopoulos, D.: Multilinear independent components analysis, 2005. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, vol. 1, pp. 547–553 (2005) 19. Tao, D.C., Li, X.L., Wu, X.D., Maybank, S.J.: General Tensor Discriminant Analysis and Gabor Feature for Gait Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 1700–1715 (2007) 20. Lee, D.D., Seung, H.S.: Algorithms for Non-negative Matrix Factorization. Advances in Neural Information Processing Systems 13, 556–562 (2001) 21. Li, S.Z., Hou, X.W., Zhang, H.J., Cheng, Q.S.: Learning Spatially Localized, Parts-based Representation. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1–6 (2001) 22. Pascual-Montano, A., Carazo, J.M., Kochi, K., Lehmann, D., Pascual-Marqui, R.D.: Nonsmooth Nonnegative Matrix Factorization. IEEE Transactions on. Pattern Analysis and Machine Intelligence. 28, 403–415 (2006)

A Hypothesis on How the Neocortex Extracts Information for Prediction in Sequence Learning Weiyu Wang Department of Biology, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong [email protected]

Abstract. From the biological view, each component of a temporal sequence is represented by neural code in cortical areas of different orders. In whatever order areas, minicolumns divide a component into sub-components and parallel process them. Thus a minicolumn is a functional unit. Its layer IV neurons form a network where cell assemblies for sub-components form. Then layer III neurons are triggered and feed back to layer IV. Considering the delay, through Hebbian learning the connections from layer III to layer IV can associate a sub-component to the next. One sub-component may link multiple following sub-components plus itself, but the prediction is deterministic by a mechanism involving competition and threshold dynamic. So instead of learning the whole sequence, minicolumns selectively extract information. Information for complex concepts are distributed in multiple minicolumns, and long time thinking are in the form of integrated dynamics in the whole cortex, including recurrent activity. Keywords: Sequence prediction; Columnar architecture; Neocortex; Connectionism; Associative memory.

1 Introduction Most human and animal learning processes can be viewed as sequence learning. Sun and Giles summarize problems related to sequence learning into four categories: sequence prediction, sequence generation, sequence recognition, and sequential decision making [1]. The four categories are closely related [1], and sequence prediction is arguably the foundation of the other three. Sequence learning can be touched by various disciplines, while typically it deals with sequences of symbols and is applied to language processing. In this problem, a temporal pattern is defined as a temporal sequence and each static pattern constituting it is defined as a component (Wang and Arbib, [2]). Because of the intrinsic complexity of language, a component usually cannot be determined solely by the previous component, but by a previous sequence segment defined as context [2]. To learn complex sequences, a short-term memory (STM) at least of the maximum degree of these sequences is inevitable. And at least one context detector is assigned to each context. So according to the model proposed by Wang and Yuwono in 1995 [3][4], a neural network with 2m+(n+1)r neurons (m context sensors, m modulators, n terminals each with a STM of length r) can learn an arbitrary sequence at most of length m and degree r, with at most n symbols. Starzyk F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 21–29, 2008. © Springer-Verlag Berlin Heidelberg 2008

22

W. Wang

and He proposed a more complex model with hierarchical structure in 2007 [5]. To learn a sequence of length l with n symbols, the primary level network requires 3nl+2n+2l+m neurons, where m is for the number of output neurons for the next hierarchical level network (equals the number of symbols in the next level), and the total number of neurons should include all hierarchical levels [5]. Such expensive cost makes the application of sequence learning undesirable. Another problem is if we expand the discipline from language to others, for example vision, the input is nearly a continuous time temporal sequence with continuous value components, as the time interval is in milliseconds and thousands of neurons are involved for the primary visual representation. This leads to an extremely large symbol set, and extremely long sequence to be learn even within a few minutes. So it is obviously impossible to take the traditional sequence learning method aiming at remembering the whole sequence and the relationships from each context to its corresponding component. So we have to think out other methods to solve these problems. And there is surely an answer, as the guarantee is just the existence of us ourselves. We do read piles of articles and indeed learn something from them. We receive tremendous amount of information from our sense organs throughout our lives, and even at the last moment of our life, we can recall some scenes in our earliest life stage. Obviously, what’s important is not only how to learn, but also what to learn. This article touches sequence learning from a different viewpoint—how to pick up useful information from the input sequences and store it in an organized way. This is defined as “information extraction”. Our idea is to solve this problem by studying the biological architecture of the nervous system. A mechanism for information extraction is hypothesized based on the hierarchical and columnar organization of the cerebral cortex in part 2. A neural network is built to simulate the function of a single minicolumn according to this hypothesis in part 3. Part 4 gives the conclusion and summarizes the significance of this model.

2 The Mechanism for Information Extraction 2.1 The Hierarchical Structure of Neocortex and Abstraction Take the visual pathway as example. Light enters the eyes and is transduced to electrical signal in retina. The neural signal is transferred to primary visual areas via thalamus. Then information is submitted to secondary visual areas. For forming declarative memory, further transfer is to medial temporal lobe memory system and back to higher order cortex areas [6][7][8]. Though the mechanism of declarative memory formation is not completely clear yet, it is widely accepted that forming abstract concepts requires high level integration of information. Along this pathway the integration level rises, so is the abstraction level. If we describe this pathway mathematically as a vector series V1, V2,…,Vn, where Vi is a Ni elements 0-1 vector, then a component of a temporal sequence is represented by a assignment to each vector in this vector series, instead of only one vector. Notice the higher footnote i is, the higher abstraction level Vi has. And Vi depends onVi-1 ( i=2, 3,…n). This structure is somewhat an analog of Starzyk and He’s hierarchical model [5], in the difference that

A Hypothesis on How the Neocortex Extracts Information

23

it deals with real neural code instead of symbols, and much more complex integration (computation) is applied between two hierarchical levels. 2.2 The Columnar Organization of Neocortex and PDP The neocortex is horizontally divided into 6 layers. Layer IV contains different types of stellate and pyramidal cells, and is the main target of thalamocortical and intrahemispheric corticocortical afferents. Layers I through III are the main target of interhemispheric corticocortical afferents. Layer III contains predominantly pyramidal cells and is the principal source of corticocortical efferents. Layer V and VI are efferents to motor-related subcortical structures and thalamus separately [9]. Vertically neocortex is columnar organized with elementary module minicolumn. The minicolumn is a discrete module at the layers IV, II, and VI, but connected to others for most neurons of layer III [10]. Considering the vector series V1, V2,…,Vn, vector Vi is divided into sub-vectors in corresponding minicolumns for any i. Each sub-vector represents a sub-component, and is processed independently in its minicolumn. Minicolumns transmit processed information to the next hierarchical level minicolumns. This accords with the idea of “Parallel Distributed Processing” (PDP) proposed by Rumelhart and McClelland [11][12]. 2.3 Minicolumn Architecture A model for the structure of a minicolumn is shown in figure 1. In this model, all pyramidal cells and stellate cells in layer IV of the minicolumn form a symmetrical Hebbian network. As neurons involved are limited and closely packed, we can assume any neuron is connected to all other neurons through short axons, whose transmission delay can be omitted. If the pyramidal cells connect to pyramidal cells directly, the connections are excitatory. If the pyramidal cells connect to other pyramidal cells through stellate cells, the connections are inhibitory. Thus this network contains both excitatory and inhibitory connections. Typical Hebbian learning in this network will form cell assemblies [13]. Each cell assembly stands for a sub-component. Signals are transmitted from layer IV pyramidal cells to layer III pyramidal cells through long axons. As layer III contains predominantly pyramidal cells, the connections are mainly excitatory. Thus layer III is not an idea place for forming cell assemblies, as without inhibitory connections two cell assemblies will intermingle with each other and become one if only they have very small overlapping. The representations in layer III are just corresponding to the cell assemblies in layer IV, and we can assume no overlapping in layer III, as this can be automatically achieved through a winner-take-all (WTA) mechanism also used in Wang and Arbib’s model [2]. Signals are transmitted from layer III through long axons either to other minicolumns, or back to layer IV. 2.4 Association in Minicolumns during Learning What’s important is the transmission from layer III back to layer IV (the feedback). Typically the function of a feedback is thought for refinement or synchronization,

24

W. Wang

Fig 1. Structure of a minicolumn. Focusing on layer IV (afferent) and layer III (efferent). Pyramidal cells and stellate cells in layer IV connect with each other through short axons, forming a network with both excitatory and inhibitory synapses. Layer IV pyramidal cells transmit signals to layer III pyramidal cells through long axons. Layer III pyramidal cells may transmit signals to layer IV pyramidal cells of other minicolumns through long axons, or transmit signals back to its own layer IV pyramidal cells through long axons, forming feedback loop (indicated by the thick lines).

for example in the model proposed by Korner etc. [14]. But in our view, the feedback loop along with the transmission delay is the base for associating a sub-component to the next sub-component. Notice the involved two sub-components are not input at the same time, but Hebbian learning based on the synapse plasticity requires the two involved neurons exciting at the same time [13][15-17]. This is solved by the transmission delay of this feedback loop. The synapse modification can only happen in the synaptic junction, by the changes of the amount of neurotransmitter released by the presynaptic neuron, or the number of postsynaptic receptors [15-17]. Suppose the delay from the excitation of layer IV pyramidal cell bodies (dendrites) to the excitation of layer III pyramidal cell axon terminals is Δt, and the lasting time of subcomponent A and sub-component B are t1 and t2 respectively (t1, t2>>Δt), B follows A. Then from time 0 to Δt, no Hebbian learning happens at the synaptic junctions between layer III pyramidal cell axons and layer IV pyramidal cell bodies (dendrites), for only the later is exciting. From Δt to t1, the Hebbian learning associates subcomponent A with itself, denoted as learning the ordered pair (A,A). From t1 to t1+Δt , the layer III pyramidal cell axon terminals still represent sub-component A, while the

A Hypothesis on How the Neocortex Extracts Information

25

layer IV pyramidal cell bodies (dendrites) already code for sub-component B. Hence the association is (A,B). From t1+Δt to t2, the association will be (B,B). 2.5 Competition and Threshold Dynamic during Retrieval Suppose A, B, C, B, D, E, A, B, F, D, E, C, A, B denotes a sequence composed of sub-components of a temporal sequence in a minicolumn. Then after learning (A,B), (B,C), (B,D), (B,F), (C,B), (C,A), (D,E), (E,A), (E,C), (F,D) plus (A,A), (B,B), (C,C), (D,D), (E,E), (F,F) are learned. Now input A (lasting time t> Δt ). The cell assembly for A in layer IV is evoked. From Δt to t, the feedback from layer III try to evoke both A and B. But A is exciting, supported by the exterior input. It will inhibit the exciting of cell assembly for B. Until the exterior input ceases at time t, the only remaining stimulation is from layer III, and this stimulation will last exactly Δt. Because cell assembly for A has excited, the threshold of its neurons raises, thus it cannot be evoked again for quite a while (at least Δt). Thus cell assembly for B finally gets its chance to excite. After another Δt cell assembly for B ceases exciting and cannot be evoked again, and layer III feedback try to evoke three cell assemblies for C, D, F separately. They all want to excite and inhibit the other two, the competition leads to nothing excited (more accurately, the three may excite as a “flash” for inhibition is triggered by exciting, but this “flash” is so short compared with Δt and disappears without further effect). Hence from the exterior performance of the minicolumn, only (A,B), (D,E), (F,D) are learned. 2.6 Summary By the mechanism described above, an input temporal sequence is understood at different abstraction level in different hierarchical levels of the neocortex. In each hierarchical level, the components (temporal sequences) are divided into sub-components (sub-temporal sequences) by minicolumns. Each minicolumn only extracts the deterministic feature of the sub-temporal sequences: if sub-component A is always followed by sub-component B and no other sub-components, the minicolumn learns A predicts B.

3 The Neural Network Simulation of the Minicolumn We only built a small network containing 10 layer IV pyramidal cells and 10 layer III pyramidal cells for demonstration. Of course the network can be expanded to hundreds of neurons to simulate the real minicolumn. Let binary arrays F[10] and T[10] denote the layer IV neurons and layer III neurons separately. For simplicity, we let T[i] = F[i], i=0, 1,…,9, though in real case the representations in layer III for cell assemblies in layer IV can be quite different and involve different numbers of neurons. Thus a cell assembly 1111100000 is also represented 1111100000 in layer III in our network. Array Thresh[10] denotes the thresholds of the layer IV neurons, whose value is 1 initially and 21 after exciting, but returns to 1 after Δt. Intra[10][10] is the learning matrix for association among layer IV neurons, whose value is in [-300, 30]. (Negative means inhibitory. As one pyramidal cell can inhibit another through numerous stellate cells, the inhibitory connection is thought to be much stronger.)

26

W. Wang

Inter[10][10] is the learning matrix for association from layer III to layer IV, whose value is in [0,2] (only excitatory, thus the effect of layer III pyramidal cells on layer IV pyramidal cells are not as strong as it of layer IV pyramidal cells on themselves ). The sequence learning process takes discrete steps, and set Δt = 1 step (the delay in a minicolumn cannot be very long). An input sequence is noted as A[10](a), B[10](b), C[10](c),…where A[10], B[10], C[10] are 10 element 0-1 vectors, and a, b, c are integers for the number of steps which the state lasts. At one step when the input is I[10](n) ( n>0 is the remaining time this state lasts), learning starts with setting F[i] = I[i]. The learning rule for updating intra[i][j] is intra[i][j]=(intra[i][j]>=0) × (F[i]F[j] × 0.5 × (30-intra[i][j])-F[i]F[j] × 3)+ (intra[i][j] thresh[i]) j≠i

j

Notice in each step we need to repeat the above calculation until F[i] no longer changes (as newly evoked neurons can in turn evoke others). Then the result is the final evoked cell assembly. And let T[i] = F[i] simulating the information transmission. Refresh threshold by Thresh[i] = 1+20F[i]. Finally set I[10](n) = I[10](n-1) and continue (when n= 1, set I[i] = 0 and n = 1). Now look at an example. The temporal sequence 1111000000(16), 0000000001(24), 0000111000(13), 1111000000(7), 0000000110(19) is input for 10 or more times (enough repeating times are necessary as the inter-state association can only happen once when one state changes to another). After learning, intra[10][10] approximates ⎛ 30 ⎜ 30 ⎜ ⎜ 30 ⎜ ⎜ 30 ⎜ −300 ⎜ ⎜ −300 ⎜ −300 ⎜ ⎜ −300 ⎜ −300 ⎜⎜ ⎝ −300

30 30 30 30 30 30 30 30 −300 −300 −300 −300 −300 −300 −300 −300 −300 −300 −300 −300

Inter[10][10] approximates

−300 ⎞ −300 ⎟⎟ −300 ⎟ ⎟ −300 ⎟ −300 ⎟ 30 30 30 ⎟ −300 ⎟ 30 30 30 30 30 30 −300 ⎟ ⎟ 30 −300 ⎟ −300 −300 −300 −300 30 30 −300 ⎟ −300 −300 −300 −300 30 ⎟ −300 −300 −300 −300 −300 −300 30 ⎟⎠ 30 30 30 30 −300 −300 −300

−300 −300 −300 −300

−300 −300 −300 −300

−300 −300 −300 −300

−300 −300 −300 −300 −300 −300 −300

−300 −300 −300 −300 −300 −300 −300

A Hypothesis on How the Neocortex Extracts Information

⎛2 ⎜2 ⎜ ⎜2 ⎜ ⎜2 ⎜2 ⎜ ⎜2 ⎜2 ⎜ ⎜0 ⎜0 ⎜⎜ ⎝0

27

2 2 2 0 0 0 2 2 2⎞ 2 2 2 0 0 0 2 2 2 ⎟⎟ 2 2 2 0 0 0 2 2 2⎟ ⎟ 2 2 2 0 0 0 2 2 2⎟ 2 2 2 2 2 2 0 0 0⎟ ⎟ 2 2 2 2 2 2 0 0 0⎟ 2 2 2 2 2 2 0 0 0⎟ ⎟ 0 0 0 0 0 0 2 2 0⎟ 0 0 0 0 0 0 2 2 0⎟ ⎟ 0 0 0 2 2 2 0 0 2 ⎟⎠

Four cell assemblies 1111000000, 0000111000, 0000000110, 0000000001 are formed. The extracted information is (0000000001, 0000111000), and (0000111000, 1111000000), thus input 0000000001 will return sequence 0000111000, 1111000000. 0000000110 retrieves nothing as it is associated to nothing. 1111000000 retrieves nothing either, but it’s because it is associated to both 0000000110 and 0000000001. In this neural network, it is required that cell assemblies do not overlap. IF two cell assemblies in layer IV overlap, their representations in layer III also share a common part. This common part will try to evoke both cell assemblies no matter which of them causes this, leading to undesired results. This can be solved if another feed forward learning is added for constructing the representations for cell assemblies in layer III, ensuring no overlapping (for example, the WTA mechanism used in Wang and Arbib’s model [2]). Rarely oscillation may happen during retrieval. This requires the input sequence itself ends with repeating circles, like the sequence A, B, C, D, C, D, C. Thus after learning this sequence, input C or D will lead to the oscillation with C and D alternatively. But this situation is really rare as if the above sequence doesn’t end with C or D, for example A, B, C, D, C, D, C, A. Then C will not retrieve D (as it is associated to both D and A), and no oscillation can happen.

4 Conclusion and Significance The model proposed in this article deals with the sequence learning problem from a different viewpoint: extracting information. Adopting this idea, what’s important is what information to extract rather than how to remember all information. The most significant advantage of this idea is that the memory capacity required is not proportional to the sequence length and degree, but to the useful information (knowledge) contained in the sequence. Multiple different sequences may contain common knowledge. The common knowledge appears as the same sub-sequences in the certain minicolumns of certain hierarchical levels. For example, a stone, a tire, or a basket ball rolling down a hill appear to be quite different scenes if considering every detail, but all of them are abstracted as the process of an round object rolling down a slope in physics. This is because the essence of abstraction is the process of extracting important common

28

W. Wang

features while omitting the other unimportant details. This process is fulfilled in our model by the complex connections among minicolumns of different hierarchical levels, which lead to complicated neural computation. Naturally, along with the increase of abstraction level, the knowledge is more and more general and the amount of information is reduced, represented by decrease of the variation of sequences. It is arguable that in the high enough hierarchical levels, only few sequences repeat frequently. The learning is by forming associative memory in minicolumns. Each minicolumn associates a sub-component to itself and its immediate follower. But through competition and threshold dynamic, A evokes B if and only if B is the only possible follower of A. This means a minicolumn doesn’t consider a temporal sequence’s degree. Every temporal sequence is treated as a simple sequence. Thus a minicolumn can remember only a small portion of the sequence by itself, seemingly useless compared with Wang and Yuwono’s model [3][4] and Stazyk and He’s model [5]. But the advantage is that the neural network for a minicolumn is extremely simple, as described in part 3, with much less cost than Wang’s or Stazyk’s models. Thus it is very proper for being a functional unit. The complex tasks are hoped to be accomplished by the whole network composed of millions of such functional units. Typically a sub-component in a minicolumn can only retrieve one or two following sub-components, and then this minicolumn ceases. But the retrieved sub-components are submitted to higher level minicolumns, and may trigger the retrieving in them. Repeating this activity, and by possible crosses or loops (recurrent activity), a subcomponent might trigger unlimited retrieving. This process must be consciously controlled by concentration (a mysterious cognitive function not discussed in this article). Finally, the model has the following important features: 1. higher hierarchical level minicolumns tend to learn more than lower hierarchical level minicolumns, as in high abstraction level sequence variation is reduced. 2. two seemingly completely different objects may retrieve the same thing, if only they share some common feature and the concentration is on this common feature. For example, an elephant and the glacier both may retrieve the concept of “huge”. Acknowledgments. Thank Bertram E. SHI of dept of electronic & computer engineering HKUST for offering illuminating advice and inspiring discussion.

References 1. Sun, R., Giles, L.C.: Sequence Learning: From Recognition and Prediction to Sequential Decision Making. IEEE Intell. Syst. 16, 67–70 (2001) 2. Wang, D., Arbib, A.M.: Complex Temporal Sequence Learning Based on Short-term Memory. Proc. IEEE 78, 1536–1543 (1990) 3. Wang, D., Yuwono, B.: Anticipation-Based Temporal Pattern Generation. IEEE Trans. Syst. Man Cybern. 25, 615–628 (1995) 4. Wang, D., Yuwono, B.: Incremental Learning of Complex Temporal Patterns. IEEE Trans. Neural Networks 7, 1465–1481 (1996) 5. Starzyk, A.J., He, H.: Anticipation-Based Temporal Sequences Learning in Hierarchical Structure. IEEE Trans. Neural Networks 18, 344–358 (2007)

A Hypothesis on How the Neocortex Extracts Information

29

6. Squire, R.L., Zola, M.S.: The Medial Temporal Lobe Memory System. Science 253, 1380– 1386 (1991) 7. Thompson, F.R., Kim, J.J.: Memory systems in the brain and localization of a memory. PNAS 93, 13438–13444 (1996) 8. Mayes, A., Montaldi, D., Migo, E.: Associative Memory and the Medial Temporal Lobes. Trends Cogn. Sci. 11, 126–135 (2007) 9. Creutzfeldt, D.O.: Cortex Cerebri: Performance, Structural and Functional Organization of the Cortex. Oxford University Press, USA (1995) 10. Mountcastle, B.V.: The Columnar Organization of the Neocortex. Brain 120, 701–722 (1997) 11. Rumelhart, D.E., McClelland, J.L.: The PDP Research Group: Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Foundations, vol. 1. MIT Press, Cambridge (1986) 12. McClelland, J.L., Rumelhart, D.E.: The PDP Research Group: Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Psychological and Biological Models, vol. 2. MIT Press, Cambridge (1986) 13. Hebb, D.O.: The Organization of Behavior. Wiley, New York (1949) 14. Korner, E., Gewaltig, O.M., Korner, U., Richter, A., Rodemann, T.: A model of computation in neocortical architecture. Neural Networks 12, 989–1005 (1999) 15. Bliss, P.V.T., Collingridge, L.G.: A synaptic model of memory: long-term potentiation in the hippocampus. Nature 361, 31–39 (1993) 16. Bear, F.M.: A synaptic basis for memory storage in the cerebral cortex. PNAS 93, 13453– 13459 (1996) 17. Chen, R.W., Lee, S., Kato, K., Spencer, D.D., Shepherd, M.G., Williamson, A.: Long-term modifications of synaptic efficacy in the human inferior and middle temporal cortex. PNAS 93, 8011–8015 (1996)

MENN Method Applications for Stock Market Forecasting Guangfeng Jia, Yuehui Chen, and Peng Wu School of Information Science and Engineering University of Jinan, 250022 Jinan, China [email protected]

Abstract. A new approach for forecasting stock index based on Multi Expression Neural Network (MENN) is proposed in this paper. The approach employs the multi expression programming (MEP) to evolve the architecture of the MENN and the particle swarm optimization (PSO) to optimize the parameters encoded in the MENN. This framework allows input variables selection, over-layer connections for the various nodes involved. The performance and effectiveness of the proposed method are evaluated using stock market forecasting problems and compared with the related methods. Keywords: Multi Expression Programming, Artificial Neural Network, Stock Market Forecasting.

1

Introduction

Stock index forecasting is an integral part of everyday life. Current methods of forecasting require some elements of human judgment and are subject to error. Stock indices are a sequence of data points, measured typically at uniform time intervals.There are several motivations for trying to predict stock market prices. The most basic of these is financial gain. Any system that can consistently pick winners and losers in the dynamic market place would make the owner of the system very wealthy. Thus, many individuals including researchers, investment professionals, and average investors are continually looking for this superior system which will yield them high returns [1][2]. Artificial neural networks (ANNs) represent one widely technique for stock market forecasting. Apparently, White [3] first used Neural Networks for market forecasting. In other work, Chiang, Urban, and Baldridge have used ANNs to forecast the end-of-year net asset value of mutual funds. Trafalis used feedforward ANNs to forecast the change in the S&P (500) index. Typically the predicted variable is continuous, so that stock market prediction is usually a specialized form of regression. Any type of neural network can be used for stock index prediction (the network type must, however, be appropriate for regression or classification, depending on the problem type). The network can also have any number of input and output variables [4]. In addition to stock index prediction, neural networks have been trained to perform a variety of financial related tasks. F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 30–39, 2008. c Springer-Verlag Berlin Heidelberg 2008 

MENN Method Applications for Stock Market Forecasting

31

There are experimental and commercial systems used for tracking commodity markets and futures, foreign exchange trading, financial planning, company stability, and bankruptcy prediction. Banks use neural networks to scan credit and loan applications to estimate bankruptcy probabilities, while money managers can use neural networks to plan and construct profitable portfolios in real-time. As the application of neural networks in the financial area is so vast, we will focus on stock market prediction. However, most commonly there is a single variable that is both the input and the output. Despite the wide spread use of ANNs, there are significant problems to be addressed. ANNs are data-driven model, and consequently, the underlying rules in the data are not always apparent. Also, the buried noise and complex dimensionality of the stock market data make it difficult to learn or re-estimate the ANNs parameters. It is also difficult to come up with ANNs architecture that can be used for all domains. In addition, ANNs occasionally suffer from the over-fitting problem. In this paper, an automatic method for constructing MENN network is proposed. Based on the pre-defined instruction/operator sets, a MENN network can be created and evolved. MENN allows input variables selection, over-layer connections for different nodes. The novelty of this paper is in the usage of multi expression neural network model for selecting the important features and for improving the accuracy. The paper is organized as follows: Section 2 gives a short review of the original MEP algorithm and ANN model. The representation of MENN model and a hybrid learning algorithm for designing the artificial neural network are given in Section 3. Section 4 presents some simulation results for the stock market forecasting problems. Finally in section 5 we present some concluding remarks.

2 2.1

MEP, PSO Algorithms and ANN Model A Short Review of the Original MEP Algorithm

Evolutionary algorithms are defined as randomized search procedures inspired by the working mechanism of genetics and natural selection [5]. There are different types of evolutionary algorithms such as genetic algorithms (GA), genetic programming (GP), evolution strategies (ES), and evolutionary programming (EP). MEP [6][7] is a relatively new technique in genetic programming that is first introduced in 2002 by Oltean and Dumitrescu. A traditional GP encodes a single expression (computer program). By contrast, a MEP chromosome encodes several expressions. The best of the encoded solution is chosen to represent the chromosome. A MEP individual includes some genes which are represented by substrings of variable length. The number of genes per chromosome is constant and this number defines the length of a chromosome [8]. Each gene encodes a terminal or a function symbol which is selected from a terminal set T or a function set F. The two sets for a given problem are pre-defined. A gene that encode a function includes some pointers towards the function arguments. The number of the

32

G. Jia, Y. Chen, and P. Wu

Fig. 1. A valid MEP chromosome

pointers depends on how many arguments the function have. In order to ensure that each chromosome is a valid MEP individual, there are some restrictions to initialize the population [6]: 1) First gene of the chromosome must contain a terminal that is randomly selected from the terminal set T. 2) For all other genes which encodes functions, we need generate pointers toward function arguments. All the pointers must have indices of lower index than the current gene. In this way only syntactically correct chromosome are generated. An example of a chromosome using the sets F = {+, -, ∗, sin} and T = {a, b, c, d} is shown in Fig. 1. The MEP chromosomes are read in a top-down fashion starting with the first gene. A gene that encode a terminal specifies a simple expression. And a gene that encode a function specifies a complex expression (formed by linking the operands specified by the argument position with the current function symbol) [5]. For instance, genes 1, 2, 4 and 5 in Fig. 1 encode simple expressions formed by a single terminal symbol. These expressions are: E1 = a; E2 = b; E4 = c; E5 = d. Gene 3 indicates the operation ∗ on the operands located at position 1 and 2 of the chromosome. Therefore gene 3 encodes the expression: E3 = a ∗ b. Gene 6 indicates the operation sin on the operand located at position 4. Therefore gene 6 encodes the expression: E6 = sin c. Gene 7 indicates the operation − on the operands located at positions 3 and 5. Therefore gene 7 encodes the expression: E7 = (a ∗ b) − d. Gene 8 indicates the operation + on the operands located at position 7 and 6. Therefore gene 8 encodes the expression: E8 = (a∗ b)− d+ sin c. The tree representations of these expressions are also shown in Fig. 2. As MEP chromosome encodes more than one genes, it is required to choose one of the expressions to present the chromosome. The chromosome fitness is usually defined as the fitness of the best expression encoded by that chromosome. 2.2

PSO Algorithm

The PSO conducts searches using a population of particles that correspond to individuals in an Evolutionary Algorithm (EA) [9][10]. To get the parameters of SVM, particle swarm optimization (PSO) algorithm is employed. All free parameters in the SVM constitute a particle. Initially, a population of particles is

MENN Method Applications for Stock Market Forecasting

33

Fig. 2. Tree representations of a MEP chromosome

randomly generated. Each particle represents a potential solution and has a position represented by a position vector xi . A swarm of particles moves through the problem space with the moving velocity of each particle represented by a velocity vector vi . At each time step, a function fi - representing a quality measure - is calculated by using xi as input. Each particle keeps track of its own best position, which is associated with the best fitness it has achieved so far in a vector pi . Furthermore, the best position among all the particles obtained so far in the population is kept track of as pg . In addition to this global version, another version of PSO keeps track of the best position among all the topological neighbors of a particle. At each time step t, by using the individual best position, pi (t), and the global best position, pg (t), a new velocity for particle i is updated by Vi (t + 1) = vi (t) + c1 φ1 (pi (t) − xi (t)) + c2 φ2 (pg (t) − xi (t)).

(1)

where c1 and c2 are positive constants and φ1 and φ2 are uniformly distributed random numbers in [0,1]. The term ci is limited to the range of Vm ax (if the velocity violates this limit, it is set to its proper limit). Changing velocity this way enables the particle i to search around both its individual best position, pi , and global best position, pg . Based on the updated velocities, each particle changes its position according to xi (t + 1) = xi (t) + vi (t + 1).

(2)

34

2.3

G. Jia, Y. Chen, and P. Wu

A Short Review of the Traditional ANN Model

A typical neural network consists of layers. In a single layered network there is an input layer of source nodes and an output layer of neurons. A multi-layer network has in addition one or more hidden layers of hidden neurons. Some standard three-layer feed-forward networks are used widely [11]. A representative feed-forward neural network consists of a three layer structure: input layer, output layer and hidden layer. Each layer is composed of variable nodes. The type of this network is displayed in Fig. 3. The number of nodes in the hidden layers is selected to make the network more efficient and to interpret the data more accurately. The relationship between the input and output can be non-linear or linear, and its characteristics are determined by the weights assigned to the connections between the nodes in the two adjacent layers. Changing the weight will change the input-to-output behavior of the network.

Output layer

Hidden layer

Input layer Fig. 3. A fully connected feed-forward network with one hidden layer and one output layer

A feed-forward neural network analysis consists of two stages, namely training and testing. During the training stage, an input-to-output mapping is determined iteratively using the available training data. The actual output error, propagated from the current input set, is compared with the target output and the required compensation is transmitted backwards to adjust the node weights so that the error can be reduced at the next iteration. The training stage is stopped once a pre-set error threshold is reached and the node weights are frozen at this point. During the testing stage, data with unknown properties are provided as input and the corresponding output is calculated using the fixed node weights. The feed-forward neural network has been shown to perform well in many areas in previous research.

3 3.1

A Novel MENN Model Representation

In this research, a novel multi expression programming based encoding method with specific instruction set is selected for representing a MENN model. The

MENN Method Applications for Stock Market Forecasting

Output layer

Hidden layer

Input layer

D

35

E

 

 Fig. 4. A valid MENN chromosome

reason for choosing the representation is that the tree can be created and evolved using linear chromosome structure of MEP. The used function set F and terminal set T for generating a MENN model are described as F = {+, -, ∗, sin} and T = {a, b, c, d}. A gene that encode a function includes some pointers towards the function arguments. The number of the pointers depends on how many arguments the function have. The value of a MENN gene expression is calculated by a recursion way. The multi expression neural network is shown in Fig. 4, where the From this point of view, the MENN is also viewed as a flexible neural network chromosome. 3.2

Initialization

Initial population is generated according to predefined population size parameter which determines the number of MENN chromosome in the population. Individuals of population are repeatedly generated by employing the following procedure. 1) The function symbol or terminal symbol of each gene is selected from a function set F or a terminal set T. According to the proposed representation scheme, the first symbol of each MENN chromosome must be a terminal symbol. For all genes which encodes function, pointers have to be generated to address function arguments. All the pointers must have indices of lower index than the current gene. 2) The second part of the each gene consists of MENN parameters which includes weight parameters and activation function parameters. These real parameters are randomly generated in [0, 1].

36

3.3

G. Jia, Y. Chen, and P. Wu

Procedure of the General Learning Algorithm

The general learning procedure for constructing the MEP-NN model can be described as follows: 1) Initial population is generated randomly. All the learning parameters in MENN model should be assigned in advance. 2) The fitness value is calculated for each individual by using PSO algorithm to optimize the parameters encoded in the chromosome. 3) Implementation of the genetic operators: crossover and mutation operators. 4) If maximum number of generations is reached or no better solution is found for a significantly long time (300 steps), then stop, otherwise goto step 2).

4

Experiment Setup and Result

To test the efficacy of the proposed method ,the MEP-NN model is applied to a stock index prediction problem. We have used stock prices in the IT sector: the daily stock price of International Business Machines Corporation (IBM), Dell Inc [12], collected from www.finance.yahoo.com. The daily stock prices of IBM and Dell Inc., training data is from February 10, 2003 to September 10, 2004 and the test data is from September 13, 2004 to January 21, 2005. The two stock index data sets were represented by ’opening value’, ’low value’, ’high value’ and ’closing value’. Also, the experiments for S&P CNX NIFTY stock index [13] are established for evaluating the performance of the proposed methods. S&P CNX NIFTY is a well-diversified 50 stock index accounting for 25 sectors of the economy. It is used for a variety of purposes such as benchmark fund portfolios, index based derivatives and index funds. The CNX Indices are computed using market capitalization weighted method, wherein the level of the Index reflects the total market value of all the stocks in the index relative to a particular base period.

Fig. 5. Test results of IBM

MENN Method Applications for Stock Market Forecasting

37

Fig. 6. Test results of DELL

Fig. 7. Test results of NIFTY

The performance of the method is measured in terms of RMSE. Parameters used by MENN in these experiments are presented in Table 1. For comparison purpose, the forecast performances of a traditional artificial neural network (ANN) mode and an support vector machines (SVM) model [14] are also shown Table 1. Empirical comparison of RMSE result for three methods Model Name SVM model [14] ANN model [14] MENN model (This paper)

IBM Corp. 0.02849 0.03520 0.02887

Dell Inc. 0.03665 0.05182 0.02786

NIFTY 0.03220 0.01857 0.01587

38

G. Jia, Y. Chen, and P. Wu

in Table 2. The actual stock price and the predicted ones for three stock index are shown in Fig. 5, Fig. 6 and Fig. 7. From Table 2, it is observed that the proposed MENN models are better than the traditional neural network.

5

Conclusion

In this paper, a new approach for designing artificial neural networks using multi expression programming is proposed. In the viewpoint of calculation structure, the MENN model can be viewed as a flexible multi-layer feedforward neural network with over-layer connections and free activation function parameters. The work demonstrates that it is possible to find an appropriate way to evolve the structure and parameters of artificial neural networks simultaneously by using multi expression programming. Simulation results for the stock market forecasting problems shown the feasibility and effectiveness of the proposed method. Acknowledgments. This work is partially supported by the National Science Foundation of China under grant No. 60573065, the Key Subject Research Foundation of Shandong Province and the Natural Science Foundation of Shandong Province (grant Y2007G33).

References 1. Robert, R., Jae, T., Lee, L.: Artificial Intelligence in Finance and Investing, ch. 10. IRWIN (1996) 2. Wu, Q., Chen, Y.H., Wu, P.: Higher Order Neural Networks for Stock Index Modeling. In: Zhang, M. (ed.) Artificial Higher Order Neural Networks for Economics and Business (in press, 2008) 3. White, H.: Economic Prediction Using Neural Networks. The Case of IBM Daily Stock Returns. In: Proc. of IEEE Int’l Conference on Neural Networks (1988) 4. Hecht-Nielsen, R.: Kolmogorov’s mapping neural network existence theorem. In: Proc. 1st IEEE Int’l Joint Conf. Neural Network (1987) 5. Adil, B., Lale, O.: MEPAR-miner: Multi-expression programming for classification rule mining. European Journal of Operational Research 183, 767–784 (2007) 6. Oltean, M., Dumitrescu, D.: Multi Expression Programming. Technical Report, UBB-01-2002, Babes-Bolyai University, Cluj-Napoca, Romania (2002), www.mep.cs.ubbcluj.ro 7. Crina, G., Ajith, A., Sang, Y.H.: MEPIDS: Multi-Expression Programming for ´ Intrusion Detection System. In: Mira, J., Alvarez, J.R. (eds.) IWINAC 2005. LNCS, vol. 3562, pp. 163–172. Springer, Heidelberg (2005) 8. Oltean, M., Grosan, C.: Evolving Digital Circuits Using Multi Expression Programming. In: Zebulum, R., et al. (eds.) NASA/DoD Conference on Evolvable Hardware, June 24-26, pp. 87–90. IEEE Press, NJ (2004) 9. Kennedy, J.: Particle swarm optimization. In: Proc IEEE Int. Conf. on Neural Networks. IV, pp. 1942–1948 (1995) 10. Yoshida, H., Kawata, K., Fukuyama, Y., Takayama, S., Nakanishi, Y.: A Particle Swarm Optimization for Reactive Power and Voltage Control Considering Voltage Security Assessment. IEEE Trans. Power Syst. 15, 1232–1239 (2000)

MENN Method Applications for Stock Market Forecasting

39

11. Zhang, X.Q., Chen, Y.H., Yang, J.Y.: Stock Index Forecasting Using PSO Based Selective Neural Network Ensemble. In: International Conference on Artificial Intelligence (ICAI 2007), vol. 1, pp. 260–264 (2007) 12. Hassan, M.R.U., Nath, B., Kirley, M.: A Fusion Model of HMM, ANN and GA for Stock Market Forecasting. Expert Systems with Applications 33, 171–180 (2007) 13. National Stock Exchange of India Limited, http://www.nse-india.com 14. Wu, Q., Chen, Y.H., Liu, Z.: Ensemble Model of Intelligent Paradigms for Stock Market Forecasting. In: First International Workshop on Knowledge Discovery and Data Mining (WKDD 2008), pp. 205–208 (2008)

New Chaos Produced from Synchronization of Chaotic Neural Networks⋆ Zunshui Cheng School of Mathematics and Physics, Qingdao University of Science and Technology Qingdao 266061, China [email protected]

Abstract. In this paper, we investigates synchronization dynamics of neural networks. Generalized linear synchronization (GLS) is proposed to acquire a general kind of proportional relationships between two-neuron networks. Under the point of synchronization, we can find that the node has complex dynamics with some interesting characteristics, and some new chaos phenomenons can been found. Numerical simulations show that this method works very well of two-neuron networks with identical Lorenz systems. Also our method can be applied to other systems. Keywords: Neural Networks, Chaos, Synchronization, Control system, Numerical simulation.

1 Introduction Recently, dynamical properties of neural networks have been extensively investigated, and many applications have been found in different areas. Most previous literature has mainly been devoted to the stability analysis. However, it has been shown that such networks can exhibit some complicated dynamics and even chaotic behaviors if the networks parameters are appropriately chosen. Motivated by the study of chaotic phenomena, an increasing interest has been devoted to the study of chaos synchronization since the pioneering work of Pecora and Carrol [1]. Synchronization of neural networks have many applications in secure communication and so on. Therefore, the study of synchronization of neural networks is an important step for both understanding brain science and designing neural networks for practical use [2]-[7]. There are different types of synchronization in interacting nodes of chaotic neural networks, such as complete synchronization (CS), generalized synchronization (GS), phase synchronization, lag synchronization and anticipating synchronization, etc [8][12]. Projective synchronization and generalized projective synchronization, which are the special case of generalized synchronization, are becoming one of the most noticeable subjects. Its typical feature is the state variables of the two-coupled system may ⋆

This work was jointly supported by the Doctoral Found of QUST, and the Natural Science Foundation of Henan Province, China under Grant 0611055100.

F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 40–46, 2008. c Springer-Verlag Berlin Heidelberg 2008 

New Chaos Produced from Synchronization of Chaotic Neural Networks

41

synchronize up to a scaling factor, but the Lyapunov exponents and fractal dimensions remain unchanged [13]-[22]. Recently, generalized projective synchronization (GPS) has attracted increasing interests from the researchers. The early projective synchronization is usually investigated only in a class of partially linear systems[17]-[19], however, the generalized projective synchronization is studied in a general class of neural systems including non-partiallylinear systems [13]-[15], [20]-[22]. In [16], modified projective synchronization (MPS) was proposed to acquire a general kind of proportional relationships between the drive and response systems. But in practical applications, rotation is also common and interesting. To the best of our knowledge, rotary and projective synchronization still remains open, so the generalized linear synchronization (GLS) is the special case of GLS) will be proposed and considered in this paper. Motivated by the above discussions, by using the active control techniques, we will investigate the generalized linear synchronization in this paper. The remaining of this paper is organized as follows: In Section 2, the definition and theoretic analysis of generalized linear synchronization of chaotic neural systems are provided. In Section 3, chaos produced from linear synchronization of two chaotic neural systems are analyzed. Finally, our paper is completed with a conclusion and some discussions.

2 Generalized Linear Synchronization of Two-Neuron Systems In this section, follow the idea of generalized projective synchronization, we will study the generalized linear synchronization (GLS). Consider the following two-neuron chaotic system:  f (xm ), x˙ m = (1) x˙ s = g(xm , xs ), where xm , xs ∈ Rn are n-dimensional state vectors. The low subscripts m and s stand for the master and slave systems, respectively. f : Rn → Rn and g : Rn → Rn are vector fields in n-dimensional space. If there exists a matrix An×n , such that limt→∞ Axm − xs  = 0, then the generalized linear synchronization (GLS) of the system (1) is achieved, and A was called a transform factor. We take the Lorenz systems as the master system ⎧ a(ym − xm ), ⎪ ⎨ x˙ m = y˙ m = cxm − xm zm − ym , (2) ⎪ ⎩ z˙m = xm ym − bzm ,

where a = 10, b = 8/3, c = 28, one can find the chaotic attractor. In order to realize the GLS, the following slave system is constructed: ⎧ a(ys − xs ) + u1 , ⎪ ⎨ x˙ s = (3) y˙ s = cxs − xs zs − ys + u2 , ⎪ ⎩ z˙s = xs ys − bzs + u3 ,

42

Z. Cheng

where u1 , u2 and u3 are the control inputs. To determine the appropriate control inputs ui , (i = 1, 2, 3), assume transform factor ⎛ ⎞ a11 a12 a13 A = ⎝a21 a22 a23 ⎠ , a31 a32 a33 the error vector is defined as ⎞⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ xs xm a11 a12 a13 e1 ⎝e2 ⎠ = ⎝a21 a22 a23 ⎠ ⎝ ym ⎠ − ⎝ ys ⎠ . zs zm a31 a32 a33 e3

(4)

Then the error dynamical system can be obtained:

⎧ ⎪ ⎪ e˙ 1 = a11 a(ym − xm ) + a12 (cxm − xm zm − ym ) ⎪ ⎪ +a13 (xm ym − bzm ) − a(ys − xs ) − u1 , ⎪ ⎪ ⎨ e˙ 2 = a21 a(ym − xm ) + a22 (cxm − xm zm − ym ) +a23 (xm ym − bzm ) − (cxs − xs zs − ys ) − u2 , ⎪ ⎪ ⎪ ⎪ e ˙ = a ⎪ 3 31 a(ym − xm ) + a32 (cxm − xm zm − ym ) ⎪ ⎩ +a33 (xm ym − bzm ) − (xs ys − bzs ) − u3 ,

(5)

employing the original methods of active control, the control inputs ui (i = 1, 2, 3) are chosen as follows: ⎧ u1 = −ays + ca12 xm + (aa11 + aa12 − a12 )ym ⎪ ⎪ ⎪ ⎪ +(aa13 − ba13 )zm − a12 xm zm + a13 xm ym , ⎪ ⎪ ⎨ u2 = −cxs + (ca22 − aa21 + a21 )xm − a22 xm zm +(a23 − ba23 )zm + aa21 ym + a23 xm ym + xs zs , ⎪ ⎪ ⎪ ⎪ u = (ca32 − aa31 + ba31 )xm + (aa31 + ba32 ⎪ 3 ⎪ ⎩ −a32 )ym − a32 xm zm + a33 xm ym − xs ys .

(6)

Based on the above choose of inputs, error system (5) becomes ⎧ ⎪ ⎨ e˙ 1 = −ae1 , e˙ 2 = −e2 , ⎪ ⎩ e˙ 3 = −be3 .

(7)

One can find that all eigenvalues of the closed loop system have negative real parts, so error system (7) will be convergent. In other words, the choice of the control inputs will result in a stable system and the generalized linear synchronization of two identical Lorenz systems was realized. Remark. In fact, our method can also applied to the generalized linear synchronization of other neural networks with identical chaotic systems of each node, such as the Chen system, L¨ u system, etc.

New Chaos Produced from Synchronization of Chaotic Neural Networks

43

3 Chaos Produced from Linear Synchronization of Two Chaotic Systems If the error vector is chosen as ⎛ ⎛ ⎞ ⎞⎛ ⎞ ⎛ ⎞ e1 −1 1 0 xm xs ⎝e2 ⎠ = ⎝−1 1 0⎠ ⎝ ym ⎠ − ⎝ ys ⎠ , e3 zm zs 0 01

(8)

20

15

10

y

5

0

−5

−10

−15 −15

−10

−5

0 x

5

10

15

Fig. 1. projection of the response systems onto the x − y plane 50

45

40

35

z

30

25

20

15

10

5

0 −15

−10

−5

0 x

5

10

Fig. 2. projection of the response systems onto the x − z plane

15

44

Z. Cheng

50

45

40

35

z

30

25

20

15

10

5

0 −15

−10

−5

0

5

10

15

20

y

Fig. 3. projection of the response systems onto the y − z plane

15 10

x

5 0 −5 −10 −15 50 40

20 15

30

10 5

20

0 −5

10 z

0

−10 −15

y

Fig. 4. Phase plot of the response system

then we obtain the following error dynamical system: ⎧ e˙ 1 = −a(ym − xm ) + (cxm − xm zm − ym ) ⎪ ⎪ ⎪ ⎪ −a(ys − xs ) − u1 , ⎪ ⎪ ⎨ e˙ 2 = −a(ym − xm ) + (cxm − xm zm − ym ) −(cxs − xs zs − ys ) − u2 , ⎪ ⎪ ⎪ ⎪ e ˙ = ⎪ 3 ⎪ ⎩ +(xm ym − bzm ) − (xs ys − bzs ) − u3 ,

(9)

New Chaos Produced from Synchronization of Chaotic Neural Networks

45

15 10

x

5 0 −5 −10 −15 20 50

10 40

0

30 20

−10 10 y

−20

0

z

Fig. 5. Phase plot of the response system

the control inputs ui (i = 1, 2, 3) are taken as following form: ⎧ u1 = −ays + cxm − ym − xm zm , ⎪ ⎪ ⎨ u2 = −cxs + (c + a − 1)xm − xm zm −aym + xs zs , ⎪ ⎪ ⎩ u3 = xm ym − xs ys .

(10)

by the above discussions, the synchronization of the two identical Lorenz systems can be realized. Under the point of synchronization, the response system has complex dynamics, and some new chaos phenomenons can been found (see Fig. 1.-Fig. 5.).

4 Conclusions In this paper, the definition of generalized linear synchronization (GLS) is proposed to for a general kind of proportional relationships between the drive and response systems. Under the point of generalized linear synchronization (GLS), we can find that the response system has complex dynamics with some interesting characteristics, and some new chaos phenomenons can been found. It should be noted that our method can be applied to other chaotic neural systems such as Chen system, Lv system etc, and more interesting chaotic characteristics may been found. The two-neuron chaotic system can be the same system, such as Lorenz system. At the same time, we can also choose different system on each neuron. These are beyond the scope of the present paper and can be further investigated elsewhere in the near future.

46

Z. Cheng

Acknowledgments The authors would like to thank these referees for their valuable suggestions and comments.

References 1. Pecora, L.M., Carroll, T.L.: Synchronization in chaotic systems. Phys. Rev. Lett. 64, 821–824 (1990) 2. Lu, W.L., Chen, T.P.: Synchronization of coupled connected neural networks with delays. IEEE Trans. Circuits and System 51, 2491–2503 (2004) 3. Lu, J., Cao, J.: Synchronization-based approach for parameters identification in delayed chaotic neural networks. Physica A 382, 672–682 (2007) 4. Yu, W., Cao, J., Lv, J.: Global synchronization of linearly hybrid coupled networks with time-varying delay. SIAM Journal on Applied Dynamical Systems 7, 108–133 (2008) 5. Cao, J., Wang, Z., Sun, Y.: Synchronization in an array of linearly stochastically coupled networks with time delays. Physica A 385, 718–728 (2007) 6. Sun, Y., Cao, J.: Adaptive synchronization between two different noise-perturbed chaotic systems with fully unknown parameters. Physica A 376, 253–265 (2007) 7. Sun, Y., Cao, J.: Adaptive lag synchronization of unknown chaotic delayed neural networks with noise perturbation. Physics Letters A 364, 277–285 (2007) 8. Yu, W., Cao, J.: Adaptive Q-S (lag, anticipated, and complete) time-varying synchronization and parameters identification of uncertain delayed neural networks. Chaos 16, 023119 (2006) 9. Cao, J., Lu, J.: Adaptive synchronization of neural networks with or without time-varying delays. Chaos 16, 013133 (2006) 10. Cao, J., Lu, J.: Adaptive complete synchronization of two identical or different chaotic (hyperchaotic) systems with fully unknown parameters. Chaos 15, 043901 (2005) 11. Amritkar, R.E.: Spatially synchronous extinction of species under external forcing. Phys. Rev. Lett. 96, 258102 (2006) 12. Shahverdiev, E.M., Sivaprakasam, S., Shore, K.A.: Lag synchronization in time-delayed systems. Physics Letters A 292, 320–324 (2002) 13. Li, C., Yan, J.: Generalized projective synchronization of chaos: The cascade synchronization approach. Chaos, Solitons and Fractals 30, 140–146 (2006) 14. Li, G.: Generalized projective synchronization of two chaotic systems by using active control. Chaos, Solitons and Fractals 30, 77–82 (2006) 15. Kittel, A., Parisi, J., Pyragas, K.: Generalized synchronization of chaos in electronic circuit experiments. Physica D 112, 459–471 (1998) 16. Li, G.: Modified projective synchronization of chaotic system. Chaos, Solitons and Fractals 32, 1786–1790 (2007) 17. Ronnie, M., Jan, R.: Projective synchronization in three-dimensional chaotic systems. Phys. Rev. Lett. 82, 3042–3045 (1999) 18. Xu, D., Li, Z.: Controlled projective synchronization in nonpartially-linear chaotic systems. Int. J. Bifurcat Chaos 12, 1395–1402 (2002) 19. Xu, D., Chee, C., Li, C.: A necessary condition of projective synchronization in discrete-time systems of arbitrary dimensions. Chaos, Solitons and Fractals 22, 175–180 (2004) 20. Rulkov, N.F., Sushchik, M.M., Tsimring, L.S., et al.: Generalized synchronization of chaos in directionally coupled chaotic systems. Phys. Rev. E 51, 980–994 (1995) 21. Kittel, A., Parisi, J., Pyragas, K.: Generalized synchronization of chaos in electronic circuit experiments. Physica D 112, 459–471 (1998) 22. Kocarev, L., Parlitz, U.: Generalized synchronization, predictability, and equivalence of unidirectionally coupled dynamical systems. Phys. Rev. Lett. 76, 1816–1819 (1996)

A Two Stage Energy Model Exhibiting Selectivity to Changing Disparity Xiaojiang Guo1 and Bertram E. Shi2,* 1

Department of Electronics Engineering Tsinghua University, Beijing, China [email protected] 2 Department of Electronic and Computer Engineering Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong [email protected]

Abstract. We show that by cascading the disparity energy model and the motion energy model, we obtain neurons that are selective for changing disparity, which is a cue that biological systems may use in the perception of stereomotion. We demonstrate that the outputs of this model exhibit joint tuning to disparity and stereo-motion. The output achieves a peak response for an input with a preferred disparity that also changes at a preferred rate. The joint tuning curve in the disparity–change of disparity space is approximately separable. We further demonstrate that incorporating a normalization step between the two stages reduces the variability of the model output. Keywords: Motion Energy, Disparity Energy, Stereo-motion, Changing Disparity, Visual Cortex.

1 Introduction Stereo-motion refers to motion towards or away from a binocular ob-server. There are at least two cues that could be exploited by an observer to detect or estimate this motion: changing disparity (CD) or inter-ocular velocity difference (IOVD)[3]. The CD cue is derived by first combining monocular images to obtain a disparity signal at each time, and then examining the change in disparity over time. The IOVD cue is derived by first examining the change in each monocular image over time to obtain velocity signals, which are then combined across the two eyes. Psychophysical evidence suggests that both signals play a role in the perception of stereo-motion [2] . Here, we present a biologically plausible two-stage model for creating neurons selective for changing disparity. The first stage extracts disparity signals using a population of disparity energy neurons tuned to different disparities via phase shifts[4]. The second stage then establishes selectivity to disparity changes over time by a temporal filtering operation similar to that used in motion energy models[1]. *

This work was supported in part by the Hong Kong Research Grants Council under Grant HKUST 619205.

F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 47–54, 2008. © Springer-Verlag Berlin Heidelberg 2008

48

X. Guo and B.E. Shi

Both motion energy and disparity energy have been used to model the outputs of complex cells in the primary visual cortex. However, to our knowledge, they have not been integrated previously to model neurons that are selective to changing disparity. Previous models combining motion and disparity energy models were tuned to fronto-parallel motion, since the preferred velocities for the left and right eyes were assumed to be identical[5][6]. However, for stereo-motion stimuli, the velocities are non-identical. In particular, for motion along the midline between the two eyes, the left and right image velocities are of opposite sign. The paper is organized as follows. Section 2 describes the two-stage stereomotion model. Section 3 explores several characteristics of the two stage model.

2 Two Stage Stereomotion Energy Model The model can be decomposed into the cascade of two stages: a disparity selective stage based on the disparity energy model followed by a temporal filtering stage based upon the motion energy model. In this section, we review both the motion energy and disparity energy models, and show how they may be combined to achieve selectivity to changing disparity. 2.1 Disparity Energy Model The disparity energy model is depicted in Fig. 1(a). For simplicity, we assume onedimensional images that lie along corresponding epipolar lines. Left and right images are first convolved with complex valued spatial Gabor filters that model pairs of spatial receptive fields in phase quadrature. The disparity energy is the squared magnitude of the sum. Mathematically, we denote the left and right input signals by I l ( x) and I r ( x) , where x indexes the distance from the receptive field center. We denote the outputs of the spatial Gabor filters by: U l (ψ l ) = ∫

+∞

U r (ψ r ) = ∫

−∞ +∞

−∞

g ( x)e j ( Ω x x +ψ l ) I l ( x)dx g ( x)e j ( Ω x x +ψ r ) I r ( x)dx

(1)

where g(x) is a Gaussian envelope with standard deviation σ2, Ωx is the spatial frequency of the Gabor function, and ψ l and ψ r are phase shifts applied to the left and right Gabor filters. The disparity energy is the squared magnitude of the sum of the outputs of the left and right Gabor filters: Ed (Δψ ) = U l (ψ l ) + U r (ψ l ) 2

2 2

= U l (0) + U r (0) + 2 Re(U l (0)U r (0)* e j Δψ )

(2)

The disparity energy depends on the input images and the relative phase difference between the left and right Gabor filters: Δψ = ψ l −ψ r , but not the absolute phases,

ψ l and ψ r .

A Two Stage Energy Model Exhibiting Selectivity to Changing Disparity

(a)

49

(b)

Fig. 1. In the standard disparity energy model (a), left and right images are first filtered by complex valued Gabor filters with different phase shifts, and then summed and squared. In the stereo-motion energy model (b), the outputs of several disparity energy neurons with different preferred disparities due to relative phase shifts Δψ between the left and right Gabor filters are combined to obtain a complex valued output whose phase varies with disparity. This output is then normalized and passed through a temporal Gabor filter to obtain an output that responds to changing disparity.

The preferred disparity of a disparity energy neuron depends upon the phase difference Δψ . Suppose that every pixel x in the right image corresponds to pixel x + d in the left image, i.e. I r ( x) = I l ( x + d ). For small d, the output of the right Gabor filter can be approximated by U r (ψ r ) = ∫

+∞

g ( x)e j ( Ω x x +ψ r ) I l ( x + d )dx = ∫

≈ e j ( Ω x d −Δψ ) ∫ −∞

+∞

−∞

+∞

−∞

g ( x − d )e j ( Ω x ( x − d ) +ψ r ) I l ( x)dx

g ( x)e j ( Ω x x +ψ l ) I l ( x)dx = U l (ψ l )e j ( Ω x d −Δψ )

(3)

Thus, a position shift at the input results in a phase change at the output. Substituting this expression into (2), we obtain 2

2

Ed (Δψ ) ≈ U l (0) + U l (0) + 2 U l (0) U l (0) cos(Ω x d − Δψ )

(4)

Thus, we can see that the energy output achieves its maximum when the input disparity is approximately equal to d pref = Δψ / Ω . 2.2 Motion Energy Model

The motion energy model has been used to model the responses of direction selective neurons in the primary visual cortex[1] . As in the disparity energy model, the input

50

X. Guo and B.E. Shi

image is first convolved with a complex-valued spatial Gabor function. As demonstrated by (3), a small shift in the image can be approximated by a shift in the phase of the filter output. Repeating this phase shift over time results in an oscillation in the output with frequency ωt = −vΩ x , where v is the image velocity. By passing the output of the spatial Gabor filter with a temporal Gabor filter tuned to temporal frequencies Ωt, we obtain a filter that responds maximally when the input image has significant energy at spatial frequencies near Ωx moving at velocity v = −Ωt / Ω x . 2.3 Stereomotion Model

The stereomotion model exploits the fact that by combining the disparity energies at different phase shifts appropriately, we can obtain a signal that oscillates as the image disparity changes. Thus, as in the motion energy model, this oscillation can be detected by cascading this output with a temporal Gabor filter. It can be shown that the response of the disparity energy model can be written as[5][7] Ed (Δψ ) = S + P cos(Ψ d − Δψ ) 2

(5)

2

where S = U l (0) + U r (0) , P = 2 U l (0) U r (0) and Ψ d = arg(U l (0)U r (0)* ) . If we

define Ed (0) = S + P cos(Ψ d ) Ed (π ) = S − P cos(Ψ d )

(6)

Ed (π / 2) = S + P sin(Ψ d ) Ed (−π / 2) = S − P sin(Ψ d ) then we can express the output of the combination unit in Fig., 1(b) as Ed (0) − Ed (π ) E (π / 2) − Ed (−π / 2) +j d 4 4 jΨd * = Pe / 2 = U l (0)U r (0)

output =

(7)

By passing this output through a temporal filter, we can obtain an energy neuron tuned to changing velocity. Substituting the approximation in (3), we obtain

(

U l (0)U r (0)* ≈ U l (ψ l ) U l (ψ l )e jΩ x d Thus, assuming that U l (ψ l )

2

*

)

2

= U l (ψ l ) e− jΩ x d

(8)

changes slowly, the output of the combination unit

oscillates as the disparity changes, much in the same way that the output of a spatial Gabor filter oscillates as the input translates in the motion energy model. When examining the stereomotion, the model should be insensitive to the scaling of the input intensity. Therefore, in the normalization unit, we normalize the output of the combination unit by the sum of the squared magnitude of the two monocular spatial Gabor outputs. In the Section 3, we will show that the normalization unit helps to improve the stability and reliability of the response.

A Two Stage Energy Model Exhibiting Selectivity to Changing Disparity

51

Right retina velocity

-4 -2 0 2 4 -4

-2 0 2 Left retina velocity

4

Fig. 2. The average energy response of the model, which is tuned to the preferred velocity difference of 1 pixel/frame. The spatial frequency and temporal frequency are both 2π / 20 .

3 Characteristics of the Model In this section, we explore the characteristics of the model. For the results in section 3.1, we omit the normalization operation in Figure 1(b), because the normalization process does not essentially affect the basic characteristics. 3.1 Velocity Difference Tuning and Common Velocity Invariance

Motion towards or away from the observer results in a changing disparity, which can also be expressed in terms of the difference, vd, between the image velocities in the left and right images. For fronto-paralllel motion, the velocities in the left and right images will be the same. We denote the common or average velocity between the left and right images by vc. Here we show that the combined stereo-motion energy neurons are tuned to velocity differences, vd, but are invariant to changes in the common velocity, vc , between the two eyes. We simulated the model using 100 inputs consisting of translating random dots. The model is tuned to velocity difference of 1 pixel/frame. Fig. 2 depicts the average response of the energy output, from which we can see two salient properties. First, the bottom-left to upper-right diagonal cross section shows the vd tuning, and has the greatest energy output along the diagonal line vd = 1 . Second, the bottom-right to upper-left diagonal cross section shows the invariance of vc , since the energy response along the common velocity line remains roughly unchanged. 3.2 Joint Disparity and Velocity Difference Tuning

Fig. 3 plots the average energy outputs of models versus different d and vd for translating random dots, with (a) for model without normalization unit (b) for model with normalization unit. The models are both tuned to 2 pixels/frame preferred velocity

X. Guo and B.E. Shi

-40

-40

-20

-20

Disparity

Disparity

52

0 20 40

0 20

-2

0

2

4

Velocity difference (a)

40

-2

0

2

4

Velocity difference (b)

Fig. 3. Energy responses of model without normalization (a) and with normalization (b). Both are simulated over 200 translating random dots. The disparity range is -40 to 40 pixels; velocity difference range is -3 to 5 pixels/frame. The spatial and temporal frequency of the cell is 2π / 40 and 2π / 20 , respectively. Therefore the preferred velocity difference is 2 pixels/frame. Red indicates large values. Blue indicates low values.

difference and 0 preferred disparity. From the figure we can see that the contours of both plots are ovals with major and minor axes approximately parallel to the coordinate axis. The peak response occurs where the preferred disparity and preferred disparity are located. Taking any horizontal (or vertical) cross section of the plot results in a velocity difference tuning (or disparity tuning) curve with the preferred velocity difference (disparity) being the peak location. From Fig. 3, it appears that the horizontal and vertical cross sections of the tuning surface have little dependence on where the cross section is taken. For example, the velocity difference tuning curve seems to have a similar shape (up to a scaling factor) with the peak location at the preferred velocity difference, no matter which disparity cross section is chosen. This suggests that disparity tuning and vd tuning may be separable. Here we show that this is indeed the case. Denote the energy response of the normalized model by f (vd , d ) , the disparity tuning curve at vd 0 by hvd 0 (d ) , and the velocity difference tuning curve at d 0 by

hd0 (vd ) . We approximate f (vd , d ) using the assumption of separability by f (vd , d ) = k hvd (d )hd (vd )

∀vd , d

(9)

where k is a scaling factor, hvd (d ) and hd (vd ) are the average vertical and horizontal cross sections, respectively. From Fig. 4 we can see that the measured tuning surface and its approximation assuming separability are approximately the same. To evaluate this fit quantitatively, we examine MSE value of the estimation. Here, we only take into account the model with normalization unit, because it is much more stable than non-normalized model, as described below. We quantify the fit using the square root of mean squared error normalized by the average response

-40

-40

-20

-20

Disparity

Disparity

A Two Stage Energy Model Exhibiting Selectivity to Changing Disparity

0 20 40

53

0 20

-2

0

2

40

4

-2

Velocity difference (a)

0

2

4

Velocity difference (b)

Fig. 4. (a) Measured energy response f (vd , d ) for translating noise stimuli. (b) Energy response estimated assuming separability, f (vd , d ) . The color scales of the two plots are identical.

I=

Average( f (vd , d ) − f (vd , d )) 2

(10)

Average( f (vd , d ))

For the data shown, the value of I is 2.66%. This supports the conclusion that the responses of the stereo-motion tuned neurons are approximately separable. 3.3 Comparison between Models with and without Normalization

The characteristics of the two models do not differ by much in terms of average energy response of inputs, as we can see from Fig. 3. However, intuitively, the output of normalized model should exhibits less variation than the non-normalized model. -40

-40 1.95 1.9 1.85

0

1.8 1.75

20

-20

Disparity

-20

Disparity

0.5 0.4 0.3

0

0.2 20

0.1

1.7 40

-2

0

2

Velocity difference (a)

4

40

-2

0

2

4

Velocity difference (b)

Fig. 5. (a) Standard deviation of the non-normalized model. (b) Standard deviation of the normalized model. The standard deviation is expressed as multiples of the average response. The color scales of the two figures are different.

54

X. Guo and B.E. Shi

Fig. 5(a) and (b) shows the relative standard deviation of the energy output. The standard deviation for non-normalized model is 1.7~2 times of the average response while that of the normalized model is below 0.6. In particular, in the vicinity of the preferred region, i.e. disparity = 0 and vd = 2 pixels/frame, the standard deviation is below 0.15 times of the average response for the units with normalization. Therefore, the normalization unit greatly improves the stability and reliability of the response.

References 1. Adelson, E.H., Bergen, J.R.: Spatiotemporal Energy Models for the Perception of Motion. J. Opt. Soc. Am. A Opt. Image Sci. Vis. 2, 284–299 (1985) 2. Brooks, K.R., Stone, L.S.: Stereomotion Speed Perception: Contributions from both Changing Disparity and Interocular Velocity Difference Over a Range of Relative Disparities. J. Vis. 4, 1061–1079 (2004) 3. Cumming, B.G., Parker, A.J.: Binocular Mechanisms for Detecting Motion-in-Depth. Vis. Res. 34, 483–495 (1994) 4. Ohzawa, I.: Mechanisms of Stereoscopic Vision: The Disparity Energy Model. Curr. Opin. Neurobiol. 8, 509–515 (1998) 5. Qian, N.: Computing Stereo Disparity and Motion with Known Binocular Cell Properties. Neural Comput. 6, 390–404 (1994) 6. Qian, N., Anderson, R.A.: A Physiological Model for Motion-Stereo Integration and a Unified Explanation of Pulfrich-Like Phenomena. Vis. Res. 37, 1683–1698 (1997) 7. Fleet, D.J., Wagner, H., Heeger, D.J.: Neural Encoding of Binocular Disparity: Energy Models, Position Shifts and Phase Shifts. Vis. Res. 36, 1839–1857 (1996)

A Feature Extraction Method Based on Wavelet Transform and NMFs Suwen Zhang1, Wanyin Deng1, and Dandan Miao2 1 School of Automation, Wuhan University of Technology, Wuhan, 430070, China 2 School of Resource and Environmental Science, Wuhan University, Wuhan, 430070, China [email protected]

Abstract. In this paper, a feature extraction method is proposed by combining Wavelet Transformation (WT) and Non-negative Matrix Factorization with Sparseness constraints (NMFs) together for normal face images and partially occluded ones. Firstly, we apply two-level wavelet transformation to the face images. Then, the low frequency sub-bands are decomposed according to NMFs to extract either the holistic representations or the parts-based ones by constraining the sparseness of the basis images. This method can not only overcome the the low speed and recognition rate problems of traditional methods such as PCA and ICA, but also control the sparseness of the decomposed matrices freely and discover stable, intuitionistic local characteristic more easily compared with classical non-negative matrix factorization algorithm (NMF) and local non-negative matrix decomposition algorithm (LNMF). The experiment result shows that this feature extraction method is easy and feasible with lower complexity. It is also insensitive to the expression and the partial occlusion, obtaining higher recognition rate. Moreover, the WT+NMFs algorithm is robust than traditional ones when the occlusion is serious. Keywords: Feature extraction, Wavelet transformation, NMFs, Face recognition.

1 Introduction In recent years, with the development of some applications such as electronic commerce, face recognition has become one of the most potential biometrics authentication methods. Feature extraction is the most important part of face recognition. There are many methods which can mainly be divided into two big categories: methods based on the geometry characteristic, and methods based on the statistical characteristic. Since the geometry feature extraction is sensitive to illumination, expression and posture, people mainly used methods based on the statistical character in recent years, among which the most frequently used ones are principle components analysis (PCA) and independent components analysis (ICA). However, PCA and ICA do not impose non-negative constraint on the operation object when implementing the matrix decomposition, and the mutual counteract of the positive and negative coefficient will weaken the characteristic and make the recognition accuracy drop. Lee and Seung F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 55–62, 2008. © Springer-Verlag Berlin Heidelberg 2008

56

S. Zhang, W. Deng, and D. Miao

proposed the concept of Non-negative matrix factorization (NMF) [1], whose entries are all non-negative and produces a part-based representation of images because it allows only additive, not subtractive, combinations of basis components. But the problem of NMF is that it dose not always get basis vectors for local representation. The reason is that the sparseness levels for basis vector and coefficients matrix are not high enough. The localized characteristic of LNMF method [2] is obvious, but the convergence of LNMF is time consuming, and LNMF can’t explicitly control the sparseness of the representation. Patrik proposed NMF with sparseness control and realized controlling the sparsity of basis vector and coefficient matrix [3]. In this paper, we combine Wavelet Transformation (WT) and Non-negative Matrix Factorization with Sparseness Constraints (NMFs) together to extract features for face recognition. Firstly, we apply two-level wavelet transformation to the face images and decompose the low frequency sub-bands by using Names. The experimental results show that the two-level wavelet transformation overcomes the influence of the change of posture and expression to a great extent. It can catch the substantive characteristic and reduce the computation complexity effectively. In addition to that, NMFs can not only discover more stable and intuitionistic partial characteristic but also obtain either the holistic or parts-based representations by constraining the sparseness of the basis images freely. When serious obstruct occurs, NMFs has better robustness than NMF algorithm.

2 Wavelet Transformation The wavelet transformation is a kind of time-frequency signal analysis method. By using it, the image signal can be decomposed into many sub-band image signals with different Spatial Resolution, frequency characteristic and directional features. The change of human face expression and small-scale obstruct affect no low-frequency part but the high-frequency part of the image only. Additionally, the wavelet transformation has perfect restructuring ability guaranteeing that the information will not be lost during the decomposition process and the redundancy will not occur either. Therefore, we can use the wavelet analysis to filter out the high frequency information before the feature extraction. And only the low frequency sub-graph is used for recognition. Given a two-dimension signal f (x1, x2 ) , whose square is integrabel viz.

f ( x1 , x2 ) ∈ L2 (R) . The continuous wavelet transformation of f (x1, x2) is defined as WTf (a;b1, b2 ) =

x −b x − b 1 f (x1, x2 )ψ( 1 1 , 2 2 )dx1dx2 a ∫∫ a a

(1)

where the Wavelet Base Function is

1 x1 − b1 x2 − b2 , ) a a a (2) The most frequently used wavelet transform in image processing is dyadic wavelet transform, which can discretize formula (1) with a = 2 n , b ∈ Z . A small change of exponential n may result in obvious change of scale, so dyadic wavelet transform has scale-amplifying character in signal analysis. Applying 2-level wavelets decomposition to the original image, we get the result as shown in Fig. 1. We select the low

ψ a ;b ,b ( x1 , x2 ) = ψ ( 1

2

A Feature Extraction Method Based on Wavelet Transform and NMFs

Fig. 1. Two-level wavelet transformation

57

Fig. 2. Face recognition procedure

frequency component LL in the second level as the wavelet characteristic, which can not only retain the overall shape information of human face, but also weaken the partial detail [4].

3 NMF and NMFs 3.1 NMF

Given a non-negative matrix Vn×m, NMF finds the non-negative matrix Wn×r and H r×m such that V ≈ WH . Where each column is a non-negative vector of dimen-



sion n corresponding to a face image, m is the number of training images, and r is the dimension of eigenvector. Each column of matrix W represents a basis vector while each column of H means the weights used to approximate the corresponding column in V using the bases from W. NMF decomposition is a NP problem, which can be taken as an optimization problem, using iterative method to get basis vectors W and coefficient matrix H. One form of objective function is the Euclidean distance: D(V || WH ) = X − WH

2

= ∑ ( X ij − (WH )ij ) 2

. So NMF factorization is a solution to the following optimization problem: ij

min D (V || WH ) s.t. W , H ≥ 0, ∑ Wij = 1 W ,H

i

Lee and Seung presented an iterative approach to reach a local maximum of this objective function [5]. 3.2 NMFs

Non-negative Matrix Factorization with Sparseness constraints algorithm is a matrix factorization algorithm based on NMF. When the primitive matrix V is factorized, it can control sparseness of feature matrix W or encoder matrix H to attain the specific application requirements. When it is regarded as feature extraction algorithm of face recognition, the feature matrix W is desired to be sparsed. So, the differences of the elements in matrix W increase, the feature of characteristic face will be more prominent and a face image waiting for recognition related with fewer characteristic faces. It is easier to be recognized. The objective function in NMFs is defined as follows:

min D (V || WH ) s.t. W , H ≥ 0, ∑ Wij = 1 W ,H

i

58

S. Zhang, W. Deng, and D. Miao

sparseness( wi ) = Sw , sparseness(h j ) = Sh Where

wi is the i-th column of W and h j is the j-th row of H . Here, S w and S h are

the desired sparsenesses of W and H respectively. These two parameters are set by the user. The sparseness of each column wi in matrix W is defined as:

sparseness ( wi ) =

n − ( ∑ wij ) / j

∑w

2

ij

j

n −1 Here, n is defined as the dimension of non-negative vector. According to the definition, it is easy to know only when the non-negative vector X contains a non-zero element, this function is equal to 1; only when each element is equal to others, this function is equal to 0. When the base image matrix W is in low sparseness, the difference among the elements’ gradation in each column of the matrix corresponding to the human face eigenvector is not so obvious. Therefore, it can better reflect relationship among the gradations of overall face image. Then, it can respond the overall characteristic of human face well. This though is similar to the PCA algorithm based on the overall characteristic extraction. When the base image matrix W is in high sparseness, the difference among the gradations of the elements in each column in the matrix corresponding to the human face eigenvector is very obvious. So, it only retains the relationship among the gradation of partial face image. Then, it can reflect better partial characteristic of human face. This thought returns to the traditional NMF and LNMF algorithm [6].

4 Face Recognition The human face recognition can be divided into the training process and the recognition process. The concrete plan is shown in Fig. 2. The algorithm is described as follows: (i) The training sample images are pretreated and all the sizes of the images should be the same. (ii) The images are transformed by the 2-level wavelet. Then, the low-frequency sub-bands LL2 of each image can be acquired. (iii) The low-frequency sub-bands of the M images that obtained above can be constituted to an n×m -dimensional matrix which is called V.

v j ∈ R n is piled up

ages [7]. It also satisfies the equation below :, v ij ≥ 1, ∑ v ij = 1, j = 1, 2 ,

according to the columns of the low-frequency sub-bands of the j-th training imn

m.

i =1

(iv) The V can be taken into the NMFs factorization. Then, the base image matrix W and the weighted matrix H can be obtained. Under general circumstances, the r usually selects the square numbers within 100 ([7]). Thus, the W and H will be smaller than the primitive matrix. In this way, the compressed model of primitive data matrix will be obtained [8].

A Feature Extraction Method Based on Wavelet Transform and NMFs

59

(v) The low-frequency sub-bands of the training sample images and the testing samples project to the “characteristic subspace” W which is formed by base images respectively. If W + = (W TW ) −1W T is defined, the projection vector (is also weighted vector) of human face sample on base images can be obtained from h = W + v .These projection vectors are exactly the eiqenvectors that are used to depict the face [9]. (vi) The nearest neighbor classifier can be used to classify the testing faces.

5 Simulation Experiment and Comparison The factorization results of the three algorithms that include NMF, NMFs, WT+NMFs are tested respectively. ORL gallery is chosen to do the experiment. It has 40 individuals altogether, each person has 10 person face images, each image has 256 gradation levels, the size is 92×112. The posture and angle of each image are various. So the training gallery has 400 images in all. At the same time, the eyes, nose and mouth of the testing sample faces will be covered up stochastically. The covered partial images can be expressed by gradations with 80×30 pixel.

Fig. 3. Examples of normal face images and partially occluded ones

For each person, 5 face images are selected stochastically to be trained, the remaining five images are taken as facial images that are waiting for recognition. So, there are 200 images in training gallery and testing gallery each. In the experimental process, the images are normalized to 128×128 first; next, apply 2-level wavelet transformation to the images and obtain the low-frequency subbands of each image, the size of it is 32×32. Then, all the low-frequency sub-bands of facial images are elongated into column vectors of 1024 dimensions, all the column vectors of facial images are generated into the decomposition matrix V. Because the number of the training images selected is 200, the decomposition matrix V has 1024 rows and 200 columns. Then, the matrix V is taken into the NMFs decomposition. W and H are obtained. The MATLAB7.0 is used to program on PC machine with P42.2GHz and 512 MB memory. The processes and conclusions of each experiments are as follows: Firstly, comparing the direct NMFs method with the three different wavelet base WT+NMFs methods, the 2-level wavelet transformation results on the basis of the three different wavelets are obtained. The experimental date from the direct NMFs and WT+NMFs methods are shown in Table 1.

60

S. Zhang, W. Deng, and D. Miao

The original image

db6

haar

Db2

Fig. 4. The images decomposed through different wavelet Table 1. The comparison of the average recognition rate and the average computation time

Average identification rate (% )

Method

without mask NMFs WT+ NMFs

Db2 Haar Db6

with mask

Average computation time (s)

91.5

85.5

218.2

95.0 94.5 93.5

93.5 91.5 92

49.8 48.7 49.3

The experimental results show that using the WT+NMFs method is able to shorten average calculation time and improve the efficiency of recognition, especially when having something covered, the recognition rate can be increased by 8%. However, the recognition results are not sensitive to the choice of the several types of wavelet. Then, under the two kinds of situations that have cover and doesn’t have cover, changing the value of r and the sparseness, doing the experiments on comparing recognition, the results are obtained and shown in Fig. 3(a) and (b). From the Figure, it is clear that, the recognition rate increases with r, but when the r increases to certain degree, the recognition rate decreases. When there is no cover on face and the value of r is relatively low, the recognition rate of WT+NMFs methods under the high sparseness is high. When the face is covered and the value of r is relatively high, the recognition rate of WT+NMFs methods under low sparseness is high.

(a) normal face images

(b) partially occluded face images

Fig. 5. Comparison of recognition rate

A Feature Extraction Method Based on Wavelet Transform and NMFs

sw=0.45

sw=0.55

sw=0.65

61

sw=0.75

Fig. 6. Basis images of NMFs with r=81

Fig. 4 shows the basis images after WT+NMFs decomposition under the different sparseness when r=81. It is also easy to know from the Figure that along with the enhancement of the sparseness, NMFs base image has tendency of the transition from the overall situation to the partial situation. It indicates that NMFs has both the overall characteristic expression and partial characteristic expression in the control of sparseness in the control of sparseness. The face recognition method with low sparseness, based on the overall expression is more insensitive to covers than that with high sparseness, based on the local expression, and has certain robustness.

6 Conclusions In this paper, the human face feature extraction method is brought up based on wavelet transformation and NMFs. The method is easy and feasible. It si insensitive to the changing of facial gestures, expressions and the head ornaments. It is also able to discover the stable and direct-viewing partial characteristics well and control the sparseness of the decomposed matrix more freely. When serious obstruct occurs, the WT+NMFs algorithm has the better robustness than NMF algorithm. The former algorithm accelerates the feature extraction speed greatly and overcomes the weakness of traditional NMF fundamentally in that aspect. This method is not only applicable to the human face feature extraction, but also suitable for other image feature extraction problems.

References 1. Lee, D.D., Seung, H.S.: Unsupervised Learning by Convex and Conic Coding. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, pp. 515–521. The MIT Press, Massachusetts (1997) 2. Feng, T., Stan, Z., Li, H.Y.S., Zhang, H.J.: Local Non-negative Matrix Factorization as a Visual Representation. In: 2nd International Conference on Development and Learning, Cambridge, pp. 7695–1459 (2002) 3. Patrik, O.H.: Non-negative Matrix Factorization with Sparseness Constraints. Journal of Machine Learning Research 5, 1457–1469 (2004) 4. Manthalkar, R., Biswas, P.K., Chatterji, B.N.: Rotation and Scale Invariant Texture Features Using Discrete Wavelet Packet Transform. Pattern Recognition Letter 24, 2452–2462 (2003)

62

S. Zhang, W. Deng, and D. Miao

5. Lee, D., Seung, H.S.: Learning the Parts of Objects by Non-negative Matrix Factorization. Nature 1401, 788–791 (1999) 6. Pu, X.R., Zhang, Y., Zheng, Z.M., Wei, Z., Mao, Y.: Face Recognition Using Fisher Nonnegative Matrix Factorization with Sparseness Constraints. In: Wang, J., Liao, X.-F., Yi, Z. (eds.) ISNN 2005. LNCS, vol. 3497, pp. 112–117. Springer, Heidelberg (2005) 7. Ouyang, Y.B., Pu, X.R., Zhang, Y.: Wavelet-based Non-negative Matrix Factorization with Sparseness Constraints for Face Recognition. Application Research of Computers 10, 159– 162 (2006) 8. Chen, W.G., Qi, F.H.: Learning NMF Representation Using a Hybrid Method Combining Feasible Direction Algorithm and Simulated Annealing. Acta Electronica Sinica 31, 2190– 2193 (2003) 9. Zhang, Z.W., Yang, F., Xia, K.W., Yang, R.X.: Research on Face Recognition Method Based on Wavelet Transform and NMF. Computer Engineering 33, 176–179 (2007)

Similarity Measures between Connection Numbers of Set Pair Analysis Junjie Yang, Jianzhong Zhou, Li Liu, Yinghai Li, and Zhengjia Wu School of Hydropower and Information Engineering, Huazhong University of Science and Technology, Wuhan 430074, China [email protected]

Abstract. The Set Pair Analysis (SPA) is a new system analysis approach and uncertainty theory. The similarity measure between connection numbers is the key to applications of SPA in multi-attribute decision-making, pattern recognition, artificial intelligent. However, it is difficult to accurately depict similarity degree between connection numbers. The distance between connection numbers, a group of checking criterions and the similarity degree functions of connection numbers in SPA are presented in this paper to measure the similarity between connection numbers, and the rationality of such measurement is also explained by the well-designed criterions. The result shows the effectiveness of the proposed similarity measures. Keywords: Set Pair Analysis, Similarity measures, Similarity degree function.

1

Introduction

In the real world, there are all kind of uncertainties such as fuzzy uncertainty, random uncertainty, indeterminate-known, unknown and unexpected incident uncertainty, and uncertainty which resulted from imperfective information [1]. The most successful approach to understand and manipulate the uncertainty knowledge is the fuzzy set theory proposed by Zadeh. Set Pair Analysis (SPA) theory provides another way to expressing and processing the uncertainties. The theory overlaps with many other uncertainty theory, especially with fuzzy set theory, evidence theory, Boolean reasoning methods, and rough set theory. SPA theory emphasizes the relativity and fuzziness in information processing, can identify relatively certainty information and relatively uncertainty information from the researched system. The connection number theory, that includes abundant contents and has significant meaning in the development history of mathematics, has been set up. SPA considers the connection number as a kind of number that can depict uncertain quantity, and thinks that the connection number is different from constant, variable, and super uncertain quantity essentially [2,3]. The similarity measure between connection numbers is the key to applications of SPA in multi-attribute decision-making, pattern recognition, artificial intelligent. However, because the connection number contains identity, discrepancy F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 63–68, 2008. c Springer-Verlag Berlin Heidelberg 2008 

64

J. Yang et al.

and contrary information of system, it is difficult to accurately depict similarity degree between connection numbers. In this paper, the new similarity measures are proposed, which are presented in Section 2 and Section 3. Finally, conclusions are obtained in Section 4.

2

The Distance between Connection Numbers

In this section, we present the similarity measures between connection numbers by adopting an extension of distance in functional analysis. Definition 2.1. Let μ1 and μ2 be two connection numbers, where μ1 = a1 + b1 i + c1 j, μ2 = a2 + b2 i + c2 j. Weighted Minkowski distance between μ1 and μ2 can be defined as follows:  (1) dq (μ1 , μ2 ) = q ωa (a1 − a2 )q + ωb (b1 − b2 )q + ωc (c1 − c2 )q ,

where, ωa , ωb and ωc are weight. There are three forms of distance are as follows: (1) Hamming distance (q=1) d1 (μ1 , μ2 ) = ωa |a1 − a2 | + ωb |b1 − b2 | + ωc |c1 − c2 | . (2) Hamming distance (q=2)  d2 (μ1 , μ2 ) = ωa (a1 − a2 )2 + ωb (b1 − b2 )2 + ωc (c1 − c2 )2 .

(2)

(3)

(3) Chebyshev distance (q → ∞)

d∞ (μ1 , μ2 ) = max(ωa |a1 − a2 |, ωb |b1 − b2 |, ωc |c1 − c2 |) .

3

(4)

Similarity Measures between Connection Numbers

In this section, a group of rationality checking criterion of similarity measures is presented, and then similarity measures between connection numbers are proposed by employing the idea of similarity degree function [4,5]. 3.1

Checking Criterion

Let μ1 , μ2 and μ3 be three connection numbers, where μ1 = a1 + b1 i + c1 j, μ2 = a2 + b2 i + c2 j and μ3 = a3 + b3 i + c3 j, ρ(μ1 , μ2 ), ρ(μ1 , μ3 ) and ρ(μ2 , μ3 ) denote the similarity degree function between μ1 and μ2 , μ1 and μ3 , μ2 and μ3 respectively. Similarity degree function must satisfy the following criterions: Criterion 3.1: 0 ≤ ρ(μ1 , μ2 ) ≤ 1 . Criterion 3.2:(monotonicity criterion) ρ(μ1 , μ3 ) ≤ min(ρ(μ1 , μ2 ), ρ(μ2 , μ3 )), If μ1  μ2  μ3 . Criterion 3.3:(symmetry criterion) ρ(μ1 , μ2 ) = ρ(μ2 , μ1 ) . Criterion 3.4:ρ(μ1 , μ2 ) = 0 if and only if μ1 = 1+0i+0j and μ2 = 0+0i+1j; ρ(μ1 , μ2 ) = 1 if and only if μ1 = μ2 , that is a1 = a2 , c1 = c2 . − − Criterion 3.5:ρ(μ1 , μ2 ) = ρ(μ− 1 , μ2 ), where μ1 = c + bi + aj is called as complement connection number of μ1 = a + bi + cj .

Similarity Measures between Connection Numbers of Set Pair Analysis

3.2

65

Similarity Measures

Definition 3.1. Let μ be a connection number, where μ = a + bi + cj. (1) C(μ) = a − c is called as the core of μ . (2) C ω (μ) = ωa · a + ωb · b + ωc · c is called as the weighted core of μ, where ωa , ωb and ωc are weights of a, b and c respectively, ωa ≥ ωc ≥ 0 ≥ ωb . (3) S(μ) = a(1 + α · b) is called as identity degree of μ, D(μ) = c(1 + β · b) is contrary degree of μ, where α, β ∈ [0, 1], can reflect various risk attitudes of decision makers, the more value of α is, the more the probability which discrepancy degree converts into identity degree on connection number μ is. The more value of β is, the more the probability which discrepancy degree converts into contrary degree on connection number μ is. Let μ1 and μ2 be two connection numbers, where μ1 = a1 + b1 i + c1 j, μ2 = a2 + b2 i + c2 j. The similarity measures between two connection numbers are defined as follows: Definition 3.2. The similarity degree function ρ(μ1 , μ2 ) is defined as: ρ(μ1 , μ2 ) = 1 −

dq (μ1 , μ2 ) , 21/q

(5)

where dq (μ1 , μ2 ) denotes the Minkowski distance between μ1 and μ2 . It is obvious that similarity measure (5) meets criterion 3.1, 3.3, 3.4 and 3.5. Now, q = 2 is taken as an example to prove that the similarity measure meets the criterion 3.2 as follows. Let μ1 , μ2 and μ3 be three connection numbers, where μ1 = a1 + b1 i + c1 j, μ2 = a2 + b2 i + c2 j, μ3 = a3 + b3 i + c3 j, and μ1  μ2  μ3 , that is a1 ≤ a2 ≤ a3 and c1 ≥ c2 ≥ c3 . We can derive equation (6) as:  (c1 − c3 )2 ≥ (c1 − c2 )2 , (c1 − c3 )2 ≥ (c2 − c3 )2 , (6) (a1 − a3 )2 ≥ (a1 − a2 )2 , (a1 − a3 )2 ≥ (a2 − a3 )2 . Then (6) can be wrotten as:  (c1 − c3 )2 ≥ max{(c1 − c2 )2 , (c2 − c3 )2 } , (a1 − a3 )2 ≥ max{(a1 − a2 )2 , (a2 − a3 )2 } . The following equations can be derived by (7): ((a1 − a3 )2 + (c1 − c3 )2 )1/2 ≥ max{((a1 − a2 )2 + c1 − c2 )2 )1/2 , ((a2 − a3 )2 + (c2 − c3 )2 )1/2 } , d(μ1 , μ3 ) ≥ max{d(μ1 , μ2 ), d(μ2 , μ3 )} , 1 − d(μ1 , μ3 ) ≤ min{1 − d(μ1 , μ2 ), 1 − d(μ2 , μ3 )} .

(7)

66

J. Yang et al.

So, the following result can be derived: ρ(μ1 , μ3 ) ≤ min{ρ(μ1 , μ2 ), ρ(μ2 , μ3 )} . It indicates that the similarity measure of (5) satisfies the criterion 3.2. Definition 3.3. The similarity degree function between μ1 and μ2 is defined as: ρ(μ1 , μ2 ) = 1 −

dq (μ1 , μ2 ) |C(μ1 ) − C(μ2 )| . − 2 21/q

(8)

Definition 3.4. ρ(μ1 , μ2 ) is defined as: ρ(μ1 , μ2 ) = 1 −

|C(μ1 ) − C(μ2 )| |S(μ1 ) − S(μ2 )| + |D(μ1 ) − D(μ2 )| − . 2 2

(9)

It is obvious that similarity measure (9) meets criterion 3.1, 3.3, 3.4 and 3.5. The similarity measure meets the criterion 3.2 will be proved as follows. Let α = β = 1 , we can derive the equation (10) as follows: ρ(μ1 , μ2 ) = 1 −

Δa12 − Δc12 |[2 − (a1 + a2 )]Δa12 − (a1 c1 − a2 c2 )| −( 2 2 |[2 − (c1 + c2 )]Δc12 − (a1 c1 − a2 c2 )| ), + 2

(10)

where, Δa12 = a1 − a2 , Δc12 = c1 − c2 . Let c1 = c2 = c , (11) can be derived as: ρ(μ1 , μ2 ) = 1 − f1 |Δa12 | .

(11)

Let a1 = a2 = a , (12) can be derived as: ρ(μ1 , μ2 ) = 1 − f2 |Δc12 | ,

(12)

1 +c2 )| 1 +a2 )| ,f2 = 1+|4−2a−(c . f1 = 1+|4−2c−(a 2 2 obvious that |Δa13 | ≥ |Δa12 |, |Δa13 | ≥ |Δa23 |,

where, if μ1  μ2  μ3 , that is It is a1 ≤ a2 ≤ a3 and c1 ≥ c2 ≥ c3 , and (13) can be derived as: |Δa13 | ≥ max{|Δa12 |, |Δa23 |} .

(13)

By substituting (13) into (11) and (12), the following result can be derived: ρ(μ1 , μ3 ) ≤ min{ρ(μ1 , μ2 ), ρ(μ2 , μ3 )} . It indicates that the similarity measure of (9) satisfies the criterion 3.2. A group of examples will be presented to illustrate the effectiveness of the proposed similarity measures between connection numbers. Example 3.1. Let μ1 and μ2 be two connection numbers, where μ1 = a1 + b1 i + c1 j, μ2 = a2 + b2 i + c2 j. The calculated results of the proposed similarity measures are shown in Table 1.

Similarity Measures between Connection Numbers of Set Pair Analysis

67

Table 1. The examples of the proposed similarity measures μ1

μ2



a1

c1

a2

c2

1 2 3 4 5 6 7 8 9 10 11 12 13

0.1 0.0 1.0 0.2 0.4 0.2 0.2 0.2 0.2 0.1 0.1 0.4 0.6

0.5 1.0 0.0 0.5 0.1 0.5 0.5 0.5 0.5 0.5 0.5 0.6 0.4

0.1 1.0 0.0 0.3 0.5 0.1 0.1 0.3 0.3 0.4 0.2 0.2 0.8

0.5 0.0 1.0 0.4 0.2 0.4 0.6 0.4 0.6 0.6 0.8 0.8 0.2

ρ(μ1 , μ2 ) Definition Definition Definition 3.4 Definition 3.4 3.2 3.3 α=β=1 α = 0.8, β = 0.2 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.9000 0.8000 0.7700 0.7850 0.9000 0.9000 0.9200 0.9190 0.9000 0.9000 0.9200 0.9010 0.9000 0.8000 0.7700 0.7850 0.9000 0.8000 0.7700 0.7850 0.9000 0.9000 0.9600 0.9210 0.7764 0.6764 0.7200 0.7360 0.7764 0.6764 0.8200 0.7360 0.8000 0.6000 0.6000 0.6000 0.8000 0.6000 0.6000 0.6000

In Table 1, the calculated results included first to third rows illustrate that the proposed similarity measures satisfy the criterion 3.4. The results included the 12 − th and 13 − th rows satisfy the criterion 3.5. The results included 4−th to 9−th rows illustrate that, when Δa12 is equal to Δc12 of two group connection numbers, the similarity degree functions defined by Definition 3.2 can not depict the difference between the two group connection numbers, and Definition 3.3 and Definition 3.4 overcome the problem. Comparing to the values of similarity degree function calculated by Definition 3.3, the values of Definition 3.4 are well-distributed, it indicates the Definition 3.4 is more effective. Furthermore, the more results can be obtained by changing the value of α and β of Definition 3.4 based on the need of practical problem.

4

Conclusions

The similarity degree depiction between connection numbers is one of the basic connection number theories in SPA. In this paper, the similarity degree between connection numbers is described by adopting the similarity degree function of connection numbers. A group of checking criterion and similarity measures of connection numbers are proposed. The computed results of examples show that, the proposed similarity measures between connection numbers are beneficial attempt to measure the similarity between connection numbers in SPA.

Acknowledgments Project supported by the State Key Development Program for Basic Research of China (No. 2007CB714107); The Special Research Foundation for the Public

68

J. Yang et al.

Welfare Industry of the Ministry of Science and Technology and the Ministry of Water Resources (No. 200701008); The Specialized Research Fund for the Doctoral Program of Higher Education of China (No. 20050487062).

References 1. Jiang, Y.L., Zhuang, Y.T., Li, Z.X.: Application of Set Pair Analysis in Urban Planning Project Comprehensive Evaluation. In: Proceedings of 2005 International Conference on Machine Learning and Cybernetics, pp. 2267–2271 (2005) 2. Zhao, K.Q., Xuan, A.L.: Set Pair Theory: A New Theory Method of Non-Define and Its Applications. Systems Engineering 14, 18–23 (1996) 3. Jiang, Y.L., Xu, C.F., Yao, Y., Zhao, K.Q.: Systems Information in Set Pair Analysis and Its Applications. In: Proceedings of International Conference on Machine Learning and Cybernetics, pp. 1717–1722 (2004) 4. Cheng, K.Y.: Research in Fuzzy Logic on Set Pair Analysis. Systems Engineering Theory & Practice 32, 210–213 (2004) 5. Zhang, D.F., Huang, S.L., Li, F.: An Approach to Measuring the Similarity between Vague sets. Journal of Huazhong University of Science and Technology 32, 59–60 (2004)

Temporal Properties of Illusory-Surface Perception Probed with Poggendorff Configuration Qin Wang and Marsanori Idesawa Graduate School of Information Systems, The University of Electro-Communications 1-5-1, Chofugaoka, Chofu-Shi, Tokyo, 182-8585, Japan Tel.: +81-424-43-5649 [email protected]

Abstract. Temporal properties of illusory surface perception were investigated by using the probing method of the Poggendorff configuration. We used real lines and an opaque illusory surface to compose the Poggendorff configuration, which was presented in an intermittent display method so that the real lines were displayed continuously and the opaque illusory surface was displayed periodically with various duration and interval times. The results showed that the opaque illusory surface required a minimum duration of approximately 220 msec for sustained perception. An interval of as much as 2200 msec was needed to obliterate the perception of the opaque illusory surface. We decided the intermittent display method was effective to directly examine the time course of illusory surface perception. Furthermore, we concluded that we could achieve better understanding of the surface perception mechanism of the human visual system by utilizing the intermittent display method and probing method of the Poggendorff configuration. Keywords: surface perception, illusory surface, temporal properties, Poggendorff illusion.

1

Introduction

A three-dimensional (3D) illusory surface is perceived by the partial disparity along an object’s contour where no physical visual stimuli make point-by-point correspondence (Fig. 1). In relation to the phenomenon of 3D illusory surface perception, opaque and transparent perception has been discovered, and interaction between them has been reported. [1], [2], [3] Temporal properties are a crucial aspect of illusory surface perception in the human visual system, but the results of conventional studies examining the temporal properties have not been consistent. [4], [5], [6], [7], [8], [9] The purpose of the present study was to investigate the temporal properties of 3D illusory surface perception by using the probing method of the Poggendorff configuration. F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 69–77, 2008. c Springer-Verlag Berlin Heidelberg 2008 

70

Q. Wang and M. Idesawa

The Poggendorff illusion is one of geometrical illusions. [10], [11] In this illusion, two collinear line segments obliquely abutting an inducing element are perceived as noncollinear (Fig. 2(a)). In the Poggendorff illusion without physical contact, the gaps are 1.4 degrees between the line segments and the inducing element (Fig. 2(b)). Opaque surface perception can be probed from the occurrence of the illusion in the Poggendorff configuration without physical contact. [12], [13] However, during our daily experiences, the human visual system’s response to any stimulus does not result in a final percept without some time delay. Likewise, the perception of any stimulus decays with some time delay after the stimulus disappears. In the present study, we hypothesized that an illusory surface is the result of responses by a transient system. In other words, some time delay is required for the genesis of illusory surface perception, and illusory surface perception requires some time for decay. On the basis of the hypothesis, we examined the temporal properties of illusory surface perception by using the intermittent display method and the probing method of the Poggendorff configuration.

L

L

R

Fig. 1. An example of illusory surfaces. (L is for left-eye view and R is for right-eye view.) A white square surface is perceived.

L

R (a)

L

L

R

L

(b)

Fig. 2. The Poggendorff configurations. (L is for left-eye view and R is for right-eye view.) (a) The conventional Poggendorff configuration. (b) The Poggendorff configuration without physical contact.

2

Probing Method for Opaque Surface Perception

The Poggendorff configuration without physical contact is devised based on the conventional Poggendorff illusion (Fig. 2(a)). In the configuration without

Temporal Properties of Illusory-Surface Perception

71

Observe Poggendorff configuration without physical contact.

No

Illusion occurs?

Yes The inducing element The inducing element is not perceived is perceived as an opaque surface. as an opaque surface.

L

(a)

R

L

(b)

Fig. 3. Principles of the Poggendorff configuration. (a) Diagram of the probing method of the Poggendorff configuration. When observing the Poggendorff configuration without physical contact, such that the line segments are at a farther depth than the inducing element, if the illusion occurs, the inducing element is perceived as an opaque surface; otherwise, the inducing element is not perceived as an opaque surface. (b) The Poggendorff illusion without physical contact.

gn yia1 plsi D

line displaying testing surface displaying duration

duration interval

0 t0

t1

t2

Time

t3

t4

(a)

L

R

(b)

L

L

R

L

(c)

Fig. 4. Diagram of the intermittent display method. (L is for left-eye view and R is for right-eye view.) (a) The procedure of the display; the real lines are displayed continuously and the testing opaque illusory surfaces are displayed periodically in the duration and interval times. (b) The stimuli presented in the duration time, in which the lines and testing opaque illusory surface are displayed synchronously. (c) The stimuli presented in interval time, in which only lines are displayed.

physical contact, the gaps are 1.4 degrees between the line segments and the inducing element. The nearer perceptual depth of the inducing surface and its opaque property is an indispensable factor for perceiving the illusion without physical contact.

72

Q. Wang and M. Idesawa

The probing method for opaque surface perception has been proposed on the basis of the characteristics of the Poggendorff configuration without physical contact. In the probing method of the Poggendorff configuration, opaque surface perception could be probed from the occurrence of the illusion in the Poggendorff configuration without physical contact. Specifically, when we observe the Poggendorff configuration without physical contact, such that the line segments are at a farther depth than the inducing element, if the illusion occurs, the inducing element is perceived as an opaque surface; otherwise, the inducing element is not perceived as an opaque surface (Fig. 3).

3

Intermittent Display Method

In the present study, the lines and the testing surface composing the Poggendorff configuration were used as real lines and test opaque illusory surface. The lines were displayed continuously and the testing surface was displayed periodically with various duration and interval times. In other words, real lines and opaque illusory surfaces were displayed synchronously in the duration time, but only real lines were displayed in intervals (Fig. 4).

4 4.1

Experiments General

We conducted two experiments in the present study. In Experiment 1, real lines and the testing opaque surface were presented in four display sequences. In Experiment 2, we used the intermittent display method, in which real lines were displayed continuously and the opaque illusory surface was displayed periodically with various duration and interval times. Apparatus. The left and right eye views were generated by a Silicon Graphics Octane 2 workstation and presented synchronously on a screen (220 cm x 176 cm) by dual projectors (EPSON ELP-735). By wearing a passive pair of polarized glasses, the left eye view could only be seen by the left eye, and the right eye view could only be seen by the right eye; then, stereopsis could be obtained easily. The subject sat 50 cm from the stimuli. Head movement was restricted by a chinrest, and the passive pair of polarized glasses was fixed on the chinrest. Stimuli. The stimulus used in the experiments was a line (0.06 deg width, 10.2 deg length) and an inducing element (3.4 deg width, 18.8 deg length); the acute angle between them was 33 deg, and the gap was 1.4 deg between the line elements and the inducing element. The line elements were 10 mm farther than the inducing element. The distance from the stimuli to the subject was set at 50 cm. 4.2

Experiment 1

In this experiment, the real lines and test illusory surface were presented in various display sequences and for various display times. Four display sequences

Temporal Properties of Illusory-Surface Perception

(a)

73

(b)

Fig. 5. Display sequence of type-I and experimental results. (a) Display sequence of type-I. The Poggendorff configuration with lines and testing surface was presented for the display time in the range of 150-850 msec. (b) The experimental results. The collinear perception was dominant in the range of 450-850 msec.

(a)

(b)

Fig. 6. Display sequence of type-II and experimental results (a) Display sequence of type-II. In the display sequence of type-II, the opaque illusory surface was presented first for the previous surface display times of 150-1650 msec. Subsequently, the Poggendorff configuration was displayed for 450 msec. (b) The experimental results. The collinear perception was dominant when the opaque illusory surface was displayed previously for 150-1650 msec.

were prepared for the display of the lines and testing surface. Several types of display times were used for each display sequence. The subjects’ task was to state their perception of the two line segments: the right line segment was perceived higher than, collinear to, or lower than the left line segment. The subjects responded by pressing the left, center or right button on the mouse. There is a 2000 msec blank screen for subjects’ response after the stimulus displayed. The following trials will start after the 2000 msec blank screen. The experimental room was dark during the experiment, and no feedback was given. Four observers with normal or corrected-to-normal vision participated in the experiment. The rates of upper perception and collinear perception are plotted against various display times in the figures presented for each display sequence and experimental results (Fig. 5, Fig. 6, Fig. 7, Fig. 8). The horizontal axis is the display time for the testing surface.

74

Q. Wang and M. Idesawa

In the display sequence of type-I, the Poggendorff configuration with lines and testing surface was presented for display times that varied in the range of 150-850 msec. The results showed that collinear perception was dominant when the configuration with lines and testing surface was displayed for 450-850 msec. That is, the Poggendorff illusion could not be perceived when the display range of the Poggendorff configuration was between 450 msec and 850 msec. This result suggests that the testing surface could not be perceived as an opaque surface in the display range of 450-850 msec (Fig. 5). In details, the illusory surface could not be perceived when the real lines and the illusory surface were synchronously presented for the display range between 450 msec and 850 msec. This results are consistent with that illusory surface needs a bit more time than the real image for perceiving [7], [9]. In the display sequence of type-II, the opaque illusory surface was presented first for the previous surface display times of 150-1650 msec. Subsequently, the Poggendorff configuration was displayed for 450 msec. We noticed the collinear perception was dominant when the opaque illusory surface was displayed first for 150-1650 msec. This result suggests that the Poggendorff illusion could not be perceived in the time range in which the opaque illusory surface was displayed previously (Fig. 6). In the display sequence of type-III, the Poggendorff configuration was first displayed for 450 msec. Next, the opaque illusory surface was presented for the varied display times of 150-1650 msec. The results indicate that the upper perception became dominant when the opaque illusory surface was displayed for 1150-1650 msec. after the configuration was first displayed at 450 msec. That is, the Poggendorff illusion could not be perceived, or the perception of the illusion was ambiguous, when the opaque illusory surface was presented for 150-850 msec after the display of the Poggendorff configuration with the real line and the illusory surface. (Fig. 7) In the display sequence of type-IV, the opaque illusory surface was presented for 100 msec. Then, the Poggendorff configuration was displayed for 450 msec. Subsequently, the opaque illusory surface was presented for the varied surface display times of 150-850 msec. The result shows that the Poggendorff illusion could be perceived when the opaque illusory surface was displayed for 250-850 msec after the display of the Poggendorff configuration. Moreover, the Poggendorff illusion was difficult to perceive when the opaque illusory was presented previously than its posterior display. In this experiment, the results show that when the opaque surface was presented for 100 msec before the Poggendorff configuration, the necessary time for the following surface display decreased for perceiving the Poggendorff illusion (Fig. 8). From these investigations, we thought that the afterimage of the real line remained after the visual stimuli disappeared when the real line and illusory surface disappeared synchronously; however, the illusory surface could not be perceived adequately in this case. We assumed that it was necessary to keep the illusory surface display until the afterimage of the real line disappeared.

Temporal Properties of Illusory-Surface Perception

(a)

75

(b)

Fig. 7. Display sequence of type-III and experimental results. (a) Display sequence of type-III. In the display sequence of type-III, the Poggendorff configuration was displayed for 450 msec. Subsequently, the opaque illusory surface was presented for the next surface display time that we varied in the range of 150-1650 msec. (b) The experimental results. The upper perception became dominant when the opaque illusory surface was next displayed for 1150-1650 msec.

(a)

(b)

Fig. 8. Display sequence of type-IV and experimental results. (a) Display sequence of type-IV. In type-IV, the opaque illusory surface was presented for 100 msec, then, the Poggendorff configuration was displayed for 450 msec. Subsequently, the opaque illusory surface was presented for the third surface display time that we varied in the range of 150-850 msec. (b) The experimental results. The upper perception was dominant when the opaque illusory surface was displayed for 250-850 msec after the display of the Poggendorff configuration.

4.3

Experiment 2

In Experiment 2, we examined the duration time for the occurrence of the Poggendorff illusion. The stimuli were randomly displayed with various interval times. The interval time was set in increments of 200 sec from 600 msec to 3200 msec. The subject carried out the adjustment task. The subjects’ task was to adjust the time of duration until they perceived the lines as noncollinear. They adjusted the duration time by pressing the key on the keyboard. There is a 2000 msec blank screen after each time of subjects’ adjustments. The following trials

76

Q. Wang and M. Idesawa 218.4

175.4 173.2 141.2 132.6 113.8

UpperPerception CollinearPerception

103.6

2200

2000

1800

1600

1400

1200

800

1000

600

112.2

duration time (msec)

190.6

interval time (msec)

Fig. 9. Experimental results. The duration time for the Poggendorff illusion is plotted against the interval time.

will start after the 2000 msec blank screen. The experimental room was dark during the experiment, and no feedback was given. Five observers with normal or corrected-to-normal vision participated in the experiment. The duration time for the Poggendorff illusion is plotted against the interval time (Fig. 9). The Poggendorff illusion could be perceived when the duration time was more than 220 msec in the case of interval times less than 2200 msec. In the case of an interval time more than 2200 msec, the illusion disappeared even though the duration time was 800 msec. The opaque illusory surface for sustained perception required a minimum of about 220 msec. An interval time as long as 2200 msec was needed to obliterate the perception of the opaque illusory surface.

5

Conclusions

In the present study, we examined the temporal properties of opaque illusory surface perception by using the Poggendorff illusion without physical contact as the probing method for detecting surface perception. We utilized an intermittent method to display the test opaque illusory surface. We observed that the opaque illusory surface for sustained perception required a minimum of about 220 msec of duration time. An interval time as long as 2200 msec was needed to obliterate the perception of the opaque illusory surface. We expect to achieve better understanding of the surface perceiving mechanism of the human visual system by utilizing the intermittent display method and the probing method of opaque surface perception of the Poggendorff illusion without physical contact.

References 1. Idesawa, M.: Perception of 3-D Transparent Illusory Surface in Binocular Fusion. Japanese Journal of Applied Physics 30, 1289–1292 (1991) 2. Idesawa, M.: Two Types of Occlusion Cues for the Perception of 3-D Illusory Objects in Binocular Fusion. Japanese Journal of Applied Physics 32, 75–78 (1993)

Temporal Properties of Illusory-Surface Perception

77

3. Idesawa, M.: A Study on Visual Mechanism with Optical Illusions. Journal of Robotics and Mechatronics 9, 85–91 (1997) 4. Spillman, L., Fuld, K., Gerrits, H.J.M.: Brightness Contrast in Ehrenstein Illusion. Vision Research 16, 713–719 (1976) 5. Gellatly, A.R.H.: Perception of An Illusory Triangle with Masked Inducing Figure. Perception 9, 599–602 (1980) 6. Parks, T.E., Rock, I., Anson, R.: Illusory Contour Lightness: A Neglected Possibility. Perception 12, 43–47 (1983) 7. Susan, P.: The Perception of Illusory Contours. Springer, Heidelberg (1987) 8. Ringach, D., Shapley, R.: Spatial and Temporal Properties of Illusory Contours and Amodal Boundary Completion. Vision Research 36, 3037–3050 (1996) 9. Idesawa, M., Nakane, Y., Zhang, Q., Shi, W.: Spatiotemporal Influence of Preperceived Surfaces on the Perception of Bistably Perceptible Surfaces with Binocular Viewing Perception. In: ECVP, vol. 29 (2000) 10. Ninio, J.: Characterisation of the Misalignment and Misangulation Components in the Poggendorff and Corner-Poggendorff Illusions. Perception 28, 949–964 (1999) 11. Westheimer, G., Wehrhahn, C.: Real and Virtual Borders in the Poggendorff Illusion. Perception 26, 1495–1501 (1997) 12. Wang, Q., Idesawa, M.: Veiled Factors in the Poggendorff Illusion. Japanese Journal of Applied Physics 43, 11–14 (2004) 13. Wang, Q., Idesawa, M.: Surface Perception Detecting Method by Using the Poggendorff Illusion in Binocular Viewing. Perception 34 ECVP, 187 (2005)

Interval Self-Organizing Map for Nonlinear System Identification and Control Luzhou Liu, Jian Xiao, and Long Yu School of Electrical Engineering, Southwest Jiaotong University, Chengdu 610031, China [email protected]

Abstract. The self-organizing map (SOM) is an unsupervised neural network which projects high-dimensional data onto a low-dimensional. A novel model based on interval self-organizing map(ISOM) whose weights are interval numbers presented in this paper differ from conventional SOM approach. Correspondingly, a new competition algorithm based on gradient descent algorithm is proposed according to a different criterion function defined in this paper, and the convergence of the new algorithm is proved. To improve the robustness of inverse control system, the inverse controller is approximated by ISOM which is cascaded with the original to capture composite pseudo-linear system. Simulation results show that the inverse system has superior performance of tracking precision and robustness. Keywords: Interval self-organizing map, Unsupervised learning, Nonlinear system, Inverse control system.

1 Introduction The self-organizing map (SOM)[1] is an unsupervised learning algorithm that clusters and projects potentially high dimensional input data onto a discrete neural grid or map of usually reduced dimensions. SOM is a vector quantization method which can preserve the topological relationships between input vectors when projected to a lower dimensional display space. The SOM was developed to help identify clusters in multidimensional datasets. It is successfully used for a wide range of applications including nonlinear system identification and control[2][3]. However, almost every system integrates a certain amount of uncertainty. To solve this problem, it is frequently assumed that the parameters are represented by intervals. So we firstly introduce interval computing which has become an active research branch of scientific computation[4].To process uncertainty, we propose a novel type of self-organizing map, interval self-organizing map, whose weights are interval numbers. The competitive learning algorithm including initialization, updating neuron weights and the use of prior knowledge and preferential training is different due to adopt a new distance measurement based both on empirical risk and structural risk. The winner is an interval number in every competition. The adaptations of the upper bound and lower bound of weight are F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 78–86, 2008. © Springer-Verlag Berlin Heidelberg 2008

Interval Self-Organizing Map for Nonlinear System Identification and Control

79

trained respectively with same scalar α (t ) . The convergence of the new algorithm is proved using Robbins-Monro stochastic approximation principle. Although some expressions of the ISOM are similar with conventional SOM, they are essentially different. It does not only pay attention to the difference between the middle of the interval weight and training data but also to the influence of radius of the interval weight which makes the final interval weight converge to the single training data. The method above is applied to identifying and constructing inverse controller in nonlinear control system described in Narendra[5].The validity of the proposed method is illustrated by simulation examples.

2 Interval Self-Organizing Map(ISOM) 2.1 Interval Preliminaries

R. E. Moore introduced interval computing in the late 1950s. Ever since, it has become an active research branch of scientific computation. An interval is represented by its lower bound and upper bound as X = [ x x ] .The following midpoint, radius and distance are used in this paper for the calculation with lower bound and upper bound of interval weight. mid ( X ) = ( x + x ) 2 rad ( X ) = ( x − x ) 2

(1)

,

(2)

,

d ( X , Y ) = mid ( X ) − mid (Y ) + rad ( X ) − rad (Y ) .

(3)

2.2 ISOM and Learning Algorithm

Analogously, the interval self-organizing map learns the topological mapping f : G ⊂ R m → X ⊂ R n by means of self organization driven by samples X in X , where G is an output map containing a set of nodes, each representing an interval element in the m-dimensional Euclidean space. Let x = [ x1 , , xn ]T ∈ X be the input vector. It is assumed to be connected in parallel to every node in the output map. Every dimension of input vector x = [ x1 , , xn ]T can be regarded as an interval number whose midpoint is xk k = 1, 2 n and radius is 0. The k − th dimension of interval weight vector of the node i is denoted wik = [ wik wik ] , wik , wik ∈ R, k = 1, 2 n where wi = [ wi1 , , win ]T ∈ R n , n T wi = [ wi1 , , win ] ∈ R . The learning rule is given as follows: wi (t + 1) = wi (t ) + Δwi (t )

,

wi (t + 1) = wi (t ) + Δwi (t )

.

(4) (5)

80

L. Liu, J. Xiao, and L. Yu

Where Δwi (t ) = α1 (t )hci (t )[ x − mid ( wi (t )) − rad ( wi (t ))]

,

(6)

Δwi (t ) = α 2 (t )hci (t )[ x − mid ( wi (t )) + rad ( wi (t ))]

(7) . Where t = 0,1, 2 is the discrete-time coordinate, with α (t ) being a suitable, monotonically decreasing sequence of scalar-valued gain coefficients, 0 < α (t ) < 1 . and hci is the neighborhood function denoting the coordinates of nodes c and i by the vector rc and ri , respectively, a proper form for hci might be a kernel function as:

hci = h0 exp(− ri − rc

2

σ 2)

. with h0 = h0 (t ) and σ = σ (t ) as suitable decreasing functions of time. The step of the competitive algorithm of ISOM is showed in detail:

(8)

(a) Initialization. Generate N × n midpoints of weights and order equal radii, where N is the size of nodes. (b) Present the j − th input data x j = [ x j1 , , x jn ]T to every node in parallel. (c) Find the best-matching neuron c according to:

x j − wc = min {ε } i

.

(9)

where 2

2

ε = mid ( x j ) − mid ( wi ) 2 + rad ( x j ) − rad ( wi ) 2 + mid ( x j ) − mid ( wi ) • rad ( x j ) − rad ( wi )

. (10)

Where “ • ”is dot product. The criterion function ε does not only minimize the interval distance (empirical risk) but also the complexity of the structure (structural risk) which makes the model have good ability of generalization. (d) Set sign variable flag ∈ R n , flag k = ( wck − x jk ) × ( wck − x jk ) k = 1, 2 n .If flag k > 0 ,then make wck = x jk + rad ( wck ) , wck = x jk − rad ( wck ) .The adjustment above can make the winner interval weight cover the input point and keep the winner closest to the input data under the definition of the criterion function. (e) The gradient descent optimization of ε to the upper bound and lower bound of interval weight respectively yields the sequence as (4)-(7). (f) Order j = j + 1 and return to step (b) until all the data are trained. 2.3 Convergence Property of ISOM

We analyze convergence properties of the interval self-organizing map (ISOM) with multidimensional input using Robbins-Monro stochastic approximation principle. It is shown that the ISOM algorithm optimizes a well defined energy function and converges almost truly if the input data is from a discrete stochastic distribution.

Interval Self-Organizing Map for Nonlinear System Identification and Control

81

For the case of discrete input, we employ an energy function for the ISOM algorithm: 1 J (W ) = ∑ hci (t ) ∑ p j ε (11) 2 c ,i x j ∈Xc . According to Robbins-Monro algorithm, we can give a rigorous proof of the convergence of the ISOM algorithm under the following two hypotheses. • H.2.3.1. The input x j ∈ R n has discrete probability density

p( x) = ∑ j =1 p j δ ( x − x j ) . L

• H.2.3.2. The learning rate α (t ) satisfies the following conditions: (a) limt →∞ α (t ) = 0

(b) ∑ t = 0 α (t ) = ∞ ∞

(c) ∑ t = 0 α 2 (t ) < ∞ ∞

Theorem 3.1. Assume that [H.3.1] and [H.3.2] hold. Then the ISOM algorithm will minimize energy function(11), and converge almost truly. Proof. J is piecewise differentiable. Take derivatives of both sides of (11) to upper bound and lower bound of weight respectively 1 ∂J ∂ε ∂ε = ∑ p j hci (t ) = E[hci (t ) ] (12) ∂wik 2 x j ∈X ∂wik ∂wik ,

1 ∂J ∂ε ∂ε = ∑ p j hci (t ) = E[hci (t ) ] ∂wik 2 x j ∈X ∂wik ∂wik

By Robbins-Monro algorithm, order

.

(13)

∂J ∂J = 0 and = 0 . Due to the energy func∂wik ∂wik

tion including absolute value, we consider it in two cases: Case 1: if x jk ≥ mid ( wik ) , then ∂ε 1 = − ( x jk − mid ( wik ) − rad ( wik )) 2 ∂wik

,

∂ε 3 = − ( x jk − mid ( wik ) + rad ( wik )) 2 ∂wik

.

(14)

(15)

Case 2: if x jk < mid ( wik ) , then ∂ε 3 = − ( x jk − mid ( wik ) − rad ( wik )) 2 ∂wik

(16) ,

82

L. Liu, J. Xiao, and L. Yu

∂ε 1 = − ( x jk − mid ( wik ) + rad ( wik )) 2 ∂wik

(17) .

It is obvious that the formulas of upper bound and lower bound of weights are the same except the difference in coefficient in both cases.Without lost of generality,we can denote the learning rule as (6),(7). Notice that Robbins-Monro algorithm ensures that the upper and lower bounds of weights of the ISOM converge to the root of

∂J ∂J = 0 and = 0 almost truly if the ∂wik ∂wik

root exists. In practice, J usually exhibits several local minima. Therefore, it is inevitable that the upper and lower bounds of weights of ISOM obtained would only converge to solutions to local minima. However it has been observed that by introducing the neighborhood function hci which has a very large range in the beginning and gradually decreases during the learning process. The ISOM algorithm is capable in some extent to achieve a good global ordering. In order to avoid some essential error (i.e. wik > wik ) , the learning rate α (t ) and neighborhood function might be chosen equally in (6) and (7). It is essentially different between the conventional SOM and ISOM algorithm. The latter does not only consider dissimilarity between interval weight and training data but also the effect of the interval radius. The interval weight might degenerate to exact weight with t → ∞ . As a result, the ISOM degenerates to conventional SOM. In practical application to identification, the proposed algorithm shows good performance in generalization because the interval weight can control the error effectively which enhances robustness of the network model.

3 Nonlinear System Identification and Inverse Control In nonlinear system identification and control, the approximation precision and ability of generalization should be considered simultaneously. Nevertheless, the ability of generalization should be paid more attention in practical application. It is very significant to found a valid model to constrain control error. ISOM is an effective approach for this. The SOM was previously applied to learn static input–output mappings. VQTAM approach can make SOM and other unsupervised networks able to learn dynamical mappings. We generalize this approach to ISOM to learn dynamic mappings. We are interested in systems which can be modeled by the following nonlinear discrete-time difference equation:

Interval Self-Organizing Map for Nonlinear System Identification and Control

83

y(t +1) = f [ y(t ), , y(t − ny +1); u(t ), , u(t − nu + 1)]

(18) . where n y and nu are the (memory) orders of the dynamical model. In many situations, it is also desirable to approximate the inverse mapping of a nonlinear plant: u(t ) = f −1[ y(t +1), , y(t − ny +1); u(t −1), , u(t − nu +1)]

. The weight vector of neuron i , wi (t ) ,has its dimension increased accordingly. These changes are written as:

(19)

⎛ wiin (t ) ⎞ ⎛ xin (t ) ⎞ (20) wi (t ) = ⎜ out and x(t ) = ⎜ out ⎟ ⎟ w t ( ) x t ( ) ⎝ ⎠ ⎝ i ⎠. where wiin (t ) and wiout (t ) are the portions of weight vector which store information about the inputs and the outputs of the mapping being studied. To approximate the forward dynamics in (18) the following definitions apply:

x in (t ) = [ y (t ),

, y (t − ny + 1); u (t ),

, u (t − nu + 1)]

,

(21)

out

x (t ) = y (t + 1)

.

In inverse controller design, one defines:

x in (t ) = [ y (t + 1),

, y (t − n y + 1); u (t − 1),

, u (t − nu + 1)]

,

out

x (t ) = u (t )

.

Fig. 1. Identification with ISOM

Fig. 2. Structure of inverse control system with ISOM

(22)

84

L. Liu, J. Xiao, and L. Yu

According to ISOM algorithm, the output of the controller will be an interval number. To transform an interval number to an exact one, we may regard the midpoint of the interval as the exact one. To construct inverse controller, a forward dynamic identification must be built first. The ISOM model will be cascaded with the original to capture pseudo-linear system. Fig.1 and Fig.2 show identification and inverse control respectively.

4 Simulation The nonlinear system is assumed to be the form:

y (t + 1) = y (t ) (1 + y 2 (t )) + u 2 (t )

(23) . The input u (t ) is a random input in the interval [0 1] . Added noises are some white Gaussian noise whose variance equals to 0.05.The results are in Table 1 and Table 2 in the form of mean-squared error(MSE). In addition, the radius of weight can also directly affect the precision of identification and control. Fig. 3 shows the inverse control system tracks some sine wave with radius equal 0.03 with accurate data. The MSE of inverse control influenced with interval radius is shown in Fig. 4. Table 1. Identification precision with ISOM and SOM Type

MSE(accurate data) 0.02081 0.03140

ISOM SOM

MSE(noise data) 0.02990 0.04450

Table 2. Inverse control error with ISOM Type of data Accurate data Noise data

Generalization error 0.02972 0.03135

Tracking error 0.0544 0.0720

2.5

2

1.5

1

reference output

0.5 0

2

4

6

8

10

12

14

16

18

20

Fig. 3. The inverse control system tracking sine wave using ISOM

Interval Self-Organizing Map for Nonlinear System Identification and Control

85

0.07

0.06

0.05

MSE

0.04

0.03

0.02

0.01

0 0.0002

0.02

0.04

0.06

0.08 0.1 0.12 Interval radius

0.14

0.16

0.18

0.2

Fig. 4. The MSE of inverse control influenced with interval radius

The results show that the ISOM algorithm can identify nonlinear system effectively. It has better abilities in both approximation and generalization than conventional SOM. The inverse control system constructed using ISOM is preferable in robustness which is a valid approach in nonlinear system control.

5 Conclusion In this paper, the ISOM algorithm is proposed based on SOM algorithm and interval analysis theory. However, the ISOM algorithm is essentially different from SOM algorithm for the ISOM criterion function considering both empirical risk (approximation precision) and structural risk(structure complexity).First, the set of interval weight vectors tends to describe the density function of input vectors via modifying the interval midpoints and radii while the standard SOM can only modify weight vectors. In the self-organizing process, the smaller of the interval weight’s radius suggest the higher distribution density of the input vectors. So the ISOM reveals more accurate in modeling and control(Table 1 and Table 2). Second, local interactions between processing units still tend to preserve continuity of interval weight vectors just like SOM. The interval weight strike a nice balance between describing the density function of the input vectors and preserving continuity of interval weights. The simulation results testify the validity of the ISOM algorithm in system identification and nonlinear control.

Reference 1. Teuvo, K.: Self-Organizing Maps, 3rd edn. Springer, Heidelberg (2001) 2. Guilherme, A.B., Aluizio, F.R.A.: Identification and Control of Dynamical Systems Using the Self-Organizing Map. IEEE Trans. Neural Networks 15, 1244–1259 (2004) 3. Principe, J.C., Erdogmus, D., Motter, M.A.: Modeling and Inverse Controller Design for an Unmanned Aerial Vehicle Based on the Self-Organizing Map. IEEE Trans. Neural Networks 17, 445–460 (2006)

86

L. Liu, J. Xiao, and L. Yu

4. Wang, D.R., Zhang, L.S., Deng, N.Y.: Interval Algorithms for Nonlinear Equations. Shanghai Publishing House of Science and Technology, Shanghai (1987) 5. Narendra, K.S., Parthasarathy, K.: Identification and Control of Dynamical Systems Using Neural Networks. IEEE Trans. Neural Networks 1, 4–27 (1990) 6. Lin, S., Jennie, S.: Weight Convergence and Weight Density of the Multi-dimensional SOFM Algorithm. In: Proceedings of the American Control Conference, pp. 2404–2408. IEEE Press, New York (1997) 7. Gregory, L.P.: Adaptive Inverse Control of Linear and Nonlinear Systems Using Dynamic Neural Networks. IEEE Trans. Neural Networks 14, 360–376 (2003)

A Dual-Mode Learning Mechanism Combining Knowledge-Education and Machine-Learning Yichang Chen1 and Anpin Chen2 1

Department of Information Management, NPIC, No.51, Minsheng E. Rd., Pingtung City, Pingtung County, Taiwan 900, R.O.C 2 Institute of Information Management, NCTU, No. 1001, University Road, Hsinchu, Taiwan 300, R.O.C [email protected], [email protected]

Abstract. From 1956, the definitions of learning according to Artificial Intelligence and Psychology to human mind/behavior are obviously different. Owing to the rapid development of the computing power, we have potential to enhance the learning mechanism of AI. This work tries to discuss the learning process from the traditional AI learning models which are almost based on trial and error style. Furthermore, some relative literatures have pointed out that teaching-base education would increase the learning efficiency better than trial and error style. That is the reason we enhance the learning process to propose a dual-perspective learning mechanism, E&R-R XCS. As for XCS is a better accuracy model of AI, we have applied it as a basement and proposed to develop an intelligence-learning model. Finally, this work will give the inference discussion about the accuracy and accumulative performance of XCS, R-R XCS, and E&R-R XCS respectively, and the obvious summary would be concluded. That is, the proposed dual-learning mechanism has enhanced successfully. Keywords: Artificial intelligence, Psychology, Trial and error, Teaching-base education, Intelligence-Learning.

1 Introduction Traditionally, Artificial Intelligence, according to the definition of Computer Science, works as helpful machines to find solutions to complex problems in a more humanlike fashion [1,2]. This generally involves adopted characteristics from human intelligence, and it applies them as algorithms in a computer friendly way. A more or less flexible or efficient approach can be taken depending on the requirements established, which influences how artificial the intelligent behavior appears. Those researches, for example: Neural Network, Fuzzy Approach, Genetic Algorithm, and so on, all focus on Soft Computing. Of course, XCS (Extend Classifier System) is also a hybrid approach with high performance to the accuracy and the rule evolution on the prediction application. However, up to now, the Artificial Intelligence Techniques based on Soft Computing have all involved the concept, trial and error method or stimulus-response F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 87–96, 2008. © Springer-Verlag Berlin Heidelberg 2008

88

Y. Chen and A. Chen

method even the series of evolution approaches, such as [1,2] and [3], to construct their learning models. For this aspect, if possible, this example, a Chinese idiomatic phrase-”An Illusory Snake in a Goblet”, is taken into consideration as an input-output pattern to training the learning model. The models are formed for sure. It is actually a wrong model trained by a bad experience. Besides, the parameters of those training models are exactly affected by the input dataset, especially the large difference of the training inputs and testing ones. Usually, in many researches it is chosen the high relation between the input and output datasets or given the strong assumption which is the inputs and outputs are relevant. Thus, a subjective black-box view and the tuning view are easily concluded [4]. The other sub-domain, Expert System, which’s primary goal is to make expertise available to decision makers and technicians who need answers quickly. There is never enough expertise to go around -- certainly it is not always available at the right place in the right time. The same systems in-depth knowledge of specific subjects can assist supervisors and managers with situation assessment and long-range planning. These knowledge-based applications of artificial intelligence have enhanced productivity in business, science, engineering, and even the military. Although, the development of those expert systems is the view of anti-extreme to construct domain knowledge first but, for the reason, they are lack of the flexibility and the adaption. In fact, each new deployment of an expert system yields valuable data for what works in which context, thus fueling the AI research that provides even better applications. Many researches, no matter Soft Computing techniques or Expert Systems try to consider into the human-like thinking way to make the simulation. But, from classic psychology, the human-mind researches are the researches to the human-behavior. Since Plato, Psychology is an unfathomable philosophy and those advanced AI researchers should concern this perfect development of Human Psychology, from simple to complex and from single factor to multiple ones. However, the traditional AI techniques are seldom focused on the high level of human-mind process and just paid attentions to the learning definition from the Empricalism Psychology. According to the development of Modern Psychology, the core of Psychology has been already transferred Empricalism-base into Information Process Theory of Human-Mind, and even Cognitive Psychology-base. As for the knowledge and the model construction, the teaching-base aspect has been involved as well to the learning process. Based on the aspect, this work tries to enhance the learning process of traditional AI techniques whose cognitive scotomas of learning definition, and it develops the novel learning model, involving the concept of Cognitive Psychology, which is utilized the high accuracy-prediction XCS [5] model as the construction basement.

2 Relative Survey 2.1 Information Process Theory Among previous learning artificial intelligence techniques, such as neural network, or its hybrid methods, all the models are formed by trial and error learning way, the traditional definition of learning. However, to enhance the learning style, a cognitive learning, Information Process Theory, would be worth to take into consideration. According

A Dual-Mode Learning Mechanism Combining Knowledge-Education

89

to the information-processing model of learning (see Fig. 1), there is a series of stages by which new information is learned (Gagne, 1985) [6]. Information is received by receptors (such as the eyes and ears), from which it is passed to the sensory register where all of it is held, but for only a few hundredths of a second. At this point of view, selective perception acts as a filter which causes some aspects of the information to be ignored and others to be attended to. For example, the ears (receptors) receive the sounds comprising “Pi equals 3.14,” along with various other background sounds, and all those sounds are passed on to the sensory register in the brain. Then through the selective perception process, some of the information (hopefully the “Pi equals 3.14”) is attended to the part. That information which is attended to is transformed and passed on to short-term memory, which can only contain a few items of information at a time (depending on their complexity). For instance, if “Pi equals 3.14” is attended to, it is then passed on to short-term memory, where it might be said to “echo” for a few seconds, and the echoing can be prolonged through rehearsal.” Items can persist in short-term memory for up to about 20 seconds without rehearsal, but with constant rehearsal they can be retained indefinitely. Finally, the information may be passed on to long-term memory. This process is called encoding to memorize. For example, if appropriate encoding processes are exercised to link the “Pi equals 3.14” with prior knowledge, then the information is passed on to long-term memory. In the traditional model of human memory (Atkinson and Shiffrin, 1968 [7]; Waugh and D. A. Norman, 1968 [8]), immediate free recall yields items directly retrieved from a temporary short-term memory (STM) and items retrieved by retrieval cues from a more durable storage in long-term memory (LTM).

Fig. 1. Information Process Theory proposed by Gagne [9]

2.2 XCS Most machine learning techniques are developed by information process theory. No matter partial application of IPT concept or applying the entire flow of IPT, they all simulated various operations of memory. For example, those neural network types are applications of neuroanatomy. According to that, it is necessary to define the neural structures of the brain simulated as memory. The others would be evolution computing types, such as GA, GP, and LCSs. Among them, LCSs has flexible outcome on rule generation which represents information about the structure of the world in the form of rules and messages on an internal message list, such as its STM or LTM, John Holland mentioned that. The system can be used as the message list to store information about (a) the current state of the world (response), and (b) about previous states (stimulus). From now on, LCS has the ability to store rule according to the input information.

90

Y. Chen and A. Chen

Fig. 2. XCS Procedure

However, Wilson’s XCS [10] is a recently developed learning classifier system (LCS) that differs in several ways from more traditional LCSs. In XCS, classifier fitness is based on the accuracy of a classifier's pay-off prediction instead of the prediction itself. As a whole, the genetic algorithm (GA) takes place in the action sets instead of the population. XCS's fitness definition and GA locus together result in a strong tendency for the system to evolve accurate, maximally general classifiers that efficiently cover the state-action space of the problem and allow the system's ‘knowledge” to be readily seen. As a result of these properties, XCS has been considered and focused to the kernel of the proposed model in this work. XCS’s detailed loop is shown in Fig. 2, and the current situation is first sensed and the detector received the input from the environment. Second, the match set [M] is formed from all classifiers [N] that match the situation. Third, the prediction array [PA] is formed based on the classifiers in the match set [M]. [PA] predicts for each possible action ai, the resulting pay-off. Based on [PA], one action is chosen for execution and the action set [A] is formed, which includes all classifiers of [M] that propose the chosen action. Next, the winning action is executed. Then the previous action set [A]-1 (a previous action set) is modified by using the Q-learning-like payoff quantity P which is a combination of the previous reward p-1 and the largest action prediction in the prediction array [PA]. Moreover, the GA may be applied to [A]-1. If a problem ends on the current, time-step (single-step problem or last step of a multistep problem), [A] is modified according to the current reward, p, and the GA may be applied to [A]. The loop is executed as long as the termination criterion is not met. A termination criterion is a certain number of trials/inputs. Finally, XCS’s architecture is much neater development base on IPT than the previous models. However, XCS is not sufficient to represent IPT. The coming discussion would be given to its explanation.

A Dual-Mode Learning Mechanism Combining Knowledge-Education

91

2.3 Discussion Using a computer as a metaphor for memory, the short-term phase is RAM (highly volatile and easily lost when some others else are entered), but long-term memory is such as a hard drive or diskette (the information is stored there even after the machine is turned off). This metaphor is especially helpful because a computer knows the address of each bit of information because of the manner information is entered. It is essential that information placed into a student's long-term memory be linked in a way that the student can retrieve it later. The teacher who should understand the relationship between memory and retrieval can lay out a lesson plan to assist the student in the process and enhance his learning. As the pre-statement portrayed, while rehearsal is important to short-term memory, it can also be used to transfer information into long-term. Elaborating or making material memorable will also enhance the student's learning process. The effective teacher will elaborate and rehearse material so that the student can remember the information more easily. That is the reason the input material is high relevant to memorize to form valued-information, knowledge. However, it is important to note that most application AI models, even XCS model, have more trouble remembering/learning of what data they should remember/learn. Therefore, the entire learning procedure as the effective teacher help the memory process by introducing the student to various organizational techniques cannot come true.

3 Dual-Mode Learning Mechanism 3.1 Conceptual Framework During the Middle Period (mid 1900s), Knowledge is just thought of as the transformation of sensory inputs into associated thought, and the realization that sensory inputs are transformed prior to storage. In the early twentieth century, Knowledge is still considered as a framework of stimulus and response (S-R). The profound breakthrough of this period is that by studying S-R, one can gain insight into the working of cognitive knowledge. This kind viewpoint of knowledge learning is largely based on narrow term of cognitive psychology, information processing theory. Furthermore, SR of cognitive psychology research is historically analogous to the black box testing. Following these two aspects, this work applied the cognitive learning to modify the learning process of traditional soft techniques to increase the efficiency of forming knowledge storage. That is, combining the information process theory and knowledge learning to initial the concept of the dual learning mode framework is the purpose of this work, shown as Fig. 3. It contains two parts: Knowledge Education learning and Reinforcement-Rehearsal(R-R) learning. 3.2 Proposed Model (E-RR XCS) R-R XCS model, a middle version to develop E-RR XCS, is also an enhanced version from XCS by adding a rehearsal mechanism. Owing to them both adapting GA as an

92

Y. Chen and A. Chen

Fig. 3. Dual perspective learning process of Education and R-R mechanism

evolution methodology of classifiers and based on XCS, their working accuracy rate should be equivalent by the same training data and testing data. The leverage of R-R XCS to XCS deserves to be mentioned. The rational assumption is that R-R XCS has higher leverage to XCS. This reason originates from R-R XCS considering more value information automatically. But its performance would be decreased and its accuracy ratio might not be better than XCS [11]. Education & R-R XCS, an implantation of proposed learning concept, is to increase the accuracy ratio by concerning the education efficiency of learning. In Fig. 4, there are two starting points which is different from XCS. And E & R-R XCS involves R-R XCS discussed in pre-statement. Besides them, in the additional education learning part, discovered knowledge, verified theory, and defined theorem are all considered as input patterns to the mechanism. Those data should be valued and worthy to “teach” the model or the model should be learned/trained. Thus, we increase the practice route in the education part. (E & R-R XCS is a “model” not a student and the model is not necessary to be practiced for more than twice times.) For this, those input data would be easily memorized/ stored by the receiver and internalized to the knowledge rule base [N1]. Population in knowledge rule base has higher weight or effectiveness than ones in experience rule base. Besides, the detector should consider more about the knowledge rule base [N1] than about the experience rule base [N2]. WM still stores the current situation in advance. Second, the match set [M] is formed from [N1] or [N2], which is either the knowledge rule base or experience rule base. The following steps are the same with R-R XCS ones. The difference is that the initial-picked population is more from knowledge rule base than experience one. In the mechanism, this kind population from knowledge rule base seems to be “principle”. While the entire loop has finished, the new population should be generated from knowledge rule base to the experience one. Some experiences have possibility to produce from the real knowledge, if the knowledge really exists. Furthermore, while a rehearsal population from repeater to detector occurs, detector should verify the repeated population qualification that it may be transferred to receiver. The knowledge population, that is, does come not only from the outside environment but also from internal mechanism. The education knowledge should also be possible increased to the knowledge rule base [N1], while the new knowledge or theory is discovered. As for the other detailed procedure same to XCS, they have already been detailed in pr-section.

A Dual-Mode Learning Mechanism Combining Knowledge-Education

93

Fig. 4. E&R-R XCS Procedure

4 Inference Discussion Actually, the case of adding new population from detector does not happen easily. E & R-R XCS exactly defines the stern discipline to the knowledge. In fact, the percentage of knowledge from detector to receiver ones is low. That is, the population in knowledge rule base should be maintained spotless correctness. Following the descriptions, three inferences of these models are possibly deduced in this section. Their theoretical accuracy and accumulated performance would respectively be detailed as following. The x-axis, Time, in Fig. 5, 6, and 7 might means time, or times which is the operating times of the model. The y-axis is just the theoretical accuracy or accumulated performance. 4.1 In Fig. 5, γ is defined to the difference of the accuracy ratio of R-R XCS and XCS. is defined to the difference of the accuracy ratio of E&R-R XCS and XCS. It is sensible that >> |γ| >= 0. The reasonable explanation is R-R XCS with rehearsal learning focused on valuable information. When γ is approximate to zero, the two models are applied to the all original data. When |γ| is large to zero, the two models are applied to identify the result for valuable information. As for , due to the education efficiency of learning, should be larger which means the accuracy ratio of E&R-R XCS is much better than XCS one. 4.2 In Fig. 6, μ is defined to the difference of the accumulative output of R-R XCS and XCS. It is sensible that |μ| >= 0. The reasonable explanation is just R-R XCS with

94

Y. Chen and A. Chen

rehearsal learning focused on valuable information, but its accuracy rate is not absolutely better than XCS one. Indeed, the leverage effect of R-R XCS originates from it focused on more valuable information. If the output is correct and positive to the result, the accumulative output should be increased more. Contrary to the wrong one, the accumulative output should be decreased more as well.

Fig. 5. Theoretical Accuracy of XCS, R-R XCS, and E&R-R XCS

Fig. 6. Theoretical-Accumulative Performance of XCS and R-R XCS

Fig. 7. Theoretical-Accumulative Performance of XCS and E&R-R XCS

A Dual-Mode Learning Mechanism Combining Knowledge-Education

95

4.3 In Fig. 7, μ1 and μ2 are defined to the difference of the accumulative output of E&R-R XCS and XCS. It is sensible that |μ1| >> |μ2| >= 0. The reasonable explanation is that E&R-R XCS has not only the ability with rehearsal learning focused on value information but also involves the education efficiency of learning. Therefore, its accuracy rate is absolutely better than XCS one. Indeed, E&R-R XCS still owns the leverage effect, which originates the same to R-R XCS. Owing to the accuracy ratio increased, the output is usually positive to the result, and the accumulative output should be increased much more. In a word, the learning accuracy of the proposed E&R-R XCS is much better than XCS. R-R XCS comparing with XCS has the leverage effect to the accuracy and the accumulated performance at least.

5 Conclusion Such as the description of this work motivation, much Knowledge discovery, Theory verification and Theorem definition are aggregated and not disregarded by this learning mechanism development. Also, they are all continually historical accumulated. That is still the reason that the civilization is enhanced, the culture is accumulated, and knowledge is transmitted. Finally, this work successfully proposed an efficient dual-mode learning mechanism which combines the passive-learning (knowledge education) and the selflearning (machine learning) [12]. That is, the major contribution of this work is the proposed mechanism. Once, more accuracy ability of AI Techniques invented could be substituted for XCS and the mechanism performance would be more efficient.

References 1. McCarthy, J.: Generality in Artificial Intelligence. Communications of the ACM 30(12), 1030–1035 (1987) 2. Ghirlanda, S., Enquist, M.: Artificial Neural Networks as Models of Stimulus Control. Animal Behavior 56(6), 1383–1389 (1998) 3. Ghirlanda, S., Enquist, M.: The Geometry of Stimulus Control. Animal Behavior 58(4), 695–706 (1999) 4. Chiew, V.: A Software Engineering Cognitive Knowledge Discovery Framework. In: 1st IEEE International Conference on Cognitive Informatics, pp. 163–172. IEEE Press, Calgary (2002) 5. Wilson, S.W.: Classifier Fitness Based on Accuracy. Evolutionary Computation 3(2), 149– 175 (1995) 6. Gagne, R.M.: The Conditions of Learning and Theory of Instruction. Holt, Rinehart & Winston, New York (1985) 7. Atkinson, R.C., Shiffrin, R.M.: Human Memory: A Proposed System and Its Control Processes. The Psychology of Learning and Motivation: Advances in Research and Theory, vol. 2, pp. 89–195. Academic Press, New York (1968)

96

Y. Chen and A. Chen

8. Waugh, N., Norman, D.A.: Primary Memory. Psychological Review 72, 89–104 (1965) 9. Gagne, R.M., Medsker, K.L.: The Conditions of Learning. Training Applications. Harcourt Brace, New York (1996) 10. Butz, M.V., Wilson, S.W.: An Algorithmic Description of XCS, Soft Computing - A Fusion of Foundations. Methodologies and Applications. In: Deb, K., et al. (eds.) GECCO 2004. LNCS, vol. 3103, pp. 144–153. Springer, Heidelberg (2004) 11. Chen, Y.C.: Applying Cognitive Learning to Enhance XCS to Construct a Dual-Mode Learning Mechanism of Knowledge-Education and Machine-Learning - an Example of Knowledge Learning on Finance Prediction. PhD Thesis. National Chiao Tung University, Taiwan (2005) 12. Piaget, J.: Structuralism. Harper & Row, New York (1970)

The Effect of Task Relevance on Electrophysiological Response to Emotional Stimuli* Baolin Liu1, Shuai Xin1, Zhixing Jin1, Xiaorong Gao2, Shangkai Gao2, Renxin Chu3, Yongfeng Huang4, and Beixing Deng4 1

Department of Computer Science and Technology, Tsinghua University, Beijing 100084, P.R. China 2 Department of Biomedical Engineering, Tsinghua University, Beijing 100084, P.R.China 3 Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA 4 Department of Electronic Engineering, Tsinghua University, Beijing 100084, P.R. China [email protected]

Abstract. To verify whether or not the emotion processing is modulated by task relevance, in this paper two tasks are performed - Simple Task and Complex Task. In the Simple Task, negative pictures are target stimuli, while in the Complex Task white-framed negative pictures are target stimuli. Subjects are required to respond when a target stimulus is onset. The EEG (electroencephalogram) epochs are averaged and ERP (event-related potential) components are obtained. The P300 amplitude is smaller in the Complex Task than in the Simple Task, which proves that the emotion P300 is significantly modulated by task relevance. As P1 and N1 amplitudes are decreased in the Complex Task comparing with the Simple Task, we can suggest that the P1/N1 components elicited by emotional stimuli are modulated by task relevance, too. Keywords: Emotion, ERP, P300, Attention, Task relevance.

1 Introduction ERPs aroused by emotional stimulus had already been deeply studied before and were illustrated with quite a number of examples, such as the studies on the integration of emotion and working memory [1,2,3,4], the integration of emotion and inhibitory control [5,6,7], and Attentional Blink effect to emotional stimuli [8,9]. Researchers showed a strong interest in P300 components which were influenced by emotional valence. P300 waves aroused by negative pictures were finally proved to be stronger through the previous studies [10,11,12,13,14,15]. However, P300 is not the earliest ERP component related to emotion perception. There was a family of task-relevant ERP components prior to the P300 [16,17,18]. For example, an enlarged P1 component could be observed over posterior scalp sites *

This work is supported by the National Basic Research Program of China (973 program) (No. 2006CB303100 and No.2007CB310806) and by the National High Technology Development Program of China (No.2007AA010306 and No.2006AA01Z444), and by the National Natural Science Foundation of China (No.30630022).

F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 97–106, 2008. © Springer-Verlag Berlin Heidelberg 2008

98

B. Liu et al.

contralateral to the attended visual field than unattended [19]; Heinze and Mangun [20] studied the ERP waves elicited by bilateral and unilateral stimulus, and got the result that the early P1 component reflected the facilitation of visual inputs at attended locations on visual processing. Commonly, N1 component was also regarded as an indicator of emotion effect (more positive ERPs for pleasant and unpleasant stimuli than for neutral stimuli). In those studies in which stimuli were presented rapidly and unpredictably to the right and left visual fields, paying attention to the events in one field produced an amplitude enhancement of the early Pl, N1 and/or N2 components elicited by those stimuli over the contralateral occipital scalp [21,22,23,24]. Many ERP researches on the emotional modulation of attention and/or task performance had already been done. For instance, emotional modulation of attention shifting was investigated in Posner’s [25] spatial orienting task by conditioning the attention cued to an aversive white noise; Phelps et al. [26] provided the evidence for emotion potentiating the effects of attention on low-level visual processing in stimulus-driven attention. Furthermore, we could see the stronger ERP components actually reflected the increasing attentional resources devoted to the processing of emotional stimuli [27,28,29,30]. There were also some studies on emotional modulation of task. For example, it was already exemplified that negative emotions had been demonstrated to improve task performance [31]. There were already some researches on emotion perception modulated by task relevance or attention. Gierych et al. [32] designed two experiments to investigate the ERP responses to “smile-provoking” pictures. In the first experiment, both affective stimuli were set as targets in an “oddball” procedure, being presented among the more frequent green disks. Then in the second experiment, they were both non-targets whereas the green disks were task-relevant. Both experiments and all pairs of stimuli produced similar results, which indicated that affective stimuli might produce attentional reallocation of processing resources. However, ERP researches on task-modulated emotion were limited, and most of the current studies were based on fMRI (functional magnetic resonance imaging) or PET (positron emission tomography) method. Harlan et al. [33] did some experiments with task-related fMRI to investigate how attentional focus could modulate the ERPs elicited by scenes that varied in emotional content. They manifested that the response to emotional task-relevant scenes was strengthened. Meanwhile, in another research by fMRI, Lane et al. [34] showed that a higher-arousing effect was aroused when participants attended to their own emotional responses than when participants were attending to the spatial setting of the stimulus (indoor/outdoor/either). These results suggested a higher-arousing effect when subjects attended to the emotional aspect of a stimulus to a greater extent. Based on these experiments, we could tell that if the attention was distracted by added factors, the arousal effect would decrease due to the attention deficit. In the perceptual grouping study by Han et al. [35], the stimulus arrays were either evenly distributed, grouped into rows or columns by proximity or similarity, around by colored dots, or with a fixation cross. As a result, the elicited Pd100 was significantly modulated by the task relevance; in the research of Attentional Blink (AB) effect to emotional stimuli (120 pictures of the IAPS- International Affective Picture System) [9], participants were required to name the black-framed target stimuli aloud. Similarly with the stimuli that they used, we designed a Complex Task, in which the

The Effect of Task Relevance on Electrophysiological Response to Emotional Stimuli

99

subjects were required to identify the white-framed negative stimuli, and a Simple Task, in which the subjects were only required to response when a negative picture was onset. We would like to verify whether or not the emotion processing is modulated by task relevance.

2 Materials and Methods Twenty-seven subjects (13 females and 14 males), mean age 22.3 (±3.4) years, were recruited from undergraduate students of Tsinghua University. All the subjects participated in the two experiments. The participants were screened by phone and written questionnaire for history of neurological and psychiatric illness, drug abuse, and psychotropic medication use. They were all right-handed and had normal or corrected-tonormal vision. Handedness was measured using the Edinburgh Handedness Inventory (EHI) [36]. All participants filled in the Volunteer Screening Form and were paid (RMB20/ hour) for their participation. All experiments were conducted in accordance with the Declaration of Helsinki and all the procedures were carried out with adequate understanding of the subjects, who read and signed the Research Consent Form before participating in this research. All the subjects were required to complete the positive affect- negative affect scales (PANAS) questionnaire [37]. The PANAS was a 30 item questionnaire producing 6 scores of positive affects (PA) and negative affects (NA) altogether and had been correlated with both hemispheric asymmetry in brain processes of emotional perception and sensitivity to affective manipulation [38]. A one-way ANOVA (analysis of variance) analysis showed there was no significant difference between the subjects in their PANAS scores. 84 pictures were selected from the IAPS [39,40] according to the valence dimension (28 pleasant, 28 unpleasant, and 28 neutral). The pictures were divided into 2 groups, each group has 14 negative pictures, 14 positive, and 14 neutral. They were all presented in a pseudo random sequence. 24 pictures (8 pleasant, 8 unpleasant, and 8 neutral) for the Complex Task were selected and white-framed. Other 18 pictures were not framed. White-framed pictures were more than others to avoid oddball paradigm, which might elicit oddball P300. The stimuli were conducted on PC (Intel Pentium D, 3.0 GHz, 1 GB RAM) with a 22-inch color monitor. The screen was at a distance of 100 cm from the subjects, and the resolution was 1440 × 900 pixels. The luminance of the pictures was determined according to the research by Palomba et al. [41]. Judging an EEG epoch to be negative or positive was not only by the emotional valence of picture estimated before the experiment, but also the response of each subject. Therefore, if one stimulus, such as a spider, was usually thought negative to most people, but not so to one subject, then we excluded all the data concerning this stimulus after all the experiments. In the Simple Task, subjects were invited to the laboratory and were required to fill in the Research Consent Form. Then they completed the EHI and PANAS. We used a one-way ANOVA analysis to exclude the left-handed subjects and the affect-deviant subjects.

100

B. Liu et al.

Subjects sat in front of the computer screen and received instructions explaining the experimental task. Stimuli were presented on a black background of a 22-inch color monitor at a viewing distance of 100 cm. The room was sound-attenuated and dimly lit. EEG signals were recorded from 19 scalp sites (Fp1/Fp2, F3/F4, C3/C4, P3/P4, O1/O2, F7/F8, T3/T4, T5/T6, Cz, Fz and Pz) according to the International 10/20 System [42], using Neuroscan Synamps2 EEG/ERP system. The reference channel was linked to earlobes. When all the electrodes were attached, we checked each of them and made sure the impedance was < 5 kΩ. When a subject was ready, the vocal introduction of the task was played to inform that 42 pictures would be presented and there would be a break every 14 pictures. He/she was required to pay full attention to each picture, and press "B" when an unpleasant picture was shown. He/she was instructed to wait until the input screen appeared and then responded as accurately as possible. The entire task lasted about 7 minutes; each picture stimulus was presented for 2s and waited for the response, then the response result screen was shown to tell the subject what he/she has pressed; a 6s interval occurred between two trials, during which the screen was black except for a cross at the center of the screen, on which the subject were instructed to fixate. The trials were in a pseudo random order. All subjects used both hands to make responses. The "B" button was selected as the response key to negative stimuli. He/she could select another button on the keyboard for both the positive and the neutral stimuli. We selected "B", which was the first letter of "Bad" and was just above the space button which was disabled to avoid misoperation. The self-selection button could counterbalance the hand differences between the subjects. There were two candidate words in the result screen: "bad" (when "B" was pressed) or "not bad" (when another key was pressed), which was presented at the central position. After that, the white cross was shown for 6s until the next picture appeared. No cross was shown while the pictures were presenting. The Complex Task was modified as follows based on the Simple Task: 42 pictures were used in this task -14 pleasant (numbered 1-14), 14 unpleasant (numbered 15-28) and 14 neutral (numbered 29-42). Subjects could not see the internal numberings of each picture. The following pictures were selected and white-framed: 1-8; 15-22; 2936; a subject was required to press "B" if it is both negative and white-framed. He/she would press the self-selected key in other cases. The EEG from each electrode site was digitalized at 256 Hz with an amplifier band pass of 0.01-40 Hz, including a 50 Hz notch filter and was stored for off-line averaging. Subjects were required to give an integer score between 1 and 9 to each of the pictures as a self-assessment to their emotional reactions after all the experiments, where score 1 indicated very unpleasant picture, 9 indicated very pleasant picture, 5 indicated neutral stimuli. A one-way ANOVA analysis on these scores showed there was no significant difference between the mean arousal ratings of the pictures in two tasks.

3 Results Computerized artifact rejection was performed to discard epochs in which deviation in eye position, blinks, or amplifier blocking occurred [43]. The rejected epochs were

The Effect of Task Relevance on Electrophysiological Response to Emotional Stimuli

101

considered invalid. We selected the datasets of the best 20 subjects (7 females and 13 males), and all datasets of the other 7 subjects were rejected. At last, about 20% of the selected 20 subjects’ trials (the negative epochs) were rejected for violating these artifact criteria. The EEG epochs we selected were 100 ms prior the stimulus onset and 900ms after stimulus onset. Fig. 1 shows the average ERPs to unpleasant, pleasant, and neutral stimuli in the Simple Task. We can see the amplitude differences (P300 amplitude in response to unpleasant pictures is significantly higher than that in response to either pleasant pictures or neutral pictures) clearly from it.

Fig. 1. The average ERPs to unpleasant, pleasant and neutral stimuli in the Simple Task. P300 amplitude in response to unpleasant pictures is significantly higher than that in response to either pleasant pictures or neutral pictures.

Studies using neuroimaging techniques and source modeling analyses had shown that effects of emotional stimuli are strongest in occipital and posterior brain locations [44,45,46], so we paid most of our attention to occipital and posterior electrode sites: parietal electrodes (P3, P4 and Pz), occipital electrodes (O1 and O2), and temporaloccipital electrodes (T5 and T6). Additionally, what we practically cared about was the modulation itself rather than the emotional valence here. Therefore, we averaged the negative EEG epochs, which elicited more robust emotion effects, to obtain the emotional ERPs. We plotted the negative ERPs (ERP waves evoked by negative pictures, at sites T5, T6, P3, P4, O1, O2 and Pz) of both the Simple Task and the Complex Task in Fig. 2. By comparison, we could see the decrease of P300 amplitude in the Complex Task at all sites, as well as the shorter P300 latency of Complex Task at most of the sites clearly.

102

B. Liu et al.

Fig. 2. Averaged ERP waveforms evoked by negative pictures: Simple Task vs. Complex Task. We can see the P300 latencies are significantly shorter in the Complex Task than in the Simple Task at parietal electrode sites (P3, P4 and Pz) and temporal-occipital electrode sites (T5 and T6), and the P300 amplitudes are significantly larger in the Simple Task than in the Complex Task at all concerned sites.

The negative P300 amplitude was significantly larger in the Simple Task than in the Complex Task at all concerned sites (P3 [F(1,38)=82.84, p0, we can build a layered neural circuit Ψdefined by equation (1), and its fixed point can be viewed as a continuous map F(x1,…,xm)=(F1(x1,…,xm),…, Fq(x1,…,xm)) from [0,1]m to [0,1]q, such that | F ( x1 , ..., x m ) − f ( x1 , ..., x m ) |< ε , here x1,x2,…,xm are k inputs of the neural circuit, and if there are totally m>2 layers in such a layered neural circuit Ψ, its first m-1 layers use the binary logic , i.e. its first m-1 layers simulates Boolean formulas, and the q-value fuzzy logic is only used in the last layer, such neural circuit is denoted as “Boolean layered neural circuit”. For more, for an arbitrary layered neural circuit Ψ which has a fixed point function is F(x1,…,xm), we can find a q-value fuzzy logical function F’(x1,…,xm)=(F’1(x1,…,xm),…, F’q(x1,…,xm)) of weighted Bounded operator , such that | F ( x1 , ..., x m ) − F ' ( x1 , ..., x m ) |< ε . Note: In this paper, the number of layers in a neural circuit is the number of weights’ layers. Proof. We can use the universal approximation theorem(Simon Haykin(1999)7) and the theorem 2 in 8 to prove this theorem. □ We can extend the Theorem 1 to a much more general case, the Theorem 2 tries to prove that all kind recurrent neural circuits described by the 1st order partial differential equation (PDEs)(4) can be simulated by neural circuits described by (1). The 1st order partial differential equation (4) has a strong ability to describe neural phenomena. The neural circuit described by (4) can have feedback. For the sake of the existence of feedback of a recurrent neural circuit, chaos will occur in such a neural circuit. ⎧ x1 = − a 1 x1 + ⎪ x = − a 2 x 2 + 2 ⎪ ⎨ ⎪ ⎪ x = − a n x n + ⎩ n

w

1

w

2

f1 ( x f

2

1

, x

, ..., x

2

) + u

n

1

( x

1

, x

2

, ..., x

n

) + u

2

( x

1

, x

2

, ..., x

n

) + u

n

• • • •

w

n

f

n

(4)

Nonlinear Complex Neural Circuits Analysis and Design

217

Where every f i ( x1 , x2 ,..., xn ),1 ≤ i ≤ n , is a continuous , and bounded function in the domain of trajectory space TR and we suppose every 0 ≤| xi |≤ C , i = 1, 2,..., n . Theorem 2. In a finite time range, 0 ≤ t ≤ T , every neural circuit described by the equation (4) can be simulated by a neural circuit described by the equation (1) with an arbitrary small error ε > 0 , and such kind neural circuit takes a Boolean layered neural circuit as its feedback part. Proof. omitted for the sake of pages.



3 The Basic Neural Circuits and Nonlinear Complex Neural Circuits Analysis and Design Electronic digit circuits can be classified as combinatorial circuits and time serial circuits, neural circuits can also be classified into similar two kind neural circuitscombinatorial neural circuits and time serial neural circuits. Combinatorial neural circuits: A circuit is defined as combinatorial neural circuit if its output at time t is determined by the inputs at time t. Time serial neural circuits: A circuit is defined as a time serial neural circuit if its output at time t and the next state S(t+1) are determined not only by the inputs at time t, but also by the neural circuit’s state at time t, S(t), and before time t. A time serial neural circuit has some kind feedback. The neural oscillator and neural registers are two kind simplest time serial neural circuits. For the sake of the existence of feedback, chaos will occur in a time serial neural circuit. We can prove that chaos may occur in a neural system described by (1). Theorem 3 Chaos may occur in a neural circuit described by (1). Proof: omitted for the sake of pages.



A chaotic behavior is different with ordinary non chaotic behavior, so when we design a time serial neural circuit, two kind approaches should be considered. The first approach tries to design a time serial neural circuit with no chaos under a definite precision, and the other tries to design a chaotic time serial neural circuit under a set of precisions which can continuously control the calculation error to arbitrary small levels. In this section we discuss this problem. All binary time serial circuits in digit computers (with finite bits storage) work in a periodic way, but chaos will cause a time serial neural circuit works in an aperiodic way. In order to understanding the function of a time serial neural circuit which works in a chaotic way, it is necessary to approximate chaotic neural circuits at arbitrary precision and make chaotic neural circuits work in a periodic way(Turing computable). Roughly speaking, the reason is that only finite bits are needed to represent a rational number, but for an irrational number, infinite bits are needed. In order to simulate the neural circuit, neural cells which are simulated by Turing machines should run infinite steps to compute irrational numbers. From engineering point of view, it is impossible to wait a Turing machine to run infinite steps, so approximation

218

H. Hu and Z. Shi

of irrational numbers is necessary in order to understand the function of a chaotic neural circuit. Definition 5 (Approximate police). We can use an equivalence relation E to approximate real number vectors to rational number vectors. All vectors in same equivalent class of the quotient space [Rk|E] of an equivalence relation in the k dimensional real space Rk are approximated by same rational number vector denoted as a fuzzy granular vector. If Gk is the set of all equivalence relations on the k dimensional real space Rk, then Gk is a semi-order space or lattice. The order in Gk is defined as : if E1 and E2 are two equivalence relations and every equivalent class e1 in the quotient space [Rk|E1] is a subclass of an equivalent class e2 in the quotient space [Rk|E2], then E1 ≤E2 ,where an equivalence relation Ei divides Rk into a set of equivalent classes [Rk|Ei]. If Ck is a subset of Gk and every equivalent class e in [Rk|E] for all E in Ck is a connected convex region in Rk, then Ck is defined as a k dimensional approximate police, and equivalent classes of such Ck are intuitively denoted as granules. we denote d ( D ) = s u p ( | x − y |) as the diameter of a granule D , r ( Ei ) = max (d ( D )) D∈[ R k | Ei ]

x , y∈ D

as the rough-rate and p ( E i ) = min ( d ( D )) as the precision of an equivalence k D∈[ R | Ei ]

relation Ei in the k dimensional approximate police Ck. . In the following pages, we suppose that the number of neural cells is finite. For the sake of simplicity, the initial state of a time serial neural circuit is included in its initial input I (0) . If there are k neural cells, at least a k dimensional multi scale approximate police should be used for inner computing. Approximate polices can be used in the system time t , the input, inner computing output and feedback of the output of a neural circuit. In the following discussion, we suppose an 1-dimensional multi scale approximate police is used in the system time t, so the system time can be represented as steps t=0,1,2,3,…. under an equivalence relation with precision>0. In this case, we can use an automorphism from Rn to Rn to compute the discrete trajectory of a time serial neural circuit. Such kind automorphism can be achieved by changing the differential equation (4) to the discrete difference approximate equation (5).



⎧ x1 ( t + Δ t ) = (1 − a1 ) x1 ( t ) + w1 f 1 ( x1 ( t ), x 2 ( t ), ..., x n ( t )) + u1 ⎪ x ( t + Δ t ) = (1 − a ) x ( t ) + w f ( x ( t ), x ( t ), ..., x ( t )) + u n 2 2 2 2 1 2 2 ⎪ 2 • ⎨ • • ⎪ • ⎪ x ( t + Δ t ) = (1 − a ) x ( t ) + w f ( x ( t ), x ( t ), ..., x ( t )) + u n n n n n n 1 2 ⎩ n

(5)

Now we try to discuss the situations of trajectories under different equivalence relations. The theorem 5 tells us that any small difference between two equivalence relations of approximate polices may cause unpredictable result in some chaotic neural circuits. The theorem 6 tells us that all kind details of a time serial neural circuit can be revealed under an approximation with enough high precision. Definition 6. (The difference of two equivalence relations of approximate polices (AP)) If E1 and E2 are two equivalence relations and every vector (x1,x2,…xk) in Rk (or in a domain D in Rk) is approximated as e1(x1,x2,…xk) and e2(x1,x2,…xk) by E1 and E2 respectively, then the difference of E1 and E2 in a domain D can be defined as

Nonlinear Complex Neural Circuits Analysis and Design

d if ( E1 , E

2

) =

∫ || e

D

θ

1

( x 1 x 2 ... x k ) − e 2 ( x 1 x 2 ... x k ) ||d x 1 d x 2 ... d x k

θ θ

.

219

(6)

Suppose E1( ) and E2( ) are the equivalence relations in two k dimensional multi scale approximate polices A1 and A2 respectively, where is a parameter of equivalence relations, e.g. can be the rough-rate r ( E i ) = m a x ( d ( D )) of

θ

D ∈[ R k | E i ]

an equivalence relation. For a chaotic dynamical system, similar approximations of irrational numbers maybe cause a very different trajectory, i.e. no matter how smaller the d if ( E 1 (θ ) , E 2 (θ ) ) is , if d if ( E 1 (θ ) , E 2 (θ )) > ε > 0 , here

ε

is a constant , under the input I ( 0 ) , the difference of two trajectories T ( I (0) | E1 (θ )) = {OE (0), OE (1),..., OE (t ),...} and T (I (0) | E2 (θ )) = {OE (0), OE (1),..., OE (t ),...} which 1

1

2

1

2

2

are calculated under E1 and E2 respectively may be unpredictable when t trends to infinite, i.e. | O E (t ) − O E ( t ) |> g > 0 , here O E ( t ) = ( o E 1 ( t ), o E 2 ( t ), ...o E k ( t )) is a 1 2 fuzzy granular vector of output and g is an arbitrary large constant. Based on this fact we have the Theorem 4. But at other hand, any small difference between two trajectories of a time serial neural circuit can be detected under a suitable approximate police which has enough small rough-rate, i.e. if T ( I a ( 0 )) = { O a ( 0 ) , O a (1), ..., O a ( t ), ...} and T ( I b ( 0 )) = { O b ( 0 ), O b (1), ..., O b ( t ), ...} are two different trajectories of a time serial neural circuit under inputs I a ( 0 ) and I b ( 0 ) with no approximate police and they are indiscriminating under an equivalence relation E(θ), i.e. | T ( I a ( 0 ) | E (θ ) ) − T ( I b ( 0 ) | E (θ ) ) | = 0 , where T ( I a ( 0 ) | E (θ ) ) and T ( I b ( 0 ) | E (θ ) )

( where

θ = r ( E ) = m a x ( d ( D )) D ∈[ R k | E ]

) are two trajectories at input

I a ( 0 ) and I b ( 0 ) under the equivalence relation E(θ) respectively, then by reducing the rough-rate θ = r ( E ) , the difference of two orbits will appear, i.e. l i m | T ( I a ( 0 ) | E (θ ) ) − T ( I b ( 0 ) | E (θ ) ) | > ε > 0

. In this way, we can prove Theorem 5. Strict proofs of these theorems are omitted for the sake of pages. θ → 0

θ

θ

Theorem 4. Suppose E1( ) and E2( ) are the equivalence relations in two k dimensional multi scale approximate polices A1 and A2 respectively. For a time serial neural circuit, if dif ( E1 (θ ), E 2 (θ )) > ε > 0 , here ε is a arbitrary small constant , under the input I ( 0 ) , the difference of two trajectories T ( I (0) | E1 (θ )) = {O E (0), O E (1),..., O E ( t ), ...} and 1

T ( I (0) | E 2 (θ )) = {O E 2(0), O E2 (1),..., O E 2 (t ),...}

1

1

which are calculated under E1 and E2 respectively

may be unpredictable when t trends to infinite, i.e. | O E ( t ) − O E ( t ) |> g > 0 , here g is 1 2 an arbitrary large constant.

ε

Theorem 5. Suppose Gk is an approximate police and for any small >0, there are and equivalence relations in Gk which have rough-rate less than T ( I a ( 0 ) ) = { O a ( 0 ) , O a (1) , ..., O a ( t ) , ...} and T ( I b (0 )) = { O b (0 ), O b (1), ..., O b ( t ), ...}

ε,

are two different trajectories of a time serial neural circuit under inputs I a ( 0 ) and

220

H. Hu and Z. Shi

Ib (0) with no approximate police, we apply Gk in the input, output and inner computing

ε

of this neural circuit, then there is a positive >0, T ( I a ( 0 ) ) and T ( I b ( 0 ) ) are discriminable under all equivalence relations E in Gk with rough-rates smaller than .

ε/2

Summary. As we know the important characteristics of chaotic dynamics, i.e., aperiodic dynamics in deterministic systems are the apparent irregularity of time traces and the divergence of the trajectories over time(starting from two nearby initial conditions). Any small error in the calculation of a chaotic deterministic system will cause unpredictable divergence of the trajectories over time, i.e. such kind neural circuits may behave very differently under different precise calculations. According to the theorem 4, any small difference between two approximations of a trajectory of a chaotic time serial neural circuit may create two totally different approximate results of this trajectory ,and the theorem 5 tells us that all details of a time serial chaotic neural circuit S can be revealed by an equivalence relation E in a finite precision multi scale approximate police Gk with enough small rough-rates, so it is reasonable to apply a multi scale precision approximate police in the analyzing and design of a neural circuit with chaos and such multi scale precision approximate police should be able to continuously control the calculation error in order to reveal enough detail of functions of such kind neural circuits. According to the above analysis, when we design or analyze a time serial neural circuit, two kind approaches should be considered. (1). For a time serial neural circuit with no chaos, we can use back propagation kind learning to find a suitable fuzzy logical framework for it . (2). For a chaotic time serial neural circuit, we should find a set of fuzzy logical frameworks under different equivalence relations in an approximate police which can continuously control the calculation error, and every fuzzy logical framework can be computed at a definite precision same as (1). But at what kind condition, the fuzzy logical frameworks of a chaotic time serial neural circuit under different equivalence relations in an approximate police have similar or continuous structure changing is still an open problem.

Acknowledgements This work is supported by the National Science Foundation of China (No. 60435010, 90604017, 60675010), 863 National High-Tech Program (No.2006AA01Z128), National Basic Research Priorities Programme(No. 2003CB317004) and the Nature Science Foundation of Beijing (No. 4052025).

References 1. Canuto, A.M.P., Fairhurst, M.: An investigation of fuzzy combiners applied to a hybrid multi-Neural system. In: IEEE Proceedings of the VII Brazilian Symposium on Neural Networks (SBRN 2002), pp. 156–161 (2002) 2. Yager, R.: Families of Owa Operators. Fuzzy sets and systems 59(22), 125–148 (1993)

Nonlinear Complex Neural Circuits Analysis and Design

221

3. Castro, J.L.: Fuzzy logic controllers are universal approximators. IEEE Transactions on Systems, Man and Cybernetics 25(4), 629–635 (1995) 4. Li, H.X., Chen, C.L.P.: The equivalence between fuzzy logic systems and feedforward neural networks. IEEE Trans. on Neural Networks 11(2), 356–365 (2000) 5. Li, Z.P.: Pre-attentive segmentation and correspondence in stereo. Philos. Trans. R Soc. Lond. B Biol. Sci. 357(1428), 1877–1883 (2002) 6. Cho, S.B., Kim, J.H.: Multiple network fusion using fuzzy logic. IEEE Trans. on Neural Networks 6(2), 497–501 (1995) 7. Haykin, S.: NEURAL NETWORKS -a Comprehensive Foundation. Prentice-Hall, Inc., Englewood Cliffs (1999) 8. Sartori, M.A., Antsaklis, P.J.: A simple method to derive bounds on the size and to train multilayer neural networks. Neural Networks, IEEE Transactions on Publication 2(4), 467–471 (1991) 9. Kim, S.S.: A neuro-fuzzy approach to integration and control of industrial processes: Part I. J. Fuzzy Logic Intell. Syst. 8(6), 58–69 (1998) 10. Li, Z.P.: A neural model of contour integration in the primary visual cortex, neural computation, vol. 10, pp. 903–940 (1998)

Fuzzy Hyperbolic Neural Network Model and Its Application in H∞ Filter Design Shuxian Lun1,2, Zhaozheng Guo1, and Huaguang Zhang3 1

School of Information Science and Engineering, Bohai University, Jinzhou. 110004 Liaoning, China 2 The Key Laboratory of Complex System and Intelligent Science, Institute of Automation Chinese Academy of Science, Haidian 100080 Beijing, China 3 School of Information Science and Engineering, Northeastern University, Shenyang. 110004 Liaoning, China [email protected]

Abstract. This paper studies H∞ filter based on a new fuzzy neural model for signal estimation of nonlinear continuous-time systems with time delays. First, a new fuzzy neural model, called fuzzy hyperbolic neural network model (FHNNM), is developed. FHNNM is a combination of the special fuzzy model and the modified BP neural network. The main advantages of using the FHNNM over traditional fuzzy neural network are that explicit expression of expert’s experience and global analytical description. In addition, by contrast with fuzzy neural network based T-S fuzzy model, no premise structure identification is need and no completeness design of premise variables space is need. Next, we design a stable H∞ filter based on the FHNNM using linear matrix inequality (LMI) method. Simulation example is provided to illustrate the design procedure of the proposed method. Keywords: Fuzzy hyperbolic neural network, H∞ filter, Linear matrix inequality (LMI), Nonlinear system.

1 Introduction Recently, there have been a lot of interests on the problem of robust H∞ filtering for nonlinear system [1,2,3,4,5,6]. The advantage of using an H∞ filter over a Kalman filter is that no statistical assumption on the noise signals is needed. However, it is in general difficult to design an efficient filter for signal estimation of nonlinear systems. This paper deals with H∞ filtering problem based on the fuzzy hyperbolic neural network model for continuous-time nonlinear systems. There have been some successful examples of fuzzy neural network theory in filtering applications [7,8,9,10] and successes have been achieved in situations where the dynamics of systems are so complex that it is impossible to construct an accurate model. However, the identification of fuzzy neural network based on T-S fuzzy model is difficult. In order to overcome the difficulty, this paper studies H∞ filter design based on fuzzy hyperbolic neural network model for a class of continuous-time nonlinear systems. A new continuous-time fuzzy neural network model based on fuzzy hyperbolic model proposed in [11,12,13], called fuzzy hyperbolic neural network model (FHNNM). A FHNNM is both a kind of valid global F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 222–230, 2008. c Springer-Verlag Berlin Heidelberg 2008 

Fuzzy Hyperbolic Neural Network Model and Its Application in H∞ Filter Design

223

description and nonlinear model in nature. Besides the advantage mentioned above, the advantage of using FHNNM over T-S fuzzy neural network model is that no premise structure identification is needed and no completeness design of premise variables space is needed. Thus, H∞ filter using FHNNM can obtain the better estimation performance than using other fuzzy neural network model. The FHNNM can be obtained without knowing much information about the real plant and it can easily be derived from a set of fuzzy rules. The present paper is organized as follows. In Section 2, the principle of FHNNM is described. In Section 3, the H∞ filter design based on fuzzy hyperbolic neural network model is addressed, called fuzzy hyperbolic neural network H∞ filter. The H∞ filter design problem based on FHNNM is converted to the feasibility problem of a linear matrix inequality (LMI), which makes the prescribed attenuation level as small as possible, subject to some LMI constraints. In Section 4, simulation example is employed to demonstrate the design procedure for fuzzy hyperbolic neural network H∞ filters.

2 The Principle of Fuzzy Hyperbolic Neural Network Fuzzy hyperbolic neural network model is composed of two parts corresponding to the premise and the conclusion of hyperbolic type fuzzy rules, respectively. The definition of hyperbolic type fuzzy rules is as follows: Definition 1. [11,12,13] Given a plant with n input variables x = (x1 (t), ..., xn (t))T and n output variables x˙ = (x˙1 (t), ..., x˙n (t))T . For each output variable x˙l , l = 1, 2, · · · , n, the corresponding group of hyperbolic type fuzzy rules has the following form: R j : IF x1 is Fx1 and x2 is Fx2 , ..., and xn is Fxn THEN x˙l = ±cFx1 ± cFx2 ± · · · ± cFxn where Fxi (i=1,..., n) are fuzzy sets of xi , which include Pxi (positive) and Nxi (negative), and ±cFxi (i=1,...,n) are 2n real constants corresponding to Fxi ; (i) The constant terms ±cFxi in the THEN-part correspond to Fxi in the IF-part; that is, if the language value of Fxi term in the IF-part is Pxi , +cFxi must appear in the THENpart; if the language value of Fxi term in the IF-part is Nxi , −cFxi must appear in the THEN-part; if there is no Fxi in the IF-part, ±cFxi does not appear in the THEN-part. (ii) There are 2n fuzzy rules in each rule base; that is, there are a total of 2n input variable combinations of all the possible Pxi and Nxi in the IF-part; We call this group of fuzzy rules hyperbolic type fuzzy rule base (HFRB). To describe a plant with n output variables, we will need n HFRBs. Fig.1 shows that the configuration of fuzzy hyperbolic neural network model. The weights between the layer L1 and L2 , L2 and L3 , L4 and L5 are 1 in the Fig.1, respectively. The first layer L1 is the input layer of fuzzy hyperbolic neural network model. Each neural unit is used to directly transfer each input variable xi (t), i = 1, 2, · · · n, i.e. f1 (xi ) = xi , i = 1, 2, · · · n. Thus, the number of neural unit is the same as the dimension of input variables.

224

S. Lun, Z. Guo, and H. Zhang

Fig. 1. The configuration of fuzzy hyperbolic neural network model

The second layer L2 describes the fuzzy sets Fxi (Pxi and Nxi , i = 1, 2, · · · n ) of input variables, which is used to compute the membership function of input variables. The membership function Pxi and Nxi as: 1

2

µPxi (xi ) = e− 2 (xi − ki )

1

2

µNxi (xi ) = e− 2 (xi + ki )

(1)

where ki are positive constants. Therefore, we have ⎧ ⎨ µ (x ) = e− 21 (xi − ki )2 , IF - part is P xi Pxi i f2 (·) = 2 1 ⎩ − 2 (xi + ki ) µNxi (xi ) = e , IF - part is Nxi

(2)

The third layer L3 shows that product inference. Each neural unit represents the corresponding hyperbolic type fuzzy rule as (1) and computes the fitness of each rule. (k) The output function f3 (·) (k = 1, 2, · · · 2n ) of the kthneural unit of the third layer is (1)

f3 (·) = µPx1 (x1 )µPx2 (x2 ) · · · µPxn (xn ) (2)

f3 (·) = µNx1 (x1 )µPx2 (x2 ) · · · µPxn (xn ) .. . (2n ) f3 (·) = µPN1 (x1 )µNx2 (x2 ) · · · µNxn (xn ) The forth layer L4 executes the normalization. Each weight between L3 and L4 is cFx1 + cFx2 + · · · + cFxn ,. . . , −cFx1 − cFx2 − · · · − cFxn , respectively. The output function (k)

f4 (·) (k = 1, 2, · · · 2n ) of the kthneural unit of the forth layer is

Fuzzy Hyperbolic Neural Network Model and Its Application in H∞ Filter Design

225

(1)

(1)

f4 (·) = (cFx1 + cFx2 + · · · + cFxn ) f3 (·)/G (2)

(2)

f4 (·) = (−cFx1 + cFx2 + · · · + cFxn ) f3 (·)/G .. . (2n ) (2n ) f4 (·) = (−cFx1 − cFx2 − · · · − cFxn ) f3 (·)/G where G = µPx1 (x1 )µPx2 (x2 ) · · · µPxn (xn ) + µNx1 (x1 )µPx2 (x2 ) · · · µPxn (xn ) + · · · µPN1 (x1 )µNx2 (x2 ) · · · µNxn (xn ) The fifth layer L5 is the output layer. The output variable x˙l as: 2n

x˙l =

(k)

∑ f4

(·), l = 1, 2, · · · n.

k=1

we can derive the following model from reference [11,12,13] as follows: n

x˙l = f (x) = ∑

i=1

cFxi eki xi − cFxi e−ki xi eki xi + e−ki xi

n

= ∑ cFxi tanh(ki xi )

(3)

i=1

According to (3), the whole system has the following form: x˙ = A tanh(kx x)

(4)

where P is a constant vector, A is a constant matrix, and tanh(kx x) is defined by tanh(kx x) T  = tanh(k1 x1 ) tanh(k2 x2 ) · · · tanh(kn xn ) . Therefore, we can obtain a analytical description as (4) from the FHNNM, i. e., FHNNM is equivalent to (4). In Fig.1, we need to identify parameters cFxi and ki . In fact, according to (2), we can simplify the configuration of the fuzzy hyperbolic neural network as Figure 2. In fig.2, n

h1 (xi ) = xi , h2 (·) = tanh(·) and h3 (·) = ∑ oi , oi (i = 1, 2, · · · n) the inputs of the neural i=1

unit. Thus, the identification work becomes easier.

3 Fuzzy Hyperbolic Neural Filter Analysis and Design Consider the FHNNM of the nonlinear system is proposed as the following form: x(t) ˙ = Atanh(kx x) + Ad tanh(kx x(t − d)) + Bw(t) y(t) = C tanh(kx x) + Dw(t) s(t) = Lx(t)

(5)

where x(t) = [x1 (t), x2 (t), · · · xn (t)]T ∈ Rn×1 denotes the state vector; y(t) ∈ Rm×1 denotes the measurements vector; s(t) ∈ Rq×1 denotes the signal to be estimated; w(t) ∈ Rn×1 assumed to be bounded disturbance input; B ∈ Rn×n , D ∈ Rm×m and L ∈ Rq×n are con˙ ≤ β < 1 and β stant matrices, d(t) is time-varying delay in the state and satisfies d(t) is a known constant.

226

S. Lun, Z. Guo, and H. Zhang

Fig. 2. The simple configuration of the FHNNM

Based on the FHNNM (4), the following fuzzy hyperbolic neural network H∞ filter is addressed ˙ˆ = A tanh(kx x) ˆ + K(y(t) − C tanh(kx x)) ˆ x(t) s(t) ˆ = Lx(t), ˆ x(0) ˆ =0

(6)

Then, the augmented filter error system can be written as the following form:

η˙ (t) = A¯ tanh(kη (t)) + A¯ d tanh(kη (t − d)) + B¯ ω (t) (7) e(t) = s(t) − s(t) ˆ = C¯ η (t)      T T A 0 Ad 0 ¯  T where η (t) = xT (t) xˆT (t) , A¯ = , A¯ d = , B = B (KD)T , KC A − KC 0 0   C¯ = L −L , k = diag(kx , kx ).

Theorem 1. For nonlinear system (5) and a prescribed real number γ > 0, if there exist a positive definite diagonal matrices P such that the matrix inequality ⎡ ⎤ PA¯ + A¯ T P + α H + Q PA¯ d PB¯ ⎣ A¯ T P ⎦ 0, 1 1  pi > 0. Because of cosh(ki ηi ) = (eki ηi + e−kiηi ) 2 ≥ (eki ηi ) 2 · (e−ki ηi ) 2 = 1 and ki > 0, pi > 0 , we known V (t) > 0 for all η and V (t) → ∞ as η 2 → ∞. Along the trajectories of system (7) with w(t) = 0, the corresponding time derivative of V (t) is given by n

V˙ =2 ∑ pi tanh(kη )η˙ + α tanhT (kη )H tanh(kη )− i=1

˙ tanhT (kηd )H tanh(kηd ) α (1 − d(t)) =2 tanhT (kη )Pη˙ + α tanhT (kη )H tanh(kη )− ˙ tanhT (kηd )H tanh(kηd ) α (1 − d(t)) ≤2 tanhT (kη )P[A¯ tanh(kη ) + A¯ d tanh(kηd ]+

α tanhT (kη )H tanh(kη ) − tanhT (kηd )H tanh(kηd )   T    tanh(kη ) PA¯ + A¯ T P + α H PA¯ d tanh(kη ) × ≤ tanh(kηd ) tanh(kηd ) A¯ Td P −H  Δ where P = diag(p1 , p2 , · · · pn ) ∈ R2n×2n, ηd = η (t − d(t)), α = 1 1 − β . A sufficient condition for V˙ < 0 is as follows   PA¯ + A¯ T P + α H PA¯ d 0 and M > 0 such that ||x(t) − x∗ || ≤ M  x(0) − x∗  e−λt

(4)

for all t ≥ 0. Definition 2. The domain of attraction of equilibrium x∗ is the maximal region Ω such that every solution x(t) to model (1) satisfying (3) with x(0) ∈ Ω approaches to x∗ . In order to prove the main result regarding the attraction domain of model (1), we need the following lemma.

On the Domain Attraction of Fuzzy Neural Networks

233

Lemma 1. ([7]). For any aij ∈ R, xj , yj ∈ R, i, j = 1, · · · , n, we have the following estimations, | and |

aij yj | ≤

j=1

n 

j=1

n 

n 

aij yj | ≤

n 

aij xj −

aij xj −

3

(|aij | · |xj − yj |)

(5)



(|aij | · |xj − yj |)

(6)

1≤j≤n

j=1

j=1



1≤j≤n

Attraction Domain for Fuzzy Neural Networks

In this section, we will use the Lyapunov method to obtain the attraction domain for the fuzzy neural networks. The main result is presented as the following theorem. Theorem 1. Suppose that x∗ = (x∗1 , · · · , x∗n ) is an equilibrium point of model (1) with coefficients n satisfying (2), and the following holds: di > |fi (x∗i )| j=1 ζji , i = 1, 2, · · · , n, where ζij = |ξij | + |γij | + |δij |. Then, we have the following: (a) x∗ is locally exponentially stable. (b) Let δ =

2 n

min1≤i≤n {





n di −|fi (x∗ i )| j=1 ζji } n Mf ζ j=1 ji ∗

then, the open ball B(x , δ) is contained in the domain of robust attraction of x∗ . n Proof. First, we prove part (a). Since di > |fi (x∗i )| j=1 ζji , i = 1, 2, · · · , n, let be ǫ any positive number such n that ǫ < min1≤i≤n {di − |fi (x∗i )| j=1 ζji }. Considering a Lyapunov function: V (x(t)) = eǫt

n 

|xi (t) − x∗i | = eǫt

i=1

n 

|yi (t)|

(7)

i=1

where yi (t) = xi (t) − x∗i . For i = 1, · · · , n, by (2) and Lemma 1 and Lemma 2, we have the following: n n   d+ |yi (t)| dV (x(t)) = ǫeǫt |yi (t)| + eǫt dt dt i=1 i=1

≤ ǫeǫt

n  i=1

|yi | + eǫt

n n   ξij (fj (xj (t)) − fj (x∗j ))| [−di |yi (t)| + | i=1

j=1

234

T. Huang, X. Liao, and H. Huang

+| − ≤e

n 

γij fj (xj (t)) −

n 

γij fj (x∗j )| + |

δij fj (xj (t))

j=1

j=1

j=1 n 

n 

δi+|j fj (x∗j )|]

j=1 n  ǫt

ǫt

(ǫ − di )|yi (t)| + e [|

n 

[|ξij | + |γij |

j=1

i=1

+|δij |]|fj (xj (t)) − fj (x∗j )| n n   |ζij ||fj (xj (t)) − fj (x∗j )| (ǫ − di )|yi (t)| + eǫt | = eǫt j=1

i=1

(8) Since fj′′ (|ξj ) |(xj (t) − x∗j )2 | 2 Mf ≤ |fj′ (x∗j )(xj (t) − x∗j ) + (xj (t) − x∗j )2 | 2

|fj (xj (t)) − fj (x∗j )| = |fj′ (x∗j )(xj (t) − x∗j ) +

(9)

By formulas (8) and (9), we have n n n   dV (y(t)) Mf  ≤ eǫt |ζij | + (ǫ − di + |fi′ (x∗i )| |ζij ||yi (t)|)|yi (t)| dt 2 j=1 j=1 i=1

(10) We evaluate at t0 . If



we have have



n di −ǫ−|fi (x∗ i )| j=1 ζji 2 }, min { n 1≤i≤n n Mf j=1 ζji dV (x(t)) < 0, so V ((x(t)) is a decreasing function dt

|xi (t0 ) − x∗i |
t0 . Thus, we

|xi (0) − x∗i |,

||xi (t) − x∗i ||∞ ≤ ne−ǫt ||xi (0) − x∗i ||∞ According to Definition 3, x(0) is in the attraction domain of x∗ . Thus, we have completed the proof of Theorem 1.

On the Domain Attraction of Fuzzy Neural Networks

4

235

Conclusion

In this paper, we study the local dynamics of fuzzy neural networks. A criterion on the attraction domain for the fuzzy neural networks has been obtained using Lyapunov method. It is believed that that the stability property of neural networks is very important in designing neural networks. Thus, the results present in this paper are useful in the application and design of fuzzy neural networks since the conditions could be easily verified.

References 1. Cao, J.: An Estimation of the Domain and Convergence Rate of Hopfield Associative Memory. J. Electron. 21, 488–491 (1999) 2. Cao, J.: An Estimation of the Domain of Attraction and Convergence Rate for Hopfield Continuous Feedback Neural Networks. Phys. Lett. A. 325, 370–374 (2004) 3. Cao, J., Tao, Q.: An Estimation of the Domain of Attraction and Convergence Rate for Hopfield Continuous Feedback Neural Networks. J. Comput. Syst. Sci. 62, 528– 534 (2001) 4. Cao, J., Tao, Q.: An Estimation of the Domain of Attraction and Convergence Rate for Hopfield Associatiative Memory and an Application. J. Comput. Syst. Sci. 60, 179–186 (2000) 5. Cao, J., Chen, T.: Globally Exponentially Robust Stability and Periodicity of Delayed Neural Networks. Chaos, Solitions and Fractals 22, 957–963 (2004) 6. Cao, J., Wang, J.: Global Exponential Stability and Periodicity of Recurrent Neural Networks with Time Delays. IEEE Transactions on Circuits and Systems-Part I 52, 920–931 (2005) 7. Yang, X., Liao, X.F., Bai, S., Evans, D.: Robust Exponential Stability and Domains of Attraction in a Class of Interval Neural Networks. Chaos, Solitions and Fractals 26, 445–451 (2005) 8. Yang, X., Liao, X.F., Li, C., Evans, D.: New Estimate on the Domains of Attraction of Equilitrium Points in Continuous Hopfield Neural Networks. Phys. Lett. A. 351, 161–166 (2006) 9. Yang, T., Yang, L.B., Wu, C.W., Chua, L.O.: Fuzzy Cellular Neural Networks: Theory. In: Proc. of IEEE International Workshop on Cellular Neural Networks and Applications, pp. 181–186 (1996) 10. Chen, A., Huang, L., Liu, Z., Cao, J.: Periodic Bidirectional Associative Memory Neural Networks with Distributed Delays. Journal of Mathematical Analysis and Applications 317, 80–102 (2006) 11. Chen, A., Huang, L., Cao, J.: Existence and Stability of Almost Periodic Solution for BAM Neural Networks with Delays. Applied Mathematics and Computation 137, 177–193 (2003) 12. Chen, T., Amari, S.: Stability of Asymmetric Hopfield Networks. IEEE Trans. Neural Networks 12, 159–163 (2001) 13. Horn, R.A., Johnson, C.R.: Topics in Matrix Analysis. Cambridge University Press, Cambridge (1999) 14. Huang, T., Cao, J., Li, C.: Necessary and Sufficient Condition for the Absolute Exponential Stability for a Class of Neural Networks with Finite Delay. Phys. Lett. A. 352, 94–98 (2006)

236

T. Huang, X. Liao, and H. Huang

15. Huang, T.: Exponential Stability of Fuzzy Cellular Neural Networks with Unbounded Distributed Delay. Phys. Lett. A. 351, 48–52 (2006) 16. Li, C., Liao, X., Huang, T.: Global Stability Analysis for Delayed Neural Networks via an Interval Matrix Approach. IET Control Theory Appl. 1, 743–748 (2007) 17. Liao, X., Wang, J., Cao, J.: Global and Robust Stability of Interval Hipfield Neural Networks with Time-varying Delays. Int. J. Neural Sys. 13, 171–182 (2003) 18. Liao, X.F., Yu, J.B.: Robust Stability for Interval Hopfield Neural Networks with Time Delay. IEEE Trans. on Neural Networks 9, 1042–1045 (1998) 19. Song, Q.K., Cao, J.: Global Robust Stability of Interval Neural Networks with Multiple Time-varying Delays. Mathematics and Computers in Simulation 74, 38– 46 (2007) 20. Song, Q.K., Zhao, Z., Li, Y.: Global Exponential Stability of BAM Neural Networks with Distributed Delays and Reactiondiffusion Terms, vol. 335, pp. 213–225 (2005) 21. Wang, L., Lin, Y.: Global Robust Stability for Shunting Inhibitory CNNs with Delays. Int. J. Neural Syst. 14, 229–235 (2004) 22. Yang, T., Yang, L.B., Wu, C.W., Chua, L.O.: Fuzzy Cellular Neural Networks: Applications. In: Proc. of IEEE International Workshop on Cellular Neural Networks and Applications, pp. 225–230 (1996) 23. Yang, T., Yang, L.B.: The Global Stability of Fuzzy Cellular Neural Network. Circuits and Systems I: Fundamental Theory and Applications 43, 880–883 (1996)

CG-M-FOCUSS and Its Application to Distributed Compressed Sensing Zhaoshui He1,2 , Andrzej Cichocki1,3,4 , Rafal Zdunek5 , and Jianting Cao1,6 1

Lab. for Advanced Brain Signal Processing, Brain Science Institute, Wako-shi, Saitama, 351-0198, Japan 2 School of Electronics and Information Engineering, South China University of Technology, Guangzhou, 510641, China 3 System Research Institute, Polish Academy of Sciences (PAN), Warsaw, 00-901, Poland 4 Warsaw University of Technology, Warsaw, 00-661, Poland 5 Institute of Telecommunications, Teleinformatics, and Acoustics, Wroclaw University of Technology, Wybrzeze Wyspianskiego 27, 50-370 Wroclaw, Poland 6 Saitama Institute of Technology, Saitama, 369-0293, Japan {he shui,cia}@brain.riken.jp, [email protected], [email protected]

Abstract. M-FOCUSS is one of the most successful and efficient methods for sparse representation. To reduce the computational cost of M-FOCUSS and to extend its availability for large scale problems, M-FOCUSS is extended to CG-M-FOCUSS by incorporating conjugate gradient (CG) iterations in this paper. Furthermore, the CG-M-FOCUSS is applied to distributed compressed sensing. We illustrate the performance of CG-MFOCUSS by an MRI image reconstruction example, in which CG-MFOCUSS can not only reconstruct the MRI image with high precision, but also considerably reduce the computational time. Keywords: FOCUSS, M-FOCUSS, Compressed representation.

1

sensing, Sparse

Introduction

Consider the sparse representation problem: x(t) = As(t), t = 1, · · · , T, or X = AS,

(1)

where x(t) ∈ Rm is the given vector (observation), s(t) ∈ Rn is an unknown vector representing sparse sources or hidden components, A = [a1 , · · · , an ] ∈ Rm×n is a given full-row rank basis matrix; T is the number of available samples, m is the number of observations, and n is the number of sources. We consider only the overcomplete case: m < n. The main objective is to find the sparse solutions (sparse sources) s(t) satisfying equations (1). Sparse representation has found many applications in compressed sensoring [1], electromagnetic and biomagnetic problems (EEG/MEG), time-frequency F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 237–245, 2008. c Springer-Verlag Berlin Heidelberg 2008 

238

Z. He et al.

representation, image processing, fault diagnosis, etc [2, 3]. Recently, much attention has been paid to this problem due to its importance. Many methods have been developed for this problem: matching pursuit (MP) method [4], orthogonal matching pursuit (OMP) method [5], minimum ℓ1 -norm methods (e.g., linear programming (LP) [6,7,8,9,10], shortest path decomposition (SPD) [11,9]), various FOCUSS methods [12, 13, 14, 2, 15, 16] and M-FOCUSS [17]. Among these methods, M-FOCUSS is one of the most efficient in terms of both speed and precision. In this paper, we extend M-FOCUSS to CG-M-FOCUSS by incorporating the conjugate gradient iterations and apply it to compressive sensing [18, 19, 20].

2

CG-M-FOCUSS

At first, let us consider the sparse representation of multiple measurement vectors: ¯ = AS, ¯ X (2) ¯ = [¯ ¯ = [¯ where usually L ≪ T (e.g., L = 4), X x(1), · · · , x ¯(L)] and S s(1), · · · , ¯ s¯(L)]. Here X is considered to be a block of X in model (1). In this approach, instead of computing all T vectors s(1), · · · , s(T ) in model (1) simultaneously, we split them into blocks (usually strongly overlapped) and attempt to estimate them sequentially block by block. The overlapping blocks are justified by the fact that the sources are usually locally smooth or continuous. In more details, we use M-FOCUSS or CG-M-FOCUSS to process the T samples in model (1) according to the following scheme: firstly, the T samples in model (1) are segmented into T /L blocks with block length L (L ≪ T , e.g., L = 4); secondly, to make the estimated signals be smooth, we set an appropriate percentage of overlapping between two neighboring blocks (typically, 50%-80% overlapping); Then, we apply M-FOCUSS or CG-M-FOCUSS for each block. According to this scheme, 2.1

M-FOCUSS

Using M-FOCUSS, sparse representation for problem (2) can be converted to solve the following optimization problem [17]: ⎧ L  p2 ⎪ n  2  ⎨ ¯ p) = min s¯ (l) min J(S; (3) ¯ ¯ i=1 l=1 i S S ⎪ ⎩ subject to: X ¯ = AS ¯

The iterative formula of M-FOCUSS for solving problem (3) is

¯ = Π −1 (S) ¯ · AT · (A · Π −1 (S) ¯ · AT )−1 · X, ¯ S 

2−p   L 2 2−p L −1 ¯ 2 where Π (S) = diag ¯n (l) ,··· , ¯1 (l) . l=1 s l=1 s

(4)

CG-M-FOCUSS and Its Application to Distributed Compressed Sensing

2.2

239

CG-M-FOCUSS Algorithm

In the iterative formula (4) of M-FOCUSS, it is time-consuming to compute ¯ T because we must the inverse of symmetric positive definite matrix AΠ −1 (S)A compute it separately for each block at each iteration, and also computation of the matrix inverse is usually quite expensive. For this reason, we consider to exploit conjugate gradient (CG) iterations in (4) to speed up M-FOCUSS and extend its availability for large scale problems. The linear CG method is one of the most computationally inexpensive techniques for solving large linear algebraic systems of equations, which is convergent to a minimal norm least square solution in a finite number of iterations for symmetric positive definite coefficient matrix [21, 22]. For the linear equation set Hλ = b,

(5)

where λ is the unknown vector and H is an m × m symmetric positive definite matrix, CG method can find the solution in a very efficient manner. For convenience, we denote the solution obtained by CG method as λ = H −1 b = cg(H, b, λ(0) , ε), where λ(0) is the initialization and ε is the tolerance. It is worth noting that the performance of the linear CG method depends on the distribution of eigenvalues of the coefficient matrix H [22]. In details, if H has r distinct real-valued eigenvalues (r ≤ m), then the CG iterations will terminate at the solution in at most r iterations. In other words, if the matrix H has very few distinct eigenvalues, the CG method will be extremely fast. For example, if r = 1, the CG can find the right solution in only one iteration even for large scale problems. To improve the eigenvalue distribution of H and accelerate the CG method [22], we precondition matrix H in (5) by a linear transform via a nonsingular preconditioning transform matrix C as ˜ = C −T b, (C −T HC −1 )λ

(6)

˜ = Cλ. Then we can equivalently solve (5) by λ = C −1 · cg(C −T HC −1 , where λ −T C b, λ(0) , ε). In this way, the convergence rate of CG method depends on the eigenvalues of the matrix C −T HC −1 . So we can accelerate CG method by choosing an appropriate preconditioning matrix C. ∆ Setting H = AΠ −1 AT in (4), we have the iterative formula of CG-MFOCUSS as ¯ · AT · cg(AΠ −1 (S)A ¯ T,x s¯(l) = Π −1 (S) ¯(l), λ(0) (l), ε), l = 1, · · · , L.

(7)

The expression (7) can be implemented by decomposing into the following two expressions: ¯ T,x ¯(l), λ(0) (l), ε) λ(l) = cg(AΠ −1 (S)A (8) −1 ¯ T s¯(l) = Π (S) · A · λ(l) In this paper, λ(0) (l) in (8) is initialized as λ(0) (l) = λ(l − 1) when l > 1.

240

Z. He et al.

Since the preconditioning plays a crucial role in CG strategies, we discuss how to design an appropriate preconditioning transform matrix C for the CGM-FOCUSS (7). Perform the singular value decomposition (SVD) on A as A = U ΣV T , where ⎛ ⎞ σ1 · · · 0 0 · · · 0 ⎜ ⎟ Σ = [Λ, 0] = ⎝ ... . . . ... ... . . . ... ⎠ , (9) 0 · · · σm 0 · · · 0

where Λ = diag(σ1 , · · · , σm ). Alternatively, we can perform eigenvalue decomposition (EVD) on matrix AAT = U Λ2 U T to obtain matrice U and Λ. Here we choose the preconditioning transform matrix C as C = ΛU T . Premultiplying the transform matrix C −T = Λ−1 U −1 on both sides of the model (2), we have ˜=A ˜S, ¯ X (10) ¯ and A ˜ = Λ−1 U −1 A. So the problem (2) is precon˜ = Λ−1 U −1 X where X ditioned to problem (10) and can be equivalently solved by (10). Due to the preconditioning transform matrix C, the CG-M-FOCUSS is more efficient for problem (10) than the original problem (2). Based on above discussion, the CGM-FOCUSS for MMV problem (2) can be outlined as follows:

Algorithm 1. CG-M-FOCUSS for MMV problem (2) 1) Perform EVD on matrix AAT (=U Λ2 U T ) or SVD on A (=U ΣV T ) to get U ˜ = Λ−1 U −1 A and X ˜ = Λ−1 U −1 X. ¯ Set the parameter p and and Λ. Compute A −3 ε ≤ 10 . ¯ as S ¯(0) , initialize λ and set k = 0. 2) Initialize S (k) ¯(k) ) · A ˜T . 3) Compute T˜ = Π −1 (S ¯ as follows: 4) Update S for l=1 to L do ˜T˜(k) , x Update the Lagrange multiplier vector λ = cg(A ˜(l), λ, ε); (k) Update s¯(l) by s¯(k+1) (l) = T˜ · λ; end 5) Let k = k + 1 and goto step 3) until the convergence is reached.

For convenience, λ is initialized to be a zero vector λ = 0m×1 in this paper. The CG-M-FOCUSS is more suitable for larger scale problems than the standard M-FOCUSS because the conjugate directions in the CG can be generated in a very economical way. For the standard M-FOCUSS, the conventional method (e.g., Gaussian elimination) is used to calculate the matrix inversion [AΠ −1 AT ]−1 , where its computational complexity is O(m3 ), whereas the computational cost of CG method for (5) is only O(m2 ).

CG-M-FOCUSS and Its Application to Distributed Compressed Sensing

3

241

Distributed Compressed Sensing by Overlapping CG-M-FOCUSS

Let Z be an unknown matrix in Rn×T . Suppose that we have m linear measurements X ∈ Rm×T of the unknown signal matrix Z as follows X = Φ · Z,

(11)

where Φ consists of m rows drawn from an n × n orthogonal transform matrix (e.g., a Fourier transform matrix). So m ≤ n. The standard methods require at least n measurements. Suppose Z is compressible or sparse in an appropriate transform domain W, described by the orthogonal sparsifying transform domain W ∈ Rn×n (after extracting the real and imaginary parts if necessary) [1], i.e., S z = W Z is sparse. Then, from (11), we have X = A · Sz ,

(12)

where A = Φ · W −1 . Since S z is sparse, it is possible to reconstruct Z in the transform domain W even if m < n [19, 18, 20]. Given Φ and W , compressed sensing or compressive sampling recovers the true signals Z by exploiting their sparsity or compressibility in the transform domain W. Then the CG-M-FOCUSS proposed in Section 2 can be employed to solve distributed compressed sensing problem (12) [23]. For compressed sensing problem (12), it is worth mentioning some important features that arise in CG-M-FOCUSS [1] due to the orthogonal transform matrix W and the partial orthogonal transform matrix Φ. Firstly, the SVD step for finding the preconditioned transform matrix C can be omitted because AAT = Φ · W −1 · W −T · ΦT = I m×m . Secondly, the computational complexity ¯ T usually can be reduced by fast orthogonal transforms W and of AΠ −1 (S)A Φ (i.e., fast Fourier transform, fast wavelet transform and so on).

4

Experiments

In this section, we give an MRI image reconstruction example to illustrate the performance of the CG-M-FOCUSS and compare it with the standard FOCUSS and the standard M-FOCUSS [24]. All methods are implemented in Matlab 7.2, and run on the Dell PC with Intel Xeon CPU 3 GHz under Windows XP Professional. The algorithm parameters are taken as follows: the sparsity parameter is p = 1; the block length is L = 10 and the overlapping rate is 80%; for CG-MFOCUSS, the tolerance is ε = 0.001. The initializations of all algorithms are set as matrix “1”, in which all entries are 1. The Lagrange multiplier vector is set as zero vector λ = 0m×1 . Consider a 512 × 512 MRI image reconstruction problem by compressed sensing. We extracted 472 from 512 possible parallel lines in the spatial frequency of an image I. The other 40 lines of 512 were removed (see Fig. 1). Thus, a

242

Z. He et al.

Fig. 1. MRI image reconstruction results. U p p er left. Removed DFT coefficients (in white). Upper right. Original MRI image. Lower left. Linear reconstruction. Some artifacts are pointed by an arrow. Lower right. Sparse reconstruction by CG-M-FOCUSS.

472 × 512 DFT coefficient matrix I f , the kept DFT coefficient matrix of the original MRI image I after removing, was obtained, whose compressed sensing matrix Φ is a 472 × 512 matrix by randomly removing the corresponding 40 rows of the DFT transform matrix. Considering that usually images have sparse representation in the wavelet domain, we reconstruct the MRI image in the wavelet domain using the Daubechies 4 transform W . Then, we can derive the following complex-valued compressed sensing problem: I f = Φ · I = Φ · W −1 · W · I = A · I W ,

(13)

where A = Φ · W −1 and I W = W · I. The equation (13) can be further represented as a standard real-valued compressed sensing problem with 512 samples (t = 1, · · · , 512; m = 472, n = 512): R I I IR f + I f = (A + A ) · I W ,

(14)

I R I where I R f , I f are respectively the real part and imaginary part of I f and A , A are the real part and imaginary part of A, respectively. Then, we can reconstruct the original MRI image I by Iˆ = W −1 · IˆW , where IˆW is the solution of (14). Similar to the M-FOCUSS, it also takes much time for CG-M-FOCUSS to ¯ T . As mentioned in Seccompute the matrix-matrix multiplications AΠ −1 (S)A tion 3, for this kind of compressed sensing, fast algorithms are usually available [1]. Note that A = Φ · W −1 , where W is an orthogonal wavelet matrix (W −1 = W T ) and Φ is a part of a Fourier transform matrix. So this multi¯ T can be done efficiently by performing the fast inverse plication AΠ −1 (S)A wavelet transform and fast DFT on matrix Π −1 [1]. For any vector v ∈ Rn , the

CG-M-FOCUSS and Its Application to Distributed Compressed Sensing

243

Table 1. MRI reconstruction results Method PSNR [dB] Runtime (seconds) Linear reconstruction 25.03 \ Standard FOCUSS 28.88 1913.17 M-FOCUSS 33.91 942.58 CG-M-FOCUSS 33.81 525.98

computational complexity of fast DFT is only O(n log n). In addition, the SVD or EVD step for finding a good preconditioned transform matrix C is omitted in this example because AAT = I is an identity matrix. All algorithms run 30 iterations and converge within 30 iterations. From Table 1, we can see that M-FOCUSS and CG-M-FOCUSS achieved better results than the standard FOCUSS. Moreover, M-FOCUSS and CG-M-FOCUSS almost achieved the similar results (i.e., the PSNRs are a little higher than 33.8dB) and the main difference is the computational time. So we show only the reconstructed MRI by CG-M-FOCUSS in Fig. 1. In addition, Table 1 shows that the sparse MRI method approximately gained 8.8dB compared with linear reconstruction method, which sets the unobserved DFT coefficients to zeros and then directly performs the inverse DFT. We also can compare their results in Fig. 1, where the linear reconstruction suffers from the arc-like streaking artifacts (pointed by the arrow) due to undersampling, whereas the artifacts are much less noticeable in the sparse reconstruction.

5

Conclusions

M-FOCUSS is a very efficient method for the sparse representation and compressed sensing, which can simultaneously process multiple measurement vectors. In this paper, we extended M-FOCUSS to CG-M-FOCUSS by incorporating CG iterations. The CG-M-FOCUSS is computationally less expensive and more suitable for large scale problems in comparison to the standard M-FOCUSS. In addition, the application of CG-M-FOCUSS in compressed sensing is also discussed. An MRI image reconstruction was performed by distributed compressed sensing. We have shown that CG-M-FOCUSS can considerably reduce computation time compared to the standard FOCUSS and M-FOCUSS while achieving almost the same PSNR as M-FOCUSS. In addition, we would like to emphasize that M-FOCUSS and CGM-FOCUSS can achieve better results by applying overlapping. This point is confirmed and supported by the MRI reconstruction example given in this paper.

References 1. Kim, S.J., Koh, K., Lustig, M., Boyd, S., Gorinevsky, D.: An interior-point method for large-scale ℓ1 -regularized least squares. IEEE Journal on Selected Topics in Signal Processing 4(1), 606–617 (2007)

244

Z. He et al.

2. Rao, B.D.: Signal processing with the sparseness constraint. In: Proceedings of the ICASSP, Seattle, WA, vol. III, pp. 1861–1864 (1998) 3. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing: Learning Algorithms and Applications. John Wiley & Sons, New York (2003) 4. Mallat, S.G., Zhang, Z.: Matching pursuits with time-frequency dictionaries. IEEE Trans. Signal Processing 41(12), 3397–3415 (1993) 5. Tropp, J.: Greed is good: algorithmic results for sparse approximation. IEEE Trans. Information Theory 50(10), 2231–2242 (2004) 6. Chen, S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing 20(1), 33–61 (1998) 7. Donoho, D.L., Elad, M.: Maximal sparsity representation via ℓ1 minimization. In: Proc. National Academy Science, vol. 100, pp. 2197–2202 (2003) 8. Li, Y.Q., Cichocki, A., Amari, S.: Analysis of sparse representation and blind source separation. Neural Computation 16, 1193–1234 (2004) 9. Takigawa, I., Kudo, M., Toyama, J.: Performance analysis of minimum ℓ1 -norm solutions for underdetermined source separation. IEEE Trans. Signal Processing 52(3), 582–591 (2004) 10. Li, Y.Q., Amari, S., Cichocki, A., Ho, D.W.C., Xie, S.L.: Underdetermined blind source separation based on sparse representation. IEEE Trans. Signal Processing 54(2), 423–437 (2006) 11. Bofill, P., Zibulevsky, M.: Underdetermined blind source separation using sparse representations. Signal Processing 81, 2353–2362 (2001) 12. Gorodnitsky, I.F., George, J., Rao, B.D.: Neuromagnetic source imaging with FOCUSS: A recursive weighted minimum norm algorithm. Electroencephalography and Clinical Neurophysiology 95(4), 231–251 (1995) 13. Rao, B.D., Kreutz-Delgado, K.: Deriving algorithms for computing sparse solutions to linear inverse problems. In: Conference Record of the Thirty-First Asilomar Conference on Signals, Systems and Computers, vol. 1, pp. 955–959 (1997) 14. Gorodnitsky, I.F., Rao, B.D.: Sparse signal reconstruction from limited data using FOCUSS: A re-weighted minimum norm algorithm. IEEE Trans. Signal Processing 45(3), 600–616 (1997) 15. Rao, B.D., Kreutz-Delgado, K.: An affine scaling methodology for best basis selection. IEEE Trans. Signal Processing 47(1), 187–200 (1999) 16. Kreutz-Delgado, K., Murry, J.F., Rao, B.D., et al.: Dictionary learning algorithms for sparse representation. Neural Computation 15, 349–396 (2003) 17. Cotter, S.F., Rao, B.D., Engan, K., Kreutz-Delgado, K.: Sparse solutions to linear inverse problems with multiple measurement vectors. IEEE Trans. Signal Processing 53(7), 2477–2488 (2005) 18. Baraniuk, R.: Compressive sensing. IEEE Signal Processing Magazine 24(4), 118– 121 (2007) 19. Donoho, D.: Compressed sensing. IEEE Trans. on Information Theory 52(4), 1289– 1306 (2006) 20. Duarte, M., Davenport, M., Takhar, D., Laska, J., Sun, T., Kelly, K., Baraniuk, R.: Single-pixel imaging via compressive sampling. IEEE Signal Processing Magazine 25(2), 83–91 (2008) 21. Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards 49(6), 409–436 (1952)

CG-M-FOCUSS and Its Application to Distributed Compressed Sensing

245

22. Nocedal, J., Wright, S.J.: Numerical optimization, 2nd edn. Springer series in operations research and financial engineering. Springer, New York (2006) 23. Duarte, M., Sarvotham, S., Baron, D., Wakin, M., Baraniuk, R.: Distributed compressed sensing of jointly sparse signals. In: Conference Record of the Thirty-Ninth Asilomar Conference on Signals, Systems and Computers, pp. 1537–1541 (2005) 24. Lustig, M., Donoho, D.L., Santos, J.M., Pauly, J.M.: Compressed sensing MRI. IEEE Signal Processing Magazine 25(2), 72–82 (2008)

Dynamic of Cohen-Grossberg Neural Networks with Variable Coefficients and Time-Varying Delays Xuehui Mei and Haijun Jiang College of Mathematics and System Sciences, Xinjiang University, Urumqi 830046, China [email protected]

Abstract. In this paper, we study the Cohen-Grossberg neural networks with variable coefficients and time-varying delays. By applying the Young inequality technique, Dini derivative and introducing many real parameters, estimate directly the upper bound of solutions. We will establish new and useful criteria on the boundedness and global exponential stability. The results obtained in this paper extend and generalize the corresponding results existing in previous literature. Keywords: Neural networks; Boundedness; Exponential stability; Variable coefficients; Dealys.

1

Introduction

In recent years, the dynamical characteristic such as stability and periodicity of Hopfield network, cellular neural network and bidirectional associative memory neural network play an important rule in the pattern recognition, associative memory, and combinatorial optimization (see [1-10]). In particular, a general neural network, which is called the Cohen-Grossberg neural network and can function as stable associative memory, was developed and studied. The stability of recurrent neural networks is a prerequisite for almost all neural network application. In this paper, we consider a general form of Cohen-Grossberg neural network model with variable coefficients and time-varying delays: for i = 1, 2, · · · , n; x˙ i (t) = −ai (xi (t))[bi (t, xi (t)) − −

n 

n 

cij (t)fj (xj (t))

j=1

(1)

dij (t)fj (xj (t − τij (t))) + Ii (t)]

j=1

The main purpose of this paper is to study the dynamic behavior of the general Cohen-Grossberg neural networks system (1). In this paper, by applying the Young inequality technique, Dini derivative and introducing many real parameters, estimate directly the upper bound of solutions of system (1). We will F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 246–254, 2008. c Springer-Verlag Berlin Heidelberg 2008 

Dynamic of Cohen-Grossberg Neural Networks with Variable Coefficients

247

establish new and useful criteria on the boundedness and global exponential stability. We will see that the results obtained in this paper will extend and generalize the corresponding results existing in [2, 7, 10].

2

Preliminaries

For system (1), in order to convenient description we introduce the following assumptions. (H1 ) functions ai (u) > 0 are bounded, and satisfy locate Lipschitz condition, there are positive constants αi , α¯i such that 0 < αi < ai (u) < α¯i < +∞, for all u ∈ R, t ∈ (0, +∞), i = 1, 2, · · · , n. (H2 ) functions bi (t, u) are continuous, bi (t, 0) are bounded, and there exist positive bounded continuous function βi (t) such that bi (t, u) − bi (t, v) ≥ βi (t) > 0, u−v for all t ∈ (0, +∞), u, v ∈ R, u = v, i = 1, 2, · · · , n. (H3 ) functions fi (u) satisfy Lipschitz condition, i.e, there exist positive constants ki ,(i = 1, 2, · · · , n) such that |fi (u) − fi (v)| ≤ ki |u − v|, foe all u, v ∈ R, i = 1, 2, · · · , n. (H4 ) there are constants p1 , p2 , · · · , pn , such that n

pi βi (t) − n



r−lij r−gij r−sij r−hij r−1 pj (|cij (t)| r−1 kj r−1 + |dij (t)| r−1 kj r−1 ) r j=1

1 l l pj (|cij (t)|hij kjij + |dij (t)|sij kjij ) > σ > 0 r j=1

for all t ∈ [0, ∞], i = 1, 2, · · · , n. Remark 1. In system (2.1), If cij (t) ≡ cij dij (t) ≡ dij , in which aij , dij are constants, then the assume (H4 ) is transformed into the following form. (H4′ ) There are constants p1 , p2 , · · · , pn , such that n

pi βi (t) − n



r−lij r−gij r−hij r−sij r−1  pj (|cij | r−1 kj r−1 + |dij | r−1 kj r−1 ) r j=1

1 l l pj (|cij |hij kjij + |dij |sij kjij ) > σ > 0 r j=1

for all t ∈ [0, ∞], i = 1, 2, · · · , n.

248

X. Mei and H. Jiang

For system (1), we assume that Ii (t), (i = 1, 2, · · · , n) are continuous bounded functions, τij (t), (i = 1, 2, · · · , n) are nonnegative, continuous, bounded functions. Let τ = sup{τij (t) : t ∈ [0, +∞), i, j = 1, 2, ..., n}. We introduce C([−τ, 0], Rn ) as the initial function space of system (1), which is the Banach space of all continuous functions φ = (φ1 (t), φ2 (t), · · · , φn (t))T ∈ C([−τ, 0], Rn ), 1 with normal φ = sup−τ ≤θ≤0 |φ(θ)|, |φ(θ)| = [max1≤i≤n |φi (θ)|r ] r . Definition 1. System (1) is said to be uniformly ultimately bounded, if there exists a constant B > 0, for each H > 0 there exists a T (H) > 0 such that [t0 ∈ R+ , φ ∈ C[−τ, 0], φ ≤ H, t > t0 + T ] imply |x(t, t0 , φ)| ≤ B. Definition 2. System (1) is side to be globally exponentially stable, if there are constants ǫ > 0 and M ≥ 1 such that for any two solutions x(t) = (x1 (t), x2 (t), · · · , xn (t)) and y(t) = (y1 (t), y2 (t), · · · , yn (t)) of systems (1) with the initial functions φ, ψ ∈ C[−τ, 0], respectively, one has |x(t) − y(t)| ≤ M φ − ψ exp (−ǫt) for all t ∈ R+ . As the preliminaries, we firstly give the following lemma on the Young inequality. Lemma 1. Assume that a ≥ 0, b ≥ 0, p > 1, q > 1 with following inequality hold: ab ≤

3

1 p

+

1 q

= 1, then the

1 p 1 q a + b . p q

Boundedness and Global Exponential Stability

Theorem 1. Suppose that (H1 ) − (H4 ) hold. Then system (1) is uniformly bounded. Proof. For B1 > 0, take ϕ ∈ C([−τ, 0], Rn ) such that ϕ ≤ B1 . Let x(t) = (x1 (t), x2 (t), · · · , xn (t)) be the solution of system (1) and satisfies the initial condition xi (s) = ϕi (s), s ∈ [−τ, 0] (i = 1, 2, · · · , n). Let xi (t) = pi ui (t), i = 1, 2, · · · , n, then system (1) is transformed into the following form n  1 cij (t)fj (pj uj (t)) ai (pi ui (t))[bi (t, pi ui (t)) − pi j=1 n  dij (t)fj (pj uj (t − τij (t))) + Ii (t)] −

u˙ i (t) = −

j=1

We take Vi (t) = |



ui (t) 0

|s|r−1 ds| ai (pi s)

(2)

Dynamic of Cohen-Grossberg Neural Networks with Variable Coefficients

249

|ui (t)|r−1 σ(ui (t))u˙ i (t) V˙ i (t) = ai (pi ui (t)) n  |ui (t)|r−1 cij (t)fj (pj uj (t)) [−bi (t, pi ui (t)) + = σ(ui (t)) pi j=1 n  dij (t)fj (pj uj (t − τij (t))) − Ii (t)] + j=1

|ui (t)|r−1 [−(bi (t, pi ui (t)) − bi (t, 0)) − bi (t, 0) pi n n   + |dij (t)||fj (pj uj (t − τij (t))) |cij (t)||fj (pj uj (t)) − fj (0)| +

≤ σ(ui (t))

j=1

−fj (0)| +

n 

|cij (t)||fj (0)| +

n 

j=1

|dij (t)||fj (0)| + |Ii (t)|]

j=1

j=1

n n   |ui (t)|r−1 |dij (t)| |cij (t)|pj |uj (t)|kj + [−pi ui (t)βi (t) + pi j=1 j=1 n n   pj kj |uj (t − τij (t))| + (|bi (t, 0)| + |dij (t)||fj (0)| |cij (t)||fj (0)| +

≤ σ(ui (t))

j=1

j=1

+|Ii (t)|)] n n   1 ≤ [−pi |ui (t)|r βi (t) + |dij (t)|pj kj |cij (t)|pj kj |ui (t)|r−1 |uj (t)| + pi j=1 j=1 |ui (t)|r−1 |uj (t − τij (t))| + M |ui (t)|r−1 ],

let M = sup {|bi (t, 0)| + 1≤i≤n t≥0

Since

n  j=1

|cij (t)||fj (0)| +

n 

|dij (t)||fj (0)| + |Ii (t)|}.

j=1

kj |cij (t)||ui (t)|r−1 |uj (t)| r−lij r−hij 1 r−1 l |cij (t)| r−1 kj r−1 |ui (t)|r + |cij (t)|hij kjij |uj (t)|r ≤ r r

and kj |dij (t)||ui (t)|r−1 |uj (t − τij (t))| r−gij r−sij r−1 1 g ≤ |dij (t)| r−1 kj r−1 |ui (t)|r + |dij (t)|sij kj ij |uj (t − τij (t))|r r r from above, we have D+ Vi (t) ≤

n r−lij r−gij  r−hij r−1 1 [−pi |ui (t)|r βi (t) + (|cij (t)| r−1 kj r−1 kj r−1 )|ui (t)|r pi r j=1 n  1 1 l g |cij (t)|hij kjij |uj (t)|r + |dij (t)|sij kj ij |uj (t − τij (t))|r + r j=1 r (3) +M |ui (t)|r−1 ].

250

X. Mei and H. Jiang

r r Take B2 = max{B1 , 4M 3σ }. Next to proof |ui (t)| ≤ B2 for all t ≥ 0, i = 1, 2, · · · , n. If |ui (t)|r ≤ B2r is not true, then there exist i and t1 > 0 such that

|ui (t1 )|r = B2r ,

d|ui (t)|r = r|ui (t)|r−1 σ(ui (t))u˙ i (t) ≥ 0 dt

and urj (t) ≤ B2r , for −τ ≤ t ≤ t1 , j = 1, 2, · · · , n. Thus we have |ui (t)|r V˙ i (t) = σ(ui (t)) u˙ i (t) ≥ 0. ai (pi ui (t)) But from (3) we have D+ Vi (t1 ) ≤

n r−lij  r−hij r−sij r−1 1 pj (|cij (t)| r−1 kj r−1 + |dij (t)| r−1 [−pi B r βi (t) + pi r j=1 n n r−gij 1 1 l g pj |cij (t)|hij kjij B2r + pj |dij (t)|sij kj ij B2r kj r−1 ) + B2r + r j=1 r j=1 3 1 3 σB r + σB2r ] < [−σB2r + σB2r ] = − 2 < 0. 4 pi 4 4pi

This is a contradiction. Thus, we get |ui (t)|r ≤ B2r , for all t ≥ 0. Let B3 = max1≤i≤n {pi (B2 + 1)}, we finally have |xi (t)| < B3

(4)

for all t ≥ 0. Therefore, we obtain that solutions of system (1) are defined on R+ and are uniformly bounded. This completes the proof. In system (1), If cij (t) ≡ cij dij (t) ≡ dij , in which aij dij are constants, then the system (1) is transformed into the following form. x˙ i (t) = −ai (xi (t))[bi (t, xi (t)) − −

n 

n 

cij fj (xj (t))

j=1

(5)

dij fj (xj (t − τij (t))) + Ii ]

j=1

Theorem 2. Suppose that assume (H1 ) − (H3 ) and (H4′ ) hold. Then system (5) there is an equilibrium point x∗ = (x∗1 , x∗2 , · · · , x∗n ). The proof of Theorem 2 is easy. Here,we omit it. Let y(t) = x(t) − x∗ , then y˙ i (t) = −ai (yi (t) + x∗i )[bi (t, yi (t) + x∗i ) − bi (t, x∗i ) − −fj (x∗j ) −

n  j=1

n 

cij (fj (yj (t) + x∗j )

j=1

dij (fj (yj (t − τij (t)) + x∗j ) − fj (x∗j ))].

Dynamic of Cohen-Grossberg Neural Networks with Variable Coefficients

251

Theorem 3. Suppose that assume (H1 ) − (H3 ) and (H4′ ) hold, then the solution of system (5) is globally exponentially stability.  |yi (t)| r−1 rs ′ Proof. Under assume (H4 ), let Vi (t) = ds, we easy to obtain ai (s) 0 |yi (t)|r |yi (t)|r . ≤ Vi (t) ≤ α¯i αi Calculating the derivative of V (t), we obtain r|yi (t)|r−1 V˙ i (t) = σ(yi (t)){−ai (yi (t) + x∗i )[bi (t, yi (t) + x∗i ) − bi (t, x∗i ) ai (|yi (t)|) n n   dij (fj (yj (t − τij ) + x∗j ) cij (fj (yj (t) + x∗j ) − fj (x∗j ) − − j=1

j=1

−fj (x∗j ))]} n rα rα¯i  ≤ − i βi (t)|yi (t)|r−1 + |cij |kj |yi (t)|r−1 |yj (t)| [ α ¯i αi j=1 n  |dij |kj |yi (t)|r−1 |yj (t − τij (t))|] + j=1

By Young inequality we further have rαi V˙ i (t) ≤ {−βi (t)αi Vi (t)} α ¯i n r−lij r−gij r−sij r−hij rα¯i r − 1  + [ (|cij | r−1 kj r−1 + |dij | r−1 kj r−1 )α¯i Vi (t) αi r j=1 n n  1 1 l g + |cij |hij kjij α¯j Vj (t) + |dij |sij kj ij α¯j V¯j (t)] r j=1 r j=1 in which y¯j (t) =

|yj (s)|, V¯j (t) =

sup t−τ ≤s≤t

sup

|Vj (s)|

t−τ ≤s≤t

we take λi = inf{λ : pi λ − pi βi (t)αi + 1 + r

r−1 r

n hij

pj |cij |

l kjij α¯j

j=1

+e

λτ

1 r

n

pi α¯i (|cij |

r−hij r−1

r−lij

kj r−1 + |dij |

r−sij r−1

r−gij

kj r−1 )

j=1 n g

pj |dij |sij kj ij α¯j = 0}. j=1

Let α = min{λ1 , λ2 , · · · , λn }, then we have r−lij r−gij r−sij r−hij n pi α − pi βi (t)αi + r−1 ¯i (|cij | r−1 kj r−1 + |dij | r−1 kj r−1 ) j=1 pi α r n n 1 1 l g pj |cij |hij kjij α¯j + eλτ pj |dij |sij kj ij α¯j ≤ 0, + r j=1 r j=1

for all t ≥ 0, i = 1, 2, · · · , n.

252

X. Mei and H. Jiang

We choose constant β > 1 such that βpi eλτ > 1 for all t ∈ [−τ, 0] and i = 1, 2, · · · , n. Let Zi (t) = βpi

n 

V¯j (0)e−ατ .

j=1

Further let constant k > 1 such that Vi (t) < kZi (t) for all t ∈ [−τ, 0]. Next to proof Vi (t) < kZi (t)for allt ∈ [0, ∞]. If above is not true. Then there exist i ∈ {1, 2, · · · , n} and ti > 0 such that Vi (ti ) = kZi (ti ) and Vj (t) < kZj (t) for t ∈ [−τ, ti ], j = 1, 2, · · · , n, D+ Vi (ti ) ≥ ′ kZi (ti ) = −αkZi (ti ). since V¯j (ti ) = sup−τ ≤θ≤0 Vj (ti + θ) and Zj (t) is strictly monotone decreasing function, then there is θ∗ ∈ [−τ, 0] such that V¯j (ti ) = sup−τ ≤θ≤0 Vj (ti + θ) = Vj (ti + θ∗ ) < kZj (ti + θ∗ ) ≤ kZj (ti − τ ). Thus ′ V˙ i (ti ) − kZi (ti ) = D+ Vi (ti ) + kβPi α

n  i=1

V¯i (0)e−αti n

r−lij r−hij r−1 rα¯i [−βi (ti )αi kZi (ti ) + pi α¯i (|cij | r−1 kj r−1 < αi r j=1 n r−gij r−sij 1 l r−1 r−1 +|dij | )α¯i kZi (t) + |cij |hij kjij α¯j kZi (ti ) kj r j=1 n 1 rα¯i g [pi βi (ti )αi + |dij |sij kj ij α¯j kZi (ti − τ ) + r j=1 αi n r−lij r−gij r−hij r−sij r−1  − pi α¯i (|cij | r−1 kj r−1 + |dij | r−1 kj r−1 ) r j=1 n n  1 1 l g − pj |cij |hij kjij α¯j − eατ pj |dij |sij kj ij α¯j ]· r j=1 r j=1 n  V¯i (0) kβe−αti

j=1

n  rα¯i = V¯i (0) [−βi (ti )αi kβpi e−αti αi j=1 n r−lij r−gij r−hij r−sij r−1  r−1 (|cij | r−1 kj + |dij | r−1 kj r−1 )· + r j=1 n n  1 l V¯i (0) + |cij |hij kjij α¯j kβpi · α¯i kβpi e−αti r j=1 j=1 n n n    1 g V¯i (0)] V¯i (0) + e−αti |dij |sij kj ij α¯j kβpj e−α(ti −τ ) r i=1 j=1 j=1

Dynamic of Cohen-Grossberg Neural Networks with Variable Coefficients

253

n

+

r−hij rα¯i r−1 pi α¯i (|cij | r1 [pi βi (ti )αi − αi r j=1

+|dij |

r−sij r−1

n

r−gij

Kj r−1 ) −

1 l pj |cij |hij Kjij α¯j r j=1 n

n

−e

ατ

 1 g V¯l (0) pj |dij |sij Kj ij α¯j ]Kβe−αti r j=1 l=1

=0 This is a contradiction. So we have Vi (t) ≤ Zi (t) = βpi

n 

V¯l (0)e−αt ,

for all t ≥ 0.

l=1

because of V¯i (0) = sup



|yi (t)|

0

|yi (t)|r φ − x∗  rsr−1 ds ≤ sup ≤ ai (s) αi min1≤i≤n {αi } −τ ≤t≤0

and Vi (t) ≥

|yi (t)|r α¯i

we further have |xi (t) − x∗i | ≤

nα¯i βpi φ − x∗ e−αt , min1≤i≤n {αi }

for t ≥ 0.

This show that all solutions of system (5) are globally exponentially stable. This completes the proof of Theorem 2.

4

Conclusion

In this paper, we have investigated the boundedness and global exponential stability for Cohen-Grossberg neural networks. Using the Young inequality technique, Dini derivative and introducing many real parameters, estimate directly the upper bound of solutions, we gave a sufficient criterion ensuring the boundedness and global exponential stability of system (1). The obtained result improves and extend several earlier publications and is useful in applications of manufacturing high quality neural networks.

Acknowledgement This work was supported by The National Natural Science Foundation of P.R. China (60764003), The Major Project of The Ministry of Education of P.R. China (207130) and The Scientific Research Programmes of Colleges in Xinjiang (XJEDU2007G01, XJEDU2006I05).

254

X. Mei and H. Jiang

References 1. Cao, J., Li, X.: Stability in Delayed Cohen-Grossberg Neural Networks: LMI Optimization Approach. Physica D 212, 54–65 (2005) 2. Cao, J., Liang, J.: Boundedness and Stability for Cohen-Grossberg Neural Networks with Time-Varying Delays. J. Math. Anal. Appl. 296, 665–685 (2004) 3. Chen, T., Rong, L.: Delay Independent Stability Analysis of Cohen-Grossberg Neural Networks. Phys. Lett. A 317, 436–449 (2003) 4. Lu, W., Chen, T.: New Conditions on Global Stability of Cohen-Grossberg Neural Networks. Neural Comput. 15, 1173–1189 (2003) 5. Hwang, C., Cheng, C., Li, T.: Globally Exponential Stability of Generalized CohenGrossberg Neural Networks with Delays. Phys. Lett. A 319, 157–166 (2003) 6. Li, Y.: Existence and Stability of Periodic Solutions for Cohen-Grossberg Neural Networks with Multiple Delays. Chaos Solitons & Fractals 20, 459–466 (2004) 7. Wang, L., Zou, X.: Exponential Stability of Cohen-Grossberg Neural Networks. Neural Networks 15, 415–422 (2002) 8. Wang, L., Zou, X.: Harmless Delays in Cohen-Grossberg Neural Networks. Phys. D 170, 163–173 (2002) 9. Yuan, K., Cao, J.: An Analysis of Global Asymptotic Stability of Delayed CohenGrossberg Neural Networks via Nonsmooth Analysis. IEEE Trans. Circuits Syst.I 52, 1854–1861 (2005) 10. Zeng, Z., Wang, J.: Global Exponential Stability of Recurrent Neural Networks with Time-Varying Delays in the Presence of Strong External Stimuli. Neural Networks 19, 1528–1537 (2006)

Six-Element Linguistic Truth-Valued Intuitionistic Reasoning in Decision Making Li Zou1 , Wenjiang Li2 , and Yang Xu3 1

2

School of Computer and Information Technology, Liaoning Normal University, Dalian 116029, P.R. China [email protected] College of Automation, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, P.R. China wjl [email protected] 3 Intelligent Control and Development Center, Southwest Jiaotong University, Chengdu, 610031, P.R. China [email protected]

Abstract. A kind of intuitionistic linguistic truth-valued reasoning approach for decision making with both comparable and incomparable truth values is proposed in this paper. By using the lattice implication algebra, an six-element linguistic truth-valued intuitionistic propositional logic system is established which can express both truth degree and falsity degree. The implication operation of linguistic truth-valued intuitionistic propositional logic can be deduced from four times implication of their truth values. Therefore, we can use more information in the process of reasoning and eventually improve the precision of reasoning. As reasoning and operation are directly acted by linguistic truth values in the decision process, the issue on how to obtain the weight for rational decision making results is discussed. An illustration example shows the proposed approach seems more effective for decision making under a linguistic information environment with both truth degree and falsity degree. Keywords: Lattice implication algebra, Linguistic truth-valued intuitionistic propositional logic, Decision making.

1

Introduction

In the real world, people usually do judgement in a natural language with some uncertain words. The truth values of a fuzzy proposition are nature linguistic, e.g., of the form “true”,“very true”,“possible false”, etc. Therefore, truth values of proposition are often not exactly true or false, but accompany with linguistic hedges [1], such as absolute, highly, very, quite, exactly, almost, rather, somewhat, slightly and so on. Different linguistic hedges will lead to different judgments Where as the degree of the assessment will be strengthened or weakened by the linguistic hedges. In recent years, some researchers have paid their attention to linguistic hedges. Ho proposed an algebraic model, Hedge Algebra, for dealing F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 266–274, 2008. c Springer-Verlag Berlin Heidelberg 2008 

Six-Element Linguistic Truth-Valued Intuitionistic Reasoning

267

with linguistic information [2][3]. Turksen studied the formalization and inference of descriptive words, substantive words and declarative sentence [4][5]. Huynh [6] proposed a new model for parametric representation of linguistic truth-values [7][8]. Many-valued logic, a great extension and development of classical logic, has always been a crucial direction in the non-classical logic. It is a good tool to deal with linguistic values. There exists incomparable linguistic truth value in the many-valued logic. Lattice-valued logic system is an important case of manyvalued logic. It can be used to describe uncertain information that may be comparable or incomparable. In [9][10], Xu et al. discussed the lattice-valued propositional logic LP(X) and gradational Lattice-valued propositional Logic Lvpl based on lattice implication algebra. Xu et.al. have done some research on characterizing the set of linguistic values by a lattice-valued algebraic structure and investigate the corresponding logic systems with linguistic truth-value based on lattice implication algebra(LIA for short) [11][12]. From the point of lattice-valued logic system view [13][14], linguistic truth-values can be put into the lattice implication algebra(LIA) [15][16]. Zou [17] proposed a framework of linguistic truth-valued propositional logic and developed the reasoning method of six-element linguistic truth-valued logic system. Sometimes, we analysis an event which has both certainty and uncertainty characteristic or has both obverse and inverse demonstration. Therefore, a proposition has two truth values: truth degree and falsity degree. From the view of intuitionistic fuzzy set introduced by K.Atanassov, the true value of a fuzzy proposition p are juxtaposed two two real number (μ(p), ν(p)) on the closed interval [0,1] with the following constraint: µ(p) + ν(p) ≤ 1. In [18] the evaluation function V was defined over a set of propositions S in such a way that V (p) =< µ(p), ν(p) > . Hence the function V : S → [0, 1] × [0, 1] gives the truth and falsity degrees of all propositions in S. which represents its truth degree and its falsity degree [1]. With above work,we will put the linguistic truth-values into intuitionistic fuzzy logic. The truth values of the intuitionistic fuzzy logic are linguistic truthvalues instead of number. Then we discuss the properties of linguistic truthvalued reasoning in intuitionistic fuzzy logic. The rest of this paper is organized as follows: Section 2 outlines Six-element linguistic truth-valued lattice implication algebra, which can express both the comparable and incomparable truth values of linguistic truth-values. Section 3 introduces Linguistic truth-valued intuitionistic logic which truth-value field is Six-element linguistic truth-valued lattice implication algebra and its logic properties are provided. Section 4 illustrates with an example how the proposed

268

L. Zou, W. Li, and Y. Xu

approach works for the reasoning method of six-element linguistic truth-valued intuitionistic propositional logic. Finally, Section 5 concludes the paper.

2

Six-Element Linguistic Truth-Valued Lattice Implication Algebra

In this section we briefly review the notion of linguistic truth-valued lattice implication algebra and its main properties. Let V be a linguistic truth values set, every linguistic truth value v ∈ V is composed of a linguistic hedge operator h and a basic word c, i.e. V =H × C where the linguistic hedge operator set H is a totally ordered and finite set. According to the characteristic of lattice implication algebra, we can construct a new lattice implication algebra using the product of some lattice implication algebras. When a hedge is added to the sentence P (x), the truth value V (P ) of P will be strengthened or weakened, denoted it by HV (P ). Hence the truth value set is V = {h+ T, 0T, h−T, h+ F, 0F, h− F }, which represents strong true, true, weak true, weak false, false and strong false. Let L=(V , ∨, ∧, ′ , →), its operation “∨′′ and “∧” shown in the Hasse diagram of L defined in Fig. 1 and its operations “ −→” and “ ′ ” defined in Table 1 and Table 2 respectively. Then L = (V, ∨, ∧,′ , →) is a lattice implication algebra.

h+T 0T h -F -

hT 0F h+F

Fig. 1. Hasse diagram of L

We will discuss linguistic truth-valued intuitionistic logic based on six-element lattice implication algebra.(abbreviated to LT V − IP ).

3

Six-Element Linguistic Truth-Valued Intuitionistic Propositional Logic

Since some kinds of truth and falsity are incomparable, we can choose the linguistic truth-values based on six-element LIA as the truth-valued field of intuitionistic logic. We denote the linguistic truth-valued intuitionistic proposition

Six-Element Linguistic Truth-Valued Intuitionistic Reasoning

269

Table 1. Implication operator of L = (V, ∨, ∧,′ , →) →

h+ F 0F

h− F h− T 0T

h+ T

h+ F 0F h− F h− T 0T h+ T

h+ T hT h− T h− F 0F h+ F

h+ T hT h+ T h− F h− F h− F

h+ T h+ T h+ T h+ T h+ T h+ T

h+ T h+ T 0T h− F h− F 0F

h+ T 0T h− T h+ T 0T h− T

h+ T h+ T 0T h+ T h+ T 0T

Table 2. Complementary operator of L = (V, ∨, ∧,′ , →) v h+ F 0F h− F 0T h+ T v’ h+ T 0T h− T 0F h+ F

by LTV-IP.According to the definition of intuitionistic proposition, the truthvalued field of LTV-IP is as follows: Li = {(h+ T, h− F ), (0T, 0F ), (0T, h− F ), (h− T, h− F ), (h− T, 0F ), (h− T, h+ F )}.

The complementary “ ′ ” hold as follows: Table 3. Complementary operator of Li v (h+ T, h− F ) (0T, 0F ) (0T, h− F ) (h− T, h− F ) (h− T, 0F ) (h− T, h+ F )

v′ (h− T, h+ F ) (0T, 0F ) (h− T, 0F ) (h− T, h− F ) (0T, h− F ) (h+ T, h− F )

The conjunction, disjunction and implication are shown as follows: Let G, H ∈ LT V − IP , v(G) = (hi T, hj F ), v(G) = (hm T, hl F ), 1.v(G ∨ H) = (hi T ∨ hm T, hj F ∧ hl F ); 2.v(G ∧ H) = (hi T ∧ hm T, hj F ∨ hl F ). 3.v(G → H) = v(G) → v(H) = (hi T → hm T ) ∧ (hj F → hm T ) ∧ hj F → hl F, (hi T → hl F ); Note that, for 1 and 2 they satisfy the valuation conditions of LTV-IP obviously. For 3, we get v(G → H) = v(G) → v(H) = (hi T → hm T ) ∧ (hj F → hm T ) ∧ (hj F → hl F ), (hi T → hl F ) = (hmin{n,n−i+m} T ∧ hmin{n,j+m} T ∧ hmin{n,n−l+j} T ), hmax{0,i+l−n} F )

270

L. Zou, W. Li, and Y. Xu

For the truth degree of G → H there are four cases, the subscripts are n, n − i + m, j + m, n − l + j respectively. For the falsity degree of G → H, the subscript is i + l − n. We can prove that the sum is always equal to or less than n. Hence the definitions of conjunction, disjunction and implication of LTV-IP are rational. The symbols in LTV-IP Logic system are (1) The set of propositional variable: X = {p, q, r, ...}; (2) The set of constants: Li ; (3) Logical connectives: →, ′ ; (4) Auxiliary symbols: ),(. The set F of formulae of LTV-IP is the least set Y satisfying the following conditions: (1)X ⊆ Y ; (2)L ⊆ Y ; (3)If p, q ∈ Y ,then p′ and p → q ∈ Y. Note that from the viewpoint of universal algebra, LTV-IP is the free algebra on X w.r.t. the type T = Li ∪ {′ , →}, where α ∈ Li is a 0-ary operation. According to the properties of lattice implication algebra, L and LTV-IP can be looked as algebras with the same type T = Li ∪ {′ , →} and for any p, q ∈ F , (1)p ∨ q = (p → q) → q, (2) p ∧ q = (p′ ∨ q ′ ). Definition 1 A mapping v : LT V − IP → Li , is called a valuation of LTV-IP, if it is a T -homomorphism. Corollary 1 Let v : LT V − IP → Li be a mapping, then v is a valuation of LTV-IP if and only if it satisfies (1)v(hα T, hβ F )=(hα T, hβ F ), for any (hα T, hβ F ) ∈ Li ; (2)v(p′ ) = (v(p))′ for any p ∈ F ; (3)v(p → q) = v(p) → v(q) for any p, q ∈ F . Definition 2 Well-formed formula of LTV-IP or formula for short are defined recursively as follows: (1) LTV-IP atom is a formula; (2) If G, H are LTV-IP formulae, then ∼ G, (G ∨ H), (G ∧ H), ((G → H) and (G → H) are formulae; (3) No expression is a formula unless it is compelled to be one by (1) and (2). Some intuitionistic linguistic truth-valued properties hold as follows: Theorem 1 For any (hi T, hj F ) ∈ Li , (1) (h− T, h+ F ) → (hi T, hj F ) = (h+ T, h− F ), (2) (h+ T, h− F ) → (hi T, hj F ) = (hi T, hj F ), (3) (hi T, hj F ) → (h− T, h+ F ) = (hj T, hi F ), (4) (hi T, hj F ) → (h+ T, h− F ) = (h+ T, h− F ).

Six-Element Linguistic Truth-Valued Intuitionistic Reasoning

271

Corollary 2 For any ((hi , T ), (hj , F )) ∈ Li , (1) (h− T, h+ F ) → (h− T, h+ F ) = (h+ T, h− F ), (2) (h+ T, h− F ) → (h+ T, h− F ) = (h+ T, h− F ), (3) (h+ T, h− F ) → (h− T, h+ F ) = (h− T, h+ F ), (4) (h− T, h+ F ) → (h+ T, h− F ) = (h+ T, h− F ). Definition 3 For any (hi T, hj F ), (hm T, hl F ) ∈ Li , (hi T, hj F ) is said to truer than (hm T, hl F ) if and only if hi T ≥ hm T and hj F < hl F or hi T > hm T and hj F ≤ hl F , denoted by (hi T, hj F ) ≥ (hm T, hl F ). Theorem 2 If For any (hi T, hj F ), (hm T, hl F ) ∈ Li , (hi T, hj F ) ≥ (hm T, hl F ), then (hi T, hj F ) → (h− T, h+ F ) ≤ (hm T, hl F ) → (h− T, h+ F ). Proof. From Theorem 1, we get (hi T, hj F ) → (h− T, h+ F ) = (hj T, hi F ); (hm T, hl F ) → (h− T, h+ F ) = (hl T, hm F ). Since(hi T, hj F ) ≥ (hm T, hl F ), and the linguistic hedge set H = hi |i = 1, 2...n is a chain, then we get (hj T, hi F ) ≤ (hl T, hm F ). Note that if the consequence is the most false then the truth degree of the implication will decrease while the truth degree of the premise increases. Conversely, while the truth degree of the premise decreases, the truth degree of the implication will increase. This property is consistent with the classical logic and people’s intuition. Also, the linguistic truth-values based on LIA are special cases of intuitionistic linguistic truth-values. So the linguistic truth-valued intuitionistic logic is an extension of linguistic truth-valued logic. Theorem 3 For any (hi T, hj F ), (hi T, hl F ), (hm T, hj F ) ∈ L, (1)(hi T, hj F ) → (hi T, hl F ) = (hi+j T, h− F ), (2)(hi T, hj F ) → (hm T, hj F ) = (hj+m T, h− F ), Now when we do fuzzy inference in intuitionistic fuzzy logic system based on linguistic truth-value, we must consider the fact that the proposition has the truth degree as well as the falsity degree. So more information is used in the reasoning process, which can improve the precision of reasoning and reduce the loss of information in a sense.

4

A Kind of Decision Making Approach

In this section, we consider a multiple attribute decision making with a pair of linguistic truth value information: Let A = A1 , A2 , ..., An be a finite set of alternatives, and let G = G1 , G2 , ..., Gn be a finite set of attributes and W = (w1 , w2 , ..., wn ) be the Intuitionistic fuzzy linguistic truth-valued weigh vetor of attributes. Let R = (rij )m×n be an Intuitionistic fuzzy linguistic truth-valued decision matrix, where rij = (μij , νij ) ∈

272

L. Zou, W. Li, and Y. Xu

Li , where μij indicates the degree that the alternative Aj satisfies the attribute Gi , while νij indicates the degree that the alternative Aj does not satisfies the attribute Gi . The conclusion is D=

n 

(wi → Gi (Aj )).

i=1

The optimal alternative is the Aj ∈ A that maximizes D. Example: We consider a simple example to evaluate the set of cars A = {A1 = Chevrolet, A2 = T oyota, A3 = Buick}, the attribute set G = {G1 = comf ort, G2 = price, G3 = repairf requency}. Assume the evaluation set is Li , where ”true” is changed as ”high” and ”f alse” is changed ”low” respectively. Assume the weights of importance for G as w1 = (0T, h− F ), w2 = Table 4. Evaluation table rij A1

A2

A3

G1 (0T, 0F ) (h+ T, h− F ) (h− T, h− F ) G2 (0T, h− F ) (h− T, 0F ) (h+ T, h− F ) G3 (h− T, h+ F ) (0T, h− F ) (0T, 0F )

(h− T, 0F ), w3 = (h+ T, h− F ), then Table 5. Weighted evaluation table rij

A1

A2 −

+

A3 −

w1 → G1 (0T, h F ) (h T, h F ) (h− T, h− F ) w2 → G2 (0T, h− F ) (0T, h− F ) (0T, h− F ) w3 → G3 (h− T, h+ F ) (0T, h− F ) (0T, 0F )

Hence, D = {(h− T, h− F )/Chevrolet, (0T, h− F )/T oyota, (h− T, h− F )/Buick} Finally, we will choose the Toyota. Note that sometimes there exists some incomparable elements in the the result and this is according with people’s intuition.

5

Conclusions

We have found that some properties of lattice-valued logic based on linguistic truth-valued are fit for researching linguistic truth-values. The result is consistent with people’s intuition. The classical logic and linguistic truth-valued logic based on LIA are the special cases of this logic system.

Six-Element Linguistic Truth-Valued Intuitionistic Reasoning

273

The problem which has positive evidence and negative evidence at the same time can be dealt with by means of linguistic truth-value intuitionistic fuzzy logic. If a proposition has both credibility and incredibility, then the reasoning method proposed above can be used. It is illustrated by the proposed approach that linguistic truth-valued propositional logic makes intelligent decision making systems more effective. Acknowledgments. This work is partially supported by National Nature Science Foundation of China (Grant No. 60603047)and the Specialized Research Fund for the Doctoral Program of Higher Education of China under Grant No. 20060613007.

References 1. Herrera, F., Herrera, E., Martinez, L.: A Fusion Approach for Managing Multigranularity Linguistic Term Sets in Decision Making. International Journal of Fuzzy Sets and Systems 114, 43–58 (2000) 2. Ho, N.C., Wechler, W.: Hedge Algebras: an Algebraic Approach to Structure of Sets of Linguistic Turth Values. International Journal of Fuzzy Sets and Systems 35, 281–293 (1990) 3. Ho, N.C., Wechler, W.: Extended Hedge Algebras and Their Application to Fuzzy Logic. International Journal of Fuzzy Sets and Systems 52, 259–281 (1992) 4. Turksen, I.B.: Computing With Descriptive and Verisic Words. In: NAFIP 1999, pp. 13–17 (1999) 5. Turksen, I.B., Kandel, A., Zhang, Y.Q.: Universal Truth Tables and Normal Forms. IEEE Trans. Fuzzy Systems 6, 295–303 (1998) 6. Huynh, V.N., Nam, H.V.: Ordered Structure-based Semantics of Linguistic Terms of Linguistic Variables and Approximate Reasoning. In: Third International Conference on Computing Anticipatory Systems (CASYS 1999), pp. 98–116. AIP Press, New York (1999) 7. Huynh, V.N., Ho, T.B., Nakamori, Y.: A Parametric Representation of Linguistic Hedges in Zadeh’s Fuzzy Logic. International Journal of Approximate Reasoning 30, 203–223 (2002) 8. Nguyen, C.H., Huynh, V.N.: An Algebraic Approach to Linguistic Hedges in Zadeh’s Fuzzy Logic. International Journal of Fuzzy Set and Systems 129, 229– 254 (2002) 9. Xu, Y., Ruan, D., Qin, K.Y., Liu, J.: Lattice-Valued Logic. Springer, Heidelberg (2004) 10. Xu, Y., Ruan, D., Kerre, E.E., Liu, J.: α-Resolution Principle Based on Latticevalued Propositional Logic LP(X). International Journal of Information Sciences 130, 195–223 (2000) 11. Xu, Y., Ruan, D., Kerre, E.E., Liu, J.: α-Resolution Principle Based on First-order Lattice-valued Propositional Logic LF(X). International Journal of Information Sciences 132, 195–223 (2001) 12. Xu, Y., Liu, J., Ruan, D., Lee, T.T.: On the Consistency of Rule Bases Based on Lattice-valued First-Order Logic LF(X). International Journal of Intelligent Systems 21, 399–424 (2006)

274

L. Zou, W. Li, and Y. Xu

13. Pei, Z., Xu, Y.: Lattice Implication Algebra Model of a Kind of Linguistic Terms and its in Inference. In: 6th Internal FLINS Conference, pp. 93–98 (2004) 14. Xu, Y., Chen, S., Ma, J.: Linguistic Truth-Valued Lattice Implication Algebra and Its Properties. In: 2006 IMACS Multiconference on Computational Engineering in Systems Applications (CESA 2006), Beijing, pp. 1413–1418 (2006) 15. Zou, L., Liu, X., Xu, Y.: Resolution Method of Linguistic Truth-valued Propositional Logic. In: International Conference on Neural Networks and Brain, pp. 1996–2000. IEEE Press, New York (2005) 16. Zou, L., Ma, J., Xu, Y.: A Framework of Linguistic Truth-valued Propositional Logic Based on Lattice Implication Algebra. In: 2006 IEEE International Conference on Granular Computing, pp. 574–577. IEEE Press, New York (2006) 17. Zou, L., Li, J.L., Xu, K.J., Xu, Y.: A Kind of Resolution Method of Linguistic Truth-valued Propositional Logic Based on LIA. In: 4th International Conference on Fuzzy Systems, pp. 32–36. IEEE Press, New York (2007) 18. Atanassov, K.: Elements of Intuitionistic Fuzzy Logic, Part I. International Journal of Fuzzy Set and Systems 95, 39–52 (1998)

A Sequential Learning Algorithm for RBF Networks with Application to Ship Inverse Control Gexin Bi and Fang Dong College of Navigation, Dalian Maritime University, 1 Linghai Road, 116026 Dalian, China [email protected], [email protected]

Abstract. A sequential learning algorithm for constructing radial basis function (RBF) network is introduced referred to as dynamic orthogonal structure adaptation (DOSA) algorithm. The algorithm learns samples sequentially, with both structure and connecting parameters of network are adjusted on-line. The algorithm is further improved by setting initial hidden units and incorporating weight factors. Based on the improved DOSA algorithm, a direct inverse control strategy is introduced and applied to ship control. Simulation results of ship course control simulation demonstrate the applicability and effectiveness of the improved DOSA algorithm and the RBF network-based inverse control strategy. Keywords: Radial basis function network, Sequential learning, Inverse control.

1

Introduction

Neural networks have become a fashionable area of research in recent years. They have been applied in many areas such as modelling and controlling of non-linear dynamic systems [1]. Most of the successful application results, however, are achieved when the network is applied to system with static dynamics. It is realized that a considerable part of industry processes have time-varying dynamics in their nature, and such processes are prone to be influenced by time-varying environment. Thus, adaptive control of such systems could not be obtained merely using neural network models with fixed structure. One solution is to develop adaptive neural network models and control strategies whose structure are variable to process time-variants. As a kind of feed forward neural networks, radial basis function (RBF) networks are found to be suitable for on-line adaptation because of their features of best approximation and quick convergence [2]. Utilizing the merits of RBF networks, sequential learning algorithms are developed in recent years which are suitable for control applications. Sequential learning algorithms overcome drawbacks of RBF networks with fixed structure. They does not need retraining whenever a new observation is received, with advantages of low computation burden and adaptive ability. The most widely used sequential learning algorithms are RAN, RANEKF, MRAN and GGAP-RBF algorithms [3-6]. F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 275–282, 2008. c Springer-Verlag Berlin Heidelberg 2008 

276

G. Bi and F. Dong

In this paper, we introduce the dynamic orthogonal structure adaptation (DOSA) algorithm to construct RBF networks online [7]. It adds new observation as a hidden unit directly, and makes use of normalized error reduction ratio to prune units which contribute less to system output. The algorithm is adaptive to the changes of system dynamics with fast learning speed, while employs only a small number of parameters. By combining RBF network with adaptive inverse control mechanism, we present a neural inverse control strategy [8,9]. Ship course control simulation results demonstrate the applicability and effectiveness of the algorithm and control strategy.

2

DOSA Algorithm for RBF Networks

DOSA algorithm combines subset selection scheme with a sequential learning mode, the resulting network structure is variable by adding the new observation as hidden unit directly and pruning units which contribute less to output over a number of observations. The contribution of each hidden unit is measured by its normalized error reduction ratio. The conception of normalized error reduction ratio is generalized from error reduction ratio in OLS algorithm [10]. The sliding window is a first-in-first-out sequence with fixed width. When a new observation is received, the sliding window is updated by adding the new observation and discarding the foremost one. The window is of the form: window = [(x1 , y1 ), (x2 , y2 ), · · · , (xN , yN )]

(1)

N is the width of the sliding data window. The data in the sliding data are composed of input X ∈ Rn×N and output Y ∈ RN ×m . n is the dimension of the input and m is the dimension of the output. The learning procedure begins with no hidden units. At each step, the new observation is added as a hidden unit directly. The candidate units is then formed together with the existing hidden units: ⎞ ⎛ c1,1 · · · c1,M ⎜ .. ⎟ (2) C = [c1 , . . . , cM ] = ⎝ ... ... . ⎠ cn,1 · · · cn,M

where M is the number of candidate hidden units. n is the dimension of the candidate hidden units. By calculating the Gaussian functions of Euclidean distance between the sliding window inputs and the candidate hidden units, we have response matrix Φ of hidden units to the input matrix of sliding window: ⎞ ⎛ φ1,1 · · · φ1,M ⎜ .. .. ⎟ Φ = [Φ1 , . . . , ΦM ] = ⎝ ... (3) . . ⎠ φN,1 · · · φN,M

A Sequential Learning Algorithm for RBF Networks

with φj,k = exp(−

xj − ck 2 ) 2σ 2

1 ≤ j ≤ N, 1 ≤ k ≤ M

277

(4)

where ck are known as the k-th hidden units, σ a width constant and  ·  the Euclidean norm. In sequential learning scheme, we pay more attention to the new observations because the information they bring is more capable to represent the dynamics of system. This is more obvious when the data are used to on-line represent systems with time-varying dynamics. In this paper we improve DOSA algorithm by employing forgetting factors. Here we use the linear weighting coefficients: βi =

2i N (N + 1)

1≤i≤N

(5)

The response matrix is transformed into weighted response matrix by multiply βi to the corresponding elements of the matrix. Φ is transformed into a set of orthogonal basis vectors by decomposing Φ into Φ = W A. Here we implement the Gram-Schmidt method. ⎞ ⎛ w1,1 · · · w1,M ⎜ .. ⎟ .. (6) W = [w1 , . . . , wM ] = ⎝ ... . ⎠ . wN,1 · · · wN,M

Calculate the error reduction ratio of each vector wk : [err]ki =

(wkT yi )2 T (wk wk )(yiT yi )

(7)

For a multi-input multi-output (MIMO) process, [err]ki =

T 2 i=1 (wk yi )

m

(wkT wk )trace(Y T Y )

(8)

The geometrical interpretation of error reduction ratio is: (wkT yi )2 = cos2 θki wk 2 yi 2

(9)

where θki is the angle between the basis  vector 2wk and desired output vector yi . According to vector space theory, M k=1 cos θki = 1 for single-output condiM tions. This explains why k=1 [err]k = 1 in OLS algorithm under single output condition. Unlike OLS algorithm, the response matrix in DOSA algorithm is generally not square because the size of sliding window input is generally not M the same as the number of hidden units. Specifically, k=1 [err]k > 1 when M M M > N and k=1 [err]k < 1 when M < N . k=1 [err]k = 1 hold true only

278

G. Bi and F. Dong

when M = N . In order to evaluate the contribution of hidden units directly, the normalized error reduction ratio (nerr) is obtained by [err]k [nerr]k = M k=1 [err]k

(10)

At each step, we select units whose sum of error reduction ratio is less than an accuracy threshold ρ. Select [nerr]k1 = min{[nerr]k , 1 ≤ k ≤ M }. If [nerr]k1 < ρ, then select [nerr]k2 = min{[nerr]k , 1 ≤ k ≤ M, k = k1 }. The selection prok=kS +1 [nerr]k ≥ ρ. Select k1 , . . . , kS and mark the cedure continues until k=k1 corresponding hidden units Sk = {ck1 , . . . , ckS }. Make the selection at each step. If the same hidden units are selected for MS consecutive observations, the particular units will be pruned from the network. That is, remove the units in the intersection of sets selected in the past MS observations. I = {Sk Sk−1 . . . Sk+MS −1 } (11) After the hidden units being added or pruned at each step, the weights between the hidden layer and output layer Θ are adjusted using the linear least mean squares estimation: Θ = Φ+ Y = (ΦT Φ)−1 ΦT Y

(12)

One drawback of sequential learning is that there are no initial hidden units, which will result in more learning time and complexity of algorithm. We further improve the algorithm by setting several initial samples as hidden units directly without pruning. This proved to be helpful to stabilize the learning procedure.

3

The Neural Inverse Controller

Adaptive neural inverse control for unknown discrete-time nonlinear dynamical system has received much attention in recent years [8,9]. Its basic idea is to use a signal that comes from the controller to drive the plant while the model of the controller is the inverse model of the plant, the output of the plant follows the input to the controller and then realizing the anticipate control effects [8]. Therefore, the key of inverse control is how to obtain the inverse model of the plant. In our study, the RBF network constructed by improved DOSA algorithm is introduced to satisfy the requirements of on-line control. The configuration of the proposed direct inverse control strategy is shown in Fig. 1. The controller employ RBF network constructed by the DOSA algorithm. The inputs of the network include the desired output, derivatives and delayed messages from the input and output of system. To describe the inputCoutput dynamics of a nonlinear system, the concept of a nonlinear autoregressive model with exogenous inputs (NARX) system representation is used: y(t + 1) = f (y(t), . . . , y(t + 1 − ny ), u(t), . . . , u(t + 1 − nu ))

(13)

A Sequential Learning Algorithm for RBF Networks

279

Fig. 1. Configuration of RBF network-based adaptive inverse control strategy

where y(·) is the system output, u(·) is the system input, ny and nu are the maximum lags in the output and input, f (·) is the nonlinear function to be identified. Thus the model given by (13) represents a process that may be both dynamical and nonlinear. Suppose that the dynamical system represented by (13) is invertible, there exist a function g(·) such that the input can be expressed in terms of a nonlinear expansion with lagged inputs and outputs as follows: u(t) = g(y(t + 1), y(t), . . . , y(t + 1 − ny ), u(t − 1), . . . , u(t + 1 − nu ))

(14)

Assuming that function g(·) is known, the expression given by (14) allows the calculation of the control action at time t such that the value y(t + 1) is reached by the system at time t + 1. Thus, if the objective of the control action is to reach a set point r(t + 1), the control action is obtained by replacing the process output at time t by the desired plant output r(t + 1): u(t) = g(r(t + 1), y(t), . . . , y(t + 1 − ny ), u(t − 1), . . . , u(t + 1 − nu ))

(15)

Here we consider using the RBF network with tapped time delays to approximate the identifier governed by (14). The severity of the system invertibility condition is weakened by constructing a predictive model like (14), which can be approximately inverted to achieve control actions. It can be further weakened by incorporating the gradient messages involving the trend of changes in the time-varying dynamics. Here we incorporate the first-order and second-order differences of system output as input of the RBF network. The configuration of the RBF network is shown in Fig. 2.

4

Ship Course Conrol Application

The design of ship control strategy presents challenges because ship’s motion is a complex nonlinear system with time-varying dynamics. The dynamics of ship also vary in case of any changes in sailing conditions such as speed, loading conditions, trim, etc. Similar changes may also be caused by environmental disturbances, such as waves, wind, current, etc. So we examine the performance

280

G. Bi and F. Dong

Fig. 2. Configuration of RBF network as inverse controller

wind speed (m/s)

15

10

5

0

0

200

400

600 time (s)

800

1000

1200

200

400

600 time (s)

800

1000

1200

wind direction (degree)

100

50

0

−50

−100 0

Fig. 3. Wind speed and wind force curve

of the proposed control strategy by applying it in ship control. The ship model in this application is based on the model of ”MARINER” [11]. The simplified nonlinear expression of 3 degrees-of-freedom (DOF) ship motion are: surge : m(u˙ − vr − xG r2 ) = X

(16)

˙ =Y sway : m(v˙ + ur − xG r)

(17)

yaw : Iz r˙ + mxG (v˙ + ur) = N

(18)

where m is mass of ship, u and v are surge and sway velocities, r is the yaw rate, Iz is moment of inertia about the z-axis, X and Y are forces in direction of x and y-axis, respectively, N is the moment around z-axis and xG is the center of gravity along x-axis. The objective of our simulation was to steer a ship on setting courses with small deviations as well as avoiding large control actions. The desired course were set as 10⋄ during [0s, 360s], 20⋄ during [361s, 720s], −20⋄ during [721s,120s]. To make the simulation more realistic, influences of wind and random measurement

A Sequential Learning Algorithm for RBF Networks

281

Fig. 4. Ship heading course and rudder angle (RBF Network Inverse control)

Fig. 5. Ship heading course and rudder angle (PID control)

noises were considered [12]. Wind force was set to 4 in Beaufort scale, the changes of wind speed and course were illustrated in Fig. 3. Noises were added with standard deviation of 0.2. Improved DOSA algorithm was implemented on-line. In the simulation, ship speed was set to 15 knot, rudder angle and rate were constrained to ±20◦ and ±5◦ /s, respectively. The parameters were chosen as follows: N = 20, MS = 3, ρ = 0.02. For comparison, traditional PID control was also implemented under the same condition. The parameters are tuned as: KP = 8, KI = 0.01, KD = 80. Simulation results are shown in Figs. 4 and 5. We compare Fig. 4 and 5 and find that although both methods can track the desired course well, the proposed RBF network-based inverse control strategy uses much less rudder action. It also indicates that the controller reacts fast to the environmental changes with smooth rudder actions, which shows that the RBF network constructed by DOSA algorithm can react to the change of ship dynamics adaptively, and the inverse control strategy can minimize the effect of long time delay to a low level.

282

5

G. Bi and F. Dong

Conclusion

A direct inverse control strategy is introduced based on RBF network which is constructed by improved DOSA algorithm. Simulation results show that the proposed control strategy can control a nonlinear ship model with quick response and satisfactory course tracking ability.

References 1. Chen, S., Wang, X.X., Harris, C.J.: NARX-based Nonlinear System Identification Using Orthogoanl Least Squares Basis Hunting. IEEE Tans. Contr. Sys. Tech. 16, 78–84 (2008) 2. Moody, J., Darken, C.: Fast Learning in Networks of Locally-Tuned Processing Units. Neur. Comput. 1, 281–294 (1989) 3. Platt, J.: A Resource Allocating Network for Function Interpolation. Neur. Comput. 3, 213–225 (1991) 4. Kadirkamanathan, V., Niranjan, M.: A Function Estimation Approach to Sequential Learning with Neural Network. Neur. Comput. 5, 954–975 (1993) 5. Lu, Y.W., Sundararajan, N., Saratchandran, P.: A Sequential Learning Scheme for Function Approximation by Using Minimal Radial Basis Function Neural Networks. Neur. Comput. 9, 461–478 (1997) 6. Huang, G.B., Saratchandran, P., Sundararajan, N.: A Generalized Growing and Pruning RBF (GGAP-RBF) Neural Network for Function Approximation. IEEE Trans. Neur. Netw. 16, 57–67 (2005) 7. Yin, J.C., Dong, F., Wang, N.N.: A Novel Sequential Learning Algorithm for RBF Networks and Its Application to Dynamic System Identification. In: 2006 International Joint Conference on Neural Networks, pp. 827–834. IEEE Press, New York (2006) 8. Widrow, B., Wallach, E.: Adaptive Inverse Control. Prentice Hall, Upper Saddle River (1996) 9. Deng, H., Li, H.X.: A Novel Neural Approximate Inverse Control for Unkown Nonlinear Discrete Dynamical Systems. IEEE Trans. Sys. Man Cyber. 35, 115–123 (2005) 10. Chen, S., Cowan, C.F.N., Grant, P.M.: Orthogonal Least Squares Learning Algorithm for Radial Basis Function Networks. IEEE Trans. Neur. Netw. 2, 302–309 (1991) 11. Chislett, M.S., Strom, J.T.: Planar Motion Mechanism Tests and Full-scale Steering and Manoeuvring Predictions for a Mariner Class Vessel. Technical Report, Hydro and Aerodynamics Laboratory, Denmark (1965) 12. Zhang, Y., Hearn, G.E., Sen, P.: A Neural Network Approach to Ship TrackKeeping Control. IEEE J. Ocean. Eng. 21, 513–527 (1996)

Implementation of Neural Network Learning with Minimum L1 -Norm Criteria in Fractional Order Non-gaussian Impulsive Noise Environments* Daifeng Zha College of Electronic Engineering, Jiujiang University 332005 Jiujiang, China [email protected]

Abstract. Minimum L1 -norm optimization model has found extensive applications in linear parameter estimations. L1 -norm model is robust in non Gaussian alpha stable distribution error or noise environments, especially for signals that contain sharp transitions (such as biomedical signals with spiky series) or dynamic processes. However, its implementation is more difficult due to discontinuous derivatives, especially compared with the least-squares ( L2 -norm) model. In this paper, a new neural network for solving L1 -norm optimization problems is presented. It has been proved that this neural network is able to converge to the exact solution to a given problem. Implementation of L1 -norm optimization model is presented, where a new neural network is constructed and its performance is evaluated theoretically and experimentally. Keywords: L1 -norm optimization, Neural network, Alpha stable distribution, Non-Gaussian distribution.

1 Introduction In modern signal processing fields such as communication, automatic control, speech, and biomedical engineering, etc., and for most conventional and linear-theory-based methods, linear parametric estimation models, such as linear predictor and auto-regressive moving-average, have been extensively utilized. Generally speaking, it is reasonable to assume that the error or noise is usually assumed to be Gaussian distributed with finite second-order statistics. An L2 -norm (Least squares) solution is easy to find and suitable for the situations where noises or errors are Gaussian. But, in real application environments, the assumption that the distribution of noises or errors is Gaussian is unrealistic. For a non-Gaussian error distribution, the solution to problem using minimum L2 -norm optimization model may be very poor, especially for those signals with sharp outliers. In fact, the L1 -norm optimization model [1][4] is superior to L2 -norm models when the observation vector is contaminated by some large outliers *

This work is supported by National Science Foundation of China under Grant 60772037.

F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 283 – 290, 2008. © Springer-Verlag Berlin Heidelberg 2008

284

D. Zha

or impulse noises, such as stable distributions [2] noise, including underwater acoustic, low-frequency atmospheric, and many man-made noises, which is suitable for modeling random variables with tails of the probability density function that are heavier than the tails of the Gaussian density function. Stable distributions is a kind of physical process with suddenly and short endurance high impulse in real world has no its second order or higher order statistics. It has no close form probability density function so that we can only describe it by its characteristic function [3]: α

Φ (t ) = exp{ jµt − γ t [1 + jβ sgn( t )ω (t , α )]}

(1)

απ

2 (if α  1) or log t (if α = 1) , 0 < α ≤ 2 is the π 2 characteristic exponent, it controls the thickness of the tail in the distribution. The Gaussian process is a special case of stable processes with α = 2 . The dispersion parameter γ > 0 is similar to the variance of Gaussian process and −1 ≤ β ≤ 1 is the symmetry parameter. −∞ < µ < ∞ is the location parameter. The typical lower order alpha stable distribution sequences are shown in Fig.1.

where ω (t , α ) = tan

Fig. 1. Alpha stable distribution sequences

In this paper, we present a new neural network to solve the L1 -norm optimization problem in alpha stable distribution environments. By experimental validations, the proposed network is able to globally converge to the exact solution to a given problem.

2 L1 -Norm Optimization Model Because of its excellent properties, the L1 -norm optimization model has been extensively studied. However, with the increase of the model scale, these numerical algorithms are not adequate for solving real-time problems. One possible and promising approach to real-time optimization is to apply neural networks. Because of the inherent massive parallelism, the neural network-based approach can solve optimization problems within a time constant of the network.

Implementation of Neural Network Learning

285

In fact, many models can be mathematically abstracted as the following over-determined system of linear equations [1][8]: x = As − e

(2)

where A = {a ij } ∈ R M × N ( M > N ) is the model matrix derived from a given data set, s = [ s1 , s 2 ,..., s N ]T ∈ R N is the unknown vector of the parameters to be estimated,

x = [ x1 , x 2 ,..., x M ]T ∈ R M is the vector of observation or measurements containing errors or artifacts, e ∈ R M is the alpha stable distribution error or noise vector. Define the L1 -norm of error vector as follows: (3)

|| e ||1 =|| As − x ||1

then, the parameter vector s can be solved via solving the following unconstrained optimization model: s opt = min || As − x ||1

(4)

s

This model is called L1 -norm optimization model, which is generally difficult to be solved because of discontinuous derivatives. Using the following Proposition 1, we turn the problem described in (4) into another form, which is easier to be solved. Proposition 1: The optimization model described in (4) is equivalent to the following optimization model:

⎫ ⎧ s opt = min ⎨max y T ( As − x) ⎬ s ⎩ y ⎭

(

)

(5)

where y = [ y1 , y 2 ,..., y M ]T ∈ R M , | y i |≤ 1 , i = 1,2,..., M . Proof: Let u = ( As − x) ∈ R M , then for any y , we have

y T ( As − x) = y T u = ∑ y i u i ≤ ∑ | y i || u i | ≤ ∑ | u i | = || As − x ||1 M

M

M

i =1

i =1

i =1

(6)

Thus

(

)

max y T ( As − x) = || As − x ||1 y

(7)

This completes the proof of Proposition 1.

3 The L1 -Norm Neural Network Now we propose a neural network for solving the problem in (5) whose model is described by the following dynamic system:

286

D. Zha

⎧ ds T ⎪ dt = − A P (y + As − x) ⎨ dy ⎪ = −( AA T + I )y + P (y + As − x) ⎩ dt

(8)

where P ( v) = [ P (v1 ), P(v 2 ),..., P(v M )]T ∈ R M , P(vi ) is defined as a projection operator: ⎧ 1 if vi > 1 1 ⎪ (9) P(vi ) = (| vi + 1 | − | vi − 1 |) = ⎨ vi otherwise 2 ⎪− 1 if v < −1 i ⎩

The proposed L1 -norm neural network ( L1 N-NN) described in (8) is shown in Fig.2.

Fig. 2. The L1 -norm neural network

Since the neural network described in (8) is a continuous-time network governed by a set of ordinary differential equations, It can be real-time implemented. In such implementation, the projection operator of P(vi ) is actually a simple limiter with a unit threshold. The matrix or vector multiplications are actually the synaptic-weighting and summing operations and hence can be implemented via a number of adders with a weighting function [5] [6]. And the rest are a number of simple integrators. As a simulation for the implementation, Fig.3 illustrates the implementation of the proposed L1 N-NN under MATLAB SIMULINK. In the following, we will prove that the neural network described in (8) and Fig. 2 globally converges to the exact solution to problem (5), or equivalently to problem (4). Let L(s, y ) = y T ( As − x) , according to K-T theorem [7], If s * ∈ R N is a solution to the problem in (5), we know that (s * , y * ) is a solution if and only if there exists a saddle point of model (5), and L(s * , y ) ≤ L(s* , y * ) ≤ L(s, y * ) . Thus we can easily have

Implementation of Neural Network Learning

287

Fig. 3. The implementation of the proposed L1 N-NN

(y-y * ) T ( As* − x) ≤ 0

(10)

(y * ) T ( As* − x) ≤ (y * ) T ( As − x)

(11)

then there exists y * satisfying

⎧⎪ A T y * = 0 ⎨ * ⎪⎩y = P (y * + As * − x)

(12)

For any y , the following inequality holds:

{P(y + As − x) − y } (x - As ) ≥ 0 * T

*

(13)

The solution set of

⎧ ds T ⎪ dt = − A P(y + As − x) = 0 ⎨ dy ⎪ = −( AA T + I )y + P(y + As − x) = 0 ⎩ dt

is just the equilibrium point set of dynamic system (8). Let E1 = A T y , E 2 = y - P(y + As - x) , then

(14)

288

D. Zha

⎧ ds T ⎪ dt = −E1 + A E 2 ⎨ dy ⎪ = − AE1 − E 2 ⎩ dt

(15)

Let (s * , y * ) denote the solution set of model (5). By (12), we can get that when dy ds = 0 and = 0 . Thus we can give the relationship between the dt dt solution set of model (5) and the equilibrium point set of dynamic system (8).

(s, y ) = (s * , y * ) ,

4 Numerical Experiments Two numerical experiments are conducted to validate the proposed L1 N-NN and to evaluate its convergence. The experiments are performed under MATLAB 6.5 and its SIMULINK for Windows XP.

(a) s1 , s 2 , s 3

(b) s1

(c) s 2

(d) s3 Fig. 4. The convergence process

Implementation of Neural Network Learning

289

Experiment 1: Let us consider the following problem: s opt = min || As − x ||1

(16)

s

where A = [ 1,0,0;1,1,1;1,2,4;1,3,9;1,4,6 ] , x = [1,2,1,3,3 ] T . The exact solution to problem (16) is that s = [ 1.0000,0.3571, 0.0714 ]T .Under MATLAB SIMULINK, we run the NN as shown in Fig. 2 and start simulation. The solution trajectories corresponding to the state variable in dynamic system (8) are rapidly derived and displayed in Fig.4(a). It is clearly shown that the numerical result is equal to the exact solution. Fig. 4(b-d) demonstrates the convergence process of the neural network with different initial values. It is shown that the proposed network sharply converges to exact solution to the given problem independent of initial values.

(a) k =0.0001

(b) k =0.01

x = [1.0000 6.0000 17.0000 34.0000 27.0000]T x = [0.9828 5.8599 16.8636 33.9385 26.9425] T

(c) k =1

(d) k =10

x = [0.9995 5.9930 16.9959 33.9970 26.9964] T x = [-10.3137 - 9.1728 3.7807 26.5906 13.2231] T

Fig. 5. The solution of s1 , s 2 , s 3 corresponding to different k

290

D. Zha

Experiment 2: For x = As − e = As − ke 0 , let us solve the following problem in alpha stable distribution environments: s opt = min || As − x ||1 s (17) = min || e ||1 = min || ke 0 ||1 s

where

s

A = [ 1,0,0;1,1,1;1,2,4;1,3,9;1,4,6 ]

,

T

, e 0 is a error vector whose e 0 = [3.1026 0.9499 - 2.3361 - 0.1507 1.4950] elements are alpha stable distribution variables ( α = 1.8, β = 0, µ = 0, γ = 1 ). Suppose that s = [ 1,2, 3 ]T . For different k , we run the NN and start simulation. The solutions of s1 , s 2 , s 3 corresponding to different k are displayed in Fig.5.

5 Conclusion In this paper, a new neural network for solving L1 -norm optimization problems is presented. It has been proved that this neural network is able to converge to the exact solution to a given problem. This network is a continuous-time network, which is governed by a set of ordinary differential equations and hence can be implemented easily. As a simulation, an implementation of the proposed neural network under MATLAB SIMULINK has been presented. Using this implementation, numerical validation experiments have been presented. It is shown that the proposed network still gives exact solutions with rapid convergence independent of initial values. In addition, the experiments presented in this paper have illustrated that the proposed network has practical applications in alpha stable distribution error problems in non-Gaussian noise environments.

References 1. Cichocki, A., Unbehauen, R.: Neural Networks for Solving Systems of Linear Equation–Part II: Minimax and Least Absolute Value Problems. IEEE Trans. Circuits Syst. II 39, 619–633 (1992) 2. Nikias, C.L., Shao, M.: Signal Processing with Alpha-Stable Distribution and Applications, 1st edn. Wiley, Chichester (1995) 3. Georgiou, P.G., Tsakalides, P., Kyriakakis, C.: Alpha-Stable Modeling of Noise and Robust Time-Delay Estimation in the Presence of Impulsive Noise. IEEE Trans. on Multimedia 1, 291–301 (1999) 4. Bloomfield, P., Steiger, W.L.: Least Absolute Deviations: Theory Applications and Algorithms. Brikhäuser, Boston (1983) 5. Xia, Y.S.: A New Neural Network for Solving Linear Programming Problems and Its Application. IEEE Trans. Neural Networks. 7, 525–529 (1996) 6. Xia, Y.S., Wang, J.: A General Methodology for Designing Globally Convergent Optimization Neural Networks. IEEE Trans. Neural Networks 9, 1331–1343 (1998) 7. Luenberger, D.G.: Introduction to Linear and Nonlinear Programming. Addison- Wesley, New York (1973) 8. Zala, C.A., Barrodale, I., Kennedy, J.S.: High-resolution Signal and Noise Field Estimation Using the L1 (least absolute values) Norm. IEEE J. Oceanic Eng. OE- 12, 253–264 (1987)

Stability of Neural Networks with Parameters Disturbed by White Noises Wuyi Zhang and Wudai Liao Zhongyuan University of Technology, 450007, Zhengzhou, Henan, China [email protected], [email protected]

Abstract. Almost sure exponential stability (ASES) of neural networks with parameters disturbed by noises is studied, the basis of which is that the parameters in the implemented neural networks by very large scale integration (VLSI) approaches is well defined by the white-noise stochastic process, and an appropriate way to impose random factors on deterministic neural networks is proposed. By using the theory of stochastic dynamical system and matrix theory, some stability criteria are obtained which ensure the neural networks ASES, and the convergent rate is estimated. Also, the capacity of enduring random factors of the well-designed neural networks is estimated. The results obtained in this paper need only to compute the eigenvalues or verify the negative-definite of some matrices constructed by the parameters of the neural networks. An illustrative example is given to show the effectiveness of the results in the paper. Keywords: Almost sure, exponential stability, matrix theory.

1

Introduction

Neural networks have been applied to many scientific and engineering fields in the way of VLSI, such as in the neuron-dynamic optimization. As we all know, the white noises occur unavoidably in this type of circuit system, namely, the parameters, such as resistors, capacitors in the circuit system, take random values, the mean of which equal to the designed-value. A new type of neural networks – stochastic neural networks (SNN) – was first proposed and the stability of SNN was discussed in [1]. Up till now, lots of results were presented [2,3,4,5,6,7]. SNN can be treated as nonlinear dynamical systems with stochastically perturbed noise. These systems can be described by Itˆo stochastic differential equations. But little results concerned the problem that in what way the random factors were imposed on the neural networks. In the paper, an appropriate way of such problems was proposed naturally, in other words, each parameter in neural networks was estimated by white noises and it is a stochastic process. On the basis of deriving the mathematical equation of the new type SNN, some ASES conditions were obtained by using stochastic analysis and the matrix analysis, the convergent rate is given. And the capacity of SNN to endure the random intensity is also estimated. An illustrative example will be given to show the effectiveness of the results obtained. F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 291–298, 2008. c Springer-Verlag Berlin Heidelberg 2008 

292

W. Zhang and W. Liao

This paper is organized as follows: In Section 2, some notations and lemmas are given; some definitions are given, such as the white noise stochastic process, the standard Brownian motion, etc., and the mathematical description of SNN is given by the estimation formula of the parameters; In Section 3, the stability criteria are obtained which need only to compute the eigenvalues or examine the negative-definite of some matrices constructed by the network’s parameters, the Lyapunov-exponent of equilibria are also estimated; In Section 4, an illustrative example is given to show the effectiveness of the results in the paper.

2

Preliminary

In this section, we give some notations, definitions, lemmas, and the mathematical description of SNN. 2.1

Notations, Definition and Lemma

In the paper,IRn denotes the n-dimension real Euclidean space, x ∈ IRn a ndimension column vector. The Euclidean norm of x ∈ IRn is denoted |x|. IRn×n denotes n-dimension real matrix space, for A ∈ IRn×n , λmax (A), λmin (A) denote the maximum and minimum eigenvalue of the matrix A, and |A|1 , |A|∞ the 1-norm and ∞-norm of the matrix A respectively. Definition 1. A scale-valued stochastic process ξt defined on the probability space (Ω, F, P ) is called a standard white noise, if and only if its expectation Eξt = 0 and the co-variation covar(ξt+τ ξt ) = δ(τ ), or equivalently, the spectral density is 1, where δ(τ ) is a Dirac-δ function. Let wt be a 1-dimensional standard Brownian motion, we have formally wt+τ − wt dwt = lim = ξt . τ →0 dt τ

(1)

For x, f (·) ∈ IRn , g(·) ∈ IRn , and wt a 1-dimensional standard Brownian motion, we consider the following Itˆo stochastic differential equations: dx(t) = f (x)dt + g(x)dwt .

(2)

In which, we assume that f (0) = g(0) = 0, this implies that (2) has trivial solution x = 0. For a Lyapunov function V (x) : IRn → IR, we define the differential operator on (2) as the following: 1 LV (x) = Vx (x)T f (x) + g(x)T Vxx (x)g(x) . 2

Stability of Neural Networks with Parameters Disturbed by White Noises

293

Lemma 1. Assume that there exist V (x) ∈ C 2 (IRn ; IR) and constants p > 0, c1 > 0, c2 ∈ IR, c3 ≥ 0 such that for all x = 0, t ≥ 0, 1) V (x) ≥ c1 |x|p , 2) LV (x) ≤ c2 V (x) , 3) |Vx (x)T g(x)|2 ≥ c3 V 2 (x) hold, then the estimation lim sup t→∞

c3 − 2c2 1 log |x(t; x0 )| ≤ − a.s. t 2p

holds for any x0 ∈ IRn . In particular, if c3 > 2c2 , then the trivial equilibrium x = 0 of (2) is ASES with Lyapunov exponent (c3 − 2c2 )/2p. Remark 1. In Lemma 1, Condition 3) holds for c3 = 0 obviously, in this case, if condition 2), 3) hold with c2 < 0, then the conclusions of Lemma 1 is true. 2.2

The Description of SNN

Now, we consider the neural networks with perturbed parameters by white noises as follows: n  dxi (t) = −bi (t)xi + aij (t)fj (xj ) + Ii (t), i = 1, 2, · · · , n . dt j=1

(3)

x = (x1 , x2 , · · · , xn )T ∈ IRn is the state vectors, B(t) = diag(b1 (t), b2 (t), · · · , bn (t)) is feedback gain matrix, A(t) = (aij (t))n×n ∈ IRn×n is the weight matrix between neurons, aij (t) is the connected weight from neuron j to neuron i, I(t) is the bias vector, the vector activation function f (x) = (f1 (x1 ), f2 (x2 ), · · · , fn (xn ))T and fi : IR → IR satisfy local Lipschitz condition, namely, for ∀x0 , ∃ constant li > 0 and a neighborhood B(x0 ), such that for ∀θ, ρ ∈ B(x0 ) |fi (θ) − fi (ρ)| ≤ li |θ − ρ| .

(4)

Condition (4) includes some common activation function used in neural networks, such as Sigmoid (li = 1/4) and linear saturation (li = 1) functions. Assume that the parameters have the following estimations: bi (t) = bi + βi ξt , i = 1, 2, · · · , n , aij (t) = aij + αij ξt , i, j = 1, 2, · · · , n , Ii (t) = Ii + γi ξt , i = 1, 2, · · · , n .

(5)

294

W. Zhang and W. Liao

Where bi , aij , Ii are well designed parameters in neural networks, ξt is the standard white noise, βi , αij , γi are the noised densities of corresponding parameter. By using (5) and (1), Eq. (3) can be rewritten as the following Itˆ o type stochastic differential equation: dxi (t) = [−bi xi +

n 

aij fj (xj ) + Ii ]dt + [−βi xi +

n 

αij fj (xj ) + γi ]dwt .

j=1

j=1

Or its vector form dx(t) = [−bx + Af (x) + I]dt + [−βx + αf (x) + γ]dwt .

(6)

Where b = diag(b1 , · · · , bn ), A = (aij )n×n , I = (I1 , · · · , In )T , β = diag(β1 , · · · , βn ), α = (αij )n×n , γ = (γ1 , · · · , γn )T . Let x∗ be an equilibrium of (6), take the transformation y = x − x∗ , then (6) has the following form: dy(t) = [−by + Ag(y)]dt + [−βy + αg(y)]dwt .

(7)

In which g(y) = f (y + x∗ ) − f (x∗ ), satisfying |gi (yi )| ≤ li |yi |, i = 1, 2, · · · , n with Lipschitz constants li . So, in order to study the stability of equilibrium x∗ of (6), we only to study the equilibrium y = 0 of (7).

3

Main Results

In this section, we will set up some sufficient algebraic criteria ensuring the equilibrium of (6) to be almost sure exponential stability, and the convergent Lyapunov exponent are estimated. Select the symmetric positive-definite matrix Q and the diagonal matrix P = diag(pi ) with pi > 0, i = 1, 2, · · · , n, construct the following symmetric matrix:   −Qb − bQ + βQβ QA − βQα H= . −P + αT Qα AT Q − αT Qβ Denote −λ the maximum eigenvalue of the symmetric matrix, by using the semipositive definite of the matrix αT Qα, we can easily deduce that −λ + pi ≥ 0, i = 1, 2, · · · , n. Theorem 1. Assume that x∗ is an equilibrium of (6). If there exist a symmetric positive-definite matrix Q, a positive diagonal matrix P = diag(p1 , p2 , · · · , pn ) such that the maximum eigenvalue −λ of the matrix H satisfies λ>

li2 pi , i = 1, 2, · · · , n , 1 + li2

then, the equilibrium x∗ is ASES, the convergent rate is µ/2λmax (Q), µ := min1≤i≤n {λ(1 + li2 ) − pi li2 } > 0. Here, li , i = 1, 2, · · · , n are the Lipschitz constants on activation functions fi at the equilibrium x∗ .

Stability of Neural Networks with Parameters Disturbed by White Noises

295

Proof. In order to examine the stability of equilibrium x∗ of (6), we equivalently verify the trivial equilibrium y = 0 of (7). For (7), construct the Lyapunov function V (y) = y T Qy, then, Vy (y) = 2Qy, Vyy (y) = 2Q, its differential operator on (7) has the following estimation: 1 LV (y) = VyT (y)[−by + Ag(y)] + [−y T β + g T (y)αT ]Vyy (y)[−βy + αg(y)] 2 = 2y T Q[−by + Ag(y)] + [−y T β + g T (y)αT ]Q[−βy + αg(y)] = y T [−Qb − bQ + βQβ]y + 2y T (QA − βQα)g(y) + g T (y)(αT Qα)g(y)     y = y T , g T (y) H + g T (y)P g(y) g(y) n  pi gi2 (yi ) ≤ −λ(|y|2 + |g(y)|2 ) + i=1

=−

n 

λyi2 +

n 

(−λ + pi )gi2 (yi ) .

i=1

i=1

Because of |gi (yi )| ≤ li |yi | and −λ + pi ≥ 0, i = 1, 2, · · · , n, we have LV (y) ≤ −

n n     λ + (λ − pi )li2 yi2 ≤ −µ yi2 ≤ − i=1

i=1

µ V (y) . λmax (Q)

By using Lemma 1 with p = 2, c2 = −µ/λmax (Q), c3 = 0, one can deduce that the trivial equilibrium y = 0 of (7), equivalently, the equilibrium x∗ of (6), is ⊓ ⊔ ASES, and the Lyapunov exponent is µ/2λmax (Q). How to select an appropriate matrix P in Theorem 1 is the key problem to use this theorem. We give a way  to do this in Corollary 1. Select a positive number r > 0, denote R = r · diag l1−2 , · · · , ln−2 , E ∈ IRn×n is the unit matrix, construct a symmetric matrix H1 :   −Qb − bQ + rE + βQβ QA − βQα H1 = . AT Q − αT Qβ −R + αT Qα Corollary 1. Assume that x∗ is an equilibrium of (6). If there exist a symmetric positive-definite matrix Q, a positive number r > 0 such that the matrix H1 is negative definite, then, the equilibrium x∗ is ASES. Proof. In Theorem 1, we choose pi = (1 + li−2 )r, then, H1 = H + rE2n×2n . From the negative definite of the matrix H1 , we have λmax (H1 ) = λmax (H) + r < 0, that is, −λ + r < 0. So, λ>r=

li2 pi , i = 1, 2, · · · , n . 1 + li2

By using Theorem 1, the equilibrium x∗ is ASES.

⊓ ⊔

296

W. Zhang and W. Liao

Let Q = E in Corollary 1, and the matrix H1 becomes the following form:   −2b + rE + β 2 A − βα H2 = . AT − αT β −R + αT α we have the following result: Corollary 2. Assume that x∗ is an equilibrium of (6). If there exist a positive number r > 0 such that the matrix H2 is negative definite, then, the equilibrium x∗ is ASES. Remark 2. Obviously, the necessary condition of the matrix H2 being negative definite is that the optional parameter r satisfies li2

n 

α2ji < r < 2bi − βi2 , i = 1, 2, · · · , n .

j=1

In the following, we consider the capability of enduring the perturbation on the well designed neural networks (deterministic neural networks). Denote     −2b + rE A β2 −βα D= , S = . AT −R −αT β αT α Obviously, H2 = D + S, this is a decomposition of matrix H2 according to its deterministic part and random part. Corollary 3. Assume that x∗ is an equilibrium of the well designed deterministic neural network. If the condition λmax (S) < −λmax (D) holds, then, the equilibrium x∗ is also ASES. Proof. By Corollary 2, we need only to verify the negative definite of the matrix H2 . For any z ∈ IR2n , we have z T H2 z = z T Dz + z T Sz ≤ (λmax (D) + λmax (S)) |z|2 This shows that the matrix H2 is negative definite.

⊓ ⊔

From the matrix inequality ρ(A) ≤ |A|1 , which ρ(A) is the spectrum radius of the matrix A and ρ(A) = λmax (A) if A is symmetric, we have the following corollary. Corollary 4. Assume that x∗ is an equilibrium of the well designed deterministic neural network. If the condition |S|1 < −λmax (D) holds, then, the equilibrium x∗ is also ASES.

Stability of Neural Networks with Parameters Disturbed by White Noises

297

Remark 3. |S|1 = maxj { 2n i=1 |sij |} is easier to compute than to compute the maximum eigenvalue of the matrix S, it is convenient in application.

4

An example

Consider the following 2-dimensional SNN dx1 = −b1 (t)x1 + a11 (t)f1 (x1 ) + a12 (t)f2 (x2 ) + I1 (t) dt dx1 = −b2 (t)x2 + a21 (t)f1 (x1 ) + a22 (t)f2 (x2 ) + I2 (t) dt with the Sigmiod activation function f1 (u) = f2 (u) =

1 , u ∈ IR . 1 + e−u

The Lipschitz constants l1 = l2 = fi′ (0) = 1/4. The parameters are as follows estimated by statistic: b1 (t) = 1 + 0.1ξt , b2 (t) = 1 + 0.1ξt , a11 (t) = 2 + 0.2ξt , a12 (t) = 1 + 0.1ξt , a21 (t) = 1 + 0.1ξt , a22 (t) = 2 + 0.2ξt I1 (t) = I2 (t) = −3 − 0.3ξt , . We have b=



1 0

Choose r = 1, follows ⎛ −1 ⎜ 0 D=⎝ 2 1

0 1



,β=



0.1 0 0 0.1



,A=



2 1

1 2



,α=



0.2 0.1 0.1 0.2



.

then R = diag(16, 16). The matrices D, S in Corollary 4 are as ⎞ ⎞ ⎛ 0 2 1 0.01 0 −0.02 −0.01 −1 1 2 ⎟ 0.01 −0.01 −0.02 ⎟ ⎜ 0 ⎠ . ⎠, S = ⎝ 0.04 1 −16 0 −0.02 −0.01 0.05 −0.01 −0.02 0.04 0.05 2 0 −16

λmax (D) = −0.4223, |S|1 = 0.12. The condition |S|1 < −λmax (D) holds in Corollary 4, the equilibrium x1 = x2 = 0 is ASES.

Acknowledgment This work was supported in part by the National Natural Science Foundation of China under grant No. 60774051, 60474001.

298

W. Zhang and W. Liao

References 1. Liao, X.X., Mao, X.R.: Exponential Stability and Instability of Stochastic Neural Networks. Stochast. Anal. Appl. 14, 165–185 (1996) 2. Liao, W., Wang, D., Wang, Z., Liao, X.X.: Stability of Stochastic Cellular Neural Networks. Journal of Huazhong Univ. of Sci. and Tech. 35, 32–34 (2007) 3. Liao, W., Liao, X.X., Shen, Y.: Robust Stability of Time-delyed Interval CNN in Noisy Environment. Acta Automatica Sinica 30, 300–305 (2004) 4. Blythe, S., Mao, X.R., Liao, X.X.: Stability of Stochastic Delayed Neural Networks. Journal of the Franklin Institute 338, 481–495 (2001) 5. Shen, Y., Liao, X.X.: Robust Stability of Nonlinear Stochastic Delayed Systems. Acta Automatica Sinic. 25, 537–542 (1999) 6. Liao, X.X., Mao, X.R.: Exponential Stability of Stochastic Delay Interval Systems. Systems and Control Letters 40, 171–181 (2000) 7. Liao, X.X., Mao, X.R.: Stability of Stochastic Neural Networks. Neual. Parallel and Scientific Computations 14, 205–224 (1996) 8. Mao, X.R.: Stochastic Differential Equations and Applications. Horwood Pub., Chichester (1997)

Neural Control of Uncertain Nonlinear Systems with Minimum Control Effort Dingguo Chen1 , Jiaben Yang2 , and Ronald R. Mohler3 1

Siemens Power Transmission and Distribution Inc., 10900 Wayzata Blvd., Minnetonka, Minnesota 55305, USA 2 Department of Automation, Tsinghua University, Beijing, 100084, People’s Republic of China 3 Department of Electrical and Computer Engineering, Oregon State University, Corvallis, OR 97330, USA Abstract. A special class of nonlinear systems are studied in this paper in the context of fuel optimal control, which feature parametric uncertainties and confined control inputs. The control objective is to minimize the integrated control cost over the applicable time horizon. The conventional adaptive control schemes are difficult to apply. An innovative design approach is proposed to handle the uncertain parameters, physical limitations of control variables and fuel optimal control performance simultaneously. The proposed control design methodology makes an analysis of the fuel control problem for nominal cases, employs a hierarchical neural network structure, constructs the lower level neural networks to identify the switching manifolds, and utilizes the upper level neural network to coordinate the outputs of the lower level neural networks to achieve the control robustness in an approximately fuel-optimal control manner. Theoretical results are presented to justify the proposed design procedures for synthesizing adaptive, intelligent hierarchical neural controllers for uncertain nonlinear systems. Keywords: Bilinear System, Uncertain nonlinear System, Multiple Input Nonlinear System,Neural Network, Fuel Optimal Control, Neural Control, Switching Manifold, Hierarchical Neural Network.

1

Introduction

Bilinear systems [3] have been widely studied because of their appealing structure that can be utilized for better system controllability. A generalization of the bilinear systems is the affine systems that are linear in control. Numerous research results have been reported in two main categories: (1) in the context of adaptive control where system parametric uncertainties are considered but without considering physical control restrictions and incorporating control performance indexes; and (2) in the context of optimal control where both physical control restrictions and control performance indexes are considered but without considerations in system control robustness. There are many practical systems that are desired to be controlled followed appropriate control designs so that appropriate optimal control performance is achieved with certain control robustness. Rare results have been F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 299–308, 2008. c Springer-Verlag Berlin Heidelberg 2008 

300

D. Chen, J. Yang, and R.R. Mohler

obtained in the context of adaptive, optimal, and constrained control. It is the motivation of this paper to make a contribution in this direction. It is true that numerous elegant adaptive control techniques are available but they are difficult to apply to this type of problems. A new trend has been witnessed in the recent years that neural networks have been introduced into nonlinear adaptive control design [14,16,15,17,18] to hopefully tackle some of the difficult problems that conventional design approaches can not handle. This is due to the superior function approximation capabilities, the distributed structure, the parallel processing feature, of neural networks, and more important, the appealing structure of the three-layered neural networks exhibiting the linearity in unknown parameters when linearized at unknown optimal parameters [18] or at given sub-optimal parameters [21]. The control designs based on the adaptive control schemes [18,21] however, have the drawback that the control signal is not constrained within a predesignated desired physical range. Further, additional control performance criteria, e.g., optimal control performance, are difficult to incorporate within the framework of the traditional adaptive control schemes. It becomes apparent that the popular adaptive control schemes can not be directly applied to solve the practical problems that require that the control signal be bounded by a given number, and a new theory tailored for the application has yet to be worked out. It is the objective of this paper to make an attempt to address the fueloptimal control problem in the broader context of adaptive, optimal, and constrained control. The intention is to generalize the results from the previous efforts [11,7,10,22] so that the new results can cover a broader class of problems that have the multiple control variables, need to respect constraints of control, need to incorporate broader class of control performance criteria, and contains parametric uncertainties. This paper is organized follows: Section 2 describes the class of uncertain nonlinear systems to be studied in this paper and the control objective and makes several conventional assumptions. The fuel optimal control problem is analyzed and an iterative numerical solution process is presented in section 3. The control problem studied in this paper is decomposed into a series of control problems that do not have parameter uncertainties. This decomposition is utilized in the hierarchical neural control design methodology, which is presented in section 4. The synthesis of hierarchical neural controllers is to achieve (a) near fuel-optimal control of the studied systems with constrained control; (b) adaptive control of the studied control systems with unknown parameters. Theoretical results are developed to justify the fuel-optimal control oriented neural control design procedures and presented in section 5. Finally, some conclusions are drawn.

2

Problem Statement

Although the conventional adaptive control schemes are powerful, they have common drawbacks that include (a) the control usually does not consider the physical control limitations, and (b) a performance index is difficult to incorpo-

Neural Control of Uncertain Nonlinear Systems

301

rate. Most practical systems usually need to consider both physical constraints and an optimal performance index and yet require robustness of control with respect to certain system parameter variations. This paper attempts to address these challenges. The systems to be studied in this paper are linear in both control and parameters, and feature parametric uncertainties, confined control inputs, and multiple control inputs. These systems are represented by a finite dimensional differential system linear in control and linear in parameters as shown in [22]. The control objective is to follow a theoretically sound control design methodology to design the controller such that the system is adaptively controlled with respect to parametric uncertainties and yet achieves a desired control performance. To facilitate the theoretical derivations, several conventional assumptions are made same as in [22] except that AS4 is slightly modified to reflect minimum fuel control ans AS9 is added: t  AS4: The control performance criteria is J = t0f [e0 + m k=1 ek |uk |]ds where t0 and tf are the initial time and the final time, respectively, and ek (k = 0, · · · , m) are non-negative constants. The cost functional reflects the requirement of fueloptimal control and the interest of this paper having the total cost as related to the integration of the absolute control effort of each control variable over time. Remark 1: If ek = 0 for (k = 1, · · · , m) and e0 > 0, the control performance t criteria becomes J = t0f e0 ds, which corresponds to the time-optimal control problem. In this sense, the time-optimal control problem can be viewed as a special case of the fuel-optimal control problem studied in this paper. AS9: The total number of switch times for all control components for the studied fuel-optimal control problem is greater than the number of state variables. Remark 2: AS9 is true for practical systems to the best knowledge of the authors. The assumption is made for the convenience of the rigor of the theoretical results developed in this paper.

3

Fuel-Optimal Control Problem and Numerical Solution

Decomposing the control problem (P ) into a series of control problems (P0 ) is an important treatment toward the hierarchical neural control design that is intended for addressing the near fuel-optimal control of uncertain nonlinear systems. The distinction made between the control problem (P ) and the control problem (P0 ) is to facilitate the development of the hierarchical neural control design and for the clarify of the presentation of the paper. The original control problem (P ) is associated with an unknown parameter vector p while the control problem (P0 ) corresponds to a given parameter vector p. The control problem (P) can be viewed as a family of the control problems (P0 ), which together represent an approximately accurate characterization of the dynamic system behaviors exhibited by the nonlinear systems in the control problem (P ).

302

D. Chen, J. Yang, and R.R. Mohler

The application of the Pontryagin minimum principle gives rise to the socalled two-point boundary-value problem (TPBVP) which must be satisfied by an optimal solution. In general, an analytic solution to the TPBVP is extremely difficult, and usually practically impossible to obtain. It is shown that the iterative solution obtained by the switching-timesvariation method (STVM) [3,4] through successive approximation, converges to the unique solution to the optimal control problem provided that the TPBVP has a unique solution. With some derivations, the optimal control can be written as follows: ∗− u∗k = u∗+ k + uk

(1)

∗− 1 1 ∗ ∗ ∗ where u∗+ k = 2 [sgn(−sk (t) − 1) + 1], uk = 2 [sgn(−sk (t) + 1) − 1] and sk (t) is the kth component of the switching vector. It has been shown in [4] that the number of the optimal switching times must be finite provided that no singular solutions exist. Let the zeros of −sk (t) − 1 + + + for 1 ≤ j1 < j2 ≤ 2Nk+ ) < τk,j (j = 1, · · · , 2Nk+ , k = 1, · · · , m; and τk,j be τk,j 2 1 which represent the switching times corresponding to positive contorl u∗+ k , the − − − < τ zeros of −sk (t) + 1 τk,j (j = 1, · · · , 2Nk− , k = 1, · · · , m; and τk,j k,j2 for 1 − 1 ≤ j1 < j2 ≤ 2Nk ) which represent the switching times corresponding to + − + negative control u∗− k . Altogether τk,j ’s and τk,j ’s (j = 1, · · · , 2Nk , k = 1, · · · , m) ∗ represent the switching times which uniquely determine uk as follows: +

u∗k (t)

Nk 1  + + = { [sgn(t − τk,2j−1 )] − ) − sgn(t − τk,2j 2 j=1 −

Nk 

− − )]}. ) − sgn(t − τk,2j [sgn(t − τk,2j−1

(2)

j=1

Let the switch vector for the kth component of the control vector be + − + − + + τ τ Nk = [(τ Nk )τ (τ Nk )τ ]τ where τ Nk = [τk,1 · · · τk,2N and τ Nk = +] k

− − + − Nk τ · · · τk,2N is the switching vector [τk,1 − ] . Let Nk = 2Nk + 2Nk . Then τ k

of Nk dimensions. Let the vector of switch functions for the control variable uk be dek k k k k · · · φN φN · · · φN ]τ where φN = fined as φNk = [φN 1 j 2N + 2N + +1 2N + +2N − k

k

k

k

+ − k (−1)j−1 ek (sk (τk,j + 1) (j = 1, · · · , 2Nk+ ), and φN = (−1)j ek (sk (τk,j − 1) j+2N + k

(j = 1, · · · , 2Nk− ). The gradient that can be used to update the switching vector τ Nk can be given by (3) ∇τ Nk J = −φNk . The optimal switching vector can be obtained iteratively by using a gradientbased method. (4) τ Nk ,i+1 = τ Nk ,i + K k,i φNk

Neural Control of Uncertain Nonlinear Systems

303

where K k,i is a properly chosen Nk × Nk -dimensional diagonal matrix with nonnegative entries for the ith iteration of the iterative optimization process; and τ Nk ,i represents the ith iteration of the switching vector τ Nk . When the optimal switching vectors are determined upon convergence, the optimal control trajectories and the optimal state trajectories are computed. This process will be repeated for all selected nominal cases as discussed in Section 4 until all needed off-line optimal control and state trajectories are obtained. These trajectories will be used in training the fuel-optimal control oriented neural networks.

4

Neural Control Design Methodology for Fuel-Optimal Control

In one of the previous endeavors, the hierarchical neural network based control design was applied to the single-machine infinity-bus (SMIB) power system [11]. In a recent attempt [22] to generalize the control design approaches and extend them to a broader class of nonlinear systems, multiple control inputs were considered. However, the control problems studied have been limited to the time-optimal control. This paper presents a design methodology for fuel-optimal control, which in a sense can be considered as a more general control problem that the time-optimal control problem. The proposed design consists of the two major steps: 1. Use neural networks to approximate the switching manifolds for all the control components uk (k = 1, · · · , m) for each selected nominal case; 2. Use neural networks to approximate the coordination function which determines the relative control effort contributions of the lower-level neural controllers. The system dynamic behaviors are affected not only by the initial system state and the control variables, but also by the parameter vector p. The analysis of the effect of p on the system dynamic behaviors is helpful in determining the nominal cases required in the proposed design approach. Based on the qualitative system behavior analysis, the parameter vector space may be tessellated in such a way that an appropriate tessellation granularity level is achieved to meet the desired control performance yet with minimal number of nominal cases. For each individual control problem (P0 ), the bang-off-bang control is resulted. Consequently, the switching manifold can be identified using the optimal control and state trajectories that are obtained using applicable numerical methods and cover the stability region of interest. Mathematically, this is equivalent to say ui = −sgn(Si (x)) for |Si (x)| > 1; and ui = 0 for |Si (x)| = 1 (i = 1, · · · , m). Si (x) is the switching function with |Si (x)| = 1 identifying the switching manifolds. Since the switch functions are functions of state variables and costate variables, the state trajectories and costate trajectories are usually not readily analytically available, direct approximation of Si (x) using neural networks is difficult. Instead, the patterns generated from the off-line calculated optimal

304

D. Chen, J. Yang, and R.R. Mohler

control and state trajectories are used to determine the relationship between the control variable uk and the state x. Since the fuel-optimal control is a bangoff-bang control, the uk ’s thus obtained need further processing. This includes cascading another neural network to conduct the following computation logic: vk = 21 [sgn(−uk − 1) + 1] + 21 [sgn(−uk + 1) − 1]. This new neural network has the heaviside activation applied, is constructive, hence not requiring any off-line training. This resulted switching manifold identification oriented neural control is shown in Fig. 1.

x

Standard NN

uk

Fig. 1. Switching manifolds identification by neural network

The design of the upper-level neural controllers also utilizes the off-line generated optimal state trajectories. In addition, it makes use of the outputs of the lower-level neural controllers to determine the relative control effort contribution from each lower-level neural controller. Each component of the final control vector is the respective sum of the lowerlevel neural control signals modulated by the corresponding coordinating signals of the upper-level neural networks. The hierarchical neural control diagram is shown in Fig. 2.

5

Theoretical Justification on Construction of Hierarchical Neural Controllers

To validate the proposed fuel-optimal control design methodology for using neural networks to adaptively control the uncertain nonlinear systems studied in this paper, two main theoretical results are presented for the rationale of (a) the use of lower-level neural networks for switching manifolds identification for the control problem (P0 ) and (b) the use of the hierarchical neural networks for adaptive control of the control problem (P ). To address the first issue, we present the following result. Proposition 1. For the control problem (P0 ) with assumptions AS1 through AS4 and AS6 satisfied, let the switching manifolds for the kth component of the control vector u be Sk (k = 1, 2, · · · , m). Define Sk+ = Sk for Sk = 1, and Sk− = Sk for Sk = −1. Let Dk+ , a connected open subset of Ω be constructed in such a way that the switching manifold Sk is a subset of Dk and Dk+ = {x : + + ||x − y|| < ǫ+ k , y ∈ Sk , x ∈ Ω} where ǫk is a pre-specified positive number − and k = 1, 2, · · · , m. Similarly, Dk is constructed such that Sk− ⊂ Dk− and

Neural Control of Uncertain Nonlinear Systems Upper Level Neural Networks

TDL TDL

x

1

Multiplier Processing Unit

NN TDL NN

2

...

TDL

x

TDL NN TDL

x

NN

x Lower Level Neural Networks

NN

x2 x1

M

1

u1

2

u2

M

uM

...

x

NN

xM

...

x

305

X X

+

u

X

Fig. 2. Hierarchical fuel-optimal neural control

− − Dk− = {x : ||x − y|| < ǫ− k , y ∈ Sk , x ∈ Ω} where ǫk is a pre-specified positive + − number and k = 1, 2, · · · , m. If Dk and Dk are constructed such that Dk+ ∩Dk− = ∅, then there exists a state feedback neural controller unn,k = N Nk (x) which only takes -1 or 0 or +1 with x being the state, such that if x ∈ Ω − Dk+ − Dk− , ||uk (x) − unn,k (x)|| = 0 (k = 1, 2, · · · , m).

Proof: First of all, note that Ω − Dk+ − Dk− is a subset of Ω and is also a closed subest, hence compact. The optimal control uk = gk (x) (k = 1, 2, · · · , m) with x ∈ Ω is a discontinuous function only on x ∈ Sk+ ∪ §− k . It can be approximated with a continuous function,say vk = hk (x), with the same support with sufficiently small error γk > 0 such that hk = gk if x ∈ Ω − Dk+ − Dk− , and |hk (.) − gk (.)| < γk for x ∈ Dk+ ∪ Dk− . Then for any ǫ∗ > 0, there exists a neural network N N 1k (x, Θ∗ ) with the optimal parameter vector Θ∗ such that |N N 1∗k (x, Θ∗ ) − hk (x)| < ǫ∗ . It follows from AS6 that the optimal control uk is bang-off-bang type, and therefore hk (.) takes a value of -1 or 0 or +1 for any x ∈ Ω − Dk+ − Dk− . Let γ1,k = ǫ∗ . Then 1 − γ1,k < N N 1k (x, Θ∗ ) < 1 + γ1,k when hk = 1, or −1−γ1,k < N N 1k (x, Θ∗ ) < −1+γ1,k when hk = −1, or −γ1,k < N N 1k (x, Θ∗ ) < γ1,k when hk = 0 for x ∈ Ω − Dk+ − Dk− . As long as ǫ∗ is chosen such that ǫ∗ < 12 , then one of the following three mutually exclusive conditions is satisfied: either 1 < 2 ∗ N N 1k (x, Θ∗ ) < 3, or −3 < 2 ∗ N N 1k (x, Θ∗ ) < −1, or −1 < 2 ∗ N N 1k (x, Θ∗ ) < 1. Consequently, 12 [sgn(−2 ∗ N N 1k (x, Θ∗ ) − 1) + 1] + 12 [sgn(−2 ∗ N N 1k (x, Θ∗ ) + 1) − 1] = hk (x) for x ∈ Ω − Dk+ − Dk− . But 1 1 2 [sgn(−2 ∗ N N 1k (x, Θ∗ ) − 1) + 1] + 2 [sgn(−2 ∗ N N 1k (x, Θ∗ ) + 1) − 1] can be

306

D. Chen, J. Yang, and R.R. Mohler

constructed as another neural network using the heaviside activation function. Let N Nk (x) = 12 [sgn(−2 ∗ N N 1k (x, Θ∗ ) − 1) + 1] + 12 [sgn(−2 ∗ N N 1k (x, Θ∗ ) + 1) − 1]. Thus, the existence of neural controller unn,k = N Nk (x) is assured. This completes the proof. With the application of the above result along with AS8, it follows that there exists a neural network |N N 1k (x, Θs )−hk (x)| < |N N 1k (x, Θs )−N N 1∗k (x, Θ∗ )|+ |N N 1∗k (x, Θ∗ ) − hk (x)| < ǫs + ǫ∗ . As long as the off-line trained neural network and the neural network with the ideal parameters are sufficiently close, i.e., if ǫs < 12 − ǫ∗ , then as shown in the above, this off-line trained neural network, even though not in optimal configuration, is good enough for approximating the switching manifolds. From the practical implementation point of view, this is particularly meaningful in the sense that it justifies that the lower-level neural controllers can be constructed using the optimal control and state trajectories yet achieving desired, sufficiently accurate switching manifold approximations. To address the second issue highlighted at the beginning of this section, we present the following result. The following result for fuel-optimal control is a generalization of a result in [22] in which time-optimal control is considered. Proposition 2. For the control problem (P ) with the assumptions AS1 through AS9 satisfied, suppose Ω is a compact region where with the bang-off-bang control the optimal trajectories starting in the compact region will still remain in it. Then ′

1. for any ǫ1 > 0 and ǫ2 > 0, there exists ǫ3 > 0 such that if ||x0 − x0 || < ǫ3 , ′ there exists the terminal time tf such that |tf − t∗f | < ǫ1 ,and ||x(x0 , tf ) − ∗ ∗ ∗ x (x0 , tf )|| < ǫ2 where tf is the optimal terminal time for the initial state ′ x0 ; x∗ (x0 , t) is the optimal trajectory starting from x0 ; and x0 is a perturbed initial condition of x0 . ′ 2. for any ǫ4 > 0 and ǫ5 > 0, there exists ǫ6 > 0 such that if p − p|| < ǫ6 , ′ there exists the terminal time tf such that ||tf −t∗f || < ǫ4 , and ||x(x0 , p , tf )− x∗ (x0 , p, t∗f )|| < ǫ5 where t∗f is the optimal terminal time for the initial state x0 ; x∗ (x0 , p, t) is the optimal trajectory starting from x0 for the control prob′ lem (P ) with the parameter vector p; and p is a perturbed parameter vector of p. Proof: Due to the page limit, only a sketch of the proof is presented: First, consider a perturbation in the initial state x0 and show that for a small change in the initial state, the switching times vector make an accordingly small change in order to drive the final state to the origin. Secondly, consider an increment dp in p, and show that for a small change in the parameter vector, the switching times vector makes an according small adjustment to still drive the final state to the origin. In both steps, perturbation analysis is conducted, and integration of system equations and certain norm are applied, along with the Assumptions, especially AS9. In particular, for the case of the optimal state trajectory x∗ (x0 , tf ) and the opti′ mal final switching time t∗f , and the perturbed initial condition x0 of x0 , by properly ′ choosing ǫ3 , one can obtain |tf − t∗f | < ǫ1 ,and ||x(x0 , tf ) − x∗ (x0 , t∗f )|| < ǫ2 .

Neural Control of Uncertain Nonlinear Systems

307

For the case of the optimal state trajectory x∗ (x0 , p, tf ), the optimal final ′ switching time t∗f , and the perturbed parameter vector p of p, by properly ′ choosing ǫ6 , one can obtain that |tf −t∗f | < ǫ4 ,and ||x(x0 , p , tf )−x∗ (x0 , p, t∗f )|| < ǫ5 . This completes the proof. The above result indicates that the system dynamic behavior for an unknown parameter vector p can be closely approximated by those corresponding to a tessellation resulted parameter sub-region which is sufficiently small and contains the unknown parameter vector p. In addition, the theoretical results altogether presented in the paper clearly justify the proposed design methodology as to how the switch manifolds for fueloptimal control problems are identified using neural networks; and how the hierarchical neural network conducts the system control in a near optimal manner.

6

Conclusions

Different than a previous attempt to address adaptive, time-optimal control of uncertain nonlinear systems, this paper aims at achieving adaptive control of uncertain nonlinear systems in an approximately fuel-optimal control manner. The studied nonlinear system control problem is characterized by the nonlinear systems affine in both the control variables and uncertain parameters; the control variables physically restricted; and a cost functional to be minimize that is integral of a function that is linear in the absolute values of control variables over an applicable time horizon. Since the conventional adaptive control techniques can not be directly applied to solve the adaptive control of the studied systems, a neural network based control methodology is adopted. This novel control design allows for incorporation of control performance criteria, constraints on the control variables while achieve practical effectiveness of addressing the parameter uncertainty. The proposed hierarch neural controller consists of the lower level neural networks for fuel-optimal control for respective nominal cases and an upper level neural network for determining the relative contribution of each lower level neural controller. The control design procedures are presented with theoretical justifications which are practically convenient and useful in synthesizing robust, adaptive and fuel-optimal neural controllers.

References 1. Mohler, R.R.: Nonlinear Systems Volume I, Dynamics and Control. Prentice-Hall, Englewood Cliffs (1991) 2. Mohler, R.R.: Nonlinear Systems Volume II, Applications to Bilinear Control. Prentice Hall, Englewood Cliffs (1991) 3. Mohler, R.R.: Bilinear Control Processes. Academic Press, New York (1973) 4. Moon, S.F.: Optimal Control of Bilinear Systems and Systems Linear in Control, Ph.D. dissertation. The University of New Mexico (1969) 5. Lee, E.B., Markus, L.: Foundations of Optimal Control Theory. Wiley, New York (1967)

308

D. Chen, J. Yang, and R.R. Mohler

6. Rugh, W.J.: Linear System Theory. Prentice-Hall, Englewood Cliffs (1993) 7. Chen, D., Mohler, R., Chen, L.: Neural-Network-Based Adaptive Control with Application to Power Systems. In: Proc. American Control Conf., San Diego, pp. 3236–3240 (1999) 8. Chen, D., Mohler, R.: Nonlinear Adaptive Control with Potential FACTS Applications. In: Proc. American Control Conf., San Diego, pp. 1077–1081 (1999) 9. Chen, D., Mohler, R.: The Properties of Latitudinal Neural Networks with Potential Power System Applications. In: Proc. American Control Conf., Philadelphia, pp. 980–984 (1998) 10. Chen, D., Mohler, R., Chen, L.: Synthesis of Neural Controller Applied to Power Systems. IEEE Trans. Circuits and Systems I 47, 376–388 (2000) 11. Chen, D.: Nonlinear Neural Control with Power Systems Applications. Ph.D. Dissertation, Oregon State University (1998) 12. Chen, D., Mohler, R., Shahrestani, S., Hill, D.: Neural-Net-Based Nonlinear Control for Prevention of Voltage Collapse. In: Proc. 38th IEEE Conference on Decision and Control, Phoenix, pp. 2156–2161 (1999) 13. Chen, D., Mohler, R.: Theoretical Aspects on Synthesis of Hierarchical Neural Controllers for Power Systems. In: Proc. 2000 American Control Conference, Chicago, pp. 3432–3436 (2000) 14. Sanner, R., Slotine, J.: Gaussian Networks for Direct Adaptive Control. IEEE Trans. Neural Networks 3, 837–863 (1992) 15. Yesidirek, A., Lewis, F.: Feedback Linearization Using Neural Network. Automatica 31, 1659–1664 (1995) 16. Chen, F., Liu, C.: Adaptively Controlling Nonlinear Continuous-Time Systems Using Multilayer Neural Networks. IEEE Trans. Automatic Control 39, 1306–1310 (1994) 17. Lewis, F., Yesidirek, A., Liu, K.: Neural Net Robot Controller with Guaranteed Tracking Performance. IEEE Trans. Neural Networks 6, 703–715 (1995) 18. Polycarpou, M.: Stable Adaptive Neural Control Scheme for Nonlinear Systems. IEEE Trans. Automatic Control 41, 447–451 (1996) 19. Zakrzewski, R.R., Mohler, R.R., Kolodziej, W.J.: Hierarchical Intelligent Control with Flexible AC Transmission System Application. IFAC J. Control Engineering Practice 2, 979–987 (1994) 20. Narendra, K., Mukhopadhyay, S.: Intelligent Control Using Neural Networks. IEEE Control Systems Magazine 12, 11–18 (1992) 21. Chen, D., Yang, J.: Robust Adaptive Neural Control Applied to a Class of Nonlinear Systems. In: Proc. 17th IMACS World Congress: Scientific Computation, Applied Mathematics and Simulation, Paris (2005) T5-I-01-0911 22. Chen, D., Yang, J., Mohler, R.: On Near Optimal Neural Control of a Class of Nonlinear Systems with Multiple Inputs. Neural Computing and Applications 2 (2007)

Three Global Exponential Convergence Results of the GPNN for Solving Generalized Linear Variational Inequalities Xiaolin Hu1 , Zhigang Zeng2 , and Bo Zhang1 1 State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology (TNList), Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China 2 School of Automation, Wuhan University of Technology, Wuhan 430070, China

Abstract. The general projection neural network (GPNN) is a versatile recurrent neural network model capable of solving a variety of optimization problems and variational inequalities. In a recent article [IEEE Trans. Neural Netw., 18(6), 1697-1708, 2007], the linear case of GPNN was studied extensively from the viewpoint of stability analysis, and it was utilized to solve the generalized linear variational inequality with various types of constraints. In the present paper we supplement three global exponential convergence results for the GPNN for solving these problems. The first one is different from those shown in the original article, and the other two are improved versions of two results in that article. The validity of the new results are demonstrated by numerical examples.

1

Introduction

The following problem is called the generalized linear variational inequality (GLVI): find x∗ ∈ ℜm such that N x∗ + q ∈ X and (M x∗ + p)T (x − N x∗ − q) ≥ 0

∀x ∈ X,

(1)

where M, N ∈ ℜm×m ; p, q ∈ ℜm ; and X is a closed convex set in ℜm . It has many scientific and engineering applications, e.g., linear programming and quadratic programming [1], extended linear programming [2] and extended linear-quadratic programming [2, 3]. If X is a box set, i.e., X = {x ∈ ℜm |x ≤ x ≤ x}

(2)

where x and x are constants (without loss of generality, any component of x or −x can be −∞), a neurodyamic approach was proposed in [4] and [5] from different viewpoints for solving it. Moreover, in [5], the neurodynamic system was given a name, general projection neural network (GPNN). A general form of the system is as follows: dx = λW {−N x + PX ((N − αM )x + q − αp) − q}, dt F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 309–318, 2008. c Springer-Verlag Berlin Heidelberg 2008 

(3)

310

X. Hu, Z. Zeng, and B. Zhang

where λ ∈ ℜ, W ∈ ℜm×m and α ∈ ℜ are positive constants, and PX (x) = (PX1 (x1 ), · · · , PXm (xm ))T with ⎧ ⎨ xi , xi < xi , PXi (xi ) = xi , xi  xi  xi , (4) ⎩ xi , xi > xi .

Recently, the stability of the above GPNN was studied extensively in [6]. Many global convergence and stability results were presented. In addition, when X in the GLVI (1) is not a box set, but a polyhedral set defined by inequalities and equalities, several specific GPNNs similar to (3) were formulated to solve the corresponding problems. Some particular stability results of those GPNNs were also discussed. In the present paper, we will give a few new stability results of the GPNNs, reflecting our up-to-date progress in studying this type of neural networks. Throughout the paper, x denotes the l2 norm of a vector x, I denotes the identity matrix with an appropriate dimension, and X ∗ stands for the solution set of GLVI (1), which is assumed to be nonempty. In addition, it is assumed that there exists at least one finite point in X ∗ . Define an operator D+ f (t) = lim suph→0+ (f (t + h) − f (t))/h, where f (t) is a function mapping from ℜ → ℜ.

2 2.1

Main Results Box Set Constraint

First, we give a new stability result of the GPNN (3) for solving the GLVI with box-type constraint as described in (2). A useful lemma is introduced first [5, 4]. Lemma 1. Consider PX : ℜm → X defined in (4). For any u, v ∈ ℜm , we have PX (u) − PX (v) ≤ u − v. Theorem 1. Let N = {nij } and D = N − αM = {dij }. If nii >

m 

|nij | +

j=1,j =i

m 

|dij |,

∀i = 1, · · · , m,

(5)

j=1

then the GPNN (3) with W = I is globally exponentially stable. Proof. From (5) there exists θ > 0 such that nii ≥

m 

j=1,j =i

|nij | +

m 

|dij | + θ,

∀i = 1, · · · , m.

(6)

j=1

Let x∗ be a finite point in X ∗ , L(t0 ) = max1≤i≤m |xi (t0 ) − x∗i | and zi (t) = |xi (t) − x∗i | − L(t0 )e−λθ(t−t0 ) . In the following we will show zi (t) ≤ 0 for any i = 1, · · · , m and all t ≥ t0 by contradiction. Actually, if this is not true, there

Three Global Exponential Convergence Results of the GPNN

311

must exists a sufficiently small ǫ > 0, two time instants t1 and t2 satisfying t0 ≤ t1 < t2 , and at least one k ∈ {1, · · · , m}, such that zk (t1 ) = 0,

zk (t2 ) = ǫ

+

D zk (t1 ) ≥ 0, zi (s) ≤ ǫ,

(7)

+

D zk (t2 ) > 0

∀i = 1, · · · , m;

(8)

t0 ≤ s ≤ t2 .

(9)

From (3) we have dx/dt = λ{−N (x − x∗ ) + PX (Dx + q − αp) − PX (Dx∗ + q − αp)}

(10)

and from Lemma 1 we have |PXi (di x + qi − αpi ) − PXi (di x∗ + qi − αpi )| m  ≤|di (x − x∗ )| ≤ |dij ||xj − x∗j |, ∀i = 1, · · · , m,

(11)

j=1

where di ∈ ℜ1×m denotes the ith row of D. Without loss of generality, we assume xk (t2 ) − x∗k > 0. (The case of xk (t2 ) − x∗k < 0 can be reasoned similarly.) It follows from (7), (9), (10) and (11) that xk (t2 ) − x∗k = L(t0 )e−λθ(t−t0 ) + ǫ, |xi (t2 ) − x∗i | ≤ L(t0 )e−λθ(t−t0 ) + ǫ, ∀i = 1, · · · , m, and D+ zk (t2 ) =D+ |xk (t2 ) − x∗k | + λθL(t0 )e−λθ(t−t0 ) m  ≤ − λnkk (xk (t2 ) − x∗k ) + λ |nkj ||xj (t2 ) − x∗j | j=1,j =k



m 

|dkj ||xj (t2 ) − x∗j | + λθL(t0 )e−λθ(t−t0 )

j=1

≤ − λnkk (L(t0 )e−λθ(t−t0 ) + ǫ) + λ

m 

|nkj |(L(t0 )e−λθ(t−t0 ) + ǫ)

j=1,j =k



m 

|dkj |(L(t0 )e−λθ(t−t0 ) + ǫ) + λθL(t0 )e−λθ(t−t0 )

j=1



=λ ⎝−nkk + ⎛

m 

|nkj | +

j=1,j =k

+ λ ⎝−nkk +

m 

m  j=1

|nkj | +

j=1,j =k



|dkj | + θ⎠ L(t0 )e−λθ(t−t0 )

m  j=1



|dkj |⎠ ǫ

In view of (5) and (6), we have D+ zk (t2 ) < 0, which contradicts (8). Hence, |xi (t) − x∗i | ≤ L(t0 )e−λθ(t−t0 ) , The proof is completed.

∀i = 1, · · · , m;

t ≥ t0 .

(12)

312

X. Hu, Z. Zeng, and B. Zhang

The above theorem is proved in the spirit of [7]. From the analysis it can be inferred that the convergence rate of (3) is at least λθ where θ is the difference between the left and right hand sides of (5). Different from most of the results in [6], the exponential convergence rate here is expressed in terms of every component of the state vector separately, which provides a more detailed estimation than the results obtained by the usual Lyapunov method. In the above proof, if we choose L(t0 ) = x(t) − x∗ 2 , following similar arguments we can arrive at the following condition which assures the global exponential stability results as well: the minimum eigenvalue of (N + N T )/2 is greater than D. Interestingly, this is a result stated in Corollary 1 of [6] where a different proof was given. 2.2

General Constraints

Consider the GLVI (1) with X defined as X = {x ∈ ℜm |x ∈ Ωx , Ax ∈ Ωy , Bx = c},

(13)

where A ∈ ℜh×m , B ∈ ℜr×m , c ∈ ℜr , and Ωx , Ωy are two box sets defined as {x ∈ ℜm |x ≤ x ≤ x} and {y ∈ ℜh |y ≤ y ≤ y}, respectively (cf. (2)). Let A˜ = (AT , B T )T and







˜T p ˜ = M −A ˜ = N 0 , q˜ = q , M , p˜ = ,N ˜ ˜ 0 0 Aq AN 0 I T T T T T T h+r ˜ ˜ ˜y . Ωy = {y ∈ ℜ |(y , c ) ≤ y ≤ (y , c ) }, U = Ωx × Ω It was shown in [6] that the GLVI can be converted to another GLVI with a box ˜ only, and as a result, can be solved by using the following specific GPNN: set U du ˜ + P ˜ ((N ˜ − αM ˜ )u + q˜ − α˜ = λW {−Nu p) − q˜}, U dt

(14)

where λ > 0, α > 0, W ∈ ℜ(m+h+r)×(m+h+r) are constants, u = (xT , y T )T is the state vector, and PU˜ (·) is the activation function defined similarly as in (4). The output of the neural network is simply x(t), the first part of the state u(t). ˜ +αM ˜ )T , if M T N > 0 then the output In [6], it was proved that when W = (N trajectory x(t) of the neural network is globally convergent to the unique solution x∗ of the problem (1). In the following, we show that if this condition holds, the convergence rate can be exponential by choosing an appropriate scaling factor λ. The proof is inspired by [8]. ˜ + αM ˜ )T for solving the GLVI Theorem 2. Consider GPNN (14) with W = (N T with X defined in (13). If M N > 0 and λ is large enough, then the output trajectory x(t) of the neural network is globally exponentially convergent to the unique solution of the problem. Proof. It was shown in [6, Theorem 5] that the solution of the GLVI is unique, which corresponds to the first part of any equilibrium point of (14). Consider

Three Global Exponential Convergence Results of the GPNN

313

the function V (u(t)) = u(t) − u∗ 2 /2 where u∗ is a finite equilibrium point of (14). Following a similar analysis procedure to that of Corollary 4 in [5] we can derive dV (u(t)) ˜ TN ˜ (u−u∗ )−P ˜ ((N ˜ −αM ˜ )u+ q˜−α˜ ˜ u− q˜2 }. ≤ λ{−α(u−u∗ )T M p)− N U dt It follows that dV (u(t)) ˜ TN ˜ (u − u∗ )} = λα{−(x − x∗ )T M T N (x − x∗ )} ≤ λα{−(u − u∗ )T M dt ≤ λα{−βx − x∗ 2 }, where β > 0 denotes the minimum eigenvalue of (M T N + N T M )/2. Then

V (u(t)) ≤ V (u(t0 )) − λαβ

t

x(s) − x∗ 2 ds

t0

and x(t) − x∗ 2 ≤ 2V (u(t0 )) − 2λαβ



t

x(s) − x∗ 2 ds.

t0

Without loss of generality it is assumed x(t0 ) − x∗ 2 > 0 which implies V (u(t0 )) > 0. Then there exist τ > 0 and μ > 0 that depend on x(t0 ) only, so that

t0 +τ x(s) − x∗ 2 ds ≥ τ μ. If λ is large enough so that λ ≥ V (u(t0 ))/(αβτ μ), t0 we have t0 +τ

V (u(t0 )) − λαβ

x(s) − x∗ 2 ds ≤ 0.

t0

It follows that for any t > t1 ≥ t0 + τ x(t) − x∗ 2 ≤x(t1 ) − x∗ 2 + 2V (u(t0 )) − 2λαβ



t1



t0 +τ

x(s) − x∗ 2 ds

t0

− 2λαβ



t

x(s) − x∗ 2 ds

t1

≤x(t1 ) − x∗ 2 + 2V (u(t0 )) − 2λαβ

x(s) − x∗ 2 ds

t0

− 2λαβ



t

x(s) − x∗ 2 ds

t1 ∗ 2

≤x(t1 ) − x  − 2λαβ



t

x(s) − x∗ 2 ds.

t1

As a result, x(t) − x∗ 2 − x(t1 ) − x∗ 2 f (t) − f (t1 ) ≤ −2λαβ t − t1 t − t1

314

X. Hu, Z. Zeng, and B. Zhang

where f (t) =

t

t1

x(s) − x∗ 2 ds. Let t → t1 + 0, then we have dx(t) − x∗ 2 ≤ −2λαβx(t) − x∗ 2 . dt

Therefore x(t) − x∗  ≤ x(t1 ) − x∗ e−λαβ(t−t1 ) = c0 e−λαβ(t−t0 ) ,

∀t > t1

where c0 = x(t1 ) − x∗ eλαβ(t1 −t0 ) . Since dV (u(t))/dt ≤ 0, u(t) ∈ S = {u ∈ ℜm |V (u(t)) ≤ V (u(t0 ))} for all t ≥ t0 . Moreover, V (u(t)) is radially unbounded, then S is bounded, which implies that x(t) − x∗  is bounded over t ≥ t0 . Let Δ = maxt0 ≤t≤t1 x(t) − x∗  and c1 = Δ/e−λαβ(t1 −t0 ) . We have x(t) − x∗  ≤ Δ = c1 e−λαβ(t1 −t0 ) ≤ c1 e−λαβ(t−t0 ) ,

∀t0 ≤ t ≤ t1 .

Hence x(t) − x∗  ≤ cm e−λαβ(t−t0 ) ,

∀t ≥ t0 ,

where cm = max(c0 , c1 ). The proof is completed. 2.3

Inequality Constraints

Consider X in (13) with inequality constraints only; i.e., X = {x ∈ ℜm |Ax ∈ Ωy },

(15)

where the notations are the same as in (13). Let ˆ = AN M −1 AT , qˆ = −AN M −1 p + Aq. N The following specific GPNN is proposed to solve the problem: – State equation du ˆ − αI)u + qˆ) − qˆ}; ˆ + PΩy ((N = λW {−Nu dt

(16a)

– Output equation v = M −1 AT u − M −1 p,

(16b)

where λ ∈ ℜ, α ∈ ℜ, λ > 0, α > 0 and W ∈ ℜh×h . ˆ + αI)T , if M T N > 0 then the output In [6], it was proved that when W = (N trajectory v(t) of the neural network is globally convergent to the unique solution x∗ of the problem (1). In the following, we show that if this condition holds, the convergence rate can be exponential by choosing an appropriate λ. ˆ + αI)T for solving the GLVI Theorem 3. Consider GPNN (16) with W = (N T with X defined in (15). If M N > 0 and λ is large enough, then the output trajectory v(t) of the neural network is globally exponentially convergent to the unique solution of the problem.

Three Global Exponential Convergence Results of the GPNN

315

Proof. From [6, Theorem 6], the solution of the GLVI is unique, which is identical to v ∗ = M −1 AT u∗ − M −1 p where u∗ is any equilibrium point of (16a). Define a function 1 V (u(t)) = u(t) − u∗ 2 , t  t0 . 2 From (16b), we have v − v ∗ 2 = M −1 AT (u − u∗ )2  M −1 AT 2 u − u∗ 2 . ∗ 2

v−v  Thus V (u)  2M −1 AT 2 . Following a similar analysis to that of Corollary 4 in [5] we can deduce

dV (u(t)) ˆ (u − u∗ ) − PΩy ((N ˆ − αI)u + qˆ) − N ˆ u − qˆ2 }.  λ{−α(u − u∗ )T N dt It follows that dV (u(t))  λα{−(u − u∗ )T AN M −1 AT (u − u∗ )} dt = λα{−[M −1 AT (u − u∗ )]T M T N [M −1 AT (x − x∗ )]} = λα{−(v − v ∗ )T M T N (v − v ∗ )}  λα{−βv − v ∗ 2 }, where β > 0 denotes the minimum eigenvalue of (M T N + N T M )/2. Then t v(s) − v ∗ 2 ds V (u(t)) ≤ V (u(t0 )) − λαβ t0

and v(t) − v ∗ 2 ≤ 2γV (u(t0 )) − 2λαβγ



t

v(s) − v ∗ 2 ds,

t0

where γ = M −1 AT 2 . The rest of the proof is similar to the latter part of the analysis of Theorem 2, and is omitted for brevity.

3

Illustrative Examples

Example 1. Let’s first solve a GLVI (1) with a box set, where ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ 4 2 −1 5 2 −1 −1 0 M = ⎝ 0 3 0 ⎠ , N = ⎝ 1 5 0 ⎠ , p = ⎝ 2 ⎠ , q = ⎝2⎠ , −1 3 6 −1 3 8 5 0 and X = {x ∈ ℜ3 |(−4, 0, −4)T  x  (6, 6, 6)T }. Let α = 1, it is easy to verify that 3 the condition in Theorem 2 is satisfied. Actually, n11 − |n12 | − |n13 | − j=1 |d1j | =   1, n22 − |n21 | − |n23 | − 3j=1 |d2j | = 1, n33 − |n31 | − |n32 | − 3j=1 |d3j | = 2. Then the GPNN (3) is globally exponentially stable. All numerical simulations validated

316

X. Hu, Z. Zeng, and B. Zhang

10

States

5

x 3 (t) x 1 (t) 0

x 2 (t)

−5

0

1

2

3

4

5

Time unit t

Fig. 1. State trajectories of the GPNN (3) in Example 1 with W = I, λ = α = 1 and x(0) = (10, 6, −5)T

4

ln x1 (t) − x∗1  ln x2 (t) − x∗2  ln x3 (t) − x∗3  L(t0 ) − λθt

2 0 −2 −4 −6 −8 −10 −12 −14

0

1

2

3

4

5

Time unit t

Fig. 2. Solution error of the GPNN (3) in Example 1. The estimated upper bound (dashed line) is also plotted.

this conclusion. Fig. 1 demonstrates the state trajectories started from the initial point x(0) = (10, 6, −5)T with λ = 1 (t0 is set to 0), which converge to the unit solution of the problem x∗ = (0.4265, −0.4853, −0.2647)T . To show their exponential convergence rates, we take the natural logarithm of both sides of (12), ln |xi (t) − x∗i | ≤ ln L(t0 ) − λθt,

∀i = 1, · · · , 3;

t ≥ 0.

and depict both sides of above inequality in Fig. 2. (It is evident that θ can be chosen as θ = 1). The right-hand-side quantity now becomes a straight line in the figure. It is seen that the error of the states are all upper bounded by this line.

Three Global Exponential Convergence Results of the GPNN

317

20 15 10

Outputs

5

v1 (t)

v3 (t)

0

v2 (t)

−5 −10 −15 −20

0

0.2

0.4

0.6

0.8

1

Time unit t

ˆ + αM ˆ )T , Fig. 3. Output trajectories of the GPNN (16) in Example 2 with W = (N λ = α = 1 and ten random initial points

4

ln v(t) − x∗ 

2

0

−2

−4

−6

−8

0

0.2

0.4

0.6

0.8

1

Time unit t

Fig. 4. Solution error of the GPNN (16) in Example 2. Because of numerical errors in simulations, when ln v(t) − x∗  ≤ −8, the trajectories become unstable, and thus are not shown here.

Example 2. Consider a GLVI with a polyhedron set X defined in (15). Let M=



    1 1 0 1 −1 −1 −1 0  , N = 0 −1 0  , p = −1 , q = 2 , A = −5 , 5 −1

1 −1 −1 −1 1 0 0 1 −1

0 3 −1

2

0

and Ωy = {y ∈ ℜ2 | − 10  y  10}. It can be verified that M T N > 0. The ˆ + αM ˆ )T can be used to solve the problem acGPNN (16) with W = (N cording to Theorem 3. Simulation results showed that from any initial point this neural network globally converges to the unique equilibrium point u∗ = (−0.0074, −0.7556)T . Then, the solution of the GLVI is calculated as x∗ = (−0.4444, −3.2296, −1.9852)T . Fig. 3 displays the output trajectories of the neural network with λ = α = 1 and 10 different initial points, and Fig. 4 displays

318

X. Hu, Z. Zeng, and B. Zhang

the solution error (in natural logarithm) along with these trajectories. It is seen that for any of the 10 curves in Fig. 4 there exits a straight line with negative slope above it, that is, the convergence rate is upper bounded by an exponential function of t which tends to zero as t → ∞.

4

Concluding Remarks

The general projection neural network (GPNN) has attracted much attention in recent years. The paper presents three sets of global exponential convergence conditions for it, which extend our recent results to some extent. Numerical examples illustrate the correctness of these new results. Acknowledgments. The work was supported by the National Natural Science Foundation of China under the grant No. 60621062 and 60605003, the National Key Foundation R&D Projects under the grant No. 2003CB317007, 2004CB318108 and 2007CB311003, and the Basic Research Foundation of Tsinghua National Laboratory for Information Science and Technology (TNList).

References 1. Hu, X., Wang, J.: Design of General Projection Neural Networks for Solving Monotone Linear Variational Inequalities and Linear and Quadratic Optimization Problems. IEEE Trans. Syst., Man, Cybern. B 37, 1414–1421 (2007) 2. He, B.: Solution and Applications of a Class of General Linear Variational Inequalties. Sci. China Ser. A-Math. 39, 395–404 (1996) 3. Hu, X.: Applications of the General Projection Neural Network in Solving Extended Linear-Quadratic Programming Problems with Linear Constraints. Neurocomputing (accepted) 4. Gao, X.B.: A Neural Network for a Class of Extended Linear Variational Inequalities. Chinese Jounral of Electronics 10, 471–475 (2001) 5. Xia, Y., Wang, J.: A General Projection Neural Network for Solving Monotone Variational Inequalities and Related Optimization Problems. IEEE Trans. Neural Netw. 15, 318–328 (2004) 6. Hu, X., Wang, J.: Solving Generally Constrained Generalized Linear Variational Inequalities Using the General Projection Neural Networks. IEEE Trans. Neural Netw. 18, 1697–1708 (2007) 7. Zeng, Z., Wang, J., Liao, X.: Global Exponential Stability of a General Class of Recurrent Neural Networks with Time-Varying Delays. IEEE Trans. Circuits Syst. I 50, 1353–1358 (2003) 8. Xia, Y., Feng, G., Kamel, M.: Development and Analysis of a Neural Dynamical Approach to Nonlinear Programming Problems. IEEE Trans. Automatic Control 52, 2154–2159 (2007)

Disturbance Attenuating Controller Design for a Class of Nonlinear Systems with Unknown Time-Delay Geng Ji School of Mathematics and Information Engineering, Taizhou University, Linhai 317000, P.R. China [email protected]

Abstract. An adaptive neural network control design approach is proposed for a class of nonlinear systems with unknown time delay. By constructing a proper Lyapunov-Krasoviskii functional, the uncertainty of unknown time-delay is compensated. In addition, the semiglobally input-to-state practically stable (ISpS) disturbance attenuation problem is solved by neural network technique. The feasibility of neural network approximation of unknown system functions is guaranteed over practical compact set. Finally, a numerical simulation is given to show the effectiveness of the approach. Keywords: Disturbance attenuation, input-to-state practically stable, adaptive neural network control, nonlinear time-delay systems.

1

Introduction

Control of nonlinear systems has received much attention and many analysis techniques and design methodologies have been developed [1-2]. In [2], from the geometric theory of nonlinear systems, under certain assumptions, a nonlinear system can be decomposed into two cascaded systems, one is a nonlinear system and the other is a linearizable system. A class of systems with this structure, called minimum phase nonlinear systems, has been discussed heavily [3-4]. In the past few years, disturbance attenuation and almost disturbance decoupling problems have been extensively studied for uncertain nonlinear dynamic systems. Many interesting results in this area have been obtained [5-7]. In [5], disturbance attenuation was studied for a class of nonlinear systems which only contain uncertain disturbance. The problem of almost disturbance decoupling was considered in [7]. In these works, the systems contain either known functions or bounded unknown parameters, and the robust control method was used. Nevertheless, when the systems contain unknown functions, they are hard to be dealt via robust method. In recent years, adaptive neural control schemes have been found to be useful for the control nonlinear uncertain systems with unknown smooth functions, and many significant developments have been achieved [8-10]. Direct adaptive neural network control was presented for a class of affine nonlinear systems in F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 319–329, 2008. c Springer-Verlag Berlin Heidelberg 2008 

320

G. Ji

the strict-feedback form with unknown nonlinearities by Ge and Wang [9]. The problem of semiglobally ISpS disturbance attenuation was investigated for a class of uncertain nonlinear minimum-phase systems in [10]. However, all these works study the nonlinear systems without time delay. It is well known that time delays are often appeared in practical systems. In general, the existence of time delays degrades the control performance and sometime makes the closed-loop stabilization difficult, especially for nonlinear systems. Stabilization of nonlinear systems with time delay has received considerable attention, and many approaches to this issue have been developed (see [11-13] ). In [11], adaptive neural control was presented for a class of strict-feedback nonlinear time-delay systems. The unknown time delays were compensated for through the use of appropriate Lyapunov-Krasovskii functionals. In this paper, we discuss the nonlinear systems with unknown time delay. Motivated by references [10,11], we will use the adaptive neural network control method to study the problem of semiglobally ISpS disturbance attenuation for a class of minimum phase nonlinear systems with some structure uncertainties which cannot be directly coped with by the existing robust control design method. An appropriate Lyapunov-Krasovskii functional is used to construct Lyapunov function candidate such that the uncertainty from unknown time delay is removed. The paper is organized as follows: The problem description and preliminary results are given in section 2. In section 3, an adaptive neural network control design scheme is presented. Simulation result is shown in section 4. Finally, conclusion is given in section 5.

2

Problem Description and Preliminary

Consider an uncertain nonlinear system with time-delay x(t) ˙ = f (x(t), ξ(t)) ˙ξ(t) = u(t) + d1 (x(t), ξ(t)) + d2 (ξ(t − τ )) + p (x(t), ξ(t)) ω

(1)

y(t) = h (x(t), ξ(t)) where x ∈ Rn−1 and ξ ∈ R are the state variables, u ∈ R is the control input, y ∈ Rp is the regulated output, ω ∈ Rq is an exogenous input (reference and/or noise). The function vectors f ( ·, · ), d1 ( ·, · ) and h( ·, · ) are unknown smooth function vectors satisfying f (0, 0) = 0, d1 (0, 0) = 0 and h(0, 0) = 0. d2 ( · ) is a unknown smooth function satisfying assumption 2.2. p( ·, · ) is a known smooth matrix with proper dimension. τ is a unknown time delay, which is bounded by a known constant, i.e. τ ≤ τmax . Since f (x, ξ), h(x, ξ) are smooth with f (0, 0) = 0, h(0, 0) = 0, it can be decomposed into f (x, ξ) = f0 (x) + f1 (x, ξ)ξ, h(x, ξ) = h0 (x) + h1 (x, ξ)ξ where f0 (x) = f (x, 0), h0 (x) = h(x, 0) with f0 (0) = 0, h0 (0) = 0.

(2)

Disturbance Attenuating Controller Design for a Nonlinear Systems

321

The following assumption for system (1), which will be used throughout the paper, is proposed. Assumption 2.1: For the x-subsystem of (1), there exist a radially unbounded positive definite differentiable function ψ(x) and a positive constant α satisfying ∂ψ(x) f0 (x) ≤ −αψ(x), ∀x ∈ Rn−1 ∂x

(3)

This assumption assumes that the x-subsystem is asymptotically stable with respect to ξ ≡ 0. ψ(x) is an unbounded positive definite differentiable function which is not needed to be known. Assumption 2.2: The unknown smooth function d2 (ξ(t)) satisfy the following inequality |d2 (ξ(t))| ≤ |ξ(t)| ρ (ξ(t)) where ρ (ξ(t)) is known smooth function. Assumption 2.3: Define Z = [xT , ξ T ]T ∈ ΩZ ⊂ Rn with ΩZ known compact set. Firstly, we will discuss the problem of disturbance attenuation for system (1) under the assumption that f ( ·, · ), d1 ( ·, · ), and h( ·, · ) are known. To obtain a good performance with asymptotic stability, a controller is designed to stabilize the system and to guarantee the closed-loop system has the good L2 performance for any given constant γ > 0, i.e.  t  t T 2 y(τ ) y(τ )dτ ≤ γ ω(τ )T ω(τ )dτ + β(x0 , ξ0 ) 0

0

n−1

where t ≥ 0, β : R × R → R with β(x0 , ξ0 ) ≥ 0. In assumption 2.1, since ψ(x) is radially unbounded and positive definite, denote s = ψ(x), there exists a class K-function k( · ) such that hT0 (x)h0 (x) ≤ k(s)

(4)

Based on the function k(s), a storage function V0 (s) can be constructed [15]  2s dk(t) + k(t)dt (5) V0 (s) = s sup 0≤t≤1 dt s And it satisfies the following inequalities V0 (s) ≥ k(s), s

d V0 (s) ≥ V0 (s) ds

Define the following Lyapunov function candidate  t 1 1 T V = V0 (s) + ξ ξ + ξ 2 (τ )ρ2 (ξ(τ ))dτ 2 2 t−τ

(6)

(7)

The derivative of V along the trajectory of (1) is dV0 (s) ∂ψ V˙ = [f0 (x) + f1 (x, ξ)ξ] + ξ T ξ˙ ds ∂x 1 1 + ξ 2 (t)ρ2 (ξ(t)) − ξ 2 (t − τ )ρ2 (ξ(t − τ )) 2 2

(8)

322

G. Ji

In viewing of Assumption 2.1, (9) is easily obtained from (1)(6) (8) dV0 (s) ∂ψ V˙ ≤ −αV0 (s) + f1 ξ + ξ T [u + d1 + pω] ds ∂x 1 1 +ξ T d2 (ξ(t − τ )) + ξ 2 (t)ρ2 (ξ(t)) − ξ 2 (t − τ )ρ2 (ξ(t − τ )) 2 2

(9)

Applying Assumption 2.2, we have dV0 (s) ∂ψ 1 V˙ ≤ −αV0 (s) + f1 ξ + ξ T [u + d1 + pω] + ξ 2 (t)ρ2 (ξ(t)) ds ∂x 2 1 2 2 + |ξ| · |ξ(t − τ )| · ρ (ξ(t − τ )) − ξ (t − τ )ρ (ξ(t − τ )) 2

(10)

By using Young’s Inequality |ξ| · |ξ(t − τ )| · ρ (ξ(t − τ )) ≤

1 2 1 2 ξ + ξ (t − τ )ρ2 (ξ(t − τ )) 2 2

(11)

Substituting (11) into (10), (10) becomes dV0 (s) ∂ψ 1 1 V˙ ≤ −αV0 (s) + f1 ξ + ξ T [u + d1 + pω] + ξ 2 + ξ 2 (t)ρ2 (ξ(t)) (12) ds ∂x 2 2 Now, choose α0 with 0 < α0 < α and considering (2) (4) (6) and (12), it yields V˙ + α0 y T y − α0 γ 2 ω T ω ≤ − (α − α0 ) V0 (s) − α0 γ 2 ω T ω + ξ T pω 1 1 +ξ T [u + ϕ(Z)] + ξ 2 + ξ 2 (t)ρ2 (ξ(t)) 2 2

(13)

where ϕ(Z) =



dV0 (s) ∂ψ f1 ds ∂x

T

+ d1 + 2α0 hT1 h0 + α0 hT1 h1 ξ

 T with Z = xT , ξ T ∈ Rn . With the following inequality − α0 γ 2 ω T ω + ξ T pω ≤

1 ξ T ppT ξ 4α0 γ 2

(14)

And select the following controller u u = −cξ − ϕ(Z) −

1 1 1 ppT ξ − ξ − ξρ2 4α0 γ 2 2 2

where c > 0 is a constant scalar to be designed. Consequently, letting c ≥ 21 (α − α0 ), we have   1 T T 2 T ˙ V + α0 y y − α0 γ ω ω ≤ − (α − α0 ) V0 (s) + ξ ξ ≤ 0 2

(15)

(16)

Disturbance Attenuating Controller Design for a Nonlinear Systems

323

Therefore, by integrating both sides of (16) from 0 to t, and considering that V (x, ξ) ≥ 0, it is easily obtained 

t

T

y ydτ ≤ γ 0

2



t

ω T ωdτ +

0

1 V (x0 , ξ0 ) , ∀t ≥ 0 α0

(17)

Then, the following result is obtained. Theorem 2.1: Consider the certain version of the nonlinear time-delay system (1) satisfying Assumption 2.1 and Assumption 2.2 with ψ(x) being known. For  T any γ > 0, given any initial state x(0)T , ξ(0)T ∈ Rn , and scalar α0 with 0 < α0 < α, there exists a feedback controller given by (15), which solves the problem of disturbance attenuation with globally asymptotic stability. Remark 1: In the above analysis, we get the theorem 2.1 under the assumption that f ( ·, · ), d1 ( ·, · ), and h( ·, · ) are known. If f ( ·, · ), d1 ( ·, · ), and h( ·, · ) are unknown smooth functions, the problem of disturbance attenuation cannot be easily solved. This motivates us to seek a new design method. A natural way to do this is to use neural networks method with which the unknown nonlinear functions in the systems can be approximated by certain neural networks. Refer to [14], we have Definition 1: The system x(t) ˙ = f (x, u) is input-to-state practically stable (ISpS) if there exist a function R1 of class KL, a function R2 of class K and a nonnegative constant δ such that, for any initial condition x(0) and each measurable essentially bounded control u(t) defined for all t ≥ 0, the associated solution exists x(t) for all t ≥ 0 and satisfies |x(t)| ≤ R1 (|x(0)| , t) + R2 ( ut ) + δ

(18)

where ut is the truncated function of u at t and · stands for the L∞ supremum norm. Definition 2: A C 1 function V is said to be an exp-ISpS Lyapunov function for system x(t) ˙ = f (x, u) if (1) there exist functions α1 , α2 of class K∞ such that α1 (|x|) ≤ V (x) ≤ α2 (|x|) , ∀x ∈ Rn

(19)

(2) there exist two constants k > 0, δ ≥ 0 and a class k∞ -function R3 such that ∂V f (x, u) ≤ −kV (x) + R3 (|u|) + δ ∂x

(20)

Proposition 1: For any control system x(t) ˙ = f (x, u), the following properties are equivalent: (i) It is ISpS. (ii) It has an exp-ISpS Lyapunov function

324

3

G. Ji

Adaptive Neural Control Design

In this section, we assume that f ( ·, · ), d1 ( ·, · ) and h( ·, · ) are unknown. Thus, it is impossible to deal with the problem of disturbance attenuation by the controller (15) directly. An adaptive neural method will be proposed for the system (1) and the main result will be obtained in the latter. For the purpose of the practical controller design in the latter, let us define sets Ωcξ ⊂ ΩZ and ΩZ0 as follows: Ωcξ := {ξ| |ξ| < cξ } , ΩZ0 := ΩZ − Ωcξ

(21)

where cξ is a constant that can be chosen arbitrarily small and ′′ −′′ in (21) is used to denote the complement of set B in A set as A − B := {x|x ∈ A and x ∈ / B}. As the controller (16) contains the uncertainty vector field ϕ(Z), we employ RBF neural networks to approximate it. According to the main result state in [16], any real-valued continuous function can be arbitrarily closely approximated by a network of RBF type over a compact set. The compactness of set ΩZ0 is a must to guarantee the feasibility of neural networks approximation, which is shown in the following lemma. Lemma [11] Set ΩZ0 is a compact set. Based on the above lemma, given any ε > 0, by appropriately choosing μi ∈ Rn , i = 1, 2, · · · , l, for some sufficiently large integer l, one can see that the functions ϕ(Z) can be approximated by RBF neural networks on certain compact set ΩZ0 , i.e., ϕ(Z) = W ∗ T S(Z) + ε∗

(22)

where W ∗ is the ideal constant weights, and |ε∗ | ≤ ε is the approximation error with constant ε > 0. Consequently, the ideal controller u∗ is given by u∗ = −cξ − W ∗ T S(Z) − ε∗ −

1 1 1 ppT ξ − ξ − ξρ2 2 4α0 γ 2 2

ˆ be the estimate of W ∗ . Since W ∗ is unknown, let W Choose the practical controller as ˆ T S(Z) − u = −cξ − W

1 1 1 ppT ξ − ξ − ξρ2 4α0 γ 2 2 2

(23)

Consider the following Lyapunov function candidate 1 1 ˜ T −1 ˜ 1 Γ W+ V = V0 (s) + ξ T ξ + W 2 2 2



t

ξ 2 (τ )ρ2 (ξ(τ ))dτ t−τ

˜ =W ˆ − W ∗. where Γ = Γ T > 0 is an adaptation gain matrix and W

(24)

Disturbance Attenuating Controller Design for a Nonlinear Systems

325

In light of Assumption 2.1 and Assumption 2.2, referring to (12), (13) and (14), it is easily obtained that V˙ + α0 y T y − α0 γ 2 ω T ω 1 1 ξ T ppT ξ + ξ 2 ≤ − (α − α0 ) V0 (s) + 2 4α0 γ 2 1 2 2 T ˆ˙ ˜ T Γ −1 W + ξ (t)ρ (ξ(t)) + ξ (u + ϕ(Z)) + W 2   ˆ˙ − Γ S(Z)ξ ˜ T Γ −1 W = − (α − α0 ) V0 (s) − cξ T ξ + ξ T ε∗ + W

Consider the following adaptation law   ˆ˙ = Γ S(Z)ξ − σ W ˆ W

(25)

(26)

where σ > 0 is a small constant. t 0) 0) 2 2 For ξ ∈ ΩZ0 , choosing c = (α−α + c1 + (α−α 2 2ξ 2 t−τmax ξ (τ )ρ (τ )dτ , where c1 > 0. Since [t − τ, t] ⊂ [t − τmax , t], we have the inequality  t  t ξ 2 (τ )ρ2 (τ )dτ ξ 2 (τ )ρ2 (τ )dτ ≤ t−τmax

t−τ

Because of the following inequalities 2 ˜

σ W σ W ∗ 2 ∗ T T ˜ +W ≤− ˆ = −σ W ˜ ˜ W W −σ W + 2 2 −c1 ξ T ξ + ξ T ε∗ ≤

ε∗ 2 ε2 ≤ 4c1 4c1

Hence V˙ + α0 y T y − α0 γ 2 ω T ω  t α − α0 T α − α0 ≤ − (α − α0 ) V0 (s) − ξ ξ− ξ 2 (τ )ρ2 (τ )dτ 2 2 t−τ 2 ˜ σ W σ W ∗ 2 ε2 − + + 2 2 4c1

(27)

 ∗ 2 ε2 Let δ = σW2  + 4c . If we choose σ and Γ such that σ ≥ (α − α0 ) λmax Γ −1 , 1 then from (27) we have the following inequality V˙ + α0 y T y − α0 γ 2 ω T ω α − α0 T ≤ − (α − α0 ) V0 (s) − ξ ξ 2  t α − α0 α − α0 ˜ T −1 ˜ W Γ W +δ − ξ 2 (τ )ρ2 (τ )dτ − 2 2 t−τ = − (α − α0 ) V + δ

(28)

326

G. Ji

As α0 y T y ≥ 0, it following from (28) that V˙ ≤ − (α − α0 ) V + α0 γ 2 ω T ω + δ

(29)

Referring to definition 1, definition 2 and proposition 1, it is easy to obtain from (29) that the closed-loop system is input-to-state practically stable with respect to ω. Then, it is easily deduced from (18) that the state variables of closed-loop system are ultimately bounded if the states and NN weights are initiated in some compact set ΩZ0 with bounded ω. From inequality (28), we obtain that V˙ + α0 y T y − α0 γ 2 ω T ω ≤ δ Integrating both sides of (30), it yields  t  t 1 T 2 (V (0) + δt) y ydτ ≤ γ ω T ωdτ + α0 0 0

(30)

(31)

Theorem 3.1: Consider the uncertain nonlinear time-delay system (1) satisfying Assumption 2.1-2.3 with ω ∈ Lq2e [0, ∞). For any γ > 0 and compact set ΩZ0 , given ε > 0, σ > 0, α0 with 0 < α0 < α, there exist l, η, µi , Γ, c such that the solution of the closed-loop system is uniformly ultimately bounded and inequality (31) holds. Remark 2: Theorem 3.1 has essentially described the solvability of the problem of ISpS disturbance attenuation for the uncertain nonlinear time delay system (1). As the L2 -gain γ can be arbitrarily small, the disturbance’s affect on the output can almost be removed. Due to the fact that the system is ISpS, the energy of output in [0, ∞) is unbounded despite the fact that δ can be arbitrarily small, but it’s power is bounded by δ.

4

Simulation

Consider the following form of system (1) x˙ 1 (t) = −x1 (t) + x1 (t) · x22 (t) + ξ 2 (t) x˙ 2 (t) = x1 (t) · x2 (t) · ξ(t) + ξ(t) ˙ = u(t) + x1 (t) + ξ 2 (t) + sin (ξ(t − τ )) + p (x(t), ξ(t)) ω ξ(t) y(t) = (x1 (t), x2 (t), ξ(t))T

(32)

where ω = sin(t) is the disturbance input and p(x, ξ) = cos (x1 (t) · x2 (t) · ξ(t)). For simulation purpose, we assume that τ = 5. RBF neural network is employed to approximate ϕ(Z) and the practical controller u is given as follows: ˆ T S(Z) − u = −cξ − W ˆ is updated by (26). and W

1 1 1 ppT ξ − ξ − ξ 4α0 γ 2 2 2

(33)

Disturbance Attenuating Controller Design for a Nonlinear Systems −4

8

x 10

γ =0.1 γ =0.05

7 6

Output y(1)

5 4 3 2 1 0

0

10

20

30

40

50

Time(Sec)

Fig. 1. The output y(1) with different γ

0.07 γ =0.1 γ =0.05

0.06

Output y(2)

0.05

0.04

0.03

0.02

0.01

0

0

10

20

30

40

50

Time(Sec)

Fig. 2. The output y(2) with different γ

0.04 γ =0.1 γ =0.05

0.03 0.02

Output y(3)

0.01 0 −0.01 −0.02 −0.03 −0.04

0

10

20

30

40

50

Time(Sec)

Fig. 3. The output y(3) with different γ

327

328

G. Ji

ˆ T S(Z) contain 27 nodes (i.e., l = 27), with centers μk (k = Neural networks W 1, 2, · · · , l) evenly spaced in [−2, 2] × [−2, 2] × [−2, 2], and widths η = 2. The designed parameters of the above controller are c = 7, α0 = 1, σ = 0.2, Γ = ˆ = [0, · · · , 0]T . The initial states diag {2.0, 2.0, · · · , 2.0}. The initial weight W T T [x1 (0) , x2 (0) , ξ(0)] = [0, 0, 0] . Figs 1-3 shows the simulation results of applying the controller (33) to system (32) with γ = 0.1 and γ = 0.05. From Figs 1-3, we can see that the output y is attenuated by smaller gain γ.

5

Conclusion

The paper has considered the problem of semiglobally ISpS disturbance attenuation for a class of nonlinear systems with unknown time delay. The time delay term is cancelled by using appropriate Lyapunov-Krasovskii functional. Based on neural network technique and Lyapunov theory, an adaptive controller has been designed. Simulation result is presented to show the effectiveness of the approach.

References 1. Krsti´c, M., Kanellakopoulos, I., Kokotovi´c, P.: Nonlinear and Adaptive Control Design. Wiley, New York (1995) 2. Isidori, A.: Nnonliear Control Systems, 3rd edn. Springer, New York (1995) 3. Xie, L.H., Su, W.Z.: Robust H∞ Control for a Class of Cascaded Nonlinear Systems. IEEE Transactions on Automatic Control 42, 1465–1469 (1997) 4. Byrnes, C.I., Isidori, A.: Asymptotic Stabilization of Minimum Phase Nonlinear Systems. IEEE Transactions on Automatic Control 36, 1122–1137 (1991) 5. Jiang, Z.P.: Global Output Feedback Control with Disturbance Attenuation for Minimum-phase Nonlinear Systems. System and Control Letter 39, 155–164 (2000) 6. Su, W.Z., Xie, L.H., Souza, C.E.: Global Robust Disturbance Attenuation and Almost Disturbance Decoupling for Uncertain Cascaded Nonlinear Systems. Automatica 35, 697–707 (1999) 7. Lin, Z.: Almost Disturbance Decoupling with Global Asymptotic Stability for Nonlinear Systems with Disturbance-affected Unstable Zero Dynamics. System and Control Letter 33, 163–169 (1998) 8. Tang, Y.G., Sun, F.C., Sun, Z.Q.: Neural Network Control of Flexible-link Manipulators Using Sliding Mode. Neurocomputing 70, 288–295 (2006) 9. Ge, S.S., Wang, C.: Direct Adaptive NN Control of a Class of Nonlinear Systems. IEEE Transactions on Neural Networks 13, 214–221 (2002) 10. Zhou, G.P., Su, W.Z., Wang, C.: Semiglobally ISpS Disturbance Attenuation via Adaptive Neural Design for a Class of Nonlinear Systems. In: Proceedings of the 6th World Congress on Control and Automation, pp. 2964–2968 (2006) 11. Ge, S.S., Hong, F., Lee, T.H.: Adaptive Neural Network Control of Nonlinear Systems with Unknown Time Delays. IEEE Transactions on Automatic Control 48, 2004–2010 (2003) 12. Zeng, Z.G., Wang, J.: Analysis and Design of Associative Memories Based on Recurrent Neural Networks with Linear Saturation Activation Functions and Timevarying Delays. Neural Computation 19, 2149–2182 (2007)

Disturbance Attenuating Controller Design for a Nonlinear Systems

329

13. Cao, J., Ren, F.: Exponential Stability of Discrete-time Genetic Regulatory Networks with Delays. IEEE Transactions on Neural Networks 19, 520–523 (2008) 14. Jiang, Z.P., Praly, L.: Design of Robust Adaptive Controllers for Nonlinear Systems with Dynamic Uncertainties. Automatica 34, 825–840 (1998) 15. Jiang, Z.P., Teel, A., Praly, L.: Small-gain Theorem for ISS Systems and Applications. Math. Contr., Signals Syst. 7, 104–130 (1994) 16. Hornik, K., Stinchcome, M., White, H.: Multilayer Feed-forward Networks are Universal Approximators. Neural networks 2, 359–366 (1989)

Stability Criteria with Less Variables for Neural Networks with Time-Varying Delay Tao Li, Xiaoling Ye, and Yingchao Zhang Department of Information and Communication, Nanjing University of Information Science and Technology, 210044 Nanjing, Jiangsu China [email protected]

Abstract. In this paper, new delay-dependent stability criterion for neural networks is derived by using a simple integral inequality. The result is in terms of linear matrix inequalities and turn out to be equivalent to the existing result but include the least number of variables. This implies that some redundant variables in the existing stability criterion can be removed while maintaining the efficiency of the stability conditions. With the present stability condition, the computational burden is largely reduced. A numerical example is given to verify the effectiveness of the proposed criterion. Keywords: Delay-dependent, Asymptotic stability, Neural networks, Linear matrix inequality (LMI).

1

Introduction

In recent years, neural networks (NNs) have attracted much attention in research and have found successful applications in many areas such as pattern recognition, image processing, association, optimization problems [1,2]. One of the important research topics is the globally asymptotic stability of the neural network models. However, in the implementation of artificial NNs, time delays are unavoidable due to the finite switching speed of amplifiers. It has been shown that the existence of time delays in NNs may lead to oscillation, divergence or instability. Recently, the stability issue of NNs with time delays has been extensively studied (see [3,4,5,6,7,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24]). Among virous stability methods, a notable one is the free-weighting matrix method in [15,16], which is very effective to tackle the delay-dependent stability problem for timedelay NNs since neither bounding techniques on some cross-product terms nor model transformations are involved. However, the free weighting matrix method often needs to introduce many slack variables in obtaining LMI conditions and thus leads to a significant increase in the computational demand. One natural question is how to simplify existing stability results using matrix variables as less as possible while maintaining the effectiveness of the stability conditions. In this paper, simplified delay-dependent stability criterion for neural networks is obtained by using a simple integral inequality. The result is shown to F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 330–337, 2008. c Springer-Verlag Berlin Heidelberg 2008 

Stability Criteria with Less Variables for NNs with Time-Varying Delay

331

be equivalent to those in [16] but with much less variables. This implies that our result is more efficient as the computational burden is largely reduced.

2

Problem Formulation

Consider the following delayed neural networks: x(t) ˙ = −Cx + Ag(x(t)) + Bg(x(t − d(t))) + u .

(1)

where x(·) = [x1 (·), x2 (·), · · ·, xn (·)]τ ∈ Rn is the neuron state vector, g(x(·)) = [g1 (x1 (·)), g2 (x2 (·)), · · ·, gn (xn (·))]τ ∈ Rn denotes the neuron activation function, u = [u1 , u2 , · · ·, un ]τ ∈ Rn is a constant input vector. C = diag{c1 , c2 , · · ·, cn } is a diagonal matrix with ci > 0, i = 1, 2, ..., n. A and B are the connection weight matrix and the delayed connection weight matrix, respectively. The time delay, d(t), is a time-varying continuous function that satisfies ˙ ≤μ. 0 < d(t) < h, d(t)

(2)

where h > 0 and μ are constants. In the following, we assume that each neuron activation function in (1), gi (·), i = 1, 2, ..., n, satisfies the following condition: 0≤

gi (x) − gi (y) ≤ ki , ∀ x, y ∈ R, x = y, i = 1, 2, ..., n . x−y

(3)

where ki , i = 1, 2, ..., n are some constants. Assume x∗ = [x∗1 , x∗2 , ..., x∗n ]τ is an equilibrium of system (1). From (1), the transformation z(·) = x(·) − x∗ transforms system (1) into the following system: z(t) ˙ = −Cz(t) + Af (z(t)) + Bf (z(t − d(t))) .

(4)

where z(·) = [z1 (·), z2 (·), · · ·, zn (·)]τ is the state vector of the transformed system, f (z(·)) = [f1 (z1 (·)), f2 (z2 (·)), · · ·, fn (zn (·))]τ and fi (zi (·)) = gi (zi (·) + x∗i ) − gi (x∗i ), i = 1, 2, ..., n. Note that the functions f (z(t)) satisfy the following conditions:

0≤

fi (zi ) ≤ ki , fi (0) = 0, ∀zi = 0 . zi

which is equivalent to fi (zi )(fi (zi ) − ki zi ) ≤ 0, fi (0) = 0 . The purpose of this paper is to establish a simplified LMI condition with less slack variables such that NNs described by (4) is globally asymptotically stable while obtaining the allowable delay bound h as large as possible.

332

3

T. Li, X. Ye, and Y. Zhang

Main Results

Theorem 1. For given scalars h > 0 and μ, the origin of system (4) with (2) is asymptotically stable if there exist positive matrices P, Qi (i = 1, 2, 3), Z, positive diagonal matrices Λ = diag{λ1 , λ2 , ...λn }, T1 = diag{t11 , t12 , ...t1n } and T2 = diag{t21 , t22 , ...t2n } such that the following LMI holds: ⎤ ⎡ 1 Γ13 P B − hC τ ZB 0 Γ11 − h1 Z hZ 1 ⎥ ⎢ ∗ −(1 − µ)Q1 − h2 Z 0 KT2 hZ ⎥ ⎢ τ ⎢ ⎥ < 0 , (5) ∗ ∗ Γ ΛB + hA ZB 0 Γ =⎢ 33 ⎥ ⎦ ⎣ ∗ ∗ ∗ Γ44 0 ∗ ∗ ∗ ∗ −Q3 − h1 Z where

Γ11 = −P C − C τ P + Q1 + Q3 + hC τ ZC , Γ13 = P A − C τ Λ + KT1 − hC τ ZA ,

Γ33 = ΛA + Aτ Λ + Q2 − 2T1 + hAτ ZA , Γ44 = −(1 − µ)Q2 + hB τ ZB − 2T2 , K = diag{k1 , k2 , ..., kn } .

and ∗ denotes the symmetric term in a symmetric matrix. Proof: Introduce the following Lyapunov-Krasovskii functional:  0  t  t τ τ V (z(t)) = z (t)P z(t) + f τ (z(s))Q2 f (z(s))ds z˙ (s)Z z(s)dsdθ ˙ + −h

+2

n i=1

λi



t−d(t)

t+θ

zj

fj (s)ds +

0



t

τ

z (s)Q1 z(s)ds +



t

z τ (s)Q3 z(s)ds .

t−h

t−d(t)

where P > 0, Z > 0, Qi > 0 (i = 1, 2, 3). The time derivative of V (z(t)) along the trajectories of system (4) gives  t τ τ τ ˙ z˙ τ (s)Z z(s)ds ˙ V (z(t)) ≤ 2z (t)P z(t) ˙ + 2f (z(t))Λz(t) ˙ + hz˙ (t)Z z(t) ˙ − t−h

+z τ (t)(Q1 + Q3 )z(t) − (1 − µ)z τ (t − d(t))Q1 z(t − d(t)) −z τ (t − h)Q3 z(t − h) + f τ (z(t))Q2 f (z(t)) −(1 − µ)f τ (z(t − d(t)))Q2 f (z(t − d(t))) .

From integral inequality [8], we have   t−d(t)  t z˙ τ (s)Z z(s)ds ˙ − z˙ τ (s)Z z(s)ds ˙ =− − t−h

t−h

1 ≤− h



t−d(t)

t−h

z(s)ds ˙

τ

Z

t

z˙ τ (s)Z z(s)ds ˙ t−d(t)



t−d(t)

t−h

z(s)ds ˙



Stability Criteria with Less Variables for NNs with Time-Varying Delay

1 − h =



t

z(s)ds ˙

t−d(t)

τ

Z



t

z(s)ds ˙

t−d(t)

333



  

−1Z 1Z z(t − d(t)) h h z τ (t − d(t)) z τ (t − h) 1 Z −1Z z(t − h)  1h 1 h   

τ −hZ hZ z(t) + z (t) z τ (t − d(t)) (6) 1 1 z(t − d(t)) hZ −hZ

On the other hand, It is clear that

fi (zi (t))(fi (zi ) − ki zi (t)) ≤ 0, i = 1, 2...n ,

(7)

fi (zi (t − d(t)))(fi (zi (t − d(t))) − ki zi (t − d(t))) ≤ 0, i = 1, 2...n .

(8)

Thus, for any Tj = diag{t1j , t2j , ..., tnj } ≥ 0, j = 1, 2, we have V˙ (z(t)) ≤ η τ (t)Θη(t) − 2

n

ti1 fi (zi (t))(fi (zi ) − ki zi (t))

i=1

−2

n

ti2 fi (zi (t − d(t)))(fi (zi (t − d(t))) − ki zi (t − d(t))) ,

(9)

i=1

where η(t) = [z τ (t) z τ (t − d(t)) f τ (z(t)) f τ (z − d(t)) z τ (t − h)]τ , ⎡ ⎤ 1 Γ13 P B − hC τ ZB 0 Γ11 − h1 Z hZ 1 ⎢ ⎥ ∗ Γ22 − h2 Z 0 0 hZ ⎢ ⎥ τ ⎥. Θ=⎢ ∗ ∗ Γ ΛB + hA ZB 0 33 ⎢ ⎥ ⎣ ⎦ 0 ∗ ∗ ∗ Γ44 ∗ ∗ ∗ 0 −Q3 − h1 Z Applying the Schur complement to (5) gives V˙ (t) < 0 by (9). Hence, system (4) is asymptotically stable. Recently, a less conservative delay-dependent stability condition for delayed NNs was proposed in [16] by introducing some free-weighting matrices, which is restated as follows. Lemma 1. For given scalars h > 0 and µ, the origin of system (4) with (2) is asymptotically stable, if there exist positive matrices P, Qi (i = 1, 2, 3), Z, positive diagonal matrices Λ = diag{λ1 , λ2 , ...λn }, T1 = diag{t11 , t12 , ...t1n }, T2 = diag{t21 , t22 , ...t2n }, N = [N1τ N2τ N3τ N4τ N5τ ]τ , M = [M1τ M2τ M3τ M4τ M5τ ]τ , such that the following LMI hold: ⎤ ⎡ hN1 hM1 −hC τ Z N5τ − M1 Φ11 Φ12 Φ13 P B + N4τ ⎥ ⎢ ∗ Φ22 Φ23 Φ24 −N5τ + M5τ − M2 hN2 hM2 0 ⎥ ⎢ τ ⎥ ⎢ ∗ ∗ Φ33 ΛB −M hN hM hA Z 3 3 3 ⎥ ⎢ τ τ ⎥ ⎢ ∗ ∗ ∗ Φ −M hN hM hB Z 44 4 4 4 ⎥ < 0 ,(10) ⎢ Φ=⎢ τ ⎥ − M − M hN hM 0 ∗ ∗ ∗ ∗ −Q 3 5 5 5 5 ⎥ ⎢ ⎥ ⎢ ∗ ∗ ∗ ∗ ∗ −hZ 0 0 ⎥ ⎢ ⎦ ⎣ ∗ ∗ ∗ ∗ ∗ ∗ −hZ 0 ∗ ∗ ∗ ∗ ∗ ∗ ∗ −hZ

334

T. Li, X. Ye, and Y. Zhang

where Φ11 = −P C − C τ P + Q1 + Q3 + N1 + N1τ , Φ12 = N2τ − N1 + M1 , Φ13 = P A − C τ Λ + KT1 + N3τ , Φ22 = −(1 − µ)Q1 − N2 − N2τ + M2 + M2τ , Φ23 = −N3τ + M3τ , Φ24 = KT2 − N4τ + M4τ , Φ33 = ΛA + Aτ Λ + Q2 − 2T1 , Φ44 = −(1 − µ)Q2 − 2T2 . Although Theorem 1 and Lemma 1 are obtained via different methods, they turned out to be equivalent. To show this, we give the following theorem. Theorem 2. Inequality Γ < 0 in Theorem 1 is feasible if and only if Φ < 0 in Lemma 1 is feasible. Proof: Note that Φ in Lemma 1 can be expressed as Φ = Γ1 + XW + W τ X τ < 0 , where⎡

⎤ Γ11 + N1 + N1τ N2τ − N1 + M1 Γ13 Γ14 −M1 −N1 −M1 ⎢ ∗ Φ22 0 L2 T2 −M2 −N1 −M2 ⎥ ⎥ ⎢ ⎢ ∗ ∗ Γ 0 0 0 ⎥ 33 Γ34 ⎥ ⎢ ∗ ∗ ∗ Γ44 0 0 0 ⎥ Γ1 = ⎢ ⎥, ⎢ ⎥ ⎢ ∗ ∗ ∗ ∗ −Q 0 0 3 ⎥ ⎢ ⎣ ∗ ∗ ∗ ∗ ∗ − h1 Z 0 ⎦ ∗ ∗ ∗ ∗ ∗ ∗ − h1 Z     τ 0 0 N3τ N4τ N5τ 0 0 I −I 0 0 0 −I 0 X= , W = , 0 I 0 0 −I 0 −I 0 0 M3τ M4τ M5τ 0 0

and Γ11 , Γ13 , Γ14 , Φ22 Γ33 , Γ34 , Γ44 are defined in Theorem 1 and Lemma 1. For t  t−d(t) τ τ τ ξ(t) = [η τ (t) ( t−d(t) z(s)ds) ˙ ( t−h z(s)ds) ˙ ] = 0, it is seen that W ξ(t) = 0. According to Finsler’s Lemma, Φ < 0 holds if and only if the following inequality is true ξ τ (t)Γ1 ξ(t) < 0 , Then it yields that Γ1 < 0 is equivalent to Γ2 = ΠΓ1 Π τ < 0, where ⎡ ⎡ ⎤ −M1 −N1 − h1 Z I0000 I 0 ⎢ −N2 + 1 Z −M2 − 1 Z ⎢ 0 I 0 0 0 −I I ⎥ h h ⎢ ⎢ ⎥ ⎢Γ ⎢0 0 I 0 0 0 0 ⎥ 0 0 ⎢ ⎢ ⎥ ⎢ ⎥ 0 0 Π=⎢ ⎢ 0 0 0 I 0 0 0 ⎥ , Γ2 = ⎢ ⎢ ⎢ 0 0 0 0 I 0 −I ⎥ 0 0 ⎢ ⎢ ⎥ ⎣ ⎣0 0 0 0 0 I 0 ⎦ 0 − h1 Z 0 00000 0 I 0 − h1 Z



⎥ ⎥ ⎥ ⎥ ⎥. ⎥ ⎥ ⎥ ⎦

Stability Criteria with Less Variables for NNs with Time-Varying Delay

335

It is obvious that Γ < 0 holds if Γ2 < 0 holds. Conversely, if Γ < 0 holds, then Γ2 < 0 is feasible by taking N1 = − h1 Z, M1 = 0, N2 = h1 Z, M2 = − h1 Z. Thus, Φ is also feasible. This completes the proof. Remark 1. From the proof of Theorem 2, it is clear that Theorem 1 is equivalent to Theorem 1 in [16]. This means that the free weighting matrices Ni , Mi (i = 1, ..., 5) in [16] can be removed while maintaining the effectiveness of the stability condition. Theorem 1 provides stability criterion for NNs with d(t) satisfying 0 < d(t) < ˙ h and d(t) ≤ µ. In many cases, µ is unknown. For this circumstance, a rateindependent criterion for the delay satisfying 0 < d(t) < h is derived as follows by choosing Q1 = Q2 = 0 in Theorem 1. Corollary 1. For given scalar h > 0, the origin of system (4) with delay d(t) satisfy 0 < d(t) < h is asymptotically stable if there exist positive matrices P, Q3 , Z, positive diagonal matrices Λ = diag{λ1 , λ2 , ...λn }, T1 = diag{t11 , t12 , ...t1n } and T2 = diag{t21 , t22 , ...t2n } such that the following LMI holds: ⎡ ⎤ 0 Υ11 h1 Z P A − C τ Λ + KT1 − hC τ ZA P B − hC τ ZB 1 ⎢ ∗ −2Z ⎥ 0 KT2 h hZ ⎢ ⎥ τ ⎢ ∗ ⎥ 0, i = 1, 2, . . . , n, A = (aij )n×n , B = (bij )n×n are the interconnection matrices representing the weight coefficients of the neurons, and ΔC, ΔA, ΔB are the uncertainties of system matrices of the form [ΔC

ΔA ΔB] = HF (t)[E

E0

E1 ],

(2)

where the time-varying nonlinear function F (t) satisfy F T (t)F (t) ≤ I,

∀t ∈ R.

(3)

In this paper,it is assumed that the activation function g(u) is bounded and globally Lipschitz; that is 0≤

gi (ξ1 ) − gi (ξ2 ) ≤ ki , ξ1 − ξ2

i = 1, 2, . . . , n.

(4)

Then, by using well-known Brouwer’s fixed point theorem [12], one can easily prove that there exists an equilibrium point for Eq.(1). Assume that u∗ = (u∗1 , u∗2 , . . . , u∗n ) is an equilibrium point of the system (1), then we will shift the equilibrium point u∗ to the origin, 0. The transformation x(·) = u(·) − u∗ puts system (1) into the following form: x(t) ˙ = −(C + ΔC)x(t) + (A + ΔA)f (x(t)) + (B + ΔB)f (x(t − h(t))), (5)

340

W. Feng, H. Wu, and W. Zhang

where x(t) is the state vector of the transformed system, fj (xj ((t)) = gj (xj (t) + u∗j ) − gj (u∗j ) with fj (0) = 0 for j = 1, 2, . . . , n. It is noted that each activation function fi (·) satisfies the following sector condition: 0≤

fi (ξ1 ) − fi (ξ2 ) ≤ ki , ξ1 − ξ2

i = 1, 2, . . . , n.

(6)

Definition 1. The parametric uncertainties ΔC, ΔA, ΔB are said to be admissible if both (2) and (3) hold. Definition 2. The equilibrium point 0 is said to be globally robustly stable if for all admissible uncertainties ΔC, ΔA, ΔB, it is locally stable in the sense of Lyapunov and global attractive, where global attraction means that every trajectory tends to the equilibrium point as t → ∞. Lemma 1. Given any real matrices Σ1 , Σ2 , Σ3 with appropriate dimensions and a scalar ε > 0, such as that 0 < Σ3 = Σ3T , then the following inequality holds: Σ1T Σ2 + Σ2T Σ1 ≤ εΣ1T Σ3 Σ1 + ε−1 Σ2T Σ3−1 Σ2 .

(7)

Fact 1. [Schur complement] Given constant symmetric matrices Σ1 , Σ2 , Σ3 , where Σ1 = Σ1T , and 0 < Σ2 = Σ2T , then Σ1 + Σ3T Σ2−1 Σ3 < 0 if and only if 

3

 Σ1 Σ3T < 0, Σ3 −Σ2

or



 Σ1 Σ3 < 0. Σ3T −Σ2

(8)

Main Result

Theorem 1. The equilibrium point of system(5) is globally robustly stable if there exist symmetrical and positive definite matrices P, Q1 , Q2 , Q3 , a positive diagonal matrix Λ = diag {λ1 , λ2 , . . . λn } and two positive scalars ǫ1 , ǫ2 , satisfying the following LMI: ⎡

Ξ11 ⎢ ⋆ ⎢ ⎢ ⋆ ⎢ Ξ =⎢ ⎢ ⋆ ⎢ ⋆ ⎢ ⎣ ⋆ ⋆

0 Ξ22 ⋆ ⋆ ⋆ ⋆ ⋆

Ξ13 0 Ξ33 ⋆ ⋆ ⋆ ⋆

Ξ14 0 Ξ34 Ξ44 ⋆ ⋆ ⋆

0 0 0 0 Ξ55 ⋆ ⋆

where Ξ11 = −2P C + Q1 + Q2 + (ǫ1 + ǫ2 )E T E Ξ13 = P A − C T ΛT + (ǫ1 − ǫ2 )E T E0 Ξ14 = P B + (ǫ1 − ǫ2 )E T E1

Ξ16 0 0 0 0 Ξ66 ⋆

⎤ 0 0 ⎥ ⎥ Ξ37 ⎥ ⎥ 0 ⎥ ⎥ 0 to each classifier by the following:

F ( x) = sign[(∑ α i log k

p + (φ i ( x))

i =1

p − (φ i ( x))

) −T]

(10)

In the experiment part, we represent these two thresholds by symbols tpe and tpi for Promoter-Exon classifier and Promoter-Intron classifier, respectively.

3 Experimental Results and Discussions 3.1 Training Sequence Sets Our training sequence sets were downloaded from website: http://www.fruitfly.org /seq_tools/datasets/Human. All the training sequences are of 300 bp length and each promoter sequence is taken from 250 bp upstream to 50 bp downstream of the TSS. Three training sets for promoter, exon and intron include 565, 890 and 4345 sequences, respectively. We use these sequence sets to train our classifiers by setting Kmax=45=1024 and the algorithm stopped at k=641. For convenience, our promoter prediction system is called as PPFB in the following discussion. 3.2 Large Genomic Sequence Analysis and Comparisons Our promoter prediction system recognizes promoter regions in large genomic sequences by a sliding window. A window is moved over sequences and its content is classified. The window length is set as 300 bp and step length set to 1 bp in our system. A promoter region is obtained by clustering the prediction outputs with a gap tolerance 1 kb. To evaluate the performance of our algorithm, we compared our system with four other promoter prediction systems: PromoterInspector [1], Dragon Promoter Finder (DPF) [2], Eponine [4] and FirstEF [3]. These four methods can be accessible via the Internet and are currently the best four prediction systems. The evaluation set for comparison is the same as that used in PromoterInspector and DPF and is currently a standard for evaluating the performance of promoter recognition system. This set consists of six GenBank genomic sequences with a total length of 1.38 Mb and 35 known TSSs (see Table 3 in [1]). We adopt the same evaluating criterion used by PromoterInspector [1]: A predicted region is counted as correct if a TSS is located within the region or if a region boundary is within 200 bp 5’ of such a TSS. The main results and comparisons are presented in Table 1. In these experiments, PromoterInspector is used with default settings and our system is used by setting tpe=0.06 and tpi=0.75, DPF is used by setting se=0.45. The setting se=0.45 is found to give a balance sensitivity and specificity result. We observed that when the se of DPF is set too high, the number of false positives will increase much more rapidly than the number of true positives. For the same reason, we set t=0.995 for Eponine and p=0.98 for FisrtEF. By comparing the results of DPF with PromoterInspector, we can see that

488

S. Wu et al. Table 1. Results of large genomic sequence analysis Accession number

AC002397

L44140

D87675

AF017257

AF146793

AC002368

Method PromoterInspector DPF (se=0.45) Eponine (t = 0.995) FirstEF (p = 0.98) PPFB (tpe=0.06, tpi=0.75) PromoterInspector DPF (se=0.45) Eponine (t = 0.995) FirstEF (p = 0.98) PPFB (tpe=0.06, tpi=0.75) PromoterInspector DPF (se=0.45) Eponine (t = 0.995) FirstEF (p = 0.98) PPFB (tpe=0.06, tpi=0.75) PromoterInspector DPF (se=0.45) Eponine (t = 0.995) FirstEF (p = 0.98) PPFB (tpe=0.06, tpi=0.75) PromoterInspector DPF (se=0.45) Eponine(t = 0.995) FirstEF (p = 0.98) PPFB (tpe=0.06, tpi=0.75) PromoterInspector DPF (se=0.45) Eponine (t = 0.995) FirstEF (p = 0.98) PPFB (tpe=0.06, tpi=0.75)

TPa 5 6 8 7 9 6 6 6 6 8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

FPb 1 4 1 3 0 14 14 12 11 10 2 3 1 0 1 0 0 3 0 0 2 4 3 3 3 1 3 0 1 0

% TP total predictions 83.3 60 88.8 70 100 30 30 33.3 35.2 44.4 33.3 25 50 100 50 100 100 25 100 100 33.3 20 25 25 25 50 25 100 50 100

% Coveragec 29.4 35.2 47 41.1 52.9 54.5 54.5 54.5 54.5 72.7 100 100 100 100 100 100 100 100 100 100 25 25 25 25 25 100 100 100 100 100

TPa: true positive; FPb: false positive; %Coveragec: the percentage of true promoters in a sequence. Table 2. Results and comparisons of five prediction systems on human Chromosome 22 Method PromoterInspector DPF(se = 0.37) Eponine(t=0.9975) FirstEF ( p=0.98 ) PPFB (tpe=0.1, tpi=0.9) a

TP 239 241 247 242 262

FP 274 482 248 270 246

Se(%)a 60.8 61.3 62.8 61.5 66.6

Sp(%)b 46.6 33.3 49.9 47.2 51.5

Sensitivity : Se=TP/(TP+FN); bSpecificity: Sp=TP/(TP+FP); FN: false negative. TP+FN = 393

A New Strategy for Pridicting Eukaryotic Promoter Based on Feature Boosting

489

although DPF can predict more promoters it also results in more false positives. Comparing the predicting results of our system with DPF, Eponine, FirstEF and PromoterInspector shows that our method has good performance in terms of both sensitivity and specificity, especially for the sequence AC002397 and L144140. We also evaluate the performance of our system on Release 3.1 of human chromosome 22 with length 35Mb and 393 known genes annotated by the Chromosome 22 Gene Annotation Group at the Sanger Institute (http://www.sanger.ac.uk/ HGP/Chr22). In this experiment, PromoterInspector is used with default settings and our system is used by setting tpe=0.1 and tpi=0.9, DPF is used by setting se=0.37 for giving a comparable sensitivity and specificity result. For the same reason, we set t=0.9975 for Eponine and p=0.98 for FisrtEF. We adopt the same evaluating criterion used by Scherf with PromoterInspector: all the predictions located in the range -2000~+500 around the 5’ extremity of a known gene are considered as a true positive promoter region (TP) and other predictions outside this range are considered as false positives (FP). The recognition results and comparisons are summarized in Table 2. the results showed that our system has better performance compared with other four systems. Although our method otained better results than other models, improvements is not much. This is because our method is based on probability and not applicable to small probable sequences, so our future work will concentrate on conserdering and joinning other feature to predict promoter.

4 Conclusions Eukaryotic promoter predicting is one of the most elusive problems in DNA sequence analysis. Although a number of algorithms have been proposed, most of them suffer from low sensitivity or too many false positives. In this paper, we developed a new algorithm called PPFB for promoter prediction base on the hypothesis: promoter is determined by certain motifs and different promoters consist of different motifs. A new feature selection strategy is based on divergence and a new promoter prediction system that makes use of each feature as classifier and feature boosting is developed. Experimental results show that the performance of our method is better than PromoterInspector, Dragon Promoter Finder, Eponine, and FirstEF. In the future, we will use various different features to construct different weak classifiers and integrate them into one to further improve the prediction accuracy.

Acknowledgments This work is supported by a grant from National Natural Science Foundation of China (project 60772028).

References 1. Scherf, M., Klingenhoff, A., Werner, T.: Highlyspecific localization of promoter regions in large genomic sequences by Promoter Inspector: a novel context analysis approach. J. Mol. Biol. 297, 599–606 (2000) 2. Bajic, V.B., Seah, S.H., Chong, A., Krishnan, S.P.T., Koh, J.L.Y., Brusic, V.: Computer model for recognition of functional transcription start sites in polymerase II promoters of vertebrates. Journal of Molecular Graphics & Modeling 21, 323–332 (2003)

490

S. Wu et al.

3. Davuluri, R.V., Grosse, I., Zhang, M.Q.: Computational identification of promoters and first exons in the human genome. Nat. Genet. 29, 412–417 (2001) 4. Down, T.A., Hubbard, T.J.: Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res. 12, 458–461 (2002) 5. Wu, S., Xie, X., Liew, A.W., Hong, Y.: Eukaryotic promoter prediction based on relative entropy and positional information. Physical Review E 75, 041908-1–041908-7 (2007) 6. Prestridge, D.S., Burks, C.: The density of transcriptional elements in promoter and nonpromoter sequences. Hum. Mol. Genet. 2, 1449–1453 (1993) 7. Hutchinson, G.B.: The prediction of vertebrate promoter regions using differential hexamer frequency analysis. Comput. Appl. Biosci. 12, 391–398 (1996) 8. Cross, S.H., Clark, V.H., Bird, A.P.: Isolation of CpG islands from large genomic closnes. Nucleic Acids Res. 27, 2099–2107 (1999) 9. Ioshikhes, I.P., Zhang, M.Q.: Large-scale human promoter mapping using CpG islands. Nat. Genet. 26, 61–63 (2000) 10. Ponger, L., Mouchiroud, D.: CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences. Bioinformatics 18, 631–633 (2002) 11. Schapire, R., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. In: Proc. Of Annual Conf. On Computational Learning Theory, pp. 80–91 (1998) 12. Sergios, T., Konstantinos, K.: Pattern Recognition, 2nd edn. Academic Press, San Diego (2003)

Searching for Interacting Features for Spam Filtering Chuanliang Chen1, Yunchao Gong2, Rongfang Bie1,*, and Xiaozhi Gao3 1

Department of Computer Science, Beijing Normal University, Beijing 100875, China 2 Software Institute, Nanjing University, Nanjing, China 3 Department of Electrical Engineering, Helsinki University of Technology, Otakaari 5 A, 02150 Espoo, Finland [email protected], [email protected], [email protected]

Abstract. In this paper, we introduce a novel feature selection method— INTERACT to select relevant words of emails for spam email filtering, i.e. classifying an email as spam or legitimate. Four traditional feature selection methods in text categorization domain, Information Gain, Gain Ratio, Chi Squared, and ReliefF, are also used for performance comparison. Three classifiers, Support Vector Machine (SVM), Naïve Bayes and a novel classifier— Locally Weighted learning with Naïve Bayes (LWNB) are discussed in this paper. Four popular datasets are employed as the benchmark corpora in our experiments to examine the capabilities of these five feature selection methods and the three classifiers. In our simulations, we discover that the LWNB improves the Naïve Bayes and gain higher prediction results by learning local models, and its performance is sometimes better than that of the SVM. Our study also shows the INTERACT can result in better performances of classifiers than the other four traditional methods for the spam email filtering. Keywords: Interacting features, Feature selection, Naïve bayes, Spam filtering.

1 Introduction The increasing popularity of electronic mails has intrigued direct marketers to flood the mailboxes of millions of users with unsolicited messages. These messages are usually referred to as spam or, more formally, Unsolicited Bulk E-mail (UBE), and may advertise anything, from vacations to get-rich schemes [1]. The negative effect of spam has influenced people’s daily lives: filling mailboxes, engulfing important personal mails, wasting network bandwidth, consuming users' time and energy to solve it, not to mention all the other problems associated with it (crashed mail-servers, pornography advertisements sent to children, etc.). A study in 1997 indicated that the spam messages constituted approximately 10% of the incoming messages to a corporate network [4]. CAUBE.AU reports that their statistics show the volume of spam is increasing at an alarming rate, and some people claim they are even abandoning their email accounts because of spam [3]. This situation seems to be worsening with time, and without appropriate counter-measures, spam messages could eventually undermine the usability of e-mails. These serious threats from spam make the spam filtering, whose task is *

Corresponding author.

F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 491–500, 2008. © Springer-Verlag Berlin Heidelberg 2008

492

C. Chen et al.

to rule out unsolicited emails automatically from the email stream, more important and in need of solving. In recent years, many studies address the issue of spam filtering based on machine learning, because the attempts to introduce legal measures against spam mailing have limited effect. Several supervised learning algorithms have been successfully applied to spam filtering: Naïve Bayes [5,6,7,8], Support Vector Machine [9,10], Memory Based Learning methods [11,12], and Decision Tree [13]. Among these classification methods, the Naïve Bayes is particularly attractive for spam filtering, as its performance is surprisingly good [12]. The Naïve Bayes classifier has been the filtering engine of many commercial anti-spam software. Therefore, in this paper, we aim at improving the prediction ability of the Naïve Bayes by introducing locally learned model. In order to train or test classifiers, it is necessary to go through large corpus with spam and legitimate emails. E-mails of corpuses have to be preprocessed to extract their words (features) belonging to the message subjects, the bodies and/or the attachments. As the number of features in a corpus can end up being very high, it is usual to choose those features that better represent each message before carrying out the filter training to prevent the classifiers from over-fitting [14]. The effectiveness of the classifiers relies on the appropriate choice of these features, the preprocessing steps of the e-mail features extraction, and the selection of the most representative features are crucial for the performance of the filters [15]. In this paper, a novel feature selection method—INTERACT and a novel classifier—LWNB are introduced to deal with spam filtering. The remainder of this paper is organized as follows. Section 2 demonstrates the INTERACT algorithm for the spam filtering. We explain the principles of the e-mail representation and preprocessing in Section 3. Classifiers used in this paper are presented in Section 4. We report the performances of the four feature selection methods and three classifiers using F measure and accuracy in Section 5. Section 6 concludes our study with a few remarks and conclusions.

2 INTERACT Algorithm Interacting features challenge the current feature selection methods for classification. A feature by itself may have little correlation with the target concept. However, when it is combined with some other features, they can be strongly correlated with the target concept [2]. Many traditional feature selection methods usually unintentionally remove these features, and thus result in the poor classification performances. The INTERACT algorithm can efficiently handle the feature interaction with much lower time cost than the traditional methods. A brief description of the INTERACT algorithm is presented below, and more details can be found in [2]. The INTERACT algorithm searches for the interacting features by solving two key problems: how to update c-contribution effectively, and how to deal with the feature order problem? C-contribution of a feature is an indicator about how significantly the elimination of that feature will affect consistency. Especially, the C-contribution of an irrelevant feature is zero.

Searching for Interacting Features for Spam Filtering

493

To solve the first problem, the INTERACT algorithm calculates the C-contribution efficiently with a hashing mechanism [2]: each instance is inserted into the hash table, and its values of those features in Slist are used as the hash keys, where Slist is the set of the ranked features not yet eliminated (Slist is initialized with the full set of features). Instances with the same hash keys will be inserted into the same entry in the hash table and cover the old information of the labels. For the second problem, we assume that the set of features can be divided into subset S1 including relevant features, and subset S2 containing irrelevant ones. The INTERACT algorithm intends to remove the features in S2 first, and preserve features in S1, which more probably remain in the final set of selected features. The INTERACT algorithm achieves this target by applying a heuristic to rank the individual features using symmetrical uncertainty (SU) in an descending order so that the (heuristically) most relevant feature is positioned at the beginning of the list. SU has been described in the information theory books and numerical recipes. It is often used as a fast correlation measure to evaluate the relevance of individual features [12,17]. The INTERACT is a filtering algorithm that employs backward elimination to remove the features with no or low C-contribution. Given a full set with N features and a class attribute C, the INTERACT finds a feature subset Sbest for the class concept [2]. The algorithm consists of two major parts: firstly, the features are ranked in the descending order based on their Symmetrical Uncertainty values; secondly, the features are evaluated one by one starting from the end of the ranked feature list. The process is shown as follows. Algorithm 1. INTERACT Algorithm Input:

Output: Process:

F is the full features set with N features{F1,F2, …, FN}; C is the class label; δ is a predefined threshold. Sbest is subset of selected features. Sbest = ∅ for i = 1 to N then calculate SUFi,c for Fi append Fi to Sbest end sort Sbest in descending order according to SUi,c F ← last element of Sbest repeat if F ≠ NULL then p ← c-contribution of F if p ≤ δ then remove F from Sbest end end until F = NULL return Sbest

494

C. Chen et al.

3 Preprocessing of Corpus and Message Representation 3.1 Feature Selection Methods for Comparison Other four feature selection methods are used in this paper to test the capability of the INTERACT algorithm. They are Chi Squared (i.e. χ 2 ) Statistic, Information Gain, Gain Ratio, and ReliefF. Their definitions are given as follows. In the following formulas, m is the number of classes (in spam filtering domain, m is 2), and Ci denotes the ith class. V represents the number of partitions a feature can split the training set into. Let N is the total number of samples, and N Ci is that of class

i. In the vth partition, N C( vi ) denotes the number of samples belonging to class i. Chi Squared: The Chi Squared Statistic is calculated by comparing the obtained frequency with the priori frequency of the same class. The definition is:

χ 2 = ∑∑ m

V

i =1 v =1

(v)

( NC( vi ) − N Ci )2 (v )

(1)

.

N Ci

(v )

where N Ci = ( N ( v ) / N ) N Ci denotes the prior frequency. Information Gain: Information Gain is based on the feature’s impact on the decreasing entropy, and is defined as follows:

InfoGain = [∑ −( m

i =1

N Ci N

) log(

N Ci N

)] − [∑ ( V

v =1

N C( v ) N C( v ) N (v) m )∑ −( ( vi ) ) log( ( vi ) )] . N i =1 N N

(2)

Gain Ratio: Gain Ratio is firstly used in C4.5, which is defined as (3):

GainRatio = InfoGain / [∑ −( m

i =1

N Ci N

) log(

N Ci N

)] .

(3)

ReliefF: The key idea of Relief is to estimate the features according to how well their values distinguish among the instances that are near to each other. The ReliefF is an extension of the Relief, improving the Relief algorithm by estimating the probabilities more reliably and extending to deal with the incomplete and multiclass data sets. More details can be found in [17]. 3.2 Corpus Preprocessing and Message Representation

Each e-mail in the corpora is represented as a set of words. After analyzing all the emails of a corpus, a dictionary with N words/features is formed. Every e-mail is represented as a feature vector including N elements, and the ith word of the vector is a binary variable representing whether this word is in this e-mail. During preprocessing, we perform the word stemming, stop-word removable and Document Frequency Threshold (DFT), in order to reduce the dimension of feature space. The HTML tags

Searching for Interacting Features for Spam Filtering

495

of the e-mails are also removed during preprocessing. Finally, we extract the first 5,000 tokens of the dictionary according to their mutual information to form the corpora used in this paper.

4 Classifiers for Spam Filtering In this paper, we use three classifiers to test the capabilities of the aforementioned feature selection methods. The three classifiers are Support Vector Machine (SVM), Naïve Bayes, and Locally Weighted learning with Naïve Bayes (LWNB) that is an improvement of Naïve Bayes firstly introduced into spam filtering domain by us. We here only briefly introduce the LWNB, and more details can be found in [1]. In the LWNB, the Naïve Bayes is learned locally in the same way as the linear regression is used in locally weighted linear regression. A local Naïve Bayes model is fit to a subset of the data that is in the neighborhood of the instance, whose class value is to be predicted [1]. The training samples in this neighborhood are weighted, and further ones are assigned with less weight. The classification is then obtained from these Naïve Bayes models. The subset of the data used to train each locally weighted Naïve Bayes model is determined by a nearest neighbors algorithm. In the LWNB, the first k nearest neighbors are selected to form this subset, where k is a user-specified parameter. How to determine the weight of each instance of the subset? As in [1], we use a linear weighting function in our experiments, which is defined as: f linear = 1 − di / d k ,

(4)

where di is the Euclidean distance to the ith nearest neighbor xi. Obviously, by using flinear, the weight decreases linearly with the distance. Empirical study shows the LWNB is not particularly sensitive to the choice of k as long as k is not too small [1]. Too small k may cause the local Naïve Bayes model to fit the noise in the data. The Naïve Bayes calculates the posterior probability of class ci for a test instance with m attribute values a1, a2, …, am as follows:

p(cl | a1 , a2 ,..., am ) =



p (cl ) ∏ mj =1 p (a j | cl ) C

[ p(ci ) ∏ mj =1 p(a j | ci )] i =1

,

(5)

where C is the total number of classes. In the LWNB, the individual probabilities on the right-hand side of (5) are estimated based on the weighted data. The prior probability for class cl becomes: 1 + ∑ i = 0 I (ci = cl ) wi C + ∑ i = 0 wi n

p(cl ) =

n

,

(6)

where ci is the class value of the ith training instance, and the indicator function I(x=y) is 1 iff x = y. The attribute of data is assumed nominal, and as for the numeric attributes, they are discretized. The conditional probability of aj is given by:

496

C. Chen et al.

1 + ∑ i = 0 I (a j = aij ) I (ci = cl ) wi n j + ∑ i = 0 I (a j = aij ) wi n

p(a j | cl ) =

n

,

(7)

nj is the number of values of attribute j, and aij is the value of attribute j of ith instance.

5 Experiments and Analysis 5.1 Corpus in Simulations

The experiments are based on four popular benchmark corpora, PU1, PU2, PUA, and Ling Spam, which are all available on [16]. In all PU corpora and Ling Spam corpus, attachments, html tags, and header fields other than the subjects are removed, leaving only subject lines and mail body texts. In order to address privacy, each token of a corpus is encoded to a unique integer. The details about each corpus are given below. PU1 Corpus: The PU1 corpus consists of 1,099 messages, which has 481 spam messages and 618 legitimated ones. The spam rate is 43.77%. PU2 Corpus: The PU2 corpus contains less messages than PU1, which has 721 messages. Among them, there are 579 messages labeled legitimate and 142 spam. PUA Corpus: The PUA corpus has 1,142 messages, half of which, i.e., 571 messages, are marked as spam and the other half legitimate. Ling Spam Corpus: The Ling spam corpus includes 2,412 legitimate messages from a linguistic mailing list and 481 spam ones collected by the author. The spam rate is 16.63%. Different from PU corpora, the messages of Ling spam corpus come from different sources: the legitimate messages are collected from a spam-free, topicspecific mailing list and the spam ones from a personal mailbox. Therefore, the distribution of mails is less similar from the normal user’s mail stream, which makes the messages of Ling spam corpus easily separated. 5.2 Performance Measures

We use two popular evaluation metrics of the text categorization domain to measure the performance of the classifiers: accuracy and F measure. Accuracy: Accuracy is the percentage of the correct predictions in the total predictions. It is defined as follows:

Accuracy =

Pc × 100% . Pt

(8)

where Pc is the number of the correct predictions, and Pt is the number of the total predictions. The higher of the accuracy, the better.

Searching for Interacting Features for Spam Filtering

497

F measure: The definition of F measure is as follows:

F=

2R × P , R+P

(9)

where R represents Recall, which is the percentage of the messages for a given category that are classified correctly; P is the Precision, the percentage of the predicted messages for a given class that are classified correctly. F measure ranges from 0 to 1, and the higher, the better. 5.3 Results and Analysis

The following classification performance is measured through a 10-fold crossvalidation. We select all of the interacting features, i.e., features with non-negative Ccontribution. Table 1 summarizes the results of dimension reduction after the INTERACT selects the features. Table 1. Summary of results of INTERACT selected features on the four benchmark corpora

Num. of features with non-negative c-contribution

PU1 43

PU2 43

PUA 42

Ling Spam 64

From Table 1, we can find that the dimensions of data have been reduced sharply after removing irrelevant features by the INTERACT. Therefore, we just run the classifiers on these data rather than reducing them further by adjusting the parameter δ. From Table 1, we also can conclude that there are many irrelevant words/features existing in corpus for the spam filtering, and more than 99% of the features are removed by the INTERACT. The following histograms show the performances of the three classifiers, SVM (using linear kernel), Naïve Bayes, and LWNB, on the four corpora. As for other four feature selection methods for comparison, we select the first M features according to the features’ scores, where M is the number of the interacting features found by the INTERACT algorithm. From Fig. 1 and Fig. 2, we discover that the INTERACT algorithm can improve the performances of all the three classifiers. Their performances on the reduced corpus are equal to or better than those on the full corpus, evaluated by the accuracy and F measure. For example, the performances of the SVM on PU1 and PU2 corpora reduced by the INTERACT is equal to those on the full corpora, and its performance on PUA corpus reduced by the INTERACT is better than that on the full corpus. However, the performance of the SVM on Ling Spam corpus reduced by the INTERACT is slightly worse than that on the full corpus. The feature selection capability of the INTERACT is obviously better than the other popular feature selection methods. The competitive performances of the classifiers on the data handled by the INTERACT show that only a few relevant words can still distinguish between the spam and legitimate emails. This is true in practice, for example, it is well known that the words “buy, purchase, jobs, …” usually appear in the spam e-mails, and they thus are useful email category distinguishers.

498

C. Chen et al.

Fig. 1. Performances of aforementioned three classifiers and four feature selection methods on PU1, PU2, PUA, and Ling Spam benchmark corpora with accuracy evaluation measure

Fig. 2. Performances of aforementioned classifiers and four feature selection methods on PU1, PU2, PUA and Ling Spam benchmark corpora with F measure evaluation measure

Searching for Interacting Features for Spam Filtering

499

The performance of the LWNB is also promising. On Ling Spam corpus, its performance is even better than that of the SVM, which is a well-known powerful classifier. On PU1 and Ling Spam corpora, the LWNB successfully improves the performance of the Naïve Bayes by using locally weighted model. However, its performance is worse than that of the Naïve Bayes on PU2 and PUA corpora. The reason may be that the task of the spam filtering suits the hypothesis of the class conditional independence of the Naïve Bayes, that is, given the class label of the others, the frequencies of the words in one email are conditionally independent of one another. Based on a careful observation, we have another question “why the LWNB performs poorly on full corpus”? The reason is: there are many irrelevant features existing on full corpus, which can be also concluded from the feature selection results by performing the INTERACT. When determining the neighbors, all the features take part in calculating distance, and too many irrelevant features conceal the truly useful effects of the relevant features, and therefore result in that the LWNB finds the wrong or irrelevant neighbors to generate locally weighted Naïve Bayes models. However, the LWNB is still a promising classifier for the spam filtering, when combined with some excellent feature selection methods, such as the INTERACT.

6 Conclusions In this paper, we present our work on the spam filtering. Firstly, we introduce the INTERACT algorithm to select interacting words/features for the spam filtering. Other four traditional feature selection methods are also performed in the experiments for performance comparison. Secondly, we introduce a novel classifier LWNB to improve the performance of the Naïve Bayes, a most popular classifier in the spam filtering area, to deal with the spam filtering. Totally, three classifiers, SVM, Naïve Bayes and LWNB, are run on four corpora preprocessed by the five feature selection methods and corresponding full corpora in our simulations. Two popular evaluation metrics, accuracy and F measure, are used to measure the performances of these three classifiers. Our empirical study shows that the INTERACT feature selection can improve all of the three classifiers’ performances, and its feature selection ability is better than that of the four traditional feature selection methods. We briefly analyze the reason why the INTERACT and other four methods can work together to perform well. We also find out that the LWNB can improve the performance of the Naïve Bayes, which is sometimes superior to the SVM. Acknowledgements. The research work presented in this paper was supported by grants from the National Natural Science Foundation of China (Project No. 10601064). Xiaozhi Gao's research work was funded by the Academy of Finland under Grant 214144.

References 1. Frank, E., Hall, M., Pfahringer, B.: Locally Weighted Naive Bayes. In: Proc. of the Conference on Uncertainty in Artificial Intelligence, pp. 249–256 (2003) 2. Zhao, Z., Liu, H.: Searching for Interacting Features. In: Proc. of International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, pp. 1156–1161 (2007)

500

C. Chen et al.

3. CAUBE.AU (2006), http://www.caube.org.au/spamstats.html 4. Cranor, L.F., LaMacchia, B.A.: Spam! In: Communications of ACM, pp. 74–83. ACM Press, New York (1998) 5. Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian Approach to Filtering Junk E-mail. AAAI Technical Report WS-98-05, AAAI 1998 Workshop on Learning for Text Categorization (1998) 6. Schneider, K.M.: A Comparison of Event Models for Naïve Bayes Anti-Spam E-Mail Filtering. In: Proc. of the 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, pp. 307–314 (2003) 7. Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C.D., Stamatopoulos, P.: Learning to Filter Spam E-mail: A Comparison of a Naïve Bayesian and a Memory-based Approach. In: Proc. of the Workshop on Machine Learning and Textual Information Access, pp. 1–13 (2000) 8. Zhang, L., Zhu, J., Yao, T.: An Evaluation of Statistical Spam Filtering Techniques. ACM Trans. Asian Lang. Inf. Process 3, 243–269 (2004) 9. Drucker, H., Wu, D., Vapnik, V.N.: Support Vector Machines for Spam Categorization. IEEE Trans. on Neural Networks 10, 1048–1054 (1999) 10. Kolcz, A., Alspector, J.: SVM-based Filtering of E-mail Spam with Content-specific Misclassification Costs. In: Proc. of the TextDM 2001 Workshop on Text Mining - held at the 2001 IEEE International Conference on Data Mining (2001) 11. Sakkis, G., Androutsopoulos, I., Paliouras, G., Stamatopoulos, P.: A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists. Information Retrieval 6, 49–73 (2003) 12. Yu, L., Liu, H.: Feature Selection for High-dimensional Data: A Fast Correlation-based Filter Solution. In: Proc. of the 20th International Conference on Machine Learning, Washington DC, pp. 856–863 (2003) 13. Carreras, X., Marquez, L.: Boosting Trees for Anti-spam Email Filtering. In: Proc. International Conference on Recent Advances in Natural Language Processing (RANLP 2001), Tzigov Chark, Bulgaria, pp. 58–64 (2001) 14. Méndez, J.R., Iglesias, E.L., Fdez-Riverola, F., Díaz, F., Corchado, J.M.: Analyzing the Impact of Corpus Preprocessing on Anti-Spam Filtering Software. Research on Computing Science 17, 129–138 (2005) 15. Méndez, J.R., Fdez-Riverola, F., Díaz, F., Iglesias, E.L., Corchado, J.M.: A Comparative Performance Study of Feature Selection Methods for the Anti-spam Filtering Domain. In: Perner, P. (ed.) ICDM 2006. LNCS (LNAI), vol. 4065, pp. 106–120. Springer, Heidelberg (2006) 16. Email Benchmark Corpus (2006), http://www.aueb.gr/users/ion/publications.html 17. Kononenko, I.: Estimating Attributes: Analysis and Extensions of Relief. In: Proc. of European Conference on Machine Learning, pp. 171–182. Springer, Heidelberg (1994)

Structural Support Vector Machine Hui Xue1 , Songcan Chen1,⋆ , and Qiang Yang2 1

2

Department of Computer Science and Engineering, Nanjing University of Aeronautics and Astronautics, 210016, Nanjing, P.R. China Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong {xuehui,s.chen}@nuaa.edu.cn, [email protected] http://parnec.nuaa.edu.cn

Abstract. Support Vector Machine (SVM) is one of the most popular classifiers in pattern recognition, which aims to find a hyperplane that can separate two classes of samples with the maximal margin. As a result, traditional SVM usually more focuses on the scatter between classes, but neglects the different data distributions within classes which are also vital for an optimal classifier in different real-world problems. Recently, using as much structure information hidden in a given dataset as possible to help improve the generalization ability of a classifier has yielded a class of effective large margin classifiers, typically as Structured Large Margin Machine (SLMM). SLMM is generally derived by optimizing a corresponding objective function using SOCP, and thus in contrast to SVM developed from optimizing a QP problem, it, though more effective in classification performance, has the following shortcomings: 1) large time complexity; 2) lack of sparsity of solution, and 3) poor scalability to the size of the dataset. In this paper, still following the above line of the research, we develop a novel algorithm, termed as Structural Support Vector Machine (SSVM), by directly embedding the structural information into the SVM objective function rather than using as the constraints into SLMM, in this way, we achieve: 1) to overcome the above three shortcomings; 2) empirically better than or comparable generalization to SLMM, and 3) theoretically and empirically better generalization than SVM. Keywords: Support vector machine, Structural information, Rademacher complexity, Pattern recognition.

1

Introduction

In the past decade, large margin machines have become a hot issue of research in machine learning. Support Vector Machine (SVM)[1], as the most famous one among them, is derived from statistical learning theory[2] and achieves a great success in pattern recognition. ⋆

Corresponding author: Tel: +86-25-84896481 Ext.12106; Fax: +86-25-84498069. This work was supported respectively by NSFC (60773061) and Jiangsu NSF (BK2008xxx).

F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 501–511, 2008. c Springer-Verlag Berlin Heidelberg 2008 

502

H. Xue, S. Chen, and Q. Yang

Given a training set {xi , yi }ni=1 ∈ Rm × {±1}, the basic objective of SVM is to learn a classifier f = wT x + b which can maximize the margin between classes: 1 min  w 2 w, b 2 s.t. yi (wT xi + b) ≥ 1, i = 1, · · · , n

(1)

If we focus on the constraints in (1), we can immediately capture the following insight about SVM which is easily generalized to the soft margin version: Theorem 1. SVM constrains the scatter between classes as wT Sb w ≥ 4, where Sb = (µ1 − µ2 )(µ1 − µ2 )T , µi is the mean of class i(i = 1, 2). Proof. Without loss of generalization, we assume that the class one has the class label yi = 1, and the other class has yj = −1. Then we reformulate the constraints as wT xi + b ≥ 1, where xi belongs to class one, and wT xj + b ≤ −1, where xj belongs to class two. Let the numbers nof1 theTsamples in theT two classes (w xi + b) = (w µ1 + b) ≥ 1 are respectively n1 and n2 . Then we have n11 i=1 n2 1 T T and − n2 j=1 (w xj + b) = −(w µ2 + b) ≥ 1. Adding the two inequalities, we obtain wT (µ1 − µ2 ) ≥ 2. Squaring the inequality, we further have wT (µ1 −  µ2 )(µ1 − µ2 )T w ≥ 4, i.e. wT Sb w ≥ 4. Consequently, following the above theorem, it is obvious that SVM actually gives a natural lower bound to the scatter between classes, just according with its original motivation that pays more attention to the maximization of margin. However, it discards the prior data distribution information within classes which is also vital for classification. In fact, corresponding to different real-world problems, different classes may have different underlying data structures. It requires that the classifier should adjust the discriminant boundaries to fit the structures which are vital for classification, especially for the generalization capacity of the classifier. However, the traditional SVM does not differentiate the structures, and the derived decision hyperplane lies unbiasedly right in the middle of the support vectors[3, 4], which may lead to a nonoptimal classifier in the real-world problems. Recently, some new large margin machines have been presented to give more concerns to the structural information than SVM. They provide a novel view to design a classifier, that the classifier should be sensitive to the structure of the data distribution, and assume that the data contains clusters. Minimax Probability Machine (MPM)[5] and Maxi-Min Margin Machine (M4 )[3] stress the global structure of the two classes and apply two ellipsoids, i.e. two clusters, to characterize the classes distributions respectively. By using the Mahalanobis distance which combines the mean and covariance of the ellipsoids, they integrate the global structural information into the large margin machines. However, only emphasis on the global structure of the classes is too coarse. In many realworld problems, samples within classes more likely have different distributions. Therefore, Structured Large Margin Machine (SLMM)[4] is proposed to firstly

Structural Support Vector Machine

503

apply some clustering methods to capture the underlying structures in each class. As a result, SLMM uses several ellipsoids whose number is equal to the number of the clusters to enclose the training data, rather than only two ellipsoids in M4 respectively corresponding to each class. The optimization problem in soft margin SLMM can be formulated as (2)[4], which introduces the covariance matrices in each cluster into the constraints: |P |+|N |

max ρ − C



ξl

l=1

 | Pi | ρ wT ΣPi w − ξl , xl ∈ Pi , M axP | Nj |  T ρ w ΣNj w − ξl , xl ∈ Nj , − (wT xl + b) ≥ M axN

s.t. (wT xl + b) ≥

wT r = 1, ξl ≥ 0

(2)

where ξl is the penalty for violating the constraints. C is a regularization parameter that makes a trade-off between the margin and the penalties incurred. Pi denotes the ith cluster in class one, i = 1, · · · , CP , and Nj denotes the j th cluster in class two, j = 1, · · · , CN . CP and CN are the numbers of the clusters in the two classes respectively. r is a constant vector to limit the scale of the weight w. By the simple algebraic deduction, MPM, M4 even SVM can all be viewed as the special cases of SLMM. And SLMM also achieves better classification performance among these popular large margin machines experimentally. However, SLMM has much larger time complexity than SVM. Its optimization problems should be solved by SOCP, which handles relatively difficultly in real applications. And the corresponding solution loses the sparsity as in SVM derived from optimizing a QP problem. Consequently, it has poor scalability to the size of the dataset and can not easily be generalized to large-scale or multi-class problems. Furthermore, in the kernel version, SLMM should kernelize the covariance matrix in each cluster within the constraints respectively, which undoubtedly increases extra computational complexity. In this paper, we present a novel classification algorithm that provides a general way to incorporate the structural information into the learning framework of the traditional SVM. We call our method SSVM, which stands for Structural Support Vector Machine. Inspired by the SLMM, SSVM also firstly exploits the intrinsic structures of samples within classes by some unsupervised clustering methods, but then directly introduces the data distributions of the clusters in different classes into the traditional optimization function of SVM rather than in the constraints. The contributions of SSVM can be described as follows:  SSVM naturally integrates the prior structural information within classes

into SVM, without destroying the classical framework of SVM. And the corresponding optimization problem can be solved by the QP just similarly to SVM. Consequently, SSVM can overcome the above shortcomings of SLMM.

504

H. Xue, S. Chen, and Q. Yang

 SSVM empirically has comparable or better generalization to SLMM, since

it considers the separability between classes and the compactness within classes simultaneously. Though SLMM can capture the structural information within classes by some clustering algorithms, it also more emphasizes the separability between classes due to the characteristics of the traditional large margin machines, which more likely does not sufficiently apply the prior information to some extent.  SSVM can be theoretically proved that it has the lower Rademacher complexity than SVM, in the sense that it has better generalization capacity, rather than only validating generalization performance empirically in SLMM. This further justifies that the introduction of the data distribution within classes into the classifier design is essential for better recognition. The rest of the paper is organized as follows. Section 2 presents the proposed Structural Support Vector Machine, and also discusses the kernelization of SSVM. In Section 3, the theoretical analysis of the generalization capacity is deduced. Section 4 gives the experimental results. Some conclusions are drawn in Section 5.

2

Structural Support Vector Machine (SSVM)

Following the line of the research in the SLMM, SSVM also has two steps: clustering and learning. It firstly adopts clustering techniques to capture the data distribution within classes, and then minimizes the compactness in each cluster, which leads to further maximizing the margin in the sense of incorporating the data structures simultaneously. Many clustering methods, such as K-means, nearest neighbor clustering and fuzzy clustering, can be applied in the first clustering step. After the clustering, the structural information is introduced into the objective function by the covariance matrices of the clusters. So the clusters should be compact and spherical for the computation. Following SLMM, here we use the Ward’s linkage clustering in SSVM, which is one of the hierarchical clustering techniques. During the clustering, the Ward’s linkage between clusters to be merged increases as the number of clusters decreases[4]. We can draw a curve to represent this process. Through finding the knee point, i.e. the point of maximum curvature in the curve, the number of clusters can be determined automatically. Furthermore, the Ward’s linkage clustering is also applicable in the kernel space. After clustering, we obtain the c1 and c2 clusters respectively in the two classes. We denote the clusters in the classes as P1 , · · · , Pc1 and N1 , · · · , Nc2 . From Theorem 1, we have proved that SVM gives a natural lower bound to the separability between classes by the constraints. So here we pay more attention to the compactness within classes, that is, the clusters which cover the different structural information in different classes. We aim to maximize the margin and simultaneously minimize the compactness. Accordingly, the SSVM model in the soft margin version can be formulated as: n

 λ 1 ξi  w  2 + wT Σ w + C min 2 w, b 2 i=1

Structural Support Vector Machine

s.t. yi (wT xi + b) ≥ 1 − ξi , ξi ≥ 0, i = 1, · · · , n

505

(3)

where Σ = ΣP1 + · · · + ΣPc1 + ΣN1 + · · · + ΣNc2 , ΣPi and ΣNj are the covariance matrices corresponding to the ith and j th clusters in the two classes, i = 1, · · · , c1 , j = 1, · · · , c2 . λ is the parameter that regulates the relative importance of the structural information within the clusters, λ ≥ 0. Compared to SVM, SSVM inherits the advantages of SLMM that incorporates the data distribution information in a local way, that considers the covariance matrices of the clusters in each class which contain the trend of data occurrence in statistics[4]. However, different from SLMM, SSVM directly introduces the prior information into the objective function rather than the constraints. Therefore, SSVM can follow the same techniques as SVM to solve the optimization problem, which mitigates the large computational complexity in SLMM. And the algorithm can efficiently converge to the global optimum which also holds the sparsity and has better scalability to the size of the datasets. Moreover, through minimizing the compactness of the clusters, SSVM more likely further maximizes the margin between classes, which may lead to comparable or better classification and generalization performance than SLMM. We will address these in more details in the following sections. By incorporating the constraints into the objective function, we can rewrite (3) as a primal Lagrangian. Then, we transform the primal into the dual problem following the same steps as SVM: max α

n  i=1

n

αi −

n

1  αi αj yi yj [xTi (I + λΣ)−1 xj ] 2 i=1 j=1

s.t. 0 ≤ αi ≤ C, i = 1, · · · , n n 

αi yi = 0

(4)

i=1

Eq. (4) is a typical convex optimization problem. Using the same QP techniques as SVM, we can obtain the solution . Then the derived classifier function can be formulated as follows, which is used to predict the class labels for future unseen samples x: n  f (x) = sgn[ αi yi xTi (I + λΣ)−1 x + b]

(5)

i=1

It is noteworthy that SSVM boils down to the same solution framework of SVM except adding a regularization parameter λ. When λ = 0, SSVM will degenerate to the traditional SVM. Thus SVM actually can be viewed as a special version case of SSVM. We can also apply the kernel trick in SSVM in order to further improve the classification performance in complex pattern recognition problems. Furthermore, compared to SLMM which has to kernelize each cluster covariance matrix respectively, SSVM can perform complex kernelization through kernelizing the

506

H. Xue, S. Chen, and Q. Yang

covariance matrix sum of all the cluster covariance matrices which makes it simpler and more effective. Assume that the nonlinear mapping function is Φ : Rm → H, where H is a Hilbert space which has high dimension. Then the optimization function of SSVM in the kernel space can be described as: max α

n 

n

αi −

i=1

n

1  αi αj yi yj [Φ(xi )T (I + λΣ Φ )−1 Φ(xj )] 2 i=1 j=1 s.t. 0 ≤ αi ≤ C, i = 1, · · · , n n 

αi yi = 0

(6)

i=1

Due to the high dimension (even infinite), Φ usually can not be explicitly formulated. A solution to this problem is to express all computations in terms of dot products, called as the kernel trick[1]. The kernel function k : Rm ×Rm → R, k(xi , xj ) = Φ(xi )T Φ(xj ) derives the corresponding kernel matrix K ∈ Rn×n , Kij = k(xi , xj ), which is so-called Gram matrix. Consequently, we aim to transform (6) into the form of dot products for adopting the kernel trick. For each covariance matrix in the kernel space, we have  1 Φ T [Φ(xj ) − µΦ ΣiΦ = i ][Φ(xj ) − µi ] | CiΦ | Φ Φ(xj )∈Ci =

→ →T 1 Φ ΦT Φ ΦT T T − T 1 Φ | 1 |C Φ | Ti i i i |C i i | CiΦ |

(7)

where CiΦ denotes the clusters without differentiating the different classes, i ∈ [1, c1 + c2 ]. And TΦ i is a subset of the sample matrix, which is combined with the →

samples belonging to the cluster i in the kernel space. 1 |CiΦ | denotes a | CiΦ |dimensional vector with all the components equal to 1/ | CiΦ |. Then we obtain ΣΦ =

c 1 +c2 i=1

ΣiΦ =

c 1 +c2 i=1

→ →T T 1 Φ ΦT Φ ΦT Φ| 1 T T − T  PΦ Ψ PΦ 1 |C |CiΦ | Ti i i i Φ i | Ci |

Φ where PΦ = [TΦ 1 , · · · , Tc1 +c2 ], ⎛ → →T 1 ⎜ |C1Φ | I|C1Φ | − 1 |C1Φ | 1 |C1Φ | ⎜ .. Ψ =⎜ . ⎜ ⎝

⎞ 1 I Φ − |CcΦ +c | |Cc1 +c2 | 1

2

→T



1 |CcΦ +c | 1 |CcΦ +c 1

and I|CiΦ | is a | CiΦ | × | CiΦ | identity matrix, i ∈ [1, c1 + c2 ].

2

1

2

|

⎟ ⎟ ⎟ ⎟ ⎠

(8)

Structural Support Vector Machine

507

(A + UBV)−1 = A−1 − A−1 UB(B + BVA−1 UB)−1 BVA−1

(9)

By the Woodbury’s formula

So T

T

(I + λΣ Φ )−1 = (I + λPΦ Ψ PΦ )−1 = I − λPΦ Ψ (Ψ + λΨ PΦ PΦ Ψ )−1 Ψ PΦ

T

(10)

By substituting (10) into the optimization function (6), we have the kernel form of the dual problem (6) as follows: max α

n  i=1

n

αi −

n

1  ˜ T Ψ (Ψ + λΨ KΨ ˆ )−1 Ψ K ˜ j] αi αj yi yj [Kij − λK i 2 i=1 j=1 s.t. 0 ≤ αi ≤ C, i = 1, · · · , n n 

αi yi = 0

(11)

i=1

˜ i represents the ith column in the kernel Gram matrix K, ˜ K ˜ ij = where K Ct t is the sample that is realigned corresponding to the sequence , x ), x k(xC j i i ˆ is the kernel Gram matrix, K ˆ ij = of the clusters, t = 1, · · · , c1 + c2 . And K Ct Ct k(xi , xj ).

3

Rademacher Complexity

In this section, we will discuss the generalization capacity of SSVM in theory. Different from SLMM which only validates its better generalization performance than SVM empirically by experiments, we will indeed prove that the introduction of the structural information within classes can improve the generalization bound compared to SVM. Here we adopt the Rademacher complexity measure[6] and show the new error bound is tighter. In the traditional kernel machines, we are accustomed to using VC-dimension [2] to estimate the generalization error bound of a classifier. However, the bound involves a fixed complexity penalty which does not depend on the training data, thus can not be universally effective[6]. Recently, Rademacher complexity, as an alternative notion, is presented to evaluate the complexity of a classifier instead of the classical VC-dimension[7]. And for the kernel machines, we can obtain an upper bound to the Rademacher complexity: Theorem 2 [6]. If k : X × X → R is a kernel, and S = {x1 , · · · , xn } is a sample of points from X, then the empirical Rademacher complexity of the classifier FB satisfies n  2B  2B

ˆ k(xi , xj ) = tr(K) (12) Rn (FB ) ≤ n n i=1 where B is the bound of the weights w in the classifier.

508

H. Xue, S. Chen, and Q. Yang

Following Theorem 2, we then give the complexity analysis of SSVM compared to SVM. Theorem 3 (Complexity Analysis). The upper bound of the empirical Radˆ SV M (f ) ˆ SSV M (f ) in SSVM is at most the upper bound of R emacher complexity R in SVM, that is, tr(KSSV M ) ≤ tr(KSV M ). Due to limited space, here we omit the proof. The theorem states that there is an advantage to considering the separability between classes and the compactness within classes simultaneously, i.e. the structural information within the clusters, to further reduce the Rademacher complexity of the classifiers being considered. Intuitively, the minimization of the compactness in the clusters more likely leads to the larger margin compared to SVM, which means better generalization performance in practice. Theorem 3 just provides us a theoretical interpretation for the intuition.

4

Experiments

To evaluate the proposed Structural Support Vector Machine (SSVM) algorithm, we investigate its classification accuracies and computational efficiencies in several real-world UCI datasets. Since Structured Large Margin Machine (SLMM)[4] has been shown to be more effective than many relatively modern learning machines, such as Minimax Probability Machine (MPM)[5], MaxiMin Margin Machine (M4 )[3] and Radial Basis Function Networks (RBFN) in terms of classification accuracies, in this experiment we just compare SSVM with SLMM and SVM. For each dataset, we divide the samples into two nonoverlapping training and testing sets, and each set contains almost half of samples in each class respectively. This process is repeated ten times to generate ten independent runs for each dataset and then the average results are reported. Due to the relatively better performance of the kernel version, here we uniformly compare the algorithms in the kernel and soft margin cases. The width parameter σ in the Gaussian kernel, and the regularization parameters C and λ are selected from the set {2−10 , 2−9 , · · · , 29 , 210 } by cross-validation. We apply Sequential Minimal Optimization (SMO) algorithm to solve the QP problems in SSVM and SVM, and SeDuMi program to solve the SOCP problem in SLMM. The experimental results are listed in Table 1. In each block in the table, the first row is the training accuracy and variance. The second row denotes the testing accuracy and variance. And the third one is the average training time in the ten runs after the selection of the parameters. We can make several interesting observations from these results:  SSVM is consistently superior to SVM in the overall datasets both in the

training and testing accuracies, owing to the proper consideration of data distribution information. Furthermore, SSVM also outperforms SLMM in almost all the datasets except in Pima, because that SSVM simultaneously captures the separability between classes and the compactness within classes

Structural Support Vector Machine

509

Table 1. The training and testing accuracies (%), variances and average training time (sec.) compared between SSVM and SLMM, SVM in the UCI datasets SSVM SLMM SVM 96.25 ± 0.01 95.31∗ ±0.01 95.63∗ ±0.01 Automobile 91.14 ± 0.00 88.63∗ ±0.03 88.48∗ ±0.01 0.44 3.20 0.36 77.36 ± 0.10 76.03∗ ±0.15 75.68∗ ±0.08 Bupa 76.18 ± 0.04 73.52∗ ±0.12 73.06∗ ±0.06 1.23 18.77 0.89 84.10 ± 0.01 82.59∗ ±0.01 79.87∗ ±0.00 Hepatitis 83.25 ± 0.00 79.82∗ ±0.03 79.61∗ ±0.01 0.58 3.75 0.42 98.46 ± 0.00 96.97∗ ±0.03 96.80∗ ±0.02 Ionosphere 97.52 ± 0.01 95.63∗ ±0.05 95.11∗ ±0.02 1.17 5.71 0.79 79.65 ±0.02 80.63 ± 0.05 76.04∗ ±0.01 Pima 78.63 ±0.01 79.46 ± 0.02 77.08∗ ±0.02 12.53 72.14 7.67 95.58 ± 0.02 95.27 ±0.01 86.54∗ ±0.15 Sonar 87.60 ± 0.07 86.21∗ ±0.11 85.00∗ ±0.13 0.61 3.34 0.50 98.81 ± 0.02 95.61∗ ±0.10 98.47 ±0.02 Water 98.69 ± 0.01 95.49∗ ±0.12 90.51∗ ±0.09 0.39 1.56 0.29 95.96 ± 0.00 94.89∗ ±0.05 92.54∗ ±0.01 Wdbc 95.72 ± 0.00 94.57∗ ±0.03 94.25∗ ±0.01 3.58 43.65 2.77 ′ ′ ∗ Denotes that the difference between SSVM and the other two methods is significant at 5% significance level, i.e., t-value > 1.7341.

rather than only emphasizing the separability in SLMM which may miss some useful classification information. And the gap of the classification accuracies between the two algorithms in Pima is less than one percent.  The training and testing accuracies of SSVM are basically comparable in the datasets, which further provides us an experimental validation for better generalization capacity than SVM, according with the theoretical analysis in Theorem 3. And the variances show the good stability of the SSVM algorithm.  We also report the average training time of the three algorithms. SSVM is slower than SVM due to the clustering pre-processing. However, it is much quicker than SLMM, which adopts the SOCP as the optimizor rather than the QP in the SSVM. Consequently, in view of the efficiency as well as classification performance, SSVM is more likely the best option among the three algorithms.  In order to find out whether SSVM is significantly better than SLMM and SVM, we perform the t -test on the classification results of the ten runs to calculate the statistical significance of SSVM. The null hypothesis H0

510

H. Xue, S. Chen, and Q. Yang

demonstrates that there is no significant difference between the mean number of patterns correctly classified by SSVM and the other two methods. If the hypothesis H0 of each dataset is rejected at the 5% significance level, i.e., the t -test value is more than 1.7341, the corresponding results in Table 1 will be denoted ′ ∗′ . Consequently, as shown in Table 1, it can be clearly found that SSVM possesses significantly superior classification performance compared with the other two methods in almost all datasets, especially according to the testing accuracies. And in Pima, there seems to be no significant difference between SSVM and SLMM, i.e. t -value < 1.7341. This just accords with our conclusions.

5

Conclusion

In this paper, we propose a novel large margin machine called as Structural Support Vector Machine (SSVM). Following the research of SLMM, SSVM also firstly captures the data distribution information in the classes by some clustering strategies. Due to the insights about the constraints in the traditional SVM, we further introduce the compactness within classes according to the structural information into the learning framework of SVM. The new optimization problem can be solved following the same QP as SVM, rather than the SOCP in the recent related algorithms such as MPM, M4 and SLMM. Consequently, SSVM not only has much lower time complexity but also holds the sparsity of the solution. Furthermore, we validate that SSVM has better generalization capacity than SVM both in theory and practice. And it also has better than or comparable classification performance to these related algorithms. Throughout the paper, we discuss SSVM in the binary classification problems. However, SSVM can be easily generalized to the multi-class problems by using the vector labeled outputs techniques, and to large-scale problems through combining with the techniques of minimum enclosing ball[8]. These issues will be our future research.

References 1. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge (2000) 2. Vapnik, V.: Statistical Learning Theory. Wiley, Chichester (1998) 3. Huang, K., Yang, H., King, I., Lyu, M.R.: Learning Large Margin Classifiers Locally and Globally. In: ICML (2004) 4. Yeung, D.S., Wang, D., Ng, W.W.Y., Tsang, E.C.C., Zhao, X.: Structured Large Margin Machines: Sensitive to Data Distributions. Machine Learning 68, 171–200 (2007) 5. Lanckriet, G.R.G., Ghaoui, L.E., Bhattacharyya, C., Jordan, M.I.: A Robust Minimax Approach to Classfication. JMLR 3, 555–582 (2002)

Structural Support Vector Machine

511

6. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004) 7. Bartlett, P.L., Mendelson, S.: Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. JMLR 3, 463–482 (2002) 8. Tsang, I.W., Kocsor, A., Kwok, J.T.: Simpler Core Vector Machines with Enclosing Balls. In: ICML (2007)

The Turning Points on MLP’s Error Surface Hung-Han Chen 8787 Southside Boulevard, Suite #503 Jacksonville, Florida 32256, USA [email protected]

Abstract. This paper presents a different view on the issue of local minima and introduces a new search method for Backpropagation learning algorithm of Multi-Layer Perceptrons (MLP). As in conventional point of view, Backpropagation may be trapped at local minima instead of finding the global minimum. This concept often leads to less confidence that people may have on neural networks. However, one could argue that most of local minima may be caused by the limitation of search methods. Therefore a new search method to address this situation is proposed in this paper. This new method, “retreat and turn”, has been applied to several different types of data alone or combined with other techniques. The encouraging results are included in this paper. Keywords: MLP neural networks, Backpropagation, Error Surface, Escaping Local Minima, Retreat and Turn.

1 Introduction Neural network has been one of the important methods for problem solving based upon the concept of artificial intelligence. The easy-to-use supervised learning algorithm, Backpropagation, has made Multi-Layer Perceptrons (MLP) neural networks popular for solving pattern recognition problems. However, almost since the beginning, there have been some critics for MLP neural networks regarding different aspects from many intelligent researchers. One of the major critics is that Backpropagation may be trapped at local minima instead of finding the global minimum. “It is both well known and obvious that hill climbing does not always work. The simplest way to fail is to get stuck on a local minimum [1].” When people treat Backpropagation learning algorithm as a variation of hill climbing techniques, often they believe that Backpropagation may be trapped at local minima. However, it is not clear if a situation like that is caused by the limitation of search choices or the gradients really descending nowhere on the error surface. No matter what causes local minima, generating random learning factors and scanning through neighborhood are normally used to escape local minima. However, random learning factor is not efficient, scanning is time consuming, and both do not guarantee to escape local minima within a limited time frame. Therefore, it would be better to save these options as the last resort and to find a more efficient method of escaping local minima instead. This paper presents an innovative search method, “retreat and turn”, to help Backpropagation escape from most of local minima. The turning mechanism from this F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 512–520, 2008. © Springer-Verlag Berlin Heidelberg 2008

The Turning Points on MLP’s Error Surface

513

proposed method, incorporating with neuron’s firing status, can help Backpropagation to find a meaningful and efficient path on error surface to descend. If Backpropagation with this search method can escape from almost all local minima, eventually it may reach the global minimum without any help from complicate mathematical calculation. Section 2 in this paper discusses the algorithm of Backpropagation and its error surface. Section 3 studies local minima and gradients in details. The “retreat and turn” search method is presented in section 4. Some examples using this innovative search method are included in section 5. Final conclusions are discussed in section 6.

2 Backpropagation and Error Surface The MLP neural networks using Backpropagation learning algorithm constitute of many options of composing structures. In this paper, we are only discussing the form with one nonlinear hidden layer and one linear output layer, as shown in Fig. 1 (a). The error term E is measured in squared form and the transfer function for hidden neurons is the sigmoid function. Backpropagation, in fact, is a gradient descent method used with MLP neural networks. Equation (1) indicates that the weights are updated through gradients on error surface.

w( j , i ) ← w( j , i ) + η ⋅

−∂E ⋅ ∂w( j , i )

(1)

By using chain rule to propagate the error term E from output layer back to hidden layer, the gradients can be generalized with a delta function as in equation (2). The delta function is then defined in equation (3). The delta function can be calculated as in equation (4) for output layer and equation (5) for hidden layer accordingly.

∂E = − δ ( j ) ⋅ O (i ) ⋅ ∂w ( j, i )

δ ( j) =

(2)

−∂E ⋅ ∂net ( j )

(3)

δ (n ) = f ′(net (n )) ⋅ (T (n ) − O(n )) ⋅ δ ( j ) = f ′(net ( j )) ⋅

∑ δ (n)w(n, j ) = O( j )(1 − O( j )) ⋅ ∑ δ (n)w(n, j ) ⋅ n

(4) (5)

n

The MLP neural networks with Backpropagation learning algorithm may have been claimed with some drawbacks, especially for the chances of being trapped at a local minimum; however, they do, in principal, offer all the potential of universal computing devices. They were intuitively appealing to many researchers because of their intrinsic nonlinearity, computational simplicity and resemblance to the behavior of neurons [1]. Therefore, if the issue of local minima can be resolved, we can see the

514

H.-H. Chen

Layer

E +

Input

nonlinear ϕ

ϕ

ϕ

ϕ

Hidden local minimum

global minimum

linear





Output

-

+

Wj

Fig. 1. (a) The structure of a MLP neural network and (b) The error surface for W j

unlimited potential MLP neural networks may have for the future advancement on machine learning and intelligence. For Backpropagation learning algorithm, the purpose of changing weights is to reduce error. How error changes with the direction of gradient as we change a specific weight can be draw on one-dimension error surface as shown in Fig. 1 (b). Therefore the number of hidden neurons and output weights will decide the dimensions of the overall error surface. Since the rest of the weights are also constantly changing with the direction of gradients, the error landscape will change even if this specific weight may stay the same. There are many researches on the topic of MLP error surface. Frasconi et al. [2] lists studies on the surface attached to the cost. Blum [3] identifies local minima for multi-layered networks on the problem of XOR. Hush et al. [4] give interesting qualitative indications on the shape of error surface by investigating on small examples. The error surface is comprised of numerous flat and steep regions where the gradients vary by several orders of magnitude. The flat plateaus could extend to infinity in all directions and the steep troughs may extremely flat in the direction of search. Kordos et al. [5] identify some important error surface properties on the survey of factors influencing MLP error surface. They conclude that error surface depends on network structure, training data, transfer and error functions, but not on training methods. One of the properties is that error surface of networks with hidden layers has a starfish-like structure with many ravines. Another is that global minima are in infinity with MSE error functions, and local minima lie in ravines that asymptotically reach higher error values.

3 Local Minima and Gradients As we know that only limited times an algorithm can search on the error surface to descend, it is very possible that being trapped at a local minimum is simply because the search algorithm hasn’t found the right direction and distance to descend on the

The Turning Points on MLP’s Error Surface

515

error surface. In such a case, obviously this is a local minimum only because of the search algorithm but not the topography on error surface. This misunderstanding can be confirmed by the proof and disproof of the local minima for XOR problem using a simple multilayer Perceptrons network. As mentioned earlier, Blum [3] has proven there is a line of local minima on the error surface. However, other researchers have also proven either the points on Blum line are saddle points [6] or there is no local minimum on the XOR error surface [7]. According to them, Blum’s proof is based on incorrect assumptions, and naïve visualization of slices through error surface may fail to reveal the true nature of the error surface. Since Backpropagation algorithm learns through the gradients on error surface, we can examine this issue from the viewpoint of gradients.



Lemma 1. Any minimum on the error surface for a gradient descent method, local or −∂E global, should be η⋅ = 0 in vector space. (If using the same learning factor ∂Wi i

η

can be omitted.) Therefore no further descent can be made ∂E by updating weights through gradients. In such case, either all individual are ∂Wi on every weight, then

∂E are not all zeroes but somehow the directional sum of ∂Wi the gradients becomes zero.

zero or those individual

There are some properties of gradients for MLP neural networks worth noting here. Property 1. Zero gradients for a linear output neuron simply imply that sum of squared errors associated to that neuron is zero in batch mode training. Therefore, zero gradients for all output weights could possibly mean the ultimate minimum E =0. ∂E is large only when output squared errors are Property 2. For hidden neurons, ∂Wi large or the derivatives of the transfer function, f ′(net ( j )) = O ( j )(1 − O( j )) , are large. Since the output errors affect each hidden neurons almost equally, if the effect from output weights is not significantly different, then the biggest gradient mostly occurs when a hidden neuron is the least certain regarding its firing status. In other words, when net function of a hidden neuron mostly falls at the middle of sigmoid function curve, its derivative is often larger and this hidden neuron is less certain to fire or not.

4 Turning Points on Error Surface The remedy to overcome the limitation of Backpropagation search methods, which cause the situation of being trapped at local minima, could lie in the two properties of gradients described in previous section. Obviously, the situation of all zero gradients

516

H.-H. Chen

is the final target we are pursuing, E = 0 . Therefore, there is no need and nowhere to descend on the error surface. ∂E What if individual are not all zeroes but somehow the directional sum of the ∂Wi gradients becomes zero? This paper presents a new search method, “retreat and turn”, to find an efficient path on error surface to descend. “Retreat” is a normal reaction on hill climbing technologies when error increases at the current iteration. A common practice on retreat is to retrieve the best weights and reduce the learning factor. However, this line search along the sum of gradient directions sometimes is not enough to find the path for descending on error surface before the learning factor becomes too small and a bigger number needs to be randomly generated. As in ancient Chinese idiom, “If we can’t move the mountain, we can at least make the road turn.” This paper introduces a turning mechanism for MLP search algorithm. Assuming the extreme case that directional sum of all gradients is zero, if we take one ∂E out of the equation, the directional sum of gradients becomes non-zero non-zero ∂Wi again. Then the learning algorithm can find a path to descend on the error surface if the learning step is small enough. ∂E to be removed will create the best effect. The question here would be which ∂Wi In this proposed search method, we choose not to remove any gradient from the output weights since they are directly made by the output errors. Therefore, the better ∂E from the hidden layer. choices will be one or some ∂Wi From previous sections, it is known that the delta function, δ ( j ) , of each hidden neuron is also a key element to those gradients associated to that neuron. Since the largest δ ( j ) often comes from the least certain of hidden neuron, the turning mechanism presented in this paper is to restrict the hidden neuron with the largest δ ( j ) from updating its weights when MLP encounters an error increase. Fig. 2 is the flowchart describing the process of “retreat and turn” search algorithm. The reason to restrict the hidden neuron with the largest δ ( j ) from updating its weights can be analyzed from the following aspects. The first being the retreat process, the largest δ ( j ) makes the greatest change on weights, from equation (2), therefore it could be the biggest contributor for the error increase. The second being the firing status of hidden neurons, with the least certain neuron, its contribution on reducing the error can go either very good or very bad. And it is certainly very bad when error increases. When the largest δ ( j ) was removed from the equation, the directional sum of gradients is then turning on the error surface. The concept of this search method especially echoes the discovery of ravines and starfish-like error surface from [4] and [5]. Like water always flows to lower ground through “troughs” or “ravines”, the error can descend on the surface by turning away from the walls (when encountering an error increase).

The Turning Points on MLP’s Error Surface

517

backpropagation

No error ?

Yes

increase η to double

decrease η to half

add next large

δ

Yes

to pool

If η too small?

No randomly generating a larger η

reset the pool

δ

remove largest δ from pool

Yes

no more δ In the pool?

No

Updating weights according to δ pool

Fig. 2. The process of “Retreat and Turn” search method for hidden layer

5 Some Examples This new search method, “retreat and turn”, has been applied to two different types of data. One is a rare event from healthcare data to predict 3-month inpatient risk for 2.4 million insured members [8]; the other is even distribution data from USGS Land Use/Cover Categories [9]. The prevalence of 3-month inpatient risk in the population is only 1.3%. Medical and pharmacy claim history of one year is summarized and grouped into 53 features by ICD-9 diagnostic codes, CPT-4 procedure codes and NDC pharmacy codes. This featured data is first “divided” into several subgroups by Self-Organized Map (SOM), then “conquered” by MLP neural networks using this “retreat and turn” search algorithm. This combination of two technologies, Chen’s model [8], has been proved to outperform leading commercial risk score software in healthcare industry. Table 1 shows the comparison of the validation. The notation of “5k model” is when 5,000 members are targeted to be the outputs from the model of inpatient risk.

518

H.-H. Chen Table 1. Comparison with a commercial risk software Total Sensiti- PPV False True Commercial vity Positives Positives Risk Score > 13 1,748 9,099 10,847 5.31% 16.12% > 14 1,619 8,124 9,743 4.92% 16.62% > 15 1,531 7,346 8,877 4.65% 17.25% > 16 1,416 6,679 8,095 4.30% 17.49% > 17 1,302 6,121 7,423 3.96% 17.54% > 18 1,213 5,582 6,795 3.69% 17.85% > 19 1,143 5,116 6,259 3.47% 18.26% > 20 1,081 4,695 5,776 3.29% 18.72% > 21 1,019 4,356 5,375 3.10% 18.96% > 22 973 4,042 5,015 2.96% 19.40% Chen’s model 5k model 1,778 2,708 4,486 5.40% 39.63% 10k model 2,412 5,913 8,325 7.33% 28.97% 15k model 3,004 10,336 13,340 9.13% 22.52%

com f18

com f18

0.6 80 0.55

75

Accuracy

MSE

0.5

0.45

70

0.4 65 0.35

0.3

60 1

1000 2000 3000 4000 5000 6000 7000 8000 9000

1

1000 2000 3000 4000 5000 6000 7000 8000 9000 Ite r ation

Ite r ation BP 18-20-4

Chen 18-20-4

BP 18-100-4

Chen 18-100-4

BP 18-20-4

Chen 18-20-4

BP 18-100-4

Chen 18-100-4

Fig. 3. (a) MSE (b) Accuracy of the MLP simulations for comf18

The data of USGS Land Use/Cover Categories, named comf18, are generated segmented images of four classes. Each segmented region is separately histogram equalized to 20 levels. Then the joint probability density of pairs of pixels separated by a given distance and a given direction is estimated. For each separation, the cooccurrences for the four directions are folded together to form a triangular matrix. From each of the resulting three matrices, six features are computed: angular second moment, contrast, entropy, correlation, and the sums of the main diagonal and the first off diagonal. This results in 18 features for each classification window [9]. Four simulations have been designed to investigate the “retreat and turn” search method with comf18. Two of them use traditional Backpropagation (BP) with adaptive learning factor. Whenever the learning factor becomes too small, a bigger learning

The Turning Points on MLP’s Error Surface

com f18

519

com f18

1

MSE

MSE

0.4

0.2 1

10

100

1000

10000

Ite r ations Chen 18-20-4

Chen 18-100-4

100000

0.2 1000

10000

100000

Ite r ations Chen 18-20-4

Chen 18-100-4

Fig. 4. (a) A full and (b) a partial of logarithmic scale plots with proposed method

factor will be randomly generated. They are constructed with 20 and 100 hidden neurons accordingly. Then the “retreat and turn” search method (Chen) is added to conduct two new simulations. Fig. 3 (a) and (b) show the Mean Square Error (MSE) and Accuracy plots from those 4 simulations. The results of this proposed method are comparable to the advanced technique from [9] if running time is not a concern. The smoothness of descending path on the error surface can be measured by the times that learning factors are randomly generated to escape local minima. There are 49 and 37 random learning factors for the first 10,000 iterations from the simulations of BP-18-20-4 and BP-18-100-4 respectively. When the “retreat and turn” search method is added, the random numbers are reduced to 0 and 0 for the first 10,000 iterations. When simulations with the proposed search method are extended to 100,000 iterations, only 9 and 0 random learning factors are found.

6 Conclusions This paper presents an innovative search method, “retreat and turn”, for MLP’s Backpropagation learning algorithm. In order to escape from local minima, or equivalent to find a descending path on error surface within limited times of search, this proposed method incorporates the firing status of each hidden neurons to make a meaningful and efficient turn whenever it encounters an error increase. This proposed method has been tested with different types of data for up to 100,000 iterations without being stuck in a local minimum. In the meantime, this method updates the learning factor in its normal way often for tens of thousands iterations without the need to generate a random one. This means the path for descending on error surface is almost always smooth. The logarithmic scale plots in Fig. 4 have demonstrated the ability of the proposed search method to constantly descend on MLP error surface and maybe reaching the global minimum when the learning curve eventually hits the flat line. Judging by the MSE curves, we also have confidence that

520

H.-H. Chen

this method can bring us to a fairly good solution for any error surface within a reasonable amount of time. Undoubtedly, many advanced techniques with complicate mathematical calculation can perform faster or better on solving certain problems. However, the comparable simulation results from this proposed method are quite encouraging since the obstacle of local minima is mostly removed and computational simplicity is still remained. With this proposed search method, Backpropagation MLP can as well be the universal computing device one day if we can run millions of neurons at once when computational speed is greatly improved in the future. One remaining issue for future study could be the speed of learning when errors decrease to a certain level and cause small gradients.

References 1. Minsky, M., Papert, S.: Epilog: the new connectionism. In: Perceptrons, 3rd edn., pp. 247– 280. MIT Press, Cambridge (1988) 2. Frasconi, P., Gori, M., Tesi, A.: Success and Failures of Backpropagation: A Theoretical Investigation. In: Omidvar, O. (ed.) Progress in Neural Networks, vol. 5. Ablex Publishing, Greenwich (1993) 3. Blum, E.K.: Approximation of Boolean Functions by Sigmoidal Networks: Part I: XOR and Other Two-Variable Functions. Neural Computation 1, 532–540 (1989) 4. Hush, D.R., Horne, B., Salas, J.M.: Error Surfaces for Multilayer Perceptrons. IEEE Transactions on Systems, Man, and Cybernetics 22, 1152–1161 (1992) 5. Kordos, M., Duch, W.: On Some Factors Influencing MLP Error Surface. In: 7th International Conference of Artificial Intelligence and Soft Computing, pp. 217–222 (2004) 6. Hamey, L.G.: The Structure of Neural Network Error Surface. In: 6th Australian Conference on Neural Networks, pp. 197–200 (1995) 7. Sprinkhuizen-Kuyper, I.G., Boers, E.J.: A Comment on Paper of Blum: Blum’s “local minima” are Saddle Points. Technical Report 94-34, Department of Computer Science, Leiden University (1994) 8. Chen, H.H., Manry, M.T.: Improving Healthcare Predictive Modeling using NeuroSequences. In: 16th Federal Forecasters Conference (in press, 2008) 9. Abdurrab, A.A., Manry, M.T., Li, J., Malalur, S.S., Gore, R.G.: A Piecewise Linear Network Classifier. In: 20th International Joint Conference on Neural Networks, pp. 1750–1755 (2007)

Parallel Fuzzy Reasoning Models with Ensemble Learning Hiromi Miyajima, Noritaka Shigei, Shinya Fukumoto, and Toshiaki Miike Kagoshima University, 1-21-40 Korimoto, Kagoshima, 890-0065, Japan {miya,shigei}@eee.kagoshima-u.ac.jp

Abstract. This paper proposes a new learning algorithm and a parallel model for fuzzy reasoning systems. The proposed learning algorithm, which is based on an ensemble learning algorithm AdaBoost, sequentially trains a series of weak learners, each of which is a fuzzy reasoning system. In the algorithm, each weak learner is trained with the learning data set that contains more data misclassified by the previous weak one than the others. The output of the ensemble system is a majority vote weighted by weak learner accuracy. Further, the parallel model is proposed in order to enhance the ensemble effect. The model is made up of more than one ensemble system, each of which consists of weak learners. In order to show the effectiveness of the proposed methods, numerical simulations are performed. The simulation result shows that the proposed parallel model with fuzzy reasoning systems constructed by AdaBoost is superior in terms of accuracy among all the methods. Keywords: Ensemble learning, AdaBoost, Parallel model, Fuzzy reasoning model.

1 Introduction There have been proposed many studies on self-tuning fuzzy systems[1,2,3,4]. The aim of these studies is to construct automatically fuzzy reasoning rules from input and output data based on the steepest descend method. Obvious drawbacks of the steepest descend methods are its computational complexity and the problem of getting trapped in a local minimum. In order to improve them, some novel methods have been developed: i) fuzzy rules are created or deleted one by one starting from a small or large number of rules[5,6], ii) a genetic algorithm is used to determine the structure of the fuzzy model[7], iii) a self-organization or a vector quantization technique is used to determine the initial assignment of fuzzy rules[8,9] and iv) generalized objective functions are used[10]. However, there are little studies considering the distribution of learning data; in most of the conventional methods, each element in the given data set is always selected with equal probability. Ensemble learning is an approach that aims to get a better solution by using some weak learners with different distribution of learning data[11], where a weak learner means one with fewer parameters or learning data. In the previous paper, we have proposed a learning algorithm based on boosting, which is one of ensemble learning methods[12]. However, the effectiveness of ensemble learning was not enough, because weak learners with different characteristics have not been constructed and the error rate in learning has not been referred in determining of the output. F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 521–530, 2008. c Springer-Verlag Berlin Heidelberg 2008 

522

H. Miyajima et al.

In this paper, we propose a new learning algorithm based on AdaBoost and a parallel model with plural weak learners each of which is a fuzzy reasoning system with ensemble learning. The proposed algorithm is that all learning data are selected randomly, but on each step, incorrectly classified data are increased so that each learning model is forced to focus on the misclassified data in the learning data. The output for any input data is given as the weighted average of the error rate in learning. Further, in order to improve the ensemble effect, parallel models with a plural ensemble systems are proposed. In order to show the effectiveness of the proposed algorithm, numerical simulations are performed.

2 Fuzzy Reasoning Model and Its Learning 2.1 Fuzzy Reasoning Model This section describes the conventional fuzzy reasoning model using delta rule[1]. It is the basis for proposed method. Let x = (x1 , · · · , xm ) denote the input variable. Let y denote the output variable. Then the rules of simplified fuzzy reasoning model can be expressed as Rj

:

if x1 is M1j and · · · xm is Mmj

then y is wj

(1)

where j ∈ {1, · · · , n} is a rule number, i ∈ {1, · · · , m} is a variable number, Mij is a membership function of the antecedent part, and wj is the weight of the consequent part. A membership value of the antecedent part µi for input x is expressed as µj =

m 

Mij (xi )

(2)

i=1

where Mij is the triangular membership function of the antecedent part. Let cij and bij denote the center and the wide values of Mij , respectively. Then, Mij is expressed as    2·xi −cij  bij (cij − bij 2 ≤ xj ≤ cij + 2 ) bij (3) Mij (xi ) = 1 − 0 (otherwise). The output y ∗ of fuzzy reasoning can be derived from Eq.(4). n j=1 µj · wj ∗ y = n j=1 µj

(4)

As shown the above, fuzzy reasoning models are determined by the parameters cij , bij and wj . How can we determine them? One of the methods to determine them is to regard fuzzy reasoning models as learning ones. In order to perform it, we can represent fuzzy reasoning models as fuzzy network ones shown in Fig.1[1]. In the following, a learning algorithm is shown by using the network. In this method, weights of a network that are equivalent to parameters of a membership function of the antecedent part and a real number of the consequent part are learned using the descent method[1].

Parallel Fuzzy Reasoning Models with Ensemble Learning

523

Fig. 1. Fuzzy network model

The objective function E is defined to evaluate the reasoning error between the desirable output y r and the reasoning output y ∗ of system. 1 ∗ (y − y r )2 (5) 2 In order to minimize the objective function E, the parameters θ ∈ {cij , bij , wj } are updated based on the descent method as follows. E=

θ(t + 1) = θ(t) − Kθ

∂E ∂θ

where t is iteration times and Kθ is a constant. From the Eqs.(2) to (5), lated as follows: µj ∂E · (y ∗ − y r ) · (wj (t) − y ∗ )· = n ∂cij µ j j=1 2 , sgn(xi − cij (t)) · bij (t) · Mij (xi ) ∂E µj = n · (y ∗ − y r ) · (wj (t) − y ∗ )· ∂bij µ j j=1

where

1 1 − Mij (xi ) · , and Mij (xi ) bij (t) ∂E µj · (y ∗ − y r ), = n ∂wj µ j j=1 ⎧ ⎨ −1 ; z < 0 sgn(z) = 0 ; z = 0 ⎩ 1 ; z > 0.

(6) ∂E ∂θ ’s

are calcu-

(7)

(8) (9)

(10)

524

H. Miyajima et al.

In a learning algorithm based on the descent method, the initial values cij (0), bij (0) and wj (0) are decided randomly, the parameters cij (t), bij (t) and wj (t) are updated by Eqs.(7), (8), (9). 2.2 Learning Algorithm A In this section, we describe the detailed learning algorithm described in the previous section. A target data set D = {(xp1 , · · · , xpm , ypr )|p = 1, · · · , P } is given in advance. The objective of learning is minimizing the following error. E=

P 1 ∗ |y − ypr |. P p=1 p

(11)

A conventional learning algorithm is shown below[10]. Learning Algorithm A Step 1. The initial number of rules, cij , bij and wj are set randomly. The threshold T1 for reasoning error is given. Let Tmax be the maximum number of learning times. The learning coefficients Kc , Kb and Kw are set. Step 2. Let t = 1. Step 3. Let p = 1. Step 4. An input and output data (xp1 , · · · , xpm , ypr ) ∈ D is given. Step 5. Membership value of each rule is calculated by Eqs.(2) and (3). Step 6. Reasoning output yp∗ is calculated by Eq.(4). Step 7. Real number wj is updated by Eq.(9). Step 8. Parameters cij and bij are updated by Eqs.(7) and (8). Step 9. If p = P then go to the next step. If p < P then go to Step 4 with p ← p + 1. Step 10. Reasoning error E(t) is calculated by Eq.(11). If E(t) < − T1 then learning is terminated. Step 11. If t = Tmax then go to Step 3 with t ← t + 1. Otherwise learning is terminated. 2.3 Learning Algorithm B In this section, the learning algorithm given in the previous paper is shown[12]. The algorithm consists of three steps, in each of which a sub-learner is trained. Totally three fuzzy reasoning models are constructed, and finally the models are integrated into one classifier by majority among their outputs. The three sub-learners are trained with different distribution of learning data. For the first sub-learner, the distribution is same as for the conventional one. For the second and third sub-learners, the distribution of learning data is modified so that the sub-learners focus on the data to which the first and second ones did not adjust enough; specifically, the data incorrectly classified by the previous sub-learners are selected with a higher probability than the other data[13]. In the following, assume that y r ∈ {0, 1}. Let D be the target learning data given in advance. Let the fuzzy reasoning model constructed by Learning Algorithm A with learning data D be denoted by netD . Further, let output of netD for input x be denoted

Parallel Fuzzy Reasoning Models with Ensemble Learning

525

Table 1. Simulation conditions for two-category classification problems Kc , Kb 0.001 Kw 0.01 # training data 600 # test data 6400 Initial cij equal interval Initial bij constant Initial wi random Initial bij 1.0 θ1 0.02 (circle), 0.04 (torus), 0.06 (triple circle) Tmax 20000

by netD (x). In order to modify the distribution of learning data, the sets Dmiss , DA and DB are defined in the following. Let Dmiss = {(x, y r ) ∈ D|s(netD (x)) = y r }, that is, Dmiss is the subset of D and consists of learned data incorrectly, where

1 for x > − 0.5 s(x) = 0 for x < 0.5. DA is constructed using D and Dmiss as shown below. Step 1. DA ← ∅. Step 2. A set D∗ is randomly selected with equal probability between D and Dmiss . Step 3. A data (x, y r ) ∈ D∗ is randomly selected with equal probability. Step 4. DA ← DA ∪ {(x, y r )}. Step 5. If |DA | < P then go to Step 2. Otherwise the procedure is terminated. Let DB = {(x, y r ) ∈ D|s(netD (x)) = s(netDA (x))}. Then the output of boosting algorithm for input x is defined as follows: s(netD (x)) + s(netDA (x)) + s(netDB (x)) s , (12) 3 that is, the output for input x is given as decision by majority among the outputs of netD (x), netDA (x) and netDB (x). The proposed algorithm B, which is based on boosting, is presented below. Learning Algorithm B In Steps 1∼3, Learning Algorithm A is invoked. Step 1. The fuzzy learning model netD is constructed by using learning data D. Step 2. The fuzzy reasoning model netDA is constructed by using learning data DA . Step 3. The fuzzy reasoning model netDB is constructed by using learning data DB . Step 4. The output for any input data x is calculated by Eq.(12). The algorithm ends. Note that the proposed method does not refine a sub-learner constructed in the previous step, but creates a sub-learner that focuses on the data incorrectly classified by sub-learners created in the previous steps.

526

H. Miyajima et al.

Let us consider the probability distribution of the learning data. The probabilities p((x, y r )) of selecting data (x, y r ) ∈ D for learning are shown in the following. For the first sub-learner in Step 2, for any (x, y r ) ∈ D, p((x, y r )) = 1/|D|. For the second sub-learner in Step 3,

0.5/|D| if (x, y r ) ∈ / Dmiss r p((x, y )) = 0.5/|D| + 0.5/|Dmiss | if (x, y r ) ∈ Dmiss . For the third sub-learner in Step 3, p((x, y r )) =



1/|DB | if (x, y r ) ∈ DB 0 if (x, y r ) ∈ / DB .

3 The Proposed Method Since Learning Algorithm A is based on the descend method, there are some problems such that local minimum and learning speed. In order to improve them, some methods are proposed[5,6,7,8,9,10]. Further, we have proposed Learning Algorithm B which is one of ensemble learning[12]. However, it seems that the effect of ensemble is not enough. In this paper, we propose a new learning algorithm based on AdaBoost which generalizes Learning Algorithm B and parallel model of ensemble learning systems. 3.1 Learning Algorithm C The proposed method is based on AdaBoost[13]. The algorithm is as follows. Learning Algorithm C Let D1 = D. M SE1 =

1 P

(x



|s(netD1 (x)) − y r |

(13)

,y r )∈D1

B1 = d1 (x, y r ) =

1 M SE1

|s(netD1 (x)) − y r | P · M SE1

(14) (15)

Input. Target data set: D = {(xp , ypr )|p = 1, · · · , P } Output. Fuzzy reasoning models: netD1 , · · · , netDL Step 1. Let l = 1 and D1 = D. The fuzzy reasoning model netD1 is constructed by using learning data D1 . Step 2. Dl+1 ← ∅. Until |Dl+1 | = P , repeat (2.1) and (2.2).

Parallel Fuzzy Reasoning Models with Ensemble Learning

527

3

3

2.5

2.5

2

2

Error rate [%]

Error rate [%]

Fig. 2. Parallel model

1.5 1 Conventional Filter 1 Filter 3 Filter 5

0.5 0 15

20

25

30

35 40 45 50 Number of rules

55

Conventional AdaBoost 1 AdaBoost 3 AdaBoost 5

1.5 1 0.5 0

60

65

15

(a) For filter.

20

25

30

35 40 45 50 Number of rules

55

60

65

(b) For AdaBoost.

Fig. 3. Error rate versus number of rules for circle

(2.1) A data (x, y) ∈ Dl is randomly selected with the probability p((x, y)) = 

|s(netDl (x)) − y| . ′ ′ (x′ ,y ′ )∈D |s(netDl (x )) − y |

(2.2) Dl+1 ← Dl+1 ∪ {(x, y)}. Step 3. l ← l + 1. The fuzzy reasoning model netDl is trained by using learning data Dl . Step 4. If l = L, then the algorithm is terminated. Otherwise, go to Step 2. The output of Learning Algorithm C for input x is defined as follows: 

 L l=1 Bl · s(netDl (x)) , (16) s L l=1 Bl

where

M SEl =

1 P



|s(netDl (x)) − y| and

(x,y)∈D

Bl =

1 . M SEl

H. Miyajima et al.

6

6

5

5 Error rate [%]

Error rate [%]

528

4 3 2 Conventional Filter 1 Filter 3 Filter 5

1 0 20

30

40

50 60 Number of rules

70

4 3 2 Conventional AdaBoost 1 AdaBoost 3 AdaBoost 5

1 0 80

90

20

30

(a) For filter.

40

50 60 Number of rules

70

80

90

(b) For AdaBoost.

10

10

8

8 Error rate [%]

Error rate [%]

Fig. 4. Error rate versus number of rules for doughnuts

6 4 Conventional Filter 1 Filter 3 Filter 5

2

Conventional AdaBoost 1 AdaBoost 3 AdaBoost 5

6 4 2

0

0 40

50

60

70 80 90 100 Number of rules

110

120

130

(a) For filter.

40

50

60

70 80 90 100 Number of rules

110

120

130

(b) For AdaBoost.

Fig. 5. Error rate versus number of rules for circle

3.2 Parallel Models of Ensemble System In order to improve the effect of ensemble learning, parallel model constructed from a plural fuzzy reasoning systems with ensemble learning is proposed (See Fig.2). As shown in Fig.2, k pieces of fuzzy reasoning systems M1 , · · · , Mk constructed from different learning data set X1 , · · · , Xk independently by using Learning Algorithm B or C are created, respectively, and the output from any input x is determined by majority among them. Specifically, if X1 = · · · = Xk , the model is called uniform parallel model.

4 Numerical Simulations We perform two experiments to show the validity of the proposed method using learning rule described in the previous section. We perform two-category classification problems of Circle, Torus and Triple circle, to investigate basic feature of the proposed method and compare the performance of it with the conventional method. In the classification problems, points on [0, 1] × [0, 1] are classified into two classes: class 0 and class 1. The regions for class 0 and class 1 are

Parallel Fuzzy Reasoning Models with Ensemble Learning

529

separated by circles centered at (0.5,0.5). For Circle, Torus and Triple circle, the number of circles are one, two and three, respectively. The desired output ypr is set as follows:

0 if xp belongs to class 0 r yp = 1 if xp belongs to class 1. The target data set is constructed so as to cover the input space [0, 1] × [0, 1] uniformly. The conditions of the simulation are shown in Table 1. The results of reasoning output are shown in Figs. 3, 4 and 5, where Conventional, Filter 1 and AdaBoost 1 mean the methods using Learning Algorithm A, Learning Algorithm B and Learning Algorithm C with l = 3 and Filter 3, Filter 5, AdaBoost 3 and AdaBoost 5 mean parallel models for k = 3, 5, respectively. Each value in the results is an average value from 30 trials. As shown in Figs. 3, 4 and 5, AdaBoost 1 is more effective than Conventional and Filter 1, and parallel models Filter 3, Filter 5, AdaBoost 3 and AdaBoost 5 are more powerful than single models. We have performed simulation about uniform parallel models. The result is that they are more effective than Filter 1 or AdaBoost1, but is inferior to non-uniform parallel models. It seems that parallel models are superior in correctness and speed of learning than single model even when uniform models are used.

5 Conclusion In this paper, we propose a new learning algorithms based on AdaBoost and a parallel model. The proposed algorithm is that all learning data are selected randomly, but on each step, incorrectly classified data are increased so that each learning model is forced to focus on the misclassified data in the learning data. The output for any input data is given as the weighted average of the error rate in learning. Further, in order to improve the ensemble effect, a parallel model with plural ensemble systems is proposed. In numerical simulations, the proposed methods have shown good performance in terms of error rate compared to the conventional one. Specifically, the proposed parallel model have shown the best performance. Finally, we describe our remaining tasks on the proposed methods. In this paper, the simulations are performed for relatively simple classification problems. We will examine the effectiveness of the methods for more complicated problems, which, for example, involve high dimensional data and soft decision boundaries. We will also explain the effectiveness of proposed methods theoretically.

Acknowledgements This work was supported by KAKENHI (19500195).

References 1. Nomura, H., Hayashi, I., Wakami, N.: A Self-Tuning Method of Fuzzy Reasoning by Delta Rule and Its Application to a Moving Obstacle Avoidance. Journal of Japan Society for Fuzzy Theory & Systems 4, 379–388 (1992) 2. Mendel, J.M.: Fuzzy Logic Systems for Engineering: A Tutorial. Proceedings of the IEEE 83, 345–377 (1995)

530

H. Miyajima et al.

3. Lin, C., Lee, C.: Neural Fuzzy Systems. Prentice Hall, PTR (1996) 4. Gupta, M.M., Jin, L., Homma, N.: Static and Dynamic Neural Networks. IEEE Press, Los Alamitos (2003) 5. Araki, S., Nomura, H., Hayashi, I., Wakami, N.: A Fuzzy Modeling with Iterative Generation Mechanism of Fuzzy Inference Rules. Journal of Japan Society for Fuzzy Theory & Systems 4, 722–732 (1992) 6. Fukumoto, S., Miyajima, H., Kishida, K., Nagasawa, Y.: A Destructive Learning Method of Fuzzy Inference Rules. In: IEEE International Conference on Fuzzy Systems, pp. 687–694 (1995) 7. Nomura, H., Hayashi, I., Wakami, N.: A Self Tuning Method of Fuzzy Reasoning by Genetic Algorithm. In: International Fuzzy Systems and Intelligent Control Conference, pp. 236–245 (1992) 8. Wang, L.X., Mendel, J.M.: Fuzzy Basis Functions, Universal Approximation, and Orthogonal Least Square Learning. IEEE Trans. Neural Networks 3, 807–814 (1992) 9. Kishida, K., Miyajima, H.: A Learning Method of Fuzzy Inference Rules using Vector Quantization. In: International Conference on Artificial Neural Networks, vol. 2, pp. 827–832 (1998) 10. Fukumoto, S., Miyajima, H.: Learning Algorithms with Regularization Criteria for Fuzzy Reasoning Model. Journal of Innovative Computing, Information and Control 1, 249–263 (2006) 11. Miyoshi, S., Hara, K., Okada, M.: Analysis of Ensemble Learning using Simple Perceptrons Based on Online Learning Theory. Physical Review E 71, 1–11 (2005) 12. Miyajima, H., Shigei, N., Fukumoto, S., Nakatsu, N.: A Learning Algorithm with Boosting for Fuzzy Reasoning Model. In: International Conference on Fuzzy Systems and Knowledge Discovery, vol. 2, pp. 85–90 (2007) 13. Schapire, R.E.: A Brief Introduction to Boosting. In: 16th International Joint Conference on Artificial Intelligence, pp. 1401–1406 (1999)

Classification and Dimension Reduction in Bank Credit Scoring System Bohan Liu, Bo Yuan, and Wenhuang Liu Graduate School at Shenzhen, Tsinghua University, Shenzhen 518055, P.R. China [email protected], {yuanb,liuwh}@sz.tsinghua.edu.cn

Abstract. Customer credit is an important concept in the banking industry, which reflects a customer’s non-monetary value. Using credit scoring methods, customers can be assigned to different credit levels. Many classification tools, such as Support Vector Machines (SVMs), Decision Trees, Genetic Algorithms can deal with high-dimensional data. However, from the point of view of a customer manager, the classification results from the above tools are often too complex and difficult to comprehend. As a result, it is necessary to perform dimension reduction on the original customer data. In this paper, a SVM model is employed as the classifier and a “Clustering + LDA” method is proposed to perform dimension reduction. Comparison with some widely used techniques is also made, which shows that our method works reasonably well. Keywords: Dimension Reduction, LDA, SVM, Clustering.

1 Introduction Customer credit is an important concept in the banking industry, which reflects a customer’s non-monetary value. The better a customer’s credit, the higher his/her value that commercial banks perceive. Credit scoring refers to the process of customer credit assessment using statistical and related techniques. Generally speaking, banks usually assign customers into good and bad categories based on their credit values. As a result, the problem of credit assessment becomes a typical classification problem in pattern recognition and machine learning. As far as classification is concerned, some representative features need to be extracted from the customer data, which are to be later used by classifiers. Many classification tools, such as Support Vector Machines (SVMs), Decision Trees, and Genetic Algorithms can deal with high-dimensional data. However, the classification results from the above tools based on the original data are often too complex to be understood by customer managers. As a result, it is necessary to perform dimension reduction on the original data by removing those irrelevant features. Once the dimension of the data is reduced, the results from the classification tools may turn to be simpler and more explicable, which may be easier for bank staff to comprehend. On the other hand, it should be noted that the classification accuracy still needs to remain at an acceptable level after dimension reduction. F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 531–538, 2008. © Springer-Verlag Berlin Heidelberg 2008

532

B. Liu, B. Yuan, and W. Liu

2 Credit Data and Classification Models The experimental data set (Australian Credit Approval Data Set) was taken from the UCI repository [2], which has 690 samples, each with 8 symbolic features and 6 numerical features. There are 2 classes (majority rate is about 55.5%) without missing feature values. The data set was randomly divided into training set (490 samples) and test set (200 samples). All numerical features were linearly scaled to be within [0, 1]. In this paper, the SVM model was employed as the classifier, which has been widely used in various classification tasks and credit assessment applications [3, 4, 5]. 2.1 Preliminary Results In order to use the SVM model, all symbolic features need to be transformed into numerical features. A simple and commonly used scheme is shown in Table 1. In this example, a symbolic feature S taking 3 possible values a, b, and c is transformed into 3 binary features (S1, S2, and S3). Table 1. A simple way to transform symbolic features into numerical features

S=a S=b S=c

S1 1 0 0

S2 0 1 0

S3 0 0 1

In the experimental studies, K-fold cross-validation was adopted [6] where the parameter K was set to 5. In the SVM model, the RBF kernel was used and its parameters were chosen based on a series of trials. The accuracies of the SVM were 86.7347% and 87.5% on the training set and the test set respectively. The implementation of the SVM was based on “libsvm-2.85” [7]. 2.2 An Alternative Way to Handle Symbolic Features There is an alternative way to transform symbolic features, which is based on the idea of probabilities [10]. Let t represent a symbolic feature and its possible values are defined as: t1, t2,…, tk. Let ωi (i=1,2,…,M) denote the ith class label. For example, the case of t=tk is represented by:

(P(ω1 | t = tk ), P(ω2 | t = tk ),..., P(ωM | t = tk )) Since the sum of probabilities should always equal to 1, each symbolic feature can be represented by M-1 numerical features. As a result, for two-class problems, each symbolic feature can be represented by a single numerical feature. Compared to the scheme in Table 1, this new scheme is favorable when the number of classes is small (two classes in this paper) while the cardinality of each symbolic feature is high. With this type of transformation of symbolic features in the credit data, the accuracies of the SVM were 86.939% and 88.0% on the training set and the test set respectively.

Classification and Dimension Reduction in Bank Credit Scoring System

533

3 Dimension Reduction Techniques The main objective is to project the original data into a 2D space, which is intuitive to analyze. For this purpose, LDA (Linear Discriminant Analysis) was used to reduce the dimension of the data. Although there are many other dimension reduction tools such as PCA (Principal Components Analysis), LDA is usually preferred in terms of the classification accuracy after dimension reduction. Since LDA can only deal with numerical features, all symbolic features in the original data set were transformed into numerical features by the method in Section 2.2. An improved LDA was also proposed to address some of the weaknesses of the standard LDA technique. 3.1 LDA (Linear Discriminant Analysis) The purpose of LDA is to perform dimension reduction while preserving as much class discriminatory information as possible [8]. In two-class problems, LDA is often refereed to as FLD (Fisher Linear Discriminant). In this method, the between-class scatter matrix SB and the within-class scatter matrix SW are defined as:

SB =

N N (µ ∑ ( ) i

j

− µ j )(µ i − µ j )

T

i

(1)

i , j i< j

SW = ∑ ∑ ( x − µi )( x − µi )

T

i

x∈ωi

(2)

In Eq.1 and Eq.2, Ni is the number of samples in class ωi while μi is the mean of data in class ωi. Note that for M-class problems, there are at most M-1 projection directions [9] and consequently it is only possible to project the original data to a line for two-class problems. The optimal projection is defined as Wopt that maximizes the following function: T T J (Wopt ) = Wopt S BWopt / Wopt SW Wopt

(3)

3.2 Clustering Based LDA Although the objective is to transform the original data into 2D data, for two-class problems, it is only possible to get a single projection vector from the standard LDA. In the following, a new LDA method based on clustering is proposed. The key idea is to partition the data in each class into subclasses through clustering. The number of subclasses is a tunable parameter of the new LDA method. For example, for a two-class problem, two clusters (subclasses) can be created in each original class and by doing so the number of classes increases from 2 to 4. As a result, it is now possible to get three nonzero eigenvalues (instead of one). The projection directions are determined by finding the nonzero eigenvalues of SW-1SB. Since the rank of SB is more than 2, it is now possible to select two projection directions.

534

B. Liu, B. Yuan, and W. Liu

4 Experiments In order to empirically investigate the performance of the proposed LDA method, experimental studies were conducted to demonstrate its effectiveness. Comparison with two existing LDA extensions capable of producing multiple projection directions for two-class problems was also performed. 4.1 The Effectiveness of Clustering Based LDA The widely used k-means clustering algorithm with k=2 (divide each original class into two subclasses) was employed. This parameter value was selected based on a few preliminary trials. Three nonzero eigenvalues were found based on the training set: 1=1166, 2=571 and 3=163. The first two eigenvalues were selected and their corresponding eigenvectors were used as the projection directions. Fig.1 shows the transformed 2D data from the training set and the test set. -0.1 1 2

1 2

-0.2

-0.2

-0.3

-0.3

-0.4

-0.4

-0.5

-0.5

-0.6

-0.6

-0.7

-0.7

-0.8

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

-0.8

-0.6

-0.5

(a)

-0.4

-0.3

-0.2

-0.1

0

(b) Fig. 1. (a) Training Set; (b) Test Set

Table 2. Classification accuracies of the SVM with the new clustering based LDA Training Set

Test Set

Original Data

86.939%

88%

Transformed Data (1)

84.694%

88%

Transformed Data (2)

86.939%

88%

Transformed Data (3)

86.327%

87.5%

Transformed Data (4)

88.163%

87%

Transformed Data (5)

84.082%

89%

Classification and Dimension Reduction in Bank Credit Scoring System

535

As can be seen immediately from Fig.1, the 2D projections make it much easier for people to understand the distribution of the two classes. The accuracies of the SVM on the original data and transformed data, referred to as “Transformed Data (1)”, are shown in Table 2. It is clear that the accuracies of the SVM remained almost unchanged while the dimension of the data was reduced from 14 to 2. This result also indicates that the original data set contains significant amount of redundancy as far as classification is concerned. Since the initial cluster centers are randomly selected in the k-means algorithm, different original cluster centers may result in different final clusters and projection directions. To demonstrate this point, some examples of other 2D projections (training set only) that can be obtained from the same data set are shown in Fig.2. -0.05

0.1 1 2

1 2

-0.1

0.05

-0.15 0 -0.2 -0.05 -0.25 -0.1 -0.3 -0.15 -0.35 -0.2 -0.4 -0.25 -0.45 -0.3

-0.5

-0.35

-0.55

-0.4 -0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.9

-0.8

Transformed Data (2)

-0.7

-0.6

-0.5

-0.4

-0.3

Transformed Data (3) 1 2

1 2

0.5 0.2

0.4

0.1

0 0.3

-0.1 0.2 -0.2

0.1 -0.3

0 0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

-0.4 -0.2

Transformed Data (4)

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Transformed Data (5)

Fig. 2. Four different dimension reduction results on the same training set

4.2 Comparison with Other LDA Techniques There are several variations of the original LDA framework in the literature, which can find multiple nonzero eigenvalues for two-class problems. Two representative examples are briefly described below: 1. Nonparametric Discriminant Analysis (NPLDA) [11] employs the K Nearest Neighbor (KNN) method when calculating the between-class scatter matrix SB in order to make SB full of rank. Consequently, it is possible to get more than one nonzero eigenvalues (multiple projection directions).

536

B. Liu, B. Yuan, and W. Liu 0.2 1 2

0.06

1 2

0.04 0.15 0.02

0 0.1 -0.02

-0.04 0.05 -0.06

-0.08

0

-0.1

-0.12

-0.14

-0.05 -0.1

-0.05

0

0.05

0.1

-0.35

-0.3

-0.25

-0.2

(a)

-0.15

-0.1

-0.05

0

(b) 0.9 1 2

1 2 0.8

-0.1

0.7

-0.15

0.6

0.5

-0.2

0.4

0.3

-0.25 0.2

0.1 -0.3 0 -0.5

-0.45

-0.4

-0.35

-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

0

0.1

(c)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

(d)

Fig. 3. (a) NPLDA where the parameter K of KNN equals 25; (b) NPLDA when the parameter K of KNN equals 50; (c) NPLDA when the parameter K of KNN equals 100; (d) W2 method

2. The second method (referred to as W2 in this paper) uses the original SB and SW. The first projection Wopt is the same as in the original LDA. The second projection W2 (orthogonal to Wopt) is defined as the eigenvector corresponding to the nonzero eigenvalue of:

⎡ −1 S T S −1 2 S ⎤ −1 2 ⎢ SW − BT W−1 3 B SW ⎥ S B S B SW S B ⎥⎦ ⎣⎢

( ) ( ) ( )

Table 3. Classification accuracies of the SVM with different LDA methods

Clustering Based LDA NPLDA, K=25 NPLDA, K=50 NPLDA, K=100 W2 method

Training Set 88.163% 81.429% 87.551% 88.367% 88.571%

Test Set 87% 81.5% 85.5% 86% 87%

Classification and Dimension Reduction in Bank Credit Scoring System

537

As shown in Table 3, in the experiments using NPLDA, when the value of K increased, the accuracy was improved gradually. When K was set to100, the accuracy reached a satisfactory level, although the process of searching for the 100 nearest neighbors for each sample may require extra computational cost. By contrast, the W2 method showed good performance in terms of time complexity and classification accuracy. Note that it can only find a fixed projection map without the flexibility of choosing the number of projection directions as well as selecting the “best” projection maps. In summary, the proposed clustering based LDA method worked reasonably well compared to other representative LDA methods.

5 Conclusion and Future Work The major focus of this paper is on improving the clarity of the customer data. Generally speaking, dimensionality is a major challenge for data interpretation and understanding by domain experts. For this purpose, various LDA related techniques for dimension reduction were tested, including a new clustering based LDA method. Experimental results showed that all these techniques were effective at reducing the dimension of the customer data set of interest while the classification accuracies of the SVM model remained almost unaffected after dimension reduction. In addition to the preliminary work reported in this paper, there are a few directions for future work. Firstly, the proposed dimension reduction techniques need to be further tested on large scale customer data sets from commercial banks. Secondly, as shown in this paper, the projection directions as well as the classification accuracies may vary with different cluster patterns from the same data set due to the randomness of the clustering algorithm and different parameter values. As a result, a thorough analysis is required to better understand the relationship between clustering and LDA in order to investigate what kind of cluster patterns are preferred for the purpose of dimension reduction.

Acknowledgement This work was supported by National Natural Science Foundation of China (NSFC, Project no.: 70671059).

References [1] Quan, M., Qi, J., Shu, H.: An Evaluation Index System to Assess Customer Value. Nankai Business Review 7(3), 17–23 (2004) [2] Mertz, C.J., Murphy, P.M.: UCI repository of machine learning databases, http://www.ics.uci.edu/pub/machine-learning-databases [3] Yang, Y.: Adaptive credit scoring with kernel learning methods. European Journal of Operational Research 183, 1521–1536 (2007) [4] Martens, D., Baesens, B., Van Gestel, T., Vanthienen, J.: Comprehensible credit scoring models using rule extraction from support vector machines. European Journal of Operational Research 183, 1466–1476 (2007)

538

B. Liu, B. Yuan, and W. Liu

[5] Theodoridis, S., Koutroumbas, K.: Pattern Recognition. Academic Press, London (2006) [6] Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 1137– 1143 (1995) [7] Chang, C., Lin, C.: Libsvm: a library for Support Vector Machine, http://www. csie.ntu.edu.tw/~cjlin/libsvm [8] Fisher, R.A.: The Use of Multiple Measures in Taxonomic Problems. Ann. Eugenics 7, 179–188 (1936) [9] Duda, R., Hart, P.: Pattern Classification and Scene Analysis. Wiley, New York (1973) [10] Duch, W., Grudziński, K., Stawski, G.: Symbolic Features in Neural Networks. In: 5th Conference on Neural Networks and Soft Computing, pp. 180–185 (2000) [11] Fukunaga, K., Mantock, J.: Nonparametric Discriminant Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 5, 671–678 (1983)

Polynomial Nonlinear Integrals JinFeng Wang1 , KwongSak Leung1 , KinHong Lee1 , and Zhenyuan Wang2 1

Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, NT, Hong Kong SAR 2 Department of Mathematics, University of Nebraska at Omaha, Omaha, USA {jfwang,ksleung,khlee}@cse.cuhk.edu.hk, [email protected] Abstract. Nonlinear Integrals is a useful integration tool. It can get a set of virtual values by projecting original data to a virtual space using Nonlinear Integrals. The classical Nonlinear Integrals implement projection along with a line with respect to the attributes. But in many cases the linear projection is not applicable to achieve better performance for classification or regression. In this paper, we propose a generalized Nonlinear Integrals—Polynomial Nonlinear Integrals(PNI). A polynomial function with respect to the attributes is used as the integrand of Nonlinear Integrals. It makes the projection being along different kinds of curves to the virtual space, so that the virtual values gotten by Nonlinear Integrals can be more regularized well and better to deal with. For testing the capability of the Polynomial Nonlinear Integrals, we apply the Polynomial Nonlinear Integrals to classification on some real datasets. Due to limitation of computational complexity, we take feature selection method studied in another our paper to do preprocessing. We select the value of the highest power of polynomial from 1 to 5 to observe the change of performance of PNI and the effect of the highest power. Experiments show that there is evident advancement of performance for PNI compared to classical NI and the performance is not definitely rising as the highest power is increased. Keywords: Nonlinear integrals, Polynomial nonlinear integrals, Projection, Classification.

1

Introduction

Nonlinear Integrals is known to have good results on classification and regression despite of the large computational complexity. Since fuzzy measure is introduced firstly by Sugeno [1], Nonlinear integrals with respect to fuzzy measure had been proposed many versions by researchers and applied to classification and regression on real world data [2]-[5]. In these methods, the nonlinear integrals are used as confidence fusion tools. Given an object X = x1 , x2 , · · · , xn , for each class Ck , k = 1, 2, · · · , m, a fuzzy measure is needed to fuse the n degrees of confidence for statement : X belongs to class Ck based on the value of each xi , i = 1, 2, · · · , n. So m fuzzy measures are used and m(2n − 1) values of fuzzy measures are needed to be determined. Moreover, these methods are pixel-wise, so a large number of training data are required. It has large time and space complexity. Unlike

F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 539–548, 2008. c Springer-Verlag Berlin Heidelberg 2008 

540

J. Wang et al.

the methods above, another method called WCIPP (Weighted-Choquet-Integral based Projection Pursuit) use a weighted Choquet Integral as a projection tool [6]. In WCIPP, only one fuzzy measure defined on the power set of the set of all feature attributes is used to describe the importance of each feature attribute as well as their interactions [7]-[9] towards the classification of the records. The original classification problem in n-dimensional space is transformed to a onedimensional space problem through the optimal projection based on Nonlinear Integrals. We used a generalized WCIPP with respect to the signed fuzzy measure in previous research. The signed fuzzy measure can describe the interaction and contribution of attributes for decision better. The integrand is represented ˆf + ˆb and the fuzzy measure is expended to the signed fuzzy measure. by f ′ = a So the classifier is called generalized Nonlinear Integrals Classifier. But there is a limitation of application of generalized Nonlinear Integrals. When this feature number is very large, the computation complexity of nonlinear integrals will be inacceptable. So we use feature selection as preprocessing to reduce attributes and lower complexity, which can extend the application of Nonlinear Integrals to more real problems. In this research, we use the polynomial kernel instead of the linear function in above generalized Nonlinear Integrals as nonlinear integrand to describe the projection path. This can project original data to virtual space along different curves according to degree of the polynomial integrand. The virtual data may be dealt with more easily and more accurate due to polynomials effect. We valued the highest degree of elements in the polynomial kernel from 1 to 5. The performance of our model is studied as the degree changes. This paper is organized as follows. In section 2, the fundamental concepts related to Fuzzy Measures and Nonlinear Integrals are introduced. Then the main algorithm of Generalized Nonlinear Integrals for classification is presented in section 3. Section 4 extend the integrand from classical function to polynomial kernel and establish the corresponding Polynomial Nonlinear Integrals based model. The experimental results are showed in section 5 and the detailed analyses are given in the same time. Finally, some conclusions are summarized.

2

Fundamental Concepts

In classification, we are given a data set consisting of l example records, called training set, where each record contains the value of a decisive attribute, Y , and the value of predictive attributesx1 , x2 , · · · , xn . Positive integer l is the data size. The classifying attribute indicates the class to which each example belongs, and it is a categorical attribute with values coming from an unordered finite domain. The set of all possible values of the classifying attribute is denoted by C = c1 , c2 , · · · , cm , where each ck , k = 1, 2, · · · , m, refers to a specified class. The feature attributes are numerical, and their values are described by an ndimensional vector,(f (x1 ), f (x2 ), · · · , f (xn )). The range of the vector, a subset of n-dimensional Euclidean space, is called the feature space. The j th observation consists of n feature attributes and the classifying attribute can be denoted by

Polynomial Nonlinear Integrals

541

(fj (x1 ), fj (x2 ), · · · , fj (xn )), j = 1, 2, · · · , l. Before introducing the model, we give out the fundamental concepts as follows. 2.1

Fuzzy Measure [8]

Let X = x1 , x2 , · · · , xn , be a nonempty finite set of feature attributes and P (X) be the power set of X. Definition 1. A fuzzy measure, µ, is a mapping from P (X) to [0, ∞] satisfying the following conditions: 1) µ(∅) = 0; 2) A ⊂ B ⇒ µ(A) ≤ µ(B), ∀A, B ∈ P (X). To further understand the practical meaning of the fuzzy measure, let us consider the elements in a universal set X as a set of predictive attributes to predict a certain objective. Then, for each individual predictive attribute as well as each possible combination of the predictive attributes, a distinct value of a fuzzy measure is assigned to describe its influence to the objective. Due to the nonadditivity of the fuzzy measure, the influences of the predictive attributes to the objective are dependent such that the global contribution of them to the objective is not just the simple sum of their individual contributions. Set function µ is nonadditive in general. If µ(X) = 1, then µ is said to be regular. The monotonicity and non-negativity of fuzzy measure are too restrictive for real applications. Thus, the signed fuzzy measure, which is a generalization of fuzzy measure, has been defined [10], [11] and applied. Definition 2. A set function µ : P (X) → (−∞, +∞) is called a signed (nonmonotonic) fuzzy measure provided that µ(∅) = 0. A signed fuzzy measure allows its value to be negative and frees monotonicity constraint. Thus, it is more flexible to describe the individual and joint contribution rates from the predictive attributes in a universal set towards some target. 2.2

Nonlinear Integrals

Definition 3. Let µ be a non-monotonic fuzzy measure on P (X) and f be a realvalued function on X . The Choquet integral of f with respect to µ is obtained by  0 ∞ f dµ = −∞ [µ (Fα ) − µ (X)] dα + 0 μ (Fα ) dα

where Fα = {x|f (x) ≥ α}, for any α ∈ (−∞, +∞), is called the α − cut of f . To calculate the value of the Nonlinear Integral of a given real-valued function f , usually the values of f , i.e., (f (x1 ), f (x2 ), · · · , f (xn )) should be sorted in a ′ ′ ′ ′ ′ ′ nondecreasing order so that 0 ≤ f (x1 ) ≤ f (x2 ) ≤ ... ≤ f (xn ), where x1 , x2 , ..., xn is a certain permutation of x1 , x2 , ..., xn . So the value of Nonlinear Integral can be obtained by  n ′ ′ ′ ′ ′ f dμ = k=1 [f (xi ) − f (xi−1 )]μ(xi , xi+1 , ..., xn ), where f (x′0 ) = 0

The Choquet integral is based on linear operators to deal with nonlinear space.

542

3

J. Wang et al.

Projection Based on Nonlinear Integral for Classification

Based on the nonlinear integral, we can build an aggregation tool that projects the feature space onto a virtual 1-dimenstional space. Under the projection, each point in the feature space becomes a value of the virtual variable. A point (f (x1 ), f (x2 ), · · · , f (xn )) is projected to be Yˆ , the value of the virtual variable, on a real axis through a nonlinear integral defined by Yˆ = f dµ. Once the value of µ are determined, we can calculate virtual value Yˆ from f . Fig. 1 illustrates the projection from a 2-D feature space onto a real axis, L, by the nonlinear integral. The contours being broken are due to the nonaditivity of the fuzzy measure. We can classify the cases according to the virtual values on axis projected by nonlinear integrals. 3.1

GA Based Learning Fuzzy Measure

Here we discuss the optimization of the fuzzy measure µ under the criterion of minimizing the corresponding global misclassification rate which is obtained in the second part above. ˆf + ˆb, In our GA model, we use a variant of the original function f ′ , f ′ = a where a ˆ is a vector to scale the values of predictive attributes and ˆb is a vector to shift the coordinates of the data. Each chromosome represents fuzzy measure µ, scaling vector a ˆ and shifting vector ˆb. A signed fuzzy measure is 0 at empty set ∅. If there is n attributes in training data, a chromosome has 2n − 1 + 2n genes which are set to random real values randomly at initialization. Genetic operations used are traditional ones. At each generation, for each chromosome, all variables are fixed and the virtual values of all training data are calculated using nonlinear integral. The fitness function can be defined as misclassification rate which is determined in the second part is the fitness value. 3.2

Linear Classifier for the Virtual Values

After determining the fuzzy measure , scaling vector a ˆ, shifting vector ˆb and the respective classification function from the training data in GA, original data in the n-dimensional feature space are projected onto 1-Dimension space using fuzzy integrals. One linear classifier is needed to classifying the virtual data . Discriminant analysis is introduced in details [12]. We use Fishers linear discriminant [13] function to perform classification in projected space.

4

Polynomial Nonlinear Integrals

From Fig. 1, we can see the simply graphical representation of projection by the classical Nonlinear Integrals. But in many real cases, the linear function can not describe the practical information of databases very well. In [15], Nonlinear

Polynomial Nonlinear Integrals

543

Integrals with quadratic core was proposed. But it was limited for many real cases. So we extend the integrand from linear function to polynomial function as definition 4. Definition 4. Let µ be a non-monotonic fuzzy measure on P (X) and f be a real-valued function on X . The Choquet integral of f with respect to µ is obtained by ∞  0 f dµ = −∞ [µ (Fα ) − µ (X)] dα + 0 μ (Fα ) dα

where Fα = {x|f (x) ≥ α}, for any α ∈ (−∞, +∞), is called the α − cut of f . Let μ be a non-monotonic fuzzy measure on X and f be a nonnegative function Polynomial Nonlinear Integral with respect to μ is obtained by  p on X.The ′ ′ ′ ′ ′ n f dμ = k=1 [f (xi )p − f (xi−1 )p ]μ(xi , xi+1 , ..., xn ), ′ ′ ′ ′ where x1 , x2 , ..., xn is a certain permutation of x1 , x2 , ..., xn so that 0 ≤ f (x1 ) ≤ ′ ′ p ′ p f (x2 ) ≤ ... ≤ f (xn ), and f (x0 ) = 0. p is a positive integer and f is the integrand to replace the classical linear one. In this section, we discuss the detailed situation of projection by Polynomial Nonlinear Integrals with the different degree of polynomial integrand. We design p the polynomial integrand as a ˆf + ˆb . When p = 1, the polynomial Nonlinear Integrals is consistent with the classical generalized Nonlinear Integrals. For similarity, we limit our discussions in two dimensional spaces in this paper. Similar idea would apply to higher dimensional feature spaces. 4.1

p=1

when p = 1, the projection axis is linear and projection contours are piecewise linear. In 2-dimensional space, the projection axis satisfies the equation a1 f1 + b1 = a2 f2 + b2 , a = 0, b = 0. The slope of the projection axis can be positive or negative. Let us see an example for illustrating the situation with respect to the classical Fuzzy measure. Example 4.1. Let μ1 = 0.2, μ2 = 0.6, μ12 = 1.0. The other parameters are a1 = 1, b1 = 4; a2 = 2, b2 = 6. So the real axis L can be computed by solving equation a1 f1 + b1 = a2 f2 + b2 , a = 0, b = 0.

Fig. 1. Projection of classical Nonlinear Integrals

544

J. Wang et al.

Fig. 2. Projection of PNI with degree 2

Fig. 3. Projection of PNI with degree 3

L : f2 =

(b1 −b2 ) a1 + a2 f1 a2

= −1 + 0.5f1

The contours can be computed using the generalized Nonlinear Integrals defined in section 2.2. When a1 f1 + b1 < a2 f2 + b2 , the contours are above L, y = 0.4f1 + 1.2f2 + 5.2. When a1 f1 + b1 > a2 f2 + b2 , the contours are below L, y = 0.2f1 + 1.6f2 + 5.2. This projection is shown in Fig.1. In our model, we extend the fuzzy measure to generalized fuzzy measurethe signed fuzzy measure. It means the joint contribution of multiple features may not larger than the individuals. This situation makes the direction of projection lines opposite to those in Fig. 1. 4.2

p=2

When p = 2, the polynomial integrand is represented as (af + b)2 . The projection axis can be computed similarly with p = 1 which satisfies (a1 f1 + b1 )2 = (a2 f2 + b2 )2 , a = 0, b = 0. So there are two projection axes by solving above

Polynomial Nonlinear Integrals

545

  a1 2 . Projection contours may be parabola, equation, i.e. L : f2 = ± b1a−b + f 1 a2 2 hyperbola or ellipse depending on the sign of parameters. Let see the examples in Fig. 2. The data which have (a1 f1 + b1 )2 < (a2 f2 + b2 )2 are in the blue areas 2 2 and those which have (a1 f1 + b1 ) > (a2 f2 + b2 ) are in the red areas. The blue projection curves follow the function y = µ12 ∗ (a1 f1 + b1 )2 + µ2 ∗ ((a2 f2 + b2 )2 − (a1 f1 + b1 )2 ) = (µ12 − µ2 ) ∗ (a1 f1 + b1 )2 + µ2 ∗ (a2 f2 + b2 ) The blue projection curves follow the function y = µ12 ∗ (a2 f2 + b2 )2 + µ1 ∗ ((a1 f1 + b1 )2 − (a2 f2 + b2 )2 ) = (µ12 − µ1 ) ∗ (a2 f2 + b2 )2 + µ1 ∗ (a1 f1 + b1 ) 4.3

p=3

When p = 3, the polynomial integrand can be represented as (af + b)3 . The projection axis needs to satisfy (a1 f1 + b1 )3 = (a2 f2 + b2 )3 , a = 0, b = 0. Due to the odd exponent, there is only one line as the situation of p = 1. The difference between p = 1 and p = 3 is just the projection path. The former one is pure line, but the latter one is along with a curve of polynomial function with degree 3. The representative figure can be referred as Fig. 3. When p = 4, the situation is similar with that of p = 2; when p = 5, the situation is similar with that of p = 3. So the detailed process and figure will be skipped.

5

Experimental Results and Analysis

We have two parts for experiments. One part in table 1 contains two synthetic datasets and Monk series datasets from UCI repository [14]. The synthetic datasets have same distribution of ying-yang and different dataset size, 100 and 200, as figure 4.

Fig. 4. The synthetic data distribution

546

J. Wang et al. Table 1. Description of datasets in part 1 Datasets Examples Attributes classes Syn–Data1 Syn–Data2 Monk1 Monk2 Monk3

100 200 556 601 554

2 2 6 6 6

2 2 2 2 2

Table 2. Description of datasets in part 2 Datasets

Examples Attributes classes

Heart Pima Wdbc Breast-cancer-winson Echocardiogram

270 768 569 699 132

13 7 30 9 13

2 2 2 2 2

Table 3. The feature subsets using RS Datasets

RS

Heart Pima Wdbc Breast-cancer-winson Echocardiogram

{1, 8, 13} {2, 6, 8} {23, 24} {3, 5, 6, 7} {1, 3, 9}

Another part contains five datasets selected from UCI repository to be reduced the attributes to a reduct. The detailed information is shown in Table 2. Two of these datasets, breast-cancer-winson and echocardiogram, have noisy data labeled as ?. We process the noise to be substituted by the most common value or mean value, which implement in RSES 2.0. We can see that the number of attributes of each dataset is rather large for Nonlinear Integrals to deal with. It will take very long time to learn the fuzzy measure. So the feature selection is a necessary step. Based on previous research, we adopt reduct in Rough Sets to process the data before classification. As we all known, there may be many reducts in Rough Sets for one database. We just pick out the one which have more information gain. The feature subsets selected are shown in Table 3. We can see the size of feature subsets from Rough Sets is greatly smaller than original one. This can greatly advance the efficiency of Nonlinear Integrals because the time of learning the signed fuzzy measure is reduced greatly. The main algorithm of classification model is implemented by using Matlab v7.2. We test the performance of this model respectively when p equals from

Polynomial Nonlinear Integrals

547

Table 4. The results of PNI with different degrees for datasets in part1 and part2 Datasets Syn–Data1

train–accu test–accu Syn–Data2 train–accu test–accu Monk1 train–accu test–accu Monk2 train–accu test–accu Monk3 train–accu test–accu Heart train–accu test–accu Pima train–accu test–accu Wdbc train–accu test–accu Breast-cancer-winson train–accu test–accu Echocardiogram train–accu test–accu

p=1 p=2 p=3 p=4 p=5 0.959 0.902 0.964 0.945 0.867 0.789 0.720 0.677 0.954 0.950 0.650 0.556 0.777 0.751 0.903 0.866 0.967 0.931 0.923 0.885

0.966 0.931 0.959 0.935 0.890 0.793 0.703 0.646 0.967 0.964 0.662 0.600 0.776 0.755 0.898 0.875 0.959 0.938 0.921 0.886

0.958 0.901 0.954 0.925 0.880 0.744 0.680 0.611 0.972 0.975 0.666 0.600 0.775 0.767 0.906 0.879 0.952 0.930 0.920 0.909

0.966 0.941 0.952 0.905 0.883 0.886 0.670 0.644 0.971 0.975 0.655 0.633 0.769 0.749 0.910 0.882 0.959 0.938 0.918 0.894

0.959 0.910 0.947 0.929 0.827 0.797 0.660 0.646 0.978 0.986 0.659 0.611 0.769 0.740 0.902 0.863 0.967 0.954 0.918 0.894

1 to 5. The results in each situation are shown in Table 4. The italic format denotes the best result for each dataset. Because polynomial function and describe the data distribution for some special dataset, the projection with the polynomial line can be more helpful to classify those corresponding virtual data.We can see the accuracy of Polynomial Nonlinear Integrals is better than the classical one, i.e. p=1, in most cases. But the performance of Polynomial Nonlinear Integrals is not the most when the degree is the biggest. So the accuracy is not augmented linearly as degree is increased.

6

Conclusions

In this paper, we break the limitation of classical Nonlinear Integrals on integrand and introduce the polynomial function as nonlinear integrand. This revolution can extend the projection from linear line to more formats of curves which can cover more complicated data. We can see the accuracy of classification model is not definitely increased with degree of polynomial. So we can select one kind of polynomial Nonlinear Integrals as optimal tool and get more better performance. In the same time the complexity of Polynomial Nonlinear Integrals is not greater than the classical Nonlinear Integrals Classifier.

548

J. Wang et al.

References 1. Sugeno, M.: Theory of Fuzzy Integrals and Its Applications. Doctoral Thesis, Tokyo Institute of Technology (1974) 2. Grabisch, M.: The Representation of Importance and Interaction of Features by Fuzzy Measures. Pattern Recognition Letters 17, 567–575 (1996) 3. Grabisch, M., Nicolas, J.M.: Classification by Fuzzy Integral: Performance and Tests. Fuzzy Stes and Systems 65, 255–271 (1994) 4. Keller, J.M., Yan, B.: Possibility Expectation and Its Decision Making Algorithm. In: 1st IEEE Int. Conf. On Fuzzy Systems, San Diago, pp. 661–668 (1992) 5. Mikenina, L., Zimmermann, H.J.: Improved Feature Selection and Classification by the 2-additive Fuzzy Measure. Fuzzy Sets and Systems 107, 197–218 (1999) 6. Xu, K.B., Wang, Z.Y., Heng, P.A., Leung, K.S.: Classification by Nonlinear Integral Projections. IEEE Transactions on Fuzzy System 11(2), 187–201 (2003) 7. Wang, W., Wang, Z.Y., Klir, G.J.: Genetic Algorithm for Determining Fuzzy Measures from Data. Journal of Intelligent and Fuzzy Systems 6, 171–183 (1998) 8. Wang, Z.Y., Klir, G.J.: Fuzzy Measure Theory. Plenum, New York (1992) 9. Wang, Z.Y., Leung, K.S., Wang, J.: A Genetic Algorithm for Determining Nonadditive Set Functions in Information Fusion. Fuzzy Sets and Systems 102, 463–469 (1999) 10. Murofushi, T., Sugeno, M., Machida, M.: Non Monotonic Fuzzy Measures and the Choquet Integral. Fuzzy Sets and Systems 64, 73–86 (1994) 11. Grabisch, M., Murofushi, T., Sugeno, M. (eds.): Fuzzy Measures and Integrals: Theory and Applications. Physica-Verlag (2000) 12. McLachlan, G.J.: Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York (1992) 13. Mika, S., Smola, A.J., Scholkopf, B.: An Improved Training Algorithm for Fisher Kernel Discriminants. In: Jaakkaola, T., Richardson, T. (eds.) Proc. Artifical Intelligence and Statistics (AISTATS 2001), pp. 98–104 (2001) 14. Merz, C., Murphy, P.: UCI Repository of Machine Learning Databases (1996), ftp://ftp.ics.uci.edu/pub/machine-learning-databases 15. Liu, M., Wang, Z.Y.: Classification Using Generalized Choquet Integral Projections. In: Proc. World Congress of the International Fuzzy Systems Association (IFSA 2005), pp. 421–426 (2005)

Testing Error Estimates for Regularization and Radial Function Networks Petra Vidnerov´a and Roman Neruda Institute of Computer Science Academy of Sciences of the Czech Republic Pod vod´arenskou vˇezˇ´ı 2, Prague 8, Czech Republic [email protected]

Abstract. Regularization theory presents a sound framework to solving supervised learning problems. However, there is a gap between the theoretical results and practical suitability of regularization networks (RN). Radial basis function networks (RBF) can be seen as a special case of regularization networks with a selection of learning algorithms. We study a relationship between RN and RBF, and experimentally evaluate their approximation and generalization ability with respect to number of hidden units. Keywords: Regularization, Radial Basis Function Networks, Generalization.

1 Introduction The problem of supervised learning is a subject of great interest. In many applications, we are given a set of examples {(xi , yi ) ∈ Rd × R}N i=1 that was obtained by random sampling of some real function f , generally in presence of noise. To this set we refer as a training set. The goal is to recover the function f from data, or find the best estimate of it. It is not necessary that the function exactly interpolates all the given data points, but we need a function with good generalization. That is a function that gives relevant outputs also for the data not included in the training set. The supervised learning is often studied as a function approximation problem [1]. Given the data set, we are looking for the function that approximate the unknown function f . It is  usually done by empirical risk minimization, i.e. minimizing the functional N 2 H[f ] = N1 i=1 (f (xi ) − yi ) over a chosen hypothesis space, i.e. over a set of functions of a chosen type (representable by a chosen type of neural network). In Section 2 we will study the problem of learning from examples as a function approximation problem and show how is regularization network (RN) derived from regularization theory. In Section 3 we will describe one type of neural network—an RBF network that can be seen as a special case of RN. Learning methods based on regularization approach have in general very good theoretical background. Also the relation between the number of hidden units and approximation accuracy was extensively studied and bounds on convergence rate of solutions with limited number of hidden units to optimal solution (e.g. [2,3,4]) derived. In the Section 4 we demonstrate on experiments that the theoretical estimates for RN to some degree holds for RBF networks and derive several recommendations for choosing number of units. F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 549–554, 2008. c Springer-Verlag Berlin Heidelberg 2008 

550

P. Vidnerov´a and R. Neruda

2 Approximation Via Regularization Network We are given a set of examples {(xi , yi ) ∈ Rd × R}N i=1 obtained by random sampling of some real function f and we would like to find this function. Since this problem is ill-posed, we have to add some a priori knowledge about the function. We usually assume that the function is smooth, in the sense that two similar inputs corresponds to two similar outputs and the function does not oscillate too much. This is the main idea of the regularization theory, where the solution is found by minimizing the functional (1) containing both the data and smoothness information. H[f ] =

N 1  (f (xi ) − yi )2 + γΦ[f ], N i=1

(1)

where Φ is called a stabilizer and γ > 0 is the regularization parameter controlling the trade-off between the closeness to data and the smoothness of the solution. The regularization scheme (1) was first introduced by Tikhonov [5] and therefore it is called a Tikhonov regularization. The regularization approach has good theoretical background, it was shown that for a wide class of stabilizers the solution has a form of feed-forward neural network with one hidden layer, called regularization network, and that different types of stabilizers lead to different types of regularization networks [6,7]. Poggio and Smale in [7] proposed a learning algorithm (Alg. 2.1) derived from the regularization scheme (1). They choose the hypothesis space as a Reproducing Kernel Hilbert Space (RKHS) HK defined by an explicitly chosen, symmetric, positive-definite kernel function Kx (x′ ) = K(x, x′ ). The stabilizer is defined by means of norm in HK , so the problem is formulated as follows: min H[f ], where H[f ] =

f ∈HK

N 1  (yi − f (xi ))2 + γ||f ||2K . N i=1

(2)

The solution of minimization (2) is unique and has the form f (x) =

N 

ci Kxi (x),

(N γI + K)c = y,

(3)

i=1

where I is the identity matrix, K is the matrix Ki,j = K(xi , xj ), and y = (y1 , . . . , yN ). ′  2 − x−x ′ b . The most commonly used kernel function is Gaussian K(x, x ) = e Input: Data set {xi , yi }N i=1 ⊆ X × Y



Output: Function f . ′

1. Choose a symmetric, positive-definite function Kx (x ), continuous on X × X. N 2. Create f : X → Y as f (x) = i=1 ci Kx i (x) and compute c = (c1 , . . . , cN ) by solving (N γI + K)c = y, (4) where I is the identity matrix, Ki,j = K(xi , xj ), and y = (y1 , . . . , yN ), γ > 0.

Algorithm 2.1

Testing Error Estimates for Regularization and Radial Function Networks

551

The power of the Alg. 2.1 is in its simplicity and effectiveness. However, its real performance depends significantly on the choice of parameter γ and kernel function type. Optimal choice of these parameters depends on a particular data set and there is no general heuristics for setting them.

3 RBF Neural Networks An RBF neural network (RBF network) [1,8] represents a relatively new model of neural network. On the contrary to classical models (multi-layer perceptrons, etc.) it is a network with local units.

y(x) = ϕ fs (x) =

h 

wjs ϕ

 x − c  

(5)

b

 x − c   j

(6)

bj

j=1

Fig. 1. RBF network architecture and RBF network function

An RBF network is a standard feed-forward neural network with one hidden layer of RBF units and linear output layer (Fig. 1). The RBF units represent RBF functions (5), usually Gaussians. The network computes its output (6) as linear combination of outputs of the hidden layer. There is a variety of algorithms for RBF network learning, in our past work we studied their behavior and possibilities of their combinations [9]. The two most significant algorithms,Three step learning and Gradient learning, are sketched in Algorithm 2.1 and Algorithm 2.2. See [9] for details. Output: {ci , bi , Ci , wij }j=1..m i=1..h

Input: Data set {xi , y i }N i=1 1. Set the centers ci by a k-means clustering. 2. Set the widths bi and matrices Ci . 3. Set the weights wij by solving ΦW = D. Dij =

N  t=1

 x −

ytj e

t −ci Ci bi

2 , Φqr =

N 

 x −

e

t=1

Algorithm 3.1

t −cq Cq bq

2 e



xt −cr C r br

2

552

P. Vidnerov´a and R. Neruda

Output: {ci , bi , Ci , wij }j=1..m i=1..h

Input: Data set {xi , y i }N i=1

1. Put the small part of data aside as an evaluation set ES, keep the rest as a training set T S . 2. ∀j cj (i) ← random sample from T S1 , ∀j bj (i), Σj−1 (i) ← small random value, i ← 0 3. ∀j, p(i) in cj (i), bj (i), Σj−1 (i): 1 + αΔp(i − 1), p(i) ← p(i) + Δp(i) Δp(i) ← −ǫ δE δp 4. E1 ← x∈T S1 (f (x) − yi )2 , E2 ← x∈T S2 (f (x) − yi )2 5. If E1 and E2 are decreasing, i ← i + 1, go to 3, else STOP. If E2 started to increase, STOP.





Algorithm 3.2

4 Error Estimates The relation between the number of hidden units and approximation accuracy was extensively studied and bounds on convergence rate of solutions with limited number of hidden units to optimal solution (3) (e.g. [2,3,4]) derived. Most of the results agree on convergence rate close to √1h , where h is the number of hidden units. In [4, Theorems 4.2–6.3], upper bounds are derived on the convergence rate of suboptimal solutions to the optimal solution achievable without restrictions on the model complexity. The bounds are of the form √1h multiplied by a term depending on the data set size, the output vector, the Gram matrix of the kernel function with the respect to the input data (matrix obtained by applying kernel function on all couples of data points), and the regularization parameter. In this section, we study the relation between the network size (i.e. number of hidden units) and approximation accuracy and generalization by experimental means. With respect to theoretical results, we expect the approximation accuracy to improve with increasing number of hidden units. Reasonable approximation accuracy should be achieved already with small networks. In addition, high number of hidden units makes the learning task more difficult, which can influence the results. In our experiments, we applied gradient learning (Alg. 3.2) on data from Proben1 repository [10]. Fig. 2 and Fig. 3 show the results for cancer task. Fig. 2 shows the error achieved on the training set (median of 10 computations) and corresponding error on testing set. It can be seen that for small numbers of hidden units the training error increases rapidly, while for networks with more than 100 units there is no significant improvement. The situation for generalization ability represented by testing error is similar. However, the increase stops earlier, the minimal errors are achieved for networks with about 40 units. In this particular case, network with 40 hidden units is sufficient. Bigger networks (such as with 100 hidden units) are able to achieve better approximation on the training set, but do not exhibit better generalization. The maximal number of learning iterations was set to 50 000, which was reached in most cases for networks with more than 100 units. Therefore overfitting was not

Testing Error Estimates for Regularization and Radial Function Networks

553

Training and testing error 3

Training error Testing error

2.5

Error

2

1.5

1

0.5

0 0

50

100

150

200 250 300 Number of units

350

400

450

500

Fig. 2. Testing and training errors depending on the number of network units Number of iterations 60000

50000

Iterations

40000

30000

20000

10000

0 0

50

100

150

200 250 300 Number of units

350

400

450

500

Fig. 3. Number of iterations needed to train network with given number of hidden units

observed for networks with higher numbers of hidden units. The numbers of iterations needed by training of networks of different size are shown in Fig. 3. It clearly shows that the time needed for network training significantly increases with number of hidden units. Since the convergence is quite fast, we can suggest that small networks provide sufficiently good solutions. The theoretically estimated convergence rates justify using network of smaller complexity in real-life applications. Smaller networks have also smaller number of parameters that has to be tuned during the training process. Therefore, they are more easily trained.

554

P. Vidnerov´a and R. Neruda

5 Conclusion Most of the learning algorithms work with networks of fixed architectures. Those optimizing also the number of hidden units can be divided into two groups – incremental and pruning. Pruning algorithm starts with large networks and tries to eliminate the irrelevant units, while incremental algorithms start with small network and add units as long as the network performance improves. The mentioned theoretical results speaks in favor of incremental algorithms. First, learning of small networks is fast since small numbers of parameters has to be optimized. Second, it is quite probable that reasonable solution will be found among smaller networks. Based on our experiments, we recommend to start with small number of hidden units and increase the network size only as long as also generalization ability improves. There are several issues that remain to be solved in our future work. The behavior of learning algorithm is influenced by a good choice of learning parameters. In our case, an optimal selection of the learning rate ǫ of the gradient algorithm had crucial effect on the performance. Some way of automatic adaptive change of learning parameters should be tested. Moreover, we plan to perform the same experiments with a threestep learning algorithm for RBF that is closer to the RN approach and usually provides faster, if not always better solutions. Acknowledgement. This research has been supported by the project no. KJB100300804 ˇ and by the Institutional Research Plan AV0Z10300504 of Grant Agency of AS CR, “Computer Science for the Information Society: Models, Algorithms, Appplications”.

References 1. Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Tom Robins (1999) 2. Xu, L., Krzy˙zak, A., Yuille, A.: On Radial Basis Function Nets and Kernel Regression: Statistical Consistency, Convergence Rates, and Receptive Field Size. Neural Netw. 7(4), 609–628 (1994) 3. Corradi, V., White, H.: Regularized Neural Networks: Some Convergence Rate Results. Neural Computation 7, 1225–1244 (1995) 4. Kukov´a, V., Sanguineti, M.: Learning with Generalization Capability by Kernal Methods of Bounded Complexity. J. Complex 21(3), 350–367 (2005) 5. Tikhonov, A., Arsenin, V.: Solutions of Ill-posed Problems. W.H. Winston, Washington (1977) 6. Poggio, T., Girosi, F.: A Theory of Networks for Approximation and Learning. Technical report, Cambridge, MA, USA (1989) 7. Poggio, T., Smale, S.: The Mathematics of Learning: Dealing with Data. Notices of the AMS 50, 536–544 (2003) 8. Powel, M.: Radial Basis Functions for Multivariable Interpolation: A review. In: IMA Conference on Algorithms for the Approximation of Functions and Data, RMCS, Shrivenham, England, pp. 143–167 (1985) 9. Neruda, R., Kudov´a, P.: Learning Methods for Radial Basis Functions Networks. Future Generation Computer Systems 21, 1131–1142 (2005) 10. Prechelt, L.: PROBEN1 – A Set of Benchmarks and Benchmarking Rules for Neural Network Training Algorithms. Technical Report 21/94, Universitaet Karlsruhe (1994)

A Practical Clustering Algorithm Wei Li1 , Haohao Li2 , and Jianye Chen1 2

1 School of Science, Hangzhou Dianzi University, Hangzhou 310018, China School of Mathematics and Statistics, Lanzhou University, Lanzhou 730107, China

Abstract. We present a novel clustering algorithm (SDSA algorithm) based on the concept of the short distance of the consecutive points and the small angle between the consecutive vectors formed by three adjacent points. Not only the proposed SDSA algorithm is suitable for almost all test data sets used by Chung and Liu for point symmetry-based K-means algorithm (PSK algorihtm) and their newly proposed modified point symmetry-based K-means algorithm (MPSK algorithm ), the proposed SDSA algorithm is also suitable for many other cases where the PSK algorihtm and MPSK algorithm can not be well performed. Based on some test data sets, experimental results demonstrate that our proposed SDSA algorithm is rather encouraging when compared to the previous PSK algorithm and MPSK algorithm. Keywords: Pattern recognition, Data clustering, PSK algorithm, MPSK algorithm, SADA algorithm.

1

Introduction

Partitioning a set of data points into some nonoverlapping clusters is an important topic in data analysis and pattern classification. It has many applications, such as medicine, psychology, biology, sociology, pattern recognition, and image processing. Cluster seeking is very experiment-oriented in the sense that cluster algorithms that can deal with all situations are not yet available. Extensive and good overviews of clustering algorithms can be found in the literature [1,2,3]. Perhaps the best-known and most widely used member of the family is the Kmeans algorithm. Many efficient clustering algorithms have been developed for data sets of different distributions in the past several decades [4,5,6,7,8,9]. Each approach has its own merits and disadvantages. Among these developed clustering algorithms, Su and Chou [8] first took the point symmetry issue into account. Based on their proposed point symmetry distance measure, they presented a novel and efficient clustering algorithm, which is very suitable for symmetrical intra-clusters; for convenience, their proposed clustering algorithm is named the PSK algorithm. Experimental results demonstrate that the previous PSK clustering algorithm outperforms the traditional K-means algorithm. In essence, the PSK algorithm not only inherits the simplicity advantage of the K-means algorithm, but it also can handle the symmetrical F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 555–560, 2008. c Springer-Verlag Berlin Heidelberg 2008 

556

W. Li, H. Li, and J. Chen

intraclusters quite well. Recently, their proposed PSK algorithm was modified by Chung and Lin ([9,10]) and extended to be able to handle both the symmetrical intra-clusters and the symmetrical inter-clusters (MPSK algorithm) and the data set with line symmetry property (LSK algorithm.) However, PSK algorithm and MPSK algorithm will not perform so well if the symmetry property is not so ideal. This fact will be disclosed by our simulation results on some sets of data points later. In this paper, we propose a new effective clustering algorithm, based on very simple concepts of closer distance and slow varying angles formed by three consecutive points. Several data sets are used to illustrate its effectiveness when compared to the previous PSK algorithm and MPSK algorithm. The rest of this paper is organized as follows. In Section 2, the PSK algorithm and MPSK algorithm are surveyed. In Section 3, our proposed algorithm is described. In Section 4, some experimental results are demonstrated to show the effectiveness of the proposed LSK algorithm. In Section 5, some conclusion remarks are addressed.

2

PSK Algorithm and MPSK Algorithm

Based on K-means algorithm, Su and Chou [8] presented an efficient point symmetry distance (PSD) measure to help partitioning the data set into the clusters where each cluster has the point symmetry property. Given N data points, {pi |1 ≤ i ≤ N }, after running the K-means algorithm, let the obtained k temporary cluster centroids be denoted by {ck |f or 1 ≤ k ≤ K}. The PSD measure between the data point pi and the data point pj relative to the cluster centroid ck is defined as ds (pj , ck ) = min

||(pj − ck ) + (pi − ck )|| ||(pj − ck )|| + ||(pi − ck )||

(1)

for i = j and 1 ≤ i ≤ N .

Fig. 1. Clustering performance comparison for the first data set. (a) The data set contains three compact circles. (b) The clustering result obtained by using the Kmeans algorithm. (c) The clustering result obtained by using the PSK algorithm. (d) The clustering result obtained by using the proposed MPSK algorithm.

A Practical Clustering Algorithm

557

Fig. 2. Clustering performance comparison for the second data set. (a) The data set contains three compact circles. (b) The clustering result obtained by using the Kmeans algorithm. (c) The clustering result obtained by using the PSK algorithm. (d) The clustering result obtained by using the proposed MPSK algorithm.

The PSK algorithm worked for clustering the point symmetrical data set and experimental results demonstrated that the PSK algorithm significantly outperforms the conventional K-means clustering algorithm for this kind of data set. Recently, Chung and Lin pointed out that two possible problems existed in the PSD measure are (1) lacking the distance difference symmetry property, (2) leading to an unsatisfactory clustering result for the case of symmetrical inter-clusters. Due to these two problems, Chung and Lin proposed the MPSK algorithm [9]. In their experiments , the clustering results with MPSK algorithm are better than PSK algorithm measure as shown in Fig. 1 and Fig. 2. However, MPSK algorithm may lead to unsatisfactory clustering results if the data set whose symmetry property is not so ideal or too perfect. This fact will be disclosed by our simulation results on some sets of data points in Section 4.

3

The Proposed SDSA Algorithm

This section presents our proposed new algorithm which based on the short distance of the consecutive points and the small angle between the consecutive vectors formed by three adjacent points. So we name this algorithm as distance and direction orientation clustering algorithm (SDSA algorithm). The SDSA algorithm algorithm not only can cluster almost all data sets used in [8,9] successfully, but also can handle many other data sets which can not cluster satisfactorily by K-means algorithm, PSK algorithm and MPSK algorithm. More specifically, give a set of data D with N data points and δ1 > 0 and δ2 > 0 be two predetermined tolerances, the complete SDSA algorithm is presented as follows. k=1 Step 1: Choose a point p1 ∈ D randomly. Let the temporary cluster Ckt = {p1 }; Step 2: Update D by D := D − Ckt , if D = ∅, Ck = Ckt , stop;

558

W. Li, H. Li, and J. Chen

Step 3: If there exists a point p2 ∈ D such that ||p2 −p1 || =

min ||p−p1 ||
δ2 ||pj − pj−1 ||||pj−1 − pj−2 ||

then Ckt = Ckt ∪ {pj },j = j + 1,goto Step 4; Otherwise Ck = Ckt , k = k + 1, goto Step 1. The proposed SADA algorithm proceeds in an incremental way to add one new cluster set at each stage. No initial cluster centers are required. The key step is the Step 5 which means that if a point close to the previously tested point and the direction does not turn too sharply , then this point should belong to the current cluster.

4

Experimental Results

As discussed in Scetion 4, the geometry of SDSA algorithm is quite clear and simple. The points in same cluster is a serious closer points and the adjacent vectors formed by three consecutive points do not turn too sharply. Thus, it is clear that the SDSA algorithm performs satisfactorily on the data set given in Fig. (1),(2)and all data sets used by Chung and Linin ( Section 6 ,[9]), since these data sets is obviously with the geometry characters requires by SDSA algorithm. In this section, several data sets are used to demonstrate the feasibility and the extension capability of our proposed SDSA algorithm. Experimental results reveal that our proposed SDSA algorithm has encouraging results on these data sets whereas the PSK algorithm and the MPSK algorithm do not perform satisfactorily. The parameter δ1 can be chosen according to the data size. For our test data sets, the parameter δ1 is chosen for 0.1cm and parameter δ2 is chosen for 0.7. Using data set given by two circle shells, where one circle shell is embedded in the other. After running the PSK and MPSK algorithm on the given data set, there are several misclassified data points as shown in Fig. 3 (a) and (b). Fig. 3(c) illustrates the clustering result by using our proposed SDSA algorithm and it has satisfactory clustering result ( Clearly, we obtain similar results if the circle shells are replaced by ellipsoidal shells). The data set used in Fig. 4 contains and two crossed ellipsoidal shells, which is the outline of the bedge of the CCTV . The symmetry property for this data set is ”too perfect”, since two ellipsoidal shells have the same symmetry center and symmetry lines. Thus, PSK algorithm and MPSK algorithm cannot handle this case well as shown in Fig. 4 (a) and (b). However, our proposed SDSA

A Practical Clustering Algorithm

559

Fig. 3. One example to demonstrate the power of the SDSA algorithm. (a) Two obtained clusters by running PSK algorithm. (b) Two obtained clusters by running MPSK algorithm. (c) Two obtained clusters by running the SADA algorithm.

Fig. 4. One example to demonstrate the power of the PSK algorithm. (a) The given point symmetrical data set. (b) Two obtained clusters by running K-means algorithm on (a). (c) Two obtained clusters by running the PSK algorithm on (a).

algorithm illustrates a satisfactory clustering result as shown 4 (c). Clearly, the newly proposed LSK algorithm [10] for clustering the data set with line symmetry property can not handle this data set well either.

5

Conclusions

In this paper, we have presented the SDSA algorithm. The proposed new clustering algorithm not only performs satisfactorily on most data sets which can be well clustered by PSK and MPSK algorithm, but also can handle many data sets which cannot be well clustered by PSK algorithm and MPSK algorithm. Experimental results demonstrate that the feasibility of our proposed SDSA algorithm and the relevant experimental results are rather encouraging. Moreover, the PSK algorithm and MPSK algorithm are all point-based clustering method that starts with the cluster centers initially placed at arbitrary positions and proceeds by moving at each step the cluster centers in order to

560

W. Li, H. Li, and J. Chen

minimize the clustering error. The main disadvantage of these method lies in their sensitivity to initial positions of the cluster centers. However, the proposed SDSA algorithm algorithm does not depend on any initial parameter values. Instead of randomly selecting initial values for all cluster centers as is the case with most clustering algorithms, the proposed technique proceeds in an incremental way attempting to add one new cluster set at each stage.This characteristic can be advantageous to discover the correct number of clusters. Acknowledgments. This work was partially supported by Natural Science Foundation of Zhejiang Province Y606026.

References 1. Jain, A.K., Dubes, R.C.: Algorithms for Clustering. Prentice Hall, Englewood Cliffs (1988) 2. Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. Wiley, New York (1973) 3. Hartigan, J.: Clustering Algorithms. Wiley, New York (1975) 4. Fischer, B., Buhmann, J.M.: Bagging for Path Based Clustering. IEEE Trans. Pattern Anal. Machine Intel. 25, 1411–1415 (2003) 5. Bajcsy, P., Ahuja, N.: Location and Density Based Hierarchical Clustering Using Similarity Analysis. IEEE Trans. Pattern Anal. Machine Intel. 20, 1011–1015 (1998) 6. Zhu, C., Po, L.M.: Minimax Partial Distortion Competitive Learning for Optimal Codebook Design. IEEE Trans. Image Process 7, 1400–1409 (1998) 7. Fred, L.N., Leitao, M.N.: A New Cluster Isolation Criterion Based on Dissimilarity Increments. IEEE Trans. Pattern Anal. Machine Intel. 25, 944–958 (2003) 8. Su, M.C., Chou, C.H.: A Modified Version of the K-means Algorithm with A Distance Based on Cluster Symmetry. IEEE Trans. Pattern Anal. Machine Intel. 23, 674–680 (2001) 9. Chung, K.L., Lin, J.S.: Faster and more Robust Point Symmetrybased K-means Algorithm. Pattern Recognit. 40, 410–422 (2007) 10. Chung, K.L., Lin, J.S.: An Efficient Line Symmetry-based K-means Algorithm. Pattern Recognition Letters 27, 765–772 (2006)

Concise Coupled Neural Network Algorithm for Principal Component Analysis Lijun Liu1,2 , Jun Tie2 , and Tianshuang Qiu1 1

School of Electronic and Information Engineering, Dalian University of Technology, Dalian 116024, China 2 Department of Mathematics, Dalian Nationalities University, 116605 Dalian, China [email protected]

Abstract. A concise ordinary differential equations (ODE) for eigendecomposition problem of a symmetric positive matrix is proposed in this paper. Stability properties of the proposed ODE is obtained by the theory of first order approximation. Novel coupled neural network (CNN) algorithm for principal component analysis (PCA) is obtained based on this concise ODE model. Compared with most non-coupled neural PCA algorithms, the proposed online CNN algorithm works in a recursive manner and simultaneously estimates eigenvalue and eigenvector adaptively. Due to the fact the proposed CNN effectively makes use of online eigenvalue estimate during learning process, it reaches a fast convergence speed, which is further verified by the numerical experiment result. Adaptive algorithm for sequential extraction of subsequent principal components is also obtained by means of deflation techniques. Keywords: Principal component analysis, Coupled neural network, Stability, Eigenvalue.

1

Introduction

Principal component analysis (PCA) is a widely used statistical technique in such areas as data compression, data filtering and feature extraction et al. In the standard numerical approach to PCA, the sample covariance matrix is first computed and then its eigenvectors and associated eigenvalues are extracted by some well-known numerical algorithms, e.g. the QR decomposition or the SVD algorithm. However, this approach is not practicable to handle large data-sets with large dimensions of covariances matrix. Unlike traditional numerical techniques, neural network approaches to PCA pursue an ”online” approach where an estimate of the principal directions is updated after each presentation of a data point. Therefore, approaches based on neural networks are especially suitable for high-dimensional data, and for tracking in non-stationary environment. Since the pioneering work by Oja [1] of a simplified linear neuron with constrained Hebbian learning rule which extracts the principal component from stationary input data, a variety of neural learning algorithms for PCA have been F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 561–568, 2008. c Springer-Verlag Berlin Heidelberg 2008 

562

L. Liu, J. Tie, and T. Qiu

proposed [2], [3]. Among these algorithms is the well-known generalized Hebbian algorithm (GHA) [2] proposed by Sanger, which successfully sequentially extract subsequent lower order components using deflated inputs. However, due to limited training, errors in extractions will accumulate and become dominant, which makes GHA always behave low convergence speed. To improve the convergence, several authors proposed different improved neural PCA algorithms [4], [5], [6]. It should be noticed that most PCA algorithms are derived based on gradient descent or ascent approach. Thus it is always needed to choose proper learning parameters to guarantee both small misadjustment and fast fast convergence. To overcome this problem existing, many recursive least square (RLS) type algorithms are proposed [7], [9], [10], which make use of data-dependant type learning rate and thus lead to great improvement of convergence speed as well as stability. However, most RLS-type algorithms are computationally expensive. Thus, the attempts to improve the methods and to suggest new approaches and information criteria are continuing [11], [12], [13], [8]. On the other hand, it should be noted that most previously suggested rules did not consider eigenvalue estimates in the update equations of the weights, an exception being attempts to control the learning rate based on the eigenvalue estimates [5]. In this paper, we provide a novel neural learning rule where eigenvectors and eigenvalues are simultaneously estimated in coupled update equations. In non-coupled PCA rules, the eigen-motion in all directions mainly depends on the principal eigenvalue of the covariance matrix. So numerical stability and fast convergence can only be achieved by guessing this eigenvalue in advance. While for coupled neural PCA rules, due to the fact they incorporated with real-time estimate of eigenvalue, they perform very well both in the aspect of stability and convergence speed. Numerical result further shows this point.

2

Concise Coupled Learning Equations for PCA

Symmetric positive definite matrix C = E[xxT ] is the n × n covariance matrix of the zero mean data process {x(t) ∈ Rn } with t = 0, 1, 2, · · · , where notation E[·] denotes expectation operator on the entire data set. In order to find the first principal eigenvector of C, Moller and Conies [8] proposed a criterion given by p = wT Cwλ−1 − wT w + ln λ

(1)

Here w denotes the n−dimensional weight vector, i.e., the estimate of the principal eigenvector w1 associated with largest eigenvalue λ1 of C. λ is the eigenvalue estimate of C. A direct Newton’s method for optimizing objection function p leads to the following coupled differential equations 1 dw(t) = Cw(t)λ−1 (t) − w(t)wT (t)Cwλ−1 (t) − w(t)(1 − wT (t)w(t)) dt 2 dλ(t) = wT (t)Cw(t) − wT (t)w(t)λ(t) dt by proper approximation of the inversion of Hessian matrix

(2)

Concise Coupled Neural Network Algorithm for PCA

H(w, λ) = 2



Cλ−1 − I −Cwλ−2 T −2 T −w Cλ w Cwλ−3 − 21 λ−2

563



(3)

However, in [8], they only focus on the first principal eigenvector for stationary case. In [11], Hou and Chen also obtained the same equations by introducing a new information criterion, which makes the analysis much easier p = wT Cw − wT wλ + λ

(4)

It is proved that the learning rule system which use (4) as gradient is not stable at the stationary point (λ1 , w1 ) which is the principal eigenvalue and its associated eigenvector. So they use alternative Newton’s method to reach the same learning rule (2). Algorithm for extraction of more principal eigenvectors is also obtained for non-stationary case. But it is still computationally inefficient due to the fact that their algorithm is based on the same equation (2) as in [8]. Take a close look at the differential equations (2) discussed above, obviously the last term 12 w(t)(1 − wT (t)w(t)) is approaching zero near the equilibrium (λ1 , w1 ) due to the fact w1T w1 = 1. Therefore, we propose the following simplified differential equations for principal component analysis dw(t) = Cw(t)λ−1 (t) − w(t)wT (t)Cwλ−1 (t) dt dλ(t) = wT (t)Cw(t) − wT (t)w(t)λ(t) dt

(5)

We will prove that (5) is stable at (w1 , λ1 ) in the Stability Analysis Section.

3

Coupled Neural Network Algorithm for PCA

A direct discretization to (5) leads to the following iterative procedure   w(n + 1) = w(n) + η(n) Cw(n)λ−1 (n) − w(n)wT (n)Cw(n)λ−1 (n)   λ(n + 1) = λ(n) + η(n) wT (n)Cw(n) − wT (n)w(n)λ(n)

(6)

where η(n) > 0 is the adaptive learning rate. In non-stationary environment, C behaves a function of time instead of constant matrix, i.e., C(k) = αC(k − 1) + (1 − α)x(k)xT (k),

(7)

where α is the exponential forgetting factor. In stationary case, α = k−1 k . In practice, α will be in the range 0.99 ≤ α ≤ 1.0. Then we get one online algorithm based on (5) for extraction principal component (w1 , λ1 ) as follows  η  w(k + 1) = w(k) + C(k + 1)w(k) − w(k)wT (k)C(k + 1)w(k) λ(k)   λ(k + 1) = λ(k) + η wT (k)C(k + 1)w(k) − wT (k)w(k)λ(k) (8)

564

L. Liu, J. Tie, and T. Qiu

However, in practice it is generally time-consuming or unable to compute C(k + 1). Thus better way to avoid computation of C(k+1) is desirable. As the statistics of the process under observation changes slowly and smoothly with time, under the assumption that η is relatively small, a simple choice of approximation (See also [11] and [12]) is C(k)w(k) ≈ C(k)w(k − 1).

(9)

So by equation (7), we have C(k + 1)w(k) = [αC(k) + (1 − α)x(k + 1)xT (k + 1)]w(k) = αC(k)w(k) + (1 − α)x(k + 1)y(k + 1) ≈ αC(k)w(k − 1) + (1 − α)x(k + 1)y(k + 1)

(10)

where y(k + 1) = wT (k)x(k + 1) denotes the linear output of the single linear neuron for pattern x(k + 1). Unlike the procedure proposed in [11], we do not need to approximate w(k)C T (k + 1)w(k) any more for recursively computing the principal component pair (w1 , λ1 ). Therefore, if we denote q(k) = C(k)w(k − 1), then it is only necessary recursively compute q(k) itself rather than explicitly computing C(k) any more. The term wT (k − 1)C(k)w(k − 1) is simply computed as ν(k) = wT (k − 1)q(k). To sum up, we propose the following algorithm for computing (w1 , λ1 ). 1. Let λ(0) = 0, w(0) and q(0) chosen as random vector in [−1, 1]n . ǫ0 and ǫ1 are chosen as small precision constants. 2. In step k ≥ 1, randomly select pattern x(k), compute y(k) = wT (k − 1)x(k), q(k) = αq(k − 1) + (1 − α)x(k)y(k) and ν(k) = wT (k − 1)q(k). 3. Compute λ(k) = λ(k − 1) + η[ν(k) − wT (k − 1)w(k − 1)λ(k − 1)] and

η [q(k) − w(k − 1)ν(k)]. w(k) = w(k − 1) + λ(k − 1)  T   |w (k)w(k−1)|  4. If |λ(k) − λ(k − 1)| < ǫ0 and  w(k)w(k−1) − 1 < ǫ1 , goto step 5, else k = k + 1, goto step 2; 5. w1 is computed as w(k) and λ1 is computed as λ(k). End. As for the extractions of subsequent eigenvectors, we follow a general deflation procedure proposed in [2] and [11]. Suppose the first i − 1(i > 1) eigenvalueeigenvector pairs (wj , λj ), j = 1, 2, · · · , i − 1, have been obtained. Let ei = x − i−1 T j=1 wj wj x. Use the proposed algorithm for computing principal eigenvector w1 on the new input ei , we can get the i−th eigenvalue-eigenvector pair (wi , λi ).

Concise Coupled Neural Network Algorithm for PCA

4

565

Stability Analysis

In this section, we will briefly analyze the stability aspect of (5) by the approach similar to that in [8] and [11]. The Hessian ⎡ ∂ w˙ ∂ w˙ ⎤ ∂w ∂λ

H(w, λ) = ⎣

˙ ∂λ ˙ ∂λ ∂w ∂λ

(11)



¯ can be written as for (5) at the stationary point (w, ¯ λ) ⎡ ⎤ ¯ −1 − I − 2w ¯w ¯T 0 Cλ ¯ =⎣ ⎦ H(w, ¯ λ) 0 −1

(12)

Let C = U ΛU T be the eigenvalue decomposition of C where Λdiag(λ1 , λ2 , . . . , λn ) is a diagonal matrix with eigenvalues λ1 > λ2 > . . . > λn > 0, and U is the corresponding eigenvectors matrix. Let   U ¯ U= . (13) 1 ¯ = λi , w If λ ¯ = wi , then ⎡

¯⎣ H(wi , λi ) = U

T Λλ−1 i − I − 2ei ei

0

0 −1



¯T ⎦U

(14)

where ei is a vector with all entries zero except ei (i) = 1. The eigenvalues ai,1 , ai,2 , . . . , ai,n+1 of H(wi , λi ) are ai,i = ai,n+1 = −2, ai,j =

λj − 1, j = 1, 2, . . . , n, j = i. λi

(15)

Therefore, only at stationary point (w1 , λ1 )(principal component), all the eigenvalues of H are negative. Instead, at other stationary points, there is at least one positive eigenvalue, which means that only (w1 , λ1 ) is the stable stationary point of (5).

5

Experiment

In the following, we will provide a simulation result to illustrate the performance of our proposed neural network Algorithm. As the proposed algorithm is based on the differential equations (5), which is a simplified version of (2). So its online algorithm is computational efficient compared to that of (2). Additionally, compared to the adaptive algorithm proposed in [11], the proposed algorithm need only recursively compute q(k) rather than a(k) and b(k) as proposed in [11]).

566

L. Liu, J. Tie, and T. Qiu w(t) of the proposed algorithm

w(t) of Oja algorithm 0.9

1.8 1.6

0.8

w (t)

1.4

2

0.7

1.2

0.6

1 0.5

w (t) 2

w (t)

0.8

1

0.4 0.6 0.3

0.1 0

w (t)

0.4

0.2

1

0.2 50

100

150

(a)

200

250

300

0 0

50

100

150

200

250

300

(b)

Fig. 1. (a) Principal direction estimation with Oja algorithm. (b) Principal direction estimation with the proposed algorithm.

For simplicity here, we just compare the performance of the proposed algorithm with the classical Oja algorithm for the computation of largest eigenvalue and corresponding eigenvector, both of which needs a careful selection of the learning rate η > 0. Numerical result shows that the proposed algorithm behaves good performance even for large value of η > 0, while the success of Oja algorithm depends only on rather small value of η > 0. Specifically, a data set Dx = {(xi , yi )} with i = 1, · · · , 500 comes from the zero mean two dimensional Gaussian distribution with correlation coefficient ρxy = 0.9, variance of D(x) = 5 and D(y) = 10 respectively. Thus, the sample covariance matrix is computed as

4.9705 6.4651 C= 6.4651 10.2043 We only randomly select 300 of the 500 samples, which is about 1/2 of the overall number of samples, to adaptively update the weight vector w according to the proposed algorithm and Oja algorithm [1]. Using Matlab’s command [v,d]=eigs(C), we obtain largest eigenvalue λ1 = 14.5621 and w1 = [0.5589, 0.8292]T . It is well-known that Oja algorithm is sensitive to selection of learning rate η > 0. In this experiment, we made a trial and use η = 0.005 for Oja algorithm. While for the proposed algorithm, we select a relative large η = 0.8. For value t Oja’s learning algorithm λ1 is approximated by λ(t) = 1t i=1 y 2 (t), which is computational inefficient. We get λ(300) = 13.5980 by Oja algorithm. While the proposed algorithm gives λ(200) = 14.3313. As for the principal eigenvector w1 , using the proposed algorithm, it is estimated as w(300) = [0.55888, 0.82939]T , while for Oja algorithm it is computed as w(200) = [0.53964, 0.84256]T . The result is shown in Fig. 1(a)-(b) and Fig. 2(a)-(b). As is seen from Fig. 1(a) and (b), the proposed algorithm behaves much stable than that of Oja algorithm although with large learning rate η = 0.8. The online estimation of λ1 is

Concise Coupled Neural Network Algorithm for PCA

comparison of eigenvalue estimation Oja Proposed True

18 16

1.8 1.6

12

1.4

10

1.2

8

1

6

0.8

4

0.6

2

0.4

50

100

150

200

250

(a)

Proposed : norm of w(t) Oja: norm of w(t)

2

14

0 0

comparison of weight length

2.2

20

567

300

0.2 0

Proposed Oja

50

100

150

200

250

300

(b)

Fig. 2. (a) Comparison of the estimated largest eigenvalue between Oja algorithm and the proposed algorithm. (b) Tendency to unit of w(t) for both of these two algorithms.

much accurate compared to that of Oja algorithm as shown in Fig. 2(a). From Fig. 2(b), it is easy to see that lim w(t) = 1 for both the proposed algorithm t→∞

and Oja algorithm, which further confirmed our simplification for CNN (2).

6

Conclusion

This paper proposes an adaptive algorithm for computing principal eigenvectors as well as eigenvalues based on a simplified coupled neural network model. This algorithm is computational efficient compared to that proposed in [8] and [11]. Unlike most existing neural network based learning rules for PCA, the proposed CNN online learning rule can simultaneously extract eigenvalues as well as eigenvectors. As discussed in [8], noncoupled PCA rules suffer from a stability speed problem, since the eigenmotion depends on the eigenvalues of the covariance matrix. Simulations confirm that couple PCA learning rule applied in a chains of simultaneously trained stages leads to improved accuracy of the eigenvectors and eigenvalues. The proposed algorithm is most applicable to image processing field where eigenvalues are needed in PCA problems. In the experiment section, we only focus on the computation of principal direction of simple synthetic data. Applications of the proposed algorithm to signal processing field and performance comparison with many other algorithms are an emphasis of future work.

References 1. Oja, E.: Principal Components, Minor Components, and Linear Neural Networks. Neural Networks 5, 927–935 (1992) 2. Sanger, T.D.: Optimal Unsuperwised Learning in a Single-layer Linear Feedforward Neural Network. Neural Networks 2, 459–473 (1989)

568

L. Liu, J. Tie, and T. Qiu

3. Diamantaras, K.I., Kung, S.Y.: Principal Component Neural Networks—Theory and Applications. Wiley, New York (1996) 4. Xu, L., Yuille, A.L.: Robust Principal Component Analysis by Self-organizing Rules Based on Statistical Physics Approach. IEEE Trans. on Neural Networks 6, 131– 143 (1995) 5. Chen, L., Chang, S.: An Adaptive Learning Algorithm for Principal Component Analysis. IEEE Trans. on Neural Networks 6, 1255–1263 (1995) 6. Cichocki, A., Kasprzak, W., Skarbek, W.: Adaptive Learning Algorithm for Principal Component Analysis with Partial Data. Proc. Cybernetics Syst. 2, 1014–1019 (1996) 7. Bannour, S., Azimi-Sadjadi, M.R.: Principal Component Extraction Using Recursive Least Squares Learning. IEEE Trans. on Neural Networks 6, 457–469 (1995) 8. Moller, R., Konies, A.: Coupled Principal Component Analysis. IEEE Trans. on Neural Networks 15, 214–222 (2004) 9. Yang, B.: Projection Approximation Subspace Tracking. IEEE Trans. on Signal Processing 43, 95–107 (1995) 10. Ouyang, S., Bao, Z.: Robust Recursive Least Squares Learing Algorithm for Principal Component Analysis. IEEE Trans. on Neural Netorks 11, 215–221 (2000) 11. Hou, L., Chen, T.P.: Online Algorithm of Coupled Principal (Minor) Component Analysis. Journal of Fudan University 45, 158–168 (2006) 12. Hua, Y.B., Xiang, Y., Chen, T.P.: A new Look at the Power Method for Fast Subspace Tracking. Digital Signal Processing 9, 207–314 (1999) 13. Ouyang, S., Bao, Z.: Fast Principal Component Extraction by a Weighted Information Criterion. IEEE Trans. on Neural Networks 11, 215–221 (2002)

Spatial Clustering with Obstacles Constraints by Hybrid Particle Swarm Optimization with GA Mutation Xueping Zhang1, Hui Yin1, Hongmei Zhang1, and Zhongshan Fan2 1

School of Information Science and Engineering, Henan University of Technology, Zhengzhou 450001, China 2 Henan Academy of Traffic Science and Technology, Zhengzhou 450052, China [email protected]

Abstract. In this paper, we propose a novel Spatial Clustering with Obstacles Constraints (SCOC) by an advanced Hybrid Particle Swarm Optimization (HPSO) with GA mutation. In the process of doing so, we first use HPSO to get obstructed distance, and then we developed a novel HPKSCOC based on HPSO and K-Medoids to cluster spatial data with obstacles constraints. The experimental results show that the HPKSCOC algorithm can not only give attention to higher local constringency speed and stronger global optimum search, but also get down to the obstacles constraints and practicalities of spatial clustering; and it performs better than Improved K-Medoids SCOC (IKSCOC) in terms of quantization error and has higher constringency speed than Genetic K-Medoids SCOC (GKSCOC). Keywords: Spatial clustering, Obstacles constraints, Hybrid particle swarm optimization, Mutation, K-Medoids.

1 Introduction Spatial Clustering with Obstacles Constraints (SCOC) has been a new topic in Spatial Data Mining (SDM). As an example, Fig.1 shows clustering spatial data with physical obstacle constraints. Ignoring the constraints leads to incorrect interpretation of the correlation among data points. To the best of our knowledge, only three clustering algorithms for SCOC have been proposed, that is COD-CLARANS [1], AUTOCLUST+ [2], and DBCluC [3,4], but many questions exist in them. COD-CLARANS computes obstructed distance using visibility graph costly and is unfit for large spatial data. In addition, it only gives attention to local constringency. AUTOCLUST+ builds a Delaunay structure for solving SCOC costly and is also unfit for large spatial data. DBCluC cannot run in large high dimensional data sets etc. We developed Genetic K-Medoids SCOC (GKSCOC) based on Genetic algorithms (GAs) and Improved K-Medoids SCOC (IKSCOC) in [5]. The experiments show that GKSCOC is effective but the drawback is a comparatively slower speed in clustering. Particle Swarm Optimization (PSO) can solve a variety of difficult optimization problems. Compared to GAs, the advantages of PSO are its simplicity in coding and consistency in performance and there are fewer parameters to be adjusted, and it can be efficiently used on large data sets. F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 569–578, 2008. © Springer-Verlag Berlin Heidelberg 2008

570

X. Zhang et al.

In this paper, we explore the applicability of PSO for SCOC. In the process of doing so, we first use Hybrid PSO (HPSO) algorithm with GA mutation to obtain obstructed distance and then we developed HPKSCOC algorithm based on HPSO and K-Medoids to cluster spatial data with obstacles constraints. Aiming at the shortcoming of the PSO algorithm, that is, easily plunging into the local minimum, an advanced HPSO with GA mutation is adopted in this paper. By adding a mutation operator to the algorithm, it can not only escape the attraction of the local minimum in the later convergence phase, but also maintain the characteristic of fast speed in the early phase. The experiments show that HPKSCOC is better than IKSCOC in terms of quantization error and has higher constringency speed than GKSCOC. The remainder of the paper is organized as follows. Section 2 introduces a HPSO with GA mutation operator. Obstructed distance by HPSO is discussed in Section 3. Section 4 presents HPKSCOC. The performances of HPKSCOC are showed in Section 5, and Section 6 concludes the paper. C3 C2 Bridge

C1 River

Mountain

(a) Data objects and obstacles constraints

C4

(b) Clusters ignoring obstacle constraints

Fig. 1. Clustering data objects with obstacles constraints

2 Hybrid PSO with GA Mutation 2.1 Standard PSO PSO is a parallel population-based computation technique proposed by Kennedy and Eberhart in 1995 [6,7], which was motivated by the organisms behavior such as schooling of fish and flocking of birds. In order to find an optimal or near-optimal solution to the problem, PSO updates the current generation of particles using the information about the best solution obtained by each particle and the entire population. The mathematic description of PSO is as the following. Suppose the dimension of the searching space is D, the number of the particles is n. Vector X i = ( xi1 , xi 2 ,… , xiD ) represents the

position of the i th particle and pBesti = ( pi1 , pi 2 ,… , piD ) is its best position searched by now, and the whole particle swa-rm's best position is represented as gBest = ( g1 , g 2 ,… , g D ) .Vector Vi = (vi1 , vi 2 ,… , viD ) is the position change rate of the i th particle. Each particle updates its position according to the following formulas: vid (t + 1) = wvid (t ) + c rand ()[ pid (t ) - xid (t )]+c rand ()[ g d (t ) - xid (t )] , 1

2

xid (t + 1) = xid (t ) + vid (t + 1) , 1 ≤ i ≤ n, 1 ≤ d ≤ D ,

(1) (2)

Spatial Clustering with Obstacles Constraints by Hybrid Particle Swarm Optimization

571

where c and c are positive constant parameters, Rand () is a random function with 1 2 the range [0, 1], and w is the inertial function, in this paper, the inertial weight is set to the following equation. w w = w − max − wmin × I , (3) max I max where wmax is the initial value of weighting coefficient, wmin is the final value of weighting coefficient, I max is the maximum number of iterations or generation, and I is the current iteration or generation number. Equation (1) is used to calculate the particle's new velocity, then the particle flies toward a new position according to equation (2).The various range of the d th position is [ XMINX d , XMAXX d ] and the various range [−VMAXX d ,VMAXX d ] . If the value calculated by equations (1) and (2) exceeds the range, set it as the boundary value. The performance of each particle is measured according to a predefined fitness function, which is usually proportional to the cost function associated with the problem. This process is repeated until userdefined stopping criteria are satisfied. A disadvantage of the global PSO is that it tends to be trapped in a local optimum under some initialization conditions [8]. 2.2 Hybrid PSO with GA Mutation

Random parameter w, c1 , c2 , as following equation (4), have a relation to guarantee

c1 + c2 − 1 ≺ w ≺ 1 and c1 + c2 0 , (4) 2 the particle convergent to optimization result, have a relation to guarantee the particle convergent to optimization result, but how to coordinate the above parameter to get a high convergence speed is another difficult matter, so we adopt a hybrid algorithm of PSO and GA with self-adaptive velocity mutation [9, 10], named HPSO, to coordinate the relationship of w, c1 , c2 to make the algorithm have a good performance. Because w, c1 , c2 have a constriction as equation (4), the following objective function is introduced to evaluate the particle performance of HPSO. Z qk = k , k = 1, 2, , Q , (5) S

∑ q Inq Q

E (t ) = −

k

k

,

(6)

k =1

where E (t ) is the particle population distribution entropy to evaluate the population distribution performance. Here, the HPSO is adopted as follows, which is referenced from [9,10]. 1. 2. 3. 4. 5.

Initialize swarm population, each particle’s position and velocity; Evaluate each particle’s fitness; Initialize gBest , pBest , wmax , wmin , c1 , c2 ,maximum generation, and generation=0; While (generation∈ E (G ) vi = v j others

(7)

where vi and v j are any two of the nodes in the graph E (G ) , < vi , vi > represents an arc in the graph and wij is its weight. The simulation result is in Fig.2 (b) and the black solid line represents the shortest path we got.

Spatial Clustering with Obstacles Constraints by Hybrid Particle Swarm Optimization

573

3.2 Optimal Obstructed Path by HPSO

Suppose the shortest path of the MAKLINK graph that we get by Dijkstra algorithm is P0 , P1 , P2 ,… , PD , PD +1 , where P0 = start is the start point and PD +1 = goal is the goal point. Pi (i = 1, 2,… , D ) is the midpoint of the free link. The optimization task is to adjust the position of Pi to shorten the length of path and get the optimized (or acceptable) path in the planning space. The adjust process of Pi is shown in Fig.2(c) [11]. The position of Pi can be decided by the following parametric equation: Pi = Pi1 + ( Pi 2 − Pi1 ) × ti , ti ∈ [0,1], i = 1, 2,… D

.

(8)

Each particle X i is constructed as: X i = (t1t2 …tD ) .Accordingly, the i th particle’s fitness value is defined as:

f ( X i ) = ∑ Pk −1 Pk , i = 1, 2,… , n , D +1

(9)

k =1

where Pk −1 Pk is the Euclidean distance between the two points and Pk can be calculated according to equation (9). Here, the HPSO is presented as follows. 1. Initialize particles at random, and set pBesti = X i ; 2. Calculate each particle's fitness value by equation (9) and label the particle with the minimum fitness value as gBest ; 3. For t1 = 1 to t max do { 1 4. 5. 6. 7. 8. 9. 10. 11.

For each particle X i do { Update vid and xid by equations (1) and (2); Calculate the fitness according to equation (9) ;} Update gBest and pBesti ; For GA Initialize n p , pc , pm , TG ; Generate the initialization population; While (T ≺ TG ) do { Calculate fitness of GA by equation (6); Selection ,Crossover, Mutation, and Generate next generation} Accept w, c1 , c2 ;

12. if ( vid

VMAX d ) then vid = rand ()VMAX d

pBest (t ) = xid (t ) ;

13. if ||v|| ≤ ε Terminate } 14. Output the obstructed distance. where t max is the maximum number of iterations, ε is the minimum velocity. The 1 simulation result is in Fig.2 (d) and the red solid line represents the optimal obstructed path obtained by PSO.

574

X. Zhang et al.

(a)

(b)

(c)

(d)

Fig. 2. Optimal obstructed path by HPSO based on MAKLINK Graph

4 Spatial Clustering with Obstacles Constraints Based on HPSO and K-Medoids 4.1 IKSCOC Based on K-Medoids

Typical partitioning-base algorithms are K-Means, K-Medoids and CLARANS. Here, KMedoids algorithm is adopted for SCOC to avoid cluster center falling on the obstacle. Square-error function is adopted to estimate the clustering quality, and its definition can be defined as: Nc E = ∑ ∑ ( d ( p , m j ))2 , j =1 p∈C j

(10)

is the number of cluster C j , m is the cluster centre of cluster C j , d ( p, q) is j the direct Euclidean distance between the two points p and q . To handle obstacle constraints, accordingly, criterion function for estimating the quality of spatial clustering with obstacles constraints can be revised as: where

Nc

Eo =

N c ∑ ∑ j =1p∈C

( d o ( p , m )) 2 j j

where d o ( p, q ) is the obstructed distance between point p and point q . The method of IKSCOC is adopted as follows [5]. 1. Select N c objects to be cluster centers at random; 2. Distribute remain objects to nearest cluster center; 3. Calculate Eo according to equation (11); 4. While ( Eo changed) do { Let current E = Eo ; 5. Select a not centering point to replace the cluster center m randomly; j 6. Distribute objects to the nearest center; 7. Calculate E according to equation (10); 8. If E > current E , go to 5; 9. Calculate Eo ; 10. If Eo < current E , form new cluster centers }.

(11)

Spatial Clustering with Obstacles Constraints by Hybrid Particle Swarm Optimization

575

While IKSCOC still inherits two shortcomings, one is selecting initial value randomly may cause different results of the spatial clustering and even have no solution, the other is that it only gives attention to local constringency and is sensitive to an outlier. 4.2 HPKSCOC Based on HPSO and K-Medoids

PSO has been applied to data clustering [13-16]. In the context of clustering, a single particle represents the N c cluster centroid. That is, each particle X i is constructed as follows:

X i = (mi1 ,..., mij ,..., miNc ) ,

(12)

where mij refers to the j th cluster centroid of the i th particle in cluster Cij . Here, the objective function is defined as follows: 1 f (x ) = i J i

(13)

Nc Ji = ∑ ∑ d o ( p, m j ) j = 1 p ∈ Cij

(14)

The HPKSCOC is developed as follows. 1. Execute the IKSCOC algorithm to initialize one particle to contain N c selected cluster centroids; 2. Initialize the other particles of the swarm to contain N c selected cluster centroids at random; 3. For t = 1 to t do { max 4. For each particle X i do { 5. For each object p do { 6. Calculate d o ( p, mij ) ; 7. Assign object p to cluster Cij such that do ( p, mij ) = min∀c = 1,..., N {do ( p, mic )} ; c 8. Calculate the fitness according to equation (13) ;}} 9. Update gBest and pBesti ; 10. For GA initialize n p , pc , pm , TG , and Generate the initialization population; 11. 12. 13. 14. 15.

While T ≺ TG do {Calculate fitness of GA by equation (6); Selection, Crossover, Mutation, and Generate next generation} Accept w, c1 , c2 ; Update cluster centroids by equations (1) and (2); if ( vid VMAX d ) then vid = rand ()VMAX d pBest (t ) = xid (t )

16. if ||v|| ≤ ε Terminate; 17. Optimize new individuals using IKSCOC} 18. Output.

576

X. Zhang et al.

where t is the maximum number of iteration for PSO, ε is the minimum velocity. max STEP 1 is to overcome the disadvantage of the global PSO which tends to be trapped in a local optimum under some initialization conditions. STEP 17 is to improve the local constringency speed of the global PSO.

5 Results and Discussion We have made experiments separately by K-Medoids, IKSCOC, GKSCOC and HPKSCOC. n = 50 , wmax = 0.999 , wmin = 0.001 , c1 = c2 = 2 , Vmax = 0.4 , tmax = 100 , TG = 0.01 , n p = 50 , pc = 0.6 , pm = 0.01 , ε = 0.001.

Fig.3 shows the results on synthetic Dataset1. Fig.3 (a) shows the original data with simple obstacles. Fig.3 (b) shows the results of 4 clusters found by K-Medoids without considering obstacles constraints. Fig.3(c) shows 4 clusters found by IKSCOC. Fig.3 (d) shows 4 clusters found by GKSCOC. Fig.3 (e) shows 4 clusters found by HPKSCOC. Obviously, the results of the clustering illustrated in Fig.3(c), Fig.3 (d) and Fig.3 (e) have better practicalities than that in Fig.3 (b), and the ones in Fig.3 (e) and Fig.3 (d) are both superior to the one in Fig.3(c). Fig.4 shows the results on real Dataset2 of residential spatial data points with river and railway obstacles in facility location on city parks. Fig.4 (a) shows the original data with river and railway obstacles. Fig.4 (b) and Fig.4 (c) show 10 clusters found by K-Medoids and HPKSCOC respectively. Obviously, the result of the clustering illustrated in Fig.4 (c) has better practicalities than the one in Fig.4 (b). So, it can be drawn that HPKSCOC is effective and has better practicalities. Fig.5 is the value of J showed in every experiment on Dataset1 by IKSCOC and HPKSCOC respectively. It is showed that IKSCOC is sensitive to initial value and it constringes in different extremely local optimum points by starting at different initial value while HPKSCOC constringes nearly in the same optimum points at each time.

(a)

(b)

(d)

(c)

(e)

Fig. 3. Clustering Dataset1

Spatial Clustering with Obstacles Constraints by Hybrid Particle Swarm Optimization

(a)

(b)

577

(c)

Fig. 4. Clustering dataset Dataset2

Fig.6 is the constringency speed in one experiment on Dataset1. It is showed that HPKSCOC constringes in about 12 generations while GKSCOC constringes in nearly 25 generations. So, it can be drawn that HPKSCOC is effective and has higher constringency speed than GKSCOC. Therefore, we can draw the conclusion that HPKSCOC has stronger global constringent ability than IKSCOC and has higher convergence speed than GKSCOC.

Fig. 5. HPKSCOC vs. IKSCOC

6

Fig. 6. HPKSCOC vs. GKSCOC

Conclusions

In this paper, we explore the applicability of PSO for SCOC. In the process of doing so, we first use an advanced HPSO with the GA mutation to obtain obstructed distance and then we developed HPKSCOC to cluster spatial data with obstacles constraints. By adding a mutation operator to the HPSO algorithm, it can not only escape the attraction of the local minimum in the later convergence phase, but also maintain the characteristic of fast speed in the early phase. The experiments show that the HPKSCOC algorithm can not only give attention to higher local constringency speed and stronger global optimum search, but also get down to the obstacles constraints and practicalities of spatial clustering; and it is better than IKSCOC in terms of quantization error and has higher constringency speed than GKSCOC.

578

X. Zhang et al.

Acknowledgments. This work is partially supported by the Science Technology Innovation Project of Henan (Number: 2008HASTIT012), the Natural Sciences Fund of Henan (Number: 0511011000, Number: 0624220081).

References 1. Tung, A.K.H., Hou, J., Han, J.: Spatial Clustering in the Presence of Obstacles. In: 2001 International Conference on Data Engineering, pp. 359–367 (2001) 2. Estivill-Castro, V., Lee, I.J.: AUTOCLUST+: Automatic Clustering of Point-Data Sets in the Presence of Obstacles. In: 2000 International Workshop on Temporal, Spatial and Spatial-Temporal Data Mining, pp. 133–146 (2000) 3. Zaïane, O.R., Lee, C.H.: Clustering Spatial Data When Facing Physical Constraints. In: The 2002 IEEE International Conference on Data Mining, pp. 737–740 (2002) 4. Wang, X., Rostoker, C., Hamilton, H.J.: DBRS+: Density-Based Spatial Clustering in the Presence of Obstacles and Facilitators (2004), http://ftp.cs.uregina.ca/Research/Techreports/2004-09.pdf 5. Zhang, X., Wang, J., Wu, F., Fan, Z., Li, X.: A Novel Spatial Clustering with Obstacles Constraints Based on Genetic Algorithms and K-Medoids. In: The Sixth International Conference on Intelligent Systems Design and Applications, pp. 605–610 (2006) 6. Eberhart, R.C., Kennedy, J.: A New Optimizer Using Particle Swarm Theory. In: The Sixth International Symposium on Micro Machine and Human Science, pp. 39–43 (1995) 7. Kennedy, J., Eberhart, R.C.: Particle Swarm Optimization. In: 1995 IEEE International Conference on Neural Networks, vol. IV, pp. 1942–1948 (1995) 8. Bergh, F.V.D.: An Analysis of Particle Swarm Optimizers. Ph.D. Thesis, University of Pretoria (2001) 9. Esmin, A.A.A., Lambert-Torres, G., Alvarenga, G.B.: Hybrid Evolutionary Algorithm Based on PSO and GA Mutation. In: The 6th International Conference on Hybrid Intelligent Systems, p. 57 (2006) 10. Zhao, F., Zhang, Q., Wang, L.: A Scheduling Holon Modeling Method with Petri Net and its Optimization with a Novel PSO-GA Algorithm. In: The 10th International Conference on Computer Supported Cooperative Work in Design, pp. 1302–1307 (2006) 11. Qin, Y., Sun, D., Li, N., Cen, Y.: Path Planning for Mobile Robot Using the Particle Swarm Optimization with Mutation Operator. In: The Third International Conference on Machine Learning and Cybernetics, pp. 2473–2478 (2004) 12. Habib, M.K., Asama, H.: Efficient Method to Generate Collision Free Paths for Autonomous Mobile Robot Based on New Free Space Structuring Approach. In: 1991 International Workshop on Intelligent Robots and Systems, pp. 563–567 (1991) 13. Van der Merwe, D.W., Engelbrecht, A.P.: Data Clustering Using Particle Swarm Optimization. In: IEEE Congress on Evolutionary Computation 2003, pp. 215–220 (2003) 14. Xiao, X., Dow, E.R., Eberhart, R., Miled, Z.B., Oppelt, R.J.: Gene Clustering Using SelfOrganizing Maps and Particle Swarm Optimization. In: The 2003 International Conference on Parallel and Distributed Processing Symposium, p. 154 (2003) 15. Cui, X., Potok, T.E., Palathingal, P.: Document Clustering Using Particle Swarm Optimization. In: 2005 IEEE on Swarm Intelligence Symposium, pp. 185–191 (2005) 16. Omran, M.G.H.: Particle Swarm Optimization Methods for Pattern Recognition and Image Processing. Ph.D. Thesis, University of Pretoria (2005)

Analysis of the Kurtosis-Sum Objective Function for ICA Fei Ge and Jinwen Ma⋆ Department of Information Science, School of Mathematical Sciences and LMAM, Peking University, Beijing, 100871, China [email protected]

Abstract. The majority of existing Independent Component Analysis (ICA) algorithms are based on maximizing or minimizing a certain objective function with the help of gradient learning methods. However, it is rather difficult to prove whether there is no spurious solution in ICA under any objective function as well as the gradient learning algorithm to optimize it. In this paper, we present an analysis on the kurtosissum objective function, i.e., the sum of the absolute kurtosis values of all the estimated components, with a kurtosis switching algorithm to maximize it. In two-source case, it is proved that any local maximum of this kurtosis-sum objective function corresponds to a feasible solution of the ICA problem in the asymptotic sense. The simulation results further show that the kurtosis switching algorithm always leads to a feasible solution of the ICA problem for various types of sources. Keywords: Independent component analysis, Blind signal separation, Spurious solution, Kurtosis, Switching algorithm.

1

Introduction

Independent Component Analysis (ICA) provides a powerful statistical tool for signal processing and data analysis. It aims at decomposing a random vector which is an instantaneous linear combination of several independent random variables. Thus, the decomposed components should be mutually as independent as possible. One major application of ICA is Blind Signal Separation (BSS), where simultaneous observations x(t) = [x1 (t), . . . , xm (t)]T are linear mixtures of independent signal sources s(t) = [s1 (t), . . . , sn (t)]T via a mixing matrix A ∈ IRm×n such that x(t) = As(t). Typically, we can consider the case m = n and the purpose of ICA is to solve or learn an n × n matrix W such that WA has one and only one non-zero entry in each row and in each column. In fact, a such W, being called a separating matrix or demixing matrix, corresponds to a feasible solution of the ICA problem. Clearly, the independence assumption on these estimated components is the key to solve the ICA problem. That is, if y(t) = Wx(t) owns the independence of its components, they can be considered as the recovered sources. ⋆

Corresponding author.

F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 579–588, 2008. c Springer-Verlag Berlin Heidelberg 2008 

580

F. Ge and J. Ma

Actually, the independence measure among the estimated components can serve as a good objective or contrast function for ICA. Supposing that pi (yi ) is the marginal probability density function (pdf) of the i-th component of y = W x = WAs, and p(y) is the joint pdf of y, we can use the Kullback divergence to set up the following Minimum Mutual Information (MMI) criterion [1]: I(y) =



p(y) dy . i=1 pi (yi )

p(y) log n

(1)

Clearly, I(y) is nonnegative and vanishes to zero only when all yi are mutually independent. Moreover, this MMI criterion is equivalent to the Maximum Likelihood (ML) criterion [2] if pi (·) coincides with the pdf of each source. Since the pdfs of the sources are unknown in advance, we generally utilize some predefined or model pdfs to substitute the real pdfs in the mutual information. In such a way, however, the MMI approach works only in the cases where the components of y are either all super-Gaussians [3] or all sub-Gaussians [4]. For the cases where sources contain both super-Gaussian and sub-Gaussian signals in an unknown manner, it was conjectured that these model pdfs pi (yi ) should keep the same kurtosis signs of the source pdfs. This conjecture motivated the proposal of the so-called one-bit matching condition [5], which can be basically stated as “all the sources can be separated as long as there is a one-to-one same-sign-correspondence between the kurtosis signs of all source pdf’s and the kurtosis signs of all model pdf’s”. Along the one-bit matching condition, Liu, Chiu, and Xu simplified the mutual information into a cost function and proved that the global maximum of the cost function correspond to a feasible solution of the ICA problem [6]. Ma, Liu, and Xu further proved that all the maxima of the cost function corresponds to the feasible solutions in two-source mixing setting [7]. Recently, this cost function was further analyzed in [8] and an efficient learning algorithm was constructed with it in [9]. However, the one-bit matching condition is not sufficient for the MMI criterion because Vrins and Verleysen [10] have already proved that spurious maxima exist for it when the sources are strongly multimodal. On the other hand, there have been many ICA algorithms that explicitly or implicitly utilize certain flexible pdfs to fit different types of sources. Actually, these methods learn the separating matrix as well as the parameters in the flexible model pdfs, or nonlinear functions, or switching functions, simultaneously. From the simple switching or parametric functions (e.g., [11,12,13]) to the complex mixture densities (e.g., [5,14,15]), these flexible functions have enabled the algorithms to successfully separate the sources in both simulation experiments and applications. However, there is still an essential issue whether all the local optima of the objective function in each of these methods can correspond to the feasible solutions. Clearly, if all the local optima correspond to the feasible solutions, any gradient-type algorithm can be always successful on solving the ICA problem. Otherwise, if there exists some optimum which does not correspond to a feasible solution, any gradient-type algorithm may be trapped in such a local optimum and lead to a spurious solution. Thus, for an objective function, it is

Analysis of the Kurtosis-Sum Objective Function for ICA

581

vital to know whether there exists a local optimum which does not correspond to a feasible solution or an algorithm to optimize it has no spurious solution. Actually, the stability analysis by Amari et al. [16], Cardoso and Laheld [17] just gave certain conditions for a feasible solution at which the algorithm can be stable, but did not grantee a stable solution to be feasible. Besides the mutual information, another typical independence measure is nongaussianity. If s1 , . . . , sn are independent non-Gaussian random variables, their linear combination x = a1 s1 + . . . + an sn , (ai = 0) is a random variable, which tends to be closer to Gaussian than s1 , . . . , sn individually. A classical measure of nongaussianity is the fourth order cumulant or kurtosis. For extracting a single component from the mixture, kurtosis or its square as contrast function has been investigated by Delfosse and Loubaton [18], Hyv¨ arinen and Oja [19]. The extrema of the single unit contrast function corresponds to one of the original sources. By a deflation approach, all the independent components can be detected sequentially. It is just the origin of the FastICA algorithm [19]. On the other hand, we can construct a kurtosis-sum objective function, i.e., the sum of the absolute kurtosis values of all the estimated components, to solve the ICA problem simultaneously. Although Vrins and Verleysen [10] already showed that such a kurtosis-based contrast function is superior to those entropy-based ones, for multimodal sources, at least when n = 2, there is still no theoretical analysis on the spurious solution on it. In this paper, we investigate the kurtosis-sum objective function theoretically and propose a kurtosis switching algorithm to maximize it. It is proved that, for two-source case, all the local maxima correspond to the feasible solutions of the ICA problem, or in other words, the kurtosis switching algorithm has no spurious solution, only if the sources have non-zero kurtosis. Moreover, we demonstrate our theoretical results by the simulation experiments. In the sequel, the kurtosis-sum objective function and the kurtosis switching algorithm are introduced in Section 2. Then, the no spurious solution property of the kurtosis switching algorithm is proved for the two-source case in Section 3. Furthermore, simulation experiments are conducted to demonstrate the algorithm in Section 4. Finally, Section 5 contains a brief conclusion.

2

Kurtosis-Sum Objective Function and Kurtosis Switching Algorithm

As well-known, kurtosis is one of the most important features for a source signal or pdf. Actually, supposing that x is a random variable with zero mean, its kurtosis is defined by kurt{x} = E{x4 } − 3(E{x2 })2 ,

(2)

where E{·} denotes the expectation. Clearly, Gaussian variables have zero kurtosis. If a signal or random variable is non-Gaussian, it is called super-Gaussian if its kurtosis is positive. Otherwise, it is called sub-Gaussian if its kurtosis is negative.

582

F. Ge and J. Ma

It follows from Eq.(2) that: kurt{αx} = α4 kurt{x}, α ∈ IR;

(3)

and if x1 and x2 are independent, we certainly have kurt{x1 + x2 } = kurt{x1 } + kurt{x2 } . 2.1

(4)

Kurtosis-Sum Objective Function

We consider the ICA problem with n sources and n observations. Without loss of generality, we assume that the sources have zero mean and unit variance. Moreover, the observed signals can be further pre-whitened such that E{x} = 0, and E{xx}T = I. Then, for any orthogonal transformation matrix W, the estimated signals y = Wx are always whitened. The kurtosis-sum objective function is defined by J(W) =

n 

|kurt{yi }| =

n 

|kurt{wiT x}|,

(5)

i=1

i=1

where x is the (pre-whitened) observed signal (as a random vector), and W = [w1 , w2 , · · · , wn ]T is the orthogonal de-mixing matrix to be estimated. Since the two transformations are linear, y = Wx = WAs = Rs, where R is another orthogonal matrix. Because A is constant, we consider R instead of W and have J(W) = J(R) =

n 

|kurt{

n 

rij sj }| =

i=1

j=1

i=1

n n   4 rij kurt{sj }| | j=1

n n n n     4 4 rij κj , ki rij κj | = | = i=1

i=1

j=1

(6)

j=1

where κj denotes the kurtosis of the j-th source signal, and ki = sign{

n 

4 rij κj }.

(7)

j=1

In the above equations, κj is unknown. Moreover, R is related with W, but also unknown. However, with the samples of x we can directly estimate kurt{yi } and the kurtosis objective function. Since the absolute value of a function cannot be differentiable at zero, we set ki as a ±1 coefficient, which leads to a kurtosis switching function. 2.2

Kurtosis Switching Algorithm

We further construct a kurtosis switching algorithm to maximize the kurtosissum objective function. Before doing so, we give an estimate of kurt{yi } with the

Analysis of the Kurtosis-Sum Objective Function for ICA

583

samples from the observation. Actually, with a set of samples D = {x1 , . . . , xN }, it is quite reasonable to use the following statistic: f (wi |D) =

N 1  T 4 (wi xl ) − 3 N

(8)

l=1

to estimate kurt{wiT x}. With the above preparations, we can construct the kurtosis switching algorithm as follows. (1) Initialization. The mixed signal x should be pre-whitened. W is initially set to be an orthogonal matrix, and ki is set to be either 1 or −1. (2) Select a sample data set D from the mixed signals. (3) Evaluate the kurtosis values of the current estimated components, f (wi |D) and update ki := sign{f (wi |D)}, for i = 1, . . . , n. (Note that this update is not always active in each iteration.) (4) Calculate the gradient. Compute ∂f (wi |D)/∂wi for i = 1, . . . , n, and set   ∂f (wn |D) ∂f (w1 |D) . (9) , · · · , kn ∇JW = k1 ∂w1 ∂wn (5) Obtain the constraint gradient. Project ∇JW onto Stiefel manifold by T ˆ W = WWT ∇JW − W∇JW ∇J W.

(10)

ˆ W . Certain regularization process may be imple(6) Update W := W + η ∇J mented on W if W is far from orthogonal. ˆ W || < ε, where || · || is the Euclidean (7) Repeat step (2) through (6), until ||∇J norm and ε(> 0) is a pre-selected threshold value for stopping the algorithm. In this algorithm, the absolute value operator | · | is replaced by multiplying a switch coefficient ki = ±1, which guarantees the maximization of the original kurtosis-sum objective function, because the kurtosis signs are always checked. Meanwhile, we utilize a modified gradient of the objective function w.r.t. W, which automatically keeps the constraint WWT = I satisfied after each update of W, for small η.

3

No Spurious Solution Analysis in Two-Source Case

With the kurtosis switching algorithm, we can lead to a local maximum of the kurtosis-sum objective function. We now analyze the no spurious solution property of the kurtosis-sum objective function for two-source case in the asymptotic sense. The two sources are required to have zero kurtosis. Clearly, in the two-source case, R is a 2 × 2 orthogonal matrix, and can be parameterized by     cos θ sin θ cos θ sin θ R= or . (11) − sin θ cos θ sin θ − cos θ

584

F. Ge and J. Ma

Thus, we have J(W) = J(R) = J(θ) = |κ1 cos4 θ + κ2 sin4 θ| + |κ1 sin4 θ + κ2 cos4 θ|. (12) Below we analyze the local maxima of J(θ) for different signs of κ1 and κ2 . Case 1. If κ1 > 0 and κ2 > 0, or κ1 < 0 and κ2 < 0, we have 3 1 J(θ) = (|κ1 | + |κ2 |)(cos4 θ + sin4 θ) = |κ1 + κ2 |( + cos 4θ). 4 4 In this case the kurtosis of each source component of s is always positive. It is easily verified that J(θ) has local maxima only at θ ∈ {mπ/2}, m ∈ IK which lead R to the following forms:       λ1 0 10 01 λ1 0 R= or , 01 10 0 λ2 0 λ2 where λi ∈ {±1}, i = 1, 2. Certainly, all these R, i.e., the local maxima, correspond to the feasible solutions of the ICA problem. Case 2. If κ1 < 0 and κ2 > 0, the kurtosis signs of the two source components of s are different. In this case, J(θ) becomes a piecewise function as follows. ⎧ (κ1 + κ2 )(sin4 θ + cos4 θ), if ⎪ ⎪ ⎪ ⎨ (−κ − κ )(sin4 θ + cos4 θ), if 1 2 J(θ) = ⎪ (κ1 − κ2 )(cos4 θ − sin4 θ), if ⎪ ⎪ ⎩ (κ2 − κ1 )(cos4 θ − sin4 θ), if

sin4 θ cos4 θ sin4 θ cos4 θ sin4 θ cos4 θ sin4 θ cos4 θ

≥ − κκ21 < − κκ12 ≥ − κκ21 < − κκ12

and and and and

sin4 θ cos4 θ sin4 θ cos4 θ sin4 θ cos4 θ sin4 θ cos4 θ

< − κκ12 ≥ − κκ12 ≥ − κκ12 < − κκ12

For convenience, we define α = − κκ21 and φ = tan−1 ( 4 min(α, 1/α)) ≤ Then, the range of θ can be divided into three non-overlapping sets: S1 = {θ| tan4 θ ≥ max(α, 1/α)} =

+∞

[mπ +

+∞

(mπ − φ, mπ + φ);

m=−∞

S2 = {θ| tan4 θ < min(α, 1/α)} =

π 4.

π π − φ, mπ + + φ]; 2 2

m=−∞

S3 = {θ| min(α, 1/α) ≤ tan4 θ < max(α, 1/α)} = IR \ (S1



S2 ).

We now consider θ in the three sets, respectively, as follows. (a). If θ ∈ S1 , J(θ) = (κ1 − κ2 )(cos4 θ − sin4 θ) = −(κ2 − κ1 ) cos 2θ has local maxima only at {mπ + π2 }, m ∈ IK, and inf θ∈S1 J(θ) = −(κ2 − κ1 ) cos(π − 2φ) = (κ2 − κ1 ) cos 2φ. (b). If θ ∈ S2 , J(θ) = (κ2 − κ1 )(cos4 θ − sin4 θ) = (κ2 − κ1 ) cos 2θ has local maxima only at {mπ}, m ∈ IK. And inf θ∈S2 J(θ) = (κ2 − κ1 ) cos 2φ (c). If θ ∈ S3 , J(θ) = (κ1 + κ2 )(sin4 θ + cos4 θ) if −κ1 < κ2 ; or J(θ) = (−κ1 − κ2 )(sin4 θ + cos4 θ) if −κ1 > κ2 . So J(θ) = |κ1 + κ2 |(sin4 θ + cos4 θ) =

Analysis of the Kurtosis-Sum Objective Function for ICA

585

|κ1 + κ2 |( 34 + 14 cos 4θ). It is easy to see that J(θ) has no local maximum within S3 , and supθ∈S3 J(θ) = |κ1 + κ2 |( 34 + 41 cos 4φ). According to the above analysis, we have inf J(θ) = inf J(θ) = (κ2 − κ1 ) cos 2φ = (κ2 − κ1 )(1 − tan4 φ) cos4 φ

θ∈S1

θ∈S2

= (κ2 − κ1 )(1 − min(−

κ1 κ2 , − )) cos4 φ κ2 κ1

|κ2 + κ1 | cos4 φ; (13) max(−κ1 , κ2 ) 3 1 sup J(θ) = |κ1 + κ2 |( + cos 4φ) = |κ1 + κ2 |(1 + tan4 φ) cos4 φ 4 4 θ∈S3 κ1 κ2 = |κ1 + κ2 |(1 + min(− , − )) cos4 φ κ2 κ1 κ2 − κ1 cos4 φ. (14) = |κ1 + κ2 | max(κ2 , −κ1 )

Because IR = S1 S2 S3 , and inf θ∈S1 J(θ) = inf θ∈S2 J(θ) = supθ∈S3 J(θ), J(θ) cannot reach any local maximum at the boundary points of S3 . Thus, J(θ) can have local maxima only at {mπ/2}, m ∈ IK. For the case κ1 > 0 and κ2 < 0, it can be easily verified that J(θ) behaves in the same way as in Case 2. Summing up all the analysis results, we have proved that in the two-source case, J(W) = J(R) can only have the local maxima that correspond to the feasible solutions of the ICA problem. That is, J(W) is locally maximized only at a separation matrix W which leads R to a permutation matrix plus sign ambiguities. From the above analysis, we can find that when the sources with positive kurtosis and negative kurtosis co-exist, the range of R (corresponding to a unit circle of θ) can be divided into some non-overlapping sets and on each of them, the kurtosis signs of yi does not change. Thus, the update of the kurtosis sign of yi in each iteration is not necessarily active. In fact, a real switching operation happens only when the parameter moves across the boundaries of two such sets. = (κ2 − κ1 )

4

Experimental Results

In order to substantiate our theoretical results and test the kurtosis switching algorithm, we conducted two experiments on real and artificial signals. We also compared the results of our algorithm with those of the Extended Infomax algorithm [11] and FastICA algorithm [20]. Firstly, we utilized two audio recordings as independent source signals. Each of these two signals contain 4000 samples and their sample kurtosis are 0.6604 and 0.1910, respectively. The observation signals were generated as two linear mixtures of these two audio signals through a random matrix. We implemented the kurtosis switching algorithm on the observation signals. After the kurtosis switching algorithm stopped, it was found that the two sources were separated

586

F. Ge and J. Ma 0.9

J(θ) |kurt{y1}| |kurt{y2}|

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

π/4

π/2

3π/4

π

Fig. 1. The Sketches of the Kurtosis-sum Objective Function J(θ) and the Absolute Kurtosis Values of the Estimated Components of y in the Two-source Experiment for θ from Zero to π

with



 1.0021 −0.0280 R= . −0.0233 −1.0020

Actually, the performance index (refer to [4]) of this separation result was 0.1024. In the same situation, the Extended Infomax algorithm could arrive only at a performance index 0.4658. For the FastICA algorithm, the symmetric approach was selected, and the performance index was 0.1215 by using “tanh” as nonlinearity, but improved to 0.0988 by using “power 3”. As a result, on the correctness of the ICA solution, the kurtosis switching algorithm could be as good as the FastICA algorithm, although it required more iterations and took much longer time than FastICA. For illustration, we further show the sketches of the kurtosis-sum objective function and the absolute kurtosis values of the two estimated components of y in the above two-source experiment for θ from zero to π in Fig. 1. Theoretically, as the two sources are super-Gaussian, their mixtures should have positive kurtosis. However, the estimated kurtosis of yi could be negative at some θ or W. Besides, our analysis indicates that either of |kurt{yi }| is the maximum at θ = nπ/2, but it is not so for finite data. Actually, the maxima of the kurtosis-sum objective function were not exactly at θ = nπ/2, due to the errors from the estimation. We further conducted another experiment on seven synthetic sources: random samples generated from (a). Laplacian distribution, (b). Exponential distribution which is not symmetric, (c). Uniform distribution, (d). Beta distribution β(2, 2), (e). A Gaussian mixture (bimodal): 12 N (−1.5, 0.25) + 21 N (1.5, 0.25), (f). A Gaussian mixture (unimodal): 12 N (0, 0.25)+ 21 N (0, 2.25), (g). A Gaussian mixture (trimodal): 31 N (−2, 0.25) + 31 N (0, 0.25) + 31 N (2, 0.25). Three of them ((a),

Analysis of the Kurtosis-Sum Objective Function for ICA

587

(b) and (f)) were super-Gaussian while the rest four sources were sub-Gaussian. All the sources were normalized before mixing. For each source, there were 1000 samples. The observation signals were generated as seven linear mixtures of these seven independent synthetic signals through a random matrix. We implemented the kurtosis switching algorithm on these observation signals and obtained a successful separation matrix with R being given as follows: ⎡ ⎤ −0.0205 0.0141 0.0139 −0.0223 −0.0361 0.0397 1.0231 ⎢ −0.0178 0.0008 −1.0108 0.0049 0.0015 0.0106 0.0697 ⎥ ⎢ ⎥ ⎢ 1.0103 0.0121 −0.0088 −0.0071 −0.0688 0.0019 0.0103 ⎥ ⎢ ⎥ ⎥ R=⎢ ⎢ −0.0333 −0.0398 0.0034 −1.0165 −0.0057 −0.0130 −0.0106 ⎥ . ⎢ −0.0740 0.0320 −0.0114 0.0392 −1.0097 −0.0138 −0.0378 ⎥ ⎢ ⎥ ⎣ −0.0179 1.0062 0.0085 −0.0614 0.0200 0.0430 −0.0017 ⎦ −0.0232 0.0575 −0.0059 0.0466 −0.0407 −1.0112 0.0111

According to R, we obtained that the performance index of the kurtosis switching algorithm was 2.0003. In the same situation, the FastICA algorithm’s performance index was 1.9542 when using “power 3” as nonlinearity, but became 1.3905 when using “tanh”. However, the Extended Infomax algorithm did not separate all the sources, with a performance index of 15.4736. Therefore, in this complicated case with seven sources, the kurtosis switching algorithm achieved a separation result almost as good as the FastICA algorithm though it required more steps to converge, but outperformed the Extended Infomax algorithm. Moreover, this experimental result also demonstrated that our theoretical results on the kurtosis-sum objective function can be extended to the cases with more than 2 sources. Besides the two demonstrations above, we have conducted many simulations, with various types of signal sources. All the experimental results conformed to the theoretical analysis and no spurious solutions have been encountered.

5

Conclusions

We have investigated the ICA problem through the kurtosis-sum objective function which is just the sum of absolute kurtosis values of the estimated components. Actually, we prove that for two-source case, the maxima of this kurtosis-sum objective function all correspond to the feasible solutions of the ICA problem, as long as the sources have non-zero kurtosis. Moreover, in order to maximize the kurtosissum objective function, a kurtosis switching algorithm is constructed. The experimental results show that the kurtosis-sum objective function works well for solving the ICA problem and apart from the convergence speed, the kurtosis switching algorithm can arrive at a solution as good as the FastICA algorithm. Acknowledgements. This work was supported by the Ph.D. Programs Foundation of Ministry of Education of China for grant 20070001042.

588

F. Ge and J. Ma

References 1. Comon, P.: Independent Component Analysis – a New Concept? Signal Processing 36, 287–314 (1994) 2. Cardoso, J.F.: Infomax and Maximum Likelihood for Blind Source Separation. IEEE Signal Processing Letters 4, 112–114 (1997) 3. Bell, A., Sejnowski, T.: An Information-Maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation 7, 1129–1159 (1995) 4. Amari, S.I., Cichocki, A., Yang, H.: A New Learning Algorithm for Blind Separation of Sources. Advances in Neural Information Processing 8, 757–763 (1996) 5. Xu, L., Cheung, C.C., Amari, S.I.: Learned Parametric Mixture Based ICA Algorithm. Neurocomputing 22, 69–80 (1998) 6. Liu, Z.Y., Chiu, K.C., Xu, L.: One-Bit-Matching Conjecture for Independent Component Analysis. Neural Computation 16, 383–399 (2004) 7. Ma, J., Liu, Z.Y., Xu, L.: A Further Result on the ICA One-Bit-Matching Conjecture. Neural Computation 17, 331–334 (2005) 8. Ma, J., Chen, Z., Amari, S.I.: Analysis of Feasible Solutions of the ICA Problem under the One-Bit-Matching Condition. In: Rosca, J.P., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 838–845. Springer, Heidelberg (2006) 9. Ma, J., Gao, D., Ge, F., Amari, S.: A One-Bit-Matching Learning Algorithm for Independent Component Analysis. In: Rosca, J.P., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 173–180. Springer, Heidelberg (2006) 10. Vrins, F., Verleysen, M.: Information Theoretic Versus Cumulant-based Contrasts for Multimodal Source Separation. IEEE Signal Processing Letters 12, 190–193 (2005) 11. Lee, T.W., Girolami, M., Sejnowski, T.J.: Independent Component Analysis Using an Extended Infomax Algorithm for Mixed Subgaussian and Supergaussian Sources. Neural Computation 11, 417–441 (1999) 12. Zhang, L., Cichocki, A., Amari, S.I.: Self-Adaptive Blind Source Separation Based on Activation Function Adaptation. IEEE Trans. Neural Networks 15, 233–243 (2004) 13. Ma, J., Ge, F., Gao, D.: Two Aadaptive Matching Learning Algorithms for Indepenedent Component Analysis. In: Hao, Y., Liu, J., Wang, Y.-P., Cheung, Y.-m., Yin, H., Jiao, L., Ma, J., Jiao, Y.-C. (eds.) CIS 2005. LNCS (LNAI), vol. 3801, pp. 915–920. Springer, Heidelberg (2005) 14. Welling, M., Weber, M.: A Constrained EM Algorithm for Independent Component Analysis. Neural Computation 13, 677–689 (2001) 15. Boscolo, R., Pan, H., Roychowdhury, V.P.: Independent Component Analysis Based on Nonparametric Density Estimation. IEEE Trans. Neural Networks 15, 55–64 (2004) 16. Amari, S.I., Chen, T.P., Cichocki, A.: Stability Analysis of Learning Algorithms for Blind Source Separation. Neural Networks 10, 1345–1351 (1997) 17. Cardoso, J.F., Laheld, B.: Equivariant Adaptive Source Separation. IEEE Trans. Signal Processing 44, 3017–3030 (1996) 18. Delfosse, N., Loubaton, P.: Adaptive Blind Separation of Independent Sources: a Deflation Approach. Signal Processing 45, 59–83 (1995) 19. Hyv¨ arinen, A., Oja, E.: A Fast Fixed-point Algorithm for Independent Component Analysis. Neural Computation 9, 1483–1492 (1997) 20. Hyv¨ arinen, A.: Fast and Robust Fixed-point Algorithms for Independent Component Analysis. IEEE Trans. Neural Networks 10, 626–634 (1999)

BYY Harmony Learning on Weibull Mixture with Automated Model Selection Zhijie Ren and Jinwen Ma Department of Information Science School of Mathematical Sciences and LMAM Peking University, Beijing, 100871, China [email protected]

Abstract. Bayesian Ying-Yang (BYY) harmony learning has provided a new learning mechanism to implement automated model selection on finite mixture during parameter learning with a set of sample data. In this paper, two kinds of BYY harmony learning algorithms, called the batchway gradient learning algorithm and the simulated annealing learning algorithm, respectively, are proposed for the Weibull mixture modeling based on the maximization of the harmony function on the two different architectures of the BYY learning system related to Weibull mixture such that model selection can be made automatically during the parameter learning on Weibull mixture. The two proposed algorithms are both demonstrated well by the simulation experiments on some typical sample data sets with certain degree of overlap. Keywords: Bayesian Ying-Yang (BYY) harmony learning, Weibull mixture, Automated model selection, Parameter learning, Simulated annealing.

1

Introduction

Weibull mixture is a leading model in the field of reliability. In fact, there have been several statistical methods to solve the problem of parameter learning or estimation on the Weibull mixture model, such as maximum likelihood estimation, graphics estimation and the EM algorithm. However, these methods usually assume that the number k of components in the mixture is pre-known. If this number is unknown, it can be selected according to the Akaike’s information criterion [1] or its extensions [2,3]. However, this conventional approach involves a large computational cost since the entire process of parameter estimation has to be repeated for a number of different choices of k. Since k is just a scale of Weibull mixture model, its selection is essentially a model selection for the Weibull mixture modelling. The Bayesian Ying-Yang(BYY) harmony learning system and theory, proposed in 1995 in [4] and developed subsequently in [5,6,7], has provided a new efficient tool to solve the compound problem of model selection and parameter learning on the finite mixture model. In fact, by maximizing a harmony function F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 589–599, 2008. c Springer-Verlag Berlin Heidelberg 2008 

590

Z. Ren and J. Ma

on a certain BYY learning system related to finite mixture, model selection can be made automatically during parameter learning for Gaussian mixture either on a BI-architecture via some gradient-type and fixed-point learning algorithms [8,9,10,11] or on a B-architecture via the BYY annealing learning algorithm [12]. Recently, this BYY harmony learning approach has been also applied to the Poisson mixture modeling [13]. In this paper, we extend the BYY harmony learning mechanism of parameter learning with automated model selection to Weibull mixture. Actually, we consider the two-parameter Weibull model which is by far the most widely used probability distribution for life data analysis. Its probability density function (pdf) takes the following explicit expression (refer to [14]): axa−1 exp[−(x/b)a ], a, b > 0, (1) f (x) = ba where a is the shape parameter and b is the scale parameter. Actually, if a population consists of k sub-populations with the pdfs f1 (x), . . . , fk (x), being linearly mixed with the proportions p1 (≥ 0), . . . , pk (≥ 0), respectively, under the constraint that p1 + . . . + pk = 1, then the pdf of the population takes the following form: (2) f (x) = p1 f1 (x) + . . . + pk fk (x), which is considered as the general form of finite mixture model. f (x) in Eq. 2 is referred as a Weibull mixture if each fi (x) is a Weibull probability distribution. In this paper, under a BI-architecture of the BYY learning system for Weibull mixture, a batch-way gradient learning algorithm is constructed to achieve the parameter learning or estimation of Weibull mixture with automated model selection. Moreover, under a B-architecture of the BYY learning system for Weibull mixture, a simulated annealing learning algorithm is also constructed for the same purpose. It is demonstrated well by the simulation experiments that the two proposed BYY learning algorithms can make model selection automatically during the parameter learning on the sample data as long as the actual Weibull components in the original mixture are separated in a certain degree.

2

BYY Learning System for Weibull Mixture and Proposed Learning Algorithms

A BYY system describes each observation x ∈ X ⊂ Rn and its corresponding inner representation y ∈ Y ⊂ Rm via the two types of Bayesian decomposition of the joint density p(x, y) = p(x)p(y|x) and q(x, y) = q(x|y)q(y) which are named Yang machine and Ying machine, respectively. Given a data sets Dx = {xt }N t=1 , the learning task of a BYY system is to ascertain all the components of p(y|x), p(x), q(x|y), q(y) with a harmony learning mechanism which is implemented by maximizing the function:  (3) H(p  q) = p(y|x)p(x) ln[q(x|y)q(y)]dxdy − ln zq , where zq is a regularization term. Here, we will neglect this term, i.e., let zq = 1.

BYY Harmony Learning on Weibull Mixture

2.1

591

BI-Architecture of BYY Learning System

The BYY system is called to have a BI-architecture if p(y|x) and q(x|y) are both parametric. That is, p(y|x) and q(x|y) are both from a family of probability densities with a parameter θ. We use the following BI-architecture of the BYY system for the Weibull mixture. The inner representation y is discrete, i.e., y ∈ k {1, 2, . . . , k} ⊂ R and q(y = j) = αj ≥ 0 with j=1 αj = 1. p(x) is specified by N the empirical density p0 (x) = N1 t=1 G(x − xt ), where x ∈ R, G(·) is a kind of kernel function, and the Yang path is given by the following form: αj q(x|θj ) p(y = j|x) = , q(x|Θk )

q(x|Θk ) =

k 

αj q(x|θj ),

(4)

j=1

where q(x|θj ) = q(x|y = j), and Θk = {αj , θj }kj=1 denote the set of parameters. Putting all these component densities into Eq.(3) and letting the kernel function approach the delta function δ(x), the harmony functional H(pq) is transformed into the following harmony function: J(Θk ) =

N k 1   αj q(xt |θj ) ln[αj q(xt |θj )],  N t=1 j=1 ki=1 αi q(xt |θi )

(5)

where q(xt |θj ) is the two-parameter Weibull pdf, and θj = {aj , bj }. 2.2

B-Architecture of BYY Learning System

If q(x|y) is parametric and p(y|x) is free to be determined by learning, the BYY system is called to have a B-architecture. For the Weibull mixture, we use the following B-architecture of BYY system. The inner representation y, q(y = j), p(x) are defined as the BI-architecture. And the regularization term zp is ignored too. Moreover, p(y|x) is a probability distribution that is free to be determined k under the general constraints: p(j|x) ≥ 0, j=1 p(j|x) = 1. In the same way, we can get the following harmony function: N k 1  p(j|xt ) ln[αj q(xt |aj , bj )], J(Θk ) = N t=1 j=1

(6)

k where Θk = {Θ1 , Θ2 }, Θ1 = {p(j|xt )}k,N j=1,t=1 and Θ2 = {αj , aj , bj }j=1 .

2.3

Batch-Way Gradient BYY Learning Algorithm

To get rid of the constraints on αj , we utilize the transformation for each j:  αj = exp(βj )/ ki=1 exp(βi ), where −∞ < β1 , . . . , βk < +∞. After such a transformation, the parameters of the harmony function J(Θk ) given by Eq.(5) are essentially {βj , θj }kj=1 , θj = {aj , bj }.

592

Z. Ren and J. Ma

By computing the derivatives of J(Θk ) with respect to βj and aj , bj , we can obtain the batch-way gradient learning algorithm for the Weibull mixture modeling. Actually, its update rule can be given as follows: N xt xt 1 η  p(j|xt )λj (t)( + ln (1 − ( )aj )), N t=1 aj bj bj N η  xt aj ∆bj = p(j|xt )λj (t)(− (1 − ( )aj )), N t=1 bj bj

∆aj =

∆βj =

N k  1 η  λi (t)(δij − αj )Ui (xt ). N t=1 q(xt |Θk ) i=1

(7) (8)

(9)

where η > 0 is the learning rate which can be selected by experience, Uj (x) = k αj q(x|θj ), λj (t) = 1 − l=1 (p(l|xt ) − δjl ) ln Ul (xt ), j = 1, 2, . . . , k and δij is the Kronecker function. 2.4

Simulated Annealing Learning Algorithm

Because the maximization of Eq.(6) is a discrete optimization, so it is very easy to be trapped into a local maximum. To solve the local maximum problem, we employ a simulated annealing BYY harmony learning algorithm and leave the details to Ref.[12]. We consider Lλ (Θk ) = J(Θk ) + λON (p(y|x)),

(10)

N k 1  ON (p(y|x)) = − p(j|xt ) ln p(j|xt ), N t=1 j=1

(11)

where

and λ ≥ 0. If we can let λ → 0 from λ0 = 1 appropriately in a simulated annealing procedure, the maximum of Lλ (Θk ) will correspond to the global maximum of J(Θk ) with a high probability. In view of maxΘk Lλ (Θk ) = maxΘ1 ,Θ2 Lλ (Θ1 , Θ2 ), maxΘk Lλ (Θk ) can be carried out by an alternative maximization iterative procedure: Step1: Fix Θ2 = Θ2old , get Θ1new = arg maxΘ1 Lλ (Θ1 , Θ2 ). Step2: Fix Θ1 = Θ1old , get Θ2new = arg maxΘ2 Lλ (Θ1 , Θ2 ). When λ is fixed, this iterative procedure does not stop until Lλ (Θk ) converges to a local maximum. Furthermore, we can solve Θ1new and Θ2new as follows. On the one hand, we fix Θ2 and solve the maximum of Θ1 . Then, we gain a unique solution for Θ1 : [αj q(xt |aj , bj )]1/λ , p(j|xt ) = k 1/λ i=1 [αi q(xt |ai , bi )]

t = 1, . . . , N ; j = 1, . . . , k.

(12)

BYY Harmony Learning on Weibull Mixture

593

On the other hand, we fix Θ1 and solve the maximum of Θ2 . Also, by the method of Lagrange multipliers, we obtain a series of equations and a unique solution for αj as follows, for j = 1, . . . , k: N 1  xt xt xt 1 − ( )aj ln( )] = 0, p(j|xt )[ + ln N t=1 aj bj bj bj N a 1  aj xt j aj p(j|xt )(− + aj +1 ) = 0, N t=1 bj bj

α ˆj =

N 1  p(j|xt ). N t=1

(13) (14)

(15)

From Eq.(13) and (14), we can obtain an approximative solution of a ˆj , ˆbj with the help of some mathematical tools. From the above derivation, we have already constructed an alternative optimization algorithm for maximizing Lλ (Θk ). Furthermore, if λ attenuates appropriately a long time, this alternative maximization algorithm anneals to search for the global maximum of J(Θk ) and thus the automated model selection with parameter estimation is able to be implemented.

3

Experimental Results

In this section, several simulated experiments are conducted to demonstrate the performance of the batch-way gradient learning algorithm and the simulated annealing learning algorithm for both model selection and parameter estimation on some sample data sets from typical Weibull mixtures. Moreover, we compare the learning efficiency of these two proposed algorithms. For feasibility of the implementation, we only consider the situation of a > 1 in our experiments. 3.1

Sample Data Sets and Initialization of the Parameters

We begin with a description of the four sets of sample data used in our experiments. Actually, we conducted 4 Monte Carlo experiments in which samples are drawn from a mixture of four or three variate Weibull distributions, being respectively showed in Fig.(1-4). In order to clearly observe the samples from different Weibull components in the figures, we represent the samples of each Weibull component with different symbols defined on the upper-right hand corner. That is, the samples of different components are displayed with different symbols on the plane. The x-coordinate of a point is the numerical value of a sample, but the y-coordinates of the points of each component keep the same value, which is given artificially, but changes with the component just for the observation. The true (or actual) values of the parameters in the Weibull mixture to generate the four sample data sets are given in Table 1, where aj , bj , αj and Nj

594

Z. Ren and J. Ma

1

1 component1 component2 component3

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0

0.1

0

5

10

15

20

25

30

35

40

45

50

Fig. 1. The First Sample Data Set S1

0

0

10

20

30

40

50

60

70

80

Fig. 2. The Second Sample Data Set S2

1

1 component1 component2 component3 component4

0.9

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

10

20

30

40

50

60

70

80

component1 component2 component3 component4

0.9

0.8

0

component1 component2 component3 component4

0.9

90

Fig. 3. The Third Sample Data Set S3

0

0

5

10

15

20

25

30

35

40

45

Fig. 4. The Fourth Sample Data Set S4

denote the shape parameter, scale parameter, mixing proportion and the number of samples of the jth Weibull density, respectively. For analysis, we define the degree of overlap between two components (i.e., Weibull distributions) in a sample data set by n

1 Op = h1 (xt )h2 (xt ), n t=1

hj (xt ) =

αj p(j|xt ) , j = 1, 2. α1 p(1|xt ) + α2 p(2|xt )

(16)

Actually, Table 2 lists all the degrees of overlap between any two components in each of the four sample data sets. We further discuss the initialization of the parameters in the algorithms. In order to make model selection automatically, we should select k to be larger than the true number k ∗ of the components in the sample data set. However, a larger k may increase the implementation time and the risk of selecting a wrong model. Actually, we will give an appropriate range of the initialization of k. The initial value of βj can be freely chosen from some interval for the BYY

BYY Harmony Learning on Weibull Mixture

595

Table 1. The Parameters of the Original Weibull Mixtures to Generate the Four Sample Data sets The sample set S1 (N = 1200) S2 (N = 200)

S3 (N = 1200)

S4 (N = 1200)

Weibulls Weibull1 Weibull2 Weibull3 Weibull1 Weibull2 Weibull3 Weibull4 Weibull1 Weibull2 Weibull3 Weibull4 Weibull1 Weibull2 Weibull3 Weibull4

aj 2 4 10 2 4 12 15 2 6 10 20 2 4 6 8

bj 2 20 40 2 15 35 65 2 20 50 80 2 10 20 35

αj 0.25 0.35 0.40 0.175 0.35 0.225 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25

Nj 300 420 480 35 70 45 50 300 300 300 300 300 300 300 300

Table 2. The Degrees of Overlap between any Two Components in Each of the Four Sample Data Sets The sample set S1 (k∗ = 3) S2 (k∗ = 4) S3 (k∗ = 4) S4 (k∗ = 4)

0.0021 0.0038 0.0001 0.0168

Overlapping degree of adjacent clusters 0.0214 0.0088 0.0008 0.0014 0.0034 0.0420 0.0484

annealing algorithm, and the batch-way gradient learning algorithm converges more efficiently when the initial values of these βj are equal or close. In our simulation experiments, aj and bj are initialized in virtue of the Weibull transformation which is deduced in [14]. For the BYY annealing learning algorithm, {p(y = j|xt ), j = 1, . . . , k, t = 1, . . . , N } can be initialized randomly. 3.2

Simulation Results for Model Selection and Parameter Estimation

Firstly, we implemented the batch-way gradient algorithm on each of the four sample data sets S1 -S4 . The stoping criterion of the algorithm is |Jnew − Jold | < 10−7 , and all the experiment results are given in Table 3, which are all successful on both model selection and parameter estimation. However, the automated model selection on the sample set S4 fell into a failure. As the stoping criterion was satisfied, there were five active components in the resulted Weibull mixture, which does not agree with the original Weibull mixture. The reason of this failure

596

Z. Ren and J. Ma

Table 3. The Experimental Results of the Batch-way Gradient Learning Algorithm The sample set S1 (N = 1200) S2 (N = 200)

S3 (N = 1200)

Weibulls Weibull1 Weibull2 Weibull3 Weibull1 Weibull2 Weibull3 Weibull4 Weibull1 Weibull2 Weibull3 Weibull4

a ˆj 1.9482 4.7094 10.9082 2.9695 4.1712 12.3271 16.8998 1.9780 6.5847 10.2482 20.6729

ˆbj 2.0529 20.1738 40.2776 2.1548 16.2098 34.8763 65.1038 2.0325 20.0977 50.1388 80.1908

α ˆj 0.2526 0.3533 0.3941 0.1774 0.3637 0.2094 0.2495 0.2501 0.2510 0.2494 0.2496

Table 4. The Experimental Results of the Simulated Annealing Learning Algorithm The sample set S1 (N = 1200) S2 (N = 200)

S3 (N = 1200)

S4 (N = 1200)

Weibulls Weibull1 Weibull2 Weibull3 Weibull1 Weibull2 Weibull3 Weibull4 Weibull1 Weibull2 Weibull3 Weibull4 Weibull1 Weibull2 Weibull3 Weibull4

a ˆj 1.9637 4.4712 10.2418 2.9678 3.9992 12.0365 16.6213 1.9790 6.5456 10.0101 20.3964 1.8056 5.1810 7.3671 8.6626

ˆbj 2.0358 20.0428 40.0965 2.1428 16.1519 34.8157 65.0674 2.0312 20.0758 50.0399 80.1490 1.9616 9.8643 20.1510 35.5023

α ˆj 0.2508 0.3478 0.4014 0.1750 0.3650 0.2100 0.2500 0.2500 0.2500 0.2492 0.2508 0.2633 0.2442 0.2464 0.2461

might be that the degrees of overlap between some adjacent components in S4 are quite high. We further implemented the simulated annealing learning algorithm on the four sample data sets. The stoping criterion is |Lλ (Θknew ) − Lλ (Θkold )| < 10−7 . And λ is given by the expression: λ(t) = 1/(a(1 − exp(−b(t − 1))) + c), where t denotes the iteration time. In this case, a = 500, b = ln 10/10000, c = 0.5. The experiment results of the simulated annealing algorithm on the four sample data sets are given in Table 4, which are all successful on both model selection and parameter estimation. Finally, we compare the performance of the batch-way gradient and simulated annealing learning algorithms through the following specific analysis. We begin to compare the performance of the two algorithms on parameter estimation.

BYY Harmony Learning on Weibull Mixture

597

Table 5. ∆x of the Two Algorithms The sample data set S1 (N = 1200) S2 (N = 200) S3 (N = 1200)

learning algorithm BWG SA BWG SA BWG SA

∆α 0.00041 0.00006 0.0065 0.0063 0.00002 0.00002

∆a 0.0404 0.0148 0.2536 0.2459 0.0114 0.0088

∆b 0.00082 0.00033 0.0125 0.0110 0.0003 0.00026

Table 6. The Runtime Complexities of the Two Algorithms The sample set S1 S2 S3

BWG 98.9210 56.9680 149.5940

SA 40.1560 5.6720 13.5160

According to the experimental results on a sample data set, for each parameter x we can compute x¯, the radio of the estimated parameters to the actual x − 12 to equivalently describe the meanparameters and then define ∆x = ¯ square error between the estimated parameter and the actual parameter. Thus, ∆x can be used as a criterion for evaluating the performance of a learning algorithm on the parameter estimation. The results of ∆x of the batch-way gradient and simulated annealing algorithms on the first three sample data sets are given in Table 5, where x represents a single parameter in the Weibull mixture, BWG represents the batch-way gradient learning algorithm, and SA represents the simulated annealing learning algorithm. It can be observed from Table 5 that these two algorithms both perform well on parameter estimation as the number of samples is relatively large. But if the number of samples is small, the mean-square error becomes high for the both algorithms. Moreover, the degree of overlap between the components in a sample data set also plays an important role in the parameter learning. It can be found from Table 5 that the mean-square error is much lower if the degree of overlap is small enough. As showed in Table 5, on the same sample sets, the mean-square errors estimated by the simulated annealing learning algorithm is lower than the ones estimated by the batch-way gradient learning algorithm, which can be also demonstrated by the further experiments. Secondly, we consider the range of the degree of overlap among the components in a sample data set such that these two proposed learning algorithms can be successful with the sample data set. It was found from the simulation experiments that the simulated annealing learning algorithm generally owns a larger range than the batch-way gradient algorithm does. Thirdly, we compare the ranges from which the initial k can be selected for these two algorithms. From the simulated experiments, it was found that the selected range of k for the simulated annealing learning algorithm is [k ∗ , 2k ∗ + 1],

598

Z. Ren and J. Ma

which is wider than the range [k ∗ , 2k ∗ − 1] for the batch-way gradient learning algorithm. Fourthly, we compare the runtime costs of the two algorithms. Actually, the runtime complexities which are costed by these two algorithms on the sample data sets S1 -S3 have been listed in Table 6. It can be observed from Table 6 that the runtime of the batch-way gradient learning algorithm is always longer than that of the simulated annealing learning algorithm on these sample data sets. As a result from the above comparisons on the four aspects, the simulated annealing learning algorithm is much better than the batch-way gradient learning algorithm not only on the automated model selection but also on the parameter estimation and the runtime. Therefore, the simulated annealing learning algorithm is more efficient for the Weibull mixture modeling.

4

Conclusions

After introducing the BYY learning system, BI and B-architectures, and the harmony function, we have established two BYY learning algorithms: a batch-way gradient learning algorithm on the BI-architecture and a simulated annealing learning algorithm on the B-architecture, for Weibull mixture with automated model selection. The two algorithms are demonstrated well on the sample sets from Weibull mixtures with certain degrees of overlap. Moreover, we have compared the two algorithms from four aspects and found out that the simulated annealing learning algorithm is more efficient for the Weibull mixture modeling than the batch-way gradient learning algorithm. Acknowledgements. This work was supported by the Natural Science Foundation of China for grants 60771061 and 60471054.

References 1. Akaike, H.: A New Look at the Statistical Model Identification. IEEE Trans. Automatic Control, AC- 19, 716–723 (1974) 2. Bozdogan, H.: Model Selection and Akaike’s Information Criterion: the General Theory and its Analytical Extensions. Psychometrika 52, 345–370 (1978) 3. Scharz, G.: Estimating the Dimension of a Model. The Annals of Statistics 6, 461–464 (1978) 4. Xu, L.: Ying-Yang Machine: a Bayesian-Kullback Scheme for Unified Learnings and New Results on Vector Quantization. In: Proceedings of the 1995 International Conference on Neural Information Processing (ICONIP 1995), vol. 2, pp. 977–988 (1995) 5. Xu, L.: Best Harmony, Unified RPCL and Automated Model Selection for Unsupervised and Supervised Learning on Gaussian Mixtures, Three-layer Nets and ME-RBF-SVM Models. International Journal of Neural Systems 11, 43–69 (2001) 6. Xu, L.: Ying-Yang Learning. In: Arbib, M.A. (ed.) The Handbook of Brain Theory and Neural Networks, 2nd edn., pp. 1231–1237. The MIT Press, Cambridge (2002)

BYY Harmony Learning on Weibull Mixture

599

7. Xu, L.: BYY Harmony Learning, Structural RPCL, and Topological Self-organizing on Mixture Models. Neural Networks 15, 1231–1237 (2002) 8. Ma, J., Wang, T., Xu, L.: A Gradient BYY Harmony Learning Rule on Gaussian Mixture with Automated Model Selection. Neurocomputing 56, 481–487 (2004) 9. Ma, J., Gao, B., Wang, Y., et al.: Conjugate and Natural Gradient Rules for BYY Harmony Learning on Gaussian Mixture with Automated Model Selection. International Journal of Pattern Recognition and Artificial Intellegence 19, 701–713 (2005) 10. Ma, J., Wang, L.: BYY Harmony Learning on Finite Mixture: Adaptive Gradient Implementation and a Floating RPCL Mechanism. Neural Processing Lett. 24(1), 19–40 (2006) 11. Ma, J., He, X.: A Fast Fixed-point BYY Harmony Learning Algorithm on Gaussian Mixture with Automated Model Selection. Pattern Recognition Letters 29(6), 701– 711 (2008) 12. Ma, J., Liu, J.: The BYY Annealing Learning Algorithm for Gaussian Mixture with Automated Model Selection. Pattern Recognition 40, 2029–2037 (2007) 13. Liu, J., Ma, J.: An Adaptive Gradient BYY Learning Rule for Poisson Mixture with Automated Model Selection. In: Huang, D.-S., Heutte, L., Loog, M. (eds.) ICIC 2007. LNCS, vol. 4681, pp. 1059–1069. Springer, Heidelberg (2007) 14. Robert, B.A.: The New Weibull Handbook, 4th edn. North Palm Beach, Fla. (2000)

A BYY Split-and-Merge EM Algorithm for Gaussian Mixture Learning Lei Li and Jinwen Ma⋆ Department of Information Science, School of Mathematical Sciences and LAMA, Peking University, Beijing, 100871, China [email protected] Abstract. Gaussian mixture is a powerful statistic tool and has been widely used in the fields of information processing and data analysis. However, its model selection, i.e., the selection of number of Gaussians in the mixture, is still a difficult problem. Fortunately, the new established Bayesian YingYang (BYY) harmony function becomes an efficient criterion for model selection on the Gaussian mixture modeling. In this paper, we propose a BYY split-and-merge EM algorithm for Gaussian mixture to maximize the BYY harmony function by splitting or merging the unsuited Gaussians in the estimated mixture obtained from the EM algorithm in each time dynamically. It is demonstrated well by the experiments that this BYY splitand-merge EM algorithm can make both model selection and parameter estimation efficiently for the Gaussian mixture modeling. Keywords: Bayesian Ying-Yang (BYY) harmony learning, Gaussian mixture, EM algorithm, Model selection, Parameter estimation.

1

Introduction

As a powerful statistical tool, Gaussian mixture has been widely used in the fields of information processing and data analysis. Generally, the parameters of Gaussian mixture can be estimated by the expectation-maximization (EM) algorithm [1] under the maximum-likelihood framework. However, the EM algorithm not only suffers from the problem of local optimum, but also converges to a wrong result in the situation that the actual number of Gaussians in the mixture is set incorrectly. Since the number of Gaussians is just the scale of the Gaussian mixture model, the selection of number of Gausians in the mixture is also referred to as the model selection. In a conventional way, we can choose a best number k ∗ of Gaussians via some selection criterion, such as Akaike’s information criterion (AIC) [2] and the Bayesian inference criterion [3]. However, these criteria have certain limitations and often lead to a wrong result. Moreover, this approach involves a large computational cost since the entire process of parameter estimation has to be repeated for a number of different choices of k. In past several years, with the development of the Bayesian Ying-Yang (BYY) harmony learning system and theory [4,5], a new kind of BYY harmony learning ⋆

Corresponding author.

F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 600–609, 2008. c Springer-Verlag Berlin Heidelberg 2008 

A BYY Split-and-Merge EM Algorithm for Gaussian Mixture Learning

601

algorithms, such as the adaptive, conjugate, natural gradient, simulated annealing and fixed-point learning algorithms [6,7,8,9,10], have been established to make model selection automatically during the parameter learning. Although these new algorithms are quite efficient for both model selection and parameter estimation for the Gaussian mixture modeling, they must satisfy a particular assumption that k is larger than the number of actual Gaussians in the sample data, but not too much. Actually, if k is too larger than the true one, these algorithms often converge to a wrong result. Nevertheless, how to overestimate the true number of Gaussians in the sample data in such a way is also a difficult problem. In this paper, we propose a new kind of split-and-merge EM algorithm that maximizes the harmony function gradually in each time through the split-and merge operation on the estimated mixture from the EM algorithm and terminates at the maximum of the harmony function. Since the maximization of the harmony function corresponds to the correct model selection on the Gaussian mixture modeling [11] and the split-and-merge operation can escape from a local maximum of the likelihood function, the BYY split-and-merge EM algorithm can lead to a better solution for both model selection and parameter estimation. The rest of the paper is organized as follows. In Section 2, we revisit the EM algorithm for Gaussian mixtures. We further introduce the BYY learning system and the harmony function in Section 3. In Section 4, we present the BYY split-and-merge EM algorithm. Several experiments on the synthetic and realworld data sets, including a practical application of unsupervised color image segmentation, are conducted in Section 5 to demonstrate the efficiency of the proposed algorithm. Finally, we conclude briefly in Section 6.

2

The EM Algorithm for Gaussian Mixtures

The probability density of the Gaussian mixture of k components in ℜd can be described as follows: k  Φ(x) = πi φ(x|θi ), ∀x ∈ ℜd , (1) i=1

where φ(x|θi ) is a Gaussian probability density with the parameters θi = (mi , Σi ) (mi is the mean vector and Σj is the covariance matrix which is assumed positive definite) given by φ(x|θi ) = φ(x|mi , Σi ) =

1 n 2

1

(2π) |Σi |

1 2

e− 2 (x−mi )



Σi−1 (x−mi )

,

(2)

and πi ∈ [0, 1](i = 1, 2, · · · , k) are the mixing proportions under the conk straint i=1 πi = 1. If we encapsulate all the parameters into one vector: Θk = (π1 , π2 , . . . , πk , θ1 , θ2 , . . . , θk ), then, according to Eq.(1), the density of Gaussian mixture can be rewritten as: Φ(x|Θk ) =

k  i=1

πi φ(x|θi ) =

k  i=1

πi φ(x|mi , Σi ).

(3)

602

L. Li and J. Ma

For the Gaussian mixture modeling, there are many learning algorithms. But the EM algorithm may be the most well-known one. By alternatively implementing the E-step to estimate the probability distribution of the unobservable random variable and the M-step to increase the log-likelihood function, the EM algorithm can finally lead to a local maximum of the log-likelihood function of the model. For the Gaussian mixture model, given a sample data set S = {x1 , x2 , · · · , xN } as a special incomplete data set, the log-likelihood function can be expressed as follows: N N k (4) log p(S | Θk ) = log t=1 φ(xt | Θk ) = t=1 log i=1 πi φ(xt | θi ), which can be optimized iteratively via the EM algorithm as follows: πj φ(xt | θj ) , P (j|xt ) = k i=1 πi φ(xt | θi ) πj+ =

N 1  P (j|xt ), N t=1

μ+ j = N

t=1

Σj+ = N

1 P (j|xt ) 1

t=1 P (j|xt )

(5) (6)

N 

P (j|xt )xt ,

(7)

t=1

N 

+ T P (j|xt )(xt − µ+ j )(xt − µj ) .

(8)

t=1

Although the EM algorithm can have some good convergence properties in certain situations ([12,13,14]), it certainly has no ability to determine the proper number of the components for a sample data set because it is based on the maximization of the likelihood. In order to overcome this weakness, we will utilize the BYY harmony function as the criterion for the Gaussian mixture modeling.

3

BYY Learning System and Harmony Function

In a BYY learning system, each observation x ∈ X ⊂ Rd and its corresponding inner representation y ∈ Y ⊂ Rm are described with two types of Bayesian decomposition p(x, y) = p(x)p(y|x) and q(x, y) = q(y)q(x|y), which are called them Yang and Ying machine respectively. For the Gaussian mixture modeling, y is limited to be an integer in Y = {1, 2, . . . , k}. With a sample data set Dx = {xt }N t=1 , the aim of the BYY learning system is to specify all the aspects of p(y|x),p(x),q(x|y),q(y) by maximizing the following harmony functional:  H(p  q) = p(y | x)p(x) ln[q(x | y)q(y)]dxdy − ln zq , (9) where zq is a regularization term and will often be neglected. If both p(y | x) and q(x | y) are parametric, i.e, from a family of probability densities with a parameter θ ∈ Rd , the BYY learning system is called to have a

A BYY Split-and-Merge EM Algorithm for Gaussian Mixture Learning

603

Bi-directional Architecture (BI-Architecture). For the Gaussian mixture modeling, we use the following BYY learning system. the kspecific BI-Architecture of N q(j) = αj ( αj ≥ 0 and j=1 αj = 1) and p(x) = N1 t=1 δ(x−xt ). Furthermore, the BI-architecture is constructed with the following parametric forms: p(y = j | x) =

αj q(x | θj ) , q(x | Θk )

q(x | Θk ) =

k  j=1

αj q(x | θj )

(10)

where q(x | θj ) = q(x | y = j) with θj consisting of all its parameters and Θk = {αj , θj }kj=1 . Substituting all these component densities into Eq.(9), we have the following harmony function: H(p  q) = J(Θk ) =

N k 1  αj q(xt | θj ) ln[αj q(xt | θj )].  N t=1 j=1 ki=1 αi q(xt | θj )

(11)

When each q(x | θj ) is a Gaussian probability density given by Eq.(2), J(Θk ) becomes a harmony function on Gaussian mixtures. Furthermore, it has been demonstrated by the experiments [6,7,8,9,10] and theoretical analysis [11] that as this harmony function arrives at the global maximization, a number of Gaussians will match the actual Gaussians in the sample data, respectively, with the mixing proportions of the extra Gaussians attenuating to zero. Thus, we can use the harmony function as the reasonable criterion for model selection on Gaussian mixture.

4

The BYY Split-and-Merge EM Algorithm

With the above preparations, we begin to present our BYY split-and-merge EM algorithm. Given a sample data set S from an original mixture with k ∗ (> 1) actual Gaussians, we use the EM algorithm to get k estimated Gaussians with the initial parameters. If k = k ∗ , some estimated Gaussians cannot match the actual Gaussans properly and it is usually efficient to utilize a split-and-merge EM algorithm to split or merge those unsuited Gaussians dynamically. Actually, the main mechanisms of the split-and-merge EM algorithm are the split and merge criteria. Based on the BYY harmony function and the analysis of the overlap between two Gaussians in a sample data set, we can construct the split and merge criteria as well as the split-and-merge EM algorithm in the following three subsections. 4.1

The Harmony Split Criterion

After each usual EM procedure, we get the estimated parameters Θk in the Gaussian mixture. According to Eq.(11), the harmony function J(Θk ) can be  further expressed in the sum form: J(Θk ) = kj=1 Hj (pj  qj ), where H(pj  qj ) =

N αj q(xt | θj ) 1  ln[αj q(xt | θj )]. k N t=1 i=1 αi q(xt | θj )

(12)

604

L. Li and J. Ma

Clearly, H(pj  qj ) denotes the harmony or matching level of the j − th estimated Gaussian with respect to the corresponding actual Gaussian in the sample data set. In order to improve the total harmony function, we can split the Gaussian with the least component harmony value H(pj  qj ). That is, if H(pr  qr ) is the least one, the harmony split criterion will implement the split operation on the r − th estimated Gaussian. Specifically, we divide it into two components i′ , j ′ with their parameters being designed as follows (refer to [15]). Generally, the covariance matrix Σr can be decomposed as Σr = U SV T , where S = diag[s1 , s2 , · · · , sd ] is a diagonal matrix with nonnegative diagonal elements in a descent order, √ U and V are two (standard) orthogonal matrices. √ √ √ Then, we further set A = U S = U diag[ s1 , s2 , · · · , sd ] and get the first column A1 of A. Finally, we have the parameters for the two split Gaussians as follows, where γ, µ, β are all set to be 0.5. αi′ = γαr , αj ′ = (1 − γ)αr ; 1/2

mi′ = mr − (αj ′ /αi′ )

µA1 ;

1/2

mj ′ = mr + (αi′ /αj ′ ) µA1 ; Σi′ = (αj ′ /αi′ )Σr + ((β − βµ2 − 1)(αr /αi′ ) + 1)A1 AT1 ;

Σj ′ = (αi′ /αj ′ )Σr + ((βµ2 − β − µ2 )(αr /αj ′ ) + 1)A1 AT1 .

4.2

(13) (14) (15) (16) (17)

The Overlap Merge Criterion

For the r − th component with the sample xt , we introduce a special function: U (xt , r) = p(y = r | xt )(1 − p(y = r | xt )), where p(y = r | xt ) is just the posterior probability of the sample xt over the r − th component. Clearly, in the estimated Gassians mixture, U (xt , r) is a special measure of the degree of the sample xt belonging to the r − th component. With this special measure, we can define the degree of the overlap between two components under a given sample data set S as follows:   Ωiε U (xt , j) Ωjε U (xt , i) ∗ (18) Fi,j = ε ε #Ωi ∗ #Ωj ∗ dist(i, j) where Ωrε = {xt |p(y = r | xt ) > 0.5&U (xt , r) ≥ ε} and dist(i, j) is the Mahalanobis distance between i − th and j − th components. Since Fi,j is a measure of overlap between components i and j, it is clear that the two components should be merged together if Fi,j is large enough. Thus, the overlap merge criterion is that if Fi,j is the highest one, the i − th and j − th components will be merged into one component by the following rules ([15]): αr = αi + αj ;

(19)

mr = αi mi + αj mj ; Σr = (αi Σi + αj Σj + αi mi m⊺i + αj mj m⊺j − αr mr m⊺r )/αr .

(20) (21)

A BYY Split-and-Merge EM Algorithm for Gaussian Mixture Learning

4.3

605

Procedure of the BYY Split-and-Merge EM Algorithm

With the harmony split criterion and the overlap merge criterion, we can present the procedure of the BYY split-and-merge EM algorithm as follows: 1. According to the initial values of k and the parameters Θk , implement the usual EM algorithm and then compute J(Θk ). 2. Implement the following split and merge operations independently. Split Operation: With the current k and the obtained parameters Θk , split the Gaussian q(x|θr ) of the least component harmony value into two new Gaussians q(x|θj′ ) and q(x|θj′′ ) according to Eqs.(13)-(17). Then, implement the usual EM algorithm from the parameters of the previous and split Gaussians to obtain the updated parameters Θsplit for the current mixture of k + 1 Gaussians; compute J(Θsplit ) on the sample data set and denote it by Jsplit . Merge Operation: With the current k and the parameters Θk , merge the two Gaussians with the highest degree of overlap into one Gaussian according to Eqs.(19)-(21) and implement the usual EM algorithm from the parameters of the previous and merge Gaussians to obtain the updated parameters Θmerge for the current mixture of k − 1 Gaussians; compute J(Θmerge ) on the sample data set and denote it by Jmerge . 3. Compare the three value Jold = J(Θk ), Jsplit and Jmerge and continue the iteration until stop. (i). If Jsplit = max(Jold , Jsplit , Jmerge ), we accept the result of the split operation and set k = k + 1, Θk+1 = Θsplit , go to step 2; (ii). If Jmerge = max(Jold , Jsplit , Jmerge ), we accept the result of the merge operation and set k = k − 1, Θk−1 = Θmerge , go to step 2; (iii). If Jold = max(Jold , Jsplit , Jmerge ), we stop the algorithm with the current Θk as the final result of the algorithm. It can be easily found from the above procedure that both the split and merge operations try to increase the total harmony function and the stopping criterion tries to prevent from splitting and merging too many Gaussians. Thus, the harmony function criterion will make a correct model selection, while the usual EM algorithm still maintains a maximum likelihood (ML) solution of the parameters Θk . Therefore, this split-and-merge EM procedure will lead to a better solution on the Gaussian mixture modeling for both model selection and parameter estimation.

5

Experimental Results

In this section, we demonstrate the BYY split-and-merge EM algorithm through a simulation experiment and two applications for the classification of two realworld datasets and unsupervised color image segmentation. Moreover, we compare it with the greedy EM algorithm given in [16] on unsupervised color image segmentation.

606

L. Li and J. Ma

(a) Original Data Set

(b) Initial Classification

(c) Split Operation

(d) Merge Operation

(e) Merge Operation

(f) Final Result

Fig. 1. (a): The Synthetic Data Set with Six Gaussians Used in the Simulation Experiment. (b)-(e): The Experimental Results at the Four Typical Iterations of the BYY Split-and-Merge EM Algorithm. (f). The Final Experimental Result of the BYY Splitand-Merge EM Algorithm.

5.1

Simulation Result

In the simulation experiment, a synthetic data set containing six bivariate Gaussian distributions (i.e. d = 2) with certain degree of overlap, which is shown in Fig. 1(a), was used to demonstrate the performance of the BYY split-andmerge EM algorithm. The initial mean vectors were obtained by the k-means algorithm at k = 8, which is shown in Fig.1(b). The BYY split-and-merge EM algorithm was implemented on the synthetic data set until J(Θk ) arrived at a maximum. The typical results during the procedure of the BYY split-and-merge EM algorithm are shown in Fig.1(c)-(f), respectively. It can be observed from these figures that the BYY split-and-merge EM algorithm not only detected a correct number of Gaussians for the synthetic data set, but but also led to a good estimation of the parameters in the original Gaussian mixture. 5.2

On Classification of the Real-World Data

We further applied the BYY split-and-merge EM algorithm to the classification of the Iris data ( 3-class, 4-dimensional, 150 samples) and the Wine data (3-class, 13-dimensional, 178 samples ). In the both experiments, we masked the class indexes of these samples and used them to check the classification accuracy of the BYY split-and-merge EM algorithm. For quick convergence of the algorithm, a low threshold T is set such that as long as some mixing proportion was less than

A BYY Split-and-Merge EM Algorithm for Gaussian Mixture Learning

607

Table 1. The Classification Results of the BYY Split-and-Merge EM Algorithm on Real-world Data Sets The data set Iris data set Wine data set

(a)

ε 0.2 0.2

T 0.10 0.10

k 2 4

The classification accuracy 98.0% ±0.006 96.4% ±0.022

(b)

(c)

Fig. 2. The Experimental Results on Unsupervised Color Image Segmentation. (a). The Original Color Images. (b). The Segmentation Results of the BYY Split-and-Merge EM Algorithm. (c). The Segmentation Results of the Greedy EM Algorithm.

T , the corresponding Gaussian in the mixture would be discarded immediately. In the experiments, for each data set with k = 2, 4, we implemented the algorithm from the different initial parameters for 100 times. The classification results of the algorithm on the Iris and wine data sets are summarized in Table 1. It can be seen from Table 1 that their classification accuracies were rather high and stable (with a very small deviation from the average classification accuracy). 5.3

On Unsupervised Color Image Segmentation

Segmenting a digital color image into homogenous regions corresponding to the objects (including the background) is a fundamental problem in image

608

L. Li and J. Ma

processing. When the number of objects in an image is not known in advance, the image segmentation problem is in an unsupervised mode and becomes rather difficult in practice. If we consider each object as a Gaussian distribution, the whole color image can be regarded as a Gaussian mixture in the data or color space. Then, the BYY split-and-merge EM algorithm provides a new tool for solving this unsupervised color image segmentation problem. Actually, we applied it to the unsupervised color image segmentation on three typical color images that are expressed in the three-dimensional color space by the RGB system and also compared it with the greedy EM algorithm. The three color images for the experiments are given in Fig. 2(a). The segmentation results of these color images by the BYY split-and-merge EM algorithm are given in Fig.2(b). For comparison, the segmentation results of these color images by the Greedy EM algorithm are also given in Fig. 2(c). From the segmented images of the two algorithms given in Fig. 2, it can be found that the BYY split-and-merge EM algorithm could divide the objects from the background efficiently. Moreover, our proposed algorithm could obtain a more accurate segmentation on the contours of the objects in each image.

6

Conclusions

Under the framework of the Bayesian Ying-Yang (BYY) harmony learning system and theory, we have established a BYY split-and-merge EM algorithm with the help of the conventional EM algorithm. By splitting or merging the unsuited estimated Gaussians obtained from the EM algorithm, the BYY split-and-merge EM algorithm can increase the total harmony function at each time until the estimated Gaussians in the mixture match the actual Gaussians in the sample data set, respectively. It is demonstrated well by the simulation and practical experiments that the BYY split-and-merge EM algorithm can achieve a better solution for the Gaussian mixture modeling on both model selection and parameter estimation. Acknowledgments. This work was supported by the Natural Science Foundation of China for grant 60771061.

References 1. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximun Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Soceity B 39, 1–38 (1977) 2. Akaike, H.: A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control 19, 716–723 (1974) 3. Scharz, G.: Estimating the Dimension of a Model. The Annals of Statistics 6, 461–464 (1978) 4. Xu, L.: Best Harmony, Unified RPCL and Automated Model Selection for Unsupervised and Supervised Learning on Gaussian Mixtures, Three-layer Nets and ME-RBF-SVM Models. International Journal of Neural Systems 11, 43–69 (2001)

A BYY Split-and-Merge EM Algorithm for Gaussian Mixture Learning

609

5. Xu, L.: BYY Harmony Learning, Structural RPCL, and Topological Self-Organzing on Mixture Modes. Neural Networks 15, 1231–1237 (2002) 6. Ma, J., Wang, T., Xu, L.: A Gradient BYY harmony Learning Rule on Gaussian Mixture with Automated Model Selection. Neurocomputing 56, 481–487 (2004) 7. Ma, J., Gao, B., Wang, Y., et al.: Conjugate and Natural Gradient Rules for BYY Harmony Learning on Gaussian Mixture with Automated Model Selection. International Journal of Pattern Recognition and Artificial Intelligence 19(5), 701– 713 (2005) 8. Ma, J., Wang, L.: BYY Harmony Learning on Finite Mixture: Adaptive Gradient Implementation and A Floating RPCL Mechanism. Neural Processing Letters 24(1), 19–40 (2006) 9. Ma, J., Liu, J.: The BYY Annealing Learning Algorithm for Gaussian Mixture with Automated Model Selection. Pattern Recognition 40, 2029–2037 (2007) 10. Ma, J., He, X.: A Fast Fixed-point BYY Harmony Learning Algorithm on Gaussian Mixture with Automated Model Selection. Pattern Recognition Letters 29(6), 701– 711 (2008) 11. Ma, J.: Automated Model Selection (AMS) on Finite Mixtures: A Theoretical Analysis. In: Proceedings of International Joint Conference on Neural Networks, Vancouver, Canada, pp. 8255–8261 (2006) 12. Ma, J., Xu, L., Jordan, M.I.: Asymptotic Convergence Rate of the EM Algorithm for Gaussian Mixtures. Neural Computation 12(12), 2881–2907 (2000) 13. Ma, J., Xu, L.: Asymptotic Convergence Properties of the EM Algorithm with respect to the Overlap in the Mixture. Neurocomputing 68, 105–129 (2005) 14. Ma, J., Fu, S.: On the Correct Convergence of the EM Algorithm for Gaussian Mixtures. Pattern Recognition 38(12), 2602–2611 (2005) 15. Zhang, Z., Chen, C., Sun, J., et al.: EM Algorithms for Gaussian Mixtures with Split-and-Merge Operation. Pattern Recogniton 36, 1973–1983 (2003) 16. Verbeek, J.J., Vlassis, N., Kr¨ ose, B.: Efficient Greedy Learning of Gaussian Mixture Models. Neural Computation 15(2), 469–485 (2003)

A Comparative Study on Clustering Algorithms for Multispectral Remote Sensing Image Recognition Lintao Wen1, Xinyu Chen1, and Ping Guo1,2,* 1

Image Processing and Pattern Recognition Laboratory, Beijing Normal University, Beijing 100875, China 2 School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China [email protected], [email protected], [email protected]

Abstract. Since little prior knowledge about remote sensing images can be obtained before performing recognition tasks, various unsupervised classification methods have been applied to solve such problem. Therefore, choosing an appropriate clustering method is very critical to achieve good results. However, there is no standard criterion on which clustering method is more suitable or more effective. In this paper, we conduct a comparative study on three clustering methods, including C-Means, Finite Mixture Model clustering, and Affinity Propagation. The advantages and disadvantages of each method are evaluated by experiments and classification results.

1 Introduction In the last decades, remote sensing imagery utility has been proved as a powerful technology for monitoring the earth's surface and atmosphere at a global, regional, and even local scale. The volume of remote sensing images continues to grow at an enormous rate due to advances in sensor technology for both high spatial and temporal resolution systems. Consequently, an increasing quantity of multispectral image acquired in many geographical areas is available. There are many applications in analyzing and classifying remote sensing image, such as geology remote sensing, water area remote sensing, vegetation remote sensing, soil remote sensing, multispectrum remote sensing, and so on. In all these applications, the key processing step is to recognize the interested regions from a multispectral remote sensing image. Due to the lack of prior knowledge, unsupervised classification has been chosen to accomplish such recognition task, and there are two important factors affecting the accuracy of recognition result. One is feature extracting, and the other is clustering method. Usually a remote sensing image contains two kinds of features: spectral feature and texture feature. Spectral feature is regarded as one of the most important pieces of information for remote sensing image interpretation. This kind of feature can be utilized to characterize most important contents for various types of remote sensing images. On the other hand, texture feature describes attributes between a pixel and the other pixels around it. And texture feature represents the spatial information of an *

Dr. Xinyu Chen and Dr. Ping Guo are the corresponding authors.

F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 610–617, 2008. © Springer-Verlag Berlin Heidelberg 2008

A Comparative Study on Clustering Algorithms

611

image, which can be treated as an important visual primitive to search visually similar patterns in the image. However, in classification, the results by only adopting texture analysis methods are not very good. For example, the edges between different classes may be incorrectly classified, because texture feature extraction has to be considered based on a small region, not a single pixel. Spectral feature, such as gray value, can be extracted based on a single pixel; however, its limitation is the information representation. Therefore, composing these two features, spectral feature and texture feature, together to form a new feature vector will be an effective way, as it represents the most effective features of given remote sensing images [1-5]. In our previous work [6], we have proposed a method by adopting Ant Colony Optimization (ACO) [7-10] to find this mixed feature vector, and it improves the accuracy of recognition results. Many clustering methods [11-13] have been introduced to remote sensing image recognition in previous studies, such as C-Means clustering [14,15], Finite Mixture Model Clustering (FMMC) [16-18], and Affinity Propagation (AP) [19,20]. Some of these methods use statistic information, some are not. When we do image recognition, which method is more suitable for a given circumstance and which should be utilized? To answer this question, in this paper, we will conduct a comparative study on CMeans, FMMC and AP clustering methods. Advantages and disadvantages will be given after a series of experiments.

2 Clustering Methods Clustering methods are all based on a measure of similarity of data points. A common approach is clustering data points by iteratively calculating the similarity or some measurements on it until termination conditions are satisfied. In the following parts of this section, we give a detailed explanation for the three clustering methods used in our experiments, C-Means, FMMC, and AP. It will be helpful to comprehend the experiment results. 2.1 C-Means The C-Means clustering method is a traditional and popular method used for clustering. It based on error square. It takes a number c as input to classify data points to c clusters. This algorithm is composed of the following steps: (1) Place c points as cluster centers and assign each point to a cluster. The rule can be expressed as

j*

2

arg min{ yi  c j }, j  (1, c) ,

(1)

j

in which j* is yi's cluster, and cj represents the center point of the jth cluster. (2) Calculate cluster mean mi and cluster standard Je using Equations 2 and 3, respectively:

mi =

1 Ni

∑ y,

y∈Γi

(2)

612

L. Wen, X. Chen, and P. Guo

J e = ∑ ∑ y − mi , c

2

(3)

i =1 y∈Γi

in which Γi represents cluster i. (3) Calculate pj for each point y, and assign y to new cluster. 2 ⎧ Nj ⎪⎪ N + 1 y − m j , j ≠ i , j ∈ (1, c) , pj = ⎨ j N 2 i ⎪ y − mi , j = i ⎩⎪ N i + 1

j * = arg min{ pk }, k ∈ (1, c) . k

(4)

(5)

(4) Recalculate the affected means and Je. (5) Iterate steps (3) and (4) until Je stays constantly after a predefined number of iteration. 2.2 Finite Mixture Model Clustering (FMMC) FMMC engages Finite Mixture Model (FFM) and Expectation-Maximization (EM) algorithm to estimate parameters. After calculating posterior probability, it utilizes Bayes decision to classify xi to cluster j*. The joint probability distribution of data points in FMM can be expressed as

p ( x, Θ) = ∑ α j G ( x, m, ∑), (α j ≥ 0, ∑ α j = 1) , k

k

j =1

j =1

where

G ( x, m, ∑ j ) =

1 (2π )

d /2

Σj

1/2

⎡ 1 ⎤ × exp ⎢ − ( x − m j )T Σ −j 1 ( x − m j ) ⎥ . ⎣ 2 ⎦

(6)

(7)

This is a general expression of multivariate Gaussian distribution. x is a random vector, and its dimension is d. The parameter Θ={αj, mj, Σj}kj=1 is a set of FFM parameters vector, in which αj is the mixing weight, mj is the mean vector, and Σj is the covariance matrix of the jth component of the model. Usually k is pre-signed, and then the EM algorithm is adopted to estimate other parameters. It is also an iterative process, and contains two steps: E-step and M-step. E-step: Calculate posterior probability

P( j | xi ) =

α j G ( xi , m j , ∑ j )

∑ G( x , m , ∑ ) k

i

l =1

l

l

.

(8)

A Comparative Study on Clustering Algorithms



∑ P( j |x ) ,

M-step: Calculate model parameter vector

= α new j

1 N

N

i =1

α j G ( xi , m j , Σ j )

∑ α l G ( xi , ml , Σl ) k

N

1 N

i

i =1

∑ P( j | x ) x ,

l =1

= m new j

=

613

(9)

N

1 α jN

i

(10)

i

∑ P( j | x )( x − m )( x − m ) i =1

N

T

∑ P( j | xi ) i

Σ new = j

i =1

i

j

i

j

N

.

(11)

i =1

These two steps are iterated by turns until the likelihood function L(Θ) reaches a local minimum value, which can be expressed as

L (Θ) = ∏ p ( x n ) = ∏ ∑ p ( x n / j ) p ( j ) . N

N

k

(12)

n =1 j =1

n =1

The likelihood function's convergence was proved by Redner, and it could be close to a local minimum value. 2.3 Affinity Propagation (AP) AP is a new clustering method proposed by Frey and Dueck in 2007. It is quite different with other clustering methods. AP treats each data point as a potential cluster center. An innovative method has been developed to transmit real-valued messages between pairs of data points recursively until a good set of clusters emerges. The AP algorithm takes a similarity matrix as input. This similarity can be set to a negative Euclidean distance (for point i and k, s(i,k)=-||xi-xk||2). At the following steps of messages transmission, two kinds of messages are exchanged between points. One is called "responsibility" r(i,k), which indicates how well point k serves as point i's center. The other message is "availability" a(i,k), which reflects how appropriate if point i choose point k for its center. The rules to calculate these two values are:

r (i, k ) = s(i, k ) − max {a(i, k ' ) + s(i, k ' } ,



k ' s .t .k '≠ k

a(i, k ) = min{0, r (k , k ) + a(k , k ) =



i ' s .t .i ' ≠ k

max{0, r (i ', k )}} ,

i ' s .t .i ' ∉{i , k }

max{0, r (i ', k )}} .

(13)

(14)

(15)

614

L. Wen, X. Chen, and P. Guo

At the initial step, a(i,k) is set to zero. Then r(i,k) and a(i,k) are calculated by turn in the following iterations (the message-passing procedure). After a predetermined number of iterations, or after local decision stays constantly for a number of iteration, the iteration will terminate. The center of each point can be found by Equation 16:

ci* = arg max(a(i, k ) + r (i, k )) ,

(16)

k

in which ci* is the center of point i. Rather than requiring that the number of clusters be pre-specified, AP takes s(k,k) as input for each point k so that data points with larger values are more likely to be chosen as centers. These values are called "preference", and influence the resulted cluster number. By setting these "preference" values, we can get suitable clustering result. AP also utilizes another parameter, a damping parameter λ∈(0, 1), to prevent from numerical oscillations that arise in some circumstances. During the message passing procedure, each message set to λ times its value from previous iteration plus 1-λ times of its prescribed value.

3 Experiments and Result Analysis In our previous work, we engaged an ACO method to select a feature vector, consisting of spectral and texture features of a remote sensing image, to represent an image pixel well in its feature space [6]. The experiments in this paper are based on this result. In these experiments, we use the aforementioned three methods with the selected features to clustering image pixels and then analyze the results. Table 1. The best result of different clustering methods

Data/Cluster NO.

Data-1/2

Data-2/2

Data-3/3

Original Image

C-Means

FMMC

AP

A Comparative Study on Clustering Algorithms

615

3.1 Experiments Three images, referred as Data-1, Data-2, and Data-3, were used in our experiments. They are selected from the database of platform Landsat-5, which was launched by USA, and the remote sensor was thematic mapper (TM). All these images have 6 bands, and contain at least two kinds of geographical objects. At the feature selection step, we find a feature set including DMCF, HM, TS features. After that each of the three clustering methods uses the same feature vector to do clustering task. For each method we do ten times experiments and calculate the accuracy for all these results. Table 1 shows the best result image of each method. Table 2 is the statistical data of clustering accuracy for all experiments. Table 2. Statistical results of clustering accuracy

Data Data-1

Data-2

Data-3

Worst (%) Mean (%) Best (%) Worst (%) Mean (%) Best (%) Worst (%) Mean (%) Best (%)

C-Means 64.32 73.71 88.47 67.52 75.63 83.91 57.84 68.32 81.62

FMMC 81.85 83.38 85.64 77.33 79.45 81.52 75.35 76.72 80.41

AP 83.33 86.29 88.56 76.77 80.13 81.37 79.38 80.33 82.26

3.2 Results From Table 1 we can observe that all these three methods can make a good result after many times of experiments. For Data-2, the difference among the best clustering results of these three methods is more clear then the other two data. In Table 2, we can find AP runs much better than the other two methods. All its means of clustering accuracy is higher than those of the other two's. Besides this, AP and FMMC are much smoother than C-Mean. Although the C-Means’s best clustering accuracy is not lower than FMMC and AP’s, it worst result is much lower than others.

4 Conclusions In this paper, we firstly apply the new proposed AP clustering algorithm to conduct remote sensing image recognition task. From the experiment results, we found that the AP algorithm did much better than the other two. Its mean accuracy of clustering result is always the best. Furthermore, its best result accuracy won the other two methods twice and never got the worst result. In our opinion, this is because AP does not affected by initial center selection and the accuracy of the estimated probability distribution of data points. In addition, the high dimension of feature vectors does not bring AP much complex in changing. All this help its result be more stable and excellent. The

616

L. Wen, X. Chen, and P. Guo

FMMC method is another good method, which can do almost as best as AP at the accuracy aspect. But it is more complex compared with the other two, which makes it cost more time to classify image pixels, and it can only find the local best result. Finally, C-Means is simple, and also can create good results in some circumstances. However, the major problem of C-Mean is that it is affected much by initial center selection. This makes its clustering results unstable. For instance, while the accuracy of the best result of C-Means is much better than the other two in the second experiment, its worst result accuracy is worse than them. Therefore, we have to initialize different centers and do many times to ensure that they would find the best clustering result. By analyzing these experimental results, we can draw the conclusion that the AP algorithm is an effective method to do remote sensing image recognition task, especially for remote sensing images which we have no idea for the clustering number. Acknowledgements. The research work described in this paper was fully supported by a grant from the National Natural Science Foundation of China (Project No. 60675011).

References 1. Tian, Y., Guo, P., Lyu, M.R.: Comparative Studies on Feature Extraction Methods for Multispectral Remote Sensing Image Classification. In: IEEE International Conference on Systems, Man and Cybernetics, pp. 1275–1279 (2005) 2. Yin, Q., Guo, P.: Multispectral Remote Sensing Image Classification with Multiple Features. In: International Conference on Machine Learning and Cybernetics, vol. 1, pp. 360– 365 (2007) 3. Baraldi, A., Parminggian, F.: An Investigation on the Texture Characteristics Associated with Gray Level Co-occurrence Matrix Statistical Parameters. IEEE Transaction on Geosciences and Remote Sensing 32(2), 293–303 (1995) 4. Li, J., Narayanan, R.M.: Integrated Spectral and Spatial Information Mining in Remote Sensing Imagery. IEEE Transactions on Geosciences and Remote Sensing 42(3), 673–684 (2004) 5. Wikantika, K., Tateishi, R., Harto, A.B.: Spectral and Textural Information of Multisensor Data for Land Use Classification in Metropolitan Area. In: IEEE International Geoscience and Remote Sensing Symposium, pp. 2843–2845 (2000) 6. Wen, L., Yin, Q., Guo, P.: Ant Colony Optimization Algorithm for Feature Selection and Classification of Multispectral Remote Sensing Image. In: IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2008) (accepted, 2008) 7. Dorigo, M., Stutzle, T.: Ant Colony Optimization. MIT Press, Cambridge (2004) 8. Bello, R., Puris, A.: Two Step Ant Colony System to Solve the Feature Selection Problem. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 588–596. Springer, Heidelberg (2006) 9. Yan, Z., Yuan, C.: Ant Colony Optimization for Feature Selection in Face Recognition. In: Knudsen, J.L. (ed.) ECOOP 2001. LNCS, vol. 2072, pp. 221–226. Springer, Heidelberg (2001) 10. Nakamichi, Y., Arita, T.: Diversity Control in Ant Colony Optimization. Artificial Life and Robotics 7(4), 198–204 (2004) 11. Tso, B.C.K., Mather, P.M.: Classification of Multisource Remote Sensing Imagery Using a Genetic Algorithm and Markov Random Fields. IEEE Transactions on Geosciences and Remote Sensing 37(3), 1255–1260 (1999)

A Comparative Study on Clustering Algorithms

617

12. Briem, G.J., Benediktsson, J.A., Sveinsson, J.R.: Multiple Classifiers Applied to Multisource Remote Sensing Data. IEEE Transactions on Geosciences and Remote Sensing 40(10), 2291–2299 (2002) 13. Shekhar, S., Schrater, P.R., Vatsavai, R.R., Wu, W., Chawla, S.: Spatial Contextual Classification and Prediction Models for Mining Geospatial Data. IEEE Transactions on Multimedia 4, 174–188 (2002) 14. Webb, A.R.: Statistical Pattern Recognition. Oxford University Press, London (1999) 15. Ruan, Q.: Digital Image Processing. Electronics Industry Press, Beijing (2001) 16. Sanjay-Gopal, S., Hebert, T.J.: Bayesian Pixel Classification Using Spatially Variant Finite Mixtures and the Generalized EM Algorithm. IEEE Transactions on Image Processing 7(7), 1014–1028 (1998) 17. Guo, P., Lu, H.: A Study on Bayesian Probabilistic Image Automatic Segmentation. Acta Optica Sinica 22(12), 1479–1483 (2002) 18. Redner, R.A., Walker, H.F.: Mixture Densities, Maximum Likelihood and the EM Algorithm. SIAM Review 26(2), 195–239 (1984) 19. Frey, B.J., Dueck, D.: Clustering by Passing Messages between Data Points. Science 315, 972–976 (2007) 20. Frey, B.J., Dueck, D.: Non-metric Affinity Propagation for Unsupervised Image Categorization. In: IEEE International Conference on Computer Vision (ICCV), pp. 1–8 (2007)

A Gradient BYY Harmony Learning Algorithm for Straight Line Detection Gang Chen, Lei Li, and Jinwen Ma⋆ Department of Information Science, School of Mathematical Sciences and LAMA, Peking University, Beijing, 100871, China [email protected] Abstract. Straight line detection is a basic problem in image processing and has been extensively studied from different aspects, but most of the existing algorithms need to know the number of straight lines in an image in advance. However, the Bayesian Ying-Yang (BYY) harmony learning can make model selection automatically during parameter learning for the Gaussian mixture modeling, which can be further applied to detecting the correct number of straight lines automatically by representing the straight lines with Gaussians or Gaussian functions. In this paper, a gradient BYY harmony learning algorithm is proposed to detect the straight lines automatically from an image as long as the pre-assumed number of straight lines is larger than the true one. It is demonstrated by the simulation and real image experiments that this gradient BYY harmony learning algorithm can not only determine the number of straight lines automatically, but also detect the straight lines accurately against noise. Keywords: Bayesian Ying-Yang (BYY) harmony learning, Gradient algorithm, Automated model selection, Straight line detection.

1

Introduction

Straight line detection, as a basic class of curve detection, is very important for image processing, pattern recognition and computer vision. For tackling this problem, many kinds of learning algorithms have been developed from different aspects. Actually, the Hough transform (HT) and its variations (see Refs. [1,2] for reviews) might be the most classical approach. However, this kind of learning algorithms usually suffer from heavy computational cost, huge storage requirement and detection of false positives, even if the Random Hough Transform (RHT) [3] and the constrained Hough Transform [4] have been proposed to overcome these weaknesses. Later on, there appeared many other algorithms for straight line or curve detection (e.g., [5,6]), but most of these algorithms need to know the number of straight lines or curves in the image in advance. Recently, the Bayesian Ying-Yang (BYY) harmony learning system and theory [7]-[8] have developed a new mechanism of automated model selection on Gaussian mixture, being implemented by a series of BYY harmony learning algorithms [9]-[13]. Essentially, they have been established for the Gaussian mixture ⋆

Corresponding author.

F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 618–626, 2008. c Springer-Verlag Berlin Heidelberg 2008 

A Gradient BYY Harmony Learning Algorithm for Straight Line Detection

619

modeling with a favorite feature that model selection can be made automatically during parameter learning. That is, they can learn the correct number of actual Gaussians in a sample data set automatically. From the data space of a binary image, all the black pixels or points are regard as the samples or sample points generated from the image and the distance from a sample point to the straight line it is along with, is subject to some Gaussian distribution or function, since there always exists some noise. Thus, the straight lines can be represented through some Gaussians and their detection in a binary image is equivalent to the Gaussian mixture modeling of both automated model selection and parameter learning, which can be certainly solved by this kind of BYY harmony learning on Gaussian mixture. On the other hand, according to the BYY harmony learning on the mixture of experts, a gradient learning algorithm was already proposed in [14] for the straight line or ellipse detection, but it was not applicable for the general case. In this paper, with straight lines being implicitly represented by the Gaussians of the distances from samples to them, authors propose a new gradient BYY harmony learning algorithm for straight line detection based on the gradient BYY harmony learning rule established in [9]. It is demonstrated well by the experiments that this gradient BYY harmony learning algorithm approach can efficiently determine the number of straight lines and locate these straight lines accurately in an image. In the sequel, authors introduce the BYY learning system and the harmony function and propose the gradient BYY harmony learning for straight line detection in Section 2. In Section 3, several experiments on both the simulation and real-world images are conducted to demonstrate the efficiency of our proposed algorithm. Finally, authors will conclude briefly in Section 4.

2 2.1

The Gradient Learning Algorithm for Straight Line Detection BYY Learning System and the Harmony Function

A BYY system describes each observation x ∈ X ⊂ ℜn and its corresponding inner representation y ∈ Y ⊂ ℜm via the two types of Bayesian decomposition of the joint density: p(x, y) = p(x)p(y|x) and q(x, y) = q(y)q(x|y), which are called Yang machine and Ying machine, respectively. Given a data set Dx = {xt }N t=1 from the Yang or observable space, the goal of harmony learning on a BYY learning system is to extract the hidden probabilistic structure of x with the help of y from specifying all aspects of p(y|x), p(x), q(x|y) and q(y) via a harmony learning principle implemented by maximizing the functional  H(p||q) = p(y|x)p(x)ln[q(x|y)q(y)]dxdy, (1) which is essentially equivalent to minimizing the Kullback-Leibler divergence between the Yang and Ying machines, i.e., p(x, y) and q(x, y), because

620

G. Chen, L. Li, and J. Ma

KL(p||q) =



p(y|x)p(x)ln

p(y|x)p(x) dxdy = −H(p||q) − H(p), q(x|y)q(y)

where H(p) is the entropy of p(x, y) and invariant to q(x, y). If both p(y|x) and q(x|y) are parametric, i.e. from a family of probability densities with parameter θ, the BYY learning system is said to have a BIdirectional Architecture (BI-Architecture for short). For the Gaussian mixture model with a given sample set Dx = {xt }N t=1 , we can utilize the following specific BI-architecture of the BYY learning system. The inner representation y is discrete in Y = {1, 2, · · · , k} (i.e., with m = 1), and the observation x comes from a Gaussian mixture distribution. On the Ying space, we let q(y = j) = αj ≥ 0 k with j=1 αj = 1. On the Yang space, we suppose that p(x) is a blind Gaussian mixture distribution, with a set of sample data Dx being generated from it. Moreover, in the Ying path, we let each q(x|y = j) = q(x|θj ) be a Gaussian probability density function (pdf) given by q(x|θj ) = q(x|mj , Σj ) =

1 n 2

1

(2π) |Σj |

1 2

e− 2 (x−mj )

T

Σj−1 (x−mj )

,

(2)

where mj is the mean vector and Σj is the covariance matrix which are assumed positive definite. On the other hand, the Yang path is constructed under the Bayesian principle by the following parametric form: p(y = j|x) =

αj q(x|θj ) , q(x|Θk )

q(x|Θk ) =

k 

αj q(x|θj ),

(3)

j=1

where Θk = {αj , θj }kj=1 and q(x|Θk ) is just a Gaussian mixture that will approximate the true Gaussian mixture p(x) hidden in the sample data Dx via the harmony learning on the BYY learning system. With all these component densities into Eq.(1), we have k  αj q(X|θj ) ln[αj q(X|θj )]], H(p||q) = Ep(x) [ k i=1 αi q(X|θi ) j=1

that is, it becomes the expectation of a random variable

k

j=1

(4) )  α q(X|θ α q(X|θ ) j k i=1

j

i

i

ln[αj q(X|θj ) where X is just the random variable (or vector) subject to p(x). Based on the given sample data set Dx , we get an estimate of H(p||q) as the following harmony function for Gaussian mixture with the parameter set Θk : J(Θk ) =

2.2

N k 1   αj q(xt |θj ) ln[αj q(xt |θj )].  N t=1 j=1 ki=1 αi q(xt |θi )

(5)

The Gradient Learning Rule for Straight Line Detection

In order to maximize the above harmony function J(Θk ), Ma, Wang and Xu proposed a general (batch-way) gradient learning rule [9] for Gaussian mixture.

A Gradient BYY Harmony Learning Algorithm for Straight Line Detection

621

Although some new learning algorithms (e.g., [10]-[13]) have been already proposed to improve it, we still use it in this paper for its convenience of the generalization to the case of straight line detection. Actually, in the Gaussian mixture model, if we set  k  βj eβi , αj = e i=1

and substitute it into the harmony function given in Eq.(5), by the derivatives of J(Θk ) respect to all the parameters, we can easily construct the general gradient learning rule proposed in [9]. For straight line detection, we can use the following Gaussian functions to implicitly represent the straight lines in the image: q(u|l) = q(x, y|l) = exp{−

(wlT (x, y)T − bl )2 }, 2τl2 wlT wl

(6)

where u = (x, y) denotes the pair of two coordinates of a pixel point in the binary image. The sample data set {ut = (xt , yt )}N t=1 consists of all the black pixel points in the binary image. In each Gaussian function, there are two parameters wlT and bl , from which we can get the equation of the straight line it represents: wlT x = bl . Suppose that there are k straight lines in the image or k Gaussian functions in our mixture model. Then, we can replace all these components in the general gradient learning rule [9] with these q(u|l) and obtain the following new gradient learning rule for straight line detection:: ∆wl = η ∆bl = η ∆rl = η ∆βl = η

N −(wlT ut − bl )2 wl − (wlT ut − bl )wlT wl ut αl  h(l|ut )U (l|ut ) , N t=1 erl (wlT wl )2 N w T u t − bl αl  h(l|ut )U (l|ut ) 2rl T , N t=1 e l (wl wl )

N −(wT ut − bl )2 αl  , h(l|ut )U (l|ut ) 2rl T N t=1 e l (wl wl )

N k αl   h(j|ut )U (j|ut )(δjl − αj ), N t=1 j=1

(7) (8)

(9) (10)

where U (l|ut ) = 1 +

k 

(δrl − P (r|ut )) ln(αr q(ut |r)),

(11)

r=1

h(l|ut ) = q(ut |l)/

k 

αr q(ut |r), P (r|ut ) = αr h(r|xt ).

(12)

r=1

where η > 0 is the learning rate, which will be selected from 0.01 to 0.1 in our experiments in the next section.

622

G. Chen, L. Li, and J. Ma 60

40

20

0

−20

−40

−60 −60

−40

−20

0

20

40

60

(a) The First Image Data Set S1 with k∗ = 2 60

40

20

0

−20

−40

−60 −60

−40

−20

0

20

40

60

(b) The Second Image Data SetS2 with k∗ = 3 60

40

20

0

−20

−40

−60 −60

−40

−20

0

20

40

60



(c) The Third Image Data Set S3 with k = 4 Fig. 1. The Experimental Results of the Gradient BYY Harmony Learning Algorithm on Three Binary Image Data Sets at k = 6

A Gradient BYY Harmony Learning Algorithm for Straight Line Detection

623

60

40

20

0

−20

−40

−60 −60

−40

−20

0

20

40

60

(a) The Fourth Image Data Set S4 with k∗ = 2 60

40

20

0

−20

−40

−60 −60

−40

−20

0

20

40

60

(b) The Fifth Image Data Set S5 with k∗ = 3 60

40

20

0

−20

−40

−60 −60

−40

−20

0

20

40

60



(c) The Sixth Image Data Set S6 with k = 4 Fig. 2. The Experimental Results of the Gradient BYY Harmony Learning Algorithm on the Three Binary Image Data Sets with Salt Noise at k = 6

624

G. Chen, L. Li, and J. Ma

50

10

100

20

150

30

200

40

250

50

60

300 50

100

150

200

250

(a) The Original Texture age(Brodatz Texture D68)

300

10

20

30

40

50

60

Im- (b) A Small Image Window from (a)

60

10

40

20

20

30 0 40 −20 50 −40 60 10

20

30

40

50

60

−60 −60

−40

−20

0

20

40

60

(c) The Data Set after the Pre-process (d) The Result of Straight Line Detection Fig. 3. The Experimental Results of Texture Analysis via the Gradient BYY Harmony Learning Algorithm

In this way, we regard each straight line as one Gaussian function in the mixture model and implement the gradient learning algorithm with the above rule to determine the number of actual straight lines in the image through the competitive learning on the mixing proportions and locate them through their equations. After the gradient learning algorithm has converged, we can get all the parameters Θk = {(αl , wl , bl )kl=1 } and discard the components with a very low mixing proportion. Then, we pick up each pair of wl and bl in the remaining mixture to construct a straight line equation Ll : wlT u = bl , with the mixing proportion αl representing the proportion of the number of pixel points along this straight line Ll over N . Hence, all the actual straight lines in the image are detected by the gradient learning algorithm.

3

Experiment Results

In this section, several simulation experiments are carried out to demonstrate the gradient BYY harmony learning algorithm for straight line detection on both

A Gradient BYY Harmony Learning Algorithm for Straight Line Detection

625

the determination of number of straight lines and the location of these straight lines. Moreover, the gradient BYY harmony learning algorithm is also applied to texture classification. Using k ∗ to denote the true number of straight lines in the binary image, we implement the gradient BYY harmony algorithm on each set of binary image data with k > k ∗ , η = 0.01 and ε = 0.05. Moreover, the other parameters are initialized randomly within certain intervals. In all the experiments, the learning was stopped when |J(Θknew ) − J(Θkold )| < 10−6 . We implement the gradient BYY harmony learning algorithm on the three sets of binary image data, which are shown in Fig.1(a),(b),(c), respectively. Actually, it can detect the actual straight lines in each binary image automatically and accurately. As shown in Fig.1(c), the algorithm is implemented on the third set S3 of binary image data of four straight lines with k = 6. After the algorithm has converged, the mixing proportions of the two extra Gaussian functions or straight lines have been reduced to a very small number below 0.05 so that they can be discarded, while the other four lines are located accurately. Thus, the correct number of the straight lines in the image are detected automatically on this image data set. Moreover, a similar result of of the gradient BYY harmony learning has been made on the second image data set S2 with k = 6, k ∗ = 3. As shown in Fig.1(b), there is only a small number of pixel points along each straight line, the algorithm can still detect the three actual straight lines accurately, with the mixing proportions of other three extra lines being reduced below 0.05 again. In addition to the correct number detection, we further test the performance of the algorithm on the set of image data with salt noise, which is shown in Fig.2. It can be observed from Fig.2 that the algorithm can still detect the straight lines from the image with many extra noisy points. Finally, we apply the gradient BYY harmony learning algorithm to texture classification. In fact, texture classification is also a fundamental problem in computer vision with a wide variety of applications. Sometimes, we may encounter some image with strip texture, which is shown Fig.3(a). The problem is how to characterize it and distinguish it from other kinds of texture. In order to solve this problem, we can use the gradient BYY harmony learning algorithm to design the following texture classification scheme: Step 1: Split a texture image into some small image windows. Step 2: Implement the gradient BYY harmony learning algorithm to detect the straight lines in each small window. Step 3: If the average number of the parallel straight lines detected in one window is around some number corresponding to a sort of strip texture (from sparse to dense), we can consider that this texture is this sort of strip texture. In our experiment on the texture classification, we can find that there are usually three parallel straight lines in a window (as an eaxmple shown in Fig. 3(d)). Thus, we can say that this image contains the strip texture of the three strip lines in a window. Therefore, this scheme is useful to the strip texture classification.

626

4

G. Chen, L. Li, and J. Ma

Conclusions

We have proposed a new gradient BYY harmony learning algorithm for straight line detection. It is derived from the maximization of the harmony function on the mixture of Gaussian functions with the help of the general gradient BYY harmony learning rule. Several simulation experiments have demonstrated that the correct number of straight lines can be automatically detected on a binary image. Moreover, the gradient BYY harmony learning algorithm is successfully applied to the texture classification. Acknowledgments. This work was supported by the Natural Science Foundation of China for grant 60771061.

References 1. Ballard, D.: Generalizing the Hough Transform to Detect Arbitrary Shapes. Pattern Recognition 13(2), 111–122 (1981) 2. Illingworth, J., Kittler, J.: A Survey of the Hough Transform. Computer Vision, Graphics, and Image Processing 44, 87–116 (1988) 3. Xu, L., Oja, E., Kultanen, P.: A New Curve Detection Method: Randomized Hough Transform (RHT). Pattern Recognition Letter 11, 331–338 (1990) 4. Olson, C.F.: Constrained Hough Transform for Curve Detection. Computer Vision and Image Understanding 73(3), 329–345 (1999) 5. Olson, C.F.: Locating Geometric Primitives by Pruning The Parameter Space. Pattern Recognition 34(6), 1247–1256 (2001) 6. Liu, Z.Y., Qiong, H., Xu, L.: Multisets Mixture Learning-Based Ellipse Detection. Pattern Recognition 39, 731–735 (2006) 7. Xu, L.: Best Harmony, Unified RPCL and Automated Model Selection for Unsupervised and Supervised Learning on Gaussian Mixtures, Three-Layer Nets and MERBF-SVM Models. International Journal of Neural Systems 11(1), 43–69 (2001) 8. Xu, L.: BYY Harmony Learning, Structural RPCL, and Topological Self-Organzing on Mixture Modes. Neural Networks 15, 1231–1237 (2002) 9. Ma, J., Wang, T., Xu, L.: A Gradient BYY Harmony Learning Rule on Gaussian Mixture with Automated Model Selection. Neurocomputing 56, 481–487 (2004) 10. Ma, J., Gao, B., Wang, Y., Cheng, Q.: Conjugate and Natural Gradient Rules for BYY Harmony Learning on Gaussian Mixture with Automated Model Selection. International Journal of Pattern Recognition and Artificial Intelligence 19, 701–713 (2005) 11. Ma, J., Wang, L.: BYY Harmony Learning on Finite Mixture: Adaptive Gradient Implementation and A Floating RPCL Mechanism. Neural Processing Letters 24(1), 19–40 (2006) 12. Ma, J., Liu, J.: The BYY Annealing Learning Algorithm for Gaussian Mixture with Automated Model Selection. Pattern Recognition 40, 2029–2037 (2007) 13. Ma, J., He, X.: A Fast Fixed-Point BYY Harmony Learning Algorithm on Gaussian Mixture with Automated Model Selection. Pattern Recognition Letters 29(6), 701– 711 (2008) 14. Lu, Z., Cheng, Q., Ma, J.: A gradient BYY Harmony Learning Algorithm on Mixture of Experts for Curve Detection. In: Gallagher, M., Hogan, J.P., Maire, F. (eds.) IDEAL 2005. LNCS, vol. 3578, pp. 250–257. Springer, Heidelberg (2005)

An Estimation of the Optimal Gaussian Kernel Parameter for Support Vector Classification Wenjian Wang and Liang Ma



School of Computer and Information Technology Key Laboratory of Computational Intelligence & Chinese Information Processing of Ministry of Education Shanxi University, 030006 Taiyuan, P.R.C [email protected]



Abstract. The selection of kernel function and its parameters has heavy influence on the generalization performance of support vector machine (SVM) and it becomes a focus on SVM researches. At present, there are not general rules to select an optimal kernel function for a given problem yet, alternatively, Gaussian and Polynomial kernels are commonly used for practice applications. Based on the relationship analysis of Gaussian kernel support vector machine and scale space theory, this paper proves the existence of a certain range of the parameter σ , within the range the generalization performance is good. An appropriate σ within the range can be achieved via dynamic evaluation as well. Simulation results demonstrate the feasibility and effectiveness of the presented approach. Keywords: Bound estimation, Gaussian parameter tuning, Support vector machine, Scale space theory.

1 Introduction Support Vector Machine, developed by Vapnik [8], is gaining popularity due to many attractive features and promising empirical performance. Now, it has been successfully applied in many areas such as text categorization [3], time series prediction [7], face detection [11], et al. SVM is a kernel-based approach, i.e., the selection of kernel functions and their parameter has heavy effect on the performance of SVM. How to select the optimal kernel function and its parameters has become one of the critical problems for SVM researches. Although there are some researches on this problem [4-6, 12], some limitations such as high computation cost, needing prior information of data, weak generalization ability, etc, existed. For a given problem, there is not an effective approach to choose the optimal kernel function and its parameters. Alternatively, Gaussian kernel ( K ( x , z ) = exp( − || x − z ||2 2σ 2 ) ) is the most common used due to its good features [1]. Once the kernel function is fixed, tuning of relative parameters demands excessive attentions in order to achieve desired level of generalization. In the Gaussian kernel case, because the parameter σ is closely associated with generalization F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 627–635, 2008. © Springer-Verlag Berlin Heidelberg 2008

628

W. Wang and L. Ma

performance of SVM, how to choose an appropriate σ is worth pursuing. In practical applications, the parameter σ is usually chosen by experiences. Ref. [9] presented a novel approach to select parameter σ for Gaussian kernel based on scale space theory. It possessed some attractive advantages such as simple algorithm, good simulation results etc. Regrettably, it did not provide the corresponding theory proof. Based on the relationship analysis of Gaussian kernel support vector classification (SVC) and scale space clustering, this paper proves the existence of a certain range of σ , within which the generalization performance is good. An appropriate σ within the range can be achieved via dynamic evaluation. This work can be regarded as an important complementary for the Ref. [9], and it can provide a guide for the parameter tuning of Gaussian kernel.

2 Estimation of the Optimal Parameter for Gaussian SVC The practical applications show that the Gaussian SVC has excellent performance [2], while, the parameter tuning of Gaussian kernel plays a critical role on obtaining good performance. By experiments, Ref. [9] shows that there exists a certain range of σ , within which the generalization performance is stable. This paper gives the corresponding theory proof. In scale space theory, p( x) (the probability distribution of data in the original space) is embedded into a continuous family P( x, σ ) of gradually smoother version

of it. P( x, 0) represents the original image, and increasing the scale should simplify the image without creating spurious structures. There exists a certain range of the scale, within which the corresponding images are stable (i.e., the intrinsic structure of the images can be clearly seen) [10]. Similarly, the parameter σ controls the amplitude of the Gaussian function, and then the generalization ability of SVM. Due to the similarity of the influences formally for the Gaussian parameter σ and the scale on the generalization performance, we naturally think that the influence of the parameter σ on generalization performance in the Gaussian kernel SVM should be the same as that of the scale σ on visual stability in human visual system. Scale space theory provides theoretical basis for the existence of a considerable range of the scale σ in the Gaussian scale space clustering, within the range, the generalization performance should be stable. For a given dataset X = {x i | x i ∈ R n , i = 1, … , N } , where n is the dimension of

data x i , N is the number of X , the point of scale space can be represented by P ( x, σ ) = p ( x ) * g ( x, σ ) = Where p( x) =

1 N



N i =1

space, and g ( x, σ ) =

1 N



N i =1

δ ( x − xi ) * g ( x, σ )

δ ( x − xi ) is the probability distribution of data in the original 1

2πσ 2

exp(−

|| x || 2 ) [10]. 2σ 2

An Estimation of the Optimal Gaussian Kernel Parameter

629

Because g ( x, σ ) * δ ( x − x0 ) = g ( x − x0 , σ ) , then P ( x, σ ) =

1 N

1 = N



N i =1

g ( x − xi , σ )

|| x − xi ||2 exp( ) − ∑ i =1 2πσ 2 2σ 2 N

(1)

1

Ref. [10] shows that the data point P( x, σ ) is stable in a certain interval of σ , [σ 1 , σ 2 ] , that is, ∀ε >0 ,

Where σ 1

、σ



2

N i =1

| P( xi , σ 1 ) − P( xi , σ 2 ) | < ε

(2)

are two random values in the interval [σ 1 , σ 2 ] . Not loss of generality,

assume σ 1 < σ 2 . Recalling 1 N

⑴、⑵, we have

∑ i =1| ∑ j =1 ( N

1

N

2πσ 12

exp(−

|| xi − x j ||2 2σ 12

)−

1 2πσ 22

exp(−

|| xi − x j ||2 2σ 22

)) | < ε

(3)

Similar to above analysis, Theorem 1 proves that there exists a certain range of σ for Gaussian kernel, within the range, the generalization performance is stable. Theorem 1. For SVC machine, there exists a certain range of σ , [σ 1 , σ 2 ] , within the

range, the generalization performance is stable, that is, ∀ε >0 ,

| ∑ i =1 | f ( xi , σ 1 ) − yi | − ∑ i =1 | f ( xi , σ 2 ) − yi | |< ε N

N

Where f ( x, σ ) = ∑ j =1α j y j K ( xi , x j , σ ) + b is the decision function, σ 1 N

(4)

、σ

2

are

two random values in the interval [σ 1 , σ 2 ] . Not loss of generality, assume σ 1 < σ 2 . Proof. Firstly, simplifying the left of inequality ⑷

| ∑ i =1 | f ( xi , σ 1 ) − yi | − ∑ i =1 | f ( xi , σ 2 ) − yi | | N

N

≤ ∑ i =1 | ( f ( xi , σ 1 ) − yi ) − ( f ( xi , σ 2 ) − yi ) | N

= ∑ i =1 | f ( xi , σ 1 ) − f ( xi , σ 2 ) |

(5)

N

= ∑ i =1 | (∑ j =1α ′j y j K ( xi , x j , σ 1 ) + b1 ) − (∑ j =1α ′′j y j K ( xi , x j , σ 2 ) + b2 ) | N

N

N

= ∑ i , j =1 | α ′j y j K ( xi , x j , σ 1 ) − α ′′j y j K ( xi , x j , σ 2 ) + N

1 (b1 − b2 ) | N

(6)

630

W. Wang and L. Ma

Usually, b1 , b2 are zero, formula ⑹ is equivalent to



N i , j =1

| α ′j y j K ( xi , x j , σ 1 ) − α ′′j y j K ( xi , x j , σ 2 ) |

(7)

Because α ′j ≤ C , α ′′j ≤ C , formula ⑺ can be enlarged as:



N

| α ′j y j K ( xi , x j , σ 1 ) − α ′′j y j K ( xi , x j , σ 2 ) |

≤ ∑ i , j =1 | C ⋅ y j ⋅ K ( xi , x j , σ 1 ) − C ⋅ y j ⋅ K ( xi , x j , σ 2 ) |

i , j =1

N

= ∑ i , j =1 | C ⋅ y j ⋅ exp(− N

= ∑ι=1 (∑ j=+1| C ⋅ exp(− N

N

= ∑ i , j =1 | C ⋅ exp( − N

|| xi − x j ||2 2σ 12

) − C ⋅ y j ⋅ exp(−

2

|| xi − x j || 2σ12

|| xi − x j ||2 2σ

2 1

|| xi − x j ||2 2σ 22

2

) − C ⋅ exp(−

) − C ⋅ exp( −

|| xi − x j || 2σ 22

) | + ∑i, j=N +1| −C ⋅ exp(− N

(8) 2

|| xi − x j ||

+

|| xi − x j ||2 2σ 22

)|

2σ12

) + C ⋅ ex

)|

⑻ ≤ ε .Because exp(− x1 ) increases with x > 0 , taking out the absolute value of formula ⑻,we have Now we only need to prove formula



N i , j =1

C ⋅ (exp( −

|| xi − x j ||2 2σ

2 2

2

) − exp( −

|| xi − x j ||2 2σ 12

Appling the Differential Mean Value Theorem in

∑ Where

N i , j =1

Δ = σ 2 − σ 1 , ξ1 ∈ (σ 1 , σ 2 )

C ⋅Δ⋅(

2 3 1

ξ

exp( −

|| xi − x j ||2 2ξ12

))

⑼, we have (10)

))

.

In the sequel, simplifying the left of the inequality

(9)

⑶,

|| xi − x j || || xi − x j ||2 1 1 1 N N | ( exp( ) exp( )) | − − − ∑ ∑ j =1 2πσ 2 2σ 12 2πσ 22 2σ 22 N i =1 1 2

⇒ ∑ i , j =1 N

⇒ ∑ i , j =1 N

|| x − x ||2 || x − x ||2 1 1 1 ⋅ ( 2 ⋅ exp(− i 2 j ) − 2 ⋅ exp(− i 2 j )) σ2 2π N σ 2 2σ 2 2σ 1

(11)

|| x − x ||2 1 1 2 ⋅ Δ⋅ | 2 − 1| ⋅( 3 ⋅ exp(− i 2 j )) 2π N 2ξ 2 ξ2 ξ2

Where Δ = σ 2 − σ 1 , ξ 2 ∈ (σ 1 , σ 2 ) . Let 2

C = min(

then

|| x − x j || ξ13 1 1 1 1 ⋅ (| − 1|) ⋅ 3 ⋅ exp( i ⋅ ( 2 − 2 )) | i, j ∈1,… , N ) ξ2 ξ1 ξ 2 2π N ξ 22 2

(12)

An Estimation of the Optimal Gaussian Kernel Parameter



N i , j =1

C ⋅Δ⋅(

2 3 1

ξ

exp(−

|| xi − x j ||2 2 1



))

≤∑

N i , j =1

631

|| x − x ||2 1 1 2 ⋅ Δ ⋅ ( 2 − 1) ⋅ ( 3 exp(− i 2 j )) < ε 2π N 2ξ 2 ξ2 ξ2

That is, (5) ≤ (10) ≤ (11)< ε . We know, for a given dataset, ξ 2 can be obtained through the scale space clustering. When C is set up, there must exist a ξ1 satisfying ⑿, that is, the range of σ , [σ 1 , σ 2 ] , must exist. For different datasets, the conclusion is also supported. This completes the proof. Theorem 1 provides the theory basis that there exists a certain range of σ , within the range, for any given dataset, the generalization performance is good. For a practical problem, we need only obtain any σ belonging to the range, and then we can achieve good generalization performance.

3 Simulation Results Two of UCI datasets, Iris (including 60 training and 60 testing data) and Glass (including 80 training and 66 testing data), are used to verify the presented approach. For the Iris dataset, Figs.1 and 2 describe the trends of the number of support vectors (SVs) and the testing error with σ when C takes different values. Table 1 lists the simulation results.

Fig. 1. The trend of #SVs with σ

Fig. 2. The trend of testing error with σ

Table 1. Bounds of optimal σ and testing error for Iris dataset when C takes different values C 1 10 50 100 500 1000 5000 10000

Bound of optimal V [0.66, 1.16] [1.87, 2.60] [2.61, 2.92] [3.69, 4.13] [1.35, 1.56] [1.19, 1.61] [1.19, 2.49] [1.19, 2.47]

#SVs 20 13 9 9 9 8 7 7

Testing error (%) 1.67 1.67 1.67 1.67 1.67 1.67 1.67 1.67

632

W. Wang and L. Ma

From Figs.1 and 2, it can be seen that the optimal stable interval of the parameter σ is different when C takes different values. Within the corresponding ranges, the generalization performance is stable. From Table 1, we can see the misclassification rate reaches 1.67% for all cases. Figs 3-8 illustrate the optimal hyperplanes for the Iris dataset when C takes 50, 100, 500, 1000, 5000, 10000 and σ takes an arbitrary value within the optimal ranges, respectively.

Fig. 3. The optimal hyperplane when C=50 and σ =2.8

Fig. 4. The optimal hyperplane when C=100 and σ =4.0

Fig. 5. The optimal hyperplane when C=500 and σ =1.4

Fig. 6. The optimal hyperplane when C=1000 and σ =1.5

Fig. 7. The optimal hyperplane when C=5000 and σ =1.6

Fig. 8. The optimal hyperplane when C=10000 and σ =1.8

An Estimation of the Optimal Gaussian Kernel Parameter

633

From Figs 3-8, it can be easily observed that the optimal hyperplanes are different when C takes different values. When C takes a small value like 50, 100, the corresponding hyperplane is basically linear. When C takes a large value like 500, 1000, 5000, 10000, the hyperplane becomes nonlinear and its shape is similar to the curve of Gaussian function. More interestingly, the shape of the corresponding optimal hyperplane is almost consistent by increasing C from a special value, i.e. 1000. For the Glass dataset, Figs. 9 and 10 describe the trends of the number of SVs and the testing error with σ when C takes different values, respectively. Table 2 lists the simulation results.

Fig. 9. The trend of #SVs with σ

Fig. 10. The trend of testing error with σ

From Figs.9 and 10, it can be observed that the optimal stable interval of the parameter σ is also different when C takes different values. Similar to the Iris dataset, within the corresponding ranges, the generalization performance is stable. From Table 2, it can be seen that when C takes 500 or 1000, the corresponding testing error reaches small (27.2%). All these experiments demonstrate that if C takes different value, the corresponding stable performance range is different. When C increases from small to large, the number of support vector decreases, but the corresponding testing error is almost same within the whole stable performance ranges. Therefore, for a practical problem, we need only select an arbitrary value from the range. Table 2. Bounds of optimal σ and testing error for Glass dataset when C takes different values

C 1 10 50 100 500 1000 5000 10000

Bound of optimal σ [1.63, 1.80] [1.84, 2.09] [3.03, 3.17] [3.50,3.85] [5.08, 5.34] [5.77,6.64] [8.12, 8.36] [11.45, 12.46]

#SVs 71 52 46 45 40 40 39 38

Testing error (%) 30.3 30.3 28.8 28.8 27.2 27.2 28.8 28.8

634

W. Wang and L. Ma

4 Conclusion This paper proves the equivalent relation of influence of the Gaussian parameter σ and the scale on the generalization performance. More important, a considerable range of σ can be obtained, within which the corresponding classifier has stable and good performance. Simulating results illustrate the effectiveness of the proposed approach. Whereas, whether the results can be successfully applied for other kernels will be our future research work.

Acknowledgements The work described in this paper was partially supported by the National Natural Science Foundation of China (No. 60673095), Key Project of Science Technology Research of Ministry of Education (No. 208021), Hi-Tech R&D (863) Program (No. 2007AA01Z165), Program for New Century Excellent Talents in University (NCET07-0525), Program for the Top Young Academic Leaders of Higher Learning Institutions, Program for Science and Technology Development in University (No. 200611001), and Program for Selective Science and Technology Development Foundation for Returned Overseas of Shanxi Province.

References 1. Broomhead, D.S., Lowe, D.: Multivariable Functional Interpolation and Adaptive Networks. Complex Systems 2, 321–355 (1988) 2. Byun, H., Lee, S.W.: Applications of Support Vector Machines for Pattern Recognition. In: Proc. of the International Workshop on Pattern Recognition with Support vector machine, pp. 213–236. Springer, Niagara Falls (2002) 3. Rennie, J., Rifkin, R.: Improving Multiclass Text Classification with the Support Vector Machine. Technology Report AI Memo AIM-2001-026 and CCL Memo 210. Massachusetts Institute of Technology, MIT (October 2001) 4. Tsuda, K., Ratsch, G., Mika, S., et al.: Learning to Predict the Leave-One-Out Error of Kernel Based Classifiers. In: Dorffner, G., Bischof, H., Hornik, K. (eds.) ICANN 2001. LNCS, vol. 2130, pp. 331–338. Springer, Heidelberg (2001) 5. Seeger, M.: Bayesian Model Selection for Support Vector Machines, Gaussian Processes and Other Kernel Classifiers. In: Advances in Neural Information Systems, vol. 12, pp. 603–649. MIT Press, Cambridge (2000) 6. Wu, S., Amari, S.: Conformal Transformation of Kernel Functions: a Data-dependent Way to Improve Support Vector Machine Classifiers. Neural Processing Letters 15, 59–67 (2002) 7. Van, T.G., Sukens, J.A.K., Baestaens, D.E., et al.: Financial Time Series Prediction Using Least Squares Support Vector Machines within the Evidence Framework. IEEE Transaction on Neural Networks 12, 809–821 (2001) 8. Vapnik, V.: The Nature of Statistical Learning Theory. Wiley, Chichester (1995) 9. Wang, W.J., Xu, Z.B., Lu, W.Z., Zhang, X.Y.: Determination of the Spread Parameter in the Gaussian Kernel for Classification and Regression. Neurocomputing 55, 643–663 (2003)

An Estimation of the Optimal Gaussian Kernel Parameter

635

10. Leung, Y., Zhang, J.S., Xu, Z.B.: Clustering by Scale-Space Filtering. IEEE Transaction Pattern Anal. Machine Intell. 22, 1369–1410 (2000) 11. Li, Y., Gong, S., Sherrah, J., Liddell, H.: Multi-View Face Detection Using Support Vector Machines and Eigenspace Modeling. In: 4th International Conference on KnowledgeBased Intelligent Engineering System and Allied Technologies, Brighton, UK, pp. 241– 244 (2000) 12. Zhou, W.D., Zhang, L., Jiao, L.C.: An Improved Principle for Measuring Generalization Performance. Chinese Journal of Computers 26, 598–604 (2003)

Imbalanced SVM Learning with Margin Compensation Chan-Yun Yang1, Jianjun Wang2, Jr-Syu Yang3, and Guo-Ding Yu3 1

Department of Mechanical Engineering, Technology and Science Institute of Northern Taiwan, No. 2 Xue-Yuan Rd., Beitou, Taipei, Taiwan, 112 [email protected] 2 School of Mathematics & Statistics, Southwest University, Chongqing 400715, China [email protected] 3 Department of Mechanical and Electro-Mechanical Engineering Tamkang University Taipei, Taiwan, 251 [email protected], [email protected]

Abstract. The paper surveys the previous solutions and proposes further a new solution based on the cost-sensitive learning for solving the imbalanced dataset learning problem in the support vector machines. The general idea of costsensitive approach is to adopt an inverse proportional penalization scheme for dealing with the problem and forms a penalty regularized model. In the paper, additional margin compensation is further included to achieve a more accurate solution. As known, the margin plays an important role in drawing the decision boundary. It motivates the study to produce imbalanced margin between the classes which enables the decision boundary shift. The imbalanced margin is hence allowed to recompense the overwhelmed class as margin compensation. Incorporating with the penalty regularization, the margin compensation is capable to calibrate moderately the decision boundary and can be utilized to refine the bias boundary. The effect decreases the need of high penalty on the minority class and prevents the classification from the risk of overfitting. Experimental results show a promising potential in future applications. Keywords: Margin, Imbalanced learning, Support vector machine, Classification, Pattern recognition.

1 Introduction In machine learning, topics of class imbalanced learning are worth paying attention to, not only for their practical implication but also for their importance. Because the class imbalance problem is quite pervasive and ubiquitous, there are abundant research works published on the topics [1-2]. In the case, the common machines tend to be overwhelmed by the large classes and ignore the small ones. To solve the problem, a number of research-works arise as the modifications of the common machines to generate a hypothesis which is robust to the majority overwhelming [1-3]. At first, people introduce the cost-sensitive learning as a solution for imbalanced class learning. This kind of strategy gives higher learning cost to the samples in the minority-class to counterbalance the degree of imbalance [3-4]. A general practice is to exploit the misclassification costs F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 636–644, 2008. © Springer-Verlag Berlin Heidelberg 2008

Imbalanced SVM Learning with Margin Compensation

637

of identifying the majority-class to outweigh those of identifying the minority-class. The reweighing scheme is generally merged into the common edition of classification algorithms [4]. Some corresponding solutions with regard to the imbalanced class learning in support vector machines (SVMs) [6-7] can also be found in [8-10]. It shows the solutions share actually the same technical merits from the balancing cost. Veropoulos et al., [8] suggested a solution for cost-sensitive learning which used different penalty constants for different classes of data to make errors on minority-class samples costlier than errors on majority-class samples. Akbani et al., [10] developed a method incorporating synthetic minority up-sampling technique (SMOTE) [9] with Veropoulos different-cost algorithm to push the biased decision boundary away from minority-class. In the development of efficient methods, the Veropoulos cost regularized method deserves much more attention because the promising formulation is intrinsically coherent with its original prototype of SVM. In fact, the remedy has widely been applied and extended in many applications [9, 11-15].

2 Veropoulos Cost-Sensitive Model The study is started with the penalty regularized model proposed by Veropoulos et al., [8]. The key idea of the model is to introduce unequal penalties to the samples in the imbalanced classes [16]. The penalization strategy associated with misclassification of a positive (minority) sample retains penalty higher than that with misclassification of a negative (majority) sample in the optimization. The high penalty then translates into a bias for Lagrange multiplier because the cost of corresponding misclassification is heavier. This drifts the decision boundary from the positive class towards the negative class. The imbalanced dataset learning can be started with a set T consisting of l+ positive and l- negative training samples in a d-dimensional input space ℜd: T = {(x p , y p ) ∪ (x n , y n ) | y p = +1, y n = −1, x ∈ ℜ d },

(1)

where p, ranging from 1 to l+, and n, ranging from 1 to l-, denote respectively the indices of the sample in the positive and negative class. In the issue of imbalanced dataset learning, the training set T includes generally samples in unequal sample sizes, l- > l+, due to the limitation within the positive class which is statistically underrepresented with respect to the negative class. With set T, the Veropoulos model based on the soft-margin SVM [17] has been founded to learn the target concept f (x) = w T Λ ( x) + b : l l 1 2 w + C + ∑ξ p + C − ∑ξ n , 2 p =1 n =1 +

min



(2)

subject to y p ( w T Λ ( x p ) + b) ≥ 1 −ξ p , for { p | y p = +1},

y n (w T Λ(x n ) + b) ≥ 1 −ξ n , for {n | yn = −1} , and ξ p, ξ n≥ 0,

(3)

638

C.-Y. Yang et al.

where C+ and C- denote the penalty constants for positive and negative class, respectively. In the expressions, a map function, Λ: T H, mapping the learning set from the lower d-dimensional input space to a higher reproducing kernel Hilbert space (RKHS) H is introduced for solving a generalized non-linear classification problem [5]. In space H, the non-linear problem can be solved linearly. The target concept f (x ) = w T Λ ( x ) + b refers the decision hyperplane in the imbalanced classification. The weight vector w is a transposed vector normal to the decision boundary in the Hilbert space H, the bias b is a scalar for offsetting the decision boundary, and the slack variables ξi’s denote compensations to urge samples to satisfy the boundary constraints. As understood, the model recovers the decision boundary by assigning different cost for misclassifications in the different class. In general, the misclassification in the positive class is costlier than that in negative class. The smaller the scale of the positive class, the higher the misclassification cost. With the techniques of constrained optimization, the eventual dual form of the cost-sensitive model for solving the imbalanced classification problem can be represented as:



∑α i −

l + +l −

arg max α

i =1

1 l +l ∑ 2 i =1 +



∑α α

l + +l −

i

j

y i y j k (x i , x j ),

(4)

j =1

subject to 0 ≤ α p ≤ C + , 0 ≤ α n ≤ C − , and

∑α l+

p =1

= ∑α n ,

(5)

l−

p

(6)

n =1

where k(·,·) is a kernel function that is given by k ( x i , x j ) = Λ ( x i ) T Λ ( x j ).

(7)

The consecutive derivations from (2)-(3) to (4)-(6) follow the similar steps in softmargin SVM [17].

3 Extended Model with Margin Compensation In the Veropoulos model, the changing of misclassification costs is equivalent to the changing of penalties from the aspect of loss function. The preliminary for changing loss function in SVM can be found in [18]. The study here tries to extend the model from the same aspect of loss function. For a general expression, the hinge loss function, φ ( y, f ( x )) = max( 0, 1 − yf ( x )) , in the soft-margin SVM [17] is adopted and modified to develop the extended model. Referring to the hinge loss function, the loss functions for positive and negative class are first proposed respectively as follows. The proposition changes intuitively the slope and the hinge point of the inclined segment in the hinge loss function by two additional constants to allocate different cost for either positive or negative class.

Imbalanced SVM Learning with Margin Compensation

⎧⎪0

ξ p+ = φ + ( y p , f (x p )) = ⎨

if a + y p f (x p ) ≥ 1,

+ + ⎪⎩c (1 − a y p f (x p ) ) otherwise,

and

⎧⎪0 if a − yn f ( x n ) ≥ 1, ⎪⎩c − (1 − a − yn f ( x n ) ) otherwise.

ξ n− = φ − ( yn , f ( x n )) = ⎨

639

(8)

(9)

where paired constants (c+, a+) and (c-, a-) are assigned to positive and negative samples, respectively, to change their corresponding slops and the hinge points of the inclined segment in loss function. Referring to (2)-(3) of the Veropoulos model, we replace the mechanism tuning by the penalty constants, C+ and C-, with that tuning by the paired constants (c+, a+) and (c-, a-). Beside the c+ and c- which are analogous to C+ and C- in the Veropoulos model, the constants a+ and a- are added seeking to regularize the imbalanced learning problem not only by penalty but also by margin. The fact provides primarily more possibility to deal with the imbalanced dataset. With the paired constants, the primal problem of the soft-margin SVM for imbalanced dataset learning can be re-written as: min

1 2 w + C ( ∑ξ p+ + ∑ξ n− ), 2 p n

(10)

subject to c + - ξ p+ , for { p | y p = +1} , c+a + c− - ξ − y n ( w T Λ ( x n ) + b) ≥ − −n , for {n | y n = −1} , and c a ξ p+ , ξ n− ≥ 0. y p ( w T Λ ( x p ) + b) ≥

(11)

In the proposition, constants c+, c-, and C are equivalent as the constants C+ and C- in the Veropoulos model. The equivalent settings are C + = Cc + and C − = Cc − . There are two constants a+, and a- added for regularizing the model. The pair c+ and c- controls the penalty, and the additional pair a+ and a- controls the margin. Following similar derivations of soft-margin SVM [17], a quadratic programming problem is eventually set up for the imbalanced dataset learning:

∑a

l + +l −

arg max α

i =1

αi i

1 l +l ∑ 2 i =1 +





∑α α

l + +l −

i

j

y i y j k (x i , x j ),

(12)

j =1

subject to 0 ≤ α p ≤ c + a + C , 0 ≤ α n ≤ c − a − C , and

(13)

640

C.-Y. Yang et al.

∑yα

l + +l −

i

i

= 0.

(14)

i =1

Figure 1 illustrates the motivation of the study. Using the region of a shaded-dish to describe the size, two classes, showing as both top and lateral views, are aligned with their centers horizontally. A Gaussian distribution is assumed for data points in the shaded-dish regions. The inclined segments risen from the horizon to the top of the Gaussian curves are analogies of loss functions showing in Fig. 1, despite the negative slope. In the beginning, the heights of the assumed Gaussian are equally normalized for an uncompensated condition. The decision boundary drawn from the intersection of the segments vertically is actually biased from the idea decision boundary (Fig. 1a). The Veropoulos model employing costlier penalty for misclassifications in positive class can be analogous to raise the height of the corresponding Gaussian. Due to the raised height, the decision boundary drawn from the intersection shifts closer towards the idea boundary (Fig. 1b). Furthermore, if margin compensation from (8)-(9) is adopted, the bias would further be reduced (Fig. 1c). Ideal Decision Boundary

Majority Class

Actual Decision Boundary

Minority Class

!

Bias

(a) Uncompensated Learning

(b) Compensated with adjusted penalty

(c) Compensated further with an adjustable margin

Fig. 1. The compensation by adjusting simultaneously the penalty and margin

4 Experiments and Results 4.1 Evidence of Margin Compensation Evidence can be found by increasing progressively the compensation of margin. In the experiments, a 2-dimensional dataset consisting of two classes which are generated

Imbalanced SVM Learning with Margin Compensation

641

from a multivariate normal distribution with unit variance at centers ( 2 / 2 , 2 / 2 ) and ( − 2 / 2 , − 2 / 2 ) respectively is used. The ratio of examples in positive and negative class, marked as “□” and “○” respectively, is 20:100. In order for convenient visual observation, idea decision boundaries are drawn in advance as the heavy dashed line in the panels of Fig. 2 and 3. Using grids search with a cross-validation, the penalty constant C is set to 1 for a near-optimal generalization performance. Classifications underlying the C setting are performed as those panels in Fig. 2 and 3 with different c+/c- and a+/a- ratios. The consequent decision boundaries with their imbalanced margin are given as the heavy solid lines and slight dashed lines, respectively. From the beginning, Figure 2b and 3b show the decision boundaries of the compensated Veropoulus model. Comparing to those (Fig. 2a and 3a) of uncompensated model, biases in the decision boundaries are improved however they are lack of margin compensation (a+/a- = 1). Moreover, a+/a- are gradually decreased in a given range [0.9, 0.3]. The results show the consequent decision boundaries are improved due to the margin compensation, despite the linear kernel or 2nd order polynomial kernel are used (Fig. 2c-2d and 3c-3d). Tying the goal to a higher generalization performance, the improvements may include shifting the boundary close to the idea boundary, changing the orientation of the boundary towards that of the idea boundary, and flattening the boundary with a smoother curve.

(a) c+/c- = 1, a+/a- = 1

(b) c+/c- = 5, a+/a- = 1

(c) c+/c- = 5, a+/a- = .5

(d) c+/c- = 5, a+/a- = .3

Fig. 2. Effect of margin compensation with linear kernel

642

C.-Y. Yang et al.

(a) c+/c- = 1, a+/a- = 1

(b) c+/c- = 5, a+/a- = 1

(c) c+/c- = 5, a+/a- = .5

(d) c+/c- = 5, a+/a- = .3

Fig. 3. Effect of margin compensation with second order polynomial kernel

4.2 Performance Improvement with Margin Compensation One performance indicator is needed to assess the improvement. As it is commonly adopted in the analysis of the imbalanced learning, the metrics of tp and tn measuring both the ratios of true positive and true negative predictions in the confusion matrix tp =

TP TN , and tn = − l+ l

(15)

are defined in (15) as the metrics illustrated in Fig. 4 [19]. With the metrics tp and tn, an indicator of gmean, measuring the geometric mean of tp rate and tn rate, is proposed by Kubat et al., [20]: gmaen = tp ⋅ tn .

(16)

As known, the gmean is high only when both tp and tn rates have close and high scores. If one of them loses the high score, the imbalanced scores lower the gmean valve. This criterion is satisfactory with the requirements of the performance assessment. Following the previous procedure, one 30:90 imbalanced set is generated for the experiment. The experiment uses a gmean from an averaged 10-fold crossvalidation to test the performance with varied settings of c+/c- and a+/a-. Fig. 5 shows the resultant performance contours. As shown, the imbalanced dataset prefers a lower

Imbalanced SVM Learning with Margin Compensation

643

ratio of a+/a- incorporating with high penalty ratio c+/c- for a high generalization performance. It is because the lower ratio of a+/a- straightens the over-buckled separation hyperplane which is caused by the high penalty applied on the minority class. As known, an over-buckled hyperplane implies a high model complexity which would lead the classifier overfitted. In contrast to the coupled settings of high c+/c- and low a+/a-, the choice of low c+/c- with high a+/a- gives also a high value of gmean. But this type of settings is in fact not applicable. Our experiment shows a non-negligible bias is still co-existed with the insufficient c+/c- ratio.

Predicted Class

Actual Class p n True False Positive Positive False True Negative Negative

T F

Fig. 4. Confusion matrix for imbalanced learning analysis 4

0.85

0.8 0.8

5

0.8

0.8

2

1

0.5

0.8

0.8

0.85 0.75

0.8

1

75 0. 1.5

0.85

0.85

1.5

0.85

0.8

0.85

0.8 5

0.85 85 0.

0.85

2.5 0.8

c+/c- Ratio

0.8

0.85

0.850.9

0 7057

3

85 0. 0.8

3.5

0. 8

0.75 2 a /a- Ratio +

2.5

3

0.85

3.5

4

Fig. 5. Performance contours with varied c+/c- and a+/a- ratios

5 Conclusion The essential aspects of margin compensation in the imbalanced SVM learning were presented. The paper developed imbalanced margins in learning with the imbalanced dataset. The development provided an opportunity for calibrating further the decision boundary produced by the underlying cost-sensitive balancing approaches, such as Veropoulos penalty regularized model. The compensation incorporating with the costsensitive balancing approaches not only reduced effectively the bias caused by the class imbalance, but also achieved potentially the consequent classifier a good generalization performance.

644

C.-Y. Yang et al.

References 1. Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: Special Issue on Learning from Imbalanced Data Sets. SIKDD Explorations Newsletters 6, 1–6 (2004) 2. Weiss, G.M.: Mining with Rarity: A Unifying Framework. Newsletter of the ACM Special Interest Group on Knowledge Discovery and Data Mining 6, 7–19 (2004) 3. Domingos, P.: MetaCost: A General Method for Making Classifiers Cost Sensitive. In: Fifth international conference on knowledge discovery and data mining, pp. 155–164. ACM press, New York (1999) 4. Elkan, C.: The Foundations of Cost-Sensitive Learning. In: Seventeenth international joint conference on artificial intelligence, pp. 973–978. Morgan Kaufmann, San Fransisco (2001) 5. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995) 6. Vapnik, V.N.: Statistical Learning Theory. John Wiley & Sons, New York (1998) 7. Burges, C.J.C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2, 121–167 (1998) 8. Veropoulos, K., Campbell, C., Cristianini, N.: Controlling the Sensitivity of Support Vector Machines. In: International Joint Conference on Artificial Intelligence, Stockholm, Sweden, pp. 55–60 (1999) 9. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, P.: SMOTE: Synthetic Minority Oversampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002) 10. Akbani, R., Kwek, S., Japkowicz, N.: Applying Support Vector Machines to Imbalanced Datasets. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 39–50. Springer, Heidelberg (2004) 11. Cohen, G., Hilario, M., Pellegrini, C.: One-Class Support Vector Machines with a Conformal Kernel - A Case Study in Handling Class Imbalance. In: Fred, A.L.N., Caelli, T., Duin, R.P.W., Campilho, A.C., Ridder, D. (eds.) SSPR&SPR 2004. LNCS, vol. 3138, pp. 850–858. Springer, Heidelberg (2004) 12. Campadelli, P., Casiraghi, E., Valentini, G.: Support Vector Machines for Candidate Nodules Classification. Neurocomputing 68, 281–289 (2005) 13. Lee, K.K., Gunn, S.R., Harris, C.J., Reed, P.A.S.: Classification of Imbalanced Data with Transparent Kernels. In: INNS-IEEE International Joint Conference on Neural Networks, pp. 2410–2415. IEEE Press, Washington (2001) 14. Callut, J., Dupont, P.: Fβ Support Vector Machines. In: International Joint Conference on Neural Networks, pp. 1443–1448. IEEE Press, Montreal (2005) 15. Shin, H., Cho, S.B.: Response Modeling with Support Vector Machines. Expert Systems with Applications 30, 746–760 (2006) 16. Karakoulas, G.J., Shawe-Taylor, J.: Optimizing Classifiers for Imbalanced Training Sets. Advances in Neural Information Processing Systems 11, 253–259 (1999) 17. Cortes, C., Vapnik, V.N.: Support Vector Networks. Machine Learning 20, 273–297 (1995) 18. Yang, C.Y.: Generalization Ability in SVM with Fuzzy Class Labels. In: International Conference on Computational Intellignece and Security 2006 (CIS 2006). IEEE Press, Guangzhou (2006) 19. Fawcett, T.: An Introduction to ROC Analysis. Pattern Recognition Letters 27, 861–874 (2006) 20. Kubat, M., Holte, R., Matwin, S.: Learning when Negative Examples Abound. In: Someren, M.V., Widmer, G. (eds.) ECML 1997. LNCS, vol. 1224, pp. 146–153. Springer, Heidelberg (1997)

Path Algorithms for One-Class SVM Liang Zhou, Fuxin Li, and Yanwu Yang Institute of Automation, Chinese Academy of Sciences, 100190 Beijing, China [email protected],{fuxin.li,yanwu.yang}@ia.ac.cn

Abstract. The One-Class Support Vector Machine (OC-SVM) is an unsupervised learning algorithm, identifying unusual or outlying points (outliers) from a given dataset. In OC-SVM, it is required to set the regularization hyperparameter and kernel hyperparameter in order to obtain a good estimate. Generally, cross-validation is often used which requires multiple runs with different hyperparameters, making it very slow. Recently, the solution path algorithm becomes popular. It can obtain every solution for all hyperparameters in a single run rather than re-solve the optimization problem multiple times. Generalizing from previous algorithms for solution path in SVMs, this paper proposes a complete set of solution path algorithms for OC-SVM, including a ν-path algorithm and a kernel-path algorithm. In the kernel-path algorithm, a new method is proposed to avoid the failure of algorithm due to indefinite matrix . Using those algorithms, we can obtain the optimum hyperparameters by computing an entire path solution with the computational cost O(n2 +cnm3 ) on ν-path algorithm or O(cn3 + cnm3 ) on kernel-path algorithm (c: constant, n: the number of sample, m: the number of sample which on the margin). Keywords: Path algorithm, One-Class SVM, Regularization, Kernel.

1

Introduction

Support Vector Machines (SVMs) are a family of powerful statistical learning techniques for pattern recognition, regression and density estimation problems. They have been proven to be effective in many practical applications. SVMs are based on the structural risk minimization (SRM) induction principle, which is derived from the statistical learning theory[1]. Recently, Tax[2] and Sch¨ olkopf[3] independently proposed the One-Class Support Vector Machine (OC-SVM) as an extension of SVMs to identify unusual or outlying points (outliers) from a given dataset. OC-SVM has been widely applied to many areas, such as: outlier ranking, minimum volume set estimation, density estimation etc. Especially in the field of intrusion detection, OC-SVM plays an extremely important role. In the fields of machine learning and pattern recognition, most problems can be transformed to optimization problems where we have to specify in advance the values for some hyperparameters. In the case of OC-SVM, we have to specify the regularization hyperparameter and kernel hyperparameter ahead. Many F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 645–654, 2008. c Springer-Verlag Berlin Heidelberg 2008 

646

L. Zhou, F. Li, and Y. Yang

ways ([4],[5]) have been proposed to explore optimum hyperparameters in learning algorithms. Recently, solution path algorithms have come into focus. The idea of solution path algorithm originated from Efron[6], where the least angle regression (LARS) algorithm was proposed to calculate all possible Lasso estimates (with different values on the regularization hyperparameter) for a given problem in a single run. Rosset[7] found out that, any optimization problem with an L1 regularization and a quadratic, piecewise quadratic, piecewise linear, or linear loss function has a piecewise linear regularization path. Following this direction, Zhu[8] proposed an entire regularization path algorithm for the L1-norm support vector classification (SVC), and Zhu[9] proposed a similar algorithm for the standard L2-norm SVC. Both algorithms are based on the property that the paths are piecewise linear. For Support Vector Regression (SVR), the similar approach was used to build a path algorithm [10]. Meanwhile, Lee[11] used Zhu’s approach[9] to get a regularization path algorithm for OC-SVM. However, all algorithms mentioned above can produce only the solution path for the regularization hyperparameter, but not the kernel hyperparameter which is very important for good performance. To the best of our knowledge, no algorithm has been proposed to compute the solution path for kernel hyperparameter in OC-SVM. Recently, Wang[12] provided an approach to explore the kernel path for SVC. However, if we simply adopt Wang’s approach to compute the path of OC-SVM, invalid solutions which will make the algorithm fail may occur due to indefinite matrix. In this paper, we propose a complete set of algorithms following Wang’s approach to compute the entire solution path of OC-SVM for both the regularization hyperparameter and kernel hyperparameter. In computing the path for the kernel hyperparameter, we propose a mathematical trick to avoid the failure of Wang’s algorithm due to indefinite matrix. Our experiments on two synthetic datasets show that our algorithms are practicable and efficient to find the regularization hyperparameter or kernel hyperparameter for OC-SVM problems. The remainder of the paper is organized as follows. Section 2 introduces OCSVM. Section 3 analyzes the optimization problem for OC-SVM, which is the basis for solution path algorithms. The details of our solution path algorithms for OC-SVM are proposed in Section 4. Experiments are presented in Section 5, and Section 6 concludes this paper.

2

One-Class SVM

The OC-SVM is proposed as a support vector methodology to estimate a set called one-class enclosing ”most” of a given training dataset xi ∈ Rd , i = 1, ..., n without any class information. It attempts to find a hyperplane in the feature space that separates the data from the origin with maximum margin. The primal form of the one-class optimization problem proposed in (Sch¨ olkopf[3]) is:  n (1) min Riskprimal = 12 wT w − ρ + ν˜1l i=1 ξi w,ξ,ρ

s.t. wT φ(xi ) ≥ ρ − ξi . ξi ≥ 0, i = 1, ..., n. ν˜ ∈ [0, 1].

Path Algorithms for One-Class SVM

647

The hyperparameter ν˜ ∈ [0, 1] in OC-SVM acts as an upper bound on the fraction of outliers. The Langrangian dual of (1) is: Riskdual = 12 αT Qα n s.t. 0 ≤ αi ≤ 1, i = 1, ..., n. i=1 αi = ν. ν ∈ [0, n].

min α

(2)

where ν = ν˜n and Qij = Kσ (xi , xj ), Kσ (xi , xj ) = φ(xi )T φ(xj ) is a positive definite kernel function with a kernel hyperparameter σ. l  The decision function is: sign( αi Kσ (xi , x) − ρ). i=1

The relevant Karush-Kuhn-Tucker (KKT) complementarity conditions are: αi [wT φ(xi ) − ρ + ξi ] = 0

and βi ξi = 0

(3)

From the KKT complementarity conditions, we obtain: wT φ(xi ) < ρ ⇒ αi = 1, βi = 0, ξi > 0 wT φ(xi ) = ρ ⇒ αi ∈ (0, 1), βi ∈ (0, 1), ξi = 0 wT φ(xi ) > ρ ⇒ αi = 0, βi = 1, ξi = 0 These three cases refer to points lying outside (wT φ(xi ) ≥ ρ), on (wT φ(xi ) = ρ) and inside (wT φ(xi ) ≤ ρ) the margin, respectively.

3

Problem Analysis

In this section, we look into the OC-SVM problem for some insights into designing an efficient path algorithm. Note that, let us fix one hyperparameter (the regularization hyperparameter ν or the kernel hyperparameter σ) and define the path algorithm for the other, then the optimal solution may be regarded as a vector-valued function of the hyperparameter (denoted as μ): (ˆ α(μ), ρˆ(µ)) = arg maxα,ρ Riskdual (α, ρ | µ). The optimal solution varies as µ changes. For every value of µ, we can partition the training dataset into the following three subsets:L, E and R, respectively: n L = {i : j=1 αj Kµ (xi , xj ) < ρ, αi = 1} n E = {i : j=1 αj Kµ (xi , xj ) = ρ, αi ∈ (0, 1)} (4) n R = {i : j=1 αj Kµ (xi , xj ) > ρ, αi = 0} n n where : j=1 αj Kµ (xi , xj ) = j=1 αj φ(xj )φ(xi ) = wT φ(xi ) Suppose the subset E contains m items which are represented as an m-tuple (E(1), . . . , E(m)). Let α ˆ E = (ρ, αE(1) , . . . , αE(m) ). Equation (4) gives a linear  system: L(ˆ αE (µ), µ)  nj=1 αj Kµ (xi , xj ) − ρ = 0, i ∈ E with m linear equations. If now µ increases by an infinitesimally small step ǫ such that the three subsets L, E, R remain unchanged, the corresponding linear system becomes

648

L. Zhou, F. Li, and Y. Yang

L(ˆ αE (µ + ǫ), μ + ǫ) = 0. Adding the constraint in (2), we have L(ˆ αE (µ + ǫ), μ + ǫ) = L(ˆ αE (µ), µ + ǫ) + [−1, KμE ](ˆ αE (µ + ǫ) − α ˆE (µ)), m  (ˆ αE(i) (μ + ǫ) − α ˆ E(i) (μ)). (5) Δν = i=1

where 1 = (1, . . . , 1)T , and KμE = [Kμ (xE(i) , xE(j) )]m i,j=1 . and then we know the next solution α ˆ E (μ + ǫ) can be updated as   −1  −1 KμE −L(ˆ αE (μ), μ + ǫ α ˆ E (μ + ǫ) = α ˆ E (μ) + . (6) 0 1T Δν

Hence, given a hyperparameter μ and its corresponding optimal solution, the solutions for its neighborhood hyperparameters can be computed exactly as long as the three point subsets remain unchanged. However, when we change the value of μ to a larger extent, some points in the subsets L, E, R might enter other subsets. An event is said to occur when some subsets change. We categorize events as follows: – A new point i from L or R joins E, i.e., the condition for a variable αi with i ∈ L or with i ∈ R ceases to hold if α ˆ E(i) keeps moving in the same direction. – The variable αi for some i ∈ E reaches 0 or 1. In this case, the linear system (4) will cease to hold if α ˆ E(i) changes further in the same direction, i.e., point i leaves E and joins some other subset. By monitoring the occurrence of these events, we can find the next breakpoint at which the updating formula needs to be calculated again. The algorithm then updates the point subsets and continues to trace the path.

4

Path Algorithm

Based on observations from the last section, we go on to design path algorithms for OC-SVM. Since for OC-SVM, there is no supervised information on the training dataset, its specific path algorithm differs from those of classification and regression. This section first introduces two corresponding path algorithms for regularization hyperparameter and kernel hyperparameter respectively, and then propose the computational complexity analysis. 4.1

ν-Path

Similar to the classification and regression path algorithms, the OC-SVM path algorithm focuses on the points at the elbows only. What the algorithm does is simply decreasing ν, iterating through all the events as ν decreases, and computing the coefficients of the piecewise linear path at each event. As before, let ν l denote the value of ν right after the lth event has occurred. We assume that the kernel hyperparameter is prespecified by the user and remains fixed during the execution of the ν-path algorithm.

Path Algorithms for One-Class SVM

649

Initialization. We start from ν = n. If we set ν > n, the dual problem (2) has no solution. So we will set ν = n as the initial hyperparameter value, and it is then trivial to solve the optimization problem in (2). The solution is simply αi = 1 forall i, meaning that all the points are inside the margin. For ν = n, n we have: j=1 αj φ(xj )φ(xi ) = wT φ(xi ) ≤ ρ, ∀i. So the solution ρ can be any  value with ρ ≥ max nj=1 αj φ(xj )φ(xi ), ∀i. The initial hyperparameter values i n are set as ν = n and ρ = max j=1 αj φ(xj )φ(xi ), ∀i. At this time, we have i

| E |> 0. Compared with the SVC path algorithms which have to solve linear equations to find the initial hyperparameter values, the initialization problem for the one-class ν-path algorithm is much easier to be solved.

Tracing the ν-Path. We let αl denote the solution α ˆ (μ) right after the lth event. If | E |> 0 does not hold, we reduce ρ until E contains at least one point. This procedure only involves shrinking the ρ to reduce its radius without changing the center of circle. The algorithm still holds even when more than one point enters an elbow simultaneously. For ν such that ν l+1 < ν < ν l , we use (6) to get the solution. In the ν-path algorithm, we rewrite the linear equations (6) as follows:  −1   −1 KE 0 l l l α ˆ l+1 (ν + ǫ) = α ˆ (ν ) + ǫ (7) E E 0 1T 1     −1 KE 0 Now let A = , δ = and b = A−1 δ. We can use a nonlinear 0 1T 1 kernel function and a small ridge term to ensure that A−1 always exist. So l α ˆ l+1 ˆ lE (ν l ) + ǫb. Then: E (ν + ǫ) = α n l+1 f l+1 (xi ) = j=1 αl+1 j K(xi , xj ) − ρ m n l l = j=1 αj K(xi , xj ) − ρ + ǫ(−b0 + j∈E bE(i) K(xE(i) , xj )) (8)  = f l (xi ) + ǫ(−b0 + m b K(x , x )). j E(i) j∈E E(i)

Finding breakpoints. As ν decreases, the algorithm keeps track of the following events: – A point enters E from L  or R: That means some xi for which i ∈ Ll ∪ Rl n hits the hyperplane, i.e. j=1 αj φ(xj )φ(xi ) = ρ ⇒ f l+1 (xi ) = 0. To track this, we can use (8) to get the maximal step ǫi for each xi ∈ Ll ∪ Rl , before l (xi ) the event occurs: ǫi = (−b0 + m −fbE(i) K(xE(i) ,xj )) . And ǫ1 = max{ǫi | ǫi
σlow 3 r = θ; 4 while r < 1 − ǫ 5 σ = rσt ; solve (9) to compute (α(σ), ρ(σ)); 6 if(α(σ), ρ(σ)) is the valid solution 7 αt+1 = α(σ); ρt+1 = ρ(σ); σt+1 = σ; t = t + 1; 8 else r = r 1/2 ; 9 update the point subsets L, E , R; OutPut: a sequence of solutions α(σ), ρ(σ)), σlow ≤ σ ≤ σhigh

4.3

Computational Complexity

In the ν-path algorithm, the kernel matrix is kept unchanged as ν decreases at each iteration, the entire kernel matrix is computed just once and the cost

652

L. Zhou, F. Li, and Y. Yang

Fig. 1. Experimental results of OC-SVM ν-path algorithm (the two columns on the left) and kernel-path algorithm (the two columns on the right). For each algorithm, the left figure shows the results for the ”mixture” data and the right one shows the results for the ”multi-gaussian” data. Blue points are items from the training dataset. The learned one-class is covered by green points. Purple points on the margin are the current points of E . The top image shows the one-class in the initial stage, and the bottom image shows the one-class in the final stage.

Path Algorithms for One-Class SVM

653

of calculating kernel matrix is O(n2 ). The main cost at each iteration is the computation of A−1 , which is O(m3 ). We regard that, the number of iterations is always some small multiple a constant c of the number of samples n, so the total computational cost of ν-path algorithm is O(n2 + cnm3 ). In the kernel-path algorithm, the kernel matrix varies as σ decreases at each iteration, but it is not necessary to recompute the entire kernel matrix. At each iteration, the cost is O(m2 ) to compute the kernel matrix, O(m3 ) to solve the quadratic programming problem, and O(n2 ) to find breakpoints, respectively. Similar to ν-path algorithm, We regard that the number of iterations is cn. In summary, the total computational cost of the kernel-path algorithm is O(cn3 + cnm3 ).

5

Experiments

We demonstrate our algorithm on two data sets: ”mixture” from [11] and ”multigaussian” which is generated from three independent gaussians with different means and variances. The experiments have been run on a 2.4 GHz Pentium 4 processor with 512M of RAM on MATLAB. Four pieces of movies illustrating the execution of those algorithms are available on the website YouTube(www. youtube.com/v/{fwg51ibPyxo,eptbRFWq- k,DnMuGZNHU88,V3paRoJ1bEg}). The movies clearly illustrate the executing process of the ν-path algorithm and kernel-path algorithm for the OC-SVM. Figure 1 is excerpted from these movies with some specific set of parameters: σ=0.5 in ν-path algorithm and ν=0.5 in kernel-path algorithm. From Figure 1, we can see effects from both path algorithms. By increasing ν, the region covered by the learned one-class grows larger (the hyperparameter ν˜ ∈ [0, 1] in OC-SVM acts as an upper bound on the fraction of outliers). By reducing σ in the Gaussian kernel Kσ (x, y) = exp(−σ x − y 2 ), the effect of individual points are smaller, and the region becomes more smooth and better connected. Overall, the algorithm analysis and experimental results prove that the path algorithms in this paper are effective in computing the entire regularization path and kernel path for OC-SVM.

6

Conclusions

In this paper, we proposed ν-path and kernel-path algorithms for OC-SVM by adapting Wang’s approach. In the kernel-path algorithm, we used quadratic programming to ensure that the computed parameter solutions are valid. Experiments on two synthetic datasets demonstrate that we can obtain the solution path for OC-SVM with acceptable computational cost. In many practical applications, both optimum hyperparameters of the regularization ν and kernel σ need to be determined simultaneously. Therefore, the task is to find the optimum hyperparameters pair. A future work of us is to develop a two-dimensional solution algorithm to search the optimum hyperparameters pair. For a simple strategy, we can use an initial point, take a

654

L. Zhou, F. Li, and Y. Yang

greedy approach and try to find the best move along four possible directions: increasing/decreasing ν or σ, using corresponding path algorithms for ν and σ. However, the greedy approach might not give the best result. More experiments and analysis need to be done on this issue. Acknowledgments. This work was supported by the Hi-tech Research and Development Program of China (863)(2008AA01Z121).

References 1. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995) 2. Tax, D.M.J., Duin, R.P.W.: Support Vector Domain Description. Pattern Recognition Letters 20, 1191–1199 (1999) 3. Sch¨ olkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A.J., Williamson, R.C.: Estimating the Support of a High-demensional Distribution. Neural Computation 13, 1443–1472 (2001) 4. Platt, J.: Fast Training of Support Vector Machines Using Sequential Minimal Optimization. In: Sch¨ olkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods - Support Vector Learning, pp. 185–208. MIT Press, Cambridge (1999) 5. Chang, C.C., Lin, C.J.: LIBSVM: a Library for Support Vector Machines (2001), http://www.csie.ntu.edu.tw/∼ cjlin/libsvm 6. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least Angle Regression. Annals of Statistics 32, 407–499 (2004) 7. Rosset, S., Zhu, J.: Piecewise Linear Regularized Solution Paths. The Annals of Statistics 35, 1012–1030 (2007) 8. Zhu, J., Rosset, S., Hastie, T., Tibshirani, R.: L1 Norm Support Vector Machines. In: Advances in Neural Information Processing Systems 16. MIT Press, Cambridge (2003) 9. Zhu, J., Rosset, S., Hastie, T., Tibshirani, R.: The Entire Regularization Path for the Support Vector Machine. Journal of Machine Learning Research 5, 1391–1415 (2004) 10. Gunter, L., Zhu, J.: Computing the Solution Path for the Regularized Support Vector Regression. In: Advances in Neural Information Processing Systems 18 (NIPS 2005) (2005) 11. Lee, G., Scott, C.D.: The One Class Support Vector Machine Solution Path. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 521–524 (2007) 12. Wang, G., Yeung, D.Y., Lochovsky, F.H.: A Kernel Path Algorithm for Support Vector Machines. In: Proceedings of the 24rd International Conference on Machine Learning, pp. 951–958 (2007)

Simulations for American Option Pricing Under a Jump-Diffusion Model: Comparison Study between Kernel-Based and Regression-based Methods Hyun-Joo Lee, Seung-Ho Yang, Gyu-Sik Han, and Jaewook Lee Department of Industrial and Management Engineering, Pohang University of Science and Technology, Pohang, Kyungbuk 790-784, Korea {lhj1120,grimaysh,swallow,jaewookl}@postech.ac.kr http://dreamlab.postech.ac.kr

Abstract. There is no exact analytic formula for valuing American option even in the diffusion model because of its early exercise feature. Recently, Monte Carlo simulation (MCS) methods are successfully applied to American option pricing, especially under diffusion models. They include regression-based methods and kernel-based methods. In this paper, we conduct a performance comparison study between the kernel-based MCS methods and the regression-based MCS methods under a jumpdiffusion model. Keywords: American option, kernel-based regression, jump-diffusion model.

1

Introduction

An American option gives its holder a right to sell or buy its underlying asset with its strike price at any time before its maturity or at its maturity. So valuing an American-style derivative, which involves finding its optimal exercise time, has been one of the main issues in computational finance. Among lots of methods suggested for valuing American-style derivatives, the regression-based methods with Monte Carlo simulation have recently attracted many researchers and practitioners because of their simplicity and flexibility. They express the problem of pricing an American option as a stochastic dynamic programming problem where its exercise times are allowed for user-preset discrete times. [15] suggests the solution of simulating the American option’s price using bundling technique and backward Induction for the first time. [4] estimates the conditional expectation of option’s continuation value at early exercising point of time using Sequential Nonlinear Regression Algorithm. [11] uses the least-square regression method to approximate the continuation value. Recently, [8] proposed a kernel-based MCS methods. In this paper, we compare the performance of the kernel-based MCS methods with those of three other regression-based MCS methods, i.e. simple regression, F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 655–662, 2008. c Springer-Verlag Berlin Heidelberg 2008 

656

H.-J. Lee et al.

low-estimators, Longstaff and Schwartz’s method for American option pricing under a jump diffusion model [1,3,11,13].

2

American Option Problem

Given the discounted payoff process from exercise at t, U (t), we can define a continuous-time American option pricing problems as follows (see [7] and the references therein for more details): sup E[U (τ )] τ ∈T

where 0 ≤ T ≤ T is a class of admissible stopping times. Assume that the underlying instrument price is a Markov process {S(t), 0 ≤ ˜ as a nonnegative payoff function, the payoff to the option t ≤ T }. If we define h holder when exercising at t is ˜ h(S(t)). The option price can be represented as follows:   τ ˜ sup E e− 0 r(u)du h(S(τ )) τ ∈T

where {r(t), 0 ≤ t ≤ T } is a risk-neutral interest rate process. Considering the real stock market, the assumption of time continuity is not reasonable. Therefore, we consider a finite set of exercise opportunities t1 < t2 < · · · < tm and then define Si as the state of underlying Markov process at ti . Backward induction method is widely used in solving optimal stopping problems. We define Vi (s) as the present value of the option value at time ti , V˜i (s) as option value when Si = s, and hi as the present value of the payoff function ˜ i . So we can represent the discounted value of them as Vi (s) = d0,i V˜i (s), at ti , h ˜ i (s) where d0,i is the discount factor from time 0 to ti . Then the hi (s) = d0,i h option value at each exercise opportunity ti can be expressed using a dynamic programming in the following way: Vm (s) = hm (s) Vi−1 (s) = max{ hi−1 (s), E[Vi (Si )|Si−1 = s] }, i = 1, ..., m.

(1)

It is obvious that Vm (s) is the same as hm (s) at the expiration date tm , according to the no-arbitrage assumption. At each time step ti , i = m−1, · · · , 0, the option value is the maximum of the immediate exercise value and the expectation of continuing, under the assumption that the investors are rational. The value of holding an American option rather than exercising it at an exercise opportunity is called the continuation value. It can be calculated as   Ci (s) = E Vi+1 (Si+1 )|Si = s , i = 0, · · · , m − 1

(2)

(2) can be expressed using the dynamic programming recursion (1) as follows:   Ci (s) = E max{hi+1 (SXi+1 ), Ci+1 (Si+1 )}|Si = s , i = 0, · · · , m − 1 (3)

Continuation values cannot be calculated explicitly at each time step.

Simulations for American Option Pricing Under a Jump-Diffusion Model

3 3.1

657

Simulation Methods Jump-Diffusion Model

The celebrated Black-Scholes model assumes that stock prices follow a lognormal pure-diffusion model. However, it often fails to describe real stock prices’ properties. Instead, jump models incorporating discontinuous jumps in stock price models are widely accepted as alternatives that reflect the real stock price’s behaviors. In [13], Merton suggested the jump-diffusion model following Nt Yi ], St = S0 exp[µt + σWtP + Σi=1

(4)

under the objective probability measure P. µ and σ are the mean and the variance of the stock price’s log-return, and Wt is the Wiener process. Nt is the Poisson process counting the jumps of St and has a jump intensity parameter λ, the average number of jumps per unit time. Yi ’s, each of which means a jump size, are independently normally distributed with the mean m and the variance δ 2 . In order to value derivatives, this model changes under the martingale measure Q: Nt Yi ] St = S0 exp[μQ t + σWtQ + Σi=1 2 σ2 δ2 σ − λE[eYi − 1] = r − − λ[exp(m + ) − 1] µQ = r − 2 2 2

(5) (6)

where r is the risk-free interest rate. We consider simulating the jump-diffusion model on a fixed time grid with MCS method. At first, in order to produce a stock path (S1 , . . . , Sm ) for m fixed times(t1 , . . . , tm ), we simulate Wiener processes Gi ’s with N (0, σ 2 (ti − ti−1 )), generating the compound Poisson part according to 3-steps: 1. Simulate the total number of jumps N from Poisson distribution with parameter λT . 2. Simulate N independent random variables Ui ’s which are uniformly distributed on the interval [0, T ] 3. Simulate N independent random variables Yi ’s, jump sizes, which follows normal distribution N (m, δ 2 ) The discretized trajectory is given by m N Gk + Σj=1 1Uj 0 − , D W (β) = (7) Di+ W (β) = i h− h− (β), β < 0 i i i (β), βi ≤ 0 where h± i (β) = −di ± ε +

l 

kij βj .

(8)

j=1

Let Ω ∗ denote the set of optimal solutions of Problem 3 as follows: Ω ∗ = {β ∈ S| max {Di− W (β)} ≤ min {Di+ W (β)}}. i∈I1 (β)

i∈I2 (β)

In a practical situation, the optimality condition (5) is often relaxed as     max Di− W (β) ≤ min Di+ W (β) + τ i∈I1 (β)

i∈I2 (β)

(9)

Global Convergence Analysis of Decomposition Methods for SVR

667

where τ is a positive constant. In this paper, however, we employ neither (5) nor (9) for the optimality condition but max

i∈I1δ (β)

    δ− Di W (β) < min Diδ+ W (β) + τ i∈I2δ (β)

(10)

where I1δ (β) = {i : −C + δ ≤ βi ≤ C}, I2δ (β) = {i : −C ≤ βi ≤ C − δ},

Diδ+ W (β) =



h+ i (β), βi > −δ , h− i (β), βi ≤ −δ

Diδ− W (β) =



h+ i (β), βi ≥ δ h− i (β), βi < δ

(11)

(12)

and δ is any positive constant smaller than C. Usually δ is set to a sufficiently small positive number. In the following, any β ∈ S satisfying (10) is said to be a (τ, δ)-optimal solution. The set of (τ, δ)-optimal solutions is denoted by Ω (τ,δ) , that is Ω (τ,δ) = {β ∈ S| max {Diδ− W (β)} < min {Diδ+ W (β)} + τ }. i∈I1δ (β)

i∈I2δ (β)

Also, a pair of indices (i, j) such that i ∈ I1δ (β),

j ∈ I2δ (β),

Diδ− W (β) ≥ Djδ+ W (β) + τ

(13)

is called a (τ, δ)-violating pair at β. The (τ, δ)-optimality condition (10) holds at β if and only if there exists no (τ, δ)-violating pair at β. 3.3

Decomposition Algorithm

As a generalization of Flake and Lawrence’s SMO algorithm[10], we consider the following decomposition algorithm for solving Problem 3. Algorithm 1. Given training samples {(pi , di )}li=1 , a kernel function K(·, ·), positive constants C and ε, and an integer q(≤ l), execute the following procedures. 1. Let β(0) = 0 and k = 0. 2. If β = β(k) satisfies the optimality condition (10) then stop. Otherwise go to Step 3. 3. Select the working set LB (k) ⊆ L = {1, 2, . . . , l} where |LB (k)| ≤ q. 4. Find β = [β1 , β2 , · · · , βl ]T which minimizes the objective function W (β) under the constraints (4), and βi = βi (k), ∀i ∈ LN (k) = L\LB (k). 5. Set β(k + 1) to an optimal solution of the optimization problem in step 4). Add 1 to k, and go to Step 2.

668

J. Guo and N. Takahashi

It is apparent that the sequence {β(k)}∞ k=1 generated by Algorithm 1 satisfies two conditions β(k) ∈ S

(14)

W (β(k + 1)) ≤ W (β(k))

(15)

for all k. Since the objective function W (·) is bounded from below in S, (15) implies that the sequence {W (β(k))}∞ k=0 necessarily converges to a certain value. However, on the other hand, it is not clear whether the sequence {β(k)}∞ k=0 converges to Ω (τ,δ) or not. Convex optimization problems arising in Step 4 is formulated as follows: Find {βi }i∈LB (k) that minimize ˜ (βL (k) ) = − W B



di βi +ε



|βi |+

i∈LB (k)

i∈LB (k)

1 2







kij βi βj +

c i βi

i∈LB (k)

i∈LB (k) j∈LB (k)

subject to 

βi =

i∈LB (k)



βi (k),

−C ≤ βi ≤ C, ∀i ∈ LB (k)

i∈LB (k)

where βLB (k) is the vector obtained by removing {βi }i∈LN (k) from β, and  kij βj (k), i ∈ LB (k). ci = j∈LN (k)

If a QP solver is available, this problem can be solved by the following algorithm. Algorithm 2. Given ε, C, LB (k), kij (i, j ∈ LB (k)), and di , ci , βi (k) (i ∈ LB (k)), execute the following procedure. 1. Set βi = βi (k) for all i ∈ LB (k). 2. Set  (0, C, 1), if βi > 0 or (βi = 0 and ∂i ≤ 0) (Li , Ui , σi ) = (−C, 0, −1), if βi < 0 or (βi = 0 and ∂i > 0) for all i ∈ LB (k) where  ˜ (βL (k) ) ∂i = Di+ W B

βi =0

= −di + ε + ci +



kij βj

j∈LB (k)

3. Find βLB (k) which minimizes 

(−di + εσi + ci )βi +

i∈LB (k)

subject to Li ≤ βi ≤ Ui , ∀i ∈ LB (k) and

1 2





kij βi βj

i∈LB (k) j∈LB (k)



i∈LB (k)

βi =



i∈LB (k)

βi (k).

Global Convergence Analysis of Decomposition Methods for SVR

669

4. Set βLB (k) to the optimal solution of the QP problem in Step 3. 5. If the optimality condition   + ˜ ˜ (βL (k) ) ≤ D W (β Di− W ) min max LB (k) i B i∈I2 (β LB (k) )

i∈I1 (βLB (k) )

holds then stop. Otherwise, go to Step 2.

4

Global Convergence Analysis

4.1

Properties of Ω ∗ and Ω (τ,δ)

From (6) and (11) the following Lemma can be obtained easily. Lemma 3. I1 (β) ⊇ I1δ (β) and I2 (β) ⊇ I2δ (β) for any β ∈ S and δ ∈ (0, C). Moreover, limδ→0 I1δ (β) = I1 (β) and limδ→0 I2δ (β) = I2 (β) for any β ∈ S. Also, from (7) and (12) the following lemma can be obtained. Lemma 4. Diδ+ W (β) ≥ Di+ W (β) and Diδ− W (β) ≤ Di− W (β) for any β ∈ S, δ ∈ (0, C), and i ∈ L. Moreover, limδ→0 Diδ+ M (β) = Di+ M (β) and limδ→0 Diδ− M (β) = Di− M (β) for any β ∈ S and i ∈ L. Proposition 1. Ω (τ,δ) ⊇ Ω ∗ for any τ > 0 and δ ∈ (0, C). Moreover, limδ→0 limτ →0 Ω (τ,δ) = Ω ∗ . Proof. Let β be any point in Ω ∗ . Then β satisfies (5). It follows from Lemmas 3 and 4 that max {Diδ− W (β)} ≤ max {Di− W (β)}

(16)

min {Di+ W (β)} ≤ min {Diδ+ W (β)}

(17)

i∈I1δ (β) i∈I2 (β)

i∈I1 (β)

i∈I2δ (β)

From (16), (17) and (5), we have       max Diδ− W (β) ≤ min Diδ+ W (β) < min Diδ+ W (β) + τ i∈I1δ (β)

i∈I2δ (β)

i∈I2δ (β)

(18)

which implies β ∈ Ω (τ,δ) . The second statement can be proved by taking the limit δ → 0 and τ → 0 in (18). ⊓ ⊔ Lemma 5. Let {β(n)}∞ n=0 be any sequence such that β(n) ∈ S, ∀n, and limn→∞ ¯ Then there exist positive integers n1 and n2 such that I1 (β(n)) ⊇ β(n) = β. ¯ ∀n ≥ n1 and I2 (β(n)) ⊇ I2 (β), ¯ ∀n ≥ n2 . I1 (β), Proof. We will prove only the first formula. The second one can be proved in ¯ Then β¯i satisfies −C < β¯i ≤ the same way. Let i be any member of I1 (β). ¯ C. Since βi (n) converges to βi , there exists a positive integer n1 (i) such that −C < βi (n) ≤ C, ∀n ≥ n1 (i) which implies i ∈ I1 (β(n)), ∀n ≥ n1 (i). Let ¯ belong to I1 (β(n)), ∀n ≥ n1 . n1 = maxi∈I1 (β) ¯ n1 (i). Then all members of I1 (β) This completes the proof. ⊓ ⊔

670

J. Guo and N. Takahashi

Proposition 2. The set Ω ∗ is closed. ∗ Proof. Let {β(n)}∞ n=1 be any sequence such that β(n) ∈ Ω , ∀n and limn→∞ ∗ ¯ ¯ β(n) = β. It suffices for us to show that β ∈ Ω . Since β(n) ∈ Ω ∗ , ∀n, we have

max

i∈I1 (β(n))

Di− W (β(n)) ≤

min

i∈I2 (β(n))

Di+ W (β(n)),

∀n.

It follows from this inequality and Lemma 5 that there exists a positive integer n1 such that max Di− W (β(n)) ≤ min Di+ W (β(n)),

¯ i∈I1 (β)

Suppose

¯ i∈I2 (β)

∀n ≥ n1 .

¯ > min D+ W (β). ¯ max Di− W (β) i

¯ i∈I1 (β)

(19)

(20)

¯ i∈I2 (β)

¯ and i2 ∈ I1 (β) ¯ such that D− W (β) ¯ > D+ W (β). ¯ Let Then there exist i1 ∈ I1 (β) i1 i2 ¯ − D+ W (β) ¯ > 0. Δ = Di−1 W (β) i2

(21)

¯ From the definition of Di− W (β) and the assumption that β(n) converges to β, we can easily show that there exists an n2 such that ¯ − Δ, Di−1 W (β(n)) > Di−1 W (β) 2

∀n ≥ n2 .

(22)

Similarly, we can show that there exists an n3 such that ¯ + Δ, Di+2 W (β(n)) < Di+2 W (β) 2

∀n ≥ n3 .

(23)

From (21)–(23) we have Di−1 W (β(n)) − Di+2 W (β(n)) ¯ − Δ = 0, ¯ − Δ − D+ W (β) > Di−1 W (β) i2 2 2

∀n ≥ max{n2 , n3 }

which contradicts (19). Therefore (20) is wrong which implies β¯ ∈ Ω ∗ .

⊓ ⊔

Lemma 6. Let {β(n)}∞ n=0 be any sequence such that β(n) ∈ S, ∀n and limn→∞ ¯ Then there exist positive integers n1 and n2 such that I δ (β(n)) ⊆ β(n) = β. 1 ¯ ∀n ≥ n1 and I δ (β(n)) ⊆ I δ (β), ¯ ∀n ≥ n2 for any δ ∈ (0, C). I1δ (β), 2 2 Proof. We will prove only the first formula. The second one can be proved sim¯ Then β¯i satisfies β¯i < −C + δ. Since ilarly. Let i be any nonmember of I1δ (β). βi (n) converges to β¯i , there exists a positive integer n1 (i) such that βi (n) < −C+ δ, ∀n ≥ n1 (i) which implies i ∈ I1δ (β(n)), ∀n ≥ n1 (i). Let n1 = maxi∈I1δ (β) ¯ n1 (i). δ ¯ Then all nonmembers of I1 (β) do not belong to I1 (β(n)), ∀n ≥ n1 . This is equivalent to the first formula. ⊓ ⊔

Global Convergence Analysis of Decomposition Methods for SVR

671

Proposition 3. The set S \ Ω (τ,δ) is closed for any τ > 0 and δ ∈ (0, C). (τ,δ) Proof. Let {β(n)}∞ , ∀n, and n=1 be any sequence such that β(n) ∈ S \ Ω ¯ limn→∞ β(n) = β. Then we have

max

i∈I1δ (β(n))

Diδ− W (β(n)) ≥

min

i∈I2δ (β(n))

Diδ+ W (β(n)) + τ,

∀n.

It follows from this inequality and Lemma 6 that there exists a positive integer n1 such that max Diδ− W (β(n)) ≥ min Diδ+ W (β(n)) + τ,

¯ i∈I1δ (β)

Suppose

∀n ≥ n1 .

¯ i∈I2δ (β)

¯ < min Dδ+ W (β) ¯ + τ. max Diδ− W (β) i

¯ i∈I1δ (β)

¯ i∈I2δ (β)

Then we can show in a similar way to the proof of Proposition 2 that this leads ⊓ ⊔ to a contradiction. Therefore β¯ ∈ S \ Ω (τ,δ) . 4.2

Convergence Proof

Let Vq (β) be the family of sets M ⊆ L such that |M | ≤ q and M contains at least one (τ, δ)-violating pair at β ∈ S. For any M ⊆ L and β ∈ S, we define the point-to-set map ΓM (β) as ΓM (β)  {y ∈ S | yi = βi , ∀i ∈ L \ M,

max

i∈I1 (y)∩M

Di− W (y) ≤

min

i∈I2 (y)∩M

Di+ W (y)}.

By using this definition, the set of optimal solutions of the subproblem in Step 4 can be expressed as ΓLB (k) (β(k)). We also define a point-to-set map A from S to itself as follows:  ∪M∈Vq (β) ΓM (β), if β ∈ Ω (τ,δ) A(β) = (24) β, if β ∈ Ω (τ,δ) . Let {β(k)}∞ n=0 be the sequence generated by Algorithm 1. Then β(k + 1) ∈ A(β(k)) holds for all k. We present some lemmas. The proofs can be found in [6]. Lemma 7. Let {β(n)}∞ n=0 be any sequence such that β(n) ∈ S, ∀n and limn→∞ ¯ If β¯ ∈ S \ Ω ( τ,δ) then Vq (β(n)) ⊆ Vq (β) ¯ for sufficiently large n. β(n) = β. Lemma 8. For any M ⊆ L, the point-to-set map ΓM (β) is closed on S. Lemma 9. The point-to-set map A(β) defined by (24) is closed on S \ Ω (τ,δ) . Lemma 10. The objective function W (β) of Problem 3 is a descent function for the set of (τ, δ)-optimal solutions Ω (τ,δ) and the point-to-set map A(β) defined by (24). Now we are ready for giving the global convergence theorem for Algorithm 1.

672

J. Guo and N. Takahashi

Theorem 2. Let {β(k)}∞ k=0 be the sequence generated by Algorithm 1. If the working set LB (k) contains at least one (τ, δ)-violating pair at β(k) for all k, (τ,δ) then any convergent subsequence of {β(k)}∞ . k=0 has a limit in Ω The proof of Theorem 2 can be found in [6]. From Theorem 2 and Proposition 3, we immediately derive the following lemma. Theorem 3. If the working set LB (k) contains at least one (τ, δ)-violating pair at β(k) for all k, then Algorithm 1 stops at Ω (τ,δ) within a finite number of iterations for any τ > 0 and δ ∈ (0, C).

5

Conclusion

In this paper, we have analyzed the convergence property of decomposition algorithm for SVR, where the QP problem is formulated by Flake and Lawrence, and given a rigorous proof that the algorithm always stops within a finite number of iterations. Acknowledgments. This research was partly supported by the Ministry of Education, Culture, Sports, Science and Technology, Grant-in-Aid for JSPS Research Fellows, 18·9473.

References 1. Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998) 2. Platt, J.C.: Fast Training of Support Vector Machines Using Sequential Minimal Optimization. In: Sch¨ olkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods: Support Vector Machines, MIT Press, Cambridge (1998) 3. Joachims, T.: Making Large-scale SVM Learning Practical. In: Sch¨ olkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods: Support Vector Machines, MIT Press, Cambridge (1998) 4. Keerthi, S.S., Shevade, S.K., Bhattacharyya, C.S.S., Murthy, K.R.K.: Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural Computing 13, 637–649 (2001) 5. Hsu, C.W., Lin, C.J.: A Simple Decomposition Method for Support Vector Machines. Machine Learning 46, 291–314 (2002) 6. Takahashi, N., Nishi, T.: Global Convergence of Decomposition Learning Methods for Support Vector Machines. IEEE Trans. on Neural Networks 17, 1362–1368 (2006) 7. Shevade, S.K., Keerthi, S.S., Bhattacharyya, C.S.S., Murthy, K.R.K.: Improvements to the SMO Algorithm for SVM Regression. IEEE Trans. on Neural Networks 11, 1183–1188 (2000) 8. Laskov, P.: An Improved Decomposition Algorithm for Regression Support Vector Machines. In: Workshop on Support Vector Machines, NIPS 1999 (1999) 9. Lia, S.P., Lin, H.T., Lin, C.J.: A Note on the Decomposition Methods for Support Vector Regression. Neural Computing 14, 1267–1281 (2002)

Global Convergence Analysis of Decomposition Methods for SVR

673

10. Flake, G.W., Lawrence, S.: Efficient SVM Regression Training with SMO. Machine Learning 46, 271–290 (2002) 11. Zangwill, W.I.: Nonlinear Programming: A Unified Approach. Prentice-Hall, Englewood Cliffs (1967) 12. Luenberger, D.G.: Linear and Nonlinear Programming. Addison-Wesley, Reading (1989) 13. Guo, J., Takahashi, N., Nishi, T.: Convergence Proof of a Sequential Minimal Optimization Algorithm for Support Vector Regression. In: Proc. of IJCNN 2006, pp. 747–754 (2006)

Rotating Fault Diagnosis Based on Wavelet Kernel Principal Component L. Guo, G.M. Dong, J. Chen, Y. Zhu, and Y.N. Pan State Key Laboratory of Mechanical System and Vibration, Shanghai Jiao Tong University, Shanghai 200240, PR China [email protected]

Abstract. In this paper, the application of nonlinear feature extraction based on wavelet kernel KPCA for faults diagnosis is presented. Mexican hat wavelet kernel is intruded to enhance Kernel-PCA nonlinear mapping capability. The experimental data sets of rotor working under four conditions: normal, oil whirling, rub and unbalance are used to test the WKPCA method. The feature reduction results of WKPCA are compared with that of PCA method and KPCA method. The results indicate that WKPCA can classify the rotor fault type efficiently. The WKPCA is more suitable for nonlinear feature reduction in fault diagnosis area. Keywords: Kernel PCA; wavelet kernel; fault diagnosis; rotating machinery.

1 Introduction Rotating machinery such as turbines and compressors are the key equipment in power plants, chemical engineering plants. Defects and malfunctions of these machines will result in significant economic loss. Therefore, fault diagnosis on these machines is of great importance. In intelligent fault diagnosis system, the feature extraction and reduction process plays a very important role. Principal component analysis (PCA) has been widely used in dimensionality reduction, noise removal, and feature extraction from the original data set. However, for complicated cases in industrial processes, especially nonlinearity, PCA is unsuccessful as it is linear by nature [1]. Kernel Principal Component Analysis (KPCA) [2] has been proposed to tackle the nonlinear problems in recent researches. As a nonlinear extension of PCA, KPCA can efficiently compute principal components in a high-dimensional feature space by the use of nonlinear kernel functions. There are many types of kernel can be used, such as RBF kernel, sigmoid kernel and linear kernel. Since the wavelet technique shows promise for both non-stationary signal approximation and classification, it is valuable to study whether a better classification performance on equipment degradation data could be obtained if we combine the wavelet technique with KPCA. In this paper, an admissible wavelet kernel is constructed, which implements the combination of the wavelet technique with KPCA. Practical vibration signals F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 674– 681, 2008. © Springer-Verlag Berlin Heidelberg 2008

Rotating Fault Diagnosis Based on Wavelet Kernel Principal Component

675

measured from rotor with different fault type on the Bently rotor test bed are classified by PCA, RBF KPCA and Wavelet KPCA (WKPCA). The comparison results indicate that all these three methods can do the condition recognition, but the rotor condition can be more clearly reflected by WKPCA. Therefore, the WKPCA is more effective for rotating machinery fault diagnosis.

2 Kernel Principal Components Analysis PCA is an orthogonal transformation technique of an initial coordinate system that describes data [1]. The transformed new vectors are the linear composition of the original data. Given a set of n dimension feature vectors xt ( t = 1, 2,..., m ) , generally

n < m . Assumed the vector mean is zero. Then, the covariance matrix of vectors is C=

1 m ∑ xt xtT m t =1

(1)

The principal components (PCs) are computed by solving the eigen value problem of covariance matrix C,

λi vi = Cvi

(2)

where λi ( i = 1, 2,..., n ) are the eigen values and they are sorted in the descending order, vi ( i = 1, 2,..., n ) are the corresponding eigenvectors. To represent the raw vectors with low-dimensional ones, what needs to be done is to compute the first k eigen vectors ( k ≤ n ) which correspond to the k largest eigen values. In order to select the number k, a threshold θ is introduced to denote the approximation precision of the k largest eigenvectors.

∑ λi k

i =1

∑λ m

i

≥θ

(3)

i =1

Given the precision parameter θ , the number of eigenvector k can be decided. Let

V = [v1 , v2 ,...vk ], Λ = diag[λ1 , λ2 ,...λk ] After the matrix V is decided, the low-dimensional feature vectors, named PC, of a raw one are determined as follows:

P = V T xt

(4)

PCA performs well on linear problems. But with nonlinear problems, PCA dose not performance well [1, 3, 4]. Kernel principal component analysis (KPCA) is one approach of generalizing linear PCA into nonlinear case using the kernel method. The idea of KPCA is to firstly map the original input vectors xt into a high-dimensional feature space ϕ ( xt ) and then calculate the linear PCA in ϕ ( xt ) .

676

L. Guo et al.

The sample covariance matrix Cˆ of ϕ ( xt ) is formulated as:

1 m T Cˆ = ∑ ϕ ( xt )ϕ ( xt ) m i

( t = 1, 2,… , m )

(5)

The eigen value problem in the high dimensional feature space is define as

(

ˆ λi (ϕ ( xt ) ⋅ vt ) = ϕ ( xt ) ⋅ Cv t

)

( t = 1, 2,…, m )

(6)

where λi ( i = 1, 2,… , m ) are non-zero eigen values of Cˆ , vi ( i = 1, 2,… , m ) are the

vi = ∑ j =1 ai ( j ) ϕ ( x j ) ( i, j = 1, 2,..., m )

corresponding eigenvectors and can be expressed as i

(7)

where α i is the kernel coefficient. After combining Eqs. (5), (6) and (7), we get

mλiα i = K α i ( i = 1, 2,..., m )

(8)

( )

where K is the m × m kernel matrix, that is K ( i, j ) = ϕ ( xi ) ϕ x j . The introduction of kernel function is based on the fact that an inner product in the feature space has an equivalent kernel in the input space, and thus it is neither necessary to know the form of the function ϕ ( x ) nor to calculate the inner product in the high-dimensional space. Finally, the principal components for input vector xt ( t = 1, 2,… , m ) can be obtained by

st ( i ) = λiϕ ( xt ) = ∑ α i ( j )K ( xi , x j ) ( i, j = 1, 2,..., m ) m

(9)

j =1

For the purpose of the dimensionality reduction, the corresponding first k eigenvectors ai ( i = 1, 2,… , k ) can be selected as the optimal projection axes through sorting the eigen values λi ( i = 1, 2,… , k ) of the K in the descending order [5].

3 Wavelet Kernel Besides Dot-product type kernel ( K (x, x' ) = Φ (x) ⋅Φ (x' ) ), translation invariant kernels, i.e., K (x, x' ) = Φ (x − x ' ) derived in [6] are admissive SV kernels if they satisfy Mercer’s condition. And Ref. [7] gives the necessary and sufficient condition for translation invariant kernels.

Rotating Fault Diagnosis Based on Wavelet Kernel Principal Component

677

Theorem 1: A translation invariant kernel K (x, x' ) = Φ (x − x ' ) is an admissible SV kernels if and only if the Fourier transform

F [ K ] (ω ) = ( 2π )

−N 2



RN

exp ( − j (ω ⋅ x ) )K ( x ) dx ≥ 0

(10)

Theorem 2: Given a mother wavelet ψ ( x ) ∈ L2 ( R ) , if x, x ' ∈ R n , the translationinvariant wavelet kernels that satisfy the translation invariant kernel theorem are n ⎛ x − xi' ⎞ K ( x, x' ) = ∏ψ ⎜ i ⎟ i =1 ⎝ β ⎠

(11)

where β denote the dilation and β ∈ R . The proof of Theorem 1 and Theorem 2 are given in Ref. [7]. In this research, the Mexican hat wavelet function

⎛ x2 ⎞ ⎟ ⎝ 2⎠

ψ ( x ) = (1 − x 2 ) exp ⎜ −

(12)

is chosen to construct translation invariant wavelet kernel. By Theorem 2, the wavelet kernel is

⎛ x − xi' K ( x, x ) = ∏ψ ⎜ i i =1 ⎝ β '

n

2 ⎡ ⎛ x − x' ⎞ n ⎢⎛ ⎛ xi − xi' ⎞ ⎞ i ⎜− i ⎜ ⎟ 1 exp = − ⎟ ∏⎢ ⎜ ⎟ 2 ⎜ 2β ⎠ i =1 ⎢⎜⎝ ⎝ β ⎠ ⎟⎠ ⎝ ⎣

2

⎞⎤ ⎟⎥ ⎟⎥ ⎠ ⎦⎥

(13)

The proof that Formula (9) is an admissible SV kernel is given as below.



Proof: According to Theorem 1, it is sufficient to prove the inequality

F [ K ] (ω ) = ( 2π ) for all x

−N 2

, where

RN

exp ( − j (ω ⋅ x ) )K ( x ) dx ≥ 0

2 ⎛ xi 2 ⎞ ⎤ ⎞ n ⎡⎛ ⎛ xi ⎞ ⎞ = − − 1 exp ⎜ ⎟ ⎢ ⎟ ∏ ⎜ ⎟ ⎜⎜ 2β 2 ⎟⎟ ⎥⎥ ⎠ i =1 ⎣⎢⎜⎝ ⎝ β ⎠ ⎟⎠ i =1 ⎝ ⎠⎦ we can obtain the Fourier transform

⎛ xi ⎝β

K ( x ) = ∏ψ ⎜ n

F [ K ] ( ω ) = ( 2π )

−n 2



exp( − j ( ωx )) K ( x )dx

2 n ⎡⎛ ⎛ xi 2 ⎞ ⎤ ⎛ xi ⎞ ⎞ − − ωx exp( ) j 1 exp ( ) ∏ ⎢⎜⎜ ⎜ ⎟ ⎟⎟ ⎜⎜ − 2 ⎟⎟ ⎥dx ∫Rn ⎝β ⎠ ⎠ i =1 ⎢ ⎝ ⎝ 2 β ⎠ ⎥⎦ ⎣ 2 n ⎛ xi 2 ⎞ ∞ ⎛ ⎛ xi ⎞ ⎞ −n 2 ⎟dxi ω = ( 2π ) ∏ ∫ ⎜1 − ⎜ ⎟ ⎟ exp ⎜ − − j x ( ) i i −∞ ⎜ ⎟ ⎜ 2β 2 ⎟ i =1 ⎝ ⎝β⎠ ⎠ ⎝ ⎠ 2 2 n ⎛ ω β ⎞ 3 = ∏ ωi2 β exp ⎜ − i ⎟≥0 2 ⎠ i =1 ⎝

= ( 2π )

R

n

−n 2

This completes the proof.

(14)

678

L. Guo et al.

4 Experiment and Data 4.1 Data collection

A simulating experiment of rotor fault is made on Bently RK 4. Then through data collection preprocessing and feature extraction, the samples are obtained. Figure 1 is the photo of experimental system, which includes Bently Rotor Kit RK4 test bed, sensors, signal conditioner and data acquisition computer.

Fig. 1. Experiment system

Four rotor running states are simulated. They are GOODal, unbalance, rotor radial rub and oil whirl, which are abbreviated to GOOD, UBAL, RRUB and OILW. The analysis bandwidth is 1000Hz and the sample frequency is set to 2560Hz. The number of sample points is 4096. The constant rotating speeds are 3000 rpm for the GOOD and RRUB, 1727 rpm for OILW and 3600 rpm for UBAL. For each running state, 100 data sets are collected and analyzed. 4.2 Feature Extraction

In the machinery fault diagnosis field, features are often extracted in time domain and frequency domain. And the features both in time and frequency domain have been applied successfully [8]. In this paper, we extract features both in time and frequency domain and make full use of the information from the two kinds of feature. In this paper, two dimensional parameter and six non-dimensional statistical parameters are selected as the time domain features. In frequency domain, we take into account four features from the amplitude spectrum. To sum up, we get 12 features and they are listed in Table 1. Table 1. The selected 12 features No. 1 2 3

Feature RMS Peak_Peak Value impulsion index

No. 7 8 9

4 5 6

kurtosis index waveform index peak index

10 11 12

f a : rotating frequency of a rotor.

Feature tolerance index skewness index 0.5 f a 1f 2f 3f

Rotating Fault Diagnosis Based on Wavelet Kernel Principal Component

679

5 Experiment Results and Discussion In paper, PCA, KPCA and WKPCA are used to extract the linear and nonlinear features from the original feature set, respectively. Based on 90% of total eigen value, we choose the first three eigen value to describe the original data. The features are shown in Figure 2, Figure 3 and Figure 4. 0.01

1 GOOD OILW RRUB UBAL

0

(b)

PC3

(a)

PC2

0.5

0

-0.005

-0.5 -1 -2

GOOD OILW RRUB UBAL

0.005

-1

0

1

2

-0.01 -0.02

3

-0.01

0 PC2

PC1

0.01

0.02

Fig. 2. PCA Features (a) PC1-PC2 features; (b) PC2-PC3 features

PCA features of original signals are plotted in Figure 2. In Figure 2(a), the first 2 PCs features are shown to describe the data more syllabify. Because the PCs value of GOOD and UBAL are too small, so the amplificatory PC2 and PC3 features are shown in Figure 2(b). It can be seen that PCA features of signal fail to entirely separate the four running states in the linear feature space due to some overlaps in their clustering. 0.02

1 GOOD OILW RRUB UBAL

(a)

PC2

-1

0.01 0

-2

(b)

PC3

0

-0.01 -0.02

-3

-0.03

-4

-0.04

-5 -6

-4

-2

0

2

-0.05 -0.2

4

GOOD OILW RRUB UBAL

-0.15

-0.1

PC1

-0.05 PC2

0

0.05

Fig. 3. KPCA Features (a) PC1-PC2 features; (b) PC2-PC3 features 0.02

4 GOOD OILW RRUB UBAL

(a)

PC2

2

0

1

(b)

PC3

3

-0.02

GOOD OILW RRUB UBAL

-0.04

0 -0.06

-1 -2 -4

-2

0

2 PC1

4

6

-0.08

-0.04

-0.02

0 PC2

Fig. 4. WPCA Features (a) PC1-PC2 features; (b) PC2-PC3 features

0.02

0.04

680

L. Guo et al.

KPCA feature of signal are given in Figure 3. The RBF kernel function is selected and the kernel parameter σ = 2 . By compared with PCA, we can see that the nonlinear features extracted by KPCA can separate the four states much better thanPCA method. But there is still some gemination in clustering. Such as, in Figure 3(a), the OILW and RRUB overlaps and can not be clustered well. This phenomenon can also be found in Figure 3(b) of GOOD and UBAL conditions. In Figure 4, it can be seen that the nonlinear features extracted by WKPCA feature from original features can entirely separate the four conditions. In Figure 4 (a), it can be found easily that the features of OILW and RRUB can be separated with each other totally. In Figure 4 (b), we can see that the features of GOOD and UBAL can almost separated without overlapping. It is obvious that the clustering ability of WKPCA features from signals is clearly superior to that of their linear feature, because WKPCA can explore higher order information of the original data by using wavelet kernel.

6 Conclusion By using kernels to perform the nonlinearly mapping, KPCA make the features as linearly separable as possible. Kernel plays a crucial role during the process of nonlinearly mapping. In this paper, the wavelet kernel is introduced in to KPCA method to diagnose the rotor fault. Four classical rotor running states including normal, unbalance, rotor radial rub and oil whirl are simulated on Bently Rotor Kit. And the sample data are used to make fault diagnosis test. The feature matrix of the sample data is analyzed respectively by PCA, KPCA and WKPCA. By the classification results comparison, it is shown that the WKPCA can entirely separate the rotor conditions while the other two methods fail. That is to say, WKPCA is more effective to diagnose the rotor fault.

Acknowledgements This research is supported by Natural Science Foundation of China (Grant No. 50675140), the National High Technology Research and Development Program of China (Grant No. 2006AA04Z175) and China Postdoctoral Science Foundation funded project (Grant No.20070420655).

References 1. Sun, R., Tsung, F., Qu, L.: Evolving kernel principal component analysis for fault diagnosis. Computers and Industrial Engineering 53(2), 361–371 (2007) 2. Scholkopf, B., Smola, A., Muller, K.-R.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10(5), 1299–1319 (1998) 3. Qian, H., Liu, Y.B., Lv, P.: Kernel Principal Components Analysis for early identification of gear tooth crack. In: Proceedings of the World Congress on Intelligent Control and Automation (WCICA), Dalian, pp. 5748–5751 (2006)

Rotating Fault Diagnosis Based on Wavelet Kernel Principal Component

681

4. Zhao, H., Yuen, P.C., Kwok, J.: A novel incremental principal component analysis and its application for face recognition. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 36(4), 873–886 (2006) 5. Feng, W., Junyi, C., Binggang, C.: Nonlinear feature fusion scheme based on kernel PCA for machine condition monitoring. In: Proceedings of the 2007 IEEE International Conference on Mechatronics and Automation, ICMA 2007, Harbin, pp. 624–629 (2007) 6. Smola, A.J., Scholkopf, B., Muller, K.-R.: The connection between regularization operators and support vector kernels. Neural Networks 11(4), 637–649 (1998) 7. Zhang, L., Zhou, W., Jiao, L.: Wavelet Support Vector Machine. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 34(1), 34–39 (2004) 8. Sun, W., Chen, J., Li, J.: Decision tree and PCA-based fault diagnosis of rotating machinery. Mechanical Systems and Signal Processing 21(3), 1300–1317 (2007)

Inverse System Identification of Nonlinear Systems Using LSSVM Based on Clustering Changyin Sun1,2 , Chaoxu Mu1 , and Hua Liang1 1

College of Electrical Engineering, Hohai University, Nanjing 210098, P.R. China 2 School of Automation, Southeast University, Nanjing 210096, P.R. China [email protected]

Abstract. In this paper we propose the algorithm of embedding fuzzy cmeans (FCM) clustering in least square support vector machine (LSSVM). We adopt the method to identify the inverse system with immeasurable crucial variables and the inenarrable nonlinear character. In the course of identification, we construct the allied inverse system by the left inverse soft-sensing function and the right inverse system, and decide the number of clusters by a validity function, then utilize the proposed method to approach the nonlinear allied inverse system via offline training. Simulation experiments are performed and indicate that the proposed method is effective and provides satisfactory performance with excellent accuracy and low computational cost. Keywords: LSSVM; FCM clustering; Nonlinear systems; Idenfication.

1

Introduction

As nonlinear systems are often complex and dynamic, especially some immeasurable crucial variables in systems, it is difficult to identify nonlinear systems and even more difficult to identify their inverse systems. Artificial neural networks (ANN) have good learning capability in the course of identification for nonlinear inverse systems, therefore some researchers have successfully applied ANN to identify inverse systems. Recently, Dai et al. have provided the neural networks α-th order inverse system method for control of nonlinear systems [1] to resolve the above mentioned problem, and have also applied the method to biochemical processes, robot control and electric systems [2]. Support vector machine (SVM) is a new machine learning technique on the foundation of the statistical learning theory and the structural risk minimization principle, and is powerful for the problem with small samples, nonlinearity, high dimensions and local minimization [3]. As an interesting variant of the standard SVM, LSSVM has been proposed by Suykens and Vandewalle for pattern recognition [4]. Compared with the standard SVM, LSSVM involves equality constraints instead of inequalities in the problem formulation and uses the least square cost function instead of the ǫ-insensitive loss function. As a result, the F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 682–690, 2008. c Springer-Verlag Berlin Heidelberg 2008 

Inverse System Identification of Nonlinear Systems Using LSSVM

683

solution follows from a linear KKT system instead of a hard quadratic programming problem, so LSSVM has better performance. Clustering algorithms are used to analyze interaction among samples by organizing samples into different clusters so that samples within a cluster are more similar each other than samples belonging to other clusters [5]. FCM clustering is an effective data clustering algorithm employing the least square principle [6]. In the clustering algorithm, the number of cluster is important, we search the adaptive number of clusters depending on a validity function. According to the principle of clustering, we propose to embed the fuzzy c-means clustering algorithm in LSSVM and introduce the novel method to nonlinear inverse systems. The effectiveness of the proposed method is demonstrated by numerical experiments. The rest of this paper is organized as follows. Section 2 provides how to construct the left inverse soft-sensing function and the right inverse system. Section 3 gives a brief review on the LSSVM regression, then describes the proposed algorithm of embedding FCM in LSSVM. Section 4 applies the proposed method to the course of identification for the allied inverse system. Experimental results are presented in Section 5 and Section 6 gives some concluding remarks.

2

The Inverse System Method

For the class of nonlinear systems with immeasurable crucial variables, we obtain their inverse systems by the left inverse function and the right inverse system. The nonlinear relation from the left inverse system are useful for constituting the allied inverse system. (x1 , . . . , xn )T are state variables which could be divided into two groups: the directly immeasurable group x ˆ = (x1 , x2 , . . . , xl )T and the directly measurable group x ¯ = (xl+1 , xl+2 , . . . , xn )T , the input and the output of the original system are u and y respectively. Firstly one may assume that, hi = hi (x), i = 1, 2, . . . , t are functions about state variables x = (ˆ xT , x ¯T )T and can be directly measured T and record as h = (h1 , h2 , . . . , ht ) , the number of functions is t. We denote z = (z1 , z2 , . . . , zt , zt+1 , . . . , zt+n−l )T , where  hi , if 1 ≤ i ≤ t; zi = (1) xi−t+l , if t + 1 ≤ i ≤ t + n − l; Estimate the relation by the left inverse soft-sensing function according to the following steps: Step 1): Take the first order derivative of z ∂(z1 , z2 , . . . , zt , zt+1 , . . . , zt+n−l )T ∂z = T ∂x ˆ ∂(x1 , x2 , . . . , xl )

(2)

∂z and define rank( ∂∂z x ˆT ). If the result is rank( ∂ x ˆT ) = l, it implies that the number of independent variables is l, according to the inverse function theorem, and consequently choose l independent components from z to form a new vector, record as (g1 , g2 , . . . , gl ), then use them to establish the nonlinear function

684

C. Sun, C. Mu, and H. Liang

x ˆ = φ(g1 , g2 , . . . , gl ). The nonlinear mapping φ denotes an ambiguity nonlinear relation. In this case, the algorithm ends. If the result is rank( ∂∂z x ˆT ) < l, it means that one needs to find new available information for the estimation of immeasurable variables, the algorithm continues to step 2. Step 2): Take the first order derivative of the first component z1 of z, denote it ∂z1 as z1 = z1 (x), z˙1 = ∂x ˙ = z˙1 (x). (However, as x˙ = f (x, u), z˙1 may contain the T x input u and its derivative, it follows that its implementation inevitably contains self-feedback of the input and its derivative, so that the estimation error would be amplified severely. It means that the soft-sensing method based on the left inverse system would be useless in the practical project. To avoid involving this kind of self-feedback from the input u and its derivative, if the derivative of zi contains information of u and its derivative, one must abandon the derivative of zi and enter next step. Similarly analyze in posterior steps. We may assume that z˙1 doesn’t contain u and its derivative.) We add z˙1 to z and form a new vector, record as z(1) = (z1 , z˙1 , z2 , . . . , zt , zt+1 , . . . , zt+n−l )T . Take the first order deriva∂z

∂z

(1) tive of z(1) , computer ∂ xˆ(1) T and its rank. If the result is rank( ∂ x ˆT ) = l, choose l independent components from z(1) to form a new vector, record as (g1 , g2 , . . . , gl ), then use them to establish the nonlinear function x ˆ = φ(g1 , g2 , . . . , gl ), the algorithm ends; else it is necessary to take more information so the algorithm needs to go to next step.

Step 3): Take the first order derivative of the second component z2 of z, the course of analysis and computing is similar to the foregoing step. Step t + n − l): if one can’t look for l independent components in previous steps, the algorithm enters the step. Take the first order derivative of zt+n−l , ∂zt+n−l x˙ = z˙t+n−l (x), we add z˙t+n−l denote it as zt+n−l = zt+n−l (x), z˙t+n−l = ∂x T to form z(t+n−l) = (z1 , z˙1 , z2 , z˙2 , . . . , zt+n−l , z˙t+n−l )T , compute ∂z ) rank( (t+l−n) ∂x ˆT

∂z(t+l−n) . ∂x ˆT

If its

= l, similarly choose l independent components rank satisfies from z(t+n−l) to form a new vector, record as (g1 , g2 , . . . , gl ), then establish the ˆ can’t be fornonlinear function x ˆ = φ(g1 , g2 , . . . , gl ), the algorithm ends; else x mulated by (z1 , z˙1 , z2 , z˙2 , . . . , zt+n−l , z˙t+n−l ) and demand to introduce high order derivative to express the relation. With the above algorithm, conditions of the left inverse soft-sensing function and further certificates can be referred in [7]. Considering a SISO nonlinear system x˙ = f (x, u), y = d(x, u), define the initial point (x0 , u0 ) = (x(t0 ), u(t0 )), t = t0 and record the r-th order derivative to time of the output y = d(x, u) as y (r) = y (r) (x, u) = dr (x, u). If a nonnegative integer α exists to make sure that all x and u within a neighborhood of the initial point (x0 , u0 ) satisfy the following condition: 

∂dr (x,u) ∂u ∂dr (x,u) ∂u

= 0, if r = 0, 1, . . . , α − 1; = 0, if r = α

(3)

So that the relative order of the original system within a neighborhood of the initial point (x0 , u0 ) is existent. The sufficient and necessary condition of the

Inverse System Identification of Nonlinear Systems Using LSSVM

685

right reversibility within the neighborhood of (x0 , u0 ) is that the relative order α exists [2]. If the original nonlinear system is revertible, we may acquire its right inverse system using the following method: take the r-th order derivative of y until the high order derivative contains the information of u at the first time, and record as Y = y (r) (x, u). If rank( ∂Y ∂u ) = 0 is satisfied, the original system is right revertible and α = r. From the formulation Y = y (r) (x, u), we can solve the right inverse system and denote it as u = ψ(x, y (r) ).

3

LSSVM Based on FCM Clustering

In the following we briefly introduce LSSVM and describe the proposed algorithm. Consider a given training set of N data points {xj , yj }N j=1 , with the input data point xj ∈ Rp and the output point yj ∈ R. The nonlinear mapping ϕ(·) maps the input data into a higher dimensional feature space. In the feature space, a LSSVM model takes the form of yˆ(x) = wT ϕ(x) + b [8]. In order to improve the accuracy of the LSSVM optimization problem for a scaled training set, we try to find a method which can partition a training set in detail. The FCM clustering technique is widely used because of its efficacy and simplicity. The paper adopts FCM clustering to pretreat the training set for acquiring the feature of the training set. FCM decides the number of clusters by the validity function. It is the following formula: S=

c

N∗

N

m 2 j=1 μij zi − xj  c mini=q,i,q=1 (zi − zq 2 )

i=1

(4)

where · is the Euclidean norm and zi is the i-th cluster center. The numerator, which fits the objective function of FCM clustering, is a compactness validity function that reflects the compactness of clusters. The denominator is a separation validity function that measures the separation status of clusters. In fact, S will approach zero if the number of clusters approach the number of data points, although it rarely happens in practice. In applications, we select the corresponding cluster number at the largest curvature change, and the cluster number is considered as the adaptive cluster number. The FCM algorithm is based on the following objective function [9]: Jm (U, Z) = min

c  N  i=1 j=1

2 μm ij xj − zi  ;

c 

μij = 1; 0 ≤ μij ≤ 1

(5)

i=1

where N = n1 + n2 + . . . + nc . Z is a matrix of cluster centers. c is the number of clusters. Parameter m is a fuzzy exponent in the range m ∈ (1, ∞). μij is the membership grade of the data point xj belonging to the cluster center zi , μij ∈ U , U is a matrix of membership grades.

686

C. Sun, C. Mu, and H. Liang

The fuzzy partitioning is executed through an iterative optimization of the objective function shown above, with the update of the membership grade µij and the cluster center zi by: N m 1 j=1 µij xj (6) ; z = µij = c  i 2 2 N xj −zi  m m−1 j=1 µij q=1 ( xj −zq 2 )

We can get a membership grade matrix after some iterations in accordance with the formula (5) and (6). The iteration will stop when  U (K+1) − U (K) ≤ ε. ε is a termination criterion between 0 and 1, and K is the number of iterations. Select a certain cluster for each data points in terms of the largest value of membership degree for each data. Namely μij = max(μ1j , . . . , μij , . . . , μcj ), the data point belongs to the cluster i. The LSSVM regression is used to different clusters. The following optimization problem is formulated while LSSVM is used for each cluster: ni  1 e2ij ) (7) min( wiT wi + γi 2 j=1

subject to :

yij = wiT ϕ(xij ) + bi + eij ; i = 1, . . . , c, j = 1, . . . , ni

(8)

where eij is the difference for the j-th point in the cluster i. γi , wi and bi denote the regulation factor , the weighted coefficient and the bias term corresponding to the cluster i. We use the method of lagrange multipliers and yield the dual problem of the formula (7) and (8): L(w, e, α) =

ni ni   1 T αij [yij − wiT ϕ(xij ) − bi − eij ] e2ij + wi wi + γi 2 j=1 j=1

(9)

where (α11 , . . . , α1n1 , . . . , αc1 , . . . , αcnc )T are lagrangian multipliers, (e11 , . . . , e1n1 , . . . , ec1 , . . . , ecnc )T present differences. Karush-Kuhn-Tucker (KKT) conditions of each subproblem are: ∂Li ∂Li ∂Li ∂Li = 0; = 0; = 0; =0 ∂wi ∂αij ∂bi ∂eij

(10)

After solving the KKT conditions, parameters and expressions of the primal problem can be acquired: wi =

ni  j=1

αij ϕ(xij );

ni 

αij = 0; αij = γi eij ; yij − wiT ϕ(xij ) − bi − eij = 0; (11)

j=1

Define the kernel function k(x, xij ) = ϕT (x)ϕ(xij ), it is a symmetric function which satisfies Mercer conditions. Usually we must choose an appropriate kernel function as well as its corresponding parameters according to some certain conditions. The choice of the kernel function has several possibilities. In this work, the radial basis function (RBF) is used as the kernel function.

Inverse System Identification of Nonlinear Systems Using LSSVM

687

According to the validity function, the training set is divided several clusters. Via training LSSVM, we can get parameters αij and bi according to the input xij , and substitute these parameters the regression model of the cluster i,  into i αij k(x, xij ) + bi . we can get the expression: yˆ(x) = nj=1

4

The Allied Inverse System Identification Using the Proposed Method

When some important feedback variables can’t be directly measured, we acquire the immeasurable variables’ estimation from the left inverse soft-sensing system. But the immeasurable state variables’ estimation x ˆ = φ(g1 , g2 , . . . , gl ) can’t be expressed directly because the mapping φ isn’t clear, and data of immeasurable states is difficult to acquire, so neither the mathematic expression nor some nonlinear approximate algorithms can work. Considering these situations, in order to acquire the inverse system of the certain nonlinear system, we plug the left inverse soft-sensing estimation x ˆ = φ(g1 , g2 , . . . , gl ) into the right inverse system ¯ x, g1 , g2 , . . . , gl , y (r) ). ˆ = ψ(¯ u = ψ(x, y (r) ) to form a new allied inverse system: u All variables in the allied inverse system are directly measurable. We can adopt the proposed method to approach it. The algorithm for identification can be sketched in the Fig.1. The algorithm can be summed up as follows:

Fig. 1. Identification using LSSVM based on FCM clustering

Step 1): acquire training data points of the allied inverse system, clean up the training set by means of derivative, normalization and so on. Step 2): calculate the validity function described in the formula (4) and select the cluster number c at the largest curvature change. Step 3): initial parameters of FCM clustering c, m , ε, K. Randomly generate a membership grade matrix and make it as U0 . Start an iteration to execute the FCM clustering algorithm. At the k-th iteration calculate the center vector zk with Uk , update zk and Uk by the formula (6). Step 4): Index the training points to form new training sets according to the membership degree. Select the regulation factor γ and the kernel parameter σ. Approach training sets using LSSVM. Step 5): Test the allied inverse model by the testing data points.

688

5

C. Sun, C. Mu, and H. Liang

Simulation Results and Analysis

The plant to be considered is described by the following differential equation: ⎧ x˙1 = x2 − 314.159 ⎪ ⎪ ⎪ ⎨ x˙2 = 19.635 − 0.625x1 − 35.863 sin x1 + 11.232 sin(2x1 ) (12) x˙3 =

0.1u − 0.273x3 + 0.173 cos x1 ⎪ ⎪ ⎪ ⎩ y = 0.514 sin2 x + (0.758x + 0.242 cos x )2 1

3

1

where state variables are (x1 , x2 , x3 )T , x ¯ = x2 is measurable and xˆ = (x1 , x3 )T are immeasurable . We firstly constitute the right inverse system, compute the ∂ y˙ first order derivative of y. ∂u = 0 and the relative order is equal to 1, namely α = 1. The right inverse system of the system described in (12) exists and ˙ Immeasurable Feedback variables can be is formulated as u = ψ(x1 , x2 , x3 , y). obtained by the left inverse soft-sensing system. ) < Define z = (x2 , y)T and compute the first order derivative. rank( ∂∂z x ˆT 2 and introduce x˙2 to constitute continuously. Define z(1) = (x2 , x˙2 , y)T and ∂z

compute the first order derivative and rank( ∂ xˆ(1) T ) = 2. So select two independent components to establish the nonlinear function (x1 , x3 ) = φ(x˙2 , y) and substitute the soft-sensing relation into the right inverse system. The allied inverse system ¯ 2 , x˙2 , y, y). ˙ Fig.2 illustrates the to be estimated is the following function u ˆ = ψ(x identification framework of the allied inverse system. Let the sine wave signal

Fig. 2. The inverse system identification framework

whose frequency is 0.02 rad/sec is the pumping signal. The sampling time and the period are 500s and 0.1s respectively. From the formula (12), we obtain 5000 data points and clean up the training set by means of derivative and normalization. ˙ u). Set the weighted Training vectors are constructed in the form of (x2 , x˙2 , y, y, exponent m = 2, the termination criterion ε = 10−5 , the maximum iterative step K = 100. We select 300 training data points from samples and calculate the value of the validity function described in the formula (4) and get that the cluster number is 5 at the largest curvature change. We start cluster iteration to divide the training set into different cluster, and choose the regulation factor and the kernel parameter. Here we select RBF kernel and γ = 1000, σ 2 = 0.1

Inverse System Identification of Nonlinear Systems Using LSSVM

689

Table 1. Comparison in RMSE, MAE and time for different clusters c=4 c=5 c=7 c=9 training testing training testing training testing training testing MAE(*10−4 ) 8.294 8.991 4.599 4.551 5.326 8.055 5.630 8.759 RSME(*10−3 ) 2.334 9.132 1.121 2.574 2.117 3.226 2.273 3.043 constructing time 0.5187s 0.3977s 1.1571s 0.7606s total time 0.8748s 0.6140s 1.3233s 0.9061s

Fig. 3. The training curse and the testing curves using different methods Table 2. The simulation results of the propose method and LSSVM MAE RSME time training testing training testing (s) the proposed method 0.00046 0.00046 0.00112 0.00257 0.614 LSSVM 0.0260 0.0337 0.0349 0.0371 3.047

similarly in all regressions. Let every cluster train in LSSVM and test by other 587 testing points. All experiments are executed on a 2.66GHz, 512M memorizer PC. The indexes are the mean-absolute error (MAE) and the root- mean-square error (RMSE). To compare the effect of the validity function, the experiment execute at c = 4, c = 5, c = 7, c = 9. Table 1 shows the result of the experiment. We compare the training result and the testing result using LSSVM with the proposed method. For the above training set, LSSVM can not approach well, Fig.3 gives the comparable curse for identification and testing.

690

C. Sun, C. Mu, and H. Liang

Table 2 gives the result of comparison. Obviously we can see the performance of the proposed method is superior to the performance of common LSSVM. Both the accuracy of training and generalization of testing are better.

6

Conclusion

In this paper, we introduce a good method of LSSVM based on FCM clustering and carry through the method to the course of identification for the allied inverse system. Practical application has shown that the LSSVM regression after automatically clustering can find more features and obtain more accurate estimation than LSSVM and SVM at the same time also has low computational cost. This method is efficient and can be utilized to identification with good performance.

Acknowledgement This work was supported by the Natural Science Foundation of Jiangsu province, China under Grant BK2006564 and the Doctoral Project of The Ministry of Education of P.R. China under Grant 20070286001.

References 1. Dai, X., Liu, J., Feng, C., et al.: Neural network α-th order inverse system method for the control of nonlinear continuous systems. IEE Proc.-Control Theory Appl. 145, 519–523 (1998) 2. Dai, X.: Multivariable nonlinear inverse control methods with neural networks. The Science Press, Beijing (2001) 3. Vapnik, V.: An overview of statistical learning theory. IEEE Transactions on Neural Networks 10, 955–999 (1999) 4. Suykens, J.A.K., Vandewalle, J.: Least Squares Support Vector Machine Classifiers. Neural Processing Letter 9(3), 293–300 (1999) 5. Kim, P.J., Chang, H.J., Song, D.S., et al.: Fast support vector data description using k-means clustering. In: Liu, D., Fei, S., Hou, Z. (eds.) ISNN 2007. LNCS, vol. 4493, pp. 506–514. Springer, Heidelberg (2007) 6. Yao, J., Dash, M., Tan, S.T.: Entropy-based fuzzy clustering and fuzzy modeling. Fuzzy Sets and Systems 113, 381–388 (2000) 7. Dai, X.Z., Wang, W.C., Ding, Y.H., et al.: Assumed inherent sensor inversion based ANN dynamic soft-sensing method and its application in erythromycin fermentation process. Computers and Chemical Engineering 30, 1203–1225 (2006) 8. Sun, C.Y., Song, J.Y.: An adaptive internal model control based on LS-SVM. In: Liu, D., Fei, S., Hou, Z. (eds.) ISNN 2007. LNCS, vol. 4493, pp. 479–485. Springer, Heidelberg (2007) 9. Xing, S.W., Liu, H.B., Niu, X.X.: Fuzzy Support Vector Machines based on FCM Clustering. In: Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, pp. 2608–2613. IEEE Press, Los Alamitos (2005)

A New Approach to Division of Attribute Space for SVR Based Classification Rule Extraction Dexian Zhang1 , Ailing Duan1 , Yanfeng Fan2 , and Ziqiang Wang1 1

School of Information Science and Engineering, Henan University of Technology, Zheng Zhou 450052, China [email protected] 2 Computer College, Northwestern Polytecnical University, Xi’an 710072, China

Abstract. SVM based rule extraction has become an important preprocessing technique for data mining, pattern classification, and so on. There are two key problems required to be solved in the classification rule extraction based on SVMs, i.e. the attribute importance ranking and the discretization to continuous attributes. In the paper, firstly, a new measure for determining the importance level of the attributes based on the trained SVR (Support vector re-gression) classifiers is proposed. Based on this new measure, a new approach for the division to continuous attribute space based on support vectors is pre-sented. A new approach for classification rule extraction from trained SVR classifiers is given. The performance of the new approach is demonstrated by several computing cases. The experimental results prove that the proposed ap-proach proposed can improve the validity of the extracted classification rules remarkably compared with other constructing rule approaches, especially for complicated classification problems.

1

Introduction

How to extract rules from trained SVMs has become an important preprocessing technique for data mining, pattern classification, and so on. It aims at extracting rules that indicates the relationship between inputs and outputs of trained SVM, and de-scribing it with simple rule forms being easy to understand. Rule extraction based on SVM can be widely used for integration among different kinds of the AI techniques. Especially, it can provide new approaches for automatic knowledge acquirement and discovery, and bring new tools for rule learning. Moreover, SVM based rule extraction can boost the application of SVM technique in fields such as data mining and decision supporting, and also improve the application effect of SVMs and widen their application fields. The existing approaches for constructing the classification rules can be roughly classified into two categories, data driven approaches and model driven approaches. The main characteristic of the data driven approaches is to extract the symbolic rules completely based on the treatment with the sample data. The most representative approach is the ID3 algorithm and corresponded C4.5 F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 691–700, 2008. c Springer-Verlag Berlin Heidelberg 2008 

692

D. Zhang et al.

system introduced by Quinlan for inducing Classification Models, also called Decision Trees, from data. This approach has the clear and simple theory and good ability of rules extraction, which is appropriate to deal with the problems with large amount of samples. But it still has many problems such as too much dependence on the number and distribution of samples, excessively sensitivity to the noise, difficulty of dealing with continuous attributes effectively etc. The main characteristic of the model driven approaches is to establish a model at first through the sample set, and then extract rules based on the relation between inputs and outputs represented by the model. Theoretically, these rule extraction approaches can overcome the shortcomings of data driven approaches mentioned above. Therefore, the model driven approaches will be the promising ones for rules extraction. The representative approaches are rules extraction approaches based on neural networks [1-8]. Though these methods have certain effectiveness for rules extraction, there still exist some problems, such as low efficiency and validity, and difficulty in dealing with continuous attributes etc. There are two key problems required to be solved in the classification rule extraction, i.e. the attribute selection and the discretization to continuous attributes. Attrib-ute selection is to select the best subset of attributes out of original set. The attributes that are important to maintain the concepts in the original data are selected from the entire attributes set. How to determine the importance level of attributes is the key to attribute selection. Mutual information based attribute selection [9-10] is a common method of attribute selection, in which the information content of each attribute is evaluated with regard to class labels and other attributes. By calculating mutual in-formation, the importance levels of attributes are ranked based on their ability to maximize the evaluation formula. Another attribute selection method uses entropy measure to evaluate the relative importance of attributes [11]. The entropy measure is based on the similarities of different instances without considering the class labels. In paper [12], the separability-correlation measure is proposed for determining the importance of the original attributes. The measure includes two parts, the ratio of intra-class distance to inter-class distance and an attributes-class correlation measure. Through attributes-class correlation measure, the correlation between the changes in attributes and their corresponded changes in class labels are taken into account when ranking the importance of attributes. The attribute selection methods mentioned above can be classified into the sample driven method. Their performance depends on the numbers and distributions of samples heavily. It is also difficult to use them to deal with continuous attributes. Therefore, it is still required to find more effective heuristic information for the attribute selection in the classification rule extraction. For the discretization to continuous attributes, some approaches, such as that based on information entropy in ID3 and that based on χ2 distribution analysis [4], are proposed. It is because that discretization to attribute space mainly depends on the position and shape characteristics of the classification hypersurface. However, the approaches mentioned above can only indirectly reflect the position and shape characteristics of the classification hypersurface. Therefore,

A New Approach to Division of Attribute Space for SVR

693

although these approaches are effective to some extent, it is still required to find more effective approaches for the discretization to continuous attributes in the classification rule extraction. SVM (Support vector machine) is a new technique for data classification based on statistic learning theory. Based on the theory of minimizing the structure risk, it can have good generalization even for small sample sets. Therefore, how to extract rules from trained SVMs has become an important preprocessing technique for data mining, pattern classification, and so on [13].This paper mainly studies the attribute selection and the discretization to continuous attributes based on the trained SVR classifiers and develops new approach for the rule extraction. Following of the paper is organized as seven sections. Section 2 introduces how to construct classifier based on SVR. In section 3, a new measure for attribute importance ranking is proposed. In section 4, a new approach for the division to continuous attribute space based on support vectors is presented. In section 5, a new rules extrac-tion algorithm is described. Section 6 shows experiment results of rule extraction by the proposed algorithm. Section 7 concludes the paper.

2

The Classifier Based on SVR

SVR is a new technique for pattern classification, function approximation and so on. The SVR classifiers have the advantage of being used for classification problem with more than 2 class labels. In this paper, we use an SVR classifier to determine the importance level of attributes and construct classification rules. Give a set of training sample points,(xi , zi ),i = 1, . . . , l,in which xi ∈ Rn is an input and xi ∈ R1 is a target output. The output function of SVR classifier is a summation of kernel function K(, ), which is constructed on the basis of a set of support vector xi .The function can be described as Z(x) =

l 

βi K(xi , x) + b

(1)

i=1

Here, βi is a nonzero parameter obtained by the training of SVR.K(xi , x)is the kernel function. In this paper, we use the following radial basis function (RBF) as kernel function. K(xi , xj ) = exp(−γ||(xi − xj )2 ||), γ > 0

(2)

Where γ is a kernel parameter. From the formula (1) and (2), we can get l

∂Z(x)  2γβi (xk − xik ) exp(−γ||xi − x||2 ) = ∂xk i=1

(3)

Here the xk and xik are the k-th attribute values of the t-th Support vector and the sample point x, respectively.

694

D. Zhang et al.

During the construction of classification rules, only the attribute space covered by the sample set should be taken into account. Therefore , any order derivative of SVR output Z(x) to each SVR input xk exist obviously according to formula (3).

3

Measure for Attribute Importance Ranking

Definition 1. For a given sample set, the attribute value space Ω is defined as follows. (4) Ω = {x|M inxk ≤ xk ≤ M axxk , k = 1, . . . , n} Where M inxk and M axxk are the minimal and maximal value of k-th attribute in the given sample set, respectively. In the coordinate system formed by attributes and class label, for a given attribute space Ω , the importance level of each attribute depends on the mean perpendicular degree between this attribute axis and classification hypersurface in the adjacent space of classification hypersurface. The higher is the mean perpendicular degree, the higher is the importance level. So for measure of attribute importance ranking, there are two problems to be solved. One is how to estimate the adjacent space of classifi-cation hypersurface. The other is how to estimate the mean perpendicular degree between attribute axes and classification hypersurface in the given space. Next we will discuss the methods for solving the two problems. 3.1

Estimation of the Adjacent Space of Classification Hypersurface

In classification tasks, classification labels are quantified as integer numbers in some order, for example, 0, 1, 2, ... So for a given trained SVR classifier and the attribute space , supposing the classification error of the trained SVR classifier shown by formula (1) is ǫe , the point x in the adjacent space of classification hypersurface must satisfy the following conditions. Condition 1. τ < M OD(Z(x)) < 1 − τ Where M OD() is a function to get fractional part of a floating-point value, and Z(x) is output value of trained SVR classifier in point x . τ is a parameter, ǫe < τ < 0.5 . Condition 2.  grad(x) > η  grad(Γ )  Here  grad(x)  is the gradient module of point x.  grad(Γ ) is the mean gradient module in the attribute space Γ . η is a parameter, 0 < η < 2. Definition 2. For a given attribute space Γ , Γ ⊂ Ω , the adjacent space of classification hypersurface VΓ is defined as follows. VΓ = x|τ < M OD(Z(x)) < 1 − τ and  grad(x) > η  grad(Γ ) 

(5)

A New Approach to Division of Attribute Space for SVR

3.2

695

Computing of the Mean Perpendicular Degree

Definition 3. For a given trained SVR and the attribute value Γ ,Γ ⊂ Ω, the perpendicular level between classification hypersurface and attribute axis xk is defined as follows. | ∂Z(x) ∂xk | (6) Pxk =  grad(x)  For a given attribute space Γ ,Γ ⊂ Ω, its the adjacent space of classification hyper-surface is VΓ , we can generate a sample set S randomly in VΓ . The measure of the importance level of attribute xk can be computed with following equation  Pxk (x) X∈S (7) JP (xk ) = |S| Here |S| is the sample number in sample set . In this paper we usually let |S| be 20 ∼ 200. The importance level measure JP (xk ) of the attribute xk represents the influence degree of attribute xk to classification. So in the process of rules extraction, the value JP (xk ) is the important instruction information for selecting attributes and dividing attribute value space.

4

Division of Attribute Space

In this paper, the classification rule is expressed in the form of IF-THEN rules which is commonly used, as follows: IF < conditions > T HEN < class >

(8)

The rule antecedent (IF part) contains a set of conditions, connected by a logical conjunction operator (AND). In this paper we will refer to each rule condition as a term, so that the rule antecedent is a logical conjunction of terms in the form: IF term 1 AND term 2 AND .... Each term has two kinds of forms. One kind of the form is a triple ¡attribute, operator, value¿. The operator can be < , ≥ or = . Another kind of the form is a triple . Each attribute is used only one time in one rule condition. Each rule antecedent describes an attribute subspace. The rule consequent (THEN part) specifies the class label predicted for cases whose attributes satisfy all the terms specified in the rule antecedent. Definition 4. For a given rule, if an attribute has used in the rule condition, it is called a used attribute. Otherwise it is called an unused attribute. Definition 5. For a given attribute space Ω and the sample set SΩ in it. Let Pj denote the proportion of the class label j in SΩ . If Pi = M axj Pj then class label i is the key class label of the attribute space SΩ .

(9)

696

D. Zhang et al.

Definition 6. For a given rule and all of its unused attributes xj , if JP (xk ) = M axj JP (xj )

(10)

the attribute xj is the key attribute for the rule extraction. For a given classification task and a trained SVR classifier, the rule extraction process begins from the division of the whole attribute space Ω . In the whole attrib-ute space Ω , the key attribute xk is selected from all of the attributes. Then the key attribute xk is divided to create rules. Therefore, at first, each rule antecedent com-prises only an attribute. Definition 7. For a given rule, if 1 − Pi ≤ ǫe or VVr ≤ ǫv , the rule is called a finished rule. Otherwise, it is called an unfinished rule. Here, Pi is the proportion of key class label in all of the class labels and the given sample set ST ; Vr and V are the volume of attribute subspace described by the rule antecedent and the whole attribute space respectively; ǫe and ǫv are given classification error and volume limitation for space division respectively. Each created rule is examined to judge whether it is an unfinished rule or not. For each unfinished rule, the attribute subspace described by the rule antecedent will be divided further by the similar way as the division of the whole attribute space Ω , and more attributes will added to the unfinished rule. The attribute division for the selected key attribute includes two steps, one step of initial interval division and the other step of interval mergence. 4.1

Initial Interval Division

If the key attribute is a discrete one having limited values in given attribute space, construct interval[vk − ξ, vk + ξ] for each value vk in the attribute value set. Here, ξ is a constant. In this paper we usually set ξ = 0.001. So the interval number of the discrete attribute is the size of its value set. For the case that the key attribute is continuous one, the paper proposes initial interval division based on support vectors of the SVR classifier. According to the discussion in section 2, we know that support vectors must locate in the adjacent space of classification hypersurface. So the support vectors provide effective guidance information on discretization to initial intervals of continuous attributes. The method can be described as Step 1: Initializing a) Determine the value range [ω1 , ω2 ] of the key attribute xk based on the minimal and maximal value of the attribute in the given whole attribute space Ω and the rule antecedent of the unfinished rules to be appended. b) Determine the value set {vk,1 , ..., vk,n } of the key attribute in [ω1 , ω2 ] used by the support vectors of the trained SVR classifier. Rank the set {vk,1 , ..., vk,n } with ascending order. Step 2: Initial Interval generating For each value vk,i in value set {vk,1 , ..., vk,n } of the key attribute xk , generate (v +v ) an initial interval. If i = 1 ,the initial interval is [ω1 , k,1 2 k,2 ] . If i = n , the

A New Approach to Division of Attribute Space for SVR (v

+v

697

)

initial interval is [ k,n−12 k,n , ωn ] . For other values in the value set, the initial ) (v +v ) (v +v interval is [ k,i−12 k,i , k,i 2 k,i+1 ]. 4.2

Interval Mergence

For each given unfinished rule, after generating the initial intervals for its key at-tribute, each initial interval can be added to the rule and a new rule can be created. For each newly created rule based on the initial intervals, it has its attribute subspace described by its rule antecedent. So it has its key attribute and key class label in this subspace. The difference of key attribute and key class label among initial intervals can reflect the position and shape characteristics of the classification hypersurface in or near these subspaces. Therefore, Interval mergence should be performed based on the difference of the key attribute and the key class label among the intervals. Each created rule is examined to judge whether it is an unfinished rule or not. For two given adjacent intervals that their newly created rules are all unfinished rules, if their key attributes are different, it shows that there is much difference in the position and shape characteristics of the classification hypersurface in or near these two sub-spaces. Therefore the two adjacent intervals cannot be merged. If their key attributes are the same but the key labels is not, the two intervals obviously cannot be merged either. So the mergence condition of two adjacent intervals is that their key attributes and key class labels must be the same. Obviously, for two adjacent intervals that their newly created rules are all finished rules, the mergence condition of the two adjacent intervals is that their key class labels must be the same.

5

Rules Extraction Method

The algorithm for classification rule construction based on trained SVR classifiers proposed in this paper is described as follows. Step 1: Initializing a) Divide the given sample set into two parts, i.e, the training sample set and the test set. According to the training sample set, generate the attribute space Ω by formula (4). b) Set the predefined value of error rate ǫe , and the volume limitation ǫv for space division. Step 2: Rule generating a) Generate a queue for finished rules and a queue for unfinished rules. b) Select attribute xk with the biggest value of JP (xk ) computed by formula (7) as the key attribute out of the attribute set. Divide attribute xk into intervals. Merge the pairs of adjacent intervals. A rule is generated for each merged interval. If the class error of the generated rule in the sample set is less than the predefined value, put it into queue R , otherwise put it into queue U .

698

D. Zhang et al.

c) If U is empty, the extraction process terminates; otherwise go to d). d) Pick an unfinished rule from queue U by a certain order, and perform interval division and mergence. A rule is generated for each merged interval. If the class error of the generated rule in the sample set is less than the predefined value, put it into queue R , otherwise put it into queue U . Go to c). Step 3: Rule Processing Examine the rule number of each class label. Let the rules with the largest number of same class label be default rules.

6

Experiment and Analysis

The spiral problem [14] and congressional voting records(voting for short), hepatitis, iris plant(iris for short), statlog australian credit approval(credit-a for short ) in UCI data sets [15] are employed as computing cases,shown in table 1. Table 1. Computing Cases Spiral Total Samples 168 Training Samples 84 Testing Samples 84 Classification Numbers 2 Total Attributes 2 Discrete Attributes 0 Continuous Attributes 2

Voting Hepatitis 232 80 78 53 154 27 2 2 16 19 16 13 0 6

Iris 150 50 100 3 4 0 4

Credit-A 690 173 517 2 15 9 6

Since no other approaches extracting rules from SVR are available, we include a popular rule learning approach i.e. C4.5R for comparison. The experimental results are tabulated in Table 2. For the spiral problem and the Iris plant problem, the rules set extracted by the new approach are shown in Table 3 and Table 4, respectively. Table 2 shows that the rules extraction results of the new approach are obviously better than that of C4.5R, especially for spiral problem. For the case of spiral problem, C4.5R is difficult to extract effective rules, but the new approach has so impressive results that are beyond our anticipation. This means Table 2. Experimental Results Comparison between New Approach(NA) and C4.5R #Rules(NA: C4.5R) Err.Train(NA: C4.5R) Err.Test(NA: C4.5R) Spiral 5: 3 8.33%: 38.1% 9.52%: 40.5% Voting 2: 4 2.5%: 2.6% 3.8%: 3.2% Hepatitis 4: 5 3.77%: 3.8% 7.4%: 29.6% Iris 3: 4 0%: 0% 8%: 10% Credit-A 2: 3 14.3%: 13.9% 15%: 14.9%

A New Approach to Division of Attribute Space for SVR

699

Table 3. Rules Set of Spiral Problem Generated by the Proposed Algorithm R1 R2 R3 R4 R5

x0 < −2.91 −→ C0 x0 [−1.069, −0.048) ∧ x1 [−1.947, 1.017) −→ C0 x0 [−0.048, 1.065) ∧ x1 [−2.018, −1.017) −→ C0 x0 [1.065, 2.19) ∧ x1 ≥ −1.62 −→ C0 Def ault −→ C1

Table 4. Rules Set of Iris plant Problem Generated by the Proposed Algorithm R1 petalwidth[0.7, 1.55) −→ Iris − versicolor R2 petalwidth ≥ 1.55 −→ Iris − virginica R3 Def ault −→ Iris − setosa

that the new approach proposed can improve the validity of the extracted rules for complicated classification problems remarkably. Moreover, for most cases, the numbers of the rules extracted by the new approach are less than that of rules extracted by the C4.5R. The generalization ability of those rules extracted by the new approach is also better than that of rules extracted by the C4.5R.

7

Conclusions

In this paper, based on the analysis of the relation among the characteristics of position and shape of classification hypersurface and the gradient distribution of the trained SVR classifier, new measure for determining the importance level of the at-tributes based on the trained SVR classifiers is proposed. A new approach for the division to continuous attribute space based on support vectors is presented. Accord-ing to the above work, a new approach for rule extraction based on trained SVRs is proposed. A new algorithm for rule extraction is presented. It is suitable for classification problems with continuous attributes. The performance of the new approach is demonstrated by several typical examples. The computing results prove that the new approach can improve the validity of the extracted rules remarkably compared with other rule extracting approaches, especially for complicated classification problems.

References 1. Fu, L.: Rule Generation from Neural Networks. IEEE Trans. Systems Man. Cybernet 24, 1114–1124 (1994) 2. Towell, G.G., Shavlik, J.W.: Extracting Refined Rules from Knowledge-based Neural Networks. Machine Learning 13, 71–101 (1993) 3. Lu, H.J., Setiono, R., Liu, H.: NeuroRule: A Connectionist Approach to Data Mining. In: Proceedings of 21th International Conference on Very Large Data Bases, Zurich, Switzerland, pp. 478–489 (1995) 4. Zhou, Z.H., Jiang, Y., Chen, S.F.: Extracting Symbolic Rules from Trained Neural Network Ensembles. AI Communications 16, 3–15 (2003)

700

D. Zhang et al.

5. Sestito, S., Dillon, T.: Knowledge Acquisition of Conjunctive Rules Using Multilayered Neural Networks. International Journal of Intelligent Systems 8, 779–805 (1993) 6. Craven, M.W., Shavlik, J.W.: Using Sampling and Queries to Extract Rules from Trained Neural Networks. In: Proceedings of the 11th International Conference on Machine Learning, New Brunswick, NJ, USA, pp. 37–45 (1994) 7. Maire, F.: Rule-extraction by Backpropagation of Polyhedra. Neural Networks 12, 717–725 (1999) 8. Setiono, R., Leow, W.K.: On Mapping Decision Trees and Neural Networks. Knowledge Based Systems 12, 95–99 (1999) 9. Battiti, R.A.: Using Mutual Information for Selecting Featuring in Supervised Net Neural Learning. IEEE Trans. on Neural Networks 5, 537–550 (1994) 10. Bollacker, K.D., Ghosh, J.C.: Mutual Information Feature Extractors for Neural Classifiers. In: Proceedings of IEEE Int. Conference on Neural Networks, vol. 3, pp. 1528–1533 (1996) 11. Dash, M., Liu, H., Yao, J.C.: Dimensionality Reduction of Unsupervised Data. In: Proceedings of 9th IEEE Int. Conf. on Tools of Artificial Intell., pp. 532–539 (1997) 12. Fu, X.J., Wang, L.P.: Data Dimensionality Reduction with Application to Simplifying RBF Network Structure and Improving Classification Performance. IEEE Trans. System, Man, Cybern, Part B-Cybernetics 33, 399–409 (2003) 13. Zhang, Y., Su, H.Y., Jia, T., Chu, J.C.: Rule Extraction from Trained Support Vector Machines. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 61–70. Springer, Heidelberg (2005) 14. Kamarthi, S.V., Pittner, S.: Accelerating Neural Network Training Using Weight Extrapolation. Neural Networks 12, 1285–1299 (1999) 15. Blake, C., Keogh, E., Merz, C.J.: UCI Repository of Machine Learning Databases, Department of Information and Computer Science, University of California, Irvine, CA, USA (1998), http://www.ics.uci.edu/∼ meearn/MLRepository.htm

Chattering-Free LS-SVM Sliding Mode Control Jianning Li, Yibo Zhang, and Haipeng Pan Institute of Automation, Zhejiang Sci-Tech University, 310018, Hangzhou, China [email protected]

Abstract. Least squares support vector machine (LS-SVM) classifiers are a class of kernel methods whose solution follows a set of linear equations. In this work we present a least squares support vector machine sliding mode control (LS-SVM-SMC) strategy for uncertain discrete system with input saturation. The output of LS-SVM is used for replacing sign function of the reaching law in traditional sliding mode control (SMC). An equivalent matrix is constructed for input saturation condition in the scheme. Combined LS-SVM-SMC with linear Matrix Inequalities (LMIs), a chattering free control algorithm is applied in the uncertain discrete systems with input saturation. The feasibility and effectiveness of the LS-SVM-SMC scheme are demonstrated via numerical examples. As a result, compared with conventional SMC, the LS-SVM-SMC is able to achieve the desire transient response with input saturation. And there is no chattering in steady state while unmatched parameter uncertainty exists. Keywords: Sliding mode control; Least squares support vector machine; Discrete uncertain system; Input saturation; Linear matrix inequality.

1

Introduction

SMC as a general design approach for robust control system is well established. The long history of its development and main result have been reported since 1950s [1].Due to the widespread using of digital controllers, research of variable structure control for discrete-time system becomes an important branch of control theory, and different reaching conditions are presented in [2]-[3].The main drawback of these reaching conditions is that when system states are in switching region, they can not reach the original point, but tend to reach a chattering close to the original point. In recent years, a number of papers have been done based on [2], see e.g.[4]-[8]. In applications, input saturation and state constrained exist. When they are introduced in conventional sliding mode control algorithm, perfect systems response performance can not be achieved easily, the system may be unstable. A linear variable structure control approach is proposed in [9] for discrete time systems subject to input saturation. In [10]-[11], the input saturation condition is expressed in terms of linear matrix inequalities. F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 701–708, 2008. c Springer-Verlag Berlin Heidelberg 2008 

702

J. Li, Y. Zhang, and H. Pan

In this paper, a class of uncertain discrete systems with input saturation is concerned. In order to achieve desired closed-loop performance and robustness, a chattering free control, which combined LMI approach and LS-SVM with discrete quasi sliding mode control law, is proposed. First, an equivalent matrix with input saturation is adopted. Second, a LS-SVM algorithm is used for replacing the sign function, and then chattering is solved. Therefore, compared with conventional sliding mode control strategy, LS-SVM-SMC with input saturation algorithm has many advantages. Firstly, large input exists in traditional SMC. Secondly, the control algorithm is chattering free and robustness to unmatched parameter uncertainty. Thirdly, the control law can be realized easily in real application. The organization of this paper is as follows. In section 2, we will review the theories of LS-SVM and introduce some useful lemmas. In section 3, the system description for uncertain discrete systems with input saturation is addressed. In section 4, a LS-SVM-SMC scheme for uncertain discrete system with input saturation is introduced. The stability analysis of LS-SVM-SMC is shown in section 5. In section 6, the simulation results are presented to show the effectiveness of the proposed control for uncertain discrete systems with input saturation. Finally, conclusions are given in section 7.

2

Theory

A detailed description of the theory of SVM can be referred in several excellent books and tutorials [12]-[15].An alternate formulation of SVM is LS-SVM proposed in [16].The LS-SVM method is considering the following optimization problem:  γ l 1 ξi2 M in : L(w, b, ξ) = wT w + Σi=1 2 2 (1) s.t. : yi = wT ϕ(xi ) + b + ξi , i = 1, 2, . . . l The feature map ϕ(x) is implicitly known from Mercer’s theorem. The dual problem in (1) is given by the following set of linear equations [17]:      y K + γ −1 In 1 α = (2) 1T 0 b 0 For a new given input x, the LS-SVM classifier is given by  l f (x) = sgn Σi=1 αi K(x, xi ) + b)

(3)

2xT DF Ey ≤ εxT DDT x + ε−1 y T E T F T F Ey

(4)

LS-SVM classifiers achieve comparable performance as the standard SVM on a serious of benchmark data sets with less overall complexity. We introduce some useful lemmas that are essential for the proof in the following parts. Lemma 1 [18] For any x ∈ Rp ,y ∈ Rq ,D and E are matrices with compatible dimensions, F T F ≤ I,ε > 0,the following inequality holds

Chattering-Free LS-SVM Sliding Mode Control

703

Lemma 2 [19] For any x, y ∈ Rn , and a matrix M > 0, the following inequality holds 2xT y ≤ xT M x + y T M −1 y (5)

3

System Description

Consider the following uncertain discrete system with input saturation x(k + 1) = (A + △A)x(k) + Bsat(u(k))

(6)

where x(k) ∈ Rn is the state vector, u(k) ∈ Rm is the control input, A and B are appropriate dimensions, △A ∈ Rn×n represents parameter uncertainties. The function sat(u(k)) is defined as: ⎧ ⎪ if ui (k) > uH ; ⎨u H , (7) sat(ui (k)) = ui (k), if − uL ≤ ui (k) ≤ uH ; i = 1, 2, . . . m ⎪ ⎩ −uL , if ui (k) < −uL where uH ,uL ∈ R+ are bounded actuator limitations. The saturating function (7) is conveniently expressed as sat(u(k)) = Du(k)

(8)

m×m

where D ∈ R

satisfying ⎧ u H ⎪ ⎨ ui (k) , D(i, i) = 1, ⎪ ⎩ −uL ui (k) ,

if ui (k) > uH ; if − uL ≤ ui (k) ≤ uH ; i = 1, 2, . . . m ui (k) < −uL

(9)

then (6) can be rewritten as

x(k + 1) = (A + △A)x(k) + BDu(k)

4

(10)

LS-SVM-SMC Design for Uncertain Discrete System with Input Saturation

We choose sliding function as [2], the parameter is designed according to LMIs. s = Cx, C = B T P

(11)

where P is the solution of LMIs which will introduce later. The reaching law is selected as s(k + 1) = s(k) − qT s(k) − εT os, q > 0, ε > 0

(12)

where os is the output of the LS-SVM and defined as follows: os = LSSV M (s(k))

(13)

LSSV M denotes the functional characteristics of least squares support vector machine.

704

J. Li, Y. Zhang, and H. Pan

E q u iv a le n t C o n tr o l

u1(k)

u(k)=u1(k)+u2(k)

S lid in g F u n c t io n

s(k )

L S - S VM

os

SMC

Un c e r t a in D is c r e t e S y s tem s w it h Input S a t u r a t io n

u2(k)

s( k ) x( k) [ x( 0) ]

Fig. 1. Diagram of the LS-SVM-SMC

The control law is combined discrete quasi-sliding mode control with equivalent control, the overall control u(k) is chosen as u(k) = −(CB)−1 CAx(k) + (CB)−1 [(1 − qT )s(k) − εT os]

(14)

The LS-SVM-SMC control scheme is shown in Fig 1. In the sliding mode, the control u(k) is equivalent control which is described as u(k) = −(CB)−1 CAx(k)

(15)

so dynamic equation of the quasi-sliding mode is given by x(k + 1) = (A + △A − BD(CB)−1 CA)x(k)

5

(16)

Robust Stability Analysis

Theorem 1. For any x ∈ Rp , y ∈ Rq , D and E are matrices with compatible dimensions, the following inequality holds −2xT DEy ≤ xT DDT x + y T E T Ey

(17)

Proof. 0 ≤ (DT x + Ey)T (DT x + Ey) = xT DDT x + 2xT DEy + y T E T Ey ⇒ −2xT DEy ≤ xT DDT x + y T E T Ey

(18) 

Theorem 2. Consider the dynamic equation of the quasi-sliding mode (16) is asymptotically stable if there exists symmetric position-definite P such that the following linear matrix inequalities hold,   (A + △A)T P (A + △A) + AT P A + AT A − P (A + △A)T P BD α2 > 0)

(2)

is also a kernel matrix. Definition 2 K2 = α1 S + α2 S 2 (α1 > α2 > 0) is called a 2-diffusion kernel. If we extend the idea of 2-diffusion similarity graph to d-length paths with the similarity measure  πd (xi , xj ) = π (xi , xt2 ) π (xt2 , xt3 ) · · · π (xtd , xj ) , (xi xt2 ···xtd xj )∈Txdi xj see Figure 1 (right), we can define d-diffusion similarity matrix the same way.

A Generic Diffusion Kernel for Semi-supervised Learning

727

Definition 3 A similarity matrix based on πd is called a d-diffusion similarity matrix, and denoted as Sd . When d → ∞, Sd is the generic diffusion similarity matrix and denoted as S∞ . Theorem 2 Sd = S d , d ≥ 2. Proof We prove the theorem by induction argument. First, from Theorem 1, we have S2 = S 2 . Second, assuming now Sn = S n , n ≥ 2. (Sn+1 )ij = πn+1 (xi , xj )    = π (xi , xt2 ) π (xt2 , xt3 ) · · · π xtn+1 , xj (xi xt2 ···xtn+1 xj )∈Txn+1 i xj  π (xk , xj ) π (xi , xr2 ) · · · π (xrn , xk ) = (xk ,xj )∈E

(xi xr2 ···xrn xk )∈Txni xk   = (SSn )ij = (SS n )ij = S n+1 ij , that is, Sn+1 = S n+1 . Therefore, Sd = S d , d ≥ 2.

⊓ ⊔

Then the linear combination of (2) can extend to   αi S i , αi > αi+1 > 0, αi Si =

(3)

i

i

and (3) is a kernel as well. Definition 4 Kd = α1 S + α2 S 2 + · · · + αd S d + · · · (αi > αi+1 > 0) is called a d-diffusion kernel. Now we define our generic diffusion kernel as follow. Definition 5 Let S be the base similarity matrix of semi-supervised graph G = (L ∪ U, E), 0 < λ ≤ 1, and f (x) arbitrary real function with restriction that ⎧ d f (0) d ⎪ ⎨ λ > 0, d! d2 d1 ⎪ ⎩ f (0) λd1 > f (0) λd2 , d1 < d2 . d1 ! d2 ! Let Kf =

∞  f d (0) d=0

d!

λd S d = f (λS) ,

then Kf is a generic diffusion kernel, f (x) the diffusion generating function, and λ the diffusion parameter.

728

L. Jia and S. Liao

S

8

S2

7

π2

λ=0.05 λ=0.10 λ=0.15 λ=0.20 λ=0.25

6

S3

f (0) d d! λ

5

π3

π4

4 3

S4

2 1 0 1

2

3

4

d

5

6

7

8

Fig. 2. Base similarity graph and 2,3,4- Fig. 3. The relationship between d, λ and the weight of S d in K(1+λS)d . λ = 0.1 is diffusion similarity graphs suitable.

Example 1 When f (x) = exp(x) and 0 < λ ≤ 1, ∞  1 d d λ S = exp (λS) = Kexp , Kf = d! d=0

which is the exponential diffusion kernel. Example 2 When f (x) = (1 − x) Kf =

∞ 

−1

and 0 < λ ≤ 1,

λd S d = (1 − λS)−1 = KN ,

d=0

which is the von Neumann diffusion kernel. Generic diffusion kernel roots from the belief that relevance of longer paths should decrease. The bigger the d is the smaller the weight of S d becomes. In practice, any real function f (x) could be used to generate Kf with only restriction that f d (0) must exist and be non-negative, and λ is a parameter chosen to ensure the decrease of the weight of S d . 3.2

Property of Generic Diffusion Kernel

A smooth eigenvector of similarity matrix has the property that two elements of the vector have similar values if there exist large weighted paths between the nodes in the graph, and a small eigenvalue corresponds to a smooth eigenvector [13,14]. To demonstrate the computational feasibility of Kf , we have the following theorem. Theorem 3 Kf has the same eigenvectors as S, and simply operates a spectral transformation to S. Proof Since S is symmetric, it can be expressed as S = V ′ ΛV , where the orthogonal matrix V denotes its eigenvectors and diagonal matrix Λ its eigenvalues. For any real function f (x) we know that f (V ′ ΛV ) = V ′ f (Λ) V . Then for any generic diffusion kernel

A Generic Diffusion Kernel for Semi-supervised Learning

729

Kf = f (λS) = f (V ′ λΛV ) = V ′ f (λΛ) V, and the theorem holds.

⊓ ⊔

Thus when computing Kf , we can first performance an spectrum decomposition of S, and then apply f (x) directly to the eigenvalues of S. Theorem 4 Kf keeps the order of the eigenvalues, so that smooth eigenvectors have small eigenvalues in Kf . Proof Since Kf = V ′ f (λΛ) V = V ′

∞  f d (0) d=0

d!

λd Λd V,

d

f (0) d and λ is monotone decreasing in d, we can choose a certain small λ d! satisfying −1 S2 d! λd < , f d (0) f d (0) d d which makes λ Λ decrease. From spectral graph theory we know that a d! smaller eigenvalue corresponds to a smoother eigenvector over the graph. ⊓ ⊔ 3.3

Framework of Kernel Methods in Semi-supervised Learning

Based on the generic diffusion kernel approach obtained above, we propose a 5-step kernel method framework for semi-supervised learning: 1. 2. 3. 4. 5.

Given L and U . Building G = (L ∪ U, E) with some similarity measure π. Creating bases similarity matrix S. Computing Kf , such as Ke and KN . Using common kernel algorithm, such as SVM.

However, some kernel algorithm need U be labeled when learning. Lots of successful researches have been done to accomplish this goal, for instance the spectral cluster [15], label propagation [3], etc.

4

Experiments

In this section, experiments on simulated data and benchmark databases are designed to assess the validity and feasibility of generic diffusion kernels. The first experiment verifies the effectiveness of Kf in keeping the order of eigenvalues and making smooth eigenvectors have relatively small eigenvalues. We use a simulated graph to test our work. As shown in Figure 2 (top left), the base similarity graph has 13 nodes. Without loosing generality, we do not limit the graph to any particular similarity measure π and assign all weights to 1. The

730

L. Jia and S. Liao

Diffusion Similarity Graph of K(1+0.1S)4

19.0798

19.074

19.8958

19.9751

18.1843

23.6539

54.2347

54.2407

52.9749

57.544

402.434

403.212

404.689

Fig. 4. The spectral decomposition of generic diffusion kernel. It shows that when eigenvalues become larger the eigenvectors become less smooth.

2,3,4-diffusion similarity graphs are also shown. We choose f (x) = (1 + λx)d in this experiment, then Kf = (1 + λS)d . Since any two nodes in base similarity need 4 paths at most to connect each other, we let d = 4. The relationship between d, λ and the weight in Kf is illustrated in Figure 3. We set λ = 0.1 in this experiment to ensure the decline of the weight of S d in this generic diffusion kernel. From the spectral decomposition of Kf , it can be seen from Figure 4 clearly that smooth eigenvectors correspond to small eigenvalues. Next we show classification results both on simulated data and benchmark database. We choose the generating function as exponential function with λ = 1, and use similarity measure of Euclidean distance to create base similarity matrix in these experiments. Figure 5 and 6 show the utility of the unlabeled data to improve the decision boundary in linear and non-linear classifiers. In Figure 5, only 30 samples are labeled (15 samples in each group). The boundary shown as solid line is learned by SVM both from the labeled (shown in color) and unlabeled data using our diffusion kernel. Figure 6 shows a much more complicated case, where the classifier is non-linear and labeled samples are only 6 (3 samples in each group). Finally, we apply generic diffusion kernel to two UCI benchmark databases to verify its effectiveness. The similarity measure π used here is (1). We use Kf = (1 − λS)−1 , and set λ = 1. The results are given in the first two lines of Table 1. As shown in Example 1, when f (x) = exp (x), the generic diffusion kernel actually degenerates to exponential diffusion kernel. This kernel has been investigated by Kondor et al. We cite here their experimental results to demonstrate the successful applications of generic diffusion kernels on these UCI benchmark databases [7].

A Generic Diffusion Kernel for Semi-supervised Learning

Fig. 5. Linear classifier (solid line) learned by SVM with generic diffusion kernel, where only 30 samples are labeled. The boxed samples are support vectors.

731

Fig. 6. Non-linear classifier (solid line) learned by SVM with generic diffusion kernel, where only 6 samples are labeled. The boxed samples are support vectors.

Table 1. Classification results on 6 benchmark database. Rows marked ’*’ are cited from Kondor et al’s work. Database Heart Ionosphere Cancer* Vote* Income* Mushroom*

5

#Sample #Attr 270 351 699 435 48842 8124

13 34 9 16 11 22

Error

#SV

π

f (x)

Kf

λ

KN Ke Ke Ke Ke

1.0 0.3 1.5 0.4 0.1

 108 exp −xi − xj  /δ  (1 − x)−1 KN 1.0 −1 2 2

16.30% 4.84% 167 exp −xi − xj  /δ (1 − x) 3.64% 62.9 Euclidean exp (x) 3.91% 252.9 Euclidean exp (x) 18.50% 1033.4 Euclidean exp (x) 0.75% 28.2 Euclidean exp (x)

Conclusions

We have proposed the generic diffusion kernel for semi-supervised learning. It upgrades previous diffusion kernels to a general and flexible form, and reduces design of diffusion kernels for semi-supervised learning to selection of generating functions and parameters. Both theoretical analysis and comparative experiments illuminate the validity and feasibility of the generic diffusion kernel. As the generic diffusion kernel prompts a brand-new approach to kernel construction for semi-supervised learning, attention to generating parameter tuning and generating function selection of the diffusion kernel for particular database is deserved, and extension to other graph-based learning problems also need to be investigated in future work. Acknowledgment. This work is supported in part by Natural Science Foundation of China under Grant No. 60678049 and Natural Science Foundation of Tianjin under Grant No. 07JCYBJC14600.

732

L. Jia and S. Liao

References 1. Chapelle, O., Sch¨ olkopf, B.: Semi-Supervised Learning. MIT Press, Cambridge (2006) 2. Zhou, Z.H., Li, M.: Semi-Supervised Learning with Co-Training. In: 19th International Joint Conference on Artificial Intelligence. IJCAI, pp. 908–913 (2005) 3. Zhu, X.J.: Semi-Supervised Learning with Graphs. Ph.D. thesis, Carnegie Mellon University (2005) 4. Joachims, T.: Transductive Inference for Text Classification Using Support Vector Machines. In: 16th International Conference on Machine Learning, pp. 200–209. Morgan Kanfmann, San Francisco (1999) 5. Chung, F.R.K.: Spectral graph theory. Regional Conference Series in Mathematics. American Mathematical Society (1997) 6. Haussler, D.: Convolution Kernels on Discrete Structures. Technical Report UCSCCRL-99-10, University of California at Santa Cruz (1999) 7. Kondor, R.I., Lafferty, J.: Diffusion Kernels on Graphs and Other Discrete Structures. In: 19th International Conference on Machine Learning, pp. 315–322. Morgan Kaufmann, San Francisco (2002) 8. Smola, A., Kondor, R.I.: Kernels and Regularization on Graphs. In: 16th Annual Conference on Learning Theory, pp. 128–144. Springer, Heidelberg (2003) 9. Chapelle, O., Weston, J., Schlkopf, B.: Cluster Kernels for Semi-Supervised Learning. In: Advances in Neural Information Processing Systems, vol. 15, pp. 601–608. MIT Press, Cambridge (2003) 10. Zhu, X.J., Kandola, J., Ghahramani, Z., Lafferty, J.: Nonparametric Transforms of Graph Gernels for Semi-Supervised Learning. In: Advances in Neural Information Processing Systems, vol. 17, pp. 1641–1648. MIT Press, Cambridge (2005) 11. Fouss, F., Pirotte, A., Renders, J.M., Saerens, M.: A Novel Way of Computing Similarities between Nodes of a Graph, with Application to Collaborative Recommendation. In: IEEE/WIC/ACM International Joint Conference on Web Intelligence, pp. 550–556. IEEE Computer Society, Washington (2005) 12. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. China Machine Press, Beljing (2005) 13. Fiedler, M.: A Property of Eigenvectors of Nonnegative Symmetric Matrices and Its Applications to Graph Theory. Czechoslovak Mathematical Journal 25, 619–633 (1975) 14. Mohar, B.: Laplace Eigenvalues of Graphs—A Survey. Discrete Mathematics 109, 171–183 (1992) 15. Ng, A., Jordan, M., Weiss, Y.: On Spectral Clustering. In: Advances in Neural Information Processing Systems, vol. 14, pp. 849–856. MIT Press, Cambridge (2002)

Weighted Hyper-sphere SVM for Hypertext Classification Shuang Liu1 and Guoyou Shi2 1

College of Computer Science & Engineering, Dalian Nationalities University, Dalian Development Area Liaohe West Road 18, 116600 Dalian, China 2 College of Navigation, Dalian Maritime University, Linghai Road 1, 116026 Dalian, China [email protected], [email protected]

Abstract. With more and more hypertext documents available online, hypertext classification has become one popular research topic in information retrieval. Hyperlinks, HTML tags and category labels distributed over linked documents provide rich classification information. Integrating these information and content tfidf result as document feature vector, this paper proposes a new weighted hyper-sphere support vector machine for hypertext classification. Based on eliminating the influence of the uneven class sizes with weight factors, the new method solves multi-class classification with less computational complexity than binary support vector machines. Experiments on benchmark data set verify the efficiency and feasibility of our method. Keywords: Hypertext classification, Hyper-sphere support vector machine, Weight factor, Uneven class sizes.

1 Introduction With the rapid development of the Internet, hypertext documents increase remarkably everyday. How to classify and utilize these information is very important, which leads to hypertext classification. Automatic hypertext classification differs from traditional text classification such as hyperlinks and html tags. Yang [1] studied using hypertext regularities to improve classification accuracy. Liu [2] combined plain text information and hyperlink structure when training, and then used fuzzy transductive support vector machine (SVM) as classifier. Joachims [3] has showed SVM performed better than other machine learning methods. But support vector machines are originally proposed for binary classification, hypertext documents classification are multi-class classification problem. Much work has been done on how to extend binary SVMs to multi-class problem such as the methods of one-against-one, one-against-all and so on. But the essences of these methods are the optimal hyper-planes, which leads to more computational complexity of quadratic programming problems. To reduce computational complexity, Zhu [4] proposed a sphere structure SVMs for multi-class classification with less computational complexity than binary SVMs. F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 733–740, 2008. © Springer-Verlag Berlin Heidelberg 2008

734

S. Liu and G. Shi

Considering hyperlinks, html tags information and plain text information, we propose a weighted hyper-sphere SVM to solve hypertext classification. Here, weight factor is used to reduce the influence of uneven class sizes. In this paper, after a short introduction to hyper-sphere SVMs, we present our weighted hyper-sphere SVMs based on analysis of hyper-sphere SVM. Then we give our method of computing document feature vector and discuss the experiments. Finally, we summarize the paper.

2 Hyper-sphere Support Vector Machine Without loss of generalization, we consider nonlinear separable training and testing samples. Given a set of training data X = { xi | xi ∈ R N , i = 1, , m} , seeking the minimum bounding sphere of each class k can be formulated as the following optimization problem:

min Rk2 + C ∑ ξi m

ck , Rk

i =1

subject to:

(1)

φ ( xi ) − ck

2

ξi ≥ 0, i = 1,

≤ Rk2 + ξi , m.

where each minimum bounding sphere Sk for class k is characterized by its center ck and radius Rk , C is the penalty factor and ξi ≥ 0 are slack variables. In order to separate data precisely, a map function of mapping samples into a high dimensional feature space is used, that is, k ( xi , x j ) = φ ( xi )iφ ( x j ) . By introducing Lagrange multipliers α i ,k and βi ,k , the original optimization problem is equivalent to determine the saddle point of (2): m

m

i =1

i=1

2

L( Rk , ξi , ck , αi , βi ) = Rk2 + C ∑ ξi − ∑ [αi ,k ( Rk2 + ξi − φ ( xi ) − ck ) + βi , k ξi ] .

(2)

By introducing Lagrange multipliers, the original optimization problem becomes its dual optimization problem in the following format: α i ,k

∑ αi,kα j ,k k ( xi , x j ) − ∑ αi,k k ( xi , xi ) m

m

i , j =1

i =1

min

subject to: 0 ≤ α i ,k ≤ C , i = 1,

∑ αi,k = 1.

,m

(3)

m

i =1

The resulting decision function can be computed as: f k ( x ) = sgn( Rk 2 −

∑ αi,kα j ,k k ( xi , x j ) + 2∑ αi,k k ( xi , x ) − k ( x, x )) . m

m

i , j =1

i =1

(4)

Weighted Hyper-sphere SVM for Hypertext Classification

735

The class of the new testing point x can be decided by the following principle: ⎧inside of the hyper - sphere if f k ( x ) = 1 ⎪ x lies ⎨outside of the hyper - sphere if f k ( x ) = −1 . ⎪on the hyper - shpere if f k ( x ) = 0 ⎩

(5)

3 New Weighted Hyper-sphere SVM In this section, based on analysis of the shortage of hyper-sphere SVM, we propose our new weighted hyper-sphere SVM. 3.1 Analysis of Hyper-sphere SVM

For hyper-sphere SVM, the KKT condition of (2) is: αi , k ( Rk2 + ξi − φ ( xi ) − ck ) = 0 .

(6)

βi , k ξi = (C − αi , k )ξi = 0 .

(7)

2

Based on (6) and (7), we can conclude there are three cases for the optimal solution αi ,k as follows: 1) αi ,k = 0 ,due to (7), ξi = 0 , xi is correctly classified. 2) 0 < αi ,k < C , βi ,k ≠ 0, ξi = 0 , xi is a standard support vector. 3) αi ,k = C . If 0≤ξi 0 should be misclassification samples not belonging to class k. From (8), we can get: α2 =

(1 − α1nk ) . (m − nk )

(9)

Since α 2 means the error of the resulting classifier, it must be small or zero. When nk is small, α 2 becomes larger based on math computation, which means more classification error, and when nk is large, α 2 becomes smaller, which means less classification error. So when the training sets with uneven class sizes are used, the classification error of the hyper-sphere SVM is biased towards with smaller training set. To reduce the influence of the uneven class sizes, we propose the following weighted hyper-sphere SVM.

736

S. Liu and G. Shi

3.2 New Weighted Hyper-sphere SVM

Weight factor sk for each class k is used in our new algorithm. The primal problem can be formulated as: min Rk2 + C ∑ sk ξi m

ck , Rk

i =1

subject to:

(10)

φ ( xi ) − ck

2

ξi ≥ 0, i = 1,

≤ Rk2 + ξi , m.

where sk is the weight factor of data point xi belonging to class k. The term sk ξi in (10) is the error loss resulted from misclassifying xi . Here, sk can be computed by the following method: number of samples in class k sk = 1 − . (11) total number of the training set Form (11), we know when the number of samples in class k1 is smaller than the number of samples in class k2, sk1 > sk 2 . Through sk , we can add more influence of class k with small class sizes and improve the performance of the resulting classifier. By introducing Lagrange multipliers α i ,k and βi ,k , the original optimization problem is equivalent to determine the saddle point of (11): m

m

i =1

i =1

2

L( Rk , ξi , ck , αi , βi ) = Rk2 + C ∑ sk ξi − ∑ [αi , k ( Rk2 + ξi − φ ( xi ) − ck ) + βi ,k ξi ] .

(12)

With the same method as that in hyper-sphere SVM, the dual optimization problem is defined as: min α i ,k

∑ αi,kα j ,k k ( xi , x j ) − ∑ αi,k k ( xi , xi ) m

m

i , j =1

i =1

subject to: 0 ≤ α i ,k ≤ Csk , i = 1,

∑ αi ,k = 1.

,m

(13)

m

i =1

The decision function of weighted hyper-sphere SVM is same as (4). The difference of hyper-sphere SVM and weighted hyper-sphere SVM is the upper bound of Lagrange multiplier α i ,k in (3) and (12). Through sk , our algorithm provides better control of the error rates and good generalization performance of the resulting classifier.

4 Hypertext Document Feature Vector Computation We consider two kinds of feature factor: one is traditional tfidf computation with considering html tags, the other is hyperlinks between documents.

Weighted Hyper-sphere SVM for Hypertext Classification

737

Traditional tfidf method is combined with html tag information to represent documents [5]. Each document di is represented by simply taking into account whether and how frequently a word tk appears in di and in the document collection. tfidf (tk , di ) can be computed as:

tfidf (tk , di ) = tf (tk , di ) log

Tr # Tr (tk )

(1 + in _ h1_ headline(di ) ⋅ h1_ factor +

+ in _ title(di ) ⋅ title _ factor + in _ anchor (d i ) ⋅ anchor _ factor )

(14)

⎧⎪1 + log#(tk , di ) if #(tk , di ) > 0 tf (tk , di ) = ⎪⎨ ⎪⎪⎩0 otherwise. where #(tk , di ) represents the times of term tk occurs in the document di , Tr is the number of training documents and # Tr (tk ) denotes the number of documents in Tr in which term tk occurs at lease once. Then each document d j is represented as a vector of weights, d j =< w1 j , w2 j ,

, wrj > , here r is the number of words that occur at lease

once in the document collection. Weights can be further normalized by cosine normalization, i.e. wki =

tfidf (tk , di ) r′

.

∑ tfidf (tk , di )

(15)

2

s =1

where r ′ is the set of features resulting from feature selection. For hyperlinks structure, l (d x , d y ) is to represent the length of a shortest path between dx and dy and l (d x , d y , d z ) denote the length of a shortest path between dx and dy not traversing dz. Our adopted measure [2] of the hyperlink similarity between two documents captures three important notions about certain hyperlink structures that imply semantic relations: a direct path between two documents, the number of ancestor documents that refer to both documents in question, and the number of descendant documents that both documents refer to. sl (di , d j ) considers shortest direct paths between the documents, which is defined as follow: sl (di , d j ) = 1

2

l ( di , d j )

+ 1

2

l ( d j ,di )

.

(16)

If there is no path between di and dj, we do not add any weight to this similarity component. This equation ensures that as shortest paths increase in length, the similarity between the documents decreases. Let A denote common ancestors for di and dj, then sanc (di , d j ) that considers common ancestors is defined below:

738

S. Liu and G. Shi

sanc (di , d j ) =



d h ∈A

1 2

(l ( d h , di , d j ) + l ( d h , d j , di ))

.

(17)

which means the more common ancestors, the higher the similarity. Let A denote common descendants for di and dj, then sdes (di , d j ) that considers common descendants is defined as: sdes (di , d j ) =



d h ∈A

1 2

(l ( d i , d h , d j ) + l ( d j , d h , di ))

.

(18)

The complete hyperlink similarity function between two hyperlink documents di and dj, Sijlinks is a linear combination of the above three components: Sijlinks = wl sd (di , d j ) + wa sanc (di , d j ) + wd sdes (di , d j ) .

(19)

where wl , wa and wd are cost factors that determines the tradeoff among these three components. Integrating these two kinds of information, we obtain: wki′ = fterms wki + flinks Sijlinks (d j ∈ links (di )) .

(20)

where fterms and flinks are constants and we use normalized wki′ as feature vector of hypertext documents.

5 Experiments Our data set for the experiments comes from the WebKB [6] collection of WWW pages made available by the CMU text learning group. Following the setup in [7], this data set includes 4127 pages and 10945 hyperlinks interconnecting them. We did not use stemming or a stoplist because using them may hurt performance in some extent. To verify the efficiency of our method, we compare our experimental results with the methods of the original hyper-sphere SVMs. Since the computational complexity of binary SVMs of the multi-class classification problem is much larger than spherestructure SVMs (for example, k(k-1)/2 quadratic programming vs k quadratic programming), we didn’t discuss the computational complexity of our new methods. The performance of each classifier is measured using the conventional microaveraged and macro-averaged F1 values. This measure combines recall and precision in the following way: Recall =

# correct positive predictions # correct positive predictions , Precision = . # of positive examples # of positive predictions

F1 =

2 ⋅ Recall ⋅ Precision . Recall + Precision

(21)

(22)

Weighted Hyper-sphere SVM for Hypertext Classification

739

For ease of comparison, we summarize the F1 scores over the different categories using the micro- and macro-averages of F1 scores: micro-avg F1 = F1 over categories and documents macro-avg F1 = average of within-categories F1 values 100

F1 scores

90 80 70 60 50 40 30

terms only hyperlinks only combined (micro-avg F1 of hs-SVM ( micro-avg F1 of our method) (macro-avg F1 of hs-SVM ( macro-avg F1 of our method)

Fig. 1. Micro-avg F1 and macro-avg F1 comparisons of hyper-sphere SVM and our weighted hyper-sphere SVM 1.00

Precision/Recall-breakeven point

0.96 0.92 0.88 0.84 0.80 0.76 0.72 0.68 0.64 0.60

500

1200 1700 2500 2800 3200 3600 4000 number of samples whs-SVM on student hs-SVM on student whs-SVM on course hs-SVM on course whs-SVM on faculty hs-SVM on faculty

Fig. 2. Average precision/recall-breakeven point comparisons of hyper-sphere SVM and our weighted hyper-sphere SVM

740

S. Liu and G. Shi

For hyper-sphere SVMs, RBF kernel with C=1000 and σ=1.45 is used. For our weighted hyper-sphere SVMs, RBF kernel with C=136 and σ=1.56 is used. Fig. 1 and Fig. 2 summarizes the main results. In Fig. 1, the label “terms only” means the classifier only use the terms on the pages, the label “hyperlinks only” use the hyperlinks between pages, and the label “together” use both terms and hyperlinks information. From Fig. 1, we can see that hyper-sphere SVM and weighted SVM performed better when using hyperlink and terms information together compared to using one of them only. Fig. 2 shows how performance changes with increasing training set size for category course, faculty and student. In the beginning of this experiment, s1=0.3, s2=0.4, s3=0.6 are used as weight factor for class student, course and faculty. With the training set size increasing, weight factors increase simultaneously. Finally, the weight factors become s1=0.42, s2=0.75, s3=0.84. Through larger weight factor for class faculty with smaller training set, average precision/recall-breakeven point of this class is higher than original hyper-sphere SVM. From Fig. 2, we can see that our weighted hyper-sphere SVM performs better than hyper-sphere SVM.

5 Conclusions Hyper-sphere SVM performs better than binary SVM for multi-class problem with less computational complexity. Based on the analysis of hyper-sphere SVM, we propose weighted hyper-sphere SVM to solve the problem of uneven class sizes. To classify hypertext documents precisely, we combine text information of documents and hyperlinks information. Experiments on benchmark data set show the efficiency and feasibility of our new method.

References 1. Yi, Y.M., Sean, S., Rayid, G.: A Study of Approaches for Hypertext Categorization. Journal of Intelligent Information Systems 18, 219–241 (2002) 2. Hong, L.: Learning Text Classification Rules From Labeled and Unlabeled Examples. Dissertation, Shanghai Jiao Tong University (2003) 3. Joachims, T.: Make Large-scale Support Vector Machine Learning Practical. In: Scholkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods-Support Vector Learning, pp. 185–208. MIT Press, Cambridge (1999) 4. Zhu, M.L., Chen, S.F., Liu, X.D.: Sphere-structured Support Vector Machines for Multiclass Pattern Recognition. LNCS, vol. 2369, pp. 589–593. Springer, Heidelberg (2003) 5. Salton, G., Buckley, C.: Term Weighting Approaches in Automatic Text Retrieval. Information Processing and Management 24(5), 513–523 (1998) 6. The 4 Universities Data Set, http://www.cs.cmu.edu/afs/cs/project/theo-20/www-/data 7. Slattery, S., Mitchell, T.: Discovering Test Set Regularities in Relational Domains. In: 17th International Conference on Machine Learning (ICML 2000), pp. 895–902. Morgan Kaufmann, Stanford (2000)

Theoretical Analysis of a Rigid Coreset Minimum Enclosing Ball Algorithm for Kernel Regression Estimation Xunkai Wei1, 2,* and Yinghong Li2 1

Beijing Aeronautical Technology Research Center Beijing 100076, China 2 Air Force Engineering University, Xian 710038 [email protected]

Abstract. A rigid coreset minimum enclosing ball training machine for kernel regression estimation was proposed. First, it transfers the kernel regression estimation machine problem into a center-constrained minimum enclosing ball representation form, and subsequently trains the kernel methods using the proposed MEB algorithm. The primal variables of the kernel methods are recovered via KKT conditions. Then, detailed theoretical analysis and main theoretical results of our new algorithm are given. It can be concluded that our proposed MEB training algorithm is independent of sample dimension and the time complexity is linear in sample numbers, which greatly cuts down the complexity level and is expected to speedup the learning process obviously. Finally, comments about the future development directions are discussed.

1 Introduction Recently, the support vector machines (SVM) and kernel methods achieve great successes in many domains [1]. It should say that statistical learning theory [2] changes the concept of designing learning machine and greatly advances the pattern recognition theory. One of the most notable features of SVM and related kernel methods is that they can be represented and solved via a Quadratic Programming (QP) form. 3

However, the complexity of a QP problem is about O ( m ) , thus in large dataset case, the kernel Gram matrix consumes large memory. Consequently, due to the interior point based algorithm needs to implement a great number of large matrix operations, which makes the algorithm get worsening and converging slowly. Thus the success applications of kernel methods to large dataset mainly depend on the efficient and large dataset capable training algorithm. Generally, to improve the performances of the training algorithm, the main idea is to reduce its complexity [1]. One idea is to decompose it into more sub-small scale problems and then solved using standard training algorithm, such as chunking, *

This work was supported in part by the NSFC under Grant #60672179, by the China 863 program Innovation Foundation under Grant #2005AA000200 and the Doctorate Foundation of School of Engineering, Air Force Engineering University under Grant #BC0501.

F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 741 – 752, 2008. © Springer-Verlag Berlin Heidelberg 2008

742

X. Wei and Y. Li

decomposition, sequential minimal optimization (SMO), and successive over relaxation (SOR) etc. Another idea is to reduce the burden (rank) of the kernel Gram matrix, such as matrix decomposition, sampling or other methods. In fact, using approximated [3] instead of exact solution of the training algorithm can achieve satisfactory application results by setting the relative gap tolerance or the KKT conditions smaller than a given epsilon. What’s more, breakthrough [4] was made in minimum enclosing ball (MEB) based approximation algorithm with both commendable time complexity and space complexity. This makes it practical for high dimensional large dataset problems. And there are already pioneering works [5] [6] [7] for the MEB applications in training kernel methods, which achieved fast converging time and commendable performances compared with standard SVM. Motivated by the above mentioned, this paper is organized as follows. Section 2 reviews the basics and state-of-art MEB algorithms. Section 3 briefs the idea of viewing kernel methods as a MEB, and the generalized core vector machine (GCVM) algorithm is briefly introduced. Then a new fast kernel coreset MEB approximation algorithm and detailed theoretical analysis are presented in Section 4. Conclusions and future developments are given in Section 5.

2 The Minimum Enclosing Ball Problem m Ζ := {z i ∈ R d }i= 1 ,the minimum enclosing ball of Ζ is the smallest ball that contains all the points in Ζ . The MEB problem arises in a num-

Given a finite set of points

ber of important applications, which is often required to work in high dimensional spaces. Applications of MEB include gap tolerant classifier design, SVM model selection, support vector clustering, shape fitting, and computer graphics (e.g., collision detection, visibility culling), and facility locations problems. 2.1 Definitions and Formulations

c ∈ R d , and a nonnegative r ∈ R , a ball centered at c ∈ R d with d denotes radius r ∈ R is then defined as Bc, r := {x ∈ R : x − c ≤ r} , where Definition 1. For

the Euclidean norm. Definition 2. Given a finite set of points MEB B cΖ , rΖ

Ζ := {zi ∈ R d }im=1 , a ball Bc,r is called

:= MEB( Ζ) of Ζ , if it encloses all the given points with the minimum

radius. Definition 3. Given

ε > 0 , a ball Bc,r

is said to be a

(1 + ε ) approximation to MEB

BcΖ ,rΖ of Ζ , if Ζ ⊂ B c,r , r < (1 + ε )rΖ . Definition 4. A subset Χ ⊆ Ζ is said to be an

rΧ ≤ rΖ ≤ (1 + ε )rΧ , where BcΧ ,rΧ is a MEB of Χ .

ε

coreset of

Ζ , if

Theoretical Analysis of a Rigid Coreset Minimum Enclosing Ball Algorithm

743

With above definitions, the MEB problem can be formulated as following kernelized forms:

max α

Φ (α ) := αT diag(K ) − αT Kα

s.t.

αT 1 = 1, α i ≥ 0, i = 1,

.

(1)

,m

In practice, the dual QP form is more often used, and this paper will use this form for the kernel extension algorithms of MEB. In the rest of the paper, we always depend on the dual form of (1). 2.2 State-of-the-Art MEB Algorithms in Euclidean Space For a fixed dimension d , the MEB can be computed in

O(m) operations. However,

it depends on the dimension d exponentially [8]. Thus, for high dimensional MEB problem, these algorithms suffer from the dimension curse problem. Until Bădiou, Har-Peled, and Indyk [4] who found the O ( 12 ) size corset that is independent of ε

d and m made the high dimensional MEB practical. Based on this wonderful result, the MEB can be computed in O ( md2 + 110 log ε1 ) operations. Later, Bădiou and ε ε

both

Clarkson [9] found an improved computed in

O( ε1 ) size coreset. With this result, the MEB can be

O( mdε + ε15 log ε1 ) operations. Also, Kumar, Mitchell and Yildirim [10]

independently found the

O ( ε1 ) size coreset, together with SOCP formulation, they

improved the complexity bound by

O( mdε + ε 14.5 log ε1 ) .

Recently, Yildirim [11] proposed two new one-order Taylor approximation algorithms for MEB via modified Wolfe-Frank algorithm in O ( md ε ) operations, which is the best known complexity bound for

O( ε1 ) size coreset. Except, Clarkson [12] sur-

veys the relations among coreset, sparse greedy approximation, and the Frank-Wolfe algorithm from a more general point of view.

3 View Kernel Methods as MEB The first kernel MEB-like algorithm called support vector data description (SVDD) [13] was proposed by Tax from the background of SVM. The hard margin SVDD firstly maps all the data into an implicit high dimensional feature space. Then, by formulating the dual QP problem of MEB using kernel tricks, it constructs an exact MEB there. Kumar proposed the first coreset kernel (1 + ε ) MEB in O ( 12 ) iteraε

tions, and it beats the standard SVM implementation with Gaussian kernel in hand recognition [14]. Recently, Tsang et al proposed a new coreset-MEB based learning algorithm called core vector machine (CVM) [5] [6]. They have shown that the hard-margin SVDD,

744

X. Wei and Y. Li

one-class, two-class L2SVM, and L2SVR can be put into the CVM framework. Moreover, CVM required to solve an exact sub-QP MEB problem each iteration. It is a (1 + ε ) approximation to the MEB in O ( m2 + 14 ) with O ( ε1 ) size coreset. ε

ε

However as is pointed out in Section I, it will be more reasonable if we solve a relaxed sub-QP MEB problem by setting the relative gap tolerance δ in interior point QP algorithm, as is indicated by Kumar that specific relations between the relative gap tolerance [10] δ and ε can be given theoretically. Consequently, this motivate us to give a truly (1 + ε ) MEB. This will be investigated in Section 4. 3.1 L2 Regression Estimation Machine As is pointed out in [5], the original CVM works for L2SVM only when the condition k (x, x) = κ is satisfied. However, support vector regression could not be trained by minimum enclosing ball adopts in CVM. This is due to the fact that there is a linear item in the objective of the dual. Therefore, the minimum enclosing ball training algorithm adopts in CVM cannot be extended to regression estimation directly. Recall that m

d

given a training dataset {z i = (xi , yi )}i =1 , with input xi ∈ R ,

yi ∈ R , support

vector

linear

regression

estimation

machine

constructs

a

function

T

f (x) = w ϕ (x) + b in kernel induced feature space by minimizing deviation from the training data using some loss functions such as ε -insensitive loss function: ⎧⎪0, if y − f (x) ≤ ε . y − f ( x) ε = ⎨ ⎪⎩ y − f (x) − ε , otherwise

where

ε

(2)

is the tolerance tube width.

Whist, it tries to keep the linear function

f (x) flat or to minimize the item w so *

as to obtain a maximum margin. By introducing slack variables ξi , ξ i , the primal of regression estimation machine using L2 norm errors can be formulated by: 2

min w + b 2 +

C m 2 (ξi + ξi*2 ) + 2Cε ∑ µ m i =1

s.t. yi − (wT ϕ (xi ) + b) ≤ ε + ξi

.

(3)

(wT ϕ (xi ) + b) − yi ≤ ε + ξi* where

C > 0 is a penalty constant specified by the user, µ > 0 is a parameter con-

trols the size of ε . The Lagrange dual can be constructed by following counterpart (This is due to the fact that multiplying the objective by a constant does not affect the decision variables)

Theoretical Analysis of a Rigid Coreset Minimum Enclosing Ball Algorithm

L(w, b, ε , ξi , ξi* , α i , α i* ) =

745

1 1 2 1 m 2 2 w + (ξi + ξi*2 ) b + ∑ 2C 2C 2 µ m i =1

+ε + ∑ α i ( yi − wT φ (xi ) − b − ε − ξi ) . m

+ ∑ α i* (wT φ (xi ) + b − yi − ε − ξi* )

(4)

i =1

m

i =1

Using KKT optimality conditions, we get following relations:

m ⎧ ∂L 0 (α i − α i* )ϕ (xi ) w C = ⇒ = ∑ ⎪ ∂w i =1 ⎪ m ⎪∂L * ⎪ ∂b = 0 ⇒ b = C ∑ (α i − α i ) ⎪ i =1 . ⎨ m ⎪ ∂L = 0 ⇒ (α + α * ) = 1 ∑ i i ⎪ ∂ε i =1 ⎪ ⎪ ∂L = 0 ⇒ ξ = µ mα , ∂L = 0 ⇒ ξ * = µ mα * i i i i ⎪⎩ ∂ξi ∂ξi*

Then we can conclude the dual as

max* ⎡⎣αT α i ,α i

s.t. ⎡⎣αT

where y

= [ y1 , y2 ,

⎡ 2y ⎤ αT * ⎤⎦ ⎢ C 2 ⎥ − ⎡⎣αT ⎣− C y ⎦

αT * ⎤⎦ 1 = 1, α, α* ≥ 0 ym ]T and α = [α1 , α 2 ,

⎡ αT ⎤ αT * ⎤⎦ K ⎢ T * ⎥ ⎣α ⎦ .

α m ]T , α* = [α1* , α 2* ,

(5)

(6)

, α m* ]T are

the Lagrange dual variables,

⎡K + 11T + µCm I −(K + 11T ) ⎤ K = ⎡⎣ k (z i , z j ) ⎤⎦ = ⎢ ⎥ T K + 11T + µCm I ⎦ ⎣ −(K + 11 )

where K is a 2m × 2m kernel matrix. Obviously, (6) is a QP problem, using an efficient QP solver, we can obtain the dual variables. According to the KKT conditions, the primal variables can be recovered as m m ⎧ * α α ϕ w x = − = ( ) ( ) (α i − α i* ) C b C ∑ ∑ ⎪ i i i . i =1 i =1 ⎨ ⎪ξ = µ mα ξ * = µ mα * i i i ⎩ i

(7)

746

X. Wei and Y. Li

The tolerance tube width ε can be obtained via following KKT conditions,

⎧⎪α i ( yi − (wT ϕ (xi ) + b) − ε − ξi ) = 0 . ⎨ * * T ⎪⎩α i ((w ϕ (xi ) + b) − yi − ε − ξi ) = 0

Plusing above two equations, and using the fact that

ε = ⎡⎣αT

⎡y⎤ αT * ⎤⎦ ⎢ ⎥ − C ⎡⎣αT ⎣ −y ⎦

Again according to the KKT conditions ξi fact

⎡⎣αT αT * ⎤⎦ 1 = 1 , then we get

µ=

(8)

⎡⎣αT αT * ⎤⎦ 1 = 1 , then we get

⎡ αT ⎤ αT * ⎤⎦ K ⎢ T * ⎥ . ⎣α ⎦

(9)

= µ mα i , ξi* = µ mα i* and using the

1 m * ∑ (ξi + ξi ) . m i =1

(10)

We see that the pre-specified parameter µ is in close relation with the slack variables, and parameter µ actually controls the regression error. Note that although (6) is a standard QP problem, it is not of the required form in CVM because of the presence of the linear term in the dual objective. Thus, the original CVM algorithm cannot be applied directly and we should seek alternative method to utilize the features of CVM. 3.2 Regression Estimation as a Center-Constrained MEB By augmenting an extra item

Δ i ∈ R to each ϕ (z i ) , i.e. [ϕ T (z i ) Δ i ]T while conT

T

straining the center to satisfy [c 0] , a center-constrained MEB was formulated. Using this new MEB formulation, [6] proposed a generalized CVM (GCVM) algorithm and further extend CVM to support vector regression (SVR), ranking, and imbalanced SVM. As shown in [6], the center-constrained MEB can be modified as:

max α αT (diag(K ) + Δ) − αT Kα

where

Δ := ⎡⎣ Δ12 ,

s.t. α ≥ 0, αT 1 = 1

.

(11)

, Δ 2m ⎤⎦ , the radius and center can be recovered as T

c = ∑ α iϕ ( z i ) m

.

i =1

r = α (diag(K ) + Δ) − α Kα T

T

(12)

Theoretical Analysis of a Rigid Coreset Minimum Enclosing Ball Algorithm

747

[cT 0]T and any point

Moreover, the squared distance between the center

[ϕ T (z l ) Δl ]T is computed using kernel trick by: 2

2

c − ϕ (z l ) + Δl2 = c − 2(Kα )l + kll + Δ l2 . Due to the constraint

(13)

αT 1 = 1 , for an arbitrary η ∈ R , (11) is equivalent to

max α αT (diag(K ) + Δ − η 1) − αT Kα s.t. α ≥ 0, αT 1 = 1

.

(14)

Therefore, (14) now allows a linear item in the objective. In order to use the GCVM algorithm, we just need to define following items:

α = [α1 , α 2 ,

α 2 m ]T = [αT α*T ]T .

Δ = − diag(K ) + η 1 + Thus for a sufficiently large η

(15)

2 ⎡y ⎤ . C ⎢⎣ − y ⎥⎦

(16)

> 0 , (6) can be written as

max α αT (diag(K ) + Δ − η 1) − αT Kα s.t. α ≥ 0, αT 1 = 1

.

(17)

Now, (17) now has the same form as (14), and we call (17) Generalized Center Constrained Minimum Enclosing Ball (GCCMEB). Actually for the given support vector kernel regression estimation problem, we can see that each training point z i = ( xi , yi ), i = 1, , m can be seen as two separate points

[ϕ T (z i ) Δ i ] = ⎡ϕ T (xi ) 1 ⎣

C

[ϕ T (z i + m ) Δ i ] = ⎡ −ϕ T (xi + m ) − 1 ⎣ where

ei Δ i ⎤ . ⎦ T

µm T

ei + m Δ i ⎤ . ⎦

µm T C

(18) T

(19)

ei is a 2m vector with the i th element being 1 and the remaining elements 0.

And the primal variables for L2 support vector regression estimation machine can be recovered by

w = C∑ (αi − αi*+m )ϕ (xi ), b = C∑ (αi − αi*+m ), ξi = µmαi , ξi* = µmαi*+m . m

m

i =1

i =1

(20)

748

X. Wei and Y. Li

3.3 Generalized Core Vector Machines The GCVM algorithm is shown in Table 1, which is based on a simple iteration algorithm [8]. The main idea is to incrementally expand the ball by including the point that is farthest away from the current center. However, it supposes that in each iteration the sub-QP problem is exactly solved. That is to say, it supposes that the sub-QP algorithm converges with 0 error tolerance, which is guaranteed theoretically but is numerically impossible in practical applications. Therefore, it will be more reasonable to consider error tolerance when using an approximation algorithm. Table 1. Generalized Core Vector Machines

(1 + ε ) MEB solver Inputs: k ( i ) , ϕ ( i ) , Ζ ⊂ F, m ≥ 3 , ε ∈ ( 0,1) 1. Initialize Coreset Χ ← { z1 , z2 } Algorithm GCVM:

2.

Terminate if there is no training point z such that

(1 + ε ) Βc ,(1+ε )r

t

Find z such that

ϕ ( z)

t

3.

ϕ ( z ) falls outside the ball

is the furthest away from the center

c using (9), and

Χt +1 = Χt ∪ {z} 4. Find the new ball Βct +1 , 1+ε r t+1 using (10) ( ) set

5.

t ← t + 1 , go to 2

Remark: we can see that GCVM has time complexity bounded by O ( m2 ε

+ ε14 ) .

4 Proposed MEB Algorithm for Kernel Regression Estimation The algorithm (see Table 2) is based on Kumar’s iterated MEB algorithm, but we use a QP solver instead of SOCP solver in each sub-MEB problem. The MEB algorithm returns a O ( ε1 ) size coreset, which is independent of both dimension and points number. Then we give detailed theoretical analysis of the algorithm. Finally, we give the main results of our coreset MEB algorithm. In our previous work [7], we have applied the rigid coreset MEB algorithm to pattern classification. Here we will further extend it to kernel regression estimation problem. 4.1 The Rigid Coreset MEB Algorithm The proposed coreset MEB algorithm is listed in Table 2. It firstly uses the two farthest points to initialize the coreset and the approximated MEB radius. Then by

Theoretical Analysis of a Rigid Coreset Minimum Enclosing Ball Algorithm

749

Table 2. The proposed Rigid Coreset MEB Approximation Algorithm

(1 + ε ) MEB QP Solver

Algorithm 1: Inputs:

k ( i ) , ϕ ( i ) , Ζ ⊂ F, m ≥ 3 , ε ∈ ( 0,1)

{

1.

Initialize Coreset Χ ← z f1 , z f 2

2.

ALL_CHECKED ← 0

3.

ε δ ← 163

4.

k ←0

5.

While (~ALL_CHECKED), do

6.

Β, c, r ← QP M EB ( Χ, δ )

7.

z ← arg max ϕ ( z ) − c

8.

}

2

f ∈T\ Χ

If

z ∈ Βc ,(1+ ε ) r then 2

ALL_CHECKED ← 1 9.

Else ALL_CHECKED ← 0

k ← k +1

10.

11. End If 12.

Χ ← Χ ∪ {z}

13. End While 14. Output

Βk , c k , (1 + ε2 ) r k , Χ k

setting the relation between the relative gap tolerance and the approximation epsilon,

(1 + δ ) approximation to the MEB of the coreset in each iteration. Finally, it returns an at-most (1 + ε ) approximation to the MEB of Ζ .

it returns a

4.2 Theoretical Analysis In fact, we can tight it to the at most

(1 + ε ) approximation to the MEB with O( ε1 )

size coreset utilizing the relation between the relative gap tolerance δ and the ε . For completeness, we list all the necessary lemmas and the two theorems. For the detailed proof, the reader should refer to [7].

750

X. Wei and Y. Li

BcΖ ,rΖ be the MEB of point set Ζ . Then any closed halfspace con-

Lemma 1. Let taining

c Ζ contains at least one point z i in boundary subset of BcΖ ,rΖ at distance rΖ cΖ .

from center

Lemma 2. The initial radius using the two farthest points

r0 ≥

z f1 , z f 2 satisfies

1 3 Ζ

r .

Lemma 3. Let

BcΧ ,rΧ be the MEB of the point set Χ , and let a point

z ∉ BcΧ ,(1+ ε )rΧ for a ε ∈ (0,1) . Then the radius of the MEB Χ ∪ {z} is at least 3

ε2

(1 + 33 )rΧ . Lemma 4. Let

BcΖ ,rΖ be the MEB of point set Ζ . Let Bc ,r be the (1 + δ ) approxi-

mation to the MEB, then

cΖ −c ≤ rΖ δ(δ +2) and BcΖ ,rΖ − c−cΖ ⊆ Bc,r ⊆ BcΖ ,(1+δ )rΖ + c−cΖ

hold. Lemma 5. Let

BcΖ ,rΖ be the MEB of point set Ζ . Let Bc ,r be the (1 + δ ) approxi-

mation to the MEB. If

z ∉ Bc ,(1+ ε )r , then z ∉ BcΖ ,(1+ ε )rΖ , provided that the gap toler2

δ

ance

satisfy

ε

3

2

δ ≤ 1 + ( 6+ 3ε ) − 1 .

According to above Lemmas, we could induce following import theorems: Theorem 1. Let is a

BcΖ ,rΖ be the MEB of point set Ζ , the ball returned by algorithm 1

(1 + ε ) approximation to the MEB Bc ,r Ζ

Ζ

, the time complexity is bounded by

O( ε 2 + ε 4 ) . m

1

ε coreset of Ζ and has δ (δ + 2) ≤ 2(2ε+1) .The space

Theorem 2. The final coreset Χ returned by algorithm 1 is a

O( ε1 ) size, provided that the gap tolerance δ satisfy complexity of algorithm 1is bounded by

O( ε12 ) .

4.3 Complexity Analysis So as to show the superiority of the proposed algorithm, here we review the time complexity and space complexity analysis method adopted in [7]. 3

Time complexity: Suppose that a QP implementation takes O ( m ) time, and we assume that the kernel evaluations take constant time. Because in each loop, only one point is added to the coreset, then the size of the coreset is

Bk = 2 + k . The initialization

Theoretical Analysis of a Rigid Coreset Minimum Enclosing Ball Algorithm

step takes O ( m) time, step 6 takes tations in step 7 and 8 take

751

O((k + 2)3 ) = O(k 3 ) time, the distance compu-

O((k + 2)2 + km) = O(k2 + km) time, and other steps take

only constant time. Therefore, each loop takes

O(k3 +km) time. Therefore, for

Sk = O( ε1 ) loops, the total time complexity is bounded by:

T = ∑ O(km + k 3 ) =O( Sk2 m + Sk4 ) = O( εm2 + ε14 ) . Sk

(21)

k =1

Space complexity: Suppose we have m samples in the QP problem, then we need O ( m) space to store them. Since the samples are stored out of the core memory, we will ignore it in consequent analysis. As can be concluded from algorithm 1, only core vectors are involved in the QP. Therefore, the space complexity for the k th loop is O ( B

k 2

) . And for the O( ε1 ) size coreset, the space complexity is bounded by:

Sk = O(

1

ε2

).

(22)

Note that our rigid MEB has the same time complexity bound level compared with CVM. And given the same epsilon, our algorithm tends to provide us a more tight approximation ball, which is assured by the theoretical analysis of the relation between the gap tolerance and the epsilon.

5 Conclusions We proposed a large data set capable MEB algorithm and extend it to kernel regression estimation training. We give the main lemmas about the whole theoretical analysis, and conclude two main theorems for the proposed rigid MEB training algorithm. The main contribution is that we get a rigid theoretical relation, which links the approximation epsilon with the gap tolerance and guarantees converge of the algorithm theoretically. For future developments, we will continue to extend it to more cases. Specially, we would develop vector value reproducing kernel Hilbert space theory based fast MEB training algorithm for kernel methods, which is expected to solve the multiclass classification or the multivariable regression problem with low model complexity and fast training speed meantime. Especially we would like to use it as a default solver of our proposed enclosing machine learning methods, which is a new machine learning paradigm connecting cognition process and machine learning. Interested readers should refer to [15-16] for more details. Acknowledgments. The first author would like to give special thanks to Dr. Ivor W. Tsang for his invaluable discussions and constructive suggestions about the new

752

X. Wei and Y. Li

kernel MEB algorithm. Special thanks also go to Dr. Piyush Kumar. With his excellent works, we are able to give rigid theoretical analysis and proofs for the new algorithm. Special thanks also go to Dr. Johan Löfberg, Dr. Hsuan-Tien Lin for hints in implementations.

References 1. Li, Y.H., Wei, X.K., Liu, J.X.: Engineering Applications of Support Vector Machines. China Weapon Industry Press, Beijing (2004) 2. Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998) 3. Smola, A., Schölkopf, B.: A Tutorial on Support Vector Regression. Statistics and Computing 14(3), 199–222 (2004) 4. Bădoiu, M., Har-Peled, S., Indyk, P.: Approximate Clustering via Corsets. In: Proceedings of the 34th Annual ACM Symposium on Theory of Computing, pp. 250–257 (2002) 5. Tsang, I.W., Kwok, J.T., Cheung, P.-M.: Core Vector Machines: Fast SVM Training on Very Large Data Sets. Journal of Machine Learning Research 6, 363–392 (2005) 6. Tsang, I.W., Kwok, J.T., Jacek, Z.: Generalized Core Vector Machines. IEEE Transactions on Neural Networks 17(5), 1126–1140 (2006) 7. Wei, X.K., Law, R., Zhang, L., Feng, Y., Dong, Y., Li, Y.H.: A Fast Coreset Minimum Enclosing Ball Kernel Machines. In: Proceedings of International Joint Conference on Neural Networks 2008, Hong Kong, pp. 3366–3373 (2008) 8. Welzl, E.: Smallest Enclosing Disks (Balls and Ellipsoids). In: Maurer, H. (ed.) New Results and New Trends in Computer Science, pp. 359–391. Springer, Heidelberg (1991) 9. Bădoiu, M., Clarkson, K.L.: Smaller Core-sets for Balls. In: Proceedings of the 14th Annual Symposium on Discrete Algorithms, pp. 801–802 (2003) 10. Kumar, P., Mitchell, J.S.B., Yildirim, E.A.: Approximate Minimum Enclosing Ball in High Dimension Using Core-sets. The ACM Journal of Experimental Algorithmics 8(1) (2003) 11. Yildirim, E.A.: Two Algorithms for the Minimum Enclosing Ball Problem (manuscript, 2007) 12. Clarkson, K.L.: Coreset, Sparse Greedy Approximation, and the Frank-Wolfe Algorithm (manuscript, 2007) 13. Tax, D.M.J., Duin, R.P.W.: Support Vector Data Description. Pattern Recognition Letters 20(4), 1191–1199 (1999) 14. Bulatov, Y., Jambawalikar, S., Kumar, P., Sethia, S.: Hand Recognition using Geometric Classifiers. In: Zhang, D., Jain, A.K. (eds.) ICBA 2004. LNCS, vol. 3072, pp. 753–759. Springer, Heidelberg (2004) 15. Wei, X.K., Li, Y.H., Li, Y.F., Zhang, D.F.: Enclosing Machine Learning: Concepts and Algorithms. Neural Computing and Applications 17(3), 237–243 (2008) 16. Wei, X. K.: Enclosing Machine Learning Paradigm Blogger (2008), http://uniquescaler.blogspot.com

Kernel Matrix Learning for One-Class Classification Chengqun Wang1 , Jiangang Lu1 , Chonghai Hu2 , and Youxian Sun1 1

State Key Lab. of Industrial Control Tech., Zhejiang University, 310027, China 2 Dept. of Mathematics, Zhejiang University, 310027, China {cqwang,jglu,yxsun}@iipc.zju.edu.cn, [email protected]

Abstract. Kernel-based one-class classification is a special type of classification problem, and is widely used as the outlier detection and novelty detection technique. One of the most commonly used method is the support vector dada description (SVDD). However, the performance is mostly affected by which kernel is used. A promising way is to learn the kernel from the data automatically. In this paper, we focus on the problem of choosing the optimal kernel from a kernel convex hull for the given one-class classification task, and propose a new approach. Kernel methods work by nonlinearly mapping the data into an embedding feature space, and then searching the relations among this space, however this mapping is implicitly performed by the kernel function. How to choose a suitable kernel is a difficult problem. In our method, we first transform the data points linearly so that we obtain a new set whose variances equal unity. Then we choose the minimum embedding ball as the criterion to learn the optimal kernel matrix over the kernel convex hull. It leads to the convex quadratically constrained quadratic programming (QCQP). Experiments results on a collection of benchmark data sets demonstrated the effectiveness of the proposed method. Keywords: One-class kernel matrix learning, Kernel learning, One-class classification, Kernel selection, Support vector data description.

1

Introduction

One-class classification or data description problem differs in one essential aspect from the conventional classification problem. In one-class classification, it is assumed that only information of one of the classes, the target class, is available. This means that just example objects of the target class can be used and that no information about the other class is present. One-class classification is wildly used in outlier detection and novelty detection problems, and one of the most commonly used method is the SVDD [16,17], which is inspired by the Support Vector Classifier[13]. It obtains a spherically shaped boundary model around the dataset. It can be made flexible by using the kernel function. Kernel SVDD works by embedding the data into a high dimensional feature space, and then searching for a suitable minimum embedding ball [3] in this space, which covers all of the target data points. The F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 753–761, 2008. c Springer-Verlag Berlin Heidelberg 2008 

754

C. Wang et al.

boundary of the hyper-sphere can be used to detect novel data points or outliers, however the performance is mostly affected by which kernel is used. The choice of appropriate kernel function is of crucial importance. To alleviate this affliction, a promising way is to learning the kernel itself automatically. Recently, a number of algorithms have been proposed to learn the kernel from the labeled examples automatically. Early work on kernel learning is limited to learning the parameters of some pre-specified kernel form, e.g., by minimizing some estimates of the generalization error of SVMs [4], or using cross validation to select the parameters of the kernels [6,9]. More recently, the work has gone beyond kernel parameter learning by learning the kernel itself with a specified criterion. Lanckriet et al. [8] pioneered the work of learning a linear combination of pre-specified kernels with max-margin criterion for SVMs [13,14] using semidefinite programming (SDP) and it has been improved in [1] using the sequential minimal optimization (SMO) algorithm [12]. Reformulation of this problem as semi-infinite linear program (SILP) has been proposed in [15] along with extensions to regression. Ong, C.S.,et al. [11] define a Reproducing Kernel Hilbert Space (RKHS) on the space of kernels itself, leading to the so-called Hyperkernel, and choose a kernel from the parameterized family of kernels by minimizing the regularized risk functional in the Hyper-RKHS. In order to decrease the limit of the choice of the kernel or kernel matrix, Hoi, S.C.H.,et al. [7] give a nonparametric kernel matrix learning method from pairwise constraints. Except for the max-margin criteria, Yeung, D.Y. et al. choose the kernel Fisher discriminant(KFD) as the criterion [20] and Ye, J. et al.[21] engage the regularized kernel discriminant (RKD) as the criterion to learn the optimal kernel from a convex combination of kernel matrix. These methods are not directly estimating the adaptation of the kernel function and working by an optimization procedure. There are several kernel evaluation approaches directly estimating a given kernel matrix for the given task. e.g. Kernel target alignment (KTA), which was introduced by Cristianini N., et al. [5], gives a specific value of the degree of the agreement between a kernel matrix and the target matrix. Nguyen, C.H., et al. [10] point out several drawbacks of the KTA, and proposed surrogate measure to evaluate the goodness of a kernel matrix. These methods work in binary or multi class kernel learning problem. They need the between-class information to define the criterion, e.g. max-margin requires the margin information between different classes, KFD or RKD requires the inter-class scatter matrix, and KTA needs the target matrix information. In one-class classification, only target data set is available, we need a different way to define the criterion. Inspired by the SVDD, we choose the minimum embedding ball as the criterion, which is widely studied in the Core Vector Machine (CVM) [18,19]. In this paper, we propose a new approach to learn the optimal kernel for the one-class classification problem. Firstly, we linearly rescale data points so that we obtain an unity for each feature, and then learning an optimal kernel over a linear combination of pre-specified kernel functions under the condition of the minimum embedding ball. Fortunately, it leads to a QCQP problem, which can be efficiently solved by some optimization tools. The rest of this paper is organized as follows. We first recall the traditional kernel-based SVDD and the kernel convex hull in

Kernel Matrix Learning for One-Class Classification

755

section 1. And then Our kernel learning algorithm and the corresponding criterion is proposed in section 2. Some empirical results are illustrated in section 3. Conclusion and discussion will be given in section 4.

2

SVDD and Kernel Convex Hull

Given a subset S := {x1 , · · · , xn } of an input space X ∈ Rm , via an implicitly embedding Φ : X → F , a kernel function implicitly maps the input into a high dimensional feature space F ∈ Rp (p ≫ m) satisfying κ(xi , xj ) = Φ(xi ), Φ(xj ) Let Φ(S) = [Φ(x1 ), · · · , Φ(xn )] be the image of S under the map Φ. Hence Φ(S) is a subset of the inner product space of F . Let K be the Gram matrix of kernel evaluations between all pairs of elements of S. Kij = κ(xi , xj ), for all i, j = 1, · · · , n Before giving our kernel matrix learning algorithm in the next section, we will first recall the SVDD in this section first, and then give the definition of kernel convex hull. 2.1

Support Vector Data Description

Kernel-based SVDD [16,17] is one of the most commonly used method for supervised one-class classification problem. Via a suitable kernel, it maps the data into a high-dimensional feature space, and then searches a minimum embedding ball, which is a spherically shaped boundary around the target data set. The minimum embedding ball is characterized by the center a and the radius R ≥ 0. n

min a,R

s.t.

R2 + C

ξi i=1 2

xi − a ≤ R2 + ξi

(1)

0 ≤ ξi , i = 1, · · · , n

where ξi it the slack variables, which allows the possibility of outliers in the training data set, the distance from xi to the center a should not be strictly smaller than R, but larger distances should be penalized. Its dual form is n

n

αi xi , xi  −

max α

s.t.

i=1

αi αj xi , xj  i,j=1

(2)

T

e α = 1, 0 ≤ αi ≤ C, i = 1, · · · , n

where αi , for i = 1, · · · , n, is the Lagrange multiplier, and e is all ones n dimensional column vector. Note that in above equation (2) object xi only appears in the form of inner products with other object xj . When instead of the rigid hyper-sphere a more flexible data description is required, another choice for the

756

C. Wang et al.

inner product can be considered. Replacing the new inner product by a kernel function Gij = κ(xi , xj ) = xi , xj . We obtain the kernel-based SVDD as follows n

n

max α

αi Gii − i=1

s.t.

αi αj Gij

(3)

i,j=1

T

e α = 1, 0 ≤ αi ≤ C, i = 1, · · · , n

Then, the minimum embedding ball determined by the problem (1) is B(a, R), where n

αi xi

a=

R2 = Gkk − 2αT Gk + αT Kα

(4)

i=1

for any xk ∈ SV 0 , and , then the function V (t ) is Lyapunov stable function. ΔV (t ) = V (t ) − V (t − 1) ≤ 0 The dynamic neural network algorithm proposed in this paper uses a proper parameter learning rules which restrict the parameter learning process. So the parameter learning algorithm can adaptively modify the parameters in the stable steps. The main steps of the parameter adjusting algorithm are as following: (1) Define error function:

E (t ) =

1 ( y d (t ) − y (t )) 2 2

(6)

Where, y d (t ) is the expected output of the t step. y (t ) is the practical output of the t step, and y(t ) = O(t ) . The goal of this method can reach E (t ) = 0 by learning. (2) Choose a suitable Lyapunov function:

1 V (t ) = φ ( E (t )) = [ E (t )]2 2 Where, V (t ) = 0 , if E (t ) = 0 ; V (t ) > 0 ,if E (t ) ≠ 0 . (3) Changing the parameter of the network to make:

(7)

Structure Automatic Change in Neural Network

ΔV (t ) = V (t ) − V (t − 1) ≤ 0

767

(8)

The formula (8) can ensure the algorithm stable. (4) t = t + 1 , if the error is small enough, stopping; Else, return to step (2). How to change the parameter based on the following method:

wij2 (k ) =

wij1 (t ) =

φ (t − 1)e(t − 1) + yd (t )

, n , j = 1,2,

,p

(9)

φ (t − 1)e(t − 1) + yd (t ) 1 ) , l = 1,2, × Gi ( p 2 nx j (t ) n ∑ w (t )

,p

(10)

nOut HL j (t )

l =1

,i

= 1,2,

jl

2 Where, G( x) = F −1 ( x) , Out HL , , is the function φ ( x) = ρ x , the j (t ) ≠ 0 ∑ w jl (t ) ≠ 0 φ (x) p

l =1

value of ρ can be changed by the requirement.

3 Structure Automatic Change Algorithm 3.1 Improved SVM We use SVM to classify the input. SVM is based on the statistical learning theory. SVM first maps the input points into a high dimensional feature space and finds a separating hyperplane that maximizes the margin between two classes in this space. Suppose we are given a set S of labeled training set, S = (( x1 , s1 ), , ( xN , s N )) , where si ∈ {+ 1,−1} , and xi ∈ ℜn . Considering that the training data is linearly non- separable, the goal of SVM is to find an optimal hyperplane such as:

si ( wT xi + b) ≥ 1 − ξi ,

i = 1,2,

,N

(11)

Where w ∈ R n , b ∈ R , and ξ i ≥ 0 is a slack variable. For ξ i > 1 , the data are misclassified. To find an optimal hyperplane is to solve the following constrained optimization problem:

min w,ξ

N 1 T w w + C ∑ ξi ; subject to si wT xi + b ≥ 1 − ξi 2 i =1

(

Where C is a user defined positive cost parameter,

∑ξ

)

i

(12)

is an upper bound on the

number of training errors. By solving (12), we can get the final hyperplane decision function:

g ( x) = sign( wT x + b) = sign( ∑ siα i x, xi + b) i∈SV

(13)

768

Where

H. Honggui, Q. Junfei, and L. Xinyuan,

α i is

a Lagrange multiplier and the training samples for which

αi ≠ 0

are

support vectors (SVs). Construction of structure automatic change neural network (SACNN) consists of structure and parameter learning parts. This parameter learning task is similar to the training of static neural network. But the learning of hidden layer nodes and the neural network weights via SVM and hybrid learning methods is dynamic. The disadvantage of learning by SVM is that the number of nodes is equal to the number of SVs, which is usually very large. In the hybrid learning approach, the number of nodes is determined by SVM, and the weights are learned by error back-propagation. The results show that learning of weights by back-propagation instead of SVM degrades the computing time. In contrast to the conventional learning approaches, the novelty of SACNN learning introduced below is the combination of clustering with linear-kernel SVM. In SACNN, the number of nodes in hidden layer is equal to the number of clusters instead of SVs, which makes the number of nodes small. Since the number of node is decided by clustering instead of automatically by SVs in SVM, the consequent part learning by the SVM proposed above cannot be directly applied. In this paper, a new consequent learning approach based on linear-kernel SVM is used. Here, outputs of input layer in SACNN are considered to be input features to a linear-kernel SVM. After learning, the decision function is a linear combination of features (nodes), and the weighting coefficients are used in computing the consequent parameters of each node. Detailed learning algorithms are described as follows. From (12) and (13), the conventional SVM depended on the data be available all at a time and classified the features in batches. But the conventional SVM only think about the sign of g of SVs, it did not consider the g real value of SVs. In this SACNN, we add new nodes or cut redundant nodes in hidden layer based on the state of r. We should compute the g real value here for classifying. For this reason, the formula (13) giving the final hyperplane decision function does need the sign, but the real value. So this formula can be computed as:

g ( x) = wT x + b

(14)

The formula (12) can be written as:

si ( g ( x) ) ≥ 1 − ξi

(15)

Because of si ∈ {+ 1,−1}, and ξ i is a slack variable. The function of 1 − ξ i can be instead by the state of ri . So (15) can be changed to:

⎧ g ( x) ≥ ri , si = 1 ⎨ ⎩ g ( x) ≤ ri , si = −1

(16)

This formula can classify the output value of input layer, next section we will use this improved SVM for structure automatic change.

Structure Automatic Change in Neural Network

769

3.2 Structure Automatic Change Algorithm There are four parameters characterizing the structure automatic change to be considered: 1) The number of the nodes in hidden layer N; 2) The location of the center xi ; 3) The radius of each node 4) The weight value

ri ;

w of g.

In this section, we present a novel structure automatic change algorithm that is capable of determining the four parameters. The parameters connect with each other; we describe the whole parameters as following. We only consider the hidden layer structure, and the whole neural network is dynamic only by hidden layer. Initially, the number of the nodes in hidden layer is 3; the structure of the nodes is the same as figure 2; the state of the nodes in hidden layer is {r1 , r2 , r3 }; the weight value w of g. From formula (16), we can simplify it, the classifying function is used to classify the input values and judge them whether belong to the current centers. So it can be simplified as:

g ( x) − ri ≥ 0

(17)

If this formula is real, the input value is out of this center xi , otherwise, the input value belongs to the center. Based on formula (14) and (17), the judgment conditions can be written as:

wT x + b − ri ≥ 0

(18)

The rules of adding nodes or pruning nodes in hidden layer are described as: 1) If the formula (18) is real, and the computing error e ≥ ed , ed is the expectant error. It should add a new node to the hidden layer. 2) If the formula (18) is not real, there is redundant node around node i, and the computing error e ≤ ed , ed is the expectant error. It should prune the present node in the hidden layer. Otherwise, the neural network should only change the weight values to satisfy the required conditions of the researched subjects. 3.3 The Whole Structure Automatic Changed Neural Network (SACNN) According to the aforementioned, the whole structure automatic changed neural network algorithm is summarized as following. 1. Set the initial values such as: w , w1 , w2 ; the number of the nodes in hidden layer n = 3 ; the location of the center {x1 , x2 , x3 }; the state of the nodes in hidden layer is {r1 , r2 , r3 }; the expectant error ed ; and we consume the value b = 0 . 2. Judge to add new node to the hidden layer or prune the present node in hidden layer:

770

H. Honggui, Q. Junfei, and L. Xinyuan,

Using formula (18) to compute the judgment value, the weight value w given as:

w = arg min( x(t ) − xi )

(19)

Where, x(t ) is the input value at time t; xi is the location of the center. The formula (18) can be shown as:

x(t ) − xi , xi − ri ≥ 0

(20)

2.1 If the formula (20) is real, and e(t ) ≥ ed , add a new node to the hidden layer, n = n + 1 . The weight value

w of g should be given as:

w j (t + 1) = w j (t ) , wn (t + 1) = x(t ) − xi , j = 1,2,

, n −1

(21)

The state of the nodes in hidden layer should be:

⎧r j ⎪⎪ 1 r j = ⎨(1 + 2 )ri n ⎪ ⎪⎩ r

j ≠ i, n j =i

(22)

j=n

The w1 , w2 will be changed, initialize the n-th weight value as: 1 n•

w =

2 n•

w =

wi1• + w1j • 2 wi2• + w2j •

(23)

(24)

2

Where, j is the second nearest local center of the new node. And then go to step 3. 2.2 If the formula (20) is not real, find the nearest node around the node i, which can be described as node i-i, if the distance between these two nodes is less than the state ri or ri−i , and e(t ) ≤ ed , prune the present node i in hidden layer, and the number of nodes in the hidden layer will become to be n = n − 1 . The weight value should be given as:

⎧(1 − q) w j (t ) + qw j − j (t ), j = i w j (t + 1) = ⎨ j≠i ⎩w j (t ),

Where, q equals to rj − j .The state of the nodes in hidden layer should be: rj

w of g

(25)

Structure Automatic Change in Neural Network

771

Fig. 3.1. The growing process step of SACNN

Fig. 3.2. The pruning process step of SACNN Fig. 3. The structure changing steps of SACNN in the hidden layer

⎧r j ⎪⎪ 1 r j = ⎨(1 + 2 )ri n ⎪ ⎪⎩ 0 The

j ≠ i, i − i j =i

(26)

j = i −i

w1 , w2 will be changed, initialize the n-th weight value as: 1 n•

w = wn2• =

(1 − q ) wi1• + qw1j• 2 (1 − q ) wi2• + qw2j• 2

(27)

(28)

Where, j is the second nearest local center of the new node. And then go to step 3. 3. Adjust the parameters according to the formula (9), (23), (24), and compute the output value of hidden layer. 4. According the formula (10), (27), (28) to adjust the parameters, and compute the input value of output layer. 5. Compute the output value of the neural network, and the current error e(t ) . 6. Go to step 2, stop until reaching the computing time.

772

H. Honggui, Q. Junfei, and L. Xinyuan,

The details of the structure changed process can be shown in Fig 3. In constant with GCS, SACNN has some merits: firstly, this algorithm can reduce the redundant nodes in the hidden layer, which can simplify the neural network structure in the process. So the computing time and memory space would be saved. The ultimate neural network is appropriate for the current research objects based on this structure chosen method. Secondly, the parameter adjusting algorithm need not to compute the mean squared error (MSE), but only need to use the error function at time k based on the Lyapunov error energy function. So it can save the computing time and memory space too. In SACNN, we use the improved SVM algorithm to classify the input values and then change the nodes in hidden layer. Because of simplifying the SVM algorithm the computing time of the classifying step is also reduced.

4 Simulation and Discussion We present two examples in this section, we use SACNN to track the functions, to show the tracking results of the proposed neural network and compare with the conditional GCS algorithm. Consider two common functions such as:

y = 2 × ( x1 − 2 x12 ) × e − x2 / 2

(29)

y = sin( x1 ) + 5 x2

(30)

We use this structure automatic changed neural network (SACNN) algorithm to track these functions. We choose the real output of the system as y (k ) at time k. So the error at time k will be e(k ) , and ec(k ) = e( k ) − e(k − 1) . The inputs of this proposed neural network are given as : P(k ) = (e(k ), ec(k ), y (k )) , x1 , x2 ∈ [0,1] . The stable errors we choose for tracking this function are 0.1 and 0.01. The simulation results are shown as following. Figure 4 shows the three-dimensional perspective plots of the functions. Randomly selected 200 data are used for training neural network and 625 points for generalization test. We measure the neural network error in terms of the average of the square defined by Lyapunov error energy function which is used to modify the neural network

0.5

1.4 1.2

0.4 1

y value

y value

0.3

0.2

0.8 0.6 0.4

0.1 0.2 0 1

0 1 0.8

1 0.6

0.8 0.6

0.4

0.4

0.2 x2 value

0.8

1 0.6

0.8

0

0

(a)

x1 value

0.6

0.4

0.4

0.2

0.2 x2 value

0.2 0

0

x1 value

(b)

Fig. 4. Perspective plots of the test functions. (a) y = 2 × ( x1 − 2 x12 ) × e − x2 / 2 . (b) y = sin(x1 ) + 5 x2

0.6

1.4

0.5

1.2

The value of tracking

The value of tracking

Structure Automatic Change in Neural Network

0.4 0.3 0.2 0.1 0

773

1 0.8 0.6 0.4 0.2

-0.1 1

0 2 0.8

1 0.6

0.8 0.6

0.4

x2 value

1.5

2 1 0.5

0.2 0

0

x1 value

1.5

1

0.4

0.2

x2 value

(a)

0.5 0

0

x1 value

(b)

Fig. 5. The tracking results of the test functions. e=0.01. (a) y = 2 × ( x1 − 2 x12 ) × e − x2 / 2 . (b) y = sin(x1 ) + 5 x2 Table 1. The performance comparison of different algorithms and different stable errors

Function y = 2 × ( x1 − 2 x12 ) × e − x2 / 2 y = 2 × ( x1 − 2 x12 ) × e − x2 / 2 y = 2 × ( x1 − 2 x12 ) × e − x2 / 2 y = sin(x1 ) + 5 x2 y = sin(x1 ) + 5 x2 y = sin(x1 ) + 5 x2

Algorithm SACNN SACNN GCS SACNN SACNN GCS

CPU Time(s) 8.25 35.25 82.67 22.87 142.54 186.17

Test Error 0.1 0.016 0.021 0.101 0.017 0.022

Stable Error 0.1 0.01 0.01 0. 1 0.01 0.01

Max Nodes 11 12 15 15 26 21

parameters. The training goal is to achiever the stable error. Figure 5 shows the tracking results when the stable error e is 0.01. The results proof the SACNN can track the function within the given stable error. And then we compare our method with the GCS algorithm. Table 1 shows the performances of SACNN and GCS at the same compute conditions, and the comparison of the different stable errors among the SACNN. According to table 1, we find that more exact the stable error is, larger running the CPU should be and bigger the number of neurons in the hidden layer of SACNN and GCS should be. And from the comparison between SACNN and GCS it is clear that the neural network structure is simpler and the CPU running time consuming to achieve the stable situation is less by SACNN at the same stable error.

5 Conclusions For the practical situations of the neural network algorithms, if the number of hidden layer nodes is chosen too small the parameter convergence is slow; but the number of hidden layer nodes chosen too lager the computation loading is large. In this paper, a structure automatic changed neural network (SACNN) algorithm is proposed, by which the functions can be tracked perfectly. Simulation results have demonstrated the efficiency of the proposed algorithm.

774

H. Honggui, Q. Junfei, and L. Xinyuan,

1) The structure automatic changed neural network (SACNN) can change the structure on-line; the neural network is suitable for the research objects by using the structure automatic change algorithm. 2) The structure automatic changed neural network (SACNN) algorithm can save the memory space and the computing time when compared with GCS. 3) The structure automatic changed neural network (SACNN) algorithm can be used both as on-line and off-line algorithm.

References 1. Tani, J., Nishimoto, R., Namikawa, J., Ito, M.: Codevelopmental Learning Between Human and Humanoid Robot Using a Dynamic Neural-Network Model. IEEE Trans. Syst., Man Cybern. B 38(1), 43–59 (2008) 2. Fritzke, B.: Growing Cell Structures–a Self-organizing Network for Unsupervised and Supervised Learning. Neural Network 7(9), 1411–1460 (1994) 3. Burzevski, V., Mohan, C.K.: Hierarchical Growing Cell Structures. In: Proceedings of the IEEE international conference on neural networks, vol. 3, pp. 1658–1663 (1996) 4. Adams, R.G., Butchart, K., Davey, N.: Hierarchical Classification With a Competitive Evolutionary Neural Tree. Neural Networks 12, 541–551 (1999) 5. Li, T., Tan, Y., Suen, S., Fang, L.: A Structurally Adaptive Neural Tree for Recognition of a Large Character Set. In: Proc. 11th IAPR international joint conference on pattern recognition, vol. 2, pp. 187–190 (1992) 6. Marsland, S., Shapiro, J., Nehmzow, U.: A Self-organizing Network That Grows When Required. Neural Network 15, 1041–1058 (2002) 7. Cun, Y., Le., D.J.S., Solla, S.A.: Optimal Brain Damage. In: Advances in Neural Information Processing Systems, San Mateo, CA, vol. 2 (1990) 8. Hassibi, B., Stork, D.G.: Optimal Brain Surgeon and General Network Pruning. In: Proc. IEEE Int. Conf. Neural Networks, San Francisco, CA, pp. 293–300 (1993) 9. Mak, B., Chan, K.-W.: Pruning Hidden Markov Models with Optimal Brain Surgeon. IEEE Transaction on Speech and Audio Processing 13(5), 993–1003 (2005) 10. Qiao, J.-F., Zhang, Y., Han, H.-G.: Fast Unit Pruning Algorithm for Feed Forward Neural Network Design. Applied Mathematics and Computation 7, 291–299 (2008) 11. Anders, U., Korn, O.: Model Selection in Neural Networks. Neural Networks 12, 309–323 (1999) 12. Cristianini, N., Taylor, J.S.: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge Univ. Press, Cambridge (2000) 13. Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 3rd edn. Academic, New York (2006) 14. Lin, K.M., Lin, C.J.: A Study on Reduced Support Vector Machines. IEEE Trans. Neural Network 14(6), 1449–1459 (2002) 15. Zhou, S.-M., Gan, J.Q.: Constructing L2-SVM-Based Fuzzy Classifiers in HighDimensional Space with Automatic Model Selection and Fuzzy Rule Ranking. IEEE Trans. Fuzzy Syst. 15(3), 398–409 (2007) 16. Lin, C.T., Yeh, C.M., Liang, S.F., Chung, J.F., Kumar, N.: Support-vector-based Fuzzy Neural Network for Pattern Classification. IEEE Trans. Fuzzy Syst. 14(1), 31–41 (2006) 17. Schölkopf, B., Sung, K., Burges, C., Girosi, F., Niyogi, P., Poggio, T., Vapnik, V.: Comparing Support Vector Machines with Gaussian Kernels to Radial Basis Function Classifiers. IEEE Trans. Signal Process 45(11), 2758–2765 (1997)

Structure Automatic Change in Neural Network

775

18. Chiang, J.H., Hao, P.Y.: Support Vector Learning Mechanism for Fuzzy Rule-based Modeling: A New Approach. IEEE Trans. Fuzzy Syst. 12(1), 1–12 (2004) 19. Wang, D.C., et al.: Support Vector Machines Regression On-line Modeling and Its Application. Control Decision 18(1), 89–95 (2003) 20. Cauwenberghs, G., et al.: Incremental and Decremental Support Vector Machine Learning. In: Fourteenth conference on Advances in Neural Information Processing Systems, NIPS, pp. 409–423 (2001)

Particle Swarm Optimization for Two-Stage FLA Problem with Fuzzy Random Demands Yankui Liu, Siyuan Shen, and Rui Qin College of Mathematics & Computer Science, Hebei University 071002 Baoding, Hebei, China [email protected], [email protected], [email protected]

Abstract. A new class of two-stage facility location-allocation (FLA) problems is studied, in which the demands are characterized by fuzzy random variables with known possibility and probability distributions. To solve the two-stage FLA problem, an approximation method is developed to turn the original infinite dimensional FLA problem into a finite dimensional one. Since the approximating FLA problem is neither linear nor convex, conventional optimization algorithms cannot be used to solve it. To overcome this difficulty, this paper designs a hybrid algorithm, which integrates the approximation method, neural network (NN) and particle swarm optimization (PSO) algorithm, to solve the approximating two-stage FLA problem. One numerical example with five facilities and ten customers is presented to demonstrate the effectiveness of the designed algorithm. Keywords: location-allocation problem, fuzzy random programming, approximation method, neural network, particle swarm optimization.

1

Introduction

Location-allocation problem occurs in many practical settings [1], where facilities provide a homogeneous service, and attracts many researchers’ interests such as Lee, Green and Kim [6], and Logendran and Terrell [19]. In addition, to solve the location-allocation problems, numerous methodologies have been proposed in the literature. For example, Lozano et al. [21] discussed the application of Kohonen maps to solve a class of location-allocation problems. Love [20] considered one-dimensional FLA problem using dynamic programming. Liu, Kao and Wang [8] solved location-allocation problems with rectilinear distances by simulated annealing. Gong et al. [3] designed a hybrid evolutionary method for solving obstacle location-allocation problem. The purpose of this paper is to employ fuzzy random theory [2,7,11,12,13,14,15] [16,17,18,23] and two-stage fuzzy random programming [9,10,24] to model a capacitated FLA problem from a new point of view, where the demands are assumed to be characterized by a fuzzy random vector, and the expected value criterion is adopted in the objective. When making decisions on the FLA problem, we assume that the decisions are made in two stages such that the expected F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 776–785, 2008. c Springer-Verlag Berlin Heidelberg 2008 

PSO for Two-Stage FLA Problem with Fuzzy Random Demands

777

transportation cost from facilities to customers is minimized. The first-stage decision is made before the values of fuzzy random parameters are observed, and the second-stage decisions, or recourse actions, can be taken after the realization of the fuzzy random parameters is observed. The optimal value of the second-stage depends on the realization of the fuzzy random parameters and the first-stage decisions. Since the original two-stage FLA problem includes fuzzy random parameters defined through possibility and probability distributions with infinite supports, it is inherently an infinite-dimensional optimization problem that can rarely be solved directly. Therefore, algorithm procedures for solving the original FLA problem must rely on intelligent computing as well as approximation schemes, which result in an approximating two-stage FLA problem. Because the approximating FLA problem is neither linear nor convex, we will design a hybrid PSO algorithm to solve the approximating FLA problem, the theoretical foundation of the hybrid algorithm is ensured by a convergence theorem about the approximating objective functions. The rest of the paper is organized as follows. Section 2 presents a new class of two-stage FLA problems with fuzzy random demands. Section 3 applies the approximation method to the expectation objective of the FLA problem, and deals with the convergence of the approximating objectives. The convergent result facilitates us to design a hybrid algorithm in Section 4, where the approximation method, neural network (NN) and particle swarm optimization (PSO) are incorporated to solve the approximating FLA problem. Section 5 presents one numerical example with five facilities and ten customers to demonstrate the effectiveness of the designed algorithm. Finally, Section 6 gives the conclusions.

2

Two-Stage FLA Problem Formulation

In this paper, we adopt the following notations for the two-stage fuzzy random FLA problem. • • • • •

si is the capacity of the ith facility, i = 1, 2, · · · , n1 ; (xi , yi ) is the unknown location of the ith facility, i = 1, 2, · · · , n1 ; (aj , bj ) is the known location of the jth customer, j = 1, 2, · · · , n2 ; dj is the fuzzy random demand of the jth customer, j = 1, 2, · · · , n2 ; zij (ω, γ) is the quantity transported from i to j in state (ω, γ) for i = 1, 2, · · · , n1 , and j = 1, 2, · · · , n2 .

In addition, we assume that the path between any customer and facility in the two-stage FLA problem is connected and the unit transportation cost is in proportion to the quantity transported and the travel distance. The facility i for i = 1, 2, · · · , n are supposed to be located within a certain region D1 = {(x, y) | gi (xi , yi ) ≤ 0, i = 1, 2, · · · , n1 }, where gi (xi , yi ) ≤ 0 for i = 1, 2, · · · n1 represent the potential region of locations of new facilities, x = (x1 , x2 , · · · , xn1 )T , y = (y1 , y2 , · · · , yn1 )T , and the fuzzy random demand ξ = (d1 , d2 , · · · , dm )T is defined on a probability space.

778

Y. Liu, S. Shen, and R. Qin

In the proposed FLA problem, the decision variables are divided into two groups. The first-stage location variable (x, y) which represents the locations of new facilities must be taken before a fuzzy random event (ω, γ) occurs, here the outcome of a fuzzy random event refers to the realizations of fuzzy random demands dj for j = 1, 2, · · · , n2 . In the second stage, the customers’ demands dj for j = 1, 2, · · · , n2 become known. As a consequence, the second-stage distribution variables zij should be taken. According to this scheme, we present a two-stage FLA problem, in which there are two optimization problems to be solved. The second-stage problem is formulated by assuming (x, y) and (ω, γ) to be fixed, and is built as follows ⎧ n2 n1    ⎪ ⎪ min zij (ω, γ) (xi − aj )2 + (yi − bj )2 ⎪ ⎪ ⎪ i=1 j=1 ⎪ ⎪ n2 ⎪ ⎨ subject to  zij (ω, γ) ≤ si , i = 1, . . . , n1 (1) j=1 ⎪ n1 ⎪  ⎪ ⎪ zij (ω, γ) = dj,ω (γ), j = 1, . . . , n2 ⎪ ⎪ ⎪ i=1 ⎪ ⎩ zij (ω, γ) ≥ 0, i = 1, . . . , n1 ; j = 1, . . . , n2 .

Here the dependence of the second-stage distribution decision zij (ω, γ) on (ω, γ) is of a completely different nature from the dependence of dj on (ω, γ). It is not functional but simply implies that the distribution variable zij are typically not the same under different realizations of (ω, γ). They are chosen so that the constraints in the problem (1) hold almost sure with respect to (ω, γ). If we use Q(x, y, ξω (γ)) to represent the optimal value of the problem (1) at fixed (x, y) and (ω, γ), then the recourse function QE (x, y) = Eξ [Q(x, y, ξ)] is the expected value of the transportation cost incurred in the second stage. Based on the notations above, the first-stage of the FLA problem reads min{QE (x, y) | (x, y) ∈ D}, x,y

(2)

where D = {(x, y) ∈ D1 | Ch{Q(x, y, ξ) < ∞} = 1} is the feasible region of the first-stage programming (2), and Ch is the mean chance defined in [13]. Combining the problems (1) and (2) yields the two-stage FLA problem with fuzzy random demands. If the fuzzy random demand ξ has an infinite support, then the FLA problem (2) is an infinite-dimensional optimization problem. In this case, we cannot solve the FLA problem via the conventional optimization algorithms. To overcome this difficulty, in the next section, we apply the approximation method [9] to turn the original FLA problem into a finite-dimensional one.

3

Approximating Two-Stage FLA Problem

During the solution of the two-stage FLA problem, we are required to compute the recourse function QE : (x, y) → Eξ [Q(x, y, ξ)] repeatedly, where ξ is the fuzzy random demand.

PSO for Two-Stage FLA Problem with Fuzzy Random Demands

779

that the demand ξ = (d1 , d2 , · · · , dn2 )T has an infinite support Ξ = nAssume 2 n2 i=1 [ai , bi ] ⊂ ℜ . Using the approximation method [9], we can obtain a sequence {ζm } of finitely supported primitive fuzzy random vectors, where ζm = (dm,1 , dm,2 , · · · , dm,n2 )T for m = 1, 2, · · ·. Suppose the randomness of ζm is characterized by a discrete random variable ω assuming finite number of values ωi with probability pi for i = 1, 2, · · · , N ; T and for each i, the fuzzy vector ζm ,2,ωi , · · · , dm,n2 ,ωi ) takes on ,1,ωi , dm ,ωi = (dm the value T

ζ ij = dˆij , dˆij , · · · , dˆij m

m,1

m,2

m,n2

with µij > 0 for j = 1, 2, · · · , Ni , and max1≤j≤Ni µij = 1. Thus the support of ij ζm , Ξm = {ζˆm | i = 1, 2, · · · , N ; j = 1, 2, · · · , Ni }, is a finite subset of ℜn2 . For ij ∈ Ξm , we can obtain the second-stage each fixed location variable (x, y) and ζ m ij value Q(x, y, ζ m ) by solving the linear programming (1) via simplex algorithm. Without loss of generality, we assume that for each i and fixed (x, y), the secondi1 i2 ) ≤ Q(x, y, ζ m ) ≤ ··· ≤ stage value function satisfies the condition Q(x, y, ζ m iNi Q(x, y, ζm ), then the value of the recourse function QE (x, y) at (x, y) can be computed by Ni N ij pij Qm (x, y, ζ m ), (3) pi Qm,E (x, y) = i=1

j=1

where for each pair (i, j),

Q x, y, ζ ij = min m

T ij q ij y

ij y ij = subject to W hij − T ij x y ij ≥ 0,

and the weights pij ’s are given by     1 1 Ni j j−1 Ni +1 pij = max µik − max µik + max µik − max µik . k=0 k=j+1 2 k=1 2 k=j

(4)

(5)

It is easy to check that the weights satisfy the following constraints pij ≥ 0, and Ni Ni j=1 pij = maxj=1 µij = 1, i = 1, 2, · · · , N. Given the first stage location variable (x, y), the procedure to compute the recourse function QE (x, y) is summarized as Algorithm 1. (Approximation method) T ij ˆij ˆij Step 1. Generate sample points ζ m = (dˆij m,1 , dm,2 , · · · , dm,n2 ) uniformly from the support Ξ of ξ for i = 1, 2, · · · , N and j = 1, 2, · · · , Ni . Step 2. Solve the second-stage linear programming (4) and denote the optimal ij value as Q(x, y, ζˆm ) for i = 1, 2, · · · , N and j = 1, 2, · · · , Ni . Step 3. Calculate the weight pij according to formula (5) for i = 1, 2, · · · , N and j = 1, 2, · · · , Ni . Step 4. Return the value of QE (x, y) via the estimation formula (3).

780

Y. Liu, S. Shen, and R. Qin

The convergence of Algorithm 1 is ensured by the following convergence theorem. Theorem 1. Let ξ = (d1 , d2 , · · · , dn2 )T be a continuous fuzzy random demand involved in the original two-stage FLA problem (2) with sup 2a compact  interval 1 dj ≤ ni=1 si almost port Ξ ⊂ ℜn2 , and {ζm } the discretization of ξ. If nj=1 sure, then for each (x, y) ∈ D, the approximating objective value Qm,E (x, y) converges to that of the original two-stage FLA problem (2), i.e., lim Qm,E (x, y) = QE (x, y),

m→∞

where Qm,E (x, y) = Eζm [Q(x, y, ζm )], and QE (x, y) = Eξ [Q(x, y, ξ)]. n2 n1 Proof. Since j=1 dj ≤ i=1 si almost sure, the second-stage value function ˆ is finite almost sure with respect to ξˆ ∈ Ξ. Noting that Q(x, y, ξ) ˆ is Q(x, y, ξ) ˆ ˆ convex with respect to ξ = d, it is continuous on the support Ξ of the demand d. Finally, it follows from [9, Theorem 2] that the theorem is valid. The proof is complete.

4

A Hybrid PSO Algorithm

If the demands dj for j = 1, 2, · · · , n2 are characterized by continuous fuzzy random variables, then the FLA problem (2) is inherently an infinite-dimensional optimization problem that cannot be solved directly by the conventional optimization algorithms. To provide a general solution method to the two-stage FLA problem, we will design a hybrid algorithm by integrating the PSO algorithm, NN and the approximation method. PSO algorithm, originally developed by Kennedy and Eberhart [4], is a method for optimization on metaphor of social behavior of flocks of birds and/or schools of fish. Compared to genetic algorithm, the PSO algorithm has much better intelligent background, and the theoretical framework of PSO is very simple so that it can be performed easily. Recently the PSO algorithm has attracted much attention and been successfully applied in the fields of evolutionary computing, unconstrained continuous optimization problems and many others [5]. In our proposed hybrid algorithm, the technique of the approximation method is used to compute the recourse function QE (x, y), NN is trained to approximate the recourse function, and the PSO algorithm and the trained NN are integrated for solving the two-stage FLA problem (2). Training an NN: The computation of QE (x, y) via approximation method is a time-consuming process since for each first-stage location decision (x, y) and every realization ζm,ω (γ) of ζm , we are required to solve the second-stage programming problem (1) via simplex algorithm. To speed up the solution process, we desire to replace the recourse function QE (x, y) by an NN since a trained NN has the ability to approximate integrable functions [22]. In this paper, we employ the fast BP algorithm to train a feedforward NN to approximate the recourse function QE (x, y). We only consider the NN with input layer, one hidden layer

PSO for Two-Stage FLA Problem with Fuzzy Random Demands

781

and output layer connected in a feedforward way. Let {(xi , y i , qi ) | i = 1, 2, · · · , M } be a set of input-output data generated by the approximation method. The training process is to find the best weight vector w so that the following error function M

Err(w) =

1 |F (xi , y i , w) − qi |2 2 i=1

is minimized, where F (xi , y i , w) is the output function of the NN, and qi is the value of QE (xi , y i ) evaluated by the approximation method. Representation Structure: In the two-stage FLA problem (2), we use a vector X = (x, y) as a particle to represent the location of new facilities, where x = (x1 , x2 , · · · , xn1 )T , y = (y1 , y2 , · · · , yn1 )T , and ⎞ ⎛ ⎛ ⎞ X1 x1 y1 ⎜ X2 ⎟ ⎜ x2 y2 ⎟ ⎟ ⎜ ⎜ ⎟ (x, y) = ⎜ . ⎟ = ⎜ .. .. ⎟ . ⎠ ⎝. ⎝ .. . ⎠ Xn1

xn1 yn1

Here Xi = (xi , yi ) is the location of the ith facility, i = 1, 2, · · · , n1 .

Initialization: Initialize pop size particles X k for k = 1, · · · , pop size from the feasible region D. Operations in PSO Algorithm: Suppose that the searching space is 2n1 dimensional and there are pop size particles form the colony. The position and the velocity of the kth particle can be represented as ⎛ ⎛ ⎞ ⎞ ⎛ ⎞ ⎛ ⎞ Xk,1 xk,1 yk,1 Vk,1 uk,1 vk,1 ⎜ Xk,2 ⎟ ⎜ xk,2 yk,2 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎜ ⎟ ⎟ k ⎜ Vk,2 ⎟ ⎜ uk,2 vk,2 ⎟ Xk = ⎜ . = ⎜. ,V = ⎜. = ⎜. ⎟ ⎟ ⎟ ⎟. . . .. .. ⎝ .. ⎝ .. ⎠ ⎠ ⎝ .. ⎠ ⎝ .. ⎠ Xk,n1

xk,n1 yk,n1

Vk,n1

Each particle has its own best position (pbest) ⎞ ⎛ ⎛ pk,1 Pk,1 ⎜ Pk,2 ⎟ ⎜ pk,2 ⎟ ⎜ ⎜ Pk = ⎜. ⎟ = ⎜ .. ⎠ ⎝. ⎝ .. Pk,n1

qk,1 qk,2 .. .

pk,n1 qk,n1

uk,n1 vk,n1



⎟ ⎟ ⎟, ⎠

which represents the personal smallest objective value so far at time t. The global best particle (gbest) of the colony is denoted by ⎛ ⎛ ⎞ ⎞ Pg,1 pg,1 qg,1 ⎜ pg,2 qg,2 ⎟ ⎜ Pg,2 ⎟ ⎜ ⎟ ⎜ ⎟ = ⎜. Pg = ⎜ . ⎟ ⎟, .. ⎝ .. ⎝ .. ⎠ ⎠ . Pg,n1

pg,n1 qg,n1

782

Y. Liu, S. Shen, and R. Qin

which is the best particle found so far at time t in the colony. Using the notations above, the new position of the kth particle is updated by X k (t + 1) = X k (t) + V k (t + 1),

(6)

while the new velocity of the kth particle is renewed by V k (t + 1) = wV k (t) + c1 r1 (P k − X k (t)) + c2 r2 (P g − X k (t)),

(7)

where k = 1, 2, · · · , pop size; w is called the inertia coefficient; c1 and c2 are learning rates which are nonnegative constants, and r1 and r2 are two independent random numbers generated randomly in the unit interval [0, 1]. Hybrid PSO Algorithm: To solve the proposed two-stage FLA problem, we first employ the approximation method to generate a set of input-output data for the recourse function QE (x, y), then we use the training set to train an NN to approximate the recourse function QE (x, y). After the NN is well-trained, we integrate PSO and the trained NN to produce a hybrid algorithm. During the solution process, we use formula (6) to update the position of the kth particle, employ formula (7) to renew the velocity of the kth particle, and use the trained NN to calculate the objective values for all particles. We repeat the above process until a stopping criterion is satisfied. The procedure of the proposed hybrid PSO algorithm for solving the two-stage FLA problem is summarized as Algorithm 2. A Hybrid PSO Algorithm Step 1. Generate a set of input-output data for the recourse function QE (x, y) by the approximation method. Step 2. Train an NN to approximate the recourse function QE (x, y) by the generated input-output data. Step 3. Initialize pop size particles with random positions and velocities, and evaluate the objective values for all particles by the trained NN. Step 4. Set pbest of each particle and its objective value equal to its current position and objective value, and set gbest and its objective value equal to the position and objective value of the best initial particle; Step 5. Update the position and velocity of each particle according to formulas (6) and (7), respectively. Step 6. Calculate the objective values for all particles by the trained NN. Step 7. For each particle, compare the current objective value with that of its pbest. If the current objective value is smaller than that of pbest, then renew pbest and its objective value with the current position and objective value. Step 8. Find the best particle of the current swarm with the smallest objective value. If the objective value is smaller than that of gbest, then renew gbest and its objective value with the position and objective value of the current best particle. Step 9. Repeat the fifth to eighth steps for a given number of cycles. Step 10. Return the gbest and its objective value as the optimal solution and the optimal value.

PSO for Two-Stage FLA Problem with Fuzzy Random Demands

5

783

One Numerical Example

We now provide an example to show the effectiveness of the hybrid PSO algorithm. Example 1. Consider a two-stage FLA problem with five facilities and ten customers. The demands dj and locations (aj , bj ) of the ten customers are collected in Table 1, where the demands dj for j = 1, 2, · · · , 10 are supposed to be triangular fuzzy random variables, and Xi for i = 1, 2, · · · , 10 are Bernoulli distributed stochastic variables. The capacities si for i = 1, 2, 3, 4, 5 are 26, 27, 28, 29 and 30, respectively. Then the problem can be built as the following two-stage FLA model ⎧ QE (x, y) ⎨ min subject to 10 ≤ xi ≤ 50, i = 1, 2, · · · , 5 (8) ⎩ 10 ≤ yi ≤ 50, i = 1, 2, · · · , 5, where QE (x, y) = Eξ [Q(x, y, ξ)], and Q(x, y, ξ) = min subject to

5  10 

i=1 j=1 10 

 zij (ω, γ) (xi − aj )2 + (yi − bj )2

zij (ω, γ) ≤ si , i = 1, 2, . . . , 5

j=1 5 

(9)

zij (ω, γ) = dj (ω, γ), j = 1, 2, . . . , 10

i=1

zij ≥ 0, i = 1, 2, . . . , 5; j = 1, 2, . . . , 10.

First, for each fixed first-stage location variable (x, y), we generate 5.12 × 105 sample points via the approximation method to compute the recourse function QE (x, y). Then, for each sample point ζˆij , we solve the second-stage programming (9) via simplex algorithm and obtain the second-stage value Q(x, y, ζˆij ) for i = 1, 2, · · · , 1024, and j = 1, 2, · · · , 500. After that, we employ the formula (3) to compute the value of the recourse function QE (x, y) at (x, y). Table 1. Demands and Locations of 10 Customers j 1 2 3 4 5 6 7 8 9 10

(aj1 , aj2 ) dj (15,24) d1,ω = (X1 (ω) + 3, X1 (ω) + 5, X1 (ω) + 6) with X1 ∼ BE (1/2) (16,14) d2,ω = (X2 (ω) + 4, X2 (ω) + 5, X2 (ω) + 7) with X2 ∼ BE (1/2) (25,14) d3,ω = (X3 (ω) + 3, X3 (ω) + 4, X3 (ω) + 6) with X3 ∼ BE (1/2) (23,16) d4,ω = (X4 (ω) + 6, X4 (ω) + 7, X4 (ω) + 9) with X4 ∼ BE (1/2) (13,16) d5,ω = (X5 (ω) + 6, X5 (ω) + 7, X5 (ω) + 9) with X5 ∼ BE (1/2) (20,18) d6,ω = (X6 (ω) + 4, X6 (ω) + 6, X6 (ω) + 7) with X6 ∼ BE (1/2) (26,30) d7,ω = (X7 (ω) + 5, X7 (ω) + 6, X7 (ω) + 8) with X7 ∼ BE (1/2) (26,20) d8,ω = (X8 (ω) + 4, X8 (ω) + 5, X8 (ω) + 7) with X8 ∼ BE (1/2) (12,14) d9,ω = (X9 (ω) + 7, X9 (ω) + 8, X9 (ω) + 10) with X9 ∼ BE (1/2) (23,12) d10,ω = (X10 (ω) + 7, X10 (ω) + 8, X10 (ω) + 10) with X10 ∼ BE (1/2)

784

Y. Liu, S. Shen, and R. Qin

Using the above approximation method, we generate a set {(xi , y i , qi ) | i = 1, 2, · · · , 2000} of input-output data for the recourse function QE (x, y), and use the training set to train an NN to approximate the recourse function QE (x, y). After the NN is well trained, it is embedded into a PSO algorithm to produce a hybrid algorithm to search for the optimal solutions. If the parameters adopted in the implementation of the PSO algorithm are set as follows: the inertia coefficient w decreases linearly from 0.9 to 0.4; the learning rates c1 = c2 = 2, and the population size is 100, then a run of the hybrid PSO algorithm with 200 generations gives the following optimal solution (x∗1 , y1∗ ) = (40.786653, 10.000000), (x∗2 , y2∗ ) = (10.000000, 10.000000) (x∗3 , y3∗ ) = (10.000000, 10.000000), (x∗4 , y4∗ ) = (10.000000, 10.000000) (x∗5 , y5∗ ) = (23.443389, 50.000000) whose objective value is 1027.699464.

6

Conclusions

This paper has presented a new class of two-stage fuzzy random FLA problems. Since the demand d characterized by both possibility and probability distributions usually has an infinite support, we cannot solve the FLA problem by the conventional optimization algorithms. To overcome this difficulty, an approximation method was adopted to turn the original FLA problem into a finitedimensional one. Furthermore, an approximation-based hybrid PSO algorithm was designed by for solving the proposed FLA problem. One numerical example with five facilities and ten customers was provided to demonstrate the effectiveness of the designed algorithm.

Acknowledgments This work is supported by the National Natural Science Foundation of China (No.70571021), the Program for One Hundred Excellent and Innovative Talents in Colleges and Universities of Hebei Province, and the Natural Science Foundation of Hebei Province (No.A2008000563).

References 1. Domschke, W., Drex, A.: An International Bibliography on Location and Layout Planning. Springer, Heidelberg (1984) 2. Feng, X., Liu, Y.K.: Measurability Criteria for Fuzzy Random Vectors. Fuzzy Optim. Decision Making 5, 245–253 (2006) 3. Gong, D., Gen, M., Xu, W., Yamazaki, G.: Hybrid Evolutionary Method for Obstacle Location-Allocation Problem. Computers & Industrial Engineering 29, 525–530 (1995)

PSO for Two-Stage FLA Problem with Fuzzy Random Demands

785

4. Kennedy, J., Eberhart, R.C.: Particle Swarm Optimization. In: Proc. of the IEEE International Conference on Neural Networks, Piscataway, NJ, pp. 1942–1948 (1995) 5. Kennedy, J., Eberhart, R.C., Shi, Y.: Swarm Intelligence. Morgan Kaufmann Publishers, San Francisco (2001) 6. Lee, S., Green, G., Kim, C.: A Multiple Criteria Models for the Location-Allocation Problem. Computers & Operations Research 8, 1–8 (1981) 7. Liu, B.: Uncertainty Theory: An Introduction to Its Axiomatic Foundations. Springer, Berlin (2004) 8. Liu, C.M., Kao, R.L., Wang, A.H.: Solving Location-Allocation Problems with Rectilinear Distances by Simulated Annealing. Journal of Operational Research Society 45, 1304–1315 (1994) 9. Liu, Y.K.: The Approximation Method for Two-Stage Fuzzy Random Programming with Recourse. IEEE Trans. Fuzzy Syst. 15, 1197–1208 (2007) 10. Liu, Y.K., Dai, X.D.: Minimum-Risk Criteria in Two-Stage Fuzzy Random Programming. In: Proc. of the 2007 IEEE International Conference on Fuzzy Systems, pp. 1008–1012. Imperial College, London (2007) 11. Liu, Y.K., Liu, B.: Fuzzy Random Variable: A Scalar Expected Value Operator. Fuzzy Optim. Decision Making 2, 143–160 (2003) 12. Liu, Y.K., Liu, B.: A Class of Fuzzy Random Optimization: Expected Value Models. Information Sciences 155, 89–102 (2003) 13. Liu, Y.K., Liu, B.: On Minimum-Risk Problems in Fuzzy Random Decision Systems. Computers & Operations Research 32, 257–283 (2005) 14. Liu, Y.K., Liu, B.: Fuzzy Random Programming with Equilibrium Chance Constraints. Information Sciences 170, 363–395 (2005) 15. Liu, Y.K., Gao, J.: The Independence of Fuzzy Variables with Applications to Fuzzy Random Optimization. Int. J. Uncertainty Fuzziness Knowl.-Based Syst. 15, 1–20 (2007) 16. Liu, Y.K., Wang, S.: Theory of Fuzzy Random Optimization Theory. China Argricultural University Press, Beijing (2006) 17. Liu, Y.K., Wang, S.: A Credibility Approach to the Measurability of Fuzzy Random Vectors. International Journal of Natural Sciences & Technology 1, 111–118 (2006) 18. Liu, Y.K., Gao, J.: Convergence Criteria and Convergence Relations for Sequences of Fuzzy Random Variables. In: Wang, L., Jin, Y. (eds.) FSKD 2005. LNCS (LNAI), vol. 3613, pp. 321–331. Springer, Heidelberg (2005) 19. Logendran, R., Terrell, M.P.: Uncapacitated Plant Location Allocation Problem with Price Sensitive Stochastic Demands. Computers & Operations Research 15, 189–198 (1998) 20. Love, R.F.: One-Dimensional Facility Location-Allocation Problem Using Dynamic Programming. Management Science 24, 224–229 (1976) 21. Lozano, S., Guerrero, F., Onieva, L., Larra˜ neta, J.: Kohonen Maps for Solving a Class of Location-Allocation Problems. European Journal of Operational Research 108, 106–117 (1998) 22. Scarselli, F., Tsoi, A.C.: Universal Approximation Using Feedforward Neural Networks: A Survey of Some Existing Methods, and Some New Results. Neural Networks 11, 15–37 (1998) 23. Wang, S., Watada, J.: T-Independence Condition for Fuzzy Random Vector Based on Continuous Triangular Norms. Journal of Uncertain Systems 2, 155–160 (2008) 24. Zheng, M., Liu, Y.K.: The Properties of Two-Stage Fuzzy Random Programming. In: Proc. of the 6th International Conference of Machine Learning and Cybernetic, Hong Kong, China, pp. 1271–1276 (2007)

T-S Fuzzy Model Identification Based on Chaos Optimization Chaoshun Li, Jianzhong Zhou⋆ , Xueli An, Yaoyao He, and Hui He College of Hydroelectric Digitization Engineering, Huazhong University of Science and Technology, 430074 Wuhan, China [email protected], [email protected]

Abstract. Nonlinear system identification by fuzzy model has been widely applied in many fields, for fuzzy model has the ability to approximate any nonlinear system to a given accuracy. In this paper, a nonlinear system identification algorithm based on Takagi-Sugeno (T-S) fuzzy model is proposed. Considering that the fuzzy space structure of T-S fuzzy model has great influence upon precision and effect of the final identification result, a new fuzzy clustering method based on chaos optimization, which is believed to be more accurate in data clustering, is adopted to partition the fuzzy space and obtain the fuzzy rules. Based on the fuzzy space partition and the subsequent structure parameters, the least square method is used to identify the conclusion parameters. Typical nonlinear function has been simulated to test the precision and effect of the proposed identification technique, and finally the algorithm has been successfully applied in the boiler-turbine system identification. Keywords: T-S Fuzzy model, Identification, Chaos optimization, Boilerturbine.

1

Introduction

System identification is the foundation of system control and system fault diagnosis, and most real systems in all kinds of fields are nonlinear. So, nonlinear system identification has attracted a great deal of attention in recent years, and researchers have proposed different kinds of methods to handle this problem. Takagi and Sugeno proposed the well-known T-S fuzzy model in [1], and then T-S fuzzy model arouses researchers interest widely [2,3,4,5,6,7,8,9,10,11,12,13, 14, 15, 16] for it can approximate any real system to a given accuracy and needs fewer rules compared with other models, and meanwhile the T-S fuzzy model is suitable to analyze and design the correlative controller using traditional control strategy, for the conclusion sector of the T-S fuzzy model is described by linear functions. The identification of T-S fuzzy model contains two key steps, namely, premise parameter determination and consequent parameter identification. In [2, 3], ⋆

Corresponding author.

F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 786–795, 2008. c Springer-Verlag Berlin Heidelberg 2008 

T-S Fuzzy Model Identification Based on Chaos Optimization

787

researchers put forward methods identifying the parameters of two parts together, but these kinds of methods need iteration calculation and thus are timeconsuming. In [4, 5, 6, 7, 8], the two parts were identified separately, when the structure were identified at first. Based on the premise parameters, the consequence parameters were obtained. Fuzzy clustering is a main method used in identification of premise part, which can partition input space through the input feature, and gets the membership functions of antecedents subsequently. Researchers made great efforts to improve the clustering algorithm in fuzzy model identification [8, 9, 10, 11, 12]. Among them, fuzzy clustering methods were improved either for the purposes of high efficiency or high precision. In this paper, the effect and precision of fuzzy clustering strategy used in structure determination is taken into consideration, and a new fuzzy clustering method based on chaos optimization is proposed. The most attractive feature of chaos optimization is that it is capable of searching the global extremum of target function, which could be the foundation of excellent clustering result. The fuzzy clustering method based on chaos optimization is proposed firstly, and then the whole T-S fuzzy model identification strategy is presented. Finally the proposed method is applied to typical nonlinear function and boiler-turbine system identifications.

2 2.1

The T-S Model Identification Method T-S Fuzzy Model

In order to identify a Multiple Input Multiple Output (MIMO) system, we take Multiple Input Single Output (MISO) system into consideration, because MIMO system can be divided into MISO systems. It’s assumed that a MISO system P (X, Y ) is the system needing identification, while X is the system input with m demension, and Y is the system output with. The T-S fuzzy model of this system can be described by the following IF-THEN fuzzy rules: Rule i: IF xi is A1i and · · ·and xi is Am i THEN yi = p0i + p1i xi + · · · + pm i xm ,

(1)

where i=1,2,...,N, N is the number of fuzzy rule, x is the system input, , m is the dimension of input vector, yi is the ith output, and pm i is the conclusion parameter of the ith output. The final output of T-S fuzzy model can be expressed by a weighted mean defuzzification as follows: N wi yi , (2) y = i=1 N i=1 wi where the weight wi represents the overall truth value of the premise of the jth implication for the input, which can be calculated as: wi =

N 

i=1

µ(Aji ) ,

(3)

788

C. Li et al.

where µ(Aji ) is the grade of membership function and is described by a Gaussian function as: (xj − cji )2 µ(Aji ) = exp − , (4) rij where cji , rij represent the center and width of the membership function respectively. 2.2

The Premise Parameters Identification

The main work of premise parameter identification of fuzzy model is partitioning the input fuzzy space to get center cji and width rij of fuzzy set. Here, we propose a new fuzzy clustering method based on chaos optimization to separate the input space, and get the center of fuzzy set. The new fuzzy clustering method is capable of realizing data cluster accurately, since the chaos optimization strategy is able to avoid sinking into local minimum when optimize the clustering objective function. The chaos optimization based on fuzzy clustering method is used to get the centers of fuzzy sets, while the width rij of fuzzy set can be acquired through the K-Nearest Neighbor: N

rij

1 1 j ci − cjk ) , = ( β n

(5)

k=1

where cjk is the one of the n nearest centers of cji , β is a constant. Chaos Optimization Strategy. Chaos is a kind of seemingly random or irregular movement, which appears in a deterministic system, and is a kind of complex movement and natural phenomenon existing universally. Chaos variables are random, ergotic and regular to some extent. The basic idea of searching optimum using chaos variables is: producing chaos variables with a kind of chaotic map, projecting chaos variables to optimization variables interval and then searching optimal solution with chaos variable. Randomness and ergodicity of chaos variables make chaos optimization possible to achieve global optimum quickly. We choose the famous Logistic map as chaotic map, which is a one-dimensional quadratic map defined by xi+1 = µxi (1 − xi ) ,

(6)

where µ is a control parameter and when µ = 4 , equation (6) generates chaotic evolutions, and the sequence of {xi } exhibits chaotic state completely, which is called chaos sequence. The values in chaos sequence can not repeated, which means every value in the given optimization variables interval can be reached by projecting the chaos variables to optimization variables, and thus the global optimum of the objective function f (X) could be achieved. The specific steps of

T-S Fuzzy Model Identification Based on Chaos Optimization

789

optimizing problem of continuous plants with s dimensions solution are listed as: (0) choosing s initial values xi , 0 < i ≤ s , obtaining chaos variables sequence by calculating equations (1), and then projecting chaos variables to solution space of optimization problem through the linear transformation formula xmi = a + bxi , finally searching the optimal solution. Fuzzy Clustering Method Based on Chaos Optimization. When dealing with problem of classifying n samples into c classes, the clustering objective function is defined as: Jm =

c n  

(µik )m (dik )2 ,

(7)

k=1 i=1

where dik is the distance between sample xk and center vi , which is usually Euclidean distance and can be defined as dik = xk − vi . It’s assumed that the sum of the fuzzy membership grades to each cluster is equal to 1, which can be described as: c  µik = 1, k = 1, ..., n . (8) i=1

It’s expected that the optimal cluster structure will be achieved if the clustering objective function reaches its minimization. Considering the constraint equation (8), the extreme of Jm could be found by using Lagrange Multiplier Rule only when fuzzy membership grades and cluster center are in accordance with equation (9) and equation (10): µik = c

1

j=1 (dik /djk )

1 m−2

n m k=1 (µik ) xk . vi =  n m k=1 (µik )

,

(9)

(10)

The main idea of the proposed clustering method is to optimize the clustering objective function Jm by using chaos optimization strategy. The searching strategy can be described as follows: Logistic map generates chaos sequence, project the chaos variables to cluster center matrix’s elements and refresh fuzzy membership grades matrix U through equation (9) accordingly,and finally calculate the function value of Jm and judging whether the current value is the optimal or not. In order to improve efficiency of the searching, we use a chaos optimization strategy which reduces the searching range gradually, and combine the gradient method with chaos optimization, which means when obtaining current optimal solution, namely cluster center matrix V by chaos optimization, we get new V and U by calculating equation (9) and equation (10) once. The new fuzzy clustering method combines mutative scale chaos optimization strategy with gradient method, and thus it could search the global optimum quickly and effectively.

790

C. Li et al.

The specific steps of the chaos optimization based fuzzy clustering algorithm are described as: 1) Initialize the cluster number c, the cluster center V , and the stopping threshold . 2) Optimize the objective function through iterative chaos searching. Generate chaos sequence, and project the chaos variables to elements of V . Calculate the objective function value and record the best value so far. 3) Reduce the scales of elements of cluster center. If the best objective value stay unchanged for a fixed steps, reduce the scales of elements of cluster center. 4) Search the optimal value through gradient method exactly. 5) When the change of V is with threshold , stop calculation. The V according with the optimal objective function value are the best cluster center. 2.3

The Consequent Parameter Identification

For L input-output pairs (Xk , yk ) ,we get following equation by equation (2): AP = Y ,

(11)

0 m T where P = (p01 , ..., pm 1 , ...pN , ..., pN ),Y = [y1 , ..., yN ] ,and the coefficient matrix A is described as: ⎡ ⎤ 1 1 ··· 1 ⎢ λ11 x11 λ12 x12 · · · λ1L x1L ⎥ ⎢ ⎥ ⎢ ⎥ .. .. .. ⎢ ⎥ . . ··· . ⎢ ⎥ ⎢ λN 1 x11 λN 2 x12 · · · λN L x1L ⎥ ⎢ ⎥ ⎢ ⎥ .. .. .. AT = ⎢ (12) ⎥ , . . ··· . ⎢ ⎥ ⎢ ⎥ 1 1 ··· 1 ⎢ ⎥ ⎢ λ11 xm1 λ12 xm2 · · · λ1L xmL ⎥ ⎢ ⎥ ⎢ ⎥ .. .. .. ⎣ ⎦ . . ··· . λN 1 xm1 λN 2 xm2 · · · λN L xmL where xjk is the jth element of kth input, λik is the combination of weights of rules, which is expressed as:

λik = wi /

N 

wi .

(13)

k=1

We use LSM to solve equation (13), and obtain the conclusion parameters vector P . Based on the discussion above, the identification method are generalized as: 1) Select input signals of fuzzy model, and partition the input space through the algorithm in 2.2, thus getting the centers of fuzzy sets. 2) Based on the centers of fuzzy sets, get the width of fuzzy sets by equation (5). 3) With center and width of fuzzy set, calculate the coefficient matrix A, then obtain the parameter vector P by finding solution to equation (9) through LSM.

T-S Fuzzy Model Identification Based on Chaos Optimization

3 3.1

791

Applications The Identification of a Nonlinear Function

We select a typical nonlinear system to test the proposed identification method, which is described as: y(k + 1) =

y(k)y(k − 1)y(k − 2)u(k − 1)[y(k − 2) − 1] + u(k)] , 1 + y(k − 1)2 + y(k − 2)2

(14)

where the input u(k) is defined as: ⎧ πk ⎪ ⎪ sin 25 , ⎪ ⎪ ⎨1 , u(k) = −1 , ⎪ πk ⎪ ⎪ ⎪ 0.3 sin 25 + 0.1 ⎩ πk sin πk 35 + 0.6 sin 10

k < 250 , 250 ≤ k < 500 , 500 ≤ k < 750 ,

(15)

,750 ≤ k < 1000 .

We simulate this system for 1000 iterations and sample u(k) and y(k). Choose y(k), y(k − 1) and u(k) as the fuzzy model input signals, and set the number of rules as 6. Using the proposed method identify this system, we get the identification results showed in Fig. 1, where the real line presents the system’s original output, and the dashed line presents the output of T-S fuzzy model. We choose Mean Square Error as error criterion and compare the identification precision of between method in this paper and methods in [4,5], which have been listed in Table 1. From Fig. 1, it’s clear that the output of fuzzy model follows almost the same truck of the output of the original system, which means that the method proposed has relatively high accuracy. In Table 1, it’s manifest that we have improved the identification precision.

Fig. 1. Comparison of outputs of real system and T-S fuzzy model

792

C. Li et al. Table 1. Precision comparison of different methods Methods MSE Method in [4] 0.01 5.0137E-3 Method in [5] Method in this paper 3.384E-3

3.2

The Identification of Boiler-Turbine System

The boiler-turbine is a complicated nonlinear system, which is described by a group of differential equation as follows: 9/8

x˙1 = −0.0018u2x1

+ 0.9u1 − 0.15u3 , 9/8

x˙2 = (0.073u2 − 0.16)x1

− 0.1x2 ,

(16) (17)

x˙3 = (141u3 − (1.1u2 − 0.19)x1 )/185 ,

(18)

y1 = x1 ,

(19)

y2 = x2 ,

(20)

y3 = 0.05(0.13073x3 + 100acs + qe /9 − 67.975) ,

(21)

where x1 is drum pressure (kg/cm2), x2 is electrical output (MW), x3 is density of liquid in drum (kg/cm3), u1 is the level of boiler regulator, u2 is the level of valve operating the quantity of steam for turbine, u3 is the level of feedwater regulator, y3 is the level of drum (m), acs is the coefficient for the quality of steam, acs = (1 − 0.001538x3)(0.8x1 − 25.6)/[x3 (1.0394 − 0.0012304x1)] ,qe is the consumption rate of steam (kg/sec),qe = (0.854u2 − 0.147)x1 + 45.59u1 − 2.514u3 − 2.096. Select the input signal of fuzzy model u1,u2 and u3 like this: u1 (t) = u01 + 0.7 sin(2πt/2000) sin(2πt/1000) , u2 (t) = u02 + 0.6 sin(2πt/2500) sin(2πt/1100) , u3 (t) = u02 + 0.8 sin(2πt/1500) sin(2πt/800) . Set the initial condition as: x01 = 108 x02 = 66.65 x03 = 428 , u01 = 0.34 u02 = 0.69 u03 = 0.436 . We simulate the system for 2000 seconds, and sample the input and output of the system every second to get the training data. We select y1 (k − 1), y1 (k − 6), u1 (k − 1), u2 (k − 1) and u3 (k − 1) as inputs of the drum pressure fuzzy model, y2 (k − 1), y2 (k − 6), u1 (k − 1), u2 (k − 1) and u3 (k − 1) as inputs of the electrical output fuzzy model, and y3 (k − 1), y3 (k − 6), u1 (k − 1), u2 (k − 1) and u3 (k − 1) as inputs of the level of drum fuzzy model. Set rule number as 3. Using the

T-S Fuzzy Model Identification Based on Chaos Optimization

793

Fig. 2. Comparison of outputs of original system and fuzzy model

sampled data, we have gotten the identification results showed in Fig. 2 and Table 2. Fig. 2 exhibits the comparison between original system’s output and fuzzy model’s output, where the real line represents the output of original and the dashed line represents the output of the fuzzy model. In Table 2, we show the specific precision comparison between the proposed method and other methods in reference. Form Fig. 2, it’s difficult to distinguish the outputs of original system from the identified system, which means the output of fuzzy model is seemingly the same compared to the original system’s, and has shown the accuracy of the identification method designed in the paper. In order to illuminate the precision of the proposed method more clearly, we has made comparison of precisions Table 2. Precision comparison of different methods Methods

MSE y1 y2 y3 Method in [12] 4.0524E-4 3.5920E-5 2.2910E-5 Method in this paper 8.405E-7 8.406E-7 2.112E-9

794

C. Li et al.

between different methods detailedly , which has been exhibited in Table 2 manifestly. It’s obvious that we have improved the precision in the identification of boiler-turbine system considerably.

4

Conclusions

In this paper, we have proposed a new identification method of T-S fuzzy model, while paying much attention to the premise parameters identification, which includes the number of fuzzy rules, the center and width of fuzzy set. We have proposed a fuzzy clustering method based on chaos optimization, trying to improve the accuracy of system fuzzy space partition, thus increasing the precision of the fuzzy model identification. We have applied the method to a typical complicated nonlinear system identification, and the results have attested validity of the identification method, and meanwhile the improvement of precision is obviously shown through the comparison with other methods. The successful application in boiler-turbine identification has indicated the foreground of engineering application. Acknowledgments. This paper is supported by the Special Research Foundation for the Public Welfare Industry of the Ministry of Science and Technology and the Ministry of Water Resources (No. 200701008)and National Natural Science Foundation of China (50579022, 50539140).

References 1. Takigi, T., Sugeno, M.: Fuzzy Identification of System and Its Application to Modeling and Control. IEEE Transactions on System Man Cybernet 15, 16–32 (1985) 2. Sugeno, M., Yasukawa, T.: A Fuzzy-logic-based Approach to Qualitative Modeling. IEEE Trans. on Fuzzy Systems 1, 7–31 (1993) 3. Du, H.P., Zhang, N.: Application of Evolving Takagi-Sugeno Fuzzy Model to Nonlinear System Identification. Applied Soft Computing 8, 676–686 (2008) 4. Chen, J.Q., Xi, Y.G., Zhang, Z.J.: A Clustering Algorithm for Fuzzy Model Identification. Fuzzy Sets and Systems 98, 319–329 (1998) 5. Sugeno, M., Takahiro, Y.: A Fuzzy-logic-based Approach to Qualitative Modeling. IEEE Transactions on Fuzzy Systems 1, 7–31 (1993) 6. Leski, J.M.: Tsk-fuzzy Modeling Based on E-insensitive Learning. IEEE Trans. on Fuzzy Systems 13, 181–193 (2005) 7. Mohamed, L.H., Vincent, W.: Takagi-sugeno Fuzzy Modeling Incorporating Input Variables Selection. IEEE Trans. on Fuzzy Systems 10, 728–742 (2002) 8. Kim, E., Lee, H., Park, M., Park, M.: A Simply Identified Sugeno-type Fuzzy Model Via Double Clustering. Information Sciences 110, 25–39 (1998) ¨ 9. Kemal, K., Ozge, U., T¨ urksen, I.B.: Comparison of Different Strategies of Utilizing Fuzzy Clustering in Structure Identification. Information Sciences 177, 5153–5162 (2007) 10. Amine, T., Frederic, L., Mohamed, K., Gilles, E.: Fuzzy Identification of A Greenhouse. em Applied Soft Computing 7, 1092–1101 (2007)

T-S Fuzzy Model Identification Based on Chaos Optimization

795

11. Wu, B.L., Yu, X.H.: Fuzzy Modelling and Identification with Genetic Algorithm Based Learning. Fuzzy Sets and Systems 113, 351–365 (2000) 12. Liu, J.Z., Chen, Y.Q., Zeng, D.L., et al.: Identification of a Boiler-turbine System Using T-s Fuzzy Model. In: Proceedings of IEEE Tencon, Beijing, pp. 1278–1281 (2002) 13. Abdelazim, T., Malik, O.P.: Identification of Nonlinear Systems by Takagi-Sugeno Fuzzy Logic Grey Box modeling for Real-time Control. Control Engineering Practice 13, 1489–1498 (2005) 14. Li, Y.G., Shen, J.: T-s Fuzzy Modeling Based on V-support Vector Regression Machine. In: Proceedings of the CSEE, vol. 26, pp. 148–153 (2006) 15. Deng, L.C., Wang, G.J., Chen, H.: Fuzzy Identification on Inverse Dynamic Process of Steam Temperature Object of Boiler. In: Proceedings of the CSEE, vol. 27, pp. 76–80 (2007) 16. Wang, H.W., Gu, H.: An Integrated Algorithm for Structure Identification and Parameter Identification of Fuzzy Model. Chinese Journal of Computers 29, 1977– 1981 (2006)

ADHDP for the pH Value Control in the Clarifying Process of Sugar Cane Juice Xiaofeng Lin1, Shengyong Lei1, Chunning Song1, Shaojian Song1, and Derong Liu2 1

College of Electrical Engineering, Guangxi University, 530004 Nanning, China [email protected] 2 Department of Electrical and Computer Engineering, University of Illinois at Chicago

Abstract. The clarifying process of sugar cane juice is the important craft in the control process, which has the characteristics of strong non-linearity, multiconstraint, time-varying, large time-delay, and multi-input. It is an important content to control the neutralized pH value within a required range, which has the vital significance for acquiring high quality purified juice, reducing energy consumption and raising sucrose recovery. This article uses ADHDP (ActionDependent Heuristic Dynamic Programming) method to optimize and control the neutralized pH value in the clarifying process of sugar cane juice. In this way, we can stabilize the clarifying process and enhance the quality of the purified juice and lastly enhance the quality of product sugar. This method doesn’t need the precise mathematical model of the controlled object, and it is trained on-line. The simulation results indicate this method has the good application prospect in industries. Keywords: Adaptive Dynamic Programming (ADP), ADHDP, The clarifying process of sugar cane juice, The neutralized pH value.

1 Introduction China is the third largest country in sugar production in the world, and its yield has reached 8 million tons every year. There are more than 100 sugar factories in Guangxi province with 5 million tons per annum, which took up 60% of China’s total. However, the automation of technical guideline of our sugar factories is very low. Under the existing technological process and equipment, it is a key problem how to utilize omni-directional information and adjust processing parameters real time on site to keep the optimum state of production, enhancing the quality of the purified juice. We adopt sulfurous acid method and integrate the control technology of optimization to set up an intelligent, integrated and optimal control system, by which optimize control in the complicated industrial process and then the clarification technical management level can be improved consumedly in the sugar factory. The sulfitation process is mainly adopted at present. it is a complicated physicalchemistry process to clarify the juice, divided into four stages which are predefecation, heating, neutralization reaction, sedimentation and filtering[1][2]. At first a small F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 796–805, 2008. © Springer-Verlag Berlin Heidelberg 2008

ADHDP for the pH Value Control in the Clarifying Process of Sugar Cane Juice

797

amount of lime is put into the mixed juices, and at the same time, phosphoric acid is also put into the mixed juices to adjust pH to a low-grade acidity or neutral. And then the mixed juice is heated, for the first time, controlling the temperature within the range of 55-70 . After that the mixed juice is sent into the neutralization device. In neutralization device, lime liquid and sulfur dioxide gases are added to mixed juices. Then sulfurous acid and calcium hydroxide neutralize in the join, which produces calcium sulfurous to be separated out, synchronously colloid is coagulated. Then the neutralized juice is heated for the second time to accelerate reaction of phosphoric acid and sulfurous acid. Finally the neutralized juice is going to the subsider for subsiding. Flowchart of sugarcane juice clarification is shown in figure 1.



Mixed juice

Pre-ash

pH:6.6~6.9

First heating

55~70◦C

SO2

Sulphitation

pH: 3.0~4.5

Milk of lime

Liming

pH: 7.0~7.4

Second heating

100~105◦C

Mud juice Subsidation Filtering

Evaporation

Clear juice

Mud

Filtered juice

Fig. 1. Flow diagram of the clarifying process of sugar cane juice

The main factors affecting the sulphitation neutralization are: (1) The instability of flow of juice will directly influence the following operations such as adding lime, sulfur dioxide and the phosphoric acid. (2) Either pre-ash’s pH value too high or too low will result in the increased difficulty of the sulphitation neutralization control. (3) Influence of the lime milk and sulfur dioxide flow. If the amount of lime put into the juice is too small or the amount of sulfur dioxide is too large, pH value in the sugarcane juice will become acid, which will affect neutralization reaction and cause high SO2 and calcium contents in the purified juice, and inevitably decrease the purity of juice. If too large amount of lime or too small amount of sulfur dioxide is put into the juice, it will cause the reducing sugar resolving and increase Color Value of the purified juice, and inevitably decrease the purity of juice. It has the vital significance to control pH value stability in the clarifying process of sugar cane juice, which influences output and the quality of white sugar. To resolve pH control problems above mentioned, we propose that use ADHDP method to control the neutralized pH value in the clarifying process of sugar factories.

798

X. Lin et al.

2 Adaptive Dynamic Programming Adaptive Dynamic Programming (ADP) was proposed by Werbs [3] in the 1970s. It is a neural network-based approximation dynamic programming. Because the dynamic programming is only the method which can precisely solve such problems –the stochastic, fluctuant and general nonlinear system long-term optimized question, but one can meet curse of dimensionality in the concrete implementation process. ADP can improve the numerical process and avoid curse of dimensionality by using actioncritic. We have learned that clarifying process is a complex process of physical and chemical reaction. So it is very difficult to establish its mathematic model. Therefore, for such a dynamic system which contains the massive indefinite factors, but must maintain at a certain running status, adaptive dynamic programming is a feasible scheme. Suppose that one is given a discrete-time nonlinear (time-varying) dynamical system

x(t + 1) = F [ x (t ), u (t ), t ], t = 0,1, 2,..., l

(1)

where x ∈ R n represents the state vector of the system and u ∈ R m denotes the control action. Suppose that one associates with this system the performance index (or cost) J ⎡⎣ x ( i ) , i ⎤⎦ = ∑ γ k −iU ⎡⎣ x ( k ) , u ( k ) , k ⎤⎦ ∞

(2)

k =i

where U is called the utility function and γ is the discount factor with 0 4.

810

C. Wu et al.

Although the constriction factor in the PSO algorithm can produce the solution better than that of the standard PSO has been proved by experiments, it is easy to sink into local solutions. We propose a dynamic constriction factor PSO algorithm. When the global best position is over the set value, the constriction factor is 1; or else, it will dynamic adjust according to the difference of adjacent global best position. ⎧ η = 1 ⎨  pgd < ε   pkgb −pk−1 (4) gb  else ⎩ η = 1 −  pk  gb

where ε is threshold, which can be set according to global best position. The difference particle swarm optimization (DPSO) was used to train the BP neural network. The fitness is the MSE of neural network; the positions of particles indicate the weight matrixes of network.

3

Case Study

The proposed neural network model was applied in Wuhan city to identify the mobile micro emission. Wuhan is a metropolis in middle China, which consists of three areas - Hankou, Wuchang and Hanyang. Hankou is a commercial area, Wuchang is upper income and more developed, while Hanyang is relatively low income and less developed. The testing route was carefully selected to cover typical roadways, such as freeway, arterial roadways and residential roadways. Fig.2 shows the three different areas where testing activities were conducted. The red lines indicate highway roads; the blue lines indicate arterial roads; the green lines indicate residential roads [16]. Eight passenger cars were driven by professional drivers. This could reduce the effect of different driver behaviors. The drivers were asked to drive randomly thought the assigned route. In order to insure that the most representative data is collected, the data were collected from 8:00 to 11:00 am and from 2:00 to 5:00 pm, from November to December in 2006. The on-road measurement system used in this study consists of two parts: OEM-2100 and GPS instrumentation.

Fig. 2. Sectors where activity study was conducted in Wuhan, China

Dynamic PSO-Neural Network

811

Table 1. The parameters of PSO and DPSO Parameters k ωc 1 c 2 r1 r2 φ η PSO 200 1 2 2 rand rand DPSO 200 [0.3,1.4] 2.05 2.05 rand rand 4.1 0.729

Instantaneous concentrations of CO, HC, NOx were measured by OEM-2100 system. The OEM-2100 System is designed to measure vehicle mass exhaust emissions under actual on-road driving conditions using vehicle and engine operating data and concentrations of pollutants in exhaust gas sampled from the tailpipe. The OEM-2100 system provides second-by-second emissions, engine rpm, temperature and other parameters. The data output has 12-second delay. The error of the gas analyzer was less than 4% for all of the gas pollutions. The GPS instrumentation and a laptop can record the vehicle speed on a second by second basis and it also provides the position data of the vehicle, i.e., latitude, longitude and altitude. A total of 600 samples were used to train and test the neural network, which include three layers: input, hidden, and output layers. The activation function of the hidden layer and the output layer are Log-sigmoid function and pure linear function respectively. In order to compare with simulation results, the proposed PSO and the basic PSO were both used to train the neural network. The Parameters of PSO and DPSO are showed in Table 1.

4

Results Analysis

Among the 600 samples, 540 samples were used to train the network; 60 samples were used to test. Fig. 3 shows the training results with DPSO. In order to compare with the original samples, the simulation results and original samples

Fig. 3. The training results of CO, HC, and NOx

812

C. Wu et al.

Fig. 4. The test results of CO, HC, and NOx

Fig. 5. The comparing of PSO and DPSO with training results of CO, HC, and NOx

are showed at the same time. The results indicate that the network can reflect the relationship between the input and output variables. Relatively, the simulation error of HC is the lowest, followed by NOx and CO. The basic PSO was used to train the neural network and compared with DPSO. Figs. 5 and 6 show the simulation result of the training samples and test samples. The results indicate that DPSO has better generalization property than PSO. For CO and NOx, there are some opposite peak in the curve of PSO, which is not appearing in the curve of DPSO. Furthermore, the simulation errors can be used compare with the difference in quantity. The MSE of neural network are calculated and showed in Table 2.

Dynamic PSO-Neural Network

813

Fig. 6. The comparing of PSO and DPSO with test results of CO, HC, and NOx

Table 2. The MSE of network training with PSO and DPSO TEST SAMPLES CO HC NOX PSO 0.0394 7.80E-07 3.10E-05 DPSO 0.0304 7.70E-07 2.10E-05

TRAINING SAMPLES CO HC NOX 0.0252 6.40E-07 3.00E-05 0.0219 7.20E-07 2.40E-05

Table 3. Comparative the average training time

LM DPSO

COMPUTATIONAL TIME (s) 3.6 4.2

The generalization property was also tested. We use 60 samples that out of the training samples to test generalization property of the trained network. Fig. 4 shows the simulation results. Analogously, output of HC is well following the original samples. However, the simulation results of CO is dissatisfied. For the test samples, the different between PSO and DPSO is more obvious than that with training samples. Fig. 5 shows that the simulation emission rates of CO and NOx with DPSO are more closely to the test sample than that with PSO.

5

Conclusions

Furthermore, the simulation errors can be used compare with the difference in quantity. The MSE of neural network are calculated and showed in Table 2.

814

C. Wu et al.

The proposed method was compared to the conventional training method (LMtrained). On an average, the training time of LM-trained is short than that of DPSO-trained method. Table.3 shows the average computational time of the training methods. However, for the test sample data, the output of the DPSOtrained is much better than that of the LM-trained. This means the network trained by DPSO has better generalization capability. This study has shown the potential of a neural network model used for uncovering micro mobile emissions. The model can reflect the relationship between operational status of vehicles and their emission rates (include nitrogen oxides (NOx), hydrocarbon (HC) and carbon monoxide (CO)). The training of neural network is convergent through introducing dynamic searching particle swarm optimization method. A case study in Wuhan city was developed to test the applicability of the proposed model. The results indicate that the dynamic searching particle swarm optimization method can generate more useful solutions than the basic particle swarm optimization method. This study is a new attempt and the proposed model can be used for analyzing the micro mobile emission that are associated with different requirements of traffic management and environmental emission. Although the results suggest that this hybrid approach is applicable to practical problems that involve a large number of uncertainties, the proposed model could be further enhanced through introduce other factors (such as temperature, vehicle type) into its framework.

Acknowledgements The authors are extremely grateful to the graduate students for their on-road experiments. This study is sponsored by National Basic Research Program of China(2005CB724205)and the Fund of Hubei Science and Technology.

References 1. Nagendra, S.M.S., Khare, M.: Line Source Emission Modelling- review. Atmospheric Environment 36, 2083–2098 (2002) 2. Deng, X.: Economic Costs of Motor Vehicle Emissions in China: A Case Study. Transportation Research Part D-Transport and Environment 11, 216–226 (2006) 3. Cooper, C.D., Arbrandt, M.: Mobile Source Emission Inventories – Monthly or Annual Average Inputs to MOBILE6. Journal of the Air and Waste Management Association 54, 1006–1010 (2004) 4. Salles, J., et al.: Mobile Source Emission Inventory Model. Application to Paris Area. Atmospheric Environment 30, 1965–1975 (1996) 5. Marmur, A., Mamane, Y.: Comparison and Evaluation of Several Mobile–source and Line–source Models In Israel. Transportation Research Part D–Transport and Environment 8, 249–265 (2003) 6. Okamoto, S., Kobayashi, K., Ono, N., Kitabayashi, K., Katatani, N.: Comparative Study of Estimation Methods for NOx Emissions from a Roadway. Atmospheric Environment 24A, 1535–1554 (1990)

Dynamic PSO-Neural Network

815

7. Xia, W.: Study on Light-duty Vehicle Microcosmic Emission Model on Urban Highway, in Transportaion college, pp. 63–65. Jilin University, Jilin (2005) 8. Diggins, L., Schreffler, E.N., Gregory, J.: Methodology for Regional Survey of Mode Switching from Voluntary Mobile Emission Reduction Programs. In: Transportation Finance, Economics, and Economic Development 2004, pp. 144–152. Natl. Acad. Sci., Washington (2004) 9. Bin, O.: A Logit Analysis of Vehicle Emissions Using Inspection and Maintenance Testing Data. Transportation Research Part D-Transport and Environment 8, 215– 227 (2003) 10. Malcolm, C., et al.: Mobile Source Emissions: Analysis of Spatial Variability in Vehicle Activity Patterns and Vehicle Fleet Distributions. In: Transportation Research Board 79th Annual Meeting.Transp. Res., Washington, D.C, pp. 91–98 (2003) 11. Corvalan, R.M., Osses, M., Urrutia, C.M.: Hot Emission Model for Mobile Sources: Application to the Metropolitan Region of the City of Santiago, Chile. Journal of the Air and Waste Management Association 52, 167–174 (2002) 12. Goyal, P., Rama Krishna, T.V.B.P.S.: Various Methods of Emission Estimation of Vehicular Traffic in Delhi. Transportation Research Part D-Transport and Environment 3, 309–317 (1998) 13. Zhou, J.L., Duan, Z.C., Li, Y., Deng, H.C.: PSO-based Neural Network Optimization and Its Utilization in a Boring Machine. J. Mater Process Technol. 178, 19–23 (2006) 14. Kennedy, J.E.R.: Particle Swarm Optimization. In: Proc. IEEE Int. Conf. on Neural Networks, pp. 1942–1948. IEEE Press, Washington (1995) 15. Eberhart, R.C., Shi, Y.: Comparing Inertia Weights and Constriction Factors in Particle Swarm Optimization. In: Proceedings of the IEEE conference on Evolutionary computation, pp. 84–88. IEEE Press, California (2000) 16. Gong, J., Yan, X.P., Wu, C.Z., Chu, X.M.: On-Road Motor Vehicle Emissions in Chinese Metropolis City. Advances in Systems Science and Applications 7, 625–630 (2007)

An Improvement to Ant Colony Optimization Heuristic⋆ Youmei Li1 , Zongben Xu2 , and Feilong Cao3 1

3

Department of Computer Science, Shaoxing College of Arts and Sciences, 312000 Shaoxing, China 2 Institute for Information and System Sciences, Faculty of Science, Xi’an Jiaotong University, 710049 Xi’an, Shaan’xi, China Department of Information and Mathematics Sciences, College of Science, China Jiliang University, 310018 Hangzhou, China li [email protected]

Abstract. Ant Colony Optimization (ACO) heuristic provides a relatively easy and direct method to handle problem’s constraints (through introducing the so called solution construction process), while in the other heuristics, constraint-handling is normally sophisticated. But this makes its solving process slow for the solution construction process occupies most part of its computation time. In this paper, we propose a strategy to hybridize Hopfield discrete neural networks (HDNN) with ACO heuristic for maximum independent set (MIS) problems. Several simulation instances showed that the strategy can greatly improve ACO heuristic performance not only in time cost but also in solution quality. Keywords: Ant colony optimization, Hopfield discrete neural network, Maximum independent set.

1

Introduction

ACO heuristic was inspired by the observation on real ant colony’s foraging behavior: Ants can often find the shortest path between food source and their nest. It comprises mainly three components: Heuristic information assignment, pheromone trail update and solution construction process. The heuristic information represents a prior problem-specific local information, the pheromone trail is the information (or reflect the experience) acquired by ants about optimal solutions, and the solution construction process is to simulate ant’s foraging behavior and construct solution of a problem based on the heuristic and pheromone information. Any set of parameters and rules for specifying these three components defines a specific ACO heuristic. ACO heuristic has been widely used to solve NP-hard combinatorial optimization problems. Maximum independent set(MIS) problem is one of them. MIS problems have been solved by almost all modern heuristics like genetic algorithm (GA), simulated annealing algorithm(SA), Tabu Search ( TS )in past years([7,8,9]), which can be described as follows: ⋆

Supported by the Nature Science Foundation of China under Grant 60473034.

F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 816–825, 2008. c Springer-Verlag Berlin Heidelberg 2008 

An Improvement to Ant Colony Optimization Heuristic

817

Given a graph G = (V, E) with V = {V1 , V2 , · · · , Vn } the set of vertices and E ⊂ V × V the set of edges. Let n = |V | be the order of G. The maximum independent set problem is to determine a set V ∗ ⊆ V such that for any Vi , Vj ∈ / E and |V ∗ | is maximum. V ∗ the edge (Vi , Vj ) ∈ Typical ACO heuristic is also applied to solve MIS problems. Leguizamon proposed an ant system for solving MIS problem in [4]. Fenet and Solnon designed an ACO heuristic to solve the maximum clique problem in [3]. Based on those results obtained on ACO, it can be seen that ACO heuristic has three distinguishing characteristics. First, the optimization performance of ACO heuristic is very attractive and encouraging ([2,3,4]); Second, ACO provides a relatively easy and direct method to handle problem’s constraints (through introducing the so called solution construction process), while in the other heuristics, constrainthandling is normally sophisticated. And third, being its weakness, its solving process is slow for the solution construction process occupies most part of its total computation effort. In view of this, we propose in this paper to apply Hopfield discrete neural networks (HDNN) to alleviate such overload, and meanwhile to preserve the optimization performance of ACO heuristic. This is possible, because the management of pheromone trails of ACO provides an effective way to choose initial points for HDNN implementation, and in turn the very fast convergence property of HDNN may make the solution construction of ACO more directed towards global solution, and then make the ACO search more focused.

2

Hopfield Discrete Neural Networks for MIS Problem

We noticed that the Hopfield discrete neural networks (HDNN) proposed in [1] has proven to be a very fast convergent local minimizer for MIS problems. For a given graph G, let A = (aij )n×n be its adjacency matrix, which is defined by  1, if vertices Vi and Vj are adjacent, aij = 0, otherwise. and, furthermore, associated with any subset V ′ of G, we define n-vector U = (u1 , u2 , · · · , un ) by  1, if vertex Vi is in V ′ , ui = 0, otherwise. Then it is known from [1] that finding a MIS of G is equivalent to finding an n-vector U ∗ = (u∗1 , u∗2 , · · · , u∗n ) such that U ∗ solves the following minimization problem for any K ≥ 2(2n + 1): min

E(U ) = − n

U∈{0,1}

n  n  i=1 j=1

ui uj +

n n K  aij ui uj . 2 i=1 j=1

(1)

To solve problem (10), the following Hopfield discrete neural network (HDNN) algorithm was further suggested in [1]:

818

Y. Li, Z. Xu, and F. Cao

ui (t + 1) = sgn(Hi (t)) = where W =I−

K A, 2



1, if 0, if

H i (t ) > 0 , , i = 1, 2, ...n H i (t ) < 0 .

Hi (t) =

n 

(2)

wij uj (t),

j=1

I is the identical matrix, and U (0) = (u1 (0), u2 (0), ..., un (0)) is any initial point in {0, 1}n. The algorithm is asked to be implemented in serial mode, that is, only one state ui (t + 1) is allowed to be updated at each time (Note that ui (t + 1) will keep intact whenever Hi (t) = 0). According to analysis conducted in [1], the HDNN algorithm is globally convergent (i.e., converges to a stable state from any initial point), and, furthermore, its every stable state is an independent set. Hence it is an effective local optimizer for MIS problem.

3

The Improvement to ACO for MIS Problem

As long as the initial point of HDNN is selected carefully and drops into the attraction basin of a global optimal solution, the HDNN can find that global optimal solution very quickly. Moreover, we can observe that in ACO heuristic, the pheromone information collected by ants actually reflects relation between vertices and optimal solution, which can provide in certain sense some guidance information on the choice of the initial point. Therefore it can be expected that the incorporation of ACO learning mechanism into the HDNN can improve solution quality found by HDNN. Also, the computation expenses of ACO heuristic can be decreased considerably. This observation motivates us to modifie HDNN algorithm with a delicate initial point assignment strategy. Pheromone trails and heuristic information play a very important role of the ACO heuristic in the process of learning problem characteristic. Here we adopt the same way as in [3,4] to store pheromone trails and heuristic information for MIS problem on vertex Vi ∈ V , denoted by τi and ηi respectively, with the intended meaning that vertices with a higher value are more profitable. Pheromone Trail Updating Rule The pheromone trail τi guides the algorithm to explore new and promising regions of the search space. Here we suggest that for each vertex Vi the pheromone trail is updated as follows:  △τik (3) τi (t + 1) = (1 − ρ)τi (t) + k

where △τik =



Q|Sk |, if Vi ∈ Sk , 0, otherwise,

(4)

|Sk | is the cardinality of independent set Sk currently found by ant k, and Q is a positive constant. Informally, the intensity of pheromone gives a measure of

An Improvement to Ant Colony Optimization Heuristic

819

how desirable it is to insert a given element in a solution. A higher probability is assigned to an elements with a strong pheromone trail. However, in order to avoid the value τi increasing too much such that ACO heuristic quickly gets into stagnation and have too big difference with ηi , we augment the pheromone update rule by τi (t + 1) ⇐= τi (t + 1) − min τj (t + 1) + min ηj . Vj ∈V

Vj ∈V

(5)

Heuristic Information (ηi ) Heuristic information ηi is induced from problem itself and reflects in what extent the vertex Vi would belong to an optimal solution. For any element Vi ∈ V , let degVi be the number of edges incident to Vi , and N (Vi ) be the set of vertexes adjacent to Vi . We propose in [5] to assign the heuristic information according to the following formula :  Vh ∈N (Vi ) degVh ηi = . (6) degVi Solution Construction Process The solution construction process, simulating ant’s foraging process, aims to construct a feasible solution of the problem, based on the heuristic information {ηi } and pheromone information {τi } collected by ants. Generally the solution construction process can be described formally as the following SCP1 procedure: solution construction procedure 1: ant k selects an initial point Vi1 ; let Sk = {Vi1 }; k = {Vj | Vj ∈ V − Sk and (Vi , Vj ) ∈ let S / E f or any Vi ∈ Sk }; k = ∅) while(S k with a probability given by selects element Vj from S Pjk = 

τjα ηjβ

h∈Sk

τhα ηhβ

(7)

updating Sk ∪ {Vj } → Sk ; k ; recomputing S end an independent set Sk is outputted. k , called feasible element list, is associated with In SCP1 procedure, the set S each ant whose function is to prevent an ant from producing an infeasible solution. Also, the parameters α, β need to be chosen properly so as to reflect relative impacts of the heuristic and pheromone information on the search process.

820

Y. Li, Z. Xu, and F. Cao

The SCP1 procedure works well in general for small size problems. For large or even medium size problems, however, its computation expense often is prohibitively high. Since at each iterative loop of ACO heuristic, SCP1 procedure has to be implemented m times, such high computation cost of SCP1 severely restricts its applicability and considerably lowers down the total performance of ACO heuristic. solution construction procedure 2: First we explain how an delicate initial point U (0) can be assigned to the above HDNN algorithm in light of the heuristic and pheromone information provided by the ACO heuristic. Suppose at loop T of ACO heuristic, the heuristic and pheromone information are η(T ) = (η1 (T ), η2 (T ), · · · , ηn (T )) and τ (T ) = (τ1 (T ), τ2 (T ), · · · , τn (T )), and the cardinality of the best solution found up to now is M (T ). When T = 0, we then can simply set M (0) to be an estimated value of the cardinality of MIS. So the probability PT = (pT (1), pT (2), · · · , pT (n)) of each vertex being in optimal solution can be estimated as τ α (T )ηiβ (T ) . (8) pT (i) = n i β α k=1 τk (T )ηk (T ) Therefore, for any T ≥ 0, we suggest to select U (0) = (u1 (0), u2 (0), · · · , un (0)) ∈ {0, 1}n with ui (0) = 1 for any i ∈ {i1 , i2 , · · · , iM ′ }, where {i1 , i2 , · · · , iM ′ } are successively selected from {1, 2, · · · , n} according to the probability P (i1 , i2 , · · · , iM ′ ) = p(i1 ) ∗ p(i2 |i1 ) . . . p(iM ′ |i1 , i2 , · · · , iM ′ −1 ) where



p(ik |i1 , i2 , · · · , ik−1 ) = 

(9)

p(ik ) i=i1 ,i2 ,···,ik−1

p(i)

and M = M (T )+ κ∗ rand(1). This formula states that while vertices {i1 , i2 , · · · , ik−1 } are selected, the next vertex is chosen from {V − {i1 , i2 , · · · , ik−1 }} accordp(ik ) ing the probability p(i) . Here κ is a parameter (κ = 4 is suggested),



i=i1 ,i2 ,···,ik−1

and rand(1) is a random number in (0, 1). Such initial point assignment shows that the heuristic attempts to improve the best known solution among its κ/2neighbors. Combining the HDNN algorithm with the above specified initial point assignment strategy, we thus can state our SCP2 algorithm as follows. SCP2 Procedure: input M (T ), κ, compute PT = (pT (1), pT (2), · · · , pT (n)) according to (8); selects an initial state U (0) according to (9); implement HDNN (2); end; an independent set is output. The SCP2 procedure can always yield a feasible solution for any MIS problem. With the specification of ACO components listed as above, we finally summarize our new ACO heuristic as follows:

An Improvement to Ant Colony Optimization Heuristic

821

Hybrid ACO heuristic for MIS problem: Initialize; compute heuristic information with formula (6); while(termination-criterion-not-satisfied) for k=1 to m call solution construction procedure SCP2 to find a solution Sk ; calculate the cardinality |Sk | of the solutions generated by ant k; end save the best solution found so far; update the pheromone level on all elements by (3)-(5); end We refer to this hybrid ACO heuristic as ACO-HDNN below. Likewise, whenever SCP1 is used to perform solution construction in place of SCP2, the corresponding ACO heuristic will be referred to as ACO-Old. In the next section, we will demonstrate the outperformance of ACO-HDNN over ACO-Old through simulations.

4

Simulations and Comparisons

The simulations are conducted on a set of 10 benchmark MIS problems: Graphs 1 to 7, taken from reference [1], and Graphs 8 to 10, taken from http://www.research.att.com/ njas/doc/graphs.html (1tc64.txt, 1tc128.txt and ldc128.txt). The numbers of vertex/ edge of the problems are from 7/12 to 128/1471, and global optimum of all problems are known. This makes performance evaluation fair and direct. We have organized the simulations into two groups, with each group evaluating each aspect of the hybrid ACO heuristic. The two aspects include: – Effect of new solution construction procedure SCP2 ; – Global optimization ability and solution quality. In simulations, the parameters involved are uniformly set as follows: (i) The parameters α and β. In view of that parameters α, β determine relative influence of the pheromone and heuristic in solution construction process, we keep the values of τi and ηi in the same order. Thus, when heuristic information is computed according to rule (6): 1 ≤ ηi ≤ n, the pheromone trail τi is then restricted to interval [1, n], and their initial values are set to a random number in [1, n]. We set α = 0.2, β = 0.5 for Graphs 1 to 8, and set α = 0.1 and β = 0.6 for Graphs 9 to 10. (ii) The parameters Q and ρ. Based on observation in formula (4), parameter Q should be taken very small so that △τi increases very slowly, and so as for ACO heuristics to be able to explore the search space thoroughly. Therefore. in simulations, we set Q = 0.05, ρ = 0.8 for Graphs 1 to 8, and Q = 0.025, ρ = 0.8 for Graphs 9 to 10. We set ρ = 0.8 same as in [2].

822

Y. Li, Z. Xu, and F. Cao 4

10

G9 3

10

G9 time cost

G8 G8

2

10

G7 G7

1

10

ACO−HDNN ACO−GA

0

10

0

10

20

30 ant number

40

50

60

Fig. 1. Comparison of SCP1 and SCP2 in terms of computation complexity, where the curve (in log-plot) marked as Gi corresponds to the result for graph i (i=7,8,9)

(iii) The termination criterion. Since the optimal solutions of all problems are known, we terminate the algorithm whenever it found the optimal solution. Besides such natural termination criterion, we also terminate the algorithm whenever it exceeds a predetermined run time or iteration steps. With the above parameter settings, all related algorithms were written in MATLAB and run on a Pentium 4/1.6GHz/128MB personal computer. The simulation results are respectively reported below. Group 1: Evaluation on effect of new solution construction algorithm SCP2 This group of experiments is used to compare performance of the two solution construction procedures SCP1 and SCP2. The two corresponding ACO heuristics ACO-Old and ACO-HDNN were respectively applied. The comparison is made on the basis of the computation complexity, measured in the time cost each heuristic spent within a fixed number of iteration steps, and the optimization capability, measured in the percentage of producing global optimal solution within a fixed number of simulation runs. To compare the computation complexity of SCP1 and SCP2, the time (in CPU seconds) each heuristic spent within 50 iteration steps was recorded for each fixed number of ants. The results are shown in Figure 1, graphs 7,8,9 are selected for demonstration. All simulations with the 10 problems have the nearly same behavior. We can see from Figure 1 that the time cost of SCP2 is significantly lower than that of SCP1, particularly, when the number of ants is large. The comparison of SCP1 and SCP2 in terms of optimization capability is shown in Tables 1 and 2, where listed are the occurrence percentage of optimal solution out

An Improvement to Ant Colony Optimization Heuristic

823

of 10 independent runs of ACO-Old and ACO-HDNN for each problem. In Table 1, the sign “*” means that no any result can be obtained due to the prohibitively high computation time when the number of ants exceeds 35. Examining Tables 1 and 2, we can see that for small size problems (say, Graphs 1 to 7), ACO-Old and ACO-HDNN have a nearly same performance. The high computation overhead, however, makes ACO-Old very ineffective for the larger scale problems: Graphs 9 and 10. On the other hand, though ACO-HDNN is slightly worse than ACO-Old for Graph 7, it is much better than ACO-Old for the larger scale problems. Group 2: Evaluation on global optimization capability and solution quality This group of experiments is used to compare the performance of ACO-HDNN against other two typical heuristics for MIS problem, the Hopfield discrete neural network algorithm (HDNN) and the canonical genetic algorithm (CGA). In the simulations, the parameters of CGA were set as that the population size: 20 for graphs 1-2; 50 for graphs 3-7; 100 for graph 8-10, the crossover probability: 0.6 and the mutation probability: 0.1 for all problems. Since HDNN, ACO-HDNN and CGA are clearly different types of algorithms in the sense that the former is a single-point algorithm, while the latter two are population-based ones, to make comparsion fair, a standard trial is defined as follows: for ACO-HDNN, one standard trial is a complete run to get accepted solutions, starting from a fixed set of initial ants. Let T be the time cost one standard trial of ACO-HDNN spent (say, for Graph 9, T =512.1620 seconds, the longest run time of ACO-HDNN to find best solution). Then one standard trial for HDNN is all runs within time T continuously implemented by HDNN with reinitialization . Likewise, one standard trial for CGA is defined as the whole process of evolution within T time, starting from any given initial population. With such postulation, we have simulated the three algorithms with 10 standard trials. The cardinality of best solutions found, |M S|, by each algorithm together with the occurrence percentage of the known global optimal solutions, M S%, in the 10 trials were recorded, shown as in Table 3. The index |M S| is Table 1. Optimization capability of ACO-Old heuristic instance Graph 1 Graph 2 Graph 3 Graph 4 Graph 5 Graph 6 Graph 7 Graph 8 Graph 9 Graph 10

5 100 100 100 100 100 100 50 100 0 0

10 100 100 100 100 100 50 100 0 0

15 100 100 100 70 100 0 0

20 100 100 100 100 0 10

ant number 25 30 35 40 100 100 100 100 100 100 100 100 100 100 100 100 100 0 0 * * 10 10 * *

45 100 100 * *

50 * *

55 * *

60 * *

824

Y. Li, Z. Xu, and F. Cao Table 2. Optimization capability of ACO-DHNN heuristic instance Graph 1 Graph 2 Graph 3 Graph 4 Graph 5 Graph 6 Graph 7 Graph 8 Graph 9 Graph 10

5 100 100 100 100 100 100 30 100 0 0

10 100 100 100 100 100 40 100 0 0

15 100 100 100 40 100 0 10

20 100 100 40 100 0 30

ant number 25 30 35 40 100 100 100 100 100 100 50 70 90 50 100 100 100 100 10 20 30 30 50 50 50 70

45 100 80 100 50 70

50 100 50 80

55 100 70 80

60 100 90 100

Table 3. Comparison of ACO-HDNN, CGA and HDNN instance Graph 1 Graph 2 Graph 3 Graph 4 Graph 5 Graph 6 Graph 7 Graph 8 Graph 9 Graph 10

|V| 7 11 14 16 34 46 46 64 128 128

SS ACO-HDNN |E| |MIS| |MS| MS% k(.) 12 3 3 100 1(2) 28 4 4 100 1(2) 21 7 7 100 1(5) 15 9 9 100 1(5) 56 14 14 100 1(5) 66 19 19 100 1(5) 69 19 19 90 2(35) 192 20 20 100 1(5) 512 38 38 90 17(60) 1471 16 16 100 9(60)

CGA |MS| MS% 3 100 4 100 7 80 9 25 14 10 18 0 18 0 16 0 30 0 15 0

HDNN |MS| MS% 3 100 4 100 7 100 9 100 14 80 19 80 18 0 20 30 35 0 15 0

then used to assess the solution quality for each algorithm, and M S% is used to evaluate their global optimization capability. We summarize the comparison results in Table 3, in which k is the minimum iteration steps when the algorithm obtained the known global optimal solution, and the figure after k in parenthesis is the number of ants used for simulation. From the M S% columns in Table 3, we can see that ACO-HDNN can find global optimal solutions with 100% percentage for problems 1-6, 8 and 10, and 90% for problems 7 and 9; HDNN finds global optimal solutions with 100% for problems 1-4, 80% for problems 5-6, but null for problems 7, 9 and 10; whilst CGA with percentages respectively from 10% to 100% for the first 5 problems, but all null for the last five problems. Similarly, from the |M S| columns and comparing them with the |M IS| column, we see that the best solutions found by ACO-HDNN all are global optimal solutions, while those found by HDNN are also global optimal solutions with exception for problem 7, 9 and 10. As contrasted, except for problems 1-5, the best solutions found by CGA all are not global optimal

An Improvement to Ant Colony Optimization Heuristic

825

solutions. This shows that in both aspects of global optimization capability and solution quality, ACO-HDNN always outperforms HDNN and CGA.

5

Conclusion

Our main purpose in this work is to discuss the accelerating strategy on ACO heuristic. Through combining the very fast local convergence property of Hopfield discrete neural networks (HDNN) with a sophisticated initial point specification strategy motivated from the pheromone and heuristic information collected in ACO search, a new hybrid heuristic, ACO-HDNN, has been proposed for MIS problems. The proposed algorithm efficiently overcomes the disadvantage of time-consuming in solution construction process commonly existed in ACO heuristics. The simulations on a set of 10 benchmark MIS problems show that not only the computation cost of the proposed new ACO heuristic is decreased dramatically, but also the solution quality obtained is significantly improved.

References 1. Xu, Z.B., Hu, G.Q., Kwong, C.P.: Asymmetric Hopfield-type Networks: Theory and Applications. Neural Network 9, 483–501 (1996) 2. Dorigo, M., Gambardella, L.M.: Ant Colony System: a Cooperative Learning Approach to the Travelling Salesman Problem. IEEE Trans. on Evolutionary Computation 1, 53–66 (1999) 3. Fenet, S., Solnon, C.: Searching for Maximum Cliques with Ant Colony Optimization. In: Raidl, G.R., Cagnoni, S., Cardalda, J.J.R., Corne, D.W., Gottlieb, J., Guillot, A., Hart, E., Johnson, C.G., Marchiori, E., Meyer, J.-A., Middendorf, M. (eds.) EvoIASP 2003, EvoWorkshops 2003, EvoSTIM 2003, EvoROB/EvoRobot 2003, EvoCOP 2003, EvoBIO 2003, and EvoMUSART 2003. LNCS, vol. 2611, pp. 236–245. Springer, Heidelberg (2003) 4. Leguizamon, G., Michalewicz, Z., Sch¨ utz, M.: An Ant System for the Maximum Independent Set Problem, http://ls11-www.cs.uni-dortmund.de/ downloads/papers/LeSz02.pdf 5. Li, Y.M., Xu, Z.B.: An Ant Colony Optimization Heuristic for Solving Maximum Independent Set Problems. In: IEEE Proceedings of Fifth International Conference on Computational Intelligence and Multimedia Applications, pp. 206–211. IEEE Press, Xi’an (2003) 6. Li, Y.M., Cao, F.L.: New Heuristic Algorithm for CPMP. In: IEEE Proceedings of 6th international conference on intelligent systems design and application, pp. 1129–1132. IEEE Press, Jinan (2006) 7. Francois, O.: Global Optimization with Exploration/Selection Algorithms and Simulated Annealing. Annals of Applied Probability 12, 248–271 (2002) 8. Harik, G., Lobo, F., Goldbrg, D.: The Compact Genetic Algorithm. IEEE Trans. on Evolutionary Computation 3, 287–297 (1999) 9. Battiti, R., Protasi, M.: Reactive Local Search for the Maximum Clique Problem. Algorithmica 29, 610–637 (2001)

Extension of a Polynomial Time Mehrotra-Type Predictor-Corrector Safeguarded Algorithm to Monotone Linear Complementarity Problems Mingwang Zhang and Yanli Lv College of Science, China Three Gorges University, 443002 Yichang, Hubei, China [email protected]

Abstract. Mehrotra-type predictor-corrector algorithms are the basis of interior point methods software packages. Salahi et al. in their recent work have shown a feasible version of a variation of M ehrotra’s second order algorithm for linear optimization may be forced to make many small steps that motivated them to introduce certain safeguards, what allowed them to prove polynomial iteration complexity while keeping its practice efficiency. In this paper, we extend their algorithm to monotone linear complementarity problems. Our algorithm is different from their method in the way of updating the barrier parameter and the complexity analysis, and we also show  that the polynomial complexity of our algorithm is 0 T 0 O n2 log (x )ε s .

1

Introduction

Mehrotra-type predictor-corrector algorithm is one of the most remarkable interior point methods for linear optimization (LO), quadratic optimization (QO) and linear complementarity problems (LCPs), and it is also the base of the interior point methods (IPMs) software packages such as [2],[9],[10] and many others. Recently, the authors of [6] have analyzed a feasible variant of Mehrotra’s second order algorithm [5]. By a numerical example presented in [7], they showed that this algorithm for linear optimization which is using an adaptive update of the barrier parameter can imply very small steps in order to keep the iterates in a certain neighborhood of the central path, and subsequently require a large number of iterations for the algorithm to stop with a desired solution. For practical purpose, they considered an adaptive update of barrier parameter and to have control over a possible bad behavior of the algorithm, they introduced some safeguards what allowed them to prove polynomial iteration complexity while keeping its practice efficiency. We call their method the second order Mehrotratype predictor-corrector safeguarded algorithm. In this paper, we extend their method to monotone linear complementarity problems(MLCPs). Our algorithm is different from their method in the way of updating the barrier parameter and complexityanalysis, and we also show that the iteration complexity of our 0 T 0 algorithm is O n2 log (x )ε s . F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 826–835, 2008. c Springer-Verlag Berlin Heidelberg 2008 

Extension of a Polynomial Time Mehrotra-Type

827

Throughout the paper we consider the linear complementarity problem: Find vectors (x, s) ∈ Rn × Rn , which satisfy the constraints s = M x + q, xs = 0, (x, s) ≥ 0.

(1)

where M ∈ Rn × Rn , q ∈ Rn , and xs denotes componentwise product of vectors x and s. The linear complementarity problem belongs to the class of NP-complete problems. Since the feasibility problem of linear equations with binary variables can be described as an LCP problem [4]. Therefore, we can not expect an efficient solution method for the linear complementarity problem without special property of the matrix M. In this paper, we always assume the M is a positive semidefinite matrix, i.e. for any x ∈ Rn ,   xi (M x)i + xi (M x)i ≥ 0, i∈I +

i∈I−

where I+ = {i|1 ≤ i ≤ n, xi (M x)i ≥ 0} and I− = {i|1 ≤ i ≤ n, xi (M x)i < 0}. Linear complementarity problems have many applications in mathematical programming and equilibrium problems. Indeed, it is known that by exploiting the first-order optimality of the optimization, any differentiable convex quadratic programming can be formulated into a monotone LCP, and vice versa [8]. Variational inequality problems are widely used in the study of equilibriums in economics, transportation planning, control theory and game theory. They also have a close connection to the LCP. The reader can refer to [1,3] for the basic theory, algorithms, and applications. Before going into the detail of the algorithm, let us give a brief introduction to primal-dual IPMs. Without loss of generality [4], we may assume that (1) satisfies the interior point condition (IPC), i.e. there exists an (x0 , s0 ) such that s0 = M x0 + q, (x0 )T s0 = 0, (x0 , s0 ) > 0. The basic idea of primal-dual IPMs is to replace the complementarity condition by the parameterized equation xs = µe, where e = (1, 1, ...1)T , µ > 0. This leads to the following system s = M x + q, xs = µe, (x, s) > 0.

(2)

If the IPC holds and the matrix M is a positive semidefinite matrix, then the system (2) has a unique solution. This solution, denoted by (x(µ), s(µ)), is called the µ-center of (1). The set of µ-centers with all µ > 0 gives the central path of (1). Primal-dual IPMs follow the central path {x(µ), s(µ)|µ > 0} approximately and approach the solution set of the LCP (1) as µ goes to zero [4]. Now, based on [6], we briefly describe the variation of Mehrotra’s second order predictor-corrector algorithm for monotone linear complementarity problems. In the predictor step, the affine scaling search direction M Δxa = Δsa ,

(3a)

828

M. Zhang and Y. Lv

sΔxa + xΔsa = −xs,

(3b)

is computed and the maximum step size in this direction is calculated so that (x + αa Δxa , s + αa Δsa ) ≥ 0. However, the algorithm does not make such a step right away. Using the information from the predictor step it computes the corrector direction by solving the following system M Δx = Δs,

(4a) a

a

sΔx + xΔs = μe − Δx Δs ,

(4b)

where μ is defined adaptively as μ=



ga g

2

ga , n

(5)

where ga = (x + αa Δxa )T (s + αa Δsa ) and g = xT s. Finally, the maximum step size α is computed so that the next iterate is given by x(α) := x + α∆xa + α2 ∆x, a

2

s(α) := s + α∆s + α ∆s.

(6a) (6b)

belongs to a certain neighborhood of the central path. Now we make a crucial observation. By the definition of µ and Lemma A.2, we have 3   3 (1 − αa )xT s + α2a (∆xa )T ∆sa 3 ≤ 1 − αa µg . µ= n(xT s)2 4   3 Similarly, we can get µ ≥ 1 − 54 αa µg . Therefore, one has  3 3  3 5 1 − αa µg ≤ µ ≤ 1 − αa µg . 4 4 Remark. The above estimation on µ will play an elementary role in our later analysis. This paper is organized as follows. First, in section 2, we present some technical lemmas and establish the polynomial iteration complexity of the new algorithm for monotone linear complementarity problems. Then, we close the paper with concluding remarks in section 3. For self completeness we list four lemmas in appendix A which are used frequently in the paper. We use the following notions throughout the paper: I+ = {i ∈ I|∆xai ∆sai ≥ 0}, I− = {i ∈ I|∆xai ∆sai < 0}, Ω++ = {(x, s) ∈ Rn × Rn |s = M x + q, (x, s) > 0}, xT s . ∆xc = ∆xa + ∆x, ∆sc = ∆sa + ∆s, µg = n

Extension of a Polynomial Time Mehrotra-Type

2

829

Algorithm and Complexity Analysis

It is well known that a wide-neighborhood IPMs performs much better in implement action than its small-neighborhood counterparts. Therefore, in this paper, we consider the negative infinity norm neighborhood defined by − N∞ (γ) := {(x, s) ∈ Ω++ : xi si ≥ γµg , ∀i ∈ I},

(7)

where γ ∈ (0, 1) is a constant independent of n. The following theorem shows that there always exists a positive step size in the predictor step. − (γ) and (∆xa , ∆sa ) Theorem 2.1. Assume that the current iterate (x, s) ∈ N∞ is the solution of (3). Then the maximum feasible step size, αa ∈ (0, 1], satisfies 2 γ 2 + nγ − 2γ αa ≥ . (8) n

Proof. Since (∆xa , ∆sa ) satisfies the system (3), we have (xi + α∆xai )(si + α∆sai ) = (1 − α)xi si + α2 ∆xai ∆sai . − (γ) and using Lemma A.2, we have Noting that (x, s) ∈ N∞

(1 − α)xi si + α2 ∆xai ∆sai ≥ (1 − α)γ

xT s 1 2 T − α x s. n 4

Thus, to show (x+α∆xa , s+α∆sa ) ≥ 0, it suffices to prove that nα2 +4γα−4γ ≤ 0. Therefore, it is easy to show the maximum feasible step size αa ∈ (0, 1], satisfies 2 γ 2 + nγ − 2γ . αa ≥ n The following technical lemma will be used in the step size estimation for the corrector step of the new algorithm. − (γ), (∆xa , ∆sa ) is Lemma 2.2. Assume that the current iterate (x, s) ∈ N∞ the solution of (3) and (∆x, ∆s) is the solution of (4) with µ ≥ 0. Then

and

∆xa ∆s ≤



1 2γ

∆x∆sa  ≤



1 2γ

1

 

µ µg

µ µg

2

2

1 1 µ + + + 4γµg 32 16

+

µ 4γµg

+

1 32

+

1 16

 21

21

3

n 2 µg

3

n 2 µg . 1

Proof. Let D = (X −1 S) 2 . Then, by multiplying both sides of (3b) by (XS)− 2 we have 1 D∆xa + D−1 ∆sa = −(Xs) 2 .

830

M. Zhang and Y. Lv

By squaring both sides of this equation, one gets D∆xa 2 + 2(∆xa )T ∆sa + D−1 ∆sa 2 = xT s. Therefore, we have D∆xa 2 + D−1 ∆sa 2 ≤ xT s. This implies D∆xa  ≤

√ √ nµg , D−1 ∆sa  ≤ nµg .

(9)

By doing the same procedure for (4b), one has 1

D∆x + D−1 ∆s = (XS) 2 (µe − ∆X a ∆sa ), 1

max{D∆x, D−1 ∆s} ≤ (XS)− 2 (µe − ∆X a ∆sa ).

and

− (γ) and using Lemma A.2, we have Thus, noting (x, s) ∈ N∞

− 12

(XS)

a

a

(µe − ∆X ∆s ) ≤



1 2γ



µ µg

2

1 1 µ + + + 4γµg 32 16γ

12

√ n µg .

(10)

Therefore, one has D∆x ≤

D

−1



∆s ≤

1 2γ





1 2γ

µ µg



2

µ µg

1 1 µ + + + 4γµg 32 16γ

2

12

1 1 µ + + + 4γµg 32 16γ

√ n µg ,

12

(11)

√ n µg .

(12)

Finally, using the fact that D is diagonal and inequalities (9)-(12), we have ∆xa ∆s ≤ D∆xa D−1 ∆s ≤



1 2γ



µ µg

2

1 1 µ + + + 4γµg 32 16γ

21

3

n 2 µg ,

and analogously one has the second result of the lemma. This completes the proof. − (γ) and (∆x, ∆s) is Lemma 2.3. Assume that the current iterate (x, s) ∈ N∞ the solution of (4) with µ ≥ 0. Then we have  2  1 nµ nµg n2 µg nµ ∆x∆s ≤ √ + + + . 2γ 16 16γ 2 2 γµg 1

Proof. By multiplying both sides of (4b) by (XS)− 2 , then by Lemma A.4 and (10), we may obtain the desired result.

Extension of a Polynomial Time Mehrotra-Type

831

In the following lemma we obtain a bound on αa violation of which might imply a very small step size in the corrector of the new algorithm. − (γ), (∆xa , ∆sa ) is Theorem 2.4. Assume that the current iterate (x, s) ∈ N∞ the solution of (3) and (∆x, ∆s) is the solution of (4) with µ as defined by (5). Then, for αa ∈ (0, 1] satisfying ⎞ ⎛ 1  γ(t + 21 ) 3 4⎝ ⎠, αa < (13) 1− 5 1−γ

the maximum step size in the corrector is strictly positive, where t = max i∈I

∆xai ∆sai . xi si

(14)

Proof. Our aim now is to find the minimal α ∈ (0, 1] such that xi (α)si (α) ≥ γµg (α), ∀i ∈ I, where µg (α) =

x(α)T s(α) . n

(15)

By simple computation, we obtain

xi (α)si (α) = (1 − α)xi si + α2 µ + α3 (∆xai ∆si + ∆xi ∆sai ) + α4 ∆xi ∆si , and µg (α) = (1 − α)µg + α2 µ +

α3 a T n ((∆x ) ∆s

+ ∆xT ∆sa ) +

α4 T n ∆x ∆s.

Therefore, if we assume that ∆xi ∆si < 0, which is the worst case, then, for i ∈ I+ , we have xi (α)si (α) ≥ (1 − α)xi si + α2 µ − α2 ∆xai ∆sai + α3 ∆xci ∆sci . On the other hand, by (9) and (10) in the proof of Lemma 2.2, we have (∆xa )T ∆s + ∆xT ∆sa ≤ D∆xa D−1 ∆s + D∆xD−1 ∆sa  √   12 3 1 1 1 1 3 6 3 −1 ≤2 + + + n 2 γ 2 µg . n 2 µg ≤ 2γ 4γ 32 16γ 4

(16)

Similarly, by Lemma 2.3, it is easily verified that

  2 1 1 27 µ 1 1 µ T ∆x ∆s ≤ √ + + n2 µg ≤ √ n2 γ −1 µg . (17) + 4γµg 32 16γ 2 2 2γ µg 64 2 By the above discussion, to prove (15), it suffices to show that (1 − α)xi si + α2 µ − α2 ∆xai ∆sai + α3 ∆xci ∆sci √ 27 3 6 3 1 1 2 α γ 2 n 2 µg + √ α4 nµg . ≥ γ(1 − α)µg + γα µ + 4 64 2

832

M. Zhang and Y. Lv

Since ∆xai ∆sai ≤ txi si by the definition of t given in (14), it is sufficient to have α ≥ 0 for which (1 − α − α2 t)xi si + α2 µ + α3 ∆xci ∆sci √ 27 3 6 3 1 1 2 α γ 2 n 2 µg + √ α4 nµg . ≥ γ(1 − α)µg + γα µ + 4 64 2 Using the fact that xi si ≥ γµg and t ≤ (18) holds for α ≤

or

1 2γ 2 1 9n 2

1 4,

(18)

also with Lemma A.1, we can find

whenever

1 (1 − α − α2 t)γµg + α2 µ + α3 ∆xci ∆sci ≥ γ(1 − α)µg + γα2 µ + α2 γµg . 2   (19) (1−γ)µ− 12 γ t + 21 µg +α∆xci ∆sci ≥ 0.

3  5 c c µg , α ∆s ≤ 0 and µ = 1 − Finally, we also assume that the worst case ∆x a i i 4  1 1 then one has to have (1 − γ)µ − 2 γ t + 2 µg > 0. Therefore, when αa <  1 γ(t+ 21 ) 3 4 , this definitely holds. The corresponding inequalities in (15) 5 1 − 1−γ for i ∈ I− also holds for these values of αa , which completes the proof. To obtain an explicit strictly positive lower bound for the maximum step size in the corrector of (13), we let  step, instead  31  3  γ γ and µ = 1 − 54 αa µg = 1−γ µg . αa = 54 1 − 1−γ Using Lemma 2.3 and 2.4, we have the following corollaries which are useful in the next theorem. Corollary 2.5. Let µ =

γ 1−γ µg ,

then

3 3 ∆xa ∆s ≤ √ n 2 µg , 2 γ

3 3 ∆x∆sa  ≤ √ n 2 µg . 2 γ

Proof. By Lemma 2.2 and noting γ < 15 , we have 3 ∆x ∆s ≤ 2 a



1 2γ



µ µg

2

1 1 µ + + + 4γµg 32 16γ

21

3 3 3 n 2 µg ≤ √ n 2 µg . 2 γ

Similarly, we may prove the other inequality. This completes the proof. γ Corollary 2.6. Let µ = 1−γ µg , then ∆x∆s ≤ 8√21√γ n2 µg . Proof. The proof is analogous to the proof of Corollary 2.5. − Theorem 2.7. Assume that the current iterate (x, s) ∈ N∞ (γ), (∆xa , ∆sa ) is γ µg . Then the solution of (3) and (∆x, ∆s) is the solution of (4) with µ = 1−γ

α≥

2γ 2 3n2 .

Extension of a Polynomial Time Mehrotra-Type

833

Proof. Our goal is to find maximum step size α ∈ (0, 1] in the corrector step such that (15) holds. Following the similar analysis of the Theorem 2.4, it is sufficient to have α ≥ 0 for which   1 − t+ γµg + (1 − γ)µ + α∆xci ∆sci ≥ 0. 2 γ µg and t ≤ 41 , the previous inequality holds whenever Using the fact that µ = 1−γ 1 c c c and ∆sc , Corollary 2.5, 2.6 and 4 γµg + α∆xi ∆si ≥ 0. By the definition of ∆x  1 nµg , it is sufficient the relation given in Lemma A.3, ∆xa ∆sa  ≤ 18 xT s = 2√ 2 to have   1 1 1 3 1 √ nµg + √ n 2 µg + √ n2 µg ≥ 0. γµg − α 4 γ 2 2 8 2γ 2

2γ This inequality definitely holds for α = 3n 2 , which completes the proof. Now we can outline a variant of Mehrotra-type predictor-corrector safeguarded algorithm for monotone linear complementarity problems.

Algorithm 1 Input:   A proximity parameters γ ∈ 0, 51 ; an accuracy parameter ε > 0; − a starting point (x0 , s0 ) ∈ N∞ (γ). begin while xT s ≥ ε do begin Predictor Step Solve (3) and compute the maximum step size αa such that (x(αa ), s(αa )) ∈ Ω++ ; end begin Corrector Step  2 ga If αa ≥ 0.1,then solve (4) with µ = gga n and compute the maximum

− step size α such that (x(α), s(α)) ∈ N∞ (γ); 2γ 2 γ If α < 3n2 ,then solve (4) with µ = 1−γ µg and compute the maximum − (γ); step size α such that (x(α), s(α)) ∈ N∞ end else γ Solve (4) with µ = 1−γ µg and compute the maximum step size α such − that (x(α), s(α)) ∈ N∞ (γ); end Set (x, s) = (x(α), s(α)). end end

834

M. Zhang and Y. Lv

In the following theorem we get an upper bound for the total number of iterations.   0 T 0 Theorem 2.8. Algorithm 1 stops after at most O n2 log (x )ε s number of iterations with a solution for which xT s ≤ ε.

Proof. If αa ≥ 0.1 and α ≥

2γ 2 3n2 ,

then by (16), (17) and γ < 51 , we have

µg (α) = (1 − α)µg + α2 µ + α3 n−1 ((∆xa )T ∆s + ∆xT ∆sa ) + α4 n−1 ∆xT ∆s

 3   13γ 2 1 2 2 3 2 ≤ 1 − α + α 1 − αa + α γ µg ≤ 1 − µg . 4 2 25n2 If αa ≥ 0.1 and α < µg (α) ≤

2γ 2 3(4κ+3)(κ+1)2 n2 ,

then

    µ 2 − 4γ − γ 2 − γ 3 1 + α2 γ 2 µg ≤ 1 − 1 − α + α2 µg . µg 2 3(1 − γ)n2

Finally, if αa < 0.1, then again we have     2 − 4γ − γ 2 − γ 3 1 2 2 2 γ + α γ µg ≤ 1 − µg (α) ≤ 1 − α + α µg . 1−γ 2 3(1 − γ)n2 This completes the proof.

3

Conclusion

In this paper, we have extended the second order Mehrotra-type predictorcorrector safeguarded algorithm to monotone LCPs. Since monotone LCPs are the generalization of linear programming, we lose the orthogonality of vectors ∆x and ∆s. So the analysis isdifferent from the one in the linear programming 0 T 0 worst-case iterations complexity bound of our case [6]. An O n2 log (x )ε s new algorithm is established.

Acknowledgement This project is supported by Natural Science Foundation of Educational Commission of Hubei Province of China (NO. D200613009).

References 1. Bullups, S.C., Murty, K.G.: Complementarity Problems. Journal of Computional and Applied Mathematics 124, 303–318 (2000) 2. Czyayk, J., Mehrotra, S., Wagner, M., Wright, S.J.: PCX: An Interior-point Code for Linear Programming. Optimization Methods and Software 12, 397–430, 11–12 (1999)

Extension of a Polynomial Time Mehrotra-Type

835

3. Cottle, R.W., Pang, J.S., Stone, R.E.: The Linear Complementarity Problem. Academic Press Inc., San Diego (1992) 4. Kojima, M., Megiddo, N., Noma, T., Yoshise, A.: A Unified Approach to Interior Point Algorithms for Linear Complementarity Problems. In: Kojima, M., Noma, T., Megiddo, N., Yoshise, A. (eds.) A Unified Approach to Interior Point Algorithms for Linear Complementarity Problems. LNCS, vol. 538, Springer, Heidelberg (1991) 5. Mehrotra, S.: On the Implementation of a (primal-dual) Interior Point Method. SIAM Journal on Optimization 2, 575–601 (1992) 6. Salahi, M., Mahdavi-Amiri, N.: Polynomial Time Second Order Mehrotra-type Predictor-corrector Algorithms. Applied Mathematics and Computation 183, 646– 658 (2006) 7. Salahi, M., Peng, J., Terlaky, T.: On Mehrotra Typepredictor-corrector Algorithms. Technical Report, Advanced Optimization Lab, Department of Computing and Software, McMaster University (2005) 8. Wright, S.J.: Primal-dual Interior-point Methods. SIAM, Philadelphia (1997) 9. Zhang, Y.: Solving Large-scale Linear Programms by Interior Point Methods Under the Matlab Enviroment. Optimization Methods and Software 10, 1–31 (1999) 10. Zhu, X., Peng, J., Terlaky, T., Zhang, G.: On Implementing Self-regular Proximity Easible IPMs. Technical Report, Advanced Optimization Lab, Department of Computing and Software, McMaster University (2003)

Appendix In this section, we provide some technical results. We state the first lemma without proof since the proof is similar to Lemma A.1 in [6]. Lemma A.1. Let (∆xa , ∆sa ) be the solution of (3), then ∆xai ∆sai ≤

1 xi si , i ∈ I+ . 4

Lemma A.2. Let (∆xa , ∆sa ) be the solution of (3), then one has 

i∈I+

∆xai ∆sai ≤

xT s , 4



|∆xai ∆sai | ≤

i∈I−

1 T x s. 4

Lemma A.3. Let (∆xa , ∆sa ) be the solution of (3), then one has 1 ∆xa ∆sa  ≤ √ xT s. 2 2 Lemma A.4. Let (∆x, ∆s) be the solution of (4) with µ > 0, then 1 1 ∆x∆s ≤ √ (xs)− 2 (µe − ∆xa ∆sa )2 2 2

and

1

∆xT ∆s ≤ 14 (xs)− 2 (µe − ∆xa ∆sa )2 .

QoS Route Discovery of Ad Hoc Networks Based on Intelligence Computing Cong Jin1 and Shu-Wei Jin2 1

Department of Computer Science, Central China Normal University, Wuhan 430079, P.R. China 2 No.1 Middle School Attached to Central China Normal University, Wuhan 430223, P.R. China [email protected]

Abstract. Focusing on the disadvantageous factors of dynamic environment and node’s limited performances of Ad Hoc networks (AHN), a new quality of service (QoS) route discovery method, namely QGAANT, based on quantum genetic algorithm (QGA) and ant colony algorithm (ACA) was presented. Using the route switching policy based on probability, we can reduce network overhead caused by flooding. In this paper, QGA was used to adjust the search direction so as to avoid the problem of stagnancy routes. The simulations were carried out to compare the QGAANT, the algorithm based on ant colony only and the traditional on-demand routing algorithm each other. The simulations results show that QGAANT is more adaptable AHN. Keywords: QoS, route discovery, Ad Hoc network, QGA.

1 Introduction An AHN is usually a self-organizing and self-configuring “multi-hop” network which does not require any fixed infrastructure. In an AHN, all nodes are dynamic and arbitrarily located, and are required to relay packets for other nodes in order to deliver data across the network. For a link request, QoS route means that to discovery, establish and safeguard an available path with the enough resources guarantees and satisfy the certain request. Because AHN nodes are dynamic and arbitrarily located, the network performance will descend if adopting flood route discovery method. Under the dynamic network environment, the status information of the network is hardly collected. Therefore, the algorithm’s convergent speed is more important than accurate. ACA can adapt dynamic network environment, but it has the weakness of very slow local search ability, slow convergent speed, and algorithm stagnancy. For overcoming these weaknesses, we propose a new heuristic route discovery method named QGAANT. It integrates distribute route discovery mechanism based on ACA and local search optimization mechanism based on QGA.

2 QoS Routing Model of AHN In AHN, the nodes communicate via wireless links. All the links in the network are bi-directional. They are characterized with a bandwidth and a transmission delay. F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 836–844, 2008. © Springer-Verlag Berlin Heidelberg 2008

QoS Route Discovery of Ad Hoc Networks Based on Intelligence Computing

837

Although, wireless communication deals with the effects of radio communication, such as noise, fading, and interference, they will not be taken in consideration. We assume that the links are reliable. We can describe AHN as follows. Given a weighted undirected graph G = (V, E) representing an AHN, where V representing the finite set of network nodes and E the set of connections. For link e ∈ E , we have e = ( vi , v j ) , where vi ∈ V , v j ∈ V , i ≠ j , vi and v j are consecutive nodes. In an AHN, sets V and E are dynamic changes. If s ∈ V is a source node, then a ∈ {V − {s}} is corresponding destination node. For every link e ∈ E , the corresponding metric function is defined as follows. Bandwidth function B(e): E → R + ; Delay function D(e): E → R + ; Cost function C(e): E → R + ; Packet drop rate function PL (e): E → R + . Because dynamic characteristic of AHN, for ∀e ∈ E , the parameters B(e), D(e), C(e), and PL (e) all may be changeable continuously, so the relevant appoint need be give as follows. (1) A node can discovery its neighbor nodes by using some mechanism; (2) Node may real time record the state information of the consecutive link. So, for a unicast QoS path l = (vi , vi +1, ..., vk ), (vi , vi +1, ..., vk ) ∈ V , k > i , l’s QoS parameters D(l) and C(l) are additive, B(l) is a max-min parameter, and PL (l) is a multiplicative parameter. These parameters are all computable. Definition 1. In AHN, if s is a source node and a is a destination node, then the unicase QoS route discovery problem from s to a can be defined as searching a accessible route l from s to a, and it should to satisfy the follows condition: ( D(l ) ≤ d ) ∧ (C (l ) ≤ c ) ∧ ( PL (l ) ≤ p ) ∧ ( B (l ) ≥ b)

(1)

where constants d, c, p, b is QoS parameter of delay, cost, packet drop rate, and bandwidth respectively according to the demand of linking request.

3 Definition and Description of QoS Routing Based on ACA The ACA is an algorithm for finding optimal paths that is based on the behavior of ants searching for food. At first, the ants wander randomly. When an ant finds a source of food, it walks back to the colony leaving “markers” (pheromones) that show the path has food. When other ants come across the markers, they are likely to follow the path with a certain probability. If they do, they then populate the path with their own markers as they bring the food back. As more ants find the path, it gets stronger until there are a couple streams of ants traveling to various food sources near the colony. Because the ants drop pheromones every time they bring food, shorter paths are more likely to be stronger[1], hence optimizing the “solution”. In the meantime, some ants are still randomly scouting for closer food sources. Because the ant colony works on a very dynamic system, the ACA works very well in graphs with changing topologies.

838

C. Jin and S.-W. Jin

A. QoS Routing for AHN Based on ACA The ACA takes the ideas from the ant colony paradigm. When we utilize them to the field of route discovery, these artificial ants are some exploration packets. By statistics pheromones on the paths and heuristic factor[2], we may calculate ant’s transition probability on every route. After many times iterates, the route corresponding the biggest pheromone is an optimizing solution. Based on ACA principle[1] and route model of AHN using ACA[3], we can obtain the transition probability function of data packet from node vi to node vj as follows: ⎧ τ vp v (t )ηvq v (t ) i j i j ⎪ , µ ∈ Vallow ⎪ τ vpi µ (t )ηvqi µ (t ) pvki v j (t ) = ⎨ ⎪ µ∈Vallow ⎪⎩ 0, otherwise



(2)

where, pvki v j (t ) is transition probability of data packet k from node vi to node vj in time t, Vallow ⊂ V is a node set of ant can arriving at next step, τ vpi v j (t ) denotes pheromone on the path (vi ,vj) in time t, ηvqi v j (t ) is a heuristic factor, and parameters p and q show importance of pheromone and heuristic factor of selected route respectively. Definition 2. After proceeding n steps transition, ant k arrives destination node, and l is its path. Therefore, pheromone updating formulas on the path l are obtained as follows τ vi v j (t + n) = ρτ vi v j (t ) + Δτ vi v j (t , t + n) and Δτ vi v j (t , t + n) = Δτ vki v j (t , t + n) ,

∑ k ∈S

where, Δτ vi v j (t , t + n) denotes the increment of the pheromone on the path (vi , vj) in a loop, ρ is a control factor for limiting infinite increase of the pheromone, 1< ρ T , then ant k is killed. Step 2: When ant k ( k ≤ m ) arrives node v’, then If v′ ≠ destination ID If v′ ∈ k.ergodicNodes

After deleting first v’ from k.ergodicNodes, ant k is retransmitted according to transition probability pvk' µ (t ) in PRtablev’, else ant k is transmitted according to transition probability pvk' µ (t ) in PRtablev’; else go to Step 3. Step 3: If v′ = destination ID, inverse path lk′ is calculated by forward path lk of queue k.ergodicNodes recorded. Ant k is sent to source node vs. Step 4: When source node vs receives ant k, every node on the path lk of PRtable is updated according to methods of Section 3.1. Path lk is added into available path stack AVstack v s . Step 5: When source node receives all m ants, or this loop has exceeded time threshold, follow operations were executed. 1) If there are paths of satisfying QoS constraint condition in stack, let path with maximum pheromone value be avPath, then go to Step 6. Otherwise, call QGA (See Section 4). 2) If obtained avPath’ after running QGA satisfies QoS constraint condition, all nodes on the path avPath’ of PRtable are updated by calculating maximum pheromone average value, then go to Step 1. Otherwise, Let avPath = avPath’. Step 6: Output avPath.

4 Quantum Genetic Algorithm A. Preliminary Knowledge There are two significant differences between a classical computer and a quantum computer. The first is in storing information, classical bits versus quantum qubits. The second is the quantum mechanical feature known as entanglement, which allows a measurement on some qubits to effect the value of other qubits. A classical bit is in one of two states, 0 or 1. A quantum qubit can be in a superposition of the 0 and 1 states. This is often written as α 0 > + β 1 > , where α and β are the probability amplitudes associated with the 0 state and the 1 state. Therefore, the values α 2 and β 2 represent the probability of seeing a 0 (1) respectively. As such,

QoS Route Discovery of Ad Hoc Networks Based on Intelligence Computing

841

the equation α 2 + β 2 = 1 is a physical requirement. The interesting part is that until the qubit is measured it is effectively in both states. For example, any calculation using this qubit produces as an answer a superposition combining the results of the calculation having been applied to a 0 and to a 1. Thus, the calculation for both the 0 and the 1 is performed simultaneously. Unfortunately, when the result is examined only one value can be seen. The probability of measuring the answer corresponding to an original 0 bit is α 2 and the probability of measuring the answer corresponding to an original 1 bit is β 2 . B. Quantum Bit Encode In QGA, qubits are used for storing and expressing genes. The state of a gene may be 1, 0, or their any superposition state. In other words, the gene no longer expresses certain information, it may include any possible information, and arbitrary operation to the gene can produce the influence to all possible information simultaneously. Therefore, a multi-state gene can be encoded by multi-qubit as follows α nt ⎞⎟ β nt ⎟⎠

⎛α t α t α t l tj = ⎜ 1t 2t 3t ⎜β β β ⎝ 1 2 3

(4)

where l tj is j-th chromosome of t-th generation population, and n is qubit number of chromosome. C. Quantum Rotation Gate Quantum rotation gate carries out evolution operation, and it is updated by ⎛ α it ⎜ ⎜ βt ⎝ i

⎞ ⎛ cosθ i ⎟=⎜ ⎟ ⎜ sin θ i ⎠ ⎝

− sin θ i ⎞ ⎛ α i ⎞ ⎛α ⎞ ⎟ ⎜ ⎟ = U (t )⎜⎜ i ⎟⎟ cosθ i ⎟⎠ ⎜⎝ β i ⎟⎠ ⎝ βi ⎠

(5)

where, ( α i , βi ) is i-th qubit in the chromosome, θi is angle of rotation, and its size and direction are determined by the updating strategy[5] designed in Table 1. Table 1. Updating Strategy of Rotating Angle li

ηi

f (l ) ≥ f (η )

Δθi

0 0 0 0 1 1 1 1

0 0 1 1 0 0 1 1

False True False True False True False True

0 0 0 0.05 π 0.01 π

0.025 π 0.005 π 0.025 π

S (α i β i )

α i βi > 0

α i βi < 0

αi = 0

βi = 0

0 0 0 -1 -1 +1 +1 +1

0 0 0 +1 +1 -1 -1 -1

0 0 0

±1 ±1

0 0 0 0 0

0 0 0

±1 ±1 ±1

842

C. Jin and S.-W. Jin

Angle of rotation θi = S (α i βi )Δθi , S (α i β i ) and Δθi are size and direction of rotating, and their values are determined by the updating strategy in Table 1. The updating process is as follows. After measuring chromosome l tj , we calculate its fitness function value f(li), then we compared fitness value f( ηi ) of current object to f(li). According to comparison result, the l tj ’s corresponding qubit will be updated. If f (li ) < f (ηi ) , this shows that fitness value is not enough, and we should make the probability amplitudes pair ( α i , βi ) evolve toward a direction of benefiting ηi appear, otherwise, ( α i , βi ) evolves toward a direction of benefiting li appear. D. QGA Detail Description After above analytic, we can obtain QGA’s detail description as follows. Algorithm 2. QGA’s detail description Input: AVstack v s . Output: avPath’. Initialization. Let energy function f (⋅ ) be fitness function in QGA. m stack elements of AVstack v s compose initial population L(t0) = {l1t0 , l 2t0 , ..., lmt0 } . Step 1: For the chromosomes lit0 , we let ⎛ t0 t0 ⎜α α 2 lit0 = ⎜ 1t ⎜ β1 0 β 2t 0 ⎝

α nt0 ⎞⎟ , i = 1, 2, ..., m β nt0 ⎟⎟

(6)



2

2

where, both α tj0 and β tj0 are complex constants, and α tj0 + β tj0 = 1 . All ( α tj0 , β tj0 ) will be initialized into (1 2 ,1 2 ) . Meanwhile, α tj0 = (α tj01 , α tj02 , ... , α tju0 ) and β tj0 = ( β tj01 , β tj02 , ..., β tju0 ) j = 1, 2, ..., n . Where all α tj0 and β tj0 are binary string with length u. t

t

t

Step 2: By equation (6), we get R(t), where R(t ) = (α1 ,α 2 , ...,α n ) , or t

t

t

R(t ) = ( β1 , β 2 , ..., β n ) .

Step 3: To utilize fitness function f (⋅ ) evaluate every chromosome of population L(t ) = {l1t , l2t , ..., lmt } , the optimal chromosome is reserved. If acquiring satisfactory solu-

tion, then QGA stop, go to Step 6. Otherwise, go to Step 4. Step 4: To utilize quantum rotation gate U(t) update L(t). Step 5: Let t = t + 1 , go to Step 2. Step 6: Output avPath’.

5 Experiment Results NS2 is a widely used tool to simulate the behavior of wired and wireless networks discrete event simulator. In this paper, it is used to structure AHNs simulated environment. Experiment, environment parameters seen in Table 2.

QoS Route Discovery of Ad Hoc Networks Based on Intelligence Computing

843

Table 2. Environment Parameters Parameter Item Plane Size Node Number MAC Layer Transmit Layer Block Length Transferability Model

Parameter Collocation 500 × 500 100 IEEE 802.11, Equipotent Self-Organizing Scheme User Data Packet Protocols 512 Waypoint Model

To evaluate proposed QGAANT performance, we compare with QGAANT to local optimistic algorithm ANT and route protocols AODV, a routing protocol for ad hoc networks designed with mobile wireless devices in mind, based on flooding route discovery method respectively. Compare results seen in Fig. 1-3. By observing Fig.1, we discover that three algorithms have all obvious initial delay after route request and before route information not to be established. Afterwards, the delay of AODV is maximum and its delay descent speed is the slowest. Since QGA is introduced into route discovery method, it increases the convergence rate of route discovery algorithm so as to avoid the problem of stagnancy routes. The mobiles performances of three algorithms in AHN are evaluated in Fig.2. By contrast, we discover that the packet deliver success rates of three algorithms are all descent of the different degree with mobile speed increasing. Among them, the packet deliver success rates of QGAANT is the highest, and the speed of the performance descent is the slowest.

Fig. 1. Average Packet Delay Compare Results

Fig. 2. Packet Deliver Success Rate Compare Results

Fig. 3. Handling Capacity Compare Results

844

C. Jin and S.-W. Jin

In the different network load, the contrast of AHN handling capacity is given in Fig.3. From Fig.3 we can know that 1) When load smaller, the performances of three algorithms are similar; 2) When load bigger, the descent speed of QGAANT handling capacity is slower than other two algorithms. According to these experiments, QGAANT is more adaptable AHN. Bases these experiments, we compare tradition GA to QGA using in route discovery, the results in Table 3. For removing stochastic interference, all results are average value to calculate 7000 times. Table 3. QGA and Tradition GA Results Comparison Maximal iteration steps Tradition GA QGA

20

50

100

Success Rate

0.6782

0.9195

0.9762

The iteration steps of finding optimistic solution

6.7410

10.6465

11.2034

Success Rate

0.7853

0.9598

0.9945

The iteration steps of finding optimistic solution

4.8304

9.4721

9.7803

From experiment results, we find that QGA’s iteration steps are less than GA’s to find optimistic solution and QGA’s success rate is higher than GA’s. Therefore, QGA has better performance.

6 Conclusion The special environment of AHN need higher design demand to QoS the route algorithm. In this paper, for existing problems of route method using ACA, we propose a new heuristic route discover method based on QGA and ANT. The new method can overcome the weakness of ACA and increase the convergence rate of existing route discovery algorithm. The experiment results show that the performance of proposed method has been improved obviously than traditional method. This shows that QGAANT has the better adaptability to the AHN environment.

References 1. Dorigo, M., Gambardella, L.M.: Ant colonies for the traveling salesman problem. Biosystems 43, 73–81 (1997) 2. Dorigo, M., Di Caro, G., Gambardella, L.M.: Ant algorithms for discrete optimization. Artificial Life 5(2), 137–172 (1999) 3. Shen, C.C., Jaikaeo, C.: Ad Hoc multicast routing algorithm with swarm intelligence. ACM Mobile Networks and Applications Journal 10(1-2), 47–59 (2005) 4. Grover, L.: A fast quantum mechanical algorithm for database search. In: Proceedings of the 28th Annual ACM Symposium on the Theory of Computing, pp. 212–219 (1996) 5. Yang, J., Zhuang, Z.Q.: Research of quantum genetic algorithm and its application in blind source separation. Journal of Electronics 20(1), 62–68 (2003)

Memetic Algorithm-Based Image Watermarking Scheme Qingzhou Zhang, Ziqiang Wang, and Dexian Zhang School of Information Science and Engineering, Henan University of Technology, Zheng Zhou 450052, China [email protected]

Abstract. Watermarking technology is the most efficient way to protect the ownership of multimedia data. In this paper, a novel image watermarking scheme using the Discrete Wavelet Transform (DWT) and memetic algorithm (MA) is introduced. The watermark is embedded to subband coefficients of subimage which is extracted from the original image by using DWT, and watermark extraction is efficiently performed via memetic algorithm. Experimental results show that the proposed watermarking scheme makes an almost invisible difference between the watermarked image and the original image, and is robust to common image processing operations. Keywords: Memetic algorithm, Image watermarking, Discrete wavelet transform.

1 Introduction Due to the availability and popularity of the internet and the powerful computing capabilities of modem computers and recent programs, digital multimedia, including image, audio, and video, can be modified, copied, and redistributed easily. The success of the Internet allows for the prevalent distribution of multimedia data in an effortless manner. Hence, some copyright protection schemes need to be employed to conquer these problems. As a result, protection of such contents has recently become an important issue. One solution to these problems is digital watermarking, i.e., the insertion of information into the image data in such a way that the added information is not visible and yet resistant to image alterations. A variety of techniques has already been proposed; an overview of the subject can be found in [1]. A variety of image watermarking techniques has been developed in recent years. Such techniques can be broadly classified in two categories: spatial-domain and transform-domain based. Spatial-domain techniques hide the watermark information in an image by modifying the pixels of the image. It is easily to devise the embedding and extraction algorithms of the techniques. Therefore, the algorithms require less computational complexity. A main disadvantage of the techniques is fragile to resist imagemanipulation or malicious attacks. To overcome the disadvantage of the spatialdomain techniques, lots of image watermarking methods have been developed in frequency domains, such as Discrete Fourier Transform (DFT), Discrete Cosine Transform (DCT), and Discrete Wavelet Transform (DWT). Nowadays, the DWT has F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 845–853, 2008. © Springer-Verlag Berlin Heidelberg 2008

846

Q. Zhang, Z. Wang, and D. Zhang

been widely used to image processing applications, such as watermarking verification, the upcoming image compression standard JPEG2000, etc. Recently, a large amount of image watermarking methods developed in wavelet domain have gained prominent results[2].As a result, more research effort in the field of DWT-based watermarking has been done in the last few years. In this paper, we propose an image watermarking scheme based on memetic algorithm (MA) in the DWT domain. It is today widely accepted that robust image watermarking techniques should largely exploit the characteristics of the HVS [3], for more effectively hiding a robust watermark, A widely used technique exhibiting a strong similarity to the way the HVS processes images is the Discrete Wavelet Transform (DWT). As a matter of fact, the next-generation image coding standard JPEG2000 will strongly rely on DWT for obtaining good quality images at low coding rates. In addition, the memetic algorithm (MA) [4], which has emerged recently as a new meta-heuristic derived from nature, has attracted many researchers’ interests. The algorithm has been successfully applied to several complex optimization problems and neural network training. Nevertheless, the use of the memetic algorithm for image watermarking is still a research area where few people have tried to explore. In this paper, a novel watermark scheme using the Discrete Wavelet Transform (DWT) and memetic algorithm (MA) is introduced. The watermark is embedded to the discrete multiwavelet transform (DMT) coefficients larger than some threshold values, and watermark extraction is efficiently performed via particle swarm optimization algorithm. Experimental results show that the proposed watermarking scheme results in an almost invisible difference between the watermarked image and the original image, and is robust to common image processing operations and JPEG lossy compression. The remainder of this paper is organized as follows. In the next section, we briefly review of the memetic algorithm (MA). In section 3, the watermark embedding algorithm in the wavelet domain is described. In section 4, the MA-based watermarking extraction method is proposed. Experimental results are given in Section 5. Finally, the paper ends with conclusions and future research directions.

2 Brief Review of Memetic Algorithm (MA) Memetic algorithms(MAs) are inspired by Dawkins’ notion of a meme [4,5].Formally, a MA is defined as an evolutionary algorithm(EA) [5] that includes one or more local search phases within its evolutionary cycle. The choice of name is inspired by concept of a meme, which represents a unit of cultural evolution that can show local refinement. MAs are similar to GAs but the elements that form a chromosome are called memes, not genes. The unique aspect of the MAs algorithm is that all chromosomes and offsprings are allowed to gain some experience, through a local search, before being involved in the evolutionary process. MAs have been shown to be more efficient and effective than traditional EAs for solving complex optimization problems. The pseudocode for MA procedure [4,6] is described as follows:

Memetic Algorithm-Based Image Watermarking Scheme

847

Begin; Generate random population of P solutions (chromosomes); For each individual i ∈ P : calculate fitness (i); For each individual i ∈ P : do local-search (i); For i = 1 to number of generations; Randomly select an operation (crossover or mutation); If crossover; Select two parents at random ia and ib ; Generate on offspring

ic =crossover ( ia and ib );

ic =local-search ( ic ); Else If mutation; Select one chromosome i at random; Generate an offspring ic =mutation (i);

ic =local-search ( ic ); End if; Calculate the fitness of the offspring; If ic is better than the worst chromosome then replace the worst chromosome with

ic ;

Next i; Check if termination=true; End. From the above pseudocode of memetic algorithm (MA), we can observe that the parameters involved in GA are the same four parameters used in GA: population size, number of generations, crossover rate, and mutation rate in addition to a local-search mechanism.

3 The Watermark Embedding Algorithm The watermark insertion is performed in the discrete multiwavelet transform (DMT) domain by applying the three-levels multiwavelet decomposition based on the wellknown DGHM multiwavelet from [7] and optimal prefilters from [8].The reason is that DGHM multiwavelets simultaneously possess orthogonality, compact support, an approximation order of 2, and symmetry. Since the approximation subband contains the high-energy components of the image, we do not embed the watermark in this subband to avoid visible degradation of watermarked image. Furthermore, the watermark is not embedded in the subbands of the finest scale due to the low-energy components to increase the robustness of the watermark. In other subbands, we choose all DMT coefficients that are greater than the embedding threshold Te .These coefficients are named

Vi and applied to the following equation:

Vi w = Vi + β ( f1 , f 2 ) Vi wi

(1)

848

Q. Zhang, Z. Wang, and D. Zhang

where i runs over all DWT coefficients > Te , Vi denotes the corresponding DWT coefficients of the original image and

Vi w denotes the coefficients of the watermarked

image. The variable wi denotes the watermark signal which is generated from a Gaussian distribution with zero mean and unit variance, and

β ( f1 , f 2 ) is the embedding

strength used to control the watermark energy to be inserted. The watermarking algorithm is adaptive by making use of human visual system (HVS) characteristics, which increase robustness and invisibility at the same time. The HVS

β ( f1 , f 2 ) can

be

represented by [9,10]:

β ( f1 , f 2 ) = 5.05e −0.178( f + f 1

where

2

)

(e

0.1( f1 + f 2 )

)

−1

(2)

f1 and f 2 are the spatial frequencies (cycles/visual angle). However, the wa-

termark will be spatially localized at high-resolution levels of the host image. By this, the watermark will be more robust. At the end, the inverse DWT is applied to form the watermarked image.

4 The MA-Based Watermark Detection Algorithm The MA-based watermark detection algorithm consists of two steps. First, the suspected image is decomposed into three levels using the DMT. We choose all the DMT coefficients greater than the detection threshold Td from all subbands except the approximation subband and the subbands of the finest scale. The detection threshold Td must be strictly larger than embedding threshold Te for the robustness since some coefficients, which were originally below Te , may become greater than due to image manipulations. Then, the watermark is extracted in the above selected DMT coefficients. The MA realizes efficient watermark extraction by selecting superior memes and mating them to generate mutation, and allowing all chromosomes and offsprings to gain some experience through a local search. The proposed evolutionary algorithm adopts MA to extract the watermark out of the image deformed by geometry attacks such as rotation and translation. Initially, 50 memes (parents) are generated and applied to the attacked image to reverse attacks. Then, the embedded watermark is extracted and the fitness of the extracted watermark is measured. The memes with best fitness are selected and used for the next generation. We generate 50 memes by crossover and mutation with a predetermined rate. The process is repeated until the best memes is found. In the proposed algorithm the chromosome consists of a 16-bit string of 0 and 1. In the chromosome structure, two 4 bits are used for translations in both horizontal and vertical directions respectively and 8 bits are used for rotation attack. The performance of the watermarking methods under consideration is investigated by measuring their imperceptible and robust capabilities. For the imperceptible capability, a quantitative index, Peek Signal-to-Noise Ratio (PSNR) [11,12],is employed

Memetic Algorithm-Based Image Watermarking Scheme

to evaluate the difference between an original image

849

I ori and a watermarked image

′ . For the robust capability, the Mean Absolute Error (MAE) measures the differI ori ence between an original watermark w and the corresponding extracted watermark w′ . The PSNR and the MAE are, respectively, defined as follows:

PSNR = 10 log10

MSE = where X ij and

2552 ( db ) MSE

2 1 m −1 n −1 X ij − X ij′ ) ( ∑∑ m × n i =0 j =0

(3)

(4)

X ij′ represent the pixel value of original image and the attacked image

respectively. It should be noted that the larger the PSNR, the better the quality of the image.

∑ w − w′ w −1

MAE ( w, w′ ) =

i =0

(5)

w

A lower MAE reveals that the extracted watermark w′ resembles the original watermark w more closely. The robustness of a watermarking method is assessed by ′ which comparing w with w′ , where w′ is extracted from the watermarked image I ori is further degraded by attacks. If a method has a lower MAE ( w, w′ ) , it is more robust. After obtaining the PSNR in the watermarked image and the MAE values after attacking, we are ready to start the MA training process. According to the definition of MA, we need to assign the fitness function in the ith iteration with the following equation:

fitness ( i ) = PSNRi + (1 − MAEi )

(6)

The steps of the MA-based watermark detection algorithm are described as follows. Begin; Generate random population of P solutions (chromosomes); For each individual i ∈ P : calculate fitness (i); For each individual i ∈ P : do local-search (i); For i = 1 to number of generations; Select two parents at random ia and ib ; Generate on offspring

ic =crossover ( ia and ib );

850

Q. Zhang, Z. Wang, and D. Zhang

ic =local-search ( ic ); Select one chromosome i at random; Generate an offspring ic =mutation (i);

ic =local-search ( ic ); End if; Calculate the fitness of the offspring in terms of Eq.(6); If ic is better than the worst chromosome then replace the worst chromosome with

ic ; Next i; Check if the iteration number approaches to the predefined maximum iteration; End. The pseudocode for the above local-search procedure is as follows: Begin; Select an incremental value d = a ∗ rand () , where a is a constant that suits the variable values; For a given chromosome i ∈ P : calculate fitness (i); For j=1 to number of variables in chromosome i; value (j) =value (j)+d; If chromosome fitness not improved then value (j) =value (j)-d; Next j; End.

5 Experimental Results The algorithm has been extensively tested on standard image ‘Lena’ and attempting different kinds of attacks. In this section some of the most significant results will be shown. For the experiments presented in the following, the “db4” wavelet has been used for computing the DWT. To estimate the quality of our method, we used the peak signal to noise ratio (PSNR) to evaluate the distortion of the watermarked image. The Mean Absolute Error (MAE) is also used to evaluate the similarity between the original watermark and the extracted one. The value of parameters of memetic algorithm (MA) is population size=50, maximum number of generation=100, selection rate=0.85, crossover rate=0.8, and mutation rate=0.05. The watermarked images with our proposed algorithm are depicted Fig.1 and Fig.2. Which represent the watermarked image at the 0th and 100th iteration in MA, with the PSNR of 30.21 and 34.92 dB, respectively. We can observe the improvements in the watermarked image quality after certain attacks with the aid of MA. In addition, they are also tabulated in Tables 1 by comparing the PSNR and MAE values with the increase of iteration numbers. We can find that the PSNR values increase and MAE values decrease with the increase of iteration numbers. The watermarked copy is shown: the images are evidently undistinguishable, thus proving the effectiveness of MA-based DWT watermarking scheme.

Memetic Algorithm-Based Image Watermarking Scheme

851

Fig. 1. The watermarked image at the 0th iteration in MA

Fig. 2. The watermarked image at the 100th iteration in MA Table 1. Results of PSNR and MAE for Lena under different MA iterations Iteration 0 25 50 100

PSNR 30.21 34.68 34.84 34.92

MAE 0.09175 0.02591 0.01423 0.00652

In addition, a good watermark technique should be robust to withstand different kind of attacks. In the following experiments, several common attacks are used to measure the robustness of the proposed scheme, such as JPEG compression, sharpening, blurring

852

Q. Zhang, Z. Wang, and D. Zhang Table 2. The experimental results under different attacks Attack–Scheme JPEG attack Scaling attack Noise adding attack Blurring attack Sharpening attack

PSNR 32.6 32.12 30.56 30.72 30.23

MAE 0.0007 0.0006 0.0009 0.0010 0.0008

and cropping, etc. The detailed experimental results are shown in Table 2.we can see that the experimental results are acceptable.

6 Conclusions With the rapid development of computer and communication networks, the digital media, including images, audios, and videos, are easily acquired in our daily life. Watermarking technology is the most efficient way to protect the ownership of multimedia data. In this paper, a novel watermark scheme using the Discrete Wavelet Transform (DWT) and memetic algorithm (MA) is introduced. The experimental results show that the proposed algorithm yields a watermark that is invisible to human eyes and robust to various image manipulations.

References 1. Hartung, F., Kutter, M.: Multimedia Watermarking Techniques. Proceedings of IEEE 87, 1079–1107 (1999) 2. Paqueta, A.H., Wardb, R.K., Pitas, I.: Wavelet Packets-Based Digital Watermarking for Image Verification and Authentication. Signal Processing 83, 2117–2132 (2003) 3. Jayant, N.J., Johnston, J., Safranek, R.: Signal Compression Based on Models of the Human Perception. Proceedings of IEEE 81, 1385–1422 (1993) 4. Moscato, P.: On Evolution, Search, Optimization, Genetic Algorithms and Martial Arts: Towards Memetic Algorithms. Technical Report, Caltech Concurrent Computation Program (1989) 5. Holland, J.: Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor (1975) 6. Elbeltagi, E., Hegazy, T., Grierson, D.: Comparison among Five Evolutionary-Based Optimization Algorithms. Advanced Engineering Informatics 19, 43–53 (2005) 7. Geronimo, J.S., Hardin, D.P., Massopust, P.R.: Fractal Functions and Wavelet Expansions Based on Several Scaling Functions. Journal of Approximation Theory 78, 373–401 (1994) 8. Attakitmongcol, K., Hardin, D.P., Wilkes, D.M.: Multiwavelet Prefilters II: Optimal Orthogonal Prefilters. IEEE Transactions on Image Processing 10, 1476–1487 (2001) 9. Barni, M., Bartolini, F.: Improved Wavelet-Based Watermarking through Pixel-Wise Masking. IEEE Transactions on Image Processing 10, 783–791 (2001) 10. Clark, R.: An Introduction to JPEG 2000 and Watermarking. In: 2000 IEE Seminar on Secure Images and Image Authentication, pp. 3–6. IEEE Press, New York (2000)

Memetic Algorithm-Based Image Watermarking Scheme

853

11. Hsieh, S.-L., Huang, B.-Y.: A Copyright Protection Scheme for Gray-Level Images Based on Image Secret Sharing and Wavelet Transformation. In: 2004 International Symposium on Computer, pp. 661–666. IEEE Press, New York (2004) 12. Chang, C.-C., Chung, J.-C.: An Image Intellectual Property Protection Scheme for GrayLevel Images Using Visual Secret Sharing Strategy. Pattern Recognition Letters 23, 931– 941 (2002)

A Genetic Algorithm Using a Mixed Crossover Strategy* Li-yan Zhuang1, Hong-bin Dong2, Jing-qing Jiang1,∗∗, and Chu-yi Song1 1

College of Mathematics and Computer Science, Inner Mongolia University for Nationalities, Tongliao 028043, P.R. China [email protected] 2 Department of Computer Science, Harbin Normal University, Harbin 150080, P.R. China [email protected]

Abstract. Function Optimization is a typical problem. A mixed crossover strategy genetic algorithm for function optimization is proposed in this paper. Four crossover strategies are mixed in this algorithm and the performance is improved compared with traditional genetic algorithm using single crossover strategy. The numerical experiment is carried out on nine traditional functions and the results show that the proposed algorithm is superior to four single pure crossover strategy genetic algorithms in the convergence rate for function optimization problems Keywords: Genetic Algorithm; Crossover Strategy; Function Optimization.

1 Introduction Function optimization is a typical application area of genetic algorithms, and it is a common numerical example for evaluating performance of genetic algorithms. The complexity of function optimization problems depends on the number, distribution, function values distribution, and the attraction region of local minima point [1]. A global optimization problem can be described as follow: Minimize subject to

f (x) = f ( x1,…, xn ),

li ≤ xi ≤ ui

, i = 1,2, , n.

(1)

where x is a solution vector, x = (x1, x2,…, xn),and n is the dimension of the vector x, n



and f is the optimized function. S ⊆ R . This problem is to search a point xmin S so that f (xmin) arrives at the global minima in S. If F denotes feasible region and S denotes the whole search space, then F ⊆ S . Genetic Algorithms as a stochastic searching and optimization technology has been studied deeply by internal and oversea researchers. Genetic Algorithms provides a general framework to solve optimization problems in complicated system, and has been applied widely in several fields and becomes one of the most powerful tools to *

This work is supported by the Nature Science Foundation of Inner Mongolian in P.R.China (200711020807), by the Scientific Research Project of Inner Mongolia University for Nationalities (MDB2007132, YB0706, MDK2007032). ∗∗ Corresponding author. F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 854–863, 2008. © Springer-Verlag Berlin Heidelberg 2008

A Genetic Algorithm Using a Mixed Crossover Strategy

855

solve global optimization problems. The remarkable characteristic of genetic algorithms is using gene string encoding mode and introducing crossover operators as the main searching operators [2, 3], and the design and the selection of crossover operators is one of the main way to improve genetic algorithm [4, 5]. Recently, lots of papers have researched on the method for elevating the search capability and convergence of genetic algorithm by improving single crossover operator. But there are some improvements could be done. In this paper, a new genetic operator--mixed crossover strategy operator is proposed. The experiments that carried out on nine traditional functions show this algorithm is superior to some single crossover strategies of genetic algorithm in the convergence rate for function optimization problems.

2 Genetic Algorithm Based on Mixed Crossover Strategy 2.1 Game Theory Game Theory (GT) is named strategy theory, is a science that studies on the optimization solutions problems in against conflict based on mathematics [6]. GT has become an important method for society research, and has been applied widely in several fields. Mixed strategies in GT are defined as follows [6]:

, , ; , , , ,

In a normal game G = {S1 … Sn u1 … un} with n players, set the strategy space Si for player i is Si = {si1 … sik}. The player i selects a strategy randomly in the available k strategies based on a probability distribution pi= (pi1 … pik). The strategies for every player obtained by this way are called mixed strategies, where j = 1,…, k, 0≤ pij ≤1 and pi1 + … + pik = 1. Meanwhile, the original strategy in GT is called pure strategy. But each pure strategy mustn’t be included in mixed strategies. A mixed strategy can be superior to one pure strategy even if this pure strategy is better than others.

, ,

2.2 Mixed Crossover Strategy Crossover operator is the most important genetic operator in GA, so the research on crossover operators reflects the research progress on genetic algorithm. At present, the main traditional crossover operators are one-point, two-point, uniform and uniform two-point crossover et, al. And the performance of these crossover operators have been analyzed and compared by researchers [2, 7-10]. In term of binary encoding method, no crossover is absolutely efficient. The number of crossover point should depend on the number of strategy variables in function optimization problems. Generally, the number of crossover point is in similar direct proportion to the number of strategy variables. Of course, we should consider implement method except for the number of crossover point. A new crossover operator, mixed crossover strategy, is proposed by mixing the four different crossover operators. Every crossover operator is called pure crossover strategy in mixed crossover strategy, and has its own probability distribution for crossover. The probability distribution of every crossover strategy is adjusted by strengthen or weaken in current population so that the mixed strategy is adjusted. This

856

L.-y. Zhuang et al.

algorithm implements the choice of crossover strategy automatically. The performance of the algorithm becomes more stable and effective. The detail is described in the third section.

3 Function Optimization Algorithms Based on the idea of mixed strategy, a mixed crossover strategy genetic algorithm (MCSGA) for function optimization problems is designed in this section. This MCSGA mixes four different crossover strategies: one-point, two- point, uniform, uniform two point crossover strategy. The algorithm is described as follows: Algorithm1: Genetic Algorithm based on Mixed Crossover Strategy Solving Function Optimization. (1) Initialization: An initial population is generated randomly with represented by a real vector xi , i = 1, 2 ,

µ

individuals. The individual is

, µ . Vector xi has m components:

, , ,x (m)},i = 1, ,µ . And every component of vector x is ex-

xi = {xi (1) xi (2)

, , ,s (w)}, j = 1,…, m, where w is i

i

{

pressed by binary string: xi ( j) = sij (1) sij (2)

ij

the number of bits in binary string. For each generation population g, its mixed strategy vector is initialized as

, h ∈ {1,2,3,4 }, g = 1, ,max

ρ g = ρ (h)

gen .

(2)

where “ max gen ” denotes the largest generation, and ρ(1), ρ(2), ρ(3) and ρ(4) represent the probabilities of choosing One-point, Two-point, Uniform, and Uniform two point crossover strategy, respectively. In the experiment, they are set to the same value initially, i.e., 0.25. (2) Fitness calculation and evaluation Calculate the fitness of individuals in population, and give every individual score in current population. Convert the raw fitness scores that are returned by the fitness function to the value in a range that are suitable for the selection function. Calculate every individual expectation value which is the probability selected as parent individual. (3) q-Tournament selection. (4) Replication operation: The individuals with the best fitness inherit directly to next generation. (5) Crossover operation: In every generation g, selects a crossover strategy h according to mixed strategies vector ρ, and uses crossover strategy h to cross to individuals, then generates offspring. (6) Calculate offspring fitness value, and rank according to fitness. (7) The method of adjustment of mixed strategy in offspring generation is as follows [11, 12]: Using crossover strategy h, h {1 2 3 4},

∈ ,,,

A Genetic Algorithm Using a Mixed Crossover Strategy

857

If the number of offspring individuals, which fitness value greater than the parents fitness, is more than half of the number of parent after crossover then strengthens the crossover strategy, that is ∀l ≠ h

if else

then

ρ l( k + 1 ) = ρ l( k ) − ρ l( k ) × γ

ρ h ( k + 1 ) = ρ h ( k ) + (1 − ρ h ( k ) ) × γ

Else weakens the crossover strategy, that is ∀l ≠ h

if else

then

ρ l( k + 1) = ρ l( k ) +

1 × ρ l( k ) × γ ( no _ strategy − 1)

ρ h ( k +1) = ρ h ( k ) − ρ h ( k ) × γ



where γ (0, 1) uses to adjust the probability distribution of mixed strategies, “no_strategy” is the number of mixed pure crossover strategy, where γ = 1/2. (8) Mutation: Select Gaussian mutation function to mutate parent population. (9) Migration: The best individual of one subpopulation is used to instead the worst individual of other subpopulation following assignment direction after a certain generation. (10) Offspring replace current parent, and form next generation population. (11) Steps 2-10 are repeated until the stopping criterion is satisfied.

4 Experiments Results Analysis 4.1 Experiments Condition and Results In order to verify the performance of the algorithm, 9 traditional functions are selected. These functions include low dimensional multimodal functions and multidimensional unimodal functions. The descriptions of functions in detail see the appendix. The main purpose of experiments is to compare the convergence rate of MCSGA with that of pure crossover strategy GAs (that is OCGA, TCGA, UCGA, UTCGA). Population sizes of various algorithms set 100, iterative generation is 1500, and crossover probability and elite count are 0.95 and 1, respectively. Other parameters of the algorithm are showed in table1. These algorithms run at the computer with WINDOWS XP2, 512M memory, 80G hard disk and AMD2800+. At the same condition, every algorithm runs fifty times by itself, records average results and compares the results of MCSGA with those of four different pure crossover strategies GAs. The average mean and standard deviation obtained by running five different algorithms on nine traditional functions fifty times are listed in table2. Table 1. Experiment Parameters of MCSGA Para

Pm Scale Shrink

Val 0.05 0.5

0.7

Migration Migration Migration StallGen StallTime Fitness Direction Interval Fraction Limit Limit limit ‘forward’ 20 0.2 100 50 -Inf

858

L.-y. Zhuang et al.

Table 2. Comparison MCSGA with the OCGA, TCGA, UCGA and UTCGA on f 1- f 9, where “Mean best ” indicates the mean best function value found in the last generation, and “Std Dev” stands for standard deviation.

Algorithm

OCGA

Value Mean Std best Dev Function

TCGA

UCGA

UTCGA

MCSGA

Mean Std best Dev

Mean Std best Dev

Mean Std best Dev

Mean Std best Dev

0.0164 0.1256 0.0144 0.1139 0.0154 0.1211 0.0138 0.1079 0.0124 0.1028 f1 0.0734 0.1062 0.0148 0.1100 0.0168 0.1307 0.0174 0.1307 0.0144 0.1074 f2 0.3356 0.6502 0.3264 2.5541 0.2452 2.0689 0.2640 2.1739 0.2308 1.9393 f3 0.0138 0.1043 0.0142 0.1152 0.0168 0.1225 0.0152 0.1214 0.0138 0.1075 f4 0.0554 0.1223 0.0152 0.1190 0.0144 0.1128 0.0192 0.1441 0.0142 0.1132 f5 0.0103 0.0764 0.0093 0.0725 0.0099 0.0731 0.0092 0.0729 0.0092 0.0709 f6 -1.9520 0.0077 -1.9517 0.0115 -1.9520 0.0077 -1.9513 0.0154 -1.9524 0.0039 f7 -5.0015 0.1076 -5.0015 0.1134 -5.0005 0.1058 -5.0035 0.0935 -5.0095 0.0398 f8 -5.0076 0.0797 -5.0066 0.0779 -5.0036 0.1136 -5.0046 0.1037 -5.0076 0.0679 f9 Note: A result in boldface stands for a better result or that the global optimum (or best known solution) has been reached.

We observe the following from the results given in Table 2: for 9 functions, indicating that for these benchmarks the performance of MCSGA is better than that of OCGA, TCGA, UCGA, and UTCGA. This result confirms our conjecture that the mixed strategy performs better than the best of the four pure strategies considered here. Unlike a pure strategy, performs well on some functions but very poorly on others, the mixed strategy works very well on all the test functions. The results of Table 2 also show that standard deviation of the MCSGA solution is almost always smaller than that of the other solutions. In three of the functions (f2, f4, and f5) the smallest standard deviation is given by the OCGA and UCGA solution instead. For these functions, however, the standard deviation of the MCSGA solution is still smaller than that of the other four algorithms. These facts indicate that the MCSGA has a stable performance for all the benchmark functions over 50 independent runs. 4.2 Results Analysis Nine traditional test functions are classified into two categories: One is high-dimension function (as f1- f6), where f 1- f 3 are unimodal function and f4 is step function that has a minimum point and discontinuity, and f5, f6 are multimodal function that the number of local minimum shows exponential increase with dimension augmentation. For unimodal function, the convergence rate of five different algorithms is more meaningful than the final optimization results. The final results of multimodal function reflect the capability that an algorithm escapes the local optimal solutions and finds the high quasi-global optimal solutions. Thus, the convergence rate is very important for multimodal function. The others are low-dimension functions (as f7 - f9) that only have some local minimums.

A Genetic Algorithm Using a Mixed Crossover Strategy

859

This paper introduces the main experiment results of high-dimension in detail. Due to the limit of the paper length, the comparative analysis on three functions f1, f3 and f6 is given. 1. Function f1 and f3 Fig 1 shows the evolution process of best individual fitness value on OCGA, TCGA, UCGA, UTCGA and MCSGA for f1and f3, respectively. The results are the means of running 50 times. To f1 (figure1 (a)), the convergence rate of MCSGA is obviously greater than that of pure crossover strategy genetic algorithms. MCSGA converges rapidly to the optimal solution. Due to the better global convergence capability of MCSGA, MCSGA shows its rapid convergence rate from the start. In the fourth generation, OCGA and MCSGA get the same convergence rate, and get the optimal solution 0.01. In this case, UCGA, UTCGA and TCGA reach only 1,1and 4, respectively. The main reason that MCSGA converges rapidly to the optimal solution is that MCSGA selects the best pure crossover strategy in every generation and the probability that the pure crossover strategy is selected as crossover strategy of next generation is enhanced.

(a)

(b)

Fig. 1. (a) and (b) show evolution process of OCGA, TCGA, UCGA, UTCGA and MCSGA for f1and f3. The vertical axis is the mean fitness value of the best individual, and the horizontal axis is the number of generations. The results are the mean over 50 independent running. “yellow-triangle-line”, “purple-circle-line”, “green-square-line”, “blue-dot-line” and “red-star-line” indicate the results of OCGA, TCGA, UCGA, UTCGA and MCSGA, respectively.

To f3 (figure1 (b)), the convergence rate of MCSGA is rapidly in initial two generations, but after the second generation it becomes slowly. Until after the fourth generation, the convergence rate of MCSGA increases gradually. Finally, MCSGA converges to the optimal solution 0.00599 in the seventh generation. The reason of the results may be that the number of mutation decreases generation by generation in MCSGA, and the variety in offspring becomes smaller. So the convergence rate of MCSGA becomes slower. 2. Function f6 with multiple local minimum Fig 2 shows evolution process of the best individual fitness value on OCGA, TCGA, UCGA, UTCGA and MCSGA for function f6, respectively. The results are

860

L.-y. Zhuang et al.

the means of running 50 times. It can be seen clearly from Fig 2 that the performance of MCSGA is superior to that of four different pure crossover strategy genetic algorithms in the convergence rate, while four different pure crossover strategy genetic algorithms are better than MCSGA in maintaining population variety. MCSGA keeps the faster convergence rate for f6. UTCGA and MCSGA get the same convergence rate in the second generation, and get the optimal solution 0.003293 in about the third generation. At that time, OCGA, TCGA and UCGA get to 1, 1.2 and 1.4, respectively. The main reason that MCSGA converges rapidly in initial stage is that the crossover strategy, which generates the best individual, is strengthened by mixed strategy in MCSGA. So the probability to be selected as crossover strategy for next generation is increased and the convergence rate speeds up.

Fig. 2. Evolution processes of OCGA, TCGA, UCGA, UTCGA and MCSGA for f6. The vertical axis is the mean fitness value for the best individual, and the horizontal axis is the number of generations. The result is the mean over 50 independent running. “yellow-triangle-line”, “purple-circle-line”, “green-square-line”, “blue-dot-line” and “red-star-line” indicate results of OCGA, TCGA, UCGA, UTCGA and MCSGA, respectively.

4.3 Results of Comparison MCSGA with MPOCGA Further more, MCSGA is compared with MPOCGA [13] (Genetic Algorithm with Multi-point Orthogonal Crossover Operation, called MPOCGA) to verify the convergence. Two algorithms run independently 50 times for function f1-f6 at the same condition (the experiment environment is as the same as in 4.1). The best value, the average best value, the number of calculating average function value, the average mean, the average standard deviance and the average evolutionary generation are recorded. The comparison results are showed in table3. The average of evolutionary generation when the algorithm converges is a scale to measure the algorithm convergence rate [14]. This reflects the convergence rate. From table3, there is a significant difference on average results between two different algorithms at the same condition. MCSGA can get the best value in the solution space within a few evolutionary generations, and can obtain the smaller mean, standard deviance and calculation times of function. According to the scale of convergence, at the

A Genetic Algorithm Using a Mixed Crossover Strategy

861

same condition MCSGA converges to the near region of solution rapidly, and then converges to the optimal solution within a few time (that is, it decreases the time complexity of algorithm). The reason is that the algorithm adjusts the choice of crossover strategy automatically by strengthen and weaken probability distribution. The probability of the crossover strategy that generates optimal offspring increases. And then the probability that the crossover strategy is selected at next generation increases. Finally the proposed algorithm converges to the optimal solution rapidly. Table 3. The average results for comparing MCSGA with MPOCGA on function f1-f6 Test Function f1 f2 f3 f4 f5 f6

Algorithm MPOCGA MCSGA MPOCGA MCSGA MPOCGA MCSGA MPOCGA MCSGA MPOCGA MCSGA MPOCGA MCSGA

Best AveBest Value Value 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

NumofCal AveMean AveStaDev AverEvoGen AveFunValue 5.9902e+003 5.8730e+003 6.0369e+003 5.7789e+003 8.2644e+003 7.5281e+003 5.9987e+003 5.8837e+003 5.8074e+003 5.9163e+003 6.0121e+003 5.7611e+003

0.0156 0.0154 0.0148 0.0116 0.2834 0.2748 0.0158 0.0156 0.0174 0.0154 0.0246 0.0097

0.1164 0.1163 0.1140 0.0935 2.2746 1.9757 0.1198 0.1123 0.1271 0.1130 0.0747 0.0744

104.5000 103.6000 103.5200 102.8600 120.3200 116.3800 104.5800 103.7400 103.2000 104.0400 104.7200 102.7200

Note: the blackbody number represents the best value.

5 Conclusion Some papers research the way to enhance the capability of searching and convergence on GA by improving single crossover operator. But there are some shortcomings. A mixed crossover strategy genetic algorithm (MCSGA) is proposed in this paper. This algorithm is verified on nine typical functions. The results of MCSGA are compared with those of traditional pure crossover strategy genetic algorithms and MPOCGA. Results show the proposed algorithm improves the convergence rate of GA and improves the performance of GA in solving function optimization problems. Meanwhile it improves the work of Dong et al in [11].

References 1. Wang, X.P., Cao, L.L.: Genetic Algorithms-Theory, Application and Software Implementation. Ci’an Communication University Press, Ci’an China (2002) 2. Holland, J.H.: Adaptation in Natural and Artificial System. Ann Arbor., 211 (1975) 3. Goldberg, D.E.: Genetic Algorithms in Search Optimization and Machine Learning, p. 432. Addison-Wesley, New York (1989) 4. Mernik, M., Crepinsek, M., Zumer, V.: A Meta-Evolutionary Approach in Searching of the Best Combination of Crossover Operators for the TSP. In: Proceedings of the IASTED ICNN, Pittsburgh, Pennsylvania, pp. 32–36. IASTED/ACTA Press (2000) 5. Chen, H.F., Ji, S.M., Ye, H., et al.: Image Token Correspondence Based on Genetic Algorithm. Journal of Nanjing University (Natural Sciences) 36(2), 171–176 (2000) 6. Fan, R.G., Han, M.C.: Game Theory. Wuhan University Press, Wuhan (2006)

862

L.-y. Zhuang et al.

7. Syswerda, G.: Uniform Crossover in Genetic Algorithms. In: Proceeding of the Third ICGA, pp. 2–9. Morgan Kaufman, San Mateo (1989) 8. Xu, H.Z., Chen, G.L., Zhang, F.E.: Comparison of One-point-crossover with Two-pointcrossover in Genetic Algorithms. Journal of Harbin Institute of Technology 30(2), 64–67 (1998) 9. Zhang, J.Y., Xu, J., Bao, Z.: Attainability of Genetic Crossover Operator. Actual Automatica Sinica 28(1), 120–125 (2002) 10. Yang, D.D., Zhang, C.T.: Genetic Algorithm of Uniform Two-point Crossover. Journal of Chongqing Normal University (Natural Science Edition) 21(1), 26–29 (2004) 11. Dong, H.B., He, J., Huang, H.K., Hou, W.: Evolutionary Programming Using a Mixed Mutation Strategy. Information Sciences 177(1), 312–327 (2007) 12. Dong, H.B., He, J., Huang, H.K., Hou, W.: An Evolutionary Programming to Solve Constrained Optimization Problems. Journal of Computer Research and Development 43(5), 841–850 (2006) 13. Liu, Q., Liao, Z., Sheng, H.Y., et al.: Genetic Algorithm with Multi-point Orthogonal Crossover Operation. Journal of Nanjing Normal University(Engineering and Technology) 31(24), 151, 158 (2005) 14. Zhang, L.F., Li, M., Zhou, L.X.: Genetic Algorithms with Hybrid Float-code& Gray-code. Journal of Nanchang Institute of Aeronautical Technology 15(2), 27–30 (2001)

Appendix: Nine Typical Functions



(1) Sphere Model

n

f1 ( x ) =

2 x i , x i ∈ [− 100 ,100 ] , n=30, min

( f1 ) =



,,

f 1 (0 … 0 ) = 0 .

i =1

(2) Schwefel’s Problem 2.22 30

30

f2 ( x) =

xi +

i =1



,,

x i , x i ∈ [− 10 ,10 ] , n=30, min ( f 2 ) = f 2 (0 … 0 ) = 0 .

i =1

(3) Schwefel’s Problem 1.22

⎞ ⎛ i f 3 ( x ) = ∑ ⎜⎜ ∑ x j ⎟⎟ , x i ∈ [− 100 ,100 ] , n=30, min ( f 3 ) = f 3 (0 … 0 ) = 0 . i =1 ⎝ j =1 ⎠

(4) Step Function

∑ (⎣x n

f4(x) =

,,

2

n

+ 0 . 5 ⎦)

2

i

i =1

,,

, x i ∈ [− 100 ,100 ] ,n=30, min ( f 4 ) = f 4 (0 … 0 ) = 0.

(5) Generalized Rastrigin’s Function

∑ [x n

f5 ( x) =

2 i



,,

− 10 cos (2 π x i ) + 10 ] , x i ∈ [− 5 .12 5 .12 ] ,n=30, min ( f 5 ) = f 5 (0 … 0 ) = 0.

i =1

⎛ f 6 ( x ) = − 20 exp ⎜ − 0 . 2 ⎜ ⎝

(6) Ackley’s Function



1 30

∑ n

i =1

⎞ ⎛ 1 2 x i ⎟ − exp ⎜ ⎟ ⎝ 30 ⎠

x i ∈ [− 32 32 ], n = 30 , min( f 6 ) = f 6 ( 0 ,

⎞ ∑ cos (2 π x )⎟⎠ + 20 + e , n

i

i =1

,0 ) = 0 .

A Genetic Algorithm Using a Mixed Crossover Strategy (7) Hartman’s Family

4 ⎡ f 7 ( x ) = − ∑ c i exp ⎢ − i =1 ⎣

min

( f7 ) =

⎡3 ⎢0.1 aij = ⎢ ⎢3 ⎢⎣0.1

∑ a (x n

ij

j =1



j



2⎤ − p ij ) ⎥ ⎦

,x

f 7 (0 . 114 0 . 556 0 . 852

10 30⎤

⎡ 0.3689 ⎥ ⎢ 0.4699 10 35 , p ⎥ pij = ⎢ 10 30 ⎥ ⎢ 0.1091 10 35⎥⎦ ⎢⎣0.038150

)=

j

,,

∈ [0 1] n = 3 ,

− 3 . 86 , ci = [1 1.2 3 3.2] , '

0.1170 0.2673⎤

0.4387 0.7470⎥ . ⎥ 0.8732 0.5547



0.5743 0.8828⎥⎦

(8) Shekel’s Family

f8 (x) = −∑ 7

⎡4 ⎢1 ⎢ ⎢8 aij = ⎢6 ⎢3 ⎢2 ⎢ ⎣⎢5

[(x

− a i )( x − a i ) + c i T

]

−1



, x i ∈ [0 10 ], n = 4 , x local

⎡ 0.1⎤ ⎢ 0.2 ⎥ 1⎥ ⎢ ⎥ ⎥ 8 ⎢ 0.2 ⎥ ⎥ c = ⎢ 0.4 ⎥ , min ( f 8 ) = f (x local 6⎥, i ⎢ 0.4 ⎥ 7⎥ ⎢ 0.6 ⎥ ⎥ 9 ⎢ ⎥ ⎥ 3 ⎦⎥ ⎣⎢ 0 .3 ⎦⎥

i =1

≈ ai ,

_ opt

≈ ai ,

4 4 4⎤

1 8 6 7

1 8 6 3

9 2 5 3

(9) Shekel’s Family f9 (x) = −∑ 10

⎡4 ⎢1 ⎢ ⎢8 ⎢ ⎢6 ⎢3 a ij = ⎢ ⎢2 ⎢5 ⎢ ⎢8 ⎢6 ⎢ ⎢⎣ 7

_ opt

[(x

i =1

4 1

4 1

8 6

8 6

7

3

9 5

2 3

1 2

8 6

3 .6

7

− ai

)(x − a )

T

i

+ ci

4 ⎤ ⎡ 0 .1 ⎤ ⎢ 0 .2 ⎥ 1 ⎥⎥ ⎥ ⎢ 8 ⎥ ⎢ 0 .2 ⎥ ⎥ ⎥ ⎢ 6 ⎥ ⎢ 0 .4 ⎥ 7 ⎥ ⎢ 0 .4 ⎥ ⎥, c = ⎢ ⎥, 9 ⎥ i ⎢ 0 .6 ⎥ ⎢ 0 .3 ⎥ 3 ⎥ ⎥ ⎥ ⎢ 1 ⎥ ⎢ 0 .7 ⎥ ⎢ 0 .5 ⎥ 2 ⎥ ⎥ ⎥ ⎢ 3 . 6 ⎥⎦ ⎢⎣ 0 . 5 ⎥⎦

]

−1

_ opt

)

=

1 . ci



, x i ∈ [0 10 ], n = 4 , x local

min

( f9 ) =

f (x local

_ opt

)

=

1 . ci

863

Condition Prediction of Hydroelectric Generating Unit Based on Immune Optimized RBFNN Zhong Liu1, Shuyun Zou1, Shuangquan Liu2, Fenghua Jin1, and Xuxiang Lu1 1

Changsha University of Science and Technolgy, 410076 Changsha, China Huazhong University of Science and Technolgy, 430074 Wuhan, China [email protected]

2

Abstract. Establishing the condition prediction model of characteristic parameters is one of the key parts in the implementation of condition based maintenance (CBM) of the hydroelectric generating unit (HGU). The performance of radial basis function neural network (RBFNN) in prediction mainly depends on the determination of the number and locations of data centers at the hidden layer. A novel approach inspired from the immune optimization principles is proposed in this paper and used to determine and optimize the structure at the hidden layer. The immune optimized RBFNN has been applied to the vibration condition prediction of the hydroturbine guiding bearing. The prediction results are compared with those by some other intelligent algorithms and the actual values, which shows the effectiveness and the preciseness of the proposed immune optimized RBFNN. Keywords: condition prediction; hydroelectric generating unit (HGU); immune optimization; radial basis function neural network (RBFNN).

1 Introduction Hydropower plays an important role in the power sources as an economic renewable clean energy. The conversion from hydraulic energy into mechanical one, and finally into electric one is performed via the hydroelectric generating unit (HGU). Its operating condition has significant influences on the health of field engineers, the power quality of the grids, and the economic incomes of the hydropower plants. Condition based maintenance (CBM) of HGU aims to collect and analyze the condition symptoms, make maintenance decisions considering the possible damages and trends of the faults [1]. Establishing the condition prediction model [2] of HGU is one of the key parts in the implementation of CBM of HGU. Comparing with thermoturbine generating units, HGU has a far low speed less than 500 rpm. Its conditions vary gradually and slowly, which shows the feasibility for trend analysis and condition prediction. With the development of artificial intelligence technologies, neural networks are widely used in prediction fields. Radial basis function neural network (RBFNN) is especially remarkable for its universal approximation capability, simple and practicable algorithms. The behavior of RBFNN strongly depends on the number and positions of the neurons at the hidden layer, as well as the dispersions of each radial basis function (RBF) [3]. Random definition of the center locations means no control of F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 864–872, 2008. © Springer-Verlag Berlin Heidelberg 2008

Condition Prediction of HGU Based on Immune Optimized RBFNN

865

relevance and redundancy, while self-organizing approaches such as k-means increase the risk of getting trapped into poor local minima [4]. Thus, some advanced techniques should be studied to determine the parameters at the hidden layer of RBFNN, so as to improve the precision and accuracy. Immune system [5, 6] is a highly complex, distributed co-operation and selfadaptive system. It has the advantages in feature extraction, self-organization learning and memorizing. Although with different functions, biological neural network and immune system have many similarities at the level of system behavior [7]. So approaches inspired from immune system are potential in configuring RBFNN. A novel approach based on the diversity of immune cells and variety of information processing mechanisms is proposed, which contributes to the immune optimized RBFNN. This paper is organized as follows. The principles and the learning stages of RBFNN are reviewed in Section 2. The implementation of immune optimized RBFNN is described in Section 3. This RBFNN model is applied to the vibration prediction of hydroturbine guiding bearings of an HGU, and the prediction result are compared with those via other approaches such as backward propagation neural network (BPNN) and k-means RBFNN in Section 4. Some conclusions are drawn in Section 5.

2 RBFNN RBFNN is a feed-forward network consisting of three layers: input layer, hidden layer and output layer. Transfer function is nonlinear from input space to hidden space and linear from hidden space to output space. As a nonnegative and nonlinear function with attenuation from center symmetrically, Gauss kernel is often used as the hiddenlayer transfer function. The structure of RBFNN is illustrated in Figure 1. ni, nh and no are the numbers of neurons at the input, hidden, and output layers, respectively. The output of the ith hidden neuron hi is typically defined by Gaussian function.

hi = exp( − x − ci

2

/(σ i 2 )), 1 ≤ i ≤ nh .

(1)

where x is the input vector, x={x1,x2,…,xni}, ci and σi are the center vector and the dispersion of the ith hidden neuron respectively. The output of the ith output neuron is the linear combination of the outputs of the hidden neurons.

yi ( x ) = ∑ h j w ji , 1 ≤ i ≤ no . nh

(2)

j =1

where wji is the weight between the jth hidden neuron and the ith output neuron. Usually, RBFNN is trained by first determining the centers and the dispersions of the hidden neurons, and then finding the weights between the hidden and output layer by a so-called the pseudo-inverse method [8]. The method has been shown to be very effective in terms of training speed.

866

Z. Liu et al.

Fig. 1. Structure of RBFNN

3 Immune Optimized RBFNN 3.1 Immune Optimization Principles Facing various kinds of known or unknown invaders (antigens) outside of the body, the immune system can generate a dynamic limited and compact lymphocyte collection (memory antibodies) via its self-adaption mechanism and antibody diversity developing mechanism [9]. The memory antibodies can quickly recognize and eliminate the invaders so as to maintain the balance of the whole organism. Affinity maturation is one of the most important features of the immune system and immune response. Four points should be listed during the affinity maturation process, which can enlighten the development of optimization algorithms considering both global search and local search. 1) Part of individuals with higher affinity turn into activated cells, which will clone, proliferate and mutate to further the affinity maturation. 2) The diversity of antibodies is guaranteed by the supplement mechanism of immune system, which enables the organism to withstand various kinds of antigens. 3) Immune clone is the cell self-production. There is no intercrossing among the cloned individuals. 4) The existence of immune memory ensures the preservation of excellent cells and their information. 3.2 Feasibility Analysis RBFNN was proposed to implement function mapping with locally receptive fields [4]. As is shown in Figure 2, the receptive fields have the on-center off-surround characteristics, i.e. their responses decrease monotonically with the distance from a central point. The closer the distance from the central point, the more violent the activated degree. Gaussian function in Equation (1) can describe that appropriately. ci denotes the center of the receptive field, and σi the influence scope. The interactions between antigens and (or) antibodies have the similar characteristics. The stimulation or suppression between immune cells usually depends on their distances. Mathematically, either an antibody or an antigen, can be represented by a

867

Activated degree

Condition Prediction of HGU Based on Immune Optimized RBFNN

Fig. 2. On-center off-surround characteristics

set of coordinates m = , which can be regarded as a point in an Ldimensional real-valued shape space. The physical meaning of each parameter is not relevant and depends on the studied problems. Affinity degree Affij is used to measure the matching degree between antibody Abi and antigen Agj which reflects the probability of a clonal response.

Aff ij =

·

1 . 1 + Abi − Ag j

(3)

where || || represents the Euclidean distance. The value of Affij is between 0 and 1. The larger the value, the closer the antibody to the antigen, and the better the recognition. Similarity degree Simij is adopted to measure the similarity between antibody Abi and Abj or antigen Agi and Agj in the similar way to Equation (3). The value of Simij is between 0 and 1. The larger the value, the more similar the two types of antibodies (antigens), and the more they suppress each other. The training to RBFNN [10] essentially means a continuous and dynamical adjustment to the structure and parameters of the hidden layer. Suppose N sets of data X={x1,x2,...,xN} as the inputs of an RBFNN, where xi=[xi1,xi2,…,xik]T, i=1,2,…N. The optimal configuration process is to determine M sets of data C={c1,c2,...,cM} as the Table 1. Mapping of immune system and RBFNN

Immune System antigens antibodies memory antibodies antibody’s concentration immune evolution

RBFNN input data xi candidate center determined centers ci at hidden layer influence scope σi optimization of centers and their parameters

868

Z. Liu et al.

centers of the hidden layer, where ci=[ci1,ci2,…,cik]T. C is not necessarily a subset of X, but expected to be the internal images and reflect the distribution of X. M is far less than N. Inspired from immune optimization principles, mapping relationships can be established as shown in Table 1. 3.3 Implementation of Immune Optimization There is no need to predefine the number of centers at the hidden layer when using immune optimization principles to determine them. The implementation steps can be summarized as follows. 1) Initialization. Empty the center set Abmem. Initialize the candidate center set Abcan with the number of its members Nr, the number of centers with the highest affinity degree n1, the natural death threshold δd, the suppression threshold δs, and the number of newly supplemented candidate centers d. Predefine the stopping criteria. They may be the maximum iteration step or the maximum number of Abmem’s elements or the mean error between the two sequential iterations or the combinations of the above items. 2) Judge whether to enter the next iteration according to the predefined stopping criteria. If they are met, finish the evolution. Otherwise, continue to step 3). 3) Generate d candidate centers randomly to substitute the same number of centers with the lowest affinity degree in Abcan. 4) For each set of input data (antigen): a. Determine the affinities to all the candidate centers; b. Select the n1 centers with the highest affinity degree; c. Clone these n1 selected centers. The higher the affinity degree, the larger the number of clones. d. Perform the mutation on the clones. e. Reselect the centers with the highest affinity to form a temporary memory center set Abtem. f. Eliminate the centers with the affinity less than δd, to realize a size reduction of Abtem. g. Calculate the similarity degrees among the centers in Abtem. Eliminate those with the similarity degree larger than δs. h. Concatenate Abnew and Abmem, i.e.

Ab mem =Ab mem ∪ Ab new .

(4)

5) Network suppression. Recalculate the similarity degrees in Abmem, and eliminate those centers with the similarity degree lager than δs. 6) Jump to step 2). The elements in the final center set Abmem after the above steps are just the optimal centers at the hidden layer of RBFNN. The whole space of input data can be covered by the limited numbers of centers. Once the number and locations of the centers are determined, the shape parameter σi of each RBF can be calculated as follows.

Condition Prediction of HGU Based on Immune Optimized RBFNN

∑ c −c

869

k

i

σi =

j

j =1

i = 1, 2,...M .

k

(5)

where k is the predefined number of data centers nearest to the data center σj. The weights between the hidden layer and the output one can be calculated with the pseudo-inverse method. Till now, an RBFNN with the immune optimized structure for prediction have been established.

4 Application to Condition Prediction of HGU 4.1 Characteristic Parameters Selection About eighty percent of faults and accidents of HGU are exposed by the anomaly or over-threshold of vibration or pressure pulsation signals. These kinds of signals involve the configuration variations in frequency domain and can be collected conveniently [11]. So, characteristic parameters are often extracted from them to form time sequences for the future condition prediction. Root of mean square (RMS) in the time domain, mean peak-to-peak amplitude and compound features are usually adopted upon practical demands. 4.2 Evaluation Index of Condition Prediction For a time sequence {y0,y1,y2,...,yn-1}, the problem of predicting the nth value based on the former m values can be represented as

yn = f ( yn −1 , yn − 2 ,..., yn − m ) .

(6)

Input data (y1,y2,...,ym) outputs ym+1, (y2,y3,...,ym+1) outputs ym+2, and so on. After the immune optimized RBFNN is established, the prediction forms of the oncoming values are

y *n +1 = f ( yn , yn −1 ,..., yn − m +1 ) y *n + 2 = f ( yn +1 , yn , yn −1 ,..., yn − m + 2 ) .

(7)

...



Here the mean relative error eMAP is adopted as the precision evaluation index.

eMAP =

1 Ns

Ns

yi − y * i

i =1

yi

.

(8)

where yi and y*i are the actual value and the predicted one respectively, Ns is the times of predictions. 4.3 Engineering Application The stability condition of some HGU was monitored and the vibration signals from several vital parts of the unit were collected and recorded. Characteristic parameters extracted at a fixed interval formed single-variable sequences in time domain.

870

Z. Liu et al.

Taking the vibration condition prediction at the hydroturbine guiding bearings as an example [12], the RMS values were selected as the characteristic parameters. Theoretically they should be values with the same water level at sequential times. Due to the lack of relevant experiment data, data with the water level interval less than 3 meters were treated as those at the same water level. The data from the first to the tenth week were selected as training samples to establish an RBFNN with five inputs and one output via the immune optimization approach proposed in this paper. The data from the eleventh to the fifteenth week were used as testing samples. Being tested for several times, the parameters in the method are set as follows. Nr=50, n1=,δd=0.95,δs =0.85, and d=8. To verify the effectiveness and the preciseness of the proposed approach, the kmean clustering RBFNN approach and BPNN with five inputs, 10 neurons at the 140 Actual value Immune RBFNN k-means RBFNN BPNN

Vibration Applitude (um)

135

130

125

120

115

110

0

2

4

6

8 10 Time (week)

12

14

Fig. 3. Vibration prediction results of HGU Table 2. Result comparion via different approaches

No.

Actual Values (um)

11 12

131.6 132.6

Prediction results (um) k-means Immune BPNN RBFNN RBFNN 127.0 128.6 129.3 131.1 135.9 131.9

13

130.0

129.2

133.0

131.0

14

129.6

131.1

131.7

131.0

15

131.9

131.0

128.7

131.3

8.46%

11.12%

4.58%

eMAP

16

Condition Prediction of HGU Based on Immune Optimized RBFNN

871

single hidden layer and one output were adopted. The prediction results and the actual values were illustrated in Figure 3, and the prediction precisions were evaluated according to Equation (8) and shown in Table 2. As can be seen, the prediction results with immune optimized RBFNN are the closest to the real values. The prediction precision reaches 4.58%, less than 8.46% via BPNN and 11.12% via k-mean RBFNN.

5 Conclusions It is vital to establish the condition prediction model of characteristic parameters of HGU to advance fault prevention and maintenance decision of hydropower plant. RBFNN is potential in prediction and approximation problems with its special advantages. The determination of the number and locations of data centers at the hidden layer has a key effect on its performance. Similarities between RBFNN structure design and immune system have been analyzed. It is feasible to apply immune optimization principles to configure the hidden layer of RBFNN. The implementation of the immune optimized RBFNN is described in details. The whole space of input data can be covered by the limited numbers of centers based on the immune memory mechanism (corresponding global searching), the self-adjustment mechanism and the diversity of immune antibodies. The immune optimized RBFNN has been applied to the vibration condition prediction of an HGU. The prediction results are compared with those by BPNN and k-means RBFNN, which verifies the correctness and preciseness of the proposed approach.

References 1. Wang, S., Fan, W., Zhong, D.: Condition Maintenance Decision-Making System Based on Condition Monitoring of Main Equipment of Hydroelectric Plant. Automation of Electric Power Systems 16, 45–47 (2001) 2. Liu, Z., Zhou, J., Zou, M.: Condition Based Maintenance System of Hydroelectric Generating Unit. In: The 2006 IEEE International Conference on Industrial Technology, pp. 80– 85. IEEE Press, Bombay (2006) 3. Hwang, Y., Bang, S.: An Efficient Method to Construct a Radial Basis Function Neural Network Classifier. Neural Network 9, 1495–1503 (1997) 4. Moody, J.E., Darken, C.: Fast Learning in Networks of Locally Tuned Processing Units. Neural Computation 2, 281–294 (1989) 5. Dasgupta, D.: Advances in Artificial Immune Systems. IEEE Computational Intelligence Magazine 4, 40–49 (2006) 6. Timmis, J., Knight, T., de Castro, L.N., Hart, E.: An Overview of Artificial Immune Systems. Natural Computation Series, pp. 51–86. Springer, Heidelberg (2004) 7. Dasgupta, D.: Artificial Neural Network and Artificial Immune Systems: Similarities and Differences. In: Dasgupta, D. (ed.) the IEEE International Conference on System, Man, and Cybernetics, pp. 873–878. IEEE Press, Orlando (1997) 8. Wu, S., Chow, T.W.S.: Induction Machine Fault Detection Using SOM-Based RBF Neural Networks. IEEE Trans. Industrial Electronics 51, 183–194 (2004) 9. de Castro, L.N., Von Zuben, F.J.: Learning and Optimization Using the Clonal Selection Priciple. IEEE Trans. on Evolutionary Computation 3, 239–251 (2002)

872

Z. Liu et al.

10. Barra, T.V., Bezerra, G.B., de Castro, L.N., Von Zuben, F.J.: An Immunological DensityPreserving Approach to the Synthesis of RBF Neural Networks for Classification. In: International Joint Conference on Neural Networks, pp. 929–935. IEEE Press, Vancouver (2006) 11. Liu, Z., Zhou, J., Zhang, Y., Zou, M.: Compound Characteristic Extraction Based Fault Diagnosis of Hydroelectric Generating Unit Using RBF Neural Network. Automation of Electric Power Systems 11, 87–91 (2007) 12. Dai, K.: Forecasting of the Feature-Based Hydroelectric Unit State. Master Dissertation, Huazhong University of Science and Technology, Wuhan (2005)

Synthesis of a Hybrid Five-Bar Mechanism with Particle Swarm Optimization Algorithm Ke Zhang School of Mechanical and Automation Engineering, Shanghai Institute of Technology 120 Caobao Road, 200235 Shanghai, China [email protected]

Abstract. Hybrid mechanism is a new type of mechanism with flexible transmission behavior. Hybrid five-bar mechanism is the most representative one of them. In this paper, modeling and analysis for a hybrid five-bar mechanism based on power bond graph theory is introduced. An optimal dimensional synthesis of hybrid mechanism is performed with reference to dynamics objective function. Compared with conventional optimum evaluation methods such as simplex search and Powell method, Particle Swarm Optimization (PSO) algorithm can improve the efficiency of searching in the whole field by gradually shrinking the area of optimization variable. Compared to GA, PSO is easy to implement and there are few parameters to adjust. In order to solve the synthesis problem, integrating PSO optimization algorithm and MATLAB Optimization Toolbox for the constraint equations. Optimum link dimensions are obtained assuming there are no dimensional tolerances or clearances. Finally, a numerical example is carried out, and the simulation results show that the optimization method is feasible and satisfactory for hybrid mechanism. Keywords: Synthesis, mechanism, optimization, PSO algorithm.

1 Introduction Hybrid five bar mechanism is a planar parallel robot that combines the motions of two characteristically different motors by means of a five bar mechanism to produce programmable output. Where one of the motions coming from a constant speed motor provides the main power, a small servomotor introduces programmability to the resultant actuator. It is the most representative one of hybrid mechanism. The idea of hybrid machines was initially investigated by Tokuz and Jones [1], and is a field of study with full potential. Such machines will introduce to users greater flexibility with programmability option, and energy utilization will be realized at maximum. Now some practical applications of hybrid machines idea have already used with different hybrid machine configurations in industry field. Injection moulding machine, printing machine, cut to length machine and stamping press are the industrial examples [2]. Although some points are partially explored, there is still a need for dynamics analysis and optimal design studies to guide potential users for possible industrial applications with hybrid machines. F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 873–882, 2008. © Springer-Verlag Berlin Heidelberg 2008

874

K. Zhang

Previous work in hybrid machine can be found in some studies and publications. Tokuz and Jones [1] have used a hybrid machine configuration to produce a reciprocating motion. A slider crank mechanism was driven by a differential gearbox having two separate inputs; constant speed motor and a pancake servomotor to simulate a stamping press. A mathematical model was developed for the hybrid machine-motor system. The model results were compared with the experimental ones, and model validation was achieved. Greenough et al. [3] have presented a study on optimization design of hybrid machines, and a Svoboda linkage is considered as a two degree of mechanism. Kinematic analysis of Svoboda linkage was presented with inverse kinematics issue. Kireçci and Dülger [4] had given description of a different hybrid actuator configuration consisting of a servomotor driven seven link mechanism with an adjustable crank. Wang [5] has design a variable structure generalized linkage, the linkage is optimized to perform exactly the complicated motion required. The above design of hybrid mechanism employed mainly traditional optimization design methods. However, these traditional optimization methods have drawbacks in finding the global optimal solution, because it is so easy for these traditional methods to trap in local minimum points [6]. Particle Swarm Optimization (PSO) is an evolutionary computation technique developed by Dr. Eberhart and Dr. Kennedy in 1995 [7], inspired by social behavior of bird flocking or fish schooling. Compared to GA, the advantages of PSO are that it is easy to implement and there are few parameters to adjust. In recent years, there have been a lot of reported works focused on the PSO which has been applied widely in the function optimization, artificial neural network training, pattern recognition, fuzzy control and some other fields [8] in where GA can be applied. Dimensional synthesis is an important subject in designing a hybrid mechanism. The main purpose of the paper is to present modeling and analysis of hybrid mechanism system, and to investigate the optimal dimensional synthesis of the mechanism by its mathematical model. By means of dynamics objective functions, optimum dimensional synthesis for hybrid five bar mechanism is taken by using a PSO algorithm. The results and analysis of an example are obtained in this study herein.

2 Hybrid Mechanism Description Fig.1 represents five link mechanism structure having all revolute joints except one slider on output link, and shows the positional, the geometrical and the dynamic relationships. Notations shown in Fig. 1 are applied throughout the study. The hybrid mechanism has an adjustable link designed to include a power screw mechanism for converting rotary motion to linear motion by means of a small slider. The slider is assumed to move on a frictionless plane. The crank is driven by a main motor (DC motor) through a reduction gearbox; the slider is driven by a lead screw coupled an assist motor (servomotor). Here the main motor is applied as a constant speed motor, and the constant speed motor profile is applied. Point-to-point positioning is certainly achieved for both motors, and the system output is taken from the last link.

Synthesis of a Hybrid Five-Bar Mechanism with PSO Algorithm

875

a, b, d, e link lengths of the mechanism (m) φ , θ , ϕ , ψ angular displacement of the links (rad)

φ , θ , ϕ , ψ angular velocity of the links (rad/s) φ , θ , ϕ , ψ angular acceleration of the links (rad/s2) S ix , S iy (i = a, b, e, l) positions to the centre of gravity in local coordinates (m)

mi masses of the links (kg) Gi gravity of the links (N) J is link moment of inertias on the mass centre of the links(kgm2) L, L, L displacement, velocity, and acceleration of the slider on output link (m, m/s, m/s2) F the assist driving force (N) M0 the main driving torque (Nm) M ψ drag torque on output link (Nm)

RQiX , RQiY inertia forces of the links (N) M Qi inertia torques of the links (N) X i , Yi positions to the mass centre of the links in fixed coordinates (m) Y

Y

RQ b X RQb

M Qb

ϕ

y

x

sb Y Qa

R

MQ a

O

sb b

θ

sa

MQe

φ + φ0

e x Mψ

se

X

RQe

M Ql

say x

sey

Y

RQe Ge

X

RQa

Ga a M0

Gb

R

Gl

Y Ql

l

X

RQ l

(F,L)

y x l

sl

s ψ + ψ0

d

X

Fig. 1. Configuration and mechanical properties of hybrid five-bar mechanism

3 Modeling and Analysis of Hybrid Mechanism In general, the model of a mechanical system can simply be considered as inertial rigid system. Simplifying assumptions are required while developing the mathematical model. Friction and clearance in all joints are neglected. The mechanism operates in vertical plane and gravity effects are included. Figure 2 shows the bond graph model of hybrid five-bar mechanism [9]. It is composed of three parts: (1) Multiports element MCHANISM, (2) Inertial Field, (3) Source field. In Fig. 2, there exists N 1-Junctions corresponding to velocities vector of mechanism, q K . According to bond graph theory [10], an algebraic sum of all efforts (ei) on the bonds attached to a 1-Junction is zero. From the bond graph in Fig. 2, the effort

876

K. Zhang

summations at two 1-Junctions associated with the independent generalized velocities q KI vectors are written as follows:

e SI = e KD + e PI .

(1)

The effort summations at (N-2) 1-Junctions associated with dependent velocity q KD vectors are written as follows: e SD = −e KI + e PD . (2) According to literature [10] and equations (1), (2), we can get

(e SI − e PI ) = T T (q K ,Ψ )(e PD − e SD ) .

(3)

From (3), we may found the dynamic equation of hybrid mechanism in the form A1 q KI + A2 q KI + A3U 1 + A4U 2 = 0 .

(4)

Where A1 , A2 , A3 and A4 are coefficient matrixes of dynamic equation, U1 represent input torques (forces) vector, and U 2 represent other torques (forces) on hybrid mechanism. Thus, the 2-vector U 1 can be found from equation (4), U 1 = (−1) A3−1 ( A1 q KI + A2 q KI + A4U 2 + A5 ) .

(5)

L

L

Inertial Field Source Field

Fig. 2. Bond graph model of hybrid five bar mechanism

Kinematic analysis of five bar linkage is needed while carrying out dynamic analysis. The mechanism is shown with its position vectors in Fig. 1. The output of system is dependent on two separate motor inputs and the geometry of five bar mechanism. By referring to Fig. 1, the loop closure equation is written as: AB + BC + CD = AE + ED .

(6)

By solving vector loop equation (6), we can obtain angular position of the each link. Having found the angular displacements of each linkage in the five bar linkage, time

Synthesis of a Hybrid Five-Bar Mechanism with PSO Algorithm

877

derivatives can be taken to find angular velocity and accelerations. They are also definitely needed during the analysis of dynamic model.

4 Optimum Synthesis of Hybrid Mechanism 4.1 Design Variables

Hybrid mechanism can be determined by selecting a design vector as follows x = [ x1 , x2 , x3 , x4 , x5 ]T ,

(7)

where x1 = a d , x 2 = b d , x 3 = e d , x 4 = φ0 , x 5 = ψ 0 . 4.2 Objective Functions

The problem of determining mechanism dimensions can be expressed as a constrained optimization problem. In order to ensure the lower driving power for the assist motion under constraint functions, objective functions can be design as follows min f 1 = max( FL ) , or min

f2

(8)

=max(max( FL )/max( M φ )). 0

(9)

4.3 Constraint Functions

These functions consist of inequality constraints with stand type according to MATLAB Optimization Toolbox. They are functions of design variables. (1) Inequality constraint related to the movable condition of hybrid mechanisms To ensure existence of the hybrid five-bar mechanism, the follow inequality constraints are to be satisfied. a + d − b − e 2 + L2 < 0 a + e 2 + L2 − b − d < 0 a + b − e 2 + L2 − d < 0

.

(10)

a − min(b, e 2 + L2 , d ) < 0 (2) Inequality constraint due to the transmission angle b 2 + c 2 + d 2 − (d − a) 2 − cos[γ ] ≤ 0 2 ⋅ b ⋅ e 2 + L2 , b 2 + c 2 + d 2 − (d + a) 2 − cos[γ ] ≤ 0 2 ⋅ b ⋅ e 2 + L2 where [γ ] is allowable transmission angle of mechanism.

(11)

878

K. Zhang

5 Hybrid Optimization Algorithm Swarm intelligence is an emerging field of biologically-inspired artificial intelligence based on the behavioral models of social insects such as ants, bees, wasps and termites. This approach utilizes simple and flexible agents that form a collective intelligence as a group. Since 1990s, swarm intelligence has already become the new research focus and swarm-like algorithms, such as PSO, have already been applied successfully to solve real-world optimization problems in engineering and telecommunication. The PSO algorithm does not use the filtering operation (such as crossover and mutation) and the members of the entire population are maintained through the search procedure. Integrating PSO optimization algorithm and MATLAB Optimization Toolbox, we have written a hybrid optimization design program. 5.1 Particle Swarm Optimization

PSO simulate social behavior, in which a population of individuals exists. These individuals (also called “particles”) are “evolved” by cooperation and competition among the individuals themselves through generations. In PSO, each potential solution is assigned a randomized velocity, are “flown” through the problem space. Each particle adjusts its flying according to its own flying experience and its companions’ flying experience. The ith particle is represented as X i = ( xi1 , xi 2 ,… , xiD ) . Each particle is treated as a point in a D-dimensional space. The best previous position (the best fitness value is called pBest) of any particle is recorded and represented as Pi = ( pi1 , pi 2 ,… , pid ) . Anther “best” value (called gBest) is recorded by all the particles in the population. This location is represented as Pg = ( pg1 , pg 2 ,… , pgD ) . At each time step, the rate of the position changing velocity (accelerating) for the ith particle is represented as Vi = ( vi1 ,vi 2 ,… ,viD ) . Each particle moves toward its pBest and gBest locations. The performance of each particle is measured according to a fitness function, which is related to the problem to be solved. The particles are manipulated according to the following equation: vid = w ⋅ vid + c1 ⋅ rand ( ) ⋅ ( pid − xid ) + c2 ⋅ Rand ( ) ⋅ ( p gd − xid ) , xid = xid + vid ,

(12) (13)

where c1 and c 2 in equation (12) are two positive constants, which represent the weighting of the stochastic acceleration terms that pull each particle toward pBest and gBest positions. Low values allow particles to roam far from the target regions before being tugged back. On the other hand, high values cause abrupt movement toward, or past, target regions. Proper values of c1 and c 2 can quicken convergence. rand( ) and Rand( ) are two random functions in the range [0, 1]. The use of the inertia weight w provides a balance between global and local exploration, and results in less iteration to find an optimal solution.

Synthesis of a Hybrid Five-Bar Mechanism with PSO Algorithm

879

5.2 Parameters Setting

Main parameters of PSO are set as follows. (1) The inertia parameter The inertial parameter w is introduced to control the impact of the previous history of velocities on current velocity. A larger inertia weight facilitates global optimization while smaller inertia weight facilitates local optimization. In this paper, the inertia weight ranges in a decreasing way in an adaptive way. The inertia weight is obtained by the following equation: W = Wmax −

Wmax − Wmin ⋅ iter . Iter

(14)

Where Wmax is the maximum value of W, Wmin is the minimum value of W, Iter is the maximum iteration number of PSO, and iter is current iteration number of PSO. (2) The parameters c1 and c 2 The acceleration constants c1 and c2 control the maximum step size the particle can do, they are set to 2 according to experiences of other researchers. 5.3 Search Algorithm Hybridization

Being a global search method, the PSO algorithm is expected to give its best results if it is augmented with a local search method that is responsible for fine search. The PSO algorithm is hybridized with a gradient based search. The ‘‘candidate’’ solution found by the PSO algorithm at each iteration step is used as an initial solution to commence a gradient-based search using ‘‘fmincon’’ function in MATLAB [11], which based on the interior-reflective Newton method. The ‘‘fmincon’’ function can solve the constraint problem here. 5.4 Hybrid Optimization Algorithm

The procedure of hybrid optimization algorithm is shown as follows. Step1. Initialize V and X of each particle. Step2. Calculate the fitness of each particle. Step3. If the fitness value is better than the best fitness value (pBest) in history, set current value as the new pBest. Step4. Choose the particle with the best fitness value of all the particles as the gBest . Step5. Calculate particle velocity according equation (12), update particle position according equation (13). Step6. Let the above solution as an initial solution, and carry out a gradient-based search using ‘‘fmincon’’ function in MATLAB Optimization Toolbox. Step7. When the number of maximum iterations or minimum error criteria is attained, the particle with the best fitness value in X is the approximate solution andSTOP; otherwise let iter=iter+1 and turn to Step 2. Where iter denotes the current iteration number.

880

K. Zhang

Particles’ velocities on each dimension are clamped to a maximum velocity vBmaxB, If the sum of accelerations would cause the velocity on that dimension to exceed vBmax, which is a parameter specified by the user. Then the velocity on that dimension is limited to vBmaxB.

6 Numerical Examples In order to test the validity of the proposed procedure and its ability to provide better performance, an example problem was solved. Experimental model in Fig. 1 have been designed that link dimensions can be adjusted statically relative to crank, the masses of links, inertias and positions to the masses center of links in the local coordinates are independent of the static adjustment. Mechanical properties of five bar mechanism, link lengths, positions to the center of gravity of each link, link masses, and link inertias on the masses center are shown in Table 1. These link lengths and angle values for hybrid five bar mechanism in the studies of optimal kinematics design were obtained by Wang [5]. From the point of view of kinematics, the hybrid mechanism has better kinematics performance. However, having taken dynamic analysis for hybrid five bar mechanism, we found that the maximum of the assist driving powers is on the high side. As seen in Table 3, the maximum of the assist driving power is equal to 806.4 W, the maximum of the main driving power is equal to 2392 W. The results make against the control of the assist motor. Here we perform the optimal dynamics dimensional synthesis of hybrid five-bar mechanism to improve the problem. Table 1. Mechanical properties of five link mechanism a˙0.04m, b˙0.3902m, e˙0.0694m, d˙0.4m; I ˙8.8 uS /180rad , I 240 u S / 30 rad/s, I | 0 rad/s2; 0

ma˙12.9 kg, mb˙2.3 kg, me˙1.8kg, ml˙15.0 kg˗ JaS˙34.2h10-3 kgm2ˈ JbS˙70.4h10-3 kgm2ˈJeS˙3.6h10-3 kgm2, JlS˙447h10-3 kgm2˗

S ax ˙0.0m, S ay ˙0.0m, S bx ˙0.2405m, S by ˙0.0m, S ex ˙0.0227m, S ey ˙0.0m, S lx ˙0.0m, S ly ˙0.0 m˗

Initial values of design varibles in equation (7) can be obtained with link lengths of linkage and angle φ 0 in Table 1. The output motion profile is designed as a sine acceleration law. Set a=0.04 m and [ γ ] = 45°, optimal synthesis using the PSO algorithm is carried out based on Section 4. Link dimension and performance parameters obtained for the hybrid five bar mechanism in this study are shown in Table 2 and Table 3. Fig. 3(a) – (d) represent dynamics analysis results for the output motion. Fig.3(a) and (b) show the assist driving torques and the main driving forces for the assist motor driving the slider on the lead screw. Fig. 3(c) and d show the assist and the main driving powers. In Fig. 3(a) – (d), curve 1, 2 and 3 represent respectively the dynamics calculations results for optimal kinematics design, and optimal dynamics

Synthesis of a Hybrid Five-Bar Mechanism with PSO Algorithm

881

Table 2. The optimal values of design variables for optimal dynamic design a

Initial minf1 minf2

b 0.3902 0.3697 0.3735

0.04 0.04 0.04

e 0.0694 0.0771 0.0772

d 0.4 0.3865 0.3902

φ 0 (º)

ψ 0 (º)

8.8 10.60 10.49

-10.99 -11.95 -9.86

Table 3. The minimum of objective functions and performance parameters for optimal dynamic design lmax-lmin (mm) Initial minf1 minf2

8.24 13.3 13.3 max( FL ) (W)

Initial minf1 minf2

809.7 567.4 568.1

M 0 max (Nm)

Fmax (N)

3014 2615 2619

96.15 84.86 84.97

max(M 0φ ) (W)

min f 2

2392 2138 2140

0.3324 0.2618 0.2616

(a)

φ

(b)

φ

(c)

φ

(d)

φ

Fig. 3. Dynamics analysis results of hybrid mechanism

design by using objective functions as equations (8) and (9). Curve 2 in Fig. 3(c) and (d) display driving power curves of the assist and main motion obtained by using the first optimal objective function (min f1). The maximum of the assist driving power is equal to 567.4 W, the maximum of the main driving powers is equal to 2138 W. Obviously, optimal dynamic design reduces the peak power of the main and assist drive by as much as 70%, compared to optimal kinematics design. As seen in Fig. 3(a) – (b), the assist driving torques, and the main driving forces are reduced obviously by means of optimal dynamic design. Moreover, as shown in Table 3, the results

882

K. Zhang

obtained by using the second optimal objective function (min f2) are also to be satisfied. Especially, it is more propitious to selection of the assist driving power for the main driving power known. In addition, we also found that the optimal operation by using the PSO algorithm is superior to GA.

6 Conclusions This paper has described optimal dimensional synthesis for hybrid five-bar mechanism with its dynamics, kinematics analysis and optimization algorithm. The modeling and analysis based on power bond graph is simple and convenient for solving the main driving power, and the assist driving power of the hybrid mechanism. In terms of dynamic objective functions, optimal synthesis for a hybrid five bar mechanism is given by using hybrid optimization algorithm with PSO algorithm. Although some simplifications are made during derivations, this study illustrates how well the method. As a result of the comparisons, better performances have been obtained in terms of dynamics objective functions. The method presented here is more propitious to the applied effects of assist motion control and the development of hybrid mechanism.

References 1. Tokuz, L.C., Jones, J.R.: Hybrid Machine Modelling and Control. Ph.D thesis, Liverpool Polytechnic (1992) 2. Conner, A.M.: The Synthesis of Hybrid Mechanism Using Genetic Algorithm. Ph.D thesis, Livepool John Moores University (1996) 3. Greenough, J.D., Bradshaw, W.K., Gilmartin, M.J.: Design of Hybrid Machines. In: 9th World Congress on the Theory of Machines and Mechanisms, vol. 4, pp. 2501–2505. Milan (1995) 4. Kireçci, A., Dülger, L.C.: A Study on a Hybrid Actuator. Mechanism and Machine Theory 35, 1141–1149 (2000) 5. Wang, S.Z., Gao, X.: Research on Kinematics Design for a Variable Structure Generalized linkage. In: Proceedings of Young Scientists Conference on Manufacturing Science and Technology for New Century, pp. 404–408. Wuhan (1998) 6. Singiresu, S.R.: Engineering Optimization. John Wiley & Sons, New York (1996) 7. Eberhart, R., Kennedy, J.: A New Optimizer Using Particle Swarm Theory. In: 6th International Symposium on Micro Machine and Human Science, Nagoya, pp. 39–43 (1995) 8. Trelea, I.C.: The Particle Swarm Optimization Algorithm: Convergence Analysis and Parameter Selection. Information Processing Letters 6, 317–325 (2003) 9. Zhang, K., Wang, S.Z.: Dynamics Analysis of a Controllable Mechanism. In: Proceedings of the ASME International Engineering Technical Conferences and Computers and Information in Engineering Conference, 7A, pp. 133–140. ASME Press, New York (2005) 10. Karnopp, D.C., Margolis, D.L., Rosenberg, R.C.: System Dynamic: A Unified Approach. John Wiley & Sons, New York (1990) 11. Grace, A.: Optimization Toolbox for Use with Matlab, User’s Guide. Math Works Inc. (1994)

Robust Model Predictive Control Using a Discrete-Time Recurrent Neural Network Yunpeng Pan and Jun Wang Department of Mechanical and Automation Engineering The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong {yppan,jwang}@mae.cuhk.edu.hk

Abstract. Robust model predictive control (MPC) has been investigated widely in the literature. However, for industrial applications, current robust MPC methods are too complex to employ. In this paper, a discrete-time recurrent neural network model is presented to solve the minimax optimization problem involved in robust MPC. The neural network has global exponential convergence property and can be easily implemented using simple hardware. A numerical example is provided to illustrate the effectiveness and efficiency of the proposed approach. Keywords: Robust model predictive control, recurrent neural network, minimax optimization.

1 Introduction Model predictive control (MPC) is a powerful technique for optimizing the performance of control systems, with several advantages over other control systems [1]. MPC applies on-line optimization to the model of a system, by taking the current state as an initial state, a optimization problem is solved at each sample time, and at the next computation time interval, the calculation repeated with a new state. MPC that take consideration of uncertainties in the process model is called robust MPC. One way to deal with uncertainties in MPC is the worst case approach, which obtains a sequence of feedback control laws that minimizes the worst case cost. In industrial processes, it required the real-time solution to a minimax optimization problem. Although the robustness of MPC has been studied and is now well understood, the research outcomes are conceptual controllers that can work in principle but not suitable for hardware implementation [2]. As a result, further investigations on a more implementable controller are needed. One very promising approach to dynamic optimization is to apply recurrent neural networks. Recurrent neural networks are brain-like computational models for solving optimization problems in real time. Compared with traditional numerical methods for constrained optimization, neural networks have several advantages: first, they can solve many optimization problems with time-varying parameters; second, they can handle large-scale problems with their parallelizable ability; third, they can be implemented effectively using VLSI or optical technologies. Neural networks for optimization and their engineering applications have been widely investigated in the past two decades. Many neural network models have been proposed F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 883–892, 2008. c Springer-Verlag Berlin Heidelberg 2008 

884

Y. Pan and J. Wang

for solving both linear and nonlinear programming problems. In this paper, we present a discrete-time recurrent neural network model for solving the quadratic minimax optimization problem associated with robust MPC. The neural network is globally exponentially convergent to the saddle point of the objective function. The rest of this paper is organized as follows. In Section 2, we formulate robust MPC as a quadratic minimax optimization problem. In Section 3, we proposed a recurrent neural network model for minimax optimization, and prove its global exponential convergence property, a control scheme for robust MPC is also presented. In Section 4, we provide a example in industrial application to illustrate the performance of the proposed approach. Furthermore, a comparision is made between the proposed approach and linear matrix inequalities approach. Finally, Section 5 conclude this paper.

2 Problem Formulation 2.1 Process Model Consider the following discrete-time linear system with global bounded uncertainties: x(k + 1) = Ax(k) + Bu(k), y(k) = Cx(k) + Dw(k),

(1)

with the constraints umin ≤u(k) ≤ umax , ∆umin ≤∆u(k) ≤ ∆umax , wmin ≤w(k) ≤ wmax ,

(2)

ymin ≤y(k) ≤ ymax , where k ≥ 0, x(k) ∈ ℜn is the state vector, u(k) ∈ ℜm is the input vector, and y(k) ∈ ℜp is the output vector. w(k) ∈ ℜq denotes the vector of bounded uncertainties. umin ≤ umax , wmin ≤ wmax , ymin ≤ ymax are vectors of upper and lower bounds. 2.2 Robust MPC Design MPC is a step-by-step optimization technique: at each sampling time k, measure of estimate the current state, obtain the optimal input vector by solving a optimization problem. When bounded uncertainties are considered explicitly, a robust MPC law can be derived by minimizing the maximum cost within the model described by the uncertainty set. The optimal control action is obtained by solving a minimax optimization problem: (3) min max J(∆u, w), ∆u

subjected to the constraints in (2).

w

Robust MPC Using a Discrete-Time Recurrent Neural Network

885

The objective function J(∆u, w) can be with an infinite or finite, linear or quadratic norm criterion. In this paper, we consider an objective function with a finite horizon quadratic criterion: J(∆u, w) =

N 

[r(k + j|k) − y(k + j|k)]T Φ[r(k + j|k) − y(k + j|k)]+

j=1

(4)

N u −1

T

[∆u(k + j|k)] Ψ [∆u(k + j|k)]

j=0

where k is the current time step, y(k + j|k) denotes the predicted output, r(k + j|k) denotes the reference trajectory of output signal (desired output), and ∆u(k + j|k) denotes the input increment, where ∆u(k + j|k) = u(k + j|k) − u(k − 1 + j|k). Φ ∈ ℜp×p , Ψ ∈ ℜm×m are appropriate weighting matrices. N denotes the predictive horizon (1 ≤ N ). Nu denotes the control horizon (0 < Nu ≤ N ). After Nu control moves, ∆u(k + j|k) becomes zero. According to the process model (1): y(k + j) = CAj x(k) + C

j−1 

Ai Bu(k + j − i − 1) + Dw(k + j),

j = 1, ..., N (5)

i=0

Define following vectors: y¯(k) = [y(k + 1|k) u ¯(k) = [u(k|k)

u(k + Nu − 1|k)]T ∈ ℜNu m ,

···

∆¯ u(k) = [∆u(k|k)

y(k + N |k)]T ∈ ℜN p ,

···

∆u(k + Nu − 1|k)]T ∈ ℜNu m ,

···

r¯(k) = [r(k + 1|k)

(6)

r(k + N |k)]T ∈ ℜN p ,

···

where the reference trajectory r¯(k) is known in advance. The predicted output y¯(k) is expressed in the following form: y¯(k) = Sx(k) + M u ¯(k) + Ew(k) = Sx(k) + M ∆¯ u(k) + V u(k − 1) + Ew(k), where S = [CA

CA2

E = [D ⎡

D

···

CAN ]T ∈ ℜN p×n ,

···

D]T ∈ ℜN p×q , ⎤

CB ⎢ ⎥ C(A + I)B ⎢ ⎥ ⎢ ⎥ .. ⎢ ⎥ . ⎢ ⎥ N p×m Nu −1 ⎢ + · · · + A + I)B ⎥ V = ⎢ C(A , ⎥∈ℜ ⎢ C(ANu + · · · + A + I)B ⎥ ⎢ ⎥ ⎢ ⎥ .. ⎣ ⎦ . C(AN −1 + · · · + A + I)B

(7)

886

Y. Pan and J. Wang



CB C(A + I)B .. .

⎢ ⎢ ⎢ ⎢ ⎢ Nu −1 M =⎢ ⎢ C(A N + · · · I)B ⎢ C(A u + · · · I)B ⎢ ⎢ .. ⎣ . C(AN −1 + · · · I)B

··· ··· .. .

0 0 .. .

··· ··· .. .

CB C(A + I)B .. .

· · · C(AN −Nu + · · · I)B



⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ∈ ℜN p×Nu m , ⎥ ⎥ ⎥ ⎥ ⎦

I denotes the identity matrix. Define vectors: ∆¯ umin = [∆umin · · · ∆umin ]T ∈ ℜNu m , ∆¯ umax = [∆umax · · · ∆umax ]T ∈ ℜNu m u ¯min = [umin · · · umin]T ∈ ℜNu m , u ¯max = [umax · · · umax ]T ∈ ℜNu m , y¯min = [ymin · · · ymin ]T ∈ ℜNu p , y¯max = [ymax · · · ymax ]T ∈ ℜNu p , ⎡ ⎤ I 0 ··· 0 ⎢I I ··· 0⎥ ⎢ ⎥ I˜ = ⎢ . . . . ⎥ ∈ ℜNu m×Nu m . ⎣ .. .. . . .. ⎦ II I I

Thus, the original minimax optimization can be expressed in the following form: min max ∆u

w

s.t.

[¯ r(k) − Sx(k) − M ∆¯ u(k) − V u(k − 1) − Ew(k)]T Φ[¯ r (k) − Sx(k) − M ∆¯ u(k) − V u(k − 1) − Ew(k)] + ∆¯ uT (k)Ψ ∆¯ u(k) ˜ u(k) ≤ u u ¯min ≤ u ¯(k) + I∆¯ ¯max ∆¯ umin ≤ ∆¯ u(k) ≤ ∆¯ umax w ¯min ≤ w(k) ¯ ≤w ¯max y¯min ≤ y¯(k) + M (k)∆¯ u(k) ≤ y¯max (8)

By defining the variable vectors u = ∆¯ u(k) ∈ ℜNu m , w = w(k) ∈ ℜq . By neglecting the constraints on u(k) and y(k), the problem (8) can be rewritten as a minimax quadratic programming problem: min max u

w

s.t.

1 1 T u Qu + cT u − uT Hw − wT Rw − bT w 2 2 u ∈ U, w ∈ W

(9)

where U and W are two box set defined as U = {u ∈ ℜNu m |∆¯ umin ≤ u ≤ ∆¯ umax }, ¯min ≤ w ≤ w ¯max }. The coefficient matrices and vectors are W = {w ∈ ℜq |w Q = 2(M T ΦM +Ψ ) ∈ ℜNu m×Nu m , c = −2M T Φ(¯ r (k)−Sx(k)−V u(k−1)) ∈ ℜNu m , R = 2E T ΦE ∈ ℜq×q , b = Φ(¯ r (k) − Sx(k) − V u(k − 1)) ∈ ℜq , H = 2M T ΦE ∈ ℜNu m×q .

Robust MPC Using a Discrete-Time Recurrent Neural Network

887

The solution to the minimax quadratic programming problem (9) gives the vector of control action ∆¯ u(k). The control law is given by u ¯(k) = f (∆¯ u(k) + u ¯(k − 1)), where f (·) is defined as εi , (Gε)i ≤ li , f (εi ) = (10) li , (Gε)i > li . and G and l are defined as G = [−I˜ I˜

−M

M ]T ∈ ℜ(2Nu m+2N p)×Nu m ,



⎤ −¯ umin + u¯(k) ⎢ u ¯max − u¯(k) ⎥ ⎥ ∈ ℜ2Nu m+2N p . l=⎢ ⎣ −¯ ymin + y¯(k) ⎦ y¯max − y¯(k) The first element u(k|k) is used as the control signal. In industrial control processes, to solve large-scale minimax optimization problems in real-time is a major obstacle for robust MPC. In the next section, we will propose a recurrent neural network for solving (9).

3 Recurrent Neural Network Approach 3.1 Neural Network Model In recent years, many neural network models have been proposed for solving optimization problems [3,4,5,6,7]. In particular, continuous-time neural networks for solving minimax problems has been investigated in [8,9,10]. However, in view of the availability of the digital hardware and the compatibility to the digital computers, discrete-time neural network is more desirable in practical implementation. In this section, we proposed a discrete-time recurrent neural network for minimax problem (9). By the saddle point condition [11], (9) can be formulated as a linear variational inequality (LVI):

where M=

(s − s∗ )T (M s∗ + q) ≥ 0,

∀s ∈ Ω,

(11)



Ω = U × W.

(12)

Q −H , HT R

q=

c , b

According to the well-known saddle point theorem [11], s∗ = (u∗ , w∗ ) is a saddle point of J(u, w) if satisfying J(u∗ , w) ≤ J(u∗ , w∗ ) ≤ J(u, w∗ ),

∀(u, w) ∈ Ω.

(13)

We define the saddle point set Ω ∗ = {(u∗ , w∗ ) ∈ Ω|(u∗ , w∗ ) satisfy (13)} and assume Ω ∗ is not empty. It is obvious that if (u∗ , w∗ ) ∈ Ω ∗ , then (u∗ , w∗ ) is the optimal solution to the minimax problem (9).

888

Y. Pan and J. Wang

According to inequalities (13), we can get that v ∗ is a global minimizer of the objective function J(v, w∗ ) with respect to U, while w∗ is the global minimizer of J(v ∗ , w) with respect to W. As a result, the following LVIs hold: (u − u∗ )T (Qu∗ + c − Hw∗ ) ≥ 0, (w − w∗ )T (Rw∗ + b + H T u∗ ) ≥ 0,

∀u ∈ U,

(14)

∀w ∈ W.

(15)

According to the basic property of the projection mapping on a closed convex set: [z − PΩ (z)]T [PΩ (z) − v] ≥ 0,

∀z ∈ ℜ, v ∈ Ω.

(16)

Based on (14)-(16) and lemma 1 in [9], we can get that (u∗ , w∗ ) ∈ Ω ∗ if and only if the following equations hold: u∗ = PU [u∗ − α(Qu∗ + c − Hw∗ )]

(17)

w∗ = PW [w∗ − α(Rw∗ + b + H T u∗ )]

(18)

where α > 0 is a scaling constant, PU (·) and PW (·) are piecewise activation functions defined as: ⎧ ⎧ ⎨ ∆umin , εi < ∆umin ; ⎨ wmin , εi < wmin ; wmin ≤ εi ≤ wmax ; PU (εi ) = εi , ∆umin ≤ εi ≤ ∆umax ; PW (εi ) = εi , ⎩ ⎩ ∆umax , εi > ∆umax . wmax , εi > wmax . (19) Based on the equations (17) and (18), we propose a recurrent neural network for solving (9) as follow:  u(t + 1) = PU [u(t) − α(Qu(t) + c − Hw(t))] (20) w(t + 1) = PW [w(t) − α(Rw(t) + b + H T u(t))] The proposed recurrent neural network has a simple structure, and can be easily implemented using digital hardware. In the next section, we will prove that the proposed neural network has global exponential convergence property under some mild conditions. 3.2 Convergence Analysis Definition 1. Neural network (20) is said to be globally exponentially convergent to the equilibrium point (ue , we ) if both ue and we satisfy u(t) − ue  ≤ c0 u(0) − ue e−ηt , e

e

w(t) − w  ≤ b0 w(0) − w e

−ηt

∀t ≥ 1; ,

∀t ≥ 1;

(21)

where η is a positive constant independent of the initial point, c0 and b0 are positive constant dependent on the initial point.

Robust MPC Using a Discrete-Time Recurrent Neural Network

889

Lemma 1. The neural network (20) has a unique equilibrium point, which is the saddle point of J(u, w). Proof. Similar to the proof in [12], we can establish that the neural network (20) has a unique equilibrium point (ue , we ). Define a equilibrium point set Ω e = {(ue , we ) ∈ Ω|(ue , we ) satisfy (17)and(18))}. According to the above derivation, it is obvious that the equations (17) and (18) is equivalent to (13) for all (u, w) ∈ Ω, from the definition of Ω ∗ , we can get that Ω e = Ω ∗ , which means the equilibrium point of (20) is the saddle point of J(u, w). Lemma 2. For all z ∈ ℜn , PU (v) − PU (z)2 ≤ v − z2 ,

PW (v) − PW (z)2 ≤ v − z2 .

Proof. From the inequality (16) we can easily prove that PU (v) − PU (z)2 ≤ (v − z)T [PU (v) − PU (z)] ≤ v − z2 , PW (v) − PW (z)2 ≤ (v − z)T [PW (v) − PW (z)] ≤ v − z2 ,

∀v, z ∈ ℜn . (22)

R Define λQ i > 0(i = 1, ..., Nu m), λj > 0(j = 1, ..., N q) as the eigenvalues of Q, R Q R R respectively, let λQ min , λmax , λmin , λmax be the smallest and largest eigenvalues of Q and R. Define two functions

ψ (α) =



1 − λQ min α,

Q 0 < α ≤ 2/(λQ min + λmax )

λQ max α − 1,

Q 2/(λQ min + λmax ) ≤ α < +∞

ψ R (α) =



1 − λR min α,

R 0 < α ≤ 2/(λR min + λmax )

λR max α − 1,

R 2/(λR min + λmax ) ≤ α < +∞

Q

(23)

(24)

Then we give the following lemma: Lemma 3. ψ Q (α) < 1 and ψ R (α) < 1

if and only if

R 0 < α < min{2/λQ max , 2/λmax }.

(25)

Proof. From the Theorem 2 in [13], we can get that ψ Q (α) < 1 if and only if α ∈ R R (0, 2/λQ max ), similarly, ψ (α) < 1 if and only if α ∈ (0, 2/λmax ). We can easily verify Q that the sufficient and necessary condition for both ψ (α) < 1 and ψ R (α) < 1 is R 0 < α < min{2/λQ max , 2/λmax }. Theorem 1. With any α that satisfies (25), the neural network (20) is globally exponentially convergent to the saddle point of J(u, w). 2 Proof. From (23) and (24), we can obtain that ψ Q (α) = max{(1 − αλQ 1 ) , ..., (1 − Q 2 R R 2 R 2 αλNu m ) }, ψ (α) = max{(1 − αλ1 ) , ..., (1 − αλN q ) }.

890

Y. Pan and J. Wang

By Lemma 2: u(k) − u∗ 2 =PU [u(t − 1) − α(Qu(t − 1) + c − Hw(t − 1))]− PU [u∗ − α(Qu∗ + c − Hw∗ )]2 ≤(I − αQ)(u(t − 1) − u∗ )2 Q 2 2 ∗ 2 ≤ max{(1 − αλQ 1 ) , ..., (1 − αλNu m ) }u(t − 1) − u 

=ψ Q (α)2 u(t − 1) − u∗ 2

(26)

=⇒ u(t) − u∗  ≤ ψ Q (α)u(t − 1) − u∗  ≤ ψ Q (α)t u(0) − u∗  ≤ e−η

Q

(α)t

u(0) − u∗ 

R

Similarly, w(t)−w∗  ≤ e−η (α)t w(0)−w∗ . From Lemma 3, η Q (α) > 0 (ψ Q (α) < 1) and η R (α) > 0 (ψ R (α) < 1) for all α that satisfy (25). From the above proof and lemma 1, we can obtain that for any α that satisfies (25), the neural network (20) is globally exponentially convergent to the unique equilibrium point (u∗ , w∗ ), which is the saddle point of J(u, w). 3.3 Control Scheme The control scheme based on proposed recurrent neural network can be summarized as follows: 1. Let k = 1. Set terminal time T , sample time t, predictive horizon N , control horizon Nu , weighting matrices Φ and Ψ . 2. Calculate process model matrices S, E, V , M , neural network parameters Q, R, H, c, b. 3. Solve the quadratic minimax problems (9) using the proposed recurrent neural network, obtaining the optimal control action ∆¯ u(k). 4. Calculate the optimal input vector u¯(k) = f (∆¯ u(k) + u¯(k − 1)), the first element u(k|k) is sent to the process. 5. If k < T , set k = k + 1, return to step 2; otherwise, end.

4 Numerical Example Consider a two-tank system described in [14], which is a two-input, two-output system, with the flow rates of the two inlet streams as the two inputs, and the liquid level in each tank as the two output variables. By sampling at 0.2 min using a zero-order holder, the following discrete-time statespace model can be obtained: x(k + 1) =





− 0.5 3 0.5 2



0.2 3 − 0.5 2

10 y(k) = x(k) 01



x(k) +

1 3

0

0 1 2



u(k) (27)

Robust MPC Using a Discrete-Time Recurrent Neural Network

891

The set-point for the liquid levels (output) of tanks 1 and 2 are 0.8 and 0.7, respectively; the prediction and control horizons are N = 10 and N u = 4; weighting matrices Φ = I, Ψ = 5I; scaling constant α = 0.2; an uncertainty −0.02 ≤ w ≤ 0.02 is considered to affect both liquid levels of tanks 1 and 2; moreover, the following constraints are considered:





0 0.5 0 0.6 ≤u(k) ≤ ≤ y(k) ≤ 0 0.5 0 0.7



(28) −0.05 0.05 ≤ ∆u(k) ≤ −0.05 0.05 Input u1

Input u2

0.5

0.4 RNN LMI

0.45

RNN LMI 0.35

0.4 0.3 0.35 0.25

0.3 0.25

0.2

0.2

0.15

0.15 0.1 0.1 0.05

0.05 0

0

10

20

30

40

50 60 Samples k

70

80

90

100

0

0

10

20

30

40

50 60 Samples k

70

80

90

100

Fig. 1. Input signals of tanks 1 and 2 using the proposed RNN approach and LMI approach

Output y1

Output y2

0.9

0.8

0.8

0.7

0.7

0.6

0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2

0.2 0.1 0

0.1

RNN LMI 0

10

20

30

40

50 60 Samples k

70

80

90

100

0

RNN LMI 0

10

20

30

40

50 60 Samples k

70

80

90

100

Fig. 2. Output responses of tanks 1 and 2 using the proposed RNN approach and LMI approach

In order to compare the effectiveness and efficiency of the proposed approach, a linear matrix inequalities (LMI) approach [1] is also applied to the process. The simulation results are showed in Figs. 1 - 2. We can see that the proposed neural network approach gives a better set-point tracking performance with faster stable output responses.

892

Y. Pan and J. Wang

5 Conclusion This paper presents a new approach to robust MPC based on a discrete-time recurrent neural network by solving a minimax optimization problem. The neural network is proved to have global exponential convergent property. Simulation results show the superior performance of the neural network approach. Compared with a linear matrix inequalities approach, the proposed neural network approach gives a better performance in set-point tracking.

References 1. Camacho, E., Bordons, C.: Model Predictive Control. Springer, Heidelberg (2004) 2. Mayne, D., Rawlings, J., Rao, C., Scokaert, P.: Constrained model predictive control: Stability and optimality. Automatica 36, 789–814 (2000) 3. Zhang, Y., Wang, J.: A dual neural network for convex quadratic programming subject to linear equality and inequality constraints. Physics Letters A 298, 271–278 (2002) 4. Xia, Y., Feng, G., Wang, J.: A recurrent neural network with exponential convergence for solving convex quadratic program and related linear piecewise equations. Neural Networks 17, 1003–1015 (2004) 5. Liu, S., Wang, J.: A simplified dual neural network for quadratic programming with its KWTA application. IEEE Trans. Neural Netw. 17, 1500–1510 (2006) 6. Hu, X., Wang, J.: Solving pseudomonotone variational inequalities and pseudoconvex optimization problems using the projection neural network. IEEE Trans. Neural Netw. 17, 1487– 1499 (2006) 7. Liu, Q., Wang, J.: A one-layer recurrent neural network with a discontinuous hard-limiting activation function for quadratic programming. IEEE Trans. Neural Netw. 19, 558–570 (2008) 8. Tao, Q., Fang, T.: The neural network model for solving minimax problems with constraints. Control Theory Applicat. 17, 82–84 (2000) 9. Gao, X., Liao, L., Xue, W.: A neural network for a class of convex quadratic minimax problems with constraints. IEEE Trans. Neural Netw. 15, 622–628 (2004) 10. Gao, X., Liao, L.: A novel neural network for a class of convex quadratic minimax problems. Neural Computation 18, 1818–1846 (2006) 11. Bazaraa, M., Sherali, H., Shetty, C.: Nonlinear programming: theory and algorithms (1993) 12. Perez-Ilzarbe, M.: Convergence analysis of a discrete-time recurrent neural network toperform quadratic real optimization with bound constraints. IEEE Trans. Neural Netw. 9, 1344– 1351 (1998) 13. Tan, K., Tang, H., Yi, Z.: Global exponential stability of discrete-time neural networks for constrained quadratic optimization. Neurocomputing 56, 399–406 (2004) 14. Alamo, T., Ramırez, D., Camacho, E.: Efficient implementation of constrained min–max model predictive control with bounded uncertainties: a vertex rejection approach. Journal of Process Control 15, 149–158 (2005)

A PSO-Based Method for Min-ε Approximation of Closed Contour Curves Bin Wang1,⋆ , Chaojian Shi2,3 , and Jing Li4 1

Key Laboratory of Electronic Business, Nanjing University of Finance and Economics, Nanjing 210003, P.R. China [email protected] 2 Merchant Marine College, Shanghai Maritime University, Shanghai, 200135, P.R. China 3 Department of Computer Science and Engineering, Fudan University, Shanghai, 200433, P.R. China 4 Alcatel Shanghai Bell Company Limited, Shanghai 201206, P.R. China

Abstract. Finding a polygon to approximate the contour curve with the minimal approximation error ε under the pre-specified number of vertices, is termed min-ε problem. It is an important issue in image analysis and pattern recognition. A discrete version of particle swarm optimization (PSO) algorithm is proposed to solve this problem. In this method, the position of each particle is represented as a binary string which corresponds to an approximating polygon. Many particles form a swarm to fly through the solution space to seek the best one. For those particles which fly out of the feasible region, the traditional split and merge techniques are applied to adjust their position which can not only move the particles from the infeasible solution space to the feasible region, but also relocate it in a better site. The experimental results show that the proposed PSO-based method has the higher performance over the GA-based methods. Keywords: Closed contour curve; PSO; Position adjustment; Polygonal approximation.

1

Introduction

For a contour curve, how to find an optimal polygon with the minimal approximation error min-ε under the pre-specified number of vertices to represent it is a hot topic in pattern recognition and image analysis. This problem are usually termed min-ε approximation. Due to its simplicity, generality and compactness, this cure representation scheme has won wide applications such as planar object recognition [1], shape understanding[2] and image matching[3]. In the past decades, many algorithms have been proposed to solve min-ε approximation problem. Some of them are developed to seek the optimal solution. The representative algorithm is dynamic programming (DP)[4,5,6]. Although ⋆

Corresponding author.

F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 893–902, 2008. c Springer-Verlag Berlin Heidelberg 2008 

894

B. Wang, C. Shi, and J. Li

these algorithms can always obtain exact optimal solutions, their computational cost is very expensive because of adopting exhaustive searching scheme. Among these algorithms, Perez and Vidal [6] propose a DP-based algorithm to solve min −ε problem. The time complexity achieves to O(M N 4 ) for a closed contour curve[7], where M and N is the number of the polygon vertexes and the number of curve points, respectively. Therefore, these algorithms is unsuitable for real applications because the closed contour curves usually having a large amount of points. For saving the computational cost, many more methods aimed to seek nearoptimal solutions adopting local search heuristics. These methods can be divided into following three groups: (1) sequential tracing approach[8,9,10,11], (2) split method [12], merge method[13] and split-and-merge method[14], (3) dominant points or angle detection approach [15,16,17]. These methods work fast, however because these methods only take account of the local information, the search process may get trapped in a local optimum. To overcome this problem, many nature-inspired algorithms, such as genetic algorithms (GA) [18,19] and ant colony optimization (ACO)[20] have been applied to solve min −ε problem and obtain encouraging results. In recent years, a novel nature-inspired algorithms, termed particle swarm optimization (PSO), has been proposed by Eberhart and Kennedy [21] to solve various optimization problems. PSO is inspired by the observations of the social behavior of animals, such as bird flocking, fish schooling and swarm theory. It is initialized with a swarm of particles which are randomly generated and correspond to the candidate solutions. Each particle files through the solution space with a velocity which is dynamically adjusted by its own and companion’s historical experience to seek the best solutions. In this paper, we will consider using PSO to solve Min-ε problem. The main contributions of our work are as follows. (1) Although PSO has won a wide applications in various fields, these applications mainly belong to continuous optimization problems and the research work for combinatorial optimization problems is relative less. So our work of applying PSO to min −ε problem will extend this research; (2) How to coping with the infeasible solution is a difficult problem on using PSO to solve min −ε problem, since the particle may fly to the infeasible region. Another problem involved in PSO is that although it possess strong global search ability, its local search ability is poor. In this paper, the traditional split and merge techniques are combined to the PSO. The advantages of this scheme are two aspects: first, the particles can be easily moved to the feasible solution space from the infeasible region, second, the particle can be relocated in relative better position.

2

Problem Statement

A closed contour curve C can be represented as an ordered set C = {p1 , p2 , . . . , pN }, where pi+1 is the next point of pi and since the curve is closed, the next point of pN is p1 . Let the ordered subset p i pj = {pi , pi+1 , . . . , pj } represent the arc of curve

A PSO-Based Method for Min-ε Approximation of Closed Contour Curves

895

C which starts at point pi and ends in point pj in clockwise direction. Let pi pj denotes the chord of C which connects pi and pj . The approximation between  error the arc p d2 (pk , pi pj ), i pj , pi pj ) = i pj and chord pi pj can be measured as e(p pk ∈pi pj

where d(pk , pi pj ) is the perpendicular distance from point pk to the chord pi pj . The polygon V approximating the contour C = {p1 , p2 , . . . , pN } is a set of ordered line segments V = {pt1 pt2 , pt2 pt3 , . . . , ptM −1 ptM , ptM pt1 }, such that t1 < t2 < . . . < tM and {pt1 , pt2 , . . . , ptM } ⊆ {p1 , p2 , . . . , pN }, where M is the number of the vertices of the polygon V . The approximation error between the curve C and the approximating V is measured by integral square error (ISE) which is defined M  e(p as ISE(V, C) = ti pti+1 , pti pti+1 ). Then the Min-ε approximation can be i=1

stated as follows: For a pre-specified integer M (3 ≤ M ≤ N ), assume that Ω denotes the set of all the approximating polygons of curve C. Let ψ = {V | V ∈ Ω ∧ |V | = M }, where |V | denotes the cardinality of V . Find a polygon P ∈ ψ such that ISE(P, C) = min ISE(V, C). V ∈ψ

3

Particle Swarm Optimization (PSO)

Here, we review the PSO method proposed by Eberhart and Kennedy [21]. Assume that the searching space is N-dimensional and M particles form the swarm. The ith particle is represented as a N-dimensional string Xi (i = 1, 2, . . . , N ) which means that the ith particle locates at Xi = (xi1 , xi2 , . . . , xiN )(i = 1, 2, . . . , M ) in the search space. The position of each particle represents a candidate solution. The fitness value of each particle is calculated by putting its position into a designated object function. When the fitness value is higher, the corresponding Xi is better. Each particle flies through the search space with a velocity. The velocity is also a Ndimensional vector, denoted as Vi = (vi1 , vi2 , . . . , viN )(i = 1, 2, . . . , M ). Assume that vij (t), j = 1, 2 . . . , N denotes the velocity of ith particle at time t. P bi = pbi1 , pbi2 , . . . , pbiN is the best previous position yielding the best fitness value for the ith particle; and gbest is the best position discovered by the whole population. c1 and c2 are the acceleration constants and r1j and r2j are two random numbers in the range [0,1]. The velocity vij is restricted by a maximum threshold vmax . Then the new velocity at time t + 1 is calculated as vij (t + 1) = vij (t) + c1 r1j (pbij − xij (t)) + c2 r2j (gbj − xij (t)), j = 1, 2, . . . , N,(1) and the position is update as xij (t + 1) = xij (t) + vij (t + 1), . . . , j = 1, 2, . . . , N,

(2)

The above computing model can only be used to cope with the continues optimization problems. Recently, Eberhart and Kennedy [21] proposed a discrete binary version of PSO for discrete optimization problems. In this scheme, each particle is represented as a N-dimensional binary string, The Eq. 1 remains unchanged. However, since the value of velocity in not integer, The resulting

896

B. Wang, C. Shi, and J. Li

change in position is defined as follows. Where rand() is a random number in the range [0,1].  1 if rand() < 1/(1 + e−vij ), xij = (3) 0 otherwise

4 4.1

The Proposed PSO-Based Method Particle Representation and Fitness Evaluation

Each particle is represented by a binary string Xi = (xi1 , x12 , . . . , xiN ) which corresponds to a candidate solution, if and only if xij = 1, then the jth point pj of the curve will be chosen as a vertex of the approximating polygon, where N is the number of the curve points. Thus, the particle representation indicates which points are chosen from the curve to construct the polygon. For instance, given a curve C = {p1 , p2 , . . . , p10 } and a particle with xi = (1, 0, 0, 0, 1, 0, 0, 0, 1, 0). Then the approximating polygon that the particle represents is {p1 p5 , p5 p9 , p9 p1 }. Each particle has a fitness value. From the definition of the min −ε problem, the smaller the approximation error is, the better the approximating polygon is. So, we define the fitness function of each particle as follows. Assume that αi is a solution that a particle xi represents. Then the fitness function is defined as f (xi ) = −ISE(αi ).

(4)

This equation denotes that the smaller the approximation error is, the higher the fitness value is. 4.2

Particle’s Position Adjustment

A particle may fly out of the feasible region and yield infeasible solution. On the other hand, if only rely on the particle’s own experience and the social experience, it will take a long time to fly towards the better search areas because of its poor local search ability. The traditional split and merge techniques which have strong local search ability will be used to adjust the particle’s position during the search process. Split Technique: The traditional split technique is a very simple method for generating approximating polygon. It starts from an initial curve segmentation and then recursively partitions the segmentation into small portions at the selected point until the pre-specified constraint condition is satisfied. The detail of split procedure can be described as follows: suppose that curve C has been partitioned   into M arcs p tM pt1 , where pti is the ith segment point. Then t1 pt2 , . . . , ptM −1 ptM , p a split operation on the curve C is: for each point pi ∈ C, assume that pi ∈ p tj ptj+1 , calculate the distance D(pi ) = d(pi , ptj ptj+1 ). Find a point pu on the curve which satisfies D(pu ) = max D(pi ). Assume that the selected point fall into the arc pi ∈C

A PSO-Based Method for Min-ε Approximation of Closed Contour Curves

897

4

1 2

Fig. 1. The flow of the proposed PSO

pt p . Then the arc pt p is segmented at the point pu into two small arcs k tk+1 k tk+1 p p and p  p . Add the point pu into the set of segment points. tk u u tk+1 Merge Technique: Merge technique is another simple method for yielding approximating polygon of digital curve. It is a recursive method starting with an initial polygon which regards all the points of the curve as its vertexes. At each iteration, a merge procedure is conducted to merge the selected two adjacent segments. This procedure is repeated until the obtained polygon satisfy the pre-specified constraint condition. The detail of merge procedure is described as follows: assume that curve C has been segmented into M arcs  p tM pt1 , where pti is the ith segment point. Then a merge t1 pt2 , . . . , ptM −1 ptM , p operation on curve C is defined as follows: For each segment point pti , calculate the distance Q(pti ) = d(pti , pti−1 pti+1 ), where pti−1 , pti+1 are the two adjacent segment points of pti . Select a segment point ptj which satisfies Q(ptj ) = min Q(pti ), where V = {pt1 , pt2 , . . . , ptM }. Then two arcs p tj ptj+1 tj−1 ptj and p

pti ∈V

ptj+1 . The segment point ptj is removed from are merged into a single arc ptj−1  the set of the current segment points. Position adjustment: Here, we use the above split and merge techniques to adjust the position of the particles, i.e., move the particle from infeasible solution space to feasible region. Assume that the pre-specified number of sides of the approximation polygon is M . For a particle xi which flies out of the feasible region, assume that the solution which the particle represents is αi , since αi is an infeasible solution, we have |αi | = M , where |αi | denotes the number of sides of the approximating polygon αi . Then the infeasible solution αi is suffered from following operations: If |αi | > M , then conducting merge operation repeatedly until |αi | = M . If |αi | < M , then repeat conducting split operation

898

B. Wang, C. Shi, and J. Li

(a) figure-8

(b) chromosome

(c) semicircle

(d) leaf

Fig. 2. Four benchmark curves

until |αi | = M . From the above mending process, we can see that using the split and merge techniques, an infeasible solution can be easily transformed to a feasible one. Moreover, because the split technique try to find new promising vertexes for the approximating polygon and merge technique aims to remove the possible redundant vertexes from the approximating polygon in heuristic way, the transformed feasible solution will maintain relative optimality. In other words, for those particles which fly out of the feasible region, the split and merge process will move the particles from infeasible solution space to feasible region, moreover, the particle will be relocated at a relative better position in the solution space. 4.3

Algorithm Flow

Let vmax denote the maximal velocity, G denotes the maximal number of iteration and K be the number of particles in the swarm. We plot the algorithm flow in Fig. 1.

5

Experimental Results and Discussion

Here, a groups of benchmark curves (see Fig. 2, their chain codes can be obtained in [15]) are used to evaluate the performance of the proposed method. The numbers of the points for these curves are 45, 60, 102 and 120, respectively. We have conducted the existing GA-based methods Chen [18] and Sarkar [19] on these curves for comparisons with our method. Since GA and PSO both adopt probabilistic search scheme, each competing methods will be conducted ten times, the best results of these runs will be reported. The experimental platform is a PC with a pentium-4 2.4GHz CPU running Windows XP and all the competing methods are coded in Borland Delphi 7.0. The parameters for the proposed PSO is as follows. The swarm size, i.e., the number of the particles, is 20; the acceleration constants c1 = c2 = 2; the maximal velocity vmax = 6 and the maximal number of iteration G = 60. The parameters for the methods Chen [18], Sarkar [19] are set as the ones provided by these literatures. Integral square error ISE and the number of the vertices M reflect the the quality of the polygonal approximation from the precision and compactness, respectively. They provide an absolute measure, namely fixing one and using the

A PSO-Based Method for Min-ε Approximation of Closed Contour Curves

899

other alone for measurements. For accessing the relative merits of the various methods, Rosin [22] proposed a unified performance measure. This scheme are based on two measures, fidelity (error measurement) and efficiency (compression ratio), which mainly consider the obtained solution’s difference from the optimal one. Rosin [22] define them as follows: F idelity =

Eopt × 100 Eappr

Ef f iciency =

Mopt × 100 Mappr

(5)

(6)

Where Eappr and Mappr denote the approximation error and the number of vertices of the polygon obtained by the tested algorithm, respectively. Eopt is the approximation error incurred by optimal algorithm which generated the same number of vertices as the tested algorithm. Mopt denotes the number of vertices produced by the optimal algorithm which is required generate the same error Eappr as the tested algorithm. However, an exact value of Mopt is usually not available, it can be estimated by linear interpolation of the two closest integer values of Mopt . Then a unified measure which combine F idelity and Ef f iciency is defined as  M erit = F idelity × Ef f iciency (7)

It is noted that for calculating the above three measure values, the optimal algorithm, such as dynamic programming method [6] will be conducted to obtained the optimal polygons and incurred errors are generated by specifying various number of vertices. Rosin [22] used this scheme to test 31 sub-optimal algorithms and ranking them according to the merit value. The best solutions of ten independent runs using each competing methods on all the testing cases, are listed in Table 1, where M is the specified number of vertices and ISE is the approximation error of the obtained best solution. Three measure values, fidelity, efficiency and merit are also calculated for each best solution using Eq. 5, Eq. 6 and Eq. 7, respectively and listed in the table. On the computational cost, calculating all the cases in table 1, Chen [18] requires 1.478 seconds, Sarkar [19] requires 0.711 seconds, while the proposed PSO only require 0.155 seconds. From all the comparative experimental results, we can see that: (1) The proposed PSO outperforms the GA-based methods, Chen [18], Sarkar [19] in the quality of solution, namely, for the same specified number of vertices, the proposed PSO produces approximating polygon with the least approximation error; (2) On all the testing cases, PSO obtains the highest value of merit among all the competing methods and in many cases the merit value achieves to 100, which shows that in such cases, the obtained solutions are accurate optimal ones; (3) The proposed PSO has the higher computational efficiency than the other competing GA-based methods.

900

B. Wang, C. Shi, and J. Li Table 1. The results of methods Chen[18], Sarkar[19] and the proposed PSO

Curves

Method

M

ISE

Fidelity

Efficiency

Merit

Figure-8

Sarkar[19]

11

3.40

85.2

94.2

89.6

Chen[18]

2.90

100

100

100

PSO

2.90

100

100

100

2.54

94.3

97.6

95.9

Chen[18]

2.40

100

100

100

PSO

2.40

100

100

100

2.18

93.7

97.0

95.3

Chen[18]

2.04

100

100

100

PSO

2.04

100

100

100

3.91

97.1

98.0

97.5

Chen[18]

3.80

100

100

100

PSO

3.80

100

100

100

3.18

98.4

99.0

98.7

Chen[18]

3.13

100

100

100

PSO

3.13

100

100

100

2.88

98.1

99.0

98.6

Chen[18]

2.83

100

100

100

PSO

2.83

100

100

100

8.06

87.0

95.1

91.0

Chen[18]

7.19

97.5

99.2

99.3

PSO

7.04

99.6

99.9

99.7

4.79

84.6

95.2

89.8

Chen[18]

4.73

85.7

95.6

90.5

PSO

4.05

100

100

100

4.70

78.8

92.2

85.2

Chen[18]

3.74

99.0

99.6

99.3

PSO

3.70

100

100

100

11.77

80.4

91.8

85.9

Chen[18]

9.87

95.9

98.5

97.2

PSO

9.53

99.3

99.8

99.5

6.36

90.0

95.4

92.7

Chen[18]

5.86

97.6

99.0

98.3

PSO

5.72

100

100

100

5.00

89.1

95.6

92.3

Chen[18]

4.68

95.2

98.2

96.7

PSO

4.45

100

100

100

Sarkar[19]

Sarkar[19]

Chromosome

Sarkar[19]

Sarkar[19]

Sarkar[19]

Semicircle

Sarkar[19]

Sarkar[19]

Sarkar[19]

Leaf

Sarkar[19]

Sarkar[19]

Sarkar[19]

6

12

13

15

17

18

22

26

27

23

29

32

Conclusion

A discrete version of particle swarm optimization (PSO) algorithm has been proposed for Min-ε approximation problem. Although PSO has been won successful applications for continues optimization problem, there is little research work for combinatorial optimization. We have successfully extended the PSO to solve Min-ε problem. To overcome the problem that particles may fly out of the

A PSO-Based Method for Min-ε Approximation of Closed Contour Curves

901

feasible region, we use the traditional split and merge techniques to move the particle from the infeasible solution space to the feasible region and locate it in a relative better site. The experimental results show that the proposed PSO-based method has the higher performance over the GA-based methods.

References 1. Lourakis, M., Halkidis, S., Orphanoudakis, S.: Matching Disparate Views of Planar Surfaces Using Projective Invarians. In: British Matchine Vision Conference, vol. 1, pp. 94–104 (1993) 2. Attneave, F.: Some informational aspects of visual perception. Psychological review 61, 183–193 (1954) 3. Yuen, P.C.: Dominant Point Matching Algorithm. Electronic Letters 29, 2023–2024 (1993) 4. Dunham, J.G.: Optimum Uniform Piecewise Linear Approximation of Planar Curves. IEEE Transactions on Pattern Analysis and Machine Intelligence 8, 67–75 (1986) 5. Sato, Y.: Piecewise Linear Approxiamtion of Planes by Perimeter Optimization. Pattern Recognition 25, 1535–1543 (1992) 6. Perez, J.C., Vidal, E.: Optimum Polygonal Approximation of Digitized Curves. Pattern Recognition Letter. 15, 743–750 (1994) 7. Horng, J.-H.: Improving Fitting Quality of Polygonal Approximation by Using the Dynamic Programming Technique. Pattern Recognition Letter. 23, 1657–1673 (2002) 8. Sklansky, J., Chasin, R.L., Hansen, B.J.: Minimum Perimeter Polygons of Digitized Silhouettes. IEEE Trans. Computers 23, 1355–1364 (1972) 9. Williams, C.M.: An Efficient Algorithm for the Piecwise Linear Approximation of Planar Curves. Computer Graphics and Image Processing 8, 286–293 (1978) 10. Sklansky, J., Gonzalez, v.: Fast Polygonal Approximation of Digitized Curves. Pattern Recognition 12, 327–331 (1980) 11. Wall, K., Danielsson, P.E.: A Fast Sequential Method for Polygonal Approximation of Digitized Curves. Computer vision, Graphics, and Image Processing 28, 220–227 (1984) 12. Douglas, D.H., Peucker, T.K.: Algorithm for the Reduction of the Number of Points Required to Represent a Line or Its Caricature. The Canadian Cartographer 12, 112–122 (1973) 13. Leu, J.G., Chen, L.: Polygonal Approximation of 2D Shapes through Boundary Merging. Pattern Recgnition Letters 7, 231–238 (1998) 14. Ray, B.K., Ray, K.S.: A New Split-and-merge Technique for Polygonal Apporximation of Chain Coded Curves. Pattern Recognition Lett. 16, 161–169 (1995) 15. Teh, H.C., Chin, R.T.: On Detection of Dominant Points on Digital Curves. IEEE Trans. Pattern Anal. Mach. Intell. 11, 859–872 (1991) 16. Wang, W.W.Y., Detection, M.J.: the Dominant Points by the Curvature-based Polygonal Approximation. CVGIP: Graph. Models Imag. Process 55, 79–88 (1993) 17. Held, A., Abe, K., Arcelli, C.: Towards a Hierarchical Contour Description via Dominant Point Detection. IEEE Trans. Syst. Man Cybern. 24, 942–949 (1994) 18. Ho, S.-Y., Chen, Y.-C.: An Efficient Evolutionary Algorithm for Accurate Polygonal Approximation. Pattern Recognition 34, 2305–2317 (2001)

902

B. Wang, C. Shi, and J. Li

19. Sarkar, B., Singh, L.K., Sarkar, D.: A Genetic Algorithm-based Approach for Detection of Significant Vertices for Polygonal Approximation of Digital Curves. International Journal of Image and Graphics 4, 223–239 (2004) 20. Yin, P.Y.: Ant Colony Search Algorithms for Optimal Polygonal Approximation of Plane Curves. Pattern Recognition 36, 1783–1997 (2003) 21. Eberhart, R.C., Kennedy, J.: A New Optimizer Using Particle Swarm Theory. In: Proc. 6th Symp. Micro Machine and Human Science, Nagoya, Japan, pp. 39–43 (1995) 22. Rosin, P.L.: Techniques for Assessing Polygonal Approximations of Curves. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 659–666 (1997)

Author Index

Alencar, Marcelo S. I-452 An, Dong I-168 An, Xueli I-786, II-11 Azevedo, Carlos R.B. I-452 Bi, Gexin I-275 Bie, Rongfang I-491 Bispo Junior, Esdras L.

I-452

Cai, Wei II-658, II-794 Cai, Xingquan II-419 Cao, Feilong I-816 Cao, Fengwen II-351, II-359 Cao, Jianting I-237 Cao, Yuan I-472 Carter, Jonathan N. I-400 Cartes, David A. II-119 Chai, Tianyou II-148 Chang, Guoliang I-347 Chang, Yeon-Pun II-180 Chao, Kuei-Hsiang II-227 Chen, Anpin I-87 Chen, Chaolin II-74 Chen, Chuanliang I-491 Chen, Dingguo I-299, II-516 Chen, Gang I-618 Chen, Guangyi II-376, II-384 Chen, Hung-Han I-512 Chen, Jianye I-555, I-674 Chen, Jie II-351 Chen, Jun-Yu II-764 Chen, Ke I-117 Chen, Lichao II-100, II-624 Chen, Ning II-268 Chen, Peng II-284, II-473 Chen, Songcan I-501, II-57 Chen, Xiaoqian II-702 Chen, Xinyu I-610 Chen, Yan I-374 Chen, Yarui I-432 Chen, Yi-Wei II-180 Chen, Yichang I-87 Chen, Yonggang I-128 Chen, Yuanling I-176 Chen, Yuehui I-30

Cheng, Chuanjin II-165 Cheng, Hao II-321 Cheng, Shijie I-472 Cheng, Wei-Chen II-402 Cheng, Xiefeng II-650 Cheng, Zunshui I-40 Chu, Jinyu II-410 Chu, Ming-Huei II-180 Chu, Renxin I-97, I-107 Cichocki, Andrzej I-237, II-772 Cui, Baoxia I-391 Dai, Shucheng II-81 Das, Anupam I-255 Deng, Beixing I-97, I-107 Deng, Wanyin I-55 Ding, Jinliang II-148 Ding, Jundi II-57 Ding, Linge II-268 Ding, Qian II-607 Ding, Shifei II-783 Ding, Yongshan II-313 Ding, Zichun I-715 Dong, Fang I-275 Dong, G.M. I-674 Dong, Hong-bin I-854 Dong, Jianshe II-91, II-331 Dong, Xiangjun II-730 Du, Junping II-67 Duan, Ailing I-691 Duan, Shukai I-357, II-580 Duan, Yong I-391 Eaton, Matthew D.

I-400

Fan, Binbin II-483 Fan, Yanfeng I-691 Fan, Zhongshan I-569 Fang, Gang II-21 Fang, Shengle I-138 Fasanghari, Mehdi II-615 Fei, Shumin II-801 Feng, Hailin II-220 Feng, Qigao I-168

904

Author Index

Feng, Shidong I-462 Feng, Wei I-338 Ferreira, Tiago A.E. I-452 Franklin, Simon J. I-400 Fu, Longsheng I-168 Fu, Wenfang I-138 Fu, Xiaoyang II-294 Fukumoto, Shinya I-521 Gan, Tian II-830 Gao, Jingli I-442 Gao, Shangkai I-97, I-107 Gao, Shubiao II-439 Gao, Xiaorong I-97, I-107 Gao, Xiaozhi I-491 Ge, Fei I-579 Geng, Runian II-730 Goddard, Anthony J.H. I-400 Gong, Jing I-806 Gong, Yunchao I-491 Gu, Wenjin II-190 Gu, Yingkui II-526, II-533 Guan, Genzhi II-465 Guan, Weimin II-465 Guo, Chen II-138, II-294 Guo, Cuicui I-715 Guo, Jun I-663 Guo, L. I-674 Guo, Ping I-610 Guo, Qianjin II-809 Guo, Xiaodong II-483 Guo, Xiaojiang I-47 Guo, Yufeng II-650 Guo, Zhaozheng I-222 Han, Gyu-Sik I-655 Han, Seung-Soo II-367 He, Haibo I-472 He, Hong I-417 He, Hui I-786 He, Kaijian I-148 He, Shan II-560 He, Yaoyao I-786 He, Yong II-588 He, Zhaoshui I-237 Ho, Tien II-570 Hong, Liangyou II-313 Hong, Zhiguo II-598 Honggui, Han I-762 Hossain, Md. Shohrab I-255

Hu, Cheng II-560 Hu, Chonghai I-753 Hu, Hong I-212 Hu, Jian II-40 Hu, Jingtao II-809 Hu, Senqi I-1 Hu, Wei II-809 Hu, Xiaolin I-309 Huang, Hui I-231 Huang, Panfeng II-171 Huang, Qian II-313 Huang, Tingwen I-231 Huang, Wenhan II-91, II-331 Huang, Yaping II-449 Huang, Yongfeng I-97, I-107 Huang, Yourui II-542 Idesawa, Marsanori

I-69

Ji, Geng I-319 Ji, Yu II-692 Jia, Guangfeng I-30 Jia, Lei I-723 Jia, Peifa II-200, II-210 Jia, Weikuan II-783 Jiang, Dongxiang II-313 Jiang, Haijun I-246 Jiang, Jing-qing I-854 Jiang, Minghui I-138 Jiang, Shan I-400 Jin, Cong I-836 Jin, Fenghua I-864 Jin, Shu-Wei I-836 Jin, Yinlai I-158 Jin, Zhixing I-97, I-107 Junfei, Qiao I-762 Kang, Yuan Karri, Vishy

II-180 II-570

Lai, Kinkeung I-148 Lao, Jian II-304 Lee, Hyun-Joo I-655 Lee, Jaewook I-655 Lee, KinHong I-539 Lee, Woobeom II-429 Lei, Shengyong I-796 Leung, KwongSak I-539 Li, Bo II-243 Li, Chaoshun I-786, II-259

Author Index Li, Chun-Xiang II-1 Li, Dongming II-392 Li, Fengjun I-384 Li, Fuxin I-645 Li, Gang II-658, II-794 Li, Haohao I-555 Li, Heming II-498 Li, Jianning I-701 Li, Jing I-893 Li, Jinhong II-419 Li, Ju II-483 Li, Lei I-600, I-618 Li, Min II-658, II-794 Li, Qingqing II-110, II-259 Li, Shaoyuan II-119 Li, Tao I-330 Li, Wei I-555 Li, Wenjiang I-266 Li, Xiao-yan II-658, II-794 Li, Xiaoli II-809 Li, Yansong I-1 Li, Yinghai I-63, II-11 Li, Yinghong I-741 Li, Youmei I-816 Li, Yue II-588 Li, Yujun II-410 Li, Zhe I-715 Liang, Hua I-682 Liao, Shizhong I-432, I-723 Liao, Wudai I-291 Liao, Xiaofeng I-231 Lin, Dong-mei II-674 Lin, Lanxin I-347 Lin, Qiu-Hua II-764 Lin, Xiaofeng I-796 Ling, Liuyi II-542 Liou, Cheng-Yuan II-402 Liu, Baolin I-97, I-107 Liu, Bohan I-531 Liu, Changxin II-148 Liu, Changzheng II-607 Liu, Derong I-796, II-128 Liu, Fei II-492 Liu, Gang II-171 Liu, Hongzhao I-364 Liu, Huaping I-422 Liu, Jingneng I-176 Liu, Ju II-410 Liu, Juanjuan II-533 Liu, Li I-63, II-11, II-110, II-119

905

Liu, Lijun I-561 Liu, Luzhou I-78 Liu, Qiang II-11 Liu, Shuang I-733 Liu, Shuangquan I-864 Liu, Ting II-588 Liu, Wenhuang I-531 Liu, Wenxin II-119 Liu, Xiangyang II-552 Liu, Xiaodong I-196, I-204 Liu, Yan II-30 Liu, Yankui I-776 Liu, Yushu I-462 Liu, Zhigang II-666 Liu, Zhong I-864 Lu, Fangcheng II-498 Lu, Funing II-74 Lu, Hongtao II-237, II-552 Lu, Jiangang I-753 Lu, Wei II-237 Lu, Xuxiang I-864 Lun, Shuxian I-222 Luo, Siwei II-449 Luo, Zhigao II-483 Luo, Zhimeng II-110 Lv, Yanli I-826 Ma, Jinwen I-579, I-589, I-600, I-618 Ma, Liang I-627 Ma, Runing II-57 Ma, Xiaoping II-822 Madeiro, Francisco I-452 Mei, Xuehui I-246 Men, Changqian I-709 Meng, Li-Min II-1 Meng, Xin II-30 Meng, Yanmei I-176, II-74 Menzel, Wolfgang II-830 Miao, Dandan I-55 Miao, Yanzi II-822 Miike, Toshiaki I-521 Min, Lequan II-439, II-682, II-692 Minami, Mamoru I-364 Miyajima, Hiromi I-521 Mohler, Ronald R. I-299 Montazer, Gholam Ali II-615 Mu, Chaoxu I-682 Muhammad Abdullah, Saeed I-255 Neruda, Roman I-549 Nguyen, Quoc-Dat II-367

906

Author Index

Ning, Bo II-304 Niu, Dong-Xiao II-1 Niu, Dongxiao II-642 Pain, Christopher C. I-400 Pan, Haipeng I-701 Pan, Lihu II-100 Pan, Y.N. I-674 Pan, Yunpeng I-883 Park, Dong-Chul II-367 Phan, Anh Huy II-772 Phillips, Heather J. I-400 Qiao, Jianping II-410 Qiao, Shaojie II-81 Qin, Rui I-776 Qin, Tiheng I-128 Qiu, Jianlong I-158 Qiu, JianPing II-624 Qiu, Tianshuang I-561 Qu, Liguo II-542 Qu, Lili I-374 Ran, Feng II-50 Ren, Zhijie I-589 Rong, Lili II-740 Rynkiewicz, Joseph

Sossa, Humberto II-341 Strassner, John II-30 Su, Chunyang II-783 Su, Jianjun II-74 Su, Yongnei II-682 Su, Zhitong II-419 Sun, Changyin I-682 Sun, Fuchun I-422, II-268, II-712 Sun, Jingtao II-91, II-331 Sun, Ming I-168 Sun, Wei II-237 Sun, Yao II-607 Sun, Yi-zhou II-560 Sun, Youxian I-753 Takahashi, Norikazu I-663 Tan, Wen II-712 Tang, Changjie II-81 Tang, Shuyun II-526, II-533 Tang, Zhihong I-176 Tao, Yewei II-650 Tie, Jun I-561 Tu, Xuyan II-67 Ul Islam, Rashed

I-186

Shang, Fengjun II-632 Shang, Li II-351, II-359 Shao, Chenxi II-220 Shen, Minfen I-347 Shen, Siyuan I-776 Shen, Zhipeng II-138 Shi, Bertram E. I-47 Shi, Chaojian I-893 Shi, Guangchuan I-11 Shi, Guoyou I-733 Shi, Minyong II-598 Shi, Zhongzhi I-212, II-783 Shigei, Noritaka I-521 Si, Jibo I-168 Song, Chu-yi I-854 Song, Chunning I-796 Song, Guo-jie II-560 Song, Huazhu I-715 Song, Jingwei II-473 Song, Qinghua II-801 Song, Shaojian I-796 Song, Yinbin I-482

I-255

V´ azquez, Roberto A. II-341 Vidnerov´ a, Petra I-549 Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang,

Bin I-893 Chengqun I-753 Cong II-321 Danling II-692 Dianhong II-392 Fuquan II-757 Hongqiao II-268 Huaqing II-284, II-473 Jianjun I-636 JinFeng I-539 Jun I-883 Lian-zhou II-50 Lidan I-357, II-580 Lihong I-482 Nini I-196, I-204 Qi II-666 Qin I-69 Qingquan II-740 Shixing II-165 Weijun II-642 Weiyu I-21 Wenjian I-627, I-709

Author Index Wang, Xiang II-483 Wang, Xiaoling II-410 Wang, Xin II-119 Wang, Yaonan II-712 Wang, Yingchang II-506 Wang, Yongbin II-598 Wang, Yongli II-642 Wang, Yongqiang II-498 Wang, Zhenyuan I-539 Wang, Zhiliang II-158 Wang, Ziqiang I-691, I-845 Wei, Qinglai II-128 Wei, Xunkai I-741 Wei, Yaoguang I-168 Wei, Zukuan II-21 Wen, Chenglin I-442, II-506 Wen, Jinyu I-472 Wen, Lintao I-610 Wen, Ming II-148 Wen, Shiping II-720 Woo, Dong-Min II-367 Wu, Chaozhong I-806 Wu, Haixia I-338 Wu, Luheng II-526 Wu, Peng I-30 Wu, Qiang I-11 Wu, Shuanhu I-482 Wu, Zhengjia I-63 Xia, Changjun II-190 Xia, Yongxiang II-158 Xiang, Xiuqiao II-259 Xiao, Jian I-78 Xiao, Ming II-757 Xie, Chi I-148 Xie, Kun-qing II-560 Xie, Lun II-158 Xin, Shuai I-97, I-107 Xinyuan, Li I-762 Xiong, Jianping II-757 Xiuxia Yang II-190 Xu, Chengwei I-806 Xu, Hua II-200, II-210 Xu, Mei-hua II-50 Xu, Min II-243 Xu, Sixin II-588 Xu, Wenbo II-730 Xu, Xiaobin II-506 Xu, Yang I-266 Xu, Yangsheng II-171

Xu, Yi II-220 Xu, Zongben I-816 Xue, Hui I-501 Xue, Yanmin I-364 Yan, Jun II-392 Yan, Li II-702 Yan, Sijie I-176, II-74 Yan, Xiaowen II-91, II-331 Yan, Xinping I-806 Yang, Chan-Yun I-636 Yang, Huaiqing I-391 Yang, Hui II-119 Yang, Jiaben I-299 Yang, Jingyu II-57 Yang, Jr-Syu I-636 Yang, Junjie I-63 Yang, Li II-110 Yang, Qiang I-501 Yang, Seung-Ho I-655 Yang, Weiwei II-702 Yang, Yanwu I-645 Yang, Zhiyong II-190 Yang-Li, Xiang II-40 Ye, Xiaoling I-330 Ye, Yongan II-439 Yi, Chenfu I-117 Yin, Hui I-569, II-449 Yin, Jianchuan I-196, I-204 Yin, Qian II-21 Yu, Guo-Ding I-636 Yu, Haibin II-809 Yu, Jinyong II-165 Yu, Jun II-158 Yu, Kai II-740 Yu, Long I-78 Yuan, Bo I-531 Yuan, Jianping II-171 Yuan, Shengzhong I-417 Yuan, Zhanting II-91, II-331 Yue, Shuai I-117 Zdunek, Rafal I-237 Zeng, Guangping II-67 Zeng, Lingfa II-720 Zeng, Qingshang I-482 Zeng, Zhe-zhao II-674 Zeng, Zhigang I-309, II-720 Zha, Daifeng I-283, II-748 Zhang, Bo I-309

907

908 Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang,

Author Index Dexian I-691, I-845 Fan II-253 Guobao II-801 Hailong II-465 Hongmei I-569 Houxiang II-822 Huaguang I-222, II-128 Jianwei II-822, II-830 Jinfeng II-359 Jing I-1, II-30 Ke I-873 Lei II-243 Liqing I-11 Liwen II-783 Mingwang I-826 Ning II-138 Qingzhou I-845 Qiuyu II-91, II-331 Qizhi I-410 Shiqing II-457 Suwen I-55 Wei I-338 Wuyi I-291 Xiaohui I-364 Xinchun II-304 Xinhong II-253 Xuejun II-650 Xueping I-569 Yajun II-666 Yanjie I-482 Yi II-190 Yibo I-701 Yingchao I-330

Zhang, Yingjun II-100, II-624 Zhang, Yunong I-117 Zhao, Jianye II-304 Zhao, Jing II-730 Zhao, Yong II-702 Zhao, Zhong-Gai II-492 Zhao, Zhongxiang II-822 Zheng, Binglun II-81 Zheng, Chunhou II-243 Zheng, Qingyu I-158 Zhong, Luo I-715 Zhou, Jianzhon II-110 Zhou, Jianzhong I-63, I-786, II-11, II-259 Zhou, Liang I-645 Zhou, Renlai I-1 Zhou, Shaowu II-712 Zhou, Shibin I-462 Zhou, Xiong II-473 Zhou, Yali I-410 Zhou, Yipeng II-67 Zhu, Kejun II-588 Zhu, Mingfang II-81 Zhu, Wei II-674 Zhu, Wei-Ping II-376, II-384 Zhu, Y. I-674 Zhuang, Li-yan I-854 Zhuo, Xinjian II-682 Ziver, Ahmet K. I-400 Zou, Li I-266 Zou, Ling I-1 Zou, Shuyun I-864 Zuo, Jinlong II-276