133 30 27MB
English Pages 1210 [1208] Year 2006
Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
4232
Irwin King Jun Wang Laiwan Chan DeLiang Wang (Eds.)
Neural Information Processing 13th International Conference, ICONIP 2006 Hong Kong, China, October 3-6, 2006 Proceedings, Part I
13
Volume Editors Irwin King Laiwan Chan Chinese University of Hong Kong Department of Computer Science and Engineering Shatin, New Territories, Hong Kong E-mail:{king,lwchan}@cse.cuhk.edu.hk Jun Wang Chinese University of Hong Kong Department of Automation and Computer-Aided Engineering Shatin, New Territories, Hong Kong E-mail: [email protected] DeLiang Wang Ohio State University Department of Computer Science and Engineering Columbus, Ohio, USA E-mail: [email protected]
Library of Congress Control Number: 2006933758 CR Subject Classification (1998): F.1, I.2, I.5, I.4, G.3, J.3, C.2.1, C.1.3, C.3 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-540-46479-4 Springer Berlin Heidelberg New York 978-3-540-46479-2 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2006 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 11893028 06/3142 543210
Preface
This book and its companion volumes constitute the Proceedings of the 13th International Conference on Neural Information Processing (ICONIP 2006) held in Hong Kong during October 3–6, 2006. ICONIP is the annual flagship conference of the Asia Pacific Neural Network Assembly (APNNA) with the past events held in Seoul (1994), Beijing (1995), Hong Kong (1996), Dunedin (1997), Kitakyushu (1998), Perth (1999), Taejon (2000), Shanghai (2001), Singapore (2002), Istanbul (2003), Calcutta (2004), and Taipei (2005). Over the years, ICONIP has matured into a well-established series of international conference on neural information processing and related fields in the Asia and Pacific regions. Following the tradition, ICONIP 2006 provided an academic forum for the participants to disseminate their new research findings and discuss emerging areas of research. It also created a stimulating environment for the participants to interact and exchange information on future challenges and opportunities of neural network research. ICONIP 2006 received 1,175 submissions from about 2,000 authors in 42 countries and regions (Argentina, Australia, Austria, Bangladesh,Belgium, Brazil, Canada, China, Hong Kong, Macao, Taiwan, Colombia, Costa Rica, Croatia, Egypt, Finland, France, Germany, Greece, India, Iran, Ireland, Israel, Italy, Japan, South Korea, Malaysia, Mexico, New Zealand, Poland, Portugal, Qatar, Romania, Russian Federation, Singapore, South Africa, Spain, Sweden, Thailand, Turkey, UK, and USA) across six continents (Asia, Europe, North America, South America, Africa, and Oceania). Based on rigorous reviews by the Program Committee members and reviewers, 386 high-quality papers were selected for publication in the proceedings with the acceptance rate being less than 33%. The papers are organized in 22 cohesive sections covering all major topics of neural network research and development. In addition to the contributed papers, the ICONIP 2006 technical program included two plenary speeches by Shun-ichi Amari and Russell Eberhart. In addition, the ICONIP 2006 program included invited talks by the leaders of technical co-sponsors such as Wlodzislaw Duch (President of the European Neural Network Society), Vincenzo Piuri (President of the IEEE Computational Intelligence Society), and Shiro Usui (President of the Japanese Neural Network Society), DeLiang Wang (President of the International Neural Network Society), and Shoujue Wang (President of the China Neural Networks Council). In addition, ICONIP 2006 launched the APNNA Presidential Lecture Series with invited talks by past APNNA Presidents and the K.C. Wong Distinguished Lecture Series with invited talks by eminent Chinese scholars. Furthermore, the program also included six excellent tutorials, open to all conference delegates to attend, by Amir Atiya, Russell Eberhart, Mahesan Niranjan, Alex Smola, Koji Tsuda, and Xuegong Zhang. Besides the regular sessions, ICONIP 2006 also featured ten special sessions focusing on some emerging topics.
VI
Preface
ICONIP 2006 would not have achieved its success without the generous contributions of many volunteers and organizations. ICONIP 2006 organizers would like to express sincere thanks to APNNA for the sponsorship, to the China Neural Networks Council, European Neural Network Society, IEEE Computational Intelligence Society, IEEE Hong Kong Section, International Neural Network Society, and Japanese Neural Network Society for their technical co-sponsorship, to the Chinese University of Hong Kong for its financial and logistic supports, and to the K.C. Wong Education Foundation of Hong Kong for its financial support. The organizers would also like to thank the members of the Advisory Committee for their guidance, the members of the International Program Committee and additional reviewers for reviewing the papers, and members of the Publications Committee for checking the accepted papers in a short period of time. Particularly, the organizers would like to thank the proceedings publisher, Springer, for publishing the proceedings in the prestigious series of Lecture Notes in Computer Science. Special mention must be made of a group of dedicated students and associates, Haixuan Yang, Zhenjiang Lin, Zenglin Xu, Xiang Peng, Po Shan Cheng, and Terence Wong, who worked tirelessly and relentlessly behind the scene to make the mission possible. There are still many more colleagues, associates, friends, and supporters who helped us in immeasurable ways; we express our sincere thanks to them all. Last but not the least, the organizers would like to thank all the speakers and authors for their active participation at ICONIP 2006, which made it a great success.
October 2006
Irwin King Jun Wang Laiwan Chan DeLiang Wang
Organization
Organizer The Chinese University of Hong Kong
Sponsor Asia Pacific Neural Network Assembly
Financial Co-sponsor K.C. Wong Education Foundation of Hong Kong
Technical Co-sponsors IEEE Computational Intelligence Society International Neural Network Society European Neural Network Society Japanese Neural Network Society China Neural Networks Council IEEE Hong Kong Section
Honorary Chair and Co-chair Lei Xu, Hong Kong
Shun-ichi Amari, Japan
Advisory Board Walter J. Freeman, USA Toshio Fukuda, Japan Kunihiko Fukushima, Japan Tom Gedeon, Australia Zhen-ya He, China Nik Kasabov, New Zealand Okyay Kaynak, Turkey Anthony Kuh, USA Sun-Yuan Kung, USA Soo-Young Lee, Korea Chin-Teng Lin, Taiwan Erkki Oja, Finland
Nikhil R. Pal, India Marios M. Polycarpou, USA Shiro Usui, Japan Benjamin W. Wah, USA Lipo Wang, Singapore Shoujue Wang, China Paul J. Werbos, USA You-Shou Wu, China Donald C. Wunsch II, USA Xin Yao, UK Yixin Zhong, China Jacek M. Zurada, USA
VIII
Organization
General Chair and Co-chair Jun Wang, Hong Kong
Laiwan Chan, Hong Kong
Organizing Chair Man-Wai Mak, Hong Kong
Finance and Registration Chair Kai-Pui Lam, Hong Kong
Workshops and Tutorials Chair James Kwok, Hong Kong
Publications and Special Sessions Chair and Co-chair Frank H. Leung, Hong Kong
Jianwei Zhang, Germany
Publicity Chair and Co-chairs Jeffrey Xu Yu, Hong Kong
Derong Liu, USA
Chris C. Yang, Hong Kong
Wlodzislaw Duch, Poland
Local Arrangements Chair and Co-chair Andrew Chi-Sing Leung, Hong Kong
Eric Yu, Hong Kong
Secretary Haixuan Yang, Hong Kong
Program Chair and Co-chair Irwin King, Hong Kong
DeLiang Wang, USA
Organization
IX
Program Committee Shigeo Abe, Japan Peter Andras, UK Sabri Arik, Turkey Abdesselam Bouzerdoum, Australia Ke Chen, UK Liang Chen, Canada Luonan Chen, Japan Zheru Chi, Hong Kong Sung-Bae Cho, Korea Sungzoon Cho, Korea Seungjin Choi, Korea Andrzej Cichocki, Japan Chuangyin Dang, Hong Kong Wai-Keung Fung, Canada Takeshi Furuhashi, Japan Artur dAvila Garcez, UK Daniel W.C. Ho, Hong Kong Edward Ho, Hong Kong Sanqing Hu, USA Guang-Bin Huang, Singapore Kaizhu Huang, China Malik Magdon Ismail, USA Takashi Kanamaru, Japan James Kwok, Hong Kong James Lam, Hong Kong Kai-Pui Lam, Hong Kong Doheon Lee, Korea Minho Lee, Korea Andrew Leung, Hong Kong Frank Leung, Hong Kong Yangmin Li, Macau
Xun Liang, China Yanchun Liang, China Xiaofeng Liao, China Chih-Jen Lin, Taiwan Xiuwen Liu, USA Bao-Liang Lu, China Wenlian Lu, China Jinwen Ma, China Man-Wai Mak, Hong Kong Sushmita Mitra, India Paul Pang, New Zealand Jagath C. Rajapakse, Singapore Bertram Shi, Hong Kong Daming Shi, Singapore Michael Small, Hong Kong Michael Stiber, USA Ponnuthurai N. Suganthan, Singapore Fuchun Sun, China Ron Sun, USA Johan A.K. Suykens, Belgium Norikazu Takahashi, Japan Michel Verleysen, Belgium Si Wu, UK Chris Yang, Hong Kong Hujun Yin, UK Eric Yu, Hong Kong Jeffrey Yu, Hong Kong Gerson Zaverucha, Brazil Byoung-Tak Zhang, Korea Liqing Zhang, China
Reviewers Shotaro Akaho Toshio Akimitsu Damminda Alahakoon Aimee Betker Charles Brown Gavin Brown Jianting Cao Jinde Cao Hyi-Taek Ceong
Pat Chan Samuel Chan Aiyou Chen Hongjun Chen Lihui Chen Shu-Heng Chen Xue-Wen Chen Chong-Ho Choi Jin-Young Choi
M.H. Chu Sven Crone Bruce Curry Rohit Dhawan Deniz Erdogmus Ken Ferens Robert Fildes Tetsuo Furukawa John Q. Gan
X
Organization
Kosuke Hamaguchi Yangbo He Steven Hoi Pingkui Hou Zeng-Guang Hou Justin Huang Ya-Chi Huang Kunhuang Huarng Arthur Hsu Kazushi Ikeda Masumi Ishikawa Jaeseung Jeong Liu Ju Christian Jutten Mahmoud Kaboudan Sotaro Kawata Dae-Won Kim Dong-Hwa Kim Cleve Ku Shuichi Kurogi Cherry Lam Stanley Lam Toby Lam Hyoung-Joo Lee Raymond Lee Yuh-Jye Lee Chi-Hong Leung Bresley Lim Heui-Seok Lim Hsuan-Tien Lin Wei Lin Wilfred Lin Rujie Liu Xiuxin Liu Xiwei Liu Zhi-Yong Liu
Hongtao Lu Xuerong Mao Naoki Masuda Yicong Meng Zhiqing Meng Yutaka Nakamura Nicolas Navet Raymond Ng Rock Ng Edith Ngai Minh-Nhut Nguyen Kyosuke Nishida Yugang Niu YewSoon Ong Neyir Ozcan Keeneth Pao Ju H. Park Mario Pavone Renzo Perfetti Dinh-Tuan Pham Tu-Minh Phuong Libin Rong Akihiro Sato Xizhong Shen Jinhua Sheng Qiang Sheng Xizhi Shi Noritaka Shigei Hyunjung Shin Vimal Singh Vladimir Spinko Robert Stahlbock Hiromichi Suetant Jun Sun Yanfeng Sun Takashi Takenouchi
Yin Tang Thomas Trappenberg Chueh-Yung Tsao Satoki Uchiyama Feng Wan Dan Wang Rubin Wang Ruiqi Wang Yong Wang Hua Wen Michael K.Y. Wong Chunguo Wu Guoding Wu Qingxiang Wu Wei Wu Cheng Xiang Botong Xu Xu Xu Lin Yan Shaoze Yan Simon X. Yang Michael Yiu Junichiro Yoshimoto Enzhe Yu Fenghua Yuan Huaguang Zhang Jianyu Zhang Kun Zhang Liqing Zhang Peter G. Zhang Ya Zhang Ding-Xuan Zhou Jian Zhou Jin Zhou Jianke Zhu
Table of Contents – Part I
Neurobiological Modeling and Analysis How Reward Can Induce Reverse Replay of Behavioral Sequences in the Hippocampus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Colin Molter, Naoyuki Sato, Utku Salihoglu, Yoko Yamaguchi
1
Analysis of Early Hypoxia EEG Based on a Novel Chaotic Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Meng Hu, Jiaojie Li, Guang Li, Walter J. Freeman
11
BCM-Type Synaptic Plasticity Model Using a Linear Summation of Calcium Elevations as a Sliding Threshold . . . . . . . . . . . . . . . . . . . . . . . . Hiroki Kurashige, Yutaka Sakai
19
A New Method for Multiple Spike Train Analysis Based on Information Discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guang-Li Wang, Xue Liu, Pu-Ming Zhang, Pei-Ji Liang
30
Self-organizing Rhythmic Patterns with Spatio-temporal Spikes in Class I and Class II Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ryosuke Hosaka, Tohru Ikeguchi, Kazuyuki Aihara
39
Self-organization Through Spike-Timing Dependent Plasticity Using Localized Synfire-Chain Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Toshio Akimitsu, Akira Hirose, Yoichi Okabe
49
Comparison of Spike-Train Responses of a Pair of Coupled Neurons Under the External Stimulus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wuyin Jin, Zhiyuan Rui, Yaobing Wei, Changfeng Yan
59
Fatigue-Induced Reversed Hemispheric Plasticity During Motor Repetitions: A Brain Electrophysiological Study . . . . . . . . . . . . . . . . . . . . . Ling-Fu Meng, Chiu-Ping Lu, Bo-Wei Chen, Ching-Horng Chen
65
Functional Differences Between the Spatio-temporal Learning Rule (STLR) and Hebb Type (HEBB) in Single Pyramidal Cells in the Hippocampal CA1 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Minoru Tsukada, Yoshiyuki Yamazaki Ratio of Average Inhibitory to Excitatory Conductance Modulates the Response of Simple Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Akhil R. Garg, Basabi Bhaumik
72
82
XII
Table of Contents – Part I
Modeling of LTP-Related Phenomena Using an Artificial Firing Cell . . . . Beata Grzyb, Jacek Bialowas Time Variant Causality Model Applied in Brain Connectivity Network Based on Event Related Potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kai Yin, Xiao-Jie Zhao, Li Yao An Electromechanical Neural Network Robotic Model of the Human Body and Brain: Sensory-Motor Control by Reverse Engineering Biological Somatic Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alan Rosen, David B. Rosen
90
97
105
Cognitive Processing Sequence Disambiguation by Functionally Divided Hippocampal CA3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Toshikazu Samura, Motonobu Hattori, Shun Ishizaki
117
A Time-Dependent Model of Information Capacity of Visual Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaodi Hou, Liqing Zhang
127
Learning with Incrementality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abdelhamid Bouchachia
137
Lability of Reactivated Human Declarative Memory . . . . . . . . . . . . . . . . . . Fumiko Tanabe, Ken Mogi
147
A Neuropsychologically-Inspired Computational Approach to the Generalization of Cerebellar Learning . . . . . . . . . . . . . . . . . . . . . . . . . Sintiani Dewi Teddy, Edmund Ming-Kit Lai, Chai Quek
155
A Neural Model for Stereo Transparency with the Population of the Disparity Energy Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Osamu Watanabe
165
Functional Connectivity in the Resting Brain: An Analysis Based on ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xia Wu, Li Yao, Zhi-ying Long, Jie Lu, Kun-cheng Li
175
Semantic Addressable Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cheng-Yuan Liou, Jau-Chi Huang, Wen-Chie Yang
183
Top-Down Attention Guided Object Detection . . . . . . . . . . . . . . . . . . . . . . . Mei Tian, Si-Wei Luo, Ling-Zhi Liao, Lian-Wei Zhao
193
Table of Contents – Part I
Absolute Quantification of Brain Creatine Concentration Using Long Echo Time PRESS Sequence with an External Standard and LCModel: Verification with in Vitro HPLC Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . Y. Lin, Y.P. Zhang, R.H. Wu, H. Li, Z.W. Shen, X.K. Chen, K. Huang, G. Guo
XIII
203
Learning in Neural Network - Unusual Effects of “Artificial Dreams” . . . Ryszard Tadeusiewicz, Andrzej Izworski
211
Spatial Attention in Early Vision Alternates Direction-of-Figure . . . . . . . Nobuhiko Wagatsuma, Ko Sakai
219
Language Learnability by Feedback Self-Organizing Maps . . . . . . . . . . . . . Fuminori Mizushima, Takashi Toyoshima
228
Text Representation by a Computational Model of Reading . . . . . . . . . . . J. Ignacio Serrano, M. Dolores del Castillo
237
Mental Representation and Processing Involved in Comprehending Korean Regular and Irregular Verb Eojeols: An fMRI and Reaction Time Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hyungwook Yim, Changsu Park, Heuiseok Lim, Kichun Nam Binding Mechanism of Frequency-Specific ITD and IID Information in Sound Localization of Barn Owl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hidetaka Morisawa, Daisuke Hirayama, Kazuhisa Fujita, Yoshiki Kashimori, Takeshi Kambara
247
255
Effect of Feedback Signals on Tuning Shifts of Subcortical Neurons in Echolocation of Bat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Seiichi Hirooka, Kazuhisa Fujita, Yoshiki Kashimori
263
A Novel Speech Processing Algorithm for Cochlear Implant Based on Selective Fundamental Frequency Control . . . . . . . . . . . . . . . . . . . . . . . . Tian Guan, Qin Gong, Datian Ye
272
Connectionist Approaches for Predicting Mouse Gene Function from Gene Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emad Andrews Shenouda, Quaid Morris, Anthony J. Bonner
280
Semantic Activation and Cortical Areas Related to the Lexical Ambiguity and Idiomatic Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gisoon Yu, Choong-Myung Kim, Dong Hwee Kim, Kichun Nam
290
Intelligent System for Automatic Recognition and Evaluation of Speech Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wojciech Kacalak, Maciej Majewski
298
XIV
Table of Contents – Part I
A New Mechanism on Brain Information Processing—Energy Coding . . . Rubin Wang, Zhikang Zhang
306
Mathematical Modeling and Analysis A Solution to the Curse of Dimensionality Problem in Pairwise Scoring Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Man-Wai Mak, Sun-Yuan Kung
314
First Passage Time Problem for the Ornstein-Uhlenbeck Neuronal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.F. Lo, T.K. Chung
324
Delay-Dependent Exponential Estimates of Stochastic Neural Networks with Time Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhan Shu, James Lam
332
Imbalanced Learning in Relevance Feedback with Biased Minimax Probability Machine for Image Retrieval Tasks . . . . . . . . . . . . . . . . . . . . . . . Xiang Peng, Irwin King
342
Improvement of the Perfect Recall Rate of Block Splitting Type Morphological Associative Memory Using a Majority Logic Approach . . . Takashi Saeki, Tsutomu Miki
352
Two Methods for Sparsifying Probabilistic Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Colin Fyfe, Gayle Leen
361
Turbo Decoding as an Instance of Expectation Maximization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saejoon Kim
371
Dynamical Behaviors of a Large Class of Delayed Differential Systems with Discontinuous Right-Hand Side . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wenlian Lu, Tianping Chen
379
Reinforcement Learning Algorithm with CTRNN in Continuous Action Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hiroaki Arie, Jun Namikawa, Tetsuya Ogata, Jun Tani, Shigeki Sugano Entropy Based Associative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masahiro Nakagawa
387
397
Table of Contents – Part I
XV
Free Energy of Stochastic Context Free Grammar on Variational Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tikara Hosino, Kazuho Watanabe, Sumio Watanabe
407
Asymptotic Behavior of Stochastic Complexity of Complete Bipartite Graph-Type Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yu Nishiyama, Sumio Watanabe
417
Improving VG-RAM Neural Networks Performance Using Knowledge Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Raphael V. Carneiro, Stiven S. Dias, Dijalma Fardin Jr., Hallysson Oliveira, Artur S. d’Avila Garcez, Alberto F. De Souza
427
The Bifurcating Neuron Network 3 as Coloring Problem Solver and N-Ary Associative Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jinhyuk Choi, Geehyuk Lee
437
Coupling Adaboost and Random Subspace for Diversified Fisher Linear Discriminant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hui Kong, Jian-Gang Wang
447
Self-organized Path Constraint Neural Network: Structure and Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hengqing Tong, Li Xiong, Hui Peng
457
Gauss Wavelet Chaotic Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yao-qun Xu, Ming Sun, Ji-hong Shen Training RBFs Networks: A Comparison Among Supervised and Not Supervised Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mercedes Fern´ andez-Redondo, Joaqu´ın Torres-Sospedra, Carlos Hern´ andez-Espinosa A Cooperation Online Reinforcement Learning Approach in Ant-Q . . . . . SeungGwan Lee
467
477
487
Soft Analyzer Modeling for Dearomatization Unit Using KPCR with Online Eigenspace Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Haiqing Wang, Daoying Pi, Ning Jiang, Steven X. Ding
495
A Soft Computing Based Approach for Modeling of Chaotic Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jayashri Vajpai, Arun JB.
505
Predicting Nonstationary Time Series with Multi-scale Gaussian Processes Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yatong Zhou, Taiyi Zhang, Xiaohe Li
513
XVI
Table of Contents – Part I
Prediction Error of a Fault Tolerant Neural Network . . . . . . . . . . . . . . . . . John Sum, Chi-sing Leung, Kevin Ho Delay-Dependent and Delay-Independent Stability Conditions of Delayed Cellular Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wudai Liao, Dongyun Wang, Yulin Xu, Xiaoxin Liao The Perfect Recalling Rate of Morphological Associative Memory . . . . . . Ke Zhang, Sadayuki Murashima Analysis of Dynamics of Cultured Neuronal Networks Using I & F Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Akio Kawana, Shinya Tsuzuki, Osamu Watanabe
521
529
537
547
Convergence Study of Discrete Neural Networks with Delay . . . . . . . . . . . Run-Nian Ma, Guo-Qiang Bai
554
Convergence of Batch BP Algorithm with Penalty for FNN Training . . . Wei Wu, Hongmei Shao, Zhengxue Li
562
New Results for Global Stability of Cohen-Grossberg Neural Networks with Discrete Time Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zeynep Orman, Sabri Arik
570
Global Stability of Cohen-Grossberg Neural Networks with Distributed Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weirui Zhao, Huanshui Zhang
580
Exponential Stability of Neural Networks with Distributed Time Delays and Strongly Nonlinear Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . Chaojin Fu, Zhongsheng Wang
591
Global Stability of Bidirectional Associative Memory Neural Networks with Variable Coefficients and S-Type Distributed Delays . . . . . . . . . . . . Yonggui Kao, Cunchen Gao, Lu Wu, Qinghe Ming
598
Stability Analysis for Higher Order Complex-Valued Hopfield Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deepak Mishra, Arvind Tolambiya, Amit Shukla, Prem K. Kalra
608
Learning Algorithms Mixture of Neural Networks: Some Experiments with the Multilayer Feedforward Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joaqu´ın Torres-Sospedra, Carlos Hern´ andez-Espinosa, Mercedes Fern´ andez-Redondo
616
Table of Contents – Part I
Collective Information-Theoretic Competitive Learning: Emergency of Improved Performance by Collectively Treated Neurons . . . . . . . . . . . . . Ryotaro Kamimura, Fumihiko Yoshida, Ryozo Kitajima A Fast Bit-Parallel Algorithm for Gapped String Kernels . . . . . . . . . . . . . Chuanhuan Yin, Shengfeng Tian, Shaomin Mu A RBFN Based Cogging Force Estimator of PMLSM with q ω). In presence of a behavioral input, burst activations appear (scenario B or C): ΩECIIi ∈ [βECII − ω, βECII + ω] (to avoid a continuous activation). However, since we are working with a dynamical system, these inequations can be relaxed during short transients without affecting the expected behavior. Parameters have been set to the following values: ω0 = 1, βEC = 1.5 and k0EC = 1.5. During firing, the native angular frequency increases monotonically starting from 0.6 to 1.4. 2.4
Learning Sequences in the CA3 Recurrent Network
The firing patterns of activities generated in the EC layer are learned through an Hebbian mechanism in the CA3 layer. However, to reflect neurophysiological evidences [7], [8], an asymmetric time window (ranging from 0 to 50 ms) with respect to the time difference of presynaptic and postsynaptic firing is taken into account. As a result, the phase differences generated by the phase precession mechanism lead to asymmetric connections, as observed in Figure 3. The synaptic plasticity of a connection from a unit j to a unit i is given by:
How Reward Can Induce Reverse Replay of Behavioral Sequences A 1
. f >0 1
-1 -w/(b-W)
1
B
C
. f >0
D 1
. f >0
1
1
-1
. f 0
. f β + ω: a stable fixed point is reached for a positive value of the cosine and means sustained activation. w˙ ij (t) = {αP (φi (t))P (φj (t − τ )) − γ (P (φi (t)) + P (φj (t − τ )))} (R − wij (t)2 )wij (t). (8)
0
5
10
15
20
0
5
10
15
20
0.8 0.4
idx=11
0.0
0.4
idx=10
0.0
0.0
0.4
idx=9
0.8
0.8
0.8 0.4
idx=8
0.0
weight distribution
The second term is a damping factor which reflects weight decreases due to independent activities. A√saturation factor multiplies the all and maintains the weights in the range ]0, R[. Parameters have been set to the following values: α = 1, τ = 10ms, γ = 0.01 and R = 1. Equation 4 gives the sum of all inputs impinging a CA3 unit. During the learning stage, in the absence of behavioral inputs, no firing activity is expected.
0
5
10
15
20
0
5
10
15
20
CA3 units
Fig. 3. Weight distributions efferent (wi,idx - in bold) and afferent (widx,i ) a unit of index idx after a single trial learning appear on each plot. Weight asymmetries appear clearly. The symmetry of the behavioral input is reflected in these distributions which are identical for all units.
6
C. Molter et al.
In presence of behavioral inputs, EC units are fired and are expected to fire the correspondent CA3 units. To avoid the influence of recurrent connections during the learning phase, the parameter k33 is decreased to a value nearly equal to null. As a consequence the synaptic plasticity occurring in this layer reflects directly the firing sequence of the EC layer. Equation 4 parameters have been set to the following values: k0CA3 = 0.2, k EC = 3, k33 = 0.001.
3
Computational Experiments
The following experiments analyze the retrieving phase of the CA3 network after one trial learning of the behavioral sequence given in Section 2.2. Accordingly, in all experiments the CA3 weights w33ij are the same and are given in Figure 3. The aim of this paper is to analyze recall during sharp waves (SPWs). To prepare this study, first some simple propagation properties are analyzed when trigger inputs are given to specific CA3 units through activation of ECII units by a motionless behavioral input. This study will help to choose the CA3 parameters βCA3 and k33 . The second experiment depicts how the CA3 network is affected by intermittent SPWs and the last experiment will show how a reward mechanism affects the firing sequence during SPWs. Since recall is assumed during SPWs and motionless activity, all experiments are performed in the absence of theta (LFP is shut down -k0EC = k0CA3 = 0). The Runge-Kutta-Gill integration method is used with a time parameter δt = 0.08. Since the angular frequency ω0 is set to 1 and represents the 8Hz theta rhythm, one computational time step represents 1.59ms (δt ∗ 0.125/2π). 3.1
Reverse Replay with Specific Inputs
CA3 firing activities are observed when a specific behavioral input feeds the network. Three different place fields are tested: when the animal is at the beginning of the track, at its end and in an in-between position. Each time two input units are triggered. Figures 4 show the different scenarios obtained for two different parameters of k33 (in Eq. 4, the global parameter depicting the strengths of the recurrent connections). For small k33 values, only replay is observed while for larger k33 values, both replay and reverse replay occur. This impact of the k33 parameter is related to the Figure 2 scenarios A and B-C, and to the presence of asymmetric weight connections. Figures 5(a) and 5(b) show how the parameters k33 and βCA3 affect reverse replay in opposite ways. For increasing k33 , scenarios A to D are traversed, while increasing β tends to prevent any network activity and brings the network back to the scenario A. No reverse replay is observed in Figure 5(a) for k33 = 0.1: the action of the 2 cells to the preceding ones is too much inhibited by the small value of k33 . When disinhibiting slightly the net by increasing k33 , activity is propagated backward. However, since we are nearby the stability, the phase velocity departs very slowly and a very slow reverse replay is observed. In fact, the reverse replay is so slow that forward replay takes place: when a
0.0 0.4 0.8
0.0 0.4 0.8
0.0 0.4 0.8
20
7
0
5
10
15
20 0
0
5
10
15
20 15 10
CA3 units
10 5 0 0.0 0.4 0.8
5
20 15
20 15 0
5
10
15 10 5 0
CA3 units
20
How Reward Can Induce Reverse Replay of Behavioral Sequences
0.0 0.4 0.8
seconds
0.0 0.4 0.8
seconds
(a) k33 =0.15, βCA3 = 1.05
(b) k33 =0.3, βCA3 = 1.05
Fig. 4. Forward and reverse replays observed for specific local input and various k33 values (which appears to be critical). In (a), only replay is observed: the weight connections associated to the reverse path are not enough strengthened. By increasing k33 , both replay and reverse replay are observed (b).
unit stops its activity and is ready to fire again, the subsequent cells are still in activity and can give the initiating signal. When increasing k33 , the speed of the reverse replay increases. For k33 = 0.9, the reverse replay is so fast that units are reactivated in a forward replay before the end of their activition. For larger k33 , sustained activation appears. Similar considerations are observed for decreasing β (Fig.5(b)). 3.2
Reverse Replay During Sharp Waves
0.0
1.0
0.0
1.0
0.0
1.0
0.0
1.0
5
10 15 20
k33= 1
0
5
10 15 20
k33= 0.9
0
5
10 15 20
k33= 0.7
0
5
10 15 20
k33= 0.3
0
5
10 15 20
k33= 0.2
0
5
10 15 20
k33= 0.1
0
CA3 units
Hippocampal intermittent sharp waves of 40 to 120-ms duration have been observed during various behavioral activity including slow wave sleep, awake immobility, drinking, eating, grooming, etc. [13], [14].
0.0
1.0
0.0
1.0
1.0
0.0
1.0
1.0
0.0
1.0
10 15 20
beta= 1.4
0
5
10 15 20 0
0 0.0
beta= 1.37
5
10 15 20
beta= 1.3
5
10 15 20 0
0
0 0.0
beta= 1.1
5
10 15 20
beta= 1.05
5
beta= 1.01
5
10 15 20
seconds
0.0
1.0
0.0
1.0
Fig. 5. Reverse replays obtained for different k33 and βCA3 parameter configurations. βCA3 = 1.05 in top figures. k33 = 0.7 in bottom figures.
0.0 0.2 0.4
0.0 0.2 0.4
0.0 0.2 0.4
0.0 0.2 0.4
20 0
5
10
15
20 0
5
10
15
20 0
5
10
15
20 15 0
5
10
15 10 5 0
0
5
10
15
20
C. Molter et al. 20
8
0.0 0.2 0.4
0.0 0.2 0.4
Fig. 6. Excitation of the CA3 units during sharp waves. Different figures are obtained with different samples of a Gaussian input (μ = 0, σ = 0.35). Activation of the CA3 “excitable pathway” is observed. βCA3 = 1.15, k33 = 0.65.
0.0 0.2 0.4
0.0 0.2 0.4
0.0 0.2 0.4
0.0 0.2 0.4
20 0
5
10
15
20 0
5
10
15
20 0
5
10
15
20 0
5
10
15
20 15 10 5 0
0
5
10
15
20
In the following experiments, SPWs was simulated by assuming direct nonspecific inputs to CA3 units during a period of 100-ms2 , followed by no-input periods. Different scenarios are observed regarding the parameter values of the input distribution. When all inputs are lower than βCA3 − ω0 , no activation appear in CA3 units. When all inputs are bigger than βCA3 − ω0 , all CA3 units are activated at the same time during the SPWs. More interesting behaviors are observed for inputs with a small mean and large variance. At each time step, only a small fraction of CA3 units are activated: the ones receiving an input strong enough to depart from scenario A to scenario B. For “structural stability” CA3 parameters obtained from the previous analysis, these activations are propagated along the learned sequence in either forward or reverse direction -the CA3 “excitable pathway”-. A typical example is given in Figure 6.
0.0 0.2 0.4
0.0 0.2 0.4
Fig. 7. Excitation of the CA3 units during different sharp waves bursts (simulated by Gaussian input μ = 0, σ = 0.35). Last two units’ threshold parameter is slightly decreased to reflect the accumulation of reward place fields at the reward location. reward = 1.05, k33 = 0.65. Reverse replays are observed. βCA3 = 1.15, βCA3
3.3
Reward Inducing Reverse Replay During Sharp Waves
Foster and Wilson’s experiment [1] is examined here: reward is given to the animal when it reaches the end of the track. In the computer simulation this 2
100-ms=63 computational time steps. At each computational time step each CA3 unit receives an input which amplitude is given by a Gaussian distribution (μ, and σ) and whose negative values are set to zero.
How Reward Can Induce Reverse Replay of Behavioral Sequences
9
reward is simulated by a small decrease of the βCA3 parameter for the two last reward CA3 units to βCA3 . This leads to an increase in the firing probability during SPWs. By using the same parameters as in the previous section, these activations are propagated along the CA3 “excitable pathway” which leads here to a reverse replay (Fig.7). Since Eq.2 shows that βCA3 and ΩCA3 give opposite actions, the βCA3 decrease might be related to experimental evidences showing that place fields at reward locations are associated to larger neural population proportions [15], [16]. Indeed, this larger neural population is more triggered during SPWs activity which would be reflected by an increase of Ω.
4
Conclusions
Reverse replay of behavioral sequences in hippocampal place cells have been observed in rodents during non-running awake state in coincidence with sharp waves ripples[1]. In this paper, by relying on a biologically plausible hippocampal model [6], an original reward mechanism is proposed to induce these reverse replay patterns of activities. The reward is simulated by a small decrease of the firing activation threshold of the units associated to the locations where biological reward is experienced. This small decrease might reflect accumulation of the population’s distribution at reward place fields [15], [16]. As a consequence, we predict that such reverse replay mechanisms should be observed during any kind of SPWs activity: both during awake activity and during slow wave sleep. This reverse replay was demonstrated to appear just after one time trial and is suggestive of a supervisory role of the hippocampus in reinforcement learning. This reinforces Buszaki’s hypothesize [14] that the activation of the experienced behavioral sequence sharp waves could be the center of memory consolidation. What a beautiful way of learning during dreaming.
References 1. D.J. Foster and M.A. Wilson. Reverse replay of behavioural sequences in hippocampal place cells during the awake state. Nature, online, February 2006. 2. J. OKeefe and L. Nadel. The hippocampus as a cognitive map. Clarendon Press, Oxford, 1978. 3. Recce ML OKeefe J. Phase relationship between hippocampal place units and the eeg theta rhythm. Hippocampus, 3:317330, 1993. 4. W.E. Skaggs, B.L. McNaughton, M.A. Wilson, and C.A. Barnes. Theta phase precession in hippocampal neuronal populations and the compression of temporal sequences. Hippocampus, 6:149172, 1996. 5. Y. Yamaguchi and BL. McNaughton. Nonlinear dynamics generating theta phase precession in hippocampal closed circuit and generation of episodic memory. In S. Usui and T. Omori, editors, ICONIP98 -JNNS98, volume 2, page 781784, Japan, 1998. Burke, VA: IOS Press. 6. Y. Yamaguchi. A theory of hippocampal memory based on theta phase precession. Biol. Cybern., 89:19, 2003.
10
C. Molter et al.
7. W.B. Levy and O. Steward. Temporal contiguity requirements for long term associative potentiation/depression in the hippocampus. Neuroscience, 8:791–797, 1983. 8. G. Bi and M. Poo. Distributed synaptic modification in neural networks induced by patterned stimulation. Nature, 401:792–796, 1999. 9. N. Sato and Y. Yamaguchi. Memory encoding by “theta phase precession” in the hippocampal network. Neural Computation, 15:2379–2397, 2003. 10. H. Wagatsuma and Y. Yamaguchi. Cognitive map formation through sequence encoding by theta phase precession. Neural Computation, 16(12):2665–2697, 2004. 11. Z. Wu and Y. Yamaguchi. Input-dependent learning rule for the memory of spatiotemporal sequences in hippocampal network with theta phase precession. Biol. Cybern., 90:113–124, 2004. 12. A.K. Lee and M. A. Wilson. Memory of sequential experience in the hippocampus during slow wave sleep. Neuron, 36:11831194, 2002. 13. G. Buzs´ aki. Hippocampal sharp waves: their origin and significance. Brain Res., 398:242–252, 1986. 14. G. Buzs´ aki, A. Bragin, J.J. Chrobak, Z. N´ adasky, A. Sik, M. Hsu, and A. Ylinen. Oscillatory and intermittent synchrony in the hippocampus: relevance to memory trace formation. In Temporal coding in the brain, pages 145–172. Springer-Verlag, g. buzs´ aki and r. llin´ as and w. singer and a. berthoz and y. christen edition, 1994. 15. T. Kobayashi, H. Nishijo, M. Fukuda, J. Bures, and T. Ono. Task-dependent representations in rat hippocampal place neurons. J Neurophysiol, 78:597–613, 1997. 16. S.A. Hollup, S. Molden, J.G. Donnett, M.B. Moser, and E.I. Moser. Accumulation of hippocampal place fields at the goal location in an annular watermaze task. The Journal of Neuroscience, 21(5):16351644, 2001. 17. E.I. Moser. Spatial maps in hippocampal and parahippocampal cortices. Abstract Viewer/Itinerary Planner, Washington, DC: Society for Neuroscience. Program No. 466, 2005.
Analysis of Early Hypoxia EEG Based on a Novel Chaotic Neural Network Meng Hu1, Jiaojie Li2, Guang Li3, and Walter J. Freeman4 1
Department of Physics, Zhejiang University, Hangzhou 310027, China 2 Hangzhou Sanitarium of PLA Airforce, Hangzhou 310013, China 3 National Laboratory of Industrial Control Technology, Institute of Advanced Process Control, Zhejiang University, Hangzhou 310027, China [email protected] 4 Division of Neurobiology, University of California at Berkeley, LSA 142, Berkeley, CA, 94720-3200, USA
Abstract. This paper presents an experiment to recognize early hypoxia based on EEG analyses. A chaotic neural network, the KIII model, initially designed to model olfactory neural systems is utilized for pattern classification. The experimental results show that the EEG pattern can be detected remarkably at an early stage of hypoxia for individuals. Keywords: Chaotic neural network; Hypoxia; EEG.
1 Introduction It is well known that hypoxia disrupts intracellular process and impairs cellular function. Brain cells with a uniquely high oxygen demand are most susceptible to low oxygen tension. Intellectual impairment is currently considered as an early sign of hypoxia, which is particularly dangerous for pilots because the signs and symptoms do not usually cause discomfort or pain to make them recognize their own disability. While numerous physiological indicators, such as neurobehavioral evaluation (NE) used in our research, are available to evaluate hypoxia, the EEG signal is one of the most predictive and reliable method which may assess hypoxia on-line [1]. EEGs are dynamic, stochastic, non-linear and non-stationary and exhibit significant complex behavior [2], [3]. Considering this, traditional methods may not be appropriate approach in characterizing the intrinsic nature of the EEG. The architecture of the olfactory system is followed to construct a high dimensional chaotic network, the KIII model, in which the interactions of globally connected nodes are shaped by reinforcement learning to support a global landscape of high dimensional chaotic attractors. Each low-dimensional local basin of attraction corresponds to a learned class of stimulus patterns. Convergence to an attractor constitutes abstraction and generalization from an example to the class. KIII model has performed well on several complex pattern recognition tasks [4], [5], [6]. I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 11 – 18, 2006. © Springer-Verlag Berlin Heidelberg 2006
12
M. Hu et al.
Hypoxic EEG collected after the subjects stayed at environment simulating 3500 m altitude for 25 minutes could be distinguished from normal EEG when the hypoxia was proved by NE test [7]. However, earlier prediction of hypoxia is valued before the NE decay. In this paper, the KIII model is used as a pattern classifier to diagnose the hypoxia at an early stage before significant NE changes occurred. The features are extracted based on the feature vectors of 30-60 Hz sub-band wavelet packet tree coefficients constructed using wavelet packet decomposition prior to the classification.
2 KIII Model Description Biologically, the central olfactory neural system is composed of olfactory bulb (OB), anterior nucleus (AON) and prepyriform cortex (PC). In accordance with the anatomic architecture, KIII network is a multi-layer neural network model, which is composed of several K0, KI, KII units [8]. Among the models, every node is described by a second order differential equation. The parameters in the KIII network are optimized to fulfill some criteria that were deduced in electrophysiological experiments [9]. In the KIII network, Gaussian noise is introduced to simulate the peripheral and central biological noise source, respectively; the peripheral noise is rectified to simulate the excitatory action of input axons. The additive noise eliminates numerical instability of the KIII model, and makes the system trajectory stable and robust under statistical measures. Because of this kind of stochastic chaos, the KIII network can approximate the real biological intelligence for pattern recognition [10].
3 Application to Subtle Hypoxic EEG Recognition 3.1 Data Acquisition A Mixture of nitrogen and oxygen at normal atmosphere pressure, which simulates different altitude atmosphere by adjusting oxygen partial pressure, is provided to subjects via a pilot mask. In the first day, when the subject stays at normal atmosphere, he carries out auditory digit span while his EEG is recorded. In the second day, the sub-ject stays at environment simulating 3500m altitude for 25 minutes. The NE tests were performed at 16th and 25th minute, respectively, while the EEGs were recorded. The experiment is carried out in the same time each day. Five healthy male volunteers around 22 years old are taken as subjects. 1.5 seconds EEGs immediately after neurobehavioral evaluations are recorded for analysis under both normal oxygen partial pressure and 3500 m altitude. EEG data were taken from 30 Channels including: FP1, FP2, F7, F3, FZ, F4, F8, FT7, FC3, FCZ, FC4, FT8, T3, C3, CZ, C4, T4, TP7, CP3, CPZ, CP4, TP8, T5, P3, PZ, P4, T6, O1, OZ and O2 (10/20 system). The reference was (A1+A2)/2 (A1 = left mastoid, A2 = right mastoid). The EEG amplifier used was NuAmps Digital
Analysis of Early Hypoxia EEG Based on a Novel Chaotic Neural Network
13
Amplifier (Model 7181) purchased from Neuroscan Compumedics Limited, Texas, USA. Sam-pling rate was 250 S/s. All values are in μVolt. 3.2 Evaluation of the Severity of the Effects of Hypoxia by Neurobehavioral Testing NE is a sensitive and reliable tool for early detection of adverse effects of the environmental hazards on central nervous system. In the normal and simulating 3500m altitude experiments, auditory digit span was utilized to evaluate the degree of hypoxia. Auditory digit span is a common measure of short-term memory, which is the number of digits a person can absorb and recall in correct serial order after hearing them. As is usual in short-term memory tasks, here the person has to remember a small amount of information for a relatively short time, and the order of recall is important. The result of the test is shown in Table 1. T-tests were performed on the NE under normal and hypoxia conditions. As a result, the NE scores of normal and hypoxia at the 25th minute were different observably (p0.2). Table 1. Performance of NES under normal and hypoxia states Subject Normal 1 2 3 4 5
30 28 25 32 19
Auditory Digit Span Scores th th Hypoxia (16 minute) Hypoxia (25 minute) 31 27 27 29 19
29 24 21 23 9
3.3 Feature Vector Extraction By wavelet packet decomposition, the original waveform can be reconstructed from a set of analysis coefficients that capture all of the time (or space) and frequency information in the waveform. In our analysis, we use the COIF5 wavelet. The number of levels of decomposition is chosen as two and wavelet packet tree coefficients of a 30-60Hz sub-band are abstracted. The feature vector is a 30-dimensions vector due to 30 EEG channels. For each channel, the square of the wavelet packet tree coefficients are summed up as one dimension of the feature vector. According to the topology of the EEG channel, feature vectors can be transformed as a feature topography. A typical feature topography sample of comparing normal and hypoxic EEGs collected from the same subject is illustrated in Fig. 1 [7].
14
M. Hu et al.
Fig. 1. A feature vector topography of the normal and hypoxia EEG
3.4 Learning Rule There are two main learning processes: Hebbian associative learning and habituation. Hebbian reinforcement learning is used for establishing the memory basins of certain patterns, while habituation is used to reduce the impact of environment noise or those non-informative signals input to the KIII network. The output of the KIII network at the mitral level (M) is taken as the activity measure of the system. The activity of the ith channel is represented by SDai, which is the mean standard deviation of the output of the ith mitral node (Mi) over the period of the presentation of input patterns, as Eq.(1). The response period with input patterns is divided into equal segments, and the standard deviation of the ith segment is calculated as SDaik, SDai is the mean value of these S segments. SDam is the mean activity measure over the whole OB layer with n nodes (Eq.(2)). SDai =
1 s ¦ SDaik . S k =1
(1)
SDam =
1 n ¦ SDai . n k =1
(2)
The modified Hebbian rule holds that each pair of M nodes that are co-activated by the stimulus have their lateral connections W(mml)ij strengthened. Here W(mml)ij stands for the connection weights both from Mi to Mj and from Mi to Mj. Those nodes whose activities are larger than the mean activity of the OB layer are considered activated; those whose activity levels are less than the mean are considered not to be activated. Also, to avoid the saturation of the weight space, a bias coefficient K is defined in the modified Hebbian learning rule, as in Eq.(3). W(mml)ij is multiplied by a coefficient r (r>1) to represent the Hebbian reinforcement. IF THEN OR
SDai > (1 + K ) SDam and
SDaj > (1 + K ) SDam
W ( mml )ij = W ( mml ) high and W ( mml ) ji = W (mml ) high . W (mml )ij = r × W (mml )ij and W (mml ) ji = r × W (mml ) ji .
(3)
Analysis of Early Hypoxia EEG Based on a Novel Chaotic Neural Network
15
Two algorithms to increase the connection weight are presented, algorithm 1 is used to set the value to a fixed high value W(mml)high as in previous references and algorithm 2 is a new algorithm that will multiply an increasing rate to the original value. The habituation constitutes an adaptive filter to reduce the impact of environmental noise that is continuous and uninformative. The habituation exists at the synapse of the M1 nodes on other nodes in the OB and the lateral connection within the M1 layer. It is implemented by incremental weight decay (multiply with a coefficient hhab 2) and analyze the synchronization pattern among the neurons, where the analytical result mostly depends on the experimental data and is affected little by subjective interference.
1
Introduction
Our current knowledge about neurons and their functional properties is mostly derived from single neuron recordings using microelectrode techniques. Although the classical methods have been very useful for understanding the cellular mechanisms underlying relevant processes, it becomes clear that more often than not, neural information is processed based on cooperation and integration of relevant neuron populations [1],[2]. So, an important technique in neuroscience research – multi-unit recording – has been developed to measure the activity of a group of neurons simultaneously, which offers a window to explore how neurons work in concert to encode specific information [3],[4]. However, the ability of understanding relevant coding function is significantly hampered because current methods fall short of the requirements for relevant multi-dimensional data analysis. During the past decades, several methods have been proposed [5]-[7], but they were basically designed for pair-wise train analysis and did not afford the necessity for dealing with multiple train sequences obtained from multi-electrode recordings. Although techniques were also explored to deal with multiple spike trains at the same time [8],[9], but these proposed methods were seriously subject to the selection of parameters. Therefore, the present challenge is to develop methods that allow researchers to perform objective multivariate analysis on multiple spike train data.
Corresponding author.
I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 30–38, 2006. c Springer-Verlag Berlin Heidelberg 2006
A New Method for Multiple Spike Train Analysis
31
In this paper, a new method employing a measurement of information discrepancy [10],[11], which is based on the comparisons of subsequence distributions of the symbolized sequences obtained from the original spike trains, is applied to deal with the multiple spike trains simultaneously recorded from a group of neurons and analyze the synchronization pattern among these neurons. It can be used to effectively make comparison among many spike sequences at the same time, where the analytical result mostly depends on the experimental data and is affected little by subjective interference.
2 2.1
Method Experiment
The isolation of retina and recording of ganglion cell spike trains were conducted as previously described [12]. In brief, eyes were obtained from newly hatched chickens (about 2-4 days old). For electrical recording, a small piece of retinal segment was attached with the ganglion cell side to the surface of an 8 × 8 multi-electrode array (MEA60, Multi Channel Systems MCS GmbH, Reutlingen, Germany). Light stimulus was generated using a computer monitor, and was focused to form an 0.71 × 0.71 mm image on the isolated retina, which covered the MEA area, via a lens system. The light stimuli consisted of 1000 ms full field, uniformly illuminating flashes separated by 1000 ms “light-off” inter-stimulus intervals. Multi-unit photo responses from ganglion cells were simultaneously recorded from MEA electrodes and were amplified with a 64-channel amplifier (1200×). The selected channels of recording along with one stimulus signal were digitized with a commercial multiplexed data acquisition system (MCRack) and stored in a Pentium-based computer. The data were sampled at a rate of 20 kHz, plotted on the monitor screen instantaneously, and then stored on the hard disk for off-line analyses. Spike events recorded from each electrode were then detected and classified into neuronal activities based on the methods previously described [13],[14]. 2.2
Symbolization of Spike Trains
The measurement of information discrepancy is performed based on the comparisons of subsequence distributions. Since spikes are “all-or-none” events, therefore a spike train can be symbolized into a “0-1” sequence. For the purpose of determining the information discrepancy, the symbolizing procedure can be applied as follows: Let the recording start at time 0 and finish T time units later. Select a suitable time bin t, the whole length of time duration [0, T ] can be divided into r(r = T / t) bins of the same width, where each bin contains at most one spike. If there is one spike in the bin t, it is assigned with the symbol “1”; otherwise the symbol “0” is assigned. This will result in a sequence of length r with two different symbols [15]. In the present study, a bin width of 5 ms is selected.
32
G.-L. Wang et al.
2.3
Function of Degree of Disagreement of the Symbolized Spike Trains
Function of Degree of Disagreement (FDOD) is a measurement for the difference among the sequences under investigation [10]. It has been successfully applied in bioinformatics [16]. This measurement is similar in form to that of Kullback’s information, and has many good properties, such as non-negativity, identity, symmetry, boundedness, symmetric recursiveness, monotonicity, etc [17]. When it is applied to measure discrepancy among multiple sequences, the constructive information of a sequence is transformed into a set of subsequence distributions called complete information set (CIS), which is defined as follows [11]: S = Let = {a1 , a2 , . . . , am } be an alphabet of m symbols, and suppose the symbol set . Denote {S1 , S2 , . . . , Ss } is a set of sequences formed from the set of all different sequences formed from with length l by Θl . Then l the number m(l) of all sequences of Θ equals to ml . For sequence Sk ∈ S, let Lk be its length and nlik denote the number of contiguous subsequences in Sk which matches the i-th sequence of Θl , l ≤ Lk . It is easy to see that m(l) l i=1 nik = Lk − l + 1 for each l ≤ Lk and k. Let plik = nlik /(Lk − l + 1), we obtain a distribution: USl k := (pl1k , pl2k , . . . , plm(l)k )T
(1) l p = 1 for each l ≤ L and k, and T is the transposition operator. where m(l) k i=1 ik m(l) Let Γ l denote the set of all distributions satisfying i=1 plik = 1, we get: ⎧ ⎫ m(l) ⎨ ⎬ plik = 1, plik ≥ 0 , (l = 1, 2, 3, . . .) Γ l := (pl1k , pl2k , . . . , plm(l)k )T (2) ⎩ ⎭ i=1
Thus, for each sequence Sk , we can get a unique set of distributions: (US1k , US2k , . . . , USLkk ), USl k ∈ Γ l , (l = 1, 2, . . . , Lk )
(3)
which contains all the composition information of sequence Sk and forms a complete information set of the sequence Sk . Different sequences should have different complete information sets (and vice versa). For neuronal firing activities, the alphabet for the “0-1” sequence is = {a1 , . . . , am } = {0, 1}. Suppose S is a sequence of L bins, and given the length of subsequence l, the subsequence distribution of S is: USl = (pl1 , pl2 , . . . , plm(l) )T m(l) where T is the transposition operator, i=1 pli = 1, l ≤ L and m(l) = 2l . Given a set of distributions of s sequences:
where
m(l) i=1
USl 1 = (pl11 , pl21 , . . . , plm(l)1 )T USl 2 = (pl12 , pl22 , . . . , plm(l)2 )T ······ USl s = (pl1s , pl2s , . . . , plm(l)s )T plik = 1, k = 1, 2, . . . , s.
(4)
(5)
A New Method for Multiple Spike Train Analysis
33
The FDOD measurement is defined as: B(USl 1 , USl 2 , . . . , USl s ) =
s m(l)
plik l k=1 pik /s
plik log s
k=1 i=1
m(l)
Bk (USl 1 , USl 2 , . . . , USl s ) =
i=1
plik l k=1 pik /s
plik log s
(6)
(7)
Based on B(USl 1 , USl 2 , . . . , USl s ), we can also get another measurement (discrepancy ratio R): ⎛ ⎞ s m(l) l p ⎠ /(s log(s)) ≤ 1 R(USl 1 , USl 2 , . . . , USl s ) = ⎝ plik log s ik l (8) p /s ik k=1 k=1 i=1 where 0 log(0) and 0 log(0/0) are both assigned to be zero [11]. Here, the indices B and Bk calculated using (6) and (7) are similar in form to that of Kullback’s information, and are considered as measurements of disagreement [10]. Here, B(USl 1 , USl 2 , . . . , USl s ) and R(USl 1 , USl 2 , . . . , USl s ) are measurements of discrepancy among s sequences, while Bk (USl 1 , USl 2 , . . . , USl s ) is a measurement of discrepancy between the k-th sequence and the average of all sequences in the group. Based on (5) and (7), a corollary can be obtained as follows: Corollary 1: If USl i = USl j
(9)
Bi (USl 1 , USl 2 , . . . , USl s ) = Bj (USl 1 , USl 2 , . . . , USl s )
(10)
then where i = j and i, j ≤ s. So different spike sequences, which have the similar subsequence distributions, will have the similar information discrepancy, but the inverse of the above corollary is not affirmative. Since the FDOD measurement satisfies the measurement conditions of complete information set, it is not necessary to consider all distributions of the complete information set. Practically, distribution for subsequences with a suitable length l should be sufficient for analysis [11].
3
Results
In this study, the algorithm for measuring information discrepancy is used to analyze the spatiotemporal pattern of multiple spike trains from a group of adjacent neurons. Fig. 1 illustrates the geometric position of 8 neurons (N1, N2, . . . , N8, respectively) recorded using MEA. The experimental recording lasted for about 80 seconds, and the spike trains (S1, S2, . . . , S8, respectively) are plotted in Fig. 2, where each episode contains 40 traces and each trace corresponds to 2-s recording of periodic response.
34
G.-L. Wang et al.
Fig. 1. Geometric position of neurons recorded
Fig. 2. Raster plot of responses from 8 retinal ganglion cells. The spike trains (S1, S2, . . . , S8) correspond respectively to the relevant neurons (N1, N2, . . . , N8) recorded using MEA as shown in Fig. 1. The recording lasted for 80 s, with each spike train containing 40 traces and each trace representing 2-s recording of periodic response.
Fig. 3. Measurement of discrepancy between each individual sequences and the average of all the 8 sequences
A New Method for Multiple Spike Train Analysis
35
After symbolizing these spike trains into “0-1” sequences, FDOD is calculated based on these 8 sequences with subsequence lengths (l) selected as 2, 6, and 10, respectively. Fig. 3 illustrates the values of Bk (USl 1 , USl 2 , . . . , USl 8 )(k = 1, 2, . . . , 8) for various l(l = 2, 6, 10). It can be found that spike sequences S3, S5 and S6 have similar Bk values, for different subsequence lengths l(l = 2, 6, 10). This is also the case for spike sequences S4 and S8, and spike sequences S1 and S7. In addition, the values of Bk (USl 1 , USl 2 , . . . , USl 8 )(k = 1, 2, . . . , 8) have the same tendency for each subsequence length l(l = 2, 6, 10), as shown by Fig. 3. Therefore, based on Corollary 1, it suggests that among all these 8 recorded neurons, three groups of spike sequences, i.e. (S3, S5 and S6), (S4 and S8) and (S1 and S7), might have synchronized firing activities. In order to test whether this is the case, we further calculate the discrepancy ratio among relevant combinations. To test if the group (S3, S5, and S6) is the most synchronized triplet, we calculate the discrepancy ratios among relevant triplets. Since for all Bk value, B5 and B6 are of the lowest and of most similar (for all the three l values), the triplet combinations are therefore built with S5 and S6, plus another spike sequence S1, S2, S3, S4, S7, or S8, respectively. The values of discrepancy ratio R(l = 10) of each triplet are presented in Fig. 4(a). Analogous calculations are also performed for pair-wise combinations (S8 vs. one of the rest, and S7 vs. one of the rest, respectively). The discrepancy ratios R(l = 10) of each of these pair-wise combinations are given in Fig. 4(b) and Fig. 4(c), respectively.
Fig. 4. Discrepancy ratio (l = 10) of the selected groups. (a) Discrepancy ratio of triplet combinations built with S5 and S6, plus another spike sequence S1, S2, S3, S4, S7, or S8, respectively. (b) Discrepancy ratio of pair-wise combinations built with S8 plus one of the rest spike sequences. (c) Discrepancy ratio of pair-wise combinations built with S7 plus one of the rest spike sequences.
It can be found from Fig. 4(a) that the discrepancy ratio of the group (S3, S5 and S6) is the smallest, which suggests that the group (S3, S5 and S6) is more synchronized than any other groups of the triplet combinations. Crosscorrelation analysis as shown in Fig. 5 also reveals that synchronization occurred among S3, S5 and S6, but not among any other triplet groups. Synchronized firings are also found between S8 and S4, as given by cross-correlation function plotted in Fig. 6(a), which is in compatible with the result given in Fig. 4(b)
36
G.-L. Wang et al.
Fig. 5. The correlation of the spike sequences. The central panel is the auto-correlation function for spike sequence S5, and the other panels are the cross-correlation functions between spike sequence S5 and relevant sequences as arranged in Fig. 1.
Fig. 6. The correlation of the spike sequences. (a) The cross-correlation function between spike sequences S4 and S8. (b) The cross-correlation function between spike sequences S1 and S7.
that the discrepancy ratio is the lowest between S8 and S4, as compared to other pair-wise sequences. No synchronization was found between S8 and any other sequences. On the other hand, although B1 and B7 are of similar value as given in Fig. 3, it is presented by Fig. 4(c) that the pair-wise discrepancy ratio between S1 and S7 is fairly big, as compared to other pairs. This suggests that the temporal
A New Method for Multiple Spike Train Analysis
37
structure should be quite different between these two sequences, therefore S1 and S7 should not be synchronized. This is confirmed by cross-correlation analysis as shown in Fig. 6(b).
4
Discussions
In this paper, a new measurement of information discrepancy is applied to analyze the spatiotemporal pattern of multiple spike trains simultaneously recorded from several adjacent neurons. This method is efficient in comparing the synchronization pattern among many sequences, which effectively reduces computational work as compared to the existing pair-wise analysis algorithms, and in the mean time yields reliable results. More importantly, the result mostly relies on the experimental data and is affected little by subjective interference. Although the information discrepancy can be calculated on an objective basis, some considerations for preparing the sequences and the judgment of similarity among the sequences should be concerned. Firstly, the bin size chosen for spike train symbolization is one of the crucial parameters for this method. An over-large bin size will cause complicated “words” constructed by more than two “letters”, whereas an over-small one will result in elongated sequences and introduce exaggerated discrepancies. In our present study, a suitable bin size is determined by referring to the maximum of the firing rate of the neurons investigated. Secondly, the analytical result is dependent on the length l of the subsequences. Larger l value results in more significant discrepancies, and at the same time it consumes more computational power. For an appropriate selection of the length, the following empirical formula was suggested [16]. l ≤ a + Int[log L/ log m]
(11)
where L denotes the whole length of the sequences to be compared and m represents the number of symbols of the symbol set ; a = 2 was suggested if L ≤ 1000, otherwise a = 0 should be applied. Following this suggestion, a suitable subsequence length l ≤ 13 should be suitable for our analysis performed in the present study (L = 16000, m = 2). Practically, a minimal length of l = 2 should be good enough to calculate the discrepancies among the spiking sequences in our study. This is confirmed by the fact that the analytical results are qualitatively similar when various values of l(l = 2, 6, 10) are applied. Thirdly, some inaccuracies might not be avoided in identifying the possible synchronized sequences, for there is no index to quantify the similarity between relevant values of Bk (USl 1 , USl 2 , . . . , USl s ). But such effects can be limited effectively through moderately extending the spectrum of similarity defined.
Acknowledgement This work is supported by grants from National Basic Research Program of China (2005CB724301), the National Natural Science Foundation of China (No.
38
G.-L. Wang et al.
60375039), the National Natural Science Foundation of China (No. 30400088), and the Ministry of Education (No. 20040248062).
References 1. Nirenberg, S., Latham, P. E.: Population Coding in the Retina. Current. Opinion in Neurobiology 8(4) (1998) 488-493 2. Petersen, R. S., Panzeri, S., Diamond, M. E.: Population Coding in Somatosensory Cortex. Current. Opinion in Neurobiology 12(4) (2002) 441-447 3. Meister, M., Pine, J., Baylor, D. A.: Multi-neuronal Signals from the Retina: Acquisition and Analysis. J. Neurosci. Methods 51 (1994) 95-106 4. Brown, E. N., Kass, R. E., Mitra, P. P.: Multiple Neural Spike Train Data Analysis: State-of-the-Art and Future Challenges. Nature Neuroscience 7(5) (2004) 456-461 5. Mastronarde, D. N.: Interactions Between Ganglion Cells in Cat Retina. J. Neurophysiol. 49(2) (1983) 350-365 6. Aertsen, A. M., Gerstein, G. L., Habib, M. K., Palm, G.: Dynamics of Neuronal Firing Correlation: Modulation of ‘Effective Connectivity’. J. Neurophysiol. 61(5) (1989) 900-917 7. Konig, P.: A Method for the Quantification of Synchrony and Oscillatory Properties of Neuronal Activity. J. Neurosci. Methods vol. 54 (1994) 31-37 8. Gerstein, G. L., Aertsen, A. M.: Representation of Cooperative Firing Activity among Simultaneously Recorded Neurons. J. Neurophysiol. 54(6) (1985) 1513-1528 9. Schnitzer, M. J., Meister, M.: Multineuronal Firing Patterns in the Signal from Eye to Brain. Neuron 37 (2003) 499-511 10. Fang, W. W.: Disagreement Degree of Multi-person Judgements in an Additive Structure. Mathemetical Social Sciences 28 (1994) 85-111 11. Fang, W. W., Roberts, F. S., Ma, Z. R.: A Measure of Discrepancy of Multiple Sequences. Information Science 137 (2001) 75-102 12. Chen, A. H., Zhou, Y., Gong, H. Q., Liang, P. J.: Firing Rates and Dynamic Correlated Activities of Ganglion Cells Both Contribute to Retinal Information Processing. Brain Res. 1017 (2004) 13-20 13. Zhang, P. M., Wu, J. Y., Zhou, Y., Liang, P. J., Yuan, J. Q.: Spike Sorting Based on Automatic Template Reconstruction with A Partial Solution to the Overlapping Problem. J. Neurosci. Methods 135(1-2) (2004) 55-65 14. Wang, G. L., Zhou, Y., Chen, A. H., Zhang, P. M., Liang, P. J.: A Robust Method for Spike Sorting with Automatic Overlap Decomposition. IEEE Trans. Biomed. Eng. 53(6) (2006) 1195-1198 15. Szczepanski, J., Amigo, J. M., Wajnryb, E., Sanchez-Vives, M. V.: Application of Lempel-ziv Complexity to the Analysis of Neural Discharges. Network: Comput. Neural Syst. 14 (2003) 335-350 16. Li, W., Fang, W. W., Ling, L. J., Wang, J. H., Xuan, Z., Chen, R. S.: Phylogeny Based on Whole Genome as Inferred from Complete Information Set Analysis. Journal of Biological Physics 28 (2002) 439-447 17. Fang, W. W.: The Characterization of A Measure of Information Discrepancy. Information Sciences 125 (2000) 207-232
Self-organizing Rhythmic Patterns with Spatio-temporal Spikes in Class I and Class II Neural Networks Ryosuke Hosaka1,2,3,4 , Tohru Ikeguchi1 , and Kazuyuki Aihara3,2 1
Graduate School of Science and Engineering, Saitama University 255 Shimo-Ohkubo, Sakura-ku, Saitama 338–8570, Japan 2 Aihara Complexity Modelling Project, ERATO, JST 3-23-5-201 Uehara, Shibuya-ku, Tokyo 151–0064, Japan 3 Institute of Industrial Science, University of Tokyo 4-6-1 Komaba, Meguro-ku, Tokyo 153–8505, Japan 4 [email protected]
Abstract. Regularly spiking neurons are classified into two categories, Class I and Class II, by their firing properties for constant inputs. To investigate how the firing properties of single neurons affect to ensemble rhythmic activities in neural networks, we constructed different types of neural networks whose excitatory neurons are the Class I neurons or the Class II neurons. The networks were driven by random inputs and developed with STDP learning. As a result, the Class I and the Class II neural networks generate different types of rhythmic activities: the Class I neural network generates slow rhythmic activities, and the Class II neural network generates fast rhythmic activities.
1
Introduction
Ensembles of neurons, for example, neural synchrony, cell-assembly, and neural rhythmic synchrony, receive much attention because they are thought to play a significant role in the nerve system. However, their generating mechanisms remain unclear. We have already reported a possible generating mechanism of the neural synchrony [7]. Thus, in this paper, we focus on the neural rhythmic synchronies. The rhythmic synchronies are often observed in the brain [10,11,12], and they are categorized into several groups by their frequencies [2]: delta rhythm (1.5 ∼ 4 [Hz]), theta rhythm (4 ∼ 10 [Hz]), beta rhythm (10 ∼ 30 [Hz]), and gamma rhythm (30 ∼ 80 [Hz]). In the rodent hippocampus, the theta rhythms are observed and thought to represent important information [12]. In cat and human neocortex, the gamma rhythms are observed [11]. These rhythms are also observed in the brain waves [10]. Recently, Izhikevich demonstrated that a spiking neural network can generate the delta and the gamma rhythms by computer simulations [9]. The neural network composed of 800 regularly spiking neurons for excitatory neurons and I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 39–48, 2006. c Springer-Verlag Berlin Heidelberg 2006
Spike frequency
R. Hosaka, T. Ikeguchi, and K. Aihara
Spike frequency
40
I(t) (a) Class I excitability
(b) Class II excitability
Fig. 1. Firing frequency of Class I and Class II neurons. Lower lines indicate a strength of input.
200 fast spiking neurons for inhibitory neurons, and the neural network develops with STDP learning which modifies synaptic connections depending on the neural activities [1]. However, network activities generally depend on the properties of element neurons. Therefore, in this paper, we investigate how network activities would vary if we change the properties of the excitatory neurons. In mammalian neocortex, six fundamental classes of firing patters are observed [3,4,5]: regularly spiking neurons; intrinsically bursting neurons; chattering neurons; fast spiking interneurons; low-threshold spiking neurons; and late spiking neurons. Among them, the regularly spiking neuron is most major neuron. It is Hodgkin who stimulated the regularly spiking neurons by a constant current and observed its firing frequency [6]. By its excitability, Hodgkin classified the regularly spiking neurons into two sub-categories: Class I and Class II. Figure 1 shows schematic representations of firing frequency characteristics, when the injected constant inputs are ramped up slowly. The Class I neurons start to fire with a low frequency through a critical point of firing. In contrast, the Class II neurons start to fire with a high frequency that remains relatively constant even though the magnitude of the injected current increases. The Class I and the Class II excitabilities are realized by different bifurcation structures [13]: the Class I excitability occurs when a neuron exhibits a saddlenode bifurcation; the Class II excitability occurs when a neuron exhibits a Hopf bifurcation. Then, it is very important to analyze how these differences affect to macroscopic rhythms produced from the neural networks. To investigate this issue, we constructed three types of neural networks with the Class I, the Class II, or both neurons for excitatory neurons, and stimulate them by random inputs. The first and second ones are homogeneous neural networks in which all excitatory neurons are the Class I or the Class II, respectively. The third one is a heterogeneous neural network in which excitatory neurons are both the Class I and the Class II neurons. In the third type neural networks, we change a mixture rate of the neurons. The
Self-organizing Rhythmic Patterns with Spatio-temporal Spikes
41
neurons are connected through chemical synapses, and the connection strength of synapses are dynamically changed depending on the activities of neurons. The dynamic change of synaptic connection is called Spike-Timing-Dependent synaptic Plasticity (STDP) [1].
2 2.1
Methods Neural Network
In this paper, we used a neuron model proposed by Izhikevich [8] that is described as follows: 2 + 5v + 140 − u + I(t), (1) v=0.04v ˙ u=a(bv ˙ − u), with an auxiliary after-spike resetting condition: v←c if v = 30[mV], then u ← u + d.
(2)
(3)
where v and u are dimensionless variables, a, b, c and d are dimensionless parameters, and ˙ represents d/dt, where t is the time ([ms]). The variable v represents membrane potential ([mV]) of the neuron and u represents a membrane recovery variable, which accounts for the activation of K+ ionic currents and inactivation of Na+ ionic currents, and it provides a negative feedback to v. We constructed neural networks in the following way. Each network is composed of 1,000 neurons, and 80% (or 20%) of the model neurons are excitatory (or inhibitory) as in the cortex. The first neural network has the Class I excitatory neurons. The second neural network has the Class II excitatory neurons. The third neural network has both the Class I and the Class II excitatory neurons. Nullclines of the Class I and the Class II neurons are shown in Fig.2(a), and the firing frequency of the Class I and the Class II neurons are shown in Fig.2(b). Properties of the inhibitory neurons are common for both networks. Excitable property of the inhibitory neuron is the Class II and its time constant is much faster than the excitatory neurons as in the cortex. We applied an STDP rule (details are described below) only to excitatoryto-excitatory connections while the other connections are fixed. Each neuron connected with only 100 other neurons. For simplicity, the time is assumed to be discrete (the time step is 1[ms]). Then, the dynamics of the neural networks develops as follows: ⎧ 2 ⎪ ⎪ vj (t + 1)=vj (t) + 0.04vj (t) + 5vj + 140 − uj (t) + Ij (t) ⎪ ⎪ N ⎨ wij h(vi (t − dij ) − 30), (4) + ⎪ ⎪ i=1 ⎪ ⎪ ⎩ (5) uj (t + 1)=uj (t) + aj (bj vj (t) − uj (t) + ej ), with the auxiliary after-spike resetting
42
R. Hosaka, T. Ikeguchi, and K. Aihara
a
b .
v=0
18
.
u
u=0(Class II)
0
.
u=0(Class I)
-50 -100
0
Firing frequency (Hz)
50
Class I
Class II
0
30
0
v
5 Strength of constant input
Fig. 2. Nullclines and firing frequency of Class I and Class II neurons. (a) Nullclines of v and u of the neurons. The lines represent set of equilibrium points of u of Class I or Class II neurons. The quadratic curve represents set of equilibrium points of v for both Class I and Class II neurons. (b) Firing frequency of Class I and Class II neurons in response to constant inputs.
if vj (t) = 30[mV], then
vj (t) ← cj uj (t) ← uj + dj .
(6)
where vj (t) is membrane potential of the j-th neuron; uj (t) is a recovery variable of the j-th neuron, and aj , bj , cj , dj and ej are dimensionless parameters; ej was introduced to regulate a firing rate of the neural network; For the Class I excitatory neurons, aj = 0.02, bj = −0.1, cj = −65.0, dj = 8.0 and ej = −22. For the Class II excitatory neurons, aj = 0.02, bj = 0.26, cj = −65.0, dj = 8.0 and ej = 2. For inhibitory neurons, aj = 0.1, bj = 0.2, cj = −65.0 dj = 2.0 and ej = 0. wij is a synaptic connection from the i-th neuron to the j-th neuron. The synaptic weights from excitatory neurons are initially set to 5.0. The synaptic weights from inhibitory neurons are set to −6.0. If the i-th neuron and the j-th neuron are not connected, wij = 0. Self connection (wii ) is also 0. h(·) is a heaviside’s step function. dij is a synaptic transmission delay. The delay is decided randomly between 1 ∼ 20 [ms]. Ij (t)(=0 or 20) represents the external input for the j-th neuron, and Ij (t) follows a Poisson-process whose mean ISI is 1000 [ms]. 2.2
STDP Learning Rule
Several experimental studies have reported window functions of the STDP learning (see e.g., Ref.[1]). In this paper, we used a typical function (Fig.3) [14]. The amount of synaptic weight modification (Δw) decreases exponentially with a temporal difference (Δt) between the arrival time of a pre-synaptic action potential and the occurrence the of its corresponding post-synaptic action potential: Δt = tpre + dpre,post − tpost
(7)
where tpre is spike time of a pre-synaptic neuron, tpost is spike time of a postsynaptic neuron, and dpre,post is a delay time of spike transmission from the
Self-organizing Rhythmic Patterns with Spatio-temporal Spikes
43
Δw
0.1 0 -0.12 -100
0 Δt
100
Fig. 3. The learning window for the STDP learning. Δw is determined by Δt which is the temporal difference between the arrival time of a pre-synaptic action potential and occurrence time of its corresponding post-synaptic action potential. Nearest-neighbor pairs of arrival and occurrence of spikes, not all pairs, are assumed to contribute to plasticity.
pre-synaptic neuron to the post-synaptic neuron. Then, synaptic modification Δw is described by the following equation, Δt (Δt < 0), Ap e τp Δw(Δt) = (8) − Δt τd (Δt ≥ 0), −Ad e where Ap and Ad are the maximum rate of modification (Ap = 0.1, Ad = 0.12), τp and τd are the time constants for potentiation and depression, respectively (τp = τd = 20 [ms]). We assumed that the synaptic efficacy is limited in the range of 0 ≤ wij ≤ 10, because the STDP learning rule leads to further synaptic potentiation or depression to infinitely large or small synaptic weights.
3 3.1
Results Homogeneous Networks
Figure 4 shows raster plots of network activities and these power spectrum distributions. Dots on each raster plot indicate a firing of a neuron. In each raster plot, indices from 1 to 800 in vertical axis indicate the excitatory neurons, and the rests the inhibitory neurons. At the beginning of the simulations (in Fig.4, at sec=1), both the Class I and the Class II networks show slow rhythmic activities. These frequencies are 4 ∼ 6 [Hz]. The slow rhythms correspond to the theta rhythm (4 ∼ 10 [Hz]) that is often observed in hippocampus [12]. With
44
R. Hosaka, T. Ikeguchi, and K. Aihara
Fig. 4. Activities of the Class I and Class II neural networks and the power spectra of the rhythms. Each spectrum is estimated for corresponding temporal epoch.
Self-organizing Rhythmic Patterns with Spatio-temporal Spikes
45
Fig. 5. Spectrograms of the network activities of the Class I and Class II neural networks. Horizontal axes indicate time and vertical axes indicate frequencies. Colors corresponds to normalized powers. From these results, as time evolves, the Class I network exhibit slower rhythms which is in the theta rhythm (1.5 ∼ 4 [Hz]). and the Class II networks exhibits faster rhythms which is in the gamma rhythm (30 ∼ 80 [Hz]).
46
R. Hosaka, T. Ikeguchi, and K. Aihara
Fig. 6. Spectrograms of the network activities of the heterogeneous neural networks. As the number of the Class II neurons increases, the final frequencies exhibited by the neural networks become higher.
Self-organizing Rhythmic Patterns with Spatio-temporal Spikes
47
time evolution, neurons become to fire in faster rhythms (in Fig.4, at sec=2 and sec=5). Then, the rhythm of the Class I neural network becomes to slow down (in Fig.4, at sec=200). Finally, the rhythm of the Class I neural network settles down in lower frequency bands than 4 [Hz] (left column of Fig.4, at sec=3600), and the rhythm corresponds to the delta rhythm (1.5 ∼ 4 [Hz]). In contrast, the Class II neural network generates the rhythms in high frequency bands at the end of the simulations. The frequency of the fast rhythm on the Class II network corresponds to the gamma rhythm (30 ∼ 80 [Hz]). We summarized the transition of the power spectra of the rhythms observed in the neural networks into spectrograms (Fig.5). On the Class II neural network, the rhythm monotonically converges to a high frequency. On the other hand, the rhythm of the Class I neural network converges to a low frequecy and the transition is non monotonic. On the rhythm Class I neural network, the power of high frequencies does not become completely zero. This is because the slow rhythm is composed of succesive synchronies (see left column of Fig.4, at sec=3600). 3.2
Heterogeneous Networks
Generally, neurons in the biological neural network are not homogeneous, but some types of neurons are mixed. Then, we constructed a neural network which is composed of both the Class I and the Class II neurons. Figure 6 shows the spectrograms of the network rhythms observed in the heterogeneous neural network. In the case of the 80% and 60% Class I neural network, slow rhythmic activities are strongly observed. On the other hand, in the case of the 40% and 20% Class I neural network, fast rhythmic activities are strongly observed.
4
Discussions
We constructed three neural networks, homogeneous Class I, homogeneous Class II and heterogeneous neural networks, which are stimulated by random inputs, and compared their rhythmic activities by the spectrograms. As a result, the Class I neural network shows slow rhythmic activities, while the Class II neural network shows fast rhythmic activities. In our simulations, the rhythms are reproduced not only by the homogeneous networks (Fig.5) but also by the heterogeneous networks (Fig.6), and the dominant neurons in the neural network decides the rhythms of the network. Biological neural networks are also heterogeneous. Our result gives a possibility that the dominant neuron decides the rhythm of the neural network. If the Class I neurons are dominant, the network generates slow (delta) rhythms. If the Class II neurons are dominant, the network generates fast (gamma) rhythms. We observed three types of rhythms, delta, theta, and gamma rhythms in this study (Fig.5). The theta rhythm and the gamma rhythm are often observed in vivo and in vitro experiments in the hippocampus and neocortex, respectively [12,11]. However, the delta rhythms are observed only in the brain waves, and the delta rhythms are rarely observed on in vivo and in vitro experiments,
48
R. Hosaka, T. Ikeguchi, and K. Aihara
which means that it is difficult to detect a region where is the origin of the delta rhythms, because the brain wave reflects the activity of the whole brain. Our results suggest that anatomical information may tell the origins of the delta rhythms. Namely, if we can find a lot of the Class I neurons in a region, the delta rhythm may be generated in the region.
References 1. G. Bi and M. Poo. Synapic modificaion in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type. Journal of Neuroscience, 18:10464–10472, 1998. 2. G. Buzs´ aki and A. Draguhn. Neural oscillations in cortical networks. science, 304:1926–1929, 2004. 3. B. W. Connors and M. J. Gutnick. Intrinsic firing patterns of diverse neocortical neurons. Trends in Neuroscience, 13:99–104, 1990. 4. J. R. Gibson, M. Belerlein, and B. W. Connors. Two networks of electrically coupled inhibitory neurons in neocortex. Nature, 402:75–79, 1999. 5. C. M. Gray and C. A. McCormick. Chattering cells: superficial pyramidal neurons contributing to the generation of synchronous oscillations in the visual cortex. Science, 274:109–113, 1996. 6. A. L. Hodgkin. The local electoric changes associated with repetitive action in a non-medulated axon. Journal of Physiology, 107:165–181, 1948. 7. Ryosuke Hosaka, Osamu Araki, and Tohru Ikeguchi. Spike-timing-dependent synaptic plasticity makes a source of synfire chain. submitted to Neural Computation, 2006. 8. E. M. Izhikevich. Simple model of spiking neurons. IEEE Transactions on Neural Networks, 14:1569–1572, 2003. 9. E. M. Izhikevich. Polychronization: Computation with spikes. Neural Computation, 18:245–282, 2006. 10. H. Miyakawa and M. Inoue. Biophysics of Neurons. Maruzen, 2003. in Japanese. 11. M. A. L. Nicolelis, editor. Advances in Neural Population Coding. Elsevier, 2001. 12. J. O’Keefe and ML Recce. Phase relationship between hippocampal place units and the EEG theta rhythm. Hippocampus, 3:317–330, 1993. 13. J. Rinzel and B. B. Ermentrout. Analysis of neural excitability and oscillations. In C. Kock and I. Segev, editors, Methods in Neuronal Modeling. MIT Press, 1989. 14. S. Song, K. D. Miller, and L. F. Abbott. Competitive hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neuroscience, 3:919–926, 2000.
Self-organization Through Spike-Timing Dependent Plasticity Using Localized Synfire-Chain Patterns Toshio Akimitsu1 , Akira Hirose1 , and Yoichi Okabe2 1 Department of Electronics Engineering The University of Tokyo, Bunkyo-ku, Tokyo 113-8656, Japan 2 The University of the Air, Chiba City, Chiba 261-8586, Japan
Abstract. Many experimental results suggest that more precise spike timing is significant in neural information processing. From this point of view, we construct a self-organization model using the spatiotemporal patterns, where Spike-Timing Dependent Plasticity (STDP) tunes the conduction delays between neurons. STDP forms more smoothed map with the spatially random and dispersed patterns, whereas it causes spatially distributed clustering patterns from spatially continuous and synchronous inputs. These results suggest that STDP forms highly synchronous cell assemblies changing through external stimuli to solve a binding problem.
1
Introduction
In Cerebral Cortex, neurons are arranged to preserve sensory topological structure. Such neuronal structures are formed during development and modified during adulthood[1]. As to forming this topological mapping, there are many computational models based on Hebbian Learning. Recently experimental evidence from several different preparations suggests that both the direction and magnitude of synaptic modificaiont arising from repeated paring of pre- and postsynaptic action potentials depend on the relative spike timing[2,3]. SpikeTiming Dependent Plasticity (STDP) forces synapses to compete with each other for control of the timing of postsynaptic action potentials. Song et al showed that an orderly topological map can arise solely through STDP from random initial conditions without global constraints on synaptic efficiencies, or additional forms of plasticity[4]. However, despite using the millisecond-scale model, the patterns are composed of high firing-rate Poisson inputs, and coincident pre- and postsynaptic activity occurs by chance. Therefore, the meaning of temporal causal relationship is not clear. On the other hand, many experiment results suggested that the more precise spike timing accomplish the key role in the brain. For example, multiunit recording studies from the frontal cortex of behaving monkey suggested that a spatio-temporal pattern of highly synchronous firing of neural populations can propagate through several tens of synaptic connections without losing high I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 49–58, 2006. c Springer-Verlag Berlin Heidelberg 2006
50
T. Akimitsu, A. Hirose, and Y. Okabe
synchronicity[5]. This phenomenon is called ”synfire-chain”, which brings to that the neurons with long time-constant can convey the information keeping with the precise spike-time information. Diesmann et al showed that through multilayered Feed-Forward network with Integrate & Fire neuron model, the pulse packet can propagate stably in the presence of background noise if the number of neurons in a pool is large enough and yet the igniting pulse packet is synchronized strong enough [6]. This propagation is due to exact timing of the excitatory inputs. Hence, it reflects temporal coding. Nevertheless, this result suggests that pulse packet is to be synchronized, otherwise is dispersed. Therefore, the information of a signal is finally reduced to 1bit (0 or 1) through the propagation. Additionally, several studies provide that the heterogeneous structure of the network such as the Mexican-Hat-type connectivity can convey the quantitative information[7]. Similarly, Aviel et al showed that by adding inhibitory pool, synfire-chain can be embedded in a balanced network[8]. This network can also utilize the quantitative information. In this paper, we show that by using localized synfire-chain patterns, STDP can tune the conduction delays between neurons and then form self-organizing map whose patterns are expressed as firing clusters.
2
Pattern Refinement in Coincidence Detection Network
We use a simple Integrate & Fire neuron model, and the membrane potential V is determined as τV V˙ = −(V − VS ) + GE (t)(V − VE ) + GI (t)(V − VI )
(1)
with Vs = VI = −70.0mV, VE = 0.0mV, and τV = 5msec. The synaptic inputs GE and GI are expressed as spatio-temporal integration of synaptic efficiencies characterized by step rise time and exponential decay. E,I GE,I (t) = Wij Θ(t − sk ) exp(−(t − sk ))/τE,I (2) j
sk
where Θ(t) is step function and the time-constant is chosen as τE = τI = 5.0msec. The synaptic strength Wij is a transmission efficiency of the connection. All efficiencies from inhibitory neurons are assumed to have negative values (Inhibitory synapses), while all from excitatory ones are positive (Excitatory synapses). WijI corresponding to the inhibitory ones are chosen as the constant values whose range is [0.18, 0.22]. On the other hand, WijE correspond to excitatory neurons are modified via STDP whose range is [0, 0.05]. When the membrane potential V reaches a threshold value Vthr = −54mV, the neuron fires and the membrane potential is reset to Vres = −60mV. After firing, GE,I is kept zero during 3msec (absolute refractory period). On these conditions, about 20 coincident excitatory spikes elicit firing. The model neural network is schematically shown in Fig.1(a). There are 100 excitatory neurons and 50 inhibitory neurons connected each other. The input
Self-organization Through Spike-Timing Dependent Plasticity
(c)
Neurons at input layer
Highly Synchronous firing
DEE ij
51
Inhibitory neurons Excitatory neurons
Neurons at other input layers E OI DO ij , Dij :conduction delay
DEij I
path-lenght dependent
Do
: Random conduction delay
DI
Random Firing Neurons at output layer
(b)
Raster Plot
Inhibitory
Excitatory
supressed by recurrent inhibition time
Fig. 1. (a)Layered network consists of 100 excitatory and 50 inhibitory neurons connected each other. (b) Synchronized firing of inhibitory neurons suppresses the following activities of excitatory neurons. Consequently, this network detects coincident firings.
neurons of the network generate spikes. We consider 100 excitatory neurons, of which 25 neurons fire synchronously (dispersion σ), and the other 75 neurons fire randomly (10Hz Poisson spikes). After a 25msec interval, a different set of 25 neurons fire synchronously. At first, we chose a set of 25 neurons as the spatial continuous ones. After an interval, the neighboring 25 neurons were chosen with one neuron shift. This corresponds to a continuous changing pattern in which the each center position of the neurons represents the stimuli. This is a simple case and general case is discussed at section 4. They also receive 100 excitatory and 50 inhibitory inputs firing randomly (10Hz Poisson inputs). In many brain areas, the temporal precision of spikes during stimulus-locked responses can be in the millisecond range. Reproducible temporal structure can also be found. Therefore, “delay tuning mechanism” is needed. From this viewpoint, we regard the role of STDP as a tuning of the conduction delays of the
52
T. Akimitsu, A. Hirose, and Y. Okabe
(a)
ΔW
t1
A+
(b)
W t2
presynaptic neuron
LTP
t LTP
0 LTD
Δt =( t1 - t2 )
LTD LTP
LTP
Apostsynaptic neuron
LTD
t
Fig. 2. (a) Window function of STDP.(b) Time Diagram shows that only spike pairs connected by arrows contribute to plasticity (near-neighbor interaction).
OE neurons.We determine the conduction delay Dij , from input neuron i to excitatory output neuron j, proportional to distance in such a manner that the periodical boundary condition is satisfied. That is OE Dij ∝ |i − j| mod N
(3)
where i and j are neuron indices, and we define |i − j| mod N ≡ min(|i − j|, N − |i − j|)
(4)
Since the number of inhibitory output neurons is half of the input-neurons numOI from input neuron i to the inhibitory neuron j is determined ber, the delay Dij to be proportion to |i − 2j| mod N , which should also satisfy periodical boundary condition. The maximum conduction delay is 3ms, while the minimum is 0ms. We consider the case that the recurrent inhibitory conduction delays DI have an identical short value (0.5msec in Sec.2, or 1.0msec in Sec.3, 4). As the inhibitory neurons receive common inputs with the excitatory ones, their firing patterns are similar. After short delay DI , both of them receive inhibitory recurrent spikes and suppress the firings of neurons whose postsynaptic spike latencies are large. As a result, this network detects coincident firings (Fig.1(b)). The probability that an input neuron is connected to output neuron is 0.8, and the initial values of the synaptic strength are chosen about the half of the maximum. STDP was implemented only for the excitatory synapses of the output layer’s neurons, A+ exp(−Δt/τ+ ) Δt > 0 (5) ΔW = A− exp(−Δt/τ− ) otherwise where A+ and A− are the sizes of the synaptic modification by a single STDP event. We chose A+ = 0.02, A− = 0.022, and τ+ = τ− = 20msec. LTD is
Self-organization Through Spike-Timing Dependent Plasticity
53
implemented only after the latest firing and LTP is implemented after the last firing (near-neighbor interaction). When one firing pattern is presented, the input spikes elicit a postsynaptic response, triggering the STDP rule. Synapses carrying input spikes just preceding the postsynaptic ones are potentiated, later ones are depressed. This modification causes a decrease of the postsynaptic spike latency. Hence, at the next time, when this input pattern is presented, firing threshold will be reached sooner. Consequently, some previously-potentiated synapses are depressed, while other synapses that carry further preceding spikes are potentiated. In iteration, the postsynaptic spike latency tend to settle at a minimal value while the synapses contributing to firing front become fully potentiated, whereas those to later firings are fully depressed [9]. In this network, inhibitory neurons receive similar inputs to excitatory neurons and excitatory synapses of inhibitory ones are modified. The changes of the postsynaptic spike latencies of inhibitory neurons are almost in keeping with excitatory ones. Therefore, this network can detect coincidence firings, even if the synaptic efficiency has changed during learning. We first study the case of highly synchronized and spatially smoothed input patterns with no recurrent excitatory inputs. The dispersion of the synchronicity is chosen as σ = 0.5msec. This result is shown in Fig 3. To investigate a degree of spatio-temporal clustering, we calculated “ coincident clustering histogram”. If a difference of firing time of two neurons is within a time bin ΔT , we considered that these two neurons are coincident, and computed spatial difference histogram k ck (i − j), ck (i − j) =
1 0
|tki − tkj | < ΔT otherwise
(6)
where tki denotes the firing time of neuron i for input pattern k. The simulation result is shown in Fig.5. The synapses only with shortest conduction delays survive and others were pruned. It also reduces noise firing. As a result, STDP refines the patterns.
3
Forming Distributed Firing Patterns
In the previous section, we showed that STDP could tune the earliest conduction delays. However, such a tuning works best if all neurons fire synchronously. Then, we consider a feature of the local excitatory patterns by adding feedback excitatory inputs. To form a topological map, recurrent excitatory synapses are needed to cluster the similar input patterns. However, if the firings from the previous layer are highly synchronous, the neurons with the short latency become to fire only by the external inputs. Therefore, the recurrent excitatory spikes occur only after such the population’s firings, which cause depression of the recurrent connections. Then how can the recurrent excitatory connections affect? First possibility is that the input spikes through neighboring excitatory synapses evoke the burst-like next firing. Nevertheless, since these firing does
54
T. Akimitsu, A. Hirose, and Y. Okabe
Fig. 3. Refinement map is formed after learning. (a) Spike Raster of input neurons. (b) Spike Raster of output neurons after learning. (c) Coincident clustering histogram for 100 input neurons. (d) Coincident clustering histogram for output neurons (after learning). (e) Conduction delays from input neurons to output neurons. (f) Weight distribution after learning.
not depend on the external inputs, this self-excitatory loops lead to uncontrollable network activity. Another possibility is that the recurrent excitatory inputs raise another population firings. Since after firing, the membrane potentials of the neurons are reset and are not affected by the later input signals during 3msec. Meanwhile, the membrane potentials of the other neurons, suppressed by the inhibitory feedback inputs, are close to the threshold value. Therefore, such neurons are more easily to fire. Then, the excitatory inputs can cause another population firing.
Self-organization Through Spike-Timing Dependent Plasticity
55
Fig.4 shows the representative result, where the feedback excitatory conduction delays are chosen as DE = 1.5msec for all the connections. The two peaks for 3msec bin in Fig.4(d) shows the existence of another population, firing a few milliseconds later. In this model, the conduction delays are identical for all the connections. Therefore, this histogram has the gentle peaks. However, the sharper peaks, corresponding to distinctive coupling sets, can occur according to the conduction delays between the recurrent neurons.
Fig. 4. Distributed firing patterns are caused by recurrent excitatory connections. (a)Raster plot of the excitatory neurons. (b) Weight distribution of the recurrent excitatory connection. (c, d) Coincident cluster histogram for different time bin.
4
Forming a Topological Map
Next, we consider the more general case to relate this model with the topological map model. Here, we denote the firing patterns of these input neurons, ξ k = k } where j means the neuron index (the maximum N = 100) {ξ1k , · · · ξjk , · · · , ξN and k means the pattern index switched at the regular intervals. If the neuron j fire synchronously, we denote ξj = 1. Then, ξ k satisfies, ξjk = R (∀k = 1, 2, · · · , N ) (7) j
ξ 1 · ξ 2 = ξ 2 · ξ 3 = · · · = ξ N · ξ 1 = Rm
(8)
56
T. Akimitsu, A. Hirose, and Y. Okabe
where m is the overlap of patterns. The high m means continuous pattern shift. In the previous sections, these parameters are determined as R = 25, m = 0.96. In Section.2, the spatial network structure is assumed as ordered conduction delays. Additionally the set of synchronous firing neurons is spatial continuous. Here, as a substitute for random conduction delays, we consider the case of the shuffled input patterns. We choose a pair of neurons randomly, and replace the indexes of the neurons 200 times. After this shuffling, the firing pattern η k also satisfies the condition described as ηjk = R (∀k = 1, 2, · · · , N ) (9) j
η 1 · η 2 = η 2 · η 3 = · · · = η N · η 1 = Rm
(10)
100
100
80
80
#Output Neuron
#Input Neuron
We determined the dispersion of the synchronized pattern as σ = 3.0msec. These patterns with large dispersion are considered to correspond to the Poisson input patterns[4]. Besides, we decide that each synchronized sets are represented twice. The recurrent synapse is needed to form a smoothed map. We determine EE from excitatory neuron i to excitatory neuron j in the conduction delay Dij proportion to |i − j| mod N in the range of [0.5, 1.5]. We also determined the
60 40 20 0 0
1000
2000 3000 time(ms)
4000
(a)
bin=1.0msec
bin=1.0msec
0.14 0.1 0.06 0.04 0.02 0 -50 -40 -30 -20 -10 0 10 20 30 40 50 index distance of two neurons
(c)
20
(b)
0.16
0.08
40
0 20000 22000 24000 26000 28000 30000 time(ms)
5000
0.12
60
0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 -50 -40 -30 -20 -10 0 10 20 30 40 50 index distance of two neurons
(d)
Fig. 5. (a) Raster plot of the shuffled input neurons. (b)Raster plot of the output excitatory neurons. Two cycles are shown. (c, d) Coincident clustering histogram for (c) input neurons, and (d) output neurons.
Self-organization Through Spike-Timing Dependent Plasticity
57
EI delay Dij from excitatry neuron i to inhibitory neuron j in proportion to |i − 2j| mod N in the same range. Hence, some of the neighboring connections are shorter than the identical inhibitory conduction delay DI = 1.0ms. The initial synaptic efficiencies are also determined in proportion to N2 − |i − j| mod N in the range of [0.25Wmax , 0.75Wmax ], where Wmax is the maximum efficiency (Wmax = 0.05). Even under these conditions, most of the connections were pruned, and a smoothed map could not be formed. Therefore, we determined the sizes of the synaptic modification for recurrent excitatory synapses as B+ = 0.02, B− = 0.015. The result is shown at Fig.5. The smoothed topological map was formed. Furthermore Fig.5(d) shows that high synchronous topological map is formed.
5
Conclusion
We demonstrated that self-organization with locally synchronized patterns can be performed by tuning the conduction delays between neurons. In such a tuned network, patterns are more synchronized in propagation through layers. Then locally synchronized patterns are transformed into distributed multi-clusters. Therefore, first, spatially and temporally random patterns are bunching. Next, spatially continuous patterns are synchronized in propagation through layers. Then highly synchronized local patterns are transformed into distributed multiclusters. Thus in the higher visual areas, the distributed information representation is considered to be achieved. We also notice that such a network can use not only the spike-timing information, but also can utilize population information as a degree of temporal pattern matching. As an important point in neural research, there is a question of how the information is encoded and processed in the brain. These results suggest that as temporal rate is translated into the firing frequency, the neural network utilizes both of temporal and population codes.
References 1. LeVay, S., Wiesel, T. N. and Hubel,D.: The Development of Ocular Dominance columns in normal and visually deprived monkeys. J. Comp. Neurol. 191 (1980) 1-51 2. Bi, G. and Poo, M.: Activity-Induced synaptic modifications in hippocampal culture: dependence on spike timing synaptic strength and cell type. J. Neuroscience 18 (1998) 10464-10472 3. Markram, H., L¨ ubke, J., Frotscher, M. and Sakmann, B.: Regulation of Synaptic Efficacy by Coincidence of Postsynaptic APs and EPSPs. Science 275 (1997) 213215 4. Song, S. and Abott L. F.: Cortical Development and Remapping through Spike Timing-Dependent Plasticity. Neuron 32 (2001) 339-350 5. Abeles, M., Bergman, H., Margalit, E. and Vaadia, E.: Spatiotemporal firing patterns in the frontal cortex of behaving monkeys. J. Neurophysiol 70 (1993) 1629-38
58
T. Akimitsu, A. Hirose, and Y. Okabe
6. Diesmann, M., Gewaltig, M. O. and Aertsen, A.: Stable Propagation of synchronous spiking in cortical neural network. Nature 402 (1999) 529-533 7. Hamaguchi, K. and Aihara, K.: Quantitative information transfer through layers of spiking neurons connected by Mexican-Hat-type connectivity. Neurocomputing 58-60 (2004) 85-90 8. Aviel, Y., Horn, D. and Abeles, M.: Synfire waves in small balanced networks Neurocomputing 58-60 (2004) 123-127 9. Guyonneau, R., VanRullen, R. and Thorpe, S. J.: Neurons Tune to the Earliest Spikes Through STDP. Neural Comput 17 (2005) 859-879
Comparison of Spike-Train Responses of a Pair of Coupled Neurons Under the External Stimulus Wuyin Jin, Zhiyuan Rui, Yaobing Wei, and Changfeng Yan School of Mechano-Electronic Engineering, Lanzhou University of Technology, Lanzhou 730050, China [email protected]
Abstract. Numerical calculations have been made on the consistent spike-train response of a pair of locus ceruleus (LC) neurons coupled by synapse. The coupled, excitable LC neurons are assumed to receive the constant, periodic and chaotic external stimulus at dendrite of the neuron, and whose soma potential being adopted to driving the other one along axon. With appropriated stimulus and coupling strength, the synchronization oscillation between the two neurons is well preserved even when the external stimulus is chaotic, for the small time scale stimulus, one inspiring simulations results, the wave shape or chaotic attractor of stimulus could be transmitted completely by neuronal ISIs sequence, including the periodic, chaotic characters of stimulus, such as, phase space or chaotic attractors, but this phenomenon disappears for big time scale stimulus.
1
Introduction
As we all known, neurons communicate by producing sequence of action potentials to carry out the many operations that extract meaningful information from sensory receptor arrays at the organism’s periphery and translate these into action, imagery and memory, and the neural system complete these processes by encoding and decoding [1-3]. Coding is a question of fundamental importance the problem of neuroscience. Synchronization of coupled systems is a commonly used mode, and the presence, absence or degree of synchronization can be an important part of the function or dysfunction of a biological system [4]. The neuronal synchronization phenomena have received much attention recently, and many inspiring simulations results have been achieved, such as, the synchrony dependent spike frequency [1,5], asynchrony and synchrony in sustained neural activity is a neurological of working memory [6], and relation of synchrony and coding was also discussed too [7]; the massively synchronous can elicit Long-term potentiation of synaptic transmission [8]. In this paper, we main focus on the spike-train of a pair of LC neurons coupled by excitatory synapse without time delay, under external stimulations. I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 59–64, 2006. c Springer-Verlag Berlin Heidelberg 2006
60
2
W. Jin et al.
Adopted Model
As our previous work [1], we still adopt a simple neural system consisting of a pair of neurons, which is numbered 1 and 2. The neurons, which are described by the LC model with identical parameters, are coupled by synapse, as shown in Fig.1, the axon-1 of neuron-1 and dendrite of nruron-2 acts as presynaptic and postsynaptic respectively. And the model considers the action potentials can propagate not only forwards from their initiation site along axon, but also backwards into the dendrites [9]. In this work, we adopt axon-dendrite coupling, and the gap junction were placed at the synapse, the membrane potential for the soma and dendrite obeyed the current balance equations and the parameters are detailed explained in Ref.[1].
Fig. 1. Schematic diagram of the coupled LC neurons. A stimulus current I charges the neuron-1 activity, when the membrane potential of neuron-1 reaches the threshold, at the soma-1 will generates a spike and outputs through axon-1 to the synapse and generates an input current at the synapse to drive the neuron-2.
3 3.1
Simulation Results Periodic and Chaotic Applied Stimulus
Fulfilling the requirement of the synchronous spike in two neurons, in this section, we show the periodic and chaotic stimulus signal transmitted by coupled neuronal system. Firstly, we studied the response of neurons by changing the time scale of stimulus, with adopting fixed coupling strength g = 5μF/cm2 , and applied current shift Ishif t = 3.0μA to exciting neuron. After then the big shift of applied current also been studied. Periodic stimulus. Figure 2 shows two typical phase dynamics about neuron under external sinusoidal stimulus. Obviously, Fig.1a and 1b, with the small time scale stimulus frequency (0.8 Hz), the neurons exhibit synchronous firing of action potentials, which means the neurons’ ISI varying as the style of anti-phase synchronous state against the external stimulus current, and this state will not change until the rate of stimulus great than 1 Hz or so; with increasing the frequency of sinusoidal stimulus, alternating periodic-chaotic ISIs sequence will be observed, such as, period-1, period-2, beat phenomenon, burst spike (shown in Fig.2c and 2d, with stimulus frequency 10 Hz) and chaotic, and these phenomenon could also been found in single Hodgkin-Huxley (HH) neuron activity [10].
Comparison of Spike-Train Responses of a Pair of Coupled Neurons
61
Fig. 2. Two typical phase dynamics about neuron under external sinusoidal stimulus, with low rate stimulus frequency 0.8 Hz (a), the coupled two neurons exhibit anti-phase synchronous firing of action potentials (b), and when the frequency of sinusoidal stimulus becomes to 10 Hz (c), the response of neuronal changes to burst spike rhythm state (d). The ISI(1) is mark by + and ISI(2) is marked by · in (b) and (d), respectively.
Chaotic stimulus. And then the neuronal response to chaotic stimulus current were also studied in this report, the chaotic current modified from x-element of chaotic Lorenz system, the time scale of stimulus is controlled by the integration time step for solution of the Lorenz system, the big time scale stimulus is shown in Fig.3a integrated time step equal 0.0001 and the small time scale stimulus is shown in Fig.4a integrated time step equal 0.00001, and the synchronous neuronal ISIs were shown in Fig.3b and 4b respectively. Comparing Fig.3b with Fig.4b, we could find the difference, the time courses of two neurons ISIs shown in Fig.4b are varying with the changing of external stimulus current, but for the case shown in Fig.3b, the irregular oscillating ISIs seems unrelated to stimulus; and their difference could also been expressed by reconstructed attractors, obviously, in Fig.3, the original attractor (shown in Fig.4c) becomes a piece of dark cloud attractors reconstructed from ISI(1) (shown in Fig.3d) and ISI(2) (shown in Fig.3e), contrarily, the form and structure of original attractors (shown in Fig.3c) is preserved by ISI(1,2) of coupled two neuron system and their attractors were shown in Fig.4d,4e respectively. Summary. At last, even if the coupled neuronal system received bigger stimulus with small time scale, the neuron could not transmit it completely, for example, we only enlarged applied current shift to 6.0μA with the same fixed coupling strength g = 5μF/cm2 , the small time scale periodic stimulus (similar to stimulus shown in Fig. 2a) and the chaotic stimulus (similar to stimulus shown in Fig.4a) are adopted to drive the coupled neuron system respectively, under these condition, the fire frequency of neurons becomes bigger, and the ISI(1) keeps stimulus rhythm style, shown in Fig.5a ( marked by + ) for
62
W. Jin et al.
Fig. 3. The big time scale stimulus transmits by ISIs of coupled neuronal system. (a) shows the big time scale stimulus modified from x-element of chaotic Lorenz system integrated time step equal 0.0001, with applied current shift Ishif t = 3.0μA; (b) shows synchronous varying ISI(1) (marked by + ) and ISI(2) (marked by · ), overlapping each other; (c) the attractor reconstructed from applied current shown in (a), with embedding dimension m=3 and delay time τ =1500; (d-e) the attractors reconstructed from ISI(1) and ISI(2) shown in (b), respectively, with same embedding dimension m=3 and delay time τ =17; obviously, the attractors showing in (d) and (e) have no similar shape and structure comparing with the imputed attractor (c).
Fig. 4. The low rate afferent stimulus transmits by ISIs of coupled neuronal system. (a) shows the low rate afferent stimulus modified from x-element of Lorenz system integrated time step equal 0.00001, with applied current shift Ishif t = 3.0μA ; (b) shows synchronous varying ISI(1) (marked by + ) and ISI(2) (marked by · ), overlapping each other, keeping stimulus rhythm style; (c) the attractor reconstructed from applied current shown in (a), with embedding dimension m=3 and delay time τ =15000; (d-e) the attractors reconstructed from ISI(1) and ISI(2) shown in (b), with same embedding dimension m=3 and delay time τ =7; obviously, the attractors showing in (d) and (e) have similar shape and structure comparing with the imputed attractor (c).
Comparison of Spike-Train Responses of a Pair of Coupled Neurons
63
periodic case and Fig.5b for chaotic case, but the ISI(2) shows bifurcation behavior, with separating and similar stimulus time course, shown in Fig.5a (marked by · ) for periodic case and Fig.5c for chaotic case respectively, and the chaotic attractor recontracted from sequence of ISI(1) is similar to the inputed attractor, but the ISI(2) chaotic attractor become four piece cloud, all of which keep un little information of stimulus.
Fig. 5. The effect of enlarging applied current shift on the ISIs and reconstructed attractors responded to the periodic and chaotic low rate afferent stimulus, with Ishif t = 6.0μA, contrasting with neuronal response shown in Fig.2 and Fig.4; (a) for periodic case, the ISI(1) (marked by + ) keeps stimulus rhythm style, but the ISI(2) (marked by · ) shows bifurcation behavior; (b-e) for chaotic cases, (b) shows ISI(1) also keeping stimulus rhythm style and (c) ISI(2) with separating and similar stimulus time course.
4
Conclusion and Discussion
We have performed numerical investigation on the spike-train responses of a pair of LC neurons coupled by synapse. The response of the coupled, excitable LC neurons to constant, period and chaos inputs shows a rich of variety. The response to inputs strongly depends on the stimulus strength, coupling strength and the time scale of stimulus, yielding bifurcation, multistabilty and chaotic behavior. The results shown in section 3 suggest that the response ISIs will vary along with the style of stimulus. And we have applied the two types of inputs of the periodic and chaotic impulses, i.e., the small and big time scale stimulus. For the big time scale stimulus stimulus, the coupled neurons spike in synchrony, the input signals could not been transmitted by ISIs completely and the received chaotic attractor becomes a piece of dark cloud, on the other hand, if the coupled neuronal system received the small time scale stimulus stimulus, the afferent stimulus could be transmitted successfully by sequence of ISIs and the imputed chaotic attractor will be transmitted completely in phase space by ISIs, but with increased the stimulus strength, i.e., the applied current shift Ishif t , the separating and similar stimulus time course of ISIs similar to stimulus, and the chaotic attractor
64
W. Jin et al.
becomes separating too. So we could conclude that the time course of ISIs play an important role in neuron coding, and we could assumed that the nerves system just could transmit the small time scale stimulus by neuronal synchronous action potentials interval.
Acknowledgments We thank the supporting of the National Natural Science Foundation of China (No.10572056) and the Natural Science Foundation of Gansu Provinces (No.3ZS042-B25-019) .
References 1. Jin W.Y., Xu J.X., Wu Y., Hong L.: Rate of afferent stimulus dependent synchronziation andcoding in coupled neurons system. Chaos, Solitons and Fractals 21 (2004) 1221-1229 2. Koch C., Segev I.: The role of single neurons in information processing. Nat. Neurosci. 3 (2000)1171-1177 3. Moore T., Armstrong K.M.: Selective gating of visual signals by microstimulation of frontal cortex. Nat. 421 (2003) 370-373 4. Ashwin P.: Synchronization from chaos. Nat. 422 (2003) 384-385 5. Alvarez V.A., Chow C.C., Van Bockstaele E.J., et al.: Frequency-dependent synchrony in locus ceruleus: Role of electrotonic coupling. Neurobiology 99 (2002) 4032-4036 6. Gutkin B. S., Laing C. R., Colby C. L.: Turning on and off with excitation: The role spike-timing asynchrony and synchrony in sustained neural activity. J. Comp. Neuronsci. 11 (2001) 121-134 7. Singer W.: Neuronal synchrony: A versatile code for the definition relation. Neuron 24 (1999)49-65 8. Paulsen O., Sejnowski T. J.: Natural pattern of activity and long-term synaptic plasticity. Neurobio. 10 (2000) 172-179 9. Koch C.: Computation and the signal neuron. Nat. 385 (1997) 207-210 10. Jin W.Y., Xu J.X., Wu Y., Hong L.: An alternating periodic-chaotic ISI sequence of H-H neuron under external sinusoidal stimulus. Chinese Physics 13 (2004) 335-340
Fatigue-Induced Reversed Hemispheric Plasticity During Motor Repetitions: A Brain Electrophysiological Study Ling-Fu Meng, Chiu-Ping Lu, Bo-Wei Chen, and Ching-Horng Chen Department of Occupational Therapy and Institute of Clinical Behavioral Science, Chang Gung University 259 Wen-Hua 1st Road, Guei-Shan, 333, Taoyuan, Taiwan {lfmeng, cplu}@mail.cgu.edu.tw, [email protected], [email protected]
Abstract. Based on a preliminary case study, we conducted an event related potentials (ERPs) research to explore the relationship between repetitive finger tapping and brain electrophysiological potentials. This present study found that the errors increased with motor repetitions during tapping tasks by right hand especially in the third stage. We defined this stage as the fatigue stage and the first stage as the initial stage. In the fatigue stage, the decreased N1 amplitudes (30-80 ms) with the right fronto-central and right central electrodes (FC4 and C4) were observed, while comparing with the initial stage. Moreover, the pronounced P2 amplitude (150-200 ms) and increased signal with time on right hemisphere (F4 and C4 electrodes) under fatigue state were noticed. Conversely, the contralateral left electrodes (FC3, C3, and F3) did not show aforementioned N1 and P2 differences between two stages. After using the Frequency Extraction method, a clear lateralized pattern in the fatigue stage was found. The left hemisphere showed lower and the right hemisphere showed higher alpha frequency phase content evolution. It was concluded that fatigue did lower the involvement of some areas in the brain but also did make right hemisphere take on more workload during the tapping task with right hand. We call this compensatory change as “fatigue-induced asymmetric hemispheric plasticity”. Besides, less signal change between two hemispheres in the fatigue stage was also found. Therefore, the mechanism of transcallosal interaction is strongly related to the fatigue state induced by the motor repetitions.
1 Introduction Motor repetitions can result in fatigue and induce changes in neural networks [1-8], especially in the absence of skill learning [1,5,8]. Furthermore, fatigue might lower the involvement of most areas in the brain [1,2,3,5] or decrease the inter-hemispheric communication to interfere with performances [2,6,7,8]. The event related potential study conducted by Boksem et al. [3] showed N1 amplitude decreased with the state of mental fatigue. Benwell et al. [5] conducted a functional MRI study and found that there was a significant reduction in the number of voxels activated in primary sensorimotor cortex in the hemisphere contralateral to movement of the fatigued hand [5]. Besides, some studies mentioned that different, repetitionsensitive neural mechanisms are involved in the fatigue mechanism [2,3,4,5]. I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 65 – 71, 2006. © Springer-Verlag Berlin Heidelberg 2006
66
L.-F. Meng et al.
Differently from the aforementioned studies, this present study focused the effect of fatigue on the dynamic changing process of lateralization. This lateralized issue has not been clearly substantiated yet. Howard and Reggia [7] found that learning rates for the two hemispheres were changed independently over time to create a timevarying asymmetry in plasticity. Moreover, we depict left hemisphere might become inactivated in the fatigue stage (after tapping many times with right hand). Simultaneously, the right hemisphere might activate more to compensate the disadvantage of left hemisphere so that the task can be kept going longer. Therefore, we proposed that the non-dominant hemisphere could make compensation to take on more workload while dominant hemisphere getting fatigue [1,3,6]. We termed this phenomenon as “fatigue induced right hemisphere’s compensation plasticity”. In order to confirm the mechanism proposed by us, ERPs study was conducted.
2 Methods 2.1 Participants Nine college students (3 males and 6 females) from age 20-24 without any neuromuscular or cerebral disease participated in this study. Six right-handed, two left-handed, and one mixed-handed were included in this study. 2.2 Variables The stage was the independent variable including initial and fatigue stages. Dependent variables included the reaction time and the accuracy. The amplitudes of ERPs within specified time windows (30-80ms, 150-200ms, and 325-600 ms) and the phase description of the selected 8-12 Hz range from 0-500 ms were the dependent variables. 2.3 Experimental Design The experimental research design was used to answer the research question. Each participant used right hand's fingers to press the computer keys. While seeing the number 2 on the screen, the participants used their index to press the corresponded key on the left side of the keyboard. They pressed the corresponded key on the mid position of the keyboard with the middle finger and pressed the key on the right side with the fourth finger respectively while seeing the number 3 and 4. Each finger pressed 200 times and the order of 600 trials was completely randomized. 2.4 Stimulus Presentation The timing of the stimulus presentation was controlled and subject responses (accuracy and reaction time) were recorded using Stim II Software (Neuroscan, Inc. Sterling, VA, USA). 2.5 Electroencephalogram (EEG) Acquisition and ERP Recording EEGs were recorded from Quick-Cap 32 Channel-Sintered electrodes located in standard 10-20 system. All electrodes impedance was brought to below 10 kȍ. The EEG
Fatigue-Induced Reversed Hemispheric Plasticity During Motor Repetitions
67
was band pass filtered (1-30 Hz) and digitized at a sampling rate of 1000 samples/s. The baseline for ERP measurements was the mean voltage of a 100ms pre-stimulus interval. Trials exceeding±100ȝV at horizontal and vertical Electro-Oculogram (EOG) were excluded first. Furthermore, trials containing eye blinks, eye movement deflections, exceeding ±60ȝV at any electrode were also excluded from ERP averages. 2.6 Procedure Each participant used right hand’s finger to tap 200 times in the initial, middle, and fatigue stage respectively. Totally 600 trials were performed. While seeing the number of 2, 3 and 4 showed on the screen, the participant pressed left, middle and right key respectively. 2.7 Statistics ERP comparisons (12 sites) were conducted with paired t-test on the mean amplitude of the average ERPs within specified time windows (30-80ms, 150-200ms, and 325-600 ms).
3 Results 3.1 Early and Late ERPs The right fronto-central (FC4) and right central (C4) electrodes had less N1 (N30-80 ms) amplitudes in the fatigue stage than those in the initial stage (Table 1 and Fig 1).
Table 1. Summary of the t-tests conducted on the mean early ERPs in both stages
F3 Fz F4
Initial -0.08±1.10 -0.84±1.13 0.20±1.96
N1 (30-80ms) Fatigue 0.29±0.60 0.43±0.79 0.49±1.64
t (8) -1.26 -1.85 -1.37
Initial 0.75±1.89 1.68±2.47 0.55±2.92
FC3 FCz FC4
-1.12±1.09 -1.53±1.01 -1.21±1.11
-0.68±0.66 -1.05±0.69 -0.80±0.87
-1.63 -2.14 -2.68*
2.28±2.19 2.78±2.72 2.45±2.85
2.58±2.15 3.44±2.72 3.17±2.86
-1.30 -2.24 -2.30*
C3 Cz C4
-1.41±1.01 -1.80±1.00 -1.59±0.98
-0.93±0.67 -1.35±0.86 -1.14±0.70
-2.13 -2.02 -2.59*
2.61±2.17 3.21±2.73 2.94±2.76
2.56±2.39 3.52±2.88 3.22±2.75
0.17 -0.87 -0.81
P3 Pz P4
-0.84±0.89 -1.38±0.97 -1.02±0.90
-0.73±1.00 -0.92±1.20 -0.76±0.96
-0.51 -2.24 -1.92
1.50±1.33 2.23±1.60 2.02±1.63
0.35±2.55 1.96±2.04 1.34±1.87
2.02 0.57 1.49
*p 0.3). These results suggest that homosynaptic LTP by the present pairing protocol was induced under the condition inhibiting activation of dendritic Na+ channels and are consistent with a previous report that a brief application of TTX has no effect on the potentiation induced by high-frequency presynaptic stimulation (Thomas et al., 1998 [15]). However, in the same preparation, the magnitude of the heterosynaptically induced LTP in association with conditioning bursts was reduced, while a considable amount of the LTP was preserved in the presence of low TTX (Fig.5B, C; (-), 149.8 ± 9.6 %; TTX, 122.1 ± 5.8 %; P < 0.05). No significant difference in the magnitude of induced LTP between conditioning and test pathways (P > 0.2; paired t-test) was observed under the control condition, whereas that of the heterosynaptic LTP was significantly decreased compared with that of the homosynaptic LTP in the presence of low TTX (P < 0.05; paired t-test). These results suggest that Na+ channel activation in the apical dendrites plays a significant role in the propagation of the generated depolarization from one set of synapses subjected to conditioning bursts to another subjected to test pulses in the induction of heterosynaptic plasticity in CA1 neurons.
Fig. 5. Low TTX-sensitivity in the induction of heterosynaptic associative LTP. A, B: Effects of low TTX on low-frequency pairing-induced LTP in the conditioning (A) and test (B) pathway. Top: Averaged 5 typical traces of fEPSPs before (thin) and 35-40 min after (thick) the pairing in the absence (-) and presence (TTX) of 10-20 nM TTX. The pair of traces in conditioning and test pathway (represented in A and B, respectively) was obtained from the same preparation under each condition. Bottom: Summarized low-frequency pairing-induced LTP in the control (open circles) and low TTX (filled circles) conditions. C: Comparison of the effects of low TTX on the magnitude of LTP in the conditioning and test pathway at 40 min after pairing. The number of recorded slices for each group is shown in parentheses.
4 Discussion STLR(nonHebb) and Hebb can coexist in the CA1 pyramidal cells of the Hippocampus The spatiotemporal learning rule (nonHebb, Fig.6) proposed by Tsukada et al. (1994 [17], 1996 [1], 2005 [2]) consisted of two defining factors; “Cooperative plasticity without a postsynaptic spike,” and its temporal summation. For the temporal summation, we have obtained evidence in neurophysiological experiments by applying temporal stimuli to schaffer collaterals of CA3 (Tsukada et al., 1994 [17], 1996 [1]) while the cooperative plasticity without postsynaptic spikes has not been
78
M. Tsukada and Y. Yamazaki
tested. In this paper, the coincidence of spike timing of Schafer collateral paired stimuli of CA3 played a crucial role in inducing associative LTP (Fig.3). The homosynaptic and heterosynaptic associative LTP could be induced under conditions which inhibited the activation of dendritic Na+ channels (Fig.5). Our results show that LTP can indeed occur at synapses on dendrites of hippocampal CA1 pyramidal neurons, even in the absence of a postsynaptic somatic spike. From these results, if the two inputs synchronize at the dendritic synapse of CA1 pyramidal cells, then the synapse is strengthened, and the functional connection is organized on the dendrite. If the two inputs are asynchronous then the connection is weakened.
Fig. 6. The spatiotemporal learning rule (STLR). Where wij(t); the value of a weight from neuron j to neuron i prior to adjustment, wij(t)=wij(t )-wij(t), η ; the learning rate coefficient, xj(t); the level of excitation of input to neuron j, yi(t); the output of neuron I, Iij(t) ; the value of spatial coincidence from neuron j to neuron i, h(u) ; a sigmoid function of the potentiation force, ș is the thresholds, and λ2 is the time decay constant of temporal summation which is a show dynamic process ( λ2 = 223ms) (Aihara, et. al., 2000)
A schematic representation was drawn in Fig.7. The functional connection/ disconnection depends on the input-input timing dependent LTP. This is different from the Hebbian learning rule, which requires coactivity of presynaptic and postsynaptic neurons. The spatiotemporal learning rule (nonHebb) incorporated two dynamic processes; fast (10 to 30ms) and slow (150 to 250ms). The fast process works as a time window to detect a spatial coincidence among various inputs projected to a weight space of the hippocampal CA1 dendrites, while the slow process works as a temporal integrator of a sequence of events. In a previous paper (T. Aihara et al., 2000 [16]), by parameter fitting to the physiological data of LTPs time scale, the decay constant of fast dynamics was identified as 17 ms, which matches the period of hippocampal gamma oscillation. The decay constant of the slow is 169ms, which corresponds to a theta rhythm. This suggests that cell assemblies are synchronized at two time scales in the hippocampal- cortical memory system and is closely related to the memory formation of spatio-temporal context. On the other hand, Hebbian learning is characterized by coincident pre- and postsynaptic activity; the interconnected weights which contribute to fire a post-synaptic neuron are strengthened according to the delta rule. Supporting this point of view, a series of experiments have shown that synaptic modification can be induced by repetitive pairing of EPSPs and back-propagating dendritic spikes (BPDSs), providing
Functional Differences Between the STLR and HEBB Type
79
Fig. 7. A schematic representation of functional connection/disconnection, depending on inputinput timing dependent LTP/LTD
direct empirical evidence to support Hebb’s proposal (Markram et al. 1997 [4]; Magee and Johnston, 1997 [18]; Zhang, et al., 1998 [5]; Debanne et al. 1998 [10]; Bi and Poo, 1998 [11];Feldman [6], 2000; Boettiger and Doupe, 2001 [7]; Sjostrom et al., 2001 [8];Froemke and Dan, 2002 [9]). In this paper, spike timing dependent LTP was induced in the CA1 area of a hippocampal slice using optical imaging when back propagating spikes (Stim. B) were applied within a time window of 15 ms before and after the onset of Stim.A (Schaffer-commissural collateral of CA3). The heterosynaptically induced LTP in association with conditioning bursts was significantly reduced in the presence of low TTX (Fig.5BC). From these experimental results, it is concluded that two learning rules, STLR and Hebb, must coexist in single pyramidal neurons of the hippocampal CA1 area. The Functional Differences between STLR and Hebb We applied two rules to a single-layered neural network and compared its ability of separating spatiotemporal patterns with that of other rules, including the Hebbian learning rule and its extended rules. The simulated results (Tsukada and Pan, 2005 [2]) showed that the spatiotemporal learning rule, not the Hebbian learning rule (including its extended rules), had the highest efficiency in discriminating spatiotemporal pattern sequences. In the Hebbian rule, there is a natural tendency to map all of the spatio-temporal input patterns with an identical firing rate into one output pattern. In comparison, the spatio-temporal rule produced different output patterns depended on each individual input pattern. From this it is concluded that STLR has a high ability to separate spatio-temporal patterns, while this ability is lower for the Hebbian learning rule. We also expand upon the results from theoretical simulation to also imply a phenomenon occurring in a dendrites-soma system in single pyramidal cells with many independent local dendrites in the CA1 area of the hippocampus. This system includes a spine structure, NMDA receptors (NMDAR), and Sodium and Calcium channels. The pyramidal cell integrates all of these local dendrite functions. The Hebbian Learning Rule leads to the following (Fig.8A). If the post-neuron fires due to some spatial input pattern, then all synapses that received its input are strengthened. The weight changes and substantially influences the next input to fire. Then, if similar inputs arrived on the presynapses of the dendrite, the post-neuron continues to fire, and the same synapses are further strengthened. The next input must continue to add to the increasing synaptic weight. Since the same synapse continues to be
80
M. Tsukada and Y. Yamazaki
strengthened in this way, pattern separation becomes increasingly difficult. On the other hand, it gains the ability to attract analogous patterns to its synapse. In contrast, STLR leads to the following (Fig.8B). The size of the synaptic weight is strengthened according to the correlation level between the stimulus pattern and the postsynaptic weight, independent of the firing of the post-neuron. If the stimulus changes slightly, the synaptic weight in a different area is strengthened because of randomly distributed synaptic weights, and different synapses are strengthened depending on the arriving input pattern repeated. This gives the learning rule its increased ability to separate spatio-temporal patterns.
Fig. 8. Functional differences between Hebb (A), and STLR (B), and their interaction in a dendrite(local)-soma(global) system of single pyramidal cells of the CA1
Dendrite(local)-soma(global) interactions in single pyramidal cells of CA1 From these results, it was revealed that STLR and HEBB coexist in single pyramidal neurons of the hippocampal CA1 area. In STLR, synaptic weight changes are determined by the “synchrony“ level of input neurons (bottom-up), while in Hebb, the soma fires by integrating dendritic local potentials or by top-down information such as environmental sensitivity, awareness, consciousness (top-down)(Fig.8C). When we are confronted by certain situations, we naturally compare it to our previous experiences and attempt to predict what may happen and plan our actions in respect to those predicted outcomes that we found favorable. In this way, our past, present, and pre-future memory act as one and determine our actions. If these actions do not fit, then a new hypothesis is formulated, new data is reasoned, and the previous model is amended. The coexistence of STLR (local information) and HEBB (global information) may support this dynamic process, which repeats itself until the internal model fits the outer environment. In reinforcement learning, the dendritic-soma interaction in single pyramidal neurons of the hippocampal CA1 area can play an important role in the context formation of policy, reward, and value(Fig.8C). Acknowledgments. This study was supported by The 21st Century Center of Excellence Program Integrative Human Science Program, Tamagawa Univ. and by a Grant-in-Aid for Scientific Research on Priority Areas- Integrative Brain Science Project- from the Ministry of Education, Culture, Sports, Science nd Technology, Japan.
Functional Differences Between the STLR and HEBB Type
81
References 1. Tsukada M, Aihara T, Saito H, Kato H (1996) Hippocampal LTP depends on spatial and temporal correlation of inputs. Neural Networks 9: 1357-1365 2. Tsukada M, Pan X (2005) The spatiotemporal learning rule and its efficiency in separating spatiotemporal patterns. Biol.Cybern. 92: 139-146 3. Hebb DO (1949) The Organization of Behavior. New York: John Wiley 4. Markram H., Lubke J., Frotscher M. and Sakmann B.(1997) Reguration of synaptic efficacy by coincidence of postsynaptic Aps and EPSPs. Science 275:213-215. 5. Zhang LI, Tao HW, Holt CE, Harris WA, Poo M (1998) A critical window for cooperation and competition among developing retino-tectal synapses.Nature 395: 37-44. 6. Feldman DE (2000) Timing based LTP and LTD at vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron 27:45-56. 7. Boettiger CA, Doupe AJ (2001) Developmentally restricted synaptic plasticity in a songbird nucleus required for song learning. Neuron 31: 809-818. 8. Sjostrome PJ (2001) Rate timing, and cooperativity jointly determine cortical synaptic plasticity. Neuron 32:1149-1164. 9. Froemke RC, Dan Y (2002) Spike-timing-dependent synaptic modification induced by natural spike trains. Nature 416:433-438. 10. Debanne D, Thompson SM (1998) Associative long-term depression in the hippocampus in vitro. Hippocampus 6:9-16. 11. Bi G, Poo M.(1998) Synaptic modifications in cultured hippocampal neurons; dependence on spike timing, synaptic strength, and postsynaptic type. J. Neurosci 18:10464-10472. 12. Aihara T, Kobayashi Y,Matsuda H, Sasaki H, Tsukada M (1998) Optical imajing of LTPand LTD induced simultaneously by temporal stimulus in hippocampal CA1 area. Soc Neurosci Abs 24: 1070. 13. Huang YY, Pittenger C, Kandel ER 2004 A form of long-lasting, learning-related synaptic plasticity in the hippocampus induced by heterosynaptic low-frequency pairing. Proc Natl Acad Sci U S A 101(3):859-64. 14. Golding NL, Staff NP, Spruston N (2002) Dendritic spikes as a mechanism for cooperative long-term potentiation. Nature 418(6895):326-31. 15. Thomas MJ, Watabe AM, Moody TD, Makhinson M, O'Dell TJ 1998 Postsynaptic complex spike bursting enables the induction of LTP by theta frequency synaptic stimulation. J Neurosci 18(18):7118-26. 16. Aihara T, Tsukada M, and Matsda H (2000) Two dynamic processes for the induction of long-term in hippocampal CA1 neurons. Biol. Cybern. 82: 189-195 17. Tsukada M, Aihara T, Mizuro M, Kato H, Ito K (1994) Temporal pattern sensitivity of long-term potentiation in hippocampal CA1 neurons. Biol. Cybern. 70: 495-503 18. Magee JC, Johnston D (1997) A synaptically controlled, associative signal for Hebbian plasticity in hippocampal neurons. Science 275(5297):209-13.
Ratio of Average Inhibitory to Excitatory Conductance Modulates the Response of Simple Cell Akhil R. Garg1 and Basabi Bhaumik2 1
Departmant of Electrical Engineering, J.N.V. University Jodhpur India 2 Department of Electrical Engineering I.I.T Delhi, India [email protected], [email protected]
Abstract. Recent experimental study reports existence of complex type of interneurons in the primary visual cortex. The response of these inhibitory cells depends mainly upon feed-forward LGN inputs. The goal of this study is to determine the role of these cells in modulating the response of simple cells. Here we demonstrate that if the inhibitory contribution due to these cells balances the feed-forward excitatory inputs the spike response of cortical cell becomes sharply tuned. Using a single cell integrate and fire neuron model we show that the ratio of average inhibitory to excitatory conductance controls the balance between excitation and inhibition. We find that many different values of ratio can result in balanced condition. However, the response of the cell is not sharply tuned for each of these ratios. In this study we explicitly determine the best value of ratio needed to make the response of the cell sharply tuned. Keywords: Visual Cortex, Simple cells, Integrate and Fire, Complex cells.
1 Introduction Information about different aspects of visual scene is extracted by specialized groups of cells. One such group of cells called the simple cells is associated with the detection of oriented lines composing the input scene. These cells perform computation, which involves integration of feed-forward as well as recurrent excitatory and inhibitory inputs to produce responses that are sharply tuned for the orientation of visual stimulus. Ever since the discovery of these cells by Hubel and Wiesel [1], many theoretical and experimental studies have been performed to understand the circuitry underlying response properties of these cells. Many of these studies suggest that the orientation selectivity of these cells is shaped by feed-forward connections [1][2][3]. However, extra cellular measurements show sharper orientation selectivity in comparison to that estimated from the spread of excitatory feed-forward inputs [4]. It is generally believed that spike threshold of a neuron; recurrent excitatory, inhibitory and feed-forward inhibitory inputs to the cortical cell modulate the feedforward excitatory inputs to generate the desired sharpness in orientation selectivity of these cells. Experimental studies have indeed suggested that dynamic regulation of spike threshold enhances orientation selectivity [5]. However, the role of recurrent excitatory, inhibitory and feed-forward inhibitory inputs in generating orientation selectivity remains controversial [6][7][8]. I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 82 – 89, 2006. © Springer-Verlag Berlin Heidelberg 2006
Ratio of Average Inhibitory to Excitatory Conductance Modulates
83
Recent experimental study has shown existence of two functionally distinct types of cells in layer IV; these cells are called as simple and complex type of inhibitory cells [9]. The responses of these cells depend mainly upon feed-forward LGN inputs and these cells are responsible for providing tuned and untuned feed-forward inhibitory inputs to simple cells [9][10]. Role of these inhibitory cells has been studied theoretically, it has been shown that the inhibitory inputs provided by these cells in combination with feed-forward excitatory inputs is sufficient to explain sharp contrast invariant orientation tuning and low-pass temporal tuning of simple cells [10][11][12][13]. In our previous work [12][13] we had shown that untuned inhibitory inputs due to complex type inhibitory cells could be used to balance feed-forward excitatory inputs to produce contrast invariant orientation tuning. In this study, to have a better understanding about the function of these inhibitory cells, using a single cell integrate and fire neuron model we examine how the balance between inhibition from these cells and feed-forward excitation, sharpens the orientation selectivity of the simple cell? The values of the average excitatory and inhibitory conductance is considered important in influencing the average membrane potential of integrate and fire neuron. We show that the ratio of average inhibitory to excitatory conductance (and not their exact values) is important in controlling the balance of excitation and inhibition. Furthermore, this ratio also controls the value of average membrane potential of the cortical cell. We find that there can be number of combination of average excitatory conductance < G ex > and average inhibitory conductance < G in > that results in almost similar ratio and thereby similar average membrane potential. Most importantly we observe that many different values of ratio can lead to condition of balanced excitation and inhibition. However, only higher values of ratio makes the response of the cell sharply tuned for orientation. Furthermore, we show that by simultaneously varying excitatory and inhibitory inputs we can obtain sharply tuned response with variable firing rate provided the required balance between excitation and inhibition is maintained.
2 Material and Methods Three hierarchical layers: retina, LGN and cortex are considered as a model of the visual pathway. The retinal layer is modeled as two separate 2D sheets of ganglion cells lying one over the other, one sheet consisting of ON center and the other consisting of OFF center ganglion cells. Retinal ganglion cells (RGCs) have centersurround receptive field structure with center fields being 30′ wide and center-tocenter spacing between the cells being 12′ of visual angle [14]. The cells were modeled to perform a linear spatio-temporal integration of the presented stimulus. There was one to one correspondence between RGCs and LGN cells so the response of each RGC was uniquely passed onto one LGN cell of the same type (ON/OFF). ON
Firing rates of RGCs were used to generate LGN spikes ( Ai
& AiOFF ) for LGN
cell at location “i” in ON/OFF LGN sheet using a Poisson process. The details for retinal cells spatial receptive field, temporal function and the mechanism to generate LGN spikes were taken from the model of Wörgotter and Koch (1991) [15], who used this model earlier to produce realistic temporal responses to visual stimuli. Simple
84
A.R. Garg and B. Bhaumik
cells RFs were modeled as Gabor functions, which is a two dimensional Gaussian multiplied by a sinusoid. The positive values of Gabor function were taken to be ON sub-region yielding connection from ON type LGN cells, and negative values of Gabor function were taken to be OFF sub-region yielding connection from OFF type LGN cells. The synaptic strength between LGN cell “i” in ON sub-region and cortical ON
cell is described by its peak synaptic conductance g i
. Similarly the synaptic
strength between LGN cell “i” in OFF sub-region and the same cortical cell is described by
g iOFF . We also assume that each simple cell in addition to feed-forward
excitatory inputs receives feed-forward inhibitory inputs from “N” inhibitory cells. The synaptic strength of connection between each inhibitory cell and cortical cell is in
represented by g (inhibitory peak synaptic conductance). The model simple cell in cortical layer was a single compartment, integrate and fire neuron. The membrane potential of the model neuron then changes according to
τm
dV = Vrest − V (t ) + G ex (t )( E ex − V (t )) +Gin (t )( E in − V (t )) dt
(1)
with τm =20ms, Vrest=-70mV, Eex=0mV and Ein=-70mV. Eex and Ein are the reversal potential for the excitatory and inhibitory synapses respectively. V (t ) is the membrane potential at time step t of the cortical cell. When the membrane potential of the neuron reaches the threshold value of –54 mV, the neuron fires an action potential and subsequently membrane potential is reset to -60 mV parameters taken from Song
G ex (t) is the excitatory synaptic conductance at time step t of
et al. (2000) [16]. Here,
the cortical cell. This is measured in the units of the leak conductance gl of the neuron. Whenever a particular ON (OFF) type of LGN cell fires, the corresponding peak synaptic conductance contributes towards the value of excitatory synaptic conductance
G ex : M
M
i
i
Gex (t + 1) = Gex (t ) + ¦ g iON (t ). AiON (t ) + ¦ g iOFF (t ). AiOFF (t )
(2)
Here M is the total number of ON (OFF) type LGN cells connected to a particular cortical cell. Similarly,
Gin (t) is the inhibitory synaptic conductance at time step t of
the cortical cell. This is measured in the units of the leak conductance gl of the neuron. Whenever any of the inhibitory cells fires in the form of spike it contributes towards the value of total inhibitory conductance Gin . N
Ginx (t + 1) = Ginx (t ) + ¦ g inj (t ). A inh j (t )
(3)
j
Otherwise,
τ in
inhibitory
synaptic
conductance
decays
exponentially
i.e.
dGin = −Gin Where τin =5mS and Ainhj(t) is the activity of jth inhibitory cell. dt
Ratio of Average Inhibitory to Excitatory Conductance Modulates
85
Output of each of these inhibitory cells was in the form of spikes generated by independent Poisson's process. The firing frequency used for generating spikes was free parameter; it was used to control the balance between excitation and inhibition. . For a constant value of average membrane potential of the cell, if the average excitatory and inhibitory currents to the cortical cell are made equal then it results in one of the condition for balance between excitation and inhibition. From equation (1) for V=Vm and by equating excitatory and inhibitory currents we get
< Gex > ( Eex − Vm ) = − < Gin > ( Ein − Vm ) ( E − Vm ) < Gin >= − < Gex > ex ( Ein − Vm )
(4) (5)
where < G ex >= M < f ex > g ON / OFF and < Gin >= N < f in > g in are the excitatory and inhibitory synaptic conductance’s which are temporally averaged for every stimulus condition and which depend upon mean firing rate, the number (M/N) and the value of peak synaptic conductance’s (gON/OFF /gin) of excitatory and inhibitory inputs respectively. Therefore from equation (5),
< f in >= − K < f ex >
( E ex − Vm ) ( E in − Vm )
, K=
M g ON / OFF N g in
(6)
As nex, nin, gON/OFF, gin and Vm are constants we get a relationship between average firing rate of inhibitory cells and average firing rate of LGN cells. In order to compare the orientation selectivity of the cell for various conditions we used a measure called circular variance (CV). Higher values of circular variance signify broadly tuned cell. CV is calculated using the mean firing rate of the neuron
¦r e according to following CV=1-|R|, where R = ¦r
i 2θ k
k
k
k
k
In the above equation,
rk is the mean firing rate at orientation k and θk is the
orientation in radians.
3 Results Moving sinusoidal grating of different orientation and particular spatial frequency was used as an input stimulus. Depending upon the firing rate of excitatory cells using equation (6) we calculated fin the firing rate of inhibitory cell needed to balance the feed-forward excitation. By varying the value of fin near around this calculated value we varied the average inhibitory input to the cortical cell. The average excitatory input to the cortical cell depends on the thalamocortical pattern of connectivity and orientation of input stimulus. Therefore, the instantaneous excitatory input to the cortical cell depends on the orientation of the input stimulus. Cortical cell varies its discharge rate as a consequence of changes in the excitatory and inhibitory inputs. The values of average excitatory and inhibitory conductance and their ratio determine
86
A.R. Garg and B. Bhaumik
the value of average membrane potential of the cell i.e. its response in presence of excitatory and inhibitory inputs. We examined the selectivity of a cortical cell for the different values of the ratio of average inhibitory to average excitatory conductance i.e.
< Gin > . We also call this ratio as inh/exc ratio. As shown in figure 1 (a) the < G ex >
cell has the same preference for orientation at different values of ratio. However, the tuning of the cell improves with the increase in the value of the ratio. We also observe that for the value of CV to be 0.2 or below, the ratio should be 4 or more (figure 1(b)). We next obtained the excitatory and inhibitory currents at different time for different orientations of input stimulus. Figure 2(a) shows the variation of absolute values of excitation and inhibition with time for different orientation of input stimulus for ratio=4.5. Similarly, figure 2(b) shows the variation of absolute values of excitatory and inhibitory inputs with time for different orientations of input stimulus for ratio=2.4. By looking onto these plots, it can be seen that the excitation and inhibition are large/small at the same time for almost all orientation of input stimulus for different ratios. Thus the condition of balance in excitation and inhibition is maintained for different ratios and orientation of input stimulus. It is also seen that for both of these ratio, the excitatory and inhibitory currents are larger for stimulus of preferred orientation i.e. around 180. We also observe that for ratio=4.5 the difference between these currents at every instance is less in comparison to the difference between the two for ratio=2.4.
Fig. 1. (a) Orientation tuning curves for different inhibition/excitation ratios (b) Plot showing variation of circular variance with inhibition/excitation ratio
We next obtained the response of cortical cell for different average excitatory input to cortical cell. We varied the average excitation by multiplying the synaptic strength of connection between LGN cells and cortical cell by a constant. Furthermore, we varied the inhibition in such a manner that different values of < G ex > and < G in > resulted in almost equal ratio of
< G ex > . As shown in figure 3 we obtain the < Gin >
Ratio of Average Inhibitory to Excitatory Conductance Modulates
87
similar kind of tuning with different excitation and inhibition but having the similar ratio. Furthermore, the output firing rate changes with changes in excitatory and inhibitory inputs.
Fig. 2. (a) Variation of absolute excitatory and inhibitory currents with time for ratio=4.5. Each subplot in this figure is for a particular orientation of input stimulus. (b) Variation of absolute excitatory and inhibitory currents with time for ratio=2.4. Each subplot in this figure is for a particular orientation of input stimulus.
For a low excitatory input we need low inhibitory input to obtain the same ratio, which results in sharply tuned response and low output firing rate. For high excitatory inputs we need high inhibitory input to obtain the same ratio that results in sharply tuned response and high output firing rate. As shown in figure (3) for average excitatory conductance =0.69 the maximum output firing rate is equal to 11 spikes/s which is for stimulus of preferred orientation. On the other hand for average excitatory conductance equal to 2.32 the maximum output firing rate of the cortical cell becomes more than 100 spikes/s and is again for stimulus of preferred orientation. These result suggests that the balance is more important than the exact values of < Gex > and < Gin > . Also, most importantly these results suggest that sharply tuned response with variable firing rate can be obtained by varying excitation and inhibition in such a manner that required balance between excitation and inhibition is maintained.
4 Conclusion We studied the role of untuned inhibition on the response of a cortical cell. We find that incorporating inhibition improves the selectivity of the cell. The important finding of our study is that for a cell to be highly selective for orientation the excitatory and inhibitory inputs to the cortical cell must be balanced. Recent studies have emphasized that balanced excitatory and inhibitory inputs are needed for shaping the response of the cortical cell [8][12].
88
A.R. Garg and B. Bhaumik
Fig. 3. Receptive field of a cortical cell and orientation-tuning curves for different AEC (Average excitatory conductance), AIC (average inhibitory conductance). Also for each of these cases we obtain almost similar AMP (average membrane potential).
Shadlen and Newsome (1998) [17] have used this condition of balance to obtain variable discharge of cortical neurons and have suggested several methods by which the net excitatory and inhibitory inputs to the cortical cell may be approximately balanced. Experimental evidence favoring such a balance emerges from intracellular recordings performed in simple and complex cells of the visual cortex in cat [18]. In this study it was shown that IPSPs and EPSPs, were elicited predominantly at the same time and by a bar of the preferred orientation of the cell. The important finding of this study is that balanced condition can be obtained for different inhibitory to excitatory ratios. However, the ratio of inhibitory to excitatory conductance should be around 4 for a cell to have a sharply tuned response. Recent experimental study has shown the existence of two functionally distinct types of cells in layer IV, these are called as simple and complex type of inhibitory cells [9]. The response of these cells depend mainly upon the feedforward LGN inputs and we believe that complex type of inhibitory cells may provide the needed untuned inhibition to balance the excitation.
References 1. D.H. Hubel, T.N. Wiesel Receptive fields, binocular interaction and functional architecture in the cat's visual cortex J. Physiol.160 (1962) 106-154. 2. D. Ferster, S. Chung, H. Wheat Orientation selectivity of thalamic input to simple cells of cat visual cortex Nature 380 (1996) 249-252. 3. P. Kara, J.S. Pezaris, S. Yurgenson, R.C. Reid The spatial receptive field of thalamic inputs to single cortical simple cells revealed by the interaction of visual and electrical stimulation Proc Natl Acad Sci USA 99 (2002) 16261-16266
Ratio of Average Inhibitory to Excitatory Conductance Modulates
89
4. J.L. Gardner, A. Anzai, I. Ohzawa, R.D. Freeman Linear and nonlinear contributions to orientation tuning of simple cells in the cat’s striate cortex. Vis. Neurosci. 16 (1999) 1115–1121. 5. R. Azouz & C.M. Gray Adaptive coincidence detection and dynamic gain control in visual cortical neurons in vivo Neuron 37 (2003) 513-523. 6. D. Ferster and K.D. Miller Neural mechanisms of orientation selectivity in the visual cortex. Annu. Rev. Neurosci 23 (2000) 441-471 7. R. Shapley, M. Hawken & D.L. Ringach Dynamics of orientation selectivity in the primary visual cortex and the importance of cortical inhibition Neuron 38 (2003) 689-699. 8. J.Mariño, J. Schummers, D.C. Lyon, L Schwabe O. Beck, P. Wiesing, K. Obermayer & M. Sur Invariant computations in local cortical networks with balanced excitation and inhibition Nat. Neurosci. 8 (2005) 194-201 9. J.A. Hirsch, L.M. Martinez, C. Pillai, J.M. Alonso, Q. Wang and F.T. .Sommer Functionally distinct inhibitory neurons at the first stage of visual cortical processing Nat. Neurosci. 12 (2003) 1300-1308 10. T.Z. Lauritzen and K.D. Miller Different roles for simple cell and complex cell inhibition in V1. J. Neurosci. 32 (2003) 10201-10213 11. T. W. Troyer., A. Krukowski., N. J. Priebe. & K.D. Miller Contrast-invariant orientation tuning in cat visual cortex: thalamocortical input tuning and correlation-based intracortical connectivity J. Neurosci. 18 (1998) 5908-5927 12. A.R. Garg ., B. Bhaumik., & K. Obermayer The balance between excitation and inhibition not only leads to variable discharge of cortical neurons but also to contrast invariant orientation tuning In Lecture Notes in Computer Science (ICONIP), 3316 (2004) 90-95. Springer-Verlag 13. A.R. Garg, B. Bhaumik., & K. Obermayer Variable discharge pattern and contrast invariant orientation tuning of a simple cell A modeling study Neur. Inf. Proc. Let. & Rev. 6 (2005) 59-68. 14. D. Somers, S. Nelson & M. Sur An emergent model of orientation selectivity in cat visual cortical simple cells. J. Neurosci. 15 (1995) 5448-5465 15. F. Wörgötter F, C. Koch A detailed model of the primary visual pathway in the cat: Comparison of afferent excitatory and intracortical inhibitory connection schemes for orientation selectivity. J. Neurosci.11 (7) (1991) 1959-1979. 16. S. Song S., K. D. Miller & L. F. Abbott Competitive Hebbian learning through spike timing dependent plasticity Nat. Neurosci. 3 (2000) 919-926. 17. M. N. Shadlen, W. T. Newsome The variable discharge of cortical neurons: Implications for connectivity, computation and information coding J. Neurosci. 18 (1998) 3870-3896 18. D. Ferster Orientation selectivity of synaptic potentials in neurons of cat primary visual cortex. J. Neurosci,6(5) (1986) 1284-1301.
Modeling of LTP-Related Phenomena Using an Artificial Firing Cell Beata Grzyb1 and Jacek Bialowas2 1
Faculty of Mathematics, Physics and Computer Science Maria Curie-Sklodowska University, Pl. Marii Curie-Sklodowskiej 1 20-031 Lublin, Poland [email protected] 2 Dept. of Anatomy and Neurobiology Medical University of Gdansk Debinki 1, 80-211 Gdansk, Poland
Abstract. We present a computational model of neuron, called firing cell (FC), that is a compromise between biological plausibility and computational efficiency aimed to simulate spiketrain processing in a living neuronal tissue. FC covers such phenomena as attenuation of receptors for external stimuli, delay and decay of postsynaptic potentials, modification of internal weights due to propagation of postsynaptic potentials through the dendrite, modification of properties of the analog memory for each input due to a pattern of long-time synaptic potentiation (LTP), output-spike generation when the sum of all inputs exceeds a threshold, and refraction. We showed that, depending on the phase of input signals, FC’s output frequency demonstrate various types of behavior from regular to chaotic.
1 Introduction As for computational models of a neuron, a biological plausibility and computational efficiency are contradictory objectives. We propose a compromise approach aimed to develop a tool supporting an interpretation of recordings of electric potentials in living neural tissue. The model we develop, called Firing Cell (FC), reflects only the most essential properties of biological neurons and their assemblies, in order to not to increase the necessary computational power beyond necessity [1]. We have assumed that, as for a single neuron, at least the following neurophysiological facts should be covered in an model of the discussed kind: 1.
2.
Each input signal causes changes of the excitatory (EPSP) or inhibitory (IPSP) postsynaptic potential at the area of the input point; there are typical patterns of their increment and decay. There is a set of commonly accepted neurophysiological data: time courses, amplitudes of various postsynaptic potentials, action-potential amplitude, and the thresholds for neural activation causing action potentials and removing the block of NMDA canals [2][3][4]. If (after an arrival of an action potential at the excitatory synapse) the accumulated local potential for postsynaptic region of input synapse increase beyond the value of the local threshold (about -70mV), the NMDA receptor causes an
I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 90 – 96, 2006. © Springer-Verlag Berlin Heidelberg 2006
Modeling of LTP-Related Phenomena Using an Artificial Firing Cell
3.
91
opening of the previously blocked channel for an influx of calcium ions [4][5], and thus the phenomenon of Long–Term Potentiation (LTP) and activation of input-correlated long-term memory. Internal weights change due to propagation of postsynaptic potentials through the dendrite [6] and non linear summation for the threshold function.
As for currently available simulators of life-like neural circuits, they allow to simulate the circuits built of thousands of cells [7], but tend to go deeply into detailed biophysics and biochemistry of a single neuron; so, in case of simulation of large neural networks, the complexity of the neuron’s mathematical description requires hardly available computational power e.g. Hines [8]; Sikora [9]; Traub et al. [10], Bower & Beeman [11]. We propose that in neural simulations the computationally-expensive model of biophysical phenomena can be replaced with a model based on a set of three shift registers and that no essential output properties are lost because of such replacement. The most significant neuron properties we consider are frequency coding, a memory of a single input value due to the patterns of EPSP or IPSP, the LTP related memory, and a processing algorithm that facilitates a non-linear potential summation.
2 Firing Cell Structure Although Firing Cell (FC) may occur in three versions, excitatory, inhibitory, and receptory, in this paper we discuss only the first one. FC consists of a dendrite, body, and axon (Fig. 1). The dendrite is a string of compartments, each with an input synapse. Each synapse is being checked and, after a detection of an action potential, the values of the table of a certain shift register are being changed according to the typical time course of EPSP or IPSP. Each of the registers is a string of data-boxes. The first data-box of one of the registers is reserved for the actual value of activation of the related compartment. Every 0.5 msec of the simulated time the actual values of all registers are checked and their weighted sums are compared with related thresholds. There is a positive correlation between the weight and a related synapse’s proximity to the cell body. The output takes at a given clock the value equal to 1, if and only if the accumulated postsynaptic potential at the level of the neuron’s body got at least as high as the threshold (-50mV). The axon provides the output signal to the compartments of the dendrites of the destination FC’s. The period between two state updates, i.e., between a given clock and the previous clock, is an equivalent of 0.5 msec of the simulated time. When an action-potential is generated, the registers are reset to the value of resting potential and then the activity of all compartments is inhibited for 1.5 msec-period of refraction. The values of the resting potential and threshold as well as other initial parameters can be input independently for each neuron before each simulation session. The minimal value of the single postsynaptic potential is calculated as dividing the difference between the threshold and the Kalium Equilibrium Potential -90mV by the number of inputs. For practical reasons, in first simulations we enhanced this value for EPSP by two.
92
B. Grzyb and J. Bialowas Excitatory inputs
Inhibitory inputs Axonic hillock
Synapses Axon Output
Dendrite
Body
Fig. 1. Firing cell (FC) configuration used in experiments as excitatory cell
We can consider the shift registers mentioned above as a very short-term local memory unit that remembers related events for up to 15 msec. The biological nervous system uses them to avoid the troublesome necessity of strict synchronization that takes pace in conventional computers. Yet we should also need a memory related to each input. The memory must work according to LTP rule. In each state-updating cycle the program calculates (based on actual values in the registers and appropriate weights) an accumulated local potential for postsynaptic region of each input synapse. If, at the moment of arrival an action potential at a particular input, the postsynaptic potential is beyond the value of the local threshold (ca. -70mV), the second register (which simulates the rise of calcium ion content inside the cell) change values of their data-boxes. Thus, the function operating on the registers substitute a time-consuming solving of differential equations related to the strength of synaptic potentiation versus calcium ion charge. The strength remains enhanced for a time calculated as power function of the charge. If after some time an additional charge appears, the period of the calculated synaptic potentiation substantially increases. In the same pattern we can easily include for each input the third register, for simulation the slower influence of neuromodulatory substances.
3 Firing Cell Behavior In order to examine the Firing Cell’s usefulness, we used it as a model of a pyramidal hippocampal neuron under conditions such as in the classical LTP experiment by Bliss and Lomo [12]. We configured the FC in such a way that it had 13 excitatory inputs and 3 inhibitory inputs [1]. Two cases were tested. In the first case the assumed EPSP amplitude was 5mV (Fig. 2), in the second it was 7 mV. In the first case each FC’s synapse substituted ca. 1000 biological synapses, whereas in the second ca 1400. The results of the experiment confirmed the biological plausibility of FC as for information-processing-related mechanisms. In both tested cases the frequency of action-potential generation and the values of LTP increased after the training and in the case of EPSP amplitude 7mV the frequency was substantially higher both before and after the training.
Modeling of LTP-Related Phenomena Using an Artificial Firing Cell
93
-50mV -80mV
a
-50mV -80mV
b Fig. 2. Behavior of an untrained FC (a) and trained FC (b) for EPSP 5mV. The peaks located on the input lines show history of action potential arrivals, while the vertical peaks above the threshold (-50mV) show action potential generation by FC. The zig-zag-shaped line between the threshold and resting potential (-80mV) is the plot of accumulated postsynaptic potential. The training of 400 msec at 100Hz was with action potential spike trains on the 7th, 8th and 9th input (b-thick lines). The frequency of action potentials and the state of the long term memory associated with the related input (real numbers on the left) increase after training (b). An animation of 2 sec of the firing cell’s work is available at the web page of GABRI (http://www.gabri.org).
For demonstration of regular, periodic or chaotic behavior of a spiking neuron we used the Poincare return maps. Our experiments showed changes of the neuron’s behavior before, during, and after the training (Fig. 3).
94
B. Grzyb and J. Bialowas
a
b
Fig. 3. The Poincare return maps from total simulations time 2,5 sec. For EPSP 5mV (a) and 7mV (b). I(n) - interval between two subsequent spikes. Q – quantity of points with identical parameters. Note the passages between various musters of spike trains; edited after [1].
4 Concluding Remarks Some neuroscientists suggest that encoding of information using firing rates is a very popular coding scheme used by cerebral cortex [13]. In the FC model this idea is followed through elaborating mainly the mechanisms that facilitate a frequency-based coding of information. Briefly, if an action potential arrives at a synapse, the consecutive change of postsynaptic potential remains disposable to our model or neuron for further computation for a time of duration of typical EPSP or IPSP. For EPSP it is in FC 15msec (30 clock steps). The number of interspike interval combinations depends on the number of action potentials with regard to refraction that arrived at the same synapse during this time. If we assume the difference of one clock step at one interval as significant for further computing, it can be calculated 6272 combinations from one spike to densely burst of 8 spikes. If we assume 1 msec as a significant difference, then we have 623 combinations. And these data are the same for modeled biological neuron as well as for our model (see Appendix 1). The realistic modeling requires a lot of computational power. Note that Amaral & Ishizuka [14] calculated 12,000 synapses on a single rat’s hippocampal neuron. The idea of representing near 1000 biological synapses by each of the FC’s synapse is justified by the fact that during experiments in vivo a single electrode for technical reasons excites simultaneously multiple fibers. The set of functions employed in the FC model should be a subject to further implementation in hardware, for which analog memories and field transistors (as processing devices) are being considered. The silicon neuron by Mahovald and Douglas [15] seems to well solve the problem of the generation of action potentials. A satisfactory model must properly react on frequency and phase of action-potential
Modeling of LTP-Related Phenomena Using an Artificial Firing Cell
95
spike-trains. Some reported solutions e.g. Elias & Northmore [16] seem to be a step in this direction.
References 1. Bialowas J. & Grzyb B. & Poszumski P. (2006) Firing Cell: An Artificial neuron with a simulation of Long-Term-Potentiation-Related Memory, The Eleventh International Symposium of Artificial Life and Robotics, Beppu, pp. 731-734. 2. Atwood H. L. &. MacKay W. A. (1989) Essentials of Neurophysiology, Toronto: B.C. Decker Inc. 3. Schmidt R. F. (1976) in R. F. Schmidt (Ed.) Fundamentals of Neurophysiology, New York, Heidelberg, Berlin: Springer-Verlag, pp. 64-92. 4. Muller D., Joly M. & Lynch G. (1988) Contributions of quisqualate and NMDA receptors to the induction and expression of LTP, Science 242, pp. 1694-1697. 5. Bliss T. V. P. &. Collingridge G. L. (1993) Asynaptic model of memory: long-term potentiation in the hippocampus, Nature 361, pp. 31-39. 6. Bekkers J. M. & Stevens Ch. F. (1990) Two different ways evolution makes neurons larger, In: J. Storm-Mathisen, J. Zimmer, & O. P. Ottersen (Eds.) Progress in Brain Research 83, Amsterdam-New York-Oxford: Elsevier, pp. 37-45. 7. Izhikevich E. M. (2003) Simple model of spiking neurons, IEEE Transactions on Neural Networks, 14, pp. 1569-1572. 8. Hines M. (1989) A program for simulation of nerve eqations with branching geometries, International Journal of Biomedical Comput. 24, pp. 55-68. 9. Sikora M. A., Gottesman J., Miller R. F. (2005) “A computational model of the ribbon synapse”, Journal of Neuroscience Methods 145, pp. 47-61. 10. Traub R. D., Contreras D., Cunningham M. O., et al. (2005) Single-Column Thalamocortical Network Model Exhibiting Gamma Oscillations, Sleep Spindles, and Epileptogenic Bursts, Journal of Neurophysiology 93, pp. 2194-2232. 11. Bower J.M. & Beeman D. (2003) The Book of GENESIS: Exploring Realistic Neural Models with the GEneral NEural SImulation System, http://www.genesissim.org/GENESIS/iBoG/iBoGpdf/index.html 12. Bliss T. V. P., & Lomo T. J. (1973) Long lasting potentiation of synaptic transmission in the dentate area of the anaestaetized rabbit following stimulation of the perforant path, J. Physiol. (London) 232, pp. 331-356. 13. Rolls E. T. & Treves A. (1998) Neural Networks and Brain Function, Oxford: Oxford University Press. 14. Amaral D. G. & Ishizuka N. (1990) Neurons, numbers and the hippocampal network, In: J. Storm-Mathisen, J. Zimmer & O. P. Ottersen (Eds.) Progress in Brain Research 83, Amsterdam–New York–Oxford: Elsevier, pp. 1-12. 15. Mahovald M. & Douglas R. (1991) A silicon neuron, Nature 354, pp. 515-518. 16. Elias J. G. & Northmore D. P. M. (1998) Building Silicon Nervous Systems with Dendritic Tree Neuromorphs, In: W. Maass & Ch. M. Bishop (Eds.), Pulsed Neural Networks, Cambridge, Mass.: The MIT Press, pp. 135-156.
Appendix 1 Lets we have EPSP for 15 msec, Refraction 2 msec, significant interspike interval difference [S] = 0,5msec or 1msec. Then we define times [T]: TEPSP=EPSP/S; TRefrac=Refraction/S; then the maximal number of action potentials possible in
96
B. Grzyb and J. Bialowas
TEPSP: MaxOAP=1+(TEPSP-1) div TRefrac . [CQ] means number of combinations possible for done number of action potentials [QAP]. ASum(1,i) means sum of the first iterms of the arithmetic progression. For QAP=1 CQ=1 For QAP=2 CQ=TEPSP-TRefrac For QAP=3 CQ=ASum(1, N), N=TEPSP-2*TRefrac N
For QAP=4 CQ=
¦ ASum(1, i ) , N=TEPSP-3*TRefrac i=1 N
For QAP=5 CQ=
i
¦ ¦ ASum(1, j ) , N=TEPSP-4*TRefrac i=1 j=1
N
For QAP=6 CQ=
i
j
¦¦¦ ASum(1, k ) , N=TEPSP-5*TRefrac i=1 j=1 k=1 N
For QAP=7 CQ=
i
j
j
¦¦¦¦ ASum(1, l ) , N=TEPSP-6*TRefrac i=1 j=1 k=1 l=1 N
For maxQAP=8 CQ=
j
i
j
j
¦¦¦¦¦ ASum(1, m ) , N=TEPSP-7*TRefrac i=1 j=1 k=1 l=1 m=1
maxQAP
Max[CQ]=
¦ CQ[QAP ]
QAP=1
S=0,5
3,5
3
S=1
2,5 2 1,5 1 0,5 0 1
2
3
4
5
6
7
8
Fig. 4. Calculations for [S]=0,5 msec and 1msec are shown in logarithmic diagrams. Note tenfold decrease of Max[CQ] for S=1msec as compared with S=0,5 msec.
Time Variant Causality Model Applied in Brain Connectivity Network Based on Event Related Potential Kai Yin1, Xiao-Jie Zhao1,2,*, and Li Yao1,2 1
2
Department of Electronics, Beijing Normal University, Beijing, China, 100088 State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University, Beijing, China, 100088 *[email protected]
Abstract. Granger causality model mostly used to find the interaction between different time series are more and more applied to natural neural network at present. Brain connectivity network that could imply interaction and coordination between different brain regions is a focused research of brain function. Usually synchronization and correlation are used to reveal the connectivity network based on event-related potential (ERP) signals. However, these methods lack the further information such as direction of the connectivity network. In this paper, we performed an approach to detect the direction by Granger causality model. Considering the non-stationary of ERP data, we used traditional recursive least square (RLS) algorithm to calculate time variant Granger causality. In particular, we extended the method on the significance of causality measures in order to make results more reasonable. These approaches were applied to the classic Stroop cognitive experiment to establish the causality network related to attention process mechanism.
1 Introduction The study on the localization of brain functions and the coordination mechanism among the different neuronal structures is important to neuroscience. Electroencephalogram (EEG) is a kind of biological spontaneous potential that records electric temporal signal from the scalp position. Event related potential (ERP) also called evoked potential is a kind of cognitive EEG. The potential evoked by a single stimulus is often so weak that it has to be enhanced by repeating the stimulus many times and averaging across every EEG trials caused by the same stimulus. In other words, ERP signal is the average result across a series of trials each of which related to the same stimulus. ERP directly reflects the electric activities of neuron assembly at different scalp electrode during cognitive task. Traditional analysis of ERP signals usually focus on the specific ERP components within some specific electrodes, and the interactivities between different electrodes are usually analyzed by synchronization or correlation at present. However, these approaches do not reveal anything about how different brain regions communicate with each other and what causality exist. Characterizing brain activity requires causal model, by which regions and connections of interest are specified [1]. The best approach of the causal relation, which is so-called Granger causality, was introduced by Wiener (1956) and formulated by Granger (1969) in the form of linear autoregressive (AR) model [2][3]. The original concept refers to the improvement in I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 97 – 104, 2006. © Springer-Verlag Berlin Heidelberg 2006
98
K. Yin, X.-J. Zhao, and L. Yao
predictability of a time series when the knowledge of the past of another time series is considered. That is, if a time series X causes a time series Y , the variance of the prediction error of both the two time series will be less than the prediction of only Y With Granger causality applied in neuroscience, cortical networks are engaged. Many researchers used such causality conception to study the cortical connectivity networks based on EEG or ERP data by means of kinds of measurements of Granger causality, which were given by Geweke (1982) or others [4][5][6]. However, these methods require stationary time series. Usually, in cognitive experiment, EEG or ERP signals are not stationary. To solve this problem, Ding (2000) and Liang (2000) developed adaptive vector autoregressive models using short-time windows for multiple trials [7][8]. Moller (2001) and Hesse (2003) introduced generalized recursive least square (RLS) approach [9][10]. Another problem is statistic significant test that could estimate the significance of causality at last. In above papers, the statistical significant test was performed by the construction of surrogate data. Their results may be better convinced if they had considered the temporal construction within the original time series. For the aim of both finding transient direction of connectivity networks and overcoming non-stationary of EEG time series, in the present study, we used Granger causality to analyze ERP data by means of traditional RLS algorithm. We also improved the statistical significant test by using value of correlation coefficient of surrogate data. In addition, this approach was applied to analyze ERP data recorded from the classic psychological Stroop experiment in order to investigate the causal connectivity network since the event related potential technique could increase the signal to noise ratio.
2 Method 2.1 Time Variant Granger Causality Model Let X and
= {x(t )} and Y = { y (t )} be the time series. The univariate AR models of X
Y are: p
x(t ) = ¦ a1 (i ) x(t − i ) + ε1 (t ) i =1
.
p
(1)
y (t ) = ¦ a2 (i ) y (t − i ) + ε 2 (t ) i =1
a1 (i ) and a2 (i ) are the time variant model coefficients, ε1 (t ) and ε 2 (t ) are their time variant prediction errors. The variance of ε1 (t ) and ε 2 (t ) are Σ X | X − (t ) and ΣY |Y − (t ) respectively. The bivariate AR models of X and Y are:
Here, the parameters
p
p
i =1
i =1
p
p
i =1
i =1
x(t ) = ¦ a11 (i ) x(t − i ) + ¦ a12 (i ) y (t − i ) + η1 (t ) y (t ) = ¦ a21 (i ) y (t − i ) + ¦ a22 (i ) x(t − i ) + η2 (t )
.
(2)
Time Variant Causality Model Applied in Brain Connectivity Network
Where the parameters
a jl (i ) , j, l=1, 2, are the time variant model coefficients, η1 (t )
and
η2 (t )
are
Σ X | X − ,Y − (t ) and ΣY |Y − , X − (t ) respectively.
are their time variant prediction errors. The variance of
When X causes than
99
η1 (t ) and η2 (t )
Y , the variance of the prediction error ΣY |Y − , X − (t ) will be less
ΣY |Y − (t ) . The measure of Granger causality from X to Y is defined as: FX →Y (t ) = ln
ΣY |Y − (t ) ΣY |Y − , X − (t )
.
Symmetrically, the measure of Granger causality from
FY → X (t ) = ln
Σ X | X − (t ) Σ X | X − ,Y − (t )
.
(3)
Y to X is defined as: (4)
The recursive least square (RLS) algorithm can be used to estimating the timevariant Granger causality. In this paper, we used the traditional RLS algorithm with forgetting factor to calculate the Equation (3) and (4). According to Moller (2001), the recursive computation is:
Σ(t ) = (1 − λ )Σ(t ) + λ z (t ) .
(5)
with 0 < λ < 1 , z (t ) denotes the prediction errors and Σ(t ) denotes the variance of the prediction. In the case of ERP data analysis, the model was fitted with λ = 0.025 . 2.2 Order Determination of AR Model In order to determining the optimal model order p in equation (2), we use the Akaike information criterion (AIC) defined as:
AIC (i) = N ln(det(Σi )) + 2iL2 .
(6)
L is the number of variables, N is the length of data and Σi is the variance of the prediction of the i th order model. For ERP data, AR model orders were calculated
where
by least square estimation, and then, an optimal order can be used in RLS algorithm. In fact, the optimal model order should be selected for which the AIC reaches the minimum. However, in most cases, the AIC curve decreased when the model order increased. Fig. 1 shows an example of the AIC determination for electrode pair P3 and CP3. It is suitable for choosing the order p=12, since there is little change beyond that value and the similar order appeared in other studies for EEG data [6].
100
K. Yin, X.-J. Zhao, and L. Yao
Fig. 1. The AIC curve of the bivariate AR model for electrode pair P3 and CP3 according to equation (6)
2.3 Statistical Test of Significance Because the distribution of Granger causality measures as shown in Equation (3) and (4) are not easily established, we used surrogate data to construct an empirical distribution [11]. According to this method, the time series was shuffled without replacement. Thus, one surrogate data was established, and then we calculated the value of Granger causality between the surrogate data and another time series. After this procedure was carrying out 100 times, we created the empirical distribution of Granger causality and constructed the threshold with 5% statistic significance. We confirmed that there was causality if the value of equation (3) or (4) was larger than threshold. In addition, since the brain does not remain the same state in the whole experiment, the ERP time series must have its temporal construction. Although the shuffle procedure can conserve the distributional properties of the time series, it cannot conserve the temporal construction. For avoiding destroying the original temporal construction, we calculated the correlation coefficient of surrogate data and the time series, by assuming the correlation coefficient of surrogate data had uniform distribution on [-1, 1]. We only considered the surrogate data whose correlation coefficient was between [-1, -0.05] and [0.05, 1] for 5% significance.
3 Experiment and Results 3.1 Cognitive Experiment and Data Acquisition Cognitive experiment was designed the classic psychological Stroop task. When a subject is asked to identify the display color of a color meaning word, his (her) reaction will be affected by the word’s meaning. Stimuli were 4 colorful Chinese characters (red, yellow, blue and green) in different colors. Subjects were asked to identify the display color of the Chinese character using keystroke. There were six blocks in the experiment and 96 stimuli in each block with color-meaning consistent word and color-meaning inconsistent word appeared randomly. The continuous EEG was recorded from 32 electrodes using ESI 128 channel workshop (NeuroScan, USA) with two referenced electrodes to two mastoids (band-pass 0.05~30 Hz, sampling rate 500 Hz). The analysis time course was at about 1000 ms post stimulus onset with
Time Variant Causality Model Applied in Brain Connectivity Network
101
baseline at 200 ms pre-stimulus. The ERP signals were obtained by averaging across EEG trials that are associated with a correct response [12]. 3.2 Data Analysis We analyzed the phase synchronization between all electrodes firstly, and found there was strong synchronization among P3, CP3, P7, T7, P4, and CP4 [13]. Fig. 2 showed the phase synchrony index distribution between P3 and CP3.
Fig. 2. The time-frequency representation of phase synchrony index between P3-CP3, the red color means the strong synchronization
According to the above synchronization results, we analyzed the causality among these electrodes. The time-variant Granger causality analysis from P3 to CP3 was shown in Fig. 3. It showed that the value of causality was larger than the threshold after the time 160 ms and the causality occurs mostly from 160 ms to 862 ms.
Fig. 3. Time variant Granger causality from P3 to CP3. The bold line represents the values of Granger causality and another line represents the values of threshold.
102
K. Yin, X.-J. Zhao, and L. Yao
We found that there were three connectivity networks: P3-CP3-P7, P3-P4-CP4 and T7-CP3-P7. The details of the causality were displayed in Table 1. We list the time interval in which causality mostly occurs between electrode pair. Fig. 4 showed the spatial map of causal connectivity networks on the whole scalp. Table 1. The time interval of the causality occuring mostly between electrode pair
Electrode pair P3ÆP7 CP3ÆP7 P7ÆCP3 P3ÆCP3 CP3ÆP3 P3ÆP4 P4ÆP3 CP4ÆP3 P4ÆCP4 CP4ÆP4 CP3ÆT7 P7ÆT7
Time Interval (ms) 184-712 496-598 236-416 160-862 116-272 362-684 424-470 570-632 104-272 728-836 226-276 368-612 278-352 494-574 272-442 460-500 256-268 296-414
Fig. 4. The brain map of Granger causality before 250 ms (left) and after 250 ms (right)
4 Discussion Applying time variant Granger causality model to ERP data is a research way that tries to explore the interactivity of brain function. There were two aspects to discuss from our results. One was related to non-stationary of signal. Although RLS algorithm for Granger causality has high time resolution for non-stationary data, there was fluctuation in the causality results (Fig. 3). For example, the causality between some electrode pairs occurred at time 0 ms and negative values might appear sometimes. That may be for the following reasons. (1) Because the RLS algorithm results in iterative computation around the real values, the values of Granger causality might be
Time Variant Causality Model Applied in Brain Connectivity Network
103
fluctuated at the beginning, but the time of fluctuation did not last long. (2) Signal noise is a significant influential factor. The noise can increase the value of the prediction error. The statistical test of significance is another important issue when the statistical validity of causality is considered. Because statistical properties of Granger causality are unknown, in this paper, by the use of surrogate, the threshold is provided. The main idea, on one hand, is shuffling the series and establishing an empirical distribution of Granger causality for significant test. On the other hand, to avoiding destroying temporal construction within the time series of ERP signals, we use the value of correlation coefficient to choose the surrogate data. The surrogate data with very small value of correlation coefficient is not considered. Color word Stroop experiment was classic paradigm of studying attentional network. The current related research mainly depended on traditional method [14][15]. In our work, it was clearly that there were three causality networks: P3-CP3-P7, P3P4-CP4 and T7-CP3-P7 (Fig. 4 and Table 1). And the primary causality of them mostly took place around 300 ms. The time and position of the causality was consistent with that of P300 component in this experiment [12]. In the network of P3-CP3P7, the causality from P3 to P7 and from P3 to CP3 lasted a long time and remained strong. P3 also caused P4 in the right hemisphere in network of P3-P4-CP4. It implied that the brain region around P3 mainly affected other regions. Furthermore, there was also evidential causality between two hemispheres in network of P3-P4-CP4, and the causality from P4 to P3 occurred very early. That may be related to the attentional function between two hemispheres. Further work will focus on the nonlinear model and advanced statistic test, and the cognitive significance of causality network needs to be discussed deeply. Acknowledgment. This work is supported by National Natural Science Foundation of China (60472016) and National Natural Science Foundation of Beijing (4061004) and Technology and Science Project of Ministry of Education of China (105008).
References 1. Lee, L., Harrison, L.M., Mechelli, A.: The functional brain connectivity workshop: report and commentary. Comput. Neural. Syst. 14 (2003) R1-R15 2. Wiener, N.: The theory of prediction. In: Beckenbach EF, editors. Modern Mathermatics for Engineers. New York: McGraw-Hill (1956) 3. Granger, C.W.J.: Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37 (1969) 424-438 4. Geweke, J.: Measurement of linear dependence and feedback between multiple time series. J. Am. Stat. Assoc. 77 (1982) 304-324 5. Kaminski, M.J., Ding, M., Truccolo, W.A., Bressler, S.L.: Evaluating causal relations in neural systems: Granger causality, directed transfer function and statistical assessment of significance. Biol. Cybern. 85 (2001) 145-157 6. Brovelli, A., Ding, M., Ledberg, A., Chen, Y., Nakamura, R., Bressler, S.L.: Beta oscillatory in a large-scale sensorimotor cortical network: directional influences revealed by Granger causality. Aroc. Natl. Acad. Sci. USA 101 (2004) 9849-54
104
K. Yin, X.-J. Zhao, and L. Yao
7. Ding, M., Bressler, S.L., Yang, W., Liang, H.: Short-window spectral analysis of cortical event-related potentials by adaptive multivariate autoregressive modelling: data preprocessing, model validation, and variability assessment. Biol. Cybern. 83 (2000) 35-45 8. Liang, H., Ding, M., Nakamura, R., Bressler, S.L.: Causal influences in primate cerebral cortex during visual pattern discrimination. NeuroReport 11 (2000) 2875-80 9. Moller, E., Schack, B., Arnold, M., Witte, H.: Instantaneous multivariate EEG coherence analysis by means of adaptive high-dimensional autoregressive models. J. Neurosci. Methods 105 (2001) 143-58 10. Hesse, W., Moller, E., Arnold, M., Schack, B.: The use of time-variant EEG Granger causality for inspecting directed interdependencies of neural assemblies. J. Neurosci. Methods 124 (2003) 27-44 11. Theiler J., Eubank S., Longtin A., Galdrikian B., Farmer J.: Testing for nonlinearity in time series: the method of surrogate data. Physica D 58 (1992) 77-94 12. Peng, D.L., Guo, T.M., Wei, J.H., Xiao, L.H.: An ERP Study on Processing Stages of Children's Stroop Effect. Sci. Tech. Engng. 4(2) (2004) 84-88 13. Wen, X.T., Zhao, X.J., Yao, L.: Synchrony of Basic Neuronal Network Based on Event Related EEG. In: Wang J., Liao X., Yi Z. (eds.): International Symposium on Neural Network. Lecture Notes in Computer Science, Vol. 3498. Springer-Verlag, Berlin Heidelberg New York (2005) 725-730 14. Shalev, L., Algom, D.: Stroop and Garner effects in and out of Posner' s beam: reconciling two conceptions of selective attention. J. Exp. Psychol. Hum. Percept. Perform. 26 (2000) 997-1017 15. Markela-Lerenc, J., Ille, N., Kaiser, S.: Prefrontal-cingulate activation during executive control: which comes first? Brain Res. Cogn. Brain Res. 18 (2004) 278-287
An Electromechanical Neural Network Robotic Model of the Human Body and Brain: Sensory-Motor Control by Reverse Engineering Biological Somatic Sensors Alan Rosen and David B. Rosen Machine Cosciousness Inc. Redondo Beach, California, USA [email protected], [email protected]
Abstract. This paper presents an electromechanical robotic model of the human body and brain. The model is designed to reverse engineer some biological functional aspects of the human body and brain. The functional aspects includes reverse engineering, a) biological perception by means of “sensory monitoring” of the external world, b) “self awareness” by means of monitoring the location and identification of all parts of the robotic body, and c) “biological sensory motor control” by means of feedback monitoring of the internal reaction of the robotic body to external forces. The model consists of a mechanical robot body controlled by a neural network based controller.
1 Introduction This paper presents a functional design of an electromechanical robotic model that is based on human biological functions. The reverse engineered model is shown in Figure 1. The portrayed robotic system is designed as a humanoid, volitional, multitasking robotic system that may be programmed to perform any task from mail delivery postman to a expert basketball player. However the system is not a high level design that even comes close to the present day state of the art standards. Caveat: Our goal was to reverse engineer a biological adaptation, which requires merely a building path1 [1] for a humanoid robot. Thus the robotic body is a simplistic design (19th- 20th century technology) of motors and sensors with one simple torque generating motor per degree of freedom, operating on a simplified structure that has not been calculated to carry even the weight of the robot. The robotic controller, known as a Relational Robotic Controller (RRC)2, is a hybrid circuit made up of neural networks and microprocessor based components. The neural network portion is controlled by simplified, very basic neural network equations (vintage 1980 generated 1
2
The description of the robotic body adheres to Daniel Dennett’s reverse engineering requirement: “No functional analysis is complete until it has confirmed that a building path has been specified”[1]. The RRC has been designed, reduced to practice and patented (Patent no. US 6,560,512B dated May 6, 2003). A more detailed description of the RRC may be viewed at the MCon site www.mcon.org [4].
I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 105 – 116, 2006. © Springer-Verlag Berlin Heidelberg 2006
106
A. Rosen and D.B. Rosen
Fig. 1. A reverse engineered building path of a humanoid mechanical robotic body controlled by a hybrid neural net based Controller (RRC). The mechanoreceptors and nociceptors are reverse engineered by pressure transducers uniformly distributed on the robotic (skin) surface. The proprioceptors are reverse engineered by angle measuring transducers that are associated with the angular position of the shaft of each motor. The vestibular sensors are reverse engineered by circular rings on the controller (head) section of the robot. The nervous system is reverse engineered by thin wires that connect all the sensors, via cable wire bundles, to the controller (see insert). The modalities of the camera/eyes (not discussed in this paper), have been studied by Rosen and Rosen [5]. The connectivity of the system is assumed to adhere to the biological “labeled line” principle3 [6]. 3
The ‘labeled line’ principle [6], and the “Law of Specific Nerve Energy” [7], ensures that each type of sensor responds specifically to the appropriate form of stimulus that gives rise to a specific sensation. In the biological system the specificity of each modality is maintained in the central connections of sensory axons, so that stimulus modality is represented by receptors, afferent axons, and the central pathways that it activates. In the biological case, the labeled line principle is often used to explain the unique “conscious sensation” that each modality generates [6][7]. In this case, low level and high level activation thresholds in the pressure transducers simulate the modalities of “touch-feeling” and “tactile pain,” respectivly.
An Electromechanical Neural Network Robotic Model of the Human Body and Brain
107
by Teuvo Kohonen [2] and Helge Ritter [3]). However, the engineering design of the controller is complete albeit inefficient and cumbersome by present day standards. The system is unique because the RRC-robot reverse engineers the human brain, the muscles of the human body, and is designed to perform humanoid actions with its body and limbs (see Figure 1). The controller is a giant parallel processor that controls all the motors and joints of the robotic body simultaneously with a response time of 1/30-seconds and with synchronization and coordination of all body parts. The pressure transducers, uniformly distributed on the robotic body, simulate the tactile sensory system that constantly monitors the peripheral surface of the body for tactile activations. In the following sections, we shall show that the RRC-robot shares the following four characteristics of the human body and brain. 1. Similar to the biological brain, the controller has within it a reflection of the external coordinate frame in which the robotic motors are operating. The perceived tactile-activation data originating in the pressure transducers in the external frame are transformed into the coordinate frame located within the RRC-controller. 2. The measure of the internal coordinates is calibrated with the measure of the 3dimensional space in which the robot is operating. 3. The “robotic self” and the motion of the mechanical limbs of the robot with respect to the center of mass of the “robotic self” are fully defined and controlled in the internal coordinate frame as well as the external coordinate frame. 4. The robot has the capability to be trained to perform a diverse set of actions limited only by the sophistication of the neural networks in the controller and the design and the range of motion of all robotic moveable parts. for example a RRCmodel may be programmed (trained) to perform multitasking sequences that range from digging ditches to playing basketball.
2 Main Results The main results consist of four sections, sections 2.1 to 2.4, showing that the RRCrobot shares the four characteristics of the human body and brain enumerated above. 2.1 Similar to the Biological Brain: The Controller Has Within It a Reflection of the External Coordinate Frame Figure 2A illustrates the transformation of the neuronal folds in the brain into the 3dimensional external (mirror) nodal map containing the homunculus of the robot. Validation of such transformations may be obtained by reference to most textbooks in cognitive neural science [8][10]. The mapping of the folds of the brain into an homunculus is shown in figure 2A. In the reverse engineered controller, the pressure transducers located on the robotic surface (skin), are mapped onto electronic receiving neurons identified by indexed locations determined by the 3-d coordinate location of the pressure transducer. Those indexed locations form the internal nodal coordinate frame within the controller. Thus, each electronic neuron, located at each of the indexed coordinate locations, forms a portion of a neural network configured by the indexed locations of all pressure transducers.
108
A. Rosen and D.B. Rosen
Motor Cortex Mirror Nodal Map Somatosensory Cortex
Location of brain neurons that determine the self
Location of brain neurons that define the near space (grid)
Center of internal coordinate frame
Center of external coordinate frame Somatotopic Mapping
A
B
Fig. 2. A coordinate frame within the controller. A: Transforming the cortical folds in the brain into 3-dimernsional nodal mapping. B: A neuronal world-mapping: An indexed coordinate frame within the brain. The positions of flailing limbs are also shown.
2.2 The Measure of the Internal Coordinates Is Calibrated with the Measure of the 3-Dimensional Space in Which the Robot Is Operating Figure 2A shows the indexed locations of the electronic neurons (part of the neural network) that define the robotic self. The primary constraint imposed on the design of a world map-coordinate frame is that the topographic ordering of neural network neurons within the controller form a one to one correspondence with the external world space that defines the boundaries (skin-surface) of the robot. 2.3 The “Robotic Self” Is Fully Defined in the Controller Figure 2B shows that the “robotic self” is fully defined in an indexed coordinate frame within the controller. The “near space” around the robotic self is defined by flailing limbs. Regions of the near space unoccupied by flailing limbs are defined by dormant receiving neurons. For example, the positions of the robotic fingers in the near space is determined by the angle measuring transducers located on the shaft of each motor. A configured neural network is located within the controller with individual neurons of the neural network located at indexed locations that form a map of the robotic body (the configuration of neurons may be like the folds in the brain or like the 3-d homunculus shown in the figure). The configured neural network is that part of the RRC that facilitates reverse engineers the connectivity of the biological brain. 2.4 The Robot Has the Capability to Be Trained to Perform a Diverse Set of Actions This section is divided into 5 parts: part 2.4.1-The flow through the configured neural network ( neurons located at indexed locations) and the RRC, part 2.4.2-The block diagram of the RRC, part 2.4.3-Training a Nodal Map Module with robotic self
An Electromechanical Neural Network Robotic Model of the Human Body and Brain
109
knowledge, part 2.4.4 The solution to the neural net equations (for the neural network portion of the RRC), associated with a single joint Nodal Map Module, and part 2.4.5Training a volitional Multitasking robot. 2.4.1 The Flow Through the Configured Neural Network (Each Neuron Located at an Indexed Location) and the RRC In this paper, the RRC-robot is trained to perform itch-scratch type actions. The indexed location of an end joint, such as a robotic finger, used for scratching, is called q-initial. The itch-point, possibly an indexed location of a pressure transducer is labeled q-final. Figure 3 shows the flow of q-signals emanating from the pressure transducers, and the control pulse signals, p-signals, that are transmitted from the controller to the motors. The itch scratch trajectory from q-initial to q-final is shown in Figure 3 as a sequence of control signals, po. p1, p2. A dedicated Nodal Map Module is associated with each robotic joint, and the indexed location of the end-joint, q-initial, is always recorded in the dedicated Nodal Map Module.
Fig. 3. A flow diagram of the q-vector and p-vector through a configured neural network and thence to the RRC that simulates the functionality of the human brain. The output of the Nodal Map Module goes to the external mirror nodal map via the Sequence Stepper and the Controlsignal Output Module.
110
A. Rosen and D.B. Rosen
To satisfy the volitional constraint (motion must be pre-planned and goal directed), the trajectory of motion of any end-joint is divided into small nodal transitions. The total trajectory is a sequence of nodal p, q-initials between the first and final, q-final, node in the trajectory. Only the first of a pre-planned sequence of nodes may be activated during any frame period. Thus, the maximum speed of operation is one nodal transition per frame period (with all joints (q-initial nodes) activated simultaneously) 2.4.2 Block Diagram of the RRC (Hybrid Circuit) The RRC is a hybrid circuit made up of a set microprocessor based modules, programmed by sequential algorithmic programming (takes up approximately half the physical space of the controller), and a set of neural network modules, that take up the remaining physical space within the controller. A microprocessor based module is dedicated to each joint of the robotic body (21-joint require 21-modules). The q-initial motion of the end-joint is controlled in each module, during each frame period. All the programming of the Nodal Map Modules, Task Selector Modules, Sequence Stepper Modules, and Control-signal Output Modules, is based on indexed locations in the 3-d space, determined by the programming/training of the configured neural networks shown as the top half of the physical space within the controller (see Figure 4). A Nodal Map Module associated with each joint, is made up of index locations covering the range of motion of the end-joint. Twenty one Nodal Map Modules are required to control all the joints of the robot shown in Figure 1. Figure 1 shows the
Fig. 4. Hierarchical array of RRCs. All the Nodal Map Modules, Task Selector Modules, Sequence Stepper Modules and Control Signal-output Modules (associated with each joint in the body), operate simultaneously during each frame period.
An Electromechanical Neural Network Robotic Model of the Human Body and Brain
111
21-joints and the motors present at each joint (a total of 39 motors with one p-signal per motor). Thus, given a q-initial position located at an indexed location of a Nodal Map Module, the Task Selector Module generates a q-final location. The Sequence Stepper Module is activated by q-final to search the region between q-initial and qfinal and generate an obstacle avoiding p-q sequence that represents the pre-planned trajectory between q-initial and q-final. The Control-signal Output Module may then (conditionally) transmit all 39-p-signals to all the motors in order to generate the first nodal transition of the pre-planned sequence of p-signals (that control motion from qinitial to q-final). 2.4.3 Training a Nodal Map Module: Robotic Self Knowledg A block diagram explaining the training proicedure is shown in Figure 5. Two paths are shown in the figures, a training path and an operational path. Training is performed on all twenty one Nodal Map Modules simultaneously. The itch-scratch trajectory is used repeatedly to train the robot with a “self identification and location” form of knowledge. This form of knowledge is also called “robotic self-knowledge”. Robotic self knowledge is implemented by training the robot to identify and locate any and every body part of the robot by means of the itchscratch trajectory of motion. The training consists of teaching first the Nodal Map Module associated with the end-joint of the robotic finger to scratch all possible itch points that can be reached by the end-joint. Then training the remaining twenty Nodal Map Modules to scratch, with the aid of each associated end-joint, all possible itch points. In the training path of the end-joint Nodal Map Module, the set of p-signals (39-psignals one to each motor of the robot) are trained repeatedly until the displacement error CFI 0) is defined in consideration of the activity of receptors that underlies STDP. In this study, we employ the semi-nearest-neighbor manner for pairing spikes [8]. That is, for each presynaptic spike, we consider only one preceding postsynaptic spike and ignore all earlier spikes. All postsynaptic spikes subsequent to the presynaptic spike are also considered. For each pre-/postsynaptic spike pair, the synaptic weight is updated as follows: 2 2
Δwij = 0.81 1− CSTDP (0.12Δtij )
e
−(0.12Δtij ) 2
wij (t + Δt) = mi Δwij wij (t) , mi = n j
C , Δwij wij (t)
+ 1,
(9) (10) (11)
where CSTDP shows the constant of STDP. When a synapse is updated between the ith postsynaptic neuron and the jth presynaptic one by asymmetric profile STDP, the constant is defined by 0.01 Δt > =0 CSTDP = (12) 0.65 Δt < 0.
122
T. Samura, M. Hattori, and S. Ishizaki
When a synapse is updated by symmetric profile STDP, the constant is set to 0.65. C is a normalizing constant which depends on the location of the neuron in the CA3. Owing to the coefficient mi , the sum of synaptic weights is conserved. 4.3
Structure of Hippocampal CA3 Model
The proposed hippocampal CA3 model is composed of N spiking neuron models and each neuron has RC. A CA3c neuron receives inputs from CA3b neurons. As shown in Fig. 2 , we draw a 45-degree line from a CA3c neuron (destination neuron) to upper right. Then we choose the neuron where the line crosses the center of CA3b as a center neuron. We consider that the destination neuron receives from the surrounding neurons (source neurons) which are within H of the center neuron. These connections are modified by asymmetric profile STDP. In a similar manner, a CA3b neuron has connections from CA3a neurons. Moreover, it also receives inputs from its surrounding neurons in CA3b. Similarly, a CA3a neuron receives inputs from its surroundings and CA3b neurons. The connections of CA3a and CA3b are changed by symmetric profile STDP.
Fig. 2. Formation of RC in the proposed model (cell: neuron, dark gray cell: destination neuron, light gray cell: source neuron, arrow: connection between source neurons and destination neuron)
5 5.1
Computer Simulations Conditions
First of all, we set parameters as shown in Table 1 and constructed the proposed model from 245 spiking neuron models. Next, we defined 8 fixed patterns (A–H) and random patterns which compose sequences. Each fixed pattern was represented by the activation of 48 neurons and there were no overlap between
Sequence Disambiguation by Functionally Divided Hippocampal CA3 Model
123
them. The random pattern was represented by the activation of 30 neurons randomly selected. Finally we defined two overlapped sequences (sequence I, II). The sequence I was A → B → C → D → E → F → G, the other was A→B→C→E→D→F→H. Table 1. Parameters for the simulation N 245 δCA3b→CA3c 8 δCA3b→CA3b 7 δCA3a→CA3b 8 δCA3a→CA3a 7 δCA3b→CA3a 13 CA3b CA3a τm , τm 20 θ 0.047 σ 2 τinh 2 ext Rfiring 0.0001 CCA3b , CCA3a 1.6 ι 30
H ηCA3b→CA3c ηCA3b→CA3b ηCA3a→CA3b ηCA3a→CA3a ηCA3b→CA3a CA3c τm Vinit τs RC Rfiring f CCA3c
5 15 -3 -2 -2 4 5 0.0 2.5 0.5 12.5 1.0
The hippocampus receives inputs from cortex via EC. However, theta phase precession emerges from EC, which adjusts the inputs to hippocampus and contributes to memorization of sequences[9]. Theta phase precession is the phenomenon that the phase of neuronal firing gradually advances every cycle of theta wave which was observed from hippocampus. In particular, a neuron always begins to fire at a certain phase and its firing phase advances to earlier phase in the next cycle. When the total phase advance of the neuron becomes about one cycle of the theta wave, the firing of the neuron ceases. In this simulation, we applied a sequence to the proposed model on the assumption of the precession(Fig. 3). Therefore, we defined one cycle of theta wave as 40 unit times. A new pattern was applied at the latest phase of each cycle and the pattern advanced 10 unit time every cycle. As shown in Fig. 3, the first pattern of sequence was applied to the model at the latest phase of the first cycle. Next, in the second cycle, the pattern was applied with 10 unit times advances. After that manner, in the 5th cycle, the total phase advances exceeded one cycle of the theta wave, and then the input of the pattern disappeared. Like this, we applied sequences and it needed 10 cycles for extinction of the last pattern of sequence. This means that a sequence was applied as 10 cycles in this simulation. Moreover, we defined 40 unit times for the nonfirable period between cycles in order to divide cycles. Each sequence was applied to the model twice. Then the model memorized each sequence (Memorization Period). The connections from DG to CA3 contribute to memorization[10]; hence we supposed that the two sequences are
124
T. Samura, M. Hattori, and S. Ishizaki
applied to the model from DG. The sequences were inputted to all part of the CA3. Following the memorization, we applied a part of each sequence: A-F into the model (Retrieval Period). Then, we confirmed that the model could output distinctions between the pattern F of the Sequence I and the pattern F of the Sequence II through the use of a difference from their input order. For the retrieval period, we hypothesized that the partial sequence are applied to the model from EC, because the connections from EC to the CA3 are required for retrieval[10]. Note that EC neurons are connected to only CA3a and CA3b. Therefore, we applied the patterns which compose each sequence into CA3a and CA3b.
1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th
cycle cycle cycle cycle cycle cycle cycle cycle cycle cycle
early ← − → late ∗ →∗ →∗ →A ∗ →∗ →A→B ∗ →A→B→C A→B→C→D B→C→D→E C→D→E→F D→E→F→G E→F→G→∗ F→G→∗ →∗ G→∗ →∗ →∗
Fig. 3. Presentation manner in the case of Sequence I (letter: pattern, →: transition, ∗: random pattern)
5.2
Retrieval Period
During this period, the part of the sequence (A–F) was applied to the part of the model: CA3b and CA3a on the ground of the restriction of connections from EC. Since CA3a is closer to EC than CA3b, CA3a received inputs from EC faster than CA3b in this simulation. This time lag was set as 1 unit time. Figures 4(a)(b) show similarity between a certain pattern and the output of each subregion in the 6th cycle of each sequence. The similarity means what rate of a pattern is included in the output of each subregion. At the beginning of the cycle, pattern C was inputted to CA3b and CA3a. After that, next patterns were applied to them every 10 unit times in the order of each sequence. Although each pattern was applied only once in the cycle, as shown in these figures, they showed periodic activation in CA3b and CA3a. The periodic activation aroused from the autoassociative connections between CA3b and CA3a. Then, we compared the similarity of CA3c output in two sequences. As shown in the Fig. 4(a), when pattern F applied to the model, CA3c outputted pattern G. While CA3c outputted pattern H, when pattern F of sequence II was applied (Fig. 4(b)). This means that the proposed model generated different activities according to the differences between sequences in spite of the same pattern F.
Sequence Disambiguation by Functionally Divided Hippocampal CA3 Model
125
Fig. 4. The output of the proposed model for retrieving sequences
6
Conclusions
In this paper, we focused on the location dependencies which are acquired from very detailed anatomical findings of the CA3 and the physiological findings of STDP, and suggested that the CA3 is functionally divided into two functions: heteroassociative memory and autoassociative memory. Moreover, we showed that functionally divided CA3 could generate a code for sequence disambiguation in the computer simulation. In the proposed model, previously inputted patterns were periodically retrieved in CA3a and CA3b by autoassociative memory. Thus, the model could buffer the differences between sequences. Then the infomation in the buffer was transmitted to CA3c through heteroassociative connections which reflected order of inputs. Therefore, the order difference in the buffer caused the difference of retrieved pattern in CA3c, in other words, a code which dissociates same pattern in sequences was generated by functionally divided hippocampal CA3 model according to the difference of previous inputs. The hippocampal CA1 region receives inputs from the CA3 region and Yoshida et al. suggested that CA1 selectivity for the inputs is tuned by CA3 outputs in
126
T. Samura, M. Hattori, and S. Ishizaki
computer simulations[11]. Thus, the CA1 could reflect CA3 code for sequence disambiguation. Consequently, we suggest that sequence disambiguation can be realized between the CA3 and the hippocampal CA1 region.
References 1. Eichenbaum, H., Dudchenko, P., Wood, E., Shapiro, M., Tanila, H.: The Hippocampus, Memory, and Place Cells:Is It Spatial Memory or a Memory Space?. Neuron 23 (1999) 209–226 2. Ishizuka, N., Weber, J., Amaral, D.G.: Organization of Intrahippocampal Projections Originating From CA3 Pyramidal Cells in the Rat. The Journal of Comparative Neurology 295 (1990) 580–623 3. Ishizuka, N., Maxwell, W., Amaral, D.G.: A Quantitative Analysis of the Dendritic Organization of Pyramidal Cells in the Rat Hippocampus. The Journal of Comparative Neurology 362 (1995) 17–45 4. Debanne, D., G¨ ahwiler, B.H., Thompson S.M.: Long-term synaptic plasticity between pairs of individual CA3 pyramidal cells in rat hippocampal slice cultures. Journal of Physiology 507 (1998) 237–347 5. Tsukada, M., Aihara, T., Kobayashi, Y., Shimazaki, H.: Spatial Analysis of SpikeTiming-Dependent LTP and LTD in the CA1 Area of Hippocampal Slices Using Optical Imaging. Hippocampus 15 (2005) 104–109 6. Guly´ as, A.I., Miles, R., H´ ajos, N., Freund, T.F.: Precision and Variability in Postsynaptic Target Selection of Inhibitory Cells in the Hippocampal CA3 Region. European Journal of Neuroscience 5 (1993) 1729–1751 7. August, D.A., Levy, W.B.: Temporal Sequence Compression by an Integrate-andFire Model of Hippocampal Area CA3. Journal of Computational Neuroscience 6 (1999) 71–90 8. Izhikevich, E.M., Desai, N.S.: Relating STDP to BCM. Neural Computation 15 (2003) 1511–1523 9. Yamaguchi, Y.: Theta phase coding and memory in the hippocampus SEITAI NO KAGAKU 55 (2002) 33–42 10. Treves, A., Rolls, E.T.: Computational Constraints Suggest the Need for Two Distinct Input Systems to the Hippocampal CA3 Network Hippocampus 2 (1992) 189-200 11. Yoshida, M., Hayashi, H.: Encoding Temporal Sequences by Spatiotemporal Pattern of Activity Technical Report of IEICE NLP2003-112 (2003) 7–12
A Time-Dependent Model of Information Capacity of Visual Attention Xiaodi Hou and Liqing Zhang Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China [email protected], [email protected]
Abstract. What a human’s eye tells a human’s brain? In this paper, we analyze the information capacity of visual attention. Our hypothesis is that the limit of perceptible spatial frequency is related to observing time. Given more time, one can obtain higher resolution - that is, higher spatial frequency information, of the presented visual stimuli. We designed an experiment to simulate natural viewing conditions, in which time dependent characteristics of the attention can be evoked; and we recorded the temporal responses of 6 subjects. Based on the experiment results, we propose a person-independent model that characterizes the behavior of eyes, relating visual spatial resolution with the duration of attentional concentration time. This model suggests that the information capacity of visual attention is time-dependent.
1
Introduction
How much information is gained through one glimpse? There have been many attempts to answer this question[1]. To demonstrate the model in an information perspective, we consider the human visual perception pathway as an information channel. Any visual information whose spatial frequency is higher than the capacity of one’s perception is unable to be transmitted through this channel. From this point of view, one can assert that what we “see” is the information that passes the band-limit filter of the visual channel[2]. 1.1
Attention
Treisman and her colleagues in 1977 [3] classified the visual perception process into two categories, the pre-attentive process, and the attentive process. Generally speaking, the pre-attentive process is a parallel mechanism with coarse resolution and simple feature analysis. On the other hand, the attentive process is a serial process, much slower but with higher resolution. In tasks that require careful discriminations, our perception capacity is subjected to attention. An effective description of the behavior of attention is the “Zoom Lens” theory[4]. This theory proposed that the size of attentional focus can be concentrated to meet the requirement of successful perception. Recent researches even proved a physiological correlation of the “Zoom Lens” model[5]. In the “zoom lens” model, I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 127–136, 2006. c Springer-Verlag Berlin Heidelberg 2006
128
X. Hou and L. Zhang
only two factors are determinant to the information capacity of attention: the area that attention covers, and its spatial resolution. 1.2
Information Capacity of Attention
Previous studies indicated that the information capacity of attention is almost constant under uni-scaled visual stimuli [6]. However, when objects of different sizes are contained in one stimuli, the performance of attention varies[7]. To be specific, for more acute patterns, longer time is required to concentrate one’s attention. It is easy for us to read several words of the headline of a newspaper in a short glimpse, but with the same observing time, it is hard even to see one single letter of the text font, which has a much higher spatial frequency than the headline. In an empirical level, this inequity in information capacity can be explained by zooming mechanism: with finer resolution, the observer needs more time to tune his/her attention. In this paper, we aim at constructing a general model to quantify the information capacity of attention. More specifically, we try to (1) analyze the zooming mechanism of attention in respect of time and, (2) develop a quantitative formula that describes the viewing time duration and resolution of attention. The result of our experiment shows a clear dependency of response time and spatial resolution of attention. We propose a model that describes response time of attention under stimuli of different spatial frequencies, and discussed our model in the context of cognitive science. The introduction of temporal characteristics of attention enriches our understanding of the human vision, and may also advance current quantitative models. Currently, there is scant reference pertaining to the time-varying performance of our visual system. With the results of our experiments, it is possible for a researcher to deduct the frequency of incoming information given observation time. On the other hand, given the resolution of a stimulus, researchers may also predict the shortest viewing time that permits reliable perception.
2
The Experiment
The goal of the experiment is to record the exact time duration required for successful attentional perception. In our experiment, we recorded the time duration of a counting task. In a counting task, a subject is told to enumerate the number of several identical items that are placed parallel to each other. Normally, the subject has to shift his/her attention continuously and sequentially like scanning. The serial counting task has a predominant advantage: it has a clear boundary over time, which opens opportunities for quantitative analysis. The stimuli on our experiment are strings of identical numeral characters, such as 0000, or 9999999. The length of each string is randomly chosen from 4 to 8. Numerals are also chosen in a random, so that the performance of a subject may not be hampered by particular features of certain numerals. The spatial frequency is tuned by using different font sizes of the string. To quantify the response time of smaller stimuli is neither possible in the experiment
A Time-Dependent Model of Information Capacity of Visual Attention
129
nor valuable in applications. Since many researchers have suggested that in a counting task, the smallest interval should not smaller than 5 arcmin[10], which corresponds to 5px under the condition of our experiment, the possible sizes of a character in our experiment are 8px, 10px, 15px, 20px, 30px and 50px. The font sizes and the viewing distance are deliberately chosen so that there is no risk of over-pixelization[8], even at the smallest scale of 8px. 2.1
Evoking Concentration of Attention
The key in the experiment is to evoke the attention to concentrate in each trial. If the experimental task instead becomes a continuous process without interrupting, a subject would benefit by utilizing previous attentional status, and his/her performance would relate only weakly to the scale parameters[6]. In other words, aiming at quantifying the attention, we have to divert the attention from the status of being tuned to a particular position or a particular scale. Thus, we set each string a random appearing position, rather than making them pop up at the center of the monitor. In the experiment, we also set an anchor point. An anchor point is located aside the screen. The subject is told to fix his/her attention on the anchor point just before and immediately after a counting trial. Naturally, when a new trial starts, the subject would abandon his/her attention at the anchor point, “zooming out” to search for the string, and re-concentrate his attention. In addition to the spatial disparity of the anchor point and stimuli, a difference in depth also required the subject to shift optical focus plane, thus further shuffles the subject’s attention. 2.2
Experiment Configurations
Subjects and Environments: 6 subjects (all of which are college students with normal or corrected-to-normal vision) were included in the experiment; each was exposed to approximately 30 min stimuli. 5 of them were na¨ıve to the purpose of the experiments. Subjects were given the instruction to “count the number of characters as fast as possible, and, when finished, look at the anchor point.” All visual stimuli were displayed on a calibrated 19-inch LCD monitor, with viewable size 376mm × 301mm, resolution 1280px × 1024px. The distance from the monitor to the subject is 1m. The anchor point is located at 1.3m away from the subject, on the left to the screen. When concentrating attention on the anchor point, a subject moved his/her attention leftwards without turning head, so that the ocular muscle activities could be recorded by our apparatus. Data Recording: We used the NeuroScan system to collect electro-oculogram (EOG) at the sampling rate of 100Hz. EOG had been proved effective in tracking eye movements [9]. Our system has a temporal resolution high enough to
X. Hou and L. Zhang 1500 1000
500 0 −1200
5000
10000
15000
20000
25000
30000
5000
10000
15000
20000
25000
30000
5000
10000
15000 Time (ms)
20000
25000
30000
−1400 −1600 −1800 0 8000
Potential (μV)
Potential (μV)
Potential (μV)
130
7500 7000 0
Pieces of horizontal EOG data Fig. 1. This is a piece of horizontal EOG data. High potential corresponds to the activation of ocular muscle that makes the eye move rightwards. Although the potential is vulnerable to electrical activities of other muscles, the steep raises and falls of the curve are obvious. It is therefore easy to give a qualitative interpretation of the response time of attention.
distinguish whether a subject is looking at the anchor point or looking at the stimuli. Fig.1 shows the recorded data from 3 of our subjects.
3
Data Analysis
Since the viewing angle between the screen and the anchor point is 20◦ , shifting attention between stimuli and the anchor point results remarkable raises and falls in horizontal EOG signal. The rising edge of a EOG signal corresponds to the arrival of eyesight from anchor point to the screen, while the falling edge of a EOG signal corresponds to the departure of eyesight from the screen. Thus the “counting time” in each trial is the duration of the square wave. In processing, we tailored the periods overwhelmed by electrical activities of other muscles. At last, 1462 identifiable trials were included in our data set. More generally, we prefer to interpret the data in frequency domain. Measured in c/deg, the frequency f corresponding to a particular font size s is given by f = c/deg = 60/s. The conversion from size domain to frequency domain is shown in Figure 2.
A Time-Dependent Model of Information Capacity of Visual Attention
131
Fig. 2. A comparison of size domain and frequency domain representations
3.1
Two Stages of Response Time
We analyzed the behavior of attention into 2 stages: localizing and counting. The localizing stage starts from the departure of attention from the anchor point to the moment when the subject detects the first numeral of the string. Note that this process is determined by the spatial frequency of the stimuli only, we denote this function as L(f ). The second part of the response time is the “counting” part. We denote this function as C(f, n) One simplification can be made by implying C(f, n) as a linear function of n, that is, C(f, n) = n · C(f ). This linearity has been discussed by previous studies[10]. In sum, given frequency f and string length n of a trial, the response time T (f, n) is T (f, n) = L(f ) + (n − k) · C(f ), (1) in which, n−k denotes the times of jumping from one character to its propinquity. It is possible for a subject to ”apprehend” the length of a string from its shape without counting [10]. In that case, the actual numbers of counting may be less than the length of the string. For example, a subject may started counting from the third or the fourth numeral, or comprehended the number of last two or three numerals so the trial is finished in advance. A variable k is designed to take such “apprehensive counting” into consideration. 3.2
Normalization Invariance
Although different people behaved differently in our experiment, the normalized responses along frequency axes all leaded to an identical shape. If we define the T (f,n) normalization as N (f, n) = max(T (fi ,n)) , the normalized response time would be: It is obvious from Figure 3 that the response time of different spatial frequency stimuli extend in a way that is irrespective to the string length. We call this property the normalization invariance.
132
X. Hou and L. Zhang Subject 1
Subject 2
Subject 3
1
1
1
0.5
0.5
0.5
0 0
0 0
0 0
5 10 4
5
6
7
5
8
10 4
Subject 4
5
6
7
5
8
10 4 5
Subject 5 1
1
0.5
0.5
0.5
0 0
0 0
0 0
10 4 5
6
7
8
5 10 4 5
7
8
Subject 6
1
5
6
6
7
8
5 10 4 5
6
7
8
Fig. 3. Normalized data of different subjects
This invariance can be defined as follows T (fi , nj ) T (fi , ni ) = . T (fj , ni ) T (fj , nj )
(2)
Substituting T (f, n) into Equation 1, we obtain
or
3.3
L(fi ) · C(fj ) = L(fj ) · C(fi ),
(3)
C(fi ) L(fi ) = . L(fj ) C(fj )
(4)
A Computational Model of Attention
We adopt the exponential function to describe L(f ) and C(f ) as follows L(f ) = c1 · c2 f ,
(5)
C(f ) = c3 · c2 f ,
(6)
where f is the spatial frequency of the stimuli, c1 , c2 , c3 are parameters that distinguish the behavior of a particular person. L(f ) and C(f ) satisfy the normalization invariance of Equation 4, since L(f1 ) · C(f2 ) = c1 · c2 f1 · c3 · c2 f2 = c1 · c3 · c2 f1 +f2 = L(f2 ) · C(f1 ).
A Time-Dependent Model of Information Capacity of Visual Attention
133
In this model, L(f ) and C(f ) share a common parameter c2 . In an empirical way, we interpret this parameter as a factor that summarizes the personal eye conditions and observing habits concerning to one subject in different spatial frequencies. We define the error function as 2 T (f, n) − Tˆ (f, n) , (7) e= Nf,n in which, T (f, n) denotes the actual value of response, Tˆ(f, n) denotes the calculated value of our estimation, and Nf,n denotes the number of trials to the corresponding frequency and character number. In our experiment, the choice of k is not directly derived from our experimental results. However, since T (f, n) = c1 · c2 f + (n − k)c3 · c2 f = (c1 − kc3 ) · c2 f + nc3 · c2 f , the values of c2 and c3 are independent to k. This fact legitimates us to propose an arbitrary k, and discuss c2 and c3 safely. Previous studies indicated that a subject could not enumerate more than 4 objects simultaneously in a counting task [10]. Accordingly, an acceptable choice of k in our framework is 4. We plotted the response curve of each subject, and compare our prediction with actual records. From the figures shown below, we can see that the exponential function represents the characteristics of the responses of the subjects. Table 1. parameters of different data sets Data set Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Subject 6 All
3.4
c1 76.76 61.16 157.12 118.94 219.76 39.49 112.09
c2 1.0691 1.0987 1.0362 1.0533 1.0334 1.1552 1.0633
c3 31.87 32.30 25.68 25.83 38.86 32.09 33.57
e 841.6 1311.8 1006.6 1986.3 841.6 1794.9 4627.6
Speculations on the Concentration Process
We may interpret that c1 and c3 as the condition parameter of attentional concentration. Before the stimulus was detected, the attention had been concentrated at the anchor point. Once the subject detected the popped up string, he/she had to move his/her sight through a relatively long distance from the anchor point to the stimulus. In this process, the concentrated attention may not be maintained due to the rapid movement of eyes. In the stage of localization, the attention would thus be re-initialized from a totally diffused status to the desired
134
X. Hou and L. Zhang
Fig. 4. Fitting experimental results with the model
scale. We denote such process of attentional concentration as S(∞) → S(f ), in which S(∞) describes the status of the worst condition of attentional concentration, S(f ) describes the degree of concentration after which the counting stage starts. In the second stage of counting, a subject did not need to shuffle his attention because the eye moved with much smaller increments, and the concentrated attention, in some degree, may remained and might be reused in the counting of its neighboring numeral. Similarly, we denote this process as: R(f ) → R(f ).
4 4.1
Discussions Of Human Vision
Inspired by Shannon’s Theory, many scholars have analyzed human vision from a perspective of entropy[11]. These researches pioneered a new frontier to inspect the gap between human and machines. It is true that television programs are more attractive than a dead wall, but in order to calculate and compare the amount of information in different patterns, many studies oversimplified human to a camera. This camera prototype is questionable when we, for example investigate the human’s attentional responses to grasslands and faces. Both images have similar amount of information in high frequency, but in general cases, human eyes are more inclined to attach to the latter. This phenomenon can be explained by our temporal analysis of attention. Faces are rich in both low and
A Time-Dependent Model of Information Capacity of Visual Attention
135
high frequency information, whereas a piece of well cultivated grasslands has extremely abundant in high frequency information but very scarce in low frequency information. According to our theory, the observer’s attention has a very coarse resolution at a first look, and thus “blind” to the high frequency information of the grass, such as veins and shapes of the leaves. The low frequency component, however, catches the eyes and serves like an entrance. With the absence of low frequency information, one is less likely to initiate a careful observation at a certain region. 4.2
Of Machine Vision
As a counterpart to human vision, machine vision in many aspects aims at simulating the behavior of human. Nevertheless, before calculating in a humanoid way, we must be sure that the information we provide for an artificial processing system is identical to what our eyes provide to our brain. If not, it would be groundless to expect artificial processors to behave like human beings. We do not believe that a machine vision system should intentionally implement all the visual defects of human - electronic devices do not need to gazing on patterns to enhance resolution. However, we should be aware of the complex and simple tasks for human vision system. In some daily tasks, a human brain may triumph over artificial devices, but this is not because we have sharper sensors, or faster processors, but because we have the wisdom to select proper data to process. The time-dependent capacity of attention can help us find out what kind of information is usually perceived by human vision system. Understanding the fact that with very limited information, a human brain still works in high performance, we can feed information of less quantity but more quality to an artificial system.
Acknowledgements The work was supported by the National Basic Research Program of China (Grant No. 2005CB724301) and National Natural Science Foundation of China (Grant No.60375015).
References 1. Sperling, G.: The information available in brief visual presentation. Psychological Monographs, (1960) 2. Ruderman, D.:The statistics of natural images. Network: Computation in Neural Systems. 5 517-548, (1994) 3. Treisman, A., Sykes, M., Gelade, D.: Selective attention and stimulus integration, In Donic, S.(Ed) Attention and performance VI 333-361 (1977) 4. Eriksen, CW., St. James, JD.: Visual attention within and around the field of focal attention: a zoom lens model. Perception and psychophysics. (40) 225-240 (1986) 5. M¨ uller, N.G., Bartelt, O.A., Donner, T.H., Villringer, A., Brandt, S.A.:A physiological correlate of the “Zoom Lens” of visual attention. Journal of Neuroscience 23 (9) 3561-3563 (2003)
136
X. Hou and L. Zhang
6. Verghese, P., Pelli, D.G.: The information capacity of visual attention. Vision Research. 32 (5) 983-995 (1992) 7. Verghese, P., Pelli, D.G.: The scale bandwidth of visual search. Vision Research. 34 (7) 955-962 (1994) 8. Cha, K., Horch, K.W., Normann, R.A.: Reading speed with a pixelized vision system. Journal of the Optical Society of America A-Optics & Image Science, 9 (5) 673-677 (1992) 9. Coughlin, M.J., Cutmore, T.R.H., Hine, T.J.: Automated eye tracking system calibration using artificial neural networks. 76 207-220 (2004) 10. Intriligator, J., Cavanagh, P.: The spatial resolution of visual attention. Cognitive Psychology 43 171-216 (2001) 11. Reinagel, P., Zador, A.M.: Natural scene statistics at the centre of gaze. Network: Computation in Neural System. 10 1-10 (1999)
Learning with Incrementality Abdelhamid Bouchachia University of Klagenfurt, Dept. of Informatics-Systems Universitaetsstr. 65, A-9020 Klagenfurt, Austria [email protected]
Abstract. Learning with adaptivity is a key issue in many nowadays applications. The most important aspect of such an issue is incremental learning (IL). This latter seeks to equip learning algorithms with the ability to deal with data arriving over long periods of time. Once used during the learning process, old data is never used in subsequent learning stages. This paper suggests a new IL algorithm which generates categories. Each is associated with one class. To show the efficiency of the algorithm, several experiments are carried out.
1 Introduction One of the fundamental aspects of adaptivity is incremental learning. In learning-based applications where data arrives over long periods of time or where storage capacities are very limited, IL turns out to be a vital aspect. In such applications, processing of data at once is impossible. However, most of the available literature on machine learning reports on learning models that are one-shot experience and, therefore, lack adaptivity. Thus, learning algorithms with an IL ability are of increasing importance in many nowadays on-line data streams applications such as sensors, video streams, stock market indexes, and data-starved applications where the acquisition of training data is costly and requires much time. In this paper, we present a novel IL algorithm based on function decomposition (ILFD) that is realized using neural networks. ILFD is able of life-long learning and dedicated to classification tasks. It uses clustering and vector quantization techniques. Hence, training data is partitioned into an unknown number of partitions which belong to the classes to be learned. Furthermore, it is characterized by: ability of on-line learning, ability to deal with the problem of plasticity and stability. old data is never used in subsequent stages (i.e., once processed, it is discarded), ability to incrementally tune the structure of the network, no prototype initialization is required, and no prior knowledge about: the topological structure of the neural network, the data and its statistical properties, the number of classes and their corresponding prototype sets. These aspects are fully discussed in [1]. The main motivation behind ILFD is to enable an on-line classification of data lying in different regions of the space allowing to generate non-convex partitions and, more generally, to generate disconnected partitions (not lying in the same contiguous space). The basic idea here is that each class can be approximated by a sufficient number of categories centered around their respective prototypes. Hence, each category j, Cji , is i represented as a vector wj. , i = 1 · · · n where n is the number of class labels currently I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 137–146, 2006. c Springer-Verlag Berlin Heidelberg 2006
138
A. Bouchachia
known. In other words, each classifier can be approximated by a number of local classifiers, where each local classifier is described by a prototype. One further requirement of ILFD is that these partitions have to be minimally described. Thus, the data vectors cannot be stored and used in subsequent stages. Optimally, only prototypes are to be saved and maximally only a few descriptive information as it is done in this work. The rest of the paper is organized as follows. Section 2 describes the ILFD algorithm. Section 3 discusses the empirical evaluation of ILFD.
2 Overview of the ILFD Method Learning consists in inferring a classification rule in the form of a function F from a training data set X = {(xk , yk ) /yk ∈ Y, |Y | = I}, where each data point xk = (xk1 , ..., xkd ) is a d-dimensional vector and Y is set of labels. Each component of xk represents a feature. The function F maps each data point in the input space to a corresponding class y in the output space: F : X ⊂ Rd → Y ⊂ N
(1)
ILFD tries to refine this mapping by decomposing it into two mappings: G :X → W H :W → Y
F =H◦G
(2)
This decomposition is possible if there is a set of partitions represented by a set of prototypes, W . The function, G, maps the input data, X, onto the set of prototype indices. This function is a clustering function. The second function, H, maps the set of prototypes onto the set of class labels, Y and is a labeling function. In other words, given a labeled data vector (xk , yk ) s.t. xk ∈ X and y ∈ Y , if F (xk ) = y, there must exit a prototype j, wj , such that G(xk ) = wj and H(wj ) = yk . The prototype is therefore labeled according to the label of the points assigned to it. In order to realize the funcCategory layer tions G and H as a neural netInput layer Class layer work which structurally consists of three layers (Fig. 1). The first layer, L(1) , is simply the input layer, the second layer, L(2) , is the category layer, while the third, L(3) , is the class layer. Since an input vector can be part of any category, the network is fully connected between these two layers. Furthermore, Fig. 1. The architecture of the neural network the nodes of L(2) are selectively (3) connected to one node on the output (class) layer L , since a category must belong to one class at most. Recall that the category and the class nodes are dynamically and incrementally created as data is presented to the network.
Learning with Incrementality
139
The connections from L(1) to L(2) ’s units represent modifiable weights W that are to be learned. A connection between the pth unit of L(1) and the j th unit of L(2) is denoted wjp . The connections between the L( 1) and the L(2) realize the function G (eq.2). Furthermore, the connections, V = {vij }, between each category unit and a class unit is fixed and encoded as follows: 1 if the category j is dedicated to encode input from class i vij = (3) 0 otherwise Hence, the class and the category layers are fully connected. This encoding via the connections, V , aims at realizing the function H (eq.2). Indeed, during the classification, each class unit computes the sum of the category units connected to it. neti =
c
zj ∗ vij
(4)
j=1
where zj is the output of the j th category is computed using a given similarity measure like Dice, Jaccard, overlap, or cosine measures. In the current work, however, we adopted the following measure: Sim(x, y) =
1 1 + dist(x, y)2
(5)
where dist(x, y) can be any distance. While, in this work we apply the Euclidean distance (||x − y||22 ), it is possible to apply any other distance provided that it takes into account the incrementality. This has been thoroughly discussed in [2]. Note also that we use the identity (linear) activation function in the hidden layer. Furthermore, the topology of the network is updated either by adding new category nodes, adding new class nodes or by deleting redundant and dead units of L(2) . A category node is said redundant if it can be covered by other nearby nodes without deteriorating the classification accuracy. A category node is said dead if it remains inactive for a long period of time. The network is trained as follows. If at time t a new input xk with a label yi is presented to the network and no class node of that label is available, then a new category node j and a corresponding new class node i are inserted. In addition, the weight vij is set to 1 (eq. 4), vi j (i = i) is set to 0, and the weights are set to be wjp = xkp for p = 1 · · · d. This process is repeated with each new incoming training input vector with a different label, setting the weights to the successive input such that wjp = xkp , p = 1 · · · d. On the contrast, if the label of the new input vector is already known, a second check is performed. It consists of matching the input xk against all categories connected to the class node i. If there is no category that is sufficiently similar to xk , a new category Cji is created, initial weights are set as explained earlier, and a connection between this new category and the class node i is set to 1. If there is a matching category node, the weights Wj are updated according to the following rule: wji (t) = wji (t − 1) + α(t)(xk − wji (t − 1)) wji (t)
=
wji (t
− 1) − β(t)(xk −
wji (t
− 1))
(6) (7)
140
A. Bouchachia
where α and β are two learning rates, where α > β, α and β ∈ [0, 1]. The variable j indicates the most similar category, j , from a class i = i. Although differently applied here, this formulation is that of the rival penalized competitive learning [6] which can be regarded as an unsupervised extension of Kohonen’s supervised learning vector quantization algorithm LVQ2 [4]. It stipulates that after presenting a labeled input sample, xk , the network computes the similarity of that sample to all categories with the same label. The category with the highest similarity value (i.e., minimal distance value) is retained as a winner, w1 . A second winner among all categories connected to the other classes is retained, let it be, w2 . Then, if the sample, xk , is sufficiently similar to w1 (but also to the categories connected to other classes), this latter is updated with a reinforcement signal that is a function of the distance between the input sample and the prototype w1 (eq. 6), and w2 is updated with a weakening signal that is also a function of the distance between w2 and x (eq. 7). The idea here is to move the competitor with a different label away from the region of the current input vector so that the classes of the competing winners are kept as distinct as possible. Note that to perform learning we need to keep in memory a description vector, Uji , for each category j. This vector consists of: a d-dimensional prototype, wji , a number that indicates how many times it has won the competition, nij , and a staleness value, sij , that indicates how long the category has not won the competition. Furthermore, to cope efficiently with IL, in particular with plasticity and stability [3], the problem of node proliferation L(2) must be considered. To avoid constant creation of nodes and to resist to over-fitting, the creation has to be controlled. The key question is to define the criteria under which a new category node is needed and which among old ones become obsolete or redundant. It is often the case that at the beginning of the learning process, many category nodes are created if the incoming points are far away from each other. But as more data points sequentially arrive, and the statistical tendency of the data gets clearer, a few categories are created and most of the older ones are not updated anymore. As explained, if the input sample is sufficiently dissimilar to the categories of its class and those of other classes, we create a new node for it. But, this controlling mechanism is insufficient. Thus, we suggest two further criteria which are basically correction mechanisms. These can be used either as conditions to creating a new category or as a post processing step after creating it. In the last case, they can be either periodically applied or whenever a new category is created. In the current implementation, we apply the last alternative. The two mechanisms are: – Dispersion Test: This mechanism aims at checking whether the categories are well positioned within their class. The idea is to locate redundant categories that can be merged. This is done by finding the most closest pair of categories among all pairs. The pair is then merged. However, the merge operation takes place only if their similarity is larger than that of each of them with its closest neighbor emanating from a neighboring class. – Staleness Test: The second correction criterion is a related to the staleness of categories. This procedure allows to prune categories that become stale. The staleness of a category, Cji , is measured by time during which Cji has not been assigned an input vector. This corresponds to the number of input vectors, sij , that arrive
Learning with Incrementality
141
consecutively without being assigned to Cji . If the staleness of a category is larger than a certain threshold, that category must be pruned. Algorithmically, ILFD consists of the following steps: 1. Competition step: aims at finding, at time t, the most similar category Cji to the presented input with label i, called the first winner and finding the most similar category, called the second winner, among all other classes: (a) Compute the similarity sim(xk , wji (t)), j = 1 · · · |Ci |, between xk and each of the existing categories with the same label i and find the winner j: j = arg max {sim(xk , wpi )}
(8)
p, p=1···|Ci |
(b) Compute the similarity sim(xk , wji ), i = 1 · · · |I|, and j = 1 · · · |Ci | between xk and each of the existing categories with a label i = i and find the winning category, j , which has a label, i (= i) as follows:
j =
arg max {sim(xk , wji )}
(9)
j, j=1···|C | i i =1···|I|, i =i
2. Learning step: aims at reinforcing the first winner by updating its weights and penalize the second winner by weakening its weights: Let test1 be: (sim(xk , wji (t))−sim(xk , wji (t))) < M and let test2 be sim(xk , wji (t)) > R, where M and R are user-defined thresholds and play the role of confidence and confusion controllers, then : if T est1 = T rue then update the descriptive vectors associated with the first j and the second j winners as follows: wji (t) = wji (t − 1) + α(t)(xk − wji (t − 1)) (10)
wji (t) = wji (t − 1) − β(t)(xk − wji (t − 1))
(11)
nip (t) =
nip (t − 1) + 1 if j = p otherwise nip (t − 1)
(12)
sip (t) =
0 if j = p sip (t − 1) + 1 otherwise
(13)
else: if T est2 = T rue Create a new prototype and update the best competitor j wji (t) = xk wji (t)
=
nip (t) = and update sij (t) using (13)
(14)
wji (t
− 1) − β(t)(xk −
nip (t) = 1 nip (t) = nip (t − 1)
wji (t
− 1))
if j = p otherwise
(15)
(16)
142
A. Bouchachia else Create a new node without updating the competitor using Eqs. 14, 16, 13 Note that if the label of xk , let it be yi , is not yet known then a class node and a category node are created as follows: wji (t) = xk ; nij (t) = 1;
vij (t) = 1; sij (t) = 0;
(17)
3. Pruning step: can periodically be executed and consists of two tests: (a) Dispersion test: i. Find the categories j and j the most similar among all pairs sim(wji (t), wji (t)) ≥ sim(wli (t), wli (t))
(18)
such that {j, j } = {l, l } ii. Find the closest competitor of each of j and j emanating from neighboring classes of i, let them be f and g respectively. iii. if max sim(wji (t), wti ), sim(wji (t), wgi (t)) ≤ sim(wji (t), wji (t)) then merge Cji and Cji and compute the new description vector Uji of the category j: wji (t) =
nij (t)wji (t − 1) + nij (t)wji (t − 1) nij (t) + nij (t) (19)
nij (t) = nij (t − 1) + nij (t − 1) sij (t) = min sij (t), sij (t)
(b) Staleness test: i. Find the category j with the highest staleness value sij (t) and prune it if sij (t) ≥ γ ∗ nij (t), where γ > 0 is a user-defined parameter specifying the proportionality between the staleness and the number of times the category has won the competition. ii. Find the most similar category j to j, j = arg max l, l=1···|Ci | {sim(wli (t), l=j
wji (t))} and merge Cji (t) with Cji (t) and update the description vector Uji of the category j using Eq. 19
3 Simulation Results This section outlines the behavior of the algorithm using four artificial data sets which are Normal mixture (NM) data, Circle-in-the-square (CS) data, chessboard (CB) data, and the multi-circles data (MC)(see Tab. 2) and two real-world data sets. NM is a twodimension data consisting of three classes as shown in Fig. 3, while the CB, CS, and MC data sets consist of two classes each. The real world data sets are: Wisconsin diagnostic breast cancer, and image segmentation. Both data sets are from the UCI repository [5]. Image segmentation consists of 2310 instances that are randomly drawn from a database of 7 outdoor images. Each instance is a 3x3 region and described by 19 continuous features. Wisconsin diagnostic
Learning with Incrementality
143
Data set Training Testing Total Normal mixture (NM) 840 360 1200 Circle-in-Square (CS) 840 360 1200 Chessboard (CB) 1750 750 2500 multiple circles (MC) 1680 720 2400 Fig. 2. Synthetic data characteristics
20
1
20
18
0.9
18
0.8
16
0.7
14
0.6
12
0.5
10
3.5
3 16
2.5
14 2
12
10
1.5 0.4
8
0.3
6
0.2
4
8
6
1
0.5 4
2 −5
0.1
0
5
10
NM data
15
20
0
2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0
2
CS data
4
6
8
10
12
CB data
14
16
18
20
0
0
0.5
1
1.5
2
2.5
3
3.5
MC data
Fig. 3. Synthetic data used
breast cancer data consists of 9 features which are computed from a digitized image of a fine needle aspirate of a breast mass. The number of samples is 690 examples and are classed as benign or malignant. These data sets are used to analyze the following aspects: – the evolution of category centers, – the effect of the pruning mechanisms, and – the classification accuracy of the algorithm. First, we discuss the results obtained on the synthetic data that present difficulties before dealing with the real-world data. As far as we are concerned with incrementality, no data ordering is possible. Therefore, the classification accuracy can be averaged over many runs. To measure it, the data is split into two sets: training set and testing set as shown in Tab. 2. In the following, the value of M and R ( see step learning) are set in all experiments to 0.6 and 0.01 respectively, while the parameter γ (step staleness) is set to 3. Figures 4(a) and 4(d) illustrate how the prototypes evolve over time as new data arrives. Due to space problem, we show only the last two data sets. The prototypes converge to the center of the clusters. It is worth stressing that very similar results have been obtained even when the order of arrival of input vectors change (i.e. using different simulations). This results from the fact that the greater the amount of data, the more capable the algorithm is to learn the distribution (the statistical properties) of the data. Indeed, the prototypes (i.e., weights of the network) occupy the optimal position within the class. Figures 4(c) and 4(f) illustrate the final position of the prototypes after exhausting the entire training data. The results are highly significant and show the performance of the ILFD approach.
144
A. Bouchachia
Fig. 4. Progress of the training process
To avoid the problem of category proliferation, we have experimented the effect of both control mechanisms, the staleness and the dispersion. The goal is to show the contribution of each of these mechanisms. The staleness parameter γ is set to 3. Figures 4(b) and 4(e) illustrate that the results after applying the staleness mechanism. The results look suboptimal, but still acceptable. On the other hand, Figs. 4(c) and 4(f) show the final prototypes after applying the dispersion control mechanism. It is clear that the effect of the dispersion mechanism is more pronounced than that of the staleness mechanism. The next aspect to be investigated is the classification accuracy of the ILFD approach. As pointed out earlier, to obtain a more precise appreciation of the classification accuracy, we average the classification rate of 20 runs of each experiment. To see the effect of the confidence parameter R (Learning step) on the classification, we set the confusion threshold M and γ to 0.01 and 3 respectively. Table 1 shows the classification accuracy (in %) and the corresponding number of generated categories that result after applying different values of the confidence value R. The standard deviation is also included since we are averaging the results of 20 simulations. The major outcome of this experiment is that the classification accuracy is affected by R, but also depends on the data. Indeed, for the NM data the accuracy is high (96%). However, R does not seem to affect the results. On the other hand, with the CB, CS, and MC data sets, the accuracy values increase as R increases going up from 50% to 89%, 89%, and 98% respectively. Note that higher accuracy values are only obtained with higher values of R (0.6, 0.8, 0.95). Furthermore, the table illustrates that there is a correlation between the classification accuracy and the number of categories generated. The best classification values are
Learning with Incrementality
CS
CB
MC
Ca t. #
A cc u.
NM
std
0.00 0.00 0.00 0.94 0.82 0.00 0.00 0.35 1.67 14.04 0.31 0.41 1.07 0.92 1.38 0.00 0.00 0.33 0.43 15.70
m
3.00 3.00 3.00 4.05 3.60 2.00 2.00 4.14 12.78 35.52 2.10 8.20 14.47 23.10 26.87 2.00 2.00 4.12 16.24 53.52
std
0.00 0.01 0.01 0.04 0.03 0.03 0.02 0.03 0.03 0.03 0.02 0.03 0.03 0.05 0.03 0.03 0.04 0.04 0.01 0.02
m
0.95 0.96 0.96 0.96 0.96 0.51 0.53 0.73 0.89 0.89 0.50 0.50 0.56 0.84 0.89 0.50 0.50 0.53 0.98 0.96
γ
std
0.20 0.40 0.60 0.80 0.95 0.20 0.40 0.60 0.80 0.95 0.20 0.40 0.60 0.80 0.95 0.20 0.40 0.60 0.80 0.95
Data
m
#
A cc u.
MC
std
CB
m
CS
R
Data NM
Table 2. Classification error and proliferation - Effect of γ
Ca t.
Table 1. Classification error and proliferation - Effect of R
145
0.50 1.00 1.50 2.00 2.50 0.50 1.00 1.50 2.00 2.50 0.50 1.00 1.50 2.00 2.50 0.50 1.00 1.50 2.00 2.50
0.96 0.97 0.96 0.97 0.96 0.84 0.89 0.91 0.89 0.91 0.76 0.89 0.94 0.94 0.94 0.98 0.98 0.98 0.98 0.98
0.00 0.02 0.01 0.01 0.03 0.03 0.03 0.02 0.04 0.01 0.09 0.08 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01
3.00 3.15 3.10 3.30 3.90 11.00 12.78 13.10 12.90 13.80 22.30 25.13 25.37 25.93 25.77 16.10 16.18 16.28 16.42 16.48
0.00 0.49 0.32 0.67 1.29 1.76 1.67 0.99 1.10 0.79 2.29 1.20 0.67 0.94 0.73 0.30 0.39 0.50 0.61 0.68
always those corresponding to the natural number of categories as with CB and MC (see Figs. in 3). With CS it is hard to estimate the optimal number of categories. To assess the effect of γ (pruning step), that stands for the proportionality of staleness value to the number of times the category’s prototype was reinforced, Tab. 2 shows the classification accuracy (in %) and the corresponding number of categories generated and associated with different values of the staleness value γ R = 0.95 (which on average the best confidence value), and confusion threshold M = 0.01. The experiments showed that increasing the value of γ yields an increase in the number of categories and allows to obtain better accuracy values with CS and CB. On the contrast, with NM and MC data, the effect of γ on the behavior of ILFD is not clear. To fill the whole picture, we run the same experiments on the two real-world data sets: image segmentation and breast cancer. Tables 3, and 4 summarize the classification results obtained by changing the confidence and the staleness parameter values. While with the cancer data, the accuracy values are very high, those related to the image data are less high. However, the effect of R is more visible on the image data. Indeed as R increases, the accuracy monotonically increases. Similarly, the effect of γ on the image data is clearer than that on the diabetes data despite the high classification rate obtained on this latter.
146
A. Bouchachia
Ca t. std
#
A cc u.
0.50 1.00 Image 1.50 2.00 2.50 0.50 1.00 Cancer 1.50 2.00 2.50
m
5.54 12.09 19.58 19.50 19.87 1.64 1.82 0.75 4.12 2.21
std
27.40 70.80 87.70 103.55 122.35 2.80 2.80 2.40 4.10 3.15
m
0.04 0.03 0.02 0.02 0.02 0.02 0.00 0.02 0.01 0.01
γ
std
0.70 0.80 0.81 0.83 0.83 0.95 0.97 0.96 0.97 0.96
Data
m
#
std
0.20 0.40 Image 0.60 0.80 0.95 0.20 0.40 Cancer 0.60 0.80 0.95
m
R
Data
A cc u.
Ca t.
Table 3. Classification results - Real world data Table 4. Classification results - Real world data - Effect of R - Effect of γ
0.71 0.76 0.78 0.81 0.84 0.95 0.95 0.96 0.96 0.96
0.04 0.03 0.04 0.03 0.02 0.01 0.02 0.00 0.01 0.00
44.55 64.70 80.35 94.50 108.75 2.40 2.20 2.05 2.65 2.30
11.76 12.37 17.17 18.03 16.20 1.14 0.52 0.22 1.42 0.57
4 Conclusion This paper introduces a new incremental learning algorithm which is based on a neural network (ILFD). The aim was to develop a classifier that consists of a set of local classifiers that evolve over time and that can achieve good classification results as shown in Sec. 3. Here, we used synthetic data sets for the purpose of illustration but also because such data sets are quite complicate and some two other real-world data. However, extensive evaluation is still required. Moreover, as a further step in this work, ILFD will be compared against other neural networks that present most of the incremental learning characteristics described in Sec.1.
References 1. A. Bouchachia. On Adaptive Learning. In Proc. of the 6th International Joint Conference on Recent Advances in Soft Computing, pages 30–35, July, 10-12, 2006. 2. A. Bouchachia and R. Mittermeir. Towards Fuzzy Incremental Classifiers. Soft Computing, In press, 2006. 3. S. Grossberg. Nonlinear neural networks: Principles, mechanism, and architectures. Neural Networks, 1:17–61, 1988. 4. T. Kohonen. Self-organizing Maps. Springer, Berlin, 1997. 5. J. Merz and P. Murphy. UCI repository of machine learning databases. http://www.ics.uci.edu/- learn/MLRepository.html, 1996. 6. L. Xu, A. Krzyzak, and E. Oja. Rival Penalized Competitive Learning for Clustering Analysis, RBF Net, and Curve Detection. IEEE Trans. on Neural Networks, 4(4):636–649, 1993.
Lability of Reactivated Human Declarative Memory Fumiko Tanabe1,2 and Ken Mogi1,2 1
Department of Computational Intelligence and System Science, Tokyo Institute of Technology 4259 Nagatsuta-cho, Midori-ku, Yokohama 226-8502 Japan [email protected] 2 Sony Computer Science Laboratories Takanawa Muse Bldg. 3-14-13 Shinagawa-ku, Tokyo 141-0022 Japan [email protected]
Abstract. Memory consolidation is an increasingly important cortical process in which a new memory is transferred to a stable state over time. Memory is said to be labile when certain cognitive and/or pharmacological processes lead to its partial or total destruction. Classical theory held that once consolidated, memory is not susceptible to disruption. However, recent experiments have suggested that when a consolidated memory is reactivated, it can become labile again (Nader 2003). A process of reconsolidation is likely to be required for the activated memory (“reconsolidation hypothesis”). Here we investigate the lability of human declarative memory. The results show that a reactivated declarative memory becomes labile under certain conditions. It is suggested that declarative memory in an active recollected state becomes susceptible to modification, editing, and even erasure in some extreme cases. Keywords: long-term memory, consolidation, reconsolidation, declarative memory, re-activation.
1 Introduction Memory in humans can be classified into the short-term memory (STM), which lasts for seconds to minutes, and the long-term memory (LTM), which is kept for hours to years. It has been suggested that distinct molecular mechanisms are involved in these two different sorts of memory. The short-term memory is likely to be supported by a temporal change in the transmission efficiency of synapses. On the other hand, synaptic structural changes that require protein synthesis and genetic transcriptions are likely to be involved in the formation of long-term memory [1]. The transition from unstable short-term memory to stable long-term memory is called ‘consolidation’ [2]. Classical theory held that once consolidated, memory is insensitive to disruption. However, recent evidences have suggested that when a previously consolidated memory is reactivated, it returns into a labile state, where the process of ‘reconsolidation’ is necessary for the stabilization of the activated memory [3]. Disruptions of the reconsolidation of previously consolidated memory have been reported in animals. Misanin et al. gave an electroconvulsive shock (ECS) to rats I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 147 – 154, 2006. © Springer-Verlag Berlin Heidelberg 2006
148
F. Tanabe and K. Mogi
twenty-four hours after a fear-conditioning trial, preceded by a brief presentation of conditioned stimuli [4]. The loss of previously consolidated fear response was observed, which suggests that reactivated memory became labile by the presentation of conditioned stimuli, the following reconsolidation process being disrupted by the ECS. It has been suggested that protein synthesis is needed in the reconsolidation process as well as in the consolidation process [5]. The protein synthesis inhibitor anisomycin was injected into the lateral neucleus of the amygdala (LA) of rats, where protein synthesis is required for the consolidation in auditory fear conditioning. The application of anisomycin itself had no effect on subsequent tests. However, the infusion of anisomycin after memory reactivation procedure induced by auditory conditioning stimuli one day after fear conditioning caused amnesia. Thus, protein synthesis in LA is necessary for reconsolidating the memory made labile due to reactivation. Using a finger-tapping task, Walker and his colleagues found that the reactivation of consolidated memory turned it into a fragile state that requires reconsolidation [6], uncovering the lability of activated procedural memory in humans. The degree of interference between competing tapping sequences was found to correlate with the memory stability, where the performance was evaluated by the speed and accuracy of tapping. When a second tapping sequence was trained immediately after the rehearsal of first sequence twenty-four hours after training, interference between the two procedure was observed, resulting in a poorer performance of the first sequence. On the other hand, no significant lability of the memory of first sequence was observed when the second sequence was learned 6 hours after training, indicating that the motor memory is labile immediately after a training but becomes stable again after 6 hours. In addition, when the training of second sequence was given 24 hours after training of the first, immediately after retraining of first sequence, it was observed to interfere with a performance of first sequence even 24 hours later (thus 48 hours after initial training). In summary, previous studies have suggested that a reconsolidation process is necessary when a memory gets reactivated and becomes labile. It is of great interest to investigate whether the lability of activated memory is specific to certain modalities of memory, or is a ubiquitous property of the memory system, and if so, how and why. The reconsolidation hypothesis has been verified in a number of animal studies. In humans, the experiment conducted by Walker et al. on procedural memory is the first and only example to show reconsolidation, except for clinical studies using electroconvulsive shock. To the best of our knowledge, the lability involved in the reconsolidation hypothesis in human declarative memory has not been studied. It has been considered that there are two kinds of LTM, i.e., the conscious and explicit declarative memory and the unconscious and implicit procedural memory. In each of these, evidence suggests that distinct neural circuits are involved [7]. It is known that the medial temporal lobe (MTL) is involved in the formation of declarative memory such as episodic and semantic memory. The cerebellum and the premotor cortex are likely to be relevant to the procedural memory such as motor learning and conditioned reflex.
Lability of Reactivated Human Declarative Memory
Training phase memorize Object Location
149
Test phase Object recognition Location recall
Learned & Left Apple & Left Key press:
Fig. 1. The training and test phases. In the training phase, subjects learned the objects and their locations (top, bottom, left, right). In the test phase, subjects were tested for the recognition of objects and location by means of key press.
The consolidation and the retrieval of distinct forms of memories are also known to use discrete memory systems. The same is considered too be true for reconsolidation. In this paper, we report our study on the reconsolidation process in human declarative memory, following our preliminary report [8].
2 Experiment 1 2.1 Materials and Methods In Experiment 1, we studied whether competing stimuli have interference effects on the memory of previously learned stimuli. We used black-and-white line drawings [9] displayed on a computer screen. The stimuli were classified into 16 categories, e.g., furniture, fruits, etc. For these stimuli, the familiarity, frequency of contacts in daily life, and complexity were determined by questionnaires survey for 40 college students, in which the figures were rated by numbers between 1 (low) to 5 (high) [9]. There were two tasks, i.e. the target task and competing task. Each task consisted of 30 drawings. The stability of memory was judged in terms of the interference caused by the competing task [6]. The competing stimuli were chosen in such a way that their categories were the same as that of the target task. More than 80% of the competing stimuli, i.e. over 24 drawings, were those whose familiarity and complexity were within the range of +-1.0 compared with the targets. In the training phase, the subjects were requested to memorize presented objects and their locations. Each drawing was presented at one of the four locations, i.e. the top, bottom, left and right positions on the computer display for 2000 msec with an inter-trial interval (ITI) of 1000 msec. The 30 drawings were presented twice in a random order. In the test phase, 60 drawings were displayed at the center of the screen. Half of them were previously learned, while the other half were unlearned. The subjects
150
F. Tanabe and K. Mogi
reported whether or not each drawing had been learned (object recognition task). If it was judged to be a learned stimulus, they pressed an arrow key corresponding to its location during the training phase (location recall). Otherwise, they pressed the returnkey to report that it was novel. The stimuli were displayed until the subjects pressed an arbitrary key (Figure 1).
Fig. 2. Experimental procedure (a) Experiment 1. The interference effect of the competing stimuli (task B) on the target stimuli (task A) was examined. All subjects (n=8) were trained in task A. Immediately after the training, half of them (n=4) were trained in task B (B training), while others (n=4) were not (Control). Both groups were tested immediately (Test0hr) and 24 hours (Test24hr) after the training. (b) Experiment 2. The stability of memory after reactivation was tested. All subjects (n=15) were trained in task A. They were then divided into three groups. Twenty-four hours after training in task A, Group 1 (n=5) was tested (Test24hr: reactivation phase), Group 2 (n=5) was tested similarly, followed by a training in task B. Group 3 (n=5) was trained in task B without the test. All groups were tested 48 hours after initial training (Test48hr).
All subjects learned the target task A. Some of them trained the competing task B immediately after that (B_training) while others did not (Control) (Figure 2a). Both groups of subjects were tested immediately after the initial training (Test0hr) and 24 hours later (Test24hr). Eight subjects (14 to 45 years old) participated in this experiment. 2.2 Results We found no significant difference in the correct rate for task A in Control conditions at Test0hr in both the object recognition and location recall test, whereas the correct rate of task A in the location recall test was significantly lower for the "B_training"
Lability of Reactivated Human Declarative Memory
151
conditions than for the "Control" condition (B_training: M=0.46, SD=0.11; Control: M=0.73, SD=0.11 paired t-test: p < 0.01). We also compared the change in correct rate between Test0hr and Test24hr (Figure 3). The change in the correct rate between Test0hr and Test24hr {(Correct rate for Test0hr – correct rate for Test24hr) / Correct rate for Test0hr} was calculated. A significant decrease in correct rate change in the location recall test was observed in "B_training" condition, compared with the Control (B_training: M=-0.28, SD=0.16; Control: M=-0.17, SD=0.07; paired t-test: p < 0.01). These results indicate that the performance in target task A got worse when subjects learned the competing task B immediately after training target task A, suggesting that training in the competing task B interfered with the memory of task A.
3 Experiment 2 3.1 Materials and Methods In Experiment 2, we studied the interference effects on the memory of target stimuli by the presentation of competing stimuli immediately after recalling the target 24 hours after the initial training (Figure 2b). The subjects were also tested at 24 hours after recalling the target stimuli (thus, 48 hours after the initial training). The subjects were divided into three groups. Twenty-four hours after training target task A, subjects in Group1 (n=5) conducted the test for target stimuli only. Those in Group2 (n=5) conducted the test immediately followed by the training in the competing task B. Those in Group3 (n=5) learned the competing task B and did not conduct the test in task A. Forty-eight hours after the initial training, all subjects conducted the test in task A. All other conditions were same as those used in Experiment 1. Fifteen subjects (14 to 59 years old) participated in this experiment. 3.2 Results Comparing the performance of Group1 with that of Group2, we found no significant difference in the correct rate at Test24hr, in both the object recognition task and the location recall task. On the other hand, the correct rate at Test48hr in the location recall test was significantly lower for Group2 than for Group1 (Group1: M=0.53, SD=0.16; Group2: M=0.22, SD=0.11; paired t-test: p < 0.01). No significant difference was observed in the object recognition task. The change in correct rate between Test24hr and Test48hr in the location recall test was significantly lower for Group2 than for Group1 (Group1: M=-0.02, SD=0.11; Group2: M=0.44, SD=0.34; paired t-test: p < 0.05). No significant change was observed in the object recognition task. These results indicate that training in the competing task B immediately after reactivating the memory of target task A results in a worse performance for task A, suggesting that the reactivated memory of task A was made labile and disrupted by task B. Comparing the performance of Group2 with Group3, the correct rate at Test48hr in the location recall test was significantly lower for Group2 than for Group3 (Group2: M=0.02, SD=0.11; Group3: M=0.47, SD=0.13; paired t-test: p < 0.01), whereas no significant change was observed in the object recognition task. These results indicate
152
F. Tanabe and K. Mogi
Fig. 3. Change in the correct rate between Test0hr and Test24hr in Experiment 1. The group trained in task B immediately after training in task A revealed a significant decrease in the correct rate for the location recall test, suggesting that the competing task interfered with the target task. Asterisks represent statistical significance (p < 0.01); NS, non-significant. Error bars indicate SEM. The same notations are used for all subsequent figures.
Fig. 4. Correct rate at Test48hr in Experiment 2. Comparing the performance for Group 1 with that for Group 2, it is concluded that the training of the competing task immediately after reactivating the target task led to a worse performance of the target task. Thus, it is suggested that the reactivated memory of the target task was disrupted by the competing task. Comparing Group 2 with Group3, we conclude that the reactivated memory of the target task received more interference compared to that in the non-reactivated case. The reactivation of memory for the target task turned it into a labile state.
Lability of Reactivated Human Declarative Memory
153
that the memory for task A was interfered significantly when the memory was recalled just before training task B, in contrast with the non-recalled case. Taken together, we conclude that the performance for task A 48 hours after the initial training was significantly affected by training the competing task B just after recalling task A, compared with the cases where the subjects only recalled task A or were trained in the competing task B. These results suggest that the reactivation of memory for task A turns it into a labile state, becoming susceptible to interference from task B (Figure 4).
4 Discussion In this experiment, we investigated whether human declarative memory becomes labile when reactivated, susceptible to interference from a competing task. Here we summarize the main findings of this experiment. (1) Exposure to the competing task B, consisting of drawings similar to the target task A in terms of category, familiarity and complexity of objects, had an interference effect on the consolidation of task A when presented immediately after the training for the target. (2) The presentation of task B interfered with task A only when recalling task A just before training task B. The isolated recalling of task A or the training of task B after 24 hours after the initial training of task A had no interference effect on the performance of task A 48 hours after training. These results suggest, for the first time to the best of our knowledge, that human declarative memory can become labile when reactivated, susceptible to interference from a competing task. Our results are consistent with the hypothesis that the reactivation of memory of task A turns it into a labile state, where a reconsolidation process is necessary to return it to a stable state. A number of previous studies on animal memory have suggested that when previously consolidated memories are reactivated, the memory trace becomes labile and a reconsolidation is necessary to return them into a stable state. Additionally, in the consolidation and reconsolidation process, it has been suggested that protein synthesis and gene transcription are required, where distinctive mechanisms are involved in the two processes. Furthermore, Walker et al. have shown that the consolidated memory of a motor task was interfered by the competing task immediately after recall, suggesting that a reconsolidation process is required in human procedural memory as well. Our study reveals a similar dynamics of reactivation, lability, and reconsolidation in the human declarative memory. There have been several circumstantial evidence on the nature of lability in the reconsolidation of human memory. For example, patients with obsessive-compulsive disorder (OCD) or hallucinations were administered an electro-compulsive shock (ECS) during the experience of their obsessions or hallucinations, resulting in a improvement in the symptoms [10]. However, it has not been clear whether this phenomenon occurred under more general conditions.
154
F. Tanabe and K. Mogi
It is known that relatively recent memories require the workings of the hippocampus. The information retrieval dependent on the hippocampus is known to last only for a limited period. As time progresses, the memory trace becomes independent of the hippocampus, in a process that requires protein synthesis (‘cellular consolidation’). In this process, the neocortical sites gain control over the memory formation. In addition to the consolidation on the cellular level, it is considered that a systems level consolidation process is necessary (‘systems consolidation’). The interplay between the cellular and systems level processes is also involved in the reconsolidation of memory [11]. Our studies reported here suggest that the declarative memory involved in the recognition of objects and locations turns into a labile state when reactivated, becoming susceptible to interference from a competing task, as predicted by the reconsolidation hypothesis. Such an effect reveals the nature of the underlying memory dynamics in the brain. It is interesting to consider the functional significance of such a process. When a previously consolidated memory is reactivated, it might be made labile in order that it can be modified, reorganized and refined, in an adaptation to the variety of situations we face in daily life. In this context, it is of interest to further investigate whether perceptual similarities and differences between memory items involved affects in any significant way the interference observed in the reactivation and reconsolidation of human declarative memory.
References 1. McGaugh, J.L.: Memory—a century of consolidation. Science, 287 (2000) 248-251 2. Miller, R.R., Matzel, L.D.: Memory involves far more than 'consolidation'. Nat Rev Neurosci, 1 (2000) 214-216 3. Nader, K.: Memory traces unbound. Trends Neurosci., 26 (2003) 65-72 4. Misanin, J.R., Miller, R.R., Lewis, D.J.: Retrograde amnesia produced by electroconvulsive shock after reactivation of a consolidated memory trace. Science, 160 (1968) 554-555 5. Nader, K., Schafe, G.E., Le Doux, J.E.: Fear memories require protein synthesis in the amygdala for reconsolidation after retrieval. Nature, 406 (2000) 722-726 6. Walker, M.P., Brakefield, T., Hobson, J.A., Stickgold, R.: Dissociable stages of human memory consolidation and reconsolidation. Nature, 425 (2003) 616-620 7. Squire, L.R. & Zola, S.M.: Structure and function of declarative and nondeclarative memory systems. Proc Natl Acad Sci U S A, 93 (1996) 13515-13522 8. Tanabe, F. & Mogi, K.: Reactivation and consolidation in long-term memory. Abstracts of the 35th Annual meeting of the Society for Neuroscience, (2005) 9. Snodgrass, J.G. & Vanderwart, M.: A standardized set of 260 pictures: norms for name agreement, image agreement, familiarity, and visual complexity. J Exp Psychol [Human Learn Mem], 6 (1980) 174-215 10. Nader, K.: Re-recording human memories. Nature, 425 (2003) 571-572 11. Debiec, J., LeDoux, J.E., Nader, K.: Cellular and systems reconsolidation in the hippocampus. Neuron, 36(2002) 527-538
A Neuropsychologically-Inspired Computational Approach to the Generalization of Cerebellar Learning S.D. Teddy1, E.M.-K. Lai2 , and C. Quek3 Centre for Computational Intelligence, School of Computer Engineering, Nanyang Technological University, Nanyang Avenue, Singapore 639798 1 [email protected], {2 asmklai, 3 ashcquek}@ntu.edu.sg
Abstract. The CMAC neural network is a well-established computational model of the human cerebellum. A major advantage is its localized generalization property which allows for efficient computations. However, there are also two major problems associated with this localized associative property. Firstly, it is difficult to fully-train a CMAC network as the training data has to fully cover the entire set of CMAC memory cells. Secondly, the untrained CMAC cells give rise to undesirable network output when presented with inputs that the network has not previously been trained for. To the best of the authors’ knowledge, these issues have not been sufficiently addressed. In this paper, we propose a neuropsychologicallyinspired computational approach to alleviate the above-mentioned problems. Motivated by psychological studies on human motor skill learning, a ”patching” algorithm is developed to construct a plausible memory surface for the untrained cells in the CMAC network. We demonstrate through the modeling of the human glucose metabolic process that the ”patching” of untrained cells offers a satisfactory solution to incomplete training in CMAC.
1 Introduction The human cerebellum is a brain region in which the neuronal connectivity is sufficiently regular to facilitate a substantially comprehensive understanding of its functional properties. It constitutes a part of the human brain that is important for motor control and a number of cognitive functions [1], including motor learning and memory. The human cerebellum is postulated to function as a movement calibrator [2], which is involved in the detection of movement error and the subsequent coordination of the appropriate skeletal responses to reduce the error [3]. It has been established that the human cerebellum functions by performing associative mappings between the input sensory information and the cerebellar output required for the production of temporal-dependent precise behaviors [4]. The Marr-Albus-Ito model [5] describes how the climbing fibers of the cerebellum perform this function by transmitting moment-to-moment changes in sensory information for movement control. The Cerebellar Model Articulation Controller (CMAC) [6] is a neural network inspired by the neurophysiological properties of the human cerebellum and is recognized for its localized generalization and rapid algorithmic computations. As a computational model of the human cerebellum, CMAC manifests as an associative memory I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 155–164, 2006. c Springer-Verlag Berlin Heidelberg 2006
156
S.D. Teddy, E.M.-K. Lai, and C. Quek
network [7], which employs error correction signals to drive the network learning and memory formation process. This allows for simple computation, fast training, local generalization and ease of hardware implementation [2], and subsequently motivates the prevalent use of CMAC-based systems [8,9,10]. However, there are two significant issues associated with the effective utilization of the CMAC network. Firstly, it is difficult to fully-train the entire CMAC network. As CMAC is a local-learning network [11], comprehensive planning is required to generate a training dataset to train all of the network cells. Furthermore, the construction of such a training dataset is not always feasible, such as in the modeling of ill-defined problems for which only a limited amount of observations is available. Secondly, the behavior of a CMAC network is undefined for the untrained regions of the network. Although the learning convergence property of the CMAC network has been well-established, this merely implies that the learning stability of a CMAC-based system is guaranteed within the trained regions. However, the performance of a CMAC-based system remains very much dependent on the careful planning of the network training profile. To the best of the authors’ knowledge, there has been no previous attempt to address the problem of incomplete training in the CMAC network. Neurophysiological studies on the human brain have established that the cerebellum plays a significant role in the learning and acquisition of motor skills [12]. Behavioral research on skill learning has provided evidences that humans as well as animals have the innate ability to adapt and generalize skills acquired in a well-trained motor task to novel but similar situations [12,13,14]. There are generally two types of motor skill generalizations: motor adaptation [14] and contextual interference [12]. Motor adaptation refers to the capability to accustom the execution of a well-trained motor task to changes in the external environment where the task is to be performed. Contextual interference, on the other hand, refers to the phenomenon whereby the training acquired on a specific motor task influences the learning process of another novel but similar task. Physiological as well as psychological evidences of generalized learning in motor skill acquisition have been presented in the literature [15,16]. Motor skill generalization offers a plausible insight into human behavioral responses towards novel stimuli and changing work environments. Inspired by the neurophysiology of the human cerebellum as a movement coordinator and the corresponding psychological studies on generalization and adaptation in human motor skill acquisition, we propose a computational approach to alleviate the problem of incomplete training in the CMAC network. This approach, referred to as ”patching” in the paper, constructs a plausible memory surface for the untrained memory cells in a CMAC network. The proposed ”patching” technique is subsequently evaluated through the modeling of the human glucose metabolic process using the CMAC network. The rest of the paper is organized as follows. Section 2 briefly describes the neurophysiological aspects of the cerebellar learning process and outlines the basic principles of the CMAC neural network. Section 3 presents the proposed patching technique. The modeling of the human glucose metabolic process using a CMAC network is presented in Section 4 to evaluate the effectiveness of the ”patching” technique. Section 5 concludes this paper.
A Neuropsychologically-Inspired Computational Approach
157
2 CMAC Network and the Cerebellar Learning Process The human cerebellum functions primarily as a movement regulator; and although it is not essential for motor control, it is crucial for precise, rapid and smooth coordinations of movements [2]. In order to effectively accomplish its motor regulatory functions, the cerebellum is provided with an extensive repertoire of information about the objectives (intentions), actions (motor commands) and outcomes (feedback signals) associated with a physical movement. The cerebellum evaluates the disparities between the formulated intention and the executed action and subsequently adjusts the operations of the motor centers to affect and regulate the ongoing movement. Studies in neuroscience has established that the cerebellum performs an associative mapping from the input sensory afferent and cerebral efferent signals to the cerebellar output, which is subsequently transmitted back to the cerebral cortex and spinal cord through the thalamus [17,18]. This physiological process of constructing an associative pattern map constitutes the underlying neuronal mechanism of learning in the human cerebellum. The human cerebellum has been classically modelled by the Cerebellar Model Articulation Controller (CMAC) [6,7]. The model was proposed to explain the informationprocessing characteristics of its biological counterpart. The CMAC network functions as an associative memory that models the non-linear mapping between the mossy fiber inputs and the Purkinje cell outputs of the cerebellum. The massive mesh of granulle cell encoders in the cerebellum corresponds to an association layer that generates a sparse and extended representation of the mossy fiber inputs. The synaptic connections between the parallel fibers and the dendrites of the Purkinje cells formed an array of modifiable synaptic weights that motivates the grid-like CMAC computing structure. In the human cerebellum, these modifiable synaptic weights are linearly combined by the Purkinje cells to form the cerebellar output. In CMAC, the network output is computed by aggregating the memory contents of the active computing cells. The CMAC network is essentially a multi-dimensional memory array, where an input acts as the address decoder to access the respective memory (computing) cells containing the adjustable weight parameters that constitute the corresponding output. In the CMAC network, the memory cells are uniformly quantized to cover the entire input space. The operation of the CMAC network is then characterized by the table lookup access of its memory cells. Each input vector to the CMAC network selects a set of active computing cells from which the output of the network is computed. Similarly, CMAC learns the correct output response to each input vector by modifying the contents of the selected memory locations. This paper employs a generic cerebellar associative memory model which is based on a single-layered implementation of the CMAC network. Such an associative network has only one layer of network cells, but maintained the computational principles of the CMAC network by adopting a neighborhood-based activation of its computing cells. The layered cell activations in the original CMAC network contributed to three significant computational principles: (1) smoothing of the computed output; (2) to facilitate a distributed learning paradigm; and (3) activations of the similar or highly correlated computing cells in the CMAC input space. These three modeling principles are similarly conserved in the single-layered cerebellar associative memory via the introduction of neighborhood-based computations. The activation of neighboring cells
158
S.D. Teddy, E.M.-K. Lai, and C. Quek
Fig. 1. The memory cell structure of a 2-input CMAC network
corresponds to the simultaneous activation of the highly correlated cells in its multilayered counterpart, and it also contributes to the smoothing of the computed output since the neighborhood-based activation process results in continuity of the output surface. Figure 1 depicts the memory cell structure of such a single-layered implementation of the CMAC network. The single-layered CMAC network employs a Weighted Gaussian Neighborhood Output (WGNO) computational process, where a set of neighborhood-bounded computing cells is activated to derive an output response to the input stimulus. For each input stimulus X, the computed output is derived as follows: Step 1: Determine the region of activation Each input stimulus X activates a neighborhood of CMAC computing cells. The neighborhood size is governed by the neighborhood constant parameter N , and the activated neighborhood is centered at the input stimulus. Step 2: Compute the Gaussian weighting factors Each activated cell has a varied degree of activation that is inversely proportional to its distance from the input stimulus. These degrees of activation functioned as weighting factors to the memory contents of the active cells. Step 3: Retrieve the PSECMAC output The output is the weighted sum of the memory contents of the active cells. Following this, the single-layered CMAC network adopts a modified Widrow-Hoff learning rule [19] to implement a Weighted Gaussian Neighborhood Update (WGNU) learning process. The network update process is briefly described as follows: Step 1: Computation of the network output The output of the network corresponding to the input stimulus X is computed based on the WGNO process. Step 2: Computation of learning error The learning error is defined as the difference between the expected output and the computed output of the network. Step 3: Update of active cells The learning error is subsequently distributed to all of the activated cells based on their respective weighting factors. Each active cells then update its memory content.
A Neuropsychologically-Inspired Computational Approach
159
3 Learning of Untrained CMAC Cells It is not always feasible to generate a training profile that trains all the memory cells in the CMAC network. In such cases, the empty cell phenomenon occurs whenever the test input falls within the clusters of untrained cells, resulting in an undesirable network output. However, this problem can be alleviated by constructing a plausible memory surface for the untrained cells of the CMAC network. Such a construction process is referred to as ”patching” in this paper. The ”patching” algorithm proposed in the paper is inspired by the neuropsychological aspects of human motor skill learning. In human and animal behavioral research, the transfer of learning [13] or motor skill generalization is a well-established phenomenon of skill acquisitions. Humans, as well as animals, have innate abilities to accustom and generalize skills acquired in a welltrained motor task to novel but similar situations [12,13]. The motor skill generalization ability observed in humans can be broadly categorized into: (1) motor adaptation [14] and (2) contextual interference [12]. Motor adaptation refers to the capacity to adapt the execution of a well-trained motor task to changes in the external environment where the task is to be performed [15]. Such generalization capability was demonstrated in a study conducted by Palmer and Meyer [20]. In that study, experienced pianists were first asked to learn a new piece of music. They were subsequently asked to play a variation of the melody which required different combinations of hand and finger movements. The study eventually concluded that motor skill learning is not simply a matter of acquiring specific muscle movements, because experienced learners are able to transfer their skills to new situations that require them to produce the same general pattern of movements with different muscle groups [20]. Contextual interference, on the other hand, covers a broader scope of skill generalization. It refers to the phenomenon whereby the training acquired on a specific motor task influences the learning process of another novel but similar task. Such generalization capability was demonstrated and studied in [21,16]. Although much less is known about the exact neurophysiological processes underlying this motor generalization phenomena, psychological studies [12] have suggested that there is a correlation between the amount of skill transfer and the similarity in the skill executions. Generally, the more similar are the two tasks, the greater is the influence of one to the other [13]. The local generalization characteristic of the CMAC network is based on the computational principle that similar inputs should produce similar outputs. Governed by this notion, this paper proposes a ”patching” approach to construct a plausible memory surface for the untrained CMAC memory cells to address the problem of incomplete training. The proposed ”patching” algorithm employs the interpolation of the memory surfaces from the trained memory cells to the regions of untrained memory cells in the CMAC network. Starting from the outermost edge of an untrained region, the memory content of an untrained cell is defined as the weighted average of the contents of its trained direct neighbors. For an arbitrary cell ci,j at the edge of an untrained region (see Figure 2), the ”patched” value wi,j is computed as: wi,j =
i+1
j+1
k=i−1 l=j−1
1/dk,l j+1
i+1 m=i−1
1 n=j−1 dm,n
wk,l
(1)
160
S.D. Teddy, E.M.-K. Lai, and C. Quek
ci-1,j-1
ci-1,j
ci-1,j+1
Untrained Region
ci,j-1 ci,j ci+1,j-1
Fully-trained cells Untrained cells, patched in the 1st iteration Untrained cells, patched in the 2nd iteration
Fig. 2. The workings of the proposed ”patching” algorithm
where dk,l denotes the distance from the empty cell ci,j to its fully-trained neighboring cell ck,l . The interpolated values are then propagated iteratively towards the center of the untrained region (see Figure 2). The ”patching” process results in the smooth transitions of the constructed characteristic surface in the clusters of untrained cells.
4 Case Study: Modeling of Human Glucose Metabolism Diabetes is a chronic disease where the body is unable to properly and efficiently regulate the use and storage of glucose in the blood. This results in large perturbations of the plasma glucose level, leading to hyperglycemia (elevated glucose level) or hypoglycemia (depressed glucose level). Chronic hyperglycemia causes severe damage to the eyes, kidneys, nerves, heart and blood vessels of the diabetic patients while severe hypoglycemia can deprive the body of energy and causes a patient to lose consciousness, which can eventually become life threatening. Currently, the treatment of diabetes is based on a two-pronged approach: strict dietary control and insulin medication. The key component to a successful management of diabetes is essentially to develop the ability to maintain long-term near-normoglycaemia state of the patient. With respect to this objective, the therapeutic effect of discrete insulin injections is not ideal as the regulation of the insulin enzyme is an open-looped process. Continuous insulin infusion through an insulin pump, on the other hand, is a more viable approach due to its controllable infusion rate [22]. Such insulin pumps are algorithmic-driven, with an avalanche of techniques proposed, investigated and reported in the literature over the years [23,24]. All such proposed methods required some forms of modeling of the patient’s glucose metabolic process before a suitable control regime can be devised. In this section, the performance of the proposed ”patching” algorithm is evaluated through the modeling of the dynamics of the human blood glucose cycle using a CMAC network. For this application, it is not easy to collect a dataset of observations that captures every combination of factors affecting the blood glucose level.
A Neuropsychologically-Inspired Computational Approach
161
4.1 Materials and Method Due to the lack of real-life patient data and the logistical difficulties and ethical issues involving the collection of such data, a well-known web-based simulator named GlucoSim [25] from the Illinois Institute of Technology (IIT) is employed to simulate a healthy subject to generate the blood glucose data that is needed for the construction of the human glucose metabolism model. The objective of the experiment is to apply the CMAC network to the modeling of the glucose metabolism of a healthy person. The simulated healthy person, Subject A, is a typical middle-aged Asian male. His body mass index (BMI) is 23.0, which is within the recommended range for Asian. Based on the person profile of Subject A, his recommended daily allowance (RDA) of carbohydrate intake from meals is obtained from the Health Promotion Board of Singapore website [26]. According to his sex, age, weight and lifestyle, the recommended daily carbohydrate intake for Subject A is approximately 346.9g. It is hypothesized that the human blood glucose level at any given time is a nonlinear function of prior food intakes and the historical traces of the insulin and blood glucose levels. To properly account for the effects of prior food ingestions to the body’s blood glucose level, a historical window of six hours is adopted. A soft-windowing strategy is employed to partition the six-hours historical window into three conceptual segments, namely: Recent Window (i.e. previous 1 hour), Intermediate Past Window (i.e. previous 1 to 3 hour) and Long Ago Window (i.e. previous 3 to 6 hour). Based on these windows, three normalized weighting functions are introduced to compute the carbohydrate content of the meal(s) taken within the recent, intermediate past or long ago periods. The details on the data collection process is reported in [27]. In summary, including the past blood glucose and insulin levels, there are a total of five inputs to the CMAC network. A total of 100 days of glucose metabolic data for Subject A are generated using GlucoSim. The carbohydrate contents and the timings of the daily meals were varied on a daily basis during the data collection phase. This ensures that the CMAC network is not being trained on a cyclical data set, but is employed to learn the inherent relationships between the food intakes and the glucose metabolic process of a healthy person. The collected data set is partitioned into two non-overlapping groups: 80 days of data for training and the remaining 20 days for evaluation. 4.2 Results A CMAC network with a memory size of 8 cells per dimension was constructed to model the blood glucose dynamics of Subject A. The network was trained using the training dataset for 1000 training iterations with a learning constant of 0.1. A Root Mean Squared Error (RMSE) of 6.3187 mg/ml and a Pearson Correlation of 98.97% were achieved. The trained network was subsequently evaluated with the 20-days test set. Figure 3 gives a 3-days snapshot of the modeling accuracy of the trained CMAC network on the test set. The observed empty cells phenomena in the CMAC-based glucose metabolism model are highlighted in Figure 4, which depicts a one-day snapshot of the modeled blood glucose cycle during the evaluation phase. As the CMAC network was initialized to
162
S.D. Teddy, E.M.-K. Lai, and C. Quek 300 Computed Expected
Blood Glucose (mg/ml)
250
200 150
100 50
0 0
750
1500 2250 3000 Simulation Time (mins)
3750
4500
Fig. 3. Modeling results of the CMAC network on the glucose metabolic process of Subject A
300 Computed Expected
Blood Glucose (mg/ml)
250
Access of Empty Cells 200
150
100
50
0 0
150
300
450
600 750 900 1050 Simulation Time (mins)
1200
1350
1500
Fig. 4. 1-day snapshot of the modeling results of the CMAC network before ”patching”
zero prior to training, the activations of the untrained CMAC cells result in zero network outputs. One can observe that the empty cell phenomena result in poor and inaccurate performances of the CMAC glucose metabolic model. The proposed ”patching” technique is subsequently applied to the incompletely trained CMAC network to remove the untrained cells. Figure 5 depicts the performance of the ”patched” network for the same day in the evaluation period of Figure 4. It can be observed that the ”patching” technique eliminates the empty cell phenomena and significantly improves the performance of the CMAC network. Table 1 quantitatively outlines the performances of the CMAC network before and after the ”patching” process. A 76.51% improvement in the maximum error is noted. Simulation results presented in Figure 5 and Table 1 have justified the effectiveness of the proposed ”patching” technique in addressing the problem of incomplete training in CMAC network.
A Neuropsychologically-Inspired Computational Approach
163
300 Computed Expected
Blood Glucose (mg/ml)
250
Access of Reconstructed Cells 200 150
100 50
0 0
150
300
450
600 750 900 1050 Simulation Time (mins)
1200
1350
1500
Fig. 5. 1-day snapshot of the modeling results of the CMAC network after ”patching” Table 1. Testing results of the CMAC-based blood glucose modeling CMAC Network
Maximum Error RMSE Pearson Correlation (mg/ml) (mg/ml) [%] Before ”patching” 190.006 10.7788 96.73 After ”patching” 44.635 8.2548 98.08 Gain 76.51% 23.42% 1.40%
5 Conclusions In this paper, we have presented a novel neuropsychologically-inspired computational approach to address the problem of incomplete training in a CMAC-based system. An empty cell phenomenon occurs whenever the CMAC inputs fall within the clusters of untrained memory cells, and results in undesirable network outputs. The proposed ”patching” technique alleviates this deficiency by interpolating the memory surfaces around the regions of untrained cells to construct plausible memory contents for these untrained cells. The ”patching” technique was evaluated through the modeling of the human glucose metabolic process with the CMAC network. Evaluation results have sufficiently demonstrated the effectiveness of the ”patching” technique, as significant improvements were noted in the performance of the ”patched” CMAC network. Further research in this direction includes a more detailed evaluation of the ”patching” technique as well as exploring the use of more sophisticated interpolation functions.
References 1. Middleton, F.A., Strick, P.L.: The cerebellum: An overview. Trends in Cognitive Sciences 27(9) (1998) 305–306 2. Albus, J.S.: Marr and Albus theories of the cerebellum two early models of associative memory. Proc. IEEE Compcon (1989)
164
S.D. Teddy, E.M.-K. Lai, and C. Quek
3. Albus, J.S.: A theory of cerebellar function. Math. Biosci. 10(1) (1971) 25–61 4. Kandel, E.R., Schwartz, J.H., Jessell, T.M.: Principles of Neural Science. 4 edn. McGrawHill (2000) 5. Marr, D.: A theory of cerebellar cortex. J. Physiol. London 202 (1969) 437–470 6. Albus, J.S.: A new approach to manipulator control: The Cerebellar Model Articulation Controller (CMAC). J. Dyn. Syst. Meas. Control, Trans. ASME (1975) 220–227 7. Albus, J.S.: Data storage in Cerebellar Model Articullation Controller (CMAC). J. Dyn. Syst. Meas. Control, Trans. ASME (1975) 228–233 8. Yamamoto, T., Kaneda, M.: Intelligent controller using CMACs with self-organized structure and its application for a process system. IEICE Trans. Fundamentals 82(5) (1999) 856–860 9. Wahab, A., Tan, E.C., Abut, H.: HCMAC amplitude spectral subtraction for noise cancellation. Intl. Conf. Neural Inform. Processing (2001) 10. Huang, K.L., Hsieh, S.C., Fu, H.C.: Cascade-CMAC neural network applications on the color scanner to printer calibration. Intl. Conf. Neural Networks 1 (1997) 10–15 11. Miller, W.T., Glanz, F.H., Kraft, L.G.: CMAC: An associative neural network alternative to backpropagation. Proc. IEEE 78(10) (1990) 1561–1657 12. Tomporowski, P.D.: The Psychology of Skill: A life-Span Approach. Praeger (2003) 13. Mazur, J.E.: Learning and Behavior. Pearson/Prentice Hall (2006) 14. Scheidt, R.A., Dingwell, J.B., Mussa-Ivaldi, F.A.: Learning to move amid uncertainty. Journal of Neurophysiology 86 (2001) 971–985 15. Lam, T., Dietz, V.: Transfer of motor performance in an obstacle avoidance task to different walking conditions. Journal of Neurophysiology 92 (2004) 2010–2016 16. Chen, Y., et al.: The interaction of a new motor skill and an old one: H-reflex conditioning and locomotion in rats,. Journal of Neuroscience 25(29) (2005) 6898–6906 17. Houk, J.C., Buckingham, J.T., Barto, A.G.: Models of the cerebellum and motor learning. Behavioral and Brain Sciences 19(3) (1996) 368–383 18. Tyrrell, T., Willshaw, D.: Cerebellar cortex: Its simulation and the relevance of Marr’s theory. Philosophical Transactions: Biological Sciences 336(1277) (1992) 239–257 19. Widrow, B., Stearns, S.D.: Adaptive Signal Processing. Prentice-Hall (1985) 20. Palmer, C., Meyer, R.K.: Conceptual and motor learning in music performance. Psychological Science 11(1) (2000) 63–68 21. Weigelt, C., et al.: Transfer of motor skill learning in association football. Ergonomics 43(10) (2000) 1698–1707 22. Fletcher, L., et al.: Feasibility of an implanted, closed-loop, blood-glucose control device. Immunology 230 (2001) 23. Schetky, L.M., Jardine, P., Moussy, F.: A closed loop implantable artificial pancreas using thin film nitinol mems pumps. Proceedings of International Conference on Shape Memory and Superelastic Technologies (SMST-2003) (2003) 24. Sorensen, J.T.: A Physiologic Model of Glucose Metabolism in Man and its Use to Design and Assess Improved Insulin Therapies for Diabetes. PhD thesis, Departement of Chemical Engineering, MIT (1985) 25. Illinois Institute of Technology: GlucoSim: A web-based educational simulation package for glucose-insulin levels in the human body. (Online: http://216.47.139.198/glucosim/ gsimul.html) 26. Health Promotion Board Singapore. (Online: http://www.hpb.gov.sg) 27. Tung, W.L., Teddy, S.D., Zhao, G.: Neuro-cognitive approaches to the control and regulation of insulin for the treatment of diabetes mellitus. Phase 1: Neurologically inspired modeling of the human glucose metabolic process. Technical Report C2i-TR-05/002, Center for Computational Intelligence, School of Computer Engineering, Nanyang Technological University, Singapore (2005)
A Neural Model for Stereo Transparency with the Population of the Disparity Energy Models Osamu Watanabe Muroran Institute of Technology, Muroran, Hokkaido 050–8585, Japan [email protected]
Abstract. The disparity energy model can explain physiological properties of binocular neurons in early visual cortex quantitatively. Therefore, many physiologically-plausible models for binocular stereopsis employed the disparity energy model as a model neuron. These models can explain a variety of psychological data. However, most of them cannot handle with stereo transparency. Here, we develop a simple model for transparency perception with the disparity energy model, and examine the ability to detect overlapping disparities. Computer simulations showed that the model properties of transparency detection are consistent with many psychophysical findings.
1
Introduction
The disparity energy model is a hierarchical model for binocular neurons in early visual cortex and consists of simple and complex cells [4]. One of the important properties of the disparity energy model is that the receptive fields (RFs) of the simple cells are characterized by Gabor functions. This property suggests that, theoretically, there are three types of encoding methods for binocular disparity [1], i.e., positional coding, phase coding, and hybrid coding. Positional coding is equivalent to the disparity coding method employed by many conventional models for binocular stereopsis, that is, disparities are detected by the positional discrepancy between the left and right RFs. On the other hand, in phase coding, the RF positions of both eyes are identical, and disparities are encoded by the phase shift between two Gabor functions corresponding to the left and right RFs. In hybrid coding, preferred disparities of each neuron are determined by both position and phase shifts between two RFs. It is known that the disparity energy model can explain the physiological properties of binocular neurons quantitatively [4], and it was reported that binocular disparities were detected by phase or hybrid coding rather than positional coding [1]. Based on these physiological studies, many biologically-plausible models for binocular stereopsis employed the disparity energy model with phase or hybrid coding. These models elucidated many aspects of stereo computation in the brain, but some open issues remain. One of the important problems that have been left unsolved is stereo transparency. Fusing the stereogram generated by overlapping two random-dot stereograms (RDSs) with different disparities, observers can perceive the two disparities I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 165–174, 2006. c Springer-Verlag Berlin Heidelberg 2006
166
O. Watanabe
simultaneously at the same region of the visual field. This transparency perception is not only a detection problem, but is also related to the biological mechanism to handle the contradicted perception. This phenomenon suggests that simple models reconstructing a single-valued disparity map cannot model the stereo mechanism in the brain. In the present study, we propose a modification of Tsai and Victor’s stereo model [8] by employing the hybrid-type disparity energy model, and examine the model ability to detect stereo transparency. Computer simulations show that the proposed model can explain well-known properties of stereo transparency qualitatively.
2 2.1
Neural Model for Stereo Transparency Definition of the Disparity Energy Model
The disparity energy model consists of two layers corresponding to simple and complex cells [4]. In simple cells, the left and right eye’s images, Il (x) and Ir (x), are filtered with RFs, fl (x) and fr (x), given by Gabor functions as follows: x2 1 e− 2σ2 cos(ωx + φl ), fl (x) = √ 2πσ (x+d)2 1 fr (x) = √ e− 2σ2 cos(ω(x + d) + φr ), 2πσ
where ω represents the preferred frequency, φl and φr the RF phases, d position shift between two eyes, and σ the Gaussian width determining the size. For simplicity, one-dimensional RFs are considered. The response of simple cell is given by rs = dx fl (x)Il (x) + fr (x)Ir (x) .
(1) (2) the RF the
(3)
The response of the complex cell is given by the sum of squared responses of two simple cells, rs and r¯s , the RF phases of which are differed by 90◦ : rc = (rs )2 + (¯ rs )2 .
(4)
The preferred disparity of the complex cell is determined by the phase shift Δφ = φl − φr and the position shift d. In the positional and the phase coding model, all neurons have the parameter of Δφ = 0 and d = 0, respectively. 2.2
Population Activity to Stereo Images
When an input stimulus I(x) is presented with a disparity D, the stereo pair can be represented as Il (x) = I(x) and Ir (x) = I(x + D). In this case, the output of the disparity energy model after pooling is approximately written as a sinusoidal function as follows [8]: rc ≈ μ 1 + λ cos(Δφ + ωd − θ) . (5)
A Neural Model for Stereo Transparency (a) Normalized response rc 2 1 0 π π/2 Phase 0 shift -π/2 Δφ (rad) -π
167
(b) Normalized response rc 2 1 0 π π/2 Phase 0 shift -π/2 30 10 20 30 Δφ (rad) 0 -π 0 10 20 -10 -30 -20 -30 -20 -10 Position shift d (pixel) Position shift d (pixel)
Fig. 1. Population responses of the disparity energy model. (a) Response to an RDS with 0 disparity. (b) Response to a transparent RDS. Filled circles on the d-rc plane represent the amplitudes of sub-populations each neurons in which have the same position shift. Thin and thick curves represent the theoretical values of the amplitudes for individual surfaces and the sum of them, respectively. 2
2
The input disparity D is encoded by the amplitude λ = e(D+d) /(4σ ) of the population response as well as its peak position θ = ωD. μ represents the average response of the population and depends on the energy of the input pattern I(x). An example of a population response to an RDS with a disparity of 0 pixel is shown in Fig 1a. This result shows that Eq. (5) well approximates the model response. A transparent RDS is generated by overlapping two dot patterns, I1 (x) and I2 (x), each of which has a single disparity, D1 and D2 , respectively. The stereo pairs of the transparent RDS is given by Il (x) = I1 (x)+I2 (x) and Ir (x) = I1 (x+ D1 ) + I2 (x + D2 ). When two dot patterns, I1 (x) and I2 (x), are uncorrelated, the population response to the transparent RDS equals to the sum of the responses to the individual patterns [10]. Figure 1b illustrates the population response to a transparent RDS. There were two peaks in the population activity. However, in general, these peaks did not correspond to two input disparities exactly. This is because responses to individual patterns interfere each other. For example, in Fig. 1b, the amplitude at d = 0 becomes 0 because the responses to each disparity were cancelled out each other. Therefore, multiple disparities should be detected with the whole shape of population activities. 2.3
Reading Population Codes
In this section, we describe a disparity decoding method by using the whole population activity of the disparity energy models. In general, a neural response rc varies from trial to trial even if the same disparity D is presented because it should contain a noise. Let the statistical property of the neural response to a disparity D be represented by a conditional ˆ probability P [rc |D]. According to the Bayes theorem, the disparity estimate D can be derived from a population response rc as follows [5]: ˆ = arg max P [D|rc ] = arg max P [rc |D] · P [D]. D D
D
(6)
168
O. Watanabe
In the present model, we assume that the responses of individual neurons are independent (i.e., P [rc |D] = ω,Δφ,d P [rc |D]) and that the input disparity D distributes uniformly (i.e., P [D] = const.). Employing a normal distribution as the conditional probability P [rc |D], Eq. (6) can be represented as 2 2 − (D−d) 2 ˆ 4σ cos(Δφ + ωd − ωD) , D = arg min rc (ω, Δφ, d) − μ(ω) 1 + e D
ω,Δφ,d
(7) where rc (ω, Δφ, d) represents a response of a neuron the preferred spatial frequency, the phase shift, and the position shift of which are ω, Δφ, and d, respectively. μ(ω) represents the average response of neurons the preferred frequencies of which are ω. Equation (7) is equivalent to the template matching procedure [8]; an arbitrary population rc is assigned a disparity by finding which canonical, or template, population (Eq. (5)) matches it best. In computer simulations, we calculated the sum of squared errors (SSEs) given by Eq. (7) and regarded the disparities that minimize the SSEs as the model estimates. We used multiple frequency channels for estimation. If an input stimulus does not contain a certain frequency band, the average response μ(ω) of the channel that is tuned to the frequency becomes 0. In this case, and the channel cannot contribute to disparity estimation. This property corresponds to “the stimulus dependent weight” in Tsai and Victor model [8].
3
Performance of Disparity Detection
In this section, we show the results of computer simulations and compare them to psychophysical findings. In computer simulations, the preferred frequency ω/(2π) ranges from 1/16 to 1/128 cycle/pixel and is sampled at half-octave intervals. Therefore 7 frequency channels were used for calculations. For all frequency channels, the phase shifts Δφ varied from −π to +7π/8 in a step of π/8. The position shifts d varied from −64 to +64 pixels in a step of 4 pixels. The RF sizes were set at σ = π/ω to fix the frequency bandwidths at 1.14 octaves for all frequency channels. Population responses were obtained by pooling the responses of 1,000 individual trials. Dot densities of input RDSs were 25%. 3.1
Single Surface
We first consider a single channel tuned to a frequency ω. The SSE between the population response to an RDS with a disparity D and the template response to a disparity D is approximately given by (D−D )2 E(D ; D, ω) ∝ μ(ω)σ 1 − e− 8σ2 cos(ωD − ωD ) . (8) Therefore, when we use multiple channels, the SSE curve is given by E(D ; D, ω). E1 (D ; D) = ω
(9)
(a) 1
(b) 1
Normalized SSE
Normalized SSE
A Neural Model for Stereo Transparency
0.8
169
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0 -150 -100 -50 0 50 100 150 Template disparity D' (pixel)
0 -150 -100 -50 0 50 100 150 Template disparity D' (pixel)
Fig. 2. Sum of squared error (SSE) for RDSs with a single disparity. The disparities at local minima are regarded as estimated disparities. (a) SSEs for RDSs with a disparity of 0 pixel. Solid and dashed curves represents the results when 0% and 40% of dots were assigned as noise. (b) SSEs for correlated (solid curve), un-correlated (dashed curve), and anti-correlated (dotted curve) RDSs.
According to Eq. (8), the SSE function for a single frequency channel is weighted by σ = π/ω. Therefore, the contributions of low frequency channels to disparity estimation are greater than those of high frequency channels. This property corresponds to “1/f weighting” in Tsai and Victor model. Figure 2 represents SSE curves obtained by computer simulations. As shown in Fig. 2a, even if noise dots were added to stimuli, the minima were located at the input disparity correctly. Note that, when the absolute value of the template disparity D was greater than about 50 pixels, the SSEs were decreased quickly. This is because there were few neurons that tuned to such a large disparity D . In this case, many of matching errors did not included in the SSE calculations. Therefore, in computer simulations, we only considered the disparities within ±50 pixels. This disparity detection limit depends on the parameters, i.e., the ranges of the preferred disparities, of the model neurons. Figure 2b shows the SSE curves when left and right images were un-correlated and anti-correlated. Fusing these stereograms, observers cannot perceive depth planes, and binocular rivalry occurs. As shown in Fig. 2b, no well-defined minima appeared in un-correlated and anti-correlated RDSs, and the results would be consistent with the human perception. 3.2
Transparent Surface
In this section, we consider the model response to two overlapping surfaces. The SSE for two overlapping surfaces the disparities of which are D1 and D2 is approximately derived as E2 (D ; D1 , D2 ) ≈ ρE1 (D ; D1 ) + (1 − ρ)E1 (D ; D2 ) − C,
(10)
where C represents a constant, and ρ (0 ≤ ρ ≤ 1) the percentage of the number of dots on the first surface the disparity of which is D1 . Above equation indicates that the SSE for two overlapping surfaces is equivalent to the sum of two SSEs for individual surfaces. Therefore, it is expected that two local minima arise.
O. Watanabe (a) 0 10 20 30 40 50 60 Disparity D1 (pixel)
1 (b) 1
0
Normalized SSE
170
0
-100 -50 0 50 100 Template disparity D' (pixel)
-100 -50 0 50 100 Template disparity D' (pixel)
Fig. 3. SSE for a transparent RDS. (a) Normalized SSE represented with gray scale. The horizontal axis represents the disparity of the template pattern, and the vertical axis the half of the disparity difference between two overlapping surfaces, e.g., 20 indicates the overlapping surfaces have disparities of ±20 pixels. (b) SSE for the transparent RDS with the disparities of ±20 pixels (corresponding to the white line in (a)). 1 (b) 1
10 20 30 40 50 60
0
-100 -50 0 50 100 Template disparity D' (pixel)
Normalized SSE
Disparity D1 (pixel)
(a) 0
0
-100 -50 0 50 100 Template disparity D' (pixel)
Fig. 4. SSE for a transparent RDS the dot densities of overlapping surfaces of which were unequal. The dot density of the nearer surface was 1.5 times greater than that of the further surface. (a) Normalized SSE. (b) The SSE curve for the transparent RDS with the disparities of ±20 pixels.
A simulation result is shown in Fig. 3. When overlapping disparities were ±20 pixels, there were two minima nearly at ±20 pixels, and the model could detect the overlapping disparities. Dot Density. Psychophysical studies reported that, when dot densities of two overlapping surfaces were unequal, it was difficult to perceive the lower density surface [2]. According to Eq. (10), it is predicted that the squared error at the local minimum corresponding to the lower density surface is greater than that corresponding to the higher density surface. Figure 4 shows the simulation result. As predicted, the local minimum corresponding to the higher density surface arose clearly rather than that corresponding to the lower density surface. If the difference of dot densities is greater than in Fig. 4, it should become more difficult to find the local minimum corresponding to the lower density surface. This result is consistent with the psychophysical finding. Gap Resolution. Observers can perceive stereo transparency if the disparity difference of two surfaces is greater than about 3 min. The threshold for transparency
A Neural Model for Stereo Transparency
5
-100 -50 0 50 100 Template disparity D' (pixel) (c) 1 Normalized SSE
10
0
0
-100 -50 0 50 100 Template disparity D' (pixel)
Normalized SSE
1 (b) 1
0
-100 -50 0 50 100 Template disparity D' (pixel)
(d) Normalized SSE
Disparity D1 (pixel)
(a) 0
171
1
0 -100 -50 0 50 100 Template disparity D' (pixel)
Fig. 5. SSE for a transparent RDS of which the disparity difference was smaller than the gap resolution. (a) SSE represented with gray scale. (b)–(d) SSE curves for the transparent RDSs with the disparities of ±2, ±6, and ±7 pixels.
perception has been termed the gap resolution acuity. Stevenson et al [6] investigated depth perception when disparity differences were less than the gap resolution acuity. They reported that, when disparity differences were greater than 30 sec, observers perceived a thickened surface; this threshold has been termed the superresolution acuity. Figure 5 shows the SSEs at small disparity differences. For the disparity of ±2 pixels (Fig. 5b), the SSE curve had a sharp minimum at 0 pixel, and this SSE curve was not distinguishable from that for a single disparity (see Fig. 2). This result corresponds to the perception under the superresolution acuity. For the disparities of ±6 pixels (Fig. 5c), the SSE had a relatively broad minimum. This may be interpreted as thickened slab perception. For the disparities of ±7 pixels (Fig. 5d), two local minima appeared. In the present simulation, the disparity differences of 6 and 14 pixels corresponded to the superresolution and the gap resolution acuity. Tsai and Victor model employing the phase model cannot model this thickened slab perception. The present result indicates that the perception under the gap resolution acuity is well explained by using the hybrid model. Attraction/Repulsion Effect. Stevenson et al [7] investigated the attraction/repulsion effect in stereo transparency; the disparity differences between overlapping surfaces are perceived smaller/greater than actual differences. When two RDSs were overlapped, both attraction and repulsion effects occurred. The attraction effect arose when the separation, or the disparity difference between two surfaces, was smaller than about 5 min, and the repulsion effect occurred when the separation was between 5 and 10 min. The maximum shift of depth perception in both attraction and repulsion effects was 0.5–1 min. On the other hand, if one of the overlapping surfaces was an anti-correlated RDS, only the
O. Watanabe 6 4 2 0 -2 -4 -6 -8 0
(b) 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 0
Subject NY Dot Gaussian
Perceived shift (min)
(a)
^ (pixel) Perceived shift D1-D 1
172
20 40 60 80 100 120 Separation D1-D2 (pixel)
2
4 6 8 10 12 14 16 Separation (min)
Fig. 6. Attraction/repulsion effect. (a) Simulation results. The perceived shift is plotted against the separation, or disparity difference, between overlapping surfaces. Attraction and repulsion effects are plotted as positive and negative values on the vertical axis, respectively. The solid and dotted curves represent the results when both surfaces were correlated RDS. Only 3 lower frequency channels were used to obtain the dotted curve. The dashed curve represents the result when a correlated and an anti-correlated RDS were superimposed. (b) Result of the preliminary psychophysical experiment. The filled and the open circles represent the results when dots and Gaussian blobs (cutoff frequency, 3.8 deg) were used as matching primitives in RDSs, respectively. Other experimental methods were designed according to Stevenson et al. [7].
repulsion effect occurred. The repulsion effect arose when the separation was smaller than 6–8 min. In this case, the smaller the separation between two overlapping surfaces, the greater the perceived shift in the repulsion effect. In this case, the maximum shift was about 1–2 min. The present model can interpret these effects qualitatively. Overlapping SSE curves for individual surfaces (see Eq. (10)), it is expected that the minimum positions of them to be shifted slightly. Figure 6a shows the result of the computer simulations. This result is similar to the psychological data, although the maximum values of perceived shifts were relatively small in comparison with the separations inducing the attraction/repulsion effects. In addition, the model predicted that, when high frequency channels cannot be utilized for disparity estimation, the attraction/repulsion effect becomes larger (the dotted curve in Fig. 6a). Figure 6b illustrates the result of a preliminary psychophysical experiment confirming the model prediction (Urasaki and Watanabe, private communication). Although this subject’s data did not show the attraction effect clearly, repulsion effect became greater when Gaussian blobs were used as matching primitives of RDSs. This result is consistent with the model prediction. LPDS. Watanabe [9] investigated the depth perception in the locally-paireddot stereogram (LPDS) generated by overlapping two identical dot patterns with different disparities. The LPDS can be regarded as a random-dot version of the double-nail illusion, and the stereogram has potential matches leading to both transparent and non-transparent (or unitary) surface perception. Figure 7a shows the result of the psychophysical experiment. When all paired dots on two surfaces had the same signs of contrast (C-LPDS), similar to the perception in the
A Neural Model for Stereo Transparency
0.2
(b) 1 Normalized SSE
Perceived depth (deg)
(a)
173
0.8
0.15
0.6
0.1 0.05 0 C-LPDS U-LPDS A-LPDS Stimulus type
0.4 0.2 0 -150 -100 -50 0 50 100 Template disparity D' (pixel)
150
Fig. 7. Locally paired dot stereogram (LPDS). (a) Psychophysical result [9]. Perceived disparities of the nearer surfaces in three types of LPDSs are plotted. The disparities of two overlapping surfaces were ±0.14 deg (error bars represent ±1 S.E.). (b) Simulation result. Solid, dashed, and dotted curves represent SSEs for C-LPDS, U-LPDS, and A-LPDS, respectively.
double-nail illusion, a unitary surface with an average disparity was perceived. In addition, Watanabe [9] studied the depth perception when contrast polarities of the two overlapping patterns were reversed. It was expected that the contrast reversal might act as a surface segregation cue, and observers could easily perceive transparency if all paired dos had opposite signs of contrast. However, the result showed that the ability on transparency perception when all paired dots had the opposite contrast polarity (A-LPDS) was worse than when a half of paired dots had the opposite signs of contrast (U-LPDS). This result suggests that depth perception in LPDSs is affected by the global property, or the correlation, of overlapping patterns. As shown in Fig. 7b, the result of the present model is consistent with the psychological finding qualitatively. When C-LPDS and U-LPDS were presented, single and transparent disparities were detected, respectively. On the other hand, there was no well-defined minimum in the SSE of A-LPDS, and it was difficult to find a good template match stably. This is because the correlation between overlapping surfaces affects the response of the disparity energy model. In the case of U-LPDS, the two overlapping patterns are uncorrelated, and the population response is equivalent to the sum of the responses to individual depth planes [10]. However, in the cases of C-LPDS (two patterns are correlated) and A-LPDS (anti-correlated), the population responses were similar to those to a correlated an anti-correlated RDSs with 0 disparity (results not shown).
4
Conclusion
In the present study, we have proposed a modification of Tsai and Victor’s stereo model with hybrid-type disparity energy models, and have showed that the model could explain a variety of psychophysical property in stereo transparency. Further research should include introducing more elaborated model for binocular neurons, e.g., the 2-dimensional RF model proposed by Mikaelian and Qian [3]
174
O. Watanabe
that can explain the distance-dependent attraction/repulsion effect without introducing ad hoc interactions between neurons.
References 1. Anzai, A., Ohzawa, O., Freeman, R.D.: Neural mechanisms for encoding binocular disparity: receptive field position vs. phase. J. Neurophysiol. 82 (1999) 874–890 2. Gepshtein, S., Cooperman, A.: Stereoscopic transparency: a test for binocular vision’s disambiguating power. Vision Res. 38 (1998) 2913–2932 3. Mikaelian, S., Qian, N.: A physiologically-based explanation of disparity attraction and repulsion. Vision Res. 40 2999–3016 4. Ohzawa, I., DeAngelis, G.C., Freeman, R.D.: Stereoscopic depth discrimination in the visual cortex: neurons ideally suited as disparity detector. Science 249 (1990) 1037–1041 5. Pouget, A., Dayan, P., Zemel, R.: Information processing with population codes. Nature Rev. Neurosci. 1 125–132. 6. Stevenson, S.B., Cormack, L.K., Schor, C.M.: Hyperacuity, superresolution and gap resolution in human stereopsis. Vision Res. 29 (1989) 1597–1605 7. Stevenson, S.B., Cormack, L.K., Schor, C.M.: Depth attraction and repulsion in random dot stereograms. Vision Res. 31 (1991) 805–813 8. Tsai, J.J., Victor, J.D.: Reading a population code: a multi-scale neural model for representing binocular disparity. Vision Res. 43 (2003) 445–466 9. Watanabe, O.: Effect of the correlation between overlapping dot patterns on stereo transparency. Perception 34 suppl. (2005) 186–187 10. Watanabe, O., Idesawa, M.: Computational model for neural representation of multiple disparities. Neural Netw. 16 (2003) 25–37
Functional Connectivity in the Resting Brain: An Analysis Based on ICA Xia Wu1, Li Yao1,2, Zhi-ying Long3, Jie Lu4, and Kun-cheng Li4 1
State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University, Beijing, China, 100875 2 College of Information and Computer Technology, Beijing Normal University, Beijing, China {wuxia, yaoli}@bnu.edu.cn 3 Center for Human Development, University of California at San Diego, CA, USA, 92093 [email protected] 4 Department of Radiology, Xuan Wu Hospital of Beijing, Beijing, China, 100053
Abstract. The functional connectivity of the resting state, or default mode, of the human brain has been a research focus, because it is reportedly altered in many neurological and psychiatric disorders. Among the methods to assess the functional connectivity of the resting brain, independent component analysis (ICA) has been very useful. But how to choose the optimal number of separated components and the best-fit component of default mode network are still problems left. In this paper, we used three different numbers of independent components to separate the fMRI data of resting brain and three criterions to choose the best-fit component. Furthermore, we proposed a new approach to get the best-fit component. The result of the new approach is consistent with the default-mode network.
1 Introduction Cortical functional connectivity of resting brain has been a research focus, because it can indicate the concurrent spontaneous activity of spatially segregated regions [1] and identify the default-mode networks of the brain which can be used as biomarker of many neurological and psychiatric disorders. Many methods have been used to assess the functional connectivity of resting brain, such as “seed voxel” approach [2, 3], hierarchical clustering [4], PCA [5], self-organizing map [6] and ICA [7, 8, 9, 10]. Among these, ICA is the one of the most used methods. In the applications of spatial ICA (SICA) to fMRI data, the observed 4-D signals are usually modeled as linear mixtures of unknown, spatially independent processes (e.g., BOLD fluctuations, head movements, artifacts, etc.), each contributing to the dataset with an unknown time profile [11, 12, 13, 14]. The fMRI data are decomposed into spatial independent components (ICs) that each has a unique time course. Each spatial IC can be treated as a functional connectivity map. ICA has been applied to resting state data of the adult human brain [7]. SICA has been applied to resting state data of anesthetized child patients. The components in sensory and motor cortices and large vessels were identified [8]. In [9], SICA was also used to assess the cortical functional connectivity I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 175 – 182, 2006. © Springer-Verlag Berlin Heidelberg 2006
176
X. Wu et al.
maps from resting state data. SICA yielded connectivity maps of several parietal regions including the superior parietal cortex and the posterior cingulated gyrus, bilateral auditory, motor and visual cortices, etc. A default-mode network was identified from the resting fMRI data by ICA and the activity in this network can be used as biomarker for incipient Alzheimer’s disease [10]. In the researches above, there are still two important points needing further investigations. The first is how to choose the optimal number of ICs. There is no consensus on this. Methods for estimating the number of ICs are an area of active investigation [14, 17]. For resting fMRI data, because there is no time profile, selecting the desired functional connectivity network from numerous ICs is very difficult. Using small number of ICs can reduce the number of estimated components to ease the burden of computation and interpretation. In [17], twenty was chosen as the number of ICs. It is far less than the number of the time points. In [10], approximate one-fourth to one-fifth of the number of the time points was considered as optimal number of ICs. In [8] and [9], the numbers of ICs are approximate half of the numbers of the time points. Hence, the selection of number of ICs is a tradeoff between the burden of computation and the loss of information. The second is how to choose the best-fit component from the numerous ICs, which should be treated as default-mode network, after ICA separation. If the number of ICs is small, the best-fit component can be chosen manually. But if not, some automated choosing processes based on some spatial or frequency information were developed [9, 10, 18]. Considering the spatial information, there are some approaches, such as based on the number of the activated voxels in the region of interest (ROI) [18] and basing on z score of the voxels [9, 10]. According to our previous work and experience [19, 20, 21], in this paper, we separated the resting fMRI data into different numbers of ICs. Then we used three criterions to choose the best-fit component. At last, we proposed a new approach to get the accurate network from the resting fMRI data. The results were consistent with the default-mode network in [10].
2 Method 2.1 Subjects and Task Fifteen healthy subjects with no history of neurological or psychiatric disorder participated in the study (8 males; mean age, 21.1; age range, 18-26 years). All subjects are right-handed, as assessed using the Edinburgh Handedness Inventory [22]. The aim of the study was explained to the subjects and all subjects gave written informed consent before measurement. Before experiment, the subjects were told to relax but remain awake, keep their eyes open and don’t think some special thing. This scan last 4 minutes and 24 seconds. 2.2 Imaging Sequence and Parameters The fMRI scanning was performed with a Siemens 1.5 Tesla scanner at Xuan Wu Hospital in Beijing. A gradient echo EPI sequence was used during functional scans (TR =2000 ms, TE = 40 ms, flip angle = 90o, field of view (FOV) = 220 × 220 mm2,
Functional Connectivity in the Resting Brain: An Analysis Based on ICA
177
matrix = 64 ×64, 6 mm slice thickness with a 1.2 mm gap, and 20 slices covering the whole brain). In each session lasting 264 s, 132 volumes of images were acquired. 2.3 Data Processing Data were preprocessed by using SPM2 (www.fil.ion.ucl.ac.uk/spm). Images were corrected for movement by using least-squares minimization without higher-order corrections for spin history and normalized [23] to stereotaxic coordinated of Talairach and Tournoux [24]. Images were then resampled into 3*3*4 mm3 and smoothed with 8-mm Gaussian kernel. In this study, the smoothed normalized fMRI images were concatenated across time to form a single 4D image for each subject. The first 5 time points were eliminated to allow for equilibration of the magnetic field. For each subject, the 4D dataset was analyzed with a fixed-point ICA algorithm [16] (www.cis.hut.fi/projects/ ica/fastica). The number of the ICs was chosen to be 20 (approximately one-sixth of the number of the time point), 60 (approximately half), and 90 (approximately threeforth, because could not converge at 100), respectively. The ICs were calculated from with the default FastICA parameters (Approach: deflation, stabilization: off), with the exception of the nonlinearity function g=tanh. The values of resulting ICs were then transformed into z-scores. Voxels that presented z-score values of at least 2.4 (P=0.05) were thought to be significant and presented in spatial localization maps. After separation, in order to choose the best-fit component, we used three criterions based on spatial information. The number of the significant voxels in the ROI is Criterion I, the difference between the mean z score of the voxels in the ROI and the mean z score of the voxels out of the ROI is Criterion II, the mean absolute z score of the voxels in the ROI is Criterion III. The Criterion I, II and III were used to rank the ICs and the first IC map as the default-mode network (referred as single-map). In this study, we used the posterior cingulated cortex (PCC) as ROI [10]. The PCC template was got from the WFU PickAtlas tool in SPM2 [25]. Furthermore, we proposed a new method to assess the default-mode network. We added up the first several ICs into a sum-map when the number of ICs is 60 or 90. In these ICs, the number of voxels whose z-scores are above the threshold is more than some designated number (in this study, we used 8). In the sum-map, we used a higher z-score threshold (5, P)2 . n=1
Semantic Addressable Encoding
187
Multidimensional Scaling (MDS) Space We select a set of Rs eigenvectors, {fr , r = 1 ∼ Rs }, from all R eigenvectors to build a reduced feature space: s FR×R s ≡ [f1 , f2 , ..., fRs ]R×Rs .
(9)
This selection is based on the distribution of the projections of the codes on each eigenvector. An ideal distribution is an even distribution with large variance. We select those eigenvectors, {fr , r = 1 ∼ Rs }, that have large eigenvalues. The MDS space is M DS ≡ span(F s ) . (10) These selected features are independent and significant. The new code of each word in this space is T wns = F s wn (11) or
T
s = F s WR×N . WR×N
(12)
Representative Vector of a Whole Document A document, denoted as D, usually contains more than one word. A representative vector should contain the semantic meaning of the whole document. Two such measures are defined [23]. They are the peak-preferred measure, a a T s = [w1a , w2a , ..., wR ] ; where wra = max |wrn | , r = 1 ∼ R, νD s wn ∈D
and the average-preferred measure, b b T s = wns = [w1b , w2b , ..., wR ] ; where wrb = wrn ,r=1∼R. νD s ∈D wn
(13)
s ∈D wn
The magnitude is normalized as follows: b vD = vD
−1
b vD .
(14)
The normalized measure, vD , is used here to represent the whole document. A representative vector, vQ , for a whole query can be obtained similarly by using equations (13) and (14). Relation Comparison The relation score is defined as follows: < vD , vQ > =< vD , vQ > . RSQ (D) = vD × vQ
(15)
Iterative Re-Encoding Since Elman method for the sentences generated with simple fixed syntax, Noun +Verb + Noun, cannot be applied appropriately to more complex sentences, we modified his method. In our approach, each word initially has a random lexical
188
C.-Y. Liou, J.-C. Huang, and W.-C. Yang
code, wnj=0 = [wn1 , wn2 ,...,wnR ]T . After the j th training epoch, a new raw code is calculated as follows: 1 wnraw = ϕ(Uoh H(w(t − 1))), n = 1 ∼ N, (16) |sn | w(t)=wn w(t)∈D
where |sn | is the total number of words in a set, sn . This set contains all the predictions for the word, wn , based on all its precedent words, sn = {ϕ(Uoh H(w(t− 1))) | w(t) = wn , and w(t) ∈ D}. This equation has a form slightly different from that in (4). Namely, we directly average all the prediction vectors for a specific word. The hidden layer may have a flexible number of neurons in our modified method. Note that there exist other promising methods to obtain an updated code from the set sn , such as the self-organizing map [10], the multi-layer perceptron [12]. After each epoch, all the codes are normalized with the following two equations: ⎡ ⎤ 1 ... 1 ⎥ 1 raw ⎢ ave raw ⎢. 1 .⎥ = WR×N − WR×N , (17) WR×N ⎣. .⎦ N 1 ... 1 N ×N wnj = wnnom = wnave −1 wnave , where wn = (wnT wn )0.5 ,
n = 1 ∼ N . (18)
This normalization can prevent a diminished solution, {wn ∼ 0, n = 1 ∼ N }, derived by the back-propagation algorithm. In summary, the process starts with a set of random lexical codes for all of the stemmed words in a specific corpus. In each epoch, we use all the sentences in the corpus to train [12][13][14][15][18] an Elman network four times. We then compute the new code, wnj , for each word using equations (16), (17), and (18). The training phase is stopped (finished) at the J th epoch when there is no significant code difference between two successive epochs. We expect that such iterative encoding can extract certain salient features, in addition to word frequencies, in the sentence sequence that contain the writing style of the author or work. This writing behavior is unlikely to be consciously manipulated by the author and may serve as a robust stylistic signature. The trained code after the J th epoch, wn = wnJ = [w1n , w2n , , , wRn ]T , which is a vector with R features, is used in the semantic matrix WR×N in (5) and the average-preferred measure (13). The normalization step (14) and the relation score (15) are then calculated based on this vector.
3
Example of Literature Categorization
In this experiment, we test the ability to classify 36 plays written by William Shakespeare. A trained code set was generated using a training corpus that contained the 36 works. We considered each play as the query input and computed the relation score between this query and one other play. Fig. 2 shows the relation tree of the 36 plays.
Semantic Addressable Encoding
Fig. 2. Categorization of Shakespeare’s plays
189
190
C.-Y. Liou, J.-C. Huang, and W.-C. Yang
This tree was constructed by applying the methods in [5][8][19] to 630 scores of pairs of two plays. We also include the genre of each play in the right column of the figure, where ‘h’ denotes ‘history,’ ‘t’ denotes ‘tragedy,’ ‘c’ denotes ‘comedy,’ and ‘r’ denotes ‘romance.’ The categorization result is very consistent with the genre [1][11][16][22]. In this example, we set Di = 1, ..36, Qi = 1, .., 36, N = 10, 000 (words with high frequencies of occurrence), Lh = Lc = 200, and Lo = Li = RS = R = 64 (features). The numbers in the figure indicate the publication years of the plays. We provide a semantic search tool using the corpus of Shakespeare’s comedies and tragedies at http://red.csie.ntu.edu.tw/literature/SAS.htm. Two search results are listed in Table 1. In this search, we set Di = 1, ..., 7777 (the 7, 777 longest conversations in the 23 tragedies and comedies), N = 10000, Lo = Li = R = 100, Lh = Lc = 200, and RS = 64. Each query indexed one conversation. Table 1. Search results by semantic associative search query she loves kiss
search result BENVOLIO: Tut, you saw her fair, none else being by herself poised with herself in either eye; but in that crystal scales let there be weigh’d. Your lady’s love against some other maid that I will show you shining at this feast, and she shall scant show well that now shows best. – Romeo and Juliet Armies die in blood MARCUS AND RONICUS: Which of your hands hath not defended Rome, and rear’d aloft the bloody battle-axe, writing destruction on the enemy’s castle? O, none of both but are of high desert my hand hath been but idle; let it serve. To ransom my two nephews from their death; then have I kept it to a worthy end. – Titus Andronicus
Summary In summary, we have explored the concept of semantic addressable encoding and completed a design for it that includes automatic encoding methods. We have applied the methods to study literary works, and we have presented the results. The trained semantic codes can facilitate other research, such as studies on personalized codes, linguistic analysis, authorship identity, categorization, etc. This encoding process can be modified for polysemous words that resolves multiple meaning of a single word.
Acknowledgement This work was supported by the National Science Council under projects NSC 94-2213-E-002-034 and NSC 92-2213-E-002-002.
Semantic Addressable Encoding
191
References 1. Bloom, H.: Shakespeare: The Invention of Human. Riverhead Books, New York (1998) 2. Burrows, J.: Questions of Athorship: Attribution and Beyond a Lecture Delivered on the Occasion of The Roberto Busa Award ACH-ALLC 2001. New York, Computers and the humanities 37 (2003) 5-32 3. Elman, J.L., Bates, E.A., and Johnson, M.H., Karmiloff-Smith, A., Parisi, D., Plunkett, K.: Rethink Innateness. The MIT Press, Cambridge, Massachusetts (1996) 4. Elman, J.L.: Generalization, Simple Recurrent Networks, and the Emergence of Structure. the 20th Annual Conference of the Cognitive Science Society in Mahway, New Jeresy (1998) 5. Felsenstein, J.: PHYLIP (Phylogeny Inference Package) version 3.5c [Program]. Department of Genetics, University of Washington, Seattle, (1993) 6. Frakes, W.B.: Stemming Algorithms. in Information Retrieval: Data Structures and Algorithms. In: Frakes, W.B., Ricardo, B.-Y. (Eds.), Englewood Cliffs, New Jeresy, Prentice-Hall (1992) 131-160 7. Holmes, D.I.: The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing 13 (1998) 111 8. Huffman, D.A.: A Method for the Construction of Minimum-redundancy Codes. Proceedings of the I.R.E. 40 (1952) 1098-1102 9. Jordan, M. I.: Serial Order: a Parallel Distributed Processing Approach. Cognitive Science Institute Tech. Rep. 8604, San Diego (1986) 10. Kohonen, T.: Clustering, Taxonomy, and Topological Maps of Patterns. Proceedings of the Sixth Int’l Conference on Pattern Recognition in Silver Spring (1982) 114-125 11. Lee, D.D., and Seung, H.S.: Learning the Parts of Objects by Non-Negative Matrix Factorization. Nature 401 (1999) 788-791 12. Liou, C.-Y., and Yu, W.-J.: Ambiguous Binary Representation in Multilayer Neural Network. Proceedings of Int’l Conference on Neural Networks (ICNN) in Perth, Australia 1 (1995) 379-384 13. Liou, C.-Y., Huang, J.-C., and Kuo, Y.-T.: Geometrical Perspective on Learning Behavior. Journal of Information Science and Engineering 21 (2005) 721-732 14. Liou, C.-Y., and Lin, S.-L.: Finite Memory Loading in Hairy Neurons. Natural Computing 5(1) (2006) 15-42 15. Liou, C.-Y.: Backbone Structure of Hairy Memory. International Conference on Artificial Neural Networks (ICANN) in Athens, Greece (2006), Lecture Notes in Computer Science, Part I, LNCS 4131, 688-697, Springer 16. McEnery, T., and Oakes, M.: Authorship Identification and Computational Stylometry. in Handbook of Natural Language Processing, Marcel Dekker, Inc. (2000) 545-562 17. Porter, M.F.: An Algorithm for Suffix Stripping. Program 14 (1980) 130-137 18. Rumelhart, D.E., McClelland, J.L., and eds.: Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1, Cambridge, MIT Press, Massachusetts (1986) 19. Saitou, N., and Nei, M.: The Neighbor-Joining Method: a New Method for Reconstructing Phylogenetic Trees. Molecular biology and evolution 4 (1987) 406-425 20. Tweedie, F.J., and Baayen, R.H.: How Variable may a Constant be Measures of Lexical Richness in Perspective?. Computers and the Humanities 32 (1998) 323352
192
C.-Y. Liou, J.-C. Huang, and W.-C. Yang
21. William, C.B.: Mendenhall’s Studies of Word-Length Distribution in the Works of Shakespeare and Bacon. Biometrika, 62 (1975) 207-212 22. Yang, C.-C., Peng, C.-K., Yien, H.-W., and Goldberger, A.L.: Information Categorization Approach to Literary Authorship Disputes. Physica A 329 (2003) 473-483 23. Yoshida, N., Kiyoki, Y., and Kitagawa, T.: An Associative Search Method Based on Symbolic Filtering and Semantic Ordering for Database Systems. Proceedings of 7th IFIP 2.6 Working Conference on Database Semantics (DS-7) in Leysin, Switzerland (1997) 215-237
Top-Down Attention Guided Object Detection Mei Tian, Si-Wei Luo, Ling-Zhi Liao, and Lian-Wei Zhao School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China [email protected]
Abstract. Existing attention models concentrate on bottom-up attention guidance, and lack of effective definition of top-down attention information. In this paper we define a new holistic scene representation and use it as top-down attention information which works in three ways. The first is to discriminate between close-up and open scene categories. The second and the third are to provide reliable priors for the presence or absence of object and the location of it. Compared with traditional attention guidance algorithms, our algorithm shows how scene classification and basing directly on entire scene without segmentation stages, facilitate the object detection. Two stages of pre-attention and focus attention enhance the detecting performance and are more suitable for vision information processing in high level. Experiment results prove the effectiveness of our algorithm.
1 Introduction In Marr’s computing theory, vision perception can be defined as an information processing task that aims to find what objects in external world are and where they are [1]. Thus, study of object and spatial location perception has been the primary work in the research of information processing theory about visual perception system. The function of visual attention is to direct our gaze rapidly towards objects of interest in our visual environment [2]. In recent years, there have been an increasing number of bottom-up attention models [3-5]. But little success has been achieved in modeling the complex top-down attention guidance. Current bottom-up attention models confront two main obstacles. First, these models will be less effective when image is degraded that information of object itself is not sufficient for reliable detection. Second, for different appearances and locations of the same object, they don’t use priors well and needs exhaustive exploration in a large space of an image. To solve these difficulties, top-down attention control has been introduced to detect and recognize object [6-10]. Rybak proposed a “semantic significance” in definition of object saliency [7]. But this “semantic significance” was predefined only to emphasize the area of important meaning in image when high level visual structures lack of top-down attention control. Salah introduced observable Markov models to simulate task-driven attention [8], and experiment results in number and face recognition demonstrate its effectiveness. The “high level information” used in these top-down attention models is only about some simplified high level information such as threshold, and weight. I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 193 – 202, 2006. © Springer-Verlag Berlin Heidelberg 2006
194
M. Tian et al.
A strong relationship exists between the category of an object and the whole scene context that contains it [11]. Scenes like streets and forests are not random, but show much regular structure. This structure can be described by a holistic scene representation which can be extracted bypassing the stage of image segmentation. The holistic scene representation provides strong priors on the scene categories and the most likely category and location for the presence of object. Besides, it can help to disambiguate the identity of the object when lacking of the local features. Therefore, the paper defines a new holistic representation of scenes as top-down attention information. The main goal of the paper is to study top-down attention guidance effect on object detection. When the image is a close-up view of an object, the object tends to occupy a significant portion of the image and the effect of top-down attention is very limited. So we use the holistic scene representation based spectral attribute to discriminate between close-up and open scene categories. Then, for open scene categories, topdown attention gives the prior knowledge of presence or absence of the object, and predicts location of the object in the scene. The paper is organized as follows: section 2 defines two levels of the scene description which are used in top-down and bottom-up attention. Section 3 provides the algorithm of top-down biasing for detecting the object. Then section 4 detailed topdown attention guided object detection algorithm. Experiment result and discussion are given in the sections 5 and 6.
2 Scene Description In this section, we consider simple definitions of scene structure based on descriptions of holistic textural patterns and of local spatial arrangement. Therefore, two levels of scene description are discussed. The first level, feature coefficient of power spectrum provides the holistic contextual information of the initial image. The second level, weighted sum of all feature maps provides the dominant orientations and scales in the image. 2.1 Holistic Scene Representation In this section, we define holistic scene representation by the discrete Fourier transform (DFT) F ( u, v ) =
¦
( x, y )
I ( x, y ) − τ
τ
w ( x, y ) e
2π i ( ux + vy ) / N
(1)
where I ( x, y ) denotes the intensity distribution of the image at spatial variables ( x, y ) in the N × N range, and u , v are spatial frequency variables in the horizontal and vertical direction, respectively (units are in cycles per image). Boundary effects were reduced by applying a circular Kaiser-Bessel window w ( x, y ) . The weighted mean intensity τ is
¦ I ( x, y ) w ( x, y )
defined as τ = ( x , y )
¦ w ( x, y )
to avoid leakage in the spectral transformation [12].
( x, y )
Then we compute the power spectrum S of the image as squared magnitude of its DFT
Top-Down Attention Guided Object Detection
S ( u, v ) =
| F ( u , v ) |2
195
(2)
N 2ϒ ( u, v )
where ϒ is the correction factor [12]. The holistic scene representation given by S ( u , v ) provides high dimensional representation of the input image I ( x, y ) . Based on higher order statistics, we further reduce the dimensionality of the representation by applying independent component analysis (ICA) [13]. Each S serves as a sample and the rows of each sample are concatenated to a column vector, top row first. The power spectral independent component (PSIC) decompose the power spectrum of the initial image as S ' ( u , v ) = PSIC ( u , v ) ⋅ HSR ( u , v )
(3)
where S ' is a m × 1 column vector representing a sample ( m = N × N ). The coefficient PSIC = ( psic1 , psic2 ,!, psicn ) is a m × n matrix representing the linear basis function, and n is the number of basis function. HSR = ( hsr1 , hsr2 ,!, hsrn ) , each hsri represents a '
statistical feature coefficient. By using ICA, different samples have a same linear basis function and different coefficient vectors. So the coefficient vectors can be used to represent the samples. We use the fast fixed-point algorithm [13] to compute the feature coefficient HSR , and each hsri is defined as hsri = ¢Wi , S ' ² where W = (W1 ,!Wi ,!,Wn ) = PSIC −1 is the transformation matrix. The initial image is repre'
sented by a feature coefficient hsri now. Because all regions of the initial image contribute to the coefficient vector and the coefficient encodes the whole scene without splitting it into objects, the vector HSR is holistic. We define the coefficient vector HSR as the holistic scene representation which provides top-down attention information for detection tasks. 2.2 Local Scene Representation The specific local spatial arrangement produces an essential aspect of an image representation. It includes dominant orientations and scales of the initial image that the holistic scene representation does not encode. In this paper, Gabor filter is chosen for its ability to simulate the early stages of the visual pathway. For each filter, the output response of image I ( x, y ) is defined as v ( x, y ) = I ( x, y ) ∗ψ ( x − x0 , y − y0 )
(
− α 2 x ′2 + β 2 y ′2
(4)
)
where ψ ( x, y ) is Gabor filter defined by ψ ( x, y ) = e e j 2π f 0 x ′ , x′ = x cos ϕ + y sin ϕ , y′ = − x sin ϕ + y cos ϕ , α is the sharpness of Gaussian major axis, f 0 is frequency of the sinusoid, ϕ is rotation of the Gaussian and sinusoid, and β is sharpness of the Gaussian minor axis. In this definition, ( x0 , y0 ) is the center of receptive field. The outputs of all filters can be described as {vik, j ( x, y ), i, j = 1, 2, 3, 4, k = 1!16} , where the variable i and j are indexes of orientation and frequency, respectively. They are
196
M. Tian et al.
defined as 16 feature maps. We define the input image I ( x, y ) as the 17th feature map. And all 17 feature maps are normalized to the same range. Then, for each feature 2 map, we compute the global amplification factor which is defined by ( M − m ) [14] and use it as weight, where M and m are its global maximum and the average of all other local maxima, respectively. The final salient map defined by LSR is the weighted sum of all feature maps and is defined as the local scene representation.
3 Top-Down Biasing for Detecting the Object For HSR can provide strong priors on scene categories, in this section, we look for the most appropriate HSR-based spectral attribute that can provide discrimination between close-up and open scene categories. The estimation of scene categories from HSR can be learnt by applying Fisher linear discriminant analysis [15]. We select a set of 200 images that contain a certain object. Images come from Database of Cars and Faces in Context. The training set consists in the feature coefficients {HSR1 , HSR2 ,!, HSR200 } . b1 ( b1 = 50 ) in one subset D1 belong to close-up scene, b2 ( b2 = 150 ) in the other subset D2 belong to open scene. The estimation of a spectral
attribute z from HSR can be written as n
z = vT ⋅ HSR = ¦ vi ⋅ hsri
(5)
i =1
Thus, a corresponding set of 200 samples { z1 , z2 ,!, z200 } divided into the subsets Z1 and Z 2 is computed. The criterion function of the Fisher linear discriminant is J (v) =
where ei =
1 bi
¦z ,
z∈ Z i
g i2 =
¦ ( z − ei )
2
| e1 − e2 |2 g12 + g 22
(6)
. The parameters v that maximize the criterion
z∈Z i
function is v = GW−1 ( e1 − e2 )
where GW =
t t ¦ ( hsr − e1 )( hsr − e1 ) + ¦ ( hsr − e2 )( hsr − e2 ) ,
hsr∈D1
hsr∈D2
(7) ei =
1 bi
¦
hsr .Thus, the
hsr∈Di
n-dimensional binary classification problem has been converted to a one-dimensional one. Then, according to the optimal decision rule [15], we can find the threshold th that separates the projected points in one-dimensional subspace. If v exceeds threshold th , we can decide the image is a close-up view, and decide the image is an open scene otherwise. When the entire scene is a close-up view of the object, the object tends to have a large flat surface and to occupy a significant portion of the image. Thus, the scene structure is mostly determined by the object, and LSR is sufficient for detecting the
Top-Down Attention Guided Object Detection
197
object. In this situation, the object detection can only depend on bottom-up attention. However, when the object is not the main portion of the image, the context information is mostly determined by the background and not by the object. It’s need to consider the method of combination of bottom-up and top-down attention which makes full use of both HSR and LSR . The next section is dedicated to such a top-down attentional effect on object detection.
4 Object Detection Algorithm In this section, a way of introducing top-down attention in object detection is described by statistical method. Given the holistic scene representation HSR , considering category property s and location property l = ( x, y ) of the object O , the conditional probability p ( O | HSR ) can be represented by the product of p ( s | HSR ) and p ( l | s, HSR ) . p ( s | HSR ) gives the probability of presence of object category s ,
and p ( l | s, HSR ) represents the probability of location for the presence of object s . To achieve top-down attention control, we use the conditional probability p ( s | HSR ) to decide whether search operation is followed in the stage of pre-attention. And then in the stage of focus attention, p ( l | s, HSR ) is estimated to guide bottom-up attention and direct attention to the focus attention region which is most likely to contain the object. These two stages facilitate the object detection procedure. 4.1 Top-Down Attention Control The procedure of top-down attention can be divided into pre-attention and focus attention. In the stage of pre-attention, for training set consists of a large number of images, we approximate the prior probability by p ( s ) = p ( ¬s ) = 1/ 2 . Accordingly, we write p ( s | HSR ) = p ( HSR | s ) / ( p ( HSR | s ) + p ( HSR | ¬s ) ) by applying Bayes rule. For learning p ( HSR | s ) , we use {HSR1 , HSR2 ,!, HSRM } as the training data, where M is the number of training pictures. The same algorithm holds for the likelihood function p ( HSR | ¬s ) . Introducing contextual category information C = {Ci }i =1, K , the likelihood function p ( HSR | s ) can be represented as M
K
K
p ( HSR | s ) = ∏ ¦ ωi p ( HSRt | Ci , s ), ¦ ωi = 1 t =1 i =1
(8)
i =1
where ωi represents the prior probability of the i th contextual category. The Gaussian mixture model is then selected for learning p ( HSR | s ) p ( HSRt | Ci , s ) = p ( HSRt | Ci ,θi ) =
ª ( HSRt − μi ) 2 º 1 exp « − » 2σ i2 2πσ i «¬ »¼
(9)
198
M. Tian et al.
where the parameter for object s is Θ = {θi , ωi }i =1, K , and the i th parameter of Gaussian distribution is θi = ( μi ,σ i ) . We define L ( Θ | HSR ) as M ªK º L ( Θ | HSR ) = ¦ log «¦ ωi p ( HSRt | Ci ,θi ) » t =1 ¬ i =1 ¼
(10)
The learning is performed using EM algorithm [16]. Given the newest parameters θˆi = ( μˆ i , σˆ i ) and ωˆ i , E-step computes the expectation of completedata L ( Θ | HSR, C ) [16] which is defined as
(
)
ˆ = Ε ª log p ( HSR, C | Θ ) | HSR, Θ ˆº Q Θ, Θ ¬ ¼
(11)
where Θˆ represents the current parameter. M-step selects the value of Θ which can maximize the Q ( Θ, Θˆ )
ωˆ inew = μˆ inew =
1 M
M
¦ p(Ci | HSRt ,θˆi )
(12)
t =1
M
1 M
¦ p(Ci | HSRt ,θˆi )
¦ p(Ci | HSRt ,θˆi ) HSRt
(13)
t =1
t =1
σˆ i2 new =
1 M
¦ p(Ci | HSRt ,θˆi )
M
¦ p(Ci | HSRt ,θˆi )( HSRt − μˆ new ) ( HSRt − μˆ new )T i
t =1
i
(14)
t =1
These two steps are repeated until Θ converges at a stable value. And the final result of pre-attention is obtained by computing p ( s | HSR ) . In the stage of focus attention, p ( l | s, HSR ) is defined as the result of p ( l , HSR | s ) divided by p ( HSR | s ) . p ( l , HSR | s ) is simulated by a Gaussian mixture
model too. The likelihood function p ( lt , HSRt | s ) can be represented as K
p ( lt , HSRt | s ) = ¦ ωi p ( HSRt | Ci ,θi ) p ( lt | δ i )
(15)
i =1
where K is the number of K Gaussian clusters, and each cluster is decomposed into two Gaussian functions which correspond to holistic scene representation ( θi = ( μi ,σ i ) ) and location information ( δ i = ( μ i' ,σ i' ) ). Thus, the parameter of the
model is Θ = {θi ,δ i , ωi }i =1,!, K . The training process for parameter Θ is similar to the process in pre-attention. And we compute p ( l | s, HSR ) for each location. 4.2 Top-Down Attention Guided Object Detection Algorithm
We present a top-down attention guided object detection algorithm based on holistic and local scene representation. First, if HSR-based spectral attribute v exceeds threshold th ,
Top-Down Attention Guided Object Detection
199
the testing image is a close-up view, and the object detection can only depend on bottom-up attention (this is not detailed in this paper). If the value of v is less than th , the top-down attention guidance is needed and is described in the following steps. Second, in the stage of pre-attention, the estimated p ( s | HSR ) is used to decide whether search operation is followed. Third, the focus attention region is detected by the integrated information IR and the final result can be draw from IR . To find the focus attention region, we define lx0 , y0 = ¦ lx , y p ( lx , y | s, HSR ) as the center lx , y
of focus attention region. According to the distribution of p ( l | s, HSR ) in testing image, we define the width of the region is the width of the image. The height is defined as 255
Δ+ =
¦ p ( lx, y | s, HSR )
x=0 255
¦ p ( lx, y +1 | s, HSR )
,
y = y +1
(16)
x =0
where Δ + is iteratively computed, and the initial value of y is y0 . Once Δ + > 10 , the iteration is stopped and the value of y+ is set to the current value of y . The similar scheme holds for Δ − , and the value of y− is gained when iteration is stopped. At last, the height of the focus attention region is defined as h = y+ + y− . Supposing the focus attention region can be split into small n × n sub-blocks and the number of blocks is bl , we use 4bl random blocks to overlay the initial region to guarantee the reconstruct error is minimal. The integrated information IR is defined as IR = p ( l | s, HSR ) ⋅ LSR
(17)
IR in each block is regarded as a sample and is transformed into a vector. Then, we compute the difference between each sample’s IR and all other samples’ IR , and use the square sum of these differences as saliency of corresponding sample.
5 Experiment Result Images in our experiments are in the size of 256 × 256 pixels and come from the Database of Cars and Faces in Context. For each object, we use about 200 images for training. Fig.1 illustrates the distribution of the likelihood function P ( s | HSR ) when the task is to look for cars in 16 images. The parameter K in Gaussian mixture model is set to 2 in our experiments (indoors and outdoors). In Fig.1, almost all values of P ( s | HSR ) approximate to 0 or 1, so the pre-attention can provide reliable prior knowledge about presence or absence of the object for most of the images. When P ( s | HSR ) ≈ 0 , we can quickly and reliably ascertain the absence of cars before scanning the whole image. When P ( s | HSR ) ≈ 1 , we can assert the presence of cars in the scene. It’s interesting that even cars are absent from scenes, the value
200
M. Tian et al.
of P ( s | HSR ) of some images may approximate to 1. This proves the effectiveness of top-down attention guidance.
P ( s | HSR )
Fig. 1. The distribution of P ( s | HSR ) of cars
We chose 200 images at random from the database to compute P ( s | HSR ) when the task is to look for cars. There are 116 and 68 images belong to the sets defined by P ( s | HSR ) > 0.95 and P ( s | HSR ) < 0.05 respectively. To guarantee reliable object detection, we set threshold TH = 0.05 . If P ( s | HSR ) < 0.05 , the system stops detecting process immediately. This stage facilitates the detection procedure. Some experiment results of our algorithm and Itti’s are shown in Fig.2. It shows two examples of images and the task is looking for cars and people. The parameter n in our algorithm is set to 16. The local scene representation LSR is suppressed in the black regions in Fig.2(b) for top-down attention guidance factor p ( l | s, HSR ) ≈ 0 . The rest regions in Fig.2(b) belong to the focus attention region where the probability of presence of cars is very high. In Fig.4(c) and Fig.4(d), some salient regions are
(a) Input image
(b) Integrated information (c) Result of this paper Fig. 2. Experiment results of our and Itti algorithm
(d)Result of Itti
Top-Down Attention Guided Object Detection
201
labeled in descending order of saliency. Distinct from Itti’s algorithm, object detection in our algorithm is guided by top-down attention. So detecting result is not affected by the objects out of focus attention region. When considering images with P ( s | HSR ) > 0.05 in the database, our algorithm can guarantee that 91% of the object will be in the focus attention region.
6 Conclusion In this paper we present a new top-down attention guided object detection algorithm. The new holistic scene representation is defined as top-down attention information and is used to provide reliable priors for scene categories firstly. Then, object detection algorithm is performed on open scenes, and top-down attention information is used in pre-attention and focus attention. Our pre-attention is different from that defined in bottom-up attention mechanism [2]. In [2], early visual features are computed with just the elementary information available at the pre-attentive in the form of low level feature maps turned to color, orientation and intensity. Our pre-attention is only driven by top-down attention information and is uncorrelated with bottom-up attention information which belongs to the object. This operation can save computing resources and enhance object detection performance. Experiments on 600 natural images prove its effectiveness. Perceptual grouping can provide useful middle level information. It is based on local information and can be guided by high level information. Unifying rules of perceptual grouping into grouping framework to design more robust object detection algorithm is our future work.
Acknowledgements The authors would like to thank Itti and his iLab for providing source codes which were used for comparison in this paper. This research is supported by the National Natural Science Foundation of China under Grant No. 60373029 and the National Research Foundation for the Doctoral Program of Higher Education of China under Grant No. 20050004001.
References 1. Marr, D.: Vision: a Computational Investigation into the Human Representation and Processing of Visual Information. San Francisco: Freeman, W. H. (1982) 2. Itti, L and Koch, C.: Computational Modeling of Visual Attention. Nature Reviews Neuroscience, Vol.2 (2001) 194-230 3. Itti, L., Gold, C., Koch, C.: Visual Attention and Target Detection in Cluttered Natural Scenes. Optical Engineering, Vol.40 (2001) 1784-1793 4. Zhang, P. and Wang, R.S.: Detecting Salient Regions Based on Location Shift and Extent Trace. Journal of Software, Vol.15 (2004) 891-898 5. Frintrop, S. and Rome, E.: Simulating Visual Attention for Object Recognition. Proceedings of the Workshop on Early Cognitive Vision (2004)
202
M. Tian et al.
6. Long, F.H. and Zheng, N.N.: A Visual Computing Model Based on Attention Mechanism. Journal of Image and Graphics, Vol.3 (1998) 592-595 7. Rybak, I.A., Gusakova, V.I., Golovan, A.V., Podladchikova, L.N., Shevtsova, N.A.: A Model of Attention-Guided Visual Perception and Recognition. Vision Research, Vol.38 (1998) 2387-2400 8. Salah, A.A., Alpaydin, E., Akarun, L.: A Selective Attention-Based Method for Visual Pattern Recognition with Application to Handwritten Digit Recognition and Face Recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.24 (2002) 420-425 9. Itti, L.: Models of Bottom-Up and Top-Down Visual Attention. Pasadena: California Institute of Technology (2000) 10. Navalpakkam, V. and Itti, L.: A Goal Oriented Attention Guidance Model. Lecture Notes in Computer Science, Vol.2525 (2002) 453-461 11. Henderson, J.M.: Human Gaze Control during Real-World Scene Perception. Trends in Cognitive Sciences, Vol.7 (2003) 498-504 12. Schaaf, A. and Hateren, J.H.: Modelling the Power Spectra of Natural Images: Statistics and Information. Vision Research, Vol.36 (1996) 2759-2770 13. Hyvärinen, A.: Fast and Robust Fixed-Point Algorithms for Independent Component Analysis. IEEE Transactions on Neural Network, Vol.10 (1999) 626-634 14. Itti, L. and Koch, C.: Feature Combination Strategies for Saliency-Based Visual Attention Systems. Journal of Electronic Imaging, Vol.10 (2001) 161-169 15. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification. 2nd edn. ISBN: 0-471-05669-3. Wiley Interscience (2001) 16. Bilmes, J.A.: A Gentle Tutorial of the EM Algorithm and Its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Technical Report, ICSI-TR97-021, Berkeley: University of Berkeley (1998)
Absolute Quantification of Brain Creatine Concentration Using Long Echo Time PRESS Sequence with an External Standard and LCModel: Verification with in Vitro HPLC Method Y. Lin1, Y.P. Zhang1, R.H. Wu1,*, H. Li2, Z.W. Shen1, X.K. Chen1, K. Huang1, and G. Guo1 1
Department of Medical Imaging, The 2nd Hospital, Shantou University Medical College, Shantou 515041, China [email protected] 2 Central laboratory, Shantou University Medical College, Shantou, 515041, China
Abstract. To investigate the accuracy for absolute quantification of brain creatine (Cr) concentration using in vivo long echo time PRESS sequence performed with an external standard. Ten swine and an external standard phantom were investigated by 1.5T GE Signa scanner and the standard head coil. 1H-MRS data were acquired from the two VOI (2x2x2cm3) placed in swine brain and external standard solution by using PRESS sequence with TE = 135 mses, TR = 1500 msec, and 128 scan averages. In vivo Cr evaluation was made by LCModel. In vitro Cr concentration was analyzed by HPLC method. In the 1H-MRS group, the Cr concentration was 9.37±0.137 mmol/kg. In the HPLC group, the Cr concentration was 8.905±0.126 mmol/kg. Good agreement was obtained between in these two methods (P=0.491), which indicated that long echo time PRESS sequence with an external standard can accurately detect brain Cr concentration. The application of LCModel introduces more convenience for the MRS quantification. Keywords: PRESS sequence; creatine; external standard; LCModel; HPLC.
1 Introduction Total creatine (Cr), resonating at 3.03 ppm and 3.94 ppm chemical shift, represents the quantity of phosphocreatine (pCr) and creatine(Cr) involved in neurones and glial cells [1]. As the storage and transmission of phosphate-bound energy, Cr plays essential roles in energy metabolism. It can undergo phosphorylationdephosphorylation reaction catalyzed by the enzyme creatine kinase: ADP+PCr ATP+Cr. If oxidative phosphorylation cannot be maintained to supply ATP, PCr can provide the phosphate group to ADP to form ATP to reduce the extent of nonoxidative glucose consumption, which can reduce neuronal death mostly due to a delayed decrease of ATP under hypoxic stress and protect the normal brain function *
Corresponding author.
I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 203 – 210, 2006. © Springer-Verlag Berlin Heidelberg 2006
204
Y. Lin et al.
against the accumulation of Lactate [2]. Decreased brain Cr has been linked to anoxic seizure [3]. An inborn deficiency of guanidinoacetate methyltransferase(GAMT) lead to creatine deficiency which exhibited mental retardation, movement disorder, developmental delay and epilepsy; oral Cr can replenished a large proportion of the cerebral Cr and PCr pool which buffers the cellular ATP concentration and delays its depletion under situations of energy compromise and led to a marked clinical improvement and recovery [4]. Increased levels of Cr may be associated with ageingrelated mild cognitive impairment as well as, in more extreme cases, frank dementia [5]. Cr or Cr analogy have also been reported to contribute to anti-tumor, anti-virus and anti-diabetes [6]. Therefore, non-invasive detection of brain Cr concentration using 1H-MRS has an important clinical significance for the diagnosis and treatments of brain diseases related to the change of Cr content. In localized brain MR spectroscopy, the measurement results had been expressed as metabolite ratios for a long period. However, dilemma existed because metabolite ratios were not comparable with quantitative results obtained with biopsy samples and in vitro animal studies. Internal standard such as water have been utilized to acquire Cr concentration [7,8]. However, this method was also reported to have a number of potential errors, so the measurement result was not accurate. For this reason MR external standard method is preferable. It was previously used in short echo time (TE) STEAM sequence [9] and short TE PRESS sequence [10] to quantify brain Cr concentration. Concerning signal to noise ratio, PRESS sequence is better than STEAM sequence. The use of short TE has an advantage to investigate the short T2 metabolites, such as glutamate + glutamine (Glx) and myo-inositol (Ins) [11]. Cr is a long T2 metabolite, long TE PRESS sequence can be used to detect Cr peak and obtain maximum signal-to-noise ratio. LCModel is an automated, user-independent curve fitting software, for absolute quantification of cerebral metabolites from MR spectra. LCModel has also been employed in external standard method [10], but in mainland China the application of this software was not reported. High performance liquid chromatography (HPLC) method in vitro is a reliable technique to analyze metabolite concentrations, and suitable to assess the accuracy of metabolite quantification detected by in vivo 1H-MRS. Therefore, the purpose of this study was to investigate the accuracy for absolute quantification of brain Cr concentration using in vivo long TE PRESS sequence with an external standard and LCModel. In vitro Cr concentration was measured by HPLC method. Through correlation evaluation between in vivo MRS method and in vitro HPLC method, an accurate MR spectroscopy technique can be determined.
2 Methodology Ten swine (3.13 ± 0.59kg, mean ± SD) were investigated in this study by using 1.5T GE Signa scanner and the standard quadrature head coil. All studies were performed in accordance with animal protection guidelines and approved by the governmental authority. Prior to the MRI examination, all animals were intravenously anesthetized with 1 ml/kg mixed liquor including chlorpromazine hydrochloride and procarbazine hydrochloride, then immobilized with a restraint system and placed supine on a scanner bed with the head firmly fixed. A 125ml spherical phantom containing
Absolute Quantification of Brain Creatine Concentration
205
5mmolNAA 5mmol γ-aminobutyric acid 2.5mmol glutamine 2.5mmol glutamate, 4mmol creatine 1mmol choline chloride and 2.5mmol myo-inositol with the highest degree of purity (Sigma chemie) in physiological saline was used for the external standard. The phamton was placed adjacent to the animal’s head in the detection coil with its axis of symmetry parallel to the static magnetic field during MR scanning and was within the image field of view. 1H-MRS data were acquired from the two 20-mm cubic VOI which were placed in swine brain and external standard solution (figure 1) by using PRESS sequence with TE = 135 mses, TR = 1500 msec, and 128 scan averages.
Fig. 1. Voxel (2 x 2 x 2 cm2) were placed in the swine brain and external standard solution
After data acquisition, the spectroscopic data was transferred to the SGI/O2 workstation, fully automated and user-independent Cr quantification was accomplished by the LCModel with the imported basic set of 135. In order to measure the swine brain Cr concentration accurately, LCModel requires a calibration on a standard of known concentration [12]. Cr analysis data acquired from the external standard was used to calculate a factor for calibration. Let Clcm be the concentration output by LCModel, and let Ctrue be the true concentration (32mmol/L) in the external standard phantom, then the calibrate factor is:
Fcalib =
S C true S C lcm
(1)
So the in vivo swine brain concentration C of Creatine in the VOI was calculated as the following equation:
C
b true
= Fcalib × C
b lcm
S C true b = S × C lcm C lcm
(2)
(Where the superscript s and b represent standard solution and brain, respectively). Finally, the concentration is converted to millimoles per kilogram wet weight by dividing with Pbrain =1.00kg/L for the specific gravity of swine brain tissue measured in our lab. The amount of CSF cerebral-spinal fluid in the VOI also should be corrected according to Decarli C [13].
C
b true
S b C true × C lcm = S C lcm × (1 − f csf ) × Pbrain
(3)
206
Y. Lin et al.
After MRS examination, each animal was sacrificed, a 2x2x2 cm3 brain sample corresponding to the location of the voxel defined by MR spectroscopy was dissected using a sharp knife. All specimens were wrapped with plastic paper and immediately immersed in liquid nitrogen and stored at -70oC until preparation for HPLC measurement as described by Ai-Ming Sun with some modifications [14]. Brain tissues were dissolved in 20ml 0.42 mol/L perchloric acid and homogenized while cooled on ice. The homogenates were centrifuged at 4000 rpm for 10 min at -10oC. An aliquot of 0.2ml supernatant was pipetted into a 1.5-ml plastic microcentrifuge tube and 0.085ml 1mol/L KOH was added into the tube, which was capped tightly and vortex-mixed for 1min and centrifuged at 4000 rpm for 10 min at -10oC again. Then 0.02ml supernatant was ready for HPLC analysis. Comparison of Cr concentration between in vivo MR method and in vitro HPLC method was conducted by the single-factor analysis of variance (ANOVA). Differences were deemed significant if the p value was less than 0.05.
3 Results Fig. 2. shows examples of proton magnetic resonance spectra for metabolites in swine brain (a), and external standard solution (b). Major metabolites resonances were seen clearly. Fig. 3. shows example of the chromatogram for the brain homogenate. Similar chromatograms were found among all the brain samples in the study. Table.1 lists the Cr concentration using both MR spectroscopy method and HPLC method for each swine. In the MR spectroscopy group, the mean concentration of Cr was 9.37 ± 0.137 mmol/kg. In the HPLC group, the mean concentration of Cr was 8.905 ± 0.126 mmol/kg. There were no statistically significant differences between these two methods (p = 0.491), which indicated that long echo time PRESS sequence with an external standard and LCModel software can accurately quantify the brain Cr concentration.
a
b
Fig. 2. Proton magnetic resonance spectra for metabolites in swine brain (a), and external standard solution (b) using PRESS sequence with TE 135 msec, TR 1500 msec and 128 averages
Absolute Quantification of Brain Creatine Concentration
207
The in vivo experiment was also performed on 27 healthy subjects aged 20 72 years three elderly people with subtle cognitive decrement aged > 65 years. The voxel was placed in the left frontal lobe for the healthy subjects and the elderly people. No evidence of abnormal signal was found on T1WI, T2WI, FLAIR and DWI in all healthy subjects, but higher Cr concentrations in frontal lobe occur in the elderly people compared with the young subjects. Subjects with subtle cognitive decrements appearing normal on conventional imaging had significantly higher Cr concentration in frontal lobe compared with controls.
Fig. 3. The chromatogram indicated a good baseline, the peak of Cr was well separated from other peaks Table 1. Comparison of swine brain Cr concentration mmol/kg, VOI sequence with an external standard method and in vitro HPLC method
measured by PRESS
4 Discussion Total Cr plays essential roles in energy metabolism in the brain, non-invasive detection of Cr concentration using 1H-MRS has an important clinical significance for the diagnosis and treatments of brain diseases related to the change of Cr content.
208
Y. Lin et al.
In localized brain MR spectroscopy, metabolite ratios have been usually used to express the change of metabolites for a long period. The most compelling reason was that ratios correct for several experimental unknowns, difficult to obtain, or uncontrollable experimental conditions, e.g: B1 inhomogeneities, instrumental gain drifts, imager and localization method differences, voxel partial volume contamination from CSF [15]. However, the main drawback of ratios is that the results are lack of objectivity, accuracy and comparability. If the concentration ratio of two metabolites is increased, it may not be possible to know whether this increase is due to a relative increase in the numerator or a decrease in the denominator [7]. Concerning metabolites analysis in clinical MR spectroscopy studies, absolute concentrations have advantages over metabolite ratios. The MRS with internal reference method exploiting the known water content of tissue has been usually used to acquire absolute metabolites concentration [7, 8]. This method has conceptual simplicity and directness, because metabolite and internal reference signals are both acquired from the same VOI under the same loading conditions. Internal reference method was also insensitive to many of the experimental factors affecting the performance of the quantitative techniques, including effects related to loading, standing waves, B1 inhomogeneities, practical issues of phantom positioning and user expertise and examination duration [16]. However, there is a drawback in the assumption of the constant internal reference: the fraction of NMR-visible water can not remain constant during various physiological and pathophysiological states. Using water as internal reference turned out to be particularly sensitive to baseline distortions of the strong water resonance in the reference acquisition without water suppression; this sensitivity severely complicated the exact definition of the boundaries for signal integration [9]. So, using water as internal reference could not accurately quantify metabolites concentration. For this reason, using MR external standard method to quantify brain metabolites concentration is preferable. It was previously used in stimulated echo acquisition mode (STEAM) sequence to detect Cr concentration since 1993[9]. Concerning signal to noise ratio, PRESS sequence is better than STEAM sequence. PRESS sequence is based on one 90 degree and two refocusing 180 degree orthogonal slice selective pulses; it acquires the entire signal due to the 180 degree pulses, and hence yields twice the S/N compared to STEAM. External standard method using short TE PRESS sequence to measure brain Cr concentration has been reported [10], and the result trended to be low compared with quantitative results obtained with biopsy samples or in vitro animal studies. The use of short TE has an advantage to investigate the short T2 metabolites, such as glutamate + glutamine (Glx) and myo-inositol (Ins) [11]. Cr is a long T2 metabolite. Long TE PRESS sequence can be used to detect Cr peak and obtain maximum signal-to-noise ratio. In our study, we use a long TE PRESS sequence with an external standard and LCModel software to quantify swine brain Cr concentration. In order to examine the accuracy of in vivo MRS method, in vitro Cr value was detected by HPLC method. Good agreement was obtained between in these two methods, which indicated that long TE PRESS sequence with an external standard and LCModel can accurately quantify the absolute Cr concentration. This experiment was also carried out in clinic.
Absolute Quantification of Brain Creatine Concentration
209
Further information will be found for the diagnosis and treatments of brain diseases related to the change of Cr content. However, MRS with external standard method was reported to have a number of potential errors [16]: it produced spectral distortions due to increase magnetic field inhomogeneities and correspondingly impaired water suppression produced by the external phantom itself which can lead to eddy currents and effects related to loading; also the subjects will feel less comfortable due to the phantom positioning. But, in our study, we used the PRESS sequence to acquire the analysis data, which could correct for field inhomogeneities, instrumental gain drifts, imager and localization method difference due to the two refocusing 180 degree pulses. We also use the relatively small phantoms 125ml to eliminate standing wave effects and other uncomfortable phenomena. The external phantom itself can also be used to correct for residual signal variations due to B1 inhomogeneity [16]. The in vivo Cr value is close to the in vitro value. However, slightly higher value was observed in the MR spectroscopy group compared with HPLC group, because the Cr peaks contain contributions from aminobutyric acid, lysine and glutathion, which could explain the higher Cr values obtained by in vivo MR spectroscopy methods, although the differences were not statistically significant. So, further studies should be done in order to find more detailed information.
5 Conclusions The in vitro HPLC method is reliable and suitable to use to assess the performance for absolute quantification of cerebral metabolites using in vivo 1H-MRS. The dilemma of comparing in vitro vs in vivo results could be solved. The long echo time PRESS sequence performed with an external standard is demonstrated to be an accurate and scientific technique to detect the brain Cr concentration, which is helpful to provide further insights into the diagnose and treatment of brain diseases related to the change of Cr content. The use of LCModel software introduces more convenience for the 1HMRS quantification.
Acknowledgements The present study was supported by the grants from National Science Foundation of China (30270379/C010511 and 30170262/C010511), and Li Ka Shing Foundation.
References 1. Urenjak, J., Williams, S.R., Gadian, D.G.: Proton nuclear magnetic resonance spectroscopy unambiguously identifies different neural cell types. J Neurosci. 13 (1993) 981–989 2. Mich, T., Wick, M., Fuji, H.: Proton MRS of oral creatine supplementation in rats: cerebral metabolite concentrations and ischemic challenge. NMR Biomed. 12 (1999) 309-314 3. Wilken, B., Ramirez, J.M., Probst, I.: Creatine protects the central respiratory network of mammals under anoxic conditions. Pediat Re. 43 (1998) 8-14
210
Y. Lin et al.
4. Stöckler, S., Hanefeld, F., Frahm, J.: Creatine replacement therapy in guanidinoacetate methyltransferase deficiency, a novel inborn error of metabolism. Lancet. 348 (1996) 789790 5. Wyss M, Kaddurah-Daouk R.: Creatine and creatinine metabolism. [Review]. Physiol Rev. 80 (2000) 1107–1213 6. Lubec, B., Aufricht, C.: Creatine reduces collagen accumulation in the kidneys of diabetic db/db mice. Nephron. 67 (1994) 214–217 7. Tong, Z., Yamaki, T., Herkner, K.: In vivo quantification of the metabolites in normal brain and brain tumors by proton MR spectroscopy using water as an internal standard. Magn Reson Imaging. 22 (2004) 735-742 8. Christiansen, P., Henriksen, O., Stubgaard, M.: In vivo quantification of brain metabolites by 1H-MRS using water as an internal standard. Magn. Reson. Imaging. 11 (1993) 107118 9. Mich, T.: Absolute Concentrations of Metabolites in the Adult Human Brain in vivo: Quantification of Localized Proton MR Spectra. Radiology. 187 (1993) 219-227 10. Friedrich, G., Woermann, M.D., Mary, A: Short Echo Time Single-Voxel 1H MagneticResonance Spectroscopy in Magnetic Resonance Imaging–Negative Temporal Lobe Epilepsy: Different Biochemical Profile Compared with Hippocampal Sclerosis. Ann Neurol. 45 (1999) 369–376 11. Woermann1, F.G., Mclean, M.A.: Quantitative short echo time proton magnetic resonance spectroscopic imaging study of malformations of cortical development causing epilepsy. Brain. 124 (2001) 427-436 12. Provencher, S.W.: LCModel User’s Manual, version: 6.1-6.4 (2005) 13. DeCarli, C., Maisog, J., Murphy, D.G.: Method for quantification of brain, ventricular, and subarachnoid CSF volumes from MR images. Journal of Computer Assisted Tomography. 16 (1992) 274-284 14. Sun, A.M., Wang, E.R., Mao, B.Y.: Determination of creatine Phosphocreatine and Adenosinephosphates in Experimental Hydrocephalus Tissue by Reversed-Phase High Performance Liquid Chromatography. J Sichuan Univ(Med Sci Edi). 35 (2004) 113-116 15. Li, B.S., Wang, H.: Metabolite ratios to assumed stable creatine level may confound the quantification of proton brain MR spectroscopy. Magn Reson Imaging. 21 (2003) 923-928 16. Keevil, S.F., Barbiroli, B., Brooks, J.W.C.: Absolute metabolite quantification by in vivo NMR Spetroscopy: II. A multicentre trial of protocolis for in vivo localized proton studies of human brain. Magnetic Resonance Imaging. 16 (1998) 1093–1106
Learning in Neural Network - Unusual Effects of “Artificial Dreams” Ryszard Tadeusiewicz and Andrzej Izworski Department of Automatics, AGH University of Science and Technology Al. Mickiewicza 30, 30-059 Kraków, Poland {rtad, izwa}@agh.edu.pl
Abstract. Most researchers focused on particular result ignore intermediate stages of learning process of neural networks. The unstable and transitory phenomena, discovered in neural networks during the learning process, long time after the initial stage of learning, when the network knows nothing because of random values of all weights, and long time before final stage of learning process, when the network knows (almost) everything – can be very interesting, especially when we can associate with them some psychological interpretations. Some "immature" neurons exhibit behavior that can be interpreted as source of "artificial dreams". Article presents examples of simple neural networks with capabilities which might explain the origins of dreams and myths.
1 Introduction Most papers describing methods and results of the neural network applications are mainly goal-oriented. It means that the authors of such papers first try to obtain the best result in terms of solving the specified problem (e.g. building a neural network based model of some process or finding the neural solution of a pattern recognition problem), taking into account only the final result (e.g. quality of the model or correctness of classification). In all these works nobody looks at the details of network behavior during the learning process, because only the final effect of the learning seems interesting. Also the research works dealing with learning itself are mainly dedicated to speeding up the learning process or to increasing the quality of final result (e.g. in terms of avoiding the local minima problem) – but not take into account, what could have happened in the network during the learning process. Nevertheless the unstable and transitory phenomena, discovered in neural networks during the learning process, long time after the initial stage of learning, when the network knows nothing because of random values of all weights, and long time before final stage of learning process, when the network knows (almost) everything – can be very interesting, especially when we can associate with them some psychological interpretations. Some of them can be interpreted as “artificial dreams” performed by the artificial neural networks and can provide us with a new interpretation of the human ability for imagination, fantasy and also poetry. It can be presented even on the basis of very simple neural network models, but of course the most interesting results can be investigated by means of the networks deployed with high level of I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 211 – 218, 2006. © Springer-Verlag Berlin Heidelberg 2006
212
R. Tadeusiewicz and A. Izworski
similarity to the real brain structures, what implies means high level of complication of the neural structure and also complicated forms of the observed phenomena. The problems described and evaluated in this paper has never been published before anywhere in scientific articles, so we decide to present first a very simple example of such phenomena on the basis of simple structure of the self-learning in a one-layer neural network, learned by means of hebbian rule. Even in such simple network we can observe very interesting transient processes, which (according to proper interpretation) can be found as “artificial dreams”. We have also much more interesting (but sometimes ambiguous…) observations obtained in more complicated networks during more sophisticated learning processes. But when introducing the new ideas the simplest situations must be presented (and discussed) first - and this is the issue of this paper.
2 Description of the Example Network and Its Learning The phenomena described in this paper can be discovered, as mentioned above, in almost all types of neural networks and for almost all methods of learning (both supervised and unsupervised). In this paper we have decided to take into account the simplest situation: Let us have one-layer linear neural network. It means that as the input to the network we consider n-dimensional vectors X = < x1 , x2 , …, xn > , the knowledge of the network is represented by the collection of weight vectors Wj = < w1j , w2j , …, wnj > for all neurons ( j = 1, 2, …, L), outputs of which can be obtained by means of the simplest and very well known equation: n
y j = ¦ wij xi .
(1)
i =1
The network learns on the base of a simple hebbian rule: If on step p we obtain the input vector Xp = < x1p , x2p , …, xnp > than the correction of the weight vector ǻWj (p) depends on the output value yjp calculated by the j-th neuron for Xp according to the equation (1), and on the value of input vector Xp according to the formula: ǻWj (p) = Ș yjp Xp
(2)
where Ș is the learning rate coefficient ( Ș < 1). Of course new value of weight vector Wj at the next step (p + 1) of the selflearning process can be calculated by means of the formula: Wj (p + 1) = Wj (p) + ǻWj (p) = Wj (p) + Ș yjp Xp
(3)
which must be applied for all neurons (for all j = 1, 2, …, L). It is easy to find out, that the result of such calculations are different for neurons with positive output yjp calculated as the answer for input signal Xp , and different for neurons with negative output. In first case the weight vector of the neuron Wj (p) is changed toward to the position of actual input signal Xp (attraction), in second case the weight vector of the neuron Wj (p) is changed backward to the position of actual input signal Xp (repulsion). This process is presented on Fig.1, where big ring denotes the position of input signal
Learning in Neural Network - Unusual Effects of “Artificial Dreams”
213
Xp, and the small squares denotes positions of weight vectors of the neurons. The “migration” of the weight vectors can be observed on this plot – some are attracted towards the input signal, while the some other are pushed in the opposite direction. The same process performed by big populations of self-learned neurons is presented on Fig. 2. Everybody knows, what results are obtained after many steps of such self learning process, performed by the network connected with a real data stream. If the data are not uniformly distributed, the neurons are divided (spontaneously!) onto groups, where every group is dedicated to one cluster of the input data Moreover the values of the weights vectors of the neurons belonging to each group are more or less precisely located in the center of selected cluster of the data.
Fig. 1. Migration of the weight vectors during one step at the self-learning process
It means, that after the self-learning process inside the neural network we have neurons, which can be used as detectors (or sentinels) for every cluster (group of similar signals), present in observed data stream and automatically recognized by the network. This process described above is not ideal, because as everybody knows, spontaneous migration of the weight vector for every independent neuron leads to many pathologies: every attractor has many neurons as the detectors (overrepresentation), and sometimes some important attractors can be omitted (no neuron decides to point out to this region of input space). Everybody know also, how to solve this problem: a much better solution is to use Kohonen network and methodology of self-organizing maps. Yes, but in this work we do not try to make the best self-organized representation of the data. Our goal is definitely other: we are searching for very simple model of the learning of neural network, because on the basis of this model we try to show, how (and why) the learned network sometimes presents behavior, which can be interpreted as “artificial dreams”.
214
R. Tadeusiewicz and A. Izworski
Fig. 2. Migration of the weight vectors in the biggest self-learned network
3 Description of the Example Problem Let us assume now, that we take into account a very simple example problem, which must be solved by the neural network during the self-learning process. In this exemplary problem we assume, that we have four clusters in the input data. Let us assume for clear and easy graphical presentation of the results, that the attractors present in the data are localized exactly at the centers of four subparts (quarters) of the input space (Fig.3). In this case self-learning process in the simulated neural network after some thousands of learning steps leads to the situation, when almost every neuron becomes a member of one of the four separate groups, located (in sense of localization of weight vectors) at the points corresponding with the centers of the clusters discovered in the input data stream. Three snapshots from the learning process are presented on the Fig.4. Typical user of the neural network takes into account mainly last snapshot, presenting, how many neurons are located in proper positions after the learning process and how precisely the real values of attractors coordinates are reproduced by the neurons parameters. For our consideration the medium snapshot will be most interesting, because it presents something strange: a situation, when knowledge of the network is definitely not complete, but also the initial chaos is partially removed. This stage of learning process is usually skipped by neural network researchers, because apparently man can not find anything interesting in this plots: the learning process is not ready yet, it's all. Apparently.
Learning in Neural Network - Unusual Effects of “Artificial Dreams”
215
Fig. 3. In example problem input data forms four clusters located at the centers of consecutive quarters in signal space (left). According to that properly learned network consists of four groups of neurons, which weight vectors are located at the centers of quarters in weight space (right).
Fig. 4. Three stages of the self learning process
4 Special Interpretation of the Intermediate Stages of Learning Process In all goal oriented investigations when using neural networks researchers are interested in the final result of learning process, which must be useful and accurate.
216
R. Tadeusiewicz and A. Izworski
Almost nobody takes into consideration intermediate stages shown on Fig.4. But when we try to understand, what can actually mean that form of the plot, repeated on Fig.5 – we must find out, that although it is not a real dream, it can be interpreted as a very exciting model of artificial dream. In fact on the plot presented in Fig.5 we can point out the localizations of the neurons, which can recognize some (named) objects from real world. After learning all neurons will be attributed to the real world objects, like girls, fishes and birds. But when we have a very early stage of the learning process, we can find in the population of neurons both real-world related detectors and fantasy-world related detectors. On the line connecting points representing for example girls with the point representing fishes we can find neurons, which are ready to recognize objects, which parameters (features) are partially similar to the girls shapes, and partially include features taken form the other real objects, for example fishes (e.g. tiles). Isn’t it something known? Obviously in real world object like this can not exist. The objects of such properties can not also be elements of learning data stream, because input information for the network is every time taken from the real world examples. Nevertheless in neural network structure learning process forms neurons, which want to observe and recognize such not real objects. Isn’t it some kind of dreams?
Fig. 5. Localization of neuron parameters describing real and unreal objects
Very interesting is the fact, that the fantasy-oriented objects, like presented on Fig.5, encountered during the learning process, are never unrestricted or simply random. We can find only such neurons, which are able to recognize some hybrids, fantastic, but build from the real elements. Isn’t it analogy to the tell-stories or myths? Limited volume of this presentation does not allow us to present many other examples of the “artificial dreams” encountered during the learning processes in neural networks. But one more example can be also interesting, because it shows
Learning in Neural Network - Unusual Effects of “Artificial Dreams”
217
another kind of fantasy identified in neural network behavior. This form of fantasy can be called “gigantomania”. Example of such behavior of the learned network is presented on the Fig.6. When the network is learned by means of examples of real world object – in the neural structures the prototypes of these objects are formed and enhanced. This process goes over the big population of neurons and leads to the forming of internal representation (in neural structures) of particular real objects. Neurons belonging to these representation can recognize every real object of the type under consideration. It is very known and regular process.
Fig. 6. Real and fantastic objects created during learning process
But sometimes in contrast to this regular pattern we can observe single neurons, for which parameters are formed in a way, which leads to the surprise after interpretation. Let us assume, that real objects on the basis of which the network was learned during the experiment illustrated on Fig.6, were tigers. The network can “see” many tigers (of course as a collections of parameters, representing selected data about tiger – e.g. how toll tiger is, how long and sharp tiger’s tooth is and so on). After some learning period inside the network we have some imagination of a real tiger. This imagination, given as a collection of parameters (neurons weights), enable us to recognize every real tiger. But some neurons have parameters, which enable to recognize a surreal tiger, much bigger than real one, with biggest tooth and with much more dangerous claws. The relations and proportions between parameters are the same, as for real tigers (see in Fig.6 the relations between parameters of real objects and relations between parameters imprinted in weights of refugee neuron imagination of the “giant” – both belonging to the same line, coming from the root of coordination system), but such a big tiger can not exist. Nevertheless we can find neuron ready for recognition of this giant, although it does not exist !
218
R. Tadeusiewicz and A. Izworski
5 Conclusions Facts and comments presented in this paper definitely are not very important from the scientific point of view and also are not applicable to practical problem solving using neural networks. But as long as we use neural networks as the artificial systems very similar to the structures discovered in human brain – we are still thinking about analogies between processes in our psychic and in neurocomputers. Results of simulations presented in this paper gives us new point to such considerations and we hope it can be interesting for many neural network researchers bored with new learning paradigms, new network structures and new neurocomputing applications and searching for something absolutely different from the serious and boring standards. This paper is something for him/her !
References 1. Horzyk A., Tadeusiewicz R.:, Self-Optimizing Neural Network, in F. Yin, J. Wang, Ch. Guo (eds.): Advances in Neural Networks - ISNN 2004, LNCS, Vol. 3173, Part I, Springer Verlag, New York, (2004), 150 – 155 2. Izworski A., Wochlik I., Bulka J., Paslawski A.: SOM neural networks in detection of characteristic features of Brainstem Auditory Evoked Potentials (BAEP) in: Advances in systems theory, mathematical methods and applications Alexander Zemliak, Nikos E. Mastorakis (eds.), WSEAS Press, (2002), 158-162 3. Kohonen T.: Self-organizing maps, Springer Verlag, Berlin, (1995) 4. Tadeusiewicz R., M. Ogiela M.: Medical Image Understanding Technology, Studies in Fuzziness and Soft Computing, Springer Verlag, New York, Vol. 156, (2004) 5. Tadeusiewicz R:, Automatic Understanding of Signals. Chapter in book: Kłopotek A., WierzchoĔ S., Trojanowski K. (eds.): Intelligent Information Processing and Web Mining, Springer Verlag, New York, (2004), 577 – 590 6. Zurada J.: Introduction to Artificial Neuron Systems, West Bublishing Co, (1992)
Spatial Attention in Early Vision Alternates Direction-of-Figure Nobuhiko Wagatsuma and Ko Sakai Graduate School of Systems and Information Engineering, University of Tsukuba, 1-1-1 Ten-nodai, Tsukuba, Ibaraki, 305-8577, Japan [email protected] [email protected] http://www.cvs.cs.tsukuba.ac.jp/
Abstract. We propose a computational model consisting of mutually connected V1, V2, and PP modules, to realize the effect of attention to the determination of border-ownership (BO) that tells on which side of a contour owns the border. The V2 module determines BO from surrounding contrast extracted by the V1 module that could be affected by top-down spatial attention from the PP module. The simulation results show that the spatial attention modifies the direction of figure, and that the direction of figure is even flipped in ambiguous figures such as the Rubin’s vase, although the attention is applied only to enhance local contrast in V1. These results show that the activities of BO selective cells in V2 are modified significantly when spatial attention functions in early visual area, V1.
1 Introduction Visual attention is a function that boosts our perception [e.g. 1], which leads us to attend the most important information of the moment. Attention even alters the perception of an object, or a figure, which is apparent in ambiguous figures such as the Rubin’s vase. We propose that attention alters the local contrast in early vision, and the modified contrast then alters the border-ownership (BO) signals that are essential for the determination of the figure direction [2, 3]. If the effect of attention is significant, the activity of BO selective neurons is facilitated/suppressed so that the figure direction is flipped. Visual attention functions in two distinct modes: spatial attention and object-based attention, both of which have been shown to boost human perception from a number of aspects [4]. In particular, spatial attention has been reported to alter contrast gain in early visual areas [5], the mechanisms for which have been proposed by several modeling works [e.g. 6]. These models focus on interaction between the visual attention and lower visual function such as contrast sensitivity, however they cannot account for more complex perception including figure-ground segregation. It has been reported that a majority of neurons in monkeys’ V2 and V4 showed the selectivity to the BO: their responses depend on which side of a border owns the contour [2]. Computational studies have suggested that the surrounding suppression/facilitation observed in early visual areas underlies the BO coding, thus luminance contrast around I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 219 – 227, 2006. © Springer-Verlag Berlin Heidelberg 2006
220
N. Wagatsuma and K. Sakai
the classical receptive field is crucial for the determination of BO [3, 7]. These models, however, don’t represent the perception of BO for ambiguous figures in which BO flips alternatively. These recent physiological and computational studies led us to propose the following hypothesis: (1) spatial attention modulates contrast gain, and (2) activities of BO selective cells are determined from the surrounding contrast. Therefore, it is plausible that spatial attention alters contrast gain that then modifies the activities of BO selective cells. Although there are a number of previous studies reporting significant effects of attention in the visual area V2 [8] and V4 [8, 9], we focus on early visual area, V1, to investigate bottom-up attention that bias BO selective cells. The bottomup attention appears to be crucial for BO determination, because the latency of BO signal is short [2], and the switch of figures is achieved automatically. [10] We propose a computational model consisting of mutually connected V1, V2, and PP modules, to investigate the effect of attention to the determination BO. Top-down spatial attention from PP alters contrast gain in V1. The change in contrast signal then modifies significantly the activities of BO selective cell in V2, because BO is determined solely from surrounding contrast. The results showed that the direction of figure was flipped by spatial attention in ambiguous stimuli such as the Rubin’s vase. This suggests that perception of the direction of figure is altered when spatial attention functions in early visual area, V1.
2 The Model In our model, spatial attention increases the contrast sensitivity, which then facilitates the activities of BO selective cells. The model consists of three modules, V1, V2 and
Fig. 1. An illustration of the architecture of the model consisting of three modules, V1, V2 and PP, with mutual connections between them
Spatial Attention in Early Vision Alternates Direction-of-Figure
221
Posterior Parietal (PP) modules, with both top-down and bottom-up pathways connecting between them, as illustrated in Fig. 1. Each module consists of 100x100 model cells. In the absence of external input, the activity of a cell at time t, A(t), is given by τ
∂A(t ) = − A(t ) + μF ( A(t )) ∂t
(1)
, where the first term on the right side is a decay, and the second term takes into account the excitatory, recurrent signals among the excitatory neurons. Non-linear function, F(x), is given by 1 F ( x(t )) = (2) Tr − τ log ((1 τx(t )) ) , where IJ is a membrane time-constant, and Tr is absolute refractory time. The dynamics of this equation as well as appropriate values for constants has been widely studied [11]. 2.1 V1 Module V1 module that models the primary visual cortex extracts local contrast from an input stimulus. The input image Input is a 124 x 124 pixel, gray-scale image with intensity values ranging between zero and one. First, the local contrast, Cθω ( p, q,t) , is extracted by the convolution of the image with a gabor filter, Gθω ,
Cθω ( p, q,t) = Input( p, q) ∗ Gθω ( p, q)
(3)
, where indices p and q represent spatial positions, and ω shows spatial frequency. Orientation in space, θ , was chosen from 0, π /2, π and 3 π /2. In V1 module, spatial attention increases contrast gain, thus the contrast at the atV1 , is given by tended location is enhanced. The activity of a cell in V1 module, Aθω pq
τ
V1 ∂Aθω pq (t) V1 V1 V1−V 2 V 1,exct = −Aθω (t) + I θω pq (t) + μF(Aθωpq (t)) + I pq pq (t) + I o ∂t
(4)
, where I Vpq1−V 2 represents the feedback input from V2 to V1, I o shows a random noise, and μ represents a scaling constant. The local contrast, Cθω , is modulated by the feedback from PP to V1, I Vpq1− PP , as given by following equations [6]: γ I Vpq1−PP (t )
(Cθω ( p, q,t))
V1,exct Iθω pq (t) =
J I § ·δ I pq 1 + ¦ ¨¨ ¦ ¦ Cθω ( p + j, q + i,t)¸¸ ¹ θω © (2I + 1)(2J + 1) j=− Ji=− I
V 1−PP
S
δ I Vpq1−PP (t )
6
(t)
,
(5)
6
I V1−PP (t ) = χ ¦ ¦W pqij F(A(PPp−i)(q− j ) (t)) , pq i=−6 j=−6
(6)
222
N. Wagatsuma and K. Sakai
§ ( p − i)2 + (q − j)2 · W pqij = α exp¨− ¸− β 2σ w2 © ¹
(7)
, where W pqij is connection weights of Gaussian with the standard deviation of σ w .
α, β, χ, δ and γ are constants, and S in eq (4) prevents the denominator to be zero. In the V1 module, spatial attention increases contrast gain, therefore the contrast at the attended location is enhanced. 2.2 V2 Module V2 module consists of BO selective cells. This module receives feedforward input from V1 module. BO is determined from the surrounding contrast information extracted in the V1 module, as illustrated in Fig.2 [3]. Each BO selective cell has single excitatory and inhibitory regions. The location and shape of these regions determine the selectivity of BO cells in terms of stimulus preference. The activity of a BO selective cell is given by τ
2, BO ∂AVpqN (t )
∂t
2, BO 2, BO = − AVpqN (t ) + μF ( AVpqN (t )) − γF ( AV 2,inh (t )) + I Vpq2 −V 1, BO (t ) + I o
(8)
, where I Vpq2−V 1, BO represents feedforward input from V1. An index BO represents leftor right-BO selectivity, and N represents the number of BO selective cells. If BO-left selective cells are more active than BO-right cells, a figure is judged as located on the left side. The third term represents the input from inhibitory neurons. The activity of an inhibitory V2 cell is given by
τ
∂AV 2,inh (t ) ∂t
2 , BO = − AV 2,inh (t ) + μF ( AV 2,inh (t )) + κ ¦ F ( AVpqN (t ))
(9)
Npq
, where κ is a constant. This inhibitory neuron receives inputs from excitatory neurons in V2, and inhibits these neurons.
Fig. 2. A mechanism of BO-right selective cell [3]. If there is contrast in a excitatory surrounding region, the activity of the cell is enhanced. If the contrast exists in a inhibitory region, the activity is suppressed. The balance of the BO-right- and BO-left-cells determines the direction of figure.
Spatial Attention in Early Vision Alternates Direction-of-Figure
223
2.3 Posterior Parietal Module (PP) PP module encodes spatial location and facilitates the processing for the attended location. Spatial attention is given by hand in this module, which then enhances contrast gain in V1 module. PP module receives bottom-up inputs from V1 and V2 modules, and spatial attention as top-down bias. The activity of an excitatory cell in the PP module is given by τ
PP (t ) ∂A pq
∂t
PP PP PP − V 1 PP − V 2 PP , A (t ) + μF ( A pq (t )) − γ F ( A PP , inh (t )) + I pq (t ) + I pq (t ) + I pq (t ) + I o = − A pq
(10)
PP , A , where I pq represents the top-down bias of spatial attention with a Gaussian shape. PP−V1 PP−V 2 The fourth term, I pq , is feedforward input from V1 to PP, and the fifth term, I pq ,
is feedforward input from V2 to PP. The third term represents input from inhibitory PP neurons whose activity is given by
τ
∂A PP ,inh (t ) ∂t
PP = − A PP ,inh (t ) + μF ( A PP ,inh (t )) + κ ¦ F ( A pq (t )) .
(11)
pq
PP , A When there is no spatial attention, I pq = 0 , PP module is activated by feedforward signals from V1 and V2.
3 Simulation Results We carried out simulations of the model with a variety of stimuli, in order to test quantitatively the characteristics of the model in various situations. Specifically, we examined whether the model reproduces the activities of the BO selective neurons reported in physiological experiments [2], and whether human perception of the direction of figure is reproduced in ambiguous figures. In the present paper, we show two examples of the results; two overlapping squares (Fig.3a)[2] and the Rubin’s vase (Fig.3b). For both stimuli, we determined the direction of figure by comparing the activities of BO-left and BO-right cells. Because a wide variety of BO selectivity has been reported in physiology, we implemented five types of BO-left and BO-right selective cells. We define the evaluation function to estimate a population response of the full set of BO selective cells:
h (t ) =
sum _ right (t ) − sum _ left (t ) sum _ right (t ) + sum _ left (t )
(12)
, where sum_left(t) is the summation of the responses of all BO-left selective cells and sum_right(t) is that of BO-right cells. A negative h(t) indicates that BO-left cells are dominant with respect to BO-right cells. On the other hand, a positive h(t) shows that BO-right cells are dominant. If h(t) is zero, the activities of BO-right and BO-left cells are equal. We consider that the direction of figure is left, if − 1.0 ≤ h(t ) ≤ −0.3 . If − 0.3 < h(t ) < 0.3 , then no figure direction is given, and if 0.3 ≤ h(t ) ≤ 1.0 , then the direction of figure is right.
224
N. Wagatsuma and K. Sakai
Fig. 3. Examples of stimuli: two overlapping squares similar to that used in physiological experiment [2] (a), and the Rubin's vase as one of the most famous ambiguous figures (b). Grey circles indicate the location and extent of the receptive field of BO selective cells.
3.1 Correspondence with Physiological Experiments We carried out the simulations of the model with a variety of stimuli similar to those tested in physiological experiments [2]. As an example, the result for two overlapping squares (Fig.3a) is shown in this section. This stimulus is difficult in terms of not only its contour arrangement but also contrast configuration for the determination of figure direction at the border of the two squares. However, we can always perceive the black square as a figure, and the gray square as a ground. We tested the model with three
㩿㪘㪀㩷
㩿㪙㪀㩷
㩿㪚㪀㩷
Fig. 4. Simulation results for two overlapping squares. The height of bars shows h(t) for the stimulus shown below. Positive values indicate that BO-right cells are more active than BO-left cells at the location. “X” icon indicates the attending location; no spatial attention (A), attendance to the black square (B), and to the gray square(C).
Spatial Attention in Early Vision Alternates Direction-of-Figure
225
conditions of spatial attention that is applied to the black or gray square, or to nowhere, to investigate the characteristics of the model responses for these conditions. The simulation result for “two overlapping squares“ is shown in Fig.4. Regardless of the location of spatial attention, the responses of h(t) show that BO-right selective cells are dominant, indicating that the model determined the black square as a figure, which agrees with human perception. However, spatial attention changes the activity of BO selective cells in some degree. When attention is applied to a black square (Fig.4B), BO-right selective cells are more dominant than when no attention is provided. Even when spatial attention is applied to a gray square (Fig.4C), BO-right selective cells are still dominant, although the magnitude of the activity is somewhat less than when attending to a black square. Spatial attention facilitated the correct determination of direction of figure. Our model reproduces human perception that determines the black square as a figure. These simulation results also showed that spatial attention alters the activities of BO selective cells through the enhancement of contrast gain in early vision.
㩿㪘㪀㩷
㩿㪙㪀㩷
㩿㪚㪀㩷
Fig. 5. Simulation results for the Rubin’s vase. The height of bars shows the activity of the BO selective cell for the stimulus shown above. Negative values indicate that BO-left cells are more active than BO-right cells at the location, indicating that the vase is a figure. “X” icon indicates the attending location; no spatial attention (A), attendance to the center of the vase (B), and to the face (C).
3.2 Ambiguous Figures – A Case of Rubin’s Vase We carried out the simulations of the model with several ambiguous figures including variations of the Necker cube and the Rubin’s vase. As an example, we show the results for the Rubin‘s vase in this section. When we view the Rubin’s vase (Fig. 3b), we can perceive two objects alternatively, a vase and facing two faces, but we cannot perceive these two objects simultaneously. If the direction of figure is the left with respect to the receptive field (RF: indicated by a small circle in the figure), we will
226
N. Wagatsuma and K. Sakai
perceive a vase. If the right, we will perceive a face. We examined whether the model reproduces this switch when spatial attention is applied to the location of a face or a vase. The simulation results for the Rubin’s vase are shown in Fig.5. The results show that BO-left cells are dominant regardless of the location of spatial attention. When we do not provide spatial attention (Fig.5A), the model determines the direction of figure as left, indicating the perception of a vase. When we apply attention to the center of the vase (Fig.5B), the activities of BO-left cells become more active than the previous case without attention, suggesting that the perception of the vase is facilitated by attending to the center area. Even when we apply attention to the right on a face (Fig.5C), BO-left cells are still dominant. However, the magnitude of h(t) is very small, i.e. the activities of BO-left cells are weak so that they are almost the same as those of BO-right cells. The direction of figure might not be determined for this activity level. The early-level processing might not evoke a full switch of figure in the Rubin's vase, because higher-level vision is involved in the processing of human face, as discussed in the next section. In this simulation, we changed the location of spatial attention. The spatial attention altered the activities of BO selective cells in the direction consistent with human perception. These results suggest that the direction of figure could be switched depending on the location of spatial attention in ambiguous figures.
4 Discussions The hypothesis we pursued here is that spatial attention increases contrast gain in early vision, and then the modified contrast leads to the modulation of the activity of BO selective cells, which could flip the direction of figure. We constructed the network model that consists of three modules each corresponding to V1, V2 and PP area, together with the mutual connections between them including both bottom-up and top-down flows. We tested the model with two types of stimuli, the stimuli used in physiological experiments, and ambiguous figures. When two overlapping squares were presented, although the BO response is biased toward the direction of the attention, the response did not flip to indicate the change of figure direction, suggesting that spatial attention facilitates the direction of figure at the degree not to alter the perception, as consistent with human perception. Next, we tested the model with the Rubin’s vase, one of the most famous ambiguous figures, with an expectation that figure direction may flip depending on the location of attention. Although our model showed a significant modulation in BO signaling depending on the attention, the model did not indicate the turn over of BO to the face side. It has been suggested that specialized neurons in higher visual areas such as IT and TEO underlie the processing of human face. We consider that the feedback from such higher visual areas might modulate further BO selective cells to switch to the perception of face. These results suggest that spatial attention affects the contrast gain in early visual areas, which in turn modulates BO selective neurons in intermediate areas, perhaps together with the direct influence to BO coding. Therefore, the contrast gain affected by spatial attention leads to the modulation, or even switch, of the perception of direction of figure.
Spatial Attention in Early Vision Alternates Direction-of-Figure
227
Our model predicts that spatial attention alters direction-of-figure through the change in contrast sensitivity. It is expected to examine this prediction from psychophysical and physiological view points. Psychophysical experiments may be valued to test quantitatively the relation of contrast and BO-determination in ambiguous figures. Physiological data from fMRI may also be valued for testing cortical regions relevant to BO-determination and attention. Our results provide essential and testable predictions to fundamental problems of figure/ground segregation and attention.
References 1. M. I. Posner.: Orientating attention, The Quarterly Journal of Experimental Psychology, Vol. 32 (1980) 2. H. Zhou, H. S. Friedma and R. von der Heydt.: Coding of border ownership in monkey visual cortex, Journal of Neuroscience, Vol. 20 (2000) 6594-6611 3. H. Nishimura and K. Sakai.: The computational model for border-ownership determination consisting of surrounding suppression and facilitation in early vision, Neurocomputing, Vol. 65-66 (2005) 77-83 4. G. Deco and T. S. Lee.: The role of early visual cortex in visual integration: a neural model of recurrent interaction, European Journal of Neuroscience, Vol. 20 (2004) 10891100 5. M. Carrasco, S. Ling, and S. Read.: Attention alters appearance, Nature Neuroscience, Vol. 7 (2004) 308-313 6. D. K. Lee, L. Itti, C. Koch and J. Braun.: Attention activates winner-take-all competition among visual filters, Nature Neuroscience, Vol. 2 (1999) 375-381 7. K. Sakai and H. Nishimura.: Surrounding suppression and facilitation in the determination of border ownership, Journal of Cognitive Neuroscience, Vol. 18 (2006) 562-579 8. J. H. Reynolds, L. Chelazzi, and R. Desimone.: Competitive Mechanism Subserve Attention in Macaque Areas V2 and V4, The Journal of Neuroscience, Vol. 19 (1999) 17361753 9. J. H. Reynolds, T. Pasternak, R. Desimone.: Attention Increases Sensitive of V4 Neurons, Neuron, Vol. 26 (2000) 703-714 10. Wade Nicholas.: The art and science of visual illusions (1982) 11. Gerstner. W.: Population dynamics spiking of neuron: Fast transients, asynchronous states, and locking, Neurocomputiong, Vol. 12 (2000) 43-89
Language Learnability by Feedback Self-Organizing Maps Fuminori Mizushima1 and Takashi Toyoshima1, 2 1
Dept. Brain Science and Engineering 2 Dept. Human Sciences Kyushu Institute of Technology 2–4 Hibikino, Wakamatsu, Kitakyushu, Fukuoka 808–0196 Japan [email protected], [email protected]
Abstract. Identification tasks are experimented on multi-layered self-organizing maps with feedback, using strings of category symbols of English sentences. Training is carried out in a setting that approximates language acquisition by human infants as closely as possible. Novel strings that are not used in training are tested whether they can be identified as grammatical or not. Test strings can be longer with no finite bounds than the trained strings, with recursions within themselves, just as natural language syntax allows. Keywords: language identification, self-organizing map, feedback, natural language, variable data length, recursion.
1 Introduction Language acquisition has been a central concern in linguistics, psychology, brain sciences, as well as information sciences. In machine learning, it is studied from different approaches in a variety of learning paradigms, and distinct interaction settings: comuationalism vs. connectionism, supervised vs. unsupervised, reinorcement, queries, etc. As a neural network model of language learning, Elman [1] proposed Simple Recurrent Network (SRN), which is a multi-layered perceptron with a context layer that is fed back from the hidden layer to reflect the previous state of the network. After training with English sentences, SNR was able to predict the word that follows the current input word. Furthermore, the hidden unit activations were able to represent categorical relationships of the input words. Training of SNR was supervised, using the backpropagation algorithm. Natural language acquisition by human infants, in contrast, is primarily unsupervised. [2–5] Until infants begin to speak, they are only exposed to speech data in the environment. When infants start to produce a minimum telegraphic speech, they do make errors. Yet, they are not always corrected, or even told when they produce ungrammatical utterances; let alone what is wrong about them. Care-takers may correct an error or provide a grammatical model when infants produce an ungrammatical I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 228 – 236, 2006. © Springer-Verlag Berlin Heidelberg 2006
Language Learnability by Feedback Self-Organizing Maps
229
utterance, but the care-takers are just guessing what the infants want to say. In other words, there are no ways of knowing for infants what is really the correct utterance for what they want to say. That is, infants are not consistently corrected, rewarded, or penalized for what they produce. Thus, there is not enough reliable negative evidence that indicates whether or not a given utterance is grammatical. In short, natural language acquisition by human infants is essentially unsupervised. Perhaps, the most popular artificial neural network model used for unsupervised learning is Self-Organizing Map (SOM) proposed by Kohonen [6]. The basic SOM is a feedforward model with a single competition layer, which is not capable of sequential information processing. Thus, there have been various proposals of modifying SOM with feedback, possibly from an additional context layer. [7–9] Wakuya, et al. [10] have added an output layer to a feedback SOM with a context layer, calling it an Elman-type feedback SOM. Using backpropagation, they have experimented on recognition of proper names written in Japanese Braille. They also tested on the elasticity on time-scale with the same proper names encoded in multiples of the original Braille dots in time-line sequence. Although the length of expressions was varied, the informational entropy was kept constant and all the data had the same entropy. In this paper, we report on our simulations of language identification task [11] by a Feedback SOM (FSOM) with a context layer. We have trained FSOM in a setting that roughly approximates the language acquisition by human infants, using strings of category symbols (part-of-speech labels) of English sentences. Then, we subjected the trained FSOM to novel strings of various lengths with possible recursions in themselves. Then, we checked whether the test strings pass through the Best Matching Units (BMU) obtained by training. If a test string passes through only BMUs, we consider it grammatical. If a test string wanders off the BMUs even once in a sequence of category symbols, we judge it ungrammatical. By these setting and criteria, FSOM was able to identify, with an average of more than 90% accuracy, grammatical strings from ungrammatical strings of random sequence of category symbols, both of arbitrary length.
2 Training and Test Data 2.1 Vector Encoding of the Category Symbols Unlike Elman’s experiments of predicting the word that follows the current input word, the task we have set out is to judge grammaticality, or language identification [11], of strings of category symbols that represent English sentences. Current theories of transformational generative grammar [12, 13] recognize at least eight morpho-syntactic categories (parts of speech): nouns (N), verbs (V), Adjectives/Adverbs (A), prepositions/postpositions (P), determiners (D), light verbs (v), inflections (I), and complementizers (C). The first four are called lexical categories, with more or less substantive “meaning,” which should be familiar from traditional school grammar. The latter four is called functional categories, forming closed classes with a relatively limited number of items, often handful.
230
F. Mizushima and T. Toyoshima
Determiners include articles, demonstratives, and pronouns, for example. Light verbs rarely manifest themselves overtly with sounds in English. They divide the structures of lexical verb phrases (VP) into subcategories of intransitives, transitives, and ditransitives. Inflections include the auxiliary verbs, such as will, can, etc., as well as inflectional suffixes of verbs, such as {-s}, {-ed}, etc. Complementizers make a clause into an object or a subject of a verb in another sentence. For declarative finite sentences, that alternates with an empty form devoid of any sound; if and whether are for interrogatives, and for is for non-finite clause. The four lexical categories, N, V, A, and P are traditionally expressed with a pair of two binary features [± N, ± V], enabling cross-categorical classifications (Table 1). The [– N] categories are potential Case-assigners, and the [+ V] categories are syntactic predicates, for example. Functional categories can also be classified with the [± N, ± V] features, correspondingly (Table 2). Table 1. Feature specification
+V –V
Table 2. Corresponding functional categories
+N –N A V N P
+V –V
+N I D
–N v C
Adding [± F] feature to express the functional/lexical distinction yields the three binary features [± F, ± N, ± V] that can easily be converted into 3-bit vector representations. To mark that a particular string has ended, we append a full stop period mark, also encoded in 3-bit vector as [± 0, ± 0, ± 0] (Table 3). Table 3. 3-bit vector encoding of morpho-syntactic categories categories N V A P . (period)
features [– F, + N, – V] [– F, – N, + V] [– F, + N, + V] [– F, – N, – V]
3-bit vectors categories features [– 1, + 1, – 1] D [+ F, + N, – V] [– 1, – 1, + 1] v [+ F, – N, + V] [– 1, + 1, + 1] I [+ F, + N, + V] [– 1, – 1, – 1] C [+ F, – N, – V] [± 0, ± 0, ± 0]
3-bit vectors [+ 1, + 1, – 1] [+ 1, – 1, + 1] [+ 1, + 1, + 1] [+ 1, – 1, – 1]
2.2 Generation of Grammatical Strings As a preliminary study, we used the following context-free rewriting rules to generate the strings of category symbols that correspond to grammatical English sentences. CP IP DP
C I
{ VP } AP
DP (DP) D
(1)
IP . (PP) .
(NP) .
(2) (3)
Language Learnability by Feedback Self-Organizing Maps
PP
P
NP A
VP A
*
*
AP A
*
(DP) .
231
(4)
({PP CP }) .
(5)
PP PP V DP DP
. CP CP
(6)
({PP CP }) .
(7)
N
A
These rules are not exhaustive, and there are strings of category symbols that are not generated but correspond to grammatical English sentences. In particular, they do not generate strings of category symbols that correspond to the sentences with what is called “movement” in the tradition of transformational generative theories, the most essential property of natural languages. Thus, sentences such as the following are not considered in this study. What do you __ think that the boy has given ____ to me? D C + I D ___ V C D N I V ____ P D
(8)
Yet, CP is a full sentence, and as it appears on the right-hand side of the rules (5–7), a sentence can be recursively embedded, potentially infinite times. Since it never manifests itself phonetically as a stand-alone “word” in English, v is not included in the above rules. The * mark in (5–7) is the Kleene’s star that indicates the number of iterated concatenation of the category non-negative integer time(s). Thus, A* means the possible absence of the A category as well as infinitely many A’s are consecutively concatenated. Unless A is predicative, it does not expand from adjectival phrases (AP), and it manifests itself as an adverb if it modifies [+ V] phrases. [14] 2.3 Training Data Set of Strings In a preparatory experiment, we have generated a set of simplex grammatical strings, using the rewriting rules (1–7) without any recursion of phrases within the same categories, and with the Kleene’s star limited to 1, that is, none or a single A alone. That yielded 792 strings. The longest string out of those 792 strings had 19 category symbols. To create the training data set of strings, we have allowed CP to recur twice, that is, one embedding of CP into another CP, but limited any other phrasal categories not to recur within the phrases of the same categories, and Kleene’s star to 1. That will potentially yield more than 3,140,000 strings, and we have randomly picked 1,000 strings. The longest string out of those randomly picked 1,000 strings had 58 category symbols. Those 1,000 strings were converted into 1,000 strings of 3-bit vector sequence, appending [± 0, ± 0, ± 0] at the end of each string to demarcate with a period mark.
232
F. Mizushima and T. Toyoshima
2.4 Test Data Sets of Strings We have created two test data sets, one of grammatical strings and the other of ungrammatical strings. The test data set of grammatical strings was generated by the same rewriting rules (1–7), just as for the training data set, but allowing one additional recursion of CP, and yet limiting to one recursion of all the other categories and the Kleene’s star is 2. We have randomly picked 20,000 strings that are distinct from 1,000 strings in the training data set. The longest string out of those randomly picked 20,000 strings had 661 category symbols. Those 20,000 strings were converted into 20,000 strings of 3-bit vector sequence, appending [± 0, ± 0, ± 0] at the end of each string. To generate ungrammatical strings, all the category symbols of the 20,000 grammatical strings generated above were replaced by a randomly chosen category symbol, excluding v. Those 20,000 strings of ungrammatical strings thus generated were converted into 20,000 strings of 3-bit vector sequence, appending [± 0, ± 0, ± 0] at the end of each string.
Fig. 1. A schematic representation of the FSOM architecture used in this study
3 The Design and the Algorithm of Feedback SOM We have designed a Self-Organizing Map with Feedback (FSOM) from an additional context layer as diagramed in Figure 1. The competition layer L and the context layer M are both two-dimensional planes, tessellated with u × v units of square array. An input string xi = [xi, 1, xi, 2, xi, 3, … , xi, j] is fed to the input layer, one symbol by one symbol, where xi, j is a category symbol of k-dimensional vector, 3-dimensional
Language Learnability by Feedback Self-Organizing Maps
233
vector in this study, encoded as 3-bit vector. The vector of the input symbol xi, j and the context vectors C jointly determine the Best Matching Unit (BMU) L*u,v on the competition layer L, where C = [ C M1,1 , C M1,2 , … , C M 2,1 , C M 2,2 , … , C M u,v ] of all the units Mu, v, located at a coordinates u, v on the context layer M. The weight vectors between the input vector of xi, j and all the units Lu, v, located at a con
in
coordinates u, v on the competition layer L, is WL u ,v . The weight vectors WLu,v (= con
con
con
con
con
[ WLu,v :M1,1 , WLu,v :M1,2 , … , WLu,v :M 2,1 , WLu,v :M 2,2 , … , WLu,v :M u,v ]) is the weight vectors between all the units Lu, v located at coordinates u, v on the competition layer L and all the context vectors C (= [ C M1,1 , C M1,2 , … , C M 2,1 , C M 2,2 , … , C M u,v ]) of all the units Mu, v, located at coordinates u, v on the context layer M. The BMU L*u,v at the state s will be the competition unit with the smallest sum of the in
Euclidean distances between the input vector xi, j and the weight vector WL u ,v , and between the context vectors C (= [ C M1,1 , C M1,2 , … , C M 2,1 , C M 2,2 , … , C M u,v ]) and the con
con
con
con
con
weight vectors WLu,v (= [ WLu,v :M1,1 , WLu,v :M1,2 , … , WLu,v :M 2,1 , WLu,v :M 2,2 , … , con
WLu,v :M u,v ]), each of which is depreciated by the apportion rates, γ and (1 – γ), respectively, in accord with the following formula: con
in
*
L u,v = arg min {γ∞» xi, j (s) – WLu,v (s)∞» + (1– γ)∞» C(s – 1) – WLu,v (s)∞»} . Lu,v
(9)
The apportion ratio of γ to (1 – γ) sets the relative importance between the input and the context “knowledge” (or “experience”) attained up to the state s, in determining the BMU L*u,v . Once the BMU L*u,v has been determined for an input symbol xi, j at the state s, its weight vectors and the weight vectors of its neighboring units are updated towards the input vector xi, j and the context vectors C (= [ C M1,1 , C M1,2 , … , C M 2,1 , C M 2,2 , … , C M u,v ]), by the following formulae: in
in
con
con
in
WLu,v (s + 1) = WLu,v (s) + α(T)h[T, L*u,v , Lu,v] xi, j (s) – WLu,v (s) . con
WLu,v (s + 1) = WLu,v (s) + α(T)h[T, L*u,v , Lu,v] C(s – 1) – WLu,v (s) .
(10) (11)
T is the number of cycles in which all the i strings of the training data set is fed once for each, α(T) is monotonically decreasing learning coefficient, and h is the neighborhood function. Then, the context vectors C (= [ C M1,1 , C M1,2 , … , C M 2,1 , C M 2,2 , … , C M u,v ]) of all the units Mu, v on the context layer M are attenuated by the rate β, and further, a dividend * , located on of the input vector xi, j at the rate of (1 – β) is added to the context unit M u,v the same coordinates u, v that correspond to the coordinates of the BMU L*u,v on the competition layer L, as follows: * (s) = β × C M * (s – 1) + (1 – β) × xi, j (s) . C M u,v u,v
(12)
C M u,v (s) = β × C M u,v (s – 1) .
(13)
234
F. Mizushima and T. Toyoshima
This process is repeated for each category symbol, one by one. When a full stop period mark is encountered, that is, an input string is fed, all the weight vectors are adjusted and the context vectors C (= [ C M1,1 , C M1,2 , … , C M 2,1 , C M 2,2 , … , C M u,v ]) of all the units Mu, v on the context layer M are flushed. Then, the next input string is fed, one symbol by one symbol. It means that no contextual information about the categorical sequences is carried on to the next string. This is because unlike semantic and/or pragmatic information, no syntactic information has to be passed on to the next sentence. In other words, the syntactic well-formedness does not depend on the syntactic information of the immediately preceding sentence. When all the i strings of the training data set is fed, the next cycle T + 1 of the training session begins. The order of all the i strings of the training data set is randomized, so that FSOM is fed with the same string tokens, but in distinct orders for each training cycle T. The learning coefficient α(T) is monotonically decreased every training cycle T, and the neighborhood domain of the BMU shrinks, so that less units are affected and less efficient that the learning becomes. When a predetermined number T of training cycles is through, the training session con con con in ends, and all the weight vectors WL u ,v and WLu,v (= [ WLu,v :M1,1 , WLu,v :M1,2 , … con con con , WLu,v :M 2,1 , WLu,v :M 2,2 , … , WLu,v :M u,v ]) are frozen. Then, all the i strings of the training data set is fed again, just one final time, determining BMUs for each symbol, without updating any weight vectors. Instead of updating any weight vectors, the system records on the BMU List the coordinates u, v of the BMUs for each symbol, and then, just as in the training session, the context vectors C (= [ C M1,1 , C M1,2 , … , C M 2,1 , C M 2,2 , … , C M u,v ]) of all the units Mu, v on the context layer M are attenuated, further giving an * input dividend to the context unit M u,v , located on the same coordinates u, v as each * BMU, L u,v .
5 Simulations and Results With the settings presented above, we conducted experiments of language identification tasks. In our preparatory experiments, we have determined that the map size of 15 × 15 units square array are good for our simulations. Thus, both the competition layer L and the context layer M are two-dimensional planes of 225 units each. Also, we have obtained the attenuation rate β of 0.9 and the apportion rate γ of 0.8 to yield the best result. That is, the apportion ratio is 0.8 to 0.2; the input vector xi,j is four times important than the context vectors C in determining the BMU L*u,v . In one training session, FSOM was presented with the training data set of 1,000 grammatical strings, encoded in 3-bit vectors, T = 100 cycles. Every T cycle, all the 10,000 strings are presented in a random order. The learning coefficient α starts from 0.9 at T = 1, and linearly decreases to 0.15 at the end of the training T = 100. The neighborhood function h[T, L*u,v , Lu,v] covers approximately the three-quarters of the competition layer L centered around the BMU at T = 1, and it linearly shrinks to pick out only the BMU toward the end of the training session T = 100. Before starting the training, all the layers are initialized. The standard initialization is to randomize the weight vectors of all the units. Instead, we used 0-initialization, setting the weight vectors of all the units to [± 0, ± 0, ± 0]. This is to approximate the
Language Learnability by Feedback Self-Organizing Maps
235
tabula rasa of Aquinas-Lockean empiricism, to see how well the proposed FSOM behaves. Since all the layers are 0-initialized before the training session starts, all the units will equally be the winning BMU with the same Euclidean distance for the first input symbol x1, 1. Thus, we force the center unit L8, 8 to be the winner BMU by default. With these conditions, FSOM was trained with 1,000 strings randomly picked from the strings generated with the rules (1–7) with one CP embedding, without recursion of any other phrasal categories, and with Kleene’s star limited to 1, as explained in section 2.3. The longest string among those 1,000 strings had 58 category symbols. After training of T = 100 cycles, the 1,000 strings used for the training were presented one more time, without updating any weight vectors. Yet, each symbol in these 1,000 strings determines its own BMU, which is recorded on the BMU List. Since the map size is 15 × 15, there are 225 units that are potential be able to be the BMU, but only 17 BMUs were recorded. That is, the contextual “knowledge” has converged on 17/225 units. To test FSOM’s ability to judge the un/grammaticality of strings that have not been used for training, we have created two test data sets, one of grammatical strings and the other of ungrammatical strings. The test data sets of 20,000 randomly picked grammatical strings and ungrammatical strings each, 40,000 strings in total, were generated as explained in section 2.4. The longest among those 40,000 strings had 661 category symbols, and there were 10,136 strings each for the grammatical and ungrammatical test data sets, which are equal to or shorter than the longest one of the training strings, which had 58 symbols. There were 9,864 strings each for the grammatical and ungrammatical test data sets, which are longer than 58 symbols. Then, all the 40,000 strings, are fed to the FSOM frozen after training. If all the BMUs of a test string are on the BMU List, it means that FSOM judged the string as grammatical one, and if any of the BMUs, even one BMU, of a test string is not on the BMU List, the string is judged ungrammatical by FSOM. With these judgment criteria, the result was that 98.73% of grammatical strings were judged correctly as grammatical, whereas 86.43% of ungrammatical strings of random sequence of symbols were correctly judged ungrammatical. That is, 92.58% accuracy on average. With this high percentage of accuracy, we may conclude, at least in this simulation, that FSOM was able to identify grammatical strings from ungrammatical strings, both of arbitrary length.
6 Concluding Remarks We have proposed an experimental design for language identification task by Feedback Self-Organizing Maps, which is substantially different from the recurrent network model of Elman’ SRN, both of which may be able to complement each other. Elman used English sentences at the word-level. The SRN learned to predict that word that follows, and Elman extracted hierarchical category relationships from the activation levels of the hidden units of the network. In contrast, our design uses category symbols to judge the grammaticality of the entire sequence. As a preliminary study, the design yielded the results better than expected. With subsymbolic processing, FSOM acquired a sort of “suprasymbolic” grammar, so to speak, in a sense that 4-dimensional information (temporal sequence of 3-dimensional
236
F. Mizushima and T. Toyoshima
information) was reduced to 2-dimesional clusters of sublimed syntactic rules, which appear to simulate states of nondeterministic automata. Acknowledgements. The research reported here is partially supported by a 21st Century Center of Excellence Program (#J19), granted to Kyushu Institute of Technology, by the Ministry of Education, Culture, Sports and Science, and Technology, Japan. The standard disclaimers apply.
References 1. Elman, J. L.: Finding Structure in Time. Cognitive Science 14 (1990) 179–211 2. Chomsky, N.: Rules and Representation. Basil Blackwell, Oxford (1980) 3. Baker, C. L. (ed.): The Logical Problem of Language Acquisition. MIT Press, Cambridge (1981) 4. Pinker, S.: Language Learnability and Language Development. Harvard University Press, Cambridge (1984) 5. Gordon, P.: Learnability and Feedback. Developmental Psychology 26 (1990) 217–220 6. Kohonen, T.: Self-Organizing Maps. Springer-Verlag, Berlin Heidelberg New York (1995) 7. Chappell, J. G. and J. G. Taylor: The Temporal Kohonen Map. Neural Networks 6 (1993) 441–445 8. Versta, M., J. Heikkonen, and J. Del Ruiz Millán: Context Learning with the Self-Organizing Map. Proceedings of the Workshop on Self-Organizing Maps ’97 (1997) 197–202 9. Horio, K. and T. Yamakawa: Feedback Self-Organizing Map and Its Application to Spatio-Temporal Pattern Classification. International Journal of Computational Intelligence and Applications 1 (2001) 1–18 10. Wakuya, H., H. Harada, and K. Shida: An Architecture of Self-Organizing Map for Temporal Signal Processing and Its Application to a Braille Recognition Task. IEICE Transactions on Information and Systems J87DII (2004) 884–892 11. Gold, E. M.: Language Identification in the Limit. Information and Control 10 (1967) 447–474 12. Chomsky, N.: The Minimalist Program. MIT Press, Cambridge (1995) 13. Cinque, G.: Adverbs and Functional Heads: A Cross-Linguistic Perspective. Oxford University Press, Oxford New York (1999) 14. Radford, A.: Transformational Grammar. Cambridge University Press, Cambridge (1988)
Text Representation by a Computational Model of Reading J. Ignacio Serrano and M. Dolores del Castillo Instituto de Automática Industrial, Spanish Council for Scientific Research, Ctra. Campo Real km 0.200 –La Poveda. 28500 Arganda del Rey. Madrid, Spain {nachosm, lola}@iai.csic.es
Abstract. Traditional document indexing methods, although useful, do not take into account some important aspects of language, such as syntax and semantics. Unlikely, semantic hyperspaces are mathematical and statistical-based techniques that do it. However, although they are an improvement on traditional methods, the output representation is still vector like. This paper proposes a computational model of text reading, called Cognitive Reading Indexing (CRIM), inspired by some aspects of human reading cognition, such as sequential perception, temporality, memory, forgetting and inferences. The model produces not vectors but nets of activated concepts. This paper is focused on indexing or representing documents that way so that they can be labeled or retrieved, presenting promising results. The system was applied to model human subjects as well, and some interesting results were obtained.
1 Introduction Owing to the growing amount of digital information stored in natural language, systems that automatically process text are of crucial importance and extremely useful. There is currently a considerable amount of research work using a large variety of machine learning algorithms that are applied to text categorization (automatically labeling of texts according to category), and information retrieval (retrieval of texts similar to a given cue) either from databases or from the World Wide Web. Until fairly recently, most of these systems used the highly common electronic text representation, “bag of words” [12]. This representation considers texts as vectors of size n, n being the total number of words that appear within a given text collection. Accordingly, if the word k appears in a text, then the representation of that text will contain a certain value in position k of the corresponding vector. Otherwise, this value in position k will be equal to zero. There are different ways of calculating the values of the vector, such as the number of times the word occurs in the text, the relative frequency or the frequency multiplied by the inverse of the global word frequency, the well-known tf · idf (term frequency · inverse document frequency). These vectors are the input to the training and validation stages of the knowledge discovery algorithms. I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 237 – 246, 2006. © Springer-Verlag Berlin Heidelberg 2006
238
J.I. Serrano and M.D. del Castillo
2 Related Work In the mid-nineties, word hyperspaces were proposed as an alternative to the traditional approach. LSA (Latent Semantic Analysis) [4] was the first of these systems, followed by HAL (Hyperspace Analogue to Language) [1], PMI-IR [14], Random Indexing [2], WAS (Word Association Space) [13] and ICAN (Incremental Construction of an Associative Network) [6]. These kind of systems build a representation, a matrix, of the linguistic knowledge contained in a given text collection. The main differences of these approaches are the ways that they obtain and represent this knowledge. The representation, or hyperspace, takes into account the relationship between words and the syntactic and semantic context where they occur, and this is the main difference with the common “bag of words” representation. However, once the hyperspace has been built, word hyperspace systems represent the text as a vector with a size equal to the size of the hyperspace by using the information hidden in it, and by doing operations with the rows and the columns of the matrix corresponding to the words in the texts. Although the hyperspace representation contains much more information than the traditional representation because the vector values are the result of word and context interaction, texts are still a set of numbers without a structure. However, this approach has been shown to be a real improvement on the classical representation. Only ICAN introduces a structural representation and does not store linguistic knowledge as a matrix but as a net of associated words. These associations have a weight calculated from probabilities of co-occurrence and non-co-occurrence between word pairs. This model makes it possible to incrementally add new words without retraining and recalculating the knowledge, which is psychologically more plausible. This approach proposes representing linguistic knowledge as a net of concepts associated by context. In ICAN, texts are subnets of the global knowledge net, formed by the nodes corresponding to the words in the texts and their associations. Texts are thus compared by calculating the average (or any other function) similarity within the subnets for all the words they contain. Although ICAN authors state that the construction of text representation from the words in the text is not done directly in their system, a fact that is psychologically plausible, the opposite seems true if we think about the subnet representation that they propose. In spite of the progress made with word hyperspaces, human beings continue to do text classification and information retrieval tasks much better than machines, although of course more slowly. It is hard to believe that linguistic knowledge is represented as a matrix in the human mind and that text reading is carried out by mathematical operations on this matrix. Human reading is a process of sequential perception over time, during which the mind builds mental images and inferences which are reinforced, updated or discarded until the end of the text [9]. At that moment, this mental image allows humans to summarize and classify the text, to retrieve similar texts or simply to talk about the text by expressing opinions. The model presented here is inspired by the ICAN connectionist approach, where words and texts do not share the same structure of representation, unlike the systems mentioned above. The notion of context and the way of weighting associations are some of the differences with the ICAN approach, although the main difference lies in the text-torepresentation process. What is proposed here is to build text representations as a
Text Representation by a Computational Model of Reading
239
result of a process over time, with a structure that makes it possible to indirectly describe the salience and relations of words at every instant during the reading process. Other computational models of reading exist which search for an assessment of a theory of reading rather than for a real data-intensive application. Most of them are based on connectionist networks inspired by the Construction-Integration model [3] and focus on different stages of reading and targets: the representation and understanding of fiction in an associative net, the interaction of different knowledge sources at sentence level during reading and the representation of language for complex narrative understanding are presented in [11]. In [5], the reminding process during reading is explained by inferences and disambiguation and a connectionist model of episodic memory is proposed. A modification of the ConstructionIntegration model for narrative comprehension is also explained in [11]. In [7] the importance of text structure and writing style for comprehension is highlighted. Even creativity is the target of studies by the comprehension of novel concepts [11]. The works just mentioned show that there is a high number of complex cognitive processes underlying reading. The model proposed here, called CRI (Cognitive Reading Indexing), is a simple model that takes into account only a few cognitive processes and although it is aimed at a real application, it is inspired by and closer to humans than the other systems in the same application field.
3 CRIM: Cognitive Reading Indexing Model The previous knowledge required by CRIM collects the semantics and the way in which words are related to each other during the reader’s previous experience. Since the model presented belongs to the connectionist paradigm, the representation selected for this linguistic information stands for a net of concepts that are associated with each other by weighted connections. A net concept is considered here as a single lemmatized word. As already stated, this knowledge must be acquired from previous experience. This experience is achieved as a set of texts representing a certain level of linguistic knowledge. The collected texts are then analyzed: all the words in the texts, but not appearing in a stop list, are lemmatized by Porter’s algorithm [10] and then added to the net as concepts. The concepts that co-occur within the same context are associated by adding a connection between them. The definition of the context highlights a difference in this model from similar models. In this case, since it is not a fixed window, the size of the context here depends on the texts themselves and on how they are written. Thus a local context for a word is bounded by the sentence in which it occurs, and all the concepts co-occurring in the same sentence are associated. Accordingly, the aim is to capture the grammar explicitly, unlike other systems that do not expressly do so but after show that they have. The next step consisted of setting the association weights between net concepts. These associations are not symmetrical. The association weight between concept A and concept B does not have to be the same as the association weight between concept B and concept A. This is another difference with similar representation systems. Given the total number of occurrences of a concept in the text collection, its association weights are established as the proportions of co-occurrences of the concept and its associated concepts within
240
J.I. Serrano and M.D. del Castillo
the local context. For example, if the word “A” appears 10 times in the texts and it cooccurs 6 times with the word “B” and 4 times with the word “C”, the weights will be 0.6 for the A-B association and 0.4 for the A-C association. Once the linguistic knowledge has been built, it is used to represent new texts in a human-like fashion. In [9] some of the well-known cognitive processes during human reading are mentioned: working memory managing, forgetting and inferences. The model presented assumes all these processes during the reading task over time. Given the input document written in natural language, when the model reads a word from the text, its corresponding concept is sought among the linguistic knowledge in order to determine whether the system “knows” the word. If it does, the concept is activated and retrieved to the working memory with a base activation value. If the concept was already allocated in the working memory its activation value is increased by the base value. The current activation of the concept is then propagated to all its associated concepts. The propagated value is equal to the activation of the concept multiplied by the association weight. The propagated activation is added to the activation of the receptor concept if it is in the working memory. If not, this neighbor concept is retrieved to the working memory and adopts the propagated activation as its own. This concept then propagates its current activation to its associations and so on until the propagated activation is lower than a certain threshold or the level of propagation is higher than another threshold. The level of propagation is defined as the number of nodes that the activation passes through. Given the activation of a concept, the activation spreading can be viewed as the inference process during which the concepts affected by the spreading are the inferred concepts. If the inferred concepts are already in the working memory this means that they are expected to appear, and the processing and retrieval is faster than if they are not. The thresholds previously mentioned control inference depth and degree, and they are the targets of the experiments performed. Some inference theories are also indicated in [9]: the first is the selective access model in which only inferred concepts already in the context (i.e. in the working memory) are considered and retrieved. The second is the multiple access model in which all possible inferences are quickly accessed and retrieved and then a process selects only one on the basis of the context. The third possibility is the limited multiple access model in which inferences are done depending on the relative frequency of the inferred meanings, and only the most frequent meaning is accessed, although this frequency is variable and dependent on the current context. The model presented here assumes a hybrid approach of these theories by accessing all possible inferences and weighting them depending on the relative fixed frequencies of meanings. The selection of the most appropriate inference is carried out by the context over time, as explained below. Human memory is limited, so humans cannot retain everything that they have read. To model this issue the forgetting factor analogous to temperature in [8] is introduced. At specific time intervals the activations of the concepts in the working memory are decreased by a factor, whose optimal value is also a target of the experiments performed as is the definition of the time interval in terms of number of words. If the activation of a concept falls below the propagation threshold, then the concept is taken out of the working memory and accordingly forgotten. Let us imagine some concepts inferred from concept A and retrieved from the linguistic knowledge to the working memory. If one of the inferred concepts is indirectly activated by further concepts during text reading, this will
Text Representation by a Computational Model of Reading
241
“survive” over the other inferred concepts, which will be finally forgotten. It is also interesting to remark that the concept most related to concept A is the concept most likely to be kept because its initial activation in the working memory is higher. The context thus selects the most appropriate inferences from all those retrieved at the initial stage for any word. Given that the model generates all possible inferences it is necessary to determine the order of this generation, because it affects the inference activation of concepts. The order is defined by the spreading method. Two possibilities are considered: to propagate activation by levels or by depth. In the first instance, the activation is propagated to all the associated neighbors and then each of them, sorted by association strength, will propagate the activation to their associates in the same way and so on. In the second instance, the activation is recursively propagated through the most strongly associated neighbor first and then through the next most strongly associated neighbor and so on. Experiments were carried out to compare the two kinds of inference generation methods. Thus, each word in the text is read, either retrieving it from the linguistic knowledge (long-time working memory [3]) to the working memory or increasing its activation value by a base amount, and then spreading this activation to its neighbors to generate inferences. After a specific time interval, all the concepts currently in the working memory lose their activation because of the forgetting factor. At every moment during reading, the working memory contains the mental representation of the text as a net of related concepts with levels of activation. Once the last word of the text has been read, the working memory contains the final mental representation of the text. This representation is the result of a somewhat top-down-top process, as human reading is thought to be in [15]. From the word graphemes, semantic concepts are retrieved, concept inferences are generated, and finally, next word graphemes reinforce some inferences and discard others in a perception-reasoning sequence over time. This is the main difference with existing text indexing systems.
4 Experiments Two kinds of experiments were carried out. The first intended to optimize the parameters of the CRI model for text classification. The second compared the model with humans, by identifying the configuration that is most similar to the average human. For the first experiment, a corpus of recent Spanish texts was collected from the Google News. It consisted of 150 news items equally distributed in five thematic categories: Science, Sports, Economy, Culture and Health. The corpus was divided into three subsets of equal size, two subsets were used as the training set and the other as the test set. The linguistic knowledge was built from each corresponding training set and a collection of about 500 general culture texts with a high academic level in order to endow the model with a background knowledge wider than that contained in 100 training texts. For each training set, all the corpus was indexed using the corresponding linguistic knowledge, and then the training and test sets were given as input to three supervised learning algorithms: Naïve Bayes, Support Vector Machines and K-Nearest Neighbors. These methods have been shown to be the best ones for text classification [12]. The performance measurements considered were the F-measurement, a combination of the
242
J.I. Serrano and M.D. del Castillo
70
0.6
65
0.5
0.9 55
0.7 0.5
50
0.3 45
0.4 Correlation
F-Measure (%)
60
0.9 0.7
0.3
0.5 0.3
0.2 0.1
40 35
0
0.05
0.1
0.3 Propagation threshold
0.5
0.7
0.05
0.1
0.3
0.5
0.7
Propagación threshold
Fig. 1. a) Average F-measure and b) average correlation, for different combinations of values for the forgetting factor and the propagation threshold
percentage of examples correctly classified (precision) and the percentage of examples of a each category correctly identified (recall), and the correlation between the true labels and the predicted ones. These measures were macro-averaged from the three divisions of the corpus and the three algorithms. The parameters were then modified and the same process was repeated and so on. The final results show which parameter values produce the text representations that are the best classified. Since the input of the algorithms must be a vector, the indexed texts are represented as vectors with the activation values of the final concept net in the corresponding positions. Fig. 1 shows the results for the optimization of the propagation threshold and the forgetting factor. Each line in the figure corresponds to a value of the forgetting factor and the propagation threshold is represented on the x-axis. Fig. 1a) presents the average classification results in terms of F-measurement, and Fig. 1b) in terms of correlation. Obviously, the propagation threshold must always be lower than the forgetting factor so as not to forget the concepts at the moment when they are brought to the working memory. The results show that the higher the forgetting factor is, the better the classification results are, and the reverse for the propagation threshold. Thus a large memory and wide inferences seem to be very useful for the categorization task. Next, using the best values for the forgetting factor (0.9) and the propagation threshold (0.3), the level of propagation and the way of activation spreading was tested. Fig. 2 shows the classification results, a) F-measurement and b) correlation, for both ways of activation spreading on each line and for different values of maximum level of propagation on the x-axis. As can be seen, the variation in the results is very small. However, activation spreading by depth seems to work better than by levels. The correlation results shows that inferring indirect concepts until the third level is the best for classification tasks. Thus, using the best values found for propagating the activation the time interval for forgetting was tested. The time is counted here in terms of words read. There are two options: to forget each fixed number of words or to forget every sentence. All earlier experiments have been carried out using the sentence interval. Fig. 3 presents the correlation results for different fixed sizes of the interval. It seems clear that forgetting each 15 words obtains the best performance, with an average correlation of 0.56, against the maximum correlation of 0.49 previously obtained with sentence interval.
Text Representation by a Computational Model of Reading By depth
By depth
By levels
243
By levels
0.495
65.9
0.49
65.85
0.485
Correlation
F-Measure (%)
65.95
65.8 65.75
0.48 0.475
65.7
0.47
65.65 65.6
0.465 1
2
3
1
2
Maximum level of propagation
3
Maximum level of propagation
a)
b)
Fig. 2. a) Average F-meaosure and b) average correlation, for different combinations of activation spreading method and maximum level of propagation 0.6
Correlation
0 .5 5 0.5 0 .4 5 0.4 0 .3 5 0.3 5
7
9
11
13
15
17
19
Num ber of w ords
Fig. 3. Average correlation for different values of the forgetting interval size
Finally, the representation produced by the model was compared with the traditional “bag of words” representation. The Reuters21578 collection was used to test classification performance. The categories with less than 200 documents were discarded, having eleven categories. Then, the dataset was divided in three parts of the same size in order to carry out a 3-fold cross validation. In each of the three executions, the linguistic knowledge was built using the training documents and all the examples were represented according to the corresponding knowledge. Table 1 shows the average precision, recall, F-measurement, accuracy and correlation results of the SVM classifier for each of the eleven categories and the average for all of them. It is clear that the indexing produced by the CRI model outperforms the traditional representation. In order to test the similarity of the CRI model with humans, the following experiment was carried out: five texts, not included in the Google news corpus, belonging to each of the five categories considered, were given to 15 individuals. They were asked to read each text carefully just once and try not to remember anything. They were also told that they would have to write an informal summary of the text highlighting the most salient aspects. After that, the same texts were represented with the CRI model using linguistic knowledge built from the entire corpus and the general culture texts. The model parameters were analogously varied
244
J.I. Serrano and M.D. del Castillo
Table 1. Average precision, recall and F-measurement values for each category and all categories, and accuracy and correlation for “bag of words” representation and CRI representation, on Reuters collection
acq corn crude dlr earn grain interest money-fx ship trade wheat Average Correlation Accuracy
“bag of words” Pr Rc F 22.66 92.89 36.43 0.00 0.00 0.00 4.90 1.11 1.81 0.00 0.00 0.00 31.23 1.30 2.50 3.81 0.64 1.09 0.00 0.00 0.00 7.64 0.62 1.15 0.00 0.00 0.00 9.22 1.46 2.52 0.00 0.00 0.00 7.22 8.91 7.98 0.032 22.11
Pr 41.38 0.00 9.44 0.00 39.46 0.00 0.00 16.67 0.00 2.08 0.00 9.91
CRI Rc 25.37 0.00 3.47 0.00 87.61 0.00 0.00 0.12 0.00 0.18 0.00 10.61
F 31.45 0.00 5.08 0.00 54.41 0.00 0.00 0.24 0.00 0.34 0.00 10.25 0.125 38.91
to match the experiments mentioned above. Then, each representation from the model was transformed into a sorted list, from the highest activation level of the concepts it contained to the lowest. After that, all the CRI representations of the texts for each category were compared with the summaries of the same category done by the individuals. The comparison was done by computing the average distance between the words in both texts. Since the texts are sorted by salience, the distance for a word is the difference between the relative positions of the word in both texts. Thus, average similarity for all individuals was calculated for each CRI representation with different parameters, and the values of the most similar representation were the ones considered as the nearest to humans. Fig. 4a) presents the average similarity measurements for the propagation threshold and forgetting factor parameters, the same as for the previous parameter optimization experiment. The results show that a forgetting factor of 0.9 and a propagation threshold of 0.5 are the values that make the model more similar to humans. It is important to highlight that these values are very similar to the optimum ones for classification. It is also remarkable that the figure drawings are somehow similar to Fig. 1b) drawings, assuming that CRI might successfully model human reading, in an approximate way, of course. Fig. 4b) presents the same results for the activation spreading method and propagation level. In this case, the propagation by depth is more similar than by levels and the same result is obtained as in the optimization experiment. However, the best maximum level of propagation is 1, contrary to the best value obtained for classification. Moreover, the higher the level of propagation, the more different the model is from the individuals, which means that the model more similar to humans only uses inferences from direct associations. For the activation spreading by levels the maximum level of propagation does not seem to have any effect, since it remains constant, similar to the optimization results. Fig. 4c) shows that the most similar
Text Representation by a Computational Model of Reading By depth
0.75
245
By levels
0.662 0.7
Average similarity
Average similarity
0.66 0.9
0.65
0.7 0.5 0.6
0.3
0.55
0.658 0.656 0.654 0.652
0.5
0.65 0.05
0.1
0.3
0.5
0.7
1
2
Propa ga tion thre shold
3
M aximum le v e l of propagation
a)
b)
Average similarity
0.692 0.682 0.672 0.662 0.652 0.642 5
7
9
11
13
15
17
19
Number of words
c) Fig. 4. Average similarity of human subjects with CRI representations obtained with different a) combinations of forgetting factor and propagation threshold b) activation spreading method and maximum level of propagation and c) size of forgetting interval
forgetting interval to humans has a size of 11 words. The drawing is also similar to Fig. 3, but for classification a higher interval of 15 words is needed instead.
5 Conclusions and Future Work A computational model of reading, CRI, has been presented. This model tries to simulate in part the high-level cognitive processes in human mind over time. First, the model generates a representation of the input text as a net of concepts, and each concept has an activation value referring to its salience in the text. This representation is then used to index documents in order to automatically categorize them by a supervised learning algorithm. Traditional indexing methods represent texts as the result of a process of mathematical operations. Since humans are able to classify texts much better than machines, the model tries to somehow approximate human cognition in order to improve language tasks. The results show that, once the model parameters have been optimized, the representation obtained is an improvement on traditional indexing techniques. Some experiments were also carried out to compare the model with humans, and promising results were obtained. The structural representation of texts is planned to be used to compare and summarize them, and also for question/answering systems. Another future goal of this research work is to try to model individuals in order to detect and/or repair some language disorders related to reading.
246
J.I. Serrano and M.D. del Castillo
References 1. Burgess, C.: From Simple Associations to the Building Blocks of Language: Modeling Meaning in Memory with the HAL Model. Behavior Research Methods, Instruments & Computers, 30 (1998)188-198 2. Kanerva, P., Kristofersson, J. and Holst, A.: Random Indexing of Text Samples for Latent Semantic Analysis. Proceedings of the 22nd Annual Conference of the Cognitive Science Society (2000) 10363. Kintsch, W.: The Role of Knowledge in Discourse Comprehension: A ConstructionIntegration Model. Psychological Review, Vol. 95(2) (1988) 163-182 4. Landauer, T. K., Foltz, P. W., Laham, D.: An introduction to Latent Semantic Analysis. Discourse Processes, 25 (1998) 259-284 5. Lange, T. E., and Wharton, C. M.: Dynamic Memories: Analysis of an Integrated Comprehension and Episodic Memory Retrieval Model. IJCAI (1993) 208-216 6. Lemaire, B., Denhière, G.: Incremental Construction of an Associative Network from a Corpus. In Proceedings of the 26th Annual Meeting of the Cognitive Science Society (CogSci'2004) (2004) 825-830 7. Meyer, B. J. F., and Poon, L. W.: Effects of Structure Strategy Training and Signaling on Recall of Text. Journal of Educational Psychology, 93 (2001) 141-159 8. Mitchell, M.: Analogy Making as Perception: A Computer Model. A Bradford Book, the Mit Press (1993) 9. Perfetti, C. A.: Comprehending Written Language: A Blue Print of the Reader. The Neurocognition of Language, Brown & Hagoort Eds., Oxford University Press (1999) 167-208 10. Porter, M. F.: An algorithm for suffix stripping. Program, 14(3) (1980) 130−137 11. Ram, A. & Moorman, K. (eds.): Understanding Language Understanding: Computational Models of Reading. Cambridge, MA: MIT Press (1999) 12. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1) (2002) 1-47 13. Steyvers, M., Shiffrin R.M., Nelson, D.L.: Word Association Spaces for Predicting Semantic Similarity Effects in Episodic Memory. In A. Healy (Ed.), Cognitive Psychology and its Applications: Festschrift in Honor of Lyle Bourne, Walter Kintsch, and Thomas Landauer, Washington DC: American Psychological Association (2004) 14. Turney, P.: Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL . In De Raedt, Luc and Flach, Peter, Eds. Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001) (2001) 491-502 15. Zakaluk, B. L.: Theoretical overview of the Reading Process: Factors Which Influence Performance and Implications for Instruction. National Adult Literacy Database (1998)
Mental Representation and Processing Involved in Comprehending Korean Regular and Irregular Verb Eojeols: An fMRI and Reaction Time Study Hyungwook Yim1, Changsu Park1, Heuiseok Lim2, Kichun Nam1,* 1
Department of Psychology, Korea University, Seoul Department of Software, Hanshin University, Osan [email protected]
2
Abstract. The purpose of this study is to investigate the cortical areas involved in comprehending Korean regular and irregular verb Eojeols. Eojeols is the specific spacing unit of a sentence which is bigger than a word but smaller than phrase. This study showed that there is a distinction between the process and representation of regularly and irregularly inflected verbs in Korean using neuroimaging and hehavioral method.
1 Introduction There have been several hypotheses on the process and representation of the morphological information stored in the lexicon. The decomposition hypothesis insists that inflected and derivate words are stored separately in the lexicon as in the form of stem and its counterpart[1,2]. On the other hand, the full-list hypothesis asserts that the modified word forms are stored as a whole rather than as a part[3-5]. Considering the two hypotheses, hybrid hypotheses suggest that the process of polymorphemic words could vary depending on its frequency, regularity or transparency[6-8]. Dealing with regularity of inflection, the main issue lies whether the process and representation of a regularly inflected verb differs from a irregularly inflected verb[812]. One side of the so called ‘English past tense debate’ claims that regularly inflected verbs are processed by rules while irregularly inflected verbs are stored as a full form in the associative memory. On the other side of the debate insists that both kinds of the inflected words are processed as a full form. However applying the aforementioned debate to Korean, which is another language group, should be carefully carried out with the knowledge of Korean morphemes. In Korean there is a spacing unit of a sentence called ‘Eojeol’ which is bigger than a word but smaller than a phrase. The Eojeol usually consists of content morphemes and grammatical morphemes. So, morphemes not only represent semantic information but also represent syntactic information. For example, ‘ ’ [meok-da] (the root form of the word ‘eat’) could be inflected by changing the ending morpheme ‘- ’ [da] into ‘- ’ [go](an ending morpheme containing the meaning of ‘and’), which would be ‘ ’ [meok-go]. And by inserting a past tense morpheme ‘- -’ [-eot-] in I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 247 – 254, 2006. © Springer-Verlag Berlin Heidelberg 2006
248
H. Yim et al.
the middle of ‘ ’ [meok-go] makes ‘ ’ [meok-eot-go] which refers ‘ate + and’. Moreover, a same root form verb could be varied regularly and irregularly depending on the morpheme which comes after the stem. For example, ‘ ’ [it-da] (the root form of the word ‘connect’) could be inflected regularly when a ending morpheme ‘- ’ [go] is attached while attaching ‘- ’ [eo] at the end, the spelling should be ‘ ’ [i-eo] instead of ‘ ’ [it-eo]. Transformating the ending of an irregularly inflected verb is based on the facility of pronunciation. Therefore, the irregular transformation of ‘ -’ + ‘- ’ to ‘ ’ is based on the restrictions of the Korean phonological system. The current study suggests that there is a distinction between the process and representation of regularly and irregularly inflected verbs in Korean. However the difference lies differently from the Indo-European languages. Both human behavioral (reaction time) and brain imaging (fMRI) paradigms were used to investigate the hypothesis.
2 Experiment ȱ As a frequency of a word affects reaction time, it could be inferred that if Korean regular verb Eojeols are represented in a full word form, the frequency of the full word form will affect the reaction time during the lexical decision task. However, if Korean regular verb Eojeols are represented in a morpheme-based decomposition form, the usage frequency of each morpheme (stem) will be more influential than the frequency of the full word form and vice versa for the irregular verb Eojeols. In Experiment ȱ lexical decision task (LDT) was used to investigate the above hypothesis. Because of a lack of proper Korean corpus, word frequency was substituted by subjective word familiarity ratings which have been verified by former studies as a more accurate measure[13,14]. 2.1 Method Subject. Seventy six undergraduates at Korea University, all native speakers of Korean with normal vision, participated in the experiment with a compensation for extra credit. Materials. A set of 20 Korean verbs which could be inflected both regularly and irregularly, 30 filler words which were not in the form of verb Eojeols, and 50 nonwords served as stimuli, all with two syllables. The materials were divided into two subsets randomly and each subset was composed of 10 regular verb Eojeols, 10 irregular verb Eojeols, 30 filler words, and 50 non-words. The filler words and nonwords in each subset were identical. Additionally, 96 Korea University undergraduates who did not participated in Experiment I rated the subjective familiarity of word. They were instructed to rate the subjective familiarity of verb morphemes (stems) and verb Eojeols with a 7-point scale (1:very unfamiliar, 7:very familiar). The mean scaling value of 20 Korean verbs were used as the frequency index of the item.
Mental Representation and Processing Involved in Comprehending Korean Regular
249
Procedure. The subjects were tested individually. Prime and target stimuli was presented with SuperLab in the center of monitor of a personal computer (Intel Pentium 4, CPU 2.80GHz) with a 75Hz refresh rate. Each trial consisted of the following sequence presented on the same monitor location. First, a fixation point(***) was presented for 1000 milliseconds. This was immediately replaced by the auditory presented prime, and the target was presented visually, after the presentation of prime context. The prime which composed of two or three Eojeol context presented auditory by headphones. This procedure was to prevent ambiguousness of the target words, which was found in former pilot experiments. The subjects were to decide whether the presented letter string was a word or not by pressing a key on the keyboard. The subjects were asked to judge as fast and accurately as possible, when the target word presented. The key press for the response was counter-balanced across subjects. Each subset for material words was presented at random across subjects. 2.2 Results Only correct reactions from the regularly and irregularly inflected verb conditions were analyzed. The mean of response time in regularly inflected condition showed the longer pattern rather than the response time in irregularly inflected condition (regularly = 580.31(56.52), irregularly = 564.53(46.52)), but did not have statistical significance (t (38) = 0.964, p = 0.341). For the regularly inflected Eojeols, partial correlation between the median of the reaction time and the word familiarity of the stem form (where the familiarity of the Eojeol was fixed) was marginally significant (r (17) = - .300, p = .212), while the familiarity of the full Eojeol form (where the familiarity of the stem was fixed) was not (r (17) = .166, p = .497). On the other hand, for the irregularly inflected Eojeols, partial correlation between the median of the reaction time and familiarity of the full word form (where the familiarity of the stem was fixed) was marginally significant (r (17) =- .321, p = .180] whereas, the familiarity of the stem form (where the familiarity of the full word was fixed) was not (r (17) = .164, p = .502). Analysis using partial correlation methods did not show significance statistically due to multicollinearity between the independent variables. However, the regression analysis with stepwise procedures demonstrated that the response time in regularly inflected condition was statistically significant, only when the familiarity of the stem (F(1,18) = 19.22, p < .001) was entered in the model. On the other hand, the response time in irregularly inflected condition was statistically significant, only when the familiarity of the full word form (F(1,18) = 11.55, p < .005) was entered in the regression model. The results indicate that the word familiarity of the irregular full verb Eojeol is more influential than the word familiarity of the verb stem on recognizing Korean irregular verb Eojeols, while the word familiarity of the regular verb stem is more influential than the word familiarity of the regular full verb Eojeol on recognizing Korean regular verbs. In conclusion, it can be asserted that irregularly inflected Eojeols are represented in the full-list form and the regularly inflected Eojeols are represented in the decomposition form. But, the results of Experiment I could not distinguish the memory-based processing from the rule-based processing in regularly inflected verbs. Experiment II was performed to investigate this problem.
250
H. Yim et al.
3 Experiment II In experiment II brain imaging methods were used to show, despite of the formal difference between regularly inflected verbs and irregularly inflected verbs, rules are not involved in regularly inflected verbs as the previous studies, during comprehending Korean regular verb Eojeols. We hypothesized that the transformation of a verb regularly is merely based on meaning using associative memory rather than using rules. 3.1 Methods Subject. Ten undergraduates (five male, five female) in Korea University, who did not participate in Experiment I, volunteered in the experiment. All subjects were native speakers of Korean, strongly right-handed with normal vision and had no history of neurological or psychiatric disorders. Materials and Procedure. The stimulus was presented in 10 activation blocks of 30s alternation with 12 baseline blocks of 30s, and divided to two sessions. Each activation block contained 7 Korean verbs and 3 non-words used in Experiment I at random order. In each activation block, words were regularly or irregularly inflected verbs, and the instruction to the task was same in Experiment I. In each control block, threecharacter strings are the combination of the “#”,”$”, and “%” presented, and participants asked to decided whether there is a specific character, which was told before the session in a three-character string. One trial took three seconds and one string was presented for one second during a trial. Subjects had to respond `yes' or `no' using a button and practice trials were performed before the main experiment. Two session and response hand was counterbalanced across the participants. Data Acquisition and Analysis. Echo-planer imaging data was collected on a 1.5Tesla GE medical system scanner with TR = 3000 msec, TE = 60 msec, 64 64 Ma24cm, flip angle 90 degrees, twenty 3.75-mm-thick no gap contigutrix, FOV 24 ous axially. Two 114 phases scan (342 seconds including 12 seconds of dummy phases, excluded during the analysis procedure) were collected from each subject. Image analysis was conducted using SPM2 (Wellcome Department of Cognitive Neurology, UK), which allows for realignment, corregistration, normalization and smoothing in sequence. Analysis was carried out using a general linear model with a boxcar waveform, convolved with the hemodynamic response function. A mean contrast image was produced by the comparison experimental conditions to the control condition, per subject. For the group analyses, statistical inference was based on random Gaussian fields and a random effect model for p-value correction was applied. Activation areas in the experiment have a statistical threshold of p < 0.001 and minimal 20 voxels adjusted for search volume. 3.2 Results Results, in which both conditions were subtracted by the control condition, are shown in Fig. 1, Fig. 2 and Table 1. In the regular verb condition, significant activations in the left hemisphere were observed on the precentral gyrus(BA4), paracentral lobule(BA5), medial frontal gyrus(BA6), inferior occipital gyrus(BA18), postcentral
Mental Representation and Processing Involved in Comprehending Korean Regular
251
gyrus(BA3), transverse temporal gyrus(BA42), superior temporal gyrus(BA42), and the middle temporal gyrus(BA21), while significant activations in the right hemisphere were observed on the insula(BA13), medial frontal gyrus(BA6), parahippocampal gyrus(BA35), lingual gyrus(BA18) and the postcentral gyrus(BA5). On the other hand, significant activations from the left hemisphere in the irregular verb condition were founded in the middle temporal gyrus(BA22), postcentral gyrus(BA5), and the insula(BA13) while, significant activations in the right cerebral hemisphere were observed on the inferior frontal gyrus(BA47), insula(BA13), middle frontal gyrus(BA11), sub-gyral(hippocampus), superior temporal gyrus(BA22), subgyral(BA20), and the middle temporal gyrus(BA39, BA21). The results show that the activation patterns of both conditions are different, and these patterns are not like those from former studies with English and German. Studies dealing with the past tense debate in English or German showed that regularly inflected verbs are related with the left inferior frontal lobe whereas irregularly inflected verbs are related with the temporal lobe[8]. However, the present study shows that regularly inflected verbs are not related with the left inferior frontal lobe. Both regularly and irregularly inflected verbs are related with the temporal lobe along with areas like the postcentral gyrus, precentral gyrus and the insula which reveal the phonological characteristic of Korean language itself[15-17]. The temporal lobe including the abut parahippocampal area is well known for its relation with memory[18,19]. The precentral regions and the middle frontral gyri are known for retrieving words that denote action (i.e., verbs) in both fMRI and lesion studies[20].
4 General Discussions In sum, behavioral experiments show that the frequency of the stem is more influential than that of the full word form in recognizing regular verb Eojeols, whereas the
Fig. 1. Regular verb condition > control condition in 10 subjects (threshold at p < 0.001 over 20 voxel sizes)
Fig. 2. Irregular verb condition > control condition in 10 subjects (threshold at p < 0.001 over 20 volxel sizes)
252
H. Yim et al. Table 1. Regional brain activity Brain region
BA
Regular > Control
Precentral Gyrus BA 4 Paracentral Lobule BA 5 Medial Frontal Gyrus BA 6 Inferior Occipital BA 18 Gyrus Postcentral Gyrus BA 3 Transverse Temporal BA 42 Superior Temporal Gyrus BA 42 Middle Temporal Gyrus BA 21 Insula BA 13 Medial Frontal Gyrus BA 6 Parahippocampal BA 35 Gyrus Lingual Gyrus BA 18 Irregular > Middle Temporal Control Gyru BA 22 Postcentral Gyrus BA 5 Insula BA 13 Inferior Frontal Gyrus BA 47 Insula BA 13 Middle Frontal Gyrus BA 11 Insula BA 13 Insula BA 13 Sub-Gyral Hippocampus Superior Temporal Gyrus BA 22 Sub-Gyral BA 20 Middle Temporal Gyrus BA 39 Middle Temporal Gyrus
Side Voxels
Zscore
Talairach coordinates
L L
23 79
3.41 3.58
-38 -12
-14 -34
44 52
L
89
4.6
-8
-14
54
L L
24 112
4.05 3.86
-40 -40
-90 -28
-4 60
L
23
3.68
-58
-16
12
L
31
3.98
-64
-30
8
L R
35 61
3.5 3.79
-58 44
-2 -4
-10 2
R
113
3.84
8
-24
58
R R
30 49
3.85 3.52
18 14
-32 -78
-8 -4
L L L
91 24 308
3.78 3.46 4.82
-56 -14 -38
-32 -42 -14
2 64 14
R R
27 31
3.45 4.03
44 42
26 -20
0 0
R R R R
20 40 97 24
3.53 4.37 4.63 3.74
22 44 40 36
26 -2 -8 -30
-14 -4 22 -6
R R
24 29
3.6 3.56
68 38
-36 -10
20 -20
R
47
3.68
58
-60
8
BA 21
R
134
3.97
58
28
6
All areas were significant to p < 0.001, uncorrected. L: left, R: right. Coordinates are given for the steretactic space of Talairach & Tournoux (1998)[24].
frequency of the full word is more influential in recognizing irregular verb Eojeols. The results indicate that regularly inflected verb Eojeols are represented respectively in the form of the stem and the ending, while irregularly transformed verb Eojeols are represented as a full word form. Alternatively, it might also be assume that morphological rules are needed to comprehend Korean regular verb Eojeols. However, fMRI
Mental Representation and Processing Involved in Comprehending Korean Regular
253
results of both conditions show that the activated regions are mostly on the temporal lobe which is related with human memory and not with rules[20,21]. The activation patterns are not apparently consistent with the results from English and German materials: regular verbs activating the inferior frontal lobe while irregular verbs activating the temporal lobe. Scrutinizing the two experiments, it could be inferred that comprehending Korean inflected verb Eojeols are processed differently regarding the regularity of inflection. However, regularly and irregularly inflected verbs are both based on meaning rather than rules. Korean is defined as an agglutinative language while Indo-European languages like English or German are defined as an inflected language. An agglutinative language has richer morphemes and complex morpheme formation rules. Thus, the Korean language may use the meaning based inflection system because it would be more efficient to find a fitting morpheme using a semantically ordered index rather than rules on the inflecting process. Moreover, many studies with LDT task show that the left fusiform gyrus is associated with the word form processing, and BA 44, BA 45 is involved in grapheme-tophoneme conversion and explicit lexical search, respectively[22,23]. However, the present study, where no activations at the above mentioned areas, suggests that the grapheme-to-phoneme conversion and explicit lexical search do not strongly occur in Korean LDT. The suggestion could be supported by the fact that results from Experiment II showed activations on areas related with phonological information and meaning. As mentioned at the beginning, Korean regularly inflected verbs are made by attaching an ending which varies by its meaning and irregularly inflected verbs are made by attaching an ending as regularly inflected verbs but slightly transformed based on phonological features. Thus the inflected of Korean verb Eojeols could be strongly based on meaning which decides the ending form and the phonological information which decides whether the word will be transformed or not.
Acknowledgement This work was supported by the KOSEF(R01-2006-000-10733-0).
References 1. Taft M. Interactive activation as a framework for understanding morphological processing. Language and Cognitive Processes 1994; 9:271~294. 2. Taft M, Forster KI. Lexical storage and retrieval of prefixed words. Journal of Verbal Learning and Verbal Behavior 1975; 14:638~647. 3. Henderson L, Wallis J, Knight D. Morphemic structure and lexical access. In Attention and performance X. Bouma H, Bowhuis DG, (editors). London: Erlbaum; 1984. p. 211~226. 4. Butterworth B. Lexical representation. In Language Production Vol. 2: Development, Writing and Other Language Process. . Butterworth B, (editor). London: Academic Press; 1983. p. 257~294.
254
H. Yim et al.
5. McClelland JL, Rumelhart DE. An interactive activation model of context effects in letter perception. Part 1. An account of basic findings. Psychological Review 1981; 88:375~405. 6. Caramazza A, Laudanna A, Romani C. Lexical access and inflectional morphology. Cognition 1988; 28:297~332. 7. Baayen H, Schreuder R. War and peace: Morphemes and full forms in a Noninteractive Activation Parallel Dual-Route Model. Brain and Language 1999; 68:27~32. 8. Pinker S, Ullman MT. The past and future of the past tense. Trends in Cognitive Sciences 2002; 6(11):456~463. 9. Marselen-Wilson WD, Tyer LK. Rules, representation, and the English past tense. Trends in Cognitive Sciences 1998; 2:428~435. 10. McClelland JL, Patterson K. Rules or connections in past-tense inflections: what does the evidence rule out? Trends in Cognitive Sciences 2002; 6(11):465~472. 11. Dalal RH, Loeb DF. Imitative production of regular past tense -ed by English-speaking shildren with specific language impairment. International Journal of Language and Communication Disorders 2005; 40(1):67~82. 12. Dhond RP, Marinkovic K, Dale AM, Witzel T, Halgren E. Spatiotemporal maps of pasttense verb inflection. Neuroimage 2003; 19:91~100. 13. Gernsbacher MA. Resolving twenty years of inconsistent interactions between lexical familiarity and orthography, concreteness, and polysemy. Journal of Experimental Psychology: General 1984; 113:256~281. 14. Park T, Kim M, Park J. Subjective frequency estimates of Korean words and its effect on lexical decision. In 2003 Spring Conference of The Korean Society for Cognitive Science. 2003. p. 141~146. 15. Yoon HW, Cho K-D, Park HW. Brain Activation of Reading Korean Words and Recognizing Pictures by Korean Native Speakers: a Functional Magnetic Resonance Imaging Study. International Journal of Neuroscience 2005; 115:757~768. 16. Yoon HW, Cho K-D, Chung J-Y, Park HW. Neural mechanisms of Korean wrod reading: a functional magnetic resonance imaging study. Neuroscience Letters 2005; 373:206~211. 17. Kim M, Krick C, Reith W. The Korean Writing "Hangul" and its Cerebral Organization. An fMRI Study. Klinische Neurophysiologie 2004; 35(3). 18. Brown CM, Hagoort P. The Neurocognition of Language. Oxford University Press: Oxford, 1999. 19. Khader P, Burke M, Bien S, Ranganath C, Rosler F. Content-specific activation during associative long-term memory retrieval. Neuroimage 2005; 27:805~816. 20. Grabowski TJ, Damasio H, Damasio AR. Premotor and Prefrontal Correlates of CategoryRelated Lexical Retrieval. Neuroimage 1998; 7:232~243. 21. Pulverm ller F. Brain reflections of words and their meaning. Trends in Cognitive Sciences 2001; 5(12):517~524. 22. Cohen L, Dehaene S, Naccache L, et al. The visual word form area: spatial and temporal characterization of an initial stage of reading in normal subjects and posterior split-brain patients. Brain 2000; 123:291~307. 23. Heim S, Alter K, Ischebeck AK, et al. The role of the left Brodmann's areas 44 and 45 in reading words and pseudowords. Cognitive Brain Research 2005; 25:982~993. 24. Talairach J, Tournoux P. Co-planar stereotaxic atlas of the human brain. Thieme: New York, 1998.
Binding Mechanism of Frequency-Specific ITD and IID Information in Sound Localization of Barn Owl Hidetaka Morisawa1, Daisuke Hirayama1, Kazuhisa Fujita1 , Yoshiki Kashimori1,2, and Takeshi Kambara1,2 1
2
Department of Information Network Sciences, Graduate School of Information Systems, Univ. of Electro-Communications, Chofu, Tokyo, 182-8585 Japan [email protected] Department of Applied Physics and Chemistry, Univ. of Electro-Communications, Chofu, Tokyo, 182-8585 Japan
Abstract. Barn owls perform sound localization based on analyses of interaural differences in arrival time and intensity of sound. Two kinds of neural signals representing the interaural time difference (ITD) and interaural intensity difference (IID) are processed in parallel in anatomically separate pathway. The values of ITD and IID are largely changed depending on frequency components of sound. The neural circuits involved in sound localization must bind the frequency-specific ITD and IID information to determine a spatial direction of sound source. However, little is known about the binding mechanism. We present here a neural network model of ICc ls in which the signals representing ITD and IID are first combined with each other. It is shown using our model how the neural maps can be generated in ICc ls by the excitatory inputs from ICc core representing ITD and the inhibitory inputs from VLVps representing IID. We show also that ICx neuron detects a spatial direction of sound source by making synaptic connections with ICc ls neurons encoding ITD and IID information of the sound source, based on Hebbian learning.
1
Introduction
Barn owls perform sound localization based on analyses of interaural defferences in arrival time and intensity of sound. Two kinds of neural signals representing the interaural time defferences(ITD) and the interaural intensity defference(IID) are processed in anatomically separate pathways that start from the cochlea nuclei in both ears. Both the signals are processed in parallel along the neural pathways to the higher sensory brain modalities. ITD is mainly used for detecting the horizontal direction of sound source and IID for the vertical direction. The neural map for detecting the spatial direction of sound source is formed in the brain of barn owls, based on the interaural arrival time and intensity information[1]. I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 255–262, 2006. c Springer-Verlag Berlin Heidelberg 2006
256
H. Morisawa et al.
The sound localization of barn owl with respect to horizontal direction has been observed to be made with a remarkable highaccuracy based on analysis of ITD in arrival time[1]. We have shown[2] that this hyper acuity for detection of ITD is accomplished through a combination of four kinds of functional factors: (1)the functional structure of neural unit(CD unit) for coincidence detection; (2)projection of CD units randomly arranged in every array of the units in nucleus laminaris (NL) to ITD sensitive neurons arranged randomly and densely in a few single arrays in the core of central nucleus of the inferior colliculus(ICc core); (3)convergence of outputs of all the arrays tuned in to a single sound frequency in ICc core into ITD sensitive neurons arranged regularly in a single array in the lateral shell of central nucleus of the inferior colliculus(ICc ls); and (4)integration of outputs of frequency-tuned arrays in ICc ls over all frequencies in external nucleus of inferior colliculus(ICx). The sound localization of owl with respect to vertical direction is made based on analysis of IID. The neural information processing for detection of IID is made on neural pathway parallel with the pathway of ITD detection before both the signals arrive at ICc ls. The pathway for the IID detection is the angular nucleus in the cochlear nucleus → the nucleus ventralis lemnisci lateralis pars posterior (VLVp) → ICc ls(the first site of convergence of ITD snd IID information) → ICx[1,3]. In order to clarify the neural mechanism of detection of IID, we presented a neural model of VLVp[4] in which the signals of sound intensities coming from both the ears are combined to compare with each other. In the model, the information of IID is represented by position of the firing zone edge in the chains of IID sensitive neurons in both right and left VLVp units. We have shown[4] that the mutual inhibitory coupling between both VLVp units[3,5] can induce the cooperative formation of clear firing zone edge in both VLVp units so that the firing zone in both units do not over lap with each other but the firing zone gap becomes as narrow as possible. The values of ITD and IID received by eardrum depend on sound frequency. The frequency dependence comes from the fact that an auditory signal is generated by the shadowing and collecting effects of the head and external ears. The sound localization of ICc ls network cues at different frequencies lead to the firing of ICc ls neurons at different locations. The OT has been shown to have complex map of IID and ITD depending on sound frequency[6], suggesting that the information of ITD and IID, depending on sound frequency, are bound in ICc ls. To well perform sound localization, ICx must detect a unique value of the spatial direction of sound source by binding the information of ICc ls depending sound frequency. How does the ICx bind the frequency-specific information of ICc ls? In the present paper, we investigated using the neural models of ICc core[2] and VLVp[4] for detection of ITD and IID, respectively, how the two kinds of signals coming from ICc core and VLVp are combined in ICc ls and how ICx bind the frequency-specific information of ICc ls to make the neural map representing the spatial direction of sound source.
Binding Mechanism of Frequency-Specific ITD and IID Information
2
257
Model
2.1
Neural Model of ICc ls Binding ITD Information with IID One
We proposed in the present paper the model (Fig.1) showing how the map can be generated in ICc ls by the excitatory inputs from ICc core and the inhibitory inputs from right and left VLVp units. Then, values of ITD and IID are represented along the axes perpendicular mutually as shown in Fig.1. The main neurons in ICc ls are arranged in the form of lattice. The neurons in each row are received excitatory inputs from an interneuron gathering outputs of relevant neurons in ICc core chains, which is tuned in to a single value of ITD, as shown in Fig.1. The neurons in each column are received inhibitory inputs from an interneuron gathering bilaterally the outputs of right and left VLVp units at the relevant position through excitatory synapses as shown in Fig.1. The membrane potential Vs (i, j; t) and the output Us (i, j; t) of ICc ls neuron at (i, j) site of the lattice are determined by 1 dVs (i, j; t) = (−(Vs (i, j; t) − Vrest ) + wIT D UIID (i; t) − wIID UIID (j; t)),(1) dt τm Us (i, j; t) =
1 , 1 + exp[(−Vs (i, j; t) − Vthr )/T ]
(2)
where Vrest is the potential in the resting state, UIT D (i; t) and UIID (j; t) are the outputs of ith ITD and jth IID interneurons, respectively, wIT D and wIID are the strength of relevant synaptic connections, and Vthr and T are the threshold value of membrane potential and the rate constant, respectively. The membrane potentials VIT D (i; t) and VIID (j; t) of ith ITD and jth IID interneurons, respectively, are also obtained using the leaky integrator neuron model used in Eq.(1). The output of each interneuron is represented by 1 or 0, the probability that UIT D (i; t) is 1 is given by 1 Prob (UIT D (i; t) = 1) = , (3) 1 + exp[−(VIT D (i; t) − Vthr )/T ] UIID (j; t) is also obtained from VIID (j; t) using the equivalent equation. Thus, the lattice array of main neurons in ICc ls functions as the map in which the value of ITD is represented along the column direction and the value of IID is represented along the row direction as shown in Fig.1. Under application of binaural sound stimulus, only one neuron group in the chains within ICc core fires, where the ITD selectivity of the neuron group corresponds to the value of ITD of the stimulus. Therefore, the neurons in only the row of the ICc ls lattice corresponding to the ITD value receive the excitatory inputs through the relevant interneuron. On the other hand, each neuron in the row receives inhibitory inputs from almost all of IID interneurons as seen in Fig.1. The interneuron receiving signals from VLVp neurons in the narrow gap, whose position corresponds to the value of IID of the stimulus, dose not fire as shown in Fig.1. Therefore, the neurons in the column corresponding to the value of IID is not inhibited by the
258
H. Morisawa et al.
Fig. 1. Schematic description of functional connections of the lattice of main neurons in ICc ls with the chains of ITD sensitive neurons within ICc core and with the chains of IID sensitive neurons within right and left VLVp units. Those connections are made through a single ITD interneuron for each value of ITD and through a single IID interneuron for each value of IID. Arrows in the ICc ls network denote the excitatory synapses and short bars denote the inhibitory synapses.
outputs of R- and L- VLVp units. Thus, the neuron in the lattice, which is firing under application of the sound with a pair of definite values of ITD and IID, can represent the value of ITD by its position along the column direction and the value of IID by its position along the row direction. 2.2
Neural Model of ICx for Integrating Information of ICc ls over Sound Frequencies
Figure 2 shows a neural network model for detecting spatial direction of sound source, which consists of ICc ls, ICx, and OT. ICc ls network has multiple layers of ICc ls model described in subsection 2.1. ICx makes a map of location of sound source around barn owl, in which the value of azimuthal position is represented along the row direction and the value of elevational position is represented along the column direction as shown in Fig.2. ICx neurons are topographycally connected with ICc ls neurons in each layer. The OT has the same map as that of ICx. It receives visual information, and feeds its outputs back to ICx neurons. The membrane potential of ICx neuron at (i, j) site of the lattice is determined by dV ICx (i, j; t) w(i, j, k, l, L; t)X(k, l, L; t)+ I OT , = −V ICx (i, j; t)+ τV dt L k l (4) where, X(k, l, L; t) is the output of ICc ls neurons at (k, l) site in Lth layer, and w(i, j, k, l, L; t) is the weight of synaptic connection between ICx neuron at
Binding Mechanism of Frequency-Specific ITD and IID Information
259
Fig. 2. Neural network model for detecting spatial direction of sound source. The ICc ls has multiple layer of two dimensional networks shown in Fig.1, each layer of which consists of neurons tuned to specific frequency of sound. The ICx receives not only the feedforward signals from ICc ls networks, but also the feedback signals from OT.
(i, j) site and ICc ls neuron at (k, l) site in Lth layer. X(k, l, L; t) is described by sigmoidal function similar to Eq.(2). I OT is input from OT, and τV is time constant of VICx (i, j; t). The synaptic weight w(i, j, k, l, L; t) is given by τw
dw(i, j, k, l, L; t) = −w(i, j, k, l, L; t) + αX(i, j; t)X(k, l; t), dt
(5)
where, X(i, j; t) is output of ICx neuron at (i, j) site, and τw is time constant. OT receives the visual information, and then only one OT neuron representing azimuthal and elevational position of a sound source is fired. The output of the OT neuron is fed back to ICx neuron to determine the spatial direction of the sound source. On the other hand, multiple layers of ICc ls networks receive the ITD and IID information relevant to the frequency components of the sound. Each ICc ls network has the firing neuron located at different position on the network, depending on the frequency component. Synaptic connections between ICx and ICc ls networks are made based on Hebbian learning given by Eq.(5). Thus, the information about spatial direction of sound source in ICx is consistently bound with ITD and IID information encoded in multiple layer of ICc ls network.
3 3.1
Results Binding of ITD and IID Information in ICc ls
To investigate how ICc ls neurons bind values of IID represented by VLVp pair with values of ITD represented by ICc core, we calculated the firing rates of ICc
260
H. Morisawa et al.
ls neurons using our neural models of VLVp pair[4] and ICc core[2]. The result is shown in Fig.3. Figure 3a shows the firing probability UIT D (i)(i = 1 ∼ 61) of ITD interneuron i which is receiveing the outputs of ICc core neurons tuned to ITD during a finite stimulation period. Figure 3b shows the firing probability UIID (j)(j = 1 ∼ 31) of IID interneuron j which is receiving the outputs of IID sensitive neurons in both VLVp chains during the some period. Figure 3c and 3d show the firing probability of the ICc ls network for sound frequency of 3kHz and that 6kHz, respectively. The peak of firing frequency appears at the position (center) corresponding to ITD=0 and IID=0. However, the representation of ITD and IID in ICc ls is not unique, because there exist the other two peaks. This problem is solved by making synaptic connections between ICx neuron and ICc ls neurons tuned to different frequency components of sound.
Fig. 3. Response probability of ICc ls to a pair of ITD and IID inputs (ITD=0, IID=0). Firing probabilities of (a) ITD interneurons (i = 1 ∼ 61) (b) IID interneurons (j = 1 ∼ 31) (c),(d) Firing probabilities of ICc ls neurons for (c) frequency = 3kHz and (d) frequency = 6kHz.
3.2
Binding of Information of ICc ls for Various Sound Frequencies
Figure 4 shows the temporal variations of synaptic weights between the ICx neuron receiving the outputs of OT neuron and the ICc ls neurons, in the case where the sound source was put on the spatial position (azimuthal angle = 0 , and elevational angle =0 ). The values of ITD and IID, which are changed depending on the frequency components of the sound, were determined by the relationship of sound localization cue values to auditory space tuning of a single OT neuron [6]. The feedforward signlas of ITD and IID information evoke the firing of ICc ls neurons located on different positions in each ICc ls layer, depending on the sound frequency. The ICx neuron fires due to the feedback signal from OT, which represents the information of azimuthal and elevational position of sound source. The information of spatial direction of sound source in ICx and ITD and IID
Binding Mechanism of Frequency-Specific ITD and IID Information
261
Fig. 4. Temporal variations of synaptic weights between ICx and ICc ls neurons. The synaptic weights between ICx neuron receiving the signal from OT and ICc ls neuron receiving ITD and IID information are increased ((a) and (c)), but other synapses are not increased ((b) and (d)).
information of ICc ls for various frequencies are bound by Hebbian learning as shown in Fig.4. Fig.4(a) shows the temporal variation of synaptic weight between ICx neuron receiving the feedback signal from OT and ICc ls neuron receiving ITD and IID information in ICc ls network tuned to w = 3kHz. The synaptic weight is increased as time proceeds. The synaptic weight of other neuron in the same ICc ls network is not changed, as shown in Fig.4(b), because the neuron does not receive ITD and IID information. Fig.4(c) and (d) show the temporal variations of the synaptic weights of two ICc ls neurons tuned to w = 6kHz. The synaptic changes exhibite the similar results to those shown is Fig.4(a) and (b). Thus, The ICc ls neuron in each layer is connected to ICx neurons receiving the input from OT to represent spatial direction of sound in ICx map.
4
Concluding Remarks
In order to clarify the neural mechanism by which frequency-specific information of ICc ls are bound in ICx so that the neural map of sound localization is generated in ICx, we presented a neural network model of ICc ls and a binding mechanism of frequency-specific information of ICc ls, based on Hebbian learning. The ICc ls neuron sensitive specifically to a pair of specific values of ITD and IID is made by the excitatory inputs from ICc core encoding ITD and the inhibitory inputs from bilateral VLVPs encoding IID. We showed also that ICx neuron detects a spatial direction of sound source by making synaptic connections with ICc ls neurons encoding ITD and IID information of the sound source, based on Hebbian learning.
262
H. Morisawa et al.
References 1. onishi, M. : Listening with two ears. Scientific American 268 (1993) 66–73 2. noue, S., Yoshizawa, T., Kashimori, Y., Kambara, T. : An essential role of random arrangement of coicidence detection neurons in hyper accurate sound location of owl. Neurocomputing 38-40 (2001) 675–682 3. akahashi, T. T., Keller, C. H. : Commissual connections mediate inhibition for the computation of interaural level difference in the barn owl, J. Comp. Physiol. A 170 (1992) 161–169 4. irayama, D., Fujita, K., Inoue, S., Kashimori, Y., Kambara, T. : Functional role of bilateral mutual inhibitory interaction in encoding interaural intensity difference for sound localization. ICONIP 2005, Proceedings (2005) 307–311 5. dolphs, R. : Bilateral inhibition generates neuronal responses tuned to interaural level differences in the auditory brainstem of the barn owl, J. Neurosci 13 (1993) 3647–3668 6. rainard, M. S., Knudsen, E., and Esterly, S. D. : Neural deviation of sound source location : Resolution of spatial ambiguities in binaural cues, J. Accoust. Soc. Ame. 91 (1992) 1015–1027
Effect of Feedback Signals on Tuning Shifts of Subcortical Neurons in Echolocation of Bat Seiichi Hirooka1 , Kazuhisa Fujita1 , and Yoshiki Kashimori1,2 1
2
Department of Information Network Sciences, Graduate School of Information Systems, Univ. of Electro-Communications, Chofu, Tokyo, 182-8585 Japan s1 [email protected] Department of Applied Physics and Chemistry, Univ. of Electro-Communications, Chofu, Tokyo, 182-8585 Japan
Abstract. Most species of bats making echolocation use Doppler-shifted frequency of ultrasonic echo pulse to measure the velocity of target. The neural circuits for detecting the target velocity are specialized for finefrequency analysis of the second harmonic constant frequency (CF2) component of Doppler-shifted echoes. To perform the fine-frequency analysis, the feedback signals from cortex to subcortical and peripheral areas are needed. The feedback signals are known to modulate the tuning property of subcortical neurons. However, it is not yet clear the neural mechanism for the modulation of the tuning property. We present here a neural model for detecting Doppler-shifted frequency of echo sound reflecting from a target. We show that the model reproduce qualitatively the experimental results on the modulation of tuning shifts of subcortical neurons. We also clarify the neural mechanism by which the tuning property is changed depending on the feedback connections between cortical and subcortical neurons.
1
Introduction
Animals usually receive complex sensory signals in external world. To perform sensory perception, they must select actively the sensory information relevant to their behavior. To extract such information from complex signals, the feedback signals from cortex to subcortical and peripheral regions are needed. However, it is not yet clear how the feedback signals contribute to the selection of sensory information. To address this issue, we study echolocation of mustached bats, because the functions of the cortical areas have been well characterized[5], and the physiological properties of neuronal activities modulated by the feedback signals have been actively investigated[2][4][7][8]. Mustached bats emit ultrasonic pulses and listen to returning echoes for orientation and hunting flying insects. The bats analyze the correlation between the emitted pulses and their echoes and extract the detailed information about I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 263–271, 2006. c Springer-Verlag Berlin Heidelberg 2006
264
S. Hirooka, K. Fujita, and Y. Kashimori
flying insects based on the analysis. This behavior is called echolocation. The neuronal circuits underlying echolocation detect the velocity of target with accuracy of 1 cm/sec and the distance of target with accuracy of 1 mm. To extract the various information about flying insects, mustached bats emit complex biosonar that consists of a long-constant frequency (CF) component followed by a short frequency-modulated (FM) component. Each pulse contains four harmonics and so eight components represented by (CF1, CF2, CF3, CF4, and FM1, FM2, FM3, FM4) as shown in Fig. 1[3]. The information of target distance and velocity are processed separately along the different pathways in the brain by using four FM components and four CF components, respectively [5].
Fig. 1. Schematized sonagrams of mustached bat biosonar pulses. The four harmonics of the pulses each coutain a long CF component (CF1-4) followed by a short FM component (FM1-4).
In natural situation, large natural objects in environment, like bushes or trees, produce complex stochastic echoes, which can be characterized by the echo roughness. The echo signal reflecting from a target insect is embedded in the complex signal. Even in such an environment, bats can detect accurately the detailed information of flying insect. To extract the information about insects, the feedback signals from cortex to subcortical areas are needed. To investigate the role of feedback signals in extracting the information about insect, we study the neural pathway for detecting velocity of target, which consists of cochlea, inferior colliculus (IC), and Doppler-shifted constant frequency (DSCF) area. The cochlea is remarkably specialized for fine-frequency analysis of the second harmonic CF component (CF2) of Doppler-shifted echoes. The information about echo CF2 (ECF2) is transmitted to IC, and the relative velocity of target insect is detected in DSCF area by analyzing the Doppler-shifted frequency. Xiao and Suga[8] have shown on intriguing property of feedback signals that the electric stimulation of DSCF neurons evokes the best frequency
Effect of Feedback Signals on Tuning Shifts
265
(BF) shifts of IC neurons away from the BF of the stimulated DSCF neuron (centrifugal BF shift) and bicuculline (an antagonist of inhibitory GABA receptors) applied to the stimulation site changes the centrifugal BF shifts into the BF shifts towards the BF of stimulated DSCF neurons (centripetal BF shift ). Although these BF shifts are generated by the feedback signals from DSCF neurons to IC neurons, it is not yet clear how the feedback signals determine the direction of BF shift. In the present study, we propose a neural network model for detecting Dopplershifted frequency of sound echoes. Using the model, we show the neural mechanism by which the centripetal and centrifugal BF shifts are elicited.
2
Model
The network model for detecting the Doppler-shifted frequency consists of cochlea, inferior colliculus (IC), and Doppler-shifted constant frequency (DSCF) area, each of which is a linear array of frequency-tuned neurons shown in Fig. 2. The neurons in the three layers are tuned in to specific echo frequency ranging from 60 to 63 kHz, which corresponds to the frequency range of the second harmonics CF2. The bat uses the Doppler-shifted frequency of echo sound ECF2 to detect the relative velocity of target. IC and DSCF neurons are both tuned to specific frequencies of ECF2, but have different functions in processing information of ECF2. The IC neurons control the gain of neurons due to the feedback signals from DSCF neurons, and thereby modulate the coutrast of frequency distribution of echo sound. On the other hand, the DSCF neurons are more sharply tuned to specific frequencies of ECF2, leading to accurate detection of target velocity. 2.1
Model of Cochlea
The function of cochlea is to decompose the echo sound into the intensity based on the frequency components. The cochlea neurons are tuned in to specific frequency. Because of broad tuning ability of cochlea, the firing rate of ith cochlea neuron is described by Ii = I0 e
−
(i−i0 )2 σ2 1
,
(1)
where i0 is the location of cochlea neuron that maximally responds to echo sound ECF2, and I0 is the maximum firing rate. 2.2
Model of IC
As shown in Fig. 2, the IC neurons are tonotopically connected with cochlea neurons, and receive the outputs of cochlea neurons. They also receive the feedback signals from DSCF neurons.
266
S. Hirooka, K. Fujita, and Y. Kashimori
The membrane potential of ith IC neuron, ViIC , is determined by dViIC 1 IC F F F B DSCF =− Vi + wij Ij + wik Xj (t), dt τIC j=1 N
N
(2)
k=1
FF = w0 e− wij − (k−i) 2 σa
XjDSCF (t) =
,
(3)
2
2
FB = ae wik
(j−i)2 σ2
− be
− (k−i) 2 σ
b
(a, b > 0),
M (t − tmj ) − (t−tτ mj ) s e , τs m=0
(4) (5)
FF is the weight of synaptic connection where τIC is the time constant of ViIC . wij FB from jth cochlea neuron to ith IC neuron, and wik is the synaptic weight of feedback connection from kth DSCF neuron to ith IC neuron. An action potential is occured when voltage equals threshold. Immediately after, the voltage is reset to zero. The IC neuron integrates the outputs of cochlea neurons with ON center receptive field described in Eq. (3), and receives the feedback signals from DSCF neurons with ON center-OFF surrounding connections given by Eq. (4). XjDSCF (t) is the postsynaptic potential (PSP) evoked by spikes of jth DSCF neuron, which is described by sum of α-function. tm is the latest firing time of jth DSCF neuron, M is the number of spikes generated by jth DSCF neuron until the latest firing time.
2.3
Model of DSCF
The network model of DSCF is constructed with a linear array of frequency-tuned neurons, each of which is tonotopically connected with IC neurons, as shown in Fig. 2. The DSCF neuron feeds the outputs back to IC neurons with ON center-OFF surrounding connections given by Eq. (4). The DSCF neurons are mutually inhibited. The function of DSCF neurons is to decode the Doppler-shifted frequency of ECF2 on the frequency map of DSCF area in order to detect the relative velocity of target[1][6]. The model of DSCF neuron was made based on the leakey integrate-and-fire neuron model. The membrane potential of ith DSCF neuron, ViDSCF , is determined by dViDSCF DSCF = −ViDSCF + αXiIC (t) + wik XkDSCF (t), dt N
τDSCF
(6)
k=1
where τDSCF is the time constant of ViDSCF . XiIC (t) and XkDSCF (t) are the output of ith IC neuron and that of kth DSCF neuron, respectively, which are given DSCF is the weight by the sum of α-functions similar to that given by Eq. (5). wik of inhibitory synapse from kth DSCF neuron to ith one.
Effect of Feedback Signals on Tuning Shifts
267
Fig. 2. Neural model for detecting Doppler-shifted frequency of echo sound. The excitatry synaptic connections are denoted by solid lines and the inhibitory synaptic connections are denoted by dotted lines.
3 3.1
Results BF Shifts of IC Neurons Caused by Feedback Signals
Figure 3a shows the raster plot of spikes of IC neurons in the case where tone stimulus with echo frequency of 60.6 kHz is delivered and electric stimulus (ES) is applied to DSCF neuron tuned to 60.9 kHz. In Fig.3a, the ES results in the shift of firing zone away from the IC neuron tuned to the frequency of the stimulated DSCF neuron, indicating that BF of IC neurons shift away from the BF of the stimulated DSCF neuron, or centrifugal shift. Figure 3b shows the change in the tuning property of IC neurons. Before the ES, the IC neurons maximally respond to 60.6 kHz. When DSCF neuron tuned to 60.9 kHz was electrically stimulated, the BF of IC neuron was shifted from 60.6 kHz to 60.5 kHz. That is, the IC neurons showed a centrifugal shift. Figure 4 shows the tuning properties of IC neurons when the antagonist of GABA, bicuculline, was applied to the DSCF neurons before the ES. The inhibition of GABA leads to the BF shift of the IC neuron towards the BF of the stimulated DSCF neuron. The BF of IC neurons shifts from 60.6 kHz to 60.8 kHz. That is, the IC neurons showed a centripetal BF shift. The results reproduce qualitatively the experimental results of BF shifts[8].
268
S. Hirooka, K. Fujita, and Y. Kashimori
Fig. 3. (a)Raster plot of spikes of IC neurons. Electric stimulation(ES) is applied to DSCF neuron at t = 50 msec. The dotted lines indicate the center of firing patterns. (b)Tuning properties of IC neurons. They were calculated before ES (upper), and after ES (lower).
Effect of Feedback Signals on Tuning Shifts
269
Fig. 4. Tuning properties of IC neurons. They were calculated before ES (upper) and after ES+bicuculline injection (lower).
3.2
Neural Mechanism of BF Shifts
We consider here how the centrifugal and centripetal BF shifts of IC neurons are evoked by the feedback signals from DSCF neurons. Figure 5a shows the tuning curves of two adjacent IC neurons, which are tuned to the BF, f1 and f2, respectively. After electrical stimulation of DSCF neuron, the tuning curve of IC neuron tuned to the frequency f1 is shifted upward as shown the dashed line in Fig. 5a, because the IC neuron tuned to f1 receives the inhibitory feedback signals from the stimulated DSCF neuron. As a result, the IC neuron tuned to f2 can fire at lower input current in response to echo stimulus with frequency f1, as shown by the point A in Fig 5a. Thus, the BF of IC neuron tuned to f1 apparently shifts to
270
S. Hirooka, K. Fujita, and Y. Kashimori
Fig. 5. Schematized tuning curves of IC neuron and synaptic weights of feedback connection from kth DSCF neuron to ith IC neurons, under (a) the application of electric stimulation (ES) and (b) of ES+bicuculline. The horizontal axis represents the location of IC neuron in linear array, which also represents the tuning frequency of IC neurons.
f2, which is the BF shift away from the frequency fES of stimulated DSCF neuron, or centrifugal shift. Figure 5b shows the shift of the tuning curves of two adjacent IC neurons in the case where bicuculline is focally injected to DSCF neurons. The tuning curve of IC neuron tuned to f3 is shifted downward, because the bicuculine suppresses the inhibitory feedback signals, and results in only the excitatory signal given by the first term of Eq. (4). Hence, the IC neuron tuned to f3 has the lower threshold for firing in response to echo stimulus with frequency f1, as shown by the point B in
Effect of Feedback Signals on Tuning Shifts
271
Fig. 5b. The BF of IC neuron tuned to f1 changes apparently to f3, which is the shift towards the frequency fES of stimulated DSCF neuron, or centripetal shift.
4
Concluding Remarks
We have presented here the neural model for detecting Doppler-shifted frequency of echo sound reflecting from flying insect. We showed that the model well reproduces the two kinds of BF shifts of IC neurons, or centripetal and centrifugal BF shifts. We also clarified the neural mechanism by which the direction of BF shift is determined by the feedback signals from DSCF neurons.
References 1. Liu,W.,Suga,N.:Binaural and commissural organization of the primary auditory cortex of the mustached bat, J.Comp,Physiol A 181(1997)599-605 2. Ma,X.,Suga,N.:Long-term cortical plasticity evoked by electric stimulation and acetylcholine applied to the auditory cortex,PNAS.102(2005)9335-9340 3. O’Neill,W.E.,Suga,N.:Encoding of target range and its representation in the auditory cortex of the mustached bat, J.Neurosci.2(1982)17-31 4. Sakai,M.,Suga,N.:Plasticity of the cochleotopic (frequency) map in specialized and nonspecialized auditory cortices,PNAS98(2001)3507-3512 5. Suga,N:Cortical computational maps for auditory imaging, Neural Networks 3(1990)3-21 6. Suga,N.,Manabe,T.:Neural basis of amplitude-spectrum representation in auditory cortex of the mustached bat, J.Neurophysiol.47(1982)225-255 7. Suga,N.,Ma,X.:Multiparametric corticofugal modulation and plasticity in the auditory system,Nature Reveiws Neuroscience.4(2003)783-794 8. Xiao,Z.,Suga,N.:Reorganization of the cochleotopic map in the bat’sauditory system by inhibition,PNAS .99(2002)15743-15748
A Novel Speech Processing Algorithm for Cochlear Implant Based on Selective Fundamental Frequency Control Tian Guan1, Qin Gong1, and Datian Ye2 2
1 Department of Biomedical Engineering, Tsinghua Univ., Beijing 100084, P.R. China Research Center of Biomedical Engineering, Graduate School at Shenzhen, Tsinghua Univ., Shenzhen 518055, P.R. China [email protected]
Abstract. The speech recognition ability of most CI users in noisy environment remains quite poor, especially for those who speak tonal language, such as Mandarin. Based on the results of the acoustic research on Mandarin, a novel algorithm using fundamental frequency was proposed, adding the principle of frequency bands selection. Using this principle F0 was encoded only in the selected frequency bands, which was confirmed effectively in acoustic simulation experiments with white noise or mixed speech environment. Compared with the traditional algorithm which only transmitted envelope cues, this novel strategy achieved remarkable improvement no matter adopting Mandarin vowels, words, sentences or mixed sentences. What’s more, the amount of transmission in this algorithm decreased to 62.5% compared with similar algorithms which transmit fundamental frequency in full channels.
1 Introduction Cochlear implant is the available medical device to restore hearing ability to totally deaf people. Although the modern multi-channel devices produce speech recognition scores around 75% for sentences in quiet, the ability of most implant users to understand speech in noisy environment remains quite poor [1][2], especially for those who speak tonal language, such as Mandarin [3][4] in which pitch variations convey lexically different meanings. The modern multi-channel devices mostly extract the temporal envelope, but discard the spectral information such as the pitch cues. Although the evidences show that the availability of pitch cues have little effect on performance in vowel and consonant recognition of English, [5] it is no doubt that for tonal language, such as Mandarin, pitch information makes significant contributions to speech perception, because in these languages pitch variations convey different lexical meanings. The same consonant-vowel syllable may mean totally different signification depending on fundamental frequency variations including flat, rising, falling-rising, and falling tone. Many investigators focused on building novel algorithms of speech processing which not only conveyed the temporal envelop cues, but also conveyed the tonal I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 272 – 279, 2006. © Springer-Verlag Berlin Heidelberg 2006
A Novel Speech Processing Algorithm for Cochlear Implant Based on SFFC
273
information. Lan N. (et al) [6] developed a novel algorithm by extracting and encoding both the envelope and fundamental frequency(F0) of speech signal. F0 was used to modulate the center frequency of sinusoidal waves in every channel in acoustic simulation experiments. This algorithm produced significant improvement in speech recognition ability of Mandarin. However, based on three aspects of the investigations of phonetics research we assumed that it had redundancy to transmit this kind of information in every channel. A more compact algorithm could be brought forward after reducing the redundancy of conveying spectral information. First, the pipelines of conveying the tonal information of Mandarin had redundancy. Both temporal envelope information and pitch information of speech signals contributed to the recognition of four Mandarin tones.[4] Many investigations found that the temporal envelope information such as vowel duration and amplitude contours contributed to Mandarin tone recognition. [7][8] This contribution, while significant, was relatively weak when the spectral pitch information evoked by fundamental frequency and its harmonics were present. [9] So there were multiple pipelines of conveying the tonal information. And perfect tone recognition scores could be gained even if some of pipelines were isolated. Second, perfect tone recognition could be achieved by only extracting and encoding the temporal and spectral information ranging in low frequency. Previous work found that perfect tone recognition could be achieved by either directly preserving the fundamental frequency with low-pass filtering at 300 Hz or indirectly by residual pitch derived from the harmonic structure which also ranged in low frequency [10]. Therefore, maybe conveying the temporal and tonal information in low frequency bands was enough to gain perfect speech recognition. Last, the tonal information encoded by traditional algorithms in high frequency bands could hardly be apperceived. In lecture [6], the tonal information-F0 was used to modulate the central frequency of sinusoidal waves in acoustic simulation model. Therefore, in high frequency bands, the tonal information varying range was relative negligible with respect to the central frequency of sinusoidal waves corresponding to these bands (For example, the ratio of F0 to central frequency of sinusoidal wave of 8-channel cochlear implant from the lowest to the highest frequency band were as follows:47.4%, 28.4%, 17.5%, 11.1%, 7%, 4.5%, 3%, 1.9%. ). Based on these theories, it is assumed that perfect speech recognition can be achieved when we encode the temporal envelope and tonal information in lower frequency bands but only encode the temporal envelope in the higher frequency bands. So we propose to encode the fundamental frequency more effectively in certain frequency bands selected by the principle frequency bands selection. This novel algorithm is called the selective fundamental frequency control (SFFC) algorithm. In the following section, we will first present the structure of SFFC in more details. Then the algorithm’s accuracy and efficiency will be verified using its acoustic simulation model in Section 3 and evaluate the results in Section 4.
2 Algorithm The SFFC algorithm extracts and encodes the fundamental frequency of the speechsignal. This algorithm has two signal pathways, including the traditional
274
T. Guan, Q. Gong, and D. Ye
envelope extraction like CIS [11] algorithm and additional fundamental frequency processing. In one signal pathway which is similar to the standard CIS algorithm, after the phonetic signals have been pre-processed, the division of frequency bands and the extraction of the envelope are carried out; In the other signal pathway, the fundamental frequency is extracted and used as the modulation to the central frequency of the sine signals, which re-synthesizes the phonetics. Figure 1 shows a basic block diagram of the two pathways.
Fig. 1. Functional block diagram of the proposed speech processing algorithm
2.1 Pre-process and Envelope Extraction in One Pathway The first stage of pre-process balanced the spectral amplitudes by decreasing the amplitude of the lower frequencies via a 1st-order Butterworth high-pass filter at 1200Hz for threshold. Then signals divided into data frames by Hamming window were divided into 8 frequency bands ranging from 100Hz to 6000Hz. The corner frequencies of the analysis bands were determined according to Greenwood’s formula; all analysis filters were 8th-order Butterworth band-pass filters. The temporal envelope from each analysis band was extracted by full-wave rectification and low-pass filtering (4th-order Butterworth low-pass filter at 50Hz), and was used to modulate the amplitude of sine signal which were used to re-synthesize speech signal in the acoustic simulation model. 2.2 Fundamental Frequency Extraction in the Other Pathway In the case of Chinese vowel, pitch may be assumed to be equal to the fundamental frequency (F0) of the tone. F0 can be extracted by either time-domain techniques or frequency-domain techniques. In this article, lifting scheme [12] is adopted, combined with pitch extraction algorithm, to extract F0 of speech signal. We present the lifting scheme which does not rely on the Fourier transform. In the analytic stage, signal ( S j ) after lifting scheme processing is considered as the combination of low-frequency ( S j −1 ) and high-frequency ( d j −1 ) components. The band of S j −1 is from zero to half of the signal’s highest frequency, and d j −1 the remaining band. If we analyze the signal constantly, the band of the signal will be cut short to
A Novel Speech Processing Algorithm for Cochlear Implant Based on SFFC
275
the range of the voice characteristics (50-500Hz), which builds a well base for extracting the fundamental frequency later. Subsequently fundamental frequency of the processed signal after third-time lifting (0-700Hz) will be extracted by ways of Short-Time Average Magnitude Difference Function (AMDF). In this way we can get the smooth fundamental frequency contours of the signals after simple postprocessing.
Fig. 2. Fundamental frequency contours of Mandarin. (a) Original speech waveform of two Mandarin words by male with two different tones ( “ ” and “/”), (b) Corresponding fundamental frequency contours extracted by methods mentioned above.
Because of its advantages (such as speed-up, parallelism, high efficiency, in-place implementation, etc.) [12] , lifting scheme opens a bright prospect in the application of pitch extraction in speech processing strategy for cochlear implants based on characteristics of Chinese language. 2.3 The Control Principle of Frequency Bands Selection The SFFC algorithm encodes the temporal envelope and total information in lower frequency bands but only encode the temporal envelope in higher frequency bands. This makes the control principle of frequency bands selection in which the total information is encoded into the frequency modulation at the low frequency part (near the apex of cochlea) while it is not used at the high frequency part (near the base). The number of frequency bands with the total information added from the apex (defined as the parameter S) is determined by the results of the acoustic simulation experiment.
3 Acoustic Simulation Experiment We verified the SFFC speech processing algorithm in the acoustic simulation experiment especially in the situation of white noise and mixed speech. 24 young native Mandarin speakers who were reported normal hearing participated in this
276
T. Guan, Q. Gong, and D. Ye
experiment. All tests were carried out in a quiet laboratory. The simulation sounds were presented to the listener through a Sennheiser HD457 headphone. The acoustic simulation experiment included four parts: vowels, words, sentences and mixed sentences experiments. The vowels experiment (close-set) consisted of 70 questions. In the vowels experiment, testing forms of the minimal auditory capabilities in Chinese (MACC) [13] were adopted. In order to reveal the characteristics of Chinese vowels and to comply with phonetic rules in Chinese, the words selected from the word list should keep the phonetic balance of vowels as much as possible. The vowels experiment was close-set with four multiple choices for each question. The four choices were endowed with identical consonants and pitches and different vowels. Among them, one was pronouncing word and the other three were accompaniment words. In the words experiment (open-set), 70 questions were presented to the subjects. Phonetic balance was also considered in the selection of words. This session was open-set with no choice being provided. The subjects were instructed to write the Chinese characters or pinyin heard on the answer sheets and were encouraged to guess when they were not sure of the answer. The words recognition rates were the ratios of the number of correctly written words to the number of total words. In the sentences experiment (open-set), listeners were presented with 70 questions. Each sentence contained several key words ranging from 2 to 10 ( 5.4 keywords on average). Sentences in each group were similar in length, arranged randomly and narrated smoothly. The sentences recognition rates were also the ratios of the number of correctly recorded key words to the number of total key words. Finally, in the mixed speech experiment (open-set), listeners were presented with 70 questions. The overlapping method of the mixed speech was male-female overlapping. The mixed speeches were composed of short Mandarin sentences. There were several keywords in each sentence (ranging from 2 to 10, and 5.4 keywords on average ). The subjects were instructed to write the Chinese characters or pinyin with relatively high intensity in the mixed speeches they heard on the answer sheets. The recognition rate of the masking experiment was calculated using the number of correct keywords filled dividing by the total number of keywords. Algorithms: The speech processing algorithms adopted here could be divided into two types: the amplitude information algorithm – CIS and the tonal information algorithm – SFFC. SNR level selection: The SNRs of the overlapping white noise (TMR in mixed speech experiments) were -5dB, 0dB, 0dB and 5dB in vowels, words, sentences and mixed speech experiments respectively. The sampling rate was 16kHz. The channel number is fixed at 8.
4 Results The results of the recognition rate based on various algorithms at different S levels using different language materials are shown in Fig. 3.
A Novel Speech Processing Algorithm for Cochlear Implant Based on SFFC
277
Fig. 3. Recognition rate of different language materials of SFFC (different parameter S) compared with CIS in white noise and mixed speech background
Fig.3 shows the comparison of performances by SFFC varying parameter S in white noise and mixed speech environment. In table 1, the data calculated by SFFC algorithm in different situation is analyzed based on AONVA analysis. Table 1. The results of SFFC algorithm based on ANOVA analysis
CIS vs S=8 S=7 vs S=8 S=6 vs S=8 S=5 vs S=8 S=4 vs S=8 S=3 vs S=8
vowels F P 57.63 1) with l1 more than about 20 has small L(l1 :ld ) , which
706
S. Kurogi et al.
although fluctuates for small ld . In other words, we may say that the CAN2 ensemble with a larger ld provides more stable loss, or more stable prediction performance, while the single CAN2 with ld = 1 provides fluctuated loss. However, a larger ld than the optimal one provides a larger loss, or a worse prediction performance, although it provides stability. Thus, actually, it may be hard to obtain (l1 : ld ) with stable and small L(l1 :ld ) even if we know L(l1 :ld ) . (l :l ,m) which can be obtained Next, let us examine the differencial loss LD1 d without knowing the target values in the test dataset. From Fig. 4, we can see (l :l ,1) (l :l ,1) that the surface of LD1 d with m = 1 fluctuates, and besides, LD1 d decreses with fluctuation as l1 becomes larger. So, it may be hard to decide which part (l :l ,1) (l :l ,3) of LD1 d indicates the effective (l1 , ld ). However, the surface of LD1 d with (l :l ,1) m = 3 is much smoother than LD1 d because it is a kind of moving average of (l :l ,1) (l :l ,3) LD1 d . Although the surface of LD1 d seems to be similar to that of L(l1 :ld ) , (l :l ,1) it decreases with fluctuation as l1 becomes larger, as for LD1 d , while L(l1 :ld ) has the smallest value in the region shown. Thus, we have to utilize the variance shown in Eq.(12). Another problem is the selection of m, namely, if we use larger m, we sometimes have a result worse than a desirable result, because the quadratic approximation of the loss L(l1 :ld ) is not applicable to a large area as explained before.
5
Conclusion
We have introduced the CAN2 ensemble and presented a method to select an effective number of units for the ensemble. By means of numerical experiments with a benchmark function, we have shown the effectiveness of the CAN2 ensemble and the method for selecting an effective number of units. We would like to examine the present methods much more with other functions or other datasets in many applications. Furthermore, we hope the idea of the present methods for the CAN2 can be applicable to other artificial neural networks. Finally, we would like to note that our research on the CAN2 was partially supported by the Grant-in-Aid for Scientific Research (B) 16300070 of the Japanese Ministry of Education, Science, Sports and Culture.
References 1. Ahalt, A.C., Krishnamurthy, A.K., Chen, P. and Melton, D.E.: Competitive learning algorithms for vector quantization. Neural Networks. 3 (1990) 277–290 2. Kohonen, T.: Associative Memory. Springer Verlag (1977) 3. Kurogi, S. and Ren, S.: Competitive associative networks for function approximation and control of plants. Proc. NOLTA’97 (1997) 775–778 4. Kurogi, S., Tou, M. and Terada, S.: Rainfall estimation using competitive associative net. Proc. 2001 IEICE General Conference (in Japanese). SD-1 (2001) 260–261 5. Kurogi, S.: Asymptotic optimality of competitive associative nets for their learning in function approximation. Proc. ICONIP2002 1 (2002) 507–511
Ensemble of CAN2 and a Method to Select an Effective Number of Units
707
6. Kurogi, S.: Asymptotic optimality of competitive associative nets and its application to incremental learning of nonlinear functions. Trans. of IEICE D-II (in Japanese). J86-D-II(2) (2003) 184–194 7. Kurogi, S., Ueno, T. and Sawa, M.: A batch learning method for competitive associative net and its application to function approximation. Proc. of SCI2004 V (2004) 24–28 8. Kurogi, S., Ueno, T. and Sawa, M.: Batch learning competitive associative net and its application to time series prediction. Proc. of IJCNN 2004 (2004) in CD-ROM 9. http://predict.kyb.tuebingen.mpg.de/pages/home.php 10. Farmer, J.D. and Sidorowich, J.J.: Predicting chaotic time series, Phys. Rev.Lett. 59 (1987) 845–848 11. Chandrasekaran, H. and Manry, M.T.: Convergent design of a piecewise linear neural network. Proc. IJCNN1999 2 (1999) 1339–1344 12. Jacobs, R.A., Jordan, M.I., Nowlan, S.J. and Hinton, G.E.: Adaptive mixtures of local experts, Neural Computation 3 (1991) 79–87 13. Friedman, J.H.: Multivariate adaptive regression splines. Ann Stat 19 (1991) 1–50 14. Opitz, D. and Maclin, R.: Popular ensemble methods : an empirical study. Journal of Artificial Intelligence Research 11 (1999) 169–198 15. Lendasse, A., Werz, V., Simon, G. and Verleysen, M.: Fast bootstrap applied to LS-SVM for long term prediction of time series. Proc. of IJCNN2004 (2004) 705– 710
Using Weighted Combination-Based Methods in Ensembles with Different Levels of Diversity Thiago Dutra, Anne M.P. Canuto, and Marcilo C.P. de Souto Informatics and Applied Mathematics Department – Federal University of Rio Grande do Norte(UFRN) Natal, RN - Brazil, 59072-970 [email protected], [email protected], [email protected]
Abstract. There are two main approaches to combine the output of classifiers within a multi-classifier system, which are: combination-based and selectionbased methods. This paper presents an investigation of how the use of weights in some non-trainable simple combination-based methods applied to ensembles with different levels of diversity. It is aimed to analyse whether the use of weights can decrease the dependency of ensembles on the diversity of their members. …
1 Introduction In an attempt to improve recognition performance of individual classifiers, a common approach is to combine multiple classifiers, forming Multi-Classifier Systems (MCSs). MCSs, also known as ensembles or committees, exploit the idea that a pool of different classifiers, referred individually as to experts or recognition modules, can offer complementary information about patterns to be classified, improving the effectiveness of the overall recognition process [9]. In the literature, the use of MCSs has been widely used in several pattern recognition tasks [2],[4],[11]. In the context of MCSs, one aspect that has been acknowledged as very important is the diversity of MCSs [10],[14]. For example, there is clearly no accuracy gain in an ensemble that is composed of a set of identical classifiers. Thus, if there are many different classifiers to be combine, one would expect an increase in the overall accuracy when combining them, as long as they are diverse (i.e., the errors produced by these classifiers are uncorrelated). Confidence values can be associated with every sample to denote the confidence of the classifier in question in classifying the input pattern to a particular class. Confidence values can be also associated to all other classes and they define a certain degree of belonging of a sample to all other classes. These confidence values (confidence degree) are provided by the classifiers and can be considered as one of the most valuable information that can be extracted from the outputs of the classifiers [5]. Confidence measures have been used in several combination-based methods to investigate the benefits of using these measures in the performance of these methods [2],[5],[12]. I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 708 – 717, 2006. © Springer-Verlag Berlin Heidelberg 2006
Using Weighted Combination-Based Methods in Ensembles
709
However, none of them have investigated the use of weights to decrease the dependency of the ensemble on the diversity of the members. In this paper, an investigation of the performance of the two non-trainable simple combination-based methods (Max and Majority Vote) using weights is performed. The main aim of this paper is to analyze the influence of two confidence measures (weights) in the performance of the combination-based methods. Also, in this investigation, several ensemble configurations will be analyzed, basically using hybrid (different types of classifiers as components) and non-hybrid (same type of classifier) configurations. According to [3], there is a tendency to decrease diversity when using ensembles using the same type of classifiers, when compared with hybrid ensembles. In this investigation, it is intended to verify whether the use of weights will be affected by the variation of the diversity level in the ensembles. In this sense, it is aimed that the use of weights results in a decrease in the dependency of the combinationbased method to the diversity obtained by the ensemble members.
2 Multi-classifier Systems The study of Multi-Classifier Systems (MCSs), also known as ensembles or committee of classifiers, has emerged as a need to have a computational system that works with pattern recognition on an efficient way [2],[9]. The goal of using MCSs is to improve in the performance of a pattern recognition system in terms of better generalization and/or in terms of increased efficiency and clearer design. There are three main choices in the design of a MCS: the organization of its components, the system components and the combination methods that will be used. In terms of the organization of its components, a MCS can be defined as modular and ensemble. In the modular approach, each classifier becomes responsible for a part of the whole system and they are usually linked in a serial way. In contrast, in the ensemble approach, all classifiers are able to answer to the same task in a parallel or redundant way. Moreover, there exists a combination module that is responsible for providing the overall system output [9]. In this paper, the kind of MCS analyzed will be of the ensemble type. Thus, hereafter, the terms Ensembles and MCSs will be used interchangeably. With respect to the choice of the components of the MCS, the correct choice of the set of classifiers is fundamental to the overall performance of a multi-classifier system. As already mentioned, the main aim of combining classifiers is to improve their generalization ability and recognition performance. However, the combination of a set of identical classifiers will not outperform the individual members. The set of distinct classifiers can be achieved by varying the classifier’s structure (topology and parameters), varying the data and/or using different types of classifiers as components of an ensemble. In terms of using different types of classifiers, the ensembles can be classified as hybrid and non-hybrid. When an ensemble is composed of classifiers of the same type, it is called non-hybrid or homogeneous ensembles. On the other hand, ensembles composed of classifiers of more than one type of classifiers, it is called hybrid or heterogeneous ensembles. Once a set of classifiers has been created and the strategy for organizing them has been defined, the next step is to choose an effective way of combining their outputs. There are a great number of combination methods reported in the literature [4],
710
T. Dutra, A.M.P. Canuto, and M.C.P. de Souto
[9],[14]. According to their functioning, three main strategies of combination methods are discussed in the literature on classifier combination: selection-based, combinationbased and hybrid methods. 2.1 Selection-Based Methods In selection-based methods, only one classifier is needed to correctly classify the input pattern. In order to do that, it is important to define a process to choose a member of the ensemble to make the decision. The choice of a classifier to label is made during the operation phase. This choice is typically based on the certainty of the current decision. Preference is given to more certain classifiers. One of the main methods in classifier selection which is called dynamic classifier selection (DCS) [15]. 2.2 Combination-Based Methods There are a vast number of combination-based methods reported in the literature. They could be classified according to their characteristics as Linear or Non-linear. • •
Linear combination methods. Currently, the simplest ways to combine multiple classifiers are the sum and average of the classifiers outputs [9]. Non-linear methods. This class includes rank-based combiners, such as Borda Count [9], majority voting strategies [9], the Dempster-Shafer technique [13], fuzzy integral [2], neural networks [2] and genetic algorithms [9].
Majority Vote (Maj), Maximum (MAX): As already mentioned, these combination methods are called non-trainable, in which once the classifiers are trained, they do not require any further training [9]. For the majority vote combination, each classifier provides a class label to an input pattern which is assigned 1 when it is the winner class and 0 otherwise. In the maximum combination, each classifier assigns the class label to the value of the highest output class provided by the classifier, for all other classes, 0 is assigned. In all two combination methods, the class labels provided by all classifiers are summed and the class with the highest class label is chosen to be the overall class label of the system. 2.3 Hybrid Methods Hybrid methods are the ones in which selection and fusion techniques are used in order to provide the most suitable output to classify the input pattern. The main idea is to use selection only and if only the best classifier is really good to classify the testing pattern. Otherwise, a combination method is used [13]. Two main examples of hybrid methods are: Dynamic Classifier Selection based on multiple classifier behavior (Dcs-MCS) [7] and Dynamic classifier selection using also Decision Templates (Dcs-DT) [8]. 2.4 Diversity in Ensembles As already mentioned, there is no gain in the combination (MCS) that is composed of a set of identical classifiers. The ideal situation, in terms of combining classifiers, would be a
Using Weighted Combination-Based Methods in Ensembles
711
set of classifiers that present uncorrelated errors. In other words, the ensemble must show diversity among the members in order to improve the performance of the individual classifiers. Diversity can be reached in three different ways: • • •
Parameters of the classifiers: Varying the initial and learning parameters of a learning model is an attempt to get diversity in ensembles [14]; Training dataset of the classifiers: The use of learning strategies such as Bagging and Boosting [9] are examples of reaching diversity by the dataset. Type of classifier: the use of different types of classifiers as members of an ensemble (hybrid ensembles) is an attempt to get diversity by the ensemble members.
In this paper, diversity in ensembles will be reached using different initial parameters (non-hybrid approaches) and/or different type of classifiers (hybrid approaches). There are different diversity measures available from different fields of research. In [10],[14], for instance, ten diversity measures are defined.
3 Calculating Confidence Different ways of calculating the confidence of each class for each classifier can be used in determining the relative contribution of each expert within the system. In [2],[5], some examples of calculating weights were presented. In this paper, two confidence measures taken from [2],[5] will be used. In order to calculate these parameters, it is necessary to add an extra phase in addition to training and recall, which can be called the evaluation or validation phase. First of all, it is important to define the output of an ensemble, which can be described as:
Ensemble = Max ( Out c =1,.., C =
NC
¦ (O
ic
* Conf ic ) )
(1)
i =1
In equation 1, the output of classifier (i) to class (c), Oic, is used along with the relative confidence of the classifier to the corresponding class (Confic) to define the final output of the classifier. In the next two subsections, two confidence measures (Conf variable of equation 1) are described in order to be used in the combination methods.
3.1 Class Strength The class strength of a classifier is represented by the value of its highest output. Class strength represents valuable information since it not only indicates whether or not the classifier recognizes the input pattern as belonging to the right class, but also the intensity of the recognition. The confidence measure (Conf variable of equation 1) based on class strength of a particular classifier can be described as: V
Conf
ic
= C ic =
¦Θ l =1
Cr ic
icl
(2)
712
T. Dutra, A.M.P. Canuto, and M.C.P. de Souto
Where • • •
Ĭicl is the value of the highest output class c of classifier i to the lth pattern in the validation phase; Cric is the recognition rate of classifier i to class c in the validation phase; V is the number of patterns in the validation phase.
3.2 Strength Relative to the Closest Class Assuming that the outputs of classifier i to a certain input pattern p are Ĭi1p, Ĭi2p,..., Ĭinp , the relative strength to the closest class is the value of the difference between the highest output and the second highest output (Ĭicp – Ĭimp, where c is the highest output class and m is the second highest class). This measure provides valuable information since it defines how sure a classifier is about the identity of the input pattern. If the input pattern has close similarity with more than one class, the classifier will generate similar outputs, leading to a small value for the relative strength index. Otherwise, if the input pattern matches well to just one class, then a large value of relative strength will be measured. This parameter can be described as: V
Conf
ic
= CD
ic
=
¦
( Θ icl − Θ iml )
l =1
Cr ic
(3)
where c and m are the top two outputs of classifier i to the lth validation pattern.
4 Experimental Work In order to investigate the benefits of using weights in the combination-based methods, an empirical comparison of several classifiers are performed. Basically, MLP (Multi-Layer Perceptron), JRIP (Optimized IREP), RBF (radial basis function), SVM (Support Vector Machine) and Naïve Baysian have been used as base classifiers in this investigation [13]. The choice of the aforementioned classifiers is because they present different learning bias. In addition, ensemble methods using combinations of the classifiers have also been used. Experiments were conducted using two different ensemble sizes, 3 and 5 and base classifiers. The choice of a small number of classifiers is due to the fact that diversity has affected more accuracy when using ensemble with less that ten classifiers [9]. In this sense, it is believed that the combination methods are more sensible to variations of the ensemble members when using a small number of members. For each ensemble size, hybrid and non-hybrid configurations of ensembles were created. For the non-hybrid configurations, variations of the same type of all five classification methods (except naïve Bayesian) were used, varying the initial and training parameters of each method. For the non-hybrid configurations, different classification methods were used. For all ensemble configurations, six different combination methods were used, which are: the original combination-based methods (Maj and MAX), methods with class strength (Conf variable of equation 1) and methods with Strength Relative to the Closest Class (Conf variable of equation 1). As it can be noticed, there are a large number of possible configurations that can be used in an investigation, mainly in the hybrid approaches. For simplicity reasons, only a
Using Weighted Combination-Based Methods in Ensembles
713
small fraction of all possible configurations will be used in this investigation. The main idea is to investigate the performance of the ensembles when using 1, 3 and 5 different types of classifiers for systems with five classifiers and 1, 2 and 3 types of classifiers for systems with three classifiers. The idea is to vary the diversity level of the ensembles and to analyze the influence of the weights in the accuracy of these ensembles. Based on this, for ensemble size of 3, five different configurations will be used, in which two are nonhybrid and three are hybrid configurations. For ensemble size of 5, five different configurations will be used, in which two are non-hybrid and three are hybrid configurations All the learning and combination methods used in this study were implemented using Java Language. A 10-fold cross validation learning technique has been applied to all ensembles [9]. In order to evaluate the significance of the results provided by all ensembles, the hypothesis tests (t-test) will be applied in the results of the ensembles [13].
4.1 Databases Two different databases are used in this investigation, which are: •
•
Database A: It is a Primate splice-junction gene sequences (DNA) with associated imperfect domain theory, obtained from [2]. A total of 3190 Instances using 60 attributes were used. These attributes describe sequences of DNA used in the process of creation of proteins. Database B: proteins: It is a protein database which represents a hierarchical classification, manually detailed, and represents known structures of proteins. They are organized according to their evolutionary and structural relationship. The main protein classes are all-Į, all-ȕ, Į/ȕ, Į+ȕ e small. It is an unbalanced database, which has a total of 582 patterns, in which 111 patterns belong to class all-Į, 177 patterns to class all-ȕ, 203 patterns to Į/ȕ, 46 patterns to class Į+ȕ e 45 patterns to class small.
5 An Analysis of the Results Before starting the investigation of the performance of the ensembles, it is important to analyze the performance of the individual classifiers. Table 1 shows the correct mean (CM) and standard deviation (SD) of the individual classifiers employed in the ensembles. As already mentioned, five variations of each classifier are used, apart from Naïve Bayesian. For simplicity reasons, only the highest accuracy for each model is shown in Table 1. Table 1. Performance of the individual classifiers applied to databases A and B
MLP RBF JRIP SMO NAIVE
Database A CM SD 86.45 3.55 89.24 2.74 90.48 2.07 89.48 3.42 87.10 1.53
CM 80.51 76.09 77.36 80.57 78.30
Database B SD 3.30 3.86 5.19 5.04 3.59
714
T. Dutra, A.M.P. Canuto, and M.C.P. de Souto
According to the correct mean provided by the classifiers, it can be seen that all classifiers have delivered a similar pattern of performance for both databases. The JRIP classifier has provided the highest correct mean for database A, while MLP has delivered the highest correct mean for database B.
5.2 Ensembles with Three Base Classifiers Tables 2 shows the correct mean (CM) and standard deviation (SD) of the ensembles with three classifiers applied to databases A and B. In this Table, six different combination methods were analyzed, which are: the original methods (Maj and MAX), methods with class strength (C-Maj and C-MAX), methods with Strength Relative to the Closest Class (CC-Maj and CC-MAX). For each combination method, their performance using five different configurations (three hybrids and two non-hybrids) were investigated. The approaches are: JJJ (3 JRIPs), MMM (3 MLPs), NRR (1 naïve and 2 RBF), NJM (1 naïve, 1 JRIP and 1 MLP) and MJR (1 MLP, 1 JRIP and RBF). Table 2. Performance of the ensembles with three base classifiers applied to databases A and B MAX O
C
NRM NJM MJP JJJ MMM
82.47±3.1 80.02±5.2 81.91±2.0 81.09±4.2 82.76±4.1
83.79±3.1 83.79±5.6 84.17±2.9 82.47±4.4 83.00±3.8
NRM NJM MJP JJJ MMM
88.21±1.1 89.03±1.3 88.28±1.1 92.34±1.3 91.93±3.8
89.10±7.5 89.76±5.3 92.14±6.7 92.17±1.5 94.00±2.8
Maj CC
O
Database A 85.11±5.7 82.08±5.0 84.17±5.3 82.27±6.2 83.79±2.1 82.08±3.4 82.47±5.6 79.44±5.2 83.19±3.3 80.08±3.1 Database B 91.21±6.0 87.86±4.0 90.14±2.6 93.76±2.0 93.17±7.7 89.45±4.1 92.93±2.7 90.65±2.0 93.58±3.6 92.45±4.5
C
CC
82.45±6.4 83.40±4.9 82.27±4.5 80.36±5.0 80.76±3.4
82.08±5.3 83.02±4.3 82.27±5.3 81.30±4.9 80.76±3.4
88.66±4.0 94.00±1.7 90.90±5.5 91.38±2.1 92.45±3.9
88.52±4.1 93.97±1.6 90.17±4.0 91.59±2.0 92.72±4.5
As it can be noticed from Table 2, the general correct mean of the weighted methods was higher than the original combination methods. Also, the MAX combination methods have provided a higher improvement in the weighted methods than the majority vote methods. The improvement reached by database B is smaller than database A. It is believed that this is because database B is an unbalanced database and the use of weights enhances the difference among the classes. Of the weighted methods, in general, the closest class weight (equation 4) methods (CC) had delivered the highest correct mean and lowest standard deviation. In comparing the improvement reached by the weighted methods compared with the original methods when using hybrid and non-hybrid approaches, the improvement was, on average, higher for the non-hybrid approaches than for the hybrid ones. In order to analyze the improvement reached by the use of weights, the hypothesis tests (t-test) is performed, using a confidence level of 95% (p = 0.05). The accuracy of the original method is compared with the lowest accuracy of the weighted methods. The idea
Using Weighted Combination-Based Methods in Ensembles
715
is if the difference is statistically significant to the lowest accuracy weighted method, it is also for the other weighted method. For those in which the difference is not statistically significant for the lowest method, the hypothesis tests is applied to the highest accuracy weighted method. For database A, the difference of both weighted methods to the original ones is statistically significant in 0 cases (out of 4 – 2 for MAX and 2 for Maj) for the nonhybrid approaches, while it was in 2 cases (out of 6 - 3 for MAX and 3 for Maj) in the hybrid approaches. The difference of the highest weighted methods to the original ones is statistically significant in only 1 case for the non-hybrid approaches, while it was also in only 1 case in the hybrid approaches. For all other cases, there is no statistical evidence to state that the difference obtained by the weighted methods is significant for database A. For database B, the difference of both weighted methods to the original ones is statistically significant in 0 cases (out of 4) for the non-hybrid approaches, while it was in only 1 case (out of 6) in the hybrid approaches. The difference of the highest weighted methods to the original ones is statistically significant in 1 case for the non-hybrid approaches, while it was in 2 cases (out of 6) in the hybrid approaches. For all other cases, there is no statistical evidence to state that the difference obtained by the weighted methods is significant for database B. It is important to emphasize that almost all cases in which the improvement was statistically significant was using the MAX combination method. In only one case, the difference of the highest weighted Majority method to the original ones is statistically significant.
5.3 Ensembles with Five Base Classifiers Tables 3 shows the correct mean (CM) and standard deviation (SD) of the ensembles with five classifiers applied to databases A and B. As in the previous section, six different combination methods were analyzed. For each combination method, their performance using five different configurations (two hybrids and three non-hybrid) were investigated. The approaches are: JJJJJ (5 JRIPs), MMMMM (5 MLPs), NRRMM (1 Naïve, 2 RBFs and 2 MLPs), NNMMJ (2 Naïve 2 MLP 1 JRIP) and hybrid (1 MLP 1 SVM 1 Naïve 1 MLP 1 RBF). Table 3. Performance of the ensembles with five base classifiers applied to databases A and B
O
MAX C
JMNRS NRRMM NNMMJ JJJJJ MMMMM
78.87±3.4 83.58±3.3 81.42±3.7 82.45±3.2 81.11±3.0
85.09±5.2 85.85±3.1 82.19±3.7 83.96±1.0 82.68±2.0
JMNRS NRRMM NNMMJ JJJJJ MMMMM
90.31±3.8 89.28±1.3 87.35±1.3 91.76±2.7 91.21±3.6
88.35±3.4 93.66±0.9 91.38±0.7 93.28±2.5 93.34±2.0
CC
O
Database A 85.28±5.1 82.59±4.2 86.42±3.8 82.83±4.7 86.57±3.8 81.46±4.2 84.72±1.5 80.19±3.0 84.06±2.0 81.64±3.4 Database B 90.86±2.8 91.66±8.2 91.21±0.8 90.55±5.5 89.73±1.0 88.97±4.3 93.07±2.3 92.31±2.5 92.48±2.0 91.97±2.9
Maj C
CC
83.21±4.5 83.40±4.9 82.64±3.2 80.11±2.3 81.89±2.6
83.78±4.5 82.64±5.5 82.83±3.5 80.11±2.2 81.89±2.9
89.76±9.8 89.03±6.3 87.45±2.3 92.38±2.4 94.59±1.7
90.62±9.4 90.76±5.2 88.86±4.4 92.24±2.4 92.03±2.0
716
T. Dutra, A.M.P. Canuto, and M.C.P. de Souto
As it can be noticed from Table 3, the general correct means of all ensembles are higher than ensembles with three classifiers. As in the previous section, the performance (correct mean and standard deviation) weighted methods was, on average, higher than the original methods. Also, as in the previous section, the MAX combination methods have provided a higher improvement in the weighted methods than the majority vote methods. Also, the improvement reached by database B is smaller than database A. Of the weighted methods, in general, the closest class weight (equation 4) methods (CC) had delivered the highest correct mean and lowest standard deviation. The improvement reached by the weighted methods compared with the original corresponding methods when using hybrid and non-hybrid approaches, as in the previous section, it is higher for the hybrid approaches than the non-hybrid approaches of ensembles. In order to analyze the improvement reached by the use of weights, the hypothesis tests (t-test) is performed, using a confidence level of 95% (p = 0.05). The accuracy of the original method is compared with the lowest accuracy of the weighted methods. For those in which the difference is not statistically significant for the lowest method, the hypothesis tests is applied to the highest accuracy weighted method. For database A, the difference of both weighted methods to the original ones is statistically significant in 0 cases (out of 4 – 2 for MAX and 2 for Maj) for the non-hybrid approaches, while it was in 2 cases (out of 6 - 3 for MAX and 3 for Maj) in the hybrid approaches. The difference of the highest weighted methods to the original ones is statistically significant in 2 cases for the non-hybrid approaches, while it was in only 1 case in the hybrid approaches. For all other cases, there is no statistical evidence to state that the difference obtained by the weighted methods is significant for database A. For database B, the difference of both weighted methods to the original
ones is statistically significant in 0 cases (out of 4) for the non-hybrid approaches, while it was in 2 cases (out of 6) in the hybrid approaches. The difference of the highest weighted methods to the original ones is statistically significant in 3 cases (out of 4) for the nonhybrid approaches, while it was in only 1 case (out of 6) in the hybrid approaches. For all other cases, there is no statistical evidence to state that the difference obtained by the weighted methods is significant for database B. As in the previous section, in only one case, the difference of the highest weighted Majority method to the original ones is statistically significant. In all other cases, the statistically significant improvement was reached by the MAX combination methods.
6 Final Remarks In this paper, an investigation of how the use of weights could improve the accuracy of two combination-based methods of ensembles was performed. In order to execute this investigation, several ensemble configurations were analyzed, basically using hybrid (different types of classifiers as components) and non-hybrid (same type of classifier) configurations. Also, two ensemble sizes were investigated (3 and 5 base classifiers). Finally, these ensembles were applied to two different databases. Through this analysis, it could be observed that the MAX combination method had a higher improvement when using weights than the Maj method. When applying the hypothesis test, the improvement reached by at least one weighted method was statistically significant in 18 cases (out of 40 – 20 for MAX and 20 for Maj). However, in only 2 cases the improvement was statistically significant when using the majority combination methods, all other 16 cases were for the MAX combination methods. In this sense, it is possible to conclude that the use of weights was positive to the linear combination-based method,
Using Weighted Combination-Based Methods in Ensembles
717
while it did not improve significantly the accuracy of the non-linear combination-based method. Also, as one of the databases (database B) is unbalanced, the use of weights has enhanced the difference among the classes and the improvement was as strong as it was for the other database. However, unlike it was expected, the use of weights was more beneficial to the hybrid approaches, where the diversity of the ensembles tend to be high, than the non-hybrid ones. In this sense, the use of weights was also affected by variations of diversity of the members. Although the use of weights has not decreased the dependency of the ensembles to the diversity, it has improved the accuracy of the ensembles, which is positive to the performance of the ensembles.
References 1. Blake, C.L. and Merz, C.J.: UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science, 1998. 2. A Canuto.. Combining neural networks and fuzzy logic for applications in character recognition. PhD thesis, University of Kent, 2001 3. A M P Canuto, M C Abreu, L M Oliveira, J C Xavier Junior and A M Santos. “Investigating the Importance of the Ensemble Members in the Performance of Selection-based and Fusion-based Combination Methods”, submitted to Pattern Recognition Letters in December, 2005. 4. J Czyz and M Sadeghi and J Kittler and L Vandendorpe. Decision fusion for face authentication, Proc First Int Conf on Biometric Authentication, 686-693, 2004. 5. Canuto, A., Howells, G. And Fairhurst, M. The Use of Confidence Measures to Enhance Combination Strategies in Multi-Network Neuro-Fuzzy Systems. Connection Science, 12(3/4), pp.315-331, 2000. 6. G. Giacinto and F. Roli, "Design of effective neural network ensembles for image classification", Image and Vision Computing Journal , 19(9-10), pp. 697-705, 2001. 7. Giacinto, G. and Roli, F. “Dynamic Classifier Selection based on Multiple Classifier Behaviour”. Pattern Recognition, vol. 34, 1879-1881, 2001 8. Kuncheva, L. Switching Between Selection and Fusion in Combining Classifiers: An Experiment. IEEE Trans on Systems, Man and Cybernetics – Part B Vol. 32, N.2, 146-155, 2002. 9. Kuncheva L.I. Combining Pattern Classifiers. Methods and Algorithms, Wiley, 2004. 10. Kuncheva L, Whitaker, C. Measures of diversity in classifier ensembles, Mach Learning, 51, 181-207, 2003. 11. Lemieux, A and Parizeau, M "Flexible multi-classifier architecture for face recognition systems". The 16th International Conference on Vision Interface, 2003. 12. Muhlbaier M., Topalis A., Polikar R., “Ensemble confidence estimates posterior probability,” 6th Int. Workshop on Multiple Classifier Systems (MCS 2005), LNCS, vol. 3541, pp. 326-335, 2005. 13. T Mitchell. Machine Learning. McGraw-hill, 1997. 14. Shipp C.A. and Kuncheva L.I.. Relationships between combination methods and measures of diversity in combining classifiers, Inf Fusion, 3 (2), 2002, 135-148. 15. Woods, K. Kegelmeyer, W and Bowyer, K. Combination of Multiple Classifiers using Local Accuracy estimates, IEEE Trans on Patt Analysis and Mach Intelligence, 19(4), 405410, 1997.
Integrated Neural Network Ensemble Algorithm Based on Clustering Technology Bingjie Liu and Changhua Hu Xi’an Institute of Hi-Tech, Xi’an 710025, P.R. China [email protected], [email protected]
Abstract. Neural network ensemble (NNE) focuses on two aspects: how to generate component NNs and how to ensemble. The two interplayed aspects impact greatly on performance of NNE. Unfortunately, the two aspects were investigated separately in almost previous works. An integrated neural network ensemble (InNNE) is proposed in the paper, which was an integrated ensemble algorithm not only for dynamically adjusting weights of an ensemble, but also for generating component NNs based on clustering technology. InNNE classifies the training set into different subsets with clustering technology, which are used to train different component NNs. The weights of an ensemble are adjusted by the correlation of input data and the center of different training subsets. InNNE can increase the diversity of component NNs and decreases generalization error of ensemble. The paper provided both analytical and experimental evidence that support the novel algorithm.
1
Introduction
Neural Network Ensemble (NNE) was widely investigated by researchers all over the world since 1990 when NNE was proposed. The study of NNE focused on two aspects: how to generate component NNs and how to ensemble. Zhou[1] presented a selective NNE based genetic algorithm, called GASEN, which ensembles many component NNs instead of all of component NNs. The fact that the performance of GASEN is better than Boosting and Bagging was validated by experiments. But, there were no improvements of GASEN for generating component NNs. Li et al. [2] and Li et al.[3] presented a selective NNE based on clustering techniques, which can reduce the size of ensemble through computing the diversity of different component NNs in an ensemble. But several parameters used in the algorithm were difficult to determine. Md. et al.[4] presented a constructive algorithm for training cooperative neural network ensembles (CNNEs). CNNE combines ensemble architecture design with cooperative training for component NNs in ensembles. CNNE puts emphasis on both accuracy and diversity among component NNs in an ensemble. In order to maintain accuracy among component NNs and generalization performance, CNNE determines automatically the number of component NNs in an ensemble as well as the number of hidden nodes in each NN using a constructive approach. CNNE can improve the accuracy of NNE with relatively small size of ensemble, but there are several I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 718–726, 2006. c Springer-Verlag Berlin Heidelberg 2006
Integrated NNE Algorithm Based on Clustering Technology
719
shortcomings in CNNE, such as the number of epochs for training different component NNs is determined subjectively. Another problem of CNNE is that it is a time-consuming process, because the number of component NNs as well as the number of neurons of hidden layer is incremental one by one. There are many other works for NNE improvements. Wang[5] improved GASEN by adding a bias. Pitoyo[6] presented an adaptive NNE, which improved the accuracy of NNE by eliminate inadequacy samples. Yao, et al.[7] proposed an algorithm which can adjust the weights of NNE using evolution algorithm. There were many works for improving accuracy and generalization performance of ensemble. However, most of previous works addressed separately on three parts of NNE, including the architecture of ensemble (such as the number of component NNs of ensemble, the number of hidden neurons of component), weights assignment of ensemble, and the method of generating component NNs. However, there are few works for both generating component NNs and weights assignment. The paper presented an integrated algorithm based on clustering technology, called InNNE. Unlike previous works, InNNE is a method not only for adaptively adjusting weights of NNE by different input data, but also for generating component NNs. Li et al.[2] and Li et al.[3] introduced clustering technology into NNE also, the difference is that InNNE uses clustering technology to classify sample set into different subclasses for training different component NNs, and the algorithm of Li et al.[2] and Li et al.[3] used clustering technology to classify component NNs into different classes and eliminate correlative NNs which diversity are relatively small. The number of component NNs was often predefined or fixed in previous works. However the number of NNs in an ensemble has a major impact on the performance of the ensemble. If the number of component NNs is too small, the ensemble is likely to perform poorly in terms of generalization. If the number of component NNs is too large, the computational cost of training the ensemble will increase. In some cases, the ensemble performance may even deteriorate in an overly large ensemble. To address this problem, InNNE proposed in this paper determines automatically the number of component NNs in an ensemble using clustering technology. The rest of this paper is organized as follows: clustering technology was introduced in section 2. In section 3, the InNNE was stated at length. In section 4, an example was given to validate the algorithm. In section 5, discussion and related works was given. Conclusions were described in final section.
2
Clustering Technology
There were many clustering technique to classify samples[9],[10]. The paper used k -means clustering analysis to classify samples. FCM (Fuzzy C-means) is one of the most important k -means clustering branch. The main shortcoming of FCM was that FCM was sensitive to dissociative data. The reason was that FCM used following formula to compute centers of samples: x ¯=
N 1 xi N i=1
(1)
720
B. Liu and C. Hu
x ¯ will changes greatly when noise or dissociative data appeared. The paper uses medians of data sets to represent the centers of classes because medians are insensitive to noise and dissociative data. Assuming X = (x1 , d1 ), (x2 , d2 ), , (xn , dn ) is the sample set, where xi is the ith input sample, di is the ith output sample, and n is the number of samples. xi = xi1 , xi2 , , xim , xik is the kth input of the ith input sample, 1 < k < m, m is the number of input. The algorithm of FCM follows as: 1. standardization of samples[11], x ˜ij =
¯j xij − x , 1 < i < n, 1 < j < m Rj
where x¯j is the median of the jth input, and x ¯j = Xj ( n+1 2 ); x ¯j = Xj ( n2 ) . 2. Computing the similarity matrix of different samples. ⎛ ⎞ 1 d12 . . . d1n ⎜ .. ⎟ ⎜ . ⎟ D = ⎜ 1 ... ⎟ ⎝ 1 dn−1,1 ⎠ 1 dij =
m
|˜ xik − x˜jk |
(2)
(3)
(4)
(5)
k=1
D is a symmetry matrix, and dij = dji , dii = 1. 3.Computing D = D2k . where D2k = Dk ∗Dk . when D2k = D2(k+1) satisfied, D = D2k . 4. Given threshold λ, when dij < λ,the ith sample is clustered as one class.
3
Integrated Neural Network Ensemble, InNNE
The diversity among component NNs lies on the diversity of training sets. In InNNE, the training sets of component NNs is classified with clustering technology. The training set is classified into different subsets with maximal diversity among different subsets. Different component NNs learns from different training subsets. The number of component NNs equals to the number of training subsets. The weights of InNNE are adjusted according to the correlation of input data with different training subsets. Both theoretical and empirical studies have demonstrated that the generalization performance of NNE depends greatly on both accuracy and diversity among component NNs in an ensemble [4].
Integrated NNE Algorithm Based on Clustering Technology
721
Suppose that sample set are classified into m subsets, X1 , X2 , , Xm , the centers are c1 , c2 , cm , respectively. The characteristics of ci is ci1 , ci2 , cik , 0 < i < m. The input data are d = d1 , d2 , , dk , the distance of d with center cj is: dif f (d, cj ) =
k
|di − cji |, 0 < j < m
(6)
i=1
The jth component NNs of an ensemble is trained by Xj , and its weights are determined by the following formula. dif f (d, cj ) wj = 1 − m j=1 dif f (d, cj )
(7)
where wj ≥ 0. Suppose that V j is the output of jth NN, the output of ensemble is: V¯ =
m
wj V j
(8)
j=1
The process of InNNE follows as: 1. Classify training set into several subsets using the clustering technology, and record the center of every class. An optimal clustering result is needed for the following process. 2. Different NNs learn from different training subsets, the number of component NNs equals to the number of training subsets. 3. Computing correlations of input data with each training subset, determining the weights of an ensemble, and 4. Computing the result of ensemble.
4
Examples
Four data sets are used to validate InNNE. All of the data sets are regressive [12]: Data set 1 (Friedman#1) y = 10 sin(x1 x2 π) + 20(x3 − 0.5)2 + 10x4 + 5x5 + ε
(9)
where xi i = 1, ...5 obey uniform distribution in [0,1]. ε obeys normal distribution N (0,1). Data set 2(Plane) (10) y = 0.6x1 + 0.3x2 + ε where x1 , x2 obey uniform distribution in [0,1], and ε obeys normal distribution N (0,1). Data set 3(Friedman#2) . 1 2 ) (11) y = x21 + (x2 x3 − x2 x4
722
B. Liu and C. Hu
where x1 obey uniform distribution in [0,100], x2 obey uniform distribution in [40π, 560π], x3 obey uniform distribution in [0,1], x4 obey uniform distribution in [1,11], and ε obeys normal distribution N (0,1). Data set 4(Multi) y = 0.79 + 1.29x1 x2 + 1.56x1 x4 + 3.42x2 x5 + 2.06x3 x4 x5
(12)
where xi i = 1, ...5 obey uniform distribution in [0,1]. Four thousand data are produced by each data set. The 4000 data are classified as five, eight and ten subsets using clustering technology to test the influence of clustering result to InNNE. Eighty percent of each data class is used to train NNs, and the others are used to test the NNs. The component NNs are BP NNs which have one hidden layer with ten neurons. The paper performs the experiment using Bagging, GASEN, CNNE and InNNE with the same data sets respectively. The methods of generating component NNs are Bagging, CNNE and InNNE, the methods of weights assignment are average method, GASEN and InNNE. The relative generalization error (RGE) is used to evaluate the performance of these algorithms. The testing results are shown as Table 1-4. Table 1. The testing result of Friedman 1 with different weights assignment algorithms and generating component NNs algorithms Weighting Assignment Average GASEN InNNE CNNE Bagging
1.0556
0.8920 0.8755 0.8342
InNNE(5)
0.9110
0.8630 0.8133 0.8568
InNNE(8)
0.8910
0.8520 0.8021 0.8444
InNNE(10)
0.9010
0.8760 0.8104 0.7424
The RGE of InNNE is smaller than others in most cases shown as table 1-4. When both the method of weight assignment and generating component NN are InNNE, the RGE is smallest. The RGE of InNNE weight assignment is smaller than other weight assignment methods. It is noticeable that the clustering results influence greatly on the generalization performance of NNE as shown in table 1-4. The RGE is smallest with the eight classes InNNE(8), and the RGE is increased when the clustering result is ten subclasses. If the clustering result is over many, the diversity among component NNs is reduced and computational cost of training the ensemble will increase. If the clustering result is over few, the generalization performance of InNNE cannot be guaranteed. An optimal clustering result is needed for the generalization performance of InNNE. The optimal clustering result means that the diversity among a training subset is
Integrated NNE Algorithm Based on Clustering Technology
723
Table 2. The testing result of Plane with different weights assignment algorithms and generating component NNs algorithms Weighting Assignment Average GASEN InNNE CNNE Bagging
0.8310
0.8127 0.8174 0.7985
InNNE(5)
0.8224
0.8100 0.8004 0.8236
InNNE(8)
0.8108
0.8020 0.7912 0.7978
InNNE(10)
0.8210
0.8121 0.8010 0.8354
Table 3. The testing result of Friedman 2 with different weights assignment algorithms and generating component NNs algorithms Weighting Assignment Average GASEN InNNE CNNE Bagging
0.9950
0.8240 0.8242 0.8009
InNNE(5)
0.9720
0.8112 0.8100 0.8368
InNNE(8)
0.9556
0.8006 0.7918 0.8074
InNNE(10)
0.9631
0.8121 0.8115 0.8755
minimal, and the diversity among different training subsets is maximal. The four tables have shown that CNNE was a good approach with lower RGE. In fact, the accuracy of CNNE is higher than InNNE when the weight assignment is Bagging.
5
Discussion and Related Works
InNNE is an integrated approach not only for adaptively adjusting weights in an ensemble, but also for generating component NNs. The section discusses InNNE at length in two aspects: generating components NNs and ensemble. Both of Boosting and Bagging are the most important approaches for generating component NNs. Bagging randomly generates several training sets from the original one and then trains component NNs from each of those training sets. Training samples can be repeatedly selected in Bagging. Boosting generates a series of component NNs whose training sets are determined by the performance of former ones. Training instances that are wrongly predicted by former networks will play more important roles in the training of later networks. The basic principle of Boosting and Bagging is probability. The same training samples can
724
B. Liu and C. Hu
Table 4. The testing result of Multi with different weights assignment algorithms and generating component NNs algorithms Weighting Assignment Average GASEN InNNE CNNE Bagging
0.9873
0.9880 0.9869 0.9203
InNNE(5)
0.9854
0.9772 0.9527 0.9665
InNNE(8)
0.9788
0.9700 0.9498 0.9412
InNNE(10)
0.9866
0.9774 0.9602 0.9854
appear in different training sets under certain probability, which are used to train different NNs. The diversity of component NNs in an ensemble is weakened because several component NNs learn the same samples. In InNNE, it is impossible that the same samples appear in different training sets. The diversity of different training sets is greater than ones generated by probability, such as Bagging and Boosting. It is demonstrated that the generalization performance of ensemble is greatly depend on the diversity of its component NNs[4]. The generalization performance of an ensemble with greater diversity is better than ones with smaller diversity. So, the generalization performance of InNNE is better than ones without clustering training set, such as Bagging and Boosting. Numerous approaches for weights assignment have been presented in several literatures[1],[4],[6], etc. These approaches include simple average, weighted average, absolute majority vote and relative majority vote, etc. These approaches perform well in regression estimation and classifier. But the main shortcoming of these approaches is that the weight of each component NN is independent from input data. In many cases, the performance of each component NN will change with different input data. Different component NNs have different ability with different input data, because different component NNs learn from different training sets. The training sets determine the performance of NNs which contains the knowledge about one profile of the learning problem. If the input data is close to the knowledge containing in the j th NN, the j th NN performs well with the input data. If the input data is unrelated with the knowledge containing in the j th NN, the j th NN performs badly with the input data. The weights of NNE have a major impact on the accuracy of NNE. In traditional approaches with fixed weights, the contribution of j th NN with lower weights and better accuracy can be weakened. On the other hand, the contribution of j th NN with higher weights and worse accuracy can be enhanced. In InNNE, The weights can be dynamically adjusted according to different input data. The weight of j th NN with higher accuracy to the input data is relatively great, and the weight of the k th NN with lower accuracy to the input data is relative small. The accuracy of NNE can increase with the optimal weights assignment in InNNE.
Integrated NNE Algorithm Based on Clustering Technology
725
In fact, weights assignment method of NNE depends greatly on the method of generating component NNs. In Bagging, the training sets of each component NN are generated randomly from original one, the samples of training sets allow to repeat selecting. The simple average weights is suitable for Bagging. InNNE is an integrated NNE approach which can not only generate component NNs with greater diversity than Bagging and Boosting, but also dynamically adjust weights according to different input data.
6
Conclusions
Ensembles have been introduced to the NN community for more than a decade. Most ensemble training algorithms can only adjust weights in an ensemble or train component NNs respectively. However, few earlier works investigated two aspects of NNE synchronously. We proposed an integrated method not only for adjusting weights of component NNs according to input data, but also for generating component NNs with great diversity. In InNNE, the training set is classified as several training subsets with clustering technology. The training subsets were used to train different component NNs. The weights of ensemble are dynamically adjusted by the correlation of input data with centers of different training subsets. The generalization performance of InNNE is advantage over other algorithm of NNE in most cases. It is noticeable that an optimal clustering result is necessary for better generalization performance. Clustering results has great influence on the diversity among component NNs, and the diversity among component NNs influence on greatly generalization performance of NNE. So, the clustering results are most important factors contributing to generalization performance of InNNE. The distances among different subsets have great influence on the clustering results. Notwithstanding clustering technology cannot guarantee an optimal result, the diversity among different components generated by clustering techniques is better than randomly generating approaches, such as Bagging and bootstrap. There are many effective algorithms of NNE which can be introduced in InNNE, such as CNNE[4]. Combining InNNE with CNNE is a promising approach for increasing accuracy of ensemble.
References 1. Zhou Z.H., Wu J.X. and Tang W.: Ensembling Neural Networks: Many Could Be Better Than All. Artificial Intelligence. 137 (2002) 239–263 2. Li K. and Huang H.K.: A Selective Approach to Neural Network Ensemble Based on Clustering Technique Journal of computer research and development. 42 (2005) 594–598 3. Li G.Z., Yang J., Kong A.S. and Chen N.Y.: Clustering Algorithm Based Selective Ensemble. Journal of Fudan University. 43 (2004) 689–695 4. Md. Monirul Islam, Yao X., and Kazuyuki M.: A Constructive Algorithm for Training Cooperative Neural Network Ensembles. IEEE Trans. Neural Networks. 14 (2003) 820–834
726
B. Liu and C. Hu
5. Wang Z.Q., Chen S.F. , Chen Z.Q., and Xie J.Y.: A Parallel Learning Approach for Neural Network Ensemble. Artificial Intelligence. (2004) 1200-1205 6. Pitoyo Hartono and Shuji Hashimoto.: Adaptive Neural Network Ensemble That Learns From Imperfect Supervisor. Proceedings of the 9th International Conference on Neural Information Processing Artificial Intelligence. (2002) 2561-2565 7. Yao X., Manfred Fischert and Gavin Brown.: Neural Network Ensembles and Their Application to Traffic Flow Prediction in Telecommunications Networks. IEEE Trans. Evol. Comp. 4 (2001) 693-698 8. Liu Y. and Yao X.: Towards Designing Neural Network Ensembles by Evolution. Artificial Intelligence. J. Artificial Intelligence (1998) 623-632 9. Jiang S.Y. and Li Q.H.: Research on Dissimilarity for Clustering Analysis. J. Computer Engineering and Applications.11 (2005) 146-149 10. Li C.X. and Yu J.: Study on the Classification of A Kind of Fuzzy Clustering Analysis. Journal of Beijing Jiaotong University. 29 (2005) 17-21 11. Gui X.L., Jin W.Z. and Hu Y.C.: Fuzzy Clustering Analysis and Its Application in Traffic Plan. J. Traffic and Computer. 22 (2005) 80-83 12. Zhou Z.H. and Chen S.F.: Neural Network Ensemble. Chinese Journal of Computers. 25 (2002) 1-8 13. Wang Z.Q., Chen S.F. and Chen Z.Q.: Parallel Learning Approach for Neural Network Ensemble. J. Chinese Journal of Computers. 28 (2005) 402-408
The Use of Stability Principle for Kernel Determination in Relevance Vector Machines Dmitry Kropotov1, Dmitry Vetrov1 , Nikita Ptashko2, and Oleg Vasiliev2 1
Dorodnicyn Computing Centre of the Russian Academy of Sciences, Vavilova str. 40, 119991, GSP-1, Moscow, Russia [email protected], [email protected] http://vetrovd.narod.ru, http://dkropotov.narod.ru 2 Moscow State University, Computational Mathematics and Cybernetics Department, Vorob’evy gori, 119992, Moscow, Russia [email protected], [email protected] Abstract. The task of RBF kernel selection in Relevance Vector Machines (RVM) is considered. RVM exploits a probabilistic Bayesian learning framework offering number of advantages to state-of-the-art Support Vector Machines. In particular RVM effectively avoids determination of regularization coefficient C via evidence maximization. In the paper we show that RBF kernel selection in Bayesian framework requires extension of algorithmic model. In new model integration over posterior probability becomes intractable. Therefore point estimation of posterior probability is used. In RVM evidence value is calculated via Laplace approximation. However, extended model doesn’t allow maximization of posterior probability as dimension of optimization parameters space becomes too high. Hence Laplace approximation can be no more used in new model. We propose a local evidence estimation method which establishes a compromise between accuracy and stability of algorithm. In the paper we first briefly describe maximal evidence principle, present model of kernel algorithms as well as our approximations for evidence estimation, and then give results of experimental evaluation. Both classification and regression cases are considered. Keywords: Kernel Selection; Stability of Classifiers; Bayesian Inference; Relevance Vector Machines.
1
Introduction
Support Vector Machines (SVM) [1] has proved to be the state of the art technique for solving classification and regression problems. However, successful application of SVM needs choosing the particular kernel function as well as regularization coefficient C (or its analogue). Different values of C and forms of kernel functions lead to different behaviour of SVM for particular task. Usually the parameters of kernel function and coefficient C are defined using cross-validation procedure. This may be too computationally expensive. Moreover the cross-validation estimates of performance, although unbiased [2], may have large variance due to the limited size of learning sample. I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 727–736, 2006. c Springer-Verlag Berlin Heidelberg 2006
728
D. Kropotov et al.
Several methods for model selection in SVM and SVM-like models were proposed, e.g. in [9,10,6,7,11,8]. The popular way is application of Bayesian learning framework and maximal evidence principle [3]. Usually some probabilistic interpretation of SVM is provided which is then used for adaptation of maximal evidence principle [11,8]. However, such probabilistic interpretation requires different approximations and changes in initial SVM training algorithm. Here we consider an SVM-like algorithm which is constructed directly from probabilistic model - Relevance Vector Machines (RVM), proposed by Tipping [4]. This approach doesn’t require setting of coefficient C for restriction of weights’ values as corresponding regularization coefficients are adjusted automatically during training. However, the problem of kernel selection still remains. We focus on the 2 most popular RBF kernel functions K(x, z) = exp(− ||x−z|| 2σ2 ) and selection of parameter σ - width of Gaussian. We show that application of Bayesian framework for kernel selection requires extension of algorithms model - inclusion of kernel centers. Integration over posterior probability in the new model becomes intractable and hence point estimate of posterior probability is used. Laplace approximation for evidence estimation requires maximization of posterior probability as well as its Hessian computation. However, in the new model too high dimension of optimization parameters space and the fact that posterior probability is multi-modal function make the application of Laplace approximation impossible. Instead of this we propose a method of local evidence estimation which leads to a compromise between stability and training accuracy of algorithm. The paper is organized as follows. Section 2 briefly summarizes ideas of Bayesian learning, maximal evidence principle and Relevance Vector Machines. Section 3 presents extended family of algorithms and our kernel selection procedure. In section 4 experimental results on toy problems and real data are provided, while the last section gives conclusion and discussion.
2
Relevance Vector Machines
1 n Let Dtrain = {x, t} = {xi , ti }m i=1 be a training sample where xi = (xi , . . . , xi ) are feature vectors in n-dimensional real space and ti are hidden components either real (for regression) or from{−1, 1} (for classification). Consider the famm m ily of algorithms h(xnew , w) = i=1 wi K(xnew , xi ) + w0 , where {wi }i=0 are some real parameters or weights. Establish normal prior distribution on weights P (wi |αi ) ∼ N (0, α−1 i ). The set of parameters α determines the model in which the posterior distribution over weights is looked for. For this model the evidence (or marginal likelihood) is given by the following equation: P (t|x, w, α)P (w|α)dw (1) P (t|x, α) = W (α)
where P (t|x, w, α) is likelihood of training data (or more exactly likelihood of hidden components configuration) with respect to the given algorithm, W (α)
The Use of Stability Principle for Kernel Determination in RVM
729
- weights space in the model determined by α. Likelihood function is determ 2 i ,w) exp(− ti −h(x ) in case of regression and calculated mined by expression 2λ2 as
m i=1
i=1
1 1+exp(−ti h(xi ,w))
in case of classification.
Using known maximal evidence principle we should select α by maximizing (1) and then get posterior distribution P (w|t, x, α) ∝ P (t|x, w, α)P (w|α). For classification problems direct calculation of (1) is impossible due to intractable integral. Tipping used Laplace approximation for its estimation. Function Lα (w) = log(Qα (w)) = log(P (t|x, w, α)P (w|α)) is approximated by quadratic function using its Taylor decomposition with respect to w at the point of maximum wM P . Such approximation can be then integrated yielding P (t|x, α) ≈ Qα (wM P ) | Σ |1/2 , −1
Σ = (−∇w ∇w Lα (w) |w=wM P )
(2) −1
= (−∇w ∇w log(P (t|x, w, α)) − A)
(3)
where A = diag(α1 , . . . , αm ). Note that for regression problems expression (2) comes to exact equation. Differentiating expression (2) with respect to α and setting derivatives to zero leads to the following iterative re-estimation equations: γi = 2 (4) αnew i wMP,i γi = 1 − αold i Σii
(5)
th
Here γi is so-called effective weight of i parameter. It shows how much the corresponding weight is constrained by regularization term established by prior. It can be easily shown that γi ∈ [0, 1]. If αi is close to zero, wi is almost unconstrained and γi is close to one. On the contrary in case of large αi the corresponding parameter wi is close to zero and is not much affected by training information. So its effective weight tends to zero. The training procedure consists of three iterative steps. At first we search for the maximum point wM P of Lα (w). Then we estimate Σ according to (3) and use (4), (5) to get the new α values. The steps are repeated until the process converges. In Bayesian framework decision is made by integrating throughout all algorithms within the model with respect to probabilistic measure derived by posterior probability P (w|t, x, α): P (tnew |xnew , w, α)P (w|t, x, α)dw (6) P (tnew |xnew , t, x, α) = W (α)
In RVM posterior distribution is approximated by setting P (w|t, x, α) ≈ δ(w − wM P ) resulting in the expression: P (tnew |xnew , t, x, α) = P (tnew |xnew , wM P , α)
(7)
It was shown [4] that RVM provides approximately the same quality as SVM. Moreover RVM appeared to be much more sparse, i.e. the rate of non-zero weights (relevance vectors) is significantly less than the rate of support vectors.
730
3
D. Kropotov et al.
Kernel Selection
Although maximal evidence principle is fully given in probabilistic terms we may suggest its another interpretation. Equation (2) can be viewed as a compromise between accuracy of algorithm on the training sample (the value of Qα (wM P )) and its stability with respect to small changes of parameters (expressed by squared root of inverse Hessian determinant). Then we may formulate stability principle. The more ”stable” the algorithm is, the better its generalization ability becomes. The notion of stability is quite informal. Different definitions of stability and their relation to generalization ability were investigated [13,14]. Here we understand stability as ability to keep large likelihood (or more exactly the values of Qα (w)) as long as possible moving from the point of maximum in algorithms parameter space. Such view allows to modify the concept of Bayesian regularization for the cases where its direct application is impossible or not reasonable. In straightforward approach kernel parameter σ can be treated as one more meta-parameter (like α) and evidence maximization procedure can be used for its determination [11,17]. However, in this way too small values of σ can be chosen. Indeed, small σ values lead to overfitting and high accuracy on the training sample (high value of the first term in (2)). At that almost all objects from the training set have non-zero weights and the influence from the neighboring objects can be neglected. Small variations of object’s weight just change the height of the corresponding kernel function, but doesn’t change classification of object in the kernel center (Fig. 1 (a)). This means that small weight’s modification cannot change Lα (w) much and the likelihood after modification is still very high. At the same time the second term in (2) even encourages small σ as the algorithm becomes more stable with respect to the changes of weights. However, if we start moving the position of the kernel center, the likelihood of the training object changes dramatically (Fig.1 (b)). So small σ makes classification unstable with respect to shifts of the kernel centers. Actually stability with respect to weight changes is important for selection of regularization coefficients α. Parameter of kernel function σ is responsible for stability with respect to kernel shifts. Hence kernel selection requires inclusion of kernel centers into decision model resulting in hE (xnew , w, z) = m i=1 wi K(xnew , zi ) + w0 . In the extended model direct calculation of evidence (1) becomes impossible even for regression case. Laplace approximation for evidence requires additional optimization w.r.t. kernels locations z maximizing Lσ,α (w, z) = log(P (t|x, w, z, α)P (w|α)P (z))
(8)
Unfortunately optimization of Lσ,α (w, z) with respect to z is too difficult due to large amount of dimensions as z ∈ Rmn . Moreover unlike h(x, w) function hE (x, w, z) is non-linear with respect to kernel centers z and hence Lσ,α (w, z) is multi-modal function. This hardens optimization even more. In Bayesian framework decision rule is constructed with the aid of equation (6). But in our case (6) is intractable integral and hence we would pre-
The Use of Stability Principle for Kernel Determination in RVM 1
1 Initial h(x)
0.8
0.6
0.4
0.4 0.2
x3
h(x)
h(x)
0.2 0
x1
−0.2
x2
x3
0
x1
−0.2 −0.4
h(x) after small weight modification
−0.6
x2
h(x) after small kernel center modification
−0.6
−0.8 −1 −1
Initial h(x)
0.8
0.6
−0.4
731
−0.8 0
1
2
3
4
5
−1 −1
6
x
0
1
2
3
4
5
6
x
(a)
(b)
Fig. 1. The likelihood of the training sample is a product of likelihoods in each training object x1, x2, x3. In case of small σ values small change of weight still keeps the likelihood of the corresponding object high enough (a) while small shifts of relevant point (gaussian center) make likelihood significantly lower (b).
fer using only algorithm which was obtained via maximization of Lσ,α (w, z). If function Lσ,α (w, z) were unimodal then it could be approximated by its local behaviour at the maximum point (wM P , zM P ). Now consider the following situation. Our solution is located in narrow peak at point (wM P , zM P ) but there is a good stable algorithm somewhere else within the model. The evidence of obtained answer will be high, but the generalization ability of this single algorithm is poor (see fig. 2). Exact evidence calculation makes sense in case when we are able to make integration (6). However, we can use only point estimate (7). In stability approach only local characteristics of point taken as final solution should be considered. Such characteristics are the value of function Lσ,α (w, z) and its derivatives which represent instability measure. The analogies with Bayesian framework can be used to unite these values into one equation. Optimization of kernel locations is very difficult and time consuming task. Moreover, our experiments show that such optimization gives nearly no profit in accuracy while training time increases significantly. So we propose keeping kernel centers in training objects estimating at the same time algorithm’s stability with respect to hypothetical kernel shifts. Then wM P can be treated as constant which does not depend on z. Assuming that there are no prior constrains on centers location establish improper uniform prior P (z) = const. Denote Ai the stability of P (t|x, wM P , z) with respect to kernel located in zi . We assume that it may be decomposed as if the stabilities with respect to different coordinates were independent Ai =
n 1 j=1
Aij
732
D. Kropotov et al.
Fig. 2. Example of model which has large evidence value with quite poor point estimate. There is no profit of large evidence value if we use only algorithm with (w, z) = (wM P , zM P ). At the same time local characteristics of point (wM P , zM P ) such as ∇w,z ∇w,z Q(w, z) |(w ,z)=(w M P ,z M P ) penalize the obtained algorithm belonging to the model.
Aij is determined by integrating the approximation of log(P (t|x, wM P , z, α)) with parabolic function using its Taylor decomposition at the point z = x with respect to zij : −1 |a|G , if b ≤ 0 2 (9) Aij = 1 2π |a| a √ 1 − erf , otherwise 2 b exp 2b 2b here
∂ log(P (t|x, wM P , z, α))
a=
∂zij
b=−
∂ 2 log(P (t|x, wM P , z, α)) (∂zij )2
The sense of equation (9) is shown on figure 3. Estimating algorithm’s stability in the first place we would like to insure ourselves against accuracy degrade on the test sample. So f (zij ) = log P (t|x, wM P , z, α) is approximated with negative parabola or with a line (if second derivative is non-negative) at point zij = xji and decreasing tail of approximation is integrated yielding stability measure Aij . If xji were an extremum point of f (zij ) then Aij would be proportional to the result of Laplace approximation taken along xji coordinate. For uniting stability and accuracy in one expression we should consider the weight of each kernel. Actually if the weight of kernel is close to zero its stability doesn’t play important role. Taking into consideration the effective weights (5) of each kernel γi which vary from 0 to 1 we get the expression for total stability of likelihood with respect to all kernels
Z=
m 1 i=1
Aγi i
m 1 n 1 = ( Aij )γi i=1 j=1
(10)
The Use of Stability Principle for Kernel Determination in RVM
733
Fig. 3. Algorithm stability (grey area) is expressed as integration of tail in Laplace approximation of Q(w, z) for each zij
Multiplying Z and the value of likelihood at the point wM P we get kernel validity value KV = P (t|x, wM P , z, α)Z (11) The kernel function which corresponds to the largest validity value is supposed to be the best one for the particular task. Thus the procedure for selection of width parameter σ in gaussian parametric family of kernel functions becomes the following: 1. 2. 3. 4.
Choose some σ value. Put z = x. Train RVM algorithm with selected σ. At the point wM P calculate kernel validity (11), where components Aij are taken from (9), while effective weights γi are determined by (5).
The σ value corresponding to the largest validity value is considered to be the optimal one.
4
Experimental Results
We compare kernel selection performance of kernel validity index vs. crossvalidation using 9 classification and 5 regression problems from UCI repository. For each task we randomly split 20 times the data into train (33%) and test (67%) sets and use RVM with kernels of different width (σ = 0.01, 0.1, 0.3, 1, 2, 3, 4, 5, 7, 10). Test errors and sums of squared deviations corresponded to the kernels with maximum validity and with best cross-validation estimate averaged by 20 pairs of train/test tables together with their standard deviations are shown in tables 1 and 2. Columns RVM CV and SVM CV show the averaged test error with kernel selection according to 5-fold cross-validation for RVM and SVM. RVM MV shows averaged test errors corresponded to maximum kernel validity index. Column SVM MV shows how SVM performs with the same kernels as in RVM MV. This column helps us to check whether the optimal kernel width
734
D. Kropotov et al.
is defined only by the problem itself or also by the training algorithm. Finally MinTestError column contains minimal possible test error. The results from table 1 were rated in the following way. The least test error was given one point, while the second two points, etc. The worst result was assigned four points. Total results are shown in the last line of the table. Experimental results show that RVM and SVM have competitive performance although RVM generated 5-8 times less kernels than the corresponding SVM. Also our kernel validity measure works at least not worse than cross-validation alternative for classification and slightly worse for regression. But our approach has two advantages. The algorithm should be trained only once thus requesting significantly less time for training. Another good property of the proposed index is its unimodality. Unlike cross-validation measure which has lots of local extrema KV (σ) may be optimized using gradient or quasi-gradient methods. Very interesting effect is poor quality of SVM performance using the kernels which were considered to be the best (in sense of our validity measure) for RVM. This proves that kernel validity depends much on the method of training vector machine classifier. Also we should mention that neither cross-validation nor maximum validity index lead to minimum possible test error. This can be connected both with peculiarities of training sample and with the fact that test sample may be biased with respect to the universal set.
Table 1. Experimental results for classification problems (error rates and standard deviations) Sample Name AUSTRALIAN BUPA CLEVELAND CREDIT HEPATITIS HUNGARY LONG BEACH PIMA SWITZERLAND Total
RVM CV 15.5 ± 1.2 41 ± 0.4 18.6 ± 1.8 17.3 ± 2.7 43 ± 5.6 22 ± 4.4 25.25 ± 0.5 34 ± 2.7 6.4 ± 1.6 21
SVM CV RVM MV SVM MV MinTestError 16.5 ± 1.9 18.6 ± 4.35 21 ± 3.6 13.4 37.5 ± 2.5 39 ± 3.6 37.6 ± 3.8 31 21 ± 2.7 20 ± 3.5 28 ± 5.6 17 18 ± 1.6 16.9 ± 2.4 20 ± 2.9 14.5 39.17 ± 3.8 39 ± 3.9 39.21 ± 4.6 36 20 ± 2.3 24 ± 5.3 26 ± 4 18 25.18 ± 0.9 27 ± 4.7 26 ± 4.6 24.5 30 ± 2 27 ± 2.5 29.6 ± 2.9 23 8 ± 1.8 7±2 7.6 ± 2.3 5.8 20 20 29
Table 2. Experimental results for regression problems (sum of squared deviations). The last column presents a ratio of quality deterioration with respect to minimal possible test error for MV and CV given in percents. Sample Name # Tr. obj. BOSTON 100 PYRIMIDINES 52 SERVO 117 TRIAZINES 131 WISCONSIN 33
RVM CV 3.7694 ± 0.5821 0.0652 ± 0.0112 0.9784 ± 0.0765 0.1154 ± 0.0076 18.2432 ± 2.1945
RVM MV MinTestError 3.909 ± 0.5488 3.4351 141% 0.0682 ± 0.0096 0.0587 146% 1.0351 ± 0.0544 0.8973 169% 0.1137 ± 0.0073 0.1085 75% 19.2693 ± 1.3362 17.0904 189%
The Use of Stability Principle for Kernel Determination in RVM
5
735
Discussion and Conclusion
Unlike structural risk minimization [2] which restricts too flexible classifiers and minimum description length approach [16] which penalizes algorithmic complexity, the concept of Bayesian regularization (and its modification described above) tries to establish the model where the solution is stable with respect to changes of classifier parameters. We decided to move from probabilistic approach and concentrate directly on idea of stability rather than on applying maximal likelihood principle to models (i.e. maximizing evidence). The proposed characteristic of kernel validity does not show how good is the kernel for particular task. It only can serve for estimation of kernel utility in case of fixed training procedure (in our case this is RVM). This happens because we do not estimate the validity of whole model (as we use only one classifier with w = wM P ) but consider only local stability of Qα (w) at point wM P . The idea to take into consideration both the model of algorithms and particular training procedure (our ability to find good algorithm inside the model) for estimation of algorithm’s quality is not novel. For example, Vapnik proposed socalled effective VC dimension [2]. Unlike traditional VC dimension new notion suggests consideration of training sample and considers only those algorithms which can be obtained inside the model using particular training sample. As a result error bounds become more accurate. Popular boosting and bagging techniques are said to increase both training accuracy and generalization ability of algorithms. These methods make algorithm’s model sufficiently more complex. Nevertheless, effective way of choosing particular algorithm inside the extended model avoids drawbacks of such complication. Explicit consideration of training procedure together with model’s properties led to new theory of algorithms quality estimates, based on combinatorial approach [15]. In our case we are not able to consider all possible algorithms inside the model (to integrate over posterior probability P (w|t, x, α)). However, consideration of local stability of Q(w, z) at point wM P (our ability to find good algorithm inside the model) gives us appropriate technique for kernel selection task. This method seems to be quite general and probably could be applied to other complex machine learning algorithms for tuning their model parameters.
Acknowledgements This work was supported by the Russian Foundation for Basic Research (projects No. 06-01-00492, 06-01-08045, 05-07-90333, 05-01-00332, 04-01-00161) and INTAS (YS 04-83-2942, 04-77-7036).
References 1. Burges, C.J.S: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2 (1998) 121–167 2. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag New York (1995)
736
D. Kropotov et al.
3. MacKay, D.J.C.: Information Theory, Inference, and Learning Algorithms. Cambridge University Press (2003) 4. Tipping, M.E.: Sparse Bayesian Learning and the Relevance Vector Machines. Journal of Machine Learning Research 1 (2001) 211–244 5. Murphy, P.M., Aha, D.W.: UCI Repository of Machine Learning Databases [Machine Readable Data Repository]. Univ. of California, Dept. of Information and Computer Science, Irvine, Calif. (1996) 6. Ayat, N.E., Cheriet, M., Suen, C.Y.: Optimization of SVM Kernels using an Empirical Error Minimization Scheme. Proc. of the First International Workshop on Pattern Recognition with Support Vector Machines (2002) 7. Friedrichs, F., Igel, C.: Evolutionary Tuning of Multiple SVM Parameters. Neurocomputing 64 (2005) 107–117 8. Gold, C., Sollich, P.: Model Selection for Support Vector Machine Classification. Neurocomputing, 55(1-2) (2003) 221–249 9. Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., Vapnik, V.: Feature Selection for Support Vector Machines. Proc. of 15th International Conference on Pattern Recognition, Vol.2 (2000) 10. Chapelle, O., Vapnik, V.: Model Selection for Support Vector Machines. Advances in Neural Information Processing Systems 12, ed. S.A. Solla, T.K. Leen and K.-R. Muller, MIT Press (2000) 11. Kwok, J.T.-Y.: The Evidence Framework Applied to Support Vector Machines. IEEE-NN, 11(5) (2000) 12. Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer (2001) 13. Kutin, S., Niyogi, P.: Almost-everywhere algorithmic stability and generalization error. Tech. Rep. TR-2002-03: University of Chicago (2002) 14. Bousquet, O., Elisseeff, A.: Algorithmic stability and generalization performance. Advances in Neural Information Processing Systems 13 (2001) 15. Vorontsov, K.V.: Combinatorial substantiation of learning algorithms. Journal of Comp. Maths Math. Phys. 44(11) (2004) 1997-2009 http://www.ccas.ru/frc/papers/voron04jvm-eng.pdf 16. Rissanen J.: Modelling by the shortest data description. Automatica 14 (1978) 17. Van Gestel, T., Suykens, J., Lanckriet, G., Lambrechts, A., De Moor, B., Vandewalle, J.: Bayesian Framework for Least Squares Support Vector Machine Classifiers, Gaussian Processes and Kernel Fisher Discriminant Analysis. Neural Computation 15(5) (2002) 1115–1148
Adaptive Kernel Leaning Networks with Application to Nonlinear System Identification Haiqing Wang1,2, Ping Li1, Zhihuan Song1, and Steven X. Ding2 1
National Lab of Industrial Control Technology, Zhejiang University Hangzhou, 310027, P.R. China {hqwang, pli, zhsong}@iipc.zju.edu.cn 2 University of Duisburg-Essen, Inst. Auto. Cont. and Comp. Sys. Bismarckstr. 81, BB-413, 47057 Duisburg, Germany [email protected]
Abstract. By kernelizing the traditional least-square based identification method, an adaptive kernel learning (AKL) network is proposed for nonlinear process modeling, which utilizes kernel mapping and geometric angle to build the network topology adaptively. The generalization ability of AKL network is controlled by introducing a regularized optimization function. Two forms of learning strategies are addressed and their corresponding recursive algorithms are derived. Numerical simulations show this simple AKL networks can learn the process nonlinearities with very small samples, and has excellent modeling performance in both the deterministic and stochastic environments.
1 Introduction Model based automation technologies find wide applications in industry and thus the availability and quality of the plant model is crucial. However, the industrial process is generally nonlinear and MIMO system; further, it is not uncommon that only limited and delayed noisy measurements are available, especially for those variables concerned with product quality. Therefore, in practice building a practicable model for industrial process is still a challenge, although the nonlinear system identification (NSI) theory has been intensively studied for decades [1~4]. The contributions of this paper are twofold. Firstly, a new kernelized NSI method, i.e. adaptive kernel learning (AKL) network is proposed. This adaptive and complexity-controlled NSI framework is motivated by two existing elements from the field of statistical learning theory (SLT) and kernel method [5~9], e.g. training set approximation and regularization, while reformulated specially for the NSI issue; secondly, two forms of learning strategies are separately addressed under this proposed framework and their corresponding recursive algorithms are derived. Numerical simulations compared with other computational intelligence NSI methods, e.g., Neural networks and Fuzzy system [3,4] show the proposed AKL networks manifests obvious methodological advantages on model complexity and structure selection; and has excellent online modeling performance under both the deterministic and stochastic environments with very small samples. I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 737 – 746, 2006. © Springer-Verlag Berlin Heidelberg 2006
738
H. Wang et al.
2 Problem Description Conventional linear identification frame can be generally expressed as yt = αT xt + ε t
(1)
where yt denotes the output measure at t instance (for convenient here it is written in a single output form, for AKL network multiple output form is straightforward), and xt is usually a general input vector that could be several measured variables at time t, usually combined with their delayed ones and constant item 1, or with the delayed outputs. The symbols Į and İ denote model parameter vector and process noise, respectively. To expand to nonlinear case, the well known “kernel trick” is utilized [5, 6]
yt = αT φ ( xt ) + ε t = α, φ ( xt ) + ε t
(2)
where φ: x ĺ F is a nonlinear operator that projects the original input into a much higher dimensional feature space F, and this mapping operator can not selected arbitrarily and must satisfy the so called “Mercer Theory” to ensure the later excellent computational performance[5,6]. The same representations are maintained for other items in Eq. (2) to avoid introducing too many notations. To obtain the parameter vector Į, generally a quadratic cost function is optimized: t
J (α) = ¦ ( f (xi ) − yi ) 2 = ΦtT α − y t
2
(3)
i =1
where Φt = [φ ( x1 ), " , φ ( xt ) ] and y t = [ y1 , " yt ] . T
The least squares (LS) solution to Eq. (3) has the form [2]: α t = ( Φt ΦtT ) Φt y t . −1
However, different to its linear counterpart, here the dimension of matrix ĭt might be very high or even infinitive, which means this direct LS solution of Eq. (3) is actually intractable. A common trick to this difficult is transferring the optimization problem Eq.(3) to its dual one. According to the Representer Theory [6], the solution Į for such quadratic cost function has a form as the linear combination of all feature vectors t
αt = ¦ β iφ ( xi ) = Φt β t
(4)
i =1
where β t = [ β1 , " , βt ] is referred to as the dual model vector. Substitute the vector Į by its dual counterpart, a new cost function instead of Eq. (3) is obtained as J (β) = ΦtT Φt ȕt − y t
2
= K t ȕt − y t
2
(5)
where Kt is a kernel matrix [6], whose element is defined as K t (i, j ) = φ ( xi ), φ ( x j ) ,
i, j = 1, " , t . Now the LS solution to Eq. (5) is tractable which yields ȕt = K t+ y t
where K t+ denotes the Moore-Penrose inverse of Kt.
(6)
AKL Networks with Application to Nonlinear System Identification
739
So far a kernelized NSI framework is formulated and its solution amounts to solve a traditional linear NSI problem in the high dimensional feature space introduced by the kernel transform. However, there still exist several adverse factors which make the solving of Eqs. (5,6) infeasible and meaningless. (1). the length of parameter vector ȕt, which can be loosely understood as the order of identification model, is equal to the number of the identification data used. That is, with the identification continuing online (e.g. in a recursive manner), the complexity of the model will increased steadily. This is very unreasonable and computational impracticable; (2). due to limited and noise-corrupted measurements in industrial environment, the direct LS solution as Eq. (6) in the high dimensional feature space will result in a overfitting and results in poor generalization ability; (3). the projections of input signals at different time instances might be linear dependant in the feature space and thus cause the kernel matrix suffering from rank deficient. This again cause serious numerical unstable problem when solving vector ȕ using Eq. (6).
3 Basic Framework of AKL Networks 3.1 Dependence Detection and Network Topology In this paper the linear independent vectors in the feature space is referred to as “nodes”
{
}
and its set is denoted as Ni = φ ( x1 ), ", φ ( xni ) , i = 1, " , t with nt ≤ i , since it can be seen from the later derivation that the proposed ADL algorithm also shows a network like topology. The space spanned by the vectors in the Node Set is denoted as F with some abuse of the notation. The study of multicollinearity of the mapped features is far from new, and several researchers has utilized different (while similar) methods to treat this issue [6~9]. Here a new viewpoint from the geometric relationship, the space angle index (SAI) which defines the sine value between the t-th input feature vector φ(xt) and the node set Nt−1 is proposed nt −1
Sin(θ t ) = min φ ( xt ) − ¦ ak φ ( xk ) at
φ ( xt )
(7)
k =1
where at = [ a1 , " , adt −1 ] is the corresponding linear combination coefficient vector. This geometric index has a advantage of simplicity and intuition: if the Sin(θt ) > v0 where v0 is predefined very small value with 0 ≤ v0 ≤ 1 , then the T
vector φ(xt) will be introduced into the node set which consequently becomes Nt ={Nt−1, φ(xt)}; otherwise, this vector will be rejected and the node set retains the same without
expands, i.e, Nt =Nt−1. At the beginning of the identification, there is only one feature vector in the node set, i.e, N1 = {φ ( x1 )} = {φ ( x1 )} . The later feature vectors will be added into the set using the automatic judging criterion of SAI.
740
H. Wang et al.
Note the minimization of Eq.(7) is only concerned with its numerator item, and we refer its square as a ϕt = min φ ( xt ) − Φ t −1 t at
2
= min{ktt − 2aTt kt + aTt Κ t −1at }
(8)
at
= [φ ( x ), " , φ ( x ) ] is the node matrix at time t. The quantities of where Φ t −1 1 dt −1 ktt = φ ( xt ), φ ( xt ) , k t (i,1) = φ(xi ), φ(xt ) , i = 1, ", nt −1 , and Κ t −1 = φ ( xi ), φ ( x j ) , i = 1, " , nt − 1 are all corresponding kernel transforms, as in Eq.(5). The minimization of Eq. (8) is straightforward and yields
ϕt = ktt − aTt k t , at = Κ t−-11k t
(9)
Therefore, applying the SAI to Eq.(7) means to judge the following condition
ϕ t ktt ≤ v0
(10)
By online calculating the SAI as Eq.(10), the node set adaptive learns the new dynamics of the process by adding new “nodes” into itself. Meanwhile the inputs in the feature space can be greatly compressed with the utilization of the process’ node set as [a , " , a ] = Φ AT (11) Φt = Φ 1 t t t where the elements of transform matrix A is defined as A(i,j)=a (i, j) for i≤j; and equal to 0 for i≥ j.
3.2 Regularized LS The sparseness alone can not guarantee a good generalization. To control the generalization ability of the NSI method, a regularizing term [5,6] is explicitly introduced and the optimization problem of Eq.(5) is first rewrote as 1 c 2 2 K t ȕt − y t + ȕt (12) 2 2 where a regularized item is added and constant c>0 is the regularizing parameter. The balance between the approximation performance and generalization ability of the AKL model will be controlled by adjusted the parameter c, which is similar to the regularized networks and ridge regression [6]. However, due to the existence of the parameter c, the latter recursive formula for AKL will become more complicated. Using the node set representation of Eq. (11), the cost function of Eq.(12) again changes to a new one J (β) =
J (w ) =
1 AT ȕ − y At K t t t t 2
2
+
c T A t ȕt 2
2
=
1 w −y At K t t t 2
2
+
c wt 2
2
(13)
where the vector w t = ATt ȕ t and note its length will be very short comparative to that of ȕt and equal to the number of the nodes in the node set at time t. )T (A K Differential J(w) respective to w gets: ∂J (w) ∂w = (At K t t t wt − yt ) + cwt = 0 . Thus the model coefficient vector can be obtained as AT A K )−1 K AT y w t = (cI + K t t t t t t t
(14)
AKL Networks with Application to Nonlinear System Identification
741
4 Recursive Algorithm 4.1 Stable Learning Stage: Sin(θt ) ≤ v0 After the dominant dynamics of process are learned (characterized as the stopping growth of nodes), the AKL network comes into a relatively “stable” learning stage. Mathematically, this scenario indicates the relation Sin(θt ) ≤ v0 holds, which in practice is verified by the relation ϕ t ktt ≤ v0 . Due to no new node appear, the structure =K of AKL network remains unchanged, i.e., K and only the coefficient vector wt t −1
t
is adjusted online to optimize Eq.(12). To calculate Eq.(14) recursively, defines AT A K Pt = cI + K t t t t
(15)
and note the matrix A t = [ A Tt −1 , a t ]T , then the following recursive relation exists A Tt A t = A Tt −1 A t −1 + a t aTt
(16)
Therefore, the information about Ai , i = 1, " , t is not necessary to retained, which greatly reduces the memory requirements. Then the Eq. (15) can be rewrote as T [AT , a ][AT , a ]T K = cI + K AT A K T Pt = cI + K t −1 t −1 t t −1 t t −1 t −1 t −1 t −1 t −1 + Kt −1at at Kt −1 = Pt −1 + kt kt
(17)
where the last equation is due to the second equation of Eq.(9). According to the matrix inverse formula for rank-one modification [10], the relation can be obtained: Pt−1 = Pt−−11 − r ⋅ Pt−−11k t k Tt Pt−−11 , where r = 1/(1 + k Tt Pt−−11k t ) is a scalar. Submitting this relation into Eq.(14) yields T T AT y = (I − r ⋅ P −1 k k T )P −1 K w t = Pt−1K t t t t −1 t t t −1 t −1[ A t −1 , at ][ y t −1 , yt ] AT y + K a y ) = (I − r ⋅ P −1 k k T )(w = (I − r ⋅ P −1 k k T )P −1 (K t −1
t
t
t −1
t −1
t −1 t −1
t −1 t
t
= w t −1 + r ⋅ Pt−−11k t ( yt − k Tt w t −1 ) = w t −1 + Gt ( yt − k Tt wt −1 )
t −1
t
t
t −1
+ Pt−−11 k t yt ) (18)
where G t = r ⋅ Pt−−11k t can be viewed a gain vector. For a more numerical stable version, the matrix Pt−1 could be updated as Pt−1 = Pt−−11 − G t G Tt / r . 4.2 Revising Learning Stage: Sin(θt ) > v0
At the early phrase of the identification procedure, or when the dynamics of process changed, new nodes will be added into the AKL network. In this case, both the structure and t coefficients of the Network will be adjusted online. At time t if the SAI condition of Eq.(10) detects a new node appear (calculated only and A using the input vector xt), the quantities K t t −1 will change accordingly. = ª«K t −1 k t º» , and due to the new input vector will be Obviously in this scenario K t T ktt ¼ ¬ kt
742
H. Wang et al.
exactly expressed by itself as the last node in the AKL network, thus T
ª º ª At −1 0 º at = «0, " , 0, 1» and At = « T » . The matrix Pt in this case can be written in a « » 0 1 « ¬ ¼» ¬ dt −1 ¼ recursive form as T T T AT A K = ª«cI + Kt −1At −1At −1Kt + kt kt Kt −1At −1At −1kt + kt ktt º» Pt = cI + K t t t t T T kT AT A K + k k T c + kt At −1At −1k t + ktt2 ¼ ¬ t t −1 t −1 t −1 tt t (AT A )k º ªk º T ª P K t −1 t −1 t −1 t (19) = « T T t −1 + t ªk k º (A A )K T (AT A )k » «k » ¬ t tt ¼ k k + c t t −1 t −1 t ¼ ¬ tt ¼ ¬ t t −1 t −1 t −1
where the term A Tt A t can also be in a recursive form ª AT A 0º ATt A t = « t −1T t −1 » 1¼ ¬ 0
(20)
( AT A )k º ª P K t −1 t −1 t −1 t Define M = « T T t −1 , and using the matrix inverse ( A A )K T ( AT A )k » k k + c ¬ t t −1 t −1 t −1 t t −1 t −1 t¼ formula for rank-one modification again [10], the inverse of Pt can be wrote as
ªk º Pt−1 = M−1 −M−1 « t » ª¬kTt ktt º¼ M−1 1+ ª¬kTt ktt º¼ M−1 ª¬k Tt ktt º¼ ¬ktt ¼ ° ½° ªk º = ®I −M−1 « t » ª¬k Tt ktt º¼ 1+ ª¬k Tt ktt º¼ M−1 ª¬k Tt ktt º¼ ¾×M−1 =ΔM M−1 ¬ktt ¼ ¯° ¿°
{
{
}
}
(21)
where ǻM denotes the first term of the second equation above. (AT A )k and q = c + k T ( AT A )k in Eq.(19), the inverse of M Define s = K t −1 t −1 t −1 t t t −1 t −1 t can be obtained by using again the inverse formula for the block matrix [10] −1 ª 1 −1 T −1 −1 −1 º sº ªP «ϑ Pt −1 + Pt −1 ss Pt −1 −Pt −1 s » = ϑ M −1 = « tT−1 « » q »¼ ¬s −sT Pt−−11 1 ¼» ¬«
(22)
where ϑ = 1/[c + k Tt ( ATt −1 At −1 )k t − sT Pt−−11 s] . Thus the corresponding updated model vector is T º AT y = P −1 ª« K t −1 At −1y t −1 + k t yt º» = P −1 ª« Pt −1w t −1 + k t yt w t = Pt−1K t t t t t k T AT y + k y k T K −1 P w + k y » ¬ t t −1 t −1 tt t ¼ ¬ t t −1 t −1 t −1 tt t ¼
−1 −1 = ª«K t −1 k t º» = 1 where K t T k ϕt tt ¼ ¬ kt
−1 + a aT ªϕt K t −1 t t « T a t ¬
(23)
-at º » ; ϕt and at are defined as Eq. (10). 1¼
AKL Networks with Application to Nonlinear System Identification
743
Noting all the terms of P , P −1 and K −1 in Eq.(23) now can be calculated in a recursive manner (in both learning stages), the Eq.(24) already gives the recursive form for wt when a new node need to be added into the AKL identification network. The one step prediction output of the AKL network model is straightforward and given as yˆt = k Tt w t −1 , which is a weighted sum of kernel transform of the input signal. Therefore, just as the neural networks, fuzzy networks and later wavelet networks [1,3,4], the AKL model can also be understand as a kind of “Networks”; on the other hand, in both stages Eqs.(18) and (23) only deal with a dimension of the number of nodes in AKL network, no matter of the scale of a MIMO system, which is very attractive for industrial application.
5 Simulations To compare the proposed AKL network algorithm with other NSI method, in this paper we restrict ourselves to one plant that were ever used both by Refs. [3, 4]. However, different to these two famous works, here both the deterministic and stochastic situations are considered, and meanwhile the issue of identification with small samples is specially explored. The plant considered here is [3, 4]: y(k + 1) = g [ y(k ), y(k −1), y(k − 2), u(k ), u(k −1)] , where the unknown nonlinear function g and the input signal has the following form, respectively g( x1 , x2 , x3 , x4 , x5 ) =
x1 x2 x3 x5 ( x3 −1) + x4 , 1+ x22 + x32
sin(2πk /250), u(k) = ® ¯0.8sin(2πk /250) +0.2sin(2πk /25),
k ≤ 500 k >500
Case1. Deterministic Identification In this scenario, both the input and output of the plant is deterministic and no noise involved. Similarly, the serial-parallel structure [3] is adopted to organize the input signal to the later identification. Here the Gaussian kernel function is utilized in the whole simulations: k ( x1 , x2 ) = exp[− x1 − x2 /(2σ 2 )] .
Note the Gaussian kernel is translation invariant, the proposed SAI condition will reduce to judge the inequality ϕt ≤ v0 due to ktt=1; further, the regularizing parameter c in Eq.(12) is set to 0 since it is a deterministic environment and large sample is available. The AKL network in the case will be greatly simplified to a similar form as kernel recursive LS method [7], in which no explicit regularization was considered. Thus, only two parameters, e.g. the geometric angle parameter v0 and the kernel parameter σ need to be specified. The parameter adopted here is σ =14.80, and v0=1.5×10−4. There is no rigorous parameter selection theory available for AKL networks, fortunately, in deterministic case, this two parameters work well in a wide range of district. The value provided here is just one of many parameter pairs that turn out satisfied results and no optimality is guaranteed by it.
744
H. Wang et al.
There five nodes are selected out during the online learning, and the corresponding sample numbers are: [4, 7, 22, 52, 156], the whole identification procedure takes 1.773 s (with CPU main frequency of P4-1.6GHz). The identification result is shown as Fig.1, and it can be seen the AKL networks accurately traces the dynamics of the process from the beginning of the learning, then it intelligently adds notes into the network and the whole learning procedure is ended after the last node obtained at the 156-th sample. Obviously, the identification result of AKL network is very satisfied. 0 .6
2.5x10
-4
2.0x10
-4
1.5x10
-4
1.0x10
-4
5.0x10
-5
SAI v0
0 .4
SAI Error
Process/AKL Output
0 .2 0 .0 -0 .2 -0 .4 -0 .6 P ro ce ss AKL N o de s
-0 .8 -1 .0 0
0.0 200
400
600
800
0
S a m p le N u m b e r
200
400
600
800
Sample Number
Fig. 1. Identification result of AKL with c =0
Fig. 2. SAI error of AKL Network with c=0
A comparative study of AKL networks with other two methods, i.e, adaptive neural networks [3] and adaptive Fuzzy System [4] is tabulated as Table 1. The AKL networks outperforms the other two kinds of networks both in training time and mode, and it only uses five nodes (and two training parameters), which is much lesser than that of other two Networks. Further, by judging from the corresponding figures [3, 4], the approximation error of AKL networks should be also much less than that of the two others, although no specific performance indices are provided in the corresponding literatures. Table 1. Identification performance of three different methods Algorithms Adaptive Neural Network [3] Adaptive Fuzzy System [4] AKL Network AKL Network
Para. Nums.
Train. Time 5
Train. Mode
RMSE
Max. Error
200+2
10 steps
Offline
/
/
120+2
105 steps
Offline
/
/
156 steps
Online
0.0279
0.0930
16 steps
Online
0.0541
0.2892
5+2 (c=0) 2+3 (C≠0)
It should be noted that, of course, both the fuzzy system and neural networks can also be trained online using some adaptive algorithm. In the situation, however, their performances are usually much poor than those obtained by the offline manner [3, 4]. The SAI approximation error, i.e, ϕt, is shown as Fig.2.
AKL Networks with Application to Nonlinear System Identification
745
Case2. Stochastic Identification To simulate the industrial environment, both the input and output signals are corrupted by independent Gaussian noise with a variance of 0.05, whose signal-noise ratio is about a level of 11.5% (The nominal “variances” of the input and output signals are 0.4339 and 0.2373, respectively). Further, the samples are available only every 15 steps, that is, in the whole identification procedure, the identifier can only be fed with about 50 sample pairs. The parameters adopt by AKL networks are: σ=13.0, v0=2.1×10−3; c=10−6. Due to the number of parameters is very small, they can be easily determined in a trial and error manner by simulation; from a point of SLT, the kernel parameter σ has dominant effect on the final performance of AKL networks, thus in practice one should first set the other two parameters with relatively relaxing values and then adjust σ. The identification result is shown as Fig.3, in which the signal named ”process” denotes the nominal noise free output of the process, and for clarity, the very noisy output signal used to train AKL networks is not shown here to avoid too many curves in the same figure. 0.8 0.4 0.2 0.0
SAI Error
Process/AKL Output
0.6
-0.2 -0.4 -0.6 Process AKL Nodes
-0.8 -1.0 0
3.0x10
-3
2.5x10
-3
2.0x10
-3
1.5x10
-3
1.0x10
-3
5.0x10
-4
SAI v0
0.0 200
400
600
Sample Number
800
0
200
400
600
800
Sample Number
Fig. 3. Identification result of AKL with c =10−6 Fig. 4. SAI error of AKL Network with c=10−6
The AKL networks shows a relative smooth output even in this error-in-variable environment. In fact, due to the affection of regularizing term, the node in this case learned by AKL networks has only one, namely the second available data (vector) at 16-th step, the other node is the one at 4-th step when initialization begins. This guarantees the whole complexity of the AKL networks is under control and good generalization is thus obtained. The whole identification procedure takes 2.213 s. The identification performance of AKL networks in this case is summarized also in Tab. 1 (last row). It can be seen the general performance is comparable with that of the Case 1 using large sample set and noise free, although the former has a somewhat larger RMSE than the latter. The SAI approximation error, ϕt, is shown as Fig.4.
6 Conclusions Developing a qualified plant model, under the practical industrial conditions, is crucial for all kinds of model-based automation technologies. The kernel method provides a
746
H. Wang et al.
quite different idea and solution to the NSI issue compared with other traditional identification theories [1~4]. However, direct application of kernel method for NSI problem is still limited (the “regression” is a too general topic, in personal opinion) in spite of its successful boom in many other fields. The proposed AKL networks can learn adaptively and build consequently the model online with small computational consumption and using small sample set. More important, the order of the model is not necessary to be determined prior. Further theory study on its parameter selection will make this interesting NSI method more perfect. Acknowledgements. This work was sponsored by the National Natural Science Foundation of China (Projects No. 20576116 and20206028) and Alexander von Humboldt Foundation (H. Q. Wang), which are gratefully acknowledged.
References 1. Sjöberg J., Zhang Q. H., Benveniste, A., et. al.: Nonlinear Black-box Modeling in System Identification:a Unified Overview. Automatic, Vol.31 (1995)1691-1724 2. Ljung, L.: System Identification: Theory for the User, (2nd edition). New Jersey: Prentice-Hall (1999) 3. Narendra, K.S., Parthasrathy, K.: Identification and Control of Dynamical Systems Using Neural Networks. IEEE. Trans. Neural Networks, Vol.1 (1990) 4-27 4. Wang Lixin. Adaptive Fuzzy Systems and Control: Design and Stability Analysis. New Jersey. Prentice-Hall (1994) 5. Vapnik V N. The nature of Statistical Learning Theory. SpringerVerlag (1995) 6. Schölkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. The MIT Press (2002) 7. Engel, Y., Mannor, S., Meir, R.: Kernel Recursive Least Squares, available http://www.ee.technion.ac.il /~rmeir/ Publications/KrlsReport.pdf (2003) 8. Csató, L., Opper, M.: Sparse representation for Gaussian process models. In: T. K. Leen, T. G. Dietterich, and V. Tresp, eds, Advances in Neural Information Processing Systems, Vol.13. MIT Press, (2001) 444-450 9. Franc, V., Hlavac, V.: Training set approximation for kernel methods, Proceedings of the 8th Computer Vision Winter Workshop, Prague, Czech Republic, Czech Pattern Recognition Society, (2003) 121-126 10. Golub, G. H., Van Loan, C.F.: Matrix Computations, The John Hopkins University Press, Baltimore, Maryland, 3rd edition (1996)
Probabilistic Kernel Principal Component Analysis Through Time Mauricio Alvarez1 and Ricardo Henao2 1
Auxiliar Lecturer, Program of Electrical Engineering Auxiliar Lecturer, School of Electrical Technology Universidad Tecnol´ ogica de Pereira, Colombia
2
Abstract. This paper introduces a temporal version of Probabilistic Kernel Principal Component Analysis by using a hidden Markov model in order to obtain optimized representations of observed data through time. Recently introduced, Probabilistic Kernel Principal Component Analysis overcomes the two main disadvantages of standard Principal Component Analysis, namely, absence of probability density model and lack of highorder statistical information due to its linear structure. We extend this probabilistic approach of KPCA to mixture models in time, to enhance the capabilities of transformation and reduction of time series vectors. Results over voice disorder databases show improvements in classification accuracies even with highly reduced representations.
1
Introduction
Principal Component Analysis (PCA) is a popular and powerful technique for feature extraction, dimensionality reduction and probably the most employed of the techniques of multivariate analysis [1]. One of the most common definitions of PCA is that, for a set of observed d-dimensional data vectors X = {xn }, n = 1, . . . , N , the p principal axes wi , i = 1, . . . , p, are those orthonormal axes onto which the retained variance under linear projection is maximal. However, PCA has several disadvantages, among them, the absence of an associated probability density or generative model, the fact that the subspace itself is restricted to a linear mapping, where high-order statistical information is discarded and the assumption that observed data is independent, when modeling time series. The first two disadvantages were overcame by Probabilistic Principal Component Analysis (PPCA) [2] and Kernel Principal Component Analysis (KPCA) [3], respectively. Probabilistic Kernel Principal Component Analysis (PKPCA) has been proposed, as well, to deal with the first two disadvantages at the same time [4,5,6]. To overcome the third disadvantage, temporal versions for PKPCA and PPCA are introduced in this paper, by using a hidden Markov model (HMM) as a way to obtain an optimized representation of the observed data through time. With this scheme, every data point has an associated local representation corresponding to the most probable state produced by a trained HMM. The integration between HMM and PCA variants was done by incorporating a factorization of the training I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 747–754, 2006. c Springer-Verlag Berlin Heidelberg 2006
748
M. Alvarez and R. Henao
data and the responsibility weights in the covariance of the observation model in the HMM in such way that PPCA and PKPCA can be readily calculated. The paper is organized as follows: sections 2 and 3 contains reviews of PKPCA and HMM respectively. In section 4 extensions for PPCA and PKPCA through time are presented. In section 5 experimental results for voice disorder databases are reported and finally in section 6 the conclusions are given.
2
Probabilistic Kernel Principal Component Analysis
PCA is intended to work in the original observation data space Rd . KPCA on the other hand, operates in a high-dimensional feature space F , which is related to the input space by a possible nonlinear map φ(·) : Rd → F where the dimension f of F is greater than d and possibly infinite. The mapped set in F is denoted by Φ = [φ(x1 ), . . . , φ(xN )]. In the feature space, PCA operates over the covariance matrix of the feature vectors defined as SF =
N 1 (φ(xi ) − φ)(φ(xi ) − φ) N i=1
(1)
N where φ = i=1 φ(xi )/N is the sample mean in feature space. Equivalently (1) can be expressed as SF = ΦHH Φ where H = (I − 1/N 11 )/N 1/2 . Since the matrix ΦHH Φ has the same nonzero eigenvalues as H Φ ΦH = H KH, the kernel trick makes KPCA work over H KH instead of SF , then the explicit knowledge of the mapping function φ(·) is not longer necessary [3]. A kernel representation k(xi , xj ) = φ(xi )·φ(xj ) can be used to calculate the dot matrix K = Φ Φ, with Kij = k(xi , xj ). Existence of such a kernel function is guaranteed by the Mercer’s theorem [7]. One of% the most popular & kernels 2is the gaussian RBF 2 where σK controls the RBF kernel, defined as k(xi , xj ) = exp xi − xj 2 /σK width. In order to extend KPCA to its probabilistic version (PKPCA) following the Factor Analysis (FA) perspective [8], the feature dimension vector φ(x) ∈ RN can be expressed as a linear combination of base vectors and noise φ(x) = WF y + μF + F
(2)
where y is a p-dimensional data vector, WF an N × p matrix relating the two set of variables φ(x) and y, and the μF vector allows the model to have nonzero mean. The variance matrix Ψ of the noise process F should be isotropic p(F ) ∼ N (0, N σF2 I/f ) and the latent variables y1 , ..., yp are independent gaussians of the form p(y) ∼ N (0, N I/f ). The difficulty is that f is usually unknown. However, it is possible to derive an estimation procedure for WF and σF2 that does not depend on f , using the kernel trick [6]. Since there is not closed-form analytic solution for the PKPCA model, the model should be obtained by iterative maximization of LF , 1 f LF = log(p(Φ|WF , μF , σF2 )) = − {log |C| + tr(C−1 F SF )} 2 N
Probabilistic Kernel Principal Component Analysis Through Time
749
with some constant terms omitted, and where CF = WF WF + N σF2 I/f and SF from (1) is the covariance matrix of the feature vectors. The ML estimator for μF is the sample mean in feature space, so μF (ML) = φ. The estimations for WF and σF2 can be obtained by iterative maximization of LF in terms only of the kernel matrix K and not of the feature space dimension f , as in [6] L F = H KHWF (N σ 2 I + M−1 W H KHWF )−1 W F F F 1 −1 L 2 σ JF = 2 tr(H KH − H KHMF WF ) N
(3) (4)
L F means “new” and MF = W WF + σ 2 I. Then, the equations in (3) where W F F and (4) are iterated in sequence until the algorithm is judge to have converged. In order to obtain standard PPCA [2] from the above formulation, the model in (2) should be modified making Φ = x. Besides, H KH should be replaced by N − x) , the covariance matrix of the original observation S = N1 i=1 (xi − x)(xi data space, where x = n xn /N . It can be shown [2] that for standard PPCA, the estimations for W and σ 2 can be obtained similar to (3) and (4) as L = SW(σ 2 I + M−1 W SW)−1 W
σ J2 =
1 L ) tr(S − SM−1 W d
(5)
Notice that we have removed the subindex of the feature space for clarity.
3
Hidden Markov Models
A hidden Markov model is basically a Markov chain where the output observation is a random variable generated according to an output probabilistic function associated with each state [9]. A hidden Markov model of Ns states is defined by a set of parameters λ = (A, B, π), where π is the initial state distribution, A is the transition probability matrix and B is the output probability matrix. Output probabilities can be modeled through continuous probability density functions. To train an HMM, Expectation-Maximization (EM) updates can be used in conjunction with the forward-backward algorithm [9]. In particular, when the observation model is a mixture of gaussians (for the state j), pλ (xn |j) =
M
cjm N (xn ; μjm , Sjm )
(6)
m=1
where M is the number of components in the mixture. It can be shown that the reestimation formulas are given by N N γn (j, k) n=1 γn (j, k)xn J μ J cjk = T n=1 = M N jk n=1 k=1 γn (j, k) n=1 γn (j, k) J jk = S
N n=1
J jk )(xn − μ J jk ) γn (j, k)(xn − μ N n=1 γn (j, k)
(7)
750
M. Alvarez and R. Henao
where γn (j, k) is the probability of being in state j at instant n with the k-th mixture component accounting for xn . To find the single best state sequence, we use the Viterbi algorithm [9].
4
Building PCA Models Through Time
A natural link between HMM and PCA arises from the fact that both, the observation model for HMM in (6) and the formulation of PCA requires the computation of a sample covariance matrix given by (1). It should be noted J in (7) represents a weighted version of SF , even though is that the expression S preferable to obtain an expression where the observed data (or kernel) and the weights γn (j, k) belongs to different matrix representations in a similar way to H KH. To do this, from (7) it can be shown that J jk = S
N
J jk )(xn − μ J jk ) = XRjk Rjk X rn (j, k)(xn − μ
(8)
n=1
J jk is the local weighted sample covariance matrix for the state j and the where S k-th mixture component. The terms in the left side are defined as γn (j, k) rn (j, k) = N n=1 γn (j, k)
J jk = μ
N
rn (j, k)xn
Rjk = (I − r(j, k)1 )Djk
n=1
where the vector r(j, k) = [r1 (j, k), . . . , rN (j, k)] and Djk is a diagonal matrix with entries r1 (j, k)1/2 , . . . , rN (j, k)1/2 . The matrix Rjk can be seen as the responsibility matrix for the k-th mixture component and state j. Since XRjk Rjk X = Rjk X XRjk in (8) and X X is a centered sample covariance matrix, it is easy to see from here that models for PPCA and PKPCA can be builded for each mixture component of each corresponding state in the HMM model. Then the structure of the PCA models contain the information in time provided by the resulting Rjk matrix after the HMM training process. 4.1
PKPCA Model Through Time
L F jk The model for PKPCA can be builded performing iterative evaluations of W and σ JF2 jk using (3) and (4). Besides, the term H KH should be replaced by Rjk KRjk in order to conform the mixture model. The p-dimensional reduced representation of x corresponding to the k-th mixture in the state j can be obtained as K yjk = M−1 F jk WF jk k(xi , x)
(9)
where K k(xi , x) = (I−r(j, k)1 ) (kx − Kr(j, k)) and kx = [k(x1 , x), . . . , k(xN , x)] . Additionally, to build the PPCA model through time, iterative evaluations of 2 L jk and σ J jk from (8). W Jjk should be calculated by using (5), but replacing S by S The p-dimensional reduced representation of the observed vector x is calculated using yjk = M−1 Jjk ). jk Wjk (x − μ
Probabilistic Kernel Principal Component Analysis Through Time
4.2
751
Time Series Classification
Application of the described methodology for PKPCA can be summarized in algorithm 1. A similar algorithm for PPCA can be applied to time series classification using the corresponding equations. Algorithm 1. PKPCA through time Algorithm Require: Training observation X, Ns , M and σK 1: Calculate K 2: Initialize PKPCA models (WF jk , σF2 jk ) using PCA 3: Initialize one HMM per class with gaussian observation models 4: % Train HMM models 5: repeat 6: Reestimate HMM model parameters λ using EM updates 7: Reestimate WF jk and σF2 jk using Sjk (equations (3) and (4)) 8: until convergence 9: % Obtain the transformed and/or reduced representation 10: path=viterbi(X,λ) % path is the best state sequence 11: Project every xi ∈ X using the PKPCA model given by pathi (equation (9)) 12: Train HMM based classifier λC using EM algorithm and transformed observations 13: return WF jk , σF2 jk and λ % PKPCA through time parameters 14: return λC % HMM classifier
L F jk , σ In order to test unseen observations, W JF2 jk and λ can be employed to obtain the transformed and/or reduced representation and λC to classify them.
5 5.1
Experimental Results Databases
For all experiments two different databases are used. Database DB1 belongs to Universidad Nacional de Colombia, Manizales, Colombia and contains 80 cases of sustained vowel /a/, pronounced by 40 normal speech patients and 40 dysphonic speech patients. Database DB2 belongs to Universidad Polit´ecnica de Madrid, Spain, it contains 160 samples of sustained vowel /a/ pronounced by 80 normal speech patients and 80 different pathological speech patients (nodules, polypus, oedemas, cysts, sulcus, carcinomas). 5.2
Feature Extraction
Speech samples are windowed using frames of 30 milliseconds (ms) length with an overlapping of 20 ms. For each frame, 12 Mel-Frequency Cepstrum Coefficients (MFCC) and energy coefficient were extracted. First and second order deltas are also included so we get a final observation vector of 39 variables for each frame.
752
5.3
M. Alvarez and R. Henao
Classification
For the classification of the different sets of vector time series (Original MFCC, MFCC transformed with PKPCA and MFCC transformed with PPCA), we use hidden Markov models with gaussian observation densities (see equation (6)). HMM topologies are ergodic [9] and different number of states for the Markov chain are examined. Experiments are done using databases DB1 and DB2. The parameters of the model: number of states Ns , number of mixtures 2 M and width of the gaussian kernel σK are shown for the less complex model1 with the best resulting accuracy, under a 5-fold crossvalidation scheme. More detailed classification results may be found in [10]. 5.4
Experiment 1. Accuracies Using the Complete Space
Three different time series vectors are used as features for training, namely, the original MFCC vectors, the transformed MFCC vectors using PKPCA models and the transformed MFCC vectors using PPCA model. All transformations are made using the method explained before (see section 4.2). For this case, the dimension of all time series is 39, the original dimensionality of the multivariate MFCC coefficients. Accuracies for both databases are shown in table 1. For database DB1, results show that the HMM classifier trained with the transformed MFCC vectors using PKPCA and PPCA outperforms the one trained with the original raw dataset even in terms of model complexity. For models with three states, PKPCA and PPCA performances do not seem to be different, although for more states PPCA degrades its performance while PKPCA keeps its high recognition rate. On the other hand, for database DB2, it can be seen that again transformed MFCC vectors using PKPCA and PPCA outperforms the one trained with the original raw dataset even in terms of model complexity. PKPCA shows better results over PPCA, for which a high standard deviation in accuracy is obtained. Table 1. Accuracy results for Databases DB1 and DB2 DB1
2 Accuracy % Ns M σK
Accuracy %
Original 3
5
-
90.00 ± 9.88 3
5
-
70.63 ± 4.74
3
1
- 100.00 ± 0.00 10
1
-
91.25 ± 7.46
PKPCA 3
1
5 100.00 ± 0.00 3
1
5 100.00 ± 0.00
PPCA
5.5
DB2
2 Dataset Ns M σK
Experiment 2. Performance for Reduced Dimensionality
We evaluate classification performance of time series vectors using the transformed and reduced MFCC vectors with PKPCA and PPCA models. Different 1
In this context less complex means, the model with less states and mixtures without losing accuracy.
Probabilistic Kernel Principal Component Analysis Through Time
753
values for p are used, even for p > 39, i.e. taking dimensionality values greater than the original. Obtained results for both databases are shown in table 2. In this experiment, the number of mixtures was manually fixed to 1. Table 2. Results for Databases DB1 and DB2 using reduced time series vectors DB1
DB2
2 p Dataset Ns σK
2 Accuracy % Ns σK
Accuracy %
5 PKPCA 3
15
96.25 ± 5.59 3
15
82.50 ± 3.56
15 PKPCA 3
15 100.00 ± 0.00 3
15
98.75 ± 1.71
30 PPCA 5 PKPCA 3
5
96.25 ± 3.42 5 100.00 ± 0.00 3
87.50 ± 3.12 5 100.00 ± 0.00
33 PPCA 3 PKPCA 3
5
100.00 ± 0.00 5 100.00 ± 0.00 3
5
93.12 ± 9.21 100.00 ± 0.00
36 PPCA 3 PKPCA 3
5
98.75 ± 2.80 5 100.00 ± 0.00 3
15
90.62 ± 6.62 100.00 ± 0.00
42 PKPCA 3
5
100.00 ± 0.00 3
5
100.00 ± 0.00
45 PKPCA 3
5
100.00 ± 0.00 3
15
100.00 ± 0.00
48 PKPCA 3
5
100.00 ± 0.00 3
5
100.00 ± 0.00
The results in table 2 show that for just 15 of the 39 dimensions in DB1 and 30 of the 39 in DB2, the trained classifier has accuracies of 100% using the transformed and reduced MFCC vectors with the PKPCA model. Besides, this is not in any way a coincidence since for selected dimensionalities between 15 and 48, the accuracy rates remains the same even with the same number of states. Also from table 2 is noticeable the fact that for PPCA model more states are needed to achieve satisfactory results, while in PKPCA, with just 3 states, better accuracies are obtained. Intuitively, PKPCA model requires a less complex structure than PPCA model to explain the current data. For database DB2, accuracies using PPCA have large standard deviations mainly due to the nature of samples.
6
Conclusions
Classification results show that the proposed methodology greatly improves the performance of the hidden Markov model classifier. In particular, transforming time series vectors using a probabilistic kernel principal component analyzer dependent of time, allows to obtain extremely high accuracies with low variance. Even with few components in the transformed multivariate sequences, it is still posible to discriminate between both classes in each database. A difficult question that must be solved remains in the model selection problem. First, the number of states in the Markov chain that better explains dynamic behavior must be carefully chosen. Second, a common problem with kernel
754
M. Alvarez and R. Henao
methods is related with the election of a suitable kernel. Even when a kernel is chosen, it still remains the issue of parameter selection, which it is well known to be a difficult matter. We have examined some alternatives for this parameter and validated them using computational expensive cross-validation techniques. It is not clear how the parameters of the kernel could be obtained in a simple way. For the choice of the reduced dimensionality of time series vectors, perhaps applying sparse-oriented principal component analyzers should be a direction. Finally, straightforward extensions of the proposed methodology could be formulated for latent variable models relying on covariance like matrices.
Acknowledgements Authors would like to thank to CIE (Centro de Investigaciones y Extensi´ on) and Faculty of Technology of Universidad Tecnol´ ogica de Pereira for partially support this research project.
References 1. Jolliffe, I.: Principal Component Analysis. Second edn. Springer Verlag (2002) 2. Tipping, M., Bishop, C.: Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B 21(3) (1999) 611–622 3. Sch¨ olkopf, B., Smola, A., M¨ uller, K.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10(5) (1998) 1299–1319 4. Tipping, M.: Sparse kernel principal component analysis. In Press, M., ed.: Neural Information Processing Systems, NIPS’00. (2000) 633–639 5. Zhou, C.: Probabilistic analysis of kernel principal components: mixture modeling and classification. Cfar technical report, car-tr-993, University of Maryland, Department of Electrical and Computer Engineering, Maryland (2003) 6. Zhang, Z., Wang, G., Yeung, D., Kwok, J.: Probabilistic kernel principal component analysis. Technical report hkust-cs04-03, The Hong Kong University of Science and Technology, Department of Computer Science, Hong Kong (2004) 7. Sch¨ olkopf, B., Smola, A.: Learning with Kernels Support Vector Machines, Regularization, Optimization and Beyond. MIT Press, Cambridge, MA (2002) 8. H¨ ardle, W., Simar, L.: Applied Multivariate Statistical Analysis. Springer, N.Y. (2003) 9. Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of The IEEE 77(2) (1989) 10. Alvarez, M., Henao, R.: PCA for time series classification - supplementary material. Technical report, Universidad Tecnol´ ogica de Pereira, Pereira, Colombia, Available at http://ohm.utp.edu.co/~rhenao/adminsite/elements/files/ kpcatt sm.pdf. (2006)
Confidence Intervals for the Risks of Regression Models Imhoi Koo and Rhee Man Kil Division of Applied Mathematics Korea Advanced Institute of Science and Technology 373-1 Guseong-dong, Yuseong-gu, Daejeon 305-701, Korea [email protected], [email protected]
Abstract. The empirical risks of regression models are not accurate since they are evaluated from the finite number of samples. In this context, we investigate the confidence intervals for the risks of regression models, that is, the intervals between the expected and empirical risks. The suggested method of estimating confidence intervals can provide a tool for predicting the performance of regression models.
1
Introduction
In the regression models, the goal is minimizing the general error (or the expected risk) for the whole distribution of sample space, not just a set of training samples. This is referred to as the generalization problem. The regression of real-valued functions is frequently tackled by the incremental learning algorithms[1,2,3,4] in which the necessary computational units, usually the kernel functions with locality such as Gaussian kernel functions, are recruited in the learning procedure. In this incremental learning, the optimal number of kernel functions should be sought in the sense of minimizing the expected risk which is dependent upon the given target function embedded in the samples and the estimation function (or regression model). However, the expected risk can not be estimated from the finite number of samples which are usually given for the learning of regression models. Instead, we measure the empirical risk from the finite number of samples. Usually, the validation set, a part of given samples is extracted and used to estimate the risk. However, the empirical risk is not accurate. To investigate the accuracy of the evaluated risks, we can consider the confidence interval defined by the distance between the expected and empirical risks with a certain confidence (or probability). In the statistical estimation of confidence intervals, we assume the probability distribution of the samples, for instance, normal distribution or t-distribution. This method can be applied to determine the confidence intervals of risk estimates in regression models[5]. However, in general, we don’t know the error distribution. Furthermore, even if we know the error distribution, the formulas of confidence intervals do not reveal how the confidence intervals will be changed as the structure of regression models, for instance, the number of kernel functions in regression models is changed. In this context, we suggest I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 755–764, 2006. c Springer-Verlag Berlin Heidelberg 2006
756
I. Koo and R.M. Kil
the way of estimating confidence intervals using the risk estimates which can be described by the structural information of regression models such as the number of kernel functions and some coefficients related to the target and estimation functions. As a result, we can predict the performance of regression models according to the risk estimates and thus this enables us to determine the optimal regression models. Through the simulation for function approximation using the regression model with Gaussian kernel functions, we have shown the validity of our method by demonstrating the performance prediction of regression models.
2
Confidence Intervals for Risk Functions
The performance of regression models is usually evaluated using the finite number of samples. First, let us consider that the following l samples (x1 , y1 ), (x2 , y2 ), · · · , (xl , yl ) which are drawn randomly and independently from an unknown probability distribution. We assume that the K dimensional input pattern xi ∈ X (an input space, an arbitrary subset of RK ) and the output pattern y ∈ Y (an output space, an arbitrary subset of R) have the following functional relationship: y(x) = f0 (x) +
(1)
where f0 (x) represents the target function in the target function space F , and represents the randomly drawn noise term with mean zero and variance σ2 . Here, let us assume that the vector xi is drawn independent and identically distributed in accordance with a probability distribution P (x). Then, yi is drawn from the random trials in accordance with P (y|x) and the probability distribution P (x, y) defined on X × Y can be described by P (x, y) = P (x)P (y|x).
(2)
Q(x, n) = (y(x) − fn (x))2
(3)
We define a loss function as
where fn (x) represent the estimation function in Fn , the hypothesis space (or structure) with n parameters. The expected risk R (fn ) for noisy output pattern y(x) is defined by (y(x)− fn (x))2 dP (x, y). (4) R (fn ) = E[Q(x, n)] = E[(y(x)− fn (x))2 ] = X×Y
If has normal distribution, the selection of a hypothesis which minimizes the mean square error of (4) is optimal, that is, f0 (x) = E[y|x].
(5)
Confidence Intervals for the Risks of Regression Models
757
However, we cannot estimate (4) in advance. Instead, we get the empirical risk R,emp for l noisy samples defined by 1 1 Q(xi , n) = (yi − fn (xi ))2 l i=1 l i=1 l
R,emp (fn ) =
l
(6)
where xi and fn (xi ) represent the ith input pattern and the corresponding output of the estimation function respectively. Let us also define an estimation error for the estimator fn (x) as en (x) = y(x) − fn (x).
(7)
Then en (x) represents a random variable associated with the error for the estimator fn (x). Here, let us analyze the relationship between the true and empirical risks in terms of the probability distribution of en (x) when we evaluate the performance of regression models. In particular, we consider the confidence interval defined by the absolute value of the difference between the expected and empirical risks. For this confidence interval, we propose the following theorem: Theorem 1. The confidence intervals for risk functions of the given fn are bounded by the following inequality with a probability of at least 1 − δ: . kn − 1 |R (fn ) − R,emp (fn )| ≤ Zδ R (fn ) (8) l where Zδ represents a constant bounded by . 1 Zδ ≤ and δ
(9)
kn represents a constant dependent upon the distribution of estimation errors defined by E[e4 (x)] . (10) kn = 2 n2 E [en (x)] proof Let us define a random variable zn (x) as zn (x) = (y(x) − fn (x))2 = e2n (x).
(11)
Then, the expected value of zn (x) is given by E[zn (x)] = E[e2n (x)] = R (fn )
(12)
and the variance of zn (x) is given by V ar(zn (x)) = E[zn2 (x)] − E 2 [zn (x)].
(13)
758
I. Koo and R.M. Kil
Here, we assume that there exists a constant kn such that kn =
E[e4 (x)] E[zn2 (x)] = 2 n2 < ∞. 2 E [zn (x)] E [en (x)]
(14)
Then, V ar(zn (x)) = (kn − 1)E 2 [zn (x)] = (kn − 1)R2 (fn ).
(15)
In general, the constant kn is greater than or equal to 1 since the variance of zn (x) is always nonnegative. Let us also define a random variable z¯n (x) which is a sample mean of zn (x)s’, that is, l 1 zn (xi ) = R,emp (fn ). (16) z¯n = l i Then, the expected value of z¯n is given by E[¯ zn ] = E[R,emp (fn )] = E[zn (x)] = R (fn )
(17)
and the variance of z¯n is given by V ar(¯ zn ) = V ar(R,emp (fn )) =
1 kn − 1 2 V ar(zn (x)) = R (fn ) l l
(18)
since zn (xi )s are independent each other. This is due to the fact that the measurable functions of independent random variables are also independent. Then, according to Chebyshev’s inequality, Pr{|E[¯ zn ] − z¯n | > n } ≤
V ar(¯ zn ) . 2 n
(19)
The above inequality is always satisfied regardless of the distribution of z¯n = R,emp (fn ). Equivalently, the following is always satisfied: Pr{|R (fn ) − R,emp (fn )| ≤ } ≥ 1 −
V ar(R,emp (fn )) . 2
(20)
Let us set the confidence parameter as δ=
V ar(R,emp (fn )) . 2n
(21)
This confidence parameter satisfies the following equation from (18): δ=
kn − 1 2 R (fn ). 2n l
(22)
Then, the bound of n becomes . n =
kn − 1 R (fn ). lδ
(23)
Confidence Intervals for the Risks of Regression Models
Therefore, with a probability of at least 1 − δ, . |R (fn ) − R,emp (fn )| ≤ Zδ where Zδ ≤
kn − 1 R (fn ) l
! 1/δ because of (20) and (21).
759
(24) Q.E.D.
In the suggested theorem, the value of Zδ is dependent upon the probability distribution of R,emp (fn ). Note that the right-hand side of (8) is equivalent to the standard deviation of R,emp (fn ) multiplied by Zδ . If the number of samples l is reasonably large (usually more than 30), R,emp (fn ) could be approximated as normal distribution by the central limit theorem, that is, the following could be approximated as an unit normal distribution: R,emp (fn ) − R (fn ) ! ∼ ˙ N (0, 1). V ar(R,emp (fn ))
(25)
In practice, we usually calculate the following sample variance S2 (fn ) instead of the exact value of the variance of R,emp : 1 2 (e (xi ) − R,emp (fn ))2 . l − 1 i=1 n l
S2 (fn ) =
(26)
In this case, R,emp (fn ) could be approximated as t distribution with l−1 degrees of freedom, that is, R,emp (fn ) − R (fn ) √ ∼ ˙ tl−1 . (27) S (fn )/ l This implies that Zδ in (8) could be approximated as Zδ ≈ tδ/2,l−1
(28)
when we calculate the empirical risks and the sample variances of R,emp (fn ) for the reasonably large number of samples. Another interesting point in the suggested theorem, the value of kn is dependent upon the probability distribution of en (x). For instance, if the random variable en (x) has Gaussian probability density function, +∞ 2 2 1 x4 e−(x−ηn ) /(2σn ) dx E[e4n (x)] = √ 2πσn −∞ = (3σn2 + ηn2 )(σn2 + ηn2 ) ≤ 3R2 (fn )
(29)
where ηn = E[en (x)] and σn2 = V ar(en (x)). That is, kn is less than or equal to 3. Note that kn can be determined as 1.8 in the case of uniform probability density function. If the estimation error en (x) has Laplacian type probability density function, kn is bigger than 3.
760
I. Koo and R.M. Kil
Fig. 1. The distribution of R,emp (fn ): the probability of R,emp (fn ) lying within R (fn ) ± n is at least 1 − δ
The concept of the suggested theorem can be illustrated in Figure 1: 1) the empirical risk R,emp (fn ) can be treated as a random variable since it is estimated from l randomly drawn samples, 2) the expectation of R,emp (fn ) will be the true risk R (fn ), and 3) with a probability of at least 1 − δ, the empirical risk R,emp (fn ) exists between R (fn ) − n and R (fn ) + n . In other words, the suggested confidence interval indicates the maximum distance from the empirical risk to the true risk with a probability of at least 1 − δ.
3
Simulation
To check the validity of the suggested theorem, we performed a simulation for the regression of the samples generated from the following two dimensional function: f (x1 , x2 ) = 0.4x1 sin(4πx1 ) + 0.6x2 cos(6πx2 )
(30)
where the values of x1 and x2 were selected between 0 and 1. For this regression, we chose the learning model with the Gaussian kernel functions. The training of this learning model was done by the incremental learning algorithm[4] in which the necessary Gaussian kernel functions (GKFs) were recruited at the positions where the learning model made large estimation errors. In this model, the training was composed of two learning processes, the process of recruiting the necessary number of GKFs and the process of parameter estimation associated with GKFs. The goal of the recruiting process is to recruit the necessary number of kernel functions. Through this process, we adjusted the shape parameters (or kernel widths) and the positions of the kernel functions
Confidence Intervals for the Risks of Regression Models
761
representing the reference points of the learning samples. After the recruiting process, the weight parameters associated with the kernel functions were adjusted in such a way as to minimize the mean squared error between the desired and actual outputs. For the training of the learning model, one training set of 500 samples and another training set of 1000 samples were generated randomly from (30) and they were applied to the incremental learning algorithm in which the number of kernel functions n was increased from 10 to 300. Test sets that were nonoverlapped with the training sets, one set of 500 samples were also generated randomly from (30). We measured the empirical risk R,emp (fn ) for the given number of kernel functions n using the test set. Then, the parameters associated with the following risk estimate[6] were determined: nβ J (fn ) = C1 ( 1 )α + C2 √ R +θ n l2
(31)
where C1 , C2 , α, β, and θ were coefficients to be estimated. Here, l2 was the number of training samples. The value of α indicated the convergence speed of the learning model, which was dependent upon the smoothness of the target function and the regression model. If α was greater than or equal to 1, it was referred to as the fast rate of approximation. Usually, the value of α was determined between 1/2 and 1. The range of β was between 1/2 and 2 due to the range of the VC dimension of artificial neural networks as described in [7,8]. The coefficients of (31) were adapted in such a way that the mean square errors J (fn ) over the number of kernel functions n, were minimized. of R,emp (fn ) − R Once, the coefficients were estimated, the risk estimates for various numbers of training samples could be predicted. For the detail description of the form of (31) and the estimation procedure, refer to [6]. From the estimation of the risk of (31), the 95 percent confidence intervals were estimated using the suggested theorem as follows: H J kn − 1 J R (fn ). (32) |R (fn ) − R,emp (fn )| ≤ Zδ l1 − 1 where Zδ was set as Zδ = Z0.05 = t0.025,499 from (28). Here, l1 was the number of test samples and the estimate of kn was determined using the form of (10) as follows: l1 l1 1 1 J e4n (xj )/( e2n (xk ))2 (33) kn = l1 j=1 l1 k=1
where en (xj ) was the error for the jth sample in the test set. Note that the term l1 − 1 was used in (32) instead of l1 since the right-hand side of (32) was √ equivalent to the sample standard deviation of R,emp (fn ), that is, S (fn )/ l1 in (27), multiplied by Zδ . The estimation results were shown in Figure 2. In this estimation, the value of kn starting from 2.6 was saturated around 26.1. These results showed us that
762
I. Koo and R.M. Kil
30
25
20
15
10
5
0
0
50
100
150
200
250
300
Fig. 2. The plot of kn versus number of kernel functions n: the transition of kn shows that the probability distribution of error function at the early stage of n is more likely normal and changed to more likely Laplacian as n increases
the probability distribution of the error function en (x) was more likely normal when the number of kernel functions was small while as the number of kernel functions increased, it changed to more likely Laplacian, that was, with a sharp concentration near 0 but with a longer tail along the error axis compared to normal distribution. After the adaptation of coefficients associated with the risk estimate and confidence intervals using the training set of 500 samples and the test set of 500 samples, we compared the predicted results of the risk estimates and confidence intervals with the measured data when the number of training samples was 500 and 1000. These results were plotted in Figures 3 and 4. In all figures, the predicted risk estimates were drawn by a solid line, while the predicted confidence intervals were drawn by a dashed line. We also compared the results of performance prediction with the risks calculated by the numerical integration of loss functions and the 95% confidence intervals using the bootstrap percentiles[9], one of widely used methods for the confidence interval estimation using the re-sampling technique. The simulation results showed us that 1) the predicted risk estimates were well fitted to the measured empirical risks even when we changed the number of training samples and 2) the predicted confidence intervals were well fitted to the measured confidence intervals using the bootstrap percentile method especially when the predicted risk estimates were close to the measured empirical risks. As a result of this simulation, the optimal number of kernel functions for 1000 training samples was estimated as 278 from the predicted risk estimate. With
Confidence Intervals for the Risks of Regression Models
763
0.1
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0
50
100
150
200
250
300
Fig. 3. The plot of measured and predicted risks, and confidence intervals versus number of kernel functions n in the case of 500 training samples. The circle represents the risk value at n∗ = 206.
0.1
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0
50
100
150
200
250
300
Fig. 4. The plot of measured and predicted risks, and confidence intervals versus number of kernel functions n in the case of 1000 training samples. The circle represents the risk value at n∗ = 278.
764
I. Koo and R.M. Kil
this number of kernel functions, the predicted risk estimate was 0.0066 and the predicted 95% confidence interval was 0.0066±0.0029. In this case, the risk value calculated by the numerical integration method was given by R (fn ) = 0.0086 at n = 278. This value was surely included in the predicted confidence interval as shown in Figure 4. These results showed us that the suggested method could be applied to predict the performance of regression models and possibly to predict the optimal number of kernel functions in the sense of minimizing the expected risk.
4
Conclusion
We have suggested the confidence intervals for the risks of regression models. The suggested confidence intervals indicate the range of the expected risks when we measure the empirical risks of regression models. Through the simulation for function approximation, we have shown that the suggested confidence intervals are well fitted to the measured data. The suggested method of estimating confidence intervals can provide a tool for predicting the performance of regression models.
References 1. J. Moody and C. J. Darken.: Fast learning in networks of locally-tuned processing units. Neural Computation, (1989) 1(2) 281–294 2. E. J. Hartman, J. D. Keeler, and J. M. Kowalski.: Layered neural networks with gaussian hidden units as universal approximations. Neural Computation, (1990) 2(2) 210–215 3. S. Lee and R. M. Kil.: A gaussian potential function network with hierarchically self-organizing learning. Neural Networks, (1991) 4(2) 207–224 4. R. M. Kil.: Function approximation based on a network with kernel functions of bounds and locality: An approach of non-parametric estimation. ETRI journal, (1993) 15(2) 35–51 5. R. Dybowski and S. Roberts.: Confidence intervals and prediction intervals for feedforward neural networks. In R. Dybowski and V. Gant (eds.) Clinical Applications of Artificial Neural Networks. Cambridge: Cambridge University Press (2001) 6. R. M. Kil and I. Koo.: True risk bounds for the regression of real-valued functions. In Proceedings of IEEE International Joint Conference on Neural Networks, Portland, U.S.A., July, (2003) 507–512 7. M. Karpinski and A. Macintyre.: Polynomial bounds for vc dimension of sigmoidal neural networks. In Proceedings of the twenty-seventh annual ACM symposium on Theory of computing, (1995) 200–208 8. A. Sakurai.: Polynomial bounds for the vc-dimension of sigmoidal, radial basis function, and sigma-pi networks. In Proceedings of the World Congress on Neural Networks, (1995) volume 1, 58–63 9. B. Efron and R. Tibshirani.: An Introduction to the Bootstrap. Chapman and Hall (1993)
A Divide-and-Conquer Approach to the Pairwise Opposite Class-Nearest Neighbor (POC-NN) Algorithm for Regression Problem Thanapant Raicharoen1 , Chidchanok Lursinsap1 , and Frank Lin2 1
Advanced Virtual and Intelligent Computing Center (AVIC), Department of Mathematics, Faculty of Science, Chulalongkorn University, Bangkok 10330, Thailand [email protected], [email protected] 2 Department of Mathematics and Computer Science, University of Maryland Eastern Shore Kiah Hall Princess Anne, Maryland 21853-1299, U.S.A. [email protected]
Abstract. This paper presents a method for regression problem based on divide-and-conquer approach to the selection of a set of prototypes from the training set for the nearest neighbor rule. This method aims at detecting and eliminating redundancies in a given data set while preserving the significant data. A reduced prototype set contains Pairwise Opposite Class-Nearest Neighbor (POC-NN) prototypes which are used instead of the whole given data. Before finding POC-NN prototypes, all sampling data have to be separated into two classes by using the criteria through odd and even sampling number of data, then POC-NN prototypes are obtained by iterative separation and analysis of the training data into two regions until each region is correctly grouped and classified. The separability is determined by the POC-NN prototypes essential to define the function approximator for local sampling data locating near these POC-NN prototypes. Experiments and results reported showed the effectiveness of this technique and its performance in both accuracy and prototype rate to those obtained by classical nearest neighbor techniques.
1
Introduction
The Nearest Neighbor (NN) rule proposed by Cover and Hart [1,2] is one of the most attractive non-parametric decision rules or instance-based learning rules for classification and pattern recognition since no a priori knowledge is required concerning the underlying distributions of the data. Because the non-parametric decision rules are highly unstructured, they typically are not useful for understanding the nature of the relationship between the features and class outcome. However, as a black-box predictor, they can be very effective, and are often among the best performers in real problems. The NN technique can also be used in regression and works reasonably well for low-dimensional problems. However, with high-dimensional features, the bias-variance tradeoff does not work as favorably I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 765–772, 2006. c Springer-Verlag Berlin Heidelberg 2006
766
T. Raicharoen, C. Lursinsap, and F. Lin
for nearest neighbor regression as it does for classification. Moreover, the original NN rule in general requires the computational load, both in time (finding the neighbor) and space (storing the entire training data set). Therefore, reducing storage requirements is important and still an ongoing research issue. The kNearest Neighbors technique can be viewed as a form of local risk minimization. In this method, the function is estimated by taking a local average of the data. The word “local” refers to the k data points nearest to the estimated point. The other well known regression technique is linear interpolation, which guesses all intermediate values falling on the straight line between the entered points. However, it suffers from the drawbacks similar to the NN technique. Firstly, a large memory space is required as the entire data set has to be stored and all strainght lines between the entered points have to be computed. Secondly, it is sensitive to noisy data. To alleviate these drawbacks, one has to eliminate the redundant data while preserving the significant data. Only a selected prototype set is used instead of the whole given data. A devide-and-conquer approach to the Pairwise Opposite Class-Nearest Neighbor (POC-NN) algorithm proposed by [3] has been shown to be very effective for classification and pattern recognition problems. In this paper, POC-NN algorithm is generalized so that it can handle regression problems. This work is focused on developing a new method for obtaining a set of selection prototypes for regression problem. Our proposed method is based on a divide-and-conquer approach. That is the analogy to partition original training patterns into smaller regions by finding POC-NN prototypes for the regions, and then combine POC-NN prototypes for the regions into a set of selection prototypes. The rest of this paper is organized into four sections. Section 2 presents our proposed POC-NN method. In Section 3, some experiments are described which evaluate the performance of our method. Section 4 concludes the paper.
2
The Methodology of POC-NN for Regression Problem
The idea is to find a subset of the original sampling set that suffices for linear interpolation, and throw away the remaining data. Intuitively, it seems reasonable to keep the original points (patterns) that are used for building the linear interpolation line, while some points lying on or near this line should be discarded. 2.1
Finding POC-NN Patterns Algorithm for Regression
Considering n samples of a given data set S of X = {x1 , . . . , xn }, xi ∈ d
(1)
and their corresponding function values Y = {y1 = f (x1 ), . . . , yn = f (xn )}, yi ∈ , where denotes a set of real numbers.
(2)
A Divide-and-Conquer Approach to the POC-NN Algorithm
767
Before finding prototypes for regression problem for a given data sampling set S with dimension d, all sampling data set S have to be separated into two parts (classes), namely, class 1 and class 2. In order to obtain an equal distribution of the sampling data, the criteria through simple odd and even sampling number of data is used to obtain a new data sampling set S with dimension d+1, where (1) S = {(xi , yi )}, if i is odd, S = (3) S (2) = {(xj , yj )}, if j is even. The algorithm to find the Pairwise Opposite Class-Nearest Neighbor for regression is given as follows: Function FIND-POC-NN-R (S : Dataset) 1. Let S (1) and S (2) be a training set defined by Eq. (3), and d+1 be the dimension of S . 2. Let x1 ∈ S (1) be the first element. 3. Let xp2 ∈ S (2) be the nearest pattern to x1 . 4. For 1 ≤ i ≤ d do 5. Z = {xp1i |xp1i ∈ S (1) and let xp1i be the ith nearest pattern to xp2 .} 6. End 7. Return (Z,xp2 ) as POC-NN prototypes. 2
1.5
+ Class 1
1
x Class 2 x
p1
0.5 f(x)
xp2 0 x1 −0.5
−1
−1.5
−2
0
1
2
3
4
5
x
Fig. 1. This example shows how to find the prototypes for regression
Figure 1 shows an example of how algorithm for finding POC-NN prototypes for regression works. A given 11 data sampling set S is a one dimension sine function. Data set S is separated into S (1) and S (2) . Each pattern in S (1) and S (2) is denoted by the symbol “+” and “×”, respectively. The POC-NN prototypes (xp11 ,xp2 ) enclosed in circle symbols “◦”. Once POC-NN patterns are found, a separating hyperplane is generated and placed in between these POC-NN patterns. This hyperplane has a very nice characteristic for both classification and regression problems. It acts as a perceptron for local classification [3] and as a functional approximator for local
768
T. Raicharoen, C. Lursinsap, and F. Lin
sampling data locating near these POC-NN prototypes. This characteristic is so simple, however, it is useful for both problems. When the dimension (d) is greater than one, it is necessary to find d+1 numbers of POC-NN prototypes (xp11 ,xp12 ,...,xp1d ,xp1d+1 ,xp2 ) and a separating hyperplane which is generated and placed among all (d+1) POC-NN prototypes. 2.2
The POC-NN Algorithm for Regression Problem
The algorithm for selecting POC-NN patterns as a selected (compressed) data set for regression is as follows. Initially, let S be a sample set with dimension d, and S be a training set defined by (3), and POC-NN-SET initially be an empty POC-NN prototypes set for regression problem. Function SELECT-POC-NN-R (S : Dataset) 1. Find POC-NN prototypes in S by using (xp1i ,xp2 ) = FIND-POC-NN-R (S ). 2. Create a hyperplane H: {x|w · x − b = 0}, xi ∈ S connecting all of d+1 POC-NN prototypes. 3. Save all d+1 POC-NN prototypes and corresponding H into the POC-NN-SET. 4. Divide all patterns xi of S into two regions, namely R1 and R2, where R1 = {xi ∈ S |w · xi − b ≥ 0} and R2 = {xi ∈ S |w · xi − b < 0}. 5. Find any misclassification in both regions. 6. If any misclassification exists in region R1 Then 7. Consider all data in R1 as a new data set Call SELECT-POC-NN-R (R1). End 8. If any misclassification exists in region R2 Then 9. Consider all data in R2 as a new data set Call SELECT-POC-NN-R (R2). End 10. If no more misclassification exists Then 11. Return POC-NN-SET as a set of selected prototypes. 12. Stop. End
2.3
Approximating Algorithm by POC-NN
To reconstruct the value of the original data and also estimate the value of an untrained sample xu , the Nearest Neighbor (NN) rule is used by measuring the nearest distance between xu and the set of all POC-NN prototypes whose dimension must be reduced from d+1 to d. Let xu ∈ d be an untrained data pattern, and xp ∈ d be a POC-NN prototype whose dimension is reduced from d+1 to d. The detail of this algorithm is as follows:
A Divide-and-Conquer Approach to the POC-NN Algorithm
769
Approximating Algorithm 1. For each point xu Do 2. Identify the nearest POC-NN prototype (xp ) to xu and its corresponding hyperplane H as a function approximator f˜. 3. Calculate the reconstruction/approximation value by yu = f˜(xp ). 4. End
3
Experimental Results
This section presents the data sets and experimental results for regression. Besides, the performance in both accuracy rate and training time of POC-NN is evaluated and compared with 1-Nearest Neighbor (1-NN) and Linear Interpolation techniques. 3.1
Data Sets
All algorithms are tested and evaluated on a number of standard regression data sets of benchmarks, both artificial and real. The classical standard one dimensional and two dimensional sinc|x| function are defined as follows: The one dimensional function f (x) = sinc|x| =
sin|x| |x|
(4)
is considered on the basis of a sequence of 128 measurements without noise on the uniform lattice. The two dimensional function ! (5) f (x, y) = sinc = x2 + y 2 is considered on the basis of a sequence of 1,681 measurements without noise on the uniform lattice. The well known chaotic time series, the Mackey-Glass, is described by the delay-differential equation [4]: 0.2x(t − Δ) dx(t) = −0.1x(t) + , dt 1 + x(t − Δ)10
(6)
with parameters Δ = 17 and 30. These two time series are denoted by M G17 and M G30 . The time series of the Lorenz differential equation [5,6] are also considered. Moreover, the two real world data sets, the Titanium [7] and the Sunspot [8,9] series from 1700-1799, are used in the experiments. The properties of the experimental data sets are given in Table 1. The original data sets are separated into training and test sets (2:1 ratio) by using the
770
T. Raicharoen, C. Lursinsap, and F. Lin Table 1. Properties of the data sets used for regression problem Data sets 1. 2. 3. 4. 5. 6. 7.
No. of trg. patterns No. of test patterns
Sinc 1d Sinc 2d M G17 M G30 Lorenz Titanium Sunspot
85 1120 133 133 200 32 66
43 561 67 67 100 17 34
following criteria. The first pattern belongs to the test set, the second and third patterns belong to the training set, the fourth pattern belongs to the test set, and the fifth and sixth patterns belong to the training set, and so on. The Percent Root Mean Square Difference (PRD) is an important performance index parameter of any regression or function approximation algorithm. PRD is defined by Eq. (7). M N n n N (xi )2 ) ∗ 100, (7) P RD = (O (xi − x˜i )2 / i=1
i=1
where xi and x˜i are samples of the original and reconstructed/estimated data sequences, respectively. Small value of PRD shows the success of the algorithm. The other performance index parameter of regression algorithm is the prototype rate (PR) computed from the percentage of prototypes to all training patterns. Small value of PR also shows the capability of the algorithm for data reduction or data compression. 3.2
Accuracy of Approximation
To compare the results performed by POC-NN algorithm and classical 1-NN algorithm, all original samples were trained on the training set of patterns, and ran on the test set. All comparisons of POC-NN, 1-NN, and the Linear Interpolation techniques obtained on these data sets are summarized in Table 2. POC-NN’s results show better results than the 1-NN method in both, accuracy rate (PRD) and prototype rate (PR) in all cases, and also better results than the Linear interpolation in case of the Sunspot data. 3.3
Evaluate the POC-NN Algorithm
To increase statistical significance of the results on the data set, the K-fold crossvalidation technique was conducted which is one of the simplest and most widely used method for estimating prediction error [10]. To estimate the difference between accuracy, a three-fold cross-validation and five-fold cross-validation were also conducted to arrive at the average cross-validation estimate of PRD. The
A Divide-and-Conquer Approach to the POC-NN Algorithm
771
Table 2. The comparison results of POC-NN, 1-NN, and Linear Interpolation. All PR % of NN and Linear in all cases are 100. Data sets 1. 2. 3. 4. 5. 6. 7.
POC-NN(PR%) POC-NN(PRD%) NN(PRD%) Linear(PRD%)
Sinc 1d Sinc 2d M G17 M G30 Lorenz Titanium Sunspot
72.94 54.46 79.70 93.23 73.00 87.50 93.94
2.31 83.69 16.56 14.49 21.42 8.11 20.65
13.91 203.49 18.14 16.37 26.51 12.20 26.65
1.31 4.1192 9.25 9.54 6.00 6.00 22.20
performance comparisons among POC-NN and other methods are summarized in Table 3. The symbols “***”, “**” and “*” indicate 99 %, 95 % and 90 % confidence interval for estimating the difference between accuracy of POC-NN and 1-NN using one-tailed paired t-test [11], respectively. Table 3. The comparison results of POC-NN, 1-NN, and Linear Interpolation. All PR % of NN and Linear in all cases are 100. K-Fold Data sets K=3
K=5
4
1. 2. 3. 4. 5. 6. 7. 1. 2. 3. 4. 5. 6. 7.
Sinc 1d Sinc 2d M G17 M G30 Lorenz Titanium Sunspot Sinc 1d Sinc 2d M G17 M G30 Lorenz Titanium Sunspot
POC-NN(PR%) POC-NN(PRD%) NN(PRD%) Linear(PRD%) 75.78 45.33 81.01 96.00 72.67 67.42 73.99 76.96 54.46 73.25 74.00 72.33 68.36 67.50
2.46 151.38 16.85 18.12 ∗ 26.80 ∗∗ 25.48 ∗∗∗ 37.13 4.42 83.68 21.74 18.61 ∗ 23.19 ∗∗ 11.96 ∗∗ 33.23
13.96 203.49 19.15 17.51 27.96 15.89 26.65 13.91 203.49 19.14 17.52 28.02 15.59 38.41
1.31 4.12 9.78 9.92 7.59 5.65 22.20 1.30 4.11 9.73 9.92 7.52 5.53 18.78
Conclusion
A new POC-NN method based on divide-and-conquer approach to prototype selection has been extended from classification to regression problem. By using our method, both classification and regression problems can be solved very fast and easily. The proposed method showed better performances in both prototype and accuracy rates than the results from the classical 1-NN method. However, POC-NN requires less prototypes than Linear interpolation in all cases, which leads to the data compression.
772
T. Raicharoen, C. Lursinsap, and F. Lin
References 1. Cover, T., Hart, P.E.: Nearest neighbor pattern classification. IEEE Trans. Inform. Theory 13. (1967) 21–27 2. Hart, P.: An asymptotic analysis of the nearest-neighbor decision rule. Stanford Electron. Lab. Stanford Calif. Tech. Rep. SEL-66-016 (1966) 1828–1832 3. Raicharoen, T. , Lursinsap C.: A divide-and-conquer approach to the Pairwise Opposite Class-Nearest Neighbor (POC-NN) algorithm. Pattern Recognition Letter 35 (2005) 505–513 4. Mackey, M., Glass, J.: Oscillation and chaos in physiological control systems. Science. (1997) 197–287 5. Lorenz, E.: Deterministic non-periodic Flow. Atoms. Sci. 26 (1963) 639 6. Sparrow, C.: The Lorenz equations, New York. Springer-Verlag (1982) 7. Dierckx, P: Curve and Surface Fitting with Splines. Monographs on Numerical Analysis. Oxford. Clarendon Press. (1993) 8. Yule, G.: On a Method of Investigating Periodicities in Disturbed Series, with Special Reference to Wolfer’s Sunspot Numbers. Phil. Trans. Roy. Soc. London A226 (1927) 267–298 9. Wan, E.: Combining fossil and sunspot data: Committee predictions. International Conference On Neural Networks (1997) 10. Hastie, T., Tibshirani, R.,Friedman, J.: The Elements of Statistical Learning. Springer-Verlag. (2001) 11. Mitchell, T.: Machine Learning. McGraw Hill. (1994)
A Heuristic Weight-Setting Algorithm for Robust Weighted Least Squares Support Vector Regression Wen Wen1,*, Zhifeng Hao2,3, Zhuangfeng Shao3, Xiaowei Yang3,4, and Ming Chen2 1
College of Computer Science and Engineering, South China University of Technology, Guangzhou, 510641, China [email protected] 2 National Mobile Communications Research Laboratory, Southeast University Nanjing 210096, China 3 School of Mathematical Science, South China University of Technology, Guangzhou, 510641, China 4 Faculty of Information Technology, University of Technology, Sydney, PO Box 123, Broadway, NSW 2007, Australia
Abstract. Firstly, a heuristic algorithm for labeling the “outlierness” of samples is presented in this paper. Then based on it, a heuristic weight-setting algorithm for least squares support vector machine (LS-SVM) is proposed to obtain the robust estimations. In the proposed algorithm, the weights are set according to the changes of the observed value in the neighborhood of a sample’s input space. Numerical experiments show that the heuristic weight-setting algorithm is able to set appropriate weights on noisy data and hence effectively improves the robustness of LS-SVM.
1 Introduction Support Vector Machine (SVM), introduced by Vapnik [1] is a useful tool for data mining, especially in the fields of pattern recognition and regression. During the past few years, its solid theoretical foundation and good behaviors have a number of researchers, and it has been demonstrated to be an effective method for solving reallife problems [2-3]. According to Vapnik’s “the nature of statistical learning theory” [1], using tactics such as introducing a kernel function, both nonlinear pattern recognition problems and regression problems can be converted into linear ones, and finally deduced to mathematical problems of Quadratics Programming (QP). For regression problem, Vapnik proposed to use epsilon-insensitive loss function, which may lead to a sparse approximation. However, it requires solving a QP with inequality constraints, which is very complicated. In order to simplify the model, Suykens [4] proposed another version of SVM: LS-SVM, which considers equality constraints instead of inequality ones. As a result, the solution follows directly from solving a set of linear equations, instead of quadratic programming. Experiments show that using LS-SVM can obtain good results from data without noises. However, since the LS-SVM uses a sum squared error (SSE) cost function, it is less robust, in other words, it is sensitive to I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 773 – 781, 2006. © Springer-Verlag Berlin Heidelberg 2006
774
W. Wen et al.
noises. In order to avoid this drawback, Sukyens [5] proposed a weighted version of LS-SVM (WLS-SVM), in which different weights are put on the error variables. In order to obtain robust estimation, Suykens firstly trained the samples using classical LS-SVM, and then calculated the weights for each sample according to its error variable. Finally he solved the WLS-SVM. This really provides us with a novel approach for reaching a robust LS-SVM. However, there are two drawbacks in this version: one is that it drastically increases the computational time; the other is that its performance highly depends on the distribution of noises. There are also other researchers who aim to obtain a robust estimator by improving SVM. Lin and Wang proposed a fuzzy version of support vector machine (FSVM) for dealing with noisy data [6-8]. Their basic idea is to assign a fuzzy membership to each sample, which is determined by the importance of the sample. To some extent, the FSVM is similar to WLS-SVM: Both of them improve the classical SVM by adding different weights on loss function (The fuzzy memberships in FSVM can be viewed as the weights in WLS-SVM). The only difference is that FSVM uses heuristic method to determine the fuzzy memberships while WLS-SVM depends on pre-training the classical LS-SVM. Therefore the key of FSVM becomes choosing a good heuristic method for determining the fuzzy memberships. Researchers have developed various methods for determining the fuzzy memberships [6-9]. But most of them are concentrated on the classification problems and are somewhat subjective. In this paper, we propose a heuristic algorithm for setting weights in WLS-SVM, which aims to build a robust estimator against noisy data. Using the information from the distance matrix, this method can distinguish outliers from useful samples, set appropriate weights on their error variables, and thus produce robust estimations. This paper is organized as follows: Section 2 provides a brief review to WLSSVM. In Section 3 we firstly propose the novel outlier-labeling strategy in the onedimensional space and then present a heuristic weight-setting strategy for WLS-SVM. The experimental results and discussions are given in Section 4. Finally the conclusions are drawn in Section 5.
2 Brief Review to WLS-SVM In order to obtain a robust estimator of noisy data, Suykens [5] proposed the WLSSVM. This model can be addressed as follows: l 1 1 2 2 w + C vk ek k =1 2 2 s.t. yk = wT ϕ (xk ) + b + ek , k = 1,..., l
min
where
¦
(1)
v k is determined by the following formula: 1 ° ° c2 − ek vk = ® ° c2 − c1 °10 −4 °¯
if ek sˆ ≤ c1 , if c1 ≤ ek sˆ ≤ c2 , otherwise,
(2)
A Heuristic Weight-Setting Algorithm
775
Under the assumption of a normal Gaussian distribution of ek , sˆ can be given by formula (3) or formula (4).
sˆ =
IQR 2 * 0.6745
(3)
where IQR stands for the interquartile range, that is, the difference between the 75th percentile and 25th percentile. sˆ = 1.483MAD(xi )
(4)
where MAD( xi ) stands for the median absolute deviation. To obtain the values of v k , one should firstly train the training data set under the structure of classical LS-SVM, and then compute sˆ from the ek distributions. Since in many cases the noises are not necessarily subjected to normal Gaussian distribution, the form of sˆ as formula (3) or (4) does not always produce good estimator. Additionally, this method requires pre-training of classical LS-SVM, which usually leads to high computational effort when the dataset is large.
3 The Proposed Weight-Setting Algorithm The essential objective of WLS-SVM is to set appropriate weights on the error variables, which allow the “outlying” samples to have large estimate errors, therefore produce an estimator that is robust in the noisy environment. In order to avoid pretraining the classical LS-SVM in original WLS-SVM, we use a heuristic method to set weights for regression problem. 3.1 A Heuristic Method to Label the Outlierness of Samples
The basic idea in our method is to label the “outlierness” of a sample according to its distance from other samples. Generally, the more outlying is a sample, the more large
Fig. 1. Example
distances it has from other samples. Thus we first determine a threshold to describe the “large” distance. In our method, we define half of the maximum distance between the given sample and other samples as the threshold of “large” distance. To make it more clearly, we take Fig. 1 as an example. A, B, C, D, E are 5 samples with the onedimensional coordinates 1, 2, 3, 6, 8, respectively. It is visible that D, E is probably outliers, and the “outlierness” of E is larger than D. Consider their distance matrix:
776
W. Wen et al. ª0 « «1 Dist = «2 « «5 «7 ¬
1 2 5 7º » 0 1 4 6» 1 0 3 5» » 4 3 0 2» 6 5 2 0 »¼
For row i, if we set δ i = max {dist (i, j )} 2 , j =1,...,5
(5)
and let
{ j dist (i, j ) ≤ δ } , (i ) = { j dist (i, j ) > δ } ,
N1 (i ) =
i
N2
i
we may find that for i = 1,2,3 , N1 (i ) > N 2 (i ) while for i = 4,5 , N1 (i ) < N 2 (i ) . This suggests that for the outlying samples, they have more large values than small values in the distance matrix. Furthermore, considering samples D and E, we find that N1 (4 ) = N1 (5) , N 2 (4 ) = N 2 (5) , but it is obvious that sample E is more outlying than sample D. If simply use information from N1 (i ) and N 2 (i ) , we can not distinguish the different “outlierness” of D and E. Therefore, distance is introduced into the labeling procedure, that is, if the more outlying a sample, the larger distance value it has. Hence we define ςi =
N 2 (i ) •δi N1 (i )
(6)
Obviously, ς i satisfies the requirements as an “outlierness” label: the larger N 2 (i ) and δ i are, the higher ς i is; the larger N1 (i ) is, the lower ς i is. Still take samples D and E 3 2
3 2
as examples. We have ς 4 = • 2.5 = 3.75 , ς 5 = • 3.5 = 5.25 , leading to ς 4 < ς 5 , which correctly labels the outlierness of D and E. 3.2 The Proposed Weight-Setting Strategy
In regression problems, useful samples are distributed near to the objective regression curve, while outliers have large deviations. If we limit the input features within a reasonable small interval θ , the change of y should be mild. If the change of y is drastic, it is probable that the sample is an outlier. In WLS-SVM, small weights are imposed on the outliers in order to reduce their negative influence. This in fact becomes an outlier-labeling problem. Nevertheless the weight should be the inverse of outlierness: the more outlying is a sample, the smaller weight should it have. Given training dataset {si si = (xi , yi ), i = 1,2,..., l} , xi = (xi1 , xi 2 ,..., xiD ) is the input vector, yi is the observed value. In this paper, we simply consider the situation that yi is one-dimensional. Derived from the proposed outlier-labeling procedure in section 3.1, a weight-setting strategy is presented as follows:
A Heuristic Weight-Setting Algorithm
777
Algorithm: (The proposed heuristic WLS-SVM)
1. Initialize an appropriate threshold θ for the distances in the input space. 2. Calculate the input and output distances between samples according to formula (7) and formula (8). Store them in the input distance matrix DistX and output distance matrix DistY , respectively. DistX (i, j ) =
¦ (x D
d =1
id
− x jd
)
2
DistY (i, j ) = y i − y j
3. For the ith row of DistX , find sample group
{
(8) Ωi
, which satisfies
) }
(
Ω i = j DistX xi , x j < θ
For
k ∈ Ωi
, let δ y =
(7)
(9)
1 • max{DistY ( y i , y k )} , count sample number N1 and N 2 , in 2 k∈Ωi
which N1 = { k DistY ( yi , y k ) ≤ δ y , k ∈ Ω i } , N 2 = { k DistY ( yi , y k ) > δ y , k ∈ Ω i } If N 1 ≥ N 2 , let vi = 1 ; else let η i = 4. Use the weighted samples
η i if η i < 1; N1 1 • • SCALE , vi = ® . N 2 δ yi ¯1 else
{ s ' s ' = (x , y , v ), i = 1,2,..., l } to train the WLS-SVM. i
i
i
i
i
As a matter of fact, η i can be viewed as the inverse value of ς i . The only difference is that a constant term SCALE is introduced to η i , which prevents the value of η i from irrationally descending when the observed value yi is distributed in a large interval. For example, if yi lies in an interval [−10, 10] , the value of δ y = 1 might be viewed as a regular change of observed value. But if yi lies in an interval [−1, 1] , the value of δ y = 1 is probably caused by an abnormal disturbance. Therefore, when the interval is large, SCALE should be correspondingly large, which guarantees that vi obtains a meaningful value. For convenience, we refer to heuristic WLS-SVM algorithm as HWLS-SVM. Noticing that the major computations in the heuristic weighting stage of HWLSSVM comes from calculation of distance matrix and search for group-maximum distances, this demands much less computations than the pre-training stage of standard WLS-SVM. Because pre-training of LS-SVM is essentially a quadratic programming and requires solving a linear system AX=B. Besides, pre-training of LSSVM doesn’t avoid calculation of distance matrix, because kernel functions essentially require it. Additionally, after pre-training of LS-SVM, to set appropriate weights on the samples still demands statistical analysis of the error variables. Therefore, standard WLS-SVM is more complex and also more computational expensive than HWLS-SVM.
778
W. Wen et al.
4 Numerical Experiments To check the validity of the proposed algorithm, programs including the weightsetting modules and the LS-SVM module [10-11] are written in C++, using Microsoft’s Visual C++ 6.0 compiler. The experiments are run on an HP personal computer, which utilizes a 3.06GHz Pentium IV processor with a maximum of 512MB memory available. This computer runs Windows XP OS. In order to evaluate the performance of the proposed weight-setting algorithms, three commonly used instances are tested: training data are respectively generated by f1 , f 2 and f 3 , plus a few random noises across the input space; test data are uniformly sampled from the objective regression function f1* , f 2* and f 3* . f1 ( x ) =
sin (2 x ) +ξ 2x
f 2 ( x) = x 3 + ξ ` , f 3 ( x ) =
sin §¨ ©
x12 + x 22 ·¸ ¹ +ξ 2 2 x1 + x 2
( ξ is a uniformly random variable in the interval [-0.05,0.05]). f 1* ( x ) =
sin (2 x ) , f 2* ( x ) = x 3 , f 3* ( x ) = 2x
sin §¨ ©
x 12 + x 22 ·¸ ¹ x 12 + x 22
There are respectively 165, 165, 577 training samples and 100, 100, 200 testing samples in instances 1, 2 and 3. And RBF kernel is used in the experiments. Fig. 2 illustrates the comparison between the Classical LS-SVM and HWLS-SVM on three simulated instances. It is visible that Classical LS-SVM is very sensitive to noisy data, which makes it less robust. But HWLS-SVM is much more robust and is hardly influenced by the noisy data. Table 1 shows the details from our experiments: in the training stage, Classical LS-SVM gains the better training accuracy; but in the test stage Classical LS-SVM produces much worse results. This is due to the reason that Classical LS-SVM irrationally inclines to the noisy data and produces over-fitting results. Furthermore, motorcycle data [12], a real-world benchmark data set in statistics, is used to test our algorithm. The parameters are set as follows: θ = 2 , SCALE = 10 ; γ = 2, σ = 6.6 according to [5]. Fig. 3 illustrates the regression curve obtained by Classical LS-SVM and HWLS-SVM. It is visible that Classical LS-SVM tends to be influenced by the noisy data. For example, the curve is bended towards the exceptionally noisy data on (35.2, -54.9). But HWLS-SVM is hardly influenced by these outlying data, and retains a relative robust property. Fig. 4 shows the histogram of residues. If ignoring the samples with large absolute value of residues, HWLSSVM obtains a much more symmetric histogram of residues, which is close to Gaussian distribution. Besides, experiments show that if discarding five most noisy samples, Classical LS-SVM obtains an average absolute residue 15.87, while HWLSSVM obtains 14.95. This indicates it is fairly reasonable to believe that HWLS-SVM is able to resist to the outliers by setting appropriate weights on their error variables.
A Heuristic Weight-Setting Algorithm
(a) Results from instance 1
(b) Results from instance 2
(c) Results from instance 3 Fig. 2. Tests on three benchmark instances Table 1. Comparison of Classical LSSVM and HWLS-SVM Ins. 1 2 3
Classical LS-SVM HWLS-SVM MSEts MSEtr MSEts MSEtr 0.0330 0.0060 0.0385 0.0002 0.7837 0.1064 0.9247 0.0010 0.0135 0.0098 0.0325 0.0012 LSSVM Parameters: σ = 0.25 γ = 3
Parameters SCALE 0.2 0.1 0.2 0.6 1.0 0.1 θ
Fig. 3. Motorcycle Dataset: comparison between Classical LS-SVM and HWLS-SVM
779
780
W. Wen et al.
Fig. 4. Motorcycle Dataset: (Left) histogram of residues for the Classical LS-SVM. (Right) histogram of residues for HWLS-SVM.
Notes on parameters θ and SCALE: In this paper parameters θ and SCALE are selected empirically. Generally speaking, the value of θ should ensure that within such a small interval there are about five to ten samples on average. And the value of SCALE is about 5 to 15 percent of the interval length of Y.
5 Conclusions In this paper we firstly proposed a heuristic method to label the “outlierness” of samples. Then based on it, a heuristic WLS-SVM (HWLS-SVM) is presented. Tests on three commonly used simulated datasets and one benchmark dataset in statistics show that HWLS-SVM has good property to produce robust estimates. And HWLSSVM outperforms standard WLS-SVM in that it does not require pre-training of classical LS-SVM, which may save a lot of computation efforts.
Acknowledgements The authors would like to thank the anonymous reviewers for their valuable comments and advices. This work has been supported by the National Natural Science Foundation of China (10471045, 60433020), the program for New Century Excellent Talents in University(NCET-05-0734), Natural Science Foundation of Guangdong Province (031360, 04020079), Excellent Young Teachers Program of Ministry of Education of China, Fok Ying Tong Education Foundation (91005), Social Science Research Foundation of MOE (2005-241), Key Technology Research and Development Program of Guangdong Province (2005B10101010, 2005B70101118), Key Technology Research and Development Program of Tianhe District (051G041) and Natural Science Foundation of South China University of Technology (B13-E5050190), Open Research Fund of National Mobile Communications Research Laboratory (A200605).
References [1] Vapnik, V.: The Nature of Statistical Learning Theory, John Wiley, New York, USA, 1995. [2] Wu, C. H.: Travel-Time Prediction with Support Vector Regression, IEEE Transactions on Intelligent Transportation Systems, 5 (2004) 276-281.
A Heuristic Weight-Setting Algorithm
781
[3] Yang, H. Q., Chan, L. W., King, I.: Support Vector Machine Regression for Volatile Stock Market Prediction, Proceedings of the Intelligent Data Engineering and Automated Learning 2002: Third International Conference, (2002) 391-396. [4] Suykens, J. A. K., Vandewa, J.: Least Squares Support Vector Machine Classifiers, Neural Processing Letters, 9 (1999) 293-200. [5] Suykens, J. A. K., Brabanter, J. D., Lukas, L., Vandewalle, J.: Weighted Least Squares Sup-port Vector Machines: Robustness and Sparse Approximation, Neurocomputing, 48 (2002) 85-105. [6] Lin, C. F., Wang, S. D.: Fuzzy Support Vector Machines, IEEE Transactions on Neural Networks, 13 (2002) 464 – 471. [7] Lin, C. F., Wang, S. D: Training algorithms for fuzzy support vector machines with noisy data, Pattern Recognition Letters, 25 (2004) 1647-1656. [8] Lin, C. F., Wang, S. D: Fuzzy support vector machines with automatic membership setting, StudFuzz, 17 (2005) 233-254. [9] Huang, H. P., Liu, Y. H.: Fuzzy Support Vector Machines for Pattern Recognition and Data Mining, International Journal of Fuzzy Systems, 4 (2002) 826-835. [10] Liu, J. H., Chen, J. P., Jiang, S., Cheng, J. S.: Online LS-SVM for function estimation and classification, Journal of University of Science and Technology Beijing, 10 (2003) 73-77. [11] Yu, S., Yang, X. W., Hao, Z. F, Liang, Y. C.: An Adaptive Support Vector Machine Learning Algorithm for Large Classification Problem, International Symposium Neural Networks, 2006. (In press) [12] Eubank, R. L., Nonparametric regression and spline smoothing, Statistics: textbooks and monographs, 2nd edition, Marcel Dekker, New York, 157 (1999).
Feature Selection Using SVM Probabilistic Outputs Kai Quan Shen1, Chong Jin Ong1, Xiao Ping Li2, Hui Zheng1, and Einar P.V. Wilder-Smith3 1
Department of Mechanical Engineering, National University of Singapore, EA, #04-24, 9 Engineering Drive 1, Singapore 117576 {shen, mpeongcj, zhenghui}@nus.edu.sg 2 Department of Mechanical Engineering and Division of Bioengineering, National University of Singapore, EA, #07-08, 9 Engineering Drive 1, Singapore 117576 [email protected] 3 Department of Medicine, National University of Singapore, Singapore 119074 [email protected]
Abstract. A ranking criterion based on the posterior probability is proposed for feature selection on support vector machines (SVM). This criterion has the advantage that it is directly related to the importance of the features. Four approximations are proposed for the evaluation of this criterion. The performances of these approximations, used in the recursive feature elimination (RFE) approach, are evaluated on various artificial and real-world problems. Three of the proposed approximations show good performances consistently, with one having a slight edge over the other two. Their performances compare favorably with feature selection methods in the literature.
1 Introduction Feature selection is important for improving generalization, meeting system specifications or constraints and enhancing system interpretability [1]-[3]. Many methods for feature selection have been proposed and a good review of these can be found in the paper by Guyon and Elisseeff [4]. The focus of this paper is on feature selection methods derived from SVM [1]-[2], [5]-[6]. Past researches in this area include the use of cost function in the SVM formulation [7], [8], the leave-one-out (LOO) error bound, radius/margin bound [8], [9], [10] and the “span estimate” [8], [10], [11]. In many of these methods, the sensitivity of some suitable estimate with respect to a feature is used as a criterion for ranking the features. Guyon et al [7] use the cost function of the SVM formulation and the idea of recursive feature elimination (SVMRFE) for feature selection. Similarly, Weston et al [10] use radius/margin bound and span estimate and propose an efficient feature selection algorithm through the use of virtual scaling factors. In addition, Rakotomamonjy [8] extends SVM-RFE algorithm using radius/margin bound and span estimate and proposes feature selection methods based on their zero-order and first-order sensitivity with respect to the features. I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 782 – 791, 2006. © Springer-Verlag Berlin Heidelberg 2006
Feature Selection Using SVM Probabilistic Outputs
783
This paper proposes the use of posterior probability as the main criterion for feature selection and its sensitivity with respect to a feature for ranking features. It has the advantage of providing a direct relationship between the ranking criterion and the feature importance. The evaluation of this criterion is investigated using four approximations. These approximations are combined with the recursive feature elimination approach to yield an overall feature selection algorithm based on the SVM output. The choice of two of the approximations is partly motivated by the random forest (RF) [12], [13] feature selection method. The proposed methods are tested on both artificial and real world problems, including some well-established benchmark challenge problems from the NIPS 2003 feature selection competition [14]. Numerical comparisons with several other feature selection methods for SVM are also presented. One of the four proposed approximations performs consistently well on these datasets and compares favorably with some of the best methods available in the literature. This paper considers the typical two-class classification problem with dataset D in the form of {x j , y j }Nj =1 ∈ *d × {−1, 1}, where xj is the jth sample and yj, the corre-
sponding class label. Also, xi denotes the ith feature (element) of vector x, hence, xij is the ith feature of the jth sample and x-i ∈ *d−1 is the vector obtained from x with the ith feature removed. Double subscripted variable x-i,j is also used and it refers to the jth sample of variable x-i. This paper is organized as follows. Some preliminary results from the literature needed for the subsequent sections are collected in Section 2. Section 3 provides the basis of the proposed criterion and the descriptions of the four approximations of the criterion. Section 4 outlines the overall feature selection scheme using the RFE approach. Extensive experimental results are reported in Section 5, followed by the conclusion in Section 6.
2 Preliminaries We assume the availability of the solution of a standard SVM for two-class classification problem (in its dual form): min Į
N 1 N yi y jα iα j K (xi ⋅ x j ) − ¦ α i , ¦ 2 i , j =1 i =1
N
subject to
(1)
¦ yiα i = 0 and 0 ≤ α i ≤ C , i = 1,2,..., N , i =1
with the decision function N
f (x) = ¦ yiα i K (xi , x) + b .
(2)
i =1
The choice of the kernel is general but we will illustrate the ideas using the popular Gaussian kernel: K (x k , x j ) = exp(−γ || x k − x j ||2 ) .
(3)
784
K.Q. Shen et al.
Platt [15] obtains a probabilistic measure from the SVM outputs through the use of the following sigmoid: pˆ i = 1/[1 + exp( Af i + B)] , (4) where the parameters A and B of the sigmoid are found from minimizing the negative log likelihood of the training data (or the cross-entropy error function) and fi = f(xi) being the SVM output for xi as given by (2). Our implementation of the Platt’s probabilistic SVM outputs includes the modifications suggested by Lin et al [16] for numerical stability. Hereafter, pˆ (x) refers to the estimated posterior probability of x obtained from (4).
3 The Ranking Criterion Based on Posterior Probabilities Let p(c|x) denote the conditional probability of belonging to class c given x. The proposed ranking criterion for the ith feature is Ct (i ) = ³ | p (c | x) − p(c | x − i ) | p(x)dx ,
(5)
where x-i ∈ *d−1 is the vector derived from x with the ith feature removed. The motivation of the above criterion is clear: the greater the absolute difference between p(c|x) and p(c|x-i) over the space of x, the more important is the ith feature. As the true value p(c|x) is unknown, it is approximated by pˆ (c | x) the probabilistic output of SVM. The value of p(c|x-i) requires the retraining of SVM using data {x − i , j , y j }Nj =1 in place of {x j , y j }Nj =1 for each i, which is an obviously computationally expensive process. This and the next sections show four approximations of (5) that avoid the retraining process. These approximations are termed FSPP1-FSPP4 respectively, where FSPP stands for feature-based sensitivity analysis of posterior possibility. Motivated from the RF method, the first two approximations involve a process that randomly redistributes (RR) the values of a feature. Specifically, the values of the ith feature of x are randomly redistributed over the N examples. Let x(i) ∈ *d be the vector derived from x with the ith feature randomly redistributed. Then it follows that (6)
Theorem 1. p(c|x(i))=p(c|x−i). Proof: Since RR process does not affect the distribution of p(xi),
p(xi( i ) ) = p(xi ) .
(7)
p(x( i ) ) = p(xi( i ) , x − i ) = p(xi(i ) ) p (x − i ) = p(xi ) p(x − i ) ,
(8)
Using the above, we have
where the second equality follows from the fact that the distribution of the p(xi( i ) ) is independent from p(x−i) following the RR process. Using similar argument, we have p(x( i ) , c) = p(xi( i ) ) p(x − i , c) = p (xi ) p(x − i , c) .
(9)
Feature Selection Using SVM Probabilistic Outputs
785
Hence, p (c x ( i ) ) =
p (c, x ( i ) ) p(x (i ) )
=
p ( xi ) p ( x − i , c ) = p (c x − i ) . p ( xi ) p ( x − i )
(10) ͝
A corollary of Theorem 1 is the mutual information equality of I (c, x( i ) ) = I (c, x − i ) , which can be readily proved by using (8) and (9). Theorem 1 and its corollary show that the RR process has the same effect as removing the contribution of that feature for classification. Using this fact, criterion (5) can be equivalently stated as Ct (i ) = ³ | p (c | x) − p(c | x(i ) ) | p(x)dx .
(11)
Method 1 (FSPP1): Threshold function approximation The first method uses a threshold function for the approximation of (11) as follows. p(c | x) ≈ ϕ ( f (x)) ;
(12)
p(c | x (i ) ) ≈ ϕ ( f (x( i ) )) ,
(13)
where f(.) is the output function of SVM as given by (2) and ϕ(.) is the threshold function given by
1 if f ≥ 0 . ¯0 if f < 0
ϕ( f ) = ®
(14)
It is worth noting that (12) and (13) use the same f function and do not involve the retraining of the SVM. Further approximation of the integration over x in (11) yields FSPP1(i )=
1 N ¦ | ϕ ( f (x j ) − ϕ ( f (x(i ), j )) | , N j =1
(15)
where x(i),j refers to the jth example of the input data where the ith feature has been randomly redistributed. Method 2 (FSPP2): SVM probabilistic outputs approximation Method 2 differs from Method 1 in that the threshold function (14) is replaced by the sigmoid function used in (4). The resulting approximation of p(c|x) becomes that of Platt’s probabilistic output, pˆ (c | x) . Obviously, other methods that obtain probabilis-
tic outputs from SVM [17], [18] can also be used. Similarly, p(c | x( i ) ) is approximated by pˆ (c|x(i)) using the same trained SVM and the same trained sigmoid for Platt’s probabilistic output, i.e., without the retraining of SVM and re-fitting of the sigmoid for (4). Hence, FSPP2(i ) =
1 N ¦ | pˆ (c | x j ) − pˆ (c | x(i ), j ) | . N j =1
(16)
786
K.Q. Shen et al.
FSPP1 and FSPP2 are similar to those employed in RF method in the sense that both criteria are the estimates of a feature’s contribution to posterior possibilities in the context of other features through RR process of values of that feature. Method 3 (FSPP3): Approximation via virtual vector v
The use of an additional virtual vector v∈*d for the purpose of feature selection has been attempted in the literature [8], [10]. This approach uses one vi, having a nominal value of 1, for each feature and replaces every xi by vixi. Let vx = [v1x1 v2x2 … vdxd]T and v–ix refers to vx with vi = 0. In this setting, equation (5) can be approximated by Ct (i ) = ³ | p(c | vx) − p (c | v -i x) | p (x)dx .
(17)
Using the standard approximation, the third method is, FSPP3(i ) =
1 N ¦ | pˆ (c | vx j ) − pˆ (c | v −i x j ) | , N j =1
(18)
where pˆ (c|vxj) refers to the Platt’s posterior probability of the jth example, pˆ (c|v−ix) = (1+exp(Af(v−ix)+B))−1 as given by (4) and f(⋅) is the SVM output expression (2) obtained from the training set {xi , yi }iN=1 . Method 4 (FSPP4): Approximation based on derivative of p(c|vx) with respect to v
The criteria of (17) can also be represented, under the assumption that p(c|vx) is a 1 function of v, by
Ct (i ) = ³
vi =0
∂p(c | vx) i dv p(x)dx . ∂v i vi =1
³
(19)
Instead of the integral over vi from 1 to 0, FSPP4 uses the sensitivity with respect to vi evaluated at vi = 1. Meanwhile, it is important to note that, when p(c|x) is approximated by pˆ (c | x) of (4), ∂pˆ (c | vx) / ∂vi admits a closed-from expression using the results of (2) and (4). Due to the limited space, its expression and derivation will be given in future publication. Hence, the fourth method is FSPP4(i ) =
1 N ∂pˆ (c | vx j ) ¦ | ∂vi |vi =1 . N j =1
(20)
4 Feature Selection Methods This section presents one approach using FSPP1-FSPP4 in an overall scheme for feature selection: the FSPP-based recursive feature elimination (FSPP-RFE) approach. The FSPP-RFE approach assumes that an SVM output function f is available and that all hyperparameters, C, Ȗ in (1) and (3) or others, have been determined through a proper model selection process. For the cases of FSPP2 to FSPP4, it is also assumed that the posterior probabilities are available according to (4). The FSPP-RFE approach is similar to the one given by Guyon [7] but with the FSPPm used as the
Feature Selection Using SVM Probabilistic Outputs
787
ranking criterion. The steps involved in this approach are summarized as follows. The inputs are the dataset D and m, with the output being the ranked list of features JR. FSPP-RFE(D, m): 1. Let I = {1, 2, ... , d} and A = d . 2. If I ≠ ∅, for each i ∈ I, compute FSPPm(i); Compute ranked list Jm = {j1, j2, …, j A } with …,
jk ∈ I and FSPPm(jk) ≥ FSPPm(jk+1) for k = 1, d−1; else stop.
3. Let the last element of Jm be kˆ . Assign kˆ to the A th element of JR. 4. Let I = I \ kˆ , A = A − 1 and D ={ x −kˆ , j , yj } Nj=1 .
5. Retrain SVM with D and obtain the posterior probabilities from (4) if needed. Goto 2.
It is worthy to note that steps 3 and 4 of FSPP-RFE(D, m) above removes one feature (the one with the lowest FSPPm score) from the dataset at a time. Obviously, more than one feature can be removed at one time with slight modifications to Steps 3 and 4.
5 Experiments Extensive experiments on both artificial and real world benchmark problems are carried out using the proposed methods. Like others, an artificial problem is used because the key features are known and are suitable for comparative study of the four FSPPs. Two real-world problems are chosen as they have been used by other feature selection methods [7], [8] and serve as a common reference for comparison. Finally, the proposed methods are tested on two challenge problems used in the NIPS 2003 feature selection competition [14]. In general, our method requires, for each problem, three subsets of data in the form of Dtra, Dval and Dtes for training, validation and testing purposes. The subset Dtra is normalized to zero mean and unit standard deviation. Its normalizing parameters are also used to normalize Dval and Dtes. The subset Dtra is meant for the training of the SVM including the determination of the optimal C and Ȗ using 5-fold cross-validation procedure. The subset Dval is needed for the determination of parameters A and B in (4). The Dtes subset is used for obtaining an unbiased testing accuracy of the underlying method. In cases where there are 100 realizations of a given dataset, the procedure of [19] is followed: parameters C and Ȗ are chosen as the median of the five sets of (C, Ȗ) of the first five realizations. Here each set of (C, Ȗ) is obtained by standard 5-fold cross-validations for one realization. 5.1 Artificial Problem
We follow the procedure given in [10] and generate 10,000 samples of 12 features each. Only the first two (x1, x2) are relevant while the rest are random noise. Dtra and
788
K.Q. Shen et al.
Dval contain 100 random samples each and the rest are included in Dtes for one realization of the dataset. Average feature selection performances over 100 realizations are shown in Fig. 1. The result shows that FSPP1-RFE and FSPP2-RFE correctly identify the two key features as the test error rates are the lowest with only two surviving features. However, FSPP3-RFE and FSPP4-RFE produce less appealing results. The poor performance of FSPP4 could be due to the fact that the function pˆ (c | vx) as a function of vi
is highly nonlinear and not well approximated by ∂pˆ (c | vx) / ∂vi evaluated at vi = 1 as in (23). 0.35 Method Method Method Method
Average Test Error Rate
0.3
1 2 3 4
0.25
0.2
0.15
0.1
0.05
0
1
2
3
4 5 6 7 Number of Top-ranked Features
8
9
10
Fig. 1. Average performance of feature selection on Weston’s nonlinear dataset in terms of average test error rates against top-ranked features where the top-ranked features are chosen by FSPPm-RFE (C = 32.0, Ȗ = 0.03125)
5.2 Real-World Benchmark Problems
The real-world benchmark problems are the breast cancer and heart datasets obtained from [20], used also by Rätsch et al. [19], [21] and Rakotomamonjy [8] in their experiments. Sizes of feature/Dtra/Dval/Dtes are 9/140/60/77 and 13/119/51/100 respectively and each problem has 100 realizations. The format of presentation of the results by [8] is adopted. The plots of the mean test error rates of SVM are provided with decreasing number of top-ranked features. Each plot is the mean over 100 realizations. For comparison purposes, performances of existing feature selection methods are also included. The SVM-RFE by [7] and ∇ w by [8] feature selection methods are chosen because they appear to the best performing methods reported in [8] and [10]. Their performances are reproduced together with those using FSPP1-4. Fig. 2 shows the results for breast cancer dataset. Several interesting results are noted. In general, the performances of FSPP1 and FSPP2 are similar with the edge going to FSPP2 for fewer features. FSPP1 and FSPP2 also perform comparably, if not
Feature Selection Using SVM Probabilistic Outputs
789
better than SVM-RFE and ∇ w . The performances of FSPP3 are again not as appealing. FSPP4-RFE is not shown in Fig. 2 as the computation of (20) fails (the reason will be covered in a follow-up publication). The results for heart dataset show similar trends to Fig. 2 and are hence not shown. The difference between the performances of FSPP2 and FSPP3 is most interesting and deserves attention. Both criteria use the same pˆ expression obtained from (4) but differ in that pˆ (c | x (i ), j ) is used in FSPP2 and pˆ (c | v − i x j ) in FSPP3. The sample x(i),j 0.31 Method 1 Method 2 Method 3 SVM-RFE
Average Test Error Rate
0.3
Grad w2
0.29
0.28
0.27
0.26
0.25
0.24
1
2
3
4 5 6 Number of Top-ranked Features
7
8
9
Fig. 2. Average test error rates against top-ranked features over 100 realizations of breast cancer dataset where the top-ranked features are chosen by FSPPm-RFE (C = 2.83, Ȗ = 0.05632) or other best-performing feature selection methods for SVM
has the ith element taking value that is randomly redistributed while v−ixj has the ith element set to 0. The better performance of FSPP2 over FSPP3 appears to suggest and reaffirm the correctness of Theorem 1. It also means that the distribution pˆ (c | v − i x) differs more from p(c | x − i ) than pˆ (c | x( i ) ) . 5.3 NIPS Challenge Problems
In view of time and space constraints, only the results of FSPP2-RFE on two datasets from the NIPS feature selection competition [14] are reported in the present study. The details of these two datasets, ARCENE dataset and MADELON dataset, are given in Table 1. ARCENE is the probably the most challenging dataset among all the datasets from NIPS feature selection competition as it has the smallest ratio of size of training dataset to number of features (100/10000), while MADELON is a relatively easy dataset with a bigger ratio (2000/500). Based on the experiments obtained thus far, we use the FSPP2-RFE for these problems. Our version of FSPP2-RFE uses a three-tier removal of features for MADELON: 100 features at each recursion until 100 features are left and 20 features at each recursion
790
K.Q. Shen et al.
until 20 features left to be followed by one feature at each recursion. As for ARCENE, a more aggressive scheme is used: 1000 features are deleted at each recursion. For each dataset, our result of FSPP2-RFE having the best validation accuracy is chosen. Our entries are respectively ranked 1st and 2nd (as of October 14th 2005) in the MADELON and ARCENE group of entries. A comparison between our results and the best entries by other participants of the challenge can be found online in [14].
6 Conclusions This paper introduces a new feature ranking criterion based on the posterior probability of the SVM output. Four approximations, with some motivated by the random forests feature selection method, are proposed for the evaluation of the criterion. These approximations are used in an overall feature selection scheme FSPP-RFE. The experimental results on various datasets show that three of the four approximations (FSPP1, 2 and 3) yield good overall performances. Among them, FSPP2 has the overall edge in terms of accuracy and shows feature selection performances that are comparable with some of the best methods in the literature. Table 1. Description of ARCENE and MADELON datasets
Dataset MADELON ARCENE
Type Dense Dense
Features 500 10000
Dtra 2000 100
Dval 600 100
Dtes 1800 700
References 1. Boser, B., Guyon, I., Vapnik, V. N.: A training algorithm for optimal margin classifiers. The Fifth Annual Workshop on Computational Learning Theory (1992) 144-152 2. Cortes, C., Vapnik, V. N.: Support vector networks. Machine Learning 20 (1995) 273-297 3. Vapnik, V. N.: Statistical Learning Theory. Wiley, New York (1998) 4. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of Machine Learning Research 3 (2003) 1157-1182 5. Vapnik, V. N.: The Nature of Statistical Learning Theory. Springer-Verlag, New York (1995) 6. Cristianini, N., Shawe-Taylor, J.: Introduction to Support Vector Machines. Cambridge University Press (2000) 7. Guyon, I., Weston, J., Barnhill, S., Vapnik, V. N.: Gene selection for cancer classification using support vector machines. Machine Learning 46(1-3) (2002) 389-422 8. Rakotomamonjy, A.: Variable selection using SVM-based criteria. Journal of Machine Learning Research 3 (2003) 1357-1370 9. Vapnik, V. N.: Estimation of Dependences Based on Empirical Data. Springer-Verlag (1982) 10. Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., Vapnik, V. N.: Feature selection for SVMs. in Advances in Neural Information Processing Systems 13 (2001) 11. Vapnik, V. N., Chapelle, O.: Bounds on error expectation for support vector machine. Neural Computation 12(9) (2000) 2013-2036
Feature Selection Using SVM Probabilistic Outputs
791
12. Breiman, L.: Bagging predictors. Machine Learning 26 (1996) 123-140 13. Breiman, L.: Random Forests. Machine Learning 45 (2001) 5-32 14. Guyon I., et al.: NIPS 2003 feature selection competition. Challenge website available: http://www.nipsfsc.ecs.soton.ac.uk/datasets/ (2003) 15. Platt, J.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Smola, A., Bartlett, P., Scholkopf, B., Schuurmans, D. (eds.): Advances in Large Margin Classifiers. MIT Press, Cambridge, MA (2000) 16. Lin, H. T., Lin, C. J., Weng, R. C.: A note on Platt’s probabilistic outputs for support vector machines. Technical Report, Department of Computer Science, National Taiwan University. Available: http://www.csie.ntu.edu.tw/~cjlin/papers/plattprob.ps (2003) 17. Vapnik, V. N.: Statistical Learning Theory. Wiley (1998) 18. Hastie, T., Tibshirani, R.: Classification by pairwise coupling. Technical report, Stanford University and University of Toronto (1996) 19. Rätsch, G., Onoda, T., Müller, K. –R.: Soft margins for AdaBoost. Machine Learning. 43(3) (2001) 287-320 20. Rätsch, G.: Benchmark Repository. Available: http://ida.first.fhg.de/projects/bench/ benchmarks.htm 21. Mika, S., Rätsch, G., Weston, J., Schölkopf, B., Müller, K.-R.: Fisher discriminant analysis with kernels. In: Hu, Y.-H., Larsen, J., Wilson, E., Douglas, S. (eds.) Neural Networks for Signal Processing IX, IEEE (1999) 41-48
Unified Kernel Function and Its Training Method for SVM Ha-Nam Nguyen, and Syng-Yup Ohn Department of Computer Engineering Hankuk Aviation University, Seoul, Korea {nghanam, syohn}@hau.ac.kr
Abstract. This paper proposes a unified kernel function for support vector machine and its learning method with a fast convergence and a good classification performance. We defined the unified kernel function as the weighted sum of a set of different types of basis kernel functions such as neural, radial, and polynomial kernels, which are trained by a new learning method based on genetic algorithm. The weights of basis kernel functions in the unified kernel are determined in learning phase and used as the parameters in the decision model in the classification phase. The unified kernel and the learning method were applied to obtain the optimal decision model for the classification of two public data sets for diagnosis of cancer diseases. The experiment showed fast convergence in learning phase and resulted in the optimal decision model with the better performance than other kernels. Therefore, the proposed kernel function has the greater flexibility in representing a problem space than other kernel functions.
1 Introduction Support vector machine [1-6] (SVM) is a learning method that uses a hypothesis space of linear functions in a high dimensional feature space. This learning strategy, introduced by Vapnik [2], is a principled and powerful method. In the simplest and linear form, a SVM is the hyperplane that separates a set of positive samples from a set of negative samples with the largest margin. The margin is defined by the distance between the hyperplanes supporting the nearest positive and negative samples. The output formula of a linear case is
y = w⋅x−b
(1)
where w is a normal vector to the hyperplane and x is an input vector. The separating hyperplane is the plane y = 0 and two supporting hyperplanes parallel to it with equal distances are
H1 : y = w ⋅ x − b = +1 ,
H 2 : y = w ⋅ x − b = −1
(2)
Thus, the margin M is defined as
M = 2/ w I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 792 – 800, 2006. © Springer-Verlag Berlin Heidelberg 2006
(3)
Unified Kernel Function and Its Training Method for SVM
793
In order to find the optimal separating hyperplane having a maximal margin, a learning machine should minimize ||w|| subject to inequality constraints. This is a classic nonlinear optimization problem with inequality constraints. An optimization problem, which can be solved by the saddle point of the Lagrange function, is following N 1 L(w, b,α) = wT w − ¦αi yi ([wT x + b] −1) 2 i =1
(4)
where αi ≥ 0 are Lagrange multipliers.
Fig. 1. An input space can be transformed into a linearly separable feature space by an appropriate kernel function
However, the limitation of computational power of linear learning machines was highlighted in the 1960s by Minsky and Papert [7]. It can be easily recognized that real-world applications require more extensive and flexible hypothesis space than linear functions. Such a limitation can be overcome by multilayer neural networks proposed by Rumelhart, Hinton and William that was mentioned in [3]. Kernel function also offers an alternative solution by projecting the data into high dimensional feature space to increase the computational power of linear learning machines. Nonlinear mapping from input space to high dimensional feature space can be implicitly performed by an appropriate kernel function (see Fig. 1). One of the advantages of the kernel method is that a learning algorithm can be exploited to obtain the specifics of application area, which can be simply encoded into the structure of an appropriate kernel function [1]. Genetic algorithm [8-10] is an optimization algorithm based on the mechanism of natural evolution procedure. Most of genetic algorithms share a common conceptual base of simulating the evolution of individual structures via the processes of selection, mutation, and reproduction. In each generation, a new population is selected based on the fitness values representing the performances of the individuals belonging to the generation, and some individuals of the population are given the chance to undergo alterations by means of crossover and mutation to form new individuals. In this way, GA performs a multi-directional search by maintaining a population of potential solutions and encourages the formation and the exchange of information among different directions. GA is generally applied to the problems with a large search space. It is different from random algorithms since they combine the elements of directed and stochastic search. Furthermore, GA is also known to be more robust than directed search methods.
794
H.-N. Nguyen and S.-Y. Ohn
Recently, SVM and GA are combined for the classification of biological data related to the diagnosis of caner diseases and achieved a good performance. GA was used to select the optimal set of features [13, 14], and the recognition accuracy of 80% was achieved in case of colon data set. In [15], they used GA to optimize the ensemble of multiple classifiers to improve the performance of classification. In this paper, we propose a unified kernel function which is defined as the linear combination of basis kernel functions and a new learning method for the kernel function. In the proposed learning method, GA is applied to derive the optimal decision model for the classification of patterns, which consists of the set of the weights for basis kernels in the unified kernel. The unified kernel and the learning method were applied to classify three clinical data sets related to cancer diagnosis and showed better performance and more stable classification accuracy than single basis kernels. This paper is organized as follows. In Section 2, our new unified kernel and its learning method are presented in detail. In Section 3, the performance of unified kernel and other kernels are compared by the experiments for the classification of clinical datasets on cancer diseases such as colon cancer and leukemia. Finally, conclusions are presented in Section 4.
2 Proposed Approach 2.1 The Proposed Kernel Function A kernel function provides a flexible and effective learning mechanism in SVM, and the choice of a kernel function should reflect prior knowledge about the problem at hand. However, it is often difficult for us to exploit the prior knowledge on patterns to choose a kernel function, and it is an open question how to choose the best kernel function for a given data set. According to no free lunch theorem [4] on machine learning, there is no superior kernel function in general, and the performance of a kernel function rather depends on applications. In our case, a unified kernel function is defined as the weighted sum of the set of different basis kernel functions. The proposed kernel function has the form of m
K c = ¦ βi × Ki
(5)
i =1
where β i ∈ [0,1] for i = 1,..., m ,
m
¦β
i
= 1 , and {Ki | i =1, …, m} is the set of basis
i =1
kernel functions to be combined. Table 1 shows the mathematical formula of the basis Table 1. Kernels are chosen to experiments in our study
Kernel function Polynomial Radial Neural
Formula
(¢ x. y ² + 1) e
(− γ
x− y
2
2
)
tanh( s ⋅ x , y − c )
Unified Kernel Function and Its Training Method for SVM
795
kernel functions used to construct unified kernel function. It can be proved that (5) satisfies the conditions required for kernel functions by Mercer’s theorem [1]. The coefficients β i play the important role of fitting the unified kernel function to a training data set. In the learning phase, the structure of a training sample space is learned by unified kernel, and the knowledge of a sample space is learned and embedded in the set of coefficients β i . In the learning phase of our approach, GA technique is applied to obtain the optimal set of coefficients β i that minimize the generalization error of classifier. At the end of learning phase, we obtain the optimal decision model, which is used to classify new pattern samples in classification phase. 2.2 Learning Method The overall structure for classification procedure based on the proposed unified kernel and the learning method is depicted in Fig. 2. The procedure consists of preprocessing, learning, and classification phases.
Fig. 2. Overall framework of proposed method
Firstly, in the preprocessing stage, feature selection methods [16, 17] were used to reduce the dimensionality of the feature space of the input data. Also, the training and testing sets consisting of a number of cancer and normal patterns are selected and passed to the learning phase. Secondly, in the learning phase, we applied a learning method based on GA and SVM techniques to obtain the optimal decision model for classification. GA generates a set of chromosomes representing decision models by evolutionary procedures. The fitness value of each chromosome is evaluated by measuring the accuracy from the classification with SVM containing the decision model associated with the chromosome. An n-fold validation was used to evaluate the fitness of a chromosome to reduce overfitting [4]. In the evolutionary procedure of GA, only the chromosomes with good fitness values are selected and given the chance to survive and improve in the further generations. Roulette wheel rule [8] is used for the selection of chromosome in our learning phase. Some of the selected chromosomes are given the chance to
796
H.-N. Nguyen and S.-Y. Ohn
undergo alterations by means of crossover and mutation to form new chromosomes. In our approach, one-point crossover is used, and the probabilities for crossover and mutation are 0.8 and 0.015 in turn. The procedure is repeated for a predefined number of times. At the end of GA procedure, the chromosome with the highest accuracy is chosen as the optimal decision model. Finally, the optimal decision model obtained in the learning phase is used to in SVM for the classification of new samples in the classification phase, and the performance of the model is evaluated against test samples.
3 Experiments and Analysis In this section, we show the results from the classification based on the model trained by the unified kernel and the new learning method. Furthermore, the performance of the classification model with unified kernel is compared to the performances of the models with other kernels. 3.1 Environments for Experiment All the experiments are conducted on a Pentium IV 1.8GHz computer. The experiments are composed preprocessing of samples, learning by GA to obtain the optimal decision model, and classification. For GA, we have used roulette wheel rule for the selection method. Our proposed method was executed with 100 chromosomes for 50 generations. Unified kernel function and three other kernel functions in Table 1 are trained by GA in learning phase with training set. The three kernel functions are chosen since they were known to have good performances in the field of bioinformatics [4, 6, 13-15]. We used 5-fold cross validation to measure the fitness to reduce overfitting [4]. The optimal decision model obtained after 50 generations of GA is used to classify the set of test samples. The experiments for each kernel function are repeated for 50 times to obtain generalized results. 3.2 Colon Tumor Cancer The colon cancer dataset [11] contains gene expression information extracted from DNA microarrays. The dataset consists of 22 normal and 40 cancer tissue samples 7 1772 6 5
1582
1771
780
513
138
515
625
1325
43
1060
1153
4 3 2 1 0
Fig. 3. The –lg(p) value of the first 15 features
964
399
72
Unified Kernel Function and Its Training Method for SVM
797
and each has 2000 features. (Available at: http://sdmc.lit.org.sg/GEDatasets/Data/ ColonTumor.zip). 42 samples were chosen randomly as training samples and the remaining samples were used as testing samples. We chose 50 first features based on t-test statistic. Fig. 3 showed the feature importance [16, 17, 18] of the first 15 features in decreasing order. Each column represents the indexes of features in the data set. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1
3
5
7
9
11
13
15
17
Polynomial
19
21
23
25
27
Radial
29 Neural
31
33
35
37
39
41
43
45
47
49
Weighted Kernel
Fig. 4. The comparison of hit rate in classification phase of the unified kernel function case (bold line) with other kernel functions
In case of colon dataset, the proposed method with the unified kernel function also showed more stable and higher accuracy than other kernels (see Fig. 4). In Table 2, the unified kernel function showed the best average performance with 86.23% of recognition rate. Table 2. The averages of classification accuracies based on unified kernel and three other kernels
Classification accuracy
Unified kernel
Polynomial
Radial
Neural
86.23% ± 6.05
72.84% ± 7.07
82.94 ± 6.19
62.41% ± 5.74
Table 3. The best prediction rate of some studies in case of Colon dataset
Type of classifier GA\SVM [12] Bootstrapped GA\SVM [13] Unified Kernel
Prediction rate ± S.D. (%) 84.7±9.1 80.0 86.23±6.05
798
H.-N. Nguyen and S.-Y. Ohn
The comparison of our experiments and the results of previous studies [12, 13] were depicted in Table 3. Our experiments showed the accuracy comparable to the previous ones, and the standard deviation of the prediction rate for the unified kernel is less than GA\SVM (see Table 3). It is remarked that the new kernel and the learning method results in more stable classification accuracy than previous ones. 3.3. Leukemia cancer The Leukemia dataset [12] consists of 72 samples that have to be discriminated into two classes, ALL and AML. There are 47 ALL and 25 AML samples and each sample contains 7129 features. The dataset was divided into a training set with 38 samples (27 ALL and 11 AML) and a test set with 34 samples (20 ALL and 14 AML) which are available at: http://sdmc.lit.org.sg/GEDatasets/Data/ALL-AML_Leukemia.zip). 7 V5772 6
V4328
V2020
V6281
V1306 V5593
V2354
V6471
5
V2642 V4535
V6974
V149
V2441 V1630
V5254
4 3 2 1 0
Fig. 5. The –lg(p) value of the first 15 features 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1
3
5
7
9
11
13
15
17
19
Polynomial
21
23 Radial
25
27
29 Neural
31
33
35
37
39
41
43
45
47
49
Weighted Kernel
Fig. 6. The comparison of accuracy of the unified kernel function case (bold line) with single kernel functions in classification phase
Unified Kernel Function and Its Training Method for SVM
799
Similarly for the experiment on colon cancer data set, we chose 50 features based on t-test statistic for the experiment on Leukemia data set. Fig. 5 shows the feature importance of the first 15 features in decreasing order. Each column represents the indexes of features in the data set. The experiments shows that the unified kernel and the proposed learning method results in more stable and higher accuracies than other kernels (see Fig. 6). According to Table 4, the unified kernel shows the best average performance with 96.07% of the recognition accuracy. Table 4. The average of classification accuracy using the decision model obtained from GA
Classification accuracy
Weighed kernel
Polynomial
Radial
Neural
96.07% ± 3.41
60.93% ± 16.11
95.13 ± 7.00
82.20% ± 12.88
In Table 5, we compare the prediction results from our method and other studies’ results performed on the same dataset [11, 13] and our results are comparable to the others. Table 5. The best prediction rate of some studies in case of Leukemia dataset
Type of classifier Weighted voting[11] Bootstrapped GA\SVM [13] Weighted Kernel
Prediction rate ± S.D.(%) 94.1 97.0 96.07±3.14
4 Conclusions In this paper, we proposed the unified kernel function combining a set of basis kernel functions for SVM and its learning method based on GA technique to obtain the optimal decision model for classification. The kernel function plays the important role of mapping the problem feature space into a new feature space so that the performance of the SVM classifier is improved. The unified kernel function and the proposed learning method were applied to classify the clinical datasets to test their performance. In the comparison of the classifications by the unified kernel and other three kernel functions, the unified kernel function achieved higher and more stable accuracy in classification phase than other kernels. Thus our kernel function has greater flexibility in representing a problem space than other kernel functions.
Acknowledgement This research was supported by RIC (Regional Innovation Center) in Hankuk Aviation University. RIC is a Kyounggi-Province Regional Research Center designated by Korea Science and Engineering Foundation and Ministry of Science & Technology.
800
H.-N. Nguyen and S.-Y. Ohn
References 1. N. Cristianini and J. Shawe-Taylor.: An introduction to Support Vector Machines and other kernel-based learning methods, Cambridge, 2000. 2. V.N. Vapnik et. al.: Theory of Support Vector Machines, Technical Report CSD TR-9617, Univ. of London, 1996. 3. Vojislav Kecman.: Learning and Soft Computing: Support Vector Machines, Neural Networks, and Fuzzy Logic Models (Complex Adaptive Systems), The MIT press, 2001. 4. Richard O. Duda, Peter E. Hart, David G. Stork.: Pattern Classification (2nd Edition), John Wiley & Sons Inc., 2001. 5. Joachims, Thorsten.: Making large-Scale SVM Learning Practical. In Advances in Kernel Methods - Support Vector Learning, chapter 11. MIT Press, 1999. 6. Bernhard Schökopf , Alexander J. Smola.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (Adaptive Computation and Machine Learning), MIT press, 2002 7. M.L. Minsky and S.A. Papert.: Perceptrons, MIT Press, 1969. 8. Z.Michalewicz.: Genetic Algorithms + Data structures = Evolution Programs, SpringerVerlag, 3 re rev. and extended ed., 1996. 9. D. E. Goldberg.: Genetic Algorithms in Search, Optimization & Machine Learning, Adison Wesley, 1989. 10. Melanie Mitchell.: Introduction to genetic Algorithms, MIT press, fifth printing, 1999. 11. U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A. Levine.: Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays, Proceedings of National Academy of Sciences of the United States of American, vol 96, pp. 6745-6750, 1999. 12. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield and E. S. Lander.: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science, vol. 286, pp. 531–537, 1999. 13. H. Fröhlich, O. Chapelle, B. Scholkopf.: Feature selection for support vector machines by means of genetic algorithm, Tools with Artificial Intelligence, Proceedings. 15th. IEEE International Conference, pp. 142 – 148, 2003. 14. Xue-wen Chen.: Gene selection for cancer classification using bootstrapped genetic algorithms and support vector machines, The Computational Systems, Bioinformatics Conference. Proceedings IEEE International Conference, pp. 504 – 505, 2003. 15. Chanho Park and Sung-Bae Cho.: Genetic search for optimal ensemble of featureclassifier pairs in DNA gene expression profiles, Neural Networks, 2003. Proceedings of the International Joint Conference, vol.3, pp. 1702 – 1707, 2003. 16. Stefan Rüping.: mySVM-Manual, University of Dortmund, Lehrstuhl Informatik, 2000. [online] (Available:http://www-ai.cs.uni-dortmund.de/SOFTWARE/ MYSVM) 17. Kohavi, R. and John, G.H.: Wrappers for Feature Subset Selection, Artificial Intelligence (1997) pages: 273-324 18. Blum, A. L. and Langley, P.: Selection of Relevant Features and Examples in Machine Learning, Artificial Intelligence, (1997) pages: 245-271 19. Tom M. Michell: Machine Learning, McGraw Hill (1997)
A Spectrum-Based Support Vector Algorithm for Relational Data Semi-supervised Classification Ling Ping1,2, Wang Zhe1, and Zhou Chunguang1,* 1 College
of Computer Science, Jilin University, Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education, Changchun 130012, China [email protected] 2 School of Computer Science, Xuzhou Normal University, Xuzhou, 221116, China [email protected]
Abstract. A Spectrum-based Support Vector Algorithm (SSVA) to resolve semi-supervised classification for relational data is presented in this paper. SSVA extracts data representatives and groups them with spectral analysis. Label assignment is done according to affinities between data and data representatives. The Kernel function encoded in SSVA is defined to rear to relational version and parameterized by supervisory information. Another point is the selftuning of penalty coefficient and Kernel scale parameter to eliminate the need of searching parameter spaces. Experiments on real datasets demonstrate the performance and efficiency of SSVA.
1 Introduction Recently, data mining is expanded to Relational Data Mining (RDM) [1], where a relation is expressed by a table. RDM aims to look for patterns that involve some connected tables. In such a more structured data environment, one straightforward solution is to integrate all involved tables into a comprehensive one, so that traditional mining algorithms can work on it. This, however, would cause loss of structured information and the unwieldiness to operate this huge table. Naturally, RDM requires mining from multi tables directly. Kernel function is often used to tackle mining tasks of RDM due to its fine quality that it implicitly defines the nonlinear map from original space to feature space. A suitably designed Kernel creates the linear version of the problem that would otherwise be nonlinear version in input space while not introducing extra computation efforts. This makes Kernel and Kernel-based methods become hot research area of RDM. For example, Support Vector Clustering (SVC) [2] and Spectral Clustering (SC) [3] exhibit impressive performance in concerned application areas. The description of relational data may cover many tables, so the definite label information might be unavailable in advance. Often some supervisory is provided in the form of pairwise similarity [4]. With this weak information or side-information, classification issue is equivalent to semi-supervised clustering where the decision model *
Corresponding author.
I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 801 – 810, 2006. © Springer-Verlag Berlin Heidelberg 2006
802
L. Ping, W. Zhe, and Z. Chunguang
is learned from both weak-labeled and unlabeled data. This paper proposes a Spectrum-based Support Vector Algorithm (SSVA) to address semi-supervised classification. The designed Kernel function of SSVA is reared to relation schema and it is learned from side-information. SSVA is equipped with tuning strategies of penalty parameter and Kernel scale. This benefits SSVA without suffering from pricy cost in searching parameter spaces that is necessary in traditional optimization algorithms.
2 Related Work n
First we review SVC. Let xi ∈ X , X = ℜ be the input space, and
Φ
be the nonlinear
transformation from X to the feature space. To find a minimum hyper sphere that encloses all data, the optimal objective function with slack variable ȟi is designed as: min R ,ξ
2
R + CΣiξi
2
2
s.t. || Φ ( xi ) − a || ≤ R + ξ i , ξ i ≥ 0 .
(1)
Here a and R are the center and radius of sphere, C is penalty parameter. Transfer its Lagrangian function into the Wolfe dual, and introduce Kernel trick, leading to: max β
Σ i β i K ( xi , xi ) − Σ i , j β i β j K ( xi , x j )
s.t. Σ i β i = 1 , 0 ≤ β i ≤ C .
(2)
Points within clusters satisfy ξi = 0 and β i = 0 . Points with ξi = 0 and 0 < β i < C are referred as non-bounded Support Vector (nbSV) and they describe cluster contours. Points with ξi > 0 , γ i = 0 and βi = C are bounded Support Vector (bSV). 2
Gaussian Kernel k ( xi , x j ) = exp( − q || xi − x j || ) is adopted into (2). Cluster assignment is done based on computing an adjacency matrix A, and label membership is identified according to the connected components of the graph induced by A. Then it proceeds to Spectral Clustering (SC) approach. It obtains the spectral projections of data by eigen-decomposing affinity matrix, and then groups data spectrums with a simple method. It assigns point the same label with its spectral coordinate. The main steps are: a) Compute affinity matrix H, and normalize H into H’ in a desired fashion; b) Conduct the eigen-decomposition on H’, and select the top M eigenvectors. Form spectral embedding matrix S by stacking M eigenvectors in columns; c) Perform K-means on rows of S, which are actually points’ spectral coordinates. Assign cluster label for ith point according to ith row’s cluster membership.
3 SSVA Algorithm 3.1 Algorithm Steps Idea of SSVA is to extract data representatives firstly, which is accomplished by a tuning SVC procedure. Then a spectral clustering procedure is conducted on representatives to classify them into groups. Label assignment is done by checking affinities between point and data representatives. Steps are as following:
A SSVA Algorithm for Relational Data Semi-supervised Classification
803
Step1: Perform tuning SVC, to produce data representatives. Step2: Do spectral analysis on representatives, to group them into clusters. Step3: Point shares the same membership with its most similar representatives. Note that in Step 1, a self-tuning scale is employed into SVC, and this makes those generated nbSVs come to be data representatives. The whole extraction is a one-run procedure. So it is different from traditional SVC, which performs optimization multi runs by varying scale to select a best case. Tuning approach sees Section 3.2. Spectral clustering method used in Step2 is the NJW version [3], whose normalizan
tion is H’=D-1/2HD-1/2, where D is diagonal-shape with Dii = Σ j =1 H ij . The number of clusters is discovered according to the max gap in the magnitude of eigenvalues. In Step3, still because of utilizing self-tuning Kernel scale, SVC produces nbSVs that express both data boundaries and important locations that reveal dramatic changes in distribution density. This, consequently, provides a good reason for labeling point the same membership as its nearest data representative. 3.2 Parameterization of Scale q For algorithms based on Gaussian Kernel, width q plays an important role that controls the scale of affinity measurement. Conventionally, algorithms have to be conducted several runs to find a good setting. In relational environment, distribution density varies from table to table. If specifying one fixed single scale, accurate affinities can only be formulated on a few tables because one scale works well under some certain distribution cases. To avoid this drawback, this paper investigates q value based on the local data context. For each table, we design an auto parameterized approach by learning an appropriate setting from data’s neighborhood. That is, for point x we set its density factor σ x =|| x − xr || to provide local distri-
bution information. Here xr is the rth nearest point of x. In other words, xr is the rth point in the ascending list of distance from x to other points. Intuitively, given r, if || x − xr || 2 ) classes of patterns is to be partitioned into two subsets s1 and s2 , which are initialized to empty sets. For any ( x, y ) ∈ S , where x ∈ R n , y ∈ {1,", c} , (a)if there is a ( x′, y′) ∈ si , such that y = y′ , assign ( x, y ) to si ;(b)else, if there is a si = φ , assign ( x, y ) to si ;(c)else, find the nearest neighbor ( x′, y ′) ∈ {s1 , s2 } of ( x, y ) , and assign ( x, y ) to the subset which contains ( x′, y′) . In this way the teacher signal of the examples for current node are constructed. Table 1. Construction process of FS-CSVMT
FSCSVMT ( S , Cγ p , m , Sr * ) Input: training set S ; confusion cross factor Cγ p , m ; threshold of feature selection ratio Sr * . Output: FS-CSVMT for binary classification. Procedure: Initialize(T , S , Cγ p , m , Sr * ) ;
%Build a leaf node if training patterns belong to one class. %Train an SVM for current subsample %Build feature space on selected features S selected = FS (currentSVM , S , Sr * ) for current node currentNode = TrainNode( S selected ) %Build current internal node [ leftS , rightS ] %Confusion cross is performed. = Cross (currentNode, S , Cγ p , m ) ; %Decrease confusion cross factor as deCγ p _ son , m +1 = DecreaseOn(Cγ p , m ) ; picted in Eq. (1) * FSCSVMT (leftS , Cγ p _ son , m +1 , Sr ) ; %Construct left sub-tree of FS-CSVMT
if BuildLeaf ( S ) = = TRUE return(T ) ; else currentSVM = TrainSVM ( S ) ;
FSCSVMT (rightS , Cγ p _ son , m +1 , Sr * ) ; %Construct right sub-tree of FS- CSVMT
4 Experimental Results In this section, we report the experimental results on data set of optical recognition of handwritten digits problem available in the UCI data repository. 32x32 bitmaps are divided into nonoverlapping blocks of 4x4 and the number of on pixels is counted in each block. This generates 64(8x8) features where each element is an integer in the range {0,1,…16}. The data set consists of 5620 instances distributed in 10 classes. We use 2/3 of the instances for training and the left for test. The averaged numerical results over 20 runs are recorded as the final results listed in table 2, where Sr , N size , Tsize , N F and Accuracy are feature selection ratio, average number of support vectors for each internal node, average number of internal nodes in support vector machine tree, average number of input features for each internal node, and test accuracy, respectively. Test efficiency is evaluated by O( N sizeTsize N F ) .Table 2
Support Vector Machine Tree Based on Feature Selection
861
shows that the average input features for each internal node of FS-CSVMT is much smaller (53.89) than that of CSVMT (64) even for Sr = 1.0 . It is because those features with credit 0 are considered unnecessary for local decision and therefore are discarded. Results indicate that FS-CSVMT can achieve a model with simpler internal node (86.80/112.78), less input features (15.76/64), competitive test recognition accuracy and same tree size (9.0/9.0) for Sr = 0.55 compared to CSVMT. We also compare the performance of the proposed FS-CSCMT with that of three multiclass SVM methods, i.e. 1-v-r SVM, 1-v-1 SVM and DAGSVM, analyzed in [18]. The results are listed in table 3,where N sv is the average number of support vectors for each SVM, and N svm is the average number of SVMs trained for the multi-classification problem. Table 3 indicates that FS-CSVMT reaches comparable recognition accuracy with a condensed model structure compared with the three multiclass SVM models. Fig. 3 plots the performance of FS-CSVMT for the range of feature selection ratio. We observe that the performance increases tardily when feature selection ratio reaches some value (about 0.55), i.e. averaged 15.76 features for each internal node may be efficient enough for local decision. Fig 4. compares the performance of FS-CSVMT for the range of test cost evaluated by the normalized product of N size , Tsize and N F with CSVMT. It conforms that the proposed model reaches good test recognition accuracy to lower test cost. Table 2. Numerical results of FS-CSVMT and CSVMT
FSCSVMT
CSVMT
Sr N size Tsize N F Accuracy (%) 0.35 62.12 16.20 7.64 92.1 0.40 55.54 21.00 9.20 94.2 0.45 78.81 11.20 11.12 95.8 0.50 90.56 9.00 13.72 96.1 0.55 86.80 9.00 15.76 97.7 0.60 88.50 9.00 17.56 97.8 0.65 87.49 9.00 19.67 98.3 0.70 87.06 9.00 21.94 98.6 0.75 94.91 9.00 25.38 98.4 0.85 104.93 9.00 32.47 98.6 1.00 112.78 9.00 53.89 98.9 – 107.11 9.00 64.00 98.8
Table 3. Comparative performance of FS-CSVMT
Methods 1-v-r SVM 1-v-1 SVM DAGSVM FS-CSVMT
N sv 461.77 204.73 204.73 119.17
N svm 10.00 45.00 45.00 9.00
N F Accuracy (%) 64.00 98.4 64.00 99.0 64.00 99.0 54.61 98.9
862
Q. Xu et al.
Fig. 3. Performance of FS-CSVMT for feature selection ratio
Fig. 4. Comparative performance of FS-CSVMT and CSVMT for normalized test cost
5 Conclusion We proposed a hybrid model named FS-CSVMT based on the previous research on CSVMT in this paper. In each internal node learning period, the feature credits measured by derivative based sensitivity are evaluated for selection according to feature selection ratio. Those features with low credits are considered unavailable and discarded for local decision. In this way, we can construct a CSVMT with relative simple internal nodes. The experimental results suggest that FS-CSVMT learning approach can achieve a model with competitive performance and better test efficiency than the compared CSVMT learning approach.
References 1. Mizuno, S., Zhao, Q.F.: Neural network trees with nodes of limited inputs are good for learning and understanding, In: Proc. 4th Asia-Pacific Conference on Simulated Evolution And Learning, Singapore. (2002) 573–576
Support Vector Machine Tree Based on Feature Selection
863
2. Zhou, Z.H., Chen, Z.Q.: Hybrid decision tree, Knowledge-Based Systems. 15 (2002) 515–518 3. Xu, Q.Z., Zhao, Q.F., Pei, W.J., Yang, L.X., He, Z.Y.: Design interpretable neural network trees through self-organized learning of features, In: Proc. International Joint Conference on Neural Networks, Hungary. (2004) 1433–1438 4. Brent, R.P.: Fast training algorithm for multilayer neural nets, IEEE Trans. Neural Networks. 2 (1991) 346–354 5. Kubat, M.: Decision trees can initialize radial-basis function networks, IEEE Trans. Neural Networks. 9 (1998) 813–821 6. Tsang, E.C.C., Wang X.Z., and Yeung D.S.: Improving Learning Accuracy of Fuzzy Decision Trees by Hybrid Neural Networks, IEEE Trans. Fuzzy Systems. 8 (2000) 601–614 7. Guo, H., and Gelfand S.B.: Classification trees with neural network feature extraction, IEEE Trans. Neural Networks. 3 (1992) 923–933 8. Krishnan, R., Sivakumar, G., Bhattacharya, P.: Extracting decision trees from trained neural networks, Pattern Recognition., 32 (1999) 1999–2009 9. Schmitz, G.P.J., Aldrich, C., Gouws, F.S.: ANN-DT: an algorithm for extraction of decision trees from artificial neural networks, IEEE Trans. neural networks. 10 (1999) 1392–1401 10. Cheong, S., Oh, S.H., Lee, S.Y.: Support vector machines with binary tree architecture for multi-class classification, Neural Information Processing–Letters and Reviews. 2 (2004) 47–51 11. Zhao, Q.F.: A new method for efficient design of neural network trees, Technical report of IEICE, PRMU2004-115. (2004) 59–64 12. Bennett, K., Cristianini, N., Shaw-Taylor, J., Wu, D.: Enlarging margins in perceptron decision trees, Machine Learning. 41 (2000) 295–313 13. Xu, Q.Z., Song, A.G., Pei, W.J., Yang, L.X. He, Z.Y.: Tree-structured Support Vector Machine With Confusion Cross For Complex Pattern Recognition Problems, In: Proc. IEEE International Workshop on VLSI Design and Video Technology. (2005) 195–198 14. Platt, J.C.: Fast training of support vector machines using sequential minimal optimization, In: B. Schölkopf, C. Burges, and A. Smola(eds),. Advances in kernel methods: support vector learning, MIT Press: Cambridge, MA. (1999) 185–208 15. Belousov, A.I., Verzakov, S.A., von Frese, J.: A flexible classification approach with optimal generalization performance: Support vector machines, Chemometrics and Intelligent Laboratory Systems. 64 (2002) 15–25 16. Keerthi, S.S. Lin, C.J.: Asymptotic behaviors of support vector machines with Gaussian kernel, Neural Computation. 15 (2003) 1667–1689 17. Sindhwani, V., Rakshit, S., Deodhare, D., Erdogmus, D., Principe, J. C., Nivogi P.: Feature selection in MLPs and SVMs based on maximum output information, IEEE Trans. Neural Networks. 15 (2004) 937–948 18. Hsu, C.W., Lin, C.J.: A Comparison of Methods for Multiclass Support Vector Machines, IEEE Trans. Neural Networks. 13 (2002) 415–525
A Competitive Co-evolving Support Vector Clustering Sung-Hae Jun1 and Kyung-Whan Oh2 1 Department
of Bioinformatics & Statistics, Cheongju University, Chungbuk, Korea [email protected] 2 Department of Computer Science, Sogang University, Seoul, Korea [email protected]
Abstract. The goal of clustering is to cluster the objects into groups that are internally homogeneous and heterogeneous from group to group. Clustering is an important tool for diversely intelligent systems. So, many works have been researched in the machine learning algorithms. But, some problems are still shown in the clustering. One of them is to determine the optimal number of clusters. In K-means algorithm, the number of cluster K is determined by the art of researchers. Another problem is an over fitting of learning models. The majority of learning algorithms for clustering are not free from the problem. Therefore, we propose a competitive co-evolving support vector clustering. Using competitive co-evolutionary computing, we overcome the over fitting problem of support vector clustering which is a good learning model for clustering. The number of clusters is efficiently determined by our competitive co-evolving support vector clustering. To verify the improved performances of our research, we compare competitive co-evolving support vector clustering with established clustering methods using the data sets form UCI machine learning repository.
1 Introduction Support vector clustering(SVC) is a good clustering algorithm based on support vector machine(SVM)[4],[20],[21],[25]. SVC has been applied to solve diverse clustering problems[1],[16],[22],[27]. The basic concept of SVC is to map data points by Gaussian kernel to high dimensional feature space and to find a sphere with minimal radius that contains most of the mapped data point in the feature space. After mapped back to data space, this sphere is able to separate into several components, each enclosing a separate cluster of points. SVC is able to determine the number of clusters by support vectors[26]. So, this advantage of SVC is a good idea for our clustering study. Many works of clustering have been researched in machine learning. But, some problems are shown in the clustering works. Firstly, it is difficult to determine the optimal number of clusters. For example, in K-means clustering, the number of cluster K is determined by the art of researchers. Another problem is an over fitting of learning models. The majority of learning algorithms for clustering are not free from the problem. So, we propose a competitive co-evolving support vector clustering(CE-SVC). Using competitive co-evolutionary computing, we overcome the over fitting problem of SVC. The number of clusters is able to be efficiently determined by our CE-SVC. I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 864 – 873, 2006. © Springer-Verlag Berlin Heidelberg 2006
A Competitive Co-evolving Support Vector Clustering
865
Statistical learning theory(SLT) was developed by Vapnik[14],[24],[25]. It is a learning theory based on Vapnik-Chervonenkis(VC) dimension. It also has been used in learning models as good analytical tools. In general, a learning theory has had several problems. Some of them are local optima problem which is partly caused by over fitting. As well, SLT has same problems because the kernel type, kernel parameters, and regularization constant C are determined subjectively by the art of researchers. So, we propose CE-SVC as an evolutionary SLT to settle the problems of original SLT. Combining evolutionary computing into SLT, our algorithm is constructed. One of our CE-SVC goals may be supervised application which web usage mining, pattern classification, and so forth. So, in our experiments, we compare CE-SVC with popular supervised and unsupervised learning algorithms. We verify improved performances of a CE-SVC algorithm using data sets from UCI machine learning repository.
2 Related Works 2.1 Co-evolutionary Computing A consequence of co-evolution comes from another population. The population influences the fitness of the main population. The main population also affects the fitness of the other one by turns[8]. So, in the case without co-evolution, the fitness landscape is fixed. Also the same individual always gives the same fitness. But, in the case with co-evolution, the fitness landscape is not fixed. Moreover, the fitness of an individual depends on other individuals. Therefore, the same individual may not have the same fitness in different populations. That is, co-evolution is able to be regarded as a kind of landscape coupling where adaptive moves by one individual will deform landscape of others. 2.2 Support Vector Clustering SVC is a clustering method using support vector[1]. In the SVC, data points are mapped by means of Gaussian kernel to a high dimensional feature space, where the minimal enclosing sphere is founded. After this sphere is mapped back to data space, it forms a set of contours which enclose the data points. These contours are interpreted as cluster boundaries. Points enclosed by each separate contour belong to the same cluster. Let
{xi } ⊆ X (data space, X ⊆ R d ) be a data set of n points. The
smallest enclosing sphere of radius R is able to be found using a nonlinear transformation Φ from X to some high dimensional feature space. This is represented as the following. 2
Φ( x j ) − o ≤ R 2
∀j
Where, o is the center of the sphere. Considering slack variable
(1)
ξ j , soft constraints
are incorporated. By adding slack variable, soft constraints are defined as the following.
866
S.-H. Jun and K.-W. Oh 2
Φ( x j ) − o ≤ R2 + ξ j Where,
∀j
(2)
ξ j ≥ 0 . Using the following Lagrangian[11], this problem can be solved. 2
L = R 2 − ¦ ( R2 + ξ j − Φ( x j − o )β j − ¦ ξ j μ j + C ¦ ξ j j
Where, β
j
j
(3)
j
and μ j are Lagrange multipliers and non negative. C is a constant and
C ¦ ξ j is a penalty term. A geometric approach involving R(x) based on the observation which is given a pair of data points that belong to different cluster is used. Any path connected them must exist from the sphere in feature space. So, the path contains a segment of points y (R(y)>R). Using the following adjacency matrix Aij between xi and xj, the clusters are able to be defined.
1 if ∀ y connecting xi and x j , R ( y ) ≤ R
Aij = ®
¯0
otherwise
(4)
In other words, the clustering of SVC is performed by above Aij.
3 A Competitive Co-evolving Support Vector Clustering 3.1 Optimal Clustering The cluster is a collection of data objects. Clustering is the process of grouping the data clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters[12]. That is, clustering algorithms attempt to optimize the placement of like objects into homogeneous classes or clusters[2]. In the clustering, the number of clusters has been significantly considered for looking forward to good clustering results in general. But there are not completely satisfactory method for determining the number of clusters for any type of clustering[3],[9],[13]. The number of clusters has been subjectively determined by the art of researchers. However, this approach was not only an inefficient approach but also an annoying problem in clustering[12],[18],[19]. So, the objective criteria have been needed for an efficient clustering. The goal of our research is to solve the problems in clustering. A proposed method is an evolutionary SCV combined competitive co-evolving into SCV. It is a method for determining the optimal number of clusters and clustering with good accuracy. 3.2 Competitive Co-evolving SVC for Optimal Clustering The SVC depends on a kernel based clustering. So, the parameter of kernel functions plays on important role in the SVC. Though the researches for determining it automatically, most searching processes are performed by the art of researchers[6]. The regularization constant C is also plays a crucial part in the result of SVC. But, as well
A Competitive Co-evolving Support Vector Clustering
867
this parameter is determined subjectively. Therefore, we propose a CE-SVC for optimal clustering. Our algorithm is constructed by combining competitive co-evolution into SVC. Using CE-SVC, we are able to determine the optimal number of clusters and cluster given data set efficiently. To develop automatic algorithms of a problem is one of the important issues of computer science and mathematics. Similarly to engineering, where looking at nature solutions has always been a source of inspiration, copying natural problem solvers is a stream within these disciplines[8]. The algorithm is firstly based on genetic algorithm(GA). GA has provided a analytical method motivated by an analogy to biological evolution[18]. General GA computes the fitness of given environment where is fixed. Distinguished from traditional GA, co-evolving approach is evolutionary mechanism of the natural world with competition or cooperation. The organism and the environment including organism evolve together[18]. We apply not cooperation but competition to our proposed model. Our competitive co-evolving approach uses host-parasites co-evolution. The host and parasites are used for modeling CE-SVC and training data set. Our CE-SVC and training data set are considered as the organism and the environment including it. That is, the evolving CE-SVC is followed the evolution of host. The initial parameters for CE-SVC model are determined as uniform random numbers from -1 to 1. A good result of clustering has high intra-cluster similarity and low inter-cluster similarity[15]. Step I (Competitive Co-Evolving). In this paper, we introduce a criterion for evaluating the results of clustering. The criterion is composed of two parts which are the variance of points in clusters and the penalty of excessive increasing the number of clusters. Using this criterion, we define the fitness function of CE-SVC as following.
f host ( M ) =
M
1
1
¦v +V M i =1
(5)
M
In the above, M is the number of clusters and in the ith cluster.
M
i
vi is the average of variances of points
VM is the variance of M clusters. This is defined as the following. VM =
M
1
¦ (c M −1
−c ) 2
j
(6)
j =1
In the equation (6),
c j is the center of jth cluster and c is the average of the centers
of M clusters. The smaller the f host ( M ) value is, the better the clustering result is. The fitness function of parasites is defined by the inverse form of the fitness of host as the following.
f parasites ( M ) =
K 1 M
M
1
¦v + V i
i =1
M
(7)
M
868
S.-H. Jun and K.-W. Oh
Where, K is a constant. We are able to control the competitive levels between host and parasites. Our evolutionary approaches of CE-SVC and training data set are competitive. In other words, proposed model is competitive co-evolving of two different groups. One is the parasites evolution of given training data set. Another is the host evolution of CE-SVC. In this step, we determine the parameters which are kernel parameter q and regularization constant C. Step II (SVC). In this SVC step, we find the cluster boundaries by mapping data points into a high dimensional feature space by a Gaussian kernel which computes the minimal radius enclosing sphere. After mapped back into input space, a set of contours enveloping the input data determines the cluster boundary rule. The shape of the contours is depended on kernel width and regularization parameter. For SVC, the Lagrange multiplier is defined as the following.
Maximize
¦ k ( x , x )α − ¦ α α k ( x , x ) j
j
j
i
subject to
0 ≤αj ≤ C,
j
i
j
i, j
j
¦α
= 1, j = 1," , N
j
(8)
j
Where, Gaussian kernel is represented by the following.
k ( xi , x j ) = e
− q xi − x j
2
(9)
The kernel and regularization parameters for in this step are determined in the previous step. Step III (Clustering Rule Sets). The similarities between centers of clusters are computed in this step. We use the similarity measure as the following[17]. n
m
SIM (centeri , centerj ) = ¦¦ k =1 l =1
Where,
k ( sik , s jl ) m⋅n
(10)
centeri = {si1 ,", sin ) , centerj = {s j1 , " , s jm ) with si , s j ∈ {Support
Vectors}. For combining two centers of cluster, the average affinities among support vectors are used as the threshold which should be researched. Step IV (Assigning Clusters). Data points are assigned as the following similarity formular[17]. n
SIM ( x, centeri ) = ¦ k =1
k ( x, sik ) n
(11)
In this step, we use this measure. By these steps, we are able to get efficient clustering result which has optimal number of clusters and efficient assignment of data points. Consequently, the following figure shows the process of CE-SVC.
A Competitive Co-evolving Support Vector Clustering
Step I
869
Step II Mapping: Gaussian kernel
Data space
Clustering (contours)
Back to data space
Search for the minimal enclosing sphere
High dimensional feature space
Step III
Clustering Rule Sets
Step IV
(1) Determining the Number of Clusters (2) Clustering Data Points
Fig. 1. CE-SVC Process
In the above figure, CE-SVC and training data set are co-evolved respectively. During evolution for weight optimization of CE-SVC, the competitive co-evolving is occurred between evolving CE-SVC model and evolving training data set. In this BEGIN 1. INITIALIZE population (q,C) from U[-1,1] 2. Competitive Co-Evolving, 1) CE-SVC model by fhost(x); 2) training data set by fparasites(x); REPEAT UNTIL (TERMINATION CONDITION is satisfied) DO 1. SELECT parents; 2. MUTATE the resulting offspring; 3. EVALUATE new candidates; 4. SELECT individuals for the next generation; Loop Determining Parameters 1. Kernel Parameter(s), q 2. Regularization Constant, C Clustering Rule Sets 1. K-means Operator Assigning Clusters 1. Determining the number of clusters 2. Clustering Data Points END
Fig. 2. Pseudo-code of CE-SVC
870
S.-H. Jun and K.-W. Oh
place, our model uses co-evolutionary computation for determining the parameters for traditional SVC for optimal clustering. The following is a pseudo-code of CE-SVC. In the above pseudo-code, we construct to initialize population from uniform distribution from -1 to 1 for kernel and regularization parameters. Mainly evolutionary algorithm uses mutation in the variation operators. Because of that, we use mutation operator and not use recombination or crossover. In this paper, we are able to perform the optimal clustering using CE-SVC algorithm.
4 Experimental Results To verify improved performance of CE-SVC, we make experiments using data sets from UCI machine learning repository[23]. One of our CE-SVC goals may be supervised application which web usage mining, pattern classification, and so forth. Also, we are interested in these applications. Because of that, in our experiments, we compare CE-SVC with popular supervised and unsupervised learning algorithms. So, we consider some supervised training data sets which are Arrhythmia, Iris plant, and Glass identification databases. The following table shows summary of these data sets. Table 1. Data summary Data sets Arrhythmia Iris plant Glass identification
No. of attributes 279 4 8
No. of classes 16 3 6
No. of points 452 150 214
Iris plant data set is popularly used for evaluating the machine learning algorithms. So, we are able to consider the performance of CE-SVC according to the attribute sizes. In the experiment, we compare CE-SVC with established machine learning algorithms which are support vector machine(SVM), SVC, multi layer perceptron(MLP), K-nearest neighbor(K-NN), and hierarchical clustering[5],[10]. 4.1 Experiment 1: Accuracy In this section, we show the accuracy results by the number of miss-classified points in each learning algorithm. For the experiment, given data are divided into training and validation data sets. We use one-third of the given data for the validation set, and other two-thirds for the training[18]. Table 2. Number of miss-classified points Algorithms CE-SVC SVM SVC MLP K-NN Hierarchical
Arrhythmia Training Validation 6 9 13 18 16 24 18 46 21 29 25 38
Iris plant Training Validation 1 2 4 8 4 7 5 11 4 9 6 9
Glass identification Training Validation 3 5 8 16 7 15 11 19 17 25 13 19
A Competitive Co-evolving Support Vector Clustering
871
From above table, we find the number of miss-classified points of CE-SVC is the smallest among the comparative methods. Also, the difference between training and validation about the number of miss-classified points of CE-SVC is smaller than others. So, CE-SVC is able to settle the over fitting problems of machine learning algorithms to some extent. 4.2 Experiment 2: Training Time The training time of CE-SVC is considerable. So, we compute the training time to error rate=0(%). That is, it is the time until all points are classified correctly. The experimental result is shown in the following table. K-NN and hierarchical clustering algorithm are excluded from this experiment because the computing time is the repeated time of learning. Table 3. Training time(seconds) to Error rate = 0 (%) Algorithms with bootstrap without bootstrap SVM SVC MLP
CE-SVC
Arrhythmia 2,360 6,603 1,852 1,910 1,280
Iris plant 201 685 165 169 154
Glass identification 432 1,021 326 315 302
For decreasing the training time of CE-SVC, we use a re-sampling technique of bootstrap methods[7]. In this experiment, we perform 10% random sampling with replacement from given data. It is simple, but we are able to decrease the computing time of CE-SVC fairly. From the above result, we know that the training time of CE-SVC is similar level to SVM and SVC. So, we make an experiment to find a difference of accuracy between with and without bootstrap methods. The result is in the following table. Table 4. Number of miss-classified points between with and without bootstrap Data sets Arrhythmia Iris plant Glass identification
With bootstrap 11 3 7
Without bootstrap 6 1 3
From this result, we confirm the discrepancy of the number of miss-classified points between with and without bootstrap methods. But, the accuracy efficiency of CE-SVC with bootstrap is better than existing methods. Therefore, we verify an improved performance of CE-SVC.
5
Conclusion
In this paper, we proposed a CE-SVC algorithm for optimal clustering. Our algorithm combined competitive co-evolving into SVC. By CE-SVC, the accuracy of
872
S.-H. Jun and K.-W. Oh
clustering results was increased. Also, the over fitting problem of existing machine learning algorithms was settled by our CE-SVC partly. We confirmed that by the difference between training and validation results. In future work, applying advanced bootstrap method such as Jackknife into CESVC, we will get better CE-SVC than previous one. In addition, using more objective clustering measures which are a function of determination coefficient(R2), cubic clustering criterion(CCC), and so forth, we are going to compare our research with popular clustering algorithms.
References 1. Ben-Hur, A., Horn, D., Siegelmann, H. T., Vapnik, V.: Support Vector Clustering, Journal of Machine Learning Research, vol. 2, pp. 125-137, (2001) 2. Bezdek, J. C., Boggavarapu, S., Hall, L. O., Bensaid, A.: Genetic algorithm guided clustering, IEEE World Congress on Computational Intelligence. vol. 1, pp. 34-39, (1994) 3. Bock, H. H.: On Some Significance Tests in Cluster Analysis, Journal of Classification, vol. 2, pp. 77-108, (1985) 4. Burges, C. J.: A Tutorial on Support Vector Machine for Pattern Recognition, Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121-167, (1998) 5. Cherkassky, V., Mulier, F.: Learning from Data - Concepts, Theory, and Methods, John Wiley & Sons, Inc., (1998) 6. Chiang, J. C., Wang, J. S.: A Validity-Guided Support Vector Clustering Algorithm for Identification of Optimal Cluster Configuration, IEEE International Conference on Systems, Man and Cybernetics, pp. 3613-3618, (2004) 7. Davison, A. C.: Bootstrap methods and their application, Cambridge University Press, (1997) 8. Eiben, A. E., Smith, J. E.: Introduction to Evolutionary Computing, Springer, (2003) 9. Everitt, B. S.: Unresolved Problems in Cluster Analysis, Biometrics, vol. 35, pp. 169-181, (1979) 10. Everitt, B. S., Landau, S. Leese, M.: Cluster Analysis, Arnold, (2001) 11. Fletcher, R.: Practical Methods of Optimization, John Wiley & Sons, (1989) 12. Han, J., Kamber, M.: Data Mining Concepts and Techniques, Morgan Kaufmann, (2001) 13. Hartigan, J. A.: Statistical Theory in Clustering, Journal of Classification, vol. 2, pp. 63-76, (1985) 14. Haykin, S.: Neural Networks, Prentice Hall, (1999) 15. Jun, S. H.: Web Usage Mining Using Support Vector Machine, Lecture Note in Computer Science, vol. 3512, pp. 349-356, (2005) 16. Lee, J.,Lee, D.: An Improved Cluster Labeling Method for Support Vector Clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 3, pp. 461464, (2005) 17. Ling, P., Wang, Y., Lu, N., Wang, J. Y., Liang, S., Zhou, C. G.: Two-Phase Support Vector Clustering for Multi-Relational Data Mining, Proceedings of the International Conference on Cyberworlds, (2005) 18. Mitchell, T. M.: Machine Learning, McGraw-Hill, (1997) 19. Mitchell, T. M.: An introduction to Genetic Algorithms, MIT Press, (1998) 20. Ribeiro, B.: On Data Based Learning using Support Vector Clustering, Proceedings of the 9th International Conference on Neural Information Processing, vol. 5, pp. 2516-2521, (2002) 21. Sun, B. Y., Huang, D. S.: Support Vector Clustering for Multiclass Classification Problems, IEEE Evolutionary Computation Congress, vol. 2, pp. 1480 – 1485, (2003)
A Competitive Co-evolving Support Vector Clustering
873
22. Tax, D. M. J., Duin, R. P. W.: Support Vector Domain Description, Pattern Recognition Letters, vol. 20, pp. 1191-1199, (1999) 23. UCI Machine Learning Repository, http://www.ics.uci.edu/~mlearn/MLRepository.html 24. Vapnik, V. Z.: Statistical Learning Theory, John Wiley & Sons, Inc. (1998) 25. Vapnik, V.: An Overview of Statistical Learning Theory, IEEE Transactions Networks, vol. 10, pp. 988-903, (2002) 26. Wang, J., Wu, X., Zhang, C.: Support vector machine based on K-means clustering for real-time business intelligence systems, International Journal of Business Intelligence and Data Mining, vol. 1, no. 1, pp. 54-64, (2005) 27. Yang, J., Estivill-Castro, V., Chalup, S. K.: Support Vector Clustering Through Proximity Graph Modeling, Proceedings of Ninth International Conference of Neural Information Processing, pp. 898-903, (2002)
Fuzzy Probability C-Regression Estimation Based on Least Squares Support Vector Machine Zonghai Sun College of Automation Science and Engineering, South China University of Technology, Guangzhou, 510640, China [email protected]
Abstract. The problem of regression estimation for large data set is viewed as a problem of multiple models estimation. In this paper the method of fuzzy probability c-regression based on least squares support vector machine is proposed to classify the multiple models while estimating these models. The algorithm for solving it is also provided. The numerical example is used to illustrate that our approach can be used to fit nonlinear models for mixed data set. The simulation results demonstrate that the method of fuzzy probability cregression based on least squares support vector machine can discriminate the multiple regression models with a fuzzy partition of data set while fitting perfectly these models and can overcome the problem of initialization that often makes termination occurring at local minima.
1 Introduction For given data set D = {( x1, y1 ), ( x2, y2 )," , ( xN , yN )} , people hope that a single
regression model can completely describe the information contained in this data set. Unfortunately, it is not real in applications, especially for the data set with noise contamination. The main reasons are as follow [1], [2], 1. A single model can’t fit a large data set. It is highly likely that we need multiple models to fit the large data set. In other words, a large data set may not be precisely modeled by a single structure. 2. Classical regression analysis is based on stringent model assumption. However, in practice, a large data set does not behave as nicely as stipulated by the assumption. It is very common that inliers are out-numbered by outliers, which makes many robust methods fail. To overcome the above difficulties, we may view this complicated data set as a mixture of many populations. Suppose that each population can be described by a regression model. Thus, regression estimation problem for this large data set becomes the problem of mixture modeling [1], [2]. In the literatures, the maximum-likelihood estimation [2] and expectation maximum method [3] are the most extensively adopted approaches for the problem of multiple regressions modeling. However, computation cost of the maximum-likelihood estimation is large; the probability density function set for maximum-likelihood estimation is very finite. Although the expectation maximum method reduces the I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 874 – 881, 2006. © Springer-Verlag Berlin Heidelberg 2006
Fuzzy Probability C-Regression Estimation Based on Least Squares SVM
875
computation consumption, the convergence rate is small. Richard J. H. et al [4], Menard M. et al [5] adopted the fuzzy strategy to solve this problem of mixture modeling. But this approach suffers from the problem of local minima and can’t provide a method of initialization. For regression-class estimations we will provide the method of fuzzy probability cregression (FPCR) based on least squares SVM in this paper. This method can cluster the samples while estimating multiple regressions. For convenience, all vectors will be column vectors unless transposed to a row vector by a prime superscript T . For a m × n matrix A , Ai will denote the will denote the ith row of A and A⋅ j will denote the jth column of A . A column vector of ones of arbitrary dimension will be denoted by γ .
2 FPCR Based on Least Squares SVM We assume that the given data set D = {( x1 , y1 ), ( x2 , y2 )," , ( xN , yN )} derive from c models. We view the mixture models as switching models, and suppose the function forms of c models are as follow,
y = wiϕ i ( x) + bi ,
1≤ i ≤ c
(1)
where b ∈ Rc denotes threshold values, w ∈ R c× N denotes the weighted matrix. Above all, the data set D should be clustered, and each cluster should be fitted by (1). When adopting the fuzzy clustering based on SVM, the object function is presented as follow, min L =
c N 1 c wi wiT + C ¦¦ ( μim, j + tiη, j )ei2, j ¦ 2 i =1 i =1 j =1
(2)
where μi , j denotes fuzzy membership of the jth sample lying in the ith model; ti , j denotes possibility of the jth sample lying in the ith model, which is viewed as typicality of ( x j , y j ) ; ei , j denotes the fitting errorl; and η > 1 , m > 1 denotes the constant that controls the clustering result. Bezdek J. C. [9] thought that m = 2 was the best option in most applications. The object function (2) is subject to
c j = 1, 2," , N °¦ μi , j = 1 ° i =1 °N i = 1, 2," , c . ®¦ ti , j = 1 ° j =1 ° w ϕ ( x ) + b − y + e = 0 j = 1, 2," , N , i = 1, 2," , c i j i ° i i j ¯ In order to solve the optimization problem (2), (3), one defines Lagrange
(3)
876
Z. Sun
max Lm,η =
c N 1 c wi wiT − ¦¦ λi , j ( wiϕ i ( x j ) + bi − y j + ε i + ei , j ) ¦ 2 i =1 i =1 j =1
N
c
c
N
j =1
i =1
i =1
j =1
c
N
(4)
− ¦τ j (¦ μi , j − 1) − ¦ ri (¦ ti , j − 1) + C ¦¦ ( μim, j + tiη, j )ei2, j i =1 j =1
with Lagrange multipliers λ ∈ R c× N (called support value), τ ∈ R N , r ∈ R c . The conditions for optimality are given by
∂Lm,η ° ° ∂wi ° ∂Lm,η ° ° ∂bi ° ∂L ° m,η ° ∂ei , j ° ° ∂Lm,η ° ° ∂μ i , j ® ° ∂Lm,η ° ∂ti , j ° ° ∂Lm,η ° ° ∂λi , j ° ∂L ° m,η ° ∂τ j ° ° ∂Lm,η ° ∂r ¯ i
N
= 0 → wiT − ¦ λi , jϕ i ( x j ) = 0
i = 1, 2," , c
j =1
N
= 0 → ¦ λi , j = 0
i = 1, 2," , c
j =1
= 0 → ( μim, j + tiη, j )ei , j − λi , j = 0 =0→
1 mμim, j−1ei2, j − τ j = 0 2
1 = 0 → η tiη, −j 1ei2, j − rj = 0 2
j = 1, 2," , N , i = 1, 2," , c j = 1, 2," , N , i = 1, 2," , c (5) j = 1, 2," , N , i = 1, 2," , c
= 0 → wiϕ i ( x j ) + bi − yi + ei , j = 0 j = 1, 2," , N , i = 1, 2," , c c
= 0 → ¦ μi , j = 1
j = 1, 2," , N
i =1 N
= 0 → ¦ ti , j = 1
i = 1, 2," , c
j =1
Combining the 4th equality in (5) and 7th equality in (5), the μi , j can be obtained as follow,
μi , j =
(1/(ei2, j )1/( m −1) c
, j = 1, 2," , N , i = 1, 2," , c .
(6)
¦ (1/(ek2, j ))1/( m −1) k =1
In the similar way, we can obtain the probability ti , j as follow, ti , j =
(1/(ei2, j )1/(η −1) N
¦ (1/(ei2,k ))1/(η −1) k =1
j = 1, 2," , N , i = 1, 2," , c .
(7)
Fuzzy Probability C-Regression Estimation Based on Least Squares SVM
877
According to the 1st equality in (5), the wi is obtained as follow, N
wi = ¦ λi , jϕ iT ( x j ) , i = 1, 2," , c .
(8)
j =1
According to the 3rd equation in (5), one obtains ei , j =
λi , j μ + tiη, j m i, j
, j = 1, 2," , N , i = 1, 2," , c
(9)
The 2nd equation in (5) may be written the vector form
λiγ = 0 , i = 1, 2," , c .
(10)
Elimination wi of 6th in (5) with (8) gives N
¦λ
i,k
K i ( x j , xk ) + bi + ei , j = y j ,
j = 1, 2," , N , i = 1, 2," , c
(11)
k =1
where kernel function K i ( x j , xk ) = ϕi ( x j )T ϕi ( xk ) that may be linear, polynomial, radial basis function or others that satisfy Mercer's condition. According to (11), (9), (10), elimination of e gives the line equation as follow,
[bi
ª0
λi ] « ¬γ
γT
º » = [0 Ωi ( x, x) ¼ T
y]
(12)
Where 1 ª « Ki ( x1 , x1 ) + μ m + tη i ,1 i ,1 « « K i ( x2 , x1 ) « Ωi = « « # « « Ki ( xN , x1 ) « ¬
K i ( x1 , x2 ) K i ( x2 , x2 ) +
1 μ + tiη,2 m i ,2
# K i ( xN , x2 )
º » » » K i ( x2 , x N ) " » », » " # » 1 » " K i ( x N , xN ) + m η » μ i , N ti , N ¼ i = 1, 2," , c (13) "
K i ( x1 , xN )
When solving the line system (12), the solutions of SVM are obtained.
3 FPCR Based on Least Squares SVM Algorithm In Section 2 the FPCR based on least squares SVM was discussed. Perhaps the most popular algorithm for approximating solutions of (2) and (3) is Picard iteration through (12), (7) and (6). When error ei , j = 0 for one or more i and j , singularities occur. In this case, (6), (7) can not be used. For sample j , when singularity occurs, set
878
Z. Sun
μi , j = 0 for which ei , j ≠ 0 , and distribute memberships arbitrarily across ( x j , y j ) ’s for which ei , j = 0 , subject to the constraint
c
¦μ
i, j
= 1 . Similarly, for prototype i ,
i =1
when singularities emerges, set ti , j = 0 for which ei , j ≠ 0 , and distribute memberships arbitrarily for which ei , j = 0 , subject to the constraint
N
¦t
i, j
= 1 . The
j =1
algorithm is as follow, The algorithm is as follow: 1) For given data set D = {( x1 , y1 ), ( x2 , y2 )," , ( xN , yN )} , set clustering number c , fuzzy control parameter m , probability control parameter η , precision of termination δ , ε ; 2) Initialize μ t , and calculate kernel function Ki , i = 1, 2," , c . Let p = 1 ; 3) Calculate the Ωi with (13), i = 1, 2," , c ; 4) Calculate the b ( p ) , λ ( p ) with (12), b ( p ) , λ ( p ) denote the pth b , λ calculated; N
5) Calculate e with (11), i.e. ei , j = ¦ λi , k K i ( x j , xk ) + bi − y j , j = 1, 2," , N , k =1
i = 1, 2," , c ; 6) Define S j = {i | 1 ≤ i ≤ c; ei , j = 0} ; S j = {1, 2," , c} − S j , S j = ∅ , update μ
( p −1) i, j
→μ
( p) i, j
j = 1, 2," , N . If
with (6), j = 1, 2," , N , i = 1, 2," , c , μi(, pj )
denotes the pth μi , j calculated; if S j ≠ ∅ , μi(, pj ) = 0 , ∀i ∈ S j and
¦μ
( p) i, j
=1;
i∈S j
7) Define Ti = { j | 1 ≤ j ≤ N ; ei , j = 0} ; Ti = {1, 2," , N } − Ti , i = 1, 2," , c . If
Ti = ∅ , update ti(, pj −1) → ti(, pj ) with (7), j = 1, 2," , N , i = 1, 2," , c , ti(, pj ) denotes the pth ti , j calculated; if Ti ≠ ∅ , ti(, pj ) = 0 , ∀j ∈ Ti and
¦t
( p) i, j
=1;
j∈Ti
8) Check for termination in some convenient matrix norm. If μ ( p ) − μ ( p −1) ≤ δ , then stop, otherwise set p = p + 1 , and return to step 3. We state without proof some limit results for FPCR based on least squares SVM:
°1; ei , j < ek , j , ∀k ≠ j ½° lim μi , j = ® ¾ , j = 1, 2," , N , i = 1, 2," , c . m →1 °¿ ¯°0; otherwise
(14)
lim μ i , j = 1/ c , j = 1, 2," , N , i = 1, 2," , c .
(15)
m →∞
°1; ei , j < ek , j , ∀k ≠ j ½° lim ti , j = ® ¾ , j = 1, 2," , N , i = 1, 2," , c . η →1 °¯0; otherwise °¿
lim ti , j = 1/ N , j = 1, 2," , N , i = 1, 2," , c .
η →∞
(16)
(17)
Fuzzy Probability C-Regression Estimation Based on Least Squares SVM
879
When the number N of data points is large, the numerical value of the probabilities may be very small. Thus, after FPCR based on least squares SVM algorithm terminates, the probability values may need to be scaled up. Scaling is not a necessary step in FPCR based on least squares SVM, but it may be helpful for interpretation of the probability values. In next Section the examples will be used to demonstrate the effect of the FPCR based on least squares SVM that scaling is not used.
4 Numerical Simulations We give a numerical example to illustrate FPCR based on least squares SVM. The example discusses nonlinear models. This example considers the use of FPCR based on least squares SVM to fit the simple c = 2 quadratic regression models. Two nonlinear models are as follow, 2 ° y = β11 x + β12 x + β13 + ς ® 2 °¯ y = β 21 x + β 22 x + β 23 + ς
(18)
where ς ∈ N (0, σ 2 ) . Three sets of data named A, B and C, as specified in Table 1 were generated for the tests. Each of the three data sets were generated by computing y from (18) at N / 2 randomly fixed, equally spaced x values across the interval given in column 3 of Table 1, with σ = 2 for data sets A, B, and σ = 16 for data set C. This results in sets of N points (which we pretend these data sets were unlabelled), half of which were generated from each of the two quadratics specified by the parameters in column 4 and column 5 of Table 1. Here c = 2 , m = 2 , η = 2 ,
δ = 10−6 , ε = 10−8 kernel functions K1 ( x, x j ) = exp(−
( x − x j )2
σ2
) , K 2 ( x, x j ) =
( xx j + α ) β , α = 2 , β = 2 . The three data sets are plotted in Fig.1, 2 and 3,
respectively. The iteration of FPCR based on least squares SVM was stopped as soon as the 2-norm of fuzzy membership matrix was found to be less than or equal to δ = 10 −6 and 2-norm of probability matrix was found to be less than or equal to ε = 10−8 . Fig.1, 2 and 3 show the scatter diagrams (data points are shown as dot) of three data sets and terminal regression models of FPCR based on least squares SVM. The two models were discriminated distinctly in Fig.1, 2 and 3 where points are shown as square and star respectively. The results of simulation show that this FPCR based on least squares SVM can classify the two nonlinear functions while fitting perfectly these models. There is also no problem of trapping in local minima that was caused by the initialization of μ mentioned in the method of fuzzy clustering [4], [5]. However, in simulation there is no one time of 25 trails that termination occurred at points other than the global minima. Here μ and t were initialized randomly. This shows that the provided methods in this paper overcome the problem of initialization.
880
Z. Sun
Table 1. Description of data generated from the quadratic models y = β i1 x 2 + β i 2 x + β i 3 + ς
Fig. 1. Simulation result of FPCR based on least squares SVM for data set A
Fig. 2. Simulation result of FPCR based on least squares SVM for data set B
Fig. 3. Simulation result of FPCR based on least squares SVM for data set C
Fuzzy Probability C-Regression Estimation Based on Least Squares SVM
881
5 Conclusions In this paper the regression estimation for large data set is discussed. The problem of regression estimation is viewed as a problem of multiple models estimation. In order to classify the multiple models while estimating these models, the method of fuzzy probability clustering based on least squares SVM is proposed. The algorithm for solving fuzzy probability clustering is provided. The method can discriminate the multiple regression models while fitting perfectly these models. There is also no problem of local trap states caused by the poor initialization of fuzzy.
Acknowledgment This work is sponsored by the National Science Foundation of China under grant No.60574019 and Primer Foundation of South China University of Technology under grant No.B08-D6060020.
References 1. Zonghai sun, Lixin Gao, Youxian Sun.: Using Support Vector Machines for Mining Regression Classes in Large Data Sets, Proceedings of IEEE TENCON'02, Vol. A, (2002) 89-92 2. Yee Leung, Ma J. H., Zhang W. X.: A new method for mining regression classes in large data sets, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 23(1), (2001)5-21 3. De Veaux R. D.: Mixtures of linear regressions,Computational Statistical and Data Analysis, Vol.8, (1989)227-245 4. Hathaway R. J., Bezdek J. C.: Switching regression models and fuzzy clustering, Fuzzy Systems, IEEE Transactions on, Vol. 1(3), (1993)195-204 5. Menard M., Dardignac P. A., Courboulay V.: Switching regression models using ambiguity and distance rejects: application to ionogram analysis, Proceedings of Pattern Recognition, 2000, 15th International Conference on, Vol.2, (2000)688-691 6. Vapnik V.: Statistical learning theory, John Wiley and Sons, New York, 1998 7. Redner R. A., H. F. Walker.: Mixture densities, maximum likelihood and the EM algorithm, SIAM Rev., Vol. 26(2), (1984)195-239 8. Caudill S. B., R. N. Acharya.: Maximum-Likelihood estimation of a mixture of normal regressions: starting values and singularities, Comm. Statistics-Simulation, Vol. 27(3), (1998)667-674 9. Bezdek J. C.: Pattern recognition with fuzzy object function algorithms, Plenum, New York, (1981) 10. Zonghai Sun: Study on support vector machine and its application in control, Zhejiang University dissertation for the doctor degree of degree of philosophy, (2003)43-47
Filtering E-Mail Based on Fuzzy Support Vector Machines and Aggregation Operator Jilin Yang, Hong Peng, and Zheng Pei School of Mathematics & Computer Science, Xihua University, Chengdu, Sichuan, 610039, China [email protected]
Abstract. How to filter emails is a problem for Internet users. Support vector machine (SVM) is a valid filtering emails method. As it is well known, there exists uncertainty in deciding the legitimate email by Internet users. To formalize the uncertainty, the legitimate email is understood as fuzzy concept on a set of email samples in this paper, its membership function is obtained by aggregating opinions of Internet users, and aggregation operator is ordered weighted averaging (OWA) operator. Due to email training samples with membership degrees of the legitimate email, fuzzy support vector machine (FSVM) is adopted to classify emails, and penalty factor of FSVM is decided by content-specific misclassification costs. The advantages of our method are: 1) uncertainty of the legitimate email, i.e., membership degree, is considered in classifying emails, and a method to obtain membership degree is given; 2) content-specific misclassification costs is used to decide penalty factor of FSVM. Simulative experiments are shown to the effectiveness and human consistent of our method.
1
Introduction
With the rapid development of Internet and its application, the spam has become a headache problem for its users. It does harm to the legal rights of email customers, threatens the Internet information safety, and causes great losses to national economy annually. To solve this problem, many researchers propose various email filtering methods [1], such as machine learning methods [2]- [5]. Support vector machine (SVM) [6] is a kind of new machine learning method based on the statistical learning theory. According to structure risk minimization principle, it is important to improve the generalization ability of learning machine, i.e., if there has small error for limited training samples, then the error would keep small for independent testing samples. SVM algorithm is a convex optimization problem, so the local optimal solution is sure to be the global optimal solution. For most users, spam is a nuisance, while the misclassification of the legitimate email is much more serious. So, the optimal filter is defined as one that rejects a maximum amount of spam while passing all the legitimate emails. Kolcz A. et al. propose an email misclassification cost taking definite content into consideration [4]. To be more specific, the content of emails should I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 882–891, 2006. c Springer-Verlag Berlin Heidelberg 2006
Filtering E-Mail Based on FSVM and Aggregation Operator
883
be clearly distinguished and given separate the value of penalty factors, in this way we can prevent misclassification of the legitimate email to the fullest extent. We find that the current machine learning methods classify emails into the legitimate or the spam for a certainty. However, in practice different users of server-side hold different opinions of whether an email is the legitimate or not, and to what extent. As a result, research of email filtering should be considered as dealing with the uncertainties. Therefore, it is necessary for us to obtain a reasonable value of comprehensive evaluation (fuzzy membership degree) when classifying emails by using machine learning methods. For this reason the OWA operator is applied in the paper. With regard to the existing problems mentioned above and effectiveness of misclassification cost of specific content in filtering, this paper presents an integrated filtering method of the OWA operator, contentspecific misclassification costs and SVM. Firstly we introduce content-specific misclassification costs in [4]. Secondly, a reasonable value of comprehensive evaluation is obtained as each email’s membership degree by using the OWA operator method. Meanwhile, in order to get more reasonable comprehensive evaluation value, each category of emails will be aggregated respectively. Finally, since email filtering should be considered as dealing with the uncertainties, we adopt FSVM as classifier discussed in literature[6].
2
FSVM and OWA Operator
Suppose a set of training points [7],[8] is D = {(x1 , y1 , s1 ), · · · , (xn , yn , sn )}.
(1)
Each training point xi ∈ RN is given a label yi ∈ {−1, 1} and a fuzzy membership si ∈ [σ, 1] with i = 1, . . . , n, and sufficient small σ > 0. Fuzzy membership si is the attitude of the corresponding point xi toward one class. Let z = φ(x) denote the corresponding feature space vector with a mapping from RN to a feature space Z. The optimal hyperplane problem is then regarded as the solution to 1 w·w+C si ξi 2 i=1 n
M inimize
s.t. yi (w · zi + b) ≥ 1 − ξi , ξi ≥ 0, i = 1, . . . , n
(2)
ξi ≥ 0, i = 1, . . . , n, where C is a parameter that balances the amount of misclassification and the size of the margin, so it is called penalty factor too. The problem (2) can be transformed into M ax
s.t.
n i=1 n i=1
1 αi αj yi yj K(xi , xj ) 2 i=1 j=1 n
αi −
n
yi αi = 0, 0 ≤ αi ≤ si C, i = 1, . . . , n.
(3)
884
J. Yang, H. Peng, and Z. Pei
The decision function is given by n f (x) = sign( αi yi K(xi , x) + b)
(4)
i=1
The point xi with the corresponding αi > 0 is called a support vector. There are two types of support vectors. The one with corresponding 0 < αi ≤ si C lies on the margin of the hyperplane. The one with corresponding αi = si C is misclassified. An important difference between SVM and FSVM is that the points with the same value of αi may indicate a different type of support vectors in FSVM due to the factor si . The ordered weighted averaging (OWA) operator is proposed by Yager [9]. OWA operator allows us to easily adjust the degree of “anding” and “oring” implicit in the aggregation, and can be used for aggregating many groups of the fuzzy and uncertain information [10]. A mapping F from Rm → R is called an OWA operator of dimension m if associated with F , is a weighting vector ∈ [0, 1], 1 ≤ i ≤ m , wi = w1 + w2 + . . . + wm = 1, w = (w1 , w2 , . . . , wm ), wi m and F (a1 , a2 , . . . , am ) = i=1 wi bi , where bi is the ith the largest element in 1 2 m the collection (a , a , . . . , a ), then we shall call F as m dimension the OWA operator. The weighting vector w = (w1 , w2 , . . . , wm ) can be defined from following formula: i i − 1 −Q , i = 1, 2, . . . , m, (5) wi = Q m m where Q is the fuzzy quantifier, it can be represented as: ⎧ ⎨ 0, Q(r) =
⎩
r−α β−α ,
1,
if r < α, if α ≤ r ≤ β, if r >≤ β,
(6)
with α, β, r ∈ [0, 1]. These quantifiers are characterized by values such as most, at least half, as many as possible. In this paper, the parameters (α, β) are (0, 0.5), (0.3, 0.8), (0.5, 1) for different categories of email respectively.
3
The Main Idea of Content-Specific Misclassification Costs
To prevent misclassification of the legitimate email, Kolcz A. et al. [4] proposed a content-specific cost model for filtering the spam, where the cost of misclassifying legitimate is content-specific, the cost of misclassifying spam is assumed to be uniform. Let D be email samples. Cs→l (x) denote the cost incurred by the system when a spam message x is classified as the legitimate and, conversely, let Cl→s (x) denote the cost incurred by the system when a legitimate message x is classified as spam. We assume that there is no cost associated with a correct
Filtering E-Mail Based on FSVM and Aggregation Operator
885
classification of a message, i.e., Cl→l (x) = Cs→s (x) = 0. Then, the expected costs of misclassifying x as legit or spam are given by Clegit (x) = Cs→l (x)P (spam | x),
Cspam (x) = Cl→s (x)P (legit | x)
(7)
In [4], the cost of admitting a spam message may be considered fairly nominal (at least within a certain window) and be set Cs→l (x) = 1. On the other hand, the consequences of losing an important message may potentially be devastating, so rejection of the legitimate email remains a truly rare event. Most research in the spam filtering assigns a uniform cost value to Cl→s (x). It is clear, however, that the true costs are likely to be dependent on the type of the message rejected, and the author concentrates on the content-specific misclassification costs. Generally, for any legitimate message x, its misclassification cost is given by Cl→s (x). However, since exact measurement of such a cost is very difficult, considering truly message-specific costs is largely impractical. It is possible, though, to estimate costs associated with misclassifying the legitimate message belonging to several broad content categories (see [4]): 1. 2. 3. 4. 5.
Sensitive personal messages. Set Cl→s (x | x ∈ personal) = 1000. Business related messages. Set Cl→s (x | x ∈ business) = 500. E-commerce related messages. Set Cl→s (x | x ∈ e − commerce) = 100. Special interest mailing lists/discussion forums. Set Cl→s (x | x ∈ lists) = 50. Promotional offers. Set Cl→s (x | x ∈ promotional) = 25.
The categories C={personal, . . . , promotional} are mutually exclusive and they completely cover the legitimate email class. We believe that they capture the basic usage of email today and the importance of weighting the costs of rejecting the legitimate email according to content. The use of broad content categories should facilitate the measurement of costs in a real-world environment. Then the cost of misclassifying a message as the spam (see eq.(7)) can be re-stated as: cat Cl→s P (cat | legit, x) (8) Cspam (x) = P (legit | x) cat∈C
In [4], the content-specific misclassification costs are used in standard SVM. In this paper, owing to email filtering problem being considered as dealing with the uncertainties, we adopt FSVM as classifier. Meanwhile, the misclassification costs is applied to the FSVM. The free parameter C in FSVM controls the tradeoff between the maximization of margin and the amount of misclassifications. A larger C makes the training of FSVM less misclassifications and narrower margin. The decrease of C makes FSVM ignore more training points and get wider margin. Here, the above cost model is introduced in FSVM. Recall that in the proposed cost model, Cs→l (x) = 1, while Cl→s (x) depends on the type of the legitimate message. Consequently, with Cs→l (x) as a reference, each type of Cl→s (x) can be expressed as a fixed multiple thereof. The misclassification penalty term in (2) is then n i=1
Ci si ξi = C −
cat∈C
cat Cl→s
i:xi ∈legit∧cat
ξi si
(9)
886
J. Yang, H. Peng, and Z. Pei
The value of C − in (9) is not a cost but, rather, a regularization parameter that has to be chosen so that the expected performance of the classifier is maximized.
4
Filtering Method of Email Based on FSVM And OWA Operator
In email filtering, when traditional learning machines train the classifier, each email sample is assigned to a class label, i.e., the legitimate or the spam. It can be noticed that different users hold different opinions for a legitimate email or a spam, which depends on his (or her) comprehension and hobby. On the other hand, in real world practice, there exist uncertainties when we decide the legitimate or the spam. Uncertainties are expressed by a degree in this paper, i.e., (xj , sj ), in which xj is an email, and sj is a degree of the legitimate. Intuitionally, sj can be understood as the fuzzy membership degree sj (0 < sj ≤ 1) of xj which belongs to the legitimate, under this condition, the legitimate is understood as fuzzy concept on email samples. The spam filtering problem is considered as dealing with the uncertainties. How to decide sj is a problem. In this Section, the OWA operator is used to obtain sj and FSVM to classify. 4.1
Obtain Fuzzy Membership Degree by OWA Operator
Let {(x1 , y1 ), . . . , (xn , yn )} be a finite set of email samples, they are classified as the spam or the legitimate. Let U = {u1 , u2 , . . . , um }(m ≥ 2) be users, where uk denote the kth user. Due to every user has his (or her) idea and interest, different evaluations are given by users. Assuming evaluation vector of uk is Sk = (sk1 , sk2 , . . . , skn )T (k = 1, . . . , m), in which skj denotes evaluation of the jth T email sample, skj ∈ (0, 1] and j = 1, · · · , n. B j = (s1j , s2j , . . . , sm j ) denotes all users’ evaluation value for the jth email sample xj . To obtain sj of xj , s1j , s2j , . . . and sm j need to be aggregated by OWA operator, i.e., j j T sj = F (s1j , s2j , . . . , sm j ) = H (B ) =
m
σ(k)
wjk sj
, j = 1, 2, . . . , n,
(10)
k=1 σ(1)
where vector B j = F (sj ment
σ(k) sj
∈ (0, 1] and
σ(l) sj
σ(2)
, sj ≤
σ(m) T
, . . . , sj
σ(k) sj
)
is the ordered vector if each eleσ(k)
for ∀l ≥ k. sj
is the kth element according
σ(1) σ(2) σ(m) (sj , sj , . . . , sj ).
to the value from large to small in the vector In Eq.(10), wjk is called weighting and decided as follows. In the email filtering, misclassifying the legitimate email is more serious than misclassifying the spam, so the optimal filter is defined as one that rejects a maximum amount of the spam while passing all legitimate emails. Under this condition, we need to adopt the appropriate weighting for preventing misclassification of legitimate emails. Generally, wjk can be decided by fuzzy quantifiers which is proposed in [9]. Here, selecting fuzzy quantifiers relies on misclassification cost idea proposed in [4].
Filtering E-Mail Based on FSVM and Aggregation Operator
887
The more important the legitimate email is, the higher fuzzy membership degree of the legitimate email belongs to. In [4], the legitimate emails are classed into five sub-categories, moreover it is assumed that the proposed sub-categories are mutually exclusive and they completely cover the legitimate email class. In this paper, according to properties of sub-categories, two classes of the five subcategories are obtained. For the two classes, different fuzzy quantifiers are used to decide the membership degree of their emails, respectively. (1) In five sub-categories of the legitimate emails, the former three subcategories: sensitive personal message, business related message and e-commerce related message are the legitimate emails for almost all the users. It means that their membership degrees are as possible as big. From logic point of view, S−norm is used to aggregate evaluation information of users. Hence, selected fuzzy quantifier of OWA operator is such that its “orness” tends to 1. Here, “at least half” fuzzy quantifier,i.e., (α, β) = (0, 0.5), is selected to aggregate evaluation information of users. (2)For special interest mailing lists/discussion forums and promotional offer, the users in server-side maybe treat them with great differences. For example, someone has a special hobby of horror or monstrosity things, however, other users would not like to see the information about them. When we aggregate evaluation information of users, most users’s opinions need to be considered. Correspondingly, fuzzy quantifier of OWA operator is such that its “andness” tends to 1. Here, “most” fuzzy quantifier,i.e., (α, β) = (0.3, 0.8), is selected to aggregate evaluation information of users. (3)For others emails which is not included in the above five sub-categories, “as many as possible” fuzzy quantifier, i.e., (α, β) = (0.5, 1), is used to aggregate evaluation information of users. Therefore, for each email, by it belongs to sub-categories, OWA operator is used to aggregate its membership degree, respectively. 4.2
FSVM as the Email Classifier
Based on OWA operator, the set of email samples is expressed as follows: De = {(x1 , s1 ), (x2 , s2 ), · · · , (xn , sn )},
(11)
in Eq.(11), xj (j = 1, · · · , n) is an email, sj is membership degree of xj which is decided by Eq.(10). Compared with training samples of FSVM (Eq.(1)), it can be noticed that a label yj of (xj , sj ) need to be decided, then De can be classed by using FSVM. yj is obtained as follows in this paper 1, if sj ≥ λ, (12) yj = −1, if sj < λ. In which λ is a parameter which is decided by experts, learning in experiment, so on. If yj = −1, intuitively, xj is an spam, however, sj expresses that membership degree of xj belongs to the legitimate emails, meanings of yj = −1 and sj are contradict. It is modified as follows
888
J. Yang, H. Peng, and Z. Pei
sj
=
sj , if yj = 1, 1 − sj , if yj = −1.
(13)
From fuzzy logic point of view, the spam is understood as the negation of the legitimate emails, hence, if yj = −1, membership degree of xj belongs to the spam is computed as sj = 1 − sj . Based on Eqs.(11)-(13), the set of email samples is modified as follows De = {(x1 , y1 , s1 ), (x2 , y2 , s2 ), · · · , (xn , yn , sn )},
(14)
in this paper, De is used as training samples of FSVM. The optimal hyperplane of email samples is the solution of follows M inimize
1 cat w · w + C− Cl→s 2 cat∈C
s.t. yj (w · zj + b) ≥ 1 − ξi ,
ξj sj
j:xj ∈legit∧cat
ξj ≥ 0,
j = 1, . . . , n
(15)
ξj ≥ 0, j = 1, . . . , n, in which zj = ϕ(xj ) denotes the corresponding feature space vector with a map cat ping ϕ from RN to a feature space Z, and C − cat∈C Cl→s j:xj ∈legit∧cat ξj sj is decided by Eq.(9). By using Lagrangian function, the decision function is obtained as n αj yj K(xj , x) + b) (16) f (x) = sign( j=1
αj > 0 is called a support vector and The point xj with the corresponding cat 0 < αj ≤ sj C − cat∈C Cl→s j:xj ∈legit∧cat ξj sj .
5
Experiments and Results
The email samples were gathered from the commercial email server. Here, the legitimate emails include personal messages, business related messages, e-commerce related messages, special interest messages, promotional messages and so on. Overall, the data set contained 743 spam messages and 859 legitimate messages. Table 1 details the distribution of message categories in legitimate portion of data. Suppose all users of server-side evaluate each email training sample, i.e., the evaluation information skj . Then adopting our proposed method, we can obtain the value of comprehensive evaluation of the jth sample, i.e., sj . In experiment, we suppose ten users in email server, i.e., m = 10. In this paper, we adopt legitimate precision (LP ), legitimate recall (LR) and F 1 to evaluate method’s performance. Legitimate precision denotes the percentage of messages in the test data classified as legitimate which truly are LP =
Nl→l . Nl→l + Ns→l
(17)
Filtering E-Mail Based on FSVM and Aggregation Operator
889
Table 1. Distribution of the legitimate portion of the dataset Category
Message Count
Private Business E-commerce Special Promo
P(Category/Legit)
93 116 73 431 146
0.108 0.135 0.087 0.501 0.169
Legitimate recall denotes the proportion actual legitimate messages in the test set that are categorized as legitimate LR =
Nl→l . Nl→l + Nl→s
(18)
Moreover we use F 1 to evaluate comprehensive performance F1 =
2LP · LR . LP + LR
(19)
In experiment, the 1502 emails are divided into 6 groups, each one includes about 250 emails. Then 5 groups are used for training and the rest are used for testing. The split is performed at random. In this way, we cycle 6 times and obtain average value of precision and recall for 6 times. Table 2 shows the experiment results Table 2. The experiment results Order 1 2 3 4 5 6 Avg
LP 90.38% 92.89% 93.28% 89.76% 91.02% 89.92% 91.21%
LR 96.64% 96.78% 98.17% 95.56% 96.91% 97.23% 96.88%
We run the same set of tests for FSVM, SVM and Na¨ıve Bayes. Contrasted experiments verify the method’s performance in Table 3. In the experiments, owing to using the OWA operator in training classifier, our method appreciably increase the training time. However, the contrasted experiments results show that the legitimate precision of our method is little higher than the SVM and Na¨ıve Bayes. Especially, the legitimate recall in this paper is demonstrated better than the other methods. Meanwhile, our method can obtain superiority with the value of F 1. Consequently, the performance of FSVM is superior to SVM and Na¨ıve Bayes.
890
J. Yang, H. Peng, and Z. Pei Table 3. The contrasted experiments Method LP LR F1
6
Na¨ıve Bayes
SVM
FSVM
85.66% 92.03% 88.73%
90.12% 93.34% 91.70%
91.21% 96.88% 93.96%
Conclusion
In this paper, due to users of server-side hold different opinions of whether an email is legitimate or not, i.e., there exists uncertainty in deciding the legitimate email. the legitimate email is understood as fuzzy concept on a set of email samples, moreover, every email is endowed with a fuzzy membership of the legitimate email. The fuzzy membership degree is obtained by using the OWA operator to aggregate opinions of Internet users. Owing to the membership degree is regarded as the attitude of email training samples, we adopt FSVM as classifier, and penalty factor of FSVM is decided by content-specific misclassification costs. Finally, experiment results show our method is the more effective and the more human consistent than the SVM, Na¨ıve Bayes.
Acknowledgements This work is supported by the importance project foundation of the education department of Sichuan province (No. 2005A117) and the young foundation of Sichuan province (No. 06ZQ026-037), China.
References 1. Cohen, W.: Learning Rules That Classify E-mail. AAAI Spring Symposium on Machine Learning in Information Access, (1996) 2. Saham, M., Dumais, S., Heckerman, D., et al: A Bayesian Approach to Filtering Junk E-mail. In Procceedings of AAAI Workshop on Learning for Text Categorization, (1998) 55-62 3. Drucker, H., Wu, D., Vapnik, V.: Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks, 10 (1999) 1048-1054 4. Kolcz, A., Alspector, J.: SVM-based Filtering of E-mail Spam with Contentspecific Misclassification Costs. In Proceedings of The TextDM’01 Workshop on Text Mining-held at The 2001 IEEE International Conference on Data Mining, (2001) 309-347 5. Camastra, F., Verri, A.: A novel Kernel Method for Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5) (2005) 801-805 6. Vapnik, V.: The Nature of Statistical Learning Theory. Springer-Verlag, New York (2001)
Filtering E-Mail Based on FSVM and Aggregation Operator
891
7. Lin, C., Wang, S.: Fuzzy Support Vector Machines. IEEE Transactions on Neural Networks, 13(2), (2002) 464-471 8. Lin, C., Wang, S.: Training algorithms for Fuzzy Support Vector Machines with noisy data. Parttern Recognition Letters, 25, (2004) 1647-1656 9. Yager, R.: On Ordered Weighted Averaging Aggregation Operators in Multicriteria Decisionmaking. IEEE Transactions on systems, Man, and Cybernetics, 18, (1988) 183-190 10. Ronald, R.: Families of OWA Operators. Fuzzy Sets and Systems, 59(1), (1993) 125-148
A Distributed Support Vector Machines Architecture for Chaotic Time Series Prediction Jian Cheng, Jian-sheng Qian, and Yi-nan Guo School of Information and Electrical Engineering, China University of Mining and Technology, 221008, Xu Zhou, China [email protected]
Abstract. Chaos limits predictability so that the prediction of chaotic time series is very difficult. Originated from the idea of combining several models to improve prediction accuracy and robustness, a new approach is presented to model and predict chaotic time series based on a distributed support vector machines in the embedding phase space. A three-stage architecture of the distributed support vector machines is proposed to improve its prediction accuracy and generalization performance for chaotic time series. In the first stage, Fuzzy C-means clustering algorithm is adopted to partition the input dataset into several subsets. Then, in the second stage, all the submodels are constructed by least squares support vector machines that best fit partitioned subsets, respectively, with Gaussian radial basis function kernel and the optimal free parameters. A fuzzy synthesis algorithm is used in the third stage to combine the outputs of submodels to obtain the final output, in which the degrees of memberships are generated by the relationship between a new input sample data and each subset center. All the models are evaluated by coal mine gas concentration in the experiment. The simulation shows that the distributed support vector machines achieves significant improvement in the generalization performance and the storage consumption in comparison with the single support vector machine model.
1 Introduction Nonlinear and chaotic time series prediction is a practical technique which can be used for studying the characteristics of complicated systems based on recorded data [1]. As the most real-world systems are nonlinear, chaotic signals are often found in the areas of economic and business planning, weather forecasting, signal processing, industry and automation control, and so on. As a result, the interests in chaotic time series prediction have been increased, however, most practical time series are of nonlinear and chaotic nature that makes conventional, linear prediction methods inapplicable. Although the neural networks is developed in chaotic time series prediction, some inherent drawbacks, e.g., the multiple local minima problem, the choice of the number of hidden units and the danger of over fitting, etc., would make it difficult to put the neural networks into some practice [2]. I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 892 – 899, 2006. © Springer-Verlag Berlin Heidelberg 2006
A Distributed SVMs Architecture for Chaotic Time Series Prediction
893
In recent years, support vector machine (SVM) has been proposed as a novel technique in time series prediction. SVM is a new approach of pattern recognition established on the unique theory of the structural risk minimization principle to estimate a function by minimizing an upper bound of the generalization error via the kernel functions and the sparsity of the solution [3]. SVM usually achieves higher generalization performance than traditional neural networks that implement the empirical risk minimization principle in solving many machine learning problems. Another key characteristic of SVM is that training SVM is equivalent to solving a linearly constrained quadratic programming problem so that the solution of SVM is always unique and globally optimal. One disadvantage of SVM is that the training time scales somewhere between quadratic and cubic with respect to the number of training samples [4]. So a large amount of computation time and storage will be involved when SVM is applied for solving large-size problems. Furthermore, using a single model to learn the data is somewhat mismatch as there are different noise levels in different input regions. The multiple-modeling method, proposed by Bates and Granger [5], can enhance the robustness and generalization by combining several models. Now it is widely accepted that multiple models is a simple and effective method to gain better modeling performance [6]. Originated the idea of the multiple-modeling method, a potential solution to the above problems is proposed by using a distributed support vector machines (SVMs) architecture based on fuzzy c-means (FCM) clustering algorithm [7] and fuzzy synthesis algorithm. In this method, several sub-models, which have different characteristics and the same objectives, are constructed according to training sample data subsets. Then outputs of sub-models are integrated to improve whole estimation performance. Simulation results and application indicate that the distributed SVMs outperforms the single SVM model. This paper presents a novel distributed SVMs architecture based on FCM. In the following sections, Section 2 presents the embedding phase space reconstruction and nonlinear function approximation. The architecture and algorithm of the distributed SVMs are given in Section 3. Section 4 presents the results and discussions on the experimental validation. Finally, some concluding remarks are drawn in Section 5.
2 Embedding Phase Space Reconstruction Prediction for chaotic time series is to approximate the unknown nonlinear functional mapping of a chaotic signal. The laws underlying the chaotic time series can be expressed as a deterministic dynamical system, but these deterministic equations are usually not given explicitly. Farmer and Sidorowich [8] suggest reconstructing the dynamics in phase space by choosing a suitable embedding dimension and time delay. Consider the embedding theorem [9], from an observed chaotic time series x(t ) , in
phase space reconstruction, a scalar time series {xt } , t = 1," , N , with sampling time Δt , is converted to its phase space using the method of delays: X t = [ x t , xt +1 , " , xt + ( d −1) ]
(1)
894
J. Cheng, J.-s. Qian, and Y.-n. Guo
Where t = 1,2, " , N − (d − 1)τ Δt , d is the embedding dimension, and τ is the delay time. In other words, embedding phase space reconstruction techniques convert a single scalar time series to a state-vector representation using the embedding dimension (d) and delay time ( τ ). This reconstruction is required for both characterization and forecasting. In this study, recently celebrated LS-SVM will be employed to capture the dynamics depicted in Eq. (1) with the purpose of producing reliable predictions: X t +T = f ( X t ) , f : ( x t , xt +1 , " , xt + ( d −1) ) → ( xt +τ , xt +1+τ , " , X t + ( d −1) +τ )
(2)
So yˆ = X t + ( d −1)+τ = g ( xt , xt +1 , " , xt + ( d −1) ) is obtained. The training data consists of a d-dimensional vector, x ∈ R d , and the output, y ∈ R . The goal of the learning machine, then, is to estimate an unknown continuous, real-valued function g ( x ) that is capable of making accurate predictions of an output, y, for previously unseen values of x, thus utilizing information about the dynamics of system behavior in the embedding phase space representation to make forecasts of future system states in observation space. τ is the prediction step, here, τ = 1 means one-step-ahead prediction, and τ > 1 means multi-step prediction. In this paper, we try applying the distributed SVMs to estimate the unknown function g ( x ) .
3 The Distributed Support Vector Machines 3.1 Constructing the Distributed SVMs
The basic idea of the distributed SVMs is to build several independently submodels with relevant features, and to estimate a given input sample data by obtaining output from each individual SVM and then utilizing combination methods to decide the final output. Fig. 1 shows how the distributed SVMs are constructed. Here, the fuzzy cmeans (FCM) clustering algorithm is employed as building blocks for the distributed SVMs. The training sample data is clustered into s subsets via FCM, and then all submodels are built by least squares support vector machine (LS-SVM) and trained by subsets respectively. In this study, the Gaussian radial basis function (RBF) is used as kernel function K ( x, x k ) = exp(− x − x k σ 2 )
(3)
where σ is a positive real constant. Note that in the case of RBF kernels, one has only two additional tuning parameters σ 2 in Eq. (3) and γ which is a positive real constant and a regularization parameter for avoiding over-fitting in LS-SVM. 3.2 The Fuzzy Synthesis Algorithm
There are many methods for combining multiple models described in paper [10]. Here, a new fuzzy synthesis algorithm is proposed, which is divided into two steps: firstly, degrees of membership are generated via the relationship between a new
A Distributed SVMs Architecture for Chaotic Time Series Prediction
895
Fig. 1. Architecture of the distributed SVMs
sample data and each subset; secondly, the outputs of submodels are synthesized to obtain the final output of the distributed SVMs by the former degrees. Suppose that the distributed SVMs has s submodels, the output Y of the distributed SVMs is shown:
Y =
s
¦μ Y
i i
(4)
i =1
where, μ i is the degree of membership of a new X to the submodel ( i = 1,2, " , s ), Yi
is the output of the i-th submodel. Estimating online, the degree {μ1 , μ 2 , " , μ s } is implemented as the submodel weight, and the final output of the distributed SVMs is calculated by Eq. (4). μ i can be decided by the following method: μ i is related to the Euclidian distance between a new input sample data X and z i which is the center of i-th subset, and calculated by the formula as follows:
μ i = 1, μ j ≠i ° ®μ = §¨ 1 ·¸ ° i ¨d ¸ © i¹ ¯ where, d i = X − z i
2
if d i = 0 , i = 1, " , s = 0, § s 1· ¸ ¨ ¨ d ¸, otherwise © i =1 i ¹
¦
s
and
¦μ i =1
i
(5)
= 1 , μ i ∈ [0, 1] .
3.3 The Proposed Algorithm for the Distributed SVMs
Up to here the process of predicting chaotic time series is completed. The detailed steps of the distributed SVMs learning algorithm is illustrated as the following: Step 1. Select embedding dimension d and delay time τ to reconstruct the phase space, then the training, the validating and the checking sample data set are chosen in the phase space. Step 2. The training sample data set is partitioned into s subsets and corresponding subset center z i ( i = 1,2, " , s ) by using FCM with appropriate value of the weighted exponent m and the number of cluster s.
896
J. Cheng, J.-s. Qian, and Y.-n. Guo
Step 3. The structure of each submodel is built, trained and validated by the subset respectively to determine the kernel parameters σ 2 and γ of LS-SVM. Choose the most adequate LS-SVM that produces the smallest error on the validating data set for each subset. Step 4. For new input sample data X, the output of submodel Yi ( i = 1,2, " , s ) are
obtained and the degrees of membership μ i ( i = 1,2, " , s ) are calculated by Eq. (5). Step 5. The final output of the distributed SVMs is synthesized by Eq. (4). Recover the predicted phase space points to time domain. Predicted chaotic time series data are completed.
4 Simulation Results The gas concentration, which is a chaotic time series in essence, is one of most key factors that endanger the produce in coal mine. It has very highly social and economic benefits to strengthen the forecast and control over the coal mine gas concentration. From the coal mine, 4010 samples are collected from online sensor underground after eliminating abnormal data in this study. The goal of the task is to use known values of the time series up to the point x = t to predict the value at some point in the future x = t + τ . To make it simpler than Eq. (3), the method of prediction is to create a mapping from d points of the time series spaced τ apart, that is, ( x(t − (d − 1)τ ), " , x(t − τ ), x(t )) , to a prediction future value
x(t + τ ) . Through several trials, the embedding phase space is reconstructed with the values of the parameters d = 6 and τ = 4 in the experiment. From the gas concentration time series x(t ) , we extracted 3900 input-output data pairs of the following format: [ x(t − 20), x(t − 16), x (t − 12), x(t − 8), x(t − 4), x(t ), x(t + 4)] .
(6)
where t = 20 to t = 3919 . The first 3500 pairs is used as the training data set, the second 200 pairs is used as validating data set for finding the optimal parameters of LS-SVM, while the remaining 200 pairs are used as checking data set for testing the predictive power of the model. The prediction performance is evaluated using by the root mean squared error (RMSE):
RMSE =
1 n 2 ¦ ( yi − yˆ i ) n i =1
(7)
where n represents the total number of data points in the data set, y i , yˆ i are the actual value and prediction value respectively. The training sample data is partitioned by the FCM, here, weighted exponent m = 2 and subset number s = 8 by several simulating trials. The results of clustering analysis are shown in Table 1.
A Distributed SVMs Architecture for Chaotic Time Series Prediction
897
Table 1. The results of FCM in gas concentration chaotic time series
Subset s 1 2 3 4 5 6 7 8
x(t-20) 0.2275 0.2374 0.2245 0.2384 0.2505 0.2362 0.2250 0.2425
Subset center x(t-12) x(t-8) 0.2253 0.2253 0.2382 0.2381 0.2216 0.2216 0.2397 0.2397 0.2556 0.2559 0.2364 0.2364 0.2223 0.2223 0.2449 0.2451
x(t-16) 0.2260 0.2379 0.2227 0.2393 0.2534 0.2363 0.2233 0.2444
x(t-4) 0.2261 0.2379 0.2226 0.2394 0.2540 0.2363 0.2233 0.2444
x(t) 0.2278 0.2373 0.2246 0.2385 0.2511 0.2360 0.2251 0.2424
Sample number 491 126 914 243 607 464 136 519
As the dynamics of gas concentration are strongly nonlinear, in this investigation, the Gaussian RBF kernel function is used as the kernel function of LS-SVM, because from the former work [11], we know that using LS-SVM for regression estimation, Gaussian kernels tend to give good performance under general smoothness assumptions. Consequently, they are especially useful if no additional knowledge of the data is available. To assure there is the best prediction performance in the LS-SVM model, through several trials, the optimal values of the kernel parameters ( σ 2 , γ ) are obtained as shown in Table 2. Table 2. The kernel parameters and the training and checking RMSE of model
Model
Kernel parameters
σ
2
submodel 1 30 submodel 2 5 submodel 3 5 submodel 4 2 submodel 5 4 submodel 6 25 submodel 7 5 submodel 8 15 Single SVM 10 The distributed SVMs
γ
0.20 0.10 0.08 0.05 0.30 0.01 0.30 0.05 0.04
RMSE Training Checking 0.0125 0.0194 0.0163 0.0557 0.0173 0.0425 0.0164 0.0188 0.0190 0.0207 0.0205 0.0398 0.0171 0.0303 0.0167 0.0148 0.0198 0.0376 0.0184 0.0195
From the results of simulation, it can be observed that the distributed SVMs forecast more closely to actual values than the single SVM in most of the checking data set. So there are correspondingly smaller absolute prediction errors in the distributed SVMs (the solid line) than the single SVM model (the dotted line), as illustrated in Fig.2. The predicted value of the distributed SVMs and the actual values for checking data are essentially same, and their differences can only be seen on a finer scale.
898
J. Cheng, J.-s. Qian, and Y.-n. Guo
Fig. 2. The absolute prediction errors in the single SVM model (the dotted line) and the distributed SVMs model (the solid line)
The results of the single SVM and the distributed SVMs models are given in Table 2. Comparing the results of the distributed SVMs with the single SVM model, it can be observed that the distributed SVMs achieve a much smaller RMSE than the single SVM model. In addition, the used CPU time, the storage consumption and the number of converged support vectors are less for in the distributed SVMs than the single SVM model.
5 Conclusions This paper describes a novel methodology, a distributed SVMs based on FCM, to model and predict chaotic time series. The distributed SVMs model is developed by integrating SVMs with fuzzy synthesis algorithm in a three-stage architecture inspired by the idea of multiple-modeling method. Through this investigation, there are several advantages in the distributed SVMs. In the first, the generalizations of submodels are not the same in the distributed SVMs, and combining these submodels can improve the robustness by sharing and averaging out these errors. Secondly, due to the number of training data set getting smaller in each submodel, the convergence speed of SVM is largely increased and the storage consumption is largely decreased. The third, the distributed SVMs converges to fewer support vectors. The distributed SVMs model has been evaluated by the prediction of coal mine gas concentration. Its superiority is demonstrated by comparing it with the single SVM model. All the simulation results shows that the distributed SVMs model is more effective and efficient in predicting chaotic time series than the single SVM model. On the other hand, there are some issues that should be investigated in future work, such as how to ascertain the number of the subsets in input space which affects deeply the performance of the whole model, how to construct the kernel function and determine the optimal kernel parameters, etc.
Acknowledgements This research is supported by National Natural Science Foundation of China under grant 70533050, Postdoctoral Science Foundation of China under grant 2005037225, Postdoctoral Science Foundation of Jiangsu Province under grant [2004]300 and Young Science Foundation of CUMT under grant OC4465.
A Distributed SVMs Architecture for Chaotic Time Series Prediction
899
References 1. A. S. Weigend, N. A. Gershenfeld, Time Series Prediction: Forecasting the Future and Understanding the Past, Reading, MA: Addison-Wesley (1994). 2. H. Leung, T. Lo, S. Wang, Prediction of Noisy Chaotic Time Series Using an Optimal Radial Basis Function Neural Network, IEEE Transactions Neural Networks, Vol.12, No.5, (2001) 1163-1172 3. V. N. Vapnik, An Overview of Statistical Learning Theory, IEEE Transactions Neural Networks, Vol.10, No.5, (1999) 988-999 4. N. Cristianini, J. S. Taylor, An Introduction to Support Vector Machines: and Other Kernel-based Learning Methods, Cambridge University Press, New York (2000) 5. S. B. Cho, J. H. Kim, Combining Multiple Neural Networks by Fuzzy Integral for Recognition. IEEE Transactions on System, Man and Cybernetics, Vol. 25, No. 2, (1992) 380384 6. W. P. Kegelmeyer, K. Bowyer, Combination of Multiple Classifiers Using Local Accuracy Estimates. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, No. 4, (1997) 405-410 7. J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. New York: Plenum Press (1981) 8. J. D. Farmer, J. J. Sidorowich, Predicting Chaotic Time Series, Phys. Rev. Lett., Vol. 59, (1987) 845-848 9. F. Takens, On the Numerical Determination of the Dimension of An Attractor, in Dynamical Systems and Turbulence, Lecture Notes in Mathematics, Berlin, Germany: SpringerVerlag, Vol. 898, (1981) 230-241 10. Z. Ahmad, J. Zhang, A Comparison of Different Methods for Combining Multiple Neural Networks Models. Proceedings of the 2002 International Joint Conference on Neural Networks, Vol. 1 (2002) 828-833 11. K. J. Kim, Financial Time Series Forecasting Using Support Vector Machines, Neurocomputing, Vol. 55, (2003) 307-319
Nonlinear Noise Reduction of Chaotic Time Series Based on Multi-dimensional Recurrent Least Squares Support Vector Machines Jiancheng Sun, Yatong Zhou, Yaohui Bai, and Jianguo Luo Dept. of communication Eng., Jiangxi University of Finance and Economics. Nanchang, 330013, Jiangxi, China [email protected]
Abstract. In order to resolve the noise reduction in chaotic time series, a novel method based on Multi-dimensional version of Recurrent Least Square Support Vector Machine(MDRLS-SVM) is proposed in this paper. By analyzing the relationship between the function approximation and the noise reduction, we realized that the noise reduction can be implemented by the function approximation techniques. On the basis of the MDRLS-SVM and the reconstructed embedding phase theory, the function approximation in the high dimensional embedding phase space is carried out and the noise reduction achieved simultaneously.
1
Introduction
Chaotic system is an important class of the nonlinear dynamical systems in economics, meteorology, chemical processes, biology and many other situations. During the past decade, various noise reduction techniques for chaotic time series data were developed. The most promising such technique for chaotic systems with a moderate amount of noise utilized Local geometric projection algorithm[1,2,3,4]. In addition, shadowing method and optimization techniques is the other important approach to reduce the noise[5,6]. Recently, the study focused on the appropriate selection of the local subspace[7]. In previous research, the noise reduction procedure is usually considered separately from the approximation of the dynamics and the prior knowledge of system is necessary[5,6]. In this paper, we preliminarily focus on these problems and confirm that the noise reduction and function approximation can be regarded as a single procedure. Based on the theory of reconstructed embedding phase space, we develop a new method for noise reduction based on Multi-dimensional version of Recurrent Least Square Support Vector Machine (MDRLS-SVM): a state-of-the-art technique within the machine learning community for regression estimation. It is hoped that the question will be resolved with our proposed approach since the superiority of the kernel method. I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 900–906, 2006. c Springer-Verlag Berlin Heidelberg 2006
Nonlinear Noise Reduction
2
901
Noise Reduction and Function Approximation
In order to carry out the function approximation, the phase space needs to be reconstructed since we can only acquire a scalar time series in various cases[9]. Deterministic dynamical systems describe the time evolution of a system in some phase space Γ ⊂ Rm , for simplicity we will assume that the phase space is a finite dimensional vector space. Since the sequence(usually scalar) in itself does not properly represent the multidimensional phase space of the dynamical system, we have to employ some technique to unfold the multidimensional structure using the available data. Fortunately, Takens embedding theorem guarantees the reconstruction of a state space representation from a scalar signal alone[9]. Once the scalar time series have been embedded we can model the dynamics by fitting a function to map points forward in time. There are a variety of methods for approximating the mapping function both locally or globally, however the basic idea is the same in each case. Having defined a space of functions, xk+1 = f (xk )
(1)
We wish to choose the function that best approximates the dynamics. Generally this means minimizing the error terms are defined as: ek = xk − f (xk−1 )
(2)
Where ek is the approximation error associated with xk , the kth point in the series and f (x) is the approximating function. In common cases of noise reduction, function approximation and noise reduction are viewed as two independent procedures. For explaining this point, we review the Farmer and Sidorowich’s approach[5] and its simplified version proposed by Davies[6]. Farmer and Sidorowich think that noise reduction can be realized by finding the deterministic orbit that is closest to the noise orbit, namely the “optimal shadowing” The problem can be formulated as the following Lagrangian function: S=
N k=1
yk − xk + 2
N −1
[f (xk ) − xk+1 ]T λk
(3)
k=1
where f (x) is the mapping function, xk is the deterministic trajectory, yk is the noisy trajectory and λk are Lagrange multipliers. In order to resolve above optimal problem, f (x) must be given or approximated in advance, then the noise reduction can be implemented. Unfortunately, the problem can be “illconditioned” since the homoclinic tangencies and chaotic characteristic of the dynamical systems. So Davies proposed the simplified strategy to avoid this problem. Davies illuminated that the noise can be reduced by finding the less noisy orbit close to the original noisy data instead of the deterministic orbit. Finally, the noise reduction can be formulated as resolving the multi-dimensional minimization problem such as follows: min H =
N k=1
e2k
(4)
902
J. Sun et al.
Though Davies’s method simplified the problem, the function approximation still separated from the noise reduction. Ideally we would like an algorithm that worked for function approximation and noise reduction simultaneously. The composite method presented below, namely Multi-dimensional RLS-SVM, appears to achieve these requirements.
3
Noise Reduction Based on Multi-dimensional RLS-SVM
On the basis of above-mentioned discussion, we know that achieving the noise reduction and function approximation simultaneously is an important problem in our method. Suykens and Vandewalle proposed the Recurrent Least Squares Support Vector Machines(RLS-SVM) based on sum squared error(SSE) to deal with some problem in function approximation and prediction[8]. Nevertheless, RLS-SVM can not be work correctly in the high dimensional reconstructed embedding phase space. In this paper, Multi-dimensional version of RLS-SVM is proposed to deal with this problem. First of all, we review the classical RLSSVM to explain our method. Let {s(t)}, t = 1, 2, ..., T, be a scalar time series that was generated by a dynamical system. Choose optimal embedding dimension m, time delay τd and set Q ⊆ {1, ..., N }, so the training data was constructed based on the theory of reconstructed embedding phase space as follows D = {((sk−(m−1)τd , ..., sk−τd , sk ), sk+1 ) ∈ Rm × R|k ∈ Q}
(5)
Given initial condition sˆi = si for i = 1, 2, ..., m, where sˆi denotes the estimated output, the function approximation problem is given by: sk−(m−1)τd , ..., sˆk−τd , sˆk ) sˆ(k+1 τd = f (ˆ = wT ϕ(ˆsk ) + b
(6)
Where ϕ(·) : Rm → Rnh is a nonlinear mapping in feature space, w ∈ Rnh is the output weight vector and b ∈ R is bias term, ˆ sk = [ˆ sk−(m−1)τd ; ...; sˆk−τd ; sˆk ]. So the function f (·) is estimated by using training data. In literature [8], Suykens and Vandewalle proposed that the Eq.(6) can be converted into optimal problems which can be described as follows. min = w,e
subject to
N +m 1 T 1 2 ek w w+γ 2 2
(7)
k=m+1
sk+1 − ek+1 = wT ϕ(sk − ek ) + b, k = m + 1, ..., N + m
(8)
Where ek = sk − sˆk ,sk = [sk−(m−1)τd ; ...; sk−τd ; sk ],ek = [ek−(m−1)τd ; ...; ek−τd ; ek ] , and γ is an adjustable constant. The basic idea of mapping function ϕ(·) is to map the data into a high-dimensional feature space, and to do linear regression
Nonlinear Noise Reduction
903
in this space. The mapping function ϕ(·) can be paraphrased by a kernel function K(·, ·) because of the application of Mercer’s theorem, which means that K(xi , xj ) = ϕ(xi )T ϕ(xj )
(9)
with RBF kernels one employs , xi − xj 2 K(xi , xj ) = exp − 2σ 2
(10)
Comparing Eq.(1) with Eq.(6), we can find that the difference is obvious. The left hand side of Eq.(1) is a vector while a scalar quantity appears in Eq.(6). What does the difference means is the most important problem to be solved. The practical meaning of Eq.(1) is clearly since it shows the relationship between two points in the embedding phase space. Nevertheless, we can not confirm exact practical meaning of the Eq.(6), and it did not utilize the information of high dimensional embedding space adequately. So the multi-dimensional regression estimation problem arises when the output variable is a vector s ∈ Rk instead of a scalar quantity in the left hand side of Eq.(6). Fortunately, the multi-dimensional regression problem can be divided into k 1D problems for the RLS-SVM since minimum variance estimation is equivalent to the multidimensional estimate. But, for SVM this is not the case, due to the insensitive zone defined around the estimate will not equally treat every training sample. So we propose the multi-dimensional versions of RLS-SVM, and the algorithm is formulated as follows. The multi-dimensional function approximation problems are reconsidered in the embedding phase space which is reconstructed based on the observed scalar time series s = {s0 , s1 , ..., sN −1 } . The method is presented as follows ˆ s = f(ˆsk−1 )
sk−(m −1)τd , sˆk−(m −2)τd , ..., sˆk ] k = 0, 1, ..., N − m , ˆs = [ˆ
(11)
Where m and τd are refer to as embedding dimension and time delay respectively. The τd is set to one for simplicity. So the function is as follows: ˆ sk = W T ϕ(ˆsk−1 ) + b
(12)
Where W = [w1 , w2 , ..., wm ] , b = [b1 , b2 , ..., bm ] and ϕ(·) : Rm → Rnh is the nonlinear mapping in feature space. So the estimation of above function can then be similarly stated as the quadratic optimization problem.
m N −m 1 j 2 1 w + γ ek 2 min J (w, e) = w,e 2 j=1 2
(13)
k=1
subject to
sk − ek = W T ϕ(sk−1 − ek−1 ) + b, k = 1, 2, ..., N − m
(14)
904
J. Sun et al.
Where the meaning of ek = sk −ˆ sk is same as Eq.(2). Making use of the Lagrange multipliers, leading to the minimization of
m N −m 1 1 j 2 w + γ ek 2 + L(w, e, b, α) = 2 j=1 2 k=1
N −m
(15)
αk−m [sk − ek − W T ϕ(sk−1 − ek−1 ) − b]
k=1
with respect to wj , bj and ej and its maximization with respect the Lagrange multipliers, α. The solution to this problem is given by the Karush-Kuhn-Tuker Theorem. So the Multi-dimensional versions of RLS-SVM can then be presented as m N −m ˆs = αl K(zl , ˆsk−1 ) + bp (16) p=1
l=1
Where zl = sl − el , K(·) is the RBF kernel function that stated in Eq.(10). From before-mentioned discussion, our algorithm carried out the function approximation task in the high dimension space that possesses more physical and geometrical information than the scalar time series. So the better performance should be acquired theoretically.
4 4.1
Simulation and Discussion Experimental Results
To test performance of our noise reduction, in the following procedure, we used the Ikeda map as follows: xt+1 = 1 + μ(xt cos s − yt sin s) yt+1 = μ(xt sin s + yt cos s)
(17)
Where s = 0.4 − 1+x6.0 2 +y 2 , Ikeda attractor generated by taking μ = 0.7. A trajectory has been generated for initial condition [0.1; 0.1], and then a random number generator adds noise to the orbit with SNR=30dB. Fig.1 demonstrates the result of noise reduction with no prior knowledge, learning directly from the noisy time series. Fig.1a shows a phase plot of a portion of the 1500 points in the noisy time series. We iterated the procedure of learning the dynamics and reducing the noise to produce the noise reduced phase plot shown in fig.1b. For comparison, the original uncontaminated phase plot is shown in fig.1c. Comparing (b) and (c), there are significant discrepancies; as usual, there are some segment caused by inaccuracies in the approximation algorithm. However, the majority of the points correspond in detail. Fig.1 shows our noise can learn the feature of dynamics system effectively, as well as, the noise reduction can be realized simultaneously.
Nonlinear Noise Reduction 0.6
0.6
0.4
0.4
0.2
0.2
0
0
−0.2
−0.2
−0.4
−0.4
−0.6
−0.6
−0.8 0.2
0.4
0.6
0.8
1
−0.8 0.2
1.2
0.4
0.6
0.8
1
905
1.2
(b)
(a)
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8 0.2
0.4
0.6
0.8
1
1.2
(c)
Fig. 1. (a) A phase plot showing 1500 points of a noisy time series, obtained by adding uniformly distributed noise with SNR=30dB to each component of the data shown in (c). (b) The result of applying our noise reduction method to the data in (a). (c) A plot of 1500 successive iterates of the Ikeda attractor. These are the ”true” data for (a), before adding noise.
4.2
Comparison with Previous Methods and Discussion
In literature [5] and [6], the better performance can be acquired when the function of system is known in advance or can be approximated from clean data. Infect, in most practical applications, dynamics system is unknown and must be learned directly from the noise data. By comparing our algorithms with ones in literature [5] and [6] in this case, we found that the similar performance can be acquired. However, most important of all is that, in our method, the prior knowledge of system is needless and the function approximation and noise reduction can be carried out simultaneously, namely, both procedures is same one. Nevertheless, in this paper, it should be noted that this study focus on the novel idea is proposed. Some limitations of this study are the complexity of the algorithm and lack of more research on finding appropriate kernel function in chaotic system. So, the performance should be improved by investigating the techniques deeply, namely, utilizing further research on multi-dimensional function approximation based on kernel method to resolve noise reduction in high dimensional embedding
906
J. Sun et al.
phase space. In addition, we think that the MDRLS-SVM may be effective when the “true” multi-dimensional case is taken into account, such as multidimensional economic time series and multiple-input multiple-output system (MIMO) that is formed without using techniques of reconstructed embedding phase space.
5
Conclusion
For a long time, the noise reduction procedure is considered separately from the approximation of the dynamics. In this paper, however, we present one problem that the noise reduction can be realized by using the function approximation techniques. The function approximation implement based on the Multi-dimensional version of Recurrent Least Square Support Vector Machine (MDRLS-SVM), a novel machine learning method. Simulation results show that noise reduction can be performed based on our method without the prior knowledge of system.
References 1. T. Schreiber and P. Grassberger,A simple noise-reduction method for real data,Phys. Lett. A 160, 411-418,1991. 2. R. Cawley and G.H. Hsu,Local-geometric-projection method for noise reduction in chaotic maps and flows,Phys. Rev. A 46, 3057-3082,1992. 3. H. Kantz, T. Schreiber, I. Hoffmann, T. Buzug, G. Pfister, L. G. Flepp, J. Simonet, R. Badii, and E. Brun,Nonlinear noise reduction: A case study on experimental data, Phys. Rev. E 48, 1529-1538,1993 4. Leontitsis A., Bountis T., Pagge J., An adaptive way for improving noise reduction using Local Geometric Projection, CHAOS 14(1): 106-110, 2004 5. J.D. Farmer and J.J. Sidorowich, Optimal shadowing and noise reduction, Physica D 47, 373-392, 1991 6. ME Davies, Noise reduction schemes for chaotic time series, Physica D 79, 174-192, 1994 7. A. Kern, W.H. Steeb, and R. Stoop, Projective noise cleaning with dynamic neighborhood selection, Int. J. Mod. Phys. C 11, 125-146, 2000 8. J. A. K. Suykens and J. Vandewalle, Recurrent Least Squares Support Vector Machines, IEEE Tran. on Circuits and System-I: Fundamental Theory and Applications, Vol. 47, No.7, pp.1109-1114, 2000 9. F. Takens. Detecting strange attractors in fluid turbulence. In D. Rand and L.S.Young, editors, Dynamical systems and turbulence, pp.366-381. Springer-Verlag, Berlin, 1981.
Self-Organizing Map with Input Data Represented as Graph Takeshi Yamakawa1, Keiichi Horio1 , and Masaharu Hoshino2 1
Graduate school of Life Science and Systems Engineering Kyushu Institute of Technology Hibikino, Wakamatsu, Kitakyushu, Fukuoka 808-0196, Japan {yamakawa, horio}@brain.kyutech.ac.jp 2 Ricoh Co., Ltd.
Abstract. This paper proposes a new method of Self-Organizing Map (SOM) in which an input space is represented as a graph by modifications of a distance measure and a updating rule. The distance between input node and reference element is defined by the shortest distance between them in the graph. The reference elements are updated along the shortest path to the input node. The effectiveness of the proposed method is verified by applying it to a Traveling Salesman Problem.
1
Introduction
A self-organizing map (SOM) is one of the most popular neural networks, and has been successfully applied to many areas[1][2]. The most distinctive feature of the SOM is to realize a mapping from a high dimensional input vector space to a low dimensional space with keeping a topology. In almost researches of the SOM, the input data is defined in Euclidean space. Recently, a expanded version of the SOM, called modular network SOM, is proposed to generalize an input space from Euclidean vector space to a function space, a map space and so on[3][4]. In these methods, however, input data spaces are continuous space, and noncontiguous space can not be treated as an input space. On the other hand, there are many things which can not be represented in continuous space in the real world, for example train route maps, electronic circuits, www structure and so on. They should be represented with nodes and links. A graph should be major candidate to represent these structures. Graph is a mathematical method which abstractly represents some elements and connections between them. In this paper, we propose a new SOM learning algorithm in which input space is represented as graph. In the proposed method, the reference elements, which are reference vectors in the conventional SOM, can exist only on nodes and links of the graph. Differences of the conventional SOMs and the proposed SOM are definition of distances between input elements and reference elements and updating rule of reference elements. Distance between the input and reference elements are calculated based on the shortest path on the graph, and the I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 907–914, 2006. c Springer-Verlag Berlin Heidelberg 2006
908
T. Yamakawa, K. Horio, and M. Hoshino
reference elements are updated along the shortest path to the input element. The effectiveness of the proposed method is verified by applying it to Traveling Salesman Problem (TSP) in space with obstacles.
2
Self-Organizing Map and Graph
In this section, the SOM and graph which are fundamental components is briefly explained. 2.1
Self-Organizing Map
The SOM consists of an input layer and a competitive layer, in which n and N units are included, respectively, as shown in Fig. 1. The unit i in the competitive layer has the reference vector wi . In a batch learning algorithm of the SOM, the reference vectors are updated toward a center of gravity of the input vectors weighted by neighborhood function. The batch learning algorithm with L input vector is summarized as follows. 0) The reference vectors of all the units in the competitive layer are initialized using random values. 1) An input vector xl is selected from the set of L input vectors and is applied to the input layer. 2) A winner unit cl is selected by the smallest Euclidean distance by: cl = arg min xl − wi . i
(1)
3) A neighborhood function hil is calculated by: hil = exp(−
x1
r i − r cl ), 2σ(t)2
xi
xn
Input Layer
Unit j Reference Vector
Competitive Layer
wj
Fig. 1. Structure of the SOM
(2)
SOM with Input Data Represented as Graph
909
v1 e1 v2 e3
v4
v3
e2 e4 e6
e5 v5
(a)
(b)
Fig. 2. An example of a graph. vi and ej are a node and a link, respectively.
where r i and r cl are coordinate of the unit i and the winner unit cl in the competitive layer, respectivity. σ(t) is a width of neighborhood function at learning step t. 4) The neighborhood functions for all input vector are calculated by repeating step 1) to 3). 5) The reference vectors are updated by: L
hil xl
w new = wold + α l=1 i i
L
(3) hil
l=1
where wnew and w old are the reference vectors after and before updating, i i respectively. 6) Step 1) to 5) are repeated with decreasing the width of neighborhood function σ(t) The mapping from input vector space to 1- or 2-D competitive layer can be constructed by the learning above. 2.2
Graph
Graph is a model representing phenomena including connections, and it is used to analyze the phenomena. By using the graph, properties of the connection such as electronic circuit and computer network can be represented. The graph is defined as a set of nodes V and a set of links E. Fig. 2(a) shows an example of a graph. In Fig. 2(a), vj and ek are a node and a link, respectively, and the graph is defined by: G = (V, E), (4)
910
T. Yamakawa, K. Horio, and M. Hoshino
Node of Graph Input Node for SOM Reference Element
Competitive Layer
Fig. 3. Structure of the proposed SOM. The nput vector set is represented as a graph, and the competitive layer is two dimensional Euclidean space as same as to the ordinary SOM.
where, V = {v1 , v2 , v3 , v4 , v5 } and V = {e1 , e2 , e3 , e4 , e5 , e6 }. In the graph, a distance between two nodes is defined by the shortest path between them. For example, the distance dG (v1 , v5 ) between node v1 and node v5 is 3.
3
Proposed Method
To reazlize a mapping from input space represented as graph to one or two dimensional Euclidean space, we propose a new learning algorithm of the SOM. The structure of the proposed SOM is shown in Fig. 3. We define that the subset of the node set of the graph is a set of input node X = {xl |l = 1, · · · , L}. In the ordinary SOM, the learning is divided to three processes, definition of the winner unit, definition of the neighboring units, and update of the reference vectors. The definition of the neighboring units is as same as to that of the ordinary SOM, because the competitive layer is Euclidean space. In the proposed SOM, new methods of defining the winner unit and updating rule of the reference vectors in the graph should be realized. In the proposed SOM, the reference element which corresponds to reference vector in the ordinary SOM can exist only on the nodes of links in the graph, thus we define the reference node by: wi = (vj , vk , βj )
(5)
It means that the reference element wi exists on the dividing point of the link between nodes vj and vk by β and 1 − β. Here 0 ≤ β ≤ 1. Thus the distance between input node x and reference element wi is defined by: dG (x, wi ) = min(dG (x, vj ) + βj dG (vj , vk ), dg (x, vk ) + (1 − β)dG (vj , vk ))
(6)
SOM with Input Data Represented as Graph
911
Input Node
Input Vector
Shortest Path
Reference Element before Update
Reference Element after Update
Reference Element after Update Reference Element before Update
(a)
(b)
Fig. 4. Illustration of updating of a weight vector. (a) Ordinary SOM. (b) Proposed SOM in which the input vector set is represented as a graph.
In the ordinary SOM, the reference vector wi is updated along the line segment defined by x and w i as shown in Fig. 4(a). The line segment between x and wi in the Euclidean space means the shortest distance between them. In other words, the reference vector is updated along the shortest path to the input vector. According to this analogy, the reference elements should be updated along the shortest path to the input node in the graph as shown in Fig. 4(b). We proposed the following update rule. winew = move(x, wiold , α(t)hi,c (t))
(7)
where winew and wiold are the reference elements after and before update, respectively. α and hi,c (t) are the learning rate and the neighborhood function, respectively. The function move generate the position of the reference element after update, and is described as follows: [Condition 1] In case of (dG (x, vj ) + βdG (vj , vk ) < dG (x, vk ) + (1 − β)dG (vj , vk ) and α(t)hi,c (t)dG (x, wiold ) < βdG (vj , vk )), * ) βdG (vj , vk ) − α(t)hi,c (t)dG (x.wiold ) new = vj , vk , wi dG (vj , vk ) [Condition 2] In case of (dG (x, vk ) + (1 − β)dG (vj , vk ) ≤ dG (x, vj ) + βdG (vj , vk ) and α(t)hi,c (t)dG (x, wiold ) < (1 − β)dG (vj , vk )), ) * βdG (vj , vk ) − α(t)hi,c (t)dG (x.wiold ) new wi = vj , vk , dG (vj , vk ) [Condition 3] In case of (dG (x, vj ) + βdG (vj , vk ) < dG (x, vk ) + (1 − β)dG (vj , vk ) and
(8)
(9)
912
T. Yamakawa, K. Horio, and M. Hoshino
Reference Element before Update
Input Node
vj
Shortest Path
vk Reference Element after Update
(a)
Shortest Path Input Node
vk
v s +1 vs
vj
Reference Element before Update
Reference Element after Update
(b) Fig. 5. Update of the reference element. (a) Condition 1 and 2. (b) Condition 3 and 4.
α(t)hi,c (t)dG (x, wiold ) < βdG (vj , vk ) and dG (x, vs ) < α(t)hi,c (t)dG (x, wiold ) − βdG (vj , vk ) < dG (x, vs+1 )), ) winew =
vs , vs+1 ,
dG (vs , wiold ) + βdG (vj , vk ) − α(t)hi,c (t)dG (x.wiold ) dG (vs , wiold )
* (10)
[Condition 4] In case of (dG (x, vk ) + (1 − β)dG (vj , vk ) ≤ dG (x, vj ) + βdG (vj , vk ) and α(t)hi,c (t)dG (x, wiold ) < (1 − β)dG (vj , vk )) and dG (x, vs ) < α(t)hi,c (t)dG (x, wiold ) − (1 − β)dG (vj , vk ) < dG (x, vs+1 )), ) winew =
vs , vs+1 ,
dG (vs , wiold ) + (1 − β)dG (vj , vk ) − α(t)hi,c (t)dG (x.wiold )
*
dG (vs , wiold )
(11) Examples of the update of the reference element are shown in Fig. 5. Fig. 5(a) shows the reference elements update under condition 1 and 2, and Fig. 5(b) shows update under condition 3 and 4. By modifying the distance measure and updating rules, the SOM can be applied to the graph input space.
SOM with Input Data Represented as Graph
4
913
Experimental Results
In order to verify the effectiveness of the proposed method, it is applied to TPS in the space with obstacles. TSP is to find the shortest route that the salesman visits all cities and comes back the home city. TSP is one of the most wellstudied combinational optimization problems, and it is well known as NP-hard problem. TPS is applied to many fields, for example, genomic analysis, control of astrometrical telescope, path planning of hole making of printed board, and so on. Many researches about TSP in which the cities are arranged on the two dimensional Euclidean space have been reported, and the SOM has been successfully applied to the TSP[5]. In the real world, however, situations in which obstacles exist on the space should be considered. For example, when a salesman move the city to city by a car, he can not move on straight from city to city. In such realistic case, the ordinary SOM can not be applied. The proposed SOM learning algorithm is applied to the TSP with limitation for movement mentioned above. Fig. 6(a) is two dimensional Euclidean working space with obstacles. The salesman can move on the gray colored space. Imagine
(b)
(a)
(d)
(c)
(e)
Fig. 6. Experimental results. (a) Two dimensional Euclidean working space with obstacles. (b) Graph in the limited space generated by a topology representing network (TRN). (c) Cities arranged on the working space. (Black circle:City). (d) Pathway generated by the ordinary SOM. The obstacles can not be considered. (e) Pathway generated by the proposed SOM.
914
T. Yamakawa, K. Horio, and M. Hoshino
that the white area is lake or river. If the salesman moves by car, he should make a roundabout trip to the next city. Fig. 6(b) shows a graph representing the working space generated by the topology representing network[6]. This graph is the input space of the proposed SOM. Fig. 6(c) shows cities arranged on the working space. Number of cities is 50. Fig. 6(d) shows a pathway found by the ordinary SOM. It is shown that the paths through the obstacles are generated, because the obstacles are not considered. Fig. 6(e) is a pathway generated by the proposed SOM. It is shown that the all paths are avoided to obstacles, and the reasonable pathway can be obtained.
5
Conclusions
In this paper, we proposed the new method of Self-Organizing Map (SOM) for the input space represented as graph. In the proposed method, the distance between input node and reference element is defined by the shortest distance in the graph, and the reference elements are updated along the shortest path to the input node. The proposed method is applied to the TSP with obstacles, and it is shown that the reasonable pathway which avoids the obstacles can be obtained. Implementation of the batch learning algorithm should be achieved in future work.
Acknowledgment This work was supported by a Center of Excellence Program center#J19 granted in 2003 to Department of Brain Science and Engineering, (Graduate School of Life Science and Systems Engineering), Kyushu Institute of Technology by Japan Ministry of Education, Culture, Sports, Science and Technology.
References 1. Kohonen, T.: Self-Organizing Formation of Topology Correct Feature Map. Biological Cybernetics 43 (1982) 59-69 2. Kohonen, T.: Self-Organizing Maps. Springer-Verlag (1995) 3. Tokunaga, K., Furukawa, T., Yasui, S.: Modular network SOM: Self-Organizing Maps in Function Space. Neural Information Processing - Letters and Reviews 9 (2005) 15-22 4. Furukawa, T.: SOM of SOMs: Self-Organizing Map Which Maps a Group of SelfOrganizing Maps. Lecture Notes in Computer Science 3696 (2005) 391-396 5. Angeniol, B., Vaubois, G., Texier, J.: Self-Organizing Feature Maps and the Traveling Salesman Problem. Neural Networks 1 (1988) 289 6. Martinetz, T., Schulten, K.: Topology Representing Networks. Neural Networks 7 (1994) 507-522
Semi-supervised Learning of Dynamic Self-Organising Maps Arthur Hsu and Saman K. Halgamuge Dynamic Systems and Control Group, Department of Mechanical and Manufacturing Engineering, University of Melbourne, Victoria, Australia 3010 [email protected], [email protected]
Abstract. We present a semi-supervised learning method for the Growing SelfOrganising Maps (GSOM) that allows fast visualisation of data class structure on the 2D network. Instead of discarding data with missing values, the network can be trained from data with up to 60% of their class labels and 25% of attribute values missing, while able to make class prediction with over 90% accuracy for the benchmark datasets used.
1 Introduction When all information regarding the measurement values and the type of the class are known, supervised learning is the primary technique that is used for building classifiers. The term supervised comes from the fact that when training or building classifiers, the predicted results from the classifier are compared with the known results and the errors are fed back to the classifier to improve the accuracy, like a supervisor guiding the training. In data mining terminology, supervised learning is also referred to as directed data mining. The classification problem has the goal of maximising the generalised classification accuracy, such that high prediction accuracy for both the training data and new data can be obtained. Further to merely boosting the classification accuracy, it can often be useful to exploit the understanding of the class structure in the labelled data. This can be done by supervised learning of topology-preserving networks like SOM where the complexity of the class structure, in terms of similarity and degree of overlapping of classes, can be visually identified on the two-dimensional grid. However, complete data with all entries labelled and without missing measurements are always difficult and expensive to gather. Therefore, it can often occur that the collected data are incomplete, missing either measurement values or labels. In classical supervised learning, these incomplete data entries are discarded, but there are many algorithms that can learn from partially labelled data (a dataset that contains both items with complete and incomplete information) [1][2][3][4] that combine both unsupervised and supervised learning to make full use of the collected data. In many cases, the proposed semi-supervised algorithm that uses both labelled and unlabelled data improves the performance of the resulting classifier. Therefore learning from partially labelled data has become an important area of research and a recent workshop - ICML I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 915–924, 2006. c Springer-Verlag Berlin Heidelberg 2006
916
A. Hsu and S.K. Halgamuge
2005 LPCTD (Learning with Partially Classified Training Data) Workshop, Germany was held with this as the theme. Previous studies of GSOM [5,6,7,8] have all focused on unsupervised clustering tasks. In this paper, we propose to fuse a modified form of the supervised learning architecture proposed by Fritzke [9] with the Growing Self-Organising Map (GSOM) [10], thus taking advantage of a co-evolving topology-preserving network and the proposed supervised learning network. The modifications made to the Fritzke’s supervised learning architecture involve changes to the error calculation formula to enable processing of data that have missing labels. After the modifications, the algorithm becomes semi-supervised. Most importantly, when all labels are present it behaves identically as a supervised one, yet when all labels are missing it functions the same as an unsupervised one, thereby maximising the use of all information present in the data. Three good reasons for using GSOM as the topology preserving network are: – dynamic allocation of nodes to accommodate for both complex class structure and data similarity – constantly visualisable two-dimensional grid for better and easier understanding of complexity, with overlaps and data structure in the labelled data space. – SOM has demonstrated the ability to process data with missing measurement values (datasets with up to 25% missing values still can produce good clustering [11]) and is also inherited by GSOM. The remaining sections of this paper are organised as follows. Section 2 describes the GSOM algorithm and the proposed semi-supervised learning architecture. In Section 3, we present the simulation results of the proposed algorithm which is applied to benchmark datasets. Also in Section 3, we give some general discussions of the simulation results for each dataset used. Section 4 gives the conclusion and possible future directions for this paper.
2 Background and Algorithm The Dynamic Self-Organising Maps [10], is a variant of the Kohonen’s Self-Organising Map (SOM) that can grow into adaptive size and shape with controllable spread. Initially, the GSOM network has only one lattice of nodes (e.g. four nodes for a rectangular lattice) as shown in Figure 1(a). The GSOM algorithm consists of three phases of training: a growing phase, a rough-tuning phase and a fine-tuning phase. Prior to training, a growth control variable that is called the Spread Factor (SF, where SF∈[0,1] and 0 represents minimum and 1 represents maximum growth) needs to be chosen. SF governs the Growth Threshold (GT) that is given by: GT = −D × ln(SF )
(1)
where D is the dimensionality of data vectors. GT is the maximum accumulated error that a node can have. In the growing phase, when the winning node is identified, an accumulated error counter E of the winning node is updated by the following rule: E(t + 1) = E(t) + I − wwinner
(2)
Semi-supervised Learning of Dynamic Self-Organising Maps
917
New New
New
Error
Error New
(a) Initial
New
(b) Grow Two Nodes
(c) Grow Three Nodes
Fig. 1. Growing of GSOM network
where I is the input vector, wwinner is the weight vector of the winning node and the norm (I − wwinner ) is the Euclidean distance of the vectors. When the winning node’s accumulated error counter exceeds GT, there can be two conditions: 1. the winning node is on the boundary of the map, then new nodes are inserted to complete the lattice (white circles in Figure 1(b) and 1(c)) around the winning node (node ‘Error’ in Figure 1). New connections are also created to constrain the new nodes with any node that is in their immediate neighbourhood (grey lines in Figure 1(b) and 1(c)). 2. the winning node is not on the boundary of the map, then half of Ewinner is distributed to neighbouring nodes (Equation 3) and each of Eneighbours receives an equal share of the halved error counter (Equation 4). Ewinner (t + 1) =
Ewinner (t) 2
(3)
Ewinner (t) 1 × (4) N 2 where N is the number of immediate neighbours of the winning node. After the growing phase, the map should reach an appropriate size, as specified by SF. Then GSOM follows Kohonen SOM’s learning rule in the tuning phases. As a newer development, the GSOM algorithm combines a number of merits from different incremental self-organising networks that have been proposed. Firstly, it maintains the two-dimensional grid for easier data visualisation and topology preservation of a Kohonen’s SOM. Secondly, it increments the network size parsimoniously, similar to Fritzke’s Growing Cell Structures (GCS) [9]. Only a few new nodes are created at once, thus imposing a low computational requirement. Thirdly, the insertion of nodes occurs only when required, at the highest accumulated error node that has exceeded a predefined threshold. The proposed semi-supervised learning algorithm for GSOM is based on Fritzke’s supervised learning architecture that is used on GCS. This supervised learning architecture has been chosen due to a number of advantages. Firstly, both GSOM and GCS are incrementally growing networks so that the supervised learning algorithm is highly compatible with GSOM. Secondly, the learning of both the supervised layer and the topology-preserving map occur simultaneously, with the supervised layer governing the size of the feature map. Thirdly, the delta learning rule that is used for training the supervised layer is fast and effective, and thus does not impose too much computational cost. Eneighbours (t + 1) = Eneighbours +
918
A. Hsu and S.K. Halgamuge
The architecture of the semi-supervised learning layer attached to a GSOM in its initial stage with a rectangular lattice is illustrated in Figure 2. Every node c on GSOM will have a Gaussian function and a set of weights wc1 · · · wcm connected to the m output nodes for classification. The width of the Gaussian function σc is defined by the average distance from node c to its topologically adjacent neighbours. There will be A × m (where A is the total number of nodes on GSOM at any time) links between the GSOM layer and the output layer and the weight wcm will be adjusted by the semisupervised learning algorithm to increase the output activation in the target class and decreased activation in other classes. When new nodes are added to the GSOM during training, new connections are also created to link the newly attached nodes to the output layer. These new connections are not initialised with random, zero or median values, but follow the same interpolation from neighbouring nodes as the new nodes’ weight initialisation.
Fig. 2. Illustration of Supervised GSOM
Training of the semi-supervised learning algorithm will now be described in further detail, given the input space (ξ ∈ n ) where we have n attributes and output space having m known classes. In each iteration, the co-evolving GSOM and semi-supervised layers perform the following computations: 1. An input I is drawn from ξ and presented to the GSOM network. 2. The input vector is compared with weight vectors of the GSOM nodes to find the best matching node (winner). When missing attributes are encountered, the missing component(s) in the vector is (are) ignored when computing similarity. 3. The activation level of each GSOM node c, oc (I) is computed using Equation 5. ) * I − wc 2 , (5) oc (I) = exp − σc2 where wc is a weight vector of neuron c; σc is the ‘width of activation’ of the exponential function that takes the value of the average distance from neuron c to all its direct neighbours. Then, the prediction score for each class m in the output layer Om can be evaluated by using Equation 6. A Om = wcm oc (6) c=1
Semi-supervised Learning of Dynamic Self-Organising Maps
919
where A is the current size of GSOM network, wcm is the weight of the link between GSOM node c and output node m. Again, missing components of the input vector are ignored. By doing so, the activation level will be lower for data with missing attributes, which is correct due to the fact that the confidence of prediction is also lower. The class that has maximum activation level Om is the predicted class. 4. Depending on whether the input vector’s class label is missing and the correctness of prediction, the accumulative error counter of the winning neuron is updated using Equation 7. ⎧ ⎪ Case 1: has label and correct prediction ⎨0 √ ΔEwinner = D/2 Case 2: has label and wrong prediction (7) ⎪ ⎩ I − wwinner Case 3: missing label
5.
6. 7. 8. 9.
The increment value of the accumulative error counter is selected to reflect and conform with the definition of Growth Threshold (GT). Case 1 in Equation 7 deals with correctly predicted results, where the GSOM does not require an increase of resolution to resolve conflicting predictions, thus a zero value will refrain the GSOM from growing. Case 2 in Equation 7 is used when misclassification occurs. Since the maximum √ D (D is the dimensionality of input space), error an input can induce on a node is √ we take the median error D/2 as the error induced when prediction is incorrect to encourage map growth. Finally, Case 3 in Equation 7 is used when the input data does not have a class label and cannot be compared to the predicted result. In this case, the accumulated error counter is incremented by I − wwinner . Considering that I − wwinner √ will be less than D/2, particularly toward the second half of training when the map is better self-organised and stochastically approaching input vectors, this value only promotes slow map growth. Furthermore, if all labels are missing, the resulting map will be identical to unsupervised learning. Though prediction is no longer possible when this occurs, the semi-supervised learning algorithm can always be used for training GSOM, thus making use of all information in the dataset. When the error counter exceeds the GT and the algorithm is in the growing phase, then new GSOM nodes are grown and links from the newly grown GSOM nodes are attached to the output layer. This is shown in Figure 2 by the dotted lines that link newly created GSOM nodes n5 and n6 to the output layer. Weight vectors of nodes within the neighbourhood kernel are updated in the same way as in unsupervised learning. σc of the GSOM nodes whose weights have been modified are updated (so the average distance to direct neighbours has also changed). The learning rate α and neighbourhood kernel radius for the GSOM network are decreased. Weights of the links to the output nodes λ times are updated, where λ is a predefined integer constant that is typically ≤ 5), using the simple delta rule in Equation 8, Δwcm = η(ζm − Om )Dc
(8)
920
A. Hsu and S.K. Halgamuge
where η is the delta rule learning rate (typically η=0.1). The ζ is the level of the desired output activation for class m, where ζi =1 if the input vector is class i and ζm = 0 when m = i or the input class label is missing. Both the GSOM layer and the semi-supervised learning layer operate independently to some degree, where steps 1, 2, 5, 6 and 8 are computed for the GSOM layer and steps 3, 7 and 9 are semi-supervised learning layer operations. The two layers are brought together by the accumulative error update that governs the growth of the GSOM (step 4). While the GSOM layer provides topology preservation of the input space, the semisupervised components classify the inputs in a localised area near the winning node due to the decreased activation level far from the winning node.
3 Results and Discussions Prior to testing the semi-supervised learning algorithm, we will first illustrate the visualisation of class structures in 3.1, which is the result of combining topology preservation, data visualisation and classification (fully supervised learning, with no missing data). Since this dataset is only for illustration purpose, no testing set is generated. Later in this section, the Iris flower dataset [12] is used to test the semi-supervised learning algorithm. The original dataset is complete with no missing attributes or labels. Furthermore, to demonstrate the semi-supervised learning and the ability to handle missing attributes, three additional copies of the training and testing data are made. The first copy has randomly masked attribute values, the second contains randomly masked labels and the third contains both missing attribute values and labels. The learning rate η of the weight links to the supervised layer is 0.1, and the number of times λ to perform delta rule learning is 3 for all simulations in this paper. 3.1 Two Spirals Benchmark Dataset This benchmark dataset is recreated from Carnegie Mellon University’s benchmark collection [13] as shown in Figure 3.
Fig. 3. Two Spirals Dataset; Black - Spiral 1, Gray - Spiral 2
This dataset creates sufficient complexity in the class structure, which is ideal to demonstrate the ability of class structure visualisation of GSOM with the supervised learning component of the proposed algorithm. The same dataset was also used by Fritzke in his work to illustrate the supervised CGS algorithm [9].
Semi-supervised Learning of Dynamic Self-Organising Maps
921
After thirty epochs of the growing phase and 60 epochs of the tuning phase, a total of ninety epochs was used to achieve 100% classification and our simulation produced very smooth decision regions (Figure 4(a)), similar to ones in [9]. While it took 180 epochs to reach the results reported in Fritzke’s simulation, GSOM with supervised learning only used half the training time. Additionally, the two-dimensional grid of GSOM (Figure 4(b)), which has the nodes coloured to show their class, also presents two vivid spirals that encircle the centre three times.
(a) Decision Regions
(b) Class Structure Visualisation
Fig. 4. Supervised Learning using Two Spirals Dataset
3.2 Iris Dataset The Iris flower dataset used for clustering tests in Chapter 5 is a labelled dataset and we will use it here again in the context of semi-supervised learning. The dataset, which contains three classes and 150 entries with fifty entries for each class of Iris flower. Following the usual practice of splitting data into equal halves, the original dataset is divided into training and testing datasets. When splitting the dataset, 25 entries are randomly selected from each class without replacement to form the training data, and the remaining entries are the testing data. As the objective is to demonstrate the semi-supervised learning algorithm, the training data is modified to contain both missing attribute values and data labels. To mask attribute values, 20% of data entries (fifteen in total, five from each class) are randomly selected and have, at least one, and up to 25% of the randomly selected attributes masked. To mask data labels, 20% of data entries (again, fifteen in total, five from each class) are randomly selected and have their class labels masked. Both masking operations are applied independently to the data. This implies that a data entry with a missing attribute value can also have its class label missing. All masking operations used in this paper follow the same routine of randomly selecting data entries, attribute values and data labels for masking. For comparison and benchmarking purposes, a classifier is trained from the complete data and the class visualisation is presented in Figure 5(a). The classification accuracy for the training and testing data are 98.67% and 97.33% respectively. The class visualisation shows reasonably well-separated classes, despite slight confusion between Iris-Versicolor and Iris-Virginica whose data do contain more complex and linearlynon-separable entries.
922
A. Hsu and S.K. Halgamuge
We experimented with different types of masking on the training data, to verify the feasibility of the proposed semi-supervised learning. Firstly, we tested to see if masking of a specific class will significantly influence the outcome of classification. Secondly, the masking process is split into three sets: Set 1 - masking only attribute values, Set 2 - masking only data labels, Set 3 - masking both attribute values and data labels. While Sets 1 and 2 are masked independently, Set 3 is formed by combining Sets 1 and 2, i.e. the same missing values in Sets 1 and 2 are present in Set 3. We also create a fourth set of training data (Set 4) by discarding any data entry that has a missing value in Set 3. This will illustrate the capability of the semi-supervised learning algorithm on handling missing attributes and/or data labels and allow us to compare with the traditional approach of discarding data entries with missing values. All trained classifiers are tested on the same testing data. We produce the four sets mentioned above for each of the Iris-Setosa, Iris-Virginica and Iris-Versicolor class. The semi-supervised learning algorithm produces promising results, with all classification accuracies well above 90% and out-performing discarding masked data. This shows that the semi-supervised learning algorithm is effective with handling missing data. Also, missing attributes of up to 25% does not have significant impact on classification. Since only 2 out of 4 attributes in the Iris dataset [12] are sufficient to give > 90% classification accuracy, the likelihood of having all critical attributes masked is very low. The semi-supervised algorithm is now tested by using the training data that contains missing values for all classes. The training data is created by combining training sets in the previous tests. For example, Set 1 here contains all missing values in Set 1s for Iris-Setosa, Iris-Virginica and Iris-Versicolor. This is to ensure that we have a controlled testing set up and to determine if the trained classifier will be compromised by more missing values in more classes.
(a) Data
Complete
(b) Missing Attributes
(c) Missing Labels
(d) Both
Missing
(e) Discard Masked
Fig. 5. Semi-supervised learning using iris training data with all classes containing masked data: white - Iris-Versicolor, grey - Iris-Virginica, black - Iris-Setosa
The class visualisation results are shown in Figure 5 and classification results are tabulated in Table 1. The class visualisations have more ‘contaminated’ class boundaries for Iris-Versicolor and Iris-Viriginica, which is more likely to be due to the substantially increased amount of missing values in data. However, the classification accuracies, though lower than in previous tests of masking individual classes, are still above 90%. As missing attributes and/or class labels in all classes are the more realistic type of incomplete data, we shall compare the classification accuracies in more detail. Recall
Semi-supervised Learning of Dynamic Self-Organising Maps
923
Table 1. Classification results of the iris dataset with all classes containing masked data Datasets Training Set Complte Data 98.67% Mask Attributes 98.67% Mask Labels 98.67% Mask Both 98.67% Discard Masked 97.83% Mask Both (40% labels) 86.67% Mask Both (60% labels) 82.67%
Testing Set 97.33% 98.67% 94.67% 93.33% 92.00% 85.33% 76.00%
that the classification accuracies for the training and testing data for the fully-supervised learning are 98.67% and 97.33% respectively. The semi-supervised learning achieved 98.67% and 93.33% for training and testing data respectively when there are missing attribute values and class labels. The same classification accuracies were attained on the training set, but are much lower when there is missing data. This indicates that although semi-supervised learning can make use of incomplete information, the missing data will still deteriorate the final classification, especially when used for prediction. Nevertheless, semi-supervised learning is generally better than discarding data entries with missing values. Furthermore, we also attempt to increase the number of missing labels in the training set. Two additional training sets are created, also with 20% data containing up to 25% missing attributes, with 40% and 60% of data with their class labels missing. The results are also included in Table 1. It shows decreasing classification accuracy as the number of missing labels increases, which is as expected, as the number of labelled samples becomes insufficient to build an accurate classifier.
4 Conclusions In this paper, we present a semi-supervised learning algorithm for GSOM which is tested against benchmark data. The proposed semi-supervised learning algorithm can train classifiers from incomplete data and provide class structure visualisation. Incompleteness of data, masked attribute values and/or class labels, are introduced in the datasets and the classification results are compared to the results obtained when using complete data and discarding missing data. When the percentage of masked data is no more than 20%, the classification accuracies for the training data are generally high, all above 90% accuracy. The proposed semi-supervised learning algorithm gives results that are comparable to fully supervised learning and, in most cases, better than discarding missing data. Due to the simplicity of the learning algorithm, there is little computational overhead. The decision boundary visualisation is usually more computationally costly, but is not necessary when the GSOM provides a 2D visualisation of class structure. However, by including more data with missing values, class visualisation on the GSOM grid can be compromised. Furthermore, processing of data with very few class labels can result in false or highly uncertain predictions. As a future development, it
924
A. Hsu and S.K. Halgamuge
is desirable to be able to evaluate the confidence of prediction, such that nodes can be labelled as unknown when the confidence is low.
References 1. M.-R. Amini and P. Gallinari, “The use of unlabeled data to improve supervised learning for text summarization,” in Proceedings of the 25th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 105–113, 2002. 2. T. Jaakkola and M. Szummer, “Information regularization with partially labeled data,” in Advances in Neural Information processing systems 15, 2002. 3. I. Cohen, N. Sebe, F. G. Cozman, and T. S. Huang, “Semi-supervised learning for facial expression recognition,” in Proceedings of the 5th ACM SIGMM international workshop on Multimedia information retrieval, (Berkeley, California), pp. 17–22, 2003. 4. M. Belkin and P. Niyogi, “Using manifold structure for partially labelled classification,” in Proceedings of Advances in Neural Information Processing Systems, vol. 15, 2003. 5. L. Wickramasinghe and L. Alahakoon, “Dynamic self organizing maps for discovery and sharing of knowledge in multi agent systems,” Web Intelligence and Agent Systems: An International Journal, vol. 3, no. 1, 2005. 6. D. Alahakoon, “Controlling the spread of dynamic self organising maps,” Neural Computing and Applications, vol. 13, no. 2, pp. 168–174, 2004. 7. A. Hsu and S. K. Halgamuge, “Enhancement of topology preservation and hierarchical dynamic self-organising maps for data visualisation,” International Journal of Approximate Reasoning, vol. 23, no. 2-3, pp. 259–279, 2003. 8. A. Hsu, S. Tang, and S. K. Halgamuge, “An unsupervised hierarchical dynamic selforganizing approach to cancer class discovery and marker gene identification in microarray data,” Bioinformatics, vol. 19, no. 16, pp. 2131–2140, 2003. 9. B. Fritzke, “Growing cell structures - a self-organising network for unsupervised and supervised learning,” Neural Networks, vol. 7, no. 9, pp. 1441–1460, 1994. 10. D. Alahakoon, S. K. Halgamuge, and B. Srinivasan, “Dynamic self-organising maps with controlled growth for knowledge discovery,” IEEE Transactions on Neural Networks, Special Issue on Knowledge Discovery and Data Mining, vol. 11, no. 3, 2000. 11. S. Kaski and T. Kohonen, “Exploratory data analysis by the self-organizing map: Structures of welfare and poverty in the world,” in Proceedings of the Third International Conference on Neural Networks in the Capital Markets, (Singapore), 1996. 12. C. L. Blake and C. J. Merz, “UCI repository of machine learning databases,” 1998. 13. S. E. Fahlman, “CMU benchmark collection of benchmark problems for neural-net learning algorithms,” 1993.
Self-organizing by Information Maximization: Realizing Self-Organizing Maps by Information-Theoretic Competitive Learning Ryotaro Kamimura Information Science Laboratory Information Technology Center, Tokai University, 1117 Kitakaname Hiratsuka Kanagawa 259-1292, Japan [email protected]
Abstract. The present paper shows that a self-organizing process can be realized simply by maximizing information between input patterns and competitive units. We have already shown that information maximization corresponds to competitive processes. Thus, if cooperation processes can be incorporated in information maximization, self-organizing maps can naturally be realized by information maximization. By using the weighted sum of distances among neurons or collected distance, we successfully incorporate cooperation processes in the main mechanism of information maximization. For comparing our method with the standard SOM, we applied the method to the well-known artificial data and show that clear feature maps can be obtained by maximizing information. Keywords: mutual information maximization, competition, cooperation, selforganizing maps, collective distance, collective activation, collective information.
1 Introduction Information-theoretic approaches have had significant influence on the development of neural computing [1],[2],[3],[4]. Among others, Linsker’s infomax principle [5] has had significant impact on neural computing. Linsker stated that living systems try to preserve as much information as possible. Though this hypothesis seems to be reasonable, his principle has not yet provided simple and practical procedures how to maximize information in neural networks [6], [7]. In this context, the main contribution of this paper to neural computing is to show that a self-organizing map can be realized simply by maximizing mutual information between input patterns and connection weights. We have demonstrated that information maximization corresponds to competitive learning [8], [9], [10]. If information is completely maximized, information maximization is equivalent to competitive learning with the winner-takes-all algorithm. On the other hand, if information is not so large, a very soft-type of competition appears in which many different neurons compete with each other. Thus, information maximization can provide a flexible type of competitive learning. Then, if we take into account information on neighboring neurons, it is possible to realize self-organizing processes by maximizing information. To incorporate information on neighboring neurons, we treat neighboring neurons collectively. Thus, I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 925–934, 2006. c Springer-Verlag Berlin Heidelberg 2006
926
R. Kamimura
information content obtained by taking into account neighboring neurons surely reflects all neighboring neurons’ behaviors to input patterns. We call this information reflecting neighboring neurons ”collective information.” The activation of a neuron is considered to be the weighted sum of the activations of individual neighboring neurons. By this weighted sum, a process of cooperation is automatically built in the main mechanism of information maximization, and a self-organizing process can be considered as a process of information maximization, more exactly, collective information maximization.
2 Theory and Computational Methods Competition is realized by maximizing mutual information between input patterns and competitive units. As shown in Figure 1, the sth input patterns with L elements represented by xsk are given to a network. There are M competitive units, and connection weights from the kth input unit to the jth competitive unit are represented by wjk . Competitive unit activation is represented by vjs , and the normalized one is given by p(j|s). The normalized activations can be considered to represent a probability of the jth neuron’s firing, given the the sth input pattern. Information is represented by decrease in uncertainty of competitive units by receiving input patterns [11], [12], [13]. Now, suppose p(j), and p(s) denote the probability of firing of the jth unit and the probability of the sth input pattern, respectively. Information I is defined by I =−
M
p(j) log p(j) +
j=1
M S
p(s)p(j | s) log p(j | s).
(1)
s=1 j=1
W jk s
vj
p(j|s)
S input patterns xs
k
M competitive units
L input units Fig. 1. A network architecture for information maximization.
To compute competitive unit activations, we must consider the effect of neighboring neurons, that is, neurons must behave similarly to neighboring neurons. Following
Self-organizing by Information Maximization
927
Kohonen [14], we introduce lateral distance djm between the jth and mth neurons. Suppose that the discrete vector rj denotes the jth neuron position in a two-dimensional lattice, and then squared distance is defined by djm = rj − rm 2 .
(2)
By using this distance function, we define a topological neighborhood function. To define the function, we use the Gaussian function with the parameter σ representing the effective width of the topological neighborhood. Then, the topological function is defined by , djm φjm = exp − 2 . (3) 2σ The distance between a neuron and input patterns is considered to be the weighted sum of all distances of all neighboring neurons. We call the distance considering neighboring neurons ”collective distance.” Thus, collected distance considering neighboring neurons is defined by M L s φjm (xsk − wmk )2 . (4) Dj = m=1
k=1
Distance between the jth neuron and the sth input patterns is the weighted sum of all neurons surrounding the jth neuron. By using this collected distance, the competitive unit activation is defined by 1 (5) vjs = s . Dj This equation means that as the distance between input patterns and connection weights is smaller, the activation becomes stronger. The probability of firing the jth neuron is obtained by the normalized activation and defined by vjs p(j|s) = M
s m=1 vm
.
(6)
Because input patterns are uniformly given to networks, information is defined by I =−
M
p(j) log p(j) +
j=1
S M 1 p(j | s) log p(j | s). S s=1 j=1
(7)
By differentiating the information function and adding the learning parameter β, we have final update rules: Δwjk = −β
S s=1
+β
S s=1
) log p(j) − )
M
* p(m | s) log p(m) Qsjk
m=1
log p(j | s) −
M m=1
* p(m | s) log p(m | s) Qsjk ,
(8)
928
R. Kamimura
where Qsjk =
2p(j|s)(xsk − wjk ) SDjs
(9)
For actual experiments, we changed the parameters for cooperative processes following the method by M.M Van Hulle [15], because more stable learning could be obtained. In this method, as the time goes on, the Gaussian width becomes smaller. Suppose that σ(0) is the half of the linear dimension of the lattice, and tmax is the maximum number of epochs, and then the Gaussian width is defined by , t σ(t) = σ(0) exp −2σ(0) . (10) tmax The Gaussian width is gradually decreased from the initial value of σ(0) to smaller values as time goes on.
3 Results and Discussion To compare easily results by our method with those by the conventional self-organizing map, we use three well-known artificial data. The first data is shown in Figure 3(a), and 0.8 0.7
Information
0.6 0.5 0.4 0.3 0.2 0.1 0 0
50
100
150
200
250
300
350
Epoch
Fig. 2. Information as function of the number of epochs for the artificial data. The parameter β is 0.01.
the learning parameter β was set to 0.011 . The maximum number of training epochs was set to 500. Figure 2 shows information as a function of the number of epochs. 1
Because some oscillation in a process of information maximization was observed when the parameter is larger. However, the oscillation was observed only in the later stage of learning. This means that larger learning parameter values can be used.
Self-organizing by Information Maximization
(a)Data
929
(b)0.1
W(i,1)
(c)0.2
W(i,1)
(e)0.4
W(i,1)
(g)0.6
(d)0.3
W(i,1)
(f)0.5
W(i,1)
(h)0.73
Fig. 3. Original data (a) and the development of feature maps (b)-(h) as information is increased. All figures (a)-(h) are plotted in the same scale.
930
R. Kamimura
Information is significantly small during the first 150 epochs, and then information is rapidly increased and reaches a stable point. As information is increased from 0.1 to 0.73 (final), connection weights are gradually unfolded and close to an original input pattern shown in Figure 3(a). Then, we try to demonstrate of our method in a case where the dimension of the input space is larger than that of the outer space. Figure 4 shows information as a function of the number of epochs. As shown in the figure, information is closed to zero during the first 500 epochs, and then information is gradually increased up to 0.5. Figure 5(a) shows original data drawn from a uniform distribution. The number of neurons was 50 with a one-dimensional lattice. Figure 5(b)-(f) show feature maps with different values of information content. As information is increased, a feature map is gradually unfolded and finally the feature map approaches a square corresponding to the original data. 0.5 0.45 0.4
Information
0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0
500
1000
1500
2000
Epoch
Fig. 4. Information as function of the number of epochs for the artificial data. The learning parameter β is one.
The third example is probably the most famous one, that is, two-dimensional lattice by a two-dimensional distribution. The network was trained with a two-dimensional input vector with a uniform distribution. Neurons are arranged in a two-dimensional lattice with 5 rows and columns. Figure 6 shows information as a function of the number of epochs. Information is increased from around the 200th epoch, and reaches a stable point with about 500 epochs. Figure 7 shows original data (a) and five patterns with different information content. The map is smaller when information is 0.1 (Figure 7(b)). Then, the map grows gradually when information is 0.2(Figure 7(c)). When information is 0.3 and 0.4, the map seems to grows further(Figure 7(d) and (e)). Then, when information is. 0.5(final), the map tends to capture the irregularities observed in input data (Figure 7(f)). We have used the well-known artificial data to show that our information-theoretic method can produce feature maps close to those obtained by the standard SOM. How-
Self-organizing by Information Maximization
(a)Data
(b)0.1
(c)0.2
(d)0.3
(e)0.4
931
(f)0.5
Fig. 5. Original data (a) and development of feature maps (b)-(f) as information is increased. All figures (a)-(f) are plotted in the same scale.
932
R. Kamimura 0.5 0.45 0.4
Information
0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0
50
100
150
200
250
300
350
400
450
500
Epoch
Fig. 6. Information as function of the number of epochs for the artificial data. The parameter β is 0.05.
ever, our method is based upon the very soft-type competitive learning. As already mentioned, when information is completely maximized, our method is equivalent to the standard competitive learning with the winner-takes-all algorithm. As information is smaller, competition becomes softer, and many competitive units compete with each other. This soft competition certainly has some influence on final feature maps. We can surely detect some difference in terms of final representations. Thus, the extensive comparison study will be needed to show clearly the characteristics of our method. In addition, those experimental results show some problems to realize clearer feature maps. First, the learning parameter β was set to smaller values for stable learning processes. It seems to us that the optimal values of the learning parameter vary greatly, depending upon given problems. Thus, we should carefully examine relations between the learning parameter and final maps. Second, final maps are changed significantly by changing the Gaussian width σ. Thus, more exact relations between final maps and the Gaussian width will be needed. In addition, though we have used the inverse of the Euclidean distance activation functions, some other activation functions such as the sigmoid or the Gaussian functions may be useful for some problems.
4 Conclusion In this paper, we have shown that information maximization can realize self-organizing processes by incorporating information on neighboring neurons. Information on neighboring neurons are incorporated by the weighted sum of distances or collective distances between input patterns and connection weights. With this collective distance, information maximization naturally reflects a process of cooperation. We have applied the method to the three well-known problems. In all problems, we have shown that feature maps are gradually unfolded in a process of information maximization. Though
Self-organizing by Information Maximization
(a) Data
(c) 0.2
(e) 0.4
933
(b) 0.1
(d) 0.3
(f) 0.5
Fig. 7. Original data (a) and development of feature maps (b)-(f) as information is increased. All figures (a)-(f) are plotted in the same scale.
934
R. Kamimura
some problems such as computational complexity must be solved for practical largescale applications, we have certainly shown that information maximization play a very important role in neural computing.
References 1. A. J. Bell and T. J. Sejnowski, “An information-maximization approach to blind separation and blind deconvolution,” Neural Computation, vol. 7, no. 6, pp. 1129–1159, 1995. 2. H. B. Barlow, “Unsupervised learning,” Neural Computation, vol. 1, pp. 295–311, 1989. 3. D. E. T. Lehn-Schioler, Anant Hegde and J. C. Principe, “Vector-quantization using information theoretic concepts,” Natural Computation, vol. 4, no. 1, pp. 39–51, 2004. 4. K. Torkkola, “Feature extraction by non-parametric mutual information maximization,” Journal of Machine Learning Research, vol. 3, pp. 1415–1438, 2003. 5. R. Linsker, “Self-organization in a perceptual network,” Computer, vol. 21, pp. 105–117, 1988. 6. R. Linsker, “How to generate ordered maps by maximizing the mutual information between input and output,” Neural Computation, vol. 1, pp. 402–411, 1989. 7. R. Linsker, “Local synaptic rules suffice to maximize mutual information in a linear network,” Neural Computation, vol. 4, pp. 691–702, 1992. 8. R. Kamimura, “Information-theoretic competitive learning with inverse euclidean distance,” Neural Processing Letters, vol. 18, pp. 163–184, 2003. 9. R. Kamimura, T. Kamimura, and O. Uchida, “Flexible feature discovery and structural information,” Connection Science, vol. 13, no. 4, pp. 323–347, 2001. 10. R. Kamimura, T. Kamimura, and H. Takeuchi, “Greedy information acquisition algorithm: A new information theoretic approach to dynamic information acquisition in neural networks,” Connection Science, vol. 14, no. 2, pp. 137–162, 2002. 11. T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley and Sons, INC., 1991. 12. C. E. Shannon and W. Weaver, The mathematical theory of communication. University of Illinois Press, 1949. 13. L. L. Gatlin, Information Theory and Living Systems. Columbia University Press, 1972. 14. T. Kohonen, Self-Organizing Maps. Springer-Verlag, 1995. 15. M. M. V. Hulle, “The formation of topographic maps that maximize the average mutual information of the output responses to noiseless input signals,” Neural Computation, vol. 9, no. 3, pp. 595–606, 1997.
An Online Adaptation Control System Using mnSOM Shuhei Nishida1, Kazuo Ishii2, and Tetsuo Furukawa2 1
Mechanical Systems and Environmental Engineering, The University of Kitakyushu, 1-1 Hibikino, Kitakyushu, Fukuoka 808-0135, Japan [email protected] 2 Brain Science and Engineering, Kyushu Institute of Technology, 2-4 Hibikino, Kitakyushu, Fukuoka 808-0196, Japan {ishii, furukwa}@brain.kyutech.ac.jp
Abstract. Autonomous Underwater Vehicles (AUVs) are attractive tools to survey earth science and oceanography, however, there exists a lot of problems to be solved such as motion control, acquisition of sensor data, decisionmaking, navigation without collision, self-localization and so on. In order to realize useful and practical robots, underwater vehicles should take their action by judging the changing condition from their own sensors and actuators, and are desirable to make their behavior, because of features caused by the working environment. We have been investigated the application of brain-inspired technologies such as Neural Networks (NNs) and Self-Organizing Map (SOM) into AUVs. A new controller system for AUVs using Modular Network SOM (mnSOM) proposed by Tokunaga et al. is discussed in this paper. The proposed system is developed using recurrent NN type mnSOM. The efficiency of the system is investigated through the simulations. Keywords: Adaptive Control, mnSOM, Autonomous Underwater Vehicle.
1 Introduction Autonomous underwater vehicles (AUVs) have great advantages for activities in deep oceans [1], and are expected as the attractive tool for near future underwater development or investigation. However, AUVs have various problems which should be solved for motion control, acquisition of sensors’ information, behavioural decision, navigation without collision, self-localization and so on. We have been investigating the application of neural network technology into the AUVs focusing on the capability of neural networks (NNs) such as learning, nonlinear mapping. Considering the various AUV specific features, several methods proposed using NNs, SelfOrganizing Map [2] and so on [3], [4]. In order to realize the useful and practical robots which can work in the ocean, underwater vehicles should take their action by judging the changing condition from their own sensors and actuators, and are desirable to make their behaviors with limited efforts of the operators, because of the features caused by the working environment. Therefore, the AUVs should be autonomous and have adaptive function to their environment. AUVs have non-liner I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 935 – 942, 2006. © Springer-Verlag Berlin Heidelberg 2006
936
S. Nishida, K. Ishii, and T. Furukawa
coupled dynamics in six degrees of freedom, and the changes of the equipments of robots have influence on the control system. In the previous adaptive control method in [3], the information of initial states is getting lost gradually during the process of adaptation. Therefore, new method which keeps the information of initial state or previous environment and adapt to new environment is should be developed for increasing the efficiency of the learning and reducing the learning cost with the use of the former environmental information which the robot had learned. Human beings are assumed to have a kind of modular architecture about the dynamics and controller. This modular architecture is called MOdule Selection And Identification Control (MOSAIC) [5]. This method allows that multiple pairs of dynamics and controller modules are obtained. The MOSAIC and reinforcement learning are applied into the task of swinging up a pendulum [6]. In this paper, a new self-organizing decision making system for AUVs using Modular Network Self-Organizing Map (mnSOM) [7] proposed by Tokunaga et al. is described. The mnSOM is an extension of the conventional SOM in which each vector unit is replaced by function modules such as NN, SOM. Several applications are repotted [8]-[11]. The proposed system is developed using recurrent NN type mnSOM. The efficiency of the system is investigated through the simulations.
2 Adaptive Control System Using mnSOM The proposed controller of the robot consists of Recurrent Neural Network (RNN). As shown in Fig.1, the adaptive controller is realized using RNN-mnSOM. The making processes of control system have following three steps. (a) Identification of Forward Model Modules, (b) Adaptation of Controller Modules using the Forward Model Modules and (c) Implementation of the Control Module to Robot Control. At process (a) are shown in Fig.1-(a), Forward Model Modules (FMMs) are acquired. Several time series of motion data which represents different dynamics corresponding to the relationship of control signal and states of the robot such that one module represents an option dynamics property in advance. These time series data are fed into RNN-mnSOM, and FMMs are obtained. At process (b) in Fig.1-(b), Controller Modules (CMs) are acquired using the fixed FMMs which are obtained by process (a). The target states variables are given to CMs and output data (control signals) calculated in CMs are given to all FMMs. The optimization of CMs is carried out by back-propagation method using the square error between target states and estimated states of FMMs regarding a FMM and a CM as one NN. Figure 1-(c) shows process (c). The condition of robot is determined as the bestmatching module (BMM) by feeding a certain time series data into each FMM. After the FMM is selected, the output of the CM corresponding to the FMM is given to the robot. The adaptive controller using mnSOM is realized according to the processes (a)-(c).
An Online Adaptation Control System Using mnSOM
Si
Forword Model Module (FMM )
937
Sj
FMM
(a) Identification of Forward Model Modules
Controller Module CM
Force State FMM
Off-line Adaptation
Target
Evaluation
(b) Adaptation of Controller Modules using the Forward Model Modules
Sx
Target FMM Update CM Adaptation
Best Matching Module FMM CM
Force
On-line Adaptation State
(c) Implementation of the Control Module to Robot Control Fig. 1. Learning Processes of an Adaptive Controller System using RNN-mnSOM
3 Simulations 3.1 Forward Model Modules In order to evaluate the identification capability of RNN-mnSOM, some set of time series data is prepared by changing the parameter M and C in the following equation of motion.
938
S. Nishida, K. Ishii, and T. Furukawa
F = Mx + Cx x
(1)
Here, F means force, x is velocity, x is acceleration of robot. M and C are mass including added mass and drag coefficient, respectively. The limit-cycle examination that force is inverted, if absolute of velocity is over 0.2 [m/s] is carried out. Nine sets of limit-cycle time series are prepared by changing the parameters shown in Table 1. These time series are measured in 50 [sec], sampling rate is 10[Hz] and are fed into 6x6 lattice modules RNN-mnSOM. Figure 2 shows result of 100,000 times learning. In Fig.2, each square expresses module, and the color means the distance among the neighbors. If a square is black, the module has different dynamics as against the neighbors. And white means similar. In each BMM, time series data is plotted; velocity is thin line, acceleration is bold one. The BMM for teaching data are located in the positions shown in Table 2. The comparison of estimated parameters from input data and FMMs is as shown in Fig.3. The squares are parameters of input time series, the crosses are estimated parameters from FMMs using least-square method. Figure 3 shows that the parameters of FMMs are distributed among input. In Fig.4, the relationship between acceleration and velocity are plotted. The position of each graph corresponds to FMMs. These parallelograms become smaller in area from left side to right, so that parameter M becomes big. And, gradient becomes bigger from upper side to lower, therefore parameter C becomes big. 3.2 Controller Model Modules CMs which are connected to fixed weights FMMs are optimized. Target position is 0.5[m] during 0 ~ 25 [sec] and -0.5 [m] during 25 ~ 50 [sec]. Target velocity is 0.0 [m/s]. And, sampling rate is 10 [Hz]. The result that the iteration of learning is 15,000 times is shown in Fig.5. In each square which means CM, time series is plotted using CM; the horizontal axis is time. x
y
Fig. 2. A Forward Model Map Obtained from the Time Series of Limit Cycle Simulation Data
An Online Adaptation Control System Using mnSOM
939
In vertical axis, dash line is position of robot, gray line is target position and solid line is control force. All CMs follow target so that control corresponding FMMs. Table 1. Coefficient M and C for Limit-Cycle Motion
Di: (M, C) D0: (80, 25)
D3: (90, 25)
D6: (100, 25)
D1: (80, 50)
D4: (90, 50)
D7: (100, 50)
D2: (80, 100) D5: (90, 100) D8: (100, 100) Table 2. Best Matching Module for each Data Class
Di is located in (x, y) D0: (0, 0)
D3: (4, 0)
D6: (6, 1)
D1: (0, 0)
D4: (4, 1)
D7: (6, 3)
D2: (2, 6)
D5: (4, 6)
D8: (6, 6)
3.3 On-Line Adaptive Simulation The simulation to compare adaptability for unlearned data between adaptive controller proposed by ref. [3] (hereafter, reference system) and this proposed controller are carried out. Figure 6 shows transition of evaluation values. In these graphs, horizontal axis is learning steps and vertical axis is evaluation value on log-scale. Uppers are Forward Model Error and lowers are controller Error. Solid lines are obtained from the proposed system and dot-lines are from the reference system. Forward Model Modlue Input Data
25
50 C 75
100 80
90 M
100
Fig. 3. Forward Model Map Evaluation in M-C Space by the Least Square Method
940
S. Nishida, K. Ishii, and T. Furukawa
Fig. 4. Acceleration-Velocity Relationship Obtained from Limit Cycle Simulation with FMMs
On reference system, the evaluation value of forward model become big at early stage in learning. And then, it becomes smaller. According to decrease of forward model error, controller error becomes big. And then, controller adapt to input at 20[sec] in the case of 25, and at 125[sec] in the case of 120. On proposed system, the forward model module which is expressed given time series exits. Therefore, at early stage of leaning, proposed system is needed few adaptation. The adaptability of proposed system is better than reference system.
Fig. 5. Acquisition Simulation of 6x6 Controller Network Modules
An Online Adaptation Control System Using mnSOM
941
Fig. 6. Transition of Evaluation Values
4 Conclusions The adaptive controller using mnSOM is proposed. The Forward Model Map for Dynamics identification and the Controller Map are introduced to realize the adaptive controller system. In the FMMs, the characteristic and interpolations among several input data are expressed. In the CMMs, suitable controllers corresponding FMMs are obtained. The efficiency of proposed system will be investigated through the experiments using AUV.
Acknowledgment This work was supported by a 21st Century Center of Excellence Program, “World of Brain Computing Interwoven out of Animals and Robots (PI: T. Yamakawa)” granted in 2003 to Department of Brain Science and Engineering, (Graduate School of Life Science and Systems Engineering), Kyushu Institute of Technology by Japan Ministry of Education, Culture, Sports, Science and Technology.
References 1. T. Ura (1989). “Free Swimming Vehicle PTEROA for Deep Sea Survey,” Proc. of ROV'89, pp.263-268 2. T. Kohonen, (1982), “Self-organized formation of topologically correct feature maps,” Biological cybernetics, vol. 43, pp.59-69 3. K. Ishii and T. Ura (2000). “An adaptive neural-net controller system for an underwater vehicle,” Journal of IFAC Control Engineering Practice, Vol. 8, pp.177-184 4. S. Nishida, K. Ishii and T. Ura, (2004), “A Self-Organizing Map Based Navigation System for an Underwater Robot,” IEEE International Conference on Robotics and Automation, pp.4466-4471
942
S. Nishida, K. Ishii, and T. Furukawa
5. M. Haruno, D.M. Wolpert and M. Kawato, (2001), “Mosaic: Module selection and identification for control,” Neural Computation, vol.13 no.10, pp.2201-2220 6. K. Doya, K. Samejima, K. Katagiri and M.Kawato, (2002), “Multiple Model-based Reinforement Learning,” Neural Computation, vol.14 pp.1347-1369 7. K. Tokunaga, T. Furukawa and S. Yasui, (2003), “Modular Network SOM: Extension of SOM to the realm of function space,” WSOM'03, pp.173-178 8. T. Furukawa, K. Tokunaga, K. Moroshita and S. Yasui, (2005), “Modular Network SOM (mnSOM): From Vector Space to Function Space,” International Joint Conference on Neural Networks 9. K. Tokunaga, T. Furukawa, (2005), “Nonlinear ASSOM Constituted of Autoassociative Neural Modules,” 5th Workshop on Self-Organizing Maps 10. T. Furukawa, T. Tokunaga, S. Kaneko, K. Kimotsuki and S. Yasui, (2004), “Generalized Self-Organizing Maps (mnSOM) for Dealing with Dynamical Systems,” International Symposium on Nonliner Theory and its Applications, pp.231-234 11. T. Minatohara, T. Furukawa, (2005), “Self-Organizing Adaptive Controllers: Application to the Inverted Pendulum” 5th Workshop on Self-Organizing Maps, pp.44-48
Generalization of the Self-Organizing Map: From Artificial Neural Networks to Artificial Cortexes Tetsuo Furukawa and Kazuhiro Tokunaga Kyushu Institute of Technology, Kitakyushu 808-0196, Japan [email protected], [email protected] http://www.brain.kyutech.ac.jp/˜furukawa
Abstract. This paper presents a generalized framework of a self-organizing map (SOM) applicable to more extended data classes rather than vector data. A modular structure is adopted to realize such generalization; thus, it is called a modular network SOM (mnSOM), in which each reference vector unit of a conventional SOM is replaced by a functional module. Since users can choose the functional module from any trainable architecture such as neural networks, the mnSOM has a lot of flexibility as well as high data processing ability. In this paper, the essential idea is first introduced and then its theory is described.
1 Introduction In this paper, a generalized framework of Kohonen’s self-organizing map (SOM) is presented. The generalization is realized by adopting a modular structure, thus it is called modular network SOM (mnSOM), which was first proposed by Tokunaga et al. [1,2] Our aim is to develop a generalized SOM algorithm that allows users to generate a map of given objects that are not only vector data but also functions, dynamical systems, controllers, associative memories and so on. We also aim to give the capability of information processing to every nodal unit of a SOM. Thus, unlike the conventional SOM, the map generated by the mnSOM is no longer static and can be an assembly of information processors that can dynamically process data. The idea of the mnSOM is simple: every reference vector unit of the conventional SOM is replaced by a trainable functional module such as a neural network (Figure 1). The functional modules can be designed to suit each application while keeping the backbone algorithm of the SOM untouched. This generalization strategy provides high degrees of both design flexibility and reliability to SOM users, because the mnSOM allows one to choose the functional modules from the great number of already proposed trainable architectures, and at the same time the consistent extension method assures the theoretical consistency, e.g., statistical properties, of the result. As an example, let us consider a case in which an mnSOM user wants to make a map of controllers for a set of n-controlled objects. For this, all the user has to do is (i) determine the architecture of the controllers (it should be trainable, e.g., neural network controllers) as functional modules of the mnSOM, and (ii) define an appropriate distance measure that determines the distance between two controllers. The task of the mnSOM is to train those functional modules to be desired controllers for the nobjects, while at the same time generating a feature map that indicates the similarities I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 943–949, 2006. c Springer-Verlag Berlin Heidelberg 2006
944
T. Furukawa and K. Tokunaga
Fig. 1. The architecture of mnSOM
or differences between those controllers. If the controlled-objects A and B have similar dynamics, then the corresponding controllers should be located near each other in the map space of the mnSOM, whereas if controlled-objects C and D have quite different dynamics, then those controllers should be arranged further apart. Additionally, the intermediate modules are expected to become controllers for objects that have intermediate dynamics of the given ones. After the training has finished, the user can then use it as an assembly of controller modules that can adapt the dynamic changes of the target object. This is a new aspect that is not found in the conventional SOM. Therefore, the mnSOM is expected to greatly enlarge the number of fields for applications of SOMs .
2 Architecture of an mnSOM The architecture of an mnSOM is shown in Figure 1. The architecture is such that each vector unit of a conventional SOM is replaced by a trainable functional module. These modules are usually arrayed on a lattice that represents the coordinates of the feature map. (Though any modifications such like a growing-map, a hierarchical map and a neural gas network are all available; here, we consider the simplest map structure). Figure 1 illustrates the case of multilayer perceptron (MLP) modules as a typical case, but many other module types are available. Table 1 shows a catalogue of module types that have been tried. In the case of MLP modules, i.e., MLP-mnSOM, each MLPmodule represents a nonlinear function, and as the result the entire mnSOM generates a map of functions. Therefore, MLP-mnSOM is an SOM in function space rather than vector space [1,2,3,4]. Users can also employ radial basis function network modules (RBF-mnSOM) instead of the conventional MLPs. In such cases, the distance measure is defined in the function space. Siblings of MLP are all possible to be mnSOM modules. For example, recurrent neural networks (e.g., Jordan type and Elman type) and autoassociative neural networks (i.e., 3- or 5-layer autoencoder MLPs with a sand clock structure) can be employed as well [5,6,7,8].
Generalization of the SOM: From Artificial Neural Networks to Artificial Cortexes
945
Table 1. Examples of module types and their applications Module type (Name) Object type Layer type – Multilayer perceptron nonlinear functions (MLP-mnSOM) – Autoassociative network manifolds (ANN-mnSOM)
Applications
Weather dynamics [1,2,3] Bifurcation map of logistic mapping [5,6] 2D images of 3D objects [7] Texture map [8] Periodical waveforms – Reccurent network dynamical systems Dumped oscillatory systems [7] (RNN-mnSOM) Bifurcation map of BVP model [5,6] Autonomous mobile robots [15,16] – Predictor & cotroller pair adaptive controllers Inverted pendulums [17] (SOAC) Autonomous underwater vehicle [19,20,21] – RBF network nonlinear functions (RBF-mnSOM) – Single-layer perceptron linear operators [12] (Operator map) SOM type – SOM (SOM2 ) manifolds 2D images of 3D objects [9,10] Face image recognition [9] Shape classification [11] – Neural gas network density functions Handwritten character recognition [11] (NG2 , NG-SOM) – Local linear map nonlinear functions Stochastic type – Hopfield network associative memories – Boltzmann machine Component analysis type – PCA (ASSOM) linear subspaces [13] – Nonlinear PCA nonlinear subspaces See autoassociative network module Single neuron type – Hebbian neuron static vectors (Basic SOM) Other module type – Image filter visual image filters Adaptive visual filter [22]
Another big group of mnSOMs is made up of the SOM module types proposed by Furukawa [9,10]. SOM-module-mnSOM, called SOM2 , is a “self-organizing homotopy” rather than a “self-organizing map” [11]. One of the prominent properties of this group is that this type mnSOM can have a nested structure like a Russian doll. For example, SOM2 can be a module of a meta-mnSOM. Thus, SOM2 -module mnSOM, i.e., SOM3 , is also possible. It is easy to extend the n-th order, i.e., SOMn as SOMn−1 module mnSOM. SOM2 and its family demonstrate their potential when tasks involve nonlinear manifolds or nonlinear subspaces. Stochastic type networks such as a Boltzmann machine and a Hopfield network make up another group. They are expected to generate a “map of memories”.
946
T. Furukawa and K. Tokunaga
The mnSOM includes some variations of SOM that have been proposed previously. If one employs a linear operator module, then it is an Operator Map as proposed by Kohonen [12]. When a principle component analysis (PCA) module is used, then the mnSOM becomes an ASSOM [13]. If one employs Hibbian neurons as the functional modules, then the mnSOM becomes a conventional SOM [14]. Therefore, the mnSOM is a generalization of an SOM rather than an extension, because it includes the conventional cases. Though there are many architectures that have not been tried before, they would be also available as modules of the mnSOM. Users can derive the algorithm theoretically described in the next section, without needing to try any heuristic ways.
3 Theory of mnSOM Now let us describe the generalized theory of an mnSOM algorithm. Suppose that an mnSOM user is trying to map a set of I objects, O = {O1 , . . . , OI }. In a conventional SOM, each data vector is A mapping object, whereas in the case of an MLP-mnSOM, each object corresponds to each of the nonlinear functions i.e., the input–output relations. There is one big difference between a conventional SOM and the generalized SOM case. In the case of the conventional SOM, all the mapping objects, i.e., the data vectors, are known and there is no need to estimate the objects. But in the generalized case, it often happens that the entities of the objects are unknown. For example, let us consider a case in which a user is trying to map a set of dynamical systems. In such a case, the observed input and output signals of the systems are usually given, however, their dynamics are unknown in most cases. Therefore, the user should identify those dynamics in parallel with generating their self-organizing map. Thus the mnSOM should solve the simultaneous estimation problem. Let us assume that Di = {ri,1 , . . . , ri,J } is the dataset observed from the i-th object Oi . If the mapping objects are systems or functions, then ri,j is defined as a set of input-output vectors ri,j = (xi,j , yi,j ) observed from the i-th system. Suppose that the mnSOM has K functional modules {M 1 , . . . , M K }, which are designed to have the ability of regenerating, or mimicking the objects. In other words, a module is capable of approximating an object Oi after training by Di . Suppose further that the property of each function module M k is determined by a parameter set θk . In the case of MLPmnSOM, θk is the weight vector of the k-th MLP module. Each functional module M k is given a fixed position ξ k in the map space. Therefore, ξ k assigns the coordinates of M k in the map space, while θk determines the position in the data space. Under such a situation, the tasks of the mnSOM are (i) to identify the entities of {Oi } from the observed datasets {Di } by training the function modules {M k }, and (ii) to generate a map that shows the degrees of similarity and difference between the objects. These two tasks should be processed in parallel. Note that the map generated by the mnSOM is expected to show the relationships between the entities of the objects, direct comparisons between the datasets are meaningless. The mnSOM user needs to define an appropriate distance measure L2 (Oi , M k ) that signifies the difference between an object Oi and a module M k . Since the distance measure depends on how the user wants to define similarities and differences between two objects, the measure should be defined depending on the user’s purpose.
Generalization of the SOM: From Artificial Neural Networks to Artificial Cortexes
947
By using the distance measure, an important derivative definition, namely, the definition of mass center can be determined as follows. ¯ O(m, O) arg min O
I
mi L2 (Oi , O)
(1)
i=1
¯ is the center of mass of the given objects O = {O1 , . . . , OI } with the weights Here O ¯ is given by m = (m1 , . . . , mI ). If O belongs to a vector space, then O ¯ = m1 O1 + · · · + mI OI . O m1 + · · · + mI
(2)
Since the entities of the objects are assumed to be unknown, we can measure only the ˆ i ), M k ). Here O(D ˆ i) distance between an estimated object and a module, i.e., L2 (O(D is the object entity estimated from Di . Each module is updated so as to be the center of mass, the weights of which are given by the neighborhood function. Thus, the update algorithm of mnSOM is described as θk (t + 1) = arg min θ
I
ˆ i ), M (θ)). φki (t)L2 (O(D
(3)
i=1
φki (t) is the mass of the i-th object for the k-th module given by the neighborhood function at time t. Usually a gauss function is used as the neighborhood function; 3 2 ∗ k 2 ξ − ξ i i (4) φki (t) = exp − σ(t) Here ξi∗ denotes the coordinate of the winner module of the i-th object. ˆ i ), M k ) can be apIn many cases (but not always) the estimated distance L2 (O(D k proximated by the mean square error between the module M and the dataset Di such as L2 (Oi , M k ) % Eik
J 1 % k &2 e . J j=1 i,j
(5)
Here eki,j is the error between the data vector ri,j and the corresponding output of the k-th module. In such cases, the update algorithm (3) becomes much easier, as follows. θk (t + 1) = arg min θ
J I
% &2 φki (t) eki,j .
(6)
i=1 j=1
The algorithm of the generalized SOM consists of three processes; like in the case of the conventional SOM. First, in the competitive process, the least average error module becomes the “winner” or the “best matching module” (BMM) for a given dataset. The BMM is determined for every dataset. Second, in the cooperative process, the learning weight is determined using the neighborhood function. This process is identical with
948
T. Furukawa and K. Tokunaga
the conventional one. Finally, in the adaptive process, all modules are updated so as to be the mass center of the objects with the weights {φki }. These three processes are iterated reducing the neighborhood size until the network gets to a steady state. The algorithm described above is general case; now we look at the case of an MLPmnSOM as an example. Since MLPs represent nonlinear functions, the distance measure is defined in function space. 2 fi (x) − g k (x) p(x)dx L2 (Oi , M k ) = (7) fi (x) is the i-th object, i.e., the i-th nonlinear function and g k (x) is the function represented by the k-th MLP module. Here fi (x) is assumed to be unknown, and the observed input–output data are supposed to be given. Thus, Di = {ri,j } = {(xi,j , yi,j )} is available to use as the training dataset. In this case, the mean square error is determined by the error between the output of the k-th module g k (xi,j ) and the actual (desired) output yi,j , i.e., eki,j = g k (xi,j ) − yi,j .
(8)
The least mean square error module for Di is determined as the winner (BMM) of the i-th object. In this case, the weight vectors of MLP-modules are updated by the backpropagation algorithm. Δθ = −η k
J I i=1 j=1
∂eki,j φki (t) φk1 + · · · + φkI ∂θk
(9)
Note that (9) updates the MLP toward the center of mass of the given functions with the weights {φki }. This backpropagation learning is iterated several times for all data vectors (not only once) with fixing φki (t), so that θk is updated enough. The detailed algorithms of individual module types have been described in previous works [2,3,6,7,9,10].
4 Conclusion: Can Our mnSOM Be an Artificial Cortex? The mnSOM has several advantages comparing to other neural network architectures. First, the mnSOM can process larger tasks than single neural networks, and it has less interference of memories because of its modular structure. Second, the entire output of an mnSOM is trained in a supervised manner with given datasets, while maps of functions are organized in an unsupervised manner. Therefore the mnSOM seems to transcend the dualism of supervised and unsupervised learning. Finally, the mnSOM is a meta-learning framework which rules an assembly of functional modules. The flexibility of the type of module is also an advantage inherent in the mnSOM. Interestingly, the architecture of the mnSOM looks similar to the column structure of our cortex. Each module looks like a functional column of the cortex, and the map in the mnSOM corresponds to the map of a brain. Of course the mnSOM was not invented by mimicking the cortex, and it is just a straightforward generalization of Kohonen’s SOM. But considering the above advantages, our mnSOM is expected to be a good platform to initiate an artificial cortex; though much remains to be done to realize that far off goal.
Generalization of the SOM: From Artificial Neural Networks to Artificial Cortexes
949
Acknowledgement This work was partially supported by a Center of Excellence Program (Center #J19) granted by MEXT of Japan. This work was also partially supported by a Grant-in-Aid for Scientific Research (C) granted by MEXT of Japan.
References 1. Tokunaga, K., Furukawa, T., Yasui, S.: Modular network SOM: Extension of SOM to the realm of function space. Proc. of WSOM2003 (2003) 173–178 2. Tokunaga, K., Furukawa, T., Yasui, S.: Modular network SOM: Self-organizing maps in function space. Neural Information Processing – Letters and Reviews 9(1) (2005) 15-22 3. Furukawa, T., Tokunaga, K., Morishita, K., Yasui, S.: Modular network SOM (mnSOM): From vector space to function space. Proc. of IJCNN2005 (2005) 1581-1586 4. Tokunaga, K., Furukawa, T.: Modular network SOM: Theory, algorithm and applications. Proc. of ICONIP2006 (2006) 5. Furukawa T., Tokunaga K., Kaneko S., Kimotsuki K., Yasui, S.: Generalized self-organizing maps (mnSOM) for dealing with dynamical systems. Proc. of NOLTA2004 (2004) 231–234 6. Kaneko, S., Tokunaga, K., Furukawa, T.: Modular network SOM: The architecture, the algorithm and applications to nonlinear dynamical systems. Proc. of WSOM2005 (2005) 537-544 7. Tokunaga, K., Furukawa, T.: Nonlinear ASSOM constituted of autoassociative neural modules. Proc. of WSOM2005 (2005) 637-644 8. Tokunaga, K., Furukawa, T.: Realizing the nonlinear adaptive subspace SOM (NL-ASSOM) Proc. of BrainIT 2005 (2005) 76 9. Furukawa, T.: SOM2 as “SOM of SOMs”. Proc. of WSOM2005 (2005) 545-552 10. Furukawa, T.: SOM of SOMs: Self-organizing map which maps a group of self-organizing maps. Lecture Notes in computer Science, 3696 (2005) 391-396 11. Furukawa, T.: SOM of SOMs: An extension of SOM from ‘map’ to ‘homotopy’. Proc. of ICONIP2006 (2006) 12. Kohonen, T.: Generalization of the Self-organizing map. Proc. of IJCNN93 (1993) 457–462 13. Kohonen, T., Kaski, S., Lappalainen, H.: Self-organized formation of various invariantfeature filters in the adaptive-subspace SOM. Neural Computation 9 (1997) 1321–1344 14. Kohonen, T.: Self-Organizing Maps, 3.ed., Springer (2001) 15. Aziz Muslim, M., Ishikawa, M., Furukawa, T.: A new approach to task segmentation in mobile robots by mnSOM. Proc. of IJCNN2006 (2006) 16. Aziz Muslim, M., Ishikawa, M., Furukawa, T.: Task segmentation in a mobille robot by mnSOM: A new approach to training expert modules. Proc. of ICONIP2006 (2006) 17. Minatohara, T., Furukawa, T.: Self-organizing adaptive controllers: Application to the inverted pendulum. Proc. of WSOM2005 (2005) 41-48 18. Minatohara, T., Furukawa, T.: A proposal of self-organizing adaptive controller (SOAC). Proc. of BrainIT2005 (2005) 56 19. Nishida, S., Ishii, K.: An adaptive controller system using mnSOM. Proc. of BrainIT 2005 (2005) 85 20. Nishida, S., Ishii, K.: An adaptive neural network control system using mnSOM. Proc. of OCEANS2006 (in press) 21. Nishida, S., Ishii, K.: An Online Adaptation Control System using mnSOM. Proc. of ICONIP2006 (2006) 22. Horio, K., Suetake, N.: Inverse halfoning based on pattern information and filters constructed by mnSOM. Proc. of BrainIT 2005 (2005) 102
SOM of SOMs: An Extension of SOM from ‘Map’ to ‘Homotopy’ Tetsuo Furukawa Kyushu Institute of Technology, Kitakyushu 808-0196, Japan [email protected], http://www.brain.kyutech.ac.jp/˜furukawa
Abstract. This paper proposes an extension of an SOM called the “SOM of SOMs,” or SOM2 , in which objects to be mapped are self-organizing maps. In SOM2 , each nodal unit of a conventional SOM is replaced by a function module of SOM. Therefore, SOM2 can be regarded as a variation of a modular network SOM (mnSOM). Since each child SOM module in SOM2 is trained to represent an individual map, the parent map in SOM2 generates a self-organizing map representing the continuous change of the child maps. Thus SOM2 is an extension of an SOM that generates a ‘self-organizing homotopy’ rather than a map. This extension of an SOM is easily generalized to the case of SOMn , such that “SOM3 as SOM of SOM2 s”, corresponding to the n-th order of homotopy. This paper proposes a homotopy theory of SOM2 with new simulation results.
1 Introduction SOM2 is an extension of Kohonen’s self-organizing map (SOM) aimed at generating a self-organizing map of a set of self-organizing maps [1,2]. A SOM2 consists of an assembly of basic (conventional) SOM modules arrayed on a lattice, which are the replacement of the reference vectors of the basic SOM. Thus SOM2 is called the ‘SOM of SOMs’, the name may sound a bit eccentric, perplexing or even curious. However, despite its strange name, SOM2 is a straightforward extension of a conventional SOM and is a powerful tool based on a sound mathematical theory. Since a basic SOM represents a map from a high-dimensional data space to a lowdimensional feature one, the actual task of SOM2 is to represent the continuous change of those maps, i.e., a homotopy. Thus, SOM2 is an extension from a “self-organizing map” to a “self-organizing homotopy”. When a group of datasets is given, SOM2 approximates their distributions by using a set of child SOMs, and simultaneously the parent SOM generates a map of those child maps. If two distributions of datasets are comparatively similar (or different), then those two datasets are located at nearer (or further) positions in the parent map. Such architecture is useful when a set of data vectors observed from the same object forms a corresponding manifold in the data space. A typical example is face classification for a set of 2-dimensional photographs. In this case, a set of photographs taken of a single person from various viewpoints forms a manifold that is unique to that person. I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 950–957, 2006. c Springer-Verlag Berlin Heidelberg 2006
SOM of SOMs: An Extension of SOM from ‘Map’ to ‘Homotopy’
951
Child SOMs
D1
D3
D2
M1
M2
M3
M4
M5
Parent SOM
(a)
Manifolds
Child SOMs
D1
M1
M2
Fibers
Parent SOM
M3
D2
M4
M5
D3 (b)
Fig. 1. (a) The architecture of SOM2 . In this case, the SOM2 has 5×1 child SOMs, each of which has 5 × 5 reference vectors. Thus, the parent map space is 5 × 5, while the child map spaces are 5 × 5. (b) A simulation result when datasets {D1 , D2 , D3 } are given. {M1 , . . . , M5 } are the child maps, and M1 , M3 , M5 are the best matching maps (BMMs) of the datasets D1 , D2 , D3 , respectively. The dashed lines between child maps are called ‘fibers’, and these connect reference vectors with the same index of each child SOM.
Therefore, if there are n people, then one obtains n face image manifolds that can be classified by a SOM2 . This ability of SOM2 has been shown previously [1]. The purpose of this paper is to propose a homotopy theory of SOM2 . In addition, a new application field of SOM2 is also presented, namely, shape classification. The theory and an algorithm of SOM2 are presented first, then followed by some simulation results.
952
T. Furukawa
2 Algorithm and Theory of SOM2 2.1 What Is SOM2 ? Like a conventional SOM, an SOM2 has an arrayed structure of reference units on a lattice. In the case of the conventional one, each reference unit represents a vector in the data space, whereas the reference units in SOM2 represents an SOM. Figure 1 (a) shows an architecture of SOM2 that has 5 × 1 reference maps (i.e., child SOMs) M 1 , . . . , M 5 , each of which has 5 × 5 reference vectors. In other words, the parent SOM has 5 × 1 map size, while the child SOM has 5 × 5 map size for each. SOM2 is regarded as an SOM with modular structure, the modules of which are the basic SOMs. Therefore, SOM2 is a variation of a modular network SOM (mnSOM), i.e., SOM-module-mnSOM [3,4,5,6]. To clarify the purpose of the algorithm, we take a typical case in which three families of sets {D1 , . . . , D3 } are given to the SOM2 (Figure 1 (b)). In this case, the data vectors are distributed in 2-dimensional squares, which are topologically congruent but their positions and orientations are different in the 3-dimensional data space. Figure 1 (b) also shows an actual simulation result. The child map that represents a class distribution best is called a ‘winner map’ or a ‘best matching map’ (BMM). In the case of Figure 1, the child maps M 1 , M 3 , M 5 became the BMMs of D1 , D2 , D3 , respectively. As a result, the three squares were arranged in the parent map in the desired order D1 → D2 → D3 . Child maps M 2 and M 4 that lost in the competition formed ‘intermediate squares’ to represent the continuous change of square positions and orientations. Thus, the homotopy was successfully organized in SOM2 . The lines connecting the reference vectors with same indexes are the so-called ‘fibers’ (the dashed lines in Figure 1 (b)). As shown in the figure, the fibers connect the corresponding points of five child maps. Therefore, SOM2 not only generates a map of objects, but it also finds out correspondences between given objects. In other words, SOM2 can represent a set of data distributions by a bundle of fibers. Therefore, SOM2 can be also regarded as an extension of an SOM that represents a “fiber bundle” rather than a manifold. The task of SOM2 described above can be summarized as follows. (i) For a given family of sets {D1 , D2 , . . .}, representing those distributions using child maps {M 1 , M 2 , . . .}. (ii) Mapping those datasets in the parent map. (iii) Finding out corresponding points in objects and connecting them by fibers. These three tasks are conducted simultaneously. 2.2 Algorithm of SOM2 The algorithm of SOM2 is based on the batch learning algorithm of the conventional SOM, which can be described as follows. αli (t)xi (1) wl (t + 1) = i l wl and xi denote the l-th reference vector and i-th data Vector, respectively, andl αi is l determined by the neighborhood function. Here αi is normalized so as to be i αi = 1.
SOM of SOMs: An Extension of SOM from ‘Map’ to ‘Homotopy’
953
Thus, exp[−ξi∗ − ξ l 2 /2σ 2 (T )] . αli = ∗ l 2 2 i exp[−ξi − ξ /2σ (T )]
(2)
Here ξi∗ and ξ k are the coordinates of the BMU of xi and the l-th child SOM in the parent map space. The algorithm of SOM2 has been described in an earlier work, but here I derive it another way. Suppose that an SOM2 has K child maps, each of which has L reference vectors. Let wk,l denote the l-th reference vector of the k-th child map, and let W k = (wk,1 , . . . , wk,L ) be a vector obtained by joining all reference vectors belonging to the k-th child map. Thus W k is the joint reference vector of the k-th child map M k . By regarding these joint reference vectors of the child SOMs as the reference vectors of the parent SOM, the entire SOM2 can be regarded as a conventional SOM with reference vectors {W 1 , W 2 , . . . , W K }. Suppose {D1 , . . . , DI } is a family of sets observed from I objects. We now suppose that we have another set of conventional SOMs, the reference vectors of which are Vi = (vi1 , vi2 , . . . , viL ). Further suppose that Vi learns only the dataset Di . Here let us call Vi a “class map of Di ”, since Vi organizes a map specialized to the i-th class. Under this condition, a naive algorithm for SOM2 is to train{V1 , V2 , . . . , VI } by regarding them as ordinary data vectors. Thus, the class maps are calculated in advance, and then the joint reference vector W k is updated as αki (t)Vi . (3) W k (t + 1) = i
This naive version of SOM2 algorithm indicates good suggestions for better solutions, though it has a fatal defect. (i) Usually, it is not easy to define the distance measure between two manifolds. This method provides the definition of the measure as the Euclidian distance between two joint reference vectors. It is a natural definition because the distance also means the sum of the lengths of the fibers between two manifolds. (ii) It is also easy to define the “median point” of a set of manifolds, which is given by the median point of the joint reference vectors. However, the fatal defect is that there are several equivalent solutions of reference vectors organized by an SOM, e.g., a map with rotated 180 degrees and a map turned over. Therefore, it is nonsense to measure the distance between two manifolds without matching the corresponding points. This means that it is necessary to ascertain good correspondences between manifolds, i.e., to determine the good “fibers” between child maps and class maps. To resolve this problem, one needs to simultaneously estimate both child and class maps. In such a case, an expectation maximizing (EM) algorithm is available, i.e., the class and the child maps are reciprocally estimated. In the initial state, both class and child maps are set to random, and the tentative class map is estimated from the datasets. Then the child maps are updated using the tentative class map; after which the class maps are estimated from the BMMs. Now suppose that the i-th dataset Di is picked up at time t. Then the child map with the least quantization error becomes the BMM of Di , and the joint reference vector of the BMM is supposed to be Wi∗ (t). Next the class map V˜i is estimated from the BMM
954
T. Furukawa
Wi∗ (t). Thus, substituting Wi∗ (t) to V˜i (t) as the initial state, and then updating V˜i (t) by using the batch learning SOM algorithm. Here batch learning is assumed to be executed in one step as follows. ˜ il (t) = v
l βi,j (t)xi,j
(4)
j l is given by the normalized neighborhood function that determines how xi,j Here βi,j l l ˜ i , and it satisfies j βi,j affects v = 1. Now the estimated class maps are obtained, then the child maps are updated from these estimated class maps.
W k (t + 1) =
αki V˜i (t)
(5)
i
This equation is equivalent to (2) with the exception that the class maps are tentatively estimated ones. By combining (3) and (4), we obtain wk,l (t + 1) =
i
=
i
l αki (t) βi,j (t) xi,j
j
αki (t)
⎧ ⎨ ⎩
l βi,j (t) xi,j
j
(6) ⎫ ⎬ ⎭
.
(7)
This is the algorithm for SOM2 . Please note that the estimated class maps{V˜i } are not necessary to update child maps {W k } anymore, because they are just introduced derive the algorithm. This updated algorithm is iterated, so reducing the neighborhood size until both parent and child maps achieve a steady state. The updated algorithm (6) has a recursive structure like a Russian doll. Therefore, it is easy to extend SOM3 , SOM4 , . . . , by further nesting. 2.3 Theory of SOM2 Here let us consider a theoretical aspect underlying SOM2 from the point of view of topology. Let us assume that the data vectors dealt with by SOM2 are distributed on a set of manifolds {Ui } that are homotopic. Let us further suppose that the manifold Ui is obtained by a continuous surjective map ϕi from a base space B = I n (n < m, and n equals the dimension of child SOMs), as follows. ϕi : B → Ui
(8)
ξ → xi
(9)
Now let the nonlinear maps {ϕi } be obtained by a continuous change of an intrinsic parameter θ. Thus, xi = Φ(ξ, θi ) = ϕi (ξ).
(10)
SOM of SOMs: An Extension of SOM from ‘Map’ to ‘Homotopy’
955
(a)
(b)
Fig. 2. A result of the shape classification task. (a) 15 contours are given to an SOM2 . The contours are represented by a set of dots that are the data vectors. (b) The map generated by the SOM2 . The SOM2 successfully generated a map of contours, indicating the continuous changes of shapes, sizes and orientations.
Here Φ(ξ, θi ) is the homotopy. Under this condition, the distance between two manifolds U1 and U2 can be defined as follows. 2 L (U1 , U2 ) ϕ1 (ξ) − ϕ2 (ξ)2 p(ξ)dξ (11) ξ∈B = Φ(ξ, θ1 ) − Φ(ξ, θ2 )2 p(ξ)dξ. (12) ξ∈B
Here p(ξ) gives the density of ξ. By employing this definition, the distance between a data class Di and a child map W k is approximated by 1 l vi − wk,l L L
L2 (Di , W k ) %
l=1
2
=
1 Vi − W k L
2
,
(13)
and we obtain the algorithm of SOM2 , which is an unsupervised learning machine that ascertains the homotopy Φ from a family of set {D1 , . . . , DI }, and is also regarded as using a fiber bundle for representing data distributions, the sections of which represent the data classes.
3 Simulation Results The ability of SOM2 has been shown in cases of artificial manifolds and 2D images projected from 3D objects [1,2]. SOM2 has also been applied to facial image recognition [1].
956
T. Furukawa
Fig. 3. The map of alphabet generated by NG-SOM
In this paper, another application field of SOM2 is presented; namely, shape classification. It is known that a conventional SOM can be used for shape representation. In such cases, data vectors are assumed to be distributed on the surface of the object, and a conventional SOM learns the distribution of data vectors. If one has n objects, then one needs n conventional SOMs to represent those shapes. Consequently, an SOM2 can make a map of these SOMs, i.e., a set of shapes. It is expected that objects with similar shapes are mapped nearer, while those with different shapes are mapped further in the SOM2 . The advantage of this method is that a user can directly deal with “shapes of objects” without employing any heuristic vectorization. Figure 2 shows a result of a simulation of shape classification. In this case, 15 contours are given to an SOM2 . Each contour consists of a set of small dots that corresponds to the dataset. Thus, the i-th contour corresponds to the i-th dataset Di = {xi,1 , xi,2 , . . . , xi,J }, and xi,j = (xij , yij ) represents the coordinate of the j-th dot in the i-th contour. The child SOMs has a one-dimensional closed ring structure to represent a contour, and the parent SOM has 7 × 7 child SOMs. Figure 2 (b) shows a map generated by the SOM2 . The SOM2 successfully generated a map of contours that shows continuous changes of shapes, sizes and orientations of the objects. An advantage of this method is that the result is robust to a small change of position or orientation of the contours. Figure 3 is a tentative result of handwritten character classification. Since the topological structures of the characters are all different, neural gas networks are employed instead of child SOMs for this task; namely, an NG-SOM was used. The handwritten data were also represented by sets of small dots. In the case of Figure 3, 26 characters written by a person were given to the NG-SOM. After the map was generated, then
SOM of SOMs: An Extension of SOM from ‘Map’ to ‘Homotopy’
957
another 9 sets of characters written by 9 people, i.e., 26 × 9 = 234 characters were given to the NG-SOM. The recognition rate was 93.2 ± 6.7% (mean±SD, n = 9). In this case, the training dataset has only 1 data for each alphabet, whereas the test dataset has 9 data for each alphabet with slightly different positions, shapes, sizes, and written by different people. Note that neither a heuristic preprocess nor an additional algorithm for feature extraction were used.
4 Conclusion In this paper we have proposed an extension of an SOM called an SOM2 . Despite the eccentric impression given by the name of ‘SOM of SOMs’, SOM2 is a straight forward extension of a conventional SOM from a ‘map’ to a ‘homotopy’. As a closing remark, I have some comments about SOM2 . First, some people may think that SOM2 is a supervised algorithm, because it requires a labeled dataset. However, such an understanding is not realistic. The mapping objects of SOM2 are distribution data vectors, and each distribution should be estimated by a set of data vectors. Each data vector in a conventional SOM case corresponds to each data distribution in an SOM2 . Second, our aim is not to develop an alternative algorithm that supersedes the conventional one. The concept of SOM2 tells us that we have a family of SOM1 (the conventional SOM), SOM2 , SOM3 , . . . , etc., and users can choose an appropriate order for the SOMn family depending on their purpose. For some tasks an SOM2 would be the best solution, and for others a conventional SOM would be appropriate. Therefore the idea of SOM2 will further enlarge the application fields for SOMs.
Acknowledgement This work was supported by a COE program (center #J19) granted by MEXT of Japan. This work was also partially supported by a Grant-in-Aid for Scientific Research (C) granted by MEXT of Japan.
References 1. Furukawa, T.: SOM2 as “SOM of SOMs”. Proc. of WSOM2005 (2005) 545-552 2. Furukawa, T.: SOM of SOMs: Self-organizing map which maps a group of self-organizing maps. Lecture Notes in computer Science, 3696 (2005) 391-396 3. Tokunaga, K., Furukawa, T., Yasui, S.: Modular network SOM: Extension of SOM to the realm of function space. Proc. of WSOM2003 (2003) 173–178 4. Furukawa, T., Tokunaga, K., Morishita, K., Yasui, S.: Modular network SOM (mnSOM): From vector space to function space. Proc. of IJCNN2005 (2005) 1581-1586 5. Kaneko, S., Tokunaga, K., Furukawa, T.: Modular network SOM: The architecture, the algorithm and applications to nonlinear dynamical systems. Proc. of WSOM2005 (2005) 537-544 6. Furukawa, T.: Generalization of the Self-Organizing Map: From Artificial Neural Networks to Artificial Cortexes. Proc. of ICONIP2006 (2006)
Modular Network SOM: Theory, Algorithm and Applications Kazuhiro Tokunaga and Tetsuo Furukawa Kyushu Institute of Technology, Kitakyushu 808-0196, Japan {tokunaga, furukawa}@brain.kyutech.ac.jp
Abstract. The modular network SOM (mnSOM) proposed by authors is an extension and generalization of a conventional SOM in which each nodal unit is replaced by a module such as a neural network. It is expected that the mnSOM will extend the area of applications beyond that of a conventional SOM. We set out to establish the theory and algorithm of a mnSOM, and to apply it to several research topics, to create a fundamental technology that is generally usable only in expensive studies. In this paper, the theory and the algorithm of the mnSOM are reported; moreover, the results of applications of the mnSOM are presented.
1
Introduction
Kohonen’s self-organizing map (SOM) performs a topology-preserving transformation from a higher-dimensional vector space to a lower one, which is usually two-dimensional, and generates a map that can display visually the similarity between vectors [1]. In addition, the units in a SOM can interpolate the intermediate vectors between the input vectors. Since a SOM has these features, it has been applied in various fields such as in medical treatment, informationalcommunication, control systems, and image and speech analysis. Despite that the SOM has been used in these various areas, objects that the SOM deals with are limited to vector data that are distributed in vector space. In the conventional SOM, it is difficult to generate a map of objects such as a set of time-series data or a set of manifolds. Therefore, it is necessary to propose a generalized SOM that can generate a map corresponding to various object types. Further, if it is possible to generate the intermediate objects self-organizationally by a generalized SOM, then SOMs will become more powerful tools. Kohonen has described necessity of the generalization of the SOM [2], and proposed a self-organizing operator map (SOOM) as the generalization of the SOM [2]. In the SOOM, each nodal unit in the SOM is replaced to a linear operator such as an AR model in order to generate a map for dynamic signals. In other words, the network structure of the SOOM is equal to a modular network in which a module unit is composed of a linear operator. Kohonen tried to derive a general principle of the SOM from the SOOM. However, we believe that it can’t truly be described as a generalization of the SOM since it is written in respect to theory limited to the SOOM, each module of which is a linear operator. I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 958–967, 2006. c Springer-Verlag Berlin Heidelberg 2006
Modular Network SOM: Theory, Algorithm and Applications
959
Recently, we proposed a modular network SOM (mnSOM) in which each nodal unit in the conventional SOM is replaced by a module such as a neural network [3]. The module of the mnSOM can be freely designed to correspond to the objects that are performed topology-preserving transformation to a map. For example, when the module of the mnSOM is composed of multi-layer perceptrons (MLP), which represent a nonlinear function, then the mnSOM generates a map in function space [4]. Moreover, when the module is designed as recurrent neural network that represents a function of a dynamical system, then the map generated by the mnSOM gives the interrelationships between the functions of dynamical systems [5]. Therefore, it is considered that the mnSOM has the characteristics of an extension and generalization of the SOM of Kohonen; moreover, it is expected that the mnSOM will become a fundamental technology for neural network models as well as expanding the fields to which an SOM as a generalized SOM can be applied. In our past studies, the theory and the algorithm of our mnSOM based on the MLP module (MLP-mnSOM), the theoretical framework of which is simple, were established to prove that the mnSOM behaves as a generalized SOM. Moreover, not only the MLP-mnSOM but also mnSOMs based on various kinds of modules have been applied to a variety of research topics. As a result, it has been proven that the characteristic of the maps of mnSOMs and SOMs are equal. This suggests that our mnSOM is a generalization of an SOM. Further, it has been noted that the mnSOM has characteristics of both supervised and unsupervised learning; that is, the mnSOM performs not only supervised learning corresponding to the given target signals, but also unsupervised learning in which intermediate objects are generated self-organizationally. In this paper, the theory and the algorithm of the MLP-mnSOM are presented. Moreover, the characteristics of the maps in mnSOMs and SOMs, and the results from various research areas with various variations of mnSOMs are shown.
2 2.1
Theory and Algorithm Theory and Framework
The mnSOM has the structure of a modular network, the modules of which are arranged in a lattice (Fig.1(a)). In this paper, each module of an mnSOM, called “MLP-mnSOM”, is composed of a multi-layer perceptron (MLP) that represents a nonlinear function. The MLP-mnSOM generates a map that presents similarity-relationships between functions. In other words, the neighboring modules in a mnSOM acquire similar functions through training, while distant modules acquire different functions. Fig.1(b) shows the framework of the MLP-mnSOM. Suppose that there are I systems (static), in which input-output characteristics are defined with functions fi (·) i = 1, ..., I and which have din inputs and dout outputs. In addition, suppose that input-output vector datasets Di = {(xij , y ij )} j = 1, ..., J are observed from the systems, so that y ij = fi (xij ). Known information is only the inputoutput vector datasets Di . The functions fi (·) of the systems are unknown.
960
K. Tokunaga and T. Furukawa Best Maching Module (BMM)
yˆ ijk
Output
Input
k - th module
xij
Hidden Input
Multilayer perceptron (MLP)
(a) Inputs and target signals of the mnSOM Input : xij Target signal : yij
Input
x1 j
]
Output
Output
x1 j1 x1 j2 x1 jd in
System 1 Function f 1
y1 j1 y1 j2 y1 jd out
x2 j
System 2 Function f 2
y2 j
(x
1j
]y
, y1 j )
D1
1j
A map generated by the MLP-mnSOM
Input Input-output vector space Output
(x
2j
, y2 j )
Training
Input
Output
xI j
System I Function f I
yI j
k-th module Rrepresented function with the module Input
Functions : Unknown Inputs and outputs datasets Di : Known
(b) Fig. 1. (a)The architecture of an MLP-mnSOM. (b)The framework of an MLPmnSOM.
While, the mnSOM is composed of M MLP modules, each of which has din input-layer units, dhidden hidden-layer units and dout output-layer units. In these conditions, the MLP-mnSOM simultaneously executes the following tasks . (1) To identify function fi (·) from datasets Di i = 1, ...I (2) To generate a map based on the similarity-measures between functions. (1) means that the functions of the systems are expected from the learning of the Best Matching Module (BMM). Here, a BMM means a module in which output errors are minimized to the desired outputs of the dataset of a system.
Modular Network SOM: Theory, Algorithm and Applications
961
(2) means that the i-th and the i -th systems have similar characteristics; Thus, two corresponding BMMs are near to each other by their positions in the lattice. Here, the similarity-measure L2 (g, f ) between functions g(x) and f (x) is defined as follows : (1) L2 (g, f ) = g(x − f (x)2 p(x)dx. p(x) denotes the probability function of x. Moreover, the modules between the BMMs of the i-th and the i -th systems become “intermediate systems” by interpolation. These two tasks are processed in parallel. 2.2
Algorithm
The algorithm for the mnSOM consists of four processes: (1)evaluative process, (2)competitive process, (3)cooperative process, (4)adaptive process. In this paper, the algorithm of the MLP-mnSOM is shown. (1)evaluative process The outputs of all modules are evaluated for each input-output data vector pair. Suppose that an input data vector xij is picked up, then the output of the k-th ˜ kij is calculated for that input. This calculation process is repeated for modules y k = 1, ..., M using the same input xij . After evaluating all outputs for all inputs, the errors of all modules for each data class are then evaluated. Now let Eik be the error of the k-th module for the dataset from i-th system, i.e., Eik =
J J 1 k 1 k ˜ yij − y ij 2 = g (xij ) − fi (xij )2 . J j=1 J j=1
(2)
If J is large enough, then the distance between the k-th module and the i-th system in the function space is approximated by the error Eik as follows : (3) L2 (g k , fi ) = g k (x) − fi (x)2 pi (x)dx % Eik . In this paper it is assumed that {pi (x)} for i = 1, ..., I are approximately the same as p(x), due to normalization of the data distribution for each class. (2)competitive process The module which reproduces {y ij } best is defined as the BMM, i.e. the winner module for the i-th system. Thus, let ki∗ be the module number of the BMM for the i-th system, then ki∗ is defined as : ki∗ = arg min Eik . k
(4)
962
K. Tokunaga and T. Furukawa
(3)cooperative process The learning rate of each module is calculated by using the neighborhood function. Usually, a BMM and its neighbor modules gain larger learning rates than other modules. Let ψik (T ) denote the learning rate of the k-th module for the i-th system at learning time T . Then ψik (T ) is given by : h(l(k, ki∗ ); T ψik = I ∗ i h(l(k, ki ); T ) / 0 l2 h(l; T ) = exp − 2 . 2σ (T )
(5) (6)
Here, l(k, ki∗ ) expresses the distance between the k-th module and the BMM for the i-th system in the map space, i.e. the distance on the lattice of the mnSOM. h(l; T ) is a neighborhood function which shrinks with the calculation time T . Moreover, σ 2 (T ) means the width of the neighborhood function h(l; T ) at the calculation time T . The σ 2 (T ) is monotonously decreased at time T as: / 0 T σ 2 (T ) = σmin + (σmax − σmin ) exp − (7) τ σmax , σmin and τ are constants. (4)adaptive process All modules are trained by the backpropagation learning algorithm as: Δwk = −η
I i=1
ψik
∂Eik ∂E k = −η . k ∂w ∂wk
Here, wk denotes the weight vector of the k-th MLP module and E k = Note that E k has a global minimum point at which g k (x) satisfies: g k (x) =
I
ψik fi (x).
(8) i
ψik Eik .
(9)
i=1
Therefore, g k (x) is updated so as to converge to the interior division of {fi (x)} with the weights {ψik }. During training MLPs, each input vector xij is presented one by one as the input, and the corresponding output y ij is presented as the desired signal.
3
Computer Simulation
This section presents the results of several simulations: the maps of a family of cubic functions with MLP-mnSOM, and the results of various research topics using various variations of mnSOM.
Modular Network SOM: Theory, Algorithm and Applications
3.1
963
Maps of Cubic Functions with MLP-mnSOM
First, the MLP-mnSOM generated the maps corresponding to the datasets as observed from the systems, in which the input-output characteristics were defined by the cubic functions: y = ax3 +bx2 +cx. The simulations were made under two different conditions. In the first case (simulation 1), there were a small number of datasets (I = 6) with a large number of data samples(J = 200), whereas in the second condition (simulation 2) there was a large number of datasets (I = 126) with a small number of data samples(J = 8). Each dataset Di = {(xij , yij )} was sampled randomly, with the probability density function of p(x) distributed uniformly between [−1, +1]. In addition, Gaussian white noise was added to {yij } (standard deviation of noise σnoise = 0.04). Figs.2 and 3(a) show examples of the datasets in simulations 1 and 2. It is easy to identify individual functions from each of the datasets in Fig.2(a); whereas, in Fig.3(a) the identifying of individual functions is difficult. However, it is considered that the interpolating between datasets by the cooperative processes of the MLP-mnSOM facilitates the identification of the true functions. The MLP module has three layers with one input, and eight hidden and one output units. Other details are presented in Table 1. Figs.2 and 3 (b) show the results of simulations 1 and 2, respectively. The curve depicted in each box represents the function acquired by the corresponding module after training. The MLP-mnSOM generated similar maps in both cases. The neighbor modules acquired similar function forms, and the modules in the corner show opposite functions. The modules indicated by thick frames in Fig.2 (b) are the BMMs of the given six datasets. All other functions acquired by the rest of the modules were interpolated by the mnSOM in order to make a continuous map. Also, in simulation 2 (Fig.3(b)), the mnSOM succeeded in generating a map of cubic functions, despite there being only a small number of data samples. These results from simulations 1 and 2 suggest that the identification of functions was performed with not only with the supervised learning in each module, but also with unsupervised learning. Incidentally, it is important that the essences between the maps generated by the mnSOM and the SOM are equal. If the essence of the map in the SOM is lost by replacing each unit in the SOM with a module, then it can not be confidently said that the mnSOM is the generalization of the SOM. To investigate the essence of the map in the mnSOM, we considered a case in which an orthonormal functional expansion is employed for vectorization. Thus, let {Pi (·)} be a set of orthonormal functions. Then the function f (·) is transformed to a coefficient vector a = (a0 , ..., an ), where f (x) = a1 P1 (x) + a2 P2 (x) + ... + an Pn (x). Terms higher than the n-th order are assumed to be negligible. Under this condition, the distance Lf in the function space is identical to that in the coefficient vector spaceLv as follows : L2f (fi , g k ) = (ai1 − bk1 )2 + ... + (ain − bkn )2 = L2v (ai , bk ).
(10)
Here ai and bk denote the coefficient vectors of the given i-th function and the k-th module of the mnSOM, respectively. In this situation, the mnSOM and
964
K. Tokunaga and T. Furukawa Table 1. Experimental conditions Input layer 1 Hidden layer 8 Output layer 1 Learning rate η 0.05
1
1
0.5
0.5
0
0
-0.5
-0.5
-1
Map size K 100(10 × 10) σ0 10.0 σ∞ (simulation 1) 2.0 σ∞ (simulation 2) 1.0 τ 300
-1 -1
-0.5
0
0.5
1
1
1
0.5
0.5
0
0
-0.5
-0.5
-1
-1
-0.5
0
0.5
1
-1
-0.5
0
0.5
1
-1
-0.5
0
0.5
1
-1 -1
-0.5
0
0.5
1
1
1
0.5
0.5
0
0
-0.5
-0.5
-1
-1 -1
-0.5
0
0.5
1
(a)
(b)
Fig. 2. (a) Training datasets sampled from the six cubic functions in simulation 1. (b) Map of cubic functions generated by the MLP-mnSOM in simulation 1.
SOM should produce the same results since the learning algorithm is identical for the two types, the SOM and the mnSOM. Therefore, the mnSOM and SOM share the same essences of the map. In simulation 1, six functions obtained the coefficient vectors {ai } = {(ai1 , ai2 , ai3 )} by orthogonal expansion (In this paper, Legendre expansions are used). The map generated by a SOM for {ai } is shown in Fig.4 (a). The figure shows the position of each reference vector in the coefficient vector space. The map generated by the mnSOM is shown in Fig.4 (b). The figure shows the map of the coefficient vectors bk = {(bk1 , bk2 , bk3 )} of the Legendre polynomial corresponding to the functions acquired with modules. The results in Fig.4(a) and (b) are roughly equal. Since the accuracy of the function approximation in the MLP is low, distortion is caused in the map (Fig.4(b)). 3.2
Applications of the mnSOM
In our past studies, we have applied the mnSOM to various research topics (Fig.5). The designs of modules in the mnSOMs are different in individual applications; whereas, the architecture and the algorithm are same in all applications. The results of these applications are described later.
Modular Network SOM: Theory, Algorithm and Applications
1
1
0.5
0.5
0
0
-0.5
-0.5
-1
965
-1 -1
-0.5
0
0.5
1
1
1
0.5
0.5
0
0
-0.5
-0.5
-1
-1
-0.5
0
0.5
1
-1
-0.5
0
0.5
1
-1
-0.5
0
0.5
1
-1 -1
-0.5
0
0.5
1
1
1
0.5
0.5
0
0
-0.5
-0.5
-1
-1 -1
-0.5
0
0.5
1
(a)
(b)
Fig. 3. (a) Example of training datasets in simulation 2. (b) Map of cubic functions generated by the MLP-mnSOM in simulation 2.
0.15
0.15
0.10
0.10
0.05
a3
0.05
b3
0.00 -0.05 -0.10
-0.10
-0.15
-0.15
0.05
a2
0.00 -0.05
0.00
-0.05
-0.10
(a)
0.00
a1
0.10
0.05
b2
0.00
-0.05
-0.10
0.00
0.10
b1
(b)
Fig. 4. Map of cubic functions plotted in the coefficient vector space of the Legendre expansion. (a) Map generated by the SOM. (b)Map generated by the mnSOM.
Fig.5(a) shows “Map of weather in the Kyushu area of Japan” by the mnSOM in which each module is composed of a neural network that predicts the dynamics of weather. In this simulation, the mnSOM was merely given the weather datasets of nine cities in the Kyushu area of Japan, despite that the mnSOM represented the relationship of the geographic position of each city in Kyushu on the map. Fig.5(b) shows “Map of damped oscillators”. In this simulation, each module in the mnSOM is composed of the recurrent neural network that represents a function of a dynamical system. The mnSOM is merely given the datasets obtained from 9 damped oscillators, in which the individual input-output characteristics are different. Each module of the mnSOM acquired the characteristics
966
K. Tokunaga and T. Furukawa B
C D
l
A
k
p
s
q
I r
E
H
m n
t
o J
G
F
(a)
G#
A
(b)
G
F#
A# F
B
E
C2 D# C#
D
C
(c)
(d)
Fig. 5. Results of applications in the mnSOM. (a)Map of weather in the Kyushu area of Japan. (b)Map of damped oscillators. (c)Map of periodical waveforms. (d)Map of 3D objects.
of the 9 damped oscillators by training; moreover, the intermediate characteristics between the 9 damped oscillators were also acquired. Fig.5(c) shows “Map of periodical waveforms” generated by the mnSOM based on a five layer autoassociative neural network module. This map is generated by giving several periodical waveforms to the mnSOM and training it on them. As the result, a map on which waveforms and frequencies are divided was generated. Fig.5(d) shows “Map of 3D objects”. In this simulation, sets of 2-dimensional images are merely given to the mnSOM. Despite that the mnSOM is not taught the method for rebuilding the solid from 2-dimensional images, the mnSOM generated a map of 3D objects.
Modular Network SOM: Theory, Algorithm and Applications
4
967
Concluding Remarks
We have presented the theory and the algorithm for our mnSOM. From the results, we proved that the characteristics of the maps of mnSOMs and SOMs are equal. Therefore, it is considered that our mnSOM is the generalization of the SOM of Kohonen. Additionally, it is expected that the mnSOM will become a fundamental technology of neural network models, since the mnSOM has characteristics of both supervised and unsupervised learning. The mnSOM can be applied to various research fields in which it is too difficult to apply conventional means. We expect to be able to use our mnSOM for extensive studies. A specific theory for a generalized SOM is explained by Furukawa.
Acknowledgments This work was supported by a COE program (center #J19) granted to Kyushu Institute of Technology by MEXT of Japan. This work was also partially supported by a Grant-in-Aid for Scientific Research (C) granted by MEXT of Japan. This work was also supported by a The Okawa Foundation for Information and Telecommunications.
References 1. Kohonen, T.: Self-organizing maps. 3rd ed. Springer-Verlag, berlin Heidelberg New York (2003) 2. Kohonen, T.: Generalization of the self-organizing map. Proc. of International Joint Conference on Neural Networks (1993) 457–462 3. Tokunaga, K., Furukawa, T., Yasui, S.: Modular network SOM: Extension of SOM to the realm of function space. Proc. of Workshop on Self-Organizing Maps (2003) 173–178 4. Furukawa., T., Tokunaga, K., Morishita, K., Yasui, S.: Modular network SOM(mnSOM): From vector space to function space. Proc. of International Joint Conference on Neural Networks (2005) 1581–1586 5. Kaneko, S., Tokunaga, K., Furukawa, T.: Modular network SOM: The architecture, the algorithm and applications to nonlinear dynamical systems. Proc. of Workshop on Self-Organizing Maps (2005) 537–544
Evaluation-Based Topology Representing Network for Accurate Learning of Self-Organizing Relationship Network Takeshi Yamakawa1, Keiichi Horio1 , and Takahiro Tanaka2 1
Graduate school of Life Science and Systems Engineering Kyushu Institute of Technology Hibikino, Wakamatsu, Kitakyushu, Fukuoka 808-0196, Japan {horio, yamakawa}@brain.kyutech.ac.jp 2 FANUC LTD Oshino, Yamanashi 401-0597, Japan
Abstract. A Self-Organizing Relationship (SOR) network approximates a desirable input-output (I/O) relationship of a target system using I/O vector pairs and their evaluations. However, in the case where the topology of the network is different from that of the data set, the SOR network cannot precisely represent the topology of the data set and generate desirable outputs, because topology of the SOR network is fixed in one- or two dimensional surface during learning. On the other hand, a Topology Representing Network (TRN) precisely represents the topology of the data set by a graph using the Competitive Hebbian Learning. In this paper, we propose a novel method which represents topology of the data set with evaluation by creating a fusion of SOR network and TRN.
1
Introduction
The Self-Organizing Relationship (SOR) network approximates a desirable Input/Output (I/O) relationship of a target system using I/O pairs and their evaluations[1][2]. SOR network is effective in case where desirable I/O relationship of the target system is unknown but the I/O pairs obtained by trial and error can be evaluated. The SOR network efficiently learns because this method uses the both successful data and the failed data. After the learning, a input vector is applied, then a desirable output can be generated by the execution mode. However, to obtain good learning results it is preferable that a topology of the data set is known in advance. In case where the topology of the network is different from that of the data set, the SOR network cannot represent the topology of the data set, because the topology of the SOR network depends on the structure of the network which is fixed during the learning. On the other hand, the Topology Representing Network (TRN) precisely represents the topology of the data set by graph[3]. The TRN is a combination of the Neural-Gas Network (NGN)[4] which quantizes the data set and the Competitive Hebbian Learning rule (CHL)[5] which generates topological connections between neural units which weight vectors are near each other. I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 968–977, 2006. c Springer-Verlag Berlin Heidelberg 2006
Evaluation-Based TRN for Accurate Learning of SOR Network
969
In this paper, we propose a novel method which approximates the desirable I/O relationship of the target system using I/O pairs and their evaluations by creating a function of the SOR network and the TRN. The main ideas of the proposed method are the evaluation-based CHL (E-CHL) and the execution mode which generates desirable outputs based on the graph. The E-CHL introduces the concept of the connection strength to link between units. The strength of the links are strengthened and weakened depending on evaluation of data. Therefore, the E-CHL can generate topological connections which avoid the data set with negative evaluation and represent that with positive evaluation. In the execution mode, the proposed method generates desirable outputs based on topological connections the E-CHL generates. Therefore, the approximation of a target I/O relationship which has a complex topology can be achieved by the proposed method, even if the topology is the multi-valued function. The effectiveness of the proposed method is verified by some experiments.
2
Self-Organizing Relationship Network
The SOR network consists of the input layer, the output layer, and one or two dimensional competitive layer, in which n, m, and N units are included, respectively. The i-th unit in the competitive layer is fully connected with the input and output layers through weight vector v i = [wi , ui ], where, wi and ui are the weight vectors to the input and output layers, respectively. In the SOR network, the I/O vector pairs of the target system are used as learning vectors. In the SOM, only input topology in a high dimensional input space is mapped to one or two dimensional competitive layer. On the other hand, in the SOR network, an I/O topology in a high dimensional space is mapped to the competitive layer. A feature of the SOR network is that an update of the weight vectors is determined by evaluations of the I/O vector pairs. If the evaluation is positive value, the weight vectors are updated to come close to the learning vector. This operation is called “attractive learning”. On the other hand, if the evaluation is negative value, the weight vectors are updated to get away from the learning vector. This operation is called “repulsive learning”. The desirable I/O relationship is extracted by the attractive and repulsive learning. The learning of the SOR network is processed as follows. 0) All weight vectors v i are initialized by random numbers. 1) One I/O vector pair I i = [xl , y l ] is randomly selected from the L I/O vector pairs set, and applied to the input and output layers as a learning vector. 2) A winner unit l∗ for the learning vector I l is selected by the smallest Euclidean distance by the following equation: l∗ = arg min I l − v i i
(1)
3) A coefficient hi,l of neighboring effect is calculated by following neighborhood function: p i − p l∗ 2 hi,l = |El | exp(− ) (2) 2σ(t)2
970
T. Yamakawa, K. Horio, and T. Tanaka 1
0.5
y
0.5
y
0
-0.5
-1 -1
1
1
0.5
0.5
1
ui
0
-0.5
-0.5
-0.5
0
x
(a)
0.5
1
-1 -1
y
0
-0.5
0
x
0.5
1
0
-0.5
-1 -1
-0.5
(b)
0
wi
(c)
0.5
1
-1 -1
-0.5
0
x
0.5
1
(d)
Fig. 1. An example of the learning of the SOR network. (a) The desirable I/O relationship. (b) The learning vectors with a positive evaluation (◦) and a negative evaluation (•). (c) The distribution of the weight vectors (×) after learning. (d) The curve generated by the execution mode.
where El is the evaluation of the learning vector Il . pi and pl∗ are positions of the i-th unit and the winner unit on the competitive layer, respectively. σ(t) is a width of neighborhood function at learning step t. 4) The weight vectors are updated by the following equation: v i + α(t)hi,l (I l − v i ) f or 0 ≤ El new = (3) v v i − β(t)hi,l exp(− I l − v i )sgn(I l − v i ) f or El ≤ 0, where v new is the weight vector after update. α(t) and β(t) are learning rates i for attractive or repulsive learning at learning step t, respectively. 5) 1) to 4) are repeated decreasing the learning rates α(t) and β(t) and the width of the neighboring function σ(t). After the learning, the SOR network is ready for use of I/O generator. The input vector x∗ is applied to the input layer, the similarity si between x∗ and the weight vector wi is calculated by: si = exp(−
x∗ − w i 2 ), δ2
(4)
where δ is a positive parameter. The elements of the output vector yk∗ is obtained by: N si uki yk∗ =
i=1 N
(5) si
i=1
The output vector is expected as desirable. The SOR network has been successfully applied to a design of a control system, an image enhancement and so on. However, in case where the topology of the desirable data set is unknown or an I/O relationship is multi-valued function, the learning result of the SOR network sometimes includes a large error. Fig. 1 shows the results of the learning and execution in which the desirable
Evaluation-Based TRN for Accurate Learning of SOR Network
971
I/O relationship is multi-valued function. Fig. 1(a) and (b) show desirable I/O relationship and learning vector set. Fig. 1(c) is the distribution of weight vectors after the learning. In the learning, the number of competing units is 30 and they are arranged on 1 dimensional competitive layer. Fig. 1(d) shows the I/O relationship generated by the execution mode of the SOR network. It is shown that the desirable I/O relationship can not be extracted, because the topology of the desirable I/O relationship is not represented by that of the SOR network. To cope with this problem, topology of the network should be self-organizingly extracted.
3
Topology Representing Network
The TRN is inspired from SOM. The essential difference between the SOM and the TRN is the definition of neighborhood. The TRN algorithm determines, in each learning step, neighborhood relationship which are given by unit’s ranking which is determined by distance between the learning vectors I and the weight vectors v i as follows: I − v i0 ≤ I − v i1 ≤ · · · ≤ I − v iN −1
(6)
While adapting the weight vectors, the TRN configures neighborhood relationships by connecting the units whose weight vectors are near with links or removing the links. In particular, the link between rank 0 and rand 1 is generated, and the links between rand 0 and rank 2, 3, · · ·, N − 1 are removed. This link generating or removing rule is called as Competitive Hebbian Learning (CHL). As a result of learning, the topology of the set of the learning vectors can be precisely detected by the set of the weight vectors and links.
4
Evaluation-Based Topology Representing Network
In this paper, we propose the E-TRN represents topology of a data set with evaluation and approximates a desirable I/O relationship of a target system based on the graph. In the E-TRN, there are N neural units, and unit i has the weight vector v i : v i = (w i , ui ) i = 1, · · · , N, (7) where, wi and ui are the weight vectors which are defined in the input and output space, respectively. Also, the connection strength Cij is given between units i and j. 0 ≤ Cij means that units i and j are close to each other, on the other hand Cij < 0 means that units i and j are not close to each other. The operations of the E-TRN are divided into three modes; the learning mode, the link generation mode, and the execution mode. In the learning mode, the E-TRN extracts a desirable I/O relationship of a target system using I/O pairs and their evaluations. In the link generation mode, the E-TRN generates topological connections using the E-CHL. The E-CHL strengthens and weakens the connection strength depending on I/O pairs and their evaluations. In the execution mode, the E-TRN generates desirable output vectors based on topological connections.
972
4.1
T. Yamakawa, K. Horio, and T. Tanaka
Learning Mode
In the learning mode, the I/O vector pairs obtained by trial and error are applied as learning vectors I l : I l = (xl , y l )
l = 1, · · · , L,
(8)
where, xl and y l are the input and output vectors, respectively. Each learning vector has an evaluation value El for a desirable I/O relationship. The El is subjectively or objectively determined, and is assumed to be continuous value between −1 and 1 in this paper. If El is positive, i.e. I l is a successful data, I l attracts the weight vectors, whereas I l repulses the weight vectors if El is negative, i.e. I l is failed data. Here, we call these two learning methods as attractive learning and repulsive learning, respectively. The learning mode of the E-TRN is summarized as follows. 0) All weight vectors v i are initialized by random numbers. 1) All learning vectors I l and their evaluation values El are applied to the I/O space. 2) The Euclidean distances dli between each learning vector and all weight vectors are calculated. 3) The neighborhood ranking k(l, i) for the learning vectors I l are determined as follows: dli0 ≤ dli1 ≤ · · · ≤ dliN −1 . (9) v li0 is the closest weight vector to I l , i.e. k(l, i0 ) = 0. 4) A coefficient of neighboring effect φi,l and a coefficient of the repulsive learning ψi are calculated by the following equation: El gE−T RN (i, l∗) f or 0 ≤ El 2 φi,l = (10) αE−T RN v l ∗ − I l ∗ gE−T RN (i, l ) f or El < 0, El σ2 exp − σ2 E−T RN
E−T RN
ψi = 2
| φi,l |,
(11)
l∈χ− i
, k(l, i) gE−T RN (l, i) = exp − , λ(t) , -t t λf max . λ(t) = λi λi
(12)
(13)
Here, αE−T RN CσE−T RN are the parameters that determine the strength and the region of the repulsion, respectively. χ− i is defined as a set of the learning vectors with negative evaluation in the voronoi region of the weight vector v i . gE−T RN (l, i) is the neighborhood function that decreases monotonically for increasing k(l, i) with a characteristic decay constant. λi Cλf are initial and final value of a width of neighboring function at learning step t.
Evaluation-Based TRN for Accurate Learning of SOR Network
973
5) After all learning vectors are applied, the weight vectors are update by by the following equation: L
v i (t + 1) = (1 − εE−T RN )v i (t) + εE−T RN
φi,l I l +
N i=1
l=1 L
ψi v i (t) ,
(14)
φi,l
l=1
where, t is the learning step. εE−T RN is a learning rate. 6) Step 1 to 5 are repeated decreasing the width of the neighboring function λ(t). 4.2
Link Generation Mode
The CHL generates topological connections between units, and allows the TRN to form a topology preserving map. However, the CHL cannot be applied to a data set with evaluation. In contrast, we propose the E-CHL represents topology of a data set with evaluation. The E-CHL abolishes a concept of the age of each link, and introduces the following two concepts; (1) the connection strength Cij of each link, and (2) the degree of influence fl that each data has an influence on strengthening and weakening Cij . In the CHL, Cij is always 0 or more. In contrast, Cij is real number in the E-CHL, because the E-CHL strengthens or weakens Cij depending on positive or negative evaluation of each the learning vector. If El is positive, I l strengthens the connection strength, whereas I l weakens the connection strengths if El is negative. In addition, fl of each data is 1 in the CHL. In contrast, fl is expressed in continuous value between 0 and 1 in the E-CHL. Namely, the closer the link in the I/O space the learning vector is, the bigger fl of each data is. The quantity of updating Cij depends on the evaluation of each learning vector and the degree of influence. Therefore, the E-CHL can generate topological connections which avoid the data set with negative evaluation and represent the data set with positive evaluation, and can represent the various topologies of the unknown data set with evaluation. The link generation mode of the E-TRN is summarized as follows. 0) The connection strength Cij between each unit is initialized to 0. 1) One learning vector I + l is randomly selected from the data set with positive evaluation El+ , and applied to the network. 2) The Euclidean distances di between the learning vector I + l and all weight vectors are calculated. 3) The neighborhood ranking k(l, i) for the learning vector I + l are determined as follows: (15) di0 ≤ di1 ≤ · · · ≤ diN −1 , 4) The midpoint mi0 i1 between the closest weight vector v i0 and the second closest weight vector v i1 are calculated.
974
T. Yamakawa, K. Horio, and T. Tanaka ΔC i,j
I l+
I l-
input/output space − Fig. 2. Amount of update of the weight vector ΔCij by the learning vector I + l and I l
5) The degree of influence fl which the learning vector I + l has is calculated by the following equation: 2 + I l −mi0 i1 f or I + l − mi0 i1 ≤ γf γf fl = 1 − (16) 0 f or I + − m > γ , i i f 0 1 l whereCγf is the parameter that defines the degree of influence fl . 6) The connection strength Ci0 i1 between v i0 and v i1 is updated by the following equation. The connection strength is strengthened as shown in Fig. 2. Cinew = Ci0 i1 + fl El 0 i1
(17)
is the connection strength after updating. Here, Cinew 0 i1 7) Steps 1 to 6 are repeated until all learning vectors with positive evaluation are applied. 8) One learning vector I − l is randomly selected from the data set with negative evaluation El− , and applied to the network. 9) The link which has the positive connection strength (Cij > 0) and the closest midpoint mij to the learning vector I − l is selected. 10) The degree of influence fl which the learning vector I − l has is calculated by the following equation using the selected midpoint mi0 i1 . ⎧ 2 − ⎨ I l −mi i 0 1 f or I + 1 − l − mi0 i1 ≤ γf (18) fl = γf ⎩0 f or I + − m i0 i1 > γf l 11) The connection strength Ci0 i1 of the selected link is updated by equation (17). The connection strength is weakened as shown in Fig. 2. 12) Steps 8 to 11 are repeated until all learning vectors with negative evaluation are applied. 13) If Cij is positive value conclusively, the link is generated between unit i and j. The topological connections are generated by this procedure. Here, we call steps 1 to 6 the link strengthen phase, and steps 8 to 11 the link weaken phase.
Evaluation-Based TRN for Accurate Learning of SOR Network
4.3
975
Execution Mode
In the execution mode, the output is calculated by the weighted average of the weight vectors by the similarity measure. In the SOR network, the multiple outputs cannot be generated, because all units are used in calculating the weighted average. In the E-TRN, the multiple outputs can be generated, because of taking advantage of the graph. Therefore, the approximation of a target I/O relationship which has a complex topology can be achieved by the E-TRN, even if that is the multi-valued function. In addition, unknowns of the number of outputs are acceptable. The execution mode of the E-TRN is summarized as follows. 0) An actual input vector x∗ is applied to E-TRN after the learning. 1) The similarity zi between x∗ and all weight vectors in input space wi are calculated by: ∗ 2 1 − x −γ 2wi f or x∗ − w i ≤ γ zi = (19) 0 f or x∗ − w i > γ, where, γ is a parameter representing fuzziness of similarity. 2) The units which have the positive similarity (0 < zi ) are classified based on the graph. Therefore, the connected units are classified into the same class, and the unconnected units are classified into the different classes. is generated by the following equation using the 3) The output vector y C∗ k similarity of units which are classified into the class C. This operation is done to all classes. ziC uC ik y C∗ k =
i∈C
ziC
,
k = 1, · · · , m,
(20)
i∈C
here, ziC is the similarity of the unit which is classified into the class C. Also, uC i is the weight vector in the output space of the unit which is classified into the class C.
5
Experimental Results
In order to verify the effectiveness of the propose method, the following two experiments are achieved. 5.1
Effect for High Dimensional I/O Space
The topology of the conventional SOR network is fixed in one- or two dimensional surface, because of the fixed arrangement of the units on the competitive layer. On the other hand, the topology of data set is precisely preserved by the E-CHL in the proposed method. To verify the effectiveness of this topology preserving ability, following four I/O relationships is approximated by the conventional SOR network and the proposed method.
976
T. Yamakawa, K. Horio, and T. Tanaka 0.8 0.7 0.6
SOR network E-TRN
RMSE
0.5 0.4 0.3 0.2 0.1 0
1input/1output
2input/1output
3input/1output
4input/1output
Dimension of the dataset
Fig. 3. RMSE of the conventional SOR network and the proposed method for 1input/1-output, 2-input/1-output, 3-input/1-output and 4-input/1-output
y = 0.5 sin(πx1 )
(21)
y = 0.5 sin(πx1 ) + 0.5 cos(0.5πx2 )
(22)
y = 0.5 sin(πx1 ) + 0.5 cos(0.5πx2 ) − 0.5 sin(πx3 )
(23)
y = 0.5 sin(πx1 ) + 0.5 cos(0.5πx2 ) − 0.5 sin(πx3 ) − 0.5 cos(0.5πx4 )
(24)
For the both methods, the number of learning data is 3000 and they are randomly generated, the number of learning iterations is 100, the number of units is 100 (10×10 for the SOR network). For each function, ten simulations are achieved, and the averages of the root-mean-square-error (RMSE) are shown in Fig.3. In the approximation of 2-input/1-output, the topology of the desirable I/O relationship becomes 2-dimensional surface, thus the RMSEs of the both method is almost same. On the other hand, in the approximation of 1input/1-output, 3-input/1-output and 4-input/1-output, the SOR network with two dimensional topology has large RMSEs comparing to those of the proposed method. It is shown that the conventional SOR network can not precisely approximate the I/O relationship when the dimension of the I/O relationship becomes large. 5.2
Effect of Link Weakening
One of the main features of the SOR network is to use the learning data with negative evaluation. In the proposed method, the links between units are strengthened and weakened by the learning with positive and negative evaluations, respectively. Here we show the effectiveness of the link weakening by simulation. The learning of the proposed method is achieved using the learning data shown in Fig. 1 (b). Fig. 4 shows the weight vectors and links after learning in which the link weakening is not used. It is shown that some links which should be removed remain. Thus the I/O relationship generated by the execution mode is not desirable as shown in Fig. 4 (b). On the other hand, Fig. 4 shows the weight vectors and links after learning in which the link weakening is used. It is shown that desirable I/O relationship is generated by the execution mode as shown in Fig. 4 (d).
Evaluation-Based TRN for Accurate Learning of SOR Network
ui
1
1
1
1
0.5
0.5
0.5
0.5
y
0
-0.5
-1 -1
ui
0
-0.5
-0.5
-0.5
0
wi
0.5
1
(a)
-1 -1
y
0
-0.5
0
x
(b)
0.5
1
-1 -1
977
0
-0.5
-0.5
0
wi
(c)
0.5
1
-1 -1
-0.5
0
x
0.5
1
(d)
Fig. 4. Effectiveness of the link weakening by the learning data with negative evaluation. (a) Weight vectors and links after learning in which the links are not weakened by learning data with negative evaluation, and (b) I/O relationship generated by the execution mode. (c) Weight vectors and links after learning in which the links are weakened by learning data with negative evaluation, and (d) I/O relationship generated by the execution mode.
6
Conclusions
In this paper, we proposed the E-TRN for accuracy improvement of learning of the SOR network. In the E-TRN, the links between units are strengthened or weakened by the learning data based on their evaluations. By employing the E-TRN accuracy of learning of the SOR network is improved, especially in case where the I/O space is high dimensional or I/O relationship is described as multi-valued function. We verified the effectiveness of the proposed method by some simulations.
Acknowledgment This work was supported by a COE program (center #J19) granted to Kyushu Institute of Technology by MEXT of Japan.
References 1. Yamakawa, T., Horio, K.: Self-Organizing Relationship (SOR) Network. IEICE Trans. on Fandamentals, E82-A (1999) 1674-1677 2. Furukawa, T., Sonoh, S., Horio, K., Yamakawa, T.: Batch Learning Algorithm of SOM with Attractive and Repulsive Data. Proc. of Workshop on Self-Organizing Maps (WSOM) (2005) 413-420 3. Martinetz, T., Schulten, K.: Topology Representing Networks. Neural Networks 7 (1994) 507-522 4. Martinetz, T., Berkovich, S., Schulten, K.: Neural-Gas Network for Vector Quantization and its Application to Time-Series Prediction. IEEE Trans. on Neural Networks 4 (1993) 558-569 5. Martinetz, T.: Competitive Hebbian learning rule forms perfectly topology preserving maps. Proc. of Int. Conf. on Artificial Neural Networks (ICANN) (1993) 427-434
Adaptively Incremental Self-organizing Isometric Embedding* Hou Yuexian, Gong Kefei, and He Pilian School of Computer Science & Technology, Tianjin University, China
Abstract. In this paper, we propose an adaptive incremental nonlinear dimensionality reduction algorithm for data stream in adaptive Self-organizing Isometric Embedding [1][3] framework. Assuming that each sampling point of underlying manifold and its adaptive neighbors [3] can preserve the principal directions of the regions that they reside on, our algorithm need only update the geodesic distances between anchors and all the other points, as well as distances between neighbors of incremental points and all the other points when a new point arrives. Under the above assumption, our algorithms can realize an approximate linear time complexity embedding of incremental points and effectively tradeoff embedding precision and time cost.
1 Introduction Data mining usually involves understanding intrinsic low-dimensional manifold structure of high-dimensional data sets. The earliest nonlinear dimensionality reduction technique is the Sammon’s mapping [7]. Recently many nonlinear methods are proposed, such as self-organizing maps (SOM) [8], principal curve and its extensions [9,10], auto-encoder neural networks [12][13], kernel principal component analysis (KPCA) [11], isometric feature mapping (Isomap) [5], locally liner embedding (LLE) [4], self-organizing isometric embedding (SIE) [1] and so on. Most popular nonlinear dimensionality reduction algorithms work in batch mode. It means that all data points need to be available during embedding. But in many applications of data mining, information is collected sequentially through a data stream. When data points arrive sequentially, batch methods are computationally expensive [2]. The goal of this paper is to investigate how we can efficiently recover a nonlinear manifold given a data stream. Supposing there have already been some sampling points of the underlying manifold. When new data point arrives sequentially, it may change the neighborhood structure and create improved shortest paths. It is obvious that the complexity of updating geodesic distances is dependent on the sufficiency of the sampling. Thence we discuss three methods to process new arrival points based on different sampling sufficiency. Suppose there is already a sampling set X{x1, x2,…, xn} on the manifold and it is far from a sufficient sampling of underlying manifold. In this case, the arrival of u may *
This work was supported in part by Sci.-Tech. Development Project of Tianjin (Grant 04310941R) and Applied Basic Research Project of Tianjin (Grant 05YFJMJC11700).
I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 978 – 986, 2006. © Springer-Verlag Berlin Heidelberg 2006
Adaptively Incremental Self-organizing Isometric Embedding
979
change the global topology of manifold’s graph representation, and there might be a lot of shortest paths to be updated so we should compute the co-ordinates of the new point based on the updated geodesic distances. The flaw of this method is exhausted computation cost [2]. When the sampling { x1, x2,…, xn } fn the manifold is sufficient and u is not a noisy point, the arrival of a new point u will not affect the existing shortest paths remarkably. So it need only compute the shortest paths from the new arrival point to other points using formula (1). dgeo(u, xi):=min{d(u, unb)+dgeo(unb, xi)| unb is adaptive neighbor of u}
(1)
But often the assumption of sampling sufficiency is too restrictive. A unified framework of out-of-sample extensions for LLE, Isomap, MDS, Laplacian eigenmaps, and spectral clustering is provided for extending these algorithms for new arrival points [18]. Its key idea is seeing these algorithms as learning eigenfunctions of a data-dependent kernel. To obtain an embedding for a new data point, it uses formula (2) to compute eigenfunction. fk(x) = (n1/2/
k)
vki (x,xi) i=1,2,… ,n
(2)
Then it uses fk(x) to get the embedding for the new point directly. This framework is also under the assumption that the sample is sufficient, which is too restrictive. In this paper, a compromised assumption is adopted to tradeoff time complexity and embedding precision. Supposing each point xi and its adaptive neighbors can preserve the locally principal directions of the region that they reside on. In this case {x1, x2,…, xn} may be not globally sufficient, e.g., there may be holes in the set of sampling points. When new point u arrives, it may change the neighborhood structure and create improved shortest paths. So we have to recomputed the embedding co-ordinates of {x1, x2,…, xn, u}. But under adaptive SIE [1][3] framework and the above Locally Principal Directions Preservation (LPDP) assumption, geodesic distance between anchors and other points can be updated in liner complexity. In the following, we will briefly introduce basic algorithm of SIE and adaptive neighbors finding in Sec 2, and describe proposed Adaptively Incremental Self-organizing Isometric Embedding (AISIE) algorithm in Sec 3. In Sec 4, empirically results on several examples are provided to illustrate the validity of our proposal. We conclude with a discussion of the further work.
2 Adaptive Neighbor Selection and Introduction to SIE We denotes the number of training points by n, the number of anchors by m, the number of incremental points by l, the maximal number of adaptive neighbors by k, the dimensionality of input space by D, the dimensionality of embedding space by d. 2.1 Find Adaptive Neighbors To find adaptive neighbors, we filtrate every neighbor in a neighbor set according to its distance to locally principal direction, i.e., the linear subspace spanned by directions that are of great variances. Neighbors that are relatively far from locally principal
980
Y. Hou, K. Gong, and P. He
direction will be deleted from a neighbor set even if they are near to the central point of the neighbor set. To be specific, let L{vi1, vi2,.., vid} are the smallest d eigenvectors of the covariance matrix of xi’s neighbor set, then the locally principal direction centering at xi is spanned by L and for every points xij in xi’s neighbor set, vector yijxij-xi can be approximately represented as the linear combination of vectors in L as long as xij is near to locally principal direction of xi. Formally, yij = Viwij +
ij
(3)
matrix, wij is a d fitting coefficients vector of Where Vi=[vi1 vi2…vid] is a D linear combination, ij is a D fitting error vector. The remaining task is solve optimal wij for yij, j = 1,2, k, where k is the points number of xi’s neighbor set. We solve optimal wij in the least square error sense. Then it becomes a problem of quadratic programming and w*ij can be analytically attained by the formula (4) w*ij = (ViTVi)-1 ViTyij
(4)
Once optimal w*ij is attained, ij is computed by formula (3), and the normalized norm of ij becomes a criterion to check whether ij is near to the locally principal direction centering at xi. 2.2 Basic Algorithm of SIE Let P1, P2,…,Pn be n-point configuration in RM. We need to embed P1, P2,…, Pn in a lower dimension embedding space Rm to form n-point configuration Q1, Q2,…, Qn, which preserves the pair-wise manifold distance, e.g., geodesic distance [2], among all points in P1, P2,…, Pn. Define cost function S6|dM(i,j)-dm(i,j)|, 1i> m ) is the given data (observation) matrix, A = [ a1 ," , an ] ∈ R m×n is unknown mixing (basis) matrix (not necessary sparse) and S ∈ R n×T is also unknown matrix representing sparse sources or hidden components, T is the number of available sample, m the number of observations and n the number of sources. Our main objective is to find a reasonable basis matrix A such that the coefficients in matrix S are as sparse as possible. Here we mainly discuss the k (1 < k ≤ m − 1) dominant components SCA. We use the two-stage methods to solve
the SCA problem: in the first step estimate the basis matrix A using the proposed Khyperplanes clustering approach and in the next step to estimate the coefficient matrix S using the following Optimal k -dimension Subspace Projection (OSP): (1) Denote K ( K = Cnk ) m × k submatrices of A as A1 , A2 ," , AK .
(2) Calculate each distance d ( x ( t ) , Ai ) , i = 1," , K from the observation sample
point x ( t ) to linear space span { Ai } using the distance formula (9). The
observation
sample
point
{
x ( t ) ∈ θ ( Ai )
if
and
}
only
if
d ( x ( t ) , Ai ) =
min d ( x ( t ) , Aj ) , j = 1," , K , where θ ( Ai ) is a data vector set. By this means,
assign the T observation sample points x (1) ,", x (T ) in observation matrix X = ª¬ x (1) ," , x (T ) º¼ into K different clusters θ ( Ai ) , i = 1," , K . (3) Suppose Ai = [ ai1 ,", aik ] . If x ( t ) ∈θ ( Ai ) , the estimation of the column vector s ( t ) , t = 1,", T of coefficient matrix S is as follows: § s ( i1, t ) · −1 °¨ ¸ T T °¨ # ¸ = ( Ai Ai ) Ai x ( t ) , ®¨ s ik , t ¸ )¹ °© ( ° s ( j , t ) = 0, for j ≠ i1," , ik . ¯
(16)
6 Numerical Experiments and Result Analysis In order to demonstrate the performance of the K-hyperplanes clustering algorithm, here we give a sparse component analysis experiment. To check how well the mixing matrix is estimated, we introduce the following Biased Angles Sum (BAS), defined as the sum of angles between the column vectors (of mixing matrix) and their corresponding estimations:
(
)
n
BAS A, Aˆ = ¦ acos ( ai , aˆi i =1
),
(17)
where acos ( < ) denotes the inverse cosine function, 0 and ψ > 0 in the iterations, which means that the adaptive learning rate η(k) is reasonable. At the same time, ϕ > 0 and ψ > 0 will be very small. So, roughly, we can let adaptive learning rate ηa (k) = ψ(k)/φ(k) in practical computation.
4
Simulation Results
Since the source images have nonnegative pixel values, a blind image separation problem will be used to confirm our theoretical results. Three 256x256 original and mixed images are shown in Fig.1 and Fig.2 respectively. The mixed images
Fig. 1. Source images and histograms used for the nonnegative ICA algorithm
are mixed by the three source images after the pixel intensities were scaled to unit variance. The source sequence s(k) is the sequence of pixel values of the source images from top left to bottom right. And the observed sequence x(k) is the sequence of pixel values of the mixed images from top left to bottom right. As denoted in [7], to evaluate the performance of separation, two performance measures are used. The first is the nonnegative reconstruction error eN N R =
1 Z − W T g(Y ) np
2 F
,
where Z and Y are matrices whose columns are p = 2562 pre-whitened observation vectors z of dimension n = 3 and the vectors y = W z, respectively. The second is the cross-talk error eXT =
1 abs(W V A)T abs(W V A) − In n2
2 F
,
1104
M. Ye, X. Fan, and Q. Liu
Fig. 2. The mixed images and their histograms
where abs(W V A) is the matrix of absolute values of the elements of W V A. This measure is zero if and only if the sources have been successfully separated. The performance curves are drawn in Fig.3, when the learning rate equals ηa (k) and large constant learning rate η1 = 0.9 respectively. Fig.3 shows that the convergence of algorithm (1) is guaranteed if the adaptive learning rate ηa (k) is used. Compared with gradient nonnegative ICA algorithms, the nonnegative ICA algorithm derived by geodesic method is somewhat robust to the selection of learning rate. However, we still can observe that there is a large learning rate η1 , which may lead algorithm to divergence or oscillation. 1
10
0
10
−1
Error
10
−2
10
−3
10
−4
10
0
20
40
60
80
100 120 Epoch,t
140
160
180
200
Fig. 3. The learning curves for the geodesic nonnegative ICA algorithm. The circle and solid curves are drawn for the algorithm with adaptive learning rate η(k) and large constant learning rate η1 respectively, showing nonnegative reconstruction error (lower curve), and cross-talk error (upper curve).
Monotonic Convergence of a Nonnegative ICA Algorithm
1105
0.6 0.55 0.5 0.45
Eta
0.4 0.35 0.3 0.25 0.2 0.15 0.1 0
50
100
150
200
250 300 Epoch,t
350
400
450
500
Fig. 4. The curve of the adaptive learning rate
The geodesic nonnegative ICA algorithm (1) can obtain good performance if the adaptive learning rate ηa (k) is used. Fig.4 shows that why the good performance can be obtained.
5
Conclusions
The convergence conditions on the learning rate and the initial weight vector of a geodesic nonnegative ICA algorithm are derived. A rigorous mathematical proof is given. Our theory will be very effective in practical blind source separation. Simulation results confirm our convergence theories. The techniques used in this paper may give some clues to analyze general ICA algorithms on Stiefel manifold.
Acknowledgement This work was supported in part by the National Natural Science Foundation of China (TianYuan) under grant numbers A0324638 and Youth Science and Technology Foundation of UESTC YF020801.
References 1. Cichocki,A. and Amar, S.-i.: Adaptive Blind Signal and Image Processing. John Wiley and Sons (2002) 2. Edelman, A., Arias, T. A., and Smith, S. T.: The Geometry of Algorithms with Orthogonality Constrains. SIAM J. Matrix Anal. Applicat. (1998) 20(2) 303-353 3. Fiori, S.: A Theory for Leaning by Weight Flow on Stiefel-Grassman Manifold. Neural Comput. 13 (2001) 1625-1647 4. Golub, G. H., Van Loan, C. F.: Matrix Computations. The Johns Hopkins University Press, Baltimore, Maryland (1996)
1106
M. Ye, X. Fan, and Q. Liu
5. Hyv¨ arinen, A., Karhunen, J. and Oja, E.: Independent Component Analsysis. John Wiley and Sons (2001) 6. Nishimori, Y.: Learning Algorithm for ICA for by Geodesic Flows on Orthogonal Group. Proc. Int. Joint. Conf. Neural Networks , Washington, DC 2 (1999) 933-938 7. Oja, E. and Plumbley, M. D.: Blind Separation of Positive Sources by Globally Convergent Gradient Search. Neural Computation 16(9) (2004) 1811-1825 8. Oja, E. and Plumbley, M. D.: Blind Separation of Positive Sources using Nonnegative PCA. Proc. Int. Conf. on Independent Component Analysis and Blind Signal Separation (ICA 2003) (2003) 11-16 9. Plumbley, M. D.: Conditions for Nonnegative Independent Component Analysis. IEEE Signal Processing Letters 9(6) (2002) 177-180 10. Plumbley, M. D.: Algorithms for Nonnegative Independent Component Analysis. IEEE Trans. Neural Networks 14(3) (2003) 534-543 11. Plumbley, M.D.: Lie Group Methods for Optimization with Orthogonality Constraints. Proc. Int. Conf. on Independent Component Analysis and Blind Signal Separation (ICA 2004) 1245-1252 12. Vidyasagar, M.: Nonlinear Systems Analysis. Prentice-Hall, Englewood Cliffs, NJ (1993) 13. Ye, M.: Global Convergence Analysis of a Discrete Time Nonnegative ICA Algorithm. IEEE Trans. Neural Networks 17(1) (2006) 253-256 14. Zuffiria, P. J.: On the Discrete Time Dynamics of the Basic Hebbian Neural Network Node. IEEE Trans. Neural Networks 13(6) (2002) 1342-1352
Exterior Penalty Function Method Based ICA Algorithm for Hybrid Sources Using GKNN Estimation Fasong Wang1 , Hongwei Li1 , and Rui Li2 1
2
School of Mathematics and Physics, China University of Geosciences Wuhan 430074, P.R. China School of Mathematics and Physics, Henan University of Technology Zhengzhou 450052, P.R. China
Abstract. Novel Independent Component analysis(ICA) algorithm for hybrid sources separation based on constrained optimization—exterior penalty function method is proposed. The proposed exterior penalty ICA algorithm is under the framework of constrained ICA(cICA) method to solve the constrained optimization problem by using the exterior penalty function method. In order to choose nonlinear functions as the probability density function(PDF) estimation of the source signals, generalized k-nearest neighbor(GKNN) PDF estimation is proposed which can separate the hybrid mixtures of source signals using only a flexible model and more important it is completely blind to the sources. The proposed EX-cICA algorithm provides the way to wider applications of ICA methods to real world signal processing. Simulations confirm the effectiveness of the proposed algorithm.
1
Introduction
Since Comon [1] gave a good insight to ICA problem from the statistical view, it has become a highly popular research topic in statistical signal processing and unsupervised neural network. There has been emerged a set of efficient ICA algorithms, such as the Infomax algorithm [2], the Natural Gradient algorithm [3], the Parametric algorithm [4], the Grading Learning algorithm [5], the FastICA algorithm [6], the Nonparametric ICA algorithm [7] and so on. These algorithms provide ideal solutions to satisfy independent component conditions after the convergence of the learning. Recently, the technique of constrained ICA(cICA) [8-10] was proposed as a general framework to incorporate the additional requirements and available prior knowledge into the classical ICA. These requirements and any knowledge are added into the ICA contrast function in the form of equality and inequality constraints using the exterior penalty function method. In order to choose nonlinear functions as the PDF estimation of the source signals, unlike [7], generalized knearest neighbor(GKNN) PDF estimation is proposed which can separate the hybrid mixtures of source signals using only a flexible model and more important it is completely blind to the sources. I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 1107–1116, 2006. c Springer-Verlag Berlin Heidelberg 2006
1108
F. Wang, H. Li, and R. Li
This paper is organized as follows: Section 2 introduces briefly the ICA model and cICA framework; Then in section 3, we describe the exterior penalty function method based EX-cICA algorithm in detail; In section 4, we use a GKNN estimation based non-parametric method to estimate the PDF of the sources and get the score function; Simulations illustrating the good performance of the proposed method are given in section 5; Finally, section 6 concludes the paper.
2
ICA Model and cICA Framework
Here we consider the linear, instantaneous, noiseless ICA model with the form x(t) = As(t),
(1)
where s(t) = [s1 (t), s2 (t), · · · , sN (t)]T is the unknown source vector, and the source signals si (t) are stationary process. Matrix A ∈ RN ×N is an unknown real-valued and non-singular mixing matrix. The observed mixtures x(t) are sometimes called as sensor outputs. The following assumptions for the model to be identified are needed [1]: a) The sources are statistically mutually independent; b) At most one of the sources has Gaussian distribution; c) A is invertible. The goal of ICA is to find the demixing matrix W which maps the observation x(t) to y(t) by the following linear transform y(t) = Wx(t) = WAs(t),
(2)
where W ∈ RN ×N is a separating matrix, y(t) is an estimate of s(t). Inherently in classic ICA algorithms there are two indeterminacies in ICA—scaling and permutation ambiguity [1]. We can get the performance matrix P and P = WA. Due to the important contribution of Amari, et.al. [3], we can multiply the gradient by WT W and therefore obtain a natural gradient updating rule of W. The general learning rule for updating can be developed as W = η[I − ϕ(y)yT ]W.
(3)
In (3), η > 0 is the learning rate which can be chosen by the method in [5], and ϕ(·) is the vector of score functions whose optimal components are ϕi (yi ) = −
d p (yi ) . log pi (yi ) = − i dyi pi (yi )
(4)
The purpose of the cICA is to provide a more systematic and flexible method to utilize more assumptions and prior information during the separation process, if available, into the contrast function so the ill-posed ICA is converted to a better-posed problem, and as a result providing a way to wider applications of ICA methods to real world signal processing [8-10]. The classical ICA is an illposed problem because of the indeterminacy of scaling and permutation of the solution. Incorporation of prior knowledge and further requirements converts the ill-posed ICA problem into a well-posed problem. The procedure that produce
Exterior Penalty Function Method Based ICA Algorithm
1109
unity transform operators can recover the exact original signals; In order to avoid the arbitrary ordering on output components, we can use the statistical measures to give indices to sort them in order, and evenly get the desired signals [8-9]. As in [10], constraints may be adopted to reduce the dimensionality of the output of the ICA. Incorporation of prior information, such as statistical properties or rough templates of the sources, avoids local minima and increases the quality of the separation. The constrained optimization problem to cICA is defined as follows: minimize: C(y) = C(y1 , · · · , yN ) subject to: g(y : W) ≥ 0, and/or h(y : W) = 0, where C(y) = C(y1 , · · · , yN ) represents an ICA contrast function(the definition of contrast function can be find in [1]), g(y : W) = (g1 (y : W), g2 (y : W), · · · , gm (y : W))T define a set of m inequality constraints and h(y : W) = (h1 (y : W), h2 (y : W), · · · , hn (y : W))T defines a set of equality constraints. So the remain question is to find proper constraints and the efficient nonlinear constrain optimization approach. These will be given in section 4 and 5 in detail.
3
The GKNN Estimation of PDF
As shown by (4), the original source distributions must be known. But in most real world applications this knowledge does not exist. However as indicated in [7], ICA can be realized without knowing the source distributions, since these distribution may be learnt from the sample data together with the linear transformation. In this paper, we propose the GKNN non-parametric estimation method to estimate the PDF of source signals [11]. The basic idea of the k-nearest neighbor(KNN) method is to control the degree of smoothing in the density estimate based on the size of a box required to contain a given number of observations. The size of this box is controlled using an integer k which is considerably smaller than the sample size. Suppose we have the ordered data sample x(1) , · · · , x(n) . For any point on the line we define the distance by di (x) = |xi − x|, so that d1 (x) ≤ · · · ≤ dn (x). Then we define the k−1 KNN density estimatefˆ(x) = 2nd . By definition we expect k − 1 observations k (x) in [x − dk (x), x + dk (x)]. Near the center of the distribution dk will be smaller than in the tails. In application we often use the GKNN estimate , n x − Xj 1 K fˆ(x) = , ndk (x) j=1 dk (x) where K(·) is a window or kernel function. The most common choice of kernel 2 is: the Gaussian kernel, K(u) = φ(u) = √12π exp(− u2 ); the Epanechnikov kernel and so on. The window width dk (x) is an important characteristic of the GKNN density estimate but there is no firm rule. One chooses dk (x) keeping in mind the tradeoff between too much variability and roughness on the one hand and the lack of fidelity and increase bias on the other hand.
1110
F. Wang, H. Li, and R. Li
Suppose given a batch of sample data of size M , the marginal distribution of an arbitrary reconstructed signal is approximated as follows pˆi (yi ) =
, M y − Yij 1 K , i = 1, 2, · · · , N, M dk (yi ) j=1 dk (yi ) 2
where N is the number of the source signals, K(u) = φ(u) = √12π exp(− u2 ), T Yij = Wi (X T )j = N n=1 win xnj , Wi and (X )j are the ith row and jth column T of the separating matrix W and mixture X . We know that this estimator pˆi (yi ) is asymptotically unbiased estimation and mean square consistent estimation to pi (yi ). So under the measure of mean square convergence, this estimator converges to the true PDF. Moreover, it is a continuous and differentiable function of the unmixing matrix W pˆi (yi ) = pˆi (Wi (X T )n ) =
, M Wi ((X T )n − (X T )j ) 1 K , (5) M dk (Wi (X T )n ) j=1 dk (Wi (X T )n )
From (4), if we want to get the score function, we must get the derivative of (5), - , , pˆi (yi ) =
Then
, , -- M M 1 yi − Yij 1 yi − Yij K = K M dk (yi ) j=1 dk (yi ) M (dk (yi ))2 j=1 dk (yi )
- , , - , M y −Y y −Y pˆi (yi ) = − M(dk1(yi ))2 j=1 dik (yiij) K dik (yiij) .. We use the deriv-
ative of the estimated PDF as the estimation of the PDF’s derivative, that is (ˆ pi (yi )) = pˆi (yi ). So from (4),(5) we can get the score function easily , - , M yi −Yij yi −Yij ∂p(yi ) pˆi (yi ) j=1 dk (yi ) K dk (yi ) ∂y -. , =− = ϕ(yi ) = − (6) M p(yi ) pˆi (yi ) y −Y dk (yi ) j=1 K dik (yiij) Substitute (6) to (3), we can get the GKNN estimation based ICA separation algorithm. In the process of the iteration, in order to confirm the effectiveness of the algorithm, we must update the score function online in every step.
4
Exterior Penalty Function Based Constrained Optimization Method
In this section, we consider the constrained optimization problem: minimize: C(y) = C(y1 , · · · , yN ) subject to: g(y : W) ≥ 0 and/or h(y : W) = 0, where C(y) is the contrast function, gi (y : W), i = 1, · · · , m is a set of m inequality constraints and hj (y : W)), j = 1, · · · , n define a set of n equality
Exterior Penalty Function Method Based ICA Algorithm
1111
constraints. All C(·), g(·) and h(·) are assumed to be (nolinear) differentiable real-valued continuous functions. Because the above problem is the constrained nonlinearity optimization one, we can not simplify it to unconstrained problem by conventional elimination method. So when we disposal this problem, we must not only descent the numerical of the objective function, but also satisfy the constrain condition. In order to realize it, one ideal track is to create a new assistant function by the objective function and constrained function. As a result, the original constrained optimization problem is convert a unconstrained optimization one which is optimized by minimizing the assistant function. At first, consider the equality constrained problem. minimize: C(y) = C(y1 , · · · , yN ) subject to: hj (y : W) = 0, j = 1, · · · , n Define the objective function F (W, σ) = C(y : W) + σ
n
h2j (y : W),
(7)
j=1
where σ(> 0) is the penalty parameter which ia a very large number and the equality constrained optimization problem transforms to unconstrained one: min F (W, σ). W
(8)
Obviously, the optimal solution of (8) must make the value of hj (y : W)) go to zero. For, if not, the second item of (7) will be very large and the current point is not the minimization solution. Therefore, once we get the solution of (8), the approximate solution of (7) is obtained too. Then, consider the inequality constrained problem. minimize: C(y) = C(y1 , · · · , yN ) subject to: gi (y : W) ≥ 0, i = 1, · · · , m The assistant function of inequality constrained problem is not the same as equality constrained one. But the the construct idea is the same. Now, define the objective function F (W, σ) = C(y : W) + σ
m
(max(0, −gi (y : W)))2 .
(9)
i=1
When W is the feasible solution, then max(0, −gi (y : W)) = 0. When W is not the feasible solution, then max(0, −gi (y : W)) = −gi (y : W). And also, the inequality constrained optimization problem transforms to unconstrained one: (10) min F (W, σ). W
1112
F. Wang, H. Li, and R. Li
Generalizing the above idea, we define the assistant function for the generalize case including equality and inequality constrained optimization problem: F (W, σ) = C(y : W) + σP(y : W),
(11)
where P (y : W) has the form P(y : W) =
m
φ(gi (y : W)) +
i=1
n
ψ(hj (y : W))
(12)
j=1
where φ(·) and ψ(·) are continuous functions which satisfy the following conditions: φ(x) = 0, x ≥ 0, ψ(x) = 0, x = 0, ; . φ(x) > 0, x 0, x = 0 ψ(x) = Typically, the function φ(·) and ψ(·) can be φ(x) = (max(0, −x))α ; |x| , where α ≥, β ≥ 1 are defined constants, usually α = β = 2. Thus the general nonlinear constrained optimization problem can be convert a unconstrained one: β
min F (W, σ) = C(y : W) + σP(y : W). W
(13)
where the parameter σ is a very large number and P(y : W) is a continuous function which is called penalty function. By definition, when W is the feasible solution, P (y : W) = 0, then F (W, σ) = C(y : W). When W is not the feasible solution, at the present point W, σP (y : W) will be a very large number. So, once the solution of (13) is achieved, the constrained ICA is solved too. During the computation, the choice of penalty parameter σ is very important. When σ is too big, minimizing of the penalty function will be more difficult; If σ is too small, the minimization of the penalty will be far away the optical solution which result in the low computation efficiency. So, for any k, starting from some σ1 , we can use a monotone increasing sequential {σk }∞ k=1 to compute min C(y : W) + σk P(y : W),
k = 1, 2, · · ·
(14)
Then, a minimization sequential which contains the optimal solution will be given.
5
GKNN Based EX-cICA Algorithm
As discussed in section 4, the EX-cICA objective function is given as (11-12). By partially differentiating the objective function in (11), the gradient of F (W, σ) is given by ∇W F (W, σ) = ∇W C(y : W) + σk ∇W P(y : W),
k = 1, 2, · · ·
(15)
Exterior Penalty Function Method Based ICA Algorithm
1113
where the matrix ∇W C(y : W) = [∇w1 C(y : W) ∇w2 C(y : W) · · · ∇wN C(y : W)]T denotes the gradient of C(y : W) with respect to W, and term ∇W P(y : W) = [∇w1 P(y : W) ∇w2 P(y : W) · · · ∇wN P(y : W)]T with elements m ∇wn P(y : W) = i ∇wn φ(gi (y : W)) + nj ∇wn ψ(hj (y : W)). Thus, the gradient descent learning equation for W is obtained as [9] W = η[W−T − φ(y)xT + Υ(y : W)]
(16)
where Υ(y : W) = σk ∇W P(y : W). The learning Eqs. (16) was obtained using the simple stochastic gradient descent technique, and more efficient learning can be achieved by using the natural gradient technique as proposed by Amari et al. [3]. The corresponding natural gradient descent learning rule is obtained by multiplying the right-hand side of Eq. (16) by WT W W = η[I − φ(y)yT + Υ(y : W)WT ]W
(17)
which rescales the gradient, simplifies the learning rule by removing the matrix inversion and thereby speeds up the convergence. Where the score function can be get from the GKNN estimation Eq. (6). The statistical properties(e.g., consistency, asymptotic variance) of the cICA algorithm depend on the choice of the contrast function and the constraints involved in the objective function [10]. Any function whose optimization enables the estimation of the ICs can be treated as the ICA contrast function. However, the constraints that are used to define or restrict the properties of the ICs should not infringe the independence criteria. This can be confirmed by verifying the formulated cICA objective functions with the ICA equivariant properties. In [8-10], some constrains have been adopted, such as elimination of permutation and dilation were discussed in [9]; under-complete ICA and ICA with reference were given in [10]. These constrains can be used in the proposed EXcICA algorithm too.
6
Simulation Results
In order to confirm the validity of the proposed EX-cICA algorithm, two simulations using MATLAB were given below with five source signals which have different waveforms. The input signals were generated by mixing the six simulated sources with a 5 × 5 random mixing matrix in which the elements were distributed uniformly. The sources and mixtures are displayed in Figs. 1(a) and (b), respectively. The statistical performance or accuracy of the input signal into the ICs were measured by a performance index (PI) of the permutation error defined as [10]: & N N % N N |pij | PI = i=1 rPIi + j=1 cPIj , where rPIi = i=1 j=1 maxk |pik | − 1 and & N % N |pij | cPIj = j=1 i=1 maxk |pkj | − 1 in which pij denotes the (i, j)th element of the permutation matrix P = WA. The term rPIi gives the error of the separation of the output component yi with respect to the sources, and cPIj measures the degree of the desired IC, sj , appearing multiple times at the output. PI is zero
1114
F. Wang, H. Li, and R. Li s 1
1
s 2
4 2
0.5
0 0 −0.5
−2 0
1000
2000 3000 s 3
4000 5000 sample no.
10
−4
5
0
0
−2
−5
0
1000
2000
0
1000
2000
3000 s 4
4000 5000 sample no.
3000
4000 5000 sample no.
3000
4000 5000 sample no.
2
0
1000
2000 s 5
3000
4000 5000 sample no.
1
−4
s 6 1500
0.5
1000
0 500
−0.5 −1
0
1000
2000
3000
4000 5000 sample no.
0
0
1000
2000
(a)
x 2
x 1 50
100
0
0
−50 −100
−100 −150
0
1000
2000
3000 x 3
500
4000 5000 sample no.
−200
0
1000
2000
0
1000
2000
100
0
3000 x 4
4000 5000 sample no.
3000
4000 5000 sample no.
3000
4000 5000 sample no.
3000
4000 5000 sample no.
3000
4000 5000 sample no.
3000
4000 5000 sample no.
0
−500 −100
−1000 −1500
0
1000
2000
3000 x 5
4000 5000 sample no.
200
−200
x 6 200 0
0
−200 −200 −400
−400 0
1000
2000
3000
4000 5000 sample no.
−600
0
1000
2000
(b)
y 2
y 1 10
6 4
5
2 0 −5
0 0
1000
2000
3000 y 3
−2
4000 5000 sample no.
0
1000
2000 y 4
4
4 2
2
0 0 −2
−2 0
1000
2000
3000 y 5
−4
4000 5000 sample no.
0
1000
2000 y 6
4
2
2
0
0 −2
−2 −4
0
1000
2000
3000
−4
4000 5000 sample no.
0
1000
2000
(c)
Fig. 1. Experiment Results Showing the Separation of Symmetric and Asymmetric Sources Using the Proposed EX-cICA Algorithm
when the desired subset of ICs is perfectly separated. PI is close to zero when all ICs are perfectly separated. The learning algorithm to sort the estimated ICs according to the skewness values by using Eq.(17) with constraint Υi (y : W) = (μi−1 − μi )U (yi )x, ∀i = 1, 2, · · · , N
m3 (yi ) and U(yi ) = ! 3 m2 (yi )
where U (yi ) is the first derivative of U(yi ) converged in 200 iterations. The initial condition of separating matrix W is randomly generated too, the learning rate η = 0.01 at the first stage of the separation process, then it reduces
Exterior Penalty Function Method Based ICA Algorithm
1115
to η = 0.001 at the last stage of the optimization, because this procedure can make the algorithm convergence fast and robust [5]. The recovered signals y(t) are plotted in Fig.1.(c). For comparison we execute the mixed signals with different ICA algorithms: FastICA algorithm [6], Natural Gradient ICA algorithm [4],Adaptive Spline Neural Network(ASNN) algorithm [12]. As shown in Table 1, the algorithm separated and sorted the output components in a decreasing order of skewness values. The final PI of 0.0296 indicated a good separation of the original sources from the inputs. The other algorithms can also get the ideal PI values, but they can not order the sources. Table 1. The separation results are shown for various ICA algorithms(the ordering ICs according to skewness value)
EX-cICA FastICA NG-ICA ASNN-ICA
7
y1
y2
y3
y4
y5
y6
PI
5.4982 -0.0600 3.8770 1.9923
3.8769 0 -0.0599 -0.0594
1.9933 3.8349 0 5.4984
-4.3e-016 5.5033 5.4892 -0.6813
-0.0598 1.9933 -0.6799 3.8771
-0.6818 -0.6820 1.9936 0
0.0296 0.0434 0.0391 0.0298
Conclusion
In this paper, a novel ICA algorithm for hybrid sources separation based on constrained optimization—exterior penalty function method is proposed. The proposed EX-cICA algorithm is under the framework of cICA which is a general framework to incorporate additional knowledge into the classic ICA algorithm. We demonstrated an applications of the EX-cICA: ordering the ICs according to skewness. In some sense, the EX-cICA is a semi-blind approach and has many advantages and applications over the classical ICA methods when useful constraints are incorporated into the contrast function. But not all kinds of constraints can be used in cICA because some may infringe the ICA equivariant properties. The constraints should be selected and formulated properly in the sense of being consistent with the independence criteria [10]. In order to choose nonlinear functions as the PDF estimation of the source signals, GKNN PDF estimation is proposed which can separate the hybrid mixtures of source signals using only a flexible model and more important it is completely blind to the sources. The proposed EX-cICA alrorithm provides the way to wider applications of ICA methods to real world signal processing. Simulations confirm the effectiveness of the proposed algorithm.
Acknowledgment This work is partially supported by National Natural Science Foundation of China(Grant No. 60472062).
1116
F. Wang, H. Li, and R. Li
References 1. Comon P., Independent component analysis: A new concept? Signal Processing , vol.36, no.3, pp.287-314, 1994. 2. Bell A., Sejnowski T., An information maximization approach to blind separation and blind deconvolution. Neural Computation, vol.7, no.6, pp.1129-1159, 1995. 3. Amari S., Natural gradient works efficiently in learning. Neural Computation, vol.10, no.1, pp.251-276, 1998. 4. Xu L., Cheung C.C. and Amari S.I., Learned Parametric Mixture Based ICA Algorithm. Neurocomputing vol.22 no.1-3, pp.69-80, 1998 5. Zhang X.D., Zhu X.L. and Bao Z., Grading learning for blind source separation. Science in China(Series F), vol.46, no.1, pp.31-44, 2003. 6. Hyvarinen A., Oja E., A fast fixed-point algorithm for independent component analysis. Neural Computation, vol.9, no.7, pp.1483-1492, 1998. 7. Boscolo R., Pan H. and Roychowdhury V.P., Independent Component Analysis Based on Nonparametric Density Estimation. IEEE Trans. on Neural Networks, vol.15, no.1, pp.55-65, 2004. 8. Lu W., Rajapakse J.C., Constrained independent component analysis, in: Leen T., Dietterich T., Tresp V.(Eds.), Advances in Neural Information Processing Systems, vol.10, MIT Press, Cambridge, MA, pp.570-576, 2001. 9. Lu W., Rajapakse J.C., Eliminating indeterminacy in ica, Neurocomputing, vol.50, pp.271-290, 2003. 10. Lu W., Rajapakse J.C., Approach and Applications of Constrained ICA, IEEE Trans. on Neural Network, vol.16, no.1, pp.203-212, 2005. 11. Silverman B.W., Density Estimation for Statistics and Data Analysis, New York: Chapman and Hall, 1985. 12. Pierani A., Piazza F., Solazzi M. and Uncini A., Low complexity adaptive nonlinear function for blind signal separation, in Proc. IEEE Int. Joint Conf. Neural Networks, Italy, vol.3, pp.333-338, 2000.
Denoising by Anisotropic Diffusion of Independent Component Coefficients Yinghong Luo1, Caixia Tao1, Xiaohu Qiang1, and Xiangyan Zeng2 1
School of Information and Electrical Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China 2 Department of Biological Sciences, University of California Davis, CA 95616, USA [email protected], [email protected], [email protected], [email protected]
Abstract. In this paper, we propose an image denoising method that incorporates anisotropic diffusion and independent component analysis (ICA) techniques. An image is decomposed into independent component coefficients, and anisotropic diffusion is applied to filtering the IC coefficients. The proposed method achieved much better noise suppression with minimum edge blurring compared with other denoising methods, such as original anisotropic diffusion filter and wavelet shrinkage. The effectiveness of the proposed method is demonstrated by simulation experiments on medical image denoising. Keywords: Image denoising, anisotropic diffusion, independent component analysis, shrinkage.
1 Introduction Denoising is an active research field in image processing and has attracted much attention in the past years. In the conventional literatures, to denoise digital images is to filter out the noise. This is directly reflected in low-pass filtering (smoothing) methods. The Gaussian filter is probability a representative, however, it tends to create blurring of image with the smoothing of noise. Better performance is achieved with Wiener filter which uses local means and autocorrelation functions to calculate the underlying image. More recent techniques include isotropic and anisotropic diffusion methods that are related to adaptive smoothing [1][2]. The anisotropic diffusion can smooth noise without blurring edges. However the effectiveness is sensitive to the image gradient threshold and the iterative diffusion brings about an enlarged edge effect in the original image. Another research field is the wavelet shrinkage methods [3]. The basic motivation behind this method is that the wavelet coefficients of many signals are often very sparse so that one can remove noise in the wavelet domain. As an option to wavelet transform, the independent component transform also gives sparse components and is data-adaptable [4][5]. It has been reported that ICA-domain filtering reduces Gaussian noise in non_Gaussian signals [6][7]. Similar to wavelet methods, the ICA-domain shrinkage approach applies a soft-threshold operator on the sparse ICA components. I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 1117 – 1124, 2006. © Springer-Verlag Berlin Heidelberg 2006
1118
Y. Luo et al.
Since the thresholding is done on each pixel, spatial relation of the coefficients is not used in traditional shrinkage methods. Along a research line close to shrinkage methods, we propose a new denoising method in ICA transform space. The shrinkage operation is replaced by a diffusion operation in the proposed method. The paper is organized as follows: in the second section, we review related denoising algorithms. Section 3 gives a new denoising algorithm using anisotropic diffusion of independent component coefficients. The experimental results of medical image denoising are shown in section 4. The concluding remarks will be given in the final section.
2 Related Researches 2.1 Anisotropic Diffusion The diffusion equation is a partial differential equation which describes the flow of particles or energy. For a two-dimensional system, the basic diffusion equation is
∂I = c∇ 2 I ∂t
(1)
where I = I ( x, y , t ), ∇ is a Laplacian operator and c > 0 is a conduction coefficient. The solution of the diffusion equation (1) is a family of derived images obtained by convolving the initial image I 0 ( x, y ) with Gaussian kernels. Large values of t correspond to images at coarse resolutions, which is the basic idea of denoising algorithm by iterative diffusion process. Constant conduction coefficients c lead to Gaussian smoothing. On the other hand, conduction coefficients chosen locally as a function of the magnitude of the gradient of the brightness, 2
c ( x, y , t ) = g ( ∇I ( x, y , t )
(2)
will not only preserve , but also sharpen the edges during the smoothing process if appropriate function g (⋅) is used. For instance, the following function
g ( x) = exp(−( x / K ) 2 )
(3)
where K is a constant, preserves the edges by privileging high-contrast edges over low-contrast ones. Using four-nearest-neighborhood in the calculation of Laplacian operator in the diffusion equation (1) gives a family of smoothed images at different time t
I it, j = I it,−j1 + λ{c N ⋅ ∇ N I i , j + c S ⋅ ∇ S I i , j + c E ⋅ ∇ E I i , j + cW ⋅ ∇ W I i , j } (4) where 0 ≤
λ ≤ 1/ 4
is a constant factor, N, S, E, W are the subscript for north,
south, east and west, and the nearest-neighbor difference and the corresponding conduction coefficients in these directions are defined as
Denoising by Anisotropic Diffusion of Independent Component Coefficients
( = g( ∇
) ),
( = g( ∇
) )
cN = g ∇ N I i, j ,
cS = g ∇ S I i, j ,
cE
cW
I
E i, j
W
I i, j
1119
∇ N I i , j = I i −1, j − I i , j ,
∇ S I i , j = I i +1, j − I i , j ,
∇ E I i , j = I i , j +1 − I i , j ,
∇ W I i , j = I i , j −1 − I i , j
2.2 Independent Components Shrinkage Method Different variants of shrinkage methods have been proposed for reducing noise. They generally consist of three steps: (1) Calculate the wavelet (ICA) transform of the noisy image. (2) Threshold the noisy coefficients by a shrinkage function. (3) Compute the inverse transform using the modified coefficients. Hyvarinen and his co-workers proposed an ICA based denoising method, which is very close to wavelet shrinkage methods but has the important benefit over wavelet methods that the representation is determined by the statistical properties of the data sets. The filter bank is obtained from statistical analysis of image patches. Each image patch is reshaped row-by-row into a column z = ( z1 , z 2 , " z 64 ) . ICA is used to find a matrix W, so that the elements of the resulting vector x =Wz
(5)
are statistically as independent as possible over the image patches. Each row of W is reshaped into a two-dimensional filter. The ICA shrinkage function is also dependent on the statistical probabilities of images. Hyvarinen et al. uses an ICA-domain shrinkage function
(
f (u ) = sign(u ) max 0, u − 2σ 2 / d
)
(6)
Where u is the ICA coefficient of a pixel, d is a scaling factor, and σ is the variance of the coefficients of the Gaussian noise which can be estimated from the mean absolute deviation of the very sparest coefficients. The effect of the function is to reduce the absolute value of its argument by a certain amount which depends on the noise level. The shrinkage operation has successfully reduced Gaussian noise in images.
1120
Y. Luo et al.
3 Anisotropic Diffusion of Independent Components Although the ICA coefficients are relatively sparse and thus easier to remove the noise components, the shrinkage function is calculated with the intensity of each pixel and there is no neighborhood spatial information involved. Determining the ICA coefficients individually may not be appropriate, for instance setting the coefficient 2σ to zero results in information loss. To incorporate the spatial relation in ICA space, we propose to utilize the anisotropic diffusion similar to (2)
below
(
c(u x , y , t ) = g ∇(u x , y , t )
)
(7)
to replace the shrinkage operation on ICA coefficients, where g (⋅) is defined as Eq. (3). Compared with shrinkage methods which take out the assumed noise components, the proposed methods update the ICA coefficients iteratively and diffuse the noise in each step. The latter expects to introduce fewer artifacts in the noise removal process. The proposed method performs denoising by the following two steps. Step 1. Estimate an orthogonal ICA transformation matrix W using a set of nature scene patches. Step 2. For each image x, the denoising procedure is: 1) ICA transform using each row of W as a filter
y = Wx
2) Iterative anisotropic diffusion of each coefficient image
yi
3) Reverse transform s = W y T
In contrast to the shrinkage methods which thresholds each pixel in step 2.2, the anisotropic diffusion operation is applied to the whole image.
4 Experimental Results of Medical Image Noise Reduction In this section, we present a comparison of the proposed method, anisotropic diffusion, and ICA shrinkage methods. Denoising results of test images and clinical x-ray images are used to evaluate the performance. In the anisotropic diffusion, we use parameter λ = 0.1 and set K to the mean gradient of every iteration. To obtain the ICA transform matrix, we use the FastICA algorithm [8] proposed by A. Hyvarinen et al to 8000 8 × 8 natural image patches. The 64 ICA filters are shown in Fig. 1. For the proposed method and the ICA shrinkage method, after ICA transformation, the update to ICA coefficients is performed for all the possible 8 × 8 neighborhood windows of each pixel and thus 64 reconstructions are obtained for each pixel. We noticed that smaller number of reconstructions ( ≥ 20 )
Denoising by Anisotropic Diffusion of Independent Component Coefficients
1121
Fig. 1. 64 ICA filters obtained by training an ensemble of 8 × 8 natural image patches
improves the efficiency of the algorithm and has similar denoising performance. The final result is the mean of these reconstructions. We first apply the algorithms to a test image degraded manually by Gaussian noise. The original image and the noisy image are shown in Fig.2. The signal noise ratio (SNR) and the mean square error (MSE) of the three methods are summarized in Table 1. The proposed method achieves the highest SNR and lowest MSE. Experiments are
(a)
(b) Fig. 2. (a) Original image; (b) noisy image
1122
Y. Luo et al. Table 1. SNR and M.S.E Comparison
Noisy Image Anisotropic diffusion ICA shrinkage Our Denoising algorithm
(a)
(c)
S/N (DB) 19.6 20.6 24.6 29.3
M.S.E 206.4 164.6 65.3 22.2
(b)
(d)
Fig. 3. (a) original image, (b) wavelet denoising, (c) anisotropic diffusion denoising, (d) proposed method
Denoising by Anisotropic Diffusion of Independent Component Coefficients
(a)
(c)
1123
(b)
(d)
Fig. 4. (a) original image, (b) wavelet denoising, (c) anisotropic diffusion denoising, (d) proposed method
then carried out on two real clinic X-ray images which are contaminated by noises. Denoising results are showed in Fig. 3 and Fig. 4. It is visually observed that the proposed method is capable of producing better noise-removal results than ICA shrinkage and anisotropic diffusion methods. For example, the blood flow is clear in the result of the proposed method while it is deteriorated in the other two methods. The anisotropic diffusion does not blur the main edges, but the detail information is corrupted because the edges are coarsened. Due to the sparseness of the ICA transform
1124
Y. Luo et al.
space and the edge preserving ability of anisotropic diffusion, it is not surprising that the proposed method gives the best results.
5 Conclusion In this paper, we propose a new approach for reducing image noises. The proposed methods apply anisotropic diffusion to modify the ICA transformed coefficients and use inverse transform to reconstruct images. Combining the sparseness of the ICA coefficients and the edge preserving property of anisotropic diffusion leads to much improved denoising results.
References 1. P. Perona and J. Malik: Scale-Space and Edge Detection Using Anisotropic Diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, no. 7, (1990) 629-639. 2. P.S. Marc, J.S. Chen and G.Medioni: Adaptive Smoothing: A General Toll for Early Vision. IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 13, no. 6, (1991)514-529. 3. R.D. Nowak and R. Baraniuk: Wavelet Domain Filtering for Photon Imaging Systems. IEEE Transactions on Image Processing, vol. 8, no. 5, (1999)666-678. 4. A.J. Bell and T.J. Sejnowski,: The ‘independent components’ of Natural Scenes Are Edge Filters. Vision Research, vol. 37, (1997):3327-3338. 5. X.-Y. Zeng, Y.-W. Chen, D. van Alphen, and Z. Nakao: Selection of ICA Features for Texture Classification. Lecture Notes in Computer Science, Springer-Verlag, vol. 3497, (2005)262-267. 6. A. Hyvarinen, P. Hoyer and E. Oja: Sparse Code Shrinkage for Image Denoising. Proc. IEEE Int. Conf. Neural Networks, (1998)859-864. 7. A. Hyvarinen: Sparse Code Shrinkage: Denoising of Nongaussian Data by Maximum Liklihood Estimation. Neural Computation, vol. 11, no. 7, (1999) 1739-1768. 8. A. Hyvarinen and E. Oja: A Fast Fixed-Point Algorithm for Independent Component Analysis. Neural Computation, vol. 9, (1997)1483-1492.
Blind Separation of Digital Signal Sources in Noise Circumstance* Beihai Tan and Xiaolu Li College of Electronic and Communication Engineering, South China University of Technology 510640, China [email protected]
Abstract. During blind separation, noise exists and effects the work. This paper presents novel techniques for blind separation of instantaneously mixed digital sources in noise circumstance, which is based on characteristics of digital signals. The blind separation and denoising algorithms include two steps. First, one of adaptive blind separation algorithms in existence is used to separate sources, but there still exists noise in the separating signals, and then, the second step is adopted to denoise according to the characteristics of digital signals. In the last simulations, the good performance is illustrated and the algorithm is very excellent.
1 Introduction The blind source separation (BSS) problem is currently receiving increased interests [1],[2],[3] in numerous engineering applications. This problem consists in restoring n unknown, statistically independent random sources from n available observations that are linear combinations of these sources. In recent years, blind sources separation has been a hot topic in signal processing field and neural networks field [4],[5], furthermore, it is applied widely to wireless communication, radar, image processing, speech, medicine and seism signals processing etc. After more than ten years’ research, there are a lot of research methods and adaptive algorithms of blind separation ref, e.g. [6],[7],[8] and the references therein. Among the problems of BSS, separability theory should be a kernel. For example, the authors of paper [9] discussed separability of blind source separation in the linear mixture case. By using the information of the mixing matrix, the authors obtained the results about when the source signals can be extracted or not and how many source signals can be extracted. This paper can enrich the separability theory of blind source separation. But most of these existent blind separation algorithms [10],[11],[12] are proposed in case of no noise, which is an ideal case and is in contradiction with the real world. *
The work is supported by the National Natural Science Foundation of China for Excellent Youth (Grant 60325310), the Guangdong Province Science Foundation for Program of Research Team (grant 04205783), the National Natural Science Foundation of China (Grant 60505005), the Natural Science Fund of Guangdong Province, China (Grant 05103553), the Specialized Prophasic Basic Research Projects of Ministry of Science and Technology, China (Grant 2005CCA04100).
I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 1125 – 1132, 2006. © Springer-Verlag Berlin Heidelberg 2006
1126
B. Tan and X. Li
What’s more, their computations are complicated. With digital signals’ application being more and more extensive, it is very important to study digital signals’ blind separation, especially significant in studying blind separation of digital signals with noise. This paper presents a two-step method of blind separation and denoising according to characteristics of digital signals, and its good performance of denoising is shown in last simulations.
2 Blind Separation and Denoising Algorithm In this section, we will consider blind separation model in noise case, and adaptive blind separation algorithms can be used for searching for separating matrix, by which the separating signals with noise can be gained. Noise is removed by denoising rules through these p values of every separating signal. The following subsections will present the algorithm in detail. 2.1 Blind Separation Mathematics Model Blind separation problem can be denoted in mathematics language,
x(t ) = As (t ) + n(t ) t = 1,2" N ,
(1)
u (t ) = Wx (t ) t = 1,2" N ,
(2)
where x(t ) = ( x1 (t ),", xn (t )) T are sensor signals, s (t ) = ( s1 (t ), " , s n (t )) T are source signals, and s i (t ) ∈ K i ≡ {l1 , l 2 " l pi } , (i = 1,2 " n) and l j ( j = 1,2" pi ) are pi values
of si (t ) . n(t ) = (n1 (t ),", nn (t ))T are additive Gaussian white noise whose mean is zero and variance is σ i . A = (an×n ) is unknown mixture matrix, so binary signals and analog signals can be regarded as especial digital signals because their pi are 2 and ∞ . First, we implement blind separation regardless of noise. Because blind separation aims at making separating signals sˆ(t) = (sˆ1(t), sˆ2 (t),", sˆn (t))T independent and their wave-form consistent with source signals through adjusting separate matrix W . Namely, (3) sˆ(t ) = WAs (t ) = PDs (t ) . 2
But there exists noise, combine (1), (2), we have sˆ(t ) = WAs (t ) = PDs (t ) + Wn(t ) ,
(4)
where P is a permutation matrix and D is a diagonal matrix. In order to arrive at destination, a lot of adaptive blind separation algorithms have been proposed in studying separate matrix W , such as Amari-Cichocki’ adaptive blind separation algorithm. These algorithms can be concluded in the following form [6], namely
Blind Separation of Digital Signal Sources in Noise Circumstance
W(k + 1) = W(k) +η(k)(I − G(u(k))u(k)T )W(k)
1127
(5)
where G(u) = ( g1 (u1 )," g n (u n )) are nonlinear functions, and η (k ) is a step-size of k th iteration, and d log p i ( u i ) p ' (u ) = − i i . g i (u i ) = (6) du i p i (u i ) So we can get separation matrix W through the above adaptive blind separation algorithms, and without loss of generality, (4) can be denoted as:
sˆi (t ) = λi si (t ) +
¦w n (t) ij j
(i = 1,2"n) ,
j
(7)
where λi is a scale between sˆi (t ) and si (t ) , which doesn’t change the waveform. From the equation (7), we know there exists noise in every separating signal sˆi (i = 1,2 " n) . In order to remove noise from every sˆi (i = 1,2 " n) , we adopt the second step. 2.2 Density Estimation and p Values Estimation
In order to remove noise of every sˆi (i = 1,2 " n) , we must have the following operation on every sˆi (i = 1,2 " n) . Without loss of generality, we take sˆi as an example. We know there is some noise
¦w
ij n j (t )
in sˆi from the equation (7). Because
j
n i (t )(i = 1,2 " n) have zero mean and their variance are σ i 2 respectively, so
¦w n ij
j (t )
is additive Gaussian white noise whose mean also is zero and variance
j
is
¦ (w σ ij
j)
2
. For the sake of simplicity, we denote the equation (7) as
j
sˆi (t ) = ~ s i (t ) + n~i (t ), i = 1,2, " , n where ~ si (t) = λi si (t) , n~i (t) =
(8)
¦w n (t) , (i = 1,2 "n) . ij j
j
At the same time, there exists a fact that if a random variable submits to zero mean’s Gaussian distribution, sample points concentrate in neighborhood of origin, for example, if n~i (t ) submits to one dimensional standard normal distribution, the probability of event {| n~ (t ) |> 3} is less than 0.05. As for the equation (8), because i
s i (t ) is a pi values digital signal, so it belongs to K i ≡ {l1 , l 2 " l pi } . As long as energy of noise is not very large, the density function of sˆi (t ) have local maximum in points l1 , l 2 … l pi . Facts [13],[14] show that the density function of model (8) is
1128
B. Tan and X. Li
pi
p ( sˆi (t )) =
¦ j =1
Pj 2π σ
e
−
( sˆi (t ) − l j ) 2 2σ 2
(9)
,
where l1 , l 2 … l pi are pi different digital values, and P j is the probability that sˆi (t ) is equal to l j . If energy of noise is bounded, density function p( sˆi (t )) have pi different peaks, otherwise, some peaks will disappear. From the equation (9), we can find out that p( sˆi (t )) has pi local maxima, and they can be estimated by solving the optimization problem (10) max p ( sˆi (t )) . l1"l pi
Moreover, ∂p(sˆi (t)) (sˆi (t) − l j )Pj − = e ∂l j 2π σ 3
(sˆi (t )−l j )2 2σ 2
, j = l1, l2 "l pi .
(11)
In this paper, gradient ascent algorithm is used for solving (10). From the equation (11), we will get gradient ascent algorithm l j ( k + 1) = l j ( k ) + α j ( k )
= l j (k ) +
∂p ( sˆi (t )) ∂l j
αˆ j (k )( sˆi (k ) − l j (k )) 2π σ
3
−
(12)
( sˆi ( k ) − l j ( k )) 2
e
2σ 2
( j = 1,2 " pi ) ,
where α j (k) and αˆ j (k ) = α j (k ) P j are both step-sizes of k -th iteration. In order to realize the gradient algorithm (12), we will estimate the density function p( sˆi (t )) of sˆi (t ) first. Suppose that there are N 0 sample points of sˆi (t ) denoted as a set X , where N 0 >> pi , and the minimum and maximum in X are assumed to be a, b respectively. The interval [ a, b] is then divided equally into sub-intervals which b-a and M M is a sufficiently large positive integer. By estimating the number of sample points in each interval denoted by mi for the i -th interval, the probability for sˆi (t ) belonging to the i -th interval can be obtained, that is, are [a + iδ , a + (i +1)δ ] , i = 0,1,2" M − 2 , and [a + (M - 1) δ , b] , where δ =
~ m Pi = i , i = 1,2" M . N0
(13)
To make the pdf smooth, we use the following filter, 1 ~ ~ ~ ~ ~ Pˆk = (Pk −2 + 4Pk −1 + 6Pk + 4Pk +1 + Pk +2 ) , 16 Repeating the smoothing for several times may be more useful sometimes.
(14)
Blind Separation of Digital Signal Sources in Noise Circumstance
1129
To get more accurate pi different values l1 , l 2 … l pi , we wish to have good initial values of gradient algorithm denoted as l1 (0), l 2 (0) " l pi (0) to start the iteration algorithm, so next will give the method looking for the good initial values in detail. In order to get good initial values, we should approximately know the position of pi different peak values, so the following rule is given. If the following conditions are satisfied, A1) Pˆi + Pˆi +1 + Pˆi + 2 ≤ Pˆi +1 + Pˆi + 2 + Pˆi + 3 ; A2) Pˆi +1 + Pˆi + 2 + Pˆi + 3 > Pˆi + 2 + Pˆi + 3 + Pˆi + 4 ; A3) Pˆ > ε , or Pˆ > ε , or Pˆ > ε , i +1
i+2
i +3
we take Pˆi + 2 as a peak values, where Pˆ1 + Pˆ2 + Pˆ3 > Pˆ2 + Pˆ3 + Pˆ4 and Pˆ1 > ε or peak value. What’s more, if Pˆ
M −2
ε is a given gate Pˆ2 > ε , or Pˆ3 > ε , + Pˆ + Pˆ > Pˆ M −1
M
value. Otherwise, if Pˆ is also taken as a 2
ˆ ˆ M − 3 + PM − 2 + PM −1
and
PˆM − 2 > ε , or PˆM −1 > ε ,or PˆM > ε , then PˆM −1 is a peak value too. Through the above rule, we will get the estimated number of peak values pi and find out p different peak values of Pˆ , Pˆ , " , Pˆ , then let i
lk (0) =
j1
j2
j pi
{(a + jkδ ) + [a + ( jk + 1)δ ]} (2 jk + 1)δ , k = 1,2, " , p i , = a+ 2 2
(15)
so we realize to have good initial values for the following iteration algorithm. After the work is done, the iteration will start,
l j (k + 1) = l j (k ) +
αˆ j (k )( sˆi (k ) − l j (k )) 2π σ 3
e
−
( sˆi ( k ) − l j ( k )) 2 2σ 2
( j = 1,2 " p i ) ,
(16)
where αˆ j ( k ) is a step-size of k -th iteration, so we can get all l j ( j = 1,2"pi ) through the above algorithm. 2.3 Denoising Rules
If l j = arg min{| sˆi (t ) − l j |, j = 1,2 " pi } , for sˆi (t ) , t = 1,2" N , then let s 'i (t ) = l j , t = 1,2" N , so s 'i (t ) , t = 1,2 " N , is the digital signal after denoising. Because there are n separating signals with noise through the first step blind separation, i.e. i = 1,2 " n , so we should have the above second step operation on every sˆi (t ) i ∈{1,2"n} . After the work is done, we will get the source digital signals
s'i (t ) , i = 1,2 " n , t = 1,2 " N , and realize denoising. It is noticeable for us if we have known the pi values of digital signals before transferring, denoising can be done straight by the above rule.
1130
B. Tan and X. Li
3 Simulation Results To show good performance and efficiency of this algorithm in this paper, we do simulations with digital images, and results follow. In the experiment, s1 (t ), s2 (t ), s3 (t ) are three black and white binary source images with 128× 128 pixels, The mixture matrix is chosen by a random uniformly matrix ª - 0.4326 0.2877 1.1892º « - 1.6656 - 1.1465 - 0.0376» , » « «¬ 0.1253 1.1909 0.3273»¼
and Gaussian white noise is added whose power is 30dB ,
mean is zero and variance is one. According to adaptive blind separation [6] WK +1 = Wk + ηk [ I − Φ(uk )uk T ]Wk
(17)
where η k is a step-size of k -th iteration, and Φ is nonlinear function, we can get a general permutation matrix
ª - 5.6666 - 3.8606 5.5871º « - 7.5638 2.7175 - 5.6654» « » ¬« - 0.6408 8.7514 5.2822»¼
by the self-adaptive studying
algorithm of the equation (17) in noise circumstance. At the same time, according to (13), (14), (15), (16) and denoising rules, blind separation and denoising are shown in following figures.
Fig. 1. Three black and white binary source images
Fig. 2. Mixture of three images
Finally, in order to prove the performance of the algorithm in this paper, we define a standard, SNR(Signal to Noise Rate) as follows SNR = 10 log where Sˆ is the estimation of S .
S
2
Sˆ − S
2
(18)
Blind Separation of Digital Signal Sources in Noise Circumstance
1131
Fig. 3. Available separating images with noise
Fig. 4. Three recovered images by the denoising algorithm
According to the definition of SNR, we compute the SNR of three images in figure 3 are 15.2941, 13.6504, 11.7063 respectively, but the SNR of three images in figure 4 are 76.3964, 67.5954, 64.0332 respectively, which shows the denoising algorithm has a good effect.
4 Conclusions In this paper, we give a new algorithm to remove noise after adaptive blind separation of digital signals by utilizing their characteristics. Firstly, estimate separation matrix W through existent blind separation algorithms and get separating signals with noise. Secondly, estimate density functions of separating signals. Thirdly, smooth the density functions. Fourthly, search for peaks of density functions and get pi , i ∈ {1,2" n} values. Finally, remove noise by the rule 2.3. This method has a little computation and strong noise tolerance, and it doesn’t need to know number of pi , i ∈ {1,2" n} and their values in advance. When we don’t known p values of digital signals and there exists noise in transferring, the algorithm is shown good performance and efficiency of denoising after blind separation.
References 1. Hyvarinen, A., Oja, E.: Independent component analysis: Algorithms and applications. Neural Networks, 38(13), ( 2000) 411-430 2. Stone, J.: Blind source separation using temporal predictability. Neural Computation, No.13, (2001)1559-1574
1132
B. Tan and X. Li
3. Bofill, Zibulevsky, M.: Underdetermined source separation using sparse representations. Signal processing, 81, (2001) 2353-2362 4. Xie, S.L., He, Z.S., Fu, Y.L.: A note on stone's conjecture of blind signal separation. Neural Computation, Vol. 17 Issue 2, (2005) 321-330 5. Zhang, J.L., Xie, S.L., He, Z.S.: Separability theory for blind signal separation. Zidonghua Xuebao/Acta Automatica Sinica, v30, n 3, May (2004) 337-344 6. Amari, S., Cichocki, A., Yang, H.: A new learning algorithm for blind signal separation. In D.S. Touretzky, M.E. Hasselmo(Eds.), Advances in Neural Information Processing Systems, vol.8,(1996) 757-763 7. Bell, A.J., Sejnowski, T.J.: An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), (1995)1129-1159 8. Pham, D.T.: Mutual information approach to blind separation of stationary sources. IEEE Trans. Inform. Theory, vol. 48, July (2002) 1–12 9. Zhang, J.L., Xie, S.L., He, Z.S.: Separability theory for blind signal separation. Zidonghua Xuebao/Acta Automatica Sinica, v30, n 3, May (2004) 337-344 10. Anand, K., Reddy, V.U.: Maximum likelihood estimation of constellation vectors for blind separation of co-Channel BPSK signals and its performance analysis. IEEE Trans. Signal Processing, Vol.45, No.7, (1997) 1736-1741 11. Pajunen, P.: Blind separation of binary sources with less sensors than sources. Processings of 1997 International Conference on neural Networks, Houston, Texas, USA, (1997) 12. Belouchrani, A., Cichocki, A.: An adaptive algorithm for the separation of finite alphabet signals. Proceedings of workshop on signal processing and applications’2000, Brisbane Australia, Dec (2000) 13. Li, Y., Cichocki, A., Zhang, L.: Blind separation and extraction of binary sources. Communication and Computer Sciences, Vol.E86-A(3), (2003) 580-590 14. Li, Y., Zhang, L.: New algorithm for blind enhancement of digital images. Control Theory and Applications, 20(4), (2003) 525-528
Performance Evaluation of Directionally Constrained Filterbank ICA on Blind Source Separation of Noisy Observations Chandra Shekhar Dhir1,3,@ , Hyung-Min Park∗ , and Soo-Young Lee1,2,3 1 Department of Biosystems Department of Electrical Engineering and Computer Science 3 Brain Science Research Center Korea Advanced Institute of Science and Technology, Daejeon 305-701, Korea ∗ Language Technologies Institute and Department of Electrical and Computer Engg. Carnegie Mellon University, Pittsburgh, PA 15213, U.S.A Phone: +82-42-869-5351, Fax: +82-42-869-8490 @ [email protected] 2
Abstract. Separation performance of directionally constrained filterbank ICA is evaluated in presence of noise with different spectral properties. Stationarity of mixing channels is exploited to introduce directional constraint on the adaptive subband separation networks of filterbankbased blind source separation approach. Directional constraints on demixing network improves separation of source signals from noisy convolved mixtures, when significant spectral overlap exists between the noise and the convolved mixtures. Observations corrupted with low frequency noises exhibit slight improvement in the separation performance as there is less spectral overlap. Initialization and constraining of subband demixing network in accordance to the spatial location of source signals results in faster convergence and effective permutation correction, irrespective, of the nature of additive noise.
1
Introduction
Blind source separation (BSS) of independent components from acoustically convolved mixture of speech signals is motivated by the classical cocktail party problem [1,2]. BSS can be achieved by Independent Component Analysis (ICA) which is a powerful unsupervised learning algorithm that exploits statistical independence among the constituting signals [2,3,4]. Information theoretic approaches using ICA for BSS of convolved mixtures can be categorized as full-band time domain approach, frequency domain approach and filterbank approach. Oversampled filterbank approach to ICA shows faster convergence with reduced computational complexity in comparison to time domain approaches and gives better separation performance than frequency domain approaches [5]. Figure 1 shows the 2 × 2 network for oversampled filterbank based ICA method. At convergence, the subband demixing filter tap with maximum amplitude corresponds to the arrival time difference of source signals in noise free environment [6]. Arrival time difference of a source signal is defined as the relative time I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 1133–1142, 2006. c Springer-Verlag Berlin Heidelberg 2006
1134
C.S. Dhir, H.-M. Park, and S.-Y. Lee
taken by it to propagate between the two microphones. Using the Precedence effect, a theoretical formulation can be established to confirm the relation between demixing (separation) network and the arrival time difference of the source signals [7,8]. However, in case of strong noisy environment the traditional subband demixing network does not converge to correct solutions. The presence of noise may perturb the adaptation of demixing filters resulting in unwanted slower convergence and inferior separation performance. a-priori information of source locations in spatially constrained mixing environment provides an additional cue to the ICA based learning algorithm. Arrival time difference of source signals which is closely related to their spatial location can be estimated by sound localization methods based on binaural processing [7]. The binaural auditory model for sound localization has strong structural resemblance with filterbank ICA architecture and is shown in Fig. 2.
Fig. 1. A 2 × 2 network for the oversampled filterbank-based ICA methods
Fig. 2. Structural similarity between binaural auditory model and filterbank ICA
In an attempt to apply binaural processing to convolutive blind source separation, directionally constrained filterbank ICA (DC-FBICA) has been proposed
Performance Evaluation of Directionally Constrained Filterbank ICA
1135
by Dhir et al [7,8]. DC-FBICA initializes the demixing network in accordance to the spatial location of source signals and also imposes additional spatial constraint on the adaptation of subband demixing filters. In case of noisy observations, it shows faster convergence and improved separation performance when compared to the traditional model of oversampled filterbank ICA. However, this performance improvement was reported when DC-FBICA approach is applied to BSS of convolved mixtures corrupted with additive white Gaussian noise [7]. In this paper, performance of directionally constrained filterbank ICA is studied under different noisy environments, i.e., additive noise with different spectral properties. In section II, DC-FBICA is briefly reviewed. It also highlights on the proper choice of directional constraint, in absence of which the final performance can be degraded. Section III describes the experimental setup which includes the recording arrangement, choice of parameters used in the learning algorithm and the noise signal used. Experimental results using DC-FBICA on noisy real room recordings are compared with existing methods in section IV which is followed by conclusions in section V.
2
Directionally Constrained Filterbank ICA
An observation is modeled as a linear combination of filtered independent sources xi (n) =
N L m −1
aij (l)sj (n − l),
(1)
j=1 l=0
where aij (l) denotes a mixing filter of length Lm between the source sj and the observation xi . In DC-FBICA approach, the observations, xi (n), are split into subband signals, xi (k, n), by analysis filters. Here k and n refers to the subband index and the decimated time index, respectively. Feedback architecture is chosen at each ICA network and the constituent subband independent source signals are separated using entropy maximization algorithm [5,9] ui (k, n) =
La m=0
wii (k, m)xi (k, n − m) +
N
La
w ij (k, m)uj (n − m), (2)
j=1,j =i m=1
where the adaptive filters, w ij (k, m), generates the subband outputs, ui (k, n), from xi (k, n). La is the length of subband demixing filter. For noisy observations, initialization of subband demixing filters may not ensure proper convergence of the demixing filters [6,7]. Considering this an additional directional constraint is imposed on the unsupervised learning algorithm, wherein, the subband demixing filter tap, w ij (k, γj ), corresponding to the arrival time difference of the source, sj , is fixed to an appropriate maximum value, while the remaining filter taps at m = γj are adapted. γj is the nearest integral value of the ratio of estimated arrival time difference, ρJj , and the decimation
1136
C.S. Dhir, H.-M. Park, and S.-Y. Lee
rate M . The modified learning algorithm for adaptation of subband demixing cross-filters can be given as [5,7,9] ρ 0 : m = γj = nint( Mj ) , (3) Δw ij (k, m) ∝ ∗ −ϕ(ui (k, n))uj (k, n − m) : i = j, m = γj where ϕ(·) is the score function and is defined as ϕ(ui (k, n) = −
∂p(|ui (k,n|) ∂|ui (k,n)|
p(|ui (k, n)|)
exp(j · ui (k, n)).
(4)
The subband demixing networks shown in Fig. 1 are adapted using equation (3) and (4). Invariance of interaural time delay (ITD) over the spectral range of signals and Law of first wavefront suggests that the arrival time difference must be same for all subbands [8]. Temporal whitening of recovered signals can be avoided by forcing the direct demixing filters, w ii (k, m), to scales [9]. The estimated arrival time difference are also used to initialize the subband demixing filters 0 : m = γj initial (k, m) = . (5) wij β : m = γj Imposing additional directional constraints on the adaptive demixing filters w ij (k, m) requires proper choice of β and correct estimation of arrival time difference, ρJi . The proper choice of β was empirically found by repeated BSS experiments when different constant values were given to β. For most cases β = 0 · 6 was a proper initialization value [8]. The arrival time difference can be estimated by performing cross correlation among windowed subband signals [6].
3
Experimental Setup
Recordings of 5 second length at 16kHz sampling rate were made in real office room when two streams of Korean speech signals were used as source signals. The location of speakers and microphones in the room along with room dimensions is shown in Fig. 3. An 8-channel oversampled filterbank is constructed from prototype filter with 220 taps. The prototype filter is adapted for the decimation factor of M = 10 [5,10]. Figure 4 shows the frequency response of analysis filters of the uniform 16-channel oversampled filter bank. The subband demixing filters, w ij (k, m), of length, La = 205, are adapted using (3), (4) and initialized using (5). sgn(ui (k, n) is used as the score function owing to the Laplacian distribution of speech signals in concern. Blind source separation performance with directional constraints on subband separation network is compared in terms of signal-to-interference ratio (SIR). For a 2 × 2 mixing/demixing system, the SIR is defined as a ratio of the signal power to the interference power at the outputs [11], , - (u1,s1 (n))2 (u2,s2 (n))2 1 · SIR(dB) = · 10 log . (6) 2 (u1,s2 (n))2 (u2,s1 (n))2
Performance Evaluation of Directionally Constrained Filterbank ICA
1137
20
0
Amplitude response (dB)
−20
−40
−60
−80
−100
−120
−140 0
Fig. 3. Recordings arrangement for 2 speakers and 2 microphones in normal office room (measurement in centimeters)
0.2
0.4
0.6
0.8 1 1.2 normalized frequency
1.4
1.6
1.8
2
Fig. 4. Frequency response of analysis filters of a uniform 8-channel oversampled filterbank
Here uj,si (n) denotes the jth output of the cascaded mixing-demixing system when only si (n) is active, i.e., all the other sj (n)|j =i are set to zero.
4
Results and Discussions
In this section, experiments are performed on real world audio recordings obtained from office room where the configuration of recordings and speakers is given in Fig. 3. Since the recordings did not have significant additive noise, noises with different spectral properties were intentionally added (one at a time) such that resultant mixture has 5dB signal-to-noise ratio (SNR). For the experimental study, the used noises are given as follows:
12
10
SIR (dB)
8
6
No additive noise White Gaussian noise Car noise F−16 fighternoise Voice babble as noise Unwanted speech as noise
4
2
0
30
60
90
120
150
sweeps
Fig. 5. Learning curve of DC-FBICA for BSS on noisy audio recordings with effective 5dB SNR
1138
C.S. Dhir, H.-M. Park, and S.-Y. Lee
8
w (k,m) 12
w (k,m) 21
7
6
SIR (dB)
5
4
3 Directionally constrained filterbank ICA subband demixing filters initialized to estimated arrival time difference subband demixing filters initialized to zero. 2
1
0
50
100
150
sweeps
(a) SIR w12(k,m)
(b) w21 (k, m): initialized to zero w21(k,m) w (k,m) 12
(c) wij (k, m): initialized using arrival time difference
w (k,m) 21
(d) wij (k, m): directionally constrained
Fig. 6. a) Comparison of directionally constrained filterbank based BSS on normal office room recordings corrupted with f-16 fighter noise at effective 5dB SNR. b) In absence of directional constraint, the subband demixing filters do not align themselves to the arrival time difference of source signals as can be seen in higher subbands, i.e., k = 4-8 (top subband demixing filter corresponds to first subband). c) Proper initialization of subband demixing filters accelerates the separation performance, however, it does not ensure proper convergence. The final separation performance is slightly improved as the converged results are better than the case when adaptive parameters are set to zero. d) Directional constraint ensures the adaptation of demixing filters in accordance to the static mixing environment. All the demixing filter taps with maximum amplitude are aligned to the arrival time difference of source signals.
1. Speech as noise: The speech signal is obtained from TIMIT speech database which is a female voice of 5sec length (equal to length of clean observation) at sampling rate of 16kHz. 2. F-16 fighter noise, Car noise and Babble noise: These stationary signals are obtained from NOISEX-92 CD-ROMs. The length of the additive noise signal is 5sec at 16kHz.
Performance Evaluation of Directionally Constrained Filterbank ICA
1139
Performance of directionally constrained filterbank approach to BSS of noisy observations is compared in Fig. 5 when different noises are added. In addition, the separation performance for clean observations is also reported. The separation performance is best for clean recordings and is dependent on the characteristics of the noise signal in concern. Acoustic mixtures corrupted with voice babble or unwanted speech signals show poor separation performance because of high spectral overlap of audio mixture and noise when compared to other noises. Figure 6-9 show the learning curve of directionally constrained filterbank ICA approach when the additive noise was F-16 fighter noise, voice babble, unwanted 7
6
SIR (dB)
5
4
Directionally constrained filterbank ICA subband demixing filters initialized to estimated arrival time difference subband demixing filters initialized to zero.
3
2
1
0
50
100
150
sweeps
Fig. 7. Comparison of directionally constrained filterbank based BSS on normal office room recordings corrupted with voice babble at effective 5dB SNR
7
6
SIR (dB)
5
4
3 Directionally constrained filterbank ICA subband demixing filters initialized to estimated arrival time difference subband demixing filters initialized to zero. 2
1
0
50
100
150
sweeps
Fig. 8. Comparison of directionally constrained filterbank based BSS on on normal office room recordings corrupted with unwanted speech at effective 5dB SNR
1140
C.S. Dhir, H.-M. Park, and S.-Y. Lee
10
w (k,m)
w (k,m)
12
21
9
8
SIR (dB)
7
6
Directionally constrained filterbank ICA subband demixing filters initialized to estimated arrival time difference subband demixing filters initialized to zero.
5
4
3
2
1
0
20
40
60
80
100
120
140
sweeps
(a) SIR w12(k,m)
(b) w21 (k, m): initialized to zero w21(k,m) w (k,m) 12
(c) wij (k, m): initialized using arrival time difference
w (k,m) 21
(d) wij (k, m): directionally constrained
Fig. 9. a) Comparison of directionally constrained filterbank based BSS on normal office room recordings corrupted with car noise at effective 5dB SNR. b) The subband demixing filters are not effected by presence of noise, as there is very less spectral overlap between car noise and audio recordings. It is observed that most of the subband demixing filters converge in accordance to spatial constraints (top subband demixing filter corresponds to first subband) c) Initialization of subband demixing filters accelerates the separation performance, in addition, temporal alignment of subband demixing filters is observed at convergence. d) Directional constraint ensures adaptation of demixing filters in accordance to the mixing environment. This further helps the learning algorithm and provides slight performance improvement in the final SIR.
speech and car noise, respectively. For comparison, simulation study in absence of directional constraints is also presented for the following two cases: – Subband demixing filter, w ij (k, m), are initialized using (5) – w ij (k, m) = 0, ∀k, m The adaptation of subband demixing filters is perturbed by noises whose spectral contents overlap with that of the acoustically convolved mixtures. Figure
Performance Evaluation of Directionally Constrained Filterbank ICA
1141
6(b) shows the effect of f-16 fighter noise on the adaptation of subband demixing filters in absence of initialization and directional constraint. Initialization of subband demixing filters according to (5) gives slight performance improvement along with faster convergence, however, the converged separation network is not in accordance to the temporally invariant source locations as can be seen from Fig. 6(c). Figure 6(d) shows that the subband demixing filter tap with maximum amplitude are aligned to the arrival time difference of source signals which was not seen in the previous cases. When acoustic signals are corrupted with low frequency noises like car noise there is very less overlap between the spectral contents of acoustic mixture and noise. By inspecting the converged demixing filters in Fig. 9(b), we were able to confirm that the subband demixing filter tap corresponding to the estimated arrival time difference were already aligned and had maximum amplitude without additional directional constraint. Therefore, the performance of traditional filterbank approach is not severely degraded in presence of additive low frequency noises. Figure 9(c)-9(d) show the converged subband demixing filters when initialization and directional constraints are introduced, respectively. Slight performance improvement along with faster convergence is observed in Fig. 9(a) when subband demixing filters are directionally constrained.
5
Conclusions
The performance of directionally constrained filterbank approach to BSS of convolved mixtures is evaluated under different noisy conditions. DC-FBICA approach gives better separation performance for noisy observations with substantial spectral overlap among the noise and real recorded observations. However, when there is very less spectral overlap among the noise signal (low frequency noise) and real room recordings, slight improvement is observed in the final separation performance. Imposing directional constraints along with initialization of subband demixing filters results in faster convergence and this property is independent of the characteristics of noise signal. The directional constraint can also be exploited for solving the permutation problem.
Acknowledgment This research was supported as a Brain Neuroinformatics Research Program by Korean Ministry of Commerce, Industry and Energy.
References 1. Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley & Sons, Inc. (2001) 2. Lee, T. W.: Independent component analysis - Theory and applications. Boston: Kluwer Academic Publisher. (1998)
1142
C.S. Dhir, H.-M. Park, and S.-Y. Lee
3. Comon, P.: Independent component analysis, a new concept?. Signal Processing 36 (1994) 287–314 4. Haykins, S. (Ed.): Unsupervised Adaptive Filtering. John Wiley & Sons, Inc. (2000) 5. Park, H.-M., Dhir, C. S., Oh, S.-H., and Lee, S.-Y.: A filter bank approach to independent component analysis for convolved mixtures. Neurocomputing (in press). 6. Park, H.-M., Dhir, C. S., Oh, D.-K. and Lee, S.-Y.: Filterbank-based BSS with estimated sound direction. Proc. IEEE Int. Symp. on Circuits and Systems (2005) 5874-5877 7. Dhir, C. S., Park, H.-M., Oh, D.-K. and Lee, S.-Y.: Blind source separation of noisy observations using directionally constrained Filterbank ICA. Int. conf. on Neural Information Processing (2005) 544–549 8. Dhir, C. S.: Performance improvement of filterbank ICA with sound localization. M.S thesis, KAIST, Daejeon (2006) 9. Torkkola, K : Blind separation of convolved sources based on information maximization. Proc. of IEEE Int. workshop on NNSP (1996) 423–432 10. Weib, S.: On Adaptive Filtering on Oversampled Subbands. Ph.D. thesis, Signal Processing Division, Univ. Strathclyde, Glasgow, (May 1998) 11. Schobben, D., Torkkola, K. and Smaragdis, P.: Evaluation of blind signal separation methods. Proc. Int. Conf. on ICA and BSS, (1999) 261–266
Author Index
Acar, Levent II-543 Ahn, ChangWook III-807 Ahn, Hyunchul III-420 Ahn, Tae-Chon III-993, III-1079 Aihara, Kazuyuki I-39 Akhter, Shamim II-430 Akimitsu, Toshio I-49 Akyol, Derya Eren III-553 Alahakoon, Damminda II-814 Aliev, Rafik II-860 Aliev, Rashad II-860 Alsina, Pablo Javier II-159 Alvarez, Mauricio I-747 Amarasiri, Rasika II-814 Amin, M. Ashraful II-430 An, GaoYun II-217 Araujo, Carlos A. Paz de III-1160 Arie, Hiroaki I-387 Arik, Sabri I-570 Arun, JB. I-505 Assaad, Mohammad II-831 Assun¸ca ˜o, Edvaldo III-118, III-1131 Atiya, Amir F. II-116, II-764 Bae, Suk Joo II-746 Bai, Guo-Qiang I-554 Bai, Hongliang II-448 Bai, Yaohui I-900 Bao, Zheng I-811 Barman, P.C. II-703 Barrile, Vincenzo II-909 Bascle, B. II-294 Basu, S.K. III-781, III-946 Bayhan, G. Mirac III-553 Becerikli, Yasar III-1105 Benton, Ryan G. II-604 Bernier, O. II-294 Bertoni, Fabiana Cristina III-826 Be¸sdok, Erkan II-632 Bhatia, A.K. III-781, III-946 Bhaumik, Basabi I-82 Bialowas, Jacek I-90 Bittencourt, Valnaide G. III-21 Blachnik, Marcin III-1028
Bo, Liefeng II-404 Bon´e, Romuald II-831 Bonner, Anthony J. I-280 Boo, Chang-Jin III-938 Botelho, Silvia III-40 Bouchachia, Abdelhamid I-137 Bouzerdoum, Abdesselam II-207, II-260 Bresolin, Adriano de A. II-159 Brown, Warick II-841 Br¨ uckner, Bernd III-251 Burken, John III-684 Cacciola, Matteo II-353, II-909 Cai, Sheng Zhen II-379 Cai, Wei II-713 Cai, Xiongcai II-324 Cai, Xun II-661 Cai, Zixing III-711 Canuto, Anne M.P. I-708 Cao, Fei III-596 Cao, Heng III-702 Cao, Zhiguo III-110 Cao, Zhitong I-642 Cardim, Rodrigo III-1131 Cardot, Hubert II-831 Carneiro, Raphael V. I-427 Carvalho, Aparecido A. III-118 Carvalho, Marcio II-1061 Castillo, M. Dolores del I-237 Chan, Lai-Wan III-400 Chang, Bao Rong II-925, III-478 Chang, Chuan-Wei II-850 Chang, Chuan-Yu II-244 Chang, Hong-Hao II-244 Chao, Ruey-Ming III-312 Chau, Kwok-wing II-1101 Chau, Rowena III-295 Che, Yanqiu II-1071 Chen, Angela Hsiang-Ling II-1183 Chen, Bo I-811 Chen, Bo-Wei I-65 Chen, Ching-Horng I-65 Chen, Chung-Ming II-1079 Chen, Guangju III-518
1144
Author Index
Chen, Haishan II-236 Chen, Hung-Ching(Justin) III-360 Chen, JunYing II-1148 Chen, Kai-Ju II-60 Chen, Luonan II-952 Chen, Ming I-773 Chen, Ruey-Maw II-1108 Chen, Shi-Huang II-925 Chen, Shifu III-754 Chen, Shu-Heng III-450 Chen, SongCan II-369 Chen, Tai-Liang III-469 Chen, Tianping I-379 Chen, Toly III-581, III-974 Chen, Wanhai II-596 Chen, Wenying III-920 Chen, X.K. I-203 Chen, Xi-Jun III-721 Chen, Xin II-1051 Chen, Xucan II-661 Chen, Xue-wen II-140 Chen, Yi-Wei II-850 Chen, Yuehui III-137, III-209 Chen, Yun Wen II-314 Chen, Zhaoqian III-754 Cheng, Ching-Hsue III-469 Cheng, Jian I-892 Chiang, Chen-Han III-469 Chikagawa, Takeshi II-420 Chiou, Hou-Kai III-605 Cho, Sehyeong III-234 Cho, Sung-Bae III-155, III-892 Cho, Sungzoon I-837, II-21 Cho, Tae-jun III-731 Choi, Gyunghyun II-746 Choi, Hwan-Soo II-964 Choi, Jinhyuk I-437 Choi, Jun Rim III-1206 Chow, Tommy W.S. II-679 Chu, Chee-hung Henry II-604 Chu, Fulei III-920 Chu, Ming-Hui II-850 Chuang, Cheng-Long II-1079 Chun, Myung-Geun II-107 Chung, T.K. I-324 Churan, Jan I-1048 Chyu, Chiuh-Cheng II-1183 Cichocki, Andrzej I-1038, III-92 C ¸ ivicio˜ glu, Pinar II-632 Cottrell, M. II-40
Costa, Jose A.F. III-21 Covacic, M´ arcio R. III-1131 Cox, Pedro Henrique III-1113 Cui, Gang III-817 D’Amico, Sebastiano II-909 da Silva, Ivan Nunes III-826 Daejeon, Technology II-703 Dai, Hongwei II-1071 Dai, Ruwei II-31 de Carvalho, Aparecido Augusto III-1113, III-1131 de Carvalho, Francisco de A.T. II-50, II-934, III-1012 de Souto, Marcilio C.P. III-21 de Souza, Danilo L. II-729 de Souza, Renata M.C.R. II-50 Deaecto, Grace S. III-118 Demirkol, Askin II-543 Deng, Yanfang I-660 Dhir, Chandra Shekhar I-1133 Dias, Stiven S. I-427 Ding, Feng II-721 Ding, Jundi II-369 Ding, Kang III-535 Ding, Steven X. I-495, I-737 Don, An-Jin II-60 Dong, Huailin II-236 Du, Ding III-702 Du, Ji-Xiang II-80 Duan, Gaoyan II-1090 Duan, Zhuohua III-711 Duch, Wlodzislaw III-1028 Dutra, Thiago I-708 Eom, Jae-Hong III-30 Er, Meng Joo III-1002 Ertunc, H. Metin III-508 Fan, Ding III-572 Fan, Hou-bin III-322 Fan, Ling II-236 Fan, Shu II-952 Fan, Xuqian I-1098 Fan, Zhi-Gang II-187 Fang, Rui III-692 Fardin, Dijalma Jr. I-427 Farias, Uender C. III-118 Fayed, Hatem II-116 Fazlollahi, Bijan II-860
Author Index Fei, Shumin III-664 Feng, Boqin III-498 F´eraud, Raphael II-693 Fern´ andez-Redondo, Mercedes I-477, I-616, I-688 Fischer, T. II-40 Freeman, Walter J. I-11 Fu, Chaojin I-591 Fu, Hui III-268 Fu, Zhumu III-664 Fujimoto, Katsuhito II-88 Fujita, Kazuhisa I-255, I-263 Funaya, Hiroyuki III-746 Fung, Chun Che II-841 Furukawa, Tetsuo I-935, I-943, I-950, I-958, II-278 Fyfe, Colin I-361 Gaino, Ruberlei III-118 Gao, Cao III-322 Gao, Cunchen I-598 Gao, Fang III-817 Gao, Fei III-772 Gao, Jianpo II-270 Gao, Shao-xia III-322 Gao, Shihai II-515 Gao, Ying II-1174 Garcez, Artur S. d Avila I-427 Garg, Akhil R. I-82 Gedeon, Tam´ as II-841 Geweniger, T. II-40 Gomes, Gecynalda Soares S. II-737 Gong, Jianhua II-890 Gong, Kefei I-978 Gong, Qin I-272 Greco, Antonino II-353, II-909 Gruber, Peter I-1048 Grzyb, Beata I-90 Gu, Ren-Min II-253 Gu, Weikang II-1128 Gu, Wenjin III-702 Guan, Di II-596 Guan, Jing-Hua II-1164 Guan, Tian I-272 Guerreiro, Ana M.G. III-1160 Gui, Chao II-506 Guirimov, Babek II-860 Guo, Bing III-1189 Guo, G. I-203 Guo, Jun I-827
Guo, Guo, Guo, Guo, Guo, Guo,
1145
Meng-shu II-1032 Ping II-286 Qiang II-596 Shen-Bo II-80 X.C. II-1138 Yi-nan I-892
Halgamuge, Saman K. I-915, III-260, III-526 Hammer B. II-40 Han, Jialing III-225 Han, Seung-Soo III-234 Hao, Jingbo III-866 Hao, Zhifeng I-773 Hashem, Sherif II-116 Hashimoto, Hideki III-856, III-874 Hattori, Motonobu I-117 He, Bo III-1055 He, Pilian I-978 He, Zhaoshui I-1038 He, Zhenya I-856 Henao, Ricardo I-747 Herbarth, Olf III-278 Hern´ andez-Espinosa, Carlos I-477, I-616, I-688 Hirayama, Daisuke I-255 Hirooka, Seiichi I-263 Hirose, Akira I-49 Ho, Kevin I-521 Hong, Chuleui II-1118 Hong, Jin-Hyuk III-155, III-892 Hong, Jin Keun III-1122 Honma, Shun’ichi II-420 Horiike, Fumiaki II-420 Horio, Keiichi I-907, I-968, III-1168 Hosaka, Ryosuke I-39 Hoshino, Masaharu I-907 Hosino, Tikara I-407 Hotta, Yoshinobu II-88 Hou, Gang III-225 Hou, Xiaodi I-127 Hou, Yuexian I-978 Hou, Zeng-Guang III-721 Hsieh, Kun-Lin III-48 Hsieh, Sheng-Ta I-1078 Hsu, Arthur I-915, III-260, III-526 Hu, Changhua I-718 Hu, Meng I-11 Hu, Wei II-448 Hu, Xiaolin II-994
1146
Author Index
Hu, Yun-an II-1022 Huang, D. II-679 Huang, De-Shuang II-80 Huang, Jau-Chi I-183, I-1012 Huang, Jie II-481 Huang, K. I-203 Huang, Kaizhu II-88 Huang, Kou-Yuan II-60 Huang, Lijuan I-1022 Huang, Liyu II-533, III-58 Huang, Qinghua I-1058 Huang, Shian-Chang III-390 Huang, Tingwen III-1070 Huang, Wei III-380 Huang, Yueh-Min II-1108 Huang, Zhi-Kai II-80 H¨ uper, Knut I-1068 Hwang, Hyung-Soo III-993 Hwang, Kyu-Baek I-670 Hwang, Soochan III-341 Ikeda, Kazushi III-746 Ikeguchi, Tohru I-39 In, Myung Ho I-1088 Inuso, G. III-82 Iqbal, Nadeem II-703 Ishii, Kazuo I-935 Ishizaki, Shun I-117 Islam, Md. Atiqul II-430 Itoh, Katsuyoshi III-563 Izworski, Andrzej I-211 Jang, Kyung-Won III-1079 Jeng, Cheng-Chang III-48 Jeong, Sungmoon II-466 Ji, Jian II-387, II-394 Jia, Yinshan I-819 Jia, Yunde III-268 Jian, Jigui III-674 Jiang, Huilan II-124 Jiang, Jiayan II-278 Jiang, L. III-589 Jiang, Minghu III-285 Jiang, Ning I-495 Jiang, Qi II-499 Jiang, Weijin II-870 Jianhong, Chen III-572 Jianjun, Li III-572 Jiao, Licheng II-404 Jin, Dongming III-1199
Jin, TaeSeok III-856, III-874 Jin, Wuyin I-59 Jordaan, Jaco II-974 Ju, Fang II-499 Juang, Jih-Gau III-605, III-654 Jun, Sung-Hae I-864 Jung, Chai-Yeoung III-789 Jung, Soonyoung III-331 Kabe, Takahiro III-1141 Kacalak, Wojciech I-298 Kalra, Prem K. I-608 Kambara, Takeshi I-255 Kamimura, Ryotaro I-626, I-925, II-897 Kang, Min-Jae II-1014, III-938 Kang, Pilsung I-837 Kang, Rae-Goo III-789 Kang, Yuan II-850 Kao, Yonggui I-598 Karabıyık, Ali III-1095 Karadeniz, Tayfun II-824 Karim, M.A. III-526 Kashimori, Yoshiki I-255, I-263 Katsumata, Naoto II-420 Kawana, Akio I-547 Keck, Ingo R. I-1048 Khan, Farrukh Aslam II-1014 Khan, Usman II-651 Kil, Rhee Man I-755 Kim, Byoung-Hee I-670 Kim, Byung-Joo III-192 Kim, C.H. III-964 Kim, Choong-Myung I-290 Kim, Dong Hwee I-290 Kim, Eun Ju II-489, III-350 Kim, H.J. III-964 Kim, Harksoo II-150 Kim, Ho-Chan II-1014, III-938 Kim, Ho-Joon II-177 Kim, HyungJun II-917 Kim, Il Kon III-192 Kim, Kwang-Baek II-167 Kim, Kyoung-jae III-420 Kim, Myung Won II-489, III-350, III-797 Kim, S.S. III-964 Kim, Saejoon I-371 Kim, SangJoo III-856 Kim, Sungcheol II-1118 Kim, Sungshin II-167
Author Index Kim, Tae-Seong I-1088 Kim, Yonghwan III-341 Kinane, Andrew III-1178 King, Irwin I-342 Kiong, Loo Chu II-70 Kitajima, Ryozo I-626 Koh, Stephen C.L. III-145 Kong, Feng III-1038, III-1046 Kong, Hui I-447 Kong, Jun III-225 Koo, Imhoi I-755 Koyama, Ryohei I-698 Kraipeerapun, Pawalai II-841 Kropotov, Dmitry I-727 Kui-Dai III-217 Kung, Sun-Yuan I-314 Kuo, Yen-Ting I-1012 Kurashige, Hiroki I-19 Kurogi, Shuichi I-698, III-563 Kuwahara, Daisuke III-563 Kwok, James T. III-11 Kwon, Man-Jun II-107 Kwon, YouAn III-331 La Foresta, F. III-82 Lai, Edmund Ming-Kit I-155, III-370 Lai, Hung-Lin II-60 Lai, Kin Keung III-380, III-928 Lai, Weng Kin II-776 Lam, James I-332 Lang, Elmar W. I-1048 Larkin, Daniel III-1178 Lassez, Jean-Louis II-824 Lee, Dae-Jong II-107 Lee, Geehyuk I-437 Lee, Hyoung-joo II-21 Lee, Hyunjung II-150 Lee, Juho II-177 Lee, Kan-Yuan I-1078 Lee, Kichun III-420 Lee, Kin-Hong II-796 Lee, Kwan-Houng III-731 Lee, Minho II-466 Lee, Sang-joon II-1014 Lee, SeungGwan I-487 Lee, Soo Yeol I-1088 Lee, Soo-Young I-1133, II-703 Lee, Sungyoung I-1088 Lee, Young-Koo I-1088 Leen, Gayle I-361
Lemaire, Vincent II-294, II-693 Leng, Yan II-227 Leung, Chi-sing I-521 Leung, Kwong-Sak II-796 Li, Bo II-458 Li, Bo Yu II-314 Li, Chao-feng II-132 Li, Cheng Hua III-302 Li, Donghai II-515 Li, F. III-589 Li, G. II-586 Li, Guang I-11 Li, H. I-203 Li, Hongwei I-1107 Li, Jianyu I-987 Li, Jiaojie I-11 Li, Jie II-523 Li, Jing II-1022 Li, Kun-cheng I-175 Li, Li II-99 Li, Min II-713 Li, Ping I-737 Li, Qian-Mu III-545 Li, Rui I-1107 Li, Shutao III-11 Li, Sikun II-661 Li, Weihua III-535 Li, Wenye II-796 Li, Xiaobin II-474 Li, Xiaohe I-513 Li, Xiaoli III-66 Li, Xiaolu I-1125 Li, Xiao Ping I-782 Li, Yangmin II-1051 Li, Zhaohui II-1174 Li, Zheng II-596 Li, Zhengxue I-562 Li, Zhishu III-1189 Li, Ziling I-660 Liang, Pei-Ji I-30 Liang, Xun III-410 Liang, Y.C. II-1138 Liang, Yun-Chia II-1183 Liao, Chen III-11 Liao, Guanglan I-1030 Liao, Ling-Zhi I-193, II-343 Liao, Shasha III-285 Liao, Wudai I-529 Liao, Xiaoxin I-529 Lim, Chee Peng II-776
1147
1148
Author Index
Lim, HeuiSeok III-331 Lim, Heuiseok I-247 Lim, Sungsoo III-892 Lim, Sungwoo II-651 Lin, Chun-Ling I-1078 Lin, Chun-Nan III-48 Lin, Frank I-765 Lin, L. II-586 Lin, Lili II-1128, III-763 Lin, Lizong III-702 Lin, Pan II-379 Lin, Xiaohong II-870 Lin, Y. I-203 Lin, Yao III-498 Ling, Ping I-801 Liou, Cheng-Yuan I-183, I-1012 Liu, Ben-yu II-942 Liu, Bingjie I-718 Liu, Chan-Cheng I-1078 Liu, Changping II-448 Liu, Chunbo II-880 Liu, Da-You II-1164 Liu, Fan-Xin III-461 Liu, Fan-Yong III-461 Liu, Fang II-880 Liu, Feng-Yu III-545 Liu, H. III-589 Liu, Hanxing III-209 Liu, Hongwei I-811, III-817 Liu, Hongyan III-1038, III-1046 Liu, Huailiang II-1174 Liu, Jing II-1042, II-1156 Liu, Li II-880 Liu, Qihe I-1098 Liu, Qingshan II-1004 Liu, Shiyuan I-1030 Liu, Si-Pei II-1164 Liu, Tangsheng II-124 Liu, Wei II-481 Liu, Wen-Kai III-654 Liu, Xiabi III-268 Liu, Xue I-30 Liu, Yanheng III-201 Liu, Yunfeng III-596 Liu, Zhi-Qiang II-671 Lo, C.F. I-324 Lo, Shih-Tang II-1108 Long, Chong II-88 Long, Fei II-236, III-664 Long, Zhi-ying I-175
Loy, Chen Change II-776 Lu, Bao-Liang II-187 Lu, Chiu-Ping I-65 Lu, Huaxiang III-488 Lu, Jie I-175 Lu, Yinghua III-225 Lu, Li I-1030 Lu, Wenlian I-379 Lu, Yinghua III-285 Ludermir, Teresa B. II-737, II-934, II-1061, III-884 Lunga, Dalton III-440 Luo, Dijun II-1 Luo, Jianguo I-900 Luo, Si-Wei I-193, II-343 Luo, Siwei I-987, III-692 Luo, Yinghong I-1117 Lursinsap, Chidchanok I-765 Ma, Aidong III-692 Ma, RuNing II-369 Ma, Run-Nian I-554 Ma, Weimin III-1022 Ma, Xiaoyan III-488 Ma, Xin II-499 Ma, Yunfeng III-1055 Madabhushi, Anant III-165 Magdon-Ismail, Malik III-360 Maia, Andr´e Luis S. II-934 Majewski, Maciej I-298 Mak, Man-Wai I-314 Mammone, N. III-82 Man, Kim-Fung II-568 Mani, V. III-964 Mao, Chengxiong II-952 Mao, Ye III-498 Mart´ın-Merino, Manuel I-995 Marwala, Tshilidzi III-430, III-440, III-684, III-1087 Mata, Wilson da II-729 Matsuyama, Yasuo II-420 Meng, Fankun II-515 Meng, Ling-Fu I-65 Meng, Qing-lei II-458 Miao, Dong III-596 Miao, Sheng II-942 Miki, Tsutomu I-352 Mimata, Mitsuru III-563 Ming, Qinghe I-598 Minku, Fernanda L. III-884
Author Index Minohara, Takashi II-786 Mishra, Deepak I-608 Mizushima, Fuminori I-228 Mogi, Ken I-147 Molter, Colin I-1 Moon, Jaekyoung II-466 Morabito, Francesco Carlo II-353, II-909, III-82 Morisawa, Hidetaka I-255 Morris, Quaid I-280 Mu, Shaomin I-634, III-184 Mukkamala, Srinivas II-824 Murashima, Sadayuki I-537 Mursalin, Tamnun E. II-430 Nakagawa, Masahiro I-397 Nakajima, Shinichi I-650 Nakamura, Tomohiro II-420 Nam, KiChun I-247, I-290, III-331 Namikawa, Jun I-387 Naoi, Satoshi II-88 Navet, Nicolas III-450 Neto, Adri˜ ao Duarte D. II-159, II-729 Ng, G.S. III-145 Ng, S.C. III-165 Nguyen, Ha-Nam I-792, III-1 Nishi, Tetsuo I-827 Nishida, Shuhei I-935 Nishida, Takeshi I-698, III-563 Nishiyama, Yu I-417 Niu, Yanmin II-197 O’Connor, Noel III-1178 Ogata, Tetsuya I-387 Oh, Kyung-Whan I-864 Oh, Sanghoun III-807 Oh, Sung-Kwun III-1079 Ohashi, Fuminori II-420 Ohn, Syng-Yup I-792, III-1 Okabe, Yoichi I-49 Oki, Nobuo III-1131 Oliveira, Hallysson I-427 Ong, Chong Jin I-782 Orman, Zeynep I-570 Oshime, Tetsunari I-1004 Pan, Li II-99 Panchal, Rinku III-127 Park, Changsu I-247 Park, Dong-Chul II-439, II-641, II-964 Park, Ho-Sung III-1079
Park, Hyung-Min I-1133 Park, KiNam III-331 Park, Kyeongmo II-1118 Park, Soon Choel III-302 Parrillo, Francesco II-909 Patel, Pretesh B. III-430 Pei, Wenjiang I-856 Pei, Zheng I-882 Peng, Bo III-110 Peng, Hong I-882 Peng, Hui I-457 Peng, Lizhi III-209 Peng, Wendong II-622 Peng, Xiang I-342 Peng, Yunhui III-596 Phung, S.L. II-207 Pi, Daoying I-495 Piao, Cheng-Ri III-234 Poh, Chen Li II-70 Ptashko, Nikita I-727 Puntonet, Carlos G. I-1048 Qian, Jian-sheng I-892 Qiang, Xiaohu I-1117 Qiao, X.Y. II-586 Qin, Sheng-Feng II-651 Qin, Zheng II-1148 Qiu, Fang-Peng II-880 Quan, Zhong-Hua II-80 Quek, Chai I-155, III-145, III-370 Raicharoen, Thanapant I-765 Rajapakse, Jagath C. II-361, III-102, III-983 Ramakrishnan, A.G. II-361 Ramakrishna, R.S. III-807 Rao, M.V.C. II-70 Rasheed, Tahir I-1088 Ren, Guang III-616 Ren, Guoqiao I-847 Ren, Mingming III-1055 Roeder, Stefan III-278 Roh, Seok-Beom III-993 Rolle-Kampczyk, Ulrike III-278 Rom´ an, Jesus I-995 Rosen, Alan I-105 Rosen, David B. I-105 Ruan, QiuQi II-217 Rui, Zhiyuan I-59 Ryu, Joung Woo II-489, III-797
1149
1150
Author Index
Saad, David II-754 Saeki, Takashi I-352 Sahin, Suhap III-1105 Saito, Toshimichi I-1004, III-738, III-1141 Sakai, Ko I-219 Sakai, Yutaka I-19 Salihoglu, Utku I-1 Samura, Toshikazu I-117 Saranathan, Manojkumar II-361 Sato, Naoyuki I-1 Savran, Aydo˘ gan III-1095 Schleif, F.-M. II-40 Senf, Alexander II-140 Seo, Jungyun II-150 Serrano, J. Ignacio I-237 Shao, Bo I-642 Shao, Hongmei I-562 Shao, Zhuangfeng I-773 Shen, Hao I-1068 Shen, Ji-hong I-467 Shen, Kai Quan I-782 Shen, Lincheng III-900 Shen, Yan III-1189 Shen, Yanjun III-674 Shen, Z.W. I-203 Shenouda, Emad Andrews I-280 Shi, Chaojian II-334 Shi, Tielin III-535 Shieh, Grace S. II-1079 Shu, Zhan I-332 Shukla, Amit I-608 Shyr, Wen-Jye III-954 Silva, Fabio C.D. II-50 Silva, Joyce Q. II-50 Silveira, Patricia III-40 Simas, Gisele III-40 Singare, Sekou II-533, III-58 Sinha, Neelam II-361 Smith, A.J.R. III-526 Smith-Miles, Kate II-814, III-295 Smola, Alexander J. I-1068 Soares, Fola III-684 Son, YoungDae III-874 Song, Baoquan II-612 Song, Tao II-890 Song, Wang-cheol II-1014 Song, Zhihuan I-737 Souto, Marcilo C.P. de I-708 Souza, Alberto F. De I-427
Sowmya, Arcot II-324 Su, Jianbo II-622 Sugano, Shigeki I-387 Sui, Jianghua III-616 Suk, Jung-Hee III-1206 Sum, John I-521 Sun, Baolin II-506 Sun, Jiancheng I-900 Sun, Jun II-88, II-1042, II-1156 Sun, Lei II-304 Sun, Ming I-467, II-1032 Sun, Tsung-Ying I-1078 Sun, Wei II-984 Sun, Zonghai I-874 Suzuki, Satoshi III-738 Tadeusiewicz, Ryszard I-211 Takahashi, Norikazu I-827 Takamatsu, Shingo I-650 Tamukoh, Hakaru III-1168 Tan, Beihai I-1125 Tan, Min III-721 Tan, T.Z. III-145 Tan, Zheng II-713 Tanabe, Fumiko I-147 Tanaka, Shinya I-698 Tanaka, Takahiro I-968 Tang, Zheng II-1071 Tani, Jun I-387 Tao, Caixia I-1117 Teddy, Sintiani Dewi I-155, III-370 Teixeira, Marcelo C.M. III-118, III-1131 Terchi, Abdelaziz II-651 Tettey, Thando III-1087 Theis, Fabian J. I-1048 Tian, Daxin III-201 Tian, Jing III-900 Tian, Jun II-721 Tian, Mei I-193, II-343 Tian, Shengfeng I-634, III-184 Tian, Zheng II-387, II-394, II-474 Tivive, Fok Hing Chi II-260 Tokunaga, Kazuhiro I-943, I-958 Tolambiya, Arvind I-608 Tomisaki, Hiroaki III-563 Tong, Hengqing I-457, I-660, III-772 Tong, Xiaofeng II-448 Torikai, Hiroyuki I-1004, III-1141 Torres-Huitzil, C´esar III-1150
Author Index Torres-Sospedra, Joaqu´ın I-477, I-616, I-688 Toyoshima, Takashi I-228 Tran, Chung Nguyen II-439, II-964 Tsai, Hsiu Fen II-925, III-478 Tsukada, Minoru I-72 Tsuzuki, Shinya I-547 Tu, Kai-Ti III-654 Tu, Yung-Chin II-244 Ukil, Abhisek
II-578, II-974
Vajpai, Jayashri I-505 Vasiliev, Oleg I-727 Verma, Brijesh III-127 Versaci, Mario II-353 Vetrov, Dmitry I-727 Vialatte, Fran¸cois B. III-92 Villmann, Th. II-40 Wagatsuma, Nobuhiko I-219 Wan, Guojin II-560 Wang, Bin II-334 Wang, Bo II-369 Wang, C.Y. II-1138 Wang, Chi-Shun III-312 Wang, Chunheng II-31 Wang, Cong III-165 Wang, Daojun II-890 Wang, Dianhui III-1189 Wang, Dingwei III-836 Wang, Dongyun I-529 Wang, Fasong I-1107 Wang, Feng II-515 Wang, Gang III-1063 Wang, Guang-Li I-30 Wang, Haiqing I-495, I-737 Wang, Hongfeng III-836 Wang, Hsing-Wen III-390 Wang, Huiyan III-176 Wang, Huiyuan II-227 Wang, J.Z. III-589 Wang, Jian III-201 Wang, Jian-Gang I-447 Wang, Jie III-176 Wang, Jing-Xin III-217 Wang, Jingmin I-847 Wang, Jun II-994, II-1004 Wang, Keyun III-920 Wang, Liang III-692
1151
Wang, Liansheng II-661 Wang, Limin I-680 Wang, Lin III-285 Wang, Ling II-404 Wang, Mengbin II-124 Wang, Peng III-1199 Wang, R.C. III-589 Wang, Rubin I-306 Wang, Shi-tong II-132 Wang, Shouyang III-380, III-928 Wang, Tao II-304, II-412, II-448 Wang, Weirong II-533, III-58 Wang, Xiaodong III-910 Wang, Xin III-636 Wang, Xuchu II-197 Wang, Yan III-137 Wang, Yea-Ping II-850 Wang, Yi II-671 Wang, Yujian II-270 Wang, Yumei I-819 Wang, Zhe I-801 Wang, Zengfeng II-227 Wang, Zheng-you II-132 Wang, Zhengzhi II-612 Wang, Zhi-Ying III-217 Wang, Zhongsheng I-591 Watanabe, Kazuho I-407 Watanabe, Osamu I-165, I-547 Watanabe, Sumio I-407, I-417, I-650 Wei, Pengcheng III-243 Wei, Yaobing I-59 Wen, Wen I-773 Weng, ZuMao II-379 Wilder-Smith, Einar P.V. I-782 Wong, K.Y. Michael II-754 Wong, Kok Wai II-841 Woo, Dong-Min II-641 Woodley, Robert S. II-543 Wright, David II-651 Wu, C.G. II-1138 Wu, Jianhua II-560 Wu, Lu I-598 Wu, M. III-589 Wu, R.H. I-203 Wu, Wei I-562 Wu, Xia I-175 Wu, Xiaojuan II-227 Wu, Xiaopei III-74 Wu, Xiu-Ling II-523 Wu, Yan II-253
1152
Author Index
Wu, Yiqiang II-560 Wu, Yunfeng III-165 Wu, Zhenyang II-270 Xi, Lixia II-1090 Xia, Chang-Liang III-626, III-645 Xiao, Baihua II-31 Xiao, Mei-ling II-942 Xie, Shengli I-1038, II-806 Xiong, Li I-457 Xiong, Rong II-1 Xiu, Jie III-626, III-645 Xu, Anbang II-286 Xu, Lei II-31 Xu, Man-Wu III-545 Xu, Qing II-304 Xu, Qinzhen I-856 Xu, Wenbo II-1042, II-1156 Xu, Yao-qun I-467, II-1032 Xu, Yuetong I-642 Xu, Yulin I-529 Xuan, Jianping I-1030 Yamaguchi, Nobuhiko II-11 Yamaguchi, Yoko I-1 Yamakawa, Takeshi I-907, I-968, III-1168 Yamazaki, Yoshiyuki I-72 Yan, Changfeng I-59 Yan, Shaoze III-920 Yang, Bo III-137, III-209 Yang, Bojun II-1090 Yang, Chao III-1055 Yang, Chenguang II-984 Yang, Degang III-243 Yang, Huaqian III-243 Yang, Hui III-636 Yang, Hui-zhong II-721 Yang, Hyun-Seung II-177 Yang, I-Ching III-48 Yang, Jian III-410 Yang, Jie I-1058 Yang, Jihua III-1063 Yang, Jilin I-882 Yang, Jun-Jie II-880 Yang, Luxi I-856 Yang, Shuzhong I-987 Yang, Wen-Chie I-183 Yang, Xiaogang III-596 Yang, Xiaowei I-773
Yang, Yu II-1071 Yang, Yulong III-225 Yang, Zhiyong III-702 Yao, Chun-lian II-458 Yao, Li I-97, I-175 Yao, Wangshu III-754 Yazici, Suleyman III-1105 Ye, Datian I-272 Ye, Liao-yuan II-942 Ye, Mao I-1098, III-498 Ye, Zhongfu III-74 Yeh, Chung-Hsing III-295 Yeo, Jiyoung II-466 Yeung, C.H. II-754 Yeung, Sai-Ho II-568 Yim, Hyungwook I-247 Yin, Chuanhuan I-634, III-184 Yin, Hujun II-671 Yin, Jianping III-866 Yin, Kai I-97 Yin, Yuhai III-702 Yoon, PalJoo II-466 Yoshida, Fumihiko I-626, II-897 You, Jiun-De II-60 Youn, Jin-Seon III-1206 Yu, Gang II-379 Yu, Gisoon I-290 Yu, Hui III-674 Yu, Jinxia III-711 Yu, Lean III-380, III-928 Yu, Li II-1090 Yu, Shi III-572 Yu, Xuegang III-201 Yu, Zhongyuan II-1090 Yuan, Haiying III-518 Yuan, Runzhang III-209 Yuan, Xu-dong III-322 Yuan, Zhanhui III-1063 Yumak, Onder III-508 Zeng, Delu II-806 Zeng, Xiangyan I-1117 Zhang, Anne II-140 Zhang, Boyun III-866 Zhang, Byoung-Tak I-670, III-30 Zhang, Changjiang III-910 Zhang, Haisheng III-410 Zhang, Hong III-545 Zhang, Huanshui I-580 Zhang, Jiadong II-952
Author Index Zhang, Jun-ben II-132 Zhang, Ke I-537, III-845 Zhang, Kun III-400 Zhang, Liming II-278 Zhang, Liqing I-127, II-523 Zhang, Pu-Ming I-30 Zhang, Taiyi I-513 Zhang, Wei III-243 Zhang, Xiaoguang II-1090 Zhang, Xin II-612 Zhang, Xingzhou II-596 Zhang, Xun III-1199 Zhang, Yan II-304, II-412 Zhang, Ye II-560 Zhang, Yimin II-448 Zhang, Y.P. I-203 Zhang, Yong Sheng II-412 Zhang, Yun-Chu III-721 Zhang, Zhi-Lin II-523 Zhang, Zhikang I-306 Zhao, Chuang II-553 Zhao, Lian-Wei I-193, II-343 Zhao, Qibin II-523 Zhao, Weirui I-580 Zhao, Xiao-Jie I-97 Zhao, Yibiao III-692
Zhao, Yongjun II-481, II-553 Zheng, Bo III-102 Zheng, Hong II-99 Zheng, Hui I-782, II-1174 Zheng, Shiyou III-664 Zheng, XiaoJian II-379 Zheng Zhai, Yu III-260 Zhong, Yixin III-165 Zhou, Chengxiong III-928 Zhou, Chunguang I-801 Zhou, Jian-Zhong II-880 Zhou, Juan II-361, III-983 Zhou, Li-Zhu II-671 Zhou, Weidong II-499 Zhou, Wenhui II-1128 Zhou, Yatong I-513, I-900 Zhou, Yi III-1002 Zhou, Yue I-1058 Zhou, Zhiheng II-806 Zhu, Jun II-890 Zhu, Li III-498 Zhu, Xiaoyan II-88 Zhuang, Li II-88 Zou, An-Min III-721 Zuo, Bin II-1022
1153
ERRATUM
First Passage Time Problem for the Ornstein-Uhlenbeck Neuronal Model C.F. Lo∗ and T.K. Chung Institute of Theoretical Physics and Department of Physics The Chinese University of Hong Kong, Shatin, N.T., Hong Kong ∗ [email protected]
Abstract. In this paper we propose a simple and efficient method for computing accurate estimates (in closed form) of the first passage time density of the Ornstein-Uhlenbeck neuronal model through a fixed boundary (i.e. the interspike statistics of the stochastic leaky integrate-and-fire neuron model). This new approach can also provide very tight upper and lower bounds (in closed form) for the exact first passage time density in a systematic manner. Unlike previous approximate analytical attempts, this novel approximation scheme not only goes beyond the linear response and weak noise limit, but it can also be systematically improved to yield the exact results. Furthermore, it is straightforward to extend our approach to study the more general case of a deterministically modulated boundary.
1
Introduction
The leaky integrate-and-fire (LIF) model is one of the most widely used spiking neuron models.[1] In spite of the simplicity of this model, it appears as a good compromise between the tractability and realism. Mathematically speaking, the LIF model provides a first-order approximation of the full set of HodgkinsHuxley equations used to describe neuronal behaviour. While the LIF model does not provide complete descriptions of real neurons, it has successfully been applied to explain the high temporal precision achieved in the auditory[2] and visual system[3] notwithstanding comparatively long membrane time constraints. It has also been employed widely in the ongoing debate about the origin of spike-rate variability in cortical neurons[4-6] and as a spike generator in studies on synaptic gain control[7]. Consequently, the LIF model becomes the most widespread model in studies on neuronal information processing. Under specific assumptions[8], the stochastic formulation of the LIF model coincides with the time-dependent Ornstein-Uhlenbeck process1 (abbreviated as OU-process) with an absorbing boundary. The first passage time density (FPTD) of the time-dependent OU-process corresponds to the distribution function of the interspike intervals of this neuron model. Unfortunately, despite the importance and wide applications of the OU-process, explicit analytic solutions 1
This is a generalization of the Ornstein-Uhlenbeck process with time-dependent model parameters.
I. King et al. (Eds.): ICONIP 2006, Part I, LNCS 4232, pp. 324–333, 2006. c Springer-Verlag Berlin Heidelberg 2006
First Passage Time Problem for the Ornstein-Uhlenbeck Neuronal Model
325
to such a first passage time problem are not known except for a few specific instances. As summarized by Alili et al.[9], three representations of analytical nature have been obtained for the FPTD of an OU-process through a constant threshold. The first one is based on an eigenfunction expansion involving zeros of the parabolic cylinder functions, the second one is an integral representation involving some special functions, and the third one is given in terms of a functional of a three-dimensional Bessel bridge. In addition to the numerical methods, e.g. the finite-difference approach and the direct Monte-Carlo simulation, these three representations suggest alternative ways to approximate the FPTD. Nevertheless, these three representations are valid for an OU-process with constant model parameters only. In this paper we derive the closed-form formula for the FPTD of a timedependent OU-process to a parametric class of moving boundaries. We then apply the results to develop a simple and efficient method for computing accurate estimates (in closed form) of the FPTD through a fixed boundary. This new approach can also provide very tight upper and lower bounds (in closed form) for the exact FPTD in a systematic manner. Unlike previous approximate analytical attempts[10-21], our approximation scheme detailed below not only goes beyond the linear response and weak noise limit, but it can also be systematically improved to yield the exact results.
2
First Passage Time Density
To begin with, we consider the Fokker-Planck equation (FPE) associated with a time-dependent OU-process[22]: 1 ∂ 2 P (x, t) ∂ [μ + A cos (ωt) − λx] P (x, t) ∂P (x, t) = σ (t)2 − ∂t 2 ∂x2 ∂x
(1)
It is straightforward to show that its solution corresponding to the so-called natural boundary condition is given by ∞ P (x, t) = K (x, t; x , 0) P (x , 0) dx (2) −∞
where
$ 2 xeλt + γ (t) − x K (x, t; x , 0) = ! + λt exp − 4η (t) 4πη (t) t & 1 σ 2 % 2λt η (t) = σ 2 e −1 e2λt dt = 2 4λ 0 t [μ + A cos (ωt )] eλt dt γ (t) = − 1
(3) (4)
0
' ( A [λ cos (ωt) + ω sin (ωt)] eλt − λ − 2 +ω & μ % λt e −1 . λ
=−
λ2
(5)
326
C.F. Lo and T.K. Chung
By the method of images we are also able to derive the solution
1
P (x, t) = −∞
{K (x − 1, t; x − 1, 0) −
# K (x − 1, t; −x + 1, 0) e−2β (x −1) P (x , 0) dx
(6)
which vanishes at x = x∗ (t) ≡ 1 − [γ (t) + 2βη (t)] e−λt at any time t ≥ 0. Here β is a real adjustable parameter. This solution is valid for the interval −∞ < x ≤ x∗ (t). Hence, we have obtained a parametric class of closed-form solutions of Eq.(1) with a moving absorbing boundary whose movement is controlled by the parameter β. Accordingly, the corresponding FPTD conditional to P (x, 0) = δ (x − x0 ) can be analytically obtained in closed form as follows: Pf p (x0 , t) = 1 −
x∗ (t)
{K (x − 1, t; x0 − 1, 0) − # K (x − 1, t; −x0 + 1, 0) e−2β(x0 −1) dx ) * 2βη (t) + x0 − 1 ! =N + 2η (t) ) * 2βη (t) − x0 + 1 ! N − e−2β(x0 −1) 2η (t) −∞
(7)
where N (·) is the cumulative normal distribution function. In order to approximate the FPTD through a fixed boundary at x = 1, we could choose an optimal value of the adjustable parameter β in such a way that the integral τ 2 [x∗ (t) − 1] dt 0
is minimum. In other words, we try to minimize the deviation of the moving boundary from the fixed boundary by varying the parameter β. Here τ denotes the time at which the solution of the FPE is evaluated. Simple algebraic manipulations then yield the optimal value of β as follows: +τ γ (t) η (t) e−2λt dt βopt = − 0 + τ 2 . (8) 2 0 η (t) e−2λt dt Making use of the maximum principle for parabolic partial differential equations, we can also determine the upper and lower bounds for the exact solution associated with the fixed boundary. It is not difficult to show2 that the upper bound can be provided by the solution of the FPE associated with a moving 2
The proof is based upon the maximum principle for parabolic partial differential equations (see the appendix of Lo et al. (2003) for the relevant proof).
First Passage Time Problem for the Ornstein-Uhlenbeck Neuronal Model
327
boundary whose x∗ (t) is always larger than or equal to unity for the duration of interest. Similarly, the solution of the FPE associated with a moving boundary whose x∗ (t) is always smaller than or equal to unity for the duration of interest can serve as the lower bound. Furthermore, the upper and lower bounds can be optimized by adjusting the corresponding values of the parameter β.3 The FPTD corresponding to the “upper-bound” solution is smaller than the exact FPTD, whilst the one derived from the “lower-bound” solution is larger than the exact value.
3
Multi-stage Approximation
Now, we propose a systematic multi-stage scheme to approximate the exact solution of the FPE with a fixed absorbing boundary at x = 1. This approximation scheme has been successfully applied to compute tight upper and lower bounds of barrier option prices with time-dependent parameters very efficiently, where the underlying asset prices follow the lognormal process and the constant elasticity of variance process[23,24]. For demonstration, we consider the evaluation of the approximate FPTD in two stages. Stage 1: the time interval [0, τ /2] We choose an appropriate value of the parameter β, denoted by β1 , such that x∗ (t = 0) = x∗ (t = τ /2) = 1. This determines the movement of the boundary within the time interval [0, τ /2]. The corresponding solution is given by 1 G (x, t; x , 0; β1 ) P (x , 0) dx , (9) P (x, 0 ≤ t ≤ τ /2) = −∞
where G (x, t; x , 0; β1 ) = K (x − 1, t; x − 1, 0) −
K (x − 1, t; −x + 1, 0) e−2β1 (x −1)
.
(10)
Stage 2: the time interval [τ /2, τ ] We repeat the procedure in stage 1 such that x∗ (t = τ /2) = x∗ (t = τ ) = 1. This will give us another value of β, denoted by β2 , and determine the moving boundary’s trajectory for the time interval [τ /2, τ ]. Then, the corresponding solution is evaluated as follows: 1 ¯ (x, t; x , τ /2; β2 ) P (x , τ /2) dx , G (11) P (x, τ /2 ≤ t ≤ τ ) = −∞
3
Each of the moving barriers associated with the upper and lower bounds could be determined by requiring that either the moving barrier returns to its initial position and merges with the fixed barrier at time t = τ , i.e. x∗ (t = τ ) = x∗ (t = 0), or the instantaneous rate of change of x∗ (t) must be zero at time t = 0. Both of the criteria are to ensure the deviation from the fixed barrier to be minimum.
328
C.F. Lo and T.K. Chung
where ¯ (x, t; x , τ /2; β) = K ¯ (x − 1, t; x − 1, τ /2) − G ¯ (x − 1, t; −x + 1, τ /2) e−2β2 (x −1) K 2 xeλ(t−τ /2) + γ (t) − x 1 ¯ K (x, t; x , τ /2) = ! exp − + 4η (t) 4πη (t) τ # λ t− 2 t & 1 σ 2 % 2λt 2 σ (t ) e2λt dt = η (t) = e − eλτ 4λ τ /2 2 t [μ + A cos (ωt )] eλt dt γ (t) = −
(12)
(13) (14)
τ /2
μ λt e − eλτ /2 − =− λ ' A [λ cos (ωt) + ω sin (ωt)] eλt − 2 2 λ +ω # . [λ cos (ωτ /2) + ω sin (ωτ /2)] eλτ /2
(15)
As a result, the associated FPTD is found to be Pf p (x0 , 0 ≤ t ≤ τ /2) 1−[γ(t)+2β1 η(t)]e−λt =1− G (x, t; x0 , 0; β1 ) dx −∞ ) * 2β1 η (t) + x0 − 1 ! =N + 2η (t) ) * 2β1 η (t) − x0 + 1 ! N − e−2β1 (x0 −1) 2η (t)
(16)
and Pf p (x0 , τ /2 ≤ t ≤ τ ) 1−[γ (t)+2β2 η (t)]e−λ(t−τ /2) =1−
1
−∞
−∞
¯ (x, t; x , τ /2; β2 ) × G
G (x , τ /2; x0 , 0; β1 ) dx dx * 1 ) 2β2 η (t) + x − 1 ! =1− N − − 2η (t) −∞ $ ) * 2β2 η (t) − x + 1 −2β2 (x −1) ! × N − e 2η (t) G (x , τ /2; x0 , 0; β1 ) dx
.
(17)
First Passage Time Problem for the Ornstein-Uhlenbeck Neuronal Model
329
The integration can be performed analytically and the result can be expressed in closed form in terms of the cumulative bivariate normal distribution function N2 (·). However, in practice it is also very efficient to calculate the integral numerically, e.g. using the Gauss quadrature method. Apparently, one can further improve the estimate by splitting the evaluation process into four stages instead, namely [0, τ /4], [τ /4, τ /2], [τ /2, 3τ /4] and [3τ /4, τ ]. Then, what one needs to do is to determine the corresponding values of β for these four different stages and perform successive integrations similar to the one in the two-stage approximation. The final expression of the associated FPTD can be expressed in closed form in terms of the N (·), N2 (·), N3 (·) and N4 (·) functions. In summary, the essence of this multi-stage approximation scheme is to replace the smooth barrier track by a continuous and piecewise smooth trajectory in order that the deviation from the fixed barrier is minimized in a systematic manner. We then need to perform some simple one-dimensional numerical integrations (e.g. using the Gauss quadrature method)4 at the connecting points of the piecewise smooth barrier in order to evaluate the approximate value of the FPTD. By construction, it is expected that the multistage approximation becomes better and better as the number N of stages increases; in fact, the error is asymptotically reduced to zero. In practice, even a rather low-order approximation can yield very accurate estimates of the FPTD. Finally, for illustration, we apply the multi-stage approximation scheme to estimate the FPTD of the example depicted in Figure 1. In this example, all model parameters are adapted from Bulsara et al.’s paper[10].5 The corresponding barrier track used for the calculations is shown in Figure 2. Obviously, the selected moving barrier provides a very good proxy for the fixed boundary. As shown in Figure 1, the multi-stage approximation is indeed able to produce very accurate estimate of the (numerically) exact FPTD obtained by the Crank-Nicolson method; the square-dotted curve and the solid curve practically coincide. In order to have a clearer picture of the accuracy and efficiency of the multi-stage approximation scheme, we also compare some of the numerical results of various order of the multi-stage approximation with the (numerically) exact results in Table 1. It is clear that even with a step size of 0.25T , the multi-stage approximation scheme is able to generate estimates of the FPTD with an error of less than 2.5%, and that the discrepancy can be efficiently reduced as we go to higher-order approximation. Furthermore, we have also tested the robustness of the multi-stage approximation for different sets of model parameters, and we shall report those results elsewhere. 4
5
The integration can be performed analytically and the result can be expressed in closed form in terms of the multi-variate normal distribution functions. However, in practice the numerical integrations are indeed very efficient. Note: We have normalised the parameters in our calculations such that the fixed boundary is located at x = 1.
C.F. Lo and T.K. Chung
Instantaneous rate of change of FPTD
330
0.3 0.25 0.2 0.15 0.1 0.05 0 0
2
4 t/T
6
8
Fig. 1. Instantaneous rate of change of the first passage time density g(t) obtained by systematic multistage approximation scheme with step size equal to 0.2 T (square dotted line) and numerical results obtained by the CN method (solid curve). The two lines practically coincide. Input parameters: x0 = 0, μ = 0.4608, Ȝ = 0.5202, A = 0.1466, ı2 = 0.1396, Ȧ = 2ʌ and T = 2ʌ/Ȧ.
ȕ=-0.2233
1.006 1.004 1.002
ȕ=-0.1762
0.998 0.996
ȕ=0.0828
1 ȕ=-0.3830
x(t)
0.994 0
0.5
1
1.5
2
2.5
3
t/T Fig. 2. Barrier track of the multistage approximation scheme with step size equal to 0.2T (corres-ponding to the Fig. 1)
First Passage Time Problem for the Ornstein-Uhlenbeck Neuronal Model
331
Table 1 (a). Comparison of single-stage approximation scheme with numerical results obtained by the Crank-Nicolson (CN) method. Input parameters are the same as Fig.1. (In all CN simulations, ¨x = 0.0001, ¨t = 0.0001.) t (T)
CN method
Single-Stage approximation scheme
Optimal ȕ
2
0.27850
0.29460 (6.4%)
-0.02345
4
0.17029
0.14128 (-17%)
-0.00480
6
0.07899
0.05163 (-34%)
-0.00142
8
0.03552
0.01833 (-48%)
-0.00047
Table 1 (b). Comparison of multistage approximation scheme with numerical results obtained by the CN method. The percentage error is defined as (estimate – CN result)/CN result × 100 %. Same input parameters as Table 1(a). Multistage approximation scheme t (T)
4
CN method Step = 1 T
Step = 0.25 T
Step = 0.1 T
Step = 0.01T
2
0.27850
0.33859 (21.5%)
0.28439 (2.11%)
0.27865 (0.053%)
0.27850 (0 %)
4
0.17029
0.21143 (24.1%)
0.17408 (2.22%)
0.17040 (0.064%)
0.17029 (0 %)
6
0.07899
0.09780 (23.8%)
0.08081 (2.30%)
0.07906 (0.088%)
0.07899 (0%)
8
0.03552
0.04376 (23.1%)
0.03637 (2.39%)
0.03555 (0.084%)
0.03552 (0 %)
Conclusion
In this paper we have proposed a simple and efficient method for computing accurate estimates (in closed form) of the FPTD of the Ornstein-Uhlenbeck neuronal model through a fixed boundary (i.e. the interspike statistics of the stochastic LIF neuron model). This new approach can also provide very tight upper and lower bounds (in closed form) for the exact FPTD in a systematic
332
C.F. Lo and T.K. Chung
manner. Unlike previous approximate analytical attempts, our novel approximation scheme not only goes beyond the linear response and weak noise limit, but it can also be systematically improved to yield the exact results. Furthermore, it is straightforward to extend our approach to study the more general case of a deterministically modulated boundary.
References 1. Tuckwell, H.C.: Stochastic Processes in the Neurosciences (SIAM, Philadelphia, 1989). 2. Gerstner, W., Kempter, R., van Hemmen, J.L., Wagner, H.: A neuronal learning rule for sub-millisecond temporal coding. Nature 383 (1996) 76-78 3. Marˇs´ alek, P., Koch, C., Maunsell, J.: On the relationship between synaptic input and spike output jitter in individual neurons. Proc. Natl. Acad. Sci. USA 94 (1997) 735-740 4. Troyer, T.W., Miller, K.D.: Physiological Gain Leads to High ISI Variability in a Simple Model of a Cortical Regular Spiking Cell. Neural Computation 9 (1997) 971-983 5. Bugmann, G., Christodoulou, C., Taylor, J. G.: Role of the Temporal Integration and Fluctuation Detection in the Highly Irregular Firing of a Leaky Integrator Neuron with Partial Reset. Neural Computation. 9 (1997) 985-1000 6. Feng, J.: Behaviors of Spike Output Jitter in the Integrate-and-Fire Model. Phys. Rev. Lett. 79 (1997) 4505-4508 7. Abbott, L.F., Varela, J.A., Sen, K., Nelson, S.B.: Synaptic depression and cortical gain control. Science 275 (1997) 220-223 8. Lansky, P.: On approximations of Stein’s neuronal model. J. Theor. Biol. 107 (1984) 631-647 9. Alili, L., Patie, P., Pedersen, J.L.: Representations of first hitting time density of an Ornstein-Uhlenbeck process. Stoch. Models 21 (2005) 967-980 10. Bulsara, A.R., Elston, T.C., Doering, C.R., Lowen, S.B., Lindenberg K.: Cooperative behavior in periodically driven noisy integrate-fire models of neuronal dynamics. Phys. Rev. E, 53 (1996) 3958-3969 11. Plesser, H.E., Gerstner, W.: Noise in integrate-and-fire neurons: from stochastic input to escape rates. Neurocomputing 32-33 (2000) 219-224 12. Plesser, H.E., Geisel, Y.: Bandpass properties of integrate-fire neurons. Phys Rev E 59 (1999) 7008-7017 13. H¨ anggi, P., Talkner,P., Borkovec, M.: Reaction-rate theory: Fifty years after Kramers. Rev. Mod. Phys. 62 (1990) 251-341 14. Lindner B., Schimansky-Geier, L.: Transmission of Noise Coded versus Additive Signals through a Neuronal Ensemble. Phys. Rev. Lett. 86, (2001) 2934-2937 15. Fourcaud, N., Brunel, N.: Dynamics of the Firing Probability of Noisy Integrateand-Fire Neurons. Neural Comput. 14 (2002) 2057-2110. 16. Lindner, B., Garcia-Ojalvo, J., Neiman, A., Schimansky-Geier, L.: Effects of noise in excitable systems. Phys. Rep. 392 (2004) 321-424 17. Jung, P., H¨ anggi, P.: Amplification of small signals via stochastic resonance. Phys. Rev. A 44 (1991) 8032–8042 18. Shneidman, V. A., Jung, P., H¨ anggi, P.:Weak-noise limit of stochastic resonance. Phys. Rev. Lett. 72 (1994) 2682-2685
First Passage Time Problem for the Ornstein-Uhlenbeck Neuronal Model
333
19. Lehmann,, J., Reimann, P., H¨ anggi, P.: Surmounting Oscillating Barriers. Phys. Rev. Lett. 84 (2000) 1639-1642 20. Nikitin, A., Stocks, N.G., Bulsara, A.R.: Phys. Rev. E 68 (2003) 016103 21. Casado-Pascual, J., Gomez-Ordonez, J., Morillo, M., H¨ anggi, P.: Two-State Theory of Nonlinear Stochastic Resonance. Phys. Rev. Lett. 91 (2003) 210601 22. Gardiner, C.W.: Handbook of Stochastic Methods for Physics, Chemistry and the Natural Sciences, 3rd ed. (Springer-Verlag, Berlin, 2003) 23. Lo, C.F., Lee, H.C., Hui, C.H.: A simple approach for pricing barrier options with time-dependent parameters. Quant. Finance, 3 (2003) 98-107 24. Lo, C.F., Tang, H.M., Ku, K.C., Hui, C.H.: Valuation of CEV barrier options with time-dependent model parameters. Proceedings of the 2nd IASTED International Conference on Financial Engineering and Applications, 8-10 November 2004, Cambridge, MA, USA, (2004) 34-39