216 67 6MB
English Pages 266 [281] Year 2005
Lecture Notes in Artificial Intelligence Edited by J. G. Carbonell and J. Siekmann
Subseries of Lecture Notes in Computer Science
3327
This page intentionally left blank
Yong Shi Weixuan Xu Zhengxin Chen (Eds.)
Data Mining and Knowledge Management Chinese Academy of Sciences Symposium CASDMKM 2004 Beijing, China, July 12-14, 2004 Revised Papers
Springer
eBook ISBN: Print ISBN:
3-540-30537-8 3-540-23987-1
©2005 Springer Science + Business Media, Inc.
Print ©2004 Springer-Verlag Berlin Heidelberg All rights reserved
No part of this eBook may be reproduced or transmitted in any form or by any means, electronic, mechanical, recording, or otherwise, without written consent from the Publisher
Created in the United States of America
Visit Springer's eBookstore at: and the Springer Global Website Online at:
http://ebooks.springerlink.com http://www.springeronline.com
Preface Toward an Integrated Study of Data Mining and Knowledge Management Data mining (DM) and knowledge management (KM) are two important research areas, but with different emphases. Research and practice in these two areas have been largely conducted in parallel. The Chinese Academy of Sciences Symposium on Data Mining and Knowledge Management 2004 (CASDMKM 2004) held in Beijing, China (July 12–14, 2004) provided a unique opportunity for scholars to exchange ideas in these two areas. CASDMKM is a forum for discussing research findings and case studies in data mining, knowledge management and related fields such as machine learning and optimization problems. It promotes data mining technology, knowledge management tools and their real-life applications in the global economy. This volume of symposium postproceedings contains 3 invited talks, as well as 25 papers selected from 60 original research papers submitted to the symposium. Contributions in this volume come from scholars within China as well as from abroad, with diverse backgrounds, addressing a wide range of issues. The papers in this volume address various aspects of data mining and knowledge management. We believe the publication of this volume will stimulate the integrated study of these two important areas in the future. Although both data mining and knowledge management have been active areas in research and practice, there is still a lack of idea exchange between these two camps. CASDMKM aims to bridge this gap. Numerous issues need to be studied in regard to data mining and knowledge management. For example, how to manage the knowledge mined from different data mining methods? From the knowledge management perspective, what kinds of knowledge need to be discovered? What are the similarities and differences for data mining applications and knowledge management applications? What are the issues not yet explored on the boundary of data mining and knowledge management? This list of questions goes on and on. Of course papers in this volume cannot answer all of these questions. Nevertheless, we believe that CASDMKM 2004 served as an exciting platform to foster an integrated study of data mining and knowledge management in the near future. The papers included in this volume are organized into the following categories: Data mining methods: Various theoretical aspects of data mining were examined from different perspectives such as fuzzy set theory, linear and non-linear programming, etc. Practical issues of data mining: Complementary to theoretical studies of data mining, there are also papers exploring aspects of implementing and applying data mining methods.
VI
Preface
Data mining for bioinformatics: As a new field, bioinformatics has shown great potential for applications of data mining. The papers included in this category focus on applying data mining methods for microarray data analysis. Data mining applications: In addition to bioinformatics, data mining methods have also been applied to many other areas. In particular, multiplecriteria linear and nonlinear programming has proven to be a very useful approach. Knowledge management for enterprise: These papers address various issues related to the application of knowledge management in corporations using various techniques. A particular emphasis here is on coordination and cooperation. Risk management: Better knowledge management also requires more advanced techniques for risk management, to identify, control, and minimize the impact of uncertain events, as shown in these papers, using fuzzy set theory and other approaches for better risk management. Integration of data mining and knowledge management: As indicated earlier, the integration of these two research fields is still in the early stage. Nevertheless, as shown in the papers selected in this volume, researchers have endeavored to integrate data mining methods such as neural networks with various aspects related to knowledge management, such as decision support systems and expert systems, for better knowledge management. September 2004
Yong Shi Weixuan Xu Zhengxin Chen
CASDMKM 2004 Organization
Hosted by Institute of Policy and Management at the Chinese Academy of Sciences Graduate School of the Chinese Academy of Sciences International Journal of Information Technology and Decision Making Sponsored by Chinese Academy of Sciences National Natural Science Foundation of China University of Nebraska at Omaha, USA Conference Chairs Weixuan Xu, Chinese Academy of Sciences, China Yong Shi, University of Nebraska at Omaha, USA Advisory Committee Siwei Cheng, Natural Science Foundation, China Ruwei Dai, Chinese Academy of Sciences, China Masao Fukushima, Kyoto University, Japan Bezalel Gavish, Southern Methodist University, USA Jiali Ge , Petroleum University, China Fred Glover, University of Colorado, USA Jifa Gu, Chinese Academy of Sciences, China Finn V. Jensen, Aalborg University, Denmark Peter Keen, Delft University, Netherlands Ralph Keeney, Duke University, USA Kin Keung Lai, City University of Hong Kong, Hong Kong, China Alexander V. Lotov, Russian Academy of Sciences, Russia Robert Nease, Washington University School of Medicine, USA Hasan Pirkul, University of Texas at Dallas, USA David Poole, University of British Columbia, Canada Thomas Saaty, University of Pittsburgh, USA Mindia E. Salukvadze, Georgian Academy of Sciences, Georgia Elie Sanchez, University of Mediterranée, France Prakash P. Shenoy, University of Kansas, USA Zhongzhi Shi, Chinese Academy of Sciences, China Jian Song, Chinese Academy of Engineering, China Ralph E. Steuer, University of Georgia, USA Peizhuang Wang, Beijing Normal University, China Andrew B. Whinston, University of Texas at Austin, USA
VIII
Organization
Po-Lung Yu, National Chiao Tung University, Taiwan, and University of Kansas,USA Philip S. Yu, IBM T.J. Watson Research Center, USA Lotfi A. Zadeh, University of California at Berkeley, USA Milan Zeleny, Fordham University, USA Hans-Jürgen Zimmermann, Aachen Institute of Technology, Germany Program Committee Hesham Ali, University of Nebraska at Omaha, USA Daobin Chen, Industrial and Commercial Bank of China, China Jian Chen, Tsinghua University, China Xiaojun Chen, Hirosaki University, Japan Zhengxin Chen, University of Nebraska at Omaha, USA Chao-Hsien Chu, Pennsylvania State University, USA John Chuang, University of California at Berkeley, USA Xiaotie Deng, City University of Hong Kong, Hong Kong, China Jiawei Han, University of Illinois at Urbana-Champaign, USA Xirui Hao, Vision Software Inc., USA Chongfu Huang, Beijing Normal University, China Haijun Huang, Natural Science Foundation, China Zhimin Huang, Adelphi University, USA Deepak Khazanchi, University of Nebraska at Omaha, USA Wikil Kwak, University of Nebraska at Omaha, USA Heeseok Lee, Korea Advanced Institute of Science and Technology, Korea Hongyu Li, Fudan University, China Shanling Li, McGill University, Canada Keying Ye, Virginia Polytechnic Institute and State University, USA Yachen Lin, First North American Bank, USA Jiming Liu, Hong Kong Baptist University, Hong Kong, China Xiaohui Liu, Brunel University, UK Yoshiteru Nakamori, Japan Advanced Institute of Science and Technology, Japan David L. Olson, University of Nebraska at Lincoln, USA Fuji Ren, Tokushima University, Japan Hongchi Shi, University of Missouri-Columbia, USA Minghua Shi, Dagong Global Credit Rating Co., China Chengzheng Sun, Griffith University, Australia Di Sun, China Construction Bank, China Minghe Sun, University of Texas at San Antonio, USA Tieniu Tan, Chinese Academy of Sciences, China Zixiang Tan, Syracuse University, USA Xiaowo Tang, Chinese University of Electronic Science and Technology, China Xijing Tang, Chinese Academy of Sciences, China James Wang, Pennsylvania State University, USA Shouyang Wang, Chinese Academy of Sciences, China Zhengyuan Wang, University of Nebraska at Omaha, USA Yiming Wei, Chinese Academy of Sciences, China
Organization
Xindong Wu, University of Vermont, USA Youmin Xi, Xi’an Jiaotong University, China Lan Xue, Tsinghua University, China Xiaoguang Yang, Chinese Academy of Sciences, China Yixian Yang, Beijing University of Posts and Telecommunications, China Zheng Yang, Sichuan University, China Gang Yu, University of Texas at Austin, USA Shuming Zhao, Nanjing University, China Jianlin Zheng, University of Nebraska Medical Center, USA Ning Zhong, Maebashi Institute of Technology, Japan Zongfang Zhou, Chinese University of Electronic Science and Technology, China Yangyong Zhu, Fudan University, China
IX
This page intentionally left blank
Table of Contents
Keynote Lectures Visualization-Based Data Mining Tool and Its Web Application Alexander V. Lotov, Alexander A. Kistanov, Alexander D. Zaitsev
1
Knowledge Management, Habitual Domains, and Innovation Dynamics P. L. Yu, T. C. Lai
11
Knowledge-Information Circulation Through the Enterprise: Forward to the Roots of Knowledge Management Milan Zeleny
22
Data Mining Methodology A Hybrid Nonlinear Classifier Based on Generalized Choquet Integrals Zhenyuan Wang, Hai-Feng Guo, Yong Shi, Kwong-Sak Leung
34
Fuzzy Classification Using Self-Organizing Map and Learning Vector Quantization Ning Chen
41
Solving Discriminant Models Using Interior Point Algorithm Siming Huang, Guoliang Yang, Chao Su
51
A Method for Solving Optimization Problem in Continuous Space Using Improved Ant Colony Algorithm Ling Chen, Jie Shen, Ling Qin, Jin Fan
61
Practical Issues of Data Mining Data Set Balancing David L. Olson
71
Computation of Least Square Estimates Without Matrix Manipulation Yachen Lin, Chung Chen
81
Ensuring Serializability for Mobile Data Mining on Multimedia Objects Shin Parker, Zhengxin Chen, Eugene Sheng
90
XII
Table of Contents
Data Mining for Bioinformatics “Copasetic Clustering”: Making Sense of Large-Scale Images Karl Fraser, Paul O’Neill, Zidong Wang, Xiaohui Liu
99
Ranking Gene Regulatory Network Models with Microarray Data and Bayesian Network Hongqiang Li, Mi Zhou, and Yan Cui
109
On Efficiency of Experimental Designs for Single Factor cDNA Microarray Experiments Xiao Yang, Keying Ye
119
Data Mining Applications Data Mining Approach in Scientific Research Organizations Evaluation Via Clustering Jingli Liu, Jianping Li, Weixuan Xu, Yong Shi
128
Heuristics to Scenario-Based Capacity Expansion Problem of PWB Assembly Systems Zhongsheng Hua, Liang Liang
135
A Multiple-Criteria Quadratic Programming Approach to Network Intrusion Detection Gang Kou, Yi Peng, Yong Shi, Zhengxin Chen, Xiaojun Chen
145
Classifications of Credit Cardholder Behavior by Using Multiple Criteria Non-linear Programming Jing He, Yong Shi, Weixuan Xu
154
Multiple Criteria Linear Programming Data Mining Approach: An Application for Bankruptcy Prediction Wikil Kwak, Yong Shi, John J. Cheh, Heeseok Lee
164
Knowledge Management for Enterprise Coordination and Cooperation in Manufacturer-Retailer Supply Chains Zhimin Huang, Susan X. Li
174
Development of Enterprises’ Capability Based on Cooperative Knowledge Network Junyu Cheng, Hanhui Hu
187
Information Mechanism, Knowledge Management and Arrangement of Corporate Strategem Zhengqing Tang, Jianping Li, Zetao Yan
195
Table of Contents
An Integrating Model of Experts’ Opinions Jun Tian, Shaochuan Cheng, Kanliang Wang, Yingluo Wang
XIII
204
Risk Management Cartographic Representation of the Uncertainty Related to Natural Disaster Risk: Overview and State of the Art Junxiang Zhang and Chongfu Huang 213 A Multi-objective Decision-Making Method for Commercial Banks Loan Portfolio Zhanqin Guo, Zongfang Zhou
221
A Multi-factors Evaluation Method on Credit Evaluation of Commerce Banks Zongfang Zhou, Xiaowo Tang, Yong Shi
229
Integration of Data Mining and Knowledge Management A Novel Hybrid AI System Framework for Crude Oil Price Forecasting Shouyang Wang, Lean Yu, K. K. Lai
233
A Neural Network and Web-Based Decision Support System for Forex Forecasting and Trading K.K. Lai, Lean Yu, Shouyang Wang
243
XML-Based Schemes for Business Project Portfolio Selection Jichang Dong, K. K. Lai, Shouyang Wang
254
Author Index
263
This page intentionally left blank
Visualization-Based Data Mining Tool and Its Web Application Alexander V. Lotov1, Alexander A. Kistanov2, and Alexander D. Zaitsev2 1
State University – Higher School of Economics, Moscow, Russia, and Russian Academy of Sciences, Dorodnicyn Computing Centre, and Lomonosov Moscow State University [email protected] http://www.ccas.ru/mmes/mmeda/
2
Lomonosov Moscow State University, Department of Systems Analysis
Abstract. The paper is devoted to a visualization-based data mining tool that helps to explore properties of large volumes of data given in the form of relational databases. It is shown how the tool can support the process of exploration of data properties and selecting a small number of preferable items from the database by application a graphic form of goal programming. The graphic Web application server is considered which implements the data mining tool via Internet. Its current and future applications are discussed.
1 Introduction Data mining is a well-known approach to studying large volumes of data collected in databases. Statistical methods that are usually used in data mining help to discover new knowledge concerning the data. In this paper we consider a method for data mining that does not use statistical concepts, but supports discovering of new information concerning the data collected in a relational database by computer visualization. Computer visualization of information proved to be a convenient and effective technique that can help people to assess information. Usually one understands visualization as a transformation of symbolic data into geometric figures that are supposed to help human beings to form a mental picture of the symbolic data. About one half of human brain’s neurons is associated with vision, and this fact provides a solid basis for successful application of visualization techniques. One can consider computer visualization of information as a direct way to its understanding. The visualization method considered in this paper is called the Interactive Decision Maps (IDM) technique. Along with other data mining techniques, the IDM technique helps to find a new knowledge in large volumes of data (and even in mathematical models). However, in contrast to usual data mining techniques that reveal some laws hidden in data volumes, the IDM technique provides information on their frontiers. Moreover, being combined with the goal programming approach, the IDM technique helps to select small volumes of data, which correspond to the interests of the user. By this, information on data responsible for the form of the frontiers is discovered. The IDM technique proved to be compatible with the Internet and was implemented in Web in the framework of server-client structure. Y. Shi, W. Xu, and Z. Chen (Eds.): CASDMKM 2004, LNAI 3327, pp. 1–10, 2004. © Springer-Verlag Berlin Heidelberg 2004
2
A.V. Lotov, A.A. Kistanov, and A.D. Zaitsev
The main idea of the IDM technique in the case of large relational databases consists in transformation of the rows of a database into multi-dimensional points, in enveloping them and in subsequent exploration of the envelope. To be precise, it is assumed that the relational database contains a large list of items described by their attributes. Any item is associated with a row of the database, while columns of the database represent attributes. Several (three to seven) numerical attributes specified by the user are considered as the selection criteria. Then, rows are associated with points in the criterion space. The IDM technique is based on enveloping the variety of criterion points (constructing the convex hull of the variety) and on-line visualization of the Pareto frontier of the envelope in the form of multiple decision maps. Applying the IDM technique, the user obtains information on feasible criterion values and on envelope-related criterion tradeoffs. An interactive exploration of the Pareto frontier with the help of the IDM technique is usually combined with goal identification: the user has to specify the preferable combination of criterion values (the goal). However, due to the IDM technique, the goal can be identified at a decision map directly on display. Then, several rows from the list are provided, which are close to the identified goal (Reasonable Goals Method, RGM). The IDM/RGM technique was implemented in the form of a graphic Web application server. The Web application server uses the fundamental feature of the IDM technique that consists in separating the phase of enveloping the points from the phase of human study of the Pareto frontier and identification of the goal. Such a feature makes it possible to apply the IDM/RGM technique in the framework of server-client structure. On the Web, such a structure is applied by using the opportunities of Java. Enveloping of the points is performed at the server, and a Java applet provides visualization of the Pareto frontier on-line and identification of the goal at the user’s computer. The idea to visualize the Pareto frontier was introduced by S.Gass and T.Saaty in 1955 [1]. This idea was transformed into an important form of the multi-criteria methods by J.Cohon [2]. The IDM technique was introduced in the 1980s in the framework of the Feasible Goals Method, FGM, and applied in various economic and environmental studies (see, for example, [3-5]). The FGM is usually used to explore the Pareto frontier and select a reasonable decision in the cases where mathematical models can be used. In contrast, the IDM/RGM technique introduced in 1990s [6] is aimed at exploration of relational databases. It was used in several studies including water management in Russia [7] and national energy planning at the Israeli Ministry of National Infrastructures [8]. Other applications of the RGM/IDM technique are possible, too (see, for example, [9]). They are summarized in [5] and include selecting from large lists of environmental, technical, financial, personal, medical, and other decision alternatives. In this paper we concentrate on Web application of the IDM technique. Experimental application of the IDM technique on Web has started as soon as in 1996 [3]. Its refined version based on Java technology was developed in 2000 in the form of Web application server [10]. The first real-life application of the Web application server is related to supporting of remote negotiations and decision making in regional water management (Werra project, Germany). However, a wide range of applications of such a Web tool can be considered. Actually, any relational database can be now
Visualization-Based Data Mining Tool and Its Web Application
3
analyzed through Web in a simple way using the IDM/RGM technique. A database may contain statistical data (medical, demographic, environmental, financial, etc.), the results of experiments with technical or natural systems (say, data on a device performance, etc.) or of simulation experiments with models, etc. The Web tool can be applied in e-commerce, too: for example, it can support selecting of goods or services from large lists as lists of real estate, second-hand cars, tourist tours, etc. The IDM/RGM technique can be used for visualization of temporal databases, and so a graphic Web tool can be coded that studies temporal data by animation of the Pareto frontier via Internet. This option can be especially important in financial management. The concept of the IDM/RGM technique and its Web implementation are described in the paper. First, the mathematical description of the technique is provided. Then, a description of the demo Web application server is given.
2 Mathematical Description Let us consider a mathematical description of the IDM/RGM technique. We consider a table that contains N rows and several columns, any of which is related to an attribute. Let us suppose that user has specified m attributes to be selection criteria. Then, each row can be associated to a point of the m-dimensional linear criterion space Criterion values for the row number j are described by the point which coordinates are
Since N rows are we considered, we have got N
criterion points The RGM is based on enveloping of the points, i.e. on constructing the convex hull of them defined as
Let us suppose that the maximization of the criterion values is preferable. In this case, the point dominates (is better than) the point y, if and Then, the Paret-efficient (non-dominated) frontier of is a variety of points of that are not dominated, i.e.
The Edgeworth-Pareto Hull of the convex hull (CEPH) denoted by convex hull of the points broadened by the dominated points, i.e.
is the
where is the non-negative cone of It is important that the efficiency frontier of the CEPH is the same as for the convex hull, but the dominated frontiers disappear. For this reason, the RGM applies approximation of the variety instead of Approximation methods are described in details in [5]. A two-criterion slice of passing through a point is defined as follows. Let us consider a pair of criteria, say u and v. Let be the values of the rest of criteria in the point Then, a two-criterion slice of the set related to the pair
4
A.V. Lotov, A.A. Kistanov, and A.D. Zaitsev
(u, v) and passing through the point order of the criteria)
can be defined as (we do not care about the
Collection of slices, for which the value of only one of the rest of criteria can change, constitutes the decision map. To identify the goal directly on the decision map, user has to select a convenient decision map and a slice on it (by this the values of all criteria except two are fixed) Then the identification of a goal vector is reduced to a fixation of the values of two criteria given on axes. It can be done by a click of the computer mouse. By this the goal vector is identified. Several points, which are close to the identified goal, are selected and related rows are provided to user. Different variants of the concept of proximity can be applied. In our recent studies we apply the weighted Tchebycheff metric as the measure of distance between the goal and a point where
Parameters (weights) are non-negative. Tchebycheff metric has the sense of maximal deviation of weighted criterion values. Usually the weighted Tchebycheff metric is applied with given values of parameters (see [11]). In this case, a point with minimal value of Tchebycheff metric is found. Clearly it is non-dominated. In our case, the only information provided by the user is the goal No information about the parameters is supposed to be provided by user. To solve this problem, several approaches could be proposed. Here we described one of them [6], which is used in the Web application server. All points are found that could be optimal if the whole variety of positive parameters Fig. 1. Selection procedure of the RGM (it is preferable to is used. Certainly it increase the criterion values) is impossible to solve the infinite number of optimization problems with all different sets of parameters However, the following simple procedure may be used instead of solving an infinite number of optimization problems. In the first step of the procedure, a modified point is constructed for any original point in the following way: if a criterion value in the original point is better than the criterion value in the user-identified goal, the criterion value of the identified goal is substituted for the criterion value of the original point. In the second step of the
Visualization-Based Data Mining Tool and Its Web Application
5
procedure, Pareto domination rule is applied to modified points. In the result, nondominated points are selected from the modified points. Finally, the original feasible points that originated the non-dominated modified points are selected. The procedure is illustrated in Fig. 1 for the case of two criteria, which are subject of maximization. The reasonable goal identified by the user is denoted by the filled square symbol. Original points are represented by the filled circles. For any original point, a modified point is constructed: if a criterion value in an original point is better than in the reasonable goal, the goal value is substituted for the criterion value. Modified points are represented by hollow circles in Fig. 1. So, the feasible point 1 originates the modified point 1’, etc. If all criterion values for an original point are less than the aspiration levels (for example, point 3), the modified point coincides with the original one. Then, Pareto domination rule is applied to modified points: non-dominated points are selected among them. Point 2’ dominates point 1’, and point 4’ dominates point 5’. So, three non-dominated modified points are selected: 2’, 3, and 4’. Finally, the original feasible points which originated the non-dominated modified points are selected. In Fig. 1, these points are 2, 3 and 4. It is clear that points elected through this procedure represent non-dominated row (in epy usual Pareto sense). One can see that the approximation of the CEPH and its exploration may be easily separated in time and space in the framework of the RGM/IDM technique. This feature of the RGM/IDM technique is effectively used in the Web application server.
3 Web Application Server The current Web application server based on the RGM/IDM technique is a prototype version of the future Web application servers that will support easy selection of preferable alternatives from various tables using simple graphic interface. This service can be of a general use suited for any table prepared data or domain specific that enables some useful features and deeper integration with domain data. Web service implements multi-tier architecture and consists of the calculation server, web server application and graphic presentation. Calculation server is an executable module coded in C++. It processes given table data and builds the approximation of the CEPH. Calculation server is ANSI C++ compliant so it can be compiled and executed at any platform. Main graphic presentation window is a Java applet executed inside user browser. MS Internet Explorer, v. 4.0 or higher may be used to display it. Web application is coded in Java and JSP and serves for several interfacing purposes: it helps user to prepare a table with alternatives, invokes calculation server to process it, displays the applet with calculated data and handles user choice to generate selected alternatives. Web application can be executed on any web server that supports JSP and Java servlets. The Web tool is located at http://www.ccas.ru/mmes/mmeda/rgdb/index.htm The user has first to specify the table to be explored. After the data input is completed and the query is submitted, server envelops the criterion points and sends the Java applet along with the CEPH to computer of the user. The user can explore the decision maps (Fig. 2) for different numbers of bedrooms and bathrooms by specifying these numbers by moving sliders of the scroll bars. The
6
A.V. Lotov, A.A. Kistanov, and A.D. Zaitsev
Fig. 2. A black-and-white copy of decision map that describes feasible lot-size and age for several values of price (in color on display and shading here, thousand of US$) for the whole real estate table (as it is specified by scroll bars, not less than two bedrooms and one bathroom are required at the moment)
user may want to use animation (automatic movement of sliders). He/she can see how the numbers of bedrooms and bathrooms influence possible combinations of lot-size, age and price. Using the slider of the color (shading) scroll bar, the user can specify a desired price. The slider is given on the color palette. Colors of the palette are practically not seen in the black-and-white picture given in the paper. Therefore, we propose to visit our Web application server to play with the animation of color decision maps. To identify a goal, preferred numbers of bedrooms and bathrooms must be identified by the user first. The related decision map given in Fig. 3 differs from one given in previous Fig. 2: several shadings (colors on display) disappeared. Then, the user has to identify a preferable combination of values given of two criteria given in the map. To do it, the user has to use the cross that helps to identify the reasonable goal (Fig. 3). Once again, it is needed to stress that a full freedom of choice with respect to the efficient combinations of criteria is given to the user as it is done in all methods for generating the Pareto-efficient frontier.
Visualization-Based Data Mining Tool and Its Web Application
7
After the preferred position of the cross is specified, the user has to use Fixation button. Then, the applet transmits the goal to the server, and the server returns the selected rows to the user.
Fig. 3. A decision map for preferred numbers of bedrooms and bathrooms with the cross that helps to identify the goal
The user receives a list of rows that are equal from point of view of the Web server, but surely are not equal to the user. He/she has to choose one of these options by him/herself. Various methods for selecting the criterion points (rows of the table) can be easily applied in customized versions, but they were not included into the demo version. The demo version is restricted to 500 alternatives and five criteria. The full version can have till seven criteria and several hundreds of thousands of rows. Moreover, matrices of decision maps can be used in a customized version. Our prototype implements the general approach to exploration of any table data. However this service can be integrated to existing Web sites, Internet stores, online shops, portals etc, everywhere where selection should be performed from large tables of homogenous goods or services.
8
A.V. Lotov, A.A. Kistanov, and A.D. Zaitsev
Integration with other Web sites can be done rather simple by modifying or replacing our Web application tier and integrating it with another one. In this case, data can be prepared or comes from other Web application or Web site, then it is processed via calculation server, examined via our graphic applet, decision is made and selection results are displayed by the means of another Web site and in its context. For example, in most applications a lot of additional information about alternatives as detail descriptions, pictures, etc. must be shown with the list of selected alternatives. Calculation server (RGDB server) may reside on a special dedicated highperformance server or servers and communicate with Web application via simple external interface. This leads to Application Service Provider (ASP) or Utility architecture.
Fig. 4. ASP architecture
There is a server (RGDB server on the Fig. 4) with well-defined interface and users or other programs and web servers that use the server. Hence, Web application becomes independent from calculation server and can be developed and deployed independently.
4 Summary The RGM/IDM technique helps to explore various relational databases. Due to enveloping, the user has an opportunity to explore the whole variety visually and select several interesting rows by a simple click of the computer mouse. It is
Visualization-Based Data Mining Tool and Its Web Application
9
important that the RGM procedure is scalable – it can be used in the case of databases that contain even million of rows. Many applications of the IDM/RGM technique for databases can be found. In addition to e-commerce, problems of e-logistics can be studied (partner selection, etc.). Another network application may be related to supporting the network traders in various exchanges. For example, a day trader can visually monitor technical indicators of stocks provided via network. Graphic display of information of stock may help the trader to be the first to buy an advantageous security during a trade session. Important applications of the technique may be related to mobile DSS. Visualization provides a natural tool for informing remote users and inquiring concerning their preferences.
Acknowledgments Fraunhofer Institute for Autonomous Intelligent Systems, Germany, has partially supported coding of the demo version of the Web application server described here. We are grateful to Drs. Hans Voss, Natalia and Gennady Andrienko. Our research was supported by the Russian State Program for Supporting Scientific Schools (grant NSh-1843.2003.1), by Russian Foundation for Basic Research (grant 04-01-00662) and by Program no. 3 for Fundamental Research of Department of Mathematical Sciences of Russian Academy of Sciences.
References 1. Gass, S., Saaty, T. The computational algorithm for the parametric objective function. Naval Research Logistics Quarterly 2 (1955) 39-51 2. Cohon, J.: Multiobjective Programming and Planning, John Wiley, New York (1978) 3. Lotov, A., Bushenkov, V., Chernov, A., Gusev, D. and Kamenev, G.: INTERNET, GIS, and Interactive Decision Maps, in: J. of Geographical Information and Decision Analysis 1 (1997) 119-143 http://www.geodec.org/gida_2.htm 4. Lotov, A., Bushenkov, V., Kamenev, G.: Feasible Goals Method. Mellen Press, Lewiston, NY (1999, in Russian) 5. Lotov, A.V., Bushenkov, V.A., and Kamenev, G.K.: Interactive Decision Maps. Kluwer Academic Publishers, Boston (2004) 6. Gusev, D.V., and Lotov, A.V.: Methods for Decision Support in Finite Choice Problems. In: Ivanilov, Ju. (ed.): Operations Research. Models, Systems, Decisions, Computing Center of Russian Academy of Sciences, Moscow, Russia (1994, in Russian) 15-43 7. Bourmistrova, L., Efremov, R., Lotov, A.: A Visual Decision Making Support Technique and its Application in Water Resources Management Systems. J. of Computer and System Science Int. 41 (2002) 759-769 8. Soloveichik, D., Ben-Aderet, N., Grinman, M., and Lotov, A.: Multi-objective Optimization and Marginal Abatement Cost in the Electricity Sector – an Israeli Case Study. European J. of Operational Research 140 (2002) 571-583
10
A.V. Lotov, A.A. Kistanov, and A.D. Zaitsev
9. Jankowski, P., Lotov, A., and Gusev, D.: Multiple Criteria Trade-off Approach to Spatial Decision Making. In: J.-C. Thill (ed.) Spatial Multicriteria Decision Making and Analysis: A Geographical Information Sciences Approach, Brookfield, VT (1999) 127148 10. Lotov, A.V., Kistanov, A.A., Zaitsev, A.D.: Client Support in E-commerce: Graphic Search for Bargains in Large Lists. Working Paper N34, Fachbereich Wirtschaftwissenschaften, Institute fuer Wirtschaftsinformatik, University of Siegen, Germany (2001) 11. Steuer, R.E.: Multiple Criteria Optimization. John Wiley, New York (1986).
Knowledge Management, Habitual Domains, and Innovation Dynamics* P. L. Yu1 and T. C. Lai2 1,2
Institute of Information Management, National Chiao Tung University, 1001, Ta Hsueh Road, HsinChu City 300, Taiwan [email protected],
[email protected]
Abstract. Knowledge Management (KM) with information technology (IT) has made tremendous progresses in recent years. It has helped many people in making decision and transactions. Nevertheless, without continuous expanding and upgrading our habitual domains (HD) and competence set (CS), KM may lead us to decision traps and making wrong decisions. This article introduces the concepts of habitual domains and competence set analysis in such a way that we could see where KM can commit decision traps and how to avoid them. Innovation dynamics, as an overall picture of continued enterprise innovation, is also introduced so that we could know the areas and directions in which KM can make maximum contributions and create value. KM empowered by HD can make KM even more powerful.
1 Introduction With rapid advancement of Information Technology (IT), Knowledge Management (KM) has enjoyed its rapid growth [4]. In the market, there are many software available to help people make decisions or transactions, such as supply chain management (SCM), enterprise resource planning (ERP), customer relationship management (CRM), accounting information system (AIS), etc.[3], [7], [12]. In the nutshell, KM is useful because it can help certain people to relieve the pains and frustrations for obtaining useful information to make certain decisions or transactions. For salesperson, KM could provide useful information as to close sales. For credit card companies, KM could provide useful information about card holders’ credibility. For supply chain management, KM can efficiently provide where to get needed materials, where to produce and how to transport the product and manage the cash flow, etc. * 1
2
This research was supported by the National Science Council of the Republic of China. NSC92-2416-H009-009. Distinguished Chair Professor, Institute of Information Management, National Chiao Tung University, Taiwan and C. A. Scupin Distinguished Professor, School of Business, University of Kansas, Kansas. Ph. D. Student, Institute of Information Management, National Chiao Tung University, Taiwan.
Y. Shi, W. Xu, and Z. Chen (Eds.): CASDMKM 2004, LNAI 3327, pp. 11–21, 2004. © Springer-Verlag Berlin Heidelberg 2004
12
P.L. Yu and T.C. Lai
It seems, KM could do “almost everything” to help people make “any decision” with good results. Let us consider the following example. Example 1: Breeding Mighty Horses. For centuries, many biologists paid their attention and worked hard to breed endurable mighty working horses so that the new horse could be durable, controllable and did not have to eat. To their great surprise, their dream was realized by mechanists, who invented a kind of “working horse”, tractors. The biologists’ decision trap and decision blind are obvious. Biologists habitually thought that to produce the mighty horses, they had to use “breeding methods” —a bio-tech, a decision trap in their mind. Certainly, they made progress. However, their dream could not be realized. IT or KM, to certain degree, is similar to breeding, a biotech. One wonders: is it possible that IT or KM could create traps for people as to make wrong decision or transactions? If it is possible, how could we design a good KM that can minimize the possibility to have decision traps and maximize the benefits for the people who use it? Since humans are involved, habitual domains (HD) and competence set analysis must be addressed as to answer the above questions. We shall discuss these concepts in the next section. As KM is based on IT, its modules can handle only “routine” or “mixed routine” problems. It may help solve “fuzzy problems”. But, we must decompose these fuzzy problems into series of routine problems first. For challenge decision problems, their solutions are beyond our HD and KM. Certain insight are needed. We shall discuss these topics in Section 3. Finally, for an enterprise to continuously prosper and be competitive, it needs continuous innovation in technology, management, marketing, financing, distribution logistics, etc. [5]. For a systematic view of the innovation, we introduce “Innovation Dynamics” in Section 4. The introduction will help us to locate which areas and directions that KM can be developed as to maximize its utilization and create its value. At the end, some conclusion remarks will be offered.
2 Habitual Domains and Competence Set Analysis From Example 1, we see that one’s judging and responding could be inefficient or inaccurate if his or her ways of thinking get trapped rigidly within a small domain. To further expound this concept, let us describe the known concept of Habitual Domains. For details, see Refs. [8] and [9].
2.1 Habitual Domains Each person has a unique set of behavioral patterns resulting from his or her ways of thinking, judging, responding, and handling problems, which gradually stabilized within a certain boundary over a period of time. This collection of ways of thinking, judging, etc., accompanied with its formation, interaction, and dynamics, is called habitual domain (HD). Let us take a look at an example. Example 2: Chairman Ingenuity. A retiring corporate chairman invited to his ranch two finalists, say A and B, from whom he would select his replacement using a horse
Knowledge Management, Habitual Domains, and Innovation Dynamics
13
race. A and B, equally skillful in horseback riding, were given a black and white horse respectively. The chairman laid out the course for the horse race and said, “Starting at the same time now, whoever’s horse is slower in completing the course will be selected as the next Chairman!” After a puzzling period, A jumped on B’s horse and rode as fast as he could to the finish line while leaving his horse behind. When B realized what was going on, it was too late! Naturally, A was the new Chairman. Most people consider that the faster horse will be the winner in the horse race (a habitual domain). When a problem is not in our HD, we are bewildered. The above example makes it clear that one’s habitual domain can be helpful in solving problems but it also can come his or her way of thinking. Moreover, one may be distorting information in a different way. Our habitual domains go wherever we go and have great impact on our decision making. As our HD, over a period of time, will gradually become stabilized, unless there is an occurrence of extraordinary events or we purposely try to expand it, our thinking and behavior will reach some kind of steady state and predictable. Our habitual domains are comprised of four elements: 1. Potential domain This is the collection of all thoughts, concepts, ideas, and actions that can be potentially activated by one person or by one organization at time t. 2. Actual domain This is the collection of all thoughts, concepts, ideas, and actions, which actually catch our attention and mind at time t. This represents the probability that the ideas, concepts 3. Activation Probability and actions in the potential domain that can be actually activated. 4. Reachable domain This is the collection of thoughts, concepts, ideas, actions and operators that can be generated from initial actual domain.
At any point in time habitual domains, denoted by will mean the collection of the above four subsets. That is, In general, the actual domain is only a small portion of the reachable domain; in turn, the reachable domain is only a small portion of potential domain, and only a small portion of the actual domain is observable. Note that changes with time. We will take an example to illustrate and Example 3. Assume we are taking an iceberg scenic trip. At the moment of seeing an iceberg, we can merely see the small part of the iceberg which is above sea level and faces us. We cannot see the part of iceberg under sea level, nor see the seal behind the back of iceberg (see Fig. 1). Let us assume t is the point of time when we see the iceberg, the portion which we actually see may be considered as the actual domain in turn, the reachable domain could be the part of iceberg above sea level including the seal. The potential domain could be the whole of the iceberg including those under the sea level.
14
P.L. Yu and T.C. Lai
Fig. 1. Illustration ofPDt, RDt, and ADt
At time t, if we do not pay attention to the backside of the iceberg, we will never find the seal. In addition to this, never can we see the spectacular iceberg if we do not dive into the sea. Some people might argue it is nothing special to see a seal in the iceberg. But, what if it is a box of jewelry rather than a live seal! This example illustrates that the actual domain can easily get trapped in small domain resulting from concentrating our attention on solving certain problems. In doing so, we might overlook the tremendous power of the reachable domain and potential domain. In the information era, even the advances of IT and KM can help solve people’s decision problems, our actual domain could still easily get trapped, leading us to make wrong decision or action. Example 4: Dog Food. A dog food company designed a special package that not only was nutritious, but also could reduce dogs’ weight. The statistical testing market was positive. The company started “mass production”. Its dog food supply was far short from meeting the overwhelming demand. Therefore, the company doubled its capacity. To their big surprise, after one to two months of excellent sales, the customers and the wholesalers began to return the dog food package, because the dogs did not like to eat it.
Knowledge Management, Habitual Domains, and Innovation Dynamics
15
Clearly, a decision trap was committed by using statistics on buyers, not on the final users (dogs). The KM used statistical method on “wrong” subject and committed the trap. If the RD (reachable domain) of the KM could include the buyers and the users, the decision traps and wrong decisions might be avoided. There are many methods for helping us to improve or expand our habitual domains and avoid decision traps. We list some of them in the following two tables. The interested reader is referred to Refs. [8] and [9] for more detail.
2.2 Competence Set and Cores of Competence Set For each decision problem or event E, there is a competence set consisting of ideas, knowledge, skills, and resources for its effective solution. When the decision maker (DM) thinks he/she has already acquired and mastered the competence set as perceived, he/she would feel comfortable making the decision. Note that conceptually, competence set of a problem may be regarded as a projection of a habitual domain on the problem. Thus, it also has potential domain, actual domain, reachable domain, and activation probability as described in Sec. 2.1. Also note that through training, education, and experience, competence set can be expanded and enriched (i.e. its number of elements can be increased and their corresponding activation probability can become larger) [8, 9, 10,11].
16
P.L. Yu and T.C. Lai
Given an event or a decision problem E which catches our attention at time t, the probability or propensity for an idea I or element in Sk (or HD) that can be activated is denoted by Like a conditional probability, we know that that if I is unrelated to E or I is not an element of (potential domain) at time t; and that if I is automatically activated in the thinking process whenever E is presented. Empirically, like probability functions, may be estimated by determining its relative frequency. For instance, if I is activated 7 out of 10 times whenever E is presented, then may be estimated at 0.7. Probability theory and statistics can then be used to estimate The of competence set at time t, denoted by is defined to be the collection of skills or elements of Sk that can be activated with a propensity larger than or equal to That is,
3 Classification of Decision Problems Let the truly need competence set at time t, the acquired skill set at time t, and the of an acquired skill set at time t be denoted by and respectively. Depending on and we may classify decision problems into following categories: 1. If is well-known and with high value of or then the problem is a routine problem, for which satisfactory solutions are readily known and routinely used. 2. Mixed-routine problem consists of a number of routine sub-problems, we may decompose it into a number of routine problems to which the current IT can provide the solutions. 3. If is only fuzzily known and may not contained in with a high value of then the problem is a fuzzy problem, for which solutions are fuzzily known. Note that once the is gradually clarified and contained in with a high value of the fuzzy problem may gradually become routine problem. 4. If is very large relative no matter how small is or is unknown and difficult to know, then the problem is a challenging problem.
So far, KM with IT can accelerate decisions for the routine or mixed-routine problems. There still are many challenging problems, which cannot be easily solved by KM with IT. This is because the needed competence set of a challenging problem is unknown or only partially known, especially when humans are involved. The following illustrates this fact. Example 5: Alinsky’s Strategy (Adapted from Alinsky [1]). In 1960 African Americans living in Chicago had little political power and were subject to discriminatory treatment in just about every aspect of their lives. Leaders of the black community invited Alinsky, a great social movement leader, to participate in their effort. Alinsky clearly was aware of deep knowledge principles. Working with black leaders
Knowledge Management, Habitual Domains, and Innovation Dynamics
17
he came up with a strategy so alien to city leaders that they would be powerless to anticipate it. He would mobilize a large number of people to legally occupy all the public restrooms of the O’Hare Airport. Imagine thousands of individuals visit the airport daily who were hydraulically loaded (very high level of charge) rushed for restroom but there would be no place for all these persons to relieve themselves. How embarrassing when the newspaper and media around the world headlined and dramatized the situation. As it turned, the plan never was put into operation. City authorities found out about Alinsky’s strategy and, realizing their inability to prevent its implementation and its potential for damaging the city’s reputation, met with black leaders and promised to fulfill several of their key demands. The above example shows us the importance of understanding one’s potential domain. At the beginning, African Americans did not entirely know the habitual domain of city authorities (a challenging problem). Their campaigns, such as demonstration, hunger strike, etc., failed to reach their goal (an actual domain). Alinsky observed a potentially high level of charge of the city authorities, the public opinion (potential domain), that could force them to act. As a result, the authorities agreed to meet the key demands of the black community, with both sides claiming a victory.
4 Innovation Dynamics Without creative ideas and innovation, our lives will be bound in a certain domain and become stable. Similarly, without continuous innovation, our business will lose its vitality and competitive edge [2], [6]. Bill Gates indicated that Microsoft would collapse in about two years if they do not continue the innovation. In this section, we are going to explore innovation dynamics based on Habitual Domains (HD) and Competence Set (CS) Analysis as to increase competitive edge. From HD Theory and CS Analysis, all things and humans can release pains and frustrations for certain group of people at certain situations and time. Thus all humans and things carry the competence (in broad sense, including skills, attitudes, resources, and functionalities). For instance, a cup is useful when we need a container to carry water as to release our pains and frustrations of having no cup. The competitive edge of an organization or human can be defined as the capability to provide right services and products at right price to the target customers earlier than the competitors, as to release their pains and frustrations and make them satisfied and happy. To be competitive, we therefore need to know what would be the customers’ needs as to produce the right products or services at a lower cost and faster than the competitors. At the same time, given a product or service of certain competence or functionality, how to reach out the potential customers as to create value (the value is usually positively related to how much we could release the customers’ pains and frustrations). If we abstractly regard all humans and things as a set of different CS, then producing new products or services can be regarded as a transformation of the existent CS to a new form of CS. Based on this, we could draw clockwise innovation dynamics as in Fig. 2:
18
P.L. Yu and T.C. Lai
Fig. 2. Clockwise Innovation Dynamics
Although Fig. 2 is self-explaining, the following are worth mentioning: (The numbers are corresponding to that of the figure.) Note 1: According to HD Theory, when the current states and the ideal goals have unfavorable discrepancies (for instance losing money instead of making money, technologically behind, instead of ahead of, the competitors) will create mental charge which can prompt us to work harder to reach our ideal goals. Note 2: Producing product and service is a matter of transforming CS from the existing one to a new form. Note 3: Our product could release the charges and pains of certain group of people and make them satisfied and happy. Note 4: The organization can create or release charges of certain group of people through advertising, marketing and selling. Note 5: The target group of people will experience the change of charges. When their pains and frustrations, by buying our products or services, are relieved and become happy, the products and services can create value, which is Note 6. Note 7 and Note 8 respectively are the distribution of the created value and reinvestment. To gain the competitive edge, products and services need to be continuously upgraded and changed. The reinvestment, Note 8, is needed for research and development for producing new product and service.
Knowledge Management, Habitual Domains, and Innovation Dynamics
19
In a contrast, the innovation dynamics can be counter-clockwise. We could draw counter-clockwise innovation dynamics as in Fig. 3:
Fig. 3. Counter-clockwise Innovation Dynamic
Note 1: According to HD Theory, when the current states and the ideal goals have unfavorable discrepancies will create mental charge which can prompt us to work harder to reach our ideal goals. Note 2: In order to make profit, organization must create value. Note 3: According to CS analysis, all things carry competence which can release pains and frustrations for certain group of people at certain situations and time. Note 4: New business opportunities could be found by understanding and analyzing the pains and frustrations of certain group of people. Note 5: Reallocation or expansion of competence set is needed for innovating products or services to release people’s pains and frustrations. Innovation needs creative ideas, which are outside the existing HD and must be able to relieve the pains and frustrations of certain people. From this point of view, the method of expanding and upgrading our HDs becomes readily applicable. Innovation can be defined as the work and process to transform the creative ideas into reality as to create the value expected. It includes planning, executing (building structures, organization, processes, etc.), and adjustment. It could demand hard working, perseverance, persistence and competences. Innovation is, therefore, a process of transforming the existing CS toward a desired CS (product or service).
20
P.L. Yu and T.C. Lai
5 Conclusions With the advances of information technologies including computer, network, etc. Knowledge Management (KM) has been rapidly innovated in recent years. KM is useless if it cannot help some people release their frustrations and pains, or if it cannot help them make better decisions. The challenging for KM nowadays is how can it be developed to maximize its value to help solve complex problems. This article discussed four categories of decision problems: routine, mixed-routine, fuzzy, and challenging problems. Many routine problems can be solved by KM/IT. For mixed-routine and fuzzy problems, we may decompose it into a number of solvable routine sub-problems. As to challenging problems, one must expand his/her habitual domain or think deeper into reachable domain even potential domain, to find effective solution and avoid decision traps. This paper also addressed “Innovation Dynamics” for a systematic view of innovation. Though KM can clarify what are the needed competence set, and may speed up the process of expansion of competence set. KM/IT may also lead us into traps as to make wrong decisions or transactions. This is most likely when we are confronted with challenging problems and we are in a state of high level of charge. Many research problems are open for exploration. For instance, in the innovation dynamics, each link of Fig. 2 and 3 involves a number of routine, fuzzy and challenging problems. How do use KM/IT, HD, CS to help the decision maker to make good (optimal) decisions easily and quickly, so that we could relieve their pains and frustration, and create value?
References 1. Alinsky, S. D.: Rules for Radicals. Vintage Books, New York (1972) 2. Drucker, P. F.: The Coming of New Organization. Harvard Business Review on Knowledge Management. Boston MA: Harvard Business School Press (1998) 3. Grant, G. G.: ERP & data warehousing in organizations: issues and challenges. IRM Press (2003) 4. Holsapple, C. W.: Handbook on knowledge management. Springer-Verlag, Berlin Heidelberg New York (2003) 5. Ikujiro, N. and Hirotaka, T.: The Knowledge-Creating Company. Oxford University Press, New York (1995) 6. Sharkie, R.: Knowledge creation and its place in the development of sustainable competitive advantage. Journal of Knowledge Management, 7(1) (2003) 20-31 7. Stadtler, H. and Kilger, C.: Supply chain management and advanced planning: concepts, models, software, and case studies. Springer-Verlag, Berlin Heidelberg New York (2002) 8. Yu, P. L.: Forming Winning Strategies – An Integrated Theory of Habitual Domains Springer-Verlag, Berlin Heidelberg New York (1990) 9. Yu, P. L.: Habitual Domains and Forming Winning Strategies. NCTU Press (2002) 10. Yu, P. L. and Chiang, C. I.: Decision Making, Habitual Domains and Information Technology. International Journal of Information Technology & Decision Making, 1 (1) (2002) 5-26
Knowledge Management, Habitual Domains, and Innovation Dynamics
21
11. Yu, P. L. and Zhang, D.: A foundation for competence set analysis. Mathematical Social Sciences, 20. (1990) 251-299 12. Zikmund, W. G., McLeod, R. and Gilbert, F. W.: Customer relationship management: integrating marketing strategy and information technology. Wiley (2002)
Knowledge-Information Circulation Through the Enterprise: Forward to the Roots of Knowledge Management Milan Zeleny Fordham University, New York, USA Tomas Bata University, Zlín, CR [email protected] [email protected]
Abstract. The field of Knowledge Management (KM) has already completed its initiatory phase, characterized by operational confusion between knowledge and information, stemming from the tenuous notion of “explicit knowledge”. Consequently, the progress of KM has been much slower than would the significance of knowledge management in a modern enterprise indicate. Here we propose and discuss four cornerstones for returning to the roots of knowledge management and so moving forward towards a new phase of KM. We discuss the roots of reliable knowledge thinking and theory in economics, management and philosophy. Then we formulate clear, unambiguous and pragmatic definitions and distinctions of knowledge and information, establish simple and natural measures of the value of knowledge and propose the Knowledge-Information (KnowIn) continuum and its circulatory nature in managing knowledge of the enterprise. Autopoietic cycle A-C-I-S is elaborated to that purpose. We conclude the paper by discussing some implications of the new KM for strategy and strategic management.
1 Introduction The field of Knowledge Management (KM) has already completed its initial cycle of relative euphoria and fashion with rather unimpressive practical results. This is because KM lacked reliable and self-confident definition and differentiation from information, information management and IT applications. This allowed an “easy entry” of a large variety of enthusiasts who were able to interpret “knowledge” in whichever suitable way. Such phenomenon is well documented by an unusual swell of thousands of KM books and articles. Opportunistic entries resulted in equally opportunistic exits. Consequently, the field of KM has lost its ways [8]. Yet, knowledge based strategy and therefore also KM undoubtedly represent one of the most significant advances in economics, management and business enterprise of modern era. The earliest expositions and formulations of Knowledge Management come from the 1980s, as for example in [12, 13]. At least four cornerstones have to be re-established before fully capitalizing on the KM promise of such import and magnitude: Y. Shi, W. Xu, and Z. Chen (Eds.): CASDMKM 2004, LNAI 3327, pp. 22–33, 2004. © Springer-Verlag Berlin Heidelberg 2004
Knowledge-Information Circulation Through the Enterprise
23
1. Return to the firm roots of reliable knowledge thinking and theory in economics, management and philosophy. 2. Formulate clear, unambiguous and pragmatic definitions and distinctions of knowledge and information. 3. Establish simple and natural measures of the value of knowledge. 4. Propose the Knowledge-Information (KnowIn) continuum and its circulatory nature in the enterprise.
Other aspects, like strategy, technology, human resources and organizational environment are also important, but can be more or less derived from the above four cornerstones of conceptual foundations of KM. Observe that all four cornerstones are interconnected in a legacy progression, the next always based on the preceding one. In this paper we concentrate on outlining the four cornerstones, with a short conclusion exploring the nature of strategy and strategic management from the vantage point of the new KM.
1 Forward to the Roots of Knowledge Although we have to return back to the roots [3, 4], in the context of KM such move represents a step forward. This apparent contradiction is intentional. A useful and practical philosophical foundation of knowledge comes from American pragmatists, especially from C. I. Lewis’s system of conceptualistic pragmatism [5], rooted in the thought of Peirce, James and Dewey[2]. Pragmatist philosophical roots firmly established that knowledge is: 1. Action oriented 2. Socially established 3. Relatively interpreted
First, knowledge is action. This is also echoed in Polanyi’s “All knowledge is tacit” [9]. There is no “explicit” knowledge, only information. Second, knowledge is consensually social and without a social context there can be no knowledge. Third, although the “given” of sensory data and experience remains absolute, its classification and its relation to other things is relative to a given context of experience and intended action. Lewis captured the social dimension of knowledge through his term community of action. Congruity of behavior and consensual human cooperation are the ultimate tests of shared knowledge. The purpose of communication is coordination of action and behavior: It is therefore essential that all of its aspects remain consensual. Knowledge cannot be separated from the process of knowing (establishing relationships). Knowledge and knowing are identical: knowledge is process. What is meant when we say that somebody knows or possesses knowledge? We imply that we expect one to be capable of coordinated action towards some goals and objectives. Coordinated action is the test of possessing knowledge. Knowledge without action reduces to simple information or data. Maturana and Varela [6] put it very succintly: All doing is knowing, and all knowing is doing. Clearly, “explicit knowledge”, repositories of data and information (data banks, encyclopaedias, expert systems) are only passive recordings, descriptions of
24
M. Zeleny
knowledge. Only coordinated human action, i. e., process of relating such components into coherent patterns, which turn out to be successful in achieving goals and purposes, qualifies as knowledge. Among the myriads of possible postulated relationships among objects, only some result in a coordinated action. Every act of knowing brings forth a world. We “bring forth” a hypothesis about the relationships and test it through action; if we succeed in reaching our goal - we know. Bringing forth a world of coordinated action is human knowledge. Bringing forth a world manifests itself in all our action and all our being. Knowing is effective [i. e., coordinated and “successful”] action. Knowledge as an effective action enables a living (human) being to persist in its coordinated existence in a specific environment from which it continually brings forth its own world of action. All knowing is coordinated action by the knower and therefore depends on the “structure” of the knower. The way knowledge can be brought forth in doing depends on the nature of “doing” as it is implied by the organization of the knower and his circumstance (working environment).
3 Definition of Knowledge Clear, unambiguous and operational definition of knowledge is essential and without it the field of KM cannot progress in either theory or practice. Based on the preceding philosophical foundations, we can advance the simplest possible definitions for the purposes of effective KM [14]. Knowledge is purposeful coordination of action. The quality and effectiveness of achieved purpose is the evidence (and measure) of knowledge. Information is symbolic description of action. Any action, past, current or future, can be described and captured through symbols. All such descriptions are information. All those rules, formulas, frames, plans, scripts, and semantic networks are information, not forms of knowledge. It is not a set of rules or a formal representation of knowledge, i. e. information, that is critical to intelligence, but rather the mind’s coordination of the body’s experiences and actions, i. e. knowledge. Knowledge is rooted in each individual’s actions, behavior and experiences and therefore partially embedded in the process that is being coordinated. The differences between knowledge and information are significant, qualitative and striking – as the differences between action and its description should be. I know because I do. I have information because I describe. There can be too much information (information overload) but there can never be too much knowledge: There is no knowledge overload. Information is only one of the inputs into the process coordination. Knowledge is coordination itself. There can be too many inputs, but coordination can only be better or worse. Information can be correct or incorrect, right or wrong, true or misleading. Knowledge can only be more or less effective.
Knowledge-Information Circulation Through the Enterprise
25
Knowledge is always gradual, from less to more (effective). In this sense, it is not correct or incorrect: it is not an input. Knowledge refers to the processing of inputs through coordination of action. The rules of coordination (sequences, patterns, levels of performance), derived from experience, observation, consensus or social prescription, are characteristic of knowledge, not of information. What are these rules and how are they followed are among the determinants of forms of knowledge. Skills. If the rules are internally determined and controlled by the subject, we speak of skills. Skills can be validated by the action’s outcome only. There is no need for social sanction or approval of the rules. Robinson Crusoe has skills as all autodidacts have skills. Neither have knowledge. Knowledge. If the rules adhered to are established externally, in a social context and validation, then we can speak of knowledge rather than skills. Knowledge is recognized and validated socially. (One cannot say “I know” – unless one is an autodidact (amateur or diletante) and thus self-exempt from the rules. Only others family, community or society – can testify to one’s knowledge.) One cannot claim knowledge without proper social validation. Expertise. If the external rules are mastered and performed at a socially respected degree and if the actor can reflect upon the rules with respect to their improvement or change, then knowledge becomes expertise. An expert gains socially sanctioned power over the rules so that they no longer need to be obeyed. Expertise is an acquired ability to change the rules. Observe that the difference between skills and knowledge is not based on the outcome. A skillful person can sometimes achieve a better outcome than a knowledgeable person, but it is not equally socially recognized and valued. Skill is based on the outcome only. Knowledge is based on both the outcome and the process leading to it. Expertise is masterful knowledge and cannot grow out of skills. While skills, knowledge and expertise are all related to know-how – how to achieve a given or stated purpose, or to know-what – how to state or select a purpose to be pursued, the notion of wisdom is related to know-why. Knowledge is related to both efficiency (know-how) and effectiveness (know-what) while wisdom is related to explicability (know-why). Having information is far from being knowledgeable. Being knowledgeable still does not imply wisdom. One can be knowledgeable without being wise. Many use information and follow given rules efficiently: they acquire dexterity and become specialists. Others choose their goals and change the rules with the approval of others – and become experts. But even the masters of rules and purposes are not wise if they cannot satisfactorily explain why particular purposes, rules or courses of action should be chosen or rejected. Wisdom is socially accepted or experience validated explication of purpose. Enhancing human wisdom, pursuing practices and systems that are not only efficient or effective, but also wise, i. e., building wisdom systems, is the next frontier of the long and tortuous progression from data and information to knowledge and wisdom. It is probably useful to expand on a definition of communication.
26
M. Zeleny
Communication is closely related to both knowledge and information. Conventional wisdom would weaken the usefulness of the concept of communication by including any information transfer in its domain. We communicate with each other through language. Language is a system of symbolic descriptions of action. We exchange these symbolic labels (information) in order to coordinate our action and modify behavior. When such coordination or modification occurs, we communicate. When it does not, we just transfer information. Communication occurs when the result of a particular exchange of information (e. g., linguistic labels) is the coordination of action (doings, operations) or modification of behavior. Clearly, language is not a system of communication, yet communication occurs through language. What is the difference between action and behavior? Action is the result of deliberate decision making [15] within new contexts and circumstances. Behavior is a habitual or automated response to repeating circumstances within a known context. Both are affected by communication. Communication is consequential exchange of information.
4 Natural Measure of Knowledge Knowledge must be measured in a simple, natural way, not through a complex arificial formula or construct. Based on the definition of knowledge as purposeful coordination of action, one can derive a natural measure of knowledge as a value attributed to coordination. Knowledge is neither intangible nor abstract and it is not difficult to measure. Knowledge produces very tangible outcomes of real value to the approving society. Information, as a description of action, may be difficult to measure – it has no tangible outcome per se. The value of information is intangible, unless it becomes an input into measurable action, i. e. knowledge. Action itself (knowledge) is eminently measurable because its outcomes can be observed, measured and valued. Knowledge is measured by the value that our coordination of effort, action and process adds to inputs of material, technology, energy, services, information, time, etc. Knowledge is measured by added value. Value of any produced item, product or service, is a combination of purchased or otherwise externally or internally acquired inputs and work and labor (coordinated performance of operations constituting the process). This value have to be socially recognized and accepted: by the market, by the purchaser, sponsor, peer group, community, family and so on. If nobody wants my product then it is irrelevant how many inputs, how much time and effort have I expended. My knowledge has no value. If somebody pays for my product (in money or in kind) then its market or social value has been established. To derive the value of knowledge, we have to correct the value of product by subtracting all (including information) external and internal
Knowledge-Information Circulation Through the Enterprise
27
purchases (their market value) or used and otherwise valued acquisitions. In corporate setting, we also subtract operating cost and general administrative cost. As a result we obtain added value (to inputs) or added value per hour or worker. Such conceived added value is due to action or process, its performance and coordination. There are three components to added value: labor, work and coordination. One has to pay wages to labor (performance of externally coordinated operations) and work (internally coordinated operations). In addition, one has to pay salaries for any employed coordination services. Observe that both wages and salaries can only be covered from the added value. Labor, work and management are not (or should not be) inputs, but forms of coordination and performance of the process. If no value has been added, no payment of wages and salaries can be sustained. “Work” can be defined as economically purposeful activity requiring substantial human coordination of task and action. “Job” designates the kind of work that is performed contractually, that is, explicitly for remuneration and in the employ of others. “Labor” (often used as a synonym for hard work or toil) can more properly be related to performing simplified work-components or tasks without engaging in their substantial coordination towards given purposes. Work often involves labor but not vice versa. Work involves coordination of tasks while labor relates only to their performance. After we subtract from added value the cost of labor (considered material input), what remains is the value of knowledge applied to the process. Added value measures knowledge, the contribution of coordination of action through work and management. The relativity of the value of knowledge is clear. The same expenditure of coordination effort, time, skills and work can have great value in one context and no value in another. The same level of knowledge can have great value in New York and no value in Prague – and vice versa. All knowledge is relative and its value is derived from the context of its application. This is why knowledge cannot be measured from inputs and through apriori expenditures of time, effort and skills. Knowledge is not primary but secondary, a derived category: derived from the value of its outcome. The amount of knowledge does not determine the value of its outcome, but the value of the outcome determines the value of knowledge applied. No amount of information, duration of study, hard work or dedicated effort can guarantee the value of knowledge. All such effort has to be socially accepted and sanctioned, its value affirmed and validated. Otherwise it can be wrong, misplaced, unuseful and unvalued – regardless of the effort. In education we mostly acquire information (description of action), not knowledge (action itself). We study cookbooks but rarely learn to cook. Information is necessary and potentially useful, easy to transmit. But information is not knowledge. In a world of global communications and information sharing we are less and less going to be paid for having information and more and more for knowing, for being able to coordinate action successfully (pay for knowledge). The value of education rooted in information is going to decline, education for knowledge is going to rise. In this context, it becomes apparent that confusing information with knowledge is rapidly becoming counterproductive. After reading hundreds of cookbooks, I am still not a viable chef. I still do not know how to coordinate action, my own or others.
28
M. Zeleny
After reading hundreds of textbooks on management, I am still not a manager. I still do not know how to manage enterprise, my own or of others. One of the cruelest outcomes of education is instilling the feeling that information is knowledge in unexperienced novices. Studying description of action does not guarantee knowledge of action. This is why even the oxymoronic connection “explicit knowledge”, implying that somehow a symbolic description is some sort of “knowledge”, is not only confusing and unscientific, but also damaging and fundamentally untrue. Witness K. E. Sveiby [10]: “All knowledge is either tacit or rooted in tacit knowledge. All our knowledge therefore rests in the tacit dimension,” or M. Polanyi [9]: “Knowledge is an activity which would be better described as a process of knowing.” So it would be. To know is to do. The field of KM has to abandon its initial cycle and leap forward to its roots.
5 KnowIn Circulatory System It is important that knowledge and information become inteconnected in an integrated, mutually enhancing system of autopoietic self-production cycle of KnowIn circulation. Clearly, there is a useful connection between action and its description, between knowledge and information. While knowledge management should include information management, information management cannot include knowledge management. Process can include its inputs, but no single input can include its process. Knowledge produces more knowledge with the help of intermediate information. The purpose is to produce more knowledge, not more information. In order to do that effectively, we have to integrate knowledge and information (KnowIn) flows into a unified system of transformations. It is insufficient, although necessary, to manage, manipulate, mine and massage data and information. It is incomplete and inadequate to manage knowledge without managing its descriptions. Its is both necessary and sufficient to manage integrated and interdependent KnowIn flows. Purpose of knowledge is more knowledge, not more information. Useful knowledge is codified into its recording or description. Obtained information is combined and adjusted to yield actionable information. Actionable information forms an input into effective coordination of action (knowledge). Effective knowledge is then socialized and shared, transformed into useful knowledge. In short, the cycle can be broken into its constituent transformations: 1. 2. 3. 4.
Articulation: knowledge information Combination: information information Internalization: information knowledge Socialization: knowledge knowledge
Knowledge-Information Circulation Through the Enterprise
29
These labels are due to Nonaka’s [7] transitions of knowledge: tacit to explicit, Articulation; explicit to explicit, Combination; explicit to tacit, Internalization; and tacit to tacit, Socialization. They are not separate dimensions and should not be separately treated. The above sequence A-C-I-S of KnowIn flows is continually repeated in a circular organization of knowledge production. Every enterprise, individual or collective, is engaged in two types of production: 1. Production of the other (products, services), heteropoiesis 2. Production of itself (ability to produce, knowledge), autopoiesis
Production of the other is dependent on the production of itself. Any successful, sustainable enterprise must continually produce itself, its own ability to produce, in order to produce the other, its products and services. Production, renewal and improvement of knowledge to produce is necessary for producing anything. Knowledge production (production of itself) has traditionally been left unmanaged and uncoordinated. The focus used to be on the product or service, on “the other”. In the era od global competition the omission of knowledge management is no longer affordable. Knowledge production leads to sustained competitive products and services but not the other way around. Even the most successful products do not guarantee sustained knowledge base and competitiveness of the enterprise. The A-C-I-S cycle is concerned with autopoiesis [18], the production of itself. Traditional management is focused on its products and services, while neglecting its own continued ability to produce requisite knowledge for their production. Therein lies the imperative for knowledge management in the global era: information is becoming abundant, more accessible and cheaper, while knowledge is increasingly scarce, valued and more expensive commodity. There are too many people with a lot of information, but too few with useful and effective knowledge. A-C-I-S Cycle. We can now characterize all four essential transformations in greater detail: 1. Articulation: transformation (knowledge information) is designed to describe, record and preserve the acquired, tested and provenly effective knowledge and experience in a form of symbolic description. All such symbolic descriptions, like records, manuals, recipes, databases, graphs, diagrams, digital captures and expert systems, but also books, “cookbooks” and procedures, help to create symbolic memory of the enterprise. This phase creates the information necessary for its subsequent combination and recombination into forms suitable for new and effective action. 2. Combination: transformation (information information) is the simplest as it is the only one taking place entirely in the symbolic domain. This is the content of traditional information management and technology (IT). It transforms one symbolic description into another, more suitable (actionable) symbolic description. It involves data and information processing, data mining, data warehousing, documentation, databases and other combinations. The purpose is to make information actionable, a useful input into coordination process. knowledge) is the most 3. Internalization: transformation (information important and demanding phase of the cycle: how to use information for effective action, for useful knowledge. Symbolic memory should not be passive,
30
M. Zeleny
information just laying about in libraries, databases, computers and networks. Information has to be actively internalized in human abilities, coordinations, activities, operations and decisions – in human action. Only through action information attains value, gains context and interpretation and - connected with the experience of the actor – becomes reflected in the quality of achieved results. knowledge) is related to sharing, 4. Socialization: transformation (knowledge propagating, learning and transfer of knowledge among various actors, coordinators and decision makers. Without such sharing through the community of action knowledge loses its social dimension and becomes ineffective. Through intra- and inter-company communities, markets, fairs and incubators we connect experts with novices, customers with specialists, employees with management for the purposes of learning through example, practice, training, instruction and debate. Learning organization can emerge and become effective only through socialization of knowledge. The A-C-I-S cycle is continually repeated and renewed on improved, more effective levels through each iteration. All phases, not just the traditional combination of IT, have to be managed and coordinated as a system. Circular KnowIn flows are stimulated, coordinated and maintained by a catalytic function of Knowledge Exchange Hub (KEH). This KEH functions under the supervision of KM Coordinator who is responsible for maintaining the four transformations A-C-I-S. For the first two transformations, Tuggle and Goldfinger [11] developed a partial methodology for externalizing (or articulating) knowledge embedded in organizational processes. Any such externalization produces useful information [1]. It consists of four steps. First, a process important to the organization is selected. Second, a map of the selected process is produced (by specifying its steps and operations and identifying who is involved in executing the process, what are the inputs and the outputs). Third, the accuracy of the process map needs to be verified. Fourth, we examine the process map for extracting the embedded information: What does the process reveal about the characteristics of the person executing the process? What about the nature of the work performed? What about the organization in which this process occurs? Why is this process important to the organization in question? What benefit (added value) does the process contribute to the organization? There are two forms of information extracted from the process mapping. The first extraction produces information about process structure while the second extraction produces information about process coordination. By producing a map of the process, a symbolic description of action, one extracts information about the process. The second extraction works with the process map directly (extracting information from information), i. e. shifting into Combination of A-C-I-S. It describes properties about the agent conducting the process, insights regarding the steps carried out in executing the process, and revealed understandings about the communications going on during the execution of the process. This methodology involves only the A-C portion of the A-C-I-S cycle. The all important stages of Internalization and Socialization are not yet addressed. This incompleteness is probably due to the Nonaka [7] induced habit of treating the dimensions of A-C-I-S as separate, autonomous and independent. They form an autopoietic cycle and cannot be separated.
Knowledge-Information Circulation Through the Enterprise
31
A-C-I-S cycle has autopoietic organization [16, 17], defined as a network of processes of: 1) Knowledge Production (Poiesis): the rules governing the process of creation of new knowledge through Internalization of information. 2) Knowledge Bonding (Linkage): the rules governing the process of Socialization of knowledge within the enterprise. 3) Knowledge Degradation (Information Renewal and Replenishment): the rules associated with the process of transforming knowledge into information through Articulation and Combination.
All three types of constitutive processes must be well balanced and functioning in harmony. If one of the three types is missing or if one or two types predominate (out-ofbalance system), then the organization can either be heteropoietic or allopoietic, i. e., capable of producing only “the other” rather than itself. Any self-sustaining system will have the processes of production, bonding and degradation concatenated in a balanced way, so that the production rate does not significantly exceed the replenishment rate, and vice versa. Self-sustaining systems will be autopoietic in an environment of shared or common resources; such a business enterprise would resemble a living organism rather than mechanistic machinery. Autopoietic knowledge systems, in spite of their rich metaphoric and anthropomorphic meanings and intuitions, are simply networks characterized by inner coordination of individual actions achieved through communication among temporary member-agents. The key words are coordination, communication, and limited individual life span of members. Coordinated behavior includes both cooperation and competition. So we, as individuals, can coordinate our own actions in the environment only if we coordinate it with the actions of other participants in the same, intersecting or shared network. In order to achieve this, we have to in-form (change) the environment so that the actions of others are suitably modified: we have to communicate. As all other individuals are attempting to do the same, a knowledge network of coordination emerges, and, if successful, it is being “selected” and persists. Such a network then improves our ability to coordinate our own actions effectively. Cooperation, competition, altruism, and self-interest are inseparable. Business enterprise becomes a living organism. Any self-sustainable system must secure, enhance and preserve communication (and thus coordinated action) among its components or agents as well as their own coordination and self-coordination competencies. Systems with limited or curtailed communication can be sustained and coordinated only through external commands or feedback; they are not self-sustaining. Hierarchies of command are sustainable but not self-sustaining. Their organization is machine-like, based on processing information, not on producing knowledge. We have established that consensual (unforced) and purposeful (goal-directed) coordination of action is knowledge. Self-sustainable systems must maintain their ability to coordinate their own actions – producing knowledge. Self-sustaining systems must be knowledge producing, not only information, labor or money consuming entities.
32
M. Zeleny
6 Knowledge Based Strategy One of the main implications of the new KM is the realization that strategy should be based on knowledge rather than information and rooted in action rather than its symbolic description. Traditionally, the organization executives prepare a set of statements, descriptions of future action: mission, vision, set of goals, plan or pattern for action and similar artefacts. Observe that all these statements are nothing but information. It all remains to be translated into action. That is where most organization executives stumble. How do you transform information into knowledge? How do you carry out the Internalization phase of A-C-I-S? They can all write statements, but can they do? All the statements, from mission to plan are “above the cloud line”. They do not see from the high clear skies of information down into the confusing reality of knowledge. So, it does not work. So, we have to start anew. Strategy is about what you do, not about what you say you do or desire to do. Strategy is about action, not about description of action. Strategy is about doing, not about talking about it. Your strategy is what you do. And what you do is your strategy. All the rest is words. All organizations do and so all organizations have strategy, whether or not they realize it. Executives have to stop managing information through issuing statements and start managing knowledge through coordinating action. There are no strategic, tactical and operational levels: everything takes place below the cloud line, separating information from knowledge. Everything useful is operational. First, one has to create a detailed map of corporate activities to find out what is company doing, reveal its own strategy. Remarkably, many corporations do not know what they do, do not know their own strategy. They only know what they say, their own statements. Second, after creating activity map, one has to analyze the activities by benchmarking them with respect to competitors, industry standards or stated aspirations. Third, value-curve maps are created in order to differentiate one’s activities from those of competition. Differentiation, not imitation, is the key to competitiveness and strategy. Fourth, selected activities are changed in order to fill the spaces revealed by valuecurve maps as most effective for successful differentiatiation. So, we change our action, and thus our strategy, without ever leaving the action domain. Our strategy remains what we are doing, even though we are doing something else. No need to implement or execute our “strategy” (set of statements) – it has already been enacted. Executives “execute” their strategic statements. Their strategies are hard to execute. They are probably created “above the cloud line”, far removed from the doing, and should not be executed at all. Their effective (forced) execution is likely to damage the corporation and its strategic resilience.
Knowledge-Information Circulation Through the Enterprise
33
Once we have effectively changed our activities and differentiated our action, there is nothing to prevent excutives from describing the newly created strategy: They can derive their missions and visions as a description of true action, from bottom up, reflecting a real strategy – and take them above the cloud line. Their company will prosper. Strategic management is all about doing, producing and creating. How do we produce knowledge, capability, core values, alliances, and networks? How do we do? Therein lies the new promise and challenge of Knowledge Management.
References 1. Desouza, K. C.: Facilitating Tacit Knowledge Exchange. Communications of the ACM, 46 (6) (2003) 85-88 2. Dewey, J. And Bentley, A. F.: Knowing and the Known, Beacon Press, Boston (1949) 3. Hayek, F. A.: The Use of Knowledge in Society. The American Economic Review, 35(1945) 519-530 4. Hayek, F. A.: Economics and Knowledge. Economica, February (1937) 33-45. 5. Lewis, C. I.: Mind and the World-Order (1929), 2nd ed., Dover Publ., New York, (1956) 6. Maturana H. R. and Varela, F. J.: The Tree of Knowledge .Shambhala Publications, Inc., Boston (1987) 7. Nonaka I.: The Knowledge-Creating Company. Harvard Business Review 69 (6) Nov. Dec. (1991) 96-104 8. Prusak, L.: What’s up with knowledge management: A personal view. In: Cortada, J. W., Woods, J. A. (eds.): The Knowledge Management Yearbook 1999-2000, ButterworthHeinemann, Woburn, MA (1999) 3-7 9. Polanyi, M.: The Tacit Dimension, Routledge and Keoan, London, England (1966) (Peter Smith Publ., June 1983). 10. Sveiby, K. E.: Tacit knowledge. The Knowledge Management Yearbook 1999-2000, eds. J. W. Cortada and J. A. Woods, Butterworth-Heinemann, Woburn, MA, 1999, pp. 18-27. 11. Tuggle, F. D., Goldfinger, W. E.: A Methodology for Mining Embedded Knowledge from Process Maps. Human Systems Management (2004) 12. Zeleny, M.: Management Support Systems: Towards Integrated Knowledge Management. Human Systems Management, 7 (1) (1987) 59-70 13. Zeleny, M.: Knowledge as a New Form of Capital, Part 1: Division and Reintegration of Knowledge. Human Systems Management, 8(1) (1989) 45-58; Knowledge as a New Form of Capital, Part 2: Knowledge-Based Management Systems. Human Systems Management, 8(2) 1989) 129-143 14. Zeleny, M.: Knowledge versus Information. In: Zeleny, M. (ed.), IEBM Handbook of Information Technology in Business, Thomson, London (2000) 162-168 15. Zeleny, M.: Multiple Criteria Decision Making. McGraw-Hill, New York (1982) 16. Zeleny M.: Autopoiesis, Dissipative Structures, and Spontaneous Social Orders. Westview Press, Boulder, Co. (1980) 17. Zeleny M.: Autopoiesis: A Theory of Living Organization. North-Holland, New York (1981) 18. Zeleny, M.: Autopoiesis (Self-Production). In: Zeleny, M. (ed.), IEBM Handbook of Information Technology in Business. Thomson, London (2000) 283–290
A Hybrid Nonlinear Classifier Based on Generalized Choquet Integrals Zhenyuan Wang1, Hai-Feng Guo2, Yong Shi3, and Kwong-Sak Leung4 1
Department of Mathematics, University of Nebraska at Omaha, Omaha, NE 68182, USA 2 Department of Computer Science, University of Nebraska at Omaha, Omaha, NE 68182, USA 3 Department of Information Systems and Quantitative Analysis, University of Nebraska at Omaha, Omaha, NE 68182, USA {zhenyuanwang, haifengguo yshi}@mail.unomaha.edu 4
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, NT, Hong Kong [email protected]
Abstract. In this new hybrid model of nonlinear classifier, unlike the classical linear classifier where the feature attributes influence the classifying attribute independently, the interaction among the influences from the feature attributes toward the classifying attribute is described by a signed fuzzy measure. An optimized Choquet integral with respect to an optimized signed fuzzy measure is adopted as a nonlinear projector to map each observation from the sample space onto a one-dimensional space. Thus, combining a criterion concerning the weighted Euclidean distance, the new linear classifier also takes account of the elliptic-clustering character of the classes and, therefore, is much more powerful than some existing classifiers. Such a classifier can be applied to deal with data even having classes with some complex geometrical shapes such as crescent (cashew-shaped) classes.
1
Introduction
Classification is one of the important methods for pattern recognition [1, 10]. It has been applied in data mining widely. The simplest and fundamental model of classification is the two-class linear classifier that divides the feature space into two parts by a hyperplane. Applying any linear classifier needs a basic assumption that there is no interaction among the strengths of the influences from individual feature attributes toward the classifying attribute, that is, the joint influence from a set of feature attributes toward the classifying attribute is just a linear combination of the influences from individual feature attributes in the set toward the classifying attribute. However, the above-mentioned interaction cannot be ignored in many real problems. Due to the interaction, the geometrical shapes may be various. Some previous works [2, 3, 11, 12] have adopted nonadditive set functions to describe such an interaction among the feature attributes directly and used nonlinear integrals (such as the Y. Shi, W. Xu, and Z. Chen (Eds.): CASDMKM 2004, LNAI 3327, pp. 34–40, 2004. © Springer-Verlag Berlin Heidelberg 2004
A Hybrid Nonlinear Classifier Based on Generalized Choquet Integrals
35
Choquet integral [4, 7, 8]) as a projector from the feature space to a one-dimensional space. As a continuation of these previous works, this paper provides a new hybrid model of nonlinear classifier that combines the Choquet deviation and the weighted Euclidean distance. Thus, the new nonlinear classifier also takes account of the elliptic-clustering character of the classes and, therefore, is much more powerful than some existing classifiers. Such a classifier can be applied to deal with data even having classes with some complex geometrical shapes such as crescent (cashew-shaped) classes.
2
Basic Concepts and Notations
Consider m feature attributes, and a classifying attribute, y, in a database. Attributes are numerical, while y is categorical with a range where each is the indicator of a class and may be symbolic or numerical. Denote by X. The m-dimensional Euclidean space is the feature space. An n–classifier is an n–partition of the feature space with a one–to–one correspondence to C. It should be determined based on a sufficient data set. The data set consists of l observations of and y, and has a form as
where row
is the i-th observation of attributes and y with and for i=1,2,...,l and j = 1,2,...,m . Positive integer l is called the size of the data, and should be much larger than m. The observation of can be regarded as a function It is a point in the feature space. Thus, the i-th observation of is denoted by and we write j = 1,2,...,m for i = 1,2,...,l. Let k =1,2,...,n . Then, is an n-partition of index set {1,2, ..., l}. The interaction among the influences of feature attributes toward the classifying attribute is described by a set function defined on the power set of X satisfying the condition of vanishing at the empty set, i.e., with For convenience, we also require that Set function may not be additive or nonnegative, or neither. Such a set function is called a pseudo-regular signed fuzzy measure. A pseudo-regular singed fuzzy measure defined on P(X) is identified with a vector of order where if h is expressed in binary digits as for every
Z. Wang et al.
36
The generalized Choquet integral of a function, f, with respect to pseudo-regular signed fuzzy measure is defined by
when not both terms on the right-hand side are infinite, where set is called the set of function f for any The generalized Choquet integral can be calculated through the following formula:
where
and
in which
In the above expression, a convention that the
maximum taken on an empty set has value zero is adopted. Given
two
constants
and
and
functions with
pseudo-regular signed fuzzy measure
with and
inequality
given
a
represents
a subset of the feature space. Such a region can reflect the interaction among the influences from feature attributes toward the classifying attribute and, therefore, can be adopted as a frame to each component of the n-partition of the feature space. Simply, we denote
by
and
by
for j = 1,2,...,m , i.e.,
and
3
Main Algorithm
(1) Input the data set. Create l, m, n, and nonempty sets (2) Choose a large prime s as the seed for the random number generator. Set the value for each parameter listed in the following.
The bit length of each gene, i.e., bits are used for expressing each gene. It depends on the required precision of the results. e.g., means that the precision is almost Its default is 10. P and q: The population sizes used in GAI and GAII. They should be large positive even integers. Their defaults are 200 and 100 respectively.
A Hybrid Nonlinear Classifier Based on Generalized Choquet Integrals
37
and The probabilities used in a random switch in GAI and GAII to control the choice of genetic operators for producing offspring from selected parents. They should satisfy the condition that and for i = 1,2. Their defaults are 0.2, 0.5, 0.2, and 0.5 respectively. and Small positive numbers used in the stopping controller in GAI. Their defaults are and respectively. and The limit numbers of generations that have no significant progression successively in GAI and GAII. Their defaults are 10 and 100 respectively. w: the weight for using weighted Euclidean distance in GAI. Its default is 0.1. (3) Let The obtained result is saved as (4) Rearrange
where
For each
take it as D and run GAI. and such that
as
is a permutation of (1,2,...,n).
(5) From k = 1 to k = n successively, taking
using
as A and and
i = k , k + 1,...,n, is the adjustment of
as B, and run GAII, where
i.e., the remainder of
by
erasing those data that have already been classified into the previous classes The obtained
and
are saved as
and
k=1,2,...,n. (6) Stop.
4
Genetic Algorithm I (GAI for Determining the Location and Shape of Each Class)
Given a nonempty subset of data with a form where I is a nonempty index set, its location and shape may be described by vectors a, b, constant c, and pseudo-regular signed fuzzy measure These parameters can be determined according to the criterion that
is minimized, where projected mean of
is the cardinality of I. Constant c is called the Choquet is called the eigenmeasure of D, and
the Choquet projected variance of D. The Choquet centroid of D is point
is called in the
Z. Wang et al.
38
feature space whose j-th coordinate is
when
Any j-th dimension of the feature space with
will be rejected and then the
j = 1,2,...,m.
Choquet centroid will be considered in the relevant lower-dimensional space. Minimizing
is the main criterion of the optimization. However, another criterion
that the data should be as close to the Choquet centroid as possible will be also considered. Thus, a convex combination of
that is,
and
will be adopted as the objective of the optimization,
where 1 – w and w are the weights chosen before running the program. A genetic algorithm [5, 6, 9] is used to determine a, b, c, and with relevant and that minimize for the given subset of data, D.
5
Genetic Algorithm II (GAII for Determining the Size of Each Class)
Once the location and the shape of a class are determined, the size of the class that will be adopted for the classification may be described by the relative Choquet deviation
and the relative weighted Euclidean distance from the Choquet centroid
where f is a point in the feature space. The optimal size of the class will be determined by minimizing the misclassification rate. The problem now is simplified to be a two-class classification, i.e., to determine the best boundaries of the class with data set A, which will be used to distinguish points in this class from the others with data set B. The boundaries are expressed in terms of the relative Choquet deviation and the relative weighted Euclidean distance from the Choquet centroid when the data sets A (or adjusted A) and B are given and the information on A’s location and shape is available. A genetic algorithm is used to determine the optimal constants and for dominating the relative Choquet deviation and the relative weighted Euclidean distance from the Choquet centroid respectively.
A Hybrid Nonlinear Classifier Based on Generalized Choquet Integrals
6
39
Application
Once a new observation is available, it can be classified into some class as follows. (1) Set the value of switch SW. SW =0 means classifying any outlier into some class among and obligatorily, while SW = 1 means classifying any outlier into an additional class Its default is 1. (2) Set the value of parameter It is a factor used for balancing the relative Choquet deviation and weighted Euclidean distance to the Choquet centroid of a class when an outlier is obligatory to be classified into one of the given n classes. Its default is 2. (3) Input the new observation, Based on the result obtained in the main algorithm, according to the order of is classified into class if it has not been classified into previous class and satisfies
and
Then go to step (5). In case no such a k exists for f, go to the next step. (4) If SW=0, calculate
for k =1,2,...,n . Then f is classified into class
if
If SW=1, classify f into Output where h is one of 1, 2, ..., n, and n+1. Then stop. (5)
References 1. Devijver, P. A., Kittler, J.: Pattern Recognition: A Statistical Approach, Prentice Hall (1982) 2. Grabisch, M., Nicolas, J. M.: Classification by fuzzy integral: Performance and tests, Fuzzy Sets and Systems, Vol. 65 (1994) 255-271 3. Mikenina, L., Zimmermann, H. -J.: Improved feature selection and classification by the 2-additive fuzzy measure, Fuzzy Sets and Systems, Vol. 107 (1999) 197-218
40
Z. Wang et al.
4. Murofushi, T., Sugeno, M.: An interpretation of fuzzy measure and the Choquet integral as an integral with respect to a fuzzy measure, Fuzzy Sets and Systems, Vol. 29 (1989) 201-227 5. Wang, W., Wang, Z., Klir, G. J.: Genetic algorithm for determining fuzzy measures from data, Journal of Intelligent and Fuzzy Systems, Vol. 6 (1998) 171-183 6. Wang, Z., A new genetic algorithm for nonlinear multiregressions based on generalized Choquet integrals, Proc. FUZZ-IEEE (2003) 819-821 7. Wang, Z., Convergence theorems for sequences of Choquet integrals, Int. j. General Systems, Vol. 26 (1997) 133-143 8. Wang, Z., Klir, G. J.: Fuzzy Measure Theory, Plenum, New York (1992) 9. Wang, Z., Leung, K. S., Wang, J.: A genetic algorithm for determining nonadditive set functions in information fusion, Fuzzy Sets and Systems, Vol. 102 (1999) 463-469 10. Weiss, S. M., Kapouleas, I.: An empirical comparison of pattern recognition, neural nets, and machine learning classification methods, Proc. IJCAI (1989) 781-787 11. Xu, K., Wang, Z., Heng, P. A., Leung, K. S.: Using generalized Choquet integrals in projection pursuit based classification, Proc. IFSA/NAFIPS (2001) 506-511 12. Xu, K., Wang, Z., Leung, K. S.: Classification by nonlinear integral projections, IEEE T. Fuzzy Systems, Vol. 11, No. 2 (2003) 187-201
Fuzzy Classification Using Self-Organizing Map and Learning Vector Quantization Ning Chen Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, Portugal Institute of Mechanics, Chinese Academy of Sciences, P. R. China [email protected]
Abstract. Fuzzy classification proposes an approach to solve uncertainty problem in classification tasks. It assigns an instance to more than one class with different degrees instead of a definite class by crisp classification. This paper studies the usage of fuzzy strategy in classification. Two fuzzy algorithms for sequential self-organizing map and learning vector quantization are proposed based on fuzzy projection and learning rules. The derived classifiers are able to provide fuzzy classes when classifying new data. Experiments show the effectiveness of proposed algorithms in terms of classification accuracy. Keywords: fuzzy classification, self-organizing map (SOM), learning vector quantization (LVQ).
1
Introduction
Classification is a supervised machine learning method to derive models between features (independent variables) and class (target variable). In the past decades, classification has been widely applied to solve a great variety of classifying tasks, e.g., product marketing, medical diagnosis, credit approval, image segmentation, qualitative prediction, customer attrition causes analysis. The process of classification is to first produce models from training data in which each sample is assumed to have a predefined class and then use the models to classify new data in which the class label is unknown. Classification can be divided into crisp classification and fuzzy classification. Crisp classification produces a number of classical sets of data, in which an element is classified to only one class. However, there also exist uncertainty cases in which samples are not clear members of any class [4]. Fuzzy classification is advantageous over crisp classification on solving uncertainty problems by assigning a sample to multiple classes with different membership degrees. The membership degrees given by fuzzy classification provide some valuable information, e.g., significance of a sample belonging to a class. Self-organizing map (SOM) and learning vector quantization (LVQ) are artificial neuron network algorithms. SOM is trained in an unsupervised way. It projects the data into neurons through a topology preserving transformation so that the neurons close to each other have similar features in the input space. Although SOM is an unsupervised learning method by nature, it can be also used for supervised tasks after the map is labeled. LVQ is performed in a supervised way which defines class regions in the data space rather than preserves topological property of data. Y. Shi, W. Xu, and Z. Chen (Eds.): CASDMKM 2004, LNAI 3327, pp. 41–50, 2004. © Springer-Verlag Berlin Heidelberg 2004
42
N. Chen
Fuzzy SOM and LVQ have been studied in literature. FLVQ [2] is a batch SOM algorithm combining online weight adaptation rule with fuzzy membership assignment. The relative membership is calculated directly from the distances between the input instance and map neurons rather than the topological neighbors. Replacing crisp class with fuzzy class membership for both input samples and map neurons, a fuzzy SOM classifier is presented in [6]. The crisp labels of input samples are fuzzified by a k-nearest neighbor rule. After training, each map neuron is assigned to a class with a membership degree based on the typicalness of patterns projected on it. In this method, the fuzzy paradigm is only used in the labeling phase and has no impact on map organization. Some competitive algorithms: FALVQ 1, FALVQ 2, and FALVQ 3 support fuzzy classification using different membership functions [3]. These algorithms optimize some fuzzy cost functions, formed as the weighted sum of squared Euclidean distances between input vectors and reference vectors of neurons. However, it is noted that the optimization procedures are plagued with local minima. In this paper, we propose two sequential algorithms for fuzzy classification using SOM and LVQ. Sequential algorithms are ‘on-line’ in the sense that the neurons are updated after the presentation of each input. Discarding each sample once it has been used, sequential algorithms avoid the storage of complete data set. The fuzzy SOM and LVQ algorithms are based on fuzzy projection, in which the membership values are calculated from the distance between input samples and map neurons. As opposed to [6], the fuzzy paradigm is used in both model training and model classification. In training phase, the neurons are updated according to the membership values. In classifying phase, each instance is assigned to multiple classes with different degrees. Finally, a hybrid classifier is presented combining fuzzy SOM and LVQ to hold both topology preserving property and pattern recognition capability. The performance of proposed fuzzy algorithms is investigated in terms of classification accuracy. In the remaining of the paper, section 2 describes the methodology of fuzzy classification algorithms. Experiments and results are given in section 3. Lastly, section 4 concludes the paper.
2 2.1
Fuzzy Sequential SOM and LVQ Fuzzy Projection
SOM is an artificial neural network (ANN) which attempts to represent the input data in low dimensional grid space through a topology preserving mapping [5]. The neurons are organized on a regular grid with usually one or two dimensions. Each neuron is associated with input samples by a reference vector and connected to adjacent neurons by a neighborhood function. Suppose is the input vector and is the reference vector of the neuron. Crisp SOM projects to map unit which best matches to it, i.e., where is the Euclidean distance. Fuzzy SOM projects to with a membership degree satisfying and Given a fuzzy parameter which controls the degree of fuzziness, the membership matrix can be calculated from the distance between input vector and reference vectors [1].
Fuzzy Classification Using Self-Organizing Map and Learning Vector Quantization
2.2
43
Fuzzy Sequential SOM (FSSOM)
Fuzzy sequential SOM uses fuzzy match instead of crisp match and updates reference vectors according to membership degrees. The map is trained iteratively by updating reference vectors according to input samples. An input vector is assigned to each neuron with a membership degree obtained by Equation 1. Then the units are updated towards the input case with a proportion of the distance between them. The incremental vector is the sum of the neighborhood function values weighted by the exponent on the membership degrees. After training, the neurons become topologically ordered on the map. Finally, the map neurons are labeled by specific classes according to classified samples. For the purpose of constructing a classifier, the fuzzy sequential SOM algorithm for model derivation is described as follows: Step 1: Step 2: Step 3: Step 4: Step 5:
Initialize the map with a lattice of neurons and reference vectors. Choose a sample from the training data set at time t. Calculate the distances between and reference vectors. Compute the membership degree of neurons with respect to Update the reference vectors of all neurons using fuzzy update rule:
where is the learning rate at time t, and is the neighborhood function of radius Both and are non-increasing functions of time. Step 6: Repeat from Step 2 to Step 5 enough iterations until the status of map is stable. Step 7: Input the samples of classified data and project them to best-matching units. Step 8: Label a neuron with the class of maximal frequency occurring in the projected samples. Usually the neighborhood radius and learning rate are bigger values at first and decrease to zero with training steps [5]. Apparently, when it produces a hard projection that is the best-matching unit which has the minimal distance to the input) and In such case, FSSOM is equivalent to classic sequential SOM.
2.3
Fuzzy Sequential LVQ (FSLVQ)
Learning vector quantizer (LVQ), a variant of SOM, uses a supervised approach during learning. The map units are assigned by class labels in the initialization and then updated
N. Chen
44
at each training step. The update way depends on the match of class labels between bestmatching unit and input. If the unit has the same class to the input, the reference vector is moved close to the input, otherwise, it is moved away from the input. In contrast to crisp LVQ which updates only the best-matching unit, FSLVQ updates all units according to the memberships. As an extension of LVQ1 [5], one of basic LVQ algorithms, the fuzzy sequential LVQ algorithm is described as follows: Step 1: Step 2: Step 3: Step 4: Step 5:
Initialize the reference vector and class label for each neuron. Choose a sample from the training data set at time t. Calculate the distances between and reference vectors. Compute the membership degree of neurons with respect to Update the reference vector of all neurons:
Step 6: Repeat from Step 2 to Step 5 enough iterations. When only the best-matching unit is updated, so that FSLVQ is essentially equivalent to crisp LVQ1.
2.4
Fuzzy Classifying
Once a map is trained and labeled, it can be used as a classifier for unclassified data. Fuzzy classification offers more insight of class assignment to decision makers. After calculating the membership degrees of a sample with respect to all units, the degree of the sample to one class is calculated as the sum of membership degrees with respect to the units having the same class. When a crisp assignment is needed, the classification can be done according to the class with maximal degree. If there are more than one class having the maximal degree, the first one is chosen. When it yields to a crisp classification, that simply assigns the input to the class of best-matching unit. Suppose is the set of class labels, then the class of sample is determined as follows:
3 3.1
Experiments and Results Hybrid Fuzzy Classifier
Although both SOM and LVQ can be used for classification, they are different in some aspects. Firstly, SOM attempts to approximately preserve the neighborhood relationship of data in a topological order fashion. LVQ tries to recognize the patterns of class with respect to other features; Secondly, SOM is trained from an initial map without class labels. LVQ needs the assignment of labels for neurons in the initialization; Next, SOM
Fuzzy Classification Using Self-Organizing Map and Learning Vector Quantization
45
is trained in an unsupervised way without the direction of class labels. The training process is extreme data driven based on intrinsic similarity of data. LVQ is trained in a supervised way under the direction of class information. In order to possess both topology preserving property and pattern recognition capability, the unsupervised and supervised scheme can be combined in either simultaneous manner, e.g. LVQ-SOM [5], HLVQ [7], or successive manner [9]. In the hybrid fuzzy classifier, the models are derived using FSSOM followed by FSLVQ. The combination of FSSOM and FSLVQ is inspired by three reasons. First, the local neighborhood properties of trained SOM contribute to easier pattern recognition tasks, hence no pre-classified samples are required in the initial training, and only a limited number of known samples is needed in the labeling phases. This feature makes SOM particularly suitable for classification cases where there are few classified samples and allow users to avoid the expensive and tedious process of known sample collection [9]. Next, the objective of SOM is to preserve topology property of data without any consideration of class assignment. FSLVQ can be used to adjust the map neurons for better performance on pattern recognition. With labels and reference vectors induced from data clustering, FSSOM offers a better starting condition for FSLVQ training than random initialization. Next, FSSOM is very close to FSLVQ in data structure and learning scheme. In fact, to stabilize the status of FSSOM, neighborhood region usually shrinks to zero in fine-tuning step so that it is easy to change to FSLVQ in a straightforward way.
3.2
Effectiveness Study
The proposed classification algorithms are implemented based on SOM & LVQ software [8]. The following experiments are performed on Iris data set in a machine with 256M memory and intel celeron 1.03 GHz processor running windows XP professional operating system. Iris data set has 150 Iris flowers, described by four numeric features: sepal length, sepal width, petal length and petal width. The samples belong to three classes respectively: ‘setosa’, ‘versicolor’, and ‘virginica’. The experiments are performed in four steps. Step 1: The performance of proposed algorithms is evaluated using 10-fold cross validation. In each trial, nine folds are used for model exploration and the remaining is for model validation. Step 2: For each training data, a map is initialized linearly in the two-dimensional subspace corresponding to the largest eigenvalues of autocorrelation matrix of the training data. Afterwards, the map is trained by FSSOM using a variant of fuzzy parameter from 1 to 10 in an unsupervised manner, and then labeled according to the known samples in a supervised manner. After that, FSLVQ is performed on the resultant map with the same parameters as previous training. Step 3: In the validation, each sample of the test data set is compared to map units and assigned by the label of best-matching unit. Then the accuracy is calculated as the percent of the correctly classified samples. Step 4: The final accuracy is obtained by calculating the average results on distinct trials. Table 1 lists the arguments used in the experiment. The intermediate values of learning rate and neighborhood radius are linearly interpolated from the initial values to the
46
N. Chen
end values. After the labeling phase, some units are not labeled because no sample is projected on them. Although these neurons maybe useful on recognizing uncertain cases in future decision making, the existence of non-labeled neurons will influence the classification accuracy. This problem is exacerbated by big maps which probably result in more unlabeled neurons. Hence, these neurons are discarded before classifying. In Table 2, the average accuracy ratios for three classes and whole test data at a varied fuzzy parameter are given. It was observed that the accuracy of fuzzy configuration has an obvious increase compared to crisp configuration. The overall accuracy increases from 94.67% to 97.33%. Fuzzy parameter over 3 do not end up with any improvement on accuracy. Starting from the resulting map of FSSOM, FSLVQ does not result in significant improvement (less than 1 %) on the accuracy. This is due to the fact that Iris data has an almost unmixed cluster formulation of class regions so that FSSOM classifier performs as well as hybrid classifier. In Figure 1, the test data is projected to a 2-dimensional subspace spanned by its two eigenvectors with greatest eigenvalues using principal component analysis (PCA). Figure 2 is the projection of classified data using crisp SOM. Figure 3 is the projection of classified data using FSSOM at fuzzy parameter of 2. In each visualization, three classes are plotted in different markers: for ‘setosa’, × for ‘versicolor’ and for ‘virginica’. For the sake of easy detection, the misclassified samples are marked by in the last two figures. Compared to the crisp classifier which misclassifies two samples, fuzzy classifier results in only one error. It was also found that the misclassified samples occur on the boundary of class regions, where uncertain cases usually locate.
3.3
Fuzzy Classifying
In the following experiment, we use a more complex approach to classification phase which takes fuzzy parameter into account. The membership degrees of a sample to the units are calculated by Equation 1 and then the class with maximum degree is obtained by Equation 4. In each trial, a model is trained by FSSOM and FSLVQ using a random fuzzy parameter between 1 and 10 and a map size of [4 × 3]. Each obtained model is validated by the same test data using different fuzzy parameters. Increasing the fuzzy parameter from 1 to 3 with a step of 0.2, the results achieved on test data are listed in Table 3. It was observed the result is quite good when fuzzy parameter is below 3, showing
Fuzzy Classification Using Self-Organizing Map and Learning Vector Quantization
Fig. 1. PCA projection of test data
Fig. 2. PCA projection of crisp classified test data (2 errors)
Fig. 3. PCA projection of fuzzy classified test data (1 error)
47
48
N. Chen
that the fuzziness of classification does not degrade the accuracy while providing more information of class assignment. Figure 4 shows the test data and map neurons in a 2-dimensional subspace. The neurons are displayed in different makers according to their labels and the number of samples is shown. For each sample, the class assignment of crisp classification and class memberships of fuzzy classification are given in Table 4. From the membership, the significance of an instance belonging to a class is known. Some misclassified samples are classified correctly under fuzzy strategy, for example, sample 8 is misclassified to ‘virginica’ in crisp case, while it is assigned to ‘versicolor’ in fuzzy case. Also, sample ‘4’ and ‘10’ are two members of ‘setosa’, while the latter has bigger membership (0.98) than the former (0.84). In fact, the latter is much closer to the representative neurons of ‘setosa’ than the former in Figure 4. It can be stated that replacing exact project with fuzzy project at a certain level in classification does not compromise the benefit of models.
Fuzzy Classification Using Self-Organizing Map and Learning Vector Quantization
49
Fig. 4. PCA projection of test data and map neurons in a 2-dimensional subspace. Three classes of neurons are shown in different markers: for ‘setosa’, for ‘ versicolor’ and for ‘virginica’. The samples of test data are marked by their numbers
4
Conclusion
Fuzzy classification is an extension of crisp classification using fuzzy set theory. In this paper, two fuzzy classification algorithms are proposed using sequential SOM and LVQ based on fuzzy projection. The resulting map of SOM can be used as the initialization
50
N. Chen
of LVQ in a hybrid classifier, which can improve the pattern recognition ability while preserves the topology property approximately. Experimental results show that the proposed algorithms at a certain fuzzy level improve the accuracy of classification compared to crisp algorithms. It could be stated that fuzzy classification solves the uncertainty problem of samples belonging to several classes, and improves classification accuracy in future decision. Future work will mainly focus on the qualitative description of classification models.
Acknowledgements Parts of this reported work were supported by NSFC-RGC #70201003 from National Science Foundation of China and Head Fund #0347SZ from Institute of Policy and Management, CAS.
References 1. Bezdek, James C.: Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York (1981) 2. Bezdek, James C., Pal, Nikhil R.: Two soft relative of learning vector quantization. Neural Networks 8(5) (1995) 729-743 3. Karayiannis, Nicolaos B., Pai, Pin-I: Fuzzy algorithms for learning vector quantization: generalizations and extensions. In: Steven K. Rogers (ed.): Applications and Science of Artificial Neural Networks. Proceedings of SPIE, Air Force Institute of Technology, Wright-Patterson AFB, OH, USA 2492 (1995) 264-275 4. Keller, James M., Gary, Michael R., Givens, James A.: A fuzzy k-nearest neighnor algorithm. IEEE Trans. on Systems, Man, and Cybernetics 15(4) (1985) 580-585 5. Kohonen, T.: Self-organizing maps. Springer Verlag, Berlin, Second edition (1997) 6. Sohn, S., Dagli, Cihan H.: Self-organizing map with fuzzy class memberships. In Proceedings of SPIE International Symposium on AreoSense 4390 (2001) 150-157 7. Solaiman, B., Mouchot, Marie C., Maillard, Eric P.: A hybrid algorithm (HLVQ) combining unsupervised and supervised learning approaches. In Proceedings of IEEE International Conference on Neural Networks(ICNN), Orlando, USA (1994) 1772-1778 8. Laboratory of computer and information sciences & Neural networks research center, Helsinki University of Technology: SOM Toolbox 2.0. http://www.cis.hut.fi/projects/somtoolbox/ 9. Visa, A., Valkealahti, K., Iivarinen, J., Simula, O.: Experiences from operational cloud classifier based on self-organising map. In Procedings of SPIE, Orlando, Florida, Applications of Artificial Neural Networks V 2243 (1994) 484-495
Solving Discriminant Models Using Interior Point Algorithm Siming Huang, Guoliang Yang, and Chao Su Institute of Policy and Management, Chinese Academy of Sciences, Beijing 100080, China [email protected]
Abstract. In this paper we first survey the linear programming based discriminant models in the literature. We then propose an interior point algorithm to solve the linear programming. The algorithm is polynomial with simple starting point.
1 Introduction The two-group discriminant problem has applications in many areas, for example, differentiating between good credit risks and poor ones, between promising new firms and those likely to fail, or between patients with strong prospects for recovery and those highly at risk. The two-group classification problem, sometimes referred to as the two-group discriminant problem, involves classifying an observation into one of two a prior group based on the attributes of the observation. In some recent papers, Freed and Glover [5], Banks and Abad [2], Lee and Ord [10] proposed some linear programming-based models and some solution algorithms. While the proposed algorithms are promising, it is the objective of this paper to propose a new optimization algorithms to solve the discriminant models. This paper presents a new efficient LP-based solution algorithms, interior point algorithms for linear programming, to optimally solve some proposed discriminant models due to the efficiency of interior point algorithm. The recent research will be used as a benchmark for the proposed interior point algorithm.
2 Proposed Discriminant Models In this section a set of proposed discriminant models will be listed and explained.
2.1 MMD (Minimize Maximum Deviation) Model The first and simplest, hereafter referred to as the MMD models [5], can be summarized as follows:
Y. Shi, W. Xu, and Z. Chen (Eds.): CASDMKM 2004, LNAI 3327, pp. 51–60, 2004. © Springer-Verlag Berlin Heidelberg 2004
52
S. Huang, G. Yang, and C. Su
where
the variables satisfy:
2.2 MSID (Minimize the Sum of Interior Distance) Model The second LP formulation, hereafter referred to as the MSID model[5], seeks to:
where the variables satisfy:
The objective is essentially two-fold: Find the discriminant function ( x ) and the boundary ( b ) that will minimize group overlap and maximize the total (interior) distance of group members from the designated boundary hyperplane ( Ax = b). A parametric procedure was used to produce the set of ranges on the values of H which would generate the full family of distinct MSID solutions. Thus, a given two-group problem might produce 5 or 6 solutions (i.e., discriminant functions, x), each corresponding to a specific range of values assigned to H.
2.3 MSD (Minimize Sum of Deviations) Model The third LP model is referred to as the MSD model [5], has the form of:
where the variables satisfy: The objective here focuses on the minimization of total (vs. maximal) group overlap. The objective has value zero when the paired groups can be separated by a hyperplane. This formulation has the advantage (over the preceding formulation) of not requiring parametric adjustment.
2.4 LAD (Least Absolute Deviations) Model An alternative approach to the least squares method in regression analysis is the LAD method [10]. Instead of minimizing the sum of squares, a linear model is fitted by minimizing the sum of absolute deviations.
Solving Discriminant Models Using Interior Point Algorithm
53
For regression analysis, many studies have compared the relative performance of the LAD procedure with the least squares method. In general, these studies suggest that the LAD estimators are more efficient when the error terms have a Laplace or some other heavy-tailed distribution. The LAD approach may be justified directly or by a likelihood argument, as we now demonstrate. Suppose the error terms are independent and follow the double exponential (or Laplace) distribution with zero mean and mean deviation of The functional form is As before, we consider where likelihood is
or 1 and the
are measured about their mean. From (2), the
By inspection, we see that the likelihood estimators are given by minimizing the sum of the absolute deviations. This may be formulated as the following LP problem with dummy variables and
where
2.5 The Hybrid Discriminant Model We restrict attention to the two-group discriminant problem, observing that the multigroup problem can be handled by a sequential variant of the two-group approach. Notationally, we represent each data point by where (Group 1) and
(Group 2). We seek a weighting vector x and a scalar b , providing a
hyperplane of the form Ax = b .(This interpretation arises retrospectively by supposing variables x and b have already been determined and thus may be treated as constants; A is a row vector of variables.) The goal is to assure as nearly as possible that the points of Group 1 lie on one side of the hyperplane and those of Group 2 on the other (e.g., for and for Upon introducing objective function coefficients to discourage external deviations and
54
S. Huang, G. Yang, and C. Su
encourage internal deviations, and defining hybrid model [6] as follows.
we may express the
The objective function coefficients are assumed nonnegative. Appropriate values of these coefficients should satisfy for i = 0 and and and
.In general, the coefficients would be chosen to
make each of the foregoing inequalities strict. These restrictions can be understood by direct analysis but can also be justified as conditions essential to assuring feasibility of the linear programming dual.
2.6 MIP (Mixed Integer Programming) Model The MIP model [2] is:
Solving Discriminant Models Using Interior Point Algorithm
55
where =the value for observation i of attribute j , C =the unknown cutoff value(unrestricted on sign), =the prior probability of being from group p =the cost of misclassifying an observation from group p =the number of observations in group p
=the correct classification deviation, =the misclassification deviation, =the unknown weight for attribute j (unrestricted on sign), Z =the value of the objective function,
The objective function represents the expected cost of misclassification. The terms are 0-1 variables. The value of is to be 1 if observation i is misclassified and 0 if observation i is correctly classified. The term represents the proportion of misclassifications in sample group p .In order to obtain the expected misclassification proportion for the population, the sample proportions are multiplied by the population prior probabilities of group membership, .Including costs of misclassification, the objective function represents the expected cost of misclassification. The equivalent linear programming (ELP) model is:
56
S. Huang, G. Yang, and C. Su
2.7 PMM (Practical Minimal Misclassifications) Model The practical minimal misclassifications (PMM) [2] is:
For at least one i,
where
P = a positive number,
Solving Discriminant Models Using Interior Point Algorithm
57
=a small positive number relative to P ,and C =are unconstrained on sign.
3 An Infeasible Interior Point Algorithm for Proposed Models In this section we introduce a homogeneous feasibility model for linear programming for solving the models. Consider the primal and dual linear programming in standard form: Primal problem(LP):
Dual problem(LD):
We consider the following homogeneous feasibility model (HLP) introduced in Huang [8]:
The above model is an extension of the homogeneous and self-dual model for linear programming developed by Goldman and Tucker. A similar formulation was introduced by Potra and Sheng for solving semidefinite programming. It is also closely related to the shifted homogeneous primal-dual model considered by Nesterov. It is easy to see: Therefore we can obtain the following theorem immediately. Theorem 1. The (LP) problem has a solution if and only if the (HLP) has a feasible solution such that
S. Huang, G. Yang, and C. Su
58
Let and point, we denote the residual with respect to (HLP) as follows:
be any
We define the infeasible central path neighborhood of the homogeneous problem by
Search direction
is defined by the following linear system:
where is a parameter. It is easy to show that the above linear system has a unique solution. Let be the solution of above linear system with and
We can then prove the following result: Theorem 2. (i)
(ii) (iii)
Solving Discriminant Models Using Interior Point Algorithm
Corollary 1. Let algorithms, then
We defined the set of
59
be the sequence generated by path-following
solutions of (LP)-(LD) by
and the stopping criterion of the algorithm by
where
is the tolerance.
4 Conclusions In this paper we introduce an interior point algorithm for solving the linear programming based discriminant models in the literature. The algorithm is polynomial in its complexity with simple start. It has been implemented in practical with good efficiency. Therefore we think it can also solve the LP discriminant models efficiently. We have applied it to some data mining problems (support vector machine) and it showed that the interior point algorithms is one of the best in solving the data mining problems. The result will appear in an up coming paper.
References [1] Bajgier, S. M., Hill, A. V.: An Experimental Comparison of Statistical and Linear Programming Approaches to the Discriminant Problem. Decision Sciences, 13 (1982) 604-618 [2] Banks, W. J., Abad, P. L.: An Efficient Optimal Solution Algorithms for the Classfication Problem. Decision Sciences, 21 (1991) 1008-1023 [3] Freed, N., Glover, F.: A Linear Programming Approach to the Discriminant Problem. Decision Sciences, 12 (1981) 68-74 [4] Freed, N., Glover, F.: Resolving Certain Difficulties and Improving the Classification Power of LP Discriminant Analysis Formulations. Decision Sciences, 17 (1986) 589-595 [5] Freed, N., Glover, F.: Evaluating Alternative Linear Programming Models to Solve the Two-group Discriminant Problem. Decision Sciences, 17 (1986) 151-162 [6] Glover, F., Keene, S., Duea, B.: A New Class of Models For the Discriminant Problem. Decision Sciences, 19 (1988) 269-280. [7] Glover, F.: Improved Linear Programming Models for Discriminant Analysis. Decision Sciences 21 (1990) 771-784. [8] Huang, S.: Probabilistic Analysis of Infeasible Interior-Point Algorithms for Linear Programming. Working Paper, Institute of Policy and Management, Chinese Academy of Sciences, Beijing 100080, China (2003)
60
S. Huang, G. Yang, and C. Su
[9] Koehler, G. J., Erenguc, S. S.: Minimizing Misclassifications in Linear Discriminant Analysis. Decision Sciences 21 (1990) 62-85 [10] Lee, C. K., Ord, J. K.: Discriminant Analysis Using Least Absolute Deviations, Decision Sciences 21 (1990) 86-96 [11] Markowski, E. P., Markowski, C. A.: Some Difficulties and Improvements in applying linear programming Formulations to the Discriminant Problem. Decision Sciences 16 (1985) 237-247 [12] Ragsdate, G. T., Stam, A.: Mathematical Programming Formulations for the Discriminant Problem: An Old Dog Does New Tricks. Decision Sciences 22 (1991) [13] Rubin, P. A.: A Comparison of Linear Programming and Parametric Approaches to the Two-Group Discriminant Problem. Decision Sciences 21 (1990) 373-386 [14] Rubin, P. A.: Separation Failure in Linear Programming Discriminant Models. Decision Sciences, 22(1991) 519-535 [15] Stam, A., Joachimsthaler, E. A.: Solving the classification Problem in Discriminant Analysis Via Linear and Nonlinear Programming Methods. Decision Sciences (1989) 285-293 [16] Yarnold, P. R., Soltsik, R. C.: Refining Two-Group Multivariable Classification Models Using Univariate Optimal Discriminant Analysis. Decision Sciences 22 (1991) 1158-1164
A Method for Solving Optimization Problem in Continuous Space Using Improved Ant Colony Algorithm* Ling Chen1,2, Jie Shen1, Ling Qin1, and Jin Fan1 2
1 Department of Computer Science, Yangzhou University, Yangzhou 225009 National Key Lab of Advanced Software Tech, Nanjing Univ. Nanjing 210093
{lchen, yzshenjie, fanjin}@yzcn.net [email protected]
Abstract. A method for solving optimization problem with continuous parameters using improved ant colony algorithm is presented. In the method, groups of candidate values of the components are constructed, and each value in the group has its trail information. In each iteration of the ant colony algorithm, the method first chooses initial values of the components using the trail information. Then, crossover and mutation can determine the values of the components in the solution. Our experimental results of the problem of nonlinear programming show that our method has much higher convergence speed and stability than that of GA, and the drawback of ant colony algorithm of not being suitable for solving continuous optimization problems is overcome.
1 Introduction Ant colony algorithm (AC) has emerged recently as a new meta-heuristic for hard combinatorial optimization problems. This meta-heuristic also includes evolutionary algorithms, neural networks, simulated annealing which all belong to the class of problem-solving strategies derived from nature. The first AC algorithm was introduced by Dorigo, Maniezzo, and Colorni[1], using as example the Traveling Salesman Problem (TSP)[2]. With the further study in this area, ant colony algorithm is widely applied to the problems of Job-shop Scheduling[3], Quadratic Assignment Problem (QAP)[4,5], the Sequential Ordering Problem (SOP)[6] and some other NP complete hard problems[7-11]. It demonstrates its superiority of solving complicated combinatorial optimization problems. The AC algorithm is basically a multi-agent system where low level interactions between single agents, i.e., artificial ants, result in a complex behavior of the whole ant colony. AC algorithms have been inspired by colonies of real ants, which deposit a chemical substance called pheromone on the ground. This substance influences the choices they make: the larger amount of pheromone is on a particular path, the larger * This
research was supported in part by Chinese National Science Foundation, Science Foun dation of Jaingsu Educational Commission, China.
Y. Shi, W. Xu, and Z. Chen (Eds.): CASDMKM 2004, LNAI 3327, pp. 61–70, 2004. © Springer-Verlag Berlin Heidelberg 2004
62
L. Chen et al.
probability is that an ant selects the path. Artificial ants in AC algorithms behave in similar way. Thus, these behaviors of ant colony construct a positive feedback loop, and the pheromone is used to communicate information among individuals finding the shortest path from a food source to the nest. Ant colony algorithm simulates this mechanism of optimization, which can find the optimal solutions by means of communication and cooperation with each other. Here we briefly introduce AC and its applications to TSP. In the TSP, a given set of n cities has to be visited exactly once and the tour ends in the initial city. We denote the distance between city i and j as Let be the intensity of trail information on edge (i , j) at time t, and use it simulate the pheromone of real ants. Suppose m is the total number of ants, in time t the kth ant selects from its current city i to transit from to city j according to the following probability distribution:
where is a set of the cities can be chosen by the kth ant at city i for the next step , is a heuristic function which is defined as the visibility of the path between cities i and j , for instance it can defined as parameters and determine the relative influence of the trail information and the visibility. The algorithm becomes a traditional greedy algorithm while and a complete heuristic algorithm with positive feedback while It will take an ant n steps to complete a tour of traversing all the cities. For every ant its path traversing all the cities forms a solution. The intensity of trail information should be changed by the updating formula:
where and is the increment of
represents the evaporation of
between time t and t+1,
Here, is the trail information laid by the kth ant the path between cities i and j in step t, it takes different formula depending the model used. For example, in the most popularly used model called “ant circle system” it is given as:
where Q is a constant, is the total length of current tour traveled by the kth ant. These iterations will end when a certain terminal condition is satisfied. The essence of the optimization process in ant colony algorithm is that:
A Method for Solving Optimization Problem in Continuous Space
63
1. Learning mechanism: the more trail information a edge has, the more probability it being selected; 2. Updating mechanism: the intensity of trail information on the edge would be increased by the passing ants and decreased by evaporation; 3. Coorperative mechanism: communications and cooperations between individuals by trail information enable the ant colony algorithm to have strong capability of finding the best solutions.
However, the classical ant colony algorithm also has its defects. For instance, since the moving of the ants is stochastic, while the population size is large enough, it could take quite a long time to find a better solution of a path. Based on the algorithm introduced above M.Dorigo et al improved the classical AC algorithm and proposed an much more general algorithm called “Ant-Q System”[12,13]. This algorithm allow the path which has the most intensive trail information to has much higher probability to be selected so as to make full use of the learning mechanism and stronger exploitation of the feedback information of the best solution. In order to overcome the stagnation behavior in the Ant-Q System, T. Stutzle et al presented the MAX-MIN Ant System[14] in which the Allowed range of the trail information in each edge is limited to a certain interval. On the other hand, L.M.Gambardella et al proposed the Hybrid Ant System (HAS)[15]. In each iteration of HAS, local probe is made for each solution by constructed by ants was used so as to find the local optimal. The previous solutions of the ants would be replaced by these local optimal solutions so as to improve the quality of the solutions quickly. Moreover, H.M.Botee et al make a thorough study on the adjustment of the parameter m and and obtained the optimal combination of these parameters using GA [16]. L. Chen et al also have proposed an ant colony algorithm based on equilibrium of distribution and ant colony algorithm with characteristics of sensation and consciousness [17,18], which enhance the optimization ability of ACA and broaden its application area. The problems of the convergence and parallelization of the ant colony algorithm have also been studied to speedup the ant colony optimization [19,20]. Since the ants can only choose limited and fixed paths at each stage, the ant colony algorithm demand of discrete search space and is not suitable for solving continuous optimization problems such as linear or non-linear programming. The main issue when extending the basic approach to deal with continuous search spaces is how to model a continuous nest neighborhood with a discrete structure. Recently, some research works have extended AC algorithm to real function optimization problems. Bilchev and Parmee [21]. for example, proposed an adaptation of the Ant Colony model for continuous problems. They proposed to represent a finite number of directions whose origin is a common base point called the nest. Since the idea is to cover eventually all the continuous search space, these vectors evolve over time according to the fitness values of the ants. In this paper, we present a new method for solving optimization problem in continuous space. In this method, the set of candidate paths the ants to select is dynamic and unfixed. Since genetic operations of crossover and mutate are used to determine the value of each component, the solutions obtained will
64
L. Chen et al.
tend to be more diverse and global, and the values of the components can be selected in a continuous space. Group of candidate values for each component is constructed and each value in the group has its trail information. In each iteration of the algorithm, the method first chooses initial values of the components from its candidate values using the trail information, then the values of the components in the solution can be determined by the operations of crossover and mutation. Our experimental results on non-linear programming (NLP) problem show that our method has much higher convergence speed and stability than that of GA. Our method can also be applied to other problems with continuous parameters or problems with massive candidates in each stage, it widens the applying scope of ant colony system.
2 Solving Continuous Optimization Problems In this section, we introduce our method using the following nonlinear programming (NLP) problem as an example:
Here, objective function G is a given non-linear function, constraint conditions which represented by a set of inequalities form a convex domain of We can obtain the minimal n-d hyper-cube which encloses this convex domain by transforming the inequalities. The hyper-cube obtained can be defined the following in equalities :
Let the total number of ants in the system be m and the m initial solution vectors are chosen at random. All the ith components of these initial solution vectors construct a group of candidate values of the ith components of solution vector. If we use n vertices to represent the n components and the edges between vertex i and vertex i+1 to represent the candidate values of component i, a path from the start vertex to the last vertex represents a solution vector whose n edges represent n components of the solution vector. We denote the jth edge between vertex i and i+1 as (i, j) and intensity of trail information on edge (i , j)at time t as Each ant start at the first vertex, select n edges in turn according to a certain possibility distribution and reaches the last vertex to complete a tour. In order to enhance the diversity of the component values, we update the selected m values of each component using genetic operations of crossover and mutation. When all components have got their m new values, m new solutions are obtained. Then we update the trail information of the edges on each path. After each iteration, each component will have 2m candidate values which consists of its m former candidate values and its m new values. Then from these 2m values the m new candidate values with higher trail information should be selected to form a new set of candidate values. This process is iterated until a certain terminal condition is satisfied. The framework of our algorithm is described as follows:
A Method for Solving Optimization Problem in Continuous Space
65
In line 2.1.1 of the algorithm, an ant choose the jth value in the candidate group of component i at time t according to the following probability distribution:
66
L. Chen et al.
Here, is the intensity of trail information of the jth candidate value of the group for component i at time t, it is changing dynamically.
3 Adaptive Crossover and Mutation Line 2.1.2 of the algorithm introduced above is to have crossover and mutation on the m values selected for component i. To get high speed of convergence and avoid excessive local optimization, we adopt self-adaptive crossover and mutation operations. The probabilities of operations and the range of values are determined dynamically by the quality of solutions. Suppose and are two initial values of component i which is going to be crossovered, the fitness of the two solutions they belonged to are Assume the crossover probability set by the system is then the actual crossover probability for
and
is as
here
is the maxi-
mal fitness of the current generation. The higher fitness the two solutions have, the lower is. Generate a random number the crossover operation could be carried out only if In the operation of crossover, we first generate two stochastic numbers and so that here The results of the crossover operation obtained by the following affine combination of
and
can be
and
It can easily be seen that the lower fitness and have, the larger value range crossover operation has. In the mutation of a component value x, the real mutate probability is determined according to the fitness f of the solution where x belongs to. Suppose the mutation probability set by the system is we let If a solution has much higher fitness f , its values of components will have much lower mutate probability and the values of components in better solutions would has greater chance to be preserved. Generate a random number mutate operation is implemented only if Suppose will transformed into after mutation and to ensure define and can be obtained by:
A Method for Solving Optimization Problem in Continuous Space
67
Here, is a stochastic number between [-1,1], and it has greater chance of the tendency to be 0 with the increment of the value of fitness f and generation number t, is determined by this formula:
Here r is also a stochastic number between [-1,1], is the parameter who decides the extent of diversity and adjusts the area of local probe, it can be selected between 0.005 and 0.01. As a result, the component values derived from a solution with higher fitness f will have smaller mutate scope and causes a local probe. On the contrary, the process becomes a global probe when f is large and the mutate scope is wide enough. The mutate scope of the component value is reduced gradually while the number of iterations is increased. In this way, the convergent process could be controlled when generation number is very large so as to accelerate the speed of convergence.
4 Experimental Results The algorithm presented in this paper was coded in C language performed on a Pentium PC. The computational testing of the new algorithm was carried out applying the code to the standard test problems G1 to G5 from the literature, and comparing the results to those of GA algorithm, under identical experimental conditions. We let m=25, and perform 20 trials on each problem. The data given as follows is the average numbers of iteration required to find the best solution.
Table 1 shows us that, the average number of iterations to reach the best solution on G1 to G5 using GA is 4285.084, the average number is 3198.938 while using our algorithm. This fact shows that our algorithm has a stronger capability of finding
68
L. Chen et al.
optical solutions than GA, and it saves much more computing time. The experimental results illustrate that our algorithm is very effective solving continuous optimization problems. Fig.1 shows the process of the best solution of G1 using both our algorithm and GA. It is obvious that the speed of reaching the best solution using our algorithm is higher than that of using GA. In our algorithm, after reaching the best solution, the curse fluctuates within a very narrow scope around the best solution, this confirms the conclusion that our algorithm has a good convergence. We also test G1 and G2 with our algorithm and GA on 10 trials each of which have 500 iterations on G1 and 1500
Fig. 1. The process of the best solution
A Method for Solving Optimization Problem in Continuous Space
69
iterations on G2. The best objective function values of each trial, their average value, and their standard deviation (SD), are listed in table 2. The experimental results of 10 trials show that the standard deviation is much lower using our algorithm than that of using GA, this demonstrates that our algorithm is superior in stability. The reason our algorithm has much higher convergence and stability is that component values of the solutions are obtained by genetic operations on the values of the components of different solutions in the previous generation, this leads to solutions of new generation be more diverse and global so as to avoid the prematurity. On the other hand, since each component chooses its values of the next generation from a group of candidate values in which most of the component values evolved are from solutions with better fitness, this makes the new solutions have strong capability of optimization in all directions so that better solutions can be constructed in a short period of time.
5 Conclusions To overcome the drawback of classical ant colony algorithm of not being suitable for continuous optimizations, a method for solving optimization problem with continuous parameters using improved ant colony algorithm is presented. In the method, the candidate paths to be selected by the ants is changing dynamically rather than being fixed, and the solutions will tent to be diverse and global by means of using genetic operations on the component values of the paths at every stage. Groups of candidate values of the components are constructed, and each value in the group has its trail information. In each iteration of the algorithm, the method first chooses initial values of the components using the trail information. Then, crossover and mutation can determine the values of the components in the solution. This enables us to select the component values from a continuous space. Thus, our method can solve continuous optimization problems using ant colony algorithm successfully. Our experimental results of the problem of nonlinear programming show that our method has much higher convergence speed and stability than that of GA. Furthermore, our method can also be applied to other problems with continuous parameters.
References 1. Dorigo, M., Maniezzo, V., Colorni A.: Ant system: Optimization by a colony of coorperating agents. IEEE Trans. On SMC, 26(1) (1996) 28-41 2. Dorigo, M, Gambardella, L. M.: Ant Colony System: A cooperative learning approach to the traveling salesman problem. IEEE Trans. On Evolutionary Computing, 1(1) (1997) 53-56 3. Colorni, A., Dorigo, M., Maniezzo, V.: Ant colony system for job-shop scheduling. Belgian J. of Operations Research Statistics and Computer Science, 34(1) (1994) 39-53 4. Maniezzo V, Exact and approximate nonditerministic tree search procedures for the quadratic assignment problem. INFORMS J. Comput. 11 (1999) 358-369 5. Maniezzo V, Carbonaro A.: An ANTS heuristic for the frequency assignment problem, Future Generation Computer Systems, 16 (2000) 927-935
70
L. Chen et al.
6. Gambardella, L. M., Dorigo, M.: HAS-SOP: An Hybrid Ant System for the Sequential Ordering Problem. Tech. Rep. No. IDSIA 97-11, IDSIA, Lugano, Switzerland (1997) 7. Hadeli, Valckenaers, P., Kollingbaum, M., Van Brussel, H.: Multi-agent coordination and control using stigmergy. Computers in Industry, 53(1) (2004) 75-96 8. Eggers, J., Feillet, D., Kehl, S., Wagner, M. O., Yannou, B.: Optimization of the keyboard arrangement problem using an Ant Colony algorithm. European Journal of Operational Research, 148(3) (2003) 672-686 9. Gravel, M., Price, W. L., Gagné, C.: Scheduling continuous casting of aluminum using a multiple objective ant colony optimization metaheuristic. European Journal of Operational Research, 143(1) (2002) 218-229 10. Shelokar, P. S., Jayaraman, V. K., Kulkarni, B. D.: An ant colony classifier system: application to some process engineering problems, Computers & Chemical Engineering, 28(9) (2004) 1577-1584 11. Scheuermann, B., So, K., Guntsch, M., Middendorf, M., Diessel, O., ElGindy, H., Schmeck, H.: FPGA implementation of population-based ant colony optimization, Applied Soft Computing, 4(3) (2004) 303-322 12. Gambardella, L., Dorigo, M.: Ant-Q: A reinforcement learning approach to the traveling salesman problem. Proceedings of the International Conference on Evolutionary Computation (1996) 616-621 13. Dorigo, M, Luca, M.: A study of Ant-Q. Proceedings of International Conference on Parallel Problem from Nature, Springer Verlag, Berlin (199) 656-665 14. Stutzle, T., Hoos, H. H.: Improvements on the Ant System: Introducting the MAX-MIN Ant System. Artificial Neural Networks and Genetic Algorithms. Springer Verlag, New York (1988) 245-249 15. Gambaradella L.M., Dorigo M.: HAS-SOP: Hybrid ant system for the sequential ordering problem. Technical Report, IDSIA (1997) 16. Botee H. M., Bonabeau E.: Evolving ant colony optimization. Adv. Complex Systems, 1 (1998) 149-159 17. 17 Chen, L., Shen, J., Qin, L.: An Adaptive Ant Colony Algorithm Based on Equilibrium of Distribution. Journal of Software, 14(6) (2003) 1148-1151 18. Chen, L., Qin, L. et al. Ant colony algorithm with characteristics of sensation and consciousness. Journal of System Simulation, 15(10) (2003) 1418-1425 19. Gutjahr, W. J.: ACO algorithms with guaranteed convergence to the optimal solution. Information Processing Letters, 82(3) (2002) 145-153 20. Randall, M., Lewis, A.: A Parallel Implementation of Ant Colony Optimization. Journal of Parallel and Distributed Computing, 62(9) (2002) 1421-1432 21. Bilchev, G., Parmee, I. C.: The ant colony metaphor for searching continuous design spaces. In: Fogarty, Y., ed., Lecture Notes in Computer Science 993, (1995) 25-39. 22. Shen, J., Chen, L.: A new approach to solving nonlinear programming, Journal of Systems Science and Systems Engineering, 11(1) 2002 (28-36)
Data Set Balancing David L. Olson University of Nebraska, Department of Management, Lincoln, NE 68588-0491 USA [email protected]
Abstract. This paper conducts experiments with three skewed data sets, seeking to demonstrate problems when skewed data is used, and identifying counter problems when data is balanced. The basic data mining algorithms of decision tree, regression-based, and neural network models are considered, using both categorical and continuous data. Two of the data sets have binary outcomes, while the third has a set of four possible outcomes. Key findings are that when the data is highly unbalanced, algorithms tend to degenerate by assigning all cases to the most common outcome. When data is balanced, accuracy rates tend to decline. If data is balanced, that reduces the training set size, and can lead to the degeneracy of model failure through omission of cases encountered in the test set. Decision tree algorithms were found to be the most robust with respect to the degree of balancing applied.
1 Introduction Data mining technology is used increasingly by many companies to analyze large databases in order to discover previously unknown and actionable information that is then used to make crucial business decisions. This is the basis for the term “knowledge discovery”. Data mining can be performed through a number of techniques, such as association, classification, clustering, prediction, and sequential patterns. Data mining algorithms are implemented from various fields such as statistics, decision trees, neural networks, fuzzy logic and linear programming. There are many data mining software product suites, to include Enterprise Miner (SAS), Intelligent Miner (IBM), Clementine (SPSS), and Polyanalyst (Megaputer). There are also specialty software products for specific algorithms, such as CART and See5 for decision trees, and other products for various phases of the data mining process. Data mining has proven valuable in almost every academic discipline. Understanding business application of data mining is necessary to expose business college students to current analytic information technology. Data mining has been instrumental in customer relationship management [1] [2], financial analysis [3], credit card management [4], banking [5], insurance [6], tourism [7], and many other areas of statistical support to business. Business data mining is made possible by the generation of masses of data from computer information systems. Understanding this information generation system and tools available leading to analysis is fundamental for business students in the 21st Century. There are many highly useful applications in practically every field of scientific study. Data mining support is required to make sense of the masses of business data generated by computer technology. Y. Shi, W. Xu, and Z. Chen (Eds.): CASDMKM 2004, LNAI 3327, pp. 71–80, 2004. © Springer-Verlag Berlin Heidelberg 2004
72
D.L. Olson
A major problem in many of these applications is that data is often skewed. For instance, insurance companies hope that only a small portion of claims are fraudulent. Physicians hope that only a small portion of tested patients have cancerous tumors. Banks hope that only a small portion of their loans will turn out to have repayment problems. This paper examines the relative impact of such skewed data sets on common data mining algorithms for two different types of data – categorical and continuous.
2 Data Sets The paper presents results of experiments on outcome balancing using three simulated data sets representative of common applications of data mining in business. While simulated, these data sets were designed to have realistic correlations across variables. The first model includes loan applicants, the second data set insurance claims, and the third records of job performance.
2.1 Loan Application Data This data set consists of information on applicants for appliance loans. The full data set involves 650 past observations, of which 400 were used for the full training set, and 250 for testing. Applicant information on age, income, assets, debts, and credit rating (from a credit bureau, with red for bad credit, yellow for some credit problems, and green for clean credit record) is assumed available from loan applications. Variable Want is the amount requested in the appliance loan application. For past observations, variable On-Time is 1 if all payments were received on time, and 0 if not (Late or Default). The majority of past loans were paid on time. Data was transformed to obtain categorical data for some of the techniques. Age was grouped by less than 30 (young), 60 and over (old), and in between (middle aged). Income was grouped as less than or equal to $30,000 per year and lower (low income), $80,000 per year or more (high income), and average in between. Asset, debt, and loan amount (variable Want) are used by rule to generate categorical variable risk. Risk was categorized as high if debts exceeded assets, as low if assets exceeded the sum of debts plus the borrowing amount requested, and average in between. The categorical data thus consisted of four variables, each with three levels. The continuous data set transformed the original data to a 0-1 scale with 1 representing ideal and 0 the nadir for each variable.
2.2 Insurance Fraud Data The second data set involves insurance claims. The full data set includes 5000 past claims with known outcomes, of which 4000 were available for training and 1000 reserved for testing. Variables include claimant age, gender, amount of insurance claim, number of traffic tickets currently on record (less than 3 years old), number of prior accident claims of the type insured, and Attorney (if any). Outcome variable Fraud was 0 if fraud was not detected, and 1 if fraud was detected.
Data Set Balancing
73
The categorical data set was generated by grouping Claimant Age into three levels and Claim amount into three levels. Gender was binary, while number of tickets and prior claims were both integer (from 0 to 3). The Attorney variable was left as five discrete values. Outcome was binary. The continuous data set transformed the original data to a 0-1 scale with 1 representing ideal and 0 the nadir for each variable.
2.3 Job Application Data The third data set involves 500 past job applicants, of which 250 were used for the full training set and 250 reserved for testing. This data set varies from the first two in that there are four possible outcomes (unacceptable, minimal, adequate, and excellent, in order of attractiveness). Some of these variables were quantitative and others are nominal. State, degree, and major were nominal. There is no information content intended by state or major. State was not expected to have a specific order prior to analysis, nor was major. (The analysis may conclude that there is a relationship between state, major, and outcome, however.) Degree was ordinal, in that MS and MBA are higher degrees than BS. However, as with state and major, the analysis may find a reverse relationship with outcome. The categorical data set was created by generating three age groups, two state outcomes (binary), five degree categories, three majors, and three experience levels. The continuous data set transformed the original data to a 0-1 scale with 1 representing ideal and 0 the nadir for each variable.
3 Experiments These data sets represent instances where there can be a high degree of imbalance in the data. Data mining was applied for categorical and continuous forms of all three data sets. For categorical data, decision tree models were obtained using See5, logistic regression from Clementine, and Clementine’s neural network model applied. For continuous data sets, See5 was used for a regression tree, and Clementine for regression (discriminant analysis) and neural network. In each case, the training data was sorted so that a controlled experiment could be conducted. First, the full model was run. Then the training set was reduced in size by deleting cases with the most common outcome until the desired imbalance was obtained. The correct classification rate was obtained by dividing the correctly classified test cases by the total number of test cases. This is not the only useful error metric, especially when there is high differential in the cost by error type. However, other error metrics would yield different solutions. Thus for our purposes, correct classification rate serves the purpose of examining the degradation of accuracy expected from reducing the training set in order to balance the data.
3.1 Loan Data Results The loan application training set included 45 late cases of 400, for a balance proportion of 0.1125 (45/400). Keeping the 45 late cases for all training sets, the training set size was reduced by deleting cases with on-time outcomes, for late-case
74
D.L. Olson
proportions of 0.15 (300 total), 0.2 (225 total), 0.25 (180 total), and 0.3 (150 total). The correct classification rates and cost results are shown in Tables 1 through 6. The first test is shown in Table 1, using a decision tree model on categorical data. Using the full training set had a relatively low proportion of late cases (0.1125). This training set yielded a model predicting all cases to be on-time, which was correct in 0.92 of the 250 test cases. As the training set was balanced, the correct classification rate deteriorated, although some cases were assigned to the late category. Note that this trend was not true throughout the experiment, as when the training set was reduced to 150 cases, the correct classification rate actually increased over the results for training set sizes of 180 and 225. Table 2 shows the results of the logistic regression model on categorical data. Here the full training set was again best. Balancing the data yielded the same results from then on. Table 3 shows the results for a neural network model on categorical data.
The results for this model were consistent with expectations. Reducing the training set to balance the outcomes yielded less and less accurate results.
Data Set Balancing
75
Tests were also conducted on continuous data with the same three algorithms. Table 4 gives the results for a linear regression model on continuous data. These results were similar to those obtained with categorical data. Here there was an anomaly with the training set of 180 observations, but results were not much different from expectations. Table 5 shows results for a discriminant analysis model applied to the continuous data. These results were slightly better than those obtained for categorical data, exhibiting the expected trend of decreased accuracy with smaller training set. Table 6 shows the results for the neural network model applied to continuous data.
The neural network model for continuous data was slightly less accurate than the results obtained from applying a neural network model to categorical data. The trend in accuracy was as expected. As expected, the full training set yielded the highest correct classification rate, except for two anomalies. Data mining software has the capability of including a cost function that could be used to direct algorithms in the case of decision trees. That was not used in this case, but it is expected to yield parallel results (greater accuracy
76
D.L. Olson
according to the metric driving the algorithm would be obtained with larger data sets). The best of the six models was the decision tree using categorical data, pruning the training set to only 150 observations. Continuous data might be expected to provide greater accuracy, as it is more precise than categorical data. However, this was not borne out by the results. Continuous data is more vulnerable to error induced by smaller data sets, which could have been one factor.
3.2 Fraud Data Set The fraud data set was more severely imbalanced, including only 60 late cases in the full training set of 4000. Training sets of 3000 (0.02 late), 2000 (0.03 late), 1000 (0.06 late), 600 (0.1 late), 300 (0.2 late), and 120 (0.5 late) were generated. Table 7 shows the decision tree model results. Only two sets of results were obtained. The outcome based on larger training sets was degenerate – assigning all cases to be OK (not fraudulent). This yielded a very good correct classification rate, as only 22 of 1000 test cases were fraudulent. Table 8 gives results for the logistic regression model.
Balancing the data from 4000 to 3000 training cases actually yielded an improved correct classification rate. This degenerated when training file size was reduced to
Data Set Balancing
77
1000, and the model yielded very poor results when the training data set was completely balanced, as only 120 observations were left. For the logistic regression model, this led to a case where the test set contained 31 cases not covered by the training set. Table 9 shows results for the neural network model applied to categorical data. The neural network model applied to categorical data was quite stable until the last training set where there were only 120 observations. At that point, model accuracy became very bad. Table 10 displays results for the regression tree applied to continuous data. The regression tree for continuous data had results very similar to those of the decision tree applied to categorical data. For the smaller training sets, the continuous data yielded slightly inferior results. Table 11 gives results for the discriminant analysis model.
78
D.L. Olson
The discriminant analysis model using continuous data had results with fewer anomalies than logistic regression obtained with categorical data, but was slightly less accurate. It also was not very good when based upon the smallest training set. The neural network model based on continuous data was not as good as the neural network model applied to categorical data, except that the degeneration for the training set of 120 was not as severe. Table 12 shows relative accuracy for the neural network model applied to the continuous data.
Overall, application of models to the highly imbalanced fraud data set behaved as expected for the most part. The best fit was obtained with logistic regression and neural network models applied to categorical data. Almost all of the models over the original data set were degenerate, in that they called all outcomes OK. The exceptions were logistic regression and neural network models over continuous data. The set of runs demonstrated the reverse problem of having too small a data set. The neural network models for both categorical and continuous data had very high error rates for the equally balanced training set, as did the logistic regression model for categorical data. There was a clear degeneration of correct classification rate as the training set was reduced, along with improved cost results, except for these extreme instances.
3.3 Job Applicant Data Results This data set was far more complex, with correct classification requiring consideration of a four by four outcome matrix. The original data set was small, with only 250 training observations, only 7 of which were excellent (135 were adequate, 79 minimal, and 29 unacceptable). Training sets of 140 (7 excellent, 66 adequate, 38 minimal, and 29 unacceptable), 70 (7 excellent and 21 for each of the other three categories), 35 (7 excellent, 10 adequate, and 9 for the other two categories), and 28 (all categories 7 cases) were generated. Results are shown in Table 13. The proportion correct increased as the training set size increased. This was because there were three ways for the forecast to be wrong. A naive forecast would be expected to be correct 0.25 of the time. The correct classification rate was more erratic in this case. Smaller training sets tended to have lower correct classification rates, but the extreme small size of the smaller sets led to anomalies in results from the decision tree model applied to categorical data. The results from the logistic
Data Set Balancing
79
regression model were superior to that of the decision tree for the training sets of size 250 and 140. The other results, however, were far inferior, and for the very small training sets were degenerate with no results reported. Neural network model results over categorical data were quite good, and relatively stable for smaller data sets. There was, however, an anomaly for the training data set of 70 observations.
Results for the regression tree model applied to continuous data was inferior to that of the decision tree applied to categorical data except for the largest training set (which was very close in result). Discriminant analysis applied to continuous data also performed quite well, and did not degenerate when applied to the smaller data sets. The neural network model applied to continuous data was again erratic. Neural network models worked better for the data sets with more training observations.
4 Results The logistic regression model had the best overall fit, using the full training set. However, this model failed when the data set was reduced to the point where the training set did not include cases that appeared in the test set. The categorical decision tree model was very good when 140 or 250 observations were used for training, but when the training set was reduced to 70, it was very bad (as were all categorical models. The decision tree model again seemed the most robust. Models based upon
80
D.L. Olson
continuous data did not have results as good as those based on categorical data for most training sets. Table 14 provides a comparison of data set features based upon these results.
5 Conclusions Key findings are that when the data is highly unbalanced, algorithms tend to degenerate by assigning all cases to the most common outcome. When data is balanced, accuracy rates tend to decline. If data is balanced, that reduces the training set size, and can lead to the degeneracy of model failure through omission of cases encountered in the test set. Decision tree algorithms were found to be the most robust with respect to the degree of balancing applied. Simulated data sets representing important data mining applications in business were used. The positive feature of this approach is that expected data characteristics were controlled (no correlation of outcome with gender or state, for instance; positive correlations for educational level and major). However, it obviously would be better to use real data. Given access to such real data, similar testing is attractive. For now, however, this set of experiments has identified some characteristics data mining tools with respect to the issue of balancing data sets.
References 1. Drew, J.H., Mani, D.R., Betz, A.L., Datta, P.: Targeting customers with statistical and datamining techniques, Journal of Service Research 3:3 (2001) 205-219. 2. Garver, M.S.: Using data mining for customer satisfaction research, Marketing Research 14:1 (2002) 8-17. 3. Cowan, A.M.: Data mining in finance: Advances in relational and hybrid methods, International Journal of Forecasting 18:1 (2002) 155-156. 4. Adams, N.M., Hand, D.J., Till, R.J.: Mining for classes and patterns in behavioural data, The Journal of the Operational Research Society 52:9 (2001) 1017-1024. 5. Sung, T.K., Chang, N., Lee, G.: Dynamics of modeling in data mining: Interpretive approach to bankruptcy prediction, Journal of Management Information Systems 16:1 (1999) 63-85. 6. Smith, K.A., Willis, R.J., Brooks, M.: An analysis of customer retention and insurance claim patterns using data mining: A case study, The Journal of the Operational Research Society 51:5 (2000) 532-541. 7. Petropoulos, C., Patelis, A., Metaxiotis, K., Nikolopoulos, K., Assimakopoulos, V.: SFTIS: A decision support system for tourism demand analysis and forecasting, Journal of Computer Information Systems 44:1 (2003), 21-32.
Computation of Least Square Estimates Without Matrix Manipulation Yachen Lin1 and Chung Chen2 1
JPM Chase, Division of Private Label Cards, 225 Chastain Meadows Ct, Kennesaw, GA 30152, USA [email protected] 2
School of Management, Syracuse University, Syracuse, NY 13244, USA
Abstract. The least square approach is undoubtedly one of the well known methods in the fields of statistics and related disciplines such as optimization, artificial intelligence, and data mining. The core of the traditional least square approach is to find the inverse of the product of the design matrix and its transpose. Therefore, it requires storing at least two matrixes - the design matrix and the inverse matrix of the product. In some applications, for example, high frequency financial data in the capital market and transactional data in the credit card market, the design matrix is huge and on line update is desirable. Such cases present a difficulty to the traditional matrix version of the least square approach. The reasons are from the following two aspects: (1) it is still a cumbersome task to manipulate the huge matrix; (2) it is difficult to utilize the latest information and update the estimates on the fly. Therefore, a new method is demanded. In this paper, authors applied the idea of CIO-component-wise iterative optimization and propose an algorithm to solve a least square estimate without manipulating matrix, i.e. it requires no storage for the design matrix and the inverse of the product, and furthermore it can update the estimates on the fly. Also, it is rigorously shown that the solution obtained by the algorithm is truly a least square estimate.
1 Introduction In recent years, e-commerce has grown exponentially. As the amount of information recorded and stored electronically grows ever large, it becomes increasingly useful and essential to develop better and more efficient ways to store, extract, and process information. In fact, it is clear that so far our achievements on acquiring data and storing data have been far more ahead than those on analyzing data. With the availability of huge information, it has been found in many cases that the well-known methodologies have shown difficulties to be applied directly because of the size of information being processed or because of the way of information being utilized. Such difficulties in the real applications present an uncompromising challenge for academic researchers and practitioners. At the same time, business and industry have been demanding new tools for their tasks of in-depth data analysis. The born of a Y. Shi, W. Xu, and Z. Chen (Eds.): CASDMKM 2004, LNAI 3327, pp. 81–89, 2004. © Springer-Verlag Berlin Heidelberg 2004
82
Y. Lin and C. Chen
research area called data mining is partially due to the demands of this kind. People, in the field of data mining, have no longer confined themselves with assumptions of the underlying probability space and distributions. They only treat the data as a set of close related objects for given tasks; certainly, some tasks may take the same statistical settings of assumptions. Why dose the size of data bring such a revolutionary change? In statistical text books, a nice result is usually obtained when the sample size tends to infinity under regular assumptions such as a given probability space and an underlying distribution. People may tend to think it is fine to apply the asymptotic results when the sample size is larger than a few hundreds. Indeed it is maybe useful, in many cases, to apply the asymptotic results when the sample size is larger than a few hundreds and smaller than a few hundred thousands, but may not be useful at all for the case that the sample size is larger than few millions (Lin (2001)) for some problems. The fact is that the size matters! A penny may not be significant, but when it is amplified by millions, it becomes significant. Such an effect of large quantities in business is called the Gigabyte effect due to the fact that information recorded and processed needs a few Giga bytes to store. The phenomenon that from the change of quantity to the change of quality reveals an underlying truth that the assumption of a given probability space and distribution is not sufficient. This insufficiency, we believe, is partially caused by entangled signals in the data. When the size becomes extremely large, the entangled signals will be significant not only in the desirable dimension but also in unexpected dimensions. The size of data being processed for a regular OLAP (On-Line Analytical Processing) task in a credit card issuer of medium size is usually about a few million observations. An OLAP task may come from one of the following categories: (1) decisions for potential customers; (2) decisions for existing non-default customers; (3) decisions for default customers. The first category focuses on whether or not to extend credit, and if yes, by how much, which is known as acquisition decision making. The second category is for those decisions including increasing or lowering the credit limit, authorization, reissue, and promotion. The third category contains several decisions regarding to delinquent or default customers, which is known as collection or recovery, such as making a computer generated letter, a phone call, or sending a business or an attorney letter, assigning to an external collection agency. Any action taken will be based on the assessment of risk or profit on the individual account level and is usually including those like computing the values from predictive models. There are three categories of predictive models in terms of the final outcomes: (a) predicted likelihood, (b) expected values, and (c) real values. The last two categories usually involve the computing for the estimates of real values, and for these cases, the least square approach is often one of the preferred. Considering transactional based real value predictive models, which are used to directly predict the profit or revenue of accounts or used in updating the profiles of accounts, millions of transactions are processed through these models in each single day. If the traditional least square approach were used in the computation, it would cost (1) larger computing resources due to the stored matrixes, (2) unstable results in some cases due to the singularity of matrixes, and (3) virtually impossible to update the least square estimates on the fly
Computation of Least Square Estimates Without Matrix Manipulation
83
due to the lack of such an algorithm. Therefore, the natural question is raised: how can we overcome the barriers of the traditional matrix version of the least square approach and find a least square solution without these problems? In order to answer the question, we need to re-exam the core of the traditional least square approach: it is to find the inverse of the product of the design matrix and its transpose. In such an approach, it requires at least to store two matrixes - the design matrix and the inverse matrix of the product (many variations of matrix manipulation for least square estimates in the literature all require to store these matrixes, although the sizes of the matrixes have been reduced and varied due to the improvement of the algorithms). As such, we can say that the core in the heart of the traditional least square approach is the matrix manipulation, which, although is neat, brings with the potential problems stated above in computing resources, in stability, and in on-going updating when the size of matrix is large. Based on these observations, the new algorithms should come from the category of the non-matrix version, that is, algorithms without matrix manipulation. In the current study, we propose an algorithm to solve a least square estimate without manipulating matrix, i.e. it requires no storage for the design matrix and the inverse of the product, and furthermore it can update the estimates on the fly. The algorithm proposed is based on the principle, as we called - CIO, the component-wise iterative optimization, which will become clear in the next section. Section 3 gives a theoretical proof that indeed the proposed algorithm generates a least square solution.
2 The Component-Wise Iterative Optimization and Algorithm Given a linear model,
where
is a random term, i = 1,2, … ,n , and its objective function is
A least square estimate
Theoretically,
of
is defined as a solution of
has a nice closed form in terms of X and Y, where X and
Y are the design matrix and response vector, respectively. It is well-known that In theory, we have solved the problem. But in practice, we still
84
Y. Lin and C. Chen
need to find the inverse, Thus, there are many algorithms proposed to solve the inverse due to different conditions. However, all those algorithms, no matter how different they appeared, have the following two items in common: (1) all observations need to present in memory, (2) the inverse of the matrix needs to be in memory also. Due to the above two constrains, the OLAP task using the current algorithms for least square solutions has to be kept for a relatively small to a medium size of data, i.e. up to a few hundred thousands observations. Also to update the least square estimate, we have to make all the observations available at the time when a transaction occurred. For a relative large credit card issuer, this means huge memory and CPU resources and sometimes non-practical. Considering the current approaches, the key is the matrix manipulations. If we can find a way around this, we may be able to save memory and CPU resources and make the approach more applicable. To avoid storing a huge matrix, we have to explore a totally new way to solve the least square problem. Starting from the early 80’s, the Gibbs sampling has been drawn on much attention and later has been developed into a field called MCMC. The idea in the Gibbs sampling is that the underlying population may be either unknown or very complicated, however, given any component, the conditional distribution is known or relatively easy to simulate. They basic procedure is the following: Step 0. Pick
in the support of
the underlying population. Step 1. Compute (1) Draw (2) Draw
in the following way:
from the distribution of from the distribution of
continue the same procedure to the last component. (3) Draw from the distribution of Thus, the first cycle of Gibbs sampling is achieved and we can update to the Repeat the Step 1 for n-1 times, and then get to Step n. Step n. Following the same procedure, we can get By some regular assumptions, we can prove that in distribution. These steps of Gibbs sampling present a typical procedure of CIO-the componentwise iterative optimization. The idea of CIO can also be seen from many other topics such as EM algorithms. In the E-step, the conditional expectation of the missing component given the likelihood of the complete data is calculated with respect to the observed data. In the M-step, it maximizes for the underlying model. The CIO proposes a principle that the complicated problem can be solved iteratively by a series of relatively simple solutions. A general study has been carried out, and some fruitful results can be found in (Lin (2003)), which includes more rigorous studies on its theoretical foundation.
Computation of Least Square Estimates Without Matrix Manipulation
85
Applying the idea of CIO to the problem of the matrix version of the least square approach discussed in the beginning of this section, we proposed the following algorithm. From this algorithm, readers can get a much clearer picture about how the procedure of CIO can pave the way to propose a new algorithm. Non-matrix Version of Algorithm for Solving the Least Square Estimates Given (2.1) and (2.2), the following algorithm generates a solution of least square estimates for Step 1: Objective Function: Take the sum of squares as the objective function, , the goal is to find a solution of
of
such that
Step 2. Initial Value: Choose an initial value of for
randomly or by certain optimal rules.
Step 3. Updating the Intercept: (1) Given the the
iterations for updating
pattern of
and
take the value for the intercept at
iterations
as
i = 1, …, n, and let
(2) Repeat (1) of Step 3 for all samplepatterns
Step 4. Updating the Slops: (1) Given the iterations for updating
pattern of
take the value for the slop
and the
l = 1, …, p, at
tions as
(2) Repeat of (1) Step 4 for all patterns
Step 5. Updating
Set
i = 1, …, n, and let
itera-
86
Y. Lin and C. Chen
Remarks: (1) From the algorithm, we can see that it follows exactly the procedure of CIO. Each component of the parameter is optimized iteratively given others and then finally the whole parameter optimized. (2) Each updating cycle or iteration is finished by updating the components one by one. Up to certain cycles, say and generate the similar values for the given objective function. If we change the size of sample patterns, we have a series of estimates (3) The implementation can be in a much simpler way:
Thus, the case of
being zero is no longer an issue.
3 The Proposed Algorithm Generates a Least Square Solution From the last section, we have mentioned that the proposed algorithm possesses some useful asymptotic properties such as strong consistence and asymptotic normal. In this section, we will prove that the proposed algorithm generates a least square solution computationally. From the proof, readers will have a chance to appreciate insightful nature of the procedure of CIO. Theorem. Given the setting of (2.1) and the objective function of (2.2), in the and iterations for estimating using the proposed algorithm of Section 2, where and denote the estimates in the and iterations, respectively, we have that
where the equality holds if and only if are least square estimates.
i=0,1, ... ,p. Under the equality,
Proof. Two parts of the proof will be carried out: (1) to prove (3.1) to be true for any k. If (3.1) is true, then it implies that either improvement will be made for each iteration or equality holds. If a significant improvement still presents after iterations, then more iterations are needed. (2) If equality holds or non-significant improvement presents after iterations, then it means that i=0,1, ...,p. At this point, the will be a least square solution. In order to prove (3.1) to be true, by mathematical induction, it is sufficient to prove it to be true when k=0. That is
Computation of Least Square Estimates Without Matrix Manipulation
87
By the procedure of CIO, we optimize one component at each time in the proposed algorithm. Thus, to prove (3.2) to be true, we will prove the following sequence of inequalities holds:
First, we prove that
where
and
i=0,1,...,n and by (2) of Step 3 in the proposed algorithm. That is
Let LHS and RHS denote the left-hand side of (3.4) and the right hand side of (3.4), respectively. To prove that considering the following expansion of the equivalent forms:
Combining LHS and RHS, we can have the following equivalent inequalities:
Y. Lin and C. Chen
88
The last inequality holds, therefore, we have proved that (3.4) is true. Further more, the equality of (3.4) holds if and only if Next, we prove the following inequality holds:
where
by (2) of Step 4 in the
proposed algorithm. Let
then
Let LHS and RHS denote the left-hand side of (3.5) and the right hand side of (3.5), respectively. It is easy to see that
and
Combining LHS and RHS, we can have the following equivalent inequalities:
The last inequality holds, therefore, we have proved that (3.4) is true. Further more, the equality of (3.5) holds if and only if Using the same reasoning, it is easy to prove that for any j we have the following equivalent inequalities:
Computation of Least Square Estimates Without Matrix Manipulation
89
where j = 2,..., p. It is clear that the equality holds if and only if Thus, we have proved that (3.3) holds. Therefore, (3.2) holds. For the second part of the proof: we assume that
we prove
is a
solution of least square normal equations. If we assume that
then
which is equivalent to
Similarly, if
we have
(3.6) and (3.7) are simply normal equations for the least square solutions. Therefore, we come to the conclusion that if and then is a least square solution. Remark: In real applications, it may require a few more iterations rather than one iteration.
References 1. Lin, Y.: Success or Failure? Another look at the statistical significant test in credit scoring, Technical report, First North American National Bank (2001) 2. Lin, Y.: Introduction to Component-wise Iterative, Technical report, First North American National Bank (2003) 3. Lin, Y.: Estimation of parameters in nonlinear regressions and neural networks, Invited talk in North American New Researchers’ Meeting in Kingston, Canada during July 5 – July 8 (1995) 4. Lin, Y.: Feed-forward Neural Networks-Learning Algorithms, Statistical Properties, and Applications, Ph. D. dissertation, Department of mathematics, Syracuse University (1996) 5. Lin, Y.: Statistical Behavior of Two-Stage Learning for Feed-forward Neural Networks with a Single Hidden Layer, 98’s proceeding of American Statistical Association (1998)
Ensuring Serializability for Mobile Data Mining on Multimedia Objects Shin Parker1, Zhengxin Chen1, and Eugene Sheng2 1
Department of Computer Science, University of Nebraska at Omaha, Omaha, NE 68182-0500 [email protected]
2
Department of Computer Science, Northern Illinois University, DeKalb, IL 60115 [email protected]
Abstract. Data mining usually is considered as application tasks conducted on the top of database management systems. However, this may not always be true. To illustrate this, in this article we examine the issue of conduct data mining in mobile computing environments, where multiple physical copies of the same data object in client caches may exist at the same time with the server as the primary owner of all data objects. By demonstrating what can be mined in such an environment, we point out the important connection of data mining with database implementation. This leads us to take a look at the issue of extending traditional invalid-access prevention policy protocols, which are needed to ensure serializability involving data updates in mobile environments. Furthermore, we provide examples to illustrate how such kind of research can shed light on mobile data mining.
1 Introduction We start this paper with the following interesting question related to data mining: For (database-centric) data mining, is it always a task of database application? In other words, is data mining always restricted to a front-end tool? A more general form of this question is: Could data mining be related to database implementation in any way? In this paper we will address these issues by taking a look at the case of mobile data mining, which is concerned with conducting data mining in a mobile computing environment. We present our opinion on these questions, arguing that there is a need to consider issues related to data mining and implementation of database management systems (DBMSs). As an example of our viewpoint, we first review a recent work for extending serializability support to handle multiple copies of objects so that concurrent execution of multiple transactions is conflict equivalent to serial execution of these transactions. We then examine the indication of this work to data mining. Y. Shi, W. Xu, and Z. Chen (Eds.): CASDMKM 2004, LNAI 3327, pp. 90–98, 2004. © Springer-Verlag Berlin Heidelberg 2004
Ensuring Serializability for Mobile Data Mining on Multimedia Objects
91
The balance of this paper is organized as follows. In Section 2 we provide example to illustrate the need for examining the relationship between data mining and DBMS implementation. This leads us to take a look at the general data management issues for mobile data mining in Section 3. In order to deal with problems raised in Section 2, we examine the problem of ensuring serializability in mobile computing environment in Section 4, and further discuss the indications of this research to mobile data mining in Section 5. We conclude the paper in Section 6, where related future research work is discussed.
2 Mobile Data Mining and DBMS Implementation Mobile data mining can be considered as a special case of distributed data mining (DDM) [7, 9], which is concerned with mining data using distributed resources. However, there are unique features for mobile data mining, and researchers have started exploring this area (e.g., [12]). One interesting issue of data mining in mobile computing environment is related to mining on mobile users, such as their behavior in mobile environments. The importance of using mobile user movement data to derive knowledge about mobile users in the e-commerce application domain have been noticed [8]. So what is the role of DBMS in mobile data mining? Techniques for mobile query processing and optimization, such as location-dependent query processing [14], have been addressed by various researchers. Since mobile data mining can take advantage of such kind of technical advance, it seems not too difficult to argue that query processing can contribute to mobile data mining. But how about the relevance of other aspects in DBMS implementation, such as transaction processing? The connection between mobile data mining and transaction processing seems to be quite remote. However, consider the following scenario. In order to derive knowledge about mobile users in the e-commerce application domain, if the same multimedia object is accessed or even updated by different mobile users, then in order to mine the user profiles, shopping habit or other useful knowledge about users, it is crucial to handle multiple copies of objects correctly. Such a prerequisite can be stated as a problem of ensuring serializability in mobile clientserver environment, a problem within the realm of transaction processing. For example, we may want to know what kind of mobile users are interested in certain kind of video, or what kind of video are interested by certain kind of mobile users. It is thus important to recognize multiple versions of the same mobile object. Mobile data mining community has the responsibility to identify such kind of problems to be dealt with. Furthermore, mobile mining community may even have to develop solutions for some problems identified by themselves (rather than just waiting for solutions provided by other people such as DBMS developer), because such requirements may be closely integrated into the data mining task at hand. For example, we may want to know which kind of message boards (or some other shared objects) is updated most frequently by mobile users; in this case, we need to trace and count the version numbers of the object. Therefore, although version numbers are usually handled by database implementation, mobile data mining algorithms may have to do something about them.
92
S. Parker, Z. Chen, and E. Sheng
3 Data Management Issues for Mobile Data Mining In order to understand the importance of our observation made in Section 2, we can cooperate it into the general discussion of data management issues for mobile data mining as identified by Lim et al. [8]: Developing infrastructure for mobile data mining, to explore the design issues involving a warehouse for mobile data. This may include the algorithms for efficient aggregation and transformation of mobile data. The challenge in this case is to deal with the heterogeneous data format from different mobile devices at different locations with different bandwidth and computing resources. Developing algorithms for mobile data mining, such as to develop algorithms to find knowledge to improve the efficiency of mobile applications/queries. Incorporating mobile mining results into operational systems, to integrate knowledge obtained from data mining with the operational systems. The challenge is to develop algorithms that evaluate which are the ‘actionable’ data mining results, and then apply them in a timely fashion to ensure the effectiveness of the mobile data mining system. In addition to above issues raised in [8], we also advocate the study of the following aspect: Conducting research in support of query processing/optimization and transaction processing, so that mobile data mining can be effectively conducted. The rest of this paper is to further examine this last issue.
4 Extending Invalid-Access Prevention Policy As a concrete example of conducting research on transaction processing for mobile data mining, we now get back to the question raised in Section 2, namely, how to correctly handle multiple object copies so that serializability can be ensured in mobile computing environments. One approach could be attaching an additional field in each record of the system log so that the identifier of the mobile user who has accessed (or even updated) can be recorded along with the object itself. However, this approach may not be realistic, because the introduction of such additional field may not be justified by other system applications (i.e., such information may not be useful for other applications). Therefore, in the rest of this paper we consider how to extend a recent work of study on extending traditional invalid-access prevention policy for mobile data mining concerning multiple-copy objects. This is done in two steps. First, in this section (Section 4) we provide a discussion of the extension process itself (based on the discussion of Parker and Chen 2004) without addressing any issues directly related to mobile computing. Then in Section 5, we revisit the issue of mobile data mining concerning multiplecopy objects by pointing out how we can take advantage of results obtained in this section.
Ensuring Serializability for Mobile Data Mining on Multimedia Objects
93
4.1 Review of Traditional Invalid-Access Prevention Policy In a typical client/server computing architecture, there may exist multiple physical copies of the same data object at the same time in the network with the server as the primary owner of all data objects. The existence of multiple copies of the same multimedia object in client caches is possible when there is no data conflict in the network. In managing multiple clients’ concurrent read/write operations on a multimedia object, no transactions that accessed the old version should be allowed to commit. From this basis of the invalid-access prevention policy, several protocols have been proposed. The purpose of these protocols is to create an illusion of a single, logical, multimedia data object in the face of multiple physical copies in the client/server network when a data conflict situation arises. When the server becomes aware of a network-wide data conflict, it initiates a cache consistency request to remote clients on behalf of the transaction that caused the data conflict. The basic ideas of three known protocols under this policy are summarized below (for a summary of these methods, see [11]; for other related work, see [2,3,4,5,6,10,13,15]). Server-Based Two-Phase Locking (S2PL): The S2PL uses a detection-based algorithm and supports inter-transaction caching. It validates cached pages synchronously on a transaction’s initial access to the page. Before a transaction is allowed to commit, it must first access the primary copies from the server on each data item that it has read at the client. The new value must be installed at the client if the client’s cache version is outdated. The server is aware of a list of clients who requested locks only, and no broadcast is used by the server to communicate with clients. The client is aware of a list of object version numbers, and no local lock is used by the client. This protocol carries no overhead in page table maintenance at the price of not being able to detect multiple versions in a network. Call-Back Locking (CBL): CBL is an avoidance-based protocol that supports intertransactional page caching. Transactions executing under an avoidance-based scheme must obey the read-once write-all (ROWA) replica management approach, which guarantees the correctness of data from the client cache by enforcing that all existing copies of an updated object have the same value when an updating transaction commits. Therefore an interaction with the server is required only at client cache-miss or for updating its cache copy. The global nature of ROWA implies that consistency actions may be required at one or more remote clients before the server can register a write permit for a local client. An update transaction cannot commit until all of the necessary consistency operations have been completed at remote clients. When all consistency operations are complete, there exist only two copies of the multimedia object, a primary copy in the server and a secondary copy in the client with a write permit, to get ready for a commit. Note that a write-write data conflict does not exist in the CBL. In the process of securing a write permit for a local client, all remote clients’ cache copies were invalidated and no new download is possible while the write permit locks the row in the database. Optimistic Two-Phase Locking (O2PL): This is avoidance-based and is more optimistic about the existence of data contention in the network than CBL. It defers
94
S. Parker, Z. Chen, and E. Sheng
the write intention declaration until the end of a transaction’s execution phase. Under the ROWA protocol, an interaction with the server is required only at client cachemiss or for committing its cache copy under the O2PL. As in CBL, all clients must inform the server when they erase a page from their buffer so that the server can update its page list. All three of the above invalid-access prevention policy protocols ensure that any update transactions that previously accessed the old version data be aborted by the server. A general difference among them is the varying degree in the number of client messages sent to the server and the server’s abort frequency.
4.2 What’s New in Mobile Computing Environment? The inherent limitations of mobile computing systems present a challenge to the traditional problems of database management, especially when the client/server communication is unexpectedly severed from the client site. From the summary presented in Section 3, we can provide the following observations. In a S2PL network, multiple versions of the same object ID can exist in the network due to automatic and as-needed replications without the server’s overall awareness of data conflicts in the network. This may result in frequent transaction aborts in the S2PL protocol. In a CBL or O2PL network, during a brief slice of time when an object is being committed to the database by a row-server with a write lock, there exists only one physical copy at the primary site and one copy at the secondary site if all cached clients have invalidated their locally cached copies. However, some cached mobile clients may not have the opportunity to invalidate their obsolete copies or replicate updated copies due to their disconnection from the network while the object is being committed by another node. This causes the existence of multiple cached versions of the same object ID in the network when the disconnected clients return to the network. This is not desirable in applications requiring the ROWA protocol. Serializability is the most commonly used correctness criterion for transactions in database applications for concurrent transaction execution at a global level. The standard policy does not enforce the serializability to the mobile computing environment. Transactions executing under an avoidance-based scheme must obey the (ROWA) principle, which guarantees the correctness of the data from the client cache under the CBL or the O2PL protocol. The standard CBL and O2PL protocols cannot guarantee the currency of the mobile clients’ cache copies and to prevent serializability violations when they reconnect to the network. Figures 1 illustrates how error condition (appearing toward the end of the figure) arises in CBL environment after mobile clients properly exit the client application.
4.3 Extended Invalid Access Prevention Protocols for Mobile Environment To prevent the serializability failure scenario described above, the extended invalid access prevention policy protocols are proposed here for the mobile client/server environments that guarantee the serializability. Extended invalid-Access Prevention
Ensuring Serializability for Mobile Data Mining on Multimedia Objects
95
Policy should include additional attributes such as version numbers, recreate/release page table rows, relinquish unused locks at sign-off, and maximum lock duration.
Fig. 1. CBL Failure Analysis Tree in Mobile Environment
With the four extended attributes, all disconnected mobile clients’ obsolete cached copies during their absence from the network are logically invalidated while a connected client updates the same multimedia object. The logical invalidation is accomplished via the first two attributes, implementing sequential version numbers in the CBL and O2PL protocols and recreating page table entries only when the version number is current. The remaining two attributes, relinquishing unused permits or locks at exit or shutdown from the client application and the maximum lock duration to help server recover from indefinite stalls, are to handle the effects of normal and abnormal program terminations so that the server does not wait indefinitely to hear from its constituent clients after multicasts. With the implementation of version numbers in CBL and O2PL protocols, the server can now determine which cached objects of returning mobile clients are obsolete and can discard the corresponding page table entries, both from the client and the server page tables. Experiments have shown that extended invalid-access prevention policy algorithms enforce a guaranteed serializability of multimedia objects in RDBMS applications under a mobile client/server environment. For more details, see [11]. Figure 2 depicts a proper page table procedure for logical invalidations to deal with serializability problem through page table consistency check.
96
S. Parker, Z. Chen, and E. Sheng
Fig. 2. Consistency check of the page table
5 Potential Contribution of Extended Invalid Access Prevention Protocols to Mobile Data Mining Based on our discussion presented above, we can now revisit the issues raised earlier, to analyze scenarios and answer queries like following: What kind of mobile users is interested in certain kind of video? What kind of video is interested by certain kind of mobile users? Which kind of message boards (or some other shared objects) is updated most frequently by mobile users? What kind of mobile users (i.e., characteristics) have accessed and updated most frequently on a particular object? What kind of multimedia objects (i.e., characteristics) have the largest number of copies at the same time? What kind of multimedia objects are likely to be updated as a consequence of updating some other multimedia objects? To answer these questions and many other questions we need some details of database implementation as discussed earlier. For example, in order to find out “what kind of mobile users is interested in certain kind of video,” we need to count the frequency of certain video to be accessed by mobile users. Although in this kind of application a user may not be able to update the contents of video and thus not be able to create new versions of an object, in many other cases (such as the case of dealing with the message board) updates do occur and multiple versions of objects do exist. Therefore, the research work presented in Section 4 may have important contribution in the related data mining tasks.
Ensuring Serializability for Mobile Data Mining on Multimedia Objects
97
6 Conclusions and Future Research In this paper we raised and discussed these issues by examining the case of mobile data mining. Data mining may not be limited to “front end” applications. To the best of our knowledge, these have not been discussed before, but these are issues worthy exploring. In particular, we have focused on how to ensure serializability in dealing with multiple copies of multimedia objects as an example to illustrate the relationship between mobile data mining and DBMS transaction processing. For more details of our proposed approach, see [11]. There are many outstanding research issues that need to be addressed before data mining in mobile environment can be effectively used. Although we have addressed the issues of ensuring the serilizability of large multimedia objects, in doing so will inevitably invalids many transactions that access modified objects. To relieve such problem, we need an elaborate scheme to divide the large object into smaller pieces and subsequently effectively organize and manage these pieces. This will allow multiple users (transactions) to simultaneously update different parts of the same objects without being rollback. Another related issue that needs to be further studied is how to address serializability if we are mining from different data sources.
References 1. Barbara, D.: Mobile Computing and Databases – A Survey. IEEE Transactions on Knowledge and Data Engineering, 11(1) (1999) 108-117 2. Breitbart, Y., Komondoor, R., Rastogi, R., Seshadri, S., Silberschatz, A.: Update Propagation Protocols for Replicated Databases. Proceedings ACM SIGMOD (1999) 97108 3. Dunham, M. H., Helal, A., Balakrishnan, T.: Mobile transaction model that captures both the data and movement behavior. Mobile Networks and Applications, 2(2) (1997)149-162. 4. Franklin, M. J., Carey, M. J., Livny, M.: Transactional client-server cache consistency: alternatives and performance. ACM Transactions on Database Systems, 22(3) (1997) 315363. 5. Holiday, J., Agrawal, D., Abbadi, A.: Disconnection Modes for Mobile Databases. Wireless Networks (2002) 391-402 6. Jensen, C. S., Lomer, D. B.: Transaction Timestamping in Temporal Databases. Proceedings of the International Conference on Very Large Data Bases (2001) 441450 7. Kargupta, H., Joshi, A.: Data Mining “To Go”: Ubiquitous KDD for Mobile and Distributed Environments, KDD Tutorial. (2001) 8. Lim, E.-P., Wang, Y., Ong, K.-L., Hwang, S.-Y.: In Search of Knowledge about Mobile Users, Center for Advanced Information Systems at Nanyang Technological University, Singapore. (2003) http://www.ercim.org/publication/Ercim_News/enw54/lim.html. 9. Liu, K., Kargupta, H., Ryan, J.: Distributed data mining bibliography, University of Maryland Baltimore County. (2004) http://www.cs.umbc.edu/~hillol/DDMBIB/ddmbib.pdf 10. Pacitti, E., Minet, P., .Simon, E.: Fast Algorithms for Maintaining Replica Consistency in Lazy Master Replicated Databases. Proceedings of the International Conference on Very Large Data Bases (1999) 126-137.
98
S. Parker, Z. Chen, and E. Sheng
11. Parker, S., Chen, Z.: Extending Invalid-Access Prevention Policy Protocols for MobileClient Data Caching. Proc. ACM SAC (2004) 1171-1176 12. Saygin, Y., Ulusoy, Ö.: Exploiting Data Mining Techniques for Broadcasting Data in Mobile Computing Environments. IEEE TKDE (2002)1387-1399. 13. Schuldt, H.: Process Locking: A Protocol based on Ordered Shared Locks for the Execution of Transactional Processes. Proceedings of the ACM SIGMOD SIGACT SIGART Symposium on Principles of database systems (2001) 289-300 14. Seydim, A. Y., Dunham, M. H., Kumar, V.: Location dependent query processing. Proceedings of the 2nd ACM international workshop on Data engineering for wireless and mobile access (2001) 47-53 15. Shanmugasundaram, J., Nithrakashyap, A., Sivasankaran R., Ramamritham. K.: Efficient Concurrency Control for Broadcast Environments. Proceedings of the 1999 ACM SIGMOD International Conference in Management of Data (1999) 85-96
“Copasetic Clustering”: Making Sense of Large-Scale Images Karl Fraser*, Paul O’Neill*, Zidong Wang, and Xiaohui Liu Department of Information Systems and Computing, Brunel University, Uxbridge, Middlesex, UB8 3PH, U.K. {karl, paul, zidong, hui}@ida-research.net
Abstract. In an information rich world, the task of data analysis is becoming ever more complex. Even with the processing capability of modern technology, more often than not, important details become saturated and thus, lost amongst the volume of data. With analysis problems ranging from discovering credit card fraud to tracking terrorist activities the phrase “a needle in a haystack” has never been more apt. In order to deal with large data sets current approaches require that the data be sampled or summarised before true analysis can take place. In this paper we propose a novel pyramidic method, namely, copasetic clustering, which focuses on the problem of applying traditional clustering techniques to large-scale data sets while using limited resources. A further benefit of the technique is the transparency into intermediate clustering steps; when applied to spatial data sets this allows the capture of contextual information. The abilities of this technique are demonstrated using both synthetic and biological data.
1
Introduction
In order to deal with large data sets, current clustering approaches require that the data be sampled or summarised before true analysis can take place. In this paper we deal with the problem of applying clustering techniques to large-scale data sets when subjected to limited resources. The technique was designed such that traditional clustering metrics could not only be applied to large-scale data sets, but would also produce results that were comparable to their un-scaled relatives on appropriately small data sets. This novel technique also gives the advantage of rendering transparent the ‘internal decision processes’ of the clustering methods. When applied to spatial data sets this results in the capture of contextual information that traditional methods would have lost, producing a far more accurate output. This improved accuracy is shown to exist when the copasetic clustering techniques abilities are applied to both synthetic and biological data sets. Clustering [1] is a prominent unsupervised technique used for segmentation that allows us to search for and classify the similarity between many thousands *
The authors contributed equally to the research held in this paper.
Y. Shi, W. Xu, and Z. Chen (Eds.): CASDMKM 2004, LNAI 3327, pp. 99–108, 2004. © Springer-Verlag Berlin Heidelberg 2004
100
K. Fraser et al.
of ‘objects’. Techniques such as [2] and fuzzy [3] are good representatives of the clustering ideology and are still in widespread use today, for example partitions data by initially creating random cluster centres which define the boundaries for each partition in the hyper-volume. Fuzzy logic based algorithms [3] on the other hand, were specifically designed to give computer based systems the ability to account for the grey or fuzzy decision processes that are often seen in reality. Fuzzy logic based clustering therefore offers inherent advantages over non-fuzzy methods, as they can cope with problem spaces which have no well defined boundaries. Like many other techniques, the two clustering methods mentioned suffer a variety of problems: they are heavily influenced by initial starting conditions, can become trapped in local minima, do not lend themselves to distributed processing methods and are restricted in their application to the memory resources of a single workstation. Pavel Berkhin [1] detailed these issues and classified the current solutions into three groups; incremental mining [4], data squashing [5] and reliable sampling [6] methods. The main problem with these implementations is that by reducing the number of elements in the data sets, important data will have been lost and so there is a need for developing techniques that can be scaled to these problems. This paper focuses on solutions to two of the key challenges in the clustering field. The first of these is associated with the scaling issue of current techniques and thus attempts to meet the demand of modern image sets. Secondly we detail a method whereby rather than discarding the wealth of spatial image information we harness it. The next section then describes the process whereby existing techniques are scaled to larger data sets, not only making their application feasible but also providing extra information about the internal clustering decision process that would traditionally be unavailable. The proposed solution is then tested on both synthetic and real-world biological data, with the results described in detail and the proposed algorithm showing great promise.
2
Copasetic Clustering
The Copasetic Clustering (CC) method is a technique which facilitates the application of tradition clustering algorithms to large-scale data sets; it also has the additional ability of capturing spatial information allowing the refinement of groupings. Initially it arbitrarily divides up the image into spatially related areas (normally very small grid squares). Each of these areas is then clustered using a traditional technique such as or fuzzy and the result is stored. Then representatives are calculated for each of the clusters that exist, these are then further clustered in the next generation. Such a process is repeated until all sub-clustered groups have been merged. The final result is that every pixel will have been clustered into one of groups and on small data sets the output is comparable with traditional techniques. The aforementioned idea can be illustrated using the conceptual diagram shown in Fig. 1. In the input, we can
“Copasetic Clustering”: Making Sense of Large-Scale Images
101
see twelve items which are to be clustered. Creating two groups with traditional clustering methods would compare all twelve shapes and is likely to output one group of triangles and one containing a mixture of circles and crosses as shown Fig. 1 (left). CC, on the other hand, would divide this set into sub-groups of a given size, in this example, into three sets of four as Fig. 1 (right) shows. Each of these sub-groups would be clustered into one of two classes (represented by the checkerboard pattern or lack thereof, within the layer ‘0’ under each shape). In layer 1, the previously generated representatives for each of these groups have been clustered together. The final result can now be calculated by re-traversing the pyramid structure and tracking the shaded members.
Fig. 1. Conceptual diagram of traditional clustering techniques (left) and CC (right)
Studying the three groups in this example closely, it can be seen that the shapes were chosen to illustrate how this sub-clustering can provide useful contextual information. The first group of circles and crosses can easily be clustered as two separate shape classes with no problems (as shown in layer 0). With the introduction of triangles in the second group, we see that the circle and cross are now clustered together; here this is the optimum arrangement due to shape similarity. By layer 1, the groups are the same as a traditional clustering algorithm, with all circles and crosses in one group and triangles in the other. Unlike a traditional clustering algorithm, information from previous layers can now be used to ascertain the fact that in some situations certain items should not have been grouped together. Here the circles and crosses could have formed two distinctive groups if it were not for the presence of the triangles. Note that this information is context specific, and relies on a spatial data set, an order dependent vector or image for example. A pseudo-code implementation of the CC algorithm is described in Fig. 2. For clarity, this implementation has been based around clustering an image utilising traditional techniques into two groups containing either signal or noise elements; although the implementation can easily be modified to work with multiple groups. As mentioned, one of the fundamental strengths of the CC approach over traditional clustering techniques is that CC effectively renders transparent
102
K. Fraser et al.
Fig. 2. Copasetic clustering pseudo code description
the ‘internal decision processes’. This is achieved by way of the pyramidic layering concept giving the ability to review historical information which enables the improvement of the final output. Traditional clustering algorithms lack this information and the problems that arise from this can be illustrated with a simple test image which has a gradient (realistic) based background. The left of Fig. 3 shows an image with well defined signal (the arrows) throughout. However, towards the bottom of the image the local noise tends to the signal and will therefore be grouped with this noise element. Note the question mark signifies the complete lack of internal knowledge available from the traditional clustering techniques.
Fig. 3. Traditional clustering (left) and CC’s result including historical data (right)
The right of Fig. 3 shows the various layers of the CC algorithm, where we can clearly see there are three intermediate layers between the input and
“Copasetic Clustering”: Making Sense of Large-Scale Images
103
output data. This diagram illustrates two ways the raw output from CC can be used. The simplest method is to take the final layer results and use them directly; which is equivalent to traditional clustering techniques. An alternative way would be to harness the information from all these layers, producing an amalgamated result from a consensus of the previous layers. This result shows more details are accounted for from the original input, than would have otherwise been possible using traditional methods. The consensus layer was generated by calculating the mean value for each pixel across the previous layers and then discretised using a threshold of 0.66 (i.e. agreement between two thirds of the layers). For a data set which consists of two or more clusters a median could be used to the same effect, although the use of more complex methods of forming an agreement are not ruled out.
Results
3
This section details the results of numerous experiments which test the CC technique using both and fuzzy clustering when applied to various data sets. The first set of experiments consists of a series of synthetic imagery, designed to emphasise both the weaknesses and strengths of each technique. These methods are then, applied to biological gene expression analysis data with comparisons drawn between the two clustering results and that of the human expert.
3.1
Synthetic Data Set Experiments
Fig. 4 illustrates a selection of images with various degrees of solid white noise. In the uniform set of images the objects have been rendered with an intensity value of 50% while the background was originally set to 0%. Over the course of the series the background noise steadily increased in increments of 5%. The gradient set of images is based on the aforementioned set with the obvious difference being that average intensity of the background is gradually increased down the image surface. This has the effect of making lower region background pixels more similar to the signal of the objects than the background in the upper region. As we know the true morphology for the imagery involved, the techniques can be evaluated by calculating the percentage error. Here this percentage error is defined as:
Fig. 4. Two example data sets shown with varying intensities of solid white noise
104
K. Fraser et al.
Fig. 5. Results for the uniform synthetic data,
(left) and fuzzy
(right)
where N is the number of pixels in and such that N for the images signal and mask. Fig. 5 shows the absolute error results when using the images with a uniform background as plotted against the intensity of the noise for both and fuzzy Each graph consists of three plots: the grey line shows the traditional clustering technique when applied to each image, the black line shows the performance obtained when CC is applied and finally, the dotted line shows the results when a consensus of the historical CC information is utilised. Generally all three methods perform admirably producing comparable results. However, the consensus implementation of CC appears to degrade more gracefully as noise levels increase. As this data set contains a uniform spread of noise elements it was envisioned that the results of the experiment would be similar between the techniques. Due to this uniformity of the background, foreground pixels are clearly defined and hence easier to cluster. However in many real data sets this would not be the case, rather, variations in background would mean pixels could only be classified correctly when taken in the context of their local region. Currently traditional clustering techniques fail in this situation, due to the fact that every pixel is treated in isolation, inherently CC is spatially aware and therefore able to capture this contextual information. Fig. 6 shows the percentage error when the same techniques are applied to the gradient based synthetic data which represents a more realistic scenario. As before, both traditional clustering and CC have performed on a par with one another, which is the expected result. However, for this data set the historic information plays a much bigger role and it can be seen that there is a substantial improvement across the entire series. To further illustrate this some example images are shown in Fig. 7. In these images the benefit of using historical information can clearly be seen. Although there is still a lot of noise present, the objects in the image have been clearly defined and could be used for further analysis such as edge detection. In contrast, the traditional clustering results have lost this information completely and nothing further can be done to clarify the lower region without reanalysis.
“Copasetic Clustering”: Making Sense of Large-Scale Images
Fig. 6. Results for the gradient synthetic data for (right)
105
(left) and fuzzy
Fig. 7. Representative clustering results of processes
3.2
Microarray Data Set Experiments
The final set of results represent a real-world image analysis problem in the biological domain more specifically that associated with microarray data. Until recently, when biologists wanted to discover which genes were being used in a given process, they would have to focus on one gene at a time. The ability to analyse genes in this manor is extremely useful, but due to the lack of functional knowledge for the majority of genes this is pragmatically restrictive. With the use of Microarray technology [7], biologists can analyse many thousands of genes simultaneously on a single chip. For a detailed explanation readers may find references [8] and [9] of interest. An example usage of this technology is the comparison between cells for a patient before and after infection by disease. If particular genes are used more (highly expressed) after infection, it can be surmised that these genes may play an important role in the life cycle of this disease. This is then digitised using a dual laser scanning device, producing a 5000×2000, two channel 16-bit grey-scale images, an example of which is shown in the left of Fig. 8. One of the mainstream methods used to analyse microarrays is that provided by Axon Instruments in their package Initially the operator defines a template of gene spot locations, which the package then uses to define the centre of a circle that is then applied to every gene with a simple threshold used to calculate its diameter. All of these are then manually checked and realigned if necessary, with the median value of these pixels used as the intensity of the gene spot. The technique samples the surrounding background by placing four rectangular regions in the diagonal space between this and adjacent spots
106
K. Fraser et al.
Fig. 8. Example microarray images (left). Template example based on method employed by (right)
(valleys). Again the median values of all pixels within these valleys are taken to be background. The final stage is to subtract the background from the foreground domain. The right of Fig. 8 shows an example of this template based approach. This process makes the assumption there is little variation both within the gene spot and the surrounding background. Unfortunately, this is not always the case as various types of artefacts are commonly found as can be seen in the background regions of Fig. 8. Yang et al. [10] present a review of this and other manual methods as applied to microarray analysis. Rather than this template approach, a better method of dividing the foreground and background domains would be that of clustering. To this end we have investigated a technique that could facilitate the requirement of automatic full-slide processing, while avoiding the associated overheads. Using CC, the entire microarray image can be classified in terms of signal and noise, which can be used to help distinguish the gene spot boundaries. Having calculated a mask for the gene spots across the entire slide, they can be verified using the peak signal-to-noise ratio (PSNR) [11] against human defined areas for each of the gene spots. This metric is defined for greyscale images as:
where the RMSE (root mean squared error) represents the norm of the difference between the original signal and the mask. The PSNR is the ratio of the mean squared difference between two images and the maximum mean squared difference that can exist between them. Therefore the higher the PSNR value, the more accurately the mask fits the raw imagery (for all images present the proposed framework gave more accurate results). This gives a measure of the accuracy between the signal as defined by a trained biologist and the signal defined by the clustering technique. From Fig. 9, we directly compare PSNR values determined by and CC for the individual images and on average CC has shown a marked 1 - 3dB improvement over that of the template approach. Essentially the CC process has consistently outperformed the human expert using in terms of gene spot identification. We have shown above that the CC technique can apply clustering methods to data sets that are traditionally infeasible while also producing a significant improvement over the manually guided process.
“Copasetic Clustering”: Making Sense of Large-Scale Images
Fig. 9. PSNR Comparison between
4
107
and CC Results
Conclusions
This paper presented a new method which can be used to scale existing clustering techniques so that they can process larger images. The proposed method is based entirely on the distribution of the images pixel intensities and thus immune to spatial and morphological issues. With this approach, we have shown that the copasetic clustering algorithm can improve the spatial classification results over those as derived from traditional techniques on both real and synthetic image data. Through this technique’s inherent transparency we have also shown that more detail is captured by the method when compared to the traditional techniques on which they are based. In future we would like to further refine the historical capabilities in order to maximise the benefits of the approach. When applied to microarray imagery, this improved accuracy in the classification of the slide surface reduces the difficulty in defining the structural composition. In this paper we focused primarily on applying the copasetic clustering algorithm in order to ascertain how it compares to traditional techniques with a view to scalability. From the promising results presented here, the next step will be to explore the direct effect that ‘window size’ has on the resulting quality. At present we have used the smallest ‘window size’ (WindowSz = 2) with a view to high quality output, however, this may not be practical for real-world applications due to their dimensionality; this parameter will always be a trade off between quality and speed. This leads to another important issue, that of resource allocation such as memory and processor load compared to traditional techniques. Currently, copasetic clustering takes significantly longer to process than traditional techniques. However, in its present form the code is un-optimised and should be taken in the context that it is able to process data sets which are
108
K. Fraser et al.
several orders of magnitude larger. Our ultimate goal is to develop these techniques into a truly automated microarray processing system. This study brings that goal one step closer as it addresses the challenging issues associated with the initial stages of analysis.
Acknowledgements For kindly providing the data sets used, the authors would like to thank Paul Kellam from the Dept. of Immunology and Molecular Pathology, University College London.
References 1. Berkhin, P.: Survey of clustering data mining techniques, Accrue Software, San Jose, CA, (2002). 2. McQueen, J.: Some methods for classification and analysis of multivariate observations. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, (1967) 281–297. 3. Dunn, C. J.: A fuzzy relative of ISODATA process and its use in detecting compact well-separated clusters, Cybernetics Vol. 3, No. 3, (1974) 32–57. 4. Wann, D. C., Thomopoulos, A. S: A comparative study of self-organising clustering algorithms Dignet and ART2. Neural Networks, Vol. 10, No. 4, (1997) 737–743. 5. DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C., Pregibon, D.: Squashing flat files flatter. Proceedings of the 5th ACM SIGKDD, (1999) 6–15. 6. Motwani, R., Raghavan, P.: Randomised algorithms, Cambridge University Press, (1995) 7. Moore, K. S.: Making Chips, IEEE Spectrum, (2001) 54–60. 8. Orengo, A. C., Jones, D., T., Thorton, M. J.: Bioinformatics: Genes, proteins & computers, BIOS scientific publishers limited, (2003) 217–244. 9. The chipping forecast II.: Nature Genetics Supplement, (2002) 461–552. 10. Yang, H. Y., Buckley, J., M., Dudoit, S., Speed, P. T.: Comparison of methods for image analysis on cDNA microarray data, J. Comput. Graphical Stat., Vol 11, (2002) 108–136. 11. Netravali, N. A., Haskell, G. B.: Digital pictures: Representation, compression and standards (2nd Ed), Plenum Press, New York, NY, (1995).
Ranking Gene Regulatory Network Models with Microarray Data and Bayesian Network Hongqiang Li, Mi Zhou, and Yan Cui* Department of Molecular Sciences, Center of Genomics and Bioinformatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA {Hli7, Mzhou3, Ycui2}@utmem.edu
Abstract. Researchers often have several different hypothesises on the possible structures of the gene regulatory network (GRN) underlying the biological model they study. It would be very helpful to be able to rank the hypothesises using existing data. Microarray technologies enable us to monitor the expression levels of tens of thousands of genes simultaneously. Given the expression levels of almost all of the well-substantiated genes in an organism under many experimental conditions, it is possible to evaluate the hypothetical gene regulatory networks with statistical methods. We present RankGRN, a web-based tool for ranking hypothetical gene regulatory networks. RankGRN scores the gene regulatory network models against microarray data using Bayesian Network methods. The score reflects how well a gene network model explains the microarray data. A posterior probability is calculated for each network based on the scores. The networks are then ranked by their posterior probabilities. RankGRN is available online at [http://GeneNet.org/bn]. RankGRN is a useful tool for evaluating the hypothetical gene network models’ capability of explaining the observational gene expression data (i.e. the microarray data). Users can select the gene network model that best explains the microarray data.
1 Introduction The gene expression programs encoded in DNA sequences determine the development of the organisms and their responses to the external stimuli at cellular level. These programs are executed via the gene regulatory networks. The gene regulatory network is a group of gene that interact through directed transcriptional regulation.1,2 Many diseases are related to malfunctions of part of the gene regulatory network. Better understandings of the structures of gene regulatory networks will improve our understanding of the complex processes involved in higher order biological functions and will affect the researches on many diseases profoundly. The structures of gene regulatory networks of eukaryotes remain largely unknown, except for a few intensively studied pathways. Recently, the applications of microarray technologies have generated huge amounts of gene expression data, *
To whom correspondence should be addressed.
Y. Shi, W. Xu, and Z. Chen (Eds.): CASDMKM 2004, LNAI 3327, pp. 109–118, 2004. © Springer-Verlag Berlin Heidelberg 2004
110
H. Li, M. Zhou, and Y. Cui
therefore have made it practical to discover gene regulatory relations through statistical and computational approaches. Researchers often have several different hypothesises on the possible structures of gene regulatory network underlying the biological model they study. It will be very helpful to be able to rank the hypothesis using existing data.3 Bayesian Network methods4 have been used to infer the structures of gene regulatory networks from microarray data5-8, 21-28. In the previous works, Bayesian networks were used mainly for reconstructing the gene regulatory networks without the prior knowledge on the possible structures of the networks. Due to the huge numbers of possible networks need to be evaluate and the limited amount of microarray data, it is impractical to learning the whole gene regulatory network only form microarray data. However, it is much more feasible to evaluate a number of alternative hypothesises about the structure of a regulatory network against microarray data. In this work, we developed a web-based program that ranks gene network models with Bayesian Network methods.
2 Methods 2.1 Data Preparation Huge amount of microarray data have been stored in public databases, for example, Gene Expression Omnibus (GEO),9 Stanford Microarray Database,10,11 ArrayExpress12 and ExpressDB.13,14 As of July 3, 2003, microarray data of 6,418 samples using 229 platforms is available at GEO. The most frequently used platform for Human and Mouse microarray data in GEO are Affymetrix GeneChip Human Genome U95 Set HG-U95A (290 samples, 4/30/2003) and Affymetrix GeneChip Murine Genome U74 Version 2 Set MG-U74A (476 samples, 4/30/2003). We downloaded and compiled the microarray datasets of HG-U95A (Human) and MG-U74Av2 (Mouse) from GEO.9 We also downloaded the Estimated relative abundances (ERAs) for 213 yeast conditions from ExpressDB.13,14 The downloaded microarray data was then normalized, discretized (see the Method section for details) and saved in a local database. The transformed microarray data is used for scoring hypothetic gene networks. We first normalize the gene expression data for each sample to having same mean and standard deviation. Then the microarray data are discretized into three levels. We calculate the mean and standard deviation for each gene’s expression values. If an expression value is less than it belongs to level 0; If an expression value is between it belongs to level 1; If an expression value is larger than it belongs to level 2.
2.2 Bayesian Network RankGRN uses Bayesian Network methods to rank hypothetical gene networks. A Bayesian network4 is a probabilistic graphical model of dependencies between multiple variables. It has been considered as an effective tool for learning the genetic networks from gene expression profiles.19 Directed acyclic graphs are used to depict the dependencies and conditional independencies between variables, in our case, the
Ranking Gene Regulatory Network Models
111
expression levels of the genes. Causal interpretation for Bayesian networks has been proposed20 -- the parents of a variable are its immediate causes. The structure of Bayesian networks can be learned from microarray data.5-8, 21-28 Given the microarray dataset D, we want to find which hypothetical gene regulatory network best matches D. We used a Bayesian score to evaluate a network G:7
The Bayesian score for the entire network is decomposable as a sum of scores for each edge (parent-child relationship) under the assumption of complete data. Thus, the score can be written as:7
where is the expression level of Gene i, are the expression levels of the regulators (i.e. the parents) of Gene i according to the network G. The contribution of each gene to the total score depends only on Gene i and its regulators. In the case of a discrete Bayesian network with multinomial local conditional probability distributions, the local contributions for each gene can be computed using a closed form equation7,29
where
is (non-informative) parameter prior,
is the number of
occurrences of gene i in state k given parent configuration
and
is the gamma function, the first term is the prior probability assigned to the choice of the set U as the parents of structure prior. The posterior probability of a hypothetical network
where N is the number of hypothetical networks.
RankGRN uses uniform is
112
H. Li, M. Zhou, and Y. Cui
2.3 Visualization of Gene Networks The Dot program in the Graphviz,30 an open source graph drawing software, was used to visualize the gene regulatory networks.
3 Results 3.1 Gene Network Model Selection RankGRN takes as input hypothetical gene networks in the format of a network description file. For example, six hypothetical yeast gene network structures are described in the following format:
Ranking Gene Regulatory Network Models
113
The hypothetical networks consist of five genes that are listed at the top of the file. Each network is a directed graph. The directed edges may represent either positive regulatory relations (activation) or negative regulatory relations (repression). The starting (left, the regulator) and ending (right, the target gene) point of each directed edges is listed in the file. Thus, the structures of the hypothetical networks are completely determined by the network description file. RankGRN first parses the network description file, then scores each hypothetical gene network against microarray data using Bayesian Network methods (See Methods section). The score allows us to directly compare the merits of alternative models.15 A posterior probability is calculated for each network based on its score. The posterior probability reflects our confidence in the hypothetical network given the microarray data. The hypothetical gene networks are then ranked by their posterior probabilities. The structures of the six networks are shown in Fig. 1.
Fig. 1. Six yeast gene regulatory network models and their posterior probabilities
The networks contain five genes – Leu3, encodes zinc finger transcription factor of the Zn(2)-Cys(6) binuclear cluster domain type which regulates genes involved in branched chain amino acid biosynthesis and in ammonia assimilation; Bap2, encodes amino acid permease for leucine, valine, and isoleucine (putative); Gdh1, NADPspecific glutamate dehydrogenase; Leu2, encodes beta-IPM (isopropylmalate) dehydrogenase which is involved in leucine biosynthesis; Toa2 encodes transcription factor IIA subunit beta.16 Four regulatory relations (between the five genes) are found by searching TransFac database17 – Leu3 regulates Bap2, Gdh1 and Leu2; Toa2 regulates Leu2. We assembled hypothetical gene network#1 (Fig. 1) using the four regulatory relations. The other five hypothetical gene networks are constructed using both true and false regulatory relations (Fig. 1).
114
H. Li, M. Zhou, and Y. Cui
RankGRN assigns uniform prior probability to all hypothetical networks. The prior probability reflects our confidence in the hypothetical gene network before observing microarray data. We use uniform prior probability because we do not want to bias any network. The probability of the networks may increase or decrease after considering microarray data. The updated probability is called posterior probability. As shown in Fig. 1, the most likely gene network is the network#1 (with the largest posterior probability), which contains and only contains regulatory relations extracted from TransFac database. In a previous work, Boolean Network model and Genetic Algorithms were used to evaluate the hypothetical models of gene regulatory network.3 The advantage of Bayesian Network is that the gene expression levels are modelled as random variable which can naturally deal with the stochastic aspect of the gene expression and measurement noise.19
4 Discussion Bayesian network is a powerful tool for ranking hypothetical gene regulatory networks using microarray data. However, it cannot distinguish equivalent networks. In general, two network structures are equivalent if and only if they have the same undirected structures and the same v-structure. A v-structure is an ordered tuple (X, Y, Z) such that there is a directed edge from X to U and from Z to Y, but no edge between X and Z.18 For example, in Fig. 2, the first five networks (network#1, 7, 8, 9, 6) are equivalent. RankGRN takes probe set IDs as input because each gene may be represented by more than one probe sets in Affymetrix gene chips. Here we list the genes represented by the five probe set IDs in this figure: 92271_at: 98027_at: 102168_at: 100154_at: 92674_at:
Pax6, paired box gene 6 Col9a2, procollagen, type IX, alpha 2 Gabbr1, gamma-aminobutyric acid (GABA-B) receptor, 1 Tapbp, TAP binding protein Foxn1, forkhead box N1
Among them, the network#1 contains and only contains regulatory relations retrieved from TransFac database. The other four equivalent networks are constructed by reversing one of the directed edges in network#1 respectively. If some hypothetical networks are equivalent, they will have same posterior probabilities (Fig. 2). RankGRN cannot tell which one (of the five equivalent networks) is better. However, we can use model averaging to evaluate the features of the networks. Specifically, we calculated a posterior probability for each regulatory relation. An indicator function f was introduced, if a network G contains the regulatory relation, f(G) = 1, otherwise, f(G) = 0. The posterior probability of a regulatory relation is
where is the posterior probability of network G. There are 12 regulatory relations in the 9 gene network models shown in Figure 2. The posterior probabilities
Ranking Gene Regulatory Network Models
115
of the 12 regulatory relations are listed in Table 1. The first four regulatory relations were retrieved from TransFac database. The posterior probabilities of the four true regulatory relations (>0.79) are significantly higher than the false regulatory relations (