138 16
English Pages [638] Year 2013
Data Mining: Concepts, Methodologies, Tools, and Applications Information Resources Management Association USA
Volume I
Managing Director: Editorial Director: Assistant Acquisitions Editor: Book Production Manager: Publishing Systems Analyst: Development Editor: Assistant Book Production Editor: Cover Design:
Lindsay Johnston Joel Gamon Kayla Wolfe Jennifer Romanchak Adrienne Freeland Chris Wozniak Deanna Jo Zombro Nick Newcomer
Published in the United States of America by Information Science Reference (an imprint of IGI Global) 701 E. Chocolate Avenue Hershey PA 17033 Tel: 717-533-8845 Fax: 717-533-8661 E-mail: [email protected] Web site: http://www.igi-global.com Copyright © 2013 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark.
Library of Congress Cataloging-in-Publication Data
Data mining : concepts, methodologies, tools, and applications / Information Resources Management Association, editor. v. cm. Includes bibliographical references and index. Summary: “This reference is a comprehensive collection of research on the latest advancements and developments of data mining and how it fits into the currently technological world”--Provided by publisher. ISBN 978-1-4666-2455-9 (hbk.) -- ISBN 978-1-4666-2456-6 (ebook) -- ISBN 978-1-4666-2457-3 (print & perpetual access) 1. Data mining. I. Information Resources Management Association. QA76.9.D343D382267 2013 006.3’12--dc23 2012034040
British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. The views expressed in this book are those of the authors, but not necessarily of the publisher.
Editor-in-Chief Mehdi Khosrow-Pour, DBA Contemporary Research in Information Science and Technology, Book Series
Associate Editors Steve Clarke, University of Hull, UK Murray E. Jennex, San Diego State University, USA Annie Becker, Florida Institute of Technology, USA Ari-Veikko Anttiroiko, University of Tampere, Finland
Editorial Advisory Board Sherif Kamel, American University in Cairo, Egypt In Lee, Western Illinois University, USA Jerzy Kisielnicki, Warsaw University, Poland Keng Siau, University of Nebraska-Lincoln, USA Amar Gupta, Arizona University, USA Craig van Slyke, University of Central Florida, USA John Wang, Montclair State University, USA Vishanth Weerakkody, Brunel University, UK
List of Contributors
Acar, Murat / ISE Settlement and Custody Bank Inc., Turkey......................................................... 2250 Ackerman, Eric / Catharina Hospital, The Netherlands................................................................. 1461 Agrawal, Sanjuli / InfoBeyond Technology LLC, USA...................................................................... 336 Aguilar, María A. / University of Granada, Spain............................................................................. 179 Alakhdar, Yasser / University of Valencia, Spain............................................................................... 650 Alam, M. Afshar / Hamdard University, India..................................................................................... 92 Alexander, Paul / Marine Metadata Interoperability Initiative & Stanford Center for Biomedical Informatics Research, Palo Alto CA, USA................................................................................... 1709 Alonso, Fernando / Universidad Politécnica de Madrid, Spain...................................................... 1936 Alonso, Jose M. / University of Alcala, Spain.................................................................................. 1064 Alrabea, Adnan / Al Balqa Applied University, Jordan..................................................................... 586 Alzoabi, Zaidoun / Arab International University, Syria................................................................... 550 Amine, Abdelmalek / Tahar Moulay University, Algeria................................................................. 2208 Antonelli, Dario / Politecnico di Torino, Italy.................................................................................. 1004 Armato III, Samuel G. / University of Chicago, USA..................................................................... 1794 Athithan, G / C V Raman Nagar, India.............................................................................................. 159 Atzmueller, Martin / University of Kassel, Germany...................................................................... 1189 Azevedo, Ana / CEISE/STI, ISCAP/IPP, Portugal............................................................................ 1873 Azpeitia, Daniel / Juarez City University, México............................................................................ 1163 Babu, T. Ravindra / Infosys Limited, India........................................................................................ 734 Bajwa, Imran Sarwar / University of Birmingham, UK.................................................................... 881 Bakar, Azuraliza Abu / Universiti Kebengsaan Malaysia, Malaysia.............................................. 2057 Bakken, Suzanne / Columbia University, USA................................................................................ 1082 Bañares-Alcántara, René / University of Oxford, UK....................................................................... 970 Bandyopadhyay, Sanghamitra / Indian Statistical Institute, India................................................... 366 Baralis, Elena / Politecnico di Torino, Italy..................................................................................... 1004 Baumes, Laurent A. / CSIC-Universidad Politecnica de Valencia, Spain........................................... 66 Bechhoefer, Eric / NRG Systems, USA............................................................................................... 395 Beer, Stephanie / University Clinic of Wuerzburg, Germany........................................................... 1189 Belen, Rahime / Informatics Institute, METU, Turkey....................................................................... 603 Benites, Fernando / University of Konstanz, Germany...................................................................... 125 Benitez, Josep / University of Valencia, Spain.................................................................................... 650 Bentayeb, Fadila / University of Lyon, France................................................................................ 1422
Bergmans, Jan / Eindhoven University of Technology, The Netherlands........................................ 1461 Berlanga, Rafael / Universitat Jaume I, Spain................................................................................... 625 Bermudez, Luis / Southeastern University Research Association, USA.......................................... 1709 Bernhard, Stefan / Freie Universität Berlin, Germany................................................................... 2069 Berrada, Ilham / University of Mohamed V, Morocco..................................................................... 1647 Bhattacharyya, Siddhartha / The University of Burdwan, India...................................................... 366 Bimonte, Sandro / Cemagref, TSCF, Clermont Ferrand, France.................................................... 2094 Biskri, Ismaïl / University of Quebec at Trois-Rivieres, Canada....................................................... 503 Bonazzi, Riccardo / University of Lausanne, Switzerland................................................................. 793 Bosnić, Zoran / University of Ljubljana, Slovenia............................................................................. 692 Bothwell, James / University of Oklahoma, USA............................................................................. 2117 Bourennani, Farid / University of Ontario Institute of Technology, Canada.................................... 816 Boussaïd, Omar / University of Lyon, France.................................................................................. 1422 Bruno, Giulia / Politecnico di Torino, Italy...................................................................................... 1004 Cagliero, Luca / Politecnico di Torino, Italy.................................................................................... 1230 Calvillo, Alán / Aguascalientes University, México.......................................................................... 1163 Carrari, Fernando / Institute of Biotechnology, INTA & National Scientific and Technical Research Council, Argentina......................................................................................................................... 203 Carrasco, Ramón A. / University of Granada, Spain........................................................................ 179 Castiello, Ciro / University of Bari, Italy.......................................................................................... 1064 Cayci, Aysegul / Sabanci University, Turkey.................................................................................... 1960 Chan, Chien-Chung / University of Akron, USA............................................................................. 1407 Chandrasekaran, RM. / Annamalai University, India.................................................................... 1534 Chanet, Jean-Pierre / Cemagref, TSCF, Clermont Ferrand, France............................................... 2094 Chauhan, Ritu / Hamdard University, India........................................................................................ 92 Chen, Zheng / Microsoft Research Asia, China............................................................................... 1320 Chira, Camelia / Babeş-Bolyai University, Romania....................................................................... 1163 Chiusano, Silvia / Politecnico di Torino, Italy................................................................................. 1004 Coenen, Frans / University of Liverpool, UK..................................................................................... 970 Cui, Zhanfeng / University of Oxford, UK......................................................................................... 970 Darmont, Jérôme / University of Lyon, France............................................................................... 1422 de Ávila, Ana M. H. / University of Campinas, Brazil..................................................................... 1624 de Sousa, Elaine P. M. / University of Sao Paulo at Sao Carlos, Brazil.......................................... 1624 De Tré, Guy / Ghent University, Belgium........................................................................................... 279 Depaire, Benoit / Hasselt University, Belgium................................................................................... 445 Desierto, Diane A. / Yale Law School, USA and University of the Philippines College of Law, Philippines................................................................................................................................... 1778 Diko, Faek / Arab International University, Syria.............................................................................. 550 Dileep, A. D. / Indian Institute of Technology Madras, India............................................................. 251 Ding, Qin / East Carolina University, USA......................................................................................... 859 Dong, Aijuan / Hood College, USA.................................................................................................... 142 Dong, Dong / Hebei Normal University, China................................................................................ 1253 Dorado, Julián / University of A Coruña, Spain................................................................................ 704 Dousset, Bernard / University of Toulouse, France......................................................................... 1647
Dustdar, Schahram / Vienna University of Technology, Austria........................................................ 658 Eibe, Santiago / Universitad Politecnica, Spain............................................................................... 1960 El Emary, Ibrahiem Mahmoud Mohamed / King Abdulaziz University, Saudi Arabia................ 1591 El Haddadi, Anass / University of Toulouse III, France & University of Mohamed V, Morocco....................................................................................................................................... 1647 Elberrichi, Zakaria / Djillali Liabes University, Algeria................................................................ 2208 Erdem, Alpay / Middle East Technical University, Turkey.............................................................. 1208 Escandell-Montero, Pablo / University of Valencia, Spain............................................................... 650 Faloutsos, Christos / Carnegie Mellon University, USA.................................................................... 567 Favre, Cécile / University of Lyon, France....................................................................................... 1422 Fiori, Alessandro / Politecnico di Torino, Italy................................................................................ 1230 Furst, Jacob D. / DePaul University, USA....................................................................................... 1794 Gaines, Brian R. / University of Victoria, Canada........................................................................... 1688 Ganière, Simon / Deloitte SA, Switzerland......................................................................................... 793 García, Diego / University of Cantabria, Spain................................................................................ 1291 Garrido, Paulo / University of Minho, Portugal.............................................................................. 1376 George, Ibrahim / Macquarie University, Australia........................................................................ 2193 Gerard, Matías / Universidad Nacional del Litoral & Universidad Tecnologica Nacional & National Scientific and Technical Research Council, Argentina................................................... 203 Giaoutzi, Maria / National Technical University of Athens, Greece................................................ 2269 Gladun, Victor / V.M.Glushkov Institute of Cybernetics, Ukraine..................................................... 445 Gomes, João Bártolo / Universitad Politecnica, Spain.................................................................... 1960 Gómez, Leticia / Instituto Tecnológico de Buenos Aires, Argentina................................................ 2021 Govindarajan, M. / Annamalai University, India............................................................................ 1534 Grošelj, Ciril / University Medical Centre Ljubljana, Slovenia....................................................... 1043 Grouls, Rene / Catharina Hospital, The Netherlands...................................................................... 1461 Guan, Huiwei / North Shore Community College, USA..................................................................... 107 Gudes, Ori / Queensland University of Technology, Australia & Griffith University, Australia....................................................................................................................................... 1545 Guimerà-Tomás, Josep / University of Valencia, Spain.................................................................... 650 Guo, Zhen / SUNY Binghamton, USA................................................................................................ 567 Gupta, Manish / University of Illinois at Urbana-Champaign, USA....................................... 947, 1835 Hai-Jew, Shalin / Kansas State University, USA.............................................................................. 1358 Hamdan, Abdul Razak / Universiti Kebengsaan Malaysia, Malaysia............................................ 2057 Han, Jiawei / University of Illinois at Urbana-Champaign, USA............................................ 947, 1835 Han, Jung Hoon / The University of New South Wales, Australia................................................... 1545 Hanna, Saiid / Arab International University, Syria.......................................................................... 550 Harbi, Nouria / University of Lyon, France..................................................................................... 1422 Hardiker, Nicholas / University of Salford, UK............................................................................... 1082 Hashemi, Mukhtar / Newcastle University, UK................................................................................ 405 He, Bing / Aviat Networks Inc., USA................................................................................................... 336 He, David / The University of Illinois-Chicago, USA......................................................................... 395 Hemalatha, M. / M.A.M. College of Engineering, India.................................................................. 1276 Hernández, Paula / ITCM, México.................................................................................................. 1998
Hill, James H. / Indiana University-Purdue University Indianapolis, USA....................................... 751 Hirokawa, Sachio / Kyushu University, Japan................................................................................. 1390 Hoeber, Orland / Memorial University of Newfoundland, Canada................................................. 1852 Hong, Tzung-Pei / National University of Kaohsiung, Taiwan........................................................ 1449 Hornos, Miguel J. / University of Granada, Spain............................................................................ 179 Horsthemke, William H. / DePaul University, USA........................................................................ 1794 Hung, Edward / Hong Kong Polytechnic University, Hong Kong..................................................... 515 Ionascu, Costel / University of Craiova, Romania............................................................................. 920 Isenor, Anthony / Defense R&D Canada – Atlantic, Canada.......................................................... 1709 Ivanova, Krassimira / University for National and World Economy, Bulgaria................................ 445 Janghel, R. R. / Indian Institute of Information Technology and Management Gwalior, India............................................................................................................................................. 1114 Jans, Mieke / Hasselt University, Belgium....................................................................................... 1664 Kacprzyk, Janusz / Polish Academy of Sciences, Poland................................................................. 279 Kamenetzky, Laura / Institute of Biotechnology, INTA & National Scientific and Technical Research Council, Argentina......................................................................................................... 203 Karahoca, Adem / Bahcesehir University, Turkey........................................................................... 2250 Karahoca, Dilek / Bahcesehir University, Turkey............................................................................ 2250 Kareem, Shazia / The Islamia University of Bahawalpur, Pakistan.................................................. 881 Kato, Kanji / GK Sekkei Incorporated, Japan.................................................................................. 1390 Kaur, Harleen / Hamdard University, India......................................................................................... 92 Kavakli, Manolya / Macquarie University, Australia...................................................................... 2193 Kendall, Elizabeth / Griffith University, Australia........................................................................... 1545 Kharlamov, Evgeny / Free University of Bozen-Bolzano, Italy & INRIA Saclay, France................ 669 Klijn, Floor / Catharina Hospital, The Netherlands........................................................................ 1461 Kononenko, Igor / University of Ljubljana, Slovenia.............................................................. 692, 1043 Korsten, Erik / Catharina Hospital and Eindhoven University of Technology, The Netherlands.................................................................................................................................. 1461 Košir, Domen / University of Ljubljana, Slovenia & Httpool Ltd., Slovenia..................................... 692 Kowatsch, Tobias / University of St. Gallen, Switzerland................................................................ 1339 Koyuncugil, Ali Serhan / Capital Markets Board of Turkey, Turkey..................................... 1559, 2230 Kuijpers, Bart / Hasselt University and Transnational University of Limburg, Belgium................ 2021 Kukar, Matjaž / University of Ljubljana, Slovenia.......................................................................... 1043 Kutty, Sangeetha / Queensland University of Technology, Australia.................................................... 1 Lagaros, Nikos D. / Institute of Structural Analysis and Seismic Research, National Technical University Athens, Greece............................................................................................................ 2132 Lassoued, Yassine / University College Cork, Ireland..................................................................... 1709 Leitner, Philipp / Vienna University of Technology, Austria.............................................................. 658 Li, Ying / Microsoft Corporation, USA............................................................................................. 1320 Li, Zhixiong / Reliability Engineering Institute, Wuhan University of Technology, China.............. 2174 Liszka, Kathy J. / University of Akron, USA.................................................................................... 1407 Liu, Ning / Microsoft Research Asia, China..................................................................................... 1320 Liu, Zhan / University of Lausanne, Switzerland............................................................................... 793 Lo, Wei-Shuo / Meiho University, Taiwan........................................................................................ 1449
Lombardo, Rosaria / Second University of Naples, Italy................................................................ 1472 Long, Zalizah Awang / Malaysia Institute Information Technology, Universiti Kuala Lumpur, Malaysia....................................................................................................................................... 2057 López, Mariana / Institute of Biotechnology, INTA & National Scientific and Technical Research Council, Argentina......................................................................................................................... 203 Loudcher, Sabine / University of Lyon, France............................................................................... 1422 Lucarelli, Marco / University of Bari, Italy..................................................................................... 1064 Lucero, Robert / Columbia University, USA.................................................................................... 1082 Lybaert, Nadine / Hasselt University, Belgium................................................................................ 1664 Ma, Jinghua / The University of Illinois-Chicago, USA..................................................................... 395 Maass, Wolfgang / University of St. Gallen, Switzerland & Hochschule Furtwangen University, Germany....................................................................................................................................... 1339 Mah, Teresa / Microsoft Corporation, USA...................................................................................... 1320 Mahboubi, Hadj / CEMAGREF Centre Clermont-Ferrand, France............................................... 1422 Mahoto, Naeem A. / Politecnico di Torino, Italy............................................................................. 1004 Maïz, Nora / University of Lyon, France.......................................................................................... 1422 Mankad, Kunjal / ISTAR, CVM, India............................................................................................... 299 Manque, Patricio A. / Universidad Mayor, Chile............................................................................ 1131 Manset, David / maatG, France....................................................................................................... 1893 Markov, Krassimir / Institute of Mathematics and Informatics at BAS, Bulgaria............................ 445 Marmo, Roberto / University of Pavia, Italy..................................................................................... 326 Martínez, Loïc / Universidad Politécnica de Madrid, Spain........................................................... 1936 Martínez-Martínez, José M. / University of Valencia, Spain............................................................ 650 Maulik, Ujjwal / Jadavpur University, India..................................................................................... 366 Medeni, İhsan Tolga / Middle East Technical University, Turkey & Turksat, Turkey & Çankaya University, Turkey........................................................................................................................ 1208 Medeni, Tunç D. / Turksat, Turkey & Middle East Technical University, Turkey & Yildirim Beyazit University, Turkey........................................................................................................... 1208 Mehedintu, Anca / University of Craiova, Romania.......................................................................... 920 Menasalvas, Ernestina / Universitad Politecnica, Spain................................................................. 1960 Mencar, Corrado / University of Bari, Italy..................................................................................... 1064 Mercken, Roger / University of Hasselt, Belgium.............................................................................. 474 Miah, Shah J. / Victoria University, Australia.................................................................................... 991 Milone, Diego / Universidad Nacional del Litoral & National Scientific and Technical Research Council, Argentina......................................................................................................................... 203 Minderman, Niels / Catharina Hospital, The Netherlands.............................................................. 1461 Mitov, Iliya / Institute of Information Theories and Applications, Bulgaria...................................... 445 Mitropoulou, Chara Ch. / Institute of Structural Analysis and Seismic Research, National Technical University Athens, Greece........................................................................................... 2132 Moelans, Bart / Hasselt University and Transnational University of Limburg, Belgium................ 2021 Murray, Gregg R. / Texas Tech University, USA................................................................................. 28 Murty, M. Narasimha / Indian Institute of Science Bangalore, India....................................... 159, 734 Nardini, Franco Maria / ISTI-CNR, Pisa, Italy................................................................................. 658 Nava, José / Centro de Investigación en Ciencias Aplicadas para la Industria, México................. 1998
Nayak, Richi / Queensland University of Technology, Australia........................................................... 1 Nebot, Victoria / Universitat Jaume I, Spain...................................................................................... 625 Nembhard, D. A. / Pennsylvania State University, USA.................................................................. 1737 neuGRID Consortium, The............................................................................................................. 1893 O’Connell, Enda / Newcastle University, UK.................................................................................... 405 O’Grady, Eoin / Marine Institute, Ireland........................................................................................ 1709 Olinsky, Alan / Bryant University, USA............................................................................................ 1819 Omari, Asem / Jarash Private University, Jordan........................................................................... 1519 Ozgulbas, Nermin / Baskent University, Turkey.................................................................... 1559, 2230 Pan, Jia-Yu (Tim) / Google Inc., USA................................................................................................ 567 Pathak, Virendra / Queensland University of Technology, Australia.............................................. 1545 Patris, Georgios / National Technical University of Athens, Greece............................................... 2269 Pazos, Alejandro / University of A Coruña, Spain............................................................................. 704 Pearson, Siani / Cloud and Security Research Lab, HP Labs, UK.................................................. 1496 Pérez, Aurora / Universidad Politécnica de Madrid, Spain............................................................. 1936 Petrigni, Caterina / Politecnico di Torino, Italy.............................................................................. 1004 Petry, Frederick E. / Naval Research Laboratory, USA...................................................................... 50 Pham, Trung T. / University College Cork, Ireland......................................................................... 1709 Pigneur, Yves / University of Lausanne, Switzerland......................................................................... 793 Pinet, François / Cemagref, TSCF, Clermont Ferrand, France....................................................... 2094 Pirvu, Cerasela / University of Craiova, Romania............................................................................. 920 Plevris, Vagelis / School of Pedagogical and Technological Education (ASPETE), Greece........... 2132 Polat, Erkan / Suleyman Demirel University, Turkey....................................................................... 2153 Post, Brendan P. / The College at Brockport: State University of New York, USA............................ 837 Pudi, Vikram / International Institute of Information Technology Hyderabad, India..................... 1019 Puppe, Frank / University of Wuerzburg, Germany......................................................................... 1189 Quinn, John / Bryant University, USA............................................................................................. 1819 Rabaey, Marc / University of Hasselt and Belgian Ministry of Defense, Belgium............................ 474 Rabuñal, Juan R. / University of A Coruña, Spain............................................................................ 704 Rahmouni, Ali / Tahar Moulay University, Algeria......................................................................... 2208 Rahnamayan, Shahryar / University of Ontario Institute of Technology, Canada........................... 816 Raicu, Daniela S. / DePaul University, USA.................................................................................... 1794 Rajasethupathy, Karthik / Cornell University, USA........................................................................... 28 Rajasethupathy, Kulathur S. / The College at Brockport, State University of New York, USA......... 28 Rani, Pratibha / International Institute of Information Technology Hyderabad, India................... 1019 Reddy, Ranga / CERDEC, USA......................................................................................................... 336 Ribeiro, Marcela X. / Federal University of Sao Carlos, Brazil..................................................... 1624 Rios, Armando / Institute Technologic of Celaya, Mexico............................................................... 1607 Rivero, Daniel / University of A Coruña, Spain................................................................................. 704 Robardet, Céline / Université de Lyon, France................................................................................. 719 Romani, Luciana A. S. / University of Sao Paulo at Sao Carlos, Brazil & Embrapa Agriculture Informatics at Campinas, Brazil.................................................................................................. 1624 Rompré, Louis / University of Quebec at Montreal, Canada............................................................ 503 Roselin, R. / Sri Sarada College for Women (Autonomous), India..................................................... 775
Sahani, Mazrura / Universiti Kebengsaan Malaysia, Malaysia...................................................... 2057 Sajja, Priti Srinivas / Sardar Patel University, India........................................................................ 299 Sambukova, Tatiana V. / Military Medical Academy, Russia............................................................ 893 Sander, Tomas / Cloud and Security Research Lab, HP Labs, USA................................................ 1496 Santos, Manuel Filipe / University of Minho, Portugal................................................................... 1873 Sapozhnikova, Elena / University of Konstanz, Germany................................................................. 125 Sarabdeen, Jawahitha / University of Wollongong in Dubai, UAE................................................. 1752 Scarnò, Marco / Inter-University Consortium for SuperComputing, CASPUR, Italy...................... 1312 Scheepers-Hoeks, Anne-Marie / Catharina Hospital, The Netherlands......................................... 1461 Schoier, Gabriella / Università di Trieste, Italy................................................................................. 435 Schtte, Christof / Freie Universität Berlin, Germany...................................................................... 2069 Schumacher, Phyllis / Bryant University, USA................................................................................ 1819 Scime, Anthony / The College at Brockport, State University of New York, USA....................... 28, 837 Sekhar, C. Chandra / Indian Institute of Technology Madras, India................................................ 251 Sekhar, Pedamallu Chandra / New England Biolabs Inc., USA....................................................... 586 Senellart, Pierre / Télécom ParisTech, France.................................................................................. 669 Senthil Kumar, A.V. / Hindusthan College of Arts and Science, Bharathiar University, India......... 586 Shaw, Mildred L. G. / University of Calgary, Canada.................................................................... 1688 Shekar, Chandra / University of Akron, USA................................................................................... 1407 Shen, Dou / Microsoft Corporation, USA......................................................................................... 1320 Sheng, Chenxing / Reliability Engineering Institute, Wuhan University of Technology, China............................................................................................................................................ 2174 Shukla, Anupam / Indian Institute of Information Technology and Management Gwalior, India............................................................................................................................................. 1114 Silvestri, Fabrizio / ISTI-CNR, Pisa, Italy.......................................................................................... 658 Siminica, Marian / University of Craiova, Romania......................................................................... 920 Simonet, Michel / Joseph Fourier University, France..................................................................... 2208 Smets, Jean-Paul / Nexedi SA, France............................................................................................. 1979 Soria-Olivas, Emilio / University of Valencia, Spain......................................................................... 650 Stegmayer, Georgina / Universidad Tecnologica Nacional & National Scientific and Technical Research Council, Argentina......................................................................................................... 203 Stifter, C. A. / Pennsylvania State University, USA.......................................................................... 1737 Stocks, Karen / University of California San Diego, USA............................................................... 1709 Subrahmanya, S. V. / Infosys Limited, India...................................................................................... 734 Summers, Ronald M. / National Institutes of Health, USA............................................................. 1097 Sun, Zhaohao / University of Ballarat, Australia............................................................................. 1253 Sundarraj, Gnanasekaran / Pennsylvania State University, USA.................................................... 859 Suri, N N R Ranga / C V Raman Nagar, India................................................................................... 159 Swanger, Tyler / Yahoo! & The College at Brockport: State University of New York, USA.............. 837 Sze, Loretta K.W. / The Hong Kong Polytechnic University, Hong Kong......................................... 231 Temizel, Tuğba Taşkaya / Informatics Institute, METU, Turkey....................................................... 603 Ternes, Sônia / Embrapa Agriculture Informatics, Campinas, Brazil.............................................. 2094 Thangavel, K. / Periyar University, India.......................................................................................... 775 Tiwari, Ritu / Indian Institute of Information Technology and Management Gwalior, India.......... 1114
Tolomei, Gabriele / ISTI-CNR, Pisa, Italy......................................................................................... 658 Traina Jr., Caetano / University of Sao Paulo at Sao Carlos, Brazil.............................................. 1624 Traina, Agma J. M. / University of Sao Paulo at Sao Carlos, Brazil.............................................. 1624 Tran, Tien / Queensland University of Technology, Australia................................................................ 1 Vacio, Rubén Jaramillo / CFE – LAPEM & CIATEC – CONACYT, Mexico.................................. 1607 Vaisman, Alejandro / Universidad de la Republica, Uruguay......................................................... 2021 Valente, Juan Pedro / Universidad Politécnica de Madrid, Spain.................................................. 1936 van der Linden, Carolien / Catharina Hospital, The Netherlands.................................................. 1461 Vanhoof, Koen / Hasselt University, Belgium.......................................................................... 445, 1664 Vasilescu, Laura Giurca / University of Craiova, Romania.............................................................. 920 Veena, T. / Indian Institute of Technology Madras, India................................................................... 251 Velychko, Vitalii / V.M.Glushkov Institute of Cybernetics, Ukraine................................................... 445 Vescoukis, Vassilios / National Technical University of Athens, Greece.......................................... 2269 Villar, Pedro / University of Granada, Spain..................................................................................... 179 Visoli, Marcos / Embrapa Agriculture Informatics, Campinas, Brazil............................................ 2094 Wang, Baoying / Waynesburg University, USA.................................................................................. 142 Wang, Weiqi / University of Oxford, UK............................................................................................ 970 Wang, Yanbo J. / China Minsheng Banking Corporation Ltd., China............................................... 970 Whitlock, Kaitlyn / Yahoo!, USA....................................................................................................... 837 Wijaya,Tri Kurniawan / Sekolah Tinggi Teknik Surabaya, Indonesia............................................ 1150 Woehlbier, Ute / University of Chile, Chile...................................................................................... 1131 Wölfel, Klaus / Technische Universität Dresden, Germany............................................................. 1979 Wong, T. T. / The Hong Kong Polytechnic University, Hong Kong................................................... 231 Xie, Bin / InfoBeyond Technology LLC, USA..................................................................................... 336 Xin, Qin / Simula Research Laboratory, Norway............................................................................... 970 Xu, Yabo / Sun Yet-sen University, P. R. China................................................................................. 1916 Yamada, Yasuhiro / Shimane University, Japan.............................................................................. 1390 Yan, Jun / Microsoft Research Asia, China...................................................................................... 1320 Yan, Xinping / Reliability Engineering Institute, Wuhan University of Technology, China............. 2174 Yao, Jianhua / National Institutes of Health, USA........................................................................... 1097 Yigitcanlar, Tan / Queensland University of Technology, Australia................................................ 1545 Yip, K. K. / Pennsylvania State University, USA.............................................................................. 1737 Yoon, Sunmoo / Columbia University, USA..................................................................................... 1082 Young, Darwin / COMIMSA Centro Conacyt, Mexico..................................................................... 1163 Yuan, Chengqing / Reliability Engineering Institute, Wuhan University of Technology, China............................................................................................................................................ 2174 Yuan, May / University of Oklahoma, USA...................................................................................... 2117 Zadrożny, Sławomir / Polish Academy of Sciences, Poland............................................................. 279 Zanda, Andrea / Universitad Politecnica, Spain............................................................................. 1960 Zezzatti, Carlos Alberto Ochoa Ortiz / Juarez City University, México.............................. 1163, 1607 Zhang, Ji / University of Southern Queensland, Australia................................................................. 530 Zhang, Ping / CSIRO, Australia........................................................................................................ 1253 Zhang, Yuelei / Reliability Engineering Institute, Wuhan University of Technology, China............ 2174 Zhang, Zhongfei (Mark) / SUNY Binghamton, USA......................................................................... 567
Zhao, David / CERDEC, USA............................................................................................................ 336 Zhao, Jiangbin / Reliability Engineering Institute, Wuhan University of Technology, China......... 2174 Zhou, Mingming / Nanyang Technological University, Singapore.................................................. 1916 Zhu, Junda / The University of Illinois-Chicago, USA...................................................................... 395 Zorrilla, Marta E. / University of Cantabria, Spain........................................................................ 1291 Zoukra, Kristine Al / Freie Universität Berlin, Germany............................................................... 2069 Zullo Jr., Jurandir / University of Campinas, Brazil....................................................................... 1624
Table of Contents
Volume I Section 1 Fundamental Concepts and Theories This section serves as a foundation for this exhaustive reference tool by addressing underlying principles essential to the understanding of Data Mining. Chapters found within these pages provide an excellent framework in which to position Data Mining within the field of information science and technology. Insight regarding the critical incorporation of global measures into Data Mining is addressed, while crucial stumbling blocks of this field are explored. With 15 chapters comprising this foundational section, the reader can learn and chose from a compendium of expert research on the elemental theories underscoring the Data Mining discipline.
Chapter 1 A Study of XML Models for Data Mining: Representations, Methods, and Issues................................ 1 Sangeetha Kutty, Queensland University of Technology, Australia Richi Nayak, Queensland University of Technology, Australia Tien Tran, Queensland University of Technology, Australia Chapter 2 Finding Persistent Strong Rules: Using Classification to Improve Association Mining....................... 28 Anthony Scime, The College at Brockport, State University of New York, USA Karthik Rajasethupathy, Cornell University, USA Kulathur S. Rajasethupathy, The College at Brockport, State University of New York, USA Gregg R. Murray, Texas Tech University, USA Chapter 3 Data Discovery Approaches for Vague Spatial Data............................................................................. 50 Frederick E. Petry, Naval Research Laboratory, USA Chapter 4 Active Learning and Mapping: A Survey and Conception of a New Stochastic Methodology for High Throughput Materials Discovery.................................................................................................. 66 Laurent A. Baumes, CSIC-Universidad Politecnica de Valencia, Spain
Chapter 5 An Optimal Categorization of Feature Selection Methods for Knowledge Discovery......................... 92 Harleen Kaur, Hamdard University, India Ritu Chauhan, Hamdard University, India M. Afshar Alam, Hamdard University, India Chapter 6 Parallel Computing for Mining Association Rules in Distributed P2P Networks .............................. 107 Huiwei Guan, North Shore Community College, USA Chapter 7 Learning Different Concept Hierarchies and the Relations between them from Classified Data ...... 125 Fernando Benites, University of Konstanz, Germany Elena Sapozhnikova, University of Konstanz, Germany Chapter 8 Online Clustering and Outlier Detection............................................................................................. 142 Baoying Wang, Waynesburg University, USA Aijuan Dong, Hood College, USA Chapter 9 Data Mining Techniques for Outlier Detection .................................................................................. 159 N N R Ranga Suri, C V Raman Nagar, India M Narasimha Murty, Indian Institute of Sceince, India G Athithan, C V Raman Nagar, India Chapter 10 An Extraction, Transformation, and Loading Tool Applied to a Fuzzy Data Mining System ........... 179 Ramón A. Carrasco, University of Granada, Spain Miguel J. Hornos, University of Granada, Spain Pedro Villar, University of Granada, Spain María A. Aguilar, University of Granada, Spain Chapter 11 Analysis and Integration of Biological Data: A Data Mining Approach using Neural Networks....... 203 Diego Milone, Universidad Nacional del Litoral & National Scientific and Technical Research Council, Argentina Georgina Stegmayer, Universidad Tecnologica Nacional & National Scientific and Technical Research Council, Argentina Matías Gerard, Universidad Nacional del Litoral & Universidad Tecnologica Nacional & National Scientific and Technical Research Council, Argentina Laura Kamenetzky, Institute of Biotechnology, INTA & National Scientific and Technical Research Council, Argentina Mariana López, Institute of Biotechnology, INTA & National Scientific and Technical Research Council, Argentina Fernando Carrari, Institute of Biotechnology, INTA & National Scientific and Technical Research Council, Argentina
Chapter 12 A Neuro-Fuzzy Partner Selection System for Business Social Networks .......................................... 231 T. T. Wong, The Hong Kong Polytechnic University, Hong Kong Loretta K.W. Sze, The Hong Kong Polytechnic University, Hong Kong Chapter 13 A Review of Kernel Methods Based Approaches to Classification and Clustering of Sequential Patterns, Part I: Sequences of Continuous Feature Vectors................................................................. 251 A. D. Dileep, Indian Institute of Technology Madras, India T. Veena, Indian Institute of Technology Madras, India C. Chandra Sekhar, Indian Institute of Technology Madras, India Chapter 14 Remarks on a Fuzzy Approach to Flexible Database Querying, Its Extension and Relation to Data Mining and Summarization ................................................................................................................ 279 Janusz Kacprzyk, Polish Academy of Sciences, Poland Guy De Tré, Ghent University, Belgium Sławomir Zadrożny, Polish Academy of Sciences, Poland Chapter 15 Measuring Human Intelligence by Applying Soft Computing Techniques: A Genetic Fuzzy Approach.............................................................................................................................................. 299 Kunjal Mankad, ISTAR, CVM, India Priti Srinivas Sajja, Sardar Patel University, India
Section 2 Development and Design Methodologies This section provides in-depth coverage of conceptual architecture frameworks to provide the reader with a comprehensive understanding of the emerging developments within the field of Data Mining. Research fundamentals imperative to the understanding of developmental processes within Data Mining are offered. From broad examinations to specific discussions on methodology, the research found within this section spans the discipline while offering detailed, specific discussions. From basic designs to abstract development, these chapters serve to expand the reaches of development and design technologies within the Data Mining community. This section includes 15 contributions from researchers throughout the world on the topic of Data Mining.
Chapter 16 Web Mining and Social Network Analysis.......................................................................................... 326 Roberto Marmo, University of Pavia, Italy Chapter 17 Long-Term Evolution (LTE): Broadband-Enabled Next Generation of Wireless Mobile Cellular Network................................................................................................................................................ 336 Bing He, Aviat Networks Inc., USA Bin Xie, InfoBeyond Technology LLC, USA Sanjuli Agrawal, InfoBeyond Technology LLC, USA David Zhao, CERDEC, USA Ranga Reddy, CERDEC, USA
Chapter 18 Soft Computing and its Applications................................................................................................... 366 Siddhartha Bhattacharyya, The University of Burdwan, India Ujjwal Maulik, Jadavpur University, India Sanghamitra Bandyopadhyay, Indian Statistical Institute, India Chapter 19 A Particle Filtering Based Approach for Gear Prognostics ................................................................ 395 David He, The University of Illinois-Chicago, USA Eric Bechhoefer, NRG Systems, USA Jinghua Ma, The University of Illinois-Chicago, USA Junda Zhu, The University of Illinois-Chicago, USA Chapter 20 Science and Water Policy Interface: An Integrated Methodological Framework for Developing Decision Support Systems (DSSs)....................................................................................................... 405 Mukhtar Hashemi, Newcastle University, UK Enda O’Connell, Newcastle University, UK Chapter 21 On the MDBSCAN Algorithm in a Spatial Data Mining Context ..................................................... 435 Gabriella Schoier, Università di Trieste, Italy Chapter 22 Intelligent Data Processing Based on Multi-Dimensional Numbered Memory Structures . .............. 445 Krassimir Markov, Institute of Mathematics and Informatics at BAS, Bulgaria Koen Vanhoof, Hasselt University, Belgium Iliya Mitov, Institute of Information Theories and Applications, Bulgaria Benoit Depaire, Hasselt University, Belgium Krassimira Ivanova, University for National and World Economy, Bulgaria Vitalii Velychko, V.M.Glushkov Institute of Cybernetics, Ukraine Victor Gladun, V.M.Glushkov Institute of Cybernetics, Ukraine Chapter 23 Framework of Knowledge and Intelligence Base: From Intelligence to Service................................ 474 Marc Rabaey, University of Hasselt and Belgian Ministry of Defense, Belgium Roger Mercken, University of Hasselt, Belgium Chapter 24 Using Association Rules for Query Reformulation ............................................................................ 503 Ismaïl Biskri, University of Quebec at Trois-Rivieres, Canada Louis Rompré, University of Quebec at Montreal, Canada
Chapter 25 A Framework on Data Mining on Uncertain Data with Related Research Issues in Service Industry ............................................................................................................................................... 515 Edward Hung, Hong Kong Polytechnic University, Hong Kong Chapter 26 A Subspace-Based Analysis Method for Anomaly Detection in Large and High-Dimensional Network Connection Data Streams...................................................................................................... 530 Ji Zhang, University of Southern Queensland, Australia Chapter 27 Suggested Model for Business Intelligence in Higher Education....................................................... 550 Zaidoun Alzoabi, Arab International University, Syria Faek Diko, Arab International University, Syria Saiid Hanna, Arab International University, Syria Chapter 28 A Highly Scalable and Adaptable Co-Learning Framework on Multimodal Data Mining in a Multimedia Database .......................................................................................................................... 567 Zhongfei (Mark) Zhang, SUNY Binghamton, USA Zhen Guo, SUNY Binghamton, USA Christos Faloutsos, Carnegie Mellon University, USA Jia-Yu (Tim) Pan, Google Inc., USA
Volume II Chapter 29 Temporal Association Rule Mining in Large Databases...................................................................... 586 A.V. Senthil Kumar, Hindusthan College of Arts and Science, Bharathiar University, India Adnan Alrabea, Al Balqa Applied University, Jordan Pedamallu Chandra Sekhar, New England Biolabs Inc., USA Chapter 30 A Framework to Detect Disguised Missing Data ............................................................................... 603 Rahime Belen, Informatics Institute, METU, Turkey Tuğba Taşkaya Temizel, Informatics Institute, METU, Turkey
Section 3 Tools and Technologies This section presents an extensive coverage of various tools and technologies available in the field of Data Mining that practitioners and academicians alike can utilize to develop different techniques. These chapters enlighten readers about fundamental research on the many tools facilitating the burgeoning field of Data Mining. It is through these rigorously researched chapters that the reader is provided with countless examples of the up-andcoming tools and technologies emerging from the field of Data Mining. With 14 chapters, this section offers a broad treatment of some of the many tools and technologies within the Data Mining field.
Chapter 31 XML Mining for Semantic Web ......................................................................................................... 625 Rafael Berlanga, Universitat Jaume I, Spain Victoria Nebot, Universitat Jaume I, Spain Chapter 32 Visual Data Mining in Physiotherapy Using Self-Organizing Maps: A New Approximation to the Data Analysis....................................................................................................................................... 650 Yasser Alakhdar, University of Valencia, Spain José M. Martínez-Martínez, University of Valencia, Spain Josep Guimerà-Tomás, University of Valencia, Spain Pablo Escandell-Montero, University of Valencia, Spain Josep Benitez, University of Valencia, Spain Emilio Soria-Olivas, University of Valencia, Spain Chapter 33 Mining Lifecycle Event Logs for Enhancing Service-Based Applications......................................... 658 Schahram Dustdar, Vienna University of Technology, Austria Philipp Leitner, Vienna University of Technology, Austria Franco Maria Nardini, ISTI-CNR, Pisa, Italy Fabrizio Silvestri, ISTI-CNR, Pisa, Italy Gabriele Tolomei, ISTI-CNR, Pisa, Italy Chapter 34 Modeling, Querying, and Mining Uncertain XML Data .................................................................... 669 Evgeny Kharlamov, Free University of Bozen-Bolzano, Italy & INRIA Saclay, France Pierre Senellart, Télécom ParisTech, France Chapter 35 The Use of Prediction Reliability Estimates on Imbalanced Datasets: A Case Study of Wall Shear Stress in the Human Carotid Artery Bifurcation.................................................................................. 692 Domen Košir, University of Ljubljana, Slovenia & Httpool Ltd., Slovenia Zoran Bosnić, University of Ljubljana, Slovenia Igor Kononenko, University of Ljubljana, Slovenia Chapter 36 Database Analysis with ANNs by means of Graph Evolution............................................................. 704 Daniel Rivero, University of A Coruña, Spain Julián Dorado, University of A Coruña, Spain Juan R. Rabuñal, University of A Coruña, Spain Alejandro Pazos, University of A Coruña, Spain Chapter 37 Data Mining Techniques for Communities’ Detection in Dynamic Social Networks ........................ 719 Céline Robardet, Université de Lyon, France
Chapter 38 Quantization based Sequence Generation and Subsequence Pruning for Data Mining Applications......................................................................................................................................... 734 T. Ravindra Babu, Infosys Limited, India M. Narasimha Murty, Indian Institute of Science Bangalore, India S. V. Subrahmanya, Infosys Limited, India Chapter 39 Data Mining System Execution Traces to Validate Distributed System Quality-of-Service Properties ............................................................................................................................................ 751 James H. Hill, Indiana University-Purdue University Indianapolis, USA Chapter 40 Mammogram Mining Using Genetic Ant-Miner ................................................................................ 775 K. Thangavel, Periyar University, India R. Roselin, Sri Sarada College for Women (Autonomous), India Chapter 41 A Dynamic Privacy Manager for Compliance in Pervasive Computing ............................................ 793 Riccardo Bonazzi, University of Lausanne, Switzerland Zhan Liu, University of Lausanne, Switzerland Simon Ganière, Deloitte SA, Switzerland Yves Pigneur, University of Lausanne, Switzerland Chapter 42 Heterogeneous Text and Numerical Data Mining with Possible Applications in Business and Financial Sectors.................................................................................................................................. 816 Farid Bourennani, University of Ontario Institute of Technology, Canada Shahryar Rahnamayan, University of Ontario Institute of Technology, Canada Chapter 43 ANGEL Mining .................................................................................................................................. 837 Tyler Swanger, Yahoo! & The College at Brockport: State University of New York, USA Kaitlyn Whitlock, Yahoo!, USA Anthony Scime, The College at Brockport: State University of New York, USA Brendan P. Post, The College at Brockport: State University of New York, USA Chapter 44 Frequent Pattern Discovery and Association Rule Mining of XML Data .......................................... 859 Qin Ding, East Carolina University, USA Gnanasekaran Sundarraj, Pennsylvania State University, USA
Section 4 Utilization and Application This section discusses a variety of applications and opportunities available that can be considered by practitioners in developing viable and effective Data Mining programs and processes. This section includes 14 chapters that review topics from case studies in Africa to best practices in Asia and ongoing research in the United States. Further chapters discuss Data Mining in a variety of settings (government, research, health care, etc.). Contributions included in this section provide excellent coverage of today’s IT community and how research into Data Mining is impacting the social fabric of our present-day global village.
Chapter 45 Virtual Telemedicine and Virtual Telehealth: A Natural Language Based Implementation to Address Time Constraint Problem..................................................................................................................... 881 Shazia Kareem, The Islamia University of Bahawalpur, Pakistan Imran Sarwar Bajwa, University of Birmingham, UK Chapter 46 Machine Learning in Studying the Organism’s Functional State of Clinically Healthy Individuals Depending on Their Immune Reactivity ............................................................................................ 893 Tatiana V. Sambukova, Military Medical Academy, Russia Chapter 47 Data Mining Used for Analyzing the Bankruptcy Risk of the Romanian SMEs................................. 920 Laura Giurca Vasilescu, University of Craiova, Romania Marian Siminica, University of Craiova, Romania Cerasela Pirvu, University of Craiova, Romania Costel Ionascu, University of Craiova, Romania Anca Mehedintu, University of Craiova, Romania Chapter 48 Applications of Pattern Discovery Using Sequential Data Mining .................................................... 947 Manish Gupta, University of Illinois at Urbana-Champaign, USA Jiawei Han, University of Illinois at Urbana-Champaign, USA Chapter 49 A Comparative Study of Associative Classifiers in Mesenchymal Stem Cell Differentiation Analysis............................................................................................................................................... 970 Weiqi Wang, University of Oxford, UK Yanbo J. Wang, China Minsheng Banking Corporation Ltd., China Qin Xin, Simula Research Laboratory, Norway René Bañares-Alcántara, University of Oxford, UK Frans Coenen, University of Liverpool, UK Zhanfeng Cui, University of Oxford, UK Chapter 50 Cloud-Based Intelligent DSS Design for Emergency Professionals .................................................. 991 Shah J. Miah, Victoria University, Australia
Chapter 51 Extraction of Medical Pathways from Electronic Patient Records.................................................... 1004 Dario Antonelli, Politecnico di Torino, Italy Elena Baralis, Politecnico di Torino, Italy Giulia Bruno, Politecnico di Torino, Italy Silvia Chiusano, Politecnico di Torino, Italy Naeem A. Mahoto, Politecnico di Torino, Italy Caterina Petrigni, Politecnico di Torino, Italy Chapter 52 Classification of Biological Sequences . ........................................................................................... 1019 Pratibha Rani, International Institute of Information Technology Hyderabad, India Vikram Pudi, International Institute of Information Technology Hyderabad, India Chapter 53 Automated Diagnostics of Coronary Artery Disease: Long-Term Results and Recent Advancements.................................................................................................................................... 1043 Matjaž Kukar, University of Ljubljana, Slovenia Igor Kononenko, University of Ljubljana, Slovenia Ciril Grošelj, University Medical Centre Ljubljana, Slovenia Chapter 54 Modeling Interpretable Fuzzy Rule-Based Classifiers for Medical Decision Support...................... 1064 Jose M. Alonso, University of Alcala, Spain Ciro Castiello, University of Bari, Italy Marco Lucarelli, University of Bari, Italy Corrado Mencar, University of Bari, Italy Chapter 55 Implications for Nursing Research and Generation of Evidence....................................................... 1082 Suzanne Bakken, Columbia University, USA Robert Lucero, Columbia University, USA Sunmoo Yoon, Columbia University, USA Nicholas Hardiker, University of Salford, UK Chapter 56 Content-Based Image Retrieval for Medical Image Analysis .......................................................... 1097 Jianhua Yao, National Institutes of Health, USA Ronald M. Summers, National Institutes of Health, USA Chapter 57 Intelligent Decision Support System for Fetal Delivery using Soft Computing Techniques............ 1114 R. R. Janghel, Indian Institute of Information Technology and Management Gwalior, India Anupam Shukla, Indian Institute of Information Technology and Management Gwalior, India Ritu Tiwari, Indian Institute of Information Technology and Management Gwalior, India
Chapter 58 Systems Biology-Based Approaches Applied to Vaccine Development .......................................... 1131 Patricio A. Manque, Universidad Mayor, Chile Ute Woehlbier, University of Chile, Chile
Section 5 Organizational and Social Implications This section includes a wide range of research pertaining to the social and behavioral impact of Data Mining around the world. Chapters introducing this section critically analyze and discuss trends in Data Mining, such as media strategies, prediction modeling, and knowledge discovery. Additional chapters included in this section look at ICT policies and the facilitating online learning. Also investigating a concern within the field of Data Mining is research which discusses the effect of user behavior on Data Mining. With 15 chapters, the discussions presented in this section offer research into the integration of global Data Mining as well as implementation of ethical and workflow considerations for all organizations.
Chapter 59 From Data to Knowledge: Data Mining............................................................................................ 1150 Tri Kurniawan Wijaya, Sekolah Tinggi Teknik Surabaya, Indonesia
Volume III Chapter 60 Mass Media Strategies: Hybrid Approach using a Bioinspired Algorithm and Social Data Mining................................................................................................................................................ 1163 Carlos Alberto Ochoa Ortiz Zezzatti, Juarez City University, México Darwin Young, COMIMSA Centro Conacyt, Mexico Camelia Chira, Babeş-Bolyai University, Romania Daniel Azpeitia, Juarez City University, México Alán Calvillo, Aguascalientes University, México Chapter 61 Data Mining, Validation, and Collaborative Knowledge Capture .................................................... 1189 Martin Atzmueller, University of Kassel, Germany Stephanie Beer, University Clinic of Wuerzburg, Germany Frank Puppe, University of Wuerzburg, Germany Chapter 62 Enterprise Architecture for Personalization of e-Government Services: Reflections from Turkey................................................................................................................................................ 1208 Alpay Erdem, Middle East Technical University, Turkey İhsan Tolga Medeni, Middle East Technical University, Turkey & Turksat, Turkey & Çankaya University, Turkey Tunç D. Medeni, Turksat, Turkey & Middle East Technical University, Turkey & Yildirim Beyazit University, Turkey
Chapter 63 Knowledge Discovery from Online Communities............................................................................ 1230 Luca Cagliero, Politecnico di Torino, Italy Alessandro Fiori, Politecnico di Torino, Italy Chapter 64 Customer Decision Making in Web Services ................................................................................... 1253 Zhaohao Sun, University of Ballarat, Australia Ping Zhang, CSIRO, Australia Dong Dong, Hebei Normal University, China Chapter 65 A Predictive Modeling of Retail Satisfaction: A Data Mining Approach to Retail Service Industry.............................................................................................................................................. 1276 M. Hemalatha, M.A.M. College of Engineering, India Chapter 66 A Data Mining Service to Assist Instructors Involved in Virtual Education .................................... 1291 Marta E. Zorrilla, University of Cantabria, Spain Diego García, University of Cantabria, Spain Chapter 67 User’s Behaviour inside a Digital Library......................................................................................... 1312 Marco Scarnò, Inter-University Consortium for SuperComputing, CASPUR, Italy Chapter 68 Behavioral Targeting Online Advertising.......................................................................................... 1320 Jun Yan, Microsoft Research Asia, China Dou Shen, Microsoft Corporation, USA Teresa Mah, Microsoft Corporation, USA Ning Liu, Microsoft Research Asia, China Zheng Chen, Microsoft Research Asia, China Ying Li, Microsoft Corporation, USA Chapter 69 Mobile Purchase Decision Support Systems for In-Store Shopping Environments . ....................... 1339 Tobias Kowatsch, University of St. Gallen, Switzerland Wolfgang Maass, University of St. Gallen, Switzerland & Hochschule Furtwangen University, Germany Chapter 70 Structuring and Facilitating Online Learning through Learning/Course Management Systems.............................................................................................................................................. 1358 Shalin Hai-Jew, Kansas State University, USA
Chapter 71 Conversation-Oriented Decision Support Systems for Organizations............................................... 1376 Paulo Garrido, University of Minho, Portugal Chapter 72 Text Mining for Analysis of Interviews and Questionnaires............................................................. 1390 Yasuhiro Yamada, Shimane University, Japan Kanji Kato, GK Sekkei Incorporated, Japan Sachio Hirokawa, Kyushu University, Japan Chapter 73 Detecting Pharmaceutical Spam in Microblog Messages . ............................................................... 1407 Kathy J. Liszka, University of Akron, USA Chien-Chung Chan, University of Akron, USA Chandra Shekar, University of Akron, USA
Section 6 Managerial Impact This section presents contemporary coverage of the social implications of Data Mining, more specifically related to the corporate and managerial utilization of information sharing technologies and applications, and how these technologies can be extrapolated to be used in Data Mining. Core tools and concepts such as warehousing, warning mechanisms, decision support systems, direct marketing, and pattern detection are discussed. Equally as crucial, chapters within this section discuss how leaders can utilize Data Mining applications to get the best outcomes from their governors and their citizens.
Chapter 74 Innovative Approaches for Efficiently Warehousing Complex Data from the Web.......................... 1422 Fadila Bentayeb, University of Lyon, France Nora Maïz, University of Lyon, France Hadj Mahboubi, CEMAGREF Centre Clermont-Ferrand, France Cécile Favre, University of Lyon, France Sabine Loudcher, University of Lyon, France Nouria Harbi, University of Lyon, France Omar Boussaïd, University of Lyon, France Jérôme Darmont, University of Lyon, France Chapter 75 A Three-Level Multiple-Agent Early Warning Mechanism for Preventing Loss of Customers in Fashion Supply Chains ..................................................................................................................... 1449 Wei-Shuo Lo, Meiho University, Taiwan Tzung-Pei Hong, National University of Kaohsiung, Taiwan
Chapter 76 Clinical Decision Support Systems for ‘Making It Easy to Do It Right’ ......................................... 1461 Anne-Marie Scheepers-Hoeks, Catharina Hospital, The Netherlands Floor Klijn, Catharina Hospital, The Netherlands Carolien van der Linden, Catharina Hospital, The Netherlands Rene Grouls, Catharina Hospital, The Netherlands Eric Ackerman, Catharina Hospital, The Netherlands Niels Minderman, Catharina Hospital, The Netherlands Jan Bergmans, Eindhoven University of Technology, The Netherlands Erik Korsten, Catharina Hospital and Eindhoven University of Technology, The Netherlands Chapter 77 Data Mining and Explorative Multivariate Data Analysis for Customer Satisfaction Study............ 1472 Rosaria Lombardo, Second University of Naples, Italy Chapter 78 A Decision Support System for Privacy Compliance........................................................................ 1496 Siani Pearson, Cloud and Security Research Lab, HP Labs, UK Tomas Sander, Cloud and Security Research Lab, HP Labs, USA Chapter 79 Supporting Companies Management and Improving their Productivity through Mining Customers Transactions ...................................................................................................................................... 1519 Asem Omari, Jarash Private University, Jordan Chapter 80 A Hybrid Multilayer Perceptron Neural Network for Direct Marketing........................................... 1534 M. Govindarajan, Annamalai University, India RM. Chandrasekaran, Annamalai University, India Chapter 81 Developing a Competitive City through Healthy Decision-Making ................................................ 1545 Ori Gudes, Queensland University of Technology, Australia & Griffith University, Australia Elizabeth Kendall, Griffith University, Australia Tan Yigitcanlar, Queensland University of Technology, Australia Jung Hoon Han, The University of New South Wales, Australia Virendra Pathak, Queensland University of Technology, Australia Chapter 82 Financial Early Warning System for Risk Detection and Prevention from Financial Crisis ............ 1559 Nermin Ozgulbas, Baskent University, Turkey Ali Serhan Koyuncugil, Capital Markets Board of Turkey, Turkey
Chapter 83 Role of Data Mining and Knowledge Discovery in Managing Telecommunication Systems ......... 1591 Ibrahiem Mahmoud Mohamed El Emary, King Abdulaziz University, Saudi Arabia Chapter 84 Data Mining Applications in the Electrical Industry ........................................................................ 1607 Rubén Jaramillo Vacio, CFE – LAPEM & CIATEC – CONACYT, Mexico Carlos Alberto Ochoa Ortiz Zezzatti, Juarez City University, México Armando Rios, Institute Technologic of Celaya, Mexico Chapter 85 Mining Climate and Remote Sensing Time Series to Improve Monitoring of Sugar Cane Fields.................................................................................................................................................. 1624 Luciana A. S. Romani, University of Sao Paulo at Sao Carlos, Brazil & Embrapa Agriculture Informatics at Campinas, Brazil Elaine P. M. de Sousa, University of Sao Paulo at Sao Carlos, Brazil Marcela X. Ribeiro, Federal University of Sao Carlos, Brazil Ana M. H. de Ávila, University of Campinas, Brazil Jurandir Zullo Jr., University of Campinas, Brazil Caetano Traina Jr., University of Sao Paulo at Sao Carlos, Brazil Agma J. M. Traina, University of Sao Paulo at Sao Carlos, Brazil Chapter 86 Discovering Patterns in Order to Detect Weak Signals and Define New Strategies ........................ 1647 Anass El Haddadi, University of Toulouse III, France & University of Mohamed V, Morocco Bernard Dousset, University of Toulouse, France Ilham Berrada, University of Mohamed V, Morocco Chapter 87 Data Mining and Economic Crime Risk Management...................................................................... 1664 Mieke Jans, Hasselt University, Belgium Nadine Lybaert, Hasselt University, Belgium Koen Vanhoof, Hasselt University, Belgium
Section 7 Critical Issues This section contains 15 chapters, giving a wide variety of perspectives on Data Mining and its implications. Such perspectives include reading in privacy, gender, ethics, and several more. The section also discusses legal issues in e-health, sociocognitive inquiry, and much more. Within the chapters, the reader is presented with an in-depth analysis of the most current and relevant issues within this growing field of study. Crucial questions are addressed and alternatives offered, such as what makes up cooperation in the field of expert knowledge and data mining discovery.
Chapter 88 Sociocognitive Inquiry....................................................................................................................... 1688 Brian R. Gaines, University of Victoria, Canada Mildred L. G. Shaw, University of Calgary, Canada Chapter 89 Coastal Atlas Interoperability ........................................................................................................... 1709 Yassine Lassoued, University College Cork, Ireland Trung T. Pham, University College Cork, Ireland Luis Bermudez, Southeastern University Research Association, USA Karen Stocks, University of California San Diego, USA Eoin O’Grady, Marine Institute, Ireland Anthony Isenor, Defense R&D Canada – Atlantic, Canada Paul Alexander, Marine Metadata Interoperability Initiative & Stanford Center for Biomedical Informatics Research, Palo Alto CA, USA
Volume IV Chapter 90 Association Rule Mining in Developmental Psychology ................................................................. 1737 D. A. Nembhard, Pennsylvania State University, USA K. K. Yip, Pennsylvania State University, USA C. A. Stifter, Pennsylvania State University, USA Chapter 91 Legal Issues in E-Healthcare Systems............................................................................................... 1752 Jawahitha Sarabdeen, University of Wollongong in Dubai, UAE Chapter 92 Privacy Expectations in Passive RFID Tagging of Motor Vehicles: Bayan Muna et al. v. Mendoza et al. in the Philippine Supreme Court................................................................................................... 1778 Diane A. Desierto, Yale Law School, USA and University of the Philippines College of Law, Philippines Chapter 93 Evaluation Challenges for Computer-Aided Diagnostic Characterization: Shape Disagreements in the Lung Image Database Consortium Pulmonary Nodule Dataset.............................................. 1794 William H. Horsthemke, DePaul University, USA Daniela S. Raicu, DePaul University, USA Jacob D. Furst, DePaul University, USA Samuel G. Armato III, University of Chicago, USA
Chapter 94 Assessing Data Mining Approaches for Analyzing Actuarial Student Success Rate ....................... 1819 Alan Olinsky, Bryant University, USA Phyllis Schumacher, Bryant University, USA John Quinn, Bryant University, USA Chapter 95 Approaches for Pattern Discovery Using Sequential Data Mining .................................................. 1835 Manish Gupta, University of Illinois at Urbana-Champaign, USA Jiawei Han, University of Illinois at Urbana-Champaign, USA Chapter 96 Human-Centred Web Search.............................................................................................................. 1852 Orland Hoeber, Memorial University of Newfoundland, Canada Chapter 97 A Perspective on Data Mining Integration with Business Intelligence ............................................ 1873 Ana Azevedo, CEISE/STI, ISCAP/IPP, Portugal Manuel Filipe Santos, University of Minho, Portugal Chapter 98 Gridifying Neuroscientific Pipelines: A SOA Recipe and Experience from the neuGRID Project................................................................................................................................................ 1893 David Manset, maatG, France The neuGRID Consortium Chapter 99 Challenges to Use Recommender Systems to Enhance Meta-Cognitive Functioning in Online Learners.............................................................................................................................................. 1916 Mingming Zhou, Nanyang Technological University, Singapore Yabo Xu, Sun Yet-sen University, P. R. China Chapter 100 Cooperation between Expert Knowledge and Data Mining Discovered Knowledge........................ 1936 Fernando Alonso, Universidad Politécnica de Madrid, Spain Loïc Martínez, Universidad Politécnica de Madrid, Spain Aurora Pérez, Universidad Politécnica de Madrid, Spain Juan Pedro Valente, Universidad Politécnica de Madrid, Spain Chapter 101 Research Challenge of Locally Computed Ubiquitous Data Mining................................................ 1960 Aysegul Cayci, Sabanci University, Turkey João Bártolo Gomes, Universitad Politecnica, Spain Andrea Zanda, Universitad Politecnica, Spain Ernestina Menasalvas, Universitad Politecnica, Spain Santiago Eibe, Universitad Politecnica, Spain
Chapter 102 Tailoring FOS-ERP Packages: Automation as an Opportunity for Small Businesses....................... 1979 Klaus Wölfel, Technische Universität Dresden, Germany Jean-Paul Smets, Nexedi SA, France
Section 8 Emerging Trends This section highlights research potential within the field of Data Mining while exploring uncharted areas of study for the advancement of the discipline. Introducing this section are chapters that set the stage for future research directions and topical suggestions for continued debate, centering on the new venues and forums for discussion. A pair of chapters on space-time makes up the middle of the section of the final 14 chapters, and the book concludes with a look ahead into the future of the Data Mining field, with “Advances of the Location Based Context-Aware Mobile Services in the Transport Sector.” In all, this text will serve as a vital resource to practitioners and academics interested in the best practices and applications of the burgeoning field of Data Mining.
Chapter 103 Optimization of a Hybrid Methodology (CRISP-DM)...................................................................... 1998 José Nava, Centro de Investigación en Ciencias Aplicadas para la Industria, México Paula Hernández, ITCM, México Chapter 104 A State-of-the-Art in Spatio-Temporal Data Warehousing, OLAP and Mining ............................... 2021 Leticia Gómez, Instituto Tecnológico de Buenos Aires, Argentina Bart Kuijpers, Hasselt University and Transnational University of Limburg, Belgium Bart Moelans, Hasselt University and Transnational University of Limburg, Belgium Alejandro Vaisman, Universidad de la Republica, Uruguay Chapter 105 Pattern Mining for Outbreak Discovery Preparedness ..................................................................... 2057 Zalizah Awang Long, Malaysia Institute Information Technology, Universiti Kuala Lumpur, Malaysia Abdul Razak Hamdan, Universiti Kebengsaan Malaysia, Malaysia Azuraliza Abu Bakar, Universiti Kebengsaan Malaysia, Malaysia Mazrura Sahani, Universiti Kebengsaan Malaysia, Malaysia Chapter 106 From Non-Invasive Hemodynamic Measurements towards Patient-Specific Cardiovascular Diagnosis .......................................................................................................................................... 2069 Stefan Bernhard, Freie Universität Berlin, Germany Kristine Al Zoukra, Freie Universität Berlin, Germany Christof Schtte, Freie Universität Berlin, Germany
Chapter 107 Towards Spatial Decision Support System for Animals Traceability................................................ 2094 Marcos Visoli, Embrapa Agriculture Informatics, Campinas, Brazil Sandro Bimonte, Cemagref, TSCF, Clermont Ferrand, France Sônia Ternes, Embrapa Agriculture Informatics, Campinas, Brazil François Pinet, Cemagref, TSCF, Clermont Ferrand, France Jean-Pierre Chanet, Cemagref, TSCF, Clermont Ferrand, France Chapter 108 Space-Time Analytics for Spatial Dynamics .................................................................................... 2117 May Yuan, University of Oklahoma, USA James Bothwell, University of Oklahoma, USA Chapter 109 Metaheuristic Optimization in Seismic Structural Design and Inspection Scheduling of Buildings............................................................................................................................................ 2132 Chara Ch. Mitropoulou, Institute of Structural Analysis and Seismic Research, National Technical University Athens, Greece Vagelis Plevris, School of Pedagogical and Technological Education (ASPETE), Greece Nikos D. Lagaros, Institute of Structural Analysis and Seismic Research, National Technical University Athens, Greece Chapter 110 An Approach for Land-Use Suitability Assessment Using Decision Support Systems, AHP and GIS .................................................................................................................................................... 2153 Erkan Polat, Suleyman Demirel University, Turkey Chapter 111 Remote Fault Diagnosis System for Marine Power Machinery System............................................ 2174 Chengqing Yuan, Reliability Engineering Institute, Wuhan University of Technology, China Xinping Yan, Reliability Engineering Institute, Wuhan University of Technology, China Zhixiong Li, Reliability Engineering Institute, Wuhan University of Technology, China Yuelei Zhang, Reliability Engineering Institute, Wuhan University of Technology, China Chenxing Sheng, Reliability Engineering Institute, Wuhan University of Technology, China Jiangbin Zhao, Reliability Engineering Institute, Wuhan University of Technology, China Chapter 112 Data Mining in the Investigation of Money Laundering and Terrorist Financing . .......................... 2193 Ibrahim George, Macquarie University, Australia Manolya Kavakli, Macquarie University, Australia
Chapter 113 A Hybrid Approach Based on Self-Organizing Neural Networks and the K-Nearest Neighbors Method to Study Molecular Similarity.............................................................................................. 2208 Abdelmalek Amine, Tahar Moulay University, Algeria Zakaria Elberrichi, Djillali Liabes University, Algeria Michel Simonet, Joseph Fourier University, France Ali Rahmouni, Tahar Moulay University, Algeria Chapter 114 Social Aid Fraud Detection System and Poverty Map Model Suggestion Based on Data Mining for Social Risk Mitigation....................................................................................................................... 2230 Ali Serhan Koyuncugil, Capital Markets Board of Turkey, Turkey Nermin Ozgulbas, Baskent University, Turkey Chapter 115 Designing an Early Warning System for Stock Market Crashes by Using ANFIS........................... 2250 Murat Acar, ISE Settlement and Custody Bank Inc., Turkey Dilek Karahoca, Bahcesehir University, Turkey Adem Karahoca, Bahcesehir University, Turkey Chapter 116 Advances of the Location Based Context-Aware Mobile Services in the Transport Sector ............ 2269 Georgios Patris, National Technical University of Athens, Greece Vassilios Vescoukis, National Technical University of Athens, Greece Maria Giaoutzi, National Technical University of Athens, Greece
xxxii
Preface
The constantly changing landscape of Data Mining makes it challenging for experts and practitioners to stay informed of the field’s most up-to-date research. That is why Information Science Reference is pleased to offer this four-volume reference collection that will empower students, researchers, and academicians with a strong understanding of critical issues within Data Mining by providing both broad and detailed perspectives on cutting-edge theories and developments. This reference is designed to act as a single reference source on conceptual, methodological, technical, and managerial issues, as well as provide insight into emerging trends and future opportunities within the discipline. Data Mining: Concepts, Methodologies, Tools and Applications is organized into eight distinct sections that provide comprehensive coverage of important topics. The sections are: (1) Fundamental Concepts and Theories, (2) Development and Design Methodologies, (3) Tools and Technologies, (4) Utilization and Application, (5) Organizational and Social Implications, (6) Managerial Impact, (7) Critical Issues, and (8) Emerging Trends. The following paragraphs provide a summary of what to expect from this invaluable reference tool. Section 1, Fundamental Concepts and Theories, serves as a foundation for this extensive reference tool by addressing crucial theories essential to the understanding of Data Mining. Introducing the book is “A Study of XML Models for Data Mining” by Richi Nayak, Sangeetha Kutty, and Tien Tran, a great foundation laying the groundwork for the basic concepts and theories that will be discussed throughout the rest of the book. Another chapter of note in Section 1 is titled “Learning Different Concept Hierarchies and the Relations between them from Classified Data” by Fernando Benites and Elena Sapozhnikova, which discusses concept hierarchies and classified data. Section 1 concludes, and leads into the following portion of the book with a nice segue chapter, “Measuring Human Intelligence by Applying Soft Computing Techniques,” by Priti Srinivas Sajja and Kunjal Mankad. Where Section 1 leaves off with fundamental concepts, Section 2 discusses architectures and frameworks in place for Data Mining. Section 2, Development and Design Methodologies, presents in-depth coverage of the conceptual design and architecture of Data Mining, focusing on aspects including soft computing, association rules, data processing, spatial data mining, missing data, rule mining, and many more topics. Opening the section is “Web Mining and Social Network Analysis” by Roberto Marmo. This section is vital for developers and practitioners who want to measure and track the progress of digital literacy on a local, national, or international level. Through case studies, this section lays excellent groundwork for later sections that will get into present and future applications for Data Mining, including, of note: “A Framework on Data Mining on Uncertain Data with Related Research Issues in Service Industry” by Edward Hung, and “A Highly Scalable and Adaptable Co-Learning Framework on Multimodal Data Mining in a Multimedia Database” by Zhen Guo, Christos Faloutsos, and Zhongfei (Mark) Zhang. The section concludes with an excellent work by Rahime Belen and Tugba Taskaya Temizel, titled “A Framework to Detect Disguised Missing Data.”
xxxiii
Section 3, Tools and Technologies, presents extensive coverage of the various tools and technologies used in the implementation of Data Mining. Section 3 begins where Section 2 left off, though this section describes more concrete tools at place in the modeling, planning, and applications of Data Mining. The first chapter, “XML Mining for Semantic Web,” by Rafael Berlanga and Victoria Nebot, lays a framework for the types of works that can be found in this section, a perfect resource for practitioners looking for new ways to mine data in the burgeoning Semantic Web. Section 3 is full of excellent chapters like this one, including such titles as “Database Analysis with ANNs by means of Graph Evolution,” “Data Mining System Execution Traces to Validate Distributed System Quality-of-Service Properties,” and “A Dynamic Privacy Manager for Compliance in Pervasive Computing” to name a few. Where Section 3 described specific tools and technologies at the disposal of practitioners, Section 4 describes successes, failures, best practices, and different applications of the tools and frameworks discussed in previous sections. Section 4, Utilization and Application, describes how the broad range of Data Mining efforts has been utilized and offers insight on and important lessons for their applications and impact. Section 4 includes the widest range of topics because it describes case studies, research, methodologies, frameworks, architectures, theory, analysis, and guides for implementation. Topics range from telemedicine and telehealth to DSS design, systems biology, and content-based image retrieval. The first chapter in the section is titled “Virtual Telemedicine and Virtual Telehealth,” which was written by Shazia Kareem and Imran Sarwar Bajwa. The breadth of topics covered in the chapter is also reflected in the diversity of its authors, from countries all over the globe, including Pakistan, UK, Russia, Romania, China, Australia, Italy, Slovenia, Chile, the United States, and more. Section 4 concludes with an excellent view of a case study in technology implementation and use, “Systems Biology-Based Approaches Applied to Vaccine Development” by Patricio A. Manque. Section 5, Organizational and Social Implications, includes chapters discussing the organizational and social impact of Data Mining. The section opens with “From Data to Knowledge” by Tri Wijaya. Where Section 4 focused on the broad, many applications of Data Mining technology, Section 5 focuses exclusively on how these technologies affect human lives, either through the way they interact with each other, or through how they affect behavioral/workplace situations. Other interesting chapters of note in Section 5 include “A Predictive Modeling of Retail Satisfaction” by M. Hemalatha, and “Behavioral Targeting Online Advertising” by Jun Yan, Dou Shen, Teresa Mah, Ning Liu, Zheng Chen, and Ying Li. Section 5 concludes with a fascinating study of a new development in Data Mining, in “Detecting Pharmaceutical Spam in Microblog Messages.” Section 6, Managerial Impact, presents focused coverage of Data Mining as it relates to effective uses of complex data warehousing, multivariate data analysis, privacy compliance, and many more utilities. This section serves as a vital resource for developers who want to utilize the latest research to bolster the capabilities and functionalities of their processes. The section begins with “Innovative Approaches for Efficiently Warehousing Complex Data from the Web,” a great look into how small colleges and compeanies can utilize benefits previously thought to be reserved to their larger competitors. The 14 chapters in this section offer unmistakable value to managers looking to implement new strategies that work at larger bureaucratic levels. The section concludes with “Data Mining and Economic Crime Risk Management” by Nadine Lybaert. Where Section 6 leaves off, section seven picks up with a focus on some of the more content-theoretical material of this compendium. Section 7, Critical Issues, presents coverage of academic and research perspectives on Data Mining tools and applications. The section begins with “Sociocognitive Inquiry,” by Brian R. Gaines and Mildred L. G. Shaw. Other issues covered in detail in Section 7 include interoperability, developmental
xxxiv
psychology, legal issues, privacy expectations, and much more. The section concludes with “Tailoring FOS-ERP Packages” by Klaus Wölfel and Jean-Paul Smets, a great transitional chapter between Sections 7 and 8 because it examines an important question going into the future of the field. The last chapter manages to show a theoretical look into future and potential technologies, a topic covered in more detail in Section 8. Section 8, Emerging Trends, highlights areas for future research within the field of Data Mining, opening with “Optimization of a Hybrid Methodology (CRISP-DM)?” by José Nava and Paula Hernández. Section 8 contains chapters that look at what might happen in the coming years that can extend the already staggering amount of applications for Data Mining. Other chapters of note include “Towards Spatial Decision Support System for Animals Traceability” and “An Approach for Land-Use Suitability Assessment Using Decision Support Systems, AHP, and GIS.” The final chapter of the book looks at an emerging field within Data Mining, in the excellent contribution, “Advances of the Location Based Context-Aware Mobile Services in the Transport Sector” by Maria Giaoutzi, Vassilios Vescoukis, and Georgios Patris. Although the primary organization of the contents in this multi-volume work is based on its eight sections, offering a progression of coverage of the important concepts, methodologies, technologies, applications, social issues, and emerging trends, the reader can also identify specific contents by utilizing the extensive indexing system listed at the end of each volume. Furthermore to ensure that the scholar, researcher, and educator have access to the entire contents of this multi volume set as well as additional coverage that could not be included in the print version of this publication, the publisher will provide unlimited multi-user electronic access to the online aggregated database of this collection for the life of the edition, free of charge when a library purchases a print copy. This aggregated database provides far more contents than what can be included in the print version, in addition to continual updates. This unlimited access, coupled with the continuous updates to the database ensures that the most current research is accessible to knowledge seekers. As a comprehensive collection of research on the latest findings related to using technology to providing various services, Data Mining: Concepts, Methodologies, Tools and Applications, provides researchers, administrators and all audiences with a complete understanding of the development of applications and concepts in Data Mining. Given the vast number of issues concerning usage, failure, success, policies, strategies, and applications of Data Mining in countries around the world, Data Mining: Concepts, Methodologies, Tools and Applications addresses the demand for a resource that encompasses the most pertinent research in technologies being employed to globally bolster the knowledge and applications of Data Mining.
Section 1
Fundamental Concepts and Theories
This section serves as a foundation for this exhaustive reference tool by addressing underlying principles essential to the understanding of Data Mining. Chapters found within these pages provide an excellent framework in which to position Data Mining within the field of information science and technology. Insight regarding the critical incorporation of global measures into Data Mining is addressed, while crucial stumbling blocks of this field are explored. With 15 chapters comprising this foundational section, the reader can learn and chose from a compendium of expert research on the elemental theories underscoring the Data Mining discipline.
1
Chapter 1
A Study of XML Models for Data Mining:
Representations, Methods, and Issues Sangeetha Kutty Queensland University of Technology, Australia Richi Nayak Queensland University of Technology, Australia Tien Tran Queensland University of Technology, Australia
ABSTRACT With the increasing number of XML documents in varied domains, it has become essential to identify ways of finding interesting information from these documents. Data mining techniques can be used to derive this interesting information. However, mining of XML documents is impacted by the data model used in data representation due to the semi-structured nature of these documents. In this chapter, we present an overview of the various models of XML documents representations, how these models are used for mining, and some of the issues and challenges inherent in these models. In addition, this chapter also provides some insights into the future data models of XML documents for effectively capturing its two important features, structure and content, for mining.
INTRODUCTION Due to the increased popularity of XML in varied application domains, a large number of XML documents are found in both organizational intranets and Internet. Some of the popular datasets such as English Wikipedia contains 3.1 million DOI: 10.4018/978-1-4666-2455-9.ch001
web documents in XML format with 1.74 billion words, and the ClueWeb dataset used in Text Retrieval Conference (TREC) tracks contains 503.9 million XML documents collected from the web in January and February 2009. In order to discover useful knowledge from these large amount of XML documents, researchers have used data mining techniques (Nayak, 2005). XML data mining techniques have gained great deal of
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
A Study of XML Models for Data Mining
interest among researchers due to their potential to discover useful knowledge in diverse fields such as bioinformatics, telecommunication network analysis, community detection, information retrieval, social network analysis (Nayak, 2008). Unlike structured data where the structure is fixed because the data is stored in structured format as in relational tables, XML data has flexibility in its structure as the users are allowed to use custom-defined tags to represent the data. An XML document contains tags and the data is enclosed within those tags. A tag usually describes a meaningful name to the content it represents. Moreover, tags present in the document are organised in hierarchical order showing the relationships between elements of the document. Usually, the hierarchical ordering of tags in an XML document is called as the document structure and the data enclosed within these tags is called as the document content. XML data can be modelled in various forms namely vectors (or transactional data models), paths, trees and graphs based on its structure and/ or content. The focus of this chapter is to present an overview of the various models that can be used to represent XML documents for the process of mining. This chapter also addresses some of the issues and challenges associated with each of these models.
Organisation of this chapter is as follows. In the next section, it explains various XML data models in detail. The third section discusses about the roles of models in diverse mining tasks such as frequent pattern mining, association rules mining, clustering and classification. The chapter then details about the issues and the challenges in using these models for mining. It concludes with the needs and opportunities of new models for mining on XML documents.
DATA MODELS FOR XML DOCUMENT MINING To suit the objectives and the needs of XML mining algorithms, XML data has been represented in various forms. Figure 1 gives taxonomy of XML data showing various data models that facilitate XML mining with different features that exist in the XML data. There are two types of XML data: XML document and XML schema definition. An XML schema definition contains the structure and data definitions of XML documents (Abiteboul, Buneman, & Suciu, 2000). An XML document, on the other hand, is an instance of the XML schema that contains the data content represented in a structured format.
Figure 1. Data models facilitating mining of XML data
2
A Study of XML Models for Data Mining
The provision of the XML schema definition with XML documents makes it different from the other types of semi-structured data such as HTML and BibTeX. The schema imposes restrictions on the syntax and structure of XML documents. The two most popular XML document schema languages are Document Type Definition (DTD) and XML-Schema Definition (XSD). Figures 2 and 3 show DTD and XSD examples respectively. An example of a simple XML document conforming to the schemas from Figure 2 and 3 is shown in Figure 4. An XML document can belong to one of the followings, ill-formed, well-formed, and valid, according to how it abides the XML schema definition. An ill-formed document does not have a fixed structure meaning it does not conform to the XML syntax rules such as lack of XML declaration statement and it contain more than one root element. A well-formed document conforms to the XML syntax rules and may have a document schema but the document does not conform to it. It contains exactly one root element, and subelements are properly nested within each other. Finally, a valid XML document is a well-formed document which conforms to a specified XML schema definition (Stuart, 2004). The document as shown in Figure 4 is an example of valid document. Each XML document can be divided into two parts, namely markup constructs and content. A markup construct consists of the characters that
Figure 2. An example of a DTD schema (conf.dtd)
are marked up using “”. The content is the set of characters that is not included in markup. There are two types of markup constructs, tags (or elements) and attributes. Tags are the markup constructs which begin with a start tag “” such as conf, title, year, editor, for the conf.xml document in Figure 4. On the other hand, the attributes are markup constructs consisting of a name/value pair that exists within a start-tag. In the running example, id is the attribute and “SIAM10” is its value. Examples of content are “SIAM Data Mining Conference”, “2010”, “Bing Liu”. Having discussed about the preliminaries of XML, now let us look into the various models, namely Vector Space or Transactional data models, Path, Tree and Graph models for XML data for mining in the following subsections.
Transactional Data Model or Vector Space Model A way to represent an XML dataset is using the Transactional Data Model (TDM) or Vector Space Model (VSM) in which each document in the dataset is represented as transaction or vector. The TDM or VSM of an XML document can be represented based on either its tags or its content. Firstly, let us look at converting the tags of an XML document into a transactional data model or bag of words (as used in VSM). The XML document is first parsed using a SAX (Simple API for XML) parser and the resulting information is pre-processed and then transformed into a TDM with the XML tags as the items in the transaction along with their number of occurrences as shown in Figure 5. One problem of applying the mining techniques on the TDM is that this model does not preserve the hierarchical relationship between the tags. Hence, the data mining methods applied on the transaction models may not provide accurate results. A simple example to illustrate this problem is given in Figure 6. The fragment boat
3
A Study of XML Models for Data Mining
Figure 3. An example of an XSD schema (conf.xsd)
4
A Study of XML Models for Data Mining
Figure 4. An example of XML document (conf.xml)
Figure 5. Transactional data model generated from the structure of XML document given in Figure 4
appearing in Figure 6(a) and the fragment boat building appearing in Figure 6(b) show that they both have the same tag; however, the former refers to a vessel and the latter to an occupation. Application of a frequent pattern mining technique on this document set would yield as a frequent structure result. However, is not a frequent structure as its parents are different. This is due to the reason that hierarchical relationships are not considered while modelling the XML data as a transactional model.
Now let us consider modelling the content of the XML document which is often modelled as transactional data similar to the tag representation. Some of the common pre-processing techniques Figure 6. (a) XML document A, and (b) XML document B
5
A Study of XML Models for Data Mining
Figure 7. Transactional data model generated from the content of the XML document given in Figure 4
such as stop-word removal and stemming are applied on the content to identify the unique words that could be represented in TDM. Each unique word in the content of the XML document corresponds to an item and the document is considered as a transaction. An example is illustrated in Figure 7 using the XML document in Figure 4 which shows that the words such as “on”, “for”, “a”, “of” and many others are removed as they are stop words and then the stemmed words are generated. The words do not include the tag names in the XML document; therefore, it is not clear whether the name “bing” refers to an author or an editor of the paper. This may result in imprecise knowledge discovery. Thus, it is essential not only to include the structure hierarchical relationships among the data items but also the content while mining for XML documents.
Paths A path is an expression that contains edges in sequential order from the root node labelled r to a node labelled m in an XML document, (nr, n1), (n1, n2), …, (nm−1, nm) where. The length of the path is the number of nodes in the path. A path could be either a partial path or a complete path. A partial path contains the edges in sequential order from the root node to an internal node in the document; a complete path (or unique path) contains the edges in sequential order from the root node to a leaf node. A leaf node is a node that encloses the content or text. Figures 8, (a) and (b) show an example of a complete and a partial path model respectively for the structure of the XML document using its tags. A complete path can have more than one partial path with varying lengths.
6
As shown in Figure 8(c), paths can be used to model both the structure and the content by considering the leaf node as the text node. However, this kind of model results in repeated paths for different text nodes. For instance, if there is another editor “Malcom Turn” for this conference proceedings, then the path with the text node for this editor will have the same path as that of the editor “Bing Liu” and the only difference will be in the text node. Hence, to reduce the redundancy in the structure and to capture the sibling relationships, XML documents are commonly modelled as trees.
Figure 8. (a) A complete path, (b) a partial path, and (c) a full path with text node model for our running example
A Study of XML Models for Data Mining
Trees Often XML data occurs naturally as trees or can easily be converted into trees using a node-splitting methodology (Anderson, 2000). Figure 9 shows the tree model of the XML document depicted in Figure 4 using only its structure that is obtained by parsing an XML document with a Document Object Model (DOM) parser (Hégaret, Wood, & Robie, 2000). A tree is denoted as T = (V, E, f), where (1) V is the set of vertices or nodes; (2) E is the set of edges in the tree T; and (3) f is a mapping function f: E → V × V. Most of the XML documents are rooted labelled trees where vertices represent tags, edges represent the element-sub-element or element-attribute relationships, and the leaf vertices represent the contents within the tags or instances. Based on the existence of a root vertex and ordering among the vertices of trees, they are divided into different types. Some of the common types of trees are free, rooted, labelled tree or unlabelled tree, directed
or undirected tree, and connected or disconnected tree. A tree is free if its edges have no direction, i.e., it is an undirected graph. Therefore, the tree has no predefined root. Figure 10 illustrates examples of rooted, labelled and directed trees. A rooted tree has the form T = (V, v0, E, f), where v0 is the root vertex which does not have any edges entering in it. In the conf.xml document, the root vertex is conf. The labelled tree in Figure 10(b) can be denoted by (V, E, f, Σ, L), since it contains an alphabet Σ which represents the vertex (P, Q, R, S, T, U) and edge labels (A, B, C, D, E) with a labelling function L to assign the labels to the vertices and edges. A tree is directed or undirected if it indicates the ordering in which the vertices are connected among each other with the edge labels, or not. If all the vertices are connected with at least one edge then it is a connected graph otherwise unconnected. Also, there are two canonical representations that could be used for these trees for mining namely pre-order string encoding and level-wise encoding. The pre-order string and level-wise encoding for the labelled tree in Figure 10 (b) are
Figure 9. Tree representation of the XML document shown in Figure 4 using only its structure
7
A Study of XML Models for Data Mining
Figure 10. (a) A rooted tree, (b) a labelled tree, and (c) a directed tree
representation of the schema as shown in Figure 2 is given in Figure 12. It can be noted from the graph representation that there is a cyclic reference to the element paper from the element reference.
Other Models
PQS$T$$RU$ and P$QR$STU$ respectively. The $ sign is used to mark the end of the nodes in that level and to indicate backtracking. A tree model can successfully model an XML document; however, it cannot be used to model the relationships in an XML schema due to the presence of cyclic relationships between the elements in a schema. Hence, the graph models have also been used to represent XML data.
Graphs A graph can be defined as a triple G=(V, E, f) where V representing the set of vertices and an edge set E with a mapping function f: E → V × V. The vertices are the elements in XML documents and the edge set E corresponds to the links that connect the vertices for representing parent-child relationships. There are different types of graphs similar to trees except that cycles also need to be considered. A cyclic graph is the one in which the first and last vertices in the path are the same as shown in Figure 11 which has vertices S and U connected to the first vertex P. On the other hand, an acyclic graph is a tree. If all the vertices are connected with at least one edge then it is a connected graph, otherwise unconnected. Often graph models are used in representing the schema of an XML document rather than the XML document itself due to the presence of cyclic relationships in a schema. The labelled graph
8
Apart from the vector, path, tree and graph models mentioned above, there are other types of models such as time-series models (Flesca, Manco, Masciari, Pontieri, Pugliese, 2005) and multi-dimensional models that have been used to model XML documents for mining. In the time-series models, the structure of each XML document is represented as discrete time signal in which numeric values summarise relevant features of the elements enclosed within documents. The Discrete Fourier Transformation Theory (DFTT) is applied to compare the encoded documents which are represented as signals for mining purpose. Another model which addresses both the structure and the content of the XML documents is BitCube (Yoon, Raghavan, Chakilam, & Kerschberg, 2001). BitCube model was used to cluster and query the XML documents. Paths, words and document ids represent the three dimensions of the BitCube where each entry in the cube presents either the presence or absence of a given word in the path of a given document.
Figure 11. A connected, directed, cyclic, labelled graph
A Study of XML Models for Data Mining
Figure 12. Graph representation of conf.dtd
DATA MINING TASKS USING THE VARIOUS MODELS There are various types of data mining tasks namely frequent pattern mining, association rule mining, classification and clustering that can be applied to XML data sets using the different types of data models as described in the preceding section. This section details how each of these models could be used for the various data mining tasks.
Frequent Pattern Mining XML frequent pattern mining is one of the wellresearched area (Paik, Shin, & Kim, 2005; Termier et al., 2005; Win & Hla, 2005; Zhang, Liu, & Zhang, 2004). Frequent pattern mining on XML documents involves mining of structures as well as their content (Nayak, 2005). The element tags and their nesting dictate the structure of an XML document (Abiteboul et al., 2000). Identifying frequently occurring structures in the dataset is the essence of XML structure mining. On the other
hand, application of frequent pattern mining on the content which is the text enclosed within the tags contributes to content mining.
Using Transactional Data Model (TDM) Inspired by frequent itemset mining, the tags/ content are represented in a transactional data format to identify frequent tags/content patterns from XML databases (Gu, Hwang, & Ryu, 2005). As shown in Figures 5 and 7, tags or content of an XML document are modelled as items in a transaction. A standard frequent pattern mining process can be applied on an XML dataset represented as TDM. The TDM considers the XML dataset as a fixed structured dataset. The process begins with a complete scan of the transactional data to identify 1-length frequent tags/content with a length of 1 and they are tested for support greater than the user-defined support threshold (min_supp). Then the 1-length frequent tags/content are combined together to form 2-length candidate tags/content
9
A Study of XML Models for Data Mining
in order to verify whether they are frequent or not. The process of forming n-length candidate and frequent tags/content is repeated until there is no more frequent tags/content combinations that could be found. This type of frequent pattern generation is referred to as generate-and-test (Agrawal, Mannila, Srikant, Toivonen, & Verkamo, 1996) as the k+1-length candidates is generated by joining the frequent k-length candidates and these candidates are tested whether the generated candidates are frequent. However, this technique incurs a huge overhead for dense datasets in which there exist a very large number of candidates, and testing each of them is expensive. To overcome this disadvantage, a novel approach called pattern-growth as used in FP-growth algorithm (Han, Pei, & Yin, 2000) was proposed which adopts the “divide-and-conquer” heuristic to recursively partition the dataset based on the frequent patterns generated and then it mines them for frequent patterns in each of the partitions. This technique is less expensive as it scans only the projections of the datasets based on the frequent patterns and not the entire dataset. Also, it generates 1-length candidates for frequent patterns only from the projections and hence counting of their support is simple. Despite all the benefits, this type of methods is memory-intensive compared to Apriori-based algorithms due to the need of projections being resided in the memory. Application of frequent pattern mining on the TDMs for XML documents faces a serious problem as the TDMs ignore the hierarchical relationship between the tags and the content of the XML document in order to represent a transaction.
Using Paths The ability of a path model to capture the hierarchical relationship between the tags has facilitated the use of this model for frequent pattern mining. Often the path model is used to represent only the structure of the XML document. Given a collection of paths extracted from dataset D, the
10
problem is to find a partial path p such that freq(p) ≥ min_supp, where freq(p) is the percentage of paths in D that contain p. In this data model, every XML document in the dataset is represented as a bag of paths with each path corresponding to an item in transactional data format. Similar to the frequent tag/content pattern mining, the first scan of the dataset is conducted to identify the frequent 1-length path which will be just a node. The frequent 1-length paths are then combined to form candidate 2-length paths. Testing is then carried out to verify how often these candidate paths occur in the dataset; if they occur more than the min_supp then the paths are considered as frequent. The difference between the frequent pattern mining using the TDM and the paths is that while checking the support of the paths, the hierarchical structure is also verified against the dataset which is not in the case of TDM. This technique is much more suitable for partial paths than complete paths as the frequency of complete paths could often be very low and hence there might not be sufficient frequent paths to output. Often there is a large set of partial paths generated especially for lower support thresholds or from dense datasets. To reduce the number of common partial paths, a new threshold called maximum support threshold (max_supp) has been introduced to avoid the generation of very common partial paths as these very common subpaths do not provide any interesting or new knowledge (Aggarwal, Ta, Wang, Feng, & Zaki, 2007).
Using Trees As XML documents are often modelled as trees, several frequent tree mining algorithms have been developed. These frequent tree mining techniques can be divided into two broad categories based on their candidate generation, namely generate-andtest and pattern-growth technique. The generateand-test generates the candidates and tests for their support. On the other hand, pattern-growth generates subtrees by partitioning the dataset based on the
A Study of XML Models for Data Mining
frequent subtrees generated in the previous iteration. The former technique requires less memory as the patterns are generated on-the-fly but are not suitable for datasets having high branching factor (more than 100 branches out of a given node). The latter technique is memory intensive, as a result, several algorithms have been developed to efficiently store the dataset and partition it. In frequent tree mining of XML data, it has been noted that often the entire tree will not be frequent, rather there is a good possibility that parts of the tree would be frequent. The parts of such trees are referred to as subtrees. There are different notions about subtrees and we discuss some of them below. Induced Subtree For a tree T with edge set E and a vertex set V, a tree T’ with vertex set V’, edge set E’ is an induced subtree of T if and only if (1) V’ ⊆ V; (2) E’ ⊆ E; (3) the labelling of vertices of V’ in T’ is preserved in T; (4) (v1,v2) ∈ E, where v1 is the parent of v2 in T’ if and only if v1 is the parent of v2 in T; and (5) for v1, v2 ∈ V’, preorder(v1) < preorder(v2) in T’ if and only if preorder(v1) < preorder(v2) in T. In other words, an induced subtree T’ preserves parent-child relationship among the vertices of the tree T (Kutty, Nayak, & Li, 2007). Embedded Subtree For a tree T with edge set E and a vertex set V, a tree T’ with vertex set V’, edge set E’ is an embedded subtree of T if and only if (1) V’ ⊆ V; (2) E’ ⊆ E; and (3) the labelling of vertices of V’ and E’ in T’ is preserved in T. In simpler terms, an embedded subtree T’ is a subtree which preserves ancestor-descendent relationship among the vertices of the tree, T. Figure 13 shows the induced and embedded subtrees generated from a tree. As can be seen, Figure 13(b) preserves the parent-child relationship, and in Figure 13(c) the ancestor-descendent relationship is preserved (Kutty et al., 2007).
Often, the number of frequent subtrees generated from XML documents is too large and to derive useful and interesting knowledge from these patterns becomes a challenge (Chi, Nijssen, Muntz, & Kok, 2005). In order to control the number of frequent subtrees, two popular concise representations were proposed, namely closed and maximal. These concise representations not only reduce the redundancy of having all the frequent subtrees but also do not suffer much from information loss due to the reduction. Given an XML dataset D modelled as T trees and a user-defined support threshold min_supp, a subtree T’ is said to be closed if there exists no superset of T’ with the same support as that of T’. On the other hand, a subtree, T’ is said to be maximal if there exists no superset of T’ which is frequent. They can be formally defined as follows. In a given tree dataset, D = {T1, …, Tn}, let there exist two frequent subtrees T’ and T’’: (1) T’ is said to be maximal of T’’ if and only if for every T’ ⊇ T’’, supp(T’) ≤ supp(T’’); and (2) T’ is closed of T’’ if and only if, for every T’ ⊇ T’’, supp(T’) = supp(T’’). Based on the definition, it is clear that M ≤ C ≤ F, where M, C, and F denote the number of maximal frequent subtrees, closed frequent subtrees, and frequent subtrees, respectively. The benefit of generating closed or maximal frequent subtrees is two-fold. Firstly, these concise representations result in a reduced number of subtrees and hence ease the analysis of the subtrees with no information loss. Secondly, a subset of the Figure 13. (a) A tree, (b) an induced subtree, and (c) an embedded subtree
11
A Study of XML Models for Data Mining
frequent subtrees is identified and candidates are generated on them, therefore it results in improved performance when compared to mining all the frequent subtrees (Kutty et al., 2007). According to various trees and subtrees representations, several tree mining algorithms have been developed. Table 1 provides an outline of such algorithms. The rooted unordered or ordered tree mining algorithms can easily be applied for mining XML documents and the free trees for mining XML Schema or element(s) that exhibit cyclic relationships. Chi (2005) provides an overview of most of these algorithms in detail and an interested reader can refer to it.
Using Graphs Frequent graph mining is more commonly applied on XML schema datasets. This type of graph mining can also be applied on the XML dataset in which various documents are linked to each other or a given document has elements having a cyclic relationship. An example of such dataset is INEX (Initiative for Evaluation of XML Retrieval) 2009 Wikipedia dataset1 with categories having cyclic relationships as well as documents are connected via various elements appearing in documents.
The frequent graph mining problem can be defined as follows. Given a graph dataset D, find a subgraph g such that freq(g) ≥ min_supp where freq(g) is the percentage of graphs in D that contain g. The basic of a graph mining algorithm is checking for subgraph isomorphism, i.e., deciding if there is a subgraph of one graph which is isomorphic to another graph. Two graphs are isomorphic if there is a one-to-one correspondence between their vertices and there is an edge between two vertices of one graph if and only if there is an edge between the two corresponding vertices in the other graph. Determining a subgraph isomorphism is often considered to be expensive for graphs. Let us now analyse the cost of frequent subgraphs to understand why graph mining is expensive. There are three types of costs as given below: Cost ∝ ∑ D∝ G∝iso ∝
where
∑ represents the number of candidates, ∝
D∝ represents the data and the G∝iso indicates the costs in subgraph isomorphism checking (Garey & Johnson, 1979). Due to the presence of cyclic relationships in graphs, the testing of candidate
Table 1. Classification of frequent tree mining algorithms for various representations Models
Tree representation
Subtree representation
Canonical representation
Concise representations
12
Types
Algorithms
Free Tree
FreeTreeMiner (Chi, Yang, & Muntz, 2003)
Rooted Unordered Tree
uFreqt (Nijssen & Kok, 2003), Unot (Asai, Arimura, Uno, & Nakano, 2003), HybridTreeMiner (Chi, Yang, & Muntz, 2004).
Rooted Ordered Tree
TreeMiner (Zaki, 2005), FREQT (Asai et al., 2002).
Induced subtree
FREQT(Asai et al., 2002), uFreqt (Nijssen & Kok, 2003), HybridTreeMiner (Chi, Yang, & Muntz, 2004), Unot (Asai et al., 2003)
Embedded Subtree
TreeMinerV (Zaki, 2005)
Pre-order string encoding
TreeMinerV (Zaki, 2005)
Level-wise encoding
HybridTreeMiner (Chi, Yang, & Muntz, 2004)
Closed
PCITMiner (Kutty et al., 2007), PCETMiner (Kutty, Nayak, & Li, 2009b), CMTreeMiner (Chi, Yang, Xia, & Muntz, 2004)
Maximal
CMTreeMiner, PathJoin (Xiao, Yao, Li, & Dunham, 2003)
A Study of XML Models for Data Mining
subgraphs is expensive. Due to the huge number of candidate checks required, it takes exponential time to identify frequent subgraphs. This problem becomes computationally expensive for larger graphs with many nodes and for large-sized datasets. There has been a myriad of frequent graph mining techniques developed. There also exists some improved techniques such as Biased Aprioribased Graph Mining (B-AGM) (Inokuchi, Washio, & Motoda, 2005) and CI-GBI that could provide results in acceptable time period by introducing bias for generating only specific graphs and greedy algorithms respectively. Table 2 presents some of the graph mining algorithms with a focus on the underlying data models. These techniques can be split upon various factors such as types of graphs, concise representations similar to frequent tree mining techniques. Also, based on the approaches used for solving the subgraph isomorphism, the graph mining algorithms can be classified into Apriori-based Graph Mining (AGM) and Graph Based Induction (GBI) (Matoda, 2007). AGM family searches for all possible spaces efficiently devising a good data structure using an adjacency matrix with appropriate indexing. On the contrary, GBI avoids exhaustive search using a greedy algorithm which
recursively chunks two adjoining nodes, thus generating fairly large subgraphs at an early stage of search. In the same line as frequent tree mining, frequent graph mining techniques also employ concise representations such as closed and maximal graph patterns in order to reduce the size of generated frequent graph patterns. Closed frequent graph utilizes the heuristic that a frequent graph G is closed if there exists no supergraph of G that contains the same support as that of G. A frequent graph G is said to be maximal if there exists no supergraph of G that is frequent. The CloseGraph (Xifeng & Jiawei, 2003) method employs a pattern-growth approach and terminates the growth of G which is a subgraph of G if in any part of graphs in the dataset where G occurs, G’ also occurs as the supergraphs of G will be included by G’. In spite of the advances in graph-mining algorithms, these methods have often been criticized due to the difficulty in interpreting the results because of the existence of a massive number of patterns. Due to the large pattern set, it will lead to more knowledge, hence perplexing users in utilising the knowledge for further discovery such as clustering, classification, and indexing (Chen, Lin, Yan, & Han, 2008).
Table 2. Frequent graph mining algorithms Models
Type of Graphs
Types
UGM (Unconnected Graph Mining) (Skonieczny, 2009)
Connected graphs
FSG (Deshpande, Kuramochi, & Karypis, 2003), AcGM (Aprioribased connected Graph Mining algorithm) (Inokuchi, Washio, Nishimura, & Motoda, 2002), MFC (Maximal Frequent Connected Graphs)
Concise representations
Approaches
Algorithms
Unconnected graphs
Maximal and Closed graphs
CloseGraph (Xifeng & Jiawei, 2003) MFC (Maximal Frequent Connected Graphs) MARGIN (Thomas, Valluri, & Karlapalem, 2006) SPIN (Jun et al., 2004)
AGM
AcGM (Apriori-based connected Graph Mining algorithm) (Inokuchi, Washio, Nishimura, & Motoda, 2002), B-AGM (Inokuchi et al., 2005)
GBI
CI-GBI (Nguyen, Ohara, Motoda, & Washio, 2005), DT-GBI (Geamsakul et al., 2004)
13
A Study of XML Models for Data Mining
Association Rule Mining Association rule mining for XML documents derives its main motivation from market-basket analysis. The two steps in determining association rules are: 1. Finding frequent pattern based on min_supp 2. Using these frequent patterns and applying the user defined minimum confidence (min_conf) constraint to form rules. We have discussed the process and algorithms of frequent pattern mining using min_supp in the previous section. We will now discuss about the second step. The support values for the frequent patterns are stored and then their confidence (conf) values are calculated with the following equation conf = supp(X ∪ Y)/supp(X), where X and Y denote items or structures in the dataset according to which data model is used in the process of frequent mining. If the conf value is greater than min_conf then X ⇒ Y, which implies that for a given document, if X items/structures occur then it is likely that Y items/structures also occur.
Using Transactional Data Model Researchers have predominantly used TDM for association rule mining, with some exceptions. Association rule mining for XML documents was initially proposed by Braga, Campi, Ceri, Klemettinen, and Lanzi (2002) using the XMINE rule operator to extract association rules from XML documents in a SQL-like format. The XML tags are mapped to a TDM then association rules are extracted. Wan and Dobbie (2004) used XQuery expressions to extract association rules from XML data. This technique has its limitations in the sense that it does not take into account the structure of XML data. For more complex XML data, transformations may be required before applying the XQuery expressions, though this type of mapping XML tags to TDM enables to represent the data 14
easily and allows for efficient management and analysis of the data. Nevertheless, this model ignores the hierarchical structure while generating frequent patterns and hence could result in poor accuracy or incorrect results, as identified in the “craft” example provided in Figure 6.
Using Trees Approaches using trees for association rule mining used either the fragments of XML documents or tree summaries. Association rules based on fragments have implications of the form X ⇒ Y, where X and Y are fragments of XML documents given X and Y is disjoint. The tree summarisation of XML documents can be used in either graph-based summarised patterns or used frequent subtrees of the XML tree model (Mazuran, Quintarelli, & Tanca, 2009).
Clustering Clustering is a process for grouping unknown data into smaller groups having some commonalities (Jain & Dubes, 1988). The generic clustering process involves three stages. The first stage is data modelling. This stage is to represent the input data using a common data model that can capture the semantic or/and structure information inherent in the input data. The second stage is similarity computation, which is to determine the most appropriate measure to compute the similarity between the models of two documents. The final stage is to choose an algorithm to group the input data into clusters. XML clustering is useful for many applications such as data warehouse, information retrieval, data/schema integration and many more. The clustering process of XML documents can be performed on its content, its structure or on both the content and the structure. The following sections present some of the recent XML clustering approaches in relation to the various data models explained earlier.
A Study of XML Models for Data Mining
Using Transactional Data Model or Vector Space Model
tively. The default values are k1 = 2 and b = 0.75. The BM-25 weighting depends on three factors:
In clustering, the TDM is often referred to as Vector Space Model (VSM) (Salton, 1975). VSM is a model for representing text documents or any objects as vectors of identifiers, for example, terms. When using the VSM model for XML clustering, feature of the document content is a set of terms, and the feature of the document structure is a set of substructure such as tags, paths, subtrees, or subgraphs. There exist various techniques to compute the weights of features in VSM. The most popular one is the term frequency-inverse document frequency (tf-idf) weighting. The key intuition is that the more frequent a feature (term) in a particular document, and the less common a feature in the document collection, the higher its weight. This weighting prevents a bias toward longer documents and gives a measure of the importance of a feature fi within a particular document dj. Given a collection D of documents, the weight vector for any document dj ∈ D is defined as:
•
tf -idf ( fi , d j ) = tfi, j log
|D | | {d : fi ∈ d } |
Above, tfi,j is called term-frequency of the feature fi in the document dj, and denotes the number of occurrences of fi in dj. The logarithmic term is called inverse-document-frequency, with the numerator denoting the number of documents in the collection and the denominator denoting the number of documents in which the feature fi appears (at least once). For an XML dataset, the feature fi is either referred to a content feature or a substructure feature of the XML documents. Another popular weighting scheme is Okapi BM-25 which works on utilising similar concepts as that of tf-idf but has two tuning parameters, namely k1 and b, which influence the effect of feature frequency and document length, respec-
the Collection Frequency Weight (CFW), which is defined for the feature f as:
CFW = log(|D| - log(|D: f ∈ D|)) • •
the feature frequency fff (i.e., the term-frequency as in the tf-idf function) the Normalized Document Length (NDL), which is defined for the given document d as:
NDL(d) = DL(d)/avg(DL(D)) where DL(d) is the length of the document (in words) and avg(DL(D)) denotes the average document length in the collection. BM25 weighting for a given feature f is given by the following formula: Bf =
CFW fff (k1 + 1) k1 ((1 − b) + (b NDL(d ))) + fff
In the VSM representation, the matrix cell value is the weight of the feature if the feature is present in the document. The cell value is zero if a document does not contain the feature. There are two VSM representations for representing the data distribution, dense and sparse. The sparse VSM representation retains only the non-zero values along with the feature id. This improves the efficiency in computation especially for sparse datasets where the number of non-zeroes is less compared to the number of zeroes. Figure 14 gives examples of a dense and a sparse representation using only the frequency of the feature. It can be seen that the size of the sparse representation is smaller than the dense representation. Clustering using the sparse representation is more efficient when the number of zeroes is more; however, if there is more number of non-zeroes then the sparse
15
A Study of XML Models for Data Mining
Figure 14. An example of (a) dense and (b) sparse representation of an XML dataset modelled in VSM using feature frequencies
representation could incur in additional overhead as the feature indices are also stored. The VSM model is commonly used by XML clustering methods that focus on content features (Salton & McGill, 1986). Some attempts (Tran, Nayak, & Bruza, 2008; Yang & Chen, 2002) have been made to model and mine both the content and the structure of XML documents for the clustering tasks. Tran et al. approach (2008) models the structure and the content of XML document in different VSM models. Structure similarity and content similarity is calculated independently for each pair of documents and then these two similarity components are combined using a weighted linear combination approach to determine a single similarity score between documents. Another representation which can link the content and the structure together is the Structured Link Vector Model (SLVM) (Yang & Chen, 2002) that represents both the structure and the content information of XML documents using vector linking. For instance in the SLVM model, given an XML document d, it is defined as a matrix d ∈ Rn×m such that d = [d(1), …, d(m)], where m is the number of elements, d(i) ∈ Rn is the tf-idf feature vector representing the element ei, given as d(i) = tf(tj, d, ei) idf(tj) (for all j=1 to n), where tf(tj, d, ei) is the frequency of the term tj in the element ei of d.
Using Paths The path model represents the document’s structure as a collection of paths. A clustering method using this path model measures the similarity
16
between XML documents by finding the common paths (Nayak, 2008; Nayak & Iryadi, 2006). One of the common techniques for identifying the common paths is to apply frequent pattern mining on the collection of paths to extract the frequent paths of a constrained length and using these frequent paths as representatives for the cluster. This technique has been utilised by Hwang and Ryu (2005) and XProj (Aggarwal et al., 2007). Another simple method of finding XML data similarity according to common paths is by treating the paths as feature for VSM model (Doucet & Ahonen-Myka, 2002; Yao & Zerida, 2007). Other approaches such as XSDCluster (Nayak & Xia, 2004), PCXSS (Nayak & Tran, 2007) and XClust (Lee, Yang, Hsu, & Yang, 2002) adopt the concept of schema matching for finding the similarity between paths. The path similarity is obtained by considering the semantic and structure similarity of a pair of elements appearing in two paths. The path measures in these approaches are computationally expensive as it considers many attributes of the elements such as data type and their constraints. The above described methods usually ignore the content that the nodes in the path may contain. Some researchers have made attempt to include the text along with the path representation in order to cluster XML documents using both structure and content features. However, such methods are computationally expensive due to the presence of repeated paths for various text nodes. Previous works by Vercoustre, Fegas, Gul, and Lechevallier (2005) and Yao & Zerida (2007) have shown
A Study of XML Models for Data Mining
that this type of data representation and clustering technique is not effective for paths with lengths greater than 3. In this model, the increase in the length imposes strict restriction and often results in poor accuracy. Another approach that combines the structure and the content is that using the Boolean representation model of BitCube; here, the XML document collection is first partitioned using the top down approach into small bitcubes based on the paths and then the smaller bitcubes are clustered using the bit-wise distance and their popularity measures.
Using Trees This is one of the well-established fields of XML clustering methods. Several approaches modelling the XML data as trees have been developed to determine XML data similarity. The reputed methods of tree edit distance are extended to compute the similarity between the XML documents. The tree edit distance is based on dynamic programming techniques for string-to-string correction problem (Wagner & Fischer, 1974). The tree edit distance essentially involves three edit operations for the trees involved namely changing, deleting, and inserting a vertex to transform one tree into another tree. The tree edit distance between two trees is the minimum cost between the costs of all possible tree edit sequences based on a cost model. The basic intuition behind this technique is that the XML documents with the minimum distance are likely to be similar and hence they can be clustered together. Some of the clustering techniques that use the tree-edit distance are Dalamagas, Cheng, Winkel, and Sellis (2006) and Nierman and Jagadish (2002). Besides mining the structure similarity of the whole tree, other techniques have also been developed to mine the frequent pattern in subtrees from a collection of trees (Termier et al., 2005; Zaki, 2005). Termier et al. (2005) approach consists of two steps; first it clusters the trees based on the occurrence of the same pairs of labels in
the ancestor relation using the Apriori algorithm. After the trees are clustered, a maximal common tree is computed to measure the commonality of each cluster to all the trees. To avoid modifying the tree structure as in tree edit distance methods, other clustering techniques involve breaking the paths of tree-structured data into a collection of macro-path sequences where each macro-path contains a tag name, its attribute, data types and content. A matrix similarity of XML documents is then generated based on the macro-path similarity technique. Clustering of XML documents is then performed based on the similarity matrix with the support of approximate tree inclusion and isomorphic tree similarity (Shen & Wang, 2003). Many other approaches have also utilised the idea of tree similarity for XML document change detection (DeWitt & Cai, 2003) and for extracting the schema information from an XML document such as those proposed in Garofalakis, Gionis, Rastogi, Seshadri, and Shim (2000) and Moh, Lim, and Ng (2000). Majority of these methods focuses on clustering the XML documents by identifying structure similarity between them. However, as pointed out earlier in the chapter, for some datasets it becomes essential to include both the structure and the content similarity for identifying clusters. Recently, some researchers have focused on combining content and structure for the clustering process (Kutty, Nayak, & Li, 2009a; Tagarelli & Greco, 2010). The SemXClust method (Tagarelli & Greco, 2010) represents XML documents into a collection of tree tuples with the structure and the content features enriched by the support of an ontology knowledge base to create a set of semantically cohesive and smaller-sized documents. These tree tuples are then modelled as transactions and transactional clustering algorithms are then applied. The HCX method (Kutty et al., 2009a) includes both the structure and the content information of XML documents in producing clusters in order to improve the accuracy and meaning of the clustering solution. HCX first determines the
17
A Study of XML Models for Data Mining
structure similarity in the form of frequent subtrees as a constraint and then uses these frequent subtrees to represent the content of the XML documents. In other words, the content included in the nodes of frequent tree is only used in clustering process and hence it is a constrained content. By using only the constrained content corresponding to the frequent subtrees results, it not only combines the structure and the content effectively but also reduces the huge overhead of the combination.
Using Graphs The clustering algorithms on graph data can be categorised into two types: node clustering and graph clustering. Flake, Tsioutsiouliklis, and Tarjan (2002) and Aggarwal and Wang (2010) give a good overview of graph clustering algorithms. The node clustering algorithms attempt to group the underlying nodes with the use of a distance (or similarity) value on the edges. In this case, the edges of the graph are labelled with a numerical distance values. These numerical distance values are used to create clusters of nodes. On the other hand, the graph clustering algorithms use the underlying structure as a whole and calculate the similarity between two different graphs. This task is more challenging than the node clustering tasks because of the need to match the structures of the underlying graphs, and then use these structures for clustering purposes. A popular graph clustering algorithm for XML document is S-GRACE (Lian, Cheung, Mamoulis, & Siu-Ming, 2004) which computes the distance between two graphs by measuring the common set of nodes and edges. It first scans the XML documents and computes their s-graphs. S-graph of two documents is the sets of common nodes and edges. The s-graphs of all documents are then stored in a structure table called SG, which contains two fields of information: a bit string representing the edges of an s-graph and the ids of all the documents whose s-graphs are represented by this bit string. Once the SG is constructed, clustering
18
can be performed on the bit strings. S-GRACE is a hierarchical clustering algorithm that uses the ROCK method (Guha, Rastogi, & Shim, 1999) to exploit the link (common neighbours) between s-graphs to select the best pair of clusters to be merged in the hierarchical merging process.
Classification The concept of classification on XML documents was first introduced by XRules (Zaki & Aggarwal, 2003). Though there had not been many techniques developed for XML classification rules since the initial work, the classification data mining challenge (Nayak, De Vries, Kutty, Geva, Denoyer, & Gallinari, 2010) in the INEX forum has attracted a great deal of interest among researchers in extending the supervised machine learning techniques for XML documents classification. Let us now look at how different models were used for the XML classification task.
Using Transactional Data Model or Vector Space Model As the XML documents are often considered as text documents, they could be represented in a VSM and standard mining methods for classification tasks may well be applied. A simple and the most commonly used method is the nearest neighbour method that could be applied for small number of class labels. However, in many cases, the classification behaviour of the XML document is hidden in the structure information available inside the document. In these situations, classifiers using IR based representations are likely to be ineffective for XML documents. A recent technique attempted to represent the XML documents is SLVM which includes not only the content but also the structure (Yang & Wang, 2010). The frequent structure of the XML documents is captured and their content is represented in this model similar to the ideas of HCX (Kutty et al., 2009a) in clustering.
A Study of XML Models for Data Mining
Using Paths To our best knowledge, we have not come across any XML classification method that utilised paths models in its representation.
Using Trees XRules (Zaki & Aggarwal, 2003) classifier utilises a rule based classification of XML data by using frequent discriminatory subtrees within the XML documents collection. Some of the other approaches that utilise the tree structures for not only modelling the structure but also to the content for the purpose of classification utilise the concept of tree-edit distance to compute the minimal cost involved in changing a source tree to a destination tree. The cost information is then provided to a k-NN classifier to classify the XML documents based on the structure and the content (Bouchachia & Hassler, 2007). Also, the concise frequent subtrees have been be used to combine the structure and the content of XML documents in reduced search space for classification (Yang & Chen, 2002).
Using Graphs Graphs models have been used to model the XML documents that have links amongst them and then generate the classifications (Chidlovskii, 2010; Tsoi, Hagenbuchner, Chau, & Lee, 2009). The links between the documents were captured using graph structures. By doing so, not only content analysis but also link mining could be performed to improve the accuracy of the classification.
CURRENT ISSUES AND CHALLENGES IN MODELLING XML DOCUMENTS FOR MINING Some of the current issues and challenges in XML mining for the various models are:
1. How to effectively model both the structure and content features for mining XML documents? Though there have been several attempts to combine the structure and the content features for mining XML documents, it often resulted in decreased quality of mining results. For instance, the BitCube model was used to cluster and query the XML documents. However, this approach suffers from the typical disadvantages inherent in Boolean representation models, such as the lack of partial matching criteria and natural measures of document ranking. Also, it was evident from the experimental results that it is an expensive task to project a document into small bitcubes based on their paths and, hence, the application of this type of approach to large datasets containing about thousands of documents is questionable. This calls for a multi-dimensional representation of XML documents using not all the content features but more significant content features. Defining what is significant is a critical problem. 2. How to combine the structure and content features, in these different types of models which do not affect the scalability of the mining process? Not only the effectiveness of combining the features of structure and content is important but also the impact of this combination on the scalability of the mining process is also an issue. For example, BitCube can capture the relationship between the structure and the content however it is expensive in terms of memory usage. Another example is the CRP and 4RP approach by Yao and Zerida (2007) with VSM model using path representation. Each path contains the edges in sequential order from a node i to a term in the data content. This approach creates a large number of features, for instance, if there are 5 distinct terms in a document, and 5 of these distinct terms are in two different paths then there will be 10
19
A Study of XML Models for Data Mining
different features altogether. Techniques like random indexing (Achlioptas & Mcsherry, 2007) can be used to reduce the large number of features. 3. How to integrate background knowledge into the model for mining – using high utility data for mining? In the presence of very large number of XML documents, data models and mining techniques, it is now essential to mine patterns which are of much use and can be referred to as “high utility patterns”. To create high utility patterns background knowledge should be included in mining process. SemXClust (Tagarelli & Greco, 2010) utilises WordNet ontology to derive semantically rich and smaller XML documents for the purpose of clustering. WordNet could be used for enriching the tags as in YAWN (Schenkel, Suchanek, & Kasneci, 2007). These enriched tags could be used while modelling XML documents to assist in providing these high utility patterns. Also, ontologies could be incorporated into the mining models to provide more useful and interesting results. 4. How to capture the semantic relationships using the models? Due to the flexibility of the XML language, XML users can define their own schema definitions. Therefore, the document heterogeneity does not lie in the content but also in the element tag’s name. To address this problem, approaches such as XSDCluster (Nayak & Xia, 2004), PCXSS (Nayak & Tran, 2007) and XClust (Lee et al., 2002) use external resources such as WordNet (Fellbaum, 1998) to find the semantic meaning between element tag names for finding common paths. To find the semantic between element tag names, many research have been made in the area of schema matching (Madhavan, Bernstein, & Rahm, 2001). Other techniques such as the Latent Semantic Kernel (LSK) (Cristianini, Shawe-Taylor, & Lodhi, 2002) which is
20
based on LSI can construct a “semantic” space wherein terms and documents that are closely associated are placed near one another, which reflects major associative patterns in the data and ignores less important patterns. The advantage of finding semantic relationships is that it allows the mining process to be more accurate; however, it is the trade-off between the accuracy and the scalability as many features are considered in the mining process.
FUTURE MODELS OF XML DOCUMENTS AND THEIR OPPORTUNITIES The forthcoming models can utilise the relationship within the content of the XML documents represented in a TDM. To model the sequential relationship within the content of XML documents, the terms in the document can be combined to form bigrams, trigrams and n-grams. For the conf.xml some of the bigrams are “international conference”, “conference data”, “data mining” that could be represented in TDM. TDM can also be used as a simple model to represent the link relationship between XML documents. For instance, if there are two Wikipedia documents which discuss about the same subject they may have links between them. There are two types of links: inbound link and outbound link: an inbound link is a link coming into a given Wikipedia document from another Wikipedia document, whereas an outbound is a link going out of the given Wikipedia document. When the documents are represented in a pair-wise matrix, then the presence of a link between two documents can be indicated using a non-zero value. Figure 16 shows the transactional data model for the links in the sample Wikipedia documents given in Figure 15. Consider the three Wikipedia documents in Figure 15, which contain outbound and inbound
A Study of XML Models for Data Mining
Figure 15. Examples of Wikipedia XML documents with links: (from top to bottom, and left to right) 1.xml, 2.xml, and 3.xml files
links. These links are used to build the link-based transactional data model. For every document in the document corpus, if there is an outbound link from document di to document dj, the cell value of (i, j) is entered as 1 and if there is an inbound link from di to dj, the cell value of (j, i) is entered as 1. Figure 16. Link-based transactional data model built using the sample Wikipedia XML documents given in Figure 15
The TDM could easily store these links without much preprocessing efforts, especially for smaller datasets. However, it should be noted that when the number of documents is very large as in many real-life datasets, the transactional data model will be of size n2 where n is the number of documents. Hence, it is essential to have efficient structures which could reduce the redundancy in situations when the number of links between documents is sparse. Also, the future models of XML documents should include multiple features regarding structure and content. Also, with the increasing number of XML documents, the future models should include the link also as one of its dimensions.
21
A Study of XML Models for Data Mining
Initial work by BitCube representation using the paths used only the binary representation of the feature corresponding to its path. However, it used all the paths and hence might incur in heavy computational complexity. It is essential to identify the common subtrees similar to the work by XProj (Aggarwal et al., 2007) and HCX (Kutty et al., 2009a) to reduce the number of subtrees for clustering. One of the common multi-dimensional models is the tensor space model which has been successfully used to model in signal processing (Damien & Salah, 2007), web mining (Mirzal, 2009) and many other fields. By modelling XML documents in a tensor space model, multi-way data analysis could be conducted to study the interaction between the features of the XML documents.
CONCLUSION With the growing importance of XML documents as a means to represent data, there has been extensive effort on devising new technologies to process, query, transform and integrate various XML documents. Our focus on this chapter has been to overview the various models that has been used for XML mining in order to accomplish various XML data applications. In spite of the abundance of the models for XML mining, the unprecedented growth in the size of XML data has resulted in several challenges for the XML data modelling and mining techniques that are yet to be addressed by future researches.
REFERENCES Abiteboul, S., Buneman, P., & Suciu, D. (2000). Data on the Web: From relations to semistructured data and XML. San Francisco, CA: Morgan Kaufmann Publishers.
22
Aggarwal, C. C., Ta, N., Wang, J., Feng, J., & Zaki, M. J. (2007). XProj: A framework for projected structural clustering of XML documents. In P. Berkhin, R. Caruana, & X. Wu (Eds.), Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), (pp. 46-55). ACM. Aggarwal, C. C., & Wang, H. (2010). Graph data management and mining: A survey of algorithms and applications. In Aggarwal, C. C., & Wang, H. (Eds.), Managing and mining graph data (pp. 13–68). London, UK: Kluwer Academic Publishers. doi:10.1007/978-1-4419-6045-0_2 Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., & Verkamo, A. I. (1996). Fast discovery of association rules. In Fayyad, U. M., PiatetskyShapiro, G., Smyth, P., & Uthurusamy, R. (Eds.), Advances in knowledge discovery and data mining (pp. 307–328). Menlo Park, CA: American Association for Artificial Intelligence. Anderson, R. (2000). Professional XML. Birmingham, UK: Wrox Press Ltd. Asai, T., Abe, K., Kawasoe, S., Arimura, H., Sakamoto, H., & Arikawa, S. (2002). Efficient substructure discovery from large semi-structured data. In R. L. Grossman, J. Han, V. Kumar, H. Mannila, & R. Motwani (Eds.), Proceedings of the Second SIAM International Conference on Data Mining. SIAM. Asai, T., Arimura, H., Uno, T., & Nakano, S. (2003). Discovering frequent substructures in large unordered trees. In G. Grieser, Y. Tanaka, and A. Yamamoto (Eds.), Lecture Notes in Computer Science: Vol. 2843, Discovery Science, 6th International Conference, (pp. 47-61). Springer.
A Study of XML Models for Data Mining
Bouchachia, A., & Marcus, H. (2007). Classification of XML documents. In Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining, part of the IEEE Symposium Series on Computational Intelligence (CIDM), (pp. 390-396). IEEE. Braga, D., Campi, A., Ceri, S., Klemettinen, M., & Lanzi, P. L. (2002). A tool for extracting XML association rules.In Proceedings of the 14th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), (p. 57). IEEE Computer Society. Chen, C., Lin, C. X., Yan, X., & Han, J. (2008). On effective presentation of graph patterns: A structural representative approach. In J. G. Shanahan, S. Amer-Yahia, I. Manolescu, Y. Zhang, D. A. Evans, A. Kolcz, K.-S. Choi, & A. Chowdhury (Eds.), Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM), (pp. 299-308). ACM. Chi, Y., Nijssen, S., Muntz, R. R., & Kok, J. N. (2005). Frequent subtree mining - An overview. Fundamenta Informaticae, 66(1-2), 161–198. Chi, Y., Yang, Y., & Muntz, R. R. (2003). Indexing and mining free trees. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM), (pp. 509-512). IEEE Computer Society. Chi, Y., Yang, Y., & Muntz, R. R. (2004). HybridTreeMiner: An efficient algorithm for mining frequent rooted trees and free trees using canonical forms. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM), (pp. 11-20). IEEE Computer Society. Chi, Y., Yang, Y., Xia, Y., & Muntz, R. R. (2004). CMTreeMiner: Mining both closed and maximal frequent subtrees. In H. Dai, R. Srikant, & C. Zhang (Eds.), Advances in Knowledge Discovery and Data Mining, 8th Pacific-Asia Conference (PAKDD), Lecture Notes in Computer Science: Vol. 3056 (pp. 63-73). Springer.
Chidlovskii, B. (2010). Multi-label Wikipedia classification with textual and link features. In S. Geva, J. Kamps & A. Trotman (Eds.), Focused Retrieval and Evaluation, 8th International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX): Vol. 6203. Lecture Notes in Computer Science (pp. 387-396). Springer. Cristianini, N., Shawe-Taylor, J., & Lodhi, H. (2002). Latent semantic kernels. Journal of Intelligent Information Systems, 18(2-3), 127–152. doi:10.1023/A:1013625426931 Dalamagas, T., Cheng, T., Winkel, K.-J., & Sellis, T. (2006). A methodology for clustering XML documents by structure. Information Systems, 31(3), 187–228. doi:10.1016/j.is.2004.11.009 Damien, M., & Salah, B. (2007). Survey on tensor signal algebraic filtering. Signal Processing, 87(2), 237–249. doi:10.1016/j.sigpro.2005.12.016 Deshpande, M., Kuramochi, M., & Karypis, G. (2003). Frequent sub-structure-based approaches for classifying chemical compounds. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM), (pp.35-42). IEEE Computer Society. DeWitt, D. J., & Cai, J. Y. (2003). X-Diff: An effective change detection algorithm for XML documents. In U. Dayal, K. Ramamritham, & T. M. Vijayaraman (Eds.), Proceedings of the 19th International Conference on Data Engineering (ICDE), (pp. 519-530). IEEE Computer Society. Doucet, A., & Ahonen-Myka, H. (2002). Naive clustering of a large XML document collection. Paper presented at the Initiative for the Evaluation of XML Retrieval (INEX) Workshop Doucet, A., & Ahonen-Myka, H. (2002). Naïve clustering of a large XML document collection. In N. Fuhr, N. Gövert, G. Kazai, & M. Lalmas (Eds.) Proceedings of the First Workshop of the Initiative for the Evaluation of XML Retrieval (INEX), (pp. 81-87).
23
A Study of XML Models for Data Mining
Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Massachusetts, USA: MIT Press. Flake, G. W., Tsioutsiouliklis, K., & Tarjan, R. E. (2002). Graph clustering techniques based on minimum cut trees. Princeton, NJ: NEC. Flesca, S., Manco, G., Masciari, E., Pontieri, L., & Pugliese, A. (2005). Fast detection of XML structural similarity. IEEE Transactions on Knowledge and Data Engineering, 17(2), 160–175. doi:10.1109/TKDE.2005.27 Garey, M. R., & Johnson, D. S. (1979). Computers and intractability: A guide to the theory of NP-completeness. W.H. Freeman and Company. Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., & Shim, K. (2000). XTRACT: A system for extracting document type descriptors. In W. Chen, J. F. Naughton, & P. A. Bernstein (Eds.), Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (pp. 165-176). ACM. Geamsakul, W., Yoshida, T., Ohara, K., Motoda, H., Yokoi, H., & Takabayashi, K. (2004). Constructing a decision tree for graph-structured data and its applications. Fundamenta Informaticae, 66(1-2), 131–160. Gu, M. S., Hwang, J. H., & Ryu, K. H. (2005). Designing the ontology of XML documents semiautomatically. In D.-S. Huang, X.-P. Zhang, & G.-B. Huang (Eds.), Advances in Intelligent Computing, International Conference on Intelligent Computing (ICIC) Lecture Notes in Computer Science: Vol. 3644 (pp. 818-827). Springer. Guha, S., Rastogi, R., & Shim, K. (1999). ROCK: A robust clustering algorithm for categorical attributes. Proceedings of the 15th International Conference on Data Engineering (ICDE), (pp. 512-521). IEEE Computer Society Press.
24
Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. In W. Chen, J. F. Naughton, & P. A. Bernstein (Eds.), Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (pp. 1-12). ACM. Hégaret, P. L., Wood, L., & Robie, J. (2000). What is the document object model? Document Object Model (DOM) Level 2 Core Specification. Retrieved from http://www.w3.org/TR/DOMLevel-2-Core/introduction.html Hwang, J. H., & Ryu, K. H. (2005). Clustering and retrieval of XML documents by structure. In Gervasi, O., Gavrilova, M. L., Kumar, V., Laganà, A., Lee, H. P., & Mun, Y. (Eds.), Computational Science and Its Applications (ICCSA) Lecture Notes in Computer Science (Vol. 3481, pp. 925–935). Springer. Inokuchi, A., Washio, T., & Motoda, H. (2005). A general framework for mining frequent subgraphs from labeled graphs. Fundamenta Informaticae, 66(1-2), 53–82. Inokuchi, A., Washio, T., Nishimura, K., & Motoda, H. (2002). A fast algorithm for mining frequent connected subgraphs. Tokyo, Japan: IBM Research, Tokyo Research Laboratory. Kutty, S., Nayak, R., & Li, Y. (2007). PCITMiner: Prefix-based closed induced tree miner for finding closed induced frequent subtrees. In P. Christen, P. J. Kennedy J. Li, I. Kolyshkina, & G. J. Williams (Eds.), Proceedings of the Sixth Australasian Data Mining Conference: Vol. 70. Conferences in Research and Practice in Information Technology (pp. 151-160). Australian Computer Society. Kutty, S., Nayak, R., & Li, Y. (2009a). HCX: An efficient hybrid clustering approach for XML documents. In U. M. Borghoff, & B. Chidlovskii (Eds.), Proceedings of the 2009 ACM Symposium on Document Engineering (pp. 94-97). ACM.
A Study of XML Models for Data Mining
Kutty, S., Nayak, R., & Li, Y. (2009b). XCFS: An XML documents clustering approach using both the structure and the content. In D. W.-L. Cheung, I.-Y. Song, W. W. Chu, X. Hu, & J. J. Lin (Eds.), Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM), (pp. 1729-1732). ACM. Lee, M. L., Yang, L., Hsu, W., & Yang, X. (2002). XClust: Clustering XML schemas for effective integration. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), (pp. 292–299). ACM.
Nayak, R. (2005). Discovering knowledge from XML documents. In Wong, J. (Ed.), Encyclopedia of data warehousing and mining (pp. 372–376). Hershey, PA: Idea Group Publications. doi:10.4018/978-1-59140-557-3.ch071 Nayak, R. (2008). Fast and effective clustering of XML data using structural information. Knowledge and Information Systems, 14(2), 197–215. doi:10.1007/s10115-007-0080-8
Lian, W., Cheung, D. W. L., Mamoulis, N., & Yiu, S.-M. (2004). An efficient and scalable algorithm for clustering XML documents by structure. IEEE Transactions on Knowledge and Data Engineering, 16(1), 82–96. doi:10.1109/ TKDE.2004.1264824
Nayak, R., De Vries, C., Kutty, S., Geva, S., Denoyer, L., & Gallinari, P. (2010). Overview of the INEX 2009 XML mining track: Clustering and classification of XML documents. In S. Geva, J. Kamps, & A. Trotman (Eds.), Focused Retrieval and Evaluation, 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, Lecture Notes in Computer Science: Vol. 6203 (pp. 366-378). Springer.
Madhavan, J., Bernstein, P. A., & Rahm, E. (2001). Generic schema matching with Cupid. In P. M. G. Apers, P. Atzeni, S. Ceri, S. Paraboschi, K. Ramamohanarao, & R. T. Snodgrass (Eds.), Proceedings of 27th International Conference on Very Large Data Bases (VLDB), (pp. 49-58). Morgan Kaufmann.
Nayak, R., & Iryadi, W. (2006). XMine: A methodology for mining XML structure. In X. Zhou, J. Li, H. Shen, M. Kitsuregawa, & Y. Zhang (Eds.), Frontiers of WWW Research and Development, 8th Asia-Pacific Web Conference (APWeb), Lecture Notes in Computer Science: Vol. 3841 (pp. 786-792). Springer.
Mazuran, M., Quintarelli, E., & Tanca, L. (2009). Mining tree-based frequent patterns from XML. In Andreasen, T., Yager, R., Bulskov, H., Christiansen, H., & Larsen, H. (Eds.), Flexible Query Answering Systems Lecture Notes in Computer Science (Vol. 5822, pp. 287–299). Springer. doi:10.1007/978-3-642-04957-6_25
Nayak, R., & Tran, T. (2007). A progressive clustering algorithm to group the XML data by structural and semantic similarity. International Journal of Pattern Recognition and Artificial Intelligence, 21(3), 1–21.
Mirzal, A. (2009). Weblog clustering in multilinear algebra perspective. International Journal of Information Technology, 15(1), 108–123. Moh, C.-H., Lim, E.-P., & Ng, W.-K. (2000). DTD-Miner: A tool for mining DTD from XML documents. In The Second International Workshop on Advance Issues of E-Commerce and WebBased Information Systems (WECWIS). IEEE Computer Society.
Nayak, R., & Xia, F. B. (2004). Automatic integration of heterogenous XML-schemas. In S. Bressan, D. Taniar, G. Kotsis, & I. K. Ibrahim (Eds.), Proceedings of the Sixth International Conference on Information Integrationand Webbased Applications Services (iiWAS). Austrian Computer Society.
25
A Study of XML Models for Data Mining
Nguyen, P. C., Ohara, K., Motoda, H., & Washio, T. (2005). Cl-GBI: A novel approach for extracting typical patterns from graph-structured data. In T. B. Ho, D. Cheung & H. Liu (Eds.), Proceedings of the Advances in Knowledge Discovery and Data Mining, 9th Pacific-Asia Conference (PAKDD), Lecture Notes in Computer Science: Vol. 3518. (pp. 639-649). Springer. Nierman, A., & Jagadish, H. V. (2002). Evaluating structural similarity in XML documents. In Proceedings of the ACM SIGMOD WebDB Workshop (pp. 61-66). Nijssen, S., & Kok, J. N. (2003). Efficient discovery of frequent unordered trees. In Proceedings of First International Workshop on Mining Graphs, Trees, and Sequences (pp. 55-64). Paik, J., Shin, D. R., & Kim, U. (2005). EFoX: A scalable method for extracting frequent subtrees. In V. S. Sunderam, G. D. van Albada, P. M. A. Sloot, & J. Dongarra (Eds.), Proceedings of the Computational Science, 5th International Conference (ICCS) Lecture Notes in Computer Science: Vol. 3516. (pp. 813-817). Springer. Salton, G., & McGill, M. J. (1986). Introduction to modern information retrieval. New York, NY: McGraw-Hill Inc. Schenkel, R., Suchanek, F. M., & Kasneci, G. (2007). YAWN: A semantically annotated Wikipedia XML corpus. In A. Kemper, H. Schoning, T. Rose, M. Jarke, T. Seidl, C. Quix, & C. Brochhaus (Eds.), Datenbanksysteme in Business, Technologie und Web (BTW 2007): Vol. 103. LNI (pp. 277-291). GI. Shen, Y., & Wang, B. (2003). Clustering schemaless XML documents. In Meersman, R., Tari, Z., & Schmidt, D. C. (Eds.), On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE, Lecture Notes in Computer Science (Vol. 2888, pp. 767–784). Springer. doi:10.1007/9783-540-39964-3_49
26
Skonieczny, Ł. (2009). Mining for unconnected frequent graphs with direct subgraph isomorphism tests. In Cyran, K., Kozielski, S., Peters, J., Stanczyk, U., & Wakulicz-Deja, A. (Eds.), Man-machine interactions (Vol. 59, pp. 523–531). Springer. doi:10.1007/978-3-642-00563-3_55 Stuart, I. (2004). XML schema, a brief introduction. Retrieved from http://lucas.ucs.ed.ac.uk/ tutorials/xml-schema/ Tagarelli, A., & Greco, S. (2010). Semantic clustering of XML documents. ACM Transactions on Information Systems, 28(1), 1–56. doi:10.1145/1658377.1658380 Termier, A., Rousset, M.-C., Sebag, M., Ohara, K., Washio, T., & Motoda, H. (2005). Efficient mining of high branching factor attribute trees. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM), (pp. 785-788). IEEE Computer Society. Tran, T., Nayak, R., & Bruza, P. (2008). Document clustering using incremental and pairwise approaches. In N. Fuhr, J. Kamps, M. Lalmas & A. Trotman (Eds.), Focused Access to XML Documents, 6th International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX) Lecture Notes in Computer Science: Vol. 4862. (pp. 222-233). Springer. Tsoi, A., Hagenbuchner, M., Chau, R., & Lee, V. (2009). Unsupervised and supervised learning of graph domains. In Bianchini, M., Maggini, M., Scarselli, F., & Jain, L. (Eds.), Innovations in Neural Information Paradigms and Applications, Studies in Computational Intelligence (Vol. 247, pp. 43–65). Springer. doi:10.1007/978-3-64204003-0_3 Vercoustre, A.-M., Fegas, M., Gul, S., & Lechevallier, Y. (2005). A flexible structured-based representation for XML document. In Mining Advances in XML Information Retrieval and Evaluation. Lecture Notes in Computer Science, 3977, 443–457. Springer
A Study of XML Models for Data Mining
Wagner, R. A., & Fischer, M. J. (1974). The stringto-string correction problem. Journal of the ACM, 21(1), 168–173. doi:10.1145/321796.321811 Wan, J. W. W., & Dobbie, G. (2004). Mining association rules from XML data using XQuery. In J. M. Hogan, P. Montague, M. K. Purvis, & C. Steketee (Eds.), Proceedings of the Second Workshop on Australasian Information Security, Data Mining and Web Intelligence, and Software Internationalisation: Vol. 32. ACSW Frontiers (pp. 169-174). Australian Computer Society.
Yao, J., & Zerida, N. (2007). Rare patterns to improve path-based clustering. In N. Fuhr, J. Kamps, M. Lalmas, & A. Trotman (Eds.), Proceedings of the 6th International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX). Springer. Yoon, J. P., Raghavan, V., Chakilam, V., & Kerschberg, L. (2001). BitCube: A three-dimensional bitmap indexing for XML documents. Journal of Intelligent Information Systems, 17(2-3), 241–254. Kluwer Academic Publishers doi:10.1023/A:1012861931139
Xiao, Y., Yao, J.-F., Li, Z., & Dunham, M. H. (2003). Efficient data mining for maximal frequent subtrees. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM), (pp. 379-386). IEEE Computer Society.
Zaki, M. J. (2005). Efficiently mining frequent trees in a forest: Algorithms and applications. IEEE Transactions on Knowledge and Data Engineering, 17(8), 1021–1035. doi:10.1109/ TKDE.2005.125
Xifeng, Y., & Jiawei, H. (2003). CloseGraph: Mining closed frequent graph patterns. In L. Getoor, T. E. SenatoR, & P. Domingos (Eds.), Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), (pp. 286-295). ACM.
Zaki, M. J., & Aggarwal, C. C. (2003). XRules: An effective structural classifier for XML data. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), (pp. 316-325). ACM.
XML Schema. (n.d.). Retrieved from http://www. w3.org/XML/Schema Yang, J., & Chen, X. (2002). A semi-structured document model for text mining. Journal of Computer Science and Technology, 17(5), 603–610. doi:10.1007/BF02948828 Yang, J., & Wang, S. (2010). Extended VSM for XML document classification using frequent subtrees. In S. Geva, J. Kamps & A. Trotman (Eds.), Focused Retrieval and Evaluation, 8th International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX) Lecture Notes in Computer Science: Vol. 6203. (pp. 441448). Springer.
Zhang, W.-S., Liu, D.-X., & Zhang, J.-P. (2004). A novel method for mining frequent subtrees from XML data. In Z. R. Yang, R. M. Everson, & H. Yin (Eds.), Intelligent Data Engineering and Automated Learning (IDEAL): Vol. 3177. Lecture Notes in Computer Science (pp. 300305). Springer.
ENDNOTES
1
http://www.inex.otago.ac.nz//tracks/wikimine/wiki-mine.asp
This work was previously published in XML Data Mining: Models, Methods, and Applications, edited by Andrea Tagarelli, pp. 1-28, copyright 2012 by Information Science Reference (an imprint of IGI Global).
27
28
Chapter 2
Finding Persistent Strong Rules: Using Classification to Improve Association Mining
Anthony Scime The College at Brockport, State University of New York, USA
Kulathur S. Rajasethupathy The College at Brockport, State University of New York, USA
Karthik Rajasethupathy Cornell University, USA
Gregg R. Murray Texas Tech University, USA
ABSTRACT Data mining is a collection of algorithms for finding interesting and unknown patterns or rules in data. However, different algorithms can result in different rules from the same data. The process presented here exploits these differences to find particularly robust, consistent, and noteworthy rules among much larger potential rule sets. More specifically, this research focuses on using association rules and classification mining to select the persistently strong association rules. Persistently strong association rules are association rules that are verifiable by classification mining the same data set. The process for finding persistent strong rules was executed against two data sets obtained from the American National Election Studies. Analysis of the first data set resulted in one persistent strong rule and one persistent rule, while analysis of the second data set resulted in 11 persistent strong rules and 10 persistent rules. The persistent strong rule discovery process suggests these rules are the most robust, consistent, and noteworthy among the much larger potential rule sets.
INTRODUCTION In data mining, there are a number of methodologies used to analyze data. The choice of methodology is an important consideration, which is determined by the goal of the data mining and the DOI: 10.4018/978-1-4666-2455-9.ch002
type of data. Different methodologies can result in different rules from the same data. Association mining is used to find patterns of data that show conditions where sets of attribute-value pairs occur frequently in the data set. It is often used to determine the relationships among transaction data. Classification mining, on the other hand, is used to find models of data for categorizing
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Finding Persistent Strong Rules
instances (e.g., objects, events, or persons). It is typically used for predicting future events from historical data (Han & Kamber, 2001). Because association and classification methodologies or algorithms process data in very different ways, they yield different sets of rules. The process presented here exploits these differences to find particularly robust, consistent, and noteworthy rules among much larger potential rule sets. More specifically, this research focuses on using association rules and classification mining to select the persistently strong association rules, which are association rules that are verifiable by classification mining the same data set. Decision tree classification algorithms construct models by looking at past performance of input attributes with respect to an outcome class. The model is constructed inductively from records with known values for the outcome class. The input attribute with the strongest association with the outcome class is selected from the training data set using a divide-and-conquer strategy that is driven by an evaluation criterion. The training data are divided based on the values of this attribute, thereby creating subsets of the data. Each subset is evaluated independently to select the attribute with the next strongest association to the outcome class along the subset’s edge. The process of dividing the data and selecting the next attribute, which is the one with the next strongest association with the outcome class at that point, continues until a leaf node is constructed (Quinlan, 1993). The rules derived from the decision tree provides insight into how the outcome class’s value is, in fact, dependent on the input attributes. A complete decision tree provides for all possible combinations of the input attributes and their allowable values reaching a single, allowable outcome class. Classification decision trees have a root node. The attribute of this root node is the most predictive attribute of a record’s class and is present in the premise of every rule produced by classification. The presence of the root node attribute in the premise of all the rules is a limitation of decision
tree classification mining. There may be a domain theory where the root attribute is not relevant and/ or another attribute is theoretically relevant and useful for predicting the value of the class attribute. Association mining may find rules in such instances. Further, the class attribute appears in the consequent of every classification rule. This class attribute is the goal of the data mining. It is the attribute that ultimately determines if a record supports a domain theory under consideration. Association mining evaluates data for relationships among attributes in the data set (Agrawal, Imieliński, & Swami, 1993). The association rule mining algorithm Apriori finds itemsets within the data set at user-specified minimum support levels. An itemset is a collection of attribute-value pairs (items) that occurs in the data set. The support of an itemset is the percent of records that contain all the items in the itemset. The largest supported itemsets are converted into rules where each item implies and is implied by every other item in the itemset. Given the limitations on decision tree classification rules, association mining may be applied to the classification attributes and data set to find other rules that address the domain. Unlike classification, association mining considers all the attribute combinations in the records. Also unlike classification, it does not have a goal of predicting the value of a specific attribute. As a result, association mining often produces a large number of rules (Bagui, Just, & Bagui, 2008), many of which may not be relevant. The strength of rules is an important consideration in association mining. Generally, a rule’s strength is measured by its confidence level. Strong association mined rules are those that meet the minimum confidence level set by the domain expert (Han & Kamber, 2001). The higher the confidence level the stronger the rule and the more likely the rule will be successfully applied to new data. Measures of interestingness are either subjective or objective (Tan, Steinbach, & Kumar, 2006). Subjective interestingness is based on the
29
Finding Persistent Strong Rules
domain expert’s opinion of the rules found. A rule is subjectively interesting if it contradicts the expectations of an expert or is actionable in the domain (Silberschatz & Tuzhilin, 1996). Objectively measuring the interestingness of a rule is a major research area for both classification and association mining. In addition to being interesting, rules may be supportive (Scime, Murray, & Hunter, 2010). Knowledge in a domain is based on observations that are supported by substantial evidence from behavior in the domain and analysis of data from the domain. A data set representative of a domain should result in rules that support and confirm the domain’s established knowledge, if the knowledge is correct, as well as be interesting when a rule contradicts the domain knowledge (National Academy of Sciences, 2008). Although, different data mining methodologies yield different sets of rules, it is possible that different algorithms will generate rules that are similar in structure and supportive of one another. When this is the case, these independently identified yet common rules may be considered “persistent rules.” Rules that are both persistent and strong can improve decision making by narrowing the focus to rules that are the most robust and consistent. In this research, the concept of persistent strong rules is explored. Further, the persistent-rule discovery process is demonstrated in the area of voting behavior, which is a complex process subject to a wide variety of factors. Given the high stakes often involved in elections, researchers, the media, candidates, and political parties devote considerable effort and resources in trying to understand the dynamics of voting and vote choice (Edsall, 2006). This research shows how strong persistent rules, found using both association and classification data mining algorithms, can be used to understand voters and their behavior. This chapter is a continuation and extension of the work presented in Rajasethupathy et al. (2009), where the concept of persistent rules was first introduced and its utility for discovering ro-
30
bust, consistent, and noteworthy rules discussed. In particular, this chapter presents the algorithm for determining if a persistent rule is a persistent strong rule. The chapter begins with background on combining association and classification data mining to achieve results that are more meaningful than using the approaches individually. This is followed with a review of how rules are created by the Apriori association and the C4.5 classification algorithms. Included is how rules can be combined using rule reduction and supersession leading to the discovery of persistent rules and the determination of persistent strong rules. In the next section this methodology is applied to two independently derived data sets from the 1948-2004 American National Election Studies (ANES) Cumulative Data File. The first data set is used to identify individuals who are likely to vote, and the second is used to predict the political party for which a voter is likely to cast a ballot in a presidential election. These examples are followed by a short discussion on the future of data mining using multiple techniques. The chapter ends with a concluding section that summarizes persistent strong rule discovery in general and as applied to the two data sets.
BACKGROUND The choice of association or classification mining is determined by the purpose of the data mining and the data type. Association mining is used to find connections between data attributes that may have real world significance, yet are not obvious to a real world observer. This analysis is often referred to as a market basket analysis because it can be used to find the likelihood that shopping items will be bought together. An association rule’s consequent may have multiple attributes. Furthermore, one rule’s consequent may be another rule’s precedent. Classification mining builds a decision tree model with the goal of determining the likely value of a specific attribute based on
Finding Persistent Strong Rules
the values of the other attributes. The consequent of all the classification rules of a model have the same attribute with its different values. Nevertheless, similar results have been obtained by mining the same data set using both methodologies. For example, Bagui (2006) mined crime data using association and classification techniques, as a result both indicates that in all regions of the United States, when the population is low, the crime rate is low. Researchers have attempted to identify interesting rules by creating processes in which one applies other methodologies to data prior to the application of a data mining algorithm. For example, CMAR (Classification based on Multiple class-Association Rules) uses the frequent patterns and association relationships between records and attributes to do classification (Li, Han, & Pei, 2001). Jaroszewicz and Simovici (2004) employ user background knowledge to determine the interestingness of sets of attributes using a Bayesian Network prior to association mining. In their research, the interestingness of a set of attributes is indicated by the absolute difference between the support for the attribute’s itemset in the Bayesian Network and as an association itemset. A problem with association rules is that they must have high levels of support and confidence to be discovered by most association algorithms. Zhong et al. (2001) have gone beyond interesting association rules and found that peculiar rules are also to be interesting. These are rules that are derived from low support counts and attribute values that are very different from the other values for that attribute. Interesting association rules can also be identified through clustering (Zhao, Zhang, & Zhang, 2005). Rules are interesting when the attributes in the rule are very dissimilar. By clustering, the distance or dissimilarity between attributes can be computed to judge the interestingness of the discovered rules. Geng and Hamilton (2006) have surveyed and classified interestingness measures for rules providing guidance on selecting the ap-
propriate measurement instrument as a function of the application and data set. Another problem with association mining is the very large number of rules typically found. Applying a rule template is a simple method to reduce the number of rules to those interesting to a particular question (Klemettinen, Mannilla, Ronkainen, Toivonen, & Verkamo, 1994). Zaki (2004) introduced the concept of closed frequent itemsets to drastically reduce the number of rules to present to the user. Rule reduction can also be accomplished by pruning association mining rules with rule covers (Toivonen et al., 1995). A rule cover is a subset of a rule set that matches all the records matched by the rule set. Rule covers produce useful short descriptions of large sets of rules. An interestingness measure can be used to reduce the number of association rules by 40-60% when the data are structured as a taxonomy. This is possible when the support and confidence of a rule are close to their expected values based on an ancestor rule, making the rule redundant (Srikant & Agrawal, 1995). Data Dimensionality Reduction (DDR) can be used to simplify the data thus reducing the number of rules. Fu and Wang (2005) reduced data dimensionality using a separability-correlation measure to select subsets of attributes based on attribute importance with respect to a classification class attribute. Using a neural network classifier, the reduction in attributes lead to improved classification performance resulting in smaller rule sets with higher accuracies when compared with other methods. Expert knowledge has also been used to reduce data dimensionality while iteratively creating classification models (Murray, Riley, & Scime, 2007; Scime & Murray, 2007). Classification mining has been improved by the prior application of association mining to data. In this approach, the association mining creates itemsets that are selected based on achieving a given support threshold. The original data set then has an attribute added to it for each selected itemset, where the attribute values are true or false;
31
Finding Persistent Strong Rules
true if the instance contains the itemset and false otherwise. The classification algorithm is then executed on this modified data set (Deshpande & Karypis, 2002; Padmanabhan & Tuzhilin, 2000) to find the interesting rules. Liu et al. (1998) used Apriori association mining limited to finding rules with a problem’s class attribute in the rule consequent. These “classification association rules” are then used in a heuristic classifier to find interesting classification rules. Combining association and classification is a multi-method approach to data mining. In Rajasethupathy et al. (2009), the concept of persistent rules to improve the usefulness of rules was first introduced, which is extended here to strong persistent rules. There are a large number of other approaches that apply multiple methods such as feature selection (Dy & Brodley, 2004), hybrid learning (Babu, Murty, & Agrawal, 2004; Vatsavai & Bhaduri, 2007; Wagstaff, Cardie, Bogers, & Schrodl, 2001; Wang, Zhang, & Huang, 2008), multi-strategy learning (Adomavicius & Tuzhilin, 2001; Dietterich, 2000; Li, Tang, Li, & Luo, 2009; Tozicka, Rovatsos, & Pechoucek, 2007; Webb & Zheng, 2004), model combination (Aslandogan, Mahajani, & Taylor, 2004; Lane & Brodley, 2003; Wang, Zhang, Xia, & Wang, 2008), multi-objective optimization (Chandra & Grabis, 2008), hybrid methods (Ramu & Ravi, 2008), and ensemble systems (Kumar & Ravi, 2008; Su, Khoshgoftaar, & Greiner, 2009).
ASSOCIATION PLUS CLASSIFICATION: PERSISTENT STRONG RULES In data mining, data preparation is key to the analysis. A data set has attributes and instances. Attributes define the characteristics of the data set. Instances serve as examples and contain specific values of attributes. Attributes can be either nominal or numeric. Numeric attributes cannot be mined in association mining because they are
32
continuous valued attributes. Therefore, they have to be converted into nominal or discrete valued attributes. During data preparation, numeric attributes are discretized (numeric to nominal) so that they can be mined using the association algorithm. Association mining evaluates data for relationships among attributes in the data set (Witten & Frank, 2005). The association rule mining algorithm, Apriori, finds itemsets within the data set at user specified minimum support and confidence levels. The size of the itemsets is continually increased by the algorithm until no itemsets satisfy the minimum support level. The support of an itemset is the number of instances that contain all the attributes in the itemset. The largest supported itemsets are converted into rules where each item implies and is implied by every other item in the itemset. For example, given an itemset of three items (C1 = x, C2 = g, C3 = a), twelve rules are generated: IF (C1 = x AND C2 = g) THEN C3 = a (1) IF (C1 = x AND C3 = a) THEN C2 = g (2) IF (C2 = g AND C3 = a) THEN C1 = x (3) IF (C1 = x) THEN (C2 = g AND C3 = a) (4) IF (C2 = g) THEN (C1 = x AND C3 = a) (5) IF (C3 = a) THEN (C1 = x AND C2 = g) (6) IF (C1=x) THEN (C2=g) (7) IF (C1=x) THEN (C3=a) (8) IF (C2=g) THEN (C1=x) (9) IF (C2=g) THEN (C3=a) (10) IF (C3=a) THEN (C1=x) (11) IF (C3=a) THEN (C2=g) (12)
Classification mining using the C4.5 algorithm, on the other hand, generates a decision tree. The goal of classification is to determine the likely value of a class variable (the outcome variable)
Finding Persistent Strong Rules
given values for the other attributes of data. This is accomplished by the construction of a decision tree using data containing the outcome variable and its values. The decision tree consists of decision nodes and leaf nodes (beginning with a root decision node) that are connected by edges. Each decision node is an attribute of the data and the edges represent the attribute values. The leaf nodes represent the outcome variable; the expected classification results of each data instance. Using the three items from above with C1 as the outcome variable, Figure 1 represents a possible tree. The branches of the decision tree can be converted into rules, whose consequent is the outcome variable with its legal values. The rules for the tree in Figure 1 are: IF IF IF IF
C3 C3 C3 C3
= = = =
a a b c
AND C2 = g THEN C1 = x (13) AND C2 = h THEN C1 = y (14) THEN C1 = z (15) THEN C1 = y (16)
Rule Reduction and Supersession The need to reduce the number of rules is common to classification and association techniques. This reduction may take place because the rule is not physically possible, the rule’s confidence falls Figure 1. Classification Decision Tree
below the established threshold level, or the rule can be combined with other rules. In association mining, a minimum confidence level is set for the rules. Those rules whose confidence falls below that level are eliminated. In classification mining, a pruning process combines decision tree nodes to reduce the size of the tree while having a minimum effect on the classification result (Witten & Frank, 2005). It is possible that physically impossible or obviously coincidental rules remain after the algorithms reduce the number of rules. These rules should be identified by a domain expert and be eliminated, as well. Furthermore, one rule may have all the properties of another rule in association mining. As a rule’s premise takes on more conditions, the confidence of the rule generally increases. For example, given two rules, with the confidence levels given after the rule IF A1 = r and A2 = s THEN A3 = t (conf:.90) (17) IF A1 = r THEN A3 = t (conf:.80) (18)
Rule 18 has all the conditions of Rule 17. The additional condition in Rule 17 increases the confidence; however, if a confidence level of.80 is sufficient, then Rule 18 can supersede Rule 17 and Rule 17 is eliminated.
Persistent Strong Rule Discovery Persistent rules are those that are obtained across independent data mining methods. That is, they are the subset of rules common to more than one method. If an association rule and a classification rule are similar, then the rule would be robust across methods and be considered persistent. Strong rules, on the other hand, are those found by association and classification algorithms that meet a domain expert’s defined confidence level. Each algorithm has a mechanism for elimination of rules not meeting that confidence level. A persistent strong rule is not a new rule; it is an association rule that is
33
Finding Persistent Strong Rules
both persistent by appearing in a classification decision tree and strong by exhibiting a similar confidence level in the classification rule. That is, the confidence levels fall within a tolerance defined by a domain expert. Rules from association and classification can be compared when the rules contain the same attributes in the premise and the consequent. There are two ways an association rule can be matched to a classification rule. In the first case, the rule is a direct match. That is, the premise and the consequent of the classification rule match the association rule in attribute and value. A direct match exists in Rules 3 and 13, above, in which the premise is C2 = g AND C3 = a, and the consequent is C1 = x. In the second (and more common) case, the classification rules contain many conditions as the tree is traversed to construct the rule. That is, a condition exists for each node of the tree. As long as the entire association rule premise is present in a classification rule, the association rule can supersede the classification rule. When the classification rule drops conditions that are not present in the association rule, it becomes a rule-part. However, there may be many identical ruleparts. The process of finding and evaluating the rule-parts that match an association rule involves the following steps: 1. Find association rules with the classification outcome variable as the consequent; 2. Find those classification rules that contain the same conditions as the association rule; 3. Create rule-parts by deleting the classification rule conditions that are not present in the corresponding association rule (Figure 2); 4. From the classification rule-parts add the number of total instances and 5. Add the number of correctly classified instances with the same consequent value; 6. If the consequent value rule-part does not match that of an association rule, then the entire rule results in an incorrect classification.
34
The number of instances correctly classified by these rule-parts are subtracted from the number of correctly classified instances; and 7. Divide the correctly classified instances by the total classified instances to find the confidence level. More formally, to determine a combined confidence level for classification rules (steps 4-7, above) the following is applied: CCI = IC - IIC ARC = CCI - ICAR CF = ARC/IC
Where: CCI: total number of Correctly Classified Instances IC: total number of Instances Classified IIC: total number of Instances Incorrectly Classified ARC: total number of Instances Correctly Classified with respect to the Association Rule ICAR: total number of Instances Incorrectly Classified with respect to the Association Rule CF: ConFidence
Figure 2. Rule-part Creation
Finding Persistent Strong Rules
Persistent strong rules are those that occur in association and classification rule sets with the same premise and consequent, and when the confidence levels of the independently found rules fall within a tolerance level, established by the domain expert. When the confidence level is not within tolerance, the association rule continues to be persistent but not as strong. The best possible rule is not one with the highest success rate. Sometimes the strongest rules may satisfy the instances in the training data set, but they may be both unreliable and unreasonable when applied to other data. However, when a rule is found independently through different data mining methods, the rule can be considered strong even when the confidence level may be less than that of a single rule. In other words, persistent rules across data mining methods are preferable to high-confidence rules found by only one data mining method. A rule found to be very strong (have a high confidence) by only one data mining method may be a result of the particular data set, whereas discovery of similar rules by independent data mining methods is not as likely to be caused by the data set. The persistent strong rule may have a lower confidence level; however, it is a more accurate confidence level and can reasonably assure researchers that it will maintain that confidence level when applied to other data sets. The goal of the persistent strong rule discovery process is to find as many persistent strong rules as possible. A classification tree has one consequent whereas an association rule set can have many consequents. Therefore, because of the premise-consequent requirement, only some of the association rules can be compared to the classification rules. To accommodate more rule comparisons, classification is executed using each association rule consequent as the outcome variable. This insures the discovery of a greater number of persistent strong rules. The persistent strong rules discovered by the above objective algorithm may or may not be interesting. Regardless of the rules’ algorithmic
origin – association, classification, persistent, or strong persistent – the interestingness must also be evaluated subjectively by a domain expert. A rule is subjectively interesting if it contradicts the expectations of the expert, suggests actions to take in the domain (Geng & Hamilton, 2006; Silberschatz & Tuzhilin, 1996), or raises further questions about the domain. In addition to being interesting, rules are of value if they support existing knowledge in the domain (Scime, Murray, & Hunter, 2010). Rules that are strong persistent are those rules found by different data mining algorithms. These rules came from different analyses of the domain data and are supportive of each other. Other domain analyses, non-data mining analyses may come to the same conclusions as a persistent strong rules. In this case, which may be quite common, the persistent strong rule is providing more evidence that the situation outlined by the rule is in fact true in the domain. Persistent strong rules may be interesting or supportive in their domain.
MINING THE ANES DATA The American National Election Studies (ANES, 2005) is an ongoing, long-term series of public opinion surveys intended to produce researchquality data for researchers who study the theoretical and empirical bases of American national election outcomes. The ANES collects data on items such as voter registration and choice, social and political values, social background and structure, partisanship, candidate and group evaluations, opinions about public policy, ideological support for the political system, mass media consumption, and egalitarianism. The ANES has conducted pre- and post-election interviews of a nationally representative sample of adults every presidential and midterm election year since 1948, except for the midterm election of 1950. The ANES data set is used primarily in the field of political science and contains a large number of records (47,438) and attributes (more than 900),
35
Finding Persistent Strong Rules
which, for comparability, have been coded in a consistent manner from year to year. Because the data set is prepared for analysis, all the attribute values are coded numerically with predefined meanings. This study uses ANES data that had been previously selected and cleaned for data mining (Murray, Riley, & Scime, 2009; Scime & Murray, 2007). See the appendix for details on the pertinent survey items. An important issue in political science is vote choice; that is, who votes for which candidate and why. For one project (Scime & Murray, 2007), the domain expert reduced the initial number of ANES attributes from more than 900 to the 238 attributes most meaningful for predicting presidential vote choice based on domain knowledge. Following the iterative expert data mining process using the C4.5 classification algorithm, the domain expert then further reduced the number of attributes to 13 attributes. These 13 specific survey questions effectively predict the party candidate for whom a voter will cast a ballot. The results suggest that such a survey will correctly predict vote choice 66% of the time. Previous studies using non-data mining techniques have shown only 51% accuracy. Another important issue in political science and, in particular, for political pollsters is the likelihood of a citizen voting in an election. Again using the ANES, but selecting a different set of attributes and instances and using the CHAID classification algorithm, the domain expert identified two survey questions that together can be used to categorize citizens as voters or non-voters. These results met or surpassed the accuracy rates of previous non-data mining models using fewer input attributes. The two items correctly classify 78% of respondents over a three-decade period. Additionally, the findings indicate that demographic attributes are less salient than previously thought by political science researchers (Murray, Riley, & Scime, 2009). The ANES attributes are of two types: Discrete and Continuous. Discrete-value attributes contain a single defined value such as party identification,
36
which is indicated as Democrat, Republican, or other. Continuous-value attributes take on an infinite number of values such as the 0-100-scale “feeling thermometers,” which measure affect toward a specified target, and subtractive scales, which indicate the number of “likes” minus the number of “dislikes” mentioned about a target. It should be noted that in the previous studies the continuous-value attributes were left as continuous attributes. As a result of the previous data mining methodology studies, the data sets had been cleaned and prepared for classification mining. To insure that discrete attributes were not misinterpreted as numeric values, an “a” or “A” was prepended to each value. Because association mining only uses discrete attributes, the continuous attributes were discretized. In this study, the WEKA (Waikato Environment for Knowledge Analysis) (Witten & Frank, 2005) software implementations of the association mining Apriori algorithm and the classification mining C4.5 algorithm were used. Shannon’s entropy method was used to discretize the continuous attributes. When discretized, the values are presented as a range of numbers; the value closed by a parentheses is not included in that range of numbers. The value closed by a square bracket is included in the range of numbers. The values ‘–inf’ and ‘inf’ represent negative and positive infinite, respectively.
Demonstrating Persistent Strong Rules: Identifying Likely Voters The persistent-rule discovery process was applied to the data set used in the likely voter study (Murray, Riley, & Scime, 2009). The focus of this analysis is voter turnout – whether the respondent is expected to vote or not. This data set consists of three attributes and 3899 instances from the ANES. Association Apriori analysis resulted in three rules all of which have the intent to vote as the consequent. A three-fold C4.5 classification algorithm using intent to vote as the outcome
Finding Persistent Strong Rules
variable generated a tree with three rules. In a three-fold process the data set is divided into three equal parts. Each part is independently used to assess the classification tree. The results of the assessments are averaged together to determine the tree’s overall success rate and each individual rule’s confidence. The resulting rules were compared and evaluated, using a domain expert defined tolerance of 0.10. Persistent rules must have identical rule consequents generated independently by both data mining methodologies. Persistent strong rules must also have confidences within the tolerance. The association rules are: IF A_Voteval_V2 = A1 THEN A_Intent = IF A_Voteval_V2 THEN A_Intent = IF A_Prevvote = THEN A_Intent =
= A1 AND A_Prevvote A1 (conf: 0.99) (19) = A1 A1 (conf: 0.99) (20) A1 A1 (conf: 0.96) (21)
Rule 20 supersedes Rule 19, while retaining 99% confidence. Because intent to vote is the consequent for all the association rules, only one classification tree with intent to vote as the outcome variable is needed to find the persistent strong rules. The classification tree follows using A_Intent as the outcome variable: A_Prevvote = A0 A_Voteval_V2 = A0: A_Intent = A0 (IC – 1042.85, IIC – 391.78) A_Voteval_V2 = A1: A_Intent = A1 (IC – 378.15, IIC – 49.93) A_Prevvote = A1: A_Intent = A1 (IC – 3477, IIC – 144)
This tree can be converted into three rules: IF A_Prevvote = A0 AND A_Voteval_V2 = A0 THEN A_Intent = A0 (conf: 0.62) (22)
IF A_Prevvote = A1 THEN A_Intent IF A_Prevvote THEN A_Intent
= A0 AND A_Voteval_V2 = A1 (conf: 0.87) (23) = A1 = A1 (conf: 0.96) (24)
Rule 22 does not have as a consequent an association rule consequent; therefore, it does not support the persistence of any of the association rules. Rule 23 and Rule 24 have the same consequent as the association rules. Rule 23 has two rule-parts one of which matches Rule 20, allowing for supersession and a partial match. However, the confidence is outside the tolerance of 0.10, therefore Rule 20 is persistent, but not strong. Rule 24 is identical to Rule 21, and they both have 0.96 confidence. Hence, Rule 21 is a persistent strong association rule. The single persistent strong rule found from the likely voter data set states that respondents to the survey who voted previously in a presidential election are likely to intend to vote in an upcoming Presidential election with a confidence of 96%. This rule may not be especially surprising. But, it strongly supports the fact that citizens interested in politics maintain that interest from election to election.
Demonstrating Persistent Strong Rules: Predicting Vote Choice The persistent-rule discovery process was also applied to the data set used in the presidential vote choice studies (Murray & Scime, 2010; Scime & Murray, 2007). This data set consists of 14 attributes and 6677 instances from the ANES. The Apriori association algorithm was run on the data set, which generated 29 rules with a minimum 0.80 confidence and 0.15 support levels. All 29 rules concluded with the race attribute having the value “white.” This suggested that the number of white voters in the data set was sufficiently large to skew the results. Further examination of the data set revealed that 83.5% of the voters were
37
Finding Persistent Strong Rules
white. The domain expert concluded that race as an indicator of association is not useful. The race attribute was removed and the data set was rerun with the Apriori algorithm. This resulted in 33 rules with confidence levels between 0.60 and 0.85 and a support level of 0.15. Though the confidence levels had decreased, the rule consequents were varied and reasonable. Next, the C4.5 classification algorithm using three folds was applied to the data set to which the Apriori association algorithm was applied (i.e., the data set that excluded the race attribute). A separate classification tree was constructed for each of the attributes appearing in an association rule consequent (depvarvotewho, awhoelect, affrep, aintelect, and affdem). The association and classification rules were compared and evaluated using a domain expert defined tolerance of 0.10. As an example, the use of the outcome variable for the political party for which the voter reported voting (depvarvotewho) resulted in a classification tree with more complex rules than the rules obtained from association mining, and more complex than in the likely voter study. For example, one branch of the tree was apid = a2 affrepcand = ‘(-0.5-0.5]’ demtherm = ‘(-inf-42.5]’ aeduc = a1: depvarvotewho = NotVote
This branch of the tree translates into the rule: IF Party identification (apid) = weak or leaning Democratic (a2) AND Affect towards Republican candidate (affrepcand) = no affect,’(-0.5-0.5]’ AND Democratic thermometer (demtherm) = not favorable ‘(-inf-42.5]’ AND Education of respondent (aeduc) = 8 grades or less (a1) THEN Outcome variable, party voted for (depvarvotewho) = Not Vote
38
With vote choice (depvarvotewho) as the subject of the classification mining, the association rules with vote choice as the consequent become candidates for identification as persistent rules. Ten of the 33 association rules met this requirement; two of these are superseded by another, leaving eight possibly persistent rules. For example, one of the eight association rules states: IF Affect toward Republican candidate (affrepcand) = extreme like, ‘(2.5inf)’ THEN party voted for (depvarvotewho) = Republican (conf: 0.75)
A review of the tree rules reveals that there are six classification rules whose premises and consequents match the premises and consequents of the association rules. The other rules are not considered further, because to be classified along a branch an instance must satisfy all the conditions (attribute-value pairs) of the branch. By supersession, the instances that satisfy the branch would also satisfy the association rule being evaluated. The six classification rules that incorporate the association rule each have the rule-part: IF affrepcand = ‘(2.5-inf)’ THEN REP
In this example, then, there are 11 persistent strong association rules (25 - 27 and 33 - 40) and 10 persistent (not strong) association rules (28 - 32 and 41 - 45) among the original 33 rules. The 11 persistent strong rules found from the vote choice data set collectively support political science’s current understanding of the unparalleled strength of party identification as a predictor of political attitudes and behavior. Voters are loyal to their political party, and they assess the political environment through the lens of their partisanship.
Finding Persistent Strong Rules
Table 1. IF the affect toward the Republican candidate is positive THEN the respondent votes for the Republican candidate. (25) Association Confidence 0.75
Classification Combined Confidence 0.78
IF the affect toward the Democratic candidate is negative THEN the respondent votes for the Republican candidate. (26) Association Confidence 0.65
Classification Combined Confidence 0.64
IF the respondent identifies him or herself as a strong Democrat THEN the respondent votes for the Democratic candidate. (27) Association Confidence 0.63
Classification Combined Confidence 0.68
The other five rules were found to be persistent, but not persistent strong. They are outside the 0.10 tolerance between the confidence levels: IF the feeling about Republican presidential candidate is positive THEN the respondent votes for the Republican candidate. (28) Association Confidence 0.65
Classification Combined Confidence 0.34
IF the feeling about Democratic presidential candidate is positive THEN the respondent votes for the Democratic candidate. (29) Association Confidence 0.62
Classification Combined Confidence 0.43
IF the feeling about Democratic presidential candidate is negative THEN the respondent votes for the Republican candidate. (30) Association Confidence 0.62
Classification Combined Confidence 0.34
IF the affect toward the Republican Party is mostly positive THEN the respondent votes for the Republican candidate. (31) Association Confidence 0.61
Classification Combined Confidence 0.48
IF the affect toward the Democratic Party is positive THEN the respondent votes for the Democratic candidate. (32) Association Confidence 0.61
Classification Combined Confidence 0.80
The other four classification trees using awhoelect, affrep, aintelect, and affdem as outcome variables result in an additional eight persistent strong rules. IF the feeling about Republican presidential candidate is positive THEN the respondent thinks the Republican will be elected. (33) Association Confidence 0.81
Classification Combined Confidence 0.79
IF the affect toward the Republican presidential candidate is very positive THEN the respondent thinks the Republican will be elected. (34) Association Confidence 0.80
Classification Combined Confidence 0.87
continued on following page
39
Finding Persistent Strong Rules
Table 1. Continued IF the affect toward the Democratic presidential candidate is negative THEN the respondent thinks the Republican will be elected. (35) Association Confidence 0.73
Classification Combined Confidence 0.70
IF the respondent voted Republican THEN the respondent thinks the Republican will be elected. (36) Association Confidence 0.73
Classification Combined Confidence 0.68
IF the affect toward the Republican presidential candidate is positive THEN the respondent thinks the Republican will be elected. (37) Association Confidence 0.68
Classification Combined Confidence 0.77
IF the feeling about Republican vice presidential candidate is positive THEN the respondent thinks the Republican will be elected. (38) Association Confidence 0.62
Classification Combined Confidence 0.53
IF the affect toward the Republican presidential candidate is negative THEN the respondent thinks the Democratic will be elected. (39) Association Confidence 0.62
Classification Combined Confidence 0.55
IF the affect toward the Democratic Party is neutral THEN the affect toward the Republican Party is neutral. (40) Association Confidence 0.70
Classification Combined Confidence 0.68
Also, persistent but not persistent strong rules were found: IF the respondent is a weak Republican or leaning Republican THEN the respondent thinks the Republican will be elected. (41) Association Confidence 0.65
Classification Combined Confidence 0.38
IF the respondent is a strong Democrat THEN the respondent thinks the Democrat will be elected. (42) Association Confidence 0.62
Classification Combined Confidence 0.79
IF the respondent is interested in public affairs most of the time THEN the respondent is very much interested in campaigns. (43) Association Confidence 0.64
Classification Combined Confidence 0.44
IF the affect toward the Republican Party is neutral THEN the affect toward the Democratic Party is neutral. (44) Association Confidence 0.62
Classification Combined Confidence 0.75
IF the affect toward the Republican presidential candidate is neutral THEN the affect toward the Republican Party is neutral. (45) Association Confidence 0.61
40
Classification Combined Confidence 0.45
Finding Persistent Strong Rules
FUTURE RESEARCH DIRECTIONS Today, the use of the computer has become common for statistical analysis of data. Software packages are easy to use, inexpensive, and fast. But today’s vast stores of data with immense data sets make comprehensive analysis all but impossible using conventional techniques. A solution is data mining. As data mining becomes an accepted methodology in the social sciences, domain experts will more routinely exploit the techniques as part of their normal analytical procedures. Association mining discovers patterns in the data set. This technique comprehensively provides possible relationships between the data. The question is, which associations are viable. Currently, the domain expert decides the threshold level of confidence to select the strong rules. Classification mining can be used to find the strong association rules that are persistent. In the future, the domain expert will be able to conduct both association and classification simultaneously to determine the persistent strong rules from the data set. This combination of techniques will help the domain expert set the confidence threshold based on the data set itself. These rules can then be used with more confidence when making decisions within the domain. Data mining is commonly conducted against transactional data, but data have gone beyond simple numeric and character flat file data. Today, data come in many forms: image, video, audio, streaming video, and combinations of data types. Data mining research is being conducted to find interesting patterns in data sets of all these data types. Beyond the different types of data, the data sources are also diverse. Data can be found in corporate and government data warehouses, transactional databases, the World Wide Web, and others. The data mining of the future will be multidimensional, accessing all these data sources and data types to find interesting, persistent strong rules.
Finding more interesting and supportive rules for a domain using multiple methods places constraints on the mining process. This is a form of constraint-based data mining. Constraint-based data mining use constraints to guide the process. Constraints that can be used can specify the data mining algorithm. Constraints can be placed on the type of knowledge that is to be found or the data to be mined. Dimension-level constraints research is needed to determine what level of a summary, or the reverse, detail is needed in the data before the algorithms are applied (Hsu, 2002). Research in data mining will continue to find new methods to determine interestingness. Research is needed to determine what values of a particular attribute are considered to be especially interesting in the data and in the resulting rule set (Hsu, 2002). Currently there are 21 different statistically based objective measures for determining interestingness (Tan, Steinbach, & Kumar, 2006). A leading area of research is to find new, increasingly effective measures. With regard to subjective measures of interestingness, research in domains that is both quantitative and qualitative can lead to new methods for determining interestingness. Further data mining research will find new methods to support existing knowledge and perhaps find new knowledge in domains where it has not yet been applied (Scime, Murray, & Hunter, 2010).
CONCLUSION Data mining typically results in a set of rules that can be applied to future events or that can provide knowledge about interrelationships among data. This set of rules is most useful when it can be dependably applied to new data. Dependability is the strength of the rule. Generally, a rule’s strength is measured by its confidence level. Association mining generates all possible rules by combining all the attributes and their values in the data set. The strong association rules are those that meet
41
Finding Persistent Strong Rules
the minimum confidence level set by the domain expert (Han & Kamber, 2001). The higher the confidence level the stronger the rule and the more likely the rule will be successfully applied to new data. Classification mining generates a decision tree that has been pruned to a minimal set of rules. Each rule also has a confidence rating suggesting its ability to correctly classify future data. This research demonstrates a process to identify especially powerful rules. These rules are strong because they have a confidence level at or exceeding the threshold set in association mining; they are persistent because they are also found by classification mining and hold a similar confidence level. These powerful rules, which are deemed “persistent strong rules,” are those that are common to different algorithms and meet a pre-determined confidence level. Persistent rules are discovered by the independent application of association and classification mining to the same data set. Some of these rules have been identified as strong because they meet a minimum tolerance level established by a domain expert. While persistent rules may have a confidence level different from similar association rules and may not classify all future instances of data, persistent strong rules improve decision making by narrowing the focus to rules that are the most robust, consistent, and noteworthy. In this case, the persistent strong rule discovery process is demonstrated in the area of voting behavior. In the likely voter data set, the process resulted in three association rules of which one persistent strong rule and one persistent rule were identified. In the vote choice data set, the process resulted in 33 association rules of which 11 persistent strong rules and 10 persistent rules were identified. The persistent strong rule discovery process suggests these rules are the most robust, consistent, and noteworthy of the much larger potential rule sets.
42
REFERENCES Adomavicius, G., & Tuzhilin, A. (2001). Expertdriven Validation of Rule-based User Models in Personalization Applications. Data Mining and Knowledge Discovery, 5(1-2), 33–58. doi:10.1023/A:1009839827683 Agrawal, R., Imieliński, T., & Swami, A. (1993). Mining Association Rules between Sets of Items in Large Databases. In Proceedings of 1993 ACM SIGMOD International Conference on Management of Data (pp. 207-216). Washington, D.C. American National Election Studies (ANES). (2005). Center for Political Studies. Ann Arbor, MI: University of Michigan. Aslandogan, Y. A., Mahajani, G. A., & Taylor, S. (2004). Evidence combination in Medical Data Mining. In Proceedings of International Conference on Information Technology: Coding and Computing (ITCC’04), Vol. 2, (pp.465-469). Las Vegas, NV. Babu, T. R., Murty, M. N., & Agrawal, V. K. (2004). Hybrid Learning Scheme for Data Mining Applications. In Fourth International Conference on Hybrid Intelligent Systems (HIS’04) (pp. 266271). Kitakyushu, Japan. Bagui, S. (2006). An approach to Mining Crime Patterns. International Journal of Data Warehousing and Mining, 2(1), 50–80. Bagui, S., Just, J., & Bagui, S. C. (2008). Deriving Strong Association Mining Rules using a Dependency Criterion, the Lift Measure. International Journal of Data Analysis Techniques and Strategies, 1(3), 297–312. doi:10.1504/ IJDATS.2009.024297 Chandra, C., & Grabis, J. (2008). A Goal Modeldriven Supply Chain Design. International Journal of Data Analysis Techniques and Strategies, 1(3), 224–241. doi:10.1504/IJDATS.2009.024294
Finding Persistent Strong Rules
Deshpande, M., & Karypis, G. (2002). Using Conjunction of Attribute Values for Classification. In Proceedings of the Eleventh International Conference on Information and Knowledge Management (pp.356-364). McLean, VA. Dietterich, T. G. (2000). Ensemble methods in Machine Learning. In Proceedings of the First International Workshop on Multiple Classifier Systems (pp.1-15). Cagliari, Italy. Dy, J. G., & Brodley, C. E. (2004). Feature Selection for Unsupervised Learning. Journal of Machine Learning Research, 5, 845–889. Edsall, T. B. (2006). Democrats’ Data Mining Stirs an Intraparty Battle. The Washington Post, March 8, A1. Fu, X., & Wang, L. (2005). Data Dimensionality Reduction with application to improving Classification Performance and explaining Concepts of Data Sets. International Journal of Business Intelligence and Data Mining, 1(1), 65–87. doi:10.1504/IJBIDM.2005.007319 Geng, L., & Hamilton, H. J. (2006). Interestingness measures for Data Mining: A Survey. ACM Computing Surveys, 38(3), 9–14. doi:10.1145/1132960.1132963 Han, J., & Kamber, M. (2001). Data Mining: Concepts and Techniques. Boston, MA: Morgan Kaufman. Hsu, J. (2002). Data Mining Trends and Developments: The Key Data Mining Technologies and Applications for the 21st Century. In Proceedings of 19th Annual Conference for Information Systems Education (ISECON 2002), (Art 224b). San Antonio, TX. Jaroszewicz, S., & Simovici, D. A. (2004). Interestingness of Frequent Itemsets using Bayesian Networks as Background Knowledge. In Proceedings of 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp.178-186). Seattle, WA.
Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., & Verkamo, A. I. (1994). Finding Interesting Rules from Large Sets of Discovered Association Rules. In Proceedings of Third International Conference on Information and Knowledge Management (CIKM’94) (pp. 401408). Gaithersburg, Maryland, USA. Kumar, D. A., & Ravi, V. (2008). Predicting Credit Card Customer Churn in Banks using Data Mining. International Journal of Data Analysis Techniques and Strategies, 1(1), 4–28. doi:10.1504/IJDATS.2008.020020 Lane, T., & Brodley, C. E. (2003). An Empirical Study of Two Approaches to Sequence Learning for Anomaly Detection. Machine Learning, 51(1), 73–107. doi:10.1023/A:1021830128811 Li, J., Tang, J., Li, Y., & Luo, Q. (2009). RiMOM: A Dynamic Multistrategy Ontology Alignment Framework. IEEE Transactions on Knowledge and Data Engineering, 21(8), 1218–1232. doi:10.1109/TKDE.2008.202 Li, W., Han, J., & Pei, J. (2001). CMAR: Accurate and Efficient Classification based on Multiple Class-Association Rules. In Proceedings of 2001 IEEE International Conference on Data Mining (pp. 369-376). San Jose, CA. Liu, B., Hsu, W., & Ma, Y. (1998). Integrating Classification and Association Rule Mining. In Proceedings of 4th International Conference on Knowledge Discovery and Data Mining (pp. 2731). New York, NY. Murray, G. R., Riley, C., & Scime, A. (2007). A New Age Solution for an Age-old problem: Mining Data for Likely Voters. Paper presented at the 62nd Annual Conference of the American Association of Public Opinion Research, Anaheim, CA. Murray, G. R., Riley, C., & Scime, A. (2009). Preelection Polling: Identifying likely voters using Iterative Expert Data Mining. Public Opinion Quarterly, 73(1), 159–171. doi:10.1093/poq/ nfp004 43
Finding Persistent Strong Rules
Murray, G. R., & Scime, A. (2010). Microtargeting and Electorate Segmentation: Data Mining the American National Election Studies. Journal of Political Marketing, 9(3), 143–166. doi:10.1080 /15377857.2010.497732
Silberschatz, A., & Tuzhilin, A. (1996). What makes Patterns interesting in Knowledge Discovery. IEEE Transactions on Knowledge and Data Engineering, 8(6), 970–974. doi:10.1109/69.553165
National Academy of Sciences. (2008). Science, Evolution, and Creationism. Washington, D.C.: National Academies Press.
Srikant, R., & Agrawal, R. (1995). Mining generalized Association Rules. In Proceedings of the 21st VLDB Conference (pp. 407-419). Zurich, Switzerland.
Padmanabhan, B., & Tuzhilin, A. (2000). Small is Beautiful: Discovering the minimal set of Unexpected Patterns. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp.5463). Boston, MA. Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Francisco: Morgan Kaufmann. Rajasethupathy, K., Scime, A., Rajasethupathy, K. S., & Murray, G. R. (2009). Finding “Persistent Rules”: Combining Association and Classification Results. Expert Systems with Applications, 36(3P2), 6019-6024. Ramu, K., & Ravi, V. (2008). Privacy preservation in Data Mining using Hybrid Perturbation methods: An application to Bankruptcy Prediction in Banks. International Journal of Data Analysis Techniques and Strategies, 1(4), 313–331. doi:10.1504/IJDATS.2009.027509 Scime, A., & Murray, G. R. (2007). Vote prediction by Iterative Domain Knowledge and Attribute Elimination. International Journal of Business Intelligence and Data Mining, 2(2), 160–176. doi:10.1504/IJBIDM.2007.013935 Scime, A., Murray, G. R., & Hunter, L. Y. (2010). Testing Terrorism Theory with Data Mining. International Journal of Data Analysis Techniques and Strategies, 2(2), 122–139. doi:10.1504/ IJDATS.2010.032453
44
Su, X., Khoshgoftaar, T. M., & Greiner, R. (2009). Making an Accurate Classifier Ensemble by Voting on Classifications from Imputed Learning Sets. International Journal of Information and Decision Sciences, 1(3), 301–322. doi:10.1504/ IJIDS.2009.027657 Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to Data Mining. Boston: Addison Wesley. Toivonen, H., Klemettinen, M., Ronkainen, P., Hätönen, K., & Mannila, H. (1995). Pruning and Grouping of Discovered Association Rules. In Proceedings of ECML-95 Workshop on Statistics, Machine Learning, and Discovery in Databases (pp. 47-52). Heraklion, Crete, Greece. Tozicka, J., Rovatsos, M., & Pechoucek, M. (2007). A Framework for Agent-based Distributed Machine Learning and Data Mining. In Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems, (Art 96). Honolulu, HI. Vatsavai, R. R., & Bhaduri, B. (2007). A Hybrid Classification Scheme for Mining Multisource Geospatial Data. In Proceedings of the Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007) (pp. 673-678). Omaha, NE.
Finding Persistent Strong Rules
Wagstaff, K., Cardie, C., Rogers, S., & Schrödl, S. (2001). Constrained k-means Clustering with Background Knowledge. In Proceedings of the Eighteenth International Conference on Machine Learning (pp. 577-584). Williamstown, MA. Wang, G., Zhang, C., & Huang, L. (2008). A Study of Classification Algorithm for Data Mining based on Hybrid Intelligent Systems. In Ninth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (pp. 371-375). Phuket Thailand. Wang, Y., Zhang, Y., Xia, J., & Wang, Z. (2008). Segmenting the Mature Travel Market by Motivation. International Journal of Data Analysis Techniques and Strategies, 1(2), 193–209. doi:10.1504/IJDATS.2008.021118 Webb, G. I., & Zheng, Z. (2004). Multistrategy Ensemble Learning: Reducing Error by Combining Ensemble Learning Techniques. IEEE Transactions on Knowledge and Data Engineering, 16(8), 980–991. doi:10.1109/TKDE.2004.29 Witten, I. H., & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques (2nd ed.). San Francisco: Morgan Kaufman. Zaki, M. J. (2004). Mining Non-redundant Association Rules. Data Mining and Knowledge Discovery, 9, 223–248. doi:10.1023/ B:DAMI.0000040429.96086.c7 Zhao, Y., Zhang, C., & Zhang, S. (2005). Discovering Interesting Association Rules by Clustering. AI 2004: Advances in Artificial Intelligence, 3335, 1055-1061. Heidelberg: Springer. Zhong, N., Yao, Y. Y., Ohshima, M., & Ohsuga, S. (2001). Interestingness, Peculiarity, and Multidatabase Mining. In First IEEE International Conference on Data Mining (ICDM’01) (pp.566574). San Jose, California.
ADDITIONAL READING Ankerst, M., Ester, M., & Kriegel, H. (2000). Towards an Effective Cooperation of the User and the Computer for Classification. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 179-188). Boston, MA. Birrer, F. A. (2005). Data Mining to Combat Terrorism and the Roots of Privacy Concerns. Ethics and Information Technology, 7(4), 211–220. doi:10.1007/s10676-006-0010-6 Blos, M. F., Wee, H.-M., & Cardenas-Barron, L. E. (2009). The Threat of Outsourcing US Ports Operation to any Terrorist Country Supporter: A Case Study using Fault Tree Analysis. International Journal of Information and Decision Sciences, 1(4), 411–427. doi:10.1504/IJIDS.2009.027760 Cao, L., Zhao, Y., Zhang, H., Luo, D., Zhang, C., & Park, E. K. (2009). Flexible Frameworks for Actionable Knowledge Discovery. IEEE Transactions on Knowledge and Data Engineering, doi.ieeecomputersociety.org/10.1109/ TKDE.2009.143. Chen, H., Reid, E., Sinai, J., Silke, A., & Ganor, B. (2008). Terrorism Informatics: Knowledge Management and Data Mining for Homeland Security. New York: Springer. Dalvi, N., & Domingos, P. Mausam, Sanghai, S. & Verma, D., (2004). Adversarial Classification. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 99-108). Seattle, WA. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern Classification (2nd ed.). New York: Wiley. Giarratano, J. C., & Riley, G. D. (2004). Expert Systems: Principles and Programming (4th ed.). New York: Course Technology.
45
Finding Persistent Strong Rules
Hofmann, M., & Tierney, B. (2003). The involvement of Human Resources in Large Scale Data Mining Projects. In Proceedings of the 1st International Symposium on Information and Communication Technologies (pp. 103-109). Dublin, Ireland. Jain, A. K., Duin, R. P. W., & Mao, J. (2000). Statistical Pattern Recognition: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1), 4–37. doi:10.1109/34.824819 Kass, G. (1980). An Exploratory Technique for Investigating Large Quantities of Categorical Data. Applied Statistics, 29, 119–127. doi:10.2307/2986296 Kim, B., & Landgrebe, D. (1991). Hierarchical Decision Classifiers in High-Dimensional and Large Class Data. IEEE Transactions on Geoscience and Remote Sensing, 29(4), 518–528. doi:10.1109/36.135813 Lee, J. (2008). Exploring Global Terrorism Data: A Web-based Visualization of Temporal Data. Crossroads, 15(2), 7–14. doi:10.1145/1519390.1519393 Magidson, J. (1994). The CHAID Approach to Segmentation Modeling: Chi-squared Automatic Interaction Detection. In Bagozzi, R. P. (Ed.), Advanced Methods of Marketing Research. Cambridge, MA: Basil Blackwell. Memon, N., Hicks, D. L., & Larsen, H. L. (2007). Harvesting Terrorists Information from Web. In Proceedings of the 11th International Conference Information Visualization (pp. 664-671). Washington, DC. Memon, N., & Qureshi, A. R. (2005). Investigative data mining and its application in counterterrorism. In Proceedings of the 5th WSEAS International Conference on Applied Informatics and Communications (pp. 397-403). Malta.
46
Mingers, J. (1989). An Empirical Comparison of Pruning methods for Decision Tree Induction. Machine Learning, 4, 227–243. doi:10.1023/A:1022604100933 Murray, G. R., Hunter, L. Y., & Scime, A. (2009). Testing Terrorism using Iterative Expert Data Mining. In Proceeding of the 2009 International Conference on Data Mining (DMIN 2009) (pp. 565-570). Las Vegas, NV. Murthy, S. K. (1998). Automatic construction of Decision Trees from Data: A Multi-disciplinary Survey. Data Mining and Knowledge Discovery, 2(4), 345–389. doi:10.1023/A:1009744630224 Quinlan, J. R. (1979). Discovering Rules by Induction from Large collection of Examples. In Michie, D. (Ed.), Expert Systems in the Micro Electronic Age. Edinburgh, Scotland: Edinburgh University Press. Quinlan, J. R. (1987). Simplifying Decision Trees. International Journal of Man-Machine Studies, 27, 221–234. doi:10.1016/S0020-7373(87)80053-6 Scime, A., Murray, G. R., Huang, W., & Brownstein-Evans, C. (2008). Data Mining in the Social Sciences and Iterative Attribute Elimination. In Taniar, D. (Ed.), Data Mining and Knowledge Discovery Technologies. Hershey, PA: IGI Publishing. Seno, M., & Karypis, G. (2001). LPMiner: An algorithm for finding Frequent Itemsets using Length-Decreasing Support Constraint. In Proceedings of the 2001 IEEE International Conference on Data Mining (pp. 505-512). San Jose, CA. Turban, E., McLean, E., & Wetherbe, J. (2004). Information Technology for Management (3rd ed.). New York: Wiley. Yi, X., & Zhang, Y. (2007). Privacy-Preserving Distributed Association Rule Mining via Semitrusted Mixer. Data & Knowledge Engineering, 63(2), 550–567. doi:10.1016/j.datak.2007.04.001
Finding Persistent Strong Rules
KEY TERMS AND DEFINITIONS Association Mining: A data mining method used to find patterns of data that show conditions where sets of attribute-value pairs occur frequently in the data set. Association Rule: A rule found by association mining. Classification Mining: A data mining method used to find models of data for categorizing instances; typically used for predicting future events from historical data.
Classification Rule: A rule found by classification mining. Persistent Rule: A rule common to more than one data mining method. Persistent Strong Rule: An association rule that is both persistent by appearing in a classification decision tree and strong by exhibiting a similar confidence level in the classification rule Strong Rule: A rule found by association or classification mining that meets a domain expert’s defined confidence level.
47
Finding Persistent Strong Rules
APPENDIX ANES survey items in the likely voter data set: Was respondent’s vote validated? (A_Voteval_V2) 0. No record of respondent voting. 1. Yes. “On the coming Presidential election, do you plan to vote?” (A_Intent) 0. No 1. Yes“Do you remember for sure whether or not you voted in that [previous] election?” (A_Prevvote) 0. Respondent did not vote in previous election or has never voted 1. Voted: Democratic/Republican/Other ANES survey items in the vote choice data set: Discrete-valued questions (attribute names) What is the highest degree that you have earned? (aeduc) 1. 8 grades or less 2. 9–12 grades, no diploma/equivalency 3. 12 grades, diploma or equivalency 4. 12 grades, diploma or equivalency plus non-academic training 5. Some college, no degree; junior/community college level degree (AA degree) 6. BA level degrees 7. Advanced degrees including LLB. Some people do not pay much attention to political campaigns. How about you, would you say that you have been/were very much interested, somewhat interested, or not much interested in the political campaigns this year? (aintelect) 1 Not much interested 2 Somewhat interested 3 Very much interested. Some people seem to follow what is going on in government and public affairs most of the time, whether there is an election going on or not. Others are not that interested. Would you say you follow what is going on in government and public affairs most of the time, some of the time, only now and then, or hardly at all? (aintpubaff) 1. Hardly at all 2. Only now and then 3. Some of the time 4. Most of the time
48
Finding Persistent Strong Rules
How do you identify yourself in terms of political parties? (apid) –3 Strong Republican –2 Weak or leaning Republican 0 Independent 2 Weak or leaning Democrat 3 Strong Democrat In addition to being American, what do you consider your main ethnic group or nationality group? (arace) 1. White 2. Black 3. Asian 4. Native American 5. Hispanic 6. Other Who do you think will be elected President in November? (awhoelect) 1. Democratic candidate 2. Republican candidate 3. Other candidate
Continuous-Valued Questions Feeling thermometer questions. A measure of feelings. Ratings between 50 and 100 degrees mean a favorably and warm feeling; ratings between 0 and 50 degrees mean the respondent does not feel favorably. The 50 degree mark is used if the respondent does not feel particularly warm or cold: Feeling about Democratic Presidential Candidate. (demtherm) Discretization ranges: (-inf - 42.5], (42.5 – 54.5], (54.5 – 62.5], (62.5 – 77.5], (77.5 – inf) Feeling about Republican Presidential Candidate. (reptherm) Discretization ranges: (-inf - 42.5], (42.5 – 53.5], (53.5 – 62.5], (62.5 – 79.5], (79.5 – inf) Feeling about Republican Vice Presidential Candidate. (repvptherm) Discretization ranges: (-inf - 32.5], (32.5 – 50.5], (50.5 – 81.5], (81.5 – inf) Affect questions. The number of ‘likes’ mentioned by the respondent minus the number of ‘dislikes’ mentioned: Affect toward the Democratic Party. (affdem) Discretization ranges: (-inf - -1.5], (-1.5 - -0.5], (-0.5 - 0.5], (0.5 – 1.5], (1.5 – inf) Affect toward Democratic presidential candidate. (affdemcand) Discretization ranges: (-inf - -1.5], (-1.5 - -0.5], (-0.5 - 0.5], (0.5 – 2.5], (2.5 – inf) Affect toward Republican Party. (affrep) Discretization ranges: (-inf - -2.5], (-2.5 - -0.5], (-0.5 - 0.5], (0.5 – 2.5], (2.5 – inf) Affect toward Republican presidential candidate. (affrepcand)Discretization ranges: (-inf - -2.5], (-2.5 - -0.5], (-0.5 - 0.5], (0.5 – 2.5], (2.5 – inf) This work was previously published in Knowledge Discovery Practices and Emerging Applications of Data Mining: Trends and New Domains, edited by A.V. Senthil Kumar, pp. 85-107, copyright 2011 by Information Science Reference (an imprint of IGI Global). 49
50
Chapter 3
Data Discovery Approaches for Vague Spatial Data Frederick E. Petry Naval Research Laboratory, USA
ABSTRACT This chapter focuses on the application of the discovery of association rules in approaches vague spatial databases. The background of data mining and uncertainty representations using rough set and fuzzy set techniques is provided. The extensions of association rule extraction for uncertain data as represented by rough and fuzzy sets is described. Finally, an example of rule extraction for both types of uncertainty representations is given.
INTRODUCTION Data mining or knowledge discovery generally refers to a variety of techniques that have developed in the fields of databases, machine learning (Alpaydin, 2004) and pattern recognition (Han & Kamber, 2006). The intent is to uncover useful patterns and associations from large databases. For complex data such as that found in spatial databases (Shekar & Chawla, 2003) the problem of data discovery is more involved (Lu et al., 1993; Miller & Han, 2009). Spatial data has traditionally been the domain of geography with various forms of maps as the standard representation. With the advent of DOI: 10.4018/978-1-4666-2455-9.ch003
computerization of maps, geographic information systems (GIS) have come to fore with spatial databases storing the underlying point, line and area structures needed to support GIS (Longley et al., 2001). A major difference between data mining in ordinary relational databases (Elmasri & Navathe, 2010) and in spatial databases is that attributes of the neighbors of some object of interest may have an influence on the object and therefore have to be considered as well. The explicit location and extension of spatial objects define implicit relations of spatial neighborhood (such as topological, distance and direction relations), which are used by spatial data mining algorithms (Ester et al 2000). Additionally when wish to consider vagueness or uncertainty in the spatial data mining process
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Discovery Approaches for Vague Spatial Data
(Burrough & Frank 1996; Zhang & Goodchild, 2002), an additional level of difficulty is added. In this chapter we describe one of the most common data mining approaches, discovery of association rules, for spatial data for which we consider uncertainty in the extraction rules as represented by both fuzzy set and rough set techniques.
BACKGROUND Data Mining Although we are primarily interested here in specific algorithms of knowledge discovery, we will first review the overall process of data mining (Tan, Steinbach & Kumar, 2005). The initial steps are concerned with preparation of data, including data cleaning intended to resolve errors and missing data and integration of data from multiple heterogeneous sources. Next are the steps needed to prepare for actual data mining. These include the selection of the specific data relevant to the task and the transformation of this data into a format required by the data mining approach. These steps are sometimes considered to be those in the development of a data warehouse (Golfarelli & Rizzi, 2009), i.e., an organized format of data available for various data mining tools. There are a wide variety of specific knowledge discovery algorithms that have been developed (Han & Kamber, 2006). These discover patterns that can then be evaluated based on some interestingness measure used to prune the huge number of available patterns. Finally as true for any decision aid system, an effective user interface with visualization/alternative representations must be developed for the presentation of the discovered knowledge. Specific data mining algorithms can be considered as belonging to two categories - descriptive and predictive data mining. In the descriptive category are class description, association rules and classification. Class description can either provide a characterization or generalization of
the data or comparisons between data classes to provide class discriminations. Association rules are the main focus of this chapter and correspond to correlations among the data items (Hipp et al., 2000). They are often expressed in rule form showing attribute-value conditions that commonly occur at the same time in some set of data. An association rule of the form X → Y can be interpreted as meaning that the tuples in the database that satisfy the condition X also are “likely” to satisfy Y, so that the “likely” implies this is not a functional dependency in the formal database sense. Finally, a classification approach analyzes the training data (data whose class membership is known) and constructs a model for each class based on the features in the data. Commonly, the outputs generated are decision trees or sets of classification rules. These can be used both for the characterization of the classes of existing data and to allow the classification of data in the future, and so can also be considered predictive. Predictive analysis is also a very developed area of data mining. One very common approach is clustering (Mishra et al., 2004). Clustering analysis identifies the collections of data objects that are similar to each other. The similarity metric is often a distance function given by experts or appropriate users. A good clustering method produces high quality clusters to yield low intercluster similarity and high intra-cluster similarity. Prediction techniques are used to predict possible missing data values or distributions of values of some attributes in a set of objects. First, one must find the set of attributes relevant to the attribute of interest and then predict a distribution of values based on the set of data similar to the selected objects. There are a large variety of techniques used, including regression analysis, correlation analysis, genetic algorithms and neural networks to mention a few. Finally, a particular case of predictive analysis is time-series analysis. This technique considers a large set of time-based data to discover regularities and interesting characteristics (Shasha &
51
Data Discovery Approaches for Vague Spatial Data
Zhu., 2004). One can search for similar sequences or subsequences, then mine sequential patterns, periodicities, trends and deviations.
Uncertainty Representations In this section we overview the uncertainty representations we will use for data discovery in spatial data, specifically rough sets and fuzzy set similarity relationships.
Rough Set Theory Rough set theory, introduced by Pawlak (Pawlak, 1984; Polkowski, 2002) is a technique for dealing with uncertainty and for identifying cause-effect relationships in databases as a form of database learning. Rough sets involve the following: • • • •
U is the universe, which cannot be empty, R is the indiscernability relation, or equivalence relation, A = (U,R), an ordered pair, is called an approximation space, [x]R denotes the equivalence class of R containing x, for any element x ∈ U, elementary sets in A - the equivalence classes of R, definable set in A - any finite union of elementary sets in A.
Therefore, for any given approximation space defined on some universe U and having an equivalence relation R imposed upon it, U is partitioned into equivalence classes called elementary sets which may be used to define other sets in A. Given that X ⊆ U, X can be defined in terms of definable sets in A as below: • •
52
lower approximation of X in A is the set RX = {x ∈ U | [x]R ⊆ X} upper approximation of X in A is the set R X = {x ∈ U | [x]R ∩ X ≠ ∅}.
Another way to describe the set approximations is as follows. Given the upper and lower approximations R X and RX, of X a subset of U, the R-positive region of X is POSR(X) = RX, the R-negative region of X is NEGR(X) = U - R X, and the boundary or R-borderline region of X is BNR(X) = R X - RX. X is called R-definable if and only if RX = R X. Otherwise, RX ≠ R X and X is rough with respect to R. In Figure 1 the universe U is partitioned into equivalence classes denoted by the squares. Those elements in the lower approximation of X, POSR(X), are denoted with the letter P and elements in the R-negative region by the letter N. All other classes belong to the boundary region of the upper approximation.
Fuzzy Set Theory Fuzzy set theory is approach in which the elements of a set belong to the set to varying degrees known as membership degrees. Conventionally we can specify a set C by its characteristic function, CharC(x). If U is the universal set from which values of C are taken, then we can represent C as C = {x | x ∈ U and CharC(x) = 1}
This is the representation for a crisp or nonfuzzy set. For an ordinary set C the range of CharC(x) are just the two values: {0, 1}. However for a fuzzy set A we have a range of the entire interval [0,1]. That is, for a fuzzy set the characteristic function takes on all values between 0 and 1 and not just the discrete values of 0 or 1 representing the binary choice for membership in a conventional crisp set. For a fuzzy set the characteristic function is often called the membership function and denoted µA (x). One fuzzy set concept that we employ particularly in databases is the similarity relation, S(x, y), denoted also as xSy. For given domain D this is a mapping of every pair of values in the particular
Data Discovery Approaches for Vague Spatial Data
Figure 1. Example of a rough set X
domain onto the unit interval [0,1], which reflects the level of similarity between them. A similarity relation is reflexive and symmetric as a traditional identity relation. However, special forms of transitivity are used So a similarity relation has the following three properties, for x, y, z ∈ D (Zadeh, 1970; Buckles & Petry, 1982): 1. Reflexive: s D (x, x) = 1 2. Symmetric: s D (x, y) = s D (y, x) 3. Transitive: s D (x, z) ≥ Maxy (Min [s D (x, y), s D (y, z)]): (T1) This particular max-min form of transitivity is known as T1 transitivity. Another useful form is T2, also known as max-product: 3’. Transitive: s D (x, z) = Max y ([s D (x, y) * s D (y, z) ]): (T2) where * is arithmetic multiplication. An example of a similarity relation satisfying T2 transitivity is: s D (x, y) = e–β*|y–x| where β > 0 is an arbitrary constant and x, y ∈ D. So we can see that are different aspects of uncertainty dealt with in fuzzy set or rough set representations. The major rough set concepts of interest are the use of an indiscernibility relation to partition domains into equivalence classes and
the concept of lower and upper approximation regions to allow the distinction between certain and possible, or partial, inclusion in a rough set. The indiscernibility relation allows us to group items based on some definition of ‘equivalence’ as it relates to the application domain. A complementary approach to rough set uncertainty management is fuzzy set theory. Instead of the “yes-maybe-no” approach to belonging to a set, a more gradual membership value approach is used. An object belongs to a fuzzy set to some degree. This contrasts with the more discrete representation of uncertainty from indiscernibility relations or rough set theory
VAGUE QUERYING FOR SPATIAL DATA MINING Here we describe the management of uncertainty in the data-mining query. Such a query develops the relation or table on which to apply the data mining algorithm is applied similar to the approach of GeoMiner (Koperski & Han, 1995; Han et al., 1997). The issue of concern is the form of the resultant relation obtained from the querying involving the spatial data and the representation of uncertainty. We will describe this for both fuzzy set and rough set approaches.. A crucial aspect of the query for the formulation of the data over which the data mining algorithm will operate is the selection of a spatial predicate that identifies the specific spatial region or area of interest (AOI). This is closely related to the property that causes objects of interest to cluster in space, which is the so-called first law of geography: “Everything is related to everything else but nearby things are more related than distant things” (Tobler, 1979). A common choice for this is some distance metric such as a NEAR predicate; however, other spatial predicates such as the topological relationships contains or intersects could also be used.
53
Data Discovery Approaches for Vague Spatial Data
Let us consider an SQL form of a query: SELECT Attributes A, B FROM Relation X WHERE (X.A = α and NEAR (X.B, β)) AT Threshold Levels = M, N
In the table ai is a value of A from X, and μai is the similarity, sim, of ai and α μai = sim (ai, α) For example, let the domain be soil types and if ai = loam and α = peat then the similarity might be
Since we are considering approximate matching of the spatial we must specify for each of such predicates the threshold degree of matching below which the data will not appear in the resultant relation.
sim (loam, peat) = 0.75
Fuzzy Set Querying Approach
< loam, 0.75 >
To match the values in the query, we have for the attribute A, a spatially related attribute, a similarity table of its domain values, and for B, a spatial attribute such as location, the NEAR predicate can be evaluated. Since these values may not be exact matches, the intermediate resultant relation Rint will have to maintain the degree of matching. The final query results are chosen based on the values specified in the threshold clause. Results for attribute A are based on the first value M and similarly those for B are based on N. The level values are typically user specified as linguistic terms that correspond to such values (Petry, 1996). The intermediate step of the query evaluation is shown in Table 1.
Similarly for the location attribute, where bi = (13.1, 74.5) and β = (12.9, 74.1), we might have
Table 1. The intermediate result of the SQL form query RESint A
B
……
……
54
and if the threshold level value N in the query were lower than.75, we could possibly retain this in the final relation as the value and membership
< (13.1, 74.5), 0.78 > if the coordinates (in some measure) are “near” by a chosen fuzzy distance measure μ near = NEAR ((13.1, 74.5), (12.9, 74.1)) = 0.78. Figure 2 shows an example of a fuzzy NEAR function that might be used here to represent “within a distance of about 5 kilometers or less.” Such a fuzzy function can be represented generally parameterized by inflection points, a,b, on a membership function Fμ (x, a, b) as shown below:
1. 2. 3.
F
(x, a, b) = 1 (b-x)/(b-a) 0 μ
x ≤ a a ≤ x < b x ≥ b
Specifically then using this representation, the fuzzy membership function in Figure 2 is: F μ (x, 5, 6)
Data Discovery Approaches for Vague Spatial Data
Figure 2. Fuzzy membership function for distance
Now in the final resultant relation, rather than retaining these individual memberships, the data mining algorithm will be simplified if we formulate a combined membership value. Thus the final result is obtained by evaluating each tuple relative to the threshold values and assigning a tuple membership value based on the individual attribute memberships combined by the commonly used min operator. So the tuple < ai, bi, μt > will appear in the final result relation RES if and only if (μai > M) and (μbi > N) and the tuple membership is μt = min (μai, μbi) If the rows (tuples) 1 and 3 from Rint are such that the memberships for both columns are above their respective thresholds, then these are retained. However for tuple 2, let it be the case that μa2 > M, but the second attribute’s membership is μb2 < N. Then this tuple will not appear in the final result relation shown in Table 2.
Rough Set Querying Approach For a rough set approach predicates such as NEAR will be formulated as rough selection predicates as previously done in rough querying of crisp databases (Beaubouef & Petry, 1994). Spatial predicates for the rule discovery query can be made more realistic and effective by a consideration
of vague regions and their spatial relationships. Consider the NEAR predicate in the query of the previous example. Several possibilities exist for defining predicates for such a spatial relationship in which application dependency of terms and parameters have to be taken into account (Clementini & DeFelice, 1997). Although the definition of “near” is subjective and determined by a specialist investigating the data and doing the actual data mining, we can still make some general observations about nearness that should apply in all cases. If the lower approximations of two vague regions X and Y intersect (Beaubouef & Petry, 2004), they are definitely near, i.e. RX ∩ RY ≠ ∅
This can be restated as: if the positive regions of the two vague regions overlap, then they are considered near. In this case, the positions of the upper approximation regions are irrelevant. Figure 3 illustrate several of these cases: Let region X be the first “egg” (dashed line) and Y second (dotted line) in each pair. The “yolk of the egg” is lower approximation and the upper approximation is the outer part. So for example in the figure on the left the yolks – positive regions- completely overlap. The other two show some degree of overlaps of the positive regions illustrating the nearness of X and Y. We should also consider the possibility that two vague regions are near if their lower apTable 2. The final result relation of the SQL format query using threshold values RES A
B
μt
a1
b1
Min (μa1, μb1)
a3
b3
Min (μa3, μb3)
……
……
ai
bi
Min (μai, μbi)
55
Data Discovery Approaches for Vague Spatial Data
Figure 3. Overlap of positive regions
proximations are within a prescribed distance of each other. This might be represented in several ways including measuring the distances of the centroids of the certain regions, using the distances between minimum bounding rectangles, or by simply comparing the distances between all points on the edges of the boundaries. For example, two towns might not be overlapping if they are separated from each other by a river. However, we would still want to say that they are near based on the ground distance measured between them. Depending on the application there could be different distance measures. If the distance measure was road distance and there were no bridges or ferries across the river in say, under than 20 kilometers, we would no longer consider the towns to be near. Incorporating some of these points discussed, we can give a more general form of a NEAR predicate as needed for both the lower and then upper approximation (see Box 1). DISTANCE would have been defined previously based on various criteria as we have discussed. N1 and N2 are the specific distance thresholds entered by the user. The term h can be considered as a certainty factor for the matching
in the predicate, with c and d being user/expert provided parameters. This factor could then be maintained for each tuple in the query result RES and act as a weighting factor for the rough support count for each value in apriori algorithm to be described in the next section. There could also be additional terms added to the NEAR predicate depending on the application. For example if one wanted to be sure to include the possible regions’ (upper approximations) distance as a factor for inclusion in the boundary result of the query then we could include the term: If DISTANCE( R X, R Y) < (N3), (h = e) in the boundary region predicate. Other terms can be developed similarly as well as predicates for other relationships such as INTERSECTS also utilizing rough set spatial approaches. The result of a query is then: RES = { RT, R T }
Box 1. Positive region (lower approximation) NEAR (X,Y) = True if RX ∩ RY ≠ ∅, (h =1) OR if DISTANCE (RX,RY) < (N1), (h = c) Boundary region NEAR (X,Y) = True if DISTANCE (RX,RY) < (N2), (h=1) OR if R X ∩ R Y ≠ ∅, (h = d) 56
Data Discovery Approaches for Vague Spatial Data
and so we must take into account the lower approximation, RT, and upper approximation, R T, results of the rough query in developing the association rules.
ASSOCIATION RULES Association rules capture the idea of certain data items commonly occurring together and have been often considered in the analysis of a “market basket” of purchases. For example, a delicatessen retailer might analyze the previous year’s sales and observe that of all purchases 30% were of both cheese and crackers and, for any of the sales that included cheese, 75% also included crackers. Then it is possible to conclude a rule of the form: Cheese → Crackers This rule is said to have a 75% degree of confidence and a 30% degree of support. A retailer could use such rules to aid in the decision process about issues such as placement of items in the store, marketing options such as advertisements and discounts and so forth. In a spatial data context an analysis of various aspects of a certain region might produce a rule associating soils and vegetation such as:
R = {T1, T2, …} where Ti ⊆ D. We are interested in discovering if there is a relationship between two sets of items (called itemsets) Xj, Xk; Xj, Xk ⊆ D. For such a relationship to be determined, the entire set of transactions in R must be examined and a count made of the number of transactions containing these sets, where a transaction Ti contains Xm if Xm ⊆ Ti. This count, called the support count of Xm, SCR (Xm), will be appropriately modified in the case of fuzzy set and rough set representations.. There are then two measures used in determining rules: the percentage of Ti’s in R that contain both Xj and Xk (i.e. Xj ∪ Xk) - called the support s if Ti contains Xj then Ti also contains Xk – called the confidence c. The support and confidence can be interpreted as probabilities: s – Prob (Xj ∪ Xk) and c – Prob (Xk | Xj)
Sandy soil → Scrub cover
We assume the system user has provided minimum values for these in order to generate only sufficiently interesting rules. A rule whose support and confidence exceeds these minimums is called a strong rule. The overall process for finding strong association rules can be organized as a 3 step process:
that could be used for planning and environmental decision makers. This particular form of data mining is largely based on the Apriori algorithm developed by Agrawal (Agrawal et al 1993). Let a database of possible data items be
1. Determine frequent itemsets –commonly done with variations of the Apriori algorithm 2. Extract strong association rules from the frequent itemsets 3. Assess generated rules with interestingness measures.
D = {d1,d2, … dn}
The first step is to compute the frequent itemsets F which are the subsets of items from D, such as {d2, d4, d5}. The support count SCR of each such subset must be computed and the frequent itemsets are then only those whose support count
and the relevant set of transactions (sales, query results, etc.) are
57
Data Discovery Approaches for Vague Spatial Data
exceeds the minimum support count specified. This is just the product of the minimum support specified and the number of transactions or tuples in R. For a large database this generation of all frequent itemsets can be very computationally expensive. The Apriori algorithm is an influential algorithm that makes this more computationally feasible. It basically uses an iterative level-wise search where sets of k items are used to consider sets at the next level of k+1 items. The Apriori property is used to prune the search as seen in the discussion below. After the first and more complex step of determining the frequent itemsets, the strong association rules can easily be generated. The first step is to enumerate all subsets of each frequent itemset F: f1, f2, …fi… Then for each fi, calculate the ratio of the support count of F and fi, i.e. SCR (F)/SCR (fi) Note that all subsets of a frequent itemset are frequent (Apriori property) and so the support counts of each subset will have been computed in the process of finding all frequent itemsets. This greatly reduces the amount of computation needed. If this ratio is greater than the minimum confidence specified then we can output the rule fi → { F - f i } The set of rules generated may then be further pruned by a number of correlation and heuristic measures.
Extensions to Fuzzy and Rough Spatial Association Rules Fuzzy Rules We are now prepared to consider how to extend approaches to generating association rules to process the form of the fuzzy data we have developed for the spatial data query. Fuzzy data
58
mining for generating association rules has been considered by a number of researchers. There are approaches using the SETM (Set-oriented mining) algorithm (Shu et al., 2001) and other techniques (Bosc & Pivert, 2001) but most have been based on the important Apriori algorithm. Extensions have included fuzzy set approaches to quantitative data (Zhang, 1999; Kuok et al., 1998), hierarchies or taxonomies (Chen et al., 2000; Lee, 2001), weighted rules (Gyenesei, 2000) and interestingness measures (de Graaf et al., 2001). Our extensions most closely follow that of Chen (Chen et al., 2000). Recall that in order to generate frequent itemsets, we must count the number of transactions Tj that support an itemset Xj. In the ordinary Apriori algorithm one simply counts the occurrence of a value as 1 if in the set, or if not in the set - 0. Here, since we have obtained from the query a membership degree for the values in the transaction, we must modify the support count SCR. To achieve this we will use the Σ Count operator which extends the ordinary concept of set cardinality to fuzzy sets (Yen & Langari, 1999). Let A be a fuzzy set, then the cardinality of A is obtained by summation of the membership values of the elements of A: Card(A) = Σ Count(A) = Σ μA(yi); yi ε A Using this the fuzzy support count for the set Xj becomes: FSCR (Xj) = Σ Count (Xj) = ΣμTi ; Xj ⊆ Ti
Note that membership of Ti is included in the count only all of the values of the itemset Xj are included in the transaction, i.e. it is a subset of the transaction. Finally to produce the association rules from the set of relevant data R retrieved from the spatial database, we will provide our extension to deal with the resulting frequent itemsets. For the purposes of generating a rule such as Xj →
Data Discovery Approaches for Vague Spatial Data
Xk we can now extend the ideas of fuzzy support and confidence as FS = FSCR (Xj ∪ Xk)/| R | FC = FSCR (Xj ∪ Xk)/FSCR (Xj)
Rough Set Rules
In this case since the query result R is a rough set, we must again modify the support count SCT. So we define the rough support count, RSCR, for the set Xj, to count differently in the upper and lower approximations: RSCR (Xj) = Σ W (Xj); Xj ⊆ Ti 1 if Ti ε RT where W (Xj) = { a
if
Ti ε R T, 0 < a < 1.
The value, a, can be a subjective value obtained from the user depending on relative assessment of the roughness of the query result T. For the data mining example of the next section we could simply choose a neutral default value of a = ½. Note that W (Xj) is included in the summation only if all of the values of the itemset Xj are included in the transaction, i.e. it is a subset of the transaction. Finally as in the fuzzy set case above we have rough support and confidence as follows: RS = RSCR (Xj ∪ Xk)/| R |
RC = RSCR (Xj ∪ Xk)/RSCR (Xj)
SPATIAL DATA MINING EXAMPLE We now consider a data mining example involving a spatial database that could be used, for example, in industrial plant location planning. Here we would like to discover relationships between attributes that are relevant for providing guidance in the planning and selection of a suitable plant
site. There may be several possible counties in the region that are under consideration for plant location and we will assume that smaller cities are of interest for this particular industry’s operation since they would have sufficient infrastructure, but none of the congestion and other problems typically associated with larger cities. Transportation (airfields, highways, etc.), water sources for plant operation (rivers and lakes) and terrain information (soils, drainage, etc.), in and to an extent of about five kilometers surrounding the city are of particular interest for the site selection.
Fuzzy Spatial Representation The first step we must then take to discovering rules that may be of interest in a given county is to formulate an SQL query as we have described before using the fuzzy function NEAR (Figure 2) to represent those objects within about 5 kilometers of the cities. We will use the fuzzy function of Figure 4 to select the cities with a small population. SELECT City C, Road R, Railroad RR, Airstrip A, Terrain T FROM County 1 WHERE ([NEAR (C.loc, R.loc), NEAR (C.loc, RR.loc), NEAR (C.loc, A.loc), NEAR (C.loc, T.loc)] and C.pop = Small) AT Threshold Levels =.80,.75,.70
We evaluate for each city in a selected county the locations of roads, railroads and airstrips using the NEAR fuzzy function. The terrain attribute value is produced by evaluation of various factors such as average soil conditions (e.g. firm, marshy), relief (e.g. flat, hilly), coverage (fields, woods), etc. These subjective evaluations are then combined into one membership value which is used to provide a linguistic label based on fuzzy functions for these. Note that the evaluation for terms such as “good” can be context dependent. For the
59
Data Discovery Approaches for Vague Spatial Data
Figure 4. Fuzzy membership function for small city
purpose of development of a plant an open and flat terrain is suitable whereas for a recreational park a woody and hilly situation would be desirable. Each attribute value in the intermediate relation then has a degree of membership. The three threshold levels in the query are specified for the NEAR, Small and the terrain memberships. The final relation R is formulated based on the thresholds and the overall tuple membership computed as previously described (see Table 3). In R the value “None” indicates that for the attribute no value was found NEAR – within the five kilometers. For such values no membership value is assigned and so μt is just based on the non-null attribute values in the particular tuple. Now in the next step of data mining we generate the frequent itemsets from R using the fuzzy support count. At the first level for itemsets of size 1(k=1), airstrips are not found since they do not occur often enough in R to yield a fuzzy support count above the minimum support count that was pre-specified. The level k=2 itemsets are generated from the frequent level 1 itemsets.
Here only two of these possibilities exceed the minimum support and none above this level, i.e. k=3 or higher. This gives us the following table of frequent itemsets: From this table of frequent itemsets we can extract various rules and their confidence. Rules will not be output unless they are strong – satisfy both minimum support and confidence. A rule produced from a frequent itemset satisfies minimum support by the manner in which frequent itemsets are generated, so it only necessary to use the fuzzy support counts from the table to compute the confidence. The “small city” clause that will appear in all extracted rules arises because this was the general condition that selected all of the tuples that appeared in query result R from which the frequent itemsets were generated. Let us assume for this case that the minimum confidence specified was 85%. So, for example, one possible rule that can be extracted from the frequent itemsets in Table 4 is:
Table 3. The final result of the example query – R City
Roads
Railroads
Airstrips
Terrain
μt
A
Hwy.10
RRx
None
Good
0.89
B
{Hwy.5,Hwy.10}
None
A2
Fair
0.79
F
Hwy.6
RRx
None
Good
0.92
…
…
…
…
…
…
60
Data Discovery Approaches for Vague Spatial Data
Table 4. The frequent itemsets found for the example. k
Frequent Itemsets
Fuzzy Support Count
1
{Road Near}
11.3
1
{Good Terrain}
10.5
1
{Railroad Near}
8.7
2
{Road Near, Good Terrain}
9.5
2
{Good Terrain, Railroad Near}
7.2
If C is a small city and has good terrain nearby then there is a road nearby with 90% confidence. Since the fuzzy support count for {Good Terrain} is 10.5 and the level 2 itemset {Road Near, Good Terrain} has a fuzzy support count of 9.5, the confidence for the rule is 9.5/10.5 or 90%. Since this is above the minimum confidence of 85%, this rule is strong and will be an output of the data mining process. If we had specified a lower minimum confidence such as 80% we could extract (among others) the rule: If C is a small city and has a railroad nearby then there is good terrain nearby with 83% confidence. Since the fuzzy support count for {Railroad Near} and {Railroad Near, Good Terrain} are 8.7 and 7.2, the confidence is 7.2/8.7 or 83% and so this rule is also output.
Rough Set Spatial Representation Now we will consider this example using rough set techniques. We will make use of some rough predicates as has been done in rough querying of crisp databases (Beaubouef & Petry, 1994) using SMALL for population size of city and NEAR for relative distances between cities. We might roughly define SMALL, for example, to include in the lower approximation all cities that have a
population < 8000, and in the upper approximation those cities that have a population < 50000. NEAR would include all distances that are < 5 kilometers for the lower approximation, and then all distances that are < 8 kilometers in the upper approximation, for example. We use a sample query similar to the previous one as: SELECT City C, Road R, River RV, Airstrip A, Drainage D FROM County 1 WHERE ((SMALL (C.pop) and [NEAR(C. loc, R.loc), NEAR(C.loc, RV.loc), NEAR (C.loc, A.loc), NEAR(C.loc, D.loc) ])
There are several potential issues related to drainage. For some plant operations a poor drainage of adjacent terrain might be unacceptable because of pollution regulations in the county under consideration. For other plants that do not produce wastewater discharges, the drainage situation is not as important. Depending on the specific plant being planned, the various drainages of terrains suitable for the purpose would be entered and a qualitative category for the terrain desirability taken into account. Assume the following rough resultant table R is Table 5. Attribute values for drainage quality may have been partitioned by the following equivalence relation: {[POOR, UNACCEPTABLE],[FAIR, OK, ACCEPTABLE], [GOOD, DESIRABLE], [EXCELLENT]}. In Table 5, the lower approximation region contains data for four cities (A, B, C, D). Cities E and F appear in the boundary region of the upper approximation result. These are results that do not match lower approximation certain results exactly based on our query and the definition of predicates, but do meet the qualifications for the boundary, or uncertain region. For example, CITY E is part of counties 1 and 2, and CITY F might be 7 kilometers from HWY 2, rather than within 5 kilometers, which would place it in the lower approximation.
61
Data Discovery Approaches for Vague Spatial Data
Table 5. R: Rough results from spatial query City
County
Roads
Rivers
Airstrips
Drainage
A
1
Hwy 1
Oak
None
Good
B
1
{Hwy 5, Hwy 7}
None
A1
Fair
C
1
Hwy 28
Sandy
None
Good
D
1
Hwy 10
Black
A2
Acceptable
E
{ 1, 2}
Hwy 33
Black
None
Good
F
1
Hwy 2
{Sandy, Pearl }
A3
Desirable
As for the fuzzy set case we can generate the frequent itemsets from the relation R in Table 5 using the rough support count. A frequent itemsets table very similar to Table 4 for fuzzy sets will then result. As before we specify a minimum confidence such as 80%. This means we could obtain a rule such as: If there is a river closely located to a small city C then it is likely that there is good terrain near with 83% confidence. This would follow if the rough support count for {River Near} and {River Near, Good Terrain} was for example 8.5 and 7.0, in which case the rough confidence is RC = 7.0/8.5 or 82.4% for this rule.
CONCLUSION AND FUTURE DIRECTIONS The use of data mining has become widespread in such diverse areas as marketing and intelligence applications. The integration of data discovery with GIS (Geographic Information Systems) allows extensions of these approaches to include spatially related data, greatly enhancing their applicability but incurring considerable complexity. In this chapter we have described data discovery for spatial data using association rules. To permit uncertainty in the discovery process, vague spatial predicates based on fuzzy set and
62
rough set techniques were presented. Finally examples of these approaches for industrial plant sites planning using typical association rules of interest were illustrated. A number of topics that are now of current research can enhance future direction in data mining of vague spatial data. There are several recent approaches to uncertainty representation that may be more suitable for certain applications and forms of spatial data. Type-2 fuzzy sets have been of considerable recent interest (Mendel & John, 2002). In these as opposed to ordinary fuzzy sets in which the underlying membership functions are crisp, here the membership function are themselves fuzzy. Intuitionistic sets (Atanassov, 1999) are another generalization of a fuzzy set. Two characteristic functions are used for capturing both the ordinary idea of degree of membership in the intuitionistic set as well as the degree of non-membership of elements in the set and can be used in database design (Beaubouef & Petry, 2007). Combinations of the concepts of rough set and fuzzy set theory (Nanda & Majumdar, 1992) have also been proposed and can be used in modeling some of the complexities found in spatial data. Closer integration of data mining systems and GIS can assist users in effective rule extraction from spatial data. The visual presentations would aid in the specification of imprecise spatial predicates and their corresponding thresholds. A user could utilize the GIS tools to experiment with a variety of spatial predicates and their thresholds to
Data Discovery Approaches for Vague Spatial Data
best represent the user’s viewpoint and intuitions which are often very visually oriented. Additionally another issue for algorithms that generate association rules is that very large number of such rules are often produced. Deciding which rules are most valuable or interesting is an ongoing research topic (Hilderman & Hamilton, 2001; Tan et. al., 2002). By providing visual feedback of the spatial association rules, a user could more effectively prune the large set of potential rules.
ACKNOWLEDGMENT We would like to thank the Naval Research Laboratory’s Base Program, Program Element No. 0602435N for sponsoring this research.
REFERENCES Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining Association rules between sets of items in large databases. Proceedings of the 1993 ACM-SIGMOD International Conference on Management of Data (pp. 207-216). New York, NY: ACM Press. Alpaydin, E. (2004). Introduction to machine learning. Boston, MA: MIT Press. Atanassov, K. (1999). Intuitionistic fuzzy sets: Theory and applications. Heidelberg, Germany: Physica-Verlag. Beaubouef, T., & Petry, F. (1994). Rough querying of crisp data in relational databases. Third International Workshop on Rough Sets and Soft Computing (RSSC’94), San Jose, California (pp. 34-41). Beaubouef, T., Petry, F., & Ladner, R. (2007). Spatial data methods and vague regions: A rough set approach. Applied Soft Computing, 7, 425–440. doi:10.1016/j.asoc.2004.11.003
Bosc, P., & Pivert, O. (2001). On some fuzzy extensions of association rules. In Proceedingsof IFSA-NAFIPS 2001 (pp. 1104–1109). Piscataway, NJ: IEEE Press. Buckles, B., & Petry, F. (1982). A fuzzy representation for relational databases. International Journal of Fuzzy Sets and Systems, 7, 213–226. doi:10.1016/0165-0114(82)90052-5 Burrough, P., & Frank, A. (Eds.). (1996). Geographic objects with indeterminate boundaries. GISDATA series (Vol. 2). London, UK: Taylor and Francis. Chen, G., Wei, Q., & Kerre, E. (2000). Fuzzy data mining: Discovery of fuzzy generalized association rules. In Bordogna, G., & Pasi, G. (Eds.), Recent issues on fuzzy databases (pp. 45–66). Heidelberg, Germany: Physica-Verlag. Clementini, E., & DeFelice, P. (1997). Approximate topological relations. International Journal of Approximate Reasoning, 16, 173–204. doi:10.1016/S0888-613X(96)00127-2 de Graaf, J., Kosters, W., & Witteman, J. (2001). Interesting fuzzy association rules in quantitative databases. In Principles of Data Mining and Knowledge Discovery (LNAI 2168) (pp. 140–151). Berlin, Germany: Springer Verlag. doi:10.1007/3540-44794-6_12 Elmasri, R., & Navathe, S. (2010). Fundamentals of database systems (6th ed.). Addison Wesley. Ester, M., Fromelt, A., Kriegel, H., & Sander, J. (2000). Spatial data mining: Database primitives, algorithms and efficient and DBMS support. Data Mining and Knowledge Discovery, 4, 89–125. doi:10.1023/A:1009843930701 Golfarelli, M., & Rizzi, S. (2009). Data warehouse design: Modern principles and methodologies. McGraw Hill.
63
Data Discovery Approaches for Vague Spatial Data
Gyenesei, A. (2000). Mining weighted association rules for fuzzy quantitative items. (TUCS Technical Report 346). Turku, Finland: Turku Center for Computer Science. Han, J., & Kamber, M. (2006). Data mining: Concepts and techniques (2nd ed.). San Diego, CA: Academic Press. Han, J., Koperski, K., & Stefanovic, N. (1997). GeoMiner: A system prototype for spatial data mining. In Proceedings of the 1997 ACM-SIGMOD International Conference on Management of Data (pp. 553-556). New York, NY: ACM Press. Hilderman, R., & Hamilton, H. (2001). Knowledge discovery and measures of interest. Kluwer Academic Publishers. Hipp, J., Guntzer, U., & Nakhaeizadeh, G. (2000). Algorithms for association rule mining- a general survey. SIGKDD Explorations, 2, 58–64. doi:10.1145/360402.360421 Koperski, K., & Han, J. (1995). Discovery of spatial association rules in geographic information databases. In Proceedings of 4th International Symposium on Large Spatial Databases (pp. 4766). Berlin, Germany: Springer-Verlag.
Mendel, J., & John, R. (2002). Type-2 fuzzy sets made simple. IEEE Transactions on Fuzzy Sets, 10, 117–127. doi:10.1109/91.995115 Miller, H., & Han, J. (2009). Geographic data mining and knowledge discovery (2nd ed.). Chapman & Hall. Mishra, N., Ron, D., & Swaminathan, R. (2004). A new conceptual clustering framework. Machine Learning Journal, 56, 115–151. doi:10.1023/ B:MACH.0000033117.77257.41 Nanda, S., & Majumdar, S. (1992). Fuzzy rough sets. Fuzzy Sets and Systems, 45, 157–160. doi:10.1016/0165-0114(92)90114-J Nguyen, H., & Walker, E. (2005). A first course in fuzzy logic (3rd ed.). Boca Raton, FL: CRC Press. Pawlak, Z. (1984). Rough sets. International Journal of Man-Machine Studies, 21, 127–134. doi:10.1016/S0020-7373(84)80062-0 Petry, F. (1996). Fuzzy databases: Principles and application. Norwell, MA: Kluwer Academic Publishers. Polkowski, L. (2002). Rough sets. Heidelberg, Germany: Physica Verlag.
Kuok, C., Fu, A., & Wong, H. (1998). Mining fuzzy association rules in databases. SIGMOD Record, 27, 41–46. doi:10.1145/273244.273257
Rigaux, P., Scholl, M., & Voisard, A. (2002). Spatial databases with application to GIS. San Francisco, CA: Morgan Kaufmann.
Lee, K. (2001). Mining generalized fuzzy quantitative association rules with fuzzy generalization hierarchies. In. Proceedings of IFSA-NAFIPS, 2001, 2977–2982. Piscataway, NJ: IEEE Press
Shasha, D., & Zhu, Y. (2004). High performance discovery in time series. Springer.
Longley, P., Goodchild, M., Maguire, D., & Rhind, D. (2001). Geographic Information Systems and science. Chichester, UK: Wiley. Lu, W., Han, J., & Ooi, B. (1993). Discovery of general knowledge in large spatial databases. In Proceedings of Far East Workshop Geographic Information Systems (pp. 275-289). Singapore: World Scientific Press.
64
Shekar, S., & Chawla, S. (2003). Spatial databases: A tour. Upper Saddle River, NJ: Prentice Hall. Shu, J., Tsang, E., & Yeung, D. (2001). Query fuzzy association rules in relational databases. In. Proceedings of IFSA-NAFIPS, 2001, 2989–2993. Piscataway, NJ: IEEE Press
Data Discovery Approaches for Vague Spatial Data
Tan, P., Kumar, V., & Srivastava, J. (2002). Selecting the right interestingness measure for association patterns. Proceedings of ACM SIGKDD International Conference on Knowledge Discovery in Databases, (pp. 32-41). Edmonton, Canada.
Zhang, W. (1999). Mining fuzzy quantitative association rules. In Proceedings of IEEE International Conference on Tools with Artificial Intelligence (pp. 99-102). Piscataway, NJ: IEEE Press.
Tan, P., Steinbach, M., & Kumar, V. (2005). Introduction to data mining. Boston, MA: Addison Wesley.
KEY TERMS AND DEFINITIONS
Tobler, W. (1979). Cellular geography. In Gale, S., & Olsson, G. (Eds.), Philosophy in geography (pp. 379–386). Dortrecht, Germany: Riedel. Yen, J., & Langari, R. (1999). Fuzzy logic: Intelligence, control and information. Upper Saddle River, NJ: Prentice Hall. Zadeh, L. (1970). Similarity relations and fuzzy orderings. Information Sciences, 3, 177–200. doi:10.1016/S0020-0255(71)80005-1 Zhang, J., & Goodchild, M. (2002). Uncertainty in geographical information. London, UK: Taylor and Francis Pub. doi:10.4324/9780203471326
Association Rule: A rule that capture the idea of certain data items commonly occurring together. Data Mining: The process of applying a variety of algorithms to discover relationships from data files. Fuzzy Set: A set in which an element can have a degree of membership in set. Geographic Information System: A software system that provides a wide variety of tools to manipulate spatial data. Itemsets: Sets of data items that occur in transactions such as sales. Rough Set: A set specified by upper and lower approximations that are defined by indiscernability relations among the data. Spatial Data: Geographic data that can be represented by points, lines or areas.
This work was previously published in Computational Modeling and Simulation of Intellect: Current State and Future Perspectives, edited by Boris Igelnik, pp. 342-360, copyright 2011 by Information Science Reference (an imprint of IGI Global).
65
66
Chapter 4
Active Learning and Mapping: A Survey and Conception of a New Stochastic Methodology for High Throughput Materials Discovery Laurent A. Baumes CSIC-Universidad Politecnica de Valencia, Spain
ABSTRACT The data mining technology increasingly employed into new industrial processes, which require automatic analysis of data and related results in order to quickly proceed to conclusions. However, for some applications, an absolute automation may not be appropriate. Unlike traditional data mining, contexts deal with voluminous amounts of data, some domains are actually characterized by a scarcity of data, owing to the cost and time involved in conducting simulations or setting up experimental apparatus for data collection. In such domains, it is hence prudent to balance speed through automation and the utility of the generated data. The authors review the active learning methodology, and a new one that aims at generating successively new samples in order to reach an improved final estimation of the entire search space investigated according to the knowledge accumulated iteratively through samples selection and corresponding obtained results, is presented. The methodology is shown to be of great interest for applications such as high throughput material science and especially heterogeneous catalysis where the chemists do not have previous knowledge allowing to direct and to guide the exploration.
1. INTRODUCTION Data mining, also called knowledge discovery in databases (Piatetsky-Shapiro & Frawley, 1991;Fayyad, Piatetsky-Shapiro & Smyth, 1996) (KDD) is the efficient discovery of unknown patDOI: 10.4018/978-1-4666-2455-9.ch004
terns in databases (DBs). The data source can be a formal DB management system, a data warehouse or a traditional file. In recent years, data mining has invoked great attention both in academia and industry. Understanding of field and defining the discovery goals are the leading tasks in the KDD process. It can be distinguished two aims: i) veri-
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Active Learning and Mapping
fications, where the user hypothesizes and mines the DB to corroborate or disprove the hypothesis; ii) Discovery, where the objective is to find out new unidentified patterns. Our contribution is concerned by the latter, which can further be either predictive or descriptive. The data mining technology is more and more applied in the production mode, which usually requires automatic analysis of data and related results in order to proceed to conclusions. However, an absolute automation may not be appropriate. Unlike traditional data mining contexts deal with voluminous amounts of data, some domains are actually characterized by a scarcity of data, owing to the cost and time involved in conducting simulations or setting up experimental apparatus for data collection. In such domains, it is hence prudent to balance speed through automation and the utility of the generated data. For these reasons, the human interaction and guidance may lead to better quality output: the need for active learning arises. In many natural learning tasks, knowledge is gained iteratively, by making action, queries, or experiments. Active learning (AL) is concerned with the integration of data collection, design of experiment, and data mining, for making better data exploitation. The learner is not treated as a classical passive recipient of data to be processed. AL can occur due to two extreme cases. i) The amount of data available is very large, and therefore a miming algorithm uses a selected data subset rather than the whole available data. ii) The researcher has the control of data acquisition, and he has to pay attention on the iterative selection of samples for extracting the greatest benefit from future data treatments. We are concerned by the second situation, which becomes especially crucial when each data point is costly, domain knowledge is imperfect, and theory-driven approaches are inadequate such as for heterogeneous catalysis and material science fields. Active data selection has been investigated in a variety of contexts but as far as we know, this contribution represents the first investigation concerning this chemistry domain.
A catalytic reaction is a chemical reaction in which transformations are accelerated thanks to a substance called catalyst. Basically, starting molecules and intermediates, as soon as they are formed, interact with the catalyst in a specific/ discriminating manner. This implies that some transformation steps can be accelerated while other can be kept constant or even slowed down. Catalytic processes constitute the fundamentals of modern chemical industries. Over 90% of the newly introduced chemical processes are catalytic. In the highly developed industrial countries, catalytic processes create about 20% of the Gross Domestic Product. Catalysis is responsible in the manufacture of over $3 trillion in goods and services. We will focus on heterogeneous catalysis which involves the use of catalysts acting in a different phase from the reactants, typically a solid catalyst with liquid and/or gaseous reactants. For further details, the reader is referred to (Ertl, Knozinger & Weitkamp, 1997). During the whole catalytic development, a very large number of features and parameters have to be screened and therefore any detailed and relevant catalyst description remains a challenge. All these parameters generate an extremely high degree of complexity. As a consequence, the entire catalyst development is long (~15 years) and costly. The conventional catalyst development relies essentially on fundamental knowledge and know-how. It implies a complete characterisation of the catalyst in order to establish properties-activity relationship. The main drawback of this approach is to be a very time-consuming process, making and testing one material at a time. Another drawback comes from the relative importance of intuition for the initial choices of development strategy (Jandeleit, Schaefer, Powers, Turner & Weinberg, 1999; Farrusseng, Baumes, Hayaud, Vauthey, 2001; Farruseng, Baumes & Mirodatos, 2003). To overcome these major drawbacks, attempts to shorten this process by using high throughput (HT) technology or experimentation have been reported since about 10-15 years. The HT ap-
67
Active Learning and Mapping
proach is more pragmatic-oriented. It deals with the screening of collections of samples. However, it may be stressed that relevant parameters are usually unknown and can be hardly directly and individually controlled. In addition, it is in general a combination of factors that provides outstanding properties which are today’s required to meet challenging targets. The tools necessary for the combinatorial approach can be classified into two main categories i) HT equipments for fast and parallel synthesis and testing of catalysts (Xiang & Takeuchi, 2003; Koinuma &Takeuchi, 2004; Hanak, 2004; Serna, Baumes, Moliner & Corma, 2008; Akporiaye, Dahl, Karlsson, Wendelbo, 1998; Holmgren et al, 2001;Bricker et al, 2004; Pescarmona, Rops, van der Waal, Jansen & Maschmeyer, 2002; Klein, Lehmann,W.; Schmidt & Maier, 1999; Hoffmann, Wolf, SchMth, 1999) and ii) computational methods (Montgomery, 1997;Corma, Moliner, Serra, Serna, Díaz-Cabañas & Baumes, 2006;Baumes, Gaudin, Serna, Nicoloyannis & Corma, 2008; Baumes et al, 2009; Baumes, Moliner & Corma, 2006;Caruthers et al, 2003;Todeschini& Consonni, 2000; Rodemerck, Baerns, Holena &Wolf, 2004) and new hardware for time-consuming calculations. (Kruger, Baumes, Lachiche & Collet, 2010; Maitre, Lachiche, Clauss, Baumes, Corma, Collet, 2009; Maitre, Baumes; Lachiche, Collet, Corma, 2009)One should note that algorithms should be adequately selected or created taking into account HT tools or strategy. (Baumes, Moliner & Corma, 2008; Baumes, Moliner & Corma, 2009; Baumes, Jimenez & Corma, in press) HT experimentation has become an accepted and important strategy in the development of catalysts and materials. (Senkan, 2001; Baumes, Farruseng &Ausfelder, 2004; Gorer, 2004; Sohn, Seo & Park, 2001; Boussie et al, 2003) However, such an approach has more success in the optimization than in the discovery. (Klanner, Farrusseng, Baumes, Lengliz, Mirodatos, Schüth, 2004;Nicolaides, 2005;Klanner, Farrusseng, Baumes, Mirodatos, Schüth, 2003;Farrusseng, Klanner, Baumes, Lengliz,
68
Mirodatos, Schüth, 2005) Despite fast synthesis and testing robots, each catalytic experiment still requires few hours. Here, the learner’s most powerful tool is its ability to act, and to gather data. On the other hand, very few recent papers of this domain deal with the strategies called mapping that should be used in order to guide a discovery study. Mapping develops relationships among properties such as composition, synthesis conditions… while these interactions may be obtained without searching for hits or lead materials. Then, the results of mapping studies can be used as input to guide subsequent screening or optimisation experiments. The purpose of screening experiments is said to identify iteratively, by accumulation of knowledge, hits or small space regions of materials with promising properties. The last manner to guide the chemist, called optimisation, is when experiments are designed to refine material properties. Mapping receives relatively little attention, being too often subsumed under screening. The sampling strategy in HT material science and especially heterogeneous catalysis typically embodies a chemist assessment of where might be a good location to collect data or is derived from the iterative optimization (Wolf, Buyevskaya, Baerns,2000; Buyevskaya, Wolf, Baerns, 2000; Baumes & Collet, 2009; Buyevskaya, Bruckner, Kondratenko, Wolf, Baerns, 2001; Corma, Serra, Chica, 2003; Holena & Baerns, 2003; Serra, Chica, Corma, 2003; Grubert, Kondratenko, Kolf, Baerns, van Geem, Parton, 2003; Maier et al, 2004; Tchougang, Blansché, Baumes, Lachiche & Collet, 2008; Günter, Jansen, Lucas, Poloni & Beume, 2003; Omata, Umegaki, Watanabe & Yamada, 2003; Paul, Janssens, Joeri, Baron & Jacobs, 2005; Watanabe, Umegaki, Hashimoto, Omata & Yamada, 2004; Cawse, Baerns &Holena, 2004) (principally with evolutionary algorithm) of specific design criteria usually the selectivity or conversion. Homogeneous covering (Bem, Erlandson, Gillespie, Harmon, Schlosser & Vayda, 2003; Cawse & Wroczynski, 2003; Sjöblom, Creaser & Papadakis, 2004; Harmon, 2004) or
Active Learning and Mapping
traditional design of experiment (DoE), (Deming & Morgan, 1993; Montgomery, 1991; Tribus & Sconyi, 1989) which is usually neglected due to the specificity of the different methods and the restrictions imposed by the domain, have been exploited. Otherwise, Simple Random Sampling (SRS) rules the domain. However, SRS should not be underestimated, see (Sammut & Cribb, 1990) for a detailed explanation of SRS robustness. The general problem considered here is the efficiency of data selection in HT heterogeneous catalysis for a discovery program. Considering such domain, only very fast, i.e. qualitative response or relatively noisy information, screening tools should be employed aiming at finding the different “groups” of catalysts outputs. This prescreening of the search space shall extract information or knowledge from the restricted selected sampling in order to provide guidelines and well defined boundaries for further screenings and optimization. (Serra, Corma, Farrusseng, Baumes, Mirodatos, Flego &Perego,2003; Baumes, Jouve, Farrusseng, Lengliz, Nicoloyannis, Mirodatos, 2003) The catalytic performances is defined as classes, rank (if exists) is not taken into account since the objective is not the optimization of catalytic outputs. The chemist knowledge is not integrated into the AL methodology but it should permit to define an a priori “poorly-previouslyexplored” parameter space, letting opportunities to surprising or unexpected catalytic results, especially when considering that HT tools for synthesis and reactivity testing do already restrict much the experimental space. The typical distribution of catalytic outputs usually exhibits unbalanced datasets for which an efficient learning can be hardly carried out.(Baumes, Farruseng, Lengliz & Mirodatos,2004)Even if the overall recognition rate may be satisfactory, catalysts belonging to rare classes are usually misclassified. On the other hand, the identification of atypical classes is interesting from the point of view of the knowledge gain. Therefore an iterative algorithm
is suggested for the characterization of the space structure. (Baumes, 2006) The algorithm should: i) Increase the quality of the machine learning (ML) performed at the end of this first exploratory stage. ii) Work independently from the choice of the supervised learning system. iii) Decrease the misclassification rates of catalysts belonging to small frequency classes of performance. iv) Handle both quantitative and qualitative features. v) Proceed iteratively while capturing information contained into all previous experiments. vi) Integrate inherent constraints such as the a priori fixed reactor capacity, i.e. the size of the iteratively selected sample to be labelled, and a maximum number of experiments to be conducted, so-called deadline, and vii) Be robust considering noisy responses. The organization of the manuscript is as follow. First, the active learning approaches and the different selection schemes are presented. Then some notations and representations used throughout the text are presented. In the third section, the method and its central criterion are investigated. Section 4 details the creation of the benchmarks, and discusses the great interest of using such testing methodology. Then, in section 5, the method is evaluated on the different benchmarks identified by mathematical functions that exhibit different levels of difficulty. Finally, section 6 will emphasize on quantifying the strength of such a method through statistical analysis and results are thoroughly discussed.
2. ACTIVE LEARNING Active learning (AL) assumes that the data is provided by a source which is controlled by the researcher. Such control is used for different aims and in different ways. The various fields for which one may wish to use AL are numerous such as optimization, where the learner experiments to find a set of inputs that maximize some response variable, for example the response surface methodology
69
Active Learning and Mapping
(Box & Draper, 1987) which guides hill-climbing through the input space; adaptive control, where one must learn a control policy by taking actions. In this field, one may face the complication that the value of a specific action remains unknown until a time delay has occurred; model selection problem for driving data collection in order to refine and discriminate a given set of models. For all types of application, the principal AL task is to determine an “optimal” sample selection strategy. Such optimization is defined through a criterion, called selection scheme, depending on the user aim. Therefore, considering the model selection problem, the approach can either be motivated by the need to disambiguate among models or to gain the most prediction accuracy from a ML algorithm while requiring the fewest number of labelling. Before inspecting the different selection schemes proposed earlier and some selected examples from publications, it has to be noted that new samples can either be created by the system or selected from an unlabeled set. The first approach is not investigated here, and considering the domain of application, it remains difficult to generate samples without lack of coherence. Typically a system could produce and ask non existing materials to be labelled. By now, authors who explored this methodology reveal that the protocol often describe “impossible” catalysts such as the following example: prepare by a precipitation process a solid consisting of 30% Ba, 50% Na, and 20% V (oxygen is excluded), using inorganic, non-halide precursors from aqueous solution. Using suitable precursors and finding a precipitation agent which would precipitate all three metals at the same time is virtually impossible. The second approach is the most common and corresponds to the one we are concerned. Two kinds of selection from an unlabeled set can be distinguished. The pool-based approach allows the selection among an a priori restricted set of unlabelled samples while the other one permits to pick up any sample to be labelled from an entire pre-defined search
70
space. Another criterion that should be taken into account when specifying an AL algorithm is the exact role of the ML system. The following cases are discriminated based on the frequency of learning system update. AL which usually starts from a very small number of labelled samples, then iteratively asks for new samples. The selection of new samples may be done in order to update at each new round either the previously obtained model, increasing its performance and accuracy, or a given criterion which remains independent from the learning system allowing a unique use of the ML when the whole selection is achieved. Using the semantic used in feature selection domain,(Baumes, Serna &Corma, in press) the first protocol is called a wrapper approach while the second one is qualified as filter. The advantage of using a wrapper technique is that the selection is optimized considering the learning algorithm that has been previously chosen. However, such choice is not always trivial, and depends on the complexity of the underlying system investigated (which is usually unknown or difficult to be quantified) but also on the complexity of the ML system itself since considering complex algorithms it may be delicate to elaborate the selection scheme. Moreover, for many configurations, such methodology might be intractable.
2.1. Selection Schemes The primary question of AL is how to choose which points to try next. A simple strategy for sampling is to target locations to reduce our uncertainty in modelling, for example by selecting the location that minimizes the posterior generalized variance of a function. The distribution P(Y/X) being unknown, a classical approach consists in approximating P with a large number of samples but then a great number of hypothesis and simplifications have to be done in order to compute the estimated error reduction. For example, using a probabilistic classifier, uncertainty sampling would pick the
Active Learning and Mapping
observation for which the predicted class probabilities yield the greatest entropy. The query by committee utility (Seung, Opper & Sompolinsky, 1992) measures the classification disagreement of a committee of classifiers, choosing an example with high disagreement. Cohn et al.(1995) measure the expected reduction in prediction variance of neural networks and other models. Another closely-related solution is to select the most ambiguous sample. Ambiguity-directed sampling aims at clarifying the decision-making near the ambiguity. Making the assumption that close elements are similar, the knowledge of one sample should induce the knowledge of the neighbouring. However, ambiguous points are likely to be neighbours. It is therefore important to select ambiguous points spread over the distribution of input variables. Another solutions for choosing these points are to look for “places” where there is no data,(Whitehead, 1991) where it is expected to change the model,(Cohn, Atlas &Ladner,1990) for example, (Juszczak, 2004) relies on measuring the variation in label assignments (of the unlabeled set) between the classifier trained on the training set and the classifiers trained on training set with a single unlabeled object added with all possible labels. Other closely related selection schemes are investigated (Linden & Weber,1993; Schmidhuber & Storck,1993) which respectively aims at choosing points where the system performs poorly, and where it was previously found data that resulted in learning. Other solutions are directly induced by the domain of application, for instance, robot navigation studies. (Thrun &Moller, 1992) In such learning tasks, data-query is neither free nor of constant cost. Researchers try to integrate the fact that the cost of a query depends on the distance from the current location in state space to the desired query point. On the other hand, such notion of distance is not transferable to the synthesis of materials. Therefore, the next section presents some selected applications and the given solutions proposed in order to better position our approach and underlined its specificities.
2.2. Positioning of the Methodology An active learner has to efficiently select a set of samples S’ in S to be labelled. Nevertheless, it is intractable to compute all possible combinations for S’. The common approach is, then, to select one query sample at each round. Therefore, existing AL approaches such as (Juszczak, & Duin, 2004; Ramakrishnany, Bailey-Kellogg, Tadepalliy, & Pandeyy, 2005) concentrate on the selection of a single element to be labelled. (Ramakrishnany, Bailey-Kellogg, Tadepalliy & Pandeyy, 2005) is an active sampling using entropy-based functions defined over spatial aggregates. This mechanism is applied on numerous 2D functional benchmarks, and for wireless system simulations where the aim is to identify and characterize the nature of the basins of local minima with models of flow classes. However such a solution is not adequate considering our HT approach for which the principal goal is to increase the number of experiments through parallelisation. Moreover, (Cohn, Atlas, & Ladner, 1990) iteratively requires model evaluations such as (Bailey-Kellogg & Ramakrishnan, 2003) where the strategy evaluates models until a highconfidence model is obtained. (Bailey-Kellogg & Ramakrishnan, 2003) employs an AL strategy for model selection problem, and apply their analysis framework to two cases of scientific computing domains. The goal is to empirically characterize problem characteristics (e.g. matrix sensitivity) and algorithm performance (e.g. convergence) by the data-driven strategy. Active learning has been applied on relatively new techniques such as Support Vector Machines (SVM).(Brinker, 2003; Tong & Koller, 2001) See (Baumes, Serra, Serna &Corma, 2006; Serra, Baumes, Moliner, Serna &Corma, 2007) for application of SVM in materials science. On the other hand, many of these strategies are meant to be used with specific data mining algorithms (this does not mean that they are wrapper methods). For example, the selection done in (Cohn, Ghahramani, Jordan,Tesauro, Touretzky & Alspector,
71
Active Learning and Mapping
1995) is made from a statistical point of view by selecting a new point in order to minimize the estimated variance of its resulting output through an estimate of P( x , y ), and various assumptions are made considering mixtures of Gaussians, and locally weighted regression, while (Schein, Sandler & Ungar, 2004) develops example selection schemes for a Bayesian logistic regression model in classification settings. Considering only one precise learning system is not desirable for our purpose. First, the choice of the technique is not obvious. Secondly, more than one algorithm or algorithm instance could be employed allowing the use of arcing, boosting, bagging methodologies. Moreover, choosing the learning approach a posteriori permits to better handle the complexity of the underlying system investigated by determining an adequate solution. As previously mentioned, our approach should support all kinds of variable type while others are applicable only to continuous output. (Cohn, Ghahramani, Jordan, Tesauro, Touretzky & Alspector, 1995) The technique requires different strong assumptions or simplifications, in order to achieve the minimization of the criterion. In a continuous context, and without differentiable systems, this method is intrinsically limited. For example, considering NN, this approach has many disadvantages: the optimal data selection becomes computationally very expensive and approximate. Moreover, this minimization has to be done for each new example to be potentially integrated in the training set which is very expensive. Similar drawbacks are associated to the maximisation done in (Ramakrishnany, Bailey-Kellogg, Tadepalliy & Pandeyy, 2005). Pool-based methods (Cohn, Ghahramani, Jordan,Tesauro, Touretzky & Alspector, 1995) take a trained machine learning algorithm and pick the next example from the a priori defined pool for labelling according to a measure of expected benefit. (Schein, Sandler, Ungar,2004) makes the selection of examples with an optimality criteria
72
that describes with some accuracy the expected error induced by a particular training set. (Souvannavong, Mérialdo, Huet, 2004) proposed a partition sampling approach to select a set of ambiguous samples that contain complementary information. Thanks to a clustering technique, they select the most ambiguous element per cluster. They apply this methodology to a video database which is too big and too long to be entirely labelled. The elements are chosen among a pool and therefore the use of a partition algorithm remains reasonable since the pool is finite and tractable. However, in our case, the whole search space is available for labelling. Other drawbacks have to be noted such as the fact that some studies are noisy-free. For example, (Ramakrishnany, Bailey-Kellogg, Tadepalliy & Pandeyy, 2005) makes the following assumption as most other works on AL: the probability distributions defined by the interpolation model are correct. Despite HT apparatus, one has to bear in mind that the amount of experiments should be in a reasonable range which is usually a priori fixed through the time and money associated to a given research contract. On the other hand, (Ramakrishnany, Bailey-Kellogg, Tadepalliy & Pandeyy, 2005) employs the k-nearest neighbours (k-nn with k=8) algorithm with a total number of selected points relatively very high (5% for the initialization and 25% in total) on benchmarks which are all differentiable functions. Considering the total number of combinations possible for materials synthesis, such requirement of data is inaccessible. Moreover, the use of k-nn is difficult in a context where the calculation of a distance between catalysts remains tricky due to the qualitative variables such as preparation modes (co-precipitation, impregnation…). It has also to be pointed that some selection schemes such as the maximization of the expected information gain, may select principally extreme solutions. Intuitively, it is expected that gathering data at the location where errors bars on the model are currently the greatest. But the error bars are usually
Active Learning and Mapping
the largest beyond the most extreme points where data have been gathered. Therefore this criterion may lead to gather at the edges of the input space which might be considered as non-ideal. The solution proposed by traditional statistical optimal design (Fedorov, 1972; Chaloner &Verdinelli, 1995) does not meet all the constraints previously underlined. The fundamental objective is hypothesis testing, and an experiment is designed to generate statistically reliable conclusions to specific questions. Therefore, hypothesis must be clearly formulated and experiments are chosen according to the given supposition in order to verify it in a best statistical manner. This strategy is particularly suited to domains that are known sufficiently well that appropriate questions can be formed and models can be pre-defined. In contrast, combinatorial methods are often employed for the express purpose of exploring new and unknown domains. Although the previously reviewed literature is very valuable and gives theoretical justification of using AL, even without considering the specificities which make them unusable in our case, most of the relevant articles require a degree of statistical sophistication which is beyond the reach of most practitioners of the domain of high throughput materials science. We now present our methodology and empirical results demonstrating the effectiveness of the active mining strategy on synthetic datasets. At the moment, such approach is used in a research program for the discovery of new crystalline materials called zeolites. The strategy is tested against simple random sampling (SRS) on numerous benchmarks with different levels of complexity.
(
)
the search space noted Ω, and ωp ∈ Ω, p ∈ 1..P corresponds to an experiment. [Y] is the output set of variables. A process ℘ partition is chosen by the user to provide a partition of [Y] in H≥2 classes noted Ch,h∈[1..H]. ℘ partition can be a clustering which, in some sense, “discover” classes by itself, by partitioning the examples into clusters, which is a form of unsupervised learning. Note that, once the clusters are found, each cluster can be considered as a “class”, see (Senkan, 2001) for an example in the domain of application. [X] is a set of independent variables noted vi, and x i j
is the value of vi for the experiment j. Each vi can be either qualitative or quantitative. If vi is a given quantitative feature, it is discretized by a process ℘discr providing a set of modalities Mi, with Card(Mi)=mi , (vi , j ) = mi , j ∈ 1..mi is j the modality j of vi. Whatever the variable, the number of modality m is of arbitrary size. Each cell of a n-dimensional contingency table represents the number of elements that have been observed belonging simultaneously to n given modalities. A so-called “zone” noted s is defined as a set of 1≤o≤n modalities. def (v1,..., vn ) =
{{1..m }, {1..m }, , {1..m }}, i1
i2
in
s : def → {mi , −} , mi ∈ def (vi )
where “–” is the unspecified modality. o(s), called “order”, returns the number of defined modalities for s. Let consider a search space partitioned into H classes and N catalysts already evaluated. vi i
contains mi modalities and n j corresponds to the amount of experiments with mi . The number of j
3. THE METHODOLOGY 3.1. Notation The method is a stochastic group sequential biased sampling which iteratively proposes a sample of
experiments belonging to the class h possessing i
the modality j of the variables vi is nhj . The general notation is summarized in Equation 1. A classifier C, C (.) = C v1 (.), v2 (.), , vn (.) is
(
)
utilized (here C is a NN), which can recognize the class using a list of predictive attributes. The
73
Active Learning and Mapping
Equation 1. i
n11 i
n21 i
nH1 i
n1
i
i
n1mi N 1 H mi H i ij n2mi N 2 N = n ∑ ∑ h mi ∑ h H ij ij ij h = 1 h =1 j =1 ⇒ n = ∑ nh , N i = ∑ nh , N = m mi H i j =1 h =1 ij ij i i nH2 nHmi N H ∑ n = ∑ ∑ nh j =1 h =1 j =1 im i2 i n … n N n12
methodology is totally independent from the choice of this ML. It processes as a filter on the sampling in order to enhance the recognition rate by transferring selected catalysts to be synthesized and tested from “stable” search space zones to “unsteady” ones which necessitate more experimental points to be well modelled within the search space.
3.2. The Mechanism Importantly, the approach does not need to sample the entire combinatorial space, but only enough so as to be able to identify the structure of classes without forgetting classes obtained only with few experiments. It is thus imperative to focus the sampling at only those locations that are expected to be useful. Such AL does not provide any direct information but distribution sets can be used to boost the performance of the classifier without requiring a distance measure. A “puzzling” zone is a region of the search space which contains various classes making the recognition or characterization of the structure difficult due to the heterogeneity of the responses. As soon as the emergence of such a confusing region is detected, a natural behaviour may be to select relatively more experiments belonging to the given region in order to better capture the space structure. A better recognition of space zones in which a relatively high dynamism is detected should permit the understanding of the underlying or causal phenomenon and therefore could be used for lo74
calizing hit regions. The methodology focuses on irregularity or variability of catalytic responses, i.e. the “wavering” behaviour of class distribution. The method transfers the points from potentially stable zones of the landscape to unsteady or indecisive ones. Therefore the following questions must be answered: How is precisely sized the “difficulty” of a space zone? How is balanced the necessity to confirm trends and exploration while bearing in mind that deadline is approaching?
3.3. The Criterion The calculation of the statistic called χ2 is used as a measure of how far a sample distribution deviates from a theoretical distribution. This type of calculation is referred to as a measure of Goodness of Fit (GOF). The Chi-square can be used for measuring how the classes are disparate into zones compared to the distribution one gets after the random initialization (k1 points) or the updated one after successive generations. Therefore a given amount of points can be assigned into zones proportionally to the deviation between the overall distribution and observed distributions into zones. Figure 1a shows a given configuration with H=4, N=1000 and vi (with mi=5) that splits the root (i.e. the overall distribution is on the left hand side). For equal distributions between the root and a leaf, Chi-square is null ( χi25 , ■ in Figure 1a). Chi-square values are equals for two leaves with the same distribution between each
Active Learning and Mapping
Figure 1. a) Criterion settings, 1st configuration. On the left hand side is represented the entire search space. This given root has been split into 5 leaves where the distributions are given for the last three ones. Each leaf and the root are partitioned into 5 classes. The first class has received 100 elements, and among these 12 belong to the fourth leaf. The Chi-square statistic is given on the right hand side of each leaf in between brackets. b) Criterion settings, 2d configuration
other (● in Figure 1a). One would prefer to add a point with the third modality (bottom ●) in order to increase the number of individuals which is relatively low. This is confirmed by the fact that χ2 is relatively more “reactive” for leaves with smaller populations, see the absolute variations (■ → ● and □ → ○) of two successive χ2 in Figure 1b. In order to obtain a significant impact, i.e. information gain, by adding a new point it is more interesting to test new catalysts possessing a modality which has been poorly explored (i.e. □). Chi-square does not make any difference between leaves that exhibit exactly the same distribution ij
(i.e. □ and ■). Therefore n must be minimized at the same time in order to support relatively empty leaves. Based on the Chi-square behaviour, the criterion is defined by Equation 2. 2 ij nh N h − i H n j N −1 i N n j + 1 × ∑ + 1 Nh h =1
(
)
(2)
Note that extremely unstable and small zones may have distributions which are very far from the overall distribution. With this criterion, they may attract continuously experiments. However, this may not be due to a natural complex underlying relationship but rather to lack of reproducibility, uncontrolled parameters, noise… Therefore the maximum number of experiments a zone can o receive can be bounded by the user. X rnd is the k2
calculated average number of individuals that a zone of order o receives from a SRS of k2 points. It can be decided a maximum number of points o noted ρX rnd , that MAP is authorized to allocate k2 +k 1
o in a zone compared to X rnd . ρ is a parameter the k2
user has to set. However during our experiments such phenomenon did not appear and thus such parameter is not studied here. After the distribution zone analysis done after each new selected generation, the algorithm ranks them based on the criterion. Among the whole set of zones, ts (for tournament size) zones are selected randomly and compete together following the GA-like selection operator called “tournament”.(Blickle &Thiele,1995;Thierens,1997)A
75
Active Learning and Mapping
−1
zone with rank r has 2r × k2 (k2 + 1) of chance to be selected. As the criterion is computed on subsets of modalities (i.e. zone of order o), when a given zone is selected for receiving new points, the modalities that do not belong to s are randomly assigned. Figure 2 is depicting the whole process. The class conception is of great importance since the criterion deeply depends on the root distribution. Enlarging or splitting classes permit an indirect control of the sampling. We recommend to merge uninteresting classes and to split the out of the ordinary ones in order to create relatively unbalanced root distributions. On the other hand, a reasonable balance should be respected otherwise small and interesting classes hidden into large ones will have less chance to be detected. In the experiments presented in next section, o remains fixed and is a priori set. For each zone of order o, it is associated the corresponding observed distribution and the related criterion value. Figure 2. Scheme representing the methodology
76
4. TEST CASES Even if it is impossible to say how many datasets would be sufficient (in whatever sense) to characterize the behaviour of a new algorithm, in most cases, benchmarking is not performed with a sufficient number of different problems. Rarely can the results presented in articles be compared directly. The most useful setup is to use both artificial datasets, whose characteristics are known exactly, and real datasets, which may have some surprising and very irregular properties. Considering the domain of application, real datasets are very costly and time consuming. Therefore, for the first presentation of the new method, its efficiency is thoroughly evaluated with mathematical functions. Two criteria are emphasized: Reproducibility. In a majority of cases the information about the exact setup of the benchmarking tests is insufficient for other researchers to exactly reproduce it. Comparability. A bench-
Active Learning and Mapping
mark is useful if results can directly be compared with results obtained by others for other algorithms. Even if two articles use the same dataset, the results are most often not directly comparable, because either the input/output encoding or the partitioning of training versus test data is not the same or is even undefined. Therefore, the methodology employed for testing the algorithm is fully described. All benchmarks are mathematical functions: f (x i ∈ ) → y ∈ . For simplicity, ∀i, x i ∈ a b , (a, b ) ∈ for a given function. A benchmark or test case is used after the three following steps. 1. n-dimensions functions are traced onto a first bi-dimensional series plot. ℘discr splits a b into m equal parts (∀i, m = m ) . i i All the boundaries (m+1) are selected as points to be plotted in the series plot. On
x-axis, an overlapped loop is applied taking into account the selected values of each variable. As example let’s consider Baumes fg function (Equation 3). Figure 3 shows the associated series plot with n=6 and x i ∈ −1..1 . An overlapped loop is used
2.
on each feature with 9 points for each, i.e. 531441 points in total. Classes of performances are constructed by setting thresholds on the y-axis of the series plot. The size of each class (i.e. the number of points between two thresholds) is easily visualized (horizontal lines in Figure 3). One colour and form is assigned to each class:
blue ≤ 2, 2 < aqua ≤ 6, 6 < green ≤ 10, 10 < yellow ≤ 15, red > 15. Figure 6 gives an example for the placement of points.
Figure 3. Series plot Baumes fg. The number of variables noted n = 6, and the number of points represented for each feature is 9.
77
Active Learning and Mapping
Figure 4. 2D graph for function Baumes fg (n = 6, 9pts/var)
Figure 5. How multi-dimensional function are represented onto 2D space
78
Active Learning and Mapping
Figure 6. 2D plot for function Schwefel f7, n=6, 6 modalities black ≤ 24000, 24000 < light gray ≤ 25000, 25000 < gray ≤ 25750, 25750 < white ≤ 26500, dark gray > 265000
Figure 7. 2D plot for function De Jong f3, n=6, 6 modalities black ≤ 26.5, 26.5 < light gray ≤ 27.5, 27.5 < gray ≤ 28.5, 28.5 < white ≤ 29.5, dark gray > 29.5
79
Active Learning and Mapping
Figure 8. 2D plot for function Baumes fa, n=9, 4 modalities black ≤ 1, 1 < light gray ≤ 2, 2 < gray ≤ 5, 5 < white ≤ 15, dark gray > 15
Figure 9. 2D plot for function De Jong f1, n=9, 4 modalities black ≤ 30, 30 < light gray ≤ 60, 60 < gray ≤ 90, 90 < white ≤ 120, dark gray > 120
80
Active Learning and Mapping
sampled on De Jong f1 search space (9var./4mod.), see Table 1. When using our methodology, the number of good individuals (Class A, the smallest) is increased from 4 with SRS (Training + selection A) to 27. Such distribution permits both to increase the overall rate of recognition and the recognition of small classes. For the other benchmarks, the distributions in the merged training and selection sets are given in Table 2, whereas the distribution in the test sets are shown in Table 3. It can be seen in the respective distributions of every tested benchmarks that small classes received more experiments (the smallest class is greyed), and less experiments belong to the larger one (the largest is in black) as it was expected. Results show clearly that MAP permits a better characterization of small zones than SRS while exploration of the search space is maintained. The gain of recognition by NN on both the smallest and the largest classes for each benchmark using such active sampling instead of SRS is given in Figure 10. It can be seen that the gains on smallest classes are tremendously increased varying from 18% to an infinite gain (for Schwefel f7, 600 is substituted to the infinite value since assigned to SRS 1 experiment into the smallest zone in order not to obtain a zero division). The loss of recognition rate for the largest classes (if there is) is very
3. In between two thresholds, every point is labelled, corresponding to a given class. Figure 4 shows the graph related to Figure 3 and Equation 3. Five different benchmarks, see Figures 5 to 9 (De Jong f1 and De Jong f3 (De Jong,(n.d.).), Schwefel f7 (Whitley, Mathias, Rana, Dzubera,1996), Baumes fa, and Baumes fg, see Equation 3) have been selected in order to test the algorithm. Among them, some new ones (Baumes fa and Baumes fg) have been specially designed in order to trap the method.
5. RESULTS 5.1. Evaluation Mode 1 Samples and corresponding effects on NN learning are noted for both the proposed methodology and SRS. Dataset are separated into training test and selection test. The use of analytical benchmarks permits the utilization of test sets with an arbitrary amount of cases. In our experiments, small classes are always considered of great interest. For each sample 10000 individuals are randomly chosen as test set. As example, 1500 points have been Equation 3. n fa (x i ) = tan ∑ sin2 x i2 − 1 / 2 i =1
(
n
(
)
fg (x i ) = ∑ (n − i + 1) × x i2 i =1 n
1 + x i 1000
2
)
− 1 ≤ xi ≤ 1
f1 (x i ) = ∑ x i2 i =1
0 ≤ xi ≤ 2
0 ≤ xi ≤ 6 n
f3 (x i ) = A + ∑ int(x i ) i =1
n
f7 (x i ) = nV + ∑ −x i × sin i =1
A = 25 (option)
(x) i
0 ≤ xi ≤ 3 -500 ≤ x i ≤ 500
81
Active Learning and Mapping
low compared to the relatively high gain on small ones. The overall recognition rate being deeply influenced by the relative size of classes does not represent an adequate criterion, however the proposed methodology wins in most of the cases.
method does not transfer point from zones to zones. The Chi-square test (Snedecor &Cochran, 1989) is used to test if a sample of data comes from a population with a specific distribution. The Chi-square GoF test is applied to binned data (i.e., data put into classes) and is an alternative to the Anderson-Darling(Stephens,1974) and Kolmogorov-Smirnov(Chakravarti & Roy,1967) GoF tests which are restricted to continuous distributions. We state as a “statistical null hypothesis” noted H0 something that is the logical opposite of what we believe. Then, using statistical theory, it is expected from the data that H0 is shown to be false, and should be rejected. This is called
5.2. Evaluation Mode 2 As our methodology is not influenced by the choice of the ML applied on selected points, another way to size its influence is to analyze the distribution of points. Therefore if the overall distribution of classes on the whole search space is statistically similar to an SRS, the proposed
Table 1. Training, selection and test sets of De jong f1 from SRS (upper array) and the new methodology (lower array)
SRS
New one
Table 2. Merged training and selection sets after sampling from the proposed appraoch and SRS considering all others benchmarks. In each case 5 classes are present (A to E). De Jong F3 SRS
Baumes Fa New
SRS
Baumes Fg
New
SRS
Schwefel F7
New
SRS
New
A
15
58
772
737
58
111
5
25
B
31
123
300
308
397
448
80
180
C
85
196
273
284
592
452
200
320
D
139
213
85
82
402
386
412
402
E
1230
910
70
89
51
103
803
573
82
4 15 2 0 0 105 0 0 0 5042
0
0
0
8208
265
93
105
25
37
0
258
5
0 108
789 275
4800 341
0 0
0
2273 243 0
63
175 3526 266
0 24 51
85 345
100 121
351 835
290 0
0 0
885 0
541
0
0 0
93
101 0
0
0
D
E
0
2592
113
122
0 C
0
40 B
44
1242
0
0
98
0
0
0 0 373 3141 35 0 0
20 A
486
93
0
255
0
0
0
1107
466
377
113
98
0 181 2481 70
0
0
0 0
11 0
0 45
0 0
67 0 0 84
0
85 0 0
275 265 300
8 17
862 976
11 20
2611 0
8204 5
0 6
0
105
0
0
0 0 0
0 0 E
0
0 D
12
186
4971
0
4
0 0
880 0
517
0
0
0
164
111
143
1139 0 C
0
51
62
0
0
175
0 9
B
SRS Predicted
MAP
“Reject-Support testing” (RS testing) because rejecting the null hypothesis supports the experimenter’s theory. Consequently, before undertaking the experiment, one can be certain that only the 4 possible states summarized in the Table 4 can happen. The Chi-square test statistic follows, approximately, a Chi-square distribution with a degree of freedom (df) noted v=(l-1)(c-1) where c is the number of columns and l is the number of rows in the Chi-square table. Therefore, the hypothesis that the data are from a population with the specified distribution is rejected if χ2 > χ(2α,v ) where χ(2α,v ) is the chi-square percent point function with v df and a significance level of α . χα2 is the upper critical value from the Chisquare distribution and χ(21−α) is the lower critical value. Table 5 contains the critical values of the chisquare distribution for the upper tail of the distribution. Therefore statistic tests with v df (v=(l-1) (c-1)=4) are computed from the data (Table 4). H0: The_proposed_methdology = SRS, H1: The_proposed_methdology ≠ SRS is tested. For such an upper one-sided test, one finds the column corresponding to α in the upper critical values table and rejects H0 if the statistic is greater than the tabulated value. The estimation and testing results from contingency tables hold regardless of whether the distribution sample model. Top values in Table 4 are frequencies calculated from Table 1. The Chi-square χv2 = ∑ ( fobserved − ftheoritical ) × ftheoritical −1 2
A
476
4
2509
0
7
0
0 0
0 30
0
73 0
217
32 0
31
6
0
218
9
0 160
741 155
5072 279
0 0
0
2225 231 0
103
243 3488 362
53
0
130
74 157
517 433
177 166
756
0
0 125
201 188
174 456
642 917
469 695
3471
137
0
0
0
0 0
0 223
0 36
3212 50
52
231 2360 97
0 0 109 248
0
C
Schwefel F7
B A E C
Baumes Fg
D B A E D C
Baumes Fa
B A E D C
De Jong F3
B A E D De Jong F1
C B A
Real
Table 3. Distribution of classification by Neural Network in Test depending on the sample (SRS or MAP) for all benchmarks
D
E
Active Learning and Mapping
is noted in red and the critical values at different level are in blue. Yes (Y) or No (N) correspond to answers to the question “is H0 rejected?”. Table 5 shows that distributions differ on some cases only. One can note that negative answers are observed on two benchmarks called Baumes fa and Baumes fg (the black cell is discussed later). These benchmarks have been created in order to check the ef-
83
Active Learning and Mapping
Figure 10. Percentage recognition gain for both the smallest and largest class considering every benchmark when using MAP methodology instead of SRS
ficiency on extremely difficult problems. However, the analysis of the results in previous section clearly shows that MAP modifies the distributions and thus implies improvement of search space characterization through ML. Therefore, we think that the sample size is not large enough in order to discriminate both approaches.
DISCUSSION It has to be underlined that during all the experiments the root distribution has been calculated after the initialisation. However, the user could i) Table 4. Chi-square GOF test
84
intentionally not respect the real distribution in order to give weights on selected classes, or ii) re-evaluate the root distribution (Equation 4) using the following notation. The search space partitioned into mi cells considering only one variable vi (i.e. Card(Mi)=mi), N ia represents the total amount of point in the mia cell. “a” is analogous to a strata
{
mi
Mi = mia , mib , , mi and
}⇒ M
i
= mi ,
Active Learning and Mapping
Table 5. Critical values of the chi-square distribution for the upper tail of the distribution
Table 5. Continued Probability of exceeding the critical value
v v
Probability of exceeding the critical value 0.10
0.05
0.01
0.001
1
2.706
3.841
5.024
0.025
6.635
10.828
2
4.605
5.991
7.378
9.210
13.816
3
6.251
7.815
9.348
11.345
16.266
4
7.779
9.488
11.143
13.277
18.467
5
9.236
11.070
12.833
15.086
20.515
6
10.645
12.592
14.449
16.812
22.458
7
12.017
14.067
16.013
18.475
24.322
8
13.362
15.507
17.535
20.090
26.125
9
14.684
16.919
19.023
21.666
27.877
10
15.987
18.307
20.483
23.209
29.588
11
17.275
19.675
21.920
24.725
31.264
12
18.549
21.026
23.337
26.217
32.910
13
19.812
22.362
24.736
27.688
34.528
14
21.064
23.685
26.119
29.141
36.123
15
22.307
24.996
27.488
30.578
37.697
16
23.542
26.296
28.845
32.000
39.252
17
24.769
27.587
30.191
33.409
40.790
18
25.989
28.869
31.526
34.805
42.312
19
27.204
30.144
32.852
36.191
43.820
20
28.412
31.410
34.170
37.566
45.315
21
29.615
32.671
35.479
38.932
46.797
22
30.813
33.924
36.781
40.289
48.268
23
32.007
35.172
38.076
41.638
49.728
24
33.196
36.415
39.364
42.980
51.179
25
34.382
37.652
40.646
44.314
52.620
26
35.563
38.885
41.923
45.642
54.052
27
36.741
40.113
43.195
46.963
55.476
28
37.916
41.337
44.461
48.278
56.892
29
39.087
42.557
45.722
49.588
58.301
30
40.256
43.773
46.979
50.892
59.703
31
41.422
44.985
48.232
52.191
61.098
32
42.585
46.194
49.480
53.486
62.487
33
43.745
47.400
50.725
54.776
63.870
34
44.903
48.602
51.966
56.061
65.247
35
46.059
49.802
53.203
57.342
66.619
36
47.212
50.998
54.437
58.619
67.985
37
48.363
52.192
55.668
59.893
69.347
continued in follwoing column
0.10
0.05
0.025
0.01
0.001
38
49.513
53.384
56.896
61.162
70.703
39
50.660
54.572
58.120
62.428
72.055
40
51.805
55.758
59.342
63.691
73.402
∀j, mia ∈ ω j , Ω Mi = N ia ⇒ ∀a, N ia = N i. . The weight of each strata is ∀a, wia = N ia Ω = 1 Mi = wi. . pA, pB, …, pK are the total proportion of class A (respectively ma
B, …, K) in the entire population and pA i is the proportion of individuals belonging to class A in ma
a. n i corresponds to the amount of the sampling that possesses mia . fia = nia N ia is the proportion ma
of the sampling for mia and nA i corresponds to the amount of sampled individuals in mia that belongs to class A. j =Mi
pA =
j =Mi
∑w p j =1
j i
mij A
=
j =Mi
∑wp j =1
j . mi i A
=
∑p j =1
mi
mij A
(4)
Altering the distribution of points in the search space as detected by the evaluation mode 2 is a fact but transferring individuals from stable zones to puzzling ones is different. Therefore new tests have been performed. The overall distribution is split considering a set of modalities or a given number of variables and a new chi-square GoF is evaluated (Equation 5). If i=3 then v=6 and the critical value is 2 χ0.5(6) = 12.5916. H0 is accepted when no difference on zones size is observed for the considered variables on a given benchmark and also that H0 is rejected when a clear difference appears. Tables from these tests are not presented. With “easy”
85
Active Learning and Mapping
Equation 5.
m1
A ~ n1A E n1A
B B ~ B n1 E n1
( )
mi
~ n E niA A i
( )
C D E ~ n1E E n1E ~ ~ niB E niB niE E niE
( ) ( )
benchmarks, it appears clearly that our methodology acts as it is expected. However, for one case H0 is accepted, and thus the power of test is quickly discussed. The power testing procedure is set up to give H0 “the benefit of a doubt” that is, to accept H0 unless there is strong evidence to support the alternative. Statistical power (1- β ) should be at least 0.80 to detect a reasonable departure from H0. The conventions are, of course, much more rigid with respect to α than with respect to β . Factors influencing power in a statistical test include i) What kind of statistical test is being performed. Some statistical tests are inherently more powerful than others. ii) Sample size. In general, the larger the sample size the larger the power. To ensure a statistical test will have adequate power, one usually must perform special analyses prior to running the experiment, to calculate how large the sample size (noted n) is required. One could plot power against sample size, under the assumption that the real distribution is known exactly. The user might start with a graph that covers a very wide range of sample sizes, to get a general idea of how the statistical test behaves. The minimum required sample size that permits to start discriminating (significantly with a fixed error rate α) our methodology from SRS is dependent on the search space landscape. Simulations will be investigating in future works. It has to be noted that 1500 points have been selected for each benchmark. However, the search spaces are extremely broad and thus such a sample size represents only a very little percentage of the entire research space. 86
( )
j =i h =E
⇒ χ = ∑∑ 2
j =1 h =A
( )
~ h n − E n h j j
( )
~
2
( )
E n hj
CONCLUSION This paper develops a targeted sampling mechanism based on a novel, flexible, quantitative analysis of classes’ distribution into sub-regions of research spaces. There are several motivations for wanting to alter the selection of samples. In a general sense, we want a learning system to acquire knowledge. In particular, we want the learned knowledge to be as useful as possible, while retaining high performance. If the space of configurations is very large with much irregularity, then it is difficult to adequately sample enough of the space. Adaptive sampling such as our methodology tries to include the most productive samples. Such adaptive sampling allows selecting a criterion over which the samples are chosen. The learned knowledge about the structure is used for biasing the sampling. The results reveal that the proposed sampling strategy makes more judicious use of data points by selecting locations that clarify structural characterization of little represented classes, rather than choosing points that merely improve quality of the global recognition rate. The proposed methodology has been thoroughly presented and tested. As such, this methodology was developed to propose formulations that are relevant to be tested at the very first stage of a HT program when numerous but inherent constraints are taken into account for the discovery of new performing catalysts. Such a methodology is flexible enough to be applied on a broad variety of domains. The main advantages
Active Learning and Mapping
are the following. The number of false negative is highly decreased while the number of true positive is tremendously increased. The approach is totally independent from the classifier and creates more balanced learning sets, permitting both to prevent from over-learning, and to gain higher recognition rates. All previous experiments can be integrated giving more strength to the method and any type of feature is taken into account. The method is tuneable through the modification of the root distribution.
Baumes, L.A., Farruseng, D., Ausfelder, F. (2009). Catalysis Today. Special Issue “EuroCombiCat 2009” conference.
ACKNOWLEDGMENT
Baumes, L. A., Jimenez, S., & Corma, A. (in press). hITeQ: A new workflow-based computing environment for streamlining discovery. In L.A. Baumes, D. Farruseng, F. Ausfelder (eds). Application in materials science. Catalysis Today, Special Issue “EuroCombiCat 2009” Conf.
EU Commission FP6 (TOPCOMBI Project) is gratefully acknowledged. We also thank Santiago Jimenez and Diego Bermudez for their collaboration on the hITeQ platform which integrates all programming codes of the presented methodology.
REFERENCES Akporiaye, D. E., Dahl, I. M., Karlsson, A., & Wendelbo, R. (1998)... Angewandte Chemie International Edition, 37(5), 609–611. doi:10.1002/ (SICI)1521-3773(19980316)37:53.0.CO;2-X Bailey-Kellogg, N. Ramakrishnan.(2003). Proc. 17th Int. Workshop on Qualitative Reasoning, pp. 23-30.\ Baumes, L. A. (2006)... Journal of Combinatorial Chemistry, 8(3), 304–314. doi:10.1021/ cc050130+ Baumes, L. A., Blansché, A., Serna, P., Tchougang, A., Lachiche, N., P. Collet & A. Corma (2009). Materials and Manufacturing Processes, 24 (3), 282 – 292. Baumes, L. A., & Collet, P. (2009). Computational Materials Science, 45(1), 27–40. New York: Elsevier doi:10.1016/j.commatsci.2008.03.051
Baumes, L. A., Farruseng, D., Lengliz, M., & Mirodatos, C. (2004)... QSAR & Combinatorial Science, 29(9), 767–778. doi:10.1002/ qsar.200430900 Baumes, L. A., Gaudin, R., Serna, P., Nicoloyannis, N., & Corma, A. (2008). Combinatorial Chemistry & High Throughput Screening, 11(4), 266–282. doi:10.2174/138620708784246068
Baumes, L. A., Jimenez, S., Kruger, F., Maitre, O., Collet, P., & Corma, A. (n.d.). How gaming industry fosters crystal structure prediction?. Physical Chemistry Chemical Physics (PCCP). Baumes, L. A., Jouve, P., Farrusseng, D., Lengliz, M., Nicoloyannis, N., & Mirodatos, C. (2003). 7th Int. Conf. on Knowledge-Based Intelligent Information & Engineering Systems (KES’2003). Lecture Notes in AI (LNCS/LNAI series). Sept. 3-5. Univ. of Oxford, UK: Springer-Verlag Baumes, L. A., Moliner, M., & Corma, A. (2006). QSAR & Combinatorial Science, 26(2), 255–272. doi:10.1002/qsar.200620064 Baumes, L. A., Moliner, M., & Corma, A. (2008). CrystEngComm, 10, 1321–1324. doi:10.1039/ b812395k Baumes, L. A., Moliner, M., & Corma, A. (2009). Chemistry (Weinheim an der Bergstrasse, Germany), 15, 4258–4269. doi:10.1002/ chem.200802683
87
Active Learning and Mapping
Baumes, L. A., Serna, P., & Corma, A. (in press). Merging traditional and high throughput approaches results in efficient design, synthesis and screening of catalysts for an industrial process. Applied Catalysis A. Baumes, L. A., Serra, J. M., Serna, P., & Corma, A. (2006)... Journal of Combinatorial Chemistry, 8(4), 583–596. doi:10.1021/cc050093m Bem, D. S., Erlandson, E. J., Gillespie, R. D., Harmon, L. A., Schlosser, S. G., & Vayda, A. J. (2003). Experimental design for combinatorial and high throughput materials development, 89107. Hoboken, NJ: Wiley and sons. Blickle, T., & Thiele, L. (1995). 6th Int. Conf. on Genetic Algorithms. San Mateo, CA: Morgan Kaufmann. Boussie, T. R. (2003)... Journal of the American Chemical Society, 125, 4306–4317. doi:10.1021/ ja020868k Box, G., & Draper, N. (1987). Empirical modelbuilding and response surfaces. New York: John Wiley and Sons. Bricker, M. L., Sachtler, J. W. A., Gillespie, R. D., McGoneral, C. P., Vega, H., Bem, D. S., & Holmgren, J. S. (2004)... Applied Surface Science, 223(1-3), 109–117. doi:10.1016/S01694332(03)00893-6 Brinker, K. (2003). In Proc. of the 20th Int. Conf. on Machine Learning (ICML’03), pp. 59-66. Buyevskaya, O. V., Bruckner, A., Kondratenko, E. V., Wolf, D., & Baerns, M. (2001)... Catalysis Today, 67, 369–378. doi:10.1016/S09205861(01)00329-7 Caruthers, J. M., Lauterbach, J. A., Thomson, K. T., Venkatasubramanian, V., Snively, C. M., & Bhan, A. (2003)... Journal of Catalysis, 216, 98. doi:10.1016/S0021-9517(02)00036-2
88
Cawse, J. N., Baerns, M., & Holena, M. (2004)... Journal of Chemical Information and Computer Sciences, 44(1), 143–146. doi:10.1021/ci034171+ Cawse, J. N., & Wroczynski, R. (2003). Experimental design for combinatorial and high throughput materials development, 109-127. Hoboken, NJ: Wiley and sons. Chakravarti, L., & Roy, H. L. (1967). John Wiley and Sons. pp. 392-394. Chaloner, K., & Verdinelli, I. (1995)... Statistical Science, 10(3), 273–304. doi:10.1214/ ss/1177009939 Cohn, D., Atlas, L., & Ladner, R. (1990). Advances in Neural Information Processing Systems 2. San Francisco: Morgan Kaufmann. Cohn, D. A., Ghahramani, Z., & Jordan, M. I. in G. Tesauro, D. Touretzky, J. Alspector,(1995). Advances in Neural Information Processing Systems 7. San Francisco: Morgan Kaufmann. Corma, A., Moliner, M., Serra, J. M., Serna, P., Díaz-Cabañas, M. J., & Baumes, L. A. (2006)... Chemistry of Materials, 18(14), 3287–3296. doi:10.1021/cm060620k Corma, A., Serra, J. M., & Chica, A. (2002). Principles and methods for accelerated catalyst design and testing. De Jong, K. A. (n.d.). Doctoral dissertation, univ. of Michigan. Dissertation Abstract International, 36(10), 5140(B). Univ. of Michigan Microfilms No. 76-9381 Deming, S. N., & Morgan, S. L. (1993). Experimental design: A chemometric approach (2nd ed.). Amsterdam: Elsevier Science Publishers B.V. Derouane, E., Parmon, V., Lemos, F., & Ribeir, F. (2002). Book Series: NATO SCIENCE SERIES: II: Mathematics, Physics and Chemistry (Vol. 69, pp. 101–124). Dordrecht, Netherlands: Kluwer Academic Publishers.
Active Learning and Mapping
Derouane, E. G., Parmon, V., Lemos, F., & Ribeiro, F. R. (Eds.). Kluver Academic Publishers: Dordrecht, The Netherlands, pp 153-172. Ertl, G., Knözinger, H., & Weitkamp, J. (1997). Handbook of Heterogeneous Catalysis. New York: Whiley-VCH. doi:10.1002/9783527619474 Farruseng, D., Baumes, L. A., & Mirodatos, C. (2003). Data Management For Combinatorial Heterogeneous Catalysis: Methodology And Development Of Advanced Tools. In Potyrailo, R. A., & Amis, E. J. (Eds.), High-Throughput Analysis: A Tool for Combinatorial Materials Science (pp. 551–579). Boston: Kluwer Academic/Plenum Publishers. Farrusseng, D., Baumes, L. A., Hayaud, C., Vauthey, I., Denton, P., & Mirodatos, C. (2001). Nato series. In E. Derouane (ed). Proc. NATO Advanced Study Institute on Principles and Methods for Accelerated Catalyst Design, Preparation, Testing and Development. Vilamoura, Portugal, 15-28 July 2001. Boston: Kluwer Academic Publisher. Farrusseng, D., Klanner, C., Baumes, L. A., Lengliz, M., Mirodatos, C., & Schüth, F. (2005)... QSAR & Combinatorial Science, 24, 78–93. doi:10.1002/qsar.200420066 Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From Data Mining to Knowledge Discovery in Databases. Fedorov, V. V. (1972). Theory of optimal experiments. New York: Acad. Press. Gorer, A. (2004). U.S. Patent 6.723.678, to Symyx Technologies Inc. Grubert, G., Kondratenko, E. V., Kolf, S., Baerns, M., van Geem, P., & Parton, R. (2003)... Catalysis Today, 81, 337–345. doi:10.1016/S09205861(03)00132-9 Hanak, J. J. (2004)... Applied Surface Science, 223, 1–8. doi:10.1016/S0169-4332(03)00902-4
Harmon, L.A. (2003). Journal of Materials Science, 38, 4479–4485. doi:10.1023/A:1027325400459 Hoffmann, C., & Wolf, A. & F. SchMth. (1999). Angew. Chem. 111, 2971. Angewandte Chemie International Edition, 38, 2800. doi:10.1002/ (SICI)1521-3773(19990917)38:183.3.CO;2-0 Holena, M., & Baerns, M. (2003)... Catalysis Today, 81, 485–494. doi:10.1016/S09205861(03)00147-0 Holmgren, J., Bem, D., Bricker, M., Gillespie, R., Lewis, G., & Akporiaye, D. (2001)... Studies in Surface Science and Catalysis, 135, 461–470. Jandeleit, B., Schaefer, D. J., Powers, T. S., Turner, H. W., & Weinberg, W. H. (1999)... Angewandte Chemie International Edition, 38, 2494–2532. doi:10.1002/(SICI)15213773(19990903)38:173.0.CO;2-# Juszczak, P., & Duin, R. P. W. (2004). Proc. 17th Int. Conf. on Pattern Recognition. IEEE Comp. Soc., Los Alamitos, CA. Klanner, C., Farrusseng, D., Baumes, L. A., Lengliz, M., Mirodatos, C., & Schüth, F. (2004)... Angewandte Chemie International Edition, 43(40), 5347–5349. doi:10.1002/anie.200460731 Klanner, C., Farrusseng, D., Baumes, L. A., Mirodatos, C., & Schüth, F. (2003)... QSAR & Combinatorial Science, 22, 729–736. doi:10.1002/ qsar.200320003 Klein, J., Lehmann, C. W., Schmidt, H. W., & Maier, W. F. (1999)... Angewandte Chemie International Edition, 38, 3369. doi:10.1002/(SICI)15213773(19990712)38:13/143.0.CO;2-G Koinuma, H., & Takeuchi, I. (2004). Nature Materials, 3, 429–438. doi:10.1038/nmat1157
89
Active Learning and Mapping
Kruger, F., Baumes, L. A., Lachiche, N., & Collet, P. (2010). In Lecture Notes in Computer Science, Publisher Springer Berlin/Heidelberg. Proc. Int. Conf. EvoStar 2010, 7th - 9th April 2010, Istanbul Technical University, Istanbul, Turkey. Linden, F. Weber.(1993). Proc. 2d Int. Conf. on Simulation of Adaptive Behavior. Cambridge, MA: MIT Press. Maier, W. F. (2004). Polymeric Materials Science and Engineering, 90, 652–653. Maitre, O., Baumes, L. A., Lachiche, N., & Collet, P. Corma, A. (2009). Proc. of the 11th Annual conf. on Genetic and evolutionary computation. Montreal, Québec, Canada, Session: Track 12: parallel evolutionary systems, 1403-1410. New York: Association for Computing Machinery. Maitre, O., & Lachiche, N. P., Baumes, L. A., Corma, A. & P. Collet.(2009). In Lecture Notes in Computer Science, Publisher Springer Berlin/ Heidelberg Vol. 5704/2009 Euro-Par 2009 Parallel Processing, 974-985. Montgomery, D. C. (1991). Design and analysis of experiments (3rd ed.). New York: Wiley. Montgomery, D. C. (1997). Design and Analysis of Experiments (4th ed.). New York: John Wiley & Sons Inc. Nicolaides, D. (2005)... QSAR & Combinatorial Science, 24. Omata, K., Umegaki, T., Watanabe, Y., & Yamada, M. (2003). Studies in Surface Science and Catalysis, 291–294. New York: Elsevier Sci. B.V. doi:10.1016/S0167-2991(03)80217-3 Paul, J. S., Janssens, R., Joeri, J. F. M., Baron, G. V., & Jacobs, P. A. (2005). Journal of Combinatorial Chemistry, 7(3), 407–413. doi:10.1021/ cc0500046
90
Pescarmona, P. P., Rops, J. J. T., van der Waal, J. C., Jansen, J. C., & Maschmeyer, T. (2002)... J. Mol. Chem. A, 182-183, 319–325. doi:10.1016/ S1381-1169(01)00494-0 Piatetsky-Shapiro, G., & Frawley, W. (1991). Knowledge discovery in databases. Menlo Park, CA: AAAI/MIT Press. Ramakrishnany, N., Bailey-Kellogg, C., Tadepalliy, S., & Pandeyy, V. N. (2005). SIAM Int. Conf. on Data Mining, SDM 2005. Newport Beach, CA, USA. Rodemerck, U., Baerns, M., Holena, M., & Wolf, D. (2004). Applied Surface Science, 223, 168. doi:10.1016/S0169-4332(03)00919-X Sammut, C., & Cribb, J. (1990). 7th Int. Machine Learning Conf. Austin, TX: Morgan Kaufmann. Schein, A. I., Sandler, S. T., & Ungar, L. H. (2004). Univ. of Pennsylvania, Dpt. of Comp. & Information Sci. Tech. Report No. MS-CIS-04-08. Schein, A. I., Sandler, S. T., & Ungar, L. H. (2004). Univ. of Pennsylvania, Dpt. of Comp. & Information Sci. Tech. Report No. MS-CIS-04-08. Schmidhuber, J., & Storck, J. (1993). Tech. Report, Fakultat fur Informatik. Technische Universitat Munchen. Senkan, S. (2001)... Angewandte Chemie International Edition, 40(2), 312–329. doi:10.1002/15213773(20010119)40:23.0.CO;2-I Serna, P., Baumes, L. A., Moliner, M., & Corma, A. (2008)... Journal of Catalysis, 1(258), 25–34. doi:10.1016/j.jcat.2008.05.033 Serra, J. M. (2003)... Catalysis Today, 81(3), 425–436. doi:10.1016/S0920-5861(03)00142-1 Serra, J. M., Baumes, L. A., Moliner, M., Serna, P., & Corma, A. (2007)... Combinatorial Chemistry & High Throughput Screening, 10, 13–24. doi:10.2174/138620707779802779
Active Learning and Mapping
Serra, J. M., Chica, A.& Corma, A. (2003). Appl. Catal., A. 239, 35-42. Seung, H. S., Opper, M., & Sompolinsky, H. (1992). Proc. of the 5th Annual Workshop on Computational Learning Theory, pp. 287-294. Sjöblom, J., Creaser, D., & Papadakis, K. (2004). Proc. 11th Nordic Symposium on Catalysis. Oulu, Finland Snedecor, G. W., & Cochran, W. G. (1989). Iowa State Univ (8th ed.). Press. Sohn, K. S., Seo, S. Y., & Park, H. D. (2001)... Electrochemical and Solid-State Letters, 4, H26– H29. doi:10.1149/1.1398560 Souvannavong, F., Mérialdo, B., & Huet, B. (2004). WIAMIS’04, 5th Int. Workshop on Image Analysis for Multimedia Interactive Services. Inst. Sup. Técnico, Lisboa, Portugal. Apr. 21-23. Stephens, M. A. (1974)... Journal of the American Statistical Association, 69, 730–737. doi:10.2307/2286009 Tchougang, A., Blansché, A., Baumes, L. A., Lachiche, N., & Collet, P. (2008). Lecture Notes in Computer Science 599-609, Volume 5199. In Rudolph, G., Jansen, T., Lucas, S. M., Poloni, C. & Beume, N. (eds). Parallel Problem Solving from Nature – PPSN X. Berlin: Springer.
Thrun, S., & Moller, K. (1992). Advances in Neural Information Processing Systems 4. San Francisco: Morgan Kaufmann. Todeschini, R., & Consonni, V. (2000). Handbook of Molecular Descriptors. Weinheim, Germany: Wiley-VCH. Tong, S., & Koller, D. (2001). Journal of Machine Learning Research, 2, 45–66. doi:10.1162/153244302760185243 Tribus, M., & Sconyi, G. (1989). An alternative view of the Taguchi approach. Quality Progress, 22, 46–48. Watanabe, Y., Umegaki, T., Hashimoto, M., Omata, K., & Yamada, M. (2004). Catalysis Today, 89(4), 455–464. New York: Elsevier Sci. B.V. doi:10.1016/j.cattod.2004.02.001 Whitehead, S. (1991). A study of cooperative mechanisms for reinforcement learning. TR-365, Dpt. of comp. sci. Rochester, NY: Rochester Univ. Whitley, D., Mathias, K., Rana, S., & Dzubera, J. (1996)... Artificial Intelligence, 85(1-2), 245–276. doi:10.1016/0004-3702(95)00124-7 Wolf, D.; Buyevskaya, O. V.; Baerns, M. (2000). Appl. Catal. A, 63-77. Xiang, X. D., & Takeuchi, I. (2003). Combinatorial Materials Science. New York: Dekker. doi:10.1201/9780203912737
This work was previously published in Advanced Methods and Applications in Chemoinformatics: Research Progress and New Applications, edited by Eduardo A. Castro and A. K. Haghi, pp. 111-138, copyright 2012 by Engineering Science Reference (an imprint of IGI Global).
91
92
Chapter 5
An Optimal Categorization of Feature Selection Methods for Knowledge Discovery Harleen Kaur Hamdard University, India Ritu Chauhan Hamdard University, India M. Afshar Alam Hamdard University, India
ABSTRACT With the continuous availability of massive experimental medical data has given impetus to a large effort in developing mathematical, statistical and computational intelligent techniques to infer models from medical databases. Feature selection has been an active research area in pattern recognition, statistics, and data mining communities. However, there have been relatively few studies on preprocessing data used as input for data mining systems in medical data. In this chapter, the authors focus on several feature selection methods as to their effectiveness in preprocessing input medical data. They evaluate several feature selection algorithms such as Mutual Information Feature Selection (MIFS), Fast Correlation-Based Filter (FCBF) and Stepwise Discriminant Analysis (STEPDISC) with machine learning algorithm naive Bayesian and Linear Discriminant analysis techniques. The experimental analysis of feature selection technique in medical databases has enable the authors to find small number of informative features leading to potential improvement in medical diagnosis by reducing the size of data set, eliminating irrelevant features, and decreasing the processing time.
DOI: 10.4018/978-1-4666-2455-9.ch005
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
An Optimal Categorization of Feature Selection Methods for Knowledge Discovery
INTRODUCTION Data mining is the task of discovering previously unknown, valid patterns and relationships in large datasets. Generally, each data mining task differs in the kind of knowledge it extracts and the kind of data representation it uses to convey the discovered knowledge. Data mining techniques has been applied to a variety of medical domains to improve medical decision making. The sheer number of data mining techniques has the ability to handle large associated medical datasets, which consist of hundreds or thousands of features. The large amount of features present in such datasets often causes problems for data miners because some of the features may be irrelevant to the data mining techniques used. To deal with irrelevant features data reduction techniques can be applied in many ways, by feature (or attribute) selection, by discretizing continuous feature-values, and by selecting instances. There are several benefits associated with removing irrelevant features, some of which include reducing the amount of data (i.e., features). The reduced factors are easier to handle while performing data mining, and is capable to analyze the important factors within the data. However, the feature selection has been an active and fruitful field of research and development for decades in statistical pattern recognition (Mitra, Murthy, & Pal, 2002), machine learning (Liu, Motoda, & Yu, 2002; Robnik-Sikonja & Kononenko, 2003) and statistics (Hastie, Tibshirani, & Friedman, 2001; Miller, 2002). It plays a major role in data selection and preparation for data mining. Feature selection is the process of identifying and removing irrelevant and redundant information as much as possible. The irrelevant features can harm the quality of the results obtained from data mining techniques; it has proven that inclusion of irrelevant, redundant, and noisy attributes in the model building process can result in poor predictive performance as well as increased computation.
Moreover, the feature selection is widely used for selecting the most relevant subset of features from datasets according to some predefined criterion. The subset of variables is chosen from input variables by eliminating features with little or no predictive information. It is a preprocessing step to data mining which has proved effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving comprehensibility in medical databases. Many methods have shown effective results to some extent in removing both irrelevant features and redundant features. Therefore, the removal of features should be done in a way that does not adversely impact the classification accuracy. The main issues in developing feature selection techniques are choosing a small feature set in order to reduce the cost and running time of a given system, as well as achieving an acceptably high recognition rate. The computational complexity of categorization increases rapidly with increasing numbers of objects in the training set, with increasing number of features, and increasing number of classes. For multi-class problems, a substantially sized training set and a substantial number of features is typically employed to provide sufficient information from which to differentiate amongst the multiple classes. Thus, multi-class problems are by nature generally computationally intensive. By reducing the number of features, advantages such as faster learning prediction, easier interpretation, and generalization are typically obtained. Feature selection is a problem that has to be addressed in many areas, especially in data mining, artificial intelligence and machine learning. Machine learning has been one of the methods used in most of these data mining applications. It is widely acknowledged that about 80% of the resources in a majority of data mining applications are spent on cleaning and preprocessing the data. However, there have been relatively few studies on preprocessing data used as input in these data mining systems.
93
An Optimal Categorization of Feature Selection Methods for Knowledge Discovery
In this chapter, we deal with more specifically with several feature selection methods as to their effectiveness in preprocessing input medical data. The data collected in clinical studies were examined to determine the relevant features for predicting diabetes. We have conducted feature selection methods on diabetic datasets using the MIFS, FCBF and STEPDISC scheme and demonstrate how it works better than the single machine approach. This chapter is organized as follows: Section 3, we discuss the related works. Section 4 discusses the existing feature selection approaches. Experimental analysis study is presented in section 5. Conclusions are presented in the last section. We shall now briefly review some of the existing feature selection approaches.
REVIEW OF FEATURE SELECTION APPROACHES There are several studies on data mining and knowledge discovery as an interdisciplinary field for uncovering hidden and useful knowledge (Kim, Street, & Menczer, 2000). One of the challenges to effective data mining is how to handle immensely vast volumes of medical data. If the data is immensely large then the number of features to learning algorithms can make them very inefficient for computational reasons. As previously mentioned, feature selection is a useful data mining tool for selecting sets of relevant features from medical datasets. Extracting knowledge from these health care databases can lead to discovery of trends and rules for later diagnostic purposes. The importance of feature selection in medical domain is found in Kononenko, Bratko, & Kukar (1998) and Kaur et al. (2006) worked in applied data mining techniques i.e. association rule mining in medical data items. The profusion in data collection by hospitals and clinical laboratories in recent years has helped in the discovery of many disease associated factors,
94
such as diagnosing the factors related to patient’s illness. In this context, Kaur and Wasan (2009) proposed experience management can be used for better diagnosis and disease management in view of the complexity. Thus, the complexity of data arrives as the number of irrelevant or redundant features exists in the medical datasets (i.e., features which do not contribute to the prediction of class labels). At this point, we can efficiently reduce the complex data by recognizing associated factors related with disease. The performance of certain learning algorithms degrades in the presence of irrelevant features. In addition, irrelevant data may confuse algorithms making them to build inefficient classifiers while correlation between features sets which causes the redundancy of information and may result in the counter effect of over fitting. Therefore, it is more important to explore data and utilize independent features to train classifiers, rather than increase the number of features we use. Feature selection can reduce the dimensionality of data, so that important factors can be studied well for the hypothesis space and allows algorithms to operate faster and more effectively. Feature selection in medical data mining is appreciable as the diagnosis of the disease could be done in this patient-care activity with minimum number of features while still maintaining or even enhancing accuracy (Abraham, Simha, & Iyengar, 2007). Pechinizkiy, has applied classification successfully for number of medical applications like localization of a primary tumor, prognostics of recurrence of breast cancer, diagnosis of thyroid diseases, and rheumatology (Richards, RaywardSmith, Sonksen, Carey, & Weng, 2001). Statistical pattern recognition has been studied extensively by (Miller, 2002). Impressive performance gain by reducing the irrelevant features in feature selection has been studied (Langley, 1994; Liu & Motoda, 1998; Dy & Brodley, 2000; Dash, Liu, & Motoda, 1998). Moustakis and Charissis, surveyed the role of machine learning in medical decision making and provided an extensive literature review.
An Optimal Categorization of Feature Selection Methods for Knowledge Discovery
Although Feature Selection literature contains many papers, few feature selection algorithms that have appeared in the literature can be categorized in two classes, according to the type of information extracted from the training data and the induction algorithm (John, Kohavi, & Pfleger, 1994). However, we could only find a few studies related to medical diagnosis using data mining approaches. Several publications have reported performance improvements for such measures when feature selection algorithms are used. Detailed survey of feature selection method can be found in (Langley, 1994; Dash & Liu, 1997). Recently, more attention has been received by feature selection because of enthusiastic research in data mining. Aha & Bankert, 1995, have considered the specific survey of forward and backward sequential feature selection algorithms and their variants. The feature selection algorithms for classification and clustering, comparing different algorithms with a categorizing framework based on different search strategies, evaluation criteria, and data mining tasks, reveals un attempted combinations, and provides guidelines in selecting feature selection algorithms with an integrated framework for intelligent feature selection has been proposed by (Liu & Yu, 2005). The new method that combines positive and negative category and feature set is constructed. Chi-square, correlation coefficient has being used for comparison in feature selection method by (Zheng, Srihari, & Srihari, 2003).One of the oldest algorithms used for selecting the relevant feature is branch and bound (Narendra & Fukunaga, 1977). The main idea of the algorithm is to select as few features as possible and to place a bound on the value of the evaluation criterion. The select as few features as possible and to place a bound on the value of the evaluation criterion. It starts with the whole set of features and removes one feature at each step. The bound is placed in order to make the search process faster. Branch and bound (BB) is used when the evaluation criterion is monotonous.
(Krishnapuram, Harternink, Carin, & Figueiredo, 2004; Chen & Liu, 1999) has studied the best subset of features for prediction by reducing the number of features which are irrelevant or redundant ones. The irrelevant data decreases the speed and reduces the accuracy of mining algorithm. (Kirsopp & Shepperd, 2002) have also analyzed the application of feature subset selection to cost estimation reaching. This has led to the development of a variety of techniques for selecting an optimal subset of features from a larger set of possible features. Finding an optimal subset is usually intractable (Kohavi & John, 1997). The interesting topic of feature selection for unsupervised learning (clustering) is a more complex issue, and research into this field is recently getting more attention in several communities (Varshavsky et al., 2006). Recently, Dy and Brodley, 2000a, Devaney and Ram, 1997, Agrawal et al., 1998, have studied feature selection and clustering together with a single or unified criterion. Thus feature selection in unsupervised learning aims to find a good subset of features that forms high quality of clusters for a given number of clusters. Recent advances in computing technology in terms of speed, cost, as well as access to tremendous amounts of computing power and the ability to process huge amounts of data in reasonable time has spurred increased interest in data mining applications to extract useful knowledge from data.
FEATURE SELECTION ALGORITHMS The feature selection method is developed as an extension to the recently proposed maximum entropy discrimination (MED) framework. MED is described as a flexible (Bayesian) regularization approach that subsumes, e.g., support vector classification, regression and exponential family models. In general, feature selection algorithms fall into categories
95
An Optimal Categorization of Feature Selection Methods for Knowledge Discovery
1. the filter approach and 2. the wrapper approach 3. Embedded approach (Kohavi & John, 1997), (Liu & Setiono, 1996). The filter model relies on general characteristics of the training data to select predictive features (i.e., features highly correlated to the target class) without involving any mining algorithm (Duch, 2006). The assessments of features are based on independent general characteristics of the data. Filter methods are preprocessing methods which attempt to assess the merits of features from the data, ignoring the effects of the selected feature subset on the performance of the learning algorithm by computing the correlation. Some filter methods are based on an attempt to immediately derive non redundant and relevant features for the task at hand (e.g., for prediction of the classes) (Yu & Liu, 2004). Filter techniques are computationally simple, fast and can easily handle very high-dimensional datasets. They are independent of the classifier so they need to perform only once with different classifiers. Filter methods are independent of the classification algorithm. Among the pioneering filter methods, and very much cited, are focused (Almuallim & Dietterich, 1991) it searches for all possible feature subsets, but this is applicable to only few attributes and Relief is used to compute ranking score of every feature (Kira & Rendell, 1992). The wrapper model uses the predictive accuracy of a predetermined mining algorithm to give the quality of a selected feature subset, generally producing features better suited to the classification task. It generally searches for features better suited to the mining algorithm, aiming to improve mining performance, but it also is more computationally expensive (Langley, 1994; Kohavi & John, 1997) than filter models. These methods assess subsets of variables according to their usefulness to a given predictor. They conduct a search for a good subset by using the learning algorithm as part of the evaluation function. Chen et al., 2005
96
has presented the application of feature selection using wrappers to the problem of cost estimation. However, it is computationally expensive for high-dimensional data (Blum & Langley, 1997). Embedded methods: they perform variable selection as part of the learning procedure and are usually specific to given learning machines. Examples are classification trees, regularization techniques (e.g. lasso). Embedded techniques are specific for given learning algorithm and search for an optimal subset of features, which are in built into classifier. They are less computationally expensive. There are various ways to conduct feature selection. We introduce some often used methods conducted by analyzing the statistical properties of the data.. There are different ways to implement feature selection, depending on the type of data. In this chapter, we have evaluated feature selection techniques such as Fast-Correlation Based Filter (FCBF), Mutual information feature selection (MIFS) and Stepwise Discriminant Analysis (SDA). In Fast correlation based filtering technique (FCBF) (Yu & Liu, 2003) is the irrelevant features are identified and filtered out. The FCBF (Fast Correlation-Based Filter) algorithm consists of two steps, first features are ranked by relevance which is computed as the symmetric uncertainty with respect to target attribute; it also discards the irrelevant features which have score below defined threshold. The second step is called redundant analysis in which the features are identified using an approximate Markov blanket configured to identify for a given candidate feature whether any other feature is both 1. More correlated with the set of classes than the candidate feature and 2. More correlated with the candidate feature than with the set of classes. If both conditions are satisfied, then the candidate feature is identified as a redundant feature
An Optimal Categorization of Feature Selection Methods for Knowledge Discovery
and is filtered out. The correlation factor is used to measure the similarity between the relevant and irrelevant factors. For feature X with values xi and classes Y with values yi are treated as random variables. The linear coefficient is defined as:
P (X ,Y ) =
∑ (xi − xi )(yi − yi ) i
∑ (xi − xi ) ∑ (yi − yi ) 2
i
2
i
where, X and Y are linearly dependent if (X, Y) is equal to ±1; if P takes the value +1 or -1 then the variables are completely correlated. If the value of correlation is zero then they are completely uncorrelated. Stepwise discriminant analysis is the feature selection technique to find the relevant subset with determination of addition and removal of feature (Afifi & Azen, 1972). In this process the significance level of each feature is found with analysis of discriminant function. Stepwise selection begins with no variables in the model. The initial variable is then paired with each of the other independent variables one at a time; the success is measured with discriminant until the rate of discrimination improves. In the stepwise manner it founds the best discriminatory power. The discriminant function is calculated with Mahalanobis D2 and Rao’s V distance function. Mutual information feature selection (MIFS) (Battiti, 1992) is the quantity to measure mutual dependencies in between two random features. It maximizes the mutual information between the selected features and the classes, while minimizing the interdependence among the selected features. For columns that contain discrete and discretized data entropy measured is used. The mutual information between two random variables can be specified as I (M;N) between the set of feature values M and the set of classes N. I (M;N) measures the interdependence between two random variables M and N. It can be computed as follows:
I (M; N) = H (N) − H (N | M) The entropy H (N) measures the degree of uncertainty entailed by the set of classes N, and can be computed as H (N) = −∑ p(n) log p(n) n εN
The remaining entropy in between M and N is measured as H (N | M) =−∑ ∑ p(m, n ) log p(n | m ) n εN m εM
The mutual information in between the two discretizied random features are defined as I (M; N) = ∑ .∑ .p(m,n) × log n εN m εN
p(m, n ) p(m )p(n )
If the data is continuous random the mutual information is defined as I (M; N) =
∫ ∫
n εN m εM
p(m,n) × log
p(m, n ) dm dn p(m )p(n )
The drawback of the MIFS algorithm is that it does not take into consideration the interaction between features. It has been proved that choosing features individually does not lead to an optimal solution.
RESULTS Data mining technology has become an essential instrument for medical research and hospital Management. Numerous data mining techniques are applied in medical databases to improve medical decision making. Diabetes affects between 97
An Optimal Categorization of Feature Selection Methods for Knowledge Discovery
2% and 4% of the global population (up to 10% in the over 65 age group), and its avoidance and effective treatment are undoubtedly crucial public health and health economics issues in the 21st century. There has been extensive work on diabetic registries for a variety of purposes. These databases are extremely large and have proved beneficial in diabetic care. Diabetes is a disease that can cause complications such as blindness, amputation and also in extreme cases causes cardiovascular death. So, the challenge for the physicians is to know which factors can prove beneficial for diabetic care.
Evaluation Techniques The aim of this research was to identify significant factors influencing diabetes control, by applying feature selection in diabetic care system to improve classification and knowledge discovery. The classification models can be used to determine individuals in the population with poor diabetes control status based on physiological and examination factors. The study focuses on feature selection techniques to develop models capable of predicting qualified medical opinions. The techniques used are Fast-Correlation Based Filter (FCBF), Mutual information feature selection (MIFS), and Stepwise discriminant analysis (STEPDISC) filtering techniques. The models are compared in terms of their performances with significant classifier. Moreover each model reveals specific input variables which are considered significant. In, this study we compare the effectiveness of different feature sets chosen by each technique are tested with two different and well-known types of classifiers: a probabilistic classifier (naive Bayes) and linear discriminant analysis. These algorithms have been selected because they represent different approaches to learning and for their long standing tradition in classification studies. As stated previously, Feature Selection can be grouped into filter
98
or wrapper depending on whether the classifier is used to find the feature subset. In this chapter for the filter model, we have used consistency and correlation measures; for the wrapper-method, standard classifiers have been applied: Naive Bayes, to find the best suitable technique for diabetic care in feature selection. The Naïve Bayes technique depends on the famous Bayesian approach following a simple, clear and fast classifier (Witten & Frank, 2005). It has been called ‘Naïve’ due to the fact that it assumes mutually independent attributes. The data in Naïve Bayes is preprocessed to find the most dependent categories. This method has been used in many areas to represent, utilize, and learn the probabilistic knowledge and significant results have been achieved in machine learning. The Naïve Bayesian technique directly specific input variables with the class attribute by recording dependencies between them. We have used the tanagra toolkit to experiment with these two data mining algorithms. The tanagra is an ensemble of tools for data classification, regression, clustering, association rules, and visualization. The toolkit is open source software issued under the General Public License (GNU). The diabetic data set has 403 rows and 19 attributes from 1046 subjects who were interviewed in a study to understand the prevalence of obesity, diabetes, and other cardiovascular risk factors in Central Virginia for African Americans. It contains the basic demographics for each patient, that is, patient id, age, race, height, weight, hip size, waist size, stabilized glucose, high density lipoprotein, glycosolated Hemoglobin and the symptoms presented when the patient first came to the emergency room, as well as the emergency room diagnosis these features were continuous. Whereas gender, frame and location were taken as discrete attributes.
An Optimal Categorization of Feature Selection Methods for Knowledge Discovery
An IIIustrative Application Domain: Experiment I
not. Therefore, it was used to determine the best number of features in a model. In a five-fold cross validation, a training data set was divided into two equal subsets. Each subset took turns to be the subset, to determine accuracy of data. The classification error rate or fitness of individual feature was obtained and thus that feature can be decided to be added or removed from the feature subset used. In Table 2 the overall cross validation is use to measure the error rate of Naïve Bayes classifier. Table 3 shows the error rate using cross validation in Fast-Correlation Based Filtering technique was applied. We have found that FCBF increases the efficiency of Naïve Bayes classifier, by removing the irrelevant and redundant information. In this case, we used Mutual information feature selection (MIFS) filtering technique with Naïve Bayes classifier. The continuous attributes are chosen part of analysis using gender as a target attribute. The MDLPC technique was applied to continuous attributes into ordered discrete one. We then insert the Naïve Bayes classifier and error rate was determined by cross validation. The cross Validation was applied for 5 trials and 2 folds. The table was analyzed for rows and columns. Table 4 shows the error rate before filtering technique was applied on Naïve Bayes classifier. Table 5 error rate after MIFS feature selection was applied on Naïve Bayes classifier. We have found that the error rate of Naïve Bayes classifier improves after MIFS filtering technique is applied.
We now introduce an example that will be used to illustrate the concept of feature selection methods in spatial data mining. In this study the diabetic dataset was used to examine the Naïve Bayes technique using FCBF and MIFS filtering. We have found that error rate before and after feature selection on naïve classifier. The whole training dataset is used for analysis. The error rate before the feature selection was 1%, whereas the analysis suggests that error rate got reduced 0.8% after the Fast-Correlation Based Filtering technique was applied, and they are shown in Table 1. We observe that the FCBF feature selection process improves the accuracy of the Naive Bayes classifier. In order to determine the number of features in the selected model, a five-fold cross validation was carried. Cross validation is an estimate of a selected feature set’s performance in classifying new data sets. Of course domain knowledge is crucial in deciding which attributes are important and which are Table 1. Error rate using naïve Bayes classifier Before filtering
After using FCBF
MIN
1
MIN
0.8
MAX
1
MAX
0.8
Trial
Error rate
Trial
Error rate
1
1
1
0.8
2
1
2
0.8
3
1
3
0.8
Table 2. Cross validation on naïve Bayes Error rate
1
Values prediction
Confusion matrix
Value
Recall
1-Precision
Female
0
1
Male
0
1
Female
Male
Sum
Female
0
9
9
Male
6
0
6
Sum
6
9
15
99
An Optimal Categorization of Feature Selection Methods for Knowledge Discovery
Table 3. Cross validation after FCBF Error rate
0.8
Values prediction
Confusion matrix
Value
Recall
1-Precision
Female
Male
Sum
Female
0
0.6667
Female
2
6
8
Male
0
1
Male
5
0
5
Sum
7
6
13
Table 4. Naïve Bayes classifier Error rate
0.4333
Values prediction
Confusion matrix
Value
Recall
1-Precision
Female
0.3333
0.5556
Male
0.7222
0.381
female
male
Sum
Female
4
8
12
Male
5
13
18
Sum
9
21
30
Table 5. MIFS on naïve Bayes Error rate
0.2
Values prediction
Confusion matrix
Value
Recall
1-Precision
female
male
Sum
Female
0.75
0.25
Female
9
3
12
Male
0.8333
0.1667
Male
3
15
18
Sum
12
18
30
Third analysis was conducted using linear discriminant analysis prediction technique to stepwise discriminant analysis (STEPDISC) filtering technique, to evaluate the efficiency of datasets. Classification of data is analyzed using linear discriminant technique, we perform the stepwise discriminant analysis on the training data sets, and about 19 features were selected by the stepwise selection process. Bootstrap was carried out to measure the efficiency number of features in a selected model. Bootstrap re-samples the available data at random with replacement; we estimated error rate by bootstrap procedure for selected feature set’s performance for classifying new data sets. Therefore, it was used to determine the best number of features in a model. The 4
100
features were chosen by stepwise selection process. If the F statistic has lower value than the threshold value the feature gets excluded, if F statistic has higher value the feature gets added to the model. Table 6 shows the error rate with 25 replication after STEPDISC is performed. The classification performance is estimated using the normal procedure of cross validation, or the bootstrap estimator. Thus, the entire feature selection process is rather computation-intensive. The accuracy of the classifier has improved with the removal of the irrelevant and redundant features. The learning models efficiency has improved by; alleviating the effect of the curse of dimensionality; enhancing generalization capability; speeding up learning process; improving model
An Optimal Categorization of Feature Selection Methods for Knowledge Discovery
Table 6. Bootstrap in linear discriminant analysis Error Rate after stepwise discriminant analysis
Error Rate .632+ bootstrap
0.4682
.632+ bootstrap
0.1426
.632 bootstrap
0.4682
.632 bootstrap
0.122
Resubstitution
0.4286
Resubstitution
0
Avg. test set
0.4912
Avg. test set
0.193
Repetition
Test Error
Repetition
Test Error
1
0.4912
1
0.193
Bootstrap
Bootstrap+
Bootstrap
Bootstrap+
0.4682
0.4682
0.4682
0.1426
interpretability. It has also benefited medical specialists to acquire better understanding of data by analyzing related factors. Feature selection, also known as variable selection, feature reduction, and attribute selection or variable subset selection. The spatial distribution of attributes sometimes shows the distinct local trends which contradict the global trends. For example, the maps in Figure 1 show the data was significant taken from Virginia State of Louisa and Buckingham country. It is interesting to note that there statistical difference in the two population, as noted that population of Louisa has more significant increase in the level of cholesterol as compared to Buckingham. We
have also found that level of cholesterol is higher in males rather than females. The above spatial distribution represents the number of cases in Stabilized Glucose level with respect to age in Louisa and Buckingham. This is most vivid in Figure 2 where the distribution represents that people of Louisa has more frequent chances of having diabetes as compared to population of Buckingham. The people of Louisa of age group (years) between 35 to 45 are having active chances of carrying diabetes as compared to Buckingham. Thus Male age group has more chances of having diabetes. For example, later on we show how to build rule induction from diabetic dataset.
Experiment II Rule induction is a data mining process for acquiring knowledge in terms of if-then rules from training set of objects which are described by attributes and labeled by a decision class. Rule Induction Method has the potential to use retrieved cases for predictions. Initially, a supervised learning system is used to generate a prediction model in the form of “IF THEN ” style rules. Rule Induction Method has the potential to use retrieved cases for predictions. The rule antecedent (the IF part)
Figure 1. Indicated areas shows the level of cholesterol in Virginia State of Louisa and Buckingham country
101
An Optimal Categorization of Feature Selection Methods for Knowledge Discovery
Figure 2. Indicated areas of Stabilized Glucose level people of Louisa vs. Buckingham
contains one or more conditions about value of predictor attributes where as the rule consequent (THEN part) contains a prediction about the value of a goal attribute. The decision making process improves if the prediction of the value of a goal attribute is accurate. IF-THEN prediction rules are very popular in data mining; they represent discovered knowledge at a high level of abstraction. Algorithms for inducing such rules have
been mainly studied in machine learning and data mining. In the health care system it can be applied as follows: In this method we adopted pre classification technique which is represented as logical expressions of the following form: (Symptoms) (Previous--- history) ------- > (Cause--- of --- disease)
Box 1. If_then_rule induced in the onset of diabetic in adult If Sex = Male OR Female AND Frame= LARGE AND Cholestrol>200, then Diagnosis = chances of diabetic increases.
Box 2. If_then_rule induced in the diagnosis of diabetic in blood If glycosolated heamoglobin ≥ 7.0 and waist ratio> 48 and age>50, then the record is pre-classified as “positive diagnosis” else if glycosolated heamoglobin < 2.0 and waist ratio γ rp (i ) . This is motivated by the H
fact that the important rules describe the relationships between the hierarchies and the less important rules can be deduced from the more important ones using the hierarchy. GARP was formalized
in (Srikant & Agrawal, 1995) using expected values: SupE (X ,Y ) =
|X | * P (pH (X ) ∩ Y ), | pH (X ) |
ConfE (X ,Y ) = Conf (pH (X ),Y ), thus the measured value should be greater than its expectation. We already used the expectations in Section Interestingness to propose the interestingness measure. From their definitions it is clear that if only the antecedent is generalized, then using support or confidence to determine interestingness results in the same value. We use this variant in the experiments.
PAR PAR, in turn, assures that only those rules are selected where the values of the chosen metric are greater than the corresponding values of all their ancestors. Such pruning differs from GARP in that it does not consider decreases and increases, in a wave-like manner, along a path. In (Srikant & Agrawal, 1995) the authors argue that rule pruning should only consider the closest ancestors (parents), but as we aim at selecting the most interesting rules preferably in lower hierarchy levels, we use ancestral pruning, i.e. a value of a rule must be greater than all values of rules of the antecedent’s ancestors.
EXPERIMENTS The main difficulty in evaluating our approach is that it is typically not known how many and what type of relationships between taxonomies should be discovered. To simplify evaluation, we chose two similar taxonomies for the first experiment, establishing a base line. In this case it is much
131
Learning Different Concept Hierarchies and the Relations between them from Classified Data
easier to connect similar concepts manually, an approach which is often used to validate results (Doan et al., 2002, Maedche & Staab, 2000). In the second experiment, we investigated two different taxonomies and compared results obtained for true taxonomies with the results obtained for taxonomies extracted from classifier predictions. In both experiments, we examined which quality measure produces fewer rules that are still interesting.
Performance Measures To assess the obtained results, two usual performance measures, known as recall and precision, were used: R(A, B ) =
|A∩B | , |A|
P (A, B ) =
|A∩B | , |B |
where A represents the set of true elements and B the set of found elements. In this work, elements were either the relations between concepts or multi-label instances. The F-measure is the harmonic mean between recall and precision. Both were macro-averaged over the instances to calculate the F-measure. In (Maedche & Staab, 2000) an evaluation metric is introduced, known as Generic Relation Learning Accuracy (RLA). We have transformed it into RLA recall and RLA precision by focusing on the true rule set instead of the discovered one and building the sum over true rules instead of averaging over found rules. RLA recall can be obtained by dividing the resulting sum by the number of true rules whereas RLA precision through division by the number of rules found. This is motivated by the fact that not only the discovered true rules are important but also the number of discovered irrelevant or redundant rules. It should be noted that RLA precision can become greater 1 because a smaller set of rules can cover a larger set of rules with good approximation,
132
i.e. it is no longer normalized, and calculating an F-measure can lead to a very good value even if the recall is bad.
Data The IMDb and Rotten Tomatoes (RT) dataset, used in (T. P. Martin et al., 2008), was kindly donated by Trevor Martin. From the roughly 90,000 movies of each database, we selected only those that have at least one genre assigned and that have an almost exact match in title and director in both datasets. This resulted in 3079 entries. We further enriched the dataset with data we collected from IMDb from August 2010 comprising about 390,000 movies. From these 390,000 IMDb entries we extracted the genres and keywords as labels and created two label sets, one which had classes with more than 250 entries (IMDb_large) and the other with classes with more than 600 entries (IMDb_small). Further, we deleted some redundancy and obvious dependencies reducing the label set to 88 labels for IMDb_large and 48 for IMDb_small about genre and keywords (like friend, friendship). We added their respective selected keywords and genres (labels) to the 3079 IMDb movies, creating the IMDb_large and IMDb_small multi-label dataset. After that, we created a hierarchy which we extracted with the Apriori algorithm and 0.4 confidence threshold (method described in (Brucker et al., 2011)) from the multi-labels for each label set. Further, we expanded the multi-labels by ancestry, i.e. parent labels were assigned to their respective child for each multi-label instance, and for each dataset/ hierarchy. The Rotten Tomatoes dataset had also two versions of the label set, one small (RT_small) with labels with more or equal 50 entries and one large (RT_large) with more or equal than 30 entries. The hierarchy was extracted from the labels with the Apriori and threshold also set to 0.4 for both RT datasets. With the extracted hierarchy we performed ancestor expansion on the multi-labels and removed label redundancy as in IMDb.
Learning Different Concept Hierarchies and the Relations between them from Classified Data
The WIPO-alpha dataset1 is a collection of patent documents made available for research by the World Intellectual Property Organization (WIPO). The original hierarchy consists, from top to bottom, of 8 sections, 120 classes, 630 subclasses and about 69,000 groups. In our experiment, the hierarchy was only considered through to, and including, subclasses. Each document in the collection has one so-called main code and any number of secondary codes, where each code describes a subclass to which the document belongs. Both main and secondary codes were used in our experiment. We removed subclasses with fewer than 50 training and 50 test instances (and any documents that only belonged to such “small” subclasses). The remaining training and test documents were then combined to form a single dataset. The final dataset consisted of 62,587 records with a label set of size 273. For Wipo A, only branch A (Human Necessities) and for Wipo C, only branch C (Chemistry) (the branches which have most correlated documents (4146)) were used, which span only 89 nodes and have 26,530 documents. Multi-label statistics about each dataset are depicted in Table 2. The datasets are quite different: Wipo A & C is not so dense and IMDb & RT is not as large as Wipo.
IMDb-RT (Small) Strong Associations We compared the IMDb_small with the RT_small to find out if our method could find the same Table 2. Multi-label statistics of datasets IMDb:RT small/ large
Wipo A& C
# multi-labels
3079
27,483
Cardinality
8.1/9.5
4.0
Density
0.09/0.06
0.04
Multi-labels distinct
2616/2807
1325
Hierarchy depth
3:4/4:5
3
strong associations as Martin et al. found in (T. Martin, Shen, & Azvine, 2007). The rules were discovered by PAR using three different interestingness measures discussed above: Interestingness (Int), Confidence (Conf) and Jaccard (Jac), and a minimum confidence threshold of 0.2 (#rules 120), which was found experimentally. Minimum support threshold was not employed because the dataset was relatively small. With the interestingness measure, we found 13 of the 17 strong associations found by Martin et al.2 from 101 extracted connections. Examples of the found rules are: • • • • •
Horror → Horror/Suspense Documentary → Documentary War → Drama blood → Horror/Suspense Documentary → Education/General Interest
The main connections were discovered successfully. Some deviations can be explained by their redundancy with respect to the GARs. For example, IMDb Thriller could not be connected to RT Drama, since IMDb Thriller is a direct child of Drama (IMDb Drama → RT Drama is a strong bond therefore implying the rule left out). Similarly, Mystery could not be related to Horror/ Suspense since its direct parent, Thriller, was also strongly connected to it, and also Adventure → Action/Adventure because of Action → Action/ Adventure. Taking these rules into account, this sums up to a total of 16 rules covered by the method. Another connection Adult → Education/ General Interest could not be discovered due to its relatively low confidence and support values, which indicates that the connection is noisy. This single deviation can also be explained by the dataset itself, since we did not use exactly the same dataset. Therefore, we can state that the interestingness had a very good rule recovery with a low number of irrelevant rules.
133
Learning Different Concept Hierarchies and the Relations between them from Classified Data
Figure 2. Excerpt from the IMDb_small, genres are uppercase, keywords are in lowercase
With confidence, 124 rules were extracted and the result was almost the same but instead of Mystery → Horror/Suspense, the rule that was not found was Mystery → Drama, again Thriller had a stronger connection to Drama, while Drama to Drama was even stronger (See Figure 2). The difference can be explained by the fact that PAR with interestingness concentrates more on the relative increase of confidence, whereas PAR with confidence focuses on the absolute difference to ancestors. That means in this case that confidence had a greater increase from Drama → Horror/Suspense to Thriller → Horror/Suspense in comparison to Thriller → Horror/Suspense to Mystery → Horror/Suspense, although confidence for Mystery → Horror/Suspense was highest along its ancestor path. For Mystery → Drama, the highest confidence along the ancestor path was Drama → Drama, but there was an increase between Thriller and Mystery making it more interesting than the previous rule. The association rules discovered by using the Jaccard measure did not match the rules we compared since the asymmetrical connections were not discovered. For example, the relation War → Drama with confidence 0.6 and only 0.04 in the opposite direction (Drama → War) was not found. Leaving out pruning with minimum confidence increased recall but decreased precision dramatically, as can be seen in Table 3. One can also see that the results of interestingness and confidence are comparable but precision was somewhat better in the former case. Jaccard had the worst recall but the best precision at 0.2 minimum confidence. 134
IMDb-RT (Large) Finding Manual Connections In the next experiment 48 manual connections between the IMDb_large and RT_large hierarchies were created manually, e.g. Comedy → Comedy, Sci-Fi → Science Fiction and Fantasy, gangster → Organized Crime. We expected to discover as many hand-coded connections as possible, plus the connections which were not evident or redundant. We compared our method with the method employed by (Maedche & Staab, 2000), which consisted in pruning the rules by ancestral rules with higher or equal confidence and support. That is, a rule was selected only if its ancestors’ confidence and support values were less than its own values. In contrast, we only used confidence pruning, since support pruning allowed only the top level classes to generate rules in our setup. The manually generated relationships were symmetrical but we focused only on the IMDb to RT direction in this experiment. First, the pruning with minimum support and confidence set to 0 and 0.1, respectively, was examined. There were 31 rules found by using PAR with interestingness Table 3. Recall/precision for IMDb_small and RT_small rules compared to the strong associations from Martin et al. Min. Conf.
Int
Conf
Jac
0
0.824/0.012
0.824/0.013
0.588/0.012
0.2
0.765/0.108
0.765/0.105
0.529/0.129
Learning Different Concept Hierarchies and the Relations between them from Classified Data
Table 4. Recall/precision for PAR with interestingness, confidence and Jaccard for IMDb-RT (Large) Conf
0.10
Sup 0 0.002 0.040
Int
Conf
0.20 Jac
Int
0.40
Conf
Jac
Int
Conf
Jac
31/373
32/387
29/156
22/212
23/215
21/79
11/121
11/124
9/30
0.65/0.08
0.67/0.08
0.6/0.19
0.46/0.10
0.48/0.11
0.44/0.27
0.23/0.09
0.23/0.09
0.19/0.30
29/327
30/338
28/143
22/204
23/208
21/79
11/113
11/117
9/30
0.60/0.09
0.62/0.09
0.58/0.20
0.46/0.11
0.48/0.11
0.44/0.27
0.23/0.10
0.23/0.09
0.19/0.30
9/59
9/58
8/39
8/49
8/48
7/30
6/35
6/35
5/18
0.19/0.15
0.19/0.16
0.17/0.21
0.17/0.16
0.17/0.17
0.15/0.23
0.12/0.17
0.12/0.17
0.10/0.28
and 32 with confidence which corresponds to 65% and 67% accuracy, respectively. This result is comparable to the results achieved earlier in ontology mapping (Doan et al., 2002). Though some of the hand-coded relations were not found even when the minimum confidence was set to 0.1: singer → Musical and Performing Arts, sex → Erotic, policeman → Cops. PAR with interestingness did not find the rule wwii → World War II, because it was covered by the rule War → World War II. On the other hand, PAR with confidence could not find the sheriff → Western rule, which was subsumed by Western → Western. It should be noted that these relations had low confidence: For half of the not discovered manual rules, confidence was less than 10 percent. A possible method to find more relations with less total rules would be to consider the inverse confidence (RT to IMDb) and to select only those rules that have a sufficiently high confidence value in both directions. Setting support to 0 and confidence to 0.1 returned 1160 rules with 39 rules found. Although this method discovered seven more manual connections it had approximately three times more rules in total. By increasing confidence to 0.4, the number of found relations rapidly decreased. This indicates that the movies in the datasets were labeled differently and inconsistently. Moreover, there were still many other rules found mostly because many keywords and genres were connected through the hierarchy, e.g. politics → Drama (Conf IMDb
to RT:0.6, Conf RT to IMDb:0.03) is four times stronger than politics → Politics (0.15,0.18). In addition, connections from the bottom of IMDb to the top of RT occurred frequently indicating that the labeling and generalization was different in both label sets.
Wipo A & C In the WIPO dataset hand-coded relations were not used for evaluation. It was difficult to create such connections because the branches of Wipo A & C are very different, and such manual mapping requires expert knowledge. The more appropriate way would be to discover the connections directly from the data. Since the proposed method of mining interesting associations between taxonomies worked well with the movie dataset of lower quality, it was natural to assume that this method would also work successfully on the WIPO dataset with well labeled instances. The Wipo A & C was classified by ML-ARAM with 5-fold cross validation using labels from A only, labels from C only and labels from A & C. The parameters were chosen as in (Brucker et al., 2011): 9 voters, vigilance at 0.9 and a threshold of 0.002. Hierarchies were extracted from original multi-labels and from predicted multi-labels and the tree distances CTED, LCAPD and TO* were calculated similarly to (Brucker et al., 2011). The macro averaged F1-measure value was 0.63, 0.57 and 0.46 for the predicted multi-labels
135
Learning Different Concept Hierarchies and the Relations between them from Classified Data
Table 5. Tree distances between the hierarchies extracted from the true and predicted multi-labels. In parentheses is the used confidence threshold. Dataset
CTED
LCAP
TO
Wipo A
0.0732 (1)
0.0796 (1)
0 (1)
Wipo C
0.1250 (0.8)
0.1406 (0.8)
0 (0.8)
Wipo A& C
0.1236 (0.7)
0.1268 (0.7)
0 (0.8)
for Wipo A alone, Wipo C alone and Wipo A & C together, respectively. The standard deviation was less then 0.001 for all three cases. Such results point to a relatively hard to predict dataset. Table 5 shows that the extracted hierarchies were very close to the original ones. The next comparison concerns the rules discovered from the true labels and the rules discovered from the predicted labels. Using RLA recall and RLA precision revealed not only that all true rules were discovered, but also if the results contained uninteresting rules, in terms of the true rules among the discovered ones. Table 6 depicts the results for RLA recall and RLA precision of rules extracted from the predicted multi-labels covering rules extracted from the true multi-labels. The discovered rules from the true multi-labels were pruned by PAR with the true hierarchy, but for the rules discovered from the predicted multi-labels the extracted hierarchy was used. Next, the RLA recall and RLA precision values between these two rule sets were calculated by means of the true hierarchy. In the the top part of the table the discovered rules from the true multi-labels using three different interestingness measures were compared to each other. Although both rule sets extracted by using Int and Conf almost covered each other rule set, the set of rules discovered by Int had fewer number of rules, indicated by the RLA precision greater 1. Jac had few rules in total but the number of covered rules was lower. The RLA recall from the separately predicted A to C rules was relatively low as compared with the case when they were predicted together. How-
136
ever RLA precision was close to 1 when they were predicted separately and was low when they were predicted together. This small increase of RLA recall indicates that the overlap between the rule sets was higher when the branches were predicted together, on the other side the large decrease of RLA precision points to a significant increase of the number of rules extracted from the predicted labels. This indicates that the predicted labels (for the “together” scenario) had multi-label combinations which did not happen in the true multi-labels, as expected. When the branches were predicted separately, the rule sets extracted were more compact but fewer rules from the true rule set were discovered, indicating that some multi-label combinations were not accurately enough predicted. Of all the measures, Int was the best with a higher RLA recall or at least a higher RLA precision as compared to Conf. Jac was not as good as Conf when the labels were predicted together and also in the case of true multi-labels against true multilabels. We can conclude from the F1-measure values as well as from the RLA recall and RLA precision results that the branches had few strong correlations but also a certain amount of loose ones. Table 6. RLA recall /RLA precision from rules extracted from predicted multi-labels to rules from the true multi-labels for WIPO True AC
Int
Conf
Jac
True AC Int
1.00 1.00
0.94 1.04
0.96 0.76
Conf
0.93 0.85
1.00 1.00
1.00 0.71
Jac
0.80 1.01
0.86 1.21
1.00 1.00
Separately predicted A to C Int
0.65 0.93
0.63 1.00
0.67 0.76
Conf
0.62 0.82
0.64 0.93
0.68 0.70
Jac
0.55 1.01
0.56 1.15
0.64 0.93
Together predicted A to C Int
0.76 0.33
0.75 0.36
0.77 0.26
Conf
0.74 0.30
0.74 0.33
0.77 0.25
Jac
0.68 0.39
0.68 0.43
0.73 0.33
Learning Different Concept Hierarchies and the Relations between them from Classified Data
CONCLUSION In this chapter, the extraction of concept hierarchies and finding relations between them was demonstrated in the context of a multi-label classification task. We introduced a data mining system, which performs classification and mining of generalized association rules. The proposed system was applied to two real-world datasets with the class labels of each dataset taken from two different class hierarchies. Using Pruning by Ancestral Rules, the majority of redundant rules could be removed leaving only very important rules. A new quality measure for generalized association rules was developed, which not only correlates very well with the confidence measure, but can be effectively used for pruning uninteresting rules with respect to the hierarchy. We also improved the RLA measure, relating it to recall and precision. There is still room for other improvements, for instance, by removing redundancy. For rule parents where the rule has just one child rule, i.e. p(i) → j, and there is only one rule for child(p(i)) → j the p(i) → j rule can be erased. This becomes more important in deeper hierarchies. A subject of future work is to improve classification by means of discovered associations between taxonomies.
REFERENCES Bendaoud, R., & Hacene, M. Rouane, Toussaint, Y., Delecroix, B., & Napoli, A. (2007). Text-based ontology construction using relational concept analysis. In International Workshop on Ontology Dynamics. Innsbruck, Austria. Brucker, F., Benites, F., & Sapozhnikova, E. (2011). Multi-label classification and extracting predicted class hierarchies. Pattern Recognition, 44(3), 724–738. doi:10.1016/j.patcog.2010.09.010 Choi, N., Song, I. Y., & Han, H. (2006). A survey on ontology mapping. SIGMOD Record, 35(3), 34–41. doi:10.1145/1168092.1168097
Cimiano, P., Hotho, A., & Staab, S. (2005, August). Learning concept hierarchies from text corpora using formal concept analysis. Journal of Artificial Intelligence Research, 24, 305–339. Doan, A., Madhavan, J., Domingos, P., & Halevy, A. (2002). Learning to map between ontologies on the semantic web. In Proceedings of the 11th International Conference on World Wide Web (pp. 662–673). New York, NY: ACM. Lallich, S., Teytaud, O., & Prudhomme, E. (2007). Association rule interestingness: Measure and statistical validation. In Guillet, F., & Hamilton, H. (Eds.), Quality measures in data mining (Vol. 43, pp. 251–275). Berlin, Germany: Springer. doi:10.1007/978-3-540-44918-8_11 Maedche, A., & Staab, S. (2000). Discovering conceptual relations from text. In Proceedings of the 14th European Conference on Artificial Intelligence (ECAI) (pp. 321–325). Maedche, A., & Staab, S. (2001, March). Ontology learning for the semantic web. IEEE Intelligent Systems, 16, 72–79. doi:10.1109/5254.920602 Majidian, A., & Martin, T. (2009). Extracting taxonomies from data - A case study using fuzzy formal concept analysis. In WI-IAT ’09: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (pp. 191–194). Washington, DC: IEEE Computer Society. Martin, T., & Shen, Y. (2009, June). Fuzzy association rules in soft conceptual hierarchies. In Fuzzy Information Processing Society, NAFIPS 2009 (p. 1 - 6). Martin, T., Shen, Y., & Azvine, B. (2007). A mass assignment approach to granular association rules for multiple taxonomies. In Proceedings of the Third ISWC Workshop on Uncertainty Reasoning for the Semantic Web, Busan. Korea & World Affairs, (November): 12.
137
Learning Different Concept Hierarchies and the Relations between them from Classified Data
Martin, T. P., Shen, Y., & Azvine, B. (2008). Granular association rules for multiple taxonomies: A mass assignment approach. In Uncertainty Reasoning for the Semantic Web I: ISWC International Workshops, URSW 2005-2007, Revised Selected and Invited Papers (pp. 224–243). Berlin, Germany: Springer-Verlag. doi:10.1007/978-3540-89765-1_14 Miani, R. G., Yaguinuma, C. A., Santos, M. T. P., & Biajiz, M. (2009). NARFO algorithm: Mining non-redundant and generalized association rules based on fuzzy ontologies. In Aalst, W. (Eds.), Enterprise information systems (Vol. 24, pp. 415–426). Berlin, Germany: Springer. doi:10.1007/978-3-642-01347-8_35 Omelayenko, B. (2001). Learning of ontologies for the Web: The analysis of existent approaches. In Proceedings of the International Workshop on Web Dynamics, held in conj. with the 8th International Conference on Database Theory (ICDT’01), London, UK. Rajan, S., Punera, K., & Ghosh, J. (2005). A maximum likelihood framework for integrating taxonomies. In AAAI’05: Proceedings of the 20th National Conference on Artificial Intelligence (pp. 856–861). AAAI Press. Srikant, R., & Agrawal, R. (1995). Mining generalized association rules. In VLDB ’95: Proceedings of the 21th International Conference on Very Large Data Bases (pp. 407–419). San Francisco, CA: Morgan Kaufmann Publishers Inc.
Tan, P. N., Kumar, V., & Srivastava, J. (2004, June). Selecting the right objective measure for association analysis. Information Systems, 29, 293–313. doi:10.1016/S0306-4379(03)00072-3 Tomás, D., & Vicedo, J. L. (2007). Multipletaxonomy question classification for category search on faceted information. In TSD’07: Proceedings of the 10th International Conference on Text, Speech and Dialogue (pp. 653–660). Berlin, Germany: Springer-Verlag. Villaverde, J., Persson, A., Godoy, D., & Amandi, A. (2009, September). Supporting the discovery and labeling of non-taxonomic relationships in ontology learning. Expert Systems with Applications, 36, 10288–10294. doi:10.1016/j. eswa.2009.01.048 Wimalasuriya, D. C., & Dou, D. (2009). Using multiple ontologies in information extraction. In CIKM ’09: Proceeding of the 18th ACM Conference on Information and Knowledge Management (pp. 235–244). New York, NY: ACM.
ADDITIONAL READING Agrawal, R., Imieliński, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (pp. 207–216). New York, NY: ACM.
Surana, A., Kiran, U., & Reddy, P. K. (2010). Selecting a right interestingness measure for rare association rules. In 16th International Conference on Management of Data (COMAD).
Berardi, M., Lapi, M., Leo, P., & Loglisci, C. (2005). Mining generalized association rules on biomedical literature. In A. Moonis & F. Esposito, F. (Eds.), Innovations in Applied Artificial Intelligence, Lecture Notes in Artificial Intelligence, 3353 (pp. 500–509). Springer-Verlag.
Takeda, I. R., Hideaki, T., & Shinichi, H. (2001). Rule induction for concept hierarchy alignment. In Proceedings of the 2nd Workshop on Ontology Learning at the 17th International Joint Conference on Artificial Intelligence (IJCAI).
Berzal, F., Cubero, J.-C., Marin, N., Sanchez, D., Serrano, J.-M., & Vila, A. (2005). Association rule evaluation for classification purposes. In Internacional Congreso Español de Informática (CEDI 2005). España.
138
Learning Different Concept Hierarchies and the Relations between them from Classified Data
Bodenreider, O., Aubry, M., & Burgun, A. (2005). Non-lexical approaches to identifying associative relations in the gene ontology. In The Gene Ontology (pp. 91–102). PBS. doi:10.1142/9789812702456_0010
Marinica, C., & Guillet, F. (2010). Knowledgebased interactive postmining of association rules using ontologies. IEEE Transactions on Knowledge and Data Engineering, 22, 784–797. doi:10.1109/TKDE.2010.29
Brijs, T., Vanhoof, K., & Wets, G. (2003). Defining interestingness measures for association rules. International Journal of Information Theories and Applications, 10(4), 370–376.
Natarajan, R., & Shekar, B. (2005). A relatednessbased data-driven approach to determination of interestingness of association rules. In Proceedings of the 2005 ACM Symposium on Applied Computing (pp. 551–552). New York, NY: ACM.
de Carvalho, V. O., Rezende, S. O., & de Castro, M. (2007). Evaluating generalized association rules through objective measures. In AIAP’07: Proceedings of the 25th Conference on IASTED International Multi-Conference (pp. 301–306). Anaheim, CA: ACTA Press.
Pasquier, N. (2000). Mining association rules using formal concept analysis. In Stumme, G. (Ed.), Working with Conceptual Structures. Contributions to ICCS 2000 (pp. 259–264).
Haehnel, S., Hauf, J., & Kudrass, T. (2006). Design of a data mining framework to mine generalized association rules in a web-based GIS (pp. 114–117). DMIN.
Sánchez, D., & Moreno, A. (2008, March). Learning non-taxonomic relationships from web documents for domain ontology construction. Data & Knowledge Engineering, 64, 600–623. doi:10.1016/j.datak.2007.10.001
Harsh, K. V., Deepti, G., & Suraj, S. (2010, August). Article: Comparative investigations and performance evaluation for multiple-level association rules mining algorithm. International Journal of Computers and Applications, 4(10), 40–45. doi:10.5120/860-1208
Sánchez Fernández, D., Berzal Galiano, F., Cubero Talavera, J. C., Serrano Chica, J. M., Marín Ruiz, N., & Vila Miranda, M. A. (2005). Association rule evaluation for classification purposes. In Internacional Congreso Español De Informática (CEDI 2005). España.
Hong, T. P., Lin, K. Y., & Wang, S. L. (2003, September). Fuzzy data mining for interesting generalized association rules. Fuzzy Sets and Systems, 138, 255–269. doi:10.1016/S01650114(02)00272-5
Shaw, G., Xu, Y., & Geva, S. (2009). Interestingness measures for multi-level association rules. In Proceedings of ADCS 2009, School of Information Technologies, University of Sydney.
Jalali-Heravi, M., & Zaïane, O. R. (2010). A study on interestingness measures for associative classifiers. In Proceedings of the 2010 ACM Symposium on Applied Computing (pp. 1039–1046). New York, NY: ACM. Kunkle, D., Zhang, D., & Cooperman, G. (2008). Mining frequent generalized itemsets and generalized association rules without redundancy. Journal of Computer Science and Technology, 23, 77–102. doi:10.1007/s11390-008-9107-1
Shrivastava, V. K., Kumar, P., & Pardasani, K. R. (2010, February). Fp-tree and COFI based approach for mining of multiple level association rules in large databases. International Journal of Computer Science and Information Security, 7(2). Sikora, M., & Gruca, A. (2010). Quality improvement of rule-based gene group descriptions using information about go terms importance occurring in premises of determined rules. Applied Mathematics and Computer Science, 20(3), 555–570.
139
Learning Different Concept Hierarchies and the Relations between them from Classified Data
Singh, L., Scheuermann, P., & Chen, B. (1997). Generating association rules from semi-structured documents using an extended concept hierarchy. In Proceedings of the Sixth International Conference on Information and Knowledge Management (pp. 193–200). Stumme, G., Taouil, R., Bastide, Y., Pasquier, N., & Lakhal, L. (2001). Intelligent structuring and reducing of association rules with formal concept analysis. In Baader, F., Brewka, G., & Eiter, T. (Eds.), KI 2001: Advances in Artificial Intelligence (Vol. 2174, pp. 335–350). Berlin, Germany: Springer. doi:10.1007/3-540-45422-5_24 Toivonen, H., Klemettinen, M., Ronkainen, P., Hätönen, K., & Mannila, H. (1995). Pruning and grouping discovered association rules. Tseng, M. C., Lin, W. Y., & Jeng, R. (2008). Updating generalized association rules with evolving taxonomies. Applied Intelligence, 29, 306–320. doi:10.1007/s10489-007-0096-5 Verma, H. K., Gupta, D., & Srivastava, S. (2010, August). Article: Comparative investigations and performance evaluation for multiple-level association rules mining algorithm. International Journal of Computers and Applications, 4(10), 40–45. doi:10.5120/860-1208 Wu, T., Chen, Y., & Han, J. (2010, November). Re-examination of interestingness measures in pattern mining: a unified framework. Data Mining and Knowledge Discovery, 21, 371–397. doi:10.1007/s10618-009-0161-2 Xu, Y., & Li, Y. (2007). Generating concise association rules. In Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management (pp. 781–790). New York, NY: ACM.
140
Yang, L. (2005, January). Pruning and visualizing generalized association rules in parallel coordinates. IEEE Transactions on Knowledge and Data Engineering, 17, 60–70. doi:10.1109/ TKDE.2005.14 Zhang, H., Zhao, Y., Cao, L., & Zhang, C. (2007). Class association rule mining with multiple imbalanced attributes. In Proceedings of the 20th Australian Joint Conference on Advances in Artificial Intelligence (pp. 827–831). Berlin, Germany: Springer-Verlag.
KEY TERMS AND DEFINITIONS Association Rule: Connection between two items or item sets which occur frequently together in a given multiset of transactions. The connection strength is traditionally measured by confidence in order to find the most interesting associations, but it can also be assessed by any other interestingness measure. Classification: The process of learning a function to map instances to classes, where each instance possesses a set of features. The instance set is usually divided into a training set and a test set. Cross Validation: A classification procedure with dividing the whole dataset into a number of small test sets and using the rest of the respective set to train the classifier. Hierarchy: A tree-like structure of classes or concepts where each child class has exactly one parent class as opposite to a direct acyclic graph where a child class can have multiple parents. It is often used to ease and organize data. Multi-Label Classification: Classification with not mutually exclusive classes when an instance can belong to multiple classes.
Learning Different Concept Hierarchies and the Relations between them from Classified Data
Non-Taxonomic Concept Relations: Associations between concepts (classes) which are not covered by a hierarchy. They can be either within a single label set or between several label sets. Rule Pruning: Elimination of redundant association rules with the aim of selecting a small set of truly interesting rules containing useful knowledge.
ENDNOTES 1
2
http://www.wipo.int/classifications/ipc/ en/ITsupport/Categorization/dataset/wipoalpha-readme.html (retrieved August 2009) We grouped some labels together, e.g. Comedies and Comedy.
This work was previously published in Intelligent Data Analysis for Real-Life Applications: Theory and Practice, edited by Rafael Magdalena-Benedito, Marcelino Martínez-Sober, José María Martínez-Martínez, Joan Vila-Francés and Pablo EscandellMontero, pp. 18-34, copyright 2012 by Information Science Reference (an imprint of IGI Global).
141
142
Chapter 8
Online Clustering and Outlier Detection Baoying Wang Waynesburg University, USA Aijuan Dong Hood College, USA
ABSTRACT Clustering and outlier detection are important data mining areas. Online clustering and outlier detection generally work with continuous data streams generated at a rapid rate and have many practical applications, such as network instruction detection and online fraud detection. This chapter first reviews related background of online clustering and outlier detection. Then, an incremental clustering and outlier detection method for market-basket data is proposed and presented in details. This proposed method consists of two phases: weighted affinity measure clustering (WC clustering) and outlier detection. Specifically, given a data set, the WC clustering phase analyzes the data set and groups data items into clusters. Then, outlier detection phase examines each newly arrived transaction against the item clusters formed in WC clustering phase, and determines whether the new transaction is an outlier. Periodically, the newly collected transactions are analyzed using WC clustering to produce an updated set of clusters, against which transactions arrived afterwards are examined. The process is carried out continuously and incrementally. Finally, the future research trends on online data mining are explored at the end of the chapter.
1. INTRODUCTION With the widespread use of network, online clustering and outlier detection as the main data mining tools have drew attention from many practical applications, especially in areas where detecting DOI: 10.4018/978-1-4666-2455-9.ch008
abnormal behaviors is critical, such as online fraud detection, network instruction detection, and customer behavior analysis. These applications often generate a huge amount of data at a rather rapid rate. Manual screening or checking of this massive data collection is time consuming and impractical. Because of this, online clustering and outlier detection is a promising approach for
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Online Clustering and Outlier Detection
such applications. Specifically, data mining tools are used to group online activities or transactions into clusters and to detect the most suspicious entries. The clusters are used for marketing and management analysis. The most suspicious ones are investigated further to determine whether they are truly outlier. Numerous clustering and outlier detection algorithms have been developed (Agyemang, Barker, & Alhajj, 2006; Weston, Hand, Adams, Whitrow, & Juszczak, 2008; Dorronsoro, Ginel, Sgnchez, & Cruz, 1997; Bolton & Hand, 2002; Panigrahi, 2009; He, Deng, & Xu, 2005; Wei, Qian, Zhou, Jin, & Yu, 2003; Aggarwal, Han, Wang, & Yu, 2006; Elahi, Li, Nisar, Lv, & Wang, 2008), but the majority of them are intended for continuous data. With the few approaches for categorical data (He et al., 2005; Wei et al., 2003), time efficiency and detection accuracy need to be further improved. In this chapter, we present an efficient dynamic clustering and outlier detection method for online market basket data. Market basket data are usually organized horizontally in the form of transactions, with each transaction containing a list of items bought (and/or a list of behaviors performed) by a customer during a single checkout at a (online) store. Unlike traditional data, market-basket data are known to be high dimensional, sparse, and to contain attributes of categorical nature. Our incremental clustering and outlier detection approach consists of two phases: weighted affinity measure clustering (WC clustering) and outlier detection. First, the transaction sets are analyzed so that items are grouped using WC clustering. Then, each newly arrived transaction is examined against the item clusters that are formed in the WC clustering phase. Phase two decides whether the new transaction is an outlier. After a period of time, the newly collected transactions or data streams are analyzed using WC clustering to produce an updated item clusters, against which each newly arrived transaction afterwards is examined. The process continues incrementally.
This proposed online clustering and outliner detection method has the following characteristics: 1. It is incremental. Each newly arrived transaction is examined immediately against the results from the past transactions. 2. The results of WC clustering are item clusters rather than transaction clusters so that the newly arrived transaction is examined against the item clusters rather than the whole past transactions. The number of item clusters is usually much smaller than the number of past transactions clusters. 3. The item clusters are updated periodically so that any new items and any new purchase behaviors of customers are taken into consideration to produce more accurate results for the future detection. 4. Finally, WC affinity measure, developed in our previous work, is used to improve the clustering results hence outlier detection results. The rest of the chapter is organized as follows. Section 2 introduces background information and reviews previous research in related areas. Section 3 presents the proposed online clustering and outlier detection method in details. Section 4 concludes the research and highlights the future research trend.
2. BACKGROUND Since the proposed method is an online clustering and outlier detection method that uses vertical data structure and weighted confidence affinity measure, we present a brief literature overview on the following aspects in this section: clustering methods, outlier detection, online data mining, affinity measure between clusters, and vertical data structures.
143
Online Clustering and Outlier Detection
2.1. Clustering Methods
2.1.1 Partitioning Clustering Methods
Clustering in data mining is a discovery process that partitions the data set into groups such that the data points in the same group are more similar to each other than the data points in other groups. Data clustering is typically considered as a form of unsupervised learning. Sometime the goal of the clustering is to arrange the clusters into a natural hierarchy. Cluster analysis can also be used as a form of descriptive data model, showing whether or not the data consists of a set of distinct subgroups. There are many types of clustering techniques, which can be categorized in many ways (Hani & Kamber, 2001; Jain & Dubes, 1998). The categorization shown in Figure 1 is based on the structure of clusters. As Figure 1 shows, clustering can be subdivided into partitioning clustering and hierarchical clustering. Hierarchical clustering is a nested sequence of partitions, whereas a partitioning clustering is a single partition. Hierarchical clustering methods can be further classified into agglomerative and divisive hierarchical clustering, depending on whether the hierarchical decomposition is accomplished in a bottom-up or a topdown fashion. Partitioning clustering consists of two approaches: distance-based and densitybased, according to the similarity measure.
Partitioning clustering methods generate a partition of the data in an attempt to recover natural groups present in the data. Partitioning clustering can be further subdivided into distance-based partitioning and density-based partitioning. A distance-based partitioning method breaks a data set into k subsets, or clusters, such that data points in the same cluster are more similar to each other than the data points in other clusters. The most classical similarity-based partitioning methods are k-means (Hartigan & Wong, 1979) and k-medoid, where each cluster has a gravity center. The time complexity of K-means is O(n) since each iteration is O(n) and only a constant number of iterations is computed. Density-based partitioning clustering has been recognized as a powerful approach for discovering arbitrary-shape clusters. In density-based clustering, clusters are dense areas of points in the data space that are separated by areas of low density (noise). A cluster is regarded as a connected dense area of data points, which grows in any direction that density leads. Density-based clustering can usually discover clusters with arbitrary shapes without predetermining the number of clusters.
Figure 1. Categorization of clustering
144
2.1.2 Hierarchical Clustering Methods Hierarchical algorithms create a hierarchical decomposition of a data set X. The hierarchical
Online Clustering and Outlier Detection
Figure 2. Hierarchical decomposition and the dendrogram
decomposition is represented by a dendrogram, a tree that iteratively splits X into smaller subsets until each subset consists of only one object. In such a hierarchy, each level of the tree represents a clustering of X. Figure 2 shows the hierarchical decomposition process and the dendrogram of hierarchical clustering. Hierarchical clustering methods are subdivided into agglomerative (bottom-up) approaches and divisive (top-down) approaches (Hani & Kamber, 2001). An agglomerative approach begins with each point in a distinct cluster, and successively merges clusters together until a stopping criterion is satisfied. A divisive method begins with all points in a single cluster and performs splitting until a stopping criterion is met. In our research, agglomerative hierarchical clustering is applied.
2.2. Outlier Detection Detecting outliers is an important data mining task. From the classical view, outliers are defined as “an observation, which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism” (Hawkins, 1980). Outlier detection is used in many applications, such as credit fraud detection, network intrusion detection, cyber crime detection, customer behavior analysis, and so on. Traditional outlier detection has largely focused on univariate datasets where data follow certain
known standard distributions. However, most real data are multivariate, where a data point is an outlier with respect to its neighborhood, but may not be an outlier with respect to the whole dataset (Huang & Cheung, 2002). Outlier detection methods can be generally categorized as supervised or unsupervised (Yamanishi, Takeuchi, Williams, & Milne, 2004; Yue, Wu, Wang, Li, & Chu, 2007). In supervised methods, models are trained with labeled training data so that new data/observations can be assigned with a corresponding label given the criterion of the model. On the contrary, unsupervised methods do not need prior knowledge of outlier in a historical database, but simply detect those transactions that are “unusual.” Many unsupervised outlier detecting methods have been developed in multivariate data, such as distance-based approaches (Ghoting, Parthasarathy, & Otey, 2008), density-based approaches (Knorr & Ng, 1998), subspace-based approaches, clustering-based approaches (Lu, Chen, & Kou, 2003), etc. However, most of them are used for continuous numerical data but not suitable for categorical data, which often appear in market-basket data. Categorical data are those with finite unordered attribute values, such as the sex of a customer. In market-basket data, a transaction can be represented as a vector with Boolean attributes where each attribute corresponds to a single item/behavior (Guha, Rastogii, & Shim, 2000). Boolean attributes are special categorical attributes. There
145
Online Clustering and Outlier Detection
are a few outlier detection methods proposed for categorical data in the literature. ROCK is an agglomerative hierarchical clustering algorithm for categorical data using the links between the data points (Guha et al., 2000). The authors in (Wei et al., 2003) proposed HOT, a hypergraph partitioning algorithm to find the clusters of items/ transactions of market basket data. Recently, an adherence clustering method was developed based on the taxonomy (hierarchy) of items (Yun, Chuangi, & Chen, 2006) for transactions with hierarchical items. While useful, the efficiency and accuracy of categorical outlier detection needs to be further improved.
2.3. Online Data Mining Online data mining is mainly used for data streams. Data stream mining is concerned with extracting knowledge patterns from continuous, rapid data records. The general goal is to predict the property of new data instance based on those of previous data instances in the data stream. Applications produce streams of this type include network monitoring, telecommunication systems, customer click monitoring, stock markets, or any type of sensor system. The stream model differs from the standard relational model in the following ways (Guha, Meyerson, Mishra, Motwani, & O’Callaghan, 2003): • • •
The elements of a stream arrive more or less continuously. The order in which elements of a stream arrive are not under the control of the system. Data streams are potentially of unbounded size.
The above characteristics of data streams make the storage, querying and mining of such data sets highly computationally challenged. By nature, a data item in a data stream can be read only once or a small number of times using limited computing and storage capacity. Therefore, it is
146
usually not feasible to simply store the arriving data in a traditional database management system in order to perform operations on that data later on. Rather, data streams are generally processed in an online manner. Based on the approaches to improving processing efficiency, data stream mining can be divided into data-based and task-based. Data-based techniques refer to summarizing the whole dataset or choosing a subset of the incoming stream to be analyzed. Sampling, load shedding and sketching techniques represent the former one. Synopsis data structures and aggregation represent the later one. On the other hand, in task-based solutions, techniques from computational theory have been adopted to achieve time and space efficient solutions (Babcock, Babu, Datar, Motwani, & Widom, 2002). Specifically, existing techniques are modified and new methods are invented in order to address the computational challenges of data stream processing. Approximation algorithms, sliding window and algorithm output granularity represent this category (Babcock et al.; Safaei & Haghjoo, 2010). There have been many researches on data stream clustering (Angiulli & Fassetti, 2010; Babcock, Babu, Datar, Motwani, & Widom, 2002; O’Callaghan, Mishra, Meyerson, Guha, & Motwani, 2003) and data stream outlier detection (Aggarwal, Han, Wang, & Yu, 1997; Guha, Meyerson, Mishra, Motwani, & O’Callaghan, 2003; Elahi, Li, Nisar, Lv, & Wang, 2008; Phua, Gayler, Lee, & Smith-Miles, 2009). Data streams can evolve over time, thus older data may not yield much information about the current state. The online clustering methods include k-means clustering and cover tree based clustering. Clustering happens in an online manner as it takes place in the brain: each data point comes in, is processed, and then goes away. The ideal online k-means clustering is to repeat the following forever: get a new data point x and update the current set of k means. However, the method cannot store all the data it sees, because the process goes on infinitely. The
Online Clustering and Outlier Detection
online version is revised as follows (Beringeri & Hüllermeier, 2006):
2.4. Affinity Measure between Clusters
1. Iterate over the set of data. For each data, find the closest mean and move the closest mean a certain distance towards the data point. 2. Repeat step 1 until termination criterion is reached.
A good clustering method produces high-quality clusters to ensure that data points within the same group have high similarity while being very dissimilar to points in other groups. Thus similarity/ affinity measures are critical for producing high quality clustering results. Some commonly used distance measures include Euclidean distance, Manhattan distance, maximum norm, Mahalanobis distance, and Hamming distance. However, these distance measures do now work well with market-basket data clustering. As aforementioned, market-basket data is different from traditional data. One such difference results from the categorical nature of its attributes. Therefore, traditional distance measures do not quite work effectively for such data environments. Moreover, items may not be spread in transactions evenly; in other words, some items may have much higher/lower support than the rest. Such data sets are usually described as “supportskewed” data sets. The all-confidence measure was devised especially for data sets with skewed supports (Omiecinski, 2003; Xiong, Tan, & Kumar, 2003). Give two items Ii and Ij, all-confidence chooses the minimum value between conf ({Ii → Ij}) and conf ({Ij → Ii}) as the affinity measure between the two items. However, this measure is biased because it depends on the support of the larger item (i.e. has a larger support) without consideration of the other smaller item. For example, suppose supp({I1}) = 0.5, supp({I2}) = 0.1, supp({I3}) = 0.4, supp({I1, I2}) = 0.1, and supp({I1, I3}) = 0.1 where supp(X) is the support of itemset X. Then the all-confidence measures for sets {I1, I2} and {I1, I3} will have the same value, i.e. 0.1/0.5 = 0.2, because these two sets share the large item, I1, regardless of the fact that the support of I3 is much greater than the support of I2. But according to the cosine similarity and the Jaccard measures (Han, Karypis, Kumar, & Kamber, 1998), set {I1,
One of the weaknesses of k-means clustering is the predefined value of k. To solve this problem, Beygelzimer, Kakade, and Langford developed a cover tree based online clustering method (Beygelzimer, Kakade, & Langford, 2006). Assume for the moment that the distances among all data points are ≤ 1. A cover tree on data points/ transactions t1,..., tn is a rooted infinite tree with the following properties. 1. Each node of the tree is associated with one of the data points, ti. 2. If a node is associated with ti, then one of its children must also be associated with ti. 3. All nodes at depth h are at distance at least ½h from each other. 4. Each node at depth h + 1 is within distance ½h of its parent (at depth h). This is described as an infinite tree for simplicity of analysis, but it would not be stored as such. In practice, there is no need to duplicate a node as its own child, and so the tree would take up O(n) space. What makes cover trees especially convenient is that they can be built online, one point at a time. To insert a new point t’: find the largest h such that t’ is within ½h of some node s at depth h in the tree; and make t’ a child of s. Once the tree is built, it is easy to obtain k-means clustering from it.
147
Online Clustering and Outlier Detection
I2} has higher affinity than {I1, I3}. Therefore, it should be obvious that the all-confidence affinity measure can result in many ties among item sets that involve the same large item, which can lead to inaccurate results. This motivated us to devise the weighted confidence affinity measure.
2.5. Vertical Data Structures Vertical partition of relations has drawn a lot of attention in database, data mining, and data warehouse in the last decade. Compare to traditional horizontal partition that processes and stores data row by row, vertical approach processes and stores data column-wise. The concept of vertical partitioning for relations and vertical mining has been well studied in data analysis fields. Wong et al. present the Bit Transposed File (BTF) model that takes advantage of encoding attribute values using a small number of bits in order to reduce the storage space in a vertically decomposed context (Wong, Liu, Olken, Rotem & Wong, 1995). The most basic vertical structure is a bitmap (Ding, Ding, & Perrizo, 2002), where every intersection is represented by a bit in an index bitmap. Consequently, each item is represented by a bit vector. The AND logical operation can then be used to merge items and itemsets into larger itemset patterns. The support of an itemset is calculated by counting the number of 1-bits in the bit vector. It has been demonstrated in the literature that vertical data approaches are very effective and usually outperform traditional horizontal approaches (Ding et al., 2002; Zakii & Hsiao, 2002). The advantage is due to the fact that logical AND or intersection operation is very efficient in calculation. For example, in Association Rule Mining (ARM), frequent item sets can be counted and irreverent transactions can be pruned via columnwise intersections. While in traditional horizontal approach, complex internal data structure, such as hash/search trees, are required. Moreover, vertical approaches can be implemented easily in
148
parallel environments to speed up the data mining process further.
3. THE PROPOSED METHOD In this section, we first introduce affinity function as a similarity measure, then present in details the two phases of the proposed online clustering and outlier detection method, i.e. WC clustering and outlier detection.
3.1. Affinity Function between Items Distance or similarity measure is critical for clustering and outlier detection accuracy. Our affinity functions is applied to calculate the similarity between items and between clusters in a set of transactions D = {T1, T2, …, Tn}, where n is the total number of transactions. Each transaction Ti contains a subset of the items from the item space {I1, I2, …, Im}, where m is the total number of items. As we discussed earlier, the all-confidence measure (Omiecinski, 2003; Xiong et al., 2003) is developed to deal with skewed support. However, this measure is biased and often results in many ties among item sets and among clusters that involve the same large item, which can eventually lead to inaccurate results. To eliminate this problem while still tackling skewed-support data sets, we suggest using the weighted summation of the two confidences as the affinity measure between two items (Wang & Rahal, 2007), i.e. A(I i , I j ) = wi * conf ({I i → I j }) , + w j * conf ({I j → I i })
(1)
where wi =
supp({I i }) supp({I i }) + supp({I j })
(2)
Online Clustering and Outlier Detection
wj =
supp({Ij }) supp({Ii}) + supp({Ij })
(3)
For the above definition, A(I i , I j ) is the affinity measure between item I i and item I j ; supp({I i }) and supp({Ij }) define the support of item I i and item I j , i.e. the proportion of transactions in the data set which contain the item I i and item I j respectively; conf ({Ii → Ij}) defines the confidence of rule Ii → Ij and is calculated as supp({Ii, Ij})/supp({Ii}). For simplification, we denote conf ({Ii → Ij }) as “the confidence from Ii’s side.” The equations above show that the confidences from two sides are included in the affinity measure but are weighted based on the support of each side. The higher the item support is, the more the confidence from its side contributes to the affinity measure. Consider two extreme scenarios: (1) when the two item supports are the same, the confidences from both sides are equal and contribute to the affinity equally. In this case, the affinity measure equals to one of the confidences; and (2) when the two item supports are significantly different, the contribution of the confidence from lower support side nears zero. In this case, the affinity measure approximately equals to the confidence from the higher support side. The all-confidence measure is designed to deal with the second case above (Xiong et al., 2003). It takes the minimum confidence of the confidences from the two sides to filter out the impact of low support items. Our affinity function can still deal with this scenario well by getting the approximate value of the confidence from the higher support side, which is the minimum confidence between the two. Therefore, our affinity function can achieve the accuracy of the all-confidence measure on skewed-support data sets while not producing misleading ties. By replacing wi and wj in Equation (1) using Equations (2) and (3), replacing confidence
variables with their formulas respectively, and simplifying Equation (1), we get the following affinity measure function: A(I i , I j ) =
2 * supp({Ii, Ij }) supp({Ii}) + supp({Ij })
(4)
As can be observed from (4), our affinity measure function is calculated directly from supports and there are no comparisons involved. Therefore it is more efficient to compute than the all-confidence measure. The function is also very intuitive: when supp({Ii, Ij}) = supp({Ii}) = supp({Ij}), A(Ii, Ij)gets the maximum value of 1. This is the case where two items are always together in any transaction; in other words, if we denote the transaction set that contains Ii as {T(Ii)}, then {T(Ii)} = {T(Ij)} in this case. On the other hand, if supp({Ii, Ij}) 0 f (x ) = 0 for x ≤ 0
(3)
Thus the threshold function produces only a 1 or a 0 as the output, so that the neuron is either activated or not.
Learning Algorithm Since the single layer perceptron is a supervised neural network architecture, it requires learning of some a priori knowledge base for its operation. The following algorithm illustrates the learning paradigm for a single layer perceptron. It comprises the following steps. •
Initialization of the interconnection weights and thresholds randomly.
•
Calculating the actual outputs by taking the thresholded value of the weighted sum of the inputs. Altering the weights to reinforce correct decisions and discourage incorrect decisions, i.e. reducing the error.
The weights are however, unchanged if the network makes the correct decision. Also the weights are not adjusted on input lines, which do not contribute to the incorrect response, since each weight is adjusted by the value of the input on that line xi, which would be zero. (see Table 1) In order to predict the expected outputs, a loss (also called objective or error) function E can be defined over the model parameters to ascertain the error in the prediction process. A popular choice for E is the sum-squared error given by E = ∑ (yi − di )2
(4)
i
In words, it is the sum of the squared difference between the target value di and the perceptron’s prediction yi (calculated from the input value xi) computed over all points i in the data set. For a
Table 1. Begin Initialize interconnection weights and threshold Set wi(t=0), (0